[CVPR 2019] Yogesh Rawat @ University of Central Florida: Capsule Networks for Computer Vision
Articles,  Blog

[CVPR 2019] Yogesh Rawat @ University of Central Florida: Capsule Networks for Computer Vision


We have Yogesh here with us. He’s a postdoc
researcher at the University of Central Florida. He’s also an organizer of the tutorial “Capsule
Networks for Computer Vision”. Thank you so much for coming here to join us. Thank you for inviting me. Tell us a bit about this tutorial that you
organized. How did you pick the speakers and co-organizers? We did some recent work on capsule networks.
We have really good results in different problem domains. Then my supervisor at UCF, Mubarak
Shah actually suggested that we should organize this tutorial because this is a really exciting
area and there’s a lot of attention from the research community. Then we discussed this
for a while, and it turned out to be a good idea, then we submitted a proposal
to the conference. Our co-organizers who published all the
breakthrough papers which have been done in this area. We tried to contact them mainly
via email. Because most of them were not available to see the authors and those work(1:20), and
some of them agreed. So it was good. We got positive responses from some of them, then
we built a team. One good thing was, we have done a lot of work on capsule networks in
UCF, so most of our speakers are from UCF. We got one collaborator from University of
Toronto, Canada. They did very initial work in this area, those works are really popular,
that was a breakthrough. And they agreed to collaborate. That was good. Once we had the collaborators, then we decided
on the program, like what content we should cover, and which problems or
different domains we should talk about. This will not be came up with the schedule.
So that’s the preparation for this tutorial. What is it about? This is about capsule network, it’s really
very recent. It was the first research work which talked about this idea. It came out
in 2011, so it’s not very old. It was quite new, but there were some issues, so it didn’t
took off at that time. It took five to six years to form these nice ideas and then we
started seeing some good papers in 2015, then there was a big breakthrough in 2017. Since
then, there had been a lot of research in this area. The idea is, we saw breakthroughs in standard
convolution neural networks starting from 2010, so this area of deep learning actually
flew off. Capsule networks in these were also proposed at that time, but the training of
these networks were not that easy. CNN are everywhere these days. They talked about deep
learning explanation and neural networks(3:25). Because of that issue, it wasn’t that popular,
but now I think they are picking up. They’re different from standard convolution neural
network, these are more intuitive. And idea is, in standard convolution neural network,
what we try to do is we just try to get activations better, we have some features
presented in input data. But here is the difference. Instead of just
single activation, what we do is we actually group those activations together and
the entities are present in the input data. So the difference is, instead of just saying
that whether this entity or this visual entity is present in the input data, we also describe
the different properties of that entity. So that’s why we grouped those neurons together,
they represent different parts of the object, or in different activations or different properties.
And apart from that, we can see that whether that entity is present or not. So that’s really
different from the standard CNNs. It’s quite intuitive. What type of researchers will benefit from
this tutorial? I guess I’m saying that how do you help senior researchers?
What can you provide them? I think one difficulty many researchers are
facing is how to train these capsule networks. That’s another motivation to do this tutorial.
Initially, this capsule network was proposed for images, and the idea was tested on very
small resolution images, for example, 20 by 20 black and white images. When people were
trying to scale them for bigger or high dimensional data, such as videos or big resolution images,
they didn’t get good results. So we were the first to apply these capsule networks for
videos, which is high dimensional data. We did it successfully, probably we were lucky
or we chose the right path or we had the right ideas. We wanted to share that
with the community by doing this tutorial. That’s very nice of you, doing something
really good for this community. From your perspective, what’s the current status of
the capsule networks for computer vision? Where are you at this stage? From computer vision’s point of view, it
has been applied to different problem domains. For some of the domains, we have seen very
good success. For tasks like image classification, we have good results when the size of the
dataset isn’t that big. And then we have seen good results in object segmentation,
we have seen good results in some entity localization in the video. The reason for that is when we represent those
entities using a group of neurons – we call them capsules – those capsules are representing
entities, so that’s like interesting(6:25) CNNs. That’s why when you represent that entity,
it’s actually easier to segment or track those entities. Specifically, these two tasks are
really showing very good results. I hope this can expand to other problem domains as well.
So this is the best set of – What are some of the breakthroughs that
you and your team have solved? To me, in this area, the biggest breakthrough
was the routing algorithm that people came out with in 2017 across(7:00) 2018. That actually
enabled us to train these capsule networks. Before that, it was not very easy to do that.
The idea was there, the intuition was there, but we didn’t know how to train
such networks. So it was introduced. The next breakthrough I’d like to say is,
how we can scale that routing algorithm for high dimensional data. As I mentioned before
that we did for videos, it was never done before that. So I will say the biggest breakthrough
was how we can scale this thing. There were basically two main ideas we proposed. One
is capsule pooling, where we actually reduce the number of capsules we have, because it
would affect our routing algorithms. If you have too many capsules, routing is infeasible
to do. And also with too many capsules, it’s hard to train, they cannot fit in the
memory. So we had two main ideas, in which we use a number of capsules, and the other
was like sharing of weights or the transformation mechanism that is between the capsules. That
actually brings down the memory consumption and execution time of the algorithm.
I would say this is the big breakthrough, which is scaling these capsule networks. Awesome. As a researcher, what are the problems
that you’re focusing on right now? What’s your future path in this area? Right now, I’m mainly looking into video analytics,
like understanding how we can infer different kinds of problems in video domain. And my
main focus right now is activity recognition. If we have a video stream, can we detect what
kind of activities are happening in that video? And one specific problem which I’m looking
at right now is, if you have videos from multiple views, then actually it affects a lot. As
a human, we can easily tell that means this is the same activity. But for computers, if
we change the viewpoint, it changes a lot, so it’s very achievable. That’s one area. The other area that I’m focusing on is semi-
supervised learning. For videos, I talked about activity recognition, getting labels is very hard.
And you can get video-level labels. But then when you have to get pixel-level labels, since
there are too many pixels in the video, it’s not that trivial(9:25). So what I’m looking
into is how we can actually manage to train our networks using very few labels. We do
have a lot of videos which are not labeled, so how we can make use of those and combine
these two to perform semi-supervised learning? Those are the challenges. Right now, most
of the research is being done in supervised learning and we are seeing very good results.
But now I think the next steps will be like how we can do this in semi-supervised,
unsupervised training. That’s also the future of pedophilia is going. Alright, thank you so much for coming to
our platform to share with us. Thank you very much for inviting me.

Leave a Reply

Your email address will not be published. Required fields are marked *