deep learning in computer vision
TRANSCRIPT
1
Introduction toDeep LearningPresenter: Sungjoon Choi
Optimization methodsCNN basicsSemantic segmentationWeakly supervised localizationImage detectionRNNVisual QnAWord2VecImage CaptioningContents
What is deep learning?3Deep learning is a branch of machine learning based on a set of algorithms that attempt to model high-level abstractions in data by using multiple processing layers, with complex structures or otherwise, composed of multiple non-linear transformations.Wikipedia says:
MachineLearningHigh-levelabstractionNetwork
Is it brand new?4Neural NetsMcCulloch & Pitt 1943PerceptionRosenblatt 1958RNNGrossberg 1973CNNFukushima 1979RBMHinton 1999DBNHinton 2006D-AEVincent 2008
AlexNetAlex 2012GoogLeNetSzegedy 2015
Deep architectures5Feed-Forward: multilayer neural nets, convolutional nets
Feed-Back: Stacked Sparse Coding, Deconvolutional Nets
Bi-Directional: Deep Boltzmann Machines, Stacked Auto-Encoders
Recurrent: Recurrent Nets, Long-Short Term Memory
CNN basics
CNN7CNNs are basically layers of convolutions followed by subsampling and fully connected layers. Intuitively speaking, convolutions and subsampling layers works as feature extraction layers while a fully connected layer classifies which category current input belongs to using extracted features.
8
9
10
11
12
13
14
15
16
Optimization methods
Gradient descent?
Gradient descent?There are three variants of gradient descent Differ in how much data we use to compute gradientWe make a trade-off between the accuracy and computing time
Batch gradient descentIn batch gradient decent, we use the entire training dataset to compute the gradient.
Stochastic gradient descentIn stochastic gradient descent (SGD), the gradient is computed from each training sample, one by one.
Mini-batch gradient decentIn mini-batch gradient decent, we take the best of both worlds. Common mini-batch sizes range between 50 and 256 (but can vary).
ChallengesChoosing a proper learning rate is cumbersome. Learning rate scheduleAvoiding getting trapped in suboptimal local minima
Momentum
Nesterov accelerated gradient
AdagradIt adapts the learning rate to the parameters, performing larger updates for infrequent and smaller updates for frequent parameters.Performing larger updates for infrequent and smaller updates for frequent parameters.
AdadeltaAdadelta is an extension of Adagrad that seeks to reduce its monotonically decreasing learning rate. No learning rate!
Exponential moving average28
RMSpropRMSprop is an unpublished, adaptive learning rate method proposed by Geoff Hinton in his lecture..
AdamAdaptive Moment Estimation (Adam) stores both exponentially decaying average of past gradients and and squared gradients.
Momentum
Running average of gradient squares
AdamAdaptive Moment Estimation (Adam) stores both exponentially decaying average of past gradients and and squared gradients.
Visualization
32
Semantic segmentation
Semantic Segmentation?
liondoggiraffeImage Classification
bicyclepersonballdogObject Detection
personpersonpersonpersonpersonbicyclebicycleSemantic Segmentation
Semantic segmentation35
36
37
38
39
40
41
42
43
44
Results45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
Results73
Results74
Weakly supervised localization
Weakly supervised localization76
Weakly supervised localization77
Weakly Supervised Object Localization78Usually supervised learning of localization is annotated with bounding box
What if localization is possible with image label without bounding box annotations?
Todays seminar: Learning Deep Features for Discriminative Localization1512.04150v1 Zhou et al. 2015 CVPR2016
Architecture79AlexNet+GAP+places205
Living room11x11 Avg Pooling: Global Average Pooling (GAP)11x11x512
512205
227x227x3
Class activation map (CAM)80Identify important image regions by projecting back the weights of output layer to convolutional feature maps. CAMs can be generated for each class in single image.Regions for each categories are different in given image.palace, dome, church
Results81CAM on top 5 predictions on an imageCAM for one object class in images
GAP vs. GMP82Oquab et al. CVPR2015 Is object localization for free? weakly-supervised learning with convolutional neural networks.Use global max pooling(GMP)Intuitive difference between GMP and GAP?GAP loss encourages identification on the extent of an object.GMP loss encourages it to identify just one discriminative part.GAP, average of a map maximized by finding all discriminative parts of objectif activations is all low, output of particular map reduces.GMP, low scores for all image regions except the most discriminative partdo not impact the score when perform MAX
pooling
GAP & GMP83GAP (upper) vs GMP (lower)GAP outperforms GMPGAP highlights more complete object regions and less background noise.Loss for average pooling benefits when the network identifies all discriminative regions of an object
84
Concept localization85Concept localization in weakly labeled imagesPositive set: short phrase in text captionNegative set: randomly selected imagesModel catch the concept, phrases are much more abstract than object name.
Weakly supervised text detectorPositive set: 350 Google StreeView images that contain text.Negative set: outdoor scene images in SUN datasetText highlighted without bounding box annotations.
Image detection
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
Results102
SPPnet
104
105
106
107
108
109
110
111
112
Results113
Results114
Fast R-CNN
116
117
118
119
120
121
122
123
124
125
Faster R-CNN
127
128
129
130
131
132
133
134
135
136
137
138
Results139
Results140
R-CNN141
Image
Regions
Resize
Convolution Features
Classify
SPP net142
Image
Convolution FeaturesSPP
Regions
Classify
R-CNN vs. SPP net143
R-CNNSPP net
Fast R-CNN144
Image
Convolution Features
Regions
RoI Pooling Layer
Class LabelConfidence
RoI Pooling Layer
Class LabelConfidence
R-CNN vs. SPP net vs. Fast R-CNN145
R-CNNSPP netFast R-CNN
Faster R-CNN146
Image
Fully Convolutional Features
Bounding Box RegressionBB ClassificationFast R-CNN
R-CNN vs. SPP net vs. Fast R-CNN147
R-CNNSPP netFast R-CNN
Faster R-CNN
148
Results
149
150
151
152
RNN
Recurrent Neural Network155
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Recurrent Neural Network156
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
LSTM comes in!157Long Short Term Memory
This is just a standard RNN. http://colah.github.io/posts/2015-08-Understanding-LSTMs/
LSTM comes in!158Long Short Term MemoryThis is just a standard RNN. This is the LSTM!
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Overall Architecture159(Cell) stateHidden StateForget Gatehttp://colah.github.io/posts/2015-08-Understanding-LSTMs/Input GateOutput GateNext (Cell) StateNext Hidden StateInputOutputOutput = Hidden state
The Core Idea160http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Visual QnA
VQA: Dataset and Problem definition162VQA dataset - Example
Q: How many dogs are seen?
Q: What animal is this?
Q: What color is the car?
Q: What is the mustache made of?
Q: Is this vegetarian pizza?
Solving VQA163Approach
[Malinowski et al., 2015]
[Ren et al., 2015]
[Andres et al., 2015]
[Ma et al., 2015]
[Jiang et al., 2015]
Various methods have been proposed
DPPnet164MotivationCommon pipeline of using deep learning for vision
CNN trained on ImageNet
Switch the final layer and fine-tune for the New Task
In VQA, Task is determined by a questionObservation:
DPPnet165Main IdeaSwitching parameters of a layer based on a question
Dynamic Parameter LayerQuestionParameter Prediction Network
DPPnet166Parameter ExplosionNumber of parameter for fc-layer (R):
Dynamic Parameter Layer
Question FeaturePredicted ParameterMNQP
: Dimension of hidden statefc-layerQ=1000, P=1000, M=500For example: R=500,000,0001.86GB for single layerNumber of parameters for VGG19: 144,000,000
DPPnet167Parameter ExplosionNumber of parameter for fc-layer (R):
Dynamic Parameter Layer
Question FeaturePredicted ParameterMNQP
: Dimension of hidden statefc-layerSolution:
We can control N
DPPnet168Weight Sharing with Hashing TrickWeights of Dynamic Parameter Layer are picked from Candidate weights by Hashing
Question FeatureCandidate Weightsfc-layer0.11.2-0.70.3-0.20.10.1-0.2-0.71.2-0.20.1-0.7-0.71.20.3-0.20.30.30.11.2
Dynamic Parameter Layer
Hasing[Chen et al., 2015]
DPPnet169Final Architecture
End-to-End Fine-tuning is possible (Fully-differentiable)
DPPnet170Qualitative Results
Q: What is the boy holding?DPPnet:surfboardDPPnet:bat
DPPnet171Qualitative ResultsQ: What animal is shown?DPPnet:giraffeDPPnet:elephant
DPPnet172Qualitative Results
Q: How does the woman feel?DPPnet:happyQ: What type of hat is she wearing?DPPnet:cowboy
DPPnet173Qualitative ResultsQ: How many cranes are in the image?DPPnet:2 (3)Q: How many people are on the bench?DPPnet: 2 (1)
How to combine image and question?174
How to combine image and question?175
How to combine image and question?176
How to combine image and question?177
How to combine image and question?178
How to combine image and question?179
How to combine image and question?180
How to combine image and question?181
Multimodal Compact Bilinear Pooling182
Multimodal Compact Bilinear Pooling183
Multimodal Compact Bilinear Pooling184
Multimodal Compact Bilinear Pooling185
MCB without Attention186
MCB with Attention187
Results188
Results189
Results190
Results191
Results192
Results193
Word2Vec
Word2vec?195
196
197
198
199
200
201
202
203
204
205
206
207
208
Image Captioning
Image Captioning?210
Overall Architecture 211
Language Model212
Language Model213
Language Model214
Language Model215
Language Model216
Training phase217
Training phase218
Training phase219
Training phase220
Training phase221
Training phase222
Test phase223
Test phase224
Test phase225
Test phase226
Test phase227
Test phase228
Test phase229
Test phase230
Test phase231
Results 232
Results233
But not always..234
235
Show, attend and tell236
237
238
239
240
Results241
Results242
Results (mistakes)243
Neural Art
Preliminaries245Understanding Deep Image Representations by Inverting ThemCVPR2015Texture Synthesis Using Convolutional Neural Networks NIPS2015
A Neural Algorithm of Artistic Style246
A Neural Algorithm of Artistic Style247
248Texture Synthesis Using Convolutional Neural Networks-NIPS2015 Leon A. Gatys, Alexander S. Ecker, Matthias Bethge
Texture?249
Visual texture synthesis250
Which one do you think is real?Right one is real. Goal of texture synthesis is to produce (arbitrarily many) new samples from an example texture.
Results of this work251
Right ones are given sources!
How?252
Texture Model253
Input aInput bnumber of filters
Feature Correlations254Input aInput bnumber of filters(Gram matrix)
Feature Correlations255number of filtersW*H(Gram matrix)number of filters
Texture Generation256Input aInput b
Texture Generation257Input aInput b
Element-wise squared lossTotal layer-wise loss function
Results258
Results259
260Understanding Deep Image Representations by Inverting Them-CVPR2015 Aravindh Mahendran, Andrea Vedaldi (VGGgroup)
Reconstruction from feature map261
Reconstruction from feature map262Input aInput bnumber of filtersLets make this features similar!By changing the input image!
Receptive Field263
264A Neural Algorithm of Artistic StyleLeon A. Gatys, Alexander S. Ecker, Matthias Bethge
How?265Style ImageContent ImageMixed ImageNeural Art
How?266Style ImageContent ImageMixed ImageNeural ArtTexture Synthesis Using Convolutional Neural Networks Understanding Deep Image Representations by Inverting Them
How?267
Gram matrix
Neural Art268
ContentStyleTotal loss = content loss + style loss
Results269
Results270
271