deep learning in computer vision

1

Introduction toDeep LearningPresenter: Sungjoon Choi

([email protected])

Optimization methodsCNN basicsSemantic segmentationWeakly supervised localizationImage detectionRNNVisual QnAWord2VecImage CaptioningContents

What is deep learning?3Deep learning is a branch of machine learning based on a set of algorithms that attempt to model high-level abstractions in data by using multiple processing layers, with complex structures or otherwise, composed of multiple non-linear transformations.Wikipedia says:

MachineLearningHigh-levelabstractionNetwork

Is it brand new?4Neural NetsMcCulloch & Pitt 1943PerceptionRosenblatt 1958RNNGrossberg 1973CNNFukushima 1979RBMHinton 1999DBNHinton 2006D-AEVincent 2008

AlexNetAlex 2012GoogLeNetSzegedy 2015

Deep architectures5Feed-Forward: multilayer neural nets, convolutional nets

Feed-Back: Stacked Sparse Coding, Deconvolutional Nets

Bi-Directional: Deep Boltzmann Machines, Stacked Auto-Encoders

Recurrent: Recurrent Nets, Long-Short Term Memory

CNN basics

CNN7CNNs are basically layers of convolutions followed by subsampling and fully connected layers. Intuitively speaking, convolutions and subsampling layers works as feature extraction layers while a fully connected layer classifies which category current input belongs to using extracted features.

8

9

10

11

12

13

14

15

16

Optimization methods

Gradient descent?

Gradient descent?There are three variants of gradient descent Differ in how much data we use to compute gradientWe make a trade-off between the accuracy and computing time

Batch gradient descentIn batch gradient decent, we use the entire training dataset to compute the gradient.

Stochastic gradient descentIn stochastic gradient descent (SGD), the gradient is computed from each training sample, one by one.

Mini-batch gradient decentIn mini-batch gradient decent, we take the best of both worlds. Common mini-batch sizes range between 50 and 256 (but can vary).

ChallengesChoosing a proper learning rate is cumbersome. Learning rate scheduleAvoiding getting trapped in suboptimal local minima

Momentum

Nesterov accelerated gradient

AdagradIt adapts the learning rate to the parameters, performing larger updates for infrequent and smaller updates for frequent parameters.Performing larger updates for infrequent and smaller updates for frequent parameters.

AdadeltaAdadelta is an extension of Adagrad that seeks to reduce its monotonically decreasing learning rate. No learning rate!

Exponential moving average28

RMSpropRMSprop is an unpublished, adaptive learning rate method proposed by Geoff Hinton in his lecture..

AdamAdaptive Moment Estimation (Adam) stores both exponentially decaying average of past gradients and and squared gradients.

Momentum

Running average of gradient squares

AdamAdaptive Moment Estimation (Adam) stores both exponentially decaying average of past gradients and and squared gradients.

Visualization

32

Semantic segmentation

Semantic Segmentation?

liondoggiraffeImage Classification

bicyclepersonballdogObject Detection

personpersonpersonpersonpersonbicyclebicycleSemantic Segmentation

Semantic segmentation35

36

37

38

39

40

41

42

43

44

Results45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

Results73

Results74

Weakly supervised localization

Weakly supervised localization76

Weakly supervised localization77

Weakly Supervised Object Localization78Usually supervised learning of localization is annotated with bounding box

What if localization is possible with image label without bounding box annotations?

Todays seminar: Learning Deep Features for Discriminative Localization1512.04150v1 Zhou et al. 2015 CVPR2016

Architecture79AlexNet+GAP+places205

Living room11x11 Avg Pooling: Global Average Pooling (GAP)11x11x512

512205

227x227x3

Class activation map (CAM)80Identify important image regions by projecting back the weights of output layer to convolutional feature maps. CAMs can be generated for each class in single image.Regions for each categories are different in given image.palace, dome, church

Results81CAM on top 5 predictions on an imageCAM for one object class in images

GAP vs. GMP82Oquab et al. CVPR2015 Is object localization for free? weakly-supervised learning with convolutional neural networks.Use global max pooling(GMP)Intuitive difference between GMP and GAP?GAP loss encourages identification on the extent of an object.GMP loss encourages it to identify just one discriminative part.GAP, average of a map maximized by finding all discriminative parts of objectif activations is all low, output of particular map reduces.GMP, low scores for all image regions except the most discriminative partdo not impact the score when perform MAX

pooling

GAP & GMP83GAP (upper) vs GMP (lower)GAP outperforms GMPGAP highlights more complete object regions and less background noise.Loss for average pooling benefits when the network identifies all discriminative regions of an object

84

Concept localization85Concept localization in weakly labeled imagesPositive set: short phrase in text captionNegative set: randomly selected imagesModel catch the concept, phrases are much more abstract than object name.

Weakly supervised text detectorPositive set: 350 Google StreeView images that contain text.Negative set: outdoor scene images in SUN datasetText highlighted without bounding box annotations.

Image detection

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

Results102

SPPnet

104

105

106

107

108

109

110

111

112

Results113

Results114

Fast R-CNN

116

117

118

119

120

121

122

123

124

125

Faster R-CNN

127

128

129

130

131

132

133

134

135

136

137

138

Results139

Results140

R-CNN141

Image

Regions

Resize

Convolution Features

Classify

SPP net142

Image

Convolution FeaturesSPP

Regions

Classify

R-CNN vs. SPP net143

R-CNNSPP net

Fast R-CNN144

Image

Convolution Features

Regions

RoI Pooling Layer

Class LabelConfidence

RoI Pooling Layer

Class LabelConfidence

R-CNN vs. SPP net vs. Fast R-CNN145

R-CNNSPP netFast R-CNN

Faster R-CNN146

Image

Fully Convolutional Features

Bounding Box RegressionBB ClassificationFast R-CNN

R-CNN vs. SPP net vs. Fast R-CNN147

R-CNNSPP netFast R-CNN

Faster R-CNN

148

Results

149

150

151

152

RNN

Recurrent Neural Network155

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Recurrent Neural Network156


LSTM comes in!157Long Short Term Memory

This is just a standard RNN. http://colah.github.io/posts/2015-08-Understanding-LSTMs/

LSTM comes in!158Long Short Term MemoryThis is just a standard RNN. This is the LSTM!


Overall Architecture159(Cell) stateHidden StateForget Gatehttp://colah.github.io/posts/2015-08-Understanding-LSTMs/Input GateOutput GateNext (Cell) StateNext Hidden StateInputOutputOutput = Hidden state

The Core Idea160http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Visual QnA

VQA: Dataset and Problem definition162VQA dataset - Example

Q: How many dogs are seen?

Q: What animal is this?

Q: What color is the car?

Q: What is the mustache made of?

Q: Is this vegetarian pizza?

Solving VQA163Approach

[Malinowski et al., 2015]

[Ren et al., 2015]

[Andres et al., 2015]

[Ma et al., 2015]

[Jiang et al., 2015]

Various methods have been proposed

DPPnet164MotivationCommon pipeline of using deep learning for vision

CNN trained on ImageNet

Switch the final layer and fine-tune for the New Task

In VQA, Task is determined by a questionObservation:

DPPnet165Main IdeaSwitching parameters of a layer based on a question

Dynamic Parameter LayerQuestionParameter Prediction Network

DPPnet166Parameter ExplosionNumber of parameter for fc-layer (R):

Dynamic Parameter Layer

Question FeaturePredicted ParameterMNQP

: Dimension of hidden statefc-layerQ=1000, P=1000, M=500For example: R=500,000,0001.86GB for single layerNumber of parameters for VGG19: 144,000,000

DPPnet167Parameter ExplosionNumber of parameter for fc-layer (R):


Question FeaturePredicted ParameterMNQP

: Dimension of hidden statefc-layerSolution:

We can control N

DPPnet168Weight Sharing with Hashing TrickWeights of Dynamic Parameter Layer are picked from Candidate weights by Hashing

Question FeatureCandidate Weightsfc-layer0.11.2-0.70.3-0.20.10.1-0.2-0.71.2-0.20.1-0.7-0.71.20.3-0.20.30.30.11.2


Hasing[Chen et al., 2015]

DPPnet169Final Architecture

End-to-End Fine-tuning is possible (Fully-differentiable)

DPPnet170Qualitative Results

Q: What is the boy holding?DPPnet:surfboardDPPnet:bat

DPPnet171Qualitative ResultsQ: What animal is shown?DPPnet:giraffeDPPnet:elephant

DPPnet172Qualitative Results

Q: How does the woman feel?DPPnet:happyQ: What type of hat is she wearing?DPPnet:cowboy

DPPnet173Qualitative ResultsQ: How many cranes are in the image?DPPnet:2 (3)Q: How many people are on the bench?DPPnet: 2 (1)

How to combine image and question?174








Multimodal Compact Bilinear Pooling182




MCB without Attention186

MCB with Attention187

Results188

Results189

Results190

Results191

Results192

Results193

Word2Vec

Word2vec?195

196

197

198

199

200

201

202

203

204

205

206

207

208

Image Captioning

Image Captioning?210

Overall Architecture 211

Language Model212

Language Model213

Language Model214

Language Model215

Language Model216

Training phase217

Training phase218

Training phase219

Training phase220

Training phase221

Training phase222

Test phase223

Test phase224

Test phase225

Test phase226

Test phase227

Test phase228

Test phase229

Test phase230

Test phase231

Results 232

Results233

But not always..234

235

Show, attend and tell236

237

238

239

240

Results241

Results242

Results (mistakes)243

Neural Art

Preliminaries245Understanding Deep Image Representations by Inverting ThemCVPR2015Texture Synthesis Using Convolutional Neural Networks NIPS2015

A Neural Algorithm of Artistic Style246

A Neural Algorithm of Artistic Style247

248Texture Synthesis Using Convolutional Neural Networks-NIPS2015 Leon A. Gatys, Alexander S. Ecker, Matthias Bethge

Texture?249

Visual texture synthesis250

Which one do you think is real?Right one is real. Goal of texture synthesis is to produce (arbitrarily many) new samples from an example texture.

Results of this work251

Right ones are given sources!

How?252

Texture Model253

Input aInput bnumber of filters

Feature Correlations254Input aInput bnumber of filters(Gram matrix)

Feature Correlations255number of filtersW*H(Gram matrix)number of filters

Texture Generation256Input aInput b

Texture Generation257Input aInput b

Element-wise squared lossTotal layer-wise loss function

Results258

Results259

260Understanding Deep Image Representations by Inverting Them-CVPR2015 Aravindh Mahendran, Andrea Vedaldi (VGGgroup)

Reconstruction from feature map261

Reconstruction from feature map262Input aInput bnumber of filtersLets make this features similar!By changing the input image!

Receptive Field263

264A Neural Algorithm of Artistic StyleLeon A. Gatys, Alexander S. Ecker, Matthias Bethge

How?265Style ImageContent ImageMixed ImageNeural Art

How?266Style ImageContent ImageMixed ImageNeural ArtTexture Synthesis Using Convolutional Neural Networks Understanding Deep Image Representations by Inverting Them

How?267

Gram matrix

Neural Art268

ContentStyleTotal loss = content loss + style loss

Results269

Results270

271

deep learning in computer vision

Engineering