computer vision lab seminar(deep learning) yong hoon

2015

CVL

Wee

kly

Sem

inar

1

Deep LearningBasic Theory and other Application

권용훈

2015

CVL

Wee

kly

Sem

inar

2

Table of Contents

1. Representation Learning

2. Background

3. Concepts and Principles

4. Applications

2015

CVL

Wee

kly

Sem

inar

3

2015

CVL

Wee

kly

Sem

inar

4

2015

CVL

Wee

kly

Sem

inar

5

• upheaval in pattern recognition due to deep learning

Trends in pattern recognition

2015

CVL

Wee

kly

Sem

inar

6

1. Representation Learning

2015

CVL

Wee

kly

Sem

inar

7

Representation Learning

Computer understand information by itself

Car

2015

CVL

Wee

kly

Sem

inar

8


Computer understand information by itself

Car

HOW?

2015

CVL

Wee

kly

Sem

inar

9


H Hidden variable

V Visible variable Observable in the real world

Non-Observable in the real world

2015

CVL

Wee

kly

Sem

inar

10

Representation LearningAll

Everything that can be expressed by hidden variable

2015

CVL

Wee

kly

Sem

inar

11


V Visible variable

All

H Hidden variable

2015

CVL

Wee

kly

Sem

inar

12


HV

2015

CVL

Wee

kly

Sem

inar

13


HV

HVV

2015

CVL

Wee

kly

Sem

inar

14


HV

HVV

...

2015

CVL

Wee

kly

Sem

inar

15


Connect a number of Visible variables(V) and Hidden variable(H)

…

H VV

V VV

…

……

2015

CVL

Wee

kly

Sem

inar

16


Connecting the structural relationships that make something

…

H VV

V VV

…

……

2015

CVL

Wee

kly

Sem

inar

17


V -> H -> X(something)• Expression• Summary• Encoding• Abstraction

V V V V V V V V V V

H H H H H

X

2015

CVL

Wee

kly

Sem

inar

18


One hidden variable is connected all of v

V V V V V V V V V V

H H H H H

X

v v vv v v v v v

h

X

2015

CVL

Wee

kly

Sem

inar

19


Single layer

X

h h h h h

v v vv v v v v v

2015

CVL

Wee

kly

Sem

inar

20


Multi layer

X

v v vv v v v v v

h h h h h

h h h

……

2015

CVL

Wee

kly

Sem

inar

21


Intuitive Interpretation of Multi Layer

X

v v vv v v v v v

h h h h h

h h h

2015

CVL

Wee

kly

Sem

inar

22



X

v v vv v v v v v

h h h h h

h h h

abstraction

abstraction

2015

CVL

Wee

kly

Sem

inar

23



X

v v vv v v v v v

h h h h h

h h h

abstraction

abstraction

2015

CVL

Wee

kly

Sem

inar

24

2. Background

2015

CVL

Wee

kly

Sem

inar

25

Neural networks history• Deep learning is all about deep neural networks

• 1949 : Hebbian learning• Donald Hebb : the father of neural networks

• 1958 : (single layer) Perceptron• Frank Rosenblatt

- Marvin Minsky, 1969

• 1986 : Multilayer Perceptron(Back propagation)• David Rumelhart, Geoffrey Hinton, and Ronald Williams

• 2006 : Deep Neural Networks• Geoffrey Hinton and Ruslan Salakhutdinov

2015

CVL

Wee

kly

Sem

inar

26

Why neural networks?

• Weakness in kernel machine(SVM …):• It does not scale well with sample size.• Based on matching local templates.

• the training data is referenced for test data• Local representation VS distributed representation

• N N(Neural Network) -> Kernel machine -> Deep NN

2015

CVL

Wee

kly

Sem

inar

27

Artificial neural network(ANNs or NN)

Neuron and synapse in brain and ANN Neural networks

• ANNs are computational models inspired by brain• Processing units(nodes vs. neurons)• Connections(weights vs. synapses)

2015

CVL

Wee

kly

Sem

inar

28

Artificial Neural Network(ANN)𝑥1

𝑥2𝑥3

𝑥𝑛

𝑤1𝑤2𝑤3

𝑤𝑛

…

…

Input Output

bias

2015

CVL

Wee

kly

Sem

inar

29

Artificial Neural Network(ANN)𝑥1

𝑥2𝑥3

𝑥𝑛

𝑤1𝑤2

𝑤3

𝑤𝑛

……

𝑧=∑𝑖=1

𝑛

𝑤 𝑖𝑥 𝑖+𝑏 ; 𝑦=𝐻 (𝑧 ) 𝑦Input Output

Activation functionb

2015

CVL

Wee

kly

Sem

inar

30

Deep Neural Network

Input …

…

……

Output

2015

CVL

Wee

kly

Sem

inar

31

Deep Neural Network

Input …

…

……

Output

2015

CVL

Wee

kly

Sem

inar

32

Deep Neural Network

Input …

…

……

Output

2015

CVL

Wee

kly

Sem

inar

33

Training Deep Neural Network

Iteratively update W along error gradient -> gradient descent

Input …

…

…… Output

X y

tTarget

Given training set {()}, Find W that minimizes

𝑤11(1)

𝑤12(1)

𝑤1𝑛(1)

𝑤11(2)

𝑤𝑖𝑗(𝑘)

2015

CVL

Wee

kly

Sem

inar

34

Gradient descent

[http://darkpgmr.tistory.com/133]

gradient ascent <-> gradient descentFind local optimum(global optimum-x)

http://darkpgmr.tistory.com/133

2015

CVL

Wee

kly

Sem

inar

35

Backpropagation

Input …

…

…… Output

X y

tTarget

• Using chain rule, propagate error derivatives Backwards to compute each nodes contribution to error, • Compute error derivative of each weight using

𝑤11(1)

𝑤12(1)

𝑤1𝑛(1)

𝑤11(2)

𝑤𝑖𝑗(𝑘)

𝛿❑𝐼

𝛿❑( 𝐼−1 )

1

𝛿❑𝑘𝑖=(∑𝑤 𝑖𝑗

𝑘𝛿 𝑗𝑘+1)𝛿′

2015

CVL

Wee

kly

Sem

inar

36

3. Concepts and Principles

2015

CVL

Wee

kly

Sem

inar

37

shallow learning

Paradigm shift on pattern recognitionShallow learning Deep learning

feature extraction by domain experts(SIFT, SURF, orb...)

automatic feature extraction from data

separate modules(feature extractor + trainable classifier)

unified model : end-to-end learning(trainable feature + trainable classifier)

deep learning

2015

CVL

Wee

kly

Sem

inar

38

Inferior temporal (IT) cortex

[DiCarlo 12]

• Visual : ventral stream

http://ac.els-cdn.com/S089662731200092X/1-s2.0-S089662731200092X-main.pdf?_tid=403b4f50-accf-11e4-b07c-00000aacb35d&acdnat=1423096866_b7f6fa86eb3fb1f934da4cd66bebb319



2015

CVL

Wee

kly

Sem

inar

39

Representations in deep networks and brain

• Core visual object recognition kernel analysis on neural and DNNs representation[Cadieu 14]

Feedback

https://www.google.co.kr/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&ved=0CC4QFjAB&url=http://www.researchgate.net/profile/James_Dicarlo/publication/263091712_Deep_Neural_Networks_Rival_the_Representation_of_Primate_IT_Cortex_for_Core_Visual_Ob



2015

CVL

Wee

kly

Sem

inar

40

• The human brain has at least 5 to 10 layers for visual processing• “Hierarchical model” is necessary for human-level intelligence

Why deep?

2015

CVL

Wee

kly

Sem

inar

41

What good comes from “deep”?“Deep”means more layers

• The representation gets more hierarchical and abstract.• It increases the model complexity, which leads to higher accuracy.

𝑥1𝑥2

shallow net-works

𝑦 1𝑦 2

𝑤1 𝑤2

2015

CVL

Wee

kly

Sem

inar

42

What good comes from “deep”?“Deep”means more layers

• The representation gets more hierarchical and abstract.• It increases the model complexity, which leads to higher accuracy.

𝑥1𝑥2

deep networks

𝑦 1𝑦 2

𝑤(1) 𝑤(2) 𝑤(3) 𝑤(4 )

h(1) h(2) h(3)

2015

CVL

Wee

kly

Sem

inar

43

Pre-training• Backpropagation may not work well with deep network

• vanishing gradient problem• lower layers may not learn much about the task.

vanishing gradient

Backward error information vanishing

good initialization is crucial

Pre-training

2015

CVL

Wee

kly

Sem

inar

44

• Neural network has been around since 60’s, but...• Deep NN was difficult to train, due to

• Lack of dataset large enough to train it• Lack of computing power• Lack of efficient training algorithms & techniques

• Now we have all of the above• Readily available large-scale dataset• GPU, multicore/cluster systems• DBN [Hinton 06], ReLU(Rectified linear unit), dropout, …

• Still, more thorough theoretical analysis needed to understand why it works well (or not)

Deep Learning : So Now Why?

https://www.cs.toronto.edu/~hinton/absps/fastnc.pdf



2015

CVL

Wee

kly

Sem

inar

45

Deep belief networks(DBNs)

Generative fine tuning discriminative fine tuning

• probabilistic generative model• supervised fine-tuning

• generative: up-down algorithm• discriminative: backpropagation

2015

CVL

Wee

kly

Sem

inar

46

Updating Weights• How much to update?

• Learning rate()• = • Fixed or adaptive• Common recipe : reduce learning rate when validation

error stops decreasing

Error Learning rates reduced

Epoch

2015

CVL

Wee

kly

Sem

inar

47

Updating Weights• How much to update?

• Learning rate()• = • Fixed or adaptive• Common recipe : reduce learning rate when validation

error stops decreasing• Momentum(v)

• Forces GD to keep moving in previous direction

number

2015

CVL

Wee

kly

Sem

inar

48

Updating Weights

• How much to update?• After every training sample(online learning)• After iterating over entire training set(full-batch)• After some training samples(mini-batch)

• Stochastic gradient descent• Faster convergence than full-batch• Efficient computation on GPUs

2015

CVL

Wee

kly

Sem

inar

49

Regularization

• Ways to avoid Overfitting• Weight decay• Weight sharing(CNN)• Early stopping• Model averaging(various model)• Dropout(more on this later)• Pre-training(good initialization)• Adding noise to training data

2015

CVL

Wee

kly

Sem

inar

50

Dropout

• Consider a neural net with one hidden layer• Each time we present a training example, randomly omit each hidden unit with probability 0.5• Randomly sampling from different architectures. All architectures share weights

An Efficient way to average many large neural nets.

Random value > 0.5 Random value < 0.5

2015

CVL

Wee

kly

Sem

inar

51

Other Training Details

• Choice of nonlinear function• Logistic function• Tanh• ReLU(Rectified linear unit)

• F(x) = max(0, x)• Non-saturating• Faster convergence [Nair 10]

Both suffer from saturation problem(slow convergence due to near-0 gradient)

http://www.cs.toronto.edu/~fritz/absps/reluICML.pdf

2015

CVL

Wee

kly

Sem

inar

52

Other Training Details

• Softmax and cross-entropy[Ref.]• Normally used instead of squared error loss• Appropriate for representing probability distribution•

• Input preprocessing(pre –processing)• Zero-mean, unit-variance input data yields better shaped error surface

http://www.appliedcuriosity.eu/blog/post/22/sigmoidlogistic-function-and-softmax-without-overflow

2015

CVL

Wee

kly

Sem

inar

53

𝑥𝑡

h𝑡

¿

𝑥0

h0

𝑥2

h2

𝑥1

h1

𝑥𝑡

h𝑡

…

[http://karpathy.github.io/2015/05/21/rnn-effectiveness]

Recurrent Neural Network(RNN)

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

2015

CVL

Wee

kly

Sem

inar

54

• Bidirection Neural Network utilize in the past and future context for every point in the sequence

• Two Hidden Layer(Forwards and Backwards) shared same output layer

Visualized of the amount of input information for prediction by different network structures

[Schuster 97]

Bidirection Recurrent Neural network(BRNN)

http://www.di.ufpe.br/~fnj/RNA/bibliografia/BRNN.pdf

2015

CVL

Wee

kly

Sem

inar

55

Long short-term memory• Long short-term memory (LSTM) works successfully with sequential data.

• hand writing and speech, etc..• LSTM can model very long term sequential patterns.

• Longer memory has a stabilizing effect.

A node itself is a deep network.

2015

CVL

Wee

kly

Sem

inar

56

RNN LSTM

• RNN forget the previous input(vanishing gradient)

• LSTM remember previous data and reminder if it wants

Problem of RNN

2015

CVL

Wee

kly

Sem

inar

57

h𝑡−1(𝑝 𝑟𝑒𝑣𝑟𝑒𝑠𝑢𝑙𝑡 )

𝜎

𝑥𝑡 (𝑐 𝑢𝑟𝑟𝑒𝑛𝑡𝑑𝑎𝑡𝑎)

𝐶𝑡 −1 𝐶𝑡

𝑓 𝑡=𝜎 (𝑊 𝑓 ∙ [h𝑡− 1 , 𝑥𝑡 ]+𝑏 𝑓 )

𝑓 𝑡

[http://colah.github.io/posts/2015-08-Understanding-LSTMs]

Step-by-Step LSTM Walk

http://colah.github.io/posts/2015-08-Understanding-LSTMs


2015

CVL

Wee

kly

Sem

inar

58


𝜎



𝑖𝑡=𝜎 (𝑊 𝑖∙ [h𝑡−1 ,𝑥𝑡 ]+𝑏𝑖)

𝜎

𝑓 𝑡 𝑖𝑡h𝑡𝑎𝑛

~𝐶𝑡

~𝐶𝑡= h𝑡𝑎𝑛 (𝑊 𝑐 ∙ [h𝑡 −1 ,𝑥𝑡 ]+𝑏𝑐)[http://colah.github.io/posts/2015-08-Understanding-LSTMs]




2015

CVL

Wee

kly

Sem

inar


𝜎



𝐶𝑡= 𝑓 𝑡∗𝐶𝑡− 1+ 𝑖𝑡∗~𝐶𝑡

𝜎


~𝐶𝑡

ⅹ

+ⅹ

59[http://colah.github.io/posts/2015-08-Understanding-LSTMs]




2015

CVL

Wee

kly

Sem

inar

60


𝜎



𝑂𝑡=𝜎 (𝑊 𝑜 ∙ [h𝑡 −1 ,𝑥𝑡 ]+𝑏𝑜)

𝜎


~𝐶𝑡

ⅹ

+ⅹ

𝜎ⅹ

h𝑡𝑎𝑛

h𝑡

h𝑡

h𝑡=𝑂𝑡∗𝑡𝑎nh(𝐶𝑡 )[http://colah.github.io/posts/2015-08-Understanding-LSTMs]




2015

CVL

Wee

kly

Sem

inar

61

LSTM Regularization with Dropout

• Dropout operator only to non-recurrent connections

[Zaremba14]

Arrow dash applied dropout otherwise solid line is not applied

: hidden state in layer in timestep

dropout operator

Frame-level speech recognition accuracy

http://arxiv.org/abs/1409.2329


2015

CVL

Wee

kly

Sem

inar

62

decode

encode

V1

W1

X2

X1

X1

V1

W1

X2

X1

X1

X2

V2

W2

X3

• Regress from observation to itself (input X1 -> output X1)• ex : data compression(JPEG etc..)

[Lemme 10]

Auto Encoder

output

hidden

input

https://www.elen.ucl.ac.be/Proceedings/esann/esannpdf/es2010-73.pdf



2015

CVL

Wee

kly

Sem

inar

63

0 1 0 0…

0.05 0.7 0.5 0.01…

0.9 0.1 10− 8…10− 4

cow dog cat bus

original target

output of ensemble

[Hinton 14]

Softened outputs reveal the dark knowledge in the ensemble

dog

dog

training result

cat buscow

dog cat buscow

Dark knowledge

http://www.ttic.edu/dl/dark14.pdf

2015

CVL

Wee

kly

Sem

inar

• Distribution of the top layer has more information.• Model size in DNN can increase up to tens of GB

input

target

input

output

Training a DNN

Training a shallow network

64

Dark knowledge

[Hinton 14]

http://www.ttic.edu/dl/dark14.pdf

2015

CVL

Wee

kly

Sem

inar

65

0 1 0 0 0 0 0 0 0 0dog

Word EmbeddingLanguage Understanding(semantic)

0 0 1 0 0 0 0 0 0 0 cat

• Word embedding function mapping to high-dimensional vectors

0.3 0.2 0.1 0.5 0.7dog

0.2 0.8 0.3 0.1 0.9cat

one hot vector representation

[Vinyals 14]Nearest neighbors a few words

Word Embedding




2015

CVL

Wee

kly

Sem

inar

66

: time sequence : gain : bias : weight value of the between neuron and : external input for neuron : non-linear function() : rate of change activation post synaptic neuron

Input NodesHidden Nodes

Output Nodes(subset of hidden nodes)

𝜏 𝑖( 𝑑𝑦𝑖𝑑𝑡 )=− 𝑦 𝑖+∑𝑊 𝑗𝑖𝜎 (𝑔 𝑗 ( 𝑦 𝑗−𝑏 𝑗 ) )+𝐼 𝑖

Update Equation

Continuous-Time RNN(CTRNN)

• Dynamic system model of biological neural network(walk, bike, etc..)

• Ordinary differential equations to model the effects on a neuron of the training(using Generic Algorithm)

2015

CVL

Wee

kly

Sem

inar

67

4. Applications

2015

CVL

Wee

kly

Sem

inar

68

Convolutional Neural Network(CNN)

• Handwritten digit recognition [LeCun 98]• (Convolution-Subsampling) N + (Full connection) M

• Neural network that makes use of prior knowledge about im-ages

Features extraction Classification

http://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf



2015

CVL

Wee

kly

Sem

inar

69


• Incorporate prior knowledge about images• Locality : each pixel is only related to small neighborhood of pixels -> local connectivity

2015

CVL

Wee

kly

Sem

inar

70


• Incorporate prior knowledge about images• Locality : each pixel is only related to small neighborhood of pixels -> local connectivity• Stationarity : image statistics are invariant over all image locations -> Shared Weights

2015

CVL

Wee

kly

Sem

inar

71


• Convolution kernels with learned parameters• Learn multiple kernels(filter)• Still much fewer parameters than fully connected model

2015

CVL

Wee

kly

Sem

inar

72


• Subsampling(pooling)• NxN -> 1• Max pooling, Average pooling• Invariance to small translation• Larger receptive fields in upper layers

2015

CVL

Wee

kly

Sem

inar

73


• Backpropagation• Convolution layer

• dE/dW : Error summed and propagated from all nodes in which current weight W occurs

• Pooling layer• Max pooling : error propagated back to max node only• Average pooling : error uniformly propagated back to all pooled nodes

2015

CVL

Wee

kly

Sem

inar

74

Application : Image Classification

• ImageNet Large-Scale Visual Recognition Challenge (ILSVRC, 2010~)

• Image classification / localization• 1,200,000(1.2M) labeled images, 1000 classes• 2012 : CNN won the contest by large margin• CNN has been dominating the contest since..

• 2012 : 15.3% (top-5 error), 2nd(26.2%)• 2013 : 11.2%• 2014 : 6.7%

http://www.image-net.org/challenges/LSVRC/

2015

CVL

Wee

kly

Sem

inar

75

ImageNet Challenge

[Krizhevsky 12]

http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks



2015

CVL

Wee

kly

Sem

inar

76

Super Vision Team

Geoffrey Hinton (right) Alex Krizhevsky, and Ilya Sutskever (left)

2015

CVL

Wee

kly

Sem

inar

77

[Krizhevsky 12]• Deep : 5conv. Layers + 3 fully connected• Trained using 2GPUs• Top-5 error : 15.3 % vs 26.2%(2nd place, non-CNN)

ImageNet Challenge 2012




2015

CVL

Wee

kly

Sem

inar

78

[Krizhevsky 12]• ReLU • Overfitting prevention

• Data augmentation• Random translation, horizontal flip• Color perturbation

• Dropout• Randomly sets node activation to 0• Has an effect of simultaneously learning multiple architectures• Reduces co-adaptation between neurons[Hinton 12]






2015

CVL

Wee

kly

Sem

inar

79

[Zeiler 13] : winning submission by clarifai• (Training details not revealed : related publication)• Applied modifications to [krizhevsky 12] by visualizing features from each conv. layer

ImageNet Challenge 2013awesome performance!!




http://www.clarifai.com/#demo




2015

CVL

Wee

kly

Sem

inar

80










2015

CVL

Wee

kly

Sem

inar

81










2015

CVL

Wee

kly

Sem

inar

82



Dead Filter








2015

CVL

Wee

kly

Sem

inar

83










2015

CVL

Wee

kly

Sem

inar

84

[Howard 13]


• Utilize entire input image instead of cropping out edges (as opposed to [krizhevsky 12])

[Sermanet 13]• Multi-scale training• Efficient computation of dense localization








2015

CVL

Wee

kly

Sem

inar

85


[Lin 14]: “Network–in–network”• Replace convolution with multilayer perceptron• Nonlinear : better abstraction


2015

CVL

Wee

kly

Sem

inar

86

[Lin 14]: “Network–in–network”


• Replace convolution with multilayer perceptron• Nonlinear : better abstraction• Can replace full connection with simple averaging


2015

CVL

Wee

kly

Sem

inar

87


[Lin 14]: “Network–in–network”• Replace convolution with multilayer perceptron• Nonlinear : better abstraction• Can replace full connection with simple averaging


2015

CVL

Wee

kly

Sem

inar

88


CNN NIN



2015

CVL

Wee

kly

Sem

inar

89




2015

CVL

Wee

kly

Sem

inar

90


[Lin 14]: “Network–in–network”• Replace convolution with multilayer perceptron• Nonlinear : better abstraction• Can replace full connection with simple averaging• Equivalent to 1x1 convolution


2015

CVL

Wee

kly

Sem

inar

91

ConvolutionPoolingSoftmaxOther


[Szegedy 14] : “GoogLeNet”




2015

CVL

Wee

kly

Sem

inar

92


[Szegedy 14] : “GoogLeNet”• 22-layer network trained on 16k CPU cores [Dean 12]• 9 “Inception” modules (multi-scale convolution)• Average pooling• Auxiliary classifiers• 12x fewer parameters than [krizhevsky 12]-60,000,000




http://research.google.com/archive/large_deep_networks_nips2012.html




2015

CVL

Wee

kly

Sem

inar

93


[Szegedy 14] : “GoogLeNet”• Inception” modules (multi-scale convolution)

• Heterogeneous concatenation of multi-scale convolution• [Arora 14] “cluster correlated neurons together”• 1x1 convolution used for dimension reduction







2015

CVL

Wee

kly

Sem

inar

94


[Simonyan 14](Oxford Univ)• Very deep CNN:

Deeper netsInitialized w/shallower net

• 3x3 convolutions• 2x more parameters than [Krizhevsky 12]• Multi-GPU• Multi-scale training

http://www.robots.ox.ac.uk/~vgg/research/very_deep/






2015

CVL

Wee

kly

Sem

inar

95


[Wu 15] (Baidu)• Beats GoogLeNet: 6.67 % -> 5.98%• Custom-built supercomputer: 4GPUs x 36 nodes (Nvidia Tesla K40m)• Aggressive data augmentation• Multi-scale training with high-res image


2015

CVL

Wee

kly

Sem

inar

96


[Wu 15] (Baidu)• Beats GoogLeNet: 6.67 % -> 5.98%• Custom-built supercomputer: 4GPUs x 36 nodes (Nvidia Tesla K40m)• Aggressive data augmentation• Multi-scale training with high-res image

Data augmentation


2015

CVL

Wee

kly

Sem

inar

97


• ImageNet Challenge 2015 will be open (November 13, 2015) submission deadline• 2012 non-CNN : 26.2%(top-5 error)• 2012 AlexNet : 15.3%• 2013 Clarifai : 11.2%• 2014 GoogLeNet : 6.7%• (pre-2015): (Google) 4.9%

• Beyond human-level performance

[ImageNet Challenge]

http://arxiv.org/pdf/1409.0575v3.pdf

2015

CVL

Wee

kly

Sem

inar

98

ImageNet Challenge

• Common recipes• Deep (many conv layers), ReLU, dropout• Random crop training (translation, horizontal flip)• Multi-scale or Random-scale training• Color perturbation• Multi-crop testing• Multi-model averaging

• Focus gradually moving away from classification to classification + localization

2015

CVL

Wee

kly

Sem

inar

99

Auto Caption

Auto Caption(Google)

Neural Talk(Stanford Univ.)http://cs.stanford.edu/people/karpathy/deepimagesent/

http://cs.stanford.edu/people/karpathy/deepimagesent/

http://cs.stanford.edu/people/karpathy/deepimagesent/

2015

CVL

Wee

kly

Sem

inar

100

• Text-image multimodal learning• Learn mapping between image and word space • Generate sentence describing image & find image matching given sentence

CNN(convolutional neural net) + RNN(recurrent neural net)

Auto CaptionShow and Tell : A Neural Image Caption Generator

[Vinyals 14]

describing true sentence




2015

CVL

Wee

kly

Sem

inar

101

[Karpathy 14]

[Girshick 13]

• Generate dense, free-from descriptions of images

Infer region word alignments use to R-CNN + BRNN + MRF

Image Segmentation(Graph Cut + Disjoint union)

Deep Visual-Semantic Alignments for Generating Image Descriptions(Stanford Univ. )Auto Caption








2015

CVL

Wee

kly

Sem

inar

102

[Karpathy 14]Infer region word alignments use to R-CNN + BRNN + MRF


𝑆𝑘𝑙=∑𝑡 ∈𝑔𝑙

∑𝑖∈𝑔𝑘

𝑚𝑎𝑥(0 ,𝑣 𝑖𝑇 𝑆𝑡)Result BRNN

Result RNN

𝑔𝑙𝑔𝑘

• and with their additional Multiple Instance Learning

hⅹ4096 maxrix(h is 1000~1600)

t-dimensional word dictionary





2015

CVL

Wee

kly

Sem

inar

103

[Karpathy 14]


Smoothing with an MRF

• Best region independently align each other• Similarity regions are arrangement nearby

• Argmin can found dynamic programming

(word, region)





2015

CVL

Wee

kly

Sem

inar

104

Auto Caption

• Generation Methods on Auto Caption1) Compose descriptors directly from recognized content2) Retrieve relevant existing text given recognized content

Related Works

• Compose descriptions given recognized content Yao et al. (2010), Yang et al. (2011), Li et al. ( 2011), Kulkarni et al. (2011)

• Generation as retrieval Farhadi et al. (2010), Ordonez et al (2011), Gupta et al (2012), Kuznetsova et al (2012)

• Generation using pre-associated relevant text Leong et al (2010), Aker and Gaizauskas (2010), Feng and Lapata (2010a)

• Other (image annotation, video description, etc) Barnard et al (2003), Pastra et al (2003), Gupta et al (2008), Gupta et al (2009), Feng and Lapata (2010b), del Pero et al (2011), Krishnamoorthy et al (2012), Barbu et al (2012), Das et al (2013)

2015

CVL

Wee

kly

Sem

inar

105

Other Vision Applications

• Face recognition [Taigman 14] – Deep Face• 97.25% (state-of –the art nearing human perfor-

mance)• 4.4M faces of 4K people• 3D face alignment + locally connected neural net-

work

https://research.facebook.com/publications/480567225376225/deepface-closing-the-gap-to-human-level-performance-in-face-verification/



2015

CVL

Wee

kly

Sem

inar

106

Other Vision Applications• Sequence convert to sequence learning• Sequence representation 1000D -> PCA 2D

sensitive to word order[Sutskever 14]

Invariant to active voice and passive voice




2015

CVL

Wee

kly

Sem

inar

107

Other Vision Applications

• Data regularities are captured in multimodal vector space• possible in a multimodal representation (in an Euclidean

space)

[Kiros 14]

vec(QS rank) + vec(gist) = vec(world ranking 2)

http://arxiv.org/pdf/1411.2539.pdf



2015

CVL

Wee

kly

Sem

inar

108

• Divided to five part of human body(two arms, two legs, trunk)• Modeling movements of these individual part and layer composed of 9

layers(BRNN, fusion layer, fully connection layer)

[Yong 15]

Hierarchical RNN for skeleton based Action Recognition

http://www.cv-foundation.org/openaccess/content_cvpr_2015/html/Du_Hierarchical_Recurrent_Neural_2015_CVPR_paper.html

2015

CVL

Wee

kly

Sem

inar

109

Leading experts in deep learning

2015

CVL

Wee

kly

Sem

inar

110

Summary

• Deep architectures perform better than existing shallow ones because they learn hierarchical representation of data

• Now it’s possible to train deep neural networks thanks to the availability of:• Large-scale training data• High-performance computing devices• Newly developed training algorithms & techniques

• Common rules of thumb for improving performance of DNN:• Make it deeper & larger (ensuring that it does not overfitting)• Use ReLU for faster convergence & dropout as regularization• Apply various data augmentation schemes to increase effective amount of training data• Average predictions from multiple models & input crops

2015

CVL

Wee

kly

Sem

inar

111

Resourceshttp://deeplearning.net/

• “Learning deep architectures for AI” by Y. Bengio 2009• “Deep learning in neural networks : An overview” by J. Schmidhuber 2014• “Maching Learning to Deep Learning by 곽동민

• DBN (Science paper's code) : Hinton (Matlab)• http://www.cs.toronto.edu/~hinton/MatlabForSciencePaper.html

• convolutional neural networks : LeCun• Alex Krizhevsky: Hinton (python, C++)

• https://code.google.com/p/cuda-convnet/• Caffe: UC Berkeley (C++)

• http://caffe.berkeleyvision.org/• pylearn2: Bengio (python)

• https://github.com/lisa-lab/pylearn2• CURRENNT: Weninger et al (Munchen) (C++)

• http://sourceforge.net/projects/currennt/• Libraries : Torch(http://torch.ch/), Theano(http://deeplearning.net/software/theano/)

http://deeplearning.net/

http://www.cs.toronto.edu/~hinton/MatlabForSciencePaper.html

http://www.cs.toronto.edu/~hinton/MatlabForSciencePaper.html

https://code.google.com/p/cuda-convnet/

https://code.google.com/p/cuda-convnet/

http://caffe.berkeleyvision.org/



https://github.com/lisa-lab/pylearn2

https://github.com/lisa-lab/pylearn2

http://sourceforge.net/projects/currennt/

http://sourceforge.net/projects/currennt/

http://torch.ch/

http://torch.ch/

http://deeplearning.net/software/theano/

http://deeplearning.net/software/theano/

2015

CVL

Wee

kly

Sem

inar

112

THANK YOU

computer vision lab seminar(deep learning) yong hoon

Engineering