deep convolutional nets 11th march 2015 jiaxin shi tsinghua university

Deep Convolutional Nets

11th March 2015

Jiaxin Shi

Tsinghua University


Jiaxin Shi 11th March 2015 Tsinghua University

A Brief Introduction to CNN

• The replicated feature approach

• Use many different copies of the

same feature detector with

different positions.

– Replication greatly reduces the

number of free parameters to be

learned.

• Use several different feature types,

each with its own map of replicated

detectors.

– Allows each patch of image to be

represented in several ways.




• What does replicating the feature detectors achieve?

• Equivariant activities: Replicated features do not make the

neural activities invariant to translation. The activities are

equivariant.

• Invariant knowledge: If a feature is useful in some locations

during training, detectors for that feature will be available in all

locations during testing.




• Pooling the outputs of replicated feature detectors

• Get a small amount of translational invariance at each level by

averaging four neighboring replicated detectors to give a single

output to the next level.

– This reduces the number of inputs to the next layer of feature

extraction, thus allowing us to have many more different feature

maps.

– Taking the maximum of the four works slightly better.

• Problem: After several levels of pooling, we have lost information

about the precise positions of things.

– This makes it impossible to use the precise spatial

relationships between high-level parts for recognition.




• Terminology

Kernel: 5x5

Image




• Terminology

Kernel: 5x5

Stride: 2

Image




• Terminology

Image

Padding: 1

Kernel: 5x5

Stride: 2




• Terminology

Image

Feature map 1

Feature map 0

Feature map 2

Convolution Layer (5x5, 2, 1, 4)(kernel size, stride, padding, number of kernels)

Feature map 3

Padding: 1

Kernel: 5x5

Stride: 2




• Terminology

4 feature maps

Feature map 1

Feature map 0

Feature map 2

Pooling Layer (4x4, 4, 0)(pooling size, stride, padding)

Feature map 3




• An example – a ‘VW’ detector

V\

/

^

Input ImageChannel: 1

Layer 1 output

Channel: 3





V\

/

^


Layer 1 output

Channel: 3

‘V’ detector

‘W’ detector

Layer1Filter

(detector): 3

Layer2Filter (detector):

2x3Output Channel:

2





W\

/

^


Layer 1 output

Channel: 3

‘V’ detector

‘W’ detector

Layer1Filter

(detector): 3

Layer2Filter (detector):

2x3Output Channel:

2



History

• 1979, Neocognitron (Fukushima), the first convolutional nets.

Fukushima, however, did not set the weights by supervised

backpropagation, but by local unsupervised learning rules.

• 1989, LeNet (LeCun), BP for Convolutional NNs.

LeCun re-invented CNN with BP.

• 1992, Cresceptron (Weng et al., 1992), Max Pooling.

Later integrated with CNN (MPCNN).

• 2006, CNN trained on GPU (Chellapilla et al., 2006).

• 2011, Multi-Column GPU-MPCNNs (Ciresan et al., 2011),

superhuman performance.

The first system to achieve superhuman visual pattern recognition

in the IJCNN 2011 traffic sign recognition contest.

• 2012, ImageNet Breakthrough (Krizhevsky et al., 2012).

AlexNet trained on GPUs won imageNet competition.



Outline

• Recent Progress of Supervised Convolutional

Nets

• AlexNet

• GoogLeNet

• VGGNet

• Small Break: Microsoft’s Tricks

• Representation Learning and Bayesian

Approach

• Deconvolutional Networks

• Bayesian Deep Deconvolutional Networks



Outline

• Recent Progress of Supervised

Convolutional Nets

• AlexNet

• GoogLeNet

• VGGNet



Approach





Recent Progress of Supervised Convolutional Nets• AlexNet, 2012

• The architecture which made the 2012 ImageNet breakthrough.

• NIPS12, ImageNet Classification with Deep Convolutional Neural

Networks.

• A general practical guide of training deep supervised convnets.




• The architecture which made the 2012 ImageNet breakthrough.

• NIPS12, ImageNet Classification with Deep Convolutional Neural

Networks.

• A general practical guide of training deep supervised convnets.

• Main techniques

• ReLU nonlinearity

• Data augmentation

• Dropout

• Overlapping pooling

• Mini-batch SGD with momentum and weight decay




• Dropout

• Reduce overfit

• Model Average

A Brief Proof

𝑥 𝑦=𝑎𝑟𝑔𝑚𝑎𝑥𝑘′ (𝑜𝑢𝑡𝑘′)




• Dropout

• Encourage sparsity



Recent Progress of Supervised Convolutional Nets• GoogLeNet, 2014

• The 2014 ImageNet competition winner.

• CNN can go further if carefully tuned.

• Main techniques

• Carefully designed inception architecture

• Network in Network

• Deeply Supervised Nets






• Main techniques

• Network in Network






• Main techniques

• Deeply Supervised Net

• associating a “companion” classification output with each

hidden layer.






• Main techniques

• Deeply Supervised Net



Recent Progress of Supervised Convolutional Nets• VGGNet, 2014

• A simple and always state-of-art architecture compared to

GoogLeNet-like structure (very hard to tune).

• Developed by Oxford (later DeepMind) people. Based on

Zeiler & Fergus’s 2013 work.

• Most widely used now.

• Small filter (3x3) and small stride (1)



Outline

• Recent Progress of Supervised Convolutional

Nets

• AlexNet

• GoogLeNet

• VGGNet



Approach





Representation Learning and Bayesian Approach• Deconvolutional Networks, Zeiler & Fergus, CVPR 2010

• Deep layered model for representation learning

• Optimization perspective

• Results are better than previous representation learning

methods but there is still distance from supervised CNN models.




• : number of filters (dictionaries).

• : channel c of the nth image.

• : channel c of the kth filter (dictionary).

• : sparse, indicates the position and pixel-wise strength of .

• Cost function of the first layer

• : number of channels. : number of filters (dictionaries).




• Stack layer

• : layer l’s number of channels.

• Learning process

• Optimize layer by layer.

• Optimize over feature maps .

• Optimize over filters (dictionaries) .




• Stack layer

• : layer l’s number of channels.




• When , convex.

• But poorly conditioned due to being coupled to one

another by filters. (Why?)

• Optimize over filters (dictionaries) . Using gradient

descent.







• When , convex.

• But poorly conditioned due to being coupled to one

another by filters.

• Solution:

𝐶𝑙 (𝑊 𝑙− 1(𝑛 ) )= 𝜆

2∑𝑐=1

𝐾 𝑙 −1||∑𝑘=1

(𝐾 𝑙 )

𝑊 𝑙(𝑛 ,𝑘)∗𝐷 (𝑘 , 𝑐 )−𝑊 𝑙−1

(𝑛 , 𝑐 )||22

+∑𝑘=1

𝐾 𝑙

|𝑥 𝑙(𝑛 ,𝑘)|𝑝+∑

𝑘=1

𝐾 𝑙

‖𝑥 𝑙(𝑛 ,𝑘)−𝑊 𝑙

(𝑛 ,𝑘)‖22




• Performance (slightly outperforms sift-based approaches and

CDBN)



Representation Learning and Bayesian Approach• Bayesian Deep Deconvolutional Learning, Yunchen, 2015

• Deep layered model for representation learning

• Bayesian perspective

• Claim state-of-art classification performance using

representation learned




• : the nth image.

• : indicates which shifted version of is used to

represent .

• : indicates the pixel-wise strength of .

• Compared to the Deconvolutional Networks paper

• here is actually an explicit version of sparse in the

2010 paper.






represent .


• Priors






represent .


• Pooling

• Within each block of S(n,kl,l), either all nxny pixels are

zero, or only one pixel is non-zero, with the position of

that pixel selected stochastically via a multinomial

distribution.



Representation Learning and Bayesian Approach• Pooling

• Within each block of S(n,kl,l), either all nx*ny pixels are

zero, or only one pixel is non-zero, with the position of that

pixel selected stochastically via a multinomial distribution.



Representation Learning and Bayesian Approach• Learning Process

• Bottom to top: gibbs sampling and MAP samples selected

• Top to Bottom Refinement




• Intuition of Deconvolutional Networks (Generative)

• An image is made up of patches.

• These patches are weighted transformation of dictionary

elements.

• We learn dictionaries from training data.

• A new image is then represented by position and weights of

dictionaries.

• Intuition of Convolutional Networks

• An image is made up of patches.

• We can learn feature detectors for various kinds of patches.

• Then we use these feature detectors to scan a new image,

and classify it based on features (kinds of patches) detected.

• Both are translation equivariant.



Representation Learning and Bayesian Approach• Performance



Discussion

• Deep Supervised CNNs still has limits. Where lies further

improvement?

• Why does bayesian learning of deconvolution representations work

much better than those in optimization perspective?


Thank you.


deep convolutional nets 11th march 2015 jiaxin shi tsinghua university

Documents