imagenet classification with deep convolutional neural networks alex krizhevsky, ilya sutskever,...
TRANSCRIPT
ImageNet Classification withDeep Convolutional Neural Networks
Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton, NIPS 2012
Eunsoo Oh( 오은수 )
2
ILSVRC
● ImageNet Large Scale Visual Recognition Challenge
● An image classification challenge with 1,000 categories (1.2 million images)
reference : http://www.image-net.org/challenges/LSVRC/2013/slides/ILSVRC2013_12_7_13_clsloc.pdf
Processing…
DeepConvolutional
Neural Network(ILSVRC-2012 Winner)
3
Why Deep Learning?
● “Shallow” vs. “deep” architectures
reference : http://web.engr.illinois.edu/~slazebni/spring14/lec24_cnn.pdf
Learn a feature hierarchy all the way from pixels to classifier
4
Background
● A neuron
x1
x2
x3
xd
…
f
Input(raw pixel)
w1
w2
w3
wd
Weights
Output: f(w·x+b)
reference : http://en.wikipedia.org/wiki/Sigmoid_function#mediaviewer/File:Gjl-t(x).svg
5
Background
● Multi-Layer Neural Networks
● Nonlinear classifier
● Learning can be done
by gradient descent
Back-Propagation
algorithm
InputLayer
HiddenLayer
OutputLayer
6
Background
● Convolutional Neural Networks● Variation of multi-layer neural networks
● Kernel (Convolution Matrix)
reference : http://en.wikipedia.org/wiki/Kernel_(image_processing)
7
Background
● Convolutional Filter
reference : http://cs.nyu.edu/~fergus/tutorials/deep_learning_cvpr12/fergus_dl_tutorial_final.pptx
InputFeature Map
.
.
.
8
Proposed Method
● Deep Convolutional Neural Network● 5 convolutional and 3 fully connected layers
● 650,000 neurons, 60 million parameters
● Some techniques for boosting up performance
● ReLU nonlinearity
● Training on Multiple GPUs
● Overlapping max pooling
● Data Augmentation
● Dropout
9
Rectified Linear Units (ReLU)
reference : http://www.image-net.org/challenges/LSVRC/2012/supervision.pdf
10
Training on Multiple GPUs
● Spread across two GPUs● GTX 580 GPU with 3GB memory
● Particularly well-suited to cross-GPU parallelization
● Very efficient implementation of CNN on GPUs
11
Pooling
● Spatial Pooling● Non-overlapping / overlapping regions● Sum or max
reference : http://cs.nyu.edu/~fergus/tutorials/deep_learning_cvpr12/fergus_dl_tutorial_final.pptx
Max
Sum
12
Data Augmentation
256x256
224x224
224x224
224x224
224x224
224x224
224x224
Horizontal Flip
Training Image
Training Images
Enlarge the dataset!
13
Dropout
● Independently set each hidden unit activity to zero with 0.5 probability
● Used in the two globally-connected hidden layers at the net's output
reference : http://www.image-net.org/challenges/LSVRC/2012/supervision.pdf
14
Overall Architecture
● Trained with stochastic gradient descent on two NVIDIA GPUs for about a week (5~6 days)
● 650,000 neurons, 60 million parameters, 630 million connections
● The last layer contains 1,000 neurons which produces a distribution over the 1,000 class labels.
15
Results
● ILSVRC-2010 test set
ILSVRC-2010 winner
Previous bestpublished result
Proposed Method
16
Results
● ILSVRC-2012 results
Proposed methodTop-5 error rate : 16.422%
Runner-upTop-5 error rate : 26.172%
reference : http://image-net.org/challenges/LSVRC/2012/ilsvrc2012.pdf
17
Qualitative Evaluations
18
Qualitative Evaluations
19
ILSVRC-2013 Classification
reference : http://www.image-net.org/challenges/LSVRC/2013/slides/ILSVRC2013_12_7_13_clsloc.pdf
20
ILSVRC-2014 Classification
22 Layers 19 Layers
21
Conclusion
● Large, deep convolutional neural networks for large scale image classification was proposed
● 5 convolutional layers, 3 fully-connected layers
● 650,000 neurons, 60 million parameters
● Several techniques for boosting up performance
● Several techniques for reducing overfitting
● The proposed method won the ILSVRC-2012● Achieved a winning top-5 error rate of 15.3%,
compared to 26.2% achieved by the second-best entry
22
Q & A
???
23
Quiz
● 1. The proposed method used hand-designed features, thus there is no need to learn features and feature hierarchies. (True / False)
● 2. Which technique was not used in this paper?
① Dropout
② Rectified Linear Units nonlinearity
③ Training on multiple GPUs
④ Local contrast normalization
24
AppendixFeature Visualization
● 96 learned low-level(1st layer) filters
25
AppendixVisualizing CNN
reference : M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional neural networks. arXiv preprint arXiv:1311.2901, 2013.
26
AppendixLocal Response Normalization
● : the activity of a neuron computed by applyuing kernel i at position (x, y)
● The response-normalized activity is given by
● N : the total # of kernels in the layer
● n : hyper-parameter, n=5
● k : hyper-parameter, k=2
● α : hyper-parameter, α=10^(-4)
● This aids generalization even though ReLU don’t require it.
● This reduces top-5 error rate by 1.2%
27
AppendixAnother Data Augmentation
● Alter the intensities of the RGB channels in training images
● Perform PCA on the set of RGB pixel values
● To each training image, add multiples of the found principal components
● To each RGB image pixel
add the following quantity
● , : i-th eigenvector and eigenvalue
● : random variable drawn from a Gaussian with mean 0 and standard deviation 0.1
● This reduces top-1 error rate by over 1%
28
AppendixDetails of Learning
● Use stochastic gradient descent with a batch size of 128 examples, momentum of 0.9, and weigh decay of 0.0005
● The update rule for weight w was
● i : the iteration index
● : the learning rate, initialized at 0.01 and reduced three times prior to termination
● : the average over the i-th batch Di of the
derivative of the objective with respect to w
● Train for 90 cycles through the training set of 1.2 million images
29