20130925.deeplearning
TRANSCRIPT
![Page 2: 20130925.deeplearning](https://reader035.vdocuments.pub/reader035/viewer/2022070313/554a0a6bb4c905507a8b58e1/html5/thumbnails/2.jpg)
Deep Learning とは多層 Neural network を使った機械学習法
Neural network の「逆襲」
(Shallow) Neural network Deep Neural Network
![Page 3: 20130925.deeplearning](https://reader035.vdocuments.pub/reader035/viewer/2022070313/554a0a6bb4c905507a8b58e1/html5/thumbnails/3.jpg)
Deep learning事例:
画像特徴の教師なし学習12層 DNN
パラメータ数 ~1010
教師なし学習による自動特徴抽出入力: YouTube の画像 108 枚16 core PC x 103 台 x 3日間
「おばあさん細胞」生成か?
Le et al. ICML 2012Preferred Stimuli in Higher level cellExamples of Training images
![Page 4: 20130925.deeplearning](https://reader035.vdocuments.pub/reader035/viewer/2022070313/554a0a6bb4c905507a8b58e1/html5/thumbnails/4.jpg)
Deep learning事例:
一般物体認識
IMAGENET Large Scale Visual Recognition Challenge 2012
1000 カテゴリ x 約1000枚の訓練画像
Convolution Neural Network
Krizhevsky et al. NIPS 2012
SIFT + FVs: 0.26 test err.CNN: 0.15 test err.
![Page 5: 20130925.deeplearning](https://reader035.vdocuments.pub/reader035/viewer/2022070313/554a0a6bb4c905507a8b58e1/html5/thumbnails/5.jpg)
Deep learning事例:
Text mining: Deep Generative Model
Bag of Words による Reuters ニュースのデータ
804,414 件の文書
Auto encoder による教師なし学習
Hinton & Salakhutdinov 2006
Legal/JudicialLeading Economic Indicators
European Community Monetary/Economic
Accounts/Earnings
Interbank Markets
Government Borrowings
Disasters and Accidents
Energy Markets
Model$P(document)$
Bag$of$words$
Reuters$dataset:$804,414$$newswire$stories:$unsupervised*
Deep$Genera:ve$Model$
(Hinton & Salakhutdinov, Science 2006)!
2D-LSA ResultDeep Generative Model Result
![Page 6: 20130925.deeplearning](https://reader035.vdocuments.pub/reader035/viewer/2022070313/554a0a6bb4c905507a8b58e1/html5/thumbnails/6.jpg)
Simple/Complex cell(Hubel&WIesel 59)
Linear resp. func.(Anzai+ 99)
201020001990198019701960
Perceptron(Rosenblatt 57)
Neocognitron(Fukushima 80)
Conv. net(LeCun+ 89)
Deep learning(Hinton+ 06)
“Linear Separable” (Minski & Papert 68)
Sparse Coding(Olshausen&Field 96)
Stochastic GD(Amari 67)
Boltzmann Mach.(HInton+85)
Back Prop.(Rumelhart+ 86)
今ココ
第1期 第2期
Neural network (NN) 歴史的背景
![Page 7: 20130925.deeplearning](https://reader035.vdocuments.pub/reader035/viewer/2022070313/554a0a6bb4c905507a8b58e1/html5/thumbnails/7.jpg)
NN の基礎知識
基本素子の考え方
ネットワークアーキテクチャ
学習
コンボリューションネット
![Page 8: 20130925.deeplearning](https://reader035.vdocuments.pub/reader035/viewer/2022070313/554a0a6bb4c905507a8b58e1/html5/thumbnails/8.jpg)
y1
y3
NN の基本要素入力の線形和
非線形活性化関数
Logistic-Sigmoid
Rectified Linear
Hyperbolic Tangent, etc...
y1
y3
x1
x2
x3
y2
z2
u
f (u)u
j
=
3X
i=1
w
ji
x
i
+ b
j
y
j
= f
⇣u
j
⌘
![Page 9: 20130925.deeplearning](https://reader035.vdocuments.pub/reader035/viewer/2022070313/554a0a6bb4c905507a8b58e1/html5/thumbnails/9.jpg)
NN の Architecture
ニューラルネットワークアーキテクチャ
階層型
相互結合型
Input
Output
![Page 10: 20130925.deeplearning](https://reader035.vdocuments.pub/reader035/viewer/2022070313/554a0a6bb4c905507a8b58e1/html5/thumbnails/10.jpg)
NN の学習(Back Propagation)
パラメータ {wij}, {bj} を最適化
教師あり学習
Back Propagation (Ramelhart+ 86)
Input
Outputy1 y2
t1 t2 Teacher
コスト関数 H =X
j
(t j � y j)2
H = �X
j
t j ln y j
微係数を用いた学習(Gradient Decent)
wi j = wi j � ⌘@H@wi j
LeopardCat0 1
wij
bj
![Page 11: 20130925.deeplearning](https://reader035.vdocuments.pub/reader035/viewer/2022070313/554a0a6bb4c905507a8b58e1/html5/thumbnails/11.jpg)
Back Propagation の連鎖則
yk
tk
k
j
�k = tk � yk
wkj@H@wk j
= �ky j
� j = f 0(u j)X
k
�kwk ji
wji� j
勾配の連鎖則
確率的降下法(Stochastic GD)
1サンプル毎は非効率↔全サンプルの平均勾配(Batch)は困難
mini Batch: 数個~100個程度の平均勾配
準ニュートン法や,共益勾配法 (Le+11)
![Page 12: 20130925.deeplearning](https://reader035.vdocuments.pub/reader035/viewer/2022070313/554a0a6bb4c905507a8b58e1/html5/thumbnails/12.jpg)
多階層における BackProp.
過学習問題
訓練誤差 ≫ 汎化誤差
勾配情報の拡散
識別器だけなら上位層で実現可能
全体のトレーニングは難しい
全結合型 NN で顕著
データに対してパラメータ数が過多O(Mk Mk+1 )
![Page 13: 20130925.deeplearning](https://reader035.vdocuments.pub/reader035/viewer/2022070313/554a0a6bb4c905507a8b58e1/html5/thumbnails/13.jpg)
Convolution NN (CNN)(Neocognitron)
階層型ネットワーク
畳み込みによる局所特徴抽出と空間プーリング
Neocognitron(Fukushima80): 階層仮説の実装 (Hubel & Wiesel 59)
Back Prop. 導入 (LeCun89, Okada94)
S-Cell Feature Extraction
Us1 Uc1
C-Cell Tolerance to the distortion
Input
Recognition
U0 Us2 Uc2 Us3 Uc3 Us4 Uc4
It’ s “5”
S-Cell S-Cell
C-Cell
S-Cell
C-Cell
Feature IntegrationLocal Feature
Global Feature
![Page 14: 20130925.deeplearning](https://reader035.vdocuments.pub/reader035/viewer/2022070313/554a0a6bb4c905507a8b58e1/html5/thumbnails/14.jpg)
CNN の動作原理局所特徴抽出(畳み込み)+変形に対する不変性(プーリング)
Preferred Feature (Orientation): X
Input: x
Convlution Layer
Blurring
PreferredOrientation
S-cell responseInput: x
Subsampling Layer
ConvolutionsSubsampling
Convolutions Subsampling
Preferred feature
![Page 15: 20130925.deeplearning](https://reader035.vdocuments.pub/reader035/viewer/2022070313/554a0a6bb4c905507a8b58e1/html5/thumbnails/15.jpg)
CNN の動作原理(contd.)
局所特徴抽出(畳み込み)+変形に対する不変性(プーリング)
![Page 16: 20130925.deeplearning](https://reader035.vdocuments.pub/reader035/viewer/2022070313/554a0a6bb4c905507a8b58e1/html5/thumbnails/16.jpg)
CNN デモ
http://yann.lecun.com/exdb/lenet/index.html
Rotataion Scale
NoiseMultiple Input
![Page 17: 20130925.deeplearning](https://reader035.vdocuments.pub/reader035/viewer/2022070313/554a0a6bb4c905507a8b58e1/html5/thumbnails/17.jpg)
CNN: Architecture の重要性Network Architecture は学習と同様に重要(Jarret+09, Saxe+10)
アーキテクチャの違いによる性能評価(Caltech-101)
Filter Bank Layer - FCSG: the input of a filter banklayer is a 3D array with n1 2D feature maps of size n2!n3.Each component is denoted xijk, and each feature map isdenoted xi. The output is also a 3D array, y composed ofm1 feature maps of size m2!m3. A filter in the filter bankkij has size l1 ! l2 and connects input feature map xi tooutput feature map yj . The module computes:
yj = gj tanh(!
i
kij " xi) (1)
where tanh is the hyperbolic tangent non-linearity, " is the2D discrete convolution operator and gj is a trainable scalarcoefficient. By taking into account the borders effect, wehave m1 = n1# l1 +1, and m2 = n2# l2 +1. This layer isdenoted by FCSG because it is composed of a set of convo-lution filters (C), a sigmoid/tanh non-linearity (S), and gaincoefficients (G). In the following, superscripts are used todenote the size of the filters. For instance, a filter bank layerwith 64 filters of size 9x9, is denoted as: 64F 9!9
CSG.Rectification Layer - Rabs: This module simply appliesthe absolute value function to all the components of its in-put: yijk = |xijk|. Several rectifying non-linearities weretried, including the positive part, and produced similar re-sults.Local Contrast Normalization Layer - N : This moduleperforms local subtractive and divisive normalizations, en-forcing a sort of local competition between adjacent fea-tures in a feature map, and between features at the samespatial location in different feature maps. The subtrac-tive normalization operation for a given site xijk com-putes: vijk = xijk #
"ipq wpq.xi,j+p,k+q, where wpq is
a Gaussian weighting window (of size 9x9 in our exper-iments) normalized so that
"ipq wpq = 1. The divisive
normalization computes yijk = vijk/max(c,!jk) where!jk = (
"ipq wpq.v2
i,j+p,k+q)1/2. For each sample, the
constant c is set to the mean(!jk) in the experiments. Thedenominator is the weighted standard deviation of all fea-tures over a spatial neighborhood. The local contrast nor-malization layer is inspired by computational neurosciencemodels [24, 20].Average Pooling and Subsampling Layer - PA: The pur-pose of this layer is to build robustness to small distor-tions, playing the same role as the complex cells in mod-els of visual perception. Each output value is yijk ="
pq wpq.xi,j+p,k+q, where wpq is a uniform weightingwindow (“boxcar filter”). Each output feature map is thensubsampled spatially by a factor S horizontally and verti-cally. In this work, we do not consider pooling over fea-ture types, but only over the spatial dimensions. Therefore,the numbers of input and output feature maps are identical,while the spatial resolution is decreased. Disregarding theborder effects in the boxcar averaging, the spatial resolutionis decreased by the down-sampling ratio S in both direc-tions, denoted by a superscript, so that, an average pooling
Figure 1. A example of feature extraction stage of the type FCSG!
Rabs ! N ! PA. An input image (or a feature map) is passedthrough a non-linear filterbank, followed by rectification, local
contrast normalization and spatial pooling/sub-sampling.
layer with 4x4 down-sampling is denoted: P 4!4A .
Max-Pooling and Subsampling Layer - PM : building lo-cal invariance to shift can be performed with any symmetricpooling operation. The max-pooling module is similar tothe average pooling, except that the average operation is re-placed by a max operation. In our experiments, the poolingwindows were non-overlapping. A max-pooling layer with4x4 down-sampling is denoted P 4!4
M .
2.1. Combining Modules into a HierarchyDifferent architectures can be produced by cascading the
above-mentioned modules in various ways. An architec-ture is composed of one or two stages of feature extraction,each of which is formed by cascading a filtering layer withdifferent combinations of rectification, normalization, andpooling. Recognition architectures are composed of one ortwo such stages, followed by a classifier, generally a multi-nomial logistic regression.FCSG #PA This is the basic building block of tra-ditional convolutional networks, alternating tanh-squashedfilter banks with average down-sampling layers [14, 10].A complete convolutional network would have several se-quences of “FCSG - PA” followed by by a linear classifier.FCSG #Rabs #PA The tanh-squashed filter bank isfollowed by an absolute value non-linearity, and by an av-erage down-sampling layer.FCSG #Rabs #N#PA The tanh-squashed filter bankis followed by an absolute value non-linearity, by a lo-cal contrast normalization layer and by an average down-sampling layer.FCSG #PM This is also a typical building block of con-volutional networks, as well as the basis of the HMAX andother architectures [28, 25], which alternate tanh-squashedfilter banks with max-pooling layers.
3. Training ProtocolGiven a particular architecture, a number of training pro-
tocols have been considered and tested. Each protocol isidentified by a letter R, U, R+, or U+. A single letter (e.g.R) indicates an architecture with a single stage of featureextraction, followed by a classifier, while a double letter(e.g. RR) indicates an architecture with two stages of fea-ture extraction followed by a classifier:Random Features and Supervised Classifier - R andRR: The filters in the feature extraction stages are set torandom values and kept fixed (no feature learning takesplace), and the classifier stage is trained in supervised mode.
2148
Figure 4. Left: random stage-1 filters, and corresponding optimal inputs that maximize the response of each corresponding complex cell ina FCSG!Rabs!N!PA architecture. The small asymmetry in the random filters is sufficient to make them orientation selective. Middle:
same for PSD filters. The optimal input patterns contain several periods since they maximize the output of a complete stage that containsrectification, local normalization, and average pooling with down-sampling. Shifted versions of each pattern yield similar activations.
Right panel: subset of stage-2 filters obtained after PSD and supervised refinement on Caltech-101. Some structure is apparent.
4.2. Random Filter PerformancePerhaps the most astonishing result is the surprisingly
good performance obtained with random filters with few la-beled samples. The NORB experiments show that randomfilters yield sub-par performance when labeled samples areabundant. But the experiments also show that random filtersseem to require the presence of abs and normalization. Toexplore why random filters work at all, we used gradient de-scent to find the optimal input patterns that maximize eachcomplex cell (after pooling) in a FCSG ! Rabs !N ! PA
stage. The surprising finding is that the optimal stimuli forrandom filters are oriented gratings (albeit a noisy and faintones), similar to the optimal stimuli for trained filters. Asshown in fig 4, it appears that random weights, combinedwith the abs/norm/pooling creates a spontaneous orienta-tion selectivity.
4.3. Handwritten Digits RecognitionAs a sanity check for the overall training procedures and
architectures, experiments were run on the MNIST dataset,which contains 60,000 gray-scale 28x28 pixel digit imagesfor training and 10,000 images for testing. An architec-ture with two stages of feature extraction was used: the firststage produces 32 feature maps using 5" 5 filters, followedby 2x2 average pooling and down-sampling. The secondstage produces 64 feature maps, each of which combines16 feature maps from stage 1 with 5" 5 filters (1024 filterstotal), followed by 2" 2 pooling/down-sampling. The clas-sifier is a 2-layer fully-connected neural network with 200hidden units, and 10 outputs. The loss function is equiva-lent to that of a 10-way multinomial logistic regression (alsoknown as cross-entropy loss). The two feature stages useabs rectification and normalization.
The parameters for the two feature extraction stages arefirst trained with PSD as explained in Section 3.1. Theclassifier is initialized randomly. The whole system is fine-tuned in supervised mode (the protocol could be describedas (U+U+R+R+). A validation set of size 10,000 was setapart from the training set to tune the only hyper-parameter:the sparsity constant !. Nine different values were testedbetween 0.1 and 1.6 and the best value was found to be 0.2.The system was trained with a form of stochastic gradient
descent on the 50,000 non-validation training samples un-til the best error rate on the validation set was reached (thistook 30 epochs). It was then tuned for another 3 epochs onthe whole training set. A test error rate of 0.53% was ob-tained. To our knowledge, this is the best error rate everreported on the original MNIST dataset, without distortionsor preprocessing. The best previously reported error ratewas 0.60% [26].
5. Conclusions
This paper addressed the following three questions:
1. how do the non-linearities that follow the filter banks in-fluence the recognition accuracy. The surprising answer isthat using a rectifying non-linearity is the single most im-portant factor in improving the performance of a recogni-tion system. This might be due to several reasons: a) thepolarity of features is often irrelevant to recognize objects,b) the rectification eliminates cancellations between neigh-boring filter outputs when combined with average pooling.Without a rectification what is propagated by the averagedown-sampling is just the noise in the input. Also introduc-ing a local normalization layer improves the performance.It appears to make supervised learning considerably faster,perhaps because all variables have similar variances (akinto the advantages introduced by whitening and other decor-relation methods)
2. does learning the filter banks in an unsupervised orsupervised manner improve the performance over hard-wired filters or even random filters: the most surprising re-sult is that random filters used in a two-stage system withthe proper non-linearities yield 62.9% recognition rate onCaltech-101. Experiments on NORB show that this sur-prising performance is only seen in the limit of very smalltraining set sizes. We have also shown that the optimal in-put patterns for a randomly initialized stage are very simi-lar to the optimal inputs for a stage that use learned filters.The second important result is that global supervised learn-ing of the filters yields good recognition rate if the propernon-linearities are used. It was thought that the dismal per-formance of supervised convolutional networks on Caltech-101 was due to overparameterization, but it seems to be due
2152
Figure 4. Left: random stage-1 filters, and corresponding optimal inputs that maximize the response of each corresponding complex cell ina FCSG!Rabs!N!PA architecture. The small asymmetry in the random filters is sufficient to make them orientation selective. Middle:
same for PSD filters. The optimal input patterns contain several periods since they maximize the output of a complete stage that containsrectification, local normalization, and average pooling with down-sampling. Shifted versions of each pattern yield similar activations.
Right panel: subset of stage-2 filters obtained after PSD and supervised refinement on Caltech-101. Some structure is apparent.
4.2. Random Filter PerformancePerhaps the most astonishing result is the surprisingly
good performance obtained with random filters with few la-beled samples. The NORB experiments show that randomfilters yield sub-par performance when labeled samples areabundant. But the experiments also show that random filtersseem to require the presence of abs and normalization. Toexplore why random filters work at all, we used gradient de-scent to find the optimal input patterns that maximize eachcomplex cell (after pooling) in a FCSG ! Rabs !N ! PA
stage. The surprising finding is that the optimal stimuli forrandom filters are oriented gratings (albeit a noisy and faintones), similar to the optimal stimuli for trained filters. Asshown in fig 4, it appears that random weights, combinedwith the abs/norm/pooling creates a spontaneous orienta-tion selectivity.
4.3. Handwritten Digits RecognitionAs a sanity check for the overall training procedures and
architectures, experiments were run on the MNIST dataset,which contains 60,000 gray-scale 28x28 pixel digit imagesfor training and 10,000 images for testing. An architec-ture with two stages of feature extraction was used: the firststage produces 32 feature maps using 5" 5 filters, followedby 2x2 average pooling and down-sampling. The secondstage produces 64 feature maps, each of which combines16 feature maps from stage 1 with 5" 5 filters (1024 filterstotal), followed by 2" 2 pooling/down-sampling. The clas-sifier is a 2-layer fully-connected neural network with 200hidden units, and 10 outputs. The loss function is equiva-lent to that of a 10-way multinomial logistic regression (alsoknown as cross-entropy loss). The two feature stages useabs rectification and normalization.
The parameters for the two feature extraction stages arefirst trained with PSD as explained in Section 3.1. Theclassifier is initialized randomly. The whole system is fine-tuned in supervised mode (the protocol could be describedas (U+U+R+R+). A validation set of size 10,000 was setapart from the training set to tune the only hyper-parameter:the sparsity constant !. Nine different values were testedbetween 0.1 and 1.6 and the best value was found to be 0.2.The system was trained with a form of stochastic gradient
descent on the 50,000 non-validation training samples un-til the best error rate on the validation set was reached (thistook 30 epochs). It was then tuned for another 3 epochs onthe whole training set. A test error rate of 0.53% was ob-tained. To our knowledge, this is the best error rate everreported on the original MNIST dataset, without distortionsor preprocessing. The best previously reported error ratewas 0.60% [26].
5. Conclusions
This paper addressed the following three questions:
1. how do the non-linearities that follow the filter banks in-fluence the recognition accuracy. The surprising answer isthat using a rectifying non-linearity is the single most im-portant factor in improving the performance of a recogni-tion system. This might be due to several reasons: a) thepolarity of features is often irrelevant to recognize objects,b) the rectification eliminates cancellations between neigh-boring filter outputs when combined with average pooling.Without a rectification what is propagated by the averagedown-sampling is just the noise in the input. Also introduc-ing a local normalization layer improves the performance.It appears to make supervised learning considerably faster,perhaps because all variables have similar variances (akinto the advantages introduced by whitening and other decor-relation methods)
2. does learning the filter banks in an unsupervised orsupervised manner improve the performance over hard-wired filters or even random filters: the most surprising re-sult is that random filters used in a two-stage system withthe proper non-linearities yield 62.9% recognition rate onCaltech-101. Experiments on NORB show that this sur-prising performance is only seen in the limit of very smalltraining set sizes. We have also shown that the optimal in-put patterns for a randomly initialized stage are very simi-lar to the optimal inputs for a stage that use learned filters.The second important result is that global supervised learn-ing of the filters yields good recognition rate if the propernon-linearities are used. It was thought that the dismal per-formance of supervised convolutional networks on Caltech-101 was due to overparameterization, but it seems to be due
2152
Random filter
Trained filter
2 layer + abs
2 layer +mean
1 layer+abs
0.629 0.647
0.196 0.310
0.533 0.548
Random Predictive Sparse Decomp.
abs
![Page 18: 20130925.deeplearning](https://reader035.vdocuments.pub/reader035/viewer/2022070313/554a0a6bb4c905507a8b58e1/html5/thumbnails/18.jpg)
視覚野(Ventral pathway)の性質
視覚野: 階層構造を持ち,階層ごとに異なる視覚課題の解決
初期視覚野: 狭い受容野,単純な特徴抽出Simple Cell,Complex Cellの存在
高次視覚野: 広い受容野,中程度に複雑な特徴に選択的
V1
V2V4
PITCIT
Ventral PathwayAIT
TEO
TE
V1
V2
V3 VP
V4 MT VA/V4
PIT
AIT/CIT 8 TF
LIP MST DPL VIP
7a
V3A
V1
V4
V2
IT
Small receptive fieldEdge, Line segmentdetector
Large receptive fieldFace, Complex featuredetector
?
?
![Page 19: 20130925.deeplearning](https://reader035.vdocuments.pub/reader035/viewer/2022070313/554a0a6bb4c905507a8b58e1/html5/thumbnails/19.jpg)
初期視覚野の性質
線分やエッジなどの成分に反応
Simple cell: 方位,位相に敏感
Complex cell: 位相には許容的
Complex cell: Simple cel のカスケード接続
Simple Cell Phase SensitiveOrientation Selective
Receptive Field
Input Stimulus
Fire Not FireNot Fire
Phase InsensitiveComplex Cell
Receptive Field
Input Stimulus
Fire Not FireFire
V1
V2V4
PITCIT
Ventral PathwayAIT
TEO
TE
V1
V4
V2
IT
Small receptive fieldEdge, Line segmentdetector
Large receptive fieldFace, Complex featuredetector
?
?
2
ĉƻ�8ƙĵ�ǷPMC48�Ɖ7 5G�¹bp{d
QǬ´&C(�":-F�ɑɆ�KŽ3�MŘƞĉQǷ
PMƙĵ;G�x]f�:Ɣɀ4;�LC*R�Řƞĉ
8; 150ɇɅ:Řƞĉƻ��Lj9KN3�C(�500ɇ
ĝ:ŘľɃ�K 150ɇɅ:ƻ�870-:4(�K�¬
K�:Ɣɀ�ż�Û8ɑɆ4á03�C(�
� 6:J�7ƔɀƂɧ�á03�M:�Q¸Ư(M-
F8;��ĝ�ĝ:ƞĉľɃ�1CL�Ʌ�Ʌ:ƞĉƻ
��ǜ�ƔɀQƏľ8ljMȦɠ��LC(��1:ȣƓ
8ɘĦ7Ɂɂ;ĝ�:ľɃ:ŰɜɕQljM"54(�A
5R6 180ǻ8þ�ī$Qś1ȃȴĝǒ:Řɕ5�7L�
ĝ�:ŘÅľɃ8503:Řɕ;�Ƣ�8�MJ�8Æ
ǻ8&3 1~ƥǻ:ö�Ƞ�8ĜKN3�C(�":J�
7�ĝ�:ƞĉľɃ�ĤōČ8Ǔ&3ÌŰƨQś1ɰ�
5,:ĄQŰɜɕ5ğ=C(�Űɜɕ;ŘÅľɃ850
3;��P<¿»8½�N-ƈ$7ǃ4�M5��"5�4�C(�ŘÅľɃ:Űɜɕ:Ą;�ɑ
Ɇ�KĴŝ:Ȕɰ�C4:ÀǠ¾4�ȆǧDz7��Qś03�C(�Űɜɕ:ĄQĔN<�,:ľ
Ƀ�¶DžǤ8ÔCNM6:J�7Ą8ĽGɯ ȝ¦(M:��ɚlj4�C(�
� -5�<�Řƞĉƻ�:�Ʌ�Ʌ;ɑɆ8�MƞĉƶľɃ5��ůɳ:ľɃ�KŽ3�ŘƊ5�
�Ȕ:Ǥƛȯȵ8�MƞĉÂ:ľɃ8bp{dQ·&317�LC(��"NK:ƞĉƶľɃ:Ű
ɜɕ;ȋŻ:Ȅƛ£ă:ĮdžQ&3�C(�":-F�ƈ$7Ǥƛȯȵ8ɍM�Ĥ:d�jnQ�
ȄŜ8,:ŲL8�Mo�pkƕ:ȯȵ8��ōČQǯŠ(M5ĽGɯ ȝ¦&C(�ŘƊ:ľɃ
GA5R6Ȅɝ:ŰɜɕQś03�C(�Ĥ�Ǥƛȯȵ5��^ƕ:ŲȽȯȵ8éB(Ħ±;è8
Ȃ :4�Űɜɕƽǒ8�ɝ8ɍM�ĤQƋŪ&3G�ľɃ;��03ȝ¦&C*R�1CL�Ó
�Ą4`�n�dn�ɯ ȫ�3�7�5�ɑɆ:ľɃ;ɯ ȝ¦&C*R�
����¦: v v:~¯'t3"C?:� ��+7��·�¥:§:¨��
� ǖȔ4Ľƃ8ŘƊ�K:ƞĉƙĵQŰ!ŭMɰ�Q�ŝŘÅɕ(V1)5ğ=C(�":ɰɕ8�M
ľɃ;�ɑɆHŘƊ4;Ħ±Dz.0-Ó�ōČ8;�CLȝ¦&C*R�V1 :ƞĉľɃ; 1960
Ȑĸ�ľɃ8J03�7MȆǮ:Æǻ4Ǫ Ƙ=-ƺƕ:ĤH�ǫƺƕ:ɍ�:ò»8J ȝ¦(
M"5��Ģ8s�|�ƐQŰƐ(MHubel5Wiesel5��ȋơ:°Èū8JLțĔ$NC&-�
":țĔǼƃ�K 1980Ȑ"OC4;�V1:ƞĉľɃ;ƺHWjcò»�Űɜɕ8ȍM5ȝ¦
(M�ƺ:ĒŽÙ�H�WjcĒŽÙ�5&3Ýȓ&3�M:4;7��5ǐ :ēìū�İ�C
&-�
� &�&�ĚŁ;":İ�;ęɉ8;ƫ& 7�0-5$N3�C(�,N; V1ľɃ:Űɜɕ:
� ��n�fq:�r~¯:�³²�http://ohzawa-lab.bpe.es.osaka-u.ac.jp/resources/text/KisokouKoukai2009/Ohzawa2009Koukai04.pdf
![Page 20: 20130925.deeplearning](https://reader035.vdocuments.pub/reader035/viewer/2022070313/554a0a6bb4c905507a8b58e1/html5/thumbnails/20.jpg)
高次視覚野の性質
中程度に複雑な特徴に反応顔細胞の存在
巨大受容野
時空間的な変化に許容的
V1
V2V4
PITCIT
Ventral PathwayAIT
TEO
TE
V1
V4
V2
IT
Small receptive fieldEdge, Line segmentdetector
Large receptive fieldFace, Complex featuredetector
?
?
![Page 21: 20130925.deeplearning](https://reader035.vdocuments.pub/reader035/viewer/2022070313/554a0a6bb4c905507a8b58e1/html5/thumbnails/21.jpg)
CNN の視覚野的解釈
Hubel & Wiesel の階層仮設: Compl cell →Simple Cell のカスケード接続
V2 → IT の不明な領野は初期視覚野による構造的外挿
学習によるチューニング可能性
V1
V2V4
PITCIT
Ventral PathwayAIT
TEO
TE
V1
V4
V2
IT
Small receptive fieldEdge, Line segmentdetector
Large receptive fieldFace, Complex featuredetector
?
?
U0 Us1Uc1 Us2Uc2 Us3Uc3 Us4Uc4 Us5Uc5
41x41x1
41x41x8
41x41x8
41x41xK2
21x21xK2
21x21xK3
11x11xK3
11x11xK4
5x5xK4
5x5xK5
1x1xK5
![Page 22: 20130925.deeplearning](https://reader035.vdocuments.pub/reader035/viewer/2022070313/554a0a6bb4c905507a8b58e1/html5/thumbnails/22.jpg)
Simple/Complex cell(Hubel&WIesel 59)
Linear resp. func.(Anzai+ 99)
201020001990198019701960
Perceptron(Rosenblatt 57)
Neocognitron(Fukushima 80)
Conv. net(LeCun+ 89)
Deep learning(Hinton+ 06)
“Linear Separable” (Minski & Papert 68)
Sparse Coding(Olshausen&Field 96)
Stochastic GD(Amari 67)
Boltzmann Mach.(HInton+85)
Back Prop.(Rumelhart+ 86)
今ココ
第1期 第2期
NN 周辺領域の歴史的背景
![Page 23: 20130925.deeplearning](https://reader035.vdocuments.pub/reader035/viewer/2022070313/554a0a6bb4c905507a8b58e1/html5/thumbnails/23.jpg)
Face detection(Viola & Jones 01)
HOG(Dalal&Triggs 05)
SURF(Bay+ 06)
SIFT(Lowe 99)
Conv. net(LeCun+ 89)
Deep learning(Hinton+ 06)
Sparse Coding(Olshausen&Field 96)
NN 周辺領域の歴史的背景
201020001990
今ココSVM
(Vapnik 95)Boosting
(Schapire 90)L1-recovery
(Candes+ 06)
Bayesian Method
Bayesian net(Pearl 00)
Kernel Method
![Page 24: 20130925.deeplearning](https://reader035.vdocuments.pub/reader035/viewer/2022070313/554a0a6bb4c905507a8b58e1/html5/thumbnails/24.jpg)
NN 界隈で起こったこと@90年台後半
アーキテクチャ設計の難しさ for Back Prop.
隠れ素子が少なければ表現がプア
隠れ素子が多ければ過学習
機械学習法の進展
Support VectorMachine / Kernel 法
Boosting
Shallow network で十分じゃないの?的な風潮
![Page 25: 20130925.deeplearning](https://reader035.vdocuments.pub/reader035/viewer/2022070313/554a0a6bb4c905507a8b58e1/html5/thumbnails/25.jpg)
Viola & Jones による顔検出Haar Like Feature + Boosting (Viola & Jones01)
Haar Like Detectors
Training Sampleshttp://vimeo.com/12774628
![Page 26: 20130925.deeplearning](https://reader035.vdocuments.pub/reader035/viewer/2022070313/554a0a6bb4c905507a8b58e1/html5/thumbnails/26.jpg)
SIFT による画像記述Scale Invariant Feature Transform (Lowe99)特徴点検出とヒストグラムにより特徴記述
回転・スケール変化に不変,照明変化に頑健
u
v
l
-
-
-
-
ı
ガウシアン平滑化 ガウシアン差分画像 DoG
D( u, v, l )
ı2
ı3
ı4
ı5
ı1
ı2
ı3
ı4
極値探索
SIFT 特徴点(キーポイント)
原画像I( u, v )
![Page 27: 20130925.deeplearning](https://reader035.vdocuments.pub/reader035/viewer/2022070313/554a0a6bb4c905507a8b58e1/html5/thumbnails/27.jpg)
SIFT による画像記述Scale Invariant Feature Transform (Lowe99)特徴点検出とヒストグラムにより特徴記述
回転・スケール変化に不変,照明変化に頑健
u
v
l
ガウシアン差分画像 DoG
D( u, v, l )
極値探索
SIFT 特徴点(キーポイント)
1 2 3 4 5 6 7 80
0.1
0.2
1 2 3 4 5 6 7 80
0.1
0.2
1 2 3 4 5 6 7 80
0.1
0.2
1 2 3 4 5 6 7 80
0.1
0.2
1 2 3 4 5 6 7 80
0.1
0.2
1 2 3 4 5 6 7 80
0.1
0.2
1 2 3 4 5 6 7 80
0.1
0.2
1 2 3 4 5 6 7 80
0.1
0.2
1 2 3 4 5 6 7 80
0.1
0.2
1 2 3 4 5 6 7 80
0.1
0.2
1 2 3 4 5 6 7 80
0.1
0.2
1 2 3 4 5 6 7 80
0.1
0.2
1 2 3 4 5 6 7 80
0.1
0.2
1 2 3 4 5 6 7 80
0.1
0.2
1 2 3 4 5 6 7 80
0.1
0.2
1 2 3 4 5 6 7 80
0.1
0.2
SIFT 特徴点(キーポイント) SIFT 記述子
ヒストグラム化
特徴点周りの勾配情報の算出
![Page 28: 20130925.deeplearning](https://reader035.vdocuments.pub/reader035/viewer/2022070313/554a0a6bb4c905507a8b58e1/html5/thumbnails/28.jpg)
Bag of Features による画像認識特徴量記述量を直接識別器へ (Bag of Visual Words)(Csurka+04)
http://www.vision.cs.chubu.ac.jp/sift/PDF/sift_tutorial_ppt.pdf
![Page 29: 20130925.deeplearning](https://reader035.vdocuments.pub/reader035/viewer/2022070313/554a0a6bb4c905507a8b58e1/html5/thumbnails/29.jpg)
HOG による画像記述Histograms of Orientation Gradient (HOG) (Dalal&Triggs05)エッジ成分の局所ヒストグラムによる表現
照明変化に頑健,大まかな領域の記述特徴
5 10 15
5
10
155 10 15
5
10
155 10 15
5
10
15
5 10 15
5
10
155 10 15
5
10
155 10 15
5
10
15
5 10 15
5
10
155 10 15
5
10
155 10 15
5
10
15
セル
ブロック5 10 15
5
10
155 10 15
5
10
155 10 15
5
10
15
5 10 15
5
10
155 10 15
5
10
155 10 15
5
10
15
5 10 15
5
10
155 10 15
5
10
155 10 15
5
10
151 2 3 4 5 6 7 8 9
00.10.20.30.4
1 2 3 4 5 6 7 8 90
0.10.20.30.4
1 2 3 4 5 6 7 8 90
0.10.20.30.4
1 2 3 4 5 6 7 8 90
0.10.20.30.4
1 2 3 4 5 6 7 8 90
0.10.20.30.4
1 2 3 4 5 6 7 8 90
0.10.20.30.4
1 2 3 4 5 6 7 8 90
0.10.20.30.4
1 2 3 4 5 6 7 8 90
0.10.20.30.4
1 2 3 4 5 6 7 8 90
0.10.20.30.4
勾配画像 m(u, v) セル分割
原画像 I(u,v) HOG特徴 Vi
ブロック内の原画像
ブロック内の勾配強度画像
SVM などの識別器
![Page 30: 20130925.deeplearning](https://reader035.vdocuments.pub/reader035/viewer/2022070313/554a0a6bb4c905507a8b58e1/html5/thumbnails/30.jpg)
画像認識問題の NN 的解釈画像の特性(エッジ等)に基づいた特徴量構築+機械学習
Shallow Network model?
Input
OutputLeopardCat
Feature Detector(Haar, SIFT, HOG...)
Machine Learning (SVM, Boosting...)
![Page 31: 20130925.deeplearning](https://reader035.vdocuments.pub/reader035/viewer/2022070313/554a0a6bb4c905507a8b58e1/html5/thumbnails/31.jpg)
部分特徴から組み合わせ特徴へBag of Words からの脱却
部分特徴の組み合わせ特徴量へ (Felzenswalb+10, Divvala+12)
HierarchicalSDeep$Models$
Collins$&$Quillian$(1969)$
Hierarchical$Bayes$
CategorySbased$Hierarchy$
HD*Models:*Compose$hierarchical$Bayesian$
models$with$deep$networks,$two$influen:al$
approaches$from$unsupervised$learning$
Deep*Networks:*• $learn$mul:ple$layers*of*nonlineari;es.*• $trained$in$unsupervised$fashion$SS$unsupervised*feature*learning*–$no$need$to$rely$on$humanScrabed$input$representa:ons.$
• *labeled*data*is$used$to$slightly$adjust$the$model$for$a$specific$task.$
Hierarchical*Bayes:*• $explicitly*represent*category*hierarchies*for$sharing$abstract$knowledge.*• $explicitly$iden:fy$only$a$small*number*of*parameters*that$are$relevant$to$the$new$concept$being$learned.$
Marr$and$Nishihara$(1978)$
Deep$Nets$
PartSbased$Hierarchy$
(Marr&Nishihara78)
(Felzenswalb+10)
![Page 32: 20130925.deeplearning](https://reader035.vdocuments.pub/reader035/viewer/2022070313/554a0a6bb4c905507a8b58e1/html5/thumbnails/32.jpg)
特徴抽出機構の設計どうやって(中程度に複雑な)特徴検出器を作るか?
“Token” (Marr82) 的な組み合わせ
Object parts:
ハンドメイドな特徴量はしんどい→機械学習による表現獲得
Contnuation Coner Junction Cross
![Page 33: 20130925.deeplearning](https://reader035.vdocuments.pub/reader035/viewer/2022070313/554a0a6bb4c905507a8b58e1/html5/thumbnails/33.jpg)
Face detection(Viola & Jones 01)
HOG(Dalal&Triggs 05)
SURF(Bay+ 06)
SIFT(Lowe 99)
Conv. net(LeCun+ 89)
Deep learning(Hinton+ 06)
Sparse Coding(Olshausen&Field 96)
NN 周辺領域の歴史的背景
201020001990
今ココSVM
(Vapnik 95)Boosting
(Schapire 90)L1-recovery
(Candes+ 06)
Bayesian Method
Bayesian net(Pearl 00)
Kernel MethodSparse Model
Sparse Model
![Page 34: 20130925.deeplearning](https://reader035.vdocuments.pub/reader035/viewer/2022070313/554a0a6bb4c905507a8b58e1/html5/thumbnails/34.jpg)
疎表現によるデータ記述基底ベクトルによる線形和表現
なるべく多くの係数が 0 になることを要請
y =MX
i
x
i
di
= x1 +x2 +x3 +...
y d1 d2 d3
なるべく0に
{di} を学習で決める
![Page 35: 20130925.deeplearning](https://reader035.vdocuments.pub/reader035/viewer/2022070313/554a0a6bb4c905507a8b58e1/html5/thumbnails/35.jpg)
疎表現によるデータ記述
= x1 +x2 +x3 +...
y d1 d2 d3
なるべく0に
H =X
p
�������yp �X
i
xpi di
�������
2
+ �X
i
kxpi k1
画像をなるべく忠実に表現
なるべく多くの係数を 0 に (LASSO)
画像パッチ {yp} から {di} と {xip} を取得可能か?
![Page 36: 20130925.deeplearning](https://reader035.vdocuments.pub/reader035/viewer/2022070313/554a0a6bb4c905507a8b58e1/html5/thumbnails/36.jpg)
Sparse Coding による特徴抽出自然画像の Sparse coding による表現 (Olshausen&Fields96)
初期視覚野の線形応答関数(Anzai+99), Gabor Waveletに類似
自然音源の Sparse coding による表現 (Terashima&Okada12)
和音の表現
50 100 150 200 250 300 350 400 450 500
50
100
150
200
250
300
350
400
450
500
50 100 150 200 250 300 350 400 450 500
50
100
150
200
250
300
350
400
450
500
50 100 150 200 250 300 350 400 450 500
50
100
150
200
250
300
350
400
450
500
Slide credit: Andrew Ng
![Page 37: 20130925.deeplearning](https://reader035.vdocuments.pub/reader035/viewer/2022070313/554a0a6bb4c905507a8b58e1/html5/thumbnails/37.jpg)
Sparse Representation for MNIST
60K train, 10K test
Dict.size 512
Linear SVM classification
H =X
p
�������yp �X
i
xpi di
�������
2
+ �X
i
kxpi k1
Eval. Param.
Slide credit: Kai Yu
Input Feature Classifier
Sparse Coding
![Page 38: 20130925.deeplearning](https://reader035.vdocuments.pub/reader035/viewer/2022070313/554a0a6bb4c905507a8b58e1/html5/thumbnails/38.jpg)
Sparse Representation for MNIST
λ = 5×10-4
部分的な検出器
H =X
p
�������yp �X
i
xpi di
�������
2
+ �X
i
kxpi k1
Eval. Param.
Slide credit: Kai Yu
![Page 39: 20130925.deeplearning](https://reader035.vdocuments.pub/reader035/viewer/2022070313/554a0a6bb4c905507a8b58e1/html5/thumbnails/39.jpg)
Sparse Representation for MNIST
λ = 5×10-2
部分的な数字検出器
H =X
p
�������yp �X
i
xpi di
�������
2
+ �X
i
kxpi k1
Eval. Param.
Slide credit: Kai Yu
![Page 40: 20130925.deeplearning](https://reader035.vdocuments.pub/reader035/viewer/2022070313/554a0a6bb4c905507a8b58e1/html5/thumbnails/40.jpg)
Sparse Representation for MNIST
λ = 5×10-4
VQ 表現的
H =X
p
�������yp �X
i
xpi di
�������
2
+ �X
i
kxpi k1
Eval. Param.
Slide credit: Kai Yu
![Page 41: 20130925.deeplearning](https://reader035.vdocuments.pub/reader035/viewer/2022070313/554a0a6bb4c905507a8b58e1/html5/thumbnails/41.jpg)
Sparse Auto Encoder
Predictive Sparse Decomposition(Ranzato+07)
x
p = f (Wy
p)y
p = Dx
p
Sparse Representation {xp}
Input Patchs {yp}
L1-Constraint
minD,W,x
X
p
kyp � Dx
pk2 + kxp � f (Wy
p)k2 + �X
i
kxp
i
k
Encoder
Decoder
![Page 42: 20130925.deeplearning](https://reader035.vdocuments.pub/reader035/viewer/2022070313/554a0a6bb4c905507a8b58e1/html5/thumbnails/42.jpg)
Sparseness + Hierarchy?
Hiearachical Sparse Coding (Yu+11)
Deep Belief network (DBN), Deep Boltzman Machine(DBM) (Hinton & Salakhutdinov06)
Hiearchy Representation
Input Patchs {yp}
Level 2 Features
Level 1 FeaturesEncoderDecoder
EncoderDecoder
EncoderDecoder
![Page 43: 20130925.deeplearning](https://reader035.vdocuments.pub/reader035/viewer/2022070313/554a0a6bb4c905507a8b58e1/html5/thumbnails/43.jpg)
Sparseness + Hierarchy?
Deep Belief network (DBN), Deep Boltzman Machine(DBM) (Hinton & Salakhutdinov06)
Hiearchy Representation
Input Patchs {yp}
Level 2 Features
Level 1 FeaturesEncoder
Encoder
Encoder
Decoder を外せば
NN として動作
![Page 44: 20130925.deeplearning](https://reader035.vdocuments.pub/reader035/viewer/2022070313/554a0a6bb4c905507a8b58e1/html5/thumbnails/44.jpg)
Sparseness + Hierarchy?
Deep Belief network (DBN), Deep Boltzman Machine(DBM) (Hinton & Salakhutdinov06)
Hiearchy Representation
Input Patchs {yp}
Level 2 Features
Level 1 Features
Decoder を動作させて最適特徴を導出
Decoder
Decoder
Decoder
![Page 45: 20130925.deeplearning](https://reader035.vdocuments.pub/reader035/viewer/2022070313/554a0a6bb4c905507a8b58e1/html5/thumbnails/45.jpg)
Hierarchical CNN +Sparse Coding
Sparse coding を用いた階層型識別器(Yu+11, Zeiler+11)
Sparse Coding
2nd Layer の基底回転,並進に対応
ConvolutionsSubsampling
Convolutions Subsampling
![Page 46: 20130925.deeplearning](https://reader035.vdocuments.pub/reader035/viewer/2022070313/554a0a6bb4c905507a8b58e1/html5/thumbnails/46.jpg)
まとめ
Deep Learning の発想は古くから存在
ネットワークのアーキテクチャと学習,双方が重要
Deep Learning は何故流行っているか?
Shallow network の性能飽和
Hand-maid Feature detector の難しさSparse Modeling の導入による学習方式の確立
計算機性能の向上による可能性の供給所謂ビッグデータの到来による需要
![Page 47: 20130925.deeplearning](https://reader035.vdocuments.pub/reader035/viewer/2022070313/554a0a6bb4c905507a8b58e1/html5/thumbnails/47.jpg)
まとめ(contd.)
細かい部分は(多分まだ)職人芸を必要とする.
Sparseness の設定,データに応じた素子の個数設定データからのλの推定 (佐々木DC研究員)
認識性能の向上→deep化→過学習怖い→CrossValidation
→更なる計算性能の要求
設計スキームの一般化 (libsvm 的な何か?)は多分必要
特徴学習・表現学習の期待
Semi-supervised learningラベル付データのコストが高い分野は割りとありそう
マルチデバイスの統合など,イメージフュージョン
![Page 48: 20130925.deeplearning](https://reader035.vdocuments.pub/reader035/viewer/2022070313/554a0a6bb4c905507a8b58e1/html5/thumbnails/48.jpg)
参考にしたもの岡谷先生のスライド http://www.vision.is.tohoku.ac.jp/jp/research/
LeCun の Website http://yann.lecun.com/
IEEE PAMI 特集号http://ieeexplore.ieee.org/xpl/tocresult.jsp?isnumber=6541932
CVPR 2012 Deep Learning チュートリアルhttp://cs.nyu.edu/~fergus/tutorials/deep_learning_cvpr12/
神経回路と情報処理(福島邦彦: 朝倉書店)
ICONIP 2007 Special Session for Neocognitron
Python と Theano を使った Deepnet 構築 http://deeplearning.net/