20130925.deeplearning

Deep Learning による画像処理2013/09/25

電気通信大学大学院情報理工学研究科庄野逸: [email protected]

mailto:[email protected]

mailto:[email protected]

Deep Learning とは多層 Neural network を使った機械学習法

Neural network の「逆襲」

(Shallow) Neural network Deep Neural Network

Deep learning事例:

画像特徴の教師なし学習12層 DNN

パラメータ数～1010

教師なし学習による自動特徴抽出入力: YouTube の画像 108 枚16 core PC x 103 台 x 3日間

「おばあさん細胞」生成か？

Le et al. ICML 2012Preferred Stimuli in Higher level cellExamples of Training images


一般物体認識

IMAGENET Large Scale Visual Recognition Challenge 2012

1000 カテゴリ x 約1000枚の訓練画像

Convolution Neural Network

Krizhevsky et al. NIPS 2012

SIFT + FVs: 0.26 test err.CNN: 0.15 test err.


Text mining: Deep Generative Model

Bag of Words による Reuters ニュースのデータ

804,414 件の文書

Auto encoder による教師なし学習

Hinton & Salakhutdinov 2006

Legal/JudicialLeading Economic Indicators

European Community Monetary/Economic

Accounts/Earnings

Interbank Markets

Government Borrowings

Disasters and Accidents

Energy Markets

Model$P(document)$

Bag$of$words$

Reuters$dataset:$804,414$$newswire$stories:$unsupervised*

Deep$Genera:ve$Model$

(Hinton & Salakhutdinov, Science 2006)!

2D-LSA ResultDeep Generative Model Result

Simple/Complex cell(Hubel&WIesel 59)

Linear resp. func.(Anzai+ 99)

201020001990198019701960

Perceptron(Rosenblatt 57)

Neocognitron(Fukushima 80)

Conv. net(LeCun+ 89)

Deep learning(Hinton+ 06)

“Linear Separable” (Minski & Papert 68)

Sparse Coding(Olshausen&Field 96)

Stochastic GD(Amari 67)

Boltzmann Mach.(HInton+85)

Back Prop.(Rumelhart+ 86)

今ココ

第1期第2期

Neural network (NN) 歴史的背景

NN の基礎知識

基本素子の考え方

ネットワークアーキテクチャ

学習

コンボリューションネット

y1

y3

NN の基本要素入力の線形和

非線形活性化関数

Logistic-Sigmoid

Rectified Linear

Hyperbolic Tangent, etc...

y1

y3

x1

x2

x3

y2

z2

u

f (u)u

j

=

3X

i=1

w

ji

x

i

+ b

j

y

j

= f

⇣u

j

⌘

NN の Architecture

ニューラルネットワークアーキテクチャ

階層型

相互結合型

Input

Output

NN の学習(Back Propagation)

パラメータ {wij}, {bj} を最適化

教師あり学習

Back Propagation (Ramelhart+ 86)

Input

Outputy1 y2

t1 t2 Teacher

コスト関数 H =X

j

(t j � y j)2

H = �X

j

t j ln y j

微係数を用いた学習(Gradient Decent)

wi j = wi j � ⌘@H@wi j

LeopardCat0 1

wij

bj

Back Propagation の連鎖則

yk

tk

k

j

�k = tk � yk

wkj@H@wk j

= �ky j

� j = f 0(u j)X

k

�kwk ji

wji� j

勾配の連鎖則

確率的降下法(Stochastic GD)

1サンプル毎は非効率↔全サンプルの平均勾配(Batch)は困難

mini Batch: 数個～100個程度の平均勾配

準ニュートン法や，共益勾配法 (Le+11)

多階層における BackProp.

過学習問題

訓練誤差 ≫ 汎化誤差

勾配情報の拡散

識別器だけなら上位層で実現可能

全体のトレーニングは難しい

全結合型 NN で顕著

データに対してパラメータ数が過多O(Mk Mk+1 )

Convolution NN (CNN)(Neocognitron)

階層型ネットワーク

畳み込みによる局所特徴抽出と空間プーリング

Neocognitron(Fukushima80): 階層仮説の実装 (Hubel & Wiesel 59)

Back Prop. 導入 (LeCun89, Okada94)

S-Cell Feature Extraction

Us1 Uc1

C-Cell Tolerance to the distortion

Input

Recognition

U0 Us2 Uc2 Us3 Uc3 Us4 Uc4

It’ s “5”

S-Cell S-Cell

C-Cell

S-Cell

C-Cell

Feature IntegrationLocal Feature

Global Feature

CNN の動作原理局所特徴抽出(畳み込み)＋変形に対する不変性(プーリング)

Preferred Feature (Orientation): X

Input: x

Convlution Layer

Blurring

PreferredOrientation

S-cell responseInput: x

Subsampling Layer

ConvolutionsSubsampling

Convolutions Subsampling

Preferred feature

CNN の動作原理(contd.)

局所特徴抽出(畳み込み)＋変形に対する不変性(プーリング)

CNN デモ

http://yann.lecun.com/exdb/lenet/index.html

Rotataion Scale

NoiseMultiple Input

http://www.vision.cs.chubu.ac.jp/sift/PDF/sift_tutorial_ppt.pdf


CNN: Architecture の重要性Network Architecture は学習と同様に重要(Jarret+09, Saxe+10)

アーキテクチャの違いによる性能評価(Caltech-101)

Filter Bank Layer - FCSG: the input of a filter banklayer is a 3D array with n1 2D feature maps of size n2!n3.Each component is denoted xijk, and each feature map isdenoted xi. The output is also a 3D array, y composed ofm1 feature maps of size m2!m3. A filter in the filter bankkij has size l1 ! l2 and connects input feature map xi tooutput feature map yj . The module computes:

yj = gj tanh(!

i

kij " xi) (1)

where tanh is the hyperbolic tangent non-linearity, " is the2D discrete convolution operator and gj is a trainable scalarcoefficient. By taking into account the borders effect, wehave m1 = n1# l1 +1, and m2 = n2# l2 +1. This layer isdenoted by FCSG because it is composed of a set of convo-lution filters (C), a sigmoid/tanh non-linearity (S), and gaincoefficients (G). In the following, superscripts are used todenote the size of the filters. For instance, a filter bank layerwith 64 filters of size 9x9, is denoted as: 64F 9!9

CSG.Rectification Layer - Rabs: This module simply appliesthe absolute value function to all the components of its in-put: yijk = |xijk|. Several rectifying non-linearities weretried, including the positive part, and produced similar re-sults.Local Contrast Normalization Layer - N : This moduleperforms local subtractive and divisive normalizations, en-forcing a sort of local competition between adjacent fea-tures in a feature map, and between features at the samespatial location in different feature maps. The subtrac-tive normalization operation for a given site xijk com-putes: vijk = xijk #

"ipq wpq.xi,j+p,k+q, where wpq is

a Gaussian weighting window (of size 9x9 in our exper-iments) normalized so that

"ipq wpq = 1. The divisive

normalization computes yijk = vijk/max(c,!jk) where!jk = (

"ipq wpq.v2

i,j+p,k+q)1/2. For each sample, the

constant c is set to the mean(!jk) in the experiments. Thedenominator is the weighted standard deviation of all fea-tures over a spatial neighborhood. The local contrast nor-malization layer is inspired by computational neurosciencemodels [24, 20].Average Pooling and Subsampling Layer - PA: The pur-pose of this layer is to build robustness to small distor-tions, playing the same role as the complex cells in mod-els of visual perception. Each output value is yijk ="

pq wpq.xi,j+p,k+q, where wpq is a uniform weightingwindow (“boxcar filter”). Each output feature map is thensubsampled spatially by a factor S horizontally and verti-cally. In this work, we do not consider pooling over fea-ture types, but only over the spatial dimensions. Therefore,the numbers of input and output feature maps are identical,while the spatial resolution is decreased. Disregarding theborder effects in the boxcar averaging, the spatial resolutionis decreased by the down-sampling ratio S in both direc-tions, denoted by a superscript, so that, an average pooling

Figure 1. A example of feature extraction stage of the type FCSG!

Rabs ! N ! PA. An input image (or a feature map) is passedthrough a non-linear filterbank, followed by rectification, local

contrast normalization and spatial pooling/sub-sampling.

layer with 4x4 down-sampling is denoted: P 4!4A .

Max-Pooling and Subsampling Layer - PM : building lo-cal invariance to shift can be performed with any symmetricpooling operation. The max-pooling module is similar tothe average pooling, except that the average operation is re-placed by a max operation. In our experiments, the poolingwindows were non-overlapping. A max-pooling layer with4x4 down-sampling is denoted P 4!4

M .

2.1. Combining Modules into a HierarchyDifferent architectures can be produced by cascading the

above-mentioned modules in various ways. An architec-ture is composed of one or two stages of feature extraction,each of which is formed by cascading a filtering layer withdifferent combinations of rectification, normalization, andpooling. Recognition architectures are composed of one ortwo such stages, followed by a classifier, generally a multi-nomial logistic regression.FCSG #PA This is the basic building block of tra-ditional convolutional networks, alternating tanh-squashedfilter banks with average down-sampling layers [14, 10].A complete convolutional network would have several se-quences of “FCSG - PA” followed by by a linear classifier.FCSG #Rabs #PA The tanh-squashed filter bank isfollowed by an absolute value non-linearity, and by an av-erage down-sampling layer.FCSG #Rabs #N#PA The tanh-squashed filter bankis followed by an absolute value non-linearity, by a lo-cal contrast normalization layer and by an average down-sampling layer.FCSG #PM This is also a typical building block of con-volutional networks, as well as the basis of the HMAX andother architectures [28, 25], which alternate tanh-squashedfilter banks with max-pooling layers.

3. Training ProtocolGiven a particular architecture, a number of training pro-

tocols have been considered and tested. Each protocol isidentified by a letter R, U, R+, or U+. A single letter (e.g.R) indicates an architecture with a single stage of featureextraction, followed by a classifier, while a double letter(e.g. RR) indicates an architecture with two stages of fea-ture extraction followed by a classifier:Random Features and Supervised Classifier - R andRR: The filters in the feature extraction stages are set torandom values and kept fixed (no feature learning takesplace), and the classifier stage is trained in supervised mode.

2148

Figure 4. Left: random stage-1 filters, and corresponding optimal inputs that maximize the response of each corresponding complex cell ina FCSG!Rabs!N!PA architecture. The small asymmetry in the random filters is sufficient to make them orientation selective. Middle:

same for PSD filters. The optimal input patterns contain several periods since they maximize the output of a complete stage that containsrectification, local normalization, and average pooling with down-sampling. Shifted versions of each pattern yield similar activations.

Right panel: subset of stage-2 filters obtained after PSD and supervised refinement on Caltech-101. Some structure is apparent.

4.2. Random Filter PerformancePerhaps the most astonishing result is the surprisingly

good performance obtained with random filters with few la-beled samples. The NORB experiments show that randomfilters yield sub-par performance when labeled samples areabundant. But the experiments also show that random filtersseem to require the presence of abs and normalization. Toexplore why random filters work at all, we used gradient de-scent to find the optimal input patterns that maximize eachcomplex cell (after pooling) in a FCSG ! Rabs !N ! PA

stage. The surprising finding is that the optimal stimuli forrandom filters are oriented gratings (albeit a noisy and faintones), similar to the optimal stimuli for trained filters. Asshown in fig 4, it appears that random weights, combinedwith the abs/norm/pooling creates a spontaneous orienta-tion selectivity.

4.3. Handwritten Digits RecognitionAs a sanity check for the overall training procedures and

architectures, experiments were run on the MNIST dataset,which contains 60,000 gray-scale 28x28 pixel digit imagesfor training and 10,000 images for testing. An architec-ture with two stages of feature extraction was used: the firststage produces 32 feature maps using 5" 5 filters, followedby 2x2 average pooling and down-sampling. The secondstage produces 64 feature maps, each of which combines16 feature maps from stage 1 with 5" 5 filters (1024 filterstotal), followed by 2" 2 pooling/down-sampling. The clas-sifier is a 2-layer fully-connected neural network with 200hidden units, and 10 outputs. The loss function is equiva-lent to that of a 10-way multinomial logistic regression (alsoknown as cross-entropy loss). The two feature stages useabs rectification and normalization.

The parameters for the two feature extraction stages arefirst trained with PSD as explained in Section 3.1. Theclassifier is initialized randomly. The whole system is fine-tuned in supervised mode (the protocol could be describedas (U+U+R+R+). A validation set of size 10,000 was setapart from the training set to tune the only hyper-parameter:the sparsity constant !. Nine different values were testedbetween 0.1 and 1.6 and the best value was found to be 0.2.The system was trained with a form of stochastic gradient

descent on the 50,000 non-validation training samples un-til the best error rate on the validation set was reached (thistook 30 epochs). It was then tuned for another 3 epochs onthe whole training set. A test error rate of 0.53% was ob-tained. To our knowledge, this is the best error rate everreported on the original MNIST dataset, without distortionsor preprocessing. The best previously reported error ratewas 0.60% [26].

5. Conclusions

This paper addressed the following three questions:

1. how do the non-linearities that follow the filter banks in-fluence the recognition accuracy. The surprising answer isthat using a rectifying non-linearity is the single most im-portant factor in improving the performance of a recogni-tion system. This might be due to several reasons: a) thepolarity of features is often irrelevant to recognize objects,b) the rectification eliminates cancellations between neigh-boring filter outputs when combined with average pooling.Without a rectification what is propagated by the averagedown-sampling is just the noise in the input. Also introduc-ing a local normalization layer improves the performance.It appears to make supervised learning considerably faster,perhaps because all variables have similar variances (akinto the advantages introduced by whitening and other decor-relation methods)

2. does learning the filter banks in an unsupervised orsupervised manner improve the performance over hard-wired filters or even random filters: the most surprising re-sult is that random filters used in a two-stage system withthe proper non-linearities yield 62.9% recognition rate onCaltech-101. Experiments on NORB show that this sur-prising performance is only seen in the limit of very smalltraining set sizes. We have also shown that the optimal in-put patterns for a randomly initialized stage are very simi-lar to the optimal inputs for a stage that use learned filters.The second important result is that global supervised learn-ing of the filters yields good recognition rate if the propernon-linearities are used. It was thought that the dismal per-formance of supervised convolutional networks on Caltech-101 was due to overparameterization, but it seems to be due

2152

Figure 4. Left: random stage-1 filters, and corresponding optimal inputs that maximize the response of each corresponding complex cell ina FCSG!Rabs!N!PA architecture. The small asymmetry in the random filters is sufficient to make them orientation selective. Middle:

same for PSD filters. The optimal input patterns contain several periods since they maximize the output of a complete stage that containsrectification, local normalization, and average pooling with down-sampling. Shifted versions of each pattern yield similar activations.

Right panel: subset of stage-2 filters obtained after PSD and supervised refinement on Caltech-101. Some structure is apparent.

4.2. Random Filter PerformancePerhaps the most astonishing result is the surprisingly

good performance obtained with random filters with few la-beled samples. The NORB experiments show that randomfilters yield sub-par performance when labeled samples areabundant. But the experiments also show that random filtersseem to require the presence of abs and normalization. Toexplore why random filters work at all, we used gradient de-scent to find the optimal input patterns that maximize eachcomplex cell (after pooling) in a FCSG ! Rabs !N ! PA

stage. The surprising finding is that the optimal stimuli forrandom filters are oriented gratings (albeit a noisy and faintones), similar to the optimal stimuli for trained filters. Asshown in fig 4, it appears that random weights, combinedwith the abs/norm/pooling creates a spontaneous orienta-tion selectivity.

4.3. Handwritten Digits RecognitionAs a sanity check for the overall training procedures and

architectures, experiments were run on the MNIST dataset,which contains 60,000 gray-scale 28x28 pixel digit imagesfor training and 10,000 images for testing. An architec-ture with two stages of feature extraction was used: the firststage produces 32 feature maps using 5" 5 filters, followedby 2x2 average pooling and down-sampling. The secondstage produces 64 feature maps, each of which combines16 feature maps from stage 1 with 5" 5 filters (1024 filterstotal), followed by 2" 2 pooling/down-sampling. The clas-sifier is a 2-layer fully-connected neural network with 200hidden units, and 10 outputs. The loss function is equiva-lent to that of a 10-way multinomial logistic regression (alsoknown as cross-entropy loss). The two feature stages useabs rectification and normalization.

The parameters for the two feature extraction stages arefirst trained with PSD as explained in Section 3.1. Theclassifier is initialized randomly. The whole system is fine-tuned in supervised mode (the protocol could be describedas (U+U+R+R+). A validation set of size 10,000 was setapart from the training set to tune the only hyper-parameter:the sparsity constant !. Nine different values were testedbetween 0.1 and 1.6 and the best value was found to be 0.2.The system was trained with a form of stochastic gradient

descent on the 50,000 non-validation training samples un-til the best error rate on the validation set was reached (thistook 30 epochs). It was then tuned for another 3 epochs onthe whole training set. A test error rate of 0.53% was ob-tained. To our knowledge, this is the best error rate everreported on the original MNIST dataset, without distortionsor preprocessing. The best previously reported error ratewas 0.60% [26].

5. Conclusions

This paper addressed the following three questions:

1. how do the non-linearities that follow the filter banks in-fluence the recognition accuracy. The surprising answer isthat using a rectifying non-linearity is the single most im-portant factor in improving the performance of a recogni-tion system. This might be due to several reasons: a) thepolarity of features is often irrelevant to recognize objects,b) the rectification eliminates cancellations between neigh-boring filter outputs when combined with average pooling.Without a rectification what is propagated by the averagedown-sampling is just the noise in the input. Also introduc-ing a local normalization layer improves the performance.It appears to make supervised learning considerably faster,perhaps because all variables have similar variances (akinto the advantages introduced by whitening and other decor-relation methods)

2. does learning the filter banks in an unsupervised orsupervised manner improve the performance over hard-wired filters or even random filters: the most surprising re-sult is that random filters used in a two-stage system withthe proper non-linearities yield 62.9% recognition rate onCaltech-101. Experiments on NORB show that this sur-prising performance is only seen in the limit of very smalltraining set sizes. We have also shown that the optimal in-put patterns for a randomly initialized stage are very simi-lar to the optimal inputs for a stage that use learned filters.The second important result is that global supervised learn-ing of the filters yields good recognition rate if the propernon-linearities are used. It was thought that the dismal per-formance of supervised convolutional networks on Caltech-101 was due to overparameterization, but it seems to be due

2152

Random filter

Trained filter

2 layer + abs

2 layer +mean

1 layer+abs

0.629 0.647

0.196 0.310

0.533 0.548

Random Predictive Sparse Decomp.

abs

視覚野(Ventral pathway)の性質

視覚野: 階層構造を持ち，階層ごとに異なる視覚課題の解決

初期視覚野: 狭い受容野，単純な特徴抽出Simple Cell，Complex Cellの存在

高次視覚野: 広い受容野，中程度に複雑な特徴に選択的

V1

V2V4

PITCIT

Ventral PathwayAIT

TEO

TE

V1

V2

V3 VP

V4 MT VA/V4

PIT

AIT/CIT 8 TF

LIP MST DPL VIP

7a

V3A

V1

V4

V2

IT

Small receptive fieldEdge, Line segmentdetector

Large receptive fieldFace, Complex featuredetector

?

?

初期視覚野の性質

線分やエッジなどの成分に反応

Simple cell: 方位，位相に敏感

Complex cell: 位相には許容的

Complex cell: Simple cel のカスケード接続

Simple Cell Phase SensitiveOrientation Selective

Receptive Field

Input Stimulus

Fire Not FireNot Fire

Phase InsensitiveComplex Cell

Receptive Field

Input Stimulus

Fire Not FireFire

V1

V2V4

PITCIT

Ventral PathwayAIT

TEO

TE

V1

V4

V2

IT



?

?

2

ĉƻ�8ƙĵ�ǷPMC48�Ɖ7 5G�¹bp{d

QǬ´&C(�":-F�ɑɆ�KŽ3�MŘƞĉQǷ

PMƙĵ;G�x]f�:Ɣɀ4;�LC*R�Řƞĉ

8; 150ɇɅ:Řƞĉƻ��ǈ9KN3�C(�500ɇ

ĝ:ŘľɃ�K 150ɇɅ:ƻ�870-:4(�K�¬

K�:Ɣɀ�ż�Û8ɑɆ4á03�C(�

� 6:J�7ƔɀƂɧ�á03�M:�Q¸Ư(M-

F8;��ĝ�ĝ:ƞĉľɃ�1CL�Ʌ�Ʌ:ƞĉƻ

��ǜ�ƔɀQƏľ8ǉMȦɠ��LC(��1:ȣƓ

8ɘĦ7Ɂɂ;ĝ�:ľɃ:ŰɜɕQǉM"54(�A

5R6 180ǻ8þ�ī$Qś1ȃȴĝǒ:Řɕ5�7L�

ĝ�:ŘÅľɃ8503:Řɕ;�Ƣ�8�MJ�8Æ

ǻ8&3 1~ƥǻ:ö�Ƞ�8ĜKN3�C(�":J�

7�ĝ�:ƞĉľɃ�ĤōČ8Ǔ&3ÌŰƨQś1ɰ�

5,:ĄQŰɜɕ5ğ=C(�Űɜɕ;ŘÅľɃ850

3;��P<¿»8½�N-ƈ$7ǃ4�M5��"5�4�C(�ŘÅľɃ:Űɜɕ:Ą;�ɑ

Ɇ�KĴŝ:Ȕɰ�C4:ÀǠ¾4�Ȇǧǲ7��Qś03�C(�Űɜɕ:ĄQĔN<�,:ľ

Ƀ�¶ǅǤ8ÔCNM6:J�7Ą8ĽGɯ ȝ¦(M:��ɚǉ4�C(�

� -5�<�Řƞĉƻ�:�Ʌ�Ʌ;ɑɆ8�MƞĉƶľɃ5��ůɳ:ľɃ�KŽ3�ŘƊ5�

�Ȕ:Ǥƛȯȵ8�MƞĉÂ:ľɃ8bp{dQ·&317�LC(��"NK:ƞĉƶľɃ:Ű

ɜɕ;ȋŻ:Ȅƛ£ă:ĮǆQ&3�C(�":-F�ƈ$7Ǥƛȯȵ8ɍM�Ĥ:d�jnQ�

ȄŜ8,:ŲL8�Mo�pkƕ:ȯȵ8��ōČQǯŠ(M5ĽGɯ ȝ¦&C(�ŘƊ:ľɃ

GA5R6Ȅɝ:ŰɜɕQś03�C(�Ĥ�Ǥƛȯȵ5��^ƕ:ŲȽȯȵ8éB(Ħ±;è8

Ȃ :4�Űɜɕƽǒ8�ɝ8ɍM�ĤQƋŪ&3G�ľɃ;��03ȝ¦&C*R�1CL�Ó

�Ą4`�n�dn�ɯ ȫ�3�7�5�ɑɆ:ľɃ;ɯ ȝ¦&C*R�

��¦: v v:~¯'t3"C?:� ��+7��·�¥:§:¨��

� ǖȔ4Ľƃ8ŘƊ�K:ƞĉƙĵQŰ!ŭMɰ�Q�ŝŘÅɕ(V1)5ğ=C(�":ɰɕ8�M

ľɃ;�ɑɆHŘƊ4;Ħ±ǲ.0-Ó�ōČ8;�CLȝ¦&C*R�V1 :ƞĉľɃ; 1960

Ȑĸ�ľɃ8J03�7MȆǮ:Æǻ4Ǫ Ƙ=-ƺƕ:ĤH�ǫƺƕ:ɍ�:ò»8J ȝ¦(

M"5��Ģ8s�|�ƐQŰƐ(MHubel5Wiesel5��ȋơ:°Èū8JLțĔ$NC&-�

":țĔǼƃ�K 1980Ȑ"OC4;�V1:ƞĉľɃ;ƺHWjcò»�Űɜɕ8ȍM5ȝ¦

(M�ƺ:ĒŽÙ�H�WjcĒŽÙ�5&3Ýȓ&3�M:4;7��5ǐ :ēìū�İ�C

&-�

� &�&�ĚŁ;":İ�;ęɉ8;ƫ& 7�0-5$N3�C(�,N; V1ľɃ:Űɜɕ:

� ��n�fq:�r~¯:�³²�http://ohzawa-lab.bpe.es.osaka-u.ac.jp/resources/text/KisokouKoukai2009/Ohzawa2009Koukai04.pdf

http://ohzawa-lab.bpe.es.osaka-u.ac.jp/resources/text/KisokouKoukai2009/Ohzawa2009Koukai04.pdf




高次視覚野の性質

中程度に複雑な特徴に反応顔細胞の存在

巨大受容野

時空間的な変化に許容的

V1

V2V4

PITCIT

Ventral PathwayAIT

TEO

TE

V1

V4

V2

IT



?

?

CNN の視覚野的解釈

Hubel & Wiesel の階層仮設: Compl cell →Simple Cell のカスケード接続

V2 → IT の不明な領野は初期視覚野による構造的外挿

学習によるチューニング可能性

V1

V2V4

PITCIT

Ventral PathwayAIT

TEO

TE

V1

V4

V2

IT



?

?

U0 Us1Uc1 Us2Uc2 Us3Uc3 Us4Uc4 Us5Uc5

41x41x1

41x41x8

41x41x8

41x41xK2

21x21xK2

21x21xK3

11x11xK3

11x11xK4

5x5xK4

5x5xK5

1x1xK5

Simple/Complex cell(Hubel&WIesel 59)

Linear resp. func.(Anzai+ 99)

201020001990198019701960

Perceptron(Rosenblatt 57)

Neocognitron(Fukushima 80)



“Linear Separable” (Minski & Papert 68)


Stochastic GD(Amari 67)

Boltzmann Mach.(HInton+85)

Back Prop.(Rumelhart+ 86)

今ココ

第1期第2期

NN 周辺領域の歴史的背景

Face detection(Viola & Jones 01)

HOG(Dalal&Triggs 05)

SURF(Bay+ 06)

SIFT(Lowe 99)





201020001990

今ココSVM

(Vapnik 95)Boosting

(Schapire 90)L1-recovery

(Candes+ 06)

Bayesian Method

Bayesian net(Pearl 00)

Kernel Method

NN 界隈で起こったこと@90年台後半

アーキテクチャ設計の難しさ for Back Prop.

隠れ素子が少なければ表現がプア

隠れ素子が多ければ過学習

機械学習法の進展

Support VectorMachine / Kernel 法

Boosting

Shallow network で十分じゃないの？的な風潮

Viola & Jones による顔検出Haar Like Feature + Boosting (Viola & Jones01)

Haar Like Detectors

Training Sampleshttp://vimeo.com/12774628

http://vimeo.com/12774628

http://vimeo.com/12774628

SIFT による画像記述Scale Invariant Feature Transform (Lowe99)特徴点検出とヒストグラムにより特徴記述

回転・スケール変化に不変，照明変化に頑健

u

v

l

-

-

-

-

ı

ガウシアン平滑化ガウシアン差分画像 DoG

D( u, v, l )

ı2

ı3

ı4

ı5

ı1

ı2

ı3

ı4

極値探索

SIFT 特徴点（キーポイント）

原画像I( u, v )

SIFT による画像記述Scale Invariant Feature Transform (Lowe99)特徴点検出とヒストグラムにより特徴記述

回転・スケール変化に不変，照明変化に頑健

u

v

l

ガウシアン差分画像 DoG

D( u, v, l )

極値探索

SIFT 特徴点（キーポイント）

1 2 3 4 5 6 7 80

0.1

0.2

1 2 3 4 5 6 7 80

0.1

0.2

1 2 3 4 5 6 7 80

0.1

0.2

1 2 3 4 5 6 7 80

0.1

0.2

1 2 3 4 5 6 7 80

0.1

0.2

1 2 3 4 5 6 7 80

0.1

0.2

1 2 3 4 5 6 7 80

0.1

0.2

1 2 3 4 5 6 7 80

0.1

0.2

1 2 3 4 5 6 7 80

0.1

0.2

1 2 3 4 5 6 7 80

0.1

0.2

1 2 3 4 5 6 7 80

0.1

0.2

1 2 3 4 5 6 7 80

0.1

0.2

1 2 3 4 5 6 7 80

0.1

0.2

1 2 3 4 5 6 7 80

0.1

0.2

1 2 3 4 5 6 7 80

0.1

0.2

1 2 3 4 5 6 7 80

0.1

0.2

SIFT 特徴点（キーポイント） SIFT 記述子

ヒストグラム化

特徴点周りの勾配情報の算出

Bag of Features による画像認識特徴量記述量を直接識別器へ (Bag of Visual Words)(Csurka+04)




HOG による画像記述Histograms of Orientation Gradient (HOG) (Dalal&Triggs05)エッジ成分の局所ヒストグラムによる表現

照明変化に頑健，大まかな領域の記述特徴

5 10 15

5

10

155 10 15

5

10

155 10 15

5

10

15

5 10 15

5

10

155 10 15

5

10

155 10 15

5

10

15

5 10 15

5

10

155 10 15

5

10

155 10 15

5

10

15

セル

ブロック5 10 15

5

10

155 10 15

5

10

155 10 15

5

10

15

5 10 15

5

10

155 10 15

5

10

155 10 15

5

10

15

5 10 15

5

10

155 10 15

5

10

155 10 15

5

10

151 2 3 4 5 6 7 8 9

00.10.20.30.4

1 2 3 4 5 6 7 8 90

0.10.20.30.4

1 2 3 4 5 6 7 8 90

0.10.20.30.4

1 2 3 4 5 6 7 8 90

0.10.20.30.4

1 2 3 4 5 6 7 8 90

0.10.20.30.4

1 2 3 4 5 6 7 8 90

0.10.20.30.4

1 2 3 4 5 6 7 8 90

0.10.20.30.4

1 2 3 4 5 6 7 8 90

0.10.20.30.4

1 2 3 4 5 6 7 8 90

0.10.20.30.4

勾配画像 m(u, v) セル分割

原画像 I(u,v) HOG特徴 Vi

ブロック内の原画像

ブロック内の勾配強度画像

SVM などの識別器

画像認識問題の NN 的解釈画像の特性（エッジ等）に基づいた特徴量構築＋機械学習

Shallow Network model?

Input

OutputLeopardCat

Feature Detector(Haar, SIFT, HOG...)

Machine Learning (SVM, Boosting...)

部分特徴から組み合わせ特徴へBag of Words からの脱却

部分特徴の組み合わせ特徴量へ (Felzenswalb+10, Divvala+12)

HierarchicalSDeep$Models$

Collins$&$Quillian$(1969)$

Hierarchical$Bayes$

CategorySbased$Hierarchy$

HD*Models:*Compose$hierarchical$Bayesian$

models$with$deep$networks,$two$influen:al$

approaches$from$unsupervised$learning$

Deep*Networks:*• $learn$mul:ple$layers*of*nonlineari;es.*• $trained$in$unsupervised$fashion$SS$unsupervised*feature*learning*–$no$need$to$rely$on$humanScrabed$input$representa:ons.$

• *labeled*data*is$used$to$slightly$adjust$the$model$for$a$specific$task.$

Hierarchical*Bayes:*• $explicitly*represent*category*hierarchies*for$sharing$abstract$knowledge.*• $explicitly$iden:fy$only$a$small*number*of*parameters*that$are$relevant$to$the$new$concept$being$learned.$

Marr$and$Nishihara$(1978)$

Deep$Nets$

PartSbased$Hierarchy$

(Marr&Nishihara78)

(Felzenswalb+10)

特徴抽出機構の設計どうやって（中程度に複雑な）特徴検出器を作るか？

“Token” (Marr82) 的な組み合わせ

Object parts:

ハンドメイドな特徴量はしんどい→機械学習による表現獲得

Contnuation Coner Junction Cross

Face detection(Viola & Jones 01)

HOG(Dalal&Triggs 05)

SURF(Bay+ 06)

SIFT(Lowe 99)





201020001990

今ココSVM

(Vapnik 95)Boosting

(Schapire 90)L1-recovery

(Candes+ 06)

Bayesian Method

Bayesian net(Pearl 00)

Kernel MethodSparse Model

Sparse Model

疎表現によるデータ記述基底ベクトルによる線形和表現

なるべく多くの係数が 0 になることを要請

y =MX

i

x

i

di

= x1 +x2 +x3 +...

y d1 d2 d3

なるべく0に

{di} を学習で決める

疎表現によるデータ記述

= x1 +x2 +x3 +...

y d1 d2 d3

なるべく0に

H =X

p

��yp �X

i

xpi di

��

2

+ �X

i

kxpi k1

画像をなるべく忠実に表現

なるべく多くの係数を 0 に (LASSO)

画像パッチ {yp} から {di} と {xip} を取得可能か？

Sparse Coding による特徴抽出自然画像の Sparse coding による表現 (Olshausen&Fields96)

初期視覚野の線形応答関数(Anzai+99), Gabor Waveletに類似

自然音源の Sparse coding による表現 (Terashima&Okada12)

和音の表現

50 100 150 200 250 300 350 400 450 500

50

100

150

200

250

300

350

400

450

500

50 100 150 200 250 300 350 400 450 500

50

100

150

200

250

300

350

400

450

500

50 100 150 200 250 300 350 400 450 500

50

100

150

200

250

300

350

400

450

500

Slide credit: Andrew Ng

Sparse Representation for MNIST

60K train, 10K test

Dict.size 512

Linear SVM classification

H =X

p

��yp �X

i

xpi di

��

2

+ �X

i

kxpi k1

Eval. Param.

Slide credit: Kai Yu

Input Feature Classifier

Sparse Coding


λ = 5×10-4

部分的な検出器

H =X

p

��yp �X

i

xpi di

��

2

+ �X

i

kxpi k1

Eval. Param.



λ = 5×10-2

部分的な数字検出器

H =X

p

��yp �X

i

xpi di

��

2

+ �X

i

kxpi k1

Eval. Param.



λ = 5×10-4

VQ 表現的

H =X

p

��yp �X

i

xpi di

��

2

+ �X

i

kxpi k1

Eval. Param.


Sparse Auto Encoder

Predictive Sparse Decomposition(Ranzato+07)

x

p = f (Wy

p)y

p = Dx

p

Sparse Representation {xp}

Input Patchs {yp}

L1-Constraint

minD,W,x

X

p

kyp � Dx

pk2 + kxp � f (Wy

p)k2 + �X

i

kxp

i

k

Encoder

Decoder

Sparseness + Hierarchy?

Hiearachical Sparse Coding (Yu+11)

Deep Belief network (DBN), Deep Boltzman Machine(DBM) (Hinton & Salakhutdinov06)

Hiearchy Representation

Input Patchs {yp}

Level 2 Features

Level 1 FeaturesEncoderDecoder

EncoderDecoder

EncoderDecoder




Input Patchs {yp}

Level 2 Features

Level 1 FeaturesEncoder

Encoder

Encoder

Decoder を外せば

NN として動作




Input Patchs {yp}

Level 2 Features

Level 1 Features

Decoder を動作させて最適特徴を導出

Decoder

Decoder

Decoder

Hierarchical CNN +Sparse Coding

Sparse coding を用いた階層型識別器(Yu+11, Zeiler+11)

Sparse Coding

2nd Layer の基底回転，並進に対応

ConvolutionsSubsampling

Convolutions Subsampling

まとめ

Deep Learning の発想は古くから存在

ネットワークのアーキテクチャと学習，双方が重要

Deep Learning は何故流行っているか？

Shallow network の性能飽和

Hand-maid Feature detector の難しさSparse Modeling の導入による学習方式の確立

計算機性能の向上による可能性の供給所謂ビッグデータの到来による需要

まとめ(contd.)

細かい部分は（多分まだ）職人芸を必要とする．

Sparseness の設定，データに応じた素子の個数設定データからのλの推定 (佐々木DC研究員)

認識性能の向上→deep化→過学習怖い→CrossValidation

→更なる計算性能の要求

設計スキームの一般化 (libsvm 的な何か?)は多分必要

特徴学習・表現学習の期待

Semi-supervised learningラベル付データのコストが高い分野は割りとありそう

マルチデバイスの統合など，イメージフュージョン

参考にしたもの岡谷先生のスライド http://www.vision.is.tohoku.ac.jp/jp/research/

LeCun の Website http://yann.lecun.com/

IEEE PAMI 特集号http://ieeexplore.ieee.org/xpl/tocresult.jsp?isnumber=6541932

CVPR 2012 Deep Learning チュートリアルhttp://cs.nyu.edu/~fergus/tutorials/deep_learning_cvpr12/

神経回路と情報処理(福島邦彦: 朝倉書店)

ICONIP 2007 Special Session for Neocognitron

Python と Theano を使った Deepnet 構築 http://deeplearning.net/

20130925.deeplearning

Technology