chapter 01 #ml-professional

Chapter 1: はじめに機械学習プロフェッショナルシリーズ輪読会

～「深層学習」編～@a_macbee

後ろの章で説明される話題が多いので本書を読み終わった後にもう一度読むと

理解が深まりそうで良さそう

意：ここではあまり内容を深堀りません

• 1.1 研究の歴史

• 1.1.1 多層ニューラルネットへの期待と失望

• 1.1.2 多層ネットワークの事前学習

• 1.1.3 特徴量の学習

• 1.1.4 深層学習の隆盛

• 1.2 本書の構成 (※話しません)

多層ニューラルネットワークの歴史

• 人工ニューラルネットワーク（以下ニューラルネット）の研究の歴史は山あり谷あり • 1940年代：　研究開始

• 1980 - 1990年代：　誤差逆伝播法 (back propagation) (4章)の発明によ　る2度目のブーム

• 1990年代後半 - 2000年代前半：　再び下火に

この間盛り上がったり下がったりあったらしい

なぜ流行らなかったのか？• 誤差逆伝播法によるニューラルネットの学習は多層になるとうまくいかない (図1.1) • 勾配消失問題にもとづく過学習が問題に • 畳み込みニューラルネット (CNN) (6章) はこの限りではない

• 学習パラメータ (層数 / ユニット数等) の性能への寄与がよく分からない

多層ネットワークの事前学習• Hintonらのディープビリーフネットワーク (DBN) の登場 (2010年)

• 一般的なニューラルネットとは原理が異なるが、どちらにしろ多層になると学習は困難

多層ネットワークの事前学習

• Hintonらのディープビリーフネットワーク (DBN) の登場 (2010年)

• 制約ボルツマンマシン (RBM) (8章)と呼ばれる単層ネットワークに分類し，層ごとに事前学習

　　　多層であっても過適合を起こさない

層ごとにパラメータの良い初期値を得る

多層ネットワークの事前学習

• DBNやRBMより単純な自己符号化器 (auto-encoder) (5章)を使っても事前学習が可能

特徴量の学習• 強い偏りを持ちながら複雑に広がる高次元データ (e.g. 画像，音声) をどう学習するのか

• 自己符号化器に，少数の基底の組み合わせで入力を表現するスパース符号化 (sparse coding) (5章)の考え方を導入　→ 多層ネットワークは学習によって興味深い　　階層構造を構成する

わかりやすかった解説：http://d.hatena.ne.jp/takmin/20121224/1356315231

http://d.hatena.ne.jp/takmin/20121224/1356315231

多層ネットワークの階層構造の例

• 特定の物体だけに選択的に反応するユニット

Building High-level Features Using Large Scale Unsupervised Learning 引用：http://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/38115.pdf

Building high-level features using large-scale unsupervised learning

and minimum activation values, then picked 20 equallyspaced thresholds in between. The reported accuracyis the best classification accuracy among 20 thresholds.

4.3. Recognition

Surprisingly, the best neuron in the network performsvery well in recognizing faces, despite the fact that nosupervisory signals were given during training. Thebest neuron in the network achieves 81.7% accuracy indetecting faces. There are 13,026 faces in the test set,so guessing all negative only achieves 64.8%. The bestneuron in a one-layered network only achieves 71% ac-curacy while best linear filter, selected among 100,000filters sampled randomly from the training set, onlyachieves 74%.

To understand their contribution, we removed the lo-cal contrast normalization sublayers and trained thenetwork again. Results show that the accuracy ofbest neuron drops to 78.5%. This agrees with pre-vious study showing the importance of local contrastnormalization (Jarrett et al., 2009).

We visualize histograms of activation values for faceimages and random images in Figure 2. It can be seen,even with exclusively unlabeled data, the neuron learnsto differentiate between faces and random distractors.Specifically, when we give a face as an input image, theneuron tends to output value larger than the threshold,0. In contrast, if we give a random image as an inputimage, the neuron tends to output value less than 0.

Figure 2. Histograms of faces (red) vs. no faces (blue).The test set is subsampled such that the ratio betweenfaces and no faces is one.

4.4. Visualization

In this section, we will present two visualization tech-niques to verify if the optimal stimulus of the neuron isindeed a face. The first method is visualizing the mostresponsive stimuli in the test set. Since the test setis large, this method can reliably detect near optimalstimuli of the tested neuron. The second approachis to perform numerical optimization to find the op-timal stimulus (Berkes & Wiskott, 2005; Erhan et al.,2009; Le et al., 2010). In particular, we find the norm-bounded input x which maximizes the output f of the

tested neuron, by solving:

x∗ = argminx

f(x;W,H), subject to ||x||2 = 1.

Here, f(x;W,H) is the output of the tested neurongiven learned parameters W,H and input x. In ourexperiments, this constraint optimization problem issolved by projected gradient descent with line search.

These visualization methods have complementarystrengths and weaknesses. For instance, visualizingthe most responsive stimuli may suffer from fitting tonoise. On the other hand, the numerical optimizationapproach can be susceptible to local minima. Results,shown in Figure 13, confirm that the tested neuronindeed learns the concept of faces.

Figure 3. Top: Top 48 stimuli of the best neuron from thetest set. Bottom: The optimal stimulus according to nu-merical constraint optimization.

4.5. Invariance properties

We would like to assess the robustness of the face de-tector against common object transformations, e.g.,translation, scaling and out-of-plane rotation. First,we chose a set of 10 face images and perform distor-tions to them, e.g., scaling and translating. For out-of-plane rotation, we used 10 images of faces rotatingin 3D (“out-of-plane”) as the test set. To check the ro-bustness of the neuron, we plot its averaged responseover the small test set with respect to changes in scale,3D rotation (Figure 4), and translation (Figure 5).6

6Scaled, translated faces are generated by standardcubic interpolation. For 3D rotated faces, we used 10 se-

Building high-level features using large-scale unsupervised learning

Figure 4. Scale (left) and out-of-plane (3D) rotation (right)invariance properties of the best feature.

Figure 5. Translational invariance properties of the bestfeature. x-axis is in pixels

The results show that the neuron is robust againstcomplex and difficult-to-hard-wire invariances such asout-of-plane rotation and scaling.

Control experiments on dataset without faces:As reported above, the best neuron achieves 81.7% ac-curacy in classifying faces against random distractors.What if we remove all images that have faces from thetraining set?

We performed the control experiment by running aface detector in OpenCV and removing those trainingimages that contain at least one face. The recognitionaccuracy of the best neuron dropped to 72.5% whichis as low as simple linear filters reported in section 4.3.

5. Cat and human body detectors

Having achieved a face-sensitive neuron, we would liketo understand if the network is also able to detect otherhigh-level concepts.

We observed that the most common objects in theYouTube dataset are body parts and pets and hencesuspected that the network also learns these concepts.

To verify this hypothesis and quantify selectivity prop-erties of the network with respect to these concepts,we constructed two datasets, one for classifying hu-man bodies against random backgrounds and one forclassifying cat faces against other random distractors.

quences of rotated faces from The Sheffield Face Database –http://www.sheffield.ac.uk/eee/research/iel/research/face.Different sequences record rotated faces of different indi-viduals. The dataset only contains rotated faces up to 90degrees. See Appendix F for a sample sequence.

Figure 6. Visualization of the cat face neuron (left) andhuman body neuron (right).

For the ease of interpretation, these datasets have apositive-to-negative ratio identical to the face dataset.

The cat face images are collected from the dataset de-scribed in (Zhang et al., 2008). In this dataset, thereare 10,000 positive images and 18,409 negative images(so that the positive-to-negative ratio is similar to thecase of faces). The negative images are chosen ran-domly from the ImageNet dataset.

Negative and positive examples in our human bodydataset are subsampled at random from a benchmarkdataset (Keller et al., 2009). In the original dataset,each example is a pair of stereo black-and-white im-ages. But for simplicity, we keep only the left images.In total, like in the case of human faces, we have 13,026positive and 23,974 negative examples.

We then followed the same experimental protocols asbefore. The results, shown in Figure 14, confirm thatthe network learns not only the concept of faces butalso the concepts of cat faces and human bodies.

Our high-level detectors also outperform standardbaselines in terms of recognition rates, achieving 74.8%and 76.7% on cat and human body respectively. Incomparison, best linear filters (sampled from the train-ing set) only achieve 67.2% and 68.1% respectively.

In Table 1, we summarize all previous numerical re-sults comparing the best neurons against other base-lines such as linear filters and random guesses. To un-derstand the effects of training, we also measure theperformance of best neurons in the same network atrandom initialization.

During the development process of our algorithm, wealso tried several other algorithms such as deep autoen-coders (Hinton & Salakhutdinov, 2006; Bengio et al.,2007) and K-means (Coates et al., 2011). In our im-plementation, deep autoencoders are also locally con-nected and use sigmoidal activation function. For K-means, we downsample images to 40x40 in order tolower computational costs.

We also varied the parameters of autoencoders, K-means and chose them to maximize performances

http://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/38115.pdf

深層学習の隆盛• 深層学習 (deep learning)の有効性が広く認知される

• 深層学習の様々な方法論 • 音声認識 • 層間ユニットが全結合したネットワークがよく利用される（事前学習がよく用いられる）

• 画像認識 • 畳込みニューラルネットが主流．事前学習はあまり利用されない

• 自然言語処理 / 音声認識 • 再帰型ニューラルネット (RNN)が使われている

なぜ多層ニューラルネットは有用か

• 現実の問題は複雑なので，それに見合う規模のニューラルネットが必要→ 学習できるだけのデータがある

• 計算機の計算能力が飛躍的に向上

現実世界の大規模な問題に対し多層ニューラルネットを試してみたところ思わぬ性能を発揮した…が実際のところ？

chapter 01 #ml-professional

Data & Analytics