deep learning by jskim

딥러닝(Deep Learning)역사와 현재, 그리고 보건학으로의 적용

김진섭

유전체역학

September 10, 2014

김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 1 / 74

What is Deep Learning?

Contents

1 What is Deep Learning?

2 HistoryPerceptronMultilayer Perceptron1st Breakthrough: Unsupervised Learning2nd Breakthrough: Supervised Learning

3 Apply to Public HealthEpidemiology vs Machine LearningDeep Learning vs Other MLHypothesis Testing vs Hypothesis Generating

4 Conclusion



Machine Learning

컴퓨터가 학습하여 예측할 수 있도록 예측모형(prediction)을개발하는 인공지능의 한 분야.

Computer science + Statistics ??

Amazon, Google, Facebook..



Neural Network

Human brain VS Computer

3431× 3324 =??

개와 고양이 구별, 음성인식, 문자인식

Sequential VS Parallel



Neuron & Artificial Neural Network(ANN)[19]

Figure. (A) Human neuron; (B) artificial neuron or hidden unity; (C) biologicalsynapse; (D) ANN synapses.



http://www.nd.com/welcome/whatisnn.htm


http://www.nd.com/welcome/whatisnn.htm


Deep Neural Network(DNN) ' Deep Learning



글로벌 IT기업 ‘기계학습’ 집중 http://www.dt.co.kr/contents.

html?article_no=2014062002010960718002

세계는 지금 인공지능 열풍 6조달러 블루오션 한국은 ‘꽝’http://vip.mk.co.kr/news/view/21/20/1178659.html

MS 클라우드, ‘머신러닝’으로 똑똑해진다http://www.bloter.net/archives/196341

떠오르는 5대 주요 기술과 ‘딥러닝’http://www.wikitree.co.kr/main/news_view.php?id=157174

인공지능 시대 구글의 맨해튼 프로젝트 http://weekly.chosun.

com/client/news/viw.asp?nNewsNumb=002311100009&ctcd=C02


http://www.dt.co.kr/contents.html?article_no=2014062002010960718002

http://www.dt.co.kr/contents.html?article_no=2014062002010960718002

http://vip.mk.co.kr/news/view/21/20/1178659.html

http://www.bloter.net/archives/196341

http://www.wikitree.co.kr/main/news_view.php?id=157174

http://weekly.chosun.com/client/news/viw.asp?nNewsNumb=002311100009&ctcd=C02

http://weekly.chosun.com/client/news/viw.asp?nNewsNumb=002311100009&ctcd=C02

History

Contents




4 Conclusion


History Perceptron

Perceptron

1958년 Rosenblatt[23].

y = ϕ(n∑

i=1

wixi + b) (1)

(b: bias, ϕ: activation function(e.g: logistic or tanh))

Figure. Concept of Perceptron[Honkela]김진섭 (유전체역학) 딥러닝(Deep Learning) September 10, 2014 10 / 74

History Perceptron

Low Performance

XOR도 해결하지 못한다[Hinton].


History Multilayer Perceptron

Multilayer Perceptron

Hidden layer를 늘리면 해결된다!!



Learing Problem

Hidden layer증가 → Weight 갯수 증가..

1985년: Error Backpropagation Algorithm[24]

Gradient Descent Methods뒤에서부터 거꾸로..



Gradient Descent Methods

Weight 갯수가 너무 많다..

Linear regression: Least square, maximum likelihood: Exactcalculation.

MLP: No exact method



Gradient Descent Algorithm[Han-Hsing]

(a) Large Gradient (b) Small Gradient

(c) Small Learning Rate (d) Large Learning Rate



Example[Hinton]

A toy example to illustrate the iterative method •  Each day you get lunch at the cafeteria.

–  Your diet consists of fish, chips, and ketchup. –  You get several portions of each.

•  The cashier only tells you the total price of the meal –  After several days, you should be able to figure out the price of

each portion. •  The iterative approach: Start with random guesses for the prices and

then adjust them to get a better fit to the observed prices of whole meals.



Solving the equations iteratively

•  Each meal price gives a linear constraint on the prices of the portions:

•  The prices of the portions are like the weights in of a linear neuron.

•  We will start with guesses for the weights and then adjust the guesses slightly to give a better fit to the prices given by the cashier.

w = (wfish ,wchips ,wketchup )

price = x fishw fish + xchipswchips + xketchupwketchup



The true weights used by the cashier Price of meal = 850 = target

portions of fish

portions of chips

portions of ketchup

150 50 100

2 5 3

linear neuron



•  Residual error = 350 •  The “delta-rule” for learning is:

•  With a learning rate of 1/35, the weight changes are +20, +50, +30

•  This gives new weights of 70, 100, 80. –  Notice that the weight for

chips got worse!

A model of the cashier with arbitrary initial weights

Δwi = ε xi (t − y)

price of meal = 500

portions of fish

portions of chips

portions of ketchup

50 50 50

2 5 3

ε



Deriving the delta rule

•  Define the error as the squared residuals summed over all training cases:

•  Now differentiate to get error derivatives for weights

•  The batch delta rule changes the weights in proportion to their error derivatives summed over all training cases

E = 12

(tnn∈training∑ − yn )2

∂E∂wi

= 12

∂yn

∂wi

dEn

dynn∑

= − xin

n∑ (tn − yn )

Δwi = −ε∂E∂wi

= ε xin

n∑ (tn − yn )



Backpropagation Algorithm[Kim]

(e) Forward Propagation (f) Back Propagation



Limitations of MLP[Kim]

1 Vanishing gradient problem

2 Typically requires lots of labeled data

3 Overfitting problem: Given limited amounts of labeled data, trainingvia back-propagation does not work well

4 Get stuck in local minima (?)



Vanishing Gradient[2]

Figure. Sigmoid functions



Local Minima[Kim]

Figure. Global and Local Minima


History 1st Breakthrough: Unsupervised Learning

1st Breakthrough: Unsupervised Learning

2006년 Restricted Boltzmann Machine, Deep Belief Network, DeepBoltzmann Machine[25, 13]..

Figure. Description of Unsupervised Learning[Kim]



Limitations of MLP[Kim]


Solved by bottom-up layerwise unsupervised pre-training

2 Typically requires lots of labeled data3 Overfitting problem: Given limited amounts of labeled data, training

via back-propagation does not work well

Solved by using lots of unlabeled data


Unsupervised pre-training may help the network initialize with goodparameters



Restricted Boltzmann Machine(RBM)

에너지가 낮을수록 확률이 높다

P(v , h) =1

Zexp−E(v ,h)

(Z: Normalized Constant)

Figure. Diagram of a Restricted Boltzmann[Wikipedia]



Energy Function

E (v , h) = −∑

i

aivi −∑

j

bjhj −∑

i

∑j

hjwi ,jvi = −aTv − bTh − hTWv

(ai : offset of visible variable, bj : offset of hidden variable, wi ,j : weightbetween vi and hj )



목표

P(v) =∑

h P(v , h)를 최대화 하는 v와 그때의 weight들을 구하는 것.

E (v , h) = −∑

i

aivi −∑

j

bjhj −∑

i

∑j

hjwi ,jvi = −aTv − bTh − hTWv

즉, h, v가 동시에 켜진 쪽의 weight를 크게하려는 의도

같이 활성화되는 시냅스(synapse)는 연결된다.



Hebb’s Law (Hebbian Learning Rule)

http://www.skewsme.com/behavior.htm

http://lesswrong.com/lw/71x/a_crash_course_in_the_

neuroscience_of_human/l


http://www.skewsme.com/behavior.htm

http://lesswrong.com/lw/71x/a_crash_course_in_the_neuroscience_of_human/l

http://lesswrong.com/lw/71x/a_crash_course_in_the_neuroscience_of_human/l


Traing RBM

P(v) =∑

h P(v , h)를 최대화 하는 v와 그때의 weight들을 구하는 것.

Gradient Ascent

logP(v) = log(∑

h

exp−E(v ,h)

Z)

= log(∑

h

exp−E(v ,h))− logZ

= log(∑

h

exp−E(v ,h))− log(∑v ,h

exp−E(v ,h))



∂logP(v)

∂θ= −

1∑h exp−E(v,h)

∑h

exp−E(v,h) ∂E(v, h)

∂θ+

1∑v,h exp−E(v,h)

∑v,h

exp−E(v,h) ∂E(v, h)

∂θ

= −∑

h

p(h|v)∂E(v, h)

∂θ+

∑v,h

p(h, v)∂E(v, h)

∂θ



∂logP(v)

∂θ= −

∑h

p(h|v)∂E (v , h)

∂θ+∑v ,h

p(h, v)∂E (v , h)

∂θ

변형된 Gibbs sampler로 Sampling하여 해결



Figure. Contrastive Divergence(CD-k)[7]



Deep Belief Network[11, 12, 1]

1 Multiple RBM

2 Phoneme → Word → Grammer, Sentence

3 Generation도 가능!!!

http://www.cs.toronto.edu/~hinton/adi/index.htm


http://www.cs.toronto.edu/~hinton/adi/index.htm

History 2nd Breakthrough: Supervised Learning

2nd Breakthrough: Supervised Learning


Solved by a new non-linear activation :rectified linear unit (ReLU)

2 Typically requires lots of labeled dataSolved by big data & crowd sourcing

3 Overfitting problem: Given limited amounts of labeled data, trainingvia back-propagation does not work well

Solved by a new regularization method : dropout, dropconnect, etc




Rectified Linear Unit (ReLU)

Figure. The proposed non-linearity, ReLU, and the standard neuralnetwork non-linearity, logistic[30]



장점

1 0보다만 크면 항상 기울기가 1로 일정해 기울기가 감소하는 경우가없다.

2 학습이 쉽다.

3 Pre-training의 필요성을 없애준다.[20, 8].



DropOut & DropConnect

Ensemble Model

DropOut: hidden unit의 일부를 쉬게 한다[14].

DropConnect: hidden unit으로의 연결 중 일부를 쉬게 한다[28].



Figure. Description of DropOut & DropConnect[Wan]



Figure. Using the MNIST dataset, in a) Ability of Dropout andDropConnect to prevent overfitting as the size of the 2 fully connectedlayers increase. b) Varying the drop-rate in a 400-400 network shows nearoptimal performance around the p = 0.5[28]



Local Minima Issue

High dimension and non-convex optimization

1 Local minima들의 값이 비슷비슷할 것

2 Local minima ' Global minima.

3 수많은 차원에서 차원마다 local minima이기는 어렵다.



31

ConvNets: today

Loss

parameter

Local minima are all similar, there are long plateaus, it can take long to break symmetries.

Optimization is not the real problem when:– dataset is large– unit do not saturate too much– normalization layer

Figure. Local minima when high dimension and non-convex optimization[Ranzato]



Others: Convolutional Neural Network

Sparse Connectivity & Shared Weight: 2차원 데이터에적합[documentation]



http://parse.ele.tue.nl/education/cluster0


http://parse.ele.tue.nl/education/cluster0


http://eblearn.sourceforge.net/old/demos/mnist/index.shtml

http://yann.lecun.com/exdb/lenet/


http://eblearn.sourceforge.net/old/demos/mnist/index.shtml

http://yann.lecun.com/exdb/lenet/


Deep Learning Summary!!!

1 1950년대 퍼셉트론(perceptron)에서 시작된 인공신경망 연구는 1980년대오류역전파알고리즘(Error Backpropagation Algorithm)으로 다층퍼셉트론(Multilayer perceptron)을 학습할 수 있게 되면서 발전.

2 Gradient vanishing, labeled data의 부족, overfitting, local minima issue 등이 잘해결되지 못해 2000년대 초까지 인공신경망 연구는 답보상태.

3 2006년부터 볼츠만머신을 이용한 Unsupervised Learning인 Restricted BoltzmannMachine(RBM), Deep Belief Network(DBN), Deep Boltzmann Machine(DBM),Convolutional Deep Belief Network 등이 개발.

4 Unlabeled data를 이용하여 pre-training을 수행할 수 있게 되어 위에 언급된다층퍼셉트론의 한계점이 극복됨.

5 2010년부터는 빅데이터를 적극적으로 이용함으로서 수많은 labeled data를 사용할수 있게 되었고, Rectified linear unit (ReLU), DropOut, DropConnect 등의발견으로 vanishing gradient문제와 overfitting issue를 해결하여 아예 Supervisedlearning이 가능.

6 Local minima issue는 High dimension non-convex optimization에서는 별로 중요한부분이 아니라는 공감대.


Apply to Public Health

Contents




4 Conclusion


Apply to Public Health Epidemiology vs Machine Learning

Objective of statistics

1 지식의 확장, Causal inference

통계학자 Pearson: 다윈의 진화론 증명을 위하여..

2 의사결정

통계학자 R.A Fisher: 가장 성능이 좋은 비료 선택



Statistics in Epidemiology

Causal inference: 원인이 무엇인가?

해석이 잘되는 모형이 짱이다. 인과관계 추론.

간단한 모형 선호.

독립변수의 단위도 중요(Kilometer VS meter, centering issue)

β, Odds Ratio(OR), Hazard Ratio(HR), p-value, AIC



Statistics in Machine Learning

Prediction: 앞으로 어떻게 될 것인가?

예측력이 좋은 것이 짱이다.

복잡한 모형도 상관없다. 예측만 효율적으로 잘 한다면.

필요에 따라 독립변수들을 자유자재로 바꾼다. (Scale change)

Y , p, Cross-validation, Accuracy, ROC curve



Example: Logistic regression

Binomial data를 다루는 강력한 통계분석방법.

특히 epidemiologic study에서는 절대적인 지위.

β → Odds Ratio(OR) : 해석이 쉽다.

But..

Logit function... 계산이 어려워지는 원인.

Heritability issue of binomial trait?? Logit함수가 범인..

Probit model이 대안이 될 수 있다.

계산쉽다.β 해석 어렵다..



Logit VS Probit

Figure. Logit VS Probit

Logit: Pr(Y = 1 | X ) = [1 + e−X ′β]−1

Probit: Pr(Y = 1 | X ) = Φ(X ′β)



Example2: Cox proportional hazard model

Censored data분석의 표준.

http:

//www.theriac.org/DeskReference/viewDocument.php?id=188


http://www.theriac.org/DeskReference/viewDocument.php?id=188

http://www.theriac.org/DeskReference/viewDocument.php?id=188


http://www.uni-kiel.de/psychologie/rexrepos/posts/

survivalCoxPH.html


http://www.uni-kiel.de/psychologie/rexrepos/posts/survivalCoxPH.html

http://www.uni-kiel.de/psychologie/rexrepos/posts/survivalCoxPH.html


Assumptions

lnλ(t) = lnλ0(t) + β1X1 + · · ·+ βpXp = lnλ0(t) + Xβλ(t) = λ0(t) eβ1X1+···+βpXp = λ0(t) eXβ

S(t) = S0(t)exp(Xβ) = exp(−Λ0(t) eXβ

)Λ(t) = Λ0(t) eXβ

λ(t)

λ0(t)= eXβ

β : Hazard Ratio(HR)



Hazard Ratio

해석 편하다. Odd Ratio 급.

But, 가정이 많이 들어간다.

식이 복잡해서 계산이 어렵다.

Conditional Logistic Regression..

Prediction에도 Cox를 고집할 필요는 없다.



Alternatives

Yi : Time of event

Not censored

p(yi |µi , σ2) = (2πσ2)−

12 exp{−(yi − µi )

2

2σ2}

Censored

p(yi ≥ ti |µi , σ2) =

∫ ∞ti

(2πσ2)−12 exp{−(yi − µi )

2

2σ2}∂yi = Φ(

µi − ti

σ)

정규분포의 CDF로 간단히 표현 → 계산이 쉽다!!



Example3: Correlation Structure

Correlation structure 고려해야하나?1 Epidemiology: Important

β의 s.e가 바뀐다. → p-value가 바뀐다.

2 Prediction model: Not importantβ 자체는 크게 안바뀐다.→ Y , p는 잘 안바뀐다.Correlation structure : Unmeasured effect → 측정되지 않은 것은 Newdata에서 prediction할 때 이용할 수 없다.



Figure. A representation of the tradeoff between flexibility and interpretability,using different statistical learning methods. In general, as the flexibility of amethod increases, its interpretability decreases[16]



Ted Chiang

It has been 25 years since a report of origi-nal research was last submitted to oureditors for publication, making this an

appropriate time to revisit the questionthat was so widely debated then: what isthe role of human scientists in an age whenthe frontiers of scientific inquiry havemoved beyond the comprehensibility ofhumans?

No doubt many of our subscribersremember reading papers whose authorswere the first individuals ever to obtain theresults they described. But as metahumansbegan to dominate experimental research,they increasingly made their findings avail-able only via DNT (digital neural transfer),leaving journals to publish second-handaccounts translated into human language.

Without DNT, humans could not fullygrasp earlier developments nor effectivelyutilize the new tools needed to conductresearch, while metahumans continued toimprove DNT and rely on it even more. Jour-nals for human audiences were reduced tovehicles of popularization, and poor ones atthat, as even the most brilliant humansfound themselves puzzled by translations ofthe latest findings.

No one denies the many benefits ofmetahuman science, but one of its costs tohuman researchers was the realization thatthey would probably never make an originalcontribution to science again. Some left thefield altogether, but those who stayed shiftedtheir attentions away from original researchand toward hermeneutics: interpreting thescientific work of metahumans.

Textual hermeneutics became popularfirst, since there were already terabytes ofmetahuman publications whose transla-tions, although cryptic, were presumablynot entirely inaccurate. Deciphering thesetexts bears little resemblance to the task per-formed by traditional palaeographers, butprogress continues: recent experiments havevalidated the Humphries decipherment ofdecade-old publications on histocompati-bility genetics.

The availability of devices based onmetahuman science gave rise to artefacthermeneutics. Scientists began attemptingto ‘reverse engineer’ these artefacts, theirgoal being not to manufacture competingproducts, but simply to understand thephysical principles underlying their opera-tion. The most common technique is thecrystallographic analysis of nanoware appli-

entific inquiry and increases the body ofhuman knowledge just as original researchdid. Moreover, human researchers maydiscern applications overlooked by meta-humans, whose advantages tend to makethem unaware of our concerns.

For example, imagine if research offeredhope of a different intelligence-enhancingtherapy, one that would allow individuals togradually ‘upgrade’ their minds to a levelequivalent to that of a metahuman. Such atherapy would offer a bridge across what hasbecome the greatest cultural divide in ourspecies’ history, yet it might not even occur tometahumans to explore it; that possibilityalone justifies the continuation of humanresearch.

We need not be intimidated by theaccomplishments of metahuman science.We should always remember that the tech-nologies that made metahumans possiblewere originally invented by humans, andthey were no smarter than we. ■

Ted Chiang is an occasional writer of science fiction.His latest story can be found in the anthologyVanishing Acts, published by Tor Books.

futures

NATURE | VOL 405 | 1 JUNE 2000 | www.nature.com 517

ances, which frequently provides us withnew insights into mechanosynthesis.

The newest and by far the mostspeculative mode of inquiry isremote sensing of metahumanresearch facilities. A recenttarget of investigation isthe ExaCollider recentlyinstalled beneath theGobi Desert, whosepuzzling neutrinosignature has beenthe subject of muchcontroversy. (Theportable neutrinodetector is, ofcourse, anothermetahuman arte-fact whose oper-ating principlesremain elusive.)

The question is,are these worthwhileundertakings for sci-entists? Some call thema waste of time, likeningthem to a Native Americanresearch effort into bronzesmelting when steel tools ofEuropean manufacture are readilyavailable. This comparison might bemore apt if humans were in competitionwith metahumans, but in today’s economyof abundance there is no evidence of suchcompetition. In fact, it is important torecognize that — unlike most previous low-technology cultures confronted with a high-technology one — humans are in no dangerof assimilation or extinction.

There is still no way to augment a humanbrain into a metahuman one; the Sugimotogene therapy must be performed before theembryo begins neurogenesis in order for abrain to be compatible with DNT. This lackof an assimilation mechanism means thathuman parents of a metahuman child face adifficult choice: to allow their child DNTinteraction with metahuman culture, andwatch him or her grow incomprehensible tothem; or else restrict access to DNT duringthe child’s formative years, which to ametahuman is deprivation like that sufferedby Kaspar Hauser. It is not surprising that thepercentage of human parents choosing theSugimoto gene therapy for their children hasdropped almost to zero in recent years.

As a result, human culture is likely to sur-vive well into the future, and the scientifictradition is a vital part of that culture.Hermeneutics is a legitimate method of sci-

Catching crumbs from the tableIn the face of metahuman science, humans have become metascientists.

JAC

EY

© 2000 Macmillan Magazines Ltd



Human VS metahuman[4]

Ted Chiang : SF 소설가

메타 인류(인공지능)의 압도적인 지식처리능력.

Human science: 메타 인류가 밝혀낸 것들을 해석하는 정도의 수준.

메타 인류의 논문을 번역하는 것이 human science..


Apply to Public Health Deep Learning vs Other ML

Deep Learning vs Other ML

Multiple Hidden Layer: High flexibility

Massive Parallel Computing

Programming language for GPU/parallel computing

CUDA(Compute Unified Device Architecture), OpenCL[21, 26]



Examples: Cat recognition

16,000개의 CPU

그림만 보고 고양이 인식 (Unsupervised Learning)

GPU를 이용하여 Computing 시간 줄임.

http:

//www.asiae.co.kr/news/view.htm?idxno=2012062708351993171

http://googleblog.blogspot.kr/2012/06/

using-large-scale-brain-simulations-for.html


http://www.asiae.co.kr/news/view.htm?idxno=2012062708351993171

http://www.asiae.co.kr/news/view.htm?idxno=2012062708351993171

http://googleblog.blogspot.kr/2012/06/using-large-scale-brain-simulations-for.html

http://googleblog.blogspot.kr/2012/06/using-large-scale-brain-simulations-for.html


Paper[18, 5]

Building High-level FeaturesUsing Large Scale Unsupervised Learning

Quoc V. Le [email protected]’Aurelio Ranzato [email protected] Monga [email protected] Devin [email protected] Chen [email protected] S. Corrado [email protected] Dean [email protected] Y. Ng [email protected]

Abstract

We consider the problem of building high-level, class-specific feature detectors fromonly unlabeled data. For example, is it pos-sible to learn a face detector using only unla-beled images? To answer this, we train a 9-layered locally connected sparse autoencoderwith pooling and local contrast normalizationon a large dataset of images (the model has1 billion connections, the dataset has 10 mil-lion 200x200 pixel images downloaded fromthe Internet). We train this network usingmodel parallelism and asynchronous SGD ona cluster with 1,000 machines (16,000 cores)for three days. Contrary to what appears tobe a widely-held intuition, our experimentalresults reveal that it is possible to train a facedetector without having to label images ascontaining a face or not. Control experimentsshow that this feature detector is robust notonly to translation but also to scaling andout-of-plane rotation. We also find that thesame network is sensitive to other high-levelconcepts such as cat faces and human bod-ies. Starting with these learned features, wetrained our network to obtain 15.8% accu-racy in recognizing 22,000 object categoriesfrom ImageNet, a leap of 70% relative im-provement over the previous state-of-the-art.

Appearing in Proceedings of the 29 th International Confer-ence on Machine Learning, Edinburgh, Scotland, UK, 2012.Copyright 2012 by the author(s)/owner(s).

1. Introduction

The focus of this work is to build high-level, class-specific feature detectors from unlabeled images. Forinstance, we would like to understand if it is possible tobuild a face detector from only unlabeled images. Thisapproach is inspired by the neuroscientific conjecturethat there exist highly class-specific neurons in the hu-man brain, generally and informally known as “grand-mother neurons.” The extent of class-specificity ofneurons in the brain is an area of active investigation,but current experimental evidence suggests the possi-bility that some neurons in the temporal cortex arehighly selective for object categories such as faces orhands (Desimone et al., 1984), and perhaps even spe-cific people (Quiroga et al., 2005).

Contemporary computer vision methodology typicallyemphasizes the role of labeled data to obtain theseclass-specific feature detectors. For example, to builda face detector, one needs a large collection of imageslabeled as containing faces, often with a bounding boxaround the face. The need for large labeled sets posesa significant challenge for problems where labeled dataare rare. Although approaches that make use of inex-pensive unlabeled data are often preferred, they havenot been shown to work well for building high-levelfeatures.

This work investigates the feasibility of building high-level features from only unlabeled data. A positiveanswer to this question will give rise to two significantresults. Practically, this provides an inexpensive wayto develop features from unlabeled data. But perhapsmore importantly, it answers an intriguing question asto whether the specificity of the “grandmother neuron”could possibly be learned from unlabeled data. Infor-mally, this would suggest that it is at least in principlepossible that a baby learns to group faces into one class

Deep learning with COTS HPC systems

Adam Coates [email protected] Huval [email protected] Wang [email protected] J. Wu [email protected] Y. Ng [email protected]

Stanford University Computer Science Dept., 353 Serra Mall, Stanford, CA 94305 USA

Bryan Catanzaro [email protected]

NVIDIA Corporation, 2701 San Tomas Expressway, Santa Clara, CA 95050

Abstract

Scaling up deep learning algorithms has beenshown to lead to increased performance inbenchmark tasks and to enable discovery ofcomplex high-level features. Recent effortsto train extremely large networks (with over1 billion parameters) have relied on cloud-like computing infrastructure and thousandsof CPU cores. In this paper, we present tech-nical details and results from our own sys-tem based on Commodity Off-The-Shelf HighPerformance Computing (COTS HPC) tech-nology: a cluster of GPU servers with Infini-band interconnects and MPI. Our system isable to train 1 billion parameter networks onjust 3 machines in a couple of days, and weshow that it can scale to networks with over11 billion parameters using just 16 machines.As this infrastructure is much more easilymarshaled by others, the approach enablesmuch wider-spread research with extremelylarge neural networks.

1. Introduction

A significant amount of effort has been put into de-veloping deep learning systems that can scale to verylarge models and large training sets. With each leapin scale new results proliferate: large models in theliterature are now top performers in supervised vi-sual recognition tasks (Krizhevsky et al., 2012; Cire-san et al., 2012; Le et al., 2012), and can even learn

Proceedings of the 30 th International Conference on Ma-chine Learning, Atlanta, Georgia, USA, 2013. JMLR:W&CP volume 28. Copyright 2013 by the author(s).

to detect objects when trained from unlabeled im-ages alone (Coates et al., 2012; Le et al., 2012). Thevery largest of these systems has been constructed byLe et al. (Le et al., 2012) and Dean et al. (Dean et al.,2012), which is able to train neural networks with over1 billion trainable parameters. While such extremelylarge networks are potentially valuable objects of AIresearch, the expense to train them is overwhelming:the distributed computing infrastructure (known as“DistBelief”) used for the experiments in (Le et al.,2012) manages to train a neural network using 16000CPU cores (in 1000 machines) in just a few days, yetthis level of resource is likely beyond those availableto most deep learning researchers. Less clear still ishow to continue scaling significantly beyond this sizeof network. In this paper we present an alternativeapproach to training such networks that leverages in-expensive computing power in the form of GPUs andintroduces the use of high-speed communications in-frastructure to tightly coordinate distributed gradientcomputations. Our system trains neural networks atscales comparable to DistBelief with just 3 machines.We demonstrate the ability to train a network withmore than 11 billion parameters—6.5 times larger thanthe model in (Dean et al., 2012)—in only a few dayswith 2% as many machines.

Buoyed by many empirical successes (Uetz & Behnke,2009; Raina et al., 2009; Ciresan et al., 2012;Krizhevsky, 2010; Coates et al., 2011) much deeplearning research has focused on the goal of buildinglarger models with more parameters. Though sometechniques (such as locally connected networks (Le-Cun et al., 1989; Raina et al., 2009; Krizhevsky, 2010),and improved optimizers (Martens, 2010; Le et al.,2011)) have enabled scaling by algorithmic advan-tage, another main approach has been to achieve scale


Apply to Public Health Hypothesis Testing vs Hypothesis Generating

Hypothesis Testing vs Hypothesis Generating

Figure. Hypothesis-testing and Hypothesis-generating paradigms[3]


Conclusion

Contents




4 Conclusion


Conclusion

Conclusion

Deep Learning이 Mobile Health 의 핵심.

Mobile data: 영상, 음성, 텍스트 등 비정형 데이터.

Parallel Computing System 구축이 필요하다.

Prediction vs Inference

Understanding concept of Machine Learning

Hypothesis Generating

Paradigm shift: Causal inference → Big data & Prediction


Conclusion

Reference I

[1] Bengio, Y. (2009). Learning deep architectures for ai. Foundations and trends R© in Machine Learning, 2(1):1–127.

[2] Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. NeuralNetworks, IEEE Transactions on, 5(2):157–166.

[3] Biesecker, L. G. (2013). Hypothesis-generating research and predictive medicine. Genome research, 23(7):1051–1053.

[4] Chiang, T. (2000). Catching crumbs from the table. Nature, 405(6786):517–517.

[5] Coates, A., Huval, B., Wang, T., Wu, D., Catanzaro, B., and Andrew, N. (2013). Deep learning with cots hpc systems. InProceedings of The 30th International Conference on Machine Learning, pages 1337–1345.

[documentation] documentation, D. . Convolutional neural networks (lenet).http://deeplearning.net/tutorial/lenet.html.

[7] Fischer, A. and Igel, C. (2012). An introduction to restricted boltzmann machines. In Progress in Pattern Recognition,Image Analysis, Computer Vision, and Applications, pages 14–36. Springer.

[8] Glorot, X., Bordes, A., and Bengio, Y. (2011). Deep sparse rectifier networks. In Proceedings of the 14th InternationalConference on Artificial Intelligence and Statistics. JMLR W&CP Volume, volume 15, pages 315–323.

[Han-Hsing] Han-Hsing, T. [ml, python] gradient descent algorithm (revision 2).http://hhtucode.blogspot.kr/2013/04/ml-gradient-descent-algorithm.html.

[Hinton] Hinton, G. Coursera: Neural networks for machine learning. https://class.coursera.org/neuralnets-2012-001.

[11] Hinton, G., Osindero, S., and Teh, Y.-W. (2006). A fast learning algorithm for deep belief nets. Neural computation,18(7):1527–1554.

[12] Hinton, G. E. (2009). Deep belief networks. Scholarpedia, 4(5):5947.

[13] Hinton, G. E. and Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science,313(5786):504–507.

[14] Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. R. (2012). Improving neural networks bypreventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580.


http://deeplearning.net/tutorial/lenet.html

http://hhtucode.blogspot.kr/2013/04/ml-gradient-descent-algorithm.html

https://class.coursera.org/neuralnets-2012-001

Conclusion

Reference II

[Honkela] Honkela, A. Multilayer perceptrons. https://www.hiit.fi/u/ahonkela/dippa/node41.html.

[16] James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013). An introduction to statistical learning. Springer.

[Kim] Kim, J. 2014 패턴인식 및 기계학습 여름학교. http://prml.yonsei.ac.kr/.

[18] Le, Q. V. (2013). Building high-level features using large scale unsupervised learning. In Acoustics, Speech and SignalProcessing (ICASSP), 2013 IEEE International Conference on, pages 8595–8598. IEEE.

[19] Maltarollo, V. G., Honorio, K. M., and da Silva, A. B. F. (2013). Applications of artificial neural networks in chemicalproblems.

[20] Nair, V. and Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines. In Proceedings of the27th International Conference on Machine Learning (ICML-10), pages 807–814.

[21] Nvidia, C. (2007). Compute unified device architecture programming guide.

[Ranzato] Ranzato, M. Deep learning for vision: Tricks of the trade. www.cs.toronto.edu/~ranzato.

[23] Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and organization in the brain.Psychological review, 65(6):386.

[24] Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1985). Learning internal representations by error propagation.Technical report, DTIC Document.

[25] Smolensky, P. (1986). Information processing in dynamical systems: Foundations of harmony theory.

[26] Stone, J. E., Gohara, D., and Shi, G. (2010). Opencl: A parallel programming standard for heterogeneous computingsystems. Computing in science & engineering, 12(3):66.

[Wan] Wan, L. Regularization of neural networks using dropconnect. http://cs.nyu.edu/~wanli/dropc/.

[28] Wan, L., Zeiler, M., Zhang, S., Cun, Y. L., and Fergus, R. (2013). Regularization of neural networks using dropconnect. InProceedings of the 30th International Conference on Machine Learning (ICML-13), pages 1058–1066.

[Wikipedia] Wikipedia. Wikepedia. http://en.wikipedia.org/wiki/Restricted_Boltzmann_machine.

[30] Zeiler, M. D., Ranzato, M., Monga, R., Mao, M., Yang, K., Le, Q. V., Nguyen, P., Senior, A., Vanhoucke, V., Dean, J.,et al. (2013). On rectified linear units for speech processing. In Acoustics, Speech and Signal Processing (ICASSP), 2013IEEE International Conference on, pages 3517–3521. IEEE.


https://www.hiit.fi/u/ahonkela/dippa/node41.html

http://prml.yonsei.ac.kr/

www.cs.toronto.edu/~ranzato

http://cs.nyu.edu/~wanli/dropc/

http://en.wikipedia.org/wiki/Restricted_Boltzmann_machine

deep learning by jskim

Data & Analytics