딥러닝 개요 (2015-05-09 kistep)

Deep Learning

[email protected]

Introduction

Machine-Learning

Deep learning

Issues

Summary

Deep LearningA brief explanation

[email protected]

Centre for Digital Music, Queen Mary University of London, UK

1/24

mailto:[email protected]


Deep Learning

[email protected]

Introduction

Machine-Learning

Deep learning

Issues

Summary

1 Introduction

2 Machine-Learning

3 Deep learningOverviewNonlinearityWeightsSGDTraining

4 IssuesOverfittingBatch processingBack-propagationOther architecturesImageNet

5 Summary

2/24


Deep Learning

[email protected]

Introduction

Machine-Learning

Deep learning

Issues

Summary

Keunwoo Choi

PhD, QMUL, EECS, c4dm, 2014-presentSupervised by Mark Sandler and George FazekasMusic Recommendation, (Deep) Machine LearningInternship, Naver Labs, July-Oct 2015Visiting PhD, New York University, July-Dec 2016

ETRI, 2011-20143D Audio (WFS)

Master’s, SNU EECS, 2009-2011Applied Acoustics Laboratory, 3D Audio,Music Signal Processing

Bachelor. SNU EECS, 2005-2009

3/24


Deep Learning

[email protected]

Introduction

Machine-Learning

Deep learning

Issues

Summary

Research Topics

Music Feature ExtractionsAnalysis of deep CNNs (ISMIR LDB 2015, MLSP 2016)Auto-Tagging using deep CNN (ISMIR 2016)

Playlist GenerationRNN-based playlist generation (ICML workshop 2016)

Music Captioning

Automatic CompositionText-based chords and drums (CSMC 2016)

4/24


Deep Learning

[email protected]

Introduction

Machine-Learning

Deep learning

Issues

Summary

Machine LearningMore correctly, supervised learning

Given a goal

Given data x , y

Train an algorithm that best matches x ! y

and validated using unseen x (good generalisation)”Do not memorise the examples!”

Conventional approaches:Feature extraction + ClassifierResearchers and experts hand-craft the featuresClassifier (e.g. SVM) is trained to achieve the goal

5/24


Deep Learning

[email protected]

Introduction

Machine-Learning

Deep learning

Issues

Summary

Machine LearningProblems of the conventional approaches

Hand-crafting takes resources

E.g. MFCCs (speech recognition), Histogram of Gradient,SIFT (computer vision)

Hand-crafting is not automatically optimisablebut a Jang-in-jeong-sin thingy.

Is a Jang-in better than machines?

6/24


Deep Learning

[email protected]

Introduction

Machine-Learning

Deep learning

Overview

Nonlinearity

Weights

SGD

Training

Issues

Summary

NN vs. DNN

1

Logistic regression: No hidden layer

Neural Networks: 1 hidden layer

Deep NN: N hidden layers (N>1)

1extremetech.com

7/24


Deep Learning

[email protected]

Introduction

Machine-Learning

Deep learning

Overview

Nonlinearity

Weights

SGD

Training

Issues

Summary

DEMOTensorFlow Playground

Logistic regression: No hidden layer

Neural Networks: 1 hidden layer

Deep NN: N hidden layers (N>1)

Logistic regression Logistic regression failsNN works well! NN failsShallow NN is okay The bigger, the better

8/24


http://goo.gl/fbPSy9

http://goo.gl/s9g39n

http://goo.gl/L3Z6Ax

http://goo.gl/XsBjmp

http://goo.gl/eJkV1L

http://goo.gl/vrWEWj

Deep Learning

[email protected]

Introduction

Machine-Learning

Deep learning

Overview

Nonlinearity

Weights

SGD

Training

Issues

Summary

DL OverviewA motivation to deep leanring

Brain and human sensory system

Neurons are identical

Many (100B) identical neurons with suitable structures

Human learns by examples

Human sensory systems are deep

Parallel and serial neuron structure

9/24


Deep Learning

[email protected]

Introduction

Machine-Learning

Deep learning

Overview

Nonlinearity

Weights

SGD

Training

Issues

Summary

DL OverviewA motivation to deep leanring

Do not need to hand-craft features

Black box includes [feature extraction ! classifier]

The whole procedure is computationally optimised toachieve the goal

by iterative, heavy-computational methodshave outperformed many Jang-in’s

10/24


Deep Learning

[email protected]

Introduction

Machine-Learning

Deep learning

Overview

Nonlinearity

Weights

SGD

Training

Issues

Summary

ComparisonExample task: Speech recognition

Method Conventional ML Deep Learning

Feature

MFCCs(FFT ! mel-scaleaggregation!DCT!time-

derivative!ignore firstcoe↵!..)

FFT!NN

Classifier SVM, GMM NN

Every computation, parameters, weights is automaticallydecided by during training

11/24


Deep Learning

[email protected]

Introduction

Machine-Learning

Deep learning

Overview

Nonlinearity

Weights

SGD

Training

Issues

Summary

DLNonlinearity

Single layer performs a nonlinear mapping using �Let x=input vector, y=output vector,

NN: y = �2(W2�1(W1x))

DNN: Stacked (=deep) layers perform a more nonlinearand complex mapping

y = �6(W6�5(W5�4(W4�3(W3�2(W2�1(W1x))))))

Stacked layers = stacked Nonlinearity! 2

Multiple linear layers, otherwise, can be compressed intoone layer

2best explained in Colah’s blog

12/24


http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/

Deep Learning

[email protected]

Introduction

Machine-Learning

Deep learning

Overview

Nonlinearity

Weights

SGD

Training

Issues

Summary

DLWeights (= parameters)

NN = nonlinear �() and weights W

For �(), we use ReLU and its variants

DNN = Combination of ReLU and many Wi

’s

We want...the network to be trained to do the all dirty works -feature extraction and classification(=W

i

’s that do what we order to do)the network to learn by examples(=find the optimal W using training data)

How do we train? ! SGD

13/24


Deep Learning

[email protected]

Introduction

Machine-Learning

Deep learning

Overview

Nonlinearity

Weights

SGD

Training

Issues

Summary

Deep LearningHow it learns - by SGD!

SGD: Stochastic Gradient Descent

SGD computationally finds w so that J(w) is minimisedSGD iteratively finds w so that J(w) is minimisedSGD gradually finds w so that J(w) is minimised

w is updated to minimise J(w)

(J(w) J(w)� @J(w)@w )

...if J(w) is di↵erentiable14/24


Deep Learning

[email protected]

Introduction

Machine-Learning

Deep learning

Overview

Nonlinearity

Weights

SGD

Training

Issues

Summary

Deep LearningHow it learns - by SGD!

Loss function J(w)

A function that we want to minimise to achieve the goal

y

estimation

= �4(W4�3(W3�2(W2�1(W1x))))

y

true

is given in the dataset

E.g. l2: J(w) = (yestimation

� y

true

)2

Loss function measures how well the current algorithm isperforming

15/24


Deep Learning

[email protected]

Introduction

Machine-Learning

Deep learning

Overview

Nonlinearity

Weights

SGD

Training

Issues

Summary

Deep LearningHow it learns

We have (a set of) x and y

true

(aka dataset)

We decide a loss function

y

estimation

= �4(W4�3(W3�2(W2�1(W1x))))

J(w) = a function of (yestimation

, ytrue

)w is updated and becomes better weights= training is performed by SGD= the DNN is optimised

16/24


Deep Learning

[email protected]

Introduction

Machine-Learning

Deep learning

Overview

Nonlinearity

Weights

SGD

Training

Issues

Summary

Deep LearningThe whole learning procedure

Prepare a training dataset (x , y)

Get a DNN configured (number of layers, nodes, lossfunction)

for many times:for every x , y : (do SGD)

compute y

estimation

= f (x ,w) (go through DNN)update W according to the current loss,loss(y

true

, yestimation

)

Done!

17/24


Deep Learning

[email protected]

Introduction

Machine-Learning

Deep learning

Overview

Nonlinearity

Weights

SGD

Training

Issues

Summary

Break!

Q&A

playground.tensorflow.org

18/24


http://playground.tensorflow.org

Deep Learning

[email protected]

Introduction

Machine-Learning

Deep learning

Issues

Overfitting

Batch processing

Back-propagation

Otherarchitectures

ImageNet

Summary

Overfitting

Overfitting

When the network memorises the training data and fails togeneralise

3

A general problem in ML

Example: ,3hrefhttp://cs231n.github.io/neural-networks-3/cs231n

19/24


http://goo.gl/IT0edj

http://goo.gl/W3DQht

Deep Learning

[email protected]

Introduction

Machine-Learning

Deep learning

Issues

Overfitting

Batch processing

Back-propagation

Otherarchitectures

ImageNet

Summary

Batch Gradient Descent

Batch Gradient Descent

Compute GD with seeing more than 1 examples simultaneously

Every computation ofy

estimation

= �4(W4�3(W3�2(W2�1(W1x))))is done by matrix computationsQuicker in GPU (because GPU is specialised at computinglarge matrix computations)Less zig-zag

4

4www.holehouse.org

20/24


http://www.holehouse.org/mlclass/17_Large_Scale_Machine_Learning.html

Deep Learning

[email protected]

Introduction

Machine-Learning

Deep learning

Issues

Overfitting

Batch processing

Back-propagation

Otherarchitectures

ImageNet

Summary

Back-propatagionaka backprop

5

The essence inside Gradient Descent of NN

The way to compute the derivatives of all weights, @J(w)@w

so that J(w) can be updated as J(w)� @J(w)@w

Discovered by Rumelhart, Hinton, and Williams (1986)

5extremetech.com

21/24


Deep Learning

[email protected]

Introduction

Machine-Learning

Deep learning

Issues

Overfitting

Batch processing

Back-propagation

Otherarchitectures

ImageNet

Summary

Other architectures

Convolutional Networksby LeCun (in Facebook AI Research and NYU)Biological visual systemsVery widely used in almost every DL problem

Recurrent networksSequences (text) and time-series data (speech, weather,stock price,...)

22/24


Deep Learning

[email protected]

Introduction

Machine-Learning

Deep learning

Issues

Overfitting

Batch processing

Back-propagation

Otherarchitectures

ImageNet

Summary

ImageNet competition

6

14M images in 1K categoriesHave enabled to test new algorithms in DL

6Slide from NVIDIA

23/24


http://www.slideshare.net/NVIDIA/nvidia-ces-2016-press-conference

Deep Learning

[email protected]

Introduction

Machine-Learning

Deep learning

Issues

Summary

Resources

Deeplearning4j tutorials (Korean)

ML lecture in Coursera, Stanford

cs231n from Stanford

24/24


http://deeplearning4j.org/kr-neuralnet-overview.html

https://www.coursera.org/learn/machine-learning

http://cs231n.github.io

딥러닝 개요 (2015-05-09 kistep)

Engineering