[251] implementing deep learning using cu dnn

67
Implementing Deep Learning using cuDNN 이예하 VUNO Inc. 1

Upload: naver-d2

Post on 09-Feb-2017

10.451 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: [251] implementing deep learning using cu dnn

Implementing Deep Learning using cuDNN

이예하

VUNO Inc.

1

Page 2: [251] implementing deep learning using cu dnn

Contents

1. Deep Learning Review2. Implementation on GPU using cuDNN3. Optimization Issues4. Introduction to VUNO-Net

2

Page 3: [251] implementing deep learning using cu dnn

1.Deep Learning Review

3

Page 4: [251] implementing deep learning using cu dnn

Brief History of Neural Network

1940 1950 1960 1970 1980 1990 2000

Golden Age Dark Age (“AI Winter”)

Electronic Brain

1943

S. McCulloch - W. Pitts

• Adjustable Weights • Weights are not Learned

1969

XORProblem

M. Minsky - S. Papert

• XOR Problem

Multi-layered Perceptron

(Backpropagation)

1986

D. Rumelhart - G. Hinton - R. Wiliams

• Solution to nonlinearly separable problems • Big computation, local optima and overfitting

Backward Error

Foward Activity

V. Vapnik - C. Cortes

1995

SVM

• Limitations of learning prior knowledge • Kernel function: Human Intervention

Deep Neural Network (Pretraining)

2006

G. Hinton - S. Ruslan

• Hierarchical feature Learning

2010

Perceptron

1957

F. Rosenblatt

• Learnable Weights and Threshold

ADALINE

1960

B. Widrow - M. Hoff

4

Page 5: [251] implementing deep learning using cu dnn

Machine/Deep Learning is eating the world!

5

Page 6: [251] implementing deep learning using cu dnn

• Restricted Boltzmann machine• Auto-encoder• Deep belief network

• Deep Boltzmann machines

• Generative stochastic networks

• Convolutional neural networks

• Recurrent neural networks

Building Blocks

6

Page 7: [251] implementing deep learning using cu dnn

• Receptive Field(Huber & Wiesel, 1959)

• Neocognitron (Fukushima, 1980)

Convolutional Neural Networks

7

Page 8: [251] implementing deep learning using cu dnn

• LeNet-5 (Yann LeCun, 1998)

Convolutional Neural Networks

8

Page 9: [251] implementing deep learning using cu dnn

• Alex Net (Alex Krizhevsky et. al., 2012)

• Clarifai (Matt Zeiler and Rob Fergus, 2013)

Convolutional Neural Networks

9

Page 10: [251] implementing deep learning using cu dnn

• Detection/Classification and Localization

• 1.2M for Training/50K for Validation

• 1000 Classes

ImageNet Challenge

10

Error rate Remark

Human 5.1% Karpathy

google (2014/8) 6.67% Inception

baidu (2015/1) 5.98% Data Augmentation

ms (2015/2) 4.94%SPP, PReLU,

Weight Initialization

google (2015/2) 4.82% Batch Normalization

baidu (2015/5) 4.58% Cheating

Page 11: [251] implementing deep learning using cu dnn

Convolutional Neural Networks

Network

• Softmax Layer (Output)

• Fully Connected Layer

• Pooling Layer

• Convolution Layer

Layer

• Input / Output

• Weights

• Neuron activationVGG (K. Simonyan and A. Zisserman, 2015)

11

Page 12: [251] implementing deep learning using cu dnn

Neural Network

12

Forward Pass

Softmax Layer

FC Layer

Conv Layer

Backward Pass

Softmax Layer

FC Layer

Conv Layer

Page 13: [251] implementing deep learning using cu dnn

Convolution Layer - Forward

x1 x4 x7

x2 x5 x8

x3 x6 x9

w1 w3

w2 w4

y1 y3

y2 y4

Filter Input

13

w1

w2

w4

w3

Output

Page 14: [251] implementing deep learning using cu dnn

Convolution Layer - Forward

How to evaluate the convolution layer efficiently?

1. Lower the convolutions into a matrix multiplication (cuDNN)

• There are several ways to implement convolutions efficiently

2. Fast Fourier Transform to compute the convolution (cuDNN_v3)

3. Computing the convolutions directly (cuda-convnet)

14

Page 15: [251] implementing deep learning using cu dnn

Fully Connected Layer - Forward

15

x2x1

y1 y2

x3

w11 w12 w13w w w

• Matrix calculation is very fast on GPU

• cuBLAS library

Page 16: [251] implementing deep learning using cu dnn

Softmax Layer - Forward

16

x2x1

y1 y2 Issues

1. How to efficiently evaluate denominator?

2. Numerical instability due to the too large or too

small xk

Fortunately, softmax Layer is supported by cuDNN

Page 17: [251] implementing deep learning using cu dnn

Learning on Neural Network

Update network (i.e. weights) to minimize loss function

• Popular loss function for classification

• Necessary criterion

Sum-of-squared error

Cross-entropy error

17

Page 18: [251] implementing deep learning using cu dnn

Learning on Neural Network

• Due to the complexity of the loss, a closed-form solution is usually not possible

• Using iterative approach

• How to evaluate the gradient

• Error backpropagation

Gradient descent Newton’s method

18

Page 19: [251] implementing deep learning using cu dnn

Softmax Layer - Backward

x2x1

y1 y2

Loss function:

• Errors back-propagated from softmax Layer can be calculated

• Subtract 1 from the output value with target index

• This can be done efficiently using GPU threads

19

Page 20: [251] implementing deep learning using cu dnn

Fully Connected Layer - Backward

20

x2x1

y1 y2

x3

w11 w ww21 w w

Forward pass:

Error:

Gradient:

• Matrix calculation is very fast on GPU

• Element-wise multiplication can be done efficiently using GPU thread

Error:

Gradient:

Page 21: [251] implementing deep learning using cu dnn

Convolution Layer - Backward

x1 x4 x7

x2 x5 x8

x3 x6 x9

w1 w3

w2 w4

y1 y3

y2 y4

Filter Input

21

w1

w2

w4

w3

Output

w4 w2

w3 w1

Forward

Error

Page 22: [251] implementing deep learning using cu dnn

Convolution Layer - Backward

Filter Input

22

Output

x1 x4 x7

x2 x5 x8

x3 x6 x9

w1 w3

w2 w4

Gradient

• The error and gradient can be computed with convolution scheme

• cuDNN supports this operation

Page 23: [251] implementing deep learning using cu dnn

2.Implementation on GPU using cuDNN

23

Page 24: [251] implementing deep learning using cu dnn

Introduction to cuDNN

cuDNN is a GPU-accelerated library of primitives for deep neural networks

• Convolution forward and backward• Pooling forward and backward• Softmax forward and backward• Neuron activations forward and backward:

• Rectified linear (ReLU)• Sigmoid• Hyperbolic tangent (TANH)

• Tensor transformation functions24

Page 25: [251] implementing deep learning using cu dnn

Introduction to cuDNN (version 2)

cuDNN's convolution routines aim for performance competitive with the fastest GEMM

•Lowering the convolutions into a matrix multiplication

25(Sharan Chetlur et. al., 2015)

Page 26: [251] implementing deep learning using cu dnn

Introduction to cuDNN

Benchmarks

26https://github.com/soumith/convnet-benchmarks https://developer.nvidia.com/cudnn

Page 27: [251] implementing deep learning using cu dnn

Goal

Learning VGG model using cuDNN

• Data Layer• Convolution Layer• Pooling Layer• Fully Connected Layer• Softmax Layer

27

Page 28: [251] implementing deep learning using cu dnn

Preliminary

Common data structure for Layer

• Device memory & tensor description for input/output data & error• Tensor Description defines dimensions of data

28

cuDNN initialize & release

cudaSetDevice( deviceID );cudnnCreate( &cudnn_handle );

cudnnDestroy( cudnn_handle );cudaDeviceReset();

float *d_input, *d_output, *d_inputDelta, *d_outputDeltacudnnTensorDescriptor_t inputDesc;cudnnTensorDescriptor_t outputDesc;

Page 29: [251] implementing deep learning using cu dnn

Data Layer

29

create & set Tensor DescriptorcudnnCreateTensorDescriptor();

cudnnSetTensor4dDescriptor();

cudnnSetTensor4dDescriptor(outputDesc,CUDNN_TENSOR_NCHW,CUDNN_FLOAT,sampleCnt,channels,height,width

);

1 2 3

4 5 6

7 8 9

10 11 12

13 14 15

16 17 18

19 20 21

22 23 24

25 26 27

28 29 30

31 32 33

34 35 36

sample #1 sample #2

channel #1

channel #2

Example: 2 images (3x3x2)

Page 30: [251] implementing deep learning using cu dnn

Convolution Layer

1. Initialization

30

1.1 create & set Filter Descriptor

1.2 create & set Conv Descriptor

1.3 create & set output Tensor Descriptor

1.4 Get Convolution Algorithm

cudnnCreateFilterDescriptor(&filterDesc);

cudnnSetFilter4dDescriptor(…);

cudnnCreateConvolutionDescriptor(&convDesc);

cudnnSetConvolution2dDescriptor(…);

cudnnGetConvolution2dForwardOutputDim(…);

cudnnCreateTensorDescriptor(&dstTensorDesc);

cudnnSetTensor4dDescriptor();

cudnnGetConvolutionForwardAlgorithm(…);

cudnnGetConvolutionForwardWorkspaceSize(…);

Page 31: [251] implementing deep learning using cu dnn

Convolution Layer

1.1 Set Filter Descriptor

31

cudnnSetFilter4dDescriptor(filterDesc,CUDNN_FLOAT,filterCnt,input_channelCnt,filter_height,filter_width

);

1 2

3 4

5 6

7 8

9 10

11 12

13 14

15 16

Filter #1 Filter #2

channel #1

channel #2

Example: 2 Filters (2x2x2)

Page 32: [251] implementing deep learning using cu dnn

Convolution Layer

1.2 Set Convolution Descriptor

32

CUDNN_CROSS_CORRELATION

Page 33: [251] implementing deep learning using cu dnn

Convolution Layer

1.3 Set output Tensor Descriptor

33

• n, c, h and w indicate output dimension• Tensor Description defines dimensions of data

Page 34: [251] implementing deep learning using cu dnn

Convolution Layer

1.4 Get Convolution Algorithm

34

inputDesc

outputDescCUDNN_CONVOLUTION_FWD_PREFER_FASTEST

Page 35: [251] implementing deep learning using cu dnn

Convolution Layer

2. Forward Pass

2.1 Convolution

2.2 Activation

cudnnConvolutionForward(…);

cudnnActivationForward(…);

35

Page 36: [251] implementing deep learning using cu dnn

Convolution Layer

2.1 Convolution

36

d_inputinputDesc

d_outputoutputDesc

Page 37: [251] implementing deep learning using cu dnn

Convolution Layer

2.2 Activation

37

sigmoidtanhReLU

outputDescd_output

Page 38: [251] implementing deep learning using cu dnn

Convolution Layer

3. Backward Pass

3.1 Activation Backward

3.2 Calculate Gradient

cudnnActivationBackward(…);

cudnnConvolutionBackwardFilter(…);

3.2 Error Backpropagation cudnnConvolutionBackwardData(…);

38

Page 39: [251] implementing deep learning using cu dnn

• Errors back-propagated from l+1 layer (d_outputDelta) is multiplied by • See 22 slide (Convolution Layer - Backward)

Convolution Layer

3.1 Activation Backward

39

outputDescd_output

outputDescd_outputDelta

outputDescd_output

outputDescd_outputDelta

Page 40: [251] implementing deep learning using cu dnn

Convolution Layer

3.2 Calculate Gradient

40

inputDescd_input

outputDescd_outputDelta

filterDescd_filterGradient

Page 41: [251] implementing deep learning using cu dnn

Convolution Layer

3.2 Error Backpropagation

41

outputDescd_outputDelta

inputDescd_inputDelta

Page 42: [251] implementing deep learning using cu dnn

Pooling Layer / Softmax Layer

42

1. Initilization

2. Forward Pass

3. Backward Pass

cudnnCreatePoolingDescriptor(&poolingDesc);

cudnnSetPooling2dDescriptor(…);

cudnnPoolingForward(…);

cudnnPoolingBackward(…);

Pooling Layer

Softmax Layer

Forward Pass cudnnSoftmaxForward(…);

Page 43: [251] implementing deep learning using cu dnn

3.Optimization Issues

43

Page 44: [251] implementing deep learning using cu dnn

Optimization

Learning Very Deep ConvNet

We know the Deep ConvNet can be trained without pre-training

• weight sharing

• sparsity

• Recifier unit

But, “With fixed standard deviations very deep models have difficulties to

converges” (Kaiming He et. al., 2015)

• e.g. random initialization from Gaussian dist. with 0.01 std

• >8 convolution layers44

Page 45: [251] implementing deep learning using cu dnn

Optimization

VGG (K. Simonyan and A. Zisserman, 2015) GoogleNet (C. Szegedy et. al., 2014)45

Page 46: [251] implementing deep learning using cu dnn

Optimization

Initialization of Weights for Rectifier (Kaiming He et. al., 2015)

•The variance of the response in each layer

•Sufficient condition that the gradient is not exponentially large/small

•Standard deviation for initialization

(spatial filter size)2 x (filter Cnt)

46

Page 47: [251] implementing deep learning using cu dnn

Optimization

Case study

47

The filter number

64 0.059

128 0.042

256 0.029

512 0.021

3x3 filter

• When using 0.01, the std of the gradient propagated from conv10 to convey

Error vanishing

Page 48: [251] implementing deep learning using cu dnn

Speed

Data loading & model learning

•Reducing data loading and augmentation time

•Data provider thread (dp_thread)

•Model learning thread (worker_thread)

readData(){

if(is_dp_thread_running) pthread_join(dp_thread)…if(is_data_remain) pthread_create(dp_thread)

}

readData();for(…) {

readData();pthread_create(worker_thread)…pthread_join(worker_thread)

}

48

Page 49: [251] implementing deep learning using cu dnn

Speed

Multi-GPU

• Data parallelization v.s. Model parallelization• Distribute the model, use the same data : Model Parallelism• Distribute the data, use the same model : Data Parallelism

• Data parallelization & Gradient Average• One of the easiest way to use Multi-GPU

• The result is same with using single GPU

49

Page 50: [251] implementing deep learning using cu dnn

Parallelism

50

Data Parallelization Model Parallelization

The good : Easy to implementThe bad : Cost of sync increases with the number of GPU

The good : Larger network can be trainedThe bad : Sync is necessary in all layers

Page 51: [251] implementing deep learning using cu dnn

Parallelism

51

Mixing Data Parallelization and Model Parallelization

(Krizhevsky, 2014)

Page 52: [251] implementing deep learning using cu dnn

Speed

Data Parallelization & Gradient Average

Forward Pass

Backward Pass

Gradient

Update

data (256)

Forward Pass

Backward Pass

Gradient

Update

data1 (128)

Forward Pass

Backward Pass

Gradient

data2(128)

Average

52

Page 53: [251] implementing deep learning using cu dnn

4.Introduction to VUNO-Net

53

Page 54: [251] implementing deep learning using cu dnn

The Team

54

Page 55: [251] implementing deep learning using cu dnn

VUNO-Net

Theano

Caffe

Cuda-ConvNet

Pylearn

RNNLIBDL4J

Torch VUNO-Net

55

Page 56: [251] implementing deep learning using cu dnn

VUNO-Net

56

Structure Output Learning

Convolution

LSTM

MD-LSTM(2D)

Pooling

Spatial Pyramid Pooling

Fully Connection

Concatenation

Softmax

Regression

Connectionist Temporal Classification

Multi-GPU Support

Batch Normalization

Parametric Rectifier

Initialization for Rectifier

Stochastic Gradient Descent

Dropout

Data Augmentation Open Source

VUNO-net Only

Features

Page 57: [251] implementing deep learning using cu dnn

Performance

The state-of-the-art performance at image & speech

84.3%82.2%

VUNOWR1

Speech Recognition(TIMIT*)

82.1%79.3%

VUNOWR2

Image Classification(CIFAR-100)

Kaggle Top 1(2014)

* TIMIT is a one of most popular benchmark dataset for speech recognition task (Texas Instrument - MIT)WR1 (World Record) - “Speech Recognitino with Deep Recurrent Neural Networks”, Alex Graves, ICCASP(2013)WR2 (World Record) - “kaggle competition: https://www.kaggle.com/c/cifar-10”

57

Page 58: [251] implementing deep learning using cu dnn

Application

We’ve achieved record breaking performance on medical image analysis.

58

Page 59: [251] implementing deep learning using cu dnn

Application

Whole Lung Quantification

• Sliding window: Pixel level classification

• But, context information is more important

• Ongoing works

MD-LSTM Recurrent CNN

normal Emphysemav.s.

59

Page 60: [251] implementing deep learning using cu dnn

Application

Whole Lung Quantification

60

Original Image (CT) VUNO Golden Standard

Example #1

Page 61: [251] implementing deep learning using cu dnn

Application

Whole Lung Quantification

61

Original Image (CT) VUNO Golden Standard

Example #2

Page 62: [251] implementing deep learning using cu dnn

Visualization

SVM Features (22 features)

62

Histogram, Gradient, Run-length, Co-occurrence matrix, Cluster analysis,

Top-hat transformation

Page 63: [251] implementing deep learning using cu dnn

Visualization

Activation of top hidden layer

63

200 hidden nodes

Page 64: [251] implementing deep learning using cu dnn

Visualization

64

Page 65: [251] implementing deep learning using cu dnn

We Are Hiring!!

65

Algorithm Engineer CUDA Programmer Application Developer

Staff MemberBusiness Developer

Page 66: [251] implementing deep learning using cu dnn

Q&A

66

Page 67: [251] implementing deep learning using cu dnn

Thank You

67