深度学习科普和实战肖达. agenda a two minutes tour of dl fundamentals dl in action with...

深度学习科普和实战

肖达

Agenda

• A two minutes tour of DL fundamentals

• DL in Action with GPU/Theano/Pylearn2

Deep Learning in a Nutshell深度学习管窥

• 对黑盒子的限定– 深度：多层非线性信息处理

• Input => L1 => L2 => … => Ln => Output

– 学习：内部结构通过学习涌现

原始数据对数据的理解（表示）和判断（分类）

机器学习与特征表示

• 每层从上层输出中提取特征

• 从原始数据直到分类器，各层结构基本相同

• 所有层的特征都用数据训练得到

层次特征学习

Layer 1 Layer 2 Layer 3 Simple Classifier

data

人工设计特征提取

可训练分类器

图像 /视频 /语音

目标分类

以监督学习为例，假设我们有训练样本集 (xi, yi) ，神经网络算法能够提供一种非线性的假设模型 hw,b(x)，它具有参数 W,b ，以此拟合数据

这个“神经元”是一个以 x1,x2,x3及截距 +1为输入值的运算单元，其输出为

函数被称为“激活函数”

这里我们选用 sigmoid函数作为激活函数

单个神经元 & 逻辑回归（ LR）

• 神经网络就是将许多个单一“神经元”联结在一起• 下图神经网络的参数为：

神经网络

Rumelhart et al. Nature, 1986 8

反向传播算法（ Back-Prop）

xWfa )1(

aWh 2softmax

• 计算每个样本的损失函数（实际输出与预期输出的差别）对各参数的梯度• 应用链式求导法则

hJ log

h

J

)2()2( W

h

h

J

W

J

a

h

h

J

a

J

)1()1( W

a

a

h

h

J

W

J

学习过程• 1、前向传播激励响应

• 2、和目标比较得到损失• 3、反向传播修正权重

• 数据获取问题– 训练依赖有标签数据，通常是稀缺的

• 局部极值问题– 多层非线性 ->求解一个高度非凸的优化问题，非常容易陷入很坏的局部最小

• 梯度弥散问题– 当深度较深时，梯度传到前面的时候严重衰减，前几层不能有效训练，训练速度很慢

训练深层神经网络的问题

Agenda

• A two minutes tour of DL fundamentals

• DL in Action with GPU/Theano/Pylearn2

What you need

• An off-the-shelf PC with 650w+ power supply

• A GPU (GTX 580/780/Titan)

• Get familiar with Linux, Python and Numpy

• Total cost < 8k￥– “DL不再是高富帅的专利， GPU+Theano乃我等屌丝之福音。” — by Xingyuan

Open souce libraries

• Theano – by Bengio's group @U. Montreal– transparent use of GPU– automatic gradient computation

• pylearn2– high-level DL library based on Theano– contain most building blocks needed for DL experiments– contain a cuda-convnet wrapper

• cuda-convnet– by Hinton's group @U. Toronto– VERY fast C++/CUDA implementation of convolutional neural ne

tworks– not a user-friendly library

How to launch a DL experiment in 5 minutes

• Specify 3 things in my_exp.yaml– dataset– model– training algorithm

• exec "train.py my_exp.yaml"

An example yaml file

dataset: &train !obj:pylearn2.datasets.cifar10.CIFAR10 { toronto_prepro: True, which_set: 'train', one_hot: 1, axes: ['c', 0, 1, 'b'], start: 0, stop: 40000 },

model: !obj:pylearn2.models.mlp.MLP {batch_size: 128,input_space: !obj:pylearn2.space.Conv2DSpace {

shape: [32, 32],num_channels: 3,axes: ['c', 0, 1, 'b'],

},layers: [

!obj:pylearn2.models.maxout.MaxoutConvC01B { layer_name: 'conv1', pad: 2, num_channels: 32, num_pieces: 1, kernel_shape: [5, 5], pool_shape: [3, 3], pool_stride: [2, 2], irange: .01, min_zero: True, tied_b: True, max_kernel_norm: 9.9,

}, !obj:pylearn2.models.maxout.MaxoutConvC01B {

layer_name: 'conv2',

algorithm: !obj:pylearn2.training_algorithms.sgd.SGD {batch_size: 128,learning_rate: .01,init_momentum: .9,monitoring_dataset:

{ 'valid' : !obj:pylearn2.datasets.cifar10.CIFAR10 { toronto_prepro: True, axes: ['c', 0, 1, 'b'], which_set: 'train', one_hot: 1, start: 40000, stop: 50000

},},

cost: !obj:pylearn2.costs.cost.SumOfCosts { costs: [!obj:pylearn2.costs.cost.MethodCost {

method: 'cost_from_X'}, !obj:pylearn2.costs.mlp.WeightDecay {

coeffs: [ .002, .002, .002, .002 ]}]

},termination_criterion: !obj:pylearn2.termination_criteria.MonitorBased {

channel_name: "valid_y_misclass",prop_decrease: 0.,N: 10

}}

Dataset: CIFAR-10• 60000 32x32 colour images in 10 classes• 50000 training images and 10000 test images

Rethink on datasets

• Deep neural nets are data-hungry beasts

• Current mainstream methods of preparing labeled datasets for training deep nets is costly and unnatural– Difficult to prepare sufficient amount of data,

starving the net

Network architecture• A baseline 3 layer convolutional network

– conv1: 32x32 32-channel output– pool1: 16x16 32-channel output– conv2: 16x16 32-channel output– pool2: 8x8 32-channel output– conv3: 8x8 64-channel output– pool3: 4x4 64-channel output– fully-connected (softmax): 10 outputs

• Similar to the visual cortex hierarchy?

Statistics• #neurons

– 32x32x32+16x16x32+16x16x32+8x8x32+8x8x64+4x4x64+10 = 56,330

• #free parameters (learnable weights) – 5*5*3*32 + 5*5*32*32 + 5*5*32*64 + 4*4*64*10 = 89,440

• #samples seen– 40k * 32 = 1.22m

• Run time– 32 epoch * 10s/epoch = 320s on GTX780 GPU (~66x faste

r than CPU, ~4x faster than Theano GPU)

• Result– test error = 26%

Does BEING DEEP matter?

• Test error after 10 epochs– 0 conv layer: 70%– 1 conv layer: 35%– 2 conv layers: 30%– 3 conv layers: 28%

• More performance gain with better hyperparameter tuning

Improving performance with tricks

• Test error

• Baseline: 26.1%

• Weight decay: -1.5%

• Learning rate decay: -2.5%

• Final: 22.1%

Note: We do not use the most effective trick for image data, i.e. data augmentation

Learning curve

The effect of overfitting

Filters learned by 1st conv layer

• How to visualize filters learned by higher layers?

Which layer matures first?

How to get SOTA results(Maxout Networks, ICML'13)

• Substantially improving the result by:– data preprocessing: ZCA whitening– max pooling among linear hidden units– dropout training– 60x larger net, 90x longer training time

• Test error– 22.1% -> 14.5%

Raw vs preprocessed images

A new model in 11 lines of codeclass Hypercolumn(Maxout): def __init__(self, hcol_size, **kwargs): super(Hypercolumn, self).__init__(**kwargs) self.hcol_size = hcol_size def fprop(self, state_below): p = super(Hypercolumn, self).fprop(state_below) w = p.reshape((p.shape[0], p.shape[1] // self.hcol_size,

self.hcol_size)) hcol_max = w.max(axis=2).dimshuffle(0, 1, 'x') * T.ones_like(w) w = w * (w >= hcol_max) w = w.reshape((p.shape[0], p.shape[1])) return w

Discussion~

深度学习科普和实战 肖达. agenda a two minutes tour of dl fundamentals dl in action with...

Documents

深度学习科普和实战肖达. agenda a two minutes tour of dl fundamentals dl in action with...