深度学习科普和实战 肖达. agenda a two minutes tour of dl fundamentals dl in action with...
TRANSCRIPT
深度学习科普和实战
肖达
Agenda
• A two minutes tour of DL fundamentals
• DL in Action with GPU/Theano/Pylearn2
Deep Learning in a Nutshell深度学习管窥
• 对黑盒子的限定– 深度:多层非线性信息处理
• Input => L1 => L2 => … => Ln => Output
– 学习:内部结构通过学习涌现
原始数据对数据的理解(表示)和判断(分类)
机器学习与特征表示
• 每层从上层输出中提取特征
• 从原始数据直到分类器,各层结构基本相同
• 所有层的特征都用数据训练得到
层次特征学习
Layer 1 Layer 2 Layer 3 Simple Classifier
data
人工设计特征提取
可训练分类器
图像 /视频 /语音
目标分类
以监督学习为例,假设我们有训练样本集 (xi, yi) ,神经网络算法能够提供一种非线性的假设模型 hw,b(x),它具有参数 W,b ,以此拟合数据
这个“神经元”是一个以 x1,x2,x3及截距 +1为输入值的运算单元,其输出为
函数 被称为“激活函数”
这里我们选用 sigmoid函数作为激活函数
单个神经元 & 逻辑回归( LR)
• 神经网络就是将许多个单一“神经元”联结在一起• 下图神经网络的参数为:
神经网络
Rumelhart et al. Nature, 1986 8
反向传播算法( Back-Prop)
xWfa )1(
aWh 2softmax
• 计算每个样本的损失函数(实际输出与预期输出的差别)对各参数的梯度• 应用链式求导法则
hJ log
h
J
)2()2( W
h
h
J
W
J
a
h
h
J
a
J
)1()1( W
a
a
h
h
J
W
J
学习过程• 1、前向传播激励响应
• 2、和目标比较得到损失• 3、反向传播修正权重
• 数据获取问题– 训练依赖有标签数据,通常是稀缺的
• 局部极值问题– 多层非线性 ->求解一个高度非凸的优化问题,非常容易陷入很坏的局部最小
• 梯度弥散问题– 当深度较深时,梯度传到前面的时候严重衰减,前几层不能有效训练,训练速度很慢
训练深层神经网络的问题
Agenda
• A two minutes tour of DL fundamentals
• DL in Action with GPU/Theano/Pylearn2
What you need
• An off-the-shelf PC with 650w+ power supply
• A GPU (GTX 580/780/Titan)
• Get familiar with Linux, Python and Numpy
• Total cost < 8k¥– “DL不再是高富帅的专利, GPU+Theano乃我等屌丝之福音。” — by Xingyuan
Open souce libraries
• Theano – by Bengio's group @U. Montreal– transparent use of GPU– automatic gradient computation
• pylearn2– high-level DL library based on Theano– contain most building blocks needed for DL experiments– contain a cuda-convnet wrapper
• cuda-convnet– by Hinton's group @U. Toronto– VERY fast C++/CUDA implementation of convolutional neural ne
tworks– not a user-friendly library
How to launch a DL experiment in 5 minutes
• Specify 3 things in my_exp.yaml– dataset– model– training algorithm
• exec "train.py my_exp.yaml"
An example yaml file
dataset: &train !obj:pylearn2.datasets.cifar10.CIFAR10 { toronto_prepro: True, which_set: 'train', one_hot: 1, axes: ['c', 0, 1, 'b'], start: 0, stop: 40000 },
model: !obj:pylearn2.models.mlp.MLP {batch_size: 128,input_space: !obj:pylearn2.space.Conv2DSpace {
shape: [32, 32],num_channels: 3,axes: ['c', 0, 1, 'b'],
},layers: [
!obj:pylearn2.models.maxout.MaxoutConvC01B { layer_name: 'conv1', pad: 2, num_channels: 32, num_pieces: 1, kernel_shape: [5, 5], pool_shape: [3, 3], pool_stride: [2, 2], irange: .01, min_zero: True, tied_b: True, max_kernel_norm: 9.9,
}, !obj:pylearn2.models.maxout.MaxoutConvC01B {
layer_name: 'conv2',
algorithm: !obj:pylearn2.training_algorithms.sgd.SGD {batch_size: 128,learning_rate: .01,init_momentum: .9,monitoring_dataset:
{ 'valid' : !obj:pylearn2.datasets.cifar10.CIFAR10 { toronto_prepro: True, axes: ['c', 0, 1, 'b'], which_set: 'train', one_hot: 1, start: 40000, stop: 50000
},},
cost: !obj:pylearn2.costs.cost.SumOfCosts { costs: [!obj:pylearn2.costs.cost.MethodCost {
method: 'cost_from_X'}, !obj:pylearn2.costs.mlp.WeightDecay {
coeffs: [ .002, .002, .002, .002 ]}]
},termination_criterion: !obj:pylearn2.termination_criteria.MonitorBased {
channel_name: "valid_y_misclass",prop_decrease: 0.,N: 10
}}
Dataset: CIFAR-10• 60000 32x32 colour images in 10 classes• 50000 training images and 10000 test images
Rethink on datasets
• Deep neural nets are data-hungry beasts
• Current mainstream methods of preparing labeled datasets for training deep nets is costly and unnatural– Difficult to prepare sufficient amount of data,
starving the net
Network architecture• A baseline 3 layer convolutional network
– conv1: 32x32 32-channel output– pool1: 16x16 32-channel output– conv2: 16x16 32-channel output– pool2: 8x8 32-channel output– conv3: 8x8 64-channel output– pool3: 4x4 64-channel output– fully-connected (softmax): 10 outputs
• Similar to the visual cortex hierarchy?
Statistics• #neurons
– 32x32x32+16x16x32+16x16x32+8x8x32+8x8x64+4x4x64+10 = 56,330
• #free parameters (learnable weights) – 5*5*3*32 + 5*5*32*32 + 5*5*32*64 + 4*4*64*10 = 89,440
• #samples seen– 40k * 32 = 1.22m
• Run time– 32 epoch * 10s/epoch = 320s on GTX780 GPU (~66x faste
r than CPU, ~4x faster than Theano GPU)
• Result– test error = 26%
Does BEING DEEP matter?
• Test error after 10 epochs– 0 conv layer: 70%– 1 conv layer: 35%– 2 conv layers: 30%– 3 conv layers: 28%
• More performance gain with better hyperparameter tuning
Improving performance with tricks
• Test error
• Baseline: 26.1%
• Weight decay: -1.5%
• Learning rate decay: -2.5%
• Final: 22.1%
Note: We do not use the most effective trick for image data, i.e. data augmentation
Learning curve
The effect of overfitting
Filters learned by 1st conv layer
• How to visualize filters learned by higher layers?
Which layer matures first?
How to get SOTA results(Maxout Networks, ICML'13)
• Substantially improving the result by:– data preprocessing: ZCA whitening– max pooling among linear hidden units– dropout training– 60x larger net, 90x longer training time
• Test error– 22.1% -> 14.5%
Raw vs preprocessed images
A new model in 11 lines of codeclass Hypercolumn(Maxout): def __init__(self, hcol_size, **kwargs): super(Hypercolumn, self).__init__(**kwargs) self.hcol_size = hcol_size def fprop(self, state_below): p = super(Hypercolumn, self).fprop(state_below) w = p.reshape((p.shape[0], p.shape[1] // self.hcol_size,
self.hcol_size)) hcol_max = w.max(axis=2).dimshuffle(0, 1, 'x') * T.ones_like(w) w = w * (w >= hcol_max) w = w.reshape((p.shape[0], p.shape[1])) return w
Discussion~