智能技术应 - nanjing university · 4 神经络! 模型 ! 用 sigmoid，relu...

⼈人⼯工智能技术应⽤用深度学习：优化与正则化

�1

2

神经⽹网络! 模型 ! 用 Sigmoid，ReLU 作为激活函数 ! 分类时用交叉熵作为损失函数

Softmax：将给定的任意⼀组值映射成⼀个概率分布

整流/修正线性单元ReLU

《神经⽹络与深度学习》 3

交叉熵损失函数! 负对数似然损失函数

! 对于⼀个三类分类问题，类别为[0,0,1]，预测的类别概率为[0.3,0.3,0.4]，则

4

神经⽹网络! 模型 ! 用 Sigmoid，ReLU 作为激活函数 ! 分类时用交叉熵作为损失函数

x = raw input

z1 = WT1 x + b11

h1 = ReLU(z1)

θ1 = UT1 h1 + b21

[ ̂y1, ̂y2, ̂y3] = softmax(θ1, θ2, θ3)

z2 = WT2 x + b12

h2 = ReLU(z2)

θ2 = UT2 h2 + b22

Lce( ̂y, y) = −k

∑j=1

yjlog ̂yj

z3 = WT3 x + b13

h3 = ReLU(z3)

θ3 = UT3 h3 + b23

∂Lce( ̂y, y)∂W11

∂Lce( ̂y, y)∂b11

∂Lce( ̂y, y)∂U11

⋯ ⋯ ⋯

5

随机梯度下降法

6

Coding: Kerasfrom keras.models import Sequential from keras.layers import Dense, Activation from keras.optimizers import SGD

model = Sequential() model.add(Dense(output_dim=64, input_dim=100)) model.add(Activation("relu")) model.add(Dense(output_dim=10)) model.add(Activation("softmax"))

model.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['accuracy'])

model.fit(X_train, Y_train, nb_epoch=5, batch_size=32)

loss = model.evaluate(X_test, Y_test, batch_size=32)

7

学习优化问题的难点

!结构差异⼤ ! 没有通用的优化算法

!非凸优化问题 ! 参数初始化 ! 逃离局部最优

8

⾼高维空间的⾮非凸优化问题

!鞍点

!平摊底部

⽆法逃离鞍点(0,0)

梯度值比较小

z = x2 − y2

9

优化算法：MiniBatch!选取K个训练样本，计算偏导数

!定义梯度

!更新参数

其中α > 0为学习率

⼏个关键因素： • 小批量样本数量 • 梯度 • 学习率

10

MNIST数据集：随机梯度下降

小批量梯度下降中，每次选取样本数量对损失下降的影响。

11

如何改进？

!学习率 !学习率衰减 !Adagrad !Adadelta !RMSprop

!梯度 !Momentum !Nesterov accelerated

gradient !梯度截断

!综合⽅法 ! Adam≈动量法

+RMSprop ! Nadam

Reference: 1. An overview of gradient descent optimization algorithms 2. Optimizing the Gradient Descent

Adam is better choice!

http://ruder.io/optimizing-gradient-descent/index.htmlhttps://medium.com/towards-data-science/types-of-optimization-algorithms-used-in-neural-networks-and-ways-to-optimize-gradient-95ae5d39529f

12

学习率衰减

!学习率 α 的取值如果过⼤就不会收敛，如果过小则收敛速度太慢。 ! 逆时衰减

! 指数衰减

! 自然指数衰减

13

学习率衰减! 学习率 α 的取值如果过⼤就不会收敛，如果过小则收敛速度太

慢! AdaGrad(Adaptive Gradient)算法 [Duchi et al., 2011]

! RMSprop 算法 [Tieleman and Hinton, 2012]

14

梯度⽅方向优化：动量量法! 动量法(Momentum Method)[Rumelhart et al., 1988]是用之前积累动量来替

代真正的梯度。每次迭代的梯度可以看作是加速度

其中为动量因⼦，通常设为 0.9，α 为学习率。

在迭代初期，梯度⽅向都比较⼀致，动量法会起到加速作用，可以更快地到达最优点。在迭代后期，梯度⽅向会取决不⼀致，在收敛值附近震荡，动量法会起到减速作用，增加稳定性。

ρ

15

梯度⽅方向优化：Adam算法! ⾃适应动量估计(Adaptive Moment Estimation，Adam)算法

[Kingma and Ba, 2015]可以看作是动量法和RMSprop的结合，不但使⽤动量作为参数更新⽅向，⽽且可以⾃适应调整学习率。

16

梯度截断

!梯度截断是⼀种比较简单的启发式⽅法，把梯度的模限定在⼀个区间，当梯度的模小于或⼤于这个区间时就进⾏截断。 ! 按值截断

! 按模截断

17

优化⽅方法在 MNIST 数据集上收敛性的⽐比较

18

参数初始化

!参数不能初始化为0！为什么？ ! 对称权重问题！

!Gaussian初始化⽅法 ! Gaussian初始化⽅法是最简单的初始化⽅法，参数从⼀个

固定均值（比如0）和固定⽅差（比如0.01）的Gaussian分布进⾏随机初始化。

!Xavier初始化 ! 参数可以在区间[−r,r]内采用均匀分布进⾏初始化。

19

数据预处理理

! 数据归⼀化 ! 标准归⼀化 ! 缩放归⼀化 ! PCA

20

数据归⼀一化对梯度的影响

21

逐层归⼀一化

!归⼀化⽅法 ! 批量归⼀化（Batch Normalization，BN） ! 层归⼀化（Layer Normalization） ! 权重归⼀化（Weight Normalization） ! 局部响应归⼀化（Local Response Normalization，LRN）

22

批量量归⼀一化 Batch Normalization [Ioffe and Szegedy, 2015]

! 给定⼀个包含K 个样本的小批量样本集合，计算均值和⽅差

! 批量归⼀化

训练完成时，用整个数据集上的均值 μ 和⽅差 σ 来分别代替每次小批量样本的均值和⽅差

23

层归⼀一化 Layer Normalization [Ba et al., 2016 ] ! 和批量归⼀化不同的是，层归⼀化是对⼀个中间层的所有神经

元进⾏归⼀化。令第L层神经的净输⼊为 z(L)，其均值和⽅差为

! 层归⼀化

24

超参数优化

!层数 !每层神经元个数 !激活函数 !学习率（以及动态调整算法） !正则化系数 !mini-batch ⼤小

25

超参数优化

!⽹格搜索（Grid Search） ! 假设总共有K 个超参数，第k个超参数的可以取mk 个值。 ! 如果参数是连续的，可以将参数离散化，选择⼏个“经验”

值。比如学习率α，我们可以设置

α ∈ {0.01,0.1,0.5,1.0}.

! 这些超参数可以有 m1×m2 ×···×mK 个取值组合。

26

超参数优化

27

正则化: 重新思考泛化性!神经⽹络 !过度参数化 !拟合能⼒强

泛化性差

Zhang C, Bengio S, Hardt M, et al. Understanding deep learning requires rethinking generalization[J]. arXiv preprint arXiv:1611.03530, 2016.

28

正则化（regularization）所有损害优化的⽅法都是正则化

增加优化约束⼲扰优化过程

L1/L2约束、数据增强权重衰减、随机梯度下降、提前停⽌

29

正则化

!如何提⾼神经⽹络的泛化能⼒ ! 和正则化 ! early stop ! 权重衰减 ! SGD ! 丢弃法Dropout ! 数据增强

计算学习理理论：generalization error bounds

31

和正则化

!优化问题可以写为

! 为范数函数，p的取值通常为{1,2}代表和范数，λ为正则化系数。

32

神经⽹网络示例例

!隐藏层的不同神经元个数

http://playground.tensorflow.org/

33

神经⽹网络示例例

!不同的正则化系数

34

提前停⽌止 Early stopping!我们使用⼀个验证集（Validation Dataset）来测试每⼀次迭代的参数在验证集上是否最优。如果在验证集上的错误率不再下降，就停⽌迭代。

35

Learning CurvesCHAPTER 7. REGULARIZATION FOR DEEP LEARNING

0 50 100 150 200 250

Time (epochs)

0.00

0.05

0.10

0.15

0.20

Los

s(n

egat

ive

log-

like

lihood)

Training set loss

Validation set loss

Figure 7.3: Learning curves showing how the negative log-likelihood loss changes overtime (indicated as number of training iterations over the dataset, or epochs). In thisexample, we train a maxout network on MNIST. Observe that the training objectivedecreases consistently over time, but the validation set average loss eventually begins toincrease again, forming an asymmetric U-shaped curve.

greatly improved (in proportion with the increased number of examples for theshared parameters, compared to the scenario of single-task models). Of course thiswill happen only if some assumptions about the statistical relationship betweenthe different tasks are valid, meaning that there is something shared across someof the tasks.

From the point of view of deep learning, the underlying prior belief is thefollowing: among the factors that explain the variations observed in the dataassociated with the different tasks, some are shared across two or more tasks.

7.8 Early Stopping

When training large models with sufficient representational capacity to overfitthe task, we often observe that training error decreases steadily over time, butvalidation set error begins to rise again. See figure 7.3 for an example of thisbehavior. This behavior occurs very reliably.

This means we can obtain a model with better validation set error (and thus,hopefully better test set error) by returning to the parameter setting at the point intime with the lowest validation set error. Every time the error on the validation setimproves, we store a copy of the model parameters. When the training algorithmterminates, we return these parameters, rather than the latest parameters. The

246

Figure 7.3

Early stopping: terminate while validation setperformance is better

36

权重衰减（Weight Decay）!在每次参数更新时，引⼊⼀个衰减系数。

!在标准的随机梯度下降中，权重衰减正则化和L2正则化的效果相同。

!在较为复杂的优化⽅法（比如Adam）中，权重衰减和L2正则化并不等价。

37

Weight Decay as Constrained OptimizationCHAPTER 7. REGULARIZATION FOR DEEP LEARNING

w1

w2

w⇤

w̃

Figure 7.1: An illustration of the effect of L2 (or weight decay) regularization on the valueof the optimal w. The solid ellipses represent contours of equal value of the unregularizedobjective. The dotted circles represent contours of equal value of the L2 regularizer. Atthe point w̃, these competing objectives reach an equilibrium. In the first dimension, theeigenvalue of the Hessian of J is small. The objective function does not increase muchwhen moving horizontally away from w⇤. Because the objective function does not expressa strong preference along this direction, the regularizer has a strong effect on this axis.The regularizer pulls w1 close to zero. In the second dimension, the objective functionis very sensitive to movements away from w⇤. The corresponding eigenvalue is large,indicating high curvature. As a result, weight decay affects the position of w2 relativelylittle.

Only directions along which the parameters contribute significantly to reducingthe objective function are preserved relatively intact. In directions that do notcontribute to reducing the objective function, a small eigenvalue of the Hessiantells us that movement in this direction will not significantly increase the gradient.Components of the weight vector corresponding to such unimportant directionsare decayed away through the use of the regularization throughout training.

So far we have discussed weight decay in terms of its effect on the optimizationof an abstract, general, quadratic cost function. How do these effects relate tomachine learning in particular? We can find out by studying linear regression, amodel for which the true cost function is quadratic and therefore amenable to thesame kind of analysis we have used so far. Applying the analysis again, we willbe able to obtain a special case of the same results, but with the solution nowphrased in terms of the training data. For linear regression, the cost function is

233

Figure 7.1

38

丢弃法（Dropout Method）!对于⼀个神经层，引⼊⼀个丢弃函数使得。

! 其中m是丢弃掩码（dropout mask），通过以概率为p的⼆项分布随机⽣成。

39

Dropout

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

yy

h1h1 h2h2

x1x1 x2x2

yy

h1h1 h2h2

x1x1 x2x2

yy

h1h1 h2h2

x2x2

yy

h1h1 h2h2

x1x1

yy

h2h2

x1x1 x2x2

yy

h1h1

x1x1 x2x2

yy

h1h1 h2h2

yy

x1x1 x2x2

yy

h2h2

x2x2

yy

h1h1

x1x1

yy

h1h1

x2x2

yy

h2h2

x1x1

yy

x1x1

yy

x2x2

yy

h2h2

yy

h1h1

yy

Base network

Ensemble of subnetworks

Figure 7.6: Dropout trains an ensemble consisting of all sub-networks that can beconstructed by removing non-output units from an underlying base network. Here, webegin with a base network with two visible units and two hidden units. There are sixteenpossible subsets of these four units. We show all sixteen subnetworks that may be formedby dropping out different subsets of units from the original network. In this small example,a large proportion of the resulting networks have no input units or no path connectingthe input to the output. This problem becomes insignificant for networks with widerlayers, where the probability of dropping all possible paths from inputs to outputs becomessmaller.

260

40

Dropout意义!集成学习的解释 ! 每做⼀次丢弃，相当于从原始的⽹络中采样得到⼀个⼦⽹

络。如果⼀个神经⽹络有n个神经元，那么总共可以采样出2n个⼦⽹络。

!贝叶斯学习的解释

! 其中f(x,θm)为第m次应用丢弃⽅法后的⽹络。

41

循环神经⽹网络上的丢弃法

!当在循环神经⽹络上应用丢弃法，不能直接对每个时刻的隐状态进⾏随机丢弃，这样会损害循环⽹络在时间维度上记忆能⼒。

虚线边表示进⾏随机丢弃，不同的颜⾊表示不同的丢弃掩码。

42

数据增强（Dataset Augmentation）!图像数据的增强主要是通过算法对图像进⾏转变，引⼊噪声等⽅法来增加数据的多样性。 ! 图像数据的增强⽅法： ! 旋转（Rotation）：将图像按顺时针或逆时针⽅向随机旋

转⼀定角度； ! 翻转（Flip）：将图像沿⽔平或垂直⽅法随机翻转⼀定角

度； ! 缩放（Zoom In/Out）：将图像放⼤或缩小⼀定比例； ! 平移（Shift）：将图像沿⽔平或垂直⽅法平移⼀定步长； ! 加噪声（Noise）：加⼊随机噪声。

43

Dataset Augmentation

Affine Distortion Noise

Elastic Deformation

Horizontal flip

Random Translation Hue Shift

44

标签平滑（Label Smoothing）!在输出标签中添加噪声来避免模型过拟合。 !⼀个样本x的标签⼀般用onehot向量表示

!引⼊⼀个噪声对标签进⾏平滑，即假设样本以ϵ的概率为其它类。平滑后的标签为

硬目标（Hard Targets）

45

BaggingCHAPTER 7. REGULARIZATION FOR DEEP LEARNING

8

8

First ensemble member

Second ensemble member

Original dataset

First resampled dataset

Second resampled dataset

Figure 7.5: A cartoon depiction of how bagging works. Suppose we train an 8 detector onthe dataset depicted above, containing an 8, a 6 and a 9. Suppose we make two differentresampled datasets. The bagging training procedure is to construct each of these datasetsby sampling with replacement. The first dataset omits the 9 and repeats the 8. On thisdataset, the detector learns that a loop on top of the digit corresponds to an 8. Onthe second dataset, we repeat the 9 and omit the 6. In this case, the detector learnsthat a loop on the bottom of the digit corresponds to an 8. Each of these individualclassification rules is brittle, but if we average their output then the detector is robust,achieving maximal confidence only when both loops of the 8 are present.

different kind of model using a different algorithm or objective function. Baggingis a method that allows the same kind of model, training algorithm and objectivefunction to be reused several times.

Specifically, bagging involves constructing k different datasets. Each datasethas the same number of examples as the original dataset, but each dataset isconstructed by sampling with replacement from the original dataset. This meansthat, with high probability, each dataset is missing some of the examples from theoriginal dataset and also contains several duplicate examples (on average around2/3 of the examples from the original dataset are found in the resulting trainingset, if it has the same size as the original). Model i is then trained on dataseti. The differences between which examples are included in each dataset result indifferences between the trained models. See figure 7.5 for an example.

Neural networks reach a wide enough variety of solution points that they canoften benefit from model averaging even if all of the models are trained on the samedataset. Differences in random initialization, random selection of minibatches,differences in hyperparameters, or different outcomes of non-deterministic imple-mentations of neural networks are often enough to cause different members of the

257

Figure 7.5

46

Multi-Task Learning

CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

factors. The model can generally be divided into two kinds of parts and associatedparameters:

1. Task-specific parameters (which only benefit from the examples of their taskto achieve good generalization). These are the upper layers of the neuralnetwork in figure 7.2.

2. Generic parameters, shared across all the tasks (which benefit from thepooled data of all the tasks). These are the lower layers of the neural networkin figure 7.2.

h(1)

h(1)

h(2)

h(2)

h(3)

h(3)

y(1)y(1) y(2)y(2)

h(shared)

h(shared)

xx

Figure 7.2: Multi-task learning can be cast in several ways in deep learning frameworksand this figure illustrates the common situation where the tasks share a common input butinvolve different target random variables. The lower layers of a deep network (whether itis supervised and feedforward or includes a generative component with downward arrows)can be shared across such tasks, while task-specific parameters (associated respectivelywith the weights into and from h(1) and h(2)) can be learned on top of those yielding ashared representation h(shared). The underlying assumption is that there exists a commonpool of factors that explain the variations in the input x, while each task is associatedwith a subset of these factors. In this example, it is additionally assumed that top-levelhidden units h(1) and h(2) are specialized to each task (respectively predicting y(1) andy(2)) while some intermediate-level representation h(shared) is shared across all tasks. Inthe unsupervised learning context, it makes sense for some of the top-level factors to beassociated with none of the output tasks (h(3)): these are the factors that explain some ofthe input variations but are not relevant for predicting y(1) or y(2).

Improved generalization and generalization error bounds (Baxter, 1995) can beachieved because of the shared parameters, for which statistical strength can be

245

47

Sparse RepresentationsCHAPTER 7. REGULARIZATION FOR DEEP LEARNING

2

66664

�14119223

3

77775=

2

66664

3 �1 2 �5 4 14 2 �3 �1 1 3

�1 5 4 2 �3 �23 1 2 �3 0 �3

�5 4 �2 2 �5 �1

3

77775

2

6666664

0200

�30

3

7777775

y 2 Rm B 2 Rm⇥n h 2 Rn

(7.47)

In the first expression, we have an example of a sparsely parametrized linearregression model. In the second, we have linear regression with a sparse representa-tion h of the data x. That is, h is a function of x that, in some sense, representsthe information present in x, but does so with a sparse vector.

Representational regularization is accomplished by the same sorts of mechanismsthat we have used in parameter regularization.

Norm penalty regularization of representations is performed by adding to theloss function J a norm penalty on the representation. This penalty is denoted⌦(h). As before, we denote the regularized loss function by J̃ :

J̃(✓; X, y) = J(✓; X, y) + ↵⌦(h) (7.48)

where ↵ 2 [0, 1) weights the relative contribution of the norm penalty term, withlarger values of ↵ corresponding to more regularization.

Just as an L1 penalty on the parameters induces parameter sparsity, an L1penalty on the elements of the representation induces representational sparsity:⌦(h) = ||h||1 =

Pi|hi|. Of course, the L1 penalty is only one choice of penalty

that can result in a sparse representation. Others include the penalty derived froma Student-t prior on the representation (Olshausen and Field, 1996; Bergstra, 2011)and KL divergence penalties (Larochelle and Bengio, 2008) that are especiallyuseful for representations with elements constrained to lie on the unit interval.Lee et al. (2008) and Goodfellow et al. (2009) both provide examples of strategiesbased on regularizing the average activation across several examples, 1

m

Pih

(i), tobe near some target value, such as a vector with .01 for each entry.

Other approaches obtain representational sparsity with a hard constraint onthe activation values. For example, orthogonal matching pursuit (Pati et al.,1993) encodes an input x with the representation h that solves the constrainedoptimization problem

arg minh,khk0

48

Adversarial ExamplesCHAPTER 7. REGULARIZATION FOR DEEP LEARNING

+ .007 ⇥ =

x sign(rxJ(✓, x, y))x +

✏ sign(rxJ(✓, x, y))y =“panda” “nematode” “gibbon”w/ 57.7%confidence

w/ 8.2%confidence

w/ 99.3 %confidence

Figure 7.8: A demonstration of adversarial example generation applied to GoogLeNet(Szegedy et al., 2014a) on ImageNet. By adding an imperceptibly small vector whoseelements are equal to the sign of the elements of the gradient of the cost function withrespect to the input, we can change GoogLeNet’s classification of the image. Reproducedwith permission from Goodfellow et al. (2014b).

to optimize. Unfortunately, the value of a linear function can change very rapidlyif it has numerous inputs. If we change each input by ✏, then a linear functionwith weights w can change by as much as ✏||w||1, which can be a very largeamount if w is high-dimensional. Adversarial training discourages this highlysensitive locally linear behavior by encouraging the network to be locally constantin the neighborhood of the training data. This can be seen as a way of explicitlyintroducing a local constancy prior into supervised neural nets.

Adversarial training helps to illustrate the power of using a large functionfamily in combination with aggressive regularization. Purely linear models, likelogistic regression, are not able to resist adversarial examples because they areforced to be linear. Neural networks are able to represent functions that can rangefrom nearly linear to nearly locally constant and thus have the flexibility to capturelinear trends in the training data while still learning to resist local perturbation.

Adversarial examples also provide a means of accomplishing semi-supervisedlearning. At a point x that is not associated with a label in the dataset, themodel itself assigns some label ŷ. The model’s label ŷ may not be the true label,but if the model is high quality, then ŷ has a high probability of providing thetrue label. We can seek an adversarial example x0 that causes the classifier tooutput a label y0 with y0 6= ŷ. Adversarial examples generated using not the truelabel but a label provided by a trained model are called virtual adversarialexamples (Miyato et al., 2015). The classifier may then be trained to assign thesame label to x and x0. This encourages the classifier to learn a function that is

269

Figure 7.8

Training on adversarial examples is mostly intended to improve security, but can sometimes provide generic regularization.

49

总结! 模型 ! 用 ReLU 作为激活函数 ! 分类时用交叉熵作为损失函数

! 优化 ! SDG+mini-batch（Adam算法优先） ! 每次迭代都重新随机排序 ! 数据预处理（标准归⼀化） ! 动态学习率（越来越小）

! 正则化 ! 和正则化 ! Dropout ! Early-stop ! 数据增强 ! …

谢谢！

智能技术应 - nanjing university · 4 神经络! 模型 ! 用 sigmoid，relu...

Documents