(研究会輪読) facial landmark detection by deep multi-task learning

Facial Landmark Detection by Deep Multi-task Learning 2015/7/2

Masahiro Suzuki

Contents¤ Paper Information

¤ Introduction

¤ Related work

¤ Tasks-Constrained Deep Convolutional Network

¤ Experiment

¤ Conclusion

Paper InformationTitle : Facial Landmark Detection by Deep Multi-task Learning (2014)

Authors : Zhanpeng Zhang, Ping Luo, Chen Change Loy, and Xiaoou Tang

¤ The Chinese University of Hong Kong / Multimedia Laboratory

Deep Learning (CNN) + Multitask Learning

¤ Motivation¤ I’m studying lifelong learning (online multitask learning) by deep

learning

Facial landmark detection¤ Facial landmark detection is a fundamental component in

many face analysis task¤ facial attribute inference¤ face verification¤ face recognition

¤ remains a formidable challenge¤ partial occlusion and large head pose variations

Approachthe authors thought that …

¤ facial landmark detection is not a standalone problem¤ its estimation can be influenced by a number of heterogeneous

and subtly correlated factors

Main task

Auxiliary task

Multitask learning

ContributionThey propose a Tasks-Constrained Deep Convolutional Network (TCDCN)

¤ the first attempt to investigate how facial landmark detection can be optimized together with heterogeneous but subtly correlated tasks

¤ show that …¤ the representations learned from related tasks facilitate the learning

of the main task ¤ tasks relatedness are captured implicitly by the proposed model ¤ the proposed approach outperforms the existing methods

¤ demonstrate the effectiveness of using five-landmark estimation as robust initialization for improving a state-of-the-art face alignment method

Contents¤ Paper Information

¤ Introduction

¤ Related work

¤ Tasks-Constrained Deep Convolutional Network

¤ Experiment

¤ Conclusion

Facial landmark detection regression-based method

¤ 画像パッチからSVRを使ってlandmarkを直接推定¤ 多くの先⾏研究がランダム回帰フォレストを利⽤¤ 最初にlandmarkを推定してから繰り返すので、初期値依存

template fitting method ¤ 顔のテンプレートを画像に当てはめる¤ face detection, facial landmark detection, pose estimationを同時にできる

5

● Regression-based method

● Template fitting method

● Cascaded CNN

顔特徴点検出の先行研究

Valstar, M., Martinez, B., Binefa, X., Pantic, M.:

Facial point detection using boosted regression

and graph models. In: CVPR. pp. 2729-2736 (2010)

Cootes, T.F., Edwards, G.J., Taylor, C.J.:

Active appearance models.

PAMI 23(6), 681-685 (2001)

Sun, Y., Wang, X., Tang, X.:

Deep convolutional network cascade

for facial point detection.

In: CVPR. pp. 3476-3483 (2013)

回帰で、点の位置を直接求める

位置や見た目のモデルをあてはめる

同じ研究室の手法特徴点ごとに分割して段階的にCNNを適用.CNN数が多い. 23 CNNs.

先行研究に対し,補助的なタスクを使うことと,Raw-pixel入力のCNNで,Cascadeせずに少ない処理時間で処理できることが特徴.

5

● Regression-based method

● Template fitting method

● Cascaded CNN

顔特徴点検出の先行研究

Valstar, M., Martinez, B., Binefa, X., Pantic, M.:

Facial point detection using boosted regression

and graph models. In: CVPR. pp. 2729-2736 (2010)

Cootes, T.F., Edwards, G.J., Taylor, C.J.:

Active appearance models.

PAMI 23(6), 681-685 (2001)

Sun, Y., Wang, X., Tang, X.:

Deep convolutional network cascade

for facial point detection.

In: CVPR. pp. 3476-3483 (2013)

回帰で、点の位置を直接求める

位置や見た目のモデルをあてはめる

同じ研究室の手法特徴点ごとに分割して段階的にCNNを適用.CNN数が多い. 23 CNNs.

先行研究に対し,補助的なタスクを使うことと,Raw-pixel入力のCNNで,Cascadeせずに少ない処理時間で処理できることが特徴.

Landmark detection by CNNcascaded CNN [Sun et al. 2013]

¤ 顔を予め幾つかのパーツに分けてそれぞれCNNでlandmarkを推定し、最後に平均をとって出⼒¤ 元論⽂を読むと、段階ごとにCNNを適⽤してるっぽい

¤ 本研究に最も近い研究（本著者と同じ研究室）

Figure 2: Three-level cascaded convolutional networks. The input is the face region returned by a face detector. The threenetworks at level 1 are denoted as F1, EN1, and NM1. Networks at level 2 are denoted as LE21, LE22, RE21, RE22, N21,N22, LM21, LM22, RM21, and RM22. Both LE21 and LE22 predict the left eye center, and so forth. Networks at level 3are denoted as LE31, LE32, RE31, RE32, N31, N32, LM31, LM32, RM31, and RM32. Green square is the face boundingbox given by the face detector. Yellow shaded areas are the input regions of networks. Red dots are the final predictions ateach level. Dots in other colors are predictions given by individual networks.

Figure 3: The structure of deep convolutional network F1.Sizes of input, convolution, and max pooling layers are il-lustrated by cuboids whose length, width, and height denotethe number of maps, and the size of each map. Local recep-tive fields of neurons in different layers are illustrated bysmall squares in the cuboids.

multaneously predicts multiple facial points. For each facialpoint, the predictions of multiple networks are averaged toreduce the variance. Figure 3 illustrates the deep structureof F1, which contains four convolutional layers followed bymax pooling, and two fully connected layers. EN1 and N-M1 take the same deep structure, but with different sizesat each layer since the sizes of their input regions are dif-ferent. Networks at the second and third levels take localpatches centered at the predicted positions of facial pointsfrom previous levels as input and are only allowed to makesmall changes to previous predictions. The sizes of patchesand search ranges keep reducing along the cascade. Pre-dictions at the last two levels are strictly restricted becauselocal appearance is sometimes ambiguous and unreliable.The predicted position of each point at the last two levelsis given by the average of the two networks with differentpatch sizes. While networks at the first level aim to estimatekeypoint positions robustly with few large errors, networksat the last two levels are designed to achieve high accura-cy. All the networks at the last two levels share a common

shallower structure since their tasks are low-level.

3.1. Network structure selection

We analyze three important factors on the choice of net-work structures. The discussions are limited to networks atthe first level, which are the hardest to train. First, convo-lutional networks at the first level should be deep. Predict-ing keypoints from large input regions is a high-level task.Deeper structures help to form high-level features, whichare global while features extracted by neurons at lower lay-ers are local due to local receptive fields. By combiningspatially nearby features extracted at lower layers, neuron-s at higher layers can extract features from larger regions.Moreover, high-level features are highly non-linear. Addingadditional layers increases the non-linearity from input tooutput, and makes it possible to represent the relationshipbetween input and output.

Second, for neurons in the convolutional layers, abso-lute value rectification after the hyperbolic tangent activa-tion function (see details in Section 4) can effectively im-prove the performance. This modification over traditionalconvolutional networks was proposed in [14], where im-provement on Caltech-101 was observed. Our empirical s-tudy shows that it is also effective in our application.

Third, locally sharing weights of neurons on the samemap improves the performance. Traditional convolution-al networks share weights of all the neurons on the samemap based on two considerations. First, it assumes thatthe same features may appear everywhere in an image. Sofilters useful in one place should also be useful in others.Second, weight sharing helps to prevent gradient diffusionwhen back-propagating through many layers, since gradi-ents of shared weights are aggregated, which makes super-vised learning on deep structures easier. However, globally

Multi-task learning¤ deep learningとmulti-task learningは相性がいい

¤ あるタスクで学習した特徴量を他の特徴量でも利⽤できる

¤ 通常のマルチタスク学習では、それぞれのタスクが同じ難易度・収束率と考えている¤ 今回の問題は各タスクが平等ではないのでそのままでは利⽤できない

本研究ではタスクごとに早期終了（early-stopping）を設定

（[Caruana et al. 1997]がヒント）

4 Z. Zhang, P. Luo, C. C. Loy, and X. Tang

Multi-task learning: The proposed approach falls under the big umbrella ofmulti-task learning. Multi-task learning has proven e↵ective in many computervision problems [28,29]. Deep model is well suited for multi-task learning sincethe features learned from a task may be useful for other task. Existing multi-task deep models [7] are not suitable to solve our problem because they assumesimilar learning di�culties and convergence rates across all tasks. Specifically,the iterative learning on all tasks are performed without early stopping. Apply-ing this assumption on our problem leads to di�culty in learning convergence,as shown in Section 4. We mitigate this shortcoming through task-wise earlystopping. Early stopping is not uncommon in vision learning problems [13,19].Neural network methods [20] have also extensively used it to prevent over-fittingby halting the training process of a single task before a minimum error is achievedon the training set. Our early stopping scheme is inspired by Caruana [5], but hisstudy is limited to shallow multilayer perceptrons. We show that early stoppingis equally important for multi-task deep convolutional network.

3 Tasks-Constrained Facial Landmark Detection

3.1 Problem Formulation

The traditional multi-task learning (MTL) seeks to improve the generalizationperformance of multiple related tasks by learning them jointly. Suppose we havea total of T tasks and the training data for the t-th task are denoted as (xt

i, yti),

where t = {1, . . . , T}, i = {1, . . . , N}, with x

ti 2 Rd and yti 2 R being the feature

vector and label, respectively1. The goal of the MTL is to minimize

argmin{wt}T

t=1

TX

t=1

NX

i=1

`(yti , f(xti;w

t)) + �(wt), (1)

where f(xt;wt) is a function of xt and parameterized by a weight vector wt. Theloss function is denoted by `(·). A typical choice is the least square for regressionand the hinge loss for classification. The �(wt) is the regularization term thatpenalizes the complexity of weights.

In contrast to conventional MTL that maximizes the performance of all tasks,our aim is to optimize the main task r, which is facial landmark detection, withthe assistances of arbitrary number of related/auxiliary tasks a 2 A. Examplesor related tasks include facial pose estimation and attribute inference. To thisend, our problem can be formulated as

argminW

r,{Wa}a2A

NX

i=1

`r(yri , f(xi;Wr)) +

NX

i=1

X

a2A

�a`a(yai , f(xi;Wa)), (2)

1 In this paper, scalar, vector, and matrix are denoted by lowercase, bold lowercase,and bold capital letter, respectively.

Problem Formulation

¤ 従来のマルチタスク学習は、複数の関連するタスクを同時に学習することで汎化性能を⾼める

訓練事例集合

タスク

損失関数正則化項

ラベル素性重み

Proposed Formulation¤ 本研究のマルチタスク学習

特徴¤ 異なる2つの誤差関数を同時に最適化できる（回帰とクラス分類でも可能）¤ 素性xがタスク依存でなく共通


Multi-task learning: The proposed approach falls under the big umbrella ofmulti-task learning. Multi-task learning has proven e↵ective in many computervision problems [28,29]. Deep model is well suited for multi-task learning sincethe features learned from a task may be useful for other task. Existing multi-task deep models [7] are not suitable to solve our problem because they assumesimilar learning di�culties and convergence rates across all tasks. Specifically,the iterative learning on all tasks are performed without early stopping. Apply-ing this assumption on our problem leads to di�culty in learning convergence,as shown in Section 4. We mitigate this shortcoming through task-wise earlystopping. Early stopping is not uncommon in vision learning problems [13,19].Neural network methods [20] have also extensively used it to prevent over-fittingby halting the training process of a single task before a minimum error is achievedon the training set. Our early stopping scheme is inspired by Caruana [5], but hisstudy is limited to shallow multilayer perceptrons. We show that early stoppingis equally important for multi-task deep convolutional network.

3 Tasks-Constrained Facial Landmark Detection

3.1 Problem Formulation

The traditional multi-task learning (MTL) seeks to improve the generalizationperformance of multiple related tasks by learning them jointly. Suppose we havea total of T tasks and the training data for the t-th task are denoted as (xt

i, yti),

where t = {1, . . . , T}, i = {1, . . . , N}, with x

ti 2 Rd and yti 2 R being the feature

vector and label, respectively1. The goal of the MTL is to minimize

argmin{wt}T

t=1

TX

t=1

NX

i=1

`(yti , f(xti;w

t)) + �(wt), (1)

where f(xt;wt) is a function of xt and parameterized by a weight vector wt. Theloss function is denoted by `(·). A typical choice is the least square for regressionand the hinge loss for classification. The �(wt) is the regularization term thatpenalizes the complexity of weights.

In contrast to conventional MTL that maximizes the performance of all tasks,our aim is to optimize the main task r, which is facial landmark detection, withthe assistances of arbitrary number of related/auxiliary tasks a 2 A. Examplesor related tasks include facial pose estimation and attribute inference. To thisend, our problem can be formulated as

argminW

r,{Wa}a2A

NX

i=1

`r(yri , f(xi;Wr)) +

NX

i=1

X

a2A

�a`a(yai , f(xi;Wa)), (2)

1 In this paper, scalar, vector, and matrix are denoted by lowercase, bold lowercase,and bold capital letter, respectively.

メインタスク補助タスク

a番⽬の補助タスクの重要度

Proposed Formulation

¤ メインタスクが回帰問題、補助タスクがクラス分類なので、誤差関数はそれぞれ2乗誤差、クロスエントロピー誤差となる

¤ 共有する画像の特徴量をDeep CNで学習

これら2つの式を合わせて学習する

メインタスク

補助タスク

Facial Landmark Detection by Deep Multi-task Learning 5

where �a denotes the importance coe�cient of a-th task’s error and the regu-larization terms are omitted for simplification. Beside the aforementioned di↵er-ence, Eq.(1) and Eq.(2) are distinct in two aspects. First, di↵erent types of lossfunctions can be optimized together by Eq.(2), e.g. regression and classificationcan be combined, while existing methods [30] that employ Eq.(1) assume implic-itly that the loss functions across all tasks are identical. Second, Eq.(1) allowsdata x

ti in di↵erent tasks to have di↵erent input representations, while Eq.(2)

focuses on a shared input representation xi. The latter is more suitable for ourproblem, since all tasks share similar facial representation.

In the following, we formulate our facial landmark detection model based onEq.(2). Suppose we have a set of feature vectors in a shared feature space acrosstasks {xi}Ni=1

and their corresponding labels {yri , y

pi , y

gi , y

wi , y

si }Ni=1

, where y

ri is

the target of landmark detection and the remaining are the targets of auxiliarytasks, including inferences of ‘pose’, ‘gender’, ‘wear glasses’, and ‘smiling’. Morespecifically, yr

i 2 R10 is the 2D coordinates of the five landmarks (centers of theeyes, nose, corners of the mouth), ypi 2 {0, 1, .., 4} indicates five di↵erent poses(0�,±30�,±60�), and ygi , y

wi , y

si 2 {0, 1} are binary attributes. It is reasonable

to employ the least square and cross-entropy as the loss functions for the maintask (regression) and the auxiliary tasks (classification), respectively. Therefore,the objective function can be rewritten as

argminW

r,{Wa}

1

2

NX

i=1

kyri �f(xi;W

r)k2�NX

i=1

X

a2A

�ayai log(p(yai |xi;W

a))+TX

t=1

kWk22

,

(3)

where f(xi;Wr) = (Wr)Txi in the first term is a linear function. The second

term is a softmax function p(yi = m|xi) = exp{(Wam)

Txi}P

j exp{(Waj )

Txi}

, which models the

class posterior probability (Waj denotes the jth column of the matrix), and

the third term penalizes large weights (W = {Wr, {Wa}}). In this work, weadopt the deep convolutional network (DCN) to jointly learn the share featurespace x, since the unique structure of DCN allows for multitask and sharedrepresentation.

In particular, given a face image x

0, the DCN projects it to higher levelrepresentation gradually by learning a sequence of non-linear mappings

x

0

�((Ws1)

Tx

0)��! x

1

�((Ws2)

Tx

1)��! ...

�((Wsl)

Tx

l�1)��! x

l. (4)

Here, �(·) and W

sl indicate the non-linear activation function and the filters

needed to be learned in the layer l of DCN. For instance, xl = �⇣(Wsl)Txl�1

⌘.

Note that xl is the shared representation between the main task r, and relatedtasks A. Eq.(4) and Eq.(3) can be trained jointly. The former learns the sharedspace and the latter optimizes the tasks with respect to this space, and thenthe errors of the tasks can be propagated back to refine the space. We iteratethis learning procedure until convergence. We call the learned model as Tasks-Constrained Deep Convolutional Network (TCDCN).


where �a denotes the importance coe�cient of a-th task’s error and the regu-larization terms are omitted for simplification. Beside the aforementioned di↵er-ence, Eq.(1) and Eq.(2) are distinct in two aspects. First, di↵erent types of lossfunctions can be optimized together by Eq.(2), e.g. regression and classificationcan be combined, while existing methods [30] that employ Eq.(1) assume implic-itly that the loss functions across all tasks are identical. Second, Eq.(1) allowsdata x

ti in di↵erent tasks to have di↵erent input representations, while Eq.(2)

focuses on a shared input representation xi. The latter is more suitable for ourproblem, since all tasks share similar facial representation.

In the following, we formulate our facial landmark detection model based onEq.(2). Suppose we have a set of feature vectors in a shared feature space acrosstasks {xi}Ni=1

and their corresponding labels {yri , y

pi , y

gi , y

wi , y

si }Ni=1

, where y

ri is

the target of landmark detection and the remaining are the targets of auxiliarytasks, including inferences of ‘pose’, ‘gender’, ‘wear glasses’, and ‘smiling’. Morespecifically, yr

i 2 R10 is the 2D coordinates of the five landmarks (centers of theeyes, nose, corners of the mouth), ypi 2 {0, 1, .., 4} indicates five di↵erent poses(0�,±30�,±60�), and ygi , y

wi , y

si 2 {0, 1} are binary attributes. It is reasonable

to employ the least square and cross-entropy as the loss functions for the maintask (regression) and the auxiliary tasks (classification), respectively. Therefore,the objective function can be rewritten as

argminW

r,{Wa}

1

2

NX

i=1

kyri �f(xi;W

r)k2�NX

i=1

X

a2A

�ayai log(p(yai |xi;W

a))+TX

t=1

kWk22

,

(3)

where f(xi;Wr) = (Wr)Txi in the first term is a linear function. The second

term is a softmax function p(yi = m|xi) = exp{(Wam)

Txi}P

j exp{(Waj )

Txi}

, which models the

class posterior probability (Waj denotes the jth column of the matrix), and

the third term penalizes large weights (W = {Wr, {Wa}}). In this work, weadopt the deep convolutional network (DCN) to jointly learn the share featurespace x, since the unique structure of DCN allows for multitask and sharedrepresentation.

In particular, given a face image x

0, the DCN projects it to higher levelrepresentation gradually by learning a sequence of non-linear mappings

x

0

�((Ws1)

Tx

0)��! x

1

�((Ws2)

Tx

1)��! ...

�((Wsl)

Tx

l�1)��! x

l. (4)

Here, �(·) and W

sl indicate the non-linear activation function and the filters

needed to be learned in the layer l of DCN. For instance, xl = �⇣(Wsl)Txl�1

⌘.

Note that xl is the shared representation between the main task r, and relatedtasks A. Eq.(4) and Eq.(3) can be trained jointly. The former learns the sharedspace and the latter optimizes the tasks with respect to this space, and thenthe errors of the tasks can be propagated back to refine the space. We iteratethis learning procedure until convergence. We call the learned model as Tasks-Constrained Deep Convolutional Network (TCDCN).

Tasks-Constrained Deep Convolutional Network

全体構造

DCN部分• モデルは各タスクで共通

マルチタスク部分

Task-wise early stopping ¤ マルチタスクの場合、異なるタスクで難易度や収束率が異なる

¤ メインタスクよりも補助タスクの⽅が簡単そう→早く収束しそう¤ 補助タスクが先に最適解に到達してるのにマルチタスク学習を続けると、過学習となってしまい、メインタスクに悪影響を与えることになる

→タスクによって学習をhaltするtask-wise early stopping

¤ ⾃動的にタスクを停⽌する基準


of the training process, the TCDCN is constrained by all tasks to avoid beingtrapped at a bad local minima. As training proceeds, certain auxiliary tasks areno longer beneficial to the main task after they reach their peak performance,their learning process thus should be halted. Note that the regularization o↵eredby early stopping is di↵erent from weight regularization in Eq.(3). The latterglobally helps to prevent over-fitting in each task through penalizing certainparameter configurations. In Section 4.2, we show that task-wise early stoppingis critical for multi-task learning convergence even with weight regularization.

Now we introduce a criterion to automatically determine when to stop learn-ing an auxiliary task. Let Ea

val and Eatr be the values of the loss function of task

a on the validation set and training set, respectively. We stop the task if itsmeasure exceeds a threshold ✏ as below

k ·medtj=t�kEatr(j)Pt

j=t�k Eatr(j)� k ·medtj=t�kE

atr(j)

· Eaval(t)�minj=1..t Ea

tr(j)

�a ·minj=1..t Eatr(j)

> ✏, (5)

where t denotes the current iteration and k controls a training strip of lengthk. The ‘med’ denotes the function for calculating median value. The first ter-m in Eq.(5) represents the tendency of the training error. If the training errordrops rapidly within a period of length k, the value of the first term is small,indicating that training can be continued as the task is still valuable; otherwise,the first term is large, then the task is more likely to be stopped. The secondterm measures the generalization error compared to the training error. The �a

is the importance coe�cient of a-th task’s error, which can be learned throughgradient descent. Its magnitude reveals that more important task tends to havelonger impact. This strategy achieves satisfactory results for learning deep con-volution network given multiple tasks. Its superior performance is demonstratedin Section 4.2.

Learning procedure: We have discussed when and how to switch o↵ an aux-iliary task during training before it over-fits. For each iteration, we performstochastic gradient descent to update the weights of the tasks and filters ofthe network. For example, the weight matrix of the main task is updated by�W

r = �⌘ @Er

@Wr with ⌘ being the learning rate (⌘ = 0.003 in our implementa-

tion), and @Er

@Wr = (yri � (Wr)Txi)xT

i . Also, the derivative of the auxiliary task’s

weights can be calculated in a similar manner as @Ea

@Wa = (p(yai |xi;Wa)� yai )xi.

For the filters in the lower layer, we compute the gradients by propagating theloss error back following the back-propagation strategy as

"1(W

s2)

T"2 @�(u1)

@u1 �� "2(W

s3)

T"3 @�(u2)

@u2 �� ...(W

sl)

T"l @�(ul�1)

@ul�1 �� "l, (6)

where "l is the error at the shared representation layer and "l = (Wr)T[yri �

(Wr)Txi] +P

a2A(p(yai |xi;W

a)� yai )Wa, which is the integration of all tasks’

derivatives. The errors of the lower layers are computed following Eq.(6). For

instance, "l�1 = (Wsl)T"l @�(ul�1

)

@ul�1 , where @�(u)@u is the gradient of the activation

function. Then, the gradient of the filter is obtained by @E@Wsl

= "lxl�1

⌦ , where⌦ represents the receptive field of the filter.

閾値

訓練誤差の傾向• 訓練データの⼀部kにおいて訓練誤差が急激に落ちると値は⼩さくなる→⽌まらない

汎化誤差• 訓練誤差に対する汎化誤差• 汎化誤差と訓練誤差の差が⼤きくなる→⽌まる

Learning procedure ¤ 最急降下法で求める








atr(j)


tr(j)


> ✏, (5)




r = �⌘ @Er


tion), and @Er






"1(W

s2)

T"2 @�(u1)

@u1 �� "2(W

s3)

T"3 @�(u2)

@u2 �� ...(W

sl)

T"l @�(ul�1)

@ul�1 �� "l, (6)


(Wr)Txi] +P

a2A(p(yai |xi;W




)



= "lxl�1









atr(j)


tr(j)


> ✏, (5)




r = �⌘ @Er


tion), and @Er






"1(W

s2)

T"2 @�(u1)

@u1 �� "2(W

s3)

T"3 @�(u2)

@u2 �� ...(W

sl)

T"l @�(ul�1)

@ul�1 �� "l, (6)


(Wr)Txi] +P

a2A(p(yai |xi;W




)



= "lxl�1


メインタスク

補助タスク








atr(j)


tr(j)


> ✏, (5)




r = �⌘ @Er


tion), and @Er






"1(W

s2)

T"2 @�(u1)

@u1 �� "2(W

s3)

T"3 @�(u2)

@u2 �� ...(W

sl)

T"l @�(ul�1)

@ul�1 �� "l, (6)


(Wr)Txi] +P

a2A(p(yai |xi;W




)



= "lxl�1









atr(j)


tr(j)


> ✏, (5)




r = �⌘ @Er


tion), and @Er






"1(W

s2)

T"2 @�(u1)

@u1 �� "2(W

s3)

T"3 @�(u2)

@u2 �� ...(W

sl)

T"l @�(ul�1)

@ul�1 �� "l, (6)


(Wr)Txi] +P

a2A(p(yai |xi;W




)



= "lxl�1









atr(j)


tr(j)


> ✏, (5)




r = �⌘ @Er


tion), and @Er






"1(W

s2)

T"2 @�(u1)

@u1 �� "2(W

s3)

T"3 @�(u2)

@u2 �� ...(W

sl)

T"l @�(ul�1)

@ul�1 �� "l, (6)


(Wr)Txi] +P

a2A(p(yai |xi;W




)



= "lxl�1


バックプロパゲーション








atr(j)


tr(j)


> ✏, (5)




r = �⌘ @Er


tion), and @Er






"1(W

s2)

T"2 @�(u1)

@u1 �� "2(W

s3)

T"3 @�(u2)

@u2 �� ...(W

sl)

T"l @�(ul�1)

@ul�1 �� "l, (6)


(Wr)Txi] +P

a2A(p(yai |xi;W




)



= "lxl�1


Experiments ¤ Network Structure

¤ Model training ¤ 学習するデータセット：10,000 outdoor face images from the web

¤ 移動とか回転、ズームはあまり気にしないで収集¤ テストデータ：AFLWとAFL

¤ Evaluation metrics ¤ 平均エラー率

¤ 正解と推定したlandmarkの距離を計算し、⽬の間隔で正規化¤ 誤り率

¤ 10％を越えると誤りと判断

the Effectiveness of Learning with Related Task

¤ AFLWで評価

¤ 左が各landmarkのエラー率、右が全部のlandmarkの失敗率

¤ 補助タスクによって確かにエラー率も失敗率も下がっている¤ 全部の補助タスクを利⽤すると、失敗率を10％も改善できる¤ poseが⼀番効いてるっぽい


6

8

10

12

left eye right eye nose left mouthcorner

right mouthcorner

mea

n er

ror (

%)

FLD FLD+gender FLD+glasses FLD+smile FLD+pose FLD+all

35.62

31.86 32.87 32.37

28.76

25.00

20

25

30

35

40

failu

re ra

te (%

)

Fig. 4. Comparison of di↵erent model variants of TCDCN: the mean error over di↵erentlandmarks, and the overall failure rate.

4.1 Evaluating the E↵ectiveness of Learning with Related Task

To examine the influence of related tasks, we evaluate five variants of the pro-posed model. In particular, the first variant is trained only on facial landmarkdetection. We train another four model variants on facial landmark detectionalong with the auxiliary task of recognizing ‘pose’, ‘gender’, ‘wearing glasses’,and ‘smiling’, respectively. The full model is trained using all the four relatedtasks. For simplicity, we name each variants by facial landmark detection (FLD)and the related task, such as “FLD”, “FLD+pose”, “FLD+all”. We employ thepopular AFLW [11] for evaluation. This dataset is selected because it is morechallenging than other datasets, such as LFPW [2]. For example, AFLW haslarger pose variations (39% of faces are non-frontal in our testing images) andsevere partial occlusions. Figure 10 provides some examples. We selected 3,000faces randomly from AFLW for testing.

It is evident from Figure 4 that optimizing landmark detection with relatedtasks are beneficial. In particular, FLD+all outperforms FLD by a large margin,with a reduction over 10% in failure rate. When single related task is present,FLD+pose performs the best. This is not surprising since pose variation a↵ectslocations of all landmarks globally and directly. The other related tasks suchas ‘smiling’ and ‘wearing glasses’ are observed to have comparatively smallerinfluence to the final performance, since they mainly capture local informationof the face, such as mouth and eyes. We examine two specific cases below.

FLD vs. FLD+smile: As shown in Figure 5, landmark detection benefits fromsmiling attribute inference, mainly at the nose and corners of mouth. This ob-servation is intuitive since smiling drives the lower part of the faces, involvingZygomaticus and levator labii superioris muscles, more than the upper facialregion. The learning of smile attributes develops a shared representation thatdescribes lower facial region, which in turn facilitates the localization of noseand corners of mouth.

We use a crude method to investigate the relationship between tasks. Specifi-cally, we study the Pearson’s correlation of the learned weight vectors of the lastfully-connected layer, between the tasks of facial landmark detection and ‘smil-ing’ prediction, as shown in Figure 5(b). The correlational relationship is indica-tive to the performance improvement depicted in Figure 5(a). For instance, the

FLD vs. FLD + smile

smileがどのlandmarkで効果的かを検証(a)：⿐や⼝で効果がある

¤ smileは顔の下半分に該当するから(b)：最終層の重みのピアソンの相関係数

¤ ⼝と強い相関


8

8.5

9

9.5

10

10.5

11

11.5


right mouthcorner

mea

n er

ror (

%)

FLD FLD+smile

0.11

0.32

0.17

0.22

0.40

left eye

right eye

nose

left mouthcorner

right mouthcorner

correlation

Land

mar

k de

tect

ion

wei

ghts

(a) (b) Learned weights’ correlation with the weights of‘smiling’ task

Fig. 5. FLD vs. FLD+smile. The smiling attribute helps detection more on the noseand corners of mouth, than the centers of eyes, since ‘smiling’ mainly a↵ects the lowerpart of a face.

Fig. 6. FLD vs. FLD+pose. (a) Mean error in di↵erent poses, and (b) Accuracy im-provement by the FLD+pose in di↵erent poses.

weight vectors, which are learned to predict the positions of the mouth’s cornershave high correlation with the weights of ‘smiling’ inference. This demonstratesthat TCDCN implicitly learns relationship between tasks.

FLD vs. FLD+pose: As observed in Figure 6(a), detection errors of FLDincrease along with the degree of head pose deviation from the frontal view toprofiles, while these errors can be partially recovered by FLD+pose as depictedin Figure 6(b).

4.2 The Benefits of Task-wise Early Stopping

To verify the e↵ectiveness of the task-wise early stopping, we train the proposedTCDCN with and without this technique and compare the landmark detectionrates in Figure 7(a), which shows that without task-wise early stopping, the ac-curacy is much lower. Figure 7(b) plots the main task’s loss errors of the trainingset and the validation set within 2,600 iterations. Without early stopping, thetraining error converges slowly and exhibits substantial oscillations. However,convergence rates of both the training and validation sets are fast and stablewhen using the proposed early stopping scheme. In Figure 7(b), we also pointout when and which task has been halted during the training procedure. Forexample, ‘wearing glasses’ and ‘gender’ are stopped at the 250th and 350th it-erations, and ‘pose’ lasts to the 750th iteration, which matches our expectationthat ‘pose’ has the largest beneficit to landmark detection, compared to theother related tasks.

FLD vs. FLD + pose

ポーズの効果を検証(a)：どのポーズでもエラー率は下がっている(b)：正解の改善率で⾒ても、どのポーズでもよくなっている


Fig. 5. FLD vs. FLD+smile. The smiling attribute helps detection more on the noseand corners of mouth, than the centers of eyes, since ‘smiling’ mainly a↵ects the lowerpart of a face.

0

0.5

1

1.5

2

2.5

3

leftprofile

left frontal right rightprofle

accu

racy

impr

ovem

ent (

%)

(a)

5

10

15

20

leftprofile

left frontal right rightprofle

mea

n er

ror (

%)

FLD FLD+pose

(b)

Fig. 6. FLD vs. FLD+pose. (a) Mean error in di↵erent poses, and (b) Accuracy im-provement by the FLD+pose in di↵erent poses.

weight vectors, which are learned to predict the positions of the mouth’s cornershave high correlation with the weights of ‘smiling’ inference. This demonstratesthat TCDCN implicitly learns relationship between tasks.

FLD vs. FLD+pose: As observed in Figure 6(a), detection errors of FLDincrease along with the degree of head pose deviation from the frontal view toprofiles, while these errors can be partially recovered by FLD+pose as depictedin Figure 6(b).

4.2 The Benefits of Task-wise Early Stopping

To verify the e↵ectiveness of the task-wise early stopping, we train the proposedTCDCN with and without this technique and compare the landmark detectionrates in Figure 7(a), which shows that without task-wise early stopping, the ac-curacy is much lower. Figure 7(b) plots the main task’s loss errors of the trainingset and the validation set within 2,600 iterations. Without early stopping, thetraining error converges slowly and exhibits substantial oscillations. However,convergence rates of both the training and validation sets are fast and stablewhen using the proposed early stopping scheme. In Figure 7(b), we also pointout when and which task has been halted during the training procedure. Forexample, ‘wearing glasses’ and ‘gender’ are stopped at the 250th and 350th it-erations, and ‘pose’ lasts to the 750th iteration, which matches our expectationthat ‘pose’ has the largest beneficit to landmark detection, compared to theother related tasks.

The Benefits of Task-wise Early Stopping

(a)：task-wise early stoppingでかなりエラーが落ちている

(b)：訓練誤差・汎化誤差がearly stoppingで⼩さくなっている


stop ‘glasses’

stop ‘gender’

stop ‘smile’

stop ‘pose’6

8

10

12

14

16


right mouthcorner

mea

n er

ror (

%)

FLD+allFLD+all with task-wise early-stopping

Fig. 7. (a) Task-wise early stopping leads to substantially lower validation errors overdi↵erent landmarks. (b) Its benefit is also reflected on the training and validation errorconvergence rate. The error is measured in L2-norm with respect to the ground truthof the 10 coordinates values (normalized to [0,1]) for the 5 landmarks.

4.3 Comparison with the Cascaded CNN [21]

Although both the TCDCN and the cascaded CNN [21] are built upon CNN,we show that the proposed model can achieve better detection accuracy with asignificantly lower computational cost. We use the full model “FLD+all”, andthe publicly available binary code of the cascaded CNN in this experiment.

Landmark localization accuracy: Similar to Section 4.1, we employ AFLWimages for evaluation due to its challenging pose variations and occlusion. Notethat we use the same 10,000 training faces as in the cascaded CNN method.Thus the only di↵erence is that we exploit a multi-task learning approach. Itis observed from Figure 8 that our method performs better in four out of fivelandmarks, and the overall accuracy is superior to that of cascaded CNN.

Fig. 8. The proposed TCDCN vs. cascaded CNN [21]: (a) mean error over di↵erentlandmarks and (b) the overall failure rate.

Computational e�ciency: Suppose the computation time of a 2D-convolutionoperation is ⌧ , the total time cost for a CNN with L layers can be approximatedby

PLl=1

s2l qlql�1

⌧ , where s2 is the 2D size of the input feature map for l-thlayer, and q is the number of filters. The algorithm complexity of a CNN is thusO(s2q2), directly related to the input image size and number of filters. Note thatthe input face size and network structure for TCDCN is similar to cascaded CNN.The proposed method only has one CNN, whereas the cascaded CNN [21] deploys

Comparison with the Cascaded CNN

¤ 訓練データを同じにしてAFLWでテスト¤ 異なる点は、マルチタスク学習を利⽤しているかどうかという点

¤ 4つのlandmarkでcascaded CNNを上回る

¤ 全体的にはcascaded CNNに勝っている


Fig. 7. (a) Task-wise early stopping leads to substantially lower validation errors overdi↵erent landmarks. (b) Its benefit is also reflected on the training and validation errorconvergence rate. The error is measured in L2-norm with respect to the ground truthof the 10 coordinates values (normalized to [0,1]) for the 5 landmarks.

4.3 Comparison with the Cascaded CNN [21]

Although both the TCDCN and the cascaded CNN [21] are built upon CNN,we show that the proposed model can achieve better detection accuracy with asignificantly lower computational cost. We use the full model “FLD+all”, andthe publicly available binary code of the cascaded CNN in this experiment.

Landmark localization accuracy: Similar to Section 4.1, we employ AFLWimages for evaluation due to its challenging pose variations and occlusion. Notethat we use the same 10,000 training faces as in the cascaded CNN method.Thus the only di↵erence is that we exploit a multi-task learning approach. Itis observed from Figure 8 that our method performs better in four out of fivelandmarks, and the overall accuracy is superior to that of cascaded CNN.

(a) (b)

7

8

9

10

11


right mouthcorner

mea

n er

ror (

%)

cascaded CNN Ours

10

20

30

40

50


right mouthcorner

failu

re ra

te (%

)

Fig. 8. The proposed TCDCN vs. cascaded CNN [21]: (a) mean error over di↵erentlandmarks and (b) the overall failure rate.

Computational e�ciency: Suppose the computation time of a 2D-convolutionoperation is ⌧ , the total time cost for a CNN with L layers can be approximatedby

PLl=1

s2l qlql�1

⌧ , where s2 is the 2D size of the input feature map for l-thlayer, and q is the number of filters. The algorithm complexity of a CNN is thusO(s2q2), directly related to the input image size and number of filters. Note thatthe input face size and network structure for TCDCN is similar to cascaded CNN.The proposed method only has one CNN, whereas the cascaded CNN [21] deploys

Comparison with other State-of-the-art Methods

¤ AFLWでの結果¤ 他の既存研究の結果を全て上回っている

¤ AFWでの結果¤ AFLWと同様


5

10

15

20

25


right mouthcorner

mea

n er

ror (

%)

TSPM ESR CDM Luxand RCPR SDM Ours

15.9

13.0 13.1 12.4 11.6

8.5 8.0

5

10

15

20

mea

n er

ror (

%)

5

10

15

20

25


right mouthcorner

mea

n er

ror (

%)

14.312.2

11.110.4

9.38.8 8.2

5

10

15

20

mea

n er

ror (

%)

AFL

WA

FW

Fig. 9. Comparison with RCPR [3], TSPM [32], CDM [27], Luxand [18], and SDM [25]on AFLW [11] (the first row) and AFW [32] (the second row) datasets. The left sub-figures show the mean errors on di↵erent landmarks, while the right subfigures showthe overall errors.

multiple CNNs in di↵erent cascaded layers (23 CNNs in its implementation).Hence, TCDCN has much lower computational cost. The cascaded CNN requires0.12s to process an image on an Intel Core i5 CPU, whilst TCDCN only takes17ms, which is 7 times faster. The TCDCN costs 1.5ms on a NVIDIA GTX760GPU.

4.4 Comparison with other State-of-the-art Methods

We compare against: (1) Robust Cascaded Pose Regression (RCPR) [3] usingthe publicly available implementation and parameter settings; (2) Tree Struc-tured Part Model (TSPM) [32], which jointly estimates the head pose and faciallandmarks; (3) A commercial software, Luxand face SDK [18]; (4) Explicit ShapeRegression (ESR) [4]; (5) A Cascaded Deformable Model (CDM) [27]; (6) Su-pervised Descent Method (SDM) [25]. For the methods which include their ownface detector (TSPM [32] and CDM [27]), we avoid detection errors by croppingthe image around the face.

Evaluation on AFLW [11]: Figure 9 shows that TCDCN outperforms allthe state-of-the-art methods. Figure 10(a) shows several examples of TCDCN’sdetection, with additional tags generated from related tasks. We observe thatthe proposed method is robust to faces with large pose variation, lighting, andsevere occlusion. It is worth pointing out that the input images of our model is40⇥ 40, which means that the model can cope with low-resolution images.

Evaluation on AFW [32]: In addition to AFLW, we also tested on AFW.We observe similar trend as in the AFLW dataset. Figure 9 demonstrates thesuperiority of our method. Figure 10(b) presents some detection examples usingour model.

Comparison with other State-of-the-art Methods

⾊々な画像の結果¤ １⾏⽬：メガネかけてる¤ ２⾏⽬：ポーズのバリエーション¤ ３⾏⽬：

¤ 1,2列⽬：光の当たり⽅が違う¤ 3列⽬：画像の質が悪い¤ 4,5列⽬：異なる表情¤ 6~8列⽬：間違った例（⾚が間違った部分）


0’ NS NG F 30’ NS G F

60’ NS NG F 30’ S NG F -30’ NS NG F

0’ NS G M

60’ NS NG F

-30’NS G M -30’ S G M

-30’ S NG F

0’ NS NG F 0’ S NG F

-30’ NS NG M 60’ S NG M 60’ NS NG M

0’ NS NG M 0’ S NG F

0’ S NG M

0’ NS NG M0’ NS NG M 0’ NS NG M -30’ NS NG F 0’ S NG F 0’ NS NG F

Fig. 10. Example detections by the proposed model on AFLW [11] and AFW [32]images. The labels below each image denote the tagging results for the related tasks:(0�,±30�,±60�) for pose; S/NS = smiling/not-smiling; G/NG = with-glasses/without-glasses; M/F = male/female. Red rectangles indicate wrong tagging.

4.5 TCDCN for Robust Initialization

This section shows that the TCDCN can be used to generate a good initializationto improve the state-of-the-art method, owing to its accuracy and e�ciency. Wetake RCPR [3] as an example. Instead of drawing training samples randomly asinitialization as did in [3], we initialize RCPR by first applying TCDCN on thetest image to estimate the five landmarks. We compare the results of RCPR withand without TCDCN as initialization on the COFW dataset [3], which includes507 test faces that are annotated with 29 landmarks. Figure 11(a) shows the rel-ative improvement for each landmark on the COFW dataset and Figure 11(b)visualizes several examples. It is demonstrated that with our robust initializa-tion, the algorithm can obtain improvement in di�cult cases with rotation andocclusion.

5 Conclusions

Instead of learning facial landmark detection in isolation, we have shown thatmore robust landmark detection can be achieved through joint learning with

TCDCN for Robust Initialization

¤ TCDCNはよい初期化を得る⼿法としても利⽤できる

¤ 既存研究であるRCPRについて、TCDCNでの初期化をしたものとしなかったもので⽐較(a)：相対的な改善（改善後のエラー／元のエラー）(b)：改善の可視化（上が普通のRCPR、下がTCDCNで初期化したRCPR）


123 4

5 6 78

9 1011 12

13

14

15

1617 18

19 2021

22

23 2425262728

29 0

5

10

15

20

1 6 11 16 21 26

rela

tive

impr

ovm

ent (

%)

landmarks(a) (b)

Fig. 11. Initialization with our five-landmark estimation for RCPR [3] onCOFW dataset [3]. (a) shows the relative improvement on each landmark(relative improvement = reduced error

original error

). (b) visualizes the improvement. The upper rowdepicts the results of RCPR [3], while the lower row shows the improved results by ourinitialization.

heterogeneous but subtly correlated tasks, such as appearance attribute, expres-sion, demographic, and head pose. The proposed Tasks-Constrained DCN allowserrors of related tasks to be back-propagated in deep hidden layers for construct-ing a shared representation to be relevant to the main task. We have shown thattask-wise early stopping scheme is critical to ensure convergence of the model.Thanks to multi-task learning, the proposed model is more robust to faces withsevere occlusions and large pose variations compared to existing methods. Wehave observed that a deep model needs not be cascaded [21] to achieve the betterperformance. The lighter-weight CNN allows real-time performance without theusage of GPU or parallel computing techniques. Future work will explore deepmulti-task learning for dense landmark detection and other vision domains.

References

1. Asthana, A., Zafeiriou, S., Cheng, S., Pantic, M.: Robust discriminative responsemap fitting with constrained local models. In: CVPR. pp. 3444–3451 (2013)

2. Belhumeur, P.N., Jacobs, D.W., Kriegman, D.J., Kumar, N.: Localizing parts offaces using a consensus of exemplars. In: CVPR. pp. 545–552 (2011)

3. Burgos-Artizzu, X.P., Perona, P., Dollar, P.: Robust face landmark estimationunder occlusion. In: ICCV. pp. 1513–1520

4. Cao, X., Wei, Y., Wen, F., Sun, J.: Face alignment by explicit shape regression.In: CVPR. pp. 2887–2894 (2012)

5. Caruana, R.: Multitask learning. Machine learning 28(1), 41–75 (1997)6. Chen, K., Gong, S., Xiang, T., Loy, C.C.: Cumulative attribute space for age and

crowd density estimation. In: CVPR. pp. 2467–2474 (2013)7. Collobert, R., Weston, J.: A unified architecture for natural language processing:

Deep neural networks with multitask learning. In: ICML. pp. 160–167 (2008)8. Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active appearance models. PAMI 23(6),

681–685 (2001)9. Cootes, T.F., Ionita, M.C., Lindner, C., Sauer, P.: Robust and accurate shape

model fitting using random forest regression voting. In: ECCV, pp. 278–291 (2012)10. Dantone, M., Gall, J., Fanelli, G., Van Gool, L.: Real-time facial feature detection

using conditional regression forests. In: CVPR. pp. 2578–2585 (2012)

Conclusion¤ ヘテロだが相互に関連のあるタスクを同時に学習することで、よりロバストなlandmark detectionができることを⽰した

¤ TCDCNは関連したタスクのエラーをバックプロパゲーションによって、共通した表現を学習できる

¤ task-wise early stoppingがモデルの収束を確実にするために重要

¤ マルチタスク学習によって顔の状態にかなりロバストなモデルを作成できた

¤ Future work：より密なlandmark detectionへの適⽤、他の画像認識問題へのディープマルチタスク学習の適⽤

感想¤ CNNのマルチタスク学習の⽅法がわかってよかった

¤ この⽅法は知らなかった（新しい？）¤ CNN+線形識別器だったが、CNN+CNNでも良さそう

¤ 個⼈的には、論⽂の書き⽅が参考になりそうだった¤ 新しい⼿法が多くて⾊々実験している点が今書いている論⽂とにている

(研究会輪読) facial landmark detection by deep multi-task learning

Technology