part-aware cnn for pedestrian detectionjien/publications/paper_pdf...pedestrian detection is a...

社団法人電子情報通信学会THE INSTITUTE OF ELECTRONICS,INFORMATION AND COMMUNICATION ENGINEERS

信学技報TECHNICAL REPORT OF IEICE.

Part-aware CNN for Pedestrian Detection

Cong CAO:, Yu WANG:, Jien KATO:, and Kenji MASE:

: Graduate School of Information Science, Nagoya University Furo-cho, Chikusa-ku, Nagoya, Aichi, 464–8601 JapanE-mail: :[email protected]

Abstract Pedestrian detection is a significant task in computer vision. In recent years, it is widely used in the applications

such as monitoring system and automatic drive. Although it has been exhaustively studied over the past decade, the occlusion

situation remains a very challenging problem. In order to deal with this problem, one convincing method is to utilize the parts

based methods for the visible parts information, and furthermore to estimate the pedestrian position. Many part-based pedes-

trian detection methods have been proposed in recent years. According to our analyses, clumsy part combining process have

always been the problems to limit pedestrian detection performance. In this paper, we propose Part-aware CNN to solve this

problem. In this study, we focus on the part detector combination phase, which including a brand new method to reform the

part detectors to the convolutional layer of the CNN and optimize the whole pipeline by fine-tuning the CNN. In experiments,

it shows the astonishing effectiveness of optimization and robustness of occlusion handling.

Key words Pedestrian detection, Body parts detection, CNN

1. Introduction

Pedestrian detection is a significant task in computer vision. In

recent years, it is widely used in the applications such as monitor-

ing system and automatic drive. Although it has been exhaustively

studied over the past decade [5], [7], [8], [19], [22], the occlusion sit-

uation remains a very challenging problem.

Occlusion situation means that part or the whole object is blocked

by other objects. Because it is hard to detect fully invisible pedestri-

ans, the precise detection at the time when the pedestrian appears

is very important, especially when part of the pedestrian is visi-

ble(occlusion situation). Therefore, part-based pedestrian detection

is considered as a reasonable solution for occlusion handling.

Many part-based pedestrian detection methods have been pro-

posed in recent years, most of those works focused on the mining

method of body parts. In Bourdev et al.’s work [2], a Poselet method

is proposed to implement the human detection based on the poses of

human body parts under different viewpoints. Tian et al. proposed

Deepparts [18] which construct the part pool simply on the relative

position of the pedestrian bounding box and train strong detectors

for each part.

However, we notice that most of the methods do not pay much at-

tention to the combining of part detectors results, including the ap-

proaches mentioned above. Many approaches use a linear SVM bru-

tally that may not exert the potential of the part detectors. Therefore,

we propose to merge part detectors within a simple CNN model,

which makes full use of the detectors’ potential and achieves aston-

ishing optimization effect.

Figure. 1 The integral pipeline of our approach: Part-aware CNN.

2. Related Work

Recent researches have showed that using the body-part detectors

[2]～[4], [18] helps improve the pedestrian detection performance.

The definition of body-part, which related with its configuration

method, is the key to combining a more powerful pedestrian de-

tector. Bourdev et al. proposed Poselets [2], body parts in different

viewpoint and poses, which depended on the 3D joint keypoint an-

notation in the dataset. Recently, Tian et al. propose Deepparts [18],

which construct part pool based on relative locations and train CNN

models for each part pool. This method achieved state-of-the-art

pedestrian detection performance as well as the occlusion handling.

We found that most of the part-based methods [2]～[4], [18] just

focus on the construction of part detectors but hardly pay enough

attention to the combination method of part detection scores. For

example, [18] and [3] both use a linear SVM to combine part de-

— 1 —

tection scores to pedestrian score. For those methods, pedestrian

detection performance may be improved with a better combination

of part detectors.

Compare with the methods above, firstly, Part-aware CNN can

mine high-quality body parts without any additional annotations,

which makes the process of training robust part detectors easier and

faster. Secondly, we propose a fusion method to transform learned

part detectors to CNN middle layers, which proved to be an aston-

ishing optimization method on combining part detection to pedes-

trian detection in our experiments.

3. Approach

In this study, we mainly focus on the part detector combination

phase of the part-based pedestrian detection approach.

Distinguished from previous works, we pay more attention to the

combination phase of body parts detection results. We transform the

learned part detectors to the CNN middle layer and train the CNN

model entirely. It shows an astonishing optimization to improve the

pedestrian detection performance.

The integral pipeline of our research is shown in figure.1. In the

first stage, we use rule mining to gather body-part cluster based on

CNN convolutional layer feature. Then train LDA(Linear Discrim-

inant Analysis) detectors based on selected parts. In the second

stage, we transform LDA detectors to the mid-layer of CNN model

and add it to the model that used to extract features in the first stage.

Finally, train the renewed model to get the best optimization.

3. 1 Part detectors constructionWe follow the implement of MDPM [11], which use CNN fea-

ture + Association rule mining + LDA to construct part detectors,

to construct our part detectors. Different from the scene recog-

nition task targeted in [11], the object in our pedestrian detection

task have fixed shape and flexible scales. Therefore, we proposed

a more pedestrian adaptable part detector construction method, in-

cluding CNN convolutional layer feature extracting and attribute-

aware body part mining. The responding visual elements of the

mined part detectors are showed in Fig.3. 1.

3. 2 Part Detector CombinationIn this section, we give details of 2 methods we used to combine

the part detectors to pedestrian scores. The shallow method(the tra-

ditional method), which encode images with part detectors and train

a linear SVM classifier to output the pedestrian’s score. The deep

method(our proposal), which transform the detectors to a convolu-

tional layer of the CNN, and train the model to get the pedestrian

score. Like many other part-based pedestrian detection methods,

two methods both essentially manipulate body part detection and

combining part detection activations to pedestrian detection activa-

tions. In this section, we use VGG19 [16] as the sample to manipu-

late CNN model.

3. 2. 1 The shallow methodSimilar to the method used in [11]. We encode the image with

Figure. 2 Examples of image patches clusters that correspond with the

mined rules. Up, Mid, and Down represent the relative po-

sition of the source patches. Small and Large represent scales

of the source pathces

trained part detectors in Sec 3. 1, which correspond to the pattern of

pedestrian or pedestrian subclasses.

Assume N part detectors are earned in part detector construction

phase. We run our N detectors at each W ˆ H ˆ 512 Conv layer

feature map locations to get a W ˆ H ˆ N new feature map. Then

apply max pooling 5 times(1 covers the whole image and 4 cover

every 1/4H) to get a 1 ˆ 1 ˆ 5N feature vector. Finally, we train a

linear SVM to combine a pedestrian score.The pipeline of the shal-

low method is showed in Fig.3(a).

3. 2. 2 The deep methodBy observing the pipeline of the shallow method, we found that

the process of encoding and classification is surprisingly similar to

the CNN forward processing. Owing to the fact that we use the

Conv layer feature to construct detector, seamless connection of

feature extraction, part detection, encoding and classification can

be merged into one simple CNN model, as shown in Fig.3(b), 3(c).

Then, we train the entire CNN model on the Caltech-usa dataset to

get the pedestrian score. The details of model construction are given

as follows:• Turn part detectors to Conv layer（ 1） Normalization size: we fix the input size to 256 ˆ 128

which both adapt the average ratio of the Caltech-usa dataset and

the input size of our base model VGG19 [16].

（ 2） Conv layer: we keep all of the conv layers of VGG19

and put another conv layer, named DP layer(Discriminative Part

layer). DP layer transformed from part detectors with the weight

of 1 ˆ 1 ˆ 512 ˆ N comes from N part detectors’ LDA classifier

weights. The transform formula is showed as follows:

convW p1, 1, iq “ Li, (1)

convW p1, 1, iq means ith channel of the conv weight and Li means

the LDA’s 512-dimension weight of ith part detector.

（ 3） fully-connected layers: add two 4096-dimensions fc, one

2-dimensions fc, and one softmax.• Turn part detectors to Fc layer

— 2 —

(a)

(b) (c)

Figure. 3 Two methods to combine part detectors for getting the pedestrian

score. (a) represent the shallow method. (b) and (c) represent the

Deep Conv method and the Deep Fc method.

（ 1） Normalization size: we fix the input size to 256 ˆ 128 as

same as the above conv layer method.

（ 2） Conv layers: we keep all of the conv layers and add DP fc

layer and normal fc layers behind.

（ 3） Fully-connected layers: we transform part detectors as the

first fc layer. Cause detectors mined from the subclasses have per-

given location information(Up, Mid or Down), the part detectors are

only use to substitute the fc weights from the corresponding loca-

tion. Final we get a N-dimensions fc layer. The formula is showed

as follows:

fcW phi, w, iq “ Li, (2)

where fcW ph, wq means the weight of the fc layer which connects

to ph, wq of the front conv layer. hi “ thup, hmid, hdownuwhich

represents the detectors original location of the height. Finally, add

one N-dimensions fc, one 2-dimensions fc, and one softmax.

4. Experiments

This experiment is implemented in Caltech Pedestrian Dataset

[6], [9], and use log-average miss rate(MR) [6], [9] as our evalua-

tion criteria.

The experiment are implemented on shallow method and deep

methods. We evaluate the pedestrian detection performance of

those methods on Caltech-usa Reasonable, P artial occlusion

and Heavy occlusion dataset. The detection results of our model

are shown in Table.1. We also add the experiment which use the

Caltech-fine-tuned VGG19 as mining and optimization basic model.

The results of the experiments are all shown in Table 1.

Compare the shallow method with all the deep methods. We can

obviously see the optimization effect. The best deep method FcDP

outperforms shallow method 22.2% MR in Reasonable subset and

improves MR on P artial occlusion and Heavy occlusion by

Table. 1 The pedestrian detection performance on Caltech-usa dataset(MR,

%). <+ft> means using the Caltech-fine-tuned VGG19 as basic

model.

Model reasonable partial occl heavy occl

Shallow 39.34 48.21 82.33

Shallow+ft 29.74 39.10 68.71

VGG19 18.10 28.48 65.46

ConvDP 17.47 30.73 66.58

FcDP 17.14 28.12 62.33

ConvDP+ft 16.63 28.48 61.38

FcDP+ft 16.65 27.59 64.92

20.09% and 20%. The result of this experiment not only shows the

effectiveness of turning the part detector to CNN middle layer in the

specific pedestrian detection task but also gives out an inspiration of

how the transform from traditional method to the deep method can

improve final performance. The knowledge may also work on other

object recognition tasks.

Compare the fine-tuning VGG19 with our best deep method

FcDP. Our method slight outperforms 0.96% MR in Reasonable

subset,　 and improves MR on P artial and Heavy by 0.36% and

3.13%. It means that the parts information we learned from mined

rules indeed have a positive effect.

Compare the basic CNN model selection between ImageNet pre-

trained VGG19 with Caltech10x fine-tuned VGG19. In both shal-

low and deep method, the fine-tuned model show the positive ef-

fects. We think it is the part information learned in regular CNN

fine-tuning that makes positive effect on body part mining phase.

To summarize, compare with 2 baselines(Shallow method and

VGG19), the deep methods effectiveness of optimization and oc-

clusion handling abilities. The best performance of Reasonable

subset is MR 16.63% achieved by ConvDP ` ft method.

4. 1 Overall EvaluationWe compare our approach with the existing best-performing

methods, inculding VJ [20], HOG [5], ACF+SDT [15], Jointdeep

[13], SDN [12], LDCF [21], SCF+AlexNet [10], Katamari [1], Spa-

tialPooling+ [14], TA-CNN [17] and Deepparts [18]. The eval-

uation results on the Reasonable, P artial Occlusion and

Heavy Occlusion are shown in Fig.4, Fig.5 and Fig.6. Our ap-

proach achieves MR 16.63% in Reasonable, 28.48% in P artial

and 61.38% in Heavy, which outperforms most of the existing

methods. Although our approach does not outperform the state-

of-the-art approaches DeepParts, our final CNN model has a very

simple architecture. It is a standard VGG19 model with one more

convolutional layer, which is much slim comparing to DeepParts

which uses 45 GoogleNets.

5. Conclusion

In this study, we proposed a part-aware deep learning approach

— 3 —

10 -3 10 -2 10 -1 10 0 10 1

false positives per image

.05

.10

.20

.30

.40

.50

.64

.80

1

mis

s r

ate

94.73% VJ

68.46% HOG

39.32% JointDeep

37.87% SDN

37.34% ACF+SDt

24.8% LDCF

23.32% SCF+AlexNet

22.49% Katamari

21.89% SpatialPooling+

20.86% TA-CNN

16.63% DP-CNN Ours

11.89% DeepParts

Figure. 4 Log-average miss rate on Reasonable subset.

10 -3 10 -2 10 -1 10 0 10 1


.05

.10

.20

.30

.40

.50

.64

.80

1

mis

s r

ate

98.67% VJ

84.47% HOG

56.82% JointDeep

54.99% ACF+SDt

49.4% SDN

48.47% SCF+AlexNet

43.19% LDCF

41.74% Katamari


32.8% TA-CNN

28.48% DP-CNN Ours

19.93% DeepParts

Figure. 5 Log-average miss rate on Partial Occlusion subset.

10 -3 10 -2 10 -1 10 0 10 1


.05

.10

.20

.30

.40

.50

.64

.80

1

mis

s r

ate

98.78% VJ

95.97% HOG

87.4% ACF+SDt

84.38% Katamari

81.88% JointDeep

81.34% LDCF

78.77% SDN


74.65% SCF+AlexNet

70.35% TA-CNN

61.38% DP-CNN Ours

60.42% DeepParts

Figure. 6 Log-average miss rate on Heavy Occlusion subset.

which has excellent occlusion handling ability in pedestrian detec-

tion. In this approach, we focus on the part detector combination

phase. Our approach transforms the part detectors to CNN middle

layers and train the renewed CNN model to get an astonishing opti-

mization. Our future work will focus on the upgrade of our approach

and other object recognition applications.

Acknowledgement This research is supported by the JSPS

Grant-in-Aid for Scientific Research B (No.26280057), the JSPS

Grant-in-Aid for Challenging Exploratory Research (No.16K12460),

and the JST Center of Innovation Program.Reference

[1] R. Benenson, M. Omran, J. Hosang, and B. Schiele. Ten years ofpedestrian detection, what have we learned? In ECCV, CVRSUADworkshop, 2014.

[2] Lubomir Bourdev and Jitendra Malik. Poselets: Body part detectorstrained using 3d human pose annotations. In International Confer-ence on Computer Vision (ICCV), 2009.

[3] Lubomir D. Bourdev, Fei Yang, and Rob Fergus. Deep poselets forhuman detection. CoRR, abs/1407.0717, 2014.

[4] Hyunggi Cho, Paul Rybski, Aharon Bar-Hillel, and Wende Zhang.Real-time pedestrian detection with deformable part models. In IEEEIntelligent Vehicles Symposium, August 2012.

[5] Navneet Dalal and Bill Triggs. Histograms of oriented gradients forhuman detection. In Proceedings of the 2005 IEEE Computer SocietyConference on Computer Vision and Pattern Recognition (CVPR’05)- Volume 1 - Volume 01, CVPR ’05, pages 886–893, Washington, DC,USA, 2005. IEEE Computer Society.

[6] P. Dollár, C. Wojek, B. Schiele, and P. Perona. Pedestrian detection:A benchmark. In CVPR, June 2009.

[7] Piotr Dollar, Ron Appel, Serge Belongie, and Pietro Perona. Fast fea-ture pyramids for object detection. IEEE Trans. Pattern Anal. Mach.Intell., 36(8):1532–1545, August 2014.

[8] Piotr Dollar, Zhuowen Tu, Pietro Perona, and Serge Belongie.Integral channel features. In Proceedings of the British Ma-chine Vision Conference, pages 91.1–91.11. BMVA Press, 2009.doi:10.5244/C.23.91.

[9] Piotr Dollár, Christian Wojek, Bernt Schiele, and Pietro Perona.Pedestrian detection: An evaluation of the state of the art. PAMI,34, 2012.

[10] Jan Hendrik Hosang, Mohamed Omran, Rodrigo Benenson, andBernt Schiele. Taking a deeper look at pedestrians. CoRR,abs/1501.05790, 2015.

[11] Yao Li, Lingqiao Liu, Chunhua Shen, and Anton van den Hengel.Mid-level deep pattern mining. In CVPR, 2015.

[12] Ping Luo, Yonglong Tian, Xiaogang Wang, and Xiaoou Tang.Switchable deep network for pedestrian detection. In The IEEE Con-ference on Computer Vision and Pattern Recognition (CVPR), June2014.

[13] Wanli Ouyang and Xiaogang Wang. Joint deep learning for pedes-trian detection. In The IEEE International Conference on ComputerVision (ICCV), December 2013.

[14] Sakrapee Paisitkriangkrai, Chunhua Shen, and Anton van den Hen-gel. Pedestrian detection with spatially pooled features and structuredensemble learning. CoRR, abs/1409.5209, 2014.

[15] Dennis Park, C. Lawrence Zitnick, Deva Ramanan, and Piotr Dol-lar. Exploring weak stabilization for motion feature extraction.2013 IEEE Conference on Computer Vision and Pattern Recognition,00(undefined):2882–2889, 2013.

[16] Karen Simonyan and Andrew Zisserman. Very deep convolutionalnetworks for large-scale image recognition. CoRR, abs/1409.1556,2014.

[17] Yonglong Tian, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Pedes-trian detection aided by deep learning semantic tasks. CoRR,abs/1412.0069, 2014.

[18] Yonglong Tian, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deeplearning strong parts for pedestrian detection. In ICCV, 2015.

[19] Luc Van Gool. Pedestrian detection at 100 frames per second. InProceedings of the 2012 IEEE Conference on Computer Vision andPattern Recognition (CVPR), CVPR ’12, pages 2903–2910, Wash-ington, DC, USA, 2012. IEEE Computer Society.

[20] Paul Viola and Michael J. Jones. Robust real-time face detection. Int.J. Comput. Vision, 57(2):137–154, May 2004.

[21] Joon Hee Han Woonhyun Nam, Piotr Dollár. Local decorrelation forimproved pedestrian detection. In NIPS, 2014.

[22] Shanshan Zhang, Christian Bauckhage, and Armin B. Cremers. In-formed haar-like features improve pedestrian detection. In Proceed-ings of the 2014 IEEE Conference on Computer Vision and PatternRecognition, CVPR ’14, pages 947–954, Washington, DC, USA,2014. IEEE Computer Society.

— 4 —

part-aware cnn for pedestrian detectionjien/publications/paper_pdf...pedestrian detection is a...

Documents