soanets: encoder-decoder based skeleton orientation alignment network …ide/res/paper/e... ·...

SOANets: Encoder-decoder based Skeleton Orientation AlignmentNetwork for White Cane User Recognition from 2D Human Skeleton

Sequence

Naoki Nishida1, Yasutomo Kawanishi1, Daisuke Deguchi2, Ichiro Ide1, Hiroshi Murase1 and Jun Piao31Graduate School of Informatics, Nagoya University, Aichi, Japan

2Information Strategy Office, Nagoya University, Aichi, Japan3Data Science Research Laboratories, NEC Corporation, Kanagawa, Japan

[email protected], [email protected], [email protected],{ide, murase}@i.nagoya-u.ac.jp, [email protected]

Keywords: Skeleton Orientation Alignment, Skeleton Representation Sequence, White Cane User Recognition.

Abstract: Recently, various facilities have been deployed to support visually impaired people. However, accidents causedby visual disabilities still occur. In this paper, to support the visually impaired people in public areas, we aimto identify the presence of a white cane user from a surveillance camera by analyzing the temporal transitionof a human skeleton in a pedestrian image sequence represented as 2D coordinates. Our previously proposedmethod aligns the orientation of the skeletons to various orientations and identifies a white cane user fromthe corresponding sequences, relying on multiple classifiers related to each orientation. The method employsan exemplar-based approach to perform the alignment, and heavily depends on the number of exemplarsand consumes excessive memory. In this paper, we propose a method to align 2D skeleton representationsequences to various orientations using the proposed Skeleton Orientation Alignment Networks (SOANets)based on an encoder-decoder model. Using SOANets, we can obtain 2D skeleton representation sequencesaligned to various orientations, extract richer skeleton features, and recognize white cane users accurately.Results of an evaluation experiment shows that the proposed method improves the recognition rate by 16%,compared to the previous exemplar-based method.

1 INTRODUCTION

In recent years, various kinds of facilities have beendeployed to support visually impaired people. There-fore, they have become able to go out on theirown actively. For example, braille blocks can befound around cities and public facilities to guide vi-sually impaired people (e.g. low-vision and partiallysighted). However, accidents involving them still oc-cur, such as, falling on a railway track from a stationplatform. To prevent such accidents, platform screendoors are installed at stations. However, as their in-stallation is limited to major stations, social supportis still necessary to prevent accidents.

Therefore, necessity to identify visually impairedpeople in the public space is increasing. Informationfrom surveillance cameras installed in public placescan be used for this purpose. For example, Tanikawaet al. proposed a method to automatically recog-nize and track wheelchair users in security camera

footages (Tanikawa et al., 2017), and then to notifypersonnel to support them promptly.

Usually, it is not necessary to provide sighted peo-ple with notifications intended for visually impairedpeople. Therefore, it is essential to distinguish sightedand visually impaired people to provide support onlyfor the latter.

Visually impaired people usually employ a whitecane to search for obstacles. It also serves as amedium that helps other people recognize their im-pairment. Therefore, a white cane detector can beused to identify visually impaired people in imagesfrom surveillance cameras. However, even state-of-the-art object detectors (He et al., 2017; Redmonand Farhadi, 2018; Cai and Vasconcelos, 2018) maymis-detect objects with appearances similar to a whitecane, such as a white umbrella.

To overcome this issue, it is preferable to recog-nize them not only by the presence of a white cane,but also by unique actions peculiar to persons walk-

Nishida, N., Kawanishi, Y., Deguchi, D., Ide, I., Murase, H. and Piao, J.SOANets: Encoder-decoder based Skeleton Orientation Alignment Network for White Cane User Recognition from 2D Human Skeleton Sequence.In Proceedings of the 15th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2020) - Volume 5: VISAPP, pages435-443ISBN: 978-989-758-402-2Copyright © 2020 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved

435

Figure 1: Walking actions of a white cane user and a sightedperson. The difference is mainly seen in the movement oftheir arms. The sighted person swings his arms back andforth. In contrast, the arm of the white cane user holdinghis cane is fixed forward.

ing with a white cane. It is evident that there are dif-ferences between the actions of a white cane user anda sighted person as shown in Figure 1. Therefore, inthis paper, we focus on recognizing a white cane useranalyzing only the motions typical to pedestrians.

Various studies have been performed to proposean efficient way to recognize actions of human skele-tons (Le et al., 2018; Baptista et al., 2019; Wang et al.,2016). As it is difficult to create a 3D representationof a skeleton using the data from a monocular camera,a sequence of 2D skeleton representations is usuallyconsidered for estimating human actions. However,as the appearance of a 2D skeleton varies widely ac-cording to the skeleton’s orientation as shown in Fig-ure 2, the performance of action recognition may bedeteriorated depending on the orientation. Therefore,it is desirable to align a skeleton orientation in the in-put for a white cane user recognition to a specific ori-entation. However, since there may be several orien-tations that work effectively for user recognition, notonly a single orientation but also several orientationscombined together can be effective. Hence, to achievemore accurate recognition, it is required to obtain sev-eral human skeleton representation sequences alignedto different orientations.

We have previously proposed a skeleton orienta-tion alignment method based on exemplar-based 2Dskeletons for white cane user recognition (Nishidaet al., 2019). We prepared a database with 2D skele-tons in various orientations and aligned the input2D skeletons to various orientations by finding suit-able skeletons in this database. The performanceof the method depended on the number of skele-ton exemplars contained in the database. Using thismethod, slightly different skeletons could be con-verted to the same skeleton, as the dictionary is dis-crete. Therefore, unnatural skeleton representationsequences could be generated.

To tackle this problem, we propose an encoder-decoder model named Skeleton Orientation Align-ment Networks (SOANets) for skeleton orientationalignment, as an alternative to the above exemplar-based approach. Each SOANet corresponding to atarget orientation aligns an arbitrary orientation of an

Front Right Left

Human skeletons depending on skeleton’s orientations

Figure 2: Difference in skeletons depending on their orien-tations.

input 2D skeleton representation to the target orien-tation. As a result, we can obtain skeletons in mul-tiple orientations corresponding to a single input 2Dskeleton. Unlike in the exemplar-based approach, wecan obtain a continuous skeleton representation andachieve natural skeleton representation sequences ow-ing to the fact that a skeleton representation is re-gressed directly by SOANets.

The process of recognizing a white cane user issimilar to that of the exemplar-based approach. Byusing aligned skeleton representation sequences ob-tained through SOANets, identification of whethereach skeleton representation sequence corresponds toa white cane user or not is performed for all orien-tations independently. Finally, a classification resultis obtained by applying weight to the classificationresults corresponding to each orientation of the con-sidered input skeleton representation sequences andaggregating the classification results.

Contributions of the present paper are summarizedas follows:

• We introduce SOANets based on an encoder-decoder model to align an orientation of a 2Dskeleton representation to multiple orientations.As the proposed method can output a skeleton rep-resentation in a form of continuous sequence, theoutput will be much natural than that of our previ-ously proposed exemplar-based approach.

• We achieve more accurate classification of whitecane users by replacing the skeleton orientationalignment of the exemplar-based approach withthe proposed SOANets.

• Through performance evaluation using the imagescollected in several real environments, we demon-strate that the proposed method achieves the high-est accuracy for white cane user recognition.

The rest of this paper is organized as follows. In sec-tion 2, we describe related research works. In section

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

436

3, we present the proposed method to classify pedes-trian’s 2D skeleton representation sequences by theorientation alignment of a 2D skeleton representationsequence. In section 4, we describe the conducted ex-periments and discuss their results. In section 5, weconclude this paper and discuss the future directions.

2 RELATED WORK

Considering the task of human action recognition, itis necessary to extract the temporal transition of hu-man features. To address this, Recurrent Neural Net-work (RNN) is often used to recognize continuous se-quences such as sentences and videos. In particular,Long Short-Term Memory (LSTM) (Hochreiter andSchmidhuber, 1997), a type of RNN, is employed tohandle long-term sequences, and methods based onthis approach have achieved high accuracy for actionrecognition (Sun et al., 2017; Si et al., 2019).

Recently, representation of human skeletons isoften used as a feature to recognize human ac-tions. Convolutional pose machine (Wei et al., 2016)and OpenPose (Cao et al., 2017) are mentioned aswell-known methods for human skeleton estimation.These methods are used to estimate 2D coordinates ofeach body joint that composes a human body by theConvolutional Neural Network (CNN).

It is desirable that a skeleton is represented in 3Drather than in 2D, as a 2D skeleton greatly differsdepending on the skeleton’s orientation. Therefore,some researchers have used a 3D skeleton for actionrecognition (Le et al., 2018; Baptista et al., 2019;Wang et al., 2016). However, in these research works,3D skeletons are prepared in advance or are estimatedbased on images captured from multiple cameras. Ina real scene, as it is difficult to install multiple cam-eras for capturing people simultaneously everywhere,it is more efficient if a skeleton is estimated from asingle image. Recently, a method to estimate a 3Dskeleton from a 2D skeleton was proposed (Martinezet al., 2016), but it is still inaccurate.

To realize recognition of human actions, we haveproposed a 2D skeleton orientation alignment methodby an exemplar-based method (Nishida et al., 2019).The alignment is performed by obtaining skeletonrepresentations from the skeleton database preparedin advance. Using the skeleton orientation alignment,richer features of a 2D skeleton with various orienta-tions can be obtained compared to methods based ona single skeleton orientation.

3 WHITE CANE USERRECOGNITION FROMVARIOUS ORIENTATIONS BYSOANets

When recognizing a white cane user based on a skele-ton representation sequence, there is a problem thatthe appearance of a skeleton greatly varies depend-ing on the skeleton’s orientation. To address thisproblem, we basically follow the framework pro-posed in our previously proposed exemplar-basedmethod (Nishida et al., 2019). However, this methodhas a deficiency in the skeleton alignment process, asdescribed in the previous section. To overcome thisdeficiency, we propose the SOANets method regress-ing the orientation-aligned 2D skeleton instead of per-forming exemplar-based skeleton orientation align-ment.

The procedure of the white cane user recogni-tion framework is shown in Figure 3 (The part pro-posed in this paper is indicated by a double-linedbox). First, a pedestrian image sequence is used asan input, and 1) the skeleton of the pedestrian in eachframe is estimated. Then, 2) for each frame, 2Dskeleton representation sequences with various orien-tations are obtained from the 2D skeletons by the pro-posed SOANets. Finally, 3) the aligned 2D skeletonrepresentation sequences are classified by a classifiercorresponding to their orientation, and the results areintegrated to output the final decision for the input se-quence.

Here, we propose the SOANets procedure dis-cussed in point 2). In this method, we align the orien-tation of a 2D skeleton to various orientations. Weuse an encoder-decoder model for a target orienta-tion to align an input 2D skeleton. Each SOANetregresses the 2D skeleton aligned to a certain ori-entation. We obtain 2D skeletons of various orien-tations by encoder-decoder models corresponding toeach orientation. The collection of these encoder-decoder models are named as SOANets.

Details of each procedure is presented in the restof this section.

3.1 Skeleton Estimation of a Pedestrian

We define a human skeleton using a set of 2D coor-dinates of body joints such as wrists, elbows, knees,etc. Assuming that the number of body joint pointsis J, a 2D skeleton with a certain orientation can berepresented as p ∈ R2J . Here, a 2D skeleton of the n-th frame in a pedestrian image sequence is estimatedas pn = (x1n,y1n, ...,x

jn,y

jn, ...,xJn,y

Jn)

T , x jn,yjn ∈ R. The

SOANets: Encoder-decoder based Skeleton Orientation Alignment Network for White Cane User Recognition from 2D Human SkeletonSequence

437

LSTM2

Pedestrian image sequence

White-cane user or not

Skeleton estimation of a pedestrian

Classification of the pedestrian sequence by selected classifiers

Skeleton orientation alignment by SOANets

2.

1.

Skeleton sequence

Aligned skeleton sequences

3.

Front Right Left

LSTM1

Aggregation of the outputs

LSTM3

LSTM4

LSTM5

LSTM6

LSTM7

LSTM8

Figure 3: Procedure of the white cane user recognition fol-lowing the exemplar-based approach. The double-lined boxindicates the part proposed in the present paper.

sequence I = {I1, ...,In, ...,IN} consists of N colorimages obtained by tracking pedestrians, whose sizeis w×h [pixels].

We use OpenPose (Cao et al., 2017) for 2D skele-ton estimation. For each frame In, the method esti-mates heat maps indicating probabilities of all bodyjoints, and part affinity fields indicating the connec-tion between each body joint pair. These maps andimage features are used as input, and a 2D skeletonrepresentation pn and its probability on are the output.

From the estimated 2D skeleton representation se-quence P = {pn}Nn=1, we construct a 2D skeleton rep-resentation sequence S as the input for the next pro-cess (Skeleton orientation alignment). Here, we elim-inate the frames of low-confident estimations, andconstruct S from P as follows:

S = {pn|∀n,on ≤ τ}, (1)where on is the probability of pn. For the estimatedcoordinates (x jq,y

jq) of each body joint, their value

range is normalized as follows:

x′ jq =

x jq−mini(xiq)

maxi(xiq)−mini (x

iq)

θx, (2)

y′ jq =

y jq−mini(yiq)

maxi(yiq)−mini (y

iq)

θy, (3)

where θx and θy are the constants to adjust the widthand height of a skeleton. Examples of the estimated2D skeletons are shown in Figure 4.

Figure 4: Examples of estimated 2D skeletons.

In addition, we expand body joint coordinatespn and skeleton representation sequence S to clarifywhether the coordinates of body joints can be esti-mated:

p′n = (x

′1n ,y

′1n , f

1n , ...,x

′ jn ,y

′ jn , f

jn , ...,x

′Jn ,y

′Jn , f

Jn ), (4)

S′= {p′n|∀n,on ≤ τ}, (5)

f jn ={

1 (Success f ully detected)0 (Misdetected) (6)

3.2 Skeleton Orientation Alignmentusing SOANets

Using SOANets, a 2D skeleton representation p′n istransformed to a set of 2D skeleton representations{p̃nd}Dd=1 viewed from D different orientations. Here,the function T (p′n) performing this transformation isdefined as follows:

T (p′n) = {p̃n1, ..., p̃nd , ..., p̃nD}, (7)

We perform the skeleton orientation alignmentby encoder-decoder models. Each encoder-decodermodel transforms an input 2D skeleton into an ar-bitrary orientation to the aligned 2D skeleton repre-sentation in a specific orientation as shown in Fig-ure 5. This set of encoder-decoder networks is usedto regress the 2D skeleton that is aligned to the in-put skeleton orientation. We can obtain skeletons invarious orientations from a single input skeleton asshown in Figure 6. All encoder-decoder models havethe same architecture as presented in Figure 7. Thecollection of these the encoder-decoder models arenamed as SOANets. The function Td(p

′n) consists of

several SOANets, and each SOANet is defined as fol-lows:

Td(p′n) = Decoderd(Encoderd(p

′n)) = p̃nd . (8)

Therefore, T (p′n) also can be defined as follows:

T (p′n) = {T1(p

′n), ...,Td(p

′n), ...,TD(p

′n)}. (9)

Next, we describe the details on inputs and outputsof SOANets. To realize the skeleton orientation align-ment from input skeletons in various orientations, weprepare 2D skeletons in D orientations as the traininginput.


438

Inputs Aligned skeletons

Encoder Decoder

SOA Net

Figure 5: Skeletons in any orientation is aligned to a singleorientation using SOANet.

Encoder1

Decoder1

Input

Aligned skeletonsSOANets

Encoder2

EncoderD

Decoder2

DecoderD

Figure 6: Skeleton orientation alignment using SOANets.

It is difficult to obtain a representation of the bodyjoints of a skeleton due to occlusion, distance fromcameras, etc. Therefore, misdetection of some bodyjoints affects not only the skeleton orientation align-ment, but also classification using aligned skeletonrepresentation sequences. To solve this issue, we useskeleton representations with missing values of bodyjoint coordinates as the input, and the complete skele-ton representations as the output.

Using the training data, SOANets are trainedto regress the complete 2D skeleton representationp̃nd = (x̃1nd , ỹ

1nd , ..., x̃

Jnd , ỹ

Jnd) in the aligned orientation

d from an input skeleton representation sequence p′n ∈R3J . As all missing values corresponding to bodyjoints are supposed to be restored, f jn indicating theexistence of the missing value of body joints is ex-cluded from p̃n ∈ R2J .

Finally, orientations of all 2D skeletons in the in-put skeleton representation sequence S ′ ∈ R3JN arealigned by SOANets. The aligned skeleton represen-tation sequences {S̃d}Dd=1 to D orientations are de-noted as follows:

{S̃d}Dd=1 = {(p̃1d , ..., p̃nd , ..., p̃Nd)}Dd=1. (10)Each S̃d is used as the input for classification to

identify, whether a pedestrian is a white cane user ornot. An example of aligned 2D skeletons obtained bythe proposed SOANets from an input 2D skeleton isshown in Figure 8.

Fully connected layer

75

32

168

32

16

50

��

Input

�

Output

Encoder Decoder

Figure 7: Encoder-decoder architecture of each SOANet.

Frames

OrientationsSkeletonsequence

�

Figure 8: Examples of aligned 2D pseudo-skeletons.

3.3 Classification of a PedestrianSequence

Each 2D skeleton representation sequence S̃d ob-tained by SOANets is classified, whether it corre-sponds to a white cane user or not according tothe classifier corresponding to each skeleton orien-tation. We prepare D individual classifiers C ={C1, ...,Cd , ...,CD} for D different skeleton orienta-tions obtained using the methods based on LSTM net-work and the exemplar-based approach.

Following the exemplar-based approach, all clas-sification scores of each class (a white cane user ora sighted person) are weighted and integrated as fol-lows:

g`(S̃) =1D

D

∑d=1

wdg`d(S̃), (11)

wd ={

1 a(Cd)≥ δ,0 a(Cd)< δ,

, (12)

where g`(S̃) is the integrated classification score ofclass `; g`d(S̃) is the classification score of Cd ; wd is


439

the weight for g`d(S̃) of all classes; a(Cd) is the accu-racy of the classifier Cd for the training dataset; andδ is a threshold. Finally, the class ˜̀= argmax` g`(S̃)with the highest score is output as the classificationresult.

4 EVALUATION

In this section, we introduce two experiments: 1)evaluation of the skeleton orientation alignment per-formed by SOANets, and 2) confirmation of the ef-fectiveness of the proposed method in terms of pedes-trian classification.

4.1 Dataset for Training SOANets

For training SOANets, we captured images of pedes-trians using three calibrated cameras. We capturedthe data at one specific location, and the same partici-pant performed both the roles of a sighted and a visu-ally impaired person. We estimated their 3D posesby OpenPose and obtained M (= 4,616) complete3D skeleton representations. The skeleton’s orienta-tion that faces the front is labeled as 0◦, and the otherorientations are set by rotating the skeleton counter-clockwise with the step of 10◦ around the verticalaxis. As a result, using a 3D skeleton representation,D (= 36) sets of 2D complete skeletons in D orienta-tions are obtained. Finally, training data for SOANetsis composed of MD (= 166,176) 2D complete skele-ton representations.

4.2 Dataset for Classification

For training and testing data, we prepared a datasetby capturing sequences of several walking white caneusers and sighted pedestrians in the same manner asin our previous study (Nishida et al., 2019). In the ex-periment, seventeen sighted participants played bothroles, and five visually impaired people also partici-pated as white cane users. The sequences were cap-tured at five different locations including both indoorsand outdoors. We composed pedestrian sequences byselecting frames where pedestrians existed, resultingin 266 pedestrian sequences. The details of the pre-pared dataset are summarized in Table 1, and exam-ples of the images at each location are presented inFigure 9.

Table 1: Number of pedestrian image sequences in thedataset.

Location 1 2 3 4 5 All#White-cane user 23 12 12 10 76 133#Non user 26 6 25 0 76 133#All sequences 49 18 37 10 152 266

1

543

2

Figure 9: Examples of images at each location.

4.3 Experiment 1: Evaluation of theSkeleton Orientation Alignment

4.3.1 Settings

In this experiment, we evaluate the performanceof the skeleton orientation alignment by SOANets.As a metric for evaluation we use Root MeanSquared Error (RMSE) computed from sets of Ealigned skeletons {paè = (xaè1,yaè1, ...,xaèJ ,yaèJ)} andE corresponding ground-truth skeletons {pGTe =(xGTe1 ,y

GTe1 , ...,x

GTeJ ,y

GTeJ )}. Here, RMSE is defined as

follows:

RMSE =1E

E

∑e=1

√√√√1J

J

∑j=1{(xGTe j − xaè j)2 +(yGTe j − yaè j)2}

(13)

We considered two to five random body jointsfrom the input skeleton as the misdetected bodyjoints. This pseudo-misdetection is applied iterativelyten times for each input to augment the data. There-fore, the number of skeleton data for SOANets is10MD (= 1,661,760), where 1,600,000 skeletons areused for training, and 61,760 skeletons are used fortesting.

4.3.2 Result

The results of applying the proposed method basedon SOANets for all considered aligned skeleton ori-entations are shown in Figure 10. Examples of theskeleton orientation alignment using SOANets are


440

�

��

��

��

��

��

��

� ��

RM

SE

Skeleton orientations after alignment

ave�

Figure 10: Results of the skeleton orientation alignmentevaluation.

Figure 11: Examples of the skeleton orientation alignment.

InputSequence

Exemplarbased

SOANets

1 2 6543

Unnatural transition

Smooth transition

Aligning to 90° (right orientation)

Figure 12: Examples of the skeleton orientation alignmentby the exemplar-based approach and SOANets (proposed).

presented in Figure 11, and comparison of the skele-ton orientation alignment results by SOANets and theexemplar-based approach is shown in Figure 12.

4.3.3 Discussion

In this section, we discuss the results of the skeletonorientation alignment. From Figure 10, we can seethat RMSE is the lowest when the aligned orienta-tions are 0◦ and 180◦ corresponding to the front andback orientations, respectively. In contrast, when thealigned orientations are 90◦ and 270◦ correspondingto right and left orientations, respectively, RMSE isthe highest. The depth of arm and leg body joints isconsidered to be the cause of this difference of RMSE.

For a 2D skeleton in the front or back orientations,it is difficult to estimate the depth of limbs. However,for a 2D skeleton in the side orientation, the limbsappear clearly. Therefore, we consider that the align-ment error becomes larger due to the lack of informa-tion on the limb depth, when a front or back skeletonrepresentation sequence is aligned to the side orien-tation skeleton. As we can see in the bottom figureof Figure 11, the error related to the limb joints islarger than that of other joints. To address this prob-lem, it is necessary to obtain information on the limbdepth from a front or back skeleton representation se-quence. For example, we can consider the use of mul-tiple frames for input of SOANets and to obtain morefeatures than a single frame.

From Figure 12, we can see that the aligned skele-ton representations obtained using the previously pro-posed exemplar-based method change greatly fromthe third to the fourth frame. As it aligns the skele-ton orientation by obtaining skeleton representationsfrom the database, this alignment is limited to thenumber of skeleton patterns available in the database.Therefore, in Figure 12, the exemplar-based approachaligns the skeleton representation sequence with un-natural transition. On the other hand, since the pro-posed SOANets regress the aligned skeleton directlyfrom an input skeleton, it has no such limitation andwas able to align the skeleton orientation more natu-rally.

4.4 Experiment 2: Evaluation ofPedestrian Classification

4.4.1 Settings

In this experiment, we evaluate the accuracy of clas-sifying pedestrian sequences.

For evaluation, the length of each 2D skeletonrepresentation sequence is set to 64 frames, andeach sequence is divided into overlapping five se-quences, which is re-composed of 32 frames of theinput skeleton representation sequence with the stepof eight. The total number of skeleton representationsequences (before applying the skeleton orientationalignment) is 266× 5 = 1,330. The number of de-tected body joints for each 2D skeleton is J = 25, andthe value range of coordinate values of body joints is[−1.0, 1.0] in the horizontal and vertical directionswith θx = 1.0 and θy = 1.0. For evaluation, five-foldcross-validation is performed on the dataset obtainedat each of five locations considered for evaluation, andthe dataset of other four locations is used for trainingthe SOANets.

To estimate the performance of the proposed


441

Table 2: Classification results.

Location 1 2 3 4 5 All

No alignment 0.82 0.70 0.55 0.53 0.64 0.66Exemplar-basedNo weighting 0.77 0.70 0.76 0.52 0.80 0.77Weighting 0.80 0.69 0.75 0.50 0.81 0.78SOANetsNo weighting 0.85 0.84 0.68 0.64 0.82 0.80Weighting(Ours) 0.87 0.90 0.71 0.70 0.82 0.82

method, we compare the accuracy of the followingfive methods:

• No alignment of the skeleton’s orientation for aninput skeleton representation and using a singleclassifier (No alignment).

• Aligning the skeleton’s orientation by theexemplar-based approach and integratingall results of classifiers without weighting(Exemplar-based, No weighting).

• Aligning the skeleton’s orientation by theexemplar-based approach and integrating allresults with weighting (Exemplar-based, Weight-ing).

• Aligning the skeleton’s orientation by theSOANets method and integrating all resultswithout weighting (SOANets, No weighting).

• Aligning the skeleton’s orientation by theSOANets method and integrating all results withweighting (Proposed: SOANets, Weighting).

4.4.2 Results

The results are summarized in Table 2. In this ta-ble, “All” indicates the average of the results in allconsidered locations weighted by the number of sam-ples shown in Table 1. The accuracy of the pro-posed method improved by 16% compared with themethod without the skeleton orientation alignment.Moreover, the accuracy of the proposed method wasimproved compared with the previously proposedexemplar-based approach with and without weightingconsidering each classifier. As a result, the effective-ness of the proposed method was confirmed.

4.4.3 Discussion

Here, we discuss the experimental results. We fo-cused on following two points : 1) locations, wherethe data were captured, and 2) weighting of classi-fiers.

First, we discuss the difference in the results interms of the locations, where the data were captured.As shown in Table 2, for all the considered methods,the accuracy at location 4 is relatively low. There

�

�

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

Num

ber

of c

lass

ifie

rs

Acc

urac

y

Thresholds for weighting��

��

Accuracy

# Classifiers

��

Figure 13: Relation of the number of classifiers and the ac-curacy.

are two possible reasons for this. One is that un-like other locations, it contains only white cane usersand no pedestrians without white canes. The other isthat while capturing the data at location 4, the cameraposition was relatively higher than that at other loca-tions, and thereby, the tilt angle was different from theothers. Therefore, we consider that the classificationaccuracy deteriorated due to the fact that skeleton pat-terns were different from those at other locations. Tomitigate this problem, it is necessary to capture databy changing the camera position, target location, andsubjects multiple times.

Second, we discuss applying different weights toclassifiers. Let us investigate changes in the num-ber of classifiers used for evaluation (wd = 1) and thecorresponding accuracy based on the results providedin Figure 13. The graph is drawn by changing thethreshold parameter δ, which controls the number ofclassifiers. The accuracy improved when the numberof classifiers used for evaluation decreased observingthat the highest accuracy was obtained with the appli-cation of twenty classifiers, and δ = 0.972. However,the accuracy rapidly decreased when the number ofclassifiers were less than twenty. Based on this ob-servation, we can conclude that it is necessary to de-fine the required number of classifiers correspondingto each skeleton’s orientation to maintain high accu-racy. Classifiers that were not used in the evaluationmainly correspond to the orientations of the front andthe back. The reason for this is that a 2D skeleton rep-resentation in the front or back orientations providesless information on the depth of limbs, as describedin 4.3.3. However, if a white cane user employs awhite cane by swinging his/her arm left and right, theskeleton representations in the front or back orienta-tions is important to recognize such action. Therefore,we plan to introduce a mechanism to select importantclassifiers according to the action presented in the in-put skeleton representation sequence.


442

5 CONCLUSION

In this paper, we aimed to solve the problem of rec-ognizing white cane users by classifying pedestriansfrom the temporal transition of their skeletons. Weproposed a 2D skeleton orientation alignment methodnamed SOANets. Through experiments, we demon-strated that the accuracy of analyzing pedestrian im-age sequences improves by incorporating the skeletonorientation alignment of an input 2D skeleton, and theeffectiveness of the proposed method was confirmed.

In the future, we plan to train SOANets with moreaction patterns. We will also integrate object recogni-tion methods that can directly detect a white cane.

ACKNOWLEDGMENTS

Parts of this research were supported by MEXT,Grant-in-Aid for Scientific Research

REFERENCES

Baptista, R., Ghorbel, E., Papadopoulos, K., Demisse,G. G., Aouada, D., and Ottersten, B. (2019). View-invariant action recognition from RGB data via 3Dpose estimation. In Proceeding of the 2019 IEEE In-ternational Conference on Acoustics, Speech and Sig-nal Processing, pages 2542–2546.

Cai, Z. and Vasconcelos, N. (2018). Cascade R-CNN: Delv-ing into High Quality Object Detection. In Proceedingof the 2018 IEEE Conference on Computer Vision andPattern Recognition, pages 6154–6162.

Cao, Z., Simon, T., Wei, S. E., and Sheikh, Y. (2017). Real-time multi-person 2D pose estimation using part affin-ity field. In Proceeding of the 2017 IEEE Conferenceon Computer Vision and Pattern Recognition, pages7291–7299.

He, K., Gkioxari, G., Dollar, P., and Girshick, R. (2017).Mask R-CNN. In Proceeding of the 2017 IEEE In-ternational Conference on Computer Vision, pages2961–2969.

Hochreiter, S. and Schmidhuber, J. (1997). Long short-termmemory. Neural Computation, 9(8):1735–1780.

Le, T. M., Inoue, N., and Shinoda, K. (2018). A fine-to-coarse convolutional neural network for 3D human ac-tion recognition. In Proceeding of the 29th BritishMachine Vision Conf, pages 184–1–184–13.

Martinez, J., Hossain, R., Romero, J., and Little, J. J.(2016). A simple yet effective baseline for 3D hu-man pose estimation. In Proceeding of the 2017 IEEEInternational Conference on Computer Vision, pages2640–2649.

Nishida, N., Kawanishi, Y., Deguchi, D., Ide, I., Murase,H., and Piao, J. (2019). Exemplar-based Pseudo-Viewpoint Rotation for White-Cane User Recognition

from a 2D Human Pose Sequence. In Proceeding ofthe 16th IEEE International Conference on AdvancedVideo and Signal-based Surveillance, number PaperID 29.

Redmon, J. and Farhadi, A. (2018). YOLOv3: An Incre-mental Improvement. Computing Research Reposi-tory, (arXiv:1804.02767).

Si, C., Chen, W., Wang, W., Wang, L., and Tan, T. (2019).An attention enhanced graph convolutional LSTMnetwork for skeleton-based action recognition. In Pro-ceeding of the 2019 IEEE Conference on ComputerVision and Pattern Recognition, pages 1227–1236.

Sun, L., Jia, K., Chen, K., Yeung, D. Y., Shi, B. E., andSavarese, S. (2017). Lattice long short-term memoryfor human action recognition. In Proceeding of the2017 IEEE International Conference on Computer Vi-sion, pages 2147–2156.

Tanikawa, U., Kawanishi, Y., Deguchi, D., Ide, I., Murase,H., and Kawai, R. (2017). Wheelchair-user DetectionCombined with Parts-based Tracking. In Proceed-ing of the 12th Joint Conference on Computer Vision,Imaging and Computer Graphics Theory and Appli-cations, volume 5, pages 165–172.

Wang, C., Wang, Y., and Yuille, A. L. (2016). Mining 3Dkey-pose-motifs for action recognition. In Proceedingof the 2016 IEEE Conference on Computer Vision andPattern Recognition, pages 2639–2647.

Wei, S. E., Ramakrishna, V., Kanede, T., and Sheikh, Y.(2016). Convolutional pose machines. In Proceedingof the 2016 IEEE Conference on Computer Vision andPattern Recognition, pages 4724–4732.


443

soanets: encoder-decoder based skeleton orientation alignment network …ide/res/paper/e... ·...

Documents