digital human symposium 2009 march 4th, 2009 a · pdf file · 2013-09-24gmm-based...

9
Digital Human Symposium 2009 March 4th, 2009 A Predefined Command Recognition System Using a Ceiling Microphone Array in Noisy Housing Environments Yoko Sasaki a b Satoshi Kagami b c a Hiroshi Mizoguchi a b Tadashi Enomoto d [email protected] [email protected] [email protected] [email protected] Abstract This paper describes a speech recognition system that detects basic voice commands for a mobile robot op- erating in a home space. The system recognizes arbi- trary timed speech with position information in a noisy housing environment. The microphone array is attached to the ceiling, and localizes sound source direction in azimuth and elevation, then separates multiple sound sources using Delay and Sum Beam Forming(DSBF) and Frequency Band Separation(FBS) algorithm. We imple- ment the sound localization and separation method on our 32 channel microphone array. The separated sound source is recognized using an open source speech recog- nizer. These sound localization, separation and recogni- tion functions are implemented as online processing in real world. We define four indices to evaluate the perfor- mance of the recognition system, and the efficiency in a noisy environment or with distant sound sources is con- firmed from experiments in varied conditions. Finally, an application for a mobile robot interface is reported. 1. INTRODUCTION Auditory system is useful for a robot system to com- municate with human, or for the initial recognition of en- vironmental change. To achieve efficient auditory func- tionality in the real world, combination of sound local- ization, separation and recognition in a total system is needed. There is an increasing amount of research develop- ing robotic audition systems for human-robot interac- tion, and varied methods are proposed for each func- tion of auditory system. Study on Blind Source Sep- aration(BSS) based on Independent Component Analy- sis(ICA) is well performed. A real-time sound separation system[1] which has two microphones was proposed and showed efficient performance on hands-free sound recog- nition. It is applied to robot as human-robot verbal con- versation interface. The method is known for high perfor- mance sound separation with small system. On the other hand, assumption for frequency independence between sound sources causes limitation for reverberant field or handling multiple sound sources. *a : School of Science and Technology, Tokyo University of Science *b : Digital Human Research Center, National Institute of Advanced Industrial Science and Technology (AIST) *c : CREST, Japan Science and Technology Agency (JST) *d : Kansai Electric Power Company, Inc. As for room integrated sensors, work using a 1020ch microphone array [2] shows good performance for beam forming. It can localize a human voice in a room. Recent work [3] reported tracking human voice using 64 channel microphone array distributed in a room and particle fil- tering techniques which improved tracking performance. Several methods are proposed for implementation in real world. GMM-based speech end-point detection method and outlier-robust generalized sidelobe canceler was implemented using a 12ch microphone array[4]. Missing feature algorithm is proposed in order to achieve robust localization and segmentation method[5]. As for total auditory system, real-time auditory system [6] is reported with simultaneous speech recognition sys- tem implemented on a robot embedded microphone ar- ray. It works for multiple sound sources in near range and shows efficient recognition rate for known speech source coming from front of the robot head using a preliminary measured impulse response and optimized parameters for recognition. Nakadai et. al reported an application of simultaneous multiple sounds recognition system using this system[7]. In this paper, we present for a newly developed on- line voice command recognition system in noisy environ- ments using a microphone array. In order to achieve the detection of verbal command from arbitrary time and po- sition in a home space, following functions are needed: 1) works as online system, 2) without using environmen- tal information, 3) robust recognition to various sound sources including non-speech sound sources. We propose a command recognition system using beamforming based sound localization and separation method implemented on a microphone array in each room of a home space. The extracted sound sources are recognized using a open source recognizer for close-talking microphone. The pro- posed system doesn’t need preliminary measured envi- ronmental information, and can adapt to various environ- mental conditions such as multiple target sound sources witch require simultaneous recognition, or distant target sound sources together with noisy sound in near range of microphone array. We define four indexes to evaluate the recognition system and experiments to measure per- formance in varied conditions are performed. Finally, an application of the voice command recognition system for controlling mobile robot in a home environment is shown.

Upload: phamhanh

Post on 11-Mar-2018

218 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Digital Human Symposium 2009 March 4th, 2009 A · PDF file · 2013-09-24GMM-based speech end-point detection method and outlier-robust generalized sidelobe canceler ... FBS Based

Digital Human Symposium 2009March 4th, 2009

A Predefined Command Recognition System Using a CeilingMicrophone Array in Noisy Housing Environments

Yoko Sasaki ∗a∗b Satoshi Kagami ∗b∗c∗a Hiroshi Mizoguchi ∗a∗b Tadashi Enomoto ∗d

[email protected] [email protected] [email protected] [email protected]

Abstract

This paper describes a speech recognition system thatdetects basic voice commands for a mobile robot op-erating in a home space. The system recognizes arbi-trary timed speech with position information in a noisyhousing environment. The microphone array is attachedto the ceiling, and localizes sound source direction inazimuth and elevation, then separates multiple soundsources using Delay and Sum Beam Forming(DSBF) andFrequency Band Separation(FBS) algorithm. We imple-ment the sound localization and separation method onour 32 channel microphone array. The separated soundsource is recognized using an open source speech recog-nizer. These sound localization, separation and recogni-tion functions are implemented as online processing inreal world. We define four indices to evaluate the perfor-mance of the recognition system, and the efficiency in anoisy environment or with distant sound sources is con-firmed from experiments in varied conditions. Finally, anapplication for a mobile robot interface is reported.

1. INTRODUCTION

Auditory system is useful for a robot system to com-municate with human, or for the initial recognition of en-vironmental change. To achieve efficient auditory func-tionality in the real world, combination of sound local-ization, separation and recognition in a total system isneeded.

There is an increasing amount of research develop-ing robotic audition systems for human-robot interac-tion, and varied methods are proposed for each func-tion of auditory system. Study on Blind Source Sep-aration(BSS) based on Independent Component Analy-sis(ICA) is well performed. A real-time sound separationsystem[1] which has two microphones was proposed andshowed efficient performance on hands-free sound recog-nition. It is applied to robot as human-robot verbal con-versation interface. The method is known for high perfor-mance sound separation with small system. On the otherhand, assumption for frequency independence betweensound sources causes limitation for reverberant field orhandling multiple sound sources.

*a : School of Science and Technology, Tokyo University of Science*b : Digital Human Research Center, National Institute of Advanced

Industrial Science and Technology (AIST)*c : CREST, Japan Science and Technology Agency (JST)*d : Kansai Electric Power Company, Inc.

As for room integrated sensors, work using a 1020chmicrophone array [2] shows good performance for beamforming. It can localize a human voice in a room. Recentwork [3] reported tracking human voice using 64 channelmicrophone array distributed in a room and particle fil-tering techniques which improved tracking performance.

Several methods are proposed for implementation inreal world. GMM-based speech end-point detectionmethod and outlier-robust generalized sidelobe cancelerwas implemented using a 12ch microphone array[4].Missing feature algorithm is proposed in order to achieverobust localization and segmentation method[5].

As for total auditory system, real-time auditory system[6] is reported with simultaneous speech recognition sys-tem implemented on a robot embedded microphone ar-ray. It works for multiple sound sources in near range andshows efficient recognition rate for known speech sourcecoming from front of the robot head using a preliminarymeasured impulse response and optimized parameters forrecognition. Nakadai et. al reported an application ofsimultaneous multiple sounds recognition system usingthis system[7].

In this paper, we present for a newly developed on-line voice command recognition system in noisy environ-ments using a microphone array. In order to achieve thedetection of verbal command from arbitrary time and po-sition in a home space, following functions are needed:1) works as online system, 2) without using environmen-tal information, 3) robust recognition to various soundsources including non-speech sound sources. We proposea command recognition system using beamforming basedsound localization and separation method implementedon a microphone array in each room of a home space.The extracted sound sources are recognized using a opensource recognizer for close-talking microphone. The pro-posed system doesn’t need preliminary measured envi-ronmental information, and can adapt to various environ-mental conditions such as multiple target sound sourceswitch require simultaneous recognition, or distant targetsound sources together with noisy sound in near rangeof microphone array. We define four indexes to evaluatethe recognition system and experiments to measure per-formance in varied conditions are performed. Finally, anapplication of the voice command recognition system forcontrolling mobile robot in a home environment is shown.

Page 2: Digital Human Symposium 2009 March 4th, 2009 A · PDF file · 2013-09-24GMM-based speech end-point detection method and outlier-robust generalized sidelobe canceler ... FBS Based

2. Sound Localization and Separation

This section describes our approach to localize andseparate multiple sound sources using microphone array.

2.1. FBS Based 2D Sound Localization

Sound localization method has two main parts. First,Delay and Sum Beam Forming(DSBF) to enhance fo-cused direction’s signal to get sound pressure distribu-tion, named spatial spectrum. Second, Frequency BandSelection(FBS)[8] as a kind of binary mask to filter outlouder detected sound’s signal and localize smaller powersound source simultaneously. The localization system de-tects multiple sound sources from the highest power in-tensity to the lowest at each time step.

Aligning the phase of each signal amplifies the desiredsounds and attenuates ambient noise. Let O be the mi-crophone array center, and Cj the focus position on theO centered hemisphere which has a larger radius than ar-ray size. Focus position is set by a linear triangular gridon the surface of sphere[9] to localize sound source po-sition in azimuth and elevation. Using the median pointof each uniform triangle on the sphere’s surface as thefocus point Cj of DSBF, the system can estimate soundpressure distribution in two dimensions.

Let t be time and (θ, φ) be azimuth and elevation ofCj . Delay time τji for i-th microphone (1 ≤ i ≤ N )isdefined by microphone arrangement. Observed wave yj

in all directions is expressed as equation (3):

yj(ω, t) =N∑

i=1

Wi(ω, θ, φ) × x(t + τji) (1)

where N is number of microphones and Wi is correctiveweight of microphone’s directivity.

We apply FBS method after DSBF to detect multiplesound sources. FBS assumes that the frequency com-ponents of each signal are independent and is a kindof binary mask that segregates targeted sound sourcesfrom mixed sound by selecting the frequency componentsjudged to be from the targeted sound source.

The process is as follows. Let Xa(ω) be the frequencycomponents of DSBF-enhanced signals of position a andXb(ωj) be those of position b. Selected frequency com-ponent Xas(ω) for position a is expressed as in Equa-tion(2):

Xas = MaXa Ma(ω) ={

1 if Xa(ω) ≥ Xb(ω)0 otherwise

(2)This process rejects the attenuated noise signal from

the DSBF-enhanced signal. The segregated waveform isobtained by inverse Fourier transform of Xas(ω). Whenfrequency components of each signal are independent,FBS completely separate the desired sound source. Thisassumption is usually effective for human voice or every-day sounds of limited duration.

The spatial spectrum for directional localization,which indicates sound pressure distribution over a framelength N is described as follows:

QK(θ, φ) =∑ω

i=0(∏K−1

k=0 (1 − Mk(i)) ×√| Y (i) |2)∑ω

i=0

∏K−1k=0 (1 − Mk(i))

(3)where Y (ω) is Fast Fourier Transform of DSBF enhancedsignal y, and Mk is separation vector for the k-th loudestsound source generated by FBS.

Fig. 1 shows calculation flow of FBS based multiplesound localization. In DSBF phase, the system scans theformed beam of DSBF on each spherical grid, and obtainsspatial spectrum which indicates sound pressure distribu-tion on hemisphere. In FBS phase, the system first detectsthe loudest sound direction as the maximum peak of spa-tial spectrum, then it filters out the loudest sound signalby FBS and localize the second stronger sound source,and so on.

Scan the focus

Signal input DSBF phase

Spatial spectrum

FBS phase

Band pass

filter

Detect the loudest

source direction

(θ0, Φ0)

Detect the 2nd

strongest direction

(θ1, Φ1)

Same as for

more than

3 sources

Filter out the loudest signal

Q0 (θ, Φ) Q1 (θ, Φ)

Fig. 1 FBS based multiple sound sources localization

2.2. FBS Based Multiple Sound Separation

Sound source separation algorithm is almost the sameas localization. It segregates the sound sources of de-tected directions from localization stage. For robustrecognition, mask M in equation (2) is revised, and maskvector M ′ for separation is expressed as equation (4):

M ′a(ω) =

1 if Xa(ω) ≥ Xb(ω)0.5 else if Xa(ω) ≥ 0.5Xb(ω)0 otherwise

(4)

For recognition stage, sound existence interval is ex-tracted by commonly used power based Voice ActivityDetection (VAD) function from separated sound sources.It assumes there are intervals of silence before and af-ter speech, and detects beginning and end time of speechwith reference to maximum power during the separatedstream. This simple VAD is sufficient for our proposedsystem in that assumption, because VAD for our system

Page 3: Digital Human Symposium 2009 March 4th, 2009 A · PDF file · 2013-09-24GMM-based speech end-point detection method and outlier-robust generalized sidelobe canceler ... FBS Based

is mainly used to reject broken streams during a separa-tion interval. It is applied for separated sound streams,then each stream always contains some sound signals,and noise level is stationary during a separated soundstream.

3. COMMAND RECOGNITION SYSTEM

This section describes overview of our commandrecognition system for online implementation.

3.1. System Overview

For online implementation, determining when the sys-tem should stop separation and start recognition is animportant problem. Our system decides separation andrecognition interval Tmax from the longest sentence incommand dictionary. By separating past Tmax+Tp(Tp ≥Tmax) with a cycle of Tmax, the system can detects arbi-trary timed voice command at more than one of separatedsound stream.

Fig. 2 shows calculation flow of the command recogni-tion system. Localization module outputs azimuth and el-evation pairs at each cycle of FFT data length, and the in-stantaneous sound source localization keeps running con-tinuously. Sound source separation and recognition mod-ules are running with Tmax intervals. From instant local-ization results of past Tmax + Tp, it estimates the num-ber of sound sources and their positions, then segregateseach detected sound source. VAD is applied for each sep-arated source, and extracted sound sources are input torecognizer, if the data is not interrupted by a separationinterval.

t

2D sound directions

Tmax

interval

Detect sound positions

Sound source separation VAD

FBS based sound source localization

Command recognition

Sound position with time (θ, Φ, t)

and Recognized command

resultresult

Fig. 2 Calculation flow of the command recognition sys-tem

For command recognition, we use the Japanese speechrecognition engine Julian [10] with a use defined com-mand dictionary. The command dictionary has 30 wordsand 4 sentence constructs, such as ”go to somewhere”,”come here” or greetings. The size of the command dic-tionary is limited to prevent error recognition from othersound sources (especially non speech sound sources).

3.2. Microphone Array

The proposed command recognition system is testedusing 32 channel microphone array unit attached on theceiling. Fig. 3 shows the microphone array and its micro-phone arrangement. The microphone array has 32 omni-directional electlet condenser microphones and can sam-ple 32 data channels simultaneously. Sampling frequencyis 16(kHz) and resolution is 16(bit). It localizes azimuthomni-directionally (0 to 359(deg)), and elevation from di-rectly below (0 (deg)) to horizontal direction(90 (deg)).

For implementation, our system set 160 grids on sur-face of hemisphere initially, and each grid is divided intofour smaller triangles for fine search at detected sourcepositions. Using 32ch array, the system works well formultiple sound sources and power difference in variedenvironments without using environmental information.

540mm

-300

-200

-100

0

100

200

300

-300 -200 -100 0 100 200 300

Y-a

xis

[mm

]

X-axis [mm]

microphone

a) Array on ceiling b) Arrangement

Fig. 3 32 channel microphone array on ceiling

Six microphone array units are attached to the ceilingof the experimental house ”Holone”. Each unit worksindependently and recognition results with position in-formation are sent to a mobile robot. Fig. 4 shows pic-tures of the microphone array units and the arrangementin each room.

X [m]

Y [m

]

1 3 5 7 9 11 13

1

3

5

7

9

11

1

32

5

4

6

a) Arrangement b) 1:entrance c) 2:study

d) 3:living e) 4,5:kitchen f) 6:bedroom

Fig. 4 Microphone array units in experimental house”Holone”

4. EXPERIMENT

This section shows the experimental result of proposedcommand recognition system installed to 32 channel ceil-

Page 4: Digital Human Symposium 2009 March 4th, 2009 A · PDF file · 2013-09-24GMM-based speech end-point detection method and outlier-robust generalized sidelobe canceler ... FBS Based

ing microphone array. For implementation, calculationinterval Tmax is set to 3.0(sec), and data length parame-ter Tp is set to 3.5(sec).

4.1. Evaluation of Sound Source Localization

Accuracy of two dimensional sound source localiza-tion is evaluated. The experimental room has a rever-beration time (T60) of 450 (msec) and background noiselevel(LA) is 32 (dB). For calculation, the frame length is1024 points (64 msec) for one instance of localization.For evaluation of 2D directional localization, angular er-ror is defined as the intervector angle between directionof the estimated sound source and that of the real soundsource. As shown in Fig. 3, the microphone array hasrotationally-symmetric arrangement on one plane, andthe performance of localization in the azimuth directionis uniform.

Experiments are performed in two different conditions.The first condition (C1) is 1 sound source at different el-evation angles. A loud speaker playing male speech isused as the sound source. Distance (r) between micro-phone array and speaker is 2.0(m), and the sound source’sSNR is about 15(dB) from background noise.

The second condition (C2) is 2 sound sources at differ-ent distances. One loud speaker is set at (r, θ, φ)=(2(m),180(deg), 60(deg)) from the array center, and it plays fe-male speech(not including command sentences) or clas-sical music as noise source. The other loud speaker is setat 90(deg) for azimuth angle and is varied over horizontaldistance 1, 3, 5, 7 and 10 (m). Height from the micro-phone array stays constant at 1.15(m). Volume of the 2sources is the same, and the SNR is about 15(dB) frombackground noise.

Fig. 5 shows the result of condition C1 to evaluate per-formance shift between elevation angle. The result is av-erage angle error of 200(sec) data for each direction. Theangular error is 9 to 18(deg) and is not dependent on el-evation angle. For more than 45(deg) of elevation angle,azimuth error is smaller than elevation and angle error isclose to elevation error. This indicates that localizationperformance of elevation angle is weaker than azimuthdirection and angle error is mainly caused by elevationerror for more than 45(deg) of elevation.

0 3 6 9

12 15 18 21 24 27 30 33

0 15 30 45 60 75 90

erro

r (d

eg)

elevation (deg)

Angle errorAzimuth

Elevation

Fig. 5 Localization angular error for 1 sound source

Here, the performance shift of distance between micro-phone array and sound source is evaluated. Fig. 6 showsthe result of experiment with condition C2. X axis is hor-izontal distance from microphone array center to sound

source. Angular error is 6 to 14 (deg) and its not depen-dent on distance. It is considered that vibration of angu-lar error is caused by resolution of implemented sphericalgrid. As in Fig. 5, the elevation error is close in value tothe angular error.

0 3 6 9

12 15 18 21 24 27 30 33

0 2 4 6 8 10

erro

r (d

eg)

Distance (m)

Angle errorAzimuth

Elevation

Fig. 6 Localization angle error for 2 sound sources

The results of the two conditions indicate that localiza-tion accuracy is not affected by variation of elevation an-gle or sound pressure levels. Excluding locations directlybelow the microphone array and in horizontal directions,the system can localize multiple sound source positionsin azimuth and elevation.

4.2. Evaluation Index of Recognition System

Four indexes are defined to evaluate the recognitionsystem.

• Word Correct Rate: proportion of correctly rec-ognized words to total recognized words

• Task Achievement Rate: proportion of correctrecognized command at target position and timingto total command utterance

• Error Recognition Rate: proportion of erro-neously recognized command to total number ofseparated sound sources

• Target Separation Rate: proportion of separatedsound sources at correct position and time to totalcommand utterances.

Word Correct Rate shows quality of separated sourcefor recognition, and error is mainly caused by phenotypicvariation of spoken Japanese. For example, both ”(place)(ni) (mo q te i q te)” and ”(place) (e) (mo q te i ke)” mean”bring it to (place)”, but second and third words are regis-tered independently in the word dictionary, and the indexis 1/3 in this instance. Task Achievement Rate excludessuch difference. It counts the recognized sentence whenthe meaning is correct, and the index shows the efficiencyof an application. Error Recognition Rate contains twodifferent errors, one is error recognition of known voicecommand, and the other is false positive recognition ofnon-command sound source. Both errors affect appli-cations using the recognition system, and the index alsoshows efficiency of application. Target Separation Ratecounts separated sound sources after VAD, whose posi-tion and timing corresponds to real events. The index

Page 5: Digital Human Symposium 2009 March 4th, 2009 A · PDF file · 2013-09-24GMM-based speech end-point detection method and outlier-robust generalized sidelobe canceler ... FBS Based

shows performance of proposed localization and separa-tion method.

As shown in Fig. 2, separated sound sources can over-lap, and the system sometimes recognizes one voice com-mand in two intervals. Such overlapped recognition is ex-cluded from calculating Task Achievement Rate and Tar-get Separation Rate.

In addition, calculation cost is evaluated by calculat-ing elapsed time for three separate parts. First is the timeelapsing from the start of voice command phonation tothe start of separation module (A). Second part is from thestart of separation module to output of separated soundsources after VAD (B). Third part is elapsed time of com-mand recognition (C). A+B+C gives the total time fromstart of voice command phonation to the output of therecognition result.

4.3. Basic Evaluation of Recognition System

The performance of the recognition system is evalu-ated. Experimental conditions are same as in section IV-A.

The result of condition C1 is shown in Fig. 7. At 0 and90 (deg) in elevation angle, target separation rate (pinkline with square mark) is less than 90 (%), on the otherhand, the performance has no large difference over ele-vation angles. Task recognition rate (green line with Xmark) is similar to target separation rate, and the wordcorrect rate is higher than 87(%). The error recognitionrate is near 0 (%). and the result suggests that the lo-calization performance directly below the microphone ar-ray(0(deg)) and at horizontal direction (90(deg)) is a littleworse than at other areas, on the other hand, the systemshows constant performance over elevation angle.

70

80

90

100

0 15 30 45 60 75 90

Ra

te (

%)

Elevation (deg)

0

10

20

30

0 15 30 45 60 75 90

Rat

e (%

)

Elevation (deg)

WordTaskError

Separated

a) Correct value b) Error value

Fig. 7 Results of elevation changes (1 sound source)

The result of distance evaluation is shown in Fig. 8.The experimental setup is condition C2 explained in sec-tion IV-A. The evaluation result is shown in Fig. 8. Forless than 7 (m) distance, target separation rate is 100(%). At 7 (m) distance, sound pressure level of receivedsignal at microphone array is -5 (dB) compared to thenoise sound source set at 2(m) distant from the array.This result shows the efficiency of the proposed soundlocalization and separation system for multiple soundsources with different levels of sound pressure: On theother hand, the task recognition rate is drops with dis-tance for distances more than 7 (m). Error recognitionrate is less than 10 (%) and doesn’t change with distance,

but is higher than result in condition C1 overall. Erroris mainly caused by noise sound source, which containssounds not registered in the command dictionary. Degra-dation of sound separation for distant sound sources doesnot affect to error recognition rate.

70

80

90

100

0 2 4 6 8 10

Ra

te (

%)

Distance (m)

0

10

20

30

0 2 4 6 8 10

Rat

e (%

)

Distance (m)

WordTaskError

Separated

a) Correct value b) Error value

Fig. 8 Result of distance change (2 sound sources)

4.4. System Evaluation for Three Sound Sources

The performance of 3 sound sources are evaluated.Three experiments are performed as follows:

• EXP-A: command at (2.0(m), 90(deg), 60(deg)),classical music at (2.5(m), 45(deg), 72(deg)) and fe-male speech at (1.4(m), 180(deg), 45(deg))

• EXP-B: command at (2.0(m), 90(deg), 60(deg)),classical music at (2.5(m), 45(deg), 72(deg)) andcommand at (2.0(m), 180(deg), 60(deg))

• EXP-C: command at (2.0(m), 60(deg), 60(deg)),female speech at (1.4(m), 135(deg), 45(deg)) andcommand at (3.2(m), 180(deg), 72(deg))

Positions are described as (distance, azimuth, elevation),and the volume of each sound source is same level.

Table 1 shows the evaluation indexes of 3 soundsources condition. EXP-B and EXP-C have two com-mand utterances and the indexes are calculated as a com-bined value. The indexes of EXP-B are a little smallerthan EXP-A, but there is not so much of a difference.The result of EXP-C is worse than the others. The per-formance degradation is mainly caused by the low recog-nition rate for a distant command utterance at (3.2(m),180(deg), 72(deg)). The system failed to detect this soundsource, affected by the presence of female speech close tomicrophone array. High percentage of error recognitionrate is mainly caused by error recognition for the femalespeech. This result suggests that the recognition systemhas greater performance for detected commands and mul-tiple command input at same time, on the other hand, it issusceptible to unknown signal input.

4.5. Processing Time of the Recognition System

Table 2 shows the average elapsed time of each exper-iment. The system has a Pentium-4 3.0 GHz processoroperated using Debian linux 2.4.27. The result of onesound source is average of all 1400(sec) data of conditionC1, which has arbitrary timed utterances. The results of

Page 6: Digital Human Symposium 2009 March 4th, 2009 A · PDF file · 2013-09-24GMM-based speech end-point detection method and outlier-robust generalized sidelobe canceler ... FBS Based

Table 1 Evaluation result of 3 sound sources

(%) Word Task Error SeparatedEXP-A 91.7 95.8 (23/24) 12.0 95.8 (23/24)EXP-B 89.1 91.7 (44/48) 7.3 93.8 (45/48)EXP-C 67.3 79.2 (38/48) 19.1 85.4 (41/48)

two and three sound sources are the average of 800(sec)data for each condition.

Table 2 Elapsed time from start of voice commandphonetion

number of sound source 1 2 3(A)time before start separation(sec) 2.01 2.05 2.02(B) elapsed time for separation(sec) 0.86 1.24 1.37(C) elapsed time for recognition(sec) 0.61 0.54 0.61(A+B+C) total processing time(sec) 3.48 3.83 4.00

Processing time changes with environmental conditionbecause calculation cost of sound separation and recog-nition is depends on number of sound sources. The timebefore the start of separation is 2.024(sec), averaged overall conditions, and average time of command utterance is2.34(sec) (1.08(sec) at minimum and 2.89(sec) at max-imum). This indicates that calculation interval Tmax isvalid for the tested application.

4.6. Voice Command Recognition in Housing En-vironment

This section describes experimental results usingthe ceiling microphone array at housing environment”Holone”. System performance is evaluated in bed-room(unit 6 in Fig. 4 a)). The room has reverberationtime(T60) of about 600(msec) and background noise levelis about 40 (dB). 11 experiments are performed. The ex-perimental conditions are shown in Table 3. Data lengthof each experiment is 70 (sec). Positions of sound sourcesare described as (azimuth, elevation) angles. The heightof the sound sources from the microphone array is sta-tionary at 1.5 (m), and distance between microphone ar-ray and sound source is different in each condition. Com-mand set A and B is arbitrary timed command utterancewhich randomly selected from the command dictionary,and female speech doesn’t contains command sentencein dictionary. All experimental conditions have 2 soundsources, and SNR between sources is about 0(dB). InEXP-1 to 4, two sound sources have enough distance.The condition in EXP-5 to 8 is 30 (deg) in azimuth in-terval between 2 sources and elevation of command ut-terance is 60 (deg). The condition in EXP-9 to 11 is 30(deg) in azimuth interval between 2 sources and elevationof command utterance is 30 (deg)

The evaluation result is shown in Table 4. Excluding

Table 3 Experimental setup for evaluation in ”Holone”

position source position sourceEXP-1 (315, 30) set A (90, 45) female speech

EXP-2 (315, 30) set A (90, 60) classical music

EXP-3 (315, 60) set B (90, 45) classical music

EXP-4 (315, 60) set B (90, 60) female speech

EXP-5 (315, 60) set A (285, 45) classical music

EXP-6 (315, 60) set A (285, 45) female speech

EXP-7 (315, 60) set B (285, 60) classical music

EXP-8 (315, 60) set B (285, 60) female speech

EXP-9 (315, 30) set A (285, 45) female speech

EXP-10 (315, 30) set B (285, 60) classical music

EXP-11 (315, 30) set B (285, 60) female speech

Table 4 Evaluation result of recognition system in”Holone”

Word (%) Task (%) Error (%) Separated (%)

EXP-1 61.1 100.0 (8/8) 21.9 100.0 (8/8)EXP-2 96.7 87.5 (7/8) 0.0 87.5 (7/8)EXP-3 91.7 75.0 (6/8) 0.0 87.5 (7/8)EXP-4 64.5 75.0 (6/8) 26.1 100.0 (8/8)EXP-5 88.9 87.5 (7/8) 0.0 100.0 (8/8)EXP-6 90.5 100.0 (8/8) 10.0 100.0 (8/8)EXP-7 94.1 87.5 (7/8) 0.0 100.0 (8/8)EXP-8 94.1 87.5 (7/8) 0.0 100.0 (8/8)EXP-9 76.2 62.5 (5/8) 3.4 75.0 (6/8)EXP-10 85.1 87.5 (7/8) 6.1 100.0 (8/8)EXP-11 71.8 100.0 (8/8) 16.0 100.0 (8/8)

Page 7: Digital Human Symposium 2009 March 4th, 2009 A · PDF file · 2013-09-24GMM-based speech end-point detection method and outlier-robust generalized sidelobe canceler ... FBS Based

EXP-9, target separation rate is near 100 (%) for thewhole experiment. In EXP-9, it has minimum distancebetween 2 sound sources, and the system sometimes failto detect two sound sources independently. The worstscore of task achievement rate is 62.9 (%) in EXP-9.Error recognition rate becomes high when noise soundsource contains speech compared to when it is classicalmusic.

Directional localization errors as an intervector anglebetween estimated and real sound positions of EXP-4, 9and 11 are shown in Fig. 9. Angular error of EXP-4 and11 is constant during the experiment, and result of EXP-4 which has larger intervals between two sound sourcesperformed better than result of EXP-11. Angle error ofEXP-9 is varied in time. This is attributed to the othersound source near the target sound source.

0

3

6

9

12

15

0 10 20 30 40 50 60 70

angl

e er

ror

(deg

)

start time (sec)

EXP-4EXP-9

EXP-11

Fig. 9 Variation of directional localization error

4.7. Application for Mobile Robot

The recognition system is applied to a mobile robot toconfirm the ability of the system. The mobile robot hasautonomous mobility in a known environment[11]. Inputof a laser range finder mounted on the robot is used forlocalization, a particle filter based method locating a po-sition and orientation of the robot in a map. OptimizedA* algorithm [12] is used to plan a trajectory of the robotto the goal position. The mobile robot is given the recog-nized verbal command with sound position informationfrom a ceiling microphone array system.

Fig. 10 shows video clips of the experiment at”Holone”. A user watching TV in living room tells therobot(currently in the bed room) to go to the kitchen(a).Living room microphone array detects the user utteranceand sounds from TV, and separates them from each other.When the robot receives the command, it plans the pathfrom the bedroom to the kitchen and starts moving(a,b).In the kitchen, a user places a coffee cup on the robot andorders to the robot to go to the study(c). Then, the robotstarts moving to the study(d). In the study, a different usertakes the coffee cup from the robot, and released the robotby saying ”Thank you.”(e). Then the robot goes back tothe bedroom(f, g). In the experiment, the proposed recog-nition system recognized the command utterances, evenwhen the TV is on as a noise sound source.

“Kan-ichi, please go to the kitchen.”

a) call the robot during watching TV

b) robot moves to the kitchen from bedroom

“Bring it to the study.”

c) Put coffee cup on the robot and say ”go to the study”

d) The robot goes to the study

“Thank you.”

e) Get coffee and say ”thank you.”

f) The robot goes back to the bedroom

g) The robot goes back to the bedroom

Fig. 10 Snapshots on experiment at ”Holone”with GUI

Page 8: Digital Human Symposium 2009 March 4th, 2009 A · PDF file · 2013-09-24GMM-based speech end-point detection method and outlier-robust generalized sidelobe canceler ... FBS Based

5. CONCLUSIONS AND FUTUREWORKS

The paper reported an online voice command recogni-tion system using a microphone array. DSBF and FBSmethod is used for multiple sound localization, and sep-aration. The method is implemented to the 32ch micro-phone array attached to the ceiling. The separated soundsources are recognized using an open source recognizer”Julian” with a word limited command dictionary, thenrecognized command with sound position and time infor-mation are sent to the mobile robot.

The system localizes sound source position in azimuthand elevation, and recognizes separated sound sources si-multaneously. The system can recognize arbitrary timedvoice commands in noisy environments using a 32 chan-nel microphone array unit, and the system outputs therecognized command with sound position information.The system is tested on a compact microphone array unitand can be applied to robot embedded microphone array.By using DSBF and FBS method for multiple sound lo-calization and separation, the microphone array systemcan localize sound source position in azimuth and eleva-tion, with 10(deg) angle error on average. The proposedsystem works without environmental information such asimpulse response, and the evaluation result in two differ-ent reverberation conditions shows similar performance.

We defined four indexes to evaluate the performance ofthe proposed auditory system. The experimental resultsshows the robustness for multiple sound sources or dis-tant sound sources. Target separation rate is more than 95(%) on average, and task achievement rate is more than86 (%) on average for two sound sources in the ”Holone”experiments. And as shown in section IV-C, the taskachievement rate is more than 70 (%) within a 10(m) ra-dius with a noise sound source at near range.

In this paper, word dictionary is limited simple com-mand for mobile robot to prevent error recognition. Rec-ognizer could works with a handled size of word dic-tionary for separated sound sources when input soundsources are only human voice included in dictionary. Onthe other hand, error recognition rate becomes high, whenunknown sound sources exists. Future research is neededto detect human voice and reduce error recognition fornon-modeled sound sources.

References[1] Y. Mori, H. Saruwatari, T. Takatani, S. Ukai, K.

Shikano, T. Hiekata and T. Morita Real-Time Im-plementation of Two-Stage Blind Source SeparationCombining SIMO-ICA and Binary Masking, in Proc.of 2005 International Workshop on Acoustic Echoand Noise Control (IWAENC2005), pp. 229–232,September 2005.

[2] A. A. E. Weinstein, K. Steele and J. Glass, Loud:A 1020-node modular microphone array and beam-former for intelligent computing spaces, MIT/LCS

Technical Memo, Tech. Rep. MIT-LCS-TM-642,April 2004.

[3] K. Nakadai, H. Nakajima, M. Murase, S. Kaijiri, K.Yamada, T. Nakamura, Y. Hasegawa, H. G. Okuno,and H. Tsujino, Robust tracking of multiple soundsources by spatial integration of room and robot mi-crophone arrays, in Proc. of International Conferenceon Acoustics, Speech, and Signal Processing 2006,pp. IV–29–32, Toulouse, France, May 2006.

[4] C. T. Ishi, S. Matsuda, T. Kanda, T. Jitsuhiro,H. Ishiguro, S. Nakamura, and N. Hagita, Ro-bust speech recognition system for communica-tion robots in real environments, in Proc. ofIEEE-RASInternational Conference on HumanoidRobots(HUMANOIDS2006), pp. 340–345, Genova,Italy, December 2006.

[5] S. Yamamoto, K. Nakadai, H. Tsujino, T. Yokoyama,and H. G. Okuno, Improvement of robot audi-tion by interfacing sound source separation and au-tomatic speech recognition with missing feature the-ory, in Proc. of IEEE-RAS International Conferenceon Robots and Automation (ICRA2004), pp. 1517—523, New Orleans, May 2004.

[6] S. Yamamoto, K. Nakadai, M. Nakano, H. Tsujino, J-M. Valin, K. Komatani, T. Ogata, and H. G. Okuno,Real-Time Robot Audition System That RecognizesSimultaneous Speech in The Real World in Proc.of IEEE/RSJ Intrnational Conference on IntelligentRobots and Systems (IROS 2006), pp. 5333–5338,Beijing, China, October, 2006.

[7] K. Nakadai, S. Yamamoto, H. G. Okuno, H. Naka-jima, Y. Hasegawa and H. Tsujino, Developmentof A Robot Referee for Rock-Paper-Scissors SoundGames, in Proc. of JSAI Technical Report SIG-Challenge-A702-10, pp. 59–64, 2007 (In Japanese).

[8] M. Aoki, M. Okamoto, S. Aoki, H. Matsui, T. Saku-rai, Y. Kaneda. Sound source segregation based onestimating incident angle of each frequency compo-nent of input signals acquired by multiple micro-phones. Acoustical Science and Technology, pp.149–157, 2001.

[9] F. X. Giraldo. Lagrange-galerkin methods on spheri-cal geodesic grids. Journal of Computational Physic-spp. pp. 197–213, 1997.

[10] A.Lee, T.Kawahara and K.Shikano, Julius –an opensource realtime large vocabulary recognition engine.in Proc. of European Conference on Speech Commu-nication and Technology, pp. 1691–694, 2001.

[11] S. Thompson and S. Kagami, Continuous curva-ture trajectory generation with obstacle avoidance forcar-like robots, in Proc. of International Conferenceon Computational Intelligence for Modeling Controland Automation(CIMCA2005), Vienna, 2005.

Page 9: Digital Human Symposium 2009 March 4th, 2009 A · PDF file · 2013-09-24GMM-based speech end-point detection method and outlier-robust generalized sidelobe canceler ... FBS Based

[12] James J. Kuffner, Efficient optimal search ofEuclidean-cost grids and lattices, in Proc. of theIEEE/RSJ International Conference on IntelligentRobots and Systems, 2004.