human-voice enhancement based on online rpca for a hose...

6
Human-Voice Enhancement based on Online RPCA for a Hose-shaped Rescue Robot with a Microphone Array Yoshiaki Bando 1 , Katsutoshi Itoyama 1 , Masashi Konyo 2 , Satoshi Tadokoro 2 , Kazuhiro Nakadai 3 , Kazuyoshi Yoshii 1 , and Hiroshi G. Okuno 4 Abstract— This paper presents an online real-time method that enhances human voices included in severely noisy audio signals captured by microphones of a hose-shaped rescue robot. To help a remote operator of such a robot pick up a weak voice of a human buried under rubble, it is crucial to suppress the loud ego-noise caused by the movements of the robot in real time. We tackle this task by using online robust principal component analysis (ORPCA) for decomposing the spectrogram of an observed noisy signal into the sum of low-rank and sparse spectrograms that are expected to correspond to periodic ego- noise and human voices. Using a microphone array distributed on the long body of a hose-shaped robot, ego-noise suppression can be further improved by combining the results of ORPCA applied to the observed signal captured by each microphone. Experiments using a 3-m hose-shaped rescue robot with eight microphones show that the proposed method improves the per- formance of conventional ego-noise suppression using only one microphone by 7.4 dB in SDR and 17.2 in SIR. I. I NTRODUCTION Hose-shaped rescue robots have been developed for gath- ering information in narrow spaces under collapsed buildings where humans or animals cannot go [1]–[3]. They have thin, long, and flexible bodies, and self-locomotion mechanisms. The Active Hose-II robot [1], for example, has small pow- ered wheels enabling it to move forward, and the Active Scope Camera robot [2], [3] can move forward by vibrating cilia covering its body. In 2008 the Active Scope Camera robot was used in an actual search-and-rescue mission in Jacksonville, Florida, USA [4]. To avoid failing to hear the voice of a human at an unseen and distant place, real-time ego-noise suppression is crucial for a hose-shaped rescue robot. It should be noted that the rescue activity is a race against time. Although hose-shaped rescue robots are asked to keep moving for searching a wide area in a limited time, conventional robots need to stop their actuators and keep silent for hearing the voice of a human. This strategy, however, is inefficient and often misses a voice occurring when a robot is moving. Since the ego-noise of a robot changes dynamically according to the movements of wheels or vibrators and friction between the robot body and 1 Graduate School of Informatics, Kyoto University, Kyoto, 606-8501, Japan {yoshiaki, itoyama, yoshii}@kuis.kyoto-u.ac.jp 2 Graduate School of Information Science, Tohoku University, Miyagi, 980-8579, Japan {konyo, tadokoro}@rm.is.tohoku.ac.jp 3 Graduate School of Information Science and Engineering, Tokyo Insti- tute of Technology, Saitama, 351-0114, Japan / Honda Research Institute Japan Co., Ltd. [email protected] 4 Graduate Program for Embodiment Informatics, Waseda University, Tokyo 169-0072, Japan [email protected] Fig. 1. Prototype hose-shaped rescue robot with eight-channel microphone array and driving mechanism. ORPCA ORPCA ORPCA iFFT Hose-shaped Robot Remote Operator Extract Common Component FFT FFT FFT Fig. 2. Overview of proposed online ego-noise suppression system. surrounding materials, it is difficult to use conventional ego- noise suppression methods [5]–[8] that assume the noise to be stable or learned in advance. In a real rescue scenario, we need to deal with a severely- low signal-to-noise ratio (SNR) condition because ego-noise sounds generated from the body of a robot are much closer to microphones than human voices. One possibility is to use source separation methods based on microphone arrays [9], [10] for emphasizing human voices. These methods, which need the precise information of microphone positions, how- ever, cannot be used because the shape of a hose-shaped robot changes flexibly in narrow gaps and it is difficult to know the positions of microphones on the robot [11], [12]. Although several methods independent of the microphone positions called blind source separation [13], [14] are pro- posed, these methods requires too much computational costs to use in real time. A promising approach to unsupervised and efficient ego- noise suppression for a hose-shaped rescue robot is to use robust principal component analysis (RPCA) [15] that was originally proposed for decomposing an input matrix into the sum of sparse and low-lank matrices. Applying RPCA to an audio spectrogram, it can be decomposed into frequency components that appear repeatedly (e.g., the ego-noise of a Demo: http://winnie.kuis.kyoto-u.ac.jp/members/yoshiaki/demo/ssrr2015

Upload: others

Post on 01-Sep-2019

2 views

Category:

Documents


0 download

TRANSCRIPT

Human-Voice Enhancement based on Online RPCAfor a Hose-shaped Rescue Robot with a Microphone Array

Yoshiaki Bando1, Katsutoshi Itoyama1, Masashi Konyo2, Satoshi Tadokoro2,Kazuhiro Nakadai3, Kazuyoshi Yoshii1, and Hiroshi G. Okuno4

Abstract— This paper presents an online real-time methodthat enhances human voices included in severely noisy audiosignals captured by microphones of a hose-shaped rescue robot.To help a remote operator of such a robot pick up a weakvoice of a human buried under rubble, it is crucial to suppressthe loud ego-noise caused by the movements of the robot inreal time. We tackle this task by using online robust principalcomponent analysis (ORPCA) for decomposing the spectrogramof an observed noisy signal into the sum of low-rank and sparsespectrograms that are expected to correspond to periodic ego-noise and human voices. Using a microphone array distributedon the long body of a hose-shaped robot, ego-noise suppressioncan be further improved by combining the results of ORPCAapplied to the observed signal captured by each microphone.Experiments using a 3-m hose-shaped rescue robot with eightmicrophones show that the proposed method improves the per-formance of conventional ego-noise suppression using only onemicrophone by 7.4 dB in SDR and 17.2 in SIR.

I. INTRODUCTION

Hose-shaped rescue robots have been developed for gath-ering information in narrow spaces under collapsed buildingswhere humans or animals cannot go [1]–[3]. They have thin,long, and flexible bodies, and self-locomotion mechanisms.The Active Hose-II robot [1], for example, has small pow-ered wheels enabling it to move forward, and the ActiveScope Camera robot [2], [3] can move forward by vibratingcilia covering its body. In 2008 the Active Scope Camerarobot was used in an actual search-and-rescue mission inJacksonville, Florida, USA [4].

To avoid failing to hear the voice of a human at an unseenand distant place, real-time ego-noise suppression is crucialfor a hose-shaped rescue robot. It should be noted that therescue activity is a race against time. Although hose-shapedrescue robots are asked to keep moving for searching a widearea in a limited time, conventional robots need to stop theiractuators and keep silent for hearing the voice of a human.This strategy, however, is inefficient and often misses a voiceoccurring when a robot is moving. Since the ego-noise of arobot changes dynamically according to the movements ofwheels or vibrators and friction between the robot body and

1Graduate School of Informatics, Kyoto University, Kyoto, 606-8501,Japan {yoshiaki, itoyama, yoshii}@kuis.kyoto-u.ac.jp

2Graduate School of Information Science, Tohoku University, Miyagi,980-8579, Japan {konyo, tadokoro}@rm.is.tohoku.ac.jp

3Graduate School of Information Science and Engineering, Tokyo Insti-tute of Technology, Saitama, 351-0114, Japan / Honda Research InstituteJapan Co., Ltd. [email protected]

4Graduate Program for Embodiment Informatics, Waseda University,Tokyo 169-0072, Japan [email protected]

Fig. 1. Prototype hose-shaped rescue robot with eight-channel microphonearray and driving mechanism.

ORPCA

ORPCA

ORPCA

⋯⋯

iFFT

Hose-shapedRobot

RemoteOperator

Extract Common ComponentFFT

FFT

FFT

Fig. 2. Overview of proposed online ego-noise suppression system.

surrounding materials, it is difficult to use conventional ego-noise suppression methods [5]–[8] that assume the noise tobe stable or learned in advance.

In a real rescue scenario, we need to deal with a severely-low signal-to-noise ratio (SNR) condition because ego-noisesounds generated from the body of a robot are much closerto microphones than human voices. One possibility is to usesource separation methods based on microphone arrays [9],[10] for emphasizing human voices. These methods, whichneed the precise information of microphone positions, how-ever, cannot be used because the shape of a hose-shapedrobot changes flexibly in narrow gaps and it is difficult toknow the positions of microphones on the robot [11], [12].Although several methods independent of the microphonepositions called blind source separation [13], [14] are pro-posed, these methods requires too much computational coststo use in real time.

A promising approach to unsupervised and efficient ego-noise suppression for a hose-shaped rescue robot is to userobust principal component analysis (RPCA) [15] that wasoriginally proposed for decomposing an input matrix intothe sum of sparse and low-lank matrices. Applying RPCA toan audio spectrogram, it can be decomposed into frequencycomponents that appear repeatedly (e.g., the ego-noise of a

Demo: http://winnie.kuis.kyoto-u.ac.jp/members/yoshiaki/demo/ssrr2015

TABLE ISTRONG AND WEAK POINTS OF PROPOSED METHOD AND RELATED

WORK OF EGO-NOISE SUPPRESSION

Method Non-stable No prior learning Independent ofnoise microphone positions

Ince et al. [5] ✓Tezuka et al. [6] ✓ ✓Cauchi et al. [7] ✓ ✓Nakajima et al. [8] ✓ ✓Okutani et al. [18] ✓ ✓Proposed ✓ ✓ ✓

hose-shaped rescue robot) and other components that occurinfrequently (e.g., the voice of a human) without prior learn-ing [16]. RPCA can also suppress periodic ego-noise causedby wheels and vibrators because the noise spectrogram hasa low-rank structure. Furthermore, periodic environmentalnoise (e.g,, noise derived from rescue helicopters and ve-hicles) are expected to be suppressed.

This paper presents a novel statistical method that cansuppress the ego-noise of a hose-shaped rescue robot byapplying an online version of RPCA (ORPCA) [17] to amicrophone array distributed along the robot as shown inFig. 1. Our method assumes that a target voice is similarlyrecorded by all microphones and the ego-noise is differentlyrecorded by each microphone. This enables us to suppressthe ego-noise in two steps: 1) suppress the ego-noise at eachmicrophone by using ORPCA, and 2) extract the componentscommon among the microphones by combining the results ofORPCA (Fig. 2). The effectiveness of the proposed methodwas evaluated using the prototype hose-shaped rescue robotwith an eight-channel microphone array.

II. RELATED WORK

This section reviews related work on ego-noise suppres-sion. We first introduce the ego-noise suppression methodsfor a single microphone and then review methods for a mi-crophone array. The strong and weak points of each methodare summarized in Table I.

A. Noise Reduction for Single Microphone

Ince et al. [5] developed an ego-noise suppression methodfor a single microphone using the sensor data of actuators.This method uses a noise database that stores pairs of motor-sensor values and corresponding ego-noise. The ego-noise isestimated by searching a noise template from the databasewith the sensor values. Although this method works wellwhen the sensor data and ego-noise have a correlation, itsperformance is degraded by unknown noise sounds and un-stable noise.

Tezuka et al. [6] and Cauchi et al. [7] estimated the ego-noise using extensions of non-negative matrix factorization(NMF). NMF decomposes the input matrix into basis andactivation matrices [19]. The basis matrix is a set of tem-plate frequency components, and the activation matrix is thecoefficients of the bases at each time frame. Since the NMF-based approach represents the noise spectrum with multiplebases, these methods can maintain non-stable ego-noise. Thisapproach, however, needs to learn the basis of ego-noise in

mic1 ⋯

Microphone

mic2mic𝑀𝑀−1 mic𝑀𝑀

mic3

Fig. 3. Configuration of microphones on a hose-shaped rescue robot.

advance because it is difficult to distinguish the bases ofego-noise and a target voice.

Nakajima et al. [8] developed a method for estimatingthe stable noise without prior learning. This method, calledhistogram-based recursive level estimation (HRLE), estimatesthe noise using the cumulative histogram of the input powerspectrum. It therefore robustly estimates the stable noise evenwhen there is some sudden and large-power signal. HRLE isimplemented in an open source robot audition software calledHARK 1 [9]. Since HRLE estimates the noise using only thehistogram and ignores the error between the estimated noiseand current input, it fails to estimate non-stable noise.

B. Noise Reduction for Microphone Array

Okutani et al. [18] developed an ego-noise-robust soundsource localization method that uses a microphone array. Itis called multiple signal classification based on incremen-tal generalized eigenvalue decomposition (iGEVD-MUSIC).GEVD-MUSIC is an online sound source localization methodsuppressing the ego-noise given as a noise correlation ma-trix between the microphones [20]. iGEVD-MUSIC sequen-tially predicts the current noise correlation matrix from theprevious input signal. Since the correlation matrix repre-sents the relationship between microphones and noise sourcepositions, this method is robust against the stationarity ofthe noise. Okutani et al. robustly estimated the position ofthe target sound source with a microphone array equippedon a multicopter even when it was flying. This method,however, assumes that the positions of the microphones areknown in advance. There are also several methods that usea microphone array equipped a rescue robot, they need themicrophone positions [21]–[23].

III. ONLINE EGO-NOISE SUPPRESSION

This section describes the proposed online ego-noise sup-pression method that extracts the components common amongthe microphones by combining the ORPCA results of themicrophones at each frequency bin.

A. Problem Statement

We assume a hose-shaped rescue robot that has micro-phones distributed along its body as depicted in Fig. 3. Themicrophones are given an index ranging from 1 at the handposition to M at the tip of the robot. We define t as the timeframe index, F as the number of frequency bins, and f asthe frequency bin index. The other notations are summarizedin Table II.

The ego-noise suppression problem for a hose-shaped robotis defined as follows:

1Honda Research Institute Japan Audition for Robots with Kyoto Univ.

Input 𝒚𝒚𝒕𝒕Low-rank

Component 𝒙𝒙𝒕𝒕Sparse

Component 𝒆𝒆𝒕𝒕

Voice

Fig. 4. Spectrograms of low-rank and sparse components extracted byORPCA. Input is the mixture of a human voice and a simple noise (puretones).

TABLE IINOTATIONS OF MATHEMATICAL SYMBOLS

Symbol MeaningM Number of microphonesF Number of frequency binsm Microphone index (m = 1, · · · ,M )f Frequency bin indext Time frame indexk The number of basis for low-rank componentymt ∈ RF Amplitude spectrum of input signalst ∈ RF Amplitude spectrum of output signalxmt ∈ RF Low-rank component (= Lmrmt)emt ∈ RF Sparse componentrmt ∈ RK Coefficients corresponding to the basis of low-rank componentLm ∈ RF×K Basis of low-rank componentswmt ∈ RF Normalization coefficient.�

���

Input: M -channel synchronized amplitude spectrumsy1t, · · · ,yMt ∈ RF

Output: Denoised amplitude spectrum st ∈ RF

The input amplitude spectrum is defined as the absolute val-ues of the short-time Fourier transform (STFT) of recordedsignals. The time-domain signal of the output is obtained byconducting the inverse STFT to the product of the outputamplitude spectrum st and the phase of the recorded signal.

B. Online RPCA

The proposed method uses an online extension of batchRPCA (ORPCA) [17]. The input signal of each channelymt is decomposed to low-rank component xmt and sparsecomponent emt by conducting ORPCA (Fig. 4):

ymt = xmt + emt. (1)

The ego-noise that periodically changes is assigned to thelow-rank component, and the voice and other sparse noiseare assigned to the sparse component [16].

To explain ORPCA which is independent of the micro-phone index m, in the rest of this section we leave out it. Letthe F×t matrices of input spectra, low-rank components, andsparse components be Y = [y1, · · · ,yt], X = [x1, · · · ,xt],and E = [e1, · · · ,et]. The original batch RPCA decomposesthe input matrix into low-rank and sparse components bysolving following problem [15]:

minX,E

{1

2∥Y −X−E∥2F + λ1∥X∥∗ + λ2∥E∥1

}(2)

where ∥·∥F , ∥·∥∗, and ∥·∥1 represent the Frobenius, nuclear,and L1 norms, respectively, and λ1 and λ2 represent the scaleparameters. Since this optimization is solved by the aug-mented Lagrangian multiplier method and it performs SVDwhich accesses all samples of the input, it was difficult to

Normalization

𝒚𝒚𝒎𝒎𝒕𝒕Input

𝒈𝒈𝒎𝒎𝒕𝒕Normalization Gain

𝒚𝒚𝒚𝒎𝒎𝒕𝒕Normalized Input

𝒆𝒆𝒎𝒎𝒕𝒕Sparse

Component

𝒙𝒙𝒎𝒎𝒕𝒕Low-rank Component

ORPCA

Smoothing

Fig. 5. Noise suppression using channel-wise ORPCA for multi-channelsignals.

solve the original RPCA problem in an online manner [15].To solve the RPCA problem in an online manner, ORPCA

solves the following alternative problem [17].

minL,R,E

{1

2∥Y − LRT −E∥2F +

λ1

2(∥L∥2F + ∥R∥2F ) + λ2∥E∥1

}(3)

where L ∈ RF×K and R ∈ Rt×K (K < F, t) represent thebases of the low-rank components and the coefficients of thebases (X = LRT ). As proved in [17], the local optima ofthis problem are the global optimal solutions of the originalRPCA problem because the ∥X∥∗ is upper bounded by theL and R [24]:

∥X∥∗ = ∥LRT ∥∗ = infL,R

{1

2∥L∥F +

1

2∥R∥F

}(4)

ORPCA solves the alternative RPCA problem (Eq. 3) byminimizing the following cost function obtained by trans-forming Eq. (3):

ft(L) =1

t

t∑t′=1

l(yt′ ,L) +λ1

2t∥L∥2F (5)

l(yt′ ,L)=minrt′ ,et′

{1

2∥yt′−Lrt′−et′∥22+

λ1

2∥rt′∥22+λ2∥et′∥1

}(6)

where l(yt,L) represents the loss function for each sample.The coefficients of the bases rt and sparse component et areupdated at each time frame by solving Eq. (6) with fixingthe basis of low-rank subspace L. The basis L, on the otherhand, is updated at each time frame by minimizing followingobjective function gt(L) with fixing rt and et:

gt(L) =1

t

t∑t′=1

l′(yt′ ,L) +λ1

2t∥L∥2F (7)

l′(yt′ ,L) =1

2∥yt′ − Lrt′ − et′∥22 +

λ1

2∥rt′∥22 + λ2∥e∥1 (8)

This function provides an upper bound for the cost functionft(L). ORPCA solves Eqs. (6) and (7) using an off-the-shelf solver and block-coordinate descent with warm restarts,respectively [17].

C. Online Normalization of Input SignalThe ego-noise of our hose-shaped rescue robot has peaks

at low frequency bins. Since ORPCA estimates the low-rank components with the same weight for the all frequency

Fig. 6. Prototype hose-shaped robot placed on the floor of an office room.

bins (Eq. 7), it over-fits to the peaks. We therefore apply anormalization coefficient wmt = [wmt1, · · · , wmtF ]

T ∈ RF

to the input ymt:

y′mtf =1

wmtfymtf . (9)

Since the peaks of the ego-noise changes depending on theenvironment around the robot, the proposed method learnsthe normalization coefficient in an online manner (Fig. 5).We assume that the average ego-noise does not change fre-quently and drastically, therefore the normalization coeffi-cient is updated as follows:

wmt = αymt + (1− α)wm(t−1) (10)

where α is a learning weight parameter that is set to a smallvalue (e.g., 1.0× 10−2).

D. Combining Online RPCA Results at MicrophonesThe sparse components of the microphones emt are inte-

grated to extract the common component. Each sparse com-ponent has large sparse noise measured when the correspond-ing microphone touches on the environment, and small noisewhich is suppression residuals of the ego-noise. To be robustagainst these noise, the integration is conducted by taking amedian at each frequency bin as follows:

stf = Median(e1tf , · · · , eMtf ) for all f = 1, · · · , F (11)

where Median(· · · ) represents a median of the arguments.The whole proposed noise suppression algorithm is sum-

marized as follows:

Algorithm 1 Proposed ego-noise suppression method1: for t = 1, 2, 3 · · · do2: Observe the audio input signal ymt(m = 1, · · · ,M)

3: for m = 1, · · · ,M do4: Update normalizing coefficient wmt (Eq. (10))5: Normalize input ymt to get y′

mt (Eq. (9))6: Apply ORPCA to y′

mt to get sparse component e′mt

7: Multiply e′mt by the coefficient wmt to get emt

8: end for9: for f = 1, · · · , F do

10: Take a median of [e1tf , · · · , eMtf ] to get output stf11: end for12: end for

4m

1.5m

1m

Right

Front

Back

Fig. 7. Arrangement of a hose-shaped rescue robot and a loudspeaker. Wetested three loudspeaker positions: front, right, and back.

IV. EXPERIMENTAL EVALUATION

This section reports an experiment evaluating the proposedmethod of ego-noise suppression using a prototype hose-shaped rescue robot.

A. Experimental Conditions

Fig. 1 shows the prototype hose-shaped rescue robot usedin this experiment. The body was made with a corrugatedtube 38 mm in diameter and 3 m long. This robot had a self-propelling mechanism the same as that of the Active ScopeCamera robot [3]. The entire surface of the robot was coveredby cilia and the robot moved forward by vibrating the cilia.This robot had M = 8 microphones positioned on its body ata regular interval of 40 cm. The body was rotated 90 degreesafter each of the microphones was installed with in order toavoid having all microphones obstructed by the ground.

We recorded the ego-noise and target voice separately,and then mixed them with varying SNR from −25 dB to+5 dB. This experiment was conducted in an experimentalroom with a reverberation time (RT60) of 740 ms (Fig. 6).The ego-noise was recorded for 60 seconds sliding the robotleft and right by the vibrators and a hand. The arrangementof the robot and the loudspeaker from which the target voiceemerged is shown in Fig. 7. We tested three positions of theloudspeaker: front, right, and back. A low-noise target voicesignal was generated by convoluting the clean male voicerecorded in an anechoic chamber and the impulse responserecorded with the loudspeaker. These synchronized eight-channel acoustic signals were sampled at 16 kHz and 16 bitsusing HARK [9].

We compared following three methods:1) Multi-ORPCA: the proposed method using all the eight

microphones.2) Single-ORPCA: ORPCA using only the 8th (tip) micro-

phone.3) Single-HRLE: HRLE using only the 8th microphone.HRLE is one of the conventional methods that solve thesame noise suppression problem as ORPCA. The parametersof the proposed method were decided experimentally, andthose of HRLE were set to the default values of the HARKimplementation.

We implemented the proposed method as a HARK moduleby using C++ and a linear algebra library called Eigen3.The estimation was conducted with a standard desktop com-puter with an Intel Core i7–4790 CPU (4-core, 3.6 GHz)and 8.0 GB of memory. The elapsed time for the proposed

−25 −20 −15 −10 −5 0 5

SNR [dB]

−35

−30

−25

−20

−15

−10

−5

0

5

10S

DR

[dB

]

Input

Single-HRLE

Single-ORPCA

Multi-ORPCA (proposed)

(a) Loudspeaker position: Front

−25 −20 −15 −10 −5 0 5

SNR [dB]

−35

−30

−25

−20

−15

−10

−5

0

5

10

SD

R[d

B]

(b) Loudspeaker position: Right

−25 −20 −15 −10 −5 0 5

SNR [dB]

−35

−30

−25

−20

−15

−10

−5

0

5

10

SD

R[d

B]

(c) Loudspeaker position: Back

Fig. 8. SDRs of input audio signal at 8th (tip) microphone and signalsdenoised by Single-HRLE, Single-ORPCA, and Multi-ORPCA.

−25 −20 −15 −10 −5 0 5

SNR [dB]

−40

−30

−20

−10

0

10

20

30

SIR

[dB

]

(a) Loudspeaker position: Front

−25 −20 −15 −10 −5 0 5

SNR [dB]

−40

−30

−20

−10

0

10

20

30

SIR

[dB

]

(b) Loudspeaker position: Right

−25 −20 −15 −10 −5 0 5

SNR [dB]

−40

−30

−20

−10

0

10

20

30

SIR

[dB

]

(c) Loudspeaker position: Back

Fig. 9. SIRs of input audio signal at 8th (tip) microphone and signalsdenoised by Single-HRLE, Single-ORPCA, and Multi-ORPCA.

method with an 8-ch input signal was 20.0 s. This value wassmall enough compared with the whole signal length of theinput (60.0 s) that our method could work in real time.

The ego-noise suppression performance was evaluated us-ing signal-to-distortion ratio (SDR) and signal-to-interferenceratio (SIR) [25]. SDR measures the overall quality of the re-trieved noise-suppressed signal, and SIR measures how muchthe interference due to ego-noise is suppressed. They weremeasured using a Python toolkit called MIR-EVAL [26].

B. Experimental Results

As shown in Fig. 8-(a), when the loudspeaker was infront of the robot, the SDR of the proposed method washigher than that of the other methods at the SNR conditionsbetween −20 dB and −5 dB. Moreover, as shown in Figs. 8-(b) and 8-(c), when the loudspeaker was to the right of therobot or behind the robot, the SDR of the proposed methodwas higher than that of the other methods at all the SNRconditions −5 dB or less. The SIR of the proposed methodwas improved more than 2.9 dB compared with that of theother methods in the all conditions.

SIR measures how much the ego-noise is suppressed. Asshown in Figs. 10 and 11, the suppressed spectrogram of theproposed method had lower noise than those of the others.The ego-noise changed with a 30-Hz frequency which wasshown as vertical stripe patterns in Fig. 10-(a). The result ofSingle-HRLE remains the fluctuation residuals of the ego-noise as vertical stripe patterns (Fig. 10-(d)), whereas theresult of the proposed method and Single-ORPCA (Figs. 10-(b) and (c)) were suppressed.

When the loudspeaker was in front of the robot, the SDRsand SIRs for the proposed method were not as good as theywere when it was in the other two positions. This shows thesuppression performance depends on the relative positions ofthe microphones and the target sound source. This is becausethe proposed method integrates all microphones with thesame weights, whereas the power of the target sound wasdifferent at each microphone. One way to overcome thisproblem is to integrate the microphones by weighting themwith the estimated SNR of the target voice.

V. CONCLUSION

This paper presented an online real-time method that sup-presses the ego-noise of a hose-shaped rescue robot. The pro-posed method suppresses the ego-noise at each microphoneby using ORPCA, and extracts the components commonamong the microphones by combining the ORPCA results.The experiments using a 3-m hose-shaped rescue robot withan eight-channel microphone array shows that the proposedmethod improves the performance of conventional ego-noisesuppression using only one microphone by 7.4 dB in SDRand 17.2 in SIR. The results also show that the suppressionperformance is often degraded, depending on the relativepositions of the microphones and the target sound source.

To improve the proposed method, we plan to develop amicrophone array integration method that weights the mi-crophones with the estimated SNR of the target voice. Sincethe proposed method suppresses the noise that frequently ap-pears, it can suppress not only the ego-noise of a hose-shapedrescue robot but also other environmental noise and the ego-

0 1 2 3 4

Time [sec]

0

2

4

6

8F

req.

[kH

z]

(a) Signal measured at 8th (tip) microphone

0 1 2 3 4

Time [sec]

0

2

4

6

8

Fre

q.

[kH

z]

(b) Signal denoised by Multi-ORPCA (Proposed)

0 1 2 3 4

Time [sec]

0

2

4

6

8

Fre

q.

[kH

z]

(c) Signal denoised by Single-ORPCA

0 1 2 3 4

Time [sec]

0

2

4

6

8

Fre

q.

[kH

z]

(d) Signal denoised by Single-HRLE

Fig. 10. Examples of spectrograms obtained when the loudspeaker is to the right of the robot and SNR is set to −5 dB.

0 1 2 3 4

Time [sec]

0

2

4

6

8

Fre

q.

[kH

z]

Fig. 11. Spectrogram of target voice for the condition of Fig. 10

noise of other noisy rescue robots without prior learning.Our future work includes investigating the method’s perfor-mance with other rescue robots, such as multicopters [18].Furthermore, to evaluate the effectiveness in the search-and-rescue tasks, we will conduct more practical experiments insimulated disaster sites.

ACKNOWLEDGMENTS

This study was partially supported by ImPACT ToughRobotics Challenge and by JSPS KAKENHI No. 24220006and No. 15J08765.

REFERENCES

[1] A. Kitagawa et al., “Development of small diameter active hose-iifor search and life-prolongation of victims under debris,” Journal ofRobotics and Mechatronics, vol. 15, no. 5, pp. 474–481, 2003.

[2] H. Namari et al., “Tube-type Active Scope Camera with high mobilityand practical functionality,” in IEEE/RSJ International Conference onIntelligent Robots and Systems (IROS), 2012, pp. 3679–3686.

[3] J. Fukuda et al., “Remote vertical exploration by Active Scope Camerainto collapsed buildings,” in IEEE/RSJ IROS, 2014, pp. 1882–1888.

[4] S. Tadokoro et al., “Application of active scope camera to forensic in-vestigation of construction accident,” in IEEE International Workshopon Advanced Robotics and its Social Impacts, 2009, pp. 47–50.

[5] G. Ince et al., “Incremental learning for ego noise estimation of arobot,” in IEEE/RSJ IROS, 2011, pp. 131–136.

[6] T. Tezuka et al., “Ego-motion noise suppression for robots based onsemi-blind infite non-negative matrix factorization,” in IEEE Interna-tional Conference on Robotics and Automation, 2014, pp. 6293–6298.

[7] B. Cauchi et al., “Reduction of non-stationary noise for a robotic livingassistant using sparse non-negative matrix factorization,” in Workshopon Speech and Multimodal Interaction in Assistive Environments,2012, pp. 28–33.

[8] H. Nakajima et al., “An easily-configurable robot audition systemusing histogram-based recursive level estimation,” in IEEE/RSJ IROS,2010, pp. 958–963.

[9] K. Nakadai et al., “Design and implementation of robot audition sys-tem HARK – open source software for listening to three simultaneousspeakers,” Advanced Robotics, vol. 24, no. 5-6, pp. 739–761, 2010.

[10] V. M. Valin et al., “Enhanced robot audition based on microphonearray source separation with post-filter,” in IEEE/RSJ IROS, 2004, pp.2123–2128.

[11] M. Ishikura et al., “Shape estimation of flexible cable,” in IEEE/RSJIROS, 2012, pp. 2539–2546.

[12] Y. Bando et al., “A sound-based online method for estimating thetime-varying posture of a hose-shaped robot,” in IEEE InternationalSymposium on Safety, Security, and Rescue Robotics, 2014, pp. 1–6.

[13] J. Taghia et al., “A variational Bayes approach to the underdeterminedblind source separation with automatic determination of the numberof sources,” in IEEE International Conference on Acoustics, Speechand Signal Processing, 2012, pp. 253–256.

[14] H. Sawada et al., “Underdetermined convolutive blind source sep-aration via frequency bin-wise clustering and permutation align-ment,” IEEE Transaction on Audio, Speech and Language Processing(TASLP), vol. 19, no. 3, pp. 516–527, 2011.

[15] E. J. Candes et al., “Robust principal component analysis?” Journalof the ACM, vol. 58, no. 3, p. 11, 2011.

[16] C. Sun et al., “Noise reduction based on robust principal componentanalysis,” Journal of Computational Information Systems, vol. 10,no. 10, pp. 4403–4410, 2014.

[17] J. Feng et al., “Online robust PCA via stochastic optimization,” inAnnual Conference on Neural Information Processing Systems, 2013,pp. 404–412.

[18] K. Okutani et al., “Outdoor auditory scene analysis using a movingmicrophone array embedded in a quadrocopter,” in IEEE/RSJ IROS,2012, pp. 3288–3293.

[19] S. A. Abdallah et al., “Polyphonic music transcription by non-negativesparse coding of power spectra,” in International Conference on MusicInformation Retrieval (ISMIR), 2004, pp. 318–325.

[20] K. Nakamura et al., “Intelligent sound source localization for dynamicenvironments,” in IEEE/RSJ IROS, 2009, pp. 664–669.

[21] M. Basiri et al., “Robust acoustic source localization of emergencysignals from micro air vehicles,” in IEEE/RSJ IROS, 2012, pp. 4737–4742.

[22] H. Sun et al., “A far field sound source localization system for rescuerobot,” in IEEE International Conference on Control, Automation andSystems Engineering, 2011, pp. 1–4.

[23] J. C. Murray et al., “Robotic sound-source localisation architecture us-ing cross-correlation and recurrent neural networks,” Neural Networks,vol. 22, no. 2, pp. 173–189, 2009.

[24] B. Recht et al., “Guaranteed minimum-rank solutions of linear matrixequations via nuclear norm minimization,” SIAM review, vol. 52, no. 3,pp. 471–501, 2010.

[25] E. Vincent et al., “Performance measurement in blind audio sourceseparation,” IEEE TASLP, vol. 14, no. 4, pp. 1462–1469, 2006.

[26] C. Raffel et al., “mir eval: A transparent implementation of commonMIR metrics,” in ISMIR, 2014, pp. 367–372.