optimizing kinect: audio and acoustics -...

22
4/5/2012 1 CREST Symposium on Human-Harmonized Information Technology Kyoto University, April 2012 Agenda Kinect overview Acoustical design Audio pipeline Kinect overall Speech enabled user interfaces Takeaways 4/2/2012 Audio for Kinect: pushing it to the limits 2

Upload: duongbao

Post on 16-Jun-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

4/5/2012

1

CREST Symposium on Human-Harmonized Information Technology

Kyoto University, April 2012

Agenda Kinect overview

Acoustical design

Audio pipeline

Kinect overall

Speech enabled user interfaces

Takeaways

4/2/2012 Audio for Kinect: pushing it to the limits 2

4/5/2012

2

Vision, sensors, scenarios

4/2/2012 Audio for Kinect: pushing it to the limits 3

Start point Initial customer base

Mostly young male adults

Successes and trends

Nintendo Wii (2006)

Xbox Lips (2008)

Vision

Stand from the couch!

No controller required

Talk and be understood

4/2/2012 Audio for Kinect: pushing it to the limits 4

4/5/2012

3

Kinect sensors Kinect (2010)

Depth camera

640x480 depth image, 11 bits resolution

RGB camera

640x480 color, 8 bits/pixel

Microphone array

4 supercardioid microphoneses, 24 bits ADCs

Motorized pivot

Vertical tilt, ±27°

4/2/2012 Audio for Kinect: pushing it to the limits 5

Depth camera in Kinect Scattered infrared light based

Range set from 0.8 to 4.0 m

1 rad FOV, tilt assisted

Creates 640x480 pixels depth image, 11 bits resolution

The processing is done in the device

4/2/2012 Audio for Kinect: pushing it to the limits 6

4/5/2012

4

Skeleton tracking Executed on the console

Recognize and track in 3D (x, y, z) all major joints

Allows posture and body biometrics tracking

4/2/2012 Audio for Kinect: pushing it to the limits 7

Scenarios Gesture recognition

Person tracking and identification

Gaming (dance, fitness)

4/2/2012 Audio for Kinect: pushing it to the limits 8

4/5/2012

5

Problems to overcome in audio Sound coming from the loudspeakers

Reverberation in the room

Noise: comparable to voice level at 3.5 meters

Dynamic range of the sounds to handle

Building a manufacturable and robust device

End-to-end system integration and optimization

4/2/2012 Audio for Kinect: pushing it to the limits 9

From the microscopic changes in the air pressure

to electrical signal

4/2/2012 Audio for Kinect: pushing it to the limits 10

4/5/2012

6

Kinect microphones and enclosure Desired: acoustically inexistent

In general introduces complexity

Worsens directivity patterns

DI loses 2.3 dB

The enclosure shape can be used to increase the microphones directivity

Modeling the sound wave propagation around the enclosure

4/2/2012 Audio for Kinect: pushing it to the limits 11

A C

D

Rubber booth Kinect enclosure

microphone

B

Kinect microphones and enclosure (2) Optimization of the

microphones placement and vents

Four optimization parameters

Q = w1*DI+…

Results

Final DI improved 1.1 dB

Experimentally confirmed in the anechoic chamber

4/2/2012 Audio for Kinect: pushing it to the limits 12

4/5/2012

7

0 1000 2000 3000 4000 5000 6000 7000 80000

1

2

3

4

5

6

7

8

Frequency, Hz

DI,

dB

MicArray Directivity Index -Weights.DAT

Array geometry Every pair of microphones

Optimal for one frequency

N-microphones array

Has N*(N-1)/2 unique pairs

Given microphone array geometry:

We can design optimal in sense of noise suppression array

Given design we can analyze and compute DI

If we can analyze – we can optimize!

4/2/2012 Audio for Kinect: pushing it to the limits 13

0 1000 2000 3000 4000 5000 6000 7000 80000

2

4

6

8

10

12

14

Frequency, Hz

DI, d

B

Microphone Array Directivity Index

Array geometry optimization Optimization parameters:

Optimization criterion:

Q=w1*mean(DI)+w2*std(DI)

Results

4/2/2012 Audio for Kinect: pushing it to the limits 14

A B

4/5/2012

8

We have the acoustical design. Now what?

4/2/2012 Audio for Kinect: pushing it to the limits 15

Audio pipeline architecture

4/2/2012 Audio for Kinect: pushing it to the limits 16

AEC BF

Microphone Array

AES NS AEC AEC CTR

Calib- ration

AEC AEC AEC MAEC AGC

SSL

Echo Est.

VAD

Surround sound output

Direction, confidence

Audio pipeline output

To all blocks

4/5/2012

9

Audio pipeline architecture

4/2/2012 Audio for Kinect: pushing it to the limits 17

AEC BF

Microphone Array

AES NS AEC AEC CTR

Calib- ration

AEC AEC AEC MAEC AGC

SSL

Echo Est.

VAD

Surround sound output

Direction, confidence

Audio pipeline output

To all blocks

Acoustic echo reduction systems Acoustic echo cancellation

Acoustic echo suppression

Mono AEC – part of each speakerphone

4/2/2012 Audio for Kinect: pushing it to the limits 18

4/5/2012

10

More rendering channels?

4/2/2012 Audio for Kinect: pushing it to the limits 19

Microphone

Loudspeakers

hL

Stereo sound

output

hR

Highly correlated channels Ill conditioned matrix

Multiple solutions

The approach on the left Doesn’t converge well

Has to re-converge after change

Bell Labs, 1991: “You can’t do stereo echo cancellation” … with this architecture

Multichannel AEC Our multichannel AEC

Use calibration pulses, compute mixing filters

Lock mixing filters, use one adaptive filter

In Kinect implemented for surround sound and four element microphone array

4/2/2012 Audio for Kinect: pushing it to the limits 20

Microphone

Loudspeakers

hL

Stereo sound

output

hR

h

4/5/2012

11

Microphone arrays: terminology Beamforming: making the microphone array to

listen to given look-up direction

Beamsteering: electronically change the look-up direction the microphone array listens to

Nullforming: suppressing the sounds coming from given direction

Nullsteering: electronically move the suppression direction

Sound source localization: techniques to detect, localize and track one or multiple sound sources using microphone array

4/2/2012 Audio for Kinect: pushing it to the limits 21

Beamforming: time invariant Beamforming

Time invariant beamformer

Isotropic noise assumption

Off-line design

Set of pre-computed beams

Design criterion: minimize the noise, keep desired

4/2/2012 Audio for Kinect: pushing it to the limits 22

( ) ( )n nY k k k W X

arg min

subject to 1

c

H

c c ck

c c

k k k k

k k

W

W W N W

W D

4/5/2012

12

Beamforming: adaptive Adaptive beamformer

On the fly computation of the weights

Higher CPU requirements

Does nullsteering

MVDR beamformer

Demos

4/2/2012 Audio for Kinect: pushing it to the limits 23

1

1

H

c NN

MVDR H

c NN c

k kk

k k k

D ΦW

D Φ D Angle, deg

Fre

qu

en

cy, H

z

Microphone Array Magnitude Response

-150 -100 -50 0 50 100 1500

1000

2000

3000

4000

5000

6000

7000

8000

-25

-20

-15

-10

-5

0

End-to-end optimization Mean Opinion Score (MOS) and Perceptual

Evaluation of Sound Quality (PESQ)

75 parameters for optimization

Optimization criterion: Q = PESQ+0.05*ERLE+0.001*SNR-0.001*LSD-0.01*MSE

Optimization algorithm Gaussian minimization

Data corpus with various distance, levels, reverberation

Parallelized processing on computing cluster 4/2/2012 Audio for Kinect: pushing it to the limits 24

4/5/2012

13

End-to-end optimization: results

4/2/2012 Audio for Kinect: pushing it to the limits 25

0

10

20

30

40

50

60

70

80

90

100

0 20 40 60 80 100

Imp

rove

me

nts

in

WE

R,

%

Loudspeaker volume, %

Multi-volume recognition results

Do nothing

Before optim

After optim

After training

Speech NUI good up to here

Supported levels up to here

Results: demo

4/2/2012 Audio for Kinect: pushing it to the limits 26

AEC BF

Microphone Array

AES NS AEC AEC CTR

Calib- ration

AEC AEC AEC MAEC AGC

SSL

Echo Est.

VAD

Surround sound output

Direction, confidence

Audio pipeline output

To all blocks

4/5/2012

14

Putting all technologies together

4/2/2012 Audio for Kinect: pushing it to the limits 27

Kinect overall All together: amazing new way to play

10 millions sold for four months

The best selling consumer electronics product ever

Kinect is an HMI sensor!

Goes way beyond gaming

Voice modality preferred for media selection by 54%

Other applications of such technologies

Speech and gesture modalities to the HMI

People recognition and tracking

4/2/2012 Audio for Kinect: pushing it to the limits 28

4/5/2012

15

Kinect for Windows Kinect for Windows Development Kit (KDK)

Beta version released from MSR June 2011

Commercial version 1.0 released February 1st 2012

Contains

Drivers for the cameras and microphones

Depth image processor and skeletal tracking

Audio pipeline with mono AEC

Speech recognizer with acoustic models

Applicable in education and industry

4/2/2012 Audio for Kinect: pushing it to the limits 29

We have the sound. How to use it?

4/2/2012 Audio for Kinect: pushing it to the limits 30

4/5/2012

16

Automatic Speech Recognition ASR is a statistical pattern matching problem

Find most likely word sequence to explain input

Customizing your system you focus mostly on P(W) Context Free Grammar (CFG)

Statistical Language Model (SLM)

4/2/2012 Audio for Kinect: pushing it to the limits 31

Whyp= argmax P(W|S) = argmax P(W)P(S|W))

“Language Model” “Acoustic Model”

W W

Classic CFG based system

4/2/2012 Audio for Kinect: pushing it to the limits 32

Recognizer

CFG Phrases

Speech Map to query to the database

HMM

P(W) P(S|W)

4/5/2012

17

“Please say flight information, new reservation, existing reservation, or customer service”

ASR with fixed grammars (CFG) • Defines items user can say at a given time

Easy to author (XML format)

Compact

4/2/2012 Audio for Kinect: pushing it to the limits 33

<mainmenu>

<one-of>

<item> flight information <wt=0.4> </item>

<item> new reservation <wt=0.3> </item>

<item> existing reservation <wt=0.1> </item>

<item> customer service <wt=0.2> </item>

</one-of>

</mainmenu>

Is this a problem for music selection? MSR Intern data collection

> 50% of the queries contain multiple fields from meta data “Play Yesterday by the Beatles”

Users do not know or use the exact names of the titles or names

Which one you prefer:

“Play track Serenade No. 13 for strings in G major K.525”

“Play Little Night Music by Beethoven” (even if it is composed by Mozart )

4/2/2012 Audio for Kinect: pushing it to the limits 34

0%

20%

40%

60%Exact match

4/5/2012

18

SLM and parser based processing

4/2/2012 Audio for Kinect: pushing it to the limits 35

Recognizer SLU

Parser

SLM

Sampling & Annotation

Training data

Speech

Structured query to the database

HMM

Demo: music selection and SMS reply See full video at

http://research.microsoft.com/apps/video/default.aspx?id=143050

4/2/2012 Audio for Kinect: pushing it to the limits 36

4/5/2012

19

Natural Language Input Non-hierarchical menu, single free phrase query

Support for non-strict queries with fuzzy matching names: More than 60% of the queries have two fields (i.e. “Play Yesterday

from Beatles” – title and artist)

Most of the queries are with non-complete names (i.e. “Play the Myth of Fingerprints” instead “Play the All Around the World or the Myth of Fingerprints” )

Minimizing the dialog turns Confirmation only when necessary, or low confident

Full query support, but with clarifications if necessary

4/2/2012 Audio for Kinect: pushing it to the limits 37

Multimodal User Interface and NUI Speech only = phone call

Multimodal Interface: voice, touch, gesture, GUI, buttons The goal is increasing the usability

Voice is good for long lists, GUI/gesture – for short: The combination requires less time and attention: query by voice,

confirm from the short list by buttons.

Support “voice only” and “GUI/gesture only” modes

Gradually move the boundary between them based on the situation

With the right proportion of the modalities and proper design we can achieve so-called Natural User Interface that doesn’t require training to operate the system

4/2/2012 Audio for Kinect: pushing it to the limits 38

4/5/2012

20

If I had to speak for five minutes

4/2/2012 Audio for Kinect: pushing it to the limits 39

Conclusions Several breakthrough achievements in Kinect audio

The only product with surround sound echo cancellation

Hands free sound capture system for speech recognition

First open microphone speech recognition system

A chain of optimal blocks is suboptimal! If one of the blocks underperforms – the entire pipeline

does!

End-to-end optimization played critical role

Kinect is a HMI sensor!

4/2/2012 Audio for Kinect: pushing it to the limits 40

4/5/2012

21

Future work Fusion of the sensors:

Vision, depth, audio for better results

More advanced audio processing techniques

Sound source separation

Speaker ID

Demo: sound source separation

Two persons, 2.4 meters, 26° separation

And this is just the beginning of the journey …

4/2/2012 Audio for Kinect: pushing it to the limits 41

Mix Speaker 1 Speaker2

Shameless plug

Ivan Tashev. Sound Capture and Processing: Practical Approaches. Wiley, 2009

Contains algorithms, a lot of figures, and sample Matlab code

4/2/2012 Audio for Kinect: pushing it to the limits 42

4/5/2012

22

Finally …

Thank you for your attention!

Questions?

Contact info: [email protected]

Web page: http://research.microsoft.com/en-us/people/ivantash/

4/2/2012 Audio for Kinect: pushing it to the limits 43

© 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions,

it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

4/2/2012 Audio for Kinect: pushing it to the limits 44