音響信号処理基礎...2018/11/08 · 音響信号処理基礎 fundamentals of acoustic signal...

音響信号処理基礎 Fundamentals of acoustic signal processing

東京大学情報理工学系研究科助教

高道慎之介 (Shinnosuke Takamichi)

奈良先端大音情報処理論第3回 (2017/11/08)

/52

本講義の目的 Purpose of this talk

2

音を知覚する，音の場を作る，音を分離する Perceive, reproduce, and separate audio.

音知覚 (sound perception) … 音を理解する

音場再現技術 (sound field reproduction) … 音の場を作る

音源分離技術 (audio source separation) … 音を分離する

/52

レポートについて Report

3

Python programming on Google Colab

Submit your codes and results to the submission page. (I will announce the details after this talk.)

音の知覚 SOUND PERCEPTION

4

/52

音の到来方向をどうやって知覚する？ How we identify the direction of sound

ある位置から音が到来した．どうやってその方向を知覚する？

– 当然ながら，両耳の情報だけで判断している

両耳に到来する音はどう違う？

– 両耳間時間差 (ITD) と両耳間音圧差 (ILD)

– 両者とも到来経路の差により生じる

5

Difference of sounds arrived at both ears

They are caused by path differences of the sound.

/52

頭部における音波の伝達経路 How a sound will arrive at the ear

頭部を球に近似すると，その到来を明示的に記述できる

6

[高道他, 2011.]

点音源 (point sound source)

直接波

散乱波

直接波 (direct wave) 表耳に直接到来散乱波 (scatter wave) まず頭部に音波が到来し，その後，頭部表面を伝播して影耳に到来直接波と散乱波の経路差頭部を球に近似すると，音源距離と方位角から解析的に計算可能

/52

両耳間時間差 (interaural time difference: ITD) Interaural time difference (ITD)

角度 angle[°] 0 180 270 360 90

0.4

0.8

0

-0.2

ITD

[m

sec]

-0.4

4

時間 time [msec]

0 2 3 1

右左

ITD

両耳における到来の時間差は1msec以下

– 我々はその時間差を知覚できる

7

The time difference between two ears is smaller than 1 msec.

/52

両耳間音圧差 (interaural level difference: ILD) Interaural level difference (ILD)

Gain

[d

B]

-30

10

-10

-50

0.5 1 5 10

周波数 frequency [kHz]

20 0

0

角度 angle[°]

180 270 360 90

ILD

[d

B]

20

-20

到達経路の違いにより，音圧が変わる

右左

8

The path length difference changes the audio pressure.

/52

帯域毎の時間差・音圧差の影響 Effects of ITD & ILD in each frequency band

両耳間音圧差 (ILD)

両耳間時間差 (ITD)

0

周波数 frequency [kHz]

0.5 2 4 8 12 16 1 20

広帯域の音源は定位しやすい

– 逆に，純音の定位精度は悪い

9

両耳間音圧差

A wide-band signal is easily localized.

/52

時間差・音圧差以外に聴こえを変える要素 Other components

ピーク・ノッチの影響 (peak/notch)

先行音効果 (precedence effect)

– 最初に到来する音源の方向が音像の定位に支配的に影響する

視覚情報との相互作用 (audio-visual interaction)

– カクテルパーティ効果 (cocktail-party effect)

– 腹話術効果 (Precedence effect)

– マガーク効果 (McGurk effect)

10

/52

ピーク・ノッチ Peak/notch

Gain

[d

B]

-30

10

-10

-50

0.5 1 5 10 周波数 frequency [kHz]

20

耳介形状等の影響により，伝達特性は大きく変化する

– ピーク (P1, P2…) … 信号が増幅される帯域

– ノッチ (N1, N2…) … 信号が減衰される帯域

P1 N1

11

The transfer function varies by shapes of the pinna, head, etc.

/52

ピーク・ノッチは何故発生する？ Why do the peak and notch cause?

z変換を思い出すと…

– ピーク：音波の共振による増幅

– ノッチ：音波の遅延による減衰

ピーク … 耳介で生じる共振

ノッチ … 直接波と耳介による反射波の影響 [竹本他, 2010.]

12

Resonance of signals causes peak.

Delay of signals causes peak.

Spectral peaks caused by the resonance in the pinna

Spectral notches caused by direct wave and reflected wave by the pinna.

/52

先行音効果 Precedence effect (Haas effect)

13

2つのスピーカの間に音像を知覚右側のスピーカの音しか知覚できない

最初に到来する音源の方向が音像の定位に支配的に影響すること

– ハース効果，第一波面効果とも

The listener perceives an audio image at the center of speakers.

The listener perceives an audio image dominated by the location of the first-arriving sound.

/52

先行音効果 Precedence effect (Haas effect)

14

時間

音圧先行音効果の成立する領域

先行音効果の成立する条件 (conditions)

– コヒーレントな信号

– 信号の到来時間差や音圧差が影響

Time

Sound pressure Temporal/pressure area of the effect

Coherent signal

Affected by time and level differences

/52

人間の感覚器の比較

[“Communication”,P.13,No.61,vol.11,1996.]

人間の感覚器と受容器の数

中枢神経の数で比較すれば，視覚は聴覚の100倍の情報量

– 視覚情報との相互作用が生じる

– 次ページ以降のような，相互作用による効果が起こる

種類受容器（数）中枢神経への数

視覚網膜の視細胞 (108) 106

聴覚蝸牛殻の有毛細胞 (104) 104

嗅覚嗅粒膜の嗅細胞 (107) 103

触覚皮膚の触覚細胞 (105) 104

15

/52

カクテルパーティ効果 cocktail-party effect

人間は，聴取した音を処理して必要な情報だけを再構築する

– 音声の選択的聴取 (selective attention of sounds)

16

両耳受聴(聴覚）＋

＋口の動き（視覚）

＋思考（脳）⇒一致判断

～～～

～～～

～～～

～～～～～～

A君，結婚したんだって！

/52

カクテルパーティ効果 (動画) cocktail-party effect (movie)

17

https://www.youtube.com/watch?v=mN--nV61gDo 複数の声が同時に流れるが，我々はそれを選択的に聴取できる

The multiple voices are presented simultaneously, but we can selectively pay attention to one voice.

https://www.youtube.com/watch?v=mN--nV61gDo



/52

腹話術効果 Ventriloquism effect

音像位置が映像位置に引っ張られる

– 両耳情報の曖昧さに起因

18

時間的に同期した映像と音像．ただし，空間位置は違う

しかし，受聴者は，映像と同じ位置から音がなっているように知覚する →音像が映像に引っ張られる

The sound is misperceived as emanating from a visual source.

/52

マガーク効果 McGurk effect

音声の音韻知覚における視覚・聴覚の相互作用

– 音韻Aの視覚刺激＋音韻Bの聴覚刺激＝音韻Cを知覚

19

聴覚情報

視覚情報

ば

が

だ

Aural information

Visual information

Visual info. = A Aural info. = B Perceived info. = C

/52

マガーク効果（動画） McGurk effect (movie)

20

https://www.youtube.com/watch?v=G-lN8vWm3m0 音は/ba/のままなのに，映像を変えると/va/に聴こえる！

The sound is always /ba/, but you will hear /va/ when the picture is changed.

https://www.youtube.com/watch?v=G-lN8vWm3m0




音場再現技術 SOUND FIELD REPRODUCTION

21

/52

音場再現技術 Sound field reproduction

音場再現技術 (sound field reproduction)

– 所望の音場 (音波の存在する空間) を人工的に再現する技術

– → 時空間の制約を超えた高臨場感立体音響システム

再生系による区分 (two types of systems)

– 拡声型 (開放型とも) … スピーカ

– 両耳型 (没入型とも) … ヘッドホン

評価要素 (performance evaluation)

– 受聴領域の大きさ

– 空間解像度

22

A method to artificially reproduce arbitrary sound filed

High-presence 3D audio system beyond spatio-temporal constraints

Reproduced by loudspeakers

Reproduced by headphones

size of the listening area

spatial resolution

/52

音場再現技術とは Relation of sound field reproduction

23

5.1ch surround

Binaural

Transaural

空間解像度 spatial resolution

22.2ch surround

Higher Order Ambisonics

Wave Field Synthesis

広い受聴領域と高い空間解像度を目指した物理的な音場再現へ

Towards high spatial resolution and wide listening area

受聴

領域

の大

きさ

Siz

e o

f liste

nin

g a

rea

/52

従来の再生技術 (ステレオ，サラウンド5.1ch) Conventional methods (stereo, 5.1ch surround)

欠点 (Cons.)

聴くことができる位置がスピーカの中心 (スィートスポット) に限定

音をデザインする人が必要（あくまで人工的な音の表現）

人間の音の方向知覚を利用した，心理音響モデルに基づく方法 24

Engineer

The sweat spot is only at the center of speakers.

Handcrafted audio design is required.

Psychoacoustic modeling that is a scientific study of sound perception and audiology.

/52

音場再現による高臨場音響再生 Sound field reconstruction

対象領域 𝑽 内の音場を，境界面 𝑺 上に配置した二次音源（＝スピーカ）を用いて，所望の音場と一致させる

25

音場そのものを物理的に再現 (物理音響モデルベース) Physical reproduction of the sound field

Secondary source distribution

Virtual primary sources

Obtain driving signals of secondary sources (= loudspeakers) arranged on S to reconstruct desired sound field in V.

/52

音場再現手法の比較 Comparison of sound field reproduction

26

アレイ配置 Array shape

方法 methods

収録音場の再現

1. Wave Field Synthesis (WFS)

平面／直線 Plane/line

Kirchhoff-Helmholtz積分/Rayleigh積分に基づく

スピーカ駆動信号 ×

2. Higher Order Ambisonics (HOA)

球 Sphere

球面調和関数展開に基づくエンコーディング/

デコーディング ○

3. 逆フィルタに基づく手法 (e.g. 境界音場制御)

任意 Arbitrary

最小二乗法などに基づく多点音圧制御

○

4. 波面再構成フィルタ法 (WFR)

平面／直線／円筒／球／円

空間スペクトル上での直接的な信号変換

○

/52

ホイヘンスの原理 The Huygens principle

ある時点での波面の形状は，その前段階の波面上の各点から球面状に波が出た結果として生じたものと説明

図は，電子情報通信学会『知識の森』 2群-6編-7章から引用 27

/52

1. 波面合成法 Wave Field Synthesis (WFS)

[Berkhout+ JASA 1993] [Spors+ AES Conv 2008]

Secondary source plane

空間位置𝒓s，周波数𝜔の駆動信号

境界面上の音圧勾配を二次音源の駆動信号として，音場を再現

スピーカパネルin長岡技大 (2009)

スピーカパネルの図は [板倉, 長岡技大卒業論文, 2009.]より引用 28

Sound pressure gradient in the receiving plane is used as the driving signal of the secondary source.

Driving signal of spatial point 𝒓s and frequency 𝜔

/52

2. 高次アンビソニック Higher Order Ambisonics (HOA)

球面調和関数

所望音場の球面調和スペクトル

スピーカ中心を原点とする球面調和スペクトル領域で合成音場が所望音場と一致するように制御

合成音場の球面調和スペクトル

球面調和関数を要素にもつ行列の (一般化)逆行列を用いて駆動信号を得る

球状アレイを用いることで全方位の音場を再現可能

29

[Daniel AES Conf 2003] [Poletti JAES 2005]

A spherical array allows us to reproduce sound fields of all directions.

/52

𝑫 𝜔 = 𝑮𝐻 𝜔 𝑮 𝜔 + 𝛽𝑰 −1𝑮𝐻 𝜔 𝑷des 𝜔

3. 逆フィルタに基づく手法 Inverse filter-based sound pressure control

Control points (𝑀個)

所望の音圧

Loudspeakers (𝐿個)

𝑮 𝜔 の逆システム

Regularization param.

逆フィルタを設計できれば任意のアレイ形状に適用可能 30

[Gautheir+ JASA 2005]

𝑷des 𝜔 𝑫 𝜔

𝑮 𝜔

𝑷syn 𝜔

制御点上で所望の音圧と一致するような逆システムを用いる

– 最小二乗法による𝑮 𝜔 ∈ 𝐶𝐿×𝑀の逆フィルタの設計

If the inverse filter can be designed, this framework can be applied to arbitrary array shapes.

/52

両耳系の音場再現技術 Sound field reproduction by headphones

拡声型の音場再現 (reproduction by loudspeakers)

– 多人数で音場を共有可能

両耳型の音場再現 (reproduction by headphones)

– 受聴者毎に音場を個別化可能・省スペース

– バイノーラル技術

31

The reproduced sound field can be shared by many listeners.

The sound field can be personalized.

Binaural

/52

ヘッドホンの種類 Kinds of headphones

32 [福永, 長岡技大修士論文, 2011.]より引用

Circumaural Supra-aural Supra-concha

Intra-concha Insert

/52

ヘッドホンの影響 Effects of headphones to sound localization

耳覆い型～イントラコンカ型は，耳介の影響を強く受ける

– 耳介形状や装着具合に影響

– 個人依存性が強い

挿入型は伝達経路に耳介を含まない

– 個人依存性が低い

– 外耳道は一次元音響管であると仮定すれば，イヤホンの振動面から鼓膜まで平面波が伝播する．

33

/52

バイノーラル Binaural

Inverse System

Head And Torso Simulator (HATS)

原音場 Original sound field

聴取者 Listener

・ダミーヘッド（もしくはHATS）を用いて収音，ヘッドフォンで再生

・システムが簡易

・聴取者の頭部回転や移動に弱く，音像が頭内定位する

→ ヘッドトラッキングにより緩和可能

34

ヘッドホン&外耳道特性のキャンセル

Record by a dummy head and Play by a headphone.

Easy to build.

Not robust to head rolling and moving, but it is can be alleviated by the head tracking.

/52

頭部伝達関数 HRTF: Head Related Transfer Function

35 図は [平原他, 2011.]より引用

実際に現音場を構築しなくとも，HRTFとの畳み込みで実現可能

– HRTF: 自由音場における音源と受聴者鼓膜近傍の間の音響伝達関数

Listener

Sound source

Microphone

Earphone

The sound field can be reproduced by convoluting the HRTF, without actual recording.

音源分離技術 AUDIO SOURCE SEPARATION

36

/52

音源分離 Audio source separation

複数の楽器音が混合された音楽信号から，楽器音を分離・抽出

– → 音楽信号分解 (music signal separation)

応用例 (application)

– ユーザが好み応じて各楽器音を編集 (remixing by users)

– 音楽信号の自動採譜 (automatic music transcription)

– 音の拡張現実 (AR) 等

37

/52

非負値行列因子分解 Nonnegative matrix factorization (NMF)

非負値行列因子分解 (NMF) [Lee, et al., 1999]

データのスパース性，重ね合わせ表現を考慮．効率的な乗法型更新式

画像処理，信号処理等様々な分野への応用

38

𝒀 = 𝑭𝑮 𝑌𝜔,𝑡 ≥ 0, 𝐹𝜔,𝑘 ≥ 0, 𝐺𝑘,𝑡 ≥ 0 (添え字のk,tは行列のk行t列目を表す)

𝐹𝜔,𝑘 ← 𝐹𝜔,𝑘

𝒀𝑮⊤𝜔,𝑘

𝑭𝑮𝑮⊤𝜔,𝑘

, 𝐺𝑘,𝑡 ← 𝐺𝑘,𝑡

𝑭⊤𝒀𝑘,𝑡

𝑭⊤𝑭𝑮 𝑘,𝑡

Application to image processing, signal processing, etc.

/52

Time [sec]

Fre

qu

en

cy

[Hz]


…

…

…

…

頻出スペクトル

各スペクトルのタイミングと音量

39

𝒀 = 𝑭𝑮 𝒀

𝑮

𝑭

Frequently appeared spectrum

Timing & power

/52

Time [sec]

Fre

qu

en

cy

[Hz]


…

…

…

…

頻出スペクトル

各スペクトルのタイミングと音量

40

𝒀 = 𝑭𝑮 𝒀

𝑮

𝑭

アクティベーション行列 Activation matrix

スペクトル基底行列 Basis matrix

Frequently appeared spectrum

Timing & power

/52

NMF では，行列因子の 𝑭 と 𝑮 を最適化するための目的関数が距離関数として与えられる

この距離関数はデータや分解する目的に応じて使い分けられる

– 音源分離：一般化KLダイバージェンス

– 自動採譜：板倉-斉藤擬距離

NMF の目的関数 Objective function of NMF

𝐽NMF = 𝐷 𝒀|𝑭𝑮

𝐷 ⋅ | ⋅ : 任意の距離関数

41

Arbitrary distance function

Source separation: generalized KL div.

***: Itakura-Saito div.

/52

スパース性が重視された距離尺度に

𝛽-divergence について 𝛽-divergence

42

𝛽 = 2: ユークリッド距離 𝛽 = 1: 一般化KLダイバージェンス 𝛽 = 0: 板倉-斉藤擬距離

一般化距離関数 𝛽-divergence [Eguchi et al., 2001]

Euclid distance

Generalized KL

Itakura-Saito (IS) div.

Sparsified data

/52

5x102

4

3

2

1

0

IS-d

ive

rgence

-5 -4 -3 -2 -1 0 1 2 3 4 5

y-x

=0) 25

20

15

10

5

0

KL-d

ive

rgence

-5 -4 -3 -2 -1 0 1 2 3 4 5

y-x

=1) 12

10

8

6

4

2

0

EU

C-d

ista

nce

-5 -4 -3 -2 -1 0 1 2 3 4 5

y-x

=2)

𝑦 − 𝑥が負 → 入力変数 𝑥 がデータ 𝑦 より大きい

板倉-斉藤擬距離やKL-divergenceでは小さな距離値に

板倉-斉藤擬距離やKL-divergenceでは大きな距離値に

𝑥

𝑥

𝑦 − 𝑥が正 → 入力変数 𝑥 がデータ 𝑦 より小さい


43

𝐷𝛽 𝑦|𝑥 におけるy − 𝑥のグラフ

When y < x, values of the IS div. and KL div. become smaller.

When y > x, values of the IS div. and KL div. become bigger.

/52

5x102

4

3

2

1

0

IS-d

ive

rgence

-5 -4 -3 -2 -1 0 1 2 3 4 5

y-x

=0) 25

20

15

10

5

0

KL-d

ive

rgence

-5 -4 -3 -2 -1 0 1 2 3 4 5

y-x

=1) 12

10

8

6

4

2

0

EU

C-d

ista

nce

-5 -4 -3 -2 -1 0 1 2 3 4 5

y-x

=2)

-10

-8

-6

-4

-2

0

Am

plit

ude [dB

]

543210Frequency [kHz]

-10

-8

-6

-4

-2

0

Am

plit

ude [dB

]

543210Frequency [kHz]

スパース性: 強スパース性: 弱


44

𝛽 = 0 𝛽 = 2


Sparsity: strong Sparsity: weak

/52

100

80

60

40

20

0

-d

ive

rgence

-5 -4 -3 -2 -1 0 1 2 3 4 5

y-x

=3)9x10

2

8

7

6

5

4

3

2

1

0

-d

ive

rgence

-5 -4 -3 -2 -1 0 1 2 3 4 5

y-x

=4)5x10

97

0

-d

ive

rgence

-5 -4 -3 -2 -1 0 1 2 3 4 5

y-x

=100)

さらに𝛽を大きくすると，入力変数 𝑥 とデータ 𝑦 を入れ替えたような性質に


45


As 𝛽 becomes bigger, the tendencies are like the ones exchanging 𝑥 with 𝑦.

/52

𝐽NMF = 𝐷𝛽 𝒀|𝑭𝑮

𝛽-divergence の全ての 𝛽 において収束性が保障された更新式

𝛽-divergence規範NMF NMF with the 𝛽-divergence

𝑓𝜔,𝑘 , 𝑔𝑘,𝑡はそれぞれ 𝑭, 𝑮の要素

[Nakano, et al., 2010]

46

Iterative update equations that guarantees convergence in all 𝛽.

/52

分離する楽器の教師音を用いる手法

– 学習 (training)

– 分離 (separation)

𝒀target 𝑭 𝑸

𝒀 𝑭 𝑮 𝑼

𝑯は𝑭となるべく無相関となるように正則化

[Kitamura13]

罰則条件付き教師ありNMF Penalized Supervised NMF (PSNMF)

47

目的の楽器の教師音を用いて事前学習した基底

Basis matrix trained in advance using audio data of the target musical instrument.

𝑯

Target spectrogram

教師基底 𝑭 を固定して 𝑮, 𝑯, 𝑼 を推定 Estimate 𝑮, 𝑯, 𝑼 while fixing the trained bases 𝑭.

Force 𝑯 and 𝑭 to become mostly different.

Supervised training using the data of target music instrumentals

/52

分離する楽器の教師音を用いる手法

– 学習 (training)

– 分離 (separation)

𝒀target 𝑭 𝑸

𝒀 𝑭 𝑮 𝑼

𝑯は𝑭となるべく無相関となるように正則化

[Kitamura13]

罰則条件付き教師ありNMF Penalized Supervised NMF (PSNMF)

48

目的の楽器の教師音を用いて事前学習した基底

Basis matrix trained in advance using audio data of the target musical instrument.

𝑯

Target spectrogram

教師基底 𝑭 を固定して 𝑮, 𝑯, 𝑼 を推定 Estimate 𝑮, 𝑯, 𝑼 while fixing the trained bases 𝑭.

Force 𝑯 and 𝑭 to become mostly different.

𝑭𝑮から再構成したスペクトログラムが分離結果

The separated spectrogram is reconstructed as 𝑭𝑮.

Supervised training using the data of target music instrumentals

/52

音源分離デモ1 Source separation demo 1

原曲

教師1

分離音1

教師2

分離音2

実際の演奏曲を教師有りNMFで分解してみた．

49

Original song

Musical instrument 1

Separated sound 1

Musical instrument 2

Separated sound 2

まとめ CONCLUSION

51

/52

まとめ Conclusion

52

音の知覚 (sound perception)

– 両耳間時間差・両耳間音圧差

– 視覚との相互作用

音場再現技術 (sound field reproduction)

– 両耳型・拡声型

音源分離技術 (audio source separation)

– NMF音源分離

音を知覚する，音の場を作る，音を分離する Perceive, reproduce, and separate audio.

音響信号処理基礎...2018/11/08 · 音響信号処理基礎 fundamentals of acoustic signal...

Documents