appendix fundamental frequency modeling for neural...

Appendix

FundamentalFrequencyModelingforNeural-Network-based

StatisticalParametricSpeechSynthesis

ID:201517062017-12-07

1contact:[email protected],suggestions,anddiscussion

8/2/18

シンワン

XinWANG

Linguisticfeatures

2

INTRODUCTION

8/2/18

x1:T

次は新金岡、新金岡です。

Text-analyzer

“Prosody”analyzer[2,3]

シンカナオカデス

次は新金岡、新金岡です。

名詞

ツギ

助詞

ワ

固有名詞固有名詞

シンカナオカ

助動詞Parser[1]

Dictionary

ツギワシンカナオカシンカナオカデス*

| ||* *

Durationmodel[4]

Interface

ツ*名詞

11…

1

ツ*名詞

11…

2

ギ*名詞

12…

3

ギ*名詞

12…

4

… ギ*名詞

12…

6

in T framesx1:T[1] T. Kudo. MeCab: Yet Another Part-of-Speech and Morphological Analyzer.[2] 匂坂，佐藤，電子情報通信学会論文誌，Vol.J66-D， No.7，pp.849–856，1983. [3] 鈴木雅之, et al. "CRF を用いた日本語東京方言のアクセント結合自動推定." (2012): 2-2.[4] T. Yoshimura, et al. Duration modeling for Hmm- based speech synthesis. In ICSLP, volume 98, pages 29–32, 1998.

Linguisticfeaturesl Japanesedata(generatedbyOpenJTalk [5]):

§ Previous-previous/previous/current/next/next-nextphoneme

§ Distance fromcurrentmoratotheaccentnucleus§ Position ofcurrentmoraintheaccentphrase

§ Part-of-speech ofprevious/current/nextword§ Inflectedformsofprevious/current/nextword§ Conjugationtypeofprevious/current/nextword

§ Numberofmoraofprevious/current/nextaccentphrase§ Accenttypeofprevious/current/nextaccentphrase§ Whether previous/current/nextaccentphraseisinterrogative§ Position ofcurrentaccentphraseinbreathgroup§ Isthereapause afterpreviousorbeforenextaccentphrase

3

INTRODUCTION

8/2/18

x1:T

[5] The HTS Working Group. The Japanese TTS System ‘Open JTalk’, 2016. http://open-jtalk.sourceforge.net

Phoneme

Mora

Word

Accentphrase

Linguisticfeaturesl Japanesedata(generatedbyOpenJTalk [5]):

§ Numberofmoraofprevious/current/nextbreathgroup§ Numberofaccentphraseofprevious/current/nextbreathgroup§ Position ofcurrentbreathgroupinutterance

l Englishdata(generatedbyFliteHTS_engine[6]):§ Similarfeaturesoverphoneme/syllable/phrase§ Pitchaccent->accentedornot§ PartoftheToBI boundarytone(LL,LH)

4

INTRODUCTION

8/2/18

x1:T

[6] HTS Working Group. The English TTS system Flite+HTS engine, 2014. http://hts-engine.sourceforge.net

Breathgroup

Acousticfeatures

5

INTRODUCTION

8/2/18

Speech vocoder

FFT +Cepstral analysis

WindowingFraming

in T frames

o1:T

o1:T

…EachsliceiscalledaspeechframeLength:20ms;overlap:15ms

SpectrumamplitudemaybeusedPhasemaybeignored

…

…

F0 tracking unvoiced unvoiced 200 Hz

[1] H. Kawahara et al. Speech Communication, 27:187–207, 1999.[2]M.Morise,et al..IEICETrans. onInformationandSystems,99(7):1877–1884,2016.[3] K. Tokuda, et al. Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043–1046, 1994.

o1:TDependingonthetask,mayonly containF0orspectrum

8/2/18 6

INTRODUCTIONSource-filtermodelHTSSlidesver.2.3,releasedbyHTSWorkingGrouphttp://hts.sp.nitech.ac.jp/NagoyaInstituteofTechnologyDepartmentofComputerScience

Taskdefinition

• Equal-lengthsequence-to-sequenceconversion

7

INTRODUCTION

x1:T = {x1, · · · ,xT } o1:T = {o1, · · · ,oT }

⇥⇤ = argmax⇥

|D|Y

k=1

p(o(k)1:Tk

|x(k)1:Tk

;⇥)

bo1:T = argmaxo1:T

p(o1:T |x1:T ;⇥⇤)

StatisticalF0 model

Linguistic features F0 contour

Modeltraining

F0generation

Corporaandfeatures

8[12]King,S.andKaraiskos,V.(2011).TheBlizzardChallenge2011.InProc.BlizzardChallengeWorkshop,pages1–10.[13]Kawai,H.,Toda,T.,Ni,J.,Tsuzaki,M.,andTokuda,K.(2004).Ximera:Anewtts fromatr basedoncorpus-basedtechnologies.InProc.SSW5,pages179–184.[14]Tokuda,K.,Kobayashi,T.,Masuko,T.,andImai,S.(1994).Mel-generalizedcepstralanalysisaunifiedapproach.InProc.ICSLP,pages1043–1046.[15]Kawahara,H.,Masuda-Katsuse,I.,andCheveigne,A.d.(1999).Restructuringspeechrepresentationsusingapitch-adaptivetime-frequencysmoothingandaninstantaneous-frequency- basedF0

extraction: Possibleroleofarepetitivestructureinsounds.SpeechCommunication,27:187–207.

OVERVIEW OF PHDRESEARCH

Name Size Note

BlizzardChallenge2011corpus[12]Nancyvoice

~12,000utterances16hours

English,neutral style,readingspeech

ATR Ximera corpora[13]F009 voice

~30,000 utterances48hours

Japaneseneutral style,readingspeech

Feature Dimension

Linguisticfeatures phonesequence,prosodic features... ~390

Acousticfeatures

Mel-generalized cepstral[14] 60

F0 (withunvoiced/voiced) 1+1

Band-aperiodicity 25

MotivationqWhyF0

• Morethansurfacewordmeaning

• Morethanimagined…

9

INTRODUCTION

[10]NanetteVeilleux, etal.6.911Transcribing Prosodic Structure ofSpokenUtterances withToBI.JanuaryIAP2006.https://ocw.mit.edu.License:CreativeCommonsBY-NC-SA.

Speaker A: Who made the marmalade.

Speaker B:Marianna made the marmalade.

Speaker A: Bob made the marmalade.

Speaker B: (No,) Marianna made the marmalade.

Speaker B:Marianna made the marmalade.

Speaker B: Marianna made the marmalade.

Speaker B: Mariannamade the marmalade.

WordEmbeddings

108/2/18

Embeddings

TTSwithwordvectorsl Replaceprosodictagswithwordvectors

11

WORD VECTORS

text

speech waveform

graphemeto

phoneme syntactic analysis

interface

acoustic model

textanalysis

acoustic modeling

prosody prediction

this is a test

0101...00|0010.2.4.

D I sI zeI

t e s t

this Htest H*L-L%

(S (NP this)(VP is

(NP a test)))

TTSwithwordvectorsl Replaceprosodictagswithwordvectors[4]

§ similartothefirstworkbyanotherWang[5]§ whywordvectors[6]: unsupervised learning,linguisticregularity…

12

WORD VECTORS

text

speech waveform

graphemeto

phoneme

interface

acoustic model

textanalysis

acoustic modeling

wordvectors

[4]Wang,X.,Takaki,S.,&Yamagishi, J.(2016).InvestigationofUsingContinuousRepresentationofVariousLinguisticUnitsinNeural NetworkbasedTTS.IEICE,Vol.E99-D,No.10.[5]Wang,P.,Qian,Y.,Soong,F.K.,He,L.,&Zhao,H.(2015).WordembeddingforrecurrentneuralnetworkbasedTTSsynthesis. In ICASSP(pp.4879-4883).[6]Mikolov,T.,Yih,W.,&Zweig, G.(2013).Linguisticregularitiesincontinuousspacewordrepresentations.InHLT-NAACL (pp.746–751).

0101...00|0.120.34...

D I sI zeI

t e s t

this [0.12,0.34..]is [1.2,-23,..]test ...

(S (NP this)(VP is

(NP a test)))

this is a test

TTSwithwordvectorsl Resultsofpreviouswork[4]

§ Mushra testwith20paird nativespeakersinCSTR

13

WORD VECTORS

ID inputtotheacousticmodel(arecurrentneuralnetwork)

𝑅" phonemes +predictedprosodic tags

𝑅# phonemes𝑅$% phonemes +wordvector

Tab1.Systems

[4]Wang,X.,Takaki,S.,&Yamagishi,J.(2016).InvestigationofUsingContinuousRepresentationofVariousLinguisticUnitsinNeuralNetworkbasedText-to-SpeechSynthesis.IEICE,Vol.E99-D,No.10.

prosodicannotation

Enhancethewordvectorwithprosodicinformationl Sumup

§ secondarycorpus:small,withprosodicannotation

14

WORD VECTORS

ToBItags

Post-filtertraining

secondaryspeechcorpus



15

WORD VECTORS

𝑀$

ToBItags

vectorpost-filter

prosodicfeatures

𝑀$

prosodicannotation

Post-filtertraining


Misthesetofwordvector



16

WORD VECTORS

ToBItags prosodicfeatures

vectorpost-filter

𝑀$ ...

enhancedwordvector

rawwordvector

VectorenhancingPost-filtertraining

𝑀$𝑀$

vectorpost-filter

𝑀$

prosodicannotation



§ secondarycorpus:small,withexpertprosodicannotation§ primarycorpus:huge,w/oexpertprosodicannotation

17

WORD VECTORS

enhancedwordvector

𝑀$

text

grapheme-tophoneme

acousticmodel

acousticfeatures

Acousticmodeltraining

primaryspeechcorpus


vectorpost-filter

𝑀$ ...

enhancedwordvector

rawwordvector


𝑀$

vectorpost-filter

𝑀$

prosodicannotation



§ secondarycorpus:small,withexpertprosodicannotation§ primarycorpus:huge,w/oexpertprosodicannotation

18

WORD VECTORS

enhancedwordvector

𝑀$

text

grapheme-tophoneme

acousticmodel

acousticfeatures

Acousticmodeltraining

primaryspeechcorpus


vectorpost-filter

𝑀$ ...

enhancedwordvector

rawwordvector


𝑀$

vectorpost-filter

𝑀$

prosodicannotation


Enhancethewordvectorwithprosodicinformationl How?Trainapost-filtererwithtriplet-rankinglosscriterion[12]

19

E = max⇥0, 1� Sim(pw,F(mw)) + Sim(pw,F(mw�))

⇤.

prosodic tags

feature extraction

NN-based classifier

vectorpost-filter

wordvectors

F(.)

mw�mw

F(.)

w�

Sim(x,y) =x · y

||x|| · ||y||

secondaryspeech corpus

pw

speech (and text)

w

[12]Bengio,S.,&Heigold,G.(2014).Wordembeddings forspeechrecognition.InINTERSPEECH-2014 (pp.1053–1057).

WORD VECTORS

Experimentsl Systems

§ allsystemsuseanotheracousticmodeltopredictspectralfeatures

20

systemID input totheacousticmodel(F0trajectorymodel)𝑅# phonemes𝑅" phonemes +conventionalprosodiccontext(automaticallypredicted)𝑅$' phonemes +rawwordvector𝑅$( phonemes +enhancedwordvector𝑅$')* phonemes +rawwordvectortunedbyback-propagationinTTS𝑅$()* phonemes +enhancedwordvectortunedbyback-propagation inTTS

WORD VECTORS

Resultsl Objectivetest

21

RN Rp Rwr Rwe Rwrbp

Rwebp

0.77

0.78

0.79F0

Cor

r.in

Mel

-sca

le

RN Rp Rwr Rwe Rwrbp

Rwebp

38.5

39

39.5

40

F0 R

MSE

in M

el-s

cale

𝑅# onlyphoneme

𝑅" +prosodic context

𝑅$')* +rawwordvector(finetuned)

𝑅$()* +enhancedwordvector(finetuned)

𝑅$' +rawwordvector

𝑅$( +enhanced wordvector

WORD VECTORS

50 100 150 200 250Frame index

0

100

200

300

400

500

F0 (H

z)

NATURALRNRpRwrRwe

22

Resultsl Sample

𝑅# onlyphoneme






ifthemovewould require

WORD VECTORS

Resultsl Subjectivetest

§ conductedinCSTR,by20paidnativespeakers

§ someevaluatorsfavor𝑅$( verymuch whileothersfavor𝑅$' verymuch,becauseofmissing thecontextofsentence?

49.87%

46.79%

47.00%

50.13%

53.21%

53.00%!"!#$!#%

!#%&'!#%!#%&'

0% 50% 100%

𝑅# onlyphoneme






WORD VECTORS

Highway

248/2/18

Highway

Motivationl VerydeepnetworkforSPSS?

• Imageclassification:>100hiddenlayers[14]

• Speechrecognition:>10hiddenlayers[15]

l Justmorehiddenlayers?• Imageclassification

• speechrecognition

Ø SPSS?1. aregressiontask2. heterogeneoustargets:F0,MGC ...

8/2/18 25

ON NETWORK'S DEPTH

[14]He,K.,Zhang,X.,Ren,S.,&Sun,J.(2015).DeepResidualLearningforImageRecognition.CoRR,abs/1512.03385.Retrievedfromhttp://arxiv.org/abs/1512.03385[15]Liang,L.,&Steve,R.(2016).Small-footprintDeepNeuralNetworkswithHighwayConnectionsforSpeechRecognition.InProc.INTERSPEECH (pp.12–16).

classificationtasks

Highwaynetworks[16] forSPSS

• Whyhighwaynetwork?§ easiertrainingofverydeepnetworks§ easyinvestigationofnetwork'sbehavior

8/2/18 26

feedforward

feedforward

MGC

highwayblock

F0 BAP

highwayblock...

linguistic features

ON NETWORK'S DEPTH

feedforward

feedforward

MGC

feedforward

F0 BAP

feedforward...

linguistic features

[16]Srivastava,R.K.,Greff,K.,&Schmidhuber,J.(2015).HighwayNetworks.CoRR,abs/1505.00387.Retrievedfromhttp://arxiv.org/abs/1505.00387

feedforward network highway network

feedforward

feedforward

+

... gate

X

X

X

-1

y = T (x)�H(x) + [1� T (x)]� x

x

H(x) T (x)

T (x) = sigmoid(Wx+ b)

HighwaynetworksforSPSS

• Whymulti-stream?§ reduceinteractionbetweenMGC andF0 modeling

8/2/18 27

feedforward

feedforward

MGC

highwayblock

F0 BAP

highwayblock...

linguistic features

single-streamhighway network

feedforward

feed-forward

highwayblock

F0

highwayblock

linguistic features

feed-forward

highwayblock

MGC

highwayblock

feed-forward

highwayblock

BAP

highwayblock

multi-streamhighway network

.........

ON NETWORK'S DEPTH

feedforward

feedforward

MGC

feedforward

F0 BAP

feedforward...

linguistic features

single-streamfeedforward network

Experimentsl Networks

l Corpus:BC2011 Nancyvoice

8/2/18 28

ON NETWORK'S DEPTH

Notation System Configuration

DS deepfeedforwardsingle-streamnetwork layersize: 382

HS highwaysingle-streamnetwork

layersize: 3822tanh layersfor eachhighwayblock

HM highwaymulti-streamnetwork

layersize:256 (eachsub-network)2tanh layersfor eachhighwayblock

Results

• Single-stream:sufficientdepthisnecessary

v Depth:totalnumberofhiddentanh-basedfeedforwardlayers

29

ON NETWORK'S DEPTH

2 4 8 14 20 40Network depth

0.66

0.67

0.68

0.69

0.7

0.71

0.72

0.73

0.74

F0 C

orre

latio

n

DSHSHM


42.5

43

43.5

44

44.5

45

45.5

46

46.5

47

47.5

F0 R

MSE

(Hz)

DSHSHM


1.02

1.04

1.06

1.08

1.1

1.12

1.14

MG

C R

MSE

DSHSHM

8/2/18

Results


• Multi-stream:more(feedforward)layersforF0 ?

Ø Similarresultsgivenfixeddepthbutvariedlayersizes

30

ON NETWORK'S DEPTH


0.66

0.67

0.68

0.69

0.7

0.71

0.72

0.73

0.74

F0 C

orre

latio

n

DSHSHM


42.5

43

43.5

44

44.5

45

45.5

46

46.5

47

47.5

F0 R

MSE

(Hz)

DSHSHM


1.02

1.04

1.06

1.08

1.1

1.12

1.14

MG

C R

MSE

DSHSHM

8/2/18

Results


• Multi-stream:morefeedforwardlayersforF0 ?

31

ON NETWORK'S DEPTH

3.2e+05 3.9e+05 1.3e+06 3.3e+06Number of model parameters

0.66

0.67

0.68

0.69

0.7

0.71

0.72

0.73

0.74

F0 C

orre

latio

n

2

4

8 14

20

402

4

8 14 20 40

24

8 14 20 40

DSHSHM


42.5

43

43.5

44

44.5

45

45.5

46

46.5

47

47.5

F0 R

MSE

(Hz)

2

48

14

20

402

4

8 14 20 40

24

814

20 40

DSHSHM


1.02

1.04

1.06

1.08

1.1

1.12

1.14

MG

C R

MSE

2

4

8 1420 40

2

4

814 20 40

2

4

8

1420

40

DSHSHM

8/2/18

1.5e+06 3.3e+06 7.3e+06 1.6e+07 3.6e+07

Number of model parameters

1.005

1.01

1.015

1.02

1.025

1.03

1.035

1.04

1.045

MG

C R

MS

E

382

782

8821024

382

482582

7821024

HM1

HM2 HM

3HM

4

DS

HS

HM

Results

Ø Similarresultsgivenfixeddepth,variedlayersize

32

ON NETWORK'S DEPTH

1.5e+06 3.3e+06 7.3e+06 1.6e+07 3.6e+07


42.5

43

43.5

44

44.5

45

45.5

46

46.5

F0 R

MS

E (

Hz)

382

782

882

1024

382 482

582782

1024

HM1

HM2

HM3

HM4

DS

HS

HM

1.5e+06 3.3e+06 7.3e+06 1.6e+07 3.6e+07


0.67

0.68

0.69

0.7

0.71

0.72

0.73

0.74

F0 C

orr

ela

tion

382

782

882

1024

382

482

582782

1024

HM1

HM2

HM3

HM4

DS

HS

HM

8/2/18

Analysisnetworkbehaviorl Investigationtool

• indicatesnetworkbehavior

§ As,

§ As,

33

ON NETWORK'S DEPTH

feedforward

feed-forward

F0

highwayblock

linguistic features

feed-forward

MGC

highwayblock

feed-forward

BAP

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

y = T (x)�H(x) + [1� T (x)]� x

Histogram of

0 1

T (x)

T (x) ⇡ 0

T (x) ⇡ 1

y ⇡ x

y ⇡ H(x)

8/2/18

T (x)

Analysisnetworkbehaviorl Histogramof,HM14

34

ON NETWORK'S DEPTH

feedforward

feed-forward

F0

highwayblock

linguistic features

feed-forward

MGC

highwayblock

feed-forward

BAP

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

0 1

6e

+0

5 b.1 b.2 b.3 b.4 b.5 b.6 b.7

T (x)

8/2/18

Analysisnetworkbehaviorl Histogramof,HM14

• forMGC

• forF0

• F0sub-networkispartiallyinactive

35

ON NETWORK'S DEPTH

feedforward

feed-forward

F0

highwayblock

linguistic features

feed-forward

MGC

highwayblock

feed-forward

BAP

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

T (x)

0 1

6e

+0

5 b.1 b.2 b.3 b.4 b.5 b.6 b.7

0 1

7e

+0

5 b.1 b.2 b.3 b.4 b.5 b.6 b.7

8/2/18

Analysisnetworkbehaviorl Histogramof,HS14

• Single-streamnetworkisfullyactive

Ø MGC dominatesthenetwork

36

ON NETWORK'S DEPTH

feedforward

feed-forward

F0

highway block

linguistic features

MGC BAP

highway block

highway block

highway block

highway block

highway block

highway block

T (x)

0 1

8e

+0

5 b.1 b.2 b.3 b.4 b.5 b.6 b.7

8/2/18

1.4e+06 5.8e+06 2.4e+07Number of model parameter (log scale)

0.98

1

1.02

1.04

1.06

1.08

1.1

1.12

1.14

MG

C R

MSE

HM2

HM4

HM8

HM14HM20 HM40

HM60

HM80

Systems with different depthObjectivemeasure

37


42.6

42.7

42.8

42.9

43

43.1

43.2

43.3

43.4

F0 R

MSE

(Hz)

HM2

HM4

HM8

HM14

HM20

HM40

HM60

HM80

Systems with different depth


0.724

0.726

0.728

0.73

0.732

0.734

F0 C

orre

latio

n

HM2

HM4

HM8

HM14

HM20

HM40

HM60

HM80

Systems with different depth

ON NETWORK'S DEPTHResults(HMonly)

38

MGC sub-network F0 sub-network

block1: neartheinput endofthenetworkblock20: neartheoutput endofthenetwork

0 1

4e+0

5 block 1 block 2 block 3 block 4 block 5

0 1

7e+0

5 block 6 block 7

0 1

6e+0


0 1

2e+0

5 block 6 block 7

ON NETWORK'S DEPTHResults(HM14)

39



0 1

4e+0


0 1

1e+0


0 1

1e+0


0 1

4e+0



0 1

2e+0


0 1

2e+0


0 1

2e+0


0 1

1e+0


40

0 1

4e+0


0 1

1e+0


0 12e

+06 block 11 block 12 block 13 block 14 block 15

0 1

3e+0





0 1

2e+0


0 1

2e+0


0 1

2e+0


0 1

1e+0


41



0 1

4e+0


0 1

2e+0


0 14e


0 1

4e+0



42



0 1

7e+0


0 1

2e+0


0 14e


0 1

4e+0


0 1

2e+0


0 1

4e+0


0 1

4e+0


0 1

4e+0



43

ON NETWORK'S DEPTHOverfittingofHM80?

MGC sub-network LF0 sub-networkPhoneme identity Position of phoneme in syllable

Position of phoneme in syllable Accent type of next syllablePosition of phoneme in syllable (backward) Accent type of previous syllable

Number of previous stressed syllables in phrase Position of syllable in the wordNumber of stressed syllables remained in phrase Phoneme identity

MGC sub-network LF0 sub-networkPosition of phrase in utterance Number of words in previous phrase

Number of words in phrase Number of words in next phraseNumber of phrases in utterance ToBI boundary tone

Number of syllables in previous phrase Number of syllables in next phraseToBI boundary tone Number of syllables in previous phrase

Analysisl Contributionofthelinguisticfeatures

• Mostusefulfeatures

• Leastusefulfeatures

44

ON NETWORK'S DEPTH

8/2/18

Experimentsq Otherresults

• Sensitivityanalysisinmulti-streamhighway• MGCandF0usedifferentlinguisticfeatures

• SimilarresultsonJapanesedata

ISSUE 1:JOINT LEARNING FOR F0?

45

MGC

CurrentphonemeidentityPositionofcurrentphonemeinsyllable(forward)Positionofcurrentphonemeinsyllable(backward)NumberofprecedinglexicallystressedsyllablesinphraseNumberoffollowinglexicallystressedsyllablesinphrase

F0

Positionofphonemeinsyllable(forward)IsthenextsyllablebearinganEnglishpitch-accentIstheprevioussyllablebearinganEnglishpitch-accentPositionofcurrentsyllableinthewordCurrentphonemeidentity

Layers DS HS HM

2

4

8

14

20

40

60

80

SamplesON NETWORK'S DEPTH

Experimentsl Networks

l Corpus:ATRF0098/2/18 47

ON NETWORK'S DEPTH

Notation System Configuration

HMnmulti-streamhighwaynetwork

layersize256 foreachsub-network2tanh layersinone highwayblocksigmoidforhighwaygate

RNN[1]Recurrent neuralnetwork(single-stream)

2feedforward layers,512eachlayer2bi-directionalLSTMlayers,256eachlayer1linearprojection outputlayer

DNNDeepfeedforwardneuralnetwork(single-stream)

1feedforward layers,1024eachlayer3feedforward layers,512eachlayer1linearprojection outputlayer

[1]Wang,X.,Takaki,S.,&Yamagishi, J.(2016).AComparativeStudyofthePerformanceofHMM,DNN,andRNN basedSpeechSynthesisSystemsTrainedonVeryLargeSpeaker-DependentCorpora.InProc.SSW9 (pp.125–128).

48

ON NETWORK'S DEPTH

5.0e5 1.0e6 2.0e6 4.0e6 8.0e6Number of Network weights

0.98

1.00

1.02

1.04

1.06

1.08

1.10

MG

CR

MS

E

4

68

2

3

4

816

32

2

34

8

16

32

Single-stream feedforwardSingle-stream highwayMulti-stream highway


31

32

33

34

35

36

F0R

MS

E(H

z)

4 6 8

23

4

816 32

23

4

8

1632


0.830

0.835

0.840

0.845

0.850

0.855

0.860

0.865

0.870

F0C

orre

latio

n(0

-1)

4 6

8

2

3

4

816

322

3 4 8

1632

49

0 1

2e+0

4 block 1 block 2

0 1

3e+0

4 block 1 block 2

0 1

5e+04 b.1 b.2 b.3 b.4 b.5 b.6 b.7

0 1

2e+05 b.1 b.2 b.3 b.4 b.5 b.6 b.7

MGC stream F0 stream

0 1

6e+04 b.1 b.2 b.3 b.4 b.5 b.6 b.7

0 1

9e+04 b.8 b.9 b.10

0 1

2e+05 b.1 b.2 b.3 b.4 b.5 b.6 b.7

0 1

2e+04 b.8 b.9 b.10

ON NETWORK'S DEPTHResults(Japanese)l HMwith2,7,10highwayblocks

50

minEs Dim Rank'RR-Phone' 0.993 thephonemeafterthenextphonemeidentity 44 40

'C-Br_Len_Mora' 0.993 thenumberofmorasinthecurrentbreathgroup 100 41

'C-Br_Bw-Pos-in_Utt_Mora' 0.994 positionofthecurrentbreathgroupidentitybymora(backward) 201 42

'C-Br_Fw-Pos-in_Utt_Mora' 0.995 positionofthecurrentbreathgroupidentitybymora(forward) 201 43

'C-Acc_Fw-Pos-in_Br_Mora' 0.996 positionofthecurrentaccentphraseidentityinthecurrentbreathgroupbythemora(forward) 121 44

'Utt_Len_Acc' 0.997 thenumberofaccentphrasesinthisutterance 60 45

'Utt_Len_Br' 0.999 thenumberofbreathgroupsinthisutterance 30 46

'Utt_Len_Mora' 0.999 thenumberofmoras inthisutterance 200 47

minEs Dim Rank'L-Acc_Len_Mora' 0.994 thenumberofmorasinthepreviousaccentphrase 61 40'C-Br_Len_Mora' 0.995 thenumberofmorasinthecurrentbreathgroup 100 41

'C-Acc_Fw-Pos-in_Br_Mora' 0.995 positionofthecurrentaccentphraseidentityinthecurrentbreathgroupbythemora(forward) 121 42'C-Br_Bw-Pos-in_Utt_Mora' 0.996 positionofthecurrentbreathgroupidentitybymora(backward) 201 43'C-Br_Fw-Pos-in_Utt_Mora' 0.997 positionofthecurrentbreathgroupidentitybymora(forward) 201 44

'Utt_Len_Acc' 0.997 thenumberofaccentphrasesinthisutterance 60 45'Utt_Len_Br' 0.999 thenumberofbreathgroupsinthisutterance 30 46

'Utt_Len_Mora' 1.000 thenumberofmoras inthisutterance 200 47

ON NETWORK'S DEPTHResults(Japanese)l Sensitivityanalysis,47classesofcontextualfeatures

• Leastuseful features,MGC stream

• Leastuseful features,F0 stream

Summaryl Findings:

• MGC benefitsfromdeepernetworks

• F0 sub-networkcanbeshallow

• Single-streamnetworksfocusmoreonMGC otherthanF0

• Investigationlinguisticfeatures'usefulness§ (automaticallyinferred)F0-relatedtagsarenoisyforEnglish

• ExperimentsonJapanesecorpus§ similar:multi-streamnetworkimprovesF0 modeling§ different:F0-relatedtagsareuseful

51

ON NETWORK'S DEPTH

Anyreason?

8/2/18

8/2/18 52

HIGHWAY ARCHITECTURE

MGC

Bottom network

Linear

F0

Linear

BAP

Linear

MGC F0 BAP

Bottom network

Linear

MGC

Bottom network

Linear

F0

Linear

BAP

Linear

ht

Linear

bot =

2

64bbo(MGC)

t

bo(F0)t

bo(BAP )t

3

75 =

2

4W s,11 W s,12 0

0 1 00 0 1

3

5

2

64bo(MGC)t

bo(F0)t

bo(BAP )t

3

75 = W sW oht

bot =

2

64bo(MGC)t

bo(F0)t

bo(BAP )t

3

75 =

2

4W o,11 0 0

0 W o,22 00 0 W o,33

3

5

2

4ht,1

ht,2

ht,3

3

5 = W oht,

bot =

2

64bo(MGC)t

bo(F0)t

bo(BAP )t

3

75 =

2

4W o,11 W o,12 W o,13

W o,21 W o,22 W o,23

W o,31 W o,32 W o,33

3

5

2

4ht,1

ht,2

ht,3

3

5 = W oht,

8/2/18 53


MGC

Bottom network

Linear

F0

Linear

BAP

Linear

Linear

bot =

2

64bbo(MGC)

t

bo(F0)t

bo(BAP )t

3

75 =

2

4W s,11 W s,12 0

0 1 00 0 1

3

5

2

64bo(MGC)t

bo(F0)t

bo(BAP )t

3

75 = W sW oht

bbo(MGC)

t = W s,11bo(MGC)t +W s,12bo(F0)

t

l SameargumentasSARvsRMDN• Dependencybetweenmean,notrandomvariables

p(o(MGC), o(F0), o(BAP )) = p(o(MGC))p(o(F0))p(o(BAP ))

Themeanofisaffectedbymeanofp(o(MGC)) p(o(F0))

8/2/18 54


MGC

Bottom network

Linear

F0

Linear

BAP

Linear

Linear

l Trainingdependencymodel?

Observed(natural)data

MGC

Bottom network

Linear

F0

Linear

BAP

Linear

Linear

p(X,Y ;⇥) = p(X|Y ;⇥1)p(Y ;⇥2)

⇥⇤ =argmax⇥

Y

{xn,yn}2D

p(X = xn, Y = yn)

= argmax⇥

Y

{xn,yn}2D

p(X = xn|Y = yn)p(Y = yn)

8/2/18 55


MGC

Bottom network

Linear

F0

Linear

BAP

Linear

Linear

l Generatefromdependencymodel?

Idealsolution:

p(X,Y ;⇥⇤) = p(X|Y ;⇥⇤1)p(Y ;⇥⇤

2)

{x⇤, y⇤} =arg max{bx,by}

p(X = bx, Y = by)

= arg max{bx,by}

p(X = bx|Y = by)p(Y = by)

Approximation:y⇤ = argmax

byp(Y = by)

x⇤ = argmaxbx

p(X = bx|Y = y⇤)

AM&

Vocoder

568/2/18

COMPARE NEW MODELS/VOCODERSOverviewl StilltheSPSSframeworkl Testonboth“vocoder”andacousticmodels

minimum phase

10/20/17 1

Wavenet PML

Phaserecovery

SAR RNNDAR

SAR-Wa SAR-PmSAR-Pr SAR-Wo SGA-Wo RGA-Wo RNN-Wo

Waveform generators

Acoustic models

Linguistic features

F0 MGCGAN

WORLD

COMPARE NEW MODELS/VOCODERS

• Abs*:copy-synthesis WithoutMLPGWithout formantenhancement

Overview

• 40blocks,dilution [2,4,8,16,32,64,128,256,512,2,4…]

WAVENET

Linguistic features / MGC

Feedforward

Diluted1-D CNN

Sub-network

+

Tanh Sigmoid

*

1-D CNN

Waveform(time shifted)

1-D CNNs softmax Waveform+

1-D CNN +

Diluted1-D CNN

+

Tanh Sigmoid

*…

1-D CNN

1-D CNN +

Sub-networkWAVENET

Linguistic features / MGC

Up samplingTimeresolution: 16kHz

Timeresolution: 1/(5ms)=20Hz(Framelevel)

1-D CNN softmax Waveform+

…Block 1 Block 2 Block 40

Waveform(feedback)

linear

Feedforward

Bi-LSTM

F0

Nosub-network

Bi-LSTMsub-network

NaturalTrainedonnaturalMGC/F0GeneratedonsyntheticMGC&F0

Generationmethodl Randomsampling

l One-best(invoicedregions)

WAVENET

Random-samp One-best Natural

bot ⇠ P (ot|bot�R:t�1, bat)

bot = argmaxot

p(ot|bot�R:t�1, bat)

Random-samp One-best Natural

Conditioned ongeneratedMGC&F0

Conditioned onlinguistic features&F0

WaveNet Natural

62

Analysisl IsWaveNet-vocoderstrictlyspeaker/languagedependent?

• Butmaynotworkformalespeakers

WAVENET

WaveNet-vocoder

F009 (Japanese data)

Training

WaveNet-vocoder

Nancy(English data)

NaturalMGC F0

Generation

Natural MGC F0

Conditioned onnaturalMGC&F0

63

Analysisl Embeddingofwaveform(conditionedonMGC/F0)

WAVENET



Waveform(feedback)

linear

WaveformlevelID(10bits)

Weightsof linearlayer

2-DEmbedding

64

Analysisl Embeddingofwaveform(conditionedonlinguisticfeatures/F0)

WAVENET



Waveform(feedback)

linear

WaveformlevelID(10bits)

Weightsof linearlayer

2-DEmbedding

65

Analysisl Varianceofoutputfromeachblock(conditionedonMGC/F0)

0 5 10 15 20 25 30 35 40

Block ID

�4

�2

0

2

4

Feat

ure

valu

e

98-percentile 2-percentile median stdev

Rangeofoutput valuefromeachblock

Largestvalue

Smallestvalue

Standarddeviation

WAVENET



Waveform(feedback)

linear

appendix fundamental frequency modeling for neural...

Documents