appendix fundamental frequency modeling for neural...

65
Appendix Fundamental Frequency Modeling for Neural- Network-based Statistical Parametric Speech Synthesis ID: 20151706 2017-12-07 1 contact: [email protected] we welcome critical comments, suggestions, and discussion 8/2/18 シン ワン Xin WANG

Upload: others

Post on 29-Oct-2019

18 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

Appendix

FundamentalFrequencyModelingforNeural-Network-based

StatisticalParametricSpeechSynthesis

ID:201517062017-12-07

1contact:[email protected],suggestions,anddiscussion

8/2/18

シン ワン

XinWANG

Page 2: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

Linguisticfeatures

2

INTRODUCTION

8/2/18

x1:T

次は新金岡、新金岡です。

Text-analyzer

“Prosody”analyzer[2,3]

シンカナオカ デス

次 は 新金岡 、 新金岡 です 。

名詞

ツギ

助詞

固有名詞 固有名詞

シンカナオカ

助動詞Parser[1]

Dictionary

ツギ ワ シンカナオカ シンカナオカ デス*

| ||* *

Durationmodel[4]

Interface

ツ*名詞

11…

1

ツ*名詞

11…

2

ギ*名詞

12…

3

ギ*名詞

12…

4

… ギ*名詞

12…

6

in T framesx1:T[1] T. Kudo. MeCab: Yet Another Part-of-Speech and Morphological Analyzer.[2] 匂坂,佐藤,電子情報通信学会論文誌,Vol.J66-D, No.7,pp.849–856,1983. [3] 鈴木雅之, et al. "CRF を用いた日本語東京方言のアクセント結合自動推定." (2012): 2-2.[4] T. Yoshimura, et al. Duration modeling for Hmm- based speech synthesis. In ICSLP, volume 98, pages 29–32, 1998.

Page 3: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

Linguisticfeaturesl Japanesedata(generatedbyOpenJTalk [5]):

§ Previous-previous/previous/current/next/next-nextphoneme

§ Distance fromcurrentmoratotheaccentnucleus§ Position ofcurrentmoraintheaccentphrase

§ Part-of-speech ofprevious/current/nextword§ Inflectedformsofprevious/current/nextword§ Conjugationtypeofprevious/current/nextword

§ Numberofmoraofprevious/current/nextaccentphrase§ Accenttypeofprevious/current/nextaccentphrase§ Whether previous/current/nextaccentphraseisinterrogative§ Position ofcurrentaccentphraseinbreathgroup§ Isthereapause afterpreviousorbeforenextaccentphrase

3

INTRODUCTION

8/2/18

x1:T

[5] The HTS Working Group. The Japanese TTS System ‘Open JTalk’, 2016. http://open-jtalk.sourceforge.net

Phoneme

Mora

Word

Accentphrase

Page 4: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

Linguisticfeaturesl Japanesedata(generatedbyOpenJTalk [5]):

§ Numberofmoraofprevious/current/nextbreathgroup§ Numberofaccentphraseofprevious/current/nextbreathgroup§ Position ofcurrentbreathgroupinutterance

l Englishdata(generatedbyFliteHTS_engine[6]):§ Similarfeaturesoverphoneme/syllable/phrase§ Pitchaccent->accentedornot§ PartoftheToBI boundarytone(LL,LH)

4

INTRODUCTION

8/2/18

x1:T

[6] HTS Working Group. The English TTS system Flite+HTS engine, 2014. http://hts-engine.sourceforge.net

Breathgroup

Page 5: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

Acousticfeatures

5

INTRODUCTION

8/2/18

Speech vocoder

FFT +Cepstral analysis

WindowingFraming

in T frames

o1:T

o1:T

…EachsliceiscalledaspeechframeLength:20ms;overlap:15ms

SpectrumamplitudemaybeusedPhasemaybeignored

F0 tracking unvoiced unvoiced 200 Hz

[1] H. Kawahara et al. Speech Communication, 27:187–207, 1999.[2]M.Morise,et al..IEICETrans. onInformationandSystems,99(7):1877–1884,2016.[3] K. Tokuda, et al. Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043–1046, 1994.

o1:TDependingonthetask,mayonly containF0orspectrum

Page 6: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

8/2/18 6

INTRODUCTIONSource-filtermodelHTSSlidesver.2.3,releasedbyHTSWorkingGrouphttp://hts.sp.nitech.ac.jp/NagoyaInstituteofTechnologyDepartmentofComputerScience

Page 7: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

Taskdefinition

• Equal-lengthsequence-to-sequenceconversion

7

INTRODUCTION

x1:T = {x1, · · · ,xT } o1:T = {o1, · · · ,oT }

⇥⇤ = argmax⇥

|D|Y

k=1

p(o(k)1:Tk

|x(k)1:Tk

;⇥)

bo1:T = argmaxo1:T

p(o1:T |x1:T ;⇥⇤)

StatisticalF0 model

Linguistic features F0 contour

Modeltraining

F0generation

Page 8: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

Corporaandfeatures

8[12]King,S.andKaraiskos,V.(2011).TheBlizzardChallenge2011.InProc.BlizzardChallengeWorkshop,pages1–10.[13]Kawai,H.,Toda,T.,Ni,J.,Tsuzaki,M.,andTokuda,K.(2004).Ximera:Anewtts fromatr basedoncorpus-basedtechnologies.InProc.SSW5,pages179–184.[14]Tokuda,K.,Kobayashi,T.,Masuko,T.,andImai,S.(1994).Mel-generalizedcepstralanalysisaunifiedapproach.InProc.ICSLP,pages1043–1046.[15]Kawahara,H.,Masuda-Katsuse,I.,andCheveigne,A.d.(1999).Restructuringspeechrepresentationsusingapitch-adaptivetime-frequencysmoothingandaninstantaneous-frequency- basedF0

extraction: Possibleroleofarepetitivestructureinsounds.SpeechCommunication,27:187–207.

OVERVIEW OF PHDRESEARCH

Name Size Note

BlizzardChallenge2011corpus[12]Nancyvoice

~12,000utterances16hours

English,neutral style,readingspeech

ATR Ximera corpora[13]F009 voice

~30,000 utterances48hours

Japaneseneutral style,readingspeech

Feature Dimension

Linguisticfeatures phonesequence,prosodic features... ~390

Acousticfeatures

Mel-generalized cepstral[14] 60

F0 (withunvoiced/voiced) 1+1

Band-aperiodicity 25

Page 9: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

MotivationqWhyF0

• Morethansurfacewordmeaning

• Morethanimagined…

9

INTRODUCTION

[10]NanetteVeilleux, etal.6.911Transcribing Prosodic Structure ofSpokenUtterances withToBI.JanuaryIAP2006.https://ocw.mit.edu.License:CreativeCommonsBY-NC-SA.

Speaker A: Who made the marmalade.

Speaker B:Marianna made the marmalade.

Speaker A: Bob made the marmalade.

Speaker B: (No,) Marianna made the marmalade.

Speaker B:Marianna made the marmalade.

Speaker B: Marianna made the marmalade.

Speaker B: Mariannamade the marmalade.

Page 10: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

WordEmbeddings

108/2/18

Embeddings

Page 11: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

TTSwithwordvectorsl Replaceprosodictagswithwordvectors

11

WORD VECTORS

text

speech waveform

graphemeto

phoneme syntactic analysis

interface

acoustic model

textanalysis

acoustic modeling

prosody prediction

this is a test

0101...00|0010.2.4.

D I sI zeI

t e s t

this Htest H*L-L%

(S (NP this)(VP is

(NP a test)))

Page 12: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

TTSwithwordvectorsl Replaceprosodictagswithwordvectors[4]

§ similartothefirstworkbyanotherWang[5]§ whywordvectors[6]: unsupervised learning,linguisticregularity…

12

WORD VECTORS

text

speech waveform

graphemeto

phoneme

interface

acoustic model

textanalysis

acoustic modeling

wordvectors

[4]Wang,X.,Takaki,S.,&Yamagishi, J.(2016).InvestigationofUsingContinuousRepresentationofVariousLinguisticUnitsinNeural NetworkbasedTTS.IEICE,Vol.E99-D,No.10.[5]Wang,P.,Qian,Y.,Soong,F.K.,He,L.,&Zhao,H.(2015).WordembeddingforrecurrentneuralnetworkbasedTTSsynthesis. In ICASSP(pp.4879-4883).[6]Mikolov,T.,Yih,W.,&Zweig, G.(2013).Linguisticregularitiesincontinuousspacewordrepresentations.InHLT-NAACL (pp.746–751).

0101...00|0.120.34...

D I sI zeI

t e s t

this [0.12,0.34..]is [1.2,-23,..]test ...

(S (NP this)(VP is

(NP a test)))

this is a test

Page 13: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

TTSwithwordvectorsl Resultsofpreviouswork[4]

§ Mushra testwith20paird nativespeakersinCSTR

13

WORD VECTORS

ID inputtotheacousticmodel(arecurrentneuralnetwork)

𝑅" phonemes +predictedprosodic tags

𝑅# phonemes𝑅$% phonemes +wordvector

Tab1.Systems

[4]Wang,X.,Takaki,S.,&Yamagishi,J.(2016).InvestigationofUsingContinuousRepresentationofVariousLinguisticUnitsinNeuralNetworkbasedText-to-SpeechSynthesis.IEICE,Vol.E99-D,No.10.

Page 14: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

prosodicannotation

Enhancethewordvectorwithprosodicinformationl Sumup

§ secondarycorpus:small,withprosodicannotation

14

WORD VECTORS

ToBItags

Post-filtertraining

secondaryspeechcorpus

Page 15: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

Enhancethewordvectorwithprosodicinformationl Sumup

§ secondarycorpus:small,withprosodicannotation

15

WORD VECTORS

𝑀$

ToBItags

vectorpost-filter

prosodicfeatures

𝑀$

prosodicannotation

Post-filtertraining

secondaryspeechcorpus

Misthesetofwordvector

Page 16: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

Enhancethewordvectorwithprosodicinformationl Sumup

§ secondarycorpus:small,withprosodicannotation

16

WORD VECTORS

ToBItags prosodicfeatures

vectorpost-filter

𝑀$ ...

enhancedwordvector

rawwordvector

VectorenhancingPost-filtertraining

𝑀$𝑀$

vectorpost-filter

𝑀$

prosodicannotation

secondaryspeechcorpus

Page 17: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

Enhancethewordvectorwithprosodicinformationl Sumup

§ secondarycorpus:small,withexpertprosodicannotation§ primarycorpus:huge,w/oexpertprosodicannotation

17

WORD VECTORS

enhancedwordvector

𝑀$

text

grapheme-tophoneme

acousticmodel

acousticfeatures

Acousticmodeltraining

primaryspeechcorpus

ToBItags prosodicfeatures

vectorpost-filter

𝑀$ ...

enhancedwordvector

rawwordvector

VectorenhancingPost-filtertraining

𝑀$

vectorpost-filter

𝑀$

prosodicannotation

secondaryspeechcorpus

Page 18: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

Enhancethewordvectorwithprosodicinformationl Sumup

§ secondarycorpus:small,withexpertprosodicannotation§ primarycorpus:huge,w/oexpertprosodicannotation

18

WORD VECTORS

enhancedwordvector

𝑀$

text

grapheme-tophoneme

acousticmodel

acousticfeatures

Acousticmodeltraining

primaryspeechcorpus

ToBItags prosodicfeatures

vectorpost-filter

𝑀$ ...

enhancedwordvector

rawwordvector

VectorenhancingPost-filtertraining

𝑀$

vectorpost-filter

𝑀$

prosodicannotation

secondaryspeechcorpus

Page 19: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

Enhancethewordvectorwithprosodicinformationl How?Trainapost-filtererwithtriplet-rankinglosscriterion[12]

19

E = max⇥0, 1� Sim(pw,F(mw)) + Sim(pw,F(mw�))

⇤.

prosodic tags

feature extraction

NN-based classifier

vectorpost-filter

wordvectors

F(.)

mw�mw

F(.)

w�

Sim(x,y) =x · y

||x|| · ||y||

secondaryspeech corpus

pw

speech (and text)

w

[12]Bengio,S.,&Heigold,G.(2014).Wordembeddings forspeechrecognition.InINTERSPEECH-2014 (pp.1053–1057).

WORD VECTORS

Page 20: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

Experimentsl Systems

§ allsystemsuseanotheracousticmodeltopredictspectralfeatures

20

systemID input totheacousticmodel(F0trajectorymodel)𝑅# phonemes𝑅" phonemes +conventionalprosodiccontext(automaticallypredicted)𝑅$' phonemes +rawwordvector𝑅$( phonemes +enhancedwordvector𝑅$')* phonemes +rawwordvectortunedbyback-propagationinTTS𝑅$()* phonemes +enhancedwordvectortunedbyback-propagation inTTS

WORD VECTORS

Page 21: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

Resultsl Objectivetest

21

RN Rp Rwr Rwe Rwrbp

Rwebp

0.77

0.78

0.79F0

Cor

r.in

Mel

-sca

le

RN Rp Rwr Rwe Rwrbp

Rwebp

38.5

39

39.5

40

F0 R

MSE

in M

el-s

cale

𝑅# onlyphoneme

𝑅" +prosodic context

𝑅$')* +rawwordvector(finetuned)

𝑅$()* +enhancedwordvector(finetuned)

𝑅$' +rawwordvector

𝑅$( +enhanced wordvector

WORD VECTORS

Page 22: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

50 100 150 200 250Frame index

0

100

200

300

400

500

F0 (H

z)

NATURALRNRpRwrRwe

22

Resultsl Sample

𝑅# onlyphoneme

𝑅" +prosodic context

𝑅$' +rawwordvector

𝑅$( +enhanced wordvector

𝑅$')* +rawwordvector(finetuned)

𝑅$()* +enhancedwordvector(finetuned)

ifthemovewould require

WORD VECTORS

Page 23: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

Resultsl Subjectivetest

§ conductedinCSTR,by20paidnativespeakers

§ someevaluatorsfavor𝑅$( verymuch whileothersfavor𝑅$' verymuch,becauseofmissing thecontextofsentence?

49.87%

46.79%

47.00%

50.13%

53.21%

53.00%!"!#$!#%

!#%&'!#%!#%&'

0% 50% 100%

𝑅# onlyphoneme

𝑅" +prosodic context

𝑅$')* +rawwordvector(finetuned)

𝑅$()* +enhancedwordvector(finetuned)

𝑅$' +rawwordvector

𝑅$( +enhanced wordvector

WORD VECTORS

Page 24: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

Highway

248/2/18

Highway

Page 25: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

Motivationl VerydeepnetworkforSPSS?

• Imageclassification:>100hiddenlayers[14]

• Speechrecognition:>10hiddenlayers[15]

l Justmorehiddenlayers?• Imageclassification

• speechrecognition

Ø SPSS?1. aregressiontask2. heterogeneoustargets:F0,MGC ...

8/2/18 25

ON NETWORK'S DEPTH

[14]He,K.,Zhang,X.,Ren,S.,&Sun,J.(2015).DeepResidualLearningforImageRecognition.CoRR,abs/1512.03385.Retrievedfromhttp://arxiv.org/abs/1512.03385[15]Liang,L.,&Steve,R.(2016).Small-footprintDeepNeuralNetworkswithHighwayConnectionsforSpeechRecognition.InProc.INTERSPEECH (pp.12–16).

classificationtasks

Page 26: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

Highwaynetworks[16] forSPSS

• Whyhighwaynetwork?§ easiertrainingofverydeepnetworks§ easyinvestigationofnetwork'sbehavior

8/2/18 26

feedforward

feedforward

MGC

highwayblock

F0 BAP

highwayblock...

linguistic features

ON NETWORK'S DEPTH

feedforward

feedforward

MGC

feedforward

F0 BAP

feedforward...

linguistic features

[16]Srivastava,R.K.,Greff,K.,&Schmidhuber,J.(2015).HighwayNetworks.CoRR,abs/1505.00387.Retrievedfromhttp://arxiv.org/abs/1505.00387

feedforward network highway network

feedforward

feedforward

+

... gate

X

X

X

-1

y = T (x)�H(x) + [1� T (x)]� x

x

H(x) T (x)

T (x) = sigmoid(Wx+ b)

Page 27: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

HighwaynetworksforSPSS

• Whymulti-stream?§ reduceinteractionbetweenMGC andF0 modeling

8/2/18 27

feedforward

feedforward

MGC

highwayblock

F0 BAP

highwayblock...

linguistic features

single-streamhighway network

feedforward

feed-forward

highwayblock

F0

highwayblock

linguistic features

feed-forward

highwayblock

MGC

highwayblock

feed-forward

highwayblock

BAP

highwayblock

multi-streamhighway network

.........

ON NETWORK'S DEPTH

feedforward

feedforward

MGC

feedforward

F0 BAP

feedforward...

linguistic features

single-streamfeedforward network

Page 28: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

Experimentsl Networks

l Corpus:BC2011 Nancyvoice

8/2/18 28

ON NETWORK'S DEPTH

Notation System Configuration

DS deepfeedforwardsingle-streamnetwork layersize: 382

HS highwaysingle-streamnetwork

layersize: 3822tanh layersfor eachhighwayblock

HM highwaymulti-streamnetwork

layersize:256 (eachsub-network)2tanh layersfor eachhighwayblock

Page 29: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

Results

• Single-stream:sufficientdepthisnecessary

v Depth:totalnumberofhiddentanh-basedfeedforwardlayers

29

ON NETWORK'S DEPTH

2 4 8 14 20 40Network depth

0.66

0.67

0.68

0.69

0.7

0.71

0.72

0.73

0.74

F0 C

orre

latio

n

DSHSHM

2 4 8 14 20 40Network depth

42.5

43

43.5

44

44.5

45

45.5

46

46.5

47

47.5

F0 R

MSE

(Hz)

DSHSHM

2 4 8 14 20 40Network depth

1.02

1.04

1.06

1.08

1.1

1.12

1.14

MG

C R

MSE

DSHSHM

8/2/18

Page 30: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

Results

• Single-stream:sufficientdepthisnecessary

• Multi-stream:more(feedforward)layersforF0 ?

Ø Similarresultsgivenfixeddepthbutvariedlayersizes

30

ON NETWORK'S DEPTH

2 4 8 14 20 40Network depth

0.66

0.67

0.68

0.69

0.7

0.71

0.72

0.73

0.74

F0 C

orre

latio

n

DSHSHM

2 4 8 14 20 40Network depth

42.5

43

43.5

44

44.5

45

45.5

46

46.5

47

47.5

F0 R

MSE

(Hz)

DSHSHM

2 4 8 14 20 40Network depth

1.02

1.04

1.06

1.08

1.1

1.12

1.14

MG

C R

MSE

DSHSHM

8/2/18

Page 31: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

Results

• Single-stream:sufficientdepthisnecessary

• Multi-stream:morefeedforwardlayersforF0 ?

31

ON NETWORK'S DEPTH

3.2e+05 3.9e+05 1.3e+06 3.3e+06Number of model parameters

0.66

0.67

0.68

0.69

0.7

0.71

0.72

0.73

0.74

F0 C

orre

latio

n

2

4

8 14

20

402

4

8 14 20 40

24

8 14 20 40

DSHSHM

3.2e+05 3.9e+05 1.3e+06 3.3e+06Number of model parameters

42.5

43

43.5

44

44.5

45

45.5

46

46.5

47

47.5

F0 R

MSE

(Hz)

2

48

14

20

402

4

8 14 20 40

24

814

20 40

DSHSHM

3.2e+05 3.9e+05 1.3e+06 3.3e+06Number of model parameters

1.02

1.04

1.06

1.08

1.1

1.12

1.14

MG

C R

MSE

2

4

8 1420 40

2

4

814 20 40

2

4

8

1420

40

DSHSHM

8/2/18

Page 32: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

1.5e+06 3.3e+06 7.3e+06 1.6e+07 3.6e+07

Number of model parameters

1.005

1.01

1.015

1.02

1.025

1.03

1.035

1.04

1.045

MG

C R

MS

E

382

782

8821024

382

482582

7821024

HM1

HM2 HM

3HM

4

DS

HS

HM

Results

Ø Similarresultsgivenfixeddepth,variedlayersize

32

ON NETWORK'S DEPTH

1.5e+06 3.3e+06 7.3e+06 1.6e+07 3.6e+07

Number of model parameters

42.5

43

43.5

44

44.5

45

45.5

46

46.5

F0 R

MS

E (

Hz)

382

782

882

1024

382 482

582782

1024

HM1

HM2

HM3

HM4

DS

HS

HM

1.5e+06 3.3e+06 7.3e+06 1.6e+07 3.6e+07

Number of model parameters

0.67

0.68

0.69

0.7

0.71

0.72

0.73

0.74

F0 C

orr

ela

tion

382

782

882

1024

382

482

582782

1024

HM1

HM2

HM3

HM4

DS

HS

HM

8/2/18

Page 33: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

Analysisnetworkbehaviorl Investigationtool

• indicatesnetworkbehavior

§ As,

§ As,

33

ON NETWORK'S DEPTH

feedforward

feed-forward

F0

highwayblock

linguistic features

feed-forward

MGC

highwayblock

feed-forward

BAP

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

y = T (x)�H(x) + [1� T (x)]� x

Histogram of

0 1

T (x)

T (x) ⇡ 0

T (x) ⇡ 1

y ⇡ x

y ⇡ H(x)

8/2/18

T (x)

Page 34: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

Analysisnetworkbehaviorl Histogramof,HM14

34

ON NETWORK'S DEPTH

feedforward

feed-forward

F0

highwayblock

linguistic features

feed-forward

MGC

highwayblock

feed-forward

BAP

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

0 1

6e

+0

5 b.1 b.2 b.3 b.4 b.5 b.6 b.7

T (x)

8/2/18

Page 35: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

Analysisnetworkbehaviorl Histogramof,HM14

• forMGC

• forF0

• F0sub-networkispartiallyinactive

35

ON NETWORK'S DEPTH

feedforward

feed-forward

F0

highwayblock

linguistic features

feed-forward

MGC

highwayblock

feed-forward

BAP

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

T (x)

0 1

6e

+0

5 b.1 b.2 b.3 b.4 b.5 b.6 b.7

0 1

7e

+0

5 b.1 b.2 b.3 b.4 b.5 b.6 b.7

8/2/18

Page 36: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

Analysisnetworkbehaviorl Histogramof,HS14

• Single-streamnetworkisfullyactive

Ø MGC dominatesthenetwork

36

ON NETWORK'S DEPTH

feedforward

feed-forward

F0

highway block

linguistic features

MGC BAP

highway block

highway block

highway block

highway block

highway block

highway block

T (x)

0 1

8e

+0

5 b.1 b.2 b.3 b.4 b.5 b.6 b.7

8/2/18

Page 37: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

1.4e+06 5.8e+06 2.4e+07Number of model parameter (log scale)

0.98

1

1.02

1.04

1.06

1.08

1.1

1.12

1.14

MG

C R

MSE

HM2

HM4

HM8

HM14HM20 HM40

HM60

HM80

Systems with different depthObjectivemeasure

37

1.4e+06 5.8e+06 2.4e+07Number of model parameter (log scale)

42.6

42.7

42.8

42.9

43

43.1

43.2

43.3

43.4

F0 R

MSE

(Hz)

HM2

HM4

HM8

HM14

HM20

HM40

HM60

HM80

Systems with different depth

1.4e+06 5.8e+06 2.4e+07Number of model parameter (log scale)

0.724

0.726

0.728

0.73

0.732

0.734

F0 C

orre

latio

n

HM2

HM4

HM8

HM14

HM20

HM40

HM60

HM80

Systems with different depth

ON NETWORK'S DEPTHResults(HMonly)

Page 38: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

38

MGC sub-network F0 sub-network

block1: neartheinput endofthenetworkblock20: neartheoutput endofthenetwork

0 1

4e+0

5 block 1 block 2 block 3 block 4 block 5

0 1

7e+0

5 block 6 block 7

0 1

6e+0

5 block 1 block 2 block 3 block 4 block 5

0 1

2e+0

5 block 6 block 7

ON NETWORK'S DEPTHResults(HM14)

Page 39: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

39

MGC sub-network F0 sub-network

block1: neartheinput endofthenetworkblock20: neartheoutput endofthenetwork

0 1

4e+0

5 block 1 block 2 block 3 block 4 block 5

0 1

1e+0

6 block 6 block 7 block 8 block 9 block 10

0 1

1e+0

6 block 1 block 2 block 3 block 4 block 5

0 1

4e+0

5 block 6 block 7 block 8 block 9 block 10

ON NETWORK'S DEPTHResults(HM20)

Page 40: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

0 1

2e+0

6 block 1 block 2 block 3 block 4 block 5

0 1

2e+0

5 block 6 block 7 block 8 block 9 block 10

0 1

2e+0

5 block 11 block 12 block 13 block 14 block 15

0 1

1e+0

6 block 16 block 17 block 18 block 19 block 20

40

0 1

4e+0

5 block 1 block 2 block 3 block 4 block 5

0 1

1e+0

6 block 6 block 7 block 8 block 9 block 10

0 12e

+06 block 11 block 12 block 13 block 14 block 15

0 1

3e+0

6 block 16 block 17 block 18 block 19 block 20

MGC sub-network F0 sub-network

block1: neartheinput endofthenetworkblock20: neartheoutput endofthenetwork

ON NETWORK'S DEPTHResults(HM40)

Page 41: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

0 1

2e+0

6 block 1 block 2 block 3 block 4 block 5

0 1

2e+0

5 block 8 block 10 block 12 block 14 block 16

0 1

2e+0

5 block 19 block 20 block 22 block 23 block 24

0 1

1e+0

6 block 26 block 27 block 28 block 29 block 30

41

MGC sub-network F0 sub-network

block1: neartheinput endofthenetworkblock30: neartheoutput endofthenetwork

0 1

4e+0

5 block 1 block 2 block 3 block 4 block 5

0 1

2e+0

6 block 8 block 10 block 12 block 14 block 16

0 14e

+06 block 19 block 20 block 22 block 23 block 24

0 1

4e+0

6 block 26 block 27 block 28 block 29 block 30

ON NETWORK'S DEPTHResults(HM60)

Page 42: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

42

MGC sub-network F0 sub-network

block1: neartheinput endofthenetworkblock39: neartheoutput endofthenetwork

0 1

7e+0

5 block 1 block 3 block 5 block 7 block 9

0 1

2e+0

6 block 11 block 13 block 15 block 17 block 19

0 14e

+06 block 21 block 23 block 25 block 27 block 29

0 1

4e+0

6 block 31 block 33 block 35 block 37 block 39

0 1

2e+0

6 block 1 block 3 block 5 block 7 block 9

0 1

4e+0

5 block 11 block 13 block 15 block 17 block 19

0 1

4e+0

5 block 21 block 23 block 25 block 27 block 29

0 1

4e+0

5 block 31 block 33 block 35 block 37 block 39

ON NETWORK'S DEPTHResults(HM80)

Page 43: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

43

ON NETWORK'S DEPTHOverfittingofHM80?

Page 44: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

MGC sub-network LF0 sub-networkPhoneme identity Position of phoneme in syllable

Position of phoneme in syllable Accent type of next syllablePosition of phoneme in syllable (backward) Accent type of previous syllable

Number of previous stressed syllables in phrase Position of syllable in the wordNumber of stressed syllables remained in phrase Phoneme identity

MGC sub-network LF0 sub-networkPosition of phrase in utterance Number of words in previous phrase

Number of words in phrase Number of words in next phraseNumber of phrases in utterance ToBI boundary tone

Number of syllables in previous phrase Number of syllables in next phraseToBI boundary tone Number of syllables in previous phrase

Analysisl Contributionofthelinguisticfeatures

• Mostusefulfeatures

• Leastusefulfeatures

44

ON NETWORK'S DEPTH

8/2/18

Page 45: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

Experimentsq Otherresults

• Sensitivityanalysisinmulti-streamhighway• MGCandF0usedifferentlinguisticfeatures

• SimilarresultsonJapanesedata

ISSUE 1:JOINT LEARNING FOR F0?

45

MGC

CurrentphonemeidentityPositionofcurrentphonemeinsyllable(forward)Positionofcurrentphonemeinsyllable(backward)NumberofprecedinglexicallystressedsyllablesinphraseNumberoffollowinglexicallystressedsyllablesinphrase

F0

Positionofphonemeinsyllable(forward)IsthenextsyllablebearinganEnglishpitch-accentIstheprevioussyllablebearinganEnglishpitch-accentPositionofcurrentsyllableinthewordCurrentphonemeidentity

Page 46: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

Layers DS HS HM

2

4

8

14

20

40

60

80

SamplesON NETWORK'S DEPTH

Page 47: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

Experimentsl Networks

l Corpus:ATRF0098/2/18 47

ON NETWORK'S DEPTH

Notation System Configuration

HMnmulti-streamhighwaynetwork

layersize256 foreachsub-network2tanh layersinone highwayblocksigmoidforhighwaygate

RNN[1]Recurrent neuralnetwork(single-stream)

2feedforward layers,512eachlayer2bi-directionalLSTMlayers,256eachlayer1linearprojection outputlayer

DNNDeepfeedforwardneuralnetwork(single-stream)

1feedforward layers,1024eachlayer3feedforward layers,512eachlayer1linearprojection outputlayer

[1]Wang,X.,Takaki,S.,&Yamagishi, J.(2016).AComparativeStudyofthePerformanceofHMM,DNN,andRNN basedSpeechSynthesisSystemsTrainedonVeryLargeSpeaker-DependentCorpora.InProc.SSW9 (pp.125–128).

Page 48: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

48

ON NETWORK'S DEPTH

5.0e5 1.0e6 2.0e6 4.0e6 8.0e6Number of Network weights

0.98

1.00

1.02

1.04

1.06

1.08

1.10

MG

CR

MS

E

4

68

2

3

4

816

32

2

34

8

16

32

Single-stream feedforwardSingle-stream highwayMulti-stream highway

5.0e5 1.0e6 2.0e6 4.0e6 8.0e6Number of Network weights

31

32

33

34

35

36

F0R

MS

E(H

z)

4 6 8

23

4

816 32

23

4

8

1632

5.0e5 1.0e6 2.0e6 4.0e6 8.0e6Number of Network weights

0.830

0.835

0.840

0.845

0.850

0.855

0.860

0.865

0.870

F0C

orre

latio

n(0

-1)

4 6

8

2

3

4

816

322

3 4 8

1632

Page 49: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

49

0 1

2e+0

4 block 1 block 2

0 1

3e+0

4 block 1 block 2

0 1

5e+04 b.1 b.2 b.3 b.4 b.5 b.6 b.7

0 1

2e+05 b.1 b.2 b.3 b.4 b.5 b.6 b.7

MGC stream F0 stream

0 1

6e+04 b.1 b.2 b.3 b.4 b.5 b.6 b.7

0 1

9e+04 b.8 b.9 b.10

0 1

2e+05 b.1 b.2 b.3 b.4 b.5 b.6 b.7

0 1

2e+04 b.8 b.9 b.10

ON NETWORK'S DEPTHResults(Japanese)l HMwith2,7,10highwayblocks

Page 50: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

50

minEs Dim Rank'RR-Phone' 0.993 thephonemeafterthenextphonemeidentity 44 40

'C-Br_Len_Mora' 0.993 thenumberofmorasinthecurrentbreathgroup 100 41

'C-Br_Bw-Pos-in_Utt_Mora' 0.994 positionofthecurrentbreathgroupidentitybymora(backward) 201 42

'C-Br_Fw-Pos-in_Utt_Mora' 0.995 positionofthecurrentbreathgroupidentitybymora(forward) 201 43

'C-Acc_Fw-Pos-in_Br_Mora' 0.996 positionofthecurrentaccentphraseidentityinthecurrentbreathgroupbythemora(forward) 121 44

'Utt_Len_Acc' 0.997 thenumberofaccentphrasesinthisutterance 60 45

'Utt_Len_Br' 0.999 thenumberofbreathgroupsinthisutterance 30 46

'Utt_Len_Mora' 0.999 thenumberofmoras inthisutterance 200 47

minEs Dim Rank'L-Acc_Len_Mora' 0.994 thenumberofmorasinthepreviousaccentphrase 61 40'C-Br_Len_Mora' 0.995 thenumberofmorasinthecurrentbreathgroup 100 41

'C-Acc_Fw-Pos-in_Br_Mora' 0.995 positionofthecurrentaccentphraseidentityinthecurrentbreathgroupbythemora(forward) 121 42'C-Br_Bw-Pos-in_Utt_Mora' 0.996 positionofthecurrentbreathgroupidentitybymora(backward) 201 43'C-Br_Fw-Pos-in_Utt_Mora' 0.997 positionofthecurrentbreathgroupidentitybymora(forward) 201 44

'Utt_Len_Acc' 0.997 thenumberofaccentphrasesinthisutterance 60 45'Utt_Len_Br' 0.999 thenumberofbreathgroupsinthisutterance 30 46

'Utt_Len_Mora' 1.000 thenumberofmoras inthisutterance 200 47

ON NETWORK'S DEPTHResults(Japanese)l Sensitivityanalysis,47classesofcontextualfeatures

• Leastuseful features,MGC stream

• Leastuseful features,F0 stream

Page 51: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

Summaryl Findings:

• MGC benefitsfromdeepernetworks

• F0 sub-networkcanbeshallow

• Single-streamnetworksfocusmoreonMGC otherthanF0

• Investigationlinguisticfeatures'usefulness§ (automaticallyinferred)F0-relatedtagsarenoisyforEnglish

• ExperimentsonJapanesecorpus§ similar:multi-streamnetworkimprovesF0 modeling§ different:F0-relatedtagsareuseful

51

ON NETWORK'S DEPTH

Anyreason?

8/2/18

Page 52: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

8/2/18 52

HIGHWAY ARCHITECTURE

MGC

Bottom network

Linear

F0

Linear

BAP

Linear

MGC F0 BAP

Bottom network

Linear

MGC

Bottom network

Linear

F0

Linear

BAP

Linear

ht

Linear

bot =

2

64bbo(MGC)

t

bo(F0)t

bo(BAP )t

3

75 =

2

4W s,11 W s,12 0

0 1 00 0 1

3

5

2

64bo(MGC)t

bo(F0)t

bo(BAP )t

3

75 = W sW oht

bot =

2

64bo(MGC)t

bo(F0)t

bo(BAP )t

3

75 =

2

4W o,11 0 0

0 W o,22 00 0 W o,33

3

5

2

4ht,1

ht,2

ht,3

3

5 = W oht,

bot =

2

64bo(MGC)t

bo(F0)t

bo(BAP )t

3

75 =

2

4W o,11 W o,12 W o,13

W o,21 W o,22 W o,23

W o,31 W o,32 W o,33

3

5

2

4ht,1

ht,2

ht,3

3

5 = W oht,

Page 53: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

8/2/18 53

HIGHWAY ARCHITECTURE

MGC

Bottom network

Linear

F0

Linear

BAP

Linear

Linear

bot =

2

64bbo(MGC)

t

bo(F0)t

bo(BAP )t

3

75 =

2

4W s,11 W s,12 0

0 1 00 0 1

3

5

2

64bo(MGC)t

bo(F0)t

bo(BAP )t

3

75 = W sW oht

bbo(MGC)

t = W s,11bo(MGC)t +W s,12bo(F0)

t

l SameargumentasSARvsRMDN• Dependencybetweenmean,notrandomvariables

p(o(MGC), o(F0), o(BAP )) = p(o(MGC))p(o(F0))p(o(BAP ))

Themeanofisaffectedbymeanofp(o(MGC)) p(o(F0))

Page 54: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

8/2/18 54

HIGHWAY ARCHITECTURE

MGC

Bottom network

Linear

F0

Linear

BAP

Linear

Linear

l Trainingdependencymodel?

Observed(natural)data

MGC

Bottom network

Linear

F0

Linear

BAP

Linear

Linear

p(X,Y ;⇥) = p(X|Y ;⇥1)p(Y ;⇥2)

⇥⇤ =argmax⇥

Y

{xn,yn}2D

p(X = xn, Y = yn)

= argmax⇥

Y

{xn,yn}2D

p(X = xn|Y = yn)p(Y = yn)

Page 55: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

8/2/18 55

HIGHWAY ARCHITECTURE

MGC

Bottom network

Linear

F0

Linear

BAP

Linear

Linear

l Generatefromdependencymodel?

Idealsolution:

p(X,Y ;⇥⇤) = p(X|Y ;⇥⇤1)p(Y ;⇥⇤

2)

{x⇤, y⇤} =arg max{bx,by}

p(X = bx, Y = by)

= arg max{bx,by}

p(X = bx|Y = by)p(Y = by)

Approximation:y⇤ = argmax

byp(Y = by)

x⇤ = argmaxbx

p(X = bx|Y = y⇤)

Page 56: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

AM&

Vocoder

568/2/18

Page 57: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

COMPARE NEW MODELS/VOCODERSOverviewl StilltheSPSSframeworkl Testonboth“vocoder”andacousticmodels

minimum phase

10/20/17 1

Wavenet PML

Phaserecovery

SAR RNNDAR

SAR-Wa SAR-PmSAR-Pr SAR-Wo SGA-Wo RGA-Wo RNN-Wo

Waveform generators

Acoustic models

Linguistic features

F0 MGCGAN

WORLD

Page 58: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

COMPARE NEW MODELS/VOCODERS

• Abs*:copy-synthesis WithoutMLPGWithout formantenhancement

Page 59: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

Overview

• 40blocks,dilution [2,4,8,16,32,64,128,256,512,2,4…]

WAVENET

Linguistic features / MGC

Feedforward

Diluted1-D CNN

Sub-network

+

Tanh Sigmoid

*

1-D CNN

Waveform(time shifted)

1-D CNNs softmax Waveform+

1-D CNN +

Diluted1-D CNN

+

Tanh Sigmoid

*…

1-D CNN

1-D CNN +

Page 60: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

Sub-networkWAVENET

Linguistic features / MGC

Up samplingTimeresolution: 16kHz

Timeresolution: 1/(5ms)=20Hz(Framelevel)

1-D CNN softmax Waveform+

…Block 1 Block 2 Block 40

Waveform(feedback)

linear

Feedforward

Bi-LSTM

F0

Nosub-network

Bi-LSTMsub-network

NaturalTrainedonnaturalMGC/F0GeneratedonsyntheticMGC&F0

Page 61: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

Generationmethodl Randomsampling

l One-best(invoicedregions)

WAVENET

Random-samp One-best Natural

bot ⇠ P (ot|bot�R:t�1, bat)

bot = argmaxot

p(ot|bot�R:t�1, bat)

Random-samp One-best Natural

Conditioned ongeneratedMGC&F0

Conditioned onlinguistic features&F0

Page 62: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

WaveNet Natural

62

Analysisl IsWaveNet-vocoderstrictlyspeaker/languagedependent?

• Butmaynotworkformalespeakers

WAVENET

WaveNet-vocoder

F009 (Japanese data)

Training

WaveNet-vocoder

Nancy(English data)

NaturalMGC F0

Generation

Natural MGC F0

Conditioned onnaturalMGC&F0

Page 63: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

63

Analysisl Embeddingofwaveform(conditionedonMGC/F0)

WAVENET

1-D CNN softmax Waveform+

…Block 1 Block 2 Block 40

Waveform(feedback)

linear

WaveformlevelID(10bits)

Weightsof linearlayer

2-DEmbedding

Page 64: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

64

Analysisl Embeddingofwaveform(conditionedonlinguisticfeatures/F0)

WAVENET

1-D CNN softmax Waveform+

…Block 1 Block 2 Block 40

Waveform(feedback)

linear

WaveformlevelID(10bits)

Weightsof linearlayer

2-DEmbedding

Page 65: Appendix Fundamental Frequency Modeling for Neural ...tonywangx.github.io/pdfs/appendix_highway.pdf · Appendix Fundamental Frequency Modeling for Neural-Network-based Statistical

65

Analysisl Varianceofoutputfromeachblock(conditionedonMGC/F0)

0 5 10 15 20 25 30 35 40

Block ID

�4

�2

0

2

4

Feat

ure

valu

e

98-percentile 2-percentile median stdev

Rangeofoutput valuefromeachblock

Largestvalue

Smallestvalue

Standarddeviation

WAVENET

1-D CNN softmax Waveform+

…Block 1 Block 2 Block 40

Waveform(feedback)

linear