semi-supervised learning by machine speech chain …...semi-supervised learning by machine speech...

http://www.naist.jp/無限の可能性、ここが最先端－Outgrow your limits－

Semi-supervised Learning by Machine Speech Chain for

Multilingual Speech Processing, and Recent Progress

on Automatic Speech Interpretation

Satoshi Nakamura,

Sakriani Sakti, and Katsuhito Sudoh

Graduate School of Science and Technology,

Nara Institute of Science and Technology, Japan

LT4ALL ©Satoshi Nakamura, AHC Lab, NAIST, Japan

1

Dec. 6. 2019


Topics

Recent advances in speech processing

– ASR and TTS research

– Machine Speech Chain unifies ASR and TTS

– Application to code switching speech

Speech Translation

– Recent Progress on Automatic Speech Interpretation

Dec. 6. 2019 LT4ALL ©Satoshi Nakamura, AHC Lab, NAIST, Japan

2


Motivation Background

In human communication → A closed-loop speech chain mechanism has a critical auditory feedback

mechanism (“Speech Chain”, Denes, Pinson 1973)


3

Sensory nerves

Motornerves

Sensory nerves

Auditory feedback

Speaking Listening


Machine Speech Chain

Proposed Method


4

Develop a closed-loop speech chain model based on deep learning

“Good afternoon”

Sensory nerves

Motornerves

Auditory feedback

Speaking

“How are you?”

Speaking

Auditory feedback

Use the closed-loop

for ASR and TTS

Dec. 6. 2019



Definition:

– 𝑥 = original speech, 𝑦 = original text

– ො𝑥 = predicted speech, ො𝑦 = predicted text

– 𝐴𝑆𝑅(𝑥): 𝑥 → ො𝑦 (seq2seq model transforms speech to text)

– 𝑇𝑇𝑆 𝑦 : 𝑦 → ො𝑥 (seq2seq model transforms text to speech)


5

Dec. 6. 2019

Andros Tjandra, Sakriani Sakti, Satoshi Nakamura, “Listening while Speaking: Speech Chain by Deep Learning”, Proc. IEEE ASRU 2017

𝐿𝐴𝑆𝑅(𝑦, ො𝑦)𝐿𝑇𝑇𝑆(𝑥, ො𝑥)


Speech Chain with One-shot Speaker Adaptation

Proposed model

– Train ASR and TTS models using unpaired data and small amount of paired data.

– Speaker individuality is generated by SPKREC embedding.

6


Andros Tjandra, Sakriani Sakti, Satoshi Nakamura, “Machine Speech Chain with One-shot Speaker Adaptation”, Proc. INTERSPEECH 2018

http://www.naist.jp/無限の可能性、ここが最先端－Outgrow your limits－ LT4ALL ©Satoshi Nakamura, AHC Lab, NAIST, Japan

7

Code-switching Challenges For ASR

“これはstill waterですか？”

Standard ASR is monolingual

ASR?Output text

Japanese

ASR“こんにちは”

English

ASR “Hello”

Japanese

EnglishJapanese

• Typical case where paired speech and transcription are difficult to collect.

Challenge with Code-switching : Mixed multilingual input

Dec. 6. 2019


Code-switching Challenges


8

ASR

TTS

෧𝑡𝑒𝑥𝑡

෧

𝑡𝑒𝑥𝑡

𝐿𝑜𝑠𝑠

෧

𝑠𝑝𝑒𝑒𝑐ℎASR

TTS

෧𝑡𝑒𝑥𝑡𝐿𝑜𝑠𝑠unfold

ASR

TTS

෧𝑡𝑒𝑥𝑡

෧𝑡𝑒𝑥𝑡

෧

෧

t𝑒𝑥𝑡

Given only text Given only speech

• Typical case where paired speech and transcription are difficult to collect.

“これはstill waterですか？”

ASR?Output text

Japanese

EnglishJapanese

Challenge with Code-switching : Mixed multilingual input

Dec. 6. 2019

S. Nakayama, A. Tjandra, S. Sakti, S. Nakamura, “Speech chain for semi-supervised learning of Japanese-English code-switching ASR and TTS”. In Proc. IEEE SLT, 2018

18.11% CER-> 5.08% CER


Topics

Recent advances in speech processing

– ASR and TTS research

– Machine Speech Chain unifies ASR and TTS

– Application to code switching speech

Speech Translation

– Recent Progress on Automatic Speech Translation


9

Dec. 6. 2019


Speech Translation

2019/06/15 CLI9 Keynote Satoshi Nakamura, NAIST10

MultilingualSpeech

Recognition

Spoken Language

Translation

MultilingualSpeech

SynthesisJapanese English

I go to school「私は学校に行く: Watashi wa Gakko he iku」

Watashi wa Gakko he iku

I go to school

• Cascaded process of speech recognition, machine translation, and speech synthesis.

• Machine translation of ASR transcripts.


Cross-lingual Communication

LT4ALL Dec.6 2019 ©Satoshi Nakamura, AHC Lab, NAIST, Japan

11

Input:Text

SpeechVideo

Gesture

Speech⇒TextASR

RealtimeIncremental

MTConversion

Dialog Control

LinguisticInformation

ParalinguisticEmotion,

Style, Personality,

Prosody, Gesture

ParalinguisticEmotion,

Style, Personality,

Prosody, Gesture

Output:Text

SpeechVideo

Gesture

Source Language Target Language

Speech“to o kyo e i ku”

MT results/I/go/to/Tokyo/

TTS results“ai go tu tokyo/

Personality, Prosody Personality, Prosody

DiscourceContext

Domain knowledge

Text

Image⇒textPR

Text

Text⇒SpeechTTS

Text⇒ImageImage Generation

End-to-end Process

Communication

① Simultaneity, Incremental, Latency,

② +Para/non linguistic information

LinguisticInformation

Dec. 6. 2019


Human Interpreter [A.Mizuno 2016]


12

E-J Translation Example

(1) The relief workers (2) say (3) they don’t have (4) enough food, water, shelter, and medical supplies (5) to deal with (6) the gigantic wave of refugees (7) who are ransacking the countryside (8) in search of the basics (9) to stay alive.

(1) 救援担当者は (9) 生きるための (8) 食料を求めて (7) 村を荒らし回っている (6) 大量の難民達の (5) 世話をするための (4) 十分な食料や水，宿泊施設，医療品が (3) 無いと (2) 言っています．

Necessary #Chunk＞３！

(1) 救援担当者達の (2) 話では (4)食料，水，宿泊施設，医薬品が， (3) 足りず (6) 大量の難民達の (5) 世話が出来ないとのことです．(7) 難民達は今村々を荒らし回って， (9) 生きるための (8) 食料を求めているのです．

Necessary #Chunk＜３！

Memory Chunk

http://www.naist.jp/無限の可能性、ここが最先端－Outgrow your limits－ Dec. 6. 2019 LT4ALL ©Satoshi Nakamura, AHC Lab, NAIST, Japan

13

Katsuki Chousa, Katsuhito Sudoh, and Satoshi Nakamura. 2019. Simultaneous Neural Machine Translation using Connectionist Temporal Classification. arXiv preprint , 1911.1193

Simultaneous Speech Translation with Adaptive Delay

Define a special token <wait>

ブッシュ大統領はプーチンと会談する

President Bush <wait> meets with Putin<wait><wait>


Paralinguistic Speech Translation


14

Q. T. Do, S. Sakti, S. Nakamura, “Sequence-to-Sequence Models for Emphasis Speech Translation”. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(10):1873–1883, 2018


Summary


– Semi-supervised training using unpaired data

– Code switching speech, under-resource language, and continuous learning

Speech Translation

– Simultaneous speech translation

• Segmentation, anticipation, rewording, evaluation

– Paralinguistic speech translation

• Emphasis, and emotions

Future works

– Understanding and interpretation

– Context, situation and multi-modality

– Common sense, knowledge, and cross-cultural knowledge


15

Dec. 6. 2019

semi-supervised learning by machine speech chain …...semi-supervised learning by machine speech...

Documents