towards machine comprehension of spoken content machine... · 2017-11-09 · understand the...

讓機器聽懂人說話李宏毅

Hung-yi Lee

Outline

你好嗎？

從語音訊號到文字你好嗎？

機器決定要說(做)什麼

語音：

我很好

了解一個詞

了解一個句子

了解一整段對話

Everything is based on Deep Learning

Deep Learning in One Slide

They are functions.

Many kinds of networks:

Matrix

How to find the function?

Given the examples of inputs/outputs as (training data): {(x1,y1),(x2,y2), ……, (x1000,y1000)}

Fully connected feedforward network

Convolutional neural network (CNN)

Recurrent neural network (RNN)

Vector

Vector Seq

𝑥 𝑦

Outline

你好嗎？

語音：

Speech Recognition

Spoken Content

Speech Recognition

f “How are you”

“Hi”

“I am fine”

“Good bye”

Typical Deep Learning Approach

• The hierarchical structure of human languages

hh w aa t d uw y uw th ih ng k

what do you think

t-d+uw1 t-d+uw2 t-d+uw3

…… t-d+uw d-uw+y uw-y+uw y-uw+th ……

d-uw+y1 d-uw+y2 d-uw+y3

Phoneme:

Tri-phone:

State:

• The first stage of speech recognition • Classification: input → acoustic feature, output → state

…… Determine the state each acoustic feature belongs to Acoustic

feature

States: a a a b b c c

…… ……

Size of output layer = No. of states

P(a|xi)

DNN input:

One acoustic feature

DNN output:

Probability of each state

P(b|xi) P(c|xi) ……

……

Very Deep

Human Parity!

• 微軟語音辨識技術突破重大里程碑：對話辨識能力達人類水準！(2016.10)

• https://www.bnext.com.tw/article/41414/bn-2016-10-19-020437-216

• Dong Yu, Wayne Xiong, Jasha Droppo, Andreas Stolcke , Guoli Ye, Jinyu Li , Geoffrey Zweig, “Deep Convolutional Neural Networks with Layer-wise Context Expansion and Attention”, Interspeech 2016

• IBM vs Microsoft: 'Human parity' speech recognition record changes hands again (2017.03)

• http://www.zdnet.com/article/ibm-vs-microsoft-human-parity-speech-recognition-record-changes-hands-again/

• George Saon, Gakuto Kurata, Tom Sercu, Kartik Audhkhasi, Samuel Thomas, Dimitrios Dimitriadis, Xiaodong Cui, Bhuvana Ramabhadran, Michael Picheny, Lynn-Li Lim, Bergul Roomi, Phil Hall, “English Conversational Telephone Speech Recognition by Humans and Machines”, arXiv preprint, 2017

Machine 5.9% v.s. Human 5.9%

Machine 5.5% v.s. Human 5.1%

End-to-end Approach - Connectionist Temporal Classification (CTC)

• Connectionist Temporal Classification (CTC) [Alex Graves, ICML’06][Alex Graves, ICML’14][Haşim Sak, Interspeech’15][Jie Li, Interspeech’15][Andrew Senior, ASRU’15]

好好好

Trimming

棒棒棒棒棒

“好棒”

Why can’t it be “好棒棒”

Input:

Output: (character sequence)

(vector sequence)

Problem?

End-to-end Approach - Connectionist Temporal Classification (CTC)

• Connectionist Temporal Classification (CTC) [Alex Graves, ICML’06][Alex Graves, ICML’14][Haşim Sak, Interspeech’15][Jie Li, Interspeech’15][Andrew Senior, ASRU’15]

好 φ φ 棒 φ φ φ φ 好 φ φ 棒 φ 棒 φ φ

“好棒” “好棒棒” Add an extra symbol “φ” representing “null”

More Approaches

• DNN + structured SVM • [Meng & Lee, ICASSP 10]

• DNN + structured DNN • [Liao & Lee, ASRU 15]

• Neural Turing Machine • [Ko & Lee, ICASSP 17]

hidden layer h1

hidden layer h2

F2(x, y; θ

speech signal

F1(x, y; θ

y (phoneme label sequence)

(a ) u s e DNN p h o n e p o s te rio r a s a c o u s tic ve c to r

(b ) s tru c tu re d S VM (c ) s tru c tu re d DNN

Ψ(x,y)

hidden layer hL-1

hidden layer h1

hidden layer hL

output layer

input layer

feature extraction

a c b a

x (acoustic vector sequence)

Outline

你好嗎？

語音：

了解一個詞

Word Embedding

• Machine learns the meaning of words from reading a lot of documents without supervision

rabbit

jump run

flower

Word Embedding

• Machine learns the meaning of words from reading a lot of documents without supervision

• A word can be understood by its context

蔡英文 520宣誓就職

馬英九 520宣誓就職

蔡英文、馬英九 are something very similar

You shall know a word by the company it keeps

• Machine learn the meaning of words from reading a lot of documents without supervision

機器能不能學會鄉民用語

和「好棒」語意最相近的辭彙

超讚、真不錯、真好、好有趣、好感動

和「好棒棒」語意最相近的辭彙

不就好棒棒、阿不就好棒、好清高、好高尚、不就好棒

和「廢宅」語意最相近的辭彙

宅宅、臭宅、魯宅、魯蛇、窮酸宅

和「本魯」語意最相近的辭彙

小魯、魯妹、魯蛇小弟、魯弟、小弟

機器能不能學會鄉民用語

V(魯夫) − V(海賊王) ≈ V(鳴人) − V(火影忍者)

魯夫：海賊王 = 鳴人：？ Ans: 火影忍者

魯蛇：loser = 溫拿：？ Ans: winner

魯蛇：窮 = 溫拿：？ Ans: 有錢

研究生：期刊 = 漫畫家：？ Ans: 少年Jump

Outline

你好嗎？

語音：

了解一個詞

了解一個句子

Sentiment Analysis

我覺太得糟了

超好雷

好雷

普雷

負雷

超負雷

看了這部電影覺得很高興 …….

這部電影太糟了 …….

這部電影很棒 …….

Positive (正雷) Negative (負雷) Positive (正雷)

……

RNN (Recurrent Neural Network)

Recurrent Neural Network

• Recurrent Structure: usually used when the input is a sequence

f h0 h1

f h2 f h3

No matter how long the input sequence is, we only need one function f

Func f ht

LSTM GRU

Func f ht

Sentiment Analysis

• It is bad.

• It is not bad.

• AI is hard to learn, but it is powerful.

• AI is powerful, but it is hard to learn.

• AI is powerful even though it is hard to learn.

Smaller number means more negative.

Larger number means more positive.

感謝陳冠宇同學提供實驗結果

Outline

你好嗎？

語音：

了解一個詞

了解一個句子

New task for Machine Comprehension of Spoken Content

• TOEFL Listening Comprehension Test by Machine

Question: “ What is a possible origin of Venus’ clouds? ”

Audio Story:

Choices:

(A) gases released as a result of volcanic activity

(B) chemical reactions caused by high surface temperatures

(C) bursts of radio energy from the plane's surface

(D) strong winds that blow dust into the atmosphere

(The original story is 5 min long.)

New task for Machine Comprehension of Spoken Content

• TOEFL Listening Comprehension Test by Machine

“what is a possible origin of Venus‘ clouds?"

Question:

Audio Story: Neural

Network

4 Choices

e.g. (A)

answer

Using previous exams to train the network

ASR transcriptions

Model Architecture

“what is a possible origin of Venus‘ clouds?"

Question:

Question Semantics

…… It be quite possible that this be due to volcanic eruption because volcanic eruption often emit gas. If that be the case volcanism could very well be the root cause of Venus 's thick cloud cover. And also we have observe burst of radio energy from the planet 's surface. These burst be similar to what we see when volcano ……

Audio Story:

Speech Recognition

Semantic Analysis

Attention

Answer

Select the choice most similar to the answer

Attention

The whole model learned end-to-end.

Model Architecture - Attention Mechanism

Sentence 2 Sentence 1

w1 w2 w3 w4 w5 w6 w7 w8

Story (through ASR)

S1 S2 S3 S4 S5 S6 S7 S8

α1 α2 α3 α4 α5 α6 α7 α8

yb(1) yf(T)

… yb(1) yb(2) yb(T)

… yf(1) yf(2) yf(T)

Module for Vector Representation

𝑉𝑠 = ∝𝑡∗ 𝑆𝑡8𝑡=1 𝛼 =

𝑉𝑄∙ 𝑆𝑡|𝑉𝑄|∙|𝑆𝑡|

Understand the question

W 2 … W T

Question W 1

VQ : vector representation for question

Vs : consider both Question and Story with attention weight α

Concatenate the output of last hidden layer in bi-directional GRU

Concatenate the output of hidden layer at each time step

(similarity score)

A bi-directional GRU

Model Architecture

Question

Story (through ASR)

VA VB VC VD

Choice A Choice D Choice B Choice C

0.6 0.1 0.2 0.1

+ + hop 2

Attention Machanism recap

Process Question by VecRep module

VecRep Att

VecRep

Get Vs through attention module

A hop means the machine considers question and story jointly once

Do more hops for considering story again

……

…… + hop n

VecRep VecRep VecRep VecRep

Get 4 choices representation through VecRep module Compare similarity between choices and VQn

Take the choice with the highest score as answer

To keep question info, add VQ and VS

Sentence Representation

Bi-directional RNN

Tree-structured Neural Network

Attention on all phrases

Sentence

w1 w2 w3 w4

S1 S2 S3 S4

w1 w2 w3 w4

Experimental Results A

random

Naïve approaches

Example Naïve approach: 1. Find the paragraph containing most key terms in

the question. 2. Select the choice containing most key terms in

that paragraph.

Experimental Results A

random

42.2% [Tseng, Shen, Lee, Lee, Interspeech’16]

Naïve approaches

48.8% [Fan, Hsu, Lee, Lee, SLT’16]

Outline

你好嗎？

語音：

我很好

了解一個詞

了解一個句子

要怎麼讓機器可以和人對話？

• 人都知道怎麼和人對話，我們可以把規則寫下來，讓機器照著做嗎？ (Hand-crafted rules)

• For example, you want to build a chat-bot

• If there is “推薦” and “音樂” in the input, then chat-bot says “我推薦五月天”

• You can say “請推薦我一些音樂” and “你推薦誰的音樂？”. Smart?

• What if someone says “你不推薦誰的音樂？” ……

• 問題

• 人類對話的規則太過複雜，無法窮舉

• Chat-bot 沒有 “free style”，回應都是事先設好的

決定要說什麼很難嗎？

• 當你輸入一個句子時，機器有多少個可能的句子呢？

• 中文句子由一串中文字夠成

• 常用中文字約 4000 個

• 假設句子長度固定為 15 個字

• Ans: 4000 的 15 次方

• 遠多於地球上所有海洋中的水分子數目

≈1.07 x 1054

如果機器可以正確回答一句話，它不是大海撈針，而是從大海中撈出一個的水分子

Sequence-to-sequence Learning

• Sequence to sequence learning: Both input and output are both sequences with different lengths.

Seq2seq

機器學習

Seq2seq

Machine Learning

Seq2seq

你好嗎

你好嗎我很好

語音辨識翻譯對話

電視影集、電影台詞

freestyle!

RNN Encoder

Input sequence (中文)

output sequence (英文)

RNN Decoder

語義

會讀中文

會寫英文

vector

(Encoder, Decoder 共通的語言?)

• Both input and output are both sequences with different lengths. → Sequence to sequence learning

• E.g. Machine Translation (machine learning→機器學習)

Containing all information about

input sequence

learnin

學機器

……

Don’t know when to stop

習慣性

推 tlkagk: =========斷========== 接龍推文是ptt在推文中的一種趣味玩法，與推齊有些類似但又有所不同，是指在推文中接續上一樓的字句，而推出連續的意思。該類玩法確切起源已不可知(鄉民百科)

Add a symbol “===“ (斷)

[Ilya Sutskever, NIPS’14][Dzmitry Bahdanau, arXiv’15]

learnin

學機器習

Chat-bot

電視影集、電影台詞 Source of image: https://github.com/farizrahman4u/seq2seq

Chat-bot with GAN

Discriminator

Input sentence/history h response sentence x

Real or fake

human dialogues

Chatbot

Conditional GAN

response sentence x

Input sentence/history h

Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean, Alan Ritter, Dan Jurafsky, “Adversarial Learning for Neural Dialogue Generation”, arXiv preprint, 2017

Ref:一日搞懂 GAN https://www.slideshare.net/tw_dsconf/ss-78795326

Example Results 感謝段逸林同學提供實驗結果

Towards Characterization

• 感謝王耀賢同學提供實驗結果

• https://github.com/yaushian/simple_sentiment_dialogue

Input: How do you feel ? I am good.

I am so embarrassed.

Input: I love you. I love you!

I wish I wish I wish I could go.

✘Negative sentence to positive sentence: it's a crappy day -> it's a great day

i wish you could be here -> you could be here

it's not a good idea -> it's good idea

i miss you -> i love you

i don't love you -> i love you

i can't do that -> i can do that

i feel so sad -> i happy

it's a bad day -> it's a good day

it's a dummy day -> it's a great day

sorry for doing such a horrible thing -> thanks for doing a

great thing

my doggy is sick -> my doggy is my doggy

my little doggy is sick -> my little doggy is my little doggy

Cycle GAN

感謝王耀賢同學提供實驗結果

Outline

你好嗎？

語音：

了解一個詞

了解一個句子

摘要

Summarization

Audio File

to be summarized

This is the summary.

Select the most informative segments to form a compact version

Extractive Summaries

…… deep learning is powerful …… …… ……

[Lee, et al., Interspeech 12] [Lee, et al., ICASSP 13] [Shiang, et al., Interspeech 13]

Machine does not write summaries in its own words

Abstractive Summarization

• Now machine can do abstractive summary (write summaries in its own words)

Title 1

Title 2

Title 3

Training Data

title generated by machine

without hand-crafted rules

(in its own words)

• Input: transcriptions of audio, output: title

ℎ1 ℎ2 ℎ3 ℎ4

RNN Encoder: read through the input

w1 w4 w2 w3 transcriptions of audio from automatic speech recognition (ASR)

𝑧1 𝑧2 ……

…… wA wB

RNN generator

刑事局偵四隊今天破獲一個中日跨國竊車集團，根據調查國內今年七月開放重型機車上路後 ……

Document:

Human:跨國竊車銷贓情形猖獗直得國內警方注意

Machine:刑事局破獲中國車集

據印度報業托拉斯報道印度北方邦22 日發生一起小公共汽車炸彈爆炸事件造成 15 人死亡 3 人受傷 ……

Document:

Human: 印度汽車炸彈爆炸造成15人死亡

Machine: 印度發生汽車爆炸事件

感謝盧柏儒同學提供實驗結果

Unsupervised Abstractive Summarization

1 2 長文短文

當做摘要

原來的長文

台灣大學 … 灣學 … 台灣大學 …

http://laughl.com/archives/7131/【問號哪裡來】「黑人問號哥」原來大有來頭？！/

機器在密謀要統治人類了 ……

長文短文

當做摘要

原來的長文

台灣大學 … 灣學 … 台灣大學 …

大量人寫的句子判斷句子是不是人寫的

讓三號機覺得產生的句子是人寫的

台大 …

感謝王耀賢同學提供實驗結果

Outline

你好嗎？

語音：

了解一個詞

了解一個句子

了解一整段對話能不能畫圖？

“Girl with red hair and red eyes”

“Girl with yellow ribbon”

Data Collection

http://konachan.net/post/show/239400/aikatsu-clouds-flowers-hikami_sumire-hiten_goane_r

感謝曾柏翔助教、樊恩宇助教蒐集資料

Released Training Data

• Data download link:

https://drive.google.com/open?id=0BwJmB7alR-AvMHEtczZZN0EtdzQ

• Anime Dataset:

• training data: 33.4k (image, tags) pair

• Training tags file format

• img_id <comma> tag1 <colon> #_post <tab> tag2 <colon> …

blue eyes red hair short hair

tags.csv

96 x 96

Conditional GAN

• 根據文字敘述畫出動漫人物頭像

MLDS 作業三負責助教：曾柏翔

Black hair, blue eyes

Blue hair, green eyes

Red hair, long hair

Concluding Remarks

你好嗎？

語音：

我很好

了解一個詞

了解一個句子

Everything is based on Deep Learning

towards machine comprehension of spoken content machine... · 2017-11-09 · understand the...

Documents

malayalam spoken

oftalmopati grave spoken

overview of the ntcir-11 spokenquery&doc...

spoken arabic foods and drinks

spoken & prounuciation lesson 15

seminar 1: spoken language

kyoshiro sugiyama, ahc-lab., naist an investigation of...

activities to go, grammar in spoken text

cognitive spoken english for tamilan

spoken english july - 2016

07.living language spoken world greek.pdf

unum atma i : ki presenta spoken amulets

grammar for spoken english

spoken english january - 2016

spoken op het kerkhof

towards machine comprehension of spoken...

father moon's last speech spoken in public gathering

spoken word festival 2012

spoken translation - taus tokyo forum 2015

chinese spoken document retrieval and organization