towards machine comprehension of spoken content machine... · 2017-11-09 · understand the...

Post on 29-Jan-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

讓機器聽懂人說話 李宏毅

Hung-yi Lee

Outline

你 好 嗎?

從語音訊號到文字 你 好 嗎?

機器決定 要說(做)什麼

語音:

我 很 好

了解一個詞

了解一個句子

了解一整段對話

Everything is based on Deep Learning

Deep Learning in One Slide

They are functions.

Many kinds of networks:

Matrix

How to find the function?

Given the examples of inputs/outputs as (training data): {(x1,y1),(x2,y2), ……, (x1000,y1000)}

Fully connected feedforward network

Convolutional neural network (CNN)

Recurrent neural network (RNN)

Vector

Vector Seq

𝑥 𝑦

Outline

你 好 嗎?

從語音訊號到文字 你 好 嗎?

語音:

Speech Recognition

Spoken Content

Text

Speech Recognition

f “How are you”

“Hi”

“I am fine”

“Good bye”

Typical Deep Learning Approach

• The hierarchical structure of human languages

hh w aa t d uw y uw th ih ng k

what do you think

t-d+uw1 t-d+uw2 t-d+uw3

…… t-d+uw d-uw+y uw-y+uw y-uw+th ……

d-uw+y1 d-uw+y2 d-uw+y3

Phoneme:

Tri-phone:

State:

Typical Deep Learning Approach

• The first stage of speech recognition • Classification: input → acoustic feature, output → state

…… Determine the state each acoustic feature belongs to Acoustic

feature

States: a a a b b c c

Typical Deep Learning Approach

…… ……

xi

Size of output layer = No. of states

P(a|xi)

DNN

DNN input:

One acoustic feature

DNN output:

Probability of each state

P(b|xi) P(c|xi) ……

……

CNN

Very Deep

MSR

Human Parity!

• 微軟語音辨識技術突破重大里程碑:對話辨識能力達人類水準!(2016.10)

• https://www.bnext.com.tw/article/41414/bn-2016-10-19-020437-216

• Dong Yu, Wayne Xiong, Jasha Droppo, Andreas Stolcke , Guoli Ye, Jinyu Li , Geoffrey Zweig, “Deep Convolutional Neural Networks with Layer-wise Context Expansion and Attention”, Interspeech 2016

• IBM vs Microsoft: 'Human parity' speech recognition record changes hands again (2017.03)

• http://www.zdnet.com/article/ibm-vs-microsoft-human-parity-speech-recognition-record-changes-hands-again/

• George Saon, Gakuto Kurata, Tom Sercu, Kartik Audhkhasi, Samuel Thomas, Dimitrios Dimitriadis, Xiaodong Cui, Bhuvana Ramabhadran, Michael Picheny, Lynn-Li Lim, Bergul Roomi, Phil Hall, “English Conversational Telephone Speech Recognition by Humans and Machines”, arXiv preprint, 2017

Machine 5.9% v.s. Human 5.9%

Machine 5.5% v.s. Human 5.1%

End-to-end Approach - Connectionist Temporal Classification (CTC)

• Connectionist Temporal Classification (CTC) [Alex Graves, ICML’06][Alex Graves, ICML’14][Haşim Sak, Interspeech’15][Jie Li, Interspeech’15][Andrew Senior, ASRU’15]

好 好 好

Trimming

棒 棒 棒 棒 棒

“好棒”

Why can’t it be “好棒棒”

Input:

Output: (character sequence)

(vector sequence)

Problem?

End-to-end Approach - Connectionist Temporal Classification (CTC)

• Connectionist Temporal Classification (CTC) [Alex Graves, ICML’06][Alex Graves, ICML’14][Haşim Sak, Interspeech’15][Jie Li, Interspeech’15][Andrew Senior, ASRU’15]

好 φ φ 棒 φ φ φ φ 好 φ φ 棒 φ 棒 φ φ

“好棒” “好棒棒” Add an extra symbol “φ” representing “null”

More Approaches

• DNN + structured SVM • [Meng & Lee, ICASSP 10]

• DNN + structured DNN • [Liao & Lee, ASRU 15]

• Neural Turing Machine • [Ko & Lee, ICASSP 17]

hidden layer h1

hidden layer h2

W1

W2

F2(x, y; θ

2)

WL

speech signal

F1(x, y; θ

1)

y (phoneme label sequence)

(a ) u s e DNN p h o n e p o s te rio r a s a c o u s tic ve c to r

(b ) s tru c tu re d S VM (c ) s tru c tu re d DNN

Ψ(x,y)

hidden layer hL-1

hidden layer h1

hidden layer hL

W0,0

output layer

input layer

W0,L

feature extraction

a c b a

x (acoustic vector sequence)

Outline

你 好 嗎?

從語音訊號到文字 你 好 嗎?

語音:

了解一個詞

Word Embedding

• Machine learns the meaning of words from reading a lot of documents without supervision

dog

cat

rabbit

jump run

flower

tree

Word Embedding

Word Embedding

• Machine learns the meaning of words from reading a lot of documents without supervision

• A word can be understood by its context

蔡英文 520宣誓就職

馬英九 520宣誓就職

蔡英文、馬英九 are something very similar

You shall know a word by the company it keeps

Demo

• Machine learn the meaning of words from reading a lot of documents without supervision

機器能不能學會鄉民用語

和「好棒」語意最相近的辭彙

超讚、真不錯、真好、好有趣、好感動

和「好棒棒」語意最相近的辭彙

不就好棒棒、阿不就好棒、好清高、好高尚、不就好棒

和「廢宅」語意最相近的辭彙

宅宅、臭宅、魯宅、魯蛇、窮酸宅

和「本魯」語意最相近的辭彙

小魯、魯妹、魯蛇小弟、魯弟、小弟

機器能不能學會鄉民用語

V(魯夫) − V(海賊王) ≈ V(鳴人) − V(火影忍者)

魯夫:海賊王 = 鳴人:? Ans: 火影忍者

魯蛇:loser = 溫拿:? Ans: winner

魯蛇:窮 = 溫拿:? Ans: 有錢

研究生:期刊 = 漫畫家:? Ans: 少年Jump

Outline

你 好 嗎?

從語音訊號到文字 你 好 嗎?

語音:

了解一個詞

了解一個句子

Sentiment Analysis

Sentiment Analysis

我 覺 太 得 糟 了

超好雷

好雷

普雷

負雷

超負雷

看了這部電影覺得很高興 …….

這部電影太糟了 …….

這部電影很棒 …….

Positive (正雷) Negative (負雷) Positive (正雷)

……

RNN (Recurrent Neural Network)

Recurrent Neural Network

• Recurrent Structure: usually used when the input is a sequence

f h0 h1

x1

f h2 f h3

g

No matter how long the input sequence is, we only need one function f

y

x2 x3

Func f ht

xt

ht-1

LSTM GRU

Func f ht

xt

ht-1

Sentiment Analysis

• It is bad.

• It is not bad.

• AI is hard to learn, but it is powerful.

• AI is powerful, but it is hard to learn.

• AI is powerful even though it is hard to learn.

0.05

0.90

0.86

0.35

0.73

Smaller number means more negative.

Larger number means more positive.

感謝陳冠宇同學提供實驗結果

Outline

你 好 嗎?

從語音訊號到文字 你 好 嗎?

語音:

了解一個詞

了解一個句子

了解一整段對話

New task for Machine Comprehension of Spoken Content

• TOEFL Listening Comprehension Test by Machine

Question: “ What is a possible origin of Venus’ clouds? ”

Audio Story:

Choices:

(A) gases released as a result of volcanic activity

(B) chemical reactions caused by high surface temperatures

(C) bursts of radio energy from the plane's surface

(D) strong winds that blow dust into the atmosphere

(The original story is 5 min long.)

New task for Machine Comprehension of Spoken Content

• TOEFL Listening Comprehension Test by Machine

“what is a possible origin of Venus‘ clouds?"

Question:

Audio Story: Neural

Network

4 Choices

e.g. (A)

answer

Using previous exams to train the network

ASR transcriptions

Model Architecture

“what is a possible origin of Venus‘ clouds?"

Question:

Question Semantics

…… It be quite possible that this be due to volcanic eruption because volcanic eruption often emit gas. If that be the case volcanism could very well be the root cause of Venus 's thick cloud cover. And also we have observe burst of radio energy from the planet 's surface. These burst be similar to what we see when volcano ……

Audio Story:

Speech Recognition

Semantic Analysis

Semantic Analysis

Attention

Answer

Select the choice most similar to the answer

Attention

The whole model learned end-to-end.

Model Architecture - Attention Mechanism

Sentence 2 Sentence 1

w1 w2 w3 w4 w5 w6 w7 w8

Story (through ASR)

S1 S2 S3 S4 S5 S6 S7 S8

Σ

α1 α2 α3 α4 α5 α6 α7 α8

yb(1) yf(T)

VQ

… yb(1) yb(2) yb(T)

… yf(1) yf(2) yf(T)

Module for Vector Representation

𝑉𝑠 = ∝𝑡∗ 𝑆𝑡8𝑡=1 𝛼 =

𝑉𝑄∙ 𝑆𝑡|𝑉𝑄|∙|𝑆𝑡|

Understand the question

W 2 … W T

Question W 1

VQ : vector representation for question

Vs : consider both Question and Story with attention weight α

Concatenate the output of last hidden layer in bi-directional GRU

Concatenate the output of hidden layer at each time step

(similarity score)

A bi-directional GRU

Model Architecture

Question

Story (through ASR)

Att

VA VB VC VD

Choice A Choice D Choice B Choice C

0.6 0.1 0.2 0.1

hop 1

+ + hop 2

Attention Machanism recap

Process Question by VecRep module

Att

VecRep Att

VecRep

Get Vs through attention module

A hop means the machine considers question and story jointly once

Do more hops for considering story again

Att

……

…… + hop n

VecRep VecRep VecRep VecRep

Get 4 choices representation through VecRep module Compare similarity between choices and VQn

Take the choice with the highest score as answer

To keep question info, add VQ and VS

Sentence Representation

Bi-directional RNN

Tree-structured Neural Network

Attention on all phrases

Sentence

w1 w2 w3 w4

S1 S2 S3 S4

w1 w2 w3 w4

Experimental Results A

ccu

racy

(%

)

random

Naïve approaches

Example Naïve approach: 1. Find the paragraph containing most key terms in

the question. 2. Select the choice containing most key terms in

that paragraph.

Experimental Results A

ccu

racy

(%

)

random

42.2% [Tseng, Shen, Lee, Lee, Interspeech’16]

Naïve approaches

48.8% [Fan, Hsu, Lee, Lee, SLT’16]

Outline

你 好 嗎?

從語音訊號到文字 你 好 嗎?

機器決定 要說(做)什麼

語音:

我 很 好

了解一個詞

了解一個句子

了解一整段對話

要怎麼讓機器可以和人對話?

• 人都知道怎麼和人對話,我們可以把規則寫下來,讓機器照著做嗎? (Hand-crafted rules)

• For example, you want to build a chat-bot

• If there is “推薦” and “音樂” in the input, then chat-bot says “我推薦五月天”

• You can say “請推薦我一些音樂” and “你推薦誰的音樂?”. Smart?

• What if someone says “你不推薦誰的音樂?” ……

• 問題

• 人類對話的規則太過複雜,無法窮舉

• Chat-bot 沒有 “free style”,回應都是事先設好的

決定要說什麼很難嗎?

• 當你輸入一個句子時,機器有多少個可能的句子呢?

• 中文句子由一串中文字夠成

• 常用中文字約 4000 個

• 假設句子長度固定為 15 個字

• Ans: 4000 的 15 次方

• 遠多於地球上所有海洋中的水分子數目

≈1.07 x 1054

如果機器可以正確回答一句話,它不是大海撈針, 而是從大海中撈出一個的水分子

Sequence-to-sequence Learning

• Sequence to sequence learning: Both input and output are both sequences with different lengths.

Seq2seq

機 器 學 習

Seq2seq

Machine Learning

Seq2seq

你 好 嗎

你 好 嗎 我 很 好

語音辨識 翻譯 對話

電視影集、電影台詞

freestyle!

Sequence-to-sequence Learning

RNN Encoder

Input sequence (中文)

output sequence (英文)

RNN Decoder

語義

會讀中文

會寫英文

vector

(Encoder, Decoder 共通的語言?)

Sequence-to-sequence Learning

• Both input and output are both sequences with different lengths. → Sequence to sequence learning

• E.g. Machine Translation (machine learning→機器學習)

Containing all information about

input sequence

learnin

g

mach

ine

learnin

g

Sequence-to-sequence Learning

• Both input and output are both sequences with different lengths. → Sequence to sequence learning

• E.g. Machine Translation (machine learning→機器學習)

mach

ine

學 機 器

……

……

Don’t know when to stop

習 慣 性

Sequence-to-sequence Learning

推 tlkagk: =========斷========== 接龍推文是ptt在推文中的一種趣味玩法,與推齊有些類似但又有所不同,是指在推文中接續上一樓的字句,而推出連續的意思。該類玩法確切起源已不可知(鄉民百科)

Sequence-to-sequence Learning

• Both input and output are both sequences with different lengths. → Sequence to sequence learning

• E.g. Machine Translation (machine learning→機器學習)

Add a symbol “===“ (斷)

[Ilya Sutskever, NIPS’14][Dzmitry Bahdanau, arXiv’15]

learnin

g

mach

ine

學 機 器 習

===

Chat-bot

• Both input and output are both sequences with different lengths. → Sequence to sequence learning

電視影集、電影台詞 Source of image: https://github.com/farizrahman4u/seq2seq

Chat-bot with GAN

Discriminator

Input sentence/history h response sentence x

Real or fake

human dialogues

Chatbot

En De

Conditional GAN

response sentence x

Input sentence/history h

Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean, Alan Ritter, Dan Jurafsky, “Adversarial Learning for Neural Dialogue Generation”, arXiv preprint, 2017

Ref:一日搞懂 GAN https://www.slideshare.net/tw_dsconf/ss-78795326

Example Results 感謝 段逸林 同學提供實驗結果

Towards Characterization

• 感謝王耀賢同學提供實驗結果

• https://github.com/yaushian/simple_sentiment_dialogue

Input: How do you feel ? I am good.

I am so embarrassed.

Input: I love you. I love you!

I wish I wish I wish I could go.

✘Negative sentence to positive sentence: it's a crappy day -> it's a great day

i wish you could be here -> you could be here

it's not a good idea -> it's good idea

i miss you -> i love you

i don't love you -> i love you

i can't do that -> i can do that

i feel so sad -> i happy

it's a bad day -> it's a good day

it's a dummy day -> it's a great day

sorry for doing such a horrible thing -> thanks for doing a

great thing

my doggy is sick -> my doggy is my doggy

my little doggy is sick -> my little doggy is my little doggy

Cycle GAN

感謝 王耀賢 同學提供實驗結果

Outline

你 好 嗎?

從語音訊號到文字 你 好 嗎?

機器決定 要說(做)什麼

語音:

了解一個詞

了解一個句子

了解一整段對話

摘要

Summarization

Audio File

to be summarized

This is the summary.

Select the most informative segments to form a compact version

Extractive Summaries

…… deep learning is powerful …… …… ……

[Lee, et al., Interspeech 12] [Lee, et al., ICASSP 13] [Shiang, et al., Interspeech 13]

Machine does not write summaries in its own words

Abstractive Summarization

• Now machine can do abstractive summary (write summaries in its own words)

Title 1

Title 2

Title 3

Training Data

title generated by machine

without hand-crafted rules

(in its own words)

Abstractive Summarization

• Input: transcriptions of audio, output: title

ℎ1 ℎ2 ℎ3 ℎ4

RNN Encoder: read through the input

w1 w4 w2 w3 transcriptions of audio from automatic speech recognition (ASR)

𝑧1 𝑧2 ……

…… wA wB

RNN generator

Abstractive Summarization

刑事局偵四隊今天破獲一個中日跨國竊車 集團,根據調查國內今年七月開放重型機車上路後 ……

Document:

Human:跨國竊車銷贓情形猖獗直得國內警方注意

Machine:刑事局破獲中國車集

據印度報業托拉斯報道印度北方邦22 日發生一起小公共汽車炸彈爆炸事件造成 15 人死亡 3 人受傷 ……

Document:

Human: 印度汽車炸彈爆炸造成15人死亡

Machine: 印度發生汽車爆炸事件

感謝 盧柏儒 同學提供實驗結果

Unsupervised Abstractive Summarization

1 2 長文 短文

當做摘要

原來 的長文

台灣大學 … 灣學 … 台灣大學 …

http://laughl.com/archives/7131/【問號哪裡來】「黑人問號哥」原來大有來頭?!/

機器在密謀要 統治人類了 ……

Unsupervised Abstractive Summarization

1 2

3

長文 短文

當做摘要

原來 的長文

台灣大學 … 灣學 … 台灣大學 …

大量人寫的句子 判斷句子是不是人寫的

讓三號機覺得產生的句子是人寫的

台大 …

Unsupervised Abstractive Summarization

感謝 王耀賢 同學提供實驗結果

Outline

你 好 嗎?

從語音訊號到文字 你 好 嗎?

機器決定 要說(做)什麼

語音:

了解一個詞

了解一個句子

了解一整段對話 能不能畫圖?

“Girl with red hair and red eyes”

“Girl with yellow ribbon”

Data Collection

http://konachan.net/post/show/239400/aikatsu-clouds-flowers-hikami_sumire-hiten_goane_r

感謝曾柏翔助教、 樊恩宇助教蒐集資料

Released Training Data

• Data download link:

https://drive.google.com/open?id=0BwJmB7alR-AvMHEtczZZN0EtdzQ

• Anime Dataset:

• training data: 33.4k (image, tags) pair

• Training tags file format

• img_id <comma> tag1 <colon> #_post <tab> tag2 <colon> …

blue eyes red hair short hair

tags.csv

96 x 96

Conditional GAN

• 根據文字敘述畫出動漫人物頭像

MLDS 作業三 負責助教:曾柏翔

Black hair, blue eyes

Blue hair, green eyes

Red hair, long hair

Concluding Remarks

你 好 嗎?

從語音訊號到文字 你 好 嗎?

機器決定 要說(做)什麼

語音:

我 很 好

了解一個詞

了解一個句子

了解一整段對話

Everything is based on Deep Learning

top related