towards machine comprehension of spoken content machine... · 2017-11-09 · understand the...
TRANSCRIPT
讓機器聽懂人說話 李宏毅
Hung-yi Lee
Outline
你 好 嗎?
從語音訊號到文字 你 好 嗎?
機器決定 要說(做)什麼
語音:
我 很 好
了解一個詞
了解一個句子
了解一整段對話
Everything is based on Deep Learning
Deep Learning in One Slide
They are functions.
Many kinds of networks:
Matrix
How to find the function?
Given the examples of inputs/outputs as (training data): {(x1,y1),(x2,y2), ……, (x1000,y1000)}
Fully connected feedforward network
Convolutional neural network (CNN)
Recurrent neural network (RNN)
Vector
Vector Seq
𝑥 𝑦
Outline
你 好 嗎?
從語音訊號到文字 你 好 嗎?
語音:
Speech Recognition
Spoken Content
Text
Speech Recognition
f “How are you”
“Hi”
“I am fine”
“Good bye”
Typical Deep Learning Approach
• The hierarchical structure of human languages
hh w aa t d uw y uw th ih ng k
what do you think
t-d+uw1 t-d+uw2 t-d+uw3
…… t-d+uw d-uw+y uw-y+uw y-uw+th ……
d-uw+y1 d-uw+y2 d-uw+y3
Phoneme:
Tri-phone:
State:
Typical Deep Learning Approach
• The first stage of speech recognition • Classification: input → acoustic feature, output → state
…… Determine the state each acoustic feature belongs to Acoustic
feature
States: a a a b b c c
Typical Deep Learning Approach
…… ……
xi
Size of output layer = No. of states
P(a|xi)
DNN
DNN input:
One acoustic feature
DNN output:
Probability of each state
P(b|xi) P(c|xi) ……
……
CNN
Very Deep
MSR
Human Parity!
• 微軟語音辨識技術突破重大里程碑:對話辨識能力達人類水準!(2016.10)
• https://www.bnext.com.tw/article/41414/bn-2016-10-19-020437-216
• Dong Yu, Wayne Xiong, Jasha Droppo, Andreas Stolcke , Guoli Ye, Jinyu Li , Geoffrey Zweig, “Deep Convolutional Neural Networks with Layer-wise Context Expansion and Attention”, Interspeech 2016
• IBM vs Microsoft: 'Human parity' speech recognition record changes hands again (2017.03)
• http://www.zdnet.com/article/ibm-vs-microsoft-human-parity-speech-recognition-record-changes-hands-again/
• George Saon, Gakuto Kurata, Tom Sercu, Kartik Audhkhasi, Samuel Thomas, Dimitrios Dimitriadis, Xiaodong Cui, Bhuvana Ramabhadran, Michael Picheny, Lynn-Li Lim, Bergul Roomi, Phil Hall, “English Conversational Telephone Speech Recognition by Humans and Machines”, arXiv preprint, 2017
Machine 5.9% v.s. Human 5.9%
Machine 5.5% v.s. Human 5.1%
End-to-end Approach - Connectionist Temporal Classification (CTC)
• Connectionist Temporal Classification (CTC) [Alex Graves, ICML’06][Alex Graves, ICML’14][Haşim Sak, Interspeech’15][Jie Li, Interspeech’15][Andrew Senior, ASRU’15]
好 好 好
Trimming
棒 棒 棒 棒 棒
“好棒”
Why can’t it be “好棒棒”
Input:
Output: (character sequence)
(vector sequence)
Problem?
End-to-end Approach - Connectionist Temporal Classification (CTC)
• Connectionist Temporal Classification (CTC) [Alex Graves, ICML’06][Alex Graves, ICML’14][Haşim Sak, Interspeech’15][Jie Li, Interspeech’15][Andrew Senior, ASRU’15]
好 φ φ 棒 φ φ φ φ 好 φ φ 棒 φ 棒 φ φ
“好棒” “好棒棒” Add an extra symbol “φ” representing “null”
More Approaches
• DNN + structured SVM • [Meng & Lee, ICASSP 10]
• DNN + structured DNN • [Liao & Lee, ASRU 15]
• Neural Turing Machine • [Ko & Lee, ICASSP 17]
hidden layer h1
hidden layer h2
W1
W2
F2(x, y; θ
2)
WL
speech signal
F1(x, y; θ
1)
y (phoneme label sequence)
(a ) u s e DNN p h o n e p o s te rio r a s a c o u s tic ve c to r
(b ) s tru c tu re d S VM (c ) s tru c tu re d DNN
Ψ(x,y)
hidden layer hL-1
hidden layer h1
hidden layer hL
W0,0
output layer
input layer
W0,L
feature extraction
a c b a
x (acoustic vector sequence)
Outline
你 好 嗎?
從語音訊號到文字 你 好 嗎?
語音:
了解一個詞
Word Embedding
• Machine learns the meaning of words from reading a lot of documents without supervision
dog
cat
rabbit
jump run
flower
tree
Word Embedding
Word Embedding
• Machine learns the meaning of words from reading a lot of documents without supervision
• A word can be understood by its context
蔡英文 520宣誓就職
馬英九 520宣誓就職
蔡英文、馬英九 are something very similar
You shall know a word by the company it keeps
Demo
• Machine learn the meaning of words from reading a lot of documents without supervision
機器能不能學會鄉民用語
和「好棒」語意最相近的辭彙
超讚、真不錯、真好、好有趣、好感動
和「好棒棒」語意最相近的辭彙
不就好棒棒、阿不就好棒、好清高、好高尚、不就好棒
和「廢宅」語意最相近的辭彙
宅宅、臭宅、魯宅、魯蛇、窮酸宅
和「本魯」語意最相近的辭彙
小魯、魯妹、魯蛇小弟、魯弟、小弟
機器能不能學會鄉民用語
V(魯夫) − V(海賊王) ≈ V(鳴人) − V(火影忍者)
魯夫:海賊王 = 鳴人:? Ans: 火影忍者
魯蛇:loser = 溫拿:? Ans: winner
魯蛇:窮 = 溫拿:? Ans: 有錢
研究生:期刊 = 漫畫家:? Ans: 少年Jump
Outline
你 好 嗎?
從語音訊號到文字 你 好 嗎?
語音:
了解一個詞
了解一個句子
Sentiment Analysis
Sentiment Analysis
我 覺 太 得 糟 了
超好雷
好雷
普雷
負雷
超負雷
看了這部電影覺得很高興 …….
這部電影太糟了 …….
這部電影很棒 …….
Positive (正雷) Negative (負雷) Positive (正雷)
……
RNN (Recurrent Neural Network)
Recurrent Neural Network
• Recurrent Structure: usually used when the input is a sequence
f h0 h1
x1
f h2 f h3
g
No matter how long the input sequence is, we only need one function f
y
x2 x3
Func f ht
xt
ht-1
LSTM GRU
Func f ht
xt
ht-1
Sentiment Analysis
• It is bad.
• It is not bad.
• AI is hard to learn, but it is powerful.
• AI is powerful, but it is hard to learn.
• AI is powerful even though it is hard to learn.
0.05
0.90
0.86
0.35
0.73
Smaller number means more negative.
Larger number means more positive.
感謝陳冠宇同學提供實驗結果
Outline
你 好 嗎?
從語音訊號到文字 你 好 嗎?
語音:
了解一個詞
了解一個句子
了解一整段對話
New task for Machine Comprehension of Spoken Content
• TOEFL Listening Comprehension Test by Machine
Question: “ What is a possible origin of Venus’ clouds? ”
Audio Story:
Choices:
(A) gases released as a result of volcanic activity
(B) chemical reactions caused by high surface temperatures
(C) bursts of radio energy from the plane's surface
(D) strong winds that blow dust into the atmosphere
(The original story is 5 min long.)
New task for Machine Comprehension of Spoken Content
• TOEFL Listening Comprehension Test by Machine
“what is a possible origin of Venus‘ clouds?"
Question:
Audio Story: Neural
Network
4 Choices
e.g. (A)
answer
Using previous exams to train the network
ASR transcriptions
Model Architecture
“what is a possible origin of Venus‘ clouds?"
Question:
Question Semantics
…… It be quite possible that this be due to volcanic eruption because volcanic eruption often emit gas. If that be the case volcanism could very well be the root cause of Venus 's thick cloud cover. And also we have observe burst of radio energy from the planet 's surface. These burst be similar to what we see when volcano ……
Audio Story:
Speech Recognition
Semantic Analysis
Semantic Analysis
Attention
Answer
Select the choice most similar to the answer
Attention
The whole model learned end-to-end.
Model Architecture - Attention Mechanism
Sentence 2 Sentence 1
w1 w2 w3 w4 w5 w6 w7 w8
Story (through ASR)
S1 S2 S3 S4 S5 S6 S7 S8
Σ
α1 α2 α3 α4 α5 α6 α7 α8
yb(1) yf(T)
VQ
… yb(1) yb(2) yb(T)
… yf(1) yf(2) yf(T)
Module for Vector Representation
𝑉𝑠 = ∝𝑡∗ 𝑆𝑡8𝑡=1 𝛼 =
𝑉𝑄∙ 𝑆𝑡|𝑉𝑄|∙|𝑆𝑡|
Understand the question
W 2 … W T
Question W 1
VQ : vector representation for question
Vs : consider both Question and Story with attention weight α
Concatenate the output of last hidden layer in bi-directional GRU
Concatenate the output of hidden layer at each time step
(similarity score)
A bi-directional GRU
Model Architecture
Question
Story (through ASR)
Att
VA VB VC VD
Choice A Choice D Choice B Choice C
0.6 0.1 0.2 0.1
hop 1
+ + hop 2
Attention Machanism recap
Process Question by VecRep module
Att
VecRep Att
VecRep
Get Vs through attention module
A hop means the machine considers question and story jointly once
Do more hops for considering story again
Att
……
…… + hop n
VecRep VecRep VecRep VecRep
Get 4 choices representation through VecRep module Compare similarity between choices and VQn
Take the choice with the highest score as answer
To keep question info, add VQ and VS
Sentence Representation
Bi-directional RNN
Tree-structured Neural Network
Attention on all phrases
Sentence
w1 w2 w3 w4
S1 S2 S3 S4
w1 w2 w3 w4
Experimental Results A
ccu
racy
(%
)
random
Naïve approaches
Example Naïve approach: 1. Find the paragraph containing most key terms in
the question. 2. Select the choice containing most key terms in
that paragraph.
Experimental Results A
ccu
racy
(%
)
random
42.2% [Tseng, Shen, Lee, Lee, Interspeech’16]
Naïve approaches
48.8% [Fan, Hsu, Lee, Lee, SLT’16]
Outline
你 好 嗎?
從語音訊號到文字 你 好 嗎?
機器決定 要說(做)什麼
語音:
我 很 好
了解一個詞
了解一個句子
了解一整段對話
要怎麼讓機器可以和人對話?
• 人都知道怎麼和人對話,我們可以把規則寫下來,讓機器照著做嗎? (Hand-crafted rules)
• For example, you want to build a chat-bot
• If there is “推薦” and “音樂” in the input, then chat-bot says “我推薦五月天”
• You can say “請推薦我一些音樂” and “你推薦誰的音樂?”. Smart?
• What if someone says “你不推薦誰的音樂?” ……
• 問題
• 人類對話的規則太過複雜,無法窮舉
• Chat-bot 沒有 “free style”,回應都是事先設好的
決定要說什麼很難嗎?
• 當你輸入一個句子時,機器有多少個可能的句子呢?
• 中文句子由一串中文字夠成
• 常用中文字約 4000 個
• 假設句子長度固定為 15 個字
• Ans: 4000 的 15 次方
• 遠多於地球上所有海洋中的水分子數目
≈1.07 x 1054
如果機器可以正確回答一句話,它不是大海撈針, 而是從大海中撈出一個的水分子
Sequence-to-sequence Learning
• Sequence to sequence learning: Both input and output are both sequences with different lengths.
Seq2seq
機 器 學 習
Seq2seq
Machine Learning
Seq2seq
你 好 嗎
你 好 嗎 我 很 好
語音辨識 翻譯 對話
電視影集、電影台詞
freestyle!
Sequence-to-sequence Learning
RNN Encoder
Input sequence (中文)
output sequence (英文)
RNN Decoder
語義
會讀中文
會寫英文
vector
(Encoder, Decoder 共通的語言?)
Sequence-to-sequence Learning
• Both input and output are both sequences with different lengths. → Sequence to sequence learning
• E.g. Machine Translation (machine learning→機器學習)
Containing all information about
input sequence
learnin
g
mach
ine
learnin
g
Sequence-to-sequence Learning
• Both input and output are both sequences with different lengths. → Sequence to sequence learning
• E.g. Machine Translation (machine learning→機器學習)
mach
ine
學 機 器
……
……
Don’t know when to stop
習 慣 性
Sequence-to-sequence Learning
推 tlkagk: =========斷========== 接龍推文是ptt在推文中的一種趣味玩法,與推齊有些類似但又有所不同,是指在推文中接續上一樓的字句,而推出連續的意思。該類玩法確切起源已不可知(鄉民百科)
Sequence-to-sequence Learning
• Both input and output are both sequences with different lengths. → Sequence to sequence learning
• E.g. Machine Translation (machine learning→機器學習)
Add a symbol “===“ (斷)
[Ilya Sutskever, NIPS’14][Dzmitry Bahdanau, arXiv’15]
learnin
g
mach
ine
學 機 器 習
===
Chat-bot
• Both input and output are both sequences with different lengths. → Sequence to sequence learning
電視影集、電影台詞 Source of image: https://github.com/farizrahman4u/seq2seq
Chat-bot with GAN
Discriminator
Input sentence/history h response sentence x
Real or fake
human dialogues
Chatbot
En De
Conditional GAN
response sentence x
Input sentence/history h
Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean, Alan Ritter, Dan Jurafsky, “Adversarial Learning for Neural Dialogue Generation”, arXiv preprint, 2017
Ref:一日搞懂 GAN https://www.slideshare.net/tw_dsconf/ss-78795326
Example Results 感謝 段逸林 同學提供實驗結果
Towards Characterization
• 感謝王耀賢同學提供實驗結果
• https://github.com/yaushian/simple_sentiment_dialogue
Input: How do you feel ? I am good.
I am so embarrassed.
Input: I love you. I love you!
I wish I wish I wish I could go.
✘Negative sentence to positive sentence: it's a crappy day -> it's a great day
i wish you could be here -> you could be here
it's not a good idea -> it's good idea
i miss you -> i love you
i don't love you -> i love you
i can't do that -> i can do that
i feel so sad -> i happy
it's a bad day -> it's a good day
it's a dummy day -> it's a great day
sorry for doing such a horrible thing -> thanks for doing a
great thing
my doggy is sick -> my doggy is my doggy
my little doggy is sick -> my little doggy is my little doggy
Cycle GAN
感謝 王耀賢 同學提供實驗結果
Outline
你 好 嗎?
從語音訊號到文字 你 好 嗎?
機器決定 要說(做)什麼
語音:
了解一個詞
了解一個句子
了解一整段對話
摘要
Summarization
Audio File
to be summarized
This is the summary.
Select the most informative segments to form a compact version
Extractive Summaries
…… deep learning is powerful …… …… ……
[Lee, et al., Interspeech 12] [Lee, et al., ICASSP 13] [Shiang, et al., Interspeech 13]
Machine does not write summaries in its own words
Abstractive Summarization
• Now machine can do abstractive summary (write summaries in its own words)
Title 1
Title 2
Title 3
Training Data
title generated by machine
without hand-crafted rules
(in its own words)
Abstractive Summarization
• Input: transcriptions of audio, output: title
ℎ1 ℎ2 ℎ3 ℎ4
RNN Encoder: read through the input
w1 w4 w2 w3 transcriptions of audio from automatic speech recognition (ASR)
𝑧1 𝑧2 ……
…… wA wB
RNN generator
Abstractive Summarization
刑事局偵四隊今天破獲一個中日跨國竊車 集團,根據調查國內今年七月開放重型機車上路後 ……
Document:
Human:跨國竊車銷贓情形猖獗直得國內警方注意
Machine:刑事局破獲中國車集
據印度報業托拉斯報道印度北方邦22 日發生一起小公共汽車炸彈爆炸事件造成 15 人死亡 3 人受傷 ……
Document:
Human: 印度汽車炸彈爆炸造成15人死亡
Machine: 印度發生汽車爆炸事件
感謝 盧柏儒 同學提供實驗結果
Unsupervised Abstractive Summarization
1 2 長文 短文
當做摘要
原來 的長文
台灣大學 … 灣學 … 台灣大學 …
http://laughl.com/archives/7131/【問號哪裡來】「黑人問號哥」原來大有來頭?!/
機器在密謀要 統治人類了 ……
Unsupervised Abstractive Summarization
1 2
3
長文 短文
當做摘要
原來 的長文
台灣大學 … 灣學 … 台灣大學 …
大量人寫的句子 判斷句子是不是人寫的
讓三號機覺得產生的句子是人寫的
台大 …
Unsupervised Abstractive Summarization
感謝 王耀賢 同學提供實驗結果
Outline
你 好 嗎?
從語音訊號到文字 你 好 嗎?
機器決定 要說(做)什麼
語音:
了解一個詞
了解一個句子
了解一整段對話 能不能畫圖?
“Girl with red hair and red eyes”
“Girl with yellow ribbon”
Data Collection
http://konachan.net/post/show/239400/aikatsu-clouds-flowers-hikami_sumire-hiten_goane_r
感謝曾柏翔助教、 樊恩宇助教蒐集資料
Released Training Data
• Data download link:
https://drive.google.com/open?id=0BwJmB7alR-AvMHEtczZZN0EtdzQ
• Anime Dataset:
• training data: 33.4k (image, tags) pair
• Training tags file format
• img_id <comma> tag1 <colon> #_post <tab> tag2 <colon> …
blue eyes red hair short hair
tags.csv
96 x 96
Conditional GAN
• 根據文字敘述畫出動漫人物頭像
MLDS 作業三 負責助教:曾柏翔
Black hair, blue eyes
Blue hair, green eyes
Red hair, long hair
Concluding Remarks
你 好 嗎?
從語音訊號到文字 你 好 嗎?
機器決定 要說(做)什麼
語音:
我 很 好
了解一個詞
了解一個句子
了解一整段對話
Everything is based on Deep Learning