powerpoint プレゼンテーション録音音声再生部 録音音声を再生 ノート編集部...
TRANSCRIPT
•
•
•
•
•
–
•
–
–
2015/8/22 音声研究会@岩手県立大 4
•
–
•
•
•
–
•
•
•
•
•
•
•
•
•
•
•
2015/8/22 音声研究会@岩手県立大 9
•
–
–
•
10
•
•
•
•
•
•
2015/8/22 13
•
•
•
•
•
•
•
•
•
•
••
••
•
•
•
•
•
•
•
•
•
•
※ ()内数値は未知語クエリの数
•
•
…U01:
U02:
U03:
•
•
••
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
••
・・・・・・
・・・・・・
・・・
・・・・・・
… …
……
…
……
…
……
…
……
……
…
…
…
••
Search phase
Query term
Converting to phoneme
sequence
DTW-based term search
engine
Result
Indexing phase
Target Speech
data
ASR #1
ASR #10
ASR #2
…
Converting to PTN
PTN-formed index
1-best
1-best
1-best
speech utterance “Nepale” ( /N e p a a r u/ )
ASR IDOutputs of 10 ASRs
(all outputs are converted into phoneme sequence)
ASR #1 n e @ h o a @ r eASR #2 n e @ p @ a a r uASR #3 n e @ p @ a a r uASR #4 n e q p @ a a r eASR #5 n o @ b @ a @ @ NASR #6 n o @ t @ a u m eASR #7 n e N p @ a @ @ iASR #8 n e u p @ a a r eASR #9 n e @ p @ a a r e
ASR #10 n e N p @ a @ @ @
ArcNodeTerminal
Node
PTN-formed index
e
on
a
@ r
m
@
N
q
u
h
b
p
t
o
@a @
e
i
u
@
u
n
e
p
a
a
r
u
Sear
ch t
erm
CN(PTN)-based index
Total cost (distance): 0.2
NULL transition
NULL transitions
no insertion error
no insertion errors
e
on
a
@r
m
@
N
q
u
k
b
p
t
o
@a @
e
i
u
@
•
•
•
•
•
•
•
•
•
e
on
a
@ r
m
@
N
q
u
h
b
p
t
o
@a @
e
i
u
@
u
•
•
•
s o n o @ n i @ q p o N n o @ d e @ w a
s o n o @ n i @ q p o N n o @ d e @ w a
s o n o @ n i @ @ b a N n o @ u e @ w a
s o n o N n i f u p a N n u N b e e w a
s o n o @ n i @ q p o N n o @ d e @ w a
s o n o @ m i @ @ b a N n o @ u e @ w a
s o n o @ n i @ q p o N n o @ u e @ w a
s o n o @ n i @ @ b a N n o @ u e @ w a
s o n o @ n i @ @ b a N n o @ d e @ w a
s o n o @ n i @ q p o N n o @ d e @ w a
O O O O O B I I I I O O O O O O O O O O BIO tags for triphone “n-e-p”
current token
unigrams
in-ASR bigrams
in-ASR trigrams
cross-ASR bigrams
features for CRF trainingASR #1
ASR #2
ASR #3
ASR #4
ASR #5
ASR #6
ASR #7
ASR #8
ASR #9
ASR #10
phoneme-based transcriptions by ASRs
B : beginning tag of the triphoneI : inside tag of the triphone O : outside tag of the triphone
Utterance ZUtterance A
××
j-i-s: 0.50Probability of s-a-Ns: 0.9a: 0.9N: 0.9
××
s-a-N: 0.73
ASR Outputs of 10 ASR SystemsASR #1 f u j i s a N
ASR #2 f u z u sh a n
ASR #3 s i z u y a N
ASR #4 f u @ e y a m
ASR #5 k o J u g o N
ASR #6 f u J i s a N
ASR #7 k e z e ch a n
ASR #8 @ i t i s a q
ASR #9 f u r u s a N
ASR #10 s u j i h a q
B label 0.1 0.0 0.8 0.2 0.1 0.0 0.1
I label 0.1 0.4 0.2 0.7 0.9 0.9 0.1
O label 0.8 0.6 0.0 0.1 0.0 0.9 0.9
ASR Outputs of 10 ASR SystemsASR #1 f u j i s a N
ASR #2 f u z u sh a n
ASR #3 s i z u y a N
ASR #4 f u @ e y a m
ASR #5 k o J u g o N
ASR #6 f u J i s a N
ASR #7 k e z e ch a n
ASR #8 @ i t i s a q
ASR #9 f u r u s a N
ASR #10 s u j i h a q
B label 0.1 0.0 0.8 0.2 0.1 0.2 0.1
I label 0.1 0.4 0.2 0.7 0.9 0.1 0.1
O label 0.8 0.6 0.0 0.1 0.0 0.7 0.8Utterance A
ASR Outputs of 10 ASR SystemsASR #1 f u j i s a N
ASR #2 f u z u sh a n
ASR #3 s i z u y a N
ASR #4 f u @ e y a m
ASR #5 k o J u g o N
ASR #6 f u J i s a N
ASR #7 k e z e ch a n
ASR #8 @ i t i s a q
ASR #9 f u r u s a N
ASR #10 s u j i h a q
B label 0.1 0.0 0.8 0.2 0.1 0.2 0.1
I label 0.1 0.4 0.2 0.7 0.9 0.1 0.1
O label 0.8 0.6 0.0 0.1 0.0 0.7 0.8
Detection probability of query /f u j i s a N/at the utterance A by the CRF models
triphone “j-i-s” detection result by CRF
Query term/f u j i s a N/
f-u-j, u-j-i, j-i-s, …
decomposingto triphones
0.65
Probability of f-u-jf: 0.8u: 0.7j: 0.9
××
j-i-s: 0.50Probability of u-j-iu: 0.8j: 0.7i: 0.9
××
j-i-s: 0.50Probability of j-i-sj: 0.8i: 0.7s: 0.9
××
j-i-s: 0.50
10 ASR systems
0
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80 90 100
Pre
cisi
on
[%
]
Recall [%]
(1) DTW (Baseline)
(2) CRF
(3) DTW+CRF
•
•
••
••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
••
0.50
0.55
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
•
•
•
•
•
•
•
•
••
••
••
•
録音音声再生部録音音声を再生
ノート編集部
キャプチャ画像取得・表示部ボタンを押すとキャプチャ画像がノート編集部に出現
書き込み候補語句表示部(音声認識結果)
タッチで選択した単語が編集フィールド上に出現
キーボード・手書きでの文字入力も可能
58
•
•
•
•
•
•
•
•
•
•
•
•
•
•