from the perspective of human–robot interactionchapter 4 conclusion 60 acknowledgment 62...
TRANSCRIPT
題目
From
thePersp
ectiveofHuman–R
obot
Interaction
TwoKey
Tech
nolog
iesfor
aFlex
ible
Speech
Interface:
氏名左祥
平成23年度修了
博 士 論 文
Two Key Technologies for a Flexible Speech
Interface:
From the Perspective of Human–Robot Interaction
主任指導教員 岡 夏樹 教授
京都工芸繊維大学 大学院工芸科学研究科
設計工学専攻
学生番号 08821007
氏 名 左祥
平成 24 年 2 月 10 日提出
3
博 士 論 文
Two Key Technologies for a Flexible Speech
Interface:
From the Perspective of Human–Robot Interaction
主任指導教官 岡 夏樹 教授
京都工芸繊維大学 大学院工芸科学研究科
設計工学専攻
学生番号 08821007
氏 名 左祥
平成 24 年 2 月 10 日提出
柔軟な音声インターフェースを実現するための 2つの基盤技術:
ヒューマン‐ロボット・インタラクションの観点から
平成 23年度 08821007 左祥
概 要
音声は人の一番自然なコミュニケーション手段であり、人と機械の間のインター
フェースとして活用することが望まれている.しかし現状では、音声認識の性能が
不十分であり、また予め登録したコマンドしか認識できないなどの問題があるため、
柔軟性のある音声インターフェースの実現が難しい.そこで私は、「未知語の音韻列
の学習」と「発話対象の検出」の二つの課題に注目し、柔軟性のある音声インター
フェースを実現するための要素技術の開発を行った.
まず、一つ目の課題として、未知語の音韻列の学習技術を開発した.実環境にお
ける音声インターフェースにおいて未知語の音韻列の学習は大変重要である.シス
テムは予め登録したコマンドに応じるだけでなく、オンラインで未知語の音韻列(発
音)を学習できることが望ましい.例えば、システムが未知の人や未知の物体に出
会ったときに、その名前の音韻列を人の発話から学習できれば、システムは自分の
語彙をオンラインで拡張することができ、その後の会話の中で学習した名前を使っ
て人とコミュニケーションすることができるようになる.
未知語の音韻列を学習するために、未知語に対して音韻認識を行えばよいが、現
在の音声認識の性能は十分ではないため、音韻認識の誤りが生じる可能性が高い.
従って、正確な音韻列を学習するために、音韻認識の誤りを訂正する必要がある.そ
こで本研究では、ユーザが未知語を繰り返すことにより、認識誤りを訂正する方法
を提案した.また、訂正する際、ユーザは訂正した音韻列を確認しながら、インタ
ラクティブに訂正することができる.提案法の特徴は次の二つである.1)ユーザ
は未知語をそのまま繰り返すだけではなく、未知語の音韻列の中の間違った部分だ
けを繰り返すこともできる.そのため、システムは認識誤りをより効率的に特定す
ることができる.2)システムは、訂正する過程の履歴情報を用い、訂正の効率を
向上させる.例えば、もし訂正後の音韻列が悪くなった場合、ユーザはその音韻列
を訂正前のバージョンに戻すことができる.また、各回の訂正は、必ず違う音韻列
1
を生成する.訂正する際、提案法ではまず誤認識が含まれる音韻列と訂正発話の音
韻列の間で DPマッチングを使ってアラインメントを行う.そして各音韻のペアか
らより信頼度の高い音韻を選択する.こうしてより信頼度の高い音韻列を生成する
ことができる.なお、音韻の信頼度としては一般化事後確率を用いる.評価実験の
結果、提案法は平均 3発話という非常に高い効率で未知語の正確な音韻列を学習で
きることが分かった.
次に、二つ目の課題として、発話対象の検出技術を開発した.システムとの円滑
な会話を実現するためには、人の発話がシステムに向けられたものであるか否かを
判断する必要がある.システムは自分に向けられた発話に対してだけ反応し、それ
以外の発話に対しては反応してはならない.もしこの機能がなければ、システムは
自分に向けられていない発話(例えばテレビの音や人間同士の雑談など)にも反応
し、人とのコミュニケーションがうまくとれなくなり、危険な行為を起こしてしまう
可能性もある.このような問題を解決するために、私は発話対象を検出する手法を
提案した.提案法は、ロボットが人の発話に従って物体を操作するタスクにおいて
有効である.このタスクでは、人がロボットに対して、その時の物理環境における
ロボットが実行可能な物体操作行為を命令すると仮定する.この仮定の下では、対
システム(ロボット)発話の内容と現在の物理環境とのマッチング度合いは高くな
る.従って、提案法では、まず発話を現在の物理環境における実行可能な物体操作
行為として解釈し、そしてその行為と物理環境とのマッチング度合いを評価するこ
とによって発話対象を検出する.
マッチング度合いの評価基準として、本研究で独自に提案したマルチモーダル・
セマンティック・コンフィデンス(MSC)を使用する.MSCでは、音声認識、物体
認識とロボットの操作動作の生成から得られた確信度をロジスティックモデルで統合
して計算する.音声確信度の計算は従来法に従うが、物体と動作の確信度の計算は
本研究で新たに提案したものである.また、ロジスティックモデルのパラメータは、
尤度最大化基準を用いて学習する.実験では、実機ロボットを用いて提案法を評価
した.提案法では、95%以上の非常に高い精度で発話対象を検出することができた.
Two Key Technologies for a Flexible Speech Interface:
From the Perspective of Human–Robot Interaction
2011 d8821502 Xiang Zuo
Abstract
This thesis addresses two crucial problems in building a flexible speech interface
between humans and machines: (1) learning the phoneme sequences of out-of-
vocabulary (OOV) words, and (2) detecting the target of utterances. I propose the
following two methods to solve these problems.
First, I propose a method for learning the phoneme sequences of OOV words,
which is crucial for speech interfaces because developers cannot prepare all the
words beforehand for practical use in a system’s vocabulary. When the system en-
counters OOV words, it must learn their phoneme sequence to build lexical entries
for them. I propose a method called Interactive Phoneme Update (IPU) for this
purpose. Using this method, users can correct misrecognized phoneme sequences
by repeatedly making correction utterances based on the system responses. The fol-
lowing are the originalities of the method: (1) word-segment-based correction that
allows users to use word segments for locating misrecognized phonemes and (2)
history-based correction that utilizes the information of the phoneme sequences
that were recognized and corrected previously during the interactive learning of
each word. The experimental results show that IPU drastically outperformed a
previously proposed maximum-likelihood-based method for learning the phoneme
sequences of OOV words.
Second, I proposed a method for detecting the target of utterances to distinguish
speech that users say to a machine from speech that users say to other people or
themselves. Such a functional capability is crucial for speech-based human-machine
interfaces. If the machine lacks this capability, then even utterances that are not
directed to it will be recognized as commands for it. Thus the machine will generate
an erroneous response.
1
The proposed method, which is used in an object manipulation task performed
by a robot, enables it to detect robot-directed speech. The originality of the method
is the introduction of a multimodal semantic confidence (MSC) measure for the do-
main classification of input speech based on whether the speech can be interpreted
as a feasible action under the current physical situation in the object manipulation
task. This measure is calculated by integrating speech, object, and motion confi-
dences with weightings that are optimized by logistic regression. The experimental
results show that my proposed method achieves a high performance of 95% in the
average precision rates for detecting the utterance targets.
(i)
Contents
Chapter 1 Introduction 1
Chapter 2 learning the phoneme sequences of OOV words 6
2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Interactive Phoneme Update (IPU) . . . . . . . . . . . . . . . . . . . 9
2.2.1 Locating and correcting phoneme errors in IPU . . . . . . . . 11
2.2.2 History-based correction . . . . . . . . . . . . . . . . . . . . . 15
2.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.1 Experiment 1: Evaluation of the performance of IPU . . . . . 17
2.3.2 Experiment 2: Investigation of the factors of performance
results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4.1 Improvement of learning performance . . . . . . . . . . . . . 31
2.4.2 Influence caused by visual feedback . . . . . . . . . . . . . . . 31
2.4.3 Integration with OOV word detection . . . . . . . . . . . . . 32
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Chapter 3 Detecting utterance targets 34
3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Object Manipulation Task . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3 Proposed RD Speech Detection Method . . . . . . . . . . . . . . . . 39
3.3.1 Speech Understanding . . . . . . . . . . . . . . . . . . . . . . 40
3.3.2 MSC Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4.1 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . 47
(ii)
3.4.2 Off-line Experiment by Simulation . . . . . . . . . . . . . . . 49
3.4.3 On-line Experiment Using the Robot . . . . . . . . . . . . . . 55
3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.5.1 Using in a Real World Environment . . . . . . . . . . . . . . 58
3.5.2 Extended Applications . . . . . . . . . . . . . . . . . . . . . . 58
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Chapter 4 Conclusion 60
Acknowledgment 62
References 63
Appendix A Word forms in Japanese 74
Appendix B The international phoneme alphabets (IPA) of Japanese
syllabary 75
Appendix C Recursive equation of open-begin-end dynamic pro-
gramming matching 76
1
Chapter 1 Introduction
Speech, which is one of our most effective daily communication tools, is ex-
pected to eventually be used as a user-friendly interface between humans and
machines. In recent years, many studies have developed speech-based human-
machine interfaces. For example, speech interfaces provide such services on tele-
phones and mobile phones as automated phone calling systems [15, 16], flight and
hotel reservations [56, 13], alphanumeric string inputting [51], name dialing [52],
voice searches [8], multimedia information retrieval and browsing [77, 70], spoken
language translations [79, 71, 11], and voice assistance [1].
Besides telephone-based services, speech interfaces have also been used for other
tasks, including tourist [49] and museum guide tasks [61]. Furthermore, because of
their convenience, speech interfaces are suitable for hands-busy and eyes-busy tasks,
such as car navigation tasks. Speech provides safe interactions between drivers
and car navigation devices, where speech interfaces have nearly become standard
equipment; some manufacturers have even devoted research teams to them. For
example, the speech interface in the car navigation devices developed by Toshiba
is designed to adapt to different driving conditions [10]. Its speech recognition and
speech synthesis modules were optimized for in-car environments [26].
Speech interfaces have also been used for household appliances [74]. Compared
with other interfaces, such as remote controls, speech interfaces provide completely
new user experiences. For example, they were used for TV content search [69,
67]. Speech interfaces have also been used for entertainment, such as computer
games [48, 23, 78].
Speech interfaces have also been used for robots. In the last few years, robots
are being designed as part of the everyday lives of ordinary people in social and
home environments. Many robotic systems have been implemented with speech
2
interfaces, including [2, 19]. Unlike such platforms as mobile phones or car naviga-
tion devices, robots are usually equipped with many sensors such as microphones,
cameras, and touch sensors. Thus they can communicate with users through mul-
timodal information. For example, a robot can find the corresponding object by
camera and manipulate it by hand when its name is indicated by users [20, 21].
A robot can learn an object’s visual features and simultaneously learn its name
through interactions with humans [22, 3]. Visual information can be used to help
a robot understand the meaning of utterances [54]. Gaze and hand gestures can
be used to help a robot understand user intentions [30].
Although many speech interfaces have already been developed, they still lack
flexibility in practical use. A number of factors complicate their use in real scenar-
ios. I consider that the following factors are important for speech interfaces:
• Speech recognition accuracy
The interface flexibility is seriously affected by the recognition accuracy of
the automatic speech recognition (ASR) module. Although many studies
have addressed recognition accuracy [41, 25, 34, 40], it remains inadequate
in real scenarios.
Among the various causes that reduce recognition accuracy, an important
one is background noise, which always exists in real scenarios. Methods such
as noise suppression [14] and blind source separation [29] can be used to deal
with background noise. Another important cause that reduces recognition ac-
curacy is such pronunciation variations as dialects and emotion in speech. To
deal with pronunciation variations, such methods as pronunciation variation
modeling have been proposed [73, 39, 64, 57].
• Dialogue strategy
A dialogue strategy specifies which action the system will take depending
on the current dialogue context. Designing the dialogue strategy by hand
involves anticipating how users will interact with the system, repeated testing,
3
and refining, so the task can be difficult. Therefore the system is required to
manage the dialogue strategies by itself. Many studies have dealt with this
problem, such as [38, 42].
• Out-of-vocabulary words
Dealing with out-of-vocabulary (OOV) words is another serious problem for
speech interfaces because developers cannot prepare all the words beforehand
that might be used by individual users in the system’s vocabulary. If user
utterances include such OOV words, then the system will recognize them as
words within the vocabulary, and thus it will generate erroneous responses,
and the user will not know what the problem is.
The OOV problem consists of two sub-problems. When the system encoun-
ters OOV words, it first needs to detect them in the user utterances (OOV
word detection), and then it needs to learn their phoneme sequences to build
lexical entries for them (OOV word learning). Recently many studies have
focused on OOV word detection [6, 75, 50]. However, studies focusing on
OOV word learning are limited.
• Utterance targets
For speech interfaces, the functional capability to detect the target of ut-
terances is crucial. For example, a user’s speech directed to another human
listener should not be recognized as commands directed to a system. The sys-
tem must reject the utterances that are not directed to it. Studies focusing
on this problem are limited.
The goal of this thesis is to improve the flexibility of speech interfaces. Among
the above described problems, I focus on two: (1) learning the phoneme sequences
of OOV words, and (2) detecting the targets of utterances. Even though they are
fundamental problems of speech interfaces, and are crucial especially for robotic
speech interfaces, they have rarely been covered in previous studies.
4
First, learning the phoneme sequences of OOV words is a serious problem of
robotic speech interfaces because it is impossible to provide all the names of things
and persons beforehand which a robot in home use may encounter. The robot
needs to learn the phoneme sequences of the new words from utterances. In recent
years, learning the phoneme sequences of OOV words has become a basic task for
robots [3]. Next, detecting the targets of the utterances is also essential for robots
because it is quite dangerous for a robot to respond to the utterances that are
not directed to it. For example, unexpected motion of a robot may hurt someone
nearby. I propose two methods in this thesis to solve these problems. They are
described below.
1. Learning the phoneme sequences of OOV words
To solve the OOV word learning problem, I propose a novel method called In-
teractive Phoneme Update (IPU), which enables systems to learn the phoneme
sequences of OOV words through interactions with users. During interaction,
users can correct the phoneme recognition errors by repeatedly making cor-
rect utterances. The method enables the system to automatically extend its
vocabulary in an on-line manner so that the system can adapt to individual
users and the environment.
2. Detecting utterance targets
I propose a novel method for detecting the target of utterances. The proposed
method is used for a robotic dialogue system that enables a robot to detect
robot-directed speech in an object manipulation task. The method is based
on a multimodal semantic confidence (MSC) measure, which is used for the
domain classification of input speech based on whether the speech can be
interpreted as a feasible action under the current physical situation. Using the
method, the robot can detect robot-directed speech with very high accuracy,
even under noisy conditions.
5
The remainder of this thesis is organized as follows. First, the details of the
proposed OOV word learning method are given in Chapter 2. The details of the
proposed robot-directed speech detection method are given in Chapter 3. Finally,
Chapter 4 concludes the thesis.
6
Chapter 2 learning the phoneme
sequences of OOV words
2.1 Background
This chapter describes my proposed method for learning the phoneme sequences
of OOV words1 . Learning OOV words is a difficult task since every phonemes of
the words should be correctly learned in order for a system to precisely recognize
and synthesis the words in the subsequent communications. One kind of method
is to ask the user to spell out the new words [9, 17]. In this method, a graph
of possible phoneme sequences is first estimated by the given spelling, and then
the speech sample of the word is used to search this graph to determine the best
phoneme sequence. For instance the English word “teacher,” whose pronunciation
is /’ti:tS@/, is spelled as “T E A C H E R /ti: i: ei si: eitS i: a:/”. The pronunciation
of an English word and that of its spelled-out form are different. That is, spelling
out gives richer information than repeating the word, enabling better estimation
of its phoneme sequence when a grapheme-phoneme correspondence model is used.
However, spelling out is not effective in some languages such as Japanese and
Chinese, since the pronunciation of a word and that of its spelled-out form are
almost the same as each other in these languages. Therefore spelling out cannot
give richer information than repeating the word.
Another kind of method is to run a speech recognition system in a phoneme
recognition mode. However, this method is unreliable due to the high phoneme
1 OOV word detection is not discussed in this thesis. The position of OOV words is
given by template utterances pre-defined in the system. I assume that OOV words are
properly detected and segmented before the proposed method is applied.
7
recognition error rates. Although many studies have been done to improve phoneme
recognition accuracy [7, 4, 37], even state-of-the-art speech recognition systems only
achieve about 80% in phoneme recognition accuracy [45, 43]. For such a speech
recognition system, if each phoneme error occurs independently, the probability for
obtaining a correct phoneme sequence of a word with ten phonemes from a single
utterance is less than 11% (0.810 ≈ 0.11).
Since learning OOV words by phoneme recognition from a single utterance is
unreliable, some methods learned OOV words from multiple utterances [65, 60,
72, 5]. The maximum-likelihood (ML) based phoneme correction [65] is a widely
used method for this purpose. In this method, the phoneme sequence of a word is
obtained by searching a phoneme sequence that jointly maximizes the likelihood
of all of the input utterances of the word from their N -best phoneme recognition
lists.
This study deals with a word learning task in which users teach the system the
phoneme sequence (pronunciation) of OOV words by repeatedly making utterances
through speech interactions with the system. The target language of this study is
Japanese2 . Converting phoneme sequences to graphemic word forms is not dealt
with in this study since out target is speech interaction3 . Rather than improving
the phoneme recognition accuracy in a batch way, I developed Interactive Phoneme
Update (IPU) that learns the phoneme sequences of OOV words in the course
of speech interaction. Using the method, users can correct the mis-recognized
phoneme sequences by repeatedly making correction utterances according to the
system responses. Consider the following dialogue scenario between two persons
(A and B).
2 However, the proposed method can be easily extended for other languages such as
English.3 Japanese words can be written as both kana sequences and kanji sequences, and kana
sequences can be uniquely converted from phoneme sequences based on a mapping ta-
ble. Converting phoneme sequences to kanji sequences is not dealt with in this study. A
description of Japanese word forms is given in Appendix A.
8
A0: “My name is Taisuke Sumii.”
B0: “Taisuke Sumie?”
A1: “No, Taisuke Sumii.”
B1: “Taisuke Zumie?”
A2: “That’s worse. Listen, Sumii.”
B2: “Taisuke Sumii?”
A3: “That’s right.”
In this dialogue, person A tries to teach his name to person B by an utterance
(A0), and person B makes a certain mistake. Then person A corrects the errors
by repeating the name (A1 and A2). Such a dialogue is quite common in commu-
nication between humans. IPU aims at realization of such a dialogue for learning
OOV words. The originalities of IPU are summarized as follows.
1. Word-segment-based correction: Apart from the whole word, IPU en-
ables the user to make a correction with just a segment of a word, according
to the phoneme errors. The advantage of word-segment-based correction is
that locating erroneous phonemes in a phoneme sequence becomes easier,
and the mis-correction of the correct part of the phoneme sequence can be
prevented.
2. History-based correction: IPU uses the historical information of phoneme
sequences that were recognized and corrected previously in the course of
interactive learning of each word to make learning efficient.
IPU can be used as a word pronunciation learning module for a variety of spoken
dialogue systems. For example, a robotic dialogue system in a home environment
probably encounters novel objects whose names do not exist in its vocabulary [17].
For another example, a telephone-based name dialing system needs to add novel
names to its vocabulary [12]. IPU enables all such spoken dialogue systems to
learn the phoneme sequence of OOV words through speech interactions. Once the
words are successfully learned, the system can recognize and synthesize the words
9
Dialogue Recognized phoneme sequence System phoneme sequence
U0: “It is misesukumiko.” x0: m i sh e s u k u ϕ i g o y0: m i sh e s u k u ϕ i g o
S0: “Is it mishesukuigo?”
U1: “No, it is misesukumiko.” x1:m i s e z u k u m i k o ng y1: m i sh e s u k u m i k o
S1: “Is it mishesukumiko?”
U2: “No, it is misesu.” x2: m i sh e z u y2: m i sh e z u k u m i k o
S2: “Is it mishezukumiko?”
U3: “No, that’s worse.”
U4: “It is misesu.” x3: m i s e s u y3: m i s e s u k u m i k o
S3: “Is it misesukumiko?”
U5: “That’s right.”
Figure 2.1: An example for learning OOV words by IPU. The left column shows the
dialogue between a user (U) and a system (S), the middle column shows the recognized
phoneme sequences of the OOV word in the user’s utterances, and the right column shows
the phoneme sequences in the system internal state. The phoneme errors are indicated by
underlines, and “ϕ” denotes a deletion error.
in the subsequent interactions. An example application of IPU has been already
implemented by [47].
The remainder of this chapter is organized as follows. The detail of the proposed
method is given in Section 2.2. The experimental settings and results are presented
in Section 2.3. Section 2.4 gives a discussion. Finally, Section 2.5 concludes the
chapter.
2.2 Interactive Phoneme Update (IPU)
This section presents the details of IPU. An example of the process of learning
OOV words by IPU is shown in Fig. 2.1. The user first tries to teach the system
a new word “misesukumiko,” whose phoneme sequence is [m i s e s u k u m i k
o]4 , by an initial utterance (U0). The system gets a recognized phoneme sequence
x0 of the OOV word by a pre-defined grammar including a phoneme network
4 The international phoneme alphabets (IPA) of Japanese syllabary are shown in Ap-
pendix B.
10
SilB It is
ch
a
ng
SilE
Figure 2.2: The grammar used for OOV word extraction. “SilB” and “SilE” denote the
silences in the beginning and end of the utterance.
like the one shown in Fig. 2.2. Using such a grammar, the phoneme sequence
of a OOV word can be extracted from an utterance “It is [oov]5 ,” where [oov]
represents the OOV word. The system then sets x0 to a system phoneme sequence
y0, and requests the user to confirm y0 by an utterance (S0). According to the
system response, the user makes a correction utterance (U1). The system gets a
recognized phoneme sequence x1 of the OOV word from U1 by the same way of
x0, then uses x1 to correct the phoneme errors in y0, and this results in a new
system phoneme sequence y1. The system then requests the user to confirm y1.
The user then continues to make corrections until the word is correctly learned.
In this thesis, the ith recognized phoneme sequence and the ith system phoneme
sequence are respectively denoted by xi and yi. In the example, the user makes
correction utterances not only using the whole word, but also using word segments
(U2 and U4). To perform a correction between a recognized phoneme sequence
and a system phoneme sequence, the system should first locate the phoneme errors
in the system phoneme sequences, then correct the errors.
During the interaction, users follow a pre-defined grammar to help the system
understand the users’ utterances. I assume that the users behave in such a way
according to instructions. I know that this restriction does not hold in natural
5 In this thesis, utterances made in Japanese have been translated into English.
11
Inputphon
emeβ
Recognized phoneme α
Figure 2.3: A part of the phoneme confusion matrix. The element c(α, β) of the confusion
matrix represents the number of a phoneme β recognized as another phoneme α.
spoken dialogues between humans and systems. I think, however, it is valuable to
conduct research under this restriction because if efficient OOV word learning is
not possible with this restriction, it will never be possible to learn OOV words in
realistic human-machine interactions. In our plan, I will first show that it is possible
to learn OOV words with this restriction and I will then explore a way to effectively
and naturally instruct users to behave under this restriction or to improve speech
understanding to handle more natural utterances without this restriction.
2.2.1 Locating and correcting phoneme errors in IPU
Here I give details about how to locate and correct the phoneme errors in IPU.
First, to locate the phoneme errors in the (i − 1)th system phoneme sequence
yi−1, the ith recognized phoneme sequence xi is aligned to yi−1 and the conflicting
phoneme pairs between them are found. The alignment is performed by open-
begin-end dynamic programming matching (OBE-DPM) [55] in order to deal with
a recognized phoneme sequence obtained from a word segment. The phonemes
in the conflicting phoneme pairs are treated as phoneme errors which need to be
corrected. Then generalized posterior probability (GPP) [58] is used as a confidence
measure to measure the reliability of the phonemes in the conflicting phoneme pairs.
The phoneme with a lower GPP value is replaced by the phoneme with a higher
GPP value, thus results in a new system phoneme sequence yi, which is more
12
xi
yi−1
(a) Alignment Matrix
xi
yi−1
(b) Alignment result
Figure 2.4: Alignment Matrix and alignment result for the word “gashirakomori” [g a sh
i r a k o m o r i]. S and E respectively denote the start and end point. By the OBE-DPM,
conflicting phoneme pairs (‘r’, ‘b’), (‘ϕ’, ‘k’), ( ‘ϕ’, ‘ng’) and (‘n’, ‘m’) are found.
reliable than yi−1.
Locating phoneme errors using OBE-DPM
OBE-DPM is an extended version of dynamic programming matching. It has
been widely used for subsequence matching. In this study, I use an OBE-DPM
with phoneme distance measures calculated from a phoneme confusion matrix.
The recursive equation of the OBE-DPM is shown in Appendix C. The phoneme
confusion matrix was built based on the ATR Japanese speech database C-set
(a database consisting of 142, 480 speech samples of 274 speakers (137 males and
137 females), with a total of 834, 521 phonemes) [31]. ATRASR [46], which was
developed by Advanced Telecommunication Research Labs, was used as a phoneme
recognizer to build the confusion matrix. 26 Japanese phonemes were included in
ATRASR. A part of the phoneme confusion matrix is shown in Fig. 2.3. The
13
GPP value
Recognitionaccuracy
(%)
Figure 2.5: The relationship between GPP values and recognition accuracy for the
phonemes in the speech recognizer.
element c(α, β) of the confusion matrix represents the number of a phoneme β
recognized as a phoneme α. The phoneme distance measure s(α, β) is calculated
by
s(α, β) = − logc(α, β)∑α c(α, β)
. (2.1)
The alignment matrix and the alignment result for the word “gashirakomori”
[g a sh i r a k o m o r i], are shown in Fig. 2.4. The system phoneme sequence yi−1
is [g a sh i r a ϕ o n o r i], and the correction phoneme sequence xi is [sh i b a k
o ng m o], each of which includes certain errors. In this example, a sub-sequence
[sh i r a o n o] in yi−1 is obtained as a sub-sequence which corresponds to xi. The
conflicting phoneme pairs of the sub-sequence and xi are (‘r’, ‘b’), (‘ϕ’, ‘k’), ( ‘ϕ’,
‘ng’) and (‘n’, ‘m’).
Correcting phoneme errors using GPP
Generalized posterior probability (GPP) has been used as a confidence mea-
sure to verify recognized entities at different levels, e.g., sub-word, word, and sen-
tence [58]. It is computed by generalizing the likelihoods of the sub-words with
overlapped time registrations in the word graph. In this study, I use GPP at the
14
yi−1 g a sh i r a ϕ o ϕ n o r iGPP 0.93 0.66 0.61 0.72 0.53 0.92 0.74 0.66 0.95 0.99 0.97xi - - sh i b a k o ng m o - -
GPP - - 0.75 0.83 0.11 0.76 0.55 0.92 0.26 0.92 0.43 - -yi g a sh i r a k o ng m o r i
GPP 0.93 0.66 0.61 0.72 0.53 0.92 0.55 0.74 0.26 0.92 0.95 0.99 0.97
Figure 2.6: An example of a phoneme replacement. The system phoneme sequences yi−1,
yi, and the recognized phoneme sequence xi, with the GPP values for each phoneme in
them are shown in this figure. The conflicting phonemes are indicated by squares.
phoneme level.
I investigated the relationship between GPP values and the recognition accuracy
for all phonemes in the speech recognizer using the ATR Japanese speech database
C-set. The result is shown in Fig. 2.5. I found that the phoneme recognition
accuracy improved directly with GPP values, which indicates the appropriateness
of using GPP as a confidence measure6 .
An example of a phoneme replacement is shown in Fig. 2.6. It shows the system
phoneme sequences yi−1 and yi, and the recognized phoneme sequence xi with the
GPP values for each phoneme in them. The conflicting phonemes are indicated
by squares. Among the conflicting phonemes, ‘r’ is not replaced with ‘b’, and ‘n’
is replaced by ‘m’ according to the GPP values. To deal with the insertion and
deletion errors, I give a threshold of 0.57 . In the example, yi−1 is judged to have a
deletion error ‘k’, which is corrected by the threshold. In this example, yi becomes a
correct phoneme sequence after the replacement. GPP values are only updated for
the conflicting phoneme pairs. They are not updated for the consistent phoneme
pairs8 .
6 However, I found that the recognition accuracy for phoneme ‘q,’ which represents a
double stop (short pause) in Japanese, does not vary directly with its GPP value. Therefore
the correction for recognition errors of ‘q’ is not performed in this study.
7 The threshold is decided empirically by preliminary experiments.8 GPP values for the consistent phoneme pairs can be updated by at least the following
three ways: (1) updated by system phoneme sequence yi−1, (2) updated by recognized
15
1. Set i← 0, M ← maximum number of correction utterances.
2. Extract the recognized phoneme sequence x0 of the OOV word from
an initial utterance.
3. Set system phoneme sequence y0 ← x0, and request the user to
confirm y0.
4. According to the user’s response,
go to step 12 if the user gives a stop utterance or,
go to step 5 if the user makes a correction utterance.
5. Set i← i+ 1.
6. If i > M then go to step 12, otherwise go to step 7.
7. Extract the recognized phoneme sequence xi of the OOV word
from a correction utterance.
8. Use xi to correct phoneme errors in yi−1 by OBE-DPM and
GPP.
9. If correction result = yi−1, then get another phoneme sequence
xi from the same N -best recognition list of xi, set xi ← xi,
and go to step 8, else go to step 10. (Forced-change)
10. Update yi ← correction result, and request the user to
confirm yi.
11. According to the user’s response,
go to step 12 if the user gives a stop utterance or,
go to step 5 if the user makes a correction utterance or,
go to step 5, and set yi ← yi−1 if the user gives an undo
utterance. (Undo behavior)
12. Stop the correction process, and treat yi as the learning result.
Figure 2.7: The interaction with IPU.
2.2.2 History-based correction
Next I give details about history-based correction. Historical information of sys-
tem phoneme sequences {y0, . . . , yi−1} that were obtained previously in the course
of interaction is used to help the system estimate the current system phoneme se-
phoneme sequence xi, and (3) updated by the average of yi−1 and xi. In our preliminary
experiments, I found that the performances of these three approaches were almost the
same, and approach (1) was used in IPU.
16
quence yi. History-based correction consists of undo behavior and forced-change,
each of which is described as follows:
Undo behavior: During the interaction, a correction sometimes results in a
phoneme sequence that is worse than the previous one. IPU enables the user
to undo such corrections.
Forced-change: During the interaction, a system phoneme sequence in which
the user finds errors should become different after a correction. IPU ensures
that each correction results in a system phoneme sequence yi that is different
from the previous system phoneme sequences {y0, . . . , yi−1}. If a recognized
phoneme sequence xi cannot result in a different phoneme sequence, another
recognized phoneme sequence xi, which is obtained from the same N -best
phoneme recognition list of xi, is used to perform a correction instead of xi.
Finally, the algorithm for learning OOV words by IPU is shown in Fig. 2.7. Four
types of utterances can be used in this algorithm. Initial utterance “It is [oov]” is
used to teach the system new words; correction utterance “No, it is [oov]” is used
to make corrections; undo utterance “No, that’s worse” is used to undo the current
correction; and stop utterance “That’s right” is used to stop the correction when
the words are correctly learned. M denotes the maximum number of correction
utterances for each word.
2.3 Experiments
I conducted experiments of a word learning task. I first evaluated the perfor-
mance of IPU by comparing it to a baseline method. Then I performed detailed
analysis to investigate the performance results obtained by IPU.
The experiments were abstract away from any specific types of spoken dia-
logue systems such as a robotic dialogue system or a telephone-based name dialing
system. In the experiments, users made an interaction with the system to teach
17
Table 2.1: Settings for IPU and baseline.
GPP ML Word-segment Undo Forced-change Stop
IPU√
-√ √ √ √
Baseline -√
- - -√
phoneme sequences of a word. This interaction was designed to separate it from
problems that arose in each type of spoken dialogue system, such as difficulty
of OOV word detection in a variety of user utterances and the quality of speech
synthesis.
2.3.1 Experiment 1: Evaluation of the performance of
IPU
Baseline
As a baseline comparing with IPU, I ran the maximum-likelihood (ML) based
phoneme correction [65] in an on-line manner. The baseline required users to make
multiple utterances of the word. Given a set of utterances {u0, . . . , uI} of a word,
where ui denotes the ith utterance, the phoneme sequence s of the word is obtained
by searching a phoneme sequence that jointly maximizes the likelihood of all of the
input utterances from their N -best phoneme recognition lists, such as
s = argmaxs∈{L(u0)∪,...,∪L(uI)}
I∏i=0
P (ui|s), (2.2)
where {L(u0)∪, . . . ,∪L(uI)} denotes theN -best phoneme recognition lists for {u0, . . . , uI}.
In the experiment, N was set to 50. I is dynamically given in the experiment. It
equals the number of speech samples of the word uttered by the user during the
interaction.
In the baseline, users just made repetitions of the word; word-segment-based
corrections were not possible since all speech samples must be given by the whole
word. Moreover, users only performed stop behaviors. They were not allowed to
18
Table 2.2: The word list used in the experiments. The right column shows the Japanese
phoneme sequences, and the left column shows the order for the words used in the experi-
ments.
No Phoneme sequence1 n a m i h a r i n e z u m i2 m a d a g a s u k a r u m i d o r i j a m o r i3 k u r o s u t e n a g a z a r u4 m i k u r o s u t o n i k u s u5 k i k u g a sh i r a k o m o r i6 m i s e s u k u m i k o7 k a s u m i z a k u r a8 t o k i w a m a ng s a k u9 b u t a ng sh i r o m a ts u10 k i b a n a ky a t a k u r i11 a ng d o r o m e d a s e u ng12 k a m i n o k e z a b e t a s e13 s a ng g u r e z a14 m a z e r a n i k u s u t o r i m u15 r i zh i r u k e ng t a u r u s u16 h a r a t a k a sh i17 g o ng s u ng z a ng18 n o g u ch i h i d e j o19 j o s a n o a k i k o20 b a o z u ng21 a zh i s a i22 t a n u k i23 j o sh i o24 k a r u p i s u25 a k u e r i a s u
perform undo behaviors, and forced-changes were not used by the system. In other
words, the baseline did not have history-based corrections.
The settings for IPU and baseline are summarized in Table 2.1. “GPP” and
“ML” respectively represents the GPP-based phoneme correction used in IPU and
the ML-based phoneme correction used in the baseline. “Word-segment,” “Undo,”
“Forced-change” and “Stop” respectively represents word-segment-based correc-
tion, undo behavior, forced-change and stop behavior. “√” represents such a factor
was used.
19
Setting
I prepared a word list including 25 Japanese words. The word list includes
names of animals, plants, spheres, and persons from Wikipedia9 . The total num-
ber of phonemes was 305, and each word included 12.2 phonemes on average.
Table 2.2 shows the Japanese phoneme sequence for the words that were used in
the experiments.
ATRASR, which was used to build the confusion matrix, was used as a speech
recognizer in the experiments. Speaker independent phoneme models were used in
ATRASR. The phoneme models were represented by context-dependent HMMs,
with gaussian mixture distributions in each state. Mel-scale cepstrum coefficients
and their delta parameters (25-dimensional MFCC) were used as feature param-
eters. A grammar including a phoneme network like the one shown in Fig. 2.2
was used in the speech recognizer. The phoneme network was constructed with
Japanese phonotactic consonants. Phoneme N -gram models were not used in the
phoneme network to avoid its influence on the performance of IPU.
Phoneme accuracy (P%) and word pronunciation accuracy (W%) were used
for evaluation, each of which is defined as
P =Np − S −D − I
Np× 100
W =Nw −Ne
Nw× 100
, (2.3)
where Np and Nw denote the number of phonemes and words used in the experi-
ment (Np = 305 and Nw = 25), S, D and I respectively denote the total number of
phonemes with substitution, deletion and insertion errors, and Ne denotes the total
number of words which have mis-recognized phonemes in each of them. Phoneme
accuracy Pi and word pronunciation accuracy Wi of the ith system phoneme se-
quences are calculated after the ith corrections.
9 http://ja.wikipedia.org/
20
• Please teach the system new words as “It is [oov].”
• The system may mis-recognize certain phonemes. Please correct the phoneme
errors as “No, it is [oov].” You can repeat the whole word or just a part of
the word in the utterance.
• According to the system response, you can continue correcting or undo the
correction. Otherwise please stop the correction when the phoneme sequence
becomes correct.
• At most seven corrections can be made for each word.
• During the interaction, do not change your accent on purpose. Please speak
naturally.
Figure 2.8: The details of the instructions given to the participants.
Protocol
18 native Japanese speakers (twelve males and six females) participated in the
experiment. The participants were students and staff in our research institutes.
They initially did not have any knowledge about the proposed method. Each
participant did the experiments according to the following procedure.
First, each participant taught the words to the system using IPU. Before each
experimental session, a trial use of the system was permitted. Then the partic-
ipant sat on a chair 40cm from a SANKEN CS-3e directional microphone and
taught the words in the list to the system in Japanese according to the instruc-
tions whose details are shown in Fig. 2.8. In the experiment, participants uttered
only initial and correction utterances, undo and stop were operated by keyboard
operations in order to avoid recognition errors. The system phoneme sequences
were synthesized10 , and shown in katakana11 on a display to help the participant
find the phoneme errors12 . The maximum number M of correction utterances was
10 I used VoiceText (http://www.voicetext.jp) for speech synthesis.11 Katakana is a kind of Japanese phonogram. It can be uniquely converted from the
phoneme sequences based on a mapping table.12 Visual feedback is not necessary in IPU. In practical use, IPU can be used with/without
visual feedback according to the scenario.
21
Table 2.3: The statements for subjective evaluations for IPU.
No Statement
Q1 The system was efficient to correct phoneme errors.
Q2 The correction method was easy to understand.
Q3 The interaction with the system was smooth.
Q4 The participant would like to use the system to teach new words.
set to seven. Therefore, at most eight utterances were made for each word. After
the experiment of IPU, each participant used a five-point rating scale to evalu-
ate the relevance (5: very relevant, 4: somewhat relevant, 3: even, 2: somewhat
irrelevant, 1: irrelevant) of the statements shown in Table 2.3.
Then, each participant taught the words to the system using the baseline. The
participant was instructed to repeatedly make utterances “It is [oov]” using the
words in the word list until the words were correctly learned. The participant just
repeated the whole words; word segments were not allowed. During the interaction,
the participants only operated stop; undo behaviors were not allowed.
Finally, a data collection process was performed. In the baseline experiment
some words were learned with less than eight utterances. I additionally collected
speech samples such as “It is [oov]” including the whole words in the word list to
ensure that each word has eight speech samples. As a result, a total of 3,600 speech
samples (200 speech samples for one participant) were collected in the baseline
experiment and the data collection process. These speech samples were used in the
next experiment.
Results
The phoneme and word pronunciation accuracies that were obtained from 18
participants are respectively shown in Fig. 2.9 (a) and Fig. 2.9 (b). The horizontal
axis represents the number of correction utterances (‘0’ represents the initial utter-
ance). The phoneme and word pronunciation accuracies for the initial utterance
22
Phon
emeaccuracy
(%)
Number of correction utterances
IPUBaseline
(a) Phoneme accuracy
Wordpronunciationaccuracy
(%)
Number of correction utterances
IPUBaseline
(b) Word pronunciation accuracy
Figure 2.9: The phoneme and word pronunciation accuracies achieved by IPU and base-
line.
were 84.1% and 20.4%. These values represent the performance of the speech rec-
ognizer without any corrections. IPU outperformed the baseline in both phoneme
and word pronunciation accuracies. For IPU, the accuracies improved significantly
with the increment of the correction utterances, and achieved 96.8% and 79.1%
respectively in phoneme and word pronunciation accuracies after the seventh cor-
rection, while for the baseline, the accuracies did not improve much, and achieved
only 90.4% and 49.8%.
23
Errorrate
reduction(%
)
Number of correction utterances
(a) Phoneme accuracy
IPU
Baseline
1 2 3 4 5 6 7
Error
rate
reduction(%
)
Number of correction utterances
(b) Word pronunciation accuracy
IPU
Baseline
1 2 3 4 5 6 7
Figure 2.10: The error rate reductions achieved by each correction utterance relative to
the previous correction utterance in IPU and the baseline.
The error rate reductions achieved by each correction utterance relative to
the previous correction utterance in IPU and the baseline are shown in Fig. 2.10.
I found that the error rate reductions achieved by IPU outperformed the error
24
Number
ofcorrectionutterance
Number of phonemes
wholesegment
Figure 2.11: Relationship between the number of phonemes in each word and the number
of correction utterances used for that word in IPU. The number of correction utterances
including a word segment and the whole word are shown respectively by “segment” and
“whole”.
rate reductions achieved by the baseline in both phoneme and word pronunciation
accuracies. The average error rate reductions of seven correction utterances for
IPU and the baseline were 20.3% and 6.8% in phoneme accuracy, and 17.4% and
6.1% in word pronunciation accuracy. This means that the error rate reductions
achieved by IPU were about three times those of the baseline.
In IPU, the average number of correction utterances and word-segment-based
corrections used by the participants were 3.27 and 2.97. This means that in IPU,
90.8% of the corrections were done using word segments. The average number
of undo behaviors performed by the participants for each word was 2.21, and the
average number of forced-changes performed by the system for each word was
1.33. Stop behaviors were performed by the participants for the words which were
correctly learned during the interaction.
Moreover, the relationship between the number of phonemes in each word and
the number of correction utterances used for that word in IPU is shown in Fig. 2.11.
The number of correction utterances including a word segment and the whole
word are respectively shown by “segment” and “whole”. I found that the words
25
Dialogue Recognized phoneme sequence System phoneme sequence
U0: “It is mikurosutonikusu.” x0: m i k u r o s ϕ ϕ o n i k u s u y0: m i k u r o s ϕ ϕ o n i k u s u
S0: “Is it mikurosonikusu?”
U1: “No, it is sutonikusu.” x1: s u t o n i g u s u # m i k u r o s ϕ ϕ o n i g u s u
S1: “Is it mikurosonigusu?”
U2: “No, that’s worse.” y1: m i k u r o s ϕ ϕ o n i k u s u
U3: “It is sutonikusu.” x2: s u t o n i k u s u y2: m i k u r o s u t o n i k u s u
S2: “Is it mikurosutonikusu?”
U4: “That’s right.”
Figure 2.12: An example for the word “mikurosutonikusu” [m i k u r o s u t o n i k u s
u] that was successfully learned.
Dialogue Recognized phoneme sequence System phoneme sequence
U0: “It is butangshiromatsu.” x0: b e d a ng s u ϕ o m a ts u y0: b e d a ng s u ϕ o m a ts u
S0: “Is it bedangsuomatsu?”
U1: “No, it is butangshiromatsu.” x1: b u d a ng sh i d o m a ts u y1: b u d a ng sh i ϕ o m a ts u
S1: “Is it budangshiomatsu?”
U2: “No, it is shiromatsu.” x2: sh i d o m a ts u y2: b u d a ng sh i d o m a ts u
S2: “Is it budangshidomatsu?”
U3: “No, it is shiro.” x3: zh i d o # b u d a ng zh i d o m a ts u
S3: “Is it budangzhidomatsu?”
U4: “No, that’s worse.” y3: b u d a ng sh i d o m a ts u
U5: “It is shiromatsu.” x4: sh i d o n a ts u # b u d a ng sh i d o n a ts u
S4: “Is it budangshidomotsu?”
U6: “No, that’s worse.” y4: b u d a ng sh i d o m a ts u
U7: “It is shiromatsu.” x5: sh i n o m a ts u y5: b u d a ng sh i n o m a ts u
S5: “Is it budangshinomatsu?”
U8: “No, it is butangshiromatsu.” x6: b u e t a ϕ s u n o m a ts u y6: b u t a ng sh i n o m a ts u
S6: “Is it butangshinomatsu?”
U9: “No, it is shiromatsu.” x7: s u ϕ o n a ts u y7: b u t a ng sh i n o n a ts u
S7: “Is it butangshiomatsu?”
Figure 2.13: An example for the word “butangshiromatsu” [b u t a ng sh i r o m a ts u]
that was not successfully learned.
with more phonemes were corrected by more correction utterances and more word
segments.
Furthermore, Fig. 2.12 and Fig. 2.13 respectively show the examples of words
that were successfully and not successfully learned in the experiment. The left
column shows the dialogue, the middle column shows the recognized phoneme
sequence of the OOV word, and the right column shows the system phoneme
sequences after each correction. In the examples, both word segments and the
whole words were used in correction utterances. The corrections which were undone
26
Q1(3.59)
1 2 3 4 50
2
4
6
8
10Q2(3.35)
1 2 3 4 50
2
4
6
8
10Q3(3.65)
1 2 3 4 50
2
4
6
8
10Q4(3.59)
1 2 3 4 50
2
4
6
8
10
Figure 2.14: Subjective Evaluations by IPU. Horizontal axes show the opinion scores, and
vertical axes show the number of participants. The average scores are shown in parentheses.
are indicated by “#”.
Finally, the results of the subjective evaluation are shown in Fig. 2.14. The opin-
ion scores for Q1, Q2, Q3 and Q4 are respectively shown in the figures. Horizontal
axes show the opinion scores, and vertical axes show the number of participants.
The average scores are shown in parentheses. I found that most of the participants
gave scores equal to or greater than three for all of the statements. This indicates
the positive impressions by the participants.
2.3.2 Experiment 2: Investigation of the factors of per-
formance results
Setting
Next I investigated the factors of performance improvement in IPU in order
to evaluate the effectiveness of word-segment-based correction, undo behavior and
forced-change. I ran IPU under the following conditions:
Condition-1: Only the whole words were used for correction. Word-segment-
based corrections were not used.
Condition-2: Only the whole words were used for correction. Word-segment-
based corrections were not used. Forced-changes were not done in the cor-
rection process.
27
Table 2.4: Settings for IPU, Condition-1, Condition-2, Condition-3, Condition-4 and
Condition-5
.GPP ML Word-segment Undo Forced-change Stop
IPU√
-√ √ √ √
Condition-1√
- -√ √ √
Condition-2√
- -√
-√
Condition-3√
- - -√ √
Condition-4√
- - - -√
Condition-5 -√
-√ √ √
Condition-3: Only the whole words were used for correction. Word-segment-
based corrections were not used. Users were not allowed to perform undo
behaviors during the interaction.
Condition-4: Only the whole words were used for correction. Word-segment-
based corrections were not used. Users were not allowed to perform undo
behaviors during the interaction. Forced-changes were not done in the cor-
rection process.
Finally, I combined the ML-based phoneme correction with history-based cor-
rections as in Condition-5.
Condition-5: The phoneme sequences of the words were obtained by the ML-
based phoneme correction using equation (2.2). During the interaction, users
were allowed to perform stop and undo behaviors. Forced-changes were used
by the system. Word-segment-based corrections were not possible in this
condition. Only the whole words were used for correction.
The settings for all of these conditions as well as IPU are summarized in Ta-
ble 2.4. Since word-segment-based correction was not used in these conditions,
the 3,600 speech samples including the whole words collected in experiment 1 were
used as input data for these conditions. The speech samples were automatically
28
Phon
emeaccuracy
(%)
Number of correction utterances
IPUCondition-1Condition-2Condition-3Condition-4Condition-5
(a) Phoneme accuracies
Wordpronunciationaccuracy
(%)
Number of correction utterances
IPUCondition-1Condition-2Condition-3Condition-4Condition-5
(b) Word pronunciation accuracies
Figure 2.15: The phoneme and word pronunciation accuracies achieved by Condition-1,
Condition-2, Condition-3, Condition-4, Condition-5 and IPU.
inputted into the system. In the experiment, undo and stop were operated by the
same participants as in experiment 1.
Results
The phoneme and word pronunciation accuracies for Condition-1, Condition-2,
Condition-3, Condition-4, Condition-5, as well as IPU are shown in Fig. 2.15. The
comparison of IPU and Condition-1 shows the effectiveness of word-segment-based
29
Table 2.5: The detailed results of the t-test in phoneme accuracy.
Number of correction utterances
1 2 3 4 5 6 7
IPU to C-1T(898) 0.14 0.43 3.10 3.82 3.54 3.60 4.98
p 0.89 0.67 < .01 < .01 < .01 < .01 < .01
C-2 to C-4T(898) 6.02 6.89 7.46 7.52 7.85 8.02 8.14
p < .01 < .01 < .01 < .01 < .01 < .01 < .01
C-3 to C-4T(898) 0.03 0.37 3.03 3.88 4.03 4.43 6.98
p 0.89 0.67 < .01 < .01 < .01 < .01 < .01
C-1 to C-5T(898) 0.07 0.12 0.26 0.33 1.00 1.35 1.17
p 0.96 0.93 0.84 0.81 0.48 0.33 0.42
Table 2.6: The detailed results of the t-test in word pronunciation accuracy.
Number of correction utterances1 2 3 4 5 6 7
IPU to C-1T(898) 1.20 1.87 3.58 3.58 4.41 5.76 5.66p 0.24 0.08 < .01 < .01 < .01 < .01 < .01
C-2 to C-4T(898) 0.63 0.49 0.50 0.14 1.37 2.30 3.03p 0.53 0.63 0.62 0.89 0.71 0.49 < .01
C-3 to C-4T(898) 1.49 4.63 4.92 5.33 5.88 6.51 7.03p 0.65 < .01 < .01 < .01 < .01 < .01 < .01
C-1 to C-5T(898) 0.63 0.55 0.53 0.15 0.40 0.70 0.79p 0.58 0.68 0.66 0.90 0.75 0.56 0.55
correction; the comparison of Condition-2 and Condition-4 shows the effective-
ness of undo behavior; the comparison of Condition-3 and Condition-4 shows the
effectiveness of forced-change; the comparison of Condition-1 and Condition-4 ad-
ditionally shows the total effectiveness for both undo behavior and forced-change;
and the comparison of Condition-1 and Condition-5 shows the difference between
the GPP-based phoneme correction and the ML-based phoneme correction. I found
that word-segment-based correction, undo behavior and forced-change contributed
performance improvements in both phoneme and word pronunciation accuracies
in an accumulate way. Moreover, I found that undo behaviors were more efficient
than forced-changes in phoneme accuracy (see Condition-2 and Condition-3 in
Fig. 2.15 (a)). This is because that undo behaviors prevent the phoneme sequence
from getting worse, which directly improves the phoneme accuracy. In contrast,
30
forced-changes were more efficient than undo behaviors in word pronunciation ac-
curacy (see Condition-2 and Condition-3 in Fig. 2.15 (b)). This is because that
forced-changes ensure that each correction results in a different system phoneme
sequence, and thus improved the possibility for obtaining correct system phoneme
sequences. However, the performances of Condition-1 and Condition-5 were almost
the same. This means that the performances of the GPP-based phoneme correction
and the ML-based phoneme correction were almost the same.
I also performed the paired t-test to investigate the statistical differences be-
tween IPU to Condition-1, between Condition-2 to Condition-4, between Condition-
3 to Condition-4 and between Condition-1 to Condition-5. The detailed results of
the t-test in phoneme and word pronunciation accuracies are respectively shown
in Table 2.5 and Table 2.6. In the tables, “C-1,” “C-2,” “C-3,” “C-4” and “C-
5” respectively represent Condition-1, Condition-2, Condition-3, Condition-4 and
Condition-5. “T(898)” represents the t-values obtained by the t-test, with 898 de-
grees of freedom. The underlines in the tables indicate the statistically significances
(p < .01). The speech samples used in the experiment were sufficient enough to
obtain statistical differences. The statistical difference between IPU to Condition-1
indicated the validity of word-segment-based correction; the statistical differences
between Condition-2 to Condition-4 indicated the validity of undo behavior; and
the statistical difference between Condition-3 to Condition-4 indicated the validity
of forced-change. The comparison between Condition-1 to Condition-5 shows that
there was no statistical difference between the GPP-based phoneme correction and
the ML-based phoneme correction. The GPP-based phoneme correction, however,
enabled the users to make corrections according to word segments, while the ML-
based phoneme correction only allowed users to make corrections using the whole
words.
31
2.4 Discussion
2.4.1 Improvement of learning performance
In the experiment, IPU was evaluated with a grammar including a phoneme
network in which each phoneme has the same transition probabilities associated
to all possible phonemes in the network in order to avoid the influences of any
specific phoneme N -gram models. The experimental results showed that IPU was
very efficient even under such a simple grammar. I consider, in practical use,
the learning performance might be further improved if I incorporate a phoneme
N -gram model into the speech recognizer. Moreover, by observing the data col-
lected in the experiments, I found that the recognition errors caused by individual
speaking characteristics were hard to correct. This problem can be solved by inte-
grating speaker adaptation technology such as MLLR (maximum-likelihood linear
regression) [36] into IPU.
2.4.2 Influence caused by visual feedback
In the experiment, IPU was evaluated under a condition where the system
responded to the participants by both speech synthesis and visual feedbacks in
order to help the users find the phoneme errors. This is because the purpose of
this study is to learn the correct phoneme sequences of OOV words. However,
visual feedback is not necessary in IPU. Under some scenarios a system is required
to communicate with users without any visual feedbacks. In such cases, the words
which have been learned by the system might include some erroneous phonemes
since it is difficult to verify phoneme errors only from synthesized speech, thus
users might make mistakes in confirming the phoneme sequences when they are
responded to by the system. The phoneme errors will reduce the performances
of speech recognition and speech synthesis in the subsequent system processing.
However, it is unclear how the performance deteriorations in speech recognition
32
and speech synthesis affect communications between humans and systems. I will
investigate such influences in future works.
2.4.3 Integration with OOV word detection
Although I have dealt only with pre-defined template utterances such as “It is
[oov]” in this study, I have considered to integrate an OOV detection method with
IPU to detect the OOV words in arbitrary utterances that are not prepared in the
system. For example, if the system can detect an OOV word in utterances such as
“Can you search for migurikon?” in which “migurikon” is an OOV word, it can
invoke the word learning process.
On the other hand, a large number of methods detecting OOV words from
utterances have been proposed. There are two basic approaches for OOV word
detection: (1) OOV word models, which detect OOV words using a sub-word, or
generic word model [66, 6, 75], and (2) confidence estimation models, which use
confidence scores (e.g., sentence- and word-level confidence scores) to find unreliable
regions in the word lattices (or N -best lists) and label them as OOV words [68, 59].
Moreover, approaches that combined confidence scores and OOV word models to
improve OOV word detection have been recently presented by [53]. Contextual
information has been used to improve the detection accuracy by [50]. I realize that
some of the above described methods might be suitable to integrate with IPU.
2.5 Summary
This chapter described Interactive Phoneme Update, a method that enables
users to correct phoneme recognition errors of OOV words by speech interaction.
The original features of the method are (1) word-segment-based correction and (2)
history-based correction. The experimental results clearly showed IPU was very
efficient in learning OOV words, and indicated the validity of each of the features.
In addition to phoneme sequences, the learning of accent is also important.
33
In languages such as Japanese and Chinese, some words have the same phoneme
sequences and can be only distinguished by accent. In future works, I will extend
IPU to learn accents.
34
Chapter 3 Detecting utterance
targets
3.1 Background
This chapter describes the robot-directed speech detection method, which de-
tects the target of utterances for a robot. In recent years, robot-directed (RD)
speech detection has been mainly based on human behaviors in previously studies.
For example, Lang et al. [33] proposed a method for a robot to detect the direction
of a person’s attention based on face recognition, sound source localization, and
leg detection. Mutlu et al. [44] conducted experiments under conditions of human-
robot conversation, and they studied how a robot could establish the participant
roles of its conversational partners using gaze cues. Yonezawa et al. [76] proposed
an interface for a robot to communicate with users based on detecting the gaze
direction during their speech. However, this kind of method raises the possibility
that users may say something irrelevant to the robot while they are looking at it.
Consider a situation where users A and B are talking while looking at the robot in
front of them (Fig. 3.1).
A: Cool robot! What can it do?
B: It can understand your command, like “Bring me the red box.”
Note that the speech here is referential, not directed to the robot. Moreover, even
if user B makes speech that sounds like RD speech (“Bring me the red box”), she
does not really want to give such an order because no red box exists in the current
situation. How can we build a robot that responds appropriately in this situation?
35
It can understand your
command, like “Bring me
the red box.”
Cool robot!
What can it do?
Figure 3.1: People talking while looking at a robot.
To settle such an issue, the proposed method is based not only on gaze tracking
but also on domain classification of the input speech into RD speech and out-of-
domain (OOD) speech. Domain classification for robots in previous studies were
based mainly on using linguistic and prosodic features. As an example, a method
based on keyword spotting has been proposed by [28]. However, in using such a
method it is difficult to distinguish RD speech from explanations of system usage
(as in the example of Fig. 3.1). It becomes a problem when both types of speech
contain the same “keywords.” To settle this problem, a previous study [62] showed
that the difference in prosodic features between RD speech and other speech usually
appears at the head and the tail of the speech, and they proposed a method to
detect RD speech by using such features. However, their method also raised the
issue of requiring users to adjust their prosody to fit the system, which causes them
an additional burden.
In this study, the robot executed an object manipulation task in which it ma-
nipulates objects according to a user’s speech. An example of this task in a home
environment is a user telling a robot to “Put the dish in the cupboard.” Solv-
ing this task is fundamental for assistive robots. In this task, I assume that a
36
Figure 3.2: Robot used in the object manipulation task.
user orders the robot to execute an action that is feasible in the current situation.
Therefore, the word sequences and the object manipulation obtained as a result of
the process of understanding RD speech, should be possible and meaningful in the
given situation. In contrast, word sequences and the object manipulation obtained
by the process of understanding OOD speech would not be feasible. Therefore,
I can distinguish between RD and OOD speech by using the feasibility for the
corresponding word sequence and the object manipulation obtained from a speech
understanding process as a measure. Based on this concept, I developed a mul-
timodal semantic confidence (MSC) measure. A key feature of MSC is that it is
not based on using prosodic features of input speech as with the method described
above; rather, it is based on semantic features that determine whether the speech
can be interpreted as a feasible action under the current physical situation. On
the other hand, for an object manipulation task robots should deal with speech
and image signals and to carry out a motion according to the speech. Therefore,
the MSC measure is calculated by integrating information obtained from speech,
object images and robot motion.
The rest of this chapter is organized as follows. Section 3.2 gives the details of
the object manipulation task. Section 3.3 describes the proposed RD speech detec-
37
Figure 3.3: Cameras, microphone, sensor and head unit of the robot.
tion method. The experimental settings and results are presented in Section 3.4,
and Section 3.5 gives a discussion. Finally, Section 3.6 concludes the chapter.
3.2 Object Manipulation Task
In this study, humans use a robot to perform an object manipulation task. Fig-
ure 3.2 and Fig. 3.3 show the robot used in this task. It consists of a manipulator
with 7 degrees of freedom (DOFs), a 4-DOF multi-fingered grasper, a SANKEN
CS-3e directional microphone for audio signal input, a Point Grey Research Bum-
blebee 2 stereo vision camera for video signal input, a MESA Swiss Ranger SR4000
infrared sensor for 3-dimensional distance measurement, a Logicool Qcam Pro 9000
camera for human gaze tracking, and a head unit for robot gaze expression.
In the object manipulation task, users sit in front of the robot and command
the robot by speech to manipulate objects on a table located between the robot and
the user. Figure 3.4 shows an example of this task. In this figure, the robot is told
to place Object 1 (Kermit) on Object 2 (big box) by the command speech “Place-
38
Figure 3.4: Example of object manipulation tasks.
on Kermit1 big box”2 , and the robot executes an action according to this speech.
The solid line in Fig. 3.4 shows the trajectory of the moving object manipulated
by the robot.
Commands used in this task are represented by a sequence of phrases, each of
which refers to a motion, an object to be manipulated (“trajector”), or a reference
object for the motion (“landmark”). In the case shown in Fig. 3.4, the phrases
for the motion, trajector, and landmark are “Place-on,” “Kermit,” and “big box,”
respectively. Moreover, fragmental commands without a trajector phrase or a land-
mark phrase, such as “Place-on big box” or just “Place-on,” are also acceptable.
To execute a correct action according to such a command, the robot must
understand the meaning of each word in it, which is grounded by the physical
situation. The robot must also have a belief about the context information to
estimate the corresponding objects for the fragmental commands. In this study,
I used the speech understanding method proposed by [21] to interpret the input
speech as a possible action for the robot under the current physical situation.
However, for an object manipulation task in a real-world environment, there may
1 Kermit is the name of the stuffed toy used in our experiment.
2 Commands made in Japanese have been translated into English in this study.
39
Audio signalCamera images
GMM based VADGaze Tracking
Is human user looking atthe robot during his speaking?
Physical situations YES
NO
Speech Understanding
Speech ConfidenceMeasure CS
Object ConfidenceMeasure CO
Motion ConfidenceMeasure CM
θ1 θ2 θ3
θ0
MSC measure CMS (s,O, q)
CMS(s,O, q) > δ?
YESRD speech
NO
OOD speech
Figure 3.5: Flowchart of the proposed RD speech detection method.
exist OOD speech such as chatting, soliloquies, or noise. Consequently, an RD
speech detection method should be used.
3.3 Proposed RD Speech Detection Method
The proposed RD speech detection method is based on integrating gaze tracking
and the MSC measure. A flowchart is given in Fig. 3.5. First, a Gaussian mixture
model based voice activity detection method (GMM-based VAD) [35] is carried out
to detect speech from the continuous audio signal, and gaze tracking is performed
to estimated the gaze direction from the camera images3 . If the proportion of the
3 In this study, gaze direction was identified by human face angle. I used faceAPI
(http://www.seeingmachines.com) to extract human face angles from images captured by
a camera.
40
user’s gaze at the robot during her/his speech is higher than a certain threshold
η, the robot judges that the user was looking at it while speaking. The speech
during the periods when the user is not looking at the robot is rejected. Then, for
the speech detected while the user was looking at the robot, speech understanding
is performed to output the indices of a trajector object and a landmark object, a
motion trajectory, and corresponding phrases, each of which consists of recognized
words. Then, three confidence measures, i.e., for speech (CS), object image (CO)
and motion (CM ), are calculated to evaluate the feasibilities of the outputted word
sequence, the trajector and landmark, and the motion, respectively. The weighted
sum of these confidence measures with a bias is inputted to a logistic function. The
bias and the weightings {θ0, θ1, θ2, θ3}, are optimized by logistic regression [18].
Here, the MSC measure is defined as the output of the logistic function, and it
represents the probability that the speech is RD speech. If the MSC measure is
higher than a threshold δ, the robot judges that the input speech is RD speech and
executes an action according to it. In the rest of this section, I give details of the
speech understanding process and the MSC measure.
3.3.1 Speech Understanding
Given input speech s and a current physical situation consisting of object
information O and behavioral context q, speech understanding selects the opti-
mal action a based on a multimodal integrated user model. O is represented as
O = {(o1,f , o1,p), (o2,f , o2,p) . . . (om,f , om,p)}, which includes the visual features oi,f
and positions oi,p of all objects in the current situation, where m denotes the num-
ber of objects and i denotes the index of each object that is dynamically given
in the situation. q includes information on which objects were a trajector and a
landmark in the previous action and on which object the user is now holding. a is
defined as a = (t, ξ), where t and ξ denote the index of trajector and a trajectory of
motion, respectively. A user model integrating the five belief modules – (1) speech,
(2) object image, (3) motion, (4) motion-object relationship, and (5) behavioral
41
context – is called an integrated belief. Each belief module and the integrated
belief are learned by the interaction between a user and the robot in a real-world
environment.
Lexicon and Grammar
The robot has basic linguistic knowledge, including a lexicon L and a grammar
Gr. L consists of pairs of a word and a concept, each of which represents an object
image or a motion. The words are represented by the sequences of phonemes, each
of which is represented by HMM using mel-scale cepstrum coefficients and their
delta parameters (25-dimensional) as features. The concepts of object images are
represented by Gaussian functions in a multi-dimensional visual feature space (size,
color (L∗, a∗, b∗), and shape). The concepts of motions are represented by HMMs
using the sequence of three-dimensional positions and their delta parameters as
features.
The word sequence of speech s is interpreted as a conceptual structure z = [(α1,
wα1), (α2, wα2), (α3, wα3)], where αi represents the attribute of a phrase and
has a value among {M,T,L}. wM , wT and wL represent the phrases describing a
motion, a trajector, and a landmark, respectively. For example, the user’s utterance
“Place-on Kermit big box” is interpreted as follows: [(M , Place-on), (T , Kermit),
(L, big box)]. The grammar Gr is a statistical language model that is represented
by a set of occurrence probabilities for the possible orders of attributes in the
conceptual structure.
Belief modules and Integrated Belief
Each of the five belief modules in the integrated belief is defined as follows.
Speech BS: This module is represented as the log probability of speech s
conditioned by z, under grammar Gr.
Object image BO: This module is represented as the log likelihood of wT
and wL given the trajector’s and the landmark’s visual features ot,f and ol,f .
42
Motion BM : This module is represented as the log likelihood of wM given
the trajector’s initial position ot,p, the landmark’s position ol,p, and trajectory ξ.
Motion-object relationship BR: This module represents the belief that in
the motion corresponding to wM , features ot,f and ol,f are typical for a trajector
and a landmark, respectively. This belief is represented by a multivariate Gaussian
distribution of vector [ot,f , ot,f − ol,f , ol,f ]T .
Behavioral context BH : This module represents the belief that the current
speech refers to object o, given behavioral context q.
Given weighting parameter set Γ={γ1..., γ5
}, the degree of correspondence be-
tween speech s and action a is represented by integrated belief function Ψ, written
as
Ψ(s, a,O, q,Γ) = maxz,l
(γ1 logP (s|z)P (z;Gr) [BS ]
+γ2
(logP (ot,f |wT ) + logP (ol,f |wL)
)[BO]
+γ3 logP (ξ|ot,p, ol,p,wM ) [BM ]
+γ4 logP (ot,f , ol,f |wM ) [BR]
+γ5
(BH(ot, q) +BH(ol, q)
)), [BH ]
(3.1)
where l denotes the index of landmark, ot and ol denote the trajector and landmark,
respectively. Conceptual structure z and landmark ol are selected to maximize the
value of Ψ. Then, as the meaning of speech s, corresponding action a is determined
by maximizing Ψ:
a = (t, ξ) = argmaxa
Ψ(s, a,O, q,Γ). (3.2)
Finally, action a = (t, ξ), index of selected landmark l, and conceptual structure
(recognized word sequence) z are outputted from the speech understanding process.
Learning the Parameters
In the speech understanding, each belief module and the weighting parameters
Γ in the integrated belief are learned online through human-robot interaction in
43
a natural way in an environment in which the robot is used [21]. For example, a
user shows an object to the robot while uttering a word describing the object to
make the robot learn the phoneme sequence of the spoken word which refers to
the object, and to make the robot learn the Gaussian parameters representing the
object image concept based on Bayesian learning. In addition, the user orders the
robot to move an object by making an utterance and a gesture, and the robot acts in
response. If the robot responds incorrectly, the user slaps the robot’s hand, and the
robot acts in a different way in response. The weighting parameters Γ are learned
incrementally, online with minimum classification error learning [27], through such
interaction. This learning process can be conducted easily by a non-expert user. In
contrast, other speech understanding methods need an expert to manually adjust
the parameters in the methods, and the operation is not practical for ordinary users.
Therefore, in comparison with other methods, the speech understanding method
used in this study has an advantage in that it adapts to different environments,
depending on the user.
3.3.2 MSC Measure
Next, I describe the proposed MSC measure. MSC measure CMS is calcu-
lated based on the outputs of speech understanding and represents an RD speech
probability. For input speech s and current physical situation (O, q), speech under-
standing is performed first, and then CMS is calculated by the logistic regression
as
CMS(s,O, q) = P (domain = RD|s,O, q) =1
1 + e−(θ0+θ1CS+θ2CO+θ3CM ). (3.3)
Logistic regression is a type of predictive model that can be used when the target
variable is a categorical variable with two categories, which is quite suitable for the
domain classification problem in this study. In addition, the output of the logistic
function has a value in the range from 0.0 to 1.0, which can be used directly to
represent an RD speech probability.
44
Finally, given a threshold δ, speech s with an MSC measure higher than δ is
treated as RD speech. The BS , BO, and BM are also used for calculating CS ,
CO, and CM , each of which is described as follows.
Speech Confidence Measure
Speech confidence measure CS is used to evaluate the reliability of the rec-
ognized word sequence z. It is calculated by dividing the likelihood of z by the
likelihood of a maximum likelihood phoneme sequence with phoneme network Gp,
and it is written as
CS(s, z) =1
n(s)log
P (s|z)maxu∈L(Gp) P (s|u)
, (3.4)
where n(s) denotes the analysis frame length of the input speech, P (s|z) denotes the
likelihood of z for input speech s and is given by a part of BS, u denotes a phoneme
sequence, and L(Gp) denotes a set of possible phoneme sequences accepted by
phoneme network Gp. For speech that matches robot command grammar Gr, CS
has a greater value than speech that does not match Gr.
The speech confidence measure is conventionally used as a confidence measure
for speech recognition [24]. The basic idea is that it treats the likelihood of the
most typical (maximum-likelihood) phoneme sequences for the input speech as a
baseline. Based on this idea, the object and motion confidence measures are defined
as follows.
Object Confidence Measure
Object confidence measure CO is used to evaluate the reliability that the out-
putted trajector ot and landmark ol are referred to by wT and wL. It is calculated
by dividing the likelihood of visual features ot,f and ol,f by a baseline obtained by
the likelihood of the most typical visual features for the object models of wT and
wL. In this study, the maximum probability densities of Gaussian functions are
45
Input speech: “There is a red box.”Recognized as: [Raise red box.](a) Case for object confidence measure
Input speech: “Bring me that Chutotoro.”Recognized as: [Move-away Chutotoro.](b) Case for motion confidence measure
Figure 3.6: Example cases where object and motion confidence measures are low. These
examples are selected from the raw data of the experimental results.
treated as these baselines. Then, the object confidence measure CO is written as
CO(ot,f , ol,f , wT , wL) = logP (ot,f |wT )P (ol,f |wL)
maxof P (of |wT )maxof P (of |wL), (3.5)
where P (ot,f | wT ) and P (ol,f | wL) denote the likelihood of ot,f and ol,f and are
given by BO, and maxof P (of | wT ) and maxof P (of | wL) denote the maximum
probability densities of Gaussian functions, and of denotes the visual features in
object models.
For example, Figure 3.6(a) describes a physical situation under which a low
object confidence measure was obtained for input OOD speech “There is a red
box.” The examples in Fig. 3.6 are selected from the raw data of the experimental
results. Here, by the speech understanding process, the input speech was recognized
as a word sequence “Raise red box.” Then, an action of the robot raising object 1
was outputted (solid line) because the “red box” did not exist and thus object 1
46
with the same color was selected as a trajector. However, the visual feature of
object 1 was very different from “red box,” resulting in a low value of CO.
Motion Confidence Measure
The confidence measure of motion CM is used to evaluate the reliability that
the outputted trajectory ξ is referred to by wM . It is calculated by dividing the
likelihood of ξ by a baseline that is obtained by the likelihood of the most typical
trajectory ξ for the motion model of wM . In this study, ξ is written as
ξ = argmaxξ,otrajp
P (ξ|otrajp , ol,p, wM ), (3.6)
where otrajp denotes the initial position of the trajector. ξ is obtained by treating
otrajp as a variable. The likelihood of ξ is the maximum output probability of
HMMs. In this study, I used the method proposed by [63] to obtain this probability.
Different from ξ, the trajector’s initial position of ξ is unconstrained, and the
likelihood of ξ has a greater value than ξ. Then, the motion confidence measure
CM is written as
CM (ξ, wM ) = logP (ξ|ot,p, ol,p, wM )
maxξ,otrajp
P (ξ|otrajp , ol,p, wM ), (3.7)
where P (ξ|ot,p, ol,p,wM ) denotes the likelihood of ξ and is given by BM .
For example, Figure 3.6(b) describes a physical situation under which a low
motion confidence measure was obtained for input OOD speech “Bring me that
Chutotoro.” Here, by the speech understanding process, the input speech was rec-
ognized as a word sequence “Move-away Chutotoro.” Then, an action of the robot
moving away object 1 from object 2 was outputted (solid line). However, the typ-
ical trajectory of “move-away” is for one object to move away from another object
that is close to it (dotted line). Here, the trajectory of outputted action was very
different from the typical trajectory, resulting in a low value of CM .
47
Figure 3.7: Some of the objects used in the experiments.
Optimization of Weights
I now consider the problem of estimating weight Θ. The ith training sample
is given as the pair of input signal (si, Oi, qi) and teaching signal di. Thus, the
training set TN contains N samples:
TN = {(si, Oi, qi, di)|i = 1, ..., N}, (3.8)
where di is 0 or 1, which represents OOD speech or RD speech, respectively. The
likelihood function is written as
P (d|Θ) =
N∏i=1
(CMS(si, Oi, qi))d
i(1− CMS(s
i, Oi, qi))1−di , (3.9)
where d= (d1, . . . , dN ). Θ is optimized by the maximum-likelihood estimation of
Eq. (3.9) using Fisher’s scoring algorithm [32].
3.4 Experiments
3.4.1 Experimental Setting
I first evaluated the performance of MSC. This evaluation was performed by
an off-line experiment by simulation where gaze tracking is not used, and speech
is extracted manually without the GMM based VAD to avoid its detection errors.
The weighting set Θ and the threshold δ were also optimized in this experiment.
48
raise put-down∗ place-on the right side
place-on the middleplace-on∗ jump-over
rotate move-closer∗move-away∗
place-on the left side
Figure 3.8: Examples for each of the 10 kinds of motions used in the experiments. “ ∗ ”means that synonymous verbs are given in the lexicon for this motion.
Then I performed an on-line experiment with the robot to evaluate the whole
system.
The robot lexicon L used in both experiments has 50 words, including 26 nouns
and 5 adjectives representing 40 objects, and 19 verbs representing ten kinds of
motions. Figure 3.7 shows some of the objects used in the experiments. Figure 3.8
shows the examples for each motion. The solid line in each example represents
the motion trajectory. L also includes five Japanese postpositions. Different from
other words in L, each of the postpositions is not associated with a concept. By
using the postpositions, users can speak a command in a more natural way. The
parameter set Γ in Eq. (3.1) was γ1 = 1.00, γ2 = 0.75, γ3 = 1.03, γ4 = 0.56, and
γ5 = 1.88.
49
Table 3.1: Examples of the speech spoken in the experiments.
RD speech OOD speech
Move-away Grover. Good morning.
Place-on Kermit small box. How about lunch?
Rotate Chutotoro. There is a big Barbazoo.
Raise red Elmo. Let’s do an experiment.
The speech detection algorithm was run on a Dell Precision 690 workstation,
with an Intel Xeon 2.66GHz CPU and 4GB memory for speech understanding and
the calculation of MSC measure. In the on-line experiment, I added another Dell
Precision T7400 workstation with an Intel Xeon 3.2GHz CPU and a 4GB memory
for the image processing and gaze tracking.
3.4.2 Off-line Experiment by Simulation
Setting
The off-line experiment was conducted under both clean and noisy conditions
using a set of pairs of speech s and scene information (O, q). Figure 3.6(a) shows
an example of scene information. The yellow box on object 3 represents the behav-
ioral context q, which means object 3 was manipulated most recently. I prepared
160 different such scene files, each of which included three objects on average. I
also prepared 160 different speech samples (80 RD speech and 80 OOD speech)
and paired them with the scene files. The RD speech samples included words
that represent 40 kinds of objects and ten kinds of motions, which were learned
beforehand in lexicon L. Each RD and OOD speech sample included 2.8 and 4.1
words on average, respectively. Table 3.1 shows examples of the speech spoken
in the experiment. In addition, a correct motion phrase, correct trajectory, and
landmark objects are given for each RD speech-scene pair. I then recorded the
speech samples under both clean and noisy conditions as follows.
50
• Clean condition: I recorded the speech in a soundproof room without noise.
A subject sat on a chair one meter from the SANKEN CS-3e directional
microphone and read out a text in Japanese.
• Noisy condition: I added dining hall noise whose level was from 50 to 52 dBA
to each speech record gathered under a clean condition.
I gathered the speech records from 16 subjects, including eight males and eight
females. All subjects were native Japanese speakers. All subjects were instructed to
speak naturally as if they were speaking to another human listener. As a result, 16
sets of speech-scene pairs were obtained, each of which included 320 pairs (160 for
clean and 160 for noisy conditions). These pairs were inputted into the system. For
each pair, speech understanding was first performed, and then the MSC measure
was calculated. During speech understanding, a Gaussian mixture model based
noise suppression method [14] was performed, and ATRASR [46] was used for
phoneme and word sequence recognition. With ATRASR, accuracies of 83% and
67% in phoneme recognition were obtained under the clean and noisy conditions,
respectively.
The evaluation under the clean condition was performed by leave-one-out cross-
validation: 15 subjects’ data were used as a training set to learn the weighting Θ in
Eq. (3.3), and the remaining one subject’s data were used as a test set and repeated
16 times. By cross-validation, the generalization performance for different speakers
was evaluated. The average values of the weighting Θ learned by the training set
in cross-validation were used for the evaluation under the noisy condition, where
all noisy speech-scene pairs collected from 16 subjects were treated as a test set.
System performances was evaluated by recall and precision rates, which were
defined as follows:
Recall =N cor
N total, (3.10)
Precision =N cor
Ndet, (3.11)
51
where N cor denotes the number of RD speech correctly detected, N total denotes
the total number of existing RD speech, Ndet denotes the total number of speech
detected as RD speech by the MSC measure.
Finally, for comparison, four cases were evaluated for RD speech detection by
using: (1) the speech confidence measure only, (2) the speech and object confi-
dence measures, (3) the speech and motion confidence measures and, (4) the MSC
measure.
I also evaluated the speech understanding using the RD speech-scene pairs.
Differences between the output motion phrase, trajectory, and landmark objects
and the given ones were treated as an error in speech understanding.
Results
The average precision-recall curves for RD speech detection over 16 subjects
under clean and noisy conditions are shown in Fig. 3.9. The performances of each
of four cases are shown by “Speech,” “Speech + Object,” “Speech + Motion,”
and “MSC.” From the figures, I found that (1) the MSC outperforms all others
for both clean and noisy conditions and, (2) both object and motion confidence
measures helped to improve performance. The average maximum F-measures under
clean and noisy conditions are shown in Fig. 3.10. By comparing it with the
speech confidence measure only, MSC achieved an absolute increase of 5% and 12%
for clean and noisy conditions, respectively, indicating that MSC was particularly
effective under the noisy condition. I also performed the paired t-test. Under the
clean condition, there were statistical differences between (1) Speech and Speech
+ Object (p < 0.01), (2) Speech and Speech + Motion (p < 0.05), and (3) Speech
and MSC (p < 0.01). Under the noisy condition, there were statistical differences
(p < 0.01) between Speech and all other cases.
Examples of the raw data of the experimental results are shown in Fig. 3.6 and
Fig. 3.11. The examples in Fig. 3.6 are for OOD speech and have been explained
in Sections 3.2.2 and 3.2.3. The examples in Fig. 3.11 are for RD speech “Place-on
52
Speech
Speech + ObjectSpeech + MotionMSC (Speech + Object + Motion)
(a) Under clean condition
Speech
Speech + ObjectSpeech + MotionMSC
(b) Under noisy condition
Figure 3.9: Average precision-recall curves obtained in the off-line experiment.
Elmo big box” and “Jump-over Barbazoo Totoro”. These utterances were suc-
cessfully detected by the MSC measure. The processing times (seconds) spent on
the speech understanding process and the MSC-based domain classification was
1.09 and 1.36 for the examples shown in Figure 3.6(a) and 3.6(b), respectively,
1.39 and 1.36 for the examples shown in Figure 3.11(a) and 3.11(b), respectively.
53
(a) Under clean condition
(b) Under noisy condition
Figure 3.10: Average maximum F-measures obtained in the off-line experiment.
These times indicated that the proposed method could respond quickly in practical
human-robot interactions in real time. Table 3.2 shows the means and variances
of the weighted confidence measures for all RD and OOD speech obtained under
the noisy condition. Notice that the variances of CO and CM have large values for
OOD speech, which means it is difficult to perform RD speech detection using CO
or CM only.
In the experiment, weight Θ and threshold δ were optimized under the clean
condition. The optimized Θ were: θ0 = 5.9, θ1 = 0.00011, θ2 = 0.053, and
54
(a) “Place-on Elmo big box”
(b) “Jump-over Barbazoo Totoro”
Figure 3.11: Examples selected from the raw data of the experiment.
Table 3.2: Means and variances of weighted confidence measures for all RD and OOD
speech obtained under noisy conditions.
RD OOD
CS CO CM CS CO CM
Means −0.71 −0.88 −0.30 −3.8 −6.0 −3.3Variances 1.1 0.55 0.72 6.4 130 23
θ3 = 0.74. The optimized δ was set to 0.79, which maximized the average F-
measure. This means that a piece of speech with an MSC measure more than 0.79
will be treated as RD speech and the robot will execute an action according to this
speech. The above Θ and δ were used in the on-line experiment.
Finally, the accuracies of speech understanding using all RD speech and RD
speech detected with the proposed method are shown in Table 3.3, where “Total”
and “Detected” represent all RD speech and the detected RD speech, respectively,
and “Clean” and “Noisy” represent clean and noisy conditions, respectively.
55
Table 3.3: Accuracy of RD speech understanding.
Total Detected
Clean 99.8% 100%
Noisy 96.3% 98.9%
Figure 3.12: Example of on-line experiment.
3.4.3 On-line Experiment Using the Robot
Setting
In the on-line experiment, the whole system was evaluated by using the robot.
In each session of the experiment, two subjects, an “operator” and a “ministrant,”
sat in front of the robot at a distance of about one meter from the microphone.
The operator ordered the robot to manipulate objects in Japanese. He was also
allowed to chat freely with the ministrant. Fig. 3.12 shows an example of this
experiment. The threshold η of gaze tracking was set to 0.5, which means that if
the proportion of operator’s gaze at the robot during input speech was higher than
50%, the robot judged that the speech was made while the operator was looking
at it.
I conducted a total of 4 sessions of this experiment using 4 pairs of subjects, and
each session lasted for about 50 minutes. All subjects were adult males. As with
56
Table 3.4: Numbers of speech productions in the on-line experiment.
With gaze Without gaze Total
RD 155 10 165
OOD 553 265 818
Total 708 275 983
the off-line experiment, the subjects were instructed to speak to the robot as if they
were speaking to another human listener. There was constant surrounding noise of
about 48 dBA from the robot’s power module in all sessions. For comparison, five
cases were evaluated for RD speech detection by using (1) gaze only, (2) gaze and
speech confidence measure, (3) gaze and speech and object confidence measures, (4)
gaze and speech and motion confidence measures and, (5) gaze and MSC measure.
Results
During the experiment, a total of 983 pieces of speech were made, each of
which was manually labeled as either RD or OOD. The numbers of them are
shown in Table 3.4. “With gaze” and “Without gaze” show the numbers of speech
productions that were made while the operator was looking/not looking at the
robot. “RD/OOD” shows the numbers of RD/OOD speech productions. Aside
from the RD speech, there was also a lot of OOD speech made while the subjects
were looking at the robot (see “With gaze” in Table 3.4).
The accuracies of speech understanding were 97.6% and 98.1% for all RD speech
and the detected RD speech, respectively. The average recall and precision rates
for RD speech detection are shown in Fig. 3.13. The performances of each of five
cases are shown by “Gaze,” “Gaze + Speech,” “Gaze + Speech + Object,” “Gaze
+ Speech + Motion,” and “Gaze + MSC,” respectively. By using gaze only, an
average recall rate of 94% was obtained (see “Gaze” column in Fig. 3.13(a)), which
means that almost all of the RD speech was made while the operator was looking
at the robot. The recall rate dropped to 90% by integrating gaze with speech
57
(a) Recall rates
(b) Precision rates
Figure 3.13: Average recall and precision rates obtained in the on-line experiment.
confidence measure, which means some RD speech was rejected by the speech
confidence measure by mistake. However, by integrating gaze with MSC, the recall
rate returned to 94% because the mis-rejected RD speech was correctly detected
by MSC. In Fig. 3.13(b), the average precision rate by using gaze only was 22%.
However, by using MSC, these instances of OOD speech were correctly rejected,
resulting in a high precision rate of 96%, which means the proposed method is
particularly effective under situations where users make a lot of OOD speech while
looking at a robot.
58
3.5 Discussion
3.5.1 Using in a Real World Environment
Although the proposed method was evaluated in our laboratory, I consider that
our method could be used for real world environments because the used speech
understanding method is adaptable to different environments. In some cases, how-
ever, physical conditions can dynamically change. For example, lighting conditions
may change suddenly due to sunlight. The development of a method that works
robustly in such variable conditions is future work.
3.5.2 Extended Applications
This study can be extended in many kinds of ways, and I mention some of
them. Here, I evaluated the MSC measure under situations where users usually
order the robot while looking at it. However, users possibly order a robot without
looking at it under some situations. For example, in such an object manipulation
task where a robot manipulates objects together with a user, the user may make
an order while looking at the object which he is manipulating instead of looking
at the robot itself. For such tasks, the MSC measure should be used separately
without integrating it with gaze. Therefore, a method that automatically decides
whether to use the gaze information according to the task and user situation should
be implemented.
Moreover, aside from the object manipulation task, the MSC measure can also
be extended to the multi-task dialog including both the physically grounded and
ungrounded tasks. In the physically ungrounded tasks, users’ utterances express
no immediate physical objects or motions. For such dialog, a method that auto-
matically switches between the speech confidence and MSC measures should be
implemented. In the future works, I will evaluate the MSC measure for various
dialog tasks.
59
In addition, I can use the MSC to develop an advanced interface for human-
robot interaction. The RD speech probability represented by MSC can be used to
provide feedback such as the utterance “Did you speak to me?”, and this feedback
should be made under situations where the MSC measure has an ambiguous value.
Moreover, each of the object and motion confidence measures can be used sepa-
rately. For example, if the object confidence measures for all objects in a robot’s
vision were particularly low, an active exploration should be executed by the robot
to search for a feasible object in its surroundings, or an utterance such as “I can-
not do that” should be made for situations where the motion confidence measure
is particularly low.
Finally, in this study, I evaluated the MSC measure obtained by integrating
speech, object and motion confidence measures. In addition, I can consider the
use of the confidence measure obtained from the object-motion relationship. The
evaluation of the effect of this confidence measure is one of the future works.
3.6 Summary
This chapter described an RD speech detection method that enables a robot
to distinguish the speech to which it should respond in an object manipulation
task by combining speech, visual, and behavioral context with human gaze. The
remarkable feature of the method is the introduction of the MSC measure. The
MSC measure evaluates the feasibility of the action which the robot is going to
execute according to the users’ speech under the current physical situation. The
experimental results clearly showed that the method is very effective and provides
an essential function for natural and safe human-robot interaction. Finally, I would
emphasize that the basic idea adopted in the method is applicable to a broad range
of human-robot dialog tasks.
60
Chapter 4 Conclusion
This study addressed two crucial problems in building a flexible speech interface
between human and machines: (1) learning the phoneme sequences of OOV words,
and (2) detecting the target of utterances. It described the two methods I proposed
to solve these problems. An important contribution of this study is that it is
especially beneficial for robotic speech interfaces. Both of the proposed methods
can be implemented as sub-modules for robots.
First, I proposed IPU, an interactive learning method for obtaining the phoneme
sequences of OOV words. This method was demonstrated to have high-performance
and to be user-friendly for learning the phoneme sequences of OOV words. The
method enables robots to automatically extend their vocabularies through inter-
actions with users.
Next, I proposed MSC, a multimodal-based method for detecting the targets of
utterances. This method was demonstrated to be very effective and has a strong
capability to adapt to noisy conditions. It enables a robot to reject the utterances
that are not directed to it in an object manipulation task. The capability provides
convenient and safe interactions between users and robots. Moreover, in addition
to utterance targets detection, the basic idea adopted in MSC is also applicable to
a broad range of human-robot dialogue tasks.
Furthermore, in this study, the utterance targets detection by MSC is based
on the integration of information obtained from speech, image and motion. This
kind of technology which integrates multimodal information in a single framework
is especially important for robots, since robots are usually equipped with variety
kinds of sensors which monitor the environment with various channels, such as
audio, video, and touch. This study provides a new realistic method for this
purpose. We have demonstrated that the integration of multimodal information
61
is valid for robot-directed speech detection in an object manipulation task. Aside
from the task, I believe that integrating multimodal information is also crucial for
many other tasks, such as context-based speech understanding and human behavior
understanding.
Many pieces of work still remain to be done in order to improve the flexibility of
speech interfaces, and some of them are worth mentioning. One of the limitations of
this study is that both of the proposed methods do not allow users to speak freely.
Users must obey the pre-defined grammar. In order to improve the flexibility,
it is desirable for speech interfaces to deal with arbitrary utterances. Moreover,
speech interfaces are expected to understand user intentions from user behaviors
and utterances. In order to fulfill these expectations, techniques such as user
behavior modeling, and knowledge of psychology and cognitive science should be
utilized for building speech interfaces.
Another limitation is that both of the proposed methods were only evaluated
with limited test data, under short-term human-machine/robot interactions. It
is highly interesting to know the performances of the proposed methods under
long-term interactions. The evaluation of the proposed methods with long-term
interactions is one of the future works.
62
Acknowledgment
I express my sincere gratitude to my supervisor, Professor Natsuki Oka, for pro-
viding me an invaluable opportunity as a Ph.D student in his laboratory. I thank
Drs. Naoto Iwahashi, Mikio Nakano, and Kotaro Funakoshi for their guidance,
encouragement, and discussion from which my research and my study life greatly
benefited. I am also very grateful to Associate Professors Masahiro Araki and To-
moyuki Ozeki for their valuable advice about my research. I also extend thanks
to the members of the Interactive Intelligence Lab, Kyoto Institute of Technol-
ogy, for their cooperation with my experiments. Finally, I thank my wife, Yiyan,
for her understanding and love the past few years. In the end, her support and
encouragement made this thesis possible.
63
References
[1] iPhone 4S – Ask Siri to help you get things done. http://www.apple.com/
iphone/features/siri.html, Apple, Retrieved 2011-10-05.
[2] H. Asoh, T. Matsui, J. Fry, F. Asano, and S. Hayamizu. A spoken dialog
system for a mobile robot. In Proceedings of the fifth European Conference on
Speech Communication and Technology (Eurospeech – 1999), pp. 1139–1142,
1999.
[3] M. Attamimi, A. Mizutani, T. Nakamura, K. Sugiura, T. Nagai, N. Iwahashi,
H. Okada, and T. Omori. Learning novel objects using out-of-vocabulary word
segmentation and object extraction for home assistant robots. In Proceed-
ings of the 2010 IEEE International Conference on Robotics and Automation
(ICRA – 2010), pp. 745–750, 2010.
[4] C. Bael, L. Boves, H. Heuvel, and H. Strik. Automatic phonetic transcription
of large speech corpora. Computer Speech and Language, Vol. 21, No. 4, pp.
652–668, 2007.
[5] D. Bansal, N. Nair, R. Singh, and B. Raj. A joint decoding algorithm for
multiple-example-based addition of words to a pronunciation lexicon. In Pro-
ceedings of the 34th International Conference on Acoustics, Speech, and Signal
Processing (ICASSP – 2009), pp. 4293–4296, 2009.
[6] I. Bazzi and J. Glass. A multi-class approach for modelling out-of-vocabulary
words. In Proceedings of the third International Conference on Spoken Lan-
guage Processing (Interspeech – ICLSP – 2002), pp. 1613–1616, 2002.
[7] S. Y. Chang, L. Shastri, and S. Greenberg. Automatic phonetic transcription
64
of spontaneous speech (American English). In Proceedings of the 25th Inter-
national Conference on Acoustics, Speech, and Signal Processing (ICASSP –
2000), pp. 330–333, 2000.
[8] C. Chelba, J. Schalkwyk, T. Brants, V. Ha, B. Harb, W. Neveitt, C. Parada,
and P. Xu. Query language modeling for voice search. In Proceedings of
the third IEEE Workshop on Spoken Language Technology (SLT – 2010), pp.
127–132, 2010.
[9] G. Chung, S. Seneff, and C. Wang. Automatic acquisition of names using
speak and spell mode in spoken dialogue systems. In Proceedings of the North
American Chapter of the Association for Computational Linguistics – Human
Language Technologies Conference (NAACL – HLT – 2003), pp. 32–39, 2003.
[10] P. Ding, L. He, X. Yan, R. Zhao, and J. Hao. Robust mandarin speech recogni-
tion in car environments for embedded navigation system. IEEE Transactions
on Consumer Electronics, Vol. 54, No. 2, pp. 584–590, 2008.
[11] M. Eck, I. Lane, Y. Zhang, and A. Waibel. Jibbigo: speech-to-speech transla-
tion on mobile devices. In Proceedings of the third IEEE Workshop on Spoken
Language Technology (SLT – 2010), pp. 165–166, 2010.
[12] J. M. Elvira and J. C. Torrecilla. Name dialing using final user defined vocab-
ularies in mobile (GSM and TACS) and fixed telephone networks. In Pro-
ceedings of the 23rd International Conference on Acoustics, Speech and Signal
Processing (ICASSP – 1998), pp. 849–852, 1998.
[13] E. Filisko and S. Seneff. Developing city name acquisition strategies in spoken
dialogue systems via user simulation. In Proceedings of the sixth ACL/ISCA
SIGdial Workshop on Discourse and Dialogue (SIGdial – 2005), pp. 144–155,
2005.
65
[14] M. Fujimoto and S. Nakamura. Sequential non-stationary noise tracking using
particle filtering with switching dynamical system. In Proceedings of the
31st International Conference on Acoustics, Speech, and Signal Processing
(ICASSP – 2006), Vol. 2, pp. 769–772, 2006.
[15] A.L. Gorin, H. Hanek, R. Rose, and L. Miller. Automated call routing in
a telecommunications network. In Proceedings of the IEEE Workshop on
Interactive Voice Technology for Telecommunications Applications (IVTTA –
1994), pp. 137–140, 1994.
[16] A.L. Gorin, G. Riccardi, and J.H. Wright. How may I help you? Speech
Communication, Vol. 23, pp. 113–127, 1997.
[17] H. Holzapfel, D. Neubig, and A. Waibel. A dialogue approach to learning ob-
ject descriptions and semantic categories. Robotics and Autonomous Systems,
Vol. 56, pp. 1004–1013, 2008.
[18] D. W. Hosmer and S. Lemeshow. Applied Logistic Regression. Wiley-
Interscience, 2009.
[19] C. T. Ishi, S. Matsuda, T. Kanda, T. Jitsuhiro, H. Ishiguro, S. Nakamura,
and N. Hagita. Robust speech recognition system for communication robots
in real environments. In Proceeding of the sixth IEEE-RAS International
Conference on Humanoid Robots, pp. 340–345, 2006.
[20] N. Iwahashi. A method for the coupling of belief systems through human-robot
language interaction. In Proceedings of the 2003 IEEE International Work-
shop on Robot and Human Interactive Communication, pp. 385–390, 2003.
[21] N. Iwahashi. Robots that learn language: A developmental approach to sit-
uated human-robot conversations. Human-Robot Interaction, pp. 95–118,
2007.
66
[22] N. Iwahashi. Interactive learning of spoken words and their meanings through
an audio-visual interface. IEICE Transactions on Information and Systems,
Vol. E91-D, No. 2, pp. 312–321, 2008.
[23] A. Janicki and D. Wawer. Automatic speech recognition for polish in a com-
puter game interface. In Proceedings of the Federated Conference on Com-
puter Science and Information Systems (FedCSIS – 2011), pp. 711–716, 2011.
[24] H. Jiang. Confidence measures for speech recognition: A survey. Speech
Communication, Vol. 45, pp. 455–470, 2005.
[25] D. Jurafsky, W. Ward, J. P. Zhang, K. Herold, X. Y. Yu, and S. Zhang.
What kind of pronunciation variation is hard for triphones to model? In
Proceedings of the 26th International Conference on Acoustics, Speech, and
Signal processing (ICASSP – 2001), pp. 577–580, 2001.
[26] T. Kagoshima. ToSpeak: high-quality text-to-speech system. Toshiba review
(Japanese Edition), Vol. 62, No. 12, pp. 34–37, 2007.
[27] S. Katagiri, B. H. Juangs, and C. H. Lee. Pattern recognition using a family
of design algorithms based upon the generalized probabilistic descent method.
In Proceedings of the IEEE, Vol. 86, pp. 2345–2373, 1998.
[28] T. Kawahara, K. Ishizuka, S. Doshita, and C. H. Lee. Speaking-style de-
pendent lexicalized filler model for key-phrase detection and verification. In
Proceedings of the IEEE International Conference on Spoken Language Pro-
cessing, pp. 3253–3259, 1998.
[29] D. Knowles and Z. Ghahramani. Infinite sparse factor analysis and infinite
independent components analysis. In Proceedings of the 7th International
Conference on Independent Component Analysis and Signal Separation, pp.
381–388, 2007.
67
[30] D. B. Koons, C. J. Sparrell, and K. R. Thorisson. Integrating simultaneous
input from speech, gaze, and hand gestures. Intelligent Multimedia Interfaces
American Association for Artificial Intelligence, pp. 257–276, 1993.
[31] A. Kurematsu, K. Takeda, Y. Sagisaka, S. Katagiri, H. Kuwabara, and
K. Shikano. ATR Japanese speech database as a tool of speech recognition
and synthesis. Speech Communication, Vol. 9, No. 4, pp. 357–363, 1990.
[32] T. Kurita. Iterative weighted least squares algorithms for neural networks
classifiers. In Proc. Workshop on Algorithmic Learning Theory, 1992.
[33] S. Lang, M. Kleinehagenbrock, S. Hohenner, J. Fritsch, G. A. Fink, and
G. Sagerer. Providing the basis for human-robot-interaction: A multi-modal
attention system for a mobile robot. In Proceedings of the ACM International
Conference on Multimodal Interfaces, pp. 28–35, 2003.
[34] A. Lee, K. Kawahara, and K. Shikano. A new phonetic tied-mixture model
for efficient decoding. In Proceedings of the 26th International Conference on
Acoustics, Speech, and Signal Processing (ICASSP – 2001), pp. 1269–1272,
2001.
[35] A. Lee, K. Nakamura, R. Nishimura, H. Saruwatari, and K. Shikano. Noise
robust real world spoken dialogue system using GMM based rejection of unin-
tended inputs. In Proceedings of the fifth International Conference on Spoken
Language Processing (Interspeech – ICLSP – 2004), pp. 173–176, 2004.
[36] C. J. Leggetter and P. C. Woodland. Maximum likelihood linear regression for
speaker adaptation of continuous density hidden markov models. Computer
Speech and Language, Vol. 9, pp. 171–185, 1995.
[37] C. Leitner, M. Schickbichler, and S. Petrik. Example-based automatic pho-
netic transcription. In Proceedings of the seventh Conference on International
Language Resources and Evaluation (LREC – 2010), pp. 3278–3284, 2010.
68
[38] E. Levin, R. Pieraccini, and W. Eckert. Using markov decision process for
learning dialogue strategies. In Proceedings of the 23rd International Con-
ference on Acoustics, Speech, and Signal Processing (ICASSP – 1998), pp.
201–204, 1998.
[39] Y. Liu and P. Fung. Modeling partial pronunciation variations for spontaneous
mandarin speech recognition. Computer Speech and Language, Vol. 17, No. 4,
pp. 357–379, 2003.
[40] Y. Liu, F. Zheng, L. He, and Y. Q. Xia. State-dependent mixture tying with
variable codebook size for accented speech recognition. In Proceedings of the
IEEE Automatic Speech Recognition and Understanding Workshop (ASRU –
2007), pp. 300–305, 2007.
[41] X. Luo and F. Jelineck. Probabilistic classification of HMM states for large
vocabulary continuous speech recognition. In Proceedings of the 24th Inter-
national Conference on Acoustics, Speech, and Signal Processing (ICASSP –
1999), pp. 353–356, 1999.
[42] T. Misu, K. Sugiura, T. Kawahara, K. Ohtake, C. Hori, H. Kashioka,
H. Kawai, and S. Nakamura. Modeling spoken decision support dialogue and
optimization of its dialogue strategy. ACM Transactions on Speech and Lan-
guage Processing, Vol. 7, No. 3, pp. 1–18, 2011.
[43] A. Mohamed, G. Dahl, and G. Hinton. Deep belief networks for phone recogni-
tion. In Proceedings of the 22nd Neural Information Processing Systems Con-
ference Workshop on Deep Learning for Speech Recognition (NIPS – 2009),
Vol. 22, pp. 1–9, 2009.
[44] B. Mutlu, T. Shiwa, T. Kanda, H. Ishiguro, and N. Hagita. Footing in human-
robot conversations: how robots might shape participant roles using gaze cues.
In Proceedings of the ACM/IEEE International Conference on Human-Robot
Interaction, pp. 61–68, 2009.
69
[45] S. Nakagawa. Spontaneous speech recognition: its challenge and limit. In
Proceedings of the IEICE General Conference (Japanese Edition), Vol. 1, pp.
13–14, 2006.
[46] S. Nakamura, K. Markov, H. Nakaiwa, G. Kikui, H. Kawai, T. Jitsuhiro,
J. Zhang, H. Yamamoto, E. Sumita, and S. Yamamoto. The ATR multilingual
speech-to-speech translation system. IEEE transactions on Audio, Speech,
and Language Processing, Vol. 14, No. 2, pp. 365–376, 2006.
[47] M. Nakano, N. Iwahashi, T. Nakai, T. Sumii, X. Zuo, R. Taguchi, T. Nose,
A. Mizutani, T. Nakamura, M. Attamimi, H. Narimatsu, K. Funakoshi, and
Y. Hasegawa. Grounding new words on the physical world in multi-domain
human-robot dialogues. In Proceedings of the National Conference on Ar-
tificial Intelligence (AAAI) Fall Symposium Series: Dialog with Robots, pp.
74–79, 2010.
[48] H. Nanjo, H. Mikami, S. Kunimatsu, H. Kawano, and T. Nishiura. A funda-
mental study of novel speech interface for computer games. In Proceedings of
the 13rd IEEE International Symposium on Consumer Electronics, pp. 558–
560, 2009.
[49] K. Ohtake, T. Misu, C. Hori, H. Kashioka, and S. Nakamura. Dialogue acts
annotation for NICT Kyoto tour dialogue corpus to construct statistical di-
alogue systems. In Proceedings of the seventh International Conference on
Language Resources and Evaluation (LREC – 2010), pp. 2123–2130, 2010.
[50] C. Parada, M. Dredze, D. Filimonov, and F. Jelinek. Contextual information
improves OOV detection in speech. In Proceedings of the North American
Chapter of the Association for Computational Linguistics – Human Language
Technologies Conference (NAACL – HLT – 2010), pp. 216–224, 2010.
[51] D. Peters and P. Stubley. Dialog methods for improved alphanumeric string
70
capture. In Proceedings of the 12th International Conference on Spoken Lan-
guage Processing (Interspeech – ICLSP – 2011), pp. 1017–1020, 2011.
[52] C. S. Ramalingam, Y. Gong, L. P. Netsch, W. W. Anderson, J. J. Godfrey, and
Y. H. Kao. Speaker-dependent name dialing in a car environment with out-
of-vocabulary rejection. In Proceedings of the 24th International Conference
on Acoustics, Speech, and Signal Processing (ICASSP – 1999), pp. 165–168,
1999.
[53] A. Rastrow, A. Sethy, and B. Ramabhadran. A new method for OOV detec-
tion using hybrid word/fragment system. In Proceedings of the 34th Inter-
national Conference on Acoustics, Speech and Signal Processing (ICASSP –
2009), pp. 3953–3956, 2009.
[54] D. Roy and N. Mukherjee. Towards situated speech understanding: Visual
context priming of language models. Computer Speech and Language, pp.
227–248, 2005.
[55] H. Sakoe. Two-level DP-matching – a dynamic programming based pattern
matching algorithm for continuous speech recognition. IEEE transactions on
Acoustic, Speech, and Signal Processing, Vol. ASSP-27, No. 6, pp. 588–595,
1979.
[56] S. Seneff. Response planning and generation in the mercury flight reservation
system. Computer Speech and Language, Vol. 16, pp. 283–312, 2002.
[57] M. Shami and W. Verhelst. An evaluation of the robustness of existing super-
vised machine learning approaches to the classification of emotions in speech.
Speech Communication, Vol. 49, No. 3, pp. 201–212, 2007.
[58] F. K. Soong, W. K. Lo, and S. Nakamura. Generalized word posterior proba-
bility (GWPP) for measure reliability of recognized words. In Proceedings of
the Special Workshop in Maui (SWIM – 2004), 2004.
71
[59] H. Sun, G. L. Zhang, F. Zheng, and M. X. Xu. Using word confidence mea-
sure for OOV words detection in a spontaneous spoken dialog system. In
Proceedings of the eighth European Conference on Speech Communication and
Technology (Eurospeech – 2003), pp. 2713–2716, 2003.
[60] T. Svendsen, F. K. Soong, and H. Purnhagen. Optimizing baseforms for
HMM-based speech recognition. In Proceedings of the second European Con-
ference on Speech Communication and Technology (Eurospeech – 1995), pp.
783–787, 1995.
[61] W. Swartout, D. Traum, R. Artstein, D. Noren, P. Debevec, K. Bronnenkant,
J. Williams, A. Leuski, S. Narayanan, D. Piepol, C. Lane, J. Morie, P. Ag-
garwal, M. Liewer, J. Y. Chiang, J. Gerten, S. Chu, and K. White. Virtual
museum guides demonstration. In Proceedings of the third IEEE Workshop
on Spoken Language Technology (SLT – 2010), pp. 163–164, 2010.
[62] T. Takiguchi, A. Sako, T. Yamagata, and Y. Ariki. System request utter-
ance detection based on acoustic and linguistic features. Speech Recognition,
Technologies and Applications, pp. 539–550, 2008.
[63] K. Tokuda, T. Kobayashi, and S. Imai. Speech parameter generation from
HMM using dynamic features. In Proceedings of the 20th International Con-
ference on Acoustics, Speech, and Signal Processing (ICASSP – 1995), pp.
660–663, 1995.
[64] K. P. Truong and D. A. V. Leeuwen. Automatic discrimination between laugh-
ter and speech. Speech Communication, Vol. 49, No. 2, pp. 144–158, 2007.
[65] R. H. Umbach, P. Beyerlein, and E. Thelen. Automatic transcription of un-
known words in a speech recognition system. In Proceedings of the 20th In-
ternational Conference on Acoustics, Speech and Signal Processing (ICASSP
– 1995), pp. 840–843, 1995.
72
[66] A. Waibel, H. Soltau, T. Schultz, T. Schaaf, and F. Metze. Verbmobil: Foun-
dations of Speech-to-Speech Translation, chapter Multilingual Speech Recog-
nition, pp. 33–45. Springer, 2000.
[67] H. Wakaki, H. Fujii, M. Suzuki, M. Fukui, and K. Sumita. Abbreviation gen-
eration for Japanese multi-word expressions. In Proceedings of the Workshop
on Multiword Expressions: Identification, Interpretation, Disambiguation and
Applications, pp. 63–70, 2009.
[68] F. Wessel, R. Schluter, K. Macherey, and H. Ney. Confidence measures for
large vocabulary continuous speech recognition. IEEE transactions on Speech
and Audio Processing, Vol. 9, No. 3, pp. 288–298, 2001.
[69] K. Wittenburg, T. Lanning, D. Schwenke, H. Shubin, and A. Vetro. The
prospects for unrestricted speech input for TV content search. In Proceedings
of working conference on Advanced Visual Interfaces, pp. 352–359, 2006.
[70] M. Worsley and M. Johnston. Multimodal interactive spaces: MagicTV and
magicMAP. In Proceedings of the third IEEE Workshop on Spoken Language
Technology (SLT – 2010), pp. 161–162, 2010.
[71] H. Wu and H. F. Wang. Revisiting pivot language approach for machine
translation. In Proceedings of Joint Conference of the 47th Annual Meeting
of the Association for Computational Linguistics and the 4th International
Joint Conference on Natural Language Processing of the Asian Federation of
Natural Language Processing (ACL – IJCNLP – 2009), pp. 783–787, 2009.
[72] J. X. Wu and V. Gupta. Application of simultaneous decoding algorithms
to automatic transcription of known and unknown words. In Proceedings of
the 24th International Conference on Acoustics, Speech, and Signal Processing
(ICASSP – 1999), Vol. 2, pp. 589–592, 1999.
73
[73] X. Yan, L. He, P. Ding, R. Zhao, and J. Hao. Multi-accented mandarin
database construction and benchmark evaluations. In Proceedings of the 5th
International Symposium on Chinese Spoken Language Processing (ISCSLP –
2006), Vol. 2, pp. 715–723, 2006.
[74] A. Yates, O. Etzioni, and D. Weld. A reliable natural language interface to
household appliances. In Proceedings of the eighth International Conference
on Intelligent User Interfaces, pp. 189–196, 2003.
[75] A. Yazgan and M. Saraclar. Hybrid language models for out of vocabulary
word detection in large vocabulary conversational speech recognition. In Pro-
ceedings of the 29th International Conference on Acoustics, Speech, and Signal
Processing (ICASSP – 2004), pp. 745–748, 2004.
[76] T. Yonezawa, H. Yamazoe, A. Utsumi, and S. Abe. Evaluating crossmodal
awareness of daily-partner robot to user’s behaviors with gaze and utterance
detection. In Proceedings of the ACM International Workshop on Context-
Awareness for Self-Managing Systems, pp. 1–8, 2009.
[77] S.J. Young, M.G. Brown, J.T. Foote, G.J.F. Jones, and K. S. Jones. Acous-
tic indexing for multimedia retrieval and browsing. In Proceedings of the
22nd International Conference on Acoustics, Speech, and Signal Processing
(ICASSP – 1997), pp. 199–202, 1997.
[78] J. Zhang, J. Zhao, S. Bai, and Z. Huang. Applying speech interface to Mahjong
game. In Proceedings of the 10th International Conference on Multimedia
Modelling, pp. 86–92, 2004.
[79] R. Q. Zhang, H. Yamamoto, M. Paul, H. Okuma, K. Yasuda, Y. Lepage, E. De-
noual, D. Mochihashi, A. Finch, and E. Sumita. The NICT-ATR statistical
machine translation system for the IWSLT 2006 evaluation. In Proceedings
of the International Workshop on Spoken Language Translation, pp. 83–90,
2006.
74
Appendix A Word forms in Japanese
Table A.1: Examples of Japanese word forms.
Kanji Kana Phoneme IPA
sequence sequence sequence
仕事 (work) しごと sh i g o t o /Sigoto/
鮪 (tuna) まぐろ m a g u r o /magWro/
Japanese words can be written as both kana sequence and kanji sequence.
Kana is a kind of Japanese phonogram, and it can be uniquely converted from
the phoneme sequences based on a mapping table. Kanji is a kind of Japanese
logograph. It is not easy to convert phoneme sequences to kanji sequences due to
the homophones. Some examples of Japanese word forms are shown in Table A.1.
The character name of each Japanese phonogram (except for the prolonged
sound, the double stop and the nasal) is almost the same as the pronunciation of
the phonogram. This is different from languages using the Latin alphabet such as
English. For example, the English phonogram ‘T,’ whose name is /’ti:/, can be pro-
nounced by other pronunciations such as /t/. In contrast, the Japanese phonogram
(kana) ‘ま,’ whose name is /ma/, cannot be pronounced by other pronunciations
than its name. Thus the pronunciation of a word and that of its spelled-out form
are almost the same as each other.
75
Appendix B The international
phoneme alphabets (IPA) of Japanese
syllabary
Table B.1: The international phoneme alphabets (IPA) of Japanese syllabary. ‘ng’ and
‘q’ respectively represent the nasal and the double stop (short pause) in Japanese.
- p b d z zh g w r j ma /a/ /pa/ /ba/ /da/ /dza/ /ga/ /îa/ /ra/ /ja/ /ma/
i /i/ /pj i/ /bj i/ /Ãi/ /gj i/ /rj i/ /mj i/u /W/ /pW/ /bW/ /dzW/ /gW/ /rW/ /jW/ /mW/e /e/ /pe/ /be/ /de/ /dze/ /ge/ /re/ /me/o /o/ /po/ /bo/ /do/ /dzo/ /go/ /ro/ /jo/ /mo/
h n t ch ts s sh k py by zya /ha/ /na/ /ta/ /sa/ /ka/ /pja/ /bja/ /Ãa/
i /çi/ /ñi/ /tSi/ /Si/ /kj i/
u /FW/ /ñW/ /tsW/ /sW/ /kW/ /pjW/ /bjW/ /ÃW/e /he/ /ne/ /te/ /se/ /ke/
o /ho/ /no/ /to/ /so/ /ko/ /pjo/ /pjo/ /Ão/
gy ry my hy ny ty sy ky ng qa /gja/ /rja/ /mja/ /ça/ /ña/ /tSa/ /Sa/ /kja/iu /gjW/ /rjW/ /mjW/ /çW/ /ñW/ /tSW/ /SW/ /kjW/eo /gjo/ /rjo/ /mjo/ /ço/ /ño/ /tSo/ /So/ /kjo/- /ð/ /y/
76
Appendix C Recursive equation of
open-begin-end dynamic
programming matching
OBE-DPM is different from ordinary dynamic programming matching in that
the start-point and end-point in both sequences are unconstrained, and thus it
enables partial alignments between a whole word and a word segment. Assume
we have two phoneme sequences x = (p1x, p2x, . . . , p
Ix) and y = (p1y, p
2y, . . . , p
Jy ),
where pix and pjy respectively denote the i-th and j-th phonemes in x and y, I and
J respectively denote the length of each sequence. In OBE-DPM, a start point
(p0x, p0y) and an end point (pI+1
x , pJ+1y ) are added to x and y, then a trellis with
(I +2) column and (J +2) row (an example of a trellis is shown in Fig. 2.4 (a)) is
built according to the recursive equation, which is written as
Di,j =
0 (i = 0, j = 0)
Di−1,j + λ (1 ≤ i ≤ I + 1, j = 0)
Di,j−1 + λ (i = 0, 1 ≤ j ≤ J + 1)
min
Di−2,j−1 + s(pix, pjy) + s(pi−1
x , ϕ)
Di−1,j−1 + s(pix, pjy)
Di−1,j−2 + s(pix, pjy) + s(ϕ, pj−1
y )
(1 ≤ i ≤ I + 1, 1 ≤ j ≤ J + 1)
, (C.1)
where Di,j denotes the matching score, s(pix, pjy) denotes the phoneme distance
measure between pix and pjy, s(pi−1x , ϕ) and s(ϕ, pj−1
y ) respectively denote the inser-
tion and deletion penalties. s(pix, pjy), s(pi−1
x , ϕ) and s(ϕ, pj−1y ) are calculated based
77
on the confusion matrix. λ is a constant, which was set to 1.5 in the experiments.