from the perspective of human–robot interactionchapter 4 conclusion 60 acknowledgment 62...

題目

From

thePersp

ectiveofHuman–R

obot

Interaction

TwoKey

Tech

nolog

iesfor

aFlex

ible

Speech

Interface:

氏名左祥

平成23年度修了

博士論文

Two Key Technologies for a Flexible Speech

Interface:

From the Perspective of Human–Robot Interaction

主任指導教員　岡夏樹教授

京都工芸繊維大学大学院工芸科学研究科

設計工学専攻

学生番号 08821007

　氏　　名　左祥

平成 24 年 2 月 10 日提出

3

博士論文

Two Key Technologies for a Flexible Speech

Interface:


主任指導教官　岡夏樹教授

京都工芸繊維大学大学院工芸科学研究科

設計工学専攻

学生番号 08821007

　氏　　名　左祥

平成 24 年 2 月 10 日提出

柔軟な音声インターフェースを実現するための 2つの基盤技術：

ヒューマン‐ロボット・インタラクションの観点から

平成 23年度 08821007 左祥

概要

音声は人の一番自然なコミュニケーション手段であり、人と機械の間のインター

フェースとして活用することが望まれている．しかし現状では、音声認識の性能が

不十分であり、また予め登録したコマンドしか認識できないなどの問題があるため、

柔軟性のある音声インターフェースの実現が難しい．そこで私は、「未知語の音韻列

の学習」と「発話対象の検出」の二つの課題に注目し、柔軟性のある音声インター

フェースを実現するための要素技術の開発を行った．

まず、一つ目の課題として、未知語の音韻列の学習技術を開発した．実環境にお

ける音声インターフェースにおいて未知語の音韻列の学習は大変重要である．シス

テムは予め登録したコマンドに応じるだけでなく、オンラインで未知語の音韻列（発

音）を学習できることが望ましい．例えば、システムが未知の人や未知の物体に出

会ったときに、その名前の音韻列を人の発話から学習できれば、システムは自分の

語彙をオンラインで拡張することができ、その後の会話の中で学習した名前を使っ

て人とコミュニケーションすることができるようになる．

未知語の音韻列を学習するために、未知語に対して音韻認識を行えばよいが、現

在の音声認識の性能は十分ではないため、音韻認識の誤りが生じる可能性が高い．

従って、正確な音韻列を学習するために、音韻認識の誤りを訂正する必要がある．そ

こで本研究では、ユーザが未知語を繰り返すことにより、認識誤りを訂正する方法

を提案した．また、訂正する際、ユーザは訂正した音韻列を確認しながら、インタ

ラクティブに訂正することができる．提案法の特徴は次の二つである．１）ユーザ

は未知語をそのまま繰り返すだけではなく、未知語の音韻列の中の間違った部分だ

けを繰り返すこともできる．そのため、システムは認識誤りをより効率的に特定す

ることができる．２）システムは、訂正する過程の履歴情報を用い、訂正の効率を

向上させる．例えば、もし訂正後の音韻列が悪くなった場合、ユーザはその音韻列

を訂正前のバージョンに戻すことができる．また、各回の訂正は、必ず違う音韻列

1

を生成する．訂正する際、提案法ではまず誤認識が含まれる音韻列と訂正発話の音

韻列の間で DPマッチングを使ってアラインメントを行う．そして各音韻のペアか

らより信頼度の高い音韻を選択する．こうしてより信頼度の高い音韻列を生成する

ことができる．なお、音韻の信頼度としては一般化事後確率を用いる．評価実験の

結果、提案法は平均 3発話という非常に高い効率で未知語の正確な音韻列を学習で

きることが分かった．

次に、二つ目の課題として、発話対象の検出技術を開発した．システムとの円滑

な会話を実現するためには、人の発話がシステムに向けられたものであるか否かを

判断する必要がある．システムは自分に向けられた発話に対してだけ反応し、それ

以外の発話に対しては反応してはならない．もしこの機能がなければ、システムは

自分に向けられていない発話（例えばテレビの音や人間同士の雑談など）にも反応

し、人とのコミュニケーションがうまくとれなくなり、危険な行為を起こしてしまう

可能性もある．このような問題を解決するために、私は発話対象を検出する手法を

提案した．提案法は、ロボットが人の発話に従って物体を操作するタスクにおいて

有効である．このタスクでは、人がロボットに対して、その時の物理環境における

ロボットが実行可能な物体操作行為を命令すると仮定する．この仮定の下では、対

システム（ロボット）発話の内容と現在の物理環境とのマッチング度合いは高くな

る．従って、提案法では、まず発話を現在の物理環境における実行可能な物体操作

行為として解釈し、そしてその行為と物理環境とのマッチング度合いを評価するこ

とによって発話対象を検出する．

マッチング度合いの評価基準として、本研究で独自に提案したマルチモーダル・

セマンティック・コンフィデンス（MSC）を使用する．MSCでは、音声認識、物体

認識とロボットの操作動作の生成から得られた確信度をロジスティックモデルで統合

して計算する．音声確信度の計算は従来法に従うが、物体と動作の確信度の計算は

本研究で新たに提案したものである．また、ロジスティックモデルのパラメータは、

尤度最大化基準を用いて学習する．実験では、実機ロボットを用いて提案法を評価

した．提案法では、95%以上の非常に高い精度で発話対象を検出することができた．

Two Key Technologies for a Flexible Speech Interface:


2011 d8821502 Xiang Zuo

Abstract

This thesis addresses two crucial problems in building a flexible speech interface

between humans and machines: (1) learning the phoneme sequences of out-of-

vocabulary (OOV) words, and (2) detecting the target of utterances. I propose the

following two methods to solve these problems.

First, I propose a method for learning the phoneme sequences of OOV words,

which is crucial for speech interfaces because developers cannot prepare all the

words beforehand for practical use in a system’s vocabulary. When the system en-

counters OOV words, it must learn their phoneme sequence to build lexical entries

for them. I propose a method called Interactive Phoneme Update (IPU) for this

purpose. Using this method, users can correct misrecognized phoneme sequences

by repeatedly making correction utterances based on the system responses. The fol-

lowing are the originalities of the method: (1) word-segment-based correction that

allows users to use word segments for locating misrecognized phonemes and (2)

history-based correction that utilizes the information of the phoneme sequences

that were recognized and corrected previously during the interactive learning of

each word. The experimental results show that IPU drastically outperformed a

previously proposed maximum-likelihood-based method for learning the phoneme

sequences of OOV words.

Second, I proposed a method for detecting the target of utterances to distinguish

speech that users say to a machine from speech that users say to other people or

themselves. Such a functional capability is crucial for speech-based human-machine

interfaces. If the machine lacks this capability, then even utterances that are not

directed to it will be recognized as commands for it. Thus the machine will generate

an erroneous response.

1

The proposed method, which is used in an object manipulation task performed

by a robot, enables it to detect robot-directed speech. The originality of the method

is the introduction of a multimodal semantic confidence (MSC) measure for the do-

main classification of input speech based on whether the speech can be interpreted

as a feasible action under the current physical situation in the object manipulation

task. This measure is calculated by integrating speech, object, and motion confi-

dences with weightings that are optimized by logistic regression. The experimental

results show that my proposed method achieves a high performance of 95% in the

average precision rates for detecting the utterance targets.

(i)

Contents

Chapter 1 Introduction 1

Chapter 2 learning the phoneme sequences of OOV words 6

2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Interactive Phoneme Update (IPU) . . . . . . . . . . . . . . . . . . . 9

2.2.1 Locating and correcting phoneme errors in IPU . . . . . . . . 11

2.2.2 History-based correction . . . . . . . . . . . . . . . . . . . . . 15

2.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3.1 Experiment 1: Evaluation of the performance of IPU . . . . . 17

2.3.2 Experiment 2: Investigation of the factors of performance

results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.4.1 Improvement of learning performance . . . . . . . . . . . . . 31

2.4.2 Influence caused by visual feedback . . . . . . . . . . . . . . . 31

2.4.3 Integration with OOV word detection . . . . . . . . . . . . . 32

2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

Chapter 3 Detecting utterance targets 34

3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.2 Object Manipulation Task . . . . . . . . . . . . . . . . . . . . . . . . 37

3.3 Proposed RD Speech Detection Method . . . . . . . . . . . . . . . . 39

3.3.1 Speech Understanding . . . . . . . . . . . . . . . . . . . . . . 40

3.3.2 MSC Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.4.1 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . 47

(ii)

3.4.2 Off-line Experiment by Simulation . . . . . . . . . . . . . . . 49

3.4.3 On-line Experiment Using the Robot . . . . . . . . . . . . . . 55

3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.5.1 Using in a Real World Environment . . . . . . . . . . . . . . 58

3.5.2 Extended Applications . . . . . . . . . . . . . . . . . . . . . . 58

3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

Chapter 4 Conclusion 60

Acknowledgment 62

References 63

Appendix A　Word forms in Japanese 74

Appendix B　The international phoneme alphabets (IPA) of Japanese

syllabary 75

Appendix C　 Recursive equation of open-begin-end dynamic pro-

gramming matching 76

1

Chapter 1 Introduction

Speech, which is one of our most effective daily communication tools, is ex-

pected to eventually be used as a user-friendly interface between humans and

machines. In recent years, many studies have developed speech-based human-

machine interfaces. For example, speech interfaces provide such services on tele-

phones and mobile phones as automated phone calling systems [15, 16], flight and

hotel reservations [56, 13], alphanumeric string inputting [51], name dialing [52],

voice searches [8], multimedia information retrieval and browsing [77, 70], spoken

language translations [79, 71, 11], and voice assistance [1].

Besides telephone-based services, speech interfaces have also been used for other

tasks, including tourist [49] and museum guide tasks [61]. Furthermore, because of

their convenience, speech interfaces are suitable for hands-busy and eyes-busy tasks,

such as car navigation tasks. Speech provides safe interactions between drivers

and car navigation devices, where speech interfaces have nearly become standard

equipment; some manufacturers have even devoted research teams to them. For

example, the speech interface in the car navigation devices developed by Toshiba

is designed to adapt to different driving conditions [10]. Its speech recognition and

speech synthesis modules were optimized for in-car environments [26].

Speech interfaces have also been used for household appliances [74]. Compared

with other interfaces, such as remote controls, speech interfaces provide completely

new user experiences. For example, they were used for TV content search [69,

67]. Speech interfaces have also been used for entertainment, such as computer

games [48, 23, 78].

Speech interfaces have also been used for robots. In the last few years, robots

are being designed as part of the everyday lives of ordinary people in social and

home environments. Many robotic systems have been implemented with speech

2

interfaces, including [2, 19]. Unlike such platforms as mobile phones or car naviga-

tion devices, robots are usually equipped with many sensors such as microphones,

cameras, and touch sensors. Thus they can communicate with users through mul-

timodal information. For example, a robot can find the corresponding object by

camera and manipulate it by hand when its name is indicated by users [20, 21].

A robot can learn an object’s visual features and simultaneously learn its name

through interactions with humans [22, 3]. Visual information can be used to help

a robot understand the meaning of utterances [54]. Gaze and hand gestures can

be used to help a robot understand user intentions [30].

Although many speech interfaces have already been developed, they still lack

flexibility in practical use. A number of factors complicate their use in real scenar-

ios. I consider that the following factors are important for speech interfaces:

• Speech recognition accuracy

The interface flexibility is seriously affected by the recognition accuracy of

the automatic speech recognition (ASR) module. Although many studies

have addressed recognition accuracy [41, 25, 34, 40], it remains inadequate

in real scenarios.

Among the various causes that reduce recognition accuracy, an important

one is background noise, which always exists in real scenarios. Methods such

as noise suppression [14] and blind source separation [29] can be used to deal

with background noise. Another important cause that reduces recognition ac-

curacy is such pronunciation variations as dialects and emotion in speech. To

deal with pronunciation variations, such methods as pronunciation variation

modeling have been proposed [73, 39, 64, 57].

• Dialogue strategy

A dialogue strategy specifies which action the system will take depending

on the current dialogue context. Designing the dialogue strategy by hand

involves anticipating how users will interact with the system, repeated testing,

3

and refining, so the task can be difficult. Therefore the system is required to

manage the dialogue strategies by itself. Many studies have dealt with this

problem, such as [38, 42].

• Out-of-vocabulary words

Dealing with out-of-vocabulary (OOV) words is another serious problem for

speech interfaces because developers cannot prepare all the words beforehand

that might be used by individual users in the system’s vocabulary. If user

utterances include such OOV words, then the system will recognize them as

words within the vocabulary, and thus it will generate erroneous responses,

and the user will not know what the problem is.

The OOV problem consists of two sub-problems. When the system encoun-

ters OOV words, it first needs to detect them in the user utterances (OOV

word detection), and then it needs to learn their phoneme sequences to build

lexical entries for them (OOV word learning). Recently many studies have

focused on OOV word detection [6, 75, 50]. However, studies focusing on

OOV word learning are limited.

• Utterance targets

For speech interfaces, the functional capability to detect the target of ut-

terances is crucial. For example, a user’s speech directed to another human

listener should not be recognized as commands directed to a system. The sys-

tem must reject the utterances that are not directed to it. Studies focusing

on this problem are limited.

The goal of this thesis is to improve the flexibility of speech interfaces. Among

the above described problems, I focus on two: (1) learning the phoneme sequences

of OOV words, and (2) detecting the targets of utterances. Even though they are

fundamental problems of speech interfaces, and are crucial especially for robotic

speech interfaces, they have rarely been covered in previous studies.

4

First, learning the phoneme sequences of OOV words is a serious problem of

robotic speech interfaces because it is impossible to provide all the names of things

and persons beforehand which a robot in home use may encounter. The robot

needs to learn the phoneme sequences of the new words from utterances. In recent

years, learning the phoneme sequences of OOV words has become a basic task for

robots [3]. Next, detecting the targets of the utterances is also essential for robots

because it is quite dangerous for a robot to respond to the utterances that are

not directed to it. For example, unexpected motion of a robot may hurt someone

nearby. I propose two methods in this thesis to solve these problems. They are

described below.

1. Learning the phoneme sequences of OOV words

To solve the OOV word learning problem, I propose a novel method called In-

teractive Phoneme Update (IPU), which enables systems to learn the phoneme

sequences of OOV words through interactions with users. During interaction,

users can correct the phoneme recognition errors by repeatedly making cor-

rect utterances. The method enables the system to automatically extend its

vocabulary in an on-line manner so that the system can adapt to individual

users and the environment.

2. Detecting utterance targets

I propose a novel method for detecting the target of utterances. The proposed

method is used for a robotic dialogue system that enables a robot to detect

robot-directed speech in an object manipulation task. The method is based

on a multimodal semantic confidence (MSC) measure, which is used for the

domain classification of input speech based on whether the speech can be

interpreted as a feasible action under the current physical situation. Using the

method, the robot can detect robot-directed speech with very high accuracy,

even under noisy conditions.

5

The remainder of this thesis is organized as follows. First, the details of the

proposed OOV word learning method are given in Chapter 2. The details of the

proposed robot-directed speech detection method are given in Chapter 3. Finally,

Chapter 4 concludes the thesis.

6

Chapter 2 learning the phoneme

sequences of OOV words

2.1 Background

This chapter describes my proposed method for learning the phoneme sequences

of OOV words1 . Learning OOV words is a difficult task since every phonemes of

the words should be correctly learned in order for a system to precisely recognize

and synthesis the words in the subsequent communications. One kind of method

is to ask the user to spell out the new words [9, 17]. In this method, a graph

of possible phoneme sequences is first estimated by the given spelling, and then

the speech sample of the word is used to search this graph to determine the best

phoneme sequence. For instance the English word “teacher,” whose pronunciation

is /’ti:tS@/, is spelled as “T E A C H E R /ti: i: ei si: eitS i: a:/”. The pronunciation

of an English word and that of its spelled-out form are different. That is, spelling

out gives richer information than repeating the word, enabling better estimation

of its phoneme sequence when a grapheme-phoneme correspondence model is used.

However, spelling out is not effective in some languages such as Japanese and

Chinese, since the pronunciation of a word and that of its spelled-out form are

almost the same as each other in these languages. Therefore spelling out cannot

give richer information than repeating the word.

Another kind of method is to run a speech recognition system in a phoneme

recognition mode. However, this method is unreliable due to the high phoneme

1 OOV word detection is not discussed in this thesis. The position of OOV words is

given by template utterances pre-defined in the system. I assume that OOV words are

properly detected and segmented before the proposed method is applied.

7

recognition error rates. Although many studies have been done to improve phoneme

recognition accuracy [7, 4, 37], even state-of-the-art speech recognition systems only

achieve about 80% in phoneme recognition accuracy [45, 43]. For such a speech

recognition system, if each phoneme error occurs independently, the probability for

obtaining a correct phoneme sequence of a word with ten phonemes from a single

utterance is less than 11% (0.810 ≈ 0.11).

Since learning OOV words by phoneme recognition from a single utterance is

unreliable, some methods learned OOV words from multiple utterances [65, 60,

72, 5]. The maximum-likelihood (ML) based phoneme correction [65] is a widely

used method for this purpose. In this method, the phoneme sequence of a word is

obtained by searching a phoneme sequence that jointly maximizes the likelihood

of all of the input utterances of the word from their N -best phoneme recognition

lists.

This study deals with a word learning task in which users teach the system the

phoneme sequence (pronunciation) of OOV words by repeatedly making utterances

through speech interactions with the system. The target language of this study is

Japanese2 . Converting phoneme sequences to graphemic word forms is not dealt

with in this study since out target is speech interaction3 . Rather than improving

the phoneme recognition accuracy in a batch way, I developed Interactive Phoneme

Update (IPU) that learns the phoneme sequences of OOV words in the course

of speech interaction. Using the method, users can correct the mis-recognized

phoneme sequences by repeatedly making correction utterances according to the

system responses. Consider the following dialogue scenario between two persons

(A and B).

2 However, the proposed method can be easily extended for other languages such as

English.3 Japanese words can be written as both kana sequences and kanji sequences, and kana

sequences can be uniquely converted from phoneme sequences based on a mapping ta-

ble. Converting phoneme sequences to kanji sequences is not dealt with in this study. A

description of Japanese word forms is given in Appendix A.

8

A0: “My name is Taisuke Sumii.”

B0: “Taisuke Sumie?”

A1: “No, Taisuke Sumii.”

B1: “Taisuke Zumie?”

A2: “That’s worse. Listen, Sumii.”

B2: “Taisuke Sumii?”

A3: “That’s right.”

In this dialogue, person A tries to teach his name to person B by an utterance

(A0), and person B makes a certain mistake. Then person A corrects the errors

by repeating the name (A1 and A2). Such a dialogue is quite common in commu-

nication between humans. IPU aims at realization of such a dialogue for learning

OOV words. The originalities of IPU are summarized as follows.

1. Word-segment-based correction: Apart from the whole word, IPU en-

ables the user to make a correction with just a segment of a word, according

to the phoneme errors. The advantage of word-segment-based correction is

that locating erroneous phonemes in a phoneme sequence becomes easier,

and the mis-correction of the correct part of the phoneme sequence can be

prevented.

2. History-based correction: IPU uses the historical information of phoneme

sequences that were recognized and corrected previously in the course of

interactive learning of each word to make learning efficient.

IPU can be used as a word pronunciation learning module for a variety of spoken

dialogue systems. For example, a robotic dialogue system in a home environment

probably encounters novel objects whose names do not exist in its vocabulary [17].

For another example, a telephone-based name dialing system needs to add novel

names to its vocabulary [12]. IPU enables all such spoken dialogue systems to

learn the phoneme sequence of OOV words through speech interactions. Once the

words are successfully learned, the system can recognize and synthesize the words

9

Dialogue Recognized phoneme sequence System phoneme sequence

U0: “It is misesukumiko.” x0: m i sh e s u k u ϕ i g o y0: m i sh e s u k u ϕ i g o

S0: “Is it mishesukuigo?”

U1: “No, it is misesukumiko.” x1:m i s e z u k u m i k o ng y1: m i sh e s u k u m i k o

S1: “Is it mishesukumiko?”

U2: “No, it is misesu.” x2: m i sh e z u y2: m i sh e z u k u m i k o

S2: “Is it mishezukumiko?”

U3: “No, that’s worse.”

U4: “It is misesu.” x3: m i s e s u y3: m i s e s u k u m i k o

S3: “Is it misesukumiko?”

U5: “That’s right.”

Figure 2.1: An example for learning OOV words by IPU. The left column shows the

dialogue between a user (U) and a system (S), the middle column shows the recognized

phoneme sequences of the OOV word in the user’s utterances, and the right column shows

the phoneme sequences in the system internal state. The phoneme errors are indicated by

underlines, and “ϕ” denotes a deletion error.

in the subsequent interactions. An example application of IPU has been already

implemented by [47].

The remainder of this chapter is organized as follows. The detail of the proposed

method is given in Section 2.2. The experimental settings and results are presented

in Section 2.3. Section 2.4 gives a discussion. Finally, Section 2.5 concludes the

chapter.

2.2 Interactive Phoneme Update (IPU)

This section presents the details of IPU. An example of the process of learning

OOV words by IPU is shown in Fig. 2.1. The user first tries to teach the system

a new word “misesukumiko,” whose phoneme sequence is [m i s e s u k u m i k

o]4 , by an initial utterance (U0). The system gets a recognized phoneme sequence

x0 of the OOV word by a pre-defined grammar including a phoneme network

4 The international phoneme alphabets (IPA) of Japanese syllabary are shown in Ap-

pendix B.

10

SilB It is

ch

a

ng

SilE

Figure 2.2: The grammar used for OOV word extraction. “SilB” and “SilE” denote the

silences in the beginning and end of the utterance.

like the one shown in Fig. 2.2. Using such a grammar, the phoneme sequence

of a OOV word can be extracted from an utterance “It is [oov]5 ,” where [oov]

represents the OOV word. The system then sets x0 to a system phoneme sequence

y0, and requests the user to confirm y0 by an utterance (S0). According to the

system response, the user makes a correction utterance (U1). The system gets a

recognized phoneme sequence x1 of the OOV word from U1 by the same way of

x0, then uses x1 to correct the phoneme errors in y0, and this results in a new

system phoneme sequence y1. The system then requests the user to confirm y1.

The user then continues to make corrections until the word is correctly learned.

In this thesis, the ith recognized phoneme sequence and the ith system phoneme

sequence are respectively denoted by xi and yi. In the example, the user makes

correction utterances not only using the whole word, but also using word segments

(U2 and U4). To perform a correction between a recognized phoneme sequence

and a system phoneme sequence, the system should first locate the phoneme errors

in the system phoneme sequences, then correct the errors.

During the interaction, users follow a pre-defined grammar to help the system

understand the users’ utterances. I assume that the users behave in such a way

according to instructions. I know that this restriction does not hold in natural

5 In this thesis, utterances made in Japanese have been translated into English.

11

Inputphon

emeβ

Recognized phoneme α

Figure 2.3: A part of the phoneme confusion matrix. The element c(α, β) of the confusion

matrix represents the number of a phoneme β recognized as another phoneme α.

spoken dialogues between humans and systems. I think, however, it is valuable to

conduct research under this restriction because if efficient OOV word learning is

not possible with this restriction, it will never be possible to learn OOV words in

realistic human-machine interactions. In our plan, I will first show that it is possible

to learn OOV words with this restriction and I will then explore a way to effectively

and naturally instruct users to behave under this restriction or to improve speech

understanding to handle more natural utterances without this restriction.

2.2.1 Locating and correcting phoneme errors in IPU

Here I give details about how to locate and correct the phoneme errors in IPU.

First, to locate the phoneme errors in the (i − 1)th system phoneme sequence

yi−1, the ith recognized phoneme sequence xi is aligned to yi−1 and the conflicting

phoneme pairs between them are found. The alignment is performed by open-

begin-end dynamic programming matching (OBE-DPM) [55] in order to deal with

a recognized phoneme sequence obtained from a word segment. The phonemes

in the conflicting phoneme pairs are treated as phoneme errors which need to be

corrected. Then generalized posterior probability (GPP) [58] is used as a confidence

measure to measure the reliability of the phonemes in the conflicting phoneme pairs.

The phoneme with a lower GPP value is replaced by the phoneme with a higher

GPP value, thus results in a new system phoneme sequence yi, which is more

12

xi

yi−1

(a) Alignment Matrix

xi

yi−1

(b) Alignment result

Figure 2.4: Alignment Matrix and alignment result for the word “gashirakomori” [g a sh

i r a k o m o r i]. S and E respectively denote the start and end point. By the OBE-DPM,

conflicting phoneme pairs (‘r’, ‘b’), (‘ϕ’, ‘k’), ( ‘ϕ’, ‘ng’) and (‘n’, ‘m’) are found.

reliable than yi−1.

Locating phoneme errors using OBE-DPM

OBE-DPM is an extended version of dynamic programming matching. It has

been widely used for subsequence matching. In this study, I use an OBE-DPM

with phoneme distance measures calculated from a phoneme confusion matrix.

The recursive equation of the OBE-DPM is shown in Appendix C. The phoneme

confusion matrix was built based on the ATR Japanese speech database C-set

(a database consisting of 142, 480 speech samples of 274 speakers (137 males and

137 females), with a total of 834, 521 phonemes) [31]. ATRASR [46], which was

developed by Advanced Telecommunication Research Labs, was used as a phoneme

recognizer to build the confusion matrix. 26 Japanese phonemes were included in

ATRASR. A part of the phoneme confusion matrix is shown in Fig. 2.3. The

13

GPP value

Recognitionaccuracy

(%)

Figure 2.5: The relationship between GPP values and recognition accuracy for the

phonemes in the speech recognizer.

element c(α, β) of the confusion matrix represents the number of a phoneme β

recognized as a phoneme α. The phoneme distance measure s(α, β) is calculated

by

s(α, β) = − logc(α, β)∑α c(α, β)

. (2.1)

The alignment matrix and the alignment result for the word “gashirakomori”

[g a sh i r a k o m o r i], are shown in Fig. 2.4. The system phoneme sequence yi−1

is [g a sh i r a ϕ o n o r i], and the correction phoneme sequence xi is [sh i b a k

o ng m o], each of which includes certain errors. In this example, a sub-sequence

[sh i r a o n o] in yi−1 is obtained as a sub-sequence which corresponds to xi. The

conflicting phoneme pairs of the sub-sequence and xi are (‘r’, ‘b’), (‘ϕ’, ‘k’), ( ‘ϕ’,

‘ng’) and (‘n’, ‘m’).

Correcting phoneme errors using GPP

Generalized posterior probability (GPP) has been used as a confidence mea-

sure to verify recognized entities at different levels, e.g., sub-word, word, and sen-

tence [58]. It is computed by generalizing the likelihoods of the sub-words with

overlapped time registrations in the word graph. In this study, I use GPP at the

14

yi−1 g a sh i r a ϕ o ϕ n o r iGPP 0.93 0.66 0.61 0.72 0.53 0.92 0.74 0.66 0.95 0.99 0.97xi - - sh i b a k o ng m o - -

GPP - - 0.75 0.83 0.11 0.76 0.55 0.92 0.26 0.92 0.43 - -yi g a sh i r a k o ng m o r i

GPP 0.93 0.66 0.61 0.72 0.53 0.92 0.55 0.74 0.26 0.92 0.95 0.99 0.97

Figure 2.6: An example of a phoneme replacement. The system phoneme sequences yi−1,

yi, and the recognized phoneme sequence xi, with the GPP values for each phoneme in

them are shown in this figure. The conflicting phonemes are indicated by squares.

phoneme level.

I investigated the relationship between GPP values and the recognition accuracy

for all phonemes in the speech recognizer using the ATR Japanese speech database

C-set. The result is shown in Fig. 2.5. I found that the phoneme recognition

accuracy improved directly with GPP values, which indicates the appropriateness

of using GPP as a confidence measure6 .

An example of a phoneme replacement is shown in Fig. 2.6. It shows the system

phoneme sequences yi−1 and yi, and the recognized phoneme sequence xi with the

GPP values for each phoneme in them. The conflicting phonemes are indicated

by squares. Among the conflicting phonemes, ‘r’ is not replaced with ‘b’, and ‘n’

is replaced by ‘m’ according to the GPP values. To deal with the insertion and

deletion errors, I give a threshold of 0.57 . In the example, yi−1 is judged to have a

deletion error ‘k’, which is corrected by the threshold. In this example, yi becomes a

correct phoneme sequence after the replacement. GPP values are only updated for

the conflicting phoneme pairs. They are not updated for the consistent phoneme

pairs8 .

6 However, I found that the recognition accuracy for phoneme ‘q,’ which represents a

double stop (short pause) in Japanese, does not vary directly with its GPP value. Therefore

the correction for recognition errors of ‘q’ is not performed in this study.

7 The threshold is decided empirically by preliminary experiments.8 GPP values for the consistent phoneme pairs can be updated by at least the following

three ways: (1) updated by system phoneme sequence yi−1, (2) updated by recognized

15

1. Set i← 0, M ← maximum number of correction utterances.

2. Extract the recognized phoneme sequence x0 of the OOV word from

an initial utterance.

3. Set system phoneme sequence y0 ← x0, and request the user to

confirm y0.

4. According to the user’s response,

go to step 12 if the user gives a stop utterance or,

go to step 5 if the user makes a correction utterance.

5. Set i← i+ 1.

6. If i > M then go to step 12, otherwise go to step 7.

7. Extract the recognized phoneme sequence xi of the OOV word

from a correction utterance.

8. Use xi to correct phoneme errors in yi−1 by OBE-DPM and

GPP.

9. If correction result = yi−1, then get another phoneme sequence

xi from the same N -best recognition list of xi, set xi ← xi,

and go to step 8, else go to step 10. (Forced-change)

10. Update yi ← correction result, and request the user to

confirm yi.

11. According to the user’s response,

go to step 12 if the user gives a stop utterance or,

go to step 5 if the user makes a correction utterance or,

go to step 5, and set yi ← yi−1 if the user gives an undo

utterance. (Undo behavior)

12. Stop the correction process, and treat yi as the learning result.

Figure 2.7: The interaction with IPU.

2.2.2 History-based correction

Next I give details about history-based correction. Historical information of sys-

tem phoneme sequences {y0, . . . , yi−1} that were obtained previously in the course

of interaction is used to help the system estimate the current system phoneme se-

phoneme sequence xi, and (3) updated by the average of yi−1 and xi. In our preliminary

experiments, I found that the performances of these three approaches were almost the

same, and approach (1) was used in IPU.

16

quence yi. History-based correction consists of undo behavior and forced-change,

each of which is described as follows:

Undo behavior: During the interaction, a correction sometimes results in a

phoneme sequence that is worse than the previous one. IPU enables the user

to undo such corrections.

Forced-change: During the interaction, a system phoneme sequence in which

the user finds errors should become different after a correction. IPU ensures

that each correction results in a system phoneme sequence yi that is different

from the previous system phoneme sequences {y0, . . . , yi−1}. If a recognized

phoneme sequence xi cannot result in a different phoneme sequence, another

recognized phoneme sequence xi, which is obtained from the same N -best

phoneme recognition list of xi, is used to perform a correction instead of xi.

Finally, the algorithm for learning OOV words by IPU is shown in Fig. 2.7. Four

types of utterances can be used in this algorithm. Initial utterance “It is [oov]” is

used to teach the system new words; correction utterance “No, it is [oov]” is used

to make corrections; undo utterance “No, that’s worse” is used to undo the current

correction; and stop utterance “That’s right” is used to stop the correction when

the words are correctly learned. M denotes the maximum number of correction

utterances for each word.

2.3 Experiments

I conducted experiments of a word learning task. I first evaluated the perfor-

mance of IPU by comparing it to a baseline method. Then I performed detailed

analysis to investigate the performance results obtained by IPU.

The experiments were abstract away from any specific types of spoken dia-

logue systems such as a robotic dialogue system or a telephone-based name dialing

system. In the experiments, users made an interaction with the system to teach

17

Table 2.1: Settings for IPU and baseline.

GPP ML Word-segment Undo Forced-change Stop

IPU√

-√ √ √ √

Baseline -√

- - -√

phoneme sequences of a word. This interaction was designed to separate it from

problems that arose in each type of spoken dialogue system, such as difficulty

of OOV word detection in a variety of user utterances and the quality of speech

synthesis.

2.3.1 Experiment 1: Evaluation of the performance of

IPU

Baseline

As a baseline comparing with IPU, I ran the maximum-likelihood (ML) based

phoneme correction [65] in an on-line manner. The baseline required users to make

multiple utterances of the word. Given a set of utterances {u0, . . . , uI} of a word,

where ui denotes the ith utterance, the phoneme sequence s of the word is obtained

by searching a phoneme sequence that jointly maximizes the likelihood of all of the

input utterances from their N -best phoneme recognition lists, such as

s = argmaxs∈{L(u0)∪,...,∪L(uI)}

I∏i=0

P (ui|s), (2.2)

where {L(u0)∪, . . . ,∪L(uI)} denotes theN -best phoneme recognition lists for {u0, . . . , uI}.

In the experiment, N was set to 50. I is dynamically given in the experiment. It

equals the number of speech samples of the word uttered by the user during the

interaction.

In the baseline, users just made repetitions of the word; word-segment-based

corrections were not possible since all speech samples must be given by the whole

word. Moreover, users only performed stop behaviors. They were not allowed to

18

Table 2.2: The word list used in the experiments. The right column shows the Japanese

phoneme sequences, and the left column shows the order for the words used in the experi-

ments.

No Phoneme sequence1 n a m i h a r i n e z u m i2 m a d a g a s u k a r u m i d o r i j a m o r i3 k u r o s u t e n a g a z a r u4 m i k u r o s u t o n i k u s u5 k i k u g a sh i r a k o m o r i6 m i s e s u k u m i k o7 k a s u m i z a k u r a8 t o k i w a m a ng s a k u9 b u t a ng sh i r o m a ts u10 k i b a n a ky a t a k u r i11 a ng d o r o m e d a s e u ng12 k a m i n o k e z a b e t a s e13 s a ng g u r e z a14 m a z e r a n i k u s u t o r i m u15 r i zh i r u k e ng t a u r u s u16 h a r a t a k a sh i17 g o ng s u ng z a ng18 n o g u ch i h i d e j o19 j o s a n o a k i k o20 b a o z u ng21 a zh i s a i22 t a n u k i23 j o sh i o24 k a r u p i s u25 a k u e r i a s u

perform undo behaviors, and forced-changes were not used by the system. In other

words, the baseline did not have history-based corrections.

The settings for IPU and baseline are summarized in Table 2.1. “GPP” and

“ML” respectively represents the GPP-based phoneme correction used in IPU and

the ML-based phoneme correction used in the baseline. “Word-segment,” “Undo,”

“Forced-change” and “Stop” respectively represents word-segment-based correc-

tion, undo behavior, forced-change and stop behavior. “√” represents such a factor

was used.

19

Setting

I prepared a word list including 25 Japanese words. The word list includes

names of animals, plants, spheres, and persons from Wikipedia9 . The total num-

ber of phonemes was 305, and each word included 12.2 phonemes on average.

Table 2.2 shows the Japanese phoneme sequence for the words that were used in

the experiments.

ATRASR, which was used to build the confusion matrix, was used as a speech

recognizer in the experiments. Speaker independent phoneme models were used in

ATRASR. The phoneme models were represented by context-dependent HMMs,

with gaussian mixture distributions in each state. Mel-scale cepstrum coefficients

and their delta parameters (25-dimensional MFCC) were used as feature param-

eters. A grammar including a phoneme network like the one shown in Fig. 2.2

was used in the speech recognizer. The phoneme network was constructed with

Japanese phonotactic consonants. Phoneme N -gram models were not used in the

phoneme network to avoid its influence on the performance of IPU.

Phoneme accuracy (P%) and word pronunciation accuracy (W%) were used

for evaluation, each of which is defined as

P =Np − S −D − I

Np× 100

W =Nw −Ne

Nw× 100

, (2.3)

where Np and Nw denote the number of phonemes and words used in the experi-

ment (Np = 305 and Nw = 25), S, D and I respectively denote the total number of

phonemes with substitution, deletion and insertion errors, and Ne denotes the total

number of words which have mis-recognized phonemes in each of them. Phoneme

accuracy Pi and word pronunciation accuracy Wi of the ith system phoneme se-

quences are calculated after the ith corrections.

9 http://ja.wikipedia.org/

20

• Please teach the system new words as “It is [oov].”

• The system may mis-recognize certain phonemes. Please correct the phoneme

errors as “No, it is [oov].” You can repeat the whole word or just a part of

the word in the utterance.

• According to the system response, you can continue correcting or undo the

correction. Otherwise please stop the correction when the phoneme sequence

becomes correct.

• At most seven corrections can be made for each word.

• During the interaction, do not change your accent on purpose. Please speak

naturally.

Figure 2.8: The details of the instructions given to the participants.

Protocol

18 native Japanese speakers (twelve males and six females) participated in the

experiment. The participants were students and staff in our research institutes.

They initially did not have any knowledge about the proposed method. Each

participant did the experiments according to the following procedure.

First, each participant taught the words to the system using IPU. Before each

experimental session, a trial use of the system was permitted. Then the partic-

ipant sat on a chair 40cm from a SANKEN CS-3e directional microphone and

taught the words in the list to the system in Japanese according to the instruc-

tions whose details are shown in Fig. 2.8. In the experiment, participants uttered

only initial and correction utterances, undo and stop were operated by keyboard

operations in order to avoid recognition errors. The system phoneme sequences

were synthesized10 , and shown in katakana11 on a display to help the participant

find the phoneme errors12 . The maximum number M of correction utterances was

10 I used VoiceText (http://www.voicetext.jp) for speech synthesis.11 Katakana is a kind of Japanese phonogram. It can be uniquely converted from the

phoneme sequences based on a mapping table.12 Visual feedback is not necessary in IPU. In practical use, IPU can be used with/without

visual feedback according to the scenario.

21

Table 2.3: The statements for subjective evaluations for IPU.

No Statement

Q1 The system was efficient to correct phoneme errors.

Q2 The correction method was easy to understand.

Q3 The interaction with the system was smooth.

Q4 The participant would like to use the system to teach new words.

set to seven. Therefore, at most eight utterances were made for each word. After

the experiment of IPU, each participant used a five-point rating scale to evalu-

ate the relevance (5: very relevant, 4: somewhat relevant, 3: even, 2: somewhat

irrelevant, 1: irrelevant) of the statements shown in Table 2.3.

Then, each participant taught the words to the system using the baseline. The

participant was instructed to repeatedly make utterances “It is [oov]” using the

words in the word list until the words were correctly learned. The participant just

repeated the whole words; word segments were not allowed. During the interaction,

the participants only operated stop; undo behaviors were not allowed.

Finally, a data collection process was performed. In the baseline experiment

some words were learned with less than eight utterances. I additionally collected

speech samples such as “It is [oov]” including the whole words in the word list to

ensure that each word has eight speech samples. As a result, a total of 3,600 speech

samples (200 speech samples for one participant) were collected in the baseline

experiment and the data collection process. These speech samples were used in the

next experiment.

Results

The phoneme and word pronunciation accuracies that were obtained from 18

participants are respectively shown in Fig. 2.9 (a) and Fig. 2.9 (b). The horizontal

axis represents the number of correction utterances (‘0’ represents the initial utter-

ance). The phoneme and word pronunciation accuracies for the initial utterance

22

Phon

emeaccuracy

(%)

Number of correction utterances

IPUBaseline

(a) Phoneme accuracy

Wordpronunciationaccuracy

(%)


IPUBaseline

(b) Word pronunciation accuracy

Figure 2.9: The phoneme and word pronunciation accuracies achieved by IPU and base-

line.

were 84.1% and 20.4%. These values represent the performance of the speech rec-

ognizer without any corrections. IPU outperformed the baseline in both phoneme

and word pronunciation accuracies. For IPU, the accuracies improved significantly

with the increment of the correction utterances, and achieved 96.8% and 79.1%

respectively in phoneme and word pronunciation accuracies after the seventh cor-

rection, while for the baseline, the accuracies did not improve much, and achieved

only 90.4% and 49.8%.

23

Errorrate

reduction(%

)


(a) Phoneme accuracy

IPU

Baseline

1 2 3 4 5 6 7

Error

rate

reduction(%

)


(b) Word pronunciation accuracy

IPU

Baseline

1 2 3 4 5 6 7

Figure 2.10: The error rate reductions achieved by each correction utterance relative to

the previous correction utterance in IPU and the baseline.

The error rate reductions achieved by each correction utterance relative to

the previous correction utterance in IPU and the baseline are shown in Fig. 2.10.

I found that the error rate reductions achieved by IPU outperformed the error

24

Number

ofcorrectionutterance

Number of phonemes

wholesegment

Figure 2.11: Relationship between the number of phonemes in each word and the number

of correction utterances used for that word in IPU. The number of correction utterances

including a word segment and the whole word are shown respectively by “segment” and

“whole”.

rate reductions achieved by the baseline in both phoneme and word pronunciation

accuracies. The average error rate reductions of seven correction utterances for

IPU and the baseline were 20.3% and 6.8% in phoneme accuracy, and 17.4% and

6.1% in word pronunciation accuracy. This means that the error rate reductions

achieved by IPU were about three times those of the baseline.

In IPU, the average number of correction utterances and word-segment-based

corrections used by the participants were 3.27 and 2.97. This means that in IPU,

90.8% of the corrections were done using word segments. The average number

of undo behaviors performed by the participants for each word was 2.21, and the

average number of forced-changes performed by the system for each word was

1.33. Stop behaviors were performed by the participants for the words which were

correctly learned during the interaction.

Moreover, the relationship between the number of phonemes in each word and

the number of correction utterances used for that word in IPU is shown in Fig. 2.11.

The number of correction utterances including a word segment and the whole

word are respectively shown by “segment” and “whole”. I found that the words

25


U0: “It is mikurosutonikusu.” x0: m i k u r o s ϕ ϕ o n i k u s u y0: m i k u r o s ϕ ϕ o n i k u s u

S0: “Is it mikurosonikusu?”

U1: “No, it is sutonikusu.” x1: s u t o n i g u s u # m i k u r o s ϕ ϕ o n i g u s u

S1: “Is it mikurosonigusu?”

U2: “No, that’s worse.” y1: m i k u r o s ϕ ϕ o n i k u s u

U3: “It is sutonikusu.” x2: s u t o n i k u s u y2: m i k u r o s u t o n i k u s u

S2: “Is it mikurosutonikusu?”

U4: “That’s right.”

Figure 2.12: An example for the word “mikurosutonikusu” [m i k u r o s u t o n i k u s

u] that was successfully learned.


U0: “It is butangshiromatsu.” x0: b e d a ng s u ϕ o m a ts u y0: b e d a ng s u ϕ o m a ts u

S0: “Is it bedangsuomatsu?”

U1: “No, it is butangshiromatsu.” x1: b u d a ng sh i d o m a ts u y1: b u d a ng sh i ϕ o m a ts u

S1: “Is it budangshiomatsu?”

U2: “No, it is shiromatsu.” x2: sh i d o m a ts u y2: b u d a ng sh i d o m a ts u

S2: “Is it budangshidomatsu?”

U3: “No, it is shiro.” x3: zh i d o # b u d a ng zh i d o m a ts u

S3: “Is it budangzhidomatsu?”

U4: “No, that’s worse.” y3: b u d a ng sh i d o m a ts u

U5: “It is shiromatsu.” x4: sh i d o n a ts u # b u d a ng sh i d o n a ts u

S4: “Is it budangshidomotsu?”

U6: “No, that’s worse.” y4: b u d a ng sh i d o m a ts u

U7: “It is shiromatsu.” x5: sh i n o m a ts u y5: b u d a ng sh i n o m a ts u

S5: “Is it budangshinomatsu?”

U8: “No, it is butangshiromatsu.” x6: b u e t a ϕ s u n o m a ts u y6: b u t a ng sh i n o m a ts u

S6: “Is it butangshinomatsu?”

U9: “No, it is shiromatsu.” x7: s u ϕ o n a ts u y7: b u t a ng sh i n o n a ts u

S7: “Is it butangshiomatsu?”

Figure 2.13: An example for the word “butangshiromatsu” [b u t a ng sh i r o m a ts u]

that was not successfully learned.

with more phonemes were corrected by more correction utterances and more word

segments.

Furthermore, Fig. 2.12 and Fig. 2.13 respectively show the examples of words

that were successfully and not successfully learned in the experiment. The left

column shows the dialogue, the middle column shows the recognized phoneme

sequence of the OOV word, and the right column shows the system phoneme

sequences after each correction. In the examples, both word segments and the

whole words were used in correction utterances. The corrections which were undone

26

Q1(3.59)

1 2 3 4 50

2

4

6

8

10Q2(3.35)

1 2 3 4 50

2

4

6

8

10Q3(3.65)

1 2 3 4 50

2

4

6

8

10Q4(3.59)

1 2 3 4 50

2

4

6

8

10

Figure 2.14: Subjective Evaluations by IPU. Horizontal axes show the opinion scores, and

vertical axes show the number of participants. The average scores are shown in parentheses.

are indicated by “#”.

Finally, the results of the subjective evaluation are shown in Fig. 2.14. The opin-

ion scores for Q1, Q2, Q3 and Q4 are respectively shown in the figures. Horizontal

axes show the opinion scores, and vertical axes show the number of participants.

The average scores are shown in parentheses. I found that most of the participants

gave scores equal to or greater than three for all of the statements. This indicates

the positive impressions by the participants.

2.3.2 Experiment 2: Investigation of the factors of per-

formance results

Setting

Next I investigated the factors of performance improvement in IPU in order

to evaluate the effectiveness of word-segment-based correction, undo behavior and

forced-change. I ran IPU under the following conditions:

Condition-1: Only the whole words were used for correction. Word-segment-

based corrections were not used.


based corrections were not used. Forced-changes were not done in the cor-

rection process.

27

Table 2.4: Settings for IPU, Condition-1, Condition-2, Condition-3, Condition-4 and

Condition-5

.GPP ML Word-segment Undo Forced-change Stop

IPU√

-√ √ √ √

Condition-1√

- -√ √ √

Condition-2√

- -√

-√

Condition-3√

- - -√ √

Condition-4√

- - - -√

Condition-5 -√

-√ √ √


based corrections were not used. Users were not allowed to perform undo

behaviors during the interaction.


based corrections were not used. Users were not allowed to perform undo

behaviors during the interaction. Forced-changes were not done in the cor-

rection process.

Finally, I combined the ML-based phoneme correction with history-based cor-

rections as in Condition-5.

Condition-5: The phoneme sequences of the words were obtained by the ML-

based phoneme correction using equation (2.2). During the interaction, users

were allowed to perform stop and undo behaviors. Forced-changes were used

by the system. Word-segment-based corrections were not possible in this

condition. Only the whole words were used for correction.

The settings for all of these conditions as well as IPU are summarized in Ta-

ble 2.4. Since word-segment-based correction was not used in these conditions,

the 3,600 speech samples including the whole words collected in experiment 1 were

used as input data for these conditions. The speech samples were automatically

28

Phon

emeaccuracy

(%)


IPUCondition-1Condition-2Condition-3Condition-4Condition-5

(a) Phoneme accuracies

Wordpronunciationaccuracy

(%)


IPUCondition-1Condition-2Condition-3Condition-4Condition-5

(b) Word pronunciation accuracies

Figure 2.15: The phoneme and word pronunciation accuracies achieved by Condition-1,

Condition-2, Condition-3, Condition-4, Condition-5 and IPU.

inputted into the system. In the experiment, undo and stop were operated by the

same participants as in experiment 1.

Results

The phoneme and word pronunciation accuracies for Condition-1, Condition-2,

Condition-3, Condition-4, Condition-5, as well as IPU are shown in Fig. 2.15. The

comparison of IPU and Condition-1 shows the effectiveness of word-segment-based

29

Table 2.5: The detailed results of the t-test in phoneme accuracy.


1 2 3 4 5 6 7

IPU to C-1T(898) 0.14 0.43 3.10 3.82 3.54 3.60 4.98

p 0.89 0.67 < .01 < .01 < .01 < .01 < .01

C-2 to C-4T(898) 6.02 6.89 7.46 7.52 7.85 8.02 8.14

p < .01 < .01 < .01 < .01 < .01 < .01 < .01

C-3 to C-4T(898) 0.03 0.37 3.03 3.88 4.03 4.43 6.98

p 0.89 0.67 < .01 < .01 < .01 < .01 < .01

C-1 to C-5T(898) 0.07 0.12 0.26 0.33 1.00 1.35 1.17

p 0.96 0.93 0.84 0.81 0.48 0.33 0.42

Table 2.6: The detailed results of the t-test in word pronunciation accuracy.

Number of correction utterances1 2 3 4 5 6 7

IPU to C-1T(898) 1.20 1.87 3.58 3.58 4.41 5.76 5.66p 0.24 0.08 < .01 < .01 < .01 < .01 < .01

C-2 to C-4T(898) 0.63 0.49 0.50 0.14 1.37 2.30 3.03p 0.53 0.63 0.62 0.89 0.71 0.49 < .01

C-3 to C-4T(898) 1.49 4.63 4.92 5.33 5.88 6.51 7.03p 0.65 < .01 < .01 < .01 < .01 < .01 < .01

C-1 to C-5T(898) 0.63 0.55 0.53 0.15 0.40 0.70 0.79p 0.58 0.68 0.66 0.90 0.75 0.56 0.55

correction; the comparison of Condition-2 and Condition-4 shows the effective-

ness of undo behavior; the comparison of Condition-3 and Condition-4 shows the

effectiveness of forced-change; the comparison of Condition-1 and Condition-4 ad-

ditionally shows the total effectiveness for both undo behavior and forced-change;

and the comparison of Condition-1 and Condition-5 shows the difference between

the GPP-based phoneme correction and the ML-based phoneme correction. I found

that word-segment-based correction, undo behavior and forced-change contributed

performance improvements in both phoneme and word pronunciation accuracies

in an accumulate way. Moreover, I found that undo behaviors were more efficient

than forced-changes in phoneme accuracy (see Condition-2 and Condition-3 in

Fig. 2.15 (a)). This is because that undo behaviors prevent the phoneme sequence

from getting worse, which directly improves the phoneme accuracy. In contrast,

30

forced-changes were more efficient than undo behaviors in word pronunciation ac-

curacy (see Condition-2 and Condition-3 in Fig. 2.15 (b)). This is because that

forced-changes ensure that each correction results in a different system phoneme

sequence, and thus improved the possibility for obtaining correct system phoneme

sequences. However, the performances of Condition-1 and Condition-5 were almost

the same. This means that the performances of the GPP-based phoneme correction

and the ML-based phoneme correction were almost the same.

I also performed the paired t-test to investigate the statistical differences be-

tween IPU to Condition-1, between Condition-2 to Condition-4, between Condition-

3 to Condition-4 and between Condition-1 to Condition-5. The detailed results of

the t-test in phoneme and word pronunciation accuracies are respectively shown

in Table 2.5 and Table 2.6. In the tables, “C-1,” “C-2,” “C-3,” “C-4” and “C-

5” respectively represent Condition-1, Condition-2, Condition-3, Condition-4 and

Condition-5. “T(898)” represents the t-values obtained by the t-test, with 898 de-

grees of freedom. The underlines in the tables indicate the statistically significances

(p < .01). The speech samples used in the experiment were sufficient enough to

obtain statistical differences. The statistical difference between IPU to Condition-1

indicated the validity of word-segment-based correction; the statistical differences

between Condition-2 to Condition-4 indicated the validity of undo behavior; and

the statistical difference between Condition-3 to Condition-4 indicated the validity

of forced-change. The comparison between Condition-1 to Condition-5 shows that

there was no statistical difference between the GPP-based phoneme correction and

the ML-based phoneme correction. The GPP-based phoneme correction, however,

enabled the users to make corrections according to word segments, while the ML-

based phoneme correction only allowed users to make corrections using the whole

words.

31

2.4 Discussion

2.4.1 Improvement of learning performance

In the experiment, IPU was evaluated with a grammar including a phoneme

network in which each phoneme has the same transition probabilities associated

to all possible phonemes in the network in order to avoid the influences of any

specific phoneme N -gram models. The experimental results showed that IPU was

very efficient even under such a simple grammar. I consider, in practical use,

the learning performance might be further improved if I incorporate a phoneme

N -gram model into the speech recognizer. Moreover, by observing the data col-

lected in the experiments, I found that the recognition errors caused by individual

speaking characteristics were hard to correct. This problem can be solved by inte-

grating speaker adaptation technology such as MLLR (maximum-likelihood linear

regression) [36] into IPU.

2.4.2 Influence caused by visual feedback

In the experiment, IPU was evaluated under a condition where the system

responded to the participants by both speech synthesis and visual feedbacks in

order to help the users find the phoneme errors. This is because the purpose of

this study is to learn the correct phoneme sequences of OOV words. However,

visual feedback is not necessary in IPU. Under some scenarios a system is required

to communicate with users without any visual feedbacks. In such cases, the words

which have been learned by the system might include some erroneous phonemes

since it is difficult to verify phoneme errors only from synthesized speech, thus

users might make mistakes in confirming the phoneme sequences when they are

responded to by the system. The phoneme errors will reduce the performances

of speech recognition and speech synthesis in the subsequent system processing.

However, it is unclear how the performance deteriorations in speech recognition

32

and speech synthesis affect communications between humans and systems. I will

investigate such influences in future works.

2.4.3 Integration with OOV word detection

Although I have dealt only with pre-defined template utterances such as “It is

[oov]” in this study, I have considered to integrate an OOV detection method with

IPU to detect the OOV words in arbitrary utterances that are not prepared in the

system. For example, if the system can detect an OOV word in utterances such as

“Can you search for migurikon?” in which “migurikon” is an OOV word, it can

invoke the word learning process.

On the other hand, a large number of methods detecting OOV words from

utterances have been proposed. There are two basic approaches for OOV word

detection: (1) OOV word models, which detect OOV words using a sub-word, or

generic word model [66, 6, 75], and (2) confidence estimation models, which use

confidence scores (e.g., sentence- and word-level confidence scores) to find unreliable

regions in the word lattices (or N -best lists) and label them as OOV words [68, 59].

Moreover, approaches that combined confidence scores and OOV word models to

improve OOV word detection have been recently presented by [53]. Contextual

information has been used to improve the detection accuracy by [50]. I realize that

some of the above described methods might be suitable to integrate with IPU.

2.5 Summary

This chapter described Interactive Phoneme Update, a method that enables

users to correct phoneme recognition errors of OOV words by speech interaction.

The original features of the method are (1) word-segment-based correction and (2)

history-based correction. The experimental results clearly showed IPU was very

efficient in learning OOV words, and indicated the validity of each of the features.

In addition to phoneme sequences, the learning of accent is also important.

33

In languages such as Japanese and Chinese, some words have the same phoneme

sequences and can be only distinguished by accent. In future works, I will extend

IPU to learn accents.

34

Chapter 3 Detecting utterance

targets

3.1 Background

This chapter describes the robot-directed speech detection method, which de-

tects the target of utterances for a robot. In recent years, robot-directed (RD)

speech detection has been mainly based on human behaviors in previously studies.

For example, Lang et al. [33] proposed a method for a robot to detect the direction

of a person’s attention based on face recognition, sound source localization, and

leg detection. Mutlu et al. [44] conducted experiments under conditions of human-

robot conversation, and they studied how a robot could establish the participant

roles of its conversational partners using gaze cues. Yonezawa et al. [76] proposed

an interface for a robot to communicate with users based on detecting the gaze

direction during their speech. However, this kind of method raises the possibility

that users may say something irrelevant to the robot while they are looking at it.

Consider a situation where users A and B are talking while looking at the robot in

front of them (Fig. 3.1).

A: Cool robot! What can it do?

B: It can understand your command, like “Bring me the red box.”

Note that the speech here is referential, not directed to the robot. Moreover, even

if user B makes speech that sounds like RD speech (“Bring me the red box”), she

does not really want to give such an order because no red box exists in the current

situation. How can we build a robot that responds appropriately in this situation?

35

It can understand your

command, like “Bring me

the red box.”

Cool robot!

What can it do?

Figure 3.1: People talking while looking at a robot.

To settle such an issue, the proposed method is based not only on gaze tracking

but also on domain classification of the input speech into RD speech and out-of-

domain (OOD) speech. Domain classification for robots in previous studies were

based mainly on using linguistic and prosodic features. As an example, a method

based on keyword spotting has been proposed by [28]. However, in using such a

method it is difficult to distinguish RD speech from explanations of system usage

(as in the example of Fig. 3.1). It becomes a problem when both types of speech

contain the same “keywords.” To settle this problem, a previous study [62] showed

that the difference in prosodic features between RD speech and other speech usually

appears at the head and the tail of the speech, and they proposed a method to

detect RD speech by using such features. However, their method also raised the

issue of requiring users to adjust their prosody to fit the system, which causes them

an additional burden.

In this study, the robot executed an object manipulation task in which it ma-

nipulates objects according to a user’s speech. An example of this task in a home

environment is a user telling a robot to “Put the dish in the cupboard.” Solv-

ing this task is fundamental for assistive robots. In this task, I assume that a

36

Figure 3.2: Robot used in the object manipulation task.

user orders the robot to execute an action that is feasible in the current situation.

Therefore, the word sequences and the object manipulation obtained as a result of

the process of understanding RD speech, should be possible and meaningful in the

given situation. In contrast, word sequences and the object manipulation obtained

by the process of understanding OOD speech would not be feasible. Therefore,

I can distinguish between RD and OOD speech by using the feasibility for the

corresponding word sequence and the object manipulation obtained from a speech

understanding process as a measure. Based on this concept, I developed a mul-

timodal semantic confidence (MSC) measure. A key feature of MSC is that it is

not based on using prosodic features of input speech as with the method described

above; rather, it is based on semantic features that determine whether the speech

can be interpreted as a feasible action under the current physical situation. On

the other hand, for an object manipulation task robots should deal with speech

and image signals and to carry out a motion according to the speech. Therefore,

the MSC measure is calculated by integrating information obtained from speech,

object images and robot motion.

The rest of this chapter is organized as follows. Section 3.2 gives the details of

the object manipulation task. Section 3.3 describes the proposed RD speech detec-

37

Figure 3.3: Cameras, microphone, sensor and head unit of the robot.

tion method. The experimental settings and results are presented in Section 3.4,

and Section 3.5 gives a discussion. Finally, Section 3.6 concludes the chapter.

3.2 Object Manipulation Task

In this study, humans use a robot to perform an object manipulation task. Fig-

ure 3.2 and Fig. 3.3 show the robot used in this task. It consists of a manipulator

with 7 degrees of freedom (DOFs), a 4-DOF multi-fingered grasper, a SANKEN

CS-3e directional microphone for audio signal input, a Point Grey Research Bum-

blebee 2 stereo vision camera for video signal input, a MESA Swiss Ranger SR4000

infrared sensor for 3-dimensional distance measurement, a Logicool Qcam Pro 9000

camera for human gaze tracking, and a head unit for robot gaze expression.

In the object manipulation task, users sit in front of the robot and command

the robot by speech to manipulate objects on a table located between the robot and

the user. Figure 3.4 shows an example of this task. In this figure, the robot is told

to place Object 1 (Kermit) on Object 2 (big box) by the command speech “Place-

38

Figure 3.4: Example of object manipulation tasks.

on Kermit1 big box”2 , and the robot executes an action according to this speech.

The solid line in Fig. 3.4 shows the trajectory of the moving object manipulated

by the robot.

Commands used in this task are represented by a sequence of phrases, each of

which refers to a motion, an object to be manipulated (“trajector”), or a reference

object for the motion (“landmark”). In the case shown in Fig. 3.4, the phrases

for the motion, trajector, and landmark are “Place-on,” “Kermit,” and “big box,”

respectively. Moreover, fragmental commands without a trajector phrase or a land-

mark phrase, such as “Place-on big box” or just “Place-on,” are also acceptable.

To execute a correct action according to such a command, the robot must

understand the meaning of each word in it, which is grounded by the physical

situation. The robot must also have a belief about the context information to

estimate the corresponding objects for the fragmental commands. In this study,

I used the speech understanding method proposed by [21] to interpret the input

speech as a possible action for the robot under the current physical situation.

However, for an object manipulation task in a real-world environment, there may

1 Kermit is the name of the stuffed toy used in our experiment.

2 Commands made in Japanese have been translated into English in this study.

39

Audio signalCamera images

GMM based VADGaze Tracking

Is human user looking atthe robot during his speaking?

Physical situations YES

NO

Speech Understanding

Speech ConfidenceMeasure CS

Object ConfidenceMeasure CO

Motion ConfidenceMeasure CM

θ1 θ2 θ3

θ0

MSC measure CMS (s,O, q)

CMS(s,O, q) > δ?

YESRD speech

NO

OOD speech

Figure 3.5: Flowchart of the proposed RD speech detection method.

exist OOD speech such as chatting, soliloquies, or noise. Consequently, an RD

speech detection method should be used.

3.3 Proposed RD Speech Detection Method

The proposed RD speech detection method is based on integrating gaze tracking

and the MSC measure. A flowchart is given in Fig. 3.5. First, a Gaussian mixture

model based voice activity detection method (GMM-based VAD) [35] is carried out

to detect speech from the continuous audio signal, and gaze tracking is performed

to estimated the gaze direction from the camera images3 . If the proportion of the

3 In this study, gaze direction was identified by human face angle. I used faceAPI

(http://www.seeingmachines.com) to extract human face angles from images captured by

a camera.

40

user’s gaze at the robot during her/his speech is higher than a certain threshold

η, the robot judges that the user was looking at it while speaking. The speech

during the periods when the user is not looking at the robot is rejected. Then, for

the speech detected while the user was looking at the robot, speech understanding

is performed to output the indices of a trajector object and a landmark object, a

motion trajectory, and corresponding phrases, each of which consists of recognized

words. Then, three confidence measures, i.e., for speech (CS), object image (CO)

and motion (CM ), are calculated to evaluate the feasibilities of the outputted word

sequence, the trajector and landmark, and the motion, respectively. The weighted

sum of these confidence measures with a bias is inputted to a logistic function. The

bias and the weightings {θ0, θ1, θ2, θ3}, are optimized by logistic regression [18].

Here, the MSC measure is defined as the output of the logistic function, and it

represents the probability that the speech is RD speech. If the MSC measure is

higher than a threshold δ, the robot judges that the input speech is RD speech and

executes an action according to it. In the rest of this section, I give details of the

speech understanding process and the MSC measure.

3.3.1 Speech Understanding

Given input speech s and a current physical situation consisting of object

information O and behavioral context q, speech understanding selects the opti-

mal action a based on a multimodal integrated user model. O is represented as

O = {(o1,f , o1,p), (o2,f , o2,p) . . . (om,f , om,p)}, which includes the visual features oi,f

and positions oi,p of all objects in the current situation, where m denotes the num-

ber of objects and i denotes the index of each object that is dynamically given

in the situation. q includes information on which objects were a trajector and a

landmark in the previous action and on which object the user is now holding. a is

defined as a = (t, ξ), where t and ξ denote the index of trajector and a trajectory of

motion, respectively. A user model integrating the five belief modules – (1) speech,

(2) object image, (3) motion, (4) motion-object relationship, and (5) behavioral

41

context – is called an integrated belief. Each belief module and the integrated

belief are learned by the interaction between a user and the robot in a real-world

environment.

Lexicon and Grammar

The robot has basic linguistic knowledge, including a lexicon L and a grammar

Gr. L consists of pairs of a word and a concept, each of which represents an object

image or a motion. The words are represented by the sequences of phonemes, each

of which is represented by HMM using mel-scale cepstrum coefficients and their

delta parameters (25-dimensional) as features. The concepts of object images are

represented by Gaussian functions in a multi-dimensional visual feature space (size,

color (L∗, a∗, b∗), and shape). The concepts of motions are represented by HMMs

using the sequence of three-dimensional positions and their delta parameters as

features.

The word sequence of speech s is interpreted as a conceptual structure z = [(α1,

wα1), (α2, wα2), (α3, wα3)], where αi represents the attribute of a phrase and

has a value among {M,T,L}. wM , wT and wL represent the phrases describing a

motion, a trajector, and a landmark, respectively. For example, the user’s utterance

“Place-on Kermit big box” is interpreted as follows: [(M , Place-on), (T , Kermit),

(L, big box)]. The grammar Gr is a statistical language model that is represented

by a set of occurrence probabilities for the possible orders of attributes in the

conceptual structure.

Belief modules and Integrated Belief

Each of the five belief modules in the integrated belief is defined as follows.

Speech BS: This module is represented as the log probability of speech s

conditioned by z, under grammar Gr.

Object image BO: This module is represented as the log likelihood of wT

and wL given the trajector’s and the landmark’s visual features ot,f and ol,f .

42

Motion BM : This module is represented as the log likelihood of wM given

the trajector’s initial position ot,p, the landmark’s position ol,p, and trajectory ξ.

Motion-object relationship BR: This module represents the belief that in

the motion corresponding to wM , features ot,f and ol,f are typical for a trajector

and a landmark, respectively. This belief is represented by a multivariate Gaussian

distribution of vector [ot,f , ot,f − ol,f , ol,f ]T .

Behavioral context BH : This module represents the belief that the current

speech refers to object o, given behavioral context q.

Given weighting parameter set Γ={γ1..., γ5

}, the degree of correspondence be-

tween speech s and action a is represented by integrated belief function Ψ, written

as

Ψ(s, a,O, q,Γ) = maxz,l

(γ1 logP (s|z)P (z;Gr) [BS ]

+γ2

(logP (ot,f |wT ) + logP (ol,f |wL)

)[BO]

+γ3 logP (ξ|ot,p, ol,p,wM ) [BM ]

+γ4 logP (ot,f , ol,f |wM ) [BR]

+γ5

(BH(ot, q) +BH(ol, q)

)), [BH ]

(3.1)

where l denotes the index of landmark, ot and ol denote the trajector and landmark,

respectively. Conceptual structure z and landmark ol are selected to maximize the

value of Ψ. Then, as the meaning of speech s, corresponding action a is determined

by maximizing Ψ:

a = (t, ξ) = argmaxa

Ψ(s, a,O, q,Γ). (3.2)

Finally, action a = (t, ξ), index of selected landmark l, and conceptual structure

(recognized word sequence) z are outputted from the speech understanding process.

Learning the Parameters

In the speech understanding, each belief module and the weighting parameters

Γ in the integrated belief are learned online through human-robot interaction in

43

a natural way in an environment in which the robot is used [21]. For example, a

user shows an object to the robot while uttering a word describing the object to

make the robot learn the phoneme sequence of the spoken word which refers to

the object, and to make the robot learn the Gaussian parameters representing the

object image concept based on Bayesian learning. In addition, the user orders the

robot to move an object by making an utterance and a gesture, and the robot acts in

response. If the robot responds incorrectly, the user slaps the robot’s hand, and the

robot acts in a different way in response. The weighting parameters Γ are learned

incrementally, online with minimum classification error learning [27], through such

interaction. This learning process can be conducted easily by a non-expert user. In

contrast, other speech understanding methods need an expert to manually adjust

the parameters in the methods, and the operation is not practical for ordinary users.

Therefore, in comparison with other methods, the speech understanding method

used in this study has an advantage in that it adapts to different environments,

depending on the user.

3.3.2 MSC Measure

Next, I describe the proposed MSC measure. MSC measure CMS is calcu-

lated based on the outputs of speech understanding and represents an RD speech

probability. For input speech s and current physical situation (O, q), speech under-

standing is performed first, and then CMS is calculated by the logistic regression

as

CMS(s,O, q) = P (domain = RD|s,O, q) =1

1 + e−(θ0+θ1CS+θ2CO+θ3CM ). (3.3)

Logistic regression is a type of predictive model that can be used when the target

variable is a categorical variable with two categories, which is quite suitable for the

domain classification problem in this study. In addition, the output of the logistic

function has a value in the range from 0.0 to 1.0, which can be used directly to

represent an RD speech probability.

44

Finally, given a threshold δ, speech s with an MSC measure higher than δ is

treated as RD speech. The BS , BO, and BM are also used for calculating CS ,

CO, and CM , each of which is described as follows.

Speech Confidence Measure

Speech confidence measure CS is used to evaluate the reliability of the rec-

ognized word sequence z. It is calculated by dividing the likelihood of z by the

likelihood of a maximum likelihood phoneme sequence with phoneme network Gp,

and it is written as

CS(s, z) =1

n(s)log

P (s|z)maxu∈L(Gp) P (s|u)

, (3.4)

where n(s) denotes the analysis frame length of the input speech, P (s|z) denotes the

likelihood of z for input speech s and is given by a part of BS, u denotes a phoneme

sequence, and L(Gp) denotes a set of possible phoneme sequences accepted by

phoneme network Gp. For speech that matches robot command grammar Gr, CS

has a greater value than speech that does not match Gr.

The speech confidence measure is conventionally used as a confidence measure

for speech recognition [24]. The basic idea is that it treats the likelihood of the

most typical (maximum-likelihood) phoneme sequences for the input speech as a

baseline. Based on this idea, the object and motion confidence measures are defined

as follows.

Object Confidence Measure

Object confidence measure CO is used to evaluate the reliability that the out-

putted trajector ot and landmark ol are referred to by wT and wL. It is calculated

by dividing the likelihood of visual features ot,f and ol,f by a baseline obtained by

the likelihood of the most typical visual features for the object models of wT and

wL. In this study, the maximum probability densities of Gaussian functions are

45

Input speech: “There is a red box.”Recognized as: [Raise red box.](a) Case for object confidence measure

Input speech: “Bring me that Chutotoro.”Recognized as: [Move-away Chutotoro.](b) Case for motion confidence measure

Figure 3.6: Example cases where object and motion confidence measures are low. These

examples are selected from the raw data of the experimental results.

treated as these baselines. Then, the object confidence measure CO is written as

CO(ot,f , ol,f , wT , wL) = logP (ot,f |wT )P (ol,f |wL)

maxof P (of |wT )maxof P (of |wL), (3.5)

where P (ot,f | wT ) and P (ol,f | wL) denote the likelihood of ot,f and ol,f and are

given by BO, and maxof P (of | wT ) and maxof P (of | wL) denote the maximum

probability densities of Gaussian functions, and of denotes the visual features in

object models.

For example, Figure 3.6(a) describes a physical situation under which a low

object confidence measure was obtained for input OOD speech “There is a red

box.” The examples in Fig. 3.6 are selected from the raw data of the experimental

results. Here, by the speech understanding process, the input speech was recognized

as a word sequence “Raise red box.” Then, an action of the robot raising object 1

was outputted (solid line) because the “red box” did not exist and thus object 1

46

with the same color was selected as a trajector. However, the visual feature of

object 1 was very different from “red box,” resulting in a low value of CO.

Motion Confidence Measure

The confidence measure of motion CM is used to evaluate the reliability that

the outputted trajectory ξ is referred to by wM . It is calculated by dividing the

likelihood of ξ by a baseline that is obtained by the likelihood of the most typical

trajectory ξ for the motion model of wM . In this study, ξ is written as

ξ = argmaxξ,otrajp

P (ξ|otrajp , ol,p, wM ), (3.6)

where otrajp denotes the initial position of the trajector. ξ is obtained by treating

otrajp as a variable. The likelihood of ξ is the maximum output probability of

HMMs. In this study, I used the method proposed by [63] to obtain this probability.

Different from ξ, the trajector’s initial position of ξ is unconstrained, and the

likelihood of ξ has a greater value than ξ. Then, the motion confidence measure

CM is written as

CM (ξ, wM ) = logP (ξ|ot,p, ol,p, wM )

maxξ,otrajp

P (ξ|otrajp , ol,p, wM ), (3.7)

where P (ξ|ot,p, ol,p,wM ) denotes the likelihood of ξ and is given by BM .

For example, Figure 3.6(b) describes a physical situation under which a low

motion confidence measure was obtained for input OOD speech “Bring me that

Chutotoro.” Here, by the speech understanding process, the input speech was rec-

ognized as a word sequence “Move-away Chutotoro.” Then, an action of the robot

moving away object 1 from object 2 was outputted (solid line). However, the typ-

ical trajectory of “move-away” is for one object to move away from another object

that is close to it (dotted line). Here, the trajectory of outputted action was very

different from the typical trajectory, resulting in a low value of CM .

47

Figure 3.7: Some of the objects used in the experiments.

Optimization of Weights

I now consider the problem of estimating weight Θ. The ith training sample

is given as the pair of input signal (si, Oi, qi) and teaching signal di. Thus, the

training set TN contains N samples:

TN = {(si, Oi, qi, di)|i = 1, ..., N}, (3.8)

where di is 0 or 1, which represents OOD speech or RD speech, respectively. The

likelihood function is written as

P (d|Θ) =

N∏i=1

(CMS(si, Oi, qi))d

i(1− CMS(s

i, Oi, qi))1−di , (3.9)

where d= (d1, . . . , dN ). Θ is optimized by the maximum-likelihood estimation of

Eq. (3.9) using Fisher’s scoring algorithm [32].

3.4 Experiments

3.4.1 Experimental Setting

I first evaluated the performance of MSC. This evaluation was performed by

an off-line experiment by simulation where gaze tracking is not used, and speech

is extracted manually without the GMM based VAD to avoid its detection errors.

The weighting set Θ and the threshold δ were also optimized in this experiment.

48

raise put-down∗ place-on the right side

place-on the middleplace-on∗ jump-over

rotate move-closer∗move-away∗

place-on the left side

Figure 3.8: Examples for each of the 10 kinds of motions used in the experiments. “ ∗ ”means that synonymous verbs are given in the lexicon for this motion.

Then I performed an on-line experiment with the robot to evaluate the whole

system.

The robot lexicon L used in both experiments has 50 words, including 26 nouns

and 5 adjectives representing 40 objects, and 19 verbs representing ten kinds of

motions. Figure 3.7 shows some of the objects used in the experiments. Figure 3.8

shows the examples for each motion. The solid line in each example represents

the motion trajectory. L also includes five Japanese postpositions. Different from

other words in L, each of the postpositions is not associated with a concept. By

using the postpositions, users can speak a command in a more natural way. The

parameter set Γ in Eq. (3.1) was γ1 = 1.00, γ2 = 0.75, γ3 = 1.03, γ4 = 0.56, and

γ5 = 1.88.

49

Table 3.1: Examples of the speech spoken in the experiments.

RD speech OOD speech

Move-away Grover. Good morning.

Place-on Kermit small box. How about lunch?

Rotate Chutotoro. There is a big Barbazoo.

Raise red Elmo. Let’s do an experiment.

The speech detection algorithm was run on a Dell Precision 690 workstation,

with an Intel Xeon 2.66GHz CPU and 4GB memory for speech understanding and

the calculation of MSC measure. In the on-line experiment, I added another Dell

Precision T7400 workstation with an Intel Xeon 3.2GHz CPU and a 4GB memory

for the image processing and gaze tracking.

3.4.2 Off-line Experiment by Simulation

Setting

The off-line experiment was conducted under both clean and noisy conditions

using a set of pairs of speech s and scene information (O, q). Figure 3.6(a) shows

an example of scene information. The yellow box on object 3 represents the behav-

ioral context q, which means object 3 was manipulated most recently. I prepared

160 different such scene files, each of which included three objects on average. I

also prepared 160 different speech samples (80 RD speech and 80 OOD speech)

and paired them with the scene files. The RD speech samples included words

that represent 40 kinds of objects and ten kinds of motions, which were learned

beforehand in lexicon L. Each RD and OOD speech sample included 2.8 and 4.1

words on average, respectively. Table 3.1 shows examples of the speech spoken

in the experiment. In addition, a correct motion phrase, correct trajectory, and

landmark objects are given for each RD speech-scene pair. I then recorded the

speech samples under both clean and noisy conditions as follows.

50

• Clean condition: I recorded the speech in a soundproof room without noise.

A subject sat on a chair one meter from the SANKEN CS-3e directional

microphone and read out a text in Japanese.

• Noisy condition: I added dining hall noise whose level was from 50 to 52 dBA

to each speech record gathered under a clean condition.

I gathered the speech records from 16 subjects, including eight males and eight

females. All subjects were native Japanese speakers. All subjects were instructed to

speak naturally as if they were speaking to another human listener. As a result, 16

sets of speech-scene pairs were obtained, each of which included 320 pairs (160 for

clean and 160 for noisy conditions). These pairs were inputted into the system. For

each pair, speech understanding was first performed, and then the MSC measure

was calculated. During speech understanding, a Gaussian mixture model based

noise suppression method [14] was performed, and ATRASR [46] was used for

phoneme and word sequence recognition. With ATRASR, accuracies of 83% and

67% in phoneme recognition were obtained under the clean and noisy conditions,

respectively.

The evaluation under the clean condition was performed by leave-one-out cross-

validation: 15 subjects’ data were used as a training set to learn the weighting Θ in

Eq. (3.3), and the remaining one subject’s data were used as a test set and repeated

16 times. By cross-validation, the generalization performance for different speakers

was evaluated. The average values of the weighting Θ learned by the training set

in cross-validation were used for the evaluation under the noisy condition, where

all noisy speech-scene pairs collected from 16 subjects were treated as a test set.

System performances was evaluated by recall and precision rates, which were

defined as follows:

Recall =N cor

N total, (3.10)

Precision =N cor

Ndet, (3.11)

51

where N cor denotes the number of RD speech correctly detected, N total denotes

the total number of existing RD speech, Ndet denotes the total number of speech

detected as RD speech by the MSC measure.

Finally, for comparison, four cases were evaluated for RD speech detection by

using: (1) the speech confidence measure only, (2) the speech and object confi-

dence measures, (3) the speech and motion confidence measures and, (4) the MSC

measure.

I also evaluated the speech understanding using the RD speech-scene pairs.

Differences between the output motion phrase, trajectory, and landmark objects

and the given ones were treated as an error in speech understanding.

Results

The average precision-recall curves for RD speech detection over 16 subjects

under clean and noisy conditions are shown in Fig. 3.9. The performances of each

of four cases are shown by “Speech,” “Speech + Object,” “Speech + Motion,”

and “MSC.” From the figures, I found that (1) the MSC outperforms all others

for both clean and noisy conditions and, (2) both object and motion confidence

measures helped to improve performance. The average maximum F-measures under

clean and noisy conditions are shown in Fig. 3.10. By comparing it with the

speech confidence measure only, MSC achieved an absolute increase of 5% and 12%

for clean and noisy conditions, respectively, indicating that MSC was particularly

effective under the noisy condition. I also performed the paired t-test. Under the

clean condition, there were statistical differences between (1) Speech and Speech

+ Object (p < 0.01), (2) Speech and Speech + Motion (p < 0.05), and (3) Speech

and MSC (p < 0.01). Under the noisy condition, there were statistical differences

(p < 0.01) between Speech and all other cases.

Examples of the raw data of the experimental results are shown in Fig. 3.6 and

Fig. 3.11. The examples in Fig. 3.6 are for OOD speech and have been explained

in Sections 3.2.2 and 3.2.3. The examples in Fig. 3.11 are for RD speech “Place-on

52

Speech

Speech + ObjectSpeech + MotionMSC (Speech + Object + Motion)

(a) Under clean condition

Speech

Speech + ObjectSpeech + MotionMSC

(b) Under noisy condition

Figure 3.9: Average precision-recall curves obtained in the off-line experiment.

Elmo big box” and “Jump-over Barbazoo Totoro”. These utterances were suc-

cessfully detected by the MSC measure. The processing times (seconds) spent on

the speech understanding process and the MSC-based domain classification was

1.09 and 1.36 for the examples shown in Figure 3.6(a) and 3.6(b), respectively,

1.39 and 1.36 for the examples shown in Figure 3.11(a) and 3.11(b), respectively.

53

(a) Under clean condition

(b) Under noisy condition

Figure 3.10: Average maximum F-measures obtained in the off-line experiment.

These times indicated that the proposed method could respond quickly in practical

human-robot interactions in real time. Table 3.2 shows the means and variances

of the weighted confidence measures for all RD and OOD speech obtained under

the noisy condition. Notice that the variances of CO and CM have large values for

OOD speech, which means it is difficult to perform RD speech detection using CO

or CM only.

In the experiment, weight Θ and threshold δ were optimized under the clean

condition. The optimized Θ were: θ0 = 5.9, θ1 = 0.00011, θ2 = 0.053, and

54

(a) “Place-on Elmo big box”

(b) “Jump-over Barbazoo Totoro”

Figure 3.11: Examples selected from the raw data of the experiment.

Table 3.2: Means and variances of weighted confidence measures for all RD and OOD

speech obtained under noisy conditions.

RD OOD

CS CO CM CS CO CM

Means −0.71 −0.88 −0.30 −3.8 −6.0 −3.3Variances 1.1 0.55 0.72 6.4 130 23

θ3 = 0.74. The optimized δ was set to 0.79, which maximized the average F-

measure. This means that a piece of speech with an MSC measure more than 0.79

will be treated as RD speech and the robot will execute an action according to this

speech. The above Θ and δ were used in the on-line experiment.

Finally, the accuracies of speech understanding using all RD speech and RD

speech detected with the proposed method are shown in Table 3.3, where “Total”

and “Detected” represent all RD speech and the detected RD speech, respectively,

and “Clean” and “Noisy” represent clean and noisy conditions, respectively.

55

Table 3.3: Accuracy of RD speech understanding.

Total Detected

Clean 99.8% 100%

Noisy 96.3% 98.9%

Figure 3.12: Example of on-line experiment.

3.4.3 On-line Experiment Using the Robot

Setting

In the on-line experiment, the whole system was evaluated by using the robot.

In each session of the experiment, two subjects, an “operator” and a “ministrant,”

sat in front of the robot at a distance of about one meter from the microphone.

The operator ordered the robot to manipulate objects in Japanese. He was also

allowed to chat freely with the ministrant. Fig. 3.12 shows an example of this

experiment. The threshold η of gaze tracking was set to 0.5, which means that if

the proportion of operator’s gaze at the robot during input speech was higher than

50%, the robot judged that the speech was made while the operator was looking

at it.

I conducted a total of 4 sessions of this experiment using 4 pairs of subjects, and

each session lasted for about 50 minutes. All subjects were adult males. As with

56

Table 3.4: Numbers of speech productions in the on-line experiment.

With gaze Without gaze Total

RD 155 10 165

OOD 553 265 818

Total 708 275 983

the off-line experiment, the subjects were instructed to speak to the robot as if they

were speaking to another human listener. There was constant surrounding noise of

about 48 dBA from the robot’s power module in all sessions. For comparison, five

cases were evaluated for RD speech detection by using (1) gaze only, (2) gaze and

speech confidence measure, (3) gaze and speech and object confidence measures, (4)

gaze and speech and motion confidence measures and, (5) gaze and MSC measure.

Results

During the experiment, a total of 983 pieces of speech were made, each of

which was manually labeled as either RD or OOD. The numbers of them are

shown in Table 3.4. “With gaze” and “Without gaze” show the numbers of speech

productions that were made while the operator was looking/not looking at the

robot. “RD/OOD” shows the numbers of RD/OOD speech productions. Aside

from the RD speech, there was also a lot of OOD speech made while the subjects

were looking at the robot (see “With gaze” in Table 3.4).

The accuracies of speech understanding were 97.6% and 98.1% for all RD speech

and the detected RD speech, respectively. The average recall and precision rates

for RD speech detection are shown in Fig. 3.13. The performances of each of five

cases are shown by “Gaze,” “Gaze + Speech,” “Gaze + Speech + Object,” “Gaze

+ Speech + Motion,” and “Gaze + MSC,” respectively. By using gaze only, an

average recall rate of 94% was obtained (see “Gaze” column in Fig. 3.13(a)), which

means that almost all of the RD speech was made while the operator was looking

at the robot. The recall rate dropped to 90% by integrating gaze with speech

57

(a) Recall rates

(b) Precision rates

Figure 3.13: Average recall and precision rates obtained in the on-line experiment.

confidence measure, which means some RD speech was rejected by the speech

confidence measure by mistake. However, by integrating gaze with MSC, the recall

rate returned to 94% because the mis-rejected RD speech was correctly detected

by MSC. In Fig. 3.13(b), the average precision rate by using gaze only was 22%.

However, by using MSC, these instances of OOD speech were correctly rejected,

resulting in a high precision rate of 96%, which means the proposed method is

particularly effective under situations where users make a lot of OOD speech while

looking at a robot.

58

3.5 Discussion

3.5.1 Using in a Real World Environment

Although the proposed method was evaluated in our laboratory, I consider that

our method could be used for real world environments because the used speech

understanding method is adaptable to different environments. In some cases, how-

ever, physical conditions can dynamically change. For example, lighting conditions

may change suddenly due to sunlight. The development of a method that works

robustly in such variable conditions is future work.

3.5.2 Extended Applications

This study can be extended in many kinds of ways, and I mention some of

them. Here, I evaluated the MSC measure under situations where users usually

order the robot while looking at it. However, users possibly order a robot without

looking at it under some situations. For example, in such an object manipulation

task where a robot manipulates objects together with a user, the user may make

an order while looking at the object which he is manipulating instead of looking

at the robot itself. For such tasks, the MSC measure should be used separately

without integrating it with gaze. Therefore, a method that automatically decides

whether to use the gaze information according to the task and user situation should

be implemented.

Moreover, aside from the object manipulation task, the MSC measure can also

be extended to the multi-task dialog including both the physically grounded and

ungrounded tasks. In the physically ungrounded tasks, users’ utterances express

no immediate physical objects or motions. For such dialog, a method that auto-

matically switches between the speech confidence and MSC measures should be

implemented. In the future works, I will evaluate the MSC measure for various

dialog tasks.

59

In addition, I can use the MSC to develop an advanced interface for human-

robot interaction. The RD speech probability represented by MSC can be used to

provide feedback such as the utterance “Did you speak to me?”, and this feedback

should be made under situations where the MSC measure has an ambiguous value.

Moreover, each of the object and motion confidence measures can be used sepa-

rately. For example, if the object confidence measures for all objects in a robot’s

vision were particularly low, an active exploration should be executed by the robot

to search for a feasible object in its surroundings, or an utterance such as “I can-

not do that” should be made for situations where the motion confidence measure

is particularly low.

Finally, in this study, I evaluated the MSC measure obtained by integrating

speech, object and motion confidence measures. In addition, I can consider the

use of the confidence measure obtained from the object-motion relationship. The

evaluation of the effect of this confidence measure is one of the future works.

3.6 Summary

This chapter described an RD speech detection method that enables a robot

to distinguish the speech to which it should respond in an object manipulation

task by combining speech, visual, and behavioral context with human gaze. The

remarkable feature of the method is the introduction of the MSC measure. The

MSC measure evaluates the feasibility of the action which the robot is going to

execute according to the users’ speech under the current physical situation. The

experimental results clearly showed that the method is very effective and provides

an essential function for natural and safe human-robot interaction. Finally, I would

emphasize that the basic idea adopted in the method is applicable to a broad range

of human-robot dialog tasks.

60

Chapter 4 Conclusion

This study addressed two crucial problems in building a flexible speech interface

between human and machines: (1) learning the phoneme sequences of OOV words,

and (2) detecting the target of utterances. It described the two methods I proposed

to solve these problems. An important contribution of this study is that it is

especially beneficial for robotic speech interfaces. Both of the proposed methods

can be implemented as sub-modules for robots.

First, I proposed IPU, an interactive learning method for obtaining the phoneme

sequences of OOV words. This method was demonstrated to have high-performance

and to be user-friendly for learning the phoneme sequences of OOV words. The

method enables robots to automatically extend their vocabularies through inter-

actions with users.

Next, I proposed MSC, a multimodal-based method for detecting the targets of

utterances. This method was demonstrated to be very effective and has a strong

capability to adapt to noisy conditions. It enables a robot to reject the utterances

that are not directed to it in an object manipulation task. The capability provides

convenient and safe interactions between users and robots. Moreover, in addition

to utterance targets detection, the basic idea adopted in MSC is also applicable to

a broad range of human-robot dialogue tasks.

Furthermore, in this study, the utterance targets detection by MSC is based

on the integration of information obtained from speech, image and motion. This

kind of technology which integrates multimodal information in a single framework

is especially important for robots, since robots are usually equipped with variety

kinds of sensors which monitor the environment with various channels, such as

audio, video, and touch. This study provides a new realistic method for this

purpose. We have demonstrated that the integration of multimodal information

61

is valid for robot-directed speech detection in an object manipulation task. Aside

from the task, I believe that integrating multimodal information is also crucial for

many other tasks, such as context-based speech understanding and human behavior

understanding.

Many pieces of work still remain to be done in order to improve the flexibility of

speech interfaces, and some of them are worth mentioning. One of the limitations of

this study is that both of the proposed methods do not allow users to speak freely.

Users must obey the pre-defined grammar. In order to improve the flexibility,

it is desirable for speech interfaces to deal with arbitrary utterances. Moreover,

speech interfaces are expected to understand user intentions from user behaviors

and utterances. In order to fulfill these expectations, techniques such as user

behavior modeling, and knowledge of psychology and cognitive science should be

utilized for building speech interfaces.

Another limitation is that both of the proposed methods were only evaluated

with limited test data, under short-term human-machine/robot interactions. It

is highly interesting to know the performances of the proposed methods under

long-term interactions. The evaluation of the proposed methods with long-term

interactions is one of the future works.

62

Acknowledgment

I express my sincere gratitude to my supervisor, Professor Natsuki Oka, for pro-

viding me an invaluable opportunity as a Ph.D student in his laboratory. I thank

Drs. Naoto Iwahashi, Mikio Nakano, and Kotaro Funakoshi for their guidance,

encouragement, and discussion from which my research and my study life greatly

benefited. I am also very grateful to Associate Professors Masahiro Araki and To-

moyuki Ozeki for their valuable advice about my research. I also extend thanks

to the members of the Interactive Intelligence Lab, Kyoto Institute of Technol-

ogy, for their cooperation with my experiments. Finally, I thank my wife, Yiyan,

for her understanding and love the past few years. In the end, her support and

encouragement made this thesis possible.

63

References

[1] iPhone 4S – Ask Siri to help you get things done. http://www.apple.com/

iphone/features/siri.html, Apple, Retrieved 2011-10-05.

[2] H. Asoh, T. Matsui, J. Fry, F. Asano, and S. Hayamizu. A spoken dialog

system for a mobile robot. In Proceedings of the fifth European Conference on

Speech Communication and Technology (Eurospeech – 1999), pp. 1139–1142,

1999.

[3] M. Attamimi, A. Mizutani, T. Nakamura, K. Sugiura, T. Nagai, N. Iwahashi,

H. Okada, and T. Omori. Learning novel objects using out-of-vocabulary word

segmentation and object extraction for home assistant robots. In Proceed-

ings of the 2010 IEEE International Conference on Robotics and Automation

(ICRA – 2010), pp. 745–750, 2010.

[4] C. Bael, L. Boves, H. Heuvel, and H. Strik. Automatic phonetic transcription

of large speech corpora. Computer Speech and Language, Vol. 21, No. 4, pp.

652–668, 2007.

[5] D. Bansal, N. Nair, R. Singh, and B. Raj. A joint decoding algorithm for

multiple-example-based addition of words to a pronunciation lexicon. In Pro-

ceedings of the 34th International Conference on Acoustics, Speech, and Signal

Processing (ICASSP – 2009), pp. 4293–4296, 2009.

[6] I. Bazzi and J. Glass. A multi-class approach for modelling out-of-vocabulary

words. In Proceedings of the third International Conference on Spoken Lan-

guage Processing (Interspeech – ICLSP – 2002), pp. 1613–1616, 2002.

[7] S. Y. Chang, L. Shastri, and S. Greenberg. Automatic phonetic transcription

64

of spontaneous speech (American English). In Proceedings of the 25th Inter-

national Conference on Acoustics, Speech, and Signal Processing (ICASSP –

2000), pp. 330–333, 2000.

[8] C. Chelba, J. Schalkwyk, T. Brants, V. Ha, B. Harb, W. Neveitt, C. Parada,

and P. Xu. Query language modeling for voice search. In Proceedings of

the third IEEE Workshop on Spoken Language Technology (SLT – 2010), pp.

127–132, 2010.

[9] G. Chung, S. Seneff, and C. Wang. Automatic acquisition of names using

speak and spell mode in spoken dialogue systems. In Proceedings of the North

American Chapter of the Association for Computational Linguistics – Human

Language Technologies Conference (NAACL – HLT – 2003), pp. 32–39, 2003.

[10] P. Ding, L. He, X. Yan, R. Zhao, and J. Hao. Robust mandarin speech recogni-

tion in car environments for embedded navigation system. IEEE Transactions

on Consumer Electronics, Vol. 54, No. 2, pp. 584–590, 2008.

[11] M. Eck, I. Lane, Y. Zhang, and A. Waibel. Jibbigo: speech-to-speech transla-

tion on mobile devices. In Proceedings of the third IEEE Workshop on Spoken

Language Technology (SLT – 2010), pp. 165–166, 2010.

[12] J. M. Elvira and J. C. Torrecilla. Name dialing using final user defined vocab-

ularies in mobile (GSM and TACS) and fixed telephone networks. In Pro-

ceedings of the 23rd International Conference on Acoustics, Speech and Signal


[13] E. Filisko and S. Seneff. Developing city name acquisition strategies in spoken

dialogue systems via user simulation. In Proceedings of the sixth ACL/ISCA

SIGdial Workshop on Discourse and Dialogue (SIGdial – 2005), pp. 144–155,

2005.

65

[14] M. Fujimoto and S. Nakamura. Sequential non-stationary noise tracking using

particle filtering with switching dynamical system. In Proceedings of the

31st International Conference on Acoustics, Speech, and Signal Processing

(ICASSP – 2006), Vol. 2, pp. 769–772, 2006.

[15] A.L. Gorin, H. Hanek, R. Rose, and L. Miller. Automated call routing in

a telecommunications network. In Proceedings of the IEEE Workshop on

Interactive Voice Technology for Telecommunications Applications (IVTTA –

1994), pp. 137–140, 1994.

[16] A.L. Gorin, G. Riccardi, and J.H. Wright. How may I help you? Speech

Communication, Vol. 23, pp. 113–127, 1997.

[17] H. Holzapfel, D. Neubig, and A. Waibel. A dialogue approach to learning ob-

ject descriptions and semantic categories. Robotics and Autonomous Systems,

Vol. 56, pp. 1004–1013, 2008.

[18] D. W. Hosmer and S. Lemeshow. Applied Logistic Regression. Wiley-

Interscience, 2009.

[19] C. T. Ishi, S. Matsuda, T. Kanda, T. Jitsuhiro, H. Ishiguro, S. Nakamura,

and N. Hagita. Robust speech recognition system for communication robots

in real environments. In Proceeding of the sixth IEEE-RAS International

Conference on Humanoid Robots, pp. 340–345, 2006.

[20] N. Iwahashi. A method for the coupling of belief systems through human-robot

language interaction. In Proceedings of the 2003 IEEE International Work-

shop on Robot and Human Interactive Communication, pp. 385–390, 2003.

[21] N. Iwahashi. Robots that learn language: A developmental approach to sit-

uated human-robot conversations. Human-Robot Interaction, pp. 95–118,

2007.

66

[22] N. Iwahashi. Interactive learning of spoken words and their meanings through

an audio-visual interface. IEICE Transactions on Information and Systems,

Vol. E91-D, No. 2, pp. 312–321, 2008.

[23] A. Janicki and D. Wawer. Automatic speech recognition for polish in a com-

puter game interface. In Proceedings of the Federated Conference on Com-

puter Science and Information Systems (FedCSIS – 2011), pp. 711–716, 2011.

[24] H. Jiang. Confidence measures for speech recognition: A survey. Speech

Communication, Vol. 45, pp. 455–470, 2005.

[25] D. Jurafsky, W. Ward, J. P. Zhang, K. Herold, X. Y. Yu, and S. Zhang.

What kind of pronunciation variation is hard for triphones to model? In

Proceedings of the 26th International Conference on Acoustics, Speech, and

Signal processing (ICASSP – 2001), pp. 577–580, 2001.

[26] T. Kagoshima. ToSpeak: high-quality text-to-speech system. Toshiba review

(Japanese Edition), Vol. 62, No. 12, pp. 34–37, 2007.

[27] S. Katagiri, B. H. Juangs, and C. H. Lee. Pattern recognition using a family

of design algorithms based upon the generalized probabilistic descent method.

In Proceedings of the IEEE, Vol. 86, pp. 2345–2373, 1998.

[28] T. Kawahara, K. Ishizuka, S. Doshita, and C. H. Lee. Speaking-style de-

pendent lexicalized filler model for key-phrase detection and verification. In

Proceedings of the IEEE International Conference on Spoken Language Pro-

cessing, pp. 3253–3259, 1998.

[29] D. Knowles and Z. Ghahramani. Infinite sparse factor analysis and infinite

independent components analysis. In Proceedings of the 7th International

Conference on Independent Component Analysis and Signal Separation, pp.

381–388, 2007.

67

[30] D. B. Koons, C. J. Sparrell, and K. R. Thorisson. Integrating simultaneous

input from speech, gaze, and hand gestures. Intelligent Multimedia Interfaces

American Association for Artificial Intelligence, pp. 257–276, 1993.

[31] A. Kurematsu, K. Takeda, Y. Sagisaka, S. Katagiri, H. Kuwabara, and

K. Shikano. ATR Japanese speech database as a tool of speech recognition

and synthesis. Speech Communication, Vol. 9, No. 4, pp. 357–363, 1990.

[32] T. Kurita. Iterative weighted least squares algorithms for neural networks

classifiers. In Proc. Workshop on Algorithmic Learning Theory, 1992.

[33] S. Lang, M. Kleinehagenbrock, S. Hohenner, J. Fritsch, G. A. Fink, and

G. Sagerer. Providing the basis for human-robot-interaction: A multi-modal

attention system for a mobile robot. In Proceedings of the ACM International

Conference on Multimodal Interfaces, pp. 28–35, 2003.

[34] A. Lee, K. Kawahara, and K. Shikano. A new phonetic tied-mixture model

for efficient decoding. In Proceedings of the 26th International Conference on

Acoustics, Speech, and Signal Processing (ICASSP – 2001), pp. 1269–1272,

2001.

[35] A. Lee, K. Nakamura, R. Nishimura, H. Saruwatari, and K. Shikano. Noise

robust real world spoken dialogue system using GMM based rejection of unin-

tended inputs. In Proceedings of the fifth International Conference on Spoken

Language Processing (Interspeech – ICLSP – 2004), pp. 173–176, 2004.

[36] C. J. Leggetter and P. C. Woodland. Maximum likelihood linear regression for

speaker adaptation of continuous density hidden markov models. Computer

Speech and Language, Vol. 9, pp. 171–185, 1995.

[37] C. Leitner, M. Schickbichler, and S. Petrik. Example-based automatic pho-

netic transcription. In Proceedings of the seventh Conference on International

Language Resources and Evaluation (LREC – 2010), pp. 3278–3284, 2010.

68

[38] E. Levin, R. Pieraccini, and W. Eckert. Using markov decision process for

learning dialogue strategies. In Proceedings of the 23rd International Con-

ference on Acoustics, Speech, and Signal Processing (ICASSP – 1998), pp.

201–204, 1998.

[39] Y. Liu and P. Fung. Modeling partial pronunciation variations for spontaneous

mandarin speech recognition. Computer Speech and Language, Vol. 17, No. 4,

pp. 357–379, 2003.

[40] Y. Liu, F. Zheng, L. He, and Y. Q. Xia. State-dependent mixture tying with

variable codebook size for accented speech recognition. In Proceedings of the

IEEE Automatic Speech Recognition and Understanding Workshop (ASRU –

2007), pp. 300–305, 2007.

[41] X. Luo and F. Jelineck. Probabilistic classification of HMM states for large

vocabulary continuous speech recognition. In Proceedings of the 24th Inter-

national Conference on Acoustics, Speech, and Signal Processing (ICASSP –

1999), pp. 353–356, 1999.

[42] T. Misu, K. Sugiura, T. Kawahara, K. Ohtake, C. Hori, H. Kashioka,

H. Kawai, and S. Nakamura. Modeling spoken decision support dialogue and

optimization of its dialogue strategy. ACM Transactions on Speech and Lan-

guage Processing, Vol. 7, No. 3, pp. 1–18, 2011.

[43] A. Mohamed, G. Dahl, and G. Hinton. Deep belief networks for phone recogni-

tion. In Proceedings of the 22nd Neural Information Processing Systems Con-

ference Workshop on Deep Learning for Speech Recognition (NIPS – 2009),

Vol. 22, pp. 1–9, 2009.

[44] B. Mutlu, T. Shiwa, T. Kanda, H. Ishiguro, and N. Hagita. Footing in human-

robot conversations: how robots might shape participant roles using gaze cues.

In Proceedings of the ACM/IEEE International Conference on Human-Robot

Interaction, pp. 61–68, 2009.

69

[45] S. Nakagawa. Spontaneous speech recognition: its challenge and limit. In

Proceedings of the IEICE General Conference (Japanese Edition), Vol. 1, pp.

13–14, 2006.

[46] S. Nakamura, K. Markov, H. Nakaiwa, G. Kikui, H. Kawai, T. Jitsuhiro,

J. Zhang, H. Yamamoto, E. Sumita, and S. Yamamoto. The ATR multilingual

speech-to-speech translation system. IEEE transactions on Audio, Speech,

and Language Processing, Vol. 14, No. 2, pp. 365–376, 2006.

[47] M. Nakano, N. Iwahashi, T. Nakai, T. Sumii, X. Zuo, R. Taguchi, T. Nose,

A. Mizutani, T. Nakamura, M. Attamimi, H. Narimatsu, K. Funakoshi, and

Y. Hasegawa. Grounding new words on the physical world in multi-domain

human-robot dialogues. In Proceedings of the National Conference on Ar-

tificial Intelligence (AAAI) Fall Symposium Series: Dialog with Robots, pp.

74–79, 2010.

[48] H. Nanjo, H. Mikami, S. Kunimatsu, H. Kawano, and T. Nishiura. A funda-

mental study of novel speech interface for computer games. In Proceedings of

the 13rd IEEE International Symposium on Consumer Electronics, pp. 558–

560, 2009.

[49] K. Ohtake, T. Misu, C. Hori, H. Kashioka, and S. Nakamura. Dialogue acts

annotation for NICT Kyoto tour dialogue corpus to construct statistical di-

alogue systems. In Proceedings of the seventh International Conference on

Language Resources and Evaluation (LREC – 2010), pp. 2123–2130, 2010.

[50] C. Parada, M. Dredze, D. Filimonov, and F. Jelinek. Contextual information

improves OOV detection in speech. In Proceedings of the North American

Chapter of the Association for Computational Linguistics – Human Language

Technologies Conference (NAACL – HLT – 2010), pp. 216–224, 2010.

[51] D. Peters and P. Stubley. Dialog methods for improved alphanumeric string

70

capture. In Proceedings of the 12th International Conference on Spoken Lan-

guage Processing (Interspeech – ICLSP – 2011), pp. 1017–1020, 2011.

[52] C. S. Ramalingam, Y. Gong, L. P. Netsch, W. W. Anderson, J. J. Godfrey, and

Y. H. Kao. Speaker-dependent name dialing in a car environment with out-

of-vocabulary rejection. In Proceedings of the 24th International Conference

on Acoustics, Speech, and Signal Processing (ICASSP – 1999), pp. 165–168,

1999.

[53] A. Rastrow, A. Sethy, and B. Ramabhadran. A new method for OOV detec-

tion using hybrid word/fragment system. In Proceedings of the 34th Inter-

national Conference on Acoustics, Speech and Signal Processing (ICASSP –

2009), pp. 3953–3956, 2009.

[54] D. Roy and N. Mukherjee. Towards situated speech understanding: Visual

context priming of language models. Computer Speech and Language, pp.

227–248, 2005.

[55] H. Sakoe. Two-level DP-matching – a dynamic programming based pattern

matching algorithm for continuous speech recognition. IEEE transactions on

Acoustic, Speech, and Signal Processing, Vol. ASSP-27, No. 6, pp. 588–595,

1979.

[56] S. Seneff. Response planning and generation in the mercury flight reservation

system. Computer Speech and Language, Vol. 16, pp. 283–312, 2002.

[57] M. Shami and W. Verhelst. An evaluation of the robustness of existing super-

vised machine learning approaches to the classification of emotions in speech.

Speech Communication, Vol. 49, No. 3, pp. 201–212, 2007.

[58] F. K. Soong, W. K. Lo, and S. Nakamura. Generalized word posterior proba-

bility (GWPP) for measure reliability of recognized words. In Proceedings of

the Special Workshop in Maui (SWIM – 2004), 2004.

71

[59] H. Sun, G. L. Zhang, F. Zheng, and M. X. Xu. Using word confidence mea-

sure for OOV words detection in a spontaneous spoken dialog system. In

Proceedings of the eighth European Conference on Speech Communication and

Technology (Eurospeech – 2003), pp. 2713–2716, 2003.

[60] T. Svendsen, F. K. Soong, and H. Purnhagen. Optimizing baseforms for

HMM-based speech recognition. In Proceedings of the second European Con-

ference on Speech Communication and Technology (Eurospeech – 1995), pp.

783–787, 1995.

[61] W. Swartout, D. Traum, R. Artstein, D. Noren, P. Debevec, K. Bronnenkant,

J. Williams, A. Leuski, S. Narayanan, D. Piepol, C. Lane, J. Morie, P. Ag-

garwal, M. Liewer, J. Y. Chiang, J. Gerten, S. Chu, and K. White. Virtual

museum guides demonstration. In Proceedings of the third IEEE Workshop

on Spoken Language Technology (SLT – 2010), pp. 163–164, 2010.

[62] T. Takiguchi, A. Sako, T. Yamagata, and Y. Ariki. System request utter-

ance detection based on acoustic and linguistic features. Speech Recognition,

Technologies and Applications, pp. 539–550, 2008.

[63] K. Tokuda, T. Kobayashi, and S. Imai. Speech parameter generation from

HMM using dynamic features. In Proceedings of the 20th International Con-

ference on Acoustics, Speech, and Signal Processing (ICASSP – 1995), pp.

660–663, 1995.

[64] K. P. Truong and D. A. V. Leeuwen. Automatic discrimination between laugh-

ter and speech. Speech Communication, Vol. 49, No. 2, pp. 144–158, 2007.

[65] R. H. Umbach, P. Beyerlein, and E. Thelen. Automatic transcription of un-

known words in a speech recognition system. In Proceedings of the 20th In-

ternational Conference on Acoustics, Speech and Signal Processing (ICASSP

– 1995), pp. 840–843, 1995.

72

[66] A. Waibel, H. Soltau, T. Schultz, T. Schaaf, and F. Metze. Verbmobil: Foun-

dations of Speech-to-Speech Translation, chapter Multilingual Speech Recog-

nition, pp. 33–45. Springer, 2000.

[67] H. Wakaki, H. Fujii, M. Suzuki, M. Fukui, and K. Sumita. Abbreviation gen-

eration for Japanese multi-word expressions. In Proceedings of the Workshop

on Multiword Expressions: Identification, Interpretation, Disambiguation and

Applications, pp. 63–70, 2009.

[68] F. Wessel, R. Schluter, K. Macherey, and H. Ney. Confidence measures for

large vocabulary continuous speech recognition. IEEE transactions on Speech

and Audio Processing, Vol. 9, No. 3, pp. 288–298, 2001.

[69] K. Wittenburg, T. Lanning, D. Schwenke, H. Shubin, and A. Vetro. The

prospects for unrestricted speech input for TV content search. In Proceedings

of working conference on Advanced Visual Interfaces, pp. 352–359, 2006.

[70] M. Worsley and M. Johnston. Multimodal interactive spaces: MagicTV and

magicMAP. In Proceedings of the third IEEE Workshop on Spoken Language

Technology (SLT – 2010), pp. 161–162, 2010.

[71] H. Wu and H. F. Wang. Revisiting pivot language approach for machine

translation. In Proceedings of Joint Conference of the 47th Annual Meeting

of the Association for Computational Linguistics and the 4th International

Joint Conference on Natural Language Processing of the Asian Federation of

Natural Language Processing (ACL – IJCNLP – 2009), pp. 783–787, 2009.

[72] J. X. Wu and V. Gupta. Application of simultaneous decoding algorithms

to automatic transcription of known and unknown words. In Proceedings of

the 24th International Conference on Acoustics, Speech, and Signal Processing

(ICASSP – 1999), Vol. 2, pp. 589–592, 1999.

73

[73] X. Yan, L. He, P. Ding, R. Zhao, and J. Hao. Multi-accented mandarin

database construction and benchmark evaluations. In Proceedings of the 5th

International Symposium on Chinese Spoken Language Processing (ISCSLP –

2006), Vol. 2, pp. 715–723, 2006.

[74] A. Yates, O. Etzioni, and D. Weld. A reliable natural language interface to

household appliances. In Proceedings of the eighth International Conference

on Intelligent User Interfaces, pp. 189–196, 2003.

[75] A. Yazgan and M. Saraclar. Hybrid language models for out of vocabulary

word detection in large vocabulary conversational speech recognition. In Pro-

ceedings of the 29th International Conference on Acoustics, Speech, and Signal


[76] T. Yonezawa, H. Yamazoe, A. Utsumi, and S. Abe. Evaluating crossmodal

awareness of daily-partner robot to user’s behaviors with gaze and utterance

detection. In Proceedings of the ACM International Workshop on Context-

Awareness for Self-Managing Systems, pp. 1–8, 2009.

[77] S.J. Young, M.G. Brown, J.T. Foote, G.J.F. Jones, and K. S. Jones. Acous-

tic indexing for multimedia retrieval and browsing. In Proceedings of the

22nd International Conference on Acoustics, Speech, and Signal Processing

(ICASSP – 1997), pp. 199–202, 1997.

[78] J. Zhang, J. Zhao, S. Bai, and Z. Huang. Applying speech interface to Mahjong

game. In Proceedings of the 10th International Conference on Multimedia

Modelling, pp. 86–92, 2004.

[79] R. Q. Zhang, H. Yamamoto, M. Paul, H. Okuma, K. Yasuda, Y. Lepage, E. De-

noual, D. Mochihashi, A. Finch, and E. Sumita. The NICT-ATR statistical

machine translation system for the IWSLT 2006 evaluation. In Proceedings

of the International Workshop on Spoken Language Translation, pp. 83–90,

2006.

74

Appendix A Word forms in Japanese

Table A.1: Examples of Japanese word forms.

Kanji Kana Phoneme IPA

sequence sequence sequence

仕事 (work) しごと sh i g o t o /Sigoto/

鮪 (tuna) まぐろ m a g u r o /magWro/

Japanese words can be written as both kana sequence and kanji sequence.

Kana is a kind of Japanese phonogram, and it can be uniquely converted from

the phoneme sequences based on a mapping table. Kanji is a kind of Japanese

logograph. It is not easy to convert phoneme sequences to kanji sequences due to

the homophones. Some examples of Japanese word forms are shown in Table A.1.

The character name of each Japanese phonogram (except for the prolonged

sound, the double stop and the nasal) is almost the same as the pronunciation of

the phonogram. This is different from languages using the Latin alphabet such as

English. For example, the English phonogram ‘T,’ whose name is /’ti:/, can be pro-

nounced by other pronunciations such as /t/. In contrast, the Japanese phonogram

(kana) ‘ま,’ whose name is /ma/, cannot be pronounced by other pronunciations

than its name. Thus the pronunciation of a word and that of its spelled-out form

are almost the same as each other.

75

Appendix B The international

phoneme alphabets (IPA) of Japanese

syllabary

Table B.1: The international phoneme alphabets (IPA) of Japanese syllabary. ‘ng’ and

‘q’ respectively represent the nasal and the double stop (short pause) in Japanese.

- p b d z zh g w r j ma /a/ /pa/ /ba/ /da/ /dza/ /ga/ /îa/ /ra/ /ja/ /ma/

i /i/ /pj i/ /bj i/ /Ãi/ /gj i/ /rj i/ /mj i/u /W/ /pW/ /bW/ /dzW/ /gW/ /rW/ /jW/ /mW/e /e/ /pe/ /be/ /de/ /dze/ /ge/ /re/ /me/o /o/ /po/ /bo/ /do/ /dzo/ /go/ /ro/ /jo/ /mo/

h n t ch ts s sh k py by zya /ha/ /na/ /ta/ /sa/ /ka/ /pja/ /bja/ /Ãa/

i /çi/ /ñi/ /tSi/ /Si/ /kj i/

u /FW/ /ñW/ /tsW/ /sW/ /kW/ /pjW/ /bjW/ /ÃW/e /he/ /ne/ /te/ /se/ /ke/

o /ho/ /no/ /to/ /so/ /ko/ /pjo/ /pjo/ /Ão/

gy ry my hy ny ty sy ky ng qa /gja/ /rja/ /mja/ /ça/ /ña/ /tSa/ /Sa/ /kja/iu /gjW/ /rjW/ /mjW/ /çW/ /ñW/ /tSW/ /SW/ /kjW/eo /gjo/ /rjo/ /mjo/ /ço/ /ño/ /tSo/ /So/ /kjo/- /ð/ /y/

76

Appendix C Recursive equation of

open-begin-end dynamic

programming matching

OBE-DPM is different from ordinary dynamic programming matching in that

the start-point and end-point in both sequences are unconstrained, and thus it

enables partial alignments between a whole word and a word segment. Assume

we have two phoneme sequences x = (p1x, p2x, . . . , p

Ix) and y = (p1y, p

2y, . . . , p

Jy ),

where pix and pjy respectively denote the i-th and j-th phonemes in x and y, I and

J respectively denote the length of each sequence. In OBE-DPM, a start point

(p0x, p0y) and an end point (pI+1

x , pJ+1y ) are added to x and y, then a trellis with

(I +2) column and (J +2) row (an example of a trellis is shown in Fig. 2.4 (a)) is

built according to the recursive equation, which is written as

Di,j =

0 (i = 0, j = 0)

Di−1,j + λ (1 ≤ i ≤ I + 1, j = 0)

Di,j−1 + λ (i = 0, 1 ≤ j ≤ J + 1)

min

Di−2,j−1 + s(pix, pjy) + s(pi−1

x , ϕ)

Di−1,j−1 + s(pix, pjy)

Di−1,j−2 + s(pix, pjy) + s(ϕ, pj−1

y )

(1 ≤ i ≤ I + 1, 1 ≤ j ≤ J + 1)

, (C.1)

where Di,j denotes the matching score, s(pix, pjy) denotes the phoneme distance

measure between pix and pjy, s(pi−1x , ϕ) and s(ϕ, pj−1

y ) respectively denote the inser-

tion and deletion penalties. s(pix, pjy), s(pi−1

x , ϕ) and s(ϕ, pj−1y ) are calculated based

77

on the confusion matrix. λ is a constant, which was set to 1.5 in the experiments.

from the perspective of human–robot interactionchapter 4 conclusion 60 acknowledgment 62...

Documents