multimodal integration for robot systems using deep learning · with regard to sensory feature...

MULTIMODAL INTEGRATION FORROBOT SYSTEMS USING DEEP LEARNING

ディープラーニングによるロボットシステムのためのマルチモーダル統合

July 2015

Kuniaki NODA

野田邦昭

Waseda University Doctoral Dissertation

MULTIMODAL INTEGRATION FORROBOT SYSTEMS USING DEEP LEARNING

ディープラーニングによるロボットシステムのためのマルチモーダル統合

July 2015

Kuniaki NODA

野田邦昭

Waseda University

Graduate School of Fundamental Science and Engineering

Department of Intermedia Art and Science,

Research on Intelligence Dynamics and Representation Systems

Abstract

Intelligent machines such as smartphones, auto-driving cars, and domestic robots

are expected to become increasingly common in everyday life. Consequently, strong

demands for a noise-robust human-machine interface that enables stress-free in-

teraction as well as intelligent technologies that enable stable environmental recog-

nition and adaptive behavior generation for autonomous robots may arise in the

near future. To realize these functions, we need to address two fundamental re-

quirements: (1) robust recognition of poorly reproducible real-world information

and (2) adaptive behavior selection of robots depending on dynamic environmen-

tal changes. The main aim of this study is to address these requirements through a

machine learning approach that implements multimodal integration learning.

Humans succeed in recognizing an environment and mastering many tasks by

combining inputs from multiple modalities, including vision, audition, and somatic

sensation. On the other hand, the sensory inputs to most robotic applications

are commonly preprocessed through dedicated feature extraction mechanisms and

sensory-motor information processing algorithms based on perceptual and action

generation objectives. In essence, mutual intersensory processes are rarely taken

into consideration for realizing environmental recognition and behavior generation.

With regard to sensory feature extraction and multimodal integration learning mech-

anisms, deep learning approaches have recently attracted considerable attention.

One of the main advantages of applying deep neural networks (DNNs) is that they

self-organize highly generalized sensory features from large-scale raw data. The

same approach has also been applied for obtaining fused representations over mul-

tiple modalities, resulting in significant improvements in speech recognition perfor-

mance. However, DNNs have never been applied to multimodal integration learning

of dynamic information such as robot behaviors.

This study aims to address the two fundamental requirements presented above

i

Abstract

through the following three approaches: (1) utilization of highly generalized sensory

features, (2) fusional utilization of multimodal information, and (3) memory predic-

tion and association among multiple modalities. In practice, highly generalized sen-

sory features and their integrated features acquired by integration learning of mul-

timodal information enable noise-robust recognition. Consequently, a cross-modal

memory retrieval function of deep learning based on an acquired intersensory syn-

chrony model enables adaptive behavior selection of robots depending on dynamic

environmental changes.

Our proposed multimodal integration learning framework is evaluated through

the following three experiments: (1) noise robust speech recognition based on audio-

visual integration learning, (2) robust environment recognition and adaptive behav-

ior generation based on visual-motor integration learning of robot behaviors, and

(3) analysis on a multimodal synchrony model acquired from integration learning of

robot behaviors.

In the first evaluation experiment, the audio-visual speech recognition (AVSR)

approach is adopted for realizing noise robust speech recognition. Specifically, sen-

sory features acquired from audio signals and the corresponding mouth area images

are integrated. In practice, two kinds of DNNs, denoising deep autoencoder (DDA)

and convolutional neural network (CNN), are utilized for the feature extraction of au-

dio and visual information, respectively. Moreover, the multi-stream hidden Markov

model (MSHMM) is applied for integrating the two sensory features acquired from

audio signal and mouth area images. We approach noise robust speech recognition

from the following two directions; one involves utilizing DDA for the noise reduc-

tion of audio features, and the other involves utilizing multimodal information in a

complementary style.

In the second evaluation experiment, a sensory-motor multimodal integration

framework utilizing DNN is proposed for realizing adaptive generation of robot

behaviors depending on dynamic environmental changes. Specifically, synchrony

models between visual and motor modalities are structured in a self-organizing man-

ner by training a DNN with temporal sequences consisting of camera images and

joint angles acquired from six kinds of object manipulation behaviors utilizing a hu-

manoid robot. The acquired model is applied for cross-modal memory retrieval re-

flecting the synchrony model between visual and motor modalities.

ii

Abstract

In the third experiment, a synchrony model between the three modalities, vision,

audio, and motion, is acquired by conducting a bell-ringing task using a humanoid

robot. The acquired synchrony model is utilized for retrieving image sequences from

audio and motion sequence inputs. To confirm that the correct synchrony is mod-

eled and the corresponding memory retrieval is attained, quantitative evaluation on

the generated images is conducted. Moreover, correspondences among the struc-

ture of the acquired multimodal feature space, the environmental setting, and the

physical motion are analyzed by visualizing the activation patterns acquired from

the central middle layer of the DNN.

This dissertation is organized into seven chapters. Chapter 1 provides the back-

ground, the research objective, and our approaches as an introduction of the current

study.

In Chapter 2, recent research trends on multimodal integration learning are in-

troduced. First, findings from cognitive psychology are summarized. Second, studies

on AVSR and sensory-motor integration learning of robots are summarized to survey

the practical applications of multimodal integration learning. Third, recent research

trends in deep learning studies are summarized. Finally, the positioning of our pro-

posed model with regard to the recent studies is presented.

In Chapter 3, experiments on AVSR utilizing our proposed learning framework

are conducted for evaluating how sensory features acquired by deep learning and

multimodal integration contribute to robust speech recognition. In practice, a

connectionist-HMM system for AVSR is proposed. As the result of the isolated word

speech recognition evaluation, the audio feature acquired by DDA outperformed a

conventional audio feature under noisy sound settings. Moreover, the visual feature

acquired by CNN outperformed the visual features acquired by conventional dimen-

sionality compression algorithms such as principal component analysis. Finally, we

verified that AVSR utilizing MSHMM can exhibit robust speech recognition even un-

der noisy sound settings.

In Chapter 4, a multimodal integration framework based on a deep learning

algorithm for sensory-motor integration learning of robot behaviors is proposed.

The framework first compressed the sensory inputs acquired from multiple modal-

ities utilizing a deep autoencoder. In combination with a variant of a time-delayed

neural network, a novel deep learning framework that integrates sensory-motor se-

iii

Abstract

quences and self-organizes higher-level multimodal features is introduced. Further,

we showed that our proposed multimodal integration framework can reconstruct full

temporal sequences from input sequences with partial dimensionality.

In Chapter 5, our proposed sensory-motor integration framework is applied for

learning and generating object manipulation behaviors of a humanoid robot. In

practice, the framework is trained with six different object manipulation behaviors

generated by direct teaching. Results demonstrate that our proposed method can re-

trieve temporal sequences over visual and motion modalities and predict future se-

quences from the past. Moreover, the memory retrieval function enabled the robot to

adaptively switch corresponding behaviors depending on the displayed objects. Fur-

ther, behavior-dependent unified representations that fuse sensory-motor modal-

ities together are extracted from the temporal sequence feature space. The result

of our behavior recognition experiment demonstrated that the multimodal features

significantly improve the robustness and reliability of the behavior recognition per-

formance.

In Chapter 6, a quantitative evaluation experiment on our proposed sensory-

motor integration framework is conducted to analyze the acquired synchrony model.

In practice, a bell-ringing task performed by the same robot is designed and the

framework is trained utilizing sensory-motor sequences consisting of the three

modalities, vision, audio, and motion. To this end, a model representing the cross-

modal synchrony is self-organized in the abstracted feature space of our proposed

framework. Results demonstrated that the cross-modal memory retrieval function

of our proposed model succeeds in predicting visual sequences in correlation with

the sound and joint angles of bell-ringing behaviors. Further, analyzing the image

retrieval performance, we found that our proposed method correctly models the syn-

chrony among the multimodal information.

In Chapter 7, the accomplishments of our study on multimodal integration learn-

ing are summarized. Finally, reviews on the remaining research topics and future

directions conclude this dissertation.

iv

Acknowledgments

This work was carried out at the Graduate School of Fundamental Science and En-

gineering at Waseda University in 2012–2015. I thank the institute for providing me

with excellent research facilities. Here, I would like to express my sincere thanks and

appreciation to those who were involved in my study and life in the past three years.

Firstly, I would like to gratefully and sincerely thank my principal supervisor Prof.

Tetsuya Ogata for his significantly important comments and suggestions. I have al-

ways deeply impressed by his expertise supervision, brilliant ideas, valuable advice,

and extensive knowledge. His bright guidance and warm leadership foster a vibrant

and positive atmosphere for research that gave me extremely splendid and abundant

experiences in this laboratory.

I also express my deep and sincere appreciation to Prof. Kazuhiro Nakadai, for

his constructive guidance and inspired suggestions. I am grateful to him for giving

me the opportunity to pursue my interest in Robot Audition under his supervision

and continuous care. Without his generous support, this work would not have been

possible.

I owe my deep gratitude to all the coauthors of my manuscripts, Prof. Hiroshi G.

Okuno, Dr. Hiroaki Arie, and Dr. Yuki Suga for their genuine interest, rapid response

and skillful comments that greatly contributed to my manuscripts.

Many thanks to Prof. Yasuhiro Oikawa and Takashi Kawai who gave me a lot of

advices on how to complete my thesis and the correstions of my dissertation. Their

suggestions provided me lots of ideas to improve the quality of Ph.D thesis, which

may be useful for my future research as well.

Two people, who were absolutely indispensable for completing this thesis, were

the laboratory’s two secretaries, Mrs. Naomi Nakata and Ms. Junko Inaniwa. Their

outstanding work is not only essential to me, but for the whole laboratory. Thanks

also the other members of Ogata laboratory, especially the students who have con-

v

Acknowledgments

tributed to this research.

This research was supported in part by the special coordination fund for pro-

moting science and technology from the JST PRESTO “Information Environment and

Humans,” MEXT Grant-in-Aid for Scientific Research on Innovative Areas “Construc-

tive Developmental Science” (24119003), Scientific Research (S) (24220006), and JSPS

Fellows (265114).

A special thanks to my family. Words cannot express the feelings I have for my

parents and my brother for their endless patience, valuable advice, and encourage-

ment. Without your relentless support this work would not even have been started.

At the end I would like express appreciation to my beloved wife Misuzu who spent

sleepless nights with and supported me in writing, and incented me to strive towards

my goal. Thank you.

Tokyo, June 27, 2015 Kuniaki Noda

vi

Contents

List of figures xi

List of tables xv

1 Introduction 1

1.1 Background and Research Objective . . . . . . . . . . . . . . . . . . . . . 1

1.2 Overview of our Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Literature Review 9

2.1 Intersensory Perceptual Phenomena in Humans . . . . . . . . . . . . . 9

2.1.1 Ventriloquism effect . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.2 Synesthesia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.1.3 Active intermodal mapping . . . . . . . . . . . . . . . . . . . . . . 13

2.1.4 Coherent understanding of environment . . . . . . . . . . . . . . 15

2.2 Multimodal Integration for Robot Systems . . . . . . . . . . . . . . . . . 15

2.2.1 Audio-visual speech recognition . . . . . . . . . . . . . . . . . . . 15

2.2.2 Sensory-motor integration learning for robots . . . . . . . . . . . 19

2.3 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.3.1 Deep Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.3.2 Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . 22

2.4 Positioning of this Thesis towards Related Work . . . . . . . . . . . . . . 23

3 Audio-Visual Speech Recognition 27

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2 The Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.3 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

vii

Contents

3.3.1 Audio Feature Extraction by Deep Denoising Autoencoder . . . 30

3.3.2 Visual Feature Extraction by CNN . . . . . . . . . . . . . . . . . . 32

3.3.3 Audio-Visual Integration by MSHMM . . . . . . . . . . . . . . . . 35

3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.4.1 ASR Performance Evaluation . . . . . . . . . . . . . . . . . . . . . 36

3.4.2 Visual-Based Phoneme Recognition Performance Evaluation . . 38

3.4.3 Visual Feature Space Analysis . . . . . . . . . . . . . . . . . . . . . 40

3.4.4 VSR Performance Evaluation . . . . . . . . . . . . . . . . . . . . . 43

3.4.5 AVSR Performance Evaluation . . . . . . . . . . . . . . . . . . . . 45

3.5 Discussion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.5.1 Current Need for the Speaker Dependent Visual Feature Extrac-

tion Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.5.2 Positioning of our VSR Results with Regards to State of the Art in

Lip Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.5.3 Adaptive Stream Weight Selection . . . . . . . . . . . . . . . . . . 51

3.5.4 Relations of our AVSR Approach with DNN-HMM Models . . . . 53

3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4 Learning Framework for Multimodal Integration of Robot Behaviors 57

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.2 Multimodal Temporal Sequence Learning using a DNN . . . . . . . . . 58

4.2.1 Sensory Feature Extraction . . . . . . . . . . . . . . . . . . . . . . 58

4.2.2 Multimodal Integration Learning using Time-delay Networks . . 59

4.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.3.1 Cross-modal Memory Retrieval . . . . . . . . . . . . . . . . . . . . 60

4.3.2 Temporal Sequence Prediction . . . . . . . . . . . . . . . . . . . . 62

4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5 Applications for Recognition and Generation of Robot Behaviors 65

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.2 Construction of the Proposed Framework . . . . . . . . . . . . . . . . . . 65

5.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

viii

Contents

5.4.1 Cross-modal Memory Retrieval and Temporal Sequence Predic-

tion of Object Manipulation Behaviors . . . . . . . . . . . . . . . 70

5.4.2 Real-time Adaptive Behavior Selection According to Environ-

mental Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.4.3 Multimodal Feature Space Visualization . . . . . . . . . . . . . . 76

5.4.4 Behavior Recognition using Multimodal Features . . . . . . . . . 77

5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.5.1 How Generalization Capability of Deep Neural Networks Con-

tributes for Robot Behavior Learning . . . . . . . . . . . . . . . . 80

5.5.2 Three Factors that Contribute to Robustness in Behavior Recog-

nition Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.5.3 Difference between our Proposed Time-delay Autoencoder and

the Original Time-delay Neural Network . . . . . . . . . . . . . . 83

5.5.4 Characteristics of the Internal Representation of the Temporal

Sequence Learning Network . . . . . . . . . . . . . . . . . . . . . 83

5.5.5 Length of Contextual Information that a Time-delay Autoen-

coder Handles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.5.6 Scalability of our Proposed Multimodal Integration Learning

Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

6 Analysis on Intersensory Synchrony Model 89

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

6.2 Construction of the Proposed Framework . . . . . . . . . . . . . . . . . . 89

6.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

6.4.1 Image Sequence Retrieval from Sound and Motion Sequences . 94

6.4.2 Quantitative Evaluation of Image Retrieval Performance . . . . . 96

6.4.3 The Correlation between Generated Motion and Retrieved Bell

Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

6.4.4 Visualization of Multimodal Feature Space . . . . . . . . . . . . . 99

6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

ix

Contents

7 Conclusion 103

7.1 Overall Summary of the Current Research . . . . . . . . . . . . . . . . . 103

7.2 Significance of the Current Study as a Work in Intermedia Art and Science105

A Hessian-Free Optimization 107

A.1 Newton-CG Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

A.2 Computing the Matrix-Vector Product . . . . . . . . . . . . . . . . . . . . 109

B FNN with R-operator 111

B.1 Forward propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

B.2 Forward propagation with R-operator . . . . . . . . . . . . . . . . . . . . 111

B.3 Error Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

B.4 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

B.4.1 variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

B.4.2 parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

B.5 Backpropagation with R-operator . . . . . . . . . . . . . . . . . . . . . . 113

B.5.1 variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

B.5.2 parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

C RNN with R-operator 115

C.1 Forward Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

C.2 Forward Propagation with R-operator . . . . . . . . . . . . . . . . . . . . 115

C.3 Error Funcion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

C.4 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

C.4.1 variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

C.4.2 parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

C.5 Backpropagation with R-operator . . . . . . . . . . . . . . . . . . . . . . 118

C.5.1 variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

C.5.2 parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

Bibliography 119

Relevant Publications 133

Other Publications 135

x

List of Figures

1.1 Thesis organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1 A ventriloquist . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 MuGurk effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3 Feeling of shapes corresponding to different tastes (Copyright CAVE

Lab., University of Tsukuba) . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.4 Pictures used to demonstrate the bouba/kiki effect (Originally designed

by psychologist Wolfgang Köhler.) . . . . . . . . . . . . . . . . . . . . . . 13

2.5 Infant imitation (From A. N. Meltzoff and M. K. Moore. Imitation of

facial and manual gestures by human neonates. Science, 198:75–78,

1977. Copyright AAAS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.6 Vanishing gradient problem . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.7 Convolutional neural network . . . . . . . . . . . . . . . . . . . . . . . . 23

3.1 Audio-visual synchronous data recording environment . . . . . . . . . 29

3.2 Architecture of the proposed AVSR system . . . . . . . . . . . . . . . . . 30

3.3 Word recognition rate evaluation results using audio features depend-

ing on the number of Gaussian mixture components for the output

probability distribution models of HMM . . . . . . . . . . . . . . . . . . 37

3.4 Word recognition rate evaluation results utilizing MFCCs depending on

the number of Gaussian mixture components for the output probabil-

ity distribution models of HMM . . . . . . . . . . . . . . . . . . . . . . . 39

3.5 Phoneme-wise visual-based phoneme recognition rates . . . . . . . . . 41

3.6 Visual-based phoneme-recognition confusion matrix (64×64 pixels im-

age input) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

xi

List of Figures

3.7 Visual feature distribution for the five representative Japanese vowel

phonemes (64×64 pixels image input) . . . . . . . . . . . . . . . . . . . 44

3.8 Word recognition rates using image features . . . . . . . . . . . . . . . . 45

3.9 Word recognition rate evaluation results (8 components) . . . . . . . . 47



3.12 Word recognition rate evaluation results (32 components, speaker-

close evaluation) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.13 The main targets discussed in Chapter 3 . . . . . . . . . . . . . . . . . . 55

4.1 Examples of cross-modal memory retrieval and sequence prediction . 61

4.2 Buffer shift of the recurrent input . . . . . . . . . . . . . . . . . . . . . . . 62

4.3 Buffer shift of the recurrent input for temporal sequence prediction . . 63


5.1 Multimodal behavior learning and retrieving mechanism . . . . . . . . 66

5.2 Object manipulation behaviors . . . . . . . . . . . . . . . . . . . . . . . . 68

5.3 Example of motion reconstructions by our proposed model . . . . . . . 71

5.4 Example of image reconstructions by our proposed model . . . . . . . 72

5.5 Temporal sequence prediction errors of six object manipulation be-

haviors; plots are horizontally displaced from the original positions to

avoid overlap of the error bars . . . . . . . . . . . . . . . . . . . . . . . . 75

5.6 Real-time transition of object manipulation behaviors . . . . . . . . . . 76

5.7 Acquired multimodal feature space . . . . . . . . . . . . . . . . . . . . . 77

5.8 Behavior recognition rates depending on the changes in standard devi-

ation σ of the Gaussian noise superimposed on the joint angle sequences 79


6.1 Multimodal behavior learning and retrieval mechanism . . . . . . . . . 90

6.2 Bell placement configurations of the bell-ringing task . . . . . . . . . . 92

6.3 Example of image retrieval results from the sound and joint angle inputs 95

6.4 Bell image retrieval errors; . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

6.5 Bell image retrieval errors at step 60 . . . . . . . . . . . . . . . . . . . . . 98

6.6 Multimodal feature space and the correspondence between the coor-

dinates and modal-dependent characteristics . . . . . . . . . . . . . . . 99

xii

List of Figures


xiii

List of Tables

3.1 Settings for audio feature extraction . . . . . . . . . . . . . . . . . . . . . 31

3.2 39 types of Japanese phonemes . . . . . . . . . . . . . . . . . . . . . . . . 33

3.3 Construction of a convolutional neural network . . . . . . . . . . . . . . 34

3.4 Speaker-wise visual-based phoneme recognition rates and averaged

values [%] depending on the input image sizes . . . . . . . . . . . . . . . 38

5.1 Experimental parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.2 Reconstruction errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

6.1 Experimental parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

xv

Chapter 1

Introduction

1.1 Background and Research Objective

Intelligent machines, such as smartphones, auto-driving cars, and domestic robots

are expected to become increasingly common in everyday life. Consequently, strong

demands for (1) a noise-robust human–machine interface that enables stress-free

interaction and (2) intelligent technologies for autonomous robots that enable stable

environmental recognition and adaptive behavior generation may arise in the near

future. To achieve these functions, we need to address the following two fundamental

requirements:

• Issue 1: Robust recognition of poorly reproducible real-world information

• Issue 2: Adaptive behavior selection of robots depending on dynamic environ-

mental changes

These requirements indicate that robot systems working in an open-ended, real

world environment need to recognize unexperienced variations in sensory informa-

tion by generalizing their already acquired memory. For example, robots need to

promptly regulate their behavior depending on momentarily changing environmen-

tal situations such as the pose or dynamics of manipulation targets. The key un-

derlying principle in this study is to address these requirements through a machine

learning approach that implements multimodal integration learning.

1

Chapter 1. Introduction

Humans succeed in recognizing an environment and mastering many tasks by

combining inputs from multiple modalities, including vision, audition, and somatic

sensation. All of these different sources of information are efficiently merged to or-

ganize a coherent and robust percept for stable behavior generation [25, 101]. On

the other hand, the sensory inputs in most robotic applications are commonly pre-

processed through dedicated feature extraction mechanisms such as color region ex-

traction and optic flow. It is also common to design designated sensory information

recognition algorithms depending on perceptual objectives such as face detection,

speech recognition, and object detection [73]. Consequently, recognized targets are

represented by predefined symbolic descriptions of recognized targets. As for behav-

ior generation, rule-based automatic decision making algorithms, such as finite state

machine, are utilized [73].

In essence, environmental recognition and behavior generation have rarely been

attained by regarding mutual intersensory processes among multiple sensory-motor

information. Modality dependent processing approaches have been inevitable for

robotics because there have been scalability issues with conventional machine learn-

ing approaches when handling large-scale raw sensory inputs and motor command

outputs in the real world environment. However, these approaches accompany a

fundamental side effect that information filtering by designers possibly eclipses the

essential information for robots to control their behavior and limits the chances for

robots to develop their own capability from the sensory input level. Furthermore,

predefined symbolic representations of recognition targets may prevent the general-

ized comprehension of the surrounding environment. Further, rule-based behavior

control mechanisms possibly restrict the adaptability of robots for novel environ-

mental conditions.

Regarding sensory feature extraction and multimodal integration learning mech-

anisms, deep learning approaches have recently attracted considerable attention

among the machine-learning community [9]. One of the main advantages of ap-

plying deep neural networks (DNNs) is that they can self-organize highly general-

ized sensory features from large-scale raw data. For example, DNNs have success-

fully been applied to unsupervised feature learning for single modalities such as text

[103], images [58, 56], and audio [40]. The same approach has also been applied to

2

1.2. Overview of our Approaches

the learning of fused representations over multiple modalities, resulting in signifi-

cant improvements in speech recognition performance [75]. However, discussion on

the application of DNNs for more dynamic information such as speech signals has

just recently begun. Thus, DNNs have never been applied for multimodal integration

learning of robot behaviors.

In the context of the background explained above, the main objective of this study

is to address the two fundamental requirements presented above by applying deep

learning for sensory feature extraction and multimodal integration learning.

1.2 Overview of our Approaches

We address the research objectives explained in the previous section by utilizing the

multiple functionalities of deep learning. Our approaches and the corresponding

technical solutions realized by deep learning are summarized as follows.

• Approach 1: Utilization of highly generalized sensory features

• Solution 1: Self-organization of abstracted features from large amounts of

training data

• Approach 2: Fusional utilization of multimodal information

• Solution 2: Multimodal integration learning

• Approach 3: Memory prediction and association among multiple modalities

• Solution 3: Cross-modal memory retrieval

To address Requirement 1, we employ two approaches: utilization of highly gen-

eralized sensory features and fusional utilization of multimodal information. In

practice, noise-robust recognition is attained by utilizing highly generalized sensory

features self-organized by deep learning. Moreover, the same objective is attained by

utilizing integrated features acquired by an integration learning of multimodal infor-

mation. The integrated representation contributes towards the fusional utilization

of the multimodal information because even if the reliability of one modal degrades,

information from the other modal can compensate for restoring the corresponding

internal representation. We address the robust recognition of poorly reproducible

real-world information by these two approaches.

3


To address Requirement 2, we employ another approach: memory prediction and

association among multiple modalities. In practice, adaptive behavior selection of

robots depending on dynamic environmental changes is attained by a cross-modal

memory retrieval function of deep learning based on the acquired multimodal inte-

gration representation (intersensory synchrony model).

Our proposed multimodal integration learning mechanism is evaluated through

the following three experiments in a step-by-step manner.

• Evaluation 1: Noise robust speech recognition based on audio-visual integra-

tion learning

• Evaluation 2: Robust environment recognition and adaptive behavior genera-

tion based on visual-motor integration learning of robot behaviors

• Evaluation 3: Analysis on a multimodal synchrony model acquired from inte-

gration learning of robot behaviors

In the first evaluation experiment, the topics from Approaches 1 and 2 are inves-

tigated. In practice, the audio-visual speech recognition (AVSR) approach is adopted

to integrate audio and visual information for realizing robust speech recognition in

noisy environments. Specifically, sensory features acquired from audio signals and

the corresponding mouth area images are integrated to attain AVSR. In the current

experiment, two kinds of DNNs, denoising deep autoencoder (DDA) and convo-

lutional neural network (CNN), are utilized for feature extraction of audio and vi-

sual information, respectively. Moreover, the multi-stream hidden Markov model

(MSHMM) is applied for integration learning of the two sensory features acquired

from audio signal and mouth area images, respectively. Hence, we approach noise

robust speech recognition from the following two directions, one involves utilizing

DDA for the noise reduction of audio features, and the other involves utilizing CNN

and MSHMM for fusional utilization of the multimodal information.

In the second evaluation experiment, the topics from Approaches 1, 2, and 3 are

investigated. This experiment focuses on the behavior generation function of robots

rather than the recognition function, which is the main focus of the first experiment.

In practice, a sensory-motor multimodal integration learning framework utilizing

DNN is proposed for realizing adaptive generation of robot behaviors depending on

4

1.3. Thesis Organization

dynamic environmental changes. Specifically, synchrony models between visual and

motor modalities are structured in a self-organizing manner by training a DNN with

temporal sequences consisting of camera images and joint angles acquired from six

types of object manipulation behaviors utilizing a humanoid robot. The acquired

model is applied for cross-modal memory retrieval reflecting the synchrony model

between visual and motor modalities.

In the third experiment, quantitative evaluation and analysis is conducted on the

acquired synchrony model. In practice, a bell-ringing task is conducted by a hu-

manoid robot for acquiring a synchrony model among the following three modal-

ities: vision, audio, and motion. The acquired synchrony model is utilized for re-

trieving image sequences from audio and motion sequence inputs. To confirm that

the correct synchrony is modeled and the corresponding memory retrieval is at-

tained, the generated images are quantitatively evaluated. Moreover, correspon-

dences among the structure of the acquired multimodal feature space, the environ-

mental setting, and the physical motion are analyzed by visualizing the activation

patterns acquired from the central middle layer of the DNN utilized for the mul-

timodal integration learning. By analyzing the structure of the multimodal feature

space, the mechanism to represent the synchrony model in the DNN is revealed.

1.3 Thesis Organization

The remainder of this dissertation is organized as shown in Figure 1.1. In Chapter

2, recent research trends on multimodal integration learning are introduced. First,

findings from cognitive psychology studies are summarized. Second, studies on

AVSR and sensory-motor integration learning of robots are summarized to survey

preceding practical applications of multimodal integration learning. Third, recent

research trends in deep learning, which is the technical founding of our proposed

multimodal integration mechanism, are summarized. Finally, the positioning of our

proposed model with regard to the recent studies is presented.

In Chapter 3, experiments on AVSR utilizing our proposed learning framework

are conducted for evaluating how sensory features acquired by deep learning and

multimodal integration contribute to robust speech recognition. In practice, a

5


��

��

� ��

��

��

��

��

!��

"��

#��$%�� &'()

"��$%�� &'()

%�� $%�� ()

*�� $%�� &)

�� $%�� +)

�� $%�� ,)

Figure 1.1: Thesis organization

connectionist-hidden Markov model (HMM) system for noise-robust AVSR is pro-

posed. First, a DDA is utilized for acquiring noise-robust audio features. By preparing

the training data for the network with pairs of consecutive multiple steps of deteri-

orated audio features and the corresponding clean features, the network is trained

to output denoised audio features from the corresponding features deteriorated by

noise. Second, a CNN is utilized to extract visual features from raw mouth area im-

ages. By preparing the training data for the CNN as pairs of raw images and the corre-

sponding phoneme label outputs, the network is trained to predict phoneme labels

from the corresponding mouth area input images. Finally, a MSHMM is applied to

integrate the acquired audio and visual HMMs, which are independently trained with

the respective features. As the result of the isolated word speech recognition evalua-

tion, the audio feature acquired by DDA outperformed a conventional audio feature

under noisy sound settings. Moreover, the visual feature acquired by CNN outper-

formed visual features acquired by conventional dimensionality compression algo-

rithms such as principal component analysis (PCA). Finally, we verified that AVSR

utilizing MSHMM can exhibit robust speech recognition capability even under noisy

6

1.3. Thesis Organization

sound settings compared to the cases when only single modality is utilized.

In Chapter 4, a multimodal integration learning framework for sensory-motor

integration learning of robot behaviors is presented. As a practical computational

model, a multimodal temporal sequence learning framework based on a deep learn-

ing algorithm [41] is constructed. The proposed model first compresses the dimen-

sionality of the sensory inputs acquired from multiple modalities utilizing a deep

autoencoder [41, 65]. In combination with a variant of a time-delayed neural net-

work [55] learning approach, we then introduce a novel deep learning method that

integrates sensory-motor sequences and self-organizes higher-level multimodal fea-

tures. Further, we show that our proposed temporal sequence learning framework

can internally generate temporal sequences by partially masking the input data from

outside the network and recursively feeding back the previous outputs to the masked

input nodes; this is made possible by utilizing the characteristics of an autoencoder

that models identity mappings between inputs and outputs.

In Chapter 5, our proposed sensory-motor integration learning framework is

applied for learning and generating object manipulation behaviors of a humanoid

robot. In practice, the framework is trained with six different object manipulation be-

haviors generated by direct teaching. Results demonstrate that our proposed model

can retrieve temporal sequences over visual and motion modalities and predict fu-

ture sequences from the past. Moreover, the memory retrieval function enabled

the robot to adaptively switch corresponding behaviors depending on the displayed

objects. Further, behavior-dependent unified representations that fuse sensory-

motor modalities together are extracted in the temporal sequence feature space.

Our behavior recognition experiment, which utilizes the integrated features acquired

from the multimodal temporal sequence learning mechanism, demonstrates that

the multimodal features significantly improve the robustness and reliability of be-

havior recognition performance by utilizing joint angle information.

In Chapter 6, a quantitative evaluation experiment on the sensory-motor inte-

gration learning framework is conducted by analyzing the “synchrony model.” In

practice, the experimental setting of the multimodal integration learning is extended

by incorporating sound signals in addition to the image and joint angles. A bell-

ringing task performed by the same robot is designed and the proposed model is

7


trained utilizing sensory-motor sequences consisting of the three modalities, vision,

audio, and motion. To this end, a model representing the cross-modal synchrony is

self-organized in the acquired abstracted feature space. Results demonstrate that the

cross-modal memory retrieval function of the proposed model succeeds in predict-

ing visual sequences in correlation with the sound and joint angles of bell-ringing

behaviors. Further, analyzing the image retrieval performance, we found that our

proposed method correctly models the synchrony among the multimodal informa-

tion.

In Chapter 7, the accomplishments of our study on multimodal integration learn-

ing are summarized. Finally, reviews on the remaining research topics and future

directions conclude this dissertation.

8

Chapter 2

Literature Review

2.1 Intersensory Perceptual Phenomena in Humans

Humans perceive the external environment, including their own body, by integrat-

ing multiple channels of sensory inputs acquired from different modalities, such as

vision, audition, and proprioception. The use of one sensory input can influence per-

ception from another sensory system, and the transferred information across modal-

ities is utilized in order to substitute for one another. This multisensory interaction

can be observed in many human perceptual phenomena such as the ventriloquism

effect, synesthesia, and active intermodal mapping.

2.1.1 Ventriloquism effect

A ventriloquist is an entertainer that “throws his voice” by minimizing his own move-

ments so that the only visual cues the audience can associate with speech comes

from his dummy (Figure 2.1). As a result, audiences tend to feel that the voice is com-

ing from the dummy even if they clearly know which one is the dummy and which

one is not. This trick says more about the audience than the performer, because the

performance is effective less owing to the ventriloquist’s skill than the dominance of

the visual-auditory intersensory biases of the audience. In psychology, the “ventrilo-

quism effect” [43] is referred to as the broad phenomenon of intersensory bias, where

one sensory information from a modality can influence the judgments of another.

9

Chapter 2. Literature Review

Figure 2.1: A ventriloquist

For example, vision can influence judgments about proprioception and audition,

proprioception can bias auditory judgments, and so on [37, 81, 105, 95, 110, 109].

The magnitude of intersensory bias and the dominant modality depend on how com-

pelling and real each individual cue is [111]. In general, the visual modality is known

to predominate in intersensory influences.

One example of the general synergy between the visual and auditory system is

represented in the perception of speech. Even though it is difficult to recognize

someone’s speech in a room under significant background noise, seeing the speaker’s

face will make it easier to understand what is being said. In fact, a neuromagnetic

study indicates that the sight of lip movement actually modifies the activity in the

auditory cortex [92]. It is also known that visual cues enhance the processing of au-

ditory inputs, at a level functionally equivalent to altering the signal-to-noise ratio

(SNR) of the auditory stimulus by 15–20 dB [102]. On the other hand, nonmatching

visual and auditory cues in speech are also known to produce interesting auditory-

10

2.1. Intersensory Perceptual Phenomena in Humans

+ = ba

ga

da

Figure 2.2: MuGurk effect

visual illusions, which are discussed in an article entitled “Hearing lips and seeing

voices” [69]. This illusion, commonly referred to as “MuGurk effect,” occurs when

one hears “ba-ba” but sees the mouth form “ga-ga” and perceives the sound “da-da”

(Figure 2.2).

2.1.2 Synesthesia

Synesthesia is another example of intersensory phenomena in humans. This syn-

drome literally means “joining the senses,” and is explained as an involuntary join-

ing in which one sensory modality involuntarily elicits a sensation/experience in an-

other modality [21]. For example, sonogenic synesthesia, in which music provokes

intense visual experiences or cutaneous paresthesias, has been a well-known case for

over 100 years [20, 38]. Another example is that for a synesthete, a particular taste al-

ways induces the sensation of a particular geometric in his/her left hand (Figure 2.3)

[64]. This syndrome has recently been attracting attention among neurologists and

developmental psychologists, and has become an indispensable topic when multi-

sensory integration is being discussed.

Another research suggests that we all have some capacity for experiencing synes-

thesia. For example, consider two drawings, one looks like an inkblot and the other, a

jagged piece of shattered glass (Figure 2.4). When people are asked “Which of these is

‘bouba,’ and which is ‘kiki’?,” 98 percent of people respond that the inkblot is bouba

11


Figure 2.3: Feeling of shapes corresponding to different tastes (Copyright CAVE Lab.,University of Tsukuba)

and the other one is kiki [84]. Ramachandran et al. explained this phenomenon

as follows, “the gentle curves of the amoeba-like figure metaphorically mimics the

gentle undulations of the sound ‘bouba’ as represented in the hearing centers in the

brain as well as the gradual inflection of the lips as they produce the curved ‘boo-baa’

sound. In contrast, the waveform of the sound ‘kiki’ and the sharp inflection of the

tongue on the palate mimic the sudden changes in the jagged visual shape.” The au-

thors argue that the brain’s ability to pick out an abstract feature in common items—

such as a jagged visual shape and a harsh-sounding name—could have paved the

way for the development of metaphors and perhaps even a shared vocabulary [84].

Synesthetic experiences are commonly explained as a phenomenon that reflects

a fusion of sensory experiences via association phenomena, in which independent

groups of neurons are activated in close temporal proximity to one another via long

chains of synaptic connections [101]. Their concurrent activity can produce a per-

ceptual synthesis after repeated pairings like a conditioned experience [63, 64]. On

the other hand, synesthetic experiences are also explained as a sort of sensory mixing

12

2.1. Intersensory Perceptual Phenomena in Humans

Figure 2.4: Pictures used to demonstrate the bouba/kiki effect (Originally designedby psychologist Wolfgang Köhler.)

that is predicted from a survey of brain areas in which different modalities converge

on the same neurons. It is not surprising to find that one dominant input evokes

secondary sensations in other modalities via such multisensory neurons. However,

there is still no shared understanding of these experiences among researchers. Al-

though there is no acceptable theoretical explanation, these phenomena should re-

flect some nature of humans’ multisensory perception abilities. Moreover, whether

due to association or the activation of multisensory neurons, synesthesia reflects the

rich multisensory perceptual experiences that appear to be quite common in some

individuals.

2.1.3 Active intermodal mapping

Meltzoff et al. published a paper in 1977 to show that infants between 12 and 21 days

of age can imitate both facial and manual gestures (Figure 2.5) [71]. They claimed

that the result implies that human neonates can equate their own unseen behaviors

with gestures they see others perform. This experiment was ground-breaking be-

cause it showed that infants can imitate adults at a much earlier age than previously

believed. For example, Piaget claimed that facial imitation does not take place until

1 year of age or more [80]. Moreover, this experiment also showed evidence for early

facial imitation, which had been thought to be impossible at this age because it re-

quires cross-modal and mutual understanding of perception. According to the stan-

13


Figure 2.5: Infant imitation (From A. N. Meltzoff and M. K. Moore. Imitation of facialand manual gestures by human neonates. Science, 198:75–78, 1977. Copyright AAAS)

dard developmental theory, facial imitation ought to be more difficult than manual

or vocal imitation, because infants have no direct way to compare their own actions

with those of adults’. (Infants can see others’ faces, but not their own. They can feel

their own facial movements, but not those of others.) Facial imitation is thought to

represent the infant’s matching of what it sees as some equivalent of the propriocep-

tive signals that it feels when trying to mimic, a process referred to by Meltzoff as

“active intermodal mapping.”

The initial report by Meltzoff et al. met with some aggressive criticisms [3, 5], but

the criticisms were made not for the fact that infants exhibit intersensory integration

but for the claim that infant’s facial imitation appeared very early in life. In fact, the

same claim regarding infant’s intersensory integration was made previously [6, 14]

and even shown for the imitation of facial gestures [31]. Moreover, follow-up studies

moved the appearance date from weeks after birth to minutes after birth and also

showed an innate capability for detecting at least some forms of cross-modal equiv-

alence. Although some investigators had difficulty in demonstrating some effects in

young infants [88, 97], a number of replicated observations have now been reported

14

2.2. Multimodal Integration for Robot Systems

[70].

2.1.4 Coherent understanding of environment

Cognitive science research revealed that combining sensory information contributes

to enhancing perceptual clarity and reducing ambiguity about the sensory environ-

ment [25, 101]. For example, a simultaneous tone can improve detection of a dimly

flashed light [29, 104], enhance the discriminability of briefly flashed visual patterns

[108], or increase the perceived luminance of light [99]. Moreover, neuroscience re-

search demonstrated that cross-connections between early sensory areas facilitate

processing in one sense by input from another [26], and that the superior colliculus

mediates cross-modal improvements in simple attentive-orientation tasks [100, 101].

In addition, action-effect synchrony perception is known to have a close relationship

with the sense of agency [30], and thus cross-modal grouping plays an important role

in sensation [48].

2.2 Multimodal Integration for Robot Systems

Multimodal integration contributes to forming constant, coherent, and robust per-

ceptions by reducing ambiguities regarding the sensory environment. Hence, we

believe that replicating human multimodal integration learning as a computational

model is essential towards realizing sophisticated cognitive functions of robot intel-

ligence, as well as towards fundamentally understanding human intelligence. In this

section, we briefly review previous practical applications for multimodal integration

learning from an engineering perspective.

2.2.1 Audio-visual speech recognition

AVSR is one of the most representative applications in that it puts multimodal inte-

gration learning into practical use for the purpose of speech recognition. The fun-

damental idea of AVSR is to use visual information derived from a speaker’s lip mo-

tion to complement corrupted audio speech inputs. In this subsection, we review

recent approaches for the elemental technologies of AVSR from the following three

15


perspectives: audio feature extraction, image feature extraction, and audio-visual in-

tegration.

Audio feature extraction

The use of mel-frequency cepstral coefficients (MFCCs) has been a de facto stan-

dard for automatic speech recognition (ASR) for decades. However, advances in

deep learning research have led to recent breakthroughs in unsupervised audio fea-

ture extraction methods and exceptional recognition performance improvements

[27, 40, 62]. Advances in novel machine learning algorithms, improved availability

of computational resources, and the development of large databases have led to self-

organization of robust audio features by efficient training of large-scale DNNs with

large-scale datasets.

One of the most successful applications of DNNs to ASR is the deep neural net-

work hidden Markov model (DNN-HMM) [22, 72], which replaces the conventional

Gaussian mixture model (GMM) with a DNN to represent direct projection between

HMM states and corresponding acoustic feature inputs. The idea of utilizing a neural

network to replace a GMM and construct a hybrid model that combines a multilayer

perceptron and HMMs was originally proposed decades ago [85, 13]. However, owing

to limited computational resources, large and deep models were not experimented

with in the past, which led to hybrid systems that could not outperform GMM-HMM

systems.

Other major approaches for application of DNNs to ASR involve using a deep au-

toencoder as a feature extraction mechanism. For example, Sainath et al. utilized a

deep autoencoder as a dimensionality compression mechanism for self-organizing

higher-level features from raw sensory inputs and utilized the acquired higher-level

features as inputs to a conventional GMM-HMM system [90]. Another example is the

deep denoising autoencoder proposed by Vincent et al. [106, 107]. This model differs

from the former model in that the outputs of the deep autoencoder are utilized as a

sensory feature rather than the compressed vectors acquired from the middle layer

of the network. The key idea of the denoising model is to make the learned represen-

tations robust to partial destruction of the input by training a deep autoencoder to

16


reconstruct clean repaired inputs from corrupted, partially destroyed inputs.

Visual feature extraction

Incorporation of speakers’ lip movements as visual information for ASR systems is

known to contribute to robustness and accuracy, especially in environments where

audio information is corrupted by noise. In previous studies, several different

approaches have been proposed for extracting visual features from input images

[67, 54]. These approaches can be broadly classified into two representative cate-

gories.

The first is a top-down approach, where an a priori lip-shape representation

framework is embedded in a model; for example, active shape models (ASMs) [61]

and active appearance models (AAMs) [19]. ASMs and AAMs extract higher-level,

model-based features derived from the shape and appearance of mouth area images.

Model-based features are suitable for explicitly analyzing internal representations;

however, some elaboration of lip-shape models and precise hand-labeled training

data are required to construct a statistical model that represents valid lip shapes.

The second is a bottom-up approach. Various methods can be used to directly es-

timate visual features from the image; for example, dimensionality compression al-

gorithms, such as discrete cosine transform [68, 94], PCA [4, 68], and discrete wavelet

transform [68]. These algorithms are commonly utilized to extract lower-level image-

based features, which are advantageous because they do not require dedicated lip-

shape models or hand-labeled data for training; however, they are vulnerable to

changes in lighting conditions, translation, or rotation of input images. In this study,

we adopt the bottom-up approach by introducing a CNN as a visual feature extrac-

tion mechanism, because it is possible for CNNs to overcome the weaknesses of con-

ventional image-based feature extraction mechanisms. The acquired visual features

are also processed with a GMM-HMM system.

Several approaches for application of CNNs to speech recognition studies have

been proposed. Abdel-Hamid et al. [1, 2] applied their original functionally extended

CNNs for sound spectrogram inputs and demonstrated that their CNN architecture

outperformed earlier basic forms of fully connected DNNs on phone recognition

17


and large vocabulary speech recognition tasks. Palaz et al. [78] applied a CNN for

phoneme sequence recognition by estimating phoneme class conditional probabil-

ities from raw speech signal inputs. This approach yielded comparable or better

phoneme recognition performance relative to conventional approaches. Lee et al.

[60] applied a convolutional deep belief network (DBN) for various audio classifica-

tion tasks, such as speaker identification, gender classification, and phone classifica-

tion, that showed better performance as compared with conventional hand-crafted

audio features. Thus, CNNs have been attracting considerable attention in speech

recognition studies. However, applications of CNNs have been limited to audio sig-

nal processing, while applications of lip-reading remain unaddressed.

Audio-visual integration

Multimodal recognition can improve performance as compared with unimodal

recognition by utilizing complementary sources of information [15, 36, 86]. Multi-

modal integration is commonly achieved by two different approaches. First, in the

feature fusion approach, feature vectors from multiple modalities are concatenated

and transformed to acquire a multimodal feature vector. For example, Ngiam et al.

[75] utilized a DNN to extract fused representations directly from multimodal signal

inputs by compressing the input dimensionality. Huang et al. [44] utilized a DBN

for audio-visual speech recognition tasks by combining mid-level features learned

by single modality DBNs. However, these approaches have difficulty in explicitly

and adaptively selecting the respective information gains depending on the dynamic

changes in the reliability of multimodal information sources. Alternatively, in the

decision fusion approach, outputs of unimodal classifiers are merged to determine

a final classification. Unlike the previous method, decision fusion techniques can

improve robustness by incorporating stream reliabilities associated with multiple in-

formation sources as a criterion of information gain for a recognition model. For

example, Gurban et al. [35] succeeded in dynamic stream weight adaptation based

on modality confidence estimators in the MSHMM for their AVSR problem.

18


2.2.2 Sensory-motor integration learning for robots

Multimodal integration has long been a challenging problem in robotics [16, 18]. Al-

though there is relevant research reported in the literature [46, 82, 93], several is-

sues still remain unsolved. First, multimodal sensory-motor integration has typi-

cally been applied only to a singular problem, such as self-organizing one’s spatial

representation [46, 82]; further functions have not been intensively studied, includ-

ing such functions as the cross-modal complementation of information deficiencies

or the application of cross-modal memory retrieval for behavior generation prob-

lems. Second, discussion in the literature regarding how multimodal information

should be fused together to realize stable environmental recognition has not reached

a comprehensive consensus. Thus, a prevailing multimodal information integra-

tion framework has not been available. Subsequently, in robotics, sensory inputs

acquired from different sources are still typically processed with dedicated feature-

extraction mechanisms [73]. Third, multimodal synchrony modeling as a means of

implementing sensory-motor prediction for robotic applications has not been ade-

quately investigated. Several preceding studies have proposed computational mod-

els developmentally acquiring action-effect synchrony for understanding interaction

rules [52, 77]; however, most causal models have been represented using a limited

number of modalities, often focusing on vision and motion only.

A scalable learning framework that enables multimodal integration learning by

handling large amounts of sensory-motor data with high dimensionality has not yet

been realized. In line with the growing demand for perceptual precision with regard

to the surrounding environment, recent robots are equipped with state-of-the-art

sensory devices, such as high-resolution image sensors, range sensors, multichan-

nel microphones, and so on [32, 47, 91]. As a result, remarkable improvements have

been achieved in the quantity of available sensory information; however, because

of the scalability limitations of conventional machine learning algorithms, ground-

breaking computational models achieving robust behavior control and environmen-

tal recognition by fusing multimodal sensory inputs into a single representation have

not yet been proposed.

19


2.3 Deep Learning

Regarding computational models addressing large-scale data processing with signif-

icant dimensionality [8], deep learning approaches have recently attracted consider-

able attention in the machine-learning community [9]. For example, DNNs have suc-

cessfully been applied to unsupervised feature learning for single modalities, such

as text [103], images [56], or audio [40]. In such studies, various information sig-

nals, even with high-dimensional representations, were effectively compressed in a

restorable form. Further, brilliant achievements in deep learning technologies have

already succeeded in making advanced applications available to the public. For ex-

ample, competition results from the ImageNet Large Scale Visual Recognition Chal-

lenge [50] have led to significant improvements in web image search engines [89]. As

another example, unsupervised feature-extraction functions of deep learning tech-

nologies have greatly increased the sophistication of a voice recognition engine used

for a virtual assistant service [42]. The same approach has also been applied to the

learning of fused representations over multiple modalities, resulting in significant

improvements in speech recognition performance [75]. Yet another study on multi-

modal integration learning has succeeded in cross-modal memory retrieval by com-

plementing missing modalities [98]. Most current studies on multimodal integration

learning utilize deep networks; however, much work focuses in extracting correla-

tions between static modalities, such as image and text [50]. Thus, few studies have

investigated methods not only for multimodal sensor fusion, but also for dynamic

sensory-motor coordination problems [24] of robot behavior.

Back-propagation algorithm has been a dominant approach for the training of a

neural network with multiple non-linear layers for a long time. However, the “van-

ishing gradients problem” (Figure 2.6), where the derivative terms can exponentially

decay to zero or explode during the deep back-propagation process [10], prevented

this technique to generalize well with networks that possess very large number of

hidden layers.

Due to its scalability limitation, the neural network has been regarded as an out-

moded machine learning approach for decades. However, the following three factors

have recently led to a major breakthrough in the application of DNNs to the problems

20

2.3. Deep Learning

��

��

∂E∂W

,∂E∂b

⎛

⎝⎜

⎞

⎠⎟

Y = f WX + b( )

Figure 2.6: Vanishing gradient problem

of image classification and speech recognition. First, popularization of low-cost,

high-performance computational environments, i.e., high-end consumer personal

computers equipped with general-purpose graphics processing units (GPGPUs), has

allowed a wider range of users to conduct brute force numerical computations with

large datasets. Second, improved public access to large databases has enabled unsu-

pervised learning mechanisms to self-organize highly generalized features that can

outperform conventional handcrafted features. Third, the development of powerful

machine learning techniques, e.g., improved optimization algorithms, has enabled

large-scale neural network models to be efficiently trained with large datasets, which

has made it possible for deep neural networks to generate highly generalized fea-

tures.

In the following subsection, we introduce representative deep learning architec-

tures that have contributed to the recent development of deep learning studies.

2.3.1 Deep Autoencoder

The deep autoencoder is a variant of a DNN commonly utilized for dimensionality

compression and feature extraction [75, 41]. DNNs are artificial neural network mod-

els with multiple layers of hidden units between inputs and outputs. A multi-layered

21


artificial neural network is referred to as an autoencoder, particularly when the net-

work structure has a bottleneck shape (the number of nodes for the central hidden

layer becomes smaller than that for the input (encoder) and output (decoder) layers),

and the network is trained to model the identity mappings between inputs and out-

puts. Regarding dimensionality compression mechanisms, a simple and commonly

utilized approach is PCA. However, Hinton et al. demonstrated that the deep autoen-

coder outperformed PCA in image reconstruction and compressed feature acquisi-

tion [41].

To train DNNs, Hinton et al. first proposed an unsupervised learning algorithm

that uses greedy layer-wise unsupervised pretraining followed by fine-tuning meth-

ods to overcome the high prevalence of unsatisfactory local optima in learning objec-

tives of deep models [41]. Subsequently, Martens proposed a novel approach by in-

troducing a second-order optimization method, Hessian-free optimization, to train

deep networks [65]. The proposed method efficiently trained the models by a general

optimizer without pretraining. Placing emphasis on the simplicity of their algorithm,

we adopted the learning method proposed by Martens for optimizing our deep au-

toencoder. In our work, we utilized deep autoencoders for the self-organization of

sensory feature vectors, and for temporal sequence learning.

2.3.2 Convolutional Neural Network

A CNN (Figure 2.7) is a variant of a DNN commonly utilized for image classifica-

tion problems [58, 57, 59]. CNNs integrate three architectural ideas to ensure spatial

invariance: local receptive fields, shared weights, and spatial subsampling. Accord-

ingly, CNNs are advantageous compared with ordinary fully connected feed-forward

networks in the following three ways.

First, the local receptive fields in the convolutional layers extract local visual fea-

tures by connecting each unit only to small local regions of an input image. Local

receptive fields can extract visual features such as oriented-edges, end-points, and

corners. Typically, pixels in close proximity are highly correlated and distant pixels

are weakly correlated. Thus, the stack of convolutional layers is structurally advan-

tageous for recognizing images by effectively extracting and combining the acquired

22

2.4. Positioning of this Thesis towards Related Work

��

��

��

��

��

Figure 2.7: Convolutional neural network

features. Second, CNNs can guarantee some degree of spatial invariance with respect

to shift, scale, or local distortion of inputs by forcing sharing of same weight config-

urations across the input space. Units in a plane are forced to perform the same

operation on different parts of the image. As CNNs are equipped with several local

receptive fields, multiple features are extracted at each location. In principle, fully

connected networks are also able to perform similar invariances. However, learning

such weight configurations requires a very large number of training datasets to cover

all possible variations. Third, subsampling layers, which perform local averaging and

subsampling, are utilized to reduce the resolution of the feature map and sensitivity

of the output to input shifts and distortions (for implementation details, see [58]).

In terms of computational scalability, shared weights allow CNNs to possess

fewer connections and parameters compared with standard feed-forward neural

networks with similar-sized layers. Moreover, current improvements in compu-

tational resource availability, especially with highly-optimized implementations of

two-dimensional convolution algorithms processed with GPGPUs, has facilitated ef-

ficient training of remarkably large CNNs with millions of image datasets [56, 50].

2.4 Positioning of this Thesis towards Related Work

In this chapter, related work regarding multimodal integration has been reviewed

from the following three perspectives:

• How multimodal integration affects the way that humans perceive the external

environment (Section 2.1)

23


• The main contributions and the outstanding problems of multimodal integra-

tion learning in practical robot applications (Section 2.2)

• How deep learning algorithms have contributed towards achieving perfor-

mance improvements in machine-learning problems including image recogni-

tion, speech recognition, and also, multimodal integration learning (Section 2.3)

The reviews in Section 2.1 clearly show that sensory-motor information from

multiple modalities in humans mutually interact with each other. Therefore, a com-

prehensive investigation of multimodal information is crucial to understanding hu-

man intelligence. Moreover, the reviews in Section 2.2 show that there have been

several engineering approaches to apply multimodal integration learning for realiz-

ing robust environment recognition, such as AVSR in speech recognition. However,

the same section also explains that investigations on the application of multimodal

integration learning in sensory-motor coordination in robotics remain incomplete

mainly due to the scalability limitation of conventional machine learning algorithms

in handling a huge variety and a large amount of sensory-motor information ac-

quired from the robots working in real-world environments.

Based on these backgrounds, the fundamental research interest in this thesis is

to seek possibilities for the further expansion of multimodal integration learning to

realize robot intelligence. In practice, we apply deep learning, one of the state-of-the

art machine learning approaches as reviewed in Section 2.3, for robot behavior learn-

ing. Our approach is different from the conventional robot behavior learning prac-

tices in that carefully designed dedicated sensory feature extraction mechanisms are

not required for handling raw sensory inputs. Moreover, the deep learning mecha-

nism enables extraction of highly generalized features by integrating sensory-motor

information from multiple modalities that contribute towards stably abstracting and

perceiving environmental situations. The robust recognition capability enables a

robot to adaptively select corresponding behaviors in response to diverse and irre-

producible environmental changes in real-world environments.

With regard to the research interest explained above, this thesis is composed of

the following three-step elemental studies. First, sensory feature extraction perfor-

mances of deep learning algorithms are evaluated by conducting an AVSR task in

Chapter 3. Second, a variant of DNN is applied to the dynamic sensory-motor in-

24

2.4. Positioning of this Thesis towards Related Work

tegration learning of multiple object manipulation behaviors by a humanoid robot

in Chapter 5. Through the experiments, novel approaches of utilizing a DNN model

for the purposes of cross-modal memory retrieval and robust behavior recognition

are proposed. Finally, by extending the experimental settings proposed in Chapter 6,

a detailed analysis of the intersensory synchrony model acquired by the multimodal

integration learning mechanism is conducted to investigate how mutual correlations

between multimodal information is self-organized in the memory structure.

25

Chapter 3

Audio-Visual Speech Recognition

3.1 Introduction

In this chapter, we focus on the evaluation of sensory feature extraction performance

of deep learning algorithms and investigate how multimodal integration learning

contributes towards robust speech recognition. In accordance with the objectives,

we conduct an AVSR task as a practical evaluation experiment. AVSR is thought to

be one of the most promising solutions for reliable speech recognition, particularly

when the audio is corrupted by noise. The fundamental idea of AVSR is to use vi-

sual information derived from a speaker’s lip motion to complement corrupted au-

dio speech inputs. However, cautious selection of sensory features for the audio and

visual inputs is crucial in AVSR because sensory features significantly influence the

recognition performance.

Audio feature extraction by a deep denoising autoencoder is achieved by train-

ing the network to predict original clean audio features, such as MFCCs, from de-

teriorated audio features that are artificially generated by superimposing various

strengths of Gaussian noises to original clean audio inputs. Acquired audio feature

sequences are then processed with a conventional GMM-HMM to conduct an iso-

lated word recognition task. The main advantage of our audio feature extraction

mechanism is that noise-robust audio features are easily acquired through a rather

simple mechanism.

27

Chapter 3. Audio-Visual Speech Recognition

For the visual feature extraction mechanism, we propose the application of a

CNN, one of the most successfully utilized neural network architectures for image

clustering problems. This is achieved by training the CNN with over a hundred thou-

sand mouth area image frames in combination with corresponding phoneme labels.

CNN parameters are learned in order to maximize the average across training cases

for the log-probability of the correct label under the prediction distribution. Through

supervised training, multiple layers of convolutional filters, which are responsible

for extracting primitive visual features and predicting phonemes from raw image in-

puts, are self-organized. Our visual feature extraction mechanism has two main ad-

vantages: (1) the proposed model is easy to implement because dedicated lip-shape

models or hand-labeled data are not required; (2) the CNN has superiority in shift-

and rotation- resistant image recognition.

To perform an AVSR task by integrating both audio and visual features into a sin-

gle model, we propose a MSHMM [11, 12, 45]. The main advantage of the MSHMM is

that we can explicitly select the observation information source (i.e., from audio in-

put to visual input) by controlling the stream weights of the MSHMM depending on

the reliability of multimodal inputs. Our evaluation results demonstrate that the iso-

lated word recognition performance can be improved by utilizing visual information,

especially when audio information reliability is degraded. The results also demon-

strate that the multimodal recognition attains an even better performance than when

audio and visual features are separately utilized for isolated word recognition tasks.

3.2 The Dataset

A Japanese audio-visual dataset [53, 113] was used for the evaluation of the proposed

models. In the dataset, speech data from six males (400 words: 216 phonetically bal-

anced words and 184 important words from the ATR1 speech database [53]) were

used. In total, 24000 word recordings were prepared (one set of words per speaker;

approximately 1 h of speech in total). The audio-visual synchronous recording en-

vironment is shown in Figure 3.1. Audio data was recorded with a 16 kHz sampling

rate, 16-bit depth, and a single channel. To train the acoustic model utilized for the

1Advanced Telecommunications Research Institute International

28

3.3. Model

PC

Camera Light

Microphone

Figure 3.1: Audio-visual synchronous data recording environment

assignment of phoneme labels to image sequences, we extracted 39 dimensions of

audio features, composed of 13 MFCCs and their first and second temporal deriva-

tives. To synchronize the acquired features between audio and video, MFCCs were

sampled at 100 Hz. Visual data was a full-frontal 640×480 pixel 8-bit monochrome fa-

cial view recorded at 100 Hz. For visual model training and evaluation, we prepared

a trimmed dataset composed of multiple image resolutions by manually cropping

128× 128 pixels of the mouth area from the original data and resizing the cropped

data to 64×64, 32×32, and 16×16 pixels.

3.3 Model

A schematic diagram of the proposed AVSR system is shown in Figure 3.2. The pro-

posed architecture consists of two feature extractors to process audio signals syn-

chronized with lip region image sequences. For audio feature extraction, a deep de-

noising autoencoder [106, 107] is utilized to filter out the effect of background noise

from deteriorated audio features. For visual feature extraction, a CNN is utilized to

recognize phoneme labels from lip image inputs. Finally, a multi-stream HMM rec-

ognizes isolated words by binding acquired multimodal feature sequences.

29


...

...

... ...

Visual feature Audio feature

(b) Convolutional neural network

(a) Deep denoising autoencoder

Original audio feature

..

(c) Multi-stream HMM Waveform Raw image

/a/, /a:/, /i/, ..., /y/, /z/, /sp/�

. Denoised audio feature

Figure 3.2: Architecture of the proposed AVSR system

3.3.1 Audio Feature Extraction by Deep Denoising Autoencoder

For the audio feature extraction, we utilized a deep denoising autoencoder [106, 107].

Eleven consecutive frames of audio features are used as the short-time spectral rep-

resentation of speech signal inputs. To generate audio input feature sequences,

partially deteriorated sound data are artificially generated by superimposing sev-

eral strengths of Gaussian noises to original sound signals. In addition to the orig-

inal clean sound data, we prepared six different deteriorated sound data; the SNR

was from 30 to −20 dB at 10 dB intervals. Utilizing sound feature extraction tools,

the following types of sound features are generated from eight variations of origi-

nal clean and deteriorated sound signals. HCopy command of the hidden Markov

model toolkit (HTK) [114] is utilized to extract 39 dimensions of MFCCs. Auditory

Toolbox [96] is utilized to extract 40 dimensions of log mel-scale filterbank (LMFB).

Finally, the deep denoising autoencoder is trained to reconstruct clean audio fea-

tures from deteriorated features by preparing the deteriorated dataset as input and

the corresponding clean dataset as the target of the network. Among a 400-word

30

3.3. Model

Table 3.1: Settings for audio feature extraction

IN* OUT* LAYERS*

429 429 300-150-80-40-80-150-300 (a)

429 39 300-150-80 (b)

429 429 300-300-300-300-300-300-300 (c)

429 429 300-300-300-300-300 (d)

429 429 300-300-300 (e)

429 429 300 (f)* IN, OUT, and LAYERS indicate the number of in-

put and output dimensions, and layer-wise di-mensions of the network, respectively.

dataset, sound signals from 360 training words (2.76×105 samples) and the remain-

ing 40 test words (2.91×104 samples) from six speakers are used to train and evaluate

the network, respectively.

The denoised audio features are generated by recording the neuronal outputs of

the deep autoencoder when 11 frames of audio features are provided as input. To

compare the denoising performance relative to the construction of the network, sev-

eral different network architectures are compared. Table 3.1 summarizes the num-

ber of input and output dimensions, as well as layer-wise dimensions of the deep

autoencoder.

In the initial experiment, we compared three different methods to acquire de-

noised features with respect to MFCCs and LMFB audio features. The first generated

11 frames of output audio features and utilized the middle frame (SequenceOut). The

second acquired an audio feature from the activation pattern of the central middle

layer of the network (BottleNeck). For these two experiments, a bottleneck-shaped

network was utilized (Table 3.1 (a)). The last generated a single frame of an output

audio feature that corresponds to the middle frame of the inputs (SingleFrameOut).

For this experiment, a triangle-shaped network was utilized (Table 3.1 (b)).

In the second experiment, we compared the performance relative to the number

of hidden layers of the network utilizing an MFCCs audio feature. In this experiment,

we prepared four straight-shaped networks with different numbers of layers (i.e., one

to seven layers) at intervals of two (Table 3.1 (c)–(f)). Outputs were acquired by gen-

erating 11 frames of output audio features and utilizing the middle frame. Regarding

31


the activation functions of the neurons, a linear function and logistic nonlinearity

are utilized for the central middle layer of the bottleneck-shaped network and the

remaining network layers, respectively. Parameters for the network structures are

empirically determined with reference to previous studies [41, 49].

The deep autoencoder is optimized to minimize the objective function E defined

by the sum of L2-norm between the output of the network and target vector across

training dataset D under the model parameterized by θ, represented as

E (D,θ) =√√√√ |D|∑

i=1(x(i ) −x(i ))2, (3.1)

where x(i ) and x(i ) are the output of the network and corresponding target vec-

tor from the i -th data sample, respectively. To optimize the deep autoencoder, we

adopted the Hessian-free optimization algorithm proposed by Martens [65]. In our

experiment, the entire dataset was divided into 12 chunks with approximately 85000

samples per batch. We utilized 2.0×10−5 for the L2 regularization factor on the con-

nection weights. For the connection weight parameter initialization, we adopted the

sparse random initialization scheme to limit the number of non-zero incoming con-

nection weights of each unit to 15. Bias parameters were initialized at 0. To pro-

cess the substantial amount of linear algebra computation involved in this optimiza-

tion algorithm, we developed a software library using the NVIDIA CUDA Basic Lin-

ear Algebra Subprograms [76]. The optimization computation was conducted on a

consumer-class personal computer with an Intel Core i7-3930K processor (3.2 GHz,

6 cores), 32 GB RAM, and a single NVIDIA GeForce GTX Titan graphics processing

unit with 6 GB on-board graphics memory.

3.3.2 Visual Feature Extraction by CNN

For visual feature extraction, a CNN is trained to predict phoneme label posterior

probabilities corresponding to the mouth area input images. Mouth area images of

360 training words from six speakers were used to train and evaluate the network.

To assign phoneme labels to every frame of the mouth area image sequences, we

trained a monophone HMM with MFCCs utilizing the HTK and assigned 40 phoneme

32

3.3. Model

Table 3.2: 39 types of Japanese phonemes

Category Phoneme labels

Vowels/a/ /i/ /u/ /e/ /o/

/a:/ /i:/ /u:/ /e:/ /o:/

Consonants

/b/ /d/ /g/ /h/ /k/ /m/ /n/

/p/ /r/ /s/ /t/ /w/ /y/ /z/ /ts/

/sh/ /by/ /ch/ /f/ /gy/ /hy/ /j/

/ky/ /my/ /ny/ /py/ /ry/

Others /N/ /q/

labels, including 39 Japanese phonemes (Table 3.2) and short pause /sp/, to the vi-

sual feature sequence by conducting a forced alignment using the HVite command

in the HTK. To enhance shift- and rotation-invariance, artificially modulated images

created by randomly shifting and rotating the original images are added to the orig-

inal dataset. In addition, images labeled as short pause /sp/ are eliminated, with

the exception of the five adjacent frames before and after the speech segments. The

image dataset (3.05× 105 samples) was shuffled and 5/6 of the data were used for

training; the remainder was used for the evaluation of a phoneme recognition exper-

iment. From our preliminary experiment, we confirmed that phoneme recognition

precision degrades if images from all six speakers are modeled with a single CNN.

Therefore, we prepared an independent CNN for each speaker.2 The visual features

for the isolated word recognition experiment are generated by recording neuronal

outputs (phoneme label posterior probability distribution) from the last layer of the

CNN when mouth area image sequences corresponding to 216 training words were

provided as inputs to the CNN.

A seven-layered CNN is used in reference to the work by Krizhevsky et al. [50].

Table 3.3 summarizes construction of the network containing four weighted layers:

three convolutional (C1, C3, and C5) and one fully connected (F7). The first convolu-

tional layer (C1) filters the input image with 32 kernels of 5×5 pixels with a stride of

one pixel. The second and third convolutional layers (C3 and C5) take the response-

2We believe that this degradation is mainly due to the limited variations of lip region images that weprepared to train the CNN. To generalize the higher-level visual features that enable a CNN to attainspeaker invariant phoneme recognition, we believe that more image samples from different speakersare needed.

33


Table 3.3: Construction of a convolutional neural network

IN* OUT* LAYERS*

256/1024/4096 40 C1-P2-C3-P4-C5-P6-F7**

* IN, OUT, and LAYERS indicate the input di-mensions, output dimensions, and network con-struction, respectively.

** C, P, and F denote the convolutional, local-pooling, and fully connected layer, respectively.The numbers after the layer types represent layerindices.

normalized and pooled output of the previous convolutional layers (P2 and P4) as

inputs and filter them with 32 and 64 filters of 5×5 pixels, respectively. The fully con-

nected layer (F7) takes the pooled output of the previous convolutional layer (P6) as

input and outputs a 40-way softmax, regarded as a posterior probability distribution

over the 40 classes of phoneme labels. A max-pooling layer follows the first convolu-

tion layer. Average-pooling layers follow the second and third convolutional layers.

Response-normalization layers follow the first and second pooling layers. Rectified

linear unit nonlinearity is applied to the outputs of the max-pooling layer as well as

the second and third convolutional layers. Parameters for the network structures are

empirically determined in reference to previous studies [58, 50].

The CNN is optimized to maximize the multinomial logistic regression objective

of the correct label. This is equivalent to maximizing the likelihood L defined by the

sum of log-probability of the correct label across training dataset D under the model

parameterized by θ, represented as

L (D,θ) =|D|∑i=1

log(P (Y = y (i )|x(i ),θ)), (3.2)

where y (i ) and x(i ) are the class label and input pattern corresponding to the i -th

data sample, respectively. The prediction distribution is defined with the softmax

function as

P (Y = i |x,θ) = exp(hi )

ΣCj=1 exp(h j )

, (3.3)

where hi and C are the total input to output unit i and number of classes, respec-

34

3.3. Model

tively. The CNN is trained using a stochastic gradient descent method [50]. The up-

date rule for the connection weight w is defined as

vi+1 = αvi −γεwi −ε⟨∂L

∂w|wi ⟩

Di

(3.4)

wi+1 = wi + vi+1 (3.5)

where i is the learning iteration index, vi is the update variable, α is the factor of

momentum, ε is the learning rate, γ is the factor of weight decay, and ⟨∂L∂w |wi ⟩Di

is

the average over the i -th batch data Di of the derivative of the objective with respect

to w , evaluated at wi . In our experiment, the mini batches are one-sixth of the entire

dataset for each speaker (approximately 8500 samples per batch). We utilized α =0.9, ε = 0.001 and γ = 0.004 in our leaning experiment. The weight parameters were

initialized with a zero-mean Gaussian distribution with standard deviation 0.01. The

neuron biases in all layers were initialized at 0. We used open source software (cuda-

convnet) [50] for practical implementation of the CNN. The software was processed

on the same computational hardware as the audio feature extraction experiment.

3.3.3 Audio-Visual Integration by MSHMM

In our study, we adopt a simple MSHMM with manually selected stream weights

for the multimodal integration mechanism. We utilize the HTK for the practical

MSHMM implementation. The HTK can model output probability distributions

composed of multiple streams of GMMs [114]. Each observation vector at time t is

modeled by splitting it into S independent data streams ost . The output probability

distributions of state j is represented with multiple data streams as

b j (ot ) =S∏

s=1

[Ms∑

m=1c j smN (ost ;μ j sm ,Σ j sm)

]γs

, (3.6)

where ot is a speech vector generated from the probability density b j (ot ), Ms is the

number of mixture components in stream s, c j sm is the weight of the m’th compo-

nent, N (·;μ,Σ) is a multivariate Gaussian with mean vector μ and covariance matrix

Σ, and the exponent γs is a stream weight for stream s.

35


Definitions of MSHMM are generated by combining multiple HMMs indepen-

dently trained with corresponding audio and visual inputs. In our experiment, we

utilize 16 mixture components for both audio and visual output probability distribu-

tion models. When combining two HMMs, GMM parameters from audio and visual

HMMs are utilized to represent stream-wise output probability distributions. Model

parameters from only the audio HMM are utilized to represent the common state

transition probability distribution. Audio stream weights γa are manually prepared

from 0 to 1.0 at intervals of 0.1. Accordingly, visual stream weights γv are prepared to

satisfy γv = 1.0−γa . In evaluating the acquired MSHMM, the best recognition rate

is selected from the multiple evaluation results corresponding to all stream weight

pairs.

3.4 Results

3.4.1 ASR Performance Evaluation

The acquired audio features are evaluated by conducting an isolated word recogni-

tion experiment utilizing a single-stream HMM. To recognize words from the audio

features acquired by the deep denoising autoencoder, monophone HMMs with 8, 16,

and 32 GMM components are utilized. While training is conducted with 360 training

words, evaluation is conducted with 40 test words from the same speaker, thereby

yielding a closed-speaker and open-vocabulary evaluation. To enable comparison

with the baseline performance, word recognition rates utilizing the original audio

features are also prepared. To evaluate the robustness of our proposed mechanism

against the degradation of audio input, partially deteriorated sound data were arti-

ficially generated by superimposing several strengths of Gaussian noises to original

sound signals. In addition to the original clean sound data, we prepared 11 different

deteriorated sound data such that the SNR was 30 dB to −20 dB at 5 dB intervals.

Figure 3.3 shows word recognition rates for the different word recognition mod-

els for MFCCs and LMFB audio features evaluated with 12 different SNRs for sound

inputs. In Figure 3.3, changes of word recognition rates depending on the types of

audio features (MFCCs for (a) to (c) and LMFB for (d) to (f)), the types of feature

extraction mechanism, and changes of the SNR of audio inputs are shown. These

36

3.4. Results

clean 30 25 20 15 10 5 0 −5 −10 −15 −20

0

10

20

30

40

50

60

70

80

90

100

SNR [dB]

Wor

d re

cogn

ition

rat

e [%

]

OriginalSequenceOutBottleNeckSingleFrameOut

(a) 8 components (MFCCs)

clean 30 25 20 15 10 5 0 −5 −10 −15 −20

0

10

20

30

40

50

60

70

80

90

100

SNR [dB]

Wor

d re

cogn

ition

rat

e [%

]


(b) 16 components (MFCCs)

clean 30 25 20 15 10 5 0 −5 −10 −15 −20

0

10

20

30

40

50

60

70

80

90

100

SNR [dB]

Wor

d re

cogn

ition

rat

e [%

]


(c) 32 components (MFCCs)

clean 30 25 20 15 10 5 0 −5 −10 −15 −20

0

10

20

30

40

50

60

70

80

90

100

SNR [dB]

Wor

d re

cogn

ition

rat

e [%

]


(d) 8 components (LMFB)

clean 30 25 20 15 10 5 0 −5 −10 −15 −20

0

10

20

30

40

50

60

70

80

90

100

SNR [dB]

Wor

d re

cogn

ition

rat

e [%

]


(e) 16 components (LMFB)

clean 30 25 20 15 10 5 0 −5 −10 −15 −20

0

10

20

30

40

50

60

70

80

90

100

SNR [dB]

Wor

d re

cogn

ition

rat

e [%

]


(f) 32 components (LMFB)

Figure 3.3: Word recognition rate evaluation results using audio features dependingon the number of Gaussian mixture components for the output probability distribu-tion models of HMM

37


Table 3.4: Speaker-wise visual-based phoneme recognition rates and averaged values[%] depending on the input image sizes

Img. size p1 p2 p3 p4 p5 p6 Avr.

16×16 42.13 43.40 39.92 39.03 47.67 46.73 43.15

32×32 43.77 47.07 42.77 41.05 49.74 50.83 45.87

64×64 45.93 50.06 46.51 43.57 49.95 51.44 47.91* p1–p6 correspond to the six speakers.

results demonstrate that MFCCs generally outperforms LMFB. The sound feature

acquired by integrating consecutive multiple frames with a deep denoising autoen-

coder has an effect on higher noise robustness compared with the original input. By

comparing the audio features acquired from the different network architectures, it

was observed that “SingleFrameOut” obtains the highest recognition rates for the

higher SNR range, whereas “SequenceOut” outperforms for the lower SNR range.

While “BottleNeck” performs slightly better than the original input for the middle

SNR range, the advantage is minimal. Overall, approximately a 65% word recogni-

tion gain was attained with denoised MFCCs under 10 dB SNR. Although there is a

slight recognition performance difference depending on the increase of the number

of Gaussian mixture components, the effect is not significant.

Figure 3.4 shows word recognition rates for the different number of hidden layers

of the deep denoising autoencoder utilizing MFCCs audio features evaluated with

12 different SNRs for sound inputs. In Figure 3.4, changes of word recognition rates

depending on the number of hidden layers of DNN and changes of the SNR of audio

inputs are shown. The deep denoising autoencoder with five hidden layers obtained

the best noise robust word recognition performance among all SNR ranges.

3.4.2 Visual-Based Phoneme Recognition Performance Evaluation

After training the CNN, phoneme recognition performance is evaluated by record-

ing neuronal outputs from the last layer of the CNN when the mouth area image

sequences corresponding to the test image data are provided to the CNN. Table 3.4

shows that the average phoneme recognition performance for the 40 phonemes, nor-

malized with the number of samples for each phoneme over six speakers, attained

approximately 48% when 64×64 pixels of mouth area images are utilized as input.

38

3.4. Results

clean 30 25 20 15 10 5 0 −5 −10 −15 −20

0

10

20

30

40

50

60

70

80

90

100

SNR [dB]

Wor

d re

cogn

ition

rat

e [%

]

7 layers5 layers3 layers1 layer

(a) 8 components

clean 30 25 20 15 10 5 0 −5 −10 −15 −20

0

10

20

30

40

50

60

70

80

90

100

SNR [dB]

Wor

d re

cogn

ition

rat

e [%

]


(b) 16 components

clean 30 25 20 15 10 5 0 −5 −10 −15 −20

0

10

20

30

40

50

60

70

80

90

100

SNR [dB]

Wor

d re

cogn

ition

rat

e [%

]


(c) 32 components

Figure 3.4: Word recognition rate evaluation results utilizing MFCCs depending onthe number of Gaussian mixture components for the output probability distributionmodels of HMM

39


Figure 3.5 shows the mean and standard deviation of the phoneme-wise recogni-

tion rate from six different speakers for four different input image resolutions. In Fig-

ure 3.5, the mean and the standard deviations from six speakers’ results are shown.

Four different shapes of the plots correspond to the recognition results when four dif-

ferent visual features, acquired by the CNN from four different image resolutions for

the mouth area image inputs, are utilized. This result generally demonstrates that vi-

sual phoneme recognition works better for recognizing vowels than consonants. The

result derives from the fact that the mean recognition rate for all vowels is 30–90%,

whereas for all other phonemes it is 0–60%. This may be attributed to the fact that

generation of vowels strongly correlates with visual cues involving lips or jaw move-

ments [7, 112].

Figure 3.6 shows the confusion matrix of the phoneme recognition evaluation

results. In Figure 3.6, the mean values from six speakers’ results are shown. It should

be noted that, in most cases, wrongly recognized consonants are classified as vowels.

This indicates that articulation of consonants is attributed to not only the motion of

the lips but also the dynamic interaction of interior oral structures such as tongue,

teeth, oral cavity, which are not evident in frontal facial images.

Visually explicit phonemes, such as bilabial consonants (/m/, /p/, or /b/), are ex-

pected to be relatively well discriminated by a VSR system. However, the recognition

performance was not as high as expected. To improve the recognition rate, the pro-

cedure to obtain phoneme target labels for the CNN training should be improved.

In general pronunciation, consonant sounds are shorter than vowel sounds; there-

fore, the labeling for consonants is more time critical than vowels. In addition, the

accuracy of consonant labels directly affects recognition performance because the

number of training samples is much smaller for consonants than it is for vowels.

3.4.3 Visual Feature Space Analysis

To analyze how the acquired visual feature space is self-organized, the trained CNN

is used to generate phoneme posterior probability sequences from test image se-

quences. Forty dimensions of the resulting sequences are processed by PCA, and

the first three principal components are extracted to visualize the acquired feature

40

3.4. Results

N a i u e o a: i: u: e: o: b by ch d f g gy h hy−20

0

20

40

60

80

100

Phonemes

Pho

nem

e re

cogn

ition

rat

e [%

]

j k ky m my n ny p py r ry s sh t ts w y z q sp−20

0

20

40

60

80

100

Phonemes

Pho

nem

e re

cogn

ition

rat

e [%

]

16x1632x3264x64

16x1632x3264x64

Figure 3.5: Phoneme-wise visual-based phoneme recognition rates

41


Recognition

Tru

th

N a i u e o a: i: u:e:o: bbychd f ggyhhy j k kymmynnyppy r ry ssh t tsw y z qsp

Nai

ueo

a:i:

u:e:o:b

bychdfg

gyh

hyjk

kym

myn

nyp

pyr

rys

sht

tswyzq

sp 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Figure 3.6: Visual-based phoneme-recognition confusion matrix (64×64 pixels imageinput)

42

3.4. Results

space. Figure 3.7 shows the visual feature space corresponding to the five represen-

tative Japanese vowel phonemes, /a/, /i/, /u/, /e/, and /o/, generated from 64×64

pixels image inputs. The cumulative contribution ratio with 40 selected components

was 31.1%. As demonstrated in the graph, raw mouth area images corresponding to

the five vowel phonemes are discriminated by the CNN and clusters corresponding

to the phonemes are self-organized in the visual feature space. This result indicates

that the acquired phoneme posterior probability sequences can be utilized as visual

feature sequences for isolated word recognition tasks.

3.4.4 VSR Performance Evaluation

The acquired visual features are evaluated by conducting an isolated word recog-

nition experiment utilizing a single-stream HMM. To recognize words from the

phoneme label sequences generated by the CNN trained with 360 training words,

monophone HMMs with 1, 2, 4, 8, 16, 32, and 64 Gaussian components are uti-

lized. While training is conducted with 360 train words, evaluation is conducted with

40 test words from the same speaker, thereby yielding a closed-speaker and open-

vocabulary evaluation. To compare with the baseline performance, word recognition

rates utilizing two other visual features are also prepared. One feature has 36 dimen-

sions, generated by simply rescaling the images to 6×6 pixels, and the other feature

has 40 dimensions, generated by compressing the raw images by PCA.

Figure 3.8 shows the word recognition rates acquired from 35 different models

with a combination of five types of visual features and seven different numbers of

Gaussian mixture components for GMMs. In Figure 3.8, evaluation results from 40

test words over six speakers depending on the different mixture of Gaussian compo-

nents for the HMM are shown. Six visual features, including two image-based fea-

tures, one generated by simply resampling the mouth area image into 6× 6 pixels

image and the other generated by compressing the dimensionality of the image into

40 dimensions by PCA, and four visual features acquired by predicting the phoneme

label sequences from four different resolutions of the mouth area images utilizing

the CNN, are employed in this evaluation experiment. Comparison of word recogni-

tion rates from different visual features within the same number of Gaussian compo-

nents shows that visual features acquired by the CNN attain higher recognition rates

43


−0.4

−0.2

0

0.2

0.4

0.6

0.8−0.4

−0.20

0.20.4

0.60.8

−0.4

−0.2

0

0.2

0.4

0.6

PC2

PC1

PC

3

��

Figure 3.7: Visual feature distribution for the five representative Japanese vowelphonemes (64×64 pixels image input)

44

3.4. Results

1 2 4 8 16 32 640

5

10

15

20

25

Number of Gaussian components

Wor

d re

cogn

ition

rat

e [%

]

6x6PCACNN_16x16CNN_32x32CNN_64x64

Figure 3.8: Word recognition rates using image features

than the other two visual features. However, the effect of the different input image

resolutions is not prominent. Among all word recognition rates, visual features ac-

quired by the CNN with 16×16 and 64×64 input image resolutions attain a rate of

approximately 22.5%, the highest word recognition rate, when a mixture of 32 Gaus-

sian components is used.

3.4.5 AVSR Performance Evaluation

We evaluated the advantages of sensory features acquired by the DNNs and noise

robustness of the AVSR by conducting an isolated word recognition task. Training

data for the MSHMM are composed of image and sound features generated from 360

training words of six speakers. For sound features, we utilized the neuronal outputs

of the straight-shaped deep denoising autoencoder with five hidden layers (Table 3.1

(d)) when clean MFCCs are provided as inputs. For visual features, we utilized the

output phoneme label sequences generated from 32× 32 pixels mouth area image

inputs by the CNN. Evaluation data for the MSHMM are composed of image and

sound features generated from the 40 test words. Thus, closed-speaker and open-

45


vocabulary evaluation was conducted. To evaluate the robustness of our proposed

mechanism against the degradation of audio input, partially deteriorated sound data

were artificially generated by superimposing several strengths of Gaussian noises to

original sound signals. In addition to the original clean sound data, we prepared 11

different deteriorated sound data such that the SNR was 30 dB to −20 dB at 5 dB

intervals. In our evaluation experiment, we compared the performance under four

different conditions. The initial two models were the unimodal models that utilize

single-frame MFCCs and the denoised MFCCs acquired by the straight-shaped deep

denoising autoencoder with five hidden layers. These are identical to the models

“Original” and “5 layers” presented in Figure 3.3 and Figure 3.4, respectively. The

third model was the unimodal model that utilized visual features acquired by the

CNN. The fourth model was the multimodal model that binds the acquired audio

and visual features by the MSHMM.

Figures from Figure 3.9 to Figure 3.11 show word recognition rates from the four

different word recognition models under 12 different SNRs for sound inputs. Differ-

ent trajectories represent the results acquired by using dedicated features and mul-

timodal features depending on the number of Gaussian mixture components for the

output probability distribution models of HMM. Graphs on the top show changes

in word recognition rates depending on the types of utilized features and changes

in the SNR of audio inputs. “MFCC,” “DNN_Audio,” “CNN_Visual,” and “Multi-

stream” denote the original MFCCs feature, audio feature extracted by the deep de-

noising autoencoder, visual feature extracted by the CNN, and MSHMM composed of

“DNN_Audio” and “CNN_Visual” features, respectively. Graphs on the bottom show

audio stream weights that yield the best word recognition rates for the MSHMM de-

pending on changes in the audio input’s SNR.

These results demonstrate that when two modalities are combined to represent

the acoustic model, the word recognition rates are improved, particularly for lower

SNRs. At minimum, the same or a better performance was attained compared with

cases when both features are independently utilized. For example, the MSHMM at-

tained an additional 10% word recognition rate gain under 0 dB SNR for the audio

signal input compared with the case when single-stream HMM and denoised MFCCs

are utilized as the recognition mechanism and input features, respectively. Although

46

3.4. Results

clean 30 25 20 15 10 5 0 −5 −10 −15 −20

0

10

20

30

40

50

60

70

80

90

100

SNR [dB]

Wor

d re

cogn

ition

rat

e [%

]

MFCCDNN_AudioCNN_VisualMulti−stream

clean 30 25 20 15 10 5 0 −5 −10 −15 −20

0

0.2

0.4

0.6

0.8

1

SNR [dB]

Aud

io s

trea

m w

eigh

t

Figure 3.9: Word recognition rate evaluation results (8 components)

47


clean 30 25 20 15 10 5 0 −5 −10 −15 −20

0

10

20

30

40

50

60

70

80

90

100

SNR [dB]

Wor

d re

cogn

ition

rat

e [%

]


clean 30 25 20 15 10 5 0 −5 −10 −15 −20

0

0.2

0.4

0.6

0.8

1

SNR [dB]

Aud

io s

trea

m w

eigh

t


48

3.4. Results

clean 30 25 20 15 10 5 0 −5 −10 −15 −20

0

10

20

30

40

50

60

70

80

90

100

SNR [dB]

Wor

d re

cogn

ition

rat

e [%

]


clean 30 25 20 15 10 5 0 −5 −10 −15 −20

0

0.2

0.4

0.6

0.8

1

SNR [dB]

Aud

io s

trea

m w

eigh

t


49


there is a slight recognition performance difference depending on the increase in the

number of Gaussian mixture components, the effect is not significant.

3.5 Discussion and Future Work

3.5.1 Current Need for the Speaker Dependent Visual Feature Ex-

traction Model

In our study, we demonstrated an isolated word recognition performance from vi-

sual sequence inputs by the integration of CNN and HMM. We showed that the CNN

works as a phoneme recognition mechanism with mouth region image inputs. How-

ever, our current results are attained by preparing an independent CNN correspond-

ing to each speaker. As generally discussed in previous deep learning studies [50, 56],

the number and variation of training samples are critical for maximizing the gener-

alization ability of a DNN. A DNN (CNN) framework is scalable; however, it requires

a sufficient training dataset to reduce overfitting [50]. Therefore, in future work, we

need to investigate the possibility of realizing a VSR system applicable to multiple

speakers with a single CNN model by training and evaluating our current mechanism

with a more diverse audio-visual speech dataset that has large variations, particularly

for mouth region images.

3.5.2 Positioning of our VSR Results with Regards to State of the Art

in Lip Reading

Most of the current lip reading experiments are still limited to rather simple tasks,

such as isolated or connected random words, isolated or connected digits, isolated

or connected letters. Moreover, a universally acknowledged benchmark has not been

established in studies on lip reading. The major reasons are (1) lip reading studies

involve an immense amount of time and effort for preparing manually or in the best

case semi-automatically processing the data corpus and (2) the data corpora for lip

reading are still very limited due to the overwhelming task of processing and prepar-

ing the data for experiments. Therefore, although some of the reported experimental

results are listed below, it is important to keep in mind that a fair comparison to mul-

50

3.5. Discussion and Future Work

tiple experiments is difficult to provide [17].

For the isolated word recognition task, Nefian et al. [74] report 66.9%, Zhang et

al. [115] report 45.6%, and Kumar et al. [51] report 42.3% recognition rates. These

results attained more than twice the recognition rates compared to our current re-

sults. However, more detailed experimental condition description indicates that the

data corpus used in the experiments by Nefian et al. and Zhang et al. only includes

78 words by ten speakers with ten repetitions. Moreover, nine examples of each word

were used for the training and the remaining example was used for the testing. In

comparison with our experiment, although the experimental condition for the closed

speaker is common, their experiment is not evaluated with unknown words in con-

trast to our open vocabulary setting. Moreover, our preliminary experiment with the

closed speaker setting attained around 63% recognition rate (Figure 3.12), which is

competitive to previous studies. The experiment by Kumar et al. is conducted with

a corpus that includes 150 words by ten speakers. In this case, the closed speaker

setting is the same, but whether an open vocabulary test is conducted is unclear.

In conclusion, while our current isolated word recognition results did not exhibit

cutting-edge performance, we can consider that our results reached a state-of-the-

art level given the following experimental conditions: closed speaker setting with six

speakers and open vocabulary setting including 400 words vocabulary, 360 words for

training, and 40 words for testing.

3.5.3 Adaptive Stream Weight Selection

Our AVSR system utilizing MSHMM achieved satisfactory speech recognition perfor-

mance, despite its quite simple mechanism, especially for audio signal inputs with

lower reliability. The transition of the stream weight in accordance with changes in

the SNR for the audio input (Figure 3.9 to Figure 3.11) clearly demonstrates that the

MSHMM can prevent degradation of recognition precision by shifting the observa-

tion information source from audio input to visual input, even if the quality of the

audio input degrades. However, to apply our AVSR approach to real-world applica-

tions, automatic and adaptive selection of the stream weight in relation to changes

in audio input reliability becomes an important issue to be addressed.

51


clean 30 25 20 15 10 5 0 −5 −10 −15 −20

0

10

20

30

40

50

60

70

80

90

100

SNR [dB]

Wor

d re

cogn

ition

rat

e [%

]


clean 30 25 20 15 10 5 0 −5 −10 −15 −20

0

0.2

0.4

0.6

0.8

1

SNR [dB]

Aud

io s

trea

m w

eigh

t

Figure 3.12: Word recognition rate evaluation results (32 components, speaker-closeevaluation)

52

3.6. Summary

3.5.4 Relations of our AVSR Approach with DNN-HMM Models

As an experimental study for an AVSR task, we adopted a rather simple tandem ap-

proach, a connectionist-HMM [39]. Specifically, we applied heterogeneous deep

learning architectures to extract the dedicated sensory features from audio and vi-

sual inputs and combined the results with an MSHMM. We acknowledge that a DNN-

HMM is known to be advantageous for directly estimating the state posterior prob-

abilities of an HMM from raw sensory feature inputs over conventional GMM-HMM

owing to the powerful nonlinear projection capability of DNN models [40]. In the fu-

ture, it might be interesting to formulate an AVSR model based on the integration of

DNN-HMM and MSHMM. This novel approach may succeed because of the recogni-

tion capability of DNNs and the simplicity and explicitness of the proposed decision

fusion approach.

3.6 Summary

In this chapter, we proposed an AVSR system based on deep learning architectures

for audio and visual feature extraction and an MSHMM for multimodal feature inte-

gration and isolated word recognition. The main targets discussed in this chapter are

summarized in Figure 3.13.

Our experimental results demonstrated that, compared with the original MFCCs,

the deep denoising autoencoder can effectively filter out the effect of noise superim-

posed on original clean audio inputs and that acquired denoised audio features at-

tain significant noise robustness in an isolated word recognition task. Furthermore,

our visual feature extraction mechanism based on the CNN effectively predicted the

phoneme label sequence from the mouth area image sequence, and the acquired

visual features attained significant performance improvement in the isolated word

recognition task relative to conventional image-based visual features, such as PCA.

Finally, an MSHMM was utilized for an AVSR task by integrating the acquired audio

and visual features.

The next major target of our work is to examine the possibility of applying our cur-

rent approach to develop practical, real-world applications. Specifically, future work

53


will include a study to evaluate how the VSR approach utilizing translation, rotation,

or scaling invariant visual features acquired by the CNN contributes to robust speech

recognition performance in a real-world environment, where dynamic changes such

as reverberation, illumination, and facial orientation, occur.

54

3.6. Summary

��

��

� ��

��

��

��

��

!��

"��

#��$%�� &'()

"��$%�� &'()

%�� $%�� ()

*�� $%�� &)

�� $%�� +)

�� $%�� ,)

Figure 3.13: The main targets discussed in Chapter 3

55

Chapter 4

Learning Framework for Multimodal

Integration of Robot Behaviors

4.1 Introduction

In Chapter 3, the sensory feature extraction performances of the two representative

DNN mechanisms are evaluated. As a practical evaluation experiment, an AVSR task

is conducted to investigate how noise robust speech recognition becomes possible

by utilizing the sensory features acquired from different DNN frameworks and by

integrating those multimodal features. By applying an MSHMM for the multimodal

integration learning, the temporal sequences extracted from the speech signals are

modeled with a discrete representation of state transition probability. Moreover, the

multimodal integration is attained by an explicit linear mixture of the observation

probability models.

This approach is an intuitive and straightforward way for temporal sequence

recognition tasks like speech recognition, because the main focus of the task is just

to ‘recognize’ by symbolizing raw sensory signals. However, the approach is not suit-

able for sensory-motor coordination tasks such as robot behavior learning because

recognition using an MSHMM is specialized for acquiring symbolic representation

from raw signals, and thus, the reconstruction of raw signals from the acquired sym-

bolic representation is not considered. Therefore, the approach requires designing of

57

Chapter 4. Learning Framework for Multimodal Integration of Robot Behaviors

an external mechanism once generation of action commands corresponding to the

recognized states is considered.

To overcome this issue, we propose a multimodal temporal sequence integration

learning framework utilizing a DNN. In this chapter, we propose the application of

a deep autoencoder not only for its feature extraction by dimensionality compres-

sion but also for its multimodal temporal sequence integration learning. Our main

contribution is to demonstrate that our proposed framework serves as a cross-modal

memory retriever, as well as a temporal sequence predictor utilizing its powerful gen-

eralization capabilities. In the sections that follow, we first illustrate the basic mech-

anism of the autoencoder, then explain how the autoencoder is applied for its multi-

modal temporal sequence learning and further functions.

4.2 Multimodal Temporal Sequence Learning using a

DNN

4.2.1 Sensory Feature Extraction

High-dimensional raw sensory inputs, such as visual images or sound spectrums,

can be converted into low-dimensional feature vectors by multilayer networks with a

small central layer (i.e., a feature-extraction network) [41]. To this end, the networks

are trained with the goal of reconstructing the input data at the output layer with

input-output mappings defined as

ut = f (rt ) (4.1)

r t = f −1(ut ), (4.2)

where rt , ut , and r t are the vectors representing the raw input data, the correspond-

ing feature, and the reconstructed data, respectively. Functions f (.) and f −1(.) repre-

sent the transformation mapping from the input layer to the central hidden layer and

the central hidden layer to the output layer of the network, respectively. An autoen-

coder compresses the dimensionality of inputs by decreasing the number of nodes

58

4.2. Multimodal Temporal Sequence Learning using a DNN

from the input layer to the central hidden layer. Hence, the number of central hidden

layer nodes determines the dimension of the feature vector. In a symmetric fashion,

the original input is reconstructed from the feature vector by eventually increasing

the number of nodes from the central hidden layer to the output layer.

Regarding dimensionality compression mechanisms, a simple and commonly

utilized approach is PCA; however, Hinton et al. demonstrated that the deep autoen-

coder outperformed PCA in image reconstruction and compressed feature acquisi-

tion [41]. In reference to their work, we utilized the deep autoencoder for our di-

mensionality compression framework because we prioritized the precision of cross-

modal memory retrieval and the sparseness of acquired features to ease the behavior

recognition task via a conventional classifier.

4.2.2 Multimodal Integration Learning using Time-delay Networks

A time-delay neural network (TDNN) is a method for utilizing a feed-forward neu-

ral network for multi-dimensional temporal sequence learning [55]. Motivated by

TDNN, we propose a novel computational framework that utilizes a deep autoen-

coder for temporal sequence learning.

An input to the temporal sequence learning network at a single time step is de-

fined by a time segment of the tuple of joint angle vectors, image feature vectors, and

sound feature vectors, formatted as

st = (at,uit ,us

t ) (4.3)

{t|t −T +1 ≤ t ≤ t }, (4.4)

where st , at , uit , and us

t are the vectors representing the input to the network, the joint

angle, the image feature, and the sound feature, at time t , respectively, and T is the

length of the time window. Here, t represents the previous T steps of the temporal

segment from t , and a vector with subscript t indicates a time series of the vector.

The input-output mappings of the temporal sequence learning network are defined

59


as

vt = g (st ) (4.5)

st = g−1(vt ), (4.6)

where vt and st = (at, uit , us

t ) are the multimodal feature vector and a segment of the

restored multimodal temporal sequence, respectively. Functions g (.) and g−1(.) rep-

resent the transformation mapping from the input layer to the central hidden layer

and the central hidden layer to the output layer of the network, respectively.

One of the merits of applying neural networks to multimodal temporal sequence

learning is their generalization capability. Because the network can complement de-

ficiencies in the input data, the temporal sequence learning network can be used

in two different ways: (1) to retrieve a temporal sequence from one modal for use

in another (Figure 4.1(a), (b)) and (2) to predict a future sequence from the past se-

quence (Figure 4.1(c)). Thus, the temporal sequence learning network serves as a

cross-modal memory retriever or a temporal sequence predictor by masking the in-

put data from outside the network in either spatial or temporal ways; thus iteratively

feeding back the generated outputs to the inputs as substitutions for the masked in-

puts. The practical implementation of these functions is described in the following

subsections.

4.3 Applications

4.3.1 Cross-modal Memory Retrieval

Cross-modal memory retrieval is realized by self-generating sequences for a modal-

ity inside the network by providing corresponding sequences for the other modalities

from outside the network. For the retrieved modality, a recurrent loop from the out-

put nodes to the input nodes is prepared. Hence, in the case of generating an image

sequence from motion and sound sequences, input to the network is defined as

st = (at, uit ,us

t ). (4.7)

60

4.3. Applications

��

��

��

��

��

��

��

��

��

��

��

��

��

��

...� ...�

...� ...�

...� ...�

� � ��

� � ��

� ��

� � ��

� ��

� � ��

� ��

� � ��

Figure 4.1: Examples of cross-modal memory retrieval and sequence prediction

61


As shown in Figure 4.2, the time segment of the recurrent input is generated by shift-

ing the previous output of the network to the direction for one step by (1) discarding

the oldest time step output and (2) filling the latest time step with the value of the

newest time step acquired from the output.

t −T +1 t −T + 2 tt −1t − 2

t −T +1 t −T + 2 tt −1t − 2��

��

��

��

...�

...�Figure 4.2: Buffer shift of the recurrent input

4.3.2 Temporal Sequence Prediction

Similarly, the temporal sequence prediction is realized by constructing a recurrent

loop from the output layer to the input layer. The difference is that among all the T

steps of the time window, only the first Ti n steps (i.e., the past Ti n shifts to the present

time step t ) of both modalities are filled with the input data; the rest (i.e., the future

T −Ti n shifts to the predicted time step) are filled with the outputs from the previous

time step. Hence, input to the network is defined as

s(t ) = (at1 , at2 ,uit1

, uit2

,ust1

, ust2

), (4.8)

{t1|t −Ti n +1 ≤ t1 ≤ t }, (4.9)

{t2|t +1 ≤ t2 ≤ t + (T −Ti n)}. (4.10)

As shown in Figure 4.3, the prediction segment of the recurrent input is generated by

shifting the corresponding previous outputs of the network to the time direction for

one step.

62

4.4. Summary

t −Tin +1 t −1 t +T −Tint +T −Tin −1

��

��

��

t t +1

t −Tin +1 t −1 t +T −Tint +T −Tin −1t t +1

��

t + 2

t + 2

Tin ��

...�

...�

...�

...�

Figure 4.3: Buffer shift of the recurrent input for temporal sequence prediction

4.4 Summary

In this chapter, we proposed a feature extraction framework using a DNN that en-

ables not only to extract compressed features from raw sensory inputs by reducing

the dimensionality but also to reconstruct the original information from the acquired

features. Moreover, theoretical applications of the proposed framework for the mul-

timodal integration learning of temporal sequences, including visual, auditory, and

motion, are presented. The main targets discussed in this chapter are summarized

in Figure 4.4.

63


��

��

� ��

��

��

��

��

!��

"��

#��$%�� &'()

"��$%�� &'()

%�� $%�� ()

*�� $%�� &)

�� $%�� +)

�� $%�� ,)


64

Chapter 5

Applications for Recognition and

Generation of Robot Behaviors

5.1 Introduction

In Chapter 4, we proposed a theoretical framework for multimodal integration learn-

ing and cross-modal memory retrieval using a DNN. In this chapter, our proposed

model is evaluated by conducting experiments using a humanoid robot in the real-

world environment. In practice, cross-modal memory retrieval, temporal sequence

prediction, and noise-robust behavior recognition functions are evaluated by train-

ing the proposed model with the sensory-motor information acquired by directly

teaching a humanoid robot with multiple object manipulation behaviors. Through

the experiments, we investigate the possibility of applying a deep learning frame-

work to the sensory-motor coordination problem on robotic applications, especially

with high-dimensional and large-scale raw sensory temporal sequences.

5.2 Construction of the Proposed Framework

Figure 5.1 depicts a schematic diagram of our proposed framework. Two indepen-

dent deep neural networks are utilized for image compression and temporal se-

quence learning. The image compression network, shown in Figure 5.1(a), inputs

65

Chapter 5. Applications for Recognition and Generation of Robot Behaviors

�� !�"#��

��

�� "��

�� ! ��

$$$�

��

$$$�

at−T+1ut−T+1i

at−1ut−1i

atuti

at−T+1ut−T+1i

at−1ut−1i

atuti

...�...�

...�...�

� ��" !��

� ��" !�� " %� ��

� � !�� & !�

��

�� #� �" %� �� " #"��

�� "�

�� ! �

Figure 5.1: Multimodal behavior learning and retrieving mechanism

raw RGB color images acquired from a camera mounted on the head of the robot

and outputs the corresponding feature vectors from the central hidden layer. The

image features are synchronized with the joint angle vectors acquired from both arm

joints, and multimodal temporal segments are generated. The multimodal tempo-

ral segments are then fed into the temporal sequence learning network (i.e., Figure

5.1(b)). Accordingly, multimodal features are acquired from the central hidden layer,

while reconstructed multimodal temporal segments are obtained from the output

layer.

The outputs from the temporal sequence learning network are used for both

robot motion generation and image retrieval. The joint angle outputs from the net-

work are rescaled and resent to the robot as joint angle commands for generating

motion. The network can also reconstruct the retrieved images in the original form

by decompressing the image feature outputs, because the image compression net-

66

5.3. Experimental Setup

work models the identity map from the inputs to the outputs via feature vectors in

the central hidden layer.

5.3 Experimental Setup

Our proposed mechanisms are evaluated by conducting object manipulation exper-

iments with the small humanoid robot NAO, developed by Aldebaran Robotics [87].

The multimodal data, including image frames and joint angles, are recorded syn-

chronously at approximately 10 fps. For the image data input, the original 320 × 240

image is resized to a 20 × 15 matrix of pixels in order to meet the memory resource

availability limitation of our computational environment1. For joint angle data in-

put, 10 degrees of freedom of the arms (from the shoulders to the wrists) are used.

Six different object manipulation behaviors identified by different colorful toys

are prepared for training (Figure 5.2). The details of the object manipulation behav-

iors are as follows:

• (a) Ball lift: holding a yellow ball on the table with both hands, then raising the

ball to shoulder height three times with up-and-down movements

• (b) Ball roll: iteratively rolling a blue ball on top of the table to the right and left

by using alternating arm movements

• (c) and (d) Bell ring L/R: ringing a green bell placed on either the right or left side

of the table by the corresponding arm motion

• (e) Ball roll on a plate: rolling an orange ball placed in a deeply edged plate at-

tached to both hands, and alternately swinging both arms up and down

• (f) Ropeway: swinging a red ball hanging from a string attached to both hands

by alternately moving both arms up and down

We record the multimodal temporal sequence data by generating different arm

1We utilized a personal computer with an Intel Core i7-3930K processor (3.2 GHz, 6 cores), 32 GBmain memory, and a single NVidia GeForce GTX 680 graphic processing unit with 4 GB on-boardgraphics memory. Because the size of weight matrices of a multi-layered neural network exponentiallyincreases as the input dimension increases, we felt it sensible to keep the number of input dimensionsas small as possible, as long as the dimensionality reduction did not critically degrade the quality ofour experiments. As a result of preliminary experimentation, we found all of our memory retrievalexperiments are feasible even with this reduced image resolution.

67


��

��

Figure 5.2: Object manipulation behaviors

motions corresponding to each object manipulation by direct teaching. The result-

ing lengths of the motion sequences are between 100 and 200 steps, which is equiv-

alent to between 10 and 20 s. To balance the total motion sequence lengths between

different behaviors, direct teaching is repeated six to 10 times for each behavior,

such that the number of repetitions becomes inversely proportional to the motion

sequence length. Among all of the repetitions, one result is used as test data and the

others are used as training data. For multimodal temporal sequence learning, we use

a contiguous segment of 30 steps from the original time series as a single input. By

sliding the time window by one step, consecutive data segments are generated.

Table 5.1 summarizes the datasets and associated experimental parameters. For

both the image feature and temporal sequence learning, the same 12-layer deep neu-

ral network is used. In each case, the decoder architecture is a mirror-image of the

encoder, yielding a symmetric autoencoder. The parameter settings of the network

structures are empirically determined with reference to such previous studies as [41]

and [49]. The input and output dimensions of the two networks are defined as fol-

lows: 900 for image feature learning, which is defined by 20× 15 matrices of pixels

for the RGB colors; and 1200 for temporal sequence learning, which is defined by the

68


Table 5.1: Experimental parameters

TRAIN* TEST* I/O* ENCODER DIMS*

IFEAT** 8444 948 900 1000-500-250-150-80-30

TSEQ** 20548 776 1200 1000-500-250-150-80-30* TRAIN, TEST, I/O, and ENCODER DIMS indicate the size

of the training data, the test data, the input and output di-mensions, and the encoder network architecture, respec-tively.

** IFEAT and TSEQ stand for image feature and temporal se-quence, respectively.

30-step segment of the 40-dimension multimodal vector composed of 10 joint an-

gles and the 30-dimension image feature vector. For the activation functions, linear

functions are used for both the central hidden layers and logistic functions are used

for the rest of the layers in reference to [41].

The length of the time window is determined by considering the following two

constraints. First, if the length of the time window increases, the network may con-

sider longer contextual information. Second, if the length of the time window be-

comes too long, the dimension of the multimodal temporal vector becomes too big

to be processed in an acceptable amount of time. The implicit policy is to keep the

input dimensions below 3000 because of our computational limitation. As the mul-

timodal vector dimension is 40, the temporal sequence length should be below 75.

Considering the cyclic frequencies of the joint angle trajectories acquired from the

six object manipulation behaviors, we determine that 30 steps are enough to charac-

terize a phase of the behaviors.

For multimodal integration learning, we trained the temporal sequence learn-

ing network using additional examples that have only a single modality to explicitly

model the correlations across the modalities [75]. In practice, we added examples

that have noisy values for one of the input modalities (e.g., the image feature) and

original values for the other input modality (e.g., the joint angles) but still require the

network to reconstruct both modalities. Thus, one-third of the training data has only

image features for input, while another one-third of the data has only joint angles

and the last one-third has both image features and joint angles. For the noisy values,

we superimpose Gaussian noise with a standard deviation of 0.1 on the original data.

69


5.4 Results

5.4.1 Cross-modal Memory Retrieval and Temporal Sequence Pre-

diction of Object Manipulation Behaviors

We conducted two experiments to evaluate cross-modal memory retrieval perfor-

mance. One experiment generates the joint angle sequence (motion) by providing

image sequences, whereas the other generates an image sequence by providing the

joint angle sequence. For these experiments, inputs to either modality of the full

30 steps are provided, and the sequence for the other modality is internally gener-

ated in a closed-loop manner (see 4.3.1). In the experiment to evaluate temporal

sequence prediction, the input window length is defined as Ti n = 25, and the cor-

responding future five steps are internally generated as predictions (see 4.3.2). For

all of the experimental settings above, although the initial values for the recurrent

inputs are randomly generated, the internal values eventually converge to the cor-

responding states in association with the input values of the other modalities by the

generalization capability of the network.

Figure 5.3 shows the example results of joint angle sequence generation from the

image sequence input and temporal sequence prediction. We generated full length

trajectories of the object manipulation behavior by accumulating the iteratively re-

trieved joint angle vectors acquired from the 30th (final) step of the temporal window.

In the figure, graphs on the top row (Figure 5.3(a)) are the original motion trajectories

in the test data. Graphs on the second row (Figure 5.3(b)), i.e., the reconstructed tra-

jectories acquired by cross-modal memory retrieval from the image sequence, show

that the appropriate trajectories are generated and the configurations of the trajecto-

ries are clearly differentiated according to the provided image sequences. Graphs on

the bottom row (Figure 5.3(c)), i.e., the reconstructed trajectories acquired by tem-

poral sequence prediction, show that our proposed mechanism correctly predicted

future joint angles five steps ahead of the 25 steps of the multimodal temporal se-

quence. The reconstructed trajectories correspond to the same behaviors shown for

the top row. The low reconstruction qualities of the first 30 steps are attributed to the

random values supplied for the recurrent inputs at the initial iteration of the genera-

tion process.

70

5.4. Results

50 100 1500

0.5

1

Sca

led

angl

es

Ball lift

50 100 1500

0.5

1

Sca

led

angl

es

50 100 1500

0.5

1

Steps

Sca

led

angl

es

50 100 1500

0.5

1

Ball roll

50 100 1500

0.5

1

50 100 1500

0.5

1

Steps

50 1000

0.5

1

Bell ring L

50 1000

0.5

1

50 1000

0.5

1

Steps

��

��

��

Sca

led

angl

esS

cale

d an

gles

Sca

led

angl

es

��

��

��

50 1000

0.5

1

Bell ring R

50 1000

0.5

1

50 1000

0.5

1

Steps

50 1000

0.5

1

Ball roll on a plate

50 1000

0.5

1

50 1000

0.5

1

Steps

50 100 1500

0.5

1

Ropeway

50 100 1500

0.5

1

50 100 1500

0.5

1

Steps

Figure 5.3: Example of motion reconstructions by our proposed model

71


Ball lift Ball roll Bell ring L

��

��

Bell ring R Ball roll on a plate Ropeway

��

��

Figure 5.4: Example of image reconstructions by our proposed model

Figure 5.4 shows example results of image sequence generation from the joint

angle sequence input. The images shown in the figure are single frames drawn from

the series of images for each behavior. In the figure, images on the top row (Fig-

ure 5.4(a)) show the original images decompressed from the image feature vector

in the test data. Images on the bottom row (Figure 5.4(b)) show the correspond-

ing reconstructed images decompressed from the feature vectors acquired by cross-

modal memory retrieval from the joint angle sequence. Although the details of the

images are slightly different, the objects showing up in the images are correctly re-

constructed, and the locations of the color blobs are properly synchronized with the

phases of the motion.

72

5.4. Results

We conducted a quantitative evaluation of cross-modal memory retrieval by

preparing 10 different initial model parameter settings for the networks and repli-

cating the experiment of learning the same dataset composed of the six object ma-

nipulation behaviors. Table 5.2 summarizes these results. In the table, IMG → MTN

indicates image to motion, whereas MTN → IMG indicates motion to image; fur-

ther, the temporal sequence prediction (PRED) performances for the six behavior

patterns are also shown. The numbers given in each entry of the table represent the

root mean square (RMS) errors of the reconstructed trajectories (normalized by scal-

ing between 0 and 1) on the test data. The RMS errors in Table 5.2 demonstrate that

the reconstruction errors are below 10 percent for all of the evaluation conditions.

In detail, each of the RMS errors are calculated as

EI MG→MT N =

√√√√ 1

Tseq

Tseq∑t=1

|at − at |2, (5.1)

EMT N→I MG =

√√√√ 1

Tseq

Tseq∑t=1

|r it − r i

t |2, (5.2)

EPRED =

√√√√ 1

Tseq

Tseq∑t=1

|st − st |2, (5.3)

where EI MG→MT N , EMT N→I MG , and EPRED are RMS errors corresponding to the re-

construction modes identified by subscripts; at , at , r it , r i

t , st , and st are the truth and

reconstructed vectors representing the raw image data, the joint angle, and the mul-

timodal feature at time t , respectively; and Tseq is the length of the test sequence for

each of the behaviors.

Finally, to analyze the temporal sequence prediction performance in more detail,

we evaluated the prediction errors at the last (30th) step of the time window, depend-

ing on the prediction length, by varying the input window length Ti n from 25 to five

in decreasing steps of five. Figure 5.5 shows the temporal sequence prediction errors

of six object manipulation behaviors depending on the prediction length. The mean

and standard deviation are calculated from 10 replicated learning experiments. As

expected, the RMS errors demonstrate that the prediction error increases as predic-

73


Table 5.2: Reconstruction errors

IMG → MTN MTN → IMG PRED

LIFT* 7.11×10−2 (1.44×10−3) 1.76×10−2 (8.99×10−4) 3.91×10−2 (6.47×10−4)

ROLL* 7.05×10−2 (1.55×10−3) 4.45×10−2 (1.20×10−3) 4.41×10−2 (7.33×10−4)

RING-L* 4.95×10−2 (2.64×10−3) 1.83×10−2 (4.72×10−4) 2.21×10−2 (8.19×10−4)

RING-R* 3.64×10−2 (2.61×10−3) 1.79×10−2 (3.64×10−3) 1.98×10−2 (4.90×10−4)

PLT* 8.98×10−2 (1.35×10−3) 1.49×10−2 (2.96×10−3) 3.94×10−2 (4.34×10−4)

RWY* 5.63×10−2 (9.50×10−4) 1.89×10−2 (5.32×10−3) 2.75×10−2 (4.32×10−4)* LIFT, ROLL, RING-L, RING-R, PLT, and RWY stand for ball lift, ball roll, bell ring L, bell ring R, ball roll on

a plate, and ropeway, respectively.** Standard deviations in parentheses.

tion length increases. Nevertheless, the reconstruction errors are below 10 percent

in all of the evaluation conditions.

5.4.2 Real-time Adaptive Behavior Selection According to Environ-

mental Changes

As a further experiment, we switched the robot’s behavior according to changes in

the objects displayed to the robot. The approach is a combination of cross-modal

memory retrieval and temporal sequence prediction in the sense that the joint an-

gles five steps ahead, considering control delay, are predicted from the previous 25

steps of the image input sequence. By iteratively sending the predicted joint angles

as the target commands for each joint angle of the robot, the robot generates mo-

tion in accordance with environmental changes. For the initial trial, we tested the

raw image input and confirmed that the robot can properly select behaviors accord-

ing to changes in the displayed object. However, we found that the reliability of our

current image feature vector is easily affected by the environmental lighting condi-

tions2. Therefore, we adopted color region segmentation and used the coordinates of

the center of gravity of the color blobs as a substitute to the image feature vector for

the perception stability under various lighting conditions. As a result, we succeeded

in switching multiple behaviors based on the displayed objects. Figure 5.6 shows

2We recognize that the instability of the image feature vector under real environment is due tothe limitation on the variation in our image dataset utilized for training the image feature-extractionnetwork.

74

5.4. Results

5 10 15 20 250.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

Prediction length

RM

S e

rror

Ball liftBall rollBell ring LBell ring RBall roll on a plateRopeway

Figure 5.5: Temporal sequence prediction errors of six object manipulation behav-iors; plots are horizontally displaced from the original positions to avoid overlap ofthe error bars

75


Figure 5.6: Real-time transition of object manipulation behaviors

photos of the transition from one behavior to the next in the order of Ropeway, Bell

ring R, and Bell ring L.

5.4.3 Multimodal Feature Space Visualization

Figure 5.7 presents the scatter plot of the three-dimensional principal components

of the acquired multimodal features. PC1, PC2, and PC3 axes correspond to princi-

pal components 1, 2, and 3, respectively. The multimodal feature vectors are gener-

ated by recognizing the training data from the temporal sequence learning network

and recording the activations of the central hidden layer. This figure demonstrates

that the feature space is segmented according to different object manipulation be-

haviors and the feature vectors are self-organizing multiple clusters. The structure

of the multimodal feature space suggests that a supervised discrimination learning

of multiple behaviors might be possible by modeling correspondences between the

acquired multimodal features and the behavior categories.

76

5.4. Results

Ball liftBall rollBell ring LBell ring RBall roll on a plateRopeway

Figure 5.7: Acquired multimodal feature space

5.4.4 Behavior Recognition using Multimodal Features

In this section, we examine how the acquired multimodal feature expression con-

tributes to the robustness of a behavior recognition task. In our learning framework,

raw sensory inputs are converted into sensory features, and the multiple sources of

sensory features are integrated together to generate multimodal features utilizing the

dimensionality compression function of an autoencoder. Making efficient use of the

higher-level features, we can expect the following two effects in the behavior recog-

nition task: (1) a discrimination model can improve its categorization performance

against noisy sensory inputs by exploiting the higher generalization capabilities of

the compressed representations; and (2) the integrated representation of multimodal

inputs helps to inhibit the degradation of categorization performance by comple-

menting a decrease in reliability of sensory input with information from the other

modalities.

To verify our hypotheses, we evaluated the noise robustness of a behavior dis-

crimination mechanism under different training conditions using the joint angle test

77


sequences corresponding to the six object manipulation behaviors. More specifi-

cally, we compare the variation in behavior recognition rates depending on the dif-

ferences of the standard deviation of Gaussian noise superimposed on the joint an-

gle sequences. To investigate the effects of the higher-level features acquired from

dimensionality compression and multimodal integration, we compare the perfor-

mance of the classifier under the following four different training conditions:

• (1a) MTN (raw): Raw joint angles are used as inputs.

• (1b) MTN (compressed): Joint angle feature vectors are used as inputs. Feature

vectors are generated by compressing the joint angle sequences utilizing an au-

toencoder3.

• (2a) MTN+IMG: Multimodal feature vectors are used as inputs. Feature vectors

are generated by compressing the joint angle sequences and the corresponding

image feature sequences utilizing the temporal sequence learning network. Im-

age feature sequences are generated by compressing the clean image sequences

acquired from the test data.

• (2b) MTN+IMG (imaginary): Multimodal feature vectors are used as inputs. In

this case, the image feature sequences are self-generated inside the network in-

stead of externally generated from the test data.

All of the training conditions, except for case (1a), are statistically evaluated on the

10 replicated learning results (see 5.4.1).

The compressed feature vector sequences are acquired by recording the activa-

tion patterns of the central middle layer of the temporal sequence network. As one

of the most popular classification algorithms with an excellent generalization ca-

pability, a support vector machine (SVM)—namely, the multi-class SVM using one-

against-all decomposition in the Statistical Pattern Recognition Toolbox for MATLAB

[28]—is used as a classifier. An RBF kernel with default parameters (provided by the

toolbox) is used to address the one-against-all multiclass non-linear separation of

the acquired multimodal features; further, the Sequential Minimal Optimizer (SMO)

is used as the solver for the computational efficiency.

3The structure of the autoencoder used in (1b) is the same as that of the temporal sequence learn-ing network used for the multimodal integration learning in (2a), except that the image feature inputsare excluded.

78

5.4. Results

0 0.5 1 1.5

20

30

40

50

60

70

80

90

100

σ

Rec

ogni

tion

rate

[%]

(1a) MTN (raw)(1b) MTN (compressed)(2a) MTN+IMG(2b) MTN+IMG (imaginary)

Figure 5.8: Behavior recognition rates depending on the changes in standard devia-tion σ of the Gaussian noise superimposed on the joint angle sequences

Figure 5.8 shows the variations of the behavior recognition rates depending on

the changes in standard deviation of the Gaussian noise superimposed on the joint

angle sequences. The amplitudes of the joint angles are normalized to the range 0

to 1. Mean and standard deviation are calculated from 10 replicated learning experi-

ments. The results demonstrate three remarkable advantages of utilizing higher-level

features for the behavior recognition task. First, comparing results of (1b) with (1a)

shows the superior performance of compressed joint angle features over raw joint

angles with regard to behavior recognition robustness. Second, comparing results of

(2a) with (1b) shows that the multimodal features manifest higher noise robustness

over single modal features by suppressing the negative effects caused by the degra-

dation of the reliability of joint angles; this is achieved by making effective use of the

complementary information from the image features. Third, comparing results of

(2b) with (1b) demonstrates that even when the joint angle modality is provided as

the sole input, the self-generated sequences for the image features still help to pre-

vent degradation in behavior recognition performance. From these results, we con-

firmed our hypotheses that the use of higher-level features acquired by compressing

79


raw sensory inputs and integrating multimodal sequences contribute to noise resis-

tance of behavior recognition tasks.

5.5 Discussion

5.5.1 How Generalization Capability of Deep Neural Networks Con-

tributes for Robot Behavior Learning

In this study, we demonstrated the significant scalability of the deep learning al-

gorithm applied to the time-delayed autoencoder on sensory-motor coordination

learning problems. We presented experimental results on cross-modal memory re-

trieval and the subsequent adaptive behavior generation of a humanoid robot in the

real environment. For example, in the image sequence retrieval experiment of the

object manipulation behaviors learning task in 5.4.1, 900 dimensions of the image

feature vector sequence were recalled from only the 300 dimensions of joint angle se-

quence inputs. This result shows that three times as much information was recalled

by the generalization capability of the autoencoder.

This powerful information complementation capability is one of the advantages

of our proposed time-delay autoencoder. By utilizing the generalization capability

of the preceding half layers of the autoencoder, higher-level features that represent

specific object manipulation behavior can be generated even from partial modal in-

puts. Further, as the autoencoder can reconstruct the original inputs from the feature

vector, the predicted outputs can be recursively fed back to the input nodes, and the

inputs can be used as a substitution for any lacking modality information. This re-

cursive information loop in our proposed framework enabled a high level of stability

in cross-modal memory retrieval performance.

The number of layers and the number of nodes are important factors to explain

the memory capacity and generalization capability of a deep neural network; how-

ever, in general, a clear explanation has not been made for the correlation between

the network structure and its learning capability. Thus, the design principle on the

structure of neural networks has little theoretical foundation at the moment. This

might be an important research topic for future consideration.

80

5.5. Discussion

5.5.2 Three Factors that Contribute to Robustness in Behavior

Recognition Task

Our experimental results regarding behavior recognition evaluation demonstrated

that the compressed temporal features enable robust recognition performance. By

comparing the recognition rates from the different evaluation conditions, we have

shown that the following three factors contribute to noise robustness in behavior

recognition tasks: (1) utilization of higher-level features; (2) utilization of multimodal

information; and (3) utilization of self-generated sequences in multimodal behavior

recognition. Below, we present our views regarding the functions of the three factors

in relation to the internal mechanisms of our proposed framework.

Utilization of Higher-level Features

In previous work, Le et al. showed that it is possible to train neurons to be selec-

tive for high-level concepts using entirely unlabeled data [56]. As a practical result,

they succeeded in acquiring class-specific neurons such as cat and human body de-

tector neurons by training deep neural networks with unlabeled YouTube datasets.

This result—i.e., that meaningful features can be self-organized even with unla-

beled data—demonstrates the advantage of utilizing an autoencoder as a feature-

extraction mechanism. Comparable results are presented in related works involving

image classification tasks [50] and speech recognition tasks [40]. Considering all of

these previous studies, our behavior recognition results seem to coincide with the

view that deep neural networks produce higher-level features that have a prominent

generalization capability by accumulating many layers of nonlinear feature detectors

to progressively represent a complex statistical structure in the data [58].

Utilization of Multimodal Information

From the viewpoint of the amount of information acquired from the multimodal se-

quence, the multimodal temporal sequence learning network has a clear advantage

in generating a more accurate internal representation than a unimodal temporal se-

quence learning network. This fact is presented in the behavior recognition results

with noisy joint angle inputs and clear image inputs from the training dataset (i.e.,

81


(2a) MTN+IMG) in 5.4.4. These results demonstrate that even after the joint angle

information becomes uninformative, the degradation of the recognition rate con-

verges to a level that surpasses other results. In this case, the clear image feature

inputs served as a source of information for the higher-level features to correctly

represent the behavior category against the uninformative joint angle inputs. Cur-

rent results on the effects of multimodal learning toward robustness in recognition

tasks can be regarded in the same light as, for example, improvements in multimodal

speech recognition tasks utilizing a combination of sound and image inputs [75].

Utilization of Self-generated Sequences in Multimodal Behavior Recognition

Among our behavior recognition evaluation results presented in 5.4.4, the most no-

table outcome is that a higher recognition performance is realized by the multimodal

memory even with the single modal input for the joint angles (i.e., (2b) MTN+IMG

(imaginary)). This result could be explained as follows. Utilizing a multimodal mem-

ory, a multimodal internal representation is generated even from noisy joint angle

inputs, and successively accompanying image features are retrieved from the output

nodes. As the image feature vector is recalled from the internal representation, the

information becomes even more independent of the disturbance superimposed over

the joint angle observations. By feeding back the retrieved image feature to the input

nodes, this procedure leads to clarifying the internal representation that is equivalent

to the multimodal recognition process by explicitly providing the image feature se-

quence to the network in parallel with the noisy joint angle sequence. In recent neu-

ropsychological studies, the positive effects of self-referential strategies in improv-

ing memory in memory-impaired populations have been reported [34, 33]. In future

work, it would be interesting to further investigate how our current self-generating

imaginary sequence mechanism corresponds to such psychological phenomena in

the human cognitive process.

82

5.5. Discussion

5.5.3 Difference between our Proposed Time-delay Autoencoder

and the Original Time-delay Neural Network

The temporal sequence learning mechanism proposed in our work inputs a fixed

length of time series acquired by cropping a segment of a temporal sequence within

a time window. This approach inherits the idea from the original work of time-delay

neural networks by Lang et al. [55]. The difference here is that the vectors identical

to the inputs define the target outputs of our proposed model, whereas the symbol

labels define the outputs of the original model. Consequently, one of the charac-

teristics of our proposed model is that the compressed representation of temporal

sequences is self-organized by the autoencoder, and the network can self-generate

temporal sequences by recursively feeding back outputs to input nodes. The advan-

tages of the internal sequence generation were shown by the adaptive behavior se-

lection capability utilizing cross-modal memory retrieval and the robust behavior

recognition capability with unreliable joint angle observations.

5.5.4 Characteristics of the Internal Representation of the Tempo-

ral Sequence Learning Network

The temporal sequence learning network involves virtually modeling the dynam-

ics of long temporal sequences by accumulatively memorizing multiple phase-wise

temporal segments. Thus, a feature vector generated from a one-shot input repre-

sents a temporal phase of a sequence. This phenomenon can be confirmed from

plots of the feature vectors of the bell-ringing task by observing where they formed

closed loop shapes in Figure 5.7. The same phenomenon can be confirmed from the

second task in that the reciprocal transition of the feature vector plots on the two

distinct lines corresponds to each of the right and left arm motion patterns shown in

Figure 6.6.

83


5.5.5 Length of Contextual Information that a Time-delay Autoen-

coder Handles

The length of the input temporal segment defines the length of the contextual in-

formation handled by the temporal sequence learning network. Hence, in principle,

context information longer than the temporal segment is not considered. In compar-

ison with the other temporal sequence learning mechanisms, such as recurrent neu-

ral networks [66], this is a fundamental difference. Our proposed framework worked

successfully in our experiments despite this limitation of contextual representation

because the execution of robot behaviors in our task settings did not require com-

prehending long contextual situations. For example, for the object manipulation and

bell-ringing behaviors, most of the contextual information is embedded in the envi-

ronment (e.g., the robot’s arm posture, position of the balls, etc.). Thus, an internal

neuronal representation of the context was not required for executing the tasks.

5.5.6 Scalability of our Proposed Multimodal Integration Learning

Mechanism

One of the targets of the current study was to achieve “large-scale learning” of ob-

ject manipulation behaviors by a humanoid robot. We can view the issue of large-

scale learning from the following three perspectives: (1) variations in the behavior

patterns, (2) input and output dimensionality, and (3) number of the training data

samples.

From the first perspective, the behavior variations prepared for training our pro-

posed mechanism is not large enough. If the target is to just memorize the motion

trajectory and replay such memorized patterns, there might be a more efficient way

such as creating a motion pattern database using exact joint angle representations.

However, in the current study, we do not value the number and precision of the re-

trieved motion patterns but emphasize the ability to self-organize the synchrony of

sensory-motor relationships. To model the mutual relationship among concurrent

multi-dimensional temporal sequences, it is important to utilize machine learning

mechanisms that can handle distributed representations such as neural networks.

84

5.6. Summary

As far as the synchrony modeling utilizing neural network is concerned or in terms

of the variation of dynamics handled with a single neural network model, we con-

sider that current achievements have risen to a new level.

We think that the same goes for the second and the third perspectives as well.

With regard to robot behavior learning using a neural network model, conventional

approaches could handle only dozens of the input and output dimensionality and

hundreds of training samples. In contrast, we have achieved more than ten times

the scalability of the previous studies. For example, the direct memorization and

retrieval of raw image sequences corresponding to multiple motion patterns have

never been achieved with a neural network model.

5.6 Summary

In this chapter, our proposed multimodal integration learning framework is evalu-

ated by modeling multiple behavior patterns represented by multi-dimensional vi-

suomotor temporal sequences. The main targets discussed in this chapter are sum-

marized in Figure 5.9.

We presented two applications of the acquired sensory-motor integration model.

First, cross-modal memory retrieval was realized. Utilizing the generalization ca-

pability of the deep autoencoder, our proposed framework succeeded in retrieving

temporal sequences bidirectionally between image and motion modals. Second, ro-

bust behavior recognition was realized by utilizing the acquired multimodal features

as inputs to supervised behavior classification learning.

Through the evaluation experiment, a time-delay deep neural network is applied

for modeling multiple behavior patterns represented by multi-dimensional visuo-

motor temporal sequences. By using the efficient training performance of Hessian-

free optimization, the proposed mechanism successfully models six different object

manipulation behaviors in a single network. The generalization capability of the

learning mechanism enables the acquired model to perform the functions of cross-

modal memory retrieval and temporal sequence prediction. The experimental re-

sults show that the motion patterns for object manipulation behaviors are success-

85


fully generated from the corresponding image sequence, and vice versa. Moreover,

the temporal sequence prediction enables the robot to interactively switch multiple

behaviors in accordance with changes in the displayed objects. The analysis of the

self-organized feature space revealed that the multimodal features can be utilized as

abstracted information for recognizing robot behaviors.

Results from the real-time transition of object manipulation behaviors in a real-

world environment also revealed that our current approach for utilizing raw image

data is still not stable enough for handling drastic changes in lighting conditions. Fu-

ture work includes improving the robustness of the image recognition capabilities

by drawing out the potential of the generalization capabilities of deep networks via

the introduction of convolution networks trained with more diverse datasets. An-

other important challenge is dynamically combining multiple sensory modalities by

taking into account the relative reliability of different sensory sources. If reliability-

dependent integration is attained in our framework, higher-level features might be

acquired by intentionally suppressing the effects degraded modalities have on the

internal representation; this might result in more robust behavior recognition per-

formance.

86

5.6. Summary

��

��

� ��

��

��

��

��

!��

"��

#��$%�� &'()

"��$%�� &'()

%�� $%�� ()

*�� $%�� &)

�� $%�� +)

�� $%�� ,)


87

Chapter 6

Analysis on Intersensory Synchrony

Model

6.1 Introduction

In Chapter 5, we demonstrated that our proposed framework succeeds in cross-

modal memory retrieval and stable behavior recognition utilizing the self-organized

multimodal fused representations. In this chapter, we conduct further analysis on

how our proposed framework extracts the intersensory synchrony from the sensory-

motor experience in the environment and predicts the sensory outcomes utilizing

the acquired synchrony model. To analyze the acquired synchrony model at a more

general level, we extend the experimental setting by incorporating sound signals as

another input modality. As a practical experiment, we prepared a bell-ringing task

using a humanoid robot. Through the experiment, we conduct a quantitative evalua-

tion to demonstrate that our proposed framework can model synchronicity between

the color, pitch, and position of the bell and the corresponding bell-ringing motion.

6.2 Construction of the Proposed Framework

Figure 6.1 shows a schematic diagram of our proposed framework. Three indepen-

dent deep neural networks (i.e., autoencoders) are utilized for sound compression,

89

Chapter 6. Analysis on Intersensory Synchrony Model

...� ...�

...� ...�

...� ...�

��

��'�� (�'��

��

��(��

�'��('��

��('��

��

��

at−T+1 at−1 atuit−T+1 uit−1 uitust−T+1 ust−1 ust

at−T+1 at−1 atuit−T+1 uit−1 uitust−T+1 ust−1 ust

��

��

�' (�� ('��

��(�'�(��'��

��(�� )��

��(�'�(��'��(�'��

��(�'�(��

�' (�� '(��'��(��

��(��

�'��('��

��('��

Figure 6.1: Multimodal behavior learning and retrieval mechanism

90


image compression, and temporal sequence learning. Compared with the previous

experimental setup shown in Figure 5.1, this experimental setup incorporates an-

other deep neural network (Figure 6.1(a)) for sound feature extraction. The sound

data acquired from a microphone mounted on the head of the robot is preprocessed

by discrete Fourier transform (DFT). The sound compression network (Figure 6.1(a))

inputs the acquired sound spectrums and outputs the corresponding feature vec-

tors from the central hidden layer. Similarly, the image compression network (Fig-

ure 6.1(b)) inputs raw RGB bitmap images acquired from a camera mounted on the

head of the robot and outputs the corresponding feature vectors. The sound and im-

age features are synchronized with the joint angle vectors, and multimodal temporal

segments are generated. These multimodal temporal segments are then fed into the

temporal sequence learning network (Figure 6.1(c)). Accordingly, multimodal fea-

tures and reconstructed multimodal segments are output from the central hidden

layer and the output layer of the network, respectively.

The outputs from the temporal sequence learning network can be used for robot

motion generation, sound spectrum retrieval, or image retrieval. The joint angles

output from the network are rescaled and resent to the robot as joint angle com-

mands for generating motion. The networks can also reconstruct the retrieved sound

spectrum or images in the original form by decompressing the corresponding fea-

ture outputs because the sound compression network and the image compression

network model the identity map from the inputs to the outputs via feature vectors in

the central hidden layer.

6.3 Experimental Setup

The cross-modal memory retrieval performance of our proposed mechanisms is

evaluated by conducting bell-ringing tasks with the same robot used in our first ex-

periment. The bell-ringing task is setup as follows: three different desktop bells,

which can be identified by either the surface color or the sound pitch, are prepared

for the experiment. Correspondences between the colors and the pitch notations are

shown in Figure 6.2(a). For each bell-ringing trial, two bells are selected and placed

in front of the robot side by side. Then, either one of the two bells is rung by hitting a

91


��

��

��

��

��

��

��

Figure 6.2: Bell placement configurations of the bell-ringing task

button on top of the bell. Due to the limited outreach of the hands, each side of the

bell can be rung only with the corresponding arms. As shown in Figure 6.2(b), there

are six possible bell placement combinations. Note that under the task configura-

tion, information from at least two different modalities is required to determine the

right bell-ringing situation. In practice, the robot cannot (1) determine which bell

is going to be rung only from the initial image, (2) determine the placement of the

ringing bell only from the sound, and (3) predict what sound will come out only from

the arm motion.

We record twelve different multimodal temporal sequence datasets by generating

the right and left bell-striking motions under the six different bell placement con-

figurations. Arm joint angle sequences corresponding to the bell-striking motions

are generated by the angular interpolation of the initial and target postures. Pulse-

code modulation (PCM) sound data is recorded with a 16 kHz sampling rate, a 16-

bit depth, and a single channel with a microphone mounted on the forehead of the

robot1. The image frames and the joint angles of both arms are recorded at approx-

1Because of the physical structure of the robot, the microphone is located close to both of the arms,which are utilized to hit the bells. Therefore, the actuation sounds of the geared reducers equippedto the arm joints are inevitably recorded in addition to the bell sounds. To avoid the degradation of

92


Table 6.1: Experimental parameters

TRAIN* I/O* ENCODER DIMS*

SFEAT** 5352 968 1000-500-250-150-80-30

IFEAT** 2688 3000 1000-500-250-150-80-30

TSEQ** 8736 2100 1000-500-250-150-100* TRAIN, I/O, and ENCODER DIMS indicate the size of the

training data, the input and output dimensions, and theencoder network architecture, respectively.

** SFEAT, IFEAT, and TSEQ stand for sound feature, imagefeature, and temporal sequence, respectively.

imately 66 Hz, which includes replicated image frames. To synchronize the sound

data with the image and joint angles data, the sound data is preprocessed by a DFT

with a 242-sample hamming window and 242 samples of window shift with no over-

lap. A partial region of 320×200 pixels is cropped from the original 320×240 image

and resized to 40×25 pixels to meet the memory resource availability limitation on

our computational environment. For the joint angle data input, 10 degrees of free-

dom of the arms (from the shoulders to the wrists) are used. The resulting lengths

of the motion sequence were approximately 200 steps each, which is equivalent to

about 3 s each. For multimodal temporal sequence learning, we used contiguous

segments of 30 steps from the original time series as a single input. By sliding the

time window by one step, consecutive data segments are generated.

Table 6.1 summarizes the datasets and associated experimental parameters. For

both the sound feature and the image feature learning, the same 12-layered deep

neural networks are used. For temporal sequence learning, a 10-layer network is

used. In each case, the decoder architecture is a mirror-image of the encoder, yield-

ing a symmetric autoencoder. The parameters for the network structures are empir-

ically determined in reference to such previous studies as [41] and [49]. The input

and output dimensions of the two networks are defined as follows: 968 for sound

feature learning, which is defined by binding consecutive four-step sequences of the

sound spectrums (i.e., 242 dimension) into a single vector, 3000 for the image fea-

ture learning, which is defined by 40×25 matrices of pixels for RGB colors, and 2100

for temporal sequence learning, which is defined by the 30-step segment of the 70-

memory retrieval performance arising from the actuation sounds, we introduced a brief pause to thebell hitting motion when the hand contacted the button on top of the bell.

93


dimensional multimodal vector composed of a 30-dimensional sound feature vector,

a 30-dimensional image feature vector, and 10 joint angles. Especially for the cen-

tral hidden layer of the temporal sequence learning network, we compared several

numbers of nodes, i.e., 30, 50, 70, and 100. By evaluating the performance of im-

age retrieval from the sound and joint angle inputs, we concluded that 100 nodes are

needed to achieve the desired memory reconstruction precision. For the activation

functions, linear functions are used for all of the central hidden layers, and logistic

functions are used for the rest of the layers in reference to [41].

6.4 Results

6.4.1 Image Sequence Retrieval from Sound and Motion Sequences

We conducted an evaluation experiment of the cross-modal memory retrieval per-

formance by generating image sequences from the sound and joint angle input se-

quences. Note that in the following results, the number of sequence steps indicates

the generation step rather than the recorded data step. More specifically, data from

29 steps before the beginning of the generation step are used for acquiring the initial

step of the generated sequence.

Figure 6.3 shows an example of image generation results from the sound and joint

angle inputs. In Figure 6.3, the top row and the second row show the original and

retrieved images, respectively. The third and the bottom row show the sound spec-

trums and the 30 previous steps of joint angles sequences used as inputs to the tem-

poral sequence learning network to retrieve the corresponding images. Black dashed

squares in the images at step 1 indicate the bell image regions used for image retrieval

performance evaluation.

At step 1, the bells in the retrieved image are arbitrarily colored, because the color

of the placed bell is not derivable before acquiring any sound input. By contrast, the

image of the robot’s right hand is already included in the retrieved image, because the

joint angles input data indicate that the right arm is going to be used for striking the

bell. At steps 31 and 61, the bell is rung, and the corresponding sound spectrum is ac-

quired. Then, the task configuration becomes evident, and the information that the

94

6.4. Results

Orig

inal

imag

e

Step: 1

Ret

rieve

d im

age

0 2000 4000 6000 80000

0.2

0.4

0.6

0.8

1

Sou

nd s

pect

rum

Frequency [Hz]

5 10 15 20 25 300

0.2

0.4

0.6

0.8

1

Sca

led

join

t ang

le

Steps

Step: 31

0 2000 4000 6000 80000

0.2

0.4

0.6

0.8

1

Frequency [Hz]

5 10 15 20 25 300

0.2

0.4

0.6

0.8

1

Steps

Step: 61

0 2000 4000 6000 80000

0.2

0.4

0.6

0.8

1

Frequency [Hz]

5 10 15 20 25 300

0.2

0.4

0.6

0.8

1

Steps

Step: 91

0 2000 4000 6000 80000

0.2

0.4

0.6

0.8

1

Frequency [Hz]

5 10 15 20 25 300

0.2

0.4

0.6

0.8

1

Steps

Step: 121

0 2000 4000 6000 80000

0.2

0.4

0.6

0.8

1

Frequency [Hz]

5 10 15 20 25 300

0.2

0.4

0.6

0.8

1

Steps

Step: 151

0 2000 4000 6000 80000

0.2

0.4

0.6

0.8

1

Frequency [Hz]

5 10 15 20 25 300

0.2

0.4

0.6

0.8

1

Steps

Figure 6.3: Example of image retrieval results from the sound and joint angle inputs

95


rung bell on the right side has the pitch ‘F’ is correlated with the color green. Thus,

the color of the right bell in the retrieved image changes from a randomly initialized

one to green by associating the sound and joint angles information. Conversely, the

color of the left bell in the retrieved image is not stable during the run because no

information is acquired from the sound input for identifying which bell is placed on

the left side. Nevertheless, the retrieved image shows that when the color of the rung

bell (i.e., green) is identified, the color of the other bell is selected from the remaining

two colors (i.e., red or blue). This result reflects the current task design in which the

color of the two bells is always different. From step 91 (or so), the sound of the bell

starts to decay, and the actuation noise of the manipulator by the posture initializa-

tion becomes dominant. Thus, the colors of the bells again become arbitrary.

6.4.2 Quantitative Evaluation of Image Retrieval Performance

We conducted an evaluation experiment to quantitatively examine whether our pro-

posed model succeeded in modeling the synchrony between the image, sound, and

motion modalities. We prepared 10 different initial model parameter settings for the

networks and replicated the experiment of learning the same dataset composed of

the 12 combinations of the bell placements and bell-striking motion patterns. As a

result of cross-modal image retrieval for the 10 learning results, 120 patterns of the

image sequences were acquired. Image retrieval performance is quantified by the

root mean square (RMS) errors of the manually selected left and right bell regions in

the retrieved image (which are 13×13 pixels each, as indicated in Figure 6.3) against

the corresponding regions of the original image.

Figure 6.4 shows the time variation of the image retrieval error displayed in asso-

ciation with the maximum value of the sound power spectrum and the joint angles

sequence. In Figure 6.4, the graphs on the top row show the mean of the image re-

trieval errors from the replicated learning results (each line is acquired from 30 re-

sult sequences). The graphs on the second row show the mean of maximum sound

power spectrums. The graphs on the bottom row show the joint angles command

sequences used for generating the bell-striking motion. Black dashed lines indicate

the time step used for evaluating the significance of the image retrieval error differ-

ence. The evaluation results demonstrate that the image retrieval error of the left bell

96

6.4. Results

50 100 1500

0.1

0.2

0.3

0.4Im

age

retr

ieva

l err

or

Left bell ring

Left bellRight bell

50 100 1500

0.1

0.2

0.3

0.4

Right bell ring

Left bellRight bell

50 100 1500.4

0.6

0.8

1

Sou

nd a

mpl

itude

50 100 1500.4

0.6

0.8

1

50 100 1500

0.5

1

Steps

Sca

led

join

t ang

le

50 100 1500

0.5

1

Steps

Figure 6.4: Bell image retrieval errors;

becomes smaller than that of the right bell when the left bell is rung, and vice versa.

The time variation of the error trajectory shows that the retrieval error decreases after

the sound of the bell is acquired.

The shape of the error trajectory is not symmetric between the two graphs when

the left bell or the right bell is rung. When the left bell is rung, the image retrieval error

for the left bell maintains its value even after arm posture initialization. Conversely,

when the right bell is rung, the image retrieval error for the right bell increases after

arm posture initialization. These differ primarily because of the asymmetry of the

arm actuator noise. Owing to the difference in the mechanics of the left and right

actuators, which is beyond our control, the right arm produces more sound than the

left arm. Hence, when the right arm posture is initialized after striking the bell, the

97


L bell region R bell region0

0.1

0.2

0.3

0.4

0.5

**

RM

S e

rror

Left bell ring (p=6.1e−06)

L bell region R bell region0

0.1

0.2

0.3

0.4

0.5

**

Right bell ring (p=3.1e−07)

Figure 6.5: Bell image retrieval errors at step 60

accompanying actuator noise disturbs the internal state of the network (i.e., the data

buffered in the recurrent loop), and the retrieved image is altered.

6.4.3 The Correlation between Generated Motion and Retrieved

Bell Images

To evaluate the significance of the difference between image retrieval performance

of the left and right regions in the same image, we conduct a t-test for the image re-

trieval errors at step 60 of the sequences. At that time step, the arm is brought down

and the hand stably contacts the button on top of the bell. Therefore, there is no in-

fluence of actuation noise on image retrieval. In Figure 6.5, red circles and blue bars

denote the mean and standard deviation of the errors from 10 replicated learning ex-

periments, respectively. A p value less than 0.01 is considered statistically significant

(**: p < 0.01). The evaluation results show that the differences of the image retrieval

errors between the two regions are statistically significant in both the right and left

bell-ringing cases. Results further show that the spatial correlation between the bell

region in the image and the physical motion is correctly modeled, as are the asso-

ciations between the colors and sounds of the bells. Thus, the acquired synchrony

model between the image, sound, and motion modalities is utilized for image re-

trieval.

98

6.4. Results

�!"��#!��$��#!��

�� #��

��

��

Figure 6.6: Multimodal feature space and the correspondence between the coordi-nates and modal-dependent characteristics

6.4.4 Visualization of Multimodal Feature Space

Finally, we conducted an analysis of the multimodal feature space acquired by the

temporal sequence learning network. Among the 10 replicated learning results, we

took a single result and recorded the activation patterns of the central hidden layer

of the network when the 12 patterns of bell-ringing sequences were the input. We

applied PCA to project the resulting 100-dimension feature vector sequences to a

three-dimensional space defined by the acquired principal components (Figure 6.6).

The abbreviations in the legend box indicate the color combinations of the placed

bells, followed by the position (R or L) of the rung bell. The graph on the left side

(Figure 6.6(a)) demonstrates that the robot’s motion pattern is represented in a two-

dimensional space composed of the first and second principal components, whereas

the graph on the right side (Figure 6.6(b)) shows that the bell placement configura-

tions are structured along the coordinate defined by the third principal component.

Results of this analysis demonstrate that the synchrony between the multiple modal-

ities is self-organized in the temporal sequence learning network.

99


6.5 Discussion

The experimental results demonstrated that our proposed framework is able to ex-

tract implicit synchronicity among multiple modalities by integrating multimodal in-

formation. Further, the retrieved images in the bell-ringing task demonstrated that

our proposed framework not only deterministically retrieved a bell image reflecting

the acquired synchrony, but also somehow generated alternate information by se-

lecting a candidate among multiple possibilities even if the specific situation is not

identifiable for the other bell. Thus, we believe that our proposed mechanism can be

utilized as a prediction mechanism for robots to infer the successive consequences

of sensory-motor situations.

In cognitive science studies, a sense of agency is known to be a product of the gen-

eral determination of synchrony between action and effect, and experimental results

suggest that the sense of agency arises when there is temporal contiguity and content

consistency between signals related to action and those related to the putative ef-

fect [83, 23, 25]. Further, a recent study has reported the importance of action-effect

grouping on the production of a sense of agency [48]. In all of these studies, the eval-

uation of spatiotemporal congruity between predicted and actual sensory feedback

is considered to play an important role in the sense of agency. From our current re-

sults, we consider that our cross-modal synchrony modeling and subsequent mem-

ory retrieval capabilities can be utilized as a practical computational framework for

sensory feedback prediction. Hence, we believe that our presented framework can

be utilized in future work to promote a deeper understanding of the sense of agency.

6.6 Summary

In this chapter, we conducted quantitative analysis to show that our proposed multi-

modal integration learning framework correctly models synchronicity between mul-

tiple modalities. The main targets discussed in this chapter are summarized in Figure

6.7.

To analyze the acquired synchrony model at a more general level, we extend the

experimental setting by incorporating sound signals as another input modality. As

100

6.6. Summary

a practical experiment, we designed a bell ring task for a humanoid robot and con-

ducted integration learning of the image, sound, and joint angle sequences. The ac-

quired model is evaluated by retrieving images from the sound and joint angle se-

quences. The evaluation results demonstrate that the color of the bell on the cor-

responding arm motion side correctly changes in association with the input sound.

The analyses on the acquired model prove that the proposed framework succeeded

in acquiring the synchrony model over the multiple modals.

As for the bell-ringing task, we evaluated the image retrieval performance from

sound and motion with only two bell positions. Future work includes modeling a

generalized representation of bell positions by training our system with bell-ringing

behaviors using more variations of bell positions.

101


��

��

� ��

��

��

��

��

!��

"��

#��$%�� &'()

"��$%�� &'()

%�� $%�� ()

*�� $%�� &)

�� $%�� +)

�� $%�� ,)


102

Chapter 7

Conclusion

7.1 Overall Summary of the Current Research

This dissertation proposed multiple machine learning frameworks for the mutual

understanding of intersensory synchrony of multimodal information in robot sys-

tems. In practice, (1) robust recognition of poorly reproducible real-world informa-

tion and (2) adaptive behavior selection of robots depending on dynamic environ-

mental changes were achieved by utilizing deep learning architectures.

The first research topic was achieved through two approaches including (1) ex-

traction of highly generalized sensory features and (2) fusional utilization of multi-

modal information. The second research topic was achieved by (3) mutually predict-

ing and retrieving the sensory-motor information among multiple modalities. These

three approaches were enabled by the overwhelming performance of DNN in ab-

stracting and integrating large amounts of raw real-world sensory-motor informa-

tion with high dimensionality.

The performances of the proposed multimodal integration learning frameworks

were evaluated by conducting an AVSR task and two robot behavior learning tasks

utilizing a humanoid robot.

The AVSR task was conducted to evaluate the performances of the two DNN ar-

chitectures including a fully connected DNN and a CNN for extracting noise-robust

103

Chapter 7. Conclusion

sensory features for the audio and visual information of speech signals, respectively.

Our experimental results demonstrated that a fully connected DNN can serve as a

noise reduction filter that contributes towards recognizing speech under noisy en-

vironments even with audio information only. In addition, we demonstrated that

the CNN can recognize visual appearances of mouth region shapes and predict cor-

responding phoneme labels. Moreover, our AVSR experiments demonstrated that a

MSHMM can achieve noise robust multimodal speech recognition by complemen-

tarily utilizing the audio and visual information. We suppose that, even though the

experiment handles two seemingly “sensory” signals, the AVSR recognition results

implicitly show the importance of “sensory-motor” integration for robust recogni-

tion. The reason why we believe the same is because we assume that the visual inputs

of the mouth region image can indirectly transmit the information of the motor com-

mands corresponding to mouth movements. However, the current implementation

of our AVSR model is attained through a rather simple approach: linear weighted sum

of the observation probabilities corresponding to audio and visual features. There-

fore, in terms of the fusional utilization of multimodal information, the current im-

plementation still remains a potential for further development. For example, mul-

timodal integration learning of multimodal temporal sequence learning using DNN

or RNN, as accomplished by the robot behavior learning, might be a promising ap-

proach to be studied.

The robot behavior learning tasks were conducted to evaluate the multimodal in-

tegration learning performance and the cross-modal memory retrieval performance

of a fully connected DNN. One of the innovations in this research is that a DNN ar-

chitecture is applied for mutual integration learning of dynamic sensory-motor in-

formation. Experimental results also showed that the consistent learning framework

can be applied independent of the modality of the information source. The scalable

learning capability of the DNNs enabled to extract a compressed representation of

the raw sensory-motor information and their fused features attained noise robust

behavior recognition. Moreover, the generalization capability of the DNN enabled

retrieval of the raw sensory-motor signals from one modality to another in a mutual

manner according to the synchrony model acquired through the integration learning

process. The cross-modal memory retrieval and temporal sequence prediction func-

tions of the proposed framework enabled adaptive switching of object manipulation

104

7.2. Significance of the Current Study as a Work in Intermedia Art and Science

behaviors of a humanoid robot depending on the displayed object in real-time. In

addition, the bell-ringing behaviors learning experiment demonstrated that the pro-

posed multimodal integration framework self-organizes an organized feature space

to model the synchrony among multiple modalities.

This dissertation presented the effectiveness of multimodal integration learning

not only for the conventional pattern recognition problems but also for dynamic

sensory-motor coordination learning and the consequent behavior generation prob-

lems of a robot. The achievements in the current dissertation are expected to shed

light on a novel design concept of future robot systems. For example, we are confi-

dent of the novelty of our approach regarding directly modeling raw sensory-motor

signals of robot systems with DNNs. Meanwhile, the current strategy of modeling

temporal sequences with a time-delay style DNN is just an illustrative example. The

application of recurrent neural networks may open up a new horizon for acquir-

ing long-term context dependent robot behaviors with more scalability. We expect

to intensify our current research strategies by pursuing the application of our pro-

posed frameworks for practical applications as well as investigating the novel learn-

ing frameworks by introducing successive machine learning frameworks for robot

systems.

7.2 Significance of the Current Study as a Work in Inter-

media Art and Science

The communication ability, to express one’s mental state and to exchange informa-

tion with other individuals, is one of the intellectual foundations that characterize

higher order animals. The process of expression involves two processes: (1) the

structuration of one’s experience, to reflect one’s experience into its internal repre-

sentation by abstracting the acquired raw sensory-motor perception and (2) the ex-

pression of one’s mental state, to transmit information to others by creating signals

and symbols that delegate messages generated from the internal representation. For

example, the process of expression can explain most creative activities such as paint-

ing, filmmaking, musical composition, and writing novels. Therefore, we believe that

the process of expression is one of the main topics to be quested in the intermedia

105

Chapter 7. Conclusion

art and science. Although the second process tends to focus on a typical discussion

on the process of expression, the first process is equivalently important because it

is responsible for generating abstracted representations, the origin of one’s mental

state, by comprehending one’s experience.

In terms of the discussion above, we can declare that the main contribution of

the current study, as an intermedia art and science research, is aggregated to the first

process: the structuration of internal representation by abstracting robots’ experi-

ences. Recent successes of deep learning in image recognition and speech recogni-

tion studies prevailed the importance of self-organization of sensory features, which

is equivalent to internal representation in the current context, among the machine

learning community. However, a similar concept to apply self-organization of inter-

nal representation for the realization of robot intelligence has not received extensive

attention yet. The current study showed that our deep learning approach is possi-

ble to self-organize internal representation from a robot’s experience by abstracting

not only modality dependent representation but also mutually integrating acquired

sensory-motor features from multiple modalities. Thus, we showed how raw real-

world information acquired from the multiple sensors equipped on the robot and

the self-motion commands are integrated and abstracted to structure a compact in-

ternal representation. Moreover, we demonstrated that the robot could take advan-

tage of the acquired multimodal representation for behavior generation by retrieving

associated information among multiple modalities.

The representation intended for information propagation is not discussed in

the current study because communication among multiple robots or human–robot

communication is not included in the current research target. However, we can natu-

rally view several research topics, such as social development of robot intelligence or

human–robot communication, as a continuation of the current research motivation.

We believe that robotics research from the perspective of intermedia art and science

is of great value because we can reflect more deeply on the fundamental questions

such as what is the essence of creativity, whether a machine can become a creative

entity, and what is the essential difference between human intelligence and machine

intelligence.

106

Appendix A

Hessian-Free Optimization

The Hessian-free algorithm originates from Newton’s method, a well-known numer-

ical optimization technique. A canonical second-order optimization scheme, such

as Newton’s method, iteratively updates parameter θ ∈RN of an objective function f

by computing gradient vector p and updates θ as θk+1 = θk +αp with learning rate α.

The core idea of Newton’s method is to locally approximate f at the current iterate θk

by a model function mk , up to the second order, by the following quadratic equation:

mk (θk +p) = fk +∇ f Tk p + 1

2pTBk p, (A.1)

where fk and ∇ fk are the function and gradient values at θk , respectively. The ma-

trix Bk is either the Hessian matrix Hk = ∇2 fk or some approximation of it. Using

the standard Newton’s method, mk is optimized by computing N ×N matrix Bk and

solving the system

Bk p =−∇ fk . (A.2)

However, two major difficulties exist for directly solving (A.2). First, this compu-

tation is very expensive for a large N , which is a common case even with modestly

sized neural networks. To overcome this, the Hessian-free optimization utilizes the

linear conjugate gradient (CG) method for optimizing the quadratic objective. The

107

Appendix A. Hessian-Free Optimization

name “Hessian-free” indicates that the CG does not necessarily require the costly,

explicit Hessian matrix; instead, the matrix-vector product between the matrix Bk

and gradient vector p is sufficient.

Second, the Newton direction p defined by (A.2) may not necessarily be a decent

direction because the Hessian matrix may become negative definite when immedi-

ate parameter θk is away from the solution. To overcome this, two countermeasures

are introduced. One is to utilize a positive semidefinite Gauss-Newton curvature ma-

trix, instead of the possibly indefinite Hessian matrix. The other is to apply modified

Newton method by reconditioning Hessian matrix Hk as

Bk = Hk +λI , (A.3)

where Bk is a damped Hessian matrix of f at θk , λ≥ 0 is a damping parameter, and I

is the unit matrix.

A.1 Newton-CG Method

In the Newton-CG method, the search direction is computed by applying the CG

method to the Newton’s equation (A.2). The CG method is a general framework for

solving a linear system Ax = b by finding the least-squares solution. To solve the

problem, the quadratic

ψ(x) = 1

2xT Ax −bTx, (A.4)

instead of the squared error objective ‖Ax − b‖2, is optimized. In the context of

Hessian-free optimization, the parameters are set as A = B and b =−∇ f .

The CG iteration is terminated at iteration k if the following condition is satisfied:

k >G and ψ(xk ) < 0 andψ(xk )−ψ(xk−G )

ψ(xk )< εG , (A.5)

where G determines how many iterations into the past are considered for evaluating

an estimate of the current per-iteration reduction rate.

108

A.2. Computing the Matrix-Vector Product

The overview of our CG method is summarized as follows.

Algorithm 1 CG method

Given x0

Set r0 ← Ax0 −b, p0 ←−r0

for k = 0 to Kmax doif pT

k Ap ≤ 0 thenbreak

end if

αk ← r Tk rk

pTk Apk

xk+1 ← xk +αk pk

rk+1 ← rk +αk Apk

βk+1 ← r Tk+1rk+1

r Tk rk

pk+1 ←−rk+1 +βk+1pk

if k >G and ψ(xk ) < 0 and ψ(xk )−ψ(xk−G )ψ(xk ) < εG then

breakend if

end for

A.2 Computing the Matrix-Vector Product

The Newton-CG method does not require explicit knowledge of the Hessian Bk =∇2 fk . Rather, it requires Hessian-vector products of the form ∇2 fk p for any given

vector p. When the second derivatives cannot easily be calculated, or where the Hes-

sian requires too much storage, finite differencing technique known as Hessian-free

Newton methods are commonly applied.

Following the definition of a derivative, the Hessian-vector products are exactly

calculated by the following equation

∇2 fk p = limr→0

∇ f (θk + r p)−∇ f (θk )

r= ∂

∂r∇ f (θk + r p). (A.6)

This operation is regarded as a transformation to convert the gradient of a system

into the Hessian-vector products. Perlmutter [79] defined an operator to this trans-

109

Appendix A. Hessian-Free Optimization

formation as

R{

f (θk )}= ∂

∂rf (θk + r p)

∣∣∣∣r=0

, (A.7)

where R{·} is called R-operator. By applying the R{·} operator to the equations for cal-

culating a gradient, e.g. the back propagation algorithm, we can acquire the Hessian-

vector product. As R{·} is a differential operator, it follows the same rules as the usual

differential operators, such as:

R{c f (θ)

} = cR{

f (θ)}

(A.8)

R{

f (θ)+ g (θ)} = R

{f (θ)

}+R{

g (θ)}

(A.9)

R{

f (θ)g (θ)} = R

{f (θ)

}g (θ)+ f (θ)R

{g (θ)

}(A.10)

R{

f (g (θ))} = f ′(g (θ))R

{g (θ)

}(A.11)

R

{d f (θ)

d t

}= dR

{f (θ)

}d t

, (A.12)

also note that

R{θ} = p. (A.13)

For a standard feed forward neural network and a Elman-type recurrent neural

network, the feed forward, the back propagation, and the those with R-operators ap-

plied are shown in Appendix B and Appendix C, respectively.

110

Appendix B

FNN with R-operator

B.1 Forward propagation

h = Whi I +bh

H = f (h)

o = Woh H +bo

O = g (o)

B.2 Forward propagation with R-operator

Rh = W vhi I +bv

h

RH = f ′(H)Rh

Ro = W voh H +WohRH +bv

o

RO = g ′(O)Ro

111

Appendix B. FNN with R-operator

B.3 Error Function

E = 1

2(O −O)2

B.4 Backpropagation

B.4.1 variables

∂E

∂O= −(O −O)

∂E

∂o= ∂E

∂O

∂O

∂o=−(O −O)g ′(O)

∂E

∂H= ∂E

∂o

∂o

∂H= ∂E

∂oWoh

∂E

∂h= ∂E

∂H

∂H

∂h= ∂E

∂Hf ′(H)

B.4.2 parameters

∂E

∂Woh= ∂E

∂o

∂o

∂Woh= ∂E

∂oH

∂E

∂Whi= ∂E

∂h

∂h

∂Whi= ∂E

∂hI

∂E

∂bo= ∂E

∂o

∂o

∂bo= ∂E

∂o∂E

∂bh= ∂E

∂h

∂h

∂bh= ∂E

∂h

112

B.5. Backpropagation with R-operator

B.5 Backpropagation with R-operator

B.5.1 variables

R∂E

∂O= RO

R∂E

∂o= R

∂E

∂O

∂O

∂o= ROg ′(O)

R∂E

∂H= R

∂E

∂o

∂o

∂H= R

∂E

∂oWoh

R∂E

∂h= R

∂E

∂H

∂H

∂h= R

∂E

∂Hf ′(H)

B.5.2 parameters

R∂E

∂Woh= R

∂E

∂o

∂o

∂Woh= R

∂E

∂oH

R∂E

∂Whi= R

∂E

∂h

∂h

∂Whi= R

∂E

∂hI

R∂E

∂bo= R

∂E

∂o

∂o

∂bo= R

∂E

∂o

R∂E

∂bh= R

∂E

∂h

∂h

∂bh= R

∂E

∂h

113

Appendix C

RNN with R-operator

C.1 Forward Propagation

ht =⎧⎨⎩

Whi I0 +Whh Hi ni t +bh , (t = 0)

Whi It−1 +Whh Ht−1 +bh , (t > 0)

Ht = f (ht )

ot = Woh Ht +bo

Ot = g (ot )

C.2 Forward Propagation with R-operator

Rht =⎧⎨⎩

W vhi I0 +W v

hh Hi ni t +WhhRHi ni t +bvh , (t = 0)

W vhi It−1 +W v

hh Ht−1 +WhhRHt−1 +bvh , (t > 0)

RHt = f ′(Ht )Rht

Rot = W voh Ht +WohRHt +bv

o

ROt = g ′(Ot )Rot

115

Appendix C. RNN with R-operator

C.3 Error Funcion

E = 1

2(Ot −Ot )2

116

C.4. Backpropagation

C.4 Backpropagation

C.4.1 variables

∂E

∂Ot= −(Ot −Ot )

∂E

∂ot= ∂E

∂Ot

∂Ot

∂ot=−(Ot −Ot )g ′(Ot )

∂E

∂Ht= ∂E

∂ot

∂ot

∂Ht+ ∂E

∂ht+1

∂ht+1

∂Ht= ∂E

∂otWoh + ∂E

∂ht+1Whh

∂E

∂ht= ∂E

∂Ht

∂Ht

∂ht= ∂E

∂Htf ′(Ht )

C.4.2 parameters

∂E

∂Woh= ∂E

∂o

∂o

∂Woh= ∂E

∂oHt

∂E

∂Whh= ∂E

∂h

∂h

∂Whh= ∂E

∂hHt−1

∂E

∂Whi=

⎧⎪⎪⎨⎪⎪⎩

∂E

∂h

∂h

∂Whi= ∂E

∂hOt−1, (t > 0)

∂E

∂h

∂h

∂Whi= ∂E

∂hI0, (t = 0)

∂E

∂bo= ∂E

∂o

∂o

∂bo= ∂E

∂o∂E

∂bh= ∂E

∂h

∂h

∂bh= ∂E

∂h∂E

∂Hi ni t= ∂E

∂h0

∂h0

∂Hi ni t= ∂E

∂h0Whh

117

Appendix C. RNN with R-operator

C.5 Backpropagation with R-operator

C.5.1 variables

R∂E

∂O= ROt

R∂E

∂o= R

∂E

∂O

∂O

∂o= ROt g ′(Ot )

R∂E

∂Ht= R

∂E

∂ot

∂o

∂Ht+R

∂E

∂ht+1

∂h

∂Ht= R

∂E

∂otWoh +R

∂E

∂ht+1Whh

R∂E

∂ht= R

∂E

∂Ht

∂H

∂ht= R

∂E

∂Htf ′(Ht )

C.5.2 parameters

R∂E

∂Woh= R

∂E

∂o

∂o

∂Woh= R

∂E

∂oHt

R∂E

∂Whh= R

∂E

∂h

∂h

∂Whh= R

∂E

∂hHt−1

R∂E

∂Whi= R

∂E

∂h

∂h

∂Whi=

⎧⎪⎪⎨⎪⎪⎩

R∂E

∂hOt−1, (t > 0)

R∂E

∂hI0, (t = 0)

R∂E

∂bo= R

∂E

∂o

∂o

∂bo= R

∂E

∂o

R∂E

∂bh= R

∂E

∂h

∂h

∂bh= R

∂E

∂h

R∂E

∂Hi ni t= R

∂E

∂h0

∂h0

∂Hi ni t= R

∂E

∂h0Whh

118

Bibliography

[1] O. Abdel-Hamid and H. Jiang. Rapid and effective speaker adaptation of con-

volutional neural network based models for speech recognition. In Proceed-

ings of the 14th Annual Conference of the International Speech Communication

Association, Lyon, France, Aug. 2013.

[2] O. Abdel-Hamid, A. rahman Mohamed, H. Jiang, and G. Penn. Applying Con-

volutional Neural Networks concepts to hybrid NN-HMM model for speech

recognition. In Proceedings of the IEEE International Conference on Acoustics,

Speech, and Signal Processing, pages 4277–4280, Kyoto, Japan, Mar. 2012.

[3] E. Abravanel. Integrating the information from eyes and hands: A develop-

mental account. Intersensory Perception and Sensory Integration, pages 71–

108, 1981.

[4] P. S. Aleksic and A. K. Katsaggelos. Comparison of low- and high-level visual

features for audio-visual continuous automatic speech recognition. In Pro-

ceedings of the IEEE International Conference on Acoustics, Speech, and Signal

Processing, volume 5, pages 917–920, Montreal, Canada, May 2004.

[5] M. Anisfeld. Interpreting “imitative” responses in early infancy. Science,

205(4402):214–215, July 1979.

[6] E. Aronson and S. Rosenbloom. Space perception in early infancy: Perception

within a common auditory-visual space. Science, 172(3988):1161–1163, June

1971.

119

Bibliography

[7] J. Barker and F. Berthommier. Evidence of correlation between acoustic and

visual features of speech. In Proceedings of the 14th International Congress of

Phonetic Sciences, pages 5–9, San Francisco, CA, USA, Aug. 1999.

[8] R. Bekkerman, M. Bilenko, and J. Langford, editors. Scaling up Machine Learn-

ing: Parallel and Distributed Approaches. Cambridge University Press, 2011.

[9] Y. Bengio. Learning deep architectures for AI. Foundations and Trends in Ma-

chine Learning, 2(1):1–127, Jan. 2009.

[10] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with

gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157–

166, Jan. 1994.

[11] H. Bourlard and S. Dupont. A new ASR approach based on independent pro-

cessing and recombination of partial frequency bands. In Proceedings of the

4th International Conference on Spoken Language Processing, volume 1, pages

426–429, Philadelphia, PA, USA, Oct. 1996.

[12] H. Bourlard, S. Dupont, and C. Ris. Multi-stream speech recognition. IDIAP

Research Report, 1996.

[13] H. a. Bourlard and N. Morgan. Connectionist Speech Recognition: A Hybrid

Approach. Springer US, Boston, MA, 1994.

[14] T. G. R. Bower, J. M. Broughton, and M. K. Moore. The coordination of visual

and tactual input in infants. Attention, Perception, & Psychophysics, 8(1):51–53,

Jan. 1979.

[15] N. Brooke and E. D. Petajan. Seeing speech: Investigations into the synthesis

and recognition of visible speech movements using automatic image process-

ing and computer graphics. In Proceedings of the International Conference on

Speech Input and Output, Techniques and Applications, pages 104–109, Lon-

don, UK, Mar. 1986.

[16] R. A. Brooks, C. B. (Ferrell), R. Irie, C. C. Kemp, M. Marjanovic, B. Scassellati,

and M. M. Williamson. Alternative essences of intelligence. In Proceedings of

120

Bibliography

the 15th National Conference on Artificial Intelligence, pages 961–968, Madi-

son, WI, USA, July 1998.

[17] A. Chitu and L. J. Rothkrantz. Automatic Visual Speech Recognition, chapter 6,

pages 95–120. Speech Enhancement, Modeling and Recognition- Algorithms

and Applications. InTech, 2012.

[18] M. Coen. Multimodal Integration-A Biological View. In Proceedings of the 17th

International Joint Conference on Artificial Intelligence, volume 2, pages 1417–

1424, Seattle, WA, USA, Aug. 2001.

[19] T. Cootes, G. Edwards, and C. Taylor. Active appearance models. IEEE Transac-

tions on Pattern Analysis and Machine Intelligence, 23(6):681–685, June 2001.

[20] M. Critchley. Ecstatic and synaesthetic experience during musical perception.

Music and brain: Studies in the neurology of music. Charles C Thomas, Spring-

field, IL, USA, 1977.

[21] R. E. Cytowic. Synesthesia: A Union of the Senses, 2nd edition. Springer-Verlag,

New York, 1989.

[22] G. E. Dahl and A. Acero. Context-dependent pre-trained deep neural networks

for large-vocabulary speech recognition. IEEE Transactions on Audio, Speech,

and Language Processing, 20(1):30–42, Jan. 2012.

[23] S. Deneve and A. Pouget. Bayesian multisensory integration and cross-modal

spatial links. Journal of Physiology Paris, 98(1-3):249–258, Jan. 2004.

[24] J. Dewey. The reflex arc concept in psychology. Psychological Review, 3:357–

370, 1896.

[25] M. O. Ernst and H. H. Bülthoff. Merging the senses into a robust percept. Trends

in Cognitive Sciences, 8(4):162–169, Apr. 2004.

[26] A. Falchier, S. Clavagnier, P. Barone, and H. Kennedy. Anatomical evidence of

multimodal integration in primate striate cortex. The Journal of Neuroscience,

22(13):5749–5759, July 2002.

121

Bibliography

[27] X. Feng, Y. Zhang, and J. Glass. Speech feature denoising and dereverberation

via deep autoencoders for noisy reverberant speech recognition. In Proceed-

ings of the IEEE International Conference on Acoustics, Speech, and Signal Pro-

cessing, pages 1759–1763, Florence, Italy, May 2014.

[28] V. Franc and V. Hlavac. Statistical Pattern Recognition Toolbox for Matlab. http:

//cmp.felk.cvut.cz/cmp/software/stprtool/, Aug. 2008.

[29] F. Frassinetti, N. Bolognini, and E. Làdavas. Enhancement of visual percep-

tion by crossmodal visuo-auditory interaction. Experimental Brain Research,

147(3):332–343, Dec. 2002.

[30] I. Gallagher. Philosophical conceptions of the self: implications for cognitive

science. Trends in Cognitive Sciences, 4(1):14–21, Jan. 2000.

[31] J. Gardner and H. Gardner. A note on selective imitation by a six-week-old

infant. Child Development, 41(4):1209–1213, Dec. 1970.

[32] W. Grarage. Personal Robot 2 (PR2). http://www.willowgarage.com/.

[33] M. D. Grilli and E. L. Glisky. Self-Imagining Enhances Recognition Memory in

Memory-Impaired Individuals with Neurological Damage. Neuropsychology,

24(6):698–710, Nov. 2010.

[34] M. D. Grilli and E. L. Glisky. The self-imagination effect: benefits of a self-

referential encoding strategy on cued recall in memory-impaired individuals

with neurological damage. Journal of the International Neuropsychological So-

ciety, 17(5):929–933, Sept. 2011.

[35] M. Gurban, J.-P. Thiran, T. Drugman, and T. Dutoit. Dynamic modality weight-

ing for multi-stream HMMs in audio-visual speech recognition. In Proceedings

of the 10th International Conference on Multimodal Interfaces, pages 237–240,

Chania, Greece, Oct. 2008.

[36] M. Heckmann, K. Kroschel, and C. Savariaux. DCT-based video features for

audio-visual speech recognition. In Proceedings of the 7th International Con-

ference on Spoken Language Processing, volume 3, pages 1925–1928, Denver,

CO, USA, Sept. 2002.

122

Bibliography

[37] R. Held. Shifts in binaural localization after prolonged exposures to atypical

combinations of stimuli. The American Journal of Psychology, 68(4):526–548,

Dec. 1955.

[38] R. A. Henson. Neurological Aspects of Musical Experience. Music and the Brain:

Studies in the Neurology of Music. William Heinemann Medical Books Lim-

ited, London, 1977.

[39] H. Hermansky, D. Ellis, and S. Sharma. Tandem connectionist feature extrac-

tion for conventional HMM systems. In Proceedings of the IEEE International

Conference on Acoustics, Speech, and Signal Processing, volume 3, pages 1635–

1638, Istanbul, Turkey, June 2000.

[40] G. Hinton, L. Deng, D. Yu, G. Dahl, A.Mohamed, N. Jaitly, A. Senior, V. Van-

houcke, P. Nguyen, T. Sainath, and B. Kingsbury. Deep neural networks for

acoustic modeling in speech recognition. IEEE Signal Processing Magazine,

29:82–97, Nov. 2012.

[41] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with

neural networks. Science, 313(5786):504–507, July 2006.

[42] R. Hof. Meet The Guy Who Helped Google Beat Ap-

ple’s Siri. http://www.forbes.com/sites/roberthof/2013/05/01/

meet-the-guy-who-helped-google-beat-apples-siri/, May 2013.

[43] I. P. Howard and W. B. Templeton. Human Spatial Orientation. Wiley, London,

1966.

[44] J. Huang and B. Kingsbury. Audio-visual deep learning for noise robust speech

recognition. In Proceedings of the IEEE International Conference on Acous-

tics, Speech, and Signal Processing, pages 7596–7599, Vancouver, Canada, May

2013.

[45] A. Janin, D. Ellis, and N. Morgan. Multi-stream speech recognition: Ready for

prime time? In Proceedings of the 6th European Conference on Speech Commu-

nication and Technology, Budapest, Hungary, Sept. 1999.

123

Bibliography

[46] A. Jauffret, N. Cuperlier, P. Gaussier, and P. Tarroux. Multimodal integration of

visual place cells and grid cells for navigation tasks of a real robot. In Proceed-

ings of the 12th International Conference on Simulation of Adaptive Behavior,

volume 7426, pages 136–145, Odense, Denmark, Aug. 2012.

[47] K. Kaneko, F. Kanehiro, S. Kajita, H. Hirukawa, T. Kawasaki, M. Hirata,

K. Akachi, and T. Isozumi. Humanoid robot HRP-2. In Proceedings of the IEEE

International Conference on Robotics and Automation, volume 2, pages 1083–

1090, Barcelona, Spain, Apr. 2004.

[48] T. Kawabe, W. Roseboom, and S. Nishida. The sense of agency is action-effect

causality perception based on cross-modal grouping. Proceedings of the Royal

Society B: Biological Sciences, 280(1763):20130991, July 2013.

[49] A. Krizhevsky and G. E. Hinton. Using very deep autoencoders for content-

based image retrieval. In Proceedings of the 19th European Symposium on Ar-

tificial Neural Networks, Bruges, Belgium, Apr. 2011.

[50] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep

convolutional neural networks. In Proceedings of the Advances in Neural In-

formation Processing Systems 25, pages 1106–1114, Lake Tahoe, NV, USA, Dec.

2012.

[51] K. Kumar, T. Chen, and R. Stern. Profile view lip reading. In Proceedings of

the IEEE International Conference on Acoustics, Speech, and Signal Processing,

Honolulu, Hawaii, Apr 2007.

[52] T. Kuriyama, T. Shibuya, T. Harada, and Y. Kuniyoshi. Learning Interaction

Rules through Compression of Sensori-Motor Causality Space. In Proceed-

ings of the 10th International Conference on Epigenetic Robotics, pages 57–64,

Örenäs Slott, Sweden, Nov. 2010.

[53] H. Kuwabara, K. Takeda, Y. Sagisaka, S. Katagiri, S. Morikawa, and T. Watan-

abe. Construction of a large-scale Japanese speech database and its manage-

ment system. In Proceedings of the IEEE International Conference on Acous-

tics, Speech, and Signal Processing, pages 560–563, Glasgow, Scotland, UK, May

1989.

124

Bibliography

[54] Y. Lan, B.-j. Theobald, R. Harvey, E.-j. Ong, and R. Bowden. Improving vi-

sual features for lip-reading. In Proceedings of the International Conference

on Auditory-Visual Speech Processing, Hakone, Japan, Oct. 2010.

[55] K. Lang, A. Waibel, and G. Hinton. A time-delay neural network architecture

for isolated word recognition. Neural Networks, 3:23–43, 1990.

[56] Q. V. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. S. Corrado, J. Dean,

and A. Y. Ng. Building high-level features using large scale unsupervised learn-

ing. In Proceedings of the 29th International Conference on Machine Learning,

pages 81–88, Edinburgh, Scotland, July 2012.

[57] Y. LeCun and L. Bottou. Learning methods for generic object recognition with

invariance to pose and lighting. In Proceedings of the IEEE Computer Society

Conference on Computer Vision and Pattern Recognition, volume 2, pages 97–

104, Washington, D.C., USA, June 2004.

[58] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning ap-

plied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, Nov.

1998.

[59] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolutional deep belief net-

works for scalable unsupervised learning of hierarchical representations. In

Proceedings of the 26th International Conference on Machine Learning, pages

609–616, Montreal, Canada, June 2009.

[60] H. Lee, P. Pham, Y. Largman, and A. Y. Ng. Unsupervised feature learning for

audio classification using convolutional deep belief networks. In Proceedings

of the Advances in Neural Information Processing Systems 22, pages 1096–1104,

Vancouver, Canada, 2009.

[61] J. Luettin, N. Thacker, and S. Beet. Visual speech recognition using active shape

models and hidden Markov models. In Proceedings of the IEEE International

Conference on Acoustics, Speech, and Signal Processing, volume 2, pages 817–

820, Atlanta, GA, USA, May 1996.

125

Bibliography

[62] A. L. Maas, T. M. O’Neil, A. Y. Hannun, and A. Y. Ng. Recurrent neural network

feature enhancement: The 2nd chime challenge. In Proceedings of the 2nd In-

ternational Workshop on Machine Listening in Multisource Environments, Van-

couver, Canada, June 2013.

[63] L. E. Marks. On colored-hearing synesthesia: Cross-modal translations of sen-

sory dimensions. Psychological Bulletin, 82(3):303–331, May 1975.

[64] L. E. Marks. The Unity of the Senses: Interrelations Among the Modalities. Aca-

demic Press Series in Cognition and Perception. Academic Press, 1978.

[65] J. Martens. Deep learning via Hessian-free optimization. In Proceedings of

the 27th International Conference on Machine Learning, pages 735–742, Haifa,

Israel, June 2010.

[66] J. Martens and I. Sutskever. Learning recurrent neural networks with Hessian-

free optimization. In Proceedings of the 28th International Conference on Ma-

chine Learning, pages 1033–1040, Bellevue, WA, USA, June 2011.

[67] I. Matthews, T. Cootes, J. Bangham, S. Cox, and R. Harvey. Extraction of visual

features for lipreading. IEEE Transactions on Pattern Analysis and Machine

Intelligence, 24(2):198–213, 2002.

[68] I. Matthews, C. N. Gerasimos Potamianos, and J. Luettin. A comparison of

model and transform-based visual feature for audio-visual LVCSR. In Pro-

ceedings of the IEEE International Conference on Multimedia and Expo, Tokyo,

Japan, Aug. 2001.

[69] H. McGurk and J. MacDonald. Hearing lips and seeing voices. Nature, 264:746–

748, Dec. 1976.

[70] A. N. Meltzoff. Towards a developmental cognitive science: The implications

of cross-modal matching and imitation for the development of representation

and memory in infancy. Annals of the New York Academy of Sciences, 608:1–31,

Dec. 1990.

[71] A. N. Meltzoff and M. K. Moore. Imitation of facial and manual gestures by

human neonates. Science, 198(4312):75–78, Oct. 1977.

126

Bibliography

[72] A. Mohamed, G. E. Dahl, and G. E. Hinton. Acoustic Modeling Using Deep

Belief Networks. IEEE Transactions on Audio, Speech, and Language Processing,

20(1):14–22, 2012.

[73] R. R. Murphy. Introduction to AI Robotics. The MIT Press, 2000.

[74] A. V. Nefian, L. Liang, X. Pi, X. Liu, and K. Murphy. Dynamic bayesian networks

for audio-visual speech recognition. EURASIP Journal on Applied Signal Pro-

cessing, 11:1274–1288, 2002.

[75] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. Multimodal deep

learning. In Proceedings of the 28th International Conference on Machine

Learning, pages 689–696, Bellevue, WA, USA, June 2011.

[76] NVIDIA Corporation. CUBLAS library version 6.0 user guide. CUDA Toolkit

Documentation, Feb. 2014.

[77] M. Ogino, H. Toichi, Y. Yoshikawa, and M. Asada. Interaction rule learning

with a human partner based on an imitation faculty with a simple visuo-motor

mapping. Robotics and Autonomous Systems, 54(5):414–418, May 2006.

[78] D. Palaz, R. Collobert, and M. Magimai.-Doss. Estimating phoneme class con-

ditional probabilities from raw speech signal using convolutional neural net-

works. In Proceedings of the 14th Annual Conference of the International Speech

Communication Association, Lyon, France, Aug. 2013.

[79] B. Pearlmutter. Fast exact multiplication by the Hessian. Neural Computation,

6(1):147–160, Jan. 1994.

[80] J. Piaget. Play, dreams, and imitation in childhood. W. W. Norton, New York,

1962.

[81] H. L. Pick, D. H. Warren, and J. C. Hay. Sensory conflict in judgments of spatial

direction. Perception & Psychophysics, 6(4):203–205, July 1969.

[82] A. Pitti, A. Blanchard, M. Cardinaux, and P. Gaussier. Distinct mechanisms

for multimodal integration and unimodal representation in spatial develop-

ment. In Proceedings of the IEEE International Conference on Development and

Learning and Epigenetic Robotics, pages 1–6, San Diego, CA, USA, Nov. 2012.

127

Bibliography

[83] A. Pouget, S. Deneve, and J. Duhamel. A computational perspective on the

neural basis of multisensory spatial representations. Nature Reviews Neuro-

science, 3:741–747, Sept. 2002.

[84] V. S. Ramachandran and E. M. Hubbard. Hearing colors, tasting shapes. Scien-

tific American, 16:76–83, May 2006.

[85] S. Renals, N. Morgan, S. Member, H. Bourlard, M. Cohen, and H. Franco. Con-

nectionist probability estimators in HMM speech recognition. IEEE Transac-

tions on Speech and Audio Processing, 2(1):161–174, 1994.

[86] J. Robert-Ribes, M. Piquemal, J.-L. Schwartz, and P. Escudier. Exploiting sensor

fusion architectures and stimuli complementarity in av speech recognition. In

D. Stork and M. Hennecke, editors, Speechreading by Humans and Machines,

pages 193–210. Springer Berlin Heidelberg, 1996.

[87] A. Robotics. NAO Humanoid, Nov. 2012.

[88] S. A. Rose. Cross-modal transfer in human infants: What is being transferred?

Annals of the New York Academy of Sciences, 608:38–50, Dec. 1990.

[89] C. Rosenberg. Improving Photo Search: A Step Across the Semantic Gap. http:

//googleresearch.blogspot.jp/2013/06/improving-photo-search-step-across.

html, June 2013.

[90] T. N. Sainath, B. Kingsbury, and B. Ramabhadran. Auto-encoder bottleneck

features using deep belief networks. In Proceedings of the IEEE International

Conference on Acoustics, Speech, and Signal Processing, pages 4153–4156, Ky-

oto, Japan, Mar. 2012.

[91] Y. Sakagami, R. Watanabe, C. Aoyama, S. Matsunaga, N. Higaki, and K. Fu-

jimura. The intelligent ASIMO: system overview and integration. In Proceed-

ings of the IEEE/RSJ International Conference on Intelligent Robots and System,

volume 3, pages 2478–2483, Lausanne, Switzerland, Oct. 2002.

[92] M. Sams, R. Aulanko, M. Hämäläinen, R. Hari, O. V. Lounasmaa, S. Lu, and

J. Simola. Seeing speech: visual information from lip movements modifies ac-

tivity in the human auditory cortex. Neuroscience Letters, 127(1):141–145, June

1991.

128

Bibliography

[93] E. Sauser and A. Billard. Biologically Inspired Multimodal Integration: Inter-

ferences in a Human-Robot Interaction Game. In Proceedings of the IEEE/RSJ

International Conference on Intelligent Robots and Systems, pages 5619–5624,

Beijing, China, Oct. 2006.

[94] P. Scanlon and R. Reilly. Feature analysis for automatic speechreading. In Pro-

ceedings of the IEEE 4th Workshop on Multimedia Signal Processing, pages 625–

630, Cannes, France, Oct. 2001.

[95] B. R. Shelton and C. L. Searle. The influence of vision on the absolute identi-

fication of sound-source position. Perception & Psychophysics, 28(6):589–596,

1980.

[96] M. Slaney. Auditory Toolbox: A MATLAB Toolbox for Auditory Modeling Work

Version 2. Interval Research Corproation, 1998.

[97] E. S. Spelke. The development of intermodal perception. Handbook of infant

perception. Academic Press, New York, 1987.

[98] N. Srivastava and R. Salakhutdinov. Multimodal learning with deep Boltzmann

machines. In Proceedings of the Advances in Neural Information Processing

Systems 25, pages 2231–2239, Lake Tahoe, NV, USA, Dec. 2012.

[99] B. Stein and N. London. Enhancement of perceived visual intensity by auditory

stimuli: a psychophysical analysis. Journal of Cognitive Neuroscience, 8(6):497–

506, Nov. 1996.

[100] B. E. Stein. Neural mechanisms for synthesizing sensory information and pro-

ducing adaptive behaviors. Experimental Brain Research, 123(1-2):124–135,

Nov. 1998.

[101] B. E. Stein and M. A. Meredith. The merging of the senses. The MIT Press, 1993.

[102] W. H. Sumby and I. Pollack. Visual contribution to speech intelligibility in

noise. Journal of the Acoustical Society of America, 26:212–215, 1954.

[103] I. Sutskever, J. Martens, and G. Hinton. Generating text with recurrent neu-

ral networks. In Proceedings of the 28th International Conference on Machine

Learning, pages 1017–1024, Bellevue, WA, USA, June 2011.

129

Bibliography

[104] W. A. Teder-Sälejärvi, F. Di Russo, J. J. McDonald, and S. A. Hillyard. Effects of

spatial congruity on audio-visual multimodal integration. Journal of Cognitive

Neuroscience, 17(9):1396–1409, Sept. 2005.

[105] W. R. Thurlow and T. M. Rosenthal. Further study of existence regions for the

“ventriloquism effect”. Journal of the American Audiology Society, 1(6):280–

286, 1976.

[106] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol. Extracting and com-

posing robust features with denoising autoencoders. In Proceedings of the 25th

international conference on Machine learning, pages 1096–1103, New York, NY,

USA, July 2008.

[107] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol. Stacked de-

noising autoencoders: Learning useful representations in a deep network with

a local denoising criterion. Journal of Machine Learning Research, 11:3371–

3408, 2010.

[108] J. Vroomen and B. de Gelder. Sound enhances visual perception: cross-modal

effects of auditory organization on vision. Journal of Experimental Psychology:

Human Perception and Performance, 26(5):1583–1590, Oct. 2000.

[109] D. H. Warren, R. B. Welch, and T. J. McCarthy. The role of visual-auditory “com-

pellingness” in the ventriloquism effect: Implications for transitivity among

the spatial senses. Perception & Psychophysics, 30(6):557–564, Nov. 1981.

[110] R. B. Welch and D. H. Warren. Immediate perceptual response to intersensory

discrepancy. Psychological Bulletin, 88(3):638–667, Nov. 1980.

[111] R. B. Welch and D. H. Warren. Intersensory interactions. In K. R. Boff, L. Kauf-

man, and J. P. Thomas, editors, Sensory Processes and Perception, volume 1 of

Handbook of Perception and Human Performance, pages 25–1–25–36. Wiley,

New York, 1986.

[112] H. Yehia, P. Rubin, and E. Vatikiotis-Bateson. Quantitative association of vocal-

tract and facial behavior. Speech Communication, 26:23–43, 1998.

130

Bibliography

[113] T. Yoshida, K. Nakadai, and H. G. Okuno. Automatic speech recognition im-

proved by two-layered audio-visual integration for robot audition. In Proceed-

ings of the 9th IEEE-RAS International Conference on Humanoid Robots, pages

604–609, Paris, France, Dec. 2009.

[114] S. Young, G. Evermann, M. Gales, T. Hain, X. A. Liu, G. Moore, J. Odell, D. Ol-

lason, D. Povey, V. Valtchev, and P. Woodland. The HTK Book (for HTK Version

3.4). Cambridge University Engineering Department, 2009.

[115] X. Zhang, C. Broun, R. Mersereau, and M. Clements. Automatic speechreading

with applications to human-computer interfaces. EURASIP Journal on Applied

Signal Processing, 11:1228–1247, 2002.

131

Relevant Publications

Journal Papers

1. K. Noda, Y. Yamaguchi, K. Nakadai, H. G. Okuno, and T. Ogata. Audio-visual

speech recognition using deep learning, Applied Intelligence, Vol.42, Issue 4,

pp. 722–737, Jun. 2015.

2. K. Noda, H. Arie, Y. Suga, and T. Ogata. Multimodal Integration Learning of

Robot Behavior using Deep Neural Networks, Robotics and Autonomous Sys-

tems, Vol.62, Issue 6, pp. 721–736, Jun. 2014.

International Conferences

1. K. Noda, Y. Yamaguchi, K. Nakadai, H. G. Okuno, and T. Ogata. Lipreading us-

ing Convolutional Neural Network, Proceedings in Interspeech, pp. 1149–1153,

Sep. 2014, Singapore.

2. K. Noda, H. Arie, Y. Suga, and T. Ogata. Intersensory causality modeling using

deep neural networks, Proceedings in IEEE International Conference on Sys-

tems, Man, and Cybernetics (SMC 2013), pp.1995–2000, Oct. 2013, Manchester,

UK.

3. K. Noda, H. Arie, Y. Suga, and T. Ogata. Multimodal integration learning of

object manipulation behaviors using deep neural networks, Proceedings in

IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS

2013), pp.1728–1733, Nov. 2013, Tokyo, Japan.

Domestic Conferences

1. 野田邦昭，有江浩明，菅佑樹，尾形哲也：深層学習を用いたロボットの感覚運動統合と共起性の理解，日本発達神経科学学会第 3回大会，2014年 10月.

2. 野田邦昭，山口雄紀，中臺一博，奥乃博，尾形哲也：Deep Neural Network

を用いたマルチモーダル音声認識，第 32回日本ロボット学会学術講演会，1I1–04，2014年 9月.

3. 野田邦昭，有江浩明，菅佑樹，尾形哲也：Deep neural networkを用いたヒューマノイドロボットの適応的行動選択，GPU Technology Conference Japan，2014–8001，2014年 7月.

4. 野田邦昭，有江浩明，菅佑樹，尾形哲也：Deep neural networkを用いた感覚運動統合メカニズムによるヒューマノイドロボットの物体操作行動認識，日本機械学会ロボティクスメカトロニクス講演会，3P2–P03，2014年 5月.

5. 野田邦昭，有江浩明，菅佑樹，尾形哲也：Deep neural networkによる映像・音響・運動データの統合と共起，第28回人工知能学会全国大会，3H4–OS–24b–3，2014年 5月.

6. 山口雄紀，野田邦昭，中臺一博，奥乃博，尾形哲也：Deep Neural Networkを用いたマルチモーダル音声認識の為の特徴量学習，情報処理学会第 76回全国大会，5S–3，2014年 3月.

7. 野田邦昭，有江浩明，菅佑樹，尾形哲也：Deep neural networkを用いたヒューマノイドロボットによる物体操作行動の記憶学習と行動生成，第 27回人工知能学会全国大会，2G4–OS–19a–2，2013年 6月.

8. 野田邦昭，有江浩明，菅佑樹，尾形哲也：Deep neural networkを用いた連想記憶メカニズムによるヒューマノイドロボットの適応的行動選択，日本機械学会ロボティクスメカトロニクス講演会，1P1–B01，2013年 5月.

Other Publications

Journal Papers

1. Y. Hoshino, K. Kawamoto, K. Noda, and K. Sabe. Self-Regulation Mechanism : A

Principle for Continual Autonomous Learning in Open-Ended Environments,

Journal of Robotics Society of Japan, Vol.29, Issue 1, pp. 77–88, Jan. 2011.

2. M. Suzuki, K. Noda, Y. Suga, T. Ogata, and S. Sugano. Dynamic Perception af-

ter Visually-Guided Grasping by a Human-Like Autonomous Robot, Advanced

Robotics, Vol.20, No. 2, pp. 233–254, Feb. 2006.

3. M. Ito, K. Noda, Y. Hoshino, and J. Tani. Dynamic and interactive generation of

object handling behaviors by a small humanoid robot using a dynamic neural

network model, Neural Networks, Vol.19, Issue 3, pp. 323–337, Apr. 2006.

International Conferences

1. A. Schmitz, Y. Bansho, K. Noda, H. Iwata, T. Ogata, and S. Sugano. Tactile Ob-

ject Recognition Using Deep Learning and Dropout, Proceedings in IEEE-RAS

International Conference on Humanoid Robots (Humanoids 2014), pp. 1044–

1050, Nov. 2014, Madrid, Spain.

2. Y. Yamaguchi, K. Noda, S. Nishide, H. G. Okuno, and T. Ogata. Learning and As-

sociation of Synesthesia Phenomenon using Deep Neural Networks, Proceed-

ings in IEEE/SICE International Symposium on System Integration (SII 2013),

pp. 659–664, Dec. 2013, Kobe, Japan.

3. H. Nobuta, K. Kawamoto, K. Noda, K. Sabe, H. G. Okuno, S. Nishide, and T.

Ogata. Body area segmentation from visual scene based on predictability of

neuro-dynamical system, Proceedings in IEEE International Joint Conference

on Neural Networks (IJCNN 2012), Jun. 2012, Brisbane, Australia.

4. K. Noda, K. Kawamoto, T. Hasuo, and K. Sabe. A generative model for develop-

mental understanding of visuomotor experience. Proceedings in IEEE Inter-

national Conference on Development and Learning and Epigenetic Robotics

(ICDL-EpiRob 2011), Aug. 2011, Frankfurt, Germany.

5. K. Noda, M. Ito, Y. Hoshino, and J. Tani. Dynamic Generation and Switching

of Object Handling Behaviors by a Humanoid Robot Using a Recurrent Neural

Network Model, Proceedings in International Conference on the Simulation of

Adaptive Behavior (SAB’06), Lecture Notes in Artificial Intelligence, Vol. 4095,

pp. 185–196, Sep. 2006, Rome, Italy.

6. F. Tanaka, K. Noda, T. Sawada, and M. Fujita. Associated Emotion and Its Ex-

pression in an Entertainment Robot QRIO, Proceedings in International Con-

ference on Entertainment Computing (ICEC 2004), pp. 499–504, Sep. 2004,

Eindhoven, Netherlands.

7. K. Noda, M. Suzuki, N. Tsuchiya, Y. Suga, T. Ogata, and S. Sugano. Robust Mod-

eling of Dynamics Environment based on Robot Embodiment, Proceedings in

IEEE International Conference on Robotics and Automation (ICRA 2003), pp.

3565–3570, Sep. 2003, Taipei, Taiwan.

8. T. Ogata, T. Komiya, K. Noda, and S. Sugano. Influence of the Eye Motions in

Human-Robot Communication and Motion Generation based on the Robot

Body Structure, Proceedings in IEEE-RAS International Conference on Hu-

manoid Robots (Humanoids 2001), pp. 83–89, Nov. 2001, Tokyo, Japan.

9. T. Ogata, Y. Matsuyama, T. Komiya, M. Ida, K. Noda, and S. Sugano. Devel-

opment of Emotional Communication Robot: WAMOEBA-2R -Experimental

Evaluation of the Emotional Communication between Robots and Humans-,

Proceedings in IEEE/RSJ International Conference on Intelligent Robots and

Systems (IROS 2000), pp. 175–180, Nov. 2000, Takamatsu, Japan.

Domestic Conferences

1. 寺田翔太，野田邦昭，尾形哲也：CNNによる画像認識技術を応用したマンガ作家判別システム，第 15回計測自動制御学会システム・インテグレーション部門講演会 (SI2014)，3G2–4，2014年 12月.

2. 佐々木一磨，Hadi Tjandra，野田邦昭，高橋城志，尾形哲也：再帰結合型神経回路モデルによる描画像からの描画運動連想，第 15回計測自動制御学会システム・インテグレーション部門講演会 (SI2014)，3H2–4，2014年 12月.

3. 出来寛祥，野田邦昭，尾形哲也：Deep Neural Networkを用いた視覚運動情報の統合化による空間表現の汎化，第 32回日本ロボット学会学術講演会，1B2–01，2014年 9月.

4. 高橋城志，尾形哲也，Hadi Tjandra，野田邦昭，村田真悟，有江浩明，菅野重樹：神経回路モデルと身体バブリングによる道具身体化と道具機能の獲得，日本機械学会ロボティクスメカトロニクス講演会，3P2–P02，2014年 5月.

5. 高橋城志，尾形哲也，Hadi Tjandra，野田邦昭，村田真悟，有江浩明，菅野重樹：身体バブリングと再帰結合型神経回路モデルによる道具身体化～深層学習による画像特徴量抽出～，第 28回人工知能学会全国大会，1I4–OS–09a–4，2014年 5月.

6. 有江浩明，野田邦昭，菅佑樹，尾形哲也：再帰型神経回路モデルによる予測可能性を利用した自己・他者の識別，第 27回人工知能学会全国大会，3J3–

OS–20b–1，2013年 6月.

7. 山口雄紀，野田邦昭，西出俊，奥乃博，尾形哲也：多層神経回路モデルによる共感覚現象の学習と連想，情報処理学会第 75回全国大会，1R–2，2013年 3

月.

8. 信田春満，河本献太,野田邦昭，佐部浩太郎，西出俊，奥乃博，尾形哲也：神経力学モデルによる自己身体領域抽出と視覚運動系の自己組織化，第 30回日本ロボット学会学術講演会，2H3–2，2012年 9月.

9. 信田春満，河本献太，野田邦昭，佐部浩太郎，奥乃博，尾形哲也：再帰型神経回路モデルを用いた視野変化予測と場所知覚ニューロンの発現，情報処理学会第 74回全国大会，5P–8，8 Mar. 2012.名古屋工業大学.

10. 野田邦昭，鈴木基高，尾形哲也，菅野重樹：身体性に基づいた環境・ロボット自身における新奇性検出，第 20回日本ロボット学会学術講演会，1C31，2002

年 10月.

11. 小宮孝章，野田邦昭，土屋尚文，尾形哲也，菅野重樹：分散エージェントを用いた全身協調による動作生成，日本機械学会ロボティクスメカトロニクス講演会，2P1–D06，2002年 6月.

12. 野田邦昭，井田真高，尾形哲也，菅野重樹：身体性に基づいた状態表現機能を持つロボットと人間とのコミュニケーション，日本機械学会ロボティクスメカトロニクス講演会，1P1–D10(1)，2001年 6月.

13. 尾形哲也，松山佳彦，小宮孝章，井田真高，野田邦昭，菅野重樹：人間と自律ロボットのコミュニケーションに関する実験的考察システム設計と心理評価の異母集団比較，第 18回日本ロボット学会学術講演会， pp. 479–480，2000年 9月.

14. 尾形哲也，松山佳彦，小宮孝章，井田真高，野田邦昭，菅野重樹：自律ロボットWAMOEBA-2Rの開発アームシステムの搭載と心理実験，日本機械学会ロボティクスメカトロニクス講演会，1A1–80–114， 2000年 5月.

15. 尾形哲也，松山佳彦，小宮孝章，井田真高，野田邦昭，菅野重樹：情緒交流ロボットWamoeba-2Rの開発システム構成と評価実験，第 5回ロボティクスシンポジア，pp.68–73，2000年 3月.

multimodal integration for robot systems using deep learning · with regard to sensory feature...

Documents