multimodal integration for robot systems using deep learning · with regard to sensory feature...
TRANSCRIPT
MULTIMODAL INTEGRATION FORROBOT SYSTEMS USING DEEP LEARNING
ディープラーニングによるロボットシステムのためのマルチモーダル統合
July 2015
Kuniaki NODA
野田 邦昭
Waseda University Doctoral Dissertation
MULTIMODAL INTEGRATION FORROBOT SYSTEMS USING DEEP LEARNING
ディープラーニングによるロボットシステムのためのマルチモーダル統合
July 2015
Kuniaki NODA
野田 邦昭
Waseda University
Graduate School of Fundamental Science and Engineering
Department of Intermedia Art and Science,
Research on Intelligence Dynamics and Representation Systems
Abstract
Intelligent machines such as smartphones, auto-driving cars, and domestic robots
are expected to become increasingly common in everyday life. Consequently, strong
demands for a noise-robust human-machine interface that enables stress-free in-
teraction as well as intelligent technologies that enable stable environmental recog-
nition and adaptive behavior generation for autonomous robots may arise in the
near future. To realize these functions, we need to address two fundamental re-
quirements: (1) robust recognition of poorly reproducible real-world information
and (2) adaptive behavior selection of robots depending on dynamic environmen-
tal changes. The main aim of this study is to address these requirements through a
machine learning approach that implements multimodal integration learning.
Humans succeed in recognizing an environment and mastering many tasks by
combining inputs from multiple modalities, including vision, audition, and somatic
sensation. On the other hand, the sensory inputs to most robotic applications
are commonly preprocessed through dedicated feature extraction mechanisms and
sensory-motor information processing algorithms based on perceptual and action
generation objectives. In essence, mutual intersensory processes are rarely taken
into consideration for realizing environmental recognition and behavior generation.
With regard to sensory feature extraction and multimodal integration learning mech-
anisms, deep learning approaches have recently attracted considerable attention.
One of the main advantages of applying deep neural networks (DNNs) is that they
self-organize highly generalized sensory features from large-scale raw data. The
same approach has also been applied for obtaining fused representations over mul-
tiple modalities, resulting in significant improvements in speech recognition perfor-
mance. However, DNNs have never been applied to multimodal integration learning
of dynamic information such as robot behaviors.
This study aims to address the two fundamental requirements presented above
i
Abstract
through the following three approaches: (1) utilization of highly generalized sensory
features, (2) fusional utilization of multimodal information, and (3) memory predic-
tion and association among multiple modalities. In practice, highly generalized sen-
sory features and their integrated features acquired by integration learning of mul-
timodal information enable noise-robust recognition. Consequently, a cross-modal
memory retrieval function of deep learning based on an acquired intersensory syn-
chrony model enables adaptive behavior selection of robots depending on dynamic
environmental changes.
Our proposed multimodal integration learning framework is evaluated through
the following three experiments: (1) noise robust speech recognition based on audio-
visual integration learning, (2) robust environment recognition and adaptive behav-
ior generation based on visual-motor integration learning of robot behaviors, and
(3) analysis on a multimodal synchrony model acquired from integration learning of
robot behaviors.
In the first evaluation experiment, the audio-visual speech recognition (AVSR)
approach is adopted for realizing noise robust speech recognition. Specifically, sen-
sory features acquired from audio signals and the corresponding mouth area images
are integrated. In practice, two kinds of DNNs, denoising deep autoencoder (DDA)
and convolutional neural network (CNN), are utilized for the feature extraction of au-
dio and visual information, respectively. Moreover, the multi-stream hidden Markov
model (MSHMM) is applied for integrating the two sensory features acquired from
audio signal and mouth area images. We approach noise robust speech recognition
from the following two directions; one involves utilizing DDA for the noise reduc-
tion of audio features, and the other involves utilizing multimodal information in a
complementary style.
In the second evaluation experiment, a sensory-motor multimodal integration
framework utilizing DNN is proposed for realizing adaptive generation of robot
behaviors depending on dynamic environmental changes. Specifically, synchrony
models between visual and motor modalities are structured in a self-organizing man-
ner by training a DNN with temporal sequences consisting of camera images and
joint angles acquired from six kinds of object manipulation behaviors utilizing a hu-
manoid robot. The acquired model is applied for cross-modal memory retrieval re-
flecting the synchrony model between visual and motor modalities.
ii
Abstract
In the third experiment, a synchrony model between the three modalities, vision,
audio, and motion, is acquired by conducting a bell-ringing task using a humanoid
robot. The acquired synchrony model is utilized for retrieving image sequences from
audio and motion sequence inputs. To confirm that the correct synchrony is mod-
eled and the corresponding memory retrieval is attained, quantitative evaluation on
the generated images is conducted. Moreover, correspondences among the struc-
ture of the acquired multimodal feature space, the environmental setting, and the
physical motion are analyzed by visualizing the activation patterns acquired from
the central middle layer of the DNN.
This dissertation is organized into seven chapters. Chapter 1 provides the back-
ground, the research objective, and our approaches as an introduction of the current
study.
In Chapter 2, recent research trends on multimodal integration learning are in-
troduced. First, findings from cognitive psychology are summarized. Second, studies
on AVSR and sensory-motor integration learning of robots are summarized to survey
the practical applications of multimodal integration learning. Third, recent research
trends in deep learning studies are summarized. Finally, the positioning of our pro-
posed model with regard to the recent studies is presented.
In Chapter 3, experiments on AVSR utilizing our proposed learning framework
are conducted for evaluating how sensory features acquired by deep learning and
multimodal integration contribute to robust speech recognition. In practice, a
connectionist-HMM system for AVSR is proposed. As the result of the isolated word
speech recognition evaluation, the audio feature acquired by DDA outperformed a
conventional audio feature under noisy sound settings. Moreover, the visual feature
acquired by CNN outperformed the visual features acquired by conventional dimen-
sionality compression algorithms such as principal component analysis. Finally, we
verified that AVSR utilizing MSHMM can exhibit robust speech recognition even un-
der noisy sound settings.
In Chapter 4, a multimodal integration framework based on a deep learning
algorithm for sensory-motor integration learning of robot behaviors is proposed.
The framework first compressed the sensory inputs acquired from multiple modal-
ities utilizing a deep autoencoder. In combination with a variant of a time-delayed
neural network, a novel deep learning framework that integrates sensory-motor se-
iii
Abstract
quences and self-organizes higher-level multimodal features is introduced. Further,
we showed that our proposed multimodal integration framework can reconstruct full
temporal sequences from input sequences with partial dimensionality.
In Chapter 5, our proposed sensory-motor integration framework is applied for
learning and generating object manipulation behaviors of a humanoid robot. In
practice, the framework is trained with six different object manipulation behaviors
generated by direct teaching. Results demonstrate that our proposed method can re-
trieve temporal sequences over visual and motion modalities and predict future se-
quences from the past. Moreover, the memory retrieval function enabled the robot to
adaptively switch corresponding behaviors depending on the displayed objects. Fur-
ther, behavior-dependent unified representations that fuse sensory-motor modal-
ities together are extracted from the temporal sequence feature space. The result
of our behavior recognition experiment demonstrated that the multimodal features
significantly improve the robustness and reliability of the behavior recognition per-
formance.
In Chapter 6, a quantitative evaluation experiment on our proposed sensory-
motor integration framework is conducted to analyze the acquired synchrony model.
In practice, a bell-ringing task performed by the same robot is designed and the
framework is trained utilizing sensory-motor sequences consisting of the three
modalities, vision, audio, and motion. To this end, a model representing the cross-
modal synchrony is self-organized in the abstracted feature space of our proposed
framework. Results demonstrated that the cross-modal memory retrieval function
of our proposed model succeeds in predicting visual sequences in correlation with
the sound and joint angles of bell-ringing behaviors. Further, analyzing the image
retrieval performance, we found that our proposed method correctly models the syn-
chrony among the multimodal information.
In Chapter 7, the accomplishments of our study on multimodal integration learn-
ing are summarized. Finally, reviews on the remaining research topics and future
directions conclude this dissertation.
iv
Acknowledgments
This work was carried out at the Graduate School of Fundamental Science and En-
gineering at Waseda University in 2012–2015. I thank the institute for providing me
with excellent research facilities. Here, I would like to express my sincere thanks and
appreciation to those who were involved in my study and life in the past three years.
Firstly, I would like to gratefully and sincerely thank my principal supervisor Prof.
Tetsuya Ogata for his significantly important comments and suggestions. I have al-
ways deeply impressed by his expertise supervision, brilliant ideas, valuable advice,
and extensive knowledge. His bright guidance and warm leadership foster a vibrant
and positive atmosphere for research that gave me extremely splendid and abundant
experiences in this laboratory.
I also express my deep and sincere appreciation to Prof. Kazuhiro Nakadai, for
his constructive guidance and inspired suggestions. I am grateful to him for giving
me the opportunity to pursue my interest in Robot Audition under his supervision
and continuous care. Without his generous support, this work would not have been
possible.
I owe my deep gratitude to all the coauthors of my manuscripts, Prof. Hiroshi G.
Okuno, Dr. Hiroaki Arie, and Dr. Yuki Suga for their genuine interest, rapid response
and skillful comments that greatly contributed to my manuscripts.
Many thanks to Prof. Yasuhiro Oikawa and Takashi Kawai who gave me a lot of
advices on how to complete my thesis and the correstions of my dissertation. Their
suggestions provided me lots of ideas to improve the quality of Ph.D thesis, which
may be useful for my future research as well.
Two people, who were absolutely indispensable for completing this thesis, were
the laboratory’s two secretaries, Mrs. Naomi Nakata and Ms. Junko Inaniwa. Their
outstanding work is not only essential to me, but for the whole laboratory. Thanks
also the other members of Ogata laboratory, especially the students who have con-
v
Acknowledgments
tributed to this research.
This research was supported in part by the special coordination fund for pro-
moting science and technology from the JST PRESTO “Information Environment and
Humans,” MEXT Grant-in-Aid for Scientific Research on Innovative Areas “Construc-
tive Developmental Science” (24119003), Scientific Research (S) (24220006), and JSPS
Fellows (265114).
A special thanks to my family. Words cannot express the feelings I have for my
parents and my brother for their endless patience, valuable advice, and encourage-
ment. Without your relentless support this work would not even have been started.
At the end I would like express appreciation to my beloved wife Misuzu who spent
sleepless nights with and supported me in writing, and incented me to strive towards
my goal. Thank you.
Tokyo, June 27, 2015 Kuniaki Noda
vi
Contents
List of figures xi
List of tables xv
1 Introduction 1
1.1 Background and Research Objective . . . . . . . . . . . . . . . . . . . . . 1
1.2 Overview of our Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Literature Review 9
2.1 Intersensory Perceptual Phenomena in Humans . . . . . . . . . . . . . 9
2.1.1 Ventriloquism effect . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2 Synesthesia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.3 Active intermodal mapping . . . . . . . . . . . . . . . . . . . . . . 13
2.1.4 Coherent understanding of environment . . . . . . . . . . . . . . 15
2.2 Multimodal Integration for Robot Systems . . . . . . . . . . . . . . . . . 15
2.2.1 Audio-visual speech recognition . . . . . . . . . . . . . . . . . . . 15
2.2.2 Sensory-motor integration learning for robots . . . . . . . . . . . 19
2.3 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.1 Deep Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.2 Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . 22
2.4 Positioning of this Thesis towards Related Work . . . . . . . . . . . . . . 23
3 Audio-Visual Speech Recognition 27
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 The Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
vii
Contents
3.3.1 Audio Feature Extraction by Deep Denoising Autoencoder . . . 30
3.3.2 Visual Feature Extraction by CNN . . . . . . . . . . . . . . . . . . 32
3.3.3 Audio-Visual Integration by MSHMM . . . . . . . . . . . . . . . . 35
3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4.1 ASR Performance Evaluation . . . . . . . . . . . . . . . . . . . . . 36
3.4.2 Visual-Based Phoneme Recognition Performance Evaluation . . 38
3.4.3 Visual Feature Space Analysis . . . . . . . . . . . . . . . . . . . . . 40
3.4.4 VSR Performance Evaluation . . . . . . . . . . . . . . . . . . . . . 43
3.4.5 AVSR Performance Evaluation . . . . . . . . . . . . . . . . . . . . 45
3.5 Discussion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.5.1 Current Need for the Speaker Dependent Visual Feature Extrac-
tion Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.5.2 Positioning of our VSR Results with Regards to State of the Art in
Lip Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.5.3 Adaptive Stream Weight Selection . . . . . . . . . . . . . . . . . . 51
3.5.4 Relations of our AVSR Approach with DNN-HMM Models . . . . 53
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4 Learning Framework for Multimodal Integration of Robot Behaviors 57
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2 Multimodal Temporal Sequence Learning using a DNN . . . . . . . . . 58
4.2.1 Sensory Feature Extraction . . . . . . . . . . . . . . . . . . . . . . 58
4.2.2 Multimodal Integration Learning using Time-delay Networks . . 59
4.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3.1 Cross-modal Memory Retrieval . . . . . . . . . . . . . . . . . . . . 60
4.3.2 Temporal Sequence Prediction . . . . . . . . . . . . . . . . . . . . 62
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5 Applications for Recognition and Generation of Robot Behaviors 65
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.2 Construction of the Proposed Framework . . . . . . . . . . . . . . . . . . 65
5.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
viii
Contents
5.4.1 Cross-modal Memory Retrieval and Temporal Sequence Predic-
tion of Object Manipulation Behaviors . . . . . . . . . . . . . . . 70
5.4.2 Real-time Adaptive Behavior Selection According to Environ-
mental Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.4.3 Multimodal Feature Space Visualization . . . . . . . . . . . . . . 76
5.4.4 Behavior Recognition using Multimodal Features . . . . . . . . . 77
5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.5.1 How Generalization Capability of Deep Neural Networks Con-
tributes for Robot Behavior Learning . . . . . . . . . . . . . . . . 80
5.5.2 Three Factors that Contribute to Robustness in Behavior Recog-
nition Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.5.3 Difference between our Proposed Time-delay Autoencoder and
the Original Time-delay Neural Network . . . . . . . . . . . . . . 83
5.5.4 Characteristics of the Internal Representation of the Temporal
Sequence Learning Network . . . . . . . . . . . . . . . . . . . . . 83
5.5.5 Length of Contextual Information that a Time-delay Autoen-
coder Handles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.5.6 Scalability of our Proposed Multimodal Integration Learning
Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6 Analysis on Intersensory Synchrony Model 89
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.2 Construction of the Proposed Framework . . . . . . . . . . . . . . . . . . 89
6.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.4.1 Image Sequence Retrieval from Sound and Motion Sequences . 94
6.4.2 Quantitative Evaluation of Image Retrieval Performance . . . . . 96
6.4.3 The Correlation between Generated Motion and Retrieved Bell
Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.4.4 Visualization of Multimodal Feature Space . . . . . . . . . . . . . 99
6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
ix
Contents
7 Conclusion 103
7.1 Overall Summary of the Current Research . . . . . . . . . . . . . . . . . 103
7.2 Significance of the Current Study as a Work in Intermedia Art and Science105
A Hessian-Free Optimization 107
A.1 Newton-CG Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
A.2 Computing the Matrix-Vector Product . . . . . . . . . . . . . . . . . . . . 109
B FNN with R-operator 111
B.1 Forward propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
B.2 Forward propagation with R-operator . . . . . . . . . . . . . . . . . . . . 111
B.3 Error Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
B.4 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
B.4.1 variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
B.4.2 parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
B.5 Backpropagation with R-operator . . . . . . . . . . . . . . . . . . . . . . 113
B.5.1 variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
B.5.2 parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
C RNN with R-operator 115
C.1 Forward Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
C.2 Forward Propagation with R-operator . . . . . . . . . . . . . . . . . . . . 115
C.3 Error Funcion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
C.4 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
C.4.1 variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
C.4.2 parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
C.5 Backpropagation with R-operator . . . . . . . . . . . . . . . . . . . . . . 118
C.5.1 variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
C.5.2 parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
Bibliography 119
Relevant Publications 133
Other Publications 135
x
List of Figures
1.1 Thesis organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1 A ventriloquist . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 MuGurk effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Feeling of shapes corresponding to different tastes (Copyright CAVE
Lab., University of Tsukuba) . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Pictures used to demonstrate the bouba/kiki effect (Originally designed
by psychologist Wolfgang Köhler.) . . . . . . . . . . . . . . . . . . . . . . 13
2.5 Infant imitation (From A. N. Meltzoff and M. K. Moore. Imitation of
facial and manual gestures by human neonates. Science, 198:75–78,
1977. Copyright AAAS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.6 Vanishing gradient problem . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.7 Convolutional neural network . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1 Audio-visual synchronous data recording environment . . . . . . . . . 29
3.2 Architecture of the proposed AVSR system . . . . . . . . . . . . . . . . . 30
3.3 Word recognition rate evaluation results using audio features depend-
ing on the number of Gaussian mixture components for the output
probability distribution models of HMM . . . . . . . . . . . . . . . . . . 37
3.4 Word recognition rate evaluation results utilizing MFCCs depending on
the number of Gaussian mixture components for the output probabil-
ity distribution models of HMM . . . . . . . . . . . . . . . . . . . . . . . 39
3.5 Phoneme-wise visual-based phoneme recognition rates . . . . . . . . . 41
3.6 Visual-based phoneme-recognition confusion matrix (64×64 pixels im-
age input) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
xi
List of Figures
3.7 Visual feature distribution for the five representative Japanese vowel
phonemes (64×64 pixels image input) . . . . . . . . . . . . . . . . . . . 44
3.8 Word recognition rates using image features . . . . . . . . . . . . . . . . 45
3.9 Word recognition rate evaluation results (8 components) . . . . . . . . 47
3.10 Word recognition rate evaluation results (16 components) . . . . . . . . 48
3.11 Word recognition rate evaluation results (32 components) . . . . . . . . 49
3.12 Word recognition rate evaluation results (32 components, speaker-
close evaluation) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.13 The main targets discussed in Chapter 3 . . . . . . . . . . . . . . . . . . 55
4.1 Examples of cross-modal memory retrieval and sequence prediction . 61
4.2 Buffer shift of the recurrent input . . . . . . . . . . . . . . . . . . . . . . . 62
4.3 Buffer shift of the recurrent input for temporal sequence prediction . . 63
4.4 The main targets discussed in Chapter 4 . . . . . . . . . . . . . . . . . . 64
5.1 Multimodal behavior learning and retrieving mechanism . . . . . . . . 66
5.2 Object manipulation behaviors . . . . . . . . . . . . . . . . . . . . . . . . 68
5.3 Example of motion reconstructions by our proposed model . . . . . . . 71
5.4 Example of image reconstructions by our proposed model . . . . . . . 72
5.5 Temporal sequence prediction errors of six object manipulation be-
haviors; plots are horizontally displaced from the original positions to
avoid overlap of the error bars . . . . . . . . . . . . . . . . . . . . . . . . 75
5.6 Real-time transition of object manipulation behaviors . . . . . . . . . . 76
5.7 Acquired multimodal feature space . . . . . . . . . . . . . . . . . . . . . 77
5.8 Behavior recognition rates depending on the changes in standard devi-
ation σ of the Gaussian noise superimposed on the joint angle sequences 79
5.9 The main targets discussed in Chapter 5 . . . . . . . . . . . . . . . . . . 87
6.1 Multimodal behavior learning and retrieval mechanism . . . . . . . . . 90
6.2 Bell placement configurations of the bell-ringing task . . . . . . . . . . 92
6.3 Example of image retrieval results from the sound and joint angle inputs 95
6.4 Bell image retrieval errors; . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.5 Bell image retrieval errors at step 60 . . . . . . . . . . . . . . . . . . . . . 98
6.6 Multimodal feature space and the correspondence between the coor-
dinates and modal-dependent characteristics . . . . . . . . . . . . . . . 99
xii
List of Figures
6.7 The main targets discussed in Chapter 6 . . . . . . . . . . . . . . . . . . 102
xiii
List of Tables
3.1 Settings for audio feature extraction . . . . . . . . . . . . . . . . . . . . . 31
3.2 39 types of Japanese phonemes . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3 Construction of a convolutional neural network . . . . . . . . . . . . . . 34
3.4 Speaker-wise visual-based phoneme recognition rates and averaged
values [%] depending on the input image sizes . . . . . . . . . . . . . . . 38
5.1 Experimental parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.2 Reconstruction errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.1 Experimental parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
xv
Chapter 1
Introduction
1.1 Background and Research Objective
Intelligent machines, such as smartphones, auto-driving cars, and domestic robots
are expected to become increasingly common in everyday life. Consequently, strong
demands for (1) a noise-robust human–machine interface that enables stress-free
interaction and (2) intelligent technologies for autonomous robots that enable stable
environmental recognition and adaptive behavior generation may arise in the near
future. To achieve these functions, we need to address the following two fundamental
requirements:
• Issue 1: Robust recognition of poorly reproducible real-world information
• Issue 2: Adaptive behavior selection of robots depending on dynamic environ-
mental changes
These requirements indicate that robot systems working in an open-ended, real
world environment need to recognize unexperienced variations in sensory informa-
tion by generalizing their already acquired memory. For example, robots need to
promptly regulate their behavior depending on momentarily changing environmen-
tal situations such as the pose or dynamics of manipulation targets. The key un-
derlying principle in this study is to address these requirements through a machine
learning approach that implements multimodal integration learning.
1
Chapter 1. Introduction
Humans succeed in recognizing an environment and mastering many tasks by
combining inputs from multiple modalities, including vision, audition, and somatic
sensation. All of these different sources of information are efficiently merged to or-
ganize a coherent and robust percept for stable behavior generation [25, 101]. On
the other hand, the sensory inputs in most robotic applications are commonly pre-
processed through dedicated feature extraction mechanisms such as color region ex-
traction and optic flow. It is also common to design designated sensory information
recognition algorithms depending on perceptual objectives such as face detection,
speech recognition, and object detection [73]. Consequently, recognized targets are
represented by predefined symbolic descriptions of recognized targets. As for behav-
ior generation, rule-based automatic decision making algorithms, such as finite state
machine, are utilized [73].
In essence, environmental recognition and behavior generation have rarely been
attained by regarding mutual intersensory processes among multiple sensory-motor
information. Modality dependent processing approaches have been inevitable for
robotics because there have been scalability issues with conventional machine learn-
ing approaches when handling large-scale raw sensory inputs and motor command
outputs in the real world environment. However, these approaches accompany a
fundamental side effect that information filtering by designers possibly eclipses the
essential information for robots to control their behavior and limits the chances for
robots to develop their own capability from the sensory input level. Furthermore,
predefined symbolic representations of recognition targets may prevent the general-
ized comprehension of the surrounding environment. Further, rule-based behavior
control mechanisms possibly restrict the adaptability of robots for novel environ-
mental conditions.
Regarding sensory feature extraction and multimodal integration learning mech-
anisms, deep learning approaches have recently attracted considerable attention
among the machine-learning community [9]. One of the main advantages of ap-
plying deep neural networks (DNNs) is that they can self-organize highly general-
ized sensory features from large-scale raw data. For example, DNNs have success-
fully been applied to unsupervised feature learning for single modalities such as text
[103], images [58, 56], and audio [40]. The same approach has also been applied to
2
1.2. Overview of our Approaches
the learning of fused representations over multiple modalities, resulting in signifi-
cant improvements in speech recognition performance [75]. However, discussion on
the application of DNNs for more dynamic information such as speech signals has
just recently begun. Thus, DNNs have never been applied for multimodal integration
learning of robot behaviors.
In the context of the background explained above, the main objective of this study
is to address the two fundamental requirements presented above by applying deep
learning for sensory feature extraction and multimodal integration learning.
1.2 Overview of our Approaches
We address the research objectives explained in the previous section by utilizing the
multiple functionalities of deep learning. Our approaches and the corresponding
technical solutions realized by deep learning are summarized as follows.
• Approach 1: Utilization of highly generalized sensory features
• Solution 1: Self-organization of abstracted features from large amounts of
training data
• Approach 2: Fusional utilization of multimodal information
• Solution 2: Multimodal integration learning
• Approach 3: Memory prediction and association among multiple modalities
• Solution 3: Cross-modal memory retrieval
To address Requirement 1, we employ two approaches: utilization of highly gen-
eralized sensory features and fusional utilization of multimodal information. In
practice, noise-robust recognition is attained by utilizing highly generalized sensory
features self-organized by deep learning. Moreover, the same objective is attained by
utilizing integrated features acquired by an integration learning of multimodal infor-
mation. The integrated representation contributes towards the fusional utilization
of the multimodal information because even if the reliability of one modal degrades,
information from the other modal can compensate for restoring the corresponding
internal representation. We address the robust recognition of poorly reproducible
real-world information by these two approaches.
3
Chapter 1. Introduction
To address Requirement 2, we employ another approach: memory prediction and
association among multiple modalities. In practice, adaptive behavior selection of
robots depending on dynamic environmental changes is attained by a cross-modal
memory retrieval function of deep learning based on the acquired multimodal inte-
gration representation (intersensory synchrony model).
Our proposed multimodal integration learning mechanism is evaluated through
the following three experiments in a step-by-step manner.
• Evaluation 1: Noise robust speech recognition based on audio-visual integra-
tion learning
• Evaluation 2: Robust environment recognition and adaptive behavior genera-
tion based on visual-motor integration learning of robot behaviors
• Evaluation 3: Analysis on a multimodal synchrony model acquired from inte-
gration learning of robot behaviors
In the first evaluation experiment, the topics from Approaches 1 and 2 are inves-
tigated. In practice, the audio-visual speech recognition (AVSR) approach is adopted
to integrate audio and visual information for realizing robust speech recognition in
noisy environments. Specifically, sensory features acquired from audio signals and
the corresponding mouth area images are integrated to attain AVSR. In the current
experiment, two kinds of DNNs, denoising deep autoencoder (DDA) and convo-
lutional neural network (CNN), are utilized for feature extraction of audio and vi-
sual information, respectively. Moreover, the multi-stream hidden Markov model
(MSHMM) is applied for integration learning of the two sensory features acquired
from audio signal and mouth area images, respectively. Hence, we approach noise
robust speech recognition from the following two directions, one involves utilizing
DDA for the noise reduction of audio features, and the other involves utilizing CNN
and MSHMM for fusional utilization of the multimodal information.
In the second evaluation experiment, the topics from Approaches 1, 2, and 3 are
investigated. This experiment focuses on the behavior generation function of robots
rather than the recognition function, which is the main focus of the first experiment.
In practice, a sensory-motor multimodal integration learning framework utilizing
DNN is proposed for realizing adaptive generation of robot behaviors depending on
4
1.3. Thesis Organization
dynamic environmental changes. Specifically, synchrony models between visual and
motor modalities are structured in a self-organizing manner by training a DNN with
temporal sequences consisting of camera images and joint angles acquired from six
types of object manipulation behaviors utilizing a humanoid robot. The acquired
model is applied for cross-modal memory retrieval reflecting the synchrony model
between visual and motor modalities.
In the third experiment, quantitative evaluation and analysis is conducted on the
acquired synchrony model. In practice, a bell-ringing task is conducted by a hu-
manoid robot for acquiring a synchrony model among the following three modal-
ities: vision, audio, and motion. The acquired synchrony model is utilized for re-
trieving image sequences from audio and motion sequence inputs. To confirm that
the correct synchrony is modeled and the corresponding memory retrieval is at-
tained, the generated images are quantitatively evaluated. Moreover, correspon-
dences among the structure of the acquired multimodal feature space, the environ-
mental setting, and the physical motion are analyzed by visualizing the activation
patterns acquired from the central middle layer of the DNN utilized for the mul-
timodal integration learning. By analyzing the structure of the multimodal feature
space, the mechanism to represent the synchrony model in the DNN is revealed.
1.3 Thesis Organization
The remainder of this dissertation is organized as shown in Figure 1.1. In Chapter
2, recent research trends on multimodal integration learning are introduced. First,
findings from cognitive psychology studies are summarized. Second, studies on
AVSR and sensory-motor integration learning of robots are summarized to survey
preceding practical applications of multimodal integration learning. Third, recent
research trends in deep learning, which is the technical founding of our proposed
multimodal integration mechanism, are summarized. Finally, the positioning of our
proposed model with regard to the recent studies is presented.
In Chapter 3, experiments on AVSR utilizing our proposed learning framework
are conducted for evaluating how sensory features acquired by deep learning and
multimodal integration contribute to robust speech recognition. In practice, a
5
Chapter 1. Introduction
��������
��� �����������
� �������
��������������
����������������������������������
��� �������������������� ���������������������
��������������� ��������������� ��������
!���������������������������������
"���� �������������������������� �����������
#���������������������������������$%�� ���&'()
"����������������������$%�� ���&'()
%��������������� ���������$%�� ���()
*����������� ��������������������������������������$%�� ���&)
��������������������������� ����������������������������������$%�� ���+)
��� �������������� ���� �����$%�� ���,)
Figure 1.1: Thesis organization
connectionist-hidden Markov model (HMM) system for noise-robust AVSR is pro-
posed. First, a DDA is utilized for acquiring noise-robust audio features. By preparing
the training data for the network with pairs of consecutive multiple steps of deteri-
orated audio features and the corresponding clean features, the network is trained
to output denoised audio features from the corresponding features deteriorated by
noise. Second, a CNN is utilized to extract visual features from raw mouth area im-
ages. By preparing the training data for the CNN as pairs of raw images and the corre-
sponding phoneme label outputs, the network is trained to predict phoneme labels
from the corresponding mouth area input images. Finally, a MSHMM is applied to
integrate the acquired audio and visual HMMs, which are independently trained with
the respective features. As the result of the isolated word speech recognition evalua-
tion, the audio feature acquired by DDA outperformed a conventional audio feature
under noisy sound settings. Moreover, the visual feature acquired by CNN outper-
formed visual features acquired by conventional dimensionality compression algo-
rithms such as principal component analysis (PCA). Finally, we verified that AVSR
utilizing MSHMM can exhibit robust speech recognition capability even under noisy
6
1.3. Thesis Organization
sound settings compared to the cases when only single modality is utilized.
In Chapter 4, a multimodal integration learning framework for sensory-motor
integration learning of robot behaviors is presented. As a practical computational
model, a multimodal temporal sequence learning framework based on a deep learn-
ing algorithm [41] is constructed. The proposed model first compresses the dimen-
sionality of the sensory inputs acquired from multiple modalities utilizing a deep
autoencoder [41, 65]. In combination with a variant of a time-delayed neural net-
work [55] learning approach, we then introduce a novel deep learning method that
integrates sensory-motor sequences and self-organizes higher-level multimodal fea-
tures. Further, we show that our proposed temporal sequence learning framework
can internally generate temporal sequences by partially masking the input data from
outside the network and recursively feeding back the previous outputs to the masked
input nodes; this is made possible by utilizing the characteristics of an autoencoder
that models identity mappings between inputs and outputs.
In Chapter 5, our proposed sensory-motor integration learning framework is
applied for learning and generating object manipulation behaviors of a humanoid
robot. In practice, the framework is trained with six different object manipulation be-
haviors generated by direct teaching. Results demonstrate that our proposed model
can retrieve temporal sequences over visual and motion modalities and predict fu-
ture sequences from the past. Moreover, the memory retrieval function enabled
the robot to adaptively switch corresponding behaviors depending on the displayed
objects. Further, behavior-dependent unified representations that fuse sensory-
motor modalities together are extracted in the temporal sequence feature space.
Our behavior recognition experiment, which utilizes the integrated features acquired
from the multimodal temporal sequence learning mechanism, demonstrates that
the multimodal features significantly improve the robustness and reliability of be-
havior recognition performance by utilizing joint angle information.
In Chapter 6, a quantitative evaluation experiment on the sensory-motor inte-
gration learning framework is conducted by analyzing the “synchrony model.” In
practice, the experimental setting of the multimodal integration learning is extended
by incorporating sound signals in addition to the image and joint angles. A bell-
ringing task performed by the same robot is designed and the proposed model is
7
Chapter 1. Introduction
trained utilizing sensory-motor sequences consisting of the three modalities, vision,
audio, and motion. To this end, a model representing the cross-modal synchrony is
self-organized in the acquired abstracted feature space. Results demonstrate that the
cross-modal memory retrieval function of the proposed model succeeds in predict-
ing visual sequences in correlation with the sound and joint angles of bell-ringing
behaviors. Further, analyzing the image retrieval performance, we found that our
proposed method correctly models the synchrony among the multimodal informa-
tion.
In Chapter 7, the accomplishments of our study on multimodal integration learn-
ing are summarized. Finally, reviews on the remaining research topics and future
directions conclude this dissertation.
8
Chapter 2
Literature Review
2.1 Intersensory Perceptual Phenomena in Humans
Humans perceive the external environment, including their own body, by integrat-
ing multiple channels of sensory inputs acquired from different modalities, such as
vision, audition, and proprioception. The use of one sensory input can influence per-
ception from another sensory system, and the transferred information across modal-
ities is utilized in order to substitute for one another. This multisensory interaction
can be observed in many human perceptual phenomena such as the ventriloquism
effect, synesthesia, and active intermodal mapping.
2.1.1 Ventriloquism effect
A ventriloquist is an entertainer that “throws his voice” by minimizing his own move-
ments so that the only visual cues the audience can associate with speech comes
from his dummy (Figure 2.1). As a result, audiences tend to feel that the voice is com-
ing from the dummy even if they clearly know which one is the dummy and which
one is not. This trick says more about the audience than the performer, because the
performance is effective less owing to the ventriloquist’s skill than the dominance of
the visual-auditory intersensory biases of the audience. In psychology, the “ventrilo-
quism effect” [43] is referred to as the broad phenomenon of intersensory bias, where
one sensory information from a modality can influence the judgments of another.
9
Chapter 2. Literature Review
Figure 2.1: A ventriloquist
For example, vision can influence judgments about proprioception and audition,
proprioception can bias auditory judgments, and so on [37, 81, 105, 95, 110, 109].
The magnitude of intersensory bias and the dominant modality depend on how com-
pelling and real each individual cue is [111]. In general, the visual modality is known
to predominate in intersensory influences.
One example of the general synergy between the visual and auditory system is
represented in the perception of speech. Even though it is difficult to recognize
someone’s speech in a room under significant background noise, seeing the speaker’s
face will make it easier to understand what is being said. In fact, a neuromagnetic
study indicates that the sight of lip movement actually modifies the activity in the
auditory cortex [92]. It is also known that visual cues enhance the processing of au-
ditory inputs, at a level functionally equivalent to altering the signal-to-noise ratio
(SNR) of the auditory stimulus by 15–20 dB [102]. On the other hand, nonmatching
visual and auditory cues in speech are also known to produce interesting auditory-
10
2.1. Intersensory Perceptual Phenomena in Humans
+ = ba
ga
da
Figure 2.2: MuGurk effect
visual illusions, which are discussed in an article entitled “Hearing lips and seeing
voices” [69]. This illusion, commonly referred to as “MuGurk effect,” occurs when
one hears “ba-ba” but sees the mouth form “ga-ga” and perceives the sound “da-da”
(Figure 2.2).
2.1.2 Synesthesia
Synesthesia is another example of intersensory phenomena in humans. This syn-
drome literally means “joining the senses,” and is explained as an involuntary join-
ing in which one sensory modality involuntarily elicits a sensation/experience in an-
other modality [21]. For example, sonogenic synesthesia, in which music provokes
intense visual experiences or cutaneous paresthesias, has been a well-known case for
over 100 years [20, 38]. Another example is that for a synesthete, a particular taste al-
ways induces the sensation of a particular geometric in his/her left hand (Figure 2.3)
[64]. This syndrome has recently been attracting attention among neurologists and
developmental psychologists, and has become an indispensable topic when multi-
sensory integration is being discussed.
Another research suggests that we all have some capacity for experiencing synes-
thesia. For example, consider two drawings, one looks like an inkblot and the other, a
jagged piece of shattered glass (Figure 2.4). When people are asked “Which of these is
‘bouba,’ and which is ‘kiki’?,” 98 percent of people respond that the inkblot is bouba
11
Chapter 2. Literature Review
Figure 2.3: Feeling of shapes corresponding to different tastes (Copyright CAVE Lab.,University of Tsukuba)
and the other one is kiki [84]. Ramachandran et al. explained this phenomenon
as follows, “the gentle curves of the amoeba-like figure metaphorically mimics the
gentle undulations of the sound ‘bouba’ as represented in the hearing centers in the
brain as well as the gradual inflection of the lips as they produce the curved ‘boo-baa’
sound. In contrast, the waveform of the sound ‘kiki’ and the sharp inflection of the
tongue on the palate mimic the sudden changes in the jagged visual shape.” The au-
thors argue that the brain’s ability to pick out an abstract feature in common items—
such as a jagged visual shape and a harsh-sounding name—could have paved the
way for the development of metaphors and perhaps even a shared vocabulary [84].
Synesthetic experiences are commonly explained as a phenomenon that reflects
a fusion of sensory experiences via association phenomena, in which independent
groups of neurons are activated in close temporal proximity to one another via long
chains of synaptic connections [101]. Their concurrent activity can produce a per-
ceptual synthesis after repeated pairings like a conditioned experience [63, 64]. On
the other hand, synesthetic experiences are also explained as a sort of sensory mixing
12
2.1. Intersensory Perceptual Phenomena in Humans
Figure 2.4: Pictures used to demonstrate the bouba/kiki effect (Originally designedby psychologist Wolfgang Köhler.)
that is predicted from a survey of brain areas in which different modalities converge
on the same neurons. It is not surprising to find that one dominant input evokes
secondary sensations in other modalities via such multisensory neurons. However,
there is still no shared understanding of these experiences among researchers. Al-
though there is no acceptable theoretical explanation, these phenomena should re-
flect some nature of humans’ multisensory perception abilities. Moreover, whether
due to association or the activation of multisensory neurons, synesthesia reflects the
rich multisensory perceptual experiences that appear to be quite common in some
individuals.
2.1.3 Active intermodal mapping
Meltzoff et al. published a paper in 1977 to show that infants between 12 and 21 days
of age can imitate both facial and manual gestures (Figure 2.5) [71]. They claimed
that the result implies that human neonates can equate their own unseen behaviors
with gestures they see others perform. This experiment was ground-breaking be-
cause it showed that infants can imitate adults at a much earlier age than previously
believed. For example, Piaget claimed that facial imitation does not take place until
1 year of age or more [80]. Moreover, this experiment also showed evidence for early
facial imitation, which had been thought to be impossible at this age because it re-
quires cross-modal and mutual understanding of perception. According to the stan-
13
Chapter 2. Literature Review
Figure 2.5: Infant imitation (From A. N. Meltzoff and M. K. Moore. Imitation of facialand manual gestures by human neonates. Science, 198:75–78, 1977. Copyright AAAS)
dard developmental theory, facial imitation ought to be more difficult than manual
or vocal imitation, because infants have no direct way to compare their own actions
with those of adults’. (Infants can see others’ faces, but not their own. They can feel
their own facial movements, but not those of others.) Facial imitation is thought to
represent the infant’s matching of what it sees as some equivalent of the propriocep-
tive signals that it feels when trying to mimic, a process referred to by Meltzoff as
“active intermodal mapping.”
The initial report by Meltzoff et al. met with some aggressive criticisms [3, 5], but
the criticisms were made not for the fact that infants exhibit intersensory integration
but for the claim that infant’s facial imitation appeared very early in life. In fact, the
same claim regarding infant’s intersensory integration was made previously [6, 14]
and even shown for the imitation of facial gestures [31]. Moreover, follow-up studies
moved the appearance date from weeks after birth to minutes after birth and also
showed an innate capability for detecting at least some forms of cross-modal equiv-
alence. Although some investigators had difficulty in demonstrating some effects in
young infants [88, 97], a number of replicated observations have now been reported
14
2.2. Multimodal Integration for Robot Systems
[70].
2.1.4 Coherent understanding of environment
Cognitive science research revealed that combining sensory information contributes
to enhancing perceptual clarity and reducing ambiguity about the sensory environ-
ment [25, 101]. For example, a simultaneous tone can improve detection of a dimly
flashed light [29, 104], enhance the discriminability of briefly flashed visual patterns
[108], or increase the perceived luminance of light [99]. Moreover, neuroscience re-
search demonstrated that cross-connections between early sensory areas facilitate
processing in one sense by input from another [26], and that the superior colliculus
mediates cross-modal improvements in simple attentive-orientation tasks [100, 101].
In addition, action-effect synchrony perception is known to have a close relationship
with the sense of agency [30], and thus cross-modal grouping plays an important role
in sensation [48].
2.2 Multimodal Integration for Robot Systems
Multimodal integration contributes to forming constant, coherent, and robust per-
ceptions by reducing ambiguities regarding the sensory environment. Hence, we
believe that replicating human multimodal integration learning as a computational
model is essential towards realizing sophisticated cognitive functions of robot intel-
ligence, as well as towards fundamentally understanding human intelligence. In this
section, we briefly review previous practical applications for multimodal integration
learning from an engineering perspective.
2.2.1 Audio-visual speech recognition
AVSR is one of the most representative applications in that it puts multimodal inte-
gration learning into practical use for the purpose of speech recognition. The fun-
damental idea of AVSR is to use visual information derived from a speaker’s lip mo-
tion to complement corrupted audio speech inputs. In this subsection, we review
recent approaches for the elemental technologies of AVSR from the following three
15
Chapter 2. Literature Review
perspectives: audio feature extraction, image feature extraction, and audio-visual in-
tegration.
Audio feature extraction
The use of mel-frequency cepstral coefficients (MFCCs) has been a de facto stan-
dard for automatic speech recognition (ASR) for decades. However, advances in
deep learning research have led to recent breakthroughs in unsupervised audio fea-
ture extraction methods and exceptional recognition performance improvements
[27, 40, 62]. Advances in novel machine learning algorithms, improved availability
of computational resources, and the development of large databases have led to self-
organization of robust audio features by efficient training of large-scale DNNs with
large-scale datasets.
One of the most successful applications of DNNs to ASR is the deep neural net-
work hidden Markov model (DNN-HMM) [22, 72], which replaces the conventional
Gaussian mixture model (GMM) with a DNN to represent direct projection between
HMM states and corresponding acoustic feature inputs. The idea of utilizing a neural
network to replace a GMM and construct a hybrid model that combines a multilayer
perceptron and HMMs was originally proposed decades ago [85, 13]. However, owing
to limited computational resources, large and deep models were not experimented
with in the past, which led to hybrid systems that could not outperform GMM-HMM
systems.
Other major approaches for application of DNNs to ASR involve using a deep au-
toencoder as a feature extraction mechanism. For example, Sainath et al. utilized a
deep autoencoder as a dimensionality compression mechanism for self-organizing
higher-level features from raw sensory inputs and utilized the acquired higher-level
features as inputs to a conventional GMM-HMM system [90]. Another example is the
deep denoising autoencoder proposed by Vincent et al. [106, 107]. This model differs
from the former model in that the outputs of the deep autoencoder are utilized as a
sensory feature rather than the compressed vectors acquired from the middle layer
of the network. The key idea of the denoising model is to make the learned represen-
tations robust to partial destruction of the input by training a deep autoencoder to
16
2.2. Multimodal Integration for Robot Systems
reconstruct clean repaired inputs from corrupted, partially destroyed inputs.
Visual feature extraction
Incorporation of speakers’ lip movements as visual information for ASR systems is
known to contribute to robustness and accuracy, especially in environments where
audio information is corrupted by noise. In previous studies, several different
approaches have been proposed for extracting visual features from input images
[67, 54]. These approaches can be broadly classified into two representative cate-
gories.
The first is a top-down approach, where an a priori lip-shape representation
framework is embedded in a model; for example, active shape models (ASMs) [61]
and active appearance models (AAMs) [19]. ASMs and AAMs extract higher-level,
model-based features derived from the shape and appearance of mouth area images.
Model-based features are suitable for explicitly analyzing internal representations;
however, some elaboration of lip-shape models and precise hand-labeled training
data are required to construct a statistical model that represents valid lip shapes.
The second is a bottom-up approach. Various methods can be used to directly es-
timate visual features from the image; for example, dimensionality compression al-
gorithms, such as discrete cosine transform [68, 94], PCA [4, 68], and discrete wavelet
transform [68]. These algorithms are commonly utilized to extract lower-level image-
based features, which are advantageous because they do not require dedicated lip-
shape models or hand-labeled data for training; however, they are vulnerable to
changes in lighting conditions, translation, or rotation of input images. In this study,
we adopt the bottom-up approach by introducing a CNN as a visual feature extrac-
tion mechanism, because it is possible for CNNs to overcome the weaknesses of con-
ventional image-based feature extraction mechanisms. The acquired visual features
are also processed with a GMM-HMM system.
Several approaches for application of CNNs to speech recognition studies have
been proposed. Abdel-Hamid et al. [1, 2] applied their original functionally extended
CNNs for sound spectrogram inputs and demonstrated that their CNN architecture
outperformed earlier basic forms of fully connected DNNs on phone recognition
17
Chapter 2. Literature Review
and large vocabulary speech recognition tasks. Palaz et al. [78] applied a CNN for
phoneme sequence recognition by estimating phoneme class conditional probabil-
ities from raw speech signal inputs. This approach yielded comparable or better
phoneme recognition performance relative to conventional approaches. Lee et al.
[60] applied a convolutional deep belief network (DBN) for various audio classifica-
tion tasks, such as speaker identification, gender classification, and phone classifica-
tion, that showed better performance as compared with conventional hand-crafted
audio features. Thus, CNNs have been attracting considerable attention in speech
recognition studies. However, applications of CNNs have been limited to audio sig-
nal processing, while applications of lip-reading remain unaddressed.
Audio-visual integration
Multimodal recognition can improve performance as compared with unimodal
recognition by utilizing complementary sources of information [15, 36, 86]. Multi-
modal integration is commonly achieved by two different approaches. First, in the
feature fusion approach, feature vectors from multiple modalities are concatenated
and transformed to acquire a multimodal feature vector. For example, Ngiam et al.
[75] utilized a DNN to extract fused representations directly from multimodal signal
inputs by compressing the input dimensionality. Huang et al. [44] utilized a DBN
for audio-visual speech recognition tasks by combining mid-level features learned
by single modality DBNs. However, these approaches have difficulty in explicitly
and adaptively selecting the respective information gains depending on the dynamic
changes in the reliability of multimodal information sources. Alternatively, in the
decision fusion approach, outputs of unimodal classifiers are merged to determine
a final classification. Unlike the previous method, decision fusion techniques can
improve robustness by incorporating stream reliabilities associated with multiple in-
formation sources as a criterion of information gain for a recognition model. For
example, Gurban et al. [35] succeeded in dynamic stream weight adaptation based
on modality confidence estimators in the MSHMM for their AVSR problem.
18
2.2. Multimodal Integration for Robot Systems
2.2.2 Sensory-motor integration learning for robots
Multimodal integration has long been a challenging problem in robotics [16, 18]. Al-
though there is relevant research reported in the literature [46, 82, 93], several is-
sues still remain unsolved. First, multimodal sensory-motor integration has typi-
cally been applied only to a singular problem, such as self-organizing one’s spatial
representation [46, 82]; further functions have not been intensively studied, includ-
ing such functions as the cross-modal complementation of information deficiencies
or the application of cross-modal memory retrieval for behavior generation prob-
lems. Second, discussion in the literature regarding how multimodal information
should be fused together to realize stable environmental recognition has not reached
a comprehensive consensus. Thus, a prevailing multimodal information integra-
tion framework has not been available. Subsequently, in robotics, sensory inputs
acquired from different sources are still typically processed with dedicated feature-
extraction mechanisms [73]. Third, multimodal synchrony modeling as a means of
implementing sensory-motor prediction for robotic applications has not been ade-
quately investigated. Several preceding studies have proposed computational mod-
els developmentally acquiring action-effect synchrony for understanding interaction
rules [52, 77]; however, most causal models have been represented using a limited
number of modalities, often focusing on vision and motion only.
A scalable learning framework that enables multimodal integration learning by
handling large amounts of sensory-motor data with high dimensionality has not yet
been realized. In line with the growing demand for perceptual precision with regard
to the surrounding environment, recent robots are equipped with state-of-the-art
sensory devices, such as high-resolution image sensors, range sensors, multichan-
nel microphones, and so on [32, 47, 91]. As a result, remarkable improvements have
been achieved in the quantity of available sensory information; however, because
of the scalability limitations of conventional machine learning algorithms, ground-
breaking computational models achieving robust behavior control and environmen-
tal recognition by fusing multimodal sensory inputs into a single representation have
not yet been proposed.
19
Chapter 2. Literature Review
2.3 Deep Learning
Regarding computational models addressing large-scale data processing with signif-
icant dimensionality [8], deep learning approaches have recently attracted consider-
able attention in the machine-learning community [9]. For example, DNNs have suc-
cessfully been applied to unsupervised feature learning for single modalities, such
as text [103], images [56], or audio [40]. In such studies, various information sig-
nals, even with high-dimensional representations, were effectively compressed in a
restorable form. Further, brilliant achievements in deep learning technologies have
already succeeded in making advanced applications available to the public. For ex-
ample, competition results from the ImageNet Large Scale Visual Recognition Chal-
lenge [50] have led to significant improvements in web image search engines [89]. As
another example, unsupervised feature-extraction functions of deep learning tech-
nologies have greatly increased the sophistication of a voice recognition engine used
for a virtual assistant service [42]. The same approach has also been applied to the
learning of fused representations over multiple modalities, resulting in significant
improvements in speech recognition performance [75]. Yet another study on multi-
modal integration learning has succeeded in cross-modal memory retrieval by com-
plementing missing modalities [98]. Most current studies on multimodal integration
learning utilize deep networks; however, much work focuses in extracting correla-
tions between static modalities, such as image and text [50]. Thus, few studies have
investigated methods not only for multimodal sensor fusion, but also for dynamic
sensory-motor coordination problems [24] of robot behavior.
Back-propagation algorithm has been a dominant approach for the training of a
neural network with multiple non-linear layers for a long time. However, the “van-
ishing gradients problem” (Figure 2.6), where the derivative terms can exponentially
decay to zero or explode during the deep back-propagation process [10], prevented
this technique to generalize well with networks that possess very large number of
hidden layers.
Due to its scalability limitation, the neural network has been regarded as an out-
moded machine learning approach for decades. However, the following three factors
have recently led to a major breakthrough in the application of DNNs to the problems
20
2.3. Deep Learning
��������������� �
������������ �
∂E∂W
,∂E∂b
⎛
⎝⎜
⎞
⎠⎟
Y = f WX + b( )
Figure 2.6: Vanishing gradient problem
of image classification and speech recognition. First, popularization of low-cost,
high-performance computational environments, i.e., high-end consumer personal
computers equipped with general-purpose graphics processing units (GPGPUs), has
allowed a wider range of users to conduct brute force numerical computations with
large datasets. Second, improved public access to large databases has enabled unsu-
pervised learning mechanisms to self-organize highly generalized features that can
outperform conventional handcrafted features. Third, the development of powerful
machine learning techniques, e.g., improved optimization algorithms, has enabled
large-scale neural network models to be efficiently trained with large datasets, which
has made it possible for deep neural networks to generate highly generalized fea-
tures.
In the following subsection, we introduce representative deep learning architec-
tures that have contributed to the recent development of deep learning studies.
2.3.1 Deep Autoencoder
The deep autoencoder is a variant of a DNN commonly utilized for dimensionality
compression and feature extraction [75, 41]. DNNs are artificial neural network mod-
els with multiple layers of hidden units between inputs and outputs. A multi-layered
21
Chapter 2. Literature Review
artificial neural network is referred to as an autoencoder, particularly when the net-
work structure has a bottleneck shape (the number of nodes for the central hidden
layer becomes smaller than that for the input (encoder) and output (decoder) layers),
and the network is trained to model the identity mappings between inputs and out-
puts. Regarding dimensionality compression mechanisms, a simple and commonly
utilized approach is PCA. However, Hinton et al. demonstrated that the deep autoen-
coder outperformed PCA in image reconstruction and compressed feature acquisi-
tion [41].
To train DNNs, Hinton et al. first proposed an unsupervised learning algorithm
that uses greedy layer-wise unsupervised pretraining followed by fine-tuning meth-
ods to overcome the high prevalence of unsatisfactory local optima in learning objec-
tives of deep models [41]. Subsequently, Martens proposed a novel approach by in-
troducing a second-order optimization method, Hessian-free optimization, to train
deep networks [65]. The proposed method efficiently trained the models by a general
optimizer without pretraining. Placing emphasis on the simplicity of their algorithm,
we adopted the learning method proposed by Martens for optimizing our deep au-
toencoder. In our work, we utilized deep autoencoders for the self-organization of
sensory feature vectors, and for temporal sequence learning.
2.3.2 Convolutional Neural Network
A CNN (Figure 2.7) is a variant of a DNN commonly utilized for image classifica-
tion problems [58, 57, 59]. CNNs integrate three architectural ideas to ensure spatial
invariance: local receptive fields, shared weights, and spatial subsampling. Accord-
ingly, CNNs are advantageous compared with ordinary fully connected feed-forward
networks in the following three ways.
First, the local receptive fields in the convolutional layers extract local visual fea-
tures by connecting each unit only to small local regions of an input image. Local
receptive fields can extract visual features such as oriented-edges, end-points, and
corners. Typically, pixels in close proximity are highly correlated and distant pixels
are weakly correlated. Thus, the stack of convolutional layers is structurally advan-
tageous for recognizing images by effectively extracting and combining the acquired
22
2.4. Positioning of this Thesis towards Related Work
������������ ������������ ������������ ������������
����
���
��������� �� ��
���� � ���������� ���������������
Figure 2.7: Convolutional neural network
features. Second, CNNs can guarantee some degree of spatial invariance with respect
to shift, scale, or local distortion of inputs by forcing sharing of same weight config-
urations across the input space. Units in a plane are forced to perform the same
operation on different parts of the image. As CNNs are equipped with several local
receptive fields, multiple features are extracted at each location. In principle, fully
connected networks are also able to perform similar invariances. However, learning
such weight configurations requires a very large number of training datasets to cover
all possible variations. Third, subsampling layers, which perform local averaging and
subsampling, are utilized to reduce the resolution of the feature map and sensitivity
of the output to input shifts and distortions (for implementation details, see [58]).
In terms of computational scalability, shared weights allow CNNs to possess
fewer connections and parameters compared with standard feed-forward neural
networks with similar-sized layers. Moreover, current improvements in compu-
tational resource availability, especially with highly-optimized implementations of
two-dimensional convolution algorithms processed with GPGPUs, has facilitated ef-
ficient training of remarkably large CNNs with millions of image datasets [56, 50].
2.4 Positioning of this Thesis towards Related Work
In this chapter, related work regarding multimodal integration has been reviewed
from the following three perspectives:
• How multimodal integration affects the way that humans perceive the external
environment (Section 2.1)
23
Chapter 2. Literature Review
• The main contributions and the outstanding problems of multimodal integra-
tion learning in practical robot applications (Section 2.2)
• How deep learning algorithms have contributed towards achieving perfor-
mance improvements in machine-learning problems including image recogni-
tion, speech recognition, and also, multimodal integration learning (Section 2.3)
The reviews in Section 2.1 clearly show that sensory-motor information from
multiple modalities in humans mutually interact with each other. Therefore, a com-
prehensive investigation of multimodal information is crucial to understanding hu-
man intelligence. Moreover, the reviews in Section 2.2 show that there have been
several engineering approaches to apply multimodal integration learning for realiz-
ing robust environment recognition, such as AVSR in speech recognition. However,
the same section also explains that investigations on the application of multimodal
integration learning in sensory-motor coordination in robotics remain incomplete
mainly due to the scalability limitation of conventional machine learning algorithms
in handling a huge variety and a large amount of sensory-motor information ac-
quired from the robots working in real-world environments.
Based on these backgrounds, the fundamental research interest in this thesis is
to seek possibilities for the further expansion of multimodal integration learning to
realize robot intelligence. In practice, we apply deep learning, one of the state-of-the
art machine learning approaches as reviewed in Section 2.3, for robot behavior learn-
ing. Our approach is different from the conventional robot behavior learning prac-
tices in that carefully designed dedicated sensory feature extraction mechanisms are
not required for handling raw sensory inputs. Moreover, the deep learning mecha-
nism enables extraction of highly generalized features by integrating sensory-motor
information from multiple modalities that contribute towards stably abstracting and
perceiving environmental situations. The robust recognition capability enables a
robot to adaptively select corresponding behaviors in response to diverse and irre-
producible environmental changes in real-world environments.
With regard to the research interest explained above, this thesis is composed of
the following three-step elemental studies. First, sensory feature extraction perfor-
mances of deep learning algorithms are evaluated by conducting an AVSR task in
Chapter 3. Second, a variant of DNN is applied to the dynamic sensory-motor in-
24
2.4. Positioning of this Thesis towards Related Work
tegration learning of multiple object manipulation behaviors by a humanoid robot
in Chapter 5. Through the experiments, novel approaches of utilizing a DNN model
for the purposes of cross-modal memory retrieval and robust behavior recognition
are proposed. Finally, by extending the experimental settings proposed in Chapter 6,
a detailed analysis of the intersensory synchrony model acquired by the multimodal
integration learning mechanism is conducted to investigate how mutual correlations
between multimodal information is self-organized in the memory structure.
25
Chapter 3
Audio-Visual Speech Recognition
3.1 Introduction
In this chapter, we focus on the evaluation of sensory feature extraction performance
of deep learning algorithms and investigate how multimodal integration learning
contributes towards robust speech recognition. In accordance with the objectives,
we conduct an AVSR task as a practical evaluation experiment. AVSR is thought to
be one of the most promising solutions for reliable speech recognition, particularly
when the audio is corrupted by noise. The fundamental idea of AVSR is to use vi-
sual information derived from a speaker’s lip motion to complement corrupted au-
dio speech inputs. However, cautious selection of sensory features for the audio and
visual inputs is crucial in AVSR because sensory features significantly influence the
recognition performance.
Audio feature extraction by a deep denoising autoencoder is achieved by train-
ing the network to predict original clean audio features, such as MFCCs, from de-
teriorated audio features that are artificially generated by superimposing various
strengths of Gaussian noises to original clean audio inputs. Acquired audio feature
sequences are then processed with a conventional GMM-HMM to conduct an iso-
lated word recognition task. The main advantage of our audio feature extraction
mechanism is that noise-robust audio features are easily acquired through a rather
simple mechanism.
27
Chapter 3. Audio-Visual Speech Recognition
For the visual feature extraction mechanism, we propose the application of a
CNN, one of the most successfully utilized neural network architectures for image
clustering problems. This is achieved by training the CNN with over a hundred thou-
sand mouth area image frames in combination with corresponding phoneme labels.
CNN parameters are learned in order to maximize the average across training cases
for the log-probability of the correct label under the prediction distribution. Through
supervised training, multiple layers of convolutional filters, which are responsible
for extracting primitive visual features and predicting phonemes from raw image in-
puts, are self-organized. Our visual feature extraction mechanism has two main ad-
vantages: (1) the proposed model is easy to implement because dedicated lip-shape
models or hand-labeled data are not required; (2) the CNN has superiority in shift-
and rotation- resistant image recognition.
To perform an AVSR task by integrating both audio and visual features into a sin-
gle model, we propose a MSHMM [11, 12, 45]. The main advantage of the MSHMM is
that we can explicitly select the observation information source (i.e., from audio in-
put to visual input) by controlling the stream weights of the MSHMM depending on
the reliability of multimodal inputs. Our evaluation results demonstrate that the iso-
lated word recognition performance can be improved by utilizing visual information,
especially when audio information reliability is degraded. The results also demon-
strate that the multimodal recognition attains an even better performance than when
audio and visual features are separately utilized for isolated word recognition tasks.
3.2 The Dataset
A Japanese audio-visual dataset [53, 113] was used for the evaluation of the proposed
models. In the dataset, speech data from six males (400 words: 216 phonetically bal-
anced words and 184 important words from the ATR1 speech database [53]) were
used. In total, 24000 word recordings were prepared (one set of words per speaker;
approximately 1 h of speech in total). The audio-visual synchronous recording en-
vironment is shown in Figure 3.1. Audio data was recorded with a 16 kHz sampling
rate, 16-bit depth, and a single channel. To train the acoustic model utilized for the
1Advanced Telecommunications Research Institute International
28
3.3. Model
PC
Camera Light
Microphone
Figure 3.1: Audio-visual synchronous data recording environment
assignment of phoneme labels to image sequences, we extracted 39 dimensions of
audio features, composed of 13 MFCCs and their first and second temporal deriva-
tives. To synchronize the acquired features between audio and video, MFCCs were
sampled at 100 Hz. Visual data was a full-frontal 640×480 pixel 8-bit monochrome fa-
cial view recorded at 100 Hz. For visual model training and evaluation, we prepared
a trimmed dataset composed of multiple image resolutions by manually cropping
128× 128 pixels of the mouth area from the original data and resizing the cropped
data to 64×64, 32×32, and 16×16 pixels.
3.3 Model
A schematic diagram of the proposed AVSR system is shown in Figure 3.2. The pro-
posed architecture consists of two feature extractors to process audio signals syn-
chronized with lip region image sequences. For audio feature extraction, a deep de-
noising autoencoder [106, 107] is utilized to filter out the effect of background noise
from deteriorated audio features. For visual feature extraction, a CNN is utilized to
recognize phoneme labels from lip image inputs. Finally, a multi-stream HMM rec-
ognizes isolated words by binding acquired multimodal feature sequences.
29
Chapter 3. Audio-Visual Speech Recognition
...
...
... ...
Visual feature Audio feature
(b) Convolutional neural network
(a) Deep denoising autoencoder
Original audio feature
..
(c) Multi-stream HMM Waveform Raw image
/a/, /a:/, /i/, ..., /y/, /z/, /sp/�
. Denoised audio feature
Figure 3.2: Architecture of the proposed AVSR system
3.3.1 Audio Feature Extraction by Deep Denoising Autoencoder
For the audio feature extraction, we utilized a deep denoising autoencoder [106, 107].
Eleven consecutive frames of audio features are used as the short-time spectral rep-
resentation of speech signal inputs. To generate audio input feature sequences,
partially deteriorated sound data are artificially generated by superimposing sev-
eral strengths of Gaussian noises to original sound signals. In addition to the orig-
inal clean sound data, we prepared six different deteriorated sound data; the SNR
was from 30 to −20 dB at 10 dB intervals. Utilizing sound feature extraction tools,
the following types of sound features are generated from eight variations of origi-
nal clean and deteriorated sound signals. HCopy command of the hidden Markov
model toolkit (HTK) [114] is utilized to extract 39 dimensions of MFCCs. Auditory
Toolbox [96] is utilized to extract 40 dimensions of log mel-scale filterbank (LMFB).
Finally, the deep denoising autoencoder is trained to reconstruct clean audio fea-
tures from deteriorated features by preparing the deteriorated dataset as input and
the corresponding clean dataset as the target of the network. Among a 400-word
30
3.3. Model
Table 3.1: Settings for audio feature extraction
IN* OUT* LAYERS*
429 429 300-150-80-40-80-150-300 (a)
429 39 300-150-80 (b)
429 429 300-300-300-300-300-300-300 (c)
429 429 300-300-300-300-300 (d)
429 429 300-300-300 (e)
429 429 300 (f)* IN, OUT, and LAYERS indicate the number of in-
put and output dimensions, and layer-wise di-mensions of the network, respectively.
dataset, sound signals from 360 training words (2.76×105 samples) and the remain-
ing 40 test words (2.91×104 samples) from six speakers are used to train and evaluate
the network, respectively.
The denoised audio features are generated by recording the neuronal outputs of
the deep autoencoder when 11 frames of audio features are provided as input. To
compare the denoising performance relative to the construction of the network, sev-
eral different network architectures are compared. Table 3.1 summarizes the num-
ber of input and output dimensions, as well as layer-wise dimensions of the deep
autoencoder.
In the initial experiment, we compared three different methods to acquire de-
noised features with respect to MFCCs and LMFB audio features. The first generated
11 frames of output audio features and utilized the middle frame (SequenceOut). The
second acquired an audio feature from the activation pattern of the central middle
layer of the network (BottleNeck). For these two experiments, a bottleneck-shaped
network was utilized (Table 3.1 (a)). The last generated a single frame of an output
audio feature that corresponds to the middle frame of the inputs (SingleFrameOut).
For this experiment, a triangle-shaped network was utilized (Table 3.1 (b)).
In the second experiment, we compared the performance relative to the number
of hidden layers of the network utilizing an MFCCs audio feature. In this experiment,
we prepared four straight-shaped networks with different numbers of layers (i.e., one
to seven layers) at intervals of two (Table 3.1 (c)–(f)). Outputs were acquired by gen-
erating 11 frames of output audio features and utilizing the middle frame. Regarding
31
Chapter 3. Audio-Visual Speech Recognition
the activation functions of the neurons, a linear function and logistic nonlinearity
are utilized for the central middle layer of the bottleneck-shaped network and the
remaining network layers, respectively. Parameters for the network structures are
empirically determined with reference to previous studies [41, 49].
The deep autoencoder is optimized to minimize the objective function E defined
by the sum of L2-norm between the output of the network and target vector across
training dataset D under the model parameterized by θ, represented as
E (D,θ) =√√√√ |D|∑
i=1(x(i ) −x(i ))2, (3.1)
where x(i ) and x(i ) are the output of the network and corresponding target vec-
tor from the i -th data sample, respectively. To optimize the deep autoencoder, we
adopted the Hessian-free optimization algorithm proposed by Martens [65]. In our
experiment, the entire dataset was divided into 12 chunks with approximately 85000
samples per batch. We utilized 2.0×10−5 for the L2 regularization factor on the con-
nection weights. For the connection weight parameter initialization, we adopted the
sparse random initialization scheme to limit the number of non-zero incoming con-
nection weights of each unit to 15. Bias parameters were initialized at 0. To pro-
cess the substantial amount of linear algebra computation involved in this optimiza-
tion algorithm, we developed a software library using the NVIDIA CUDA Basic Lin-
ear Algebra Subprograms [76]. The optimization computation was conducted on a
consumer-class personal computer with an Intel Core i7-3930K processor (3.2 GHz,
6 cores), 32 GB RAM, and a single NVIDIA GeForce GTX Titan graphics processing
unit with 6 GB on-board graphics memory.
3.3.2 Visual Feature Extraction by CNN
For visual feature extraction, a CNN is trained to predict phoneme label posterior
probabilities corresponding to the mouth area input images. Mouth area images of
360 training words from six speakers were used to train and evaluate the network.
To assign phoneme labels to every frame of the mouth area image sequences, we
trained a monophone HMM with MFCCs utilizing the HTK and assigned 40 phoneme
32
3.3. Model
Table 3.2: 39 types of Japanese phonemes
Category Phoneme labels
Vowels/a/ /i/ /u/ /e/ /o/
/a:/ /i:/ /u:/ /e:/ /o:/
Consonants
/b/ /d/ /g/ /h/ /k/ /m/ /n/
/p/ /r/ /s/ /t/ /w/ /y/ /z/ /ts/
/sh/ /by/ /ch/ /f/ /gy/ /hy/ /j/
/ky/ /my/ /ny/ /py/ /ry/
Others /N/ /q/
labels, including 39 Japanese phonemes (Table 3.2) and short pause /sp/, to the vi-
sual feature sequence by conducting a forced alignment using the HVite command
in the HTK. To enhance shift- and rotation-invariance, artificially modulated images
created by randomly shifting and rotating the original images are added to the orig-
inal dataset. In addition, images labeled as short pause /sp/ are eliminated, with
the exception of the five adjacent frames before and after the speech segments. The
image dataset (3.05× 105 samples) was shuffled and 5/6 of the data were used for
training; the remainder was used for the evaluation of a phoneme recognition exper-
iment. From our preliminary experiment, we confirmed that phoneme recognition
precision degrades if images from all six speakers are modeled with a single CNN.
Therefore, we prepared an independent CNN for each speaker.2 The visual features
for the isolated word recognition experiment are generated by recording neuronal
outputs (phoneme label posterior probability distribution) from the last layer of the
CNN when mouth area image sequences corresponding to 216 training words were
provided as inputs to the CNN.
A seven-layered CNN is used in reference to the work by Krizhevsky et al. [50].
Table 3.3 summarizes construction of the network containing four weighted layers:
three convolutional (C1, C3, and C5) and one fully connected (F7). The first convolu-
tional layer (C1) filters the input image with 32 kernels of 5×5 pixels with a stride of
one pixel. The second and third convolutional layers (C3 and C5) take the response-
2We believe that this degradation is mainly due to the limited variations of lip region images that weprepared to train the CNN. To generalize the higher-level visual features that enable a CNN to attainspeaker invariant phoneme recognition, we believe that more image samples from different speakersare needed.
33
Chapter 3. Audio-Visual Speech Recognition
Table 3.3: Construction of a convolutional neural network
IN* OUT* LAYERS*
256/1024/4096 40 C1-P2-C3-P4-C5-P6-F7**
* IN, OUT, and LAYERS indicate the input di-mensions, output dimensions, and network con-struction, respectively.
** C, P, and F denote the convolutional, local-pooling, and fully connected layer, respectively.The numbers after the layer types represent layerindices.
normalized and pooled output of the previous convolutional layers (P2 and P4) as
inputs and filter them with 32 and 64 filters of 5×5 pixels, respectively. The fully con-
nected layer (F7) takes the pooled output of the previous convolutional layer (P6) as
input and outputs a 40-way softmax, regarded as a posterior probability distribution
over the 40 classes of phoneme labels. A max-pooling layer follows the first convolu-
tion layer. Average-pooling layers follow the second and third convolutional layers.
Response-normalization layers follow the first and second pooling layers. Rectified
linear unit nonlinearity is applied to the outputs of the max-pooling layer as well as
the second and third convolutional layers. Parameters for the network structures are
empirically determined in reference to previous studies [58, 50].
The CNN is optimized to maximize the multinomial logistic regression objective
of the correct label. This is equivalent to maximizing the likelihood L defined by the
sum of log-probability of the correct label across training dataset D under the model
parameterized by θ, represented as
L (D,θ) =|D|∑i=1
log(P (Y = y (i )|x(i ),θ)), (3.2)
where y (i ) and x(i ) are the class label and input pattern corresponding to the i -th
data sample, respectively. The prediction distribution is defined with the softmax
function as
P (Y = i |x,θ) = exp(hi )
ΣCj=1 exp(h j )
, (3.3)
where hi and C are the total input to output unit i and number of classes, respec-
34
3.3. Model
tively. The CNN is trained using a stochastic gradient descent method [50]. The up-
date rule for the connection weight w is defined as
vi+1 = αvi −γεwi −ε⟨∂L
∂w|wi ⟩
Di
(3.4)
wi+1 = wi + vi+1 (3.5)
where i is the learning iteration index, vi is the update variable, α is the factor of
momentum, ε is the learning rate, γ is the factor of weight decay, and ⟨∂L∂w |wi ⟩Di
is
the average over the i -th batch data Di of the derivative of the objective with respect
to w , evaluated at wi . In our experiment, the mini batches are one-sixth of the entire
dataset for each speaker (approximately 8500 samples per batch). We utilized α =0.9, ε = 0.001 and γ = 0.004 in our leaning experiment. The weight parameters were
initialized with a zero-mean Gaussian distribution with standard deviation 0.01. The
neuron biases in all layers were initialized at 0. We used open source software (cuda-
convnet) [50] for practical implementation of the CNN. The software was processed
on the same computational hardware as the audio feature extraction experiment.
3.3.3 Audio-Visual Integration by MSHMM
In our study, we adopt a simple MSHMM with manually selected stream weights
for the multimodal integration mechanism. We utilize the HTK for the practical
MSHMM implementation. The HTK can model output probability distributions
composed of multiple streams of GMMs [114]. Each observation vector at time t is
modeled by splitting it into S independent data streams ost . The output probability
distributions of state j is represented with multiple data streams as
b j (ot ) =S∏
s=1
[Ms∑
m=1c j smN (ost ;μ j sm ,Σ j sm)
]γs
, (3.6)
where ot is a speech vector generated from the probability density b j (ot ), Ms is the
number of mixture components in stream s, c j sm is the weight of the m’th compo-
nent, N (·;μ,Σ) is a multivariate Gaussian with mean vector μ and covariance matrix
Σ, and the exponent γs is a stream weight for stream s.
35
Chapter 3. Audio-Visual Speech Recognition
Definitions of MSHMM are generated by combining multiple HMMs indepen-
dently trained with corresponding audio and visual inputs. In our experiment, we
utilize 16 mixture components for both audio and visual output probability distribu-
tion models. When combining two HMMs, GMM parameters from audio and visual
HMMs are utilized to represent stream-wise output probability distributions. Model
parameters from only the audio HMM are utilized to represent the common state
transition probability distribution. Audio stream weights γa are manually prepared
from 0 to 1.0 at intervals of 0.1. Accordingly, visual stream weights γv are prepared to
satisfy γv = 1.0−γa . In evaluating the acquired MSHMM, the best recognition rate
is selected from the multiple evaluation results corresponding to all stream weight
pairs.
3.4 Results
3.4.1 ASR Performance Evaluation
The acquired audio features are evaluated by conducting an isolated word recogni-
tion experiment utilizing a single-stream HMM. To recognize words from the audio
features acquired by the deep denoising autoencoder, monophone HMMs with 8, 16,
and 32 GMM components are utilized. While training is conducted with 360 training
words, evaluation is conducted with 40 test words from the same speaker, thereby
yielding a closed-speaker and open-vocabulary evaluation. To enable comparison
with the baseline performance, word recognition rates utilizing the original audio
features are also prepared. To evaluate the robustness of our proposed mechanism
against the degradation of audio input, partially deteriorated sound data were arti-
ficially generated by superimposing several strengths of Gaussian noises to original
sound signals. In addition to the original clean sound data, we prepared 11 different
deteriorated sound data such that the SNR was 30 dB to −20 dB at 5 dB intervals.
Figure 3.3 shows word recognition rates for the different word recognition mod-
els for MFCCs and LMFB audio features evaluated with 12 different SNRs for sound
inputs. In Figure 3.3, changes of word recognition rates depending on the types of
audio features (MFCCs for (a) to (c) and LMFB for (d) to (f)), the types of feature
extraction mechanism, and changes of the SNR of audio inputs are shown. These
36
3.4. Results
clean 30 25 20 15 10 5 0 −5 −10 −15 −20
0
10
20
30
40
50
60
70
80
90
100
SNR [dB]
Wor
d re
cogn
ition
rat
e [%
]
OriginalSequenceOutBottleNeckSingleFrameOut
(a) 8 components (MFCCs)
clean 30 25 20 15 10 5 0 −5 −10 −15 −20
0
10
20
30
40
50
60
70
80
90
100
SNR [dB]
Wor
d re
cogn
ition
rat
e [%
]
OriginalSequenceOutBottleNeckSingleFrameOut
(b) 16 components (MFCCs)
clean 30 25 20 15 10 5 0 −5 −10 −15 −20
0
10
20
30
40
50
60
70
80
90
100
SNR [dB]
Wor
d re
cogn
ition
rat
e [%
]
OriginalSequenceOutBottleNeckSingleFrameOut
(c) 32 components (MFCCs)
clean 30 25 20 15 10 5 0 −5 −10 −15 −20
0
10
20
30
40
50
60
70
80
90
100
SNR [dB]
Wor
d re
cogn
ition
rat
e [%
]
OriginalSequenceOutBottleNeckSingleFrameOut
(d) 8 components (LMFB)
clean 30 25 20 15 10 5 0 −5 −10 −15 −20
0
10
20
30
40
50
60
70
80
90
100
SNR [dB]
Wor
d re
cogn
ition
rat
e [%
]
OriginalSequenceOutBottleNeckSingleFrameOut
(e) 16 components (LMFB)
clean 30 25 20 15 10 5 0 −5 −10 −15 −20
0
10
20
30
40
50
60
70
80
90
100
SNR [dB]
Wor
d re
cogn
ition
rat
e [%
]
OriginalSequenceOutBottleNeckSingleFrameOut
(f) 32 components (LMFB)
Figure 3.3: Word recognition rate evaluation results using audio features dependingon the number of Gaussian mixture components for the output probability distribu-tion models of HMM
37
Chapter 3. Audio-Visual Speech Recognition
Table 3.4: Speaker-wise visual-based phoneme recognition rates and averaged values[%] depending on the input image sizes
Img. size p1 p2 p3 p4 p5 p6 Avr.
16×16 42.13 43.40 39.92 39.03 47.67 46.73 43.15
32×32 43.77 47.07 42.77 41.05 49.74 50.83 45.87
64×64 45.93 50.06 46.51 43.57 49.95 51.44 47.91* p1–p6 correspond to the six speakers.
results demonstrate that MFCCs generally outperforms LMFB. The sound feature
acquired by integrating consecutive multiple frames with a deep denoising autoen-
coder has an effect on higher noise robustness compared with the original input. By
comparing the audio features acquired from the different network architectures, it
was observed that “SingleFrameOut” obtains the highest recognition rates for the
higher SNR range, whereas “SequenceOut” outperforms for the lower SNR range.
While “BottleNeck” performs slightly better than the original input for the middle
SNR range, the advantage is minimal. Overall, approximately a 65% word recogni-
tion gain was attained with denoised MFCCs under 10 dB SNR. Although there is a
slight recognition performance difference depending on the increase of the number
of Gaussian mixture components, the effect is not significant.
Figure 3.4 shows word recognition rates for the different number of hidden layers
of the deep denoising autoencoder utilizing MFCCs audio features evaluated with
12 different SNRs for sound inputs. In Figure 3.4, changes of word recognition rates
depending on the number of hidden layers of DNN and changes of the SNR of audio
inputs are shown. The deep denoising autoencoder with five hidden layers obtained
the best noise robust word recognition performance among all SNR ranges.
3.4.2 Visual-Based Phoneme Recognition Performance Evaluation
After training the CNN, phoneme recognition performance is evaluated by record-
ing neuronal outputs from the last layer of the CNN when the mouth area image
sequences corresponding to the test image data are provided to the CNN. Table 3.4
shows that the average phoneme recognition performance for the 40 phonemes, nor-
malized with the number of samples for each phoneme over six speakers, attained
approximately 48% when 64×64 pixels of mouth area images are utilized as input.
38
3.4. Results
clean 30 25 20 15 10 5 0 −5 −10 −15 −20
0
10
20
30
40
50
60
70
80
90
100
SNR [dB]
Wor
d re
cogn
ition
rat
e [%
]
7 layers5 layers3 layers1 layer
(a) 8 components
clean 30 25 20 15 10 5 0 −5 −10 −15 −20
0
10
20
30
40
50
60
70
80
90
100
SNR [dB]
Wor
d re
cogn
ition
rat
e [%
]
7 layers5 layers3 layers1 layer
(b) 16 components
clean 30 25 20 15 10 5 0 −5 −10 −15 −20
0
10
20
30
40
50
60
70
80
90
100
SNR [dB]
Wor
d re
cogn
ition
rat
e [%
]
7 layers5 layers3 layers1 layer
(c) 32 components
Figure 3.4: Word recognition rate evaluation results utilizing MFCCs depending onthe number of Gaussian mixture components for the output probability distributionmodels of HMM
39
Chapter 3. Audio-Visual Speech Recognition
Figure 3.5 shows the mean and standard deviation of the phoneme-wise recogni-
tion rate from six different speakers for four different input image resolutions. In Fig-
ure 3.5, the mean and the standard deviations from six speakers’ results are shown.
Four different shapes of the plots correspond to the recognition results when four dif-
ferent visual features, acquired by the CNN from four different image resolutions for
the mouth area image inputs, are utilized. This result generally demonstrates that vi-
sual phoneme recognition works better for recognizing vowels than consonants. The
result derives from the fact that the mean recognition rate for all vowels is 30–90%,
whereas for all other phonemes it is 0–60%. This may be attributed to the fact that
generation of vowels strongly correlates with visual cues involving lips or jaw move-
ments [7, 112].
Figure 3.6 shows the confusion matrix of the phoneme recognition evaluation
results. In Figure 3.6, the mean values from six speakers’ results are shown. It should
be noted that, in most cases, wrongly recognized consonants are classified as vowels.
This indicates that articulation of consonants is attributed to not only the motion of
the lips but also the dynamic interaction of interior oral structures such as tongue,
teeth, oral cavity, which are not evident in frontal facial images.
Visually explicit phonemes, such as bilabial consonants (/m/, /p/, or /b/), are ex-
pected to be relatively well discriminated by a VSR system. However, the recognition
performance was not as high as expected. To improve the recognition rate, the pro-
cedure to obtain phoneme target labels for the CNN training should be improved.
In general pronunciation, consonant sounds are shorter than vowel sounds; there-
fore, the labeling for consonants is more time critical than vowels. In addition, the
accuracy of consonant labels directly affects recognition performance because the
number of training samples is much smaller for consonants than it is for vowels.
3.4.3 Visual Feature Space Analysis
To analyze how the acquired visual feature space is self-organized, the trained CNN
is used to generate phoneme posterior probability sequences from test image se-
quences. Forty dimensions of the resulting sequences are processed by PCA, and
the first three principal components are extracted to visualize the acquired feature
40
3.4. Results
N a i u e o a: i: u: e: o: b by ch d f g gy h hy−20
0
20
40
60
80
100
Phonemes
Pho
nem
e re
cogn
ition
rat
e [%
]
j k ky m my n ny p py r ry s sh t ts w y z q sp−20
0
20
40
60
80
100
Phonemes
Pho
nem
e re
cogn
ition
rat
e [%
]
16x1632x3264x64
16x1632x3264x64
Figure 3.5: Phoneme-wise visual-based phoneme recognition rates
41
Chapter 3. Audio-Visual Speech Recognition
Recognition
Tru
th
N a i u e o a: i: u:e:o: bbychd f ggyhhy j k kymmynnyppy r ry ssh t tsw y z qsp
Nai
ueo
a:i:
u:e:o:b
bychdfg
gyh
hyjk
kym
myn
nyp
pyr
rys
sht
tswyzq
sp 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Figure 3.6: Visual-based phoneme-recognition confusion matrix (64×64 pixels imageinput)
42
3.4. Results
space. Figure 3.7 shows the visual feature space corresponding to the five represen-
tative Japanese vowel phonemes, /a/, /i/, /u/, /e/, and /o/, generated from 64×64
pixels image inputs. The cumulative contribution ratio with 40 selected components
was 31.1%. As demonstrated in the graph, raw mouth area images corresponding to
the five vowel phonemes are discriminated by the CNN and clusters corresponding
to the phonemes are self-organized in the visual feature space. This result indicates
that the acquired phoneme posterior probability sequences can be utilized as visual
feature sequences for isolated word recognition tasks.
3.4.4 VSR Performance Evaluation
The acquired visual features are evaluated by conducting an isolated word recog-
nition experiment utilizing a single-stream HMM. To recognize words from the
phoneme label sequences generated by the CNN trained with 360 training words,
monophone HMMs with 1, 2, 4, 8, 16, 32, and 64 Gaussian components are uti-
lized. While training is conducted with 360 train words, evaluation is conducted with
40 test words from the same speaker, thereby yielding a closed-speaker and open-
vocabulary evaluation. To compare with the baseline performance, word recognition
rates utilizing two other visual features are also prepared. One feature has 36 dimen-
sions, generated by simply rescaling the images to 6×6 pixels, and the other feature
has 40 dimensions, generated by compressing the raw images by PCA.
Figure 3.8 shows the word recognition rates acquired from 35 different models
with a combination of five types of visual features and seven different numbers of
Gaussian mixture components for GMMs. In Figure 3.8, evaluation results from 40
test words over six speakers depending on the different mixture of Gaussian compo-
nents for the HMM are shown. Six visual features, including two image-based fea-
tures, one generated by simply resampling the mouth area image into 6× 6 pixels
image and the other generated by compressing the dimensionality of the image into
40 dimensions by PCA, and four visual features acquired by predicting the phoneme
label sequences from four different resolutions of the mouth area images utilizing
the CNN, are employed in this evaluation experiment. Comparison of word recogni-
tion rates from different visual features within the same number of Gaussian compo-
nents shows that visual features acquired by the CNN attain higher recognition rates
43
Chapter 3. Audio-Visual Speech Recognition
−0.4
−0.2
0
0.2
0.4
0.6
0.8−0.4
−0.20
0.20.4
0.60.8
−0.4
−0.2
0
0.2
0.4
0.6
PC2
PC1
PC
3
��������������������
Figure 3.7: Visual feature distribution for the five representative Japanese vowelphonemes (64×64 pixels image input)
44
3.4. Results
1 2 4 8 16 32 640
5
10
15
20
25
Number of Gaussian components
Wor
d re
cogn
ition
rat
e [%
]
6x6PCACNN_16x16CNN_32x32CNN_64x64
Figure 3.8: Word recognition rates using image features
than the other two visual features. However, the effect of the different input image
resolutions is not prominent. Among all word recognition rates, visual features ac-
quired by the CNN with 16×16 and 64×64 input image resolutions attain a rate of
approximately 22.5%, the highest word recognition rate, when a mixture of 32 Gaus-
sian components is used.
3.4.5 AVSR Performance Evaluation
We evaluated the advantages of sensory features acquired by the DNNs and noise
robustness of the AVSR by conducting an isolated word recognition task. Training
data for the MSHMM are composed of image and sound features generated from 360
training words of six speakers. For sound features, we utilized the neuronal outputs
of the straight-shaped deep denoising autoencoder with five hidden layers (Table 3.1
(d)) when clean MFCCs are provided as inputs. For visual features, we utilized the
output phoneme label sequences generated from 32× 32 pixels mouth area image
inputs by the CNN. Evaluation data for the MSHMM are composed of image and
sound features generated from the 40 test words. Thus, closed-speaker and open-
45
Chapter 3. Audio-Visual Speech Recognition
vocabulary evaluation was conducted. To evaluate the robustness of our proposed
mechanism against the degradation of audio input, partially deteriorated sound data
were artificially generated by superimposing several strengths of Gaussian noises to
original sound signals. In addition to the original clean sound data, we prepared 11
different deteriorated sound data such that the SNR was 30 dB to −20 dB at 5 dB
intervals. In our evaluation experiment, we compared the performance under four
different conditions. The initial two models were the unimodal models that utilize
single-frame MFCCs and the denoised MFCCs acquired by the straight-shaped deep
denoising autoencoder with five hidden layers. These are identical to the models
“Original” and “5 layers” presented in Figure 3.3 and Figure 3.4, respectively. The
third model was the unimodal model that utilized visual features acquired by the
CNN. The fourth model was the multimodal model that binds the acquired audio
and visual features by the MSHMM.
Figures from Figure 3.9 to Figure 3.11 show word recognition rates from the four
different word recognition models under 12 different SNRs for sound inputs. Differ-
ent trajectories represent the results acquired by using dedicated features and mul-
timodal features depending on the number of Gaussian mixture components for the
output probability distribution models of HMM. Graphs on the top show changes
in word recognition rates depending on the types of utilized features and changes
in the SNR of audio inputs. “MFCC,” “DNN_Audio,” “CNN_Visual,” and “Multi-
stream” denote the original MFCCs feature, audio feature extracted by the deep de-
noising autoencoder, visual feature extracted by the CNN, and MSHMM composed of
“DNN_Audio” and “CNN_Visual” features, respectively. Graphs on the bottom show
audio stream weights that yield the best word recognition rates for the MSHMM de-
pending on changes in the audio input’s SNR.
These results demonstrate that when two modalities are combined to represent
the acoustic model, the word recognition rates are improved, particularly for lower
SNRs. At minimum, the same or a better performance was attained compared with
cases when both features are independently utilized. For example, the MSHMM at-
tained an additional 10% word recognition rate gain under 0 dB SNR for the audio
signal input compared with the case when single-stream HMM and denoised MFCCs
are utilized as the recognition mechanism and input features, respectively. Although
46
3.4. Results
clean 30 25 20 15 10 5 0 −5 −10 −15 −20
0
10
20
30
40
50
60
70
80
90
100
SNR [dB]
Wor
d re
cogn
ition
rat
e [%
]
MFCCDNN_AudioCNN_VisualMulti−stream
clean 30 25 20 15 10 5 0 −5 −10 −15 −20
0
0.2
0.4
0.6
0.8
1
SNR [dB]
Aud
io s
trea
m w
eigh
t
Figure 3.9: Word recognition rate evaluation results (8 components)
47
Chapter 3. Audio-Visual Speech Recognition
clean 30 25 20 15 10 5 0 −5 −10 −15 −20
0
10
20
30
40
50
60
70
80
90
100
SNR [dB]
Wor
d re
cogn
ition
rat
e [%
]
MFCCDNN_AudioCNN_VisualMulti−stream
clean 30 25 20 15 10 5 0 −5 −10 −15 −20
0
0.2
0.4
0.6
0.8
1
SNR [dB]
Aud
io s
trea
m w
eigh
t
Figure 3.10: Word recognition rate evaluation results (16 components)
48
3.4. Results
clean 30 25 20 15 10 5 0 −5 −10 −15 −20
0
10
20
30
40
50
60
70
80
90
100
SNR [dB]
Wor
d re
cogn
ition
rat
e [%
]
MFCCDNN_AudioCNN_VisualMulti−stream
clean 30 25 20 15 10 5 0 −5 −10 −15 −20
0
0.2
0.4
0.6
0.8
1
SNR [dB]
Aud
io s
trea
m w
eigh
t
Figure 3.11: Word recognition rate evaluation results (32 components)
49
Chapter 3. Audio-Visual Speech Recognition
there is a slight recognition performance difference depending on the increase in the
number of Gaussian mixture components, the effect is not significant.
3.5 Discussion and Future Work
3.5.1 Current Need for the Speaker Dependent Visual Feature Ex-
traction Model
In our study, we demonstrated an isolated word recognition performance from vi-
sual sequence inputs by the integration of CNN and HMM. We showed that the CNN
works as a phoneme recognition mechanism with mouth region image inputs. How-
ever, our current results are attained by preparing an independent CNN correspond-
ing to each speaker. As generally discussed in previous deep learning studies [50, 56],
the number and variation of training samples are critical for maximizing the gener-
alization ability of a DNN. A DNN (CNN) framework is scalable; however, it requires
a sufficient training dataset to reduce overfitting [50]. Therefore, in future work, we
need to investigate the possibility of realizing a VSR system applicable to multiple
speakers with a single CNN model by training and evaluating our current mechanism
with a more diverse audio-visual speech dataset that has large variations, particularly
for mouth region images.
3.5.2 Positioning of our VSR Results with Regards to State of the Art
in Lip Reading
Most of the current lip reading experiments are still limited to rather simple tasks,
such as isolated or connected random words, isolated or connected digits, isolated
or connected letters. Moreover, a universally acknowledged benchmark has not been
established in studies on lip reading. The major reasons are (1) lip reading studies
involve an immense amount of time and effort for preparing manually or in the best
case semi-automatically processing the data corpus and (2) the data corpora for lip
reading are still very limited due to the overwhelming task of processing and prepar-
ing the data for experiments. Therefore, although some of the reported experimental
results are listed below, it is important to keep in mind that a fair comparison to mul-
50
3.5. Discussion and Future Work
tiple experiments is difficult to provide [17].
For the isolated word recognition task, Nefian et al. [74] report 66.9%, Zhang et
al. [115] report 45.6%, and Kumar et al. [51] report 42.3% recognition rates. These
results attained more than twice the recognition rates compared to our current re-
sults. However, more detailed experimental condition description indicates that the
data corpus used in the experiments by Nefian et al. and Zhang et al. only includes
78 words by ten speakers with ten repetitions. Moreover, nine examples of each word
were used for the training and the remaining example was used for the testing. In
comparison with our experiment, although the experimental condition for the closed
speaker is common, their experiment is not evaluated with unknown words in con-
trast to our open vocabulary setting. Moreover, our preliminary experiment with the
closed speaker setting attained around 63% recognition rate (Figure 3.12), which is
competitive to previous studies. The experiment by Kumar et al. is conducted with
a corpus that includes 150 words by ten speakers. In this case, the closed speaker
setting is the same, but whether an open vocabulary test is conducted is unclear.
In conclusion, while our current isolated word recognition results did not exhibit
cutting-edge performance, we can consider that our results reached a state-of-the-
art level given the following experimental conditions: closed speaker setting with six
speakers and open vocabulary setting including 400 words vocabulary, 360 words for
training, and 40 words for testing.
3.5.3 Adaptive Stream Weight Selection
Our AVSR system utilizing MSHMM achieved satisfactory speech recognition perfor-
mance, despite its quite simple mechanism, especially for audio signal inputs with
lower reliability. The transition of the stream weight in accordance with changes in
the SNR for the audio input (Figure 3.9 to Figure 3.11) clearly demonstrates that the
MSHMM can prevent degradation of recognition precision by shifting the observa-
tion information source from audio input to visual input, even if the quality of the
audio input degrades. However, to apply our AVSR approach to real-world applica-
tions, automatic and adaptive selection of the stream weight in relation to changes
in audio input reliability becomes an important issue to be addressed.
51
Chapter 3. Audio-Visual Speech Recognition
clean 30 25 20 15 10 5 0 −5 −10 −15 −20
0
10
20
30
40
50
60
70
80
90
100
SNR [dB]
Wor
d re
cogn
ition
rat
e [%
]
MFCCDNN_AudioCNN_VisualMulti−stream
clean 30 25 20 15 10 5 0 −5 −10 −15 −20
0
0.2
0.4
0.6
0.8
1
SNR [dB]
Aud
io s
trea
m w
eigh
t
Figure 3.12: Word recognition rate evaluation results (32 components, speaker-closeevaluation)
52
3.6. Summary
3.5.4 Relations of our AVSR Approach with DNN-HMM Models
As an experimental study for an AVSR task, we adopted a rather simple tandem ap-
proach, a connectionist-HMM [39]. Specifically, we applied heterogeneous deep
learning architectures to extract the dedicated sensory features from audio and vi-
sual inputs and combined the results with an MSHMM. We acknowledge that a DNN-
HMM is known to be advantageous for directly estimating the state posterior prob-
abilities of an HMM from raw sensory feature inputs over conventional GMM-HMM
owing to the powerful nonlinear projection capability of DNN models [40]. In the fu-
ture, it might be interesting to formulate an AVSR model based on the integration of
DNN-HMM and MSHMM. This novel approach may succeed because of the recogni-
tion capability of DNNs and the simplicity and explicitness of the proposed decision
fusion approach.
3.6 Summary
In this chapter, we proposed an AVSR system based on deep learning architectures
for audio and visual feature extraction and an MSHMM for multimodal feature inte-
gration and isolated word recognition. The main targets discussed in this chapter are
summarized in Figure 3.13.
Our experimental results demonstrated that, compared with the original MFCCs,
the deep denoising autoencoder can effectively filter out the effect of noise superim-
posed on original clean audio inputs and that acquired denoised audio features at-
tain significant noise robustness in an isolated word recognition task. Furthermore,
our visual feature extraction mechanism based on the CNN effectively predicted the
phoneme label sequence from the mouth area image sequence, and the acquired
visual features attained significant performance improvement in the isolated word
recognition task relative to conventional image-based visual features, such as PCA.
Finally, an MSHMM was utilized for an AVSR task by integrating the acquired audio
and visual features.
The next major target of our work is to examine the possibility of applying our cur-
rent approach to develop practical, real-world applications. Specifically, future work
53
Chapter 3. Audio-Visual Speech Recognition
will include a study to evaluate how the VSR approach utilizing translation, rotation,
or scaling invariant visual features acquired by the CNN contributes to robust speech
recognition performance in a real-world environment, where dynamic changes such
as reverberation, illumination, and facial orientation, occur.
54
3.6. Summary
��������
��� �����������
� �������
��������������
����������������������������������
��� �������������������� ���������������������
��������������� ��������������� ��������
!���������������������������������
"���� �������������������������� �����������
#���������������������������������$%�� ���&'()
"����������������������$%�� ���&'()
%��������������� ���������$%�� ���()
*����������� ��������������������������������������$%�� ���&)
��������������������������� ����������������������������������$%�� ���+)
��� �������������� ���� �����$%�� ���,)
Figure 3.13: The main targets discussed in Chapter 3
55
Chapter 4
Learning Framework for Multimodal
Integration of Robot Behaviors
4.1 Introduction
In Chapter 3, the sensory feature extraction performances of the two representative
DNN mechanisms are evaluated. As a practical evaluation experiment, an AVSR task
is conducted to investigate how noise robust speech recognition becomes possible
by utilizing the sensory features acquired from different DNN frameworks and by
integrating those multimodal features. By applying an MSHMM for the multimodal
integration learning, the temporal sequences extracted from the speech signals are
modeled with a discrete representation of state transition probability. Moreover, the
multimodal integration is attained by an explicit linear mixture of the observation
probability models.
This approach is an intuitive and straightforward way for temporal sequence
recognition tasks like speech recognition, because the main focus of the task is just
to ‘recognize’ by symbolizing raw sensory signals. However, the approach is not suit-
able for sensory-motor coordination tasks such as robot behavior learning because
recognition using an MSHMM is specialized for acquiring symbolic representation
from raw signals, and thus, the reconstruction of raw signals from the acquired sym-
bolic representation is not considered. Therefore, the approach requires designing of
57
Chapter 4. Learning Framework for Multimodal Integration of Robot Behaviors
an external mechanism once generation of action commands corresponding to the
recognized states is considered.
To overcome this issue, we propose a multimodal temporal sequence integration
learning framework utilizing a DNN. In this chapter, we propose the application of
a deep autoencoder not only for its feature extraction by dimensionality compres-
sion but also for its multimodal temporal sequence integration learning. Our main
contribution is to demonstrate that our proposed framework serves as a cross-modal
memory retriever, as well as a temporal sequence predictor utilizing its powerful gen-
eralization capabilities. In the sections that follow, we first illustrate the basic mech-
anism of the autoencoder, then explain how the autoencoder is applied for its multi-
modal temporal sequence learning and further functions.
4.2 Multimodal Temporal Sequence Learning using a
DNN
4.2.1 Sensory Feature Extraction
High-dimensional raw sensory inputs, such as visual images or sound spectrums,
can be converted into low-dimensional feature vectors by multilayer networks with a
small central layer (i.e., a feature-extraction network) [41]. To this end, the networks
are trained with the goal of reconstructing the input data at the output layer with
input-output mappings defined as
ut = f (rt ) (4.1)
r t = f −1(ut ), (4.2)
where rt , ut , and r t are the vectors representing the raw input data, the correspond-
ing feature, and the reconstructed data, respectively. Functions f (.) and f −1(.) repre-
sent the transformation mapping from the input layer to the central hidden layer and
the central hidden layer to the output layer of the network, respectively. An autoen-
coder compresses the dimensionality of inputs by decreasing the number of nodes
58
4.2. Multimodal Temporal Sequence Learning using a DNN
from the input layer to the central hidden layer. Hence, the number of central hidden
layer nodes determines the dimension of the feature vector. In a symmetric fashion,
the original input is reconstructed from the feature vector by eventually increasing
the number of nodes from the central hidden layer to the output layer.
Regarding dimensionality compression mechanisms, a simple and commonly
utilized approach is PCA; however, Hinton et al. demonstrated that the deep autoen-
coder outperformed PCA in image reconstruction and compressed feature acquisi-
tion [41]. In reference to their work, we utilized the deep autoencoder for our di-
mensionality compression framework because we prioritized the precision of cross-
modal memory retrieval and the sparseness of acquired features to ease the behavior
recognition task via a conventional classifier.
4.2.2 Multimodal Integration Learning using Time-delay Networks
A time-delay neural network (TDNN) is a method for utilizing a feed-forward neu-
ral network for multi-dimensional temporal sequence learning [55]. Motivated by
TDNN, we propose a novel computational framework that utilizes a deep autoen-
coder for temporal sequence learning.
An input to the temporal sequence learning network at a single time step is de-
fined by a time segment of the tuple of joint angle vectors, image feature vectors, and
sound feature vectors, formatted as
st = (at,uit ,us
t ) (4.3)
{t|t −T +1 ≤ t ≤ t }, (4.4)
where st , at , uit , and us
t are the vectors representing the input to the network, the joint
angle, the image feature, and the sound feature, at time t , respectively, and T is the
length of the time window. Here, t represents the previous T steps of the temporal
segment from t , and a vector with subscript t indicates a time series of the vector.
The input-output mappings of the temporal sequence learning network are defined
59
Chapter 4. Learning Framework for Multimodal Integration of Robot Behaviors
as
vt = g (st ) (4.5)
st = g−1(vt ), (4.6)
where vt and st = (at, uit , us
t ) are the multimodal feature vector and a segment of the
restored multimodal temporal sequence, respectively. Functions g (.) and g−1(.) rep-
resent the transformation mapping from the input layer to the central hidden layer
and the central hidden layer to the output layer of the network, respectively.
One of the merits of applying neural networks to multimodal temporal sequence
learning is their generalization capability. Because the network can complement de-
ficiencies in the input data, the temporal sequence learning network can be used
in two different ways: (1) to retrieve a temporal sequence from one modal for use
in another (Figure 4.1(a), (b)) and (2) to predict a future sequence from the past se-
quence (Figure 4.1(c)). Thus, the temporal sequence learning network serves as a
cross-modal memory retriever or a temporal sequence predictor by masking the in-
put data from outside the network in either spatial or temporal ways; thus iteratively
feeding back the generated outputs to the inputs as substitutions for the masked in-
puts. The practical implementation of these functions is described in the following
subsections.
4.3 Applications
4.3.1 Cross-modal Memory Retrieval
Cross-modal memory retrieval is realized by self-generating sequences for a modal-
ity inside the network by providing corresponding sequences for the other modalities
from outside the network. For the retrieved modality, a recurrent loop from the out-
put nodes to the input nodes is prepared. Hence, in the case of generating an image
sequence from motion and sound sequences, input to the network is defined as
st = (at, uit ,us
t ). (4.7)
60
4.3. Applications
������
�������
������������ ��������
���������
���������������������������
������
�������
������������ ��������
���������
���������������������������
������
�������
���������
���������������������
...� ...�
...� ...�
...� ...�
� � ������ � � ������������������ ��������
� � ������
� ��������
� � �����������������
� ��������
� � ����������������� �����������
� ��������
� � ������
Figure 4.1: Examples of cross-modal memory retrieval and sequence prediction
61
Chapter 4. Learning Framework for Multimodal Integration of Robot Behaviors
As shown in Figure 4.2, the time segment of the recurrent input is generated by shift-
ing the previous output of the network to the direction for one step by (1) discarding
the oldest time step output and (2) filling the latest time step with the value of the
newest time step acquired from the output.
t −T +1 t −T + 2 tt −1t − 2
t −T +1 t −T + 2 tt −1t − 2��������������
��������������
�������
����������
...�
...�Figure 4.2: Buffer shift of the recurrent input
4.3.2 Temporal Sequence Prediction
Similarly, the temporal sequence prediction is realized by constructing a recurrent
loop from the output layer to the input layer. The difference is that among all the T
steps of the time window, only the first Ti n steps (i.e., the past Ti n shifts to the present
time step t ) of both modalities are filled with the input data; the rest (i.e., the future
T −Ti n shifts to the predicted time step) are filled with the outputs from the previous
time step. Hence, input to the network is defined as
s(t ) = (at1 , at2 ,uit1
, uit2
,ust1
, ust2
), (4.8)
{t1|t −Ti n +1 ≤ t1 ≤ t }, (4.9)
{t2|t +1 ≤ t2 ≤ t + (T −Ti n)}. (4.10)
As shown in Figure 4.3, the prediction segment of the recurrent input is generated by
shifting the corresponding previous outputs of the network to the time direction for
one step.
62
4.4. Summary
t −Tin +1 t −1 t +T −Tint +T −Tin −1
��������������
��������������
����������
t t +1
t −Tin +1 t −1 t +T −Tint +T −Tin −1t t +1
�������
t + 2
t + 2
Tin �����������
...�
...�
...�
...�
Figure 4.3: Buffer shift of the recurrent input for temporal sequence prediction
4.4 Summary
In this chapter, we proposed a feature extraction framework using a DNN that en-
ables not only to extract compressed features from raw sensory inputs by reducing
the dimensionality but also to reconstruct the original information from the acquired
features. Moreover, theoretical applications of the proposed framework for the mul-
timodal integration learning of temporal sequences, including visual, auditory, and
motion, are presented. The main targets discussed in this chapter are summarized
in Figure 4.4.
63
Chapter 4. Learning Framework for Multimodal Integration of Robot Behaviors
��������
��� �����������
� �������
��������������
����������������������������������
��� �������������������� ���������������������
��������������� ��������������� ��������
!���������������������������������
"���� �������������������������� �����������
#���������������������������������$%�� ���&'()
"����������������������$%�� ���&'()
%��������������� ���������$%�� ���()
*����������� ��������������������������������������$%�� ���&)
��������������������������� ����������������������������������$%�� ���+)
��� �������������� ���� �����$%�� ���,)
Figure 4.4: The main targets discussed in Chapter 4
64
Chapter 5
Applications for Recognition and
Generation of Robot Behaviors
5.1 Introduction
In Chapter 4, we proposed a theoretical framework for multimodal integration learn-
ing and cross-modal memory retrieval using a DNN. In this chapter, our proposed
model is evaluated by conducting experiments using a humanoid robot in the real-
world environment. In practice, cross-modal memory retrieval, temporal sequence
prediction, and noise-robust behavior recognition functions are evaluated by train-
ing the proposed model with the sensory-motor information acquired by directly
teaching a humanoid robot with multiple object manipulation behaviors. Through
the experiments, we investigate the possibility of applying a deep learning frame-
work to the sensory-motor coordination problem on robotic applications, especially
with high-dimensional and large-scale raw sensory temporal sequences.
5.2 Construction of the Proposed Framework
Figure 5.1 depicts a schematic diagram of our proposed framework. Two indepen-
dent deep neural networks are utilized for image compression and temporal se-
quence learning. The image compression network, shown in Figure 5.1(a), inputs
65
Chapter 5. Applications for Recognition and Generation of Robot Behaviors
��� ���������� �!�"#��� �������
������ �����������
���� ����� "����������
���� �� � �! ����������
$$$�
���� ����
$$$�
at−T+1ut−T+1i
at−1ut−1i
atuti
at−T+1ut−T+1i
at−1ut−1i
atuti
...�...�
...�...�
� ���" !�� ������ �
� ���" !�� ��" %� �� �
� � !������� ����& !�
���������
��� ���������#� �" %� �� �����" #"��
���� ����� "�
���� �� � �! �
Figure 5.1: Multimodal behavior learning and retrieving mechanism
raw RGB color images acquired from a camera mounted on the head of the robot
and outputs the corresponding feature vectors from the central hidden layer. The
image features are synchronized with the joint angle vectors acquired from both arm
joints, and multimodal temporal segments are generated. The multimodal tempo-
ral segments are then fed into the temporal sequence learning network (i.e., Figure
5.1(b)). Accordingly, multimodal features are acquired from the central hidden layer,
while reconstructed multimodal temporal segments are obtained from the output
layer.
The outputs from the temporal sequence learning network are used for both
robot motion generation and image retrieval. The joint angle outputs from the net-
work are rescaled and resent to the robot as joint angle commands for generating
motion. The network can also reconstruct the retrieved images in the original form
by decompressing the image feature outputs, because the image compression net-
66
5.3. Experimental Setup
work models the identity map from the inputs to the outputs via feature vectors in
the central hidden layer.
5.3 Experimental Setup
Our proposed mechanisms are evaluated by conducting object manipulation exper-
iments with the small humanoid robot NAO, developed by Aldebaran Robotics [87].
The multimodal data, including image frames and joint angles, are recorded syn-
chronously at approximately 10 fps. For the image data input, the original 320 × 240
image is resized to a 20 × 15 matrix of pixels in order to meet the memory resource
availability limitation of our computational environment1. For joint angle data in-
put, 10 degrees of freedom of the arms (from the shoulders to the wrists) are used.
Six different object manipulation behaviors identified by different colorful toys
are prepared for training (Figure 5.2). The details of the object manipulation behav-
iors are as follows:
• (a) Ball lift: holding a yellow ball on the table with both hands, then raising the
ball to shoulder height three times with up-and-down movements
• (b) Ball roll: iteratively rolling a blue ball on top of the table to the right and left
by using alternating arm movements
• (c) and (d) Bell ring L/R: ringing a green bell placed on either the right or left side
of the table by the corresponding arm motion
• (e) Ball roll on a plate: rolling an orange ball placed in a deeply edged plate at-
tached to both hands, and alternately swinging both arms up and down
• (f) Ropeway: swinging a red ball hanging from a string attached to both hands
by alternately moving both arms up and down
We record the multimodal temporal sequence data by generating different arm
1We utilized a personal computer with an Intel Core i7-3930K processor (3.2 GHz, 6 cores), 32 GBmain memory, and a single NVidia GeForce GTX 680 graphic processing unit with 4 GB on-boardgraphics memory. Because the size of weight matrices of a multi-layered neural network exponentiallyincreases as the input dimension increases, we felt it sensible to keep the number of input dimensionsas small as possible, as long as the dimensionality reduction did not critically degrade the quality ofour experiments. As a result of preliminary experimentation, we found all of our memory retrievalexperiments are feasible even with this reduced image resolution.
67
Chapter 5. Applications for Recognition and Generation of Robot Behaviors
������������� ��������� ���� ���������������
��������������� ��������� ���� ��������� ���� ������
Figure 5.2: Object manipulation behaviors
motions corresponding to each object manipulation by direct teaching. The result-
ing lengths of the motion sequences are between 100 and 200 steps, which is equiv-
alent to between 10 and 20 s. To balance the total motion sequence lengths between
different behaviors, direct teaching is repeated six to 10 times for each behavior,
such that the number of repetitions becomes inversely proportional to the motion
sequence length. Among all of the repetitions, one result is used as test data and the
others are used as training data. For multimodal temporal sequence learning, we use
a contiguous segment of 30 steps from the original time series as a single input. By
sliding the time window by one step, consecutive data segments are generated.
Table 5.1 summarizes the datasets and associated experimental parameters. For
both the image feature and temporal sequence learning, the same 12-layer deep neu-
ral network is used. In each case, the decoder architecture is a mirror-image of the
encoder, yielding a symmetric autoencoder. The parameter settings of the network
structures are empirically determined with reference to such previous studies as [41]
and [49]. The input and output dimensions of the two networks are defined as fol-
lows: 900 for image feature learning, which is defined by 20× 15 matrices of pixels
for the RGB colors; and 1200 for temporal sequence learning, which is defined by the
68
5.3. Experimental Setup
Table 5.1: Experimental parameters
TRAIN* TEST* I/O* ENCODER DIMS*
IFEAT** 8444 948 900 1000-500-250-150-80-30
TSEQ** 20548 776 1200 1000-500-250-150-80-30* TRAIN, TEST, I/O, and ENCODER DIMS indicate the size
of the training data, the test data, the input and output di-mensions, and the encoder network architecture, respec-tively.
** IFEAT and TSEQ stand for image feature and temporal se-quence, respectively.
30-step segment of the 40-dimension multimodal vector composed of 10 joint an-
gles and the 30-dimension image feature vector. For the activation functions, linear
functions are used for both the central hidden layers and logistic functions are used
for the rest of the layers in reference to [41].
The length of the time window is determined by considering the following two
constraints. First, if the length of the time window increases, the network may con-
sider longer contextual information. Second, if the length of the time window be-
comes too long, the dimension of the multimodal temporal vector becomes too big
to be processed in an acceptable amount of time. The implicit policy is to keep the
input dimensions below 3000 because of our computational limitation. As the mul-
timodal vector dimension is 40, the temporal sequence length should be below 75.
Considering the cyclic frequencies of the joint angle trajectories acquired from the
six object manipulation behaviors, we determine that 30 steps are enough to charac-
terize a phase of the behaviors.
For multimodal integration learning, we trained the temporal sequence learn-
ing network using additional examples that have only a single modality to explicitly
model the correlations across the modalities [75]. In practice, we added examples
that have noisy values for one of the input modalities (e.g., the image feature) and
original values for the other input modality (e.g., the joint angles) but still require the
network to reconstruct both modalities. Thus, one-third of the training data has only
image features for input, while another one-third of the data has only joint angles
and the last one-third has both image features and joint angles. For the noisy values,
we superimpose Gaussian noise with a standard deviation of 0.1 on the original data.
69
Chapter 5. Applications for Recognition and Generation of Robot Behaviors
5.4 Results
5.4.1 Cross-modal Memory Retrieval and Temporal Sequence Pre-
diction of Object Manipulation Behaviors
We conducted two experiments to evaluate cross-modal memory retrieval perfor-
mance. One experiment generates the joint angle sequence (motion) by providing
image sequences, whereas the other generates an image sequence by providing the
joint angle sequence. For these experiments, inputs to either modality of the full
30 steps are provided, and the sequence for the other modality is internally gener-
ated in a closed-loop manner (see 4.3.1). In the experiment to evaluate temporal
sequence prediction, the input window length is defined as Ti n = 25, and the cor-
responding future five steps are internally generated as predictions (see 4.3.2). For
all of the experimental settings above, although the initial values for the recurrent
inputs are randomly generated, the internal values eventually converge to the cor-
responding states in association with the input values of the other modalities by the
generalization capability of the network.
Figure 5.3 shows the example results of joint angle sequence generation from the
image sequence input and temporal sequence prediction. We generated full length
trajectories of the object manipulation behavior by accumulating the iteratively re-
trieved joint angle vectors acquired from the 30th (final) step of the temporal window.
In the figure, graphs on the top row (Figure 5.3(a)) are the original motion trajectories
in the test data. Graphs on the second row (Figure 5.3(b)), i.e., the reconstructed tra-
jectories acquired by cross-modal memory retrieval from the image sequence, show
that the appropriate trajectories are generated and the configurations of the trajecto-
ries are clearly differentiated according to the provided image sequences. Graphs on
the bottom row (Figure 5.3(c)), i.e., the reconstructed trajectories acquired by tem-
poral sequence prediction, show that our proposed mechanism correctly predicted
future joint angles five steps ahead of the 25 steps of the multimodal temporal se-
quence. The reconstructed trajectories correspond to the same behaviors shown for
the top row. The low reconstruction qualities of the first 30 steps are attributed to the
random values supplied for the recurrent inputs at the initial iteration of the genera-
tion process.
70
5.4. Results
50 100 1500
0.5
1
Sca
led
angl
es
Ball lift
50 100 1500
0.5
1
Sca
led
angl
es
50 100 1500
0.5
1
Steps
Sca
led
angl
es
50 100 1500
0.5
1
Ball roll
50 100 1500
0.5
1
50 100 1500
0.5
1
Steps
50 1000
0.5
1
Bell ring L
50 1000
0.5
1
50 1000
0.5
1
Steps
����
����
����
Sca
led
angl
esS
cale
d an
gles
Sca
led
angl
es
����
����
����
50 1000
0.5
1
Bell ring R
50 1000
0.5
1
50 1000
0.5
1
Steps
50 1000
0.5
1
Ball roll on a plate
50 1000
0.5
1
50 1000
0.5
1
Steps
50 100 1500
0.5
1
Ropeway
50 100 1500
0.5
1
50 100 1500
0.5
1
Steps
Figure 5.3: Example of motion reconstructions by our proposed model
71
Chapter 5. Applications for Recognition and Generation of Robot Behaviors
Ball lift Ball roll Bell ring L
����
����
Bell ring R Ball roll on a plate Ropeway
����
����
Figure 5.4: Example of image reconstructions by our proposed model
Figure 5.4 shows example results of image sequence generation from the joint
angle sequence input. The images shown in the figure are single frames drawn from
the series of images for each behavior. In the figure, images on the top row (Fig-
ure 5.4(a)) show the original images decompressed from the image feature vector
in the test data. Images on the bottom row (Figure 5.4(b)) show the correspond-
ing reconstructed images decompressed from the feature vectors acquired by cross-
modal memory retrieval from the joint angle sequence. Although the details of the
images are slightly different, the objects showing up in the images are correctly re-
constructed, and the locations of the color blobs are properly synchronized with the
phases of the motion.
72
5.4. Results
We conducted a quantitative evaluation of cross-modal memory retrieval by
preparing 10 different initial model parameter settings for the networks and repli-
cating the experiment of learning the same dataset composed of the six object ma-
nipulation behaviors. Table 5.2 summarizes these results. In the table, IMG → MTN
indicates image to motion, whereas MTN → IMG indicates motion to image; fur-
ther, the temporal sequence prediction (PRED) performances for the six behavior
patterns are also shown. The numbers given in each entry of the table represent the
root mean square (RMS) errors of the reconstructed trajectories (normalized by scal-
ing between 0 and 1) on the test data. The RMS errors in Table 5.2 demonstrate that
the reconstruction errors are below 10 percent for all of the evaluation conditions.
In detail, each of the RMS errors are calculated as
EI MG→MT N =
√√√√ 1
Tseq
Tseq∑t=1
|at − at |2, (5.1)
EMT N→I MG =
√√√√ 1
Tseq
Tseq∑t=1
|r it − r i
t |2, (5.2)
EPRED =
√√√√ 1
Tseq
Tseq∑t=1
|st − st |2, (5.3)
where EI MG→MT N , EMT N→I MG , and EPRED are RMS errors corresponding to the re-
construction modes identified by subscripts; at , at , r it , r i
t , st , and st are the truth and
reconstructed vectors representing the raw image data, the joint angle, and the mul-
timodal feature at time t , respectively; and Tseq is the length of the test sequence for
each of the behaviors.
Finally, to analyze the temporal sequence prediction performance in more detail,
we evaluated the prediction errors at the last (30th) step of the time window, depend-
ing on the prediction length, by varying the input window length Ti n from 25 to five
in decreasing steps of five. Figure 5.5 shows the temporal sequence prediction errors
of six object manipulation behaviors depending on the prediction length. The mean
and standard deviation are calculated from 10 replicated learning experiments. As
expected, the RMS errors demonstrate that the prediction error increases as predic-
73
Chapter 5. Applications for Recognition and Generation of Robot Behaviors
Table 5.2: Reconstruction errors
IMG → MTN MTN → IMG PRED
LIFT* 7.11×10−2 (1.44×10−3) 1.76×10−2 (8.99×10−4) 3.91×10−2 (6.47×10−4)
ROLL* 7.05×10−2 (1.55×10−3) 4.45×10−2 (1.20×10−3) 4.41×10−2 (7.33×10−4)
RING-L* 4.95×10−2 (2.64×10−3) 1.83×10−2 (4.72×10−4) 2.21×10−2 (8.19×10−4)
RING-R* 3.64×10−2 (2.61×10−3) 1.79×10−2 (3.64×10−3) 1.98×10−2 (4.90×10−4)
PLT* 8.98×10−2 (1.35×10−3) 1.49×10−2 (2.96×10−3) 3.94×10−2 (4.34×10−4)
RWY* 5.63×10−2 (9.50×10−4) 1.89×10−2 (5.32×10−3) 2.75×10−2 (4.32×10−4)* LIFT, ROLL, RING-L, RING-R, PLT, and RWY stand for ball lift, ball roll, bell ring L, bell ring R, ball roll on
a plate, and ropeway, respectively.** Standard deviations in parentheses.
tion length increases. Nevertheless, the reconstruction errors are below 10 percent
in all of the evaluation conditions.
5.4.2 Real-time Adaptive Behavior Selection According to Environ-
mental Changes
As a further experiment, we switched the robot’s behavior according to changes in
the objects displayed to the robot. The approach is a combination of cross-modal
memory retrieval and temporal sequence prediction in the sense that the joint an-
gles five steps ahead, considering control delay, are predicted from the previous 25
steps of the image input sequence. By iteratively sending the predicted joint angles
as the target commands for each joint angle of the robot, the robot generates mo-
tion in accordance with environmental changes. For the initial trial, we tested the
raw image input and confirmed that the robot can properly select behaviors accord-
ing to changes in the displayed object. However, we found that the reliability of our
current image feature vector is easily affected by the environmental lighting condi-
tions2. Therefore, we adopted color region segmentation and used the coordinates of
the center of gravity of the color blobs as a substitute to the image feature vector for
the perception stability under various lighting conditions. As a result, we succeeded
in switching multiple behaviors based on the displayed objects. Figure 5.6 shows
2We recognize that the instability of the image feature vector under real environment is due tothe limitation on the variation in our image dataset utilized for training the image feature-extractionnetwork.
74
5.4. Results
5 10 15 20 250.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
Prediction length
RM
S e
rror
Ball liftBall rollBell ring LBell ring RBall roll on a plateRopeway
Figure 5.5: Temporal sequence prediction errors of six object manipulation behav-iors; plots are horizontally displaced from the original positions to avoid overlap ofthe error bars
75
Chapter 5. Applications for Recognition and Generation of Robot Behaviors
Figure 5.6: Real-time transition of object manipulation behaviors
photos of the transition from one behavior to the next in the order of Ropeway, Bell
ring R, and Bell ring L.
5.4.3 Multimodal Feature Space Visualization
Figure 5.7 presents the scatter plot of the three-dimensional principal components
of the acquired multimodal features. PC1, PC2, and PC3 axes correspond to princi-
pal components 1, 2, and 3, respectively. The multimodal feature vectors are gener-
ated by recognizing the training data from the temporal sequence learning network
and recording the activations of the central hidden layer. This figure demonstrates
that the feature space is segmented according to different object manipulation be-
haviors and the feature vectors are self-organizing multiple clusters. The structure
of the multimodal feature space suggests that a supervised discrimination learning
of multiple behaviors might be possible by modeling correspondences between the
acquired multimodal features and the behavior categories.
76
5.4. Results
Ball liftBall rollBell ring LBell ring RBall roll on a plateRopeway
Figure 5.7: Acquired multimodal feature space
5.4.4 Behavior Recognition using Multimodal Features
In this section, we examine how the acquired multimodal feature expression con-
tributes to the robustness of a behavior recognition task. In our learning framework,
raw sensory inputs are converted into sensory features, and the multiple sources of
sensory features are integrated together to generate multimodal features utilizing the
dimensionality compression function of an autoencoder. Making efficient use of the
higher-level features, we can expect the following two effects in the behavior recog-
nition task: (1) a discrimination model can improve its categorization performance
against noisy sensory inputs by exploiting the higher generalization capabilities of
the compressed representations; and (2) the integrated representation of multimodal
inputs helps to inhibit the degradation of categorization performance by comple-
menting a decrease in reliability of sensory input with information from the other
modalities.
To verify our hypotheses, we evaluated the noise robustness of a behavior dis-
crimination mechanism under different training conditions using the joint angle test
77
Chapter 5. Applications for Recognition and Generation of Robot Behaviors
sequences corresponding to the six object manipulation behaviors. More specifi-
cally, we compare the variation in behavior recognition rates depending on the dif-
ferences of the standard deviation of Gaussian noise superimposed on the joint an-
gle sequences. To investigate the effects of the higher-level features acquired from
dimensionality compression and multimodal integration, we compare the perfor-
mance of the classifier under the following four different training conditions:
• (1a) MTN (raw): Raw joint angles are used as inputs.
• (1b) MTN (compressed): Joint angle feature vectors are used as inputs. Feature
vectors are generated by compressing the joint angle sequences utilizing an au-
toencoder3.
• (2a) MTN+IMG: Multimodal feature vectors are used as inputs. Feature vectors
are generated by compressing the joint angle sequences and the corresponding
image feature sequences utilizing the temporal sequence learning network. Im-
age feature sequences are generated by compressing the clean image sequences
acquired from the test data.
• (2b) MTN+IMG (imaginary): Multimodal feature vectors are used as inputs. In
this case, the image feature sequences are self-generated inside the network in-
stead of externally generated from the test data.
All of the training conditions, except for case (1a), are statistically evaluated on the
10 replicated learning results (see 5.4.1).
The compressed feature vector sequences are acquired by recording the activa-
tion patterns of the central middle layer of the temporal sequence network. As one
of the most popular classification algorithms with an excellent generalization ca-
pability, a support vector machine (SVM)—namely, the multi-class SVM using one-
against-all decomposition in the Statistical Pattern Recognition Toolbox for MATLAB
[28]—is used as a classifier. An RBF kernel with default parameters (provided by the
toolbox) is used to address the one-against-all multiclass non-linear separation of
the acquired multimodal features; further, the Sequential Minimal Optimizer (SMO)
is used as the solver for the computational efficiency.
3The structure of the autoencoder used in (1b) is the same as that of the temporal sequence learn-ing network used for the multimodal integration learning in (2a), except that the image feature inputsare excluded.
78
5.4. Results
0 0.5 1 1.5
20
30
40
50
60
70
80
90
100
σ
Rec
ogni
tion
rate
[%]
(1a) MTN (raw)(1b) MTN (compressed)(2a) MTN+IMG(2b) MTN+IMG (imaginary)
Figure 5.8: Behavior recognition rates depending on the changes in standard devia-tion σ of the Gaussian noise superimposed on the joint angle sequences
Figure 5.8 shows the variations of the behavior recognition rates depending on
the changes in standard deviation of the Gaussian noise superimposed on the joint
angle sequences. The amplitudes of the joint angles are normalized to the range 0
to 1. Mean and standard deviation are calculated from 10 replicated learning experi-
ments. The results demonstrate three remarkable advantages of utilizing higher-level
features for the behavior recognition task. First, comparing results of (1b) with (1a)
shows the superior performance of compressed joint angle features over raw joint
angles with regard to behavior recognition robustness. Second, comparing results of
(2a) with (1b) shows that the multimodal features manifest higher noise robustness
over single modal features by suppressing the negative effects caused by the degra-
dation of the reliability of joint angles; this is achieved by making effective use of the
complementary information from the image features. Third, comparing results of
(2b) with (1b) demonstrates that even when the joint angle modality is provided as
the sole input, the self-generated sequences for the image features still help to pre-
vent degradation in behavior recognition performance. From these results, we con-
firmed our hypotheses that the use of higher-level features acquired by compressing
79
Chapter 5. Applications for Recognition and Generation of Robot Behaviors
raw sensory inputs and integrating multimodal sequences contribute to noise resis-
tance of behavior recognition tasks.
5.5 Discussion
5.5.1 How Generalization Capability of Deep Neural Networks Con-
tributes for Robot Behavior Learning
In this study, we demonstrated the significant scalability of the deep learning al-
gorithm applied to the time-delayed autoencoder on sensory-motor coordination
learning problems. We presented experimental results on cross-modal memory re-
trieval and the subsequent adaptive behavior generation of a humanoid robot in the
real environment. For example, in the image sequence retrieval experiment of the
object manipulation behaviors learning task in 5.4.1, 900 dimensions of the image
feature vector sequence were recalled from only the 300 dimensions of joint angle se-
quence inputs. This result shows that three times as much information was recalled
by the generalization capability of the autoencoder.
This powerful information complementation capability is one of the advantages
of our proposed time-delay autoencoder. By utilizing the generalization capability
of the preceding half layers of the autoencoder, higher-level features that represent
specific object manipulation behavior can be generated even from partial modal in-
puts. Further, as the autoencoder can reconstruct the original inputs from the feature
vector, the predicted outputs can be recursively fed back to the input nodes, and the
inputs can be used as a substitution for any lacking modality information. This re-
cursive information loop in our proposed framework enabled a high level of stability
in cross-modal memory retrieval performance.
The number of layers and the number of nodes are important factors to explain
the memory capacity and generalization capability of a deep neural network; how-
ever, in general, a clear explanation has not been made for the correlation between
the network structure and its learning capability. Thus, the design principle on the
structure of neural networks has little theoretical foundation at the moment. This
might be an important research topic for future consideration.
80
5.5. Discussion
5.5.2 Three Factors that Contribute to Robustness in Behavior
Recognition Task
Our experimental results regarding behavior recognition evaluation demonstrated
that the compressed temporal features enable robust recognition performance. By
comparing the recognition rates from the different evaluation conditions, we have
shown that the following three factors contribute to noise robustness in behavior
recognition tasks: (1) utilization of higher-level features; (2) utilization of multimodal
information; and (3) utilization of self-generated sequences in multimodal behavior
recognition. Below, we present our views regarding the functions of the three factors
in relation to the internal mechanisms of our proposed framework.
Utilization of Higher-level Features
In previous work, Le et al. showed that it is possible to train neurons to be selec-
tive for high-level concepts using entirely unlabeled data [56]. As a practical result,
they succeeded in acquiring class-specific neurons such as cat and human body de-
tector neurons by training deep neural networks with unlabeled YouTube datasets.
This result—i.e., that meaningful features can be self-organized even with unla-
beled data—demonstrates the advantage of utilizing an autoencoder as a feature-
extraction mechanism. Comparable results are presented in related works involving
image classification tasks [50] and speech recognition tasks [40]. Considering all of
these previous studies, our behavior recognition results seem to coincide with the
view that deep neural networks produce higher-level features that have a prominent
generalization capability by accumulating many layers of nonlinear feature detectors
to progressively represent a complex statistical structure in the data [58].
Utilization of Multimodal Information
From the viewpoint of the amount of information acquired from the multimodal se-
quence, the multimodal temporal sequence learning network has a clear advantage
in generating a more accurate internal representation than a unimodal temporal se-
quence learning network. This fact is presented in the behavior recognition results
with noisy joint angle inputs and clear image inputs from the training dataset (i.e.,
81
Chapter 5. Applications for Recognition and Generation of Robot Behaviors
(2a) MTN+IMG) in 5.4.4. These results demonstrate that even after the joint angle
information becomes uninformative, the degradation of the recognition rate con-
verges to a level that surpasses other results. In this case, the clear image feature
inputs served as a source of information for the higher-level features to correctly
represent the behavior category against the uninformative joint angle inputs. Cur-
rent results on the effects of multimodal learning toward robustness in recognition
tasks can be regarded in the same light as, for example, improvements in multimodal
speech recognition tasks utilizing a combination of sound and image inputs [75].
Utilization of Self-generated Sequences in Multimodal Behavior Recognition
Among our behavior recognition evaluation results presented in 5.4.4, the most no-
table outcome is that a higher recognition performance is realized by the multimodal
memory even with the single modal input for the joint angles (i.e., (2b) MTN+IMG
(imaginary)). This result could be explained as follows. Utilizing a multimodal mem-
ory, a multimodal internal representation is generated even from noisy joint angle
inputs, and successively accompanying image features are retrieved from the output
nodes. As the image feature vector is recalled from the internal representation, the
information becomes even more independent of the disturbance superimposed over
the joint angle observations. By feeding back the retrieved image feature to the input
nodes, this procedure leads to clarifying the internal representation that is equivalent
to the multimodal recognition process by explicitly providing the image feature se-
quence to the network in parallel with the noisy joint angle sequence. In recent neu-
ropsychological studies, the positive effects of self-referential strategies in improv-
ing memory in memory-impaired populations have been reported [34, 33]. In future
work, it would be interesting to further investigate how our current self-generating
imaginary sequence mechanism corresponds to such psychological phenomena in
the human cognitive process.
82
5.5. Discussion
5.5.3 Difference between our Proposed Time-delay Autoencoder
and the Original Time-delay Neural Network
The temporal sequence learning mechanism proposed in our work inputs a fixed
length of time series acquired by cropping a segment of a temporal sequence within
a time window. This approach inherits the idea from the original work of time-delay
neural networks by Lang et al. [55]. The difference here is that the vectors identical
to the inputs define the target outputs of our proposed model, whereas the symbol
labels define the outputs of the original model. Consequently, one of the charac-
teristics of our proposed model is that the compressed representation of temporal
sequences is self-organized by the autoencoder, and the network can self-generate
temporal sequences by recursively feeding back outputs to input nodes. The advan-
tages of the internal sequence generation were shown by the adaptive behavior se-
lection capability utilizing cross-modal memory retrieval and the robust behavior
recognition capability with unreliable joint angle observations.
5.5.4 Characteristics of the Internal Representation of the Tempo-
ral Sequence Learning Network
The temporal sequence learning network involves virtually modeling the dynam-
ics of long temporal sequences by accumulatively memorizing multiple phase-wise
temporal segments. Thus, a feature vector generated from a one-shot input repre-
sents a temporal phase of a sequence. This phenomenon can be confirmed from
plots of the feature vectors of the bell-ringing task by observing where they formed
closed loop shapes in Figure 5.7. The same phenomenon can be confirmed from the
second task in that the reciprocal transition of the feature vector plots on the two
distinct lines corresponds to each of the right and left arm motion patterns shown in
Figure 6.6.
83
Chapter 5. Applications for Recognition and Generation of Robot Behaviors
5.5.5 Length of Contextual Information that a Time-delay Autoen-
coder Handles
The length of the input temporal segment defines the length of the contextual in-
formation handled by the temporal sequence learning network. Hence, in principle,
context information longer than the temporal segment is not considered. In compar-
ison with the other temporal sequence learning mechanisms, such as recurrent neu-
ral networks [66], this is a fundamental difference. Our proposed framework worked
successfully in our experiments despite this limitation of contextual representation
because the execution of robot behaviors in our task settings did not require com-
prehending long contextual situations. For example, for the object manipulation and
bell-ringing behaviors, most of the contextual information is embedded in the envi-
ronment (e.g., the robot’s arm posture, position of the balls, etc.). Thus, an internal
neuronal representation of the context was not required for executing the tasks.
5.5.6 Scalability of our Proposed Multimodal Integration Learning
Mechanism
One of the targets of the current study was to achieve “large-scale learning” of ob-
ject manipulation behaviors by a humanoid robot. We can view the issue of large-
scale learning from the following three perspectives: (1) variations in the behavior
patterns, (2) input and output dimensionality, and (3) number of the training data
samples.
From the first perspective, the behavior variations prepared for training our pro-
posed mechanism is not large enough. If the target is to just memorize the motion
trajectory and replay such memorized patterns, there might be a more efficient way
such as creating a motion pattern database using exact joint angle representations.
However, in the current study, we do not value the number and precision of the re-
trieved motion patterns but emphasize the ability to self-organize the synchrony of
sensory-motor relationships. To model the mutual relationship among concurrent
multi-dimensional temporal sequences, it is important to utilize machine learning
mechanisms that can handle distributed representations such as neural networks.
84
5.6. Summary
As far as the synchrony modeling utilizing neural network is concerned or in terms
of the variation of dynamics handled with a single neural network model, we con-
sider that current achievements have risen to a new level.
We think that the same goes for the second and the third perspectives as well.
With regard to robot behavior learning using a neural network model, conventional
approaches could handle only dozens of the input and output dimensionality and
hundreds of training samples. In contrast, we have achieved more than ten times
the scalability of the previous studies. For example, the direct memorization and
retrieval of raw image sequences corresponding to multiple motion patterns have
never been achieved with a neural network model.
5.6 Summary
In this chapter, our proposed multimodal integration learning framework is evalu-
ated by modeling multiple behavior patterns represented by multi-dimensional vi-
suomotor temporal sequences. The main targets discussed in this chapter are sum-
marized in Figure 5.9.
We presented two applications of the acquired sensory-motor integration model.
First, cross-modal memory retrieval was realized. Utilizing the generalization ca-
pability of the deep autoencoder, our proposed framework succeeded in retrieving
temporal sequences bidirectionally between image and motion modals. Second, ro-
bust behavior recognition was realized by utilizing the acquired multimodal features
as inputs to supervised behavior classification learning.
Through the evaluation experiment, a time-delay deep neural network is applied
for modeling multiple behavior patterns represented by multi-dimensional visuo-
motor temporal sequences. By using the efficient training performance of Hessian-
free optimization, the proposed mechanism successfully models six different object
manipulation behaviors in a single network. The generalization capability of the
learning mechanism enables the acquired model to perform the functions of cross-
modal memory retrieval and temporal sequence prediction. The experimental re-
sults show that the motion patterns for object manipulation behaviors are success-
85
Chapter 5. Applications for Recognition and Generation of Robot Behaviors
fully generated from the corresponding image sequence, and vice versa. Moreover,
the temporal sequence prediction enables the robot to interactively switch multiple
behaviors in accordance with changes in the displayed objects. The analysis of the
self-organized feature space revealed that the multimodal features can be utilized as
abstracted information for recognizing robot behaviors.
Results from the real-time transition of object manipulation behaviors in a real-
world environment also revealed that our current approach for utilizing raw image
data is still not stable enough for handling drastic changes in lighting conditions. Fu-
ture work includes improving the robustness of the image recognition capabilities
by drawing out the potential of the generalization capabilities of deep networks via
the introduction of convolution networks trained with more diverse datasets. An-
other important challenge is dynamically combining multiple sensory modalities by
taking into account the relative reliability of different sensory sources. If reliability-
dependent integration is attained in our framework, higher-level features might be
acquired by intentionally suppressing the effects degraded modalities have on the
internal representation; this might result in more robust behavior recognition per-
formance.
86
5.6. Summary
��������
��� �����������
� �������
��������������
����������������������������������
��� �������������������� ���������������������
��������������� ��������������� ��������
!���������������������������������
"���� �������������������������� �����������
#���������������������������������$%�� ���&'()
"����������������������$%�� ���&'()
%��������������� ���������$%�� ���()
*����������� ��������������������������������������$%�� ���&)
��������������������������� ����������������������������������$%�� ���+)
��� �������������� ���� �����$%�� ���,)
Figure 5.9: The main targets discussed in Chapter 5
87
Chapter 6
Analysis on Intersensory Synchrony
Model
6.1 Introduction
In Chapter 5, we demonstrated that our proposed framework succeeds in cross-
modal memory retrieval and stable behavior recognition utilizing the self-organized
multimodal fused representations. In this chapter, we conduct further analysis on
how our proposed framework extracts the intersensory synchrony from the sensory-
motor experience in the environment and predicts the sensory outcomes utilizing
the acquired synchrony model. To analyze the acquired synchrony model at a more
general level, we extend the experimental setting by incorporating sound signals as
another input modality. As a practical experiment, we prepared a bell-ringing task
using a humanoid robot. Through the experiment, we conduct a quantitative evalua-
tion to demonstrate that our proposed framework can model synchronicity between
the color, pitch, and position of the bell and the corresponding bell-ringing motion.
6.2 Construction of the Proposed Framework
Figure 6.1 shows a schematic diagram of our proposed framework. Three indepen-
dent deep neural networks (i.e., autoencoders) are utilized for sound compression,
89
Chapter 6. Analysis on Intersensory Synchrony Model
...� ...�
...� ...�
...� ...�
����
��'���� (�'����������
��������������������
��(��� ������������
�'�����('������������
���������('������������
���������
���� ���� ����
at−T+1 at−1 atuit−T+1 uit−1 uitust−T+1 ust−1 ust
at−T+1 at−1 atuit−T+1 uit−1 uitust−T+1 ust−1 ust
����
����
�' (���� ����('�������������������
����(�'�(������'����
��(�� � ����� �)���
����(�'�(����'������(�'��
����(�'�(���������
�' (���� ���'(����'���������(�����
��(��� ���
�'�����('���
���������('���
Figure 6.1: Multimodal behavior learning and retrieval mechanism
90
6.3. Experimental Setup
image compression, and temporal sequence learning. Compared with the previous
experimental setup shown in Figure 5.1, this experimental setup incorporates an-
other deep neural network (Figure 6.1(a)) for sound feature extraction. The sound
data acquired from a microphone mounted on the head of the robot is preprocessed
by discrete Fourier transform (DFT). The sound compression network (Figure 6.1(a))
inputs the acquired sound spectrums and outputs the corresponding feature vec-
tors from the central hidden layer. Similarly, the image compression network (Fig-
ure 6.1(b)) inputs raw RGB bitmap images acquired from a camera mounted on the
head of the robot and outputs the corresponding feature vectors. The sound and im-
age features are synchronized with the joint angle vectors, and multimodal temporal
segments are generated. These multimodal temporal segments are then fed into the
temporal sequence learning network (Figure 6.1(c)). Accordingly, multimodal fea-
tures and reconstructed multimodal segments are output from the central hidden
layer and the output layer of the network, respectively.
The outputs from the temporal sequence learning network can be used for robot
motion generation, sound spectrum retrieval, or image retrieval. The joint angles
output from the network are rescaled and resent to the robot as joint angle com-
mands for generating motion. The networks can also reconstruct the retrieved sound
spectrum or images in the original form by decompressing the corresponding fea-
ture outputs because the sound compression network and the image compression
network model the identity map from the inputs to the outputs via feature vectors in
the central hidden layer.
6.3 Experimental Setup
The cross-modal memory retrieval performance of our proposed mechanisms is
evaluated by conducting bell-ringing tasks with the same robot used in our first ex-
periment. The bell-ringing task is setup as follows: three different desktop bells,
which can be identified by either the surface color or the sound pitch, are prepared
for the experiment. Correspondences between the colors and the pitch notations are
shown in Figure 6.2(a). For each bell-ringing trial, two bells are selected and placed
in front of the robot side by side. Then, either one of the two bells is rung by hitting a
91
Chapter 6. Analysis on Intersensory Synchrony Model
�������������������������������������
�� � �� �
��������������������������
���
������ ����� �������� ��
������ ������� ��
����
Figure 6.2: Bell placement configurations of the bell-ringing task
button on top of the bell. Due to the limited outreach of the hands, each side of the
bell can be rung only with the corresponding arms. As shown in Figure 6.2(b), there
are six possible bell placement combinations. Note that under the task configura-
tion, information from at least two different modalities is required to determine the
right bell-ringing situation. In practice, the robot cannot (1) determine which bell
is going to be rung only from the initial image, (2) determine the placement of the
ringing bell only from the sound, and (3) predict what sound will come out only from
the arm motion.
We record twelve different multimodal temporal sequence datasets by generating
the right and left bell-striking motions under the six different bell placement con-
figurations. Arm joint angle sequences corresponding to the bell-striking motions
are generated by the angular interpolation of the initial and target postures. Pulse-
code modulation (PCM) sound data is recorded with a 16 kHz sampling rate, a 16-
bit depth, and a single channel with a microphone mounted on the forehead of the
robot1. The image frames and the joint angles of both arms are recorded at approx-
1Because of the physical structure of the robot, the microphone is located close to both of the arms,which are utilized to hit the bells. Therefore, the actuation sounds of the geared reducers equippedto the arm joints are inevitably recorded in addition to the bell sounds. To avoid the degradation of
92
6.3. Experimental Setup
Table 6.1: Experimental parameters
TRAIN* I/O* ENCODER DIMS*
SFEAT** 5352 968 1000-500-250-150-80-30
IFEAT** 2688 3000 1000-500-250-150-80-30
TSEQ** 8736 2100 1000-500-250-150-100* TRAIN, I/O, and ENCODER DIMS indicate the size of the
training data, the input and output dimensions, and theencoder network architecture, respectively.
** SFEAT, IFEAT, and TSEQ stand for sound feature, imagefeature, and temporal sequence, respectively.
imately 66 Hz, which includes replicated image frames. To synchronize the sound
data with the image and joint angles data, the sound data is preprocessed by a DFT
with a 242-sample hamming window and 242 samples of window shift with no over-
lap. A partial region of 320×200 pixels is cropped from the original 320×240 image
and resized to 40×25 pixels to meet the memory resource availability limitation on
our computational environment. For the joint angle data input, 10 degrees of free-
dom of the arms (from the shoulders to the wrists) are used. The resulting lengths
of the motion sequence were approximately 200 steps each, which is equivalent to
about 3 s each. For multimodal temporal sequence learning, we used contiguous
segments of 30 steps from the original time series as a single input. By sliding the
time window by one step, consecutive data segments are generated.
Table 6.1 summarizes the datasets and associated experimental parameters. For
both the sound feature and the image feature learning, the same 12-layered deep
neural networks are used. For temporal sequence learning, a 10-layer network is
used. In each case, the decoder architecture is a mirror-image of the encoder, yield-
ing a symmetric autoencoder. The parameters for the network structures are empir-
ically determined in reference to such previous studies as [41] and [49]. The input
and output dimensions of the two networks are defined as follows: 968 for sound
feature learning, which is defined by binding consecutive four-step sequences of the
sound spectrums (i.e., 242 dimension) into a single vector, 3000 for the image fea-
ture learning, which is defined by 40×25 matrices of pixels for RGB colors, and 2100
for temporal sequence learning, which is defined by the 30-step segment of the 70-
memory retrieval performance arising from the actuation sounds, we introduced a brief pause to thebell hitting motion when the hand contacted the button on top of the bell.
93
Chapter 6. Analysis on Intersensory Synchrony Model
dimensional multimodal vector composed of a 30-dimensional sound feature vector,
a 30-dimensional image feature vector, and 10 joint angles. Especially for the cen-
tral hidden layer of the temporal sequence learning network, we compared several
numbers of nodes, i.e., 30, 50, 70, and 100. By evaluating the performance of im-
age retrieval from the sound and joint angle inputs, we concluded that 100 nodes are
needed to achieve the desired memory reconstruction precision. For the activation
functions, linear functions are used for all of the central hidden layers, and logistic
functions are used for the rest of the layers in reference to [41].
6.4 Results
6.4.1 Image Sequence Retrieval from Sound and Motion Sequences
We conducted an evaluation experiment of the cross-modal memory retrieval per-
formance by generating image sequences from the sound and joint angle input se-
quences. Note that in the following results, the number of sequence steps indicates
the generation step rather than the recorded data step. More specifically, data from
29 steps before the beginning of the generation step are used for acquiring the initial
step of the generated sequence.
Figure 6.3 shows an example of image generation results from the sound and joint
angle inputs. In Figure 6.3, the top row and the second row show the original and
retrieved images, respectively. The third and the bottom row show the sound spec-
trums and the 30 previous steps of joint angles sequences used as inputs to the tem-
poral sequence learning network to retrieve the corresponding images. Black dashed
squares in the images at step 1 indicate the bell image regions used for image retrieval
performance evaluation.
At step 1, the bells in the retrieved image are arbitrarily colored, because the color
of the placed bell is not derivable before acquiring any sound input. By contrast, the
image of the robot’s right hand is already included in the retrieved image, because the
joint angles input data indicate that the right arm is going to be used for striking the
bell. At steps 31 and 61, the bell is rung, and the corresponding sound spectrum is ac-
quired. Then, the task configuration becomes evident, and the information that the
94
6.4. Results
Orig
inal
imag
e
Step: 1
Ret
rieve
d im
age
0 2000 4000 6000 80000
0.2
0.4
0.6
0.8
1
Sou
nd s
pect
rum
Frequency [Hz]
5 10 15 20 25 300
0.2
0.4
0.6
0.8
1
Sca
led
join
t ang
le
Steps
Step: 31
0 2000 4000 6000 80000
0.2
0.4
0.6
0.8
1
Frequency [Hz]
5 10 15 20 25 300
0.2
0.4
0.6
0.8
1
Steps
Step: 61
0 2000 4000 6000 80000
0.2
0.4
0.6
0.8
1
Frequency [Hz]
5 10 15 20 25 300
0.2
0.4
0.6
0.8
1
Steps
Step: 91
0 2000 4000 6000 80000
0.2
0.4
0.6
0.8
1
Frequency [Hz]
5 10 15 20 25 300
0.2
0.4
0.6
0.8
1
Steps
Step: 121
0 2000 4000 6000 80000
0.2
0.4
0.6
0.8
1
Frequency [Hz]
5 10 15 20 25 300
0.2
0.4
0.6
0.8
1
Steps
Step: 151
0 2000 4000 6000 80000
0.2
0.4
0.6
0.8
1
Frequency [Hz]
5 10 15 20 25 300
0.2
0.4
0.6
0.8
1
Steps
Figure 6.3: Example of image retrieval results from the sound and joint angle inputs
95
Chapter 6. Analysis on Intersensory Synchrony Model
rung bell on the right side has the pitch ‘F’ is correlated with the color green. Thus,
the color of the right bell in the retrieved image changes from a randomly initialized
one to green by associating the sound and joint angles information. Conversely, the
color of the left bell in the retrieved image is not stable during the run because no
information is acquired from the sound input for identifying which bell is placed on
the left side. Nevertheless, the retrieved image shows that when the color of the rung
bell (i.e., green) is identified, the color of the other bell is selected from the remaining
two colors (i.e., red or blue). This result reflects the current task design in which the
color of the two bells is always different. From step 91 (or so), the sound of the bell
starts to decay, and the actuation noise of the manipulator by the posture initializa-
tion becomes dominant. Thus, the colors of the bells again become arbitrary.
6.4.2 Quantitative Evaluation of Image Retrieval Performance
We conducted an evaluation experiment to quantitatively examine whether our pro-
posed model succeeded in modeling the synchrony between the image, sound, and
motion modalities. We prepared 10 different initial model parameter settings for the
networks and replicated the experiment of learning the same dataset composed of
the 12 combinations of the bell placements and bell-striking motion patterns. As a
result of cross-modal image retrieval for the 10 learning results, 120 patterns of the
image sequences were acquired. Image retrieval performance is quantified by the
root mean square (RMS) errors of the manually selected left and right bell regions in
the retrieved image (which are 13×13 pixels each, as indicated in Figure 6.3) against
the corresponding regions of the original image.
Figure 6.4 shows the time variation of the image retrieval error displayed in asso-
ciation with the maximum value of the sound power spectrum and the joint angles
sequence. In Figure 6.4, the graphs on the top row show the mean of the image re-
trieval errors from the replicated learning results (each line is acquired from 30 re-
sult sequences). The graphs on the second row show the mean of maximum sound
power spectrums. The graphs on the bottom row show the joint angles command
sequences used for generating the bell-striking motion. Black dashed lines indicate
the time step used for evaluating the significance of the image retrieval error differ-
ence. The evaluation results demonstrate that the image retrieval error of the left bell
96
6.4. Results
50 100 1500
0.1
0.2
0.3
0.4Im
age
retr
ieva
l err
or
Left bell ring
Left bellRight bell
50 100 1500
0.1
0.2
0.3
0.4
Right bell ring
Left bellRight bell
50 100 1500.4
0.6
0.8
1
Sou
nd a
mpl
itude
50 100 1500.4
0.6
0.8
1
50 100 1500
0.5
1
Steps
Sca
led
join
t ang
le
50 100 1500
0.5
1
Steps
Figure 6.4: Bell image retrieval errors;
becomes smaller than that of the right bell when the left bell is rung, and vice versa.
The time variation of the error trajectory shows that the retrieval error decreases after
the sound of the bell is acquired.
The shape of the error trajectory is not symmetric between the two graphs when
the left bell or the right bell is rung. When the left bell is rung, the image retrieval error
for the left bell maintains its value even after arm posture initialization. Conversely,
when the right bell is rung, the image retrieval error for the right bell increases after
arm posture initialization. These differ primarily because of the asymmetry of the
arm actuator noise. Owing to the difference in the mechanics of the left and right
actuators, which is beyond our control, the right arm produces more sound than the
left arm. Hence, when the right arm posture is initialized after striking the bell, the
97
Chapter 6. Analysis on Intersensory Synchrony Model
L bell region R bell region0
0.1
0.2
0.3
0.4
0.5
**
RM
S e
rror
Left bell ring (p=6.1e−06)
L bell region R bell region0
0.1
0.2
0.3
0.4
0.5
**
Right bell ring (p=3.1e−07)
Figure 6.5: Bell image retrieval errors at step 60
accompanying actuator noise disturbs the internal state of the network (i.e., the data
buffered in the recurrent loop), and the retrieved image is altered.
6.4.3 The Correlation between Generated Motion and Retrieved
Bell Images
To evaluate the significance of the difference between image retrieval performance
of the left and right regions in the same image, we conduct a t-test for the image re-
trieval errors at step 60 of the sequences. At that time step, the arm is brought down
and the hand stably contacts the button on top of the bell. Therefore, there is no in-
fluence of actuation noise on image retrieval. In Figure 6.5, red circles and blue bars
denote the mean and standard deviation of the errors from 10 replicated learning ex-
periments, respectively. A p value less than 0.01 is considered statistically significant
(**: p < 0.01). The evaluation results show that the differences of the image retrieval
errors between the two regions are statistically significant in both the right and left
bell-ringing cases. Results further show that the spatial correlation between the bell
region in the image and the physical motion is correctly modeled, as are the asso-
ciations between the colors and sounds of the bells. Thus, the acquired synchrony
model between the image, sound, and motion modalities is utilized for image re-
trieval.
98
6.4. Results
�!"��#!������$��#!���
��� �#��
������������ �
��������������
Figure 6.6: Multimodal feature space and the correspondence between the coordi-nates and modal-dependent characteristics
6.4.4 Visualization of Multimodal Feature Space
Finally, we conducted an analysis of the multimodal feature space acquired by the
temporal sequence learning network. Among the 10 replicated learning results, we
took a single result and recorded the activation patterns of the central hidden layer
of the network when the 12 patterns of bell-ringing sequences were the input. We
applied PCA to project the resulting 100-dimension feature vector sequences to a
three-dimensional space defined by the acquired principal components (Figure 6.6).
The abbreviations in the legend box indicate the color combinations of the placed
bells, followed by the position (R or L) of the rung bell. The graph on the left side
(Figure 6.6(a)) demonstrates that the robot’s motion pattern is represented in a two-
dimensional space composed of the first and second principal components, whereas
the graph on the right side (Figure 6.6(b)) shows that the bell placement configura-
tions are structured along the coordinate defined by the third principal component.
Results of this analysis demonstrate that the synchrony between the multiple modal-
ities is self-organized in the temporal sequence learning network.
99
Chapter 6. Analysis on Intersensory Synchrony Model
6.5 Discussion
The experimental results demonstrated that our proposed framework is able to ex-
tract implicit synchronicity among multiple modalities by integrating multimodal in-
formation. Further, the retrieved images in the bell-ringing task demonstrated that
our proposed framework not only deterministically retrieved a bell image reflecting
the acquired synchrony, but also somehow generated alternate information by se-
lecting a candidate among multiple possibilities even if the specific situation is not
identifiable for the other bell. Thus, we believe that our proposed mechanism can be
utilized as a prediction mechanism for robots to infer the successive consequences
of sensory-motor situations.
In cognitive science studies, a sense of agency is known to be a product of the gen-
eral determination of synchrony between action and effect, and experimental results
suggest that the sense of agency arises when there is temporal contiguity and content
consistency between signals related to action and those related to the putative ef-
fect [83, 23, 25]. Further, a recent study has reported the importance of action-effect
grouping on the production of a sense of agency [48]. In all of these studies, the eval-
uation of spatiotemporal congruity between predicted and actual sensory feedback
is considered to play an important role in the sense of agency. From our current re-
sults, we consider that our cross-modal synchrony modeling and subsequent mem-
ory retrieval capabilities can be utilized as a practical computational framework for
sensory feedback prediction. Hence, we believe that our presented framework can
be utilized in future work to promote a deeper understanding of the sense of agency.
6.6 Summary
In this chapter, we conducted quantitative analysis to show that our proposed multi-
modal integration learning framework correctly models synchronicity between mul-
tiple modalities. The main targets discussed in this chapter are summarized in Figure
6.7.
To analyze the acquired synchrony model at a more general level, we extend the
experimental setting by incorporating sound signals as another input modality. As
100
6.6. Summary
a practical experiment, we designed a bell ring task for a humanoid robot and con-
ducted integration learning of the image, sound, and joint angle sequences. The ac-
quired model is evaluated by retrieving images from the sound and joint angle se-
quences. The evaluation results demonstrate that the color of the bell on the cor-
responding arm motion side correctly changes in association with the input sound.
The analyses on the acquired model prove that the proposed framework succeeded
in acquiring the synchrony model over the multiple modals.
As for the bell-ringing task, we evaluated the image retrieval performance from
sound and motion with only two bell positions. Future work includes modeling a
generalized representation of bell positions by training our system with bell-ringing
behaviors using more variations of bell positions.
101
Chapter 6. Analysis on Intersensory Synchrony Model
��������
��� �����������
� �������
��������������
����������������������������������
��� �������������������� ���������������������
��������������� ��������������� ��������
!���������������������������������
"���� �������������������������� �����������
#���������������������������������$%�� ���&'()
"����������������������$%�� ���&'()
%��������������� ���������$%�� ���()
*����������� ��������������������������������������$%�� ���&)
��������������������������� ����������������������������������$%�� ���+)
��� �������������� ���� �����$%�� ���,)
Figure 6.7: The main targets discussed in Chapter 6
102
Chapter 7
Conclusion
7.1 Overall Summary of the Current Research
This dissertation proposed multiple machine learning frameworks for the mutual
understanding of intersensory synchrony of multimodal information in robot sys-
tems. In practice, (1) robust recognition of poorly reproducible real-world informa-
tion and (2) adaptive behavior selection of robots depending on dynamic environ-
mental changes were achieved by utilizing deep learning architectures.
The first research topic was achieved through two approaches including (1) ex-
traction of highly generalized sensory features and (2) fusional utilization of multi-
modal information. The second research topic was achieved by (3) mutually predict-
ing and retrieving the sensory-motor information among multiple modalities. These
three approaches were enabled by the overwhelming performance of DNN in ab-
stracting and integrating large amounts of raw real-world sensory-motor informa-
tion with high dimensionality.
The performances of the proposed multimodal integration learning frameworks
were evaluated by conducting an AVSR task and two robot behavior learning tasks
utilizing a humanoid robot.
The AVSR task was conducted to evaluate the performances of the two DNN ar-
chitectures including a fully connected DNN and a CNN for extracting noise-robust
103
Chapter 7. Conclusion
sensory features for the audio and visual information of speech signals, respectively.
Our experimental results demonstrated that a fully connected DNN can serve as a
noise reduction filter that contributes towards recognizing speech under noisy en-
vironments even with audio information only. In addition, we demonstrated that
the CNN can recognize visual appearances of mouth region shapes and predict cor-
responding phoneme labels. Moreover, our AVSR experiments demonstrated that a
MSHMM can achieve noise robust multimodal speech recognition by complemen-
tarily utilizing the audio and visual information. We suppose that, even though the
experiment handles two seemingly “sensory” signals, the AVSR recognition results
implicitly show the importance of “sensory-motor” integration for robust recogni-
tion. The reason why we believe the same is because we assume that the visual inputs
of the mouth region image can indirectly transmit the information of the motor com-
mands corresponding to mouth movements. However, the current implementation
of our AVSR model is attained through a rather simple approach: linear weighted sum
of the observation probabilities corresponding to audio and visual features. There-
fore, in terms of the fusional utilization of multimodal information, the current im-
plementation still remains a potential for further development. For example, mul-
timodal integration learning of multimodal temporal sequence learning using DNN
or RNN, as accomplished by the robot behavior learning, might be a promising ap-
proach to be studied.
The robot behavior learning tasks were conducted to evaluate the multimodal in-
tegration learning performance and the cross-modal memory retrieval performance
of a fully connected DNN. One of the innovations in this research is that a DNN ar-
chitecture is applied for mutual integration learning of dynamic sensory-motor in-
formation. Experimental results also showed that the consistent learning framework
can be applied independent of the modality of the information source. The scalable
learning capability of the DNNs enabled to extract a compressed representation of
the raw sensory-motor information and their fused features attained noise robust
behavior recognition. Moreover, the generalization capability of the DNN enabled
retrieval of the raw sensory-motor signals from one modality to another in a mutual
manner according to the synchrony model acquired through the integration learning
process. The cross-modal memory retrieval and temporal sequence prediction func-
tions of the proposed framework enabled adaptive switching of object manipulation
104
7.2. Significance of the Current Study as a Work in Intermedia Art and Science
behaviors of a humanoid robot depending on the displayed object in real-time. In
addition, the bell-ringing behaviors learning experiment demonstrated that the pro-
posed multimodal integration framework self-organizes an organized feature space
to model the synchrony among multiple modalities.
This dissertation presented the effectiveness of multimodal integration learning
not only for the conventional pattern recognition problems but also for dynamic
sensory-motor coordination learning and the consequent behavior generation prob-
lems of a robot. The achievements in the current dissertation are expected to shed
light on a novel design concept of future robot systems. For example, we are confi-
dent of the novelty of our approach regarding directly modeling raw sensory-motor
signals of robot systems with DNNs. Meanwhile, the current strategy of modeling
temporal sequences with a time-delay style DNN is just an illustrative example. The
application of recurrent neural networks may open up a new horizon for acquir-
ing long-term context dependent robot behaviors with more scalability. We expect
to intensify our current research strategies by pursuing the application of our pro-
posed frameworks for practical applications as well as investigating the novel learn-
ing frameworks by introducing successive machine learning frameworks for robot
systems.
7.2 Significance of the Current Study as a Work in Inter-
media Art and Science
The communication ability, to express one’s mental state and to exchange informa-
tion with other individuals, is one of the intellectual foundations that characterize
higher order animals. The process of expression involves two processes: (1) the
structuration of one’s experience, to reflect one’s experience into its internal repre-
sentation by abstracting the acquired raw sensory-motor perception and (2) the ex-
pression of one’s mental state, to transmit information to others by creating signals
and symbols that delegate messages generated from the internal representation. For
example, the process of expression can explain most creative activities such as paint-
ing, filmmaking, musical composition, and writing novels. Therefore, we believe that
the process of expression is one of the main topics to be quested in the intermedia
105
Chapter 7. Conclusion
art and science. Although the second process tends to focus on a typical discussion
on the process of expression, the first process is equivalently important because it
is responsible for generating abstracted representations, the origin of one’s mental
state, by comprehending one’s experience.
In terms of the discussion above, we can declare that the main contribution of
the current study, as an intermedia art and science research, is aggregated to the first
process: the structuration of internal representation by abstracting robots’ experi-
ences. Recent successes of deep learning in image recognition and speech recogni-
tion studies prevailed the importance of self-organization of sensory features, which
is equivalent to internal representation in the current context, among the machine
learning community. However, a similar concept to apply self-organization of inter-
nal representation for the realization of robot intelligence has not received extensive
attention yet. The current study showed that our deep learning approach is possi-
ble to self-organize internal representation from a robot’s experience by abstracting
not only modality dependent representation but also mutually integrating acquired
sensory-motor features from multiple modalities. Thus, we showed how raw real-
world information acquired from the multiple sensors equipped on the robot and
the self-motion commands are integrated and abstracted to structure a compact in-
ternal representation. Moreover, we demonstrated that the robot could take advan-
tage of the acquired multimodal representation for behavior generation by retrieving
associated information among multiple modalities.
The representation intended for information propagation is not discussed in
the current study because communication among multiple robots or human–robot
communication is not included in the current research target. However, we can natu-
rally view several research topics, such as social development of robot intelligence or
human–robot communication, as a continuation of the current research motivation.
We believe that robotics research from the perspective of intermedia art and science
is of great value because we can reflect more deeply on the fundamental questions
such as what is the essence of creativity, whether a machine can become a creative
entity, and what is the essential difference between human intelligence and machine
intelligence.
106
Appendix A
Hessian-Free Optimization
The Hessian-free algorithm originates from Newton’s method, a well-known numer-
ical optimization technique. A canonical second-order optimization scheme, such
as Newton’s method, iteratively updates parameter θ ∈RN of an objective function f
by computing gradient vector p and updates θ as θk+1 = θk +αp with learning rate α.
The core idea of Newton’s method is to locally approximate f at the current iterate θk
by a model function mk , up to the second order, by the following quadratic equation:
mk (θk +p) = fk +∇ f Tk p + 1
2pTBk p, (A.1)
where fk and ∇ fk are the function and gradient values at θk , respectively. The ma-
trix Bk is either the Hessian matrix Hk = ∇2 fk or some approximation of it. Using
the standard Newton’s method, mk is optimized by computing N ×N matrix Bk and
solving the system
Bk p =−∇ fk . (A.2)
However, two major difficulties exist for directly solving (A.2). First, this compu-
tation is very expensive for a large N , which is a common case even with modestly
sized neural networks. To overcome this, the Hessian-free optimization utilizes the
linear conjugate gradient (CG) method for optimizing the quadratic objective. The
107
Appendix A. Hessian-Free Optimization
name “Hessian-free” indicates that the CG does not necessarily require the costly,
explicit Hessian matrix; instead, the matrix-vector product between the matrix Bk
and gradient vector p is sufficient.
Second, the Newton direction p defined by (A.2) may not necessarily be a decent
direction because the Hessian matrix may become negative definite when immedi-
ate parameter θk is away from the solution. To overcome this, two countermeasures
are introduced. One is to utilize a positive semidefinite Gauss-Newton curvature ma-
trix, instead of the possibly indefinite Hessian matrix. The other is to apply modified
Newton method by reconditioning Hessian matrix Hk as
Bk = Hk +λI , (A.3)
where Bk is a damped Hessian matrix of f at θk , λ≥ 0 is a damping parameter, and I
is the unit matrix.
A.1 Newton-CG Method
In the Newton-CG method, the search direction is computed by applying the CG
method to the Newton’s equation (A.2). The CG method is a general framework for
solving a linear system Ax = b by finding the least-squares solution. To solve the
problem, the quadratic
ψ(x) = 1
2xT Ax −bTx, (A.4)
instead of the squared error objective ‖Ax − b‖2, is optimized. In the context of
Hessian-free optimization, the parameters are set as A = B and b =−∇ f .
The CG iteration is terminated at iteration k if the following condition is satisfied:
k >G and ψ(xk ) < 0 andψ(xk )−ψ(xk−G )
ψ(xk )< εG , (A.5)
where G determines how many iterations into the past are considered for evaluating
an estimate of the current per-iteration reduction rate.
108
A.2. Computing the Matrix-Vector Product
The overview of our CG method is summarized as follows.
Algorithm 1 CG method
Given x0
Set r0 ← Ax0 −b, p0 ←−r0
for k = 0 to Kmax doif pT
k Ap ≤ 0 thenbreak
end if
αk ← r Tk rk
pTk Apk
xk+1 ← xk +αk pk
rk+1 ← rk +αk Apk
βk+1 ← r Tk+1rk+1
r Tk rk
pk+1 ←−rk+1 +βk+1pk
if k >G and ψ(xk ) < 0 and ψ(xk )−ψ(xk−G )ψ(xk ) < εG then
breakend if
end for
A.2 Computing the Matrix-Vector Product
The Newton-CG method does not require explicit knowledge of the Hessian Bk =∇2 fk . Rather, it requires Hessian-vector products of the form ∇2 fk p for any given
vector p. When the second derivatives cannot easily be calculated, or where the Hes-
sian requires too much storage, finite differencing technique known as Hessian-free
Newton methods are commonly applied.
Following the definition of a derivative, the Hessian-vector products are exactly
calculated by the following equation
∇2 fk p = limr→0
∇ f (θk + r p)−∇ f (θk )
r= ∂
∂r∇ f (θk + r p). (A.6)
This operation is regarded as a transformation to convert the gradient of a system
into the Hessian-vector products. Perlmutter [79] defined an operator to this trans-
109
Appendix A. Hessian-Free Optimization
formation as
R{
f (θk )}= ∂
∂rf (θk + r p)
∣∣∣∣r=0
, (A.7)
where R{·} is called R-operator. By applying the R{·} operator to the equations for cal-
culating a gradient, e.g. the back propagation algorithm, we can acquire the Hessian-
vector product. As R{·} is a differential operator, it follows the same rules as the usual
differential operators, such as:
R{c f (θ)
} = cR{
f (θ)}
(A.8)
R{
f (θ)+ g (θ)} = R
{f (θ)
}+R{
g (θ)}
(A.9)
R{
f (θ)g (θ)} = R
{f (θ)
}g (θ)+ f (θ)R
{g (θ)
}(A.10)
R{
f (g (θ))} = f ′(g (θ))R
{g (θ)
}(A.11)
R
{d f (θ)
d t
}= dR
{f (θ)
}d t
, (A.12)
also note that
R{θ} = p. (A.13)
For a standard feed forward neural network and a Elman-type recurrent neural
network, the feed forward, the back propagation, and the those with R-operators ap-
plied are shown in Appendix B and Appendix C, respectively.
110
Appendix B
FNN with R-operator
B.1 Forward propagation
h = Whi I +bh
H = f (h)
o = Woh H +bo
O = g (o)
B.2 Forward propagation with R-operator
Rh = W vhi I +bv
h
RH = f ′(H)Rh
Ro = W voh H +WohRH +bv
o
RO = g ′(O)Ro
111
Appendix B. FNN with R-operator
B.3 Error Function
E = 1
2(O −O)2
B.4 Backpropagation
B.4.1 variables
∂E
∂O= −(O −O)
∂E
∂o= ∂E
∂O
∂O
∂o=−(O −O)g ′(O)
∂E
∂H= ∂E
∂o
∂o
∂H= ∂E
∂oWoh
∂E
∂h= ∂E
∂H
∂H
∂h= ∂E
∂Hf ′(H)
B.4.2 parameters
∂E
∂Woh= ∂E
∂o
∂o
∂Woh= ∂E
∂oH
∂E
∂Whi= ∂E
∂h
∂h
∂Whi= ∂E
∂hI
∂E
∂bo= ∂E
∂o
∂o
∂bo= ∂E
∂o∂E
∂bh= ∂E
∂h
∂h
∂bh= ∂E
∂h
112
B.5. Backpropagation with R-operator
B.5 Backpropagation with R-operator
B.5.1 variables
R∂E
∂O= RO
R∂E
∂o= R
∂E
∂O
∂O
∂o= ROg ′(O)
R∂E
∂H= R
∂E
∂o
∂o
∂H= R
∂E
∂oWoh
R∂E
∂h= R
∂E
∂H
∂H
∂h= R
∂E
∂Hf ′(H)
B.5.2 parameters
R∂E
∂Woh= R
∂E
∂o
∂o
∂Woh= R
∂E
∂oH
R∂E
∂Whi= R
∂E
∂h
∂h
∂Whi= R
∂E
∂hI
R∂E
∂bo= R
∂E
∂o
∂o
∂bo= R
∂E
∂o
R∂E
∂bh= R
∂E
∂h
∂h
∂bh= R
∂E
∂h
113
Appendix C
RNN with R-operator
C.1 Forward Propagation
ht =⎧⎨⎩
Whi I0 +Whh Hi ni t +bh , (t = 0)
Whi It−1 +Whh Ht−1 +bh , (t > 0)
Ht = f (ht )
ot = Woh Ht +bo
Ot = g (ot )
C.2 Forward Propagation with R-operator
Rht =⎧⎨⎩
W vhi I0 +W v
hh Hi ni t +WhhRHi ni t +bvh , (t = 0)
W vhi It−1 +W v
hh Ht−1 +WhhRHt−1 +bvh , (t > 0)
RHt = f ′(Ht )Rht
Rot = W voh Ht +WohRHt +bv
o
ROt = g ′(Ot )Rot
115
Appendix C. RNN with R-operator
C.3 Error Funcion
E = 1
2(Ot −Ot )2
116
C.4. Backpropagation
C.4 Backpropagation
C.4.1 variables
∂E
∂Ot= −(Ot −Ot )
∂E
∂ot= ∂E
∂Ot
∂Ot
∂ot=−(Ot −Ot )g ′(Ot )
∂E
∂Ht= ∂E
∂ot
∂ot
∂Ht+ ∂E
∂ht+1
∂ht+1
∂Ht= ∂E
∂otWoh + ∂E
∂ht+1Whh
∂E
∂ht= ∂E
∂Ht
∂Ht
∂ht= ∂E
∂Htf ′(Ht )
C.4.2 parameters
∂E
∂Woh= ∂E
∂o
∂o
∂Woh= ∂E
∂oHt
∂E
∂Whh= ∂E
∂h
∂h
∂Whh= ∂E
∂hHt−1
∂E
∂Whi=
⎧⎪⎪⎨⎪⎪⎩
∂E
∂h
∂h
∂Whi= ∂E
∂hOt−1, (t > 0)
∂E
∂h
∂h
∂Whi= ∂E
∂hI0, (t = 0)
∂E
∂bo= ∂E
∂o
∂o
∂bo= ∂E
∂o∂E
∂bh= ∂E
∂h
∂h
∂bh= ∂E
∂h∂E
∂Hi ni t= ∂E
∂h0
∂h0
∂Hi ni t= ∂E
∂h0Whh
117
Appendix C. RNN with R-operator
C.5 Backpropagation with R-operator
C.5.1 variables
R∂E
∂O= ROt
R∂E
∂o= R
∂E
∂O
∂O
∂o= ROt g ′(Ot )
R∂E
∂Ht= R
∂E
∂ot
∂o
∂Ht+R
∂E
∂ht+1
∂h
∂Ht= R
∂E
∂otWoh +R
∂E
∂ht+1Whh
R∂E
∂ht= R
∂E
∂Ht
∂H
∂ht= R
∂E
∂Htf ′(Ht )
C.5.2 parameters
R∂E
∂Woh= R
∂E
∂o
∂o
∂Woh= R
∂E
∂oHt
R∂E
∂Whh= R
∂E
∂h
∂h
∂Whh= R
∂E
∂hHt−1
R∂E
∂Whi= R
∂E
∂h
∂h
∂Whi=
⎧⎪⎪⎨⎪⎪⎩
R∂E
∂hOt−1, (t > 0)
R∂E
∂hI0, (t = 0)
R∂E
∂bo= R
∂E
∂o
∂o
∂bo= R
∂E
∂o
R∂E
∂bh= R
∂E
∂h
∂h
∂bh= R
∂E
∂h
R∂E
∂Hi ni t= R
∂E
∂h0
∂h0
∂Hi ni t= R
∂E
∂h0Whh
118
Bibliography
[1] O. Abdel-Hamid and H. Jiang. Rapid and effective speaker adaptation of con-
volutional neural network based models for speech recognition. In Proceed-
ings of the 14th Annual Conference of the International Speech Communication
Association, Lyon, France, Aug. 2013.
[2] O. Abdel-Hamid, A. rahman Mohamed, H. Jiang, and G. Penn. Applying Con-
volutional Neural Networks concepts to hybrid NN-HMM model for speech
recognition. In Proceedings of the IEEE International Conference on Acoustics,
Speech, and Signal Processing, pages 4277–4280, Kyoto, Japan, Mar. 2012.
[3] E. Abravanel. Integrating the information from eyes and hands: A develop-
mental account. Intersensory Perception and Sensory Integration, pages 71–
108, 1981.
[4] P. S. Aleksic and A. K. Katsaggelos. Comparison of low- and high-level visual
features for audio-visual continuous automatic speech recognition. In Pro-
ceedings of the IEEE International Conference on Acoustics, Speech, and Signal
Processing, volume 5, pages 917–920, Montreal, Canada, May 2004.
[5] M. Anisfeld. Interpreting “imitative” responses in early infancy. Science,
205(4402):214–215, July 1979.
[6] E. Aronson and S. Rosenbloom. Space perception in early infancy: Perception
within a common auditory-visual space. Science, 172(3988):1161–1163, June
1971.
119
Bibliography
[7] J. Barker and F. Berthommier. Evidence of correlation between acoustic and
visual features of speech. In Proceedings of the 14th International Congress of
Phonetic Sciences, pages 5–9, San Francisco, CA, USA, Aug. 1999.
[8] R. Bekkerman, M. Bilenko, and J. Langford, editors. Scaling up Machine Learn-
ing: Parallel and Distributed Approaches. Cambridge University Press, 2011.
[9] Y. Bengio. Learning deep architectures for AI. Foundations and Trends in Ma-
chine Learning, 2(1):1–127, Jan. 2009.
[10] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with
gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157–
166, Jan. 1994.
[11] H. Bourlard and S. Dupont. A new ASR approach based on independent pro-
cessing and recombination of partial frequency bands. In Proceedings of the
4th International Conference on Spoken Language Processing, volume 1, pages
426–429, Philadelphia, PA, USA, Oct. 1996.
[12] H. Bourlard, S. Dupont, and C. Ris. Multi-stream speech recognition. IDIAP
Research Report, 1996.
[13] H. a. Bourlard and N. Morgan. Connectionist Speech Recognition: A Hybrid
Approach. Springer US, Boston, MA, 1994.
[14] T. G. R. Bower, J. M. Broughton, and M. K. Moore. The coordination of visual
and tactual input in infants. Attention, Perception, & Psychophysics, 8(1):51–53,
Jan. 1979.
[15] N. Brooke and E. D. Petajan. Seeing speech: Investigations into the synthesis
and recognition of visible speech movements using automatic image process-
ing and computer graphics. In Proceedings of the International Conference on
Speech Input and Output, Techniques and Applications, pages 104–109, Lon-
don, UK, Mar. 1986.
[16] R. A. Brooks, C. B. (Ferrell), R. Irie, C. C. Kemp, M. Marjanovic, B. Scassellati,
and M. M. Williamson. Alternative essences of intelligence. In Proceedings of
120
Bibliography
the 15th National Conference on Artificial Intelligence, pages 961–968, Madi-
son, WI, USA, July 1998.
[17] A. Chitu and L. J. Rothkrantz. Automatic Visual Speech Recognition, chapter 6,
pages 95–120. Speech Enhancement, Modeling and Recognition- Algorithms
and Applications. InTech, 2012.
[18] M. Coen. Multimodal Integration-A Biological View. In Proceedings of the 17th
International Joint Conference on Artificial Intelligence, volume 2, pages 1417–
1424, Seattle, WA, USA, Aug. 2001.
[19] T. Cootes, G. Edwards, and C. Taylor. Active appearance models. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, 23(6):681–685, June 2001.
[20] M. Critchley. Ecstatic and synaesthetic experience during musical perception.
Music and brain: Studies in the neurology of music. Charles C Thomas, Spring-
field, IL, USA, 1977.
[21] R. E. Cytowic. Synesthesia: A Union of the Senses, 2nd edition. Springer-Verlag,
New York, 1989.
[22] G. E. Dahl and A. Acero. Context-dependent pre-trained deep neural networks
for large-vocabulary speech recognition. IEEE Transactions on Audio, Speech,
and Language Processing, 20(1):30–42, Jan. 2012.
[23] S. Deneve and A. Pouget. Bayesian multisensory integration and cross-modal
spatial links. Journal of Physiology Paris, 98(1-3):249–258, Jan. 2004.
[24] J. Dewey. The reflex arc concept in psychology. Psychological Review, 3:357–
370, 1896.
[25] M. O. Ernst and H. H. Bülthoff. Merging the senses into a robust percept. Trends
in Cognitive Sciences, 8(4):162–169, Apr. 2004.
[26] A. Falchier, S. Clavagnier, P. Barone, and H. Kennedy. Anatomical evidence of
multimodal integration in primate striate cortex. The Journal of Neuroscience,
22(13):5749–5759, July 2002.
121
Bibliography
[27] X. Feng, Y. Zhang, and J. Glass. Speech feature denoising and dereverberation
via deep autoencoders for noisy reverberant speech recognition. In Proceed-
ings of the IEEE International Conference on Acoustics, Speech, and Signal Pro-
cessing, pages 1759–1763, Florence, Italy, May 2014.
[28] V. Franc and V. Hlavac. Statistical Pattern Recognition Toolbox for Matlab. http:
//cmp.felk.cvut.cz/cmp/software/stprtool/, Aug. 2008.
[29] F. Frassinetti, N. Bolognini, and E. Làdavas. Enhancement of visual percep-
tion by crossmodal visuo-auditory interaction. Experimental Brain Research,
147(3):332–343, Dec. 2002.
[30] I. Gallagher. Philosophical conceptions of the self: implications for cognitive
science. Trends in Cognitive Sciences, 4(1):14–21, Jan. 2000.
[31] J. Gardner and H. Gardner. A note on selective imitation by a six-week-old
infant. Child Development, 41(4):1209–1213, Dec. 1970.
[32] W. Grarage. Personal Robot 2 (PR2). http://www.willowgarage.com/.
[33] M. D. Grilli and E. L. Glisky. Self-Imagining Enhances Recognition Memory in
Memory-Impaired Individuals with Neurological Damage. Neuropsychology,
24(6):698–710, Nov. 2010.
[34] M. D. Grilli and E. L. Glisky. The self-imagination effect: benefits of a self-
referential encoding strategy on cued recall in memory-impaired individuals
with neurological damage. Journal of the International Neuropsychological So-
ciety, 17(5):929–933, Sept. 2011.
[35] M. Gurban, J.-P. Thiran, T. Drugman, and T. Dutoit. Dynamic modality weight-
ing for multi-stream HMMs in audio-visual speech recognition. In Proceedings
of the 10th International Conference on Multimodal Interfaces, pages 237–240,
Chania, Greece, Oct. 2008.
[36] M. Heckmann, K. Kroschel, and C. Savariaux. DCT-based video features for
audio-visual speech recognition. In Proceedings of the 7th International Con-
ference on Spoken Language Processing, volume 3, pages 1925–1928, Denver,
CO, USA, Sept. 2002.
122
Bibliography
[37] R. Held. Shifts in binaural localization after prolonged exposures to atypical
combinations of stimuli. The American Journal of Psychology, 68(4):526–548,
Dec. 1955.
[38] R. A. Henson. Neurological Aspects of Musical Experience. Music and the Brain:
Studies in the Neurology of Music. William Heinemann Medical Books Lim-
ited, London, 1977.
[39] H. Hermansky, D. Ellis, and S. Sharma. Tandem connectionist feature extrac-
tion for conventional HMM systems. In Proceedings of the IEEE International
Conference on Acoustics, Speech, and Signal Processing, volume 3, pages 1635–
1638, Istanbul, Turkey, June 2000.
[40] G. Hinton, L. Deng, D. Yu, G. Dahl, A.Mohamed, N. Jaitly, A. Senior, V. Van-
houcke, P. Nguyen, T. Sainath, and B. Kingsbury. Deep neural networks for
acoustic modeling in speech recognition. IEEE Signal Processing Magazine,
29:82–97, Nov. 2012.
[41] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with
neural networks. Science, 313(5786):504–507, July 2006.
[42] R. Hof. Meet The Guy Who Helped Google Beat Ap-
ple’s Siri. http://www.forbes.com/sites/roberthof/2013/05/01/
meet-the-guy-who-helped-google-beat-apples-siri/, May 2013.
[43] I. P. Howard and W. B. Templeton. Human Spatial Orientation. Wiley, London,
1966.
[44] J. Huang and B. Kingsbury. Audio-visual deep learning for noise robust speech
recognition. In Proceedings of the IEEE International Conference on Acous-
tics, Speech, and Signal Processing, pages 7596–7599, Vancouver, Canada, May
2013.
[45] A. Janin, D. Ellis, and N. Morgan. Multi-stream speech recognition: Ready for
prime time? In Proceedings of the 6th European Conference on Speech Commu-
nication and Technology, Budapest, Hungary, Sept. 1999.
123
Bibliography
[46] A. Jauffret, N. Cuperlier, P. Gaussier, and P. Tarroux. Multimodal integration of
visual place cells and grid cells for navigation tasks of a real robot. In Proceed-
ings of the 12th International Conference on Simulation of Adaptive Behavior,
volume 7426, pages 136–145, Odense, Denmark, Aug. 2012.
[47] K. Kaneko, F. Kanehiro, S. Kajita, H. Hirukawa, T. Kawasaki, M. Hirata,
K. Akachi, and T. Isozumi. Humanoid robot HRP-2. In Proceedings of the IEEE
International Conference on Robotics and Automation, volume 2, pages 1083–
1090, Barcelona, Spain, Apr. 2004.
[48] T. Kawabe, W. Roseboom, and S. Nishida. The sense of agency is action-effect
causality perception based on cross-modal grouping. Proceedings of the Royal
Society B: Biological Sciences, 280(1763):20130991, July 2013.
[49] A. Krizhevsky and G. E. Hinton. Using very deep autoencoders for content-
based image retrieval. In Proceedings of the 19th European Symposium on Ar-
tificial Neural Networks, Bruges, Belgium, Apr. 2011.
[50] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep
convolutional neural networks. In Proceedings of the Advances in Neural In-
formation Processing Systems 25, pages 1106–1114, Lake Tahoe, NV, USA, Dec.
2012.
[51] K. Kumar, T. Chen, and R. Stern. Profile view lip reading. In Proceedings of
the IEEE International Conference on Acoustics, Speech, and Signal Processing,
Honolulu, Hawaii, Apr 2007.
[52] T. Kuriyama, T. Shibuya, T. Harada, and Y. Kuniyoshi. Learning Interaction
Rules through Compression of Sensori-Motor Causality Space. In Proceed-
ings of the 10th International Conference on Epigenetic Robotics, pages 57–64,
Örenäs Slott, Sweden, Nov. 2010.
[53] H. Kuwabara, K. Takeda, Y. Sagisaka, S. Katagiri, S. Morikawa, and T. Watan-
abe. Construction of a large-scale Japanese speech database and its manage-
ment system. In Proceedings of the IEEE International Conference on Acous-
tics, Speech, and Signal Processing, pages 560–563, Glasgow, Scotland, UK, May
1989.
124
Bibliography
[54] Y. Lan, B.-j. Theobald, R. Harvey, E.-j. Ong, and R. Bowden. Improving vi-
sual features for lip-reading. In Proceedings of the International Conference
on Auditory-Visual Speech Processing, Hakone, Japan, Oct. 2010.
[55] K. Lang, A. Waibel, and G. Hinton. A time-delay neural network architecture
for isolated word recognition. Neural Networks, 3:23–43, 1990.
[56] Q. V. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. S. Corrado, J. Dean,
and A. Y. Ng. Building high-level features using large scale unsupervised learn-
ing. In Proceedings of the 29th International Conference on Machine Learning,
pages 81–88, Edinburgh, Scotland, July 2012.
[57] Y. LeCun and L. Bottou. Learning methods for generic object recognition with
invariance to pose and lighting. In Proceedings of the IEEE Computer Society
Conference on Computer Vision and Pattern Recognition, volume 2, pages 97–
104, Washington, D.C., USA, June 2004.
[58] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning ap-
plied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, Nov.
1998.
[59] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolutional deep belief net-
works for scalable unsupervised learning of hierarchical representations. In
Proceedings of the 26th International Conference on Machine Learning, pages
609–616, Montreal, Canada, June 2009.
[60] H. Lee, P. Pham, Y. Largman, and A. Y. Ng. Unsupervised feature learning for
audio classification using convolutional deep belief networks. In Proceedings
of the Advances in Neural Information Processing Systems 22, pages 1096–1104,
Vancouver, Canada, 2009.
[61] J. Luettin, N. Thacker, and S. Beet. Visual speech recognition using active shape
models and hidden Markov models. In Proceedings of the IEEE International
Conference on Acoustics, Speech, and Signal Processing, volume 2, pages 817–
820, Atlanta, GA, USA, May 1996.
125
Bibliography
[62] A. L. Maas, T. M. O’Neil, A. Y. Hannun, and A. Y. Ng. Recurrent neural network
feature enhancement: The 2nd chime challenge. In Proceedings of the 2nd In-
ternational Workshop on Machine Listening in Multisource Environments, Van-
couver, Canada, June 2013.
[63] L. E. Marks. On colored-hearing synesthesia: Cross-modal translations of sen-
sory dimensions. Psychological Bulletin, 82(3):303–331, May 1975.
[64] L. E. Marks. The Unity of the Senses: Interrelations Among the Modalities. Aca-
demic Press Series in Cognition and Perception. Academic Press, 1978.
[65] J. Martens. Deep learning via Hessian-free optimization. In Proceedings of
the 27th International Conference on Machine Learning, pages 735–742, Haifa,
Israel, June 2010.
[66] J. Martens and I. Sutskever. Learning recurrent neural networks with Hessian-
free optimization. In Proceedings of the 28th International Conference on Ma-
chine Learning, pages 1033–1040, Bellevue, WA, USA, June 2011.
[67] I. Matthews, T. Cootes, J. Bangham, S. Cox, and R. Harvey. Extraction of visual
features for lipreading. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 24(2):198–213, 2002.
[68] I. Matthews, C. N. Gerasimos Potamianos, and J. Luettin. A comparison of
model and transform-based visual feature for audio-visual LVCSR. In Pro-
ceedings of the IEEE International Conference on Multimedia and Expo, Tokyo,
Japan, Aug. 2001.
[69] H. McGurk and J. MacDonald. Hearing lips and seeing voices. Nature, 264:746–
748, Dec. 1976.
[70] A. N. Meltzoff. Towards a developmental cognitive science: The implications
of cross-modal matching and imitation for the development of representation
and memory in infancy. Annals of the New York Academy of Sciences, 608:1–31,
Dec. 1990.
[71] A. N. Meltzoff and M. K. Moore. Imitation of facial and manual gestures by
human neonates. Science, 198(4312):75–78, Oct. 1977.
126
Bibliography
[72] A. Mohamed, G. E. Dahl, and G. E. Hinton. Acoustic Modeling Using Deep
Belief Networks. IEEE Transactions on Audio, Speech, and Language Processing,
20(1):14–22, 2012.
[73] R. R. Murphy. Introduction to AI Robotics. The MIT Press, 2000.
[74] A. V. Nefian, L. Liang, X. Pi, X. Liu, and K. Murphy. Dynamic bayesian networks
for audio-visual speech recognition. EURASIP Journal on Applied Signal Pro-
cessing, 11:1274–1288, 2002.
[75] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. Multimodal deep
learning. In Proceedings of the 28th International Conference on Machine
Learning, pages 689–696, Bellevue, WA, USA, June 2011.
[76] NVIDIA Corporation. CUBLAS library version 6.0 user guide. CUDA Toolkit
Documentation, Feb. 2014.
[77] M. Ogino, H. Toichi, Y. Yoshikawa, and M. Asada. Interaction rule learning
with a human partner based on an imitation faculty with a simple visuo-motor
mapping. Robotics and Autonomous Systems, 54(5):414–418, May 2006.
[78] D. Palaz, R. Collobert, and M. Magimai.-Doss. Estimating phoneme class con-
ditional probabilities from raw speech signal using convolutional neural net-
works. In Proceedings of the 14th Annual Conference of the International Speech
Communication Association, Lyon, France, Aug. 2013.
[79] B. Pearlmutter. Fast exact multiplication by the Hessian. Neural Computation,
6(1):147–160, Jan. 1994.
[80] J. Piaget. Play, dreams, and imitation in childhood. W. W. Norton, New York,
1962.
[81] H. L. Pick, D. H. Warren, and J. C. Hay. Sensory conflict in judgments of spatial
direction. Perception & Psychophysics, 6(4):203–205, July 1969.
[82] A. Pitti, A. Blanchard, M. Cardinaux, and P. Gaussier. Distinct mechanisms
for multimodal integration and unimodal representation in spatial develop-
ment. In Proceedings of the IEEE International Conference on Development and
Learning and Epigenetic Robotics, pages 1–6, San Diego, CA, USA, Nov. 2012.
127
Bibliography
[83] A. Pouget, S. Deneve, and J. Duhamel. A computational perspective on the
neural basis of multisensory spatial representations. Nature Reviews Neuro-
science, 3:741–747, Sept. 2002.
[84] V. S. Ramachandran and E. M. Hubbard. Hearing colors, tasting shapes. Scien-
tific American, 16:76–83, May 2006.
[85] S. Renals, N. Morgan, S. Member, H. Bourlard, M. Cohen, and H. Franco. Con-
nectionist probability estimators in HMM speech recognition. IEEE Transac-
tions on Speech and Audio Processing, 2(1):161–174, 1994.
[86] J. Robert-Ribes, M. Piquemal, J.-L. Schwartz, and P. Escudier. Exploiting sensor
fusion architectures and stimuli complementarity in av speech recognition. In
D. Stork and M. Hennecke, editors, Speechreading by Humans and Machines,
pages 193–210. Springer Berlin Heidelberg, 1996.
[87] A. Robotics. NAO Humanoid, Nov. 2012.
[88] S. A. Rose. Cross-modal transfer in human infants: What is being transferred?
Annals of the New York Academy of Sciences, 608:38–50, Dec. 1990.
[89] C. Rosenberg. Improving Photo Search: A Step Across the Semantic Gap. http:
//googleresearch.blogspot.jp/2013/06/improving-photo-search-step-across.
html, June 2013.
[90] T. N. Sainath, B. Kingsbury, and B. Ramabhadran. Auto-encoder bottleneck
features using deep belief networks. In Proceedings of the IEEE International
Conference on Acoustics, Speech, and Signal Processing, pages 4153–4156, Ky-
oto, Japan, Mar. 2012.
[91] Y. Sakagami, R. Watanabe, C. Aoyama, S. Matsunaga, N. Higaki, and K. Fu-
jimura. The intelligent ASIMO: system overview and integration. In Proceed-
ings of the IEEE/RSJ International Conference on Intelligent Robots and System,
volume 3, pages 2478–2483, Lausanne, Switzerland, Oct. 2002.
[92] M. Sams, R. Aulanko, M. Hämäläinen, R. Hari, O. V. Lounasmaa, S. Lu, and
J. Simola. Seeing speech: visual information from lip movements modifies ac-
tivity in the human auditory cortex. Neuroscience Letters, 127(1):141–145, June
1991.
128
Bibliography
[93] E. Sauser and A. Billard. Biologically Inspired Multimodal Integration: Inter-
ferences in a Human-Robot Interaction Game. In Proceedings of the IEEE/RSJ
International Conference on Intelligent Robots and Systems, pages 5619–5624,
Beijing, China, Oct. 2006.
[94] P. Scanlon and R. Reilly. Feature analysis for automatic speechreading. In Pro-
ceedings of the IEEE 4th Workshop on Multimedia Signal Processing, pages 625–
630, Cannes, France, Oct. 2001.
[95] B. R. Shelton and C. L. Searle. The influence of vision on the absolute identi-
fication of sound-source position. Perception & Psychophysics, 28(6):589–596,
1980.
[96] M. Slaney. Auditory Toolbox: A MATLAB Toolbox for Auditory Modeling Work
Version 2. Interval Research Corproation, 1998.
[97] E. S. Spelke. The development of intermodal perception. Handbook of infant
perception. Academic Press, New York, 1987.
[98] N. Srivastava and R. Salakhutdinov. Multimodal learning with deep Boltzmann
machines. In Proceedings of the Advances in Neural Information Processing
Systems 25, pages 2231–2239, Lake Tahoe, NV, USA, Dec. 2012.
[99] B. Stein and N. London. Enhancement of perceived visual intensity by auditory
stimuli: a psychophysical analysis. Journal of Cognitive Neuroscience, 8(6):497–
506, Nov. 1996.
[100] B. E. Stein. Neural mechanisms for synthesizing sensory information and pro-
ducing adaptive behaviors. Experimental Brain Research, 123(1-2):124–135,
Nov. 1998.
[101] B. E. Stein and M. A. Meredith. The merging of the senses. The MIT Press, 1993.
[102] W. H. Sumby and I. Pollack. Visual contribution to speech intelligibility in
noise. Journal of the Acoustical Society of America, 26:212–215, 1954.
[103] I. Sutskever, J. Martens, and G. Hinton. Generating text with recurrent neu-
ral networks. In Proceedings of the 28th International Conference on Machine
Learning, pages 1017–1024, Bellevue, WA, USA, June 2011.
129
Bibliography
[104] W. A. Teder-Sälejärvi, F. Di Russo, J. J. McDonald, and S. A. Hillyard. Effects of
spatial congruity on audio-visual multimodal integration. Journal of Cognitive
Neuroscience, 17(9):1396–1409, Sept. 2005.
[105] W. R. Thurlow and T. M. Rosenthal. Further study of existence regions for the
“ventriloquism effect”. Journal of the American Audiology Society, 1(6):280–
286, 1976.
[106] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol. Extracting and com-
posing robust features with denoising autoencoders. In Proceedings of the 25th
international conference on Machine learning, pages 1096–1103, New York, NY,
USA, July 2008.
[107] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol. Stacked de-
noising autoencoders: Learning useful representations in a deep network with
a local denoising criterion. Journal of Machine Learning Research, 11:3371–
3408, 2010.
[108] J. Vroomen and B. de Gelder. Sound enhances visual perception: cross-modal
effects of auditory organization on vision. Journal of Experimental Psychology:
Human Perception and Performance, 26(5):1583–1590, Oct. 2000.
[109] D. H. Warren, R. B. Welch, and T. J. McCarthy. The role of visual-auditory “com-
pellingness” in the ventriloquism effect: Implications for transitivity among
the spatial senses. Perception & Psychophysics, 30(6):557–564, Nov. 1981.
[110] R. B. Welch and D. H. Warren. Immediate perceptual response to intersensory
discrepancy. Psychological Bulletin, 88(3):638–667, Nov. 1980.
[111] R. B. Welch and D. H. Warren. Intersensory interactions. In K. R. Boff, L. Kauf-
man, and J. P. Thomas, editors, Sensory Processes and Perception, volume 1 of
Handbook of Perception and Human Performance, pages 25–1–25–36. Wiley,
New York, 1986.
[112] H. Yehia, P. Rubin, and E. Vatikiotis-Bateson. Quantitative association of vocal-
tract and facial behavior. Speech Communication, 26:23–43, 1998.
130
Bibliography
[113] T. Yoshida, K. Nakadai, and H. G. Okuno. Automatic speech recognition im-
proved by two-layered audio-visual integration for robot audition. In Proceed-
ings of the 9th IEEE-RAS International Conference on Humanoid Robots, pages
604–609, Paris, France, Dec. 2009.
[114] S. Young, G. Evermann, M. Gales, T. Hain, X. A. Liu, G. Moore, J. Odell, D. Ol-
lason, D. Povey, V. Valtchev, and P. Woodland. The HTK Book (for HTK Version
3.4). Cambridge University Engineering Department, 2009.
[115] X. Zhang, C. Broun, R. Mersereau, and M. Clements. Automatic speechreading
with applications to human-computer interfaces. EURASIP Journal on Applied
Signal Processing, 11:1228–1247, 2002.
131
Relevant Publications
Journal Papers
1. K. Noda, Y. Yamaguchi, K. Nakadai, H. G. Okuno, and T. Ogata. Audio-visual
speech recognition using deep learning, Applied Intelligence, Vol.42, Issue 4,
pp. 722–737, Jun. 2015.
2. K. Noda, H. Arie, Y. Suga, and T. Ogata. Multimodal Integration Learning of
Robot Behavior using Deep Neural Networks, Robotics and Autonomous Sys-
tems, Vol.62, Issue 6, pp. 721–736, Jun. 2014.
International Conferences
1. K. Noda, Y. Yamaguchi, K. Nakadai, H. G. Okuno, and T. Ogata. Lipreading us-
ing Convolutional Neural Network, Proceedings in Interspeech, pp. 1149–1153,
Sep. 2014, Singapore.
2. K. Noda, H. Arie, Y. Suga, and T. Ogata. Intersensory causality modeling using
deep neural networks, Proceedings in IEEE International Conference on Sys-
tems, Man, and Cybernetics (SMC 2013), pp.1995–2000, Oct. 2013, Manchester,
UK.
3. K. Noda, H. Arie, Y. Suga, and T. Ogata. Multimodal integration learning of
object manipulation behaviors using deep neural networks, Proceedings in
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS
2013), pp.1728–1733, Nov. 2013, Tokyo, Japan.
Domestic Conferences
1. 野田邦昭,有江浩明,菅佑樹,尾形哲也:深層学習を用いたロボットの感覚運動統合と共起性の理解,日本発達神経科学学会第 3回大会,2014年 10月.
2. 野田邦昭,山口雄紀,中臺一博,奥乃博,尾形哲也:Deep Neural Network
を用いたマルチモーダル音声認識,第 32回日本ロボット学会学術講演会,1I1–04,2014年 9月.
3. 野田邦昭,有江浩明,菅佑樹,尾形哲也:Deep neural networkを用いたヒューマノイドロボットの適応的行動選択,GPU Technology Conference Japan,2014–8001,2014年 7月.
4. 野田邦昭,有江浩明,菅佑樹,尾形哲也:Deep neural networkを用いた感覚運動統合メカニズムによるヒューマノイドロボットの物体操作行動認識,日本機械学会ロボティクスメカトロニクス講演会,3P2–P03,2014年 5月.
5. 野田邦昭,有江浩明,菅佑樹,尾形哲也:Deep neural networkによる映像・音響・運動データの統合と共起,第28回人工知能学会全国大会,3H4–OS–24b–3,2014年 5月.
6. 山口雄紀,野田邦昭,中臺一博,奥乃博,尾形哲也:Deep Neural Networkを用いたマルチモーダル音声認識の為の特徴量学習,情報処理学会第 76回全国大会,5S–3,2014年 3月.
7. 野田邦昭,有江浩明,菅佑樹,尾形哲也:Deep neural networkを用いたヒューマノイドロボットによる物体操作行動の記憶学習と行動生成,第 27回人工知能学会全国大会,2G4–OS–19a–2,2013年 6月.
8. 野田邦昭,有江浩明,菅佑樹,尾形哲也:Deep neural networkを用いた連想記憶メカニズムによるヒューマノイドロボットの適応的行動選択,日本機械学会ロボティクスメカトロニクス講演会,1P1–B01,2013年 5月.
Other Publications
Journal Papers
1. Y. Hoshino, K. Kawamoto, K. Noda, and K. Sabe. Self-Regulation Mechanism : A
Principle for Continual Autonomous Learning in Open-Ended Environments,
Journal of Robotics Society of Japan, Vol.29, Issue 1, pp. 77–88, Jan. 2011.
2. M. Suzuki, K. Noda, Y. Suga, T. Ogata, and S. Sugano. Dynamic Perception af-
ter Visually-Guided Grasping by a Human-Like Autonomous Robot, Advanced
Robotics, Vol.20, No. 2, pp. 233–254, Feb. 2006.
3. M. Ito, K. Noda, Y. Hoshino, and J. Tani. Dynamic and interactive generation of
object handling behaviors by a small humanoid robot using a dynamic neural
network model, Neural Networks, Vol.19, Issue 3, pp. 323–337, Apr. 2006.
International Conferences
1. A. Schmitz, Y. Bansho, K. Noda, H. Iwata, T. Ogata, and S. Sugano. Tactile Ob-
ject Recognition Using Deep Learning and Dropout, Proceedings in IEEE-RAS
International Conference on Humanoid Robots (Humanoids 2014), pp. 1044–
1050, Nov. 2014, Madrid, Spain.
2. Y. Yamaguchi, K. Noda, S. Nishide, H. G. Okuno, and T. Ogata. Learning and As-
sociation of Synesthesia Phenomenon using Deep Neural Networks, Proceed-
ings in IEEE/SICE International Symposium on System Integration (SII 2013),
pp. 659–664, Dec. 2013, Kobe, Japan.
3. H. Nobuta, K. Kawamoto, K. Noda, K. Sabe, H. G. Okuno, S. Nishide, and T.
Ogata. Body area segmentation from visual scene based on predictability of
neuro-dynamical system, Proceedings in IEEE International Joint Conference
on Neural Networks (IJCNN 2012), Jun. 2012, Brisbane, Australia.
4. K. Noda, K. Kawamoto, T. Hasuo, and K. Sabe. A generative model for develop-
mental understanding of visuomotor experience. Proceedings in IEEE Inter-
national Conference on Development and Learning and Epigenetic Robotics
(ICDL-EpiRob 2011), Aug. 2011, Frankfurt, Germany.
5. K. Noda, M. Ito, Y. Hoshino, and J. Tani. Dynamic Generation and Switching
of Object Handling Behaviors by a Humanoid Robot Using a Recurrent Neural
Network Model, Proceedings in International Conference on the Simulation of
Adaptive Behavior (SAB’06), Lecture Notes in Artificial Intelligence, Vol. 4095,
pp. 185–196, Sep. 2006, Rome, Italy.
6. F. Tanaka, K. Noda, T. Sawada, and M. Fujita. Associated Emotion and Its Ex-
pression in an Entertainment Robot QRIO, Proceedings in International Con-
ference on Entertainment Computing (ICEC 2004), pp. 499–504, Sep. 2004,
Eindhoven, Netherlands.
7. K. Noda, M. Suzuki, N. Tsuchiya, Y. Suga, T. Ogata, and S. Sugano. Robust Mod-
eling of Dynamics Environment based on Robot Embodiment, Proceedings in
IEEE International Conference on Robotics and Automation (ICRA 2003), pp.
3565–3570, Sep. 2003, Taipei, Taiwan.
8. T. Ogata, T. Komiya, K. Noda, and S. Sugano. Influence of the Eye Motions in
Human-Robot Communication and Motion Generation based on the Robot
Body Structure, Proceedings in IEEE-RAS International Conference on Hu-
manoid Robots (Humanoids 2001), pp. 83–89, Nov. 2001, Tokyo, Japan.
9. T. Ogata, Y. Matsuyama, T. Komiya, M. Ida, K. Noda, and S. Sugano. Devel-
opment of Emotional Communication Robot: WAMOEBA-2R -Experimental
Evaluation of the Emotional Communication between Robots and Humans-,
Proceedings in IEEE/RSJ International Conference on Intelligent Robots and
Systems (IROS 2000), pp. 175–180, Nov. 2000, Takamatsu, Japan.
Domestic Conferences
1. 寺田翔太,野田邦昭,尾形哲也:CNNによる画像認識技術を応用したマンガ作家判別システム,第 15回計測自動制御学会システム・インテグレーション部門講演会 (SI2014),3G2–4,2014年 12月.
2. 佐々木一磨,Hadi Tjandra,野田邦昭,高橋城志,尾形哲也:再帰結合型神経回路モデルによる描画像からの描画運動連想,第 15回計測自動制御学会システム・インテグレーション部門講演会 (SI2014),3H2–4,2014年 12月.
3. 出来寛祥,野田邦昭,尾形哲也:Deep Neural Networkを用いた視覚運動情報の統合化による空間表現の汎化,第 32回日本ロボット学会学術講演会,1B2–01,2014年 9月.
4. 高橋城志,尾形哲也,Hadi Tjandra,野田邦昭,村田真悟,有江浩明,菅野重樹:神経回路モデルと身体バブリングによる道具身体化と道具機能の獲得,日本機械学会ロボティクスメカトロニクス講演会,3P2–P02,2014年 5月.
5. 高橋城志,尾形哲也,Hadi Tjandra,野田邦昭,村田真悟,有江浩明,菅野重樹:身体バブリングと再帰結合型神経回路モデルによる道具身体化~深層学習による画像特徴量抽出~,第 28回人工知能学会全国大会,1I4–OS–09a–4,2014年 5月.
6. 有江浩明,野田邦昭,菅佑樹,尾形哲也:再帰型神経回路モデルによる予測可能性を利用した自己・他者の識別,第 27回人工知能学会全国大会,3J3–
OS–20b–1,2013年 6月.
7. 山口雄紀,野田邦昭,西出俊,奥乃博,尾形哲也:多層神経回路モデルによる共感覚現象の学習と連想,情報処理学会第 75回全国大会,1R–2,2013年 3
月.
8. 信田春満,河本献太,野田邦昭,佐部浩太郎,西出俊,奥乃博,尾形哲也:神経力学モデルによる自己身体領域抽出と視覚運動系の自己組織化,第 30回日本ロボット学会学術講演会,2H3–2,2012年 9月.
9. 信田春満,河本献太,野田邦昭,佐部浩太郎,奥乃博,尾形哲也:再帰型神経回路モデルを用いた視野変化予測と場所知覚ニューロンの発現,情報処理学会第 74回全国大会,5P–8,8 Mar. 2012.名古屋工業大学.
10. 野田邦昭,鈴木基高,尾形哲也,菅野重樹:身体性に基づいた環境・ロボット自身における新奇性検出,第 20回日本ロボット学会学術講演会,1C31,2002
年 10月.
11. 小宮孝章,野田邦昭,土屋尚文,尾形哲也,菅野重樹:分散エージェントを用いた全身協調による動作生成,日本機械学会ロボティクスメカトロニクス講演会,2P1–D06,2002年 6月.
12. 野田邦昭,井田真高,尾形哲也,菅野重樹:身体性に基づいた状態表現機能を持つロボットと人間とのコミュニケーション,日本機械学会ロボティクスメカトロニクス講演会,1P1–D10(1),2001年 6月.
13. 尾形哲也,松山佳彦,小宮孝章,井田真高,野田邦昭,菅野重樹:人間と自律ロボットのコミュニケーションに関する実験的考察システム設計と心理評価の異母集団比較,第 18回日本ロボット学会学術講演会, pp. 479–480,2000年 9月.
14. 尾形哲也,松山佳彦,小宮孝章,井田真高,野田邦昭,菅野重樹:自律ロボットWAMOEBA-2Rの開発アームシステムの搭載と心理実験,日本機械学会ロボティクスメカトロニクス講演会,1A1–80–114, 2000年 5月.
15. 尾形哲也,松山佳彦,小宮孝章,井田真高,野田邦昭,菅野重樹:情緒交流ロボットWamoeba-2Rの開発システム構成と評価実験,第 5回ロボティクスシンポジア,pp.68–73,2000年 3月.