chapter 3: three classes of deep learning networksjcho/presentation/180313_dl_chap3.pdf ·...
TRANSCRIPT
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
Deep Learning: Methods and Applications
Chapter 3: Three Classes of Deep Learning Network
발표자 : 조성재, 최상우
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
목차
3.1. A Three-Way Categorization
딥 러닝을 3가지 방식으로 분류
3.2. Deep Networks for Unsupervised or Generative Learning
첫번째 방식인 Unsupervised learning or Generative Learning
3.3 Deep Networks for Supervised Learning
두번째 방식인 Supervised Learning
3.4 Hybrid Deep Networks
2
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
Contents Overview
Generative model vs Discriminative model
Joint Distribution vs Conditional distribution
Denoising Autoencoder
Mean square reconstruction error and KL divergence
DBN RBM DBM
Sum-product network
Hessian-Free Optimization
Conditional Random fields
Deep stacking Network
Time delayed neural network
Convolutional neural network
Hybrid models4
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
3.1 A Three-Way Categorization(OVERVIEW)
Category 1: Deep networks for unsupervised or generative learning
intended to capture high-order correlation of the observed or visible data for pattern analysis or synthesis
purposes when no information about target class labels is available.
Unsupervised feature or representation learning in the literature refers to this category of the deep networks.
Unsupervised learning in generative mode, may also be intended to characterize joint statistical distributions of
the visible data and their associated classes
Category 2: Deep networks for supervised learning
intended to directly provide discriminative power for pattern classification purposes, often by characterizing
the posterior distributions of classes conditioned on the visible data.
Target label data are always available in direct or indirect forms for such supervised learning.
They are also called discriminative deep networks.
Category 3: Hybrid deep Networks
the goal is discrimination which is assisted, often in a significant way, with the outcomes of generative or
unsupervised deep networks. This can be accomplished by better optimization or/and regularization of the deep
networks in category
5
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
3.1 Basic Deep Learning Terminologies - 7가지 용어
1. Deep Learning
A class of machine learning techniques
The essence of deep learning is to compute hierarchical features or representations of the
observational data where the higher-level features or factors are defined from lower-level
ones.
2. Deep Belief network
Probabilistic generative models composed of multiple layers of stochastic, hidden variables.
3. Boltzmann machine
A network of symmetrically connected, neuron-like units that make stochastic decisions about
whether to be on or off
6
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
3.1 Basic Deep Learning Terminologies - 7가지 용어
4. Restricted Boltzmann machine (RBM)
A special type of BM consisting of a layer of visible units and a layer of hidden units with no
visible-visible or hidden-hidden connections
5. Deep neural network(DNN)
a multilayer perceptron with many hidden layers, whose weights are fully connected and are
often initialized using either an unsupervised or a supervised pretraining technique
6. Deep Autoencoder
a “discriminative” DNN whose output targets are the data input itself rather than class labels;
hence an unsupervised learning model
7. Distributed representation
an internal representation of the observed data in such a way that they are modeled as being
explained by the interactions of many hidden factors. 7
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
3.2 Deep Networks for Unsupervised or Generative Learning - Intro
Many deep networks in this category can be used to meaningfully generate samples
by sampling from the networks, with examples of RBMs, DBNs, DBMs, and generalized
denoising autoencoders and are thus generative models
Some networks in this category, however, cannot be easily sampled, with examples of
sparse coding networks and the original forms of deep autoencoders, and are thus not
generative in nature
Among the various subclasses of generative or unsupervised deep networks, the
energy-based deep models are the most common
Such composition leads to deep belief network (DBN)(Chapter 5).
8
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
3.2 Deep Networks for Unsupervised or Generative Learning - Introduced Deep Networks
Deep autoencoders
Transforming autoencoder
Predictive sparse coders
De-noising (stacked) autoencoders
Deep Boltzmann machines
Mean-covariance RBM
Deep Belief Networks
Sum-product networks
Recurrent neural networks
9
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
3.2 Deep Networks for Unsupervised or Generative Learning- Autoencoders
The original form of the deep autoencoder which we will give more detail about in
Chapter 4, is a typical example of this unsupervised model category
Most other forms of deep autoencoders are also unsupervised in nature, but with
quite different properties and implementations.
Examples are transforming autoencoders , predictive sparse coders and their stacked
version, and de-noising autoencoders and their stacked versions
11
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
3.2 Deep Networks for Unsupervised or Generative Learning- Autoencoders
Specifically, in de-noising autoencoders, the input vectors are first corrupted by, for
example, randomly selecting a percentage of the inputs and setting them to zeros or
adding Gaussian noise to them.(안개 낀 곳에서 구별하는 것과 같은)
The encoded representations transformed from the uncorrupted data are used as the
inputs to the next level of the stacked de-noising autoencoder.
12
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
3.2 Deep Networks for Unsupervised or Generative Learning- Deep Boltzmann Machines
A DBM contains many layers of hidden variables, and has no connections between the
variables within the same layer
While having a simple learning algorithm, the general BMs are very complex to study
and very slow to train.
In a DBM, each layer captures complicated, higher-order correlations between the
activities of hidden features in the layer below.
DBMs have the potential of learning internal representations that become increasingly
complex, highly desirable for solving object and speech recognition problems.
Further, the high-level representations can be built from a large supply of unlabeled
sensory inputs and very limited labeled data can then be used to only slightly fine-
tune the model for a specific task at hand.
13
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
3.2 Deep Networks for Unsupervised or Generative Learning- Deep Boltzmann Machines
When the number of hidden layers of DBM is reduced to one, we have restricted
Boltzmann machine (RBM).
Like DBM, there are no hidden-to-hidden and no visible-to-visible connections in the
RBM.
The main virtue of RBM is that via composing many RBMs, many hidden layers can be
learned efficiently using the feature activations of one RBM as the training data for the
next.
14
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
3.2 Deep Networks for Unsupervised or Generative Learning - Boltzmann Machines
15
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
3.2 Deep Networks for Unsupervised or Generative Learning - Restricted Boltzmann Machines
16
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
3.2 Deep Networks for Unsupervised or Generative Learning - Restricted Boltzmann Machines
17
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
3.2 Deep Networks for Unsupervised or Generative Learning - Deep Boltzmann Machines
18
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
3.2 Deep Networks for Unsupervised or Generative Learning - Deep Belief Networks
19
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
3.2 Deep Networks for Unsupervised or Generative Learning - DBM vs. DBN
20
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
3.2 Deep Networks for Unsupervised or Generative Learning - Mean-Covariance RBM(mcRBM)
The standard DBN has been extended to the factored higher-order Boltzmann machine
in its bottom layer, with strong results for phone recognition obtained (Dahl et. al., 2010).
This model, called the mean-covariance RBM or mcRBM, recognizes the limitation of the
standard RBM in its ability to represent the covariance structure of the data.
However, it is difficult to train mcRBMs and to use them at the higher levels of the deep
architecture.
the mcRBM parameters in the full DBN are not fine-tuned using the discriminative
information, which is used for fine tuning the higher layers of RBMs, due to the high
computational cost.
21
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
3.2 Deep Networks for Unsupervised or Generative Learning - Sum Product Networks (SPN)
Another representative deep generative network that can be used for unsupervised
(as well as supervised) learning is the sum-product network or SPN.
An SPN is a directed acyclic graph with the observed variables as leaves, and with sum and
product operations as internal nodes in the deep network.
The “sum” nodes give mixture models, and the “product” nodes build up the feature
hierarchy.
Properties of “completeness” and “consistency” constrain the SPN in a desirable way. The
learning of SPNs is carried out using the EM algorithm together with back-propagation.
The learning procedure starts with a dense SPN. It then finds an SPN structure by
learning its weights, where zero weights indicate removed connections.
22
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
3.2 Deep Networks for Unsupervised or Generative Learning - Sum Product Networks (SPN)
23
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
3.2 Deep Networks for Unsupervised or Generative Learning - Sum Product Networks (SPN)
The main difficulty in learning SPNs
the learning signal (i.e., the gradient) quickly dilutes when it propagates to deep layers.
It was pointed out in that early paper that despite the many desirable generative
properties in the SPN, it is difficult to fine tune the parameters using the
discriminative information, limiting its effectiveness in classification tasks.
However, this difficulty has been overcome in the subsequent work reported, where an
efficient backpropagation-style discriminative training algorithm for SPN was
presented.
24
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
3.2 Deep Networks for Unsupervised or Generative Learning - Sum Product Networks (SPN)
Importantly, the standard gradient descent, based on the derivative of the conditional
likelihood, suffers from the same gradient diffusion problem well known in the regular
DNNs.
The trick to alleviate this problem in learning SPNs is to replace the marginal inference
with the most probable state of the hidden variables and to propagate gradients
through this “hard” alignment only.
Excellent results on small-scale image recognition tasks were reported by Gens and
Domingo (2012).
25
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
3.2 Deep Networks for Unsupervised or Generative Learning - Recurrent Neural Networks (RNN)
Recurrent neural networks (RNNs) can be considered as another class of deep
networks for unsupervised (as well as supervised) learning, where the depth can be as
large as the length of the input data sequence.
In the unsupervised learning mode, the RNN is used to predict the data sequence in
the future using the previous data samples, and no additional class information is used
for learning.
26
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
3.2 Deep Networks for Unsupervised or Generative Learning - Recurrent Neural Networks (RNN)
The RNN is very powerful for modeling sequence data (e.g., speech or text), but until
recently they had not been widely used partly because they are difficult to train to
capture long term dependencies, giving rise to gradient vanishing or gradient
explosion problems.
Recent advances in Hessian-free optimization have also partially overcome this difficulty
using approximated second-order information or stochastic curvature estimates.
In the more recent work , RNNs that are trained with Hessian-free optimization are
used as a generative deep network in the character-level language
27
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
3.2 Deep Networks for Unsupervised or Generative Learning - Recurrent Neural Networks (RNN)
28
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
3.2 Deep Networks for Unsupervised or Generative Learning - Recurrent Neural Networks (RNN)
Modeling tasks, where gated connections are introduced to allow the current input
characters to predict the transition from one latent state vector to the next.
Such generative RNN models are demonstrated to be well capable of generating
sequential text characters.
More recently, Bengio et al. (2013) and Sutskever (2013) have explored variations of
stochastic gradient descent optimization algorithms in training generative RNNs and
shown that these algorithms can outperform Hessian-free optimization methods.
Molotov et al. (2010) have reported excellent results on using RNNs for language
modeling.
More recently, Mesnil et al. (2013) and Yao et al. (2013) reported the success of RNNs in
spoken language understanding.
29
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
3.2 Deep Networks for Unsupervised or Generative Learning - Dynamic and Deep Structures in Speech Recognition
There has been a long history in speech recognition research where human speech
production mechanisms are exploited to construct dynamic and deep structure in
probabilistic generative models
Specifically, the early work generalized and extended the conventional shallow and
conditionally independent HMM structure by imposing dynamic constraints, in the
form of polynomial trajectory, on the HMM parameters.
A variant of this approach has been more recently developed using different learning
techniques for time-varying HMM parameters and with the applications extended to
speech recognition robustness (Yu and Deng, 2009; Yu et al., 2009a).
Similar trajectory HMMs also form the basis for parametric speech synthesis
30
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
Subsequent work added a new hidden layer into the dynamic model to explicitly
account for the target-directed, articulatory-like properties in human speech generation
More efficient implementation of this deep architecture with hidden dynamics is
achieved with non-recursive or finite impulse response (FIR) filters in more recent
studies
31
3.2 Deep Networks for Unsupervised or Generative Learning - Dynamic and Deep Structures in Speech Recognition
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
3.2 Deep Networks for Unsupervised or Generative Learning - Deep Structured Generative Models
The above deep structured generative models of speech can be shown as special cases
of the more general dynamic network model and even more general dynamic
graphical models.
The graphical models can comprise many hidden layers to characterize the complex
relationship between the variables in speech generation.
Armed with powerful graphical modeling tool, the deep architecture of speech has
more recently been successfully applied to solve the very difficult problem of single-
channel, multi-talker speech recognition, where the mixed speech is the visible
variable while the un-mixed speech becomes represented in a new hidden layer in the
deep generative architecture (Rennie et al., 2010; Wohlmayr et al., 2011).
32
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
3.2 Deep Networks for Unsupervised or Generative Learning - Deep Structured Generative Models
Deep generative graphical models are indeed a powerful tool in many applications due
to their capability of embedding domain knowledge.
However, they are often used with inappropriate approximations in inference, learning,
prediction, and topology design, all arising from inherent intractability in these tasks for
most real-world applications.
An even more drastic way to deal with this intractability was proposed recently by
Bengio et al. (2013b), where the need to marginalize latent variables is avoided
altogether.
33
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
3.2 Deep Networks for Unsupervised or Generative Learning – to delete?
The standard statistical methods used for large-scale speech recognition and
understanding combine (shallow) hidden Markov models for speech acoustics with
higher layers of structure representing different levels of natural language hierarchy.
This combined hierarchical model can be suitably regarded as a deep generative
architecture, whose motivation and some technical detail may be found in Chapter 7 of
the recent book on “Hierarchical HMM” or HHMM.
These early deep models were formulated as directed graphical models, missing the key
aspect of “distributed representation” embodied in the more recent deep generative
networks of the DBN and DBM
Filling in this missing aspect would help improve these generative models.
34
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
3.2 Deep Networks for Unsupervised or Generative Learning – to delete?
Finally, dynamic or temporally recursive generative models based on neural network
architectures can be found in (Taylor et al., 2007) for human motion modeling for
natural language and natural scene parsing.
The latter model is particularly interesting because the learning algorithms are capable of
automatically determining the optimal model structure. This contrasts with other deep
architectures such as DBN where only the parameters are learned while the architectures
need to be pre-defined.
Specifically, the recursive structure commonly found in natural scene images and in natural
language sentences can be discovered using a max-margin structure prediction architecture.
It is shown that the units contained in the images or sentences are identified, and the
way in which these units interact with each other to form the whole is also identified.
35
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
Many of the discriminative techniques for supervised learning in signal and
information processing are shallow architectures such as conditional random fields
(CRFs)
A CRF is intrinsically a shallow discriminative architecture, Characterized by the linear
relationship between the input features and the transition features.
36
3.3 Deep Networks for Supervised Learning - CRF
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
Recently, deep-structured CRFs have been developed
by stacking the output in each lower layer of the CRF,
together with the original input data, onto its higher
layer.
Various versions of deep-structured CRFs are successfully
applied to phone recognition, spoken language
identification, and natural language processing.
37
3.3 Deep Networks for Supervised Learning – deep-structured CRF
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
3.3 Deep Networks for Supervised Learning – Multilayer perceptron (MLP)
Morgan (2012) gives an excellent review on other major existing discriminative
models in speech recognition based mainly on the traditional neural network or MLP
architecture using backpropagation learning with random initialization.
It argues for the importance of both the increased width of each layer of the neural
networks and the increased depth.
In particular, a class of deep neural network models forms the basis of the popular
“tandem” approach (Morgan et al., 2005), where the output of the discriminatively
learned neural network is treated as part of the observation variable in HMMs.
38
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
3.3 Deep Networks for Supervised Learning – Deep Stacking Network (DSN)
In the most recent work of a new deep learning
architecture, sometimes called Deep Stacking Network
(DSN), together with its tensor variant and its kernel
version are developed that all focus on discrimination
with scalable, parallelizable learning relying on little or
no generative component.
We will describe this type of discriminative deep
architecture in detail in Chapter 6.
39
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
3.3 Deep Networks for Supervised Learning – Recurrent Neural Network (RNN)
RNNs can also be used as a discriminative model where the output is a label
sequence associated with the input data sequence.
40
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
3.3 Deep Networks for Supervised Learning – Recurrent Neural Network (RNN)
A set of new models and methods were proposed more recently that enable the RNNs
themselves to perform sequence classification while embedding the long-short-term
memory (LSTM) into the model.
Underlying this method is the idea of interpreting RNN outputs as the conditional
distributions over all possible label sequences given the input sequences.
𝒑(𝒚|𝒙𝟏, 𝒙𝟐, ⋯ , 𝒙𝑵)
Then, a differentiable objective function can be derived to optimize these conditional
distributions over the correct label sequences.
The effectiveness of this method has been demonstrated in handwriting recognition tasks
and in a small speech task (Graves et al., 2013, 2013a) to be discussed in more detail in
Chapter 7 of this book.
41
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
Another type of discriminative deep architecture is the convolutional neural network
(CNN), in which each module consists of a convolutional layer and a pooling layer.
These modules are often stacked up with one on top of another, or with a DNN on top
of it, to form a deep model.
42
3.3 Deep Networks for Supervised Learning – convolutional neural network (CNN)
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
3.3 Deep Networks for Supervised Learning – convolutional neural network (CNN)
The convolutional layer shares many weights, and the pooling layer subsamples the
output of the convolutional layer and reduces the data rate from the layer below.
The weight sharing in the convolutional layer, together with appropriately chosen
pooling schemes, endows the CNN with some “invariance” properties (e.g., translation
invariance).
CNNs have been found highly effective and been commonly used in computer vision
and image. And recently, the CNN is also found effective for speech recognition.43
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
3.3 Deep Networks for Supervised Learning – time-delay neural network (TDNN)
It is useful to point out that the time-delay neural network (TDNN) developed for early
speech recognition is a special case and predecessor of the CNN when weight sharing is
limited to one of the two dimensions, i.e., time dimension, and there is no pooling layer.
It was not until recently that researchers have discovered that the time-dimension
invariance is less important than the frequency-dimension invariance for speech
recognition.
A careful analysis on the underlying reasons is described in (Deng et al., 2013), together
with a new strategy for designing the CNN’s pooling layer demonstrated to be more
effective than all previous CNNs in phone recognition
44
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
3.4 Hybrid Deep Networks – two viewpoints
The term “hybrid” for this third category refers to the deep architecture that either
comprises or makes use of both generative and discriminative model components.
In the existing hybrid architectures published in the literature, the generative
component is mostly exploited to help with discrimination, which is the final goal of
the hybrid architecture.
How and why generative modeling can help with discrimination can be examined from
two viewpoints.
The optimization viewpoint where generative models trained in an unsupervised fashion can
provide excellent initialization points in highly nonlinear parameter estimation problems
The regularization perspective where the unsupervised-learning models can effectively
provide a prior on the set of functions representable by the model.
45
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
3.4 Hybrid Deep Networks – DBN-DNN model
The DBN can be converted to and used as the initial model of a DNN for supervised
learning with the same network structure, which is further discriminatively trained or
fine-tuned using the target labels provided.
When the DBN is used in this way we consider this DBN-DNN model as a hybrid deep
model, where the model trained using unsupervised data helps to make the
discriminative model effective for supervised learning.
We will review details of the discriminative DNN for supervised learning in the context of
RBM/DBN generative, unsupervised pre-training in Chapter 5.
46
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
3.4 Hybrid Deep Networks – DNN-CRF and DNN-HMM
Another example of the hybrid deep network is developed in (Mohamed et al., 2010),
where the DNN weights are also initialized from a generative DBN but are further
fine-tuned with a sequence-level discriminative criterion, which is the conditional
probability of the label sequence given the input feature sequence, instead of the frame-
level criterion of cross-entropy commonly used.
This can be viewed as a combination of the static DNN with the shallow
discriminative architecture of CRF. It can be shown that such a DNN-CRF is equivalent
to a hybrid deep architecture of DNN and HMM whose parameters are learned
jointly using the full-sequence maximum mutual information (MMI) criterion between
the entire label sequence and the input feature sequence.
47
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
3.4 Hybrid Deep Networks - RBM
In (Larochelle and Bengio, 2008), the generative model of RBM is learned using the
discriminative criterion of posterior class-label probabilities.
Here the label vector is concatenated with the input data vector to form the
combined visible layer in the RBM.
In this way, RBM can serve as a stand-alone solution to classification problems and
the authors derived a discriminative learning algorithm for RBM as a shallow generative
model.
48
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
3.4 Hybrid Deep Networks – pretraining of CNNs
A further example of hybrid deep networks is the use of generative models of DBNs to
pre-train deep convolutional neural networks (Lee et al., 2009, 2010, 2011).
Like the fully connected DNN discussed earlier, pre-training also helps to improve the
performance of deep CNNs over random initialization.
Pre-training DNNs or CNNs using a set of regularized deep autoencoders (Bengio et
al., 2013a), including denoising autoencoders, contractive autoencoders, and sparse
autoencoders, is also a similar example of the category of hybrid deep network.ks
49
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
3.4 Hybrid Deep Networks – two-stage architecture
The final example given here for hybrid deep networks is based on the idea and work of
(Ney, 1999; He and Deng, 2011), where one task of discrimination (e.g., speech
recognition) produces the output (text) that serves as the input to the second task of
discrimination (e.g., machine translation).
The overall system, giving the functionality of speech translation – translating speech in
one language into text in another language – is a two-stage deep architecture
consisting of both generative and discriminative elements.
Both models of speech recognition (e.g., HMM) and of machine translation (e.g., phrasal
mapping and non-monotonic alignment) are generative in nature, but their parameters
are all learned for discrimination of the ultimate translated text given the speech data.
50
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
Question
p216에 On the other hand 다음 문장에, unsupervised-learning model은 해석하기 쉽고,
domain knowledge를 embed하기 쉽고, uncertainty를 핸들하기 쉽다고 나오는데, 이
부분이 명확히 이해가 안가서, 이 부분에 대해 예를 들어 자세히 설명해 주실 것 부탁드려요.
On the other hand, the deep unsupervised-learning models, especially the probabilistic generative
ones, are easier to interpret, easier to embed domain knowledge, easier to compose, and easier to
handle uncertainty, but they are typically intractable in inference and learning for complex systems.
51
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
Answer
답은 아래 논문들에 있을 것으로 예상됨. 2 -> 1 -> 3 순서로 중요.
1. G. Alain and Y. Bengio. What regularized autoencoders learn from the data
generating distribution. In Proceedings of International Conference on Learning
Representations (ICLR). 2013.
2. Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new
perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI),
38:1798–1828, 2013.
3. Y. Bengio, E. Thibodeau-Laufer, and J. Yosinski. Deep generative stochastic networks
trainable by backprop. arXiv 1306:1091, 2013. also accepted to appear in
Proceedings of International Conference on Machine Learning (ICML), 2014.
52
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
Answer
On the other hand, the deep unsupervised-learning models, especially the probabilistic generative ones, are
easier to interpret, easier to embed domain knowledge, easier to compose, and easier to handle uncertainty,
but they are typically intractable in inference and learning for complex systems.
The deep unsupervised-learning models: autoencoders, sparse coding networks, etc.
비교 대상: deep supervised-learning models: DNNs, etc.
“Easier to interpret”: 전 페이지의 논문2에서 다양한 interpret 사례를 다룸.
“Easier to embed domain knowledge”: 0부터 9까지 숫자 이미지의 representation을
학습할 때 autoencoder의 layer를 10으로 줄 수 있음.
“Easier to handle uncertainty”: ? Uncertainty = probability. “the deep unsupervised-
learning models, especially the probabilistic generative ones, are … easier to handle
uncertainty ”. Probabilistic models 이니 uncertainty(=probability)를 다루가 쉽다고 생각할
수 있음. 답은 모름.53
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
Question
p220에서 SPN 구조에 대해서 그림과 함께 설명 부탁드립니다.
54
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
Answer
References
[Paper (first introduced)] H. Poon and P. Domingos. Sum-product networks: A new deep a
rchitecture. In Proceedings of Uncertainty in Artificial Intelligence. 2011.
[Paper] R. Gens and P. Domingo. Discriminative learning of sum-product networks. Neural
Information Processing Systems (NIPS), 2012.
[Lecture by the author] Sum-Product Networks: The Next Generation of Deep Models by
P. Domingos (YouTube) (PDF)
Comprehensive web page (http://spn.cs.washington.edu/)
55
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
Answer: Basic Structure
56
Deep structure
Basic structure
Forward
Backward
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
Question
RNN을 Unsupervised learning 에 사용한다는 부분에 궁금한 점이 있습니다. 이전의 data
sample을 사용하여 future sequence를 추론하는 경우 중간중간 잘못 추론한 결과가 쌓여
결국 전체 sequence를 잘못 추론하는 결과를 야기할 수 있다고 생각합니다. 이 문제를
해결할 수 있는 장치가 있는지 궁금합니다.
58
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
Answer
Training할 때 teacher forcing이라는 기법을 이용해서 문제를 완화시킬 수 있습니다 (Goodfellow
et al. 2016). 현재 예측값이 아니라 현재 참값(ground truth)을 이용해 다음 상태를 예측합니다.
59Vanilla RNN RNN with teacher forcing
Goodfellow et al. Deep Learning. The MIT Press. 2016
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
Question & Answer
17,18년 오디오 관련한 여러 가지 테스크에 대해서 제출되는 여러 모델들은 아직도 기존의 머신러닝
방법인 HMM등을 사용하는 경우를 볼 수 있습니다. RNN 기반의 variants가 HMM 기반의 모델의
성능에 필적하지 못하는 경우도 있는데 이에 대해서 발표자 분들이 어떻게 생각하시는지 궁금합니다.
제가 오디오처리 전공이 아니라 ‘17,18년 오디오 관련한 여러 가지 테스크에 대해서 제출되는 여러 모델들’에
대해 알지 못합니다.
‘RNN 기반의 variants가 HMM 기반의 모델의 성능에 필적하지 못하는 경우’를 구체적으로 명시해주셨다면
이야기가 가능했을 것 같습니다.
HMM과 RNN의 가능 큰 차이는 HMM는 현재 상태를 이용해 다음 상태를 예측하고, RNN은 현재 상태와 이전
여러 상태를 이용해 예측한다는 점입니다. 다만, HMM은 현재 상태과 이전의 모든 상태를 담고 있다고
가정합니다(Markov property). RNN은 long-term dependency에 약하고 Markov property가 없기 때문에
long-term dependency를 잘 할 수 있도록 해야합니다. 그것을 잘하게끔 하기가 어렵기 때문에 RNN이 더 좋은
성능을 못내는 것이라고 생각합니다. 이와 달리 HMM은 long-term dependency를 가지게 끔 설계하기가
RNN보다 쉬워서 HMM이 더 좋은 성능을 낸다고 생각합니다.
60
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
Question & Answer
상대적으로 최근에 개발되었다는 방법인 time-varying HMM parameters를 사용하는 방법
을 알고 싶습니다.
질문을 통해 time-varying HMM을 처음 접했습니다.
‘time-varying HMM parameters를 사용하는 방법’을 질문하셨는데, time-varying HMM
parameters 라는 것이 무엇을 의미하는지 찾지 못했고 사용 목적이 무엇인지 몰라서 질
문의 의도를 파악하기 어렵습니다. Wang et al. 2009에서 소개한 time varying HMM이
맞는지 궁금합니다.
61
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
Question & Answer
CRF 에 대한 기본개념이 부족한 상태에서 드리는 문의사항입니다. CRF 가 shallow layer 라고 할 때 이
를 deep structure 로 쌓는다고 하는데요.(p31) 이렇게 쌓아진 구조와 DNN 구조는 어떤 차이를 갖는지
궁금합니다. 가장 단순한 비교로 1 layer DNN 의 구조와 CRF는 어떻게 다른지 궁금합니다.
62
CRF 구조의 예시
Deep structured CRF 구조의 예시
DNN 구조의 예시
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
Question & Answer
Are all discrimitive models learned by supervised learning methods?
Discriminative 모델은 p(y|x)를 학습하는 것이 목적이기 때문에 y가 주어지지 않다면
이를 학습할 수 없다고 생각합니다. 하지만 hybrid 모델처럼 discriminative 모델의
pretraining에 unsupervised learning 방법을 적용하는 것은 가능합니다.
63
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
Question & Answer
3.4에서 hybrid deep neural network의 한 예로, pre-trained DNN, CNN이 deep
autoencoder를 사용하는 예를 제시하였는데, 그렇다면 deep autoencoder에서 학습된
output이 pre-trained DNN, CNN에서 input으로 feed 된다는 의미인지요?
이때 deep autoencoder와 pre-trained DNN/CNN 각각의 역할이 무엇일지 궁금합니다.
Deep autoencoder에서 학습된 encoder를 사용하여 input으로부터 feature를 얻게 되고
이를 DNN/CNN의 input으로 사용한다는 의미입니다. 이때 deep autoencoder의
encoder가 feature extractor의 역할을 하게되고 DNN/CNN은 classifier의 역할을 하게
됩니다.
64
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
Question & Answer
음성 데이터에 대한 전처리의 결과는 많은 경우 시간 축과 frequency 축을 가지게 됩니다.
이는 1차원 시계열 데이터와도, 그리고 2차원의 이미지와도 유사한 성질을 가지고 있는 것
같은데 이렇게 전처리된 데이터에 대해 RNN, 시간 축에 대한 1차원 CNN(TDNN), 2차원
CNN을 적용하는 것이 어떻게 다른지, 그리고 각 구조는 어떤 상황에 사용하는 것이 좋은지
궁금합니다.
먼저 CNN 모델이 RNN 모델과 달리 병렬화가 쉬워 모델의 수행속도가 빠르기 때문에
수행속도가 중요한 경우에 사용하기 좋다고 생각합니다. 그리고 2차원 CNN이 서로 다른
시간 축들을 동시에 고려하기 때문에 1차원 CNN에 비하여 성능이 우수할 것으로
예상합니다.
65
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
Question & Answer
hybrid 모델에는 역의 방향도 있는데 (generative for discriminative), 책에서는 전자가 후자
의 prior로 작용하는 regularization을 언급하였습니다(p.226). prior가 된다는 것은 무슨 과정
을 의미하는지 (generative 모델의 결과가 어떻게 discriminative 모델로 feed-in 되는지), 그
리고 이것이 적절한 prior라는 근거가 있는지 궁금합니다.
Generative 모델이 discriminative 모델의 초깃값으로 활용된다는 의미입니다. 이것이 적
절한 priror라는 것을 증명한 논문은 읽어보지 못했으며 경험적으로 모델의 weight들을
랜덤으로 초기화하는 것보다 학습이 더 잘되고 성능이 우수하기 때문이라고 생각합니다.
66
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
Question & Answer
교재 221 page 1~9번째 줄 문장에서 언급한 "Hessian-free optimization" 방법이 궁금합니
다. "stochastic gradient descent optimization algorithms" 방법과 비교하였을 때 어떤 차이
점(장단점)이 있나요?
67
이론적으로는 SGD보다 더 최적해를 잘
찾지만 구현이 SGD에 비해 어려우며,
굉장히 불안정하고 conjugate gradient
단계의 계산량이 너무 많다고 합니다.
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
Question & Answer
Can hybrid deep networks solve issues of sparsity and the existence of multiple
conceptual interpretations to the same word in speech processing context? If still not,
then what do we need to achieve that?
Hybrid deep network가 아니더라도 충분한 양의 데이터가 주어지고, RNN 계열의 모델
을 사용하면 장단기적인 문맥을 모두 고려할 수 있기때문에 문제를 해결할 수 있다고 생
각합니다.
68
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
Question & Answer
vanilla RNN은 gradient vanishing문제에 직면하기 쉽다고 알고 있는데, text book에서는
"Mesnil et al. (2013) and Yao et al. (2013) reported the success of RNNs in spoken
language understanding." 라고 합니다. 그렇다면 이 실험에서는 vanilla RNN이 아닌 발전된
RNN이 적용된 것인가요?
두 연구에서는 Elman-type RNN과 Jordan-type RNN 이라는 vanilla RNN을 변형한 모델
을 적용하였습니다. 그렇지만 이 모델들은 long short-term memory (LSTM), gated
recurrent unit (GRU)과 달리 gradient vanishing 문제를 해결하기 위해 제안된 모델은 아
닙니다.
69
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
Question
현재는 많이 사용되고 있지는 않을지도 모르지만, Deep Learning에서 Deep Neural
Network의 initialization 이야기할 때 빠지지 않는 방법 중 하나가 RBM 혹은 이것을 확장한
DBN을 이용하여 weight initialization을 하는 방법입니다. 다만 RBM이 bipartite구조로 각
노드의 확률을 sampling 한다는 점에서 이걸로 어떻게 weight initialization을 하는지 잘 모
르겠습니다. 4단원 앞부분에 그 내용이 설명되어있기는 한데, 이렇게 초기화된 weight에 어
떤 통계적인 의의가 있을까요?
70
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
Answer
RBM에서 hidden layer에 대응되는 확률을 구하는 과정에서 visible layer의 값들에
weight들을 곱하게 되는데 이 weight들로 DNN의 weight를 초기화한다는 의미입니다.
이렇게 초기화된 weight가 모델에 좋은 prior를 제공해준다는 통계적인 의의가 있습니다.
71
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
Question & Answer
25쪽에서는 "Unsupervised learning refers to no use of task specific supervision
information"이라고 하고 있습니다. 하지만 여기서 표현하는 "task specific supervision
information"을 input 그 자체라고 생각하면 곧 generative model이 되는 것이고, 이렇게
생각할 수 있다면 Unsupervised learning과 Supervised learning의 차이가 없어지는
것입니다. 즉, x와 y가 모두 주어지고 y=f(x)에서 함수 f를 학습시키는것이 감독학습인데,
x=f(x)라고 볼 수 있다면 무감독학습이 되는것이죠. 이런 관점에서 본다면 감독학습과
무감독학습의 수학적 차이는 없는것인데, 서로 다른 차이점을 어떻게 설명할 수 있을까요?
Unsupervised learning과 supervised learning의 목적에 차이가 존재합니다. Supervised
learning의 경우에는 레이블이 주어진 상황에서 데이터의 분류나 회귀분석이 목적이라면
unsupervised learning의 목적은 데이터의 레이블이 없는 상황에서 데이터의 hidden
structure를 파악하는 것이 목적입니다.
72
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
Question & Answer
28쪽에서는 DBM이 speech recognition problem에 사용될 수 있다고 합니다. 하지만 제가
알기론 보통 DBM은 고정되어 있는 길이의 모델을 통해서 inference를 하는 것인데, speech
recognition을 하려고 하면 주어지는 문장의 길이가 모르는 상태에서 어떻게 inference를 할
수 있는건가요? DBM이 speech recognition을 처리하는 방법이 궁금합니다.
먼저, 주어진 음성데이터를 고정된 길이들로 나눈 후에 이들의 mfcc, melspectrogram
feature들을 추출하고, 이들로부터 다시 한번 DBM를 사용하여 고차원의 feature를 추출
합니다. 이들의 시퀀스를 마지막으로 RNN에 통과시키면 길이가 모르는 상태에서도
inference 가 가능하다고 생각합니다.
73