chapter 3: three classes of deep learning networksjcho/presentation/180313_dl_chap3.pdf ·...

74
SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실 Deep Learning: Methods and Applications Chapter 3: Three Classes of Deep Learning Network 발표자 : 조성재, 최상우

Upload: others

Post on 21-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Deep Learning: Methods and Applications

Chapter 3: Three Classes of Deep Learning Network

발표자 : 조성재, 최상우

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

목차

3.1. A Three-Way Categorization

딥 러닝을 3가지 방식으로 분류

3.2. Deep Networks for Unsupervised or Generative Learning

첫번째 방식인 Unsupervised learning or Generative Learning

3.3 Deep Networks for Supervised Learning

두번째 방식인 Supervised Learning

3.4 Hybrid Deep Networks

2

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Machine Learning Basic Concept

3

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Contents Overview

Generative model vs Discriminative model

Joint Distribution vs Conditional distribution

Denoising Autoencoder

Mean square reconstruction error and KL divergence

DBN RBM DBM

Sum-product network

Hessian-Free Optimization

Conditional Random fields

Deep stacking Network

Time delayed neural network

Convolutional neural network

Hybrid models4

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.1 A Three-Way Categorization(OVERVIEW)

Category 1: Deep networks for unsupervised or generative learning

intended to capture high-order correlation of the observed or visible data for pattern analysis or synthesis

purposes when no information about target class labels is available.

Unsupervised feature or representation learning in the literature refers to this category of the deep networks.

Unsupervised learning in generative mode, may also be intended to characterize joint statistical distributions of

the visible data and their associated classes

Category 2: Deep networks for supervised learning

intended to directly provide discriminative power for pattern classification purposes, often by characterizing

the posterior distributions of classes conditioned on the visible data.

Target label data are always available in direct or indirect forms for such supervised learning.

They are also called discriminative deep networks.

Category 3: Hybrid deep Networks

the goal is discrimination which is assisted, often in a significant way, with the outcomes of generative or

unsupervised deep networks. This can be accomplished by better optimization or/and regularization of the deep

networks in category

5

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.1 Basic Deep Learning Terminologies - 7가지 용어

1. Deep Learning

A class of machine learning techniques

The essence of deep learning is to compute hierarchical features or representations of the

observational data where the higher-level features or factors are defined from lower-level

ones.

2. Deep Belief network

Probabilistic generative models composed of multiple layers of stochastic, hidden variables.

3. Boltzmann machine

A network of symmetrically connected, neuron-like units that make stochastic decisions about

whether to be on or off

6

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.1 Basic Deep Learning Terminologies - 7가지 용어

4. Restricted Boltzmann machine (RBM)

A special type of BM consisting of a layer of visible units and a layer of hidden units with no

visible-visible or hidden-hidden connections

5. Deep neural network(DNN)

a multilayer perceptron with many hidden layers, whose weights are fully connected and are

often initialized using either an unsupervised or a supervised pretraining technique

6. Deep Autoencoder

a “discriminative” DNN whose output targets are the data input itself rather than class labels;

hence an unsupervised learning model

7. Distributed representation

an internal representation of the observed data in such a way that they are modeled as being

explained by the interactions of many hidden factors. 7

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.2 Deep Networks for Unsupervised or Generative Learning - Intro

Many deep networks in this category can be used to meaningfully generate samples

by sampling from the networks, with examples of RBMs, DBNs, DBMs, and generalized

denoising autoencoders and are thus generative models

Some networks in this category, however, cannot be easily sampled, with examples of

sparse coding networks and the original forms of deep autoencoders, and are thus not

generative in nature

Among the various subclasses of generative or unsupervised deep networks, the

energy-based deep models are the most common

Such composition leads to deep belief network (DBN)(Chapter 5).

8

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.2 Deep Networks for Unsupervised or Generative Learning - Introduced Deep Networks

Deep autoencoders

Transforming autoencoder

Predictive sparse coders

De-noising (stacked) autoencoders

Deep Boltzmann machines

Mean-covariance RBM

Deep Belief Networks

Sum-product networks

Recurrent neural networks

9

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실 10

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.2 Deep Networks for Unsupervised or Generative Learning- Autoencoders

The original form of the deep autoencoder which we will give more detail about in

Chapter 4, is a typical example of this unsupervised model category

Most other forms of deep autoencoders are also unsupervised in nature, but with

quite different properties and implementations.

Examples are transforming autoencoders , predictive sparse coders and their stacked

version, and de-noising autoencoders and their stacked versions

11

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.2 Deep Networks for Unsupervised or Generative Learning- Autoencoders

Specifically, in de-noising autoencoders, the input vectors are first corrupted by, for

example, randomly selecting a percentage of the inputs and setting them to zeros or

adding Gaussian noise to them.(안개 낀 곳에서 구별하는 것과 같은)

The encoded representations transformed from the uncorrupted data are used as the

inputs to the next level of the stacked de-noising autoencoder.

12

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.2 Deep Networks for Unsupervised or Generative Learning- Deep Boltzmann Machines

A DBM contains many layers of hidden variables, and has no connections between the

variables within the same layer

While having a simple learning algorithm, the general BMs are very complex to study

and very slow to train.

In a DBM, each layer captures complicated, higher-order correlations between the

activities of hidden features in the layer below.

DBMs have the potential of learning internal representations that become increasingly

complex, highly desirable for solving object and speech recognition problems.

Further, the high-level representations can be built from a large supply of unlabeled

sensory inputs and very limited labeled data can then be used to only slightly fine-

tune the model for a specific task at hand.

13

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.2 Deep Networks for Unsupervised or Generative Learning- Deep Boltzmann Machines

When the number of hidden layers of DBM is reduced to one, we have restricted

Boltzmann machine (RBM).

Like DBM, there are no hidden-to-hidden and no visible-to-visible connections in the

RBM.

The main virtue of RBM is that via composing many RBMs, many hidden layers can be

learned efficiently using the feature activations of one RBM as the training data for the

next.

14

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.2 Deep Networks for Unsupervised or Generative Learning - Boltzmann Machines

15

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.2 Deep Networks for Unsupervised or Generative Learning - Restricted Boltzmann Machines

16

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.2 Deep Networks for Unsupervised or Generative Learning - Restricted Boltzmann Machines

17

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.2 Deep Networks for Unsupervised or Generative Learning - Deep Boltzmann Machines

18

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.2 Deep Networks for Unsupervised or Generative Learning - Deep Belief Networks

19

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.2 Deep Networks for Unsupervised or Generative Learning - DBM vs. DBN

20

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.2 Deep Networks for Unsupervised or Generative Learning - Mean-Covariance RBM(mcRBM)

The standard DBN has been extended to the factored higher-order Boltzmann machine

in its bottom layer, with strong results for phone recognition obtained (Dahl et. al., 2010).

This model, called the mean-covariance RBM or mcRBM, recognizes the limitation of the

standard RBM in its ability to represent the covariance structure of the data.

However, it is difficult to train mcRBMs and to use them at the higher levels of the deep

architecture.

the mcRBM parameters in the full DBN are not fine-tuned using the discriminative

information, which is used for fine tuning the higher layers of RBMs, due to the high

computational cost.

21

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.2 Deep Networks for Unsupervised or Generative Learning - Sum Product Networks (SPN)

Another representative deep generative network that can be used for unsupervised

(as well as supervised) learning is the sum-product network or SPN.

An SPN is a directed acyclic graph with the observed variables as leaves, and with sum and

product operations as internal nodes in the deep network.

The “sum” nodes give mixture models, and the “product” nodes build up the feature

hierarchy.

Properties of “completeness” and “consistency” constrain the SPN in a desirable way. The

learning of SPNs is carried out using the EM algorithm together with back-propagation.

The learning procedure starts with a dense SPN. It then finds an SPN structure by

learning its weights, where zero weights indicate removed connections.

22

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.2 Deep Networks for Unsupervised or Generative Learning - Sum Product Networks (SPN)

23

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.2 Deep Networks for Unsupervised or Generative Learning - Sum Product Networks (SPN)

The main difficulty in learning SPNs

the learning signal (i.e., the gradient) quickly dilutes when it propagates to deep layers.

It was pointed out in that early paper that despite the many desirable generative

properties in the SPN, it is difficult to fine tune the parameters using the

discriminative information, limiting its effectiveness in classification tasks.

However, this difficulty has been overcome in the subsequent work reported, where an

efficient backpropagation-style discriminative training algorithm for SPN was

presented.

24

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.2 Deep Networks for Unsupervised or Generative Learning - Sum Product Networks (SPN)

Importantly, the standard gradient descent, based on the derivative of the conditional

likelihood, suffers from the same gradient diffusion problem well known in the regular

DNNs.

The trick to alleviate this problem in learning SPNs is to replace the marginal inference

with the most probable state of the hidden variables and to propagate gradients

through this “hard” alignment only.

Excellent results on small-scale image recognition tasks were reported by Gens and

Domingo (2012).

25

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.2 Deep Networks for Unsupervised or Generative Learning - Recurrent Neural Networks (RNN)

Recurrent neural networks (RNNs) can be considered as another class of deep

networks for unsupervised (as well as supervised) learning, where the depth can be as

large as the length of the input data sequence.

In the unsupervised learning mode, the RNN is used to predict the data sequence in

the future using the previous data samples, and no additional class information is used

for learning.

26

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.2 Deep Networks for Unsupervised or Generative Learning - Recurrent Neural Networks (RNN)

The RNN is very powerful for modeling sequence data (e.g., speech or text), but until

recently they had not been widely used partly because they are difficult to train to

capture long term dependencies, giving rise to gradient vanishing or gradient

explosion problems.

Recent advances in Hessian-free optimization have also partially overcome this difficulty

using approximated second-order information or stochastic curvature estimates.

In the more recent work , RNNs that are trained with Hessian-free optimization are

used as a generative deep network in the character-level language

27

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.2 Deep Networks for Unsupervised or Generative Learning - Recurrent Neural Networks (RNN)

28

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.2 Deep Networks for Unsupervised or Generative Learning - Recurrent Neural Networks (RNN)

Modeling tasks, where gated connections are introduced to allow the current input

characters to predict the transition from one latent state vector to the next.

Such generative RNN models are demonstrated to be well capable of generating

sequential text characters.

More recently, Bengio et al. (2013) and Sutskever (2013) have explored variations of

stochastic gradient descent optimization algorithms in training generative RNNs and

shown that these algorithms can outperform Hessian-free optimization methods.

Molotov et al. (2010) have reported excellent results on using RNNs for language

modeling.

More recently, Mesnil et al. (2013) and Yao et al. (2013) reported the success of RNNs in

spoken language understanding.

29

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.2 Deep Networks for Unsupervised or Generative Learning - Dynamic and Deep Structures in Speech Recognition

There has been a long history in speech recognition research where human speech

production mechanisms are exploited to construct dynamic and deep structure in

probabilistic generative models

Specifically, the early work generalized and extended the conventional shallow and

conditionally independent HMM structure by imposing dynamic constraints, in the

form of polynomial trajectory, on the HMM parameters.

A variant of this approach has been more recently developed using different learning

techniques for time-varying HMM parameters and with the applications extended to

speech recognition robustness (Yu and Deng, 2009; Yu et al., 2009a).

Similar trajectory HMMs also form the basis for parametric speech synthesis

30

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Subsequent work added a new hidden layer into the dynamic model to explicitly

account for the target-directed, articulatory-like properties in human speech generation

More efficient implementation of this deep architecture with hidden dynamics is

achieved with non-recursive or finite impulse response (FIR) filters in more recent

studies

31

3.2 Deep Networks for Unsupervised or Generative Learning - Dynamic and Deep Structures in Speech Recognition

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.2 Deep Networks for Unsupervised or Generative Learning - Deep Structured Generative Models

The above deep structured generative models of speech can be shown as special cases

of the more general dynamic network model and even more general dynamic

graphical models.

The graphical models can comprise many hidden layers to characterize the complex

relationship between the variables in speech generation.

Armed with powerful graphical modeling tool, the deep architecture of speech has

more recently been successfully applied to solve the very difficult problem of single-

channel, multi-talker speech recognition, where the mixed speech is the visible

variable while the un-mixed speech becomes represented in a new hidden layer in the

deep generative architecture (Rennie et al., 2010; Wohlmayr et al., 2011).

32

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.2 Deep Networks for Unsupervised or Generative Learning - Deep Structured Generative Models

Deep generative graphical models are indeed a powerful tool in many applications due

to their capability of embedding domain knowledge.

However, they are often used with inappropriate approximations in inference, learning,

prediction, and topology design, all arising from inherent intractability in these tasks for

most real-world applications.

An even more drastic way to deal with this intractability was proposed recently by

Bengio et al. (2013b), where the need to marginalize latent variables is avoided

altogether.

33

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.2 Deep Networks for Unsupervised or Generative Learning – to delete?

The standard statistical methods used for large-scale speech recognition and

understanding combine (shallow) hidden Markov models for speech acoustics with

higher layers of structure representing different levels of natural language hierarchy.

This combined hierarchical model can be suitably regarded as a deep generative

architecture, whose motivation and some technical detail may be found in Chapter 7 of

the recent book on “Hierarchical HMM” or HHMM.

These early deep models were formulated as directed graphical models, missing the key

aspect of “distributed representation” embodied in the more recent deep generative

networks of the DBN and DBM

Filling in this missing aspect would help improve these generative models.

34

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.2 Deep Networks for Unsupervised or Generative Learning – to delete?

Finally, dynamic or temporally recursive generative models based on neural network

architectures can be found in (Taylor et al., 2007) for human motion modeling for

natural language and natural scene parsing.

The latter model is particularly interesting because the learning algorithms are capable of

automatically determining the optimal model structure. This contrasts with other deep

architectures such as DBN where only the parameters are learned while the architectures

need to be pre-defined.

Specifically, the recursive structure commonly found in natural scene images and in natural

language sentences can be discovered using a max-margin structure prediction architecture.

It is shown that the units contained in the images or sentences are identified, and the

way in which these units interact with each other to form the whole is also identified.

35

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Many of the discriminative techniques for supervised learning in signal and

information processing are shallow architectures such as conditional random fields

(CRFs)

A CRF is intrinsically a shallow discriminative architecture, Characterized by the linear

relationship between the input features and the transition features.

36

3.3 Deep Networks for Supervised Learning - CRF

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Recently, deep-structured CRFs have been developed

by stacking the output in each lower layer of the CRF,

together with the original input data, onto its higher

layer.

Various versions of deep-structured CRFs are successfully

applied to phone recognition, spoken language

identification, and natural language processing.

37

3.3 Deep Networks for Supervised Learning – deep-structured CRF

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.3 Deep Networks for Supervised Learning – Multilayer perceptron (MLP)

Morgan (2012) gives an excellent review on other major existing discriminative

models in speech recognition based mainly on the traditional neural network or MLP

architecture using backpropagation learning with random initialization.

It argues for the importance of both the increased width of each layer of the neural

networks and the increased depth.

In particular, a class of deep neural network models forms the basis of the popular

“tandem” approach (Morgan et al., 2005), where the output of the discriminatively

learned neural network is treated as part of the observation variable in HMMs.

38

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.3 Deep Networks for Supervised Learning – Deep Stacking Network (DSN)

In the most recent work of a new deep learning

architecture, sometimes called Deep Stacking Network

(DSN), together with its tensor variant and its kernel

version are developed that all focus on discrimination

with scalable, parallelizable learning relying on little or

no generative component.

We will describe this type of discriminative deep

architecture in detail in Chapter 6.

39

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.3 Deep Networks for Supervised Learning – Recurrent Neural Network (RNN)

RNNs can also be used as a discriminative model where the output is a label

sequence associated with the input data sequence.

40

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.3 Deep Networks for Supervised Learning – Recurrent Neural Network (RNN)

A set of new models and methods were proposed more recently that enable the RNNs

themselves to perform sequence classification while embedding the long-short-term

memory (LSTM) into the model.

Underlying this method is the idea of interpreting RNN outputs as the conditional

distributions over all possible label sequences given the input sequences.

𝒑(𝒚|𝒙𝟏, 𝒙𝟐, ⋯ , 𝒙𝑵)

Then, a differentiable objective function can be derived to optimize these conditional

distributions over the correct label sequences.

The effectiveness of this method has been demonstrated in handwriting recognition tasks

and in a small speech task (Graves et al., 2013, 2013a) to be discussed in more detail in

Chapter 7 of this book.

41

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Another type of discriminative deep architecture is the convolutional neural network

(CNN), in which each module consists of a convolutional layer and a pooling layer.

These modules are often stacked up with one on top of another, or with a DNN on top

of it, to form a deep model.

42

3.3 Deep Networks for Supervised Learning – convolutional neural network (CNN)

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.3 Deep Networks for Supervised Learning – convolutional neural network (CNN)

The convolutional layer shares many weights, and the pooling layer subsamples the

output of the convolutional layer and reduces the data rate from the layer below.

The weight sharing in the convolutional layer, together with appropriately chosen

pooling schemes, endows the CNN with some “invariance” properties (e.g., translation

invariance).

CNNs have been found highly effective and been commonly used in computer vision

and image. And recently, the CNN is also found effective for speech recognition.43

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.3 Deep Networks for Supervised Learning – time-delay neural network (TDNN)

It is useful to point out that the time-delay neural network (TDNN) developed for early

speech recognition is a special case and predecessor of the CNN when weight sharing is

limited to one of the two dimensions, i.e., time dimension, and there is no pooling layer.

It was not until recently that researchers have discovered that the time-dimension

invariance is less important than the frequency-dimension invariance for speech

recognition.

A careful analysis on the underlying reasons is described in (Deng et al., 2013), together

with a new strategy for designing the CNN’s pooling layer demonstrated to be more

effective than all previous CNNs in phone recognition

44

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.4 Hybrid Deep Networks – two viewpoints

The term “hybrid” for this third category refers to the deep architecture that either

comprises or makes use of both generative and discriminative model components.

In the existing hybrid architectures published in the literature, the generative

component is mostly exploited to help with discrimination, which is the final goal of

the hybrid architecture.

How and why generative modeling can help with discrimination can be examined from

two viewpoints.

The optimization viewpoint where generative models trained in an unsupervised fashion can

provide excellent initialization points in highly nonlinear parameter estimation problems

The regularization perspective where the unsupervised-learning models can effectively

provide a prior on the set of functions representable by the model.

45

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.4 Hybrid Deep Networks – DBN-DNN model

The DBN can be converted to and used as the initial model of a DNN for supervised

learning with the same network structure, which is further discriminatively trained or

fine-tuned using the target labels provided.

When the DBN is used in this way we consider this DBN-DNN model as a hybrid deep

model, where the model trained using unsupervised data helps to make the

discriminative model effective for supervised learning.

We will review details of the discriminative DNN for supervised learning in the context of

RBM/DBN generative, unsupervised pre-training in Chapter 5.

46

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.4 Hybrid Deep Networks – DNN-CRF and DNN-HMM

Another example of the hybrid deep network is developed in (Mohamed et al., 2010),

where the DNN weights are also initialized from a generative DBN but are further

fine-tuned with a sequence-level discriminative criterion, which is the conditional

probability of the label sequence given the input feature sequence, instead of the frame-

level criterion of cross-entropy commonly used.

This can be viewed as a combination of the static DNN with the shallow

discriminative architecture of CRF. It can be shown that such a DNN-CRF is equivalent

to a hybrid deep architecture of DNN and HMM whose parameters are learned

jointly using the full-sequence maximum mutual information (MMI) criterion between

the entire label sequence and the input feature sequence.

47

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.4 Hybrid Deep Networks - RBM

In (Larochelle and Bengio, 2008), the generative model of RBM is learned using the

discriminative criterion of posterior class-label probabilities.

Here the label vector is concatenated with the input data vector to form the

combined visible layer in the RBM.

In this way, RBM can serve as a stand-alone solution to classification problems and

the authors derived a discriminative learning algorithm for RBM as a shallow generative

model.

48

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.4 Hybrid Deep Networks – pretraining of CNNs

A further example of hybrid deep networks is the use of generative models of DBNs to

pre-train deep convolutional neural networks (Lee et al., 2009, 2010, 2011).

Like the fully connected DNN discussed earlier, pre-training also helps to improve the

performance of deep CNNs over random initialization.

Pre-training DNNs or CNNs using a set of regularized deep autoencoders (Bengio et

al., 2013a), including denoising autoencoders, contractive autoencoders, and sparse

autoencoders, is also a similar example of the category of hybrid deep network.ks

49

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.4 Hybrid Deep Networks – two-stage architecture

The final example given here for hybrid deep networks is based on the idea and work of

(Ney, 1999; He and Deng, 2011), where one task of discrimination (e.g., speech

recognition) produces the output (text) that serves as the input to the second task of

discrimination (e.g., machine translation).

The overall system, giving the functionality of speech translation – translating speech in

one language into text in another language – is a two-stage deep architecture

consisting of both generative and discriminative elements.

Both models of speech recognition (e.g., HMM) and of machine translation (e.g., phrasal

mapping and non-monotonic alignment) are generative in nature, but their parameters

are all learned for discrimination of the ultimate translated text given the speech data.

50

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Question

p216에 On the other hand 다음 문장에, unsupervised-learning model은 해석하기 쉽고,

domain knowledge를 embed하기 쉽고, uncertainty를 핸들하기 쉽다고 나오는데, 이

부분이 명확히 이해가 안가서, 이 부분에 대해 예를 들어 자세히 설명해 주실 것 부탁드려요.

On the other hand, the deep unsupervised-learning models, especially the probabilistic generative

ones, are easier to interpret, easier to embed domain knowledge, easier to compose, and easier to

handle uncertainty, but they are typically intractable in inference and learning for complex systems.

51

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Answer

답은 아래 논문들에 있을 것으로 예상됨. 2 -> 1 -> 3 순서로 중요.

1. G. Alain and Y. Bengio. What regularized autoencoders learn from the data

generating distribution. In Proceedings of International Conference on Learning

Representations (ICLR). 2013.

2. Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new

perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI),

38:1798–1828, 2013.

3. Y. Bengio, E. Thibodeau-Laufer, and J. Yosinski. Deep generative stochastic networks

trainable by backprop. arXiv 1306:1091, 2013. also accepted to appear in

Proceedings of International Conference on Machine Learning (ICML), 2014.

52

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Answer

On the other hand, the deep unsupervised-learning models, especially the probabilistic generative ones, are

easier to interpret, easier to embed domain knowledge, easier to compose, and easier to handle uncertainty,

but they are typically intractable in inference and learning for complex systems.

The deep unsupervised-learning models: autoencoders, sparse coding networks, etc.

비교 대상: deep supervised-learning models: DNNs, etc.

“Easier to interpret”: 전 페이지의 논문2에서 다양한 interpret 사례를 다룸.

“Easier to embed domain knowledge”: 0부터 9까지 숫자 이미지의 representation을

학습할 때 autoencoder의 layer를 10으로 줄 수 있음.

“Easier to handle uncertainty”: ? Uncertainty = probability. “the deep unsupervised-

learning models, especially the probabilistic generative ones, are … easier to handle

uncertainty ”. Probabilistic models 이니 uncertainty(=probability)를 다루가 쉽다고 생각할

수 있음. 답은 모름.53

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Question

p220에서 SPN 구조에 대해서 그림과 함께 설명 부탁드립니다.

54

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Answer

References

[Paper (first introduced)] H. Poon and P. Domingos. Sum-product networks: A new deep a

rchitecture. In Proceedings of Uncertainty in Artificial Intelligence. 2011.

[Paper] R. Gens and P. Domingo. Discriminative learning of sum-product networks. Neural

Information Processing Systems (NIPS), 2012.

[Lecture by the author] Sum-Product Networks: The Next Generation of Deep Models by

P. Domingos (YouTube) (PDF)

Comprehensive web page (http://spn.cs.washington.edu/)

55

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Answer: Basic Structure

56

Deep structure

Basic structure

Forward

Backward

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Answer: Architecture with an Example

57

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Question

RNN을 Unsupervised learning 에 사용한다는 부분에 궁금한 점이 있습니다. 이전의 data

sample을 사용하여 future sequence를 추론하는 경우 중간중간 잘못 추론한 결과가 쌓여

결국 전체 sequence를 잘못 추론하는 결과를 야기할 수 있다고 생각합니다. 이 문제를

해결할 수 있는 장치가 있는지 궁금합니다.

58

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Answer

Training할 때 teacher forcing이라는 기법을 이용해서 문제를 완화시킬 수 있습니다 (Goodfellow

et al. 2016). 현재 예측값이 아니라 현재 참값(ground truth)을 이용해 다음 상태를 예측합니다.

59Vanilla RNN RNN with teacher forcing

Goodfellow et al. Deep Learning. The MIT Press. 2016

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Question & Answer

17,18년 오디오 관련한 여러 가지 테스크에 대해서 제출되는 여러 모델들은 아직도 기존의 머신러닝

방법인 HMM등을 사용하는 경우를 볼 수 있습니다. RNN 기반의 variants가 HMM 기반의 모델의

성능에 필적하지 못하는 경우도 있는데 이에 대해서 발표자 분들이 어떻게 생각하시는지 궁금합니다.

제가 오디오처리 전공이 아니라 ‘17,18년 오디오 관련한 여러 가지 테스크에 대해서 제출되는 여러 모델들’에

대해 알지 못합니다.

‘RNN 기반의 variants가 HMM 기반의 모델의 성능에 필적하지 못하는 경우’를 구체적으로 명시해주셨다면

이야기가 가능했을 것 같습니다.

HMM과 RNN의 가능 큰 차이는 HMM는 현재 상태를 이용해 다음 상태를 예측하고, RNN은 현재 상태와 이전

여러 상태를 이용해 예측한다는 점입니다. 다만, HMM은 현재 상태과 이전의 모든 상태를 담고 있다고

가정합니다(Markov property). RNN은 long-term dependency에 약하고 Markov property가 없기 때문에

long-term dependency를 잘 할 수 있도록 해야합니다. 그것을 잘하게끔 하기가 어렵기 때문에 RNN이 더 좋은

성능을 못내는 것이라고 생각합니다. 이와 달리 HMM은 long-term dependency를 가지게 끔 설계하기가

RNN보다 쉬워서 HMM이 더 좋은 성능을 낸다고 생각합니다.

60

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Question & Answer

상대적으로 최근에 개발되었다는 방법인 time-varying HMM parameters를 사용하는 방법

을 알고 싶습니다.

질문을 통해 time-varying HMM을 처음 접했습니다.

‘time-varying HMM parameters를 사용하는 방법’을 질문하셨는데, time-varying HMM

parameters 라는 것이 무엇을 의미하는지 찾지 못했고 사용 목적이 무엇인지 몰라서 질

문의 의도를 파악하기 어렵습니다. Wang et al. 2009에서 소개한 time varying HMM이

맞는지 궁금합니다.

61

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Question & Answer

CRF 에 대한 기본개념이 부족한 상태에서 드리는 문의사항입니다. CRF 가 shallow layer 라고 할 때 이

를 deep structure 로 쌓는다고 하는데요.(p31) 이렇게 쌓아진 구조와 DNN 구조는 어떤 차이를 갖는지

궁금합니다. 가장 단순한 비교로 1 layer DNN 의 구조와 CRF는 어떻게 다른지 궁금합니다.

62

CRF 구조의 예시

Deep structured CRF 구조의 예시

DNN 구조의 예시

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Question & Answer

Are all discrimitive models learned by supervised learning methods?

Discriminative 모델은 p(y|x)를 학습하는 것이 목적이기 때문에 y가 주어지지 않다면

이를 학습할 수 없다고 생각합니다. 하지만 hybrid 모델처럼 discriminative 모델의

pretraining에 unsupervised learning 방법을 적용하는 것은 가능합니다.

63

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Question & Answer

3.4에서 hybrid deep neural network의 한 예로, pre-trained DNN, CNN이 deep

autoencoder를 사용하는 예를 제시하였는데, 그렇다면 deep autoencoder에서 학습된

output이 pre-trained DNN, CNN에서 input으로 feed 된다는 의미인지요?

이때 deep autoencoder와 pre-trained DNN/CNN 각각의 역할이 무엇일지 궁금합니다.

Deep autoencoder에서 학습된 encoder를 사용하여 input으로부터 feature를 얻게 되고

이를 DNN/CNN의 input으로 사용한다는 의미입니다. 이때 deep autoencoder의

encoder가 feature extractor의 역할을 하게되고 DNN/CNN은 classifier의 역할을 하게

됩니다.

64

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Question & Answer

음성 데이터에 대한 전처리의 결과는 많은 경우 시간 축과 frequency 축을 가지게 됩니다.

이는 1차원 시계열 데이터와도, 그리고 2차원의 이미지와도 유사한 성질을 가지고 있는 것

같은데 이렇게 전처리된 데이터에 대해 RNN, 시간 축에 대한 1차원 CNN(TDNN), 2차원

CNN을 적용하는 것이 어떻게 다른지, 그리고 각 구조는 어떤 상황에 사용하는 것이 좋은지

궁금합니다.

먼저 CNN 모델이 RNN 모델과 달리 병렬화가 쉬워 모델의 수행속도가 빠르기 때문에

수행속도가 중요한 경우에 사용하기 좋다고 생각합니다. 그리고 2차원 CNN이 서로 다른

시간 축들을 동시에 고려하기 때문에 1차원 CNN에 비하여 성능이 우수할 것으로

예상합니다.

65

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Question & Answer

hybrid 모델에는 역의 방향도 있는데 (generative for discriminative), 책에서는 전자가 후자

의 prior로 작용하는 regularization을 언급하였습니다(p.226). prior가 된다는 것은 무슨 과정

을 의미하는지 (generative 모델의 결과가 어떻게 discriminative 모델로 feed-in 되는지), 그

리고 이것이 적절한 prior라는 근거가 있는지 궁금합니다.

Generative 모델이 discriminative 모델의 초깃값으로 활용된다는 의미입니다. 이것이 적

절한 priror라는 것을 증명한 논문은 읽어보지 못했으며 경험적으로 모델의 weight들을

랜덤으로 초기화하는 것보다 학습이 더 잘되고 성능이 우수하기 때문이라고 생각합니다.

66

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Question & Answer

교재 221 page 1~9번째 줄 문장에서 언급한 "Hessian-free optimization" 방법이 궁금합니

다. "stochastic gradient descent optimization algorithms" 방법과 비교하였을 때 어떤 차이

점(장단점)이 있나요?

67

이론적으로는 SGD보다 더 최적해를 잘

찾지만 구현이 SGD에 비해 어려우며,

굉장히 불안정하고 conjugate gradient

단계의 계산량이 너무 많다고 합니다.

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Question & Answer

Can hybrid deep networks solve issues of sparsity and the existence of multiple

conceptual interpretations to the same word in speech processing context? If still not,

then what do we need to achieve that?

Hybrid deep network가 아니더라도 충분한 양의 데이터가 주어지고, RNN 계열의 모델

을 사용하면 장단기적인 문맥을 모두 고려할 수 있기때문에 문제를 해결할 수 있다고 생

각합니다.

68

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Question & Answer

vanilla RNN은 gradient vanishing문제에 직면하기 쉽다고 알고 있는데, text book에서는

"Mesnil et al. (2013) and Yao et al. (2013) reported the success of RNNs in spoken

language understanding." 라고 합니다. 그렇다면 이 실험에서는 vanilla RNN이 아닌 발전된

RNN이 적용된 것인가요?

두 연구에서는 Elman-type RNN과 Jordan-type RNN 이라는 vanilla RNN을 변형한 모델

을 적용하였습니다. 그렇지만 이 모델들은 long short-term memory (LSTM), gated

recurrent unit (GRU)과 달리 gradient vanishing 문제를 해결하기 위해 제안된 모델은 아

닙니다.

69

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Question

현재는 많이 사용되고 있지는 않을지도 모르지만, Deep Learning에서 Deep Neural

Network의 initialization 이야기할 때 빠지지 않는 방법 중 하나가 RBM 혹은 이것을 확장한

DBN을 이용하여 weight initialization을 하는 방법입니다. 다만 RBM이 bipartite구조로 각

노드의 확률을 sampling 한다는 점에서 이걸로 어떻게 weight initialization을 하는지 잘 모

르겠습니다. 4단원 앞부분에 그 내용이 설명되어있기는 한데, 이렇게 초기화된 weight에 어

떤 통계적인 의의가 있을까요?

70

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Answer

RBM에서 hidden layer에 대응되는 확률을 구하는 과정에서 visible layer의 값들에

weight들을 곱하게 되는데 이 weight들로 DNN의 weight를 초기화한다는 의미입니다.

이렇게 초기화된 weight가 모델에 좋은 prior를 제공해준다는 통계적인 의의가 있습니다.

71

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Question & Answer

25쪽에서는 "Unsupervised learning refers to no use of task specific supervision

information"이라고 하고 있습니다. 하지만 여기서 표현하는 "task specific supervision

information"을 input 그 자체라고 생각하면 곧 generative model이 되는 것이고, 이렇게

생각할 수 있다면 Unsupervised learning과 Supervised learning의 차이가 없어지는

것입니다. 즉, x와 y가 모두 주어지고 y=f(x)에서 함수 f를 학습시키는것이 감독학습인데,

x=f(x)라고 볼 수 있다면 무감독학습이 되는것이죠. 이런 관점에서 본다면 감독학습과

무감독학습의 수학적 차이는 없는것인데, 서로 다른 차이점을 어떻게 설명할 수 있을까요?

Unsupervised learning과 supervised learning의 목적에 차이가 존재합니다. Supervised

learning의 경우에는 레이블이 주어진 상황에서 데이터의 분류나 회귀분석이 목적이라면

unsupervised learning의 목적은 데이터의 레이블이 없는 상황에서 데이터의 hidden

structure를 파악하는 것이 목적입니다.

72

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Question & Answer

28쪽에서는 DBM이 speech recognition problem에 사용될 수 있다고 합니다. 하지만 제가

알기론 보통 DBM은 고정되어 있는 길이의 모델을 통해서 inference를 하는 것인데, speech

recognition을 하려고 하면 주어지는 문장의 길이가 모르는 상태에서 어떻게 inference를 할

수 있는건가요? DBM이 speech recognition을 처리하는 방법이 궁금합니다.

먼저, 주어진 음성데이터를 고정된 길이들로 나눈 후에 이들의 mfcc, melspectrogram

feature들을 추출하고, 이들로부터 다시 한번 DBM를 사용하여 고차원의 feature를 추출

합니다. 이들의 시퀀스를 마지막으로 RNN에 통과시키면 길이가 모르는 상태에서도

inference 가 가능하다고 생각합니다.

73

SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Thank you!

Q & A

74