recurrent neural networks and natural language...

131
Recurrent Neural Networks and Natural Language Processing: 순환 신경망 & 자연언어처리 2019.10.8 Seung-Hoon Na Chonbuk National University

Upload: others

Post on 05-Apr-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Recurrent Neural Networks and

Natural Language Processing:

순환 신경망 & 자연언어처리

2019.10.8

Seung-Hoon Na

Chonbuk National University

Page 2: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Contents

• Recurrent neural networks – Classical RNN– LSTM – Recurrent language model– Sequence labelling – Neural encoder-decoder– Memory augmented neural networks

• Deep learning for NLP– Introduction to NLP– Word embedding– ELMo, BERT: Contextualized word embedding

• Summary

Page 3: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Neural Network: Two types

• Feedforward neural networks (FNN) – = Deep feedforward networks = multilayer perceptrons (MLP)– No feedback connections

• information flows: x → f(x) → y

– Represented by a directed acyclic graph

• Recurrent neural networks (RNN)– Feedback connections are included – Long short term memory (LSTM)– Recently, RNNs using explicit memories like Neural Turing

machine (NTM) are extensively studied – Represented by a cyclic graph

𝒙

𝒉

Page 4: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

FNN: Notation

• For simplicity, a network has single hidden layer only

– 𝑦𝑘: k-th output unit, ℎ𝑗: j-th hidden unit, 𝑥𝑖: i-th input

– 𝑢𝑗𝑘: weight b/w j-th hidden and k-th output

– 𝑤𝑖𝑗: weight b/w i-th input and j-th hidden

• Bias terms are also contained in weights

Output layer

Hidden layer

Input layer

𝑦1 𝑦2 𝑦𝐾−1𝑦𝐾𝑦𝑘

ℎ1 ℎ2 ℎ𝑚−1ℎ𝑚

𝑥1 𝑥2 𝑥3 𝑥𝑛−1 𝑥𝑛

ℎ𝑗

𝑥𝑖

𝑢𝑗𝑘

𝑤𝑖𝑗

Page 5: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

FNN: Matrix Notation

5

Output layer

Hidden layer

Input layer

𝑦1 𝑦2 𝑦𝐾−1 𝑦𝐾𝑦𝑘

ℎ1 ℎ2 ℎ𝑚−1ℎ𝑚

𝑥1 𝑥2 𝑥3 𝑥𝑛−1 𝑥𝑛

ℎ𝑗

𝑥𝑖

𝑢𝑗𝑘

𝑤𝑗𝑘

𝒉

𝒚

𝒙

𝑾

𝑼

𝒚 = 𝑓(𝑼𝑔(𝑾𝒙))

𝒚 = 𝑓(𝑼𝑔 𝑾𝒙 + 𝒃 + 𝒅)for explicit bias terms

Page 6: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Typical Setting for Classification

𝑦𝑖 =𝑒𝑥𝑝(𝑦𝑖)

σ𝑒𝑥𝑝(𝑦𝑡)

– K: the number of labels

– Input layer: Input values (raw features)

– Output layer: Scores of labels

– Softmax layer: Normalization of output values

• Scores are transformed to probabilities of

Output layer

Hidden layer

Input layer

𝑦1 𝑦2 𝑦𝐾−1𝑦𝐾

Softmax layer

𝑦3

𝑦1 𝑦2 𝑦𝐾

Page 7: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Recurrent neural networks

• A family of neural networks for processing sequential data

• Specialized for processing a sequence of values

– 𝒙 1 , 𝒙(2), ⋯ , 𝒙(𝜏)

• Use parameter sharing across time steps

– “I went to Nepal in 2009”

– “In 2009, I went to Nepal”

Traditional nets need to learn all of the rules of the language separately at each position in the sentence

Page 8: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

RNN as a Dynamical System

• The classical form of a dynamical system takes:

– 𝒔(𝑡): the state of the system

• Unfolding the equation Directed acyclic computational graph

– 𝒔(2) = 𝑓(𝒔 2 ; 𝜽)=𝑓(𝑓(𝒔 1 ; 𝜽); 𝜽)

𝒔(𝑡) = 𝑓(𝒔 𝑡−1 ; 𝜽)

Page 9: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

RNN as a Dynamical System

• RNN can be considered as a dynamic system to

take an external signal 𝒙(𝑡) at time 𝑡

• Using the recurrence, RNNs maps an arbitrary length sequence

(𝒙 𝑡 , 𝒙 𝑡−1 , 𝒙 𝑡−2 , ⋯ , 𝒙 2 , 𝒙 1 ) to a fixed length vector 𝒉

𝒉(𝑡) = 𝑓(𝒉 𝑡−1 , 𝒙 𝑡 , 𝜃)

Page 10: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Recurrent Neural Networks

10

Output layer

Hidden layer

Input layer

𝒉

𝒐

𝒙

𝑼

𝑽

𝒙

𝒉

𝒐

𝑽

𝑼

𝑾

𝒉(𝑡) = 𝑔 𝑾𝒉 𝑡−1 +𝑼𝒙 𝑡

Feedforward NNRecurrent neural networks

𝒉 = 𝑔 𝑼𝒙

Parameter sharing: The same weights across several time steps

Page 11: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Classical RNN: Update Formula

𝒉(𝒕) = 𝒇(𝒙 𝒕 , 𝒉(𝒕−𝟏))

• 𝜎𝒉 𝑡 = 𝑡𝑎𝑛ℎ 𝑾𝒙 𝑡 +𝑼𝒉 𝑡−1 𝑐(𝑡) = 𝑓(𝑡) °𝑐 𝑡−1 +

𝑖(𝑡) ° ǁ𝑐 𝑡 (Final memory cell)

• ℎ(𝑡) = 𝑜(𝑡)°tanh(𝑐 𝑡 )𝒙

𝒉

𝑼

𝑾𝑽

𝒐 𝑡 = 𝑽𝒉 𝑡

𝒐

𝒉 𝑡 = 𝑡𝑎𝑛ℎ 𝑾𝒙 𝑡 +𝑼𝒉 𝑡−1 + 𝒃

𝒐 𝑡 = 𝑽𝒉 𝑡 (Final

Using explicit bias terms

Page 12: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Computational Graph of RNN

• Unfolding: The process that maps a circuit-style graph to a computational graph with repeated units

• Unfolded graph has a size that depends on the sequence length

RNN with no outputs

Indicates a delay of 1 time step

Page 13: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

RNNs with Classical Setting

• RNNs that produce an output at each time step and have recurrent connections between hidden units

Loss 𝐿: measures how far each 𝒐 is from the corresponding training target 𝒚

Page 14: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Classical RNNs: Computational Power

• Classical RNNs are universal in the sense that any function computable by a Turing machine can be computed by RNN [Siegelmann ’91,’95], where the update formula is given as

– 𝒂 𝑡 = 𝒃 +𝑾𝒉 𝑡−1 + 𝑼𝒙 𝑡 ,

– 𝒉 𝑡 = tanh 𝒂 𝑡

– 𝒐 𝑡 = 𝒄 + 𝑽𝒉 𝑡

– ෝ𝒚 𝑡 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝒐 𝑡 )

Page 15: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Classical RNNs: Computational Power

• Theorems:

• Classical rational-weighted RNNs are computationally equivalent to Turing machines

• Classical real-weighted RNNs are strictly power powerful than RNNs and Turing machines Super-Turing Machine

Page 16: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Classical RNNs: Loss function

• The total loss for a given sequence of 𝒙 values paired with a sequence of 𝒚:

– the sum of the losses over all the time steps.

• 𝐿(𝑡): the negative log-likelihood of 𝑦(𝑡) given

𝒙(1), ⋯ , 𝒙 𝑡

𝐿 𝒙(1), ⋯ , 𝒙 𝜏 , 𝒚(1), ⋯ , 𝒚 𝜏

=

𝑡

𝐿(𝑡) =

𝑡

log 𝑝𝑚𝑜𝑑𝑒𝑙 𝑦(𝑡)| 𝒙(1), ⋯ , 𝒙𝑡

Page 17: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Backpropagation through Time (BPTT)

forward propagation

backward propagation through time (BPTT)

Page 18: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

RNN with Output Recurrence

• Lack hidden-to-hidden connections

– Less powerful than classical RNNs

– This type of RNN cannot simulate a universal TM

Page 19: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

RNN with Single Output

• At the end of the sequence, network obtains a representation for entire input sequence and produces a single output

Page 20: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

RNN with Output Dependency

• The output layer of RNN takes a directed graphical model that contains edges from some

𝒚(𝑖) in the past to the current output

– This model is able to perform a CRF-style of tagging

𝒙(1)

𝒉(1) 𝒉(2)

𝒙(2)

𝒉(3)

𝒙(3)

𝒉(𝑡)

𝒙(𝑡)

𝒚(1) 𝒚(2) 𝒚(3) 𝒚(𝑡)

𝑾 𝑾 𝑾

Page 21: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Recurrent Language Model:

RNN as Directed Graphical Models

• Introducing the state variable in the graphical model of the RNN

Page 22: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Recurrent Language Model:Teacher Forcing

• At training time, the teacher forcing feeds

the correct output 𝒚(𝑡) from the training set.

• At test time, because the true output is not available, the correct output is approximated by the model’s output

학습단계에서는 t시점의hidden표상에 t-1 시점의gold 정답을 입력으로 가정

테스트단계에서는 t-1시점의predicted output을 입력

Page 23: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Modeling Sequences Conditioned on

Context with RNNs

• Generating sequences given a fixed vector 𝒙

– Context: a fixed vector 𝒙

– Take only a single vector 𝒙 as input and generates the 𝒚 sequence

• Some common ways

– 1. as an extra input at each time step, or

– 2. as the initial state 𝒉(0) , or

– 3. both.

Page 24: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Modeling Sequences Conditioned on

Context with RNNs

• maps a fixed-length vector 𝒙 into a distribution over sequences 𝒀• E.g.) image labelling

Page 25: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Modeling Sequences Conditioned on

Context with RNNs• Input: sequence of vectors 𝒙(𝑡)

• Output: sequence with the same length as input

𝑃 𝒚(1), ⋯ , 𝒚 𝜏 |𝒙 1 ⋯ , 𝒙(𝜏)

≈ෑ

𝑡

𝑃 𝒚(𝑡)|𝒙 1 ⋯ , 𝒙 𝑡 , 𝒚(1), ⋯ , 𝒚 𝑡−1

Page 26: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Bidirectional RNN

• Combine two RNNs– Forward RNN: an RNN

that moves forward beginning from the start of the sequence

– Backward RNN: an RNN that moves backward beginning from the end of the sequence

– It can make a prediction of y(t) depend on the whole input sequence.

Forward RNN

Backward RNN

Page 27: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Encoder-Decoder Sequence-to-Sequence

• Input: sequence• Output: sequence (but with

a different length)

Machine translation

generate an output sequence

(𝒚(1), ⋯ , 𝒚 𝑛𝑦 ) given an input

sequence (𝒙(1), ⋯ , 𝒙 𝑛𝑥 )

Encoder로 RNN을Decoder로 Recurrent language model을 사용

Page 28: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

RNN: Extensions (1/3)

• Classical RNN– Suffers from the challenge of long-term dependencies

• LSTM (Long short term memory)– Gated units, dealing with vanishing gradients – Dealing with the challenge of long-term dependencies

• Bidirectional LSTM– forward & backward RNNs

• Bidirectional LSTM CRF– Output dependency with linear-chain CRF

• Recurrent language model– RNN for sequence generation– Predicting a next word conditioning all the previous words

• Recursive neural network & Tree LSTM– Generalized RNN for representation of tree structure

Page 29: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

RNN: Extensions (2/3)

• Neural encoder-decoder

– Conditional recurrent language model

– Encoder: RNN for encoding a source sentence

– Decoder: RNN for generating a target sentence

• Neural machine translation

– Neural encoder-decoder with attention mechanism

– Attention-based decoder: Selectively conditioning source words, when generating a target word

• Pointer network

– Attention as generation: Output vocabulary is the set of given source words

Page 30: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

RNN: Extensions (3/3)

• Stack LSTM– A LSTM for representing stack structure

• Extend the standard LSTM with a stack pointer• Previously, only push() operation is allowed• Now, Pop() operation is supported

• Memory-augmented LSTMs– Neural Turing machine– Differentiable neural computer– C.f. ) Neural encoder-decoder, Stack LSTM: Special cases of

MALSTM

• RNN architecture search with reinforcement learning – Training neural architectures that maximize the expected

accuracy on a specific task

Page 31: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

The Challenge of Long-Term Dependencies

• Example: a very simple recurrent network • No nonlinear activation function, no inputs

– 𝒉(𝑡) = 𝑾𝑇𝒉(𝑡−1)

– 𝒉(𝑡) = (𝑾𝑡)𝑇𝒉(0)

𝑾 = 𝑸𝚲𝑸𝑇

𝒉(𝑡) = 𝑸𝑇𝚲𝒕𝑸𝒉(0)

Ill-posed form

Page 32: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

The Challenge of Long-Term Dependencies

• Gradients are vanished or explored in deep models

• BPTT for recurrent neural networks is a typical example

𝒙𝟏

𝒉𝟏 𝒉𝟐

𝒙𝟐

𝒉𝟑

𝒙𝟑

𝒉𝒕

𝒙𝒕

𝑾 𝑾 𝑾𝜹𝒕

𝒈′ 𝒛𝟑 ∗ 𝑾𝑇𝜹𝒕

𝒈′ 𝒛2 𝒈′ 𝒛𝟑 ∗ 𝑾𝟐 𝑻𝜹𝑡

𝒈′ 𝒛1 𝒈′ 𝒛2 𝒈′ 𝒛𝟑 ∗ 𝑾𝟑 𝑻𝜹𝑡

Error signalDelta is obtained by the repeated multiplication of W

Explode if | 𝜆𝑖 | > 1

Vanish if | 𝜆𝑖 | < 1

𝑾 = 𝑽𝑑𝑖𝑎𝑔 𝝀 𝑽−1

𝑾𝑘 = 𝑽𝑑𝑖𝑎𝑔 𝝀 𝑘𝑽−1

𝛿𝟏 = 𝑾𝒌 𝑻𝜹𝑘 ∗ 𝒇(𝒛1

𝑘−1)

Page 33: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Exploding and vanishing gradients

[Bengio ‘94; Pascanu ‘13]

• Let:– 𝐷𝑖𝑎𝑔 𝒈′ 𝒛𝒊−𝟏 ≤ 𝛾

• for bounded nonlinear functions 𝑔′ 𝑥

– 𝜆1: the largest singular value of 𝑾

• Sufficient condition for Vanishing gradient problem

– 𝜆1 < 1/𝛾

• Necessary condition for Exploding gradient problem

– 𝜆1 > 1/𝛾

𝒉𝒌 … 𝒉𝒕

𝑾 𝑾

𝜹𝑻𝜹𝑻−𝟏 = 𝒈′ 𝒛𝑻−𝟏 ∗ 𝑾𝑇 𝜹𝑻

𝜹𝒌 = ෑ

𝑘<𝑖≤𝑇

𝐷𝑖𝑎𝑔 𝒈′ 𝒛𝒊−𝟏 𝑾𝑇 𝜹𝑻

𝐷𝑖𝑎𝑔 𝒈′ 𝒛𝒊−𝟏 𝑾𝑇 ≤ 𝐷𝑖𝑎𝑔 𝒈′ 𝒛𝒊−𝟏 𝑾𝑇 <1

𝛾𝛾=1

obtained by just inverting the condition for vanishing gradient problem

Page 34: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Gradient clipping [Pascanu’ 13]• Deal with exploring gradients

• Clip the norm | 𝒈 | of the gradient 𝒈 just before parameter update

If | 𝒈 |>𝑣: 𝒈 ←𝒈𝑣

| 𝒈 |

Without gradient clipping

with clipping

Page 35: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Long Short Term Memory (LSTM)

• LSTM: makes it easier for RNNs to capture long-term dependencies Using gated units– Basic LSTM [Hochreiter and Schmidhuer, 98]

• Cell state unit 𝒄(𝑡): as an internal memory

• Introduces input gate & output gate

• Problem: The output is close to zero as long as the output gate is closed.

– Modern LSTM: Uses forget gate [Gers et al ’00]

– Variants of LSTM• Add peephole connections [Gers et al ’02]

– Allow all gates to inspect the current cell state even when the output gate is closed.

Page 36: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Long Short Term Memory (LSTM)

𝒙

𝒉

𝑼

𝑾

Recurrent neural networksLSTM

𝒙

𝒄

𝒉𝒇

Memory cell

(cell state unit)

Page 37: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Long Short Term Memory (LSTM)

• Memory cell 𝒄: gated unit

– Controlled by input/output/forget gates

𝒛

𝒄

𝒉

𝒐

𝒊

𝒇

𝒙

𝒖

𝒚

Gated flow

𝒚 = 𝐮°𝒙

𝒇: forget gate

𝒊: input gate

𝒐: output gate

Page 38: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Long Short Term Memory (LSTM)

LSTM

𝒙

𝒄

𝒉

Memory cell

(cell state unit)

𝒄𝒊

𝒐

𝒇

Computing gate values

𝒇(𝑡) = 𝑔𝑓(𝒙𝑡 , 𝒉(𝑡−1))(forget gate)

𝒊(𝑡) = 𝑔𝑖(𝒙𝑡 , 𝒉(𝑡−1)) (input gate)

𝒐(𝑡) = 𝑔𝑜(𝒙𝑡 , 𝒉(𝑡−1))(output gate)

𝒄 𝑡 = 𝑡𝑎𝑛ℎ 𝑊 𝑐 𝑥 𝑡 + 𝑈 𝑐 ℎ 𝑡−1

(new memory cell)

𝒄(𝑡) = 𝒊(𝑡)° 𝒄 𝑡 + 𝒇(𝑡)° 𝒄 𝑡−1

𝒉(𝑡) = 𝒐(𝑡)° tanh(𝒄 𝑡 )

Page 39: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Long Short Term Memory (LSTM)

LSTM

𝒙

𝒄

𝒉

Memory cell

(cell state unit)

𝒄𝒊

𝒐

𝒇

Computing gate values

𝒇(𝑡) = 𝑔𝑓(𝒙𝑡 , 𝒉(𝑡−1))

𝒊(𝑡) = 𝑔𝑖(𝒙𝑡 , 𝒉(𝑡−1))

𝒐(𝑡) = 𝑔𝑜(𝒙𝑡 , 𝒉(𝑡−1))

𝒄 𝑡 = 𝑡𝑎𝑛ℎ 𝑊 𝑐 𝑥 𝑡 + 𝑈 𝑐 ℎ 𝑡−1

(new memory cell)

𝒊𝒐

𝒇

Controling by gate values

𝒄(𝑡) = 𝒊(𝑡)° 𝒄 𝑡 + 𝒇(𝑡)° 𝒄 𝑡−1

𝒉(𝑡) = 𝒐(𝑡)° tanh(𝒄 𝑡 )

Page 40: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Long Short Term Memory (LSTM):

Cell Unit Notation (Simplified)

𝒙

𝒄

𝒉

𝒊

𝒐

𝒇

𝒄(𝑡) = 𝒊(𝑡)° 𝒄 𝑡 + 𝒇(𝑡)° 𝒄 𝑡−1

𝒉(𝑡) = 𝒐(𝑡)° tanh(𝒄 𝑡 )

𝒄 𝑡 = 𝑡𝑎𝑛ℎ 𝑊 𝑐 𝑥 𝑡 + 𝑈 𝑐 ℎ 𝑡−1

Page 41: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Long Short Term Memory (LSTM):

Long-term dependencies

𝒙(1)

𝒄(1)

𝒉(1)

𝒊

𝒐

𝒇

𝒙(2)

𝒄(2)

𝒊

𝒇

𝒐

𝒉(2)

𝒙(3)

𝒄(3)

𝒊

𝒐

𝒉(3)

𝒙(4)

𝒄(4)

𝒊

𝒐

𝒉(4)

𝒇 𝒇

𝒙 1 𝒉(4): early inputs can be preserved in the memory cell

during long time steps by controlling mechanism

Page 42: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

LSTM: Update Formula

• 𝑖 𝑡 = 𝜎 𝑊 𝑖 𝑥 𝑡 + 𝑈 𝑖 ℎ 𝑡−1 (Input gate)

• 𝑓 𝑡 = 𝜎 𝑊 𝑓 𝑥 𝑡 + 𝑈 𝑓 ℎ 𝑡−1 (Forget gate)

• 𝑜 𝑡 = 𝜎 𝑊 𝑜 𝑥 𝑡 + 𝑈 𝑜 ℎ 𝑡−1 (Output/Exposure gate)

• ǁ𝑐 𝑡 = 𝑡𝑎𝑛ℎ 𝑊 𝑐 𝑥 𝑡 + 𝑈 𝑐 ℎ 𝑡−1 (New memory cell)

• 𝑐(𝑡) = 𝑓(𝑡) °𝑐 𝑡−1 + 𝑖(𝑡) ° ǁ𝑐 𝑡 (Final memory cell)

• ℎ(𝑡) = 𝑜(𝑡)°tanh(𝑐 𝑡 )

𝒉(𝒕) = 𝒇(𝒙 𝒕 , 𝒉(𝒕−𝟏))

Page 43: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

LSTM: Memory Cell

𝑐(𝑡) = 𝑓(𝑡) °𝑐 𝑡−1 + 𝑖(𝑡) ° ǁ𝑐 𝑡

ℎ(𝑡) = 𝑜(𝑡)°tanh(𝑐 𝑡 )

𝒉(𝒕)

𝒄 𝒕

𝒄(𝒕−𝟏)𝒄(𝒕)

𝒊(𝑡) 𝒇(𝑡) 𝒐(𝑡)

𝒙 𝒕𝒉(𝑡−1) 𝒙 𝒕𝒉(𝑡−1) 𝒙 𝒕𝒉(𝑡−1) 𝒙 𝒕𝒉(𝑡−1)

Page 44: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

LSTM: Memory Cell

• 𝑐(𝑡): behaves like a memory = MEMORY

– 𝑐(𝑡) = 𝑓(𝑡) °𝑐 𝑡−1 + 𝑖(𝑡) ° ǁ𝑐 𝑡

• M(t) = FORGET * M(t-1) + INPUT * NEW_INPUT

• H(t) = OUTPUT * M(t)

• FORGET: Erase operation (or memory reset)

• INPUT: Write operation

• OUTPUT: Read operation

Page 45: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Memory Cell - Example

45

0 0.5 0 1M[t] New input

Forget gate

-1 0.5 1 -1

0 0 1 1

Input gate

0 1 1 1

0 0.5 1 10 0 0 1

Output gate

0 0.5 1 2M[t+1]

0 0 1 1 0 0 0.76 0.96

H[t]

tahnNew memory

Page 46: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Long Short Term Memory (LSTM):

Backpropagation

• Error signal in gated flow

𝒙 𝒖 𝒚

𝒚 = 𝐮°𝒙 = 𝐷𝑖𝑎𝑔 𝒖 𝒙

𝜹𝐱 = 𝐷𝑖𝑎𝑔 𝒖 𝑻𝜹𝒚 = 𝒖 ° 𝜹𝒚

𝜹𝒚

Page 47: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Long Short Term Memory (LSTM):

Backpropagation

𝜹𝒄𝑡

𝒛𝑡 = 𝑊𝑐𝒙𝑡 + 𝑈𝑐𝒉𝑡−1𝒄𝑡 = 𝒊𝑡° tanh(𝒛𝑡) + 𝒇𝑡° 𝒄𝑡−1𝒉𝑡 = 𝒐𝑡° tanh 𝒄𝑡

𝒊𝑡

𝒇𝑡

𝒐𝑡−1

𝒉𝑡−1

𝒄𝑡−1 𝒄𝑡

Page 48: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Long Short Term Memory (LSTM):

Backpropagation

𝜹𝒄𝑡

𝒛𝑡 = 𝑊𝑐𝒙𝑡 + 𝑈𝑐𝒉𝑡−1𝒄𝑡 = 𝒊𝑡° tanh(𝒛𝑡) + 𝒇𝑡° 𝒄𝑡−1𝒉𝑡 = 𝒐𝑡° tanh 𝒄𝑡

𝒊𝑡

𝒇𝑡

𝒐𝑡−1

𝒉𝑡−1

𝒄𝑡−1 𝒄𝑡

𝜹𝒉𝑡−1 = 𝑼𝑐𝑇𝜹𝒛𝑡

𝜹𝒛𝑡 = 𝑡𝑎𝑛ℎ′ 𝒛𝑡 °𝒊𝑡°𝜹𝒄𝑡

𝜹𝒉𝑡−1 = 𝑡𝑎𝑛ℎ′ 𝒛𝑡 °𝒊𝑡°𝑼𝑐𝑇𝜹𝒄𝑡

Page 49: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Long Short Term Memory (LSTM):

Backpropagation

𝜹𝒄𝑡

𝒛𝑡 = 𝑊𝑐𝒙𝑡 + 𝑈𝑐𝒉𝑡−1𝒄𝑡 = 𝒊𝑡° tanh(𝒛𝑡) + 𝒇𝑡° 𝒄𝑡−1𝒉𝑡 = 𝒐𝑡° tanh 𝒄𝑡𝒊𝑡

𝒇𝑡

𝒐𝑡−1

𝒉𝑡−1

𝒄𝑡−1 𝒄𝑡

𝜹𝒄𝑡−1 = 𝑡𝑎𝑛ℎ′ 𝒄𝑡−1 °𝒐𝑡−1°𝜹𝒉𝑡−1+ 𝒇𝑡°𝜹𝒄𝒕

𝜹𝒄𝑡−1 = 𝑡𝑎𝑛ℎ′ 𝒄𝑡−1 °𝒐𝑡°𝑡𝑎𝑛ℎ′ 𝒛𝑡 °𝒊𝑡°𝑼𝑐

𝑇𝜹𝒄𝑡+ 𝒇𝑡°𝜹𝒄𝒕

𝜹𝒉𝑡−1 = 𝑡𝑎𝑛ℎ′ 𝒛𝑡 °𝒊𝑡°𝑼𝑐𝑇𝜹𝒄𝑡

Page 50: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Long Short Term Memory (LSTM):

Backpropagation

𝜹𝒄𝑡

𝒛𝑡 = 𝑾𝑐𝒙𝑡 +𝑼𝑐𝒉𝑡−1𝒄𝑡 = 𝒊𝑡° tanh(𝒛𝑡) + 𝒇𝑡° 𝒄𝑡−1𝒉𝑡 = 𝒐𝑡° tanh 𝒄𝑡𝒊𝑡

𝒇𝑡

𝒐𝑡−1

𝒉𝑡−1

𝒄𝑡−1 𝒄𝑡

𝜹𝒉𝑡−1 = 𝑡𝑎𝑛ℎ′(𝒛𝑡)°𝒊𝑡°𝑼𝑐𝑇𝜹𝒄𝑡

𝜹𝒄𝑡−1 = 𝑡𝑎𝑛ℎ′ 𝒄𝑡−1 °𝒐𝑡−1°𝜹𝒉𝑡−1+ 𝒇𝑡°𝜹𝒄𝒕

𝜹𝒄𝑡−1 = 𝑡𝑎𝑛ℎ′ 𝒄𝑡−1 °𝒐𝑡−1°𝑡𝑎𝑛ℎ′(𝒛𝑡)°𝒊𝑡°𝑼𝑐𝑇𝜹𝒄𝑡+ 𝒇𝑡°𝜹𝒄𝒕

𝜹𝒄𝑡−1= (𝑡𝑎𝑛ℎ′ 𝒄𝑡−1 °𝒐𝑡−1°𝑡𝑎𝑛ℎ

′ 𝒛𝑡 °𝒊𝑡°𝑼𝑐𝑇

tanh ′ 𝑥 = 1 − 𝑡𝑎𝑛ℎ2 𝑥

Page 51: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

LSTM vs. Vanilla RNN: Backpropagation

𝜹𝒄𝑡

𝒊𝑡

𝒇𝑡

𝒐𝑡

𝒉𝑡−1

𝒄𝑡−1 𝒄𝑡

𝒉𝑡−1 𝒉𝑡𝑾

𝜹𝒉𝒕−𝟏 = 𝑔′ 𝒛𝒕 ∗ 𝑾𝑇 𝜹𝒉𝒕

𝒛𝑡 = 𝑾𝒉𝑡−1 +𝑼𝒙𝑡𝒉𝑡 = 𝑡𝑎𝑛ℎ(𝒛𝒕)

𝜹𝒄𝑡−1 = 𝑔′ 𝒄𝑡−1 °𝑔′ 𝒛𝑡 °𝒐𝑡−1°𝒊𝑡 °𝑼𝑇 + 𝒇𝑡 °𝜹𝒄𝒕

tanh 𝑥 = 𝑔(𝑥)

This additive term is the key for dealing withvanishing gradient problems

Vanilla RNN LSTM

Page 52: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Exercise: Backpropagation for LSTM

𝒉𝑡−1

𝒙𝑡

𝒊𝑡

𝒇𝑡

𝒄𝑡

𝒄𝑡−1

𝒉𝑡

𝒄𝑡

𝒐𝑡Complete flow graph & derive weight update formula

memory cell

𝒚𝑡

new input

Page 53: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Gated Recurrent Units [Cho et al ’14]

• Alternative architecture to handle long-term dependencies

𝒉(𝒕) = 𝒇(𝒙 𝒕 , 𝒉(𝒕−𝟏))

• 𝑧 𝑡 = 𝜎 𝑊 𝑧 𝑥 𝑡 + 𝑈 𝑧 ℎ 𝑡−1 (Update gate)

• 𝑟 𝑡 = 𝜎 𝑊 𝑟 𝑥 𝑡 + 𝑈 𝑟 ℎ 𝑡−1 (Reset gate)

• ෨ℎ 𝑡 = 𝑡𝑎𝑛ℎ 𝑟(𝑡)°𝑈ℎ 𝑡−1 +𝑊𝑥(𝑡) (New memory)

• ℎ(𝑡) = (1 − 𝑧(𝑡)) °෨ℎ 𝑡 + 𝑧(𝑡) °ℎ 𝑡−1 (Hidden state)

Page 54: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

• The output layer of RNN takes a directed graphical model that contains edges from some

𝒚(𝑖) in the past to the current output

– This model is able to perform a CRF-style of tagging

LSTM CRF: RNN with Output Dependency

𝒙(1)

𝒉(1) 𝒉(2)

𝒙(2)

𝒉(3)

𝒙(3)

𝒉(𝑡)

𝒙(𝑡)

𝒚(1) 𝒚(2) 𝒚(3) 𝒚(𝑡)

𝑾 𝑾 𝑾

Page 55: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Recurrent Language Model

• Introducing the state variable in the graphical model of the RNN

𝒉(𝒕−𝟏)

Page 56: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Bidirectional RNN

• Combine two RNNs– Forward RNN: an RNN

that moves forward beginning from the start of the sequence

– Backward RNN: an RNN that moves backward beginning from the end of the sequence

– It can make a prediction of y(t) depend on the whole input sequence.

Forward RNN

Backward RNN

Page 57: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Bidirectional LSTM CRF [Huang ‘15]

• One of the state-of-the art models for sequence labelling tasks

BI-LSTM-CRF model applied to named entity tasks

Page 58: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Bidirectional LSTM CRF [Huang ‘15]

Comparison of tagging performance on POS, chunking and NER tasks for various models [Huang et al. 15]

Page 59: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Neural Machine Translation

• RNN encoder-decoder– Neural encoder-decoder: Conditional recurrent

language model

• Neural machine translation with attention mechanism– Encoder: Bidirectional LSTM

– Decoder: Attention Mechanism [Bahdanau et al ’15]

• Character based NMT– Hierarchical RNN Encoder-Decoder [Ling ‘16]

– Subword-level Neural MT [Sennrich ’15]

– Hybrid NMT [Luong & Manning ‘16]

– Google’s NMT [Wu et al ‘16]

Page 60: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Neural Encoder-Decoder

Inputtext

Translatedtext

Credit: http://nlp.stanford.edu/projects/nmt/Luong-Cho-Manning-NMT-ACL2016-v4.pdf

Page 61: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Neural Encoder-Decoder:

Conditional Recurrent Language Model

Credit: http://nlp.stanford.edu/projects/nmt/Luong-Cho-Manning-NMT-ACL2016-v4.pdf

Page 62: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Neural Encoder-Decoder [Cho et al ’14]

Encoder: RNN

Decoder:

Recurrent language model

• Computing the log of translation probability 𝑙𝑜𝑔 𝑃(𝑦|𝑥) by two RNNs

Page 63: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Decoder: Recurrent Language Model

Credit: http://nlp.stanford.edu/projects/nmt/Luong-Cho-Manning-NMT-ACL2016-v4.pdf

Page 64: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Neural Encoder-Decoder with Attention

Mechanism [Bahdanau et al ’15]

• Decoder with attention mechanism

– Apply attention first to the encoded representations before generating a next target word

– Attention: find aligned source words for a target word

• Considered as implicit alignment process

– Context vector c:

• Previously, the last hidden state from RNN encoder[Cho et al ’14]

• Now, content-sensitively chosen with a mixture of hidden states of input sentence at generating each target word

Sampling a word Sampling a wordcondition

Attention Attention

Page 65: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Decoder with Attention Mechanism

• Attention: 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑓𝑎(𝒉𝑡−1, ഥ𝑯𝑠))

𝑠𝑐𝑜𝑟𝑒 𝒉𝑡−1, ഥ𝒉𝑠= 𝒗𝑇 tanh 𝑾𝒉𝑡−1 + 𝑽ത𝒉𝑠

Directly computes a soft alignment

𝒂𝒕(𝑠) =exp(𝑠𝑐𝑜𝑟𝑒(𝒉𝑡−1, ഥ𝒉𝑠))

σ𝑠′ exp(𝑠𝑐𝑜𝑟𝑒(𝒉𝑡−1, ഥ𝒉𝑠′))Expected annotation

Attention scoring function

ഥ𝒉𝑠: a source hidden state

𝒉𝑡−1

ഥ𝐻𝑆 = [തℎ1, ⋯ , തℎ𝑛]

Encoded representations

𝑠𝑜𝑓𝑡𝑚𝑎𝑥

Page 66: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

• Original scoring function [Bahdanau et al ’15]

• Extension of scoring functions [Luong et al ‘15]

Decoder with Attention Mechanism

Bilinear function

𝑠𝑐𝑜𝑟𝑒 𝒉𝑡−1, ഥ𝒉𝑠 = 𝒗𝑇 tanh 𝑾𝒉𝑡−1 + 𝑽ഥ𝒉𝑠

Page 67: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

• Attention scoring function

http://aclweb.org/anthology/D15-1166

• Computation path: 𝒉𝑡 → 𝒂𝑡 → 𝒄𝑡 → ෩𝒉𝑡- Previously, 𝒉𝑡−1 → 𝒂𝑡 → 𝒄𝑡 → 𝒉𝑡

Neural Encoder-Decoder with Attention

Mechanism [Luong et al ‘15]

Page 68: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Neural Encoder-Decoder with Attention

Mechanism [Luong et al ‘15]• Input-feeding approach

Attentional vectors ሚ𝐡t are fed as inputs to the next time steps to inform the model about past alignment decisions

t에서 attentional 벡터가 다음 입력벡터와 concat되어 t+1의 입력을 구성

Page 69: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

GNMT: Google’s Neural Machine

Translation [Wu et al ‘16]Deep LSTM network with 8 encoder and 8 decoder layers using residual connections as well as attention connections from the decoder network to the encoder.

Trained by Google’s Tensor Processing Unit (TPU)

Page 70: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

GNMT: Google’s Neural Machine

Translation [Wu et al ‘16]

Mean of side-by-side scores on production data

Reduces translation errors by an average of 60% compared to Google’s phrase-based production system.

Page 71: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Pointer Network

Neural encoder-decoder Pointer network

• Attention as a pointer to select a member of the input sequence as the output.

Attention as output

Page 72: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Neural Conversational Model

[Vinyals and Le ’ 15]

• Using neural encoder-decoder for conversations

– Response generation

http://arxiv.org/pdf/1506.05869.pdf

Page 73: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

BIDAF for Machine Reading

Comprehension [Seo ‘17]Bidirectional attention flow

Page 74: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Memory Augmented Neural Networks

– Extend the capabilities of neural networks by coupling them to external memory resources, which they can interact with by attentional processes

• Writing & Reading mechanisms are added

• Examples Neural Turing Machine Differentiable Neural

Computer Memory networks

Page 75: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Neural Turing Machine [Graves ‘14]

• Two basic components: A neural network controllerand a memory bank.

• The controller network receives inputs from an external environment and emits outputs in response. – It also reads to and writes

from a memory matrix via a set of parallel read and write heads.

Page 76: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Memory

• Memory 𝑀𝑡

– The contents of the 𝑁 x 𝑀 memory matrix at time 𝑡

Page 77: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Read/Write Operations for Memory

• Read from memory (“blurry”)

– 𝒘𝑡: a vector of weightings over the N locations emitted by a read head at time 𝑡 (σ𝑖𝑤𝑡(𝑖) = 1)

– 𝒓𝑡: The length M read vector

• Write to memory (“blurry”)

– Each write: an erase followed by an add– 𝒆𝑡: Erase vector, 𝒂𝑡: Add vector

Page 78: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Addressing by Content

• Based on Attention mechanism

– Focuses attention on locations based on the similarity b/w the current values and values emitted by the controller

– 𝒌𝑡: The length M key vector

– 𝛽𝑡: a key strength, which can amplify or attenuate the precision of the focus

– K[u,v]: similarity measure cosine similarity

Page 79: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Addressing

• Interpolating content-based weights with previous weights

– which results in the gated weighting

• A scalar interpolation gate 𝑔𝑡– Blend between the weighing 𝒘𝑡−1 produced by the

head at the previous time and the weighting 𝒘𝑐

produced by the content system at the current time-step

Page 80: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Addressing by Location

• Based on Shifting

– 𝒔𝑡: shift weighting that defines a normalized distribution over the allowed integer shifts

• E.g.) The simplest way: to use a softmax layer

• Scalar-based: if the shift scholar is 6.7, then 𝑠𝑡(6)=0.3 , 𝑠𝑡(7)=0.7, and the rest of 𝒔𝑡 is zero

– 𝛾𝑡: an additional scalar which sharpen the final weighting

Page 81: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Addressing: Architecture

Page 82: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Controller

Controller

𝒓𝑡 ∈ 𝑅𝑀Input

The network for controller:FNN or RNN

• 𝒌𝑡𝑅 ∈ 𝑅𝑀

• 𝒔𝑡𝑅 ∈ 0,1 𝑁

• 𝛽𝑡𝑅 ∈ 𝑅+

• 𝛾𝑡𝑅 ∈ 𝑅≥1

• 𝑔𝑡𝑅 ∈ (0,1)

Output for write head• 𝒆𝑡, 𝒂𝑡 , 𝒌𝑡

𝑊 ∈ 𝑅𝑀

• 𝒔𝑡𝑊 ∈ 0,1 𝑁

• 𝛽𝑡𝑊 ∈ 𝑅+

• 𝛾𝑡𝑊 ∈ 𝑅≥1

• 𝑔𝑡𝑊 ∈ (0,1)

Output for read head

External output

Page 83: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

NTM vs. LSTM: Copy task

• Task: Copy sequences of eight bit random vectors, where sequence lengths were randomised b/w 1 and 20

Page 84: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

NTM vs. LSTM: Mult copy

Page 85: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Differentiable Neural Computers

• Extension of NTM by advancing Memory addressing

• Memory addressing are defined by three main attention mechanisms

– Content (also used in NTM)

– memory allocation

– Temporal order

• The controller interpolates among these mechanisms using scalar gates

Credit: http://people.idsia.ch/~rupesh/rnnsymposium2016/slides/graves.pdf

Page 86: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

DNC: Overall architecture

Page 87: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

DNC: bAbI Results

• Each story is treated as a separate sequence and presented it to the network in the form of word vectors, one word at a time.

mary journeyed to the kitchen. mary moved to the bedroom.john went back to the hallway. john picked up the milk there.what is john carrying ? - john travelled to the garden. johnjourneyed to the bedroom. what is john carrying ? - marytravelled to the bathroom. john took the apple there. what isjohn carrying ? - -

The answers required at the ‘−’ symbols, grouped by question into braces, are {milk}, {milk}, {milk apple}

The network was trained to minimize the cross-entropy of thesoftmax outputs with respect to the target words

Page 88: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

DNC: bAbI Results

http://www.nature.com/nature/journal/v538/n7626/full/nature20101.html

Page 89: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Deep learning for

Natural language processing

• Short intro to NLP

• Word embedding

• Deep learning for NLP

Page 90: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Natural Language Processing

• What is NLP?– The automatic processing of human language

• Give computers the ability to process human language

– Its goal enables computers to achieve human-like comprehension of texts/languages

• Tasks – Text processing

• POS Tagging / Parsing / Discourse analysis

– Information extraction– Question answering– Dialog system / Chatbot– Machine translation

Page 91: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Linguistics and NLP

• Many NLP tasks correspond to structural subfields of linguistics

Phonetics

Phonology

Morphology

Syntax

Semantics

Pragmatics

Speech recognition

POS tagging Parsing

Word sense disambiguationSemantic role labeling

Word segmentation

Semantic parsing

Subfields of linguistics NLP Tasks

Named entity recognition/disambiguation

Reading comprehension

Page 92: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Information Extraction

According to Robert Callahan, president of Eastern’sflight attendants union, the past practice ofEastern’s parent, Houston-based Texas Air Corp.,has involved ultimatums to unions to accept thecarrier’s terms

According to <Per> Robert Callahan </Per>, presidentof <Org> Eastern’s </Org> flight attendants union, thepast practice of <Org> Eastern’s </Org> parent, <Loc>Houston </Loc> -based <Org> Texas Air Corp. </Org>,has involved ultimatums to unions to accept thecarrier’s terms

Entity extraction

Relation extraction

92

Robert Callahan Eastern’s

<Empolyee_Of>

Texas Air CorpHuston

<Located_In>

Page 93: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

POS Tagging

• Input: Plays well with others

• Ambiguity: NNS/VBZ UH/JJ/NN/RB IN NNS

• Output: Plays/VBZ well/RB with/IN others/NNS

Page 94: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Parsing

• Sentence: “John ate the apple”

• Parse tree (PSG tree)

S

NP VP

N

John

V NP

ate

DET N

the apple

S → NP VPNP → NNP → DET NVP → V NPN → JohnV → ateDET → theN → apple

Page 95: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Dependency Parsing

S

NP VP

N

John

V NP

ate

DET N

the apple

John ate the apple

PSG tree

John

ate

the

appleSUBJ

MOD

OBJ

Dependency tree

Page 96: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Semantic Role Labeling

[Agent Jim] gave [Patient the book] [Goal to the professor.]

Jim gave the book to the professor

Semantic roles Description

Agent Initiator of action,

capable

of volition

Patient Affected by action,

undergoes

change of state

Theme Entity moving, or

being “located”

Experiencer Perceives action but

not in control

Beneficiary Instrument Location

Source Goal

Page 97: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Sentiment analysis

(1) I bought a Samsung camera and my friends brought a Canon camera yesterday. (2) In the past week, we both used the cameras a lot. (3) The photos from my Samy are not that great, and the battery life is short too. (4) My friend was very happy with his camera and loves its picture quality. (5) I want a camera that can take good photos. (6) I am going to return it tomorrow.

Posted by: big John

(Samsung, picture_quality, negative, big John)(Samsung, battery_life, negative, big John)(Canon, GENERAL, positive, big John’s_friend)(Canon, picture_quality, positive, big John’s_friend)

Page 98: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Coreference Resolution

[A man named Lionel Gaedi]1 went to [the Port-au-Prince morgue]2 in search of [[his]1 brother]3, [ Josef ]3, but was unable to find [[his]3 body]4 among [the piles of corpses that had been left [there]2 ]5.

[A man named Lionel Gaedi] went to [the Port-au-Prince morgue]2 in search of [[his] brother], [Josef], but was unable to find [[his] body] among [the piles of corpses that had been left [there] ].

Page 99: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Question Answering

• One of the oldest NLP tasks

• Modern QA systems

– IBM’s Watson, Apple’s Siri, etc.

• Examples of Factoid questions

Questions Answers

Where is the Louvre Museum located? In Paris, France

What’s the abbreviation for limited partnership? L.P.

What are the names of Odin’s ravens? Huginn and Muninn

What currency is used in China? The yuan

Page 100: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Example: IBM Watson System

• Open-domain question answering system (DeepQA)– In 2011, Watson defeated Brad Rutter and Ken Jennings in the

Jeopardy! Challenge

Page 101: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Machine Reading Comprehension

• SQuAD / KorQuAD

Page 102: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Chatbot

Page 103: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Conversational Question Answering

Page 104: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Word embedding:

Distributed representation

– Distributed representation

• n-dimensional latent vector for a word

• Semantically similar words are closely located in vector space

Page 105: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Word embedding matrix:

Lookup Table

0⋮1⋮0

Word

One-hot vector e (|V|-dimensional vector)

L= …

|V|

the cat mat …

d

𝐿 = 𝑅𝑑×|𝑉|

Word vector x is obtained from one-hot vector eby referring to lookup table

x =L e

Page 106: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Word embedding matrix:

context input layer

• word seq 𝑤1⋯𝑤𝑛 input layer

𝑤𝑡−𝑛+1 𝑤𝑡−2 𝑤𝑡−1

… … …

n context words

Input layer

𝐿 𝑒𝑤𝑡−𝑛+1

𝑛 𝑑 dim input vector

concatenation

d-diml vec

𝐿 𝑒𝑤𝑡−2 𝐿 𝑒𝑤𝑡−1

𝑒𝑤 : one hot vector 𝑤

Lookup table

d-dim vec d-dim vec

Page 107: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Natural Language Processing

using Word Embedding

Word Learning word embedding

matrixRaw

corpus

Initialize lookup table

Application-specific

neural network

Application-

specific NN

Annotated

corpus

Lookup table is further fine-tuned

Unsupervised

Supervised

Page 108: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Language Model

• Defines a probability distribution over sequences of tokens in a natural language

• N-gram model – An n-gram is a sequence of n tokens– Defines the conditional probability of the n-th token given

the preceding n-1 tokens

– To avoid it from getting zero probabilities, smoothing needs to be employed

• Back-off methods• Interpolation method • Class-based language models

𝑃 𝑤1, ⋯ , 𝑤𝑇 = 𝑃(𝑤1, ⋯ , 𝑤𝑛−1)ෑ

𝑡=𝑛

𝑇

𝑃(𝑤𝑡|𝑤𝑡−𝑛+1, ⋯ , 𝑤𝑡−1)

Page 109: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Neural Language Model [Bengio ’03]

– Instead of an original raw symbol, a distributed representation of words is used

– Word embedding: Raw symbols are projected to a low-dimensional vector space

– Unlike class-based n-gram models, it can recognize that two words are semantically similar and also encode each word as distinct from each other

• Method – Estimate 𝑃(𝑤𝑖|𝑤𝑖− 𝑛−1 , ⋯ , 𝑤𝑖−1) by FNN for

classification – 𝐱: Concatenated input features (input layer)– 𝒚 = 𝐔 tanh(𝐝 + 𝐇𝐱)

– 𝒚 = 𝐖𝐱 + 𝐛 + 𝐔 tanh(𝐝 + 𝐇𝐱)

Page 110: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Neural Language Model [Bengio ‘03]

• Words are projected by a linear operation on the projection layer • Softmax function is used at the output layer to ensure that 0 <= p<= 1

Page 111: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Neural Language Model

Image Credit: Tomas Mikolov

Output layer

𝑤𝑡−𝑛+1 𝑤𝑡−2 𝑤𝑡−1

… …

Lookup table

𝑤𝑡

… …

𝑃 𝑤𝑡 𝑤𝑡−𝑛+1, ⋯ , 𝑤𝑡−1 =𝑒𝑥𝑝(𝑦𝑤𝑡

)

σ 𝑒𝑥𝑝(𝑦𝑡)

Softmax layer𝑦1 𝑦2 𝑦|𝑉|

Hidden layer

𝑼

𝑯

𝑾

𝒙

𝒚

𝒉

𝑯

𝑼

𝑾

softmax𝒚 = 𝐖𝐱 + 𝐛 + 𝐔 tanh(𝐝 + 𝐇𝐱)

Input layer

Page 112: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Experiments: Neural Language Model

[Bengio ’03]

NLM

Page 113: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Neural Language Model: Discussion

• Limitation: Computational complexity

– Softmax layer requires computing scores over all vocabulary words

• Vocabulary size is very large

• Using short list [Schwent ‘02]

– Vocabulary 𝑉 is split into a shortlist 𝐿 of most frequent words and a tail 𝑇 of more rare words

𝑃 𝑦 = 𝑖 𝐶 = 𝛿 𝑖 ∈ 𝐿 𝑃 𝑦 = 1 𝐶, 𝑖 ∈ 𝐿 1 − 𝑃 𝑖 ∈ 𝑇 𝐶

+𝛿 𝑖 ∈ 𝑇 𝑃 𝑦 = 𝑖 𝐶, 𝑖 ∈ 𝑇 𝑃(𝑖 ∈ 𝑇|𝐶)

Use n-gram model

Use neural language model

Page 114: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Hierarchical Softmax

[Morin and Bengio ‘05]

• Requires O(log|V|) computation, instead of O(|V|)

• The next-word conditional probability is computed by𝑃 𝑣 𝑤𝑡−1, ⋯ ,𝑤𝑡−𝑛+1

=ෑ

𝑗=1

𝑚

𝑃(𝑏𝑗(𝑣)|𝑏1 𝑣 ,⋯ , 𝑏𝑗−1 𝑣 ,𝑤𝑡−1, ⋯ , 𝑤𝑡−𝑛+1)

Page 115: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Important Sampling [Bengio ‘03; Jean ’15]

𝜕𝑙𝑜𝑔𝑝 𝑤 𝐶

𝜕𝜃=𝜕𝑦𝑤𝜕𝜃

𝑘:𝑤′∈𝑉

𝑃(𝑤′|𝐶)𝜕𝑦𝑤′𝜕𝜃

𝑦𝑤: score given before applying softmax

𝐸𝑃(𝑤|𝐶)𝜕𝑦𝑤′𝜕𝜃

Expected gradientApproximation by importancesampling

𝑘:𝑤∈𝑉′

𝜔𝑤σ𝑤′:𝑦𝑤′∈𝑉′

𝜔𝑤′

𝜕𝑦𝑤𝜕𝜃

𝜔𝑤 = exp{𝑦𝑤 − 𝑙𝑜𝑔𝑄(𝑤)}

Proposed distribution

Page 116: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Ranking Loss [Collobert & Weston ’08]

• Sampling a negative example

– 𝒔: a given sequence of words (in training data)

– 𝒔′: a negative example the last word is replaced with another word

– 𝑓(𝒔): score of the sequence 𝒔

– Goal: makes the score difference (𝑓(𝒔) – 𝑓(𝒔’)) large

– Various loss functions are possible

• Hinge loss: max(0, 1 − 𝑓 𝒔 + 𝑓 𝒔′ )

Page 117: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Ranking Loss [Collobert & Weston ’08]

𝒙

𝒉

Perturbed

𝑯

𝒙′

𝑯

𝒉′

𝑦 𝑦′

𝒘 𝒘

max(0, (𝑦′ + 1) − 𝑦)

Page 118: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Recurrent Neural Language Model

[Mikolov ‘00]

• Hidden layer of the previous layer connects to the hidden layer of the next word

• No need to specify the context length– NLM: only (n-1) previous

words are conditioned – RLM: all previous words are

conditioned

Page 119: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Recurrent Neural Language Model

• Unfolded flow graph

http://arxiv.org/abs/1511.07916

Page 120: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Word2Vec [Mikolov ‘13]

𝑝 𝑤𝑂 𝑤𝐼 =exp(𝑣𝑤

′𝑂𝑇𝑣𝑤𝐼)

σ𝑤 exp(𝑣𝑤′ 𝑇 𝑣𝑤𝐼

)

http://arxiv.org/pdf/1301.3781.pdf

Page 121: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Word Vectors have linear relationships

Mikolov ‘13

Page 122: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

ELMo: Contextualized Word Embedding

• 문맥에따라 단어의임베딩을차별화

Page 123: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

ELMo: Contextualized Word Embedding

• ELMo의구조는 LSTM의 Language model에기반

–이전단어로부터 다음 단어 예측하는 tasks를 forward방향과 backward방향두 가지를 구성 BiLM

Page 124: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

ELMo: Contextualized Word Embedding

• K번째단어에대한문맥화단어임베딩:

–문장에 대한 ELMo인코딩 후 해당 단어에 대한 모든layer의 표상의 선형 결합

Page 125: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

ELMo: Contextualized Word Embedding

• Raw corpus로부터 BiLM ELMo 사전학습

–사전학습된 모델은 다른 NLP tasks에 활용

Page 126: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

ELMo: Contextualized Word Embedding

• 실험결과

–각 NLP tasks 영역에서 당시 최고 성능 개선

Page 127: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

BERT: Pre-training of Deep Bidirectional

Transformers for Language Understanding• Transformer encoder [Vaswani et al ‘17]

• Multi-headed self attention• Layer norm and residuals• Positional embeddings

Page 128: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

BERT: Pre-training of Deep Bidirectional

Transformers for Language Understanding

Page 129: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

BERT: Pre-training of Deep Bidirectional

Transformers for Language Understanding

• 실험결과

–대용량 코퍼스 사전학습 Fine tuning

• 각 NLP tasks 영역에서 최고 성능 달성

Page 130: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

BERT: Pre-training of Deep Bidirectional

Transformers for Language Understanding

Page 131: Recurrent Neural Networks and Natural Language Processingnlp.jbnu.ac.kr/AICBNU2019/slides/aijbnu_rnn_nlp.pdf · 2019-10-08 · •A family of neural networks for processing sequential

Summary • Recurrent neural networks

– Classical RNN, LSTM, RNN encoder-decoder, attention mechanism, Neural turning machine

• Deep learning for NLP– Word embedding

– ELMo, BERT: Contextualized word embedding

– Applications• POS tagging, Named entity recognition, semantic role labelling

• Information extraction, parsing, Sentiment analysis, Neural machine translation, Question answering, Response generation, Sentence completing, Reading comprehension, Information retrieval, sentence retrieval, knowledge completing