natural language processing (word...

23
포항공과대학교 산업경영학과 Natural Language Processing (Word Embedding) Seonghwi Kim Statistics and Data Science Lab. August 12, 2020

Upload: others

Post on 18-Aug-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Natural Language Processing (Word Embedding)sds.postech.ac.kr/wp-content/uploads/2020/08/NLPword... · 2020. 8. 12. · 3 Introduction • Natural Language Processing-A subfield of

포항공과대학교 산업경영학과

Natural Language Processing

(Word Embedding)

Seonghwi Kim

Statistics and Data Science Lab.

August 12, 2020

Page 2: Natural Language Processing (Word Embedding)sds.postech.ac.kr/wp-content/uploads/2020/08/NLPword... · 2020. 8. 12. · 3 Introduction • Natural Language Processing-A subfield of

Contents

1. Introduction

2. Embedding in NLP

3. Word Embedding

1. LSA

2. NNLM

3. word2vec

4. Limitation

5. Next seminar

Page 3: Natural Language Processing (Word Embedding)sds.postech.ac.kr/wp-content/uploads/2020/08/NLPword... · 2020. 8. 12. · 3 Introduction • Natural Language Processing-A subfield of

3

Introduction

• Natural Language Processing

-A subfield of linguistics, computer science, information engineering concerned with the interactions between

computers and human (natural) languages

Various topics in NLP

-In particular, analyze how to program computers to process large amounts of natural language data

-Natural Language Understanding(NLU), and Natural Language Generation(NLG), speech recognition

[1] 최요종, “2018-2020 NLU NLU 연구동향”, Kakao brain AI trend report, https://www.kakaobrain.com

Page 4: Natural Language Processing (Word Embedding)sds.postech.ac.kr/wp-content/uploads/2020/08/NLPword... · 2020. 8. 12. · 3 Introduction • Natural Language Processing-A subfield of

4

Embedding in NLP

• Embedding in NLP

-Embedding: The result of converting natural language into a vector that can be understood by a machine, or the whole

series of processes.

-Word Embedding(word vector), Document Embedding(document vector)

e.g.) Document Embedding(Bag-of-words)

[1] Marco Bonzanini, ‘Word Embedding for Natural Language Processing in Python’, 2017

e.g.) Word Embedding(One-hot encoding)

Page 5: Natural Language Processing (Word Embedding)sds.postech.ac.kr/wp-content/uploads/2020/08/NLPword... · 2020. 8. 12. · 3 Introduction • Natural Language Processing-A subfield of

5

Embedding in NLP

-Sparsity: One-hot encoding is V-dimensional (the computational inefficiency)

-One-hot encoding does not represent semantic/syntactic information

1. Calculation of relevance (similarity) between words/sentences

2. Implication of semantic/syntactic information

3. Transfer Learning

• Limitation of Limitation of One-hot encoding as Word Embedding

• Why is Word Embedding important in NLP?

Page 6: Natural Language Processing (Word Embedding)sds.postech.ac.kr/wp-content/uploads/2020/08/NLPword... · 2020. 8. 12. · 3 Introduction • Natural Language Processing-A subfield of

6

Embedding in NLP

1. Calculation of relevance (similarity) between words/sentences

• Why is a Word Embedding important?

2D visualization for word2vec embedding

[1] Marco Bonzanini, ‘Word Embedding for Natural Language Processing in Python’, 2017

Page 7: Natural Language Processing (Word Embedding)sds.postech.ac.kr/wp-content/uploads/2020/08/NLPword... · 2020. 8. 12. · 3 Introduction • Natural Language Processing-A subfield of

7

Embedding in NLP

2. Implication of semantic/syntactic information

• Why is a Word Embedding important?

[word embedding example (word2vec)][Learns Analogy]

[1] Marco Bonzanini, ‘Word Embedding for Natural Language Processing in Python’, 2017

Page 8: Natural Language Processing (Word Embedding)sds.postech.ac.kr/wp-content/uploads/2020/08/NLPword... · 2020. 8. 12. · 3 Introduction • Natural Language Processing-A subfield of

8

Embedding in NLP

-BERT(Bidirectional Encoder Representations from Transformers) – Document classification

[‘This’, ‘movie’, ‘is’, ‘awesome’] : positive

[‘This’, ‘movie’, ‘is’, ‘weird’]: negative

• Why is a Word Embedding important?

3. Transfer learning

Test Accuracy Loss function

-Pre-trained word vector is used as an input value for other deep learning model

[1] Gichang Lee(2019). Sentence Embeddings Using Korean Corpora. p38-39

Page 9: Natural Language Processing (Word Embedding)sds.postech.ac.kr/wp-content/uploads/2020/08/NLPword... · 2020. 8. 12. · 3 Introduction • Natural Language Processing-A subfield of

9

Embedding in NLP

• Embedding models in NLP

Latent Semantic Analysis (1998)

NNLM (2003)

word2vec (2013)

Glove (2014)

FastText (2017)

Doc2vec (2014)

Transformer Network (2017)

ELMo (2018)

BERT (2018)

GPT 1, 2, 3 (2018-2020)

XLNet (2019)

[Natural Language Processing]

[Embedding]

[Word Embedding] [Document Embedding]

Page 10: Natural Language Processing (Word Embedding)sds.postech.ac.kr/wp-content/uploads/2020/08/NLPword... · 2020. 8. 12. · 3 Introduction • Natural Language Processing-A subfield of

10

Word Embedding

• Latent semantic analysis (LSA)

[Document Term Matrix]

-Matrix factorization based method (SVD)

-Find the latent/hidden meaning between words and context.

-when a new document or word is added, have to start from scratch.

[1] SVD, PCA and Latent Semantic Analysis, ratsgo.github.io/from%20frequency%20to%20semantics/2017/04/06/pcasvdlsa/

Page 11: Natural Language Processing (Word Embedding)sds.postech.ac.kr/wp-content/uploads/2020/08/NLPword... · 2020. 8. 12. · 3 Introduction • Natural Language Processing-A subfield of

11

Word Embedding

• Feed-forward Neural net Language Model (NNLM)

-Prediction based method like N-gram Language Model

-N-gram Language Model: a type of probabilistic language model for predicting the next item in such

a sequence in the form of a (n − 1)

- NNLM use a neural network structure for learning

“A little boy is spreading __ ” (N=4)

N-1 words Predict

[Example of N-gram Language model ]

Page 12: Natural Language Processing (Word Embedding)sds.postech.ac.kr/wp-content/uploads/2020/08/NLPword... · 2020. 8. 12. · 3 Introduction • Natural Language Processing-A subfield of

12

Word Embedding

• Feed-forward Neural net Language Model (NNLM)

“what will the fat cat sit on” (N=5)

[EXAMPLE]

[General NNLM structure]

Projection layer

Input layer

Hidden layer

Output layer

[1] Bengio, Y., Ducharme, R., Vincent,P.,&Jauvin,C.(2003). A neural probabilistic language model. JMLR

[2] 조경현(2018). Natural Language Processing with deep learning. https://wikidocs.net/45609

Page 13: Natural Language Processing (Word Embedding)sds.postech.ac.kr/wp-content/uploads/2020/08/NLPword... · 2020. 8. 12. · 3 Introduction • Natural Language Processing-A subfield of

13

Word Embedding

• Feed-forward Neural net Language Model (NNLM)

[1] Bengio, Y., Ducharme, R., Vincent,P.,&Jauvin,C.(2003). A neural probabilistic language model. JMLR

[2] 조경현(2018). Natural Language Processing with deep learning. https://wikidocs.net/45609

-Weight vector as word embedding

-Word Embedding by converting V dimensional sparse vectors into M dimensional dense vectors.

m-dimensional word vector

Page 14: Natural Language Processing (Word Embedding)sds.postech.ac.kr/wp-content/uploads/2020/08/NLPword... · 2020. 8. 12. · 3 Introduction • Natural Language Processing-A subfield of

14

Word Embedding

• Feed-forward Neural net Language Model (NNLM)

[1] Bengio, Y., Ducharme, R., Vincent,P.,&Jauvin,C.(2003). A neural probabilistic language model. JMLR

[2] 조경현(2018). Natural Language Processing with deep learning. https://wikidocs.net/45609

𝑇𝑖𝑚𝑒 𝐶𝑜𝑚𝑝𝑙𝑒𝑥𝑖𝑡𝑦 𝑄 = (𝑁 ×𝑀) + (𝑁 ×𝑀 × 𝐻) + (𝐻 × 𝑉)

-Computational inefficiency for large V

-There is a limit to referencing only n-1 words in one direction, not in both directions.

Page 15: Natural Language Processing (Word Embedding)sds.postech.ac.kr/wp-content/uploads/2020/08/NLPword... · 2020. 8. 12. · 3 Introduction • Natural Language Processing-A subfield of

15

Word Embedding

• word2vec

[1] T. Mikolov, et al., (2013a). “Efficient Estimation of Word Representations in Vector Space”

[2] T. Mikolov, et al., (2013b), “Distributed Representations of Words and Phrases and their compositionaliry”, Advances in NIPS

[3] 조경현(2018). Natural Language Processing with deep learning. https://wikidocs.net/45609

[word2vec architecture (CBOW/Skip-gram)]

Idea (Distributional hypothesis): words that are semantically similar often occur near each other in context

2013a : Introduce 2 neural net models for embedding - CBOW and Skip-gram

2013b : Introduce 2 training methods for an efficient learning – negative sampling and hierarchical softmax

Page 16: Natural Language Processing (Word Embedding)sds.postech.ac.kr/wp-content/uploads/2020/08/NLPword... · 2020. 8. 12. · 3 Introduction • Natural Language Processing-A subfield of

16

Word Embedding

• word2vec

-CBOW predicts a center word from the surrounding context

in terms of word vectors

[EXAMPLE]

The fat cat sat on the mat

The fat cat sat on the mat

The fat cat sat on the mat

The fat cat sat on the mat

The fat cat sat on the mat

The fat cat sat on the mat

The fat cat sat on the mat

center word

surrounding word

train sample

[1] T. Mikolov, et al., (2013a). “Efficient Estimation of Word Representations in Vector Space”

[2] T. Mikolov, et al., (2013b), “Distributed Representations of Words and Phrases and their compositionaliry”, Advances in NIPS

[3] 조경현(2018). Natural Language Processing with deep learning. https://wikidocs.net/45609

1. CBOW

Page 17: Natural Language Processing (Word Embedding)sds.postech.ac.kr/wp-content/uploads/2020/08/NLPword... · 2020. 8. 12. · 3 Introduction • Natural Language Processing-A subfield of

17

Word Embedding

• word2vec

minimizationmaximization

[1] T. Mikolov, et al., (2013a). “Efficient Estimation of Word Representations in Vector Space”

[2] T. Mikolov, et al., (2013b), “Distributed Representations of Words and Phrases and their compositionaliry”, Advances in NIPS

[3] 조경현(2018). Natural Language Processing with deep learning. https://wikidocs.net/45609

1. CBOW

-Loss function 𝐽

Page 18: Natural Language Processing (Word Embedding)sds.postech.ac.kr/wp-content/uploads/2020/08/NLPword... · 2020. 8. 12. · 3 Introduction • Natural Language Processing-A subfield of

18

Word Embedding

• word2vec

[1] T. Mikolov, et al., (2013a). “Efficient Estimation of Word Representations in Vector Space”

[2] T. Mikolov, et al., (2013b), “Distributed Representations of Words and Phrases and their compositionaliry”, Advances in NIPS

[3] 조경현(2018). Natural Language Processing with deep learning. https://wikidocs.net/45609

Skip-gram aims to predicts the probability of context words from a center word

[EXAMPLE]

Loss function 𝑱 is too expensive to compute because of the softmax normalization

Loss function

, where is the cross entropy between the probability

vector and one-hot vector

Page 19: Natural Language Processing (Word Embedding)sds.postech.ac.kr/wp-content/uploads/2020/08/NLPword... · 2020. 8. 12. · 3 Introduction • Natural Language Processing-A subfield of

19

Word Embedding

• word2vec

[1] T. Mikolov, et al., (2013a). “Efficient Estimation of Word Representations in Vector Space”

[2] T. Mikolov, et al., (2013b), “Distributed Representations of Words and Phrases and their compositionaliry”, Advances in NIPS

[3] 조경현(2018). Natural Language Processing with deep learning. https://wikidocs.net/45609

1. Negative sampling

• For every train step, sample negative examples instead of looping over the entire vocabulary

• Given a pair of center word and context word, model is trained as classifying whether the sample is a positive sample or a

negative sample

[EXAMPLE]

The fat cat sat on the mat …

c w

cat The

cat fate

cat sat

cat on

positive sample

c w

cat dog

cat is

cat cute

cat and

c w

cat adorable

cat then

cat cat

cat agree

negative sample (K=2)𝑃𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑤𝑖 =

𝑓(𝑤𝑖)Τ3 4

σ𝑗=0𝑉 𝑓(𝑤𝑗)

Τ3 4,

where 𝑓 𝑤𝑖 =𝑤𝑖 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦

𝑡𝑜𝑡𝑎𝑙 𝑣𝑜𝑐𝑎𝑏𝑢𝑙𝑎𝑟𝑦 𝑠𝑒𝑡

[Probability for negative sampling]

is: 0.99

word2vec: 0.01

𝑃𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑖𝑠 = 0.97

𝑃𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑤𝑜𝑟𝑑2𝑣𝑒𝑐 = 0.03

[EXAMPLE]

Page 20: Natural Language Processing (Word Embedding)sds.postech.ac.kr/wp-content/uploads/2020/08/NLPword... · 2020. 8. 12. · 3 Introduction • Natural Language Processing-A subfield of

20

Word Embedding

• word2vec

[1] T. Mikolov, et al., (2013a). “Efficient Estimation of Word Representations in Vector Space”

[2] T. Mikolov, et al., (2013b), “Distributed Representations of Words and Phrases and their compositionaliry”, Advances in NIPS

[3] 조경현(2018). Natural Language Processing with deep learning. https://wikidocs.net/45609

1. Skip-gram with negative sampling

-from unsupervised learning to supervised learning

if is in a positive sample

if is in a negative corpus

then,

-new objective function and update rules

-binary classification problem with sigmoid function

Let’s denote

Page 21: Natural Language Processing (Word Embedding)sds.postech.ac.kr/wp-content/uploads/2020/08/NLPword... · 2020. 8. 12. · 3 Introduction • Natural Language Processing-A subfield of

21

Word Embedding

• word2vec

[1] T. Mikolov, et al., (2013a). “Efficient Estimation of Word Representations in Vector Space”

[2] T. Mikolov, et al., (2013b), “Distributed Representations of Words and Phrases and their compositionaliry”, Advances in NIPS

[3] 조경현(2018). Natural Language Processing with deep learning. https://wikidocs.net/45609

2. Hierarchical softmax

• Hierarchical softmax uses Huffman binary tree to represent all words in the vocabulary

Computational cost becomes 𝑂 log 𝑉 instead of 𝑂 𝑉

[Binary tree for Hierarchical softmax]

0.7 0.3

0.60.4

Page 22: Natural Language Processing (Word Embedding)sds.postech.ac.kr/wp-content/uploads/2020/08/NLPword... · 2020. 8. 12. · 3 Introduction • Natural Language Processing-A subfield of

22

Limitation

• Limitation of word embedding

[2D visualization for BERT embedding]

“배”는수분이많은과일이다.

“배” 고프다

“배” 나온다

“배” 가불렀다 “배” 멀미가난다

“배” 를바다에띄웠다

-Words with various meanings(homonym) can not be distinguished

“배”는수분이많은과일이다.

“배” 고프다

“배” 나온다

“배” 가불렀다

“배” 멀미가난다

“배”를바다에띄웠다

[EXAMPLE]

-Only n-1 words are used for learning

“배”

word embedding → document embedding

[1] Gichang Lee(2019). Sentence Embeddings Using Korean Corpora. p38-39

Page 23: Natural Language Processing (Word Embedding)sds.postech.ac.kr/wp-content/uploads/2020/08/NLPword... · 2020. 8. 12. · 3 Introduction • Natural Language Processing-A subfield of

23

Next seminar

• Embedding models in NLP

Latent Semantic Analysis(1998)

NNLM(2003)

word2vec(2013)

Glove(2014)

FastText(2017)

Doc2vec(2014)

Transformer Network(2017)

ELMo(2018)

BERT(2018)

GPT 1, 2, 3(2018-2020)

XLnet(2019)

[Natural Language Processing]

[Embedding]

[Word Embedding] [Document Embedding]