natural language processing (word...

포항공과대학교 산업경영학과

Natural Language Processing

(Word Embedding)

Seonghwi Kim

Statistics and Data Science Lab.

August 12, 2020

Contents

1. Introduction

2. Embedding in NLP

3. Word Embedding

1. LSA

2. NNLM

3. word2vec

4. Limitation

5. Next seminar

3

Introduction

• Natural Language Processing

-A subfield of linguistics, computer science, information engineering concerned with the interactions between

computers and human (natural) languages

Various topics in NLP

-In particular, analyze how to program computers to process large amounts of natural language data

-Natural Language Understanding(NLU), and Natural Language Generation(NLG), speech recognition

[1] 최요종, “2018-2020 NLU NLU 연구동향”, Kakao brain AI trend report, https://www.kakaobrain.com

https://www.kakaobrain.com/blog/118

4

Embedding in NLP

• Embedding in NLP

-Embedding: The result of converting natural language into a vector that can be understood by a machine, or the whole

series of processes.

-Word Embedding(word vector), Document Embedding(document vector)

e.g.) Document Embedding(Bag-of-words)

[1] Marco Bonzanini, ‘Word Embedding for Natural Language Processing in Python’, 2017

e.g.) Word Embedding(One-hot encoding)

5

Embedding in NLP

-Sparsity: One-hot encoding is V-dimensional (the computational inefficiency)

-One-hot encoding does not represent semantic/syntactic information

1. Calculation of relevance (similarity) between words/sentences

2. Implication of semantic/syntactic information

3. Transfer Learning

• Limitation of Limitation of One-hot encoding as Word Embedding

• Why is Word Embedding important in NLP?

6

Embedding in NLP

1. Calculation of relevance (similarity) between words/sentences

• Why is a Word Embedding important?

2D visualization for word2vec embedding


7

Embedding in NLP

2. Implication of semantic/syntactic information


[word embedding example (word2vec)][Learns Analogy]


8

Embedding in NLP

-BERT(Bidirectional Encoder Representations from Transformers) – Document classification

[‘This’, ‘movie’, ‘is’, ‘awesome’] : positive

[‘This’, ‘movie’, ‘is’, ‘weird’]: negative


3. Transfer learning

Test Accuracy Loss function

-Pre-trained word vector is used as an input value for other deep learning model

[1] Gichang Lee(2019). Sentence Embeddings Using Korean Corpora. p38-39

9

Embedding in NLP

• Embedding models in NLP

Latent Semantic Analysis (1998)

NNLM (2003)

word2vec (2013)

Glove (2014)

FastText (2017)

Doc2vec (2014)

Transformer Network (2017)

ELMo (2018)

BERT (2018)

GPT 1, 2, 3 (2018-2020)

XLNet (2019)

[Natural Language Processing]

[Embedding]

[Word Embedding] [Document Embedding]

10

Word Embedding

• Latent semantic analysis (LSA)

[Document Term Matrix]

-Matrix factorization based method (SVD)

-Find the latent/hidden meaning between words and context.

-when a new document or word is added, have to start from scratch.

[1] SVD, PCA and Latent Semantic Analysis, ratsgo.github.io/from%20frequency%20to%20semantics/2017/04/06/pcasvdlsa/

https://ratsgo.github.io/from%20frequency%20to%20semantics/2017/04/06/pcasvdlsa/

11

Word Embedding

• Feed-forward Neural net Language Model (NNLM)

-Prediction based method like N-gram Language Model

-N-gram Language Model: a type of probabilistic language model for predicting the next item in such

a sequence in the form of a (n − 1)

- NNLM use a neural network structure for learning

“A little boy is spreading __ ” (N=4)

N-1 words Predict

[Example of N-gram Language model ]

12

Word Embedding


“what will the fat cat sit on” (N=5)

[EXAMPLE]

[General NNLM structure]

Projection layer

Input layer

Hidden layer

Output layer

[1] Bengio, Y., Ducharme, R., Vincent,P.,&Jauvin,C.(2003). A neural probabilistic language model. JMLR

[2] 조경현(2018). Natural Language Processing with deep learning. https://wikidocs.net/45609

https://wikidocs.net/45609

13

Word Embedding




-Weight vector as word embedding

-Word Embedding by converting V dimensional sparse vectors into M dimensional dense vectors.

m-dimensional word vector


14

Word Embedding




𝑇𝑖𝑚𝑒 𝐶𝑜𝑚𝑝𝑙𝑒𝑥𝑖𝑡𝑦 𝑄 = (𝑁 ×𝑀) + (𝑁 ×𝑀 × 𝐻) + (𝐻 × 𝑉)

-Computational inefficiency for large V

-There is a limit to referencing only n-1 words in one direction, not in both directions.


15

Word Embedding

• word2vec

[1] T. Mikolov, et al., (2013a). “Efficient Estimation of Word Representations in Vector Space”

[2] T. Mikolov, et al., (2013b), “Distributed Representations of Words and Phrases and their compositionaliry”, Advances in NIPS


[word2vec architecture (CBOW/Skip-gram)]

Idea (Distributional hypothesis): words that are semantically similar often occur near each other in context

2013a : Introduce 2 neural net models for embedding - CBOW and Skip-gram

2013b : Introduce 2 training methods for an efficient learning – negative sampling and hierarchical softmax


16

Word Embedding

• word2vec

-CBOW predicts a center word from the surrounding context

in terms of word vectors

[EXAMPLE]

The fat cat sat on the mat







center word

surrounding word

…

train sample




1. CBOW


17

Word Embedding

• word2vec

minimizationmaximization




1. CBOW

-Loss function 𝐽


18

Word Embedding

• word2vec




Skip-gram aims to predicts the probability of context words from a center word

[EXAMPLE]

Loss function 𝑱 is too expensive to compute because of the softmax normalization

Loss function

, where is the cross entropy between the probability

vector and one-hot vector


19

Word Embedding

• word2vec




1. Negative sampling

• For every train step, sample negative examples instead of looping over the entire vocabulary

• Given a pair of center word and context word, model is trained as classifying whether the sample is a positive sample or a

negative sample

[EXAMPLE]

The fat cat sat on the mat …

c w

cat The

cat fate

cat sat

cat on

positive sample

c w

cat dog

cat is

cat cute

cat and

c w

cat adorable

cat then

cat cat

cat agree

negative sample (K=2)𝑃𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑤𝑖 =

𝑓(𝑤𝑖)Τ3 4

σ𝑗=0𝑉 𝑓(𝑤𝑗)

Τ3 4,

where 𝑓 𝑤𝑖 =𝑤𝑖 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦

𝑡𝑜𝑡𝑎𝑙 𝑣𝑜𝑐𝑎𝑏𝑢𝑙𝑎𝑟𝑦 𝑠𝑒𝑡

[Probability for negative sampling]

is: 0.99

word2vec: 0.01

𝑃𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑖𝑠 = 0.97

𝑃𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑤𝑜𝑟𝑑2𝑣𝑒𝑐 = 0.03

[EXAMPLE]


20

Word Embedding

• word2vec




1. Skip-gram with negative sampling

-from unsupervised learning to supervised learning

ቊ

if is in a positive sample

if is in a negative corpus

then,

-new objective function and update rules

-binary classification problem with sigmoid function

Let’s denote


21

Word Embedding

• word2vec




2. Hierarchical softmax

• Hierarchical softmax uses Huffman binary tree to represent all words in the vocabulary

Computational cost becomes 𝑂 log 𝑉 instead of 𝑂 𝑉

[Binary tree for Hierarchical softmax]

0.7 0.3

0.60.4


22

Limitation

• Limitation of word embedding

[2D visualization for BERT embedding]

“배”는수분이많은과일이다.

“배” 고프다

“배” 나온다

“배” 가불렀다 “배” 멀미가난다

“배” 를바다에띄웠다

-Words with various meanings(homonym) can not be distinguished

“배”는수분이많은과일이다.

“배” 고프다

“배” 나온다

“배” 가불렀다

“배” 멀미가난다

“배”를바다에띄웠다

[EXAMPLE]

-Only n-1 words are used for learning

“배”

word embedding → document embedding

[1] Gichang Lee(2019). Sentence Embeddings Using Korean Corpora. p38-39

23

Next seminar

• Embedding models in NLP

Latent Semantic Analysis(1998)

NNLM(2003)

word2vec(2013)

Glove(2014)

FastText(2017)

…

Doc2vec(2014)

Transformer Network(2017)

ELMo(2018)

BERT(2018)

GPT 1, 2, 3(2018-2020)

XLnet(2019)

[Natural Language Processing]

[Embedding]

[Word Embedding] [Document Embedding]

natural language processing (word...

Documents