natural language processing (word...
TRANSCRIPT
포항공과대학교 산업경영학과
Natural Language Processing
(Word Embedding)
Seonghwi Kim
Statistics and Data Science Lab.
August 12, 2020
Contents
1. Introduction
2. Embedding in NLP
3. Word Embedding
1. LSA
2. NNLM
3. word2vec
4. Limitation
5. Next seminar
3
Introduction
• Natural Language Processing
-A subfield of linguistics, computer science, information engineering concerned with the interactions between
computers and human (natural) languages
Various topics in NLP
-In particular, analyze how to program computers to process large amounts of natural language data
-Natural Language Understanding(NLU), and Natural Language Generation(NLG), speech recognition
[1] 최요종, “2018-2020 NLU NLU 연구동향”, Kakao brain AI trend report, https://www.kakaobrain.com
4
Embedding in NLP
• Embedding in NLP
-Embedding: The result of converting natural language into a vector that can be understood by a machine, or the whole
series of processes.
-Word Embedding(word vector), Document Embedding(document vector)
e.g.) Document Embedding(Bag-of-words)
[1] Marco Bonzanini, ‘Word Embedding for Natural Language Processing in Python’, 2017
e.g.) Word Embedding(One-hot encoding)
5
Embedding in NLP
-Sparsity: One-hot encoding is V-dimensional (the computational inefficiency)
-One-hot encoding does not represent semantic/syntactic information
1. Calculation of relevance (similarity) between words/sentences
2. Implication of semantic/syntactic information
3. Transfer Learning
• Limitation of Limitation of One-hot encoding as Word Embedding
• Why is Word Embedding important in NLP?
6
Embedding in NLP
1. Calculation of relevance (similarity) between words/sentences
• Why is a Word Embedding important?
2D visualization for word2vec embedding
[1] Marco Bonzanini, ‘Word Embedding for Natural Language Processing in Python’, 2017
7
Embedding in NLP
2. Implication of semantic/syntactic information
• Why is a Word Embedding important?
[word embedding example (word2vec)][Learns Analogy]
[1] Marco Bonzanini, ‘Word Embedding for Natural Language Processing in Python’, 2017
8
Embedding in NLP
-BERT(Bidirectional Encoder Representations from Transformers) – Document classification
[‘This’, ‘movie’, ‘is’, ‘awesome’] : positive
[‘This’, ‘movie’, ‘is’, ‘weird’]: negative
• Why is a Word Embedding important?
3. Transfer learning
Test Accuracy Loss function
-Pre-trained word vector is used as an input value for other deep learning model
[1] Gichang Lee(2019). Sentence Embeddings Using Korean Corpora. p38-39
9
Embedding in NLP
• Embedding models in NLP
Latent Semantic Analysis (1998)
NNLM (2003)
word2vec (2013)
Glove (2014)
FastText (2017)
Doc2vec (2014)
Transformer Network (2017)
ELMo (2018)
BERT (2018)
GPT 1, 2, 3 (2018-2020)
XLNet (2019)
[Natural Language Processing]
[Embedding]
[Word Embedding] [Document Embedding]
10
Word Embedding
• Latent semantic analysis (LSA)
[Document Term Matrix]
-Matrix factorization based method (SVD)
-Find the latent/hidden meaning between words and context.
-when a new document or word is added, have to start from scratch.
[1] SVD, PCA and Latent Semantic Analysis, ratsgo.github.io/from%20frequency%20to%20semantics/2017/04/06/pcasvdlsa/
11
Word Embedding
• Feed-forward Neural net Language Model (NNLM)
-Prediction based method like N-gram Language Model
-N-gram Language Model: a type of probabilistic language model for predicting the next item in such
a sequence in the form of a (n − 1)
- NNLM use a neural network structure for learning
“A little boy is spreading __ ” (N=4)
N-1 words Predict
[Example of N-gram Language model ]
12
Word Embedding
• Feed-forward Neural net Language Model (NNLM)
“what will the fat cat sit on” (N=5)
[EXAMPLE]
[General NNLM structure]
Projection layer
Input layer
Hidden layer
Output layer
[1] Bengio, Y., Ducharme, R., Vincent,P.,&Jauvin,C.(2003). A neural probabilistic language model. JMLR
[2] 조경현(2018). Natural Language Processing with deep learning. https://wikidocs.net/45609
13
Word Embedding
• Feed-forward Neural net Language Model (NNLM)
[1] Bengio, Y., Ducharme, R., Vincent,P.,&Jauvin,C.(2003). A neural probabilistic language model. JMLR
[2] 조경현(2018). Natural Language Processing with deep learning. https://wikidocs.net/45609
-Weight vector as word embedding
-Word Embedding by converting V dimensional sparse vectors into M dimensional dense vectors.
m-dimensional word vector
14
Word Embedding
• Feed-forward Neural net Language Model (NNLM)
[1] Bengio, Y., Ducharme, R., Vincent,P.,&Jauvin,C.(2003). A neural probabilistic language model. JMLR
[2] 조경현(2018). Natural Language Processing with deep learning. https://wikidocs.net/45609
𝑇𝑖𝑚𝑒 𝐶𝑜𝑚𝑝𝑙𝑒𝑥𝑖𝑡𝑦 𝑄 = (𝑁 ×𝑀) + (𝑁 ×𝑀 × 𝐻) + (𝐻 × 𝑉)
-Computational inefficiency for large V
-There is a limit to referencing only n-1 words in one direction, not in both directions.
15
Word Embedding
• word2vec
[1] T. Mikolov, et al., (2013a). “Efficient Estimation of Word Representations in Vector Space”
[2] T. Mikolov, et al., (2013b), “Distributed Representations of Words and Phrases and their compositionaliry”, Advances in NIPS
[3] 조경현(2018). Natural Language Processing with deep learning. https://wikidocs.net/45609
[word2vec architecture (CBOW/Skip-gram)]
Idea (Distributional hypothesis): words that are semantically similar often occur near each other in context
2013a : Introduce 2 neural net models for embedding - CBOW and Skip-gram
2013b : Introduce 2 training methods for an efficient learning – negative sampling and hierarchical softmax
16
Word Embedding
• word2vec
-CBOW predicts a center word from the surrounding context
in terms of word vectors
[EXAMPLE]
The fat cat sat on the mat
The fat cat sat on the mat
The fat cat sat on the mat
The fat cat sat on the mat
The fat cat sat on the mat
The fat cat sat on the mat
The fat cat sat on the mat
center word
surrounding word
…
train sample
[1] T. Mikolov, et al., (2013a). “Efficient Estimation of Word Representations in Vector Space”
[2] T. Mikolov, et al., (2013b), “Distributed Representations of Words and Phrases and their compositionaliry”, Advances in NIPS
[3] 조경현(2018). Natural Language Processing with deep learning. https://wikidocs.net/45609
1. CBOW
17
Word Embedding
• word2vec
minimizationmaximization
[1] T. Mikolov, et al., (2013a). “Efficient Estimation of Word Representations in Vector Space”
[2] T. Mikolov, et al., (2013b), “Distributed Representations of Words and Phrases and their compositionaliry”, Advances in NIPS
[3] 조경현(2018). Natural Language Processing with deep learning. https://wikidocs.net/45609
1. CBOW
-Loss function 𝐽
18
Word Embedding
• word2vec
[1] T. Mikolov, et al., (2013a). “Efficient Estimation of Word Representations in Vector Space”
[2] T. Mikolov, et al., (2013b), “Distributed Representations of Words and Phrases and their compositionaliry”, Advances in NIPS
[3] 조경현(2018). Natural Language Processing with deep learning. https://wikidocs.net/45609
Skip-gram aims to predicts the probability of context words from a center word
[EXAMPLE]
Loss function 𝑱 is too expensive to compute because of the softmax normalization
Loss function
, where is the cross entropy between the probability
vector and one-hot vector
19
Word Embedding
• word2vec
[1] T. Mikolov, et al., (2013a). “Efficient Estimation of Word Representations in Vector Space”
[2] T. Mikolov, et al., (2013b), “Distributed Representations of Words and Phrases and their compositionaliry”, Advances in NIPS
[3] 조경현(2018). Natural Language Processing with deep learning. https://wikidocs.net/45609
1. Negative sampling
• For every train step, sample negative examples instead of looping over the entire vocabulary
• Given a pair of center word and context word, model is trained as classifying whether the sample is a positive sample or a
negative sample
[EXAMPLE]
The fat cat sat on the mat …
c w
cat The
cat fate
cat sat
cat on
positive sample
c w
cat dog
cat is
cat cute
cat and
c w
cat adorable
cat then
cat cat
cat agree
negative sample (K=2)𝑃𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑤𝑖 =
𝑓(𝑤𝑖)Τ3 4
σ𝑗=0𝑉 𝑓(𝑤𝑗)
Τ3 4,
where 𝑓 𝑤𝑖 =𝑤𝑖 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦
𝑡𝑜𝑡𝑎𝑙 𝑣𝑜𝑐𝑎𝑏𝑢𝑙𝑎𝑟𝑦 𝑠𝑒𝑡
[Probability for negative sampling]
is: 0.99
word2vec: 0.01
𝑃𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑖𝑠 = 0.97
𝑃𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑤𝑜𝑟𝑑2𝑣𝑒𝑐 = 0.03
[EXAMPLE]
20
Word Embedding
• word2vec
[1] T. Mikolov, et al., (2013a). “Efficient Estimation of Word Representations in Vector Space”
[2] T. Mikolov, et al., (2013b), “Distributed Representations of Words and Phrases and their compositionaliry”, Advances in NIPS
[3] 조경현(2018). Natural Language Processing with deep learning. https://wikidocs.net/45609
1. Skip-gram with negative sampling
-from unsupervised learning to supervised learning
ቊ
if is in a positive sample
if is in a negative corpus
then,
-new objective function and update rules
-binary classification problem with sigmoid function
Let’s denote
21
Word Embedding
• word2vec
[1] T. Mikolov, et al., (2013a). “Efficient Estimation of Word Representations in Vector Space”
[2] T. Mikolov, et al., (2013b), “Distributed Representations of Words and Phrases and their compositionaliry”, Advances in NIPS
[3] 조경현(2018). Natural Language Processing with deep learning. https://wikidocs.net/45609
2. Hierarchical softmax
• Hierarchical softmax uses Huffman binary tree to represent all words in the vocabulary
Computational cost becomes 𝑂 log 𝑉 instead of 𝑂 𝑉
[Binary tree for Hierarchical softmax]
0.7 0.3
0.60.4
22
Limitation
• Limitation of word embedding
[2D visualization for BERT embedding]
“배”는수분이많은과일이다.
“배” 고프다
“배” 나온다
“배” 가불렀다 “배” 멀미가난다
“배” 를바다에띄웠다
-Words with various meanings(homonym) can not be distinguished
“배”는수분이많은과일이다.
“배” 고프다
“배” 나온다
“배” 가불렀다
“배” 멀미가난다
“배”를바다에띄웠다
[EXAMPLE]
-Only n-1 words are used for learning
“배”
word embedding → document embedding
[1] Gichang Lee(2019). Sentence Embeddings Using Korean Corpora. p38-39
23
Next seminar
• Embedding models in NLP
Latent Semantic Analysis(1998)
NNLM(2003)
word2vec(2013)
Glove(2014)
FastText(2017)
…
Doc2vec(2014)
Transformer Network(2017)
ELMo(2018)
BERT(2018)
GPT 1, 2, 3(2018-2020)
XLnet(2019)
[Natural Language Processing]
[Embedding]
[Word Embedding] [Document Embedding]