word vectorization(embedding) with nnlm

32
Word Vectorization(Embeddi ng) with NNLM Intern 이이이 SungKyunKwan University, Data mining Lab

Upload: ita9naiwa

Post on 12-Apr-2017

35 views

Category:

Engineering


3 download

TRANSCRIPT

Page 1: Word vectorization(embedding) with nnlm

Word Vectorization(Embedding

) with NNLMIntern 이현성

SungKyunKwan University, Data mining Lab

Page 2: Word vectorization(embedding) with nnlm

Contents• Brief intro to Keras

• Backgrounds : simple linear algebra

• Model description

• Discussion

• Go further

Page 3: Word vectorization(embedding) with nnlm

Brief intro to Keras

Page 4: Word vectorization(embedding) with nnlm

What is it?• Deep learning library(wrapper) for Theano and Tensor-

flow.

• High-level Neural Network API

Page 5: Word vectorization(embedding) with nnlm

Example : Multilayer perceptron

Page 6: Word vectorization(embedding) with nnlm

Example : Convolutional neural network

Page 7: Word vectorization(embedding) with nnlm

If you want to know deeper• https://keras.io

• https://keras.io/getting-started/sequential-model-guide/#examples

Page 8: Word vectorization(embedding) with nnlm

Backgrounds : simple lin-ear algebra

Page 9: Word vectorization(embedding) with nnlm

Backgrounds

Page 10: Word vectorization(embedding) with nnlm

Backgrounds

Page 11: Word vectorization(embedding) with nnlm

Then, for what happen to the one-hot encoded vectors?

Page 12: Word vectorization(embedding) with nnlm
Page 13: Word vectorization(embedding) with nnlm

So we can use C’s row-vector for dense vec-tor representation for word.(embedding)• How to implement it?

• Is it work well?

Page 14: Word vectorization(embedding) with nnlm

Model description

Page 15: Word vectorization(embedding) with nnlm

Dataset Description• Corpus of Contemporary American English

http://corpus.byu.edu/coca/• 1 million most frequent 5-gram in the total corpus

• No stemming or lemmatization done

• Approximately 25000 words

Page 16: Word vectorization(embedding) with nnlm

Example of Datasetpreprocessed by me

W0 W1 W2 W3 W4Both men and women reported

i wanted something that was

the hospital when he wasto have a baby that

policies of the clinton adminis-tration

Page 17: Word vectorization(embedding) with nnlm

Model architecture• Goal : Similar word have

similar vector representa-tion.

• Input : N-gram word list

• Output : list of probabil-ity’s of word t is word i

Page 18: Word vectorization(embedding) with nnlm

Model description

CW0

CW1

CW2

CW3

W0

W1

W2

W3

CW0 +

CW1 +

CW2 +

CW3

NNC 를 곱해준다Flat-ten

W4

W4_hat

이 둘의 차이 (negative log likelihood, log categorical cross-entropy) 로 back-propagation

Four vectors with dimension 30

One vector with dimen-sion 120

One vector with dimen-sion 120

Relu Softmax

Page 19: Word vectorization(embedding) with nnlm

How loss is calculated• V ={“ 미쿠가 오늘도 너무 귀여워” }• vector representation :

Page 20: Word vectorization(embedding) with nnlm

Model description• # samples = 1 Million

• Minibatch with epoch 1000

• # Iteration 50

Page 21: Word vectorization(embedding) with nnlm

Implementation with keras

Page 22: Word vectorization(embedding) with nnlm

Implementation with keras

Page 23: Word vectorization(embedding) with nnlm

Implementation with Keras

Page 24: Word vectorization(embedding) with nnlm

Discussion

Page 25: Word vectorization(embedding) with nnlm

This Vector representation actually is ‘vector representation’?

• Similar vectors have similar meaning(in syntactic, se-mantic)?

Page 26: Word vectorization(embedding) with nnlm

Result.• find similar vectors with Trained feature vector Ci• KNN with Euclidean metric used

word 1st 2nd 3rd 4th 5th

Look Looks Looking Stared Peek glance

Run Ran Running term Pass Runs

Talk Talked Talking Story Bones Truth

Know Guess Thinking Knowing Knows sure

Boy Girl Woman Man Africa Doctor

Year Week Weeks Days Decade Month

Times Moment Day Nights Night Pause

Page 27: Word vectorization(embedding) with nnlm

results• 잘 안 된 것들… ?

word 1st 2nd 3rd 4th 5th

The Our United Your White Main

Japan Russia Slavery Terrorism Britain Sector

Indian Competitive Humanitar-ian

Regulatory Canadian Investiga-tive

New His Our Its My your

Your Our My His White Their

Gay Missile Reproduc-tive

Governmen-tal

Preventive Same-sex

A Presidential San Foreign The domestic

Page 28: Word vectorization(embedding) with nnlm

Discussion• Good syntactic similarity for most words.• Good semantic(meaning) similarity for nouns and verbs• Bad semantic similarity for other words(adjectives, or

etc…)

• I think this is mainly because I skipped • lemmatization(erasing unimportant words such as ‘a’,

‘the’.......)• stemming (hashing words like ‘did’, ‘do’ and ‘done’ into single

‘do’)

Page 29: Word vectorization(embedding) with nnlm

Go further( 다음 발표 때 할 거 )• Use Skip-gram or CBOW• toward better word to vector representation• Better efficiency• Larger corpus size

• Visualization for word models

Page 30: Word vectorization(embedding) with nnlm

Use Skip-gram or CBOW

Page 31: Word vectorization(embedding) with nnlm

Proper visualization for word models

Page 32: Word vectorization(embedding) with nnlm

실제로는 하나도 안 닮음… ;;;