ntc_tensor flow 深度學習快速上手班_part4 -自然語言

44
TensorFlow深度學習快速上班 四、然語處理應 By Mark Chang

Upload: notch-training-center

Post on 23-Jan-2018

251 views

Category:

Technology


6 download

TRANSCRIPT

Page 1: NTC_Tensor flow 深度學習快速上手班_Part4 -自然語言

TensorFlow深度學習快速上⼿手班������

四、⾃自然語⾔言處理應⽤用

By Mark Chang

Page 2: NTC_Tensor flow 深度學習快速上手班_Part4 -自然語言

•  ⾃自然語⾔言處理簡介 •  Word2vec神經網路 •  語意運算實作

Page 3: NTC_Tensor flow 深度學習快速上手班_Part4 -自然語言

⾃自然語⾔言處理簡介

Page 4: NTC_Tensor flow 深度學習快速上手班_Part4 -自然語言

⾃自然語⾔言處理 •  ⾃自然語⾔言處理是⼈人⼯工智慧和語⾔言學領域的分⽀支

– 探討如何處理及運⽤用⾃自然語⾔言 •  ⾃自然語⾔言理解系統

– 把⾃自然語⾔言轉化為電腦易於處理的形式。 •  ⾃自然語⾔言⽣生成系統

– 把電腦程式數據轉化為⾃自然語⾔言。 •  https://zh.wikipedia.org/wiki/%E8%87%AA

%E7%84%B6%E8%AF%AD%E8%A8%80%E5%A4%84%E7%90%86���

Page 5: NTC_Tensor flow 深度學習快速上手班_Part4 -自然語言

語意理解

https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf

Page 6: NTC_Tensor flow 深度學習快速上手班_Part4 -自然語言

機器翻譯

http://arxiv.org/abs/1409.0473

Page 7: NTC_Tensor flow 深度學習快速上手班_Part4 -自然語言

詩詞創作

http://emnlp2014.org/papers/pdf/EMNLP2014074.pdf

Page 8: NTC_Tensor flow 深度學習快速上手班_Part4 -自然語言

影像標題產⽣生

http://arxiv.org/pdf/1411.4555v2.pdf

Page 9: NTC_Tensor flow 深度學習快速上手班_Part4 -自然語言

影像內容問答

http://arxiv.org/pdf/1505.00468v6.pdf

Page 10: NTC_Tensor flow 深度學習快速上手班_Part4 -自然語言

Word2vec神經網路

Page 11: NTC_Tensor flow 深度學習快速上手班_Part4 -自然語言

⽂文字的語意

•  某個字的語意,可從它的上下⽂文得知

dog 和 cat 語意相近.

The dog run. A cat run. A dog sleep. The cat sleep. A dog bark. The cat meows.

Page 12: NTC_Tensor flow 深度學習快速上手班_Part4 -自然語言

語意向量

The dog run. A cat run. A dog sleep. The cat sleep. A dog bark. The cat meows.

the a run sleep bark meow dog 1 2 2 2 1 0

cat 2 1 2 2 0 1

Page 13: NTC_Tensor flow 深度學習快速上手班_Part4 -自然語言

語意向量

dog (1, 2,..., xn)

cat (2, 1,..., xn)

Car (0, 0,..., xn)

Page 14: NTC_Tensor flow 深度學習快速上手班_Part4 -自然語言

語意向量相似度 •  A 和 B 的Cosine Similarity 為: A ·B

|A||B|

dog (a1, a2, ..., an)

cat (b1, b2, ..., bn)

dog 和 cat 的cosine similarity為:

a1b1 + a2b2 + ...+ anbnpa21 + a22 + ...+ a2n

pb21 + b22 + ...+ b2n

Page 15: NTC_Tensor flow 深度學習快速上手班_Part4 -自然語言

語意向量加減運算

Woman + King - Man = Queen

Woman Queen

Man King

King - Man

King - Man

Page 16: NTC_Tensor flow 深度學習快速上手班_Part4 -自然語言

語意向量維度太⼤大

(x1=the, x2 =a,..., xn)

dog

語意向量的維度等於總字彙量

x1

x2

x3

x4

xn ...

Page 17: NTC_Tensor flow 深度學習快速上手班_Part4 -自然語言

Word2vec神經網路

dog

One-Hot Encoding

word2vec 神經網路

壓縮過的語意向量

1.2

0.7

0.5

1

0

0

0

Page 18: NTC_Tensor flow 深度學習快速上手班_Part4 -自然語言

One-Hot Encoding

dog cat run fly 1

Page 19: NTC_Tensor flow 深度學習快速上手班_Part4 -自然語言

Initialize Weights

dog

cat run

fly

dog

cat run

fly

W =

2

664

w11 w12 w13

w21 w22 w23

w31 w32 w33

w31 w32 w43

3

775V =

2

664

v11 v12 v13v21 v22 v23v31 v32 v33v31 v32 v43

3

775

Page 20: NTC_Tensor flow 深度學習快速上手班_Part4 -自然語言

把語意向量壓縮

dog

高維度

低維度

v11

v12

v13

v11

v12

v13

v11

v12

v13

Page 21: NTC_Tensor flow 深度學習快速上手班_Part4 -自然語言

Compressed Vectors

dog cat run fly

v11

v12

v13

v21

v22

v23

w31

w32

w33

w41

w42

w43

dog

cat run

fly

dog

cat run

fly

Page 22: NTC_Tensor flow 深度學習快速上手班_Part4 -自然語言

Context Word dog 1

v11

v12

v13

v11

v12

v13 run

w31

w32

w33

dog

cat run

fly dog cat run fly

1

1 + e�V1W3⇡ 1

V1 ·W3 = v11w31 + v12w32 + v13w33

Page 23: NTC_Tensor flow 深度學習快速上手班_Part4 -自然語言

Context Word cat

v11

v12

v13

v21

v22

v23 run

w31

w32

w33

dog cat run fly

V2 ·W3 = v21w31 + v22w32 + v23w33

dog cat run fly

1

1 + e�V2W3⇡ 1

Page 24: NTC_Tensor flow 深度學習快速上手班_Part4 -自然語言

Non-context Word dog 1

v11

v12

v13

v11

v12

v13

fly

w41

w42

w43

V1 ·W4 = v11w41 + v12w42 + v13w43

1

1 + e�V1W4⇡ 0

dog cat run fly

dog cat run

fly

Page 25: NTC_Tensor flow 深度學習快速上手班_Part4 -自然語言

Non-context Word

cat 1

v11

v12

v13

v21

v22

v23

w41

w42

w43

V2 ·W4 = v21w41 + v22w42 + v23w43

dog cat run

fly

dog cat run

fly

fly

1

1 + e�V2W4⇡ 0

Page 26: NTC_Tensor flow 深度學習快速上手班_Part4 -自然語言

Result

dog cat run

fly

dog cat run fly

v11

v12

v13

v21

v22

v23

w31

w32

w33

w41

w42

w43

dog

cat run

fly

Page 27: NTC_Tensor flow 深度學習快速上手班_Part4 -自然語言

語意運算實作

Page 28: NTC_Tensor flow 深度學習快速上手班_Part4 -自然語言

語意運算實作 https://github.com/ckmarkoh/ntc_deeplearning_tensorflow/blob/master/sec4/semantics.ipynb

Page 29: NTC_Tensor flow 深度學習快速上手班_Part4 -自然語言

訓練資料 anarchism originated as a term of abuse first used against early working class radicals including the diggers of the english revolution and the sans culottes of the french revolution whilst the term is still used in a pejorative way to describe any act that used violent means to destroy the organization of society it has also been taken up as a positive label by self defined anarchists the word anarchism is derived from the greek without archons ruler chief king anarchism as a political philosophy is the belief that rulers are unnecessary and should be abolished although there are differing interpretations of what this means anarchism also refers to related social movements that advocate the elimination of authoritarian institutions particularly the state the word anarchy as most anarchists use it does not imply chaos nihilism or anomie but rather a harmonious anti authoritarian society in place of what

Page 30: NTC_Tensor flow 深度學習快速上手班_Part4 -自然語言

前處理 anarchism originated as a term of abuse first used against early working class radicals including the diggers of the english revolution and the sans culottes of the french revolution whilst the term is still used in a pejorative way to describe any act that used violent means to destroy the organization of society it has also been taken up ….

[‘anarchism’, ‘originated’, ‘as’, ‘a’, ‘term’, ‘of’, ‘abuse’, ‘first’, ‘used’, ‘against’, ‘early’, ‘working’, ‘class’, ‘radicals’, ‘including’, ‘the’, ‘diggers’, ‘of’, ‘the’, ‘english’, ‘revolution’, ‘and’, ‘the’, ‘sans’, ‘culottes’, ‘of’, ‘the’, ‘french’, ‘revolution’, ‘whilst’, ‘the’, ‘term’, ‘is’, ‘still’, ‘used’, ‘in’, ‘a’, ‘pejorative’, ‘way’, ‘to’, ‘describe’, ‘any’, ‘act’, ‘that’, ‘used’, ‘violent’, ‘means’, ‘to’, ‘destroy’, ‘the’... ]

Page 31: NTC_Tensor flow 深度學習快速上手班_Part4 -自然語言

前處理

‘the’, ‘english’, ‘revolution’, ‘and’, ‘the’, ‘sans’, UNK, 'of', 'the', 'french', 'revolution’…

1, 103, 855, 3, 1, 15068, 0, 2, 1, 151, 855, …

‘the’, ‘english’, ‘revolution’, ‘and’, ‘the’, ‘sans’, ‘culottes’, 'of', 'the', 'french', 'revolution’…

‘the’, ‘english’, ‘revolution’, ‘and’, ‘the’, ‘sans’, ‘culottes’, 'of', 'the', 'french', 'revolution’…

字典外的字,用UNK代替。

將字轉換成字典內的代碼。

根據詞頻, 轉換成字典

{“UNK”: 0, “the”: 1, “of”: 2, “and”: 3, “one”: 4, “in”: 5, “a”: 6, “to”: 7, “zero”: 8, “nine”: 9, .... }

# 字典大小 vocabulary_size = 50000

Page 32: NTC_Tensor flow 深度學習快速上手班_Part4 -自然語言

前處理 5239, 3084, 12, 6, 195, 2, 3137, 46, 59, 156, 128, 742, 477, 10572, 134, 1, 27549, 2, 1, 103, 855, 3, 1, 15068, 0, 2, 1, 151, 855, …

input output

3084 5239

3084 12

12 3084

12 6

6 12

6 195

195 6

195 2

3084 5239

word2vec

Page 33: NTC_Tensor flow 深度學習快速上手班_Part4 -自然語言

前處理

5239, 3084, 12, 6, 195, 2, 3137, 46, 59, 156, 128, 742, 477, 10572, 134, 1, 27549, 2, 1, 103, 855, 3, 1, 15068, 0, 2, 1, 151, 855, …

generate_batch(batch_size=8, num_skips=2, skip_window=1)

batch size

input 3084 3084 12 12 6 6 195 195

output 5239 12 3084 6 12 195 6 2

num_skips

batch_size

skip_window=1

Page 34: NTC_Tensor flow 深度學習快速上手班_Part4 -自然語言

Computational Graph train_inputs = tf.placeholder(tf.int32, shape=[batch_size]) train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1]) with tf.device('/cpu:0'):

embeddings = tf.Variable( tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))

embed = tf.nn.embedding_lookup(embeddings, train_inputs) nce_weights = tf.Variable(

tf.truncated_normal([vocabulary_size, embedding_size], stddev=1.0 / math.sqrt(embedding_size))) nce_biases = tf.Variable(tf.zeros([vocabulary_size])) loss = tf.reduce_mean( tf.nn.nce_loss(nce_weights, nce_biases, embed, train_labels, num_sampled, vocabulary_size))

optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss)

Page 35: NTC_Tensor flow 深度學習快速上手班_Part4 -自然語言

Device with tf.device('/cpu:0’)

在CPU上執行以下定義的Computational Graph

由於Tensorflow未支援 embedding_lookup 在GPU上執行,故需令它在CPU上執行。

Page 36: NTC_Tensor flow 深度學習快速上手班_Part4 -自然語言

Inputs & Outputs

word2vec

train_inputs = tf.placeholder(tf.int32, shape=[batch_size]) train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])

train_inputs 3084

3084

12

12

6

6

195

195

train_labels 5239

12

3084

6

12

195

6

2

Page 37: NTC_Tensor flow 深度學習快速上手班_Part4 -自然語言

Embedding Lookup embeddings = tf.Variable(tf.random_uniform([vocabulary_size,

embedding_size], -1.0, 1.0)) embed = tf.nn.embedding_lookup(embeddings, train_inputs)

train_inputs 2

embeddings

embedding_lookup

Page 38: NTC_Tensor flow 深度學習快速上手班_Part4 -自然語言

NCE Weights •  NCE: Noise Contrastive Estimation

nce_weights = tf.Variable( tf.truncated_normal([vocabulary_size,

embedding_size], stddev=1.0 / math.sqrt(embedding_size) ))

nce_biases = tf.Variable( tf.zeros([vocabulary_size]) )

nce_weights

nce_biases

Page 39: NTC_Tensor flow 深度學習快速上手班_Part4 -自然語言

NCE Loss loss = tf.reduce_mean(

tf.nn.nce_loss(nce_weights, nce_biases, embed, train_labels, num_sampled, vocabulary_size))

v11

v12

v13

v21

v22

v23

w31

w32

w33

1

1 + e�V2W3⇡ 1

v11

v12

v13

v21

v22

v23

w41

w42

w43

1

1 + e�V2W4⇡ 0

Positive Negative

cost = log(1

1 + e

�vT

I

wpos

) +X

neg

log(1� 1

1 + e

�vT

I

wneg

)

Page 40: NTC_Tensor flow 深度學習快速上手班_Part4 -自然語言

Train feed_dict = {train_inputs: batch_inputs,

train_labels: batch_labels} _, loss_val = session.run([optimizer, loss], feed_dict=feed_dict)

loss_val

batch_inputs 3084

3084

12

12

6

6

195

195

batch_labels 5239

12

3084

6

12

195

6

2

Page 41: NTC_Tensor flow 深度學習快速上手班_Part4 -自然語言

Result final_embeddings

array([[-0.02782757, -0.16879494, -0.06111901, ..., -0.25700757, -0.07137159, 0.0191142 ], [-0.00155336, -0.00928817, -0.0535327 , ..., -0.23261793, -0.13980433, 0.18055709], [ 0.02576068, -0.06805354, -0.03688766, ..., -0.15378961, 0.00459271, 0.0717089 ], ..., [ 0.01061165, -0.09820389, -0.09913248, ..., 0.00818674, -0.12992384, 0.05826835], [ 0.0849214 , -0.14137401, 0.09674817, ..., 0.04111136, -0.05420518, -0.01920278], [ 0.08318492, -0.08202577, 0.11284919, ..., 0.03887166, 0.01556483, 0.12496017]], dtype=float32)

Page 42: NTC_Tensor flow 深度學習快速上手班_Part4 -自然語言

Visualization

Page 43: NTC_Tensor flow 深度學習快速上手班_Part4 -自然語言

Most Similar Words def get_most_similar(word, top=10): wid = dictionary.get(word,-1)

result = np.dot(final_embeddings[wid:wid+1,:],final_embeddings.T) result = result [0].argsort().tolist() result.reverse() for idx in result [:10]: print(reverse_dictionary[idx])

get_most_similar("one")

one six two four seven three ...

Page 44: NTC_Tensor flow 深度學習快速上手班_Part4 -自然語言

講師資訊

•  Email: ckmarkoh at gmail dot com •  Blog: http://cpmarkchang.logdown.com •  Github: https://github.com/ckmarkoh

Mark Chang

•  Facebook: https://www.facebook.com/ckmarkoh.chang •  Slideshare: http://www.slideshare.net/ckmarkohchang •  Linkedin:

https://www.linkedin.com/pub/mark-chang/85/25b/847

44