[book reading] 機械翻訳 - section 3 no.1

34
Language Model MT STUDY MEETING 5/21 HIROYUKI FUDABA

Upload: naist-machine-translation-study-group

Post on 15-Apr-2017

255 views

Category:

Engineering


5 download

TRANSCRIPT

Page 1: [Book Reading] 機械翻訳 - Section 3 No.1

Language ModelMT STUDY MEETING 5/21

HIROYUKI FUDABA

Page 2: [Book Reading] 機械翻訳 - Section 3 No.1

How can you say whether a

sentence is natural or not?

𝑒1 = he is dog

𝑒2 = is big he

𝑒1 = this is a purple dog

Page 3: [Book Reading] 機械翻訳 - Section 3 No.1

How can you say whether a

sentence is natural or not?

𝑒1 = he is dog

↑ correct

𝑒2 = is big he

↑ grammatically wrong

𝑒1 = this is a purple dog

↑ semantically wrong

Page 4: [Book Reading] 機械翻訳 - Section 3 No.1

Language model probability

We want to treat “naturality” statistically

We represent this with language model probability 𝑃 𝑒

𝑃 𝑒 = ℎ𝑒 𝑖𝑠 𝑏𝑖𝑔 = 0.7

𝑃 𝑒 = 𝑖𝑠 𝑏𝑖𝑔 ℎ𝑒 = 0.3

𝑃 𝑒 = 𝑡ℎ𝑖𝑠 𝑖𝑠 𝑎 𝑝𝑢𝑟𝑝𝑙𝑒 𝑑𝑜𝑔 = 0.5

Page 5: [Book Reading] 機械翻訳 - Section 3 No.1

Some ways to estimate 𝑃(𝑒)

n-gram model

Positional language model

factored language model

cache language model

Page 6: [Book Reading] 機械翻訳 - Section 3 No.1

Basis of n-gram

we notate a sentence as 𝒆 = 𝑒1𝐼, 𝐼 being the length of it

𝑒 = ℎ𝑒 𝑖𝑠 𝑏𝑖𝑔

𝑒1 = ℎ𝑒, 𝑒2 = 𝑖𝑠, 𝑒3 = 𝑏𝑖𝑔, 𝐼 = 3

We can define 𝑃(𝑒) as following

𝑃 𝑒 = ℎ𝑒 𝑖𝑠 𝑏𝑖𝑔 = 𝑃 𝐼 = 3, 𝑒1 = ℎ𝑒, 𝑒2 = 𝑖𝑠, 𝑒3 = 𝑏𝑖𝑔

= 𝑃 𝑒1 = ℎ𝑒, 𝑒2 = 𝑖𝑠, 𝑒3 = 𝑏𝑖𝑔, 𝑒4 = 𝑒𝑜𝑠

= P(e0 = 𝑏𝑜𝑠 , 𝑒1 = ℎ𝑒, 𝑒2 = 𝑖𝑠, 𝑒3 = 𝑏𝑖𝑔, 𝑒4 = 𝑒𝑜𝑠 )

Page 7: [Book Reading] 機械翻訳 - Section 3 No.1

estimate 𝑃(𝑒) with a simple way

assume that natural sentence appear more frequently than the ones

that aren’t, simple way to estimate 𝑃(𝑒) is following

Bring a big training data 𝐸𝑡𝑟𝑎𝑖𝑛

Count frequencies of each sentences in 𝐸𝑡𝑟𝑎𝑖𝑛

𝑃𝑠 𝑒 =𝑓𝑟𝑒𝑞 𝑒

𝑠𝑖𝑧𝑒(𝐸𝑡𝑟𝑎𝑖𝑛)=𝑐𝑡𝑟𝑎𝑖𝑛 𝑒

𝑒 𝑐𝑡𝑟𝑎𝑖𝑛( 𝑒)

𝑐𝑡𝑟𝑎𝑖𝑛 𝑒 = ℎ𝑒 𝑖𝑠 𝑏𝑖𝑔 returns how many sentences exactly matched to “he is big”

Page 8: [Book Reading] 機械翻訳 - Section 3 No.1

Problem of estimation in simple way

when 𝐸𝑡𝑟𝑎𝑖𝑛 does not contain sentences 𝑒1 and 𝑒2,

than you can not say which is more natural.

𝑐𝑡𝑟𝑎𝑖𝑛 𝑒1 = 𝑐𝑡𝑟𝑎𝑖𝑛 𝑒2 = 0

𝑃𝑆 𝑒1 =𝑐𝑡𝑟𝑎𝑖𝑛 𝑒1 𝑒 𝑐𝑡𝑟𝑎𝑖𝑛 𝑒

= 0

𝑃𝑆 𝑒2 =𝑐𝑡𝑟𝑎𝑖𝑛 𝑒2 𝑒 𝑐𝑡𝑟𝑎𝑖𝑛 𝑒

= 0

You can not compare if both values are 0 …

Page 9: [Book Reading] 機械翻訳 - Section 3 No.1

Solution to 𝑃 𝑒 = 0

Rather thinking a sentence as a whole,

let’s think that a sentence is a data that is composed of words

𝑃 𝑋, 𝑌 = 𝑃 𝑋 𝑌 𝑃(𝑌)

𝑃 𝑒 = ℎ𝑒 𝑖𝑠 𝑏𝑖𝑔 = 𝑃 𝑒1 = ℎ𝑒 𝑒0 = 𝑏𝑜𝑠 )

∗ P e2 = is e0 = 𝑏𝑜𝑠 , 𝑒1 = ℎ𝑒)

∗ 𝑃 𝑒3 = 𝑏𝑖𝑔 𝑒0 = 𝑏𝑜𝑠 , 𝑒1 = ℎ𝑒, 𝑒2 = 𝑖𝑠

∗ 𝑃 𝑒4 = 𝑒𝑜𝑠 𝑒0 = 𝑏𝑜𝑠 , 𝑒1 = ℎ𝑒, 𝑒2 = is, e3 = big)

Page 10: [Book Reading] 機械翻訳 - Section 3 No.1

Solution to 𝑃 𝑒 = 0

𝑃𝑆 𝑒 =𝑐𝑡𝑟𝑎𝑖𝑛 𝑒

𝑒 𝑐𝑡𝑟𝑎𝑖𝑛( 𝑒)= 𝑃 𝑒1

𝐼 =

𝑖=1

𝐼+1

𝑃𝑀𝐿 𝑒𝑖|𝑒0𝑖−1

𝑃𝑀𝐿 𝑒𝑖| 𝑒0𝑖−1 =

𝑐𝑡𝑟𝑎𝑖𝑛 𝑒0𝑖

𝑐𝑡𝑟𝑎𝑖𝑛(𝑒0𝑖−1)

So far 𝑃 𝑒1𝐼 is completely equal to 𝑃𝑆(𝑒),

which means it still don’t work

Page 11: [Book Reading] 機械翻訳 - Section 3 No.1

Idea of n-gram model

Rather considering all words appeared before the word looking at,

let’s consider only 𝑛 − 1 words appeared just before the word

Instead of considering all words …

is big 𝑒𝑜𝑠he𝑏𝑜𝑠

Page 12: [Book Reading] 機械翻訳 - Section 3 No.1

Idea of n-gram model

Rather considering all words appeared before the word looking at,

let’s consider only 𝑛 − 1 words appeared just before the word

Consider only 𝑛 − 1 words

is big 𝑒𝑜𝑠he𝑏𝑜𝑠

Page 13: [Book Reading] 機械翻訳 - Section 3 No.1

n-gram in precise

From the previous expression

𝑃 𝑒1𝐼 =

𝑖=1

𝐼+1

𝑃𝑀𝐿 𝑒𝑖|𝑒0𝑖−1

we can approximate 𝑃(𝑒) as following

𝑃 𝑒1𝐼 ≈

𝑖=1

𝐼+1

𝑃𝑀𝐿 𝑒𝑖|𝑒𝑖−𝑛+1𝑖−1

Page 14: [Book Reading] 機械翻訳 - Section 3 No.1

How does this help?

𝑃 𝑒 = ℎ𝑒 𝑖𝑠 𝑏𝑖𝑔 ≈ 𝑃 𝑒𝑖 = ℎ𝑒 | 𝑒𝑖−1 = 𝑏𝑜𝑠

∗ P ei = is | 𝑒𝑖−1 = he

∗ P ei = big ei−1 = is)

∗ P 𝑒𝑖 = 𝑒𝑜𝑠 | 𝑒𝑖−1 = 𝑏𝑖𝑔

Intuitively, a subset sequence appear more than it’s super set,

so 𝑃 𝑒 estimated with n-gram model is less likely to be 0

Page 15: [Book Reading] 機械翻訳 - Section 3 No.1

Smoothing n-gram model

n-gram less likely estimate 𝑃 𝑒 = 0

But it still have a possibility of estimating 0

→ Smoothing

Page 16: [Book Reading] 機械翻訳 - Section 3 No.1

Idea of smoothing

Combining probability of n-gram and (n-1)-gram

Even if probability of word 𝑤 could not be estimated with n-gram,

there is a possibility that probability can be estimated with (n-1)-gram

𝑃3−𝑔𝑟𝑎𝑚 𝑠𝑚𝑎𝑙𝑙 | ℎ𝑒 𝑖𝑠 = 0

P2−gram small is) = 0.03

0

0.05

0.1

0.15

0.2

0.25

P(he|<bos>) P(is|<bos> he) P(big|he is) P(small|he is) P(<eos>|is big)

probability

probability

Page 17: [Book Reading] 機械翻訳 - Section 3 No.1

Linear interpolation

Easiest, and basic way to express the idea

𝑃 𝑒𝑖|𝑒𝑖−𝑛+1𝑖−1 = 1 − 𝑎 𝑃𝑀𝐿 𝑒𝑖|𝑒𝑖−𝑛+1

𝑖−1 + 𝑎𝑃𝑀𝐿 𝑒𝑖|𝑒𝑖−𝑛+2𝑖−1

0 ≤ 𝑎 ≤ 1

Adjusting 𝑎 to a good value is the problem

So how can we do that?

Page 18: [Book Reading] 機械翻訳 - Section 3 No.1

Adjusting 𝑎 to a good value

Easy way to achieve this is following

Bring dataset which is different from training data

Select 𝑎 that gives the highest likelihood to the dataset

Improve performance by considering each context

Page 19: [Book Reading] 機械翻訳 - Section 3 No.1

Witten-Bell smoothing

How should I choose 𝑎 if n-gram was like following?

President was President Ronald

elected 5 Reagan 38

the 3 Caza 1

in 3 Venetiaan 1

First 3

52 kind, sum 110 3 kind, sum 40

Page 20: [Book Reading] 機械翻訳 - Section 3 No.1

Witten-Bell smoothing

It is likely to have an unknown word after context “President was”

𝑎 should be large, so that (n-1)-gram will be more emphasized

𝑃 𝑒𝑖|𝑒𝑖−𝑛+1𝑖−1 = 1 − 𝑎 𝑃𝑀𝐿 𝑒𝑖|𝑒𝑖−𝑛+1

𝑖−1 + 𝑎𝑃𝑀𝐿 𝑒𝑖|𝑒𝑖−𝑛+2𝑖−1

President was President Ronald

elected 5 Reagan 38

the 3 Caza 1

in 3 Venetiaan 1

First 3

52 kind, sum 110 3 kind, sum 40

Page 21: [Book Reading] 機械翻訳 - Section 3 No.1

Witten-Bell smoothing

It is likely to have an unknown word after context “President Ronald”

𝑎 should be small, so that n-gram will be more emphasized

𝑃 𝑒𝑖|𝑒𝑖−𝑛+1𝑖−1 = 1 − 𝑎 𝑃𝑀𝐿 𝑒𝑖|𝑒𝑖−𝑛+1

𝑖−1 + 𝑎𝑃𝑀𝐿 𝑒𝑖|𝑒𝑖−𝑛+2𝑖−1

President was President Ronald

elected 5 Reagan 38

the 3 Caza 1

in 3 Venetiaan 1

First 3

52 kind, sum 110 3 kind, sum 40

Page 22: [Book Reading] 機械翻訳 - Section 3 No.1

Idea of Witten-Bell smoothing

If you only had a single coefficient value 𝑎 to adjust,

You can not consider context for each word

→ why not use different 𝒂 to consider each context info

for each word?

Page 23: [Book Reading] 機械翻訳 - Section 3 No.1

Witten-Bell smoothing in precise

Simple smoothing

𝑃 𝑒𝑖|𝑒𝑖−𝑛+1𝑖−1 = 1 − 𝑎 𝑃𝑀𝐿 𝑒𝑖|𝑒𝑖−𝑛+1

𝑖−1 + 𝑎𝑃𝑀𝐿 𝑒𝑖|𝑒𝑖−𝑛+2𝑖−1

Witten-Bell smoothing

𝑃𝑊𝐵 𝑒𝑖|𝑒𝑖−𝑛+1𝑖−1 = 1 − 𝑎

𝑒𝑖−𝑛+1𝑖−1 𝑃𝑀𝐿 𝑒𝑖|𝑒𝑖−𝑛+1

𝑖−1 + 𝑎𝑒𝑖−𝑛+1𝑖−1 𝑃𝑀𝐿 𝑒𝑖|𝑒𝑖−𝑛+2

𝑖−1

𝑎𝑒𝑖−𝑛+1𝑖−1 =

𝑢 𝑒𝑖−𝑛+1𝑖−1 ,∗

𝑢 𝑒𝑖−𝑛+1𝑖−1 ,∗ + 𝑐 𝑒𝑖−𝑛+1

𝑖−1

Page 24: [Book Reading] 機械翻訳 - Section 3 No.1

Witten-Bell smoothing in precise

𝑎𝑒𝑖−𝑛+1𝑖−1 =

𝑢 𝑒𝑖−𝑛+1𝑖−1 ,∗

𝑢 𝑒𝑖−𝑛+1𝑖−1 ,∗ + 𝑐 𝑒𝑖−𝑛+1

𝑖−1

𝑢 𝑒𝑖−𝑛+1𝑖−1 ,∗ represents how many

kind of words continue after context 𝑒𝑖−𝑛+1𝑖−1

𝑢 𝑃𝑟𝑒𝑠𝑖𝑑𝑒𝑛𝑡 𝑤𝑎𝑠,∗ = 52

𝑢 𝑃𝑟𝑒𝑠𝑖𝑑𝑒𝑛𝑡 𝑅𝑜𝑛𝑎𝑙𝑑,∗ = 3

President was President Ronald

elected 5 Reagan 38

the 3 Caza 1

in 3 Venetia

an

1

First 3

52 kind, sum 110 3 kind, sum 40

Page 25: [Book Reading] 機械翻訳 - Section 3 No.1

Witten-Bell smoothing in precise

𝑎𝑒𝑖−𝑛+1𝑖−1 =

𝑢 𝑒𝑖−𝑛+1𝑖−1 ,∗

𝑢 𝑒𝑖−𝑛+1𝑖−1 ,∗ + 𝑐 𝑒𝑖−𝑛+1

𝑖−1

𝑎𝑃𝑟𝑒𝑠𝑖𝑑𝑒𝑛𝑡 𝑤𝑎𝑠 =52

110+52= 0.32

𝑎𝑃𝑟𝑒𝑠𝑖𝑑𝑒𝑛𝑡 𝑅𝑜𝑛𝑎𝑙𝑑 =3

40+3= 0.07

President was President Ronald

elected 5 Reagan 38

the 3 Caza 1

in 3 Venetia

an

1

First 3

52 kind, sum 110 3 kind, sum 40

Page 26: [Book Reading] 機械翻訳 - Section 3 No.1

Absolute discounting

Yet another smoothing

Unlike Witten-Bell smoothing which uses 𝑃𝑀𝐿, it subtracts constant

value 𝑑 from frequency of each word in order to estimate

probability

𝑃𝑑 𝑒𝑖 | 𝑒0𝑖−1 =

max 𝑐𝑡𝑟𝑎𝑖𝑛 𝑒0𝑖 − 𝑑, 0

𝑐𝑡𝑟𝑎𝑖𝑛 𝑒0𝑖−1

Page 27: [Book Reading] 機械翻訳 - Section 3 No.1

Abstruct discounting

So why do you subtract?

We want to treat low-frequent word as unknown word,

because low-frequent one can not really be trusted.

By doing this, (n-1)-gram gets more emphasized

Page 28: [Book Reading] 機械翻訳 - Section 3 No.1

Absolute discounting

𝑃𝑑 𝑒𝑖 | 𝑒𝑖−𝑛+1𝑖−1 =

max 𝑐𝑡𝑟𝑎𝑖𝑛 𝑒𝑖−𝑛+1𝑖 − 𝑑, 0

𝑐𝑡𝑟𝑎𝑖𝑛 𝑒𝑖−𝑛+1𝑖−1

𝑃𝑑 𝑒𝑖 = 𝑟𝑒𝑎𝑔𝑎𝑛|𝑒𝑖−2𝑖−1 = 𝑝𝑟𝑒𝑠𝑖𝑑𝑒𝑛𝑡 𝑟𝑜𝑛𝑎𝑙𝑑

=38 − 0.5

40= 0.9375

𝑃𝑑 𝑒𝑖 = 𝑐𝑎𝑧𝑎|𝑒𝑖−2𝑖−1 = 𝑝𝑟𝑒𝑠𝑖𝑑𝑒𝑛𝑡 𝑟𝑜𝑛𝑎𝑙𝑑

=1 − 0.5

40= 0.0125

𝑃𝑑 𝑒𝑖 = 𝑣𝑒𝑛𝑒𝑡𝑖𝑎𝑎𝑛|𝑒𝑖−2𝑖−1 = 𝑝𝑟𝑒𝑠𝑖𝑑𝑒𝑛𝑡 𝑟𝑜𝑛𝑎𝑙𝑑

=1 − 0.5

40= 0.0125

President was President Ronald

elected 5 Reagan 38

the 3 Caza 1

in 3 Venetia

an

1

First 3

52 kind, sum 110 3 kind, sum 40

Page 29: [Book Reading] 機械翻訳 - Section 3 No.1

Absolute discounting

𝑃𝑑 𝑒𝑖 = 𝑟𝑒𝑎𝑔𝑎𝑛|𝑒𝑖−2𝑖−1 = 𝑝𝑟𝑒𝑠𝑖𝑑𝑒𝑛𝑡 𝑟𝑜𝑛𝑎𝑙𝑑 = 0.9375

𝑃𝑑 𝑒𝑖 = 𝑐𝑎𝑧𝑎|𝑒𝑖−2𝑖−1 = 𝑝𝑟𝑒𝑠𝑖𝑑𝑒𝑛𝑡 𝑟𝑜𝑛𝑎𝑙𝑑 = 0.0125

𝑃𝑑 𝑒𝑖 = 𝑣𝑒𝑛𝑒𝑡𝑖𝑎𝑎𝑛|𝑒𝑖−2𝑖−1 = 𝑝𝑟𝑒𝑠𝑖𝑑𝑒𝑛𝑡 𝑟𝑜𝑛𝑎𝑙𝑑 = 0.0125

𝑎𝑒𝑖−𝑛+1𝑖−1 = 1 − 0.9375 + 0.0125 + 0.0125 = 0.0375

Efficient way of solving this is following

𝑎𝑒𝑖−𝑛+1𝑖−1 =

𝑢 𝑒𝑖−𝑛+1𝑖−1 ,∗ × 𝑑

𝑐 𝑒𝑖−𝑛+1𝑖−1

Page 30: [Book Reading] 機械翻訳 - Section 3 No.1

Absolute discounting

Now that we do not use maximum likelihood,

n-gram probability will be estimated as following

𝑃 𝑒𝑖| 𝑒𝑖−𝑛+1𝑖−1 = 𝑃𝑑 𝑒𝑖|𝑒𝑖−𝑛+1

𝑖−1 + 𝑎𝑒𝑖−𝑛+1𝑖−1 𝑃 𝑒𝑖|𝑒𝑖−𝑛+2

𝑖−1

Quite similar, but differs in that absolute discounting use 𝑃𝑑

𝑃 𝑒𝑖|𝑒𝑖−𝑛+1𝑖−1 = 1 − 𝑎 𝑃𝑀𝐿 𝑒𝑖|𝑒𝑖−𝑛+1

𝑖−1 + 𝑎𝑃𝑀𝐿 𝑒𝑖|𝑒𝑖−𝑛+2𝑖−1

Page 31: [Book Reading] 機械翻訳 - Section 3 No.1

Kneser-Ney smoothing

achieve excellent performance

Similar to absolute discounting

Have an interest in a word that

only appears in specific context

Page 32: [Book Reading] 機械翻訳 - Section 3 No.1

Kneser-Ney smoothing

Lower order model is needed only when count in higher

order model is small

Suppose “San Francisco” is common, but “Francisco” appears only

after “San”

Both “San” and “Francisco” get a high unigram probability

But we want to give “Francisco” a low unigram probability!!

Page 33: [Book Reading] 機械翻訳 - Section 3 No.1

Kneser-Ney smoothing

Kneser-Ney is defined as following

𝑃𝑘𝑛 𝑒𝑖|𝑒𝑖−𝑛+1𝑖−1 =

max 𝑢 ∗, 𝑒𝑖−𝑛+2𝑖−1 − d, 0

𝑢 𝑒𝑖−𝑛+1𝑖−1

Page 34: [Book Reading] 機械翻訳 - Section 3 No.1

Unknown words

Even though smoothing can reduce probability of having 𝑃 𝑒 = 0,

possibility of getting 0 still rely

We may give a possibility to unknown word as following

𝑃𝑢𝑛𝑘 𝑒𝑖 =1

𝑉