[book reading] 機械翻訳 - section 3 no.1

Language ModelMT STUDY MEETING 5/21

HIROYUKI FUDABA

How can you say whether a

sentence is natural or not?

𝑒1 = he is dog

𝑒2 = is big he

𝑒1 = this is a purple dog

How can you say whether a

sentence is natural or not?

𝑒1 = he is dog

↑ correct

𝑒2 = is big he

↑ grammatically wrong

𝑒1 = this is a purple dog

↑ semantically wrong

Language model probability

We want to treat “naturality” statistically

We represent this with language model probability 𝑃 𝑒

𝑃 𝑒 = ℎ𝑒 𝑖𝑠 𝑏𝑖𝑔 = 0.7

𝑃 𝑒 = 𝑖𝑠 𝑏𝑖𝑔 ℎ𝑒 = 0.3

𝑃 𝑒 = 𝑡ℎ𝑖𝑠 𝑖𝑠 𝑎 𝑝𝑢𝑟𝑝𝑙𝑒 𝑑𝑜𝑔 = 0.5

Some ways to estimate 𝑃(𝑒)

n-gram model

Positional language model

factored language model

cache language model

Basis of n-gram

we notate a sentence as 𝒆 = 𝑒1𝐼, 𝐼 being the length of it

𝑒 = ℎ𝑒 𝑖𝑠 𝑏𝑖𝑔

𝑒1 = ℎ𝑒, 𝑒2 = 𝑖𝑠, 𝑒3 = 𝑏𝑖𝑔, 𝐼 = 3

We can define 𝑃(𝑒) as following

𝑃 𝑒 = ℎ𝑒 𝑖𝑠 𝑏𝑖𝑔 = 𝑃 𝐼 = 3, 𝑒1 = ℎ𝑒, 𝑒2 = 𝑖𝑠, 𝑒3 = 𝑏𝑖𝑔

= 𝑃 𝑒1 = ℎ𝑒, 𝑒2 = 𝑖𝑠, 𝑒3 = 𝑏𝑖𝑔, 𝑒4 = 𝑒𝑜𝑠

= P(e0 = 𝑏𝑜𝑠 , 𝑒1 = ℎ𝑒, 𝑒2 = 𝑖𝑠, 𝑒3 = 𝑏𝑖𝑔, 𝑒4 = 𝑒𝑜𝑠 )

estimate 𝑃(𝑒) with a simple way

assume that natural sentence appear more frequently than the ones

that aren’t, simple way to estimate 𝑃(𝑒) is following

Bring a big training data 𝐸𝑡𝑟𝑎𝑖𝑛

Count frequencies of each sentences in 𝐸𝑡𝑟𝑎𝑖𝑛

𝑃𝑠 𝑒 =𝑓𝑟𝑒𝑞 𝑒

𝑠𝑖𝑧𝑒(𝐸𝑡𝑟𝑎𝑖𝑛)=𝑐𝑡𝑟𝑎𝑖𝑛 𝑒

𝑒 𝑐𝑡𝑟𝑎𝑖𝑛( 𝑒)

𝑐𝑡𝑟𝑎𝑖𝑛 𝑒 = ℎ𝑒 𝑖𝑠 𝑏𝑖𝑔 returns how many sentences exactly matched to “he is big”

Problem of estimation in simple way

when 𝐸𝑡𝑟𝑎𝑖𝑛 does not contain sentences 𝑒1 and 𝑒2,

than you can not say which is more natural.

𝑐𝑡𝑟𝑎𝑖𝑛 𝑒1 = 𝑐𝑡𝑟𝑎𝑖𝑛 𝑒2 = 0

𝑃𝑆 𝑒1 =𝑐𝑡𝑟𝑎𝑖𝑛 𝑒1 𝑒 𝑐𝑡𝑟𝑎𝑖𝑛 𝑒

= 0

𝑃𝑆 𝑒2 =𝑐𝑡𝑟𝑎𝑖𝑛 𝑒2 𝑒 𝑐𝑡𝑟𝑎𝑖𝑛 𝑒

= 0

You can not compare if both values are 0 …

Solution to 𝑃 𝑒 = 0

Rather thinking a sentence as a whole,

let’s think that a sentence is a data that is composed of words

𝑃 𝑋, 𝑌 = 𝑃 𝑋 𝑌 𝑃(𝑌)

𝑃 𝑒 = ℎ𝑒 𝑖𝑠 𝑏𝑖𝑔 = 𝑃 𝑒1 = ℎ𝑒 𝑒0 = 𝑏𝑜𝑠 )

∗ P e2 = is e0 = 𝑏𝑜𝑠 , 𝑒1 = ℎ𝑒)

∗ 𝑃 𝑒3 = 𝑏𝑖𝑔 𝑒0 = 𝑏𝑜𝑠 , 𝑒1 = ℎ𝑒, 𝑒2 = 𝑖𝑠

∗ 𝑃 𝑒4 = 𝑒𝑜𝑠 𝑒0 = 𝑏𝑜𝑠 , 𝑒1 = ℎ𝑒, 𝑒2 = is, e3 = big)

Solution to 𝑃 𝑒 = 0

𝑃𝑆 𝑒 =𝑐𝑡𝑟𝑎𝑖𝑛 𝑒

𝑒 𝑐𝑡𝑟𝑎𝑖𝑛( 𝑒)= 𝑃 𝑒1

𝐼 =

𝑖=1

𝐼+1

𝑃𝑀𝐿 𝑒𝑖|𝑒0𝑖−1

𝑃𝑀𝐿 𝑒𝑖| 𝑒0𝑖−1 =

𝑐𝑡𝑟𝑎𝑖𝑛 𝑒0𝑖

𝑐𝑡𝑟𝑎𝑖𝑛(𝑒0𝑖−1)

So far 𝑃 𝑒1𝐼 is completely equal to 𝑃𝑆(𝑒),

which means it still don’t work

Idea of n-gram model

Rather considering all words appeared before the word looking at,

let’s consider only 𝑛 − 1 words appeared just before the word

Instead of considering all words …

is big 𝑒𝑜𝑠he𝑏𝑜𝑠

Idea of n-gram model

Rather considering all words appeared before the word looking at,

let’s consider only 𝑛 − 1 words appeared just before the word

Consider only 𝑛 − 1 words

is big 𝑒𝑜𝑠he𝑏𝑜𝑠

n-gram in precise

From the previous expression

𝑃 𝑒1𝐼 =

𝑖=1

𝐼+1

𝑃𝑀𝐿 𝑒𝑖|𝑒0𝑖−1

we can approximate 𝑃(𝑒) as following

𝑃 𝑒1𝐼 ≈

𝑖=1

𝐼+1

𝑃𝑀𝐿 𝑒𝑖|𝑒𝑖−𝑛+1𝑖−1

How does this help?

𝑃 𝑒 = ℎ𝑒 𝑖𝑠 𝑏𝑖𝑔 ≈ 𝑃 𝑒𝑖 = ℎ𝑒 | 𝑒𝑖−1 = 𝑏𝑜𝑠

∗ P ei = is | 𝑒𝑖−1 = he

∗ P ei = big ei−1 = is)

∗ P 𝑒𝑖 = 𝑒𝑜𝑠 | 𝑒𝑖−1 = 𝑏𝑖𝑔

Intuitively, a subset sequence appear more than it’s super set,

so 𝑃 𝑒 estimated with n-gram model is less likely to be 0

Smoothing n-gram model

n-gram less likely estimate 𝑃 𝑒 = 0

But it still have a possibility of estimating 0

→ Smoothing

Idea of smoothing

Combining probability of n-gram and (n-1)-gram

Even if probability of word 𝑤 could not be estimated with n-gram,

there is a possibility that probability can be estimated with (n-1)-gram

𝑃3−𝑔𝑟𝑎𝑚 𝑠𝑚𝑎𝑙𝑙 | ℎ𝑒 𝑖𝑠 = 0

P2−gram small is) = 0.03

0

0.05

0.1

0.15

0.2

0.25

P(he|<bos>) P(is|<bos> he) P(big|he is) P(small|he is) P(<eos>|is big)

probability

probability

Linear interpolation

Easiest, and basic way to express the idea

𝑃 𝑒𝑖|𝑒𝑖−𝑛+1𝑖−1 = 1 − 𝑎 𝑃𝑀𝐿 𝑒𝑖|𝑒𝑖−𝑛+1

𝑖−1 + 𝑎𝑃𝑀𝐿 𝑒𝑖|𝑒𝑖−𝑛+2𝑖−1

0 ≤ 𝑎 ≤ 1

Adjusting 𝑎 to a good value is the problem

So how can we do that?

Adjusting 𝑎 to a good value

Easy way to achieve this is following

Bring dataset which is different from training data

Select 𝑎 that gives the highest likelihood to the dataset

Improve performance by considering each context

Witten-Bell smoothing

How should I choose 𝑎 if n-gram was like following?

President was President Ronald

elected 5 Reagan 38

the 3 Caza 1

in 3 Venetiaan 1

First 3

…

52 kind, sum 110 3 kind, sum 40


It is likely to have an unknown word after context “President was”

𝑎 should be large, so that (n-1)-gram will be more emphasized




elected 5 Reagan 38

the 3 Caza 1

in 3 Venetiaan 1

First 3

…



It is likely to have an unknown word after context “President Ronald”

𝑎 should be small, so that n-gram will be more emphasized




elected 5 Reagan 38

the 3 Caza 1

in 3 Venetiaan 1

First 3

…


Idea of Witten-Bell smoothing

If you only had a single coefficient value 𝑎 to adjust,

You can not consider context for each word

→ why not use different 𝒂 to consider each context info

for each word?

Witten-Bell smoothing in precise

Simple smoothing




𝑃𝑊𝐵 𝑒𝑖|𝑒𝑖−𝑛+1𝑖−1 = 1 − 𝑎

𝑒𝑖−𝑛+1𝑖−1 𝑃𝑀𝐿 𝑒𝑖|𝑒𝑖−𝑛+1

𝑖−1 + 𝑎𝑒𝑖−𝑛+1𝑖−1 𝑃𝑀𝐿 𝑒𝑖|𝑒𝑖−𝑛+2

𝑖−1

𝑎𝑒𝑖−𝑛+1𝑖−1 =

𝑢 𝑒𝑖−𝑛+1𝑖−1 ,∗

𝑢 𝑒𝑖−𝑛+1𝑖−1 ,∗ + 𝑐 𝑒𝑖−𝑛+1

𝑖−1



𝑢 𝑒𝑖−𝑛+1𝑖−1 ,∗

𝑢 𝑒𝑖−𝑛+1𝑖−1 ,∗ + 𝑐 𝑒𝑖−𝑛+1

𝑖−1

𝑢 𝑒𝑖−𝑛+1𝑖−1 ,∗ represents how many

kind of words continue after context 𝑒𝑖−𝑛+1𝑖−1

𝑢 𝑃𝑟𝑒𝑠𝑖𝑑𝑒𝑛𝑡 𝑤𝑎𝑠,∗ = 52

𝑢 𝑃𝑟𝑒𝑠𝑖𝑑𝑒𝑛𝑡 𝑅𝑜𝑛𝑎𝑙𝑑,∗ = 3


elected 5 Reagan 38

the 3 Caza 1

in 3 Venetia

an

1

First 3

…




𝑢 𝑒𝑖−𝑛+1𝑖−1 ,∗

𝑢 𝑒𝑖−𝑛+1𝑖−1 ,∗ + 𝑐 𝑒𝑖−𝑛+1

𝑖−1

𝑎𝑃𝑟𝑒𝑠𝑖𝑑𝑒𝑛𝑡 𝑤𝑎𝑠 =52

110+52= 0.32

𝑎𝑃𝑟𝑒𝑠𝑖𝑑𝑒𝑛𝑡 𝑅𝑜𝑛𝑎𝑙𝑑 =3

40+3= 0.07


elected 5 Reagan 38

the 3 Caza 1

in 3 Venetia

an

1

First 3

…


Absolute discounting

Yet another smoothing

Unlike Witten-Bell smoothing which uses 𝑃𝑀𝐿, it subtracts constant

value 𝑑 from frequency of each word in order to estimate

probability

𝑃𝑑 𝑒𝑖 | 𝑒0𝑖−1 =

max 𝑐𝑡𝑟𝑎𝑖𝑛 𝑒0𝑖 − 𝑑, 0

𝑐𝑡𝑟𝑎𝑖𝑛 𝑒0𝑖−1

Abstruct discounting

So why do you subtract?

We want to treat low-frequent word as unknown word,

because low-frequent one can not really be trusted.

By doing this, (n-1)-gram gets more emphasized


𝑃𝑑 𝑒𝑖 | 𝑒𝑖−𝑛+1𝑖−1 =

max 𝑐𝑡𝑟𝑎𝑖𝑛 𝑒𝑖−𝑛+1𝑖 − 𝑑, 0

𝑐𝑡𝑟𝑎𝑖𝑛 𝑒𝑖−𝑛+1𝑖−1

𝑃𝑑 𝑒𝑖 = 𝑟𝑒𝑎𝑔𝑎𝑛|𝑒𝑖−2𝑖−1 = 𝑝𝑟𝑒𝑠𝑖𝑑𝑒𝑛𝑡 𝑟𝑜𝑛𝑎𝑙𝑑

=38 − 0.5

40= 0.9375

𝑃𝑑 𝑒𝑖 = 𝑐𝑎𝑧𝑎|𝑒𝑖−2𝑖−1 = 𝑝𝑟𝑒𝑠𝑖𝑑𝑒𝑛𝑡 𝑟𝑜𝑛𝑎𝑙𝑑

=1 − 0.5

40= 0.0125

𝑃𝑑 𝑒𝑖 = 𝑣𝑒𝑛𝑒𝑡𝑖𝑎𝑎𝑛|𝑒𝑖−2𝑖−1 = 𝑝𝑟𝑒𝑠𝑖𝑑𝑒𝑛𝑡 𝑟𝑜𝑛𝑎𝑙𝑑

=1 − 0.5

40= 0.0125


elected 5 Reagan 38

the 3 Caza 1

in 3 Venetia

an

1

First 3

…



𝑃𝑑 𝑒𝑖 = 𝑟𝑒𝑎𝑔𝑎𝑛|𝑒𝑖−2𝑖−1 = 𝑝𝑟𝑒𝑠𝑖𝑑𝑒𝑛𝑡 𝑟𝑜𝑛𝑎𝑙𝑑 = 0.9375

𝑃𝑑 𝑒𝑖 = 𝑐𝑎𝑧𝑎|𝑒𝑖−2𝑖−1 = 𝑝𝑟𝑒𝑠𝑖𝑑𝑒𝑛𝑡 𝑟𝑜𝑛𝑎𝑙𝑑 = 0.0125

𝑃𝑑 𝑒𝑖 = 𝑣𝑒𝑛𝑒𝑡𝑖𝑎𝑎𝑛|𝑒𝑖−2𝑖−1 = 𝑝𝑟𝑒𝑠𝑖𝑑𝑒𝑛𝑡 𝑟𝑜𝑛𝑎𝑙𝑑 = 0.0125

𝑎𝑒𝑖−𝑛+1𝑖−1 = 1 − 0.9375 + 0.0125 + 0.0125 = 0.0375

Efficient way of solving this is following


𝑢 𝑒𝑖−𝑛+1𝑖−1 ,∗ × 𝑑

𝑐 𝑒𝑖−𝑛+1𝑖−1


Now that we do not use maximum likelihood,

n-gram probability will be estimated as following

𝑃 𝑒𝑖| 𝑒𝑖−𝑛+1𝑖−1 = 𝑃𝑑 𝑒𝑖|𝑒𝑖−𝑛+1

𝑖−1 + 𝑎𝑒𝑖−𝑛+1𝑖−1 𝑃 𝑒𝑖|𝑒𝑖−𝑛+2

𝑖−1

Quite similar, but differs in that absolute discounting use 𝑃𝑑



Kneser-Ney smoothing

achieve excellent performance

Similar to absolute discounting

Have an interest in a word that

only appears in specific context


Lower order model is needed only when count in higher

order model is small

Suppose “San Francisco” is common, but “Francisco” appears only

after “San”

Both “San” and “Francisco” get a high unigram probability

But we want to give “Francisco” a low unigram probability!!


Kneser-Ney is defined as following

𝑃𝑘𝑛 𝑒𝑖|𝑒𝑖−𝑛+1𝑖−1 =

max 𝑢 ∗, 𝑒𝑖−𝑛+2𝑖−1 − d, 0

𝑢 𝑒𝑖−𝑛+1𝑖−1

Unknown words

Even though smoothing can reduce probability of having 𝑃 𝑒 = 0,

possibility of getting 0 still rely

We may give a possibility to unknown word as following

𝑃𝑢𝑛𝑘 𝑒𝑖 =1

𝑉

[book reading] 機械翻訳 - section 3 no.1

Engineering