[book reading] 機械翻訳 - section 3 no.1
TRANSCRIPT
Language ModelMT STUDY MEETING 5/21
HIROYUKI FUDABA
How can you say whether a
sentence is natural or not?
𝑒1 = he is dog
𝑒2 = is big he
𝑒1 = this is a purple dog
How can you say whether a
sentence is natural or not?
𝑒1 = he is dog
↑ correct
𝑒2 = is big he
↑ grammatically wrong
𝑒1 = this is a purple dog
↑ semantically wrong
Language model probability
We want to treat “naturality” statistically
We represent this with language model probability 𝑃 𝑒
𝑃 𝑒 = ℎ𝑒 𝑖𝑠 𝑏𝑖𝑔 = 0.7
𝑃 𝑒 = 𝑖𝑠 𝑏𝑖𝑔 ℎ𝑒 = 0.3
𝑃 𝑒 = 𝑡ℎ𝑖𝑠 𝑖𝑠 𝑎 𝑝𝑢𝑟𝑝𝑙𝑒 𝑑𝑜𝑔 = 0.5
Some ways to estimate 𝑃(𝑒)
n-gram model
Positional language model
factored language model
cache language model
Basis of n-gram
we notate a sentence as 𝒆 = 𝑒1𝐼, 𝐼 being the length of it
𝑒 = ℎ𝑒 𝑖𝑠 𝑏𝑖𝑔
𝑒1 = ℎ𝑒, 𝑒2 = 𝑖𝑠, 𝑒3 = 𝑏𝑖𝑔, 𝐼 = 3
We can define 𝑃(𝑒) as following
𝑃 𝑒 = ℎ𝑒 𝑖𝑠 𝑏𝑖𝑔 = 𝑃 𝐼 = 3, 𝑒1 = ℎ𝑒, 𝑒2 = 𝑖𝑠, 𝑒3 = 𝑏𝑖𝑔
= 𝑃 𝑒1 = ℎ𝑒, 𝑒2 = 𝑖𝑠, 𝑒3 = 𝑏𝑖𝑔, 𝑒4 = 𝑒𝑜𝑠
= P(e0 = 𝑏𝑜𝑠 , 𝑒1 = ℎ𝑒, 𝑒2 = 𝑖𝑠, 𝑒3 = 𝑏𝑖𝑔, 𝑒4 = 𝑒𝑜𝑠 )
estimate 𝑃(𝑒) with a simple way
assume that natural sentence appear more frequently than the ones
that aren’t, simple way to estimate 𝑃(𝑒) is following
Bring a big training data 𝐸𝑡𝑟𝑎𝑖𝑛
Count frequencies of each sentences in 𝐸𝑡𝑟𝑎𝑖𝑛
𝑃𝑠 𝑒 =𝑓𝑟𝑒𝑞 𝑒
𝑠𝑖𝑧𝑒(𝐸𝑡𝑟𝑎𝑖𝑛)=𝑐𝑡𝑟𝑎𝑖𝑛 𝑒
𝑒 𝑐𝑡𝑟𝑎𝑖𝑛( 𝑒)
𝑐𝑡𝑟𝑎𝑖𝑛 𝑒 = ℎ𝑒 𝑖𝑠 𝑏𝑖𝑔 returns how many sentences exactly matched to “he is big”
Problem of estimation in simple way
when 𝐸𝑡𝑟𝑎𝑖𝑛 does not contain sentences 𝑒1 and 𝑒2,
than you can not say which is more natural.
𝑐𝑡𝑟𝑎𝑖𝑛 𝑒1 = 𝑐𝑡𝑟𝑎𝑖𝑛 𝑒2 = 0
𝑃𝑆 𝑒1 =𝑐𝑡𝑟𝑎𝑖𝑛 𝑒1 𝑒 𝑐𝑡𝑟𝑎𝑖𝑛 𝑒
= 0
𝑃𝑆 𝑒2 =𝑐𝑡𝑟𝑎𝑖𝑛 𝑒2 𝑒 𝑐𝑡𝑟𝑎𝑖𝑛 𝑒
= 0
You can not compare if both values are 0 …
Solution to 𝑃 𝑒 = 0
Rather thinking a sentence as a whole,
let’s think that a sentence is a data that is composed of words
𝑃 𝑋, 𝑌 = 𝑃 𝑋 𝑌 𝑃(𝑌)
𝑃 𝑒 = ℎ𝑒 𝑖𝑠 𝑏𝑖𝑔 = 𝑃 𝑒1 = ℎ𝑒 𝑒0 = 𝑏𝑜𝑠 )
∗ P e2 = is e0 = 𝑏𝑜𝑠 , 𝑒1 = ℎ𝑒)
∗ 𝑃 𝑒3 = 𝑏𝑖𝑔 𝑒0 = 𝑏𝑜𝑠 , 𝑒1 = ℎ𝑒, 𝑒2 = 𝑖𝑠
∗ 𝑃 𝑒4 = 𝑒𝑜𝑠 𝑒0 = 𝑏𝑜𝑠 , 𝑒1 = ℎ𝑒, 𝑒2 = is, e3 = big)
Solution to 𝑃 𝑒 = 0
𝑃𝑆 𝑒 =𝑐𝑡𝑟𝑎𝑖𝑛 𝑒
𝑒 𝑐𝑡𝑟𝑎𝑖𝑛( 𝑒)= 𝑃 𝑒1
𝐼 =
𝑖=1
𝐼+1
𝑃𝑀𝐿 𝑒𝑖|𝑒0𝑖−1
𝑃𝑀𝐿 𝑒𝑖| 𝑒0𝑖−1 =
𝑐𝑡𝑟𝑎𝑖𝑛 𝑒0𝑖
𝑐𝑡𝑟𝑎𝑖𝑛(𝑒0𝑖−1)
So far 𝑃 𝑒1𝐼 is completely equal to 𝑃𝑆(𝑒),
which means it still don’t work
Idea of n-gram model
Rather considering all words appeared before the word looking at,
let’s consider only 𝑛 − 1 words appeared just before the word
Instead of considering all words …
is big 𝑒𝑜𝑠he𝑏𝑜𝑠
Idea of n-gram model
Rather considering all words appeared before the word looking at,
let’s consider only 𝑛 − 1 words appeared just before the word
Consider only 𝑛 − 1 words
is big 𝑒𝑜𝑠he𝑏𝑜𝑠
n-gram in precise
From the previous expression
𝑃 𝑒1𝐼 =
𝑖=1
𝐼+1
𝑃𝑀𝐿 𝑒𝑖|𝑒0𝑖−1
we can approximate 𝑃(𝑒) as following
𝑃 𝑒1𝐼 ≈
𝑖=1
𝐼+1
𝑃𝑀𝐿 𝑒𝑖|𝑒𝑖−𝑛+1𝑖−1
How does this help?
𝑃 𝑒 = ℎ𝑒 𝑖𝑠 𝑏𝑖𝑔 ≈ 𝑃 𝑒𝑖 = ℎ𝑒 | 𝑒𝑖−1 = 𝑏𝑜𝑠
∗ P ei = is | 𝑒𝑖−1 = he
∗ P ei = big ei−1 = is)
∗ P 𝑒𝑖 = 𝑒𝑜𝑠 | 𝑒𝑖−1 = 𝑏𝑖𝑔
Intuitively, a subset sequence appear more than it’s super set,
so 𝑃 𝑒 estimated with n-gram model is less likely to be 0
Smoothing n-gram model
n-gram less likely estimate 𝑃 𝑒 = 0
But it still have a possibility of estimating 0
→ Smoothing
Idea of smoothing
Combining probability of n-gram and (n-1)-gram
Even if probability of word 𝑤 could not be estimated with n-gram,
there is a possibility that probability can be estimated with (n-1)-gram
𝑃3−𝑔𝑟𝑎𝑚 𝑠𝑚𝑎𝑙𝑙 | ℎ𝑒 𝑖𝑠 = 0
P2−gram small is) = 0.03
0
0.05
0.1
0.15
0.2
0.25
P(he|<bos>) P(is|<bos> he) P(big|he is) P(small|he is) P(<eos>|is big)
probability
probability
Linear interpolation
Easiest, and basic way to express the idea
𝑃 𝑒𝑖|𝑒𝑖−𝑛+1𝑖−1 = 1 − 𝑎 𝑃𝑀𝐿 𝑒𝑖|𝑒𝑖−𝑛+1
𝑖−1 + 𝑎𝑃𝑀𝐿 𝑒𝑖|𝑒𝑖−𝑛+2𝑖−1
0 ≤ 𝑎 ≤ 1
Adjusting 𝑎 to a good value is the problem
So how can we do that?
Adjusting 𝑎 to a good value
Easy way to achieve this is following
Bring dataset which is different from training data
Select 𝑎 that gives the highest likelihood to the dataset
Improve performance by considering each context
Witten-Bell smoothing
How should I choose 𝑎 if n-gram was like following?
President was President Ronald
elected 5 Reagan 38
the 3 Caza 1
in 3 Venetiaan 1
First 3
…
52 kind, sum 110 3 kind, sum 40
Witten-Bell smoothing
It is likely to have an unknown word after context “President was”
𝑎 should be large, so that (n-1)-gram will be more emphasized
𝑃 𝑒𝑖|𝑒𝑖−𝑛+1𝑖−1 = 1 − 𝑎 𝑃𝑀𝐿 𝑒𝑖|𝑒𝑖−𝑛+1
𝑖−1 + 𝑎𝑃𝑀𝐿 𝑒𝑖|𝑒𝑖−𝑛+2𝑖−1
President was President Ronald
elected 5 Reagan 38
the 3 Caza 1
in 3 Venetiaan 1
First 3
…
52 kind, sum 110 3 kind, sum 40
Witten-Bell smoothing
It is likely to have an unknown word after context “President Ronald”
𝑎 should be small, so that n-gram will be more emphasized
𝑃 𝑒𝑖|𝑒𝑖−𝑛+1𝑖−1 = 1 − 𝑎 𝑃𝑀𝐿 𝑒𝑖|𝑒𝑖−𝑛+1
𝑖−1 + 𝑎𝑃𝑀𝐿 𝑒𝑖|𝑒𝑖−𝑛+2𝑖−1
President was President Ronald
elected 5 Reagan 38
the 3 Caza 1
in 3 Venetiaan 1
First 3
…
52 kind, sum 110 3 kind, sum 40
Idea of Witten-Bell smoothing
If you only had a single coefficient value 𝑎 to adjust,
You can not consider context for each word
→ why not use different 𝒂 to consider each context info
for each word?
Witten-Bell smoothing in precise
Simple smoothing
𝑃 𝑒𝑖|𝑒𝑖−𝑛+1𝑖−1 = 1 − 𝑎 𝑃𝑀𝐿 𝑒𝑖|𝑒𝑖−𝑛+1
𝑖−1 + 𝑎𝑃𝑀𝐿 𝑒𝑖|𝑒𝑖−𝑛+2𝑖−1
Witten-Bell smoothing
𝑃𝑊𝐵 𝑒𝑖|𝑒𝑖−𝑛+1𝑖−1 = 1 − 𝑎
𝑒𝑖−𝑛+1𝑖−1 𝑃𝑀𝐿 𝑒𝑖|𝑒𝑖−𝑛+1
𝑖−1 + 𝑎𝑒𝑖−𝑛+1𝑖−1 𝑃𝑀𝐿 𝑒𝑖|𝑒𝑖−𝑛+2
𝑖−1
𝑎𝑒𝑖−𝑛+1𝑖−1 =
𝑢 𝑒𝑖−𝑛+1𝑖−1 ,∗
𝑢 𝑒𝑖−𝑛+1𝑖−1 ,∗ + 𝑐 𝑒𝑖−𝑛+1
𝑖−1
Witten-Bell smoothing in precise
𝑎𝑒𝑖−𝑛+1𝑖−1 =
𝑢 𝑒𝑖−𝑛+1𝑖−1 ,∗
𝑢 𝑒𝑖−𝑛+1𝑖−1 ,∗ + 𝑐 𝑒𝑖−𝑛+1
𝑖−1
𝑢 𝑒𝑖−𝑛+1𝑖−1 ,∗ represents how many
kind of words continue after context 𝑒𝑖−𝑛+1𝑖−1
𝑢 𝑃𝑟𝑒𝑠𝑖𝑑𝑒𝑛𝑡 𝑤𝑎𝑠,∗ = 52
𝑢 𝑃𝑟𝑒𝑠𝑖𝑑𝑒𝑛𝑡 𝑅𝑜𝑛𝑎𝑙𝑑,∗ = 3
President was President Ronald
elected 5 Reagan 38
the 3 Caza 1
in 3 Venetia
an
1
First 3
…
52 kind, sum 110 3 kind, sum 40
Witten-Bell smoothing in precise
𝑎𝑒𝑖−𝑛+1𝑖−1 =
𝑢 𝑒𝑖−𝑛+1𝑖−1 ,∗
𝑢 𝑒𝑖−𝑛+1𝑖−1 ,∗ + 𝑐 𝑒𝑖−𝑛+1
𝑖−1
𝑎𝑃𝑟𝑒𝑠𝑖𝑑𝑒𝑛𝑡 𝑤𝑎𝑠 =52
110+52= 0.32
𝑎𝑃𝑟𝑒𝑠𝑖𝑑𝑒𝑛𝑡 𝑅𝑜𝑛𝑎𝑙𝑑 =3
40+3= 0.07
President was President Ronald
elected 5 Reagan 38
the 3 Caza 1
in 3 Venetia
an
1
First 3
…
52 kind, sum 110 3 kind, sum 40
Absolute discounting
Yet another smoothing
Unlike Witten-Bell smoothing which uses 𝑃𝑀𝐿, it subtracts constant
value 𝑑 from frequency of each word in order to estimate
probability
𝑃𝑑 𝑒𝑖 | 𝑒0𝑖−1 =
max 𝑐𝑡𝑟𝑎𝑖𝑛 𝑒0𝑖 − 𝑑, 0
𝑐𝑡𝑟𝑎𝑖𝑛 𝑒0𝑖−1
Abstruct discounting
So why do you subtract?
We want to treat low-frequent word as unknown word,
because low-frequent one can not really be trusted.
By doing this, (n-1)-gram gets more emphasized
Absolute discounting
𝑃𝑑 𝑒𝑖 | 𝑒𝑖−𝑛+1𝑖−1 =
max 𝑐𝑡𝑟𝑎𝑖𝑛 𝑒𝑖−𝑛+1𝑖 − 𝑑, 0
𝑐𝑡𝑟𝑎𝑖𝑛 𝑒𝑖−𝑛+1𝑖−1
𝑃𝑑 𝑒𝑖 = 𝑟𝑒𝑎𝑔𝑎𝑛|𝑒𝑖−2𝑖−1 = 𝑝𝑟𝑒𝑠𝑖𝑑𝑒𝑛𝑡 𝑟𝑜𝑛𝑎𝑙𝑑
=38 − 0.5
40= 0.9375
𝑃𝑑 𝑒𝑖 = 𝑐𝑎𝑧𝑎|𝑒𝑖−2𝑖−1 = 𝑝𝑟𝑒𝑠𝑖𝑑𝑒𝑛𝑡 𝑟𝑜𝑛𝑎𝑙𝑑
=1 − 0.5
40= 0.0125
𝑃𝑑 𝑒𝑖 = 𝑣𝑒𝑛𝑒𝑡𝑖𝑎𝑎𝑛|𝑒𝑖−2𝑖−1 = 𝑝𝑟𝑒𝑠𝑖𝑑𝑒𝑛𝑡 𝑟𝑜𝑛𝑎𝑙𝑑
=1 − 0.5
40= 0.0125
President was President Ronald
elected 5 Reagan 38
the 3 Caza 1
in 3 Venetia
an
1
First 3
…
52 kind, sum 110 3 kind, sum 40
Absolute discounting
𝑃𝑑 𝑒𝑖 = 𝑟𝑒𝑎𝑔𝑎𝑛|𝑒𝑖−2𝑖−1 = 𝑝𝑟𝑒𝑠𝑖𝑑𝑒𝑛𝑡 𝑟𝑜𝑛𝑎𝑙𝑑 = 0.9375
𝑃𝑑 𝑒𝑖 = 𝑐𝑎𝑧𝑎|𝑒𝑖−2𝑖−1 = 𝑝𝑟𝑒𝑠𝑖𝑑𝑒𝑛𝑡 𝑟𝑜𝑛𝑎𝑙𝑑 = 0.0125
𝑃𝑑 𝑒𝑖 = 𝑣𝑒𝑛𝑒𝑡𝑖𝑎𝑎𝑛|𝑒𝑖−2𝑖−1 = 𝑝𝑟𝑒𝑠𝑖𝑑𝑒𝑛𝑡 𝑟𝑜𝑛𝑎𝑙𝑑 = 0.0125
𝑎𝑒𝑖−𝑛+1𝑖−1 = 1 − 0.9375 + 0.0125 + 0.0125 = 0.0375
Efficient way of solving this is following
𝑎𝑒𝑖−𝑛+1𝑖−1 =
𝑢 𝑒𝑖−𝑛+1𝑖−1 ,∗ × 𝑑
𝑐 𝑒𝑖−𝑛+1𝑖−1
Absolute discounting
Now that we do not use maximum likelihood,
n-gram probability will be estimated as following
𝑃 𝑒𝑖| 𝑒𝑖−𝑛+1𝑖−1 = 𝑃𝑑 𝑒𝑖|𝑒𝑖−𝑛+1
𝑖−1 + 𝑎𝑒𝑖−𝑛+1𝑖−1 𝑃 𝑒𝑖|𝑒𝑖−𝑛+2
𝑖−1
Quite similar, but differs in that absolute discounting use 𝑃𝑑
𝑃 𝑒𝑖|𝑒𝑖−𝑛+1𝑖−1 = 1 − 𝑎 𝑃𝑀𝐿 𝑒𝑖|𝑒𝑖−𝑛+1
𝑖−1 + 𝑎𝑃𝑀𝐿 𝑒𝑖|𝑒𝑖−𝑛+2𝑖−1
Kneser-Ney smoothing
achieve excellent performance
Similar to absolute discounting
Have an interest in a word that
only appears in specific context
Kneser-Ney smoothing
Lower order model is needed only when count in higher
order model is small
Suppose “San Francisco” is common, but “Francisco” appears only
after “San”
Both “San” and “Francisco” get a high unigram probability
But we want to give “Francisco” a low unigram probability!!
Kneser-Ney smoothing
Kneser-Ney is defined as following
𝑃𝑘𝑛 𝑒𝑖|𝑒𝑖−𝑛+1𝑖−1 =
max 𝑢 ∗, 𝑒𝑖−𝑛+2𝑖−1 − d, 0
𝑢 𝑒𝑖−𝑛+1𝑖−1
Unknown words
Even though smoothing can reduce probability of having 𝑃 𝑒 = 0,
possibility of getting 0 still rely
We may give a possibility to unknown word as following
𝑃𝑢𝑛𝑘 𝑒𝑖 =1
𝑉