chapter 6: statistical inference: n-gram models over sparse data tdm seminar jonathan henke...
TRANSCRIPT
Chapter 6: Statistical Inference: n-gram
Models over Sparse DataTDM Seminar
Jonathan Henke
http://www.sims.berkeley.edu/~jhenke/Tdm/TDM-Ch6.ppt
Slide set modified slightly by Juggy for teachinga class on NLP using the same book: http://www.csee.wvu.edu/classes/nlp/Spring_2007/Modified Slides are marked with a
Basic Idea:
• Examine short sequences of words
• How likely is each sequence?
• “Markov Assumption” – word is affected only by its “prior local context” (last few words)
Possible Applications:
• OCR / Voice recognition – resolve ambiguity
• Spelling correction
• Machine translation
• Confirming the author of a newly discovered work
• “Shannon game”
“Shannon Game”
• Claude E. Shannon. “Prediction and Entropy of Printed English”, Bell System Technical Journal 30:50-64. 1951.
• Predict the next word, given (n-1) previous words
• Determine probability of different sequences by examining training corpus
Forming Equivalence Classes (Bins)
• “n-gram” = sequence of n words– bigram– trigram– four-gram
• Task at hand:– P(wn|w1,…,wn-1)
Reliability vs. Discrimination
“large green ___________”
tree? mountain? frog? car?
“swallowed the large green ________”pill? broccoli?
Reliability vs. Discrimination
• larger n: more information about the context of the specific instance (greater discrimination)
• smaller n: more instances in training data, better statistical estimates (more reliability)
Selecting an n
Vocabulary (V) = 20,000 words
n Number of bins
2 (bigrams) 20,000*19,999=400 Million
3 (trigrams) 20,000*19,999*19,998= 8 Trillion
4 (4-grams) 1.6 x 1017
Statistical Estimators
• Given the observed training data …• How do you develop a model (probability
distribution) to predict future events?
)...wP(w
)...wP(w)...w|wP(w
n
nnn
11
111
Maximum Likelihood Estimation (MLE)
• Example– 10 training instances of “comes across”
– 8 of them were followed by “as”
– 1 followed by “a”
– 1 followed by “more”
– P(as) = 0.8
– P(a) = 0.1
– P(more) = 0.1
– P(x) = 0
Statistical Estimators
Example:
Corpus: five Jane Austen novels
N = 617,091 words
V = 14,585 unique words
Task: predict the next word of the trigram “inferior to ________”
from test data, Persuasion: “[In person, she was] inferior to both [sisters.]”
“Smoothing”
• Develop a model which decreases probability of seen events and allows the occurrence of previously unseen n-grams
• a.k.a. “Discounting methods”
• “Validation” – Smoothing methods which utilize a second batch of test data.
LaPlace’s Law(adding one)
LaPlace’s Law(adding one)
LaPlace’s Law
Lidstone’s Law
BλN
λ)wC(w)w(wP n
nLid
11
P = probability of specific n-gram
C = count of that n-gram in training data
N = total n-grams in training data
B = number of “bins” (possible n-grams)
= small positive number
M.L.E: = 0LaPlace’s Law: = 1Jeffreys-Perks Law: = ½
Expected Likelihood EstimationRank Word MLE ELE
1 not 0.065 0.036
2 a 0.052 0.030
3 the 0.033 0.019
4 to 0.031 0.017
…
=1482 inferior 0 0.00003
“was” appeared 9409“not” appeared after “was” 608Total # of word types = 14589MLE = 608/9409 = 0.065ELE = (608+0.5)/(608+14589x0.5) = 0.036The new estimate has been discounted by 50%
Jeffreys-Perks Law
Objections to Lidstone’s Law
• Need an a priori way to determine .
• Predicts all unseen events to be equally likely
• Gives probability estimates linear in the M.L.E. frequency
Smoothing
• Lidstone’s Law (incl. LaPlace’s Law and Jeffreys-Perks Law): modifies the observed counts
• Other methods: modify probabilities.
Held-Out Estimator
• How much of the probability distribution should be “held out” to allow for previously unseen events?
• Validate by holding out part of the training data.
• How often do events unseen in training data occur in validation data?(e.g., to choose for Lidstone model)
Held-Out Estimator
})(:{12
111
)(rwwCww
nr
nn
wwCT
NN
TwwP
r
rnh )...( 10
C1(w1… wn) = frequency of w1… wn in training dataC2(w1… wn) = frequency of w1… wn in training data
Nr is # of n-grams with frequency r in the training textTr is the total # of times that all n-grams appeared r times in training text appeared in the held out dataAverage frequency of the n-grams in the held-out data= Tr /Nr
r = C(w1… wn)
Testing Models
• Hold out ~ 5 – 10% for testing
• Hold out ~ 10% for validation (smoothing)
• For testing: useful to test on multiple sets of data, report variance of results.– Are results (good or bad) just the result of
chance?
Cross-Validation(a.k.a. deleted estimation)
• Use data for both training and validation
Divide test data into 2 parts
(1) Train on A, validate on B
(2) Train on B, validate on A
Combine two models
A B
train validate
validate train
Model 1
Model 2
Model 1 Model 2+ Final Model
Cross-Validation
Two estimates:
Combined estimate:
NN
TP
r
rho 0
01
NN
TP
r
rho 1
10
Nra = number of n-grams
occurring r times in a-th part of training set
Trab = total number of those
found in b-th part
)( 10
1001
rr
rrho NNN
TTP
(arithmetic mean)
Good-Turing Estimator
r* = “adjusted frequency”
Nr = number of n-gram-types which occur r times
E(Nr) = “expected value”
E(Nr+1) < E(Nr) Typically this is done for r < some constant k as this value
is 0 for a r that corresponds to max r.
)(
)()(*
r
r
NE
NErr 11 NrPGT
*
Count of counts in Austen corpus
Good-Turing Estimates for Austen Corpus
• N1 = number of bigrams seen exactly once in training instance = 138741
• N = 617091 [number of words in Austen corpus]• N1 /N = 0.2248 [mass reserved for unseen bigrams using Good-Turing
approach]• Space of bigrams is vocabulary squared: 145852 • Total # of bigrams seen in training set: 199,252• Probability estimate for
unseen bigrams = 0.2248/(145852 -199,252) = 1.058 x 10-9
Discounting Methods
First, determine held-out probability
• Absolute discounting: Decrease probability of each observed n-gram by subtracting a small constant
• Linear discounting: Decrease probability of each observed n-gram by multiplying by the same proportion
Combining Estimators
(Sometimes a trigram model is best, sometimes a bigram model is best, and sometimes a unigram model is best.)
• How can you develop a model to utilize different length n-grams as appropriate?
Simple Linear Interpolation(a.k.a., finite mixture models;a.k.a., deleted interpolation)
• weighted average of unigram, bigram, and trigram probabilities
),|( 12 nnnli wwwP
),|()|()( 123112211 nnnnnn wwwPwwPwP
Katz’s Backing-Off
• Use n-gram probability when enough training data– (when adjusted count > k; k usu. = 0 or 1)
• If not, “back-off” to the (n-1)-gram probability
• (Repeat as needed)
Problems with Backing-Off
• If bigram w1 w2 is common
• but trigram w1 w2 w3 is unseen
• may be a meaningful gap, rather than a gap due to chance and scarce data– i.e., a “grammatical null”
• May not want to back-off to lower-order probability
Comparison of Estimators