chapter 6: statistical inference: n-gram models over sparse data tdm seminar jonathan henke...

Chapter 6: Statistical Inference: n-gram

Models over Sparse DataTDM Seminar

Jonathan Henke

http://www.sims.berkeley.edu/~jhenke/Tdm/TDM-Ch6.ppt

Slide set modified slightly by Juggy for teachinga class on NLP using the same book: http://www.csee.wvu.edu/classes/nlp/Spring_2007/Modified Slides are marked with a

http://www.sims.berkeley.edu/~jhenke/Tdm/TDM-Ch6.ppt

http://www.csee.wvu.edu/classes/nlp/Spring_2007/

Basic Idea:

• Examine short sequences of words

• How likely is each sequence?

• “Markov Assumption” – word is affected only by its “prior local context” (last few words)

Possible Applications:

• OCR / Voice recognition – resolve ambiguity

• Spelling correction

• Machine translation

• Confirming the author of a newly discovered work

• “Shannon game”

“Shannon Game”

• Claude E. Shannon. “Prediction and Entropy of Printed English”, Bell System Technical Journal 30:50-64. 1951.

• Predict the next word, given (n-1) previous words

• Determine probability of different sequences by examining training corpus

Forming Equivalence Classes (Bins)

• “n-gram” = sequence of n words– bigram– trigram– four-gram

• Task at hand:– P(wn|w1,…,wn-1)

Reliability vs. Discrimination

“large green ___________”

tree? mountain? frog? car?

“swallowed the large green ________”pill? broccoli?

Reliability vs. Discrimination

• larger n: more information about the context of the specific instance (greater discrimination)

• smaller n: more instances in training data, better statistical estimates (more reliability)

Selecting an n

Vocabulary (V) = 20,000 words

n Number of bins

2 (bigrams) 20,000*19,999=400 Million

3 (trigrams) 20,000*19,999*19,998= 8 Trillion

4 (4-grams) 1.6 x 1017

Statistical Estimators

• Given the observed training data …• How do you develop a model (probability

distribution) to predict future events?

)...wP(w

)...wP(w)...w|wP(w

n

nnn

11

111

Maximum Likelihood Estimation (MLE)

• Example– 10 training instances of “comes across”

– 8 of them were followed by “as”

– 1 followed by “a”

– 1 followed by “more”

– P(as) = 0.8

– P(a) = 0.1

– P(more) = 0.1

– P(x) = 0

Statistical Estimators

Example:

Corpus: five Jane Austen novels

N = 617,091 words

V = 14,585 unique words

Task: predict the next word of the trigram “inferior to ________”

from test data, Persuasion: “[In person, she was] inferior to both [sisters.]”

“Smoothing”

• Develop a model which decreases probability of seen events and allows the occurrence of previously unseen n-grams

• a.k.a. “Discounting methods”

• “Validation” – Smoothing methods which utilize a second batch of test data.

LaPlace’s Law(adding one)

LaPlace’s Law

Lidstone’s Law

BλN

λ)wC(w)w(wP n

nLid

11

P = probability of specific n-gram

C = count of that n-gram in training data

N = total n-grams in training data

B = number of “bins” (possible n-grams)

= small positive number

M.L.E: = 0LaPlace’s Law: = 1Jeffreys-Perks Law: = ½

Expected Likelihood EstimationRank Word MLE ELE

1 not 0.065 0.036

2 a 0.052 0.030

3 the 0.033 0.019

4 to 0.031 0.017

…

=1482 inferior 0 0.00003

“was” appeared 9409“not” appeared after “was” 608Total # of word types = 14589MLE = 608/9409 = 0.065ELE = (608+0.5)/(608+14589x0.5) = 0.036The new estimate has been discounted by 50%

Jeffreys-Perks Law

Objections to Lidstone’s Law

• Need an a priori way to determine .

• Predicts all unseen events to be equally likely

• Gives probability estimates linear in the M.L.E. frequency

Smoothing

• Lidstone’s Law (incl. LaPlace’s Law and Jeffreys-Perks Law): modifies the observed counts

• Other methods: modify probabilities.

Held-Out Estimator

• How much of the probability distribution should be “held out” to allow for previously unseen events?

• Validate by holding out part of the training data.

• How often do events unseen in training data occur in validation data?(e.g., to choose for Lidstone model)

Held-Out Estimator

})(:{12

111

)(rwwCww

nr

nn

wwCT

NN

TwwP

r

rnh )...( 10

C1(w1… wn) = frequency of w1… wn in training dataC2(w1… wn) = frequency of w1… wn in training data

Nr is # of n-grams with frequency r in the training textTr is the total # of times that all n-grams appeared r times in training text appeared in the held out dataAverage frequency of the n-grams in the held-out data= Tr /Nr

r = C(w1… wn)

Testing Models

• Hold out ~ 5 – 10% for testing

• Hold out ~ 10% for validation (smoothing)

• For testing: useful to test on multiple sets of data, report variance of results.– Are results (good or bad) just the result of

chance?

Cross-Validation(a.k.a. deleted estimation)

• Use data for both training and validation

Divide test data into 2 parts

(1) Train on A, validate on B

(2) Train on B, validate on A

Combine two models

A B

train validate

validate train

Model 1

Model 2

Model 1 Model 2+ Final Model

Cross-Validation

Two estimates:

Combined estimate:

NN

TP

r

rho 0

01

NN

TP

r

rho 1

10

Nra = number of n-grams

occurring r times in a-th part of training set

Trab = total number of those

found in b-th part

)( 10

1001

rr

rrho NNN

TTP

(arithmetic mean)

Good-Turing Estimator

r* = “adjusted frequency”

Nr = number of n-gram-types which occur r times

E(Nr) = “expected value”

E(Nr+1) < E(Nr) Typically this is done for r < some constant k as this value

is 0 for a r that corresponds to max r.

)(

)()(*

r

r

NE

NErr 11 NrPGT

*

Count of counts in Austen corpus

Good-Turing Estimates for Austen Corpus

• N1 = number of bigrams seen exactly once in training instance = 138741

• N = 617091 [number of words in Austen corpus]• N1 /N = 0.2248 [mass reserved for unseen bigrams using Good-Turing

approach]• Space of bigrams is vocabulary squared: 145852 • Total # of bigrams seen in training set: 199,252• Probability estimate for

unseen bigrams = 0.2248/(145852 -199,252) = 1.058 x 10-9

Discounting Methods

First, determine held-out probability

• Absolute discounting: Decrease probability of each observed n-gram by subtracting a small constant

• Linear discounting: Decrease probability of each observed n-gram by multiplying by the same proportion

Combining Estimators

(Sometimes a trigram model is best, sometimes a bigram model is best, and sometimes a unigram model is best.)

• How can you develop a model to utilize different length n-grams as appropriate?

Simple Linear Interpolation(a.k.a., finite mixture models;a.k.a., deleted interpolation)

• weighted average of unigram, bigram, and trigram probabilities

),|( 12 nnnli wwwP

),|()|()( 123112211 nnnnnn wwwPwwPwP

Katz’s Backing-Off

• Use n-gram probability when enough training data– (when adjusted count > k; k usu. = 0 or 1)

• If not, “back-off” to the (n-1)-gram probability

• (Repeat as needed)

Problems with Backing-Off

• If bigram w1 w2 is common

• but trigram w1 w2 w3 is unseen

• may be a meaningful gap, rather than a gap due to chance and scarce data– i.e., a “grammatical null”

• May not want to back-off to lower-order probability

Comparison of Estimators

chapter 6: statistical inference: n-gram models over sparse data tdm seminar jonathan henke...

Documents