n 元语言模型和平滑算法

n2005.10.17

nStatistical LM Toolkit

Borrows heavily from slides on the Internet, including but not limited to those by Jushua Goodman, Jonathan Henke, Dragomir R. Radev,,Jim Martin

A bad language model

What is a Language ModelA language model is a probability distribution over word sequences

P(And nothing but the truth) 0.001P(And nuts sing on the roof) 0

The sum of probabilities of all word sequences has to be 1.

Speech recognitionHandwriting recognitionSpelling correction ?!Optical character recognitionMachine translation

Very useful for distinguishing nothing but the truth from nuts sing on de roof

Morkov approximationAssume each word depends only on the limited local context, e.g. on previous two words. This is called trigram models

P(the| whole truth and nothing but) P(the|nothing but)P(truth| whole truth and nothing but the) P(truth|but the)

With Markov AssumptionN-1

CaveatThe formulation P(Word| Some fixed prefix) is not really appropriate in many applications.It is if were dealing with real time speech where we only have access to prefixes.But if were dealing with text we already have the right and left contexts. Theres no a priori reason to stick to left contexts.

Nn-1n(LM, Language Model)p(W)=i=1d p(wi|wi-n+1,,wi-1), d=|W|

n20,000

0(Unigram)19,9991(bigram)20,000*19,999 = 400 million2(trigram)20,0002*19,999 = 8 trillion3(four-gram)20,0003*19,999 = 1.6*1017

N-gramN=1unigramN=2bigramUnigram and trigram are words but bigram is not? (http://shuan.justrockandroll.com/arc.php?topic=14)N=3trigram

An asideMonogram?Digram?

Learn something!http://phrontistery.info/numbers.html

n3trigram4

n=3:large green ___________tree? mountain? frog? car?...

n=5swallowed the large green ________pill? broccoli?

Reliability vs. Discriminationlarger n: more information about the context of the specific instance greater discrimination powerBut it is very sparsesmaller n: more instances in training data, better statistical estimates more reliabilitymore choice

trigramsHow do we find probabilities?Get real text, and start counting!

Counting WordsExample: He stepped out into the hall, was delighted to encounter a water brother - how many words?Word forms and lemmas. cat and cats share the same lemma (also tokens and types)Shakespeares complete works: 884,647 word tokens and 29,066 word typesBrown corpus: 61,805 types and 37,851 lemmas (1 million words from 500 texts)American Heritage 3rd edition has 200,000 boldface forms (including some multiword phrases)

n' s 3550of the 2507to be 2235in the 1917I am 1366of her 1268to the 1142it was 1010had been 995she had 978to her 965could not 945I have 898of his 880and the 862she was 843have been 837of a 745for the 712in a 707

MLE

TTrigramsTC3(wi-2,wi-1,wi)TC2(wi-2,wi-1)pMLE(wi|wi-2,wi-1) = C3(wi-2,wi-1,wi) / C2(wi-2,wi-1)P(the | nothing but) C3(nothing but the) / C2(nothing but)

Bigram Probabilities (from martin)

An Aside on LogsWhen computing sentence probabilities, you dont really do all those multiplies. The numbers are too small and lead to underflowsConvert the probabilities to logs and then do additions.To get the real probability (if you need it) go back to the antilog.

Some More ObservationsP(I | I)P(want | I)P(I | food)I I I wantI want I want toThe food I want is

MLENLPMLEMLENLP00

1p(z|xy)=? xya ; xyd ; xyd xyzp(a|xy)=1/3, p(d|xy)=2/3, p(z|xy)=0/3xyzxyz

1/3, 100/3001/300100/30000

Smoothing Develop a model which decreases probability of seen events and allows the occurrence of previously unseen n-gramsa.k.a. Discounting methods

0MLEp(w)p(w)p(w)

1T:,V:,w: p(w|h)=(c(h,w)+1)/(c(h)+|V|) :p(w)=(c(w)+1)/(|T|+|V|)|V|>c(h)|V|>>c(h)T: what is it what is small? |T|=8V={what,is,it,small,?,,flying,birds,are,a,bird,.}, |V|=12p(it)=0.125, p(what)=0.25, p(.)=0, p(what is it?)=0.2520.12520.001p(it is flying.)=0.1250.2500=0p(it)=0.1, p(what)=0.15,p(.)=0.05, p(what is it?)=0.152*0.12 0.0002p(it is flying.)=0.1*0.15*0.052 0.00004

1Laplaces LawLaplaces Law actually gives far too much of the probability space to unseen events.

ELESince the adding one process may be adding too much, we can add a smaller value .PLID(w1,..,wn)=(C(w1,..,wn)+)/(|T|+|V|) and >0. ==> Lidstones LawIf =1/2, Lidstones Law corresponds to the expectation of the likelihood and is called the Expected Likelihood Estimation (ELE) or the Jeffreys-Perks Law.

T: what is it what is small? |T|=8V={what,is,it,small,?,,flying,birds,are,a,bird,.}, |V|=12p(it)=0.125, p(what)=0.25, p(.)=0, p(what is it?)=0.252 *0.1252 0.001p(it is flying.)=0.125*0.25*02=0 =0.1p(it)=0.12, p(what)=0.23,p(.)=0.01,p(what is it?)=0.232*0.122 0.0007p(it is flying.)=0.12*0.23*0.012 0.000003

Held-Out EstimatorHow much of the probability distribution should be held out to allow for previously unseen events?Validate by holding out part of the training data.How often do events unseen in training data occur in validation data?(e.g., to choose for Lidstone model)

For each n-gram, w1,..,wn , we compute C1(w1,..,wn) and C2(w1,..,wn), the frequencies of w1,..,wn in training and held out data, respectively.Let Nr be the number of n-grams with frequency r in the training text.Let Hr be the total number of times that all n-grams that appeared r times in the training text appeared in the held out data.Hr/Nris the average frequency of one of these n-gramsAn estimate for the probability of one of these n-gram is: Pho(w1,..,wn)= Hr/(NrH) where C1(w1,..,wn) = r, and H is the number of n-grams in the held out data

Pots of Data for Developing and Testing ModelsTraining data (80% of total data)Held Out data (10% of total data).Test Data (5-10% of total data).Write an algorithm, train it, test it, note things it does wrong, revise it and repeat many times.Keep development test data and final test data as development data is seen by the system during repeated testing.Give final results by testing on n smaller samples of the test data and averaging.

Cross-Validation(a.k.a. deleted estimation)held out data is used to validate the modeldivide data for both training and validationDivide test data into 2 partsTrain on A, validate on BTrain on B, validate on ACombine two modelsABtrainvalidatevalidatetrainModel 1Model 2Model 1Model 2+Final Model

?P=Pt + (1-)Pho held outDivide data into parts 0 and 1. In one model use 0 as the training data and 1 as the held out data. In another model use 1 as training and 0 as held out data. Do a weighted average of the two: Pdel(w1,..,wn)=(Hr01+Hr10)/ (Nr0+ Nr1) N

Jelinek0,1,NPdel(w1,..,wn)=Hr01/ Nr0N * Nr0/ (Nr0+ Nr1)

+Hr10/ Nr1 N * Nr1 / (Nr0+ Nr1)

=(Hr01+Hr10)/ (Nr0+ Nr1) N

1,fsnlpJane Austin,:ELE(),held out,deleted estimation; 2/3, 1/3, deleted estimation; head out ,. ?

:PerplexityN ,i

held outH=N,n,Nr.Pho = Hr/(HNr)held out MLEP=(rNr+Hr)/(2H), Pho < P,:deleted estimation underestimates the expected frequency of objects that were seen once in the training data.

Leaving-one-out (Ney et. Al. 1997) Data divided into K sets and the hold out method is repeated K times.

Witten Bell First compute the probability of an unseen event occurringThen distribute that probability mass among the as yet unseen types (the ones with zero counts)

Probability of an Unseen EventSimple case of unigramsT is the number of events that are seen for the first time in the corpusThis is just the number of types since each type had to occur for a first time onceN is just the number of observations

DistributingThe amount to be distributed is

The number of events with count zero

So distributing evenly gets us

CaveatThe unigram case is weirdZ is the number of things with count zeroOk, so thats the number of things we didnt see at all. Huh?Fortunately it makes more sense in the N-gram case.Take Shakespeare Recall that he produced only 29,000 types. So there are potentially 29,000^2 bigrams. Of which only 300k occur, so Z is 29,000^2 300k

Witten-BellIn the case of bigrams, not all conditioning events are equally promiscuousP(x|the) vsP(x|going)So distribute the mass assigned to the zero count bigrams according to their promiscuityThis means condition the redistribution on how many different types occurred with a given prefix

Distributing Among the ZerosIf a bigram wx wi has a zero countNumber of bigrams starting with wx that were not seenActual frequency of bigrams beginning with wxNumber of bigram types starting with wx

Good Turingrn,r*,

0n,GT:

Zipf,r, Nr;r,Nr,n,r*=0!GTk(k

Simple Good TuringnGT,S(r,Nr)Gale SampsonNr=arb(b

GT,Stanley F. Chen and Joshua GoodmanAn Empirical Study of Smoothing Techniques for Language ModelingTR-10-98August 1998

Chen Goodman ,Kneser-Ney

Combining EstimatorsIf we have several models of how the history predicts what comes next, then we might wish to combine them in the hope of producing an even better model.Combination Methods Considered:Simple Linear InterpolationKatzs Backing OffGeneral Linear Interpolation

Simple Linear InterpolationOne way of solving the sparseness in a trigram model is to mix that model with bigram and unigram models that suffer less from data sparseness.This can be done by linear interpolation (also called finite mixture models). When the functions being interpolated all use a subset of the conditioning information of the most discriminating function, this method is referred to as deleted interpolation.

Pli(wn|wn-2,wn-1)=1P1(wn)+ 2P2(wn|wn-1)+ 3P3(wn|wn-1,wn-2) where 0i 1 and i i =1The weights can be set automatically using the Expectation-Maximization (EM) algorithm.

EM1j=1/3,j=1..32j3Next j42

step3j |j-j ,next|<

(Hheld out data)

j=1..3

Next j=0..3

Katzs Backing-OffUse n-gram probability when enough training dataIf not, back-off to the (n-1)-gram probability(Repeat as needed)

KatzUse Good-Turing estimate

Works pretty well. is calculated so probabilities sum to 1

Problems with Backing-OffIf bigram w2 w3 is common but trigram w1 w2 w3 is unseenmay be a meaningful gap, rather than a gap due to chance and scarce datai.e., a grammatical nullMay not want to back-off to lower-order probability

General Linear InterpolationIn simple linear interpolation, the weights were just a single number, but one can define a more general and powerful model where the weights are a function of the history.For k probability functions Pk, the general form for a linear interpolation model is: Pli(w|h)= ik i(h) Pi(w|h) where 0i(h)1 and i i(h) =1

deleted interpolationlinear interpolation. ().

David MagermanEugene Charniak Statistical Language Learning.

Statistical Language Modeling Toolkit CMU: http://mi.eng.cam.ac.uk/~prc14/toolkit.htmlSRI: http://www.speech.sri.com/projects/srilm/

,,.!

Shannon GameClaude E. Shannon. Prediction and Entropy of Printed English, Bell System Technical Journal 30:50-64. 1951.Predict the next word, given (n-1) previous words

1000n=3n=4, n=100n-1unicode

cache,,cache,

Shannon gamecachehttp://www.poeming.com/

CLUSTERING = CLASSES (same thing)What is P(Tuesday | party on)?Similar to P(Monday | party on)Similar to P(Tuesday | celebration on)Put words in clusters: WEEKDAY = Sunday, Monday, Tuesday, EVENT=party, celebration, birthday,

One cluster per word: hard clusteringWEEKDAY = Sunday, Monday, Tuesday, MONTH = January, February, April, May, June,

Multiple clusters per word: soft clusteringMONTH = January, February, April, May, June, AUXILIARY = Will, Should, May, Can

C(xyz)=0,zw,C(xyw)C(xyz)P(z|xy)=w P (w|xy) sim_prob(w|z)

sim_prob(w|z)= sim(w,z)/wsim(w,z)

?

n,one of the lifes best things is a good jobnarePCFG

exercises KatzStanley Chen, Joahua Goodman, An Empirical Study of Smoothing Techniques for Language ModelingGale, Church, Good-Turing Frequency Estimation Without TearsCMUSRISLM,,.

RosenfieldTwo decades of statistical language modeling: where do we go from here?

n 元语言模型和平滑算法

Documents