chapter6. statistical inference : n-gram model over sparse data

45
CHAPTER6. STATISTICAL INFERENCE : N-GRAM MODEL OVER SPARSE DATA Pusan national university 2014. 4. 22 Myoungjin, Jung tions of Statistic Natural Language Processing 1

Upload: kevina

Post on 23-Feb-2016

102 views

Category:

Documents


0 download

DESCRIPTION

Foundations of Statistic Natural Language Processing. Chapter6. Statistical Inference : n-gram Model over Sparse Data. Pusan national university 2014. 4. 22 Myoungjin , Jung. Introduction. Object of Statistical NLP Do statistical inference for the field of natural language. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Chapter6. Statistical Inference :  n-gram Model over  Sparse Data

1

CHAPTER6. STATISTICAL INFERENCE : N-GRAM MODEL OVER SPARSE DATA

Pusan national university2014. 4. 22

Myoungjin, Jung

Foundations of Statistic Natural Language Processing

Page 2: Chapter6. Statistical Inference :  n-gram Model over  Sparse Data

2

INTRODUCTION Object of Statistical NLP

Do statistical inference for the field of natural language.

Statistical inference ( 크게 두 가지 과정으로 나눔 ) 1. Taking some data generated by unknown probability distribu-

tion. ( 말뭉치 필요 ) 2. Making some inferences about this distribution. ( 해당 말뭉치로 확률분포 추론 )

Divides the problem into three areas : ( 통계적 언어처리의 3 가지 과정 )

1. Dividing the training data into equivalence class. 2. Finding a good statistical estimator for each equivalence class. 3. Combining multiple estimators.

Page 3: Chapter6. Statistical Inference :  n-gram Model over  Sparse Data

3

BINS : FORMING EQUIVALENCE CLASSES Reliability vs Discrimination

Ex)“large green ___________”tree? mountain? frog? car?

“swallowed the large green ________”pill? broccoli?

smaller n: more instances in training data, better statistical es-timates (more reliability)

larger n: more information about the context of the specific in-stance (greater discrimination)

Page 4: Chapter6. Statistical Inference :  n-gram Model over  Sparse Data

4

BINS : FORMING EQUIVALENCE CLASSES N-gram models

“n-gram” = sequence of n words predicting the next word : Markov assumption

Only the prior local context - the last few words – affects the next word.

Selecting an n : Vocabulary size = 20,000 words

n Number of bins

2 (bigrams) 400,000,000

3 (trigrams) 8,000,000,000,000

4 (4-grams) 1.6 x 1017

11 ,,| nn wwwP

Page 5: Chapter6. Statistical Inference :  n-gram Model over  Sparse Data

5

Probability dist. : P(s) where s : sentence Ex.

P(If you’re going to San Francisco, be sure ……)= P(If) * P(you’re|If) * P(going|If you’re) * P(to|If you’re going) * ……

Markov assumption Only the last n-1 words are relevant for a prediction

Ex. With n=5P(sure|If you’re going to San Francisco, be)= P(sure|San Francisco , be)

BINS : FORMING EQUIVALENCE CLASSES

Page 6: Chapter6. Statistical Inference :  n-gram Model over  Sparse Data

6

BINS : FORMING EQUIVALENCE CLASSES N-gram : Sequence of length n with a count Ex. 5-gram : If you’re going to San Sequence naming :

Markov assumption formalized :• P() = P(|) P(|)

n-1 words

Page 7: Chapter6. Statistical Inference :  n-gram Model over  Sparse Data

7

BINS : FORMING EQUIVALENCE CLASSES

Instead of P(s) : only one conditional prob. P(|) Simplify P(|) to P(|)

n-1 n-1

• NWP() = arg max P(|)• • Set of all words in the corpus• next word prediction

Page 8: Chapter6. Statistical Inference :  n-gram Model over  Sparse Data

8

BINS : FORMING EQUIVALENCE CLASSES

Ex. The easiest way : (|) = =

P(San|If you’re going to) = =

Page 9: Chapter6. Statistical Inference :  n-gram Model over  Sparse Data

9

STATISTICAL ESTIMATORS Given the observed training data.

How do you develop a model (probability distribution) to predict fu-ture events? ( 더 좋은 확률의 추정 )

Probability estimate target feature

Estimating the unknown probability distribution of n-grams.

11

111|

n

nnn wwP

wwPwwwP

Page 10: Chapter6. Statistical Inference :  n-gram Model over  Sparse Data

10

STATISTICAL ESTIMATORS Notation for the statistical estimation chapter.

N Number of training instances

B Number of bins training instances are divided into

w1n An n-gram w1…wn in the training text

C(w1…wn) Freq. of n-gram w1…wn in the training text

r Freq. of an n-gram

f(•) Freq. estimate of a model

Nr Number of bins that have r training instances in them

Tr Total count of n-grams of freq. r in further data

h ‘History’ of preceding words

Page 11: Chapter6. Statistical Inference :  n-gram Model over  Sparse Data

11

STATISTICAL ESTIMATORS Example - Instances in the training corpus:

“inferior to ________”

Page 12: Chapter6. Statistical Inference :  n-gram Model over  Sparse Data

12

MAXIMUM LIKELIHOOD ESTIMATION (MLE) Definition

Using the relative frequency as a probability estimate. Example :

In corpus, found 10 training instances of the word “comes across”

8 times they were followed by “as” : P(as) = 0.8 Once by “more” and “a” : P(more) = 0.1 , P(a) = 0.1 Not among the above 3 word : P(x) = 0.0

Formula

11

111

11

|

n

nnnMLE

nnMLE

wwCwwCwwwP

Nr

NwwCwwP

Page 13: Chapter6. Statistical Inference :  n-gram Model over  Sparse Data

13

MAXIMUM LIKELIHOOD ESTIMATION (MLE)

Page 14: Chapter6. Statistical Inference :  n-gram Model over  Sparse Data

14

MAXIMUM LIKELIHOOD ESTIMATION (MLE)

Example 1. A Paragraph Using Training Data

The bigram model uses the preceding word to help predict the next word. (End) In

general, this helps enormously, and gives us a much better model. (End) In some

cases the estimated probability of the word that actually comes next has gone up by

about an order of magnitude (was, to, sisters). (End) However, note that the bigram

model is not guaranteed to increase the probability estimate. (End)

Word (N=79 : ,…,) 1-gram 2-gram 3-gram

P(the)=7/79, P(bigram|the)=2/7, P(model|the,bigram)=2/2 P(bigram)=2/79

C(the)=7C(bigram)

=2C(model)=

3C(the,bigram)=2C(the,bigram,model)=2

Page 15: Chapter6. Statistical Inference :  n-gram Model over  Sparse Data

15

LAPLACE’S LAW, LIDSTONE’S LAW AND THE JEF-FREYS-PERKS LAW

Laplace’s law(1814; 1995)

Add a little bit of probability space to unseen events

BN

rBN

wwCwwP nnLAP

111

1

Page 16: Chapter6. Statistical Inference :  n-gram Model over  Sparse Data

16

LAPLACE’S LAW, LIDSTONE’S LAW AND THE JEF-FREYS-PERKS LAW

Word (N=79 : B=seen(51)+unseen(70)=121)

MLE Laplace’s law

0.0886076 0.04000000.2857143 0.00509511.0000000 0.0083089

Page 17: Chapter6. Statistical Inference :  n-gram Model over  Sparse Data

17

LAPLACE’S LAW, LIDSTONE’S LAW AND THE JEF-FREYS-PERKS LAW

Page 202-203 (Associated Press[AP] newswire yielded a vocabulary)

unseen event 에 대한 약간의 확률 공간을 추가 but 너무 많은 공간을 추가하였다 .

44milion 의 경우 voca400653 발생 -> 160,000,000,000 bigram 발생 Bins 의 개수가 training instance 보다 많아지게 되는 문제가 발생• Lap law 는 unseen event 에 대한 확률공간을 위해 분모에 B 를 삽입 하였지만 결과적으로 약 46.5% 의 확률공간을 unseen event 에 주게 되었다 .• N0 * P lap(.) = 74.671,100,000 * 0,000137/22,000,000 = 0.465

Page 18: Chapter6. Statistical Inference :  n-gram Model over  Sparse Data

18

Lidstone’s law(1920) and the Jeffreys-Perks law(1973)

Lidstone’s Law add some positive value

Jeffreys-Perks Law

= 0.5 Called ELE (Expected Likelihood Estimation)

BλNλ)wC(w

)w(wP nnLid

1

1

LAPLACE’S LAW, LIDSTONE’S LAW AND THE JEF-FREYS-PERKS LAW

Page 19: Chapter6. Statistical Inference :  n-gram Model over  Sparse Data

19

LIDSTONE’S LAW

Using Lidstone’s law, instead of adding one, add some smaller value ,

where the parameter is in the range . And .

Page 20: Chapter6. Statistical Inference :  n-gram Model over  Sparse Data

20

LIDSTONE’S LAW

Here, gives the maximum likelihood estimate, gives the Laplace’s law, if tends to then we have the uniform estimate .

represents the trust we have in relative frequencies. implies more trust in relative frequencies than the Laplace's

law while represents less trust in relative frequencies.

In practice, people use values of in the range , a common value being . (Jeffreys-Perks law)

Page 21: Chapter6. Statistical Inference :  n-gram Model over  Sparse Data

21

JEFFREYS-PERKS LAW

Using Lidstone’s law,

MLE(

)

Lidstone’s law

( )

Jeffreys-Perks (

)

Lidstone’s law

( )

Laplace’s law

( )

Lidstone’s law

( )

A 0.0886 0.0633 0.0538 0.0470 0.0400 0.0280B 0.2857 0.0081 0.0063 0.0056 0.0051 0.0049C 1.0000 0.0084 0.0085 0.0083 0.0083 0.0083

*A: , B: , C:

Page 22: Chapter6. Statistical Inference :  n-gram Model over  Sparse Data

22

HELD OUT ESTIMATION(JELINEK AND MERCER, 1985)

For each n-gram, , let :

= frequency of in training data

= frequency of in held out data

Let

be the total number of times that all n-grams that appeared r times in the

training text appeared in the held out data.An estimate for the probability of one of these n-gram is :

where .

Page 23: Chapter6. Statistical Inference :  n-gram Model over  Sparse Data

23

[Full text ( ) : , respectively], unseen word : I don't know. [Word ( ) : , unseen word : 70, respec-

tively] : (Training Data)

[Word ( ) : , unseen word : 51- , respectively] : (Held out Data)

(1-gram) Traing data : , , ( ) Held out data : , , ( )

HELD OUT ESTIMATION(JELINEK AND MERCER, 1985)

Page 24: Chapter6. Statistical Inference :  n-gram Model over  Sparse Data

24

training text 에서 r 번 나온 bigram 이 추가적으로 추출한 text (further text) 에서 몇 번 나오는가를 알아보는 것 .

• Held out estimation : training text 에서 r 번 출현되어진 bigram 이 더 많은 text 에서는 얼마나 출현 할 것인가를 예측하는 방법 .

• Test data(training data 에 독립적 ) 는 전체 데이터의 5-10% 이지만 신뢰하기에 충분하다 .• 우리는 데이터를 training data 의 test data 로 나누기를 원한다 . (검증된 데이터와 검증안된 데이터 )• Held out data (10%) • N-gram 의 held-out estimation 을 통해 held-out data 를 얻는다 .

HELD OUT ESTIMATION(JELINEK AND MERCER, 1985)

Page 25: Chapter6. Statistical Inference :  n-gram Model over  Sparse Data

25

CROSS-VALIDATION (DELETED ESTIMATION; JELINEK AND MERCER, 1985)

Use data for both training and validation

Divide test data into 2 parts Train on A, validate on B Train on B, validate on A

Combine two models

A B

train validate

validate train

Model 1

Model 2

Model 1 Model 2+ Final Model

Page 26: Chapter6. Statistical Inference :  n-gram Model over  Sparse Data

26

CROSS-VALIDATION (DELETED ESTIMATION; JELINEK AND MERCER, 1985)

Cross validation : training data is used both as initial training data held out data

On large training corpora, deleted estimation works better than held-out estimation

rwwCwhereNNNTTwwP

rwwCwhereNN

TorNN

TwwP

nrr

rrndel

nr

r

r

rnho

110

1001

1

11

10

0

01

1

Page 27: Chapter6. Statistical Inference :  n-gram Model over  Sparse Data

27

CROSS-VALIDATION (DELETED ESTIMATION; JELINEK AND MERCER, 1985) [Full text ( ) : , respectively], unseen word : I don't know. [Word ( ) : , unseen word : 70, re-

spectively] : (Training Data) [A-part word ( ) : , unseen word :

101, respectively] [B-part word ( ) : , unseen

word : 90- , respectively]A-part data : , ( )

B-part data : , ( )

, .

Page 28: Chapter6. Statistical Inference :  n-gram Model over  Sparse Data

28

CROSS-VALIDATION (DELETED ESTIMATION; JELINEK AND MERCER, 1985)

28

[B-part word ( ) : , unseen word :

90- , respectively] [A-part word ( ) : , unseen

word : 101+ , respectively]B-part data : , ( )

A-part data : , ( )

, .[Result]

, .

Page 29: Chapter6. Statistical Inference :  n-gram Model over  Sparse Data

29

CROSS-VALIDATION (DELETED ESTIMATION; JELINEK AND MERCER, 1985)

Held out estimation 개념으로 우리가 training data 를 두 부분으로 나눔으로써 같은 효과를 얻는다 . 이러한 메소드를 cross validation 이라 한다 .

더 효과적인 방법 . 두 방법을 합치므로써 Nr0, Nr1 의 차이를 감소 시킨다 .

큰 training curpus 내의 deleted estimation 은 held-out estimation 보다 더 신뢰적이다 .

Page 30: Chapter6. Statistical Inference :  n-gram Model over  Sparse Data

30

GOOD-TURING ESTIMATION(GOOD, 1953) : [BINOMIAL DISTRI-BUTION]

Idea: re-estimate probability mass assigned to N-grams with zero counts

Adjust actual counts to expected counts with formula

rr

GT

NENErr

NrP

1*

*

1

(r* is an adjusted frequency)

(E denotes the expectation of

random variable)

Page 31: Chapter6. Statistical Inference :  n-gram Model over  Sparse Data

31

GOOD-TURING ESTIMATION(GOOD, 1953) : [BINOMIAL DISTRI-BUTION]

If

If

이 작을 시 : 이 클 시 : So, over-estimator 된 것을 under-estimator 시킴

Page 32: Chapter6. Statistical Inference :  n-gram Model over  Sparse Data

32

NOTE

단점 : over-estimator

[two discounting models] (Ney and Essen, 1993; Ney et al., 1994) Absolute discounting

, over-estimator 를 만큼 다운시킴 .,

Linear discounting, 를 이용하여 를 조절,

Page 33: Chapter6. Statistical Inference :  n-gram Model over  Sparse Data

33

NOTE

단점 : over-estimator

[Natural Law of Succession] (Ristad, 1995)

Page 34: Chapter6. Statistical Inference :  n-gram Model over  Sparse Data

34

COMBINING ESTIMATORS Basic Idea

Consider how to combine multiple probability estimate from various different models

How can you develop a model to utilize different length n-grams as appropriate?

Simple linear interpolation

where and .

Combination of trigram, bigram and unigram

Page 35: Chapter6. Statistical Inference :  n-gram Model over  Sparse Data

35

COMBINING ESTIMATORS [Katz’s backing-off] (Katz, 1987)

Example

Page 36: Chapter6. Statistical Inference :  n-gram Model over  Sparse Data

36

COMBINING ESTIMATORS [Katz’s backing-off] (Katz, 1987) If sequence unseen : use shorter sequence Ex. If P(San | going to) = 0, Use P(San | to)

= τ() if c() > 0 Λ*() if c() = 0

weight lower order prob. higher order prob.

Page 37: Chapter6. Statistical Inference :  n-gram Model over  Sparse Data

37

COMBINING ESTIMATORS [General linear interpolation]

where and

Page 38: Chapter6. Statistical Inference :  n-gram Model over  Sparse Data

38

COMBINING ESTIMATORS

Interpolated smoothing = τ() + Λ*()

higher order prob. Weight

lower order prob.

Seems to work better than back-off smoothing

Page 39: Chapter6. Statistical Inference :  n-gram Model over  Sparse Data

39

Witten Bell smoothing

= *() + (1- )*()

=

Where = |{}|

NOTE

Page 40: Chapter6. Statistical Inference :  n-gram Model over  Sparse Data

40

Absolute discounting Like Jelinek-Mercer, involves interpolation of higher- and

lower- order models But instead of multiplying the higher-order by a , we sub-

tract a fixed discount [0,1] from each nonzero count :

= + + (1- )*()

To make it sum to 1: (1- )= *

Choose using held-out estimation.

NOTE

Page 41: Chapter6. Statistical Inference :  n-gram Model over  Sparse Data

41

KN smoothing (1995) An extension of absolute discounting with a clever way

of constructing the lower-order (backoff) model Idea: the lower-order model is signficant only when

count is small or zero in the higher-order model, and so should be optimized for that purpose.

= + **()

NOTE

Page 42: Chapter6. Statistical Inference :  n-gram Model over  Sparse Data

42

An empirical study of smoothing techniques for language modeling (1999)

For a bigram model, we would like to select a smoothed dist. that satisfies the following con-straint on unigram marginals for all :

(1) ( 제약조건 ) (2) (1) 번으로부터 = (3) (2) 번으로부터 =

NOTE

Page 43: Chapter6. Statistical Inference :  n-gram Model over  Sparse Data

43

=*[ + **()]

=+ *()

= + ()

= + ()

NOTE

Page 44: Chapter6. Statistical Inference :  n-gram Model over  Sparse Data

44

= |{}|

= = |{}| =

()=

NOTE

Page 45: Chapter6. Statistical Inference :  n-gram Model over  Sparse Data

45

Generlizing to higher-order models, we have that

(|)= Where = |{}| = |{}| =

NOTE