6. n-grams 부산대학교 인공지능연구실 최성자. 2 word prediction “i’d like to make...
TRANSCRIPT
6. N-GRAMs
부산대학교 인공지능연구실최성자
2
Word prediction
“I’d like to make a collect …”Call, telephone, or person-to-person
- Spelling error detection
- Augmentative communication
- Context-sensitive spelling error correction
3
Language Model
Language Model (LM)– statistical model of word sequences
n-gram: Use the previous n -1 words to predict the next word
4
Applications
context-sensitive spelling error detection and correction“He is trying to fine out.”
“The design an construction will take a year.”
machine translation
5
Counting Words in Corpora
Corpora (on-line text collections) Which words to count
– What we are going to count– Where we are going to find the things to count
6
Brown Corpus
1 million words 500 texts Varied genres (newspaper, novels, non-
fiction, academic, etc.) Assembled at Brown University in 1963-64 The first large on-line text collection used
in corpus-based NLP research
7
Issues in Word Counting
Punctuation symbols (. , ? !) Capitalization (“He” vs. “he”, “Bush” vs. “b
ush”) Inflected forms (“cat” vs. “cats”)
– Wordform: cat, cats, eat, eats, ate, eating, eaten– Lemma (Stem): cat, eat
8
Types vs. Tokens
Tokens (N): Total number of running words Types (B): Number of distinct words in a corpus (si
ze of the vocabulary)
Example: “They picnicked by the pool, then lay back on the gra
ss and looked at the stars.”–16 word tokens, 14 word types (not counting punctuation)
“※ Types” will mean wordform types and not lemma type, and punctuation marks will generally be counted as word
9
How Many Words in English?
Shakespeare’s complete works– 884,647 wordform tokens– 29,066 wordform types
Brown Corpus– 1 million wordform tokens– 61,805 wordform types– 37,851 lemma types
10
Simple (Unsmoothed) N-grams
Task: Estimating the probability of a word First attempt:
– Suppose there is no corpus available
– Use uniform distribution
– Assume:• word types = V (e.g., 100,000)
11
Simple (Unsmoothed) N-grams
Task: Estimating the probability of a word Second attempt:
– Suppose there is a corpus
– Assume:• word tokens = N
• # times w appears in corpus = C(w)
12
Simple (Unsmoothed) N-grams
Task: Estimating the probability of a word Third attempt:
– Suppose there is a corpus
– Assume a word depends on its n –1 previous words
13
Simple (Unsmoothed) N-grams
14
Simple (Unsmoothed) N-grams
n-gram approximation:– Wk only depends on its previous n–1words
15
Bigram Approximation
Example:P(I want to eat British food)
= P(I|<s>) P(want|I) P(to|want) P(eat|to) P(British|eat) P(food|British)
<s>: a special word meaning “start of sentence”
16
Note on Practical Problem
Multiplying many probabilities results in a very small number and can cause numerical underflow
Use logprob instead in the actual computation
17
Estimating N-gram Probability
Maximum Likelihood Estimate (MLE)
18
19
Estimating Bigram Probability
Example:– C(to eat) = 860
– C(to) = 3256
20
21
Two Important facts
The increasing accuracy of N-gram models as we increse the value of N
Very strong dependency on their training corpus (in particular its genre and its size in words)
22
Smoothing
Any particular training corpus is finite Sparse data problem Deal with zero probability
23
Smoothing
Smoothing– Reevaluating zero probability n-grams and
assigning them non-zero probability
Also called Discounting– Lowering non-zero n-gram counts in order to
assign some probability mass to the zero n-grams
24
Add-One Smoothing for Bigram
25
26
27
Things Seen Once
Use the count of things seen once to help estimate the count of things never seen
28
Witten-Bell Discounting
29
Witten-Bell Discounting for Bigram
30
Witten-Bell Discounting for Bigram
31
Seen count Unseen count
32
33
Good-Turing Discounting for Bigram
34
35
Backoff
36
Backoff
37
Entropy
Measure of uncertainty Used to evaluate quality of n-gram models (how
well a language model matches a given language) Entropy H(X) of a random variable X:
Measured in bits Number of bits to encode information in the optimal
coding scheme
38
Example 1
39
Example 2
40
Perplexity
41
Entropy of a Sequence
42
Entropy of a Language
43
Cross Entropy
Used for comparing two language models p: Actual probability distribution that generated
some data m: A model of p (approximation to p) Cross entropy of m on p:
44
Cross Entropy
By Shannon-McMillan-Breimantheorem:
Property of cross entropy:
Difference between H(p,m) and H(p) is a measure of how accurate model m is
The more accurate a model, the lower its cross-entropy