cs60057 speech &natural language processing

159
Lecture 1, 7/21/2005 Natural Language Processing 1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 8 9 August 2007

Upload: decker

Post on 05-Jan-2016

62 views

Category:

Documents


0 download

DESCRIPTION

CS60057 Speech &Natural Language Processing. Autumn 2007. Lecture 8 9 August 2007. POS Tagging. Task : assign the right part-of-speech tag, e.g. noun, verb, conjunction, to a word in context POS taggers need to be fast in order to process large corpora - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 1

CS60057Speech &Natural Language

Processing

Autumn 2007

Lecture 8

9 August 2007

Page 2: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 2

POS Tagging

Task: assign the right part-of-speech tag, e.g. noun, verb, conjunction, to a

word in context

POS taggers need to be fast in order to process large corpora

should take no more than time linear in the size of the corpora full parsing is slow

e.g. context-free grammar n3, n length of the sentence POS taggers try to assign correct tag without actually parsing the

sentence

Page 3: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 3

POS Tagging

Components: Dictionary of words

Exhaustive list of closed class items Examples:

the, a, an: determiner from, to, of, by: preposition and, or: coordination conjunction

Large set of open class (e.g. noun, verbs, adjectives) items with frequency information

Page 4: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 4

POS Tagging

Components: Mechanism to assign tags

Context-free: by frequency Context: bigram, trigram, HMM, hand-coded rules

Example: Det Noun/*Verb the walk…

Mechanism to handle unknown words (extra-dictionary) Capitalization Morphology: -ed, -tion

Page 5: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 5

POS Tagging

Words often have more than one POS: back The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB

The POS tagging problem is to determine the POS tag for a particular instance of a word.

These examples from Dekang Lin

Page 6: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 6

How hard is POS tagging? Measuring ambiguity

Page 7: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 7

Algorithms for POS Tagging

•Ambiguity – In the Brown corpus, 11.5% of the word types are ambiguous (using 87 tags):

Worse, 40% of the tokens are ambiguous.

Page 8: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 8

Problem Setup There are M types of POS tags

Tag set: {t1,..,tM}.

The word vocabulary size is V

Vocabulary set: {w1,..,wV}.

We have a word sequence of length n:

W = w1,w2…wn

Want to find the best sequence of POS tags:

T = t1,t2…tn

)|Pr(maxarg WTTT

best

Page 9: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 9

Information sources for tagging

All techniques are based on the same observations… some tag sequences are more probable than others

ART+ADJ+N is more probable than ART+ADJ+VB

Lexical information: knowing the word to be tagged gives a lot of information about the correct tag

“table”: {noun, verb} but not a {adj, prep,…} “rose”: {noun, adj, verb} but not {prep, ...}

Page 10: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 10

Algorithms for POS Tagging

Why can’t we just look them up in a dictionary?

•Words that aren’t in the dictionary

http://story.news.yahoo.com/news?tmpl=story&cid=578&ncid=578&e=1&u=/nm/20030922/ts_nm/iraq_usa_dc

•One idea: P(ti | wi) = the probability that a random hapax legomenon in the corpus has tag ti.

Nouns are more likely than verbs, which are more likely than pronouns.

•Another idea: use morphology.

Page 11: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 11

Algorithms for POS Tagging - Knowledge

•Dictionary

•Morphological rules, e.g.,•_____-tion•_____-ly•capitalization

•N-gram frequencies•to _____•DET _____ N•But what about rare words, e.g, smelt (two verb forms, melt and past tense of smell, and one noun form, a small fish)

•Combining these• V _____-ing I was gracking vs. Gracking is fun.

Page 12: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 12

POS Tagging - Approaches

ApproachesRule-based tagging

(ENGTWOL)Stochastic (=Probabilistic) tagging

HMM (Hidden Markov Model) taggingTransformation-based tagging

Brill tagger

• Do we return one best answer or several answers and let later steps decide?

• How does the requisite knowledge get entered?

Page 13: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 13

3 methods for POS tagging

1. Rule-based tagging Example: Karlsson (1995) EngCG tagger based on

the Constraint Grammar architecture and ENGTWOL lexicon Basic Idea:

Assign all possible tags to words (morphological analyzer used)

Remove wrong tags according to set of constraint rules (typically more than 1000 hand-written constraint rules, but may be machine-learned)

Page 14: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 14

Sample rules

N-IP rule: A tag N (noun) cannot be followed by a tag IP (interrogative

pronoun)

... man who … man: {N} who: {RP, IP} --> {RP} relative pronoun

ART-V rule:A tag ART (article) cannot be followed by a tag V (verb)...the book…

the: {ART} book: {N, V} --> {N}

Page 15: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 15

After The First Stage

Example: He had a book. After the fırst stage:

he he/pronoun had have/verbpast have/auxliarypast a a/article book book/noun book/verb

Rule-1:

if (the previous tag is an article)

then eliminate all verb tags

Rule-2:

if (the next tag is verb)

then eliminate all verb tags

Tagging Rule

Page 16: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 16

Rule-Based POS Tagging

ENGTWOL tagger (now ENGCG-2) http://www.lingsoft.fi/cgi-bin/engcg

Page 17: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 17

3 methods for POS tagging

2. Transformation-based tagging Example: Brill (1995) tagger - combination of rule-based and

stochastic (probabilistic) tagging methodologies Basic Idea:

Start with a tagged corpus + dictionary (with most frequent tags)

Set the most probable tag for each word as a start value Change tags according to rules of type “if word-1 is a

determiner and word is a verb then change the tag to noun” in a specific order (like rule-based taggers)

machine learning is used—the rules are automatically induced from a previously tagged training corpus (like stochastic approach)

Page 18: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 18

1. Assign to words their most likely tag P(NN|race) = .98 P(VB|race) = .02

2. Change some tags by applying transformation rules

Rule Context (trigger) (apply the rule when…)

Examples

NN VB (noun verb)

the previous tag is the preposition to

go to sleep(VB) ? go to school(VB)

VBR VB (past tense base f orm)

one of the previous 3 tags is a modal (MD)

you may cut (VB)

J J R RBR (comparative adj comparative adv)

next tag is an adjective (J J )

a more (RBR) valuable

VBP VB (past tense base f orm)

one of the previous 2 words is “n’t”

should (VB) n’t

An example

Page 19: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 19

Types of context lots of latitude… can be:

tag-triggered transformation The preceding/following word is tagged this way The word two before/after is tagged this way ...

word- triggered transformation The preceding/following word this word …

morphology- triggered transformation The preceding/following word finishes with an s …

a combination of the above The preceding word is tagged this ways AND the following word is this

word

Page 20: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 20

Learning the transformation rules Input: A corpus with each word:

correctly tagged (for reference) tagged with its most frequent tag (C0)

Output: A bag of transformation rules Algorithm:

Instantiates a small set of hand-written templates (generic rules) by comparing the reference corpus to C0

Change tag a to tag b when…The preceding/following word is tagged zThe word two before/after is tagged zOne of the 2 preceding/following words is tagged zOne of the 2 preceding words is z…

Page 21: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 21

Learning the transformation rules (con't)

Run the initial tagger and compile types of errors

<incorrect tag, desired tag, # of occurrences> For each error type, instantiate all templates to generate candidate

transformations Apply each candidate transformation to the corpus and count the

number of corrections and errors that it produces Save the transformation that yields the greatest improvement Stop when no transformation can reduce the error rate by a

predetermined threshold

Page 22: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 22

Example

if the initial tagger mistags 159 words as verbs instead of nouns create the error triple: <verb, noun, 159>

Suppose template #3 is instantiated as the rule: Change the tag from <verb> to <noun> if one of the

two preceding words is tagged as a determiner. When this template is applied to the corpus:

it corrects 98 of the 159 errors but it also creates 18 new errors

Error reduction is 98-18=80

Page 23: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 23

Learning the best transformations

input: a corpus with each word:

correctly tagged (for reference) tagged with its most frequent tag (C0)

a bag of unordered transformation rules

output: an ordering of the best transformation rules

Page 24: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 24

let: E(Ck) = nb of words incorrectly tagged in the corpus at iteration k v(C) = the corpus obtained after applying rule v on the corpus Cε = minimum number of errors desired

for k:= 0 step 1 do

bt := argmint (E(t(Ck)) // find the transformation t that minimizes // the error rate

if ((E(Ck) - E(bt(Ck))) < ε) // if bt does not improve the tagging significantly then goto finished

Ck+1 := bt(Ck) // apply rule bt to the current corpus

Tk+1 := bt // bt will be kept as the current transformation ruleendfinished: the sequence T1 T2 … Tk is the ordered transformation rules

Learning the best transformations (con’t)

Page 25: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 25

Strengths of transformation-based tagging

exploits a wider range of lexical and syntactic regularities

can look at a wider context condition the tags on preceding/next words not just preceding

tags. can use more context than bigram or trigram.

transformation rules are easier to understand than matrices of probabilities

Page 26: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 26

How TBL Rules are Applied

Before the rules are applied the tagger labels every word with its most likely tag.

We get these most likely tags from a tagged corpus. Example:

He is expected to race tomorrow he/PRN is/VBZ expected/VBN to/TO race/NN tomorrow/NN

After selecting most-likely tags, we apply transformation rules. Change NN to VB when the previous tag is TO This rule converts race/NN into race/VB

This may not work for every case ….. According to race

Page 27: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 27

How TBL Rules are Learned

We will assume that we have a tagged corpus. Brill’s TBL algorithm has three major steps.

Tag the corpus with the most likely tag for each (unigram model) Choose a transformation that deterministically replaces an

existing tag with a new tag such that the resulting tagged training corpus has the lowest error rate out of all transformations.

Apply the transformation to the training corpus. These steps are repeated until a stopping criterion is reached. The result (which will be our tagger) will be:

First tags using most-likely tags Then apply the learned transformations

Page 28: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 28

Transformations

A transformation is selected from a small set of templates.

Change tag a to tag b when

- The preceding (following) word is tagged z.

- The word two before (after) is tagged z.

- One of two preceding (following) words is tagged z.

- One of three preceding (following) words is tagged z.

- The preceding word is tagged z and the following word is tagged w.

- The preceding (following) word is tagged z and the word

two before (after) is tagged w.

Page 29: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 29

3 methods for POS tagging

3. Stochastic (=Probabilistic) tagging Assume that a word’s tag only depends on the previous tags

(not following ones) Use a training set (manually tagged corpus) to:

learn the regularities of tag sequences learn the possible tags for a word model this info through a language model (n-gram)

Example: HMM (Hidden Markov Model) tagging - a training corpus used to compute the probability (frequency) of a given word having a given POS tag in a given context

Page 30: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 30

Topics

Probability Conditional Probability Independence Bayes Rule HMM tagging Markov Chains Hidden Markov Models

Page 31: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 31

6. Introduction to Probability

Experiment (trial) Repeatable procedure with well-defined possible outcomes

Sample Space (S) the set of all possible outcomes finite or infinite

Example coin toss experiment possible outcomes: S = {heads, tails}

Example die toss experiment possible outcomes: S = {1,2,3,4,5,6}

Page 32: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 32

Introduction to Probability

Definition of sample space depends on what we are asking Sample Space (S): the set of all possible outcomes Example

die toss experiment for whether the number is even or odd possible outcomes: {even,odd} not {1,2,3,4,5,6}

Page 33: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 33

More definitions

Events an event is any subset of outcomes from the sample space

Example die toss experiment let A represent the event such that the outcome of the die toss

experiment is divisible by 3 A = {3,6} A is a subset of the sample space S= {1,2,3,4,5,6}

Page 34: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 34

Introduction to Probability

Some definitions Events

an event is a subset of sample space simple and compound events

Example deck of cards draw experiment suppose sample space S = {heart,spade,club,diamond} (four suits) let A represent the event of drawing a heart let B represent the event of drawing a red card A = {heart} (simple event) B = {heart} u {diamond} = {heart,diamond} (compound event)

a compound event can be expressed as a set union of simple events Example

alternative sample space S = set of 52 cards A and B would both be compound events

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 35: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 35

Introduction to Probability

Some definitions Counting

suppose an operation oi can be performed in ni ways, a set of k operations o1o2...ok can be performed in n1 n2 ... nk ways

Example dice toss experiment, 6 possible outcomes two dice are thrown at the same time number of sample points in sample space = 6 6 = 36

Page 36: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 36

Definition of Probability

The probability law assigns to an event a nonnegative number

Called P(A) Also called the probability A That encodes our knowledge or belief about the

collective likelihood of all the elements of A Probability law must satisfy certain properties

Page 37: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 37

Probability Axioms

Nonnegativity P(A) >= 0, for every event A

Additivity If A and B are two disjoint events, then the probability

of their union satisfies: P(A U B) = P(A) + P(B)

Normalization The probability of the entire sample space S is equal

to 1, i.e. P(S) = 1.

Page 38: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 38

An example

An experiment involving a single coin toss There are two possible outcomes, H and T Sample space S is {H,T} If coin is fair, should assign equal probabilities to 2 outcomes Since they have to sum to 1 P({H}) = 0.5 P({T}) = 0.5 P({H,T}) = P({H})+P({T}) = 1.0

Page 39: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 39

Another example

Experiment involving 3 coin tosses Outcome is a 3-long string of H or T S ={HHH,HHT,HTH,HTT,THH,THT,TTH,TTT} Assume each outcome is equiprobable

“Uniform distribution” What is probability of the event that exactly 2 heads occur? A = {HHT,HTH,THH} 3 events/outcomes P(A) = P({HHT})+P({HTH})+P({THH}) additivity - union of the

probability of the individual events

= 1/8 + 1/8 + 1/8 total 8 events/outcomes

= 3/8

Page 40: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 40

Probability definitions

In summary:

Probability of drawing a spade from 52 well-shuffled playing cards:

Page 41: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 41

Moving toward language What’s the probability of drawing a 2 from a deck

of 52 cards with four 2s?

What’s the probability of a random word (from a random dictionary page) being a verb?

P(drawing a two) 4

52

1

13.077

P(drawing a verb) #of ways to get a verb

all words

Page 42: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 42

Probability and part of speech tags• What’s the probability of a random word (from a random dictionary

page) being a verb?

• How to compute each of these• All words = just count all the words in the dictionary• # of ways to get a verb: # of words which are verbs!• If a dictionary has 50,000 entries, and 10,000 are verbs…. P(V) is

10000/50000 = 1/5 = .20

P(drawing a verb) #of ways to get a verb

all words

Page 43: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 43

Conditional Probability

A way to reason about the outcome of an experiment based on partial information In a word guessing game the first letter for the word is

a “t”. What is the likelihood that the second letter is an “h”?

How likely is it that a person has a disease given that a medical test was negative?

A spot shows up on a radar screen. How likely is it that it corresponds to an aircraft?

Page 44: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 44

More precisely

Given an experiment, a corresponding sample space S, and a probability law

Suppose we know that the outcome is some event B We want to quantify the likelihood that the outcome also belongs to

some other event A We need a new probability law that gives us the conditional

probability of A given B P(A|B)

Page 45: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 45

An intuition

• Let’s say A is “it’s raining”.• Let’s say P(A) in Kharagpur is 0.2• Let’s say B is “it was sunny ten minutes ago”• P(A|B) means “what is the probability of it raining now if it was sunny

10 minutes ago”• P(A|B) is probably way less than P(A)• Perhaps P(A|B) is .0001• Intuition: The knowledge about B should change our estimate of the

probability of A.

Page 46: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 46

Conditional Probability

let A and B be events in the sample space P(A|B) = the conditional probability of event A occurring given some fixed

event B occurring definition: P(A|B) = P(A B) / P(B)

Page 47: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 47

Conditional probability

P(A|B) = P(A B)/P(B) Or

)(

),()|(

BP

BAPBAP

A BA,B

Note: P(A,B)=P(A|B) · P(B)Also: P(A,B) = P(B,A)

Page 48: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 48

Independence

What is P(A,B) if A and B are independent?

P(A,B)=P(A) · P(B) iff A,B independent.

P(heads,tails) = P(heads) · P(tails) = .5 · .5 = .25

Note: P(A|B)=P(A) iff A,B independent

Also: P(B|A)=P(B) iff A,B independent

Page 49: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 49

Bayes Theorem

)(

)()|()|(

AP

BPBAPABP

• Idea: The probability of an A conditional on another event B is generally different from the probability of B conditional on A. There is a definite relationship between the two.

Page 50: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 50

Deriving Bayes Rule

P(A | B) P(A B)P(B)

P(A | B) P(A B)P(B)

The probability of event A given event B is

Page 51: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 51

Deriving Bayes Rule

P(B | A) P(A B)P(A)

P(B | A) P(A B)P(A)

The probability of event B given event A is

Page 52: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 52

Deriving Bayes Rule

P(B | A) P(A B)P(A)

P(B | A) P(A B)P(A)

P(A | B) P(A B)P(B)

P(A | B) P(A B)P(B)

P(B | A)P(A) P(A B)

P(B | A)P(A) P(A B)

P(A | B)P(B) P(A B)

P(A | B)P(B) P(A B)

Page 53: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 53

Deriving Bayes Rule

P(B | A) P(A B)P(A)

P(B | A) P(A B)P(A)

P(A | B) P(A B)P(B)

P(A | B) P(A B)P(B)

P(B | A)P(A) P(A B)

P(B | A)P(A) P(A B)

P(A | B)P(B) P(A B)

P(A | B)P(B) P(A B)

P(A | B)P(B) P(B | A)P(A)

P(A | B)P(B) P(B | A)P(A)

P(A | B) P(B | A)P(A)

P(B)

P(A | B) P(B | A)P(A)

P(B)

Page 54: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 54

Deriving Bayes Rule

P(A | B) P(B | A)P(A)

P(B)

P(A | B) P(B | A)P(A)

P(B)

the theorem may be paraphrased as

conditional/posterior probability = (LIKELIHOOD multiplied by PRIOR) divided by NORMALIZING CONSTANT

Page 55: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 55

Hidden Markov Model (HMM) Tagging

Using an HMM to do POS tagging

HMM is a special case of Bayesian inference

It is also related to the “noisy channel” model in ASR (Automatic Speech Recognition)

Page 56: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 56

Goal: maximize P(word|tag) x P(tag|previous n tags)

P(word|tag) word/lexical likelihood probability that given this tag, we have this word NOT probability that this word has this tag modeled through language model (word-tag matrix)

P(tag|previous n tags) tag sequence likelihood probability that this tag follows these previous tags modeled through language model (tag-tag matrix)

Hidden Markov Model (HMM) Taggers

Lexical information Syntagmatic information

Page 57: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 57

POS tagging as a sequence classification task

We are given a sentence (an “observation” or “sequence of observations”) Secretariat is expected to race tomorrow sequence of n words w1…wn.

What is the best sequence of tags which corresponds to this sequence of observations?

Probabilistic/Bayesian view: Consider all possible sequences of tags Out of this universe of sequences, choose the tag sequence

which is most probable given the observation sequence of n words w1…wn.

Page 58: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 58

Getting to HMM

Let T = t1,t2,…,tn

Let W = w1,w2,…,wn

Goal: Out of all sequences of tags t1…tn, get the the most probable sequence of POS tags T underlying the observed sequence of words w1,w2,…,wn

Hat ^ means “our estimate of the best = the most probable tag sequence” Argmaxx f(x) means “the x such that f(x) is maximized”

it maximazes our estimate of the best tag sequence

Page 59: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 59

Getting to HMM

This equation is guaranteed to give us the best tag sequence

But how do we make it operational? How do we compute this value? Intuition of Bayesian classification:

Use Bayes rule to transform it into a set of other probabilities that are easier to compute

Thomas Bayes: British mathematician (1702-1761)

Page 60: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 60

Bayes Rule

Breaks down any conditional probability P(x|y) into three other probabilities

P(x|y): The conditional probability of an event x assuming that y has occurred

Page 61: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 61

Bayes Rule

We can drop the denominator: it does not change for each tag sequence; we are looking for the best tag sequence for the same observation, for the same fixed set of words

Page 62: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 62

Bayes Rule

Page 63: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 63

Likelihood and prior

n

Page 64: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 64

Likelihood and prior Further Simplifications

n

1. the probability of a word appearing depends only on its own POS tag, i.e, independent of other words around it

2. BIGRAM assumption: the probability of a tag appearing depends only on the previous tag

3. The most probable tag sequence estimated by the bigram tagger

Page 65: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 65

Likelihood and prior Further Simplifications

n

1. the probability of a word appearing depends only on its own POS tag, i.e, independent of other words around it

thekoalaputthekeysonthetable

WORDSTAGS

NVPDET

Page 66: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 66

Likelihood and prior Further Simplifications

2. BIGRAM assumption: the probability of a tag appearing depends only on the previous tag

Bigrams are groups of two written letters, two syllables, or two words; they are a special case of N-gram.

Bigrams are used as the basis for simple statistical analysis of text

The bigram assumption is related to the first-order Markov assumption

Page 67: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 67

Likelihood and prior Further Simplifications

3. The most probable tag sequence estimated by the bigram tagger

n

biagram assumption

---------------------------------------------------------------------------------------------------------------

Page 68: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 68

Two kinds of probabilities (1)

Tag transition probabilities p(ti|ti-1) Determiners likely to precede adjs and nouns

That/DT flight/NNThe/DT yellow/JJ hat/NNSo we expect P(NN|DT) and P(JJ|DT) to be highBut P(DT|JJ) to be:?

Page 69: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 69

Two kinds of probabilities (1)

Tag transition probabilities p(ti|ti-1) Compute P(NN|DT) by counting in a labeled

corpus:

# of times DT is followed by NN

Page 70: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 70

Two kinds of probabilities (2)

Word likelihood probabilities p(wi|ti) P(is|VBZ) = probability of VBZ (3sg Pres verb) being “is”

Compute P(is|VBZ) by counting in a labeled corpus:

If we were expecting a third person singular verb, how likely is it that

this verb would be is?

Page 71: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 71

An Example: the verb “race”

Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NR

People/NNS continue/VB to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN

How do we pick the right tag?

Page 72: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 72

Disambiguating “race”

Page 73: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 73

Disambiguating “race”

P(NN|TO) = .00047P(VB|TO) = .83The tag transition probabilities P(NN|TO) and P(VB|TO) answer the question: ‘How likely are we to expect verb/noun given the previous tag TO?’

P(race|NN) = .00057P(race|VB) = .00012Lexical likelihoods from the Brown corpus for ‘race’ given a POS tag NN or VB.

P(NR|VB) = .0027P(NR|NN) = .0012tag sequence probability for the likelihood of an adverb occurring given the previous tag verb or noun

P(VB|TO)P(NR|VB)P(race|VB) = .00000027P(NN|TO)P(NR|NN)P(race|NN)=.00000000032Multiply the lexical likelihoods with the tag sequence probabiliies: the verb wins

Page 74: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 74

Hidden Markov Models

What we’ve described with these two kinds of probabilities is a Hidden Markov Model (HMM)

Let’s just spend a bit of time tying this into the model In order to define HMM, we will first introduce the Markov

Chain, or observable Markov Model.

Page 75: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 75

Definitions

A weighted finite-state automaton adds probabilities to the arcs The sum of the probabilities leaving any arc must sum

to one A Markov chain is a special case of a WFST in which the

input sequence uniquely determines which states the automaton will go through

Markov chains can’t represent inherently ambiguous problems Useful for assigning probabilities to unambiguous

sequences

Page 76: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 76

Markov chain = “First-order observed Markov Model” a set of states

Q = q1, q2…qN; the state at time t is qt a set of transition probabilities:

a set of probabilities A = a01a02…an1…ann. Each aij represents the probability of transitioning from state i to state j The set of these is the transition probability matrix A

Distinguished start and end states

Special initial probability vector

i the probability that the MM will start in state i, each i expresses the probability p(qi|START)

aij P(qt j | qt 1 i) 1i, j N

aij 1; 1i Nj1

N

Page 77: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 77

Markov chain = “First-order observed Markov Model”

Markov Chain for weather: Example 1 three types of weather: sunny, rainy, foggy we want to find the following conditional probabilities:

P(qn|qn-1, qn-2, …, q1)

- I.e., the probability of the unknown weather on day n, depending on the (known) weather of the preceding days

- We could infer this probability from the relative frequency (the statistics) of past observations of weather sequences

Problem: the larger n is, the more observations we must collect.

Suppose that n=6, then we have to collect statistics for 3(6-1) =

243 past histories

Page 78: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 78

Markov chain = “First-order observed Markov Model” Therefore, we make a simplifying assumption, called the (first-order) Markov

assumption

for a sequence of observations q1, … qn,

current state only depends on previous state

the joint probability of certain past and current observations

Page 79: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 79

Markov chain = “First-order observable Markov Model”

Page 80: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 80

Markov chain = “First-order observed Markov Model”

Given that today the weather is sunny, what's the probability that tomorrow is sunny and the day after is rainy?

Using the Markov assumption and the probabilities in table 1, this translates into:

Page 81: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 81

The weather figure: specific example Markov Chain for weather: Example 2

Page 82: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 82

Markov chain for weather

What is the probability of 4 consecutive rainy days? Sequence is rainy-rainy-rainy-rainy I.e., state sequence is 3-3-3-3 P(3,3,3,3) =

1a11a11a11a11 = 0.2 x (0.6)3 = 0.0432

Page 83: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 83

Hidden Markov Model

For Markov chains, the output symbols are the same as the states. See sunny weather: we’re in state sunny

But in part-of-speech tagging (and other things) The output symbols are words But the hidden states are part-of-speech tags

So we need an extension! A Hidden Markov Model is an extension of a Markov

chain in which the output symbols are not the same as the states.

This means we don’t know which state we are in.

Page 84: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 84

Markov chain for weather

Page 85: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 85

Markov chain for words

Observed events: words

Hidden events: tags

Page 86: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 86

States Q = q1, q2…qN; Observations O = o1, o2…oN;

Each observation is a symbol from a vocabulary V = {v1,v2,…vV}

Transition probabilities (prior)

Transition probability matrix A = {aij}

Observation likelihoods (likelihood)

Output probability matrix B={bi(ot)}a set of observation likelihoods, each expressing the probability of an

observation ot being generated from a state i, emission probabilities

Special initial probability vector

i the probability that the HMM will start in state i, each i expresses the probability

p(qi|START)

Hidden Markov Models

Page 87: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 87

Assumptions

Markov assumption: the probability of a particular state depends only on the previous state

Output-independence assumption: the probability of an output observation depends only on the state that produced that observation

P(qi | q1...qi 1) P(qi | qi 1)

Page 88: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 88

HMM for Ice Cream

You are a climatologist in the year 2799 Studying global warming You can’t find any records of the weather in Boston, MA

for summer of 2007 But you find Jason Eisner’s diary Which lists how many ice-creams Jason ate every date

that summer Our job: figure out how hot it was

Page 89: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 89

Noam task

Given Ice Cream Observation Sequence: 1,2,3,2,2,2,3…

(cp. with output symbols) Produce:

Weather Sequence: C,C,H,C,C,C,H …

(cp. with hidden states, causing states)

Page 90: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 90

HMM for ice cream

Page 91: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 91

Different types of HMM structure

Bakis = left-to-right Ergodic = fully-connected

Page 92: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 92

HMM Taggers

Two kinds of probabilities A transition probabilities (PRIOR) B observation likelihoods (LIKELIHOOD)

HMM Taggers choose the tag sequence which maximizes the product of word likelihood and tag sequence probability

Page 93: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 93

Weighted FSM corresponding to hidden states of HMM, showing A probs

Page 94: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 94

B observation likelihoods for POS HMM

Page 95: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 95

The A matrix for the POS HMM

Page 96: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 96

The B matrix for the POS HMM

Page 97: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 97

HMM Taggers

The probabilities are trained on hand-labeled training corpora (training set)

Combine different N-gram levels Evaluated by comparing their output from a test set to

human labels for that test set (Gold Standard)

Page 98: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 98

The Viterbi Algorithm best tag sequence for "John likes to fish in the sea"? efficiently computes the most likely state sequence given a

particular output sequence based on dynamic programming

Page 99: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 99

A smaller example0.6

b

q rstart end

0.5

0.7

What is the best sequence of states for the input string “bbba”?

Computing all possible paths and finding the one with the max probability is exponential

a

0.4 0.80.2

b a

1 1

0.3 0.5

Page 100: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 100

A smaller example (con’t)

For each state, store the most likely sequence that could lead to it (and its probability) Path probability matrix:

An array of states versus time (tags versus words) That stores the prob. of being at each state at each time in terms of the prob. for being

in each state at the preceding time.

Best sequence Input sequence / time

ε --> b b --> b bb --> b bbb --> a

leading

to q

coming

from qε --> q 0.6

(1.0x0.6)

q --> q 0.108

(0.6x0.3x0.6)

qq --> q 0.01944 (0.108x0.3x0.6)

qrq --> q 0.018144

(0.1008x0.3x0.4)

coming

from rr --> q 0

(0x0.5x0.6)

qr --> q 0.1008

(0.336x0.5x 0.6)

qrr --> q 0.02688 (0.1344x0.5x0.4)

leading

to r

coming

from qε --> r 0

(0x0.8)

q --> r 0.336

(0.6x0.7x0.8)

qq --> r 0.0648 (0.108x0.7x0.8)

qrq --> r 0.014112

(0.1008x0.7x0.2)

coming

from rr --> r 0 (0x0.5x0.8)

qr --> r 0.1344 (0.336x0.5x0.8)

qrr --> r 0.01344

(0.1344x0.5x0.2)

Page 101: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 101

Viterbi intuition: we are looking for the best ‘path’

promised to back the bill

VBD

VBN

TO

VB

JJ

NN

RB

DT

NNP

VB

NN

promised to back the bill

VBD

VBN

TO

VB

JJ

NN

RB

DT

NNP

VB

NN

S1 S2 S4S3 S5

promised to back the bill

VBD

VBN

TO

VB

JJ

NN

RB

DT

NNP

VB

NN

Slide from Dekang Lin

Page 102: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 102

The Viterbi Algorithm

Page 103: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 103

Intuition

The value in each cell is computed by taking the MAX over all paths that lead to this cell.

An extension of a path from state i at time t-1 is computed by multiplying: Previous path probability from previous cell viterbi[t-

1,i] Transition probability aij from previous state I to

current state j Observation likelihood bj(ot) that current state j

matches observation symbol t

Page 104: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 104

Viterbi example

Page 105: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 105

Smoothing of probabilities

Data sparseness is a problem when estimating probabilities based on corpus data. The “add one” smoothing technique –

BN

wCwP n

n

1,1

,1

C- absolute frequencyN: no of training instancesB: no of different types

Linear interpolation methods can compensate for data sparseness with higher order models. A common method is interpolating trigrams, bigrams and unigrams:

iii

iiiiiiii ttPttPtPttP

1,10

)|()|()(| 2,133122111,1

The lambda values are automatically determined using a variant of the Expectation Maximization algorithm.

Page 106: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 108

in bigram POS tagging, we condition a tag only on the preceding tag

why not... use more context (ex. use trigram model)

more precise: “is clearly marked” --> verb, past participle “he clearly marked” --> verb, past tense

combine trigram, bigram, unigram models condition on words too

but with an n-gram approach, this is too costly (too many parameters to model)

Possible improvements

Page 107: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 110

Further issues with Markov Model tagging

Unknown words are a problem since we don’t have the required probabilities. Possible solutions: Assign the word probabilities based on corpus-wide distribution

of POS Use morphological cues (capitalization, suffix) to assign a more

calculated guess. Using higher order Markov models:

Using a trigram model captures more context However, data sparseness is much more of a problem.

Page 108: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 111

TnT

Efficient statistical POS tagger developed by Thorsten Brants, ANLP-2000 Underlying model:

Trigram modelling – The probability of a POS only depends on its two preceding POS The probability of a word appearing at a particular position given that its

POS occurs at that position is independent of everything else.

T

iTTiiiii

ttttPtwPtttP

T 1121 )|()|(),|(maxarg

1

Page 109: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 112

Training

Maximum likelihood estimates:

)(

),()|(:

),(

),,(),|(:

)(

),()|( : Bigrams

: Unigrams

3

3333

32

321213

^

3

3223

33

tc

twctwPLexical

ttc

tttctttPTrigrams

tc

ttcttP

N

)c(t)(tP

^

^

Smoothing : context-independent variant of linear interpolation.

),|()|()(),|( 213

^

323

^

23

^

1213 tttPttPtPtttP

Page 110: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 113

Smoothing algorithm

Set λi=0

For each trigram t1 t2 t3 with f(t1,t2,t3 )>0 Depending on the max of the following three values:

Case (f(t1,t2,t3 )-1)/ f(t1,t2) : incr λ3 by f(t1,t2,t3 )

Case (f(t2,t3 )-1)/ f(t2) : incr λ2 by f(t1,t2,t3 )

Case (f(t3 )-1)/ N-1 : incr λ1 by f(t1,t2,t3 )

Normalize λi

Page 111: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 114

Evaluation of POS taggers

compared with gold-standard of human performance metric:

accuracy = % of tags that are identical to gold standard most taggers ~96-97% accuracy must compare accuracy to:

ceiling (best possible results) how do human annotators score compared to each other? (96-

97%) so systems are not bad at all!

baseline (worst possible results) what if we take the most-likely tag (unigram model) regardless of

previous tags ? (90-91%) so anything less is really bad

Page 112: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 115

More on tagger accuracy is 95% good?

that’s 5 mistakes every 100 words if on average, a sentence is 20 words, that’s 1 mistake per sentence

when comparing tagger accuracy, beware of: size of training corpus

the bigger, the better the results difference between training & testing corpora (genre, domain…)

the closer, the better the results size of tag set

Prediction versus classification unknown words

the more unknown words (not in dictionary), the worst the results

Page 113: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 116

Error Analysis

Look at a confusion matrix (contingency table)

E.g. 4.4% of the total errors caused by mistagging VBD as VBN See what errors are causing problems

Noun (NN) vs ProperNoun (NNP) vs Adj (JJ) Adverb (RB) vs Particle (RP) vs Prep (IN) Preterite (VBD) vs Participle (VBN) vs Adjective (JJ)

ERROR ANALYSIS IS ESSENTIAL!!!

Page 114: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 117

Tag indeterminacy

Page 115: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 118

Major difficulties in POS tagging Unknown words (proper names)

because we do not know the set of tags it can take and knowing this takes you a long way (cf. baseline POS tagger) possible solutions:

assign all possible tags with probabilities distribution identical to lexicon as a whole

use morphological cues to infer possible tags ex. word ending in -ed are likely to be past tense verbs or past participles

Frequently confused tag pairs preposition vs particle

<running> <up> a hill (prep) / <running up> a bill (particle) verb, past tense vs. past participle vs. adjective

Page 116: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 119

Unknown Words

Most-frequent-tag approach. What about words that don’t appear in the training set? Suffix analysis:

The probability distribution for a particular suffix is generated from all words in the training set that share the same suffix.

Suffix estimation – Calculate the probability of a tag t given the last i letters of an n letter word.

Smoothing: successive abstraction through sequences of increasingly more general contexts (i.e., omit more and more characters of the suffix)

Use a morphological analyzer to get the restriction on the possible tags.

Page 117: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 120

Unknown words

Page 118: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 121

Alternative graphical models for part of speech tagging

Page 119: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 122

Different Models for POS tagging

HMM Maximum Entropy Markov Models Conditional Random Fields

Page 120: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 123

Hidden Markov Model (HMM) : Generative Modeling

Source Model PY

Noisy Channel PXY

y x

i

ii yyPP )|()( 1y i

ii yxPP )|()|( yx

Page 121: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 124

Dependency (1st order)

kY1kY

kX

)|( kk YXP

)|( 1kk YYP

1kX

)|( 11 kk YXP

2kX

)|( 22 kk YXP

2kY)|( 21 kk YYP

1kY

1kX

)|( 1 kk YYP

)|( 11 kk YXP

Page 122: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 125

Disadvantage of HMMs (1)

No Rich Feature Information Rich information are required

When xk is complex When data of xk is sparse

Example: POS Tagging How to evaluate Pwk|tk for unknown words wk ? Useful features

Suffix, e.g., -ed, -tion, -ing, etc. Capitalization

Generative Model Parameter estimation: maximize the joint likelihood of training examples

T

P),(

2 ),(logyx

yYxX

Page 123: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 126

Generative Models

Hidden Markov models (HMMs) and stochastic grammars Assign a joint probability to paired observation and label sequences The parameters typically trained to maximize the joint likelihood of train examples

Page 124: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 127

Generative Models (cont’d)

Difficulties and disadvantages Need to enumerate all possible observation sequences Not practical to represent multiple interacting features or long-range

dependencies of the observations Very strict independence assumptions on the observations

Page 125: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 128

Better Approach Discriminative model which models P(y|x) directly Maximize the conditional likelihood of training examples

T

P),(

2 )|(logyx

xXyY

Page 126: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 129

Maximum Entropy modeling

N-gram model : probabilities depend on the previous few tokens. We may identify a more heterogeneous set of features which contribute in some way

to the choice of the current word. (whether it is the first word in a story, whether the next word is to, whether one of the last 5 words is a preposition, etc)

Maxent combines these features in a probabilistic model. The given features provide a constraint on the model. We would like to have a probability distribution which, outside of these constraints, is

as uniform as possible – has the maximum entropy among all models that satisfy these constraints.

Page 127: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 130

Maximum Entropy Markov Model Discriminative Sub Models

Unify two parameters in generative model into one conditional model

Two parameters in generative model,

parameter in source model and parameter in

noisy channel

Unified conditional model Employ maximum entropy principle

)|( 1kk yyP

)|( kk yxP

),|( 1kkk yxyP

i

iii xyyPP ),|()|( 1xy

Maximum Entropy Markov Model

Page 128: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 131

General Maximum Entropy Principle

Model Model distribution PY|X with a set of features

fffl defined on X and Y

Idea Collect information of features from training data Principle

Model what is known Assume nothing else

Flattest distribution

Distribution with the maximum Entropy

Page 129: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 132

Example

(Berger et al., 1996) example Model translation of word “in” from English to French

Need to model P(wordFrench) Constraints

1: Possible translations: dans, en, à, au course de, pendant 2: “dans” or “en” used in 30% of the time 3: “dans” or “à” in 50% of the time

Page 130: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 133

Features

Features 0-1 indicator functions

1 if x y satisfies a predefined condition 0 if not

Example: POS Tagging

otherwise

NN is and tion- with ends if

,0

,1),(1

yxyxf

otherwise ,0

NNP is andtion Captializa with starts if ,1),(2

yxyxf

Page 131: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 134

Constraints

Empirical Information Statistics from training data T

Tyx

ii yxfT

fP),(

),(||

1)(ˆ

Constraints)()(ˆ

ii fPfP

Tyx YDy

ii yxfxXyYPT

fP),( )(

),()|(||

1)(

Expected Value From the distribution PY|X we want to model

Page 132: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 135

Maximum Entropy: Objective

Entropy

x y

Tyx

xXyYPxXyYPxP

xXyYPxXyYPT

I

)|(log)|()(ˆ

)|(log)|(||

1

2

),(2

)()(ˆ s.t.

max)|(

fPfP

IXYP

Maximization Problem

Page 133: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 136

Dual Problem

Dual Problem Conditional model

Maximum likelihood of conditional data)),(exp()|(

1

l

iii yxfxXyYP

Solution Improved iterative scaling (IIS) (Berger et al. 1996) Generalized iterative scaling (GIS) (McCallum et al.

2000)

Tyx

xXyYPl ),(

2,,

)|(logmax1

Page 134: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 137

Maximum Entropy Markov Model Use Maximum Entropy Approach to Model

1st order

),|( 11 kkkkkk yYxXyYP

Features Basic features (like parameters in HMM)

Bigram (1st order) or trigram (2nd order) in source model

State-output pair feature Xkxk Yk yk Advantage: incorporate other advanced

features on xk yk

Page 135: CS60057 Speech &Natural Language Processing

HMM vs MEMM (1st order)

kY1kY

kX

)|( 1kk YYP

)|( kk YXP

HMMMaximum Entropy

Markov Model (MEMM)

kY1kY

kX

),|( 1kkk YXYP

Page 136: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 139

Performance in POS Tagging

POS Tagging Data set: WSJ Features:

HMM features, spelling features (like –ed, -tion, -s, -ing, etc.)

Results (Lafferty et al. 2001) 1st order HMM

94.31% accuracy, 54.01% OOV accuracy 1st order MEMM

95.19% accuracy, 73.01% OOV accuracy

Page 137: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 140

ME applications

Part of Speech (POS) Tagging (Ratnaparkhi, 1996) P(POS tag | context) Information sources

Word window (4) Word features (prefix, suffix, capitalization) Previous POS tags

Page 138: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 141

ME applications

Abbreviation expansion (Pakhomov, 2002) Information sources

Word window (4) Document title

Word Sense Disambiguation (WSD) (Chao & Dyer, 2002) Information sources

Word window (4) Structurally related words (4)

Sentence Boundary Detection (Reynar & Ratnaparkhi, 1997) Information sources

Token features (prefix, suffix, capitalization, abbreviation) Word window (2)

Page 139: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 142

Solution

Global Optimization Optimize parameters in a global model simultaneously,

not in sub models separately Alternatives

Conditional random fields Application of perceptron algorithm

Page 140: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 143

Why ME?

Advantages Combine multiple knowledge sources

Local Word prefix, suffix, capitalization (POS - (Ratnaparkhi, 1996)) Word POS, POS class, suffix (WSD - (Chao & Dyer, 2002)) Token prefix, suffix, capitalization, abbreviation (Sentence Boundary -

(Reynar & Ratnaparkhi, 1997)) Global

N-grams (Rosenfeld, 1997) Word window Document title (Pakhomov, 2002) Structurally related words (Chao & Dyer, 2002) Sentence length, conventional lexicon (Och & Ney, 2002)

Combine dependent knowledge sources

Page 141: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 144

Why ME?

Advantages Add additional knowledge sources Implicit smoothing

Disadvantages Computational

Expected value at each iteration Normalizing constant

Overfitting Feature selection

Cutoffs Basic Feature Selection (Berger et al., 1996)

Page 142: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 145

Conditional Models

Conditional probability P(label sequence y | observation sequence x) rather than joint probability P(y, x) Specify the probability of possible label sequences given an observation

sequence

Allow arbitrary, non-independent features on the observation sequence X

The probability of a transition between labels may depend on past and future observations Relax strong independence assumptions in generative models

Page 143: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 146

Discriminative ModelsMaximum Entropy Markov Models (MEMMs)

Exponential model Given training set X with label sequence Y:

Train a model θ that maximizes P(Y|X, θ) For a new data sequence x, the predicted label y maximizes P(y|x, θ) Notice the per-state normalization

Page 144: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 147

MEMMs (cont’d)

MEMMs have all the advantages of Conditional Models

Per-state normalization: all the mass that arrives at a state must be distributed among the possible successor states (“conservation of score mass”)

Subject to Label Bias Problem

Bias toward states with fewer outgoing transitions

Page 145: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 148

Label Bias Problem

• P(1 and 2 | ro) = P(2 | 1 and ro)P(1 | ro) = P(2 | 1 and o)P(1 | r) P(1 and 2 | ri) = P(2 | 1 and ri)P(1 | ri) = P(2 | 1 and i)P(1 | r)

• Since P(2 | 1 and x) = 1 for all x, P(1 and 2 | ro) = P(1 and 2 | ri)In the training data, label value 2 is the only label value observed after label value 1Therefore P(2 | 1) = 1, so P(2 | 1 and x) = 1 for all x

• However, we expect P(1 and 2 | ri) to be greater than P(1 and 2 | ro).

• Per-state normalization does not allow the required expectation

• Consider this MEMM:

Page 146: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 149

Solve the Label Bias Problem

Change the state-transition structure of the model

Not always practical to change the set of states

Start with a fully-connected model and let the training procedure figure out a good structure Prelude the use of prior, which is very valuable (e.g. in information extraction)

Page 147: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 150

Random Field

Page 148: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 151

Conditional Random Fields (CRFs)

CRFs have all the advantages of MEMMs without label bias problem MEMM uses per-state exponential model for the conditional probabilities

of next states given the current state CRF has a single exponential model for the joint probability of the entire

sequence of labels given the observation sequence Undirected acyclic graph Allow some transitions “vote” more strongly than others depending on the

corresponding observations

Page 149: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 152

Definition of CRFs

X is a random variable over data sequences to be labeled

Y is a random variable over corresponding label sequences

Page 150: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 153

Example of CRFs

Page 151: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 154

Graphical comparison among HMMs, MEMMs and CRFs

HMM MEMM CRF

Page 152: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 155

Conditional Distribution

1 2 1 2( , , , ; , , , ); andn n k k

x is a data sequencey is a label sequence v is a vertex from vertex set V = set of label random variablese is an edge from edge set E over Vfk and gk are given and fixed. gk is a Boolean vertex feature; fk is a

Boolean edge featurek is the number of features

are parameters to be estimated

y|e is the set of components of y defined by edge ey|v is the set of components of y defined by vertex v

If the graph G = (V, E) of Y is a tree, the conditional distribution over the label sequence Y = y, given X = x, by fundamental theorem of random fields is:

(y | x) exp ( , y | , x) ( , y | , x)

k k e k k v

e E,k v V ,k

p f e g v

Page 153: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 156

Conditional Distribution (cont’d)

• CRFs use the observation-dependent normalization Z(x) for the conditional distributions:

Z(x) is a normalization over the data sequence x

(y | x) exp ( , y | , x) ( , y |1

(x), x)

k k e k k v

e E,k v V ,k

p f e g vZ

Page 154: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 157

Parameter Estimation for CRFs

The paper provided iterative scaling algorithms

It turns out to be very inefficient

Prof. Dietterich’s group applied Gradient Descendent Algorithm, which is quite efficient

Page 155: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 158

Training of CRFs (From Prof. Dietterich)

log ( | )( , y | , x) ( , y | , x) log (x)

k k e k k ve E,k v V ,k

p y xf e g v Z

log ( | ) ( , y | , x) ( , y | , x) log (x)k k e k k ve E,k v V ,k

p y x f e g v Z

• First, we take the log of the equation

• Then, take the derivative of the above equation

• For training, the first 2 items are easy to get. • For example, for each k, fk is a sequence of Boolean numbers, such

as 00101110100111. is just the total number of 1’s in the sequence.( , y | , x)k k ef e

• The hardest thing is how to calculate Z(x)

Page 156: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 159

Training of CRFs (From Prof. Dietterich) (cont’d)

• Maximal cliques

y1 y2 y3 y4c1 c2 c3

c1 c2 c3

1 2 3 4

1 2 3 4

1 1 2 2 2 3 3 3 4y ,y ,y ,y

1 1 2 2 2 3 3 3 4y y y y

(x) (y ,y ,x) (y ,y ,x) (y ,y ,x)

(y ,y ,x) (y ,y ,x) (y ,y ,x)

Z c c c

c c c

3 4 3 4 3 3 4: exp( (y ,x) (y ,y ,x)) (y ,y ,x)c c

1 1 2 1 2 1 1 2: exp( (y ,x) (y ,x) (y ,y ,x)) (y ,y ,x)c c

2 3 2 3 2 2 3: exp( (y ,x) (y ,y ,x)) (y ,y ,x)c c

Page 157: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 160

POS tagging Experiments

Page 158: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 161

POS tagging Experiments (cont’d)

• Compared HMMs, MEMMs, and CRFs on Penn treebank POS tagging• Each word in a given input sentence must be labeled with one of 45 syntactic tags• Add a small set of orthographic features: whether a spelling begins with a number

or upper case letter, whether it contains a hyphen, and if it contains one of the following suffixes: -ing, -ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies

• oov = out-of-vocabulary (not observed in the training set)

Page 159: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 162

Summary

Discriminative models are prone to the label bias problem

CRFs provide the benefits of discriminative models

CRFs solve the label bias problem well, and demonstrate good performance