finite-state methods in natural language processing

43
Finite-State Methods in Finite-State Methods in Natural Language Natural Language Processing Processing Lauri Karttunen LSA 2005 Summer Institute August 3, 2005

Upload: jerry

Post on 30-Jan-2016

32 views

Category:

Documents


0 download

DESCRIPTION

Finite-State Methods in Natural Language Processing. Lauri Karttunen LSA 2005 Summer Institute August 3, 2005. August 1 Non-concatenative morphotactics Reduplication, interdigitation Realizational morphology Readings Chapter 8. “Non-Concatenative Morphotactics” - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Finite-State Methods in Natural Language Processing

Finite-State Methods in Natural Finite-State Methods in Natural Language ProcessingLanguage Processing

Lauri KarttunenLSA 2005 Summer InstituteAugust 3, 2005

Page 2: Finite-State Methods in Natural Language Processing

August 1Non-concatenative morphotactics

Reduplication, interdigitation

Realizational morphologyReadings

Chapter 8. “Non-Concatenative Morphotactics”Gregory T. Stump. Inflectional Morphology. A Theory of

Paradigm Structure. Cambridge U. Press. 2001. (An excerpt)

Lauri Karttunen, “Computing with Realizational Morphology”, Lecture Notes in Computer Science, Volume 2588, Alexander Gelbukh (ed.), 205-216, Springer Verlag. 2003.

August 3Optimality theory

ReadingsPaul Kiparsky “Finnish Noun Inflection” Generative Approaches to Finnic

and Saami Linguistics, Diane Nelson and Satu Manninen (eds.), pp.109-161, CSLI Publications, 2003.

Nine Elenbaas and René Kager. "Ternary rhythm and the lapse constraint". Phonology 16. 273-329.

Page 3: Finite-State Methods in Natural Language Processing

BackgroundBackground

q Two old strains of finite-state (morpho)phonologyrewrite rules (Chomsky&Halle 1968)two-level constraints (Koskenniemi 1983)

q Optimality theory (Prince & Smolensky 1993)two-level model with ranked, violable constraints

q Formal PowerOT is not a finite-state system if it involves unlimited counting of

constraint violations. (Ellison 1994, Eisner 1997, Frank&Satta 1998)

But a finite-state model can be useful for OT.

Page 4: Finite-State Methods in Natural Language Processing

Optimality theoryOptimality theory

Prince & Smolensky 1993eliminate

rulesderivations

introduceviolable ranked constraints

Instant success!

Page 5: Finite-State Methods in Natural Language Processing

Brief Introduction to OTBrief Introduction to OT

InputA language of underlying lexical forms.

GENA function that generates alternate surface

realizations for each input form, possibly an infinite set.

ConstraintsA finite set of principles, preferrably universal,

that filter out unwanted realizations.

RankingA language-specific ordering of the

constraints.

Page 6: Finite-State Methods in Natural Language Processing

Computational perspectiveComputational perspective

q Ellison 1994OT deals with regular sets and relations: a finite-state systemconstraint transducers mark violations, marks sorted and counted

q Tesar 1995dynamic algorithm for optimal path computations

q Eisner 1996two-level typology of optimality constraints: restrict, prohibit“FootForm Decomposed” MIT Working Papers in Linguistics, 31:115-

143 proposes Primitive Optimality Theory (no generalized alignment)

q Karttunen 1998Introduces lenient composition

q Frank & Satta 1998Prove that OT is regular if # of violations is bounded.

Page 7: Finite-State Methods in Natural Language Processing

ComparisonsComparisons

Application Merging

rewrite rules

two-level constraints

optimality constraints

composition composition

intersecting composition

intersection

lenient composition

lenient composition

Page 8: Finite-State Methods in Natural Language Processing

Finnish OT ProsodyFinnish OT Prosody

Lauri KarttunenCLS-41April 7, 2005

Page 9: Finite-State Methods in Natural Language Processing

Finnish Prosody: basic factsFinnish Prosody: basic facts

• The nucleus of a Finnish syllable must consist of a short vowel, a long vowel, or a diphthong.

• Main stress is always on the first syllable, secondary stress occurs on non-initial syllables.

• Adjacent syllables are never stressed.• Stressed syllable is initial in the foot.

ilmoittautuminen ‘registering’ (Nom Sg)(íl.moit).(tàu.tu).(mì.nen)

Page 10: Finite-State Methods in Natural Language Processing

Ternary feet in FinnishTernary feet in Finnish

Stress that would fall on a light syllable shifts on the following heavy syllable creating a ternary foot.(ká.las).te.(lèm.me) ‘we are fishing’(íl.moit).(tàu.tu).mi.(sès.ta) ‘registering’ (Ela Sg)(rá.kas).ta.(jàt.ta).ri.(àn.sa) ‘his mistresses’ (Par Pl)

Can we get these facts to come out “for free”, from the interaction of independently motivated principles?

Yes!Paul Kiparsky “Finnish Noun Inflection” Generative Approaches to

Finnic and Saami Linguistics, Diane Nelson and Satu Manninen (eds.), pp.109-161, CSLI Publications, 2003.

Nine Elenbaas and René Kager. "Ternary rhythm and the lapse constraint". Phonology 16. 273-329.

Page 11: Finite-State Methods in Natural Language Processing

Non-OT and OT solutionsNon-OT and OT solutions

It is possible to define a cascade of replace rules that produce the desired result.

http://www.stanford.edu/~laurik/fsmbook/examples/FinnishProsody.html

But, following Kiparsky, we are going to do OT today, and in a more elegant way than is shown at

http://www.stanford.edu/~laurik/fsmbook/examples/FinnishOTProsody.html

Page 12: Finite-State Methods in Natural Language Processing

Prelude: Built-in Functions in fstPrelude: Built-in Functions in fst

Case conversion

UpCase( OptUpCase(DownCase( OptDownCase(Cap( OptCap(AnyCase(

Cap({hello}) is equivalent to {Hello} OptUpCase(a:b, L) is equivalent to [a:B | a:b] ;

Symbol manipulation

Explode( Implode(

regex Explode("+Test") is equivalent to regex {+Test};

Page 13: Finite-State Methods in Natural Language Processing

Functions: User-definedFunctions: User-defined

The function definition is attached to a symbol ending with (

The definition is any regular expression.There may be any number of arguments.

define Redup(X) [X X];define Apply(X, Y) [X .o. Y].l ;

When the function is used in a regular expression, the arguments are bound and the function is evaluated.regex Apply({abc}, a -> x || _ b);print wordsxbc

The definition of a function may contain other functions.

Page 14: Finite-State Methods in Natural Language Processing

Pig LatinPig Latin

# This script creates a function for translating from English to Pig Latin:

# pig -> igpay, brown -> ownbray, script -> iptscray

define C [b|c|d|f|g|h|j|k|l|m|n|p|q|r|s|t|v|w|x|y|z];define V [a|e|i|o|u] ;

define Redup(X) [X "." X];

define DelCons(X) [X .o. C+ @-> 0 || .#. _ ];

define TailToAy(X) [X .o. V ?* @-> {ay} || "." C* _ ];

define DelMiddle(X) [X .o. "." -> 0];

define Pig(X) [DelMiddle(TailToAy(DelCons(Redup(X))))];

Page 15: Finite-State Methods in Natural Language Processing

Demo!Demo!

fst -l piglatin.script

Page 16: Finite-State Methods in Natural Language Processing

Computing with OTComputing with OT

Input language

GEN

.o.

Compose the input languagewith GEN to produce amapping from each input formto all of its output candidates

Eliminate suboptimalcandidates by applyingconstraints in the rankedorder. At least one outputcandidate always survives.

Constraint 1

Constraint 2

By what finite-state operation?

Page 17: Finite-State Methods in Natural Language Processing

Priority union .P.Priority union .P.

R = { , } b c

z wQ = { , }

a b

x y

R .P. Q = { , , } a b c

x y w

All pairs from R and those pairs from Q that do not conflict with the mapping established by R.

R .P. Q = [ R | [~[R.u] .o. Q]

Kaplan 1987

Page 18: Finite-State Methods in Natural Language Processing

Lenient Composition .O.Lenient Composition .O.

Let R be a relation that maps each input string to one or more outputs.

Let C be a constraint that eliminates some outputs.

R .O. C is the relation that maps each input string that can meet the constraint C to the outputs that meet C and leaves the rest of the relation R unchanged. (Karttunen 1998)

R .O. C = [ [R .o. C] .P. R ]

Is constraint ranking rule ordering in disguise? Yes.

Page 19: Finite-State Methods in Natural Language Processing

Need a prolific Need a prolific GENGEN

ka.laka.láka.làka.(là)ka.(lá)ká.laká.láká.làká.(là)ká.(lá)kà.la

(kà.la)(ká).la(ká).lá(ká).là(ká).(là)(ká).(lá)(ká.là)(ká.lá)(ká.la) ☜(ka.là)(ka.lá)

kà.lákà.làkà.(là) kà.(lá)(kà).la(kà).lá(kà).là(kà).(là)(kà).(lá)(kà.là)(kà.lá)

kala ‘fish’ (Nom Sg) 33 candidates

Page 20: Finite-State Methods in Natural Language Processing

Basic definitions 1Basic definitions 1

Using Parc/XRCE regular expression syntax:

define C [b | c | d | f | g | h | j | k | l | m |

n | p | q | r | s | t | v | w | x | z]; # Consonant

define HighV [u | y | i]; # High vowel

define MidV [e | o | ö]; # Mid vowel

define LowV [a | ä] ; # Low vowel

define USV [HighV | MidV | LowV]; # Unstressed Vowel

define MSV [á | é | í | ó | ú | ý | ä’ | ö’];

define SSV [à | è | ì | ò | ù | y` | ä` | ö`];

define SV [MSV | SSV]; # Stressed vowel

define V [USV | SV] ; # Vowel

Page 21: Finite-State Methods in Natural Language Processing

Basic definitions 2Basic definitions 2

define P [V | C]; # Phone

define B [[\P+] | .#.]; # Boundary

define E .#. | "."; # Edge

define Light [C* V]; # Light syllable

define Heavy [Light P+]; # Heavy syllable

define S [Heavy | Light]; # Syllable

define SS [S & $SV]; # Stressed syllable

define US [S & ~$SV]; # Unstressed syllable

define MSS [S & $MSV] ; # Syllable with main stress

Page 22: Finite-State Methods in Natural Language Processing

GEN 1GEN 1

define MarkNonDiphthongs [ [. .] -> "." || [HighV | MidV] _ LowV, # i.a, e.a LowV _ MidV, #a.e i _ [MidV - e], # i.o, i.ö u _ [MidV - o], # u.e y _ [MidV - ö], # y.e $V i _ e, # poiki.en $V u _ o, # $V y _ ö ]; #

Insert a syllable boundary between vowels that cannot forma diphtong: i.a, e.a, a.e, i.o, u.e, y.e, etc.

define Syllabify C* V+ C* @-> ... "." || _ C V ;

Insert a syllable boundary after a maximal C* V+ C* pattern that is followed by C V. For example, strukturalismi -> struk.tu.ra.lis.mi.

Page 23: Finite-State Methods in Natural Language Processing

GEN 2GEN 2

define Stress a (->) á|à, e (->) é|è, i (->) í|ì,

o (->) ó|ò, u (->) ú|ù, y (->) "y´"|"y`",

ä (->) "ä´"|"ä`", ö (->) "ö´"|"ö`";

Optionally stress any vowel with a primary or secondary stress.

define Scan [[S ("." S ("." S)) & $SS] (->) "(" ... ")" || E _ E] ;

Optionally group syllables into unary, binary, or ternary feet when there is at least one stressed syllable.

define Gen [MarkNonDiphthongs .o. Syllabify .o.

Stress .o. Scan];

Page 24: Finite-State Methods in Natural Language Processing

Demo!Demo!

fst -utf8 -l gen.script

regex {kala} .o. Gen (compose)

print lower-words (show output candidates)

print size (count them)

Page 25: Finite-State Methods in Natural Language Processing

Kiparsky's nine constraintsKiparsky's nine constraints

ClashAlignLeftMainStressFootBinLapseNonFinalStressToWeightParseAllFeetFirst

Page 26: Finite-State Methods in Natural Language Processing

Counting constraint violationsCounting constraint violations

We use asterisks to mark constraint violations. We need a way to prefer candidates with the least number of violation marks.

define Viol ${*};

define Viol0 ~Viol; # No violationsdefine Viol1 ~[Viol^2]; # At most one violationdefine Viol2 ~[Viol^3]; # At most two violationsdefine Viol3 ~[Viol^4];

This eliminates the violation marks after the candidate set has been pruned by a constraint.

define Pardon {*} -> 0;

Page 27: Finite-State Methods in Natural Language Processing

Defining OT ConstraintsDefining OT Constraints

Three types:Unviolable constraints

Primary stress in Finnish

Ordinary violable constraintsLapse

Gradient alignment constraintsAll-Feet-First

Strategy:We define an evaluation template for each of the three

types and then define the individual constraints with the help of the templates.

Page 28: Finite-State Methods in Natural Language Processing

Evaluation Template for Evaluation Template for Unviolable ConstraintsUnviolable Constraintsdefine Unviolable(Candidates, Constraint) [ Candidates .o. Constraint ];

Example:

define MainStress(X) Unviolable(X, B MSS ~$MSS);

# B is the left edge of the word or "(".# MSS is a syllable with a primary stress.

Page 29: Finite-State Methods in Natural Language Processing

Evaluation Template for Ordinary Evaluation Template for Ordinary ConstraintsConstraintsdefine Eval(Candidates, Violation, Left, Right) [ Candidates .o.

Violation -> ... {*} || Left _ Right .O.

Viol3 .O. Viol2 .O. Viol1 .O. Viol0 .o. Pardon ];

where Viol0 is ~${*}, Viol2 is ~[[${*}]^2], etc. andPardon is {*} -> 0 deleting all violation marks.

Page 30: Finite-State Methods in Natural Language Processing

Evaluation Template for Left-Evaluation Template for Left-Oriented Gradient AlignmentOriented Gradient Alignmentdefine EvalGradientLeft(Candidates, Violation, Left, Right) [

Candidates .o.Violation -> {*} ... || .#. Left _ Right

.o.Violation -> {*}^2 ... || .#. Left^2 _ Right

.o.Violation -> {*}^3... || .#. Left^3 _ Right

.o.Violation -> {*}^4 ... || .#. Left^4 _ Right

.o.Violation -> {*}^5 ... || .#. Left^5 _ Right

.o.Violation -> {*}^6 ... || .#. Left^6 _ Right

.o.Violation -> {*}^7 ... || .#. Left^7 _ Right

.o.Violation -> {*}^8 ... || .#. Left^8 _ Right

.O. Viol12 .O. Viol11 .O. Viol10 .O. Viol9 .O. Viol8 .O. Viol7 .O. Viol6 .O. Viol5 .O. Viol4 .O. Viol3 .O. Viol2 .O. Viol1 .O. Viol0 .o. Pardon ];

Page 31: Finite-State Methods in Natural Language Processing

Clash, AlignLeft, MainStressClash, AlignLeft, MainStress

ClashNo stress on adjacent syllables.

define Clash(X) Eval(X, SS, SS B, ?*);

Align-LeftThe stressed syllable is initial in the foot.

define AlignLeft(X) Eval(X, SV, .#. ~[?* "(" C*], ?*);

Main StressThe primary stress in Finnish is on the first syllable.

define MainStress(X) Unviolable(X, B MSS ~$MSS);

Page 32: Finite-State Methods in Natural Language Processing

FootBin, Lapse, NonFinalFootBin, Lapse, NonFinal

Foot-Bin Feet are minimally bimoraic and maximally bisyllabic.

define FootBin(X) Eval(X, "(” Light ") "|” ("S["." S]^>1, ?* ,?*);

LapseEvery unstressed syllable must be adjacent to a stressed syllable or to the word

edge.

define Lapse(X) Eval(X, US, [B US B], [B US B]);

Non-FinalThe final syllable is not stressed.

define NonFinal(X) Eval(X, SS, ?*, ~$S .#.);

Page 33: Finite-State Methods in Natural Language Processing

StressToWeight, Parse, StressToWeight, Parse, AllFeetFirstAllFeetFirst

Stress-To-WeightStressed syllables are heavy.

define StressToWeight(X) Eval(X, SS & Light, ?*, ")"| E);

License-Syllables are parsed into feet.

define Parse(X) Eval(X, S, E, E);

All-Ft-LeftThe left edge of every foot coincides with the left edge of some prosodic

word.

define AllFeetFirst(X) [ EvalGradientLeft(X, "(", $".", ?*) ];

Page 34: Finite-State Methods in Natural Language Processing

Finnish ProsodyFinnish Prosody

Kiparsky 2003:

define FinnishProsody(Input) [

AllFeetFirst( Parse( StressToWeight(

NonFinal( Lapse( FootBin( MainStress(

AlignLeft( Clash( Input .o. Gen)))))))))];

Page 35: Finite-State Methods in Natural Language Processing

FinnWordsFinnWords

regex FinnishProsody( {kalastelet} | {kalasteleminen} |

{ilmoittautuminen} | {järjestelmättömyydestänsä} |

{kalastelemme} | {ilmoittautumisesta} |

{järjestelmällisyydelläni} | {järjestelmällistämätöntä} |

{voimisteluttelemasta} | {opiskelija} | {opettamassa} |

{kalastelet} | {strukturalismi} | {onnittelemanikin} |

{mäki} | {perijä} | {repeämä} | {ergonomia} |

{puhelimellani} | {matematiikka} | {puhelimistani} |

{rakastajattariansa} | {kuningas} | {kainostelijat} |

{ravintolat} | {merkonomin} ) ;

Demo!

Page 36: Finite-State Methods in Natural Language Processing

ResultResult

(ér.go).(nò.mi).a(íl.moit).(tàu.tu).mi.(sès.ta)(íl.moit).(tàu.tu).(mì.nen)(ón.nit).(tè.le).(mà.ni).kin(ó.pis).(kè.li).ja(ó.pet).ta.(màs.sa)(vói.mis).te.(lùt.te).le.(màs.ta)(strúk.tu).ra.(lìs.mi)(rá.vin).(tò.lat)(rá.kas).ta.(jàt.ta).ri.(àn.sa)(ré.pe).(ä`.mä)(pé.ri).jä(pú.he).li.(mèl.la).ni

(pú.he).li.(mìs.ta).ni(mä’.ki)(má.te).ma.(tìik.ka)(mér.ko).(nò.min)(kái.nos).(tè.li).jat(ká.las).te.(lèm.me)(ká.las).te.(lè.mi).nen(ká.las).(tè.let)(kú.nin).gas(jä’r.jes).tel.(mä`l.li).syy.(dèl.lä).ni(jä’r.jes).(tèl.mät).tö.(my`y.des).(tä`n.sä)(jä’r.jes).(tèl.mäl).(lìs.tä).mä.(tö`n.tä)

Page 37: Finite-State Methods in Natural Language Processing

Two ErrorsTwo Errors

(ká.las).te.(lè.mi).nen (jä´r.jes).tel.(mä`l.li).syy.(dèl.lä).ni

The interaction of Lapse and StressToWeight does not produce the desired result in these cases.

Page 38: Finite-State Methods in Natural Language Processing

What is wrong?What is wrong?

define Debug(Input) [ DebugStressToWeight( NonFinal( Lapse( FootBin( MainStress( AlignLeft( Clash( Input .o. Gen))))))) ];

regex Debug({kalasteleminen});(ká*.las).te.(lè*.mi).nen <-- actual winner(ká*.las).(tè*.le).(mì*.nen) <-- desired output

(jä´r.jes).tel.(mä`l.li).syy.(dèl.lä).ni <-- actual winner(jä’r.jes).(tèl.mäl).li.(sy`y.del).(lä`*.ni) <-- desired output

The StressToWeight constraint eliminates some of the desired winning candidates.

Page 39: Finite-State Methods in Natural Language Processing

Nine ElenbaasNine Elenbaas

A unified account of binary and ternary stress. Ph.D. dissertation. University of Utrecht. 1999. Based on Kiparsky&Hanson 1996. The only difference is that Elenbaas has a special constraint *(L’ H) or AntiLStressH( in place of Kiparsky’s more general StressToWeight constraint.

define FinnishProsody(Input) [

AllFeetFirst( Parse( AntiLStressH(

NonFinal( Lapse( AlignLeft( FootBin(

MainStress( Clash( Input .o. Gen))))))))) ];

define AntiLStressH(X) Eval(X, SS & Light, "(" , "." Heavy);

Page 40: Finite-State Methods in Natural Language Processing

ResultResult

(ér.go).(nò.mi).a(íl.moit).(tàu.tu).mi.(sès.ta)(íl.moit).(tàu.tu).(mì.nen)(ón.nit).(tè.le).(mà.ni).kin(ó.pis).(kè.li).ja(ó.pet).ta.(màs.sa)(vói.mis).te.(lùt.te).le.(màs.ta)(strúk.tu).ra.(lìs.mi)(rá.vin).(tò.lat)(rá.kas).ta.(jàt.ta).ri.(àn.sa)(ré.pe).(ä`.mä)(pé.ri).jä(pú.he).li.(mèl.la).ni

(pú.he).li.(mìs.ta).ni(mä’.ki)(má.te).ma.(tìik.ka)(mér.ko).(nò.min)(kái.nos).(tè.li).jat(ká.las).te.(lèm.me)(ká.las).te.(lè.mi).nen(ká.las).(tè.let)(kú.nin).gas(jä’r.jes).(tèl.mäl).li.(syy’.del).(lä’.ni)(jä’r.jes).(tèl.mät).tö.(my`y.des).(tä`n.sä)(jä’r.jes).(tèl.mäl).(lìs.tä).mä.(tö`n.tä)

Page 41: Finite-State Methods in Natural Language Processing

Did She Know?Did She Know?

Six syllables (Appendix of Elenbaas thesis)X X L L L Láterìanàni áteriànani 'meal (Ess 1SG)'érgonòmiàna 'ergonomics (Ess)'káinostèlijàna 'shy person (Ess)'káinostèlijàni 'shy person (Nom 1SG)'kúnnallìsenàni 'council (Ess 1SG)'kúnnallìsiàni ’ councils (Part 1SG)'kúnnallìsinàni 'councils (Ess 1SG)'mérkonòmiàni 'degree in economics (Part 1SG)'mérkonòminàni 'degree in economics (Ess 1SG)'ópiskèlijàni 'student (Nom 1SG)'púhelìmenàni 'telephone (Ess 1SG)'púhelìmiàni ’telephone (Part 1SG)’

Missing pattern: X X L L L H

Page 42: Finite-State Methods in Natural Language Processing

ConclusionConclusion

Can we get ternary feet in Finnish “for free”, from the interaction of independently motivated principles?We don’t know.We know that the Kiparsky and Elenbaas accounts fail.

Optimality Prosody is computationally very difficult.The number of initial candidates is huge:

kalasteleminen 70653järjestelmällisyydelläni 21767579

Simple tableau methods do not work.Finite-state implementation guards against errors made by a

human GEN and EVAL.But even when an error can be pinpointed, the fix is not

obvious.Debugging OT constraints is as hard as debugging two-

level rules, in practice more difficult than rewrite systems.

Page 43: Finite-State Methods in Natural Language Processing

Final ThoughtsFinal ThoughtsMorphology is a regular relation.

The composition of words (morphosyntax), morphological alternations, and prosody can be described in finite-state terms.

A complex relation can be decomposed in different ways.There are many flavors of finite-state morphology: Item-and-

Arrangement, Rewrite rules, Two-level rules, Realizational Morphology, Classical optimality constraints.

Computing with finite-state tools is fun and easy.We have sophisticated formalism for describing regular

relations, efficient compilers and runtime software.

‘Pen-and-pencil’ morphology badly needs computational support.It is difficult to get globally correct results relying on a handful

of interesting words, rules, and constraints.