lecturer: mathias creutz · esimerkki: onko luontevampaa sanoa ’strong tea’ vai ’powerful...

Statistical and Adaptive NaturalLanguage Processing

T-61.5020 (5 cr) L, Spring 2007

Sixth lecture

Contextual Information andCompositionality

Lecturer: Mathias Creutz

Slides: Krista Lagus, Mathias Creutz, Timo Honkela

6. Contextual Information andCompositionality . . . . . . . . . . . . . . . . . . . . . . . 36.1 Word Context: “Meaning Is Use” . . . . . . . . . . 46.2 Expressions with Limited Compositionality:

Collocations . . . . . . . . . . . . . . . . . . . . . 76.3 Hypothesis Testing (for the Discovery of Colloca-

tions) . . . . . . . . . . . . . . . . . . . . . . . . 156.4 Compositionality Within Words . . . . . . . . . . . 29

2

6. Contextual Information andCompositionality

• The importance of context in the interpretation of language:for instance ambiguous word meanings:

– “Aloitin alusta.”

– “Alusta kovalevy!”

– “Nain monta alusta.”

• Compositionality of meaning and form (within and beyond words):

– “I saw you yesterday.” (= i + see + you + yesterday)

– “openmindedness” (= open + mind + -ed + -ness)

– “What is chewing gum made of?”

– “The New Scientist reports that people who chewed gum duringthe memory tests scored higher than those who did not.”

– “How do you do.”

3

6.1 Word Context: “Meaning Is Use”

• What is meaning? What is meaning in language?

– cognitive linguistics (1970 – ) vs. structuralists (18xx – 1970)

• Words are not labels for real things in the world. They refer to ideasthat we have of the world. (Saussure)

• To understand a particular word, one has to know in which possiblecombinations with other words it occurs. Consider less obvious casessuch as time, consider, the, of. (Harris)

• You shall know a word by the company it keeps. (Firth)

• Meaning is use. (Wittgenstein)

⇒ Computational approach: The typical context in which a word occurs canbe used to characterize the word in relation to other words.

4

Types of Contexts

• n-grams, dynamical context

– I don’t like . . .

• fixed window, word-document matrix

– I don’t like icecream , but I do like cookies.

– Represent words as vectors

• bag of words vs. position sensitivity

– Are different positions in the word context mapped onto differentsubvectors?

• distant dependencies and phrasal structure:

5

Example: Phrasal Structure Using Functional Dependency Grammar(FDG)

• President Bush gave the Idol finalists a tour of the White House.

6

6.2 Expressions with Limited Compositionality:Collocations

• Kollokaatio on kahdesta tai useammasta sanasta koostuva konventio-naalistunut ilmaus

• Esimerkkeja:

– ’weapons of mass destruction’, ’disk drive’, ’part of speech’(suomessa yhdyssanoina ’joukkotuhoaseet’, ’levyasema’, ’sana-luokkatieto’)

– ’bacon and eggs’

– verbin valinta: ’make a decision’ ei ’take a decision’.

– adjektiivin valinta: ’strong tea’ mutta ei ’powerful tea’; ’vahvaateeta’, harvemmin ’voimakasta teeta’ (valinnat voivat heijastaakulttuurin asenteita: strong → tea, coffee, cigarettes powerful →drugs, antidote)

7

– ’kick the bucket’, ’heittaa veivinsa’ (kiertoilmaus, sanonta, idio-mi)

• Olentoja, yhteisoja, paikkoja tai tapahtumia yksiloivat nimet: ’WhiteHouse’ Valkoinen talo, ’Tarja Halonen’

• Kollokaation kanssa osittain paallekkaisia kasitteita: termi, tekninentermi, terminologinen fraasi. Huom: tiedonhaussa sanalla ’termi’ laa-jempi merkitys: ’sana tai kollokaatio’.

8

Sanan frekvenssi ja sanaluokkasuodatus

Pelkan frekvenssin kaytto:

Esimerkki: Onko luontevampaa sanoa ’strong tea’ vai ’powerful tea’?Ratkaisu: Etsitaan Googlella: ’strong tea’ 9270, ’powerful tea’ 201

Joihinkin tasmallisiin kysymyksiin riittava tapa. Kuitenkin jarjestettaess bi-grammeja frekvenssin mukaan, parhaita ovat ’of the’, ’in the’, ’to the’, . . .

Frekvenssi + sanaluokka:

Jos tunnetaan kunkin sanan sanaluokka, seka osataan kuvailla kollokaatioiden’sallitut’ sanaluokkahahmot:

• Jarjestetaan sanaparit tai -kolmikot yleisyyden (lukumaara) mukaan

• Hyvaksytaan vain tietyt sanaluokkahahmot:AN, NN, AAN, ANN, NAN, NNN, NPN (Justeson & Katz’s POS filter)

9

Sanojen etaisyyden keskiarvo ja varianssi

Enta joustavammat kollokaatiot, joiden keskella on kollokaatioon kuulumat-tomia sanoja?

Lasketaan etaisyyden keskiarvo ja varianssi. Jos keskiarvo nollasta poikkeavaja varianssi pieni, potentiaalinen kollokaatio (Huom: oletetaan siis etaisyydenjakautuvan gaussisesti).

Esim. ’knock . . . door ’ (ei ’hit’, ’beat’, tai ’rap’):a) ’She knocked on his door ’b) ’They knocked at the door ’c) ’100 women knocked on Donaldson’s door ’d) ’a man knocked on the metal front door ’

11

Algoritmi

• Liu’uta kiintean kokoista ikkunaa tekstin yli (leveys esim. 9) ja keraakaikki sanaparin esiintymat koko tekstissa

• Laske sanojen etaisyyksien keskiarvo:d = 1/n

∑ni=1 di = 1/4(3 + 3 + 5 + 5) = 4.0

(jos heittomerkki ja ’s’ lasketaan sanoiksi)

• Estimoi varianssi s2 (pienilla naytemaarilla):

s2 =Pn

i=1(di−d)2

n−1= 1/3((3−4.0)2+(3−4.0)2+(5−4.0)2+(5−4.0)2)

s = 1.15

12

Pohdittavaksi:

1. Mita tapahtuu jos sanoilla on kaksi tai useampia tyypillisia positioita suh-teessa toisiinsa?2. Mika merkitys on ikkunan leveydella?

14

6.3 Hypothesis Testing (for the Discovery of Collocations)

Onko suuri osumamaara yhteensattumaa (esim. johtuen siita etta jomman-kumman perusfrekvenssi on suuri)? Osuvatko kaksi sanaa yhteen useamminkuin sattuma antaisi olettaa?

1. Formuloi nollahypoteesi H0: assosiaatio on sattumaa

2. Laske tn p etta sanat esiintyvat yhdessa jos H0 on tosi

3. Hylkaa H0 jos p liian alhainen, alle merkitsevyystason, esim p < 0.05tai p < 0.01.

Nollahypoteesia varten sovelletaan riippumattomuuden maaritelmaa.

Oletetaan etta sanaparin todennakoisyys, jos H0 on tosi, on kummankinsanan oman todennakoisyyden tulo:P (w1w2) = P (w1)P (w2)

15

The t Test

Tilastollinen testi sille eroaako havaintojoukon odotusarvo oletetun, datangeneroineen jakauman odotusarvosta. Olettaa, etta todennakoisyydet ovatsuunnilleen normaalijakautuneita.

t =x− µ√

s2

N

, jossa (1)

x, s2 : naytejoukon keskiarvo ja varianssi, N = naytteiden lukumaara, ja µ =jakauman keskiarvo. Valitaan haluttu p-taso (0.05 tai pienempi). Luetaantata vastaava t:n ylaraja taulukosta. Jos t suurempi, H0 hylataan.

16

Discovery of Collocations Using the t Test

Nollahypoteesina etta sanojen yhteisosumat ovat satunnaisia: Esimerkki: H0 :P (new companies) = P (new)P (companies)

µ = P (new)P (companies)

x = c(new companies)c(·,·) = p

s2 = p(1− p) = p(1− p) ≈ p (patee Bernoulli-jakaumalle)N = c(·, ·)

Bernoulli: special case of binomial distribution: b(r;n, p) =(nr

)pr(1− p)n−r, such

that n = 1 and r ∈ {0, 1}.• Jarjestetaan sanat paremmuusjarjestykseen mitan mielessa TAI

• Hypoteesin testaus: valitaan merkittavyystaso (p=0.05 tai p=0.01) jakatsotaan t-testin taulukosta arvo, jonka ylittaminen tarkoittaa nolla-hypoteesin hylkaysta.

17

Pearson’s Chi-Square Test χ2

• χ2-testi mittaa muuttujien valista riippuvuutta perustuen riippumat-tomuuden maaritelmaan: jos muuttujat ovat riippumattomia, yhteisja-kauman arvo tietyssa jakauman pisteessa on marginaalijakaumien (reu-najakaumien) tulo.

• Kahden muuttujan jakauma voidaan kuvata 2-ulotteisena kontingens-sitaulukkona (r × c).

• Lasketaan kussakin taulukon pisteessa (i, j) erotus havaitun jakaumanO (tod. yhteisjakauma) ja odotetun jakauman E (marginaalijakaumientulo) valilla, ja summataan skaalattuna jakauman odotusarvolla:

χ2 =∑i,j

(Oi,j − Ei,j)2

Ei,j

(2)

jossa siis E(i, j) = O(i, ·) ∗O(·, j).

• χ2 on asymptoottisesti χ2-jakautunut. Ongelma kuitenkin: herkka har-valle datalle.

18

• Nyrkkisaanto: ala kayta testia jos N < 20 tai jos 20 ≤ N ≤ 40 jajokin Ei,j ≤ 5

19

Discovery of Collocations Using the χ2 Test

Formuloidaan ongelma siten etta kumpaakin sanaa vastaa yksi satunnais-muuttuja joka voi saada kaksi arvoa (sana joko esiintyy tai ei esiinny yk-sittaisessa sanaparissa).

Sanojen yhteistnjakauma voidaan talloin esittaa 2× 2 taulukkoina. Esim.w1 = new w1 6= new

w2 = companies 8 4667w2 6= companies 15280 14287173

2× 2-taulukon tapauksessa kaava 2 voidaan johtaa muotoon:

χ2 =N(O11O22 −O12O21)

2

(O11 + O12)(O11 + O21)(O12 + O22)(O21 + O22)

• Jarjestetaan sanat paremmuusjarjestykseen mitan mielessa TAI

• Hypoteesin testaus: valitaan merkittavyystaso (p=0.05 tai p=0.01) jakatsotaan χ2-taulukosta arvo jonka ylittaminen tarkoittaa nollahypo-

20

teesin hylkaysta.

Ongelmallisuus kollokaatioiden tunnistamisen kannalta

Tassa soveltamistavassa ei erotella negatiivista ja positiivista riippuvuutta.Ts. jos sanat vierastavat toisiaan, testi antaa myos suuren arvon, koskatalloin sanojen esiintymisten valilla todellakin on riippuvuus. Kollokaatioitaetsittaessa ollaan kuitenkin kiinnostuttu vain positiivisista riippuvuuksista.

Muita sovelluksia χ2-testille:

• Konekaannos: Kohdistettujen korpusten sana-kaannosparien tunnista-minen (cow, vache yhteensattumat johtuvat riippuvuudesta)

• Metriikka kahden korpuksen valiselle samankaltaisuudelle: n × 2-taulukko jossa kullekin tutkittavalle sanalle wi, i ∈ (1 . . . n) kerrotaanko. sanan lukumaara korpuksessa j

21

Likelihood Ratios

Kuinka paljon uskottavampi H2 on kuin H1? Lasketaan hypoteesien uskot-tavuuksien suhde λ:

log λ = logL(H1)

L(H2)

Esimerkki:H1: w1 ja w2 riippumattomia: P (w2|w1) = p = P (w2| 6 w1)H2: w1 ja w2 eivat riippumattomia: P (w2|w1) = p1 6= p2 = P (w2| 6 w1)

Oletetaan selva positiivinen riippuvuus, eli p1 � p2.

Kaytetaan ML-estimaatteja (keskiarvoja) laskettaessa p, p1 ja p2:p = c2

N, p1 = c12

c1, p2 = c2−c12

N−c1

Oletetaan binomijakaumat. Esim. p(w2|w1) = b(c12; c1, p). Ilmaistaan kunkinmallin yhtaaikaa voimassa olevat rajoitteet tulona. Lopputulos: kirjan kaava5.10.

log λ on asymptoottisesti χ2-jakautunut. On lisaksi osoitettu etta harval-

22

la datalla uskottavuuksien suhteella saadaan parempi approksimaatio χ2-jakaumalle kuin χ2-testilla.

23

Pointwise Mutual Information

Muistellaan entropian H(x) ja yhteisinformaation I(x; y) kaavoja:

H(x) = −E(log p(x))

I(X; Y ) = H(Y )−H(Y |X) = (H(X) + H(Y ))−H(X, Y )

= EX,Y (log p(X,Y )p(X)p(Y )

)

joka kuvastaa keskimaaraista informaatiota jonka seka x etta y sisaltavat.

Maaritellaan pisteittainen yhteisinformaatio joidenkin tiettyjen tapahtumienx ja y valilla (Fano, 1961):

I(x, y) = log p(x,y)p(x)p(y)

Voidaanko kayttaa kollokaatioiden valintaan? Motivaationa intuitio: jos sano-jen valilla on suuri yhteisinformaatio (ts. niiden kummankin kommunikoimaninformaation yhteinen osuus on suuri), voisi olettaa etta kyse on kollokaa-tiosta.

24

Taulukosta 5.16 huomataan etta jos jompikumpi sanoista on harvinainen,saadaan korkeita lukuja.

25

Taydellisen riippuville sanoille yhteisinformaatio:

I(x, y) = logp(x, y)

p(x)p(y)= log

p(x)

p(x)p(y)= log

1

P (y)

kasvaa kun sanat muuttuvat harvinaisemmiksi. Aaritilanne: kaksi sanaa esiin-tyy kumpikin vain kerran, ja talloin yhdessa. Kuitenkin talloin evidenssia kol-lokaationa toimimisesta on vahan, mika jaa huomiotta.

Johtopaatos: Ei kovin hyva mitta tahan tarkoitukseen, harhaanjohtava eten-kin pienille todennakoisyyksille. Seurauksena karsii datan harvuudesta erityi-sen paljon.

26

Summary: Hypothesis Testing

My guess is that you will not use hypothesis testing in order to discovercollocations (except in the exercise session and possibly in the exam of thiscourse).

However, you will need hypothesis testing if you do research and compare theperformance of your system to some other system. You will need to prove thatyour system performs better than your competitor in a statistically significantway.

• T-test (assumes Gaussian distribution)

• Pearson’s chi-square test (small sample sizes are not sufficient)

Also take a look at (not taught in this course):

• Sign test

• Wilcoxon signed-rank test

28

6.4 Compositionality Within Words

Just as it may be important to recognize that the vocabulary of a languagepartly consists of multi-word expressions (collocations), it is important torecognize that words have inner structure and are related to one another.

Linguistic morphology studies how words are formed from morphemes, whichare the minimal meaningful form-units. Morphemes are portions of utterancesthat recur in other utterances with approximately the same meaning.

For instance, the Finnish morpheme talo (house):

aave+talo+i+sta, aika+kaus+lehti+talo+j+en, aika+talo+a, ai-kuis+koulu+t+us+talo+n, aikuis+talo+uks+i+en, aito+talo+ude+n,akatemia+talo, akatemia+talo+n, akatemia+talo+ssa, akvaario+talo+us,alas+talo, alas+talo+n, alas+talo+on, ala+talo, ala+talo+a,ala+talo+kin, ala+talo+lla, ala+talo+n, ala+talo+on, ala+talo+sta,. . .

29

Zellig Harris’ Heuristic Morpheme Segmentation Method (1955)

Morpheme boundaries are proposed at intra-word locations with a peak insuccessor and predecessor variety.

Let us take a look at some examples of successor variety:

• How many different letters can continue an English word starting with‘d’? (day, debt, dig, dog, drill, . . .)

• How many different letters can continue a word starting with ‘drea’?(dream, dreadful, . . .)

• How many different letters can continue a word starting with ‘dream’?(dreams, dreamy, dreamily, dreamboat, . . .)

Predecessor variety works in the same way, but the words are read backwards:

• How many different letters can precede ‘ily’ at the end of an Englishword? (dreamily, happily, funnily, . . .)

30

Example of Harris’ Method

Note that there are more refined versions, e.g., Hafer and Weiss (1974).

31

Statistical Approaches (1995–)

Think of it as a compression problem. If we observe a corpus of natural language

containing lots of different word forms, what kind of model could explain the kind

of data we observe in an elegant manner?

From what kind of lexicon could words such as apple, orange, lemon,

juice, applejuice, orangejuice, appletree, lemontree emerge?

Suppose that lexicon candidates are created using a statistical generative process:

characters are drawn from an alphabet including a morph break character:

Lexicon 1 “a c e g i j l m n o p r t u ” (14 morphs),

P (Lexicon 1) = ( 127)29

Lexicon 2 “apple juice lemon orange tree ” (5 morphs),

P (Lexicon 2) = ( 127)31

Lexicon 3 “apple applejuice appletree juice lemon lemontree orange orangejuice ” (8 morphs), P (Lexicon 3) = ( 1

27)69

32

Statistical Approaches: Continued

Rewrite the words in the corpus (data) as morphs and compute probability(# is a word boundary morph):

Data | Lexicon 1 “a p p l e # o r a n g e # l e m o n # j u i c e # a p

p l e j u i c e # o r a n g e j u i c e # a p p l e t r e e # l e m

o n t r e e # #” The sequence consists of 69 morphs and its probability

conditioned on Lexicon 1 is: P (Data | Lexicon 1) = ( 114+1)69 ≈ 7.1 · 10−82.

Data | Lexicon 2 “apple # orange # lemon # juice # apple juice #

orange juice # apple tree # lemon tree # #”. The sequence con-

sist of 21 morphs, and P (Data | Lexicon 2) = ( 15+1)21 ≈ 4.6 · 10−17.

Data | Lexicon 3 “apple # orange # lemon # juice # applejuice #

orangejuice # appletree # lemontree # #”. The sequence consists

of 17 morphs, and P (Data | Lexicon 3) = ( 18+1)17 ≈ 6.0 · 10−17.

33

Statistical Approaches: Maximum A Posteriori (MAP) Optimization

The model selection procedure is based on maximizing the posterior probabi-lity of the model P (Lexicon X |Data). The posterior can be rewritten usingBayes’ rule:

P (Lexicon X |Data) =P (Lexicon X) · P (Data | Lexicon X)

P (Data)

∝ P (Lexicon X) · P (Data | Lexicon X).

P (Lexicon 2) · P (Data | Lexicon 2) ≈ 1.9 · 10−61 > P (Lexicon 3) ·P (Data | Lexicon 3) ≈ 1.0 · 10−115 > P (Lexicon 1) · P (Data | Lexicon 1) ≈2.2 · 10−123.

In this comparison the complexity of the model has been balanced againstthe fit of the training data, which favors a good compromise, that is, a modelthat does not overlearn and that adequately generalizes to unseen data.

34

Statistical Approaches: Demo

Variations of the approach sketched out above have been used for both wordsegmentation (Asian languages, where words are written without explicitboundaries) and morphology modeling, e.g.,

• Carl de Marcken (1995)

• Michael R. Brent (1999)

• John Goldsmith (2001) (Linguistica)

• Mathias Creutz and Krista Lagus (2002–) (Morfessor)

• Sharon Goldwater, Thomas L. Griffiths, and Mark Johnson (2006)

Morfessor demo: http://www.cis.hut.fi/projects/morpho/

35

lecturer: mathias creutz · esimerkki: onko luontevampaa sanoa ’strong tea’ vai ’powerful...

Documents