collocation

Collocation

발표자 : 이도관

Contents

1. Introduction2. Frequency 3. Mean & Variance4. Hypothesis Testing5. Mutual Information

Collocation

DefinitionA sequence of two or more consecutive words, that has characteristics of a

syntactic and semantic unit, and whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components.특징

- Non-compositionalityEx) white wine, white hair, white woman- Non-substitutabilityEx) white wine vs. yellow wine- Non-modifiabilityEx) as poor as church mice vs. as poor as a church mouse

1. Introduction

Frequency(1)

simplest method for finding collocationscounting word frequency

단순히 frequency 에 의존할 경우

2. Frequency

…….….

theto26430

thein58841

theof80871

W2W1C(W1,W2)

Frequency(2)

frequency 와 패턴을 같이 사용하는 경우

2. Frequency

C(w1w2)

w1 w2 Tag pattern

11487 New York A N

7261 United States A N

3301 Last Year A N

… … … …

patterns 2. Frequency

degrees of freedomN P N

class probability functionN N N

mean squared errorN A N

cumulative distribution functionA N N

Gaussian random variableA A N

regression coefficientN N

linear functionA N

ExampleTag Pattern

property

장점- 간단하면서 비교적 좋은 결과를 얻는다 .- 특히 fixed phrase 에 좋다 .

단점- 정확한 결과를 얻을 수 없다 .Ex) 웹 페이지에서 powerful tea 가 17 번 검색됨 .

- Fixed phrase 가 아니면 적용하기 어렵다 .Ex) knock 과 door

2. Frequency

Mean & Variance

finding collocations consist of two words that stand more flexible relationship to another They knocked at the door A man knocked on the metal front door

mean distance & variance between two words

low deviation : good candidate for collocation

3. Mean & Variance

Tools

relative positionmean : average offsetvariance

collocation window : local phenomenon

3. Mean & Variance

1

)(1

22

n

dds

n

i i

knock door

knock door

example 3. Mean & Variance

• position of strong with respect to for

0

10

20

30

40

50

60

70

80

90

frequ

ency

of

stro

ng

-4 -3 -2 -1 0 1 2 3 4

• d = -1.12 s = 2.15

property

장점good for finding collocation which has

- looser relationship between words- intervening material and relative position

단점compositions like ‘new company’ could be selected for the

candidate of collocation

3. Mean & Variance

Hypothesis Testing

to avoid selecting a lot of words co-occurring just by chance ‘new company’ : just composition

H0(null hypothesis) : no association between the words p(w1 w2) = p(w1)p(w2)

t test, test of difference, chi-square test, likelihood ratios,

4. Hypothesis Test

t test

t statistic

tell us how likely one is to get a sample of that mean and variance

probabilistic parsing, word sense disambiguation

4. Hypothesis Test

Ns

xt

/2

t test example

t test applied to 10 bigrams (freq. 20)

significant level : 0.005 2.576 can reject above 2 candidates’s H0

4. Hypothesis Test

t C(w1) C(w2) C(w1 w2) W1 W24.4721 42 20 20 Ayatollah Ruhollah

4.4721 41 27 20 Bette Midler

1.2176 14093 14776 20 like people

0.8036 15019 15629 20 time last

Hypo. test of diff.

to find words whose co-occurrence patterns best distinguish between two words. ‘strong’ & ‘powerful’

t score

H0 : average difference is 0( )

4. Hypothesis Test

2

22

1

21

21

ns

ns

xxt

0

difference test example

powerful & strong

strong : intrinsic quality powerful : power to move things

4. Hypothesis Test

t C(w) C(strong w) C(powerful w) Word

3.1622 933 0 10 computers

2.8284 2377 0 8 Computer

3.6055 851 13 0 gains

3.6055 832 13 0 criticism

chi-square test

do not assume normal distribution t test : assumes normal distribution

compare expected & observed frequencies if diff. Is large : can reject H0(independence)

to identify translation pairs in aligned corporachi-square statistic

4. Hypothesis Test

ji ij

ijij

E

EOX

,

22 )(

chi-square example

‘new companies’

significant level : 0.005 3.841 t = 1.55 : cannot reject H0

4. Hypothesis Test

w1 = new w1 != new

w2 = companies 8(new companies)

4667(e.g., old companies)

w2 != companies 15820(e.g., new machines)

14287181(e.g., old machines)

Likelihood ratios

sparse data than chi-square testmore interpretable than chi-square testHypothesis H1 : p(w2|w1)=p=p(w2|~w1) H2 : p(w2|w1)=p1 != p2=p(w2|w1) p = c2/N, p1 = c12/c1 , p2 = (c2-c12)/(N-c1)

likelihood ratio(pp. 173)

4. Hypothesis Test

)2(

)1(loglog

HL

HL

Likelihood ratios (2)

table 5.12(pp. 174) ‘powerful computers’ is 1.3E18 times more likely

than its base rate of occurrence would suggestrelative frequency ratio. Relative frequencies between two or more diff.

Corpora. useful for subject-specific collocation Table 5.13(pp. 176)

Karim Obeid (1990 vs. 1989) : 0.0241

4. Hypothesis Test

Mutual Information

tells us how much one word about the other

ex) table 5.14(pp. 178)

I(Ayatollah,Ruhollah) = 18.38

Ayatollah at pos. i increase by 18.38 if Ruhollah occurs

at pos. i+1

5. Mutual Info.

)'(

)'|'(log

)'(

)'|'(log

)'()'(

)''(log)','( 222 yp

xyp

xp

yxp

ypxp

yxpyxI

good measure of independencebad measure of dependence perfect dependence

perfect independence

Mutual Information(2) 5. Mutual Info.

)(

1log

)()(

)(log

)()(

)(log),(

ypypxp

xp

ypxp

xypyxI

01log)()(

)()(log

)()(

)(log),(

ypxp

ypxp

ypxp

xypyxI

장점- 한 단어에 대해 다른 단어가 전달하는 정보를 개략적으로 측정할 수

있다 .- 간단하면서 더 정확한 개념을 전달한다 .

장점- frequency 가 작은 sparse data 의 경우 결과가 잘못 나올 수

있다 .

Mutual Information(3) 5. Mutual Info.

collocation

Documents