collocation
DESCRIPTION
Collocation. 발표자 : 이도관. Contents. Introduction Frequency Mean & Variance Hypothesis Testing Mutual Information. Collocation. 1. Introduction. Definition - PowerPoint PPT PresentationTRANSCRIPT
Collocation
DefinitionA sequence of two or more consecutive words, that has characteristics of a
syntactic and semantic unit, and whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components.특징
- Non-compositionalityEx) white wine, white hair, white woman- Non-substitutabilityEx) white wine vs. yellow wine- Non-modifiabilityEx) as poor as church mice vs. as poor as a church mouse
1. Introduction
Frequency(1)
simplest method for finding collocationscounting word frequency
단순히 frequency 에 의존할 경우
2. Frequency
…….….
theto26430
thein58841
theof80871
W2W1C(W1,W2)
Frequency(2)
frequency 와 패턴을 같이 사용하는 경우
2. Frequency
C(w1w2)
w1 w2 Tag pattern
11487 New York A N
7261 United States A N
3301 Last Year A N
… … … …
patterns 2. Frequency
degrees of freedomN P N
class probability functionN N N
mean squared errorN A N
cumulative distribution functionA N N
Gaussian random variableA A N
regression coefficientN N
linear functionA N
ExampleTag Pattern
property
장점- 간단하면서 비교적 좋은 결과를 얻는다 .- 특히 fixed phrase 에 좋다 .
단점- 정확한 결과를 얻을 수 없다 .Ex) 웹 페이지에서 powerful tea 가 17 번 검색됨 .
- Fixed phrase 가 아니면 적용하기 어렵다 .Ex) knock 과 door
2. Frequency
Mean & Variance
finding collocations consist of two words that stand more flexible relationship to another They knocked at the door A man knocked on the metal front door
mean distance & variance between two words
low deviation : good candidate for collocation
3. Mean & Variance
Tools
relative positionmean : average offsetvariance
collocation window : local phenomenon
3. Mean & Variance
1
)(1
22
n
dds
n
i i
knock door
knock door
example 3. Mean & Variance
• position of strong with respect to for
0
10
20
30
40
50
60
70
80
90
frequ
ency
of
stro
ng
-4 -3 -2 -1 0 1 2 3 4
• d = -1.12 s = 2.15
property
장점good for finding collocation which has
- looser relationship between words- intervening material and relative position
단점compositions like ‘new company’ could be selected for the
candidate of collocation
3. Mean & Variance
Hypothesis Testing
to avoid selecting a lot of words co-occurring just by chance ‘new company’ : just composition
H0(null hypothesis) : no association between the words p(w1 w2) = p(w1)p(w2)
t test, test of difference, chi-square test, likelihood ratios,
4. Hypothesis Test
t test
t statistic
tell us how likely one is to get a sample of that mean and variance
probabilistic parsing, word sense disambiguation
4. Hypothesis Test
Ns
xt
/2
t test example
t test applied to 10 bigrams (freq. 20)
significant level : 0.005 2.576 can reject above 2 candidates’s H0
4. Hypothesis Test
t C(w1) C(w2) C(w1 w2) W1 W24.4721 42 20 20 Ayatollah Ruhollah
4.4721 41 27 20 Bette Midler
1.2176 14093 14776 20 like people
0.8036 15019 15629 20 time last
Hypo. test of diff.
to find words whose co-occurrence patterns best distinguish between two words. ‘strong’ & ‘powerful’
t score
H0 : average difference is 0( )
4. Hypothesis Test
2
22
1
21
21
ns
ns
xxt
0
difference test example
powerful & strong
strong : intrinsic quality powerful : power to move things
4. Hypothesis Test
t C(w) C(strong w) C(powerful w) Word
3.1622 933 0 10 computers
2.8284 2377 0 8 Computer
3.6055 851 13 0 gains
3.6055 832 13 0 criticism
chi-square test
do not assume normal distribution t test : assumes normal distribution
compare expected & observed frequencies if diff. Is large : can reject H0(independence)
to identify translation pairs in aligned corporachi-square statistic
4. Hypothesis Test
ji ij
ijij
E
EOX
,
22 )(
chi-square example
‘new companies’
significant level : 0.005 3.841 t = 1.55 : cannot reject H0
4. Hypothesis Test
w1 = new w1 != new
w2 = companies 8(new companies)
4667(e.g., old companies)
w2 != companies 15820(e.g., new machines)
14287181(e.g., old machines)
Likelihood ratios
sparse data than chi-square testmore interpretable than chi-square testHypothesis H1 : p(w2|w1)=p=p(w2|~w1) H2 : p(w2|w1)=p1 != p2=p(w2|w1) p = c2/N, p1 = c12/c1 , p2 = (c2-c12)/(N-c1)
likelihood ratio(pp. 173)
4. Hypothesis Test
)2(
)1(loglog
HL
HL
Likelihood ratios (2)
table 5.12(pp. 174) ‘powerful computers’ is 1.3E18 times more likely
than its base rate of occurrence would suggestrelative frequency ratio. Relative frequencies between two or more diff.
Corpora. useful for subject-specific collocation Table 5.13(pp. 176)
Karim Obeid (1990 vs. 1989) : 0.0241
4. Hypothesis Test
Mutual Information
tells us how much one word about the other
ex) table 5.14(pp. 178)
I(Ayatollah,Ruhollah) = 18.38
Ayatollah at pos. i increase by 18.38 if Ruhollah occurs
at pos. i+1
5. Mutual Info.
)'(
)'|'(log
)'(
)'|'(log
)'()'(
)''(log)','( 222 yp
xyp
xp
yxp
ypxp
yxpyxI
good measure of independencebad measure of dependence perfect dependence
perfect independence
Mutual Information(2) 5. Mutual Info.
)(
1log
)()(
)(log
)()(
)(log),(
ypypxp
xp
ypxp
xypyxI
01log)()(
)()(log
)()(
)(log),(
ypxp
ypxp
ypxp
xypyxI