2. mathematical foundations 2001. 7. 10. 인공지능연구실 성경희 foundations of statistic...

24
2. Mathematical Foundations 2001. 7. 10. 인인인인인인인 인인인 oundations of Statistic Natural Language Processing

Upload: giles-sparks

Post on 25-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 2. Mathematical Foundations 2001. 7. 10. 인공지능연구실 성경희 Foundations of Statistic Natural Language Processing

2. Mathematical Foundations

2001. 7. 10.

인공지능연구실성경희

Foundations of Statistic Natural Language Processing

Page 2: 2. Mathematical Foundations 2001. 7. 10. 인공지능연구실 성경희 Foundations of Statistic Natural Language Processing

2

Contents – Part 1

1. Elementary Probability Theory– Conditional probability

– Bayes’ theorem

– Random variable

– Joint and conditional distributions

– Standard distribution

Page 3: 2. Mathematical Foundations 2001. 7. 10. 인공지능연구실 성경희 Foundations of Statistic Natural Language Processing

3

Conditional probability (1/2)

P(A) : the probability of the event A

Ex1> A coin is tossed 3 times.

= {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}

A = {HHT, HTH, THH} : 2 heads, P(A)=3/8

B = {HHH, HHT, HTH, HTT} : first head, P(B)=1/2

: conditional probability

A B

A B

P(B)

B)P(AB)|P(A

Page 4: 2. Mathematical Foundations 2001. 7. 10. 인공지능연구실 성경희 Foundations of Statistic Natural Language Processing

4

Conditional probability (2/2)

Multiplication rule

Chain rule

Two events A, B are independent

A)|P(A)P(BB)|P(B)P(AB)P(A

)A|P(A)AA|)P(AA|)P(AP(A)AP(A 112131211 i

ninn

P(A)P(B)B)P(A

B)|P(AP(A) 0,P(B) If

Page 5: 2. Mathematical Foundations 2001. 7. 10. 인공지능연구실 성경희 Foundations of Statistic Natural Language Processing

5

Bayes’ theorem (1/2)

Generally, if and the Bi are disjoint

Bayes’ theorem

)B)P(B|P(AB)P(B)|P(A

)BP(AB)P(A P(A)

i

ii ))P(BB|P(A P(A)

ini BA 1

P(A)

B)P(B)|P(A

P(A)

A)P(BA)|P(B

))P(BB|P(A

))P(BB|P(A

P(A)

))P(BB|P(AA)|P(B

1i

n

ii

jjjjj

)B(B ji

Page 6: 2. Mathematical Foundations 2001. 7. 10. 인공지능연구실 성경희 Foundations of Statistic Natural Language Processing

6

Bayes’ theorem (2/2)

Ex2> G : the event of the sentence having a parasitic gap

T : the event of the test being positive

This poor result comes about because the prior probability of a sentence containing a parasitic gap is so low.

0.0020.99999 0.005 0.00001 0.95

0.00001 0.95

)G)P(G|P(T G)P(G)|P(T

G)P(G)|P(TT)|P(G

Page 7: 2. Mathematical Foundations 2001. 7. 10. 인공지능연구실 성경희 Foundations of Statistic Natural Language Processing

7

Random variable

Ex3> Random variable X for the sum of two dice.First

die

Second die1 2 3 4 5 6

6 7 8 9 10 11 12

5 6 7 8 9 10 11

4 5 6 7 8 9 10

3 4 5 6 7 8 9

2 3 4 5 6 7 8

1 2 3 4 5 6 7

x 2 3 4 5 6 7 8 9 10 11 12

p(X=x) 1/36 1/18 1/12 1/9 5/36 1/6 5/36 1/9 1/12 1/18 1/36

S={2,…,12}

probability mass function(pmf) : p(x) = p(X=x), X ~ p(x)

If X: {0,1}, then X is called an indicator RV or a Bernoulli trial

1)P()p(x i

i

x

xp(x) E(X)

(X))(X

Var(X)22 EE

XEXE

)))((( 2

Expectation :

Variance :

Page 8: 2. Mathematical Foundations 2001. 7. 10. 인공지능연구실 성경희 Foundations of Statistic Natural Language Processing

8

Joint and conditional distributions

The joint pmf for two discrete random variables X, Y

Marginal pmfs, which total up the probability mass for the values of each variable separately.

Conditional pmf

y)Yx,P(Xy)p(x,

y

y) p(x,(x)pX x

y) p(x,(y)pY

(y)p

y)p(x,y)|(xp

YY|X for y such that 0(y)pY

Page 9: 2. Mathematical Foundations 2001. 7. 10. 인공지능연구실 성경희 Foundations of Statistic Natural Language Processing

9

Standard distributions (1/3)

Discrete distributions: The binomial distribution

– When one has a series of trials with only two outcomes, each trial

being independent from all the others.

– The number r of successes out of n trials given that the probability

of success in any trial is p. :

– Expectation : np, variance : np(1-p)

rnr ppr

npnrbrRp

)(1) , ;()(

rnCrrn

nr

n

!)!(

!nr 0where

Page 10: 2. Mathematical Foundations 2001. 7. 10. 인공지능연구실 성경희 Foundations of Statistic Natural Language Processing

10

0

0.1

0.2

0.3

0 5 10 15 20 25 30 35 40

count

prob

abili

tyStandard distributions (2/3)

Discrete distributions: The binomial distribution

0.7) , ;b( nr0.5) , ;b( nr

402010 ,,n

Page 11: 2. Mathematical Foundations 2001. 7. 10. 인공지능연구실 성경희 Foundations of Statistic Natural Language Processing

11

0

0.2

0.4

0.6

-5 -4 -3 -2 -1 0 1 2 3 4 5

value

Standard distributions (3/3)

Continuous distributions: The normal distribution

– For the Mean and the standard deviation :

2

2

2

)(

2

1) , ;(

x

exn

Probability density function (pdf)

)1,0(N

(0,0.7)N

(1.5,2)N

Page 12: 2. Mathematical Foundations 2001. 7. 10. 인공지능연구실 성경희 Foundations of Statistic Natural Language Processing

12

Contents – Part 2

2. Essential Information Theory

– Entropy

– Joint entropy and conditional entropy

– Mutual information

– The noisy channel model

– Relative entropy or Kullback-Leibler divergence

Page 13: 2. Mathematical Foundations 2001. 7. 10. 인공지능연구실 성경희 Foundations of Statistic Natural Language Processing

13

Shannon’s Information Theory

Maximizing the amount of information that one can transmit over an imperfect communication channel such as a noisy phone line.

Theoretical maxima for data compression – Entropy H

Theoretical maxima for the transmission rate – Channel Capacity

Page 14: 2. Mathematical Foundations 2001. 7. 10. 인공지능연구실 성경희 Foundations of Statistic Natural Language Processing

14

Entropy (1/4)

The entropy H (or self-information) is the average uncertainty of a single random variable X.

Entropy is a measure of uncertainty. – The more we know about something, the lower the entropy will

be.

– We can use entropy as a measure of the quality of our models.

Entropy measures the amount of information in a random variable (measured in bits).

Xx

2p(x)p(x)log(X) (p) HH where, p(x) is pmf of X

Page 15: 2. Mathematical Foundations 2001. 7. 10. 인공지능연구실 성경희 Foundations of Statistic Natural Language Processing

15

Entropy (2/4)

The entropy of a weighted coin. The horizontal axis shows the probability of a weighted coin to come up heads. The vertical axis shows the entropy of tossing the corresponding coin once.

p

)(1)(1

)(

pppp

pH

loglog

back 23 page

Page 16: 2. Mathematical Foundations 2001. 7. 10. 인공지능연구실 성경희 Foundations of Statistic Natural Language Processing

16

Entropy (3/4)

Ex7> The result of rolling an 8-sided die. (uniform distribution)

– Entropy : The average length of the message needed to transmit an outcome of that variable.

For expectation E

bits3loglog81

log 8

1

8

1)()((X)

8

1

8

1 ii

ipipH

1 2 3 4 5 6 7 8

001 010 011 100 101 110 111 000

)(

1(X)

XpEH log

Page 17: 2. Mathematical Foundations 2001. 7. 10. 인공지능연구실 성경희 Foundations of Statistic Natural Language Processing

17

Entropy (4/4)

Ex8> Simplified Polynesian

– We can design a code that on average takes bits to transmit a letter

– Entropy can be interpreted as a measure of the size of the ‘search space’ consisting of the possible values of a random variable.

bits2

12)()((P)

},,,,,{

uiaktpi

iPiPH log

p t k a i u

1/8 1/4 1/8 1/4 1/8 1/8

21

2

p t k a i u

100 00 101 01 110 1113

22 bits

Page 18: 2. Mathematical Foundations 2001. 7. 10. 인공지능연구실 성경희 Foundations of Statistic Natural Language Processing

18

Joint entropy and conditional entropy (1/3)

The joint entropy of a pair of discrete random variable X,Y~ p(x,y)–

The conditional entropy–

The chain rule for entropy–

Xx Yy

2 y) p(x,y)log p(x,Y) (X,H

x)|p(yy)logp(x,

x)|p(yx)log|p(y p(x)x)X|p(x)H(YX)|(Y

Xx Yy2

Xx Yy2

Xx

H

X)|(Y(X)Y) (X, HHH

)X ,...,X|(X )X|(X )(X)X ,...,(X 1-n1n121n1 HHHH

Page 19: 2. Mathematical Foundations 2001. 7. 10. 인공지능연구실 성경희 Foundations of Statistic Natural Language Processing

19

Joint entropy and conditional entropy (2/3)

Ex9> Simplified Polynesian revisited– All words of consist of sequence of CV(consonant-vowel) syllables

1

0u

0i

a

ktp

161

16

6

16

3

161

16

1

16

3

161

2

1

4

1

4

1

8

1

4

38

1

uiaktp

161

8

3

16

1

4

18

1

8

1

Per-letter basis probabilitiesMarginal probabilities

(per-syllable basis)

double

back 8 page

Page 20: 2. Mathematical Foundations 2001. 7. 10. 인공지능연구실 성경희 Foundations of Statistic Natural Language Processing

20

Joint entropy and conditional entropy (3/3)

1

0u

0i

a

ktp

161

16

6

16

3

161

16

1

16

3

161

2

1

4

1

4

1

8

1

4

3

8

1

1.061bits3bits4

3

4

9

4

3

4

3

8

1

8

12(C) logloglogH

bits 1.375bits8

11

)2

1,0,

2

1(

8

1)

4

1,

4

1,

2

1(

4

3,0)

2

1,

2

1(

8

1

)C|(V)p(CC)|(V,,

HHH

cHcHktpc

bits 2.44bits4

3

8

29V)(C, 3log)C|V()C( HHH

Page 21: 2. Mathematical Foundations 2001. 7. 10. 인공지능연구실 성경희 Foundations of Statistic Natural Language Processing

21

Mutual information (1/2)

By the chain rule for entropy– – : mutual information

Mutual information between X and Y

– The amount of information one random variable contains about

another. (symmetric, non-negative)

– It is 0 only when two variables are independent.

– It grows not only with the degree of dependence, but also

according to the entropy of the variables.

– It is actually better to think of it as a measure of independence.

Y)|(X(Y)X)|(Y(X)Y) (X, HHHHH

X)|(Y(Y)Y)|(X(X) HHHH

Page 22: 2. Mathematical Foundations 2001. 7. 10. 인공지능연구실 성경희 Foundations of Statistic Natural Language Processing

22

Mutual information (2/2)

– Since

(entropy is called self-information)

– Conditional MI and a chain rule

yx,

y yx,x

p(x)p(y)

y)p(x,y)logp(x,

y)y)logp(x,p(x,p(y)

1p(y)log

p(x)

1p(x)log

Y)(X, (Y) (X) Y)|(X(X)Y)(X; H HHHHI

H(X|Y) H(Y|X)

I(X; Y)

H(X) H(Y)

H(X,Y)

X)(X;X)|(X(X)(X) IHHH

0 X)|(X H

Z)Y,|(XZ)|(XZ)|Y)((X;Z)|Y(X; HHII

)X,...,X|Y;(X )X,...,X|Y;(X Y)|(XY);(X 1ii

n

1ii1-n1n11n

IIII

=I(x,y) Pointwise MI

Page 23: 2. Mathematical Foundations 2001. 7. 10. 인공지능연구실 성경희 Foundations of Statistic Natural Language Processing

23

Noisy channel model

Channel capacity : the rate at which one can transmit information through the channel (optimal)–

Binary symmetric channel

– since entropy is non-negative,

Channelp(y|x)

DecoderEncoderW X Y W

Messagefrom a finite

alphabet

Input tochannel

Output fromchannel

Attempt toreconstruct message

based on output

0 0

11

p

1-p

1-p

Y)(X;)(IC

Xpmax

(p)(Y)X)|(Y (Y)Y)(X; HHHHI

(p)1 Y)(X;)(

HIXp

max 1C

go 15 page

Page 24: 2. Mathematical Foundations 2001. 7. 10. 인공지능연구실 성경희 Foundations of Statistic Natural Language Processing

24

Relative entropy or Kullback-Leibler divergence

Relative entropy for two pmfs, p(x), q(x) – A measure of how close two pmfs are.

– Non-negative, and D(p||q)=0 if p=q

– Conditional relative entropy and chain rule

q(X)

p(X)

q(x)

p(x)p(x)q)||D(p p loglog E

Xx

p(x)p(y))||y)D(p(x,Y)(X, I

yx x)|q(y

x)|p(yx)log|p(yp(x)x))|q(y||x)|D(p(y

x))|q(y||x)|D(p(yq(x))||D(p(x)y))q(x,||y)D(p(x,