2. mathematical foundations 2001. 7. 10. 인공지능연구실 성경희 foundations of statistic...
TRANSCRIPT
2. Mathematical Foundations
2001. 7. 10.
인공지능연구실성경희
Foundations of Statistic Natural Language Processing
2
Contents – Part 1
1. Elementary Probability Theory– Conditional probability
– Bayes’ theorem
– Random variable
– Joint and conditional distributions
– Standard distribution
3
Conditional probability (1/2)
P(A) : the probability of the event A
Ex1> A coin is tossed 3 times.
= {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}
A = {HHT, HTH, THH} : 2 heads, P(A)=3/8
B = {HHH, HHT, HTH, HTT} : first head, P(B)=1/2
: conditional probability
A B
A B
P(B)
B)P(AB)|P(A
4
Conditional probability (2/2)
Multiplication rule
Chain rule
Two events A, B are independent
–
–
A)|P(A)P(BB)|P(B)P(AB)P(A
)A|P(A)AA|)P(AA|)P(AP(A)AP(A 112131211 i
ninn
P(A)P(B)B)P(A
B)|P(AP(A) 0,P(B) If
5
Bayes’ theorem (1/2)
Generally, if and the Bi are disjoint
Bayes’ theorem
)B)P(B|P(AB)P(B)|P(A
)BP(AB)P(A P(A)
i
ii ))P(BB|P(A P(A)
ini BA 1
P(A)
B)P(B)|P(A
P(A)
A)P(BA)|P(B
))P(BB|P(A
))P(BB|P(A
P(A)
))P(BB|P(AA)|P(B
1i
n
ii
jjjjj
)B(B ji
6
Bayes’ theorem (2/2)
Ex2> G : the event of the sentence having a parasitic gap
T : the event of the test being positive
This poor result comes about because the prior probability of a sentence containing a parasitic gap is so low.
0.0020.99999 0.005 0.00001 0.95
0.00001 0.95
)G)P(G|P(T G)P(G)|P(T
G)P(G)|P(TT)|P(G
7
Random variable
Ex3> Random variable X for the sum of two dice.First
die
Second die1 2 3 4 5 6
6 7 8 9 10 11 12
5 6 7 8 9 10 11
4 5 6 7 8 9 10
3 4 5 6 7 8 9
2 3 4 5 6 7 8
1 2 3 4 5 6 7
x 2 3 4 5 6 7 8 9 10 11 12
p(X=x) 1/36 1/18 1/12 1/9 5/36 1/6 5/36 1/9 1/12 1/18 1/36
S={2,…,12}
probability mass function(pmf) : p(x) = p(X=x), X ~ p(x)
If X: {0,1}, then X is called an indicator RV or a Bernoulli trial
1)P()p(x i
i
x
xp(x) E(X)
(X))(X
Var(X)22 EE
XEXE
)))((( 2
Expectation :
Variance :
8
Joint and conditional distributions
The joint pmf for two discrete random variables X, Y
–
Marginal pmfs, which total up the probability mass for the values of each variable separately.
–
Conditional pmf
–
y)Yx,P(Xy)p(x,
y
y) p(x,(x)pX x
y) p(x,(y)pY
(y)p
y)p(x,y)|(xp
YY|X for y such that 0(y)pY
9
Standard distributions (1/3)
Discrete distributions: The binomial distribution
– When one has a series of trials with only two outcomes, each trial
being independent from all the others.
– The number r of successes out of n trials given that the probability
of success in any trial is p. :
– Expectation : np, variance : np(1-p)
rnr ppr
npnrbrRp
)(1) , ;()(
rnCrrn
nr
n
!)!(
!nr 0where
10
0
0.1
0.2
0.3
0 5 10 15 20 25 30 35 40
count
prob
abili
tyStandard distributions (2/3)
Discrete distributions: The binomial distribution
0.7) , ;b( nr0.5) , ;b( nr
402010 ,,n
11
0
0.2
0.4
0.6
-5 -4 -3 -2 -1 0 1 2 3 4 5
value
Standard distributions (3/3)
Continuous distributions: The normal distribution
– For the Mean and the standard deviation :
2
2
2
)(
2
1) , ;(
x
exn
Probability density function (pdf)
)1,0(N
(0,0.7)N
(1.5,2)N
12
Contents – Part 2
2. Essential Information Theory
– Entropy
– Joint entropy and conditional entropy
– Mutual information
– The noisy channel model
– Relative entropy or Kullback-Leibler divergence
13
Shannon’s Information Theory
Maximizing the amount of information that one can transmit over an imperfect communication channel such as a noisy phone line.
Theoretical maxima for data compression – Entropy H
Theoretical maxima for the transmission rate – Channel Capacity
14
Entropy (1/4)
The entropy H (or self-information) is the average uncertainty of a single random variable X.
Entropy is a measure of uncertainty. – The more we know about something, the lower the entropy will
be.
– We can use entropy as a measure of the quality of our models.
Entropy measures the amount of information in a random variable (measured in bits).
Xx
2p(x)p(x)log(X) (p) HH where, p(x) is pmf of X
15
Entropy (2/4)
The entropy of a weighted coin. The horizontal axis shows the probability of a weighted coin to come up heads. The vertical axis shows the entropy of tossing the corresponding coin once.
p
)(1)(1
)(
pppp
pH
loglog
back 23 page
16
Entropy (3/4)
Ex7> The result of rolling an 8-sided die. (uniform distribution)
–
– Entropy : The average length of the message needed to transmit an outcome of that variable.
For expectation E
bits3loglog81
log 8
1
8
1)()((X)
8
1
8
1 ii
ipipH
1 2 3 4 5 6 7 8
001 010 011 100 101 110 111 000
)(
1(X)
XpEH log
17
Entropy (4/4)
Ex8> Simplified Polynesian
–
– We can design a code that on average takes bits to transmit a letter
– Entropy can be interpreted as a measure of the size of the ‘search space’ consisting of the possible values of a random variable.
bits2
12)()((P)
},,,,,{
uiaktpi
iPiPH log
p t k a i u
1/8 1/4 1/8 1/4 1/8 1/8
21
2
p t k a i u
100 00 101 01 110 1113
22 bits
18
Joint entropy and conditional entropy (1/3)
The joint entropy of a pair of discrete random variable X,Y~ p(x,y)–
The conditional entropy–
The chain rule for entropy–
–
Xx Yy
2 y) p(x,y)log p(x,Y) (X,H
x)|p(yy)logp(x,
x)|p(yx)log|p(y p(x)x)X|p(x)H(YX)|(Y
Xx Yy2
Xx Yy2
Xx
H
X)|(Y(X)Y) (X, HHH
)X ,...,X|(X )X|(X )(X)X ,...,(X 1-n1n121n1 HHHH
19
Joint entropy and conditional entropy (2/3)
Ex9> Simplified Polynesian revisited– All words of consist of sequence of CV(consonant-vowel) syllables
1
0u
0i
a
ktp
161
16
6
16
3
161
16
1
16
3
161
2
1
4
1
4
1
8
1
4
38
1
uiaktp
161
8
3
16
1
4
18
1
8
1
Per-letter basis probabilitiesMarginal probabilities
(per-syllable basis)
double
back 8 page
20
Joint entropy and conditional entropy (3/3)
–
–
–
1
0u
0i
a
ktp
161
16
6
16
3
161
16
1
16
3
161
2
1
4
1
4
1
8
1
4
3
8
1
1.061bits3bits4
3
4
9
4
3
4
3
8
1
8
12(C) logloglogH
bits 1.375bits8
11
)2
1,0,
2
1(
8
1)
4
1,
4
1,
2
1(
4
3,0)
2
1,
2
1(
8
1
)C|(V)p(CC)|(V,,
HHH
cHcHktpc
bits 2.44bits4
3
8
29V)(C, 3log)C|V()C( HHH
21
Mutual information (1/2)
By the chain rule for entropy– – : mutual information
Mutual information between X and Y
– The amount of information one random variable contains about
another. (symmetric, non-negative)
– It is 0 only when two variables are independent.
– It grows not only with the degree of dependence, but also
according to the entropy of the variables.
– It is actually better to think of it as a measure of independence.
Y)|(X(Y)X)|(Y(X)Y) (X, HHHHH
X)|(Y(Y)Y)|(X(X) HHHH
22
Mutual information (2/2)
–
– Since
(entropy is called self-information)
– Conditional MI and a chain rule
yx,
y yx,x
p(x)p(y)
y)p(x,y)logp(x,
y)y)logp(x,p(x,p(y)
1p(y)log
p(x)
1p(x)log
Y)(X, (Y) (X) Y)|(X(X)Y)(X; H HHHHI
H(X|Y) H(Y|X)
I(X; Y)
H(X) H(Y)
H(X,Y)
X)(X;X)|(X(X)(X) IHHH
0 X)|(X H
Z)Y,|(XZ)|(XZ)|Y)((X;Z)|Y(X; HHII
)X,...,X|Y;(X )X,...,X|Y;(X Y)|(XY);(X 1ii
n
1ii1-n1n11n
IIII
=I(x,y) Pointwise MI
23
Noisy channel model
Channel capacity : the rate at which one can transmit information through the channel (optimal)–
Binary symmetric channel
–
– since entropy is non-negative,
Channelp(y|x)
DecoderEncoderW X Y W
Messagefrom a finite
alphabet
Input tochannel
Output fromchannel
Attempt toreconstruct message
based on output
0 0
11
p
1-p
1-p
Y)(X;)(IC
Xpmax
(p)(Y)X)|(Y (Y)Y)(X; HHHHI
(p)1 Y)(X;)(
HIXp
max 1C
go 15 page
24
Relative entropy or Kullback-Leibler divergence
Relative entropy for two pmfs, p(x), q(x) – A measure of how close two pmfs are.
– Non-negative, and D(p||q)=0 if p=q
– Conditional relative entropy and chain rule
q(X)
p(X)
q(x)
p(x)p(x)q)||D(p p loglog E
Xx
p(x)p(y))||y)D(p(x,Y)(X, I
yx x)|q(y
x)|p(yx)log|p(yp(x)x))|q(y||x)|D(p(y
x))|q(y||x)|D(p(yq(x))||D(p(x)y))q(x,||y)D(p(x,