statistical pattern recognition - texas a&m...
TRANSCRIPT
Statisticalpatternrecognition
Bayes theorem• Problem: deciding if a patient has a particular condition based on a particular test
– However, the test is imperfect• Someone with the condition may go undetected (false negative)
S ith t diti t iti (f l iti )• Someone without condition may come out positive (false positive)
– Test properties• SPECIFICITY or true‐negative rate P(NEG|¬COND)g ( | )
• SENSITIVITY or true‐positive rate P(POS|COND)
Ricardo Gutierrez‐Osuna TAMU‐CSE
• Problem definition– Assume a population of 10,000 where 1 out of every 100
l h h di l di ipeople has the medical condition– Assume that we design a test with 98% specificity P(NEG|¬COND) and 90% sensitivity P(POS|COND)
– You take the test, and it comes POSITIVE
–What conditional probability are we after?–How likely is it that you have the condition?y y
Ricardo Gutierrez‐Osuna TAMU‐CSE
• Solution: Joint frequency table–The answer is the ratio of individuals with the condition to total individuals (considering only individuals that tested positive) or 90/288 0 3125
TEST IS POSITIVE TEST IS NEGATIVE ROW TOTAL
individuals that tested positive) or 90/288=0.3125
TEST IS POSITIVE TEST IS NEGATIVE ROW TOTAL
HAS CONDITION Truepositive P(POS|COND) 100×0.90=90
Falsenegative P(NEG|COND)
100×(10.90)=10
100
Falsepositive Truenegative
FREE OF CONDITION
pP(POS|¬COND) 9,900×(10.98)=198
gP(NEG|¬COND)
9,900×0.98=9,072
9,900
COLUMN TOTAL 288 9,712 10,000
Ricardo Gutierrez‐Osuna TAMU‐CSE
• Conditional probabilityS S
0)B(Pfor)B(P
)BA(P)A|B(P >=I “B has
occurred”A A∩B B A A∩B B
Total probability• Total probability
=++===
N1 )BA(P...)BA(P)SA(P)A(P
II
I
B1B3 BN-1
∑==++=
N
kk
NN11
)B(P)A|B(P
)B(P)A|B(P...)B(P)A|B(P 1
B2
N 1
BN
A
=1k
Ricardo Gutierrez‐Osuna TAMU‐CSE
• Alternative solution: Bayes theorem)A(P)A|B(P
)B(P)A(P)A|B(P)A|B(P =
)(P)cond(P)|cond(P)cond|(P =
+⋅+
=+
)cond(P)cond|(P)cond(P)|cond(P)cond(P)|cond(P
=¬⋅¬++⋅+
⋅+=
3125.0990)9801(010900
01.090.0=
×+××
=99.0)98.01(01.090.0 ×−+×
Ricardo Gutierrez‐Osuna TAMU‐CSE
• In SPR, Bayes theorem is expressed as
)ω(P)x|ω(P)ω(P)x|ω(P)|( jjjj ⋅⋅
PriorLikelihoodPosterior
)x(P)()|(
)ω(P)x|ω(P
)()|()|xω(P jj
N
1kkk
jjj =
⋅=
∑= Norm
constant
–And we assign sample x to the class ωk with the highest posteriorIt can be shown this rule minimizes the prob of error• It can be shown this rule minimizes the prob. of error
Ricardo Gutierrez‐Osuna TAMU‐CSE
Discriminant functions
Class assignment
Select maxCosts
g1(x) g2(x) gC(x) Discriminant functions
x2 x3 xdx1 Features
)x|(p(x)gwhereij(x)g(x)gωx
ii
jii
ω=
≠∀>⇔∈
Ricardo Gutierrez‐Osuna TAMU‐CSE
Quadratic classifiers• For normally distributed classes, the posterior can be reduced to a very simple expressioncan be reduced to a very simple expression
– Recall an n‐dimensional Gaussian density is
⎤⎡ 11
U i B l th DF b itt
⎥⎦⎤
⎢⎣⎡ −∑−−
∑= − μ)(xμ)(x
21exp
π)2(1p(x) 1T
2/12n/
– Using Bayes rule, the DF can be written as
P(x)))P(ωP(x|ωx)|P(ωx)(g ii
ii ===
P(x)1)P(ω)μ(x)μ(x
21exp
π)2(1
P(x)
ii1
iT
i2/1i
2n/ ⎥⎦⎤
⎢⎣⎡ −∑−−
∑= −
Ricardo Gutierrez‐Osuna TAMU‐CSE
– Eliminating constant terms
1 T1/2 ⎤⎡ 1
And taking logs
)P()μ(x)μ(x21exp ii
Ti
1/2- ω⎥⎦⎤
⎢⎣⎡ −∑−−∑= −1
iii x)(g
– And taking logs
)P(ωloglog21)μ(x)μ(x
21x)(g iii
1i
Tii +∑−−∑−−= −
– This is known as a quadratic discriminant function (b it i f ti f 2)
22
(because it is a function of x2)
Ricardo Gutierrez‐Osuna TAMU‐CSE
Case 1: Σi=σ2I
• Features are statistically independent, and h h i f ll lhave the same variance for all classes– In this case, the quadratic discriminant function becomes
( )
)P(ωlog)μ(x)μ(x1
)P(ωlogIσlog21)-μ(xIσ)μ(x
21x)(g
T
i2
i1-2T
ii
+−−−=
+−−−=
– Assuming equal priors and dropping constant terms
)P(ωlog)μ(x)μ(xσ2 iii2 +=
( )∑=
−=−−−=DIM
1i
2iii
Tii μx-)μ(x)μ(xx)(g
– This is called an Euclidean‐distance or nearest mean classifier
Ricardo Gutierrez‐Osuna TAMU‐CSE
[ ] [ ] [ ]⎤⎡⎤⎡⎤⎡
===02020252μ47μ23μ T
3T
2T
1
⎥⎦
⎤⎢⎣
⎡=⎥
⎦
⎤⎢⎣
⎡=⎥
⎦
⎤⎢⎣
⎡=
2002
Σ2002
Σ2002
Σ 321
Ricardo Gutierrez‐Osuna TAMU‐CSE
Case 2: Σi=Σ• All classes have the same covariance matrix, but the matrix is not diagonalbut the matrix is not diagonal– In this case, the quadratic discriminant becomes
( ) ( ))P(ll1)()(1( ) 1T ∑∑−
–assuming equal priors and eliminating constants
( ) ( ))P(ωloglog2
)-μ(x)μ(x2
(x)g ii1T
ii +∑−∑−−=
g q p g
)μ(xΣ)μ(x(x)g i1-T
ii −−−=x1
–This is known as a Mahalanobis‐distance classifier
μx2
Κμx 2
K μ-x 2i 1 =−∑
Κμ-xi =
Ricardo Gutierrez‐Osuna TAMU‐CSE
[ ] [ ] [ ]⎤⎡⎤⎡⎤⎡
===7.017.017.01
52μ45μ23μ T3
T2
T1
⎥⎦
⎤⎢⎣
⎡=⎥
⎦
⎤⎢⎣
⎡=⎥
⎦
⎤⎢⎣
⎡=
27.07.01
Σ27.07.01
Σ27.07.01
Σ 321
Ricardo Gutierrez‐Osuna TAMU‐CSE
General case
[ ] [ ] [ ]=== 52μ45μ23μ T3
T2
T1 [ ] [ ] [ ]
⎥⎦
⎤⎢⎣
⎡=⎥
⎦
⎤⎢⎣
⎡−
−=⎥
⎦
⎤⎢⎣
⎡−
−=
35.05.05.0
Σ7111
Σ2111
Σ
52μ45μ23μ
321
321
Zoomout
Ricardo Gutierrez‐Osuna TAMU‐CSE
k nearest neighbors• Non‐parametric approximation
– Likelihood of each class
VNk)P(x|ω
i
ii =
Vx
– And priorsNNN)P(ω i
i =
– Then, the posterior becomes
kNN
VNk
))P(ωP(x|ω|x)P(ω i
i
i
i
ii
⋅
kNVkP(x)
|x)P(ω iiiii ===
Ricardo Gutierrez‐Osuna TAMU‐CSE
• Example–Given the three classes, assign a class label for the unknown example xu
– Assume the Euclidean distance and k=5 neighbors
f h l hb–Of the 5 closest neighbors, 4 belong to ω1 and 1 belongs to ω3, so xu is assigned to ω1,
ω13, u g 1,
the predominant classxu
ω2
ω3
Ricardo Gutierrez‐Osuna TAMU‐CSE
Ricardo Gutierrez‐Osuna TAMU‐CSE
1-NN 5-NN 20-NN
Ricardo Gutierrez‐Osuna TAMU‐CSE
• Advantages– Simple implementation–Nearly optimal in the large sample limit (N→∞)
• P[error]Bayes <P[error]1NN<2P[error]Bayes–Uses local information, which can yield highly adaptiveUses local information, which can yield highly adaptive behavior
– Lends itself very easily to parallel implementations
i d• Disadvantages– Large storage requirements– Computationally intensive recallComputationally intensive recall–Highly susceptible to the curse of dimensionality
Ricardo Gutierrez‐Osuna TAMU‐CSE
Dimensionality yreduction
Why do dimensionality reduction?
The so‐called “curse of dimensionality”• The so‐called curse of dimensionality– Exponential growth in the number of examples required to accurately estimate a functionrequired to accurately estimate a function
l d l• Exploratory data analysis– Visualizing the structure of the data in a low‐dimensional subspace
Ricardo Gutierrez‐Osuna TAMU‐CSE
• Two approaches to perform dimensionality reduction– Feature selection: choose a subset of all the features
– Feature extraction: create new features by combining the
[ ] [ ]M21 iiiN21 ...xxx...xxx ⎯→⎯
Feature extraction: create new features by combining the existing ones
[ ] [ ] [ ]( )M21 iiiM21N21 ...xxxf...yyy...xxx =⎯→⎯
– Feature extraction is typically a linear transform
[ ] [ ] [ ]( )M21 iiiM21N21 yyy
⎥⎥⎥⎥⎤
⎢⎢⎢⎢⎡
⎥⎥⎥⎤
⎢⎢⎢⎡
=⎥⎥⎥⎤
⎢⎢⎢⎡
⎯⎯⎯⎯⎯⎯ →⎯⎥⎥⎥⎥⎤
⎢⎢⎢⎢⎡
2
1
N22221
N11211
2
1
extractionfeaturelinear2
1
xx
wwwwww
yy
xx
MMOMM
L
L
M
⎥⎥⎥
⎦⎢⎢⎢
⎣
⎥⎥
⎦⎢⎢
⎣⎥⎥
⎦⎢⎢
⎣⎥⎥⎥
⎦⎢⎢⎢
⎣ NMN2M1MM
N xwwwy
x
MOMM
Ricardo Gutierrez‐Osuna TAMU‐CSE
Representation vs. classification
eature 2
1 122
22 2 2Fe 1 11
1 1 11111 11
112 2
2
2 22
222 2
2 22 2
2
1111 11 1
2 2 22
Feature 1
Ricardo Gutierrez‐Osuna TAMU‐CSE
PCA• Solution
– Project the data onto the eigenvectors of the largest eigenvalues of the covariance matrix
– PCA finds orthogonal directions of largest variance
• Properties– If data is Gaussian, PCA finds independent axes
Oth i it i l d l t th– Otherwise, it simply de‐correlates the axes
• Limitation Di ti f hi h i d t il t i– Directions of high variance do not necessarily contain discriminatory information
Ricardo Gutierrez‐Osuna TAMU‐CSE
LDA• Define scatter matrices
– Within class μ1
SW1
x2
μ1SW1
x2
( )( )∑∑∑= ∈=
−−==C
1i ωx
Tii
C
1iiW μxμxSS
μ3
μ
SB1
SB3
S SW3
μ3
μ
SB1
SB3
S SW3
– Between class
= ∈= 1i ωx1i i
μ2
SB2
SW2
μ2
SB2
SW2
( )( )∑=
−−=C
1i
TiiiB μμμμNS
x1x1
– Then maximize ratio
WSWT
WSW
WSWJ(W)
WT
BT
=
Ricardo Gutierrez‐Osuna TAMU‐CSE
• Solution– Optimal projections are the eigenvectors of the largest eigenvalues of
th li d i l blthe generalized eigenvalue problem
• NOTE( ) 0wSλS iWiB =−
NOTE– SB is the sum of C matrices of rank one or less and the mean vectors
are constrained by Σμi=μTherefore S will be at most of rank (C 1) and LDA produces at most– Therefore, SB will be at most of rank (C‐1), and LDA produces at most C‐1 feature projections
• Limitations– Overfitting
– Information not in the mean of the data
– Classes significantly non Gaussian
Ricardo Gutierrez‐Osuna TAMU‐CSE
14
0
2
4
6
1
1
11
1
11 11
122
2222
2
22
222
22
2
3
333
33444
4
4
4
4
44
4
4
444
4
4
55
55
5
5
5
555
5
556
6
66666 6
6
6666
677 7
777
7777
7
77
7
77
888
8 88
88
88
88
8999
99 99
9 9
999
9
99
101010101010101010
10
10
1010 10 10
105
10
7 710
10107 107
710 10 410 447
477 7 4
747
6
10
2922
86
-8
-6
-4
-2
111
1
111
1
12
22 22
2
2
33
3 3
3333
3
3
33
34
4
4 44
5
55
555
55
5
566
66
6666
7
7
78
8
8 88 8
8
8
9
999
9
10
1010ax
is 2
-10-5 0
5
-5
0
5
7
3
10
53
4101010
33
10
1
104
10
35
4
33
11
7
2
10
6 6663
4
5383
4106
81
47
81
46
233
2
7
51 816
7
2 82
74
166
1
677
62
3 858
9
10
9
4
96
5
44
51
7
1688
2
10
52619
106
107 4
856 1
9 8
10
91
41
35 331 2
93
7295
298
4
93528
62
2 9298
6
925 9 3
18
522 129
56
59
28885
51 599
5592 8
axis
3
PCA
x 10-3
7
-10 -5 0 5 10
-10 26
8
axis 1
05
10 -10
-5
axis 1axis 2
5
10
15
5
7
7
77
77
777
7
777777
77
9s 2
LDA0
5
x 10-3
333 7777 77
8 887
811 771 71 71
881
8 5
111
5
15
11161
88
6
51
8888 589 9
1
9
1
8 585155
18 5895 58 95
252
922
552
59
295
229
2 22 22599
295
2299599
22
4
922292
9
s 3
-5
0
5
1 1111
1111
11111
111
11
222
2
22222
2222
222
22 22223
3
333333 33 33
33
3333
3
44444 44444
44
4
4
44
4 44
555
55555 55
5
5
55 555
5555
66 6
66666 666666
666
666
888
8
888 888 888888
888
9 99999999999
999 9999
1010101010101010101010
101010
1010
axisLDA
-0.02
0
0 025
1015
x 10-3
-15
-10
-5
33 33 33333333 333 3333
6
77777776
7 776
76
6
10
77
10
6
10
66
1010
7
10
61
410
67
10
76
7
10
6
1010
61
6666
1010104
66
41010
6
10104 4444 4
10
444444 44 44ax
is
-0.02 -0.01 0 0.01 0.02 0.03 0.04 0.05
4 101010101010
1010
axis 1
0.02
0.04 -50
axis 1axis 2
Ricardo Gutierrez‐Osuna TAMU‐CSE
• LDA and overfitting– Generate an artificial dataset
h l l l h h l k l h d• Three classes, 50 examples per class, with the exact same likelihood: a multivariate Gaussian with zero mean and identity covariance
10 dimensions 50 dimensions
111
1
1
1
1
111
1
11
11
11
1
1 1
1
1
11
1
11
1 111 11
1
11
11
11
11 12
2
22
2
2
222
222
2
2
22
2
2
2
2
2
22
22
2
2
2
2 222
22
22
2
2
2
2 2
22
2 2223
3
33
3
33
3
3
33
3
3 3
3
3
33
33
33
33
3
33
3
3
3
33333 3
33
33
1
1
11 1
1
1
1
1
11
1
1
11
11 1
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
11
1
1
11
11
1
1
11
1 222
2
2
222
22
222
2
2
2
22
2
22 22
2
2
2222
2
2
2
2
2 22
2
33
33
33
3
33
3
3 33
3
33
3
333
3333 3
3
33
31
1
1
1
1
1111
1
222
2 33
3
3 33
33
33
33 3 1
11
1 11
2 222 22
22
2
222
2
2
22
23
3 33 33
3333 333
33333
3333
3
11 11 113
100 dimensions
22 2222
150 dimensions
11
11
1
1
1
1
1
11 11
111
1
1
1
111 1
1
11
1
11
11 11
111
1
11 1
11
111 1 11
1
2
22
22222 2
2
2 222
222 22
2 22
22
22
2
22
22
22
22
22
2 222
33
333
3 333 3 3
33
333
3
33
33
3
3
3
333 33 3333
3
333
3
3
3
3
3
33
3
3 33
3
3
111
2222222
222 222 22
222222
2222 22222222 2
22222 22 2 222222
333
33 3333
333333333
333333 3
33
3333
33
333
33 33
333 333
3333
222
222
222 22 2
211 11 11 111
11
11
11111
111 11 111
11
11
11
1111
1111 111
1 11
111
1
Ricardo Gutierrez‐Osuna TAMU‐CSE