independent component analysis
DESCRIPTION
Independent Component Analysis. 主講人:虞台文. Content. What is ICA? Nongaussianity Measurement — Kurtosis ICA By Maximization of Nongaussianity Gradient and FastICA Algorithms Using Kurtosis Measuring Nongaussianity by Negentropy FastICA Using Negentrophy. Independent Component Analysis. - PowerPoint PPT PresentationTRANSCRIPT
Independent Component Analysis
主講人:虞台文
Content
What is ICA? Nongaussianity Measurement — Kurtosis ICA By Maximization of Nongaussianity Gradient and FastICA Algorithms Using Kurt
osis Measuring Nongaussianity by Negentropy FastICA Using Negentrophy
Independent Component Analysis
What is ICA?
Motivation
Example: three people are speaking simultaneously in a room that has three microphones.
Denote the microphone signals by x1(t), x2(t), and x3(t).
They are mixtures of sources s1(t), s2(t), and s3(t).
The goal is to estimate the original speech signals using only the recorded signals.
This is called the cocktail-party problem.
11 12 13
21 22 23
31
1 2 3
1 2 3
1 2 3323 3
1
2
3
( ) ( ) ( )
( ) (
( )
( )
( )
) ( )
( ) ( ) ( )
x t
x t
x t
a a a
a a
s t s t s t
s t s t s t
s t s
a
ta sa t a
The Cocktail-Party Problem
The original speech signals The mixed speech signals
The Cocktail-Party Problem
The original speech signals The estimated sources
The Problem
)()()()(
)()()()(
)()()()(
3332321313
3232221212
3132121111
tsatsatsatx
tsatsatsatx
tsatsatsatx
Asx
Find the sources s1(t), s2(t) and s
3(t), and the coefficients aij’s from the observed signals x1(t), x2
(t), and x3(t).
It turns out that the problem can be solved just by assuming that the sources si(t) are nongaussian and statistically independent.
)()()()(
)()()()(
)()()()(
3332321313
3232221212
3132121111
txbtxbtxbts
txbtxbtxbts
txbtxbtxbts
BxxAs 1
Applications
Cocktail party problem: separation of voices or music or sounds
Sensor array processing, e.g. radar Biomedical signal processing with multiple sensors: EEG,
ECG, MEG, fMRI Telecommunications: e.g. multiuser detection in CDMA Financial and other time series Noise removal from signals and images Feature extraction for images and signals Brain modelling
Basic ICA Model
nitsatsatsatx niniii ,,2,1 ),()()()( 2211
Mixing signals(observable) Latent variables
)(1 tx
)(2 tx
),( 21 xxp
1x
2x
)( 1xp
)( 2xp
Asx Asx
The Basic Assumptions
The independent components are assumed statistically independent.
The independent components must have nongaussian distributions.
For simplicity, we assume that the unknown mixing matrix A is square.
Asx Asx
Assumption I:Statistical Independence
Basically, random variables y1, y2, …, yn are said to be independent if information on the value of yi does not give any information on the value of yj for i j.
Mathematically, the joint pdf is factorizable in the following way:
p(y1, y2, …, yn) = p1(y1) p2(y2)…pn(yn) Note that uncorrelatedness does not necessary im
ply independence.
Asx Asx
Assumption II:Nongaussian Distributions
Note that in the basic model we do not have to know what the nongaussian distributions of the ICs look like.
Asx Asx
Assumption III:Mixing Matrix is square
In other words, the number of independent components is equal to the number of observed mixtures.– This simplifies our discussion in the first stage.
However, in the basic ICA model, this is no restriction as long as originally the number of observations xi is at least as large as the number of sources sj.
Asx Asx
Ambiguities of ICA
We cannot determine the variances (energies) of IC’s.
– This also implies E[x]=0 (centering of x) and sign of si is unimportant.
We cannot determine the order of IC’s.
Asx Asx
1
1n
i ii i
is
x a Therefore, we assume
1][ 2 isE
0][ isE
PsAPx 1 where P is any permutation matrix.
Illustration of ICA
otherwise
ssp i
i
0
3||32
1)(
210
105A
Mixing
Asx
Whitening Is Only Half of ICA
whiteningwhitening
Vxz Whitening
Matrix
Asx
Whitening Is Only Half of ICA
Vxz
)( izp
By whitening, we have E[zzT] = I.
This, however, doesn’t imply zi’s are independent, i.e., we may have
n
iiin zpzzzp
121 )(),,,(
Uncorrelatedness is related to independence, but is weaker than independence.
Independent Component Analysis
Vxz Central limit theorem implicitly tells us that the additive of components, makes the distribution to become ‘more’ Gaussian.Therefore, nongaussianity is an important criterion for ICA.
Degaussian is hence the central theme in ICA.
Independent Component Analysis
Nongaussianity Measurement — Kurtosis
Moments
dxxpxxE jj
j )(][
dxxpxxE jj
j )(][The jth moment:
Mean:
dxxpmxxE j
xj
j )()(])[( 1
dxxpmxxE j
xj
j )()(])[( 1The jth central m
oment:
Variance:
][1 xEmx
])[( 22
2xx mxE
Skewness: ])[()( 33 xmxExskew
Moment Generating Function
The moment generating function MX(t) of a random variable X is defined by:
X~N(, 2)
Z~N(0, 1)
dxxpeeEtM txtX
X )(][)(
dxxpeeEtM txtX
X )(][)(
2/22
)( ttX eetM
2/2
)( tZ etM
!3
][
!2
][
!1
][1][)(
3322 tXEtXEtXEeEtM tX
X
!3
][
!2
][
!1
][1][)(
3322 tXEtXEtXEeEtM tX
X
Standard Normal Distribution N(0, 1)
2/2
)( tZ etM 2/2
)( tZ etM
0][ 12 kZE
!2
!2][ 2
k
kZE
kk
!32!22!12
13
6
2
42 ttt
!3
][
!2
][
!1
][1][)(
3322 tXEtXEtXEeEtM tX
X
!3
][
!2
][
!1
][1][)(
3322 tXEtXEtXEeEtM tX
X
Zero for all odd moments
1][ 2 ZE
3][ 4 ZE
Kurtosis
Kurtosis of a zero-mean random variable X is defined by
Normalize kurtosis:
224 ])[(3][)( XEXEXkurt 224 ])[(3][)( XEXEXkurt
3])[(
][)(~
22
4
XE
XEX 3
])[(
][)(~
22
4
XE
XEX
0)( Zkurt
Gaussianity
Gaussian
Supergaussian
Subgaussian
||
2)( .,. xexpge
||
2)( .,. xexpge
2 / 21. ., ( )
2xe g p x e
2 / 21. ., ( )
2xe g p x e
],[ ,2
1)( .,. aax
axpge ],[ ,
2
1)( .,. aax
axpge
Kurtosis for Supergaussian
||
2)( xexp Consider Laplacian Distribution:
dxexXE x||22
2][
dxex x
0
22
2
0][ XE
dxexXE x||44
2][
dxex x
0
44
24
2
24
23
24)(
Xkurt4
12
3)(~ X > 0
224 ])[(3][)( XEXEXkurt 224 ])[(3][)( XEXEXkurt
Kurtosis for Supergaussian
||
2)( xexp Consider Laplacian Distribution:
dxexXE x||22
2][
dxex x
0
22
2
0][ XE
dxexXE x||44
2][
dxex x
0
44
24
2
24
23
24)(
Xkurt4
12
3)(~ X > 0
224 ])[(3][)( XEXEXkurt 224 ])[(3][)( XEXEXkurt
3])[(
][)(~
22
4
XE
XEX 3
])[(
][)(~
22
4
XE
XEX
Kurtosis for Subgassian
Consider Uniform Distribution:
dxxa
XEa
a 22
2
1][
3
2a
0][ XE
dxxa
XE
44
2
1][
5
4a
224
33
5)(
aaXkurt
15
2 2a
5
6)(~ X < 0
],[ ,2
1)( aax
axp
224 ])[(3][)( XEXEXkurt 224 ])[(3][)( XEXEXkurt
3])[(
][)(~
22
4
XE
XEX 3
])[(
][)(~
22
4
XE
XEX
Nongaussianity Measurement By Kurtosis
Kurtosis, or rather is absolute value, has been widely used as a measure of nongaussianity in ICA and related fields.
Computationally, kurtosis can be estimated simply by using the 4th moment of the sample data (if the variance is kept constant).
224 ])[(3][)( XEXEXkurt 224 ])[(3][)( XEXEXkurt
Properties of Kurtosis
Let X1 and X2 be two independent variables both have zero mean.
)()()( 2121 XkurtXkurtXXkurt
)()( 14
1 XkurtXkurt
224 ])[(3][)( XEXEXkurt 224 ])[(3][)( XEXEXkurt
Independent Component Analysis
ICA By Maximization of Nongaussianity
Restate the Problem
Asx
zero mean (observable)
mixing matrix (unknown)
zero mean & unit variance IC’s(latent; unknown)
xAs 1Ultimate goal
How?
Simplification
Asx xAs 1Ultimate goal
whi
teni
ng
For simplicity, we assume sources are i.i.d.
To estimate an independent component by
xbTy AsbT sqTIf b is properly identified, qT = bTA contains only one nonzero entry with value one.
This implies that b will be one row of identified, A1.
Nongaussian Is Independent
Asx xAs 1Ultimate goal
whi
teni
ng
For simplicity, we assume sources are i.i.d.
To estimate an independent component by
xbTy AsbT sqT
We will take b that maximizes the nongaussianity of bTx.
sqAsbxb TTTy sqAsbxb TTTy
Nongaussian Is Independent
sqAsbxb TTTy sqAsbxb TTTy
s1
s2
210
105A
MixingMixing
Asx
Nongaussian Is Independent
sqAsbxb TTTy sqAsbxb TTTy
Asx Vxz
whiteningwhitening
Nongaussian Is Independent
sqAsbxb TTTy sqAsbxb TTTy
Vxz )( izp
Additive of components becomes more Gaussian
Nongaussian Is Independent
sqAsbxb TTTy sqAsbxb TTTy
Wzy Vxz
RotationRotation y1
y2
T),( 21 wwW
Nongaussian Is Independent
sqAsbxb TTTy sqAsbxb TTTy
yi
p(yi)
Estimated density Wzy
y1
y2
Nongaussian Is Independent
Wzy
y1
y2
sqAsbxb TTTy sqAsbxb TTTy
Consider to get one independent component.
Ty w z VAswT sqT
bT
1||||))((|||| 22 wwVAVAwq TTT
x
Nongaussian Is Independent
sqAsbxb TTTy sqAsbxb TTTy
Consider to get one independent component.
Ty w z
1||||))((|||| 22 wwVAVAwq TTT
Project the whitened data to a unit vector w to get an independent component.
w
Nongaussian Is Independent
sqAsbxb TTTy sqAsbxb TTTy
2211 sqsqy
)()( 2211 sqsqkurtykurt )()( 2
421
41 skurtqskurtq
1)()(][ 2221
21
2 sVarqsVarqyE
)( 42
41 qqc
122
21 qq
We require that
The search space is
q1
q2
Using kurtosis as nongaussianity measurement.
Independent Component Analysis
Gradient Algorithm Using Kurtosis
Criterion for ICA Using Kurtosis
maximize |)(| zwTkurt
224 ][3][)( XEXEXkurt 224 ][3][)( XEXEXkurt
Subject to 1|||| 2w
|])[(3])[(||)(|224 zwzwzw ww
TTT EEkurt
|||||3])[(|224 wzww TE
}||||3])[()){((4 23 wwzzwzw TT Ekurtsign
Gradient Algorithm
}||||3])[()){((4|)(| 23 wwzzwzwzww TTT Ekurtsignkurt
maximize |)(| zwTkurt
Subject to 1|||| 2w
unrelated
])[())(( 3 zzwzww TT Ekurtsign||||/ www
FastICA Algorithm
}||||3])[()){((4|)(| 23 wwzzwzwzww TTT Ekurtsignkurt
maximize |)(| zwTkurt
Subject to 1|||| 2w
At a stable point, the gradient must point in the direction of w.Using fixed-point interation, then
23 ||||3])[( wwzzww TE sign is not important
FastICA wzzww 3])[( 3 TE ||||/ www
Independent Component Analysis
Measuring Nongaussianity by Negentrop
y
Critique of Kurtosis
Kurtosis can be very sensitive to outliers.– Kurtosis may depend on only a few observations i
n the tails of the distribution.
Not a robust measure of nongaussianity.
224 ][3][)( XEXEXkurt 224 ][3][)( XEXEXkurt
Negentropy
dxppH )(log)()( xxX XX
dxppH )(log)()( xxX XX
Differential Entropy
Negentropy Entropy
)()()( XXX HHJ gauss )()()( XXX HHJ gauss ≧0Negentropy is zero only when the random variable is Gaussian distributed.
]2log1[2
detlog2
1)(
nH gaussX ]2log1[
2detlog
2
1)(
nH gaussX
It is invariant by a invertible linear transformation.
Approximation of Negentropy (I)
48
)(
12
)()(
24
23 xx
XJ
][)( 33 XEx
3][)( 44 XEx
Skewness
Kurtosis
For a zero mean and unit variance random variable.
Using approximation is helpless because it is sensitive to outliers.
Approximation of Negentropy (II)
2222
211 )]([)]([)]([)( ZGEXGEkXGEkXJ
Measures the asymmetry
Measures the dimension of bimodality
vs. peak at zero
G1(x) odd
G2(x) even
Choose two nonpolynomial functions
The first term is zero if the underlying density is zero.
such that
dzzZGZGE ]2/exp[)(2
1)]([ 2
22
dzzZGZGE ]2/exp[)(
2
1)]([ 2
22
Usually, only the second term is used.
Approximation of Negentropy (II)
2)]([)]([)( ZGEXGEXJ
If only an even nonpolynomial function, say, G is used, we have
The following two functions are useful
xaa
xG 11
1 coshlog1
)(
]2/exp[)( 22 xxG
21 1 a
G1
G2
G3(x)=x4
Degaussian
2)]([)]([)( ZGEXGEXJ
For ICA, we want to maximize this quantity.
Specifically, let z = Vx be the whitened data.
For one-unit ICA, we want to find a rotation, say, w to
2)]([)]([)( ZGEGEkJ TT zwzwmaximize
1|||| 2wsubject to
Gradient Algorithm
2)]([)]([)( ZGEGEkJ TT zwzwmaximize
Fact: 1])[( 2 zwTE
)],([)( zwzzwwTT gEJ
)]}([)]([{2 ZGEGEk T zwAlgorithm
)]([ zwzw TgE
||||/ www
constant
batch mode
)( zwzw Tg
||||/ www
On-line mode
Gg
Analysis
2)]([)]([)( ZGEGEkJ TT zwzwmaximize
Consider the term inside the braces.
)]([)]([)( ZGEXGEXf
G1
G2
G3(x)=x4
xaa
xG 11
1 coshlog1
)(
]2/exp[)( 22 xxG
43 )( xxG
The functions G’s we used have the following property:
0)( Xf
0)( Xf
if X is supergaussian
if X is subgaussian
Analysis
2)]([)]([)( ZGEGEkJ TT zwzwmaximize
G1
G2
G3(x)=x4
xaa
xG 11
1 coshlog1
)(
]2/exp[)( 22 xxG
43 )( xxG
0)( Xf
0)( Xf
if X is supergaussian
if X is subgaussian
Consider the term inside the braces.
)]([)]([)( ZGEXGEXf
The functions G’s we used have the following property:
Minimize E[G(wTz)] if IC is suppergaussian.
Maximize E[G(wTz)] if IC is subgaussian.
Analysis
xaxg 11 tanh)(
]2/exp[)( 22 xxxg
33 4)( xxg
g1
g2
g3
xaa
xG 11
1 coshlog1
)(
]2/exp[)( 22 xxG
43 )( xxG
G1
G2
G3(x)=x4
Both g1 and g2 are more insensitive on outliers than g3.
Analysis
)]}([)]([{2 ZGEGEk T zwAlgorithm
)]([ zwzw TgE
||||/ www
batch mode
)( zwzw Tg
||||/ www
On-line mode
Controls the search direction.The sign is dependent on the super/subgaussianity of samples
Nonlinearity g(wTt) is for weighting samples.
Nonlinearity g(wTt) is for weighting samples.
Stability Analysis
Assume that the input data follows the ICA model with whiten data: z = VAs.
And, G is a sufficiently smooth even function. Then, the local maxima (resp. minima) of E[G(wTz)] unde
r the constraint ||w|| = 1 include those rows of the inverse of the mixing matrix VA such that the corresponding independent components si satisify
)0 .( 0)]()([ respsgsgsE iii
Stability Analysis
Assume that the input data follows the ICA model with whiten data: z = VAs.
And, G is a sufficiently smooth even function. Then, the local maxima (resp. minima) of E[G(wTz)] unde
r the constraint ||w|| = 1 include those rows of the inverse of the mixing matrix VA such that the corresponding independent components si satisify
0)]()([)]()([ ZGsGEsgsgsE iiii
This condition is, in general, true for reasonable choices of G.This condition is, in general, true for reasonable choices of G.
Independent Component Analysis
FastICA Using Negentropy
Clue From Gradient Algorithm
GradientAlgorithm
)]([ zwzw TgE
||||/ www
batch mode
)( zwzw Tg
||||/ www
On-line mode
Fixed-point iteration suggested:
)]([ zwzw TgE||||/ www
Nonpolynomial moments do not have the same nice algebraic properties as kurtosis. Such a iteration scheme is poor.
Newton’s Method
)]([ zwTGEMaximize or minimize
1|||| 2wsubject to
Construct the Lagrangian as follows:
wwzwzw TTT GEL )]([)(
w
zw
w
zww
)()(
1
2
2 TT LL
Newton’s method finds an extreme point by letting:
Newton’s Methodwwzwzw TTT GEL )]([)( wwzwzw TTT GEL )]([)(
w
zw
w
zww
)()(
1
2
2 TT LL
Newton’s method finds an extreme point of the by letting:
wzwzw
zw
)]([)( T
T
gEL
Izwzzw
zw
)]([
)(2
2TT
T
gEL
wzwzIzwzzw
)]([)]([1 TTT gEgE
Evaluate the Hessian matrix and its inverse
is time consuming.We want to
approximate it.
Newton’s Methodwwzwzw TTT GEL )]([)( wwzwzw TTT GEL )]([)(
wzwzw
zw
)]([)( T
T
gEL
Izwzzw
zw
)]([
)(2
2TT
T
gEL
wzwzIzwzzw
)]([)]([1 TTT gEgE
Izwzwzzzwzz )]([)]([][)]([ TTTTT gEgEEgE
IzwIzwzz )]([)]([ TTT gEgE A diagonal matix
Newton’s Methodwwzwzw TTT GEL )]([)( wwzwzw TTT GEL )]([)(
wzwzw
zw
)]([)( T
T
gEL
Izwzzw
zw
)]([
)(2
2TT
T
gEL
wzwzIzwzzw
)]([)]([1 TTT gEgE
IzwIzwzz )]([)]([ TTT gEgE A diagonal matix
1)]([)]([
zwwzwzw TT gEgE
FastICA 1
)]([)]([
zwwzwzw TT gEgE 1)]([)]([
zwwzwzw TT gEgE
1)]([)]([
zwwzwzww TT gEgE
wzwzwzwwzw )]([)]([)]([
1 TTT gEgEgE
)]([)]([ zwzwzw TT gEgE
The algorithm:
||||/ www
FastICA
)]([)]([ zwzwzw TT gEgE
||||/ www
1. Center the data to make mean zero.
2. Whiten the data to give z.
3. Choose the initial vector w of unit norm.
4.
5.
6. If not converged, go back to step 4.
FastICA
)]([)]([ zwzwzw TT gEgE
||||/ www
1. Center the data to make mean zero.
2. Whiten the data to give z.
3. Choose the initial vector w of unit norm.
4.
5.
6. If not converged, go back to step 4.
xaa
xG 11
1 coshlog1
)(
]2/exp[)( 22 xxG
43 )( xxG
xaxg 11 tanh)(
]2/exp[)( 22 xxxg
33 4)( xxg
)tanh1()( 12
11 xaaxg
]2/exp[)1()( 222 xxxg
23 12)( xxg
FastICAxa
axG 1
11 coshlog
1)(
]2/exp[)( 22 xxG
43 )( xxG
xaxg 11 tanh)(
]2/exp[)( 22 xxxg
33 4)( xxg
)tanh1()( 12
11 xaaxg
]2/exp[)1()( 222 xxxg
23 12)( xxg
-2
-1
0
1
2
3
-4 -3 -2 -1 0 1 2 3 4
1G
2G
3G
-3
-1
1
3
-4 -3 -2 -1 0 1 2 3 4cc
1g
2g
3g
-1
0
1
2
3
-4 -3 -2 -1 0 1 2 3 4
1g
2g
3g
Estimating Several IC’s
Deflation Orthogonalization– Based on Gram-Schmidt Method
Symmetric Orthogonalization– Adjust vectors in parallel
Deflation Orthogonalization
1. Center the data to make mean zero.
2. Whiten the data to give z.
3. Choose m, the number of IC’s to estimate, set counter p←1
4. Choose an initial vector wp of unit norm, randomly.
5.
6.
7.
8. If wp not converged, go back to step 5.
9. Set p← p +1, if p<m, go back to step 4.
)]([)]([ zwzwzw Tp
Tpp gEgE
||||/ ppp www
1
1)(
p
j jjTppp wwwww
Symmetric Orthogonalization
1. Choose the number of independent components to estimate, say, m.
2. Initialize the wi, i=1,…,m.
3. Do an iteration of one-unit algorithm on every wi in paralle
l.
4. Do a symmetric orthogonalization of matrixW=(w1,… , wn).
5. If not converged, go back to step3.
Symmetric Orthogonalization
Method 1: (Classic Method)
WWWW 2/1)( T
1. Let
2. Let
3. If WWT is not close enough to identity, go back to step
2.
Method 2: (Iteration Method)
||||/ WWW WWWWW T
21
23