independent component analysis

Independent Component Analysis

主講人：虞台文

Content

What is ICA? Nongaussianity Measurement — Kurtosis ICA By Maximization of Nongaussianity Gradient and FastICA Algorithms Using Kurt

osis Measuring Nongaussianity by Negentropy FastICA Using Negentrophy


What is ICA?

Motivation

Example: three people are speaking simultaneously in a room that has three microphones.

Denote the microphone signals by x1(t), x2(t), and x3(t).

They are mixtures of sources s1(t), s2(t), and s3(t).

The goal is to estimate the original speech signals using only the recorded signals.

This is called the cocktail-party problem.

11 12 13

21 22 23

31

1 2 3

1 2 3

1 2 3323 3

1

2

3

( ) ( ) ( )

( ) (

( )

( )

( )

) ( )

( ) ( ) ( )

x t

x t

x t

a a a

a a

s t s t s t

s t s t s t

s t s

a

ta sa t a

The Cocktail-Party Problem

The original speech signals The mixed speech signals

The Cocktail-Party Problem

The original speech signals The estimated sources

The Problem

)()()()(

)()()()(

)()()()(

3332321313

3232221212

3132121111

tsatsatsatx

tsatsatsatx

tsatsatsatx

Asx

Find the sources s1(t), s2(t) and s

3(t), and the coefficients aij’s from the observed signals x1(t), x2

(t), and x3(t).

It turns out that the problem can be solved just by assuming that the sources si(t) are nongaussian and statistically independent.

)()()()(

)()()()(

)()()()(

3332321313

3232221212

3132121111

txbtxbtxbts

txbtxbtxbts

txbtxbtxbts

BxxAs 1

Applications

Cocktail party problem: separation of voices or music or sounds

Sensor array processing, e.g. radar Biomedical signal processing with multiple sensors: EEG,

ECG, MEG, fMRI Telecommunications: e.g. multiuser detection in CDMA Financial and other time series Noise removal from signals and images Feature extraction for images and signals Brain modelling

Basic ICA Model

nitsatsatsatx niniii ,,2,1 ),()()()( 2211

Mixing signals(observable) Latent variables

)(1 tx

)(2 tx

),( 21 xxp

1x

2x

)( 1xp

)( 2xp

Asx Asx

The Basic Assumptions

The independent components are assumed statistically independent.

The independent components must have nongaussian distributions.

For simplicity, we assume that the unknown mixing matrix A is square.

Asx Asx

Assumption I:Statistical Independence

Basically, random variables y1, y2, …, yn are said to be independent if information on the value of yi does not give any information on the value of yj for i j.

Mathematically, the joint pdf is factorizable in the following way:

p(y1, y2, …, yn) = p1(y1) p2(y2)…pn(yn) Note that uncorrelatedness does not necessary im

ply independence.

Asx Asx

Assumption II:Nongaussian Distributions

Note that in the basic model we do not have to know what the nongaussian distributions of the ICs look like.

Asx Asx

Assumption III:Mixing Matrix is square

In other words, the number of independent components is equal to the number of observed mixtures.– This simplifies our discussion in the first stage.

However, in the basic ICA model, this is no restriction as long as originally the number of observations xi is at least as large as the number of sources sj.

Asx Asx

Ambiguities of ICA

We cannot determine the variances (energies) of IC’s.

– This also implies E[x]=0 (centering of x) and sign of si is unimportant.

We cannot determine the order of IC’s.

Asx Asx

1

1n

i ii i

is

x a Therefore, we assume

1][ 2 isE

0][ isE

PsAPx 1 where P is any permutation matrix.

Illustration of ICA

otherwise

ssp i

i

0

3||32

1)(

210

105A

Mixing

Asx

Whitening Is Only Half of ICA

whiteningwhitening

Vxz Whitening

Matrix

Asx

Whitening Is Only Half of ICA

Vxz

)( izp

By whitening, we have E[zzT] = I.

This, however, doesn’t imply zi’s are independent, i.e., we may have

n

iiin zpzzzp

121 )(),,,(

Uncorrelatedness is related to independence, but is weaker than independence.


Vxz Central limit theorem implicitly tells us that the additive of components, makes the distribution to become ‘more’ Gaussian.Therefore, nongaussianity is an important criterion for ICA.

Degaussian is hence the central theme in ICA.


Nongaussianity Measurement — Kurtosis

Moments

dxxpxxE jj

j )(][

dxxpxxE jj

j )(][The jth moment:

Mean:

dxxpmxxE j

xj

j )()(])[( 1

dxxpmxxE j

xj

j )()(])[( 1The jth central m

oment:

Variance:

][1 xEmx

])[( 22

2xx mxE

Skewness: ])[()( 33 xmxExskew

Moment Generating Function

The moment generating function MX(t) of a random variable X is defined by:

X~N(, 2)

Z~N(0, 1)

dxxpeeEtM txtX

X )(][)(

dxxpeeEtM txtX

X )(][)(

2/22

)( ttX eetM

2/2

)( tZ etM

!3

][

!2

][

!1

][1][)(

3322 tXEtXEtXEeEtM tX

X

!3

][

!2

][

!1

][1][)(


X

Standard Normal Distribution N(0, 1)

2/2

)( tZ etM 2/2

)( tZ etM

0][ 12 kZE

!2

!2][ 2

k

kZE

kk

!32!22!12

13

6

2

42 ttt

!3

][

!2

][

!1

][1][)(


X

!3

][

!2

][

!1

][1][)(


X

Zero for all odd moments

1][ 2 ZE

3][ 4 ZE

Kurtosis

Kurtosis of a zero-mean random variable X is defined by

Normalize kurtosis:

224 ])[(3][)( XEXEXkurt 224 ])[(3][)( XEXEXkurt

3])[(

][)(~

22

4

XE

XEX 3

])[(

][)(~

22

4

XE

XEX

0)( Zkurt

Gaussianity

Gaussian

Supergaussian

Subgaussian

||

2)( .,. xexpge

||

2)( .,. xexpge

2 / 21. ., ( )

2xe g p x e

2 / 21. ., ( )

2xe g p x e

],[ ,2

1)( .,. aax

axpge ],[ ,

2

1)( .,. aax

axpge

Kurtosis for Supergaussian

||

2)( xexp Consider Laplacian Distribution:

dxexXE x||22

2][

dxex x

0

22

2

0][ XE

dxexXE x||44

2][

dxex x

0

44

24

2

24

23

24)(

Xkurt4

12

3)(~ X > 0


Kurtosis for Supergaussian

||

2)( xexp Consider Laplacian Distribution:

dxexXE x||22

2][

dxex x

0

22

2

0][ XE

dxexXE x||44

2][

dxex x

0

44

24

2

24

23

24)(

Xkurt4

12

3)(~ X > 0


3])[(

][)(~

22

4

XE

XEX 3

])[(

][)(~

22

4

XE

XEX

Kurtosis for Subgassian

Consider Uniform Distribution:

dxxa

XEa

a 22

2

1][

3

2a

0][ XE

dxxa

XE

44

2

1][

5

4a

224

33

5)(

aaXkurt

15

2 2a

5

6)(~ X < 0

],[ ,2

1)( aax

axp


3])[(

][)(~

22

4

XE

XEX 3

])[(

][)(~

22

4

XE

XEX

Nongaussianity Measurement By Kurtosis

Kurtosis, or rather is absolute value, has been widely used as a measure of nongaussianity in ICA and related fields.

Computationally, kurtosis can be estimated simply by using the 4th moment of the sample data (if the variance is kept constant).


Properties of Kurtosis

Let X1 and X2 be two independent variables both have zero mean.

)()()( 2121 XkurtXkurtXXkurt

)()( 14

1 XkurtXkurt



ICA By Maximization of Nongaussianity

Restate the Problem

Asx

zero mean (observable)

mixing matrix (unknown)

zero mean & unit variance IC’s(latent; unknown)

xAs 1Ultimate goal

How?

Simplification

Asx xAs 1Ultimate goal

whi

teni

ng

For simplicity, we assume sources are i.i.d.

To estimate an independent component by

xbTy AsbT sqTIf b is properly identified, qT = bTA contains only one nonzero entry with value one.

This implies that b will be one row of identified, A1.

Nongaussian Is Independent

Asx xAs 1Ultimate goal

whi

teni

ng

For simplicity, we assume sources are i.i.d.

To estimate an independent component by

xbTy AsbT sqT

We will take b that maximizes the nongaussianity of bTx.

sqAsbxb TTTy sqAsbxb TTTy



s1

s2

210

105A

MixingMixing

Asx



Asx Vxz

whiteningwhitening



Vxz )( izp

Additive of components becomes more Gaussian



Wzy Vxz

RotationRotation y1

y2

T),( 21 wwW



yi

p(yi)

Estimated density Wzy

y1

y2


Wzy

y1

y2


Consider to get one independent component.

Ty w z VAswT sqT

bT

1||||))((|||| 22 wwVAVAwq TTT

x



Consider to get one independent component.

Ty w z

1||||))((|||| 22 wwVAVAwq TTT

Project the whitened data to a unit vector w to get an independent component.

w



2211 sqsqy

)()( 2211 sqsqkurtykurt )()( 2

421

41 skurtqskurtq

1)()(][ 2221

21

2 sVarqsVarqyE

)( 42

41 qqc

122

21 qq

We require that

The search space is

q1

q2

Using kurtosis as nongaussianity measurement.


Gradient Algorithm Using Kurtosis

Criterion for ICA Using Kurtosis

maximize |)(| zwTkurt

224 ][3][)( XEXEXkurt 224 ][3][)( XEXEXkurt

Subject to 1|||| 2w

|])[(3])[(||)(|224 zwzwzw ww

TTT EEkurt

|||||3])[(|224 wzww TE

}||||3])[()){((4 23 wwzzwzw TT Ekurtsign

Gradient Algorithm

}||||3])[()){((4|)(| 23 wwzzwzwzww TTT Ekurtsignkurt


Subject to 1|||| 2w

unrelated

])[())(( 3 zzwzww TT Ekurtsign||||/ www

FastICA Algorithm

}||||3])[()){((4|)(| 23 wwzzwzwzww TTT Ekurtsignkurt


Subject to 1|||| 2w

At a stable point, the gradient must point in the direction of w.Using fixed-point interation, then

23 ||||3])[( wwzzww TE sign is not important

FastICA wzzww 3])[( 3 TE ||||/ www


Measuring Nongaussianity by Negentrop

y

Critique of Kurtosis

Kurtosis can be very sensitive to outliers.– Kurtosis may depend on only a few observations i

n the tails of the distribution.

Not a robust measure of nongaussianity.

224 ][3][)( XEXEXkurt 224 ][3][)( XEXEXkurt

Negentropy

dxppH )(log)()( xxX XX

dxppH )(log)()( xxX XX

Differential Entropy

Negentropy Entropy

)()()( XXX HHJ gauss )()()( XXX HHJ gauss ≧0Negentropy is zero only when the random variable is Gaussian distributed.

]2log1[2

detlog2

1)(

nH gaussX ]2log1[

2detlog

2

1)(

nH gaussX

It is invariant by a invertible linear transformation.

Approximation of Negentropy (I)

48

)(

12

)()(

24

23 xx

XJ

][)( 33 XEx

3][)( 44 XEx

Skewness

Kurtosis

For a zero mean and unit variance random variable.

Using approximation is helpless because it is sensitive to outliers.

Approximation of Negentropy (II)

2222

211 )]([)]([)]([)( ZGEXGEkXGEkXJ

Measures the asymmetry

Measures the dimension of bimodality

vs. peak at zero

G1(x) odd

G2(x) even

Choose two nonpolynomial functions

The first term is zero if the underlying density is zero.

such that

dzzZGZGE ]2/exp[)(2

1)]([ 2

22

dzzZGZGE ]2/exp[)(

2

1)]([ 2

22

Usually, only the second term is used.

Approximation of Negentropy (II)

2)]([)]([)( ZGEXGEXJ

If only an even nonpolynomial function, say, G is used, we have

The following two functions are useful

xaa

xG 11

1 coshlog1

)(

]2/exp[)( 22 xxG

21 1 a

G1

G2

G3(x)=x4

Degaussian

2)]([)]([)( ZGEXGEXJ

For ICA, we want to maximize this quantity.

Specifically, let z = Vx be the whitened data.

For one-unit ICA, we want to find a rotation, say, w to

2)]([)]([)( ZGEGEkJ TT zwzwmaximize

1|||| 2wsubject to

Gradient Algorithm


Fact: 1])[( 2 zwTE

)],([)( zwzzwwTT gEJ

)]}([)]([{2 ZGEGEk T zwAlgorithm

)]([ zwzw TgE

||||/ www

constant

batch mode

)( zwzw Tg

||||/ www

On-line mode

Gg

Analysis


Consider the term inside the braces.

)]([)]([)( ZGEXGEXf

G1

G2

G3(x)=x4

xaa

xG 11

1 coshlog1

)(

]2/exp[)( 22 xxG

43 )( xxG

The functions G’s we used have the following property:

0)( Xf

0)( Xf

if X is supergaussian

if X is subgaussian

Analysis


G1

G2

G3(x)=x4

xaa

xG 11

1 coshlog1

)(

]2/exp[)( 22 xxG

43 )( xxG

0)( Xf

0)( Xf

if X is supergaussian

if X is subgaussian

Consider the term inside the braces.

)]([)]([)( ZGEXGEXf

The functions G’s we used have the following property:

Minimize E[G(wTz)] if IC is suppergaussian.

Maximize E[G(wTz)] if IC is subgaussian.

Analysis

xaxg 11 tanh)(

]2/exp[)( 22 xxxg

33 4)( xxg

g1

g2

g3

xaa

xG 11

1 coshlog1

)(

]2/exp[)( 22 xxG

43 )( xxG

G1

G2

G3(x)=x4

Both g1 and g2 are more insensitive on outliers than g3.

Analysis

)]}([)]([{2 ZGEGEk T zwAlgorithm

)]([ zwzw TgE

||||/ www

batch mode

)( zwzw Tg

||||/ www

On-line mode

Controls the search direction.The sign is dependent on the super/subgaussianity of samples

Nonlinearity g(wTt) is for weighting samples.

Nonlinearity g(wTt) is for weighting samples.

Stability Analysis

Assume that the input data follows the ICA model with whiten data: z = VAs.

And, G is a sufficiently smooth even function. Then, the local maxima (resp. minima) of E[G(wTz)] unde

r the constraint ||w|| = 1 include those rows of the inverse of the mixing matrix VA such that the corresponding independent components si satisify

)0 .( 0)]()([ respsgsgsE iii

Stability Analysis

Assume that the input data follows the ICA model with whiten data: z = VAs.

And, G is a sufficiently smooth even function. Then, the local maxima (resp. minima) of E[G(wTz)] unde

r the constraint ||w|| = 1 include those rows of the inverse of the mixing matrix VA such that the corresponding independent components si satisify

0)]()([)]()([ ZGsGEsgsgsE iiii

This condition is, in general, true for reasonable choices of G.This condition is, in general, true for reasonable choices of G.


FastICA Using Negentropy

Clue From Gradient Algorithm

GradientAlgorithm

)]([ zwzw TgE

||||/ www

batch mode

)( zwzw Tg

||||/ www

On-line mode

Fixed-point iteration suggested:

)]([ zwzw TgE||||/ www

Nonpolynomial moments do not have the same nice algebraic properties as kurtosis. Such a iteration scheme is poor.

Newton’s Method

)]([ zwTGEMaximize or minimize

1|||| 2wsubject to

Construct the Lagrangian as follows:

wwzwzw TTT GEL )]([)(

w

zw

w

zww

)()(

1

2

2 TT LL

Newton’s method finds an extreme point by letting:

Newton’s Methodwwzwzw TTT GEL )]([)( wwzwzw TTT GEL )]([)(

w

zw

w

zww

)()(

1

2

2 TT LL

Newton’s method finds an extreme point of the by letting:

wzwzw

zw

)]([)( T

T

gEL

Izwzzw

zw

)]([

)(2

2TT

T

gEL

wzwzIzwzzw

)]([)]([1 TTT gEgE

Evaluate the Hessian matrix and its inverse

is time consuming.We want to

approximate it.


wzwzw

zw

)]([)( T

T

gEL

Izwzzw

zw

)]([

)(2

2TT

T

gEL

wzwzIzwzzw

)]([)]([1 TTT gEgE

Izwzwzzzwzz )]([)]([][)]([ TTTTT gEgEEgE

IzwIzwzz )]([)]([ TTT gEgE A diagonal matix


wzwzw

zw

)]([)( T

T

gEL

Izwzzw

zw

)]([

)(2

2TT

T

gEL

wzwzIzwzzw

)]([)]([1 TTT gEgE

IzwIzwzz )]([)]([ TTT gEgE A diagonal matix

1)]([)]([

zwwzwzw TT gEgE

FastICA 1

)]([)]([

zwwzwzw TT gEgE 1)]([)]([

zwwzwzw TT gEgE

1)]([)]([

zwwzwzww TT gEgE

wzwzwzwwzw )]([)]([)]([

1 TTT gEgEgE

)]([)]([ zwzwzw TT gEgE

The algorithm:

||||/ www

FastICA


||||/ www

1. Center the data to make mean zero.

2. Whiten the data to give z.

3. Choose the initial vector w of unit norm.

4.

5.

6. If not converged, go back to step 4.

FastICA


||||/ www



3. Choose the initial vector w of unit norm.

4.

5.

6. If not converged, go back to step 4.

xaa

xG 11

1 coshlog1

)(

]2/exp[)( 22 xxG

43 )( xxG

xaxg 11 tanh)(

]2/exp[)( 22 xxxg

33 4)( xxg

)tanh1()( 12

11 xaaxg

]2/exp[)1()( 222 xxxg

23 12)( xxg

FastICAxa

axG 1

11 coshlog

1)(

]2/exp[)( 22 xxG

43 )( xxG

xaxg 11 tanh)(

]2/exp[)( 22 xxxg

33 4)( xxg

)tanh1()( 12

11 xaaxg

]2/exp[)1()( 222 xxxg

23 12)( xxg

-2

-1

0

1

2

3

-4 -3 -2 -1 0 1 2 3 4

1G

2G

3G

-3

-1

1

3

-4 -3 -2 -1 0 1 2 3 4cc

1g

2g

3g

-1

0

1

2

3

-4 -3 -2 -1 0 1 2 3 4

1g

2g

3g

Estimating Several IC’s

Deflation Orthogonalization– Based on Gram-Schmidt Method

Symmetric Orthogonalization– Adjust vectors in parallel

Deflation Orthogonalization



3. Choose m, the number of IC’s to estimate, set counter p←1

4. Choose an initial vector wp of unit norm, randomly.

5.

6.

7.

8. If wp not converged, go back to step 5.

9. Set p← p +1, if p<m, go back to step 4.

)]([)]([ zwzwzw Tp

Tpp gEgE

||||/ ppp www

1

1)(

p

j jjTppp wwwww

Symmetric Orthogonalization

1. Choose the number of independent components to estimate, say, m.

2. Initialize the wi, i=1,…,m.

3. Do an iteration of one-unit algorithm on every wi in paralle

l.

4. Do a symmetric orthogonalization of matrixW=(w1,… , wn).

5. If not converged, go back to step3.

Symmetric Orthogonalization

Method 1: (Classic Method)

WWWW 2/1)( T

1. Let

2. Let

3. If WWT is not close enough to identity, go back to step

2.

Method 2: (Iteration Method)

||||/ WWW WWWWW T

21

23

independent component analysis

Documents