tutorial workshop on - 國立臺灣大學 · pythagoras theorem 3. 4 information divergence class...

1

Learning with Information Divergence Geometry

Shinto Eguchiand

Osamu Komori

The Institute of Statistical Mathematics, Japan Email: [email protected], [email protected]: http//www.ism.ac.jp/~eguchi/

Tutorial Workshop on

1

2

Outline

9：30~10:30 Information divergence class and robust statistical methods I

11：00~12：00 Information divergence class and robust statistical methods II

13:30~14:30 Information geometry on model uncertainty

15：00~16：00 Boosting leaning algorithm and U-loss functions I

9：30~10:30 Boosting leaning algorithm and U-loss functions II

11：00~12：00 Pattern recognition from genome and omics data

2

3

Key words

Linear connections

Statistical model and estimation

Divergence geometry

Duality of inference and model

GeodesicABC of Information geometry

Transversality

Riemannian metric (information metric)

Metric and dual connections

Pythagoras theorem

Max entropy model

minimaxity

U-boost

Observational bias Selection bias

Sensitivity analysis The worst case

ε –perturbed model

Pythagoras theorem

3

4

Information divergence class and robust statistical methods I

4

5

geometry

learning

statistics

quantum

1982 paperS. Amari

1975 paperB. Efron

P. Dawidcomment

Historical comment

1984WorkshopRSS150

C.R.Rao1945

R.A.Fisher1922

5

6

What is IG?

Geometry Uncertainty

Information space

It is a method to quantify uncertainty,

Or, a viewpoint to understand uncertainty?

6

7

Dual Riemannian Geometry

Dual Riemannian geometry gives reformulation for Fisher’s foundation

Cf. Einstein field equations (1916)

Information Geometery aims to geometrizeA dualistic structure between modeling and estimation

Estimation is projection of data onto model.

“Estimation is an action by projection and model is an object to be projected”

The interplay of action and object is elucidated by a geometric structure.

Cf. Erlangen program (Klein, 1872)http://en.wikipedia.org/wiki/Erlangen_program

7

8

2 x 2 table

The space of all 2×2 tables associates with a regular tetrahedron

We know the ruled surface is exponential geodesic, butdoes anyone know the minimality of the ruled surface?

The space of all independent 2×2 tables associates with a ruled surfaceIn the regular tetrahedron

8

9

regular tetrahedron

P

Q

S

R

0010

1000

0100

0001

1/4

1/4

1/4

1/4

1−p0p0

0p0 1−p

9

10

regular tetrahedron

P

Q

S

R

2/301/30

02/30 1/3

002/3 1/3

2/31/300

+

=

32

32

31

32

32

31

31

31

××

××

×31

×32

02/30 1/3

2/301/30

+×31

×32=

002/3 1/3

2/31/300

10

11

P

Q

S

R

P

Q

S

R

Ruled surface

11

12

．

1 00 0

p q p (1-q)(1-p)q (1-p) (1-q)

0 00 1

0 10 0

hgfeyzxwn

22 )( −

=χ

0)1()1(

})1)(1()1)(1({ 2

=−−

−−−−−=

qqppqpqpqppqn

0 01 0

nhg

fwzD

eyxC

BA

wzyxnwyhzxgwzfyxe

+++=+=+=+=+=

,,,

11−qq1−p(1−p)(1−q)(1−p)qD

Pp(1−q)pqC

BA

12

13

e-geodesical

-10

0

10 -5

0

5

-5

0

5

-10

0

10

-5

0

5

-10

0

10-10

-5

0

5

10

-5

0

5

-10

0

10

}){( 10,10,1

log,1

log,)1)(1(

log <<<<−−−−

qpq

qp

pqp

qp }{ RI,RI),,,( ∈∈+ yxyxyx

2112112222

21

22

12

22

11211211 1wherelog,log,log),,( )( ππππ

ππ

ππ

πππππ −−−=→−

3S 3RI

13

14

Two parametrization

-10

0

10 -5

0

5

-5

0

5

-10

0

10

-5

0

5

)( )1(),1(, qpqpqp −−

)( ,qp

)(1

log,1

log,)1)(1(

logq

qp

pqp

qp−−−−

14

15

0

5

10

0

5

10

0

0.05

0.1

0.15

0

5

10

Two Gaussian distributions ),(and),( 21 ININ μμ

),( 1 IN μ ),( 2 IN μ

1μ2μ

15

16

0

5

10

0

5

10

0

0.05

0.1

0.15

0

5

10

),( 1 IN μ

),( 2 IN μ

),( 3 IN μ

Pythagoras theorem in a functional space

1μ

2μ

3μ

222 543 =+

16

17

Kullback-Leibler Divergence

∫= xxxx d)()(log)(),(KL q

ppqpD

p

q

r

),(),(),( KLKLKL rqDqpDrpD +=

Let p(x) and q(x) be probabiulity density fubnctions.

Then Kullback-Leibler Divergence defined by

17

18

Two one-parameter families

)PI,(),()()1()()( ∈+−= qptqptp mt xxx

Let the space of all pdfs on a data space.PI

)PI,(,)}({)}({)( 1)( ∈= − rqqrcr sss

es xxx

xxx dqrc sss ∫ −= )}({)}({/1 1

ここで

p

q

r

)(mtp

)(esr

m-geodesic

e-geodesic

18

19

Pythagoras theorem

),(),(),( rqDqpDrpD +=

])1,0[]1,0[),((),(),(),( )()()()( ×∈∀+= tsrqDqpDrpD es

mt

es

mt

p

q

r

)(mtp

)(esr

⇒

19

20

Proof}),(),({),( )(

KL)(

KL)()(

KLe

smt

es

mt rqDqpDrpD +−

∫ −−= xxxxx drqqp es

mt ))(log)())(log()(( )()(

∫ −−−−−= xxxxx dcrqsqpt s}log)(log)()(log1)){(()()(1(

∫ −−−−= xxxxx drqqpst ))(log)())(log()(()1)(1(

∫ −−− xxx dqpct s ))()((log)1(

)},(),(),(){1)(1( KLKLKL rqDqpDrpDst −−−−=

0=

20

21

ABC in differential geometry

Linear connection defines parallelism along a vector field.

Geodesic = a minimal arc between x and y

Riemannian metric defines an inner product on any tangent space of a manifold.

RI)()(:),( →× MMggM XX

∫=

====

1

0))(),((argmin

10})1(,)0(:)({

dttxtxgtyxxxtxc

γ

)()()(: MMM XXX →×∇

))(,),(()()()2(

,)1(

MYXMfYfXYfYf

YfY

XX

XXf

XF ∈∀∈∀+∇=∇

∇=∇

∑=

Γ=∇d

kk

kjijX XX

i1

Componetwise

21

22

GeodesicA one-parameter family (curve) is called geodesic }:)({ εε ≤≤−= ttC θ

),...,1(0)()())(()(1 1

pitttt kkp

k

p

j

ikj

i =∀=Γ+∑∑= =

θθθ θ

with respect to a linear connection },,1:)({ pkjiikj ≤≤Γ θ

Remark 2：If , any geodesic is a line.),,1(0)( pkjii

kj ≤≤=Γ θ

The property is not invariant with parametrization.

The gedesic C is expressed by a transform rule from parameter θ to ω

),...,1(0)()())((~)(1 1

patttt cbp

c

p

b

abc

a =∀=Γ+ ∑∑= =

ωωω ω

,)()(~11 1 1

∑∑∑∑== = = ∂

∂+Γ=Γ

p

ac

iba

i

p

k

p

j

p

i

ikj

ai

jb

kc

abc

BBBBBω

θω )(for, θaaa

iiai

aai tBB =

∂∂

=∂∂

= ωωθ

θω

22

23

Change rule on connectionsLet us consider a transform φ from parameter θ to ω . Then a geodesic C is

written by . Hence we have the following:

),...,1(0)()())((~)(1 1

patttt cbp

c

p

b

abc

a =∀=Γ+ ∑∑= =

ωωω ω

,)()(~11 1 1

∑∑∑∑== = = ∂

∂+Γ=Γ

p

ac

iba

i

p

k

p

j

p

i

ikj

ai

jb

kc

abc

BBBBBω

θω )(for)( 1θaa

a

iiaB ϕω

ωϕ

=∂

∂=

−

}:))(()({ εεϕ <<−= ttt θω

),...,1()())(()(1

pattBtp

i

iai

a =∀= ∑=

θω θ

),...,1()()())(()())(()(1 11

patttBttBtp

j

p

i

jij

ai

p

i

iai

a =∀∂

∂+= ∑∑∑

= ==

θθθ

θω θθ

⎟⎟⎠

⎞⎜⎜⎝

⎛

∂∂

= i

aaiB

θϕ

23

24

What reference books?Foundations of Differential Geometry (Wiley Classics Library) Shoshichi Kobayashi, Katsumi Nomizu

Methods of Information Geometry Shun-Ichi Amari , Hiroshi NagaokaAmer Mathematical Society (2001)

Differential Geometrical Methods in StatisticsShun-Ichi Amari 1985 年 Springer

24

25

Statistical model and information

Statistical model }:),()({ Θ∈== θθxxθ ppM

Score vector ),(log),( θxθxsθ

p∂∂=

Space of score vector }RI:),({ T psT ∈⋅= αθαθ

Fisher information metric ),(}{),( θθθ TvuuvEvug ∈∀=

Fisher information matrix }),(),({ Tθxθxθθ ssEI =

)RI( p⊂Θ

25

26

Score vector spaceNote 1：

0),(),(

),(),(),(),()(T

T

==

==⇒∈

∫∫∫xθxθxsα

xθxθxsαxθxθxθθ

dp

dpdpuuETu

βαβxθxθxsθxsα

xθxθxθx

θ

θθ

Idp

dpvuvugTvuTTT }),(),(),({

),(),(),(),(,

==

=⇒∈

∫∫

})),((,0)),((:),({ ∞<== θxθxθx θθθ tVtEtS

The space of all random variables with mean 0 and finite variance

Note 2: is an infinite dimensional vector space that includesθS θT

),(),(For θxsθx Tu α=

,....2,1},{ ),(...

),(...

},),({),(11

=∂∂

∂−

∂∂∂

− kuEuuEuiki

k

iki

kkk θxθxθxθx θθ θθθθ

belong to θS

26

27

Linear connections

),,1()( }{ '1'

'e

pkjissEg ijk

p

i

iiikj ≤≤=Γ

∂

∂∑= θ

θθ

),,1()( }]{}{[ ''1'

'm

pkjisssEssEg ijkijk

p

i

iiikj ≤≤+=Γ

∂

∂∑=

θθθθ

e-connection

m-connection

)PI,(),()()1()()( ∈+−= qptqptp mt xxx

)PI,(,)}({)}({)( 1)( ∈= − rqqrcr sss

es xxx

m-geodesic

e-geodesic

Score vectorpiis 1)),((),( == θxθxs

27

28

Geodesical modelsDefinition

))1,0(()()()(),( 1 ∈∀∈⇒∈ − tMqpcMqp ttt xxxx

(i) Statistical model M is said to be e-geodesical if

(ii) Statistical model M is said to be m-geodesic if

))1,0(()()()1()(),( ∈∀∈+−⇒∈ tMqtptMqp xxxx

Note : Let P be a space of all probability density functions.

∫ −= xxx dqpc ttt )()(/1 1where

By definition P is e-geodesical and m-geodesical.

However the theoretical framework for P is not perfectly complete.

Cf. Pistone and Sempi (1995, AS)

28

29

Two type of modeling

})};()(exp{)(),({ 0)e( Θ∈−== θθxtθxθx κTppM

})}({)}({:)({0

)m( xtxtx pp EEpM ==

Exponential model

Matched mean model

})(:RI{ ∞<∈=Θ θθ κp

Let p0(x) be a pdf and t (x) a p-dimensional statistic.

})};()()(logexp{)(),({

1 00

)e( Θ∈−== ∑=

θθxxxθx κθ

I

i

ii p

pppM

}:)()()1(),({1

01

)m( Θ∈+−== ∑∑==

θθθI

iii

I

ii pppM xxθxMixture model

})(:RI{ ∞<∈=Θ θθ κp

Let be a set of pdfs.},...,1,0:)({ Iipi =x

)},...,1(0,10:),...,({1

1 Iii

I

iiI =∀><<==Θ ∑

=

θθθθθ

Exponential model

29

30

Statistical functional

A functional f (p) is called a statistical functional if p(x) is a pdf.

A statistical functional f (p) is said to be Fisher-consistent for a model

,)()1( Θ∈pf

)()()2( Θ∈∀= θθf θp

if f (p) satisfies that}:)({ Θ∈= θxθpM

For a normal model }RIRI),(},2

)(exp{2

1)({ 22

2

2+×∈=

−−== σμ

σμ

πσθxθ

xpM

Tdxxxpdxxdxxxpp )))((,)(()( 22∫ ∫∫ −=f

is Fisher-consistent for θ = (μ, σ 2 ).

Example.

30

31

TransversalityLet f (p) be Fisher consistent functional.

Is called a leaf transverse to M.

})(:{)( θffθ == ppL

}{)()1( θθ f pML =∩

)()2( fθLΘ∈

⊕θ

is a local neighborhood.

)( fθL

)(* fθL

θp

*θp

M

31

32

Foliation structure

)( fP θLΘ∈

⊕=θ

))((),(),( fP θθθθLTvMTuTt ppp ∈∃∈∃∈∀

),(),(),(such that θxθxθx vut +=

))(()()( fP θθθθLTMTT ppp ⊕=

)( fθL

M

θp

),( θxt ),( θxu

),( θxv

Foliation

Decomposition of tanget spaces

Statistical model }:),()({ Θ∈== θθxxθ ppM )RI( p⊂Θ

32

33

Transversality for MLE

})};()(exp{)(),({ 0)e( Θ∈−== θθxtθxθx κTppMExponential model

Hence the foliation associated with fML is a matched mean model.

})(:RI{ ∞<∈=Θ θθ κp

For the exponential model the MLE functional)e(M

)},({logmaxarg:)(ML θxfθ

pEp pΘ∈

=

}{ )}({)}({arg)( ),(ML xtxtf θθ

⋅Θ∈

== pp EEp

is written by

})}({)}({:{)( ),(ML xtxtf θθ ⋅== pp EEpL

Let p0(x) be a pdf and t (x) a p-dimensional statistic.

33

34

Maximum likelihood foliation

),(),(),( KLKLKL baba rpDppDrpD θθ +=

)( MLfθL

θp

)e(M

1p 2p

3p

2r

1r)2,1,3,2,1( == ba

34

35

Estimating function

),( θxu

Statistical model }:),()({ Θ∈== θθxxθ ppM )RI( p⊂Θ

p-variable function is unbiased

)(0)}),({(det,}),({ ),(),( Θ∈∀≠∂

∂= ⋅⋅ θ

θθxuθxu θθ pp EE 0

def⇔

The statistical functional }),({solvearg)( 0==Θ∈

θxufθ

pEp

is Fisher-consitent, and the leaf transverse to M

}),(:{)( 0== θxufθ pEpL

is m-geodesical.)( fθL

Θ∈⊕

θ

35

36

Information Geometry

}:),({ Θ∈= θθxpMStatistical model

Information metric (Rao, 1945)

∫ = 1),(stpdfais),(where dxxpxp θθ

)},(),({)( ),( θθθ θ xexeEg jixpij =

Dual connection (Amari, 1982)

),(log),(where θθθ

xpxe ii ∂

∂=

)()( ),(,e

kjixpkij eeEθθθ

∂

∂=Γ

)()()( ),(),(,m

kjixpkjixpkij eeeEeeE θθθθ +=Γ∂

∂

Exponential connection

Mixture connection

36

37

Mixture and exponential model

}:),({ Θ∈= θθxpM

Mixture model (mixture geodesic space)

}1,0:),...,{(,)(),(0

00

=>=Θ= ∑∑==

K

iiiK

K

iii xpxp θθθθθθ

}:),({ Θ∈= θθxpM

Exponential model (exponential geodesic space)

})(:{,)}()(exp{),(1

∞<=Θ−= ∑=

θθθ ψψθθK

iii xtxp

∫= dxxtii )}(exp{log)(where θψ θ

37

38

Triangle with KL divergence

)10()()()( 1e≤≤= − sxqxrzxr ss

ssExponential geodesic

Let p, q and r in M

)10()()1()()(m≤≤−+= txqtxptxptmixture geodesic

∫= dxxqxpxpqpD)()(log)(),(KLKL divergence

}),(),(),(){1)(1( KLKLKL rqDqpDrpDst −−−−=

∫ −−= })log)(log{(emst rqqp

∫ −−−−= })log)(log{()1)(1( rqqpts

),(),(),(e

KLm

KLem

KL stst rqDqpDrpD −−p

q

r

mtp

esr

38

39

Pythagorean Theorem

)(m xps

)(xp

)(xq

)(xr

)(e xr t

),(KL rpD

),(KL rqD

),(KL qpD

),(),(),(e

KLm

KLem

KL stst rqDqpDrpD +=

),(),(e

KLem

KL sst rqDrpD ≥

),(),( mKL

emKL qpDrpD tst ≥ (em algorithm, Amari, 1995)

39

40

Minimum divergence geometryD : M×M → R is an information divergence on a statistical model M

(i) D(p,q) ≧ 0 with equality if and only if p = q

(ii) D is a differentialble on M×M

Then we get a Riemannian metric and dual connections on M (Eguchi, 1983,1992)

)|(),()( YXDYXg D −=

))(()|(),()( MXZZXYDZYg XD ∈∀−=∇

))(()|(),( *)( MXZXYZDZYg XD ∈∀−=∇

pqqqp qpDYXZpXYZD == |),())(|(

⇒⇐

Let

40

41

)(),()|(),( )()( ZZYfgZXYfDZYg XD

XfD ∀∇=−=∇

)|()|()|(),()( XYDXYDYXDYXg D ⋅=⋅=−=

0)|],([)|()|(),(),( )()( =⋅=⋅−⋅=− YXDYXDXYDXYgYXg DD

)(Dg

sconnection dual areand *XX ∇∇

)(),)(()|)(()),(( )()( ZZYXfYfgZfYXDZfYg XD

XD ∀+∇=−=∇

is a Riemann metric

)( *21

XX ∇+∇=∇Xwhere

),(),(),(),(),( )()(*)()()( ZYgZYgZYgZYgZYXg XD

XD

XD

XDD ∇+∇=∇+∇=

Remarks

0)|( =⋅XD

41

42

∫ −= ))}(()({),( qUqpqpLU ξξU cross-entropy

thatsuch),,( tripleaLet ξuU U is a convex function,

U entropy

1,' −== uUu ξ

),()(),( qpLpHqpD UUU −=U divergence

)( pξ )(qξ

))(( pU ξ

))(( qU ξ

U divergence

)}({maxarg)( tUtsst

−=∞<<∞−

ξ

))(()()(where * sUsssU ξξ −=

)}({maxarg)( * sUsttus

−=∞<<∞−

∫== )(),()( * pUppLpH UU

42

4343

Example of U divergence

∫∫ −=−−−= )log(log)}log(log{),(KL qpppqppqqpD

}{ )(1

11),( ∫ −

+

+−+−=

β

ββ

β

βββ

pqppqqpD

KL divergence

Beta (power) divergence

))log(),exp(),(exp())(),(),(( sttstutU =ξ

)( 1,)1(,)1())(),(),(( /11

/)1(

β

ββ

β

ββ

βββ ββξ −

++

=+

+ sttstutU

Note )exp()(lim0

ttU =→ ββ

),(),(lim KL0qpDqpD =

→ββ

)}()({))(())((),( pqppUqUqpDU ξξξξ −−−= ∫

43

4444

Geometric formula with DU

)|(),()( YXDYXg UU −=

))(()|(),()( MXZZXYDZYg UXU ∈∀−=∇

))(()|(),( *)( MXZXYZDZYg XD ∈∀−=∇

s.t.),,( )(*)()( UUUg ∇∇

dxxqxqg iiji

U )),((),(()( θξθθθθ∫ ∂

∂

∂

∂=)

dxxqxq kjikijU )),((),(( 2

,)( θξθθ

θθθ∫ ∂

∂

∂∂

∂=)Γ

dxxqxq jikkijU )),((),((

2

,)(* θξθθ

θθθ ∂∂

∂

∂

∂∫=)Γ

2005)Eguchi,()(m

)( UU ∀∇=∇

)divergence KL(exp)( =⇒⇐= Ugg U

44

4545

Triangle with DU

])1,0[]1,0[),(()},(),(),(){1)(1(

),(),(),( )()()()(

×∈∀−−−−=

−−

tsrqDqpDrpDst

rqDqpDrpD

UUU

UsU

mtU

Us

mtU

p

q

r

)(mtp

)(Usr

)()()1()(m

xxx tqptpt +−=

)( ))(())(()1()()(s

Us qsrsur κξξ ++−= xxx

mixture geodesic

U geodesic

),(),(),(

}{}{)()()()(

)()(

UsU

mtU

Us

mtU

Usq

mt

rqDqpDrpD

rp

+=

⇒⊥

45

46

Light and shadow of MLE

1．Invariance under data-transformations

2．Asymptotic efficiency

1. Non-robust

Sufficiency and efficiencylog exp

ξ u

2. Overfitting

Log-likelihood on exponential family

46

1

Learning with Information Divergence

GeometryShinto Eguchi

andOsamu Komori



47

2

Outline







48

3

Information divergence class and

robust statistical methods II

49

4

Light and shadow of MLE

1．Invariance under data-transformations

2．Asymptotic efficiency

1. Non-robust

Sufficiency and efficiency

log exp

ξ u

2. Over-fitting

Log-likelihood on exponential family

Likelihood method

U-method

50

5

))(()}({),( qUqEqpL pU ξξ −=

U-entropy

U-cross entropy

))(()}({)( pUpEpH pU ξξ −=

Let U(t) = exp(t) ．

.1

)1()(Let

1

ββ β

β

++

=

+

ttU

)}({log)( XpEpH pU =

}1)({)(β

β −=

XpEpH pU

),,,( ΞξuUTake a quadruplet

U-entropy

Example 2.

Example 1.

51

6

U-divergence

),()( qpLpH UU ≥Information inequality

),()(),( qpLpHqpD UUU −=U-divergence

),()( qpLpH UU −

)( pξ )(qξ

))(( pU ξ

))(( qU ξ

0)}()()){(())(())(( ≥−−−= pqpupUqU ξξξξξ

}log{log),(KL pqppqqpD −−−=

ββ

ββββ

β)(),(

1

11 qpppqqpD −−

−=

+

++

KL-divergence

β-power divergence

52

7

Max U-entropy distribution

})}({:{ τXtΓτ == PEpEqual mean space

Let us fix a statistics t (x).

)(maxarg)(* pHp Up τΓ

x∈

=

))()(()( T* θxtθx κ−= up

)(})({

|)}1)1(())1)((())1(({

T*

0**T*

τ

ε

κξ

εελεεεεε

Γ∈∀+−=

−+−−+−−−+−∂∂

=

qpq

qpqpqpH

tθ

τtθ

Euler-Lagrange’s 1st variation

)()())((Hence T* θxtθx κξ −=p

U-model

53

8

))(()}({),()( θθθθ qUqEqpLL pUU ξξ −==

U-estimate

U-loss function

)(minargˆ emp θθθ

UU LΘ∈

=U-estimate

)},(),({),(),(),( θXθXθxθxθxsθ

swEsw qU −=U-estimating function

．function score is),(,))((')(),( where θXsxxθx θθ qqw ξ=

))(())((1)(1

empθθ xθ qUq

nL

n

iiU ξξ +−= ∑

=U-empirical loss function

Let p(x) be a data density function with statistical model )(xθq

54

9

Γ -minimax

})}({:{ τXtΓτ == PEpEqual mean space

),(maxmin)*,*(),(minmax qpLppLqpL UUUpqqp ττ

ττττ Γ∈Γ∈Γ∈Γ∈

==

∫ =−

−=

τxtθxt

xtθx

θ

θτ

))(()(

,))(()(* T

Twhere

κ

κ

u

up

)(maxarg* pHp Up τ

τΓ∈

=

τΓ

τ'Γ

UM

τΓ

'τΓ

}{τ

*τp

*'τp

Let us fix a statistics t (x).

55

10

U-model

}:))(()({ Θ∈−== θxtθx θθ κTU uqMU-model

mean parameter is)}({ xtτθqE=U-estimator of ∑

=

=n

iiU n 1)(1ˆ xtτ

))(())((1)(1

empθθ xθ qUq

nL

n

iiU ξξ −= ∑

=

))((})({1 T

1

Tθθ xtθxtθ κκ −−−= ∑

=

Un

n

ii

0 )}({)(1)(1

emp =−= ∑=

∂∂ Xtxtθ

θθ q

n

iiU E

nL

τ'Γ

UM

τΓ

'τΓ

*τp

*'τpτΓ

UM

).,()ˆ()( ˆempemp

θθθθ qqDLLU

UUUU =−

We observe that

which implies

Furthermore,

56

11

U-Boost density learning

)2())1((minarg),( 1]1,0[),(

TkL kUkk ≤≤+−= −×∈

πφξπφπφπ D

},...,1:)({ )( Mjj == xD φDictionary of ξ-densities

)(minarg11 φφξφ

UD

L∈

==

)()()1()( 1 xxx kkkkk φπξπξ +−= −U-boost

)(minarg*)cov(

ξξξ

ULD∈

=Goal is to find

U(t) = t 2 (Klemela, ML, 2006) U(t)=exp(t) (Friedman et al., JASA,1998)

57

12

Proposed density estimator

))()(()( T θxtθx kuf −=

)1()1(with),...,( 11 TkkkT πππθθθ −−== +θ

T1 ))(,),(()( xxxt Tφφ=

U-estimate under U-model

*ξ

1−kξ

kξ kφ

Nj

jD 1)( }{ == φ

)cov( D

58

13

*ξ

1−kξ

kξ kφ

1+kφ

1+kξ

59

14

*ξ

1−kξ

kξ kφ

1+kφ

1+kξ2+kξ

2+kφ

∞→→ kk as*ξξ

60

15

Statistical machine learning

Brain function

LearningData sets

Signal processingPattern recognition

Vapnik (1995)

Hastie, R. Tibishirani, J. Friedman (2001)

・・・・

61

16

))}(()({)( θθθ qUqpLU ξξ −= ∫

Minimum U divergence

U-loss function

)(minargˆ emp θθθ

UU LΘ∈

=MinimumU-estimator

)},(),({),(),(),( θXθXθxθxθxsθ

swEsw qU −=U-estimating function

．function score is),(,))((')(),( where θXsxxθx θθ qqw ξ=

∫∑ −==

))(())((1)(1

empθθ xθ qUq

nL

n

iiU ξξThe empirical loss

Let p(x) be a data density function with statistical model )(xθq

62

17

• Huber’s M-estimator ∑=

Θ∈

n

iix

n 1

),(1min θρθ

⎩⎨⎧

−<−−

=otherwise)sgn(

||if),(

θθθ

θρyk

kyyy

As M-estimation

• location case

• relation ∫−= ))(())((),( θθ qUxqx ξξρ θ

63

18

Influence function

Statistical functional

})),((()()),(({minarg)( ∫∫ +−=Θ∈

dxxpUxdGxpGTU θξθξθ

0

)(),(IF=

⎥⎦⎤

⎢⎣⎡∂∂

≡ε

εεGTxT

xFG δεε θε +−= )1(

)}],(),({),(),()[(),(IF 1 θθθθθ θ xSxwExSxwJxT −= −ΨΨ

Influence function

||),(IF||sup)(GES xTTx

ΨΨ =

64

19

Efficiency

Asymptotic variance

))()()(,0()ˆ( 11 −−⇒− θθθθθ UUUD

U JHJNn

)),(),(()(

),),(),(),(()(

θθθ

θθθθ

XSXwVarH

XSXSXwEJ

U

TU

=

=

111 )()()()( −−− ≤ θθθθ UUU JHJI

Information inequality

where

( equality holds iff U(x) = exp(x) )

65

20

Normal mean

-6 -4 -2 2 4 6

-6

-4

-2

2

4

6

0.125

0.1

0.075

0.05

0.01

0

-6 -4 -2 2 4 6

-6

-4

-2

2

4

6

2.5

0.8

0.4

0.15

0.015

0

β-power estimates η-sigmoid estimates

Influence functionβ= 0

= 0.015= 0.15= 0.4= 0.8= 2.5

η = 0= 0.01= 0.05= 0.075= 0.1= 0.125

66

21

Gross Error Sensitivity

β-power estimates η-sigmoid estimates

0.1970.6942.52.040.6890.1250.4550.7530.82.210.7420.10.6780.8020.42.470.7990.0751.040.8730.152.920.8610.051.90.9720.0156.160.970.01∞10∞10

GES効率ηGES効率β

67

22

Multivariate normal

pdf is

⎭⎬⎫

⎩⎨⎧ −Σ−Σ=Σ

−−− 2

)()(exp)det)2((),,(1

21 μμπμ yyyf

Tp

尤度方程式 Ψ

),(

)(}))(({)(1

0)()(1

Σ=

⎪⎩

⎪⎨

⎧

Σ=Σ−−−

=−

∑

∑

μθ

θμμθψ

μθψ

ψcxxn

xn

Tjjj

jj

68

23

( ) ( )111 , , +++ ∑=→∑= kkkkkk μθμθ

∑∑=+ ),(

),(1

ki

jkik xw

xxwθ

θμ

( ){ },

det ),(

)()(),(1

∑

∑

Σ−

−−=Σ +

kUki

Tkikiki

k

cxw

xxxw

θ

μμθ

繰り返し重み付け平均と分散

Under a mild condition

),...1 ( )(L )(L 1 =∀≥+ kkk θθ ψψ

U-algorithm

69

24

Simulation

ε-conmination

)ˆ( 0KL θ,θD

, N , N-εG

, -

N .

. , N-εG

ε

ε

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛+⎟⎟

⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛+⎟⎟

⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

9999

00

1001

00

)1(

1004

55

250501

00

)1(

(2)

(1)

ε

ε

ε

ε

(2)

(1)

under 6516702under2439033

05.00

G . . G . .

εε

→

→

=→=KL error

with MLE θ̂

70

25

β-power estimates v.s. η-sigmoid estimates

31.640.3012.400.208.910.10

20.930.0535.300.0139.240

6.040.0014.640.000754.460.00056.700.00025

23.040.000139.240

βηKL error KL error

29.680.3012.200.205.140.106.650.05

13.910.0116.50

3.070.0033.040.0023.90.0013.190.000753.360.000516.50

(1)εG

)2(εG

71

26

Plot of MLEs （no outliers）

11σ12σ

22σ

100 replications with100 size of sampleunder Normal dis.

η- MLE (η=0.0025)β- MLE (β=0.1)MLEtrue (1,0,1)

⎥⎦

⎤⎢⎣

⎡=Σ

2212

1211

σσσσ

72

27

Plot of MLEs with ouliers

η- MLE (η=0.0025)β- MLE (β=0.1)MLEtrue (1,0,1)

)2(εG

100 replications with100 size of sampleunder

11.5

2

2.5

0

0.5

1

1.5

0.5

1

1.5

2

11.5

2

2.5

0

0.5

1

1.5

73

28

Selection for tuning parameter

Squared loss function

dyygyf 221 )}()ˆ,({)ˆ(Loss −= ∫ θθ

∫∑ +−= −

=

dyyfxfn

ii

n

i

2)(

1

)ˆ,(21)ˆ,(1)(CV ββ θθβ

)(CVminargˆ βββ

=

)ˆ,(IF1

1ˆˆ )(βββ θθθ i

i xn −

+≈−Approximate

74

29

Selection for tuning parameter

∫∑ +−≅=

dyyfxfn

V i

n

i

2

1)ˆ,(

21)ˆ,(1)(C ββ θθβ

θθ

θ ββ ∂

∂−

+)ˆ,(

)ˆ(1

1,

iTi

xfxIF

n

The third term is dominant in no outlier case,in which CV(β) has a minimum around at β = 0.When there are substantial outliers, the first and second terms are dominant and has a minimum around at β = 1Cf. GIC in Konishi and Kitagawa (1996)

75

30

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛.26.1-

.1-.26 varianceand

00

mean with Normal

0.0250.050.0750.10.1250.150.175

2.82

2.84

2.86

2.88

2.9

2.92

2.94 ⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛.261.126-.126-.228

0.081-

0.054

signalMLE

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛.383.263-.263-1.059

0.184-

0.204

ContamiMLE

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛.286.134-.134-.293

0.132-

0.086 MLE-β

β

)(C βV

07.0ˆ=β

76

31

What is ICA?

Cocktail party effect

Blind source separation

s1 s2 ……… sm

x1 x2 ……… xm

77

32

ICA model

sWxW mmm 1 s.t., −× =∈∈ RR μ

(Independent signals）

(Linear mixture of signals)

0)(,,0)()()()(),...,(

1

111

====

m

mmm

SESEspspspsss ～

),,( 1 nxx

))(())((|)det(|),,( 111 mmmppWpWf μxwμxwx −−=

Aim is to learn W from a dataset

in which unknown. is)()()( 11 mm spspp =s

78

33

ICA likelihood

Log-likelihood function

|)det(|log))((log)(11

WpW jijj

m

j

n

i+−= ∑∑

==

μxw

TTm WWxWxhI

WpWxpWxF −−=

∂∂

= ))()((),,(),,(

)( )(log,,)(log)(1

11

m

mm

ssp

sspsh

∂∂

∂∂=

Estimating equation

Natural gradient algorithm, Amari et al. (1996)

79

34

Beta-ICA

β power equation ),(),,(),,(11

μμμ ββ WBWxFWxf

n

n

iii =∑

=

decompsability：

0]))}({E[

)]()}({E[

])}({E[,

=−×

−−×

−∏≠≠

XwXwp

XwhXwp

Xwp

tttt

ssssss

tqsqqq

β

β

β

μ

μμ

μ

ts ≠∀

80

35

Likelihood ICA

0.5 1 1.5 2 2.5

0.2

0.4

0.6

0.8

1

1.2

1.4

-1.5 -1 -0.5 0.5 1 1.5

-1.5

-1

-0.5

0.5

1

1.5

⎥⎦

⎤⎢⎣

⎡=−

5.01211W

１５０ signals U(0,1)× U(0,1)

Maximum likelihood

Mixture matrix

81

36

Non-robustness

-2 -1 1 2 3

-2

-1

1

2

-2 -1 1

-2

-1

1

2

５０ Gaussian noise N(0, 1)× N(0, 1)

Maximum likelihood

82

37

β power-ICA (β=0.2)

-2 -1 1 2 3

-2

-1

1

2

-10 -7.5 -5 -2.5 2.5 5 7.5

-6

-4

-2

2

4

Minimum β-power

83

38

U-PCA

)()(exp)()),exp(log()( * zzzzz Ψ−=Ψ+=Ψ ηη

∑=

−=n

iiU xrUL

1))),((log()( γμγ

2

22

||||)(||||),(

γγγ yyyr

T

−=

η-sigmoid

)},(min{minargˆ μγγμγ

UU L=

γγγγγ

γγ T

T

iSSxxr max)(tr),(min −=−∑Classical PCA

ryγ

84

39

U-PCA

),(into),(Update ** γμγμ

Tiii xxxw

xwixr

xr

))((),(),S(

),,())((

))((

μμγμμγ

γμγμψ

γμψ

−−−=

∑=

∑

,−

,−where

⎪⎩

⎪⎨

⎧

=

=

∑=

i

n

ii x,,xw

,S

)(

)(

1

*

***

μγμ

γλγμγ

85

40

Non-robustness in Classical PCA

-4 -2 2 4 6

-3

-2

-1

1

2

3

-4 -2 2 4 6

-3

-2

-1

1

2

3

-10-5

0

5

10

-10

-5

0

510

-10

-5

0

5

10

-10-5

0

5

10

-10

-5

0

510

MLE for Pc vector = (.55, .82, .01, .07, .01, .032, .10)

MLE for Pc vector = (.00, .01, .05, .04, .02, .99, .00)

86

41

Data-weights in U-PCA

20 40 60 80 100 120 140

-0.2

0.2

0.4

0.6

0.8

1

U-w

eight

Pc vector = (.55, .82, .01, .07, .01, .032, .10)

U-estimator for Pc vector = (. 64, .75, .01, .09, .01, .03, .09)

87

42

2.5 5 7.5 10 12.5 15

0.2

0.4

0.6

0.8

1

radius r

u(r)

1

Tube neighborhood in U-PCA

Ψγ̂

)−− ΨΨ γμx ˆ,ˆ( izix

88

43

Kernel methods

KMs : mapping the data into a high dimensional feature space

Use of kernel functions for vectors, sequence data, text, images.

SVM, Fisher's LDA, PCA, ICA, CCA, SIR, spectral clustering

(kernel trick, cf. Aizeman et al. 1964)

(Any kernel can be used with any kernel-algorithm)

Kernel-Machines Org (http://agbs.kyb.tuebingen.mpg.de/km/bb/)

SVM Org (http://www.support-vector-machines.org/)

(Spline method, cf. Wahba. 1979)

89

44

RKHS

ZX →Φ:

X : data space in

Z : high dimensional feature space

pRI

: kernel spectrum wrt )()(),( uxuxK Φ⋅Φ=

(kernel trick, cf. Aizeman et al. 1964)

: feature map HxKhx x ∈⋅= ),(

)(),,( xffxK H =⋅

ZXh →:

22 ||||),(||)(|| HxZ hxxKx ==Φ

90

45

Kernel data

-1

-0.5

0

0.5

1-1

-0.5

0

0.5

1

0

0.25

0.5

0.75

1

-1

-0.5

0

0.5

1

}30,...,1:{ =ixi }30,...,1:{ =ihix

91

46

Ψ-loss function

Let p be a true density function on H and m a proto model function.

Definition 1. U-loss function is

∫+−== τξξθ θθθ dmUmEmpCL PUU ))(()(),()(

We assume an exponential model defined by

)),()(exp()( T θκθθ −= hxhm

If we rewrite Ψ-loss function is

.1))(exp(satisfieswhere T∫ =− τκθκ θθ dhx

then)),(exp()( tt ξ−=Ψ

∫ −Ψ−+−Ψ= τκθκθθ θθ dhxUhxEL TTPU )))((()})(({)(

92

47

KPCA Cf. Scholkopf et al 1998 NC

Robust PCA

Wang, Karhunen, Oja (NN 1995)

Higuchi, Eguchi (NC 1998; JMLR 2004)

Xu, Yuille (IEEE NN 1995)

Hubert, Rousseeuw, Branden (Technometrics, 2005)

Robust KPCA

93

48

U-KPCA

∑=∈Γ∈

ΨΨ ΓΨ=Γn

iiOHm

mhzm xk 1,

)),,((minarg)ˆ,ˆ(

Ψ-Kernel Principal Component Analysis Cf. Haung, Yi-Ren , Eguchi (2008)

where Ψ(z) is strictly increasing function and

Remark Ψ0-KPCA = KPCA if

Example )}exp(1{)( 11 zz ββ −−=Ψ −

))(exp(1)exp(1log)( 1

2 zz

−++

=Ψ −

ηββηβ

zz =Ψ )(0

)()(lim 010zz Ψ=Ψ

→βη

β−Ψ=Ψ

→)()(lim 020

zz

2 4 6 8 10

2

4

6

8

10)(1 zΨ

)()(lim 02 zz Ψ=Ψ∞→η

1.0=β5.0=β1=β

0=β

}||)(||||{||21),,( 2*2

HxHxx mhmhmhz −Γ−−=Γ

94

49

Toy example95

50

Functional data

5 phonemes (TIMIT database) http://www.lsp.ups-tlse.fr/staph/npfda/

sh iy aadcl ao

she she dark dark water

}2000,...,1:),{( == iyfD ii 5% contamination

)15,...,1,2000,...,1(,)()(~==+= jiutftf ijijjiji δ

)15,10(Uniform~,)05.0(Bernoulli~ ijij uδ

96

51

FPCA97

1

Learning with Information Divergence

GeometryShinto Eguchi

andOsamu Komori



98

2

Outline







99

3

Information geometry on model uncertainty

100

4

Observational bias

The theory for statistical inference is formulated under an assumption of random mechanism, for example, random sample.

However, the assumption is frequently in-testable in a situation that involves observational studies. In this sense we have to make a sufficientlycautious inference.

Typically missing data comes from a variety of missing mechanism such asmissing at completely at random, missing at random, missing not at random.In particular, missing not at random bring about a serious bias in the inference.

101

5

Hidden Bias

Publication bias －not all studies are reviewed

Confounding －causal effect only partly explained

Measurement error －errors in measure of exposure

102

6

Lung cancer & passive smoking

1.00.50.3 1.5 2.0 3.0 4.0 5.0 10.0

stud

y

Odds ratio

510

1520

2530

103

7

Passive smoke and lung cancer

Log relative risk estimates (j =1,…,30) from 30 2x2tables jθ

)weighte varinancinverse theis( jj

jj ww

w

∑∑=

θθ

The estimated relative risk 1.24 with 95% confidence interval (1.13, 1.36)

104

Conventional analysis

1.00.50.3 1.5 2.0 3.0 4.0 5.0 10.0

stud

y

Odds ratio

510

1520

2530

1.24

105

9

Incomplete Data

z = (data on all studies, selection indicators)

y = (data on selected studies)

z = (response, treatment, potential confounders)

y = (response, treatment)

z = (disease status, true exposure, error)

y = (disease status, observed exposure)

y = h(z)

106

10

Level Sets of h(z)

1. One-to-one 2. Missing 3. Measurement error

4. Interval censor 5. Competing risk 6.Hidden confounder

107

11

Tubular Neighborhood

MεM

})),,((KLmin:),,({ 221 εεε ≤⋅=

∈ YYY gfg θθyΘθ

N

}:),({ Θ∈= θθyYfMModel

Copas, Eguchi (2001)

2

2

),(KL εε ≤MM

Near-model }:),,({ Θ∈= θθyY εε gM

108

12

Mis-specification

}{ ),(exp),(),,( θzθzθz ZZZ ufg εε =

0=== )(E,1)(,0)(E 2ZZZZ suuEu fff

21

}),KL(2{ ZZ fg=ε

"direction cationmisspecifi"=Zu

109

13

Near model

}{ ),(exp),(),,( θyθyθy YYY ufg εε =

],|),([E),(where yθzθy ZθY uu =

)(:By zhyzh =

h),( θzZf ),( θyYfModel

Near-modelh

),,( εθzZg ),,( εθyYg

110

14

Ignorable incompleteness

Let Y = h(Z) be a many-to-one mapping.

If Z has then Y has

∫ −=

)(1d),(),(

yhZY zzθy θff

Z is complete; Y is incomplete

),( θzZf

ZθZ onData ˆMLE ←

YθY onData ˆMLE ←

)ˆE()ˆ(E trueis ZYZ θθ =⇒f

)ˆE()ˆ(E wrongis ZYZ θθ ≠⇒f

111

15

bias),,,( from },...,{ˆMLE virtual 1 ZZZ θzzzθ ugn ε←

),,,(from},...,{ˆMLEactual 1 YYY θyyyθ ugn ε←

)()|(|),()ˆE()ˆ(E 12 −++=− nOOu εε θbθθ ZZY

)(( ,cov),cov),( 11YYYZZZYZZ ssbbθb uGuGu −− −=−=

⎥⎥⎦

⎤

⎢⎢⎣

⎡⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛⎯⎯ →⎯⎟

⎟⎠

⎞⎜⎜⎝

⎛

−−−−

−−

−−

11

11Dist 1,ˆ

ˆ

YY

YZ

YY

ZZ

bθθbθθ

GGGG

nNn00

εε

bias

Limit distribution

112

16

Asymtotic bias

)1( ||||max min22

|λε

ε−=b

Zu

ifonly and if holds bounds The

.),(),( θysθy YY ∈u

)ˆE()ˆ(E ZY θθb −=

loss ninformatio of eigenvaluesmallest theisminλ

2/12/1 −−=Λ YZY III

113

17

Problem in estimation of bias

The nonignorable model

}{ 22 )(),(exp),(),,( 21 ερεε θθyθyθy YYY −= ufg

gives the worst case if ).,(),( T θysωθy YY =u

However is inestimable and untestable: ω

The profile likelihood ∑=

∈=

n

iigPL

1

)},,,({logmax),( εε ωθyω YΘθY

is flat at 0=ω

114

18

Heckman model for MNAR

))(()1( T1T, xβxψX|R −+Φ== − yrg Y σωεσ+= xβTy

),,()(),,( )( rthrt r xzxz ==

ω

115

19

From pure misspecification

Unbiased perturbed biased perturbedh

116

20

The worst case for bias

),(),(),( T2 θbθbθ ZYZZ uGuu =β

)( 11,cov),( YYZZZZ ssθb −− −= GGuu

ddddθθ

YZYY

YZYYZYYZZ GGGG

GGGGGGGuu)(

)()(),(),( 11T

1111T*22

−−

−−−−

−−−

=≤ ββ

dd

szhsdz

ZY

ZZYYZ

)(

))(()(iff attainsBound

11T

11T* }{

−−

−−

−

−=

GG

GGu

Cauchy-Schwartz’s inequality

)()(

)(11T

11T* )(

ysdd

dy Y

ZY

ZYY −−

−−

−

−=

GG

GGu εM

M

117

21

The worst case

),,,( εωθyYg

),( θyYf

),( ωθy YY If ε+ ).,(),( T θysωθy YY =u

*)),(),,,,(KL(minarg*

θωθωθ YYΘθ

Y ⋅⋅=+∈

fgI εε

118

22

Sensitivity analysis

}{ TT

21),(exp),(),,( ωωθysωθyωθy YYY YIfg −=

θθysωθysωθy

θY

YY ∂∂

+=∂∂ − ),(),()},,(log{ 2/1T

YIg

The most sensitive model

Estimating function of θ with fixed ε, ω

Yθ̂}const.:ˆ{ T

, =ωωθ Yω IY

119

23

Behavior of two MLEs

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛

−−−

⎟⎟⎠

⎞⎜⎜⎝

⎛→⎟

⎟⎠

⎞⎜⎜⎝

⎛

−−

−−−−

−−−

1111

111

,,,ˆˆ

ˆ

ZYZY

ZYY

ZY

Y

bb

θθθθ

IIIIIIInNn

Dε

The two MLEs and are asymptotically normal asYθ̂ Zθ̂

⎟⎠⎞

⎜⎝⎛→=−− −11,)ˆˆ(|)ˆ( Zuuθθθθ ZYY InN

D

Note：this asymptotic expression is valid only when )( 2/1−= nOε

Note： The conditioning cancels selection bias.

120

24

Scenarios A, B, C

Inference from using fYnyy ...,,1

}.)()ˆ()ˆ(:{)(C 2αrkIk T ≤−−= YYY θθθθθ

Scenario A: 10 =⇒= Akε

Scenario C: 1unknown0 >⇒> Ckε

Scenario B:acceptable

n

found had and

,..., observed had weif,0 1 zz>ε

CBA kkk <<⇒

121

25

Scenarios A and C

),0(~)ˆ(2/1 INI fYθθYY −

}.)()ˆ()ˆ(:{)(C 2αrkIk T ≤−−= YYY θθθθθ

Scenario A: ⇒=0ε

Scenario C:),0(~)ˆ(

unknown02/1 INI gY

bθθYY ε

ε

−−

⇒>

?!)(,1 AA kCk =

?!)(,1 22CC kCk κε+=

122

26

Scenario B

andˆ MLEhave couldwe,,..., observe could weIf 1 Zθzz n

),0(~)ˆˆ()( 2/12/1 INII fYZYY θθU −Λ−= −

),0(~|)(* INgYUSUS =

)}ˆˆ()ˆ{()ˆ( 2/12/1ZYYZZZ θθθθθθS −−−=−= II

Conditional confidence interval

}||)(||:{)( 22*αrC ≤= uSθu

123

27

ノンランダムネスの仮定

),(~)ˆˆ( 11 −− −− ZYZY Yθθ IINn f 0

),(~)ˆˆ(|)ˆ( 1−=−− ZZYY uuθθθθY

INg

なので，条件付信頼領域は

})ˆ()ˆ(:{)( 21αα rIC T ≤−−−−= − uθθuθθθu YZY

})(:{ 2111αα rIIB T ≤−= −−− uuu ZY

なので，仮説 H：バイアス b = 0 の検定領域は

仮説 H：バイアス b = 0 が水準 α で採択される下での

条件付信頼領域の和集合は，

∪α

αB

CC∈

=u

u)(

)(uαC

αB

C

124

28

Theorem

}.)()ˆ()ˆ(:{)(CLet 2αrkIk T ≤−−= YYY θθθθθ

).2()()1( Then22||||

CCCr

⊆⊂≤

uu α

∪

-1.5 -1 -0.5 0.5 1 1.5

-1.5

-1

-0.5

0.5

1

1.5

-1.5 -1 -0.5 0.5 1 1.5

-1.5

-1

-0.5

0.5

1

1.5

-1.5 -1 -0.5 0.5 1 1.5

-1.5

-1

-0.5

0.5

1

1.5

125

29

-1 -0.5 0.5 1

-1

-0.5

0.5

1

-1 -0.5 0.5 1

-1

-0.5

0.5

1

-1 -0.5 0.5 1

-1

-0.5

0.5

1

-1 -0.5 0.5 1

-1

-0.5

0.5

1

)001.0,001.0(),( 21

=λλ

)5.0,5.0(),( 21 =λλ)1.0,1.0(),( 21 =λλ )9.0,1.0(),( 21 =λλ

s.t.),( if attainable is boundupper The21

jiji λλ ≤≤∃

αB

})ˆ()ˆ(var)ˆ(:{)( 21Tαα rrCR A ≤−−= − θθθθθθ

)2()( αα rCRCrCR ⊆⊆

1-dimensional case P = 5% → 0.3%

126

30

-1 -0.5 0.5 1

-1

-0.5

0.5

1

Double-the-variance rule}:);({ Θ∈= θθyfMStatistical model

Random sample Mfn ∈);( ,...,iid

1 θyyy ～

α % confidence interval

})ˆ()ˆ(var)ˆ(:{)( 21Tαα rrCR A ≤−−= − θθθθθθ

)( αrCR )2( αrCR

M }{ εM

127

31



Square root rule95% confidence interval (1.08, 1.41)

128

32

Risk from passive smoke

129

Root-2-rule

1.00.50.3 1.5 2.0 3.0 4.0 5.0 10.0

stud

y

Odds ratio

510

1520

2530

1.24

130

34



Square root rule95% confidence interval (1.08, 1.41)

131

35

Reference

Eguchi & Copas (1998) JRSSB (near-parametric)

Copas & Eguchi (2001) JRSSB (ε -perturbed model)

Copas & Eguchi (2005) JRSSB (discussions) (Double-the-variance rule)

Eguchi & Copas (2002) Biometrika (Kulback-Leibler divergence)

Henmi, Copas & Eguchi (2007) Biometrics (Meta analysis)

Henmi & Eguchi (2004) Biometrika (Propensity score)

Statistical Analysis With Missing Data. R. J. A. Little, D. B. Rubin. Wiley (2002)

S. Greenland. JRSSA (2005) (Multiple-bias modelling)

Copas & Eguchi (2010) JRSSB (statistical equivalent models)

132

36

Present and Future

Does all this matter?

Statistics ( missing data, response bias, censoring)

Biostatistics (drop-outs, compliance)

Epidemiology ( confounding, measurement error)

Econometrics (identifiability, instruments)

Psychometrics (publication bias, SEM)

causality, counter-factuals, ...

133

1

Learning with

Information Divergence Geometry

Shinto Eguchiand

Osamu Komori



134

2

Outline







135

3

Boosting leaning algorithm

and U-loss functions I

136

4

Which chameleon wins?

137

5

Pattern recognition…

138

6

What is pattern recognition?

□ There are a lot of examples for pattern recognition.

□ In principle pattern recognition is a prediction problem for class label

which would classify a human interestingness and importance.

□ Originally human brain wants to label phenomena a few words , for

example (good, bad), (yes, no), (dead, alive), (success, failure), (effective,

no effect)….

□ Brain intrinsically predicts the class label from empirical evidence.

139

7

Practice

● Character recognitionvoice recognition

Image recognition

☆ Credit scoring

☆ Medical screening

☆ Default prediction

☆ Weather forcast

●

●

● face recognition

finger print recognition●

speaker recognition●

☆ Treatment effect

☆ Failure prediction☆ Infectious desease

☆ Drug response

140

8

Feature vector & class label

),...,( 1 pxx=xFeature vector

Class label y

Feature space pR⊆X

},,1{ GLabel set

Training data

Test data

}...,,1:),({train niyD ii == x

}...,,1:),({ testtesttest mjyD jj == x

141

9

Classification rule

),...,( 1 pxx=x },,{ Gy 1∈

zyF →),(: x

Classification rule ),(maxarg)(},,1{

yFhGy

F xx∈

=

Feature vector Class label

yh →x:classifier

Discriminant function (score)

3)( =xFh

1 2 3 4 5

)1,( xF)2,( xF

)3,( xF

)4,( xF)5,( xF

)( R∈z

Consider a case of G = 5 for a given x

142

10

Binary classification

),...,( 1 pxx=x }1,1{ +−∈y

zF →x:

classifier )}(sgn{)( xx FhF =


score

)1,()1,()( −−+= xxx FFF

In a binary class y = −1, 1 (G=2)

)1,()1,(1))((sgn)1,()1,(1))((sgn

−<+⇔−=−>+⇔=

xxxxxx

FFFFFF

),(maxarg))(sgn(}1,1{

yFFy

xx−∈

=

143

11

Multi -class

),...,( 1 pxx=x },,{ Gy 1∈

zyF →),(: x

),(maxarg)(},,1{

yFhGy

F xx∈

=

yh →x:

3)( =xFh

1 2 3 4 5

)1,( xF)2,( xF

)3,( xF

)4,( xF)5,( xF

)( R∈z

G = 5,

Classifier


Score function

Classifier by score

144

12

Probability distribution

).,( vector random of pdf a be),(Let yyp xx

∫ ∑∈

=B Cy

ypCBp }d),({),( xx

∫= Xxx d),()( ypypMarginal density

Conditional density

∑∈

=},...,1{

),()(Gy

ypp xx

)(),()|(

xxx

pypyp =

)(),()|(

ypypyp xx =

)()|()()|(),( xxxx pypypypyp ==

)1|()1|(

)1,()1,(

−==

=−==

ypyp

ypyp

xx

xx

145

13

Error ratepR∈x },,{ Gy 1∈

Classifier )( xFh

Feature vector ，class label

has error rate

))(Pr()(Err yhh FF ≠= x

∑∑=≠

==−====G

iF

jiFF iyihjyihh

1),)(Pr(1),)(Pr()(Err xx

Training error

}...,,1:),({for train niyD ii == x

}...,,1:),({for testtesttest mjyD jj == x

nyhih iiF

F})(:{#)(Err train ≠

=x

Test error

myhj

h jjFF

})(:{#)(Err

testtesttest ≠

=x

146

14

False negative/positive

FNP )11)(Pr( )FN( +=−== y|hh FF x

)11)(pr( )FP( −=+== y|hh FF x

True Negative

True Positive False Positive

False Negative

1+=y 1−=y

1)( +=xFh

1)( −=xFh

FPR

)1pr()FP()1pr()FN( )Err( −=++== yhyhh FFF

147

15

Bayes rule

Let p(y|x) be a conditional probability given x.

.)|1(

)|1(log)(Define 0 xxx

−==

=ypypF

The classifier leads to Bayes rule. ))(sgn()( 0Bayes xx Fh =

For any classifier h

)(Err)(Err Bayes hh ≤

Note： The optimal classifier is equivalent to the likelihood ratio.However, in practice p(y|x) is unknown, so we have tolearn hBayes(x) based on the training data set.

Theorem 1

148

16

Discriminant space associated with Bayes classifier is given by

{ }{ })|1()|1(: }1)(:{

)|1()|1(: }1)(:{

BayesB

BayesB

xxRxxRx

xxRxxRx

−=<+=∈=−=∈=

−=≥+=∈=+=∈=−

+

ypyphR

ypyphRpp

pp

},{ −+ RR

Error rate for Bayes rule

In general, when a classifier h associates with spaces

)}(Err1{)}(Err1{

)|1()()|1()(

)|1()()|1()(

)|1()()|1()(

)|1()()|1()()(Err)(Err

Bayes

\\\\

\\\\

)()(

)()(

)()(

)()(

BB

BB

Bayes

hh

dyppdypp

dyppdypp

dyppdypp

dyppdypphh

RRRR

RRRRRRRR

RRRRRRRR

RRRR

BBBB

BBBB

−−−=

+=−+−=−=

+=−+−=−≥

−=−++=−=

−=−++=−=−

∫∫∫∫

∫∫∫∫

∫∫∫∫

∫∫∫∫

++−−

++++−−−−

++++−−−−

++−−

xxxxxx

xxxxxx

xxxxxx

xxxxxx

)(Err)(Err Bayes hh ≥

−B−−

BRR \ −− ∩ BRR

−R −BR

−− RRB \

149

17

Multi-normal distribution

p-variate normal (Gaussian) distribution is defined by the pdf

)}()(21exp{

)det()2(1),,( 1

2/12/ μxVμxV

Vμx −−−= −Tpπ

ϕ

pdf a has),( that Asuume yx

}),...,1{,(),,()(),( GyVypyp py ∈∈= Rxxx μϕ　

- 4- 2

02

4- 4

- 2

0

2

4

00 . 2 5

0 . 50 . 7 5

1

- 4- 2

02

4

G = 2

(Assumption for equal variance matrix)

150

18

Bayes classifier

01Bayes )|1()|1(log)( α+=

−=+=

= xαxxx T

ypypF 　

)1()1(log)(

21,)( 1

111

110

1111 −=

+=+−=−= −

−−+

−+

−−+ yp

ypTTT μVμμVμVμμα α

We call this the Fisher linear discriminant function

∑

∑

∑

∑

=

=−

=

=+

−=

−==

+=

+== n

ii

n

iii

n

ii

n

iii

yI

yI

yI

yI

1

11

1

11

)1(

)1(ˆ,

)1(

)1(ˆ

xμ

xμ

∑∑==

=−−=n

ii

n

ii

Ti nn 11

1ˆ),ˆ()ˆ(1ˆ xμμxμxV

)(in Plug Bayes xF

151

19

Boost learning

Boost by filter (Schapire, 1990)

Bagging, Arching （bootstrap）(Breiman, Friedman, Hasite)

AdaBoost (Schapire, Freund, Batrlett, Lee)

Weak learners can be combined into a strong leanrner?

Weak learner (classifier) = error rate is slightly less than .5

Strong learner (classifier) = error rate is slightly less than that Bayes calssifier

152

20

Web-page on Boost

http://www.boosting.org/

http://www.fml.tuebingen.mpg.de/boosting.org/tutorials

R. Meir and G. Rätsch. An introduction to boosting and leveraginghttp://www.boosting.org/papers/MeiRae03.pdf

R.E. Schapire. A brief introduction to boosting. In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, 1999.http://www.boosting.org/papers/Sch99e.ps.gz

Robert Schapire’s home page:http://www.cs.princeton.edu/~schapire/Yoav Freund's home page :http://www1.cs.columbia.edu/~freund/

153

21

Set of weak learners

{ }RI},1,1{},,,1{:)(sgn),,(stamp ∈+−∈∈−== bapjbxabaf jj xFDecision stumps

{ }1010

T1linear RI),(:)(sgn),( +∈=+== pf ββ ββxββxF

Linear classifiers

Neural net SVM k-nearest neighbur

Note: not strong but a variety of characters

linearstamp FF ⊆

154

22

Exponential loss function

Empirical exponential loss function for a score function F(x)

)}(exp{1)(1

exp ii

n

i

D Fyn

FL x−= ∑=

where q(y|x) is the conditional distribution given x, q(x) is the pdf of x.

Let be a training (example) set.}...,,1:),({train niyD ii == x

xxxxX

d)()}|()}(exp{{)(}1,1{

expqyqyFFL

y

E ∫ ∑−+∈

−=

Expected exponential loss function for a score function F(x)

155

23

Learning algorithm

)},( ),...,,{( 11 nn yy xx

)(,),1( 11 nww

)(,),1( 22 nww1ε

2ε

1α

2α

Tα

)()1( xf

∑=

T

1)( )(

ttt f xα

)(,),1( nww TT

)()2( xf

)()( xTf

Final learner ∑=

=T

tttT fF

1)( )()( xx α

156

24

Learning curve

50 100 150 200 250

0.05

0.1

0.15

0.2

Iteration number

Training curve

157

25

0)(),1()(:Intial.1 011

=== xFniiw n 　　　

,)'(

)())((I)(∑∑ ≠=

iwiwfyf

t

tiit xε

)(

)(121

)(

)(log)b(tt

ttt

f

f

ε

εα

−=

∑=

=T

tttTT fFF

1)( )()( where,)(sign.3 )( xxx α

Tt ,,1For .2 =

))(exp()()()c( )(1 iitttt yfiwiw xα−=+

)(min)()a( )( ff tftt εεF∈

=

Adaboost

158

26

Update weight

21)( )(1 =+ tt fε The worst case

)()( 1 iwiw tt +→

　　　　　　　

　　　　　　

t

t

eyf

eyf

iit

iit

α

α

−⇒=

⇒≠

Multiply )(

Multiply )(

)(

)(

x

x

update

Weighted error rate )()()( )1(1)(1)( +++ →→ tttttt fff εεε

159

27

21)( )(1 =+ tt fε

∑∑+

+

=+ ≠=

)'()())(()(

1

1

1)(1 iw

iwyfIft

tn

iiittt xε

∑

∑

=

=

−

−≠= n

ititit

n

itititiit

iwfy

iwfyyfI

1)(

1)()(

)()}(exp{

)()}(exp{))((

x

xx

α

α

∑∑

∑

==

=

=−+≠

≠= n

itiitt

n

itiitt

n

itiitt

iwyfIiwyfI

iwyfI

1)(

1)(

1)(

)())((}exp{)())((}exp{

)())((}exp{

xx

x

αα

α

21

)}(1{)(1

)()(

)()(1

)()(

)(1

)()(

)()(

)(

)(

)()(

)(

=

−−

+−

−

=

tttt

tttt

tt

tt

tttt

tt

ff

ff

ff

ff

f

εε

εε

εε

εεε

160

28

Update in decision stumps

Wrong example 5 5 5 6 466 5 565

}|)(|21{minarg ∑ −=

iiij yfb

bx

Decision stump⎩⎨⎧

<−>+

=−=jj

jjjjjj bx

bxfffs

if1 if1

)(where)(or)()( xxxx

jbThe j-th feature vector njj xx ,...,1

jx

161

29

Next step

Errror 数 45.5 7 9

87.56 798.5

Update foe weights:

jx

4.5

jbWeight up to 2

}|)(|)(21{minarg ∑ −=

iiijj ysiwb

bx

2log4

16log5.0ans. false of nb.ans.correct of nb.log5.01 ==⎥⎦

⎤⎢⎣⎡=α

Weight down to 0.5

jb

162

30

Update in exponential loss

)}(exp{1)(1

exp ii

n

i

Fyn

FL x−= ∑=

)()()( xxx fFF α+→

}))(1()(){(exp fefeFL εε αα −+= −

Consider

[ ]∑=

− =+≠−=n

iiiiiii yfeyfeFy

n 1

))(I())(I()}(exp{1 xxx αα

∑=

−−=+n

iiiii fyFy

nfFL

1exp )}(exp{)}(exp{1)( xx αα

)(

)}(exp{))(()(

exp

1

FL

FyyfIf

n

iiiij∑

=

−≠=

xxεwhere

163

31

Sequential optimization

}))(1()(){()( expexpαα εεα −−+=+ efefFLfFL

)()(1log

21

opt ff

εεα −

=

)}(1){(2 ff εε −≥

)}(1){(2)()(12

ffefe

f εεεε αα −+

⎭⎬⎫

⎩⎨⎧

−−

=

Inequality holds if and only if

αα εε −−+ efef ))(1()(

164

32

AdaBooost = Seq-min exponential loss

)(minarg)a( )( ff tFf

t ε∈

=

)}(exp{)()((c) 1 itittt xfyiwiw α∝+

)}(1){()()(min )()(1exp)(1exp ttttt ffFLfFL εεαα

−=+ −−∈ R

)()(1

log21

)(

)(opt

t

t

ff

εε

α−

=

)(minarg)b( )(1exp ttt fFL ααα

+= −∈ R

165

33

Proposal from machine learning

Learnability: boosting weak learners?

AdaBoost : Freund & Schapire (1997)

Weak learners (machines)

})(,....),({ 1 xx pff

Strong machine

)()( )()1(1 xx tt ff αα ++

)( xf

)()1(1 xfαForward stagewise

166

34

Simulation (complete separable)

-1 -0.5 0.5 1

-1

-0.5

0.5

1

[-1,1]×[-1,1]

Feature space

Decision boundary

}1000,,1:),({ =iyiix

]1,1[]1,1[ −×−∈ix

)2sin( 12 xx π=

}1,1{ +−∈iy

167

35

-1 -0.5 0.5 1

-1

-0.5

0.5

1

Set of linear classifiers

⎩⎨⎧

<++−≥+++

=++=0 if10 if1

)(sgn),(32211

322113221121 rxrxr

rxrxrrxrxrxxf

Linear classification machines

3321 )1,1(},,{ −Urrr ～

Random generation

168

36

Learning process

•

-1 -0.5 0 0.5 1

-1

-0.5

0

0.5

1

-1 -0.5 0 0.5 1

-1

-0.5

0

0.5

1

-1 -0.5 0 0.5 1

-1

-0.5

0

0.5

1

-1 -0.5 0 0.5 1

-1

-0.5

0

0.5

1

-1 -0.5 0 0.5 1

-1

-0.5

0

0.5

1

-1 -0.5 0 0.5 1

-1

-0.5

0

0.5

1

Iter = 1, train err = 0.21 Iter = 13, train err = 0.18 Iter = 17, train err = 0.10

Iter = 23, train err = 0.10 Iter = 31, train err = 0.095 Iter = 47, train err = 0.08

169

37

Learning process (II)

-1 -0.5 0 0.5 1

-1

-0.5

0

0.5

1

Iter = 55, train err = 0.061

-1 -0.5 0 0.5 1

-1

-0.5

0

0.5

1


-1 -0.5 0 0.5 1

-1

-0.5

0

0.5

1


170

38

Final decision boundary

-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1

-1 -0.5 0 0.5 1

-1

-0.5

0

0.5

1

Contour of F(x) Sign(F(x))

171

39

KL divergence

},,1{ G=Y

For conditinal distribution p(y|x), q(y|x) given x with commonmarginal density p(x) we can write

Nonnegative function

Feature space pR⊆X

),(),(),,( YXxxx ∈∈ yyym μ

∫ ∑=

+−=X

xxxxxx

G

yyym

yymymmD

1KL d)},(),(

),(),(log),({),( μ

μμ

Note

)()|(),(),()|(),( xxxxxx pyqypypym == μ

∫ ∑=

=X

xxxxx

G

yp

yqypypmD

1KL d)(}

)|()|(log)|({),( μ

Label set

Then,

172

40

Twine of KL loss functionFor data distribution q(x, y) = q(x) q(y|x) we model as

∑=

−=G

gyFgFym

11 )},(),(exp{)|( xxx

xxxxxxx

Xd)()}|()|(

)|()|(log)|({),(

}1,1{KL qymyq

ymyqyqmqD

y∫ ∑

−+∈

+−=

Then

∑=

= G

ggF

yFym

1

2

)},(exp{

)},(exp{)|(x

xx

Log lossExp loss

173

41

Bound for exponential loss

)}({exp(E)(exp XYFFL −=

))(exp()(1

1exp ∑

=

−=n

iiin FyFL xEmpirical exp loss

Expected exp loss

).(minarg

and functionsnt discrimina all of space a beLet

expopt FLFF F

F

∈=

Theorem.

.)|1()|1(log

21)(Then opt x

xx−=+=

=ypypF

174

42

Variational calculas

⎥⎦⎤

⎢⎣⎡ −∂∂

=∂∂ ))(exp(E)(exp XYF

FFL

δ[ ]))(exp(E XYFY −=

[ ])|1())(exp()|1())(exp(E XXXX −=−+=−= ypFypF

}]{[)|1()|1())(2exp()|1())(exp(E

XXXXX

−=+=

−+=−=ypypFypF

.)|1()|1(log

21)(Hence opt x

xx−=+=

=ypypF

175

43

On AdaBoostFDA, or Logistic regression

Parametric approach to Bayes classifier

01

101)( ααα +=+= ∑=

p

jjj xF xαx

AdaBoost ∑=

=T

ttt fF

11 )()( xx α

AdaBoost. real cf. ,classifier a is itself )(Each xtf

The stopping time T can be selected according to the state of learning.

Parametric approach to Bayes classifier, but dimension and basis function are flexible

176

44

On AdaBoost (II)

1. Unbalanced examples

2. Onverleaning

EtaBoost

GroupBoost

AsymAdaBoost

LocalBoost

Robust against mislabel examples

Hi-dimension and small sample

Local learning

Balancing the false n/ps’

177

45

Simulation (complete random)

-1 -0.5 0.5 1

-1

-0.5

0.5

1

178

46

Overlearning of AdaBoost

-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1


-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1

-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1

Iter = 151, train err = 0.06 Iter =301, train err = 0.0

179

47

U-Boost

))(())((1)(1

empθθ xθ qUq

nL

n

iiU ξξ +−= ∑

=U-empirical loss function

∑∑∑= ==

+−=n

i

G

gi

n

iiiU gFU

nyF

nFL

1 11

emp )),((1),(1)( xx

In a context of classification

Unnormalized U-loss

∑∑= =

−=n

i

G

giiiU yFgFU

nFL

1 1

(0) )),(),((1)( xx

Normalized U-loss

∑∑∑= ==

+−=n

i

G

gi

n

iiiU gFU

nyF

nFL

1 11

(1) )),((1),(1)( xx

∑=

=G

ggFu

11)),(( subject to x

180

48

U-Boost (binary)

Unnormalized U-loss

∑ ∑= ±=

−=n

i giiU FyU

nFL

1 1

(0) ))((1)( x

Note ))(()),(),((1

iig

iii FyUyFgFU xxx −=−∑±=

)}1,()1,({)(where21 −−= xxx FFF

Bayes risk consistency

)}|1())(()|1())(({)(

))(()( 1

xxxxx

xx

−=+=−∂∂

=−∂∂ ∑

±=ypFUypFU

FyFU

F y

))}(({minarg where)|1(

)|1())((

))(( **

*xyFUEF

ypyp

FuFu

F−=

−==

=− x

xx

x

F*(x) is Bayes risk consistent because 0)}({

)()()()()(

)(2 >

−−+−

=−∂

∂Fu

FuFuFuFuFu

FuF

181

49

Eta-loss function

regularized

AdaBoost with margin

FFFU ηη +−= )exp()1()( generator with

182

50

EtaBoost for Mislabels Expected Eta-loss function

)}())({exp()1()( xx yFyFEFL ηηη −−−=

Optimal score )(minarg* FLFF

η=

The variational argument leads to

ηη

ηη

2))(1(

)1()|()()(

)(

*21*

21

*21

++−

+−=

− xx

x

xFF

yF

ee

eyp

)()(

)(

**

*

1

1)(1

))(1(xx

xxx

FF

F

ee

e

++

+−= εε

ηη

ηε2))(-(1

)( where)()(

21

21

++=

− xxx

FFee

Mislabel modeling

183

51

EtaBoost

0)(),1()(: settings Initial.1 011

=== xFniiw n 　　　

,)())((I)(1

iwfyf m

n

iiim ∑

=

≠∝ xε

∑=

=T

tttTT fFF

1)( )()( where,)(sign.3 )( xxx α

Tm ,,1For .2 =

))(exp()()()c( )(*

1 iimmmm yfiwiw xα−∝+

)(min)()a( )( ff mfmm εε =

(b)

184

52

A toy example

185

53

Examples partly mislabeled

186

54

AdaBoost vs. EtaBoost

AdaBoost EtaBoost

187

55

EtaBoost

-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1

Iter = 51, train err = 0.25 Iter = 51, train err = 0.15 Iter =351, train err = 0.18

-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1

-1 -0.5 0 0.5 1-1

-0.5

0

0.5

1

188

1

Learning with Information Divergence Geometry

Shinto Eguchiand

Osamu Komori



189

2

Outline







190

3

Boosting leaning algorithm and U-loss functions II

191

4

GOAL

Statistical inference

Microarray

SNPs

Proteome

A variety of functions associated with genes

Statistical Learning

genetics

informatics

medicine

Modeling and Prediction forKnowledge and discovery

192

5

Project for genome polymorphism analysis

mRNA

Protein

Genome

Microarray

SNPs

Proteome

[ Gene expression ]

[ Protein expression ]

[Single Nucleotide Polymorphism ]

193

6

Project mission

Microarray

SNPs

Proteome

[ Gene expression ]


[Polymorphism ]

Drug effect / adverse effect

Disease

[ anticancer drug, aspirin ]

[Cancer, diabetes, cardiopathy … ]

Metabolic syndrome

biomarker

Genomic/protemic

Phenotype

194

7

Expression Arrays and the p n Problem

T. Hastie, R. Tibshirani 20 November, 2003

Gene expression arrays typically have 50 to 100 samples and 5,000

to 20,000 variables (genes). There have been many attempts to adapt

statistical models for regression and classification to these data, and in

many cases these attempts have challenged the computational resources.．．．．．．

>>

The Dantzig selector:Statistical estimation when p is much larger than n

E. Candes and T. Tao

Ann. Statist. 35, 6 (2007), 2313-2351.

195

8

Recent issue

Fan and Lv (2008)Sure independence screening Dantzig selector

Candes, E. and Tao, T. (2007)

p is much larger than n

次元 1 100 1000 30000

196

9

Unified view in machine learning

Dantzig selector (SVM, programming)

LASSO (LAR, Elastic net)

L2Boosting (Early stopping, εBoosting, Stagewize LASSO)

A tale of three cousins (Meinshausen, Rocha, Yu, 2007 )

L1 regularization

197

10

What is genomic data?

mRNA

Protein

Genome

microarry

SNPs

proteome

[ 遺伝子発現 ]

[ 蛋白発現 ]

[ 一塩基多型 ]

多様性代謝

翻訳

02040

0204060

0204060

204060

255075100

10203040

255075

255075100

255075100

1500 1750 2000 2250 2500

198

11

Target

Microarray

SNPs

Proteome

[ Gene expression ]


[Polymorphism ]

Drug effect / adverse effect

Disease

[ anticancer drug, aspirin ]

[Cancer, diabetes, cardiopathy … ]

biomarker phenotype

),( 1 pxx=x }1,1{ +−∈y

199

12

GeneChip® HT Human Genome U133

AffymetrixのU133 contains probes more tha 54,000,Which is utilized semicondotor technology.

For one gene a set of 11 prob pairs are designed.All the probes consist of 25-mer DNA.

200

13

Predection from microarrays

),( 1 pxx=x

}1,1{ +−∈y

Feature vector dimension p = the nb of geneseach component is of gene expression

Class label names of disease, effect of drugs, adverse effect of drug

training data }1:),({train niyD ii ≤≤= x

yf →x:ˆclassifier )(ˆˆ xfy =predictor

201

14

Leukemic diseases Golub,T. et al. (1999) Science.

The first successful result202

15

Web microarray data

7129

2000

7129p

242549Estrogen

224062Colon

353772ALLAMLy = －1y = +1n

p >> n

http://microarray.princeton.edu/oncology/http://mgm.duke.edu/genome/dna micro/work/

Open access data

×

203

16

source("http://www.bioconductor.org/biocLite.R")

biocLite("GEOquery")

library(GEOquery)

d <- getGEO(file = "GSE2034_family.soft.gz")

http://www.ncbi.nlm.nih.gov/geo/

Gene Expression OmnibusNational Center for Biotechnology Information

204

17

Clustering for successful case

205

18

Jones, M. et al. Lancet 2004

Lung cancer

Pathological Classification

Prognosisidentification

Not successful case

206

19

Prediction from mass-spectrometry

),( 1 pxx=x

Feature vector dimension p = peak numbers of molecular masscomponents express peak value

}1,1{ +−∈y

Class label names of disease, effect of drugs, adverse effect of drug

training data }1:),({train niyD ii ≤≤= x

yf →x:ˆclassifier )(ˆˆ xfy =predictor

207

20

MW (Time of Flight)

LensLaser beam source

Mirror

Ion detector

High voltage

+++

+++

+++Protein ChipFlight tube (High vacuum)

Proteome method

Koichi Tanaka，MALDI-TOF MS，2002年Nobel Prize for chemistry

[SELDI TOF/MS]

208

21

0

20

40

0

20

40

60

0

20

40

60

20

40

60

25

50

75

100

10

20

30

40

25

50

75

25

50

75

100

25

50

75

100

1500 1750 2000 2250 2500

Lungcancer

Normaltissue

Proteomic data財団法人癌研究会

209

2222

Total data

Proteomic data

Fileterd by common peaks

AdaBoost learning

203人（130人（卵巣癌），73人cotrols）

Fushiki, Fujisawa, Eguchi, 2006

210

23

Goal = Prediction score

Machine learning

Tarining data = (clinical data, genomic data)

Prediction score

Pattern recognition

Genom

ic data

Clinical data

{xi} {yi}

Sensitivity

x y

Train and test practical realization

F(x)

211

24

Concordance among Gene-Expression–Based Predictors for Breast Cancer

Fan, et al NEJM 355:560-569, 2006

Prognosis prediction for breast cancer

70 gene file : van 't Veer J, et al. Nature 2002;415(6871):530-6.Recurrence score : Paik S, et al. NEJM 2004;351:2817-26.Mechanism-derived： Chang H Y, et al. PNAUS 2005;102(10):3738-43.Proper sutype : Sorlie T, et al. PNAUS 2001;98(19):10869-74.2 gene rate： Ma X J, et al. Cancer Cell 2004;5(6):607-16.

5 studies suggest different sets of genes that are related withprognosis for breast cancer

Four of five studies show substantial performance for new validation test data (295 samples)

212

25

Nature 2002; 415: 530-6.

213

26

Single variable study

y|x ),...,1(| pjyx j =

Let x be a feature vector from genomic monitor andy outcome value or phenotype.

)|( yp x )|( yxp j

Note： The multi-vaiate analysis for (x, y) is essential different froma set of all single varaiable analyses.Basically the genome data are correlated with in biologicalnetwork, which implies the information amount is not so huge even if the number of observed genes is larger than several ten thousands.

Joint analysis single analysis

214

27

2 sample test approach

)),(,),,{( 11 nn yyD xx=Data set

is decomposed into

仮説 H：

),...,(}1:{00010 n

jjii

jj xxyx =−==x

),...,1(),...,(}1:{ 11111 pjxxyx njj

iijj ==+==x

},...,1:),{( 10 pjjj =xx

where n = n0 + n1

)1|()1|( −==+= yxpyxp jj

を考えよう．

2標本 , ．01)}{(

nkk

jx =11)}{(

nii

jx =

215

28

典型的な2群の検定量

Z スコア

スチューデント t 検定

ウィルコクソン検定

j

jjj

sxx

XZ 01)(ˆ −=

)2/())1()1(()(ˆ

200

211

01

−−+−

−=

nsnsn

xxXt

jj

jjj

)(1)(ˆ0

1 11

10

1 0

ij

n

k

n

ikjj xxI

nnXC ∑ ∑

= =>=

注意： Zスコアと t-検定はデータ変換 x → ax+b (a > 0) に不変である．

ウィルコクソン検定は任意の単調変換 F に対して不変である．

)(ˆ))((ˆ jj XCXFC =

しかし異なる j と k に対して上の検定統計量は何の関係も持たない．

)(ˆ),(ˆ),(ˆ),(ˆ jkjkjk bXXaCXXCXCXC ++

○

×

（後半II部でこのことを再考する．）

216

29

p 個の2群比較

p 個の２群比較のための検定統計量を採用して

},...,1:)(ˆ{ pjXT j =

p 個の検定統計量 T(Xj) を計算して，順番を以下のように付ける．

)(ˆ)(ˆ)(ˆ )()()1( pd XTXTXT ≥≥≥≥

を考察する．このとき，検定の多重性の問題（FDR) やランキングの

正確さの問題は重要な問題である．しかし，現在，ターゲットにしている

問題は，表現形予測の問題であるので次のフィルターリングに使われる．

1 100 1000 30000d p

)(ˆ jXT

217

30

Hierarchal clusteringclustering ⊆ unspervized learning

Hierarchal clustering

Optimal partition clustering k-means, self-organizing map

Minimum distance method

Eisen et al. (1998) PNAS

http://derisilab.ucsf.edu/data/microarray/software.html

218

31

ArrayMaker Version 2

Gal File Maker v1.2

Cluster:

Tree View:

J-Express:

http://derisilab.ucsf.edu/data/microarray/software.html

219

32

LASSO

),...,1(T niy iii =++= εαβx線形モデル

次元ベクトルとする．はここで pi , βx

typ

jj

n

iii =−−= ∑∑

== 11

2T ||subject to)(minarg)ˆ,ˆ( βαα βxβ

Lasso推定量 (Least absolute shrinkage and selection operator)

Tibshirani, 1996 JRSSB

220

33

Sparse learning

)}({ 11

2T21 ||)( ty

pj j

n

iii −−−−

∂∂ ∑∑ =

=βλαβx

β

0=−−−= ∑∑==

)sgn()(11

T )( βxβxx λαn

iii

n

iii y

The Lagarange method for optomization with constraints

)ˆsgn(ˆˆ 1

1

TOLS )( βxxββ

−

=∑+=n

iiiλ

Lasso estimator and LSE have the following connection

).( that Assume1

T 単位行列In

iii =∑

=xx

),...,1()ˆsgn(ˆˆOLS pjβββ jjj =+= λ

OLSβ̂β̂

221

34

Sparseness representation

),...,1()ˆsgn(ˆˆOLS pjβββ jjj =+= λ

+−= )|ˆ)(|ˆsgn(ˆOLSOLS λjjj βββ

Othonormal design matrix leads to

Inverting the above provides

jβ̂

jβOLSˆ

jβ̂

jβOLSˆ

λ

λ−

λ−

λ

)(T 単位行列Iii =∑ xx

⎩⎨⎧ >

=+

otherwise00if

whereAA

A

Note： λ is uniquely determined by the constraint tj =∑ |ˆ| β

222

35

Control of sparseness

tj =∑ |ˆ| β

t

Control is reduced to a choice of t

223

36

Microarray data

),( 1 pxx=x }1,1{ +−∈y

Leaning machines

)( xfy =

p signals Class labels

ABoost

)},(),...,,{( 11 nn yy xx

)(,),1( 11 nww

)(,),1( 22 nww1ε

2ε

1α

2α

Tα

∑=

T

1)( )(

ttt f xα

)(,),1( nww TT

)()2( xf

)()( xTf

)()1( xf

)}(...,),({ 1 xx Kff ∑=

T

1)( )(

ttt f xα

Difficult Problem: p >> n

224

37

Data

)(,),1( 11 nww

)(,),1( 22 nww

1ε

2ε

Tα

),1()1,1(

),1()1,1(

,,)(,),(

G

Gffαα

xx

∑=

T

ttf

1)( )(x

)(,),1( nww TT

Grouping G machines )( ),(),()1,()1,(1

)( GtGtttGt fff αα ++=

)()1( xf

)()2( xf

)()( xTf

),2()1,2(

),2()1,2(

,,)(,),(

G

Gffαα

xx

),()1,(

),()1,(

,,)(,),(

GTT

GTT ffαα

xx

GroupBoost

GroupBoost )}(...,),({ 1 xx Kff ∑=

T

1)( )(

ttt f xα

225

38

Results Data size

Test error

226

39

Lung cancer analysisGene number Test error

true genes

false genes

227

40

228

41

MW (Time of Flight)

LensLaser beam source

Mirror

Ion detector

High voltage

+++

+++

+++Protein ChipFlight tube (High vacuum)

Proteome[SELDI TOF/MS]

229

42

0

20

40

0

20

40

60

0

20

40

60

20

40

60

25

50

75

100

10

20

30

40

25

50

75

25

50

75

100

25

50

75

100

1500 1750 2000 2250 2500

Lungcancer

Normaltissue

Proteome data

230

43

Averaged curve203 subjects (130 cases (ovarian cancer),73 controls)

231

44

Common peaks

hth

SpecAlign, Wong et al. (2005)

232

45

Peak pattern recognittion

),( 1 pxx=x }1,1{ +−∈y

Learning machines

)( xfy =

p peaks Class labels，

AdaBoost

例題

)(,),1( 11 nww

)(,),1( 22 nww1ε

2ε

1α

2α

Tα

∑=

T

1)( )(

ttt f xα

)(,),1( nww TT

)()2( xf

)()( xTf

)()1( xf

Test data（32cases+18contols)

233

46

Association with drug effect

A joint work with Japan Institute Cancer Research Foundation

Breast cancer patients and a drug effect association

Supervised detection for common peaks

0

0.05

0.1

0.15

0.2

0.25

0.3

0 5 10 15 20 25 30 35

training error

CV error

test error

(peaks)

error rate

234

47

Inference for prediction

Microarray

SNPs

Proteome

Genome Function

Expression pattern(GroupBoost)

Peak pattern (Common peak)

SNP haploblock

Statistical machine Learning

235

tutorial workshop on - 國立臺灣大學 · pythagoras theorem 3. 4 information divergence class...

Documents