tutorial workshop on - 國立臺灣大學 · pythagoras theorem 3. 4 information divergence class...

Learning with Information Divergence Geometry

Shinto Eguchiand

Osamu Komori

The Institute of Statistical Mathematics, Japan Email: eguchi@ism.ac.jp, komori@ism.ac.jpUrl: http//www.ism.ac.jp/~eguchi/

Tutorial Workshop on

Outline

9：30~10:30 Information divergence class and robust statistical methods I

11：00~12：00 Information divergence class and robust statistical methods II

13:30~14:30 Information geometry on model uncertainty

15：00~16：00 Boosting leaning algorithm and U-loss functions I

9：30~10:30 Boosting leaning algorithm and U-loss functions II

11：00~12：00 Pattern recognition from genome and omics data

Key words

Linear connections

Statistical model and estimation

Divergence geometry

Duality of inference and model

GeodesicABC of Information geometry

Transversality

Riemannian metric (information metric)

Metric and dual connections

Pythagoras theorem

Max entropy model

minimaxity

U-boost

Observational bias Selection bias

Sensitivity analysis The worst case

ε –perturbed model

Pythagoras theorem

Information divergence class and robust statistical methods I

geometry

learning

statistics

quantum

1982 paperS. Amari

1975 paperB. Efron

P. Dawidcomment

Historical comment

1984WorkshopRSS150

C.R.Rao1945

R.A.Fisher1922

What is IG?

Geometry Uncertainty

Information space

It is a method to quantify uncertainty,

Or, a viewpoint to understand uncertainty?

Dual Riemannian Geometry

Dual Riemannian geometry gives reformulation for Fisher’s foundation

Cf. Einstein field equations (1916)

Information Geometery aims to geometrizeA dualistic structure between modeling and estimation

Estimation is projection of data onto model.

“Estimation is an action by projection and model is an object to be projected”

The interplay of action and object is elucidated by a geometric structure.

Cf. Erlangen program (Klein, 1872)http://en.wikipedia.org/wiki/Erlangen_program

2 x 2 table

The space of all 2×2 tables associates with a regular tetrahedron

We know the ruled surface is exponential geodesic, butdoes anyone know the minimality of the ruled surface?

The space of all independent 2×2 tables associates with a ruled surfaceIn the regular tetrahedron

regular tetrahedron

1−p0p0

0p0 1−p

regular tetrahedron

2/301/30

02/30 1/3

002/3 1/3

2/31/300

02/30 1/3

2/301/30

002/3 1/3

2/31/300

Ruled surface

1 00 0

p q p (1-q)(1-p)q (1-p) (1-q)

0 00 1

0 10 0

hgfeyzxwn

22 )( −

0)1()1(

})1)(1()1)(1({ 2

=−−

−−−−−=

qqppqpqpqppqn

0 01 0

wzyxnwyhzxgwzfyxe

+++=+=+=+=+=

11−qq1−p(1−p)(1−q)(1−p)qD

Pp(1−q)pqC

e-geodesical

}){( 10,10,1

log,)1)(1(

log <<<<−−−−

qp }{ RI,RI),,,( ∈∈+ yxyxyx

2112112222

11211211 1wherelog,log,log),,( )( ππππ

πππππ −−−=→−

3S 3RI

Two parametrization

)( )1(),1(, qpqpqp −−

)( ,qp

log,)1)(1(

qp−−−−

Two Gaussian distributions ),(and),( 21 ININ μμ

),( 1 IN μ ),( 2 IN μ

1μ2μ

),( 1 IN μ

),( 2 IN μ

),( 3 IN μ

Pythagoras theorem in a functional space

222 543 =+

Kullback-Leibler Divergence

∫= xxxx d)()(log)(),(KL q

),(),(),( KLKLKL rqDqpDrpD +=

Let p(x) and q(x) be probabiulity density fubnctions.

Then Kullback-Leibler Divergence defined by

Two one-parameter families

)PI,(),()()1()()( ∈+−= qptqptp mt xxx

Let the space of all pdfs on a data space.PI

)PI,(,)}({)}({)( 1)( ∈= − rqqrcr sss

es xxx

xxx dqrc sss ∫ −= )}({)}({/1 1

ここで

m-geodesic

e-geodesic

Pythagoras theorem

),(),(),( rqDqpDrpD +=

])1,0[]1,0[),((),(),(),( )()()()( ×∈∀+= tsrqDqpDrpD es

Proof}),(),({),( )(

KL)()(

mt rqDqpDrpD +−

∫ −−= xxxxx drqqp es

mt ))(log)())(log()(( )()(

∫ −−−−−= xxxxx dcrqsqpt s}log)(log)()(log1)){(()()(1(

∫ −−−−= xxxxx drqqpst ))(log)())(log()(()1)(1(

∫ −−− xxx dqpct s ))()((log)1(

)},(),(),(){1)(1( KLKLKL rqDqpDrpDst −−−−=

ABC in differential geometry

Linear connection defines parallelism along a vector field.

Geodesic = a minimal arc between x and y

Riemannian metric defines an inner product on any tangent space of a manifold.

RI)()(:),( →× MMggM XX

0))(),((argmin

10})1(,)0(:)({

dttxtxgtyxxxtxc

)()()(: MMM XXX →×∇

))(,),(()()()2(

MYXMfYfXYfYf

XF ∈∀∈∀+∇=∇

∇=∇

Γ=∇d

kjijX XX

Componetwise

GeodesicA one-parameter family (curve) is called geodesic }:)({ εε ≤≤−= ttC θ

),...,1(0)()())(()(1 1

pitttt kkp

i =∀=Γ+∑∑= =

θθθ θ

with respect to a linear connection },,1:)({ pkjiikj ≤≤Γ θ

Remark 2：If , any geodesic is a line.),,1(0)( pkjii

kj ≤≤=Γ θ

The property is not invariant with parametrization.

The gedesic C is expressed by a transform rule from parameter θ to ω

),...,1(0)()())((~)(1 1

patttt cbp

a =∀=Γ+ ∑∑= =

ωωω ω

,)()(~11 1 1

∑∑∑∑== = = ∂

∂+Γ=Γ

BBBBBω

θω )(for, θaaa

aai tBB =

∂∂

=∂∂

= ωωθ

Change rule on connectionsLet us consider a transform φ from parameter θ to ω . Then a geodesic C is

written by . Hence we have the following:

),...,1(0)()())((~)(1 1

patttt cbp

a =∀=Γ+ ∑∑= =

ωωω ω

,)()(~11 1 1

∑∑∑∑== = = ∂

∂+Γ=Γ

BBBBBω

θω )(for)( 1θaa

iiaB ϕω

}:))(()({ εεϕ <<−= ttt θω

),...,1()())(()(1

pattBtp

a =∀= ∑=

θω θ

),...,1()()())(()())(()(1 11

patttBttBtp

a =∀∂

∂+= ∑∑∑

θθθ

θω θθ

⎟⎟⎠

⎞⎜⎜⎝

∂∂

What reference books?Foundations of Differential Geometry (Wiley Classics Library) Shoshichi Kobayashi, Katsumi Nomizu

Methods of Information Geometry Shun-Ichi Amari , Hiroshi NagaokaAmer Mathematical Society (2001)

Differential Geometrical Methods in StatisticsShun-Ichi Amari 1985 年 Springer

Statistical model and information

Statistical model }:),()({ Θ∈== θθxxθ ppM

Score vector ),(log),( θxθxsθ

p∂∂=

Space of score vector }RI:),({ T psT ∈⋅= αθαθ

Fisher information metric ),(}{),( θθθ TvuuvEvug ∈∀=

Fisher information matrix }),(),({ Tθxθxθθ ssEI =

)RI( p⊂Θ

Score vector spaceNote 1：

0),(),(

),(),(),(),()(T

==⇒∈

∫∫∫xθxθxsα

xθxθxsαxθxθxθθ

dpdpuuETu

βαβxθxθxsθxsα

xθxθxθx

dpvuvugTvuTTT }),(),(),({

),(),(),(),(,

=⇒∈

∫∫

})),((,0)),((:),({ ∞<== θxθxθx θθθ tVtEtS

The space of all random variables with mean 0 and finite variance

Note 2: is an infinite dimensional vector space that includesθS θT

),(),(For θxsθx Tu α=

,....2,1},{ ),(...

),(...

},),({),(11

=∂∂

∂−

∂∂∂

− kuEuuEuiki

kkk θxθxθxθx θθ θθθθ

belong to θS

Linear connections

),,1()( }{ '1'

pkjissEg ijk

iiikj ≤≤=Γ

∂∑= θ

),,1()( }]{}{[ ''1'

pkjisssEssEg ijkijk

iiikj ≤≤+=Γ

∂∑=

θθθθ

e-connection

m-connection

)PI,(),()()1()()( ∈+−= qptqptp mt xxx

)PI,(,)}({)}({)( 1)( ∈= − rqqrcr sss

es xxx

m-geodesic

e-geodesic

Score vectorpiis 1)),((),( == θxθxs

Geodesical modelsDefinition

))1,0(()()()(),( 1 ∈∀∈⇒∈ − tMqpcMqp ttt xxxx

(i) Statistical model M is said to be e-geodesical if

(ii) Statistical model M is said to be m-geodesic if

))1,0(()()()1()(),( ∈∀∈+−⇒∈ tMqtptMqp xxxx

Note : Let P be a space of all probability density functions.

∫ −= xxx dqpc ttt )()(/1 1where

By definition P is e-geodesical and m-geodesical.

However the theoretical framework for P is not perfectly complete.

Cf. Pistone and Sempi (1995, AS)

Two type of modeling

})};()(exp{)(),({ 0)e( Θ∈−== θθxtθxθx κTppM

})}({)}({:)({0

)m( xtxtx pp EEpM ==

Exponential model

Matched mean model

})(:RI{ ∞<∈=Θ θθ κp

Let p0(x) be a pdf and t (x) a p-dimensional statistic.

})};()()(logexp{)(),({

)e( Θ∈−== ∑=

θθxxxθx κθ

}:)()()1(),({1

)m( Θ∈+−== ∑∑==

θθθI

ii pppM xxθxMixture model

})(:RI{ ∞<∈=Θ θθ κp

Let be a set of pdfs.},...,1,0:)({ Iipi =x

)},...,1(0,10:),...,({1

iiI =∀><<==Θ ∑

θθθθθ

Exponential model

Statistical functional

A functional f (p) is called a statistical functional if p(x) is a pdf.

A statistical functional f (p) is said to be Fisher-consistent for a model

,)()1( Θ∈pf

)()()2( Θ∈∀= θθf θp

if f (p) satisfies that}:)({ Θ∈= θxθpM

For a normal model }RIRI),(},2

)(exp{2

1)({ 22

2+×∈=

−−== σμ

πσθxθ

Tdxxxpdxxdxxxpp )))((,)(()( 22∫ ∫∫ −=f

is Fisher-consistent for θ = (μ, σ 2 ).

Example.

TransversalityLet f (p) be Fisher consistent functional.

Is called a leaf transverse to M.

})(:{)( θffθ == ppL

}{)()1( θθ f pML =∩

)()2( fθLΘ∈

is a local neighborhood.

)( fθL

)(* fθL

Foliation structure

)( fP θLΘ∈

⊕=θ

))((),(),( fP θθθθLTvMTuTt ppp ∈∃∈∃∈∀

),(),(),(such that θxθxθx vut +=

))(()()( fP θθθθLTMTT ppp ⊕=

)( fθL

),( θxt ),( θxu

),( θxv

Foliation

Decomposition of tanget spaces

Statistical model }:),()({ Θ∈== θθxxθ ppM )RI( p⊂Θ

Transversality for MLE

})};()(exp{)(),({ 0)e( Θ∈−== θθxtθxθx κTppMExponential model

Hence the foliation associated with fML is a matched mean model.

})(:RI{ ∞<∈=Θ θθ κp

For the exponential model the MLE functional)e(M

)},({logmaxarg:)(ML θxfθ

pEp pΘ∈

}{ )}({)}({arg)( ),(ML xtxtf θθ

⋅Θ∈

== pp EEp

is written by

})}({)}({:{)( ),(ML xtxtf θθ ⋅== pp EEpL

Let p0(x) be a pdf and t (x) a p-dimensional statistic.

Maximum likelihood foliation

),(),(),( KLKLKL baba rpDppDrpD θθ +=

)( MLfθL

1r)2,1,3,2,1( == ba

Estimating function

),( θxu

Statistical model }:),()({ Θ∈== θθxxθ ppM )RI( p⊂Θ

p-variable function is unbiased

)(0)}),({(det,}),({ ),(),( Θ∈∀≠∂

∂= ⋅⋅ θ

θθxuθxu θθ pp EE 0

def⇔

The statistical functional }),({solvearg)( 0==Θ∈

θxufθ

is Fisher-consitent, and the leaf transverse to M

}),(:{)( 0== θxufθ pEpL

is m-geodesical.)( fθL

Θ∈⊕

Information Geometry

}:),({ Θ∈= θθxpMStatistical model

Information metric (Rao, 1945)

∫ = 1),(stpdfais),(where dxxpxp θθ

)},(),({)( ),( θθθ θ xexeEg jixpij =

Dual connection (Amari, 1982)

),(log),(where θθθ

xpxe ii ∂

)()( ),(,e

kjixpkij eeEθθθ

∂=Γ

)()()( ),(),(,m

kjixpkjixpkij eeeEeeE θθθθ +=Γ∂

Exponential connection

Mixture connection

Mixture and exponential model

}:),({ Θ∈= θθxpM

Mixture model (mixture geodesic space)

}1,0:),...,{(,)(),(0

=>=Θ= ∑∑==

iii xpxp θθθθθθ

}:),({ Θ∈= θθxpM

Exponential model (exponential geodesic space)

})(:{,)}()(exp{),(1

∞<=Θ−= ∑=

θθθ ψψθθK

iii xtxp

∫= dxxtii )}(exp{log)(where θψ θ

Triangle with KL divergence

)10()()()( 1e≤≤= − sxqxrzxr ss

ssExponential geodesic

Let p, q and r in M

)10()()1()()(m≤≤−+= txqtxptxptmixture geodesic

∫= dxxqxpxpqpD)()(log)(),(KLKL divergence

}),(),(),(){1)(1( KLKLKL rqDqpDrpDst −−−−=

∫ −−= })log)(log{(emst rqqp

∫ −−−−= })log)(log{()1)(1( rqqpts

),(),(),(e

KL stst rqDqpDrpD −−p

Pythagorean Theorem

)(m xps

)(e xr t

),(KL rpD

),(KL rqD

),(KL qpD

),(),(),(e

KL stst rqDqpDrpD +=

),(),(e

KL sst rqDrpD ≥

),(),( mKL

emKL qpDrpD tst ≥ (em algorithm, Amari, 1995)

Minimum divergence geometryD : M×M → R is an information divergence on a statistical model M

(i) D(p,q) ≧ 0 with equality if and only if p = q

(ii) D is a differentialble on M×M

Then we get a Riemannian metric and dual connections on M (Eguchi, 1983,1992)

)|(),()( YXDYXg D −=

))(()|(),()( MXZZXYDZYg XD ∈∀−=∇

))(()|(),( *)( MXZXYZDZYg XD ∈∀−=∇

pqqqp qpDYXZpXYZD == |),())(|(

⇒⇐

)(),()|(),( )()( ZZYfgZXYfDZYg XD

XfD ∀∇=−=∇

)|()|()|(),()( XYDXYDYXDYXg D ⋅=⋅=−=

0)|],([)|()|(),(),( )()( =⋅=⋅−⋅=− YXDYXDXYDXYgYXg DD

sconnection dual areand *XX ∇∇

)(),)(()|)(()),(( )()( ZZYXfYfgZfYXDZfYg XD

XD ∀+∇=−=∇

is a Riemann metric

)( *21

XX ∇+∇=∇Xwhere

),(),(),(),(),( )()(*)()()( ZYgZYgZYgZYgZYXg XD

XDD ∇+∇=∇+∇=

Remarks

0)|( =⋅XD

∫ −= ))}(()({),( qUqpqpLU ξξU cross-entropy

thatsuch),,( tripleaLet ξuU U is a convex function,

U entropy

1,' −== uUu ξ

),()(),( qpLpHqpD UUU −=U divergence

)( pξ )(qξ

))(( pU ξ

))(( qU ξ

U divergence

)}({maxarg)( tUtsst

−=∞<<∞−

))(()()(where * sUsssU ξξ −=

)}({maxarg)( * sUsttus

−=∞<<∞−

∫== )(),()( * pUppLpH UU

Example of U divergence

∫∫ −=−−−= )log(log)}log(log{),(KL qpppqppqqpD

}{ )(1

11),( ∫ −

+−+−=

βββ

pqppqqpD

KL divergence

Beta (power) divergence

))log(),exp(),(exp())(),(),(( sttstutU =ξ

)( 1,)1(,)1())(),(),(( /11

βββ ββξ −

+ sttstutU

Note )exp()(lim0

ttU =→ ββ

),(),(lim KL0qpDqpD =

→ββ

)}()({))(())((),( pqppUqUqpDU ξξξξ −−−= ∫

Geometric formula with DU

)|(),()( YXDYXg UU −=

))(()|(),()( MXZZXYDZYg UXU ∈∀−=∇

))(()|(),( *)( MXZXYZDZYg XD ∈∀−=∇

s.t.),,( )(*)()( UUUg ∇∇

dxxqxqg iiji

U )),((),(()( θξθθθθ∫ ∂

dxxqxq kjikijU )),((),(( 2

,)( θξθθ

θθθ∫ ∂

∂∂

∂=)Γ

dxxqxq jikkijU )),((),((

,)(* θξθθ

θθθ ∂∂

∂∫=)Γ

2005)Eguchi,()(m

)( UU ∀∇=∇

)divergence KL(exp)( =⇒⇐= Ugg U

Triangle with DU

])1,0[]1,0[),(()},(),(),(){1)(1(

),(),(),( )()()()(

×∈∀−−−−=

−−

tsrqDqpDrpDst

rqDqpDrpD

)()()1()(m

xxx tqptpt +−=

)( ))(())(()1()()(s

Us qsrsur κξξ ++−= xxx

mixture geodesic

U geodesic

),(),(),(

}{}{)()()()(

rqDqpDrpD

⇒⊥

Light and shadow of MLE

1．Invariance under data-transformations

2．Asymptotic efficiency

1. Non-robust

Sufficiency and efficiencylog exp

2. Overfitting

Log-likelihood on exponential family

Learning with Information Divergence

GeometryShinto Eguchi

andOsamu Komori

Outline

Information divergence class and

robust statistical methods II

Light and shadow of MLE

1．Invariance under data-transformations

2．Asymptotic efficiency

1. Non-robust

Sufficiency and efficiency

log exp

2. Over-fitting

Log-likelihood on exponential family

Likelihood method

U-method

))(()}({),( qUqEqpL pU ξξ −=

U-entropy

U-cross entropy

))(()}({)( pUpEpH pU ξξ −=

Let U(t) = exp(t) ．

)1()(Let

ββ β

)}({log)( XpEpH pU =

}1)({)(β

β −=

XpEpH pU

),,,( ΞξuUTake a quadruplet

U-entropy

Example 2.

Example 1.

U-divergence

),()( qpLpH UU ≥Information inequality

),()(),( qpLpHqpD UUU −=U-divergence

),()( qpLpH UU −

)( pξ )(qξ

))(( pU ξ

))(( qU ξ

0)}()()){(())(())(( ≥−−−= pqpupUqU ξξξξξ

}log{log),(KL pqppqqpD −−−=

ββββ

β)(),(

11 qpppqqpD −−

KL-divergence

β-power divergence

Max U-entropy distribution

})}({:{ τXtΓτ == PEpEqual mean space

Let us fix a statistics t (x).

)(maxarg)(* pHp Up τΓ

))()(()( T* θxtθx κ−= up

)(})({

|)}1)1(())1)((())1(({

εελεεεεε

Γ∈∀+−=

−+−−+−−−+−∂∂

qpqpqpH

Euler-Lagrange’s 1st variation

)()())((Hence T* θxtθx κξ −=p

U-model

))(()}({),()( θθθθ qUqEqpLL pUU ξξ −==

U-estimate

U-loss function

)(minargˆ emp θθθ

UU LΘ∈

=U-estimate

)},(),({),(),(),( θXθXθxθxθxsθ

swEsw qU −=U-estimating function

．function score is),(,))((')(),( where θXsxxθx θθ qqw ξ=

))(())((1)(1

empθθ xθ qUq

iiU ξξ +−= ∑

=U-empirical loss function

Let p(x) be a data density function with statistical model )(xθq

Γ -minimax

})}({:{ τXtΓτ == PEpEqual mean space

),(maxmin)*,*(),(minmax qpLppLqpL UUUpqqp ττ

ττττ Γ∈Γ∈Γ∈Γ∈

∫ =−

τxtθxt

))(()(

,))(()(* T

Twhere

)(maxarg* pHp Up τ

τΓ∈

Let us fix a statistics t (x).

U-model

}:))(()({ Θ∈−== θxtθx θθ κTU uqMU-model

mean parameter is)}({ xtτθqE=U-estimator of ∑

iiU n 1)(1ˆ xtτ

))(())((1)(1

empθθ xθ qUq

iiU ξξ −= ∑

))((})({1 T

Tθθ xtθxtθ κκ −−−= ∑

0 )}({)(1)(1

emp =−= ∑=

∂∂ Xtxtθ

θθ q

*'τpτΓ

).,()ˆ()( ˆempemp

θθθθ qqDLLU

UUUU =−

We observe that

which implies

Furthermore,

U-Boost density learning

)2())1((minarg),( 1]1,0[),(

TkL kUkk ≤≤+−= −×∈

πφξπφπφπ D

},...,1:)({ )( Mjj == xD φDictionary of ξ-densities

)(minarg11 φφξφ

)()()1()( 1 xxx kkkkk φπξπξ +−= −U-boost

)(minarg*)cov(

ξξξ

ULD∈

=Goal is to find

U(t) = t 2 (Klemela, ML, 2006) U(t)=exp(t) (Friedman et al., JASA,1998)

Proposed density estimator

))()(()( T θxtθx kuf −=

)1()1(with),...,( 11 TkkkT πππθθθ −−== +θ

T1 ))(,),(()( xxxt Tφφ=

U-estimate under U-model

1−kξ

kξ kφ

jD 1)( }{ == φ

)cov( D

1−kξ

kξ kφ

1−kξ

kξ kφ

1+kξ2+kξ

∞→→ kk as*ξξ

Statistical machine learning

Brain function

LearningData sets

Signal processingPattern recognition

Vapnik (1995)

Hastie, R. Tibishirani, J. Friedman (2001)

・・・・

))}(()({)( θθθ qUqpLU ξξ −= ∫

Minimum U divergence

U-loss function

)(minargˆ emp θθθ

UU LΘ∈

=MinimumU-estimator

)},(),({),(),(),( θXθXθxθxθxsθ

swEsw qU −=U-estimating function

．function score is),(,))((')(),( where θXsxxθx θθ qqw ξ=

∫∑ −==

))(())((1)(1

empθθ xθ qUq

iiU ξξThe empirical loss

Let p(x) be a data density function with statistical model )(xθq

• Huber’s M-estimator ∑=

),(1min θρθ

⎩⎨⎧

−<−−

=otherwise)sgn(

||if),(

θθθ

θρyk

As M-estimation

• location case

• relation ∫−= ))(())((),( θθ qUxqx ξξρ θ

Influence function

Statistical functional

})),((()()),(({minarg)( ∫∫ +−=Θ∈

dxxpUxdGxpGTU θξθξθ

)(),(IF=

⎥⎦⎤

⎢⎣⎡∂∂

εεGTxT

xFG δεε θε +−= )1(

)}],(),({),(),()[(),(IF 1 θθθθθ θ xSxwExSxwJxT −= −ΨΨ

Influence function

||),(IF||sup)(GES xTTx

ΨΨ =

Efficiency

Asymptotic variance

))()()(,0()ˆ( 11 −−⇒− θθθθθ UUUD

U JHJNn

)),(),(()(

),),(),(),(()(

θθθ

θθθθ

XSXwVarH

XSXSXwEJ

111 )()()()( −−− ≤ θθθθ UUU JHJI

Information inequality

( equality holds iff U(x) = exp(x) )

Normal mean

-6 -4 -2 2 4 6

β-power estimates η-sigmoid estimates

Influence functionβ= 0

= 0.015= 0.15= 0.4= 0.8= 2.5

η = 0= 0.01= 0.05= 0.075= 0.1= 0.125

Gross Error Sensitivity

β-power estimates η-sigmoid estimates

0.1970.6942.52.040.6890.1250.4550.7530.82.210.7420.10.6780.8020.42.470.7990.0751.040.8730.152.920.8610.051.90.9720.0156.160.970.01∞10∞10

GES効率ηGES効率β

Multivariate normal

pdf is

⎭⎬⎫

⎩⎨⎧ −Σ−Σ=Σ

−−− 2

)()(exp)det)2((),,(1

21 μμπμ yyyf

尤度方程式 Ψ

)(}))(({)(1

0)()(1

⎪⎩

⎪⎨

Σ=Σ−−−

θμμθψ

μθψ

ψcxxn

( ) ( )111 , , +++ ∑=→∑= kkkkkk μθμθ

∑∑=+ ),(

jkik xw

( ){ },

det ),(

)()(),(1

−−=Σ +

Tkikiki

μμθ

繰り返し重み付け平均と分散

Under a mild condition

),...1 ( )(L )(L 1 =∀≥+ kkk θθ ψψ

U-algorithm

Simulation

ε-conmination

)ˆ( 0KL θ,θD

, N , N-εG

. , N-εG

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛+⎟⎟

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛+⎟⎟

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

250501

under 6516702under2439033

G . . G . .

=→=KL error

with MLE θ̂

β-power estimates v.s. η-sigmoid estimates

31.640.3012.400.208.910.10

20.930.0535.300.0139.240

6.040.0014.640.000754.460.00056.700.00025

23.040.000139.240

βηKL error KL error

29.680.3012.200.205.140.106.650.05

13.910.0116.50

3.070.0033.040.0023.90.0013.190.000753.360.000516.50

(1)εG

)2(εG

Plot of MLEs （no outliers）

11σ12σ

100 replications with100 size of sampleunder Normal dis.

η- MLE (η=0.0025)β- MLE (β=0.1)MLEtrue (1,0,1)

⎥⎦

⎤⎢⎣

⎡=Σ

σσσσ

Plot of MLEs with ouliers

η- MLE (η=0.0025)β- MLE (β=0.1)MLEtrue (1,0,1)

)2(εG

100 replications with100 size of sampleunder

Selection for tuning parameter

Squared loss function

dyygyf 221 )}()ˆ,({)ˆ(Loss −= ∫ θθ

∫∑ +−= −

dyyfxfn

)ˆ,(21)ˆ,(1)(CV ββ θθβ

)(CVminargˆ βββ

)ˆ,(IF1

1ˆˆ )(βββ θθθ i

i xn −

+≈−Approximate

Selection for tuning parameter

∫∑ +−≅=

dyyfxfn

1)ˆ,(

21)ˆ,(1)(C ββ θθβ

θ ββ ∂

∂−

+)ˆ,(

The third term is dominant in no outlier case,in which CV(β) has a minimum around at β = 0.When there are substantial outliers, the first and second terms are dominant and has a minimum around at β = 1Cf. GIC in Konishi and Kitagawa (1996)

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛.26.1-

.1-.26 varianceand

mean with Normal

0.0250.050.0750.10.1250.150.175

2.94 ⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛.261.126-.126-.228

0.081-

signalMLE

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛.383.263-.263-1.059

0.184-

ContamiMLE

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛.286.134-.134-.293

0.132-

0.086 MLE-β

)(C βV

07.0ˆ=β

What is ICA?

Cocktail party effect

Blind source separation

s1 s2 ……… sm

x1 x2 ……… xm

ICA model

sWxW mmm 1 s.t., −× =∈∈ RR μ

(Independent signals）

(Linear mixture of signals)

0)(,,0)()()()(),...,(

SESEspspspsss ～

),,( 1 nxx

))(())((|)det(|),,( 111 mmmppWpWf μxwμxwx −−=

Aim is to learn W from a dataset

in which unknown. is)()()( 11 mm spspp =s

ICA likelihood

Log-likelihood function

|)det(|log))((log)(11

WpW jijj

i+−= ∑∑

TTm WWxWxhI

WpWxpWxF −−=

∂∂

= ))()((),,(),,(

)( )(log,,)(log)(1

∂∂

∂∂=

Estimating equation

Natural gradient algorithm, Amari et al. (1996)

Beta-ICA

β power equation ),(),,(),,(11

μμμ ββ WBWxFWxf

iii =∑

decompsability：

0]))}({E[

)]()}({E[

])}({E[,

=−×

−−×

−∏≠≠

XwhXwp

ssssss

tqsqqq

ts ≠∀

Likelihood ICA

0.5 1 1.5 2 2.5

-1.5 -1 -0.5 0.5 1 1.5

⎥⎦

⎤⎢⎣

⎡=−

5.01211W

１５０ signals U(0,1)× U(0,1)

Maximum likelihood

Mixture matrix

Non-robustness

-2 -1 1 2 3

-2 -1 1

５０ Gaussian noise N(0, 1)× N(0, 1)

Maximum likelihood

β power-ICA (β=0.2)

-2 -1 1 2 3

-10 -7.5 -5 -2.5 2.5 5 7.5

Minimum β-power

)()(exp)()),exp(log()( * zzzzz Ψ−=Ψ+=Ψ ηη

iiU xrUL

1))),((log()( γμγ

||||)(||||),(

γγγ yyyr

η-sigmoid

)},(min{minargˆ μγγμγ

γγγγγ

γγ T

iSSxxr max)(tr),(min −=−∑Classical PCA

),(into),(Update ** γμγμ

Tiii xxxw

))((),(),S(

),,())((

μμγμμγ

γμγμψ

γμψ

−−−=

,−where

⎪⎩

⎪⎨

ii x,,xw

μγμ

γλγμγ

Non-robustness in Classical PCA

-4 -2 2 4 6

MLE for Pc vector = (.55, .82, .01, .07, .01, .032, .10)

MLE for Pc vector = (.00, .01, .05, .04, .02, .99, .00)

Data-weights in U-PCA

20 40 60 80 100 120 140

Pc vector = (.55, .82, .01, .07, .01, .032, .10)

U-estimator for Pc vector = (. 64, .75, .01, .09, .01, .03, .09)

2.5 5 7.5 10 12.5 15

radius r

Tube neighborhood in U-PCA

Ψγ̂

)−− ΨΨ γμx ˆ,ˆ( izix

Kernel methods

KMs : mapping the data into a high dimensional feature space

Use of kernel functions for vectors, sequence data, text, images.

SVM, Fisher's LDA, PCA, ICA, CCA, SIR, spectral clustering

(kernel trick, cf. Aizeman et al. 1964)

(Any kernel can be used with any kernel-algorithm)

Kernel-Machines Org (http://agbs.kyb.tuebingen.mpg.de/km/bb/)

SVM Org (http://www.support-vector-machines.org/)

(Spline method, cf. Wahba. 1979)

ZX →Φ:

X : data space in

Z : high dimensional feature space

: kernel spectrum wrt )()(),( uxuxK Φ⋅Φ=

(kernel trick, cf. Aizeman et al. 1964)

: feature map HxKhx x ∈⋅= ),(

)(),,( xffxK H =⋅

ZXh →:

22 ||||),(||)(|| HxZ hxxKx ==Φ

Kernel data

}30,...,1:{ =ixi }30,...,1:{ =ihix

Ψ-loss function

Let p be a true density function on H and m a proto model function.

Definition 1. U-loss function is

∫+−== τξξθ θθθ dmUmEmpCL PUU ))(()(),()(

We assume an exponential model defined by

)),()(exp()( T θκθθ −= hxhm

If we rewrite Ψ-loss function is

.1))(exp(satisfieswhere T∫ =− τκθκ θθ dhx

then)),(exp()( tt ξ−=Ψ

∫ −Ψ−+−Ψ= τκθκθθ θθ dhxUhxEL TTPU )))((()})(({)(

KPCA Cf. Scholkopf et al 1998 NC

Robust PCA

Wang, Karhunen, Oja (NN 1995)

Higuchi, Eguchi (NC 1998; JMLR 2004)

Xu, Yuille (IEEE NN 1995)

Hubert, Rousseeuw, Branden (Technometrics, 2005)

Robust KPCA

U-KPCA

∑=∈Γ∈

ΨΨ ΓΨ=Γn

mhzm xk 1,

)),,((minarg)ˆ,ˆ(

Ψ-Kernel Principal Component Analysis Cf. Haung, Yi-Ren , Eguchi (2008)

where Ψ(z) is strictly increasing function and

Remark Ψ0-KPCA = KPCA if

Example )}exp(1{)( 11 zz ββ −−=Ψ −

))(exp(1)exp(1log)( 1

=Ψ −

ηββηβ

zz =Ψ )(0

)()(lim 010zz Ψ=Ψ

→βη

β−Ψ=Ψ

→)()(lim 020

2 4 6 8 10

10)(1 zΨ

)()(lim 02 zz Ψ=Ψ∞→η

1.0=β5.0=β1=β

}||)(||||{||21),,( 2*2

HxHxx mhmhmhz −Γ−−=Γ

Toy example95

Functional data

5 phonemes (TIMIT database) http://www.lsp.ups-tlse.fr/staph/npfda/

sh iy aadcl ao

she she dark dark water

}2000,...,1:),{( == iyfD ii 5% contamination

)15,...,1,2000,...,1(,)()(~==+= jiutftf ijijjiji δ

)15,10(Uniform~,)05.0(Bernoulli~ ijij uδ

FPCA97

Learning with Information Divergence

GeometryShinto Eguchi

andOsamu Komori

Outline

Information geometry on model uncertainty

Observational bias

The theory for statistical inference is formulated under an assumption of random mechanism, for example, random sample.

However, the assumption is frequently in-testable in a situation that involves observational studies. In this sense we have to make a sufficientlycautious inference.

Typically missing data comes from a variety of missing mechanism such asmissing at completely at random, missing at random, missing not at random.In particular, missing not at random bring about a serious bias in the inference.

Hidden Bias

Publication bias －not all studies are reviewed

Confounding －causal effect only partly explained

Measurement error －errors in measure of exposure

Lung cancer & passive smoking

1.00.50.3 1.5 2.0 3.0 4.0 5.0 10.0

Odds ratio

Passive smoke and lung cancer

Log relative risk estimates (j =1,…,30) from 30 2x2tables jθ

)weighte varinancinverse theis( jj

∑∑=

The estimated relative risk 1.24 with 95% confidence interval (1.13, 1.36)

Conventional analysis

1.00.50.3 1.5 2.0 3.0 4.0 5.0 10.0

Odds ratio

Incomplete Data

z = (data on all studies, selection indicators)

y = (data on selected studies)

z = (response, treatment, potential confounders)

y = (response, treatment)

z = (disease status, true exposure, error)

y = (disease status, observed exposure)

y = h(z)

Level Sets of h(z)

1. One-to-one 2. Missing 3. Measurement error

4. Interval censor 5. Competing risk 6.Hidden confounder

Tubular Neighborhood

})),,((KLmin:),,({ 221 εεε ≤⋅=

∈ YYY gfg θθyΘθ

}:),({ Θ∈= θθyYfMModel

Copas, Eguchi (2001)

),(KL εε ≤MM

Near-model }:),,({ Θ∈= θθyY εε gM

Mis-specification

}{ ),(exp),(),,( θzθzθz ZZZ ufg εε =

0=== )(E,1)(,0)(E 2ZZZZ suuEu fff

}),KL(2{ ZZ fg=ε

"direction cationmisspecifi"=Zu

Near model

}{ ),(exp),(),,( θyθyθy YYY ufg εε =

],|),([E),(where yθzθy ZθY uu =

)(:By zhyzh =

h),( θzZf ),( θyYfModel

Near-modelh

),,( εθzZg ),,( εθyYg

Ignorable incompleteness

Let Y = h(Z) be a many-to-one mapping.

If Z has then Y has

∫ −=

)(1d),(),(

yhZY zzθy θff

Z is complete; Y is incomplete

),( θzZf

ZθZ onData ˆMLE ←

YθY onData ˆMLE ←

)ˆE()ˆ(E trueis ZYZ θθ =⇒f

)ˆE()ˆ(E wrongis ZYZ θθ ≠⇒f

bias),,,( from },...,{ˆMLE virtual 1 ZZZ θzzzθ ugn ε←

),,,(from},...,{ˆMLEactual 1 YYY θyyyθ ugn ε←

)()|(|),()ˆE()ˆ(E 12 −++=− nOOu εε θbθθ ZZY

)(( ,cov),cov),( 11YYYZZZYZZ ssbbθb uGuGu −− −=−=

⎥⎥⎦

⎢⎢⎣

⎡⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛⎯⎯ →⎯⎟

⎟⎠

⎞⎜⎜⎝

−−−−

−−

11Dist 1,ˆ

bθθbθθ

Limit distribution

Asymtotic bias

)1( ||||max min22

ε−=b

ifonly and if holds bounds The

.),(),( θysθy YY ∈u

)ˆE()ˆ(E ZY θθb −=

loss ninformatio of eigenvaluesmallest theisminλ

2/12/1 −−=Λ YZY III

Problem in estimation of bias

The nonignorable model

}{ 22 )(),(exp),(),,( 21 ερεε θθyθyθy YYY −= ufg

gives the worst case if ).,(),( T θysωθy YY =u

However is inestimable and untestable: ω

The profile likelihood ∑=

)},,,({logmax),( εε ωθyω YΘθY

is flat at 0=ω

Heckman model for MNAR

))(()1( T1T, xβxψX|R −+Φ== − yrg Y σωεσ+= xβTy

),,()(),,( )( rthrt r xzxz ==

From pure misspecification

Unbiased perturbed biased perturbedh

The worst case for bias

),(),(),( T2 θbθbθ ZYZZ uGuu =β

)( 11,cov),( YYZZZZ ssθb −− −= GGuu

ddddθθ

YZYYZYYZZ GGGG

GGGGGGGuu)(

)()(),(),( 11T

1111T*22

−−

−−−−

−−−

=≤ ββ

szhsdz

))(()(iff attainsBound

11T* }{

−−

Cauchy-Schwartz’s inequality

11T* )(

ZYY −−

−−

GGu εM

The worst case

),,,( εωθyYg

),( θyYf

),( ωθy YY If ε+ ).,(),( T θysωθy YY =u

*)),(),,,,(KL(minarg*

θωθωθ YYΘθ

Y ⋅⋅=+∈

fgI εε

Sensitivity analysis

21),(exp),(),,( ωωθysωθyωθy YYY YIfg −=

θθysωθysωθy

YY ∂∂

+=∂∂ − ),(),()},,(log{ 2/1T

The most sensitive model

Estimating function of θ with fixed ε, ω

Yθ̂}const.:ˆ{ T

, =ωωθ Yω IY

Behavior of two MLEs

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

−−−

⎟⎟⎠

⎞⎜⎜⎝

⎛→⎟

⎟⎠

⎞⎜⎜⎝

−−

−−−−

−−−

,,,ˆˆ

θθθθ

IIIIIIInNn

The two MLEs and are asymptotically normal asYθ̂ Zθ̂

⎟⎠⎞

⎜⎝⎛→=−− −11,)ˆˆ(|)ˆ( Zuuθθθθ ZYY InN

Note：this asymptotic expression is valid only when )( 2/1−= nOε

Note： The conditioning cancels selection bias.

Scenarios A, B, C

Inference from using fYnyy ...,,1

}.)()ˆ()ˆ(:{)(C 2αrkIk T ≤−−= YYY θθθθθ

Scenario A: 10 =⇒= Akε

Scenario C: 1unknown0 >⇒> Ckε

Scenario B:acceptable

found had and

,..., observed had weif,0 1 zz>ε

CBA kkk <<⇒

Scenarios A and C

),0(~)ˆ(2/1 INI fYθθYY −

}.)()ˆ()ˆ(:{)(C 2αrkIk T ≤−−= YYY θθθθθ

Scenario A: ⇒=0ε

Scenario C:),0(~)ˆ(

unknown02/1 INI gY

bθθYY ε

−−

?!)(,1 AA kCk =

?!)(,1 22CC kCk κε+=

Scenario B

andˆ MLEhave couldwe,,..., observe could weIf 1 Zθzz n

),0(~)ˆˆ()( 2/12/1 INII fYZYY θθU −Λ−= −

),0(~|)(* INgYUSUS =

)}ˆˆ()ˆ{()ˆ( 2/12/1ZYYZZZ θθθθθθS −−−=−= II

Conditional confidence interval

}||)(||:{)( 22*αrC ≤= uSθu

ノンランダムネスの仮定

),(~)ˆˆ( 11 −− −− ZYZY Yθθ IINn f 0

),(~)ˆˆ(|)ˆ( 1−=−− ZZYY uuθθθθY

なので，条件付信頼領域は

})ˆ()ˆ(:{)( 21αα rIC T ≤−−−−= − uθθuθθθu YZY

})(:{ 2111αα rIIB T ≤−= −−− uuu ZY

なので，仮説 H：バイアス b = 0 の検定領域は

仮説 H：バイアス b = 0 が水準 α で採択される下での

条件付信頼領域の和集合は，

)(uαC

Theorem

}.)()ˆ()ˆ(:{)(CLet 2αrkIk T ≤−−= YYY θθθθθ

).2()()1( Then22||||

⊆⊂≤

-1.5 -1 -0.5 0.5 1 1.5

-1 -0.5 0.5 1

)001.0,001.0(),( 21

)5.0,5.0(),( 21 =λλ)1.0,1.0(),( 21 =λλ )9.0,1.0(),( 21 =λλ

s.t.),( if attainable is boundupper The21

jiji λλ ≤≤∃

})ˆ()ˆ(var)ˆ(:{)( 21Tαα rrCR A ≤−−= − θθθθθθ

)2()( αα rCRCrCR ⊆⊆

1-dimensional case P = 5% → 0.3%

-1 -0.5 0.5 1

Double-the-variance rule}:);({ Θ∈= θθyfMStatistical model

Random sample Mfn ∈);( ,...,iid

1 θyyy ～

α % confidence interval

})ˆ()ˆ(var)ˆ(:{)( 21Tαα rrCR A ≤−−= − θθθθθθ

)( αrCR )2( αrCR

M }{ εM

Square root rule95% confidence interval (1.08, 1.41)

Risk from passive smoke

Root-2-rule

1.00.50.3 1.5 2.0 3.0 4.0 5.0 10.0

Odds ratio

Square root rule95% confidence interval (1.08, 1.41)

Reference

Eguchi & Copas (1998) JRSSB (near-parametric)

Copas & Eguchi (2001) JRSSB (ε -perturbed model)

Copas & Eguchi (2005) JRSSB (discussions) (Double-the-variance rule)

Eguchi & Copas (2002) Biometrika (Kulback-Leibler divergence)

Henmi, Copas & Eguchi (2007) Biometrics (Meta analysis)

Henmi & Eguchi (2004) Biometrika (Propensity score)

Statistical Analysis With Missing Data. R. J. A. Little, D. B. Rubin. Wiley (2002)

S. Greenland. JRSSA (2005) (Multiple-bias modelling)

Copas & Eguchi (2010) JRSSB (statistical equivalent models)

Present and Future

Does all this matter?

Statistics ( missing data, response bias, censoring)

Biostatistics (drop-outs, compliance)

Epidemiology ( confounding, measurement error)

Econometrics (identifiability, instruments)

Psychometrics (publication bias, SEM)

causality, counter-factuals, ...

Learning with

Information Divergence Geometry

Shinto Eguchiand

Osamu Komori

Outline

Boosting leaning algorithm

and U-loss functions I

Which chameleon wins?

Pattern recognition…

What is pattern recognition?

□ There are a lot of examples for pattern recognition.

□ In principle pattern recognition is a prediction problem for class label

which would classify a human interestingness and importance.

□ Originally human brain wants to label phenomena a few words , for

example (good, bad), (yes, no), (dead, alive), (success, failure), (effective,

no effect)….

□ Brain intrinsically predicts the class label from empirical evidence.

Practice

● Character recognitionvoice recognition

Image recognition

☆ Credit scoring

☆ Medical screening

☆ Default prediction

☆ Weather forcast

● face recognition

finger print recognition●

speaker recognition●

☆ Treatment effect

☆ Failure prediction☆ Infectious desease

☆ Drug response

Feature vector & class label

),...,( 1 pxx=xFeature vector

Class label y

Feature space pR⊆X

},,1{ GLabel set

Training data

Test data

}...,,1:),({train niyD ii == x

}...,,1:),({ testtesttest mjyD jj == x

Classification rule

),...,( 1 pxx=x },,{ Gy 1∈

zyF →),(: x

Classification rule ),(maxarg)(},,1{

F xx∈

Feature vector Class label

yh →x:classifier

Discriminant function (score)

3)( =xFh

1 2 3 4 5

)1,( xF)2,( xF

)3,( xF

)4,( xF)5,( xF

)( R∈z

Consider a case of G = 5 for a given x

Binary classification

),...,( 1 pxx=x }1,1{ +−∈y

zF →x:

classifier )}(sgn{)( xx FhF =

)1,()1,()( −−+= xxx FFF

In a binary class y = −1, 1 (G=2)

)1,()1,(1))((sgn)1,()1,(1))((sgn

−<+⇔−=−>+⇔=

xxxxxx

FFFFFF

),(maxarg))(sgn(}1,1{

xx−∈

Multi -class

),...,( 1 pxx=x },,{ Gy 1∈

zyF →),(: x

),(maxarg)(},,1{

F xx∈

yh →x:

3)( =xFh

1 2 3 4 5

)1,( xF)2,( xF

)3,( xF

)4,( xF)5,( xF

)( R∈z

G = 5,

Classifier

Score function

Classifier by score

Probability distribution

).,( vector random of pdf a be),(Let yyp xx

∫ ∑∈

ypCBp }d),({),( xx

∫= Xxx d),()( ypypMarginal density

Conditional density

∑∈

=},...,1{

),()(Gy

ypp xx

)(),()|(

pypyp =

)(),()|(

ypypyp xx =

)()|()()|(),( xxxx pypypypyp ==

)1|()1|(

)1,()1,(

=−==

Error ratepR∈x },,{ Gy 1∈

Classifier )( xFh

Feature vector ，class label

has error rate

))(Pr()(Err yhh FF ≠= x

∑∑=≠

==−====G

jiFF iyihjyihh

1),)(Pr(1),)(Pr()(Err xx

Training error

}...,,1:),({for train niyD ii == x

}...,,1:),({for testtesttest mjyD jj == x

nyhih iiF

F})(:{#)(Err train ≠

Test error

h jjFF

})(:{#)(Err

testtesttest ≠

False negative/positive

FNP )11)(Pr( )FN( +=−== y|hh FF x

)11)(pr( )FP( −=+== y|hh FF x

True Negative

True Positive False Positive

False Negative

1+=y 1−=y

1)( +=xFh

1)( −=xFh

)1pr()FP()1pr()FN( )Err( −=++== yhyhh FFF

Bayes rule

Let p(y|x) be a conditional probability given x.

)|1(log)(Define 0 xxx

=ypypF

The classifier leads to Bayes rule. ))(sgn()( 0Bayes xx Fh =

For any classifier h

)(Err)(Err Bayes hh ≤

Note： The optimal classifier is equivalent to the likelihood ratio.However, in practice p(y|x) is unknown, so we have tolearn hBayes(x) based on the training data set.

Theorem 1

Discriminant space associated with Bayes classifier is given by

{ }{ })|1()|1(: }1)(:{

)|1()|1(: }1)(:{

BayesB

xxRxxRx

−=<+=∈=−=∈=

−=≥+=∈=+=∈=−

ypyphR

ypyphRpp

},{ −+ RR

Error rate for Bayes rule

In general, when a classifier h associates with spaces

)}(Err1{)}(Err1{

)|1()()|1()(

)|1()()|1()()(Err)(Err

dyppdypp

dyppdypphh

RRRRRRRR

−−−=

+=−+−=−=

+=−+−=−≥

−=−++=−=

−=−++=−=−

∫∫∫∫

++−−

++++−−−−

++−−

xxxxxx

)(Err)(Err Bayes hh ≥

−B−−

BRR \ −− ∩ BRR

−R −BR

−− RRB \

Multi-normal distribution

p-variate normal (Gaussian) distribution is defined by the pdf

)}()(21exp{

)det()2(1),,( 1

2/12/ μxVμxV

Vμx −−−= −Tpπ

pdf a has),( that Asuume yx

}),...,1{,(),,()(),( GyVypyp py ∈∈= Rxxx μϕ　

- 4- 2

00 . 2 5

0 . 50 . 7 5

- 4- 2

(Assumption for equal variance matrix)

Bayes classifier

01Bayes )|1()|1(log)( α+=

−=+=

= xαxxx T

ypypF 　

)1()1(log)(

21,)( 1

1111 −=

+=+−=−= −

−−+

−−+ yp

ypTTT μVμμVμVμμα α

We call this the Fisher linear discriminant function

)1(ˆ,

∑∑==

=−−=n

Ti nn 11

1ˆ),ˆ()ˆ(1ˆ xμμxμxV

)(in Plug Bayes xF

Boost learning

Boost by filter (Schapire, 1990)

Bagging, Arching （bootstrap）(Breiman, Friedman, Hasite)

AdaBoost (Schapire, Freund, Batrlett, Lee)

Weak learners can be combined into a strong leanrner?

Weak learner (classifier) = error rate is slightly less than .5

Strong learner (classifier) = error rate is slightly less than that Bayes calssifier

Web-page on Boost

http://www.boosting.org/

http://www.fml.tuebingen.mpg.de/boosting.org/tutorials

R. Meir and G. Rätsch. An introduction to boosting and leveraginghttp://www.boosting.org/papers/MeiRae03.pdf

R.E. Schapire. A brief introduction to boosting. In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, 1999.http://www.boosting.org/papers/Sch99e.ps.gz

Robert Schapire’s home page:http://www.cs.princeton.edu/~schapire/Yoav Freund's home page :http://www1.cs.columbia.edu/~freund/

Set of weak learners

{ }RI},1,1{},,,1{:)(sgn),,(stamp ∈+−∈∈−== bapjbxabaf jj xFDecision stumps

{ }1010

T1linear RI),(:)(sgn),( +∈=+== pf ββ ββxββxF

Linear classifiers

Neural net SVM k-nearest neighbur

Note: not strong but a variety of characters

linearstamp FF ⊆

Exponential loss function

Empirical exponential loss function for a score function F(x)

)}(exp{1)(1

exp ii

FL x−= ∑=

where q(y|x) is the conditional distribution given x, q(x) is the pdf of x.

Let be a training (example) set.}...,,1:),({train niyD ii == x

d)()}|()}(exp{{)(}1,1{

expqyqyFFL

E ∫ ∑−+∈

Expected exponential loss function for a score function F(x)

Learning algorithm

)},( ),...,,{( 11 nn yy xx

)(,),1( 11 nww

)(,),1( 22 nww1ε

)()1( xf

1)( )(

ttt f xα

)(,),1( nww TT

)()2( xf

)()( xTf

Final learner ∑=

tttT fF

1)( )()( xx α

Learning curve

50 100 150 200 250

Iteration number

Training curve

0)(),1()(:Intial.1 011

=== xFniiw n 　　　

)())((I)(∑∑ ≠=

iwiwfyf

tiit xε

)(log)b(tt

tttTT fFF

1)( )()( where,)(sign.3 )( xxx α

Tt ,,1For .2 =

))(exp()()()c( )(1 iitttt yfiwiw xα−=+

)(min)()a( )( ff tftt εεF∈

Adaboost

Update weight

21)( )(1 =+ tt fε The worst case

)()( 1 iwiw tt +→

−⇒=

⇒≠

Multiply )(

update

Weighted error rate )()()( )1(1)(1)( +++ →→ tttttt fff εεε

21)( )(1 =+ tt fε

∑∑+

=+ ≠=

)'()())(()(

1)(1 iw

iwyfIft

iiittt xε

−≠= n

ititit

itititiit

iwfyyfI

)()}(exp{

)()}(exp{))((

∑∑

=−+≠

≠= n

itiitt

iwyfIiwyfI

)())((}exp{)())((}exp{

)())((}exp{

)}(1{)(1

−−

εεε

Update in decision stumps

Wrong example 5 5 5 6 466 5 565

}|)(|21{minarg ∑ −=

iiij yfb

Decision stump⎩⎨⎧

<−>+

=−=jj

jjjjjj bx

bxfffs

if1 if1

)(where)(or)()( xxxx

jbThe j-th feature vector njj xx ,...,1

Next step

Errror 数 45.5 7 9

87.56 798.5

Update foe weights:

jbWeight up to 2

}|)(|)(21{minarg ∑ −=

iiijj ysiwb

16log5.0ans. false of nb.ans.correct of nb.log5.01 ==⎥⎦

⎤⎢⎣⎡=α

Weight down to 0.5

Update in exponential loss

)}(exp{1)(1

exp ii

FL x−= ∑=

)()()( xxx fFF α+→

}))(1()(){(exp fefeFL εε αα −+= −

Consider

[ ]∑=

− =+≠−=n

iiiiiii yfeyfeFy

))(I())(I()}(exp{1 xxx αα

−−=+n

iiiii fyFy

1exp )}(exp{)}(exp{1)( xx αα

)}(exp{))(()(

FyyfIf

iiiij∑

−≠=

xxεwhere

Sequential optimization

}))(1()(){()( expexpαα εεα −−+=+ efefFLfFL

)()(1log

opt ff

εεα −

)}(1){(2 ff εε −≥

)}(1){(2)()(12

f εεεε αα −+

⎭⎬⎫

⎩⎨⎧

−−

Inequality holds if and only if

αα εε −−+ efef ))(1()(

AdaBooost = Seq-min exponential loss

)(minarg)a( )( ff tFf

t ε∈

)}(exp{)()((c) 1 itittt xfyiwiw α∝+

)}(1){()()(min )()(1exp)(1exp ttttt ffFLfFL εεαα

−=+ −−∈ R

)(minarg)b( )(1exp ttt fFL ααα

+= −∈ R

Proposal from machine learning

Learnability: boosting weak learners?

AdaBoost : Freund & Schapire (1997)

Weak learners (machines)

})(,....),({ 1 xx pff

Strong machine

)()( )()1(1 xx tt ff αα ++

)()1(1 xfαForward stagewise

Simulation (complete separable)

-1 -0.5 0.5 1

[-1,1]×[-1,1]

Feature space

Decision boundary

}1000,,1:),({ =iyiix

]1,1[]1,1[ −×−∈ix

)2sin( 12 xx π=

}1,1{ +−∈iy

-1 -0.5 0.5 1

Set of linear classifiers

⎩⎨⎧

<++−≥+++

=++=0 if10 if1

)(sgn),(32211

322113221121 rxrxr

rxrxrrxrxrxxf

Linear classification machines

3321 )1,1(},,{ −Urrr ～

Random generation

Learning process

-1 -0.5 0 0.5 1

Iter = 1, train err = 0.21 Iter = 13, train err = 0.18 Iter = 17, train err = 0.10

Iter = 23, train err = 0.10 Iter = 31, train err = 0.095 Iter = 47, train err = 0.08

Learning process (II)

-1 -0.5 0 0.5 1

Iter = 55, train err = 0.061

-1 -0.5 0 0.5 1

Final decision boundary

-1 -0.5 0 0.5 1-1

-1 -0.5 0 0.5 1

Contour of F(x) Sign(F(x))

KL divergence

},,1{ G=Y

For conditinal distribution p(y|x), q(y|x) given x with commonmarginal density p(x) we can write

Nonnegative function

Feature space pR⊆X

),(),(),,( YXxxx ∈∈ yyym μ

∫ ∑=

+−=X

xxxxxx

yymymmD

1KL d)},(),(

),(),(log),({),( μ

)()|(),(),()|(),( xxxxxx pyqypypym == μ

∫ ∑=

yqypypmD

1KL d)(}

)|()|(log)|({),( μ

Label set

Twine of KL loss functionFor data distribution q(x, y) = q(x) q(y|x) we model as

gyFgFym

11 )},(),(exp{)|( xxx

xxxxxxx

Xd)()}|()|(

)|()|(log)|({),(

}1,1{KL qymyq

ymyqyqmqD

y∫ ∑

−+∈

)},(exp{

)},(exp{)|(x

Log lossExp loss

Bound for exponential loss

)}({exp(E)(exp XYFFL −=

))(exp()(1

1exp ∑

iiin FyFL xEmpirical exp loss

Expected exp loss

).(minarg

and functionsnt discrimina all of space a beLet

expopt FLFF F

Theorem.

.)|1()|1(log

21)(Then opt x

xx−=+=

=ypypF

Variational calculas

⎥⎦⎤

⎢⎣⎡ −∂∂

=∂∂ ))(exp(E)(exp XYF

δ[ ]))(exp(E XYFY −=

[ ])|1())(exp()|1())(exp(E XXXX −=−+=−= ypFypF

}]{[)|1()|1())(2exp()|1())(exp(E

−=+=

−+=−=ypypFypF

.)|1()|1(log

21)(Hence opt x

xx−=+=

=ypypF

On AdaBoostFDA, or Logistic regression

Parametric approach to Bayes classifier

101)( ααα +=+= ∑=

jjj xF xαx

AdaBoost ∑=

ttt fF

11 )()( xx α

AdaBoost. real cf. ,classifier a is itself )(Each xtf

The stopping time T can be selected according to the state of learning.

Parametric approach to Bayes classifier, but dimension and basis function are flexible

On AdaBoost (II)

1. Unbalanced examples

2. Onverleaning

EtaBoost

GroupBoost

AsymAdaBoost

LocalBoost

Robust against mislabel examples

Hi-dimension and small sample

Local learning

Balancing the false n/ps’

Simulation (complete random)

-1 -0.5 0.5 1

Overlearning of AdaBoost

-1 -0.5 0 0.5 1-1

Iter = 151, train err = 0.06 Iter =301, train err = 0.0

U-Boost

))(())((1)(1

empθθ xθ qUq

iiU ξξ +−= ∑

=U-empirical loss function

∑∑∑= ==

+−=n

iiiU gFU

emp )),((1),(1)( xx

In a context of classification

Unnormalized U-loss

∑∑= =

giiiU yFgFU

(0) )),(),((1)( xx

Normalized U-loss

∑∑∑= ==

+−=n

iiiU gFU

(1) )),((1),(1)( xx

11)),(( subject to x

U-Boost (binary)

Unnormalized U-loss

∑ ∑= ±=

i giiU FyU

(0) ))((1)( x

Note ))(()),(),((1

iii FyUyFgFU xxx −=−∑±=

)}1,()1,({)(where21 −−= xxx FFF

Bayes risk consistency

)}|1())(()|1())(({)(

))(()( 1

−=+=−∂∂

=−∂∂ ∑

±=ypFUypFU

))}(({minarg where)|1(

)|1())((

))(( **

*xyFUEF

=− x

F*(x) is Bayes risk consistent because 0)}({

)()()()()(

−−+−

=−∂

FuFuFuFuFu

Eta-loss function

regularized

AdaBoost with margin

FFFU ηη +−= )exp()1()( generator with

EtaBoost for Mislabels Expected Eta-loss function

)}())({exp()1()( xx yFyFEFL ηηη −−−=

Optimal score )(minarg* FLFF

The variational argument leads to

2))(1(

)1()|()()(

− xx

))(1(xx

+−= εε

ηε2))(-(1

)( where)()(

− xxx

Mislabel modeling

EtaBoost

0)(),1()(: settings Initial.1 011

=== xFniiw n 　　　

,)())((I)(1

iwfyf m

iiim ∑

≠∝ xε

tttTT fFF

1)( )()( where,)(sign.3 )( xxx α

Tm ,,1For .2 =

))(exp()()()c( )(*

1 iimmmm yfiwiw xα−∝+

)(min)()a( )( ff mfmm εε =

A toy example

Examples partly mislabeled

AdaBoost vs. EtaBoost

AdaBoost EtaBoost

EtaBoost

-1 -0.5 0 0.5 1-1

Iter = 51, train err = 0.25 Iter = 51, train err = 0.15 Iter =351, train err = 0.18

-1 -0.5 0 0.5 1-1

Learning with Information Divergence Geometry

Shinto Eguchiand

Osamu Komori

Outline

Boosting leaning algorithm and U-loss functions II

Statistical inference

Microarray

Proteome

A variety of functions associated with genes

Statistical Learning

genetics

informatics

medicine

Modeling and Prediction forKnowledge and discovery

Project for genome polymorphism analysis

Protein

Genome

Microarray

Proteome

[ Gene expression ]

[ Protein expression ]

[Single Nucleotide Polymorphism ]

Project mission

Microarray

Proteome

[ Gene expression ]

[Polymorphism ]

Drug effect / adverse effect

Disease

[ anticancer drug, aspirin ]

[Cancer, diabetes, cardiopathy … ]

Metabolic syndrome

biomarker

Genomic/protemic

Phenotype

Expression Arrays and the p n Problem

T. Hastie, R. Tibshirani 20 November, 2003

Gene expression arrays typically have 50 to 100 samples and 5,000

to 20,000 variables (genes). There have been many attempts to adapt

statistical models for regression and classification to these data, and in

many cases these attempts have challenged the computational resources.．．．．．．

The Dantzig selector:Statistical estimation when p is much larger than n

E. Candes and T. Tao

Ann. Statist. 35, 6 (2007), 2313-2351.

Recent issue

Fan and Lv (2008)Sure independence screening Dantzig selector

Candes, E. and Tao, T. (2007)

p is much larger than n

次元 1 100 1000 30000

Unified view in machine learning

Dantzig selector (SVM, programming)

LASSO (LAR, Elastic net)

L2Boosting (Early stopping, εBoosting, Stagewize LASSO)

A tale of three cousins (Meinshausen, Rocha, Yu, 2007 )

L1 regularization

What is genomic data?

Protein

Genome

microarry

proteome

[ 遺伝子発現 ]

[ 蛋白発現 ]

[ 一塩基多型 ]

多様性代謝

翻訳

0204060

204060

255075100

10203040

255075

255075100

1500 1750 2000 2250 2500

Target

Microarray

Proteome

[ Gene expression ]

[Polymorphism ]

Drug effect / adverse effect

Disease

[ anticancer drug, aspirin ]

[Cancer, diabetes, cardiopathy … ]

biomarker phenotype

),( 1 pxx=x }1,1{ +−∈y

GeneChip® HT Human Genome U133

AffymetrixのU133 contains probes more tha 54,000,Which is utilized semicondotor technology.

For one gene a set of 11 prob pairs are designed.All the probes consist of 25-mer DNA.

Predection from microarrays

),( 1 pxx=x

}1,1{ +−∈y

Feature vector dimension p = the nb of geneseach component is of gene expression

Class label names of disease, effect of drugs, adverse effect of drug

training data }1:),({train niyD ii ≤≤= x

yf →x:ˆclassifier )(ˆˆ xfy =predictor

Leukemic diseases Golub,T. et al. (1999) Science.

The first successful result202

Web microarray data

242549Estrogen

224062Colon

353772ALLAMLy = －1y = +1n

p >> n

http://microarray.princeton.edu/oncology/http://mgm.duke.edu/genome/dna micro/work/

Open access data

source("http://www.bioconductor.org/biocLite.R")

biocLite("GEOquery")

library(GEOquery)

d <- getGEO(file = "GSE2034_family.soft.gz")

http://www.ncbi.nlm.nih.gov/geo/

Gene Expression OmnibusNational Center for Biotechnology Information

Clustering for successful case

Jones, M. et al. Lancet 2004

Lung cancer

Pathological Classification

Prognosisidentification

Not successful case

Prediction from mass-spectrometry

),( 1 pxx=x

Feature vector dimension p = peak numbers of molecular masscomponents express peak value

}1,1{ +−∈y

Class label names of disease, effect of drugs, adverse effect of drug

training data }1:),({train niyD ii ≤≤= x

yf →x:ˆclassifier )(ˆˆ xfy =predictor

MW (Time of Flight)

LensLaser beam source

Mirror

Ion detector

High voltage

+++Protein ChipFlight tube (High vacuum)

Proteome method

Koichi Tanaka，MALDI-TOF MS，2002年Nobel Prize for chemistry

[SELDI TOF/MS]

1500 1750 2000 2250 2500

Lungcancer

Normaltissue

Proteomic data財団法人癌研究会

Total data

Proteomic data

Fileterd by common peaks

AdaBoost learning

203人（130人（卵巣癌），73人cotrols）

Fushiki, Fujisawa, Eguchi, 2006

Goal = Prediction score

Machine learning

Tarining data = (clinical data, genomic data)

Prediction score

Pattern recognition

ic data

Clinical data

{xi} {yi}

Sensitivity

Train and test practical realization

Concordance among Gene-Expression–Based Predictors for Breast Cancer

Fan, et al NEJM 355:560-569, 2006

Prognosis prediction for breast cancer

70 gene file : van 't Veer J, et al. Nature 2002;415(6871):530-6.Recurrence score : Paik S, et al. NEJM 2004;351:2817-26.Mechanism-derived： Chang H Y, et al. PNAUS 2005;102(10):3738-43.Proper sutype : Sorlie T, et al. PNAUS 2001;98(19):10869-74.2 gene rate： Ma X J, et al. Cancer Cell 2004;5(6):607-16.

5 studies suggest different sets of genes that are related withprognosis for breast cancer

Four of five studies show substantial performance for new validation test data (295 samples)

Nature 2002; 415: 530-6.

Single variable study

y|x ),...,1(| pjyx j =

Let x be a feature vector from genomic monitor andy outcome value or phenotype.

)|( yp x )|( yxp j

Note： The multi-vaiate analysis for (x, y) is essential different froma set of all single varaiable analyses.Basically the genome data are correlated with in biologicalnetwork, which implies the information amount is not so huge even if the number of observed genes is larger than several ten thousands.

Joint analysis single analysis

2 sample test approach

)),(,),,{( 11 nn yyD xx=Data set

is decomposed into

仮説 H：

),...,(}1:{00010 n

jj xxyx =−==x

),...,1(),...,(}1:{ 11111 pjxxyx njj

iijj ==+==x

},...,1:),{( 10 pjjj =xx

where n = n0 + n1

)1|()1|( −==+= yxpyxp jj

を考えよう．

2標本 , ．01)}{(

jx =11)}{(

典型的な2群の検定量

Z スコア

スチューデント t 検定

ウィルコクソン検定

XZ 01)(ˆ −=

)2/())1()1(()(ˆ

−−+−

)(1)(ˆ0

ikjj xxI

nnXC ∑ ∑

注意： Zスコアと t-検定はデータ変換 x → ax+b (a > 0) に不変である．

ウィルコクソン検定は任意の単調変換 F に対して不変である．

)(ˆ))((ˆ jj XCXFC =

しかし異なる j と k に対して上の検定統計量は何の関係も持たない．

)(ˆ),(ˆ),(ˆ),(ˆ jkjkjk bXXaCXXCXCXC ++

（後半II部でこのことを再考する．）

p 個の2群比較

p 個の２群比較のための検定統計量を採用して

},...,1:)(ˆ{ pjXT j =

p 個の検定統計量 T(Xj) を計算して，順番を以下のように付ける．

)(ˆ)(ˆ)(ˆ )()()1( pd XTXTXT ≥≥≥≥

を考察する．このとき，検定の多重性の問題（FDR) やランキングの

正確さの問題は重要な問題である．しかし，現在，ターゲットにしている

問題は，表現形予測の問題であるので次のフィルターリングに使われる．

1 100 1000 30000d p

)(ˆ jXT

Hierarchal clusteringclustering ⊆ unspervized learning

Hierarchal clustering

Optimal partition clustering k-means, self-organizing map

Minimum distance method

Eisen et al. (1998) PNAS

http://derisilab.ucsf.edu/data/microarray/software.html

ArrayMaker Version 2

Gal File Maker v1.2

Cluster:

Tree View:

J-Express:

http://derisilab.ucsf.edu/data/microarray/software.html

),...,1(T niy iii =++= εαβx線形モデル

次元ベクトルとする．はここで pi , βx

iii =−−= ∑∑

2T ||subject to)(minarg)ˆ,ˆ( βαα βxβ

Lasso推定量 (Least absolute shrinkage and selection operator)

Tibshirani, 1996 JRSSB

Sparse learning

)}({ 11

2T21 ||)( ty

iii −−−−

∂∂ ∑∑ =

=βλαβx

0=−−−= ∑∑==

)sgn()(11

T )( βxβxx λαn

The Lagarange method for optomization with constraints

)ˆsgn(ˆˆ 1

TOLS )( βxxββ

=∑+=n

Lasso estimator and LSE have the following connection

).( that Assume1

T 単位行列In

iii =∑

),...,1()ˆsgn(ˆˆOLS pjβββ jjj =+= λ

OLSβ̂β̂

Sparseness representation

),...,1()ˆsgn(ˆˆOLS pjβββ jjj =+= λ

+−= )|ˆ)(|ˆsgn(ˆOLSOLS λjjj βββ

Othonormal design matrix leads to

Inverting the above provides

jβOLSˆ

)(T 単位行列Iii =∑ xx

⎩⎨⎧ >

otherwise00if

whereAA

Note： λ is uniquely determined by the constraint tj =∑ |ˆ| β

Control of sparseness

tj =∑ |ˆ| β

Control is reduced to a choice of t

Microarray data

),( 1 pxx=x }1,1{ +−∈y

Leaning machines

)( xfy =

p signals Class labels

ABoost

)},(),...,,{( 11 nn yy xx

)(,),1( 11 nww

)(,),1( 22 nww1ε

1)( )(

ttt f xα

)(,),1( nww TT

)()2( xf

)()( xTf

)()1( xf

)}(...,),({ 1 xx Kff ∑=

1)( )(

ttt f xα

Difficult Problem: p >> n

)(,),1( 11 nww

)(,),1( 22 nww

),1()1,1(

,,)(,),(

Gffαα

1)( )(x

)(,),1( nww TT

Grouping G machines )( ),(),()1,()1,(1

)( GtGtttGt fff αα ++=

)()1( xf

)()2( xf

)()( xTf

),2()1,2(

,,)(,),(

Gffαα

),()1,(

,,)(,),(

GTT ffαα

GroupBoost

GroupBoost )}(...,),({ 1 xx Kff ∑=

1)( )(

ttt f xα

Results Data size

Test error

Lung cancer analysisGene number Test error

true genes

false genes

MW (Time of Flight)

LensLaser beam source

Mirror

Ion detector

High voltage

+++Protein ChipFlight tube (High vacuum)

Proteome[SELDI TOF/MS]

1500 1750 2000 2250 2500

Lungcancer

Normaltissue

Proteome data

Averaged curve203 subjects (130 cases (ovarian cancer),73 controls)

Common peaks

SpecAlign, Wong et al. (2005)

Peak pattern recognittion

),( 1 pxx=x }1,1{ +−∈y

Learning machines

)( xfy =

p peaks Class labels，

AdaBoost

例題

)(,),1( 11 nww

)(,),1( 22 nww1ε

1)( )(

ttt f xα

)(,),1( nww TT

)()2( xf

)()( xTf

)()1( xf

Test data（32cases+18contols)

Association with drug effect

A joint work with Japan Institute Cancer Research Foundation

Breast cancer patients and a drug effect association

Supervised detection for common peaks

0 5 10 15 20 25 30 35

training error

CV error

test error

(peaks)

error rate

Inference for prediction

Microarray

Proteome

Genome Function

Expression pattern(GroupBoost)

Peak pattern (Common peak)

SNP haploblock

Statistical machine Learning

tutorial workshop on - 國立臺灣大學 · pythagoras theorem 3. 4 information divergence class...

Documents

bab vii teorema pythagoras 1. pembuktian teorema pythagoras

lks pythagoras

mathematics standards grade 9 - وزارة التعليم...

theorem pythagoras - mathemateg

pythagoras -matematika

pythagoras serbian

nama:...

sejarah pythagoras

rakovodstvo pythagoras

pythagoras theorem and its...

pythagoras theorem

jency pythagoras

pythagorean theorem pythagoras discovered the most famous...

teorema di pitagora - ubimath...problemi di geometra piana...

divergence march 2015

engineering mathematics divergence

rpp pythagoras

a. pythagoras

teorema pythagoras

pythagoras ppt