information geometry...outline information geometry historical remarks treasure example information...

Information Geometry

Shinto Eguchi

The Institute of Statistical Mathematics, Japan Email: eguchi@ism.ac.jp, komori@ism.ac.jp Url: http//www.ism.ac.jp/~eguchi/

2013年統計数理研究所夏期大学院統計数理研究所セミナー室 1 9月26日 14時40分～17時50分

ISM Summer School 2013

Outline

Information geometry

historical remarks

treasure example

Information divergence class

robust statistical methods

robust ICA

robust kernel PCA

geometry

learning

statistics

quantum

1982 paperS. Amari

1975 paperB. Efron

P. Dawidcomment

Historical comment

1984Workshop

RSS150

C.R.Rao1945

R.A.Fisher1922

Dual Riemannian Geometry

Dual Riemannian geometry gives reformulation for Fisher’s foundation

Cf. Einstein field equations (1916)

Information Geometery aims to geometrizeA dualistic structure between modeling and estimation

Estimation is projection of data onto model.

“Estimation is an action by projection and model is an object to be projected”

The interplay of action and object is elucidated by a geometric structure.

Cf. Erlangen program (Klein, 1872)http://en.wikipedia.org/wiki/Erlangen_program

2 x 2 table

The space of all 2×2 tables associates with a regular tetrahedron

We know the ruled surface is exponential geodesic, butdoes anyone know the minimality of the ruled surface?

The space of all independent 2×2 tables associates with a ruled surfaceIn the regular tetrahedron

regular tetrahedron

0 10 0

0 00 1 0 0

1 00 0

1/4 1/4

0 p0 1p

1p 0 p 0

regular tetrahedron

0 1/30 2/3

1/3 0 2/3 0

1/3 2/3 0 0

0 01/3 2/3

1/3 0 2/3 0

0 1/30 2/3

32=1/3 2/3

0 01/3 2/3

Ruled surface

1 00 0

p q p (1-q)(1-p)q (1-p) (1-q)

0 00 1

0 10 0

hgfeyzxwn

0)1()1(

})1)(1()1)(1({ 2

qqppqpqpqppqn

0 01 0

C x y e

D z w f

g h n wzyxnwyhzxgwzfyxe

C pq p(1q) P

D (1p)q (1p)(1q) 1p

q 1q 1

e-geodesical

}){( 10,10,1

log,)1)(1(

qp }{ RI,RI),,,( yxyxyx

2112112222

11211211 1wherelog,log,log),,( )(

3S 3RI

Two parametrization

)( )1(),1(, qpqpqp

)( ,qp

log,)1)(1(

Kullback-Leibler Divergence

Let p(x) and q(x) be probability density functions.

Then Kullback-Leibler Divergence is defined by

Two one-parameter familiesLet be the space of all pdfs on a data space.

ここで

m-geodesic

e-geodesic

Pythagoras theorem Amari-Nagaoka (1982)

ABC in differential geometry

Linear connection defines parallelism along a vector field.

Geodesic = a minimal arc between x and y

Riemannian metric defines an inner product on any tangent space of a manifold.

RI)()(:),( MMggM XX

0))(),((argmin

10})1(,)0(:)({

dttxtxgtyxxxtxc

)()()(: MMM XXX

))(,),(()()()2(

MYXMfYfXYfYf

kjijX XX

ComponetwiseCf. slide 41

Geodesic

A one-parameter family (curve) is called geodesic }:)({ ttC θ

),...,1(0)()())(()(1 1

pitttt kkp

with respect to a linear connection },,1:)({ pkjiikj θ

Remark 2：

If , any geodesic is a line.),,1(0)( pkjiikj θ

The property is not invariant with parametrization.

The gedesic C is expressed by a transform rule from parameter to

),...,1(0)()())((~)(1 1

patttt cbp

,)()(~11 1 1

θω )(for, θaaa

aai tBB

speed of acceleration

Newton's law of inertia

Change rule on connections

Let us consider a transform from parameter to . Then a geodesic C is

written by . Hence we have the following:

),...,1(0)()())((~)(1 1

patttt cbp

,)()(~11 1 1

θω )(for)( 1θaa

}:))(()({ ttt θω

),...,1()())(()(1

pattBtp

),...,1()()())(()())(()(1 11

patttBttBtp

What are reference books?

Foundations of Differential Geometry (Wiley Classics Library) Shoshichi Kobayashi, Katsumi Nomizu

Methods of Information Geometry Shun-Ichi Amari , Hiroshi NagaokaAmer Mathematical Society (2001)

Differential Geometrical Methods in StatisticsShun-Ichi Amari 1985 年 Springer

Statistical model and information

Statistical model }:),()({ θθxxθ ppM

Score vector ),(log),( θxθxsθ

Space of score vector }RI:),({ T psT αθαθ

Fisher information metric ),(}{),( θθθ TvuuvEvug

Fisher information matrix }),(),({ Tθxθxθθ ssEI

)RI( p

Score vector space

Note 1：

0),(),(

),(),(),(),()(T

xθxθxsα

xθxθxsαxθxθxθθ

dpdpuuETu

βαβxθxθxsθxsα

xθxθxθx

dpvuvugTvuTTT }),(),(),({

),(),(),(),(,

})),((,0)),((:),({ θxθxθx θθθ tVtEtS

The space of all random variables with mean 0 and finite variance

Note 2: is an infinite dimensional vector space that includesθS θT

),(),(For θxsθx Tu

,....2,1},{ ),(...

),(...

},),({),(11

kuEuuEuiki

kkk θxθxθxθx θθ

belong to θS Cf. Tiatis (2006) 21

Linear connections

),,1()( }{ '1'

pkjissEg ijk

),,1()( }]{}{[ ''1'

pkjisssEssEg ijkijk

θθθ

e-connection

m-connection

m-geodesic

e-geodesic

Score vector piis 1)),((),( θxθxs

Semiparamtric thoery

Semiparametric Theory and Missing Data (Springer Series in Statistics) Anastasios Tsiatis

1 Introduction to Semiparametric Models

2 Hilbert Space for Random Vectors

3 The Geometry of Influence Functions

4 Semiparametric Models

5 Other Examples of Semiparametric Models

6 Models and Methods for Missing Data

7 Missing and Coarsening at Random for Semiparametric Models

8 The Nuisance Tangent Space and Its Orthogonal Complement

9 …14.

Geodesical models

Definition

))1,0(()()()(),( 1 tMqpcMqp ttt xxxx

(i) Statistical model M is said to be e-geodesical if

(ii) Statistical model M is said to be m-geodesic if

))1,0(()()()1()(),( tMqtptMqp xxxx

Note : Let P be a space of all probability density functions.

xxx dqpc ttt )()(/1 1where

By definition P is e-geodesical and m-geodesical.

However the theoretical framework for P is not perfectly complete.

Cf. Pistone and Sempi (1995, AS)24

Two type of modeling

})};()(exp{)(),({ 0)e( θθxtθxθx TppM

})}({)}({:)({0

)m( xtxtx pp EEpM

Exponential model

Matched mean model

})(:RI{ θθ p

Let p0(x) be a pdf and t (x) a p-dimensional statistic.

})};()()(logexp{)(),({

θθxxxθx

}:)()()1(),({1

ii pppM xxθxMixture model

})(:RI{ θθ p

Let be a set of pdfs.},...,1,0:)({ Iipi x

)},...,1(0,10:),...,({1

Exponential model

Statistical functional

A functional f (p) is called a statistical functional if p(x) is a pdf. (Hampel, 1974)

A statistical functional f (p) is said to be Fisher-consistent for a model

,)()1( pf

)()()2( θθf θp

if f (p) satisfies that}:)({ θxθpM

For a normal model }RIRI),(},2

)(exp{2

1)({ 22

Tdxxxpdxxdxxxpp )))((,)(()( 22 f

is Fisher-consistent for = (, 2 ).

Example.

Transversality

Let f (p) be Fisher consistent functional.

which is called a leaf transverse to M.

,})(:{)( θffθ ppL

}{)()1( θθ f pML

)()2( fθL

is a local neighborhood.

)( fθL

)(* fθL

Foliation structure

)( fP θL

))((),(),( fP θθθθLTvMTuTt ppp

),(),(),(such that θxθxθx vut

))(()()( fP θθθθLTMTT ppp

)( fθL

),( θxt ),( θxu

),( θxv

Foliation

Decomposition of tanget spaces

Statistical model }:),()({ θθxxθ ppM )RI( p

Transversality for MLE

})};()(exp{)(),({ 0)e( θθxtθxθx TppMExponential model

Hence the foliation associated with fML is a matched mean model.

})(:RI{ θθ p

For the exponential model the MLE functional)e(M

)},({logmaxarg:)(ML θxfθ

}{ )}({)}({arg)( ),(ML xtxtf θθ

pp EEp

is written by

})}({)}({:{)( ),(ML xtxtf θθ pp EEpL

Let p0(x) be a pdf and t (x) a p-dimensional statistic.

Maximum likelihood foliation

)( MLfθL

Estimating function

),( θxu

Statistical model }:),()({ θθxxθ ppM )RI( p

p-variable function is unbiased

)(0)}),({(det,}),({ ),(),(

θθxuθxu θθ pp EE 0

The statistical functional }),({solvearg)( 0

θxufθ

is Fisher-consitent, and the leaf transverse to M

}),(:{)( 0 θxufθ pEpL

is m-geodesical.)( fθL

Minimum divergence geometry

D : M×M → R is an information divergence on a statistical model M

(i) D(p,q) ≧ 0 with equality if and only if p = q

(ii) D is a differentialble on M×M

Then we get a Riemannian metric and dual connections on M (Eguchi, 1983,1992)

is a Riemann metric

Remarks

Cf. Slide 21 33

U cross-entropy

U is a convex function,

U entropy

U divergence

Example of U divergence

KL divergence

Beta (power) divergence

Geometric formula with DU

Triangle with DU

mixture geodesic

U geodesic

Information divergence class

robust statistical methods

Light and shadow of MLE

1．Invariance under data-transformations

2．Asymptotic efficiency

1. Non-robust

Sufficiency and efficiency

log exp

2. Over-fitting

Log-likelihood on exponential family

Likelihood method

U-method

Max U-entropy distribution

Equal mean space

Let us fix a statistics t (x).

Euler-Lagrange’s 1st variation

U-model

U-estimate

U-loss function

U-estimate

U-estimating function

U-empirical loss function

Let p(x) be a data density function with statistical model

-minimax

Equal mean space

Let us fix a statistics t (x).

U-estimator for U-model

U-model

mean parameterU-estimator of

We observe that

which implies

Furthermore,

Influence function

Statistical functional

})),((()()),(({minarg)(

dxxpUxdGxpGTU

)(),(IF

xFG )1(

)}],(),({),(),()[(),(IF 1 xSxwExSxwJxT UU

Influence function

||),(IF||sup)(GES xTT Ux

Efficiency

Asymptotic variance

))()()(,0()ˆ( 11 UUUD

U JHJNn

)),(),(()(

),),(),(),(()(

XSXwVarH

XSXSXwEJ

111 )()()()( UUU JHJI

Information inequality

( equality holds iff U(x) = exp(x) )

Normal mean

-6 -4 -2 2 4 6

β-power estimates η-sigmoid estimates

Influence functionβ= 0

= 0.015= 0.15= 0.4= 0.8= 2.5

η = 0= 0.01= 0.05= 0.075= 0.1= 0.125

tttU )exp()( 46

Gross Error Sensitivity

β-power estimates η-sigmoid estimates

β 効率 GES η 効率 GES0 1 ∞ 0 1 ∞

0.01 0.97 6.16 0.015 0.972 1.90.05 0.861 2.92 0.15 0.873 1.040.075 0.799 2.47 0.4 0.802 0.6780.1 0.742 2.21 0.8 0.753 0.455

0.125 0.689 2.04 2.5 0.694 0.197

Multivariate normal

pdf is

)()(exp)det)2((),,(1

21 yyyf

equation Estimating

)(}))(({),(1

0)(),(1

μxμxθx

μxθx

111 , , kkkkkk μθμθ

jkik xw

det ),(

)()(),(1

Tkikiki

繰り返し重み付け平均と分散

Under a mild condition

),...1 ( )(L )(L 1 kkUkU

U-algorithm

Simulation

ε-conmination

)ˆ( 0KL θ,θD

, N , N-εG

. , N-εG

250501

under 6516702under2439033

G . . G . .

KL error

with MLE θ̂

β-power estimates v.s. η-sigmoid estimates

0 39.240.01 35.300.05 20.930.10 8.910.20 12.400.30 31.64

0 39.240.0001 23.04

0.00025 6.700.0005 4.46

0.00075 4.640.001 6.04

βηKL error KL error

0 16.50.01 13.910.05 6.650.10 5.140.20 12.200.30 29.68

0 16.50.0005 3.36

0.00075 3.190.001 3.90.002 3.040.003 3.07

Plot of MLEs （no outliers）

100 replications with100 size of sampleunder Normal dis.

η- Est (η=0.0025)β- Est (β=0.1)MLEtrue (1,0,1)

Plot of MLEs with ouliers

η- Est (η=0.0025)β- Est (β=0.1)MLEtrue (1,0,1)

100 replications with100 size of sampleunder

Selection for tuning parameter

Squared loss function

dyygyf 221 )}()ˆ,({)ˆ(Loss

dyyfxfn

)ˆ,(21)ˆ,(1)(CV

)(CVminargˆ

)ˆ,(IF1

1ˆˆ )( i

Approximate

.26.1-

.1-.26 varianceand

mean with Normal

0.0250.050.0750.10.1250.150.175

.261.126-.126-.228

0.081-

signalMLE

.383.263-.263-1.059

0.184-

ContamiMLE

.286.134-.134-.293

0.132-

contamiEst-

07.0ˆ

What is ICA?

Cocktail party effect

Blind source separation

s1 s2 ……… sm

x1 x2 ……… xm

ICA model

sWxW mmm 1 s.t., RR

(Independent signals）

(Linear mixture of signals)

0)(,,0)()()()(),...,(

SESEspspspsss

),,( 1 nxx

))(())((|)det(|),,( 111 mmmppWpWf μxwμxwx

Aim is to learn W from a dataset

in which unknown. is)()()( 11 mm spspp s

ICA likelihood

Log-likelihood function

|)det(|log))((log)(11

WpW jijj

TTm WWxWxhI

WpWxpWxF

))()((),,(),,(

)( )(log,,)(log)(1

Estimating equation

Natural gradient algorithm, Amari et al. (1996)58

Beta-ICA

power equation ),(),,(),,(11

WBWxFWxf

decomposability：

0]))}({E[

)]()}({E[

])}({E[,

XwhXwp

ssssss

tqsqqq

Likelihood ICA

0.5 1 1.5 2 2.5

-1.5 -1 -0.5 0.5 1 1.5

5.01211W

150 signals U(0,1)× U(0,1)

Mixture matrix

Non-robustness

-2 -1 1 2 3

-2 -1 1

50 Gaussian noise N(0, 1)× N(0, 1)

power-ICA (β=0.2)

-2 -1 1 2 3

-10 -7.5 -5 -2.5 2.5 5 7.5

Minimum -power

Tsallisべきエントロピーに射影不変性を課したエントロピ－、ダイバージェンスを考察した。

平均と分散を一定にする分布族で射影べきエントロピー最大分布モデルが正規分布，

t-分布，ウエグナー分布を含む形で導出された．

そのモデル上での平均と分散の素直な推定法を提案し，推定量が標本平均と標本分散

になることが示された．

-cross entropy

-diagonal entropy

Cf. Tsallis (1988), Basu et al (1998), Fujisawa-Eguchi (2008), Eguchi-Kato (2010)

-divergence

Remark= 0 reduces to Boltzmann-Shannon entropy and Kullback-Leibler divergence

Projective -power divergence

Max -entropy

-entropy

Equal moment space

Max -entropy

-model

-estimation

Parametric model

-Loss function

-estimator

Gaussian

-estimator

Remark Cf. Fujisawa-Eguchi (2008)66

-estimation on -model

-loss function

Theorem.

loss function

normal-model -model

(-estimator ’-model)

0-loss

( log-likelihood) MLE

Moment estimatorRobust -estimator

Robust model

Mixture of independent signals

Whitening

Preprocessed form

Statistical model

Gamma-ICA

Boosting leaning algorithms

for supervised/unsupervized learning

U-loss functions

Outline

Which chameleon wins?

Pattern recognition…

What is pattern recognition?

□ There are a lot of examples for pattern recognition.

□ In principle pattern recognition is a prediction problem for class label

which would classify a human interestingness and importance.

□ Originally human brain wants to label phenomena a few words , for

example (good, bad), (yes, no), (dead, alive), (success, failure), (effective,

no effect)….

□ Brain intrinsically predicts the class label from empirical evidence.

Practice

● Character recognitionvoice recognition

Image recognition

☆ Credit scoring

☆ Medical screening

☆ Default prediction

☆ Weather forcast

● face recognition

finger print recognition●

speaker recognition●

☆ Treatment effect

☆ Failure prediction ☆ Infectious desease

☆ Drug response

Binary classification

classifier

Feature vector Class label

In a binary class y = 1, 1 (G=2)

Probability distribution

).,( vector random of pdf a be),(Let yyp xx

ypCBp }d),({),( xx

Xxx d),()( ypypMarginal density

Conditional density

},...,1{

),()(Gy

ypp xx

)(),()|(

ypypyp xx

)()|()()|(),( xxxx pypypypyp

)|1()|1(

)1,()1,(

Error rate

Classifier

Feature vector ，class label

has error rate

))(Pr()(Err yhh FF x

jiFF iyihjyihh

),)(Pr(1),)(Pr()(Err xx

Training error

Test error

False negative/positive

True Negative

True Positive False Positive

False Negative

Bayes rule

Let p(y|x) be a conditional probability of y given x.

The classifier leads to Bayes rule. ))(sgn()( 0Bayes xx Fh

For any classifier h

)(Err)(Err Bayes hh

Note： The optimal classifier is equivalent to the likelihood ratio.However, in practice p(y|x) is unknown, so we have tolearn hBayes(x) based on the training data set.

Theorem 1

Discriminant space associated with Bayes classifier is given by

Error rate for Bayes rule

In general, when a classifier h associates with spaces

)(Err)(Err Bayes hh Cf. Mclachlan, 2002

Boost learning

Boost by filter (Schapire, 1990)

Bagging, Arching （bootstrap）(Breiman, Friedman, Hasite)

AdaBoost (Schapire, Freund, Batrlett, Lee)

Weak learners can be combined into a strong leanrner?

Weak learner (classifier) = error rate is slightly less than .5

Strong learner (classifier) = error rate is slightly less than that of Bayes classifier

Set of weak learners

Decision stumps

Linear classifiers

Neural net SVM k-nearest neighbur

Note: not strong but a variety of characters

Exponential loss function

Empirical exponential loss function for a score function F(x)

)}(exp{1)(1

exp ii

where q(y|x) is the conditional distribution given x, q(x) is the pdf of x.

Let be a training (example) set.

d)()}|()}(exp{{)(}1,1{

expqyqyFFL

Expected exponential loss function for a score function F(x)

0)(),1()(:Intial.1 011

xFniiw n 　　　

)())((I)(

iwiwfyf

tiit x

)(log)b(tt

tttTT fFF

1)( )()( where,)(sign.3 )( xxx

Tt ,,1For .2

))(exp()()()c( )(1 iitttt yfiwiw x

)(min)()a( )( ff tftt F

Adaboost

Learning algorithm

)},( ),...,,{( 11 nn yy xx

)(,),1( 11 nww

)(,),1( 22 nww

)()1( xf

1)( )(

ttt f x

)(,),1( nww TT

)()2( xf

)()( xTf

Final learner

tttT fF

1)( )()( xx

Update weight

21)( )(1 tt f The worst case

)()( 1 iwiw tt

Multiply )(

update

Weighted error rate )()()( )1(1)(1)( tttttt fff

21)( )(1 tt f

)'()())(()(

1)(1 iw

iwyfIft

iiittt x

ititit

itititiit

iwfyyfI

)()}(exp{

)()}(exp{))((

itiitt

iwyfIiwyfI

)())((}exp{)())((}exp{

)())((}exp{

)}(1{)(1

Update in exponential loss

)}(exp{1)(1

exp ii

)()()( xxx fFF

}))(1()(){(exp fefeFL

Consider

iiiiiii yfeyfeFy

))(I())(I()}(exp{1 xxx

iiiii fyFy

1exp )}(exp{)}(exp{1)( xx

)}(exp{))(()(

FyyfIf

xxwhere

Sequential optimization

}))(1()(){()( expexp efefFLfFL

)()(1log

opt ff

)}(1){(2 ff

)}(1){(2)()(12

Equality holds if and only if

efef ))(1()(

AdaBooost = Seq-min exponential loss

)(minarg)a( )( ff tFf

)}(exp{)()((c) 1 itittt xfyiwiw

)}(1){()()(min )()(1exp)(1exp ttttt ffFLfFL

)(minarg)b( )(1exp ttt fFL

Simulation (complete separable)

-1 -0.5 0.5 1

[-1,1]×[-1,1]

Feature space

Decision boundary

-1 -0.5 0.5 1

Set of linear classifiers

Linear classification machines

Random generation

Learning process

-1 -0.5 0 0.5 1

Iter = 1, train err = 0.21 Iter = 13, train err = 0.18 Iter = 17, train err = 0.10

Iter = 23, train err = 0.10 Iter = 31, train err = 0.095 Iter = 47, train err = 0.0895

Final decision boundary

-1 -0.5 0 0.5 1-1

-1 -0.5 0 0.5 1

Contour of F(x) Sign(F(x)) 96

Learning curve

50 100 150 200 250

Iteration number

Training curve

KL divergence

For conditinal distribution p(y|x), q(y|x) given x with commonmarginal density p(x) we can write

Nonnegative function

Feature space

Label set

Twin of KL loss functionFor data distribution q(x, y) = q(x) q(y|x) we model as

gyFgFym

11 )},(),(exp{)|( xxx

xxxxxxx

Xd)()}|()|(

)|()|(log)|({),(

}1,1{KL qymyq

ymyqyqmqD

)},(exp{

)},(exp{)|(x

Log lossExp loss 99

Bound for exponential loss

Empirical exp loss

Expected exp loss

Theorem.

Variational calculas

On AdaBoostFDA, or Logistic regression

Parametric approach to Bayes classifier

AdaBoost

The stopping time T can be selected according to the state of learning.

Parametric approach to Bayes classifier, but dimension and basis function are flexible

U-Boost

U-empirical loss function

In a context of classification

Unnormalized U-loss

Normalized U-loss

U-Boost (binary)

Unnormalized U-loss

Bayes risk consistency

F*(x) is Bayes risk consistent because104

Eta-loss function

regularized

AdaBoost with margin

EtaBoost for Mislabels Expected Eta-loss function

Optimal score

The variational argument leads to

Mislabel modeling

EtaBoost

0)(),1()(: settings Initial.1 011

xFniiw n 　　　

,)())((I)(1

iwfyf m

tttTT fFF

1)( )()( where,)(sign.3 )( xxx

Tm ,,1For .2

))(exp()()()c( )(*

1 iimmmm yfiwiw x

)(min)()a( )( ff mfmm

A toy example

Examples partly mislabeled

AdaBoost vs. EtaBoost

AdaBoost EtaBoost110

ROC curve, AUC

cF 閾値：判別関数 ),(: X

)1Y|)(()TPR(

0)Y| )(()FPR(

}RI|))(TPR),(FPR{()ROC( cccF

0)(Y )(1)(Y )(

陰性

陽性

1} {0, ,RI , ),( YY pXX

)'(FPR)'(TPR),AUC( cdcF

偽陽性確率

真の陽性確率

受信者動作特性(ROC) 曲線

下側面積(AUC )

Cf. Bamber (1975), Pepe et al. (2003)

λ5-fold CVの結果（AUCBoost）

2標本問題とAUCブーストの関係

AUCブーストの学習アルゴリズムは逐次的に各 t-ステップで

得られた判別関数 Ft(x) のウィルコクソン検定統計量

を(t+1)-ステップ

で最大方向に増大するよう設計されている．

このように

となる中で全ての単点解析が統合されている．

この意味で全ての単点解析はブースティングに

組むことは可能である． Eguchi-Komori (2009)

))()((1)(ˆ0

iktt FFI

nnFC xx

)()()( 111 xxx tttt fFF

)(ˆ)(ˆ1tt FCFC

Partial AUC

))( ),X()X((P

)(FPR)(TPRpAUC

010 XFcFF

健常者病人

)(TPR c

)(FPR c )(xF cTPR

)(FPR c

)(TPR c

健常者病人

Robust for noninformative genes

Robustness for noninformative genes

生成関数

真のU 陽性確率

Boost learning algorithm

Bayes risk consistency

ベンチマークデータ

T r u e d e n s i t y

0 . 0 0 5

0 . 0 1

0 . 0 1 5

0 . 0 2

W dictionary133

U-Boost learning

)( argmin *

* fLf Uf UW

RI1d)))((())((1)( xxx U-loss function

)( ))((co 1* WW U

Dictionary of density functions

},1d)(,0)(:)({ xxxx gggW

Learning space = U-model

)}( ))(({ 1

Goal : find

)())0,1,,0(,(Then.))((),(Let )(

1 )( xxxπx

WW * U

Inner step in the convex hull

))((co1* WW U}:),({ xW

)( argmin *

* fLf Uf UW

Goal : )))(ˆ(ˆ))(ˆ(ˆ()( ˆˆ111* xxx kk fff

Convergence of Algorithm

)()()(

)()()1()(

-mixture

)()()()()( *111 fffff kkk

),()()( 11 kkUkUkU ffDfLfL

jjjUkUU ffDfLfL

1111 ),()()(

Simplified boosting

initial with the

))),()()1(((minarg

such that))()()1((

,1,....,0For

kkkkkk

kkkK gf

1 )( )(ˆ

)(minarg1 gLf Ug W

Then we have

Non-asymptotic bound

))(co(),,()}()(){(sup)A(

),(inf),(FA fgDg Uf

|}| )(EI))((supEI2),(EE1

1{ ffg pi

constant)length-step:(1

),(IE22

(Functional approximation)

(Estimation error)

(Iteration effect)

),(EEand),(FAbetween Trade Remark. WW pp *U

),,IE( ),EE(),FA()ˆ,(EI * WWW KggfgD UKUg

Theorem. Assume that a data distribution has a density g(x) and that

Lemmas

Lemma 1 Then).))(((andestimator density abe~Let 11* N

j jjpff x

).,,~()}()~(){1()())))(())(~()1((( ***1

1 fffLfLfLfLp UUUUNj jUj

Lemma 2 Then.))(co(in be~Let Wf

])1,0[(),,( 22* UU bff

Lemma 3 (A)assumptionUnder

))((where1

)(inf)ˆ( 122

UKU fcK

bcfLfL

Lemma 4 then,)(inf)ˆ( If

fLfL UKU

||)(EI))((1||supEI2),(inf)ˆ,(EI

fgDfgD gif

gUKUg xW

KDERSDE

-Boost -Boost

C (skewed-unimodal) L (quadrimodal) H (bimodal)

geometry

learning

statistics

quantum

Future of information geometry

Deepening

Optimal transform

Free probability

Semiparametrics

Compressed sensing

Poincare conjecture

Dynamical systems

Mathematical evolutionary theory

Dempster-Shafer theory141

Poincaré conjecture 1904

Every simply connected, closed 3-manifold is

homeomorphic to the 3-sphere

Ricci flow by Richard Hamilton in 1981 Perelmann 2002, 2003

Optimal control theory due to Pontryagin and Bellman

Optimal transport

Wasserstein space }~,~:||||E{min),( 2W gYfXYXgfd

Optimal transport

Theorem (Brenier, 1991)

Talagrand inequality

Log Sovolev inequality

Optimal transport theory is extending on a general manifold

Geometers consider not a space, but a distribution family on the space.

C. Villani (2010) Optimal transport theory as Fields medal

Thank you

information geometry...outline information geometry historical remarks treasure example information...

Documents

instrumental geometry

analytical geometry

geometry mathematic

เรขาคณิต (geometry)

제2 elementary differential geometry y å )¹)º ·...

elliptic geometry

an esri white paper—july 1998 - esriジャパン株式 …...

lecture 4:...

coordinate geometry

adversarially robust optimization with gaussian...

geometry hiperbolik

euclidean geometry

robust optimization

noncommutative geometry and conformal geometry (joint work...

1 bewegungsplanung computational geometry prof. dr. th....

euclidian geometry

options in geometry optimization - unito.it · pdf...

geometry 01512

geometry analitik

8-1 dynamic multi-cue information fusion for robust