カーネル法入門 - 統計数理研究所– kernel ridge regression gaussian kernel – local...

58
1 福水健次 統計数理研究所/総合研究大学院大学 大阪大学大学院基礎工学研究科・集中講義 2014 September カーネル法入門 7.カーネル法によるノンパラメトリックな ベイズ推論

Upload: others

Post on 25-Sep-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: カーネル法入門 - 統計数理研究所– Kernel ridge regression Gaussian kernel – Local linear regression Epanechnikov kernel (‘locfit’ in R is used) J= 100, 500 runs

1

福水健次

統計数理研究所/総合研究大学院大学

大阪大学大学院基礎工学研究科・集中講義

2014 September

カーネル法入門7.カーネル法によるノンパラメトリックな

ベイズ推論

Page 2: カーネル法入門 - 統計数理研究所– Kernel ridge regression Gaussian kernel – Local linear regression Epanechnikov kernel (‘locfit’ in R is used) J= 100, 500 runs

Bayesian inference– Bayes’ rule

• PROS– Principled and flexible method for statistical inference.– Can incorporate prior knowledge.

• CONS– Computation: integral is needed

» Numerical integration: MCMC, SMC, etc» Approximation: Variational Bayes, belief propagation,

expectation propagation, etc.2

Thomas Bayes (1701-1761)Portrait?

Page 3: カーネル法入門 - 統計数理研究所– Kernel ridge regression Gaussian kernel – Local linear regression Epanechnikov kernel (‘locfit’ in R is used) J= 100, 500 runs

Motivating Example

3

Robot localization– State ∈ :2-D coordinate and orientation of a robot– Observation :image sequence taken

by a camera on the robot.

– Task: estimate the location of a robot from image sequences.

– It is well formulated by a HMM.

X0 X1 X2 X3 XT

Y0 Y1 Y2 Y3 YT

Transition of state

Location & orientation image

location & orientation

image of the environment

Page 4: カーネル法入門 - 統計数理研究所– Kernel ridge regression Gaussian kernel – Local linear regression Epanechnikov kernel (‘locfit’ in R is used) J= 100, 500 runs

– Estimation / prediction with HMM Sequential application of Bayes’ rule solves the task.

– Nonparametric approach is needed:Observation model: is very difficult to model with a simple parametric model.

“Nonparametric” implementation of Bayesian inference

c.f. Monte Carlo approach needs to know the density functions.

4

location & orientation image of the environment

Page 5: カーネル法入門 - 統計数理研究所– Kernel ridge regression Gaussian kernel – Local linear regression Epanechnikov kernel (‘locfit’ in R is used) J= 100, 500 runs

Kernel method for Bayesian inference Topic of this chapter

– A new nonparametric / kernel approach to Bayesian inference

• Kernel mean approach to represent and manipulate probabilities.

• “Nonparametric” Bayesian inference– No densities are needed, but data is needed.

• Bayesian inference with matrix computation.– Computation is done with Gram matrices.– No integral, no approximate inference.

5

Page 6: カーネル法入門 - 統計数理研究所– Kernel ridge regression Gaussian kernel – Local linear regression Epanechnikov kernel (‘locfit’ in R is used) J= 100, 500 runs

Outline

6

1. Introduction

2. Representing conditional probabilities

3. Kernel sum rule and chain rule

4. Kernel Bayes rule: kernel methods for Bayesian inference

5. Conclusions

Page 7: カーネル法入門 - 統計数理研究所– Kernel ridge regression Gaussian kernel – Local linear regression Epanechnikov kernel (‘locfit’ in R is used) J= 100, 500 runs

Representing conditional probabilities

7

Page 8: カーネル法入門 - 統計数理研究所– Kernel ridge regression Gaussian kernel – Local linear regression Epanechnikov kernel (‘locfit’ in R is used) J= 100, 500 runs

Review: kernel mean

8

X: random variable taking value on a measurable space Pk: pos.def. kernel on . H: RKHS defined by k.

– kernel mean:

– With a characteristic kernel, kernel mean uniquely identifies the distribution. • Embedding of the probabilities into the RKHS.

• Examples: Gaussian, Laplace kernel.

≔ Φ ⋅, ⋅, ∈

Page 9: カーネル法入門 - 統計数理研究所– Kernel ridge regression Gaussian kernel – Local linear regression Epanechnikov kernel (‘locfit’ in R is used) J= 100, 500 runs

Nonparametric inference with kernels

9

Principle: with characteristic kernels,Inference on ⇒ Inference on

• Two sample test ? MMD

• Independence test ⊗ ? HSIC

• Bayesian Inference this chapter.

Page 10: カーネル法入門 - 統計数理研究所– Kernel ridge regression Gaussian kernel – Local linear regression Epanechnikov kernel (‘locfit’ in R is used) J= 100, 500 runs

Operations for inferenceOperations on probabilities necessary for inference

– Conditioning: , →

– Sum rule: , → ∑ ,

– Product rule: | , → , |

(aka. Chain rule)

– Bayes’ rule: , →

10

Page 11: カーネル法入門 - 統計数理研究所– Kernel ridge regression Gaussian kernel – Local linear regression Epanechnikov kernel (‘locfit’ in R is used) J= 100, 500 runs

Conditional kernel mean

11

Review: conditional mean of Gaussian r.v., : Gaussian random vectors with mean 0 (∈ , ℓ, resp.)

– Least square prediction

– Conditional expectation

argmin∈ ℓ

,

, : 共分散行列

Page 12: カーネル法入門 - 統計数理研究所– Kernel ridge regression Gaussian kernel – Local linear regression Epanechnikov kernel (‘locfit’ in R is used) J= 100, 500 runs

General random variables With characteristic kernels, for general and , – Least square prediction

– Conditional expectation

12

Φ Φ

⟨ ,Φ ⟩

argmin: → ,

Φ ,

Kernel mean of | Expression by kernel mean and covariance operators.Estimable!

Page 13: カーネル法入門 - 統計数理研究所– Kernel ridge regression Gaussian kernel – Local linear regression Epanechnikov kernel (‘locfit’ in R is used) J= 100, 500 runs

Empirical estimation

– Remark: Even in population operators, regularized inversion is needed if the dimensionality is infinite. Since ∞, there are always arbitrarily small eigenvalues.

13

Φ Φ

| ≔ Φ Φ

:regularization coefficient

Page 14: カーネル法入門 - 統計数理研究所– Kernel ridge regression Gaussian kernel – Local linear regression Epanechnikov kernel (‘locfit’ in R is used) J= 100, 500 runs

定理(一致性)

, | , が , の関数としてRange ⊗Range に属すると仮定する.このとき / に対し,

Φ Φ .

命題(グラム行列表現)

Φ ⋅,

14

, , … , , ∈ ,

Page 15: カーネル法入門 - 統計数理研究所– Kernel ridge regression Gaussian kernel – Local linear regression Epanechnikov kernel (‘locfit’ in R is used) J= 100, 500 runs

Proof (Gram行列表現)

Let Φ , and decompose it as ∑ Φ , is orthogonal to Span Φ .

Φ ∑ Φ

∑ Φ ,, ∑ Φ .

Take inner product with Φ ℓ . Then,1 1

.

The target is

∑ Φ ,

⋅ ⋅

– 一致性の証明は略15

⋯(∗)

(∗)

Page 16: カーネル法入門 - 統計数理研究所– Kernel ridge regression Gaussian kernel – Local linear regression Epanechnikov kernel (‘locfit’ in R is used) J= 100, 500 runs

条件付カーネル平均の使い方

– ノンパラメトリック回帰

カーネルリッジ回帰

– 条件付独立性への応用 (Fukumizu et al. JMLR 2004, AoS 2009, NIPS 2010)

– ベイズ推論

Note: 一致性に対しては, カーネル(バンド幅)は固定, 正則化係数のみ→ 0 とする.

c.f. 平滑化カーネル. 16

, … , ∈

| ,

Page 17: カーネル法入門 - 統計数理研究所– Kernel ridge regression Gaussian kernel – Local linear regression Epanechnikov kernel (‘locfit’ in R is used) J= 100, 500 runs

Addendum: Kernel ridge regression– (Linear) ridge regression

min∈

Popularly used when is (almost) degenerate.

– Kernel ridge regression

min∈

Nonparametric regression with kernel17

Page 18: カーネル法入門 - 統計数理研究所– Kernel ridge regression Gaussian kernel – Local linear regression Epanechnikov kernel (‘locfit’ in R is used) J= 100, 500 runs

Nonparametric regression: theoretical comparison

18

Assume is 1 dim., and kernel is used only for

Same as Gaussian process / kernel ridge regression

– Consistency 1 (Eberts & Steinwart 2011)

If is Gaussian, and ∈ ,(under some technical assumptions) for any > 0,

Note: is the optimal rate for a linear estimator (Stone 1982).

| ≔

| | → ∞

* : Sobolev spaceoforder .

Page 19: カーネル法入門 - 統計数理研究所– Kernel ridge regression Gaussian kernel – Local linear regression Epanechnikov kernel (‘locfit’ in R is used) J= 100, 500 runs

19

– Consistency 2 (case: ∈ )

Suppose ∈ with 0, Then, with a characteristic kernel ,

• The rates do not depend on (dim of , since the analysis can be done within the RKHS.

• || ⋅ || is stronger than || ⋅ || . Thus,

| | ,

| ,

sup | ,

Page 20: カーネル法入門 - 統計数理研究所– Kernel ridge regression Gaussian kernel – Local linear regression Epanechnikov kernel (‘locfit’ in R is used) J= 100, 500 runs

Nonparametric regression: experimental comparison

20

1/ 1.5 | | , ~ 0, , ~ 0, 0.1

– Kernel ridge regressionGaussian kernel

– Local linear regression Epanechnikov kernel(‘locfit’ in R is used)

= 100, 500 runsBandwidth parameters

are chosen by CV.

0 2 4 6 8 100

0.005

0.01

0.015

Dimension of X

Mea

n sq

uare

err

ors

Kernel methodLocal linear

Page 21: カーネル法入門 - 統計数理研究所– Kernel ridge regression Gaussian kernel – Local linear regression Epanechnikov kernel (‘locfit’ in R is used) J= 100, 500 runs

– Epanechnikov kernel

21

34 1 ,

Page 22: カーネル法入門 - 統計数理研究所– Kernel ridge regression Gaussian kernel – Local linear regression Epanechnikov kernel (‘locfit’ in R is used) J= 100, 500 runs

Kernel Sum and Product Rules

22

Page 23: カーネル法入門 - 統計数理研究所– Kernel ridge regression Gaussian kernel – Local linear regression Epanechnikov kernel (‘locfit’ in R is used) J= 100, 500 runs

Kernel Sum Rule Sum rule

|

Kernel Sum Rule– Input: のカーネル平均,共分散作用素 ,– output: のカーネル平均

条件付確率 | は共分散作用素 , の形で与える.

あるいは, サンプル , , … , , ∼ の形で与える.ここで は条件付確率が となる同時分布.

23

Page 24: カーネル法入門 - 統計数理研究所– Kernel ridge regression Gaussian kernel – Local linear regression Epanechnikov kernel (‘locfit’ in R is used) J= 100, 500 runs

Intuition

– ( ∈ を仮定せよ)

⋅,⋅,

より従う.

⋅, ,

⋅,.

24

Page 25: カーネル法入門 - 統計数理研究所– Kernel ridge regression Gaussian kernel – Local linear regression Epanechnikov kernel (‘locfit’ in R is used) J= 100, 500 runs

推定量:

– Gram行列表現

∑ ⋅,ℓ , , , … , , ~

∑ ⋅, ,

,

一致性

定理 , ∈ ( 0)とするとき,

, , → ∞

25

,

Page 26: カーネル法入門 - 統計数理研究所– Kernel ridge regression Gaussian kernel – Local linear regression Epanechnikov kernel (‘locfit’ in R is used) J= 100, 500 runs

Kernel Chain Rule– Chain rule: , |

– カーネル化: : → ⊗

– Gram行列表現:

Input: ∑ Φℓ , , , … , , ~

∑ Φ ⊗Φ , .

注意: 係数 は,kernel sum rule と全く同一

26

Page 27: カーネル法入門 - 統計数理研究所– Kernel ridge regression Gaussian kernel – Local linear regression Epanechnikov kernel (‘locfit’ in R is used) J= 100, 500 runs

– Intuition:: → ⊗ , Φ ⊗Φ ⊗Φ

確率変数 , , を考える.

条件付確率の密度関数: ,

Kernel Sum Rule より

27

Φ ⊗Φ , ′ ′

Φ ⊗Φ ′

Φ ⊗Φ

Page 28: カーネル法入門 - 統計数理研究所– Kernel ridge regression Gaussian kernel – Local linear regression Epanechnikov kernel (‘locfit’ in R is used) J= 100, 500 runs

重み付サンプルとしての解釈– Kernel Sum / Chain Rule

⋅ ⋅,

離散測度

の標本カーネル平均と同一.

重みは負の値を取り得るため,確率測度とは限らない.「符号付き測度(signed measure)」 のカーネル平均.

しかし,重み付サンプルとしての解釈が可能.

• 和は1に近づく. ∑ → 1 ( → ∞)

• 期待値計算可能 ∑ → ( → ∞).28

,Y

Page 29: カーネル法入門 - 統計数理研究所– Kernel ridge regression Gaussian kernel – Local linear regression Epanechnikov kernel (‘locfit’ in R is used) J= 100, 500 runs

定理

Let ⋅ ∑ ⋅, is the estimator of Kernel Sum Rule.

If , ∈ and ∈ (0 1), for any -

square integrable function , with , ,

with the convergence rate , .

e.g.• : ∑ ∈ → probability .• 特に,∑ → 1 as → ∞.• : ∑ → -th moment .

29

→ , → ∞

Page 30: カーネル法入門 - 統計数理研究所– Kernel ridge regression Gaussian kernel – Local linear regression Epanechnikov kernel (‘locfit’ in R is used) J= 100, 500 runs

Kernel method for Bayesian inference

30

Page 31: カーネル法入門 - 統計数理研究所– Kernel ridge regression Gaussian kernel – Local linear regression Epanechnikov kernel (‘locfit’ in R is used) J= 100, 500 runs

Kernel realization of Bayes’ rule

31

Bayes’ rule

Π: priorwithp. d. f| : conditional probability (likelihood).

Kernel realization:Goal: estimate the kernel mean of the posterior

given– : kernel mean of prior Π,– , : covariance operators for , ~ ,

where is the joint probability to give | by conditioning.

, .

| ∗: ⋅, ∗

Page 32: カーネル法入門 - 統計数理研究所– Kernel ridge regression Gaussian kernel – Local linear regression Epanechnikov kernel (‘locfit’ in R is used) J= 100, 500 runs

Kernel Bayes’ Rule: overview Input

– Prior: consistent estimator of kernel mean

– Conditional probability: covariance of a JOINT distribution.

– Observation: ∗

Output– Posterior:

32

∑ Φ ℓ

, , … , ℓ, ℓ : weighted sample expression

, , … , , : (joint) sample ~ , .

, : conditioning gives |

| ∗ ⋅ ∗ Φ

weighted sample expression

Page 33: カーネル法入門 - 統計数理研究所– Kernel ridge regression Gaussian kernel – Local linear regression Epanechnikov kernel (‘locfit’ in R is used) J= 100, 500 runs

33

,

,

Prior

Joint sample ,

X

Y

X

Observation ∗

,

Posterior | ∗

X

Page 34: カーネル法入門 - 統計数理研究所– Kernel ridge regression Gaussian kernel – Local linear regression Epanechnikov kernel (‘locfit’ in R is used) J= 100, 500 runs

Derivation of KBR Bayes’ rule revisited

Bayes’ rule consists of two steps:1. Product: ,

2. Conditioning: ∗, ∗∗

We already know how to compute them!

Denote , ~Note: ≅

Product rule

Sum rule

Condition | ∗ ∗

34

Page 35: カーネル法入門 - 統計数理研究所– Kernel ridge regression Gaussian kernel – Local linear regression Epanechnikov kernel (‘locfit’ in R is used) J= 100, 500 runs

Kernel Bayes’ Rule(Fukumizu, Song, Gretton JMLR2014)

35

Input: , , … , , ~ (to give cond. probability). ∑ Φ ℓ (prior) a consistent estimator of .

1. [Product]Compute

2. [Conditioning]Compute

Output: estimator for kernel mean of posterior given observation ∗

Λ Diag /

| Λ Λ Λ.

| ∗ ⋅ ⋅ | ∗ ∗ ⋅,

* , : regularizationcoefficients

Page 36: カーネル法入門 - 統計数理研究所– Kernel ridge regression Gaussian kernel – Local linear regression Epanechnikov kernel (‘locfit’ in R is used) J= 100, 500 runs

Inference with KBR

36

Weighted sample expression

| ∗ ⋅ ∗ ⋅,

Equivalent to the kernel mean of

which is a signed measure (not necessarily a probability). Some weights may be negative.

– ∑ ∗ → 1 in probability as → ∞.

( : Dirac’s delta)

Page 37: カーネル法入門 - 統計数理研究所– Kernel ridge regression Gaussian kernel – Local linear regression Epanechnikov kernel (‘locfit’ in R is used) J= 100, 500 runs

How to use?– Expectation: if ∈ Range and ∈ satisfies

∈ Range ,

e.g.• : ∑ ∈ → posterior prob. of set .• : ∑ → -th moment of posterior.

(More general discussions in Kanagawa and Fukumizu, AISTATS 2014)

– Point estimation (quasi-MAP):

Solved numerically

37

∗ → ∗ , → ∞ .

argmin | ∗ Φ

(consistent)

Page 38: カーネル法入門 - 統計数理研究所– Kernel ridge regression Gaussian kernel – Local linear regression Epanechnikov kernel (‘locfit’ in R is used) J= 100, 500 runs

38

– Completely nonparametric way of computing Bayes rule.No parametric models are needed, but data or samples are used to express the probabilistic relations.

Examples: 1. Nonparametric HMM

, , ∏

and /or | are unknown, but data are available.

2. Explicit form of likelihood | or prior is unavailable, but sampling is possible.c.f. Approximate Bayesian Computation (ABC)(Kernel ABC: Nakagome, Mano, Fukumizu 2013)

X0 X1 X2 X3 XT

Y0 Y1 Y2 Y3 YT

Page 39: カーネル法入門 - 統計数理研究所– Kernel ridge regression Gaussian kernel – Local linear regression Epanechnikov kernel (‘locfit’ in R is used) J= 100, 500 runs

Convergence rate

39

Theorem (Fukumizu, Song, Gretton 2014)

Let ∈ , , ~ with p.d.f. .Assumptions:- for some 0 1/2.

- ⁄ ∈ Range / for some 0.

- ⋅ ∈ Range for some 0.

Then, with and , for any ∗,

Remark: the rate depending on the smoothness of the functions ⁄ and ⋅ is also available.

If 1 2⁄ ,the rate is / .

∗ ∗ → ∞ .

Page 40: カーネル法入門 - 統計数理研究所– Kernel ridge regression Gaussian kernel – Local linear regression Epanechnikov kernel (‘locfit’ in R is used) J= 100, 500 runs

Choice of kernel and hyperparameter

40

– Parameters to be chosen• Kernel (parameters in kernel)• Regularization parameter

– Cross-validation is recommended, if possible. • Straightforward in supervised setting (incl. nonparam. HMM). • Make a relevant supervised problem and apply.

Supports- CV has been used successfully for SVM.

- The rate for the regression is attained with parameter choice by validation (Eberts & Steinwart 2011).

– Heuristics:Median ∣ for in Gaussian kernel.

Page 41: カーネル法入門 - 統計数理研究所– Kernel ridge regression Gaussian kernel – Local linear regression Epanechnikov kernel (‘locfit’ in R is used) J= 100, 500 runs

Basic experiments– Comparison with KDE + Importance Weighting

• , ~ 0 / , / , , 1, … ,~ 2 , ~ 0,

• Prior Π: ~ 0; 0.5 ∗ , 1, … ,

• KDE + Importance Weighting

KDE: ̂∑

∑ ,

IW: is estimated by weighted sample , , ̂

∑ ̂ ℓℓ.

41

Page 42: カーネル法入門 - 統計数理研究所– Kernel ridge regression Gaussian kernel – Local linear regression Epanechnikov kernel (‘locfit’ in R is used) J= 100, 500 runs

– (sample size for , ) (sample size for prior) 200.– 2, … , 64

– Gaussian kernels are used for both methods.

– Bandwidth parameters are selected with CV for both methods.

42

Page 43: カーネル法入門 - 統計数理研究所– Kernel ridge regression Gaussian kernel – Local linear regression Epanechnikov kernel (‘locfit’ in R is used) J= 100, 500 runs

Example: KBR for nonparametric HMM

43

– Assume:| and/or | is not known.

But, data , is available in training phase.

Examples:• Measurement of hidden states is expensive, • Hidden states are measured with time delay.

– Testing phase (e.g., filtering, e.g.):given , … , ,estimatehiddenstate .KBR point estimator: argmin | ,…, Φ

– General sequential inference uses Bayes’ rule KBR applied.

X0 X1 X2 X3 XT

Y0 Y1 Y2 Y3 YT

Page 44: カーネル法入門 - 統計数理研究所– Kernel ridge regression Gaussian kernel – Local linear regression Epanechnikov kernel (‘locfit’ in R is used) J= 100, 500 runs

Numerical examples

44

(a)Noisy rotationcossin , arctan 0.3,

, ,, ~ 0, 0.04 . . .

Filtering with the point estimator by KBR.

KBRdoesNOTknowthedynamics,whiletheEKFandUKFuseit.

Page 45: カーネル法入門 - 統計数理研究所– Kernel ridge regression Gaussian kernel – Local linear regression Epanechnikov kernel (‘locfit’ in R is used) J= 100, 500 runs

45

(b) Noisy oscillation 1 0.4 sin 8 cos

sin , arctan 0.4,

, ,, ~ 0, 0.04 . . .

Page 46: カーネル法入門 - 統計数理研究所– Kernel ridge regression Gaussian kernel – Local linear regression Epanechnikov kernel (‘locfit’ in R is used) J= 100, 500 runs

46

Camera angles– Hidden : angles of a video camera located at a corner of a room.– Observed : movie frame of a room + additive Gaussian noise.– : 3600 downsampled frames of 20 x 20 RGB pixels (1200 dim. ). – The first 1800 frames for training, and the second half for testing.

noise KBR(Trace) Kalmanfilter(Q)

2 = 10-4 0.15 0.01 0.56 0.022 = 10-3 0.21 0.01 0.54 0.02

AverageMSEforcameraangles(10runs)

TorepresentSO(3)model,Tr[AB‐1]forKBR,andquaternionexpressionforKalman filterareused.

Page 47: カーネル法入門 - 統計数理研究所– Kernel ridge regression Gaussian kernel – Local linear regression Epanechnikov kernel (‘locfit’ in R is used) J= 100, 500 runs

Robot localization

47

COLD (COsy Localization Dataset, IJRR 2009)State ∈ :2-D coordinate and orientation of a robotObservation :image sequence(SIFT feature, 4200dim)

Training sample , : 1,… ,

Estimate the location of a robot from image sequences

– Observation: difficult to model. KBR

– State transition: linear GaussianKernel Monte Carlo,

(Kanagawa, Nishiyama, KF. 2013)

Page 48: カーネル法入門 - 統計数理研究所– Kernel ridge regression Gaussian kernel – Local linear regression Epanechnikov kernel (‘locfit’ in R is used) J= 100, 500 runs

48

NAI:naïvemethod(closestimageintrainingdata)

NN:PF+K‐nearestneighbor

(Vlassis,Terwijn,Kröse 2002)

KBR:KBR+KBR

KMC:KBR+MonteCarlo

Page 49: カーネル法入門 - 統計数理研究所– Kernel ridge regression Gaussian kernel – Local linear regression Epanechnikov kernel (‘locfit’ in R is used) J= 100, 500 runs

49

#trainingsample=200

:truelocation

:estimate

red(+)/blue(‐)circles:weightsonthetrainingsample

Page 50: カーネル法入門 - 統計数理研究所– Kernel ridge regression Gaussian kernel – Local linear regression Epanechnikov kernel (‘locfit’ in R is used) J= 100, 500 runs

Conclusions

50

A new nonparametric / kernel approach to Bayesian inference

• Kernel mean embedding: using positive definite kernels to

• “Nonparametric” Bayesian inference– No densities are needed but data.

• Bayesian inference with matrix computation.– Computation is done with Gram matrices.– No integral, no approximate inference.

• More suitable for high dimensional data than smoothing kernel approach.

Page 51: カーネル法入門 - 統計数理研究所– Kernel ridge regression Gaussian kernel – Local linear regression Epanechnikov kernel (‘locfit’ in R is used) J= 100, 500 runs

Kernel Herding

51

Page 52: カーネル法入門 - 統計数理研究所– Kernel ridge regression Gaussian kernel – Local linear regression Epanechnikov kernel (‘locfit’ in R is used) J= 100, 500 runs

Kernel Herding(Cheng, Welling, Smola 2010)

– 目的: カーネル平均 ≔ Φ を

Φ1

Φ

の形によって効率的に近似するサンプル , … , を求める.

– 仮定: は知っているとする.

– アルゴリズム:

次の更新則で , , … を生成

argmax⟨ ,Φ ⟩

Φ

52

Page 53: カーネル法入門 - 統計数理研究所– Kernel ridge regression Gaussian kernel – Local linear regression Epanechnikov kernel (‘locfit’ in R is used) J= 100, 500 runs

解釈Φ (定数)を仮定(平行移動不変なカーネル).

≔ / とおく. とする.

argmax⟨ , Φ ⟩

t 1 Φ

argmax , Φ

argmax⟨ , Φ ⟩ , ∑ Φ

⋮ ⋮

argmax⟨ , Φ ⟩ , ∑ Φ

53

近似の残差残差のΦ への射影

Page 54: カーネル法入門 - 統計数理研究所– Kernel ridge regression Gaussian kernel – Local linear regression Epanechnikov kernel (‘locfit’ in R is used) J= 100, 500 runs

54

Φ

Page 55: カーネル法入門 - 統計数理研究所– Kernel ridge regression Gaussian kernel – Local linear regression Epanechnikov kernel (‘locfit’ in R is used) J= 100, 500 runs

55

■ Kernel herdingによる最初の20個のサンプルの分布から発生した20個の i.i.d サンプル

Page 56: カーネル法入門 - 統計数理研究所– Kernel ridge regression Gaussian kernel – Local linear regression Epanechnikov kernel (‘locfit’ in R is used) J= 100, 500 runs

– Φ を知らない場合でも,多数のサンプルで Φ が近似されているときに使ってもよい.少数サンプルで近似すると,グラム行列のサイズを下げることができる カーネル法の計算効率化

– 収束性:

• 近似誤差 ∑ Φ は 1/ ) であると予想されてい

る.

• Optimal な収束レートである.

c.f. i.i.d. サンプルの場合 1/ )

56

Page 57: カーネル法入門 - 統計数理研究所– Kernel ridge regression Gaussian kernel – Local linear regression Epanechnikov kernel (‘locfit’ in R is used) J= 100, 500 runs

57

Page 58: カーネル法入門 - 統計数理研究所– Kernel ridge regression Gaussian kernel – Local linear regression Epanechnikov kernel (‘locfit’ in R is used) J= 100, 500 runs

References Fukumizu, K., L. Song, A. Gretton (2014) Kernel Bayes' Rule: Bayesian

Inference with Positive Definite Kernels. Journal of Machine Learning Research. 14:3753−3783.

Song, L., Gretton, A., and Fukumizu, K. (2013) Kernel Embeddings of Conditional Distributions. IEEE Signal Processing Magazine 30(4), 98-111

Kanagawa, M., Nishiyama, Y., Gretton, A., Fukumizu. K. (2013) Kernel Monte Carlo Filter. arXiv:1312.4664

Cheng, Y., M. Welling, A. Smola (2010) Super-samples from Kernel Herding. Proc. Uncertainty in Artificial Intelligence 2010.

58