カーネル法入門 - 統計数理研究所– kernel ridge regression gaussian kernel – local...

1

福水健次

統計数理研究所／総合研究大学院大学

大阪大学大学院基礎工学研究科・集中講義

2014 September

カーネル法入門７．カーネル法によるノンパラメトリックな

ベイズ推論

Bayesian inference– Bayes’ rule

• PROS– Principled and flexible method for statistical inference.– Can incorporate prior knowledge.

• CONS– Computation: integral is needed

» Numerical integration: MCMC, SMC, etc» Approximation: Variational Bayes, belief propagation,

expectation propagation, etc.2

Thomas Bayes (1701-1761)Portrait?

Motivating Example

3

Robot localization– State ∈ ：2-D coordinate and orientation of a robot– Observation ：image sequence taken

by a camera on the robot.

– Task: estimate the location of a robot from image sequences.

– It is well formulated by a HMM.

X0 X1 X2 X3 XT

Y0 Y1 Y2 Y3 YT

…

Transition of state

Location & orientation image

location & orientation

image of the environment

– Estimation / prediction with HMM Sequential application of Bayes’ rule solves the task.

– Nonparametric approach is needed:Observation model: is very difficult to model with a simple parametric model.

“Nonparametric” implementation of Bayesian inference

c.f. Monte Carlo approach needs to know the density functions.

4

location & orientation image of the environment

Kernel method for Bayesian inference Topic of this chapter

– A new nonparametric / kernel approach to Bayesian inference

• Kernel mean approach to represent and manipulate probabilities.

• “Nonparametric” Bayesian inference– No densities are needed, but data is needed.

• Bayesian inference with matrix computation.– Computation is done with Gram matrices.– No integral, no approximate inference.

5

Outline

6

1. Introduction

2. Representing conditional probabilities

3. Kernel sum rule and chain rule

4. Kernel Bayes rule: kernel methods for Bayesian inference

5. Conclusions

Representing conditional probabilities

7

Review: kernel mean

8

X: random variable taking value on a measurable space Pk: pos.def. kernel on . H: RKHS defined by k.

– kernel mean:

– With a characteristic kernel, kernel mean uniquely identifies the distribution. • Embedding of the probabilities into the RKHS.

• Examples: Gaussian, Laplace kernel.

≔ Φ ⋅, ⋅, ∈

Nonparametric inference with kernels

9

Principle: with characteristic kernels,Inference on ⇒ Inference on

• Two sample test ? MMD

• Independence test ⊗ ? HSIC

• Bayesian Inference this chapter.

Operations for inferenceOperations on probabilities necessary for inference

– Conditioning: , →

– Sum rule: , → ∑ ,

– Product rule: | , → , |

(aka. Chain rule)

– Bayes’ rule: , →

10

Conditional kernel mean

11

Review: conditional mean of Gaussian r.v., : Gaussian random vectors with mean 0 (∈ , ℓ, resp.)

– Least square prediction

– Conditional expectation

argmin∈ ℓ

,

, : 共分散行列

General random variables With characteristic kernels, for general and , – Least square prediction

– Conditional expectation

12

Φ Φ

⟨ ,Φ ⟩

argmin: → ,

Φ ,

Kernel mean of | Expression by kernel mean and covariance operators.Estimable!

Empirical estimation

– Remark: Even in population operators, regularized inversion is needed if the dimensionality is infinite. Since ∞, there are always arbitrarily small eigenvalues.

13

Φ Φ

| ≔ Φ Φ

:regularization coefficient

定理（一致性）

, | , が , の関数としてRange ⊗Range に属すると仮定する．このとき / に対し，

Φ Φ .

命題（グラム行列表現）

Φ ⋅,

14

, , … , , ∈ ,

Proof （Gram行列表現）

Let Φ , and decompose it as ∑ Φ , is orthogonal to Span Φ .

Φ ∑ Φ

∑ Φ ,, ∑ Φ .

Take inner product with Φ ℓ . Then,1 1

.

The target is

∑ Φ ,

⋅ ⋅

– 一致性の証明は略15

⋯(∗)

(∗)

条件付カーネル平均の使い方

– ノンパラメトリック回帰

カーネルリッジ回帰

– 条件付独立性への応用 (Fukumizu et al. JMLR 2004, AoS 2009, NIPS 2010)

– ベイズ推論

Note: 一致性に対しては, カーネル（バンド幅）は固定, 正則化係数のみ→ 0 とする．

c.f. 平滑化カーネル. 16

, … , ∈

| ,

Addendum: Kernel ridge regression– (Linear) ridge regression

min∈

Popularly used when is (almost) degenerate.

– Kernel ridge regression

min∈

Nonparametric regression with kernel17

Nonparametric regression: theoretical comparison

18

Assume is 1 dim., and kernel is used only for

Same as Gaussian process / kernel ridge regression

– Consistency 1 (Eberts & Steinwart 2011)

If is Gaussian, and ∈ ,(under some technical assumptions) for any > 0,

Note: is the optimal rate for a linear estimator (Stone 1982).

| ≔

| | → ∞

* : Sobolev spaceoforder .

19

– Consistency 2 (case: ∈ )

Suppose ∈ with 0, Then, with a characteristic kernel ,

• The rates do not depend on (dim of , since the analysis can be done within the RKHS.

• || ⋅ || is stronger than || ⋅ || . Thus,

| | ,

| ,

sup | ,

Nonparametric regression: experimental comparison

20

1/ 1.5 | | , ~ 0, , ~ 0, 0.1

– Kernel ridge regressionGaussian kernel

– Local linear regression Epanechnikov kernel(‘locfit’ in R is used)

= 100, 500 runsBandwidth parameters

are chosen by CV.

0 2 4 6 8 100

0.005

0.01

0.015

Dimension of X

Mea

n sq

uare

err

ors

Kernel methodLocal linear

– Epanechnikov kernel

21

34 1 ,

Kernel Sum and Product Rules

22

Kernel Sum Rule Sum rule

|

Kernel Sum Rule– Input: のカーネル平均，共分散作用素 ,– output: のカーネル平均

条件付確率 | は共分散作用素 , の形で与える．

あるいは，サンプル , , … , , ∼ の形で与える．ここでは条件付確率がとなる同時分布．

23

Intuition

– （ ∈ を仮定せよ）

⋅,⋅,

より従う．

–

⋅, ,

⋅,.

24

推定量:

– Gram行列表現

∑ ⋅,ℓ , , , … , , ~

∑ ⋅, ,

,

一致性

定理 , ∈ （ 0）とするとき，

, , → ∞

25

,

Kernel Chain Rule– Chain rule: , |

– カーネル化： : → ⊗

≔

– Gram行列表現：

Input: ∑ Φℓ , , , … , , ~

∑ Φ ⊗Φ , .

注意：係数は，kernel sum rule と全く同一

26

– Intuition:: → ⊗ , Φ ⊗Φ ⊗Φ

確率変数 , , を考える．

条件付確率の密度関数： ,

Kernel Sum Rule より

27

Φ ⊗Φ , ′ ′

Φ ⊗Φ ′

Φ ⊗Φ

重み付サンプルとしての解釈– Kernel Sum / Chain Rule

⋅ ⋅,

離散測度

の標本カーネル平均と同一．

重みは負の値を取り得るため，確率測度とは限らない．「符号付き測度(signed measure)」のカーネル平均.

しかし，重み付サンプルとしての解釈が可能．

• 和は1に近づく． ∑ → 1 ( → ∞)

• 期待値計算可能 ∑ → ( → ∞).28

,Y

定理

Let ⋅ ∑ ⋅, is the estimator of Kernel Sum Rule.

If , ∈ and ∈ (0 1), for any -

square integrable function , with , ,

with the convergence rate , .

e.g.• : ∑ ∈ → probability .• 特に，∑ → 1 as → ∞.• : ∑ → -th moment .

29

→ , → ∞

Kernel method for Bayesian inference

30

Kernel realization of Bayes’ rule

31

Bayes’ rule

Π: priorwithp. d. f| : conditional probability (likelihood).

Kernel realization:Goal: estimate the kernel mean of the posterior

given– : kernel mean of prior Π,– , : covariance operators for , ~ ,

where is the joint probability to give | by conditioning.

, .

| ∗: ⋅, ∗

Kernel Bayes’ Rule: overview Input

– Prior: consistent estimator of kernel mean

– Conditional probability: covariance of a JOINT distribution.

– Observation: ∗

Output– Posterior:

32

∑ Φ ℓ

, , … , ℓ, ℓ : weighted sample expression

, , … , , : (joint) sample ~ , .

, : conditioning gives |

| ∗ ⋅ ∗ Φ

weighted sample expression

33

,

,

Prior

Joint sample ,

X

Y

X

Observation ∗

,

Posterior | ∗

X

Derivation of KBR Bayes’ rule revisited

Bayes’ rule consists of two steps:1. Product: ,

2. Conditioning: ∗, ∗∗

We already know how to compute them!

Denote , ~Note: ≅

Product rule

Sum rule

Condition | ∗ ∗

34

Kernel Bayes’ Rule(Fukumizu, Song, Gretton JMLR2014)

35

Input: , , … , , ~ (to give cond. probability). ∑ Φ ℓ (prior) a consistent estimator of .

1. [Product]Compute

2. [Conditioning]Compute

Output: estimator for kernel mean of posterior given observation ∗

Λ Diag /

| Λ Λ Λ.

| ∗ ⋅ ⋅ | ∗ ∗ ⋅,

* , : regularizationcoefficients

Inference with KBR

36

Weighted sample expression

| ∗ ⋅ ∗ ⋅,

Equivalent to the kernel mean of

∗

which is a signed measure (not necessarily a probability). Some weights may be negative.

– ∑ ∗ → 1 in probability as → ∞.

( : Dirac’s delta)

How to use?– Expectation: if ∈ Range and ∈ satisfies

∈ Range ,

e.g.• : ∑ ∈ → posterior prob. of set .• : ∑ → -th moment of posterior.

(More general discussions in Kanagawa and Fukumizu, AISTATS 2014)

– Point estimation (quasi-MAP):

Solved numerically

37

∗ → ∗ , → ∞ .

argmin | ∗ Φ

(consistent)

38

– Completely nonparametric way of computing Bayes rule.No parametric models are needed, but data or samples are used to express the probabilistic relations.

Examples: 1. Nonparametric HMM

, , ∏

and /or | are unknown, but data are available.

2. Explicit form of likelihood | or prior is unavailable, but sampling is possible.c.f. Approximate Bayesian Computation (ABC)(Kernel ABC: Nakagome, Mano, Fukumizu 2013)

X0 X1 X2 X3 XT

Y0 Y1 Y2 Y3 YT

…

Convergence rate

39

Theorem (Fukumizu, Song, Gretton 2014)

Let ∈ , , ~ with p.d.f. .Assumptions:- for some 0 1/2.

- ⁄ ∈ Range / for some 0.

- ⋅ ∈ Range for some 0.

Then, with and , for any ∗,

Remark: the rate depending on the smoothness of the functions ⁄ and ⋅ is also available.

If 1 2⁄ ,the rate is / .

∗ ∗ → ∞ .

Choice of kernel and hyperparameter

40

– Parameters to be chosen• Kernel (parameters in kernel)• Regularization parameter

– Cross-validation is recommended, if possible. • Straightforward in supervised setting (incl. nonparam. HMM). • Make a relevant supervised problem and apply.

Supports- CV has been used successfully for SVM.

- The rate for the regression is attained with parameter choice by validation (Eberts & Steinwart 2011).

– Heuristics:Median ∣ for in Gaussian kernel.

Basic experiments– Comparison with KDE + Importance Weighting

• , ~ 0 / , / , , 1, … ,~ 2 , ~ 0,

• Prior Π: ~ 0; 0.5 ∗ , 1, … ,

• KDE + Importance Weighting

KDE: ̂∑

∑ ,

IW: is estimated by weighted sample , , ̂

∑ ̂ ℓℓ.

41

– (sample size for , ) (sample size for prior) 200.– 2, … , 64

– Gaussian kernels are used for both methods.

– Bandwidth parameters are selected with CV for both methods.

42

Example: KBR for nonparametric HMM

43

– Assume:| and/or | is not known.

But, data , is available in training phase.

Examples:• Measurement of hidden states is expensive, • Hidden states are measured with time delay.

– Testing phase (e.g., filtering, e.g.):given , … , ,estimatehiddenstate .KBR point estimator: argmin | ,…, Φ

– General sequential inference uses Bayes’ rule KBR applied.

X0 X1 X2 X3 XT

Y0 Y1 Y2 Y3 YT

…

Numerical examples

44

(a)Noisy rotationcossin , arctan 0.3,

, ,, ~ 0, 0.04 . . .

Filtering with the point estimator by KBR.

KBRdoesNOTknowthedynamics,whiletheEKFandUKFuseit.

45

(b) Noisy oscillation 1 0.4 sin 8 cos

sin , arctan 0.4,

, ,, ~ 0, 0.04 . . .

46

Camera angles– Hidden : angles of a video camera located at a corner of a room.– Observed : movie frame of a room + additive Gaussian noise.– : 3600 downsampled frames of 20 x 20 RGB pixels (1200 dim. ). – The first 1800 frames for training, and the second half for testing.

noise KBR(Trace) Kalmanfilter(Q)

2 = 10-4 0.15 0.01 0.56 0.022 = 10-3 0.21 0.01 0.54 0.02

AverageMSEforcameraangles(10runs)

TorepresentSO(3)model,Tr[AB‐1]forKBR,andquaternionexpressionforKalman filterareused.

Robot localization

47

COLD (COsy Localization Dataset, IJRR 2009)State ∈ ：2-D coordinate and orientation of a robotObservation ：image sequence（SIFT feature, 4200dim）

Training sample , ： 1,… ,

Estimate the location of a robot from image sequences

– Observation: difficult to model. KBR

– State transition: linear GaussianKernel Monte Carlo,

(Kanagawa, Nishiyama, KF. 2013)

48

NAI:naïvemethod(closestimageintrainingdata)

NN:PF+K‐nearestneighbor

(Vlassis,Terwijn,Kröse 2002)

KBR:KBR+KBR

KMC:KBR+MonteCarlo

49

#trainingsample=200

:truelocation

:estimate

red(+)/blue(‐)circles:weightsonthetrainingsample

Conclusions

50

A new nonparametric / kernel approach to Bayesian inference

• Kernel mean embedding: using positive definite kernels to

• “Nonparametric” Bayesian inference– No densities are needed but data.

• Bayesian inference with matrix computation.– Computation is done with Gram matrices.– No integral, no approximate inference.

• More suitable for high dimensional data than smoothing kernel approach.

Kernel Herding

51

Kernel Herding(Cheng, Welling, Smola 2010)

– 目的：カーネル平均 ≔ Φ を

Φ1

Φ

の形によって効率的に近似するサンプル , … , を求める．

– 仮定：は知っているとする．

– アルゴリズム：

次の更新則で , , … を生成

argmax⟨ ,Φ ⟩

Φ

52

解釈Φ （定数）を仮定（平行移動不変なカーネル）．

≔ / とおく. とする．

argmax⟨ , Φ ⟩

t 1 Φ

argmax , Φ

argmax⟨ , Φ ⟩ , ∑ Φ

⋮ ⋮

argmax⟨ , Φ ⟩ , ∑ Φ

53

近似の残差残差のΦ への射影

54

Φ

1Φ

55

■ Kernel herdingによる最初の20個のサンプルの分布から発生した20個の i.i.d サンプル

– Φ を知らない場合でも，多数のサンプルで Φ が近似されているときに使ってもよい．少数サンプルで近似すると，グラム行列のサイズを下げることができるカーネル法の計算効率化

– 収束性：

• 近似誤差 ∑ Φ は 1/ ) であると予想されてい

る．

• Optimal な収束レートである．

c.f. i.i.d. サンプルの場合 1/ )

56

References Fukumizu, K., L. Song, A. Gretton (2014) Kernel Bayes' Rule: Bayesian

Inference with Positive Definite Kernels. Journal of Machine Learning Research. 14:3753−3783.

Song, L., Gretton, A., and Fukumizu, K. (2013) Kernel Embeddings of Conditional Distributions. IEEE Signal Processing Magazine 30(4), 98-111

Kanagawa, M., Nishiyama, Y., Gretton, A., Fukumizu. K. (2013) Kernel Monte Carlo Filter. arXiv:1312.4664

Cheng, Y., M. Welling, A. Smola (2010) Super-samples from Kernel Herding. Proc. Uncertainty in Artificial Intelligence 2010.

58

カーネル法入門 - 統計数理研究所– kernel ridge regression gaussian kernel – local...

Documents