カーネル法入門 - 統計数理研究所– kernel ridge regression gaussian kernel – local...
TRANSCRIPT
1
福水健次
統計数理研究所/総合研究大学院大学
大阪大学大学院基礎工学研究科・集中講義
2014 September
カーネル法入門7.カーネル法によるノンパラメトリックな
ベイズ推論
Bayesian inference– Bayes’ rule
• PROS– Principled and flexible method for statistical inference.– Can incorporate prior knowledge.
• CONS– Computation: integral is needed
» Numerical integration: MCMC, SMC, etc» Approximation: Variational Bayes, belief propagation,
expectation propagation, etc.2
Thomas Bayes (1701-1761)Portrait?
Motivating Example
3
Robot localization– State ∈ :2-D coordinate and orientation of a robot– Observation :image sequence taken
by a camera on the robot.
– Task: estimate the location of a robot from image sequences.
– It is well formulated by a HMM.
X0 X1 X2 X3 XT
Y0 Y1 Y2 Y3 YT
…
Transition of state
Location & orientation image
location & orientation
image of the environment
– Estimation / prediction with HMM Sequential application of Bayes’ rule solves the task.
– Nonparametric approach is needed:Observation model: is very difficult to model with a simple parametric model.
“Nonparametric” implementation of Bayesian inference
c.f. Monte Carlo approach needs to know the density functions.
4
location & orientation image of the environment
Kernel method for Bayesian inference Topic of this chapter
– A new nonparametric / kernel approach to Bayesian inference
• Kernel mean approach to represent and manipulate probabilities.
• “Nonparametric” Bayesian inference– No densities are needed, but data is needed.
• Bayesian inference with matrix computation.– Computation is done with Gram matrices.– No integral, no approximate inference.
5
Outline
6
1. Introduction
2. Representing conditional probabilities
3. Kernel sum rule and chain rule
4. Kernel Bayes rule: kernel methods for Bayesian inference
5. Conclusions
Representing conditional probabilities
7
Review: kernel mean
8
X: random variable taking value on a measurable space Pk: pos.def. kernel on . H: RKHS defined by k.
– kernel mean:
– With a characteristic kernel, kernel mean uniquely identifies the distribution. • Embedding of the probabilities into the RKHS.
• Examples: Gaussian, Laplace kernel.
≔ Φ ⋅, ⋅, ∈
Nonparametric inference with kernels
9
Principle: with characteristic kernels,Inference on ⇒ Inference on
• Two sample test ? MMD
• Independence test ⊗ ? HSIC
• Bayesian Inference this chapter.
Operations for inferenceOperations on probabilities necessary for inference
– Conditioning: , →
– Sum rule: , → ∑ ,
– Product rule: | , → , |
(aka. Chain rule)
– Bayes’ rule: , →
10
Conditional kernel mean
11
Review: conditional mean of Gaussian r.v., : Gaussian random vectors with mean 0 (∈ , ℓ, resp.)
– Least square prediction
– Conditional expectation
argmin∈ ℓ
,
, : 共分散行列
General random variables With characteristic kernels, for general and , – Least square prediction
– Conditional expectation
12
Φ Φ
⟨ ,Φ ⟩
argmin: → ,
Φ ,
Kernel mean of | Expression by kernel mean and covariance operators.Estimable!
Empirical estimation
– Remark: Even in population operators, regularized inversion is needed if the dimensionality is infinite. Since ∞, there are always arbitrarily small eigenvalues.
13
Φ Φ
| ≔ Φ Φ
:regularization coefficient
定理(一致性)
, | , が , の関数としてRange ⊗Range に属すると仮定する.このとき / に対し,
Φ Φ .
命題(グラム行列表現)
Φ ⋅,
14
, , … , , ∈ ,
Proof (Gram行列表現)
Let Φ , and decompose it as ∑ Φ , is orthogonal to Span Φ .
Φ ∑ Φ
∑ Φ ,, ∑ Φ .
Take inner product with Φ ℓ . Then,1 1
.
The target is
∑ Φ ,
⋅ ⋅
– 一致性の証明は略15
⋯(∗)
(∗)
条件付カーネル平均の使い方
– ノンパラメトリック回帰
カーネルリッジ回帰
– 条件付独立性への応用 (Fukumizu et al. JMLR 2004, AoS 2009, NIPS 2010)
– ベイズ推論
Note: 一致性に対しては, カーネル(バンド幅)は固定, 正則化係数のみ→ 0 とする.
c.f. 平滑化カーネル. 16
, … , ∈
| ,
Addendum: Kernel ridge regression– (Linear) ridge regression
min∈
Popularly used when is (almost) degenerate.
– Kernel ridge regression
min∈
Nonparametric regression with kernel17
Nonparametric regression: theoretical comparison
18
Assume is 1 dim., and kernel is used only for
Same as Gaussian process / kernel ridge regression
– Consistency 1 (Eberts & Steinwart 2011)
If is Gaussian, and ∈ ,(under some technical assumptions) for any > 0,
Note: is the optimal rate for a linear estimator (Stone 1982).
| ≔
| | → ∞
* : Sobolev spaceoforder .
19
– Consistency 2 (case: ∈ )
Suppose ∈ with 0, Then, with a characteristic kernel ,
• The rates do not depend on (dim of , since the analysis can be done within the RKHS.
• || ⋅ || is stronger than || ⋅ || . Thus,
| | ,
| ,
sup | ,
Nonparametric regression: experimental comparison
20
1/ 1.5 | | , ~ 0, , ~ 0, 0.1
– Kernel ridge regressionGaussian kernel
– Local linear regression Epanechnikov kernel(‘locfit’ in R is used)
= 100, 500 runsBandwidth parameters
are chosen by CV.
0 2 4 6 8 100
0.005
0.01
0.015
Dimension of X
Mea
n sq
uare
err
ors
Kernel methodLocal linear
– Epanechnikov kernel
21
34 1 ,
Kernel Sum and Product Rules
22
Kernel Sum Rule Sum rule
|
Kernel Sum Rule– Input: のカーネル平均,共分散作用素 ,– output: のカーネル平均
条件付確率 | は共分散作用素 , の形で与える.
あるいは, サンプル , , … , , ∼ の形で与える.ここで は条件付確率が となる同時分布.
23
Intuition
– ( ∈ を仮定せよ)
⋅,⋅,
より従う.
–
⋅, ,
⋅,.
24
推定量:
– Gram行列表現
∑ ⋅,ℓ , , , … , , ~
∑ ⋅, ,
,
一致性
定理 , ∈ ( 0)とするとき,
, , → ∞
25
,
Kernel Chain Rule– Chain rule: , |
– カーネル化: : → ⊗
≔
– Gram行列表現:
Input: ∑ Φℓ , , , … , , ~
∑ Φ ⊗Φ , .
注意: 係数 は,kernel sum rule と全く同一
26
– Intuition:: → ⊗ , Φ ⊗Φ ⊗Φ
確率変数 , , を考える.
条件付確率の密度関数: ,
Kernel Sum Rule より
27
Φ ⊗Φ , ′ ′
Φ ⊗Φ ′
Φ ⊗Φ
重み付サンプルとしての解釈– Kernel Sum / Chain Rule
⋅ ⋅,
離散測度
の標本カーネル平均と同一.
重みは負の値を取り得るため,確率測度とは限らない.「符号付き測度(signed measure)」 のカーネル平均.
しかし,重み付サンプルとしての解釈が可能.
• 和は1に近づく. ∑ → 1 ( → ∞)
• 期待値計算可能 ∑ → ( → ∞).28
,Y
定理
Let ⋅ ∑ ⋅, is the estimator of Kernel Sum Rule.
If , ∈ and ∈ (0 1), for any -
square integrable function , with , ,
with the convergence rate , .
e.g.• : ∑ ∈ → probability .• 特に,∑ → 1 as → ∞.• : ∑ → -th moment .
29
→ , → ∞
Kernel method for Bayesian inference
30
Kernel realization of Bayes’ rule
31
Bayes’ rule
Π: priorwithp. d. f| : conditional probability (likelihood).
Kernel realization:Goal: estimate the kernel mean of the posterior
given– : kernel mean of prior Π,– , : covariance operators for , ~ ,
where is the joint probability to give | by conditioning.
, .
| ∗: ⋅, ∗
Kernel Bayes’ Rule: overview Input
– Prior: consistent estimator of kernel mean
– Conditional probability: covariance of a JOINT distribution.
– Observation: ∗
Output– Posterior:
32
∑ Φ ℓ
, , … , ℓ, ℓ : weighted sample expression
, , … , , : (joint) sample ~ , .
, : conditioning gives |
| ∗ ⋅ ∗ Φ
weighted sample expression
33
,
,
Prior
Joint sample ,
X
Y
X
Observation ∗
,
Posterior | ∗
X
Derivation of KBR Bayes’ rule revisited
Bayes’ rule consists of two steps:1. Product: ,
2. Conditioning: ∗, ∗∗
We already know how to compute them!
Denote , ~Note: ≅
Product rule
Sum rule
Condition | ∗ ∗
34
Kernel Bayes’ Rule(Fukumizu, Song, Gretton JMLR2014)
35
Input: , , … , , ~ (to give cond. probability). ∑ Φ ℓ (prior) a consistent estimator of .
1. [Product]Compute
2. [Conditioning]Compute
Output: estimator for kernel mean of posterior given observation ∗
Λ Diag /
| Λ Λ Λ.
| ∗ ⋅ ⋅ | ∗ ∗ ⋅,
* , : regularizationcoefficients
Inference with KBR
36
Weighted sample expression
| ∗ ⋅ ∗ ⋅,
Equivalent to the kernel mean of
∗
which is a signed measure (not necessarily a probability). Some weights may be negative.
– ∑ ∗ → 1 in probability as → ∞.
( : Dirac’s delta)
How to use?– Expectation: if ∈ Range and ∈ satisfies
∈ Range ,
e.g.• : ∑ ∈ → posterior prob. of set .• : ∑ → -th moment of posterior.
(More general discussions in Kanagawa and Fukumizu, AISTATS 2014)
– Point estimation (quasi-MAP):
Solved numerically
37
∗ → ∗ , → ∞ .
argmin | ∗ Φ
(consistent)
38
– Completely nonparametric way of computing Bayes rule.No parametric models are needed, but data or samples are used to express the probabilistic relations.
Examples: 1. Nonparametric HMM
, , ∏
and /or | are unknown, but data are available.
2. Explicit form of likelihood | or prior is unavailable, but sampling is possible.c.f. Approximate Bayesian Computation (ABC)(Kernel ABC: Nakagome, Mano, Fukumizu 2013)
X0 X1 X2 X3 XT
Y0 Y1 Y2 Y3 YT
…
Convergence rate
39
Theorem (Fukumizu, Song, Gretton 2014)
Let ∈ , , ~ with p.d.f. .Assumptions:- for some 0 1/2.
- ⁄ ∈ Range / for some 0.
- ⋅ ∈ Range for some 0.
Then, with and , for any ∗,
Remark: the rate depending on the smoothness of the functions ⁄ and ⋅ is also available.
If 1 2⁄ ,the rate is / .
∗ ∗ → ∞ .
Choice of kernel and hyperparameter
40
– Parameters to be chosen• Kernel (parameters in kernel)• Regularization parameter
– Cross-validation is recommended, if possible. • Straightforward in supervised setting (incl. nonparam. HMM). • Make a relevant supervised problem and apply.
Supports- CV has been used successfully for SVM.
- The rate for the regression is attained with parameter choice by validation (Eberts & Steinwart 2011).
– Heuristics:Median ∣ for in Gaussian kernel.
Basic experiments– Comparison with KDE + Importance Weighting
• , ~ 0 / , / , , 1, … ,~ 2 , ~ 0,
• Prior Π: ~ 0; 0.5 ∗ , 1, … ,
• KDE + Importance Weighting
KDE: ̂∑
∑ ,
IW: is estimated by weighted sample , , ̂
∑ ̂ ℓℓ.
41
– (sample size for , ) (sample size for prior) 200.– 2, … , 64
– Gaussian kernels are used for both methods.
– Bandwidth parameters are selected with CV for both methods.
42
Example: KBR for nonparametric HMM
43
– Assume:| and/or | is not known.
But, data , is available in training phase.
Examples:• Measurement of hidden states is expensive, • Hidden states are measured with time delay.
– Testing phase (e.g., filtering, e.g.):given , … , ,estimatehiddenstate .KBR point estimator: argmin | ,…, Φ
– General sequential inference uses Bayes’ rule KBR applied.
X0 X1 X2 X3 XT
Y0 Y1 Y2 Y3 YT
…
Numerical examples
44
(a)Noisy rotationcossin , arctan 0.3,
, ,, ~ 0, 0.04 . . .
Filtering with the point estimator by KBR.
KBRdoesNOTknowthedynamics,whiletheEKFandUKFuseit.
45
(b) Noisy oscillation 1 0.4 sin 8 cos
sin , arctan 0.4,
, ,, ~ 0, 0.04 . . .
46
Camera angles– Hidden : angles of a video camera located at a corner of a room.– Observed : movie frame of a room + additive Gaussian noise.– : 3600 downsampled frames of 20 x 20 RGB pixels (1200 dim. ). – The first 1800 frames for training, and the second half for testing.
noise KBR(Trace) Kalmanfilter(Q)
2 = 10-4 0.15 0.01 0.56 0.022 = 10-3 0.21 0.01 0.54 0.02
AverageMSEforcameraangles(10runs)
TorepresentSO(3)model,Tr[AB‐1]forKBR,andquaternionexpressionforKalman filterareused.
Robot localization
47
COLD (COsy Localization Dataset, IJRR 2009)State ∈ :2-D coordinate and orientation of a robotObservation :image sequence(SIFT feature, 4200dim)
Training sample , : 1,… ,
Estimate the location of a robot from image sequences
– Observation: difficult to model. KBR
– State transition: linear GaussianKernel Monte Carlo,
(Kanagawa, Nishiyama, KF. 2013)
48
NAI:naïvemethod(closestimageintrainingdata)
NN:PF+K‐nearestneighbor
(Vlassis,Terwijn,Kröse 2002)
KBR:KBR+KBR
KMC:KBR+MonteCarlo
49
#trainingsample=200
:truelocation
:estimate
red(+)/blue(‐)circles:weightsonthetrainingsample
Conclusions
50
A new nonparametric / kernel approach to Bayesian inference
• Kernel mean embedding: using positive definite kernels to
• “Nonparametric” Bayesian inference– No densities are needed but data.
• Bayesian inference with matrix computation.– Computation is done with Gram matrices.– No integral, no approximate inference.
• More suitable for high dimensional data than smoothing kernel approach.
Kernel Herding
51
Kernel Herding(Cheng, Welling, Smola 2010)
– 目的: カーネル平均 ≔ Φ を
Φ1
Φ
の形によって効率的に近似するサンプル , … , を求める.
– 仮定: は知っているとする.
– アルゴリズム:
次の更新則で , , … を生成
argmax⟨ ,Φ ⟩
Φ
52
解釈Φ (定数)を仮定(平行移動不変なカーネル).
≔ / とおく. とする.
argmax⟨ , Φ ⟩
t 1 Φ
argmax , Φ
argmax⟨ , Φ ⟩ , ∑ Φ
⋮ ⋮
argmax⟨ , Φ ⟩ , ∑ Φ
53
近似の残差残差のΦ への射影
54
Φ
1Φ
55
■ Kernel herdingによる最初の20個のサンプルの分布から発生した20個の i.i.d サンプル
– Φ を知らない場合でも,多数のサンプルで Φ が近似されているときに使ってもよい.少数サンプルで近似すると,グラム行列のサイズを下げることができる カーネル法の計算効率化
– 収束性:
• 近似誤差 ∑ Φ は 1/ ) であると予想されてい
る.
• Optimal な収束レートである.
c.f. i.i.d. サンプルの場合 1/ )
56
57
References Fukumizu, K., L. Song, A. Gretton (2014) Kernel Bayes' Rule: Bayesian
Inference with Positive Definite Kernels. Journal of Machine Learning Research. 14:3753−3783.
Song, L., Gretton, A., and Fukumizu, K. (2013) Kernel Embeddings of Conditional Distributions. IEEE Signal Processing Magazine 30(4), 98-111
Kanagawa, M., Nishiyama, Y., Gretton, A., Fukumizu. K. (2013) Kernel Monte Carlo Filter. arXiv:1312.4664
Cheng, Y., M. Welling, A. Smola (2010) Super-samples from Kernel Herding. Proc. Uncertainty in Artificial Intelligence 2010.
58