年月日月火日水東京工業大学すずかけ台キャンパス統計モデ...

1ISM

統計モデルを用いた大規模データの分類，変換，そして知識発見

樋口知之情報・システム研究機構統計数理研究所＆ＪＳＴＣＲＥＳＴ

第10回情報論的学習理論ワークショップ (IBIS 2007) 2007年11月5日(月),6(火),7日(水) 東京工業大学すずかけ台キャンパス

ISM2

アウトライン

１．異常値と欠損値処理

２．オンライン処理と時系列モデル

３．非ガウス情報処理

ー数値的に分布を構成

ーModel AveragingーConditional Dynamic Linear Model

４．Sequential Monte Carlo (SMC)５．On-line型 Fixed-Lag Model Averaging

3ISM

大量データは巨大なゴミ箱？

大規模データの実際は、そのままだと単なる屑の山

生ゴミプラスチック

ビン、アルミ缶

新聞・紙

分別、整理することで

じゃ、大量データの解析は、砂金探しのようなもの？

錬金術の話ではない

4ISM

言葉の使われかた

情報

情報

知識

知識

データ

データ

○○知

Wisdom:英知

取り扱っていない

情報科学

（AI，情報処理、計算機統計）

統計科学

○○抽出 ○○抽出

○○発見○○発見

○○処理 ○○処理

センシング

明確には認識されていない部分

5ISM

超大量データ（情報）処理関連研究領域

統計科学

機械学習データマイニング

•パターン列挙（枚挙）

•高速探索

•生成モデル構成 ※

(Generative Model Building)

•伝統と蓄積

•判別モデル構成 ※

(Discriminative Model Building)

(Discriminant Function Builidng)

•最適化

※ 『』 : Bishop “Pattern Recognition and Machine Learning” (2006), ：伊庭による解説（信学技報告NC2006-55 (2006-10) 61—66）中の用語を利用

『データの生成過程を条件付き確率で表現して，すべての変数の同時分布を書き下し，あとは必要に応じてベイズの公式を使う。』

『与えられた目的に必要な条件付き確率のみを抜き出してモデル化』

類似度のモデル化

全体のモデル化 →

予測，制御の作業が見通しよくできる。

新しい学問領域の創生

),( jiK xx

IBIS

ISM

NSF: Office of Cyberinfractructures■ Cyber-Enabled Discovery and Innovation

Cyber-Enabled Discovery and Innovation (CDI) is NSF’s bold five-year initiative to create revolutionary science and engineering research outcomes made possible by innovations and advances in computational thinking.

Computational thinking is defined comprehensively to encompass computational concepts, methods, models, algorithms, and tools.

* From Data to Knowledge: enhancing human cognition and generating new knowledge from a wealth of heterogeneous digital data;

* Understanding Complexity in Natural, Built, and Social Systems: deriving fundamental insights on systems comprising multiple interacting elements; and

* Building Virtual Organizations: enhancing discovery and innovation by bringing people and resources together across institutional, geographical and cultural boundaries.

※This program is expected to start at $26M (約30億円) for this fiscal year and increase significantly in future years.

■ Sustainable Digital Data Preservation and Access Network Partners

7ISM

事前のノイズ処理が実は本質的

目が細かいと，水しか通らない

目が荒いと，小石まで通ってしまう

異常値を含んだデータを次のステップへ大量に渡してしまう。

新たな知見を生む可能性があるデータも捨ててしまう。

裏ごし

ちょうど良い目の大きさ

パラパラ

さぁ、どうやって最初は手をつけようかぁ

…

次の解析プロセスへ

ゴミデータをふるいにかける小麦粉をふるうと、ごみや異物を取り除いたり、粉をほぐしてきめを細かくし、空気を含ませたりする役目があります。

異常値を除いたり、欠損値を補ったり、順番を揃え直したり、….

8ISM

情報縮約（不可逆変換）の加減

煮すぎると栄養も旨みも流れ出る

ゆでが足りないと苦みが残る

最適な調理具合処理が足らなければ，

玉石混淆の情報が溢れるやりすぎれば，必要な情報まで捨ててしまう。

いくら素材がよくとも

、…

9ISM

Chain Structure Graphical Model

0x

2y ty

2x tx観測できない

1y観測できる

観測モデル

1xシステムモデル

{ }{ }Nt

Nt

xxxxyyy

,,,,,,,,,

10

1

KK

KK

ベクトル量

過去& 現在

現在& 将来

状態xt

観測値 yt)|(

]),,,[|(]),,,[|(

]),,,[|(

:1:1

21:1

21:1

1211:1

TT

TTt

ttt

ttt

pyyypyyyp

yyyp

yxyxyxyx

K

K

K

≡≡≡ −−

きのうまでのデータに基づく今日の状態

今日までのデータに基づく今日の状態

数年後，データをすべて得たもとで振り返った今日の状態

--- 日次株価データを考えると ---

ttt

ttt

eHxyGvFxx

+=+= −1

10ISMISM

内挿と外挿

0x

1−ty

1−tx観測できない

1y観測できる

観測モデル

1x

システムモデル

ty

tx

Ty

Tx 1+Tx

欠測値、異常値

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

=

tM

t

t

t

x

xx

,

,2

,1

Mx

潜在変数を多数用意する

データ数

11ISM

賢いアルゴリズムの開発よりモデリングの妙技が肝！例：季節調整法

(北川，樋口，1998)

（月データ）

・前年同月比

・季調済みデータ（USセンサス）

12ISM

),0(,

),0(,)(

),0(,2

2

2,,321

2,,21

σμ

τ

τμμμ μμμ

Neesy

Nvvssss

Nvv

ttttt

stststttt

ttttt

～

～

～

++=

+++−=

+−=

−−−

−−

:::

t

t

t

esμ

トレンド成分

季節変動成分

観測ノイズ

季節調整モデル（四半期データの場合）

‘[ ] [ ]

[ ]00101,

0000100001

,

11

1110112

,

,, ,,211

=

⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢

⎣

⎡

=

⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢

⎣

⎡

−−−

−

=

== −−−

HGF

vvvsssx tstttttttt μμμ

ttt

ttt

eHxyGvFxx

+=+= − 1

‘

季節調整モデルの状態空間表現

13ISM

25620.03212506103.212556103.21AIC

4-

6-

××

2α

Too smooth

Too rough

(Kitagawa, 1994)

Tt|μ

AIC best)parameters-(hyper# AIC 2),|(log2 22

:1 +−= σαTp y

)|()( 1:11:1 −=Π= ttTtT ypp yy

経験ベイズ：ハイパーパラメータの決定

各時刻毎のフィルタのステップで求められる

14

Fixed-Lag SmootherTt ='tt =1=t

)|( ':1' ttp yxフィルタ分布： )|( ':1' ttp yx固定区間平滑化分布：

)|()|( 4':1':1' +⇒ ttTt pp yxyx

[ ])(4'|4'

)(,4'|3'

)(,4'|2'

)(,4'|1'

)(,4''|

)(4'|4'

itt

itt

itt

itt

itt

itt +++++++++++ =Ξ xxxxx

)20(:)|()|( ::1 =≈ ++ LLpp LtttLtt は長くとるyxyx

200１年12月 3月 6月 9月 2002年12月

15Gaussian Non-Gaussian

Smoother

Data

ttn

ttt

wyv

+=+= −

μμμ 1

トレンドモデル

ノイズの分布

　　　正規分布),0(~ 2σNwn 分布　　　 Cauchy),0(~ 2τCvt

ISM

非ガウス情報処理のからくり：非ガウス平滑化

16ISM(Kitagawa and Gersch, 1996)

ジャンプの自動同定

( )( ) ( )

ガウス分布

）分布（ローレンツ分布

　　

族：

: Cauchy :1

5.01)2/1()2/1(

)(),,|(

Pearson

22

122

+∞==

+∞≤<−+

⋅Γ−Γ

Γ=

−

bb

bvb

bbvp b

b

βτ

ττβ

1 :model system −−= tttv μμ

tv1−tμ tμ

tμ

tμ

)|( ⋅vp

17ISM

異常値の自動同定

ttt ye μ−= :観測モデル

tetμ ty

ty

ty

)|( ⋅tep

te

異常値処理された時系列データ

系列データ

18ISM

異常値の癖をモデル化する

ttt ye μ−= :観測モデル

te

tμ ty

),()1(),0()|( 22outoutst NNep σμασα −+=⋅

系列データ

outμ

■計測機器の癖をモデル化する

Normal Mixture異常値処理された時系列データ 1－α：異常値の割合

19

スミソニアン博物館内の自走案内ロボット•Position tracking

•Global localization problem (初期位置未知）

•Kidnapped robot problem (予告無しにどこかに連れ去られる)

•Multi-robot localization problem

•館内部の展示域は複雑な形状

•特別展などでガラスケースの位置などに変更がある

•似たような場所が展示域に複数ある

•混雑した中を自走する必要

•廉価かつ簡単に実装できるシステムが望ましい

難ISM

Mobile Robot Localization(D. Fox et al., “Particle filter for mobile robot localization,” 2001)

Experiences with Interactive Museum Tour-Guide Robots

Wolfram Burgard

University of FreiburgDepartment of Computer ScienceAutonomous Intelligent Systems

http://www.informatik.uni-freiburg.de/[email protected]

確率ロボティクス

Sebastian Thrun (著), Wolfram Burgard (著), Dieter Fox (著), 上田隆一 (翻訳)

21ISM

Motion model

)),,((

)()),,((

),|(),,|(

),|,(

),|(

11

11

1111

11

11

−−∗

−−

−−−−

−−

−−

==

⋅−=

⋅=

=

∫∫∫

tttt

tttttt

tttttttt

ttttt

ttt

uxxfvp

dvvpvuxfx

dvuxvpvuxxp

dvuxvxp

uxxp

δ

Convolution of conventional robot kinematics and two independent zero-mean random variable

),(),,( 11

ttt

tttt

wxhyvuxfx

== −−

Motion model：パターンを集めシステムモデルを数値的に構成

),|( 11 −− tttp uxx

22ISM)),((

)()),((

)|(),|(

)|,(

)|(

ttt

ttttt

tttttt

tttt

tt

xyhwp

dwwpwxhy

dwxwpwxyp

dwxwyp

xyp

∗==

⋅−=

⋅=

=

∫∫∫

δ

observation model

),(),,( 11

ttt

tttt

wxhyvuxfx

== −−

Perceptual model：観測誤差モデルも数値的に構成すればいい

センサーが被る観測誤差

普通の観測誤差＋普通，異常値として取り扱うような誤差

Planar 2D laser range finderの場合

)|( ttp xy

),|(),|(

2

11

θxyθxx

tt

tt

pp⋅⋅ −

～

～

23ISMISM

自己組織（調整）型時系列モデルのグラフィカルモデル

観測できない

1y観測できる

ty1−ty

1θ 1−tθ tθ0θ

0x 1x tx1−tx

(Kitagawa, 1996)

),|(]'','[

1

,21,1

ψ−

−

⋅

≡

tt

ttt

p θθθθθ

～

ロボティクスの分野で応用開発研究が非常に盛ん。主にオンライン処理。

{Ghahramani, Jordan, Hinton} {Shamway&Stoffer}

24ISMISM

状態ベクトルへの埋め込みとオンライン型 Model Averaging

1−ty

1−tx

ty

tx

⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢

⎣

⎡

=

t

tM

t

t

t

x

xx

θ

x

,

,2

,1

M

SOSSM with latent switching variable

⎥⎦

⎤⎢⎣

⎡=

t

tt I

xα

(Higuchi, 2000, 2001) in Sequential Monte Carlo

Methods in Practice (eds. A. Doucet, J.F.G, de Freitas,

and N.J.Gordon)

あとは粒子フィルタを適用するだけ！

Evolution of : Markov switching priortItI がどのモデルを使うかを指定する．

異種，多数のモデルを同時に考え，Model Averaging をオンラインで達成する．

),( ttit h wxy =観測モデル集合： i=1, …., M

⎩⎨⎧

=+=≠+−=

0,log0,/)1(

λλλλ

ttt

ttt

wxywxy

Box-Cox変換を多数用意

Fixed-Lag Smoother with Model Averaging

25ISM

),0( ,

),0( ,22

221

σμ

τμμμ

Nwwy

Nvv

tttt

ttttt

～

～

+=

+−= −−

真のトレンド

Small Dip線形・非ガウストレンドモデル

C

線形ガウス

トレンドモデル

ジャンプの自動同定（例）

26ISM

自己組織型状態空間モデル

),0( ,loglog

),0( ,

),0( ,2

2110

210

2

221

ξεεσσ

σ

τμμμ

C

Nwwty

Cvv

tttt

ttttt

ttttt

～

～

～

+=

+=

+−=

−

−−

分散変動の自動同時推定

ISM27

Given trend:truetμ

1=truetI2=true

tI

Local level model with switching system/observation variance

観測ノイズ小

•Kim and Nelson (1999)

•Fruhwirth-Schnatter (2001)

観測ノイズ大

US/UK real exchange rate from Jan. 1885 to Nov. 1999(Grilli and Kaminsky (1991), Engle and Kim (1999))

The real exchange rate is defined as the relative price of UK to US producer goods: US/UK nominal exchange rate times the UK producer price index divided by the US producer price index

ISM28

Simulation Data

11

=

+= −

t

tt

I

tItIt

H

wExHy( )

⎩⎨⎧

==

=

=

2,1,

arg tel

tsmallI

tt

II

E

x

t σσ

μModel

異常値の同定と同じ．ただ背後に，マルコフ性をもつ時系列構造が潜んでいる

29

{ }NilIN

lI kti

TtTkt ,,1| with #1)|r(P̂ ,)(:1|:1, K==== zy

Points with larger obs. noise2=true

tI1=true

tI

事後分布：レジームの推定

ISM

ttt

ttt

wDxHyuGxFx

λλ

λλ

+=+= −1

Conditional Dynamic Linear Model (CDLM)Time-Dependent Gaussian Mixture Model

tI=λ,

が与えられば定数行列は tIDHGF =λλλλλ ,,,

)|Pr( 1 iIjI ttij === −π

:tI latent indicator variablestationary, discrete, first order homogenous Markov chain

遷移確率

31

Rao-Blackwellization

に相当

の場合：

)|,(),(CDLM

:1:121 tttxpxxp yI⇒

[ ]られれば．．．がもし，解析的に求め12121

2212121212121

)|(),(

)()|(),(),(),(

dxxxpxxg

dxxpdxxxpxxgdxdxxxpxxg

∫∫ ∫∫ =

{ } 行えばいいでモンテカルロ積分をに従うサンプル　～mj

jxxpx 1)(

222 )( =

),(),|( )(|

)(|:1

)(:1

jtt

jttt

jtt VxNxp =yI

いか？どうやって求めればい

をに従うサンプル )(:1:1:1 )|( jtttp IyI

32

Conditioningの表記法：フィルタ分布

)KF|(

),(

),|(

),( and ),(KF

)(11

)(1|1

)(1|1

1:1)(11

1:1)(

1:1)(

1|1)(

1|1)(

1

jtt

jtt

jtt

tj

tt

tjt

jtt

jtt

jt

xp

VxN

xp

Vx

−−

−−−−

−−−

−−−−−−−

=

=

=

yI

yI

ISM

33)KF,|(

)'',(

),,|(

),|(

)(1

)(

)()()(1|

)(1|

1:1)(1:1

)(1:1

)(:1

)()()()()(

jt

jtt

jI

jII

jttI

jttI

tjt

jtt

tjtt

Iyp

DDHVHxHN

Iyp

yp

jt

jt

jt

jt

jt

−

−−

−−

−

=

+=

= yI

yI

)KF,|(

),(

),,|(

),|(

)(1

)(

)(1|

)(1|

1:1)(1:1

)(1:1

)(:1

jt

jtt

jtt

jtt

tjt

jtt

tjtt

Ixp

VxN

Ixp

xp

−

−−

−−

−

=

=

= yI

yI

)()()()(

)(

''

, )(

1|1)(1|

)(1|1

)(1|

jt

jt

jt

jt

jt

IIIjttI

jtt

jttI

jtt

GGFVFV

xFx

+=

=

−−−

−−−

ISM

Conditioningの表記法：予測分布

34

)|()( :1:1:1 TTT p yII =πPosterior probability:SIS framework:

)()()|(

)|()()()|()|()|()()(

)()|()|()|()()|()|()|(

)()()(

1:11

:11:1

:1:1:1:1:1:1

1:1

1:11:11:1

111:1222:1111:1

11:122:111:1

:1

:1:1

−−−

−

−−−

−−−−

−−−

≈

=≈=

=

=

=

tt

tttt

Ttttttt

ttt

tttttt

TTTTTT

TTTT

T

TT

I

ppIqIww

IqIqIqIqIIII

qw

III

yIIIyIIIII

IIIIII

III

πππ

ππ

π

ππππ

π

L

L

On line 計算

に不向き

filter dist.

target function

trial function

Importanceweight

Sequential Monte Carlo(SMC)の基礎１．

ISM

35

General SIS framework:

)KF|(),|(),|(

)|(1

)|()|(),|(

)|()|(

1

1:11:1

1:11:1

1:11:11:1

1:11:11:11:1

1:11:1

:11:1

t-t

ttt

ttt

tttt

ttttt

tt

tt

ypypyp

pyppyp

pp

=≡∝

⋅⋅

=

−−

−−

−−−

−−−−

−−

−

yIyx

yxyyxyx

yxyx

)|()KF,|(),|(),,|(

),|(),|(),,|(

),,|(),|( )|(

11

1:11:11:11:1

1:11:1

1:11:11:11:1

1:11:1

:11:11:1

−−

−−−−

−−

−−−−

−−

−−

⋅≡⋅∝

⋅=

==

ttttt

ttttttt

ttt

ttttttt

tttt

tttttt

IIpIypxpxyp

ypxpxyp

yxpxpx

yxyxyx

yxyxyx

yxxπ

)|()|(

)()(

)|()|()|()(

)|()()|()(

),()|()(

)()|()(

)(

1:11:1

:11:1

1:11

1:1

1:11:1

1:11:11

1:11:1

1:11:11

1:1

1:11:11

:1

1:11:11

:11

−−

−

−−

−

−−

−−−

−−

−−−

−

−−−

−−−−

=

=

=

=

=

=

⋅=

⋅=

tt

tt

tt

tt

tttttt

ttttt

ttttt

ttttt

ttt

ttttt

ttt

ttttt

ttt

ttt

pp

xxqxq

xxq

xxq

u

xqw

uww

yxyx

xx

xxxx

xxxx

xxx

xxx

x

ππ

ππ

πππ

ππ

ππ

π

とすると

SMCの基礎２．

ISM各粒子あたり、だけ和をとる必要がある

モデルの数

の要素数

::

MK tI

KM

36

粗形粒子フィルタ in SIS framework (Monte Carlo filter (Kitagawa, 1993), Bootstrap filter (Gordon et al., 1993))

�

xxpxyppxxpxyp

pxpxypxpyp

pypyp

pypypp

tttttt

tttttt

ttttttt

tttttt

ttttt

tt

ttttt

ttttttt

)()|()|()|()|()|(

)|(),|()|()|,(),|(

)|(),|()|(

)|(),|(),|()|()(

1:111

1:11:11

1:11:11:11:1

1:11:11:1:1

1:1:11:1:1

1:1

1:1:11:1:1

1:1:1:1:1:1

−−−

−−−

−−−−

−−−

−−

−

−−

−

====∝

=

==

xyx

yxyxyxyx

yxyxy

yxyxyxyxx

π

π

SMCの基礎３．

�

xqxxpxyp

xqxxpxyp

xqu

ttt

tttt

ttttt

tttttt

ttttt

ttt

)|()|()|(

)|()()()|()|(

)|()()(

1:1

1

1:11:11

1:111

1:11:11

:1

−

−

−−−

−−−

−−−

=

∝

=

x

xxx

xxx

ππ

ππ

)|(�� ttt xypu ∝�

xxpxq ttttt )|()|( 11:1 −− =x

システムモデル

観測モデル

ISM37

)|(),|(),,|()|,(),,|(

)|(),|()|,(

)|()|,(

),|()|(

1:11:11:11:11:11:1

1:11:11:11:1

1:1:11:1:1

1:1:1

1:1

1:1:1

1:11:1:1:1

−−−−−−

−−−−

−−

−

−

−

−−

===∝

=

=

ttttttttt

ttttttt

ttttt

ttt

tt

ttt

ttttt

pIpIypIpIyp

pypyp

ypyp

ypp

yIyIyIyIyI

yIyIyI

yyI

yIyI

Trial Function に予測分布を使う簡易版

)|(),|(),,|()|( 1:1)(1:11:1

)(1:1

)(1:1

)(1:1

)(1:1

)(:1 −−−−−−− ∝ t

jtt

jt

jtt

jt

jttt

jt pIpIypp yIyIyIyI

SamplingへResamplingへ

ISM38

2. Constant Velocity Model:

3. Constant Acceleration Model:

cvtxt w

tdvd

,, =

catxt w

tdad

., =

Target Tracking Problem：複数モデル

1. Constant Position Model: cvtxt w

tdsd

,, =

4. Constant Jerk Model: cjtxt w

tdad

., =

∇

t

xty ,

x

y

ISM39

( )

⎥⎦

⎤⎢⎣

⎡=

=

=

==

yI

xII

ytytytxtxtxtt

t

t

t uu

avsavs

,1

,11

,,,,,, ,,,,,

u

x

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡ Δ==

01

1

,1

tF xIt

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

ΔΔΔ

==

11

2/)(1 2

,2 ttt

F xIt

tItIt ttGF uxx += −1

Constant Velocity Model: Constant Acceleration Model:cvt

xt utd

vd=, ca

txt u

tdad

=,

Target Tracking Problem：状態ベクトル

位置速度加速度

分散の違いは，Gで表現させれ

ばいい

[ ]',,,, ,,, xtxtxtxtt aavs ∇=x

ISM

１）速度０とする

２）初期等速度モデル分布からリサンプリングする

３）一期前の等速度モデル分布からリサンプリングする

次元の異なる状態ベクトル間の情報交換同位置モデル⇒等速度モデル

),0(~

0,0,21

0,0,0),,0(~,

2,,,,1,

,,,,1,1,

,,,2,,,,1,

xxx

x

xxx

vtvtvtxtxt

xtxtvtxtxtxt

xtxtxtstststxtxt

Nwwvv

aawvss

aavNwwss

τ

τ

+=

=∇=++=

=∇==+=

−

−−

−

{ }N

jttji

xt Ivvtxt 11|1

)()(,1 |~

1|,1 =−−− =−−

等速度モデル

41ISM

簡易版オンライン型 Model Averaging の手続き

{ }1|1| ,P −−≡ ttttt Vx

)(1|

jttI −

)(1|1

jttI −−

)(1|

jtLtI −−

)(1)|1(

jtLtI −+−

)(1

jt−ＫＦ )( j

tＫＦ

j番目の粒子

)2(KF +−

∧

Lt

)1( +− Lty ty

)1(KF +−

∧

Lt

)1(P +−

∧

Lt

{ } の最頻値m

jj

LtLt II1

)()1()1( :ˆ

=+−+−

{ }1|1| ,S ++≡ ttttt Vx

)2(S +−

∧

Lt

次の時刻までのデータが所与のもとで

の状態ベクトルxの推定

j=1,…,m

Fixed-Lag Smoother with Model Averaging

帰納的

42ISM

TESD: 第4の科学，第4の方法論

T:理論 E:実験

S:シミュレ

ーションD：大量デ

ータ処理

演繹的

データ同化

科学の駆動力

予測・制御

前には進むが，どちらにいくのかコントロールが必要

Modeling

43ISM

2007年6月出版

ベイジアンモデリングによる実世界イノベーション

全体モデルから局所モデルへ：

状態空間モデルとシミュレーション

樋口知之

今月号

年 月 日 月 火 日 水 東京工業大学すずかけ台キャンパス 統計モデ...

Documents

年月日月火日水東京工業大学すずかけ台キャンパス統計モデ...