learning with intractable inference and partial supervision - stanford … · 2015. 12. 3. ·...

Learning with Intractable Inference and Partial Supervision

Jacob Steinhardt

Stanford University

jsteinhardt@cs.stanford.edu

September 8, 2015

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 1 / 31

Motivation

An Example

Company officials refused to comment.公司官员拒绝对此发表评论。

He said the company would appeal.他表示该公司将提出上诉。

Statistical reasoning: aggregate data across sentences to reach conclusions.Computational reasoning: focus on easily disambiguated words first.

Tension: statistics wants to expose information (aggregation), while computerscience wants to hide it (abstraction, adaptivity).

Statistical inference is computationally intractable.

How can we bring these two paradigms together?

Motivation

An Example

Motivation

An Example

Statistical reasoning: aggregate data across sentences to reach conclusions.

Computational reasoning: focus on easily disambiguated words first.

Motivation

An Example

Motivation

An Example

Motivation

An Example

Motivation

An Example

Formal Setting

1 Motivation

2 Formal Setting

3 Reified Context Models

4 Relaxed Supervision

5 Open Questions

Formal Setting

Setting: Structured Prediction

input x :

output y : v o l c a n i c

Goal: learn θ to maximize Ex ,y∼D[logpθ (y | x)]Structured output space Y — requires inference

Formal Setting

input x :

Goal: learn θ to maximize Ex ,y∼D[logpθ (y | x)]

Structured output space Y — requires inference

Formal Setting

input x :

Goal: learn θ to maximize Ex ,y∼D[logpθ (y | x)]Structured output space Y — requires inference

Formal Setting

Supervised Learning is Easy

Recall: want to maximize E[logpθ (y | x)].

Suppose pθ (y | x) ∝ exp(θ>φ(x ,y)). Then:

∇θ logpθ (y | x) = φ(x ,y)︸︷︷︸given

−Ey∼pθ (·|x)[φ(x , y)]︸︷︷︸inference

Inference errors will be corrected by supervision signal (φ(x ,y)) over thecourse of learning.

In practice, anything reasonable (MCMC, beam search) works.

Conceptually, can use Searn (Daume III et al., 2009) or pseudolikelihood(Besag, 1975) to obviate need for inference.

Approximate inference is easy in supervised settings.

Unless we care about estimating uncertainty (calibration, precision/recall)

Formal Setting

Partially Supervised Structured Prediction

input x : Company officials refused to comment.latent z:output y : 公司官员拒绝对此发表评论。

Goal: learn θ to maximize Ex ,y∼D[logpθ (y | x)]Where pθ (y | x) = ∑z pθ (y ,z | x)

Again assume pθ (y ,z | x) ∝ exp(θ>φ(x ,z,y)). Then

∇θ logpθ (y | x) = Ez∼pθ (·|x ,y)[φ(x , z,y)]︸︷︷︸inference on z

−Ez,y∼pθ (·|x)[φ(x , z, y)]︸︷︷︸inference on z,y

Inference errors on z get reinforced during learning.

Inference often hardest (and most consequential) at beginning of learning!

Formal Setting

Goal: learn θ to maximize Ex ,y∼D[logpθ (y | x)]

Where pθ (y | x) = ∑z pθ (y ,z | x)

Formal Setting

This Work

Two thrusts:1 How can we reify computation as part of a statistical model?2 How can we relax the supervision signal to aid computation while still

maintaining consistent parameter estimates?

Formal Setting

Related Work

Learning tractable models / accounting for approximations

sum-product networks (Poon & Domingos, 2011)

max-violation perceptron (Huang, Fayong, & Guo, 2012; Zhang et al., 2013; Yu et al., 2013)

fast-mixing Markov chains (S. & Liang, 2015)

many others (Barbu, 2009; Daume III, Langford, & Marcu, 2009; Domke, 2011; Stoyanov,

Ropson, & Eisner, 2011; Niepert & Domingos, 2014; Li & Zemel, 2014; Shi, S., & Liang, 2015)

Improving expressivity of variational inference

combining with MCMC (Salimans, Kingma, & Welling, 2015)

using neural networks (Kingma & Welling, 2013; Mnih & Gregor, 2014)

Computational-statistical tradeoffs

huge body of recent work (Berthet & Rigollet, 2013; Chandrasekaran & Jordan, 2013;

Zhang et al., 2013; Zhang, Wainwright, & Jordan, 2014; Christiano, 2014; Daniely, Linial, &

Shalev-Shwartz, 2014; Garg, Ma, & Nguyen, 2014; Shamir, 2014; Braverman et al., 2015; S. &

Duchi, 2015; S., Valiant, & Wager, 2015)

Formal Setting

Related Work

Formal Setting

Related Work

Reified Context Models

1 Motivation

2 Formal Setting

5 Open Questions

Structured Prediction Task

input x :

Contexts Are Key

v o l c a

v *o **l ***cDP:

v o l c a

v vo vol volcbeam search:

Key idea: contexts!

*odef=

aoboco...

Contexts Are Key

v o l c a

v *o **l ***cDP:

v o l c a

Key idea: contexts!

*odef=

aoboco...

Contexts Are Key

v o l c a

v *o **l ***cDP:

v o l c a

Key idea: contexts!

*odef=

aoboco...

Contexts Are Key

v o l c a

v *o **l ***cDP:

v o l c a

Key idea: contexts!

*odef=

aoboco...

Desiderata

r *o **l ***cv *a **i ***r

coverage (short contexts)better uncertainty estimates (precision)stabler partially supervised learning updates

r ro rol rolcv ra ral ralc

expressivity (long contexts)capture complex dependencies

r ro rol *olc

← best of both worldsv ra ral ***cy *o *ol ***r

* ** *** ****

Desiderata

r *o **l ***cv *a **i ***r

r ro rol *olc

* ** *** ****

Desiderata

r *o **l ***cv *a **i ***r

r ro rol *olc

* ** *** ****

Desiderata

r *o **l ***cv *a **i ***r

r ro rol *olc

* ** *** ****

Desiderata

r *o **l ***cv *a **i ***r

r ro rol *olc

* ** *** ****

Desiderata

r *o **l ***cv *a **i ***r

r ro rol *olc

* ** *** ****

Desiderata

r *o **l ***cv *a **i ***r

r ro rol *olc

* ** *** ****

Reifying Contexts

input x :

output y : v o l c a n i ccontext c: v *o *ol *olc · · · · · ·

Challenge: how to trade off contexts of different lengths?=⇒ Reify contexts as part of model!

Reifying Contexts

input x :

context c: v *o *ol *olc · · · · · ·

Reifying Contexts

input x :

Reifying Contexts

input x :

r ro rol *olc

←“context sets”v ra ral ***cy *o *ol ***r

* ** *** ****C1 C2 C3 C4

Reifying Contexts

input x :

r ro rol *olc

* ** *** ****C1 C2 C3 C4

Challenge: how to trade off contexts of different lengths?

=⇒ Reify contexts as part of model!

Reifying Contexts

input x :

r ro rol *olc

* ** *** ****C1 C2 C3 C4

Given:

context sets C1, . . . ,CL

features φi(ci−1,yi)

Define the model

pθ (y1:L,c1:L−1) ∝ exp

∑i=1

φi(ci−1,yi)

)· κ(y ,c)︸︷︷︸

consistency

Graphical model structure:

Y1 Y2 Y3 Y4 Y5

C1 C2 C3 C4

κ κ κ κ κφ2 φ3 φ4 φ5

inference viaforward-backward!

Given:

Define the model

∑i=1

φi(ci−1,yi)

)· κ(y ,c)︸︷︷︸

consistency

Y1 Y2 Y3 Y4 Y5

C1 C2 C3 C4

Given:

Define the model

∑i=1

φi(ci−1,yi)

)· κ(y ,c)︸︷︷︸

consistency

Y1 Y2 Y3 Y4 Y5

C1 C2 C3 C4

Given:

Define the model

∑i=1

φi(ci−1,yi)

)· κ(y ,c)︸︷︷︸

consistency

Y1 Y2 Y3 Y4 Y5

C1 C2 C3 C4

Given:

Define the model

∑i=1

φi(ci−1,yi)

)· κ(y ,c)︸︷︷︸

consistency

Y1 Y2 Y3 Y4 Y5

C1 C2 C3 C4

κ κ κ κ κ

φ2 φ3 φ4 φ5

Given:

Define the model

∑i=1

φi(ci−1,yi)

)· κ(y ,c)︸︷︷︸

consistency

Y1 Y2 Y3 Y4 Y5

C1 C2 C3 C4

Given:

Define the model

∑i=1

φi(ci−1,yi)

)· κ(y ,c)︸︷︷︸

consistency

Y1 Y2 Y3 Y4 Y5

C1 C2 C3 C4

Adaptive Context Selection

Select context sets Ci during forward pass of inference

Greedily select contexts with largest mass

abcde...

cacb...eaeb...?a...

ca?a??

Biases towards short contexts unless there is high confidence.

abcde...

cacb...eaeb...?a...

ca?a??

abcde...

cacb...eaeb...?a...

ca?a??

abcde...

cacb...eaeb...?a...

ca?a??

abcde...

cacb...eaeb...?a...

ca?a??

abcde...

cacb...eaeb...?a...

ca?a??

abcde...

cacb...eaeb...?a...

ca?a??

abcde...

cacb...eaeb...?a...

ca?a??

abcde...

cacb...eaeb...?a...

ca?a??

abcde...

cacb...eaeb...?a...

ca?a??

Precision

input x :

Model assigns probability to each prediction, so can predict on most confidentsubset.

Measure precision (# of correct words) vs. recall (# of words predicted).comparison: beam search

0.0 0.2 0.4 0.6 0.8 1.0recall

Word Recognition

Beam searchRCM

Precision

input x :

0.0 0.2 0.4 0.6 0.8 1.0recall

Word Recognition

Beam searchRCM

Precision

input x :

Measure precision (# of correct words) vs. recall (# of words predicted).

comparison: beam search

0.0 0.2 0.4 0.6 0.8 1.0recall

Word Recognition

Beam searchRCM

Precision

input x :

0.0 0.2 0.4 0.6 0.8 1.0recall

Word Recognition

Beam searchRCM

Precision

Measure precision (# of correct words) vs. recall (# of words predicted).

0.0 0.2 0.4 0.6 0.8 1.0recall

1.00pr

Word Recognition

Beam searchRCM

Partially Supervised Learning

Decipherment task:

cipher am 7→ 5, I 7→ 13, what 7→ 54, . . .

latent z I am what I amoutput y 13 5 54 13 5

Goal: determine cipher

Fit 2nd-order HMM with EM, using RCMs for approximate E-step.use learned emissions to determine cipher.again compare to beam search (Nuhn et al., 2013)

Fraction of correctly mapped words:

0 5 10 15 20training passes

DeciphermentRCMbeam

Decipherment task:

cipher am 7→ 5, I 7→ 13, what 7→ 54, . . .latent z I am what I am

output y 13 5 54 13 5

DeciphermentRCMbeam

Decipherment task:

cipher am 7→ 5, I 7→ 13, what 7→ 54, . . .latent z I am what I amoutput y 13 5 54 13 5

DeciphermentRCMbeam

Decipherment task:

DeciphermentRCMbeam

Decipherment task:

Fit 2nd-order HMM with EM, using RCMs for approximate E-step.

use learned emissions to determine cipher.again compare to beam search (Nuhn et al., 2013)

DeciphermentRCMbeam

Decipherment task:

Fit 2nd-order HMM with EM, using RCMs for approximate E-step.use learned emissions to determine cipher.

again compare to beam search (Nuhn et al., 2013)Fraction of correctly mapped words:

DeciphermentRCMbeam

Decipherment task:

DeciphermentRCMbeam

Contexts During Training

Context lengths increase smoothly during training:

0 5 10 15 20number of passes

Decipherment

******↓

***ing↓

idding

Start of training: little information, short contexts.End of training: lots of information, long contexts.

hDecipherment

******↓

***ing↓

idding

hDecipherment

******↓

***ing↓

idding

Start of training: little information, short contexts.

End of training: lots of information, long contexts.

hDecipherment

******↓

***ing↓

idding

Discussion

RCMs provide both expressivity and coverage, which enable:

More accurate uncertainty estimates (precision)

Better partially supervised learning updates

Reproducible experiments on Codalab: codalab.org/worksheets

Discussion

Relaxed Supervision

1 Motivation

2 Formal Setting

5 Open Questions

Relaxed Supervision

Intractable Supervision

Sometimes, even supervision is intractable:

input x : What is the largest city in California?latent z: argmax(λx .CITY(x)∧ LOC(x ,CA),λx .POPULATION(x))output y : Los Angeles

Intractable no matter how simple the model is!

but likely statistical relationships (e.g. between CITY and Los Angeles)

Need a way to relax the likelihood.

while maintaining good statistical properties (asymptotic consistency)

Relaxed Supervision

Approach

tractable

intractable

Start with intractable likelihood q(y | z), model family pθ (z | x).

Replace q(y | z) with family of likelihoods qβ (y | z) (some very easy).

Derive constraints on (θ ,β ) that ensure tractability.

Learn within the tractable region.

Relaxed Supervision

Approach

tractable

intractable

Relaxed Supervision

Approach

tractable

intractable

Relaxed Supervision

Approach

tractable

intractable

Relaxed Supervision

Relaxed Supervision: Example

Idea: instead of requiring y to match observed output, penalize based onsome weighted distance distβ (y ,y).

`(θ ,β ;x ,y) =− log

pθ (z,y | x)

)As β → ∞, recover original objective.

but optimizing will send β → 0!

Two questions:

How to create natural pressure to increase β?

How to define distances for general problems?

Relaxed Supervision

`(θ ,β ;x ,y) =− log

pθ (z,y | x)

Two questions:

Relaxed Supervision

`(θ ,β ;x ,y) =− log

pθ (z,y | x)

As β → ∞, recover original objective.

Two questions:

Relaxed Supervision

`(θ ,β ;x ,y) =− log

(∑z,y

pθ (z, y | x)exp(−distβ (y ,y))

As β → ∞, recover original objective.

Two questions:

Relaxed Supervision

`(θ ,β ;x ,y) =− log

(∑z,y

Two questions:

Relaxed Supervision

`(θ ,β ;x ,y) =− log

(∑z,y

Two questions:

Relaxed Supervision

`(θ ,β ;x ,y) =− log

(∑z,y

Two questions:

Relaxed Supervision

`(θ ,β ;x ,y) =− log

(∑z,y

Two questions:

Relaxed Supervision

Relaxed Supervision: Formal Framework

Assume (WLOG) that z→ y is deterministic: y = f (z).

Let S(z,y) ∈ {0,1} encode the constraint [f (z) = y ].

Take projections πj : Y →Yj , j = 1, . . . ,k .

Let Sj(z,y) = [πj(f (z)) = πj(y)] be the projected constraint.

Define distance function:

distβ (z,y) =k

∑j=1

βj · (1−Sj(z,y)).

Note: can featurize distβ as −β>ψ(z,y), where ψj = Sj −1.

Suppose that π1×·· ·×πk is injective. Then

S(z,y) =k∧

Sj(z,y)

Relaxed Supervision

distβ (z,y) =k

∑j=1

βj · (1−Sj(z,y)).

S(z,y) =k∧

Sj(z,y)

Relaxed Supervision

distβ (z,y) =k

∑j=1

βj · (1−Sj(z,y)).

S(z,y) =k∧

Sj(z,y)

Relaxed Supervision

distβ (z,y) =k

∑j=1

βj · (1−Sj(z,y)).

S(z,y) =k∧

Sj(z,y)

Relaxed Supervision

distβ (z,y) =k

∑j=1

βj · (1−Sj(z,y)).

S(z,y) =k∧

Sj(z,y)

Relaxed Supervision

distβ (z,y) =k

∑j=1

βj · (1−Sj(z,y)).

S(z,y) =k∧

Sj(z,y)

Relaxed Supervision

distβ (z,y) =k

∑j=1

βj · (1−Sj(z,y)).

S(z,y) =k∧

Sj(z,y)

Relaxed Supervision

Example: Unordered Supervision

input x : a b a alatent z: d c d doutput y : {c : 1,d : 3}

Let count(·, j) count number of occurences of character j .

Decomposition:

f (z)︷︸︸︷multiset(z)]︸︷︷︸S(z,y)

=⇒V∧

[count(z, j) = count(y , j)]

Relaxed Supervision

Decomposition:

=⇒V∧

Relaxed Supervision

Decomposition:

=⇒V∧

Relaxed Supervision

Decomposition:

=⇒V∧

Relaxed Supervision

Decomposition:

=⇒V∧

[count(z, j) =

πj(y)︷︸︸︷count(y , j)]︸︷︷︸

Sj(z,y)

Relaxed Supervision

Decomposition:

⇐⇒V∧

[count(z, j) =

πj(y)︷︸︸︷count(y , j)]︸︷︷︸

Sj(z,y)

Relaxed Supervision

Example: Conjunctive Semantic Parsing

Side information: predicates {Q1, . . . ,Qm}.e.g. Q6 = [DOG] = set of all dogs

input x : brown dog (input utterance)latent z: (Q11, Q6) (set of all brown objects, set of all dogs)output y : Q11∩Q6 (denotation, observed as a set)

For z = (Qj1 , . . . ,QjL), define the denotation JzK = Qj1 ∩·· ·∩QjL .

Decomposition:

y = JzK︸︷︷︸S(z,y)

⇐⇒m∧

I[JzK⊆ Qj ] =

πj(y)︷︸︸︷I[y ⊆ Qj ]︸︷︷︸

Sj(z,y)

Relaxed Supervision

input x : brown dog (input utterance)

latent z: (Q11, Q6) (set of all brown objects, set of all dogs)output y : Q11∩Q6 (denotation, observed as a set)

Decomposition:

⇐⇒m∧

I[JzK⊆ Qj ] =

πj(y)︷︸︸︷I[y ⊆ Qj ]︸︷︷︸

Sj(z,y)

Relaxed Supervision

input x : brown dog (input utterance)latent z: (Q11, Q6) (set of all brown objects, set of all dogs)

output y : Q11∩Q6 (denotation, observed as a set)

Decomposition:

⇐⇒m∧

I[JzK⊆ Qj ] =

πj(y)︷︸︸︷I[y ⊆ Qj ]︸︷︷︸

Sj(z,y)

Relaxed Supervision

Decomposition:

⇐⇒m∧

I[JzK⊆ Qj ] =

πj(y)︷︸︸︷I[y ⊆ Qj ]︸︷︷︸

Sj(z,y)

Relaxed Supervision

Decomposition:

⇐⇒m∧

I[JzK⊆ Qj ] =

πj(y)︷︸︸︷I[y ⊆ Qj ]︸︷︷︸

Sj(z,y)

Relaxed Supervision

Decomposition:

⇐⇒m∧

I[JzK⊆ Qj ] =

πj(y)︷︸︸︷I[y ⊆ Qj ]︸︷︷︸

Sj(z,y)

Relaxed Supervision

Decomposition:

⇐⇒m∧

I[JzK⊆ Qj ] =

πj(y)︷︸︸︷I[y ⊆ Qj ]︸︷︷︸

Sj(z,y)

Relaxed Supervision

Normalization Constant

Create pressure to increase β by adding normalization constant:

qβ (y | z) = exp(β>ψ(z,y)︸︷︷︸−distβ (z,y)

−A(β ))

`(θ ,β ;x ,y) =− log

pθ (z | x)qβ (y | z)).

Given π1, . . . ,πk , let A(β )def= ∑

kj=1 log(1+(|Yj |−1)exp(−βj)). Then,

∑y exp(−distβ (z,y))≤ A(β ) for all z.

Jointly minimizing L(θ ,β ) = E[`(θ ,β ;x ,y)] yields a consistent estimate of thetrue parameters θ ∗.

Relaxed Supervision

−A(β ))

`(θ ,β ;x ,y) =− log

pθ (z | x)qβ (y | z)).

Relaxed Supervision

−A(β ))

`(θ ,β ;x ,y) =− log

pθ (z | x)qβ (y | z)).

Relaxed Supervision

−A(β ))

`(θ ,β ;x ,y) =− log

pθ (z | x)qβ (y | z)).

Relaxed Supervision

Constraints for Efficient Inference

Inference task:

∇θ logpθ (y | x) = Ez∼pθ (·|x ,y)[φ(x , z,y)]︸︷︷︸sample z given x ,y

−Ez,y∼pθ (·|x)[φ(x , z, y)]︸︷︷︸sample z given x

pθ ,β (z | x ,y) ∝ pθ (z | x)qβ (y | z)∝ pθ (z | x)exp(β>ψ(z,y)).

Rejection sampler:

sample from pθ (z | x)accept with probability exp(β>ψ(z,y)).

Bound expected number of samples:

∑x ,y∈Data

pθ (z | x)exp(β>ψ(z,y))

≤ τ. (1)

Ratio of normalization constants: can optimize subject to (1) (similar to CCCP).

Relaxed Supervision

Inference task:

Rejection sampler:

∑x ,y∈Data

≤ τ. (1)

Relaxed Supervision

Inference task:

Rejection sampler:

∑x ,y∈Data

≤ τ. (1)

Relaxed Supervision

Inference task:

Rejection sampler:

∑x ,y∈Data

≤ τ. (1)

Relaxed Supervision

Inference task:

Rejection sampler:

∑x ,y∈Data

≤ τ. (1)

Relaxed Supervision

Experiments

Conjunctive semantic parsing:

0 10 20 30 40 50iteration

FixedBeta(0.5)FixedBeta(0.2)FixedBeta(0.1)

0 10 20 30 40 50iteration

FixedBeta(0.5)FixedBeta(0.2)FixedBeta(0.1)

Relaxed Supervision

Experiments

Conjunctive semantic parsing:

0 10 20 30 40 50iteration

AdaptBeta(500)FixedBeta(0.5)FixedBeta(0.2)FixedBeta(0.1)

0 10 20 30 40 50iteration

AdaptBeta(500)FixedBeta(0.5)FixedBeta(0.2)FixedBeta(0.1)

Open Questions

1 Motivation

2 Formal Setting

5 Open Questions

Open Questions

Scale up to larger taskssemantic parsing, reinforcement learning, program induction

Extend to Bayesian models

Understand non-convex optimizationMetacomputation

using Reified Context Models?

Probabilistic abstract interpretation

Statistics & Computation: still a long way to go

Thanks!谢谢

Open Questions

Thanks!谢谢

Open Questions

Understand non-convex optimization

Metacomputationusing Reified Context Models?

Thanks!谢谢

Open Questions

Thanks!谢谢

Open Questions

Thanks!谢谢

Open Questions

Thanks!谢谢

Open Questions

Thanks!谢谢

learning with intractable inference and partial supervision - stanford … · 2015. 12. 3. ·...

Documents

causal inference

robot control - stanford school of engineering - stanford

stanford...

gershmanvultenenbaum12 pwerceptual inference

inference causale

ch 14 – inference for regression yms - 14.1 inference...

lasso - stanford...

bayesian inference - home | applied mathematics &...

studia @ stanford: jak aplikować na stanford university...

introduction to statistical inference - bioinformatics …...

stanford university · 2003. 4. 1. · stanford university

7 statistical inference

chapter 4: inference techniques reasoning inference forward...

local medical care support system for patients with...

graphical inference

stanford universityzwicky/zpcliticsinfl.pdf · 2007. 1....

main quad, building 50 stanford university stanford, …...

continuous repetitive transcranial magnetic stimulation...

propositional logic & inference

inference in fol