learning with intractable inference and partial supervision - stanford … · 2015. 12. 3. ·...

Post on 26-Sep-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Learning with Intractable Inference and Partial Supervision

Jacob Steinhardt

Stanford University

jsteinhardt@cs.stanford.edu

September 8, 2015

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 1 / 31

Motivation

An Example

Company officials refused to comment.公司官员拒绝对此发表评论。

He said the company would appeal.他表示该公司将提出上诉。

Statistical reasoning: aggregate data across sentences to reach conclusions.Computational reasoning: focus on easily disambiguated words first.

Tension: statistics wants to expose information (aggregation), while computerscience wants to hide it (abstraction, adaptivity).

Statistical inference is computationally intractable.

How can we bring these two paradigms together?

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 2 / 31

Motivation

An Example

Company officials refused to comment.公司官员拒绝对此发表评论。

He said the company would appeal.他表示该公司将提出上诉。

Statistical reasoning: aggregate data across sentences to reach conclusions.Computational reasoning: focus on easily disambiguated words first.

Tension: statistics wants to expose information (aggregation), while computerscience wants to hide it (abstraction, adaptivity).

Statistical inference is computationally intractable.

How can we bring these two paradigms together?

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 2 / 31

Motivation

An Example

Company officials refused to comment.公司官员拒绝对此发表评论。

He said the company would appeal.他表示该公司将提出上诉。

Statistical reasoning: aggregate data across sentences to reach conclusions.

Computational reasoning: focus on easily disambiguated words first.

Tension: statistics wants to expose information (aggregation), while computerscience wants to hide it (abstraction, adaptivity).

Statistical inference is computationally intractable.

How can we bring these two paradigms together?

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 2 / 31

Motivation

An Example

Company officials refused to comment.公司官员拒绝对此发表评论。

He said the company would appeal.他表示该公司将提出上诉。

Statistical reasoning: aggregate data across sentences to reach conclusions.Computational reasoning: focus on easily disambiguated words first.

Tension: statistics wants to expose information (aggregation), while computerscience wants to hide it (abstraction, adaptivity).

Statistical inference is computationally intractable.

How can we bring these two paradigms together?

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 2 / 31

Motivation

An Example

Company officials refused to comment.公司官员拒绝对此发表评论。

He said the company would appeal.他表示该公司将提出上诉。

Statistical reasoning: aggregate data across sentences to reach conclusions.Computational reasoning: focus on easily disambiguated words first.

Tension: statistics wants to expose information (aggregation), while computerscience wants to hide it (abstraction, adaptivity).

Statistical inference is computationally intractable.

How can we bring these two paradigms together?

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 2 / 31

Motivation

An Example

Company officials refused to comment.公司官员拒绝对此发表评论。

He said the company would appeal.他表示该公司将提出上诉。

Statistical reasoning: aggregate data across sentences to reach conclusions.Computational reasoning: focus on easily disambiguated words first.

Tension: statistics wants to expose information (aggregation), while computerscience wants to hide it (abstraction, adaptivity).

Statistical inference is computationally intractable.

How can we bring these two paradigms together?

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 2 / 31

Motivation

An Example

Company officials refused to comment.公司官员拒绝对此发表评论。

He said the company would appeal.他表示该公司将提出上诉。

Statistical reasoning: aggregate data across sentences to reach conclusions.Computational reasoning: focus on easily disambiguated words first.

Tension: statistics wants to expose information (aggregation), while computerscience wants to hide it (abstraction, adaptivity).

Statistical inference is computationally intractable.

How can we bring these two paradigms together?

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 2 / 31

Formal Setting

1 Motivation

2 Formal Setting

3 Reified Context Models

4 Relaxed Supervision

5 Open Questions

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 3 / 31

Formal Setting

Setting: Structured Prediction

input x :

output y : v o l c a n i c

Goal: learn θ to maximize Ex ,y∼D[logpθ (y | x)]Structured output space Y — requires inference

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 4 / 31

Formal Setting

Setting: Structured Prediction

input x :

output y : v o l c a n i c

Goal: learn θ to maximize Ex ,y∼D[logpθ (y | x)]

Structured output space Y — requires inference

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 4 / 31

Formal Setting

Setting: Structured Prediction

input x :

output y : v o l c a n i c

Goal: learn θ to maximize Ex ,y∼D[logpθ (y | x)]Structured output space Y — requires inference

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 4 / 31

Formal Setting

Supervised Learning is Easy

Recall: want to maximize E[logpθ (y | x)].

Suppose pθ (y | x) ∝ exp(θ>φ(x ,y)). Then:

∇θ logpθ (y | x) = φ(x ,y)︸ ︷︷ ︸given

−Ey∼pθ (·|x)[φ(x , y)]︸ ︷︷ ︸inference

.

Inference errors will be corrected by supervision signal (φ(x ,y)) over thecourse of learning.

In practice, anything reasonable (MCMC, beam search) works.

Conceptually, can use Searn (Daume III et al., 2009) or pseudolikelihood(Besag, 1975) to obviate need for inference.

Approximate inference is easy in supervised settings.

Unless we care about estimating uncertainty (calibration, precision/recall)

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 5 / 31

Formal Setting

Supervised Learning is Easy

Recall: want to maximize E[logpθ (y | x)].

Suppose pθ (y | x) ∝ exp(θ>φ(x ,y)). Then:

∇θ logpθ (y | x) = φ(x ,y)︸ ︷︷ ︸given

−Ey∼pθ (·|x)[φ(x , y)]︸ ︷︷ ︸inference

.

Inference errors will be corrected by supervision signal (φ(x ,y)) over thecourse of learning.

In practice, anything reasonable (MCMC, beam search) works.

Conceptually, can use Searn (Daume III et al., 2009) or pseudolikelihood(Besag, 1975) to obviate need for inference.

Approximate inference is easy in supervised settings.

Unless we care about estimating uncertainty (calibration, precision/recall)

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 5 / 31

Formal Setting

Supervised Learning is Easy

Recall: want to maximize E[logpθ (y | x)].

Suppose pθ (y | x) ∝ exp(θ>φ(x ,y)). Then:

∇θ logpθ (y | x) = φ(x ,y)︸ ︷︷ ︸given

−Ey∼pθ (·|x)[φ(x , y)]︸ ︷︷ ︸inference

.

Inference errors will be corrected by supervision signal (φ(x ,y)) over thecourse of learning.

In practice, anything reasonable (MCMC, beam search) works.

Conceptually, can use Searn (Daume III et al., 2009) or pseudolikelihood(Besag, 1975) to obviate need for inference.

Approximate inference is easy in supervised settings.

Unless we care about estimating uncertainty (calibration, precision/recall)

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 5 / 31

Formal Setting

Supervised Learning is Easy

Recall: want to maximize E[logpθ (y | x)].

Suppose pθ (y | x) ∝ exp(θ>φ(x ,y)). Then:

∇θ logpθ (y | x) = φ(x ,y)︸ ︷︷ ︸given

−Ey∼pθ (·|x)[φ(x , y)]︸ ︷︷ ︸inference

.

Inference errors will be corrected by supervision signal (φ(x ,y)) over thecourse of learning.

In practice, anything reasonable (MCMC, beam search) works.

Conceptually, can use Searn (Daume III et al., 2009) or pseudolikelihood(Besag, 1975) to obviate need for inference.

Approximate inference is easy in supervised settings.

Unless we care about estimating uncertainty (calibration, precision/recall)

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 5 / 31

Formal Setting

Supervised Learning is Easy

Recall: want to maximize E[logpθ (y | x)].

Suppose pθ (y | x) ∝ exp(θ>φ(x ,y)). Then:

∇θ logpθ (y | x) = φ(x ,y)︸ ︷︷ ︸given

−Ey∼pθ (·|x)[φ(x , y)]︸ ︷︷ ︸inference

.

Inference errors will be corrected by supervision signal (φ(x ,y)) over thecourse of learning.

In practice, anything reasonable (MCMC, beam search) works.

Conceptually, can use Searn (Daume III et al., 2009) or pseudolikelihood(Besag, 1975) to obviate need for inference.

Approximate inference is easy in supervised settings.

Unless we care about estimating uncertainty (calibration, precision/recall)

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 5 / 31

Formal Setting

Supervised Learning is Easy

Recall: want to maximize E[logpθ (y | x)].

Suppose pθ (y | x) ∝ exp(θ>φ(x ,y)). Then:

∇θ logpθ (y | x) = φ(x ,y)︸ ︷︷ ︸given

−Ey∼pθ (·|x)[φ(x , y)]︸ ︷︷ ︸inference

.

Inference errors will be corrected by supervision signal (φ(x ,y)) over thecourse of learning.

In practice, anything reasonable (MCMC, beam search) works.

Conceptually, can use Searn (Daume III et al., 2009) or pseudolikelihood(Besag, 1975) to obviate need for inference.

Approximate inference is easy in supervised settings.

Unless we care about estimating uncertainty (calibration, precision/recall)

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 5 / 31

Formal Setting

Supervised Learning is Easy

Recall: want to maximize E[logpθ (y | x)].

Suppose pθ (y | x) ∝ exp(θ>φ(x ,y)). Then:

∇θ logpθ (y | x) = φ(x ,y)︸ ︷︷ ︸given

−Ey∼pθ (·|x)[φ(x , y)]︸ ︷︷ ︸inference

.

Inference errors will be corrected by supervision signal (φ(x ,y)) over thecourse of learning.

In practice, anything reasonable (MCMC, beam search) works.

Conceptually, can use Searn (Daume III et al., 2009) or pseudolikelihood(Besag, 1975) to obviate need for inference.

Approximate inference is easy in supervised settings.

Unless we care about estimating uncertainty (calibration, precision/recall)

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 5 / 31

Formal Setting

Supervised Learning is Easy

Recall: want to maximize E[logpθ (y | x)].

Suppose pθ (y | x) ∝ exp(θ>φ(x ,y)). Then:

∇θ logpθ (y | x) = φ(x ,y)︸ ︷︷ ︸given

−Ey∼pθ (·|x)[φ(x , y)]︸ ︷︷ ︸inference

.

Inference errors will be corrected by supervision signal (φ(x ,y)) over thecourse of learning.

In practice, anything reasonable (MCMC, beam search) works.

Conceptually, can use Searn (Daume III et al., 2009) or pseudolikelihood(Besag, 1975) to obviate need for inference.

Approximate inference is easy in supervised settings.

Unless we care about estimating uncertainty (calibration, precision/recall)

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 5 / 31

Formal Setting

Partially Supervised Structured Prediction

input x : Company officials refused to comment.latent z:output y : 公司官员拒绝对此发表评论。

Goal: learn θ to maximize Ex ,y∼D[logpθ (y | x)]Where pθ (y | x) = ∑z pθ (y ,z | x)

Again assume pθ (y ,z | x) ∝ exp(θ>φ(x ,z,y)). Then

∇θ logpθ (y | x) = Ez∼pθ (·|x ,y)[φ(x , z,y)]︸ ︷︷ ︸inference on z

−Ez,y∼pθ (·|x)[φ(x , z, y)]︸ ︷︷ ︸inference on z,y

.

Inference errors on z get reinforced during learning.

Inference often hardest (and most consequential) at beginning of learning!

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 6 / 31

Formal Setting

Partially Supervised Structured Prediction

input x : Company officials refused to comment.latent z:output y : 公司官员拒绝对此发表评论。

Goal: learn θ to maximize Ex ,y∼D[logpθ (y | x)]

Where pθ (y | x) = ∑z pθ (y ,z | x)

Again assume pθ (y ,z | x) ∝ exp(θ>φ(x ,z,y)). Then

∇θ logpθ (y | x) = Ez∼pθ (·|x ,y)[φ(x , z,y)]︸ ︷︷ ︸inference on z

−Ez,y∼pθ (·|x)[φ(x , z, y)]︸ ︷︷ ︸inference on z,y

.

Inference errors on z get reinforced during learning.

Inference often hardest (and most consequential) at beginning of learning!

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 6 / 31

Formal Setting

Partially Supervised Structured Prediction

input x : Company officials refused to comment.latent z:output y : 公司官员拒绝对此发表评论。

Goal: learn θ to maximize Ex ,y∼D[logpθ (y | x)]Where pθ (y | x) = ∑z pθ (y ,z | x)

Again assume pθ (y ,z | x) ∝ exp(θ>φ(x ,z,y)). Then

∇θ logpθ (y | x) = Ez∼pθ (·|x ,y)[φ(x , z,y)]︸ ︷︷ ︸inference on z

−Ez,y∼pθ (·|x)[φ(x , z, y)]︸ ︷︷ ︸inference on z,y

.

Inference errors on z get reinforced during learning.

Inference often hardest (and most consequential) at beginning of learning!

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 6 / 31

Formal Setting

Partially Supervised Structured Prediction

input x : Company officials refused to comment.latent z:output y : 公司官员拒绝对此发表评论。

Goal: learn θ to maximize Ex ,y∼D[logpθ (y | x)]Where pθ (y | x) = ∑z pθ (y ,z | x)

Again assume pθ (y ,z | x) ∝ exp(θ>φ(x ,z,y)). Then

∇θ logpθ (y | x) = Ez∼pθ (·|x ,y)[φ(x , z,y)]︸ ︷︷ ︸inference on z

−Ez,y∼pθ (·|x)[φ(x , z, y)]︸ ︷︷ ︸inference on z,y

.

Inference errors on z get reinforced during learning.

Inference often hardest (and most consequential) at beginning of learning!

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 6 / 31

Formal Setting

Partially Supervised Structured Prediction

input x : Company officials refused to comment.latent z:output y : 公司官员拒绝对此发表评论。

Goal: learn θ to maximize Ex ,y∼D[logpθ (y | x)]Where pθ (y | x) = ∑z pθ (y ,z | x)

Again assume pθ (y ,z | x) ∝ exp(θ>φ(x ,z,y)). Then

∇θ logpθ (y | x) = Ez∼pθ (·|x ,y)[φ(x , z,y)]︸ ︷︷ ︸inference on z

−Ez,y∼pθ (·|x)[φ(x , z, y)]︸ ︷︷ ︸inference on z,y

.

Inference errors on z get reinforced during learning.

Inference often hardest (and most consequential) at beginning of learning!

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 6 / 31

Formal Setting

Partially Supervised Structured Prediction

input x : Company officials refused to comment.latent z:output y : 公司官员拒绝对此发表评论。

Goal: learn θ to maximize Ex ,y∼D[logpθ (y | x)]Where pθ (y | x) = ∑z pθ (y ,z | x)

Again assume pθ (y ,z | x) ∝ exp(θ>φ(x ,z,y)). Then

∇θ logpθ (y | x) = Ez∼pθ (·|x ,y)[φ(x , z,y)]︸ ︷︷ ︸inference on z

−Ez,y∼pθ (·|x)[φ(x , z, y)]︸ ︷︷ ︸inference on z,y

.

Inference errors on z get reinforced during learning.

Inference often hardest (and most consequential) at beginning of learning!

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 6 / 31

Formal Setting

This Work

Two thrusts:1 How can we reify computation as part of a statistical model?2 How can we relax the supervision signal to aid computation while still

maintaining consistent parameter estimates?

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 7 / 31

Formal Setting

Related Work

Learning tractable models / accounting for approximations

sum-product networks (Poon & Domingos, 2011)

max-violation perceptron (Huang, Fayong, & Guo, 2012; Zhang et al., 2013; Yu et al., 2013)

fast-mixing Markov chains (S. & Liang, 2015)

many others (Barbu, 2009; Daume III, Langford, & Marcu, 2009; Domke, 2011; Stoyanov,

Ropson, & Eisner, 2011; Niepert & Domingos, 2014; Li & Zemel, 2014; Shi, S., & Liang, 2015)

Improving expressivity of variational inference

combining with MCMC (Salimans, Kingma, & Welling, 2015)

using neural networks (Kingma & Welling, 2013; Mnih & Gregor, 2014)

Computational-statistical tradeoffs

huge body of recent work (Berthet & Rigollet, 2013; Chandrasekaran & Jordan, 2013;

Zhang et al., 2013; Zhang, Wainwright, & Jordan, 2014; Christiano, 2014; Daniely, Linial, &

Shalev-Shwartz, 2014; Garg, Ma, & Nguyen, 2014; Shamir, 2014; Braverman et al., 2015; S. &

Duchi, 2015; S., Valiant, & Wager, 2015)

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 8 / 31

Formal Setting

Related Work

Learning tractable models / accounting for approximations

sum-product networks (Poon & Domingos, 2011)

max-violation perceptron (Huang, Fayong, & Guo, 2012; Zhang et al., 2013; Yu et al., 2013)

fast-mixing Markov chains (S. & Liang, 2015)

many others (Barbu, 2009; Daume III, Langford, & Marcu, 2009; Domke, 2011; Stoyanov,

Ropson, & Eisner, 2011; Niepert & Domingos, 2014; Li & Zemel, 2014; Shi, S., & Liang, 2015)

Improving expressivity of variational inference

combining with MCMC (Salimans, Kingma, & Welling, 2015)

using neural networks (Kingma & Welling, 2013; Mnih & Gregor, 2014)

Computational-statistical tradeoffs

huge body of recent work (Berthet & Rigollet, 2013; Chandrasekaran & Jordan, 2013;

Zhang et al., 2013; Zhang, Wainwright, & Jordan, 2014; Christiano, 2014; Daniely, Linial, &

Shalev-Shwartz, 2014; Garg, Ma, & Nguyen, 2014; Shamir, 2014; Braverman et al., 2015; S. &

Duchi, 2015; S., Valiant, & Wager, 2015)

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 8 / 31

Formal Setting

Related Work

Learning tractable models / accounting for approximations

sum-product networks (Poon & Domingos, 2011)

max-violation perceptron (Huang, Fayong, & Guo, 2012; Zhang et al., 2013; Yu et al., 2013)

fast-mixing Markov chains (S. & Liang, 2015)

many others (Barbu, 2009; Daume III, Langford, & Marcu, 2009; Domke, 2011; Stoyanov,

Ropson, & Eisner, 2011; Niepert & Domingos, 2014; Li & Zemel, 2014; Shi, S., & Liang, 2015)

Improving expressivity of variational inference

combining with MCMC (Salimans, Kingma, & Welling, 2015)

using neural networks (Kingma & Welling, 2013; Mnih & Gregor, 2014)

Computational-statistical tradeoffs

huge body of recent work (Berthet & Rigollet, 2013; Chandrasekaran & Jordan, 2013;

Zhang et al., 2013; Zhang, Wainwright, & Jordan, 2014; Christiano, 2014; Daniely, Linial, &

Shalev-Shwartz, 2014; Garg, Ma, & Nguyen, 2014; Shamir, 2014; Braverman et al., 2015; S. &

Duchi, 2015; S., Valiant, & Wager, 2015)

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 8 / 31

Reified Context Models

1 Motivation

2 Formal Setting

3 Reified Context Models

4 Relaxed Supervision

5 Open Questions

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 9 / 31

Reified Context Models

Structured Prediction Task

input x :

output y : v o l c a n i c

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 10 / 31

Reified Context Models

Contexts Are Key

v o l c a

v *o **l ***cDP:

v o l c a

v vo vol volcbeam search:

Key idea: contexts!

*odef=

aoboco...

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 11 / 31

Reified Context Models

Contexts Are Key

v o l c a

v *o **l ***cDP:

v o l c a

v vo vol volcbeam search:

Key idea: contexts!

*odef=

aoboco...

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 11 / 31

Reified Context Models

Contexts Are Key

v o l c a

v *o **l ***cDP:

v o l c a

v vo vol volcbeam search:

Key idea: contexts!

*odef=

aoboco...

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 11 / 31

Reified Context Models

Contexts Are Key

v o l c a

v *o **l ***cDP:

v o l c a

v vo vol volcbeam search:

Key idea: contexts!

*odef=

aoboco...

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 11 / 31

Reified Context Models

Desiderata

r *o **l ***cv *a **i ***r

coverage (short contexts)better uncertainty estimates (precision)stabler partially supervised learning updates

r ro rol rolcv ra ral ralc

expressivity (long contexts)capture complex dependencies

r ro rol *olc

← best of both worldsv ra ral ***cy *o *ol ***r

* ** *** ****

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 12 / 31

Reified Context Models

Desiderata

r *o **l ***cv *a **i ***r

coverage (short contexts)better uncertainty estimates (precision)stabler partially supervised learning updates

r ro rol rolcv ra ral ralc

expressivity (long contexts)capture complex dependencies

r ro rol *olc

← best of both worldsv ra ral ***cy *o *ol ***r

* ** *** ****

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 12 / 31

Reified Context Models

Desiderata

r *o **l ***cv *a **i ***r

coverage (short contexts)better uncertainty estimates (precision)stabler partially supervised learning updates

r ro rol rolcv ra ral ralc

expressivity (long contexts)capture complex dependencies

r ro rol *olc

← best of both worldsv ra ral ***cy *o *ol ***r

* ** *** ****

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 12 / 31

Reified Context Models

Desiderata

r *o **l ***cv *a **i ***r

coverage (short contexts)better uncertainty estimates (precision)stabler partially supervised learning updates

r ro rol rolcv ra ral ralc

expressivity (long contexts)capture complex dependencies

r ro rol *olc

← best of both worldsv ra ral ***cy *o *ol ***r

* ** *** ****

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 12 / 31

Reified Context Models

Desiderata

r *o **l ***cv *a **i ***r

coverage (short contexts)better uncertainty estimates (precision)stabler partially supervised learning updates

r ro rol rolcv ra ral ralc

expressivity (long contexts)capture complex dependencies

r ro rol *olc

← best of both worldsv ra ral ***cy *o *ol ***r

* ** *** ****

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 12 / 31

Reified Context Models

Desiderata

r *o **l ***cv *a **i ***r

coverage (short contexts)better uncertainty estimates (precision)stabler partially supervised learning updates

r ro rol rolcv ra ral ralc

expressivity (long contexts)capture complex dependencies

r ro rol *olc

← best of both worldsv ra ral ***cy *o *ol ***r

* ** *** ****

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 12 / 31

Reified Context Models

Desiderata

r *o **l ***cv *a **i ***r

coverage (short contexts)better uncertainty estimates (precision)stabler partially supervised learning updates

r ro rol rolcv ra ral ralc

expressivity (long contexts)capture complex dependencies

r ro rol *olc

← best of both worldsv ra ral ***cy *o *ol ***r

* ** *** ****

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 12 / 31

Reified Context Models

Reifying Contexts

input x :

output y : v o l c a n i ccontext c: v *o *ol *olc · · · · · ·

Challenge: how to trade off contexts of different lengths?=⇒ Reify contexts as part of model!

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 13 / 31

Reified Context Models

Reifying Contexts

input x :

output y : v o l c a n i c

context c: v *o *ol *olc · · · · · ·

Challenge: how to trade off contexts of different lengths?=⇒ Reify contexts as part of model!

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 13 / 31

Reified Context Models

Reifying Contexts

input x :

output y : v o l c a n i ccontext c: v *o *ol *olc · · · · · ·

Challenge: how to trade off contexts of different lengths?=⇒ Reify contexts as part of model!

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 13 / 31

Reified Context Models

Reifying Contexts

input x :

output y : v o l c a n i ccontext c: v *o *ol *olc · · · · · ·

r ro rol *olc

←“context sets”v ra ral ***cy *o *ol ***r

* ** *** ****C1 C2 C3 C4

Challenge: how to trade off contexts of different lengths?=⇒ Reify contexts as part of model!

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 13 / 31

Reified Context Models

Reifying Contexts

input x :

output y : v o l c a n i ccontext c: v *o *ol *olc · · · · · ·

r ro rol *olc

←“context sets”v ra ral ***cy *o *ol ***r

* ** *** ****C1 C2 C3 C4

Challenge: how to trade off contexts of different lengths?

=⇒ Reify contexts as part of model!

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 13 / 31

Reified Context Models

Reifying Contexts

input x :

output y : v o l c a n i ccontext c: v *o *ol *olc · · · · · ·

r ro rol *olc

←“context sets”v ra ral ***cy *o *ol ***r

* ** *** ****C1 C2 C3 C4

Challenge: how to trade off contexts of different lengths?=⇒ Reify contexts as part of model!

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 13 / 31

Reified Context Models

Reified Context Models

Given:

context sets C1, . . . ,CL

features φi(ci−1,yi)

Define the model

pθ (y1:L,c1:L−1) ∝ exp

(L

∑i=1

θ>

φi(ci−1,yi)

)· κ(y ,c)︸ ︷︷ ︸

consistency

Graphical model structure:

Y1 Y2 Y3 Y4 Y5

C1 C2 C3 C4

κ κ κ κ κφ2 φ3 φ4 φ5

inference viaforward-backward!

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 14 / 31

Reified Context Models

Reified Context Models

Given:

context sets C1, . . . ,CL

features φi(ci−1,yi)

Define the model

pθ (y1:L,c1:L−1) ∝ exp

(L

∑i=1

θ>

φi(ci−1,yi)

)· κ(y ,c)︸ ︷︷ ︸

consistency

Graphical model structure:

Y1 Y2 Y3 Y4 Y5

C1 C2 C3 C4

κ κ κ κ κφ2 φ3 φ4 φ5

inference viaforward-backward!

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 14 / 31

Reified Context Models

Reified Context Models

Given:

context sets C1, . . . ,CL

features φi(ci−1,yi)

Define the model

pθ (y1:L,c1:L−1) ∝ exp

(L

∑i=1

θ>

φi(ci−1,yi)

)· κ(y ,c)︸ ︷︷ ︸

consistency

Graphical model structure:

Y1 Y2 Y3 Y4 Y5

C1 C2 C3 C4

κ κ κ κ κφ2 φ3 φ4 φ5

inference viaforward-backward!

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 14 / 31

Reified Context Models

Reified Context Models

Given:

context sets C1, . . . ,CL

features φi(ci−1,yi)

Define the model

pθ (y1:L,c1:L−1) ∝ exp

(L

∑i=1

θ>

φi(ci−1,yi)

)· κ(y ,c)︸ ︷︷ ︸

consistency

Graphical model structure:

Y1 Y2 Y3 Y4 Y5

C1 C2 C3 C4

κ κ κ κ κφ2 φ3 φ4 φ5

inference viaforward-backward!

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 14 / 31

Reified Context Models

Reified Context Models

Given:

context sets C1, . . . ,CL

features φi(ci−1,yi)

Define the model

pθ (y1:L,c1:L−1) ∝ exp

(L

∑i=1

θ>

φi(ci−1,yi)

)· κ(y ,c)︸ ︷︷ ︸

consistency

Graphical model structure:

Y1 Y2 Y3 Y4 Y5

C1 C2 C3 C4

κ κ κ κ κ

φ2 φ3 φ4 φ5

inference viaforward-backward!

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 14 / 31

Reified Context Models

Reified Context Models

Given:

context sets C1, . . . ,CL

features φi(ci−1,yi)

Define the model

pθ (y1:L,c1:L−1) ∝ exp

(L

∑i=1

θ>

φi(ci−1,yi)

)· κ(y ,c)︸ ︷︷ ︸

consistency

Graphical model structure:

Y1 Y2 Y3 Y4 Y5

C1 C2 C3 C4

κ κ κ κ κφ2 φ3 φ4 φ5

inference viaforward-backward!

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 14 / 31

Reified Context Models

Reified Context Models

Given:

context sets C1, . . . ,CL

features φi(ci−1,yi)

Define the model

pθ (y1:L,c1:L−1) ∝ exp

(L

∑i=1

θ>

φi(ci−1,yi)

)· κ(y ,c)︸ ︷︷ ︸

consistency

Graphical model structure:

Y1 Y2 Y3 Y4 Y5

C1 C2 C3 C4

κ κ κ κ κφ2 φ3 φ4 φ5

inference viaforward-backward!

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 14 / 31

Reified Context Models

Adaptive Context Selection

Select context sets Ci during forward pass of inference

Greedily select contexts with largest mass

abcde...

c

e

ce?

C1

cacb...eaeb...?a...

ca

?a

ca?a??

C2

etc.

Biases towards short contexts unless there is high confidence.

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 15 / 31

Reified Context Models

Adaptive Context Selection

Select context sets Ci during forward pass of inference

Greedily select contexts with largest mass

abcde...

c

e

ce?

C1

cacb...eaeb...?a...

ca

?a

ca?a??

C2

etc.

Biases towards short contexts unless there is high confidence.

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 15 / 31

Reified Context Models

Adaptive Context Selection

Select context sets Ci during forward pass of inference

Greedily select contexts with largest mass

abcde...

c

e

ce?

C1

cacb...eaeb...?a...

ca

?a

ca?a??

C2

etc.

Biases towards short contexts unless there is high confidence.

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 15 / 31

Reified Context Models

Adaptive Context Selection

Select context sets Ci during forward pass of inference

Greedily select contexts with largest mass

abcde...

c

e

ce?

C1

cacb...eaeb...?a...

ca

?a

ca?a??

C2

etc.

Biases towards short contexts unless there is high confidence.

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 15 / 31

Reified Context Models

Adaptive Context Selection

Select context sets Ci during forward pass of inference

Greedily select contexts with largest mass

abcde...

c

e

ce?

C1

cacb...eaeb...?a...

ca

?a

ca?a??

C2

etc.

Biases towards short contexts unless there is high confidence.

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 15 / 31

Reified Context Models

Adaptive Context Selection

Select context sets Ci during forward pass of inference

Greedily select contexts with largest mass

abcde...

c

e

ce?

C1

cacb...eaeb...?a...

ca

?a

ca?a??

C2

etc.

Biases towards short contexts unless there is high confidence.

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 15 / 31

Reified Context Models

Adaptive Context Selection

Select context sets Ci during forward pass of inference

Greedily select contexts with largest mass

abcde...

c

e

ce?

C1

cacb...eaeb...?a...

ca

?a

ca?a??

C2

etc.

Biases towards short contexts unless there is high confidence.

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 15 / 31

Reified Context Models

Adaptive Context Selection

Select context sets Ci during forward pass of inference

Greedily select contexts with largest mass

abcde...

c

e

ce?

C1

cacb...eaeb...?a...

ca

?a

ca?a??

C2

etc.

Biases towards short contexts unless there is high confidence.

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 15 / 31

Reified Context Models

Adaptive Context Selection

Select context sets Ci during forward pass of inference

Greedily select contexts with largest mass

abcde...

c

e

ce?

C1

cacb...eaeb...?a...

ca

?a

ca?a??

C2

etc.

Biases towards short contexts unless there is high confidence.

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 15 / 31

Reified Context Models

Adaptive Context Selection

Select context sets Ci during forward pass of inference

Greedily select contexts with largest mass

abcde...

c

e

ce?

C1

cacb...eaeb...?a...

ca

?a

ca?a??

C2

etc.

Biases towards short contexts unless there is high confidence.

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 15 / 31

Reified Context Models

Precision

input x :

output y : v o l c a n i c

Model assigns probability to each prediction, so can predict on most confidentsubset.

Measure precision (# of correct words) vs. recall (# of words predicted).comparison: beam search

0.0 0.2 0.4 0.6 0.8 1.0recall

0.86

0.88

0.90

0.92

0.94

0.96

0.98

1.00

prec

isio

n

Word Recognition

Beam searchRCM

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 16 / 31

Reified Context Models

Precision

input x :

output y : v o l c a n i c

Model assigns probability to each prediction, so can predict on most confidentsubset.

Measure precision (# of correct words) vs. recall (# of words predicted).comparison: beam search

0.0 0.2 0.4 0.6 0.8 1.0recall

0.86

0.88

0.90

0.92

0.94

0.96

0.98

1.00

prec

isio

n

Word Recognition

Beam searchRCM

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 16 / 31

Reified Context Models

Precision

input x :

output y : v o l c a n i c

Model assigns probability to each prediction, so can predict on most confidentsubset.

Measure precision (# of correct words) vs. recall (# of words predicted).

comparison: beam search

0.0 0.2 0.4 0.6 0.8 1.0recall

0.86

0.88

0.90

0.92

0.94

0.96

0.98

1.00

prec

isio

n

Word Recognition

Beam searchRCM

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 16 / 31

Reified Context Models

Precision

input x :

output y : v o l c a n i c

Model assigns probability to each prediction, so can predict on most confidentsubset.

Measure precision (# of correct words) vs. recall (# of words predicted).comparison: beam search

0.0 0.2 0.4 0.6 0.8 1.0recall

0.86

0.88

0.90

0.92

0.94

0.96

0.98

1.00

prec

isio

n

Word Recognition

Beam searchRCM

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 16 / 31

Reified Context Models

Precision

Measure precision (# of correct words) vs. recall (# of words predicted).

0.0 0.2 0.4 0.6 0.8 1.0recall

0.86

0.88

0.90

0.92

0.94

0.96

0.98

1.00pr

ecis

ion

Word Recognition

Beam searchRCM

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 16 / 31

Reified Context Models

Partially Supervised Learning

Decipherment task:

cipher am 7→ 5, I 7→ 13, what 7→ 54, . . .

latent z I am what I amoutput y 13 5 54 13 5

Goal: determine cipher

Fit 2nd-order HMM with EM, using RCMs for approximate E-step.use learned emissions to determine cipher.again compare to beam search (Nuhn et al., 2013)

Fraction of correctly mapped words:

0 5 10 15 20training passes

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

map

ping

acc

urac

y

DeciphermentRCMbeam

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 17 / 31

Reified Context Models

Partially Supervised Learning

Decipherment task:

cipher am 7→ 5, I 7→ 13, what 7→ 54, . . .latent z I am what I am

output y 13 5 54 13 5

Goal: determine cipher

Fit 2nd-order HMM with EM, using RCMs for approximate E-step.use learned emissions to determine cipher.again compare to beam search (Nuhn et al., 2013)

Fraction of correctly mapped words:

0 5 10 15 20training passes

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

map

ping

acc

urac

y

DeciphermentRCMbeam

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 17 / 31

Reified Context Models

Partially Supervised Learning

Decipherment task:

cipher am 7→ 5, I 7→ 13, what 7→ 54, . . .latent z I am what I amoutput y 13 5 54 13 5

Goal: determine cipher

Fit 2nd-order HMM with EM, using RCMs for approximate E-step.use learned emissions to determine cipher.again compare to beam search (Nuhn et al., 2013)

Fraction of correctly mapped words:

0 5 10 15 20training passes

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

map

ping

acc

urac

y

DeciphermentRCMbeam

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 17 / 31

Reified Context Models

Partially Supervised Learning

Decipherment task:

cipher am 7→ 5, I 7→ 13, what 7→ 54, . . .latent z I am what I amoutput y 13 5 54 13 5

Goal: determine cipher

Fit 2nd-order HMM with EM, using RCMs for approximate E-step.use learned emissions to determine cipher.again compare to beam search (Nuhn et al., 2013)

Fraction of correctly mapped words:

0 5 10 15 20training passes

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

map

ping

acc

urac

y

DeciphermentRCMbeam

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 17 / 31

Reified Context Models

Partially Supervised Learning

Decipherment task:

cipher am 7→ 5, I 7→ 13, what 7→ 54, . . .latent z I am what I amoutput y 13 5 54 13 5

Goal: determine cipher

Fit 2nd-order HMM with EM, using RCMs for approximate E-step.

use learned emissions to determine cipher.again compare to beam search (Nuhn et al., 2013)

Fraction of correctly mapped words:

0 5 10 15 20training passes

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

map

ping

acc

urac

y

DeciphermentRCMbeam

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 17 / 31

Reified Context Models

Partially Supervised Learning

Decipherment task:

cipher am 7→ 5, I 7→ 13, what 7→ 54, . . .latent z I am what I amoutput y 13 5 54 13 5

Goal: determine cipher

Fit 2nd-order HMM with EM, using RCMs for approximate E-step.use learned emissions to determine cipher.

again compare to beam search (Nuhn et al., 2013)Fraction of correctly mapped words:

0 5 10 15 20training passes

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

map

ping

acc

urac

y

DeciphermentRCMbeam

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 17 / 31

Reified Context Models

Partially Supervised Learning

Decipherment task:

cipher am 7→ 5, I 7→ 13, what 7→ 54, . . .latent z I am what I amoutput y 13 5 54 13 5

Goal: determine cipher

Fit 2nd-order HMM with EM, using RCMs for approximate E-step.use learned emissions to determine cipher.again compare to beam search (Nuhn et al., 2013)

Fraction of correctly mapped words:

0 5 10 15 20training passes

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

map

ping

acc

urac

y

DeciphermentRCMbeam

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 17 / 31

Reified Context Models

Partially Supervised Learning

Fraction of correctly mapped words:

0 5 10 15 20training passes

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8m

appi

ng a

ccur

acy

DeciphermentRCMbeam

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 17 / 31

Reified Context Models

Contexts During Training

Context lengths increase smoothly during training:

0 5 10 15 20number of passes

1.5

2.0

2.5

3.0

3.5

4.0

4.5

aver

age

cont

ext l

engt

h

Decipherment

******↓

***ing↓

idding

Start of training: little information, short contexts.End of training: lots of information, long contexts.

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 18 / 31

Reified Context Models

Contexts During Training

Context lengths increase smoothly during training:

0 5 10 15 20number of passes

1.5

2.0

2.5

3.0

3.5

4.0

4.5

aver

age

cont

ext l

engt

hDecipherment

******↓

***ing↓

idding

Start of training: little information, short contexts.End of training: lots of information, long contexts.

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 18 / 31

Reified Context Models

Contexts During Training

Context lengths increase smoothly during training:

0 5 10 15 20number of passes

1.5

2.0

2.5

3.0

3.5

4.0

4.5

aver

age

cont

ext l

engt

hDecipherment

******↓

***ing↓

idding

Start of training: little information, short contexts.

End of training: lots of information, long contexts.

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 18 / 31

Reified Context Models

Contexts During Training

Context lengths increase smoothly during training:

0 5 10 15 20number of passes

1.5

2.0

2.5

3.0

3.5

4.0

4.5

aver

age

cont

ext l

engt

hDecipherment

******↓

***ing↓

idding

Start of training: little information, short contexts.End of training: lots of information, long contexts.

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 18 / 31

Reified Context Models

Discussion

RCMs provide both expressivity and coverage, which enable:

More accurate uncertainty estimates (precision)

Better partially supervised learning updates

Reproducible experiments on Codalab: codalab.org/worksheets

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 19 / 31

Reified Context Models

Discussion

RCMs provide both expressivity and coverage, which enable:

More accurate uncertainty estimates (precision)

Better partially supervised learning updates

Reproducible experiments on Codalab: codalab.org/worksheets

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 19 / 31

Reified Context Models

Discussion

RCMs provide both expressivity and coverage, which enable:

More accurate uncertainty estimates (precision)

Better partially supervised learning updates

Reproducible experiments on Codalab: codalab.org/worksheets

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 19 / 31

Reified Context Models

Discussion

RCMs provide both expressivity and coverage, which enable:

More accurate uncertainty estimates (precision)

Better partially supervised learning updates

Reproducible experiments on Codalab: codalab.org/worksheets

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 19 / 31

Relaxed Supervision

1 Motivation

2 Formal Setting

3 Reified Context Models

4 Relaxed Supervision

5 Open Questions

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 20 / 31

Relaxed Supervision

Intractable Supervision

Sometimes, even supervision is intractable:

input x : What is the largest city in California?latent z: argmax(λx .CITY(x)∧ LOC(x ,CA),λx .POPULATION(x))output y : Los Angeles

Intractable no matter how simple the model is!

but likely statistical relationships (e.g. between CITY and Los Angeles)

Need a way to relax the likelihood.

while maintaining good statistical properties (asymptotic consistency)

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 21 / 31

Relaxed Supervision

Intractable Supervision

Sometimes, even supervision is intractable:

input x : What is the largest city in California?latent z: argmax(λx .CITY(x)∧ LOC(x ,CA),λx .POPULATION(x))output y : Los Angeles

Intractable no matter how simple the model is!

but likely statistical relationships (e.g. between CITY and Los Angeles)

Need a way to relax the likelihood.

while maintaining good statistical properties (asymptotic consistency)

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 21 / 31

Relaxed Supervision

Intractable Supervision

Sometimes, even supervision is intractable:

input x : What is the largest city in California?latent z: argmax(λx .CITY(x)∧ LOC(x ,CA),λx .POPULATION(x))output y : Los Angeles

Intractable no matter how simple the model is!

but likely statistical relationships (e.g. between CITY and Los Angeles)

Need a way to relax the likelihood.

while maintaining good statistical properties (asymptotic consistency)

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 21 / 31

Relaxed Supervision

Intractable Supervision

Sometimes, even supervision is intractable:

input x : What is the largest city in California?latent z: argmax(λx .CITY(x)∧ LOC(x ,CA),λx .POPULATION(x))output y : Los Angeles

Intractable no matter how simple the model is!

but likely statistical relationships (e.g. between CITY and Los Angeles)

Need a way to relax the likelihood.

while maintaining good statistical properties (asymptotic consistency)

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 21 / 31

Relaxed Supervision

Intractable Supervision

Sometimes, even supervision is intractable:

input x : What is the largest city in California?latent z: argmax(λx .CITY(x)∧ LOC(x ,CA),λx .POPULATION(x))output y : Los Angeles

Intractable no matter how simple the model is!

but likely statistical relationships (e.g. between CITY and Los Angeles)

Need a way to relax the likelihood.

while maintaining good statistical properties (asymptotic consistency)

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 21 / 31

Relaxed Supervision

Approach

tractable

intractable

θ

β

Start with intractable likelihood q(y | z), model family pθ (z | x).

Replace q(y | z) with family of likelihoods qβ (y | z) (some very easy).

Derive constraints on (θ ,β ) that ensure tractability.

Learn within the tractable region.

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 22 / 31

Relaxed Supervision

Approach

tractable

intractable

θ

β

Start with intractable likelihood q(y | z), model family pθ (z | x).

Replace q(y | z) with family of likelihoods qβ (y | z) (some very easy).

Derive constraints on (θ ,β ) that ensure tractability.

Learn within the tractable region.

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 22 / 31

Relaxed Supervision

Approach

tractable

intractable

θ

β

Start with intractable likelihood q(y | z), model family pθ (z | x).

Replace q(y | z) with family of likelihoods qβ (y | z) (some very easy).

Derive constraints on (θ ,β ) that ensure tractability.

Learn within the tractable region.

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 22 / 31

Relaxed Supervision

Approach

tractable

intractable

θ

β

Start with intractable likelihood q(y | z), model family pθ (z | x).

Replace q(y | z) with family of likelihoods qβ (y | z) (some very easy).

Derive constraints on (θ ,β ) that ensure tractability.

Learn within the tractable region.

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 22 / 31

Relaxed Supervision

Relaxed Supervision: Example

input x : Company officials refused to comment.latent z:output y : 公司官员拒绝对此发表评论。

Idea: instead of requiring y to match observed output, penalize based onsome weighted distance distβ (y ,y).

`(θ ,β ;x ,y) =− log

(∑z

pθ (z,y | x)

)As β → ∞, recover original objective.

but optimizing will send β → 0!

Two questions:

How to create natural pressure to increase β?

How to define distances for general problems?

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 23 / 31

Relaxed Supervision

Relaxed Supervision: Example

input x : Company officials refused to comment.latent z:output y : 公司官员拒绝对此发表评论。

Idea: instead of requiring y to match observed output, penalize based onsome weighted distance distβ (y ,y).

`(θ ,β ;x ,y) =− log

(∑z

pθ (z,y | x)

)As β → ∞, recover original objective.

but optimizing will send β → 0!

Two questions:

How to create natural pressure to increase β?

How to define distances for general problems?

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 23 / 31

Relaxed Supervision

Relaxed Supervision: Example

input x : Company officials refused to comment.latent z:output y : 公司官员拒绝对此发表评论。

Idea: instead of requiring y to match observed output, penalize based onsome weighted distance distβ (y ,y).

`(θ ,β ;x ,y) =− log

(∑z

pθ (z,y | x)

)

As β → ∞, recover original objective.

but optimizing will send β → 0!

Two questions:

How to create natural pressure to increase β?

How to define distances for general problems?

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 23 / 31

Relaxed Supervision

Relaxed Supervision: Example

input x : Company officials refused to comment.latent z:output y : 公司官员拒绝对此发表评论。

Idea: instead of requiring y to match observed output, penalize based onsome weighted distance distβ (y ,y).

`(θ ,β ;x ,y) =− log

(∑z,y

pθ (z, y | x)exp(−distβ (y ,y))

)

As β → ∞, recover original objective.

but optimizing will send β → 0!

Two questions:

How to create natural pressure to increase β?

How to define distances for general problems?

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 23 / 31

Relaxed Supervision

Relaxed Supervision: Example

input x : Company officials refused to comment.latent z:output y : 公司官员拒绝对此发表评论。

Idea: instead of requiring y to match observed output, penalize based onsome weighted distance distβ (y ,y).

`(θ ,β ;x ,y) =− log

(∑z,y

pθ (z, y | x)exp(−distβ (y ,y))

)As β → ∞, recover original objective.

but optimizing will send β → 0!

Two questions:

How to create natural pressure to increase β?

How to define distances for general problems?

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 23 / 31

Relaxed Supervision

Relaxed Supervision: Example

input x : Company officials refused to comment.latent z:output y : 公司官员拒绝对此发表评论。

Idea: instead of requiring y to match observed output, penalize based onsome weighted distance distβ (y ,y).

`(θ ,β ;x ,y) =− log

(∑z,y

pθ (z, y | x)exp(−distβ (y ,y))

)As β → ∞, recover original objective.

but optimizing will send β → 0!

Two questions:

How to create natural pressure to increase β?

How to define distances for general problems?

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 23 / 31

Relaxed Supervision

Relaxed Supervision: Example

input x : Company officials refused to comment.latent z:output y : 公司官员拒绝对此发表评论。

Idea: instead of requiring y to match observed output, penalize based onsome weighted distance distβ (y ,y).

`(θ ,β ;x ,y) =− log

(∑z,y

pθ (z, y | x)exp(−distβ (y ,y))

)As β → ∞, recover original objective.

but optimizing will send β → 0!

Two questions:

How to create natural pressure to increase β?

How to define distances for general problems?

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 23 / 31

Relaxed Supervision

Relaxed Supervision: Example

input x : Company officials refused to comment.latent z:output y : 公司官员拒绝对此发表评论。

Idea: instead of requiring y to match observed output, penalize based onsome weighted distance distβ (y ,y).

`(θ ,β ;x ,y) =− log

(∑z,y

pθ (z, y | x)exp(−distβ (y ,y))

)As β → ∞, recover original objective.

but optimizing will send β → 0!

Two questions:

How to create natural pressure to increase β?

How to define distances for general problems?

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 23 / 31

Relaxed Supervision

Relaxed Supervision: Formal Framework

Assume (WLOG) that z→ y is deterministic: y = f (z).

Let S(z,y) ∈ {0,1} encode the constraint [f (z) = y ].

Take projections πj : Y →Yj , j = 1, . . . ,k .

Let Sj(z,y) = [πj(f (z)) = πj(y)] be the projected constraint.

Define distance function:

distβ (z,y) =k

∑j=1

βj · (1−Sj(z,y)).

Note: can featurize distβ as −β>ψ(z,y), where ψj = Sj −1.

Lemma

Suppose that π1×·· ·×πk is injective. Then

S(z,y) =k∧

j=1

Sj(z,y)

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 24 / 31

Relaxed Supervision

Relaxed Supervision: Formal Framework

Assume (WLOG) that z→ y is deterministic: y = f (z).

Let S(z,y) ∈ {0,1} encode the constraint [f (z) = y ].

Take projections πj : Y →Yj , j = 1, . . . ,k .

Let Sj(z,y) = [πj(f (z)) = πj(y)] be the projected constraint.

Define distance function:

distβ (z,y) =k

∑j=1

βj · (1−Sj(z,y)).

Note: can featurize distβ as −β>ψ(z,y), where ψj = Sj −1.

Lemma

Suppose that π1×·· ·×πk is injective. Then

S(z,y) =k∧

j=1

Sj(z,y)

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 24 / 31

Relaxed Supervision

Relaxed Supervision: Formal Framework

Assume (WLOG) that z→ y is deterministic: y = f (z).

Let S(z,y) ∈ {0,1} encode the constraint [f (z) = y ].

Take projections πj : Y →Yj , j = 1, . . . ,k .

Let Sj(z,y) = [πj(f (z)) = πj(y)] be the projected constraint.

Define distance function:

distβ (z,y) =k

∑j=1

βj · (1−Sj(z,y)).

Note: can featurize distβ as −β>ψ(z,y), where ψj = Sj −1.

Lemma

Suppose that π1×·· ·×πk is injective. Then

S(z,y) =k∧

j=1

Sj(z,y)

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 24 / 31

Relaxed Supervision

Relaxed Supervision: Formal Framework

Assume (WLOG) that z→ y is deterministic: y = f (z).

Let S(z,y) ∈ {0,1} encode the constraint [f (z) = y ].

Take projections πj : Y →Yj , j = 1, . . . ,k .

Let Sj(z,y) = [πj(f (z)) = πj(y)] be the projected constraint.

Define distance function:

distβ (z,y) =k

∑j=1

βj · (1−Sj(z,y)).

Note: can featurize distβ as −β>ψ(z,y), where ψj = Sj −1.

Lemma

Suppose that π1×·· ·×πk is injective. Then

S(z,y) =k∧

j=1

Sj(z,y)

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 24 / 31

Relaxed Supervision

Relaxed Supervision: Formal Framework

Assume (WLOG) that z→ y is deterministic: y = f (z).

Let S(z,y) ∈ {0,1} encode the constraint [f (z) = y ].

Take projections πj : Y →Yj , j = 1, . . . ,k .

Let Sj(z,y) = [πj(f (z)) = πj(y)] be the projected constraint.

Define distance function:

distβ (z,y) =k

∑j=1

βj · (1−Sj(z,y)).

Note: can featurize distβ as −β>ψ(z,y), where ψj = Sj −1.

Lemma

Suppose that π1×·· ·×πk is injective. Then

S(z,y) =k∧

j=1

Sj(z,y)

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 24 / 31

Relaxed Supervision

Relaxed Supervision: Formal Framework

Assume (WLOG) that z→ y is deterministic: y = f (z).

Let S(z,y) ∈ {0,1} encode the constraint [f (z) = y ].

Take projections πj : Y →Yj , j = 1, . . . ,k .

Let Sj(z,y) = [πj(f (z)) = πj(y)] be the projected constraint.

Define distance function:

distβ (z,y) =k

∑j=1

βj · (1−Sj(z,y)).

Note: can featurize distβ as −β>ψ(z,y), where ψj = Sj −1.

Lemma

Suppose that π1×·· ·×πk is injective. Then

S(z,y) =k∧

j=1

Sj(z,y)

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 24 / 31

Relaxed Supervision

Relaxed Supervision: Formal Framework

Assume (WLOG) that z→ y is deterministic: y = f (z).

Let S(z,y) ∈ {0,1} encode the constraint [f (z) = y ].

Take projections πj : Y →Yj , j = 1, . . . ,k .

Let Sj(z,y) = [πj(f (z)) = πj(y)] be the projected constraint.

Define distance function:

distβ (z,y) =k

∑j=1

βj · (1−Sj(z,y)).

Note: can featurize distβ as −β>ψ(z,y), where ψj = Sj −1.

Lemma

Suppose that π1×·· ·×πk is injective. Then

S(z,y) =k∧

j=1

Sj(z,y)

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 24 / 31

Relaxed Supervision

Example: Unordered Supervision

input x : a b a alatent z: d c d doutput y : {c : 1,d : 3}

Let count(·, j) count number of occurences of character j .

Decomposition:

[y =

f (z)︷ ︸︸ ︷multiset(z)]︸ ︷︷ ︸S(z,y)

=⇒V∧

j=1

[count(z, j) = count(y , j)]

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 25 / 31

Relaxed Supervision

Example: Unordered Supervision

input x : a b a alatent z: d c d doutput y : {c : 1,d : 3}

Let count(·, j) count number of occurences of character j .

Decomposition:

[y =

f (z)︷ ︸︸ ︷multiset(z)]︸ ︷︷ ︸S(z,y)

=⇒V∧

j=1

[count(z, j) = count(y , j)]

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 25 / 31

Relaxed Supervision

Example: Unordered Supervision

input x : a b a alatent z: d c d doutput y : {c : 1,d : 3}

Let count(·, j) count number of occurences of character j .

Decomposition:

[y =

f (z)︷ ︸︸ ︷multiset(z)]︸ ︷︷ ︸S(z,y)

=⇒V∧

j=1

[count(z, j) = count(y , j)]

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 25 / 31

Relaxed Supervision

Example: Unordered Supervision

input x : a b a alatent z: d c d doutput y : {c : 1,d : 3}

Let count(·, j) count number of occurences of character j .

Decomposition:

[y =

f (z)︷ ︸︸ ︷multiset(z)]︸ ︷︷ ︸S(z,y)

=⇒V∧

j=1

[count(z, j) = count(y , j)]

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 25 / 31

Relaxed Supervision

Example: Unordered Supervision

input x : a b a alatent z: d c d doutput y : {c : 1,d : 3}

Let count(·, j) count number of occurences of character j .

Decomposition:

[y =

f (z)︷ ︸︸ ︷multiset(z)]︸ ︷︷ ︸S(z,y)

=⇒V∧

j=1

[count(z, j) =

πj(y)︷ ︸︸ ︷count(y , j)]︸ ︷︷ ︸

Sj(z,y)

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 25 / 31

Relaxed Supervision

Example: Unordered Supervision

input x : a b a alatent z: d c d doutput y : {c : 1,d : 3}

Let count(·, j) count number of occurences of character j .

Decomposition:

[y =

f (z)︷ ︸︸ ︷multiset(z)]︸ ︷︷ ︸S(z,y)

⇐⇒V∧

j=1

[count(z, j) =

πj(y)︷ ︸︸ ︷count(y , j)]︸ ︷︷ ︸

Sj(z,y)

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 25 / 31

Relaxed Supervision

Example: Conjunctive Semantic Parsing

Side information: predicates {Q1, . . . ,Qm}.e.g. Q6 = [DOG] = set of all dogs

input x : brown dog (input utterance)latent z: (Q11, Q6) (set of all brown objects, set of all dogs)output y : Q11∩Q6 (denotation, observed as a set)

For z = (Qj1 , . . . ,QjL), define the denotation JzK = Qj1 ∩·· ·∩QjL .

Decomposition:

y = JzK︸ ︷︷ ︸S(z,y)

⇐⇒m∧

j=1

I[JzK⊆ Qj ] =

πj(y)︷ ︸︸ ︷I[y ⊆ Qj ]︸ ︷︷ ︸

Sj(z,y)

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 26 / 31

Relaxed Supervision

Example: Conjunctive Semantic Parsing

Side information: predicates {Q1, . . . ,Qm}.e.g. Q6 = [DOG] = set of all dogs

input x : brown dog (input utterance)

latent z: (Q11, Q6) (set of all brown objects, set of all dogs)output y : Q11∩Q6 (denotation, observed as a set)

For z = (Qj1 , . . . ,QjL), define the denotation JzK = Qj1 ∩·· ·∩QjL .

Decomposition:

y = JzK︸ ︷︷ ︸S(z,y)

⇐⇒m∧

j=1

I[JzK⊆ Qj ] =

πj(y)︷ ︸︸ ︷I[y ⊆ Qj ]︸ ︷︷ ︸

Sj(z,y)

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 26 / 31

Relaxed Supervision

Example: Conjunctive Semantic Parsing

Side information: predicates {Q1, . . . ,Qm}.e.g. Q6 = [DOG] = set of all dogs

input x : brown dog (input utterance)latent z: (Q11, Q6) (set of all brown objects, set of all dogs)

output y : Q11∩Q6 (denotation, observed as a set)

For z = (Qj1 , . . . ,QjL), define the denotation JzK = Qj1 ∩·· ·∩QjL .

Decomposition:

y = JzK︸ ︷︷ ︸S(z,y)

⇐⇒m∧

j=1

I[JzK⊆ Qj ] =

πj(y)︷ ︸︸ ︷I[y ⊆ Qj ]︸ ︷︷ ︸

Sj(z,y)

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 26 / 31

Relaxed Supervision

Example: Conjunctive Semantic Parsing

Side information: predicates {Q1, . . . ,Qm}.e.g. Q6 = [DOG] = set of all dogs

input x : brown dog (input utterance)latent z: (Q11, Q6) (set of all brown objects, set of all dogs)output y : Q11∩Q6 (denotation, observed as a set)

For z = (Qj1 , . . . ,QjL), define the denotation JzK = Qj1 ∩·· ·∩QjL .

Decomposition:

y = JzK︸ ︷︷ ︸S(z,y)

⇐⇒m∧

j=1

I[JzK⊆ Qj ] =

πj(y)︷ ︸︸ ︷I[y ⊆ Qj ]︸ ︷︷ ︸

Sj(z,y)

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 26 / 31

Relaxed Supervision

Example: Conjunctive Semantic Parsing

Side information: predicates {Q1, . . . ,Qm}.e.g. Q6 = [DOG] = set of all dogs

input x : brown dog (input utterance)latent z: (Q11, Q6) (set of all brown objects, set of all dogs)output y : Q11∩Q6 (denotation, observed as a set)

For z = (Qj1 , . . . ,QjL), define the denotation JzK = Qj1 ∩·· ·∩QjL .

Decomposition:

y = JzK︸ ︷︷ ︸S(z,y)

⇐⇒m∧

j=1

I[JzK⊆ Qj ] =

πj(y)︷ ︸︸ ︷I[y ⊆ Qj ]︸ ︷︷ ︸

Sj(z,y)

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 26 / 31

Relaxed Supervision

Example: Conjunctive Semantic Parsing

Side information: predicates {Q1, . . . ,Qm}.e.g. Q6 = [DOG] = set of all dogs

input x : brown dog (input utterance)latent z: (Q11, Q6) (set of all brown objects, set of all dogs)output y : Q11∩Q6 (denotation, observed as a set)

For z = (Qj1 , . . . ,QjL), define the denotation JzK = Qj1 ∩·· ·∩QjL .

Decomposition:

y = JzK︸ ︷︷ ︸S(z,y)

⇐⇒m∧

j=1

I[JzK⊆ Qj ] =

πj(y)︷ ︸︸ ︷I[y ⊆ Qj ]︸ ︷︷ ︸

Sj(z,y)

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 26 / 31

Relaxed Supervision

Example: Conjunctive Semantic Parsing

Side information: predicates {Q1, . . . ,Qm}.e.g. Q6 = [DOG] = set of all dogs

input x : brown dog (input utterance)latent z: (Q11, Q6) (set of all brown objects, set of all dogs)output y : Q11∩Q6 (denotation, observed as a set)

For z = (Qj1 , . . . ,QjL), define the denotation JzK = Qj1 ∩·· ·∩QjL .

Decomposition:

y = JzK︸ ︷︷ ︸S(z,y)

⇐⇒m∧

j=1

I[JzK⊆ Qj ] =

πj(y)︷ ︸︸ ︷I[y ⊆ Qj ]︸ ︷︷ ︸

Sj(z,y)

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 26 / 31

Relaxed Supervision

Normalization Constant

Create pressure to increase β by adding normalization constant:

qβ (y | z) = exp(β>ψ(z,y)︸ ︷︷ ︸−distβ (z,y)

−A(β ))

`(θ ,β ;x ,y) =− log

(∑z

pθ (z | x)qβ (y | z)).

Lemma

Given π1, . . . ,πk , let A(β )def= ∑

kj=1 log(1+(|Yj |−1)exp(−βj)). Then,

∑y exp(−distβ (z,y))≤ A(β ) for all z.

Lemma

Jointly minimizing L(θ ,β ) = E[`(θ ,β ;x ,y)] yields a consistent estimate of thetrue parameters θ ∗.

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 27 / 31

Relaxed Supervision

Normalization Constant

Create pressure to increase β by adding normalization constant:

qβ (y | z) = exp(β>ψ(z,y)︸ ︷︷ ︸−distβ (z,y)

−A(β ))

`(θ ,β ;x ,y) =− log

(∑z

pθ (z | x)qβ (y | z)).

Lemma

Given π1, . . . ,πk , let A(β )def= ∑

kj=1 log(1+(|Yj |−1)exp(−βj)). Then,

∑y exp(−distβ (z,y))≤ A(β ) for all z.

Lemma

Jointly minimizing L(θ ,β ) = E[`(θ ,β ;x ,y)] yields a consistent estimate of thetrue parameters θ ∗.

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 27 / 31

Relaxed Supervision

Normalization Constant

Create pressure to increase β by adding normalization constant:

qβ (y | z) = exp(β>ψ(z,y)︸ ︷︷ ︸−distβ (z,y)

−A(β ))

`(θ ,β ;x ,y) =− log

(∑z

pθ (z | x)qβ (y | z)).

Lemma

Given π1, . . . ,πk , let A(β )def= ∑

kj=1 log(1+(|Yj |−1)exp(−βj)). Then,

∑y exp(−distβ (z,y))≤ A(β ) for all z.

Lemma

Jointly minimizing L(θ ,β ) = E[`(θ ,β ;x ,y)] yields a consistent estimate of thetrue parameters θ ∗.

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 27 / 31

Relaxed Supervision

Normalization Constant

Create pressure to increase β by adding normalization constant:

qβ (y | z) = exp(β>ψ(z,y)︸ ︷︷ ︸−distβ (z,y)

−A(β ))

`(θ ,β ;x ,y) =− log

(∑z

pθ (z | x)qβ (y | z)).

Lemma

Given π1, . . . ,πk , let A(β )def= ∑

kj=1 log(1+(|Yj |−1)exp(−βj)). Then,

∑y exp(−distβ (z,y))≤ A(β ) for all z.

Lemma

Jointly minimizing L(θ ,β ) = E[`(θ ,β ;x ,y)] yields a consistent estimate of thetrue parameters θ ∗.

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 27 / 31

Relaxed Supervision

Constraints for Efficient Inference

Inference task:

∇θ logpθ (y | x) = Ez∼pθ (·|x ,y)[φ(x , z,y)]︸ ︷︷ ︸sample z given x ,y

−Ez,y∼pθ (·|x)[φ(x , z, y)]︸ ︷︷ ︸sample z given x

.

pθ ,β (z | x ,y) ∝ pθ (z | x)qβ (y | z)∝ pθ (z | x)exp(β>ψ(z,y)).

Rejection sampler:

sample from pθ (z | x)accept with probability exp(β>ψ(z,y)).

Bound expected number of samples:

∑x ,y∈Data

(∑z

pθ (z | x)exp(β>ψ(z,y))

)−1

≤ τ. (1)

Ratio of normalization constants: can optimize subject to (1) (similar to CCCP).

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 28 / 31

Relaxed Supervision

Constraints for Efficient Inference

Inference task:

∇θ logpθ (y | x) = Ez∼pθ (·|x ,y)[φ(x , z,y)]︸ ︷︷ ︸sample z given x ,y

−Ez,y∼pθ (·|x)[φ(x , z, y)]︸ ︷︷ ︸sample z given x

.

pθ ,β (z | x ,y) ∝ pθ (z | x)qβ (y | z)∝ pθ (z | x)exp(β>ψ(z,y)).

Rejection sampler:

sample from pθ (z | x)accept with probability exp(β>ψ(z,y)).

Bound expected number of samples:

∑x ,y∈Data

(∑z

pθ (z | x)exp(β>ψ(z,y))

)−1

≤ τ. (1)

Ratio of normalization constants: can optimize subject to (1) (similar to CCCP).

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 28 / 31

Relaxed Supervision

Constraints for Efficient Inference

Inference task:

∇θ logpθ (y | x) = Ez∼pθ (·|x ,y)[φ(x , z,y)]︸ ︷︷ ︸sample z given x ,y

−Ez,y∼pθ (·|x)[φ(x , z, y)]︸ ︷︷ ︸sample z given x

.

pθ ,β (z | x ,y) ∝ pθ (z | x)qβ (y | z)∝ pθ (z | x)exp(β>ψ(z,y)).

Rejection sampler:

sample from pθ (z | x)accept with probability exp(β>ψ(z,y)).

Bound expected number of samples:

∑x ,y∈Data

(∑z

pθ (z | x)exp(β>ψ(z,y))

)−1

≤ τ. (1)

Ratio of normalization constants: can optimize subject to (1) (similar to CCCP).

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 28 / 31

Relaxed Supervision

Constraints for Efficient Inference

Inference task:

∇θ logpθ (y | x) = Ez∼pθ (·|x ,y)[φ(x , z,y)]︸ ︷︷ ︸sample z given x ,y

−Ez,y∼pθ (·|x)[φ(x , z, y)]︸ ︷︷ ︸sample z given x

.

pθ ,β (z | x ,y) ∝ pθ (z | x)qβ (y | z)∝ pθ (z | x)exp(β>ψ(z,y)).

Rejection sampler:

sample from pθ (z | x)accept with probability exp(β>ψ(z,y)).

Bound expected number of samples:

∑x ,y∈Data

(∑z

pθ (z | x)exp(β>ψ(z,y))

)−1

≤ τ. (1)

Ratio of normalization constants: can optimize subject to (1) (similar to CCCP).

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 28 / 31

Relaxed Supervision

Constraints for Efficient Inference

Inference task:

∇θ logpθ (y | x) = Ez∼pθ (·|x ,y)[φ(x , z,y)]︸ ︷︷ ︸sample z given x ,y

−Ez,y∼pθ (·|x)[φ(x , z, y)]︸ ︷︷ ︸sample z given x

.

pθ ,β (z | x ,y) ∝ pθ (z | x)qβ (y | z)∝ pθ (z | x)exp(β>ψ(z,y)).

Rejection sampler:

sample from pθ (z | x)accept with probability exp(β>ψ(z,y)).

Bound expected number of samples:

∑x ,y∈Data

(∑z

pθ (z | x)exp(β>ψ(z,y))

)−1

≤ τ. (1)

Ratio of normalization constants: can optimize subject to (1) (similar to CCCP).

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 28 / 31

Relaxed Supervision

Experiments

Conjunctive semantic parsing:

0 10 20 30 40 50iteration

0.0

0.2

0.4

0.6

0.8

1.0

accu

racy

FixedBeta(0.5)FixedBeta(0.2)FixedBeta(0.1)

0 10 20 30 40 50iteration

100

101

102

103

104

105

num

ber o

f sam

ples

FixedBeta(0.5)FixedBeta(0.2)FixedBeta(0.1)

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 29 / 31

Relaxed Supervision

Experiments

Conjunctive semantic parsing:

0 10 20 30 40 50iteration

0.0

0.2

0.4

0.6

0.8

1.0

accu

racy

AdaptBeta(500)FixedBeta(0.5)FixedBeta(0.2)FixedBeta(0.1)

0 10 20 30 40 50iteration

100

101

102

103

104

105

num

ber o

f sam

ples

AdaptBeta(500)FixedBeta(0.5)FixedBeta(0.2)FixedBeta(0.1)

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 29 / 31

Open Questions

1 Motivation

2 Formal Setting

3 Reified Context Models

4 Relaxed Supervision

5 Open Questions

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 30 / 31

Open Questions

Scale up to larger taskssemantic parsing, reinforcement learning, program induction

Extend to Bayesian models

Understand non-convex optimizationMetacomputation

using Reified Context Models?

Probabilistic abstract interpretation

Statistics & Computation: still a long way to go

Thanks!谢谢

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 31 / 31

Open Questions

Scale up to larger taskssemantic parsing, reinforcement learning, program induction

Extend to Bayesian models

Understand non-convex optimizationMetacomputation

using Reified Context Models?

Probabilistic abstract interpretation

Statistics & Computation: still a long way to go

Thanks!谢谢

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 31 / 31

Open Questions

Scale up to larger taskssemantic parsing, reinforcement learning, program induction

Extend to Bayesian models

Understand non-convex optimization

Metacomputationusing Reified Context Models?

Probabilistic abstract interpretation

Statistics & Computation: still a long way to go

Thanks!谢谢

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 31 / 31

Open Questions

Scale up to larger taskssemantic parsing, reinforcement learning, program induction

Extend to Bayesian models

Understand non-convex optimizationMetacomputation

using Reified Context Models?

Probabilistic abstract interpretation

Statistics & Computation: still a long way to go

Thanks!谢谢

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 31 / 31

Open Questions

Scale up to larger taskssemantic parsing, reinforcement learning, program induction

Extend to Bayesian models

Understand non-convex optimizationMetacomputation

using Reified Context Models?

Probabilistic abstract interpretation

Statistics & Computation: still a long way to go

Thanks!谢谢

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 31 / 31

Open Questions

Scale up to larger taskssemantic parsing, reinforcement learning, program induction

Extend to Bayesian models

Understand non-convex optimizationMetacomputation

using Reified Context Models?

Probabilistic abstract interpretation

Statistics & Computation: still a long way to go

Thanks!谢谢

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 31 / 31

Open Questions

Scale up to larger taskssemantic parsing, reinforcement learning, program induction

Extend to Bayesian models

Understand non-convex optimizationMetacomputation

using Reified Context Models?

Probabilistic abstract interpretation

Statistics & Computation: still a long way to go

Thanks!谢谢

J. Steinhardt (Stanford) Learning and Inference September 8, 2015 31 / 31

top related