query linguistic intent detectionjohn.blitzer.com/harbin/supervised.pdf · running with scissors: a...

48
John Blitzer 自然语言计算组 http://research.microsoft.com/asia/group/nlc/

Upload: others

Post on 09-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Query Linguistic Intent Detectionjohn.blitzer.com/harbin/supervised.pdf · Running with Scissors: A Memoir Title: Horrible book, horrible. This book was horrible. I read half of it,

John Blitzer

自然语言计算组

http://research.microsoft.com/asia/group/nlc/

Page 2: Query Linguistic Intent Detectionjohn.blitzer.com/harbin/supervised.pdf · Running with Scissors: A Memoir Title: Horrible book, horrible. This book was horrible. I read half of it,

This is an NLP summer school. Why should I care about machine learning?

ACL 2008: 50 of 96 full papers mention learning, or statistics in their titles

4 of 4 outstanding papers propose new learning or statistical inference methods

Page 3: Query Linguistic Intent Detectionjohn.blitzer.com/harbin/supervised.pdf · Running with Scissors: A Memoir Title: Horrible book, horrible. This book was horrible. I read half of it,

Running with Scissors: A Memoir

Title: Horrible book, horrible.

This book was horrible. I read half of it,

suffering from a headache the entire time, and

eventually i lit it on fire. One less copy in the

world...don't waste your money. I wish i had the

time spent reading this book back so i could use

it for better purposes. This book wasted my life

Positive

Negative

Output: LabelsInput: Product Review

Page 4: Query Linguistic Intent Detectionjohn.blitzer.com/harbin/supervised.pdf · Running with Scissors: A Memoir Title: Horrible book, horrible. This book was horrible. I read half of it,

From the MSRA 机器学习组http://research.microsoft.com/research/china/DCCUE/ml.aspx

Page 5: Query Linguistic Intent Detectionjohn.blitzer.com/harbin/supervised.pdf · Running with Scissors: A Memoir Title: Horrible book, horrible. This book was horrible. I read half of it,

. . .

Un-ranked List Ranked List

. . .

Page 6: Query Linguistic Intent Detectionjohn.blitzer.com/harbin/supervised.pdf · Running with Scissors: A Memoir Title: Horrible book, horrible. This book was horrible. I read half of it,

全国田径冠军赛结束

The national track & field

championships concluded

Input: English sentence

Output: Chinese sentence

Page 7: Query Linguistic Intent Detectionjohn.blitzer.com/harbin/supervised.pdf · Running with Scissors: A Memoir Title: Horrible book, horrible. This book was horrible. I read half of it,

1) Supervised Learning [2.5 hrs]

2) Semi-supervised learning [3 hrs]

3) Learning bounds for domain adaptation [30 mins]

Page 8: Query Linguistic Intent Detectionjohn.blitzer.com/harbin/supervised.pdf · Running with Scissors: A Memoir Title: Horrible book, horrible. This book was horrible. I read half of it,

1) Notation and Definitions [5 mins]

2) Generative Models [25 mins]

3) Discriminative Models [55 mins]

4) Machine Learning Examples [15 mins]

Page 9: Query Linguistic Intent Detectionjohn.blitzer.com/harbin/supervised.pdf · Running with Scissors: A Memoir Title: Horrible book, horrible. This book was horrible. I read half of it,

Training data: labeled pairs

. . .

?? ?? . . . ??

Use this function to label unlabeled testing data

Use training data to learn a function

Page 10: Query Linguistic Intent Detectionjohn.blitzer.com/harbin/supervised.pdf · Running with Scissors: A Memoir Title: Horrible book, horrible. This book was horrible. I read half of it,

3 0

horrible read_half waste

0... 1 0 ... 0 2

0 0 0

...

1

...

00

horrible

2

excellent loved_it

Page 11: Query Linguistic Intent Detectionjohn.blitzer.com/harbin/supervised.pdf · Running with Scissors: A Memoir Title: Horrible book, horrible. This book was horrible. I read half of it,
Page 12: Query Linguistic Intent Detectionjohn.blitzer.com/harbin/supervised.pdf · Running with Scissors: A Memoir Title: Horrible book, horrible. This book was horrible. I read half of it,

Encode a multivariate probability distribution

Nodes indicate random variables

Edges indicate conditional dependency

Page 13: Query Linguistic Intent Detectionjohn.blitzer.com/harbin/supervised.pdf · Running with Scissors: A Memoir Title: Horrible book, horrible. This book was horrible. I read half of it,

horrible waste read_half

p(y = -1)

Page 14: Query Linguistic Intent Detectionjohn.blitzer.com/harbin/supervised.pdf · Running with Scissors: A Memoir Title: Horrible book, horrible. This book was horrible. I read half of it,

Given an unlabeled instance, how can we find its label? ??

Just choose the most probable label y

Page 15: Query Linguistic Intent Detectionjohn.blitzer.com/harbin/supervised.pdf · Running with Scissors: A Memoir Title: Horrible book, horrible. This book was horrible. I read half of it,

Back to labeled training data:

. . .

Page 16: Query Linguistic Intent Detectionjohn.blitzer.com/harbin/supervised.pdf · Running with Scissors: A Memoir Title: Horrible book, horrible. This book was horrible. I read half of it,

Input query:

“自然语言处理”

Query classification

Travel

Technology

News

Entertainment

. . . .

Training and testing same as in binary case

Page 17: Query Linguistic Intent Detectionjohn.blitzer.com/harbin/supervised.pdf · Running with Scissors: A Memoir Title: Horrible book, horrible. This book was horrible. I read half of it,

Why set parameters to counts?

Page 18: Query Linguistic Intent Detectionjohn.blitzer.com/harbin/supervised.pdf · Running with Scissors: A Memoir Title: Horrible book, horrible. This book was horrible. I read half of it,
Page 19: Query Linguistic Intent Detectionjohn.blitzer.com/harbin/supervised.pdf · Running with Scissors: A Memoir Title: Horrible book, horrible. This book was horrible. I read half of it,

Predicting broken traffic lights

Lights are broken: both lights are red always

Lights are working: 1 is red & 1 is green

Page 20: Query Linguistic Intent Detectionjohn.blitzer.com/harbin/supervised.pdf · Running with Scissors: A Memoir Title: Horrible book, horrible. This book was horrible. I read half of it,

Now, suppose both lights are red. What will our model predict?

We got the wrong answer. Is there a better model?

The MLE generative model is not the best model!!

Page 21: Query Linguistic Intent Detectionjohn.blitzer.com/harbin/supervised.pdf · Running with Scissors: A Memoir Title: Horrible book, horrible. This book was horrible. I read half of it,

We can introduce more dependencies

This can explode parameter space

Discriminative models minimize error -- next

Further reading

K. Toutanova. Competitive generative models with structure learning for NLP classification tasks. EMNLP 2006.

A. Ng and M. Jordan. On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naïve Bayes. NIPS 2002

Page 22: Query Linguistic Intent Detectionjohn.blitzer.com/harbin/supervised.pdf · Running with Scissors: A Memoir Title: Horrible book, horrible. This book was horrible. I read half of it,

We will focus on linear models

Model training error

Page 23: Query Linguistic Intent Detectionjohn.blitzer.com/harbin/supervised.pdf · Running with Scissors: A Memoir Title: Horrible book, horrible. This book was horrible. I read half of it,

-1.5 -1 -0.5 0 0.5 1 1.5

2

4

60-1 loss (error): NP-hard to minimize

over all data points

Exp loss: exp(-score): Minimized by

AdaBoost

Hinge loss: Minimized by

support vector machines

Page 24: Query Linguistic Intent Detectionjohn.blitzer.com/harbin/supervised.pdf · Running with Scissors: A Memoir Title: Horrible book, horrible. This book was horrible. I read half of it,

In NLP, a feature can be a weak learner

Sentiment example:

Page 25: Query Linguistic Intent Detectionjohn.blitzer.com/harbin/supervised.pdf · Running with Scissors: A Memoir Title: Horrible book, horrible. This book was horrible. I read half of it,

Input: training sample

(1) Initialize

(2) For t = 1 … T,

Train a weak hypothesis to minimize error on

Set [later]

Update

(3) Output model

Page 26: Query Linguistic Intent Detectionjohn.blitzer.com/harbin/supervised.pdf · Running with Scissors: A Memoir Title: Horrible book, horrible. This book was horrible. I read half of it,

+ + – –Excellent book.

The_plot was riveting

Excellent

read

Terrible: The_plot was

boring and opaque

Awful book. Couldn’t

follow the_plot.

+ ++ – –

+ + – – – –– –

– –– –

– –

Page 27: Query Linguistic Intent Detectionjohn.blitzer.com/harbin/supervised.pdf · Running with Scissors: A Memoir Title: Horrible book, horrible. This book was horrible. I read half of it,

Bound on training error [Freund & Schapire 1995]

We greedily minimize error by minimizing

Page 28: Query Linguistic Intent Detectionjohn.blitzer.com/harbin/supervised.pdf · Running with Scissors: A Memoir Title: Horrible book, horrible. This book was horrible. I read half of it,

For proofs and a more complete discussion

Robert Schapire and Yoram Singer.

Improved Boosting Algorithms Using Confidence-

rated Predictions.

Machine Learning Journal 1998.

Page 29: Query Linguistic Intent Detectionjohn.blitzer.com/harbin/supervised.pdf · Running with Scissors: A Memoir Title: Horrible book, horrible. This book was horrible. I read half of it,

We chose to minimize . Was that the right choice?

This gives

Plugging in our solution for , we have

Page 30: Query Linguistic Intent Detectionjohn.blitzer.com/harbin/supervised.pdf · Running with Scissors: A Memoir Title: Horrible book, horrible. This book was horrible. I read half of it,

-1.5 -1 -0.5 0 0.5 1 1.5

2

4

6What happens when an example is

mis-labeled or an outlier?

Exp loss exponentially penalizes

incorrect scores.

Hinge loss linearly penalizes

incorrect scores.

Page 31: Query Linguistic Intent Detectionjohn.blitzer.com/harbin/supervised.pdf · Running with Scissors: A Memoir Title: Horrible book, horrible. This book was horrible. I read half of it,

Linearly separable

++––– –

++

+

+––

–++

Non-separable

Page 32: Query Linguistic Intent Detectionjohn.blitzer.com/harbin/supervised.pdf · Running with Scissors: A Memoir Title: Horrible book, horrible. This book was horrible. I read half of it,

Lots of separating hyperplanes. Which should we choose?

++

––– –

++ +

+

––– –

++

Choose the hyperplane with largest margin

Page 33: Query Linguistic Intent Detectionjohn.blitzer.com/harbin/supervised.pdf · Running with Scissors: A Memoir Title: Horrible book, horrible. This book was horrible. I read half of it,

score of correct label greater than margin

Why do we fix norm of w to be less than 1?

Scaling the weight vector doesn’t change the optimal hyperplane

Page 34: Query Linguistic Intent Detectionjohn.blitzer.com/harbin/supervised.pdf · Running with Scissors: A Memoir Title: Horrible book, horrible. This book was horrible. I read half of it,

Minimize the norm of the weight vector

With fixed margin for each example

Page 35: Query Linguistic Intent Detectionjohn.blitzer.com/harbin/supervised.pdf · Running with Scissors: A Memoir Title: Horrible book, horrible. This book was horrible. I read half of it,

We can’t satisfy the margin constraints

But some hyperplanesare better than others +

+––

–++

Page 36: Query Linguistic Intent Detectionjohn.blitzer.com/harbin/supervised.pdf · Running with Scissors: A Memoir Title: Horrible book, horrible. This book was horrible. I read half of it,

Add slack variables to the optimization

Allow margin constraints to be violated

But minimize the violation as much as possible

Page 37: Query Linguistic Intent Detectionjohn.blitzer.com/harbin/supervised.pdf · Running with Scissors: A Memoir Title: Horrible book, horrible. This book was horrible. I read half of it,
Page 38: Query Linguistic Intent Detectionjohn.blitzer.com/harbin/supervised.pdf · Running with Scissors: A Memoir Title: Horrible book, horrible. This book was horrible. I read half of it,

-0.5 0 0.5 1 1.5

0.5

1

1.5

Max creates a non-differentiable

point, but there is a subgradient

Subgradient:

Page 39: Query Linguistic Intent Detectionjohn.blitzer.com/harbin/supervised.pdf · Running with Scissors: A Memoir Title: Horrible book, horrible. This book was horrible. I read half of it,

Subgradient descent is like gradient descent.

Also guaranteed to converge, but slow

Pegasos [Shalev-Schwartz and Singer 2007]

Sub-gradient descent for a randomly selected subset of examples. Convergence bound:

Objective after

T iterations

Best objective

value Linear convergence

Page 40: Query Linguistic Intent Detectionjohn.blitzer.com/harbin/supervised.pdf · Running with Scissors: A Memoir Title: Horrible book, horrible. This book was horrible. I read half of it,

We’ve been looking at binary classification

But most NLP problems aren’t binary

Piece-wise linear decision boundaries

We showed 2-dimensional examples

But NLP is typically very high dimensional

Joachims [2000] discusses linear models in high-dimensional spaces

Page 41: Query Linguistic Intent Detectionjohn.blitzer.com/harbin/supervised.pdf · Running with Scissors: A Memoir Title: Horrible book, horrible. This book was horrible. I read half of it,

Kernels let us efficiently map training data into a high-dimensional feature space

Then learn a model which is linear in the new space, but non-linear in our original space

But for NLP, we already have a high-dimensional representation!

Optimization with non-linear kernels is often super-linear in number of examples

Page 42: Query Linguistic Intent Detectionjohn.blitzer.com/harbin/supervised.pdf · Running with Scissors: A Memoir Title: Horrible book, horrible. This book was horrible. I read half of it,

John Shawe-Taylor and Nello Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press 2004.

Dan Klein and Ben Taskar. Max Margin Methods for NLP: Estimation, Structure, and Applications. ACL 2005 Tutorial.

Ryan McDonald. Generalized Linear Classifiers in NLP. Tutorial at the Swedish Graduate School in Language Technology. 2007.

Page 43: Query Linguistic Intent Detectionjohn.blitzer.com/harbin/supervised.pdf · Running with Scissors: A Memoir Title: Horrible book, horrible. This book was horrible. I read half of it,

SVMs with slack are noise tolerant

AdaBoost has no explicit regularization

Must resort to early stopping

AdaBoost easily extends to non-linear models

Non-linear optimization for SVMs is super-linear in the number of examples

Can be important for examples with hundreds or thousands of features

Page 44: Query Linguistic Intent Detectionjohn.blitzer.com/harbin/supervised.pdf · Running with Scissors: A Memoir Title: Horrible book, horrible. This book was horrible. I read half of it,

Logistic regression: Also known as Maximum Entropy

Probabilistic discriminative model which directly models p(y | x)

A good general machine learning book

On discriminative learning and more

Chris Bishop. Pattern Recognition and Machine Learning. Springer 2006.

Page 45: Query Linguistic Intent Detectionjohn.blitzer.com/harbin/supervised.pdf · Running with Scissors: A Memoir Title: Horrible book, horrible. This book was horrible. I read half of it,

(1)

(2)

(3)

(4)

Page 46: Query Linguistic Intent Detectionjohn.blitzer.com/harbin/supervised.pdf · Running with Scissors: A Memoir Title: Horrible book, horrible. This book was horrible. I read half of it,

Good features for this model?

(1) How many words are shared between the query and the web page?

(2) What is the PageRank of the webpage?

(3) Other ideas?

Page 47: Query Linguistic Intent Detectionjohn.blitzer.com/harbin/supervised.pdf · Running with Scissors: A Memoir Title: Horrible book, horrible. This book was horrible. I read half of it,

Loss for a query and a pair of documents

Score for documents of different ranks must be separated by a margin

MSRA 互联网搜索与挖掘组http://research.microsoft.com/asia/group/wsm/

Page 48: Query Linguistic Intent Detectionjohn.blitzer.com/harbin/supervised.pdf · Running with Scissors: A Memoir Title: Horrible book, horrible. This book was horrible. I read half of it,

http://www.msra.cn/recruitment/