![Page 1: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.pub/reader035/viewer/2022081505/56815c91550346895dcaa13e/html5/thumbnails/1.jpg)
John Blitzer
自然语言计算组http://research.microsoft.com/asia/group/nlc/
![Page 2: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.pub/reader035/viewer/2022081505/56815c91550346895dcaa13e/html5/thumbnails/2.jpg)
This is an NLP summer school. Why should I care about machine learning?ACL 2008: 50 of 96 full papers mention learning, or statistics in their titles4 of 4 outstanding papers propose new learning or statistical inference methods
![Page 3: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.pub/reader035/viewer/2022081505/56815c91550346895dcaa13e/html5/thumbnails/3.jpg)
Running with Scissors: A Memoir
Title: Horrible book, horrible.
This book was horrible. I read half of it,
suffering from a headache the entire time, and
eventually i lit it on fire. One less copy in the
world...don't waste your money. I wish i had the
time spent reading this book back so i could use
it for better purposes. This book wasted my life
Positive
Negative
Output: Labels
Input: Product Review
![Page 4: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.pub/reader035/viewer/2022081505/56815c91550346895dcaa13e/html5/thumbnails/4.jpg)
From the MSRA 机器学习组 http://research.microsoft.com/research/china/DCCUE/
ml.aspx
![Page 5: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.pub/reader035/viewer/2022081505/56815c91550346895dcaa13e/html5/thumbnails/5.jpg)
. . .
Un-ranked List Ranked List
. . .
![Page 6: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.pub/reader035/viewer/2022081505/56815c91550346895dcaa13e/html5/thumbnails/6.jpg)
全国田径冠军赛结束
The national track & field championships concluded
Input: English sentence
Output: Chinese sentence
![Page 7: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.pub/reader035/viewer/2022081505/56815c91550346895dcaa13e/html5/thumbnails/7.jpg)
1) Supervised Learning [2.5 hrs]
2) Semi-supervised learning [3 hrs]
3) Learning bounds for domain adaptation [30 mins]
![Page 8: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.pub/reader035/viewer/2022081505/56815c91550346895dcaa13e/html5/thumbnails/8.jpg)
1) Notation and Definitions [5 mins]
2) Generative Models [25 mins]
3) Discriminative Models [55 mins]
4) Machine Learning Examples [15 mins]
![Page 9: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.pub/reader035/viewer/2022081505/56815c91550346895dcaa13e/html5/thumbnails/9.jpg)
Training data: labeled pairs
. . .
?? ?? . . . ??
Use this function to label unlabeled testing data
Use training data to learn a function
![Page 10: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.pub/reader035/viewer/2022081505/56815c91550346895dcaa13e/html5/thumbnails/10.jpg)
3 0
horrible
read_half waste
0... 1 0 ... 0 2
0 0 0... 1 ... 00
horrible
2
excellent
loved_it
![Page 11: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.pub/reader035/viewer/2022081505/56815c91550346895dcaa13e/html5/thumbnails/11.jpg)
![Page 12: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.pub/reader035/viewer/2022081505/56815c91550346895dcaa13e/html5/thumbnails/12.jpg)
Encode a multivariate probability distribution
Nodes indicate random variablesEdges indicate conditional dependency
![Page 13: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.pub/reader035/viewer/2022081505/56815c91550346895dcaa13e/html5/thumbnails/13.jpg)
p(horrible | -
1)
horrible
waste
p(read_half | -
1)
read_half
p(y = -1)
![Page 14: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.pub/reader035/viewer/2022081505/56815c91550346895dcaa13e/html5/thumbnails/14.jpg)
Given an unlabeled instance, how can we find its label?
??
Just choose the most probable label y
![Page 15: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.pub/reader035/viewer/2022081505/56815c91550346895dcaa13e/html5/thumbnails/15.jpg)
Back to labeled training data:
. . .
![Page 16: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.pub/reader035/viewer/2022081505/56815c91550346895dcaa13e/html5/thumbnails/16.jpg)
Input query: “ 自然语言处理”
Query classificationTravel
TechnologyNews
Entertainment
. . . . Training and testing same as in binary case
![Page 17: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.pub/reader035/viewer/2022081505/56815c91550346895dcaa13e/html5/thumbnails/17.jpg)
Why set parameters to counts?
![Page 18: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.pub/reader035/viewer/2022081505/56815c91550346895dcaa13e/html5/thumbnails/18.jpg)
![Page 19: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.pub/reader035/viewer/2022081505/56815c91550346895dcaa13e/html5/thumbnails/19.jpg)
Predicting broken traffic lights
Lights are broken: both lights are red alwaysLights are working: 1 is red & 1 is green
![Page 20: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.pub/reader035/viewer/2022081505/56815c91550346895dcaa13e/html5/thumbnails/20.jpg)
Now, suppose both lights are red. What will our model predict?
We got the wrong answer. Is there a better model?
The MLE generative model is not the best model!!
![Page 21: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.pub/reader035/viewer/2022081505/56815c91550346895dcaa13e/html5/thumbnails/21.jpg)
We can introduce more dependencies
This can explode parameter space
Discriminative models minimize error -- nextFurther readingK. Toutanova. Competitive generative models with structure learning for NLP classification tasks. EMNLP 2006.A. Ng and M. Jordan. On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naïve Bayes. NIPS 2002
![Page 22: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.pub/reader035/viewer/2022081505/56815c91550346895dcaa13e/html5/thumbnails/22.jpg)
We will focus on linear models
Model training error
![Page 23: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.pub/reader035/viewer/2022081505/56815c91550346895dcaa13e/html5/thumbnails/23.jpg)
-1.5 -1 -0.5 0 0.5 1 1.5
2
4
60-1 loss (error): NP-hard to minimize over all data points
Exp loss: exp(-score): Minimized by AdaBoost
Hinge loss: Minimized by support vector machines
![Page 24: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.pub/reader035/viewer/2022081505/56815c91550346895dcaa13e/html5/thumbnails/24.jpg)
In NLP, a feature can be a weak learner
Sentiment example:
![Page 25: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.pub/reader035/viewer/2022081505/56815c91550346895dcaa13e/html5/thumbnails/25.jpg)
Input: training sample(1) Initialize (2) For t = 1 … T,
Train a weak hypothesis to minimize error on
Set [later]
Update
(3) Output model
![Page 26: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.pub/reader035/viewer/2022081505/56815c91550346895dcaa13e/html5/thumbnails/26.jpg)
+ + – –Excellent book. The_plot was riveting
Excellent read
Terrible: The_plot was boring and opaque
Awful book. Couldn’t follow the_plot.
+++– –
++– – – –– –
– –– –– –
![Page 27: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.pub/reader035/viewer/2022081505/56815c91550346895dcaa13e/html5/thumbnails/27.jpg)
Bound on training error [Freund & Schapire 1995]
We greedily minimize error by minimizing
![Page 28: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.pub/reader035/viewer/2022081505/56815c91550346895dcaa13e/html5/thumbnails/28.jpg)
For proofs and a more complete discussionRobert Schapire and Yoram Singer. Improved Boosting Algorithms Using Confidence-rated Predictions. Machine Learning Journal 1998.
![Page 29: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.pub/reader035/viewer/2022081505/56815c91550346895dcaa13e/html5/thumbnails/29.jpg)
We chose to minimize . Was that the right choice?
This gives
Plugging in our solution for , we have
![Page 30: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.pub/reader035/viewer/2022081505/56815c91550346895dcaa13e/html5/thumbnails/30.jpg)
-1.5 -1 -0.5 0 0.5 1 1.5
2
4
6What happens when an example is mis-labeled or an outlier?Exp loss exponentially penalizes incorrect scores.
Hinge loss linearly penalizes incorrect scores.
![Page 31: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.pub/reader035/viewer/2022081505/56815c91550346895dcaa13e/html5/thumbnails/31.jpg)
Linearly separable
++––– –
+++
+––
––
++Non-
separable
![Page 32: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.pub/reader035/viewer/2022081505/56815c91550346895dcaa13e/html5/thumbnails/32.jpg)
Lots of separating hyperplanes. Which should we choose?
++––– –
++ ++––– –
++
Choose the hyperplane with largest margin
![Page 33: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.pub/reader035/viewer/2022081505/56815c91550346895dcaa13e/html5/thumbnails/33.jpg)
score of correct label greater thanmargin
Why do we fix norm of w to be less than 1?
Scaling the weight vector doesn’t change the optimal hyperplane
![Page 34: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.pub/reader035/viewer/2022081505/56815c91550346895dcaa13e/html5/thumbnails/34.jpg)
Minimize the norm of the weight vectorWith fixed margin for each example
![Page 35: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.pub/reader035/viewer/2022081505/56815c91550346895dcaa13e/html5/thumbnails/35.jpg)
We can’t satisfy the margin constraintsBut some hyperplanes are better than others
++
––––
++
![Page 36: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.pub/reader035/viewer/2022081505/56815c91550346895dcaa13e/html5/thumbnails/36.jpg)
Add slack variables to the optimization
Allow margin constraints to be violatedBut minimize the violation as much as possible
![Page 37: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.pub/reader035/viewer/2022081505/56815c91550346895dcaa13e/html5/thumbnails/37.jpg)
![Page 38: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.pub/reader035/viewer/2022081505/56815c91550346895dcaa13e/html5/thumbnails/38.jpg)
-0.5 0 0.5 1 1.5
0.5
1
1.5
Max creates a non-differentiable point, but there is a subgradient
Subgradient:
![Page 39: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.pub/reader035/viewer/2022081505/56815c91550346895dcaa13e/html5/thumbnails/39.jpg)
Subgradient descent is like gradient descent.Also guaranteed to converge, but slowPegasos [Shalev-Schwartz and Singer 2007]
Sub-gradient descent for a randomly selected subset of examples. Convergence bound:
Objective after T iterations
Best objective value Linear
convergence
![Page 40: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.pub/reader035/viewer/2022081505/56815c91550346895dcaa13e/html5/thumbnails/40.jpg)
We’ve been looking at binary classification
But most NLP problems aren’t binaryPiece-wise linear decision boundaries
We showed 2-dimensional examplesBut NLP is typically very high dimensionalJoachims [2000] discusses linear models in high-dimensional spaces
![Page 41: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.pub/reader035/viewer/2022081505/56815c91550346895dcaa13e/html5/thumbnails/41.jpg)
Kernels let us efficiently map training data into a high-dimensional feature spaceThen learn a model which is linear in the new space, but non-linear in our original spaceBut for NLP, we already have a high-dimensional representation!Optimization with non-linear kernels is often super-linear in number of examples
![Page 42: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.pub/reader035/viewer/2022081505/56815c91550346895dcaa13e/html5/thumbnails/42.jpg)
John Shawe-Taylor and Nello Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press 2004.Dan Klein and Ben Taskar. Max Margin Methods for NLP: Estimation, Structure, and Applications. ACL 2005 Tutorial.
Ryan McDonald. Generalized Linear Classifiers in NLP. Tutorial at the Swedish Graduate School in Language Technology. 2007.
![Page 43: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.pub/reader035/viewer/2022081505/56815c91550346895dcaa13e/html5/thumbnails/43.jpg)
SVMs with slack are noise tolerantAdaBoost has no explicit regularization
Must resort to early stopping
AdaBoost easily extends to non-linear modelsNon-linear optimization for SVMs is super-linear in the number of examples
Can be important for examples with hundreds or thousands of features
![Page 44: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.pub/reader035/viewer/2022081505/56815c91550346895dcaa13e/html5/thumbnails/44.jpg)
Logistic regression: Also known as Maximum Entropy
Probabilistic discriminative model which directly models p(y | x)
A good general machine learning bookOn discriminative learning and moreChris Bishop. Pattern Recognition and Machine Learning. Springer 2006.
![Page 45: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.pub/reader035/viewer/2022081505/56815c91550346895dcaa13e/html5/thumbnails/45.jpg)
(1)
(2)
(3)
(4)
![Page 46: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.pub/reader035/viewer/2022081505/56815c91550346895dcaa13e/html5/thumbnails/46.jpg)
Good features for this model? (1) How many words are shared between
the query and the web page?(2) What is the PageRank of the
webpage?(3) Other ideas?
![Page 47: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.pub/reader035/viewer/2022081505/56815c91550346895dcaa13e/html5/thumbnails/47.jpg)
Loss for a query and a pair of documents Score for documents of different ranks must be separated by a margin
MSRA 互联网搜索与挖掘组http://research.microsoft.com/asia/group/wsm/
![Page 48: Supervised and semi-supervised learning for NLP](https://reader035.vdocuments.pub/reader035/viewer/2022081505/56815c91550346895dcaa13e/html5/thumbnails/48.jpg)
http://www.msra.cn/recruitment/