semi-supervised learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ml_2016/lecture/semi (v3).pdfย ยท...

Post on 03-Aug-2020

5 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Semi-supervised Learning

Introduction

โ€ข Supervised learning: ๐‘ฅ๐‘Ÿ , เทœ๐‘ฆ๐‘Ÿ ๐‘Ÿ=1๐‘…

โ€ข E.g.๐‘ฅ๐‘Ÿ: image, เทœ๐‘ฆ๐‘Ÿ: class labels

โ€ข Semi-supervised learning: ๐‘ฅ๐‘Ÿ , เทœ๐‘ฆ๐‘Ÿ ๐‘Ÿ=1๐‘… , ๐‘ฅ๐‘ข ๐‘ข=๐‘…

๐‘…+๐‘ˆ

โ€ข A set of unlabeled data, usually U >> R

โ€ข Transductive learning: unlabeled data is the testing data

โ€ข Inductive learning: unlabeled data is not the testing data

โ€ข Why semi-supervised learning?

โ€ข Collecting data is easy, but collecting โ€œlabelledโ€ data is expensive

โ€ข We do semi-supervised learning in our lives

Why semi-supervised learning helps?

Labelled data

Unlabeled data

cat dog

(Image of cats and dogs without labeling)

Why semi-supervised learning helps?

The distribution of the unlabeled data tell us something.

Usually with some assumptions

Who knows?

Outline

Semi-supervised Learning for Generative Model

Low-density Separation Assumption

Smoothness Assumption

Better Representation

Semi-supervised Learning for Generative Model

Supervised Generative Model

โ€ข Given labelled training examples ๐‘ฅ๐‘Ÿ โˆˆ ๐ถ1, ๐ถ2โ€ข looking for most likely prior probability P(Ci) and class-

dependent probability P(x|Ci)

โ€ข P(x|Ci) is a Gaussian parameterized by ๐œ‡๐‘– and ฮฃ

๐œ‡1, ฮฃ

๐œ‡2, ฮฃ

๐‘ƒ ๐ถ1|๐‘ฅ =๐‘ƒ ๐‘ฅ|๐ถ1 ๐‘ƒ ๐ถ1

๐‘ƒ ๐‘ฅ|๐ถ1 ๐‘ƒ ๐ถ1 + ๐‘ƒ ๐‘ฅ|๐ถ2 ๐‘ƒ ๐ถ2

With ๐‘ƒ ๐ถ1 , ๐‘ƒ ๐ถ2 ,๐œ‡1,๐œ‡2, ฮฃ

Decision Boundary

Semi-supervised Generative Modelโ€ข Given labelled training examples ๐‘ฅ๐‘Ÿ โˆˆ ๐ถ1, ๐ถ2

โ€ข looking for most likely prior probability P(Ci) and class-dependent probability P(x|Ci)

โ€ข P(x|Ci) is a Gaussian parameterized by ๐œ‡๐‘– and ฮฃ

Decision Boundary

The unlabeled data ๐‘ฅ๐‘ข help re-estimate ๐‘ƒ ๐ถ1 , ๐‘ƒ ๐ถ2 , ๐œ‡1,๐œ‡2, ฮฃ

๐œ‡1, ฮฃ

๐œ‡2, ฮฃ

Semi-supervised Generative Modelโ€ข Initialization:๐œƒ = ๐‘ƒ ๐ถ1 ,๐‘ƒ ๐ถ2 ,๐œ‡1,๐œ‡2, ฮฃ

โ€ข Step 1: compute the posterior probability of unlabeled data

โ€ข Step 2: update model

๐‘ƒ๐œƒ ๐ถ1|๐‘ฅ๐‘ข

๐‘ƒ ๐ถ1 =๐‘1 +ฯƒ๐‘ฅ๐‘ข ๐‘ƒ ๐ถ1|๐‘ฅ

๐‘ข

๐‘

๐œ‡1 =1

๐‘1

๐‘ฅ๐‘Ÿโˆˆ๐ถ1

๐‘ฅ๐‘Ÿ +1

ฯƒ๐‘ฅ๐‘ข ๐‘ƒ ๐ถ1|๐‘ฅ๐‘ข

๐‘ฅ๐‘ข

๐‘ƒ ๐ถ1|๐‘ฅ๐‘ข ๐‘ฅ๐‘ข โ€ฆโ€ฆ

Back to step 1

Depending on model ๐œƒ

๐‘: total number of examples๐‘1: number of examples belonging to C1

The algorithm converges eventually, but the initialization influences the results.

E

M

Why?

โ€ข Maximum likelihood with labelled data

โ€ข Maximum likelihood with labelled + unlabeled data

๐‘™๐‘œ๐‘”๐ฟ ๐œƒ =

๐‘ฅ๐‘Ÿ

๐‘™๐‘œ๐‘”๐‘ƒ๐œƒ ๐‘ฅ๐‘Ÿ , เทœ๐‘ฆ๐‘Ÿ

๐‘™๐‘œ๐‘”๐ฟ ๐œƒ =

๐‘ฅ๐‘Ÿ

๐‘™๐‘œ๐‘”๐‘ƒ๐œƒ ๐‘ฅ๐‘Ÿ , เทœ๐‘ฆ๐‘Ÿ +

๐‘ฅ๐‘ข

๐‘™๐‘œ๐‘”๐‘ƒ๐œƒ ๐‘ฅ๐‘ข

๐‘ƒ๐œƒ ๐‘ฅ๐‘ข = ๐‘ƒ๐œƒ ๐‘ฅ๐‘ข|๐ถ1 ๐‘ƒ ๐ถ1 + ๐‘ƒ๐œƒ ๐‘ฅ๐‘ข|๐ถ2 ๐‘ƒ ๐ถ2

(๐‘ฅ๐‘ข can come from either C1 and C2)

Closed-form solution

Solved iteratively

๐œƒ = ๐‘ƒ ๐ถ1 ,๐‘ƒ ๐ถ2 ,๐œ‡1,๐œ‡2, ฮฃ

= ๐‘ƒ๐œƒ ๐‘ฅ๐‘Ÿ| เทœ๐‘ฆ๐‘Ÿ ๐‘ƒ เทœ๐‘ฆ๐‘Ÿ๐‘ƒ๐œƒ ๐‘ฅ๐‘Ÿ , เทœ๐‘ฆ๐‘Ÿ

Semi-supervised LearningLow-density Separation

้ž้ป‘ๅณ็™ฝโ€œBlack-or-whiteโ€

Self-training

โ€ข Given: labelled data set = ๐‘ฅ๐‘Ÿ , เทœ๐‘ฆ๐‘Ÿ ๐‘Ÿ=1๐‘… , unlabeled data set

= ๐‘ฅ๐‘ข ๐‘ข=๐‘™๐‘…+๐‘ˆ

โ€ข Repeat:

โ€ข Train model ๐‘“โˆ— from labelled data set

โ€ข Apply ๐‘“โˆ— to the unlabeled data set

โ€ข Obtain ๐‘ฅ๐‘ข, ๐‘ฆ๐‘ข ๐‘ข=๐‘™๐‘…+๐‘ˆ

โ€ข Remove a set of data from unlabeled data set, and add them into the labeled data set

Independent to the model

How to choose the data set remains open

Regression?

Pseudo-label

You can also provide a weight to each data.

Self-training

โ€ข Similar to semi-supervised learning for generative model

โ€ข Hard label v.s. Soft label

Considering using neural network

New target for ๐‘ฅ๐‘ข is 10

Class 1

70% Class 130% Class 2

๐œƒโˆ— (network parameter) from labelled data

New target for ๐‘ฅ๐‘ข is 0.70.3

Doesnโ€™t work โ€ฆ

It looks like class 1, then it is class 1.0.70.3

๐œƒโˆ—๐‘ฅ๐‘ข

Hard

Soft

Entropy-based Regularization

๐œƒโˆ—๐‘ฅ๐‘ข ๐‘ฆ๐‘ข

Distribution

๐‘ฆ๐‘ข

1 2 3 4 5

๐‘ฆ๐‘ข

1 2 3 4 5

๐‘ฆ๐‘ข

1 2 3 4 5

Good!

Good!

Bad!

๐ธ ๐‘ฆ๐‘ข = โˆ’

๐‘š=1

5

๐‘ฆ๐‘š๐‘ข ๐‘™๐‘› ๐‘ฆ๐‘š

๐‘ข

Entropy of ๐‘ฆ๐‘ข : Evaluate how concentrate the distribution ๐‘ฆ๐‘ข is

๐ธ ๐‘ฆ๐‘ข = 0

๐ธ ๐‘ฆ๐‘ข = 0

๐ธ ๐‘ฆ๐‘ข

= ๐‘™๐‘›5

= โˆ’๐‘™๐‘›1

5

As small as possible

๐ฟ =

๐‘ฅ๐‘Ÿ

๐ถ ๐‘ฆ๐‘Ÿ , เทœ๐‘ฆ๐‘Ÿ

+๐œ†

๐‘ฅ๐‘ข

๐ธ ๐‘ฆ๐‘ข

labelled data

unlabeled data

Outlook: Semi-supervised SVM

Find a boundary that can provide the largest margin and least error

Enumerate all possible labels for the unlabeled data

Thorsten Joachims, โ€Transductive Inference for Text Classification using Support Vector Machinesโ€, ICML, 1999

Semi-supervised LearningSmoothness Assumption

่ฟ‘ๆœฑ่€…่ตค๏ผŒ่ฟ‘ๅขจ่€…้ป‘โ€œYou are known by the company you keepโ€

Smoothness Assumption

โ€ข Assumption: โ€œsimilarโ€ ๐‘ฅ has the same เทœ๐‘ฆ

โ€ข More precisely:

โ€ข x is not uniform.

โ€ข If ๐‘ฅ1 and ๐‘ฅ2 are close in a high density region, เทœ๐‘ฆ1 and เทœ๐‘ฆ2 are the same.

connected by a high density path

Source of image: http://hips.seas.harvard.edu/files/pinwheel.png

๐‘ฅ1

๐‘ฅ2

๐‘ฅ3

๐‘ฅ1 and ๐‘ฅ2 have the same label

๐‘ฅ2 and ๐‘ฅ3 have different labels

Smoothness Assumption

โ€œindirectlyโ€ similarwith stepping stones

(The example is from the tutorial slides of Xiaojin Zhu.)

similar?Not similar?

Source of image: http://www.moehui.com/5833.html/5/

Smoothness Assumption

โ€ข Classify astronomy vs. travel articles

(The example is from the tutorial slides of Xiaojin Zhu.)

Smoothness Assumption

โ€ข Classify astronomy vs. travel articles

(The example is from the tutorial slides of Xiaojin Zhu.)

Cluster and then Label

Class 1

Class 2

Cluster 1

Cluster 2

Cluster 3

Class 1

Class 2

Class 2

Using all the data to learn a classifier as usual

Graph-based Approach

โ€ข How to know ๐‘ฅ1 and ๐‘ฅ2 are close in a high density region (connected by a high density path)

Represented the data points as a graph

E.g. Hyperlink of webpages, citation of papers

Graph representation is nature sometimes.

Sometimes you have to construct the graph yourself.

Graph-based Approach- Graph Constructionโ€ข Define the similarity ๐‘  ๐‘ฅ๐‘– , ๐‘ฅ๐‘— between ๐‘ฅ๐‘– and ๐‘ฅ๐‘—

โ€ข Add edge:

โ€ข K Nearest Neighbor

โ€ข e-Neighborhood

โ€ข Edge weight is proportional to s ๐‘ฅ๐‘– , ๐‘ฅ๐‘—

๐‘  ๐‘ฅ๐‘– , ๐‘ฅ๐‘— = ๐‘’๐‘ฅ๐‘ โˆ’๐›พ ๐‘ฅ๐‘– โˆ’ ๐‘ฅ๐‘—2

Gaussian Radial Basis Function:

The image is from the tutorial slides of Amarnag Subramanyaand Partha Pratim Talukdar

Graph-based Approach

Class 1

Class 1

Class 1

Class 1

Class 1

Propagate through the graph

The labelled data influence their neighbors.

x

Graph-based Approach

โ€ข Define the smoothness of the labels on the graph

๐‘† =1

2

๐‘–,๐‘—

๐‘ค๐‘–,๐‘— ๐‘ฆ๐‘– โˆ’ ๐‘ฆ๐‘—2

Smaller means smoother

x1

x2x3

x4

23

1

1

๐‘ฆ1 = 1

๐‘ฆ2 = 1

๐‘ฆ3 = 1

๐‘ฆ4 = 0

x1

x2x3

x4

23

1

1

๐‘ฆ1 = 0

๐‘ฆ2 = 1

๐‘ฆ3 = 1

๐‘ฆ4 = 0๐‘† = 0.5 ๐‘† = 3

For all data (no matter labelled or not)

Graph-based Approach

โ€ข Define the smoothness of the labels on the graph

= ๐’š๐‘‡๐ฟ๐’š

L: (R+U) x (R+U) matrix

Graph Laplacian

๐ฟ = ๐ท โˆ’๐‘Š

y: (R+U)-dim vector

๐’š = โ‹ฏ๐‘ฆ๐‘–โ‹ฏ๐‘ฆ๐‘—โ‹ฏ๐‘ป

๐‘Š =

0 22 0

3 01 0

3 10 0

0 11 0

D =

5 00 3

0 00 0

0 00 0

5 00 1

๐‘† =1

2

๐‘–,๐‘—

๐‘ค๐‘–,๐‘— ๐‘ฆ๐‘– โˆ’ ๐‘ฆ๐‘—2

Graph-based Approach

โ€ข Define the smoothness of the labels on the graph

= ๐’š๐‘‡๐ฟ๐’š๐‘† =1

2

๐‘–,๐‘—

๐‘ค๐‘–,๐‘— ๐‘ฆ๐‘– โˆ’ ๐‘ฆ๐‘—2 Depending on

network parameters

๐ฟ =

๐‘ฅ๐‘Ÿ

๐ถ ๐‘ฆ๐‘Ÿ , เทœ๐‘ฆ๐‘Ÿ +๐œ†๐‘†

J. Weston, F. Ratle, and R. Collobert, โ€œDeep learning via semi-supervised embedding,โ€ ICML, 2008

As a regularization term

smoothsmooth

smooth

Semi-supervised LearningBetter Representation

ๅŽป่•ชๅญ˜่๏ผŒๅŒ–็น็‚บ็ฐก

Looking for Better Representation

โ€ข Find the latent factors behind the observation

โ€ข The latent factors (usually simpler) are better representations

observation

Better representation (Latent factor)

(In unsupervised learning part)

Reference

http://olivier.chapelle.cc/ssl-book/

Acknowledgement

โ€ขๆ„Ÿ่ฌๅŠ‰่ญฐ้š†ๅŒๅญธๆŒ‡ๅ‡บๆŠ•ๅฝฑ็‰‡ไธŠ็š„้Œฏๅญ—

โ€ขๆ„Ÿ่ฌไธๅ‹ƒ้›„ๅŒๅญธๆŒ‡ๅ‡บๆŠ•ๅฝฑ็‰‡ไธŠ็š„้Œฏๅญ—

top related