mncs 16-10-1주-변승규-introduction to the machine learning #2

Intelligence Networking and Computing Lab. 2

Review: What is Machine LearningMain Ingredients

Examples of Models

ModelsOutput of Machine Learning

Geometric models

Probabilistic Models

Logical Models

FeaturesWorkhorses of Machine Learning

Two Uses of Features

Feature Construction and Transformation

Conclusion

Intelligence Networking and Computing Lab.3

Intelligence Networking and Computing Lab.

Task

Learning problem

Features: relevant object in our domain as data points

Task: abstract representation of a problem between domain objects and outpute.g., classifying them into two or more classes

Model: a mapping from data points to output,

produced as the output of a machine learning algorithm applied to training data

4

Learning Algorithm

Training Data

modelData Output

FeaturesDomain

objects


SpamAssassin: a linear equation of the form 𝑖=1𝑛 𝑤𝑖𝑥𝑖 > 𝑡

𝑥𝑖 : Boolean features indicating whether the 𝑖 -th test succeeded

𝑤𝑖 : feature weights learned from the training set

𝑡 : threshold for classification learned from the training set

Bayesian classifier: a decision rule of the form 𝑖=0𝑛 𝑜𝑖 > 1

𝑜𝑖 : the likelihood ratio associated with each word 𝑥𝑖

𝑜0 : the prior odds, estimated from the training set

5


Distinction according to intuition:Geometric models

Probabilistic Models

Logical models

Characterization by modus operandi:Grouping models: limited resolution

↑

↓

Grading models: unlimited resolution

7


Basic Linear ClassifierLet 𝑃 and 𝑁 be the sets of positive and negative examples, respectively

𝐩 =1

𝑃 𝑥∈𝑃𝐱 𝑎𝑛𝑑 𝐧 =

1

𝑁 𝑥∈𝑁𝐱

Since 𝐩 + 𝐧 /𝟐 is on the decision boundary

𝑡 = 𝐩 − 𝐧 ⋅ 𝐩 + 𝐧 /𝟐 =𝐩 𝟐 − 𝐧 𝟐

𝟐

8


K-nearest neighbor classifierPredictions are locally made based on 𝒌 most similar instances

Popular similarity measures:

Euclidean distance: 𝑖=1𝑑 𝑥𝑖 − 𝑦𝑖

21

2

Manhattan distance: 𝑖=1𝑑 𝑥𝑖 − 𝑦𝑖

Lazy method *

: If your mother told you "clean up your room!“,

9


We want to learn a binary classifier with the training data shown in the figure above

(a) Derive the equation of the decision boundary of a basic linear classifier, which is a perpendicular bisector of the line between the positive and negative means.

(b) What is the error rate of the above basic linear classifier on the training data?

(c) How is the test data (-1, 1) classified by 1-NN (k nearest neighbor with k=1)?

(d) Is there any way to have (-1, 1) classified differently than in (c)?

(e) Draw the decision boundary of 1-NN in the figure.

10

0 4

4 ● positive

○ negative


Derive the equation of the decision boundary of a basic linear classifier, which is a perpendicular bisector of the line between the positive and negative means.

𝐰 = 𝐩 − 𝐧 = 1, 3 − 2, 1 = −1, 2

𝐰𝑥 = 𝑡 =𝐩 − 𝐧 ⋅ 𝐩 + 𝐧

𝟐=𝐩 𝟐 − 𝐧 𝟐

𝟐=𝟏𝟎 − 𝟓

𝟐

𝑤𝑥 = 𝑡 ⟺ −1, 2 𝑥 =5

2

What is the error rate of the above basic linear classifier on the training data?

11

0 4

4

p

n

w=p-n

(p+n)/2

(1, 3)

(2, 1)

● positive

○ negative


How is the test data (-1, 1) classified by 1-NN (k nearest neighbor with k=1)?

Is there any way to have (-1, 1) classified differently than in (c)?

12

0 4

4

(-1, 1)


Draw the decision boundary of 1-NN in the figure.

13

0 4

4


Suppose we observe ‘Viagra’ four times more often in spams than in hams on average

Likelihood ratio associated with Viagra is

One spam is received for every six hams on average

Prior odds are

By Bayes’ rule the posterior odds become

‘Viagra’ makes the probability of ham to drop from 6/7=0.86 to 6/10=0.6

14

𝑃 𝑉𝑖𝑎𝑔𝑟𝑎 𝑆𝑝𝑎𝑚 /𝑃 𝑉𝑖𝑎𝑔𝑟𝑎 𝐻𝑎𝑚

𝑃 𝑆𝑝𝑎𝑚 /𝑃 𝐻𝑎𝑚 =

𝑃 𝑆𝑝𝑎𝑚 𝑉𝑖𝑎𝑔𝑟𝑎

𝑃 𝐻𝑎𝑚 𝑉𝑖𝑎𝑔𝑟𝑎=

𝑃 𝑉𝑖𝑎𝑔𝑟𝑎 𝑆𝑝𝑎𝑚 𝑃 𝑆𝑝𝑎𝑚𝑃 𝑉𝑖𝑎𝑔𝑟𝑎

𝑃 𝑉𝑖𝑎𝑔𝑟𝑎 𝐻𝑎𝑚 𝑃 𝐻𝑎𝑚𝑃 𝑉𝑖𝑎𝑔𝑟𝑎

=𝑃 𝑉𝑖𝑎𝑔𝑟𝑎 𝑆𝑝𝑎𝑚

𝑃 𝑉𝑖𝑎𝑔𝑟𝑎 𝐻𝑎𝑚⋅𝑃 𝑆𝑝𝑎𝑚

𝑃 𝐻𝑎𝑚

1/6

= 4/1

=4

1

1

6= 4/6


Feature TreeEach internal node is labeled with a feature

Each edge is labeled with a feature value

Each leaf corresponds to a unique hyper-rectangle

Each leaf is indicated with the class distribution derived from the training set

The majority of ham lives in the lower left-hand corner

15


Decision TreeMajority Class

Real Values or Linear Functions

16

Spam

SpamHam

4/5

2/31/3

Proportion


Feature ListA binary feature tree in which every internal node has a leaf child

Can be written as a nested if-then-else statement

Feature lists whose leaves are labeled with classes are called decision lists

17


Discussion:

The rightmost node of the tree may better be pruned

Logical models are said to be declarative as they can provide explanations for their predictions

18


A model is only as good as its features

Garbage In Garbage Out (GIGO)

Learning Algorithm

Training Data

modelData Output

FeaturesDomain

objects

FeaturesDomain

objects


as SplitsTo zoom in on a particular area of the instance space in grouping or logical models

𝑓 : the number of the word ‘Viagra’ a mail contains

𝑥 : a mail

𝑓 𝑥 = 0 : indicates to select the mails that do not contain ‘Viagra’

𝑓 𝑥 ≠ 0 𝑜𝑟 𝑓 𝑥 > 0 : indicates to select the mails that contain ‘Viagra’

The conditions above split the instance space into the two group Binary splits

as PredictorsTo make contribution to the final prediction

Each numeric feature 𝑥𝑖 in the decision rule 𝑖=1𝑛 𝑤𝑖𝑥𝑖 > 𝑡

makes an independent contribution to the score of an instance depending on its weight 𝑤𝑖

21


A regression tree combining a one-split feature tree

with linear regression models in the leaves𝑥 is used to as both a splitting feature and a regression variable

The function 𝑦 = cos𝜋𝑥 on −1 ≤ 𝑥 ≤ 1,

and the piecewise linear approximation achieved by the regression tree

22


Feature construction process depends on ML tasksA good set of features should amplify the ‘signal’ and attenuate the ‘noise’ in an ML task

Bag of word representation (e.g., indexing an e-mail by the words that occur in it)

is not right for distinguishing grammatical and ungrammatical sentences

Feature transformation further improves the signal-to-noise ratio of a featureE.g., discretization could make a feature more useful in making predictions

SVM makes a transformation of the entire instance space to achieve a linear separability

23


A linear classifier would perform poorly on this data

By transforming the original 𝑥, 𝑦 data into 𝑥′, 𝑦′ = 𝑥2, 𝑦2 ,

a linear decision boundary 𝑥′ + 𝑦′ = 3 separates the data

In the original space it corresponds to a circle with radius 3 around the origin

24


Consider the mapping 𝜙 to a 3-dimensional feature space

𝐱1 = 𝑎, 𝑏 𝜙 𝐱1 = 𝑎2, 𝑏2, 2𝑎𝑏

𝐱2 = 𝑐, 𝑑 𝜙 𝐱2 = 𝑐2, 𝑑2, 2𝑐𝑑

𝜙 𝐱1 ⋅ 𝜙 𝐱2 = 𝑎2𝑐2 + 𝑏2𝑑2 + 2𝑎𝑏𝑐𝑑 = 𝑎𝑐 + 𝑏𝑑 2 = 𝐱1 ⋅ 𝐱2

2

By squaring the dot product in the original space we obtain the dot product in the new higher-dimensional space without actually constructing the feature vectors in that space

A function that calculates the dot product in the transformed space directly from the vectors in the original space is called a kernel

Here the kernel is 𝐾 𝐱1, 𝐱2 = 𝐱1 ⋅ 𝐱22 = 𝜙 𝐱1 ⋅ 𝜙 𝐱2

25


𝜙 𝐱𝑗 ⋅ 𝜙 𝐱𝑘 = 𝐱𝑗 ⋅ 𝐱𝑘2

26


Why Machine Learning?The problem not solved or solved but incompletely in the past

may be solved with Machine Learning

I just don’t go against the general trend (?)

I’m paging a research partner,not an academic slave who is willing to put all his research fruits on my plate,

not a head researcher and beaker washer who wants to troubleshoot all the problems in the lab on his own.

Very High Entry BarrierWhat is this? the introduction of intelligence into a machine

with learning algorithm

take care not to make a garbage

How can we apply it? with Machine Learning tools? Perceptron?

To where? Handover?

Future WorkTo find such issues

27

mncs 16-10-1주-변승규-introduction to the machine learning #2

Engineering