cs229 lecture 2 - github pages · 2019. 10. 1. · 2 linear regression : gradient descent let’s...

CS229 Lecture 2Machine Learning

顾晓玲[email protected]

Outlines

• Machine Learning: Overview

• Linear Regression

• Classification

• Logistic Regression

• Regularization

• Perceptron

1 Machine Learning：Overview

1 Machine Learning: Overview

2 Linear Regression: 1 Dimensional Input

2 Linear Regression: 2 Dimensional Input

2 Linear Regression: An Example• Suppose we have a dataset giving the living areas, number of bedrooms and

prices of 200 houses from a specific region:

• Given data like this, how can we learn to predict the prices of other houses, as a function of the size of their living areas and the number of bedrooms?

Sample, Example

Feature

Target

Hypothesis

Training Data (Training Set)

Test data (Test Set)

Training Error & Test Error

2 Linear Regression: Terms and Concepts

2 Linear Regression: Hypothesis

: the living area of the i-th house in the training set

: the number of bedrooms of the i-th house in the training set

: parameters (weights )

: the price of the i-th house in the trainingset

2 Linear Regression: Cost Function

Now, given a training set, how do we learn the parameters θ? One reasonable

method seems to be to make h(x) close to y.

Cost Function (Loss Function):

2 Linear Regression: Cost Function

Hypothesis:

Parameters:

Goal:

Cost Function:

, ,

Minimize

2 Linear Regression: Gradient Descent

Gradient Descent Algorithm

Update Rule:

In Our Case:

2 Linear Regression : Gradient Descent

Let’s assume we have only one training example (x, y):

For a single training example, this gives the update rule:

• Batch Gradient Descent: looks at every example in the entire training set on every step.


Old Version

New Version

• Stochastic Gradient Descent: update the parameters according to the gradient of the error with respect to that single training example only.


Often, stochastic gradient descent gets θ “close” to the minimum muchfaster than batch gradient descent.

2 Linear Regression: Normal Equations • Represent J as matrix-vectorial notation

2 Linear Regression: Normal Equations

m-by-n matrix, actually m-by-n + 1

m-dimensional vector

• Represent J as matrix-vectorial notation


�⃗�𝑦 = 400330369

X = 1 2104 31 1600 31 2400 3

𝜃𝜃 =𝜃𝜃0𝜃𝜃1𝜃𝜃2

Note: more precisely, the features should be scaled.

• For example

2 Linear Regression: Feature Scaling


𝒚𝒚: m × 1X: m × (𝒏𝒏 +𝟏𝟏)

𝜽𝜽 ∶ (𝐧𝐧 + 𝟏𝟏) × 1

a real number

• Normal Equations


What if 𝑋𝑋𝑇𝑇𝑋𝑋 𝑖𝑖𝑖𝑖 𝑛𝑛𝑜𝑜𝑛𝑛−𝑖𝑖𝑛𝑛𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖?

2 Gradient Descent Vs. Normal Equations

3 Classification: Overview

l Classification: classifies examples into given set of categories

3 Classification: Application

• Image/Document Classification

• Spam Email Filtering

• Optical Character Recognition

• Machine Vision (e.g., face detection)

• Natural Language Processing (e.g., spoken language understanding)

• Fraud Detection (e.g., if a transaction is fraudulent)

• Market Segmentation (e.g.: predict if customer will respond to promotion)

• Bioinformatics (e.g., classify proteins according to their function)

3 Classification: Algorithms

• Logistic Regression

• Support Vector Machine (SVM)

• Naïve Bayes Classifier

• K-Nearest Neighbor Classifier

• Decision Tree, Random Forest

• (Deep) Neural Network

4 Logistic Regression: Overview

Logistic Regression was used in the biological sciences in early twentieth century. Logistic

Regression is used when the dependent variable(target) is categorical. For example:

To predict whether an email is spam (1) or (0)

Whether the tumor is malignant (1) or not (0)

4 Logistic Regression: : Hypothesis

is called the logistic function or sigmoid function

4 Logistic regression: Hypothesis

4 Logistic Regression: Sigmoid Function

A useful property of the derivative of the sigmoid function:

4 Logistic Regression: How to fit 𝜽𝜽?

Let’s assume that:

This can be written as:

• Assuming that the m training examples were generated independently, we can then write down the likelihood of the parameters as:

4 Logistic Regression: Maximize the Likelihood

It will be easier to maximize the log likelihood:

• Maximize the likelihood

4 Logistic Regression: Maximize the likelihood

• The stochastic gradient ascent rule

4 Logistic Regression: Minimize Cross-Entropy Loss

𝐶𝐶𝑜𝑜𝑖𝑖𝑖𝑖 ℎ𝜃𝜃 𝑥𝑥 𝑖𝑖 ,𝑦𝑦 𝑖𝑖 = �− log ℎ𝜃𝜃 𝑥𝑥 𝑖𝑖 , 𝑖𝑖𝑖𝑖 𝑦𝑦 = 1

− log 1 − ℎ𝜃𝜃 𝑥𝑥 𝑖𝑖 , 𝑖𝑖𝑖𝑖 𝑦𝑦 = 0

𝐶𝐶𝑜𝑜𝑖𝑖𝑖𝑖 ℎ𝜃𝜃 𝑥𝑥 𝑖𝑖 ,𝑦𝑦 𝑖𝑖 = �− log ℎ𝜃𝜃 𝑥𝑥 𝑖𝑖 , 𝑖𝑖𝑖𝑖 𝑦𝑦 = 1

− log 1 − ℎ𝜃𝜃 𝑥𝑥 𝑖𝑖 , 𝑖𝑖𝑖𝑖 𝑦𝑦 = 0

𝐽𝐽 𝜃𝜃 =1𝑚𝑚�𝑖𝑖=1

𝑚𝑚

𝐶𝐶𝑜𝑜𝑖𝑖𝑖𝑖 ℎ𝜃𝜃 𝑥𝑥 𝑖𝑖 ,𝑦𝑦 𝑖𝑖 = −1𝑚𝑚

[𝑦𝑦 𝑖𝑖 log ℎ𝜃𝜃 𝑥𝑥 𝑖𝑖 + 1 − 𝑦𝑦 𝑖𝑖 log(1 − ℎ𝜃𝜃 𝑥𝑥 𝑖𝑖 )]

4 Logistic Regression: Minimize Cross-Entropy Loss

• Cross-Entropy Loss

4 Logistic regression: Decision Boundary

4 Logistic regression vs Linear regression

4 Logistic regression: Multiclass Classification

5 Regularization: Overfitting

5 Regularization

5 Regularization: Linear Regression

5 Regularization: Logistic Regression

54

突触

树突

轴突

生物神经元的基本结构人工神经元的结构模型

6 Perceptron: Artificial Neuron

6 Perceptron: Model • Perceptron

• First compute a weighted sum of the inputs from other neurons.

• Then output a 1 if the weighted sum exceeds the threshold.

6 Perceptron: Minimizing Squared Errors

𝐸𝐸 =12

(𝑦𝑦(𝑖𝑖) − ℎ𝜃𝜃(𝑥𝑥(𝑖𝑖)))2

𝜕𝜕𝐸𝐸𝜕𝜕𝜃𝜃𝑗𝑗

= 𝑦𝑦 𝑖𝑖 − ℎ𝜃𝜃 𝑥𝑥 𝑖𝑖 × −1 × 𝑔𝑔, 𝑖𝑖𝑛𝑛 × 𝑥𝑥𝑗𝑗(𝑖𝑖)

= - (𝑦𝑦(𝑖𝑖) − ℎ𝜃𝜃(𝑥𝑥(𝑖𝑖))) × 𝑥𝑥𝑗𝑗(𝑖𝑖)

The squared error for a single training example with input x and true output y is:

Gradient Descentundefined and omitted

Update Rule

Homework (1)

Task: Regression

Dataset: Housing

(https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html#

housing)

Algorithms: Linear Regression

Homework (2)

Task: Classification

Dataset: Mnist (http://yann.lecun.com/exdb/mnist/). The MNIST database

of handwritten digits, available from this page, has a training set of 60,000

examples, and a test set of 10,000 examples.

Algorithms: Logistic Regression

http://yann.lecun.com/exdb/mnist/)

Homework (3)

Task: Classification

Dataset: Breast Cancer Wisconsin (Diagnostic) Data Set

(https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Dia

gnostic%29)

Algorithms: Perceptron

Tips for Homework

• Evaluation Metrics:

Regression: Root Mean Squared Error

Classification: Accuracy

Tips for Homework

• Plot Loss Function Figure

Normal Overfitting Learning Rate

Tips for Homework• K-Folds Cross Validation

More Materials• 1. Deep Learning, deeplearning.ai (Andrew Ng et al.) @ Coursera 网易云课堂

• 2. Andrew Ng. Machine Learning. Coursera，网易公开课

• 3. 机器学习基石, Hsuan-Tien Lin, 林軒田，台湾大学 @ Coursera

• 4. 机器学习技法, Hsuan-Tien Lin, 林軒田，台湾大学 @ Coursera

• 5.CS231n: Convolutional Neural Networks for Visual Recognition Feifei Li et al., University of Stanford

• 6.Ian Goodfellow, Yoshua Bengio, and Aaron Courville, Deep Learning, Book in preparation for MIT Press, 2016, http://www.deeplearningbook.org/

• 7. 台大，李宏毅，http://speech.ee.ntu.edu.tw/~tlkagk/courses_ML16.html

• MIT公开课，Gilbert Strang ,《线性代数》

http://www.deeplearningbook.org/http://speech.ee.ntu.edu.tw/%7Etlkagk/courses_ML16.html

Programming Skills

• Python

• Ubuntu

• Keras

• Pytorch

• Tensorflow

END

CS229 Lecture 2Machine Learning

顾晓玲[email protected]

幻灯片编号 1Outlines1 Machine Learning：Overview1 Machine Learning: Overview2 Linear Regression: 1 Dimensional Input2 Linear Regression: 2 Dimensional Input2 Linear Regression: An Example2 Linear Regression: Terms and Concepts2 Linear Regression: Hypothesis 2 Linear Regression: Cost Function2 Linear Regression: Cost Function2 Linear Regression: Gradient Descent2 Linear Regression: Gradient Descent2 Linear Regression : Gradient Descent2 Linear Regression: Gradient Descent2 Linear Regression: Gradient Descent2 Linear Regression: Gradient Descent2 Linear Regression: Gradient Descent2 Linear Regression: Normal Equations �2 Linear Regression: Normal Equations �2 Linear Regression: Normal Equations �2 Linear Regression: Feature Scaling 2 Linear Regression: Normal Equations �2 Linear Regression: Normal Equations 2 Linear Regression: Normal Equations 2 Gradient Descent Vs. Normal Equations 3 Classification: Overview3 Classification: Application3 Classification: Algorithms 4 Logistic Regression: Overview� 4 Logistic Regression: : Hypothesis � 4 Logistic regression: Hypothesis �4 Logistic Regression: Sigmoid Function4 Logistic Regression: How to fit 𝜽? 4 Logistic Regression: Maximize the Likelihood 4 Logistic Regression: Maximize the likelihood 4 Logistic Regression: Minimize Cross-Entropy Loss4 Logistic Regression: Minimize Cross-Entropy Loss4 Logistic regression: Decision Boundary 4 Logistic regression: Decision Boundary 4 Logistic regression vs Linear regression4 Logistic regression: Multiclass Classification4 Logistic regression: Multiclass Classification4 Logistic regression: Multiclass Classification5 Regularization: Overfitting5 Regularization: Overfitting5 Regularization: Overfitting5 Regularization5 Regularization: Linear Regression5 Regularization: Linear Regression5 Regularization: Linear Regression5 Regularization: Logistic Regression5 Regularization: Logistic Regression6 Perceptron: Artificial Neuron6 Perceptron: Model �6 Perceptron: Minimizing Squared ErrorsHomework (1)�Homework (2)�Homework (3)�Tips for HomeworkTips for HomeworkTips for HomeworkMore MaterialsProgramming Skills幻灯片编号 65

cs229 lecture 2 - github pages · 2019. 10. 1. · 2 linear regression : gradient descent let’s...

Documents