1. introduction - university of haifaidattner/course2015sem2/dm-00.pdf · >plot(uscereal$fat,...
TRANSCRIPT
1. Introduction
� Motivation and definition
� Data mining tasks
� Methods and algorithms
� Examples and applications
1
Motivation: why data mining?
◮ Data explosion problem
– Automated data collection tools lead to tremendous amounts of
data stored in databases.
– Processing capacity of computers grows rapidly: CPU, memory,...
– Rapidly growing gap between our ability to generate data, and our
ability to make use of it.
We are drowning in data, but starving for knowledge!
2
What is data mining?
◮ Data Mining is the process of discovering new patterns from large
data sets involving methods from statistics and artificial intelligence
but also database management. The term is a buzzword, and is
frequently misused to mean any form of large scale data or
information processing. (Wikipedia)
◮ The term Data Mining is a misnomer. Mining of gold from rocks or
sand is called gold mining, rahter than rock mining or sand mining.
More appropriate term is Knowledge Mining from Data.
◮ Analysis of data is a process of inspecting, cleaning, transforming,
and modeling data with the goal of highlighting useful information,
suggesting conclusions, and supporting decision making.
3
What data mining is not
◮ Data mining differs from traditional database queries
– the query might not be precisely stated
– the data accessed is usually a different version from that of the
operational database
– output of data mining is not a subset of the database.
4
Example: wage data
◮ The goal: relate wage to education, age, year when the wage is
earned.
20 40 60 80
50
100
200
300
Age
Wage
2003 2006 2009
50
100
200
300
Year
Wage
1 2 3 4 5
50
100
200
300
Education Level
Wage
◮ Predict wage on the basis of age, year , education.
5
Example: stock market data
◮ The goal: to predict increase or decrease in S&P 500 stock index on a
given day using the past 5 years percentage change in the index.
Down Up
−4
−2
02
46
Yesterday
Today’s Direction
Pe
rce
nta
ge
ch
an
ge
in
S&
P
Down Up
−4
−2
02
46
Two Days Previous
Today’s Direction
Pe
rce
nta
ge
ch
an
ge
in
S&
P
Down Up
−4
−2
02
46
Three Days Previous
Today’s Direction
Pe
rce
nta
ge
ch
an
ge
in
S&
P◮ Boxplots of the previous day’s (2, 3 days) percentage change in index
for 648/602 days when it increased/decreased on the subsequent day.
6
Example: gene expression data
◮ Data: 6380 measurements of gene expression for 64 cancer cell lines.
◮ The goal: determine if there are groups (clusters) among the cell lines.
−40 −20 0 20 40 60
−6
0−
40
−2
00
20
−40 −20 0 20 40 60−
60
−4
0−
20
020
Z1Z1
Z2
Z2
◮ Data set in two–dimensional space (principal components): each
point is a cell line; on the right panel – for 14 types of cancer.
7
Steps of the data mining process
◮ Learning the application domain
◮ Creating the target data set – data selection
◮ Data cleaning and preprocessing (may take 60% of the effort).
◮ Data reduction and transformation
◮ Choosing functions of data mining
– summarization, classification, regression, association
◮ Data mining – search for patterns
◮ Pattern evaluation and knowledge representation
8
Example
Credit card company must detemine whether to authorize credit card
purchases. Based on historical information about purchases, each
purchase is placed in one of 4 classes
� authorize
� ask for further identification before authorization
� do not authorize
� do not authorize and contact police.
Data mining tasks: (1) determine how the data fit into the classes; (2)
apply the model for each new purchase.
9
1. Objectives of data mining
◮ Exploratory data analysis (data visualization)
Summary statistics, various plots, etc. Starting point of any data
mining process.
◮ Descriptive modeling – describe the data generating mechanism.
Examples include
* estimating probability distribution of the data
* finding groups in data (clustering)
* relationships between variables (dependency modeling: regression,
time series...).
10
2. Objectives data mining
◮ Predictive modeling – make a prediction about values of data using
known results from the database.
* Regression
* Classification
* Time series models
A key distinction between predictive and descriptive modeling is that
prediction problems focus on a single variable or a set of variables,
while in description problems no single variable is central to the
model.
11
Example applications
◮ Fraud detection
– AT&T uses a data mining system to detect fraudulent
international calls
– The Financial Crimes Enforcement Network AI Systems (FAIS)
uses data mining technologies to identify possible money
laundering activity within large cash transactions.
◮ Risk management
– Risk management applications use data mining to determine
insurance rates, manage investment portfolios, and differentiate
between companies and/or individuals who are good and poor
credit risks.12
– US West Communications uses data mining to determine customer
trends and needs based on characteritics such as family size,
median family age and location.
◮ Text mining and Web analysis
– Personalize the products and pages displayed to a particular user
or set of users.
13
Specific applications
◮ Predict whether a patient, hospitalized due to a heart attach, will
have a second heart attack. Prediction can use demographic, diet and
clinical measurements.
◮ Predict the price of a stock six month from now on the basis of
company performance measures and economic data.
◮ Identify the number in a handwritten ZIP code from a digitized
image.
◮ Identify the risk factors for prostate cancer based on clinical and
demographic variables.
◮ Distinguish pornographic and non–pornographic web pages.
14
I. Data mining tasks
◮ Regression
Regression is used to map a data item into a real valued prediction
variable. Regression involves the learning of the function that does
this mapping.
◮ Classification
Classification maps data into predefined groups (classes). It is often
referred to as supervised learning because the classes are determined
before examining the data.
◮ Time series analysis
Modeling variables evolving over time.
15
II. Data mining tasks
◮ Clustering
Clustering is similar to classification except that the groups are not
predifined, but defined by a data.
◮ Association rules
Association rules (affinity analysis) refers to the data mining task of
uncovering relationships in data.
◮ Summarization
Summarization maps data into subsets with associated simple
descriptions. For example, U.S. News World Report uses the average
SAT and ACT scores to compare US universities.
16
Data mining methods and algorithms
◮ Regression methods
Linear regression, kernels, splines, nearest–neighbors, neural
networks, regression trees,...
◮ Classification methods
Discriminant analysis, logistic regression, nearest–neighbors,
classification trees, ensenmble methods,...
◮ Clustering methods
Hierarchical clustering, K–means, K–medoids, mixtures,...
17
2. Data types and distance measures
18
Data format
◮ Observations: for subject i ∈ {1, . . . , N} we observe k different
features (e.g., age, cholesterol level, weight, marital status, etc.), i.e.
we have a vector xi = (xi,1, . . . , xi,k).
◮ Database can be identified with matrix N × k–matrix
subjects
x1 →x2 →...
xN →
x1,1 x1,2 · · · x1,k
x2,1 x2,2 · · · x2,k...
......
...
xN,1 xN,2 · · · xN,k
︸ ︷︷ ︸features
19
Distance and data types
◮ A distance measure d(x,y) between observations x and y should
satisfy the following axioms:
(i) d(x,y) ≥ 0 for all x, y, and d(x,y) = 0 iff x = y.
(ii) d(x,y) = d(y,x) (symmetry)
(iii) d(x,y) ≤ d(x, z) + d(z,y) (triangle inequality)
◮ Features can be
– Numerical – discrete or continuous (age, weight,...)
– Binary – encoded by 0− 1 (gender, success–failure,...)
– Categorical – extension of binary to more categories
– Ordinal – ordered categorical: {worst, bad, good, best}
20
1. Distance measures
◮ Numerical data
– Euclidean (L2) distance
d(x,y) =√(x− y)T (x− y) =
√√√√k∑
i=1
(xi − yi)2
– Manhattan (L1) distance: d(x,y) =∑k
i=1 |xi − yi|
– Lp–distance: d(x,y) = {∑k
i=1 |xi − yi|p}1/p, p ∈ [1,∞].
– Mahalanobis distance: if Σ is the covariance matrix of the features
then
d(x,y) =√(x− y)TΣ−1(x− y).
21
2. Distance measures
◮ Categorical data
– Matching (Hamming) distance: d(x,y) =∑k
i=1 I(xi 6= yi)
– Weighted matching distance:
d(x,y) =k∑
i=1
wiI(xi 6= yi), wi > 0,k∑
i=1
wi = 1.
◮ Ordinal data: xi – ordinal variable with mi levels is substituted by its
rank r(xi) ∈ {1, . . . ,mi}. To normalize the rank to [0, 1] define
z(xi) =r(xi)−1mi−1 , and set
d(x,y) =
k∑
i=1
|z(xi)− z(yi)|.
◮ Mixed data: use a mixture of normalized distances.
22
3. Data summary and visualization
23
Summary statistics
# The UScereal data frame has 65 rows and 11 columns.
# The data come from the 1993 ASA Statistical Graphics Exposition,
# and are taken from the mandatory F&DA food label.
# The data have been normalized here to a portion of one American cup.
>library(MASS)
>data(UScereal)
>summary(UScereal)
mfr calories protein fat sodium
G:22 Min. : 50.0 Min. : 0.7519 Min. :0.000 Min. : 0.0
K:21 1st Qu.:110.0 1st Qu.: 2.0000 1st Qu.:0.000 1st Qu.:180.0
N: 3 Median :134.3 Median : 3.0000 Median :1.000 Median :232.0
P: 9 Mean :149.4 Mean : 3.6837 Mean :1.423 Mean :237.8
Q: 5 3rd Qu.:179.1 3rd Qu.: 4.4776 3rd Qu.:2.000 3rd Qu.:290.0
R: 5 Max. :440.0 Max. :12.1212 Max. :9.091 Max. :787.9
fibre carbo sugars shelf
Min. : 0.000 Min. :10.53 Min. : 0.00 Min. :1.000
1st Qu.: 0.000 1st Qu.:15.00 1st Qu.: 4.00 1st Qu.:1.000
Median : 2.000 Median :18.67 Median :12.00 Median :2.000
Mean : 3.871 Mean :19.97 Mean :10.05 Mean :2.169
24
3rd Qu.: 4.478 3rd Qu.:22.39 3rd Qu.:14.00 3rd Qu.:3.000
Max. :30.303 Max. :68.00 Max. :20.90 Max. :3.000
potassium vitamins
Min. : 15.0 100% : 5
1st Qu.: 45.0 enriched:57
Median : 96.6 none : 3
Mean :159.1
3rd Qu.:220.0
Max. :969.7
># correlation matrix between some variables
>cor(UScereal[c("calories","protein","fat","fibre","sugars")])
calories protein fat fibre sugars
calories 1.0000000 0.7060105 0.5901757 0.3882179 0.4952942
protein 0.7060105 1.0000000 0.4112661 0.8096397 0.1848484
fat 0.5901757 0.4112661 1.0000000 0.2260715 0.4156740
fibre 0.3882179 0.8096397 0.2260715 1.0000000 0.1489158
sugars 0.4952942 0.1848484 0.4156740 0.1489158 1.0000000
25
1. Density visualization
Histogram
>hist(UScereal[,"protein"], main="UScereal data", xlab="protein")
UScereal data
protein
Freq
uenc
y
0 2 4 6 8 10 12 14
05
1015
20
26
2. Density visualization
Kernel smoothing
>plot(density(UScereal[,"protein"],kernel="gaussian"), main="UScereal data",
+ xlab="protein")
0 5 10
0.00
0.05
0.10
0.15
0.20
UScereal data
protein
Dens
ity
27
Boxplot
>mfr=UScereal["mfr"]
>boxplot(UScereal[mfr=="K","protein"], UScereal[mfr=="G", "protein"],
+ names=c("Kellogs", "General Mills"), xlab="Manufacturer", ylab="protein"))
Kellogs General Mills
24
68
1012
Manufacturer
prot
ein
28
Quantile plot
QQ plot displays (zk/(n+1), x(k)), zq is qth quantile of N (0, 1) Φ(zq) = q,
0 < q < 1.
>qqnorm(UScereal$calories)
−2 −1 0 1 2
100
200
300
400
Normal Q−Q Plot
Theoretical Quantiles
Samp
le Qu
antile
s
29
Relations between two variables
Scatterplot
>plot(UScereal$fat, UScereal$calories, xlab="Fat", ylab="Calories")
0 2 4 6 8
100
200
300
400
Fat
Calor
ies
30
Relations between more than two variables
Scatterplot matrix
>plot(UScereal[c("calories", "fat", "protein", "sugars","fibre", "sodium")])
calories
0 2 4 6 8 0 5 10 15 20 0 200 600
100
200
300
400
02
46
8
fat
protein
24
68
1012
05
1015
20
sugars
fibre
05
1015
2025
30
100 300
020
040
060
080
0
2 4 6 8 12 0 10 20 30
sodium
31
Parallel plot
>parallel( UScereal[, c("calories","protein", "fat", "fibre")])
Min Max
calories
protein
fat
fibre
32
4. Association rules
(Market basket analysis)
33
Market basket analysis
◮ Association rules show the relationships between data items.
◮ Typical example
A grocery store keeps a record of weekly transactions. Each
represents the items bought during one cash register
transaction. The objective of the market basket analysis is to
determine the items likely to be purchased together by a
customer.
34
Example
◮ Items: {Beer, Bread, Jelly, Milk, PeanutButter}
Transaction Items
t1 Bread, Jelly, PeanutButter
t2 Bread, PeanutButter
t3 Bread, Milk, PeanutButter
t4 Beer, Bread
t5 Beer, Milk
◮ 100% of the time that PeanutButter is purchased, so is Bread.
◮ 33.3% of the time PeanutButter is purchased, Jelly is also
purchased.
◮ PeanutButter exists in 60% of the overall transactions.
35
Definitions
◮ Given:
� a set of items I = {I1, . . . , Im}
� a database of transactions D = {t1, . . . , tn} where ti = {Ii1 , . . . , Iik}and Iij ∈ I
◮ Association rule
Let X and Y be two disjoint subsets (itemsets) of I . We say that
Y is associated with X (and write X ⇒ Y ) if the appearance of
X in an transaction ”usually” implies that Y occur in that
transaction too. We identify
X ⇔ {X is purchased}
36
Support and confidence
◮ Support s of an association rule X ⇒ Y is the percentage of
transactions in the database that contain X ∩ Y
s(X ⇒ Y ) = P (X ∩ Y ) =1
n
n∑
i=1
1{ti ⊇ (X ∩ Y )
}.
◮ Confidence or strength α of an association rule X ⇒ Y is the ratio of
the number of transactions that contain X ∩ Y to the number of
transactions that contain X
α(X ⇒ Y ) = P (Y |X) =P (X ∩ Y )
P (X)=
∑ni=1 1
{ti ⊇ (X ∩ Y )
}∑n
i=1 1{ti ⊇ X
}
◮ Problem: find all rules with support ≥ s0 and confidence ≥ α0.
37
Support and confidence of some rules
X ⇒ Y s α
Bread ⇒ PeanutButter 60% 75%
PeanutButter ⇒ Bread 60% 100%
Beer ⇒ Bread 20% 50%
PeanutButter ⇒ Jelly 20% 33.3%
Jelly ⇒ PeanutButter 20% 100%
Jelly ⇒ Milk 0% 0%
38
Other measures of rule quality
Rules with high support and confidence may be obvious or not interesting.
◮ Example: 100 baskets, purchases of tea and coffee
coffee coffeec∑
row
tee 20 5 25
teec 70 5 75∑
col 90 10 100
Rule tea⇒ coffee: s = 0.2, α = 20/10025/100 = 0.8 ⇒ strong rule!
However, s(coffee) = 90100 = 0.9; thus, there is a negative
“association” between buying tea and buying coffee.
◮ Additional measures of the rule quality are needed.
39
Lift and conviction
◮ Lift (interest)
lift(X ⇒ Y ) =P (X ∩ Y )
P (X)P (Y )=
1n
∑ni=1 1(ti ⊇ X ∩ Y )
1n
∑ni=1 1(ti ⊇ X) 1n
∑ni=1 1(ti ⊇ Y )
Rules with lift ≥ 1 are interesting. In previous example
lift(tea⇒ coffee) =0.2
0.25× 0.9= 0.89.
◮ Conviction
conviction(X ⇒ Y ) =P (X)P (Y c)
P (X ∩ Y c)
=1n
∑ni=1 1{ti ⊇ X} 1n
∑ni=1 1{ti ⊇ Y c}
1n
∑ni=1 1{ti ⊇ X ∩ Y c}
Rules that always hold have conviction =∞. In the previous example
conviction(tea⇒ coffee) = (25/100)·(10/100)5/100 = 0.5
40
Lift and conviction of some rules
X ⇒ Y Lift Conviction
Bread ⇒ PeanutButter 54
85
PeanutButter ⇒ Bread 54 ∞
Beer ⇒ Bread 58
25
PeanutButter ⇒ Jelly 53
65
Jelly ⇒ PeanutButter 53 ∞
Jelly ⇒ Milk 0 35
41
Mining rules from frequent itemsets
1. Find frequent itemsets (itemset whose number of occurrences is above
a threshold s0).
2. Generate rules from frequent itemsets.
Input: D - database, I - collection of all items,
L-collection of all frequent itemsets, s0, α0.
Output: R - association rules satisfying s0 and α0.
R = ∅;for each ℓ ∈ L do
for each x ⊂ ℓ such that x 6= ∅ do
ifsupport(ℓ)support(x) ≥ α0 then R = R ∪ {x⇒ (ℓ− x)};
42
Example
Assume s0 = 30% and α0 = 50%.
◮ Frequent itemset L
{{Beer},{Bread},{Milk},{PeanutButter},{Bread,PeanutButter}}
◮ For ℓ = {Bread, PeanutButter} we have two subsets:
support({Bread, PeanutButter})support({Bread}) =
60
80= 0.75 > 0.5
support({Bread, PeanutButter})support({PeanutButter}) =
60
60= 1 > 0.5
◮ Conclusion:
PeanutButter ⇒ Bread and Bread ⇒ PeanutButter are valid
association rules.
43
1. Finding frequent itemsets: apriori algorithm
◮ Frequent itemset property
Any subset of frequent itemset must be frequent
◮ Basic idea:
– Look at candidate sets of size i
– Choose frequent itemsets of the size i
– Generate frequent itemsets of size i+ 1 by joining (taking unions
of) frequent itemsets found till pass i+ 1.
44
2. Finding frequent itemsets: apriori algorithm
At kth pass of the apriori algorithm we form a set of candidate itemsets
Ck of size k and a set of frequent itemsets Lk of size k.
1. Start with all singleton itemsets C1. Count support of all items in C1
and form the set L1 of all items from C1 with support ≥ s.
2. Let C2 be the set of all pairs from L1. Count support of all members
of C2 and form the set L2 of pairs with support ≥ s.
3. Let C3 be the set of triples, any two of which is a pair in L2.
Calculate the support of each triple in C3 and form the set of triples
L3 with support ≥ s.
4. Continue...
45
Example: apriori algorithm
s0 = 30%, α0 = 50%
Pass Candidates Frequent itemsets
1 {Beer},{Bread},{Jelly} {Beer},{Bread},
{PeanutButter},{Milk} {Milk},{PeanutButter}
2 {Beer,Bread},{Beer,Milk}, {Bread,PeanutButter}
{Bear,PeanutButter},{Bread,Milk},
{Bread,PeanutButter},
{Milk,PeanutButter}
46
Summary
◮ Efficient finding frequent itemsets
Finding frequent itemsets is costly. If there are m items, potentially
there may be 2m − 1 frequent itemsets.
◮ Once all frequent itemsets are found, generating the association rules
is easy and straightforward.
47
Other applications
Applications of association rules are not limited to basket analysis.
◮ Finding related concepts: items=words, baskets=documents (web
pages, tweets,...). We look for sets of words appearing together in
many documents. Expect that {Brad, Angelina} appears with
surprising frequency.
◮ Plagiatrism: items=documents, baskets=sentences; an item in the
basket if the sentence is in the document.
◮ Biomarkers: items are of two types – biomarkers (genes, proteins,...)
and diseases; each basket is the set of data about the patient (genome
and blood analysis, medical history...). A frequent itemset that
contains one disease and one or more biomarkers suggests a test for a
disease.
48
Example: DVD movies purchases
◮ Data:
> data<-read.table("DVDdata.txt",header=T)
> data
Braveheart Gladiator Green.Mile Harry.Potter1 Harry.Potter2 LOTR1 LOTR2
1 0 0 1 1 0 1 1
2 1 1 0 0 0 0 0
3 0 0 0 0 0 1 1
4 0 1 0 0 0 0 0
5 0 1 0 0 0 0 0
6 0 1 0 0 0 0 0
7 0 0 0 1 1 0 0
8 0 1 0 0 0 0 0
9 0 1 0 0 0 0 0
10 0 1 1 0 0 1 0
49
Patriot Sixth.Sense
1 0 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 1
7 0 0
8 1 0
9 1 1
10 0 1
>
◮ Preparations
> nobs<-dim(data)[1]
> n<-dim(data)[2]
> namesvec<-colnames(data)
> namesvec
[1] "Braveheart" "Gladiator" "Green.Mile" "Harry.Potter1"
[5] "Harry.Potter2" "LOTR1" "LOTR2" "Patriot"
[9] "Sixth.Sense"
>
50
> # thresholds for rules
> supthresh<-0.2
> conftresh<-0.5
> lifttresh<-2
>
> sup1<-array(0,n)
> sup2<-matrix(0,ncol=n,nrow=n,dimnames=list(namesvec,namesvec))
◮ Calculating the chance of appearance P (X) for each movie
> for (i in 1:n){
+ sup1[i]<-sum(data[,i])/nobs}
> sup1
[1] 0.1 0.7 0.2 0.2 0.1 0.3 0.2 0.6 0.6
◮ Calculating the chance of appearance P (X, Y ) for each pair of movies
> for (j in 1:n){
+ if(sup1[j]>=supthresh){
+ for (k in j:n){
+ if (sup1[k]>=supthresh){
+ sup2[j,k]<-data[,j]%*%data[,k]
+ sup2[k,j]<-sup2[j,k] } } } }
> sup2<-sup2/nobs
> sup2
51
Braveheart Gladiator Green.Mile Harry.Potter1 Harry.Potter2 LOTR1
Braveheart 0 0.0 0.0 0.0 0 0.0
Gladiator 0 0.7 0.1 0.0 0 0.1
Green.Mile 0 0.1 0.2 0.1 0 0.2
Harry.Potter1 0 0.0 0.1 0.2 0 0.1
Harry.Potter2 0 0.0 0.0 0.0 0 0.0
LOTR1 0 0.1 0.2 0.1 0 0.3
LOTR2 0 0.0 0.1 0.1 0 0.2
Patriot 0 0.6 0.0 0.0 0 0.0
Sixth.Sense 0 0.5 0.2 0.1 0 0.2
LOTR2 Patriot Sixth.Sense
Braveheart 0.0 0.0 0.0
Gladiator 0.0 0.6 0.5
Green.Mile 0.1 0.0 0.2
Harry.Potter1 0.1 0.0 0.1
Harry.Potter2 0.0 0.0 0.0
LOTR1 0.2 0.0 0.2
LOTR2 0.2 0.0 0.1
Patriot 0.0 0.6 0.4
Sixth.Sense 0.1 0.4 0.6
◮ Calculating the confidence matrix P (column|row)
52
> conf2<-sup2/c(sup1)
> conf2
Braveheart Gladiator Green.Mile Harry.Potter1 Harry.Potter2
Braveheart 0 0.0000000 0.0000000 0.0000000 0
Gladiator 0 1.0000000 0.1428571 0.0000000 0
Green.Mile 0 0.5000000 1.0000000 0.5000000 0
Harry.Potter1 0 0.0000000 0.5000000 1.0000000 0
Harry.Potter2 0 0.0000000 0.0000000 0.0000000 0
LOTR1 0 0.3333333 0.6666667 0.3333333 0
LOTR2 0 0.0000000 0.5000000 0.5000000 0
Patriot 0 1.0000000 0.0000000 0.0000000 0
Sixth.Sense 0 0.8333333 0.3333333 0.1666667 0
LOTR1 LOTR2 Patriot Sixth.Sense
Braveheart 0.0000000 0.0000000 0.0000000 0.0000000
Gladiator 0.1428571 0.0000000 0.8571429 0.7142857
Green.Mile 1.0000000 0.5000000 0.0000000 1.0000000
Harry.Potter1 0.5000000 0.5000000 0.0000000 0.5000000
Harry.Potter2 0.0000000 0.0000000 0.0000000 0.0000000
LOTR1 1.0000000 0.6666667 0.0000000 0.6666667
LOTR2 1.0000000 1.0000000 0.0000000 0.5000000
Patriot 0.0000000 0.0000000 1.0000000 0.6666667
Sixth.Sense 0.3333333 0.1666667 0.6666667 1.0000000
53
◮ Calculating the lift matrix
> tmp<-matrix(c(sup1),nrow=n,ncol=n,byrow=TRUE)
> tmp
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] 0.1 0.7 0.2 0.2 0.1 0.3 0.2 0.6 0.6
[2,] 0.1 0.7 0.2 0.2 0.1 0.3 0.2 0.6 0.6
[3,] 0.1 0.7 0.2 0.2 0.1 0.3 0.2 0.6 0.6
[4,] 0.1 0.7 0.2 0.2 0.1 0.3 0.2 0.6 0.6
[5,] 0.1 0.7 0.2 0.2 0.1 0.3 0.2 0.6 0.6
[6,] 0.1 0.7 0.2 0.2 0.1 0.3 0.2 0.6 0.6
[7,] 0.1 0.7 0.2 0.2 0.1 0.3 0.2 0.6 0.6
[8,] 0.1 0.7 0.2 0.2 0.1 0.3 0.2 0.6 0.6
[9,] 0.1 0.7 0.2 0.2 0.1 0.3 0.2 0.6 0.6
>
> lift2<-conf2/tmp
> lift2
Braveheart Gladiator Green.Mile Harry.Potter1 Harry.Potter2
Braveheart 0 0.0000000 0.0000000 0.0000000 0
Gladiator 0 1.4285714 0.7142857 0.0000000 0
Green.Mile 0 0.7142857 5.0000000 2.5000000 0
Harry.Potter1 0 0.0000000 2.5000000 5.0000000 0
54
Harry.Potter2 0 0.0000000 0.0000000 0.0000000 0
LOTR1 0 0.4761905 3.3333333 1.6666667 0
LOTR2 0 0.0000000 2.5000000 2.5000000 0
Patriot 0 1.4285714 0.0000000 0.0000000 0
Sixth.Sense 0 1.1904762 1.6666667 0.8333333 0
LOTR1 LOTR2 Patriot Sixth.Sense
Braveheart 0.0000000 0.0000000 0.000000 0.0000000
Gladiator 0.4761905 0.0000000 1.428571 1.1904762
Green.Mile 3.3333333 2.5000000 0.000000 1.6666667
Harry.Potter1 1.6666667 2.5000000 0.000000 0.8333333
Harry.Potter2 0.0000000 0.0000000 0.000000 0.0000000
LOTR1 3.3333333 3.3333333 0.000000 1.1111111
LOTR2 3.3333333 5.0000000 0.000000 0.8333333
Patriot 0.0000000 0.0000000 1.666667 1.1111111
Sixth.Sense 1.1111111 0.8333333 1.111111 1.6666667
◮ Extracting and printing rules
> rulesmat<-(sup2>=supthresh)*(conf2>=conftresh)*(lift2>=lifttresh)
55
> rulesmat
Braveheart Gladiator Green.Mile Harry.Potter1 Harry.Potter2 LOTR1
Braveheart 0 0 0 0 0 0
Gladiator 0 0 0 0 0 0
Green.Mile 0 0 0 0 0 1
Harry.Potter1 0 0 0 0 0 0
Harry.Potter2 0 0 0 0 0 0
LOTR1 0 0 1 0 0 0
LOTR2 0 0 0 0 0 1
Patriot 0 0 0 0 0 0
Sixth.Sense 0 0 0 0 0 0
LOTR2 Patriot Sixth.Sense
Braveheart 0 0 0
Gladiator 0 0 0
Green.Mile 0 0 0
Harry.Potter1 0 0 0
Harry.Potter2 0 0 0
LOTR1 1 0 0
LOTR2 0 0 0
Patriot 0 0 0
Sixth.Sense 0 0 0
56
> diag(rulesmat)<-0
> rules<-NULL
> for (j in 1:n){
+ if (sum(rulesmat[j,])>0){
+ rules<-c(rules,paste(namesvec[j],"->",namesvec[rulesmat[j,]==1],sep=""))
+ }
+ }
> rules
[1] "Green.Mile->LOTR1" "LOTR1->Green.Mile" "LOTR1->LOTR2"
[4] "LOTR2->LOTR1"
◮ If we set supthresh<-0.1 then we find 12 rules
> rules
[1] "Green.Mile->Harry.Potter1" "Green.Mile->LOTR1"
[3] "Green.Mile->LOTR2" "Harry.Potter1->Green.Mile"
[5] "Harry.Potter1->Harry.Potter2" "Harry.Potter1->LOTR2"
[7] "Harry.Potter2->Harry.Potter1" "LOTR1->Green.Mile"
[9] "LOTR1->LOTR2" "LOTR2->Green.Mile"
[11] "LOTR2->Harry.Potter1" "LOTR2->LOTR1"
57
5. Predictive data mining: general issues
� Regression problem
� Classification problem
� Assessing goodness of a predictive model
58
Regression problem
◮ Regression problem: to model a response variable Y ∈ R1 as a
function of predictor variables X = (X1, . . . , Xp) ∈ Rp. If Y is a
continuous variable then a plausible model is
Y = f(X) + ǫ, ǫ is a random noise, Eǫ = 0
f is the regression function, f(X) = E{Y |X}.
◮ Fitting the model: based on the data
Dn = {(Yi,Xi), i = 1, . . . , n}, Xi = (Xi1, . . . , Xip)
the goal is to construct an estimate f(·) = f(·;Dn) of f(·).
◮ Model can be parametric and non–parametric
59
Approaches to regression modeling
◮ Parametric approach: a parametric form for unknown regression
function is assumed
f(x) = f(x, θ), θ ∈ Θ ⊂ Rm.
E.g., f(x, θ) = θTx where θ is unknown vector (linear regression).
◮ Nonparametric approach:
no specific parametric form for f is assumed.
60
Classification problem
◮ Classification problem: the objective is to model a binary
(categorical) response variable Y as a function of predictor variables
X = (X1, . . . , Xp).
◮ Each subject belongs to one of the two populations Π0 or Π1. For the
i-th subject Xi is observed and the corresponding ”population label”
Yi ∈ {0, 1}. Based on the data
Dn = {(Yi,Xi), i = 1, . . . , n}, Xi = (X1i, . . . , Xpi)
the goal is to construct a classsifier (prediction rule) f(·) = f(·;Dn)
that for each x predicts the label Y .
◮ Approaches: parametric and non-parametric.
61
Assessing goodness of a predictive model
◮ A ”naive” (resubstitution) approach: f(Xi) should be close to Yi for
each i. One can look, e.g., at the following performance indeces:
SL2 =1
n
n∑
i=1
[Yi − f(Xi)
]2
SL1 =1
n
n∑
i=1
∣∣Yi − f(Xi)∣∣.
If Y is a binary variable (as in classification problem) then the
appropriate index is
S0/1 =1
n
n∑
i=1
I{f(Xi) 6= Yi
}
The smaller SL2 (SL1 , S0/1) is, the better the fit.
Is that a good approach?
62
Training and validation data sets
◮ Overfitting: “fitting the noise”.
The model is evaluated on the fitted data
◮ Training and validation data sets:
Divide the data Dn into two subsamples of sizes n1 and n2
respectively: training sample D′n and validation sample D′′
n.
Construct the estimate on the basis of D′n only f(x) = f(x;D′
n).
Assess goodness–of–fit on the basis of the validation sample D′′n, e.g.,
SL2 = 1n2
∑
i:(Yi,Xi)∈D′′
n
[Yi − f(Xi;D′
n)]2.
◮ Drawback: D′′n used only for the validation purposes.
63
Leave–one–out cross–validation
◮ Cross-validation (leave–one–out): original data set
Dn = {(X1, Y1), . . . , (Xn, Yn)}.
The data set D(−i)n without i-th observation
D(−i)n = {(X1, Y1), . . . , (Xi−1, Yi−1), (Xi+1, Yi+1), . . . , (Xn, Yn)}.
◮ Cross–validation accuracy estimate
SCVL2
= 1n
n∑
i=1
[Yi − f(Xi;D(−i)
n )]2.
◮ In general computationally expensive; cheap for linear regression
SCVL2
= 1n
n∑
i=1
(Yi − Yi1− hi
)2
,
where hi is ith diagonal element of the hat matrix (leverage statistic).
64
k–fold cross-validation
◮ Divide randomly the set of observations Dn = {(Xi, Yi), i = 1, . . . , n}in k groups D(j)
n , j = 1, . . . , k. Let D(−j)n be the dataset without jth
group of observations, j = 1, . . . , k.
◮ Cross-validation accuracy estimate:
SCV(k)L2
=1
k
k∑
j=1
MSE(j),
MSE(j) =1
n
∑
i∈D(j)n
[Yi − f(Xi,D(−j)n ]2
◮ Computationally faster that leave-on-out CV, but more biased. In
practice use 10–fold cross-validation.
65
Bootstrap methods
◮ Bootstrapping refers to a self-sustained process that proceeds without
external help. The term is sometimes attributed to Rudolf Erich
Raspe’s story “The Surprising Adventures of Baron Munchausen”,
where the main character pulls himself (and his horse) out of a
swamp by his hair (specifically, his pigtail).
◮ In Statistics, bootstrap is a method for accessing accuracy of
statistical procedures by resampling from empirical distributions.
Introduced by Bradley Efron in 1979.
66
Bootstrap idea
◮ Training set: Dn = {(X1, Y1), . . . , (Xn, Yn)}.
◮ The idea is
– randomly draw with replacements B (bootstrap) datasets
{D∗n,b, b = 1, . . . , B}
from Dn, each sample D∗n,b is of the size n;
– refit the model for each of the bootstrap datasets and compute an
accuracy measure;
– average an accuracy measures over B replications.
67
2.8 5.3 3
1.1 2.1 2
2.4 4.3 1
Y X Obs
2.8 5.3 3
2.4 4.3 1
2.8 5.3 3
Y X Obs
2.4 4.3 1
2.8 5.3 3
1.1 2.1 2
Y X Obs
2.4 4.3 1
1.1 2.1 2
1.1 2.1 2
Y X Obs
Original Data (Z)
1*Z
2*Z
Z*B
1*α
2*α
α*B
!!
!!
!!
!!
!
!!
!!
!!
!!
!!
!!
!!
!!
67-1
Bootstrap accuracy estimate
◮ The final bootstrap accuracy estimate:
SBL2
=1
B
B∑
b=1
1
n
n∑
i=1
[Yi − f(Xi,D∗n,b)]
2
◮ There are modifications like leave-one-out bootstrap accuracy
estimates...
68
6. Review of Linear Algebra
69
Matrix and vector
◮ Matrix: a rectangular array of numbers, e.g., A ∈ Rn×p
A =
a11 a12 · · · a1p
a21 a22 · · · a2p...
......
...
an1 an2 · · · anp
, A = {aij}i=1,...,n
j=1,...,p
◮ Vector: is a matrix containing one column x = [x1; · · · ;xn] ∈ Rn.
◮ Think of a matrix as a linear operation on vectors: when an n× pmatrix A is applied to (multiplies) a vector x ∈ Rp , it returns a
vector in y = Ax ∈ Rn.
70
1. Matrix multiplication
◮ If A ∈ Rn×p and B ∈ Rp×m, C = AB then C ∈ Rn×m and
C = AB, cij =
p∑
k=1
aikbkj , i = 1, . . . , n; j = 1, . . . ,m.
◮ Special cases
– inner product of vectors: if x ∈ Rn, y ∈ Rn then xT y =∑n
i=1 xiyi.
– matrix–vector multiplication: if A ∈ Rn×p and x ∈ Rp then
Ax = [a1, a2, . . . , ap]x =n∑
i=1
aixi.
The product Ax is a linear combination of matrix columns {aj}with weights x1, . . . , xp.
71
2. Matrix multiplication
◮ Properties
– Associative: (AB)C = A(BC)
– Distirbutive: (A+ B)C = AC +BC
– Non–commutative: AB 6= BA
◮ Block multiplication: if A = [Aik], B = [Bkj ], where Aik’s and Bkj ’s
are matrix blocks, and the number of columns in Aik equals to the
number of rows in Bkj then
C = AB = [Cij ], Cij =∑
k
AikBkj .
72
Special types of of quadratic matrices
◮ Diagonal A = diag{a11, a22, . . . , ann}
◮ Identity: I = In = diag {1, . . . , 1}︸ ︷︷ ︸n times
◮ Symmetric: A = AT
◮ Orthogonal: ATA = In = AAT
◮ Idempotent (projection): A2 = A ·A = A.
73
Linear independence and rank
◮ A set of vectors x1, . . . , xn are linearly independent if there are no
exist constants c1, . . . , cn (except all cj ’s are zero) such that
c1x1 + · · ·+ cnxn = 0.
◮ Rank of A ∈ Rn×p is the maximal number of linearly independent
columns (or, equivalently, rows). If rank(A) = min(n, p) then iot is
said that A is of the full rank.
◮ Properties: rank(A) ≤ min(p, n), rank(A) = rank(AT ),
rank(AB) ≤ min{rank(A), rank(B)}, rank(A+B) ≤ rank(A)+rank(B).
74
Trace of matrix
◮ Trace of A ∈ Rn×n is the sum of all diagonal elements of A:
tr(A) =n∑
i=1
aii.
◮ Properties
– tr(A) = tr(AT )
– tr(A+B) = tr(A) + tr(B)
– tr(α · A) = α · tr(A) for all α ∈ R
– tr(AB) = tr(BA)
– if x ∈ Rn, y ∈ Rn then tr(xyT ) = xT y.
75
1. Determinant
◮ 2× 2–matrix: determinant of the matrix
det
([ a b
c d
])= ad− bc.
Absolute value of the determinant is the area of the parallelogram
formed by the vectors (a, b) and (c, d).
◮ In general, if τ = (τ1, . . . , τn) is a permutation of {1, . . . , n} then
det(A) =∑
τ
(−1)|τ |a1τ1a2τ2 · · · anτn
where |τ | = 0 if τ is a permutation with even number of changes in
{1, . . . , n}, and |τ | = 1 otherwise.
76
2. Determinant
◮ Properties
– det(A) = det(AT ), det(αA) = αndet(A), ∀α ∈ R;
– determinant changes its sign if two columns are interchanged;
– determinant vanishes if and only if there is a linear dependence
between its columns;
– det(AB) = det(A)det(B).
77
Inverse matrix
◮ If A ∈ Rn×n and rank(A) = n then the inverse of A, denoted A−1 is
the matrix such that AA−1 = A−1A = In.
◮ Properties:
(A−1)−1 = A; (AB)−1 = B−1A−1; (A−1)T = (AT )−1.
◮ If det(A) = 0 then matrix A is singular (the inverse matrix does not
exist).
78
Range and null space (kernel) of a matrix
◮ Span: for xi ∈ Rp, i = 1, . . . , n
span(x1, . . . , xn) ={ n∑
i=1
αixi : αi ∈ R, i = 1, . . . , n}.
◮ Range: If A ∈ Rn×p then
Range(A) = {Ax : x ∈ Rp}.
Range(A) is the span of columns of A.
◮ Null space or kernel of a matrix:
Ker(A) = {x ∈ Rp : Ax = 0}.
79
Eigenvalues and eigenvectors
◮ Characteristic polynomial: Let A ∈ Rp×p; then
q(λ) = det(A− λI), λ ∈ R
is the characteristic polynomial of matrix A. Roots of this polynomial
λ1, . . . , λp are eigenvalues of matrix A: det(A− λjI) = 0, j = 1, . . . , p.
◮ A− λjI is a singular matrix; therefore there exists a non–zero vector
γ ∈ Rp such that Aγ = λjγ. This vector is called the eigenvector of A
corresponding to the eigenvalue λj . We can normalize eigenvectors so
that γT γ = 1.
◮ q(λ) = (−1)p ∏pj=1(λ− λj) = det(A− λI); hence
det(A) = q(0) =∏p
j=1 λj . In addition, tr(A) =∑p
j=1 λj .
80
Symmetric matrices, spectral decomposition
◮ All eigenvalues of a symmetric matrix are real.
◮ Orthogonal matrix: if c1, . . . , cp are orthonormal vectors (basis), i.e.,
cTi cj = 0, i 6= j, cTi ci = 1, ∀i then matrix C = [c1, c2, . . . , cp] is
orthogonal, CCT = CTC = I ⇒ C−1 = CT .
◮ Spectral decomposition: any symmetric p× p matrix A can be
represented as
A = ΓΛΓT =
p∑
j=1
λjγjγTj ,
where Λ = diag{λ1, . . . , λp}, λj ’s are eigenvalues, Γ = [γ1, . . . , γp], and
γj ’s are eigenvectors.
◮ If A is non–singular symmetric then An = ΓΛnΓT . In particular, if
λj ≥ 0, ∀j then√A = ΓΛ1/2ΓT .
81
Eigenvalues characterization for symmetric matrices
◮ Let A be an n× n symmetric matrix with eigenvalues
λmin = λ1 ≤ λ2 ≤ · · · ≤ λn−1 ≤ λn = λmax.
Then
λminxTx ≤ xTAx ≤ λmaxx
Tx, ∀x ∈ Rn,
λmax = maxx 6=0
xTAx
xTx= max
xTx=1xTAx,
λmin = minx 6=0
xTAx
xTx= min
xTx=1xTAx.
In addition, if γ1, . . . γn are eigenvectors corresponding to λ1, . . . , λn
then
maxx 6=0
x⊥un,...,un−k+1
xTAx
xTx= λn−k, k = 1, . . . , n− 1.
82
Quadratic forms, projection matrices, etc.
◮ Quadratic form: Q(x) = xTAx.
◮ For symmetric A: if Q(x) = xTAx > 0 for all x ∈ Rp then A is
positive definite. Alternatively, λj(A) > 0 for all j = 1, . . . , p.
◮ Projection (idempotent) matrix: P = P 2. Typical example is the hat
matrix in regression. If X ∈ Rn×p then
P = X(XTX)−1XT
is idempotent. It is projection on the column space (range) of
matrix X .
83
7. Multiple Linear Regression
84
Linear regression model
◮ Model: response Y is modeled as a linear function of predictors
X1, . . . , Xp plus some errors ǫ:
Y = β0 + β1X1 + · · ·+ βpXp + ǫ.
The data is {Yi, Xi1, . . . , Xip, i = 1, . . . , n}. Then
Y︸︷︷︸n×1
= X︸︷︷︸n×(p+1)
β︸︷︷︸(p+1)×1
+ ǫ︸︷︷︸n×1
,
where
Y =
Y1...
Yn
, X =
1 X11 X12 · · · X1p
......
......
...
1 Xn1 Xn2 · · · Xnp
, ǫ =
ǫ1...
ǫn
.
85
Model fitting
◮ Basic assumptions: zero mean, uncorrelated errors
Eǫ = 0, cov(ǫ) = EǫǫT = σ2In, In − n× n identity matrix.
◮ Least squares estimator β
The idea is to minimize the sum of squares of errors minβ S(β) where
S(β) = (Y −Xβ)T (Y −Xβ) = Y TY − 2βTXTY + βTXTXβ.
Differentiate S(β) w.r.t. β and set to zero:
∇βS(β) = −2XTY + 2XTXβ = 0 ⇒ β = [XTX]−1XTY ,
provided that XTX is non–singular.
86
Predicted values and residuals
◮ Predicted (fitted) values
Y = Xβ = X(XTX)−1XTY = H︸︷︷︸n×n
Y ,
H is the hat matrix; H = HT , H = H2 – projection matrix, projects
on the column space of X .
◮ Residuals
e = Y − Y = (In −H)Y , In − n× n identity matrix
87
Sums of squares
◮ Residual sum of squares:
SSE =n∑
i=1
e2i = eTe = (Y − Y )T (Y − Y ).
◮ Total sum of squares: for Y = 1n
∑ni=1 Yi
SST =n∑
i=1
(Yi − Y )2 = (Y − 1nY )T (Y − 1nY ), 1n = (1, . . . , 1)T .
Characterizes variability of the response Y around its average.
◮ Regression sum of squares
SSreg = SST − SSE.
Characterizes variability in data explained by the regression model.
88
Sums of squares
◮ R2–value characterizes proportion of variablity in the data explained
by the regression model
R2 =SSreg
SST
Closer R2 to one, more variability is explained. R2 grows as more
predictor variables are added to the model.
89
1. Inference in the linear regression model
◮ Basic assumption: ǫ ∼ Nn(0, σ2In). Under this assumption, if there is
no relationship between X1, . . . , Xp and Y , i.e., if
β1 = β2 = · · · = βp = 0 then
MSreg =SSreg
p∼ χ2(p), MSE =
SSE
n− p− 1∼ χ2(n− p− 1).
◮ F–test for the hypothesis H0 : β1 = · · · = βp = 0 is based on the fact
F ∗ =SSreg/p
SSE/(n− p− 1)∼ F (p, n− p− 1).
Then H0 is rejected when F ∗ > F(1−α)(p, n− p− 1).
90
2. Inference in the linear regression model
◮ Inference on individual coefficients: β ∼ Np+1(β, σ2(XTX)−1)
βk − βks.d.(βk)
∼ t(n− p− 1), k = 0, 1, . . . , p
where s.d.(βk) is the square root of corresponding diagonal element of
σ2(XTX)−1, σ2 =MSE. Hence H0 : βk = 0 is rejected if
|t∗| > t(1−α/2)(n− p− 1), t∗ =βk
s.d.(βk).
91
1. LS regression diagnostics
◮ Non–linearity of the response–predictor relationship
5 10 15 20 25 30
−1
5−
10
−5
05
10
15
20
Fitted values
Re
sid
ua
ls
Residual Plot for Linear Fit
323
330
334
15 20 25 30 35
−1
5−
10
−5
05
10
15
Fitted values
Re
sid
ua
ls
Residual Plot for Quadratic Fit
334323
155
92
2. LS regression diagnostics
◮ Correlations of errors
Standard errors computed on the basis of independence assumption.
If there are correlations between errors then confidence intervals can
have less coverage probability than expected.
◮ Tests for serial correlation: run test, sign changes tests, etc.
◮ Time series models
93
3. LS regression diagnostics
◮ Non–constant variance of errors
10 15 20 25 30
−1
0−
50
51
01
5
Fitted values
Re
sid
ua
ls
Response Y
998
975845
2.4 2.6 2.8 3.0 3.2 3.4−
0.8
−0
.6−
0.4
−0
.20
.00
.20
.4
Fitted values
Re
sid
ua
ls
Response log(Y)
437671
605
◮ Remedy: transformations
94
4. LS regression diagnostics
◮ Outliers: points where the response variable is unusually large (small)
given predictors. Outliers can be detected in residuals plots.
◮ High leverage points has unusual X values.
◮ Collinearity refers to a situation when two or more predictors are
closely related to each other. The matrix XTX is close to singular.
95
Example: ozone data
># airquality data set
>ozone.lm <- lm (Ozone~Solar.R+Wind+Temp, data=airquality)
>summary(ozone.lm)
Call:
lm(formula = Ozone ~ Solar.R + Wind + Temp, data = airquality)
Residuals:
Min 1Q Median 3Q Max
-40.485 -14.219 -3.551 10.097 95.619
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -64.34208 23.05472 -2.791 0.00623 **
Solar.R 0.05982 0.02319 2.580 0.01124 *
Wind -3.33359 0.65441 -5.094 1.52e-06 ***
Temp 1.65209 0.25353 6.516 2.42e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 21.18 on 107 degrees of freedom
Multiple R-Squared: 0.6059, Adjusted R-squared: 0.5948
F-statistic: 54.83 on 3 and 107 DF, p-value: < 2.2e-16
96
1. Plots of residuals
◮ Plot of residuals vs fitted values
>plot(ozone.lm$fitted.values, ozone.lm$residuals, ylab="Fitted values",
+ xlab="Residuals")
>abline(0,0)
−20 0 20 40 60 80 100
−40
−20
020
4060
8010
0
Residuals
Fitted
value
s
97
2. Plots of residuals
◮ QQ–plot of residuals
>qqnorm(ozone.lm$residuals)
>qqline(ozone.lm$residuals)
−2 −1 0 1 2
−40
−20
020
4060
8010
0Normal Q−Q Plot
Theoretical Quantiles
Samp
le Qu
antile
s
98
1. Example: Boston housing data
◮ Response variable: median value of homes (medv)
◮ Predictor variables:
– crime rate (crim); % land zones for lots (zn)
– % nonretail business (indus); 1/0 on Charles river (chas)
– nitrogene oxide concentration (nox); average number of rooms (rm)
– % built before 1940 (age); tax rate (tax)
– weigthed distance to employment centers (dis)
– % lower-status population (lstat); % black (B)
– accecibility to radial highways (rad)
– pupil/teacher ratio (ptratio)
99
2. Example: Boston housing data
> library(MASS)
> Boston.lm<- lm(medv~., data=Boston)
> summary(Boston.lm)
Call:
lm(formula = medv ~ ., data = Boston)
Residuals:
Min 1Q Median 3Q Max
-15.594 -2.730 -0.518 1.777 26.199
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
crim -1.080e-01 3.286e-02 -3.287 0.001087 **
zn 4.642e-02 1.373e-02 3.382 0.000778 ***
indus 2.056e-02 6.150e-02 0.334 0.738288
chas 2.687e+00 8.616e-01 3.118 0.001925 **
nox -1.777e+01 3.820e+00 -4.651 4.25e-06 ***
rm 3.810e+00 4.179e-01 9.116 < 2e-16 ***
100
age 6.922e-04 1.321e-02 0.052 0.958229
dis -1.476e+00 1.995e-01 -7.398 6.01e-13 ***
rad 3.060e-01 6.635e-02 4.613 5.07e-06 ***
tax -1.233e-02 3.760e-03 -3.280 0.001112 **
ptratio -9.527e-01 1.308e-01 -7.283 1.31e-12 ***
black 9.312e-03 2.686e-03 3.467 0.000573 ***
lstat -5.248e-01 5.072e-02 -10.347 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.745 on 492 degrees of freedom
Multiple R-Squared: 0.7406, Adjusted R-squared: 0.7338
F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16
◮ indus and age are not significant at the 0.05 level.
>fmBoston=as.formula("medv~crim+zn+chas+nox+rm+dis+rad+tax+ptration+black+lstat"
)
> Boston1.lm <- lm(fmBoston, data=Boston)
> summary(Boston1.lm)
Call:
lm(formula = fmBoston, data = Boston)
Residuals:
101
Min 1Q Median 3Q Max
-15.5984 -2.7386 -0.5046 1.7273 26.2373
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 36.341145 5.067492 7.171 2.73e-12 ***
crim -0.108413 0.032779 -3.307 0.001010 **
zn 0.045845 0.013523 3.390 0.000754 ***
chas 2.718716 0.854240 3.183 0.001551 **
nox -17.376023 3.535243 -4.915 1.21e-06 ***
rm 3.801579 0.406316 9.356 < 2e-16 ***
dis -1.492711 0.185731 -8.037 6.84e-15 ***
rad 0.299608 0.063402 4.726 3.00e-06 ***
tax -0.011778 0.003372 -3.493 0.000521 ***
ptratio -0.946525 0.129066 -7.334 9.24e-13 ***
black 0.009291 0.002674 3.475 0.000557 ***
lstat -0.522553 0.047424 -11.019 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.736 on 494 degrees of freedom
Multiple R-Squared: 0.7406, Adjusted R-squared: 0.7348
F-statistic: 128.2 on 11 and 494 DF, p-value: < 2.2e-16
102
1. Boston housing data: residual plots
> plot(residuals(Boston1.lm))
> abline(0,0)
> qqnorm(residuals(Boston1.lm))
> qqline(residuals(Boston1.lm))
103
2. Boston housing data: residual plots
0 100 200 300 400 500
−10
010
20
Index
residu
als(Bo
ston1
.lm)
−3 −2 −1 0 1 2 3
−10
010
20
Normal Q−Q Plot
Theoretical Quantiles
Samp
le Qu
antile
s
104
8. Linear Model Selection and Regularization
105
The need for model selection
◮ We have many different potential predictors. Why do not base the
model on all of them?
◮ Two sides of one coin: bias and variance
– The model with more predictors can describe the phenomenon
better – less bias.
– When we estimate more parameters, the variance of estimators
grows – we “fit the noise”, overfitting!
◮ A clever model selection strategy should resolve
the bias–variance trade–off.
106
Subset selection and coefficient shrinkage
◮ Why the least squares are not always satisfactory?
∗ Prediction accuracy: the LS estimates often have large variance
(collinearity problems, large number of predictors etc.)
∗ Interpretation: with large number of predictors, we often would
like to determine a smaller subset that exhibit strongest effects.
◮ Two approaches:
∗ subset selection (identify a subset of predictors having strongest
effect on the response variable)
∗ coefficient shrinkage; this has an effect of reducing variance of
estimates (but increasing bias...)
107
1. Criteria for subset selection
◮ How to judge if a subset of predictors is good? The R2 index is
useless as it increases as new variables are added to the model.
◮ Criteria for subset selection. The idea is to adjust or penalize the
residual sum of squares SSE for the model complexity.
– Mallows’ Cp: Cp = 1n [SSE + 2pσ2]
– AIC (Akaike infromation criterion): penalization of the likelihood
function; for linear regression is equivalent to Cp
– BIC (Bayesian information criterion): BIC = 1n [SSE + p log(n)σ2].
– Adjusted R2 = 1− SSE/(n−p−1)SST/(n−1)
108
2. Criteria for subset selection
◮ Typical use of the criteria
2 4 6 8 10
10
00
01
50
00
20
00
02
50
00
30
00
0
Number of Predictors
Cp
2 4 6 8 10
10
00
01
50
00
20
00
02
50
00
30
00
0
Number of Predictors
BIC
2 4 6 8 10
0.8
60
.88
0.9
00
.92
0.9
40
.96
Number of Predictors
Ad
juste
d R
2
◮ Procedures: best subset selection, stepwise (forward, backward)
selection.
109
Best subset selection
1. LetM0 be the model without predictors
2. For k = 1, . . . p:
� Fit all(pk
)models that contain k predictors;
� Pick the best among these(pk
)models with largest R2; call itMk.
3. Select betweenM0, . . . ,Mp using Cp, AIC, BIC, etc.
110
1. Forward selection procedure
◮ Fact: assume that we have two models
– the first contains p variables
– the second contain the same p variables + more q variables.
We want to test H0 : q variables are not significant. If ǫ ∼ N (0, σ2)
then under H0
[SSreg(p+ q)− SSreg(p)]/q
SSE(p+ q)/(n− p− q − 1)∼ F (q, n− p− q − 1).
◮ Idea of the forward selection procedure: at each step to add a variable
which maximizes the F -statistic (provided it is significant, greater
than the corresponding (1− α)-quantile).
◮ Instead of F–test at each step one can use AIC.
111
2. Forward selection procedure
1. Fit simple regression model for each variable xk, k = 1, . . . , p and
compute
Fxk=
SSreg(xk)/1
SSE(xk)/(n− 2), k = 1, . . . , p.
Select variable xk1 with k1 = argmaxFxk; if Fxk1
> F(1−α)(1, n− 2)
add xk1 to the model.
2. Fit models with predictors (xk, xk1), and compute
Fxk|xk1=
[SSreg(xk, xk1)− SSreg(xk1)]/1
SSE(xk, xk1)/(n− 3), k = 1, . . . , p, k 6= k1
Select k2 = argmaxFxk|xk1, compare with F(1−α)(1, n− 2) and if
significant add xk2 to the model. Proceed...
112
Boston housing data
> library(MASS)
> maxfmla<-as.formula(paste("medv~", paste(names(Boston[,-14]), collapse="+")))
> maxfmla
medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad +
tax + ptratio + black + lstat
> Boston.lm <-lm(medv~1, data=Boston)
> Boston.fwd<-step(Boston.lm,direction="forward", scope=list(upper=maxfmla),test="F")
Start: AIC=2246.51
medv ~ 1
Df Sum of Sq RSS AIC F value Pr(>F)
+ lstat 1 23243.9 19472 1851.0 601.618 < 2.2e-16 ***
+ rm 1 20654.4 22062 1914.2 471.847 < 2.2e-16 ***
+ ptratio 1 11014.3 31702 2097.6 175.106 < 2.2e-16 ***
+ indus 1 9995.2 32721 2113.6 153.955 < 2.2e-16 ***
+ tax 1 9377.3 33339 2123.1 141.761 < 2.2e-16 ***
+ nox 1 7800.1 34916 2146.5 112.591 < 2.2e-16 ***
+ crim 1 6440.8 36276 2165.8 89.486 < 2.2e-16 ***
+ rad 1 6221.1 36495 2168.9 85.914 < 2.2e-16 ***
113
+ age 1 6069.8 36647 2171.0 83.478 < 2.2e-16 ***
+ zn 1 5549.7 37167 2178.1 75.258 < 2.2e-16 ***
+ black 1 4749.9 37966 2188.9 63.054 1.318e-14 ***
+ dis 1 2668.2 40048 2215.9 33.580 1.207e-08 ***
+ chas 1 1312.1 41404 2232.7 15.972 7.391e-05 ***
<none> 42716 2246.5
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Step: AIC=1851.01
medv ~ lstat
Df Sum of Sq RSS AIC F value Pr(>F)
+ rm 1 4033.1 15439 1735.6 131.3942 < 2.2e-16 ***
+ ptratio 1 2670.1 16802 1778.4 79.9340 < 2.2e-16 ***
+ chas 1 786.3 18686 1832.2 21.1665 5.336e-06 ***
+ dis 1 772.4 18700 1832.5 20.7764 6.488e-06 ***
+ age 1 304.3 19168 1845.0 7.9840 0.004907 **
+ tax 1 274.4 19198 1845.8 7.1896 0.007574 **
+ black 1 198.3 19274 1847.8 5.1764 0.023316 *
+ zn 1 160.3 19312 1848.8 4.1758 0.041527 *
+ crim 1 146.9 19325 1849.2 3.8246 0.051059 .
114
+ indus 1 98.7 19374 1850.4 2.5635 0.109981
<none> 19472 1851.0
+ rad 1 25.1 19447 1852.4 0.6491 0.420799
+ nox 1 4.8 19468 1852.9 0.1239 0.724966
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Step: AIC=1735.58
medv ~ lstat + rm
Df Sum of Sq RSS AIC F value Pr(>F)
+ ptratio 1 1711.32 13728 1678.1 62.5791 1.645e-14 ***
+ chas 1 548.53 14891 1719.3 18.4921 2.051e-05 ***
+ black 1 512.31 14927 1720.5 17.2290 3.892e-05 ***
+ tax 1 425.16 15014 1723.5 14.2154 0.0001824 ***
+ dis 1 351.15 15088 1725.9 11.6832 0.0006819 ***
+ crim 1 311.42 15128 1727.3 10.3341 0.0013900 **
+ rad 1 180.45 15259 1731.6 5.9367 0.0151752 *
+ indus 1 61.09 15378 1735.6 1.9942 0.1585263
<none> 15439 1735.6
+ zn 1 56.56 15383 1735.7 1.8457 0.1748999
+ age 1 20.18 15419 1736.9 0.6571 0.4179577
115
+ nox 1 14.90 15424 1737.1 0.4849 0.4865454
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Step: AIC=1678.13
medv ~ lstat + rm + ptratio
Df Sum of Sq RSS AIC F value Pr(>F)
+ dis 1 499.08 13229 1661.4 18.9009 1.668e-05 ***
+ black 1 389.68 13338 1665.6 14.6369 0.0001468 ***
+ chas 1 377.96 13350 1666.0 14.1841 0.0001854 ***
+ crim 1 122.52 13606 1675.6 4.5115 0.0341560 *
+ age 1 66.24 13662 1677.7 2.4291 0.1197340
<none> 13728 1678.1
+ tax 1 44.36 13684 1678.5 1.6242 0.2031029
+ nox 1 24.81 13703 1679.2 0.9072 0.3413103
+ zn 1 14.96 13713 1679.6 0.5467 0.4600162
+ rad 1 6.07 13722 1679.9 0.2218 0.6378931
+ indus 1 0.83 13727 1680.1 0.0301 0.8622688
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
116
..............................................................
Step: AIC=1596.1
medv ~ lstat + rm + ptratio + dis + nox + chas + black + zn +
crim + rad
Df Sum of Sq RSS AIC F value Pr(>F)
+ tax 1 273.619 11081 1585.8 12.1978 0.0005214 ***
<none> 11355 1596.1
+ indus 1 33.894 11321 1596.6 1.4790 0.2245162
+ age 1 0.096 11355 1598.1 0.0042 0.9485270
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Step: AIC=1585.76
medv ~ lstat + rm + ptratio + dis + nox + chas + black + zn +
crim + rad + tax
Df Sum of Sq RSS AIC F value Pr(>F)
<none> 11081 1585.8
+ indus 1 2.51754 11079 1587.7 0.1120 0.7380
+ age 1 0.06271 11081 1587.8 0.0028 0.9579
117
Shrinkage methods
◮ The idea of regularization: contrary to subset selection the idea is to
fit a model keeping all coefficients, but to impose constraints on the
size of the coefficients. For example, to shrink the coefficients to zero.
◮ In general, regularization refers to a process of introducing additional
information in order to solve an ill-posed problem or to prevent
overfitting.
118
Ridge regression
◮ Ridge regression shrinks regression coefficients by imposing a penalty
on their size:
βλ = argminβ
{ n∑
i=1
(Yi − β0 −
p∑
j=1
Xijβj
)2
+ λ
p∑
j=1
β2j
}
= (Y −Xβ)T (Y −Xβ) + λβTβ
βλ = (XTX + λI)−1XTY .
◮ Ridge parameter λ should be chosen: larger λ’s result in smaller
variance but bigger bias.
◮ βλ is a linear estimator.
◮ Usually βλ is computed for a range of λ’s.
119
2. Ridge regression
◮ The matrix X is usually centered and rescaled: if xj is the jth
(j = 1, . . . , p) column of X then define Z = [z1, . . . , zp] by
zj =1
Sj(xj − xj), S2
j =1
n(xj − xj)
T (xj − xj).
Y is also cenetred. Then consider the model without intercept
Y = Zθ + ǫ.
◮ Ridge trace: graphs of coefficient estimates as function of λ.
◮ Generalized Cross-Validation (GCV): select λ that minimizes
V (λ) =1nY
T [I −A(λ)]2Y(
1n tr[I −A(λ)]
)2 , A(λ) = X(XTX + λI)−1XT .
120
2. Ridge regression in R
> Boston.ridge<-lm.ridge(medv~., Boston, lambda=seq(0, 100, 0.1))
◮ Output values
* scales - scalings used on the X matrix.
* Inter - was intercept included?
* lambda - vector of lambda values
* ym - mean of y
* xm - column means of x matrix
* GCV - vector of GCV values
* kHKB - HKB estimate of the ridge constant.
* kLW - L-W estimate of the ridge constant.
> plot(Boston.ridge) # produces ridge trace
121
Ridge trace: Boston housing data
0 20 40 60 80 100
−4−3
−2−1
01
23
x$lambda
t(x$co
ef)
122
Ridge regression in R
> select(Boston.ridge)
modified HKB estimator is 4.594163
modified L-W estimator is 3.961575
smallest value of GCV at 4.3
>
> Boston.ridge.cv<-lm.ridge(medv~.,Boston,lambda=4.3)
> Boston.ridge.cv$coef
crim zn indus chas nox rm
-0.895001937 1.020966996 0.049465334 0.694878367 -1.943248437 2.707866705
age dis rad tax ptratio black
-0.005646034 -2.992453378 2.384190136 -1.819613735 -2.026897293 0.847413719
lstat
-3.689619529
123
LASSO
◮ LASSO estimator
βlasso = argminβ
n∑
i=1
(Yi − β0 −
p∑
j=1
Xijβj
)2
subject to
p∑
j=1
|βj | ≤ t.
◮ Comparison to ridge:∑p
j=1 β2j is replaced by
∑pj=1 |βj |
◮ Properties:
* estimator βlasso is non–linear;
* t small causes some of the coefficients to be exactly zero; if t is
larger than t0 =∑p
j=1 |βLSj | then βlasso = βLS;
* a kind of continuous subset selection
124
LASSO and ridge
125
LASSO, ridge and best subset selection
◮ Ridge regression
βridge = argminβ
{ n∑
i=1
(Yi − β0 −
p∑
j=1
Xijβj
)2 ∣∣∣p∑
j=1
β2j ≤ s
}
◮ Lasso
βlasso = argminβ
{ n∑
i=1
(Yi − β0 −
p∑
j=1
Xijβj
)2 ∣∣∣p∑
j=1
|βj | ≤ t}
◮ Best subset selection
βsparse = argminβ
{ n∑
i=1
(Yi − β0 −
p∑
j=1
Xijβj
)2 ∣∣∣p∑
j=1
I{βj 6= 0} ≤ s}
◮ LASSO and ridge are computationally feasible alternatives to best
subset selection.
126
1. LASSO and ridge in a special case
Assume that n = p and X is the identity matrix.
◮ Least squares estimator: βls = argminβ∑n
i=1(Yi − βi)2,
βlsi = Yi, i = 1, . . . , n.
◮ Ridge regression: βridge = argminβ
{∑ni=1(Yi − βi)2 + λ
∑ni=1 β
2i
},
βridgei = Yi/(1 + λ), i = 1, . . . , n.
◮ LASSO: βlasso = argminβ
{∑ni=1(Yi − βi)2 + λ
∑ni=1 |βi|
},
βlassoi =
Yi − λ/2, Yi > λ/2,
Yi + λ/2, Yi < −λ/2,0, |Yi| ≤ λ/2
i = 1, . . . , n.
127
2. LASSO and ridge in a special case
◮ βlassoi = soft thresholding(Yi)
−1.5 −0.5 0.0 0.5 1.0 1.5
−1
.5−
0.5
0.5
1.5
Co
eff
icie
nt
Estim
ate
Ridge
Least Squares
−1.5 −0.5 0.0 0.5 1.0 1.5−
1.5
−0
.50
.51
.5
Co
eff
icie
nt
Estim
ate
Lasso
Least Squares
yjyj
128
LASSO in R
> library(lars)
> library(MASS)
> x<-as.matrix(Boston[,1:13])
> y<-as.vector(Boston[,14])
> Boston.lasso <- lars(x,y,type="lasso")
> summary(Boston.lasso)
LARS/LASSO
Call: lars(x = x, y = y, type = "lasso")
Df Rss Cp
0 1 42716 1392.997
1 2 36326 1111.195
2 3 21335 447.485
3 4 14960 166.356
4 5 14402 143.588
5 6 13667 112.931
6 7 13449 105.281
7 8 13117 92.515
8 9 12423 63.717
9 10 11950 44.700
129
10 11 11899 44.446
11 12 11730 38.934
12 13 11317 22.590
13 12 11086 10.341
14 13 11080 12.032
15 14 11079 14.000
◮ Print and plot of complete coefficient path
> print(Boston.lasso)
Call:
lars(x = x, y = y, type = "lasso")
R-squared: 0.741
Sequence of LASSO moves:
lstat rm ptratio black chas crim dis nox zn indus rad tax indus indus age
Var 13 6 11 12 4 1 8 5 2 3 9 10 -3 3 7
Step 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
> plot(Boston.lasso)
Returns a plot: coefficient values against s = t∑p
j=1 |βLSj
| , 0 ≤ s ≤ 1.
130
LASSO coefficients path
* * * * * * * * * ** **
* * *
0.0 0.2 0.4 0.6 0.8 1.0
−50
050
|beta|/max|beta|
Stan
dard
ized
Coe
ffici
ents
* * * * * * * * *** *
** * *
* * * * * * * * * ** * * * * ** * * * *
* * * * ** * * * * *
* * * * * * * *
*
***
*
* * *
* *
*
**
* * * * ** * ** * *
* * * * * * * * * ** * * * * ** * * * * * **
*
***
*
* * *
* * * * * * * * * ***
*
** *
* * * * * * * * * ** *
*
** *
* * *
**
* * ** ** * * * * *
* * * **
* * * * ** * * * * *
*
*
*
* * * * * * ** * * * * *
LASSO
138
101
74
29
0 1 2 3 5 7 8 9 11 12 13 15
131
Cross-validated choice of s
> cv.lars(x,y, K=10)
Return K-fold CV mean squared prediction error against s.
0.0 0.2 0.4 0.6 0.8 1.0
2030
4050
6070
8090
fraction
cv
132
Prediction and extraction of coefficients
◮ Extraction of LASSO coefficients for given s
> Boston.coef.03<-coef(Boston.lasso, s=0.3, mode="fraction")
> Boston.coef.03
crim zn indus chas nox rm age
0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 3.3707730 0.0000000
dis rad tax ptratio black lstat
0.0000000 0.0000000 0.0000000 -0.4299806 0.0000000 -0.4664715
◮ Prediction
> Boston.lasso.pr<-predict(Boston.lasso, x, s=0.3, mode="fraction", type="fit")
133
9. Logistic Regression
134
Example
◮ Age and coronary heart desease (CHD) status data: 100 subjects
response (Y) - absence or presence (0/1) of the CHD, predictor (X) -
age.
20 30 40 50 60 70
0.00.2
0.40.6
0.81.0
AGE
CHD
135
Logistic regression model
◮ Linear regression is not appropriate:
E(Y |X = x) = P (Y = 1|X = x) = β0 + β1X
should be in [0, 1], ∀x.
◮ The idea is
to model relathionship between p(x) = P (Y = 1|X = x) and x
using logistic response function
p(x) =eβ0+β1x
1 + eβ0+β1x⇔ logit{p(x)} := log
p(x)
1− p(x) = β0+β1x.
136
Logistic response function
−100 −50 0 50 100
0.00.2
0.40.6
0.81.0
x
y
◮ Why logit? For fixed x the odds p(x)1−p(x) are naturally on a log scale:
usually one has odds like ’10 to 1’, or ’2 to 1’.
◮ A specific case of the generalized linear model (GLM) with logit link
function: g(E(Y |X = x)) = β0 + β1x, g(z) = log z1−z , 0 ≤ z < 1.
137
Interpretation of the logistic regression model
◮ If p(x) = 0.75 then the odds of getting CHD at age x are 3 to 1.
◮ If x = 0 then
logp(0)
1− p(0) = β0 ⇔p(0)
1− p(0) = eβ0 .
Thus eβ0 can be interpreted as baseline odds, especially if zero is
within the range of the data for the predictor variable X .
◮ If we increase x by one unit, we multiply the odds by eβ1 . If β1 > 0
then eβ1 > 1 and odds increase; if β1, 0 then the odds decrease.
138
1. Likelihood function
◮ Data and model: Dn = {(Yi, Xi), i = 1, . . . , n}, Yi ∈ {0, 1}, i.i.d.
πi = π(Xi) = P (Yi = 1|Xi) = E(Yi|Xi) =eβ0+β1Xi
1 + eβ0+β1Xi, i = 1, . . . , n.
◮ Likelihood and log–likelihood (should be maximized w.r.t. β0 and β1)
L(β0, β1;Dn) =n∏
i=1
πYi
i (1− πi)1−Yi
=n∏
i=1
( eβ0+β1Xi
1 + eβ0+β1Xi
)Yi( 1
1 + eβ0+β1Xi
)1−Yi
=n∏
i=1
e(β0+β1Xi)Yi
1 + eβ0+β1Xi
log{L(β0, β1;Dn)} =n∑
i=1
Yi(β0 + β1Xi)−n∑
i=1
log{1 + eβ0+β1Xi
}.
139
2. Likelihood function
S1(β0, β1) =∂ log{L(β0, β1)}
∂β0=
n∑
i=1
Yi −n∑
i=1
eβ0+β1Xi
1 + eβ0+β1Xi
S2(β0, β1) =∂ log{L(β0, β1)}
∂β1=
n∑
i=1
XiYi −n∑
i=1
Xieβ0+β1Xi
1 + eβ0+β1Xi.
◮ Solve the system S1(β0, β1) = 0, S2(β0, β1) = 0 for β0, β1.
◮ No close form solution is available, solution by an iterative procedure.
140
1. Fitting the model: Newton–Raphson algorithm
◮ Idea of the algorithm:
– Assume we want to solve equation g(x) = 0.
– Let x∗ be the solution; then if x is close to x∗ then
0 = g(x∗) ≈ g(x) + g′(x)(x∗ − x) ⇒ x∗ ≈ x−g(x)
g′(x).
◮ Iterative procedure:
– Let xk be a current approximation to x∗ (at kth stage); define the
next approximation point xk+1 by
xk+1 = xk −g(xk)
g′(xk), k = 1, 2, . . .
– Stop when g(xk) is small, e.g., |g(xk)| ≤ ǫ.
141
2. Fitting the model: Newton–Raphson algorithm
1. Let β(j)0 and β
(j)1 be the current parameter approximations after j-th
step of the algorithm.
2. Let
J(β0, β1) = −
∂S1
∂β0
∂S1
∂β1
∂S2
∂β0
∂S2
∂β1
.
3. Compute
β
(j+1)0
β(j+1)1
=
β
(j)0
β(j)1
+ J−1
(β(j)0 , β
(j)1
)S1
(β(j)0 , β
(j)1
)
S2
(β(j)0 , β
(j)1
)
.
4. Continue untill a convergence criterion is met.
142
Extension to multiple predictors
◮ Model: Xi = (1, Xi1, . . . , Xip), i = 1, . . . , n, β = (β0, β1, . . . , βp)
πi = π(Xi) = P (Yi = 1|Xi) = E(Yi|Xi) =exp{βTXi}
1 + exp{βTXi}.
◮ Likelihood and log–likelihood:
L(β;Dn) =n∏
i=1
[π(Xi)]Yi [1− π(Xi)]
1−Yi
=n∏
i=1
( eβTXi
1 + eβTXi
)Yi( 1
1 + eβTXi
)1−Yi
=n∏
i=1
eβTXiYi
1 + eβTXi
log{L(β;Dn)} =n∑
i=1
βTXiYi −n∑
i=1
log{1 + eβ
TXi
}.
Should be maximized with respect to β.
143
1. Fitting the model and assessing the fit
◮ No close form solution is available, solution by an iterative procedure.
◮ If β is the ML estimate of β then the fitted values are
Yi = π(Xi) =exp{βTXi}
1 + exp{βTXi}◮ Deviance is twice the difference between the log–likelihoods evaluated
at (a) the MLE π(Xi); and (b) π(Xi) = Yi:
G2 = 2n∑
i=1
{Yi log
( Yiπ(Xi)
)+ (1− Yi) log
( 1− Yi1− π(Xi)
)}.
=n∑
i=1
dev(Yi, π(Xi))
◮ Deviance residuals: ri = sign{Yi − π(Xi)}√
dev(Yi, π(Xi)).
144
2. Fitting the model and assessing the fit
◮ The degrees of freedom (df) associated with the deviance G2 equals
to n− (p+ 1); (p+ 1) is the dimension of the vector β.
◮ Pearson’s X2 is an approximation to the deviance
X2 =n∑
i=1
[Yi − π(Xi)]2
π(Xi)[1− π(Xi)]
◮ Comparing models: let x = (x1,x2), and consider testing
H0 : log{ π(x)
1− π(x)}= βTx1
H1 : log{ π(x)
1− π(x)}= βTx1 + ηTx2.
Obtain deviance G20 and df0 under H0; and G
21 with corresponding
df1 under H1. Under the null: G20 −G2
1 ≈ χ2(df0 − df1).
145
1. Example
> agchd <-read.table("Age-CHD.dat", header=T)
> agchd.glm<-glm(CHD~Age, data=agchd, family=binomial)
> summary(agchd.glm)
Call:
glm(formula = CHD ~ Age, family = binomial, data = agchd)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.9718 -0.8456 -0.4576 0.8253 2.2859
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.30945 1.13365 -4.683 2.82e-06 ***
Age 0.11092 0.02406 4.610 4.02e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 136.66 on 99 degrees of freedom
Residual deviance: 107.35 on 98 degrees of freedom
AIC: 111.35
Number of Fisher Scoring iterations: 4
146
2. Example
◮ The fitted model
log
{P (CHD|Age)
1− P (CHD|Age)
}= −5.309 + 0.111× Age
As Age grows by one unit, the odds to have the CHD are multiplied by
e0.111 ≈ 1.117.
◮ Null deviance is the G2 statistic for the null model (without slope)
◮ Residual deviance is the G2 statistic for the fitted model
147
Classification using logistic regression
◮ With any value x of the predictor variable the fitted logistic
regression model associates the probability π(x).
◮ Classification rule: for some threshold τ ∈ (0, 1) (e.g., τ = 1/2) let
Y (x) =
1, π(x) > τ
0, π(x) ≤ τ.
Changing τ one can get some feel of efficacy of the model.
148
Sensitivity and specificity
◮ Classification results can be represented in the form of the table
True 0 True 1
Predicted 0 a b
Predicted 1 c d
◮ Sensitivity is the proportion of correctly predicted 1’s - true positives.
Sensitivity =d
b+ d
◮ Specificity is the proportion of correctly predicted 0’s - true negatives.
Specificity =a
a+ c
149
ROC curve
◮ Receiver Operating Characteristic (ROC) curve: Sensitivity versus
1 - Specificity as the threshold τ varies from 0 to 1.
0.0 0.2 0.4 0.6 0.8 1.0
0.00.2
0.40.6
0.81.0
ROC curve, CHD data
1−Specificity
Sens
itivity
τ = 0
τ = 1
150
Interpretation of ROC curve
◮ If τ = 1 then we never classify an observation as positive; here
Sensitivity = 0, Specificity = 1.
◮ If τ = 0 then everything will be classified as positive; here
Sensitivity = 1 and Specificity = 0.
◮ As τ varies between 0 and 1 there is a trade–off between
Sensitivity and Specificity; one looks for the value of τ which
gives ”large” sensitivity and specificity.
◮ The closer ROC curve comes to the 45 degrees straight line, the less
useful the model is.
151
3. Example
> tau<-0.5 # threshold
> agch1<-as.numeric(fitted(agchd.glm)>=tau)
> table(agchd$CHD, agch1)
agch1
0 1
0 45 12
1 14 29
> 29/(29+14) [1] 0.6744186 # sensitivity
> 45/(45+12) [1] 0.7894737 # specificity
> tau<-0.6
> agch1<-as.numeric(fitted(agchd.glm)>=tau)
> table(agchd$CHD, agch1)
agch1
0 1
0 50 7
1 18 25
> 25/(25+18) [1] 0.5813953 # sensitivity
> 50/(50+7) [1] 0.877193 # specificity
152
1. Another example: iris data
◮ The Independent variables - Sepal.Length, Sepal.Width,
Petal.Length, Petal.Width.
◮ Response variable - flower species: setosa, versicolor and
virginica.
◮ The ‘iris’ dataset consists of 150 observations, 50 from each species.
Consider a logistic regression on a single species type, versicolor.
> data(iris)
> tmpdata <- iris
> Versicolor <- as.numeric(tmpdata[,"Species"]=="versicolor")
> tmpdata[,"Species"] <- Versicolor
> fmla <- as.formula(paste("Species ~ ",paste(names(tmpdata)[1:4],
collapse="+")))
> ilr <- glm(fmla, data=tmpdata, family=binomial(logit))
> summary(ilr)
Call:
153
glm(formula = fmla, family = binomial(logit), data = tmpdata)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1281 -0.7668 -0.3818 0.7866 2.1202
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 7.3785 2.4993 2.952 0.003155 **
Sepal.Length -0.2454 0.6496 -0.378 0.705634
Sepal.Width -2.7966 0.7835 -3.569 0.000358 ***
Petal.Length 1.3136 0.6838 1.921 0.054713 .
Petal.Width -2.7783 1.1731 -2.368 0.017868 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 190.95 on 149 degrees of freedom
Residual deviance: 145.07 on 145 degrees of freedom
AIC: 155.07
Number of Fisher Scoring iterations: 5
154
2. Another example: Iris data
◮ Model without Sepal.Length
> ilr1<-glm(Species ~ Sepal.Width + Petal.Length + Petal.Width,
+ data=tmpdata, family=binomial(logit))
> summary(ilr1)
Call:
glm(formula = Species ~ Sepal.Width + Petal.Length + Petal.Width,
family = binomial(logit), data = tmpdata)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1262 -0.7731 -0.3984 0.8063 2.1562
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 6.9506 2.2261 3.122 0.00179 **
Sepal.Width -2.9565 0.6668 -4.434 9.26e-06 ***
Petal.Length 1.1252 0.4619 2.436 0.01484 *
Petal.Width -2.6148 1.0815 -2.418 0.01562 *
155
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 190.95 on 149 degrees of freedom
Residual deviance: 145.21 on 146 degrees of freedom
AIC: 153.21
Number of Fisher Scoring iterations: 5
156
10. Review of multivariate normal distribution.
157
Standard multivariate normal distribution
◮ Definition:
Let Z1, . . . , Zn be independent N (0, 1) random variables. Vector
Z = (Z1, . . . , Zn) has standard multivarite normal distribution:
fZ(z) = fZ1,...,Zn(z1, . . . , zn)
=n∏
j=1
1√2πe−z2
j /2 =(
1√2π
)n
exp{− 12z
T z}.
We write Z ∼ Nn(0, I) where 0 is expectation, EZ = 0, and I is the
covariance matrix, cov(Z) = E[(Z − EZ)(Z − EZ)T ] = EZZT = I .
158
Multivariate normal distribution
◮ Transformation: Let A ∈ Rn×n and µ ∈ Rn. Define random vector
Y = AZ + µ
◮ Y ∼ Nn(µ,AAT ):
EY = EAZ + µ = AEZ + µ = µ,
cov(Y ) = E(Y − EY )(Y − EY )T = AE(ZZT )AT = AAT .
◮ Definition: distribution of random vector Y ∈ Rn is multivariate
normal with expectation µ ∈ Rn and covariance matrix Σ ∈ Rn×n if
fY (y) =1
(2π)n/2|det(Σ)|1/2 exp{− 1
2 (y − µ)TΣ−1(y − µ)}.
Σ > 0 - covariance matrix, Y ∼ Nn(µ,Σ).
159
Properties
◮ If a ∈ Rn and Y ∼ Nn(µ,Σ) then
X = aTY =n∑
i=1
aiYi = aTY ∼ N (aTµ, aTΣa).
◮ In general, if B ∈ Rq×n and Y ∼ Nn(µ,Σ) then
X = BY ∼ Nq(Bµ,BΣBT ).
◮ Any sub–vector of multivariate normal vector is multivariate normal.
◮ If elements of multivariate normal vector are uncorrelated then they
are independent.
◮ If X1, . . . , Xniid∼ N(µ, σ2) then Xn and s2 are independent (in fact,
the sample is normal if and only if Xn and s2 are independent).
160
11. Discriminant Analysis
161
Cushing syndrome data
Types of the desease: (a), (b) and (c).
Variables: concentrations of Tetrahydrocortisone and Pregnanetriol.
10 20 30 40 50 60
0.05
0.10
0.20
0.50
1.00
2.00
5.00
10.00
Tetrahydrocortisone
Preg
nane
triol
a
a
a
a
a
a
a
b
b
b
b
b
b
b
bb
b
b
b
c
c
c
c
c
c
u
u
Cushings syndrome data
162
1. Classification problem
◮ Problem: we are given a pair of variables (X, Y )
– X ∈ X ⊆ Rp is the vector of predictor variables (features), belongs
to one of K populations;
– Y is the label of the population;
◮ Assume
– πk is the prior probability that X belongs to kth population
πk = P (Y = k) > 0,K∑
k=1
πk = 1.
– If observation X belongs to kth population then X ∼ pk(x).
◮ Problem: we observe X = x. How to predict the label Y ?
163
2. Classification problem
◮ Classification rule: any function g : X → {1, . . . , K}. Defines
partition of the feature space X into K disjoint sets:
X =K⋃
k=1
Ak, Ak = {x : g(x) = k}.
◮ Accuracy of a classifier: L(g) = P{g(X) 6= Y }.
◮ Optimal rule
g∗ = arg ming:X→{1,...,K}
P{g(X) 6= Y }.
164
Bayes rule
◮ Bayes formula
p(k|x) = P (Y = k|X = x) =P (X = x|Y = k)P (Y = k)
∑Kk=1 P (X = x|Y = k)P (Y = k)
=πkpk(x)∑Kk=1 πkpk(x)
◮ Bayes classification rule:
g∗(x) = argmaxk=1,...,K p(k|x) = argmaxk=1,...,K [πkpk(x)].
◮ Theorem: Bayes rule g∗ minimizes probability of error, i.e.,
L(g∗) = P{g∗(X) 6= Y } ≤ L(g) = P{g(X) 6= Y }, ∀g.
Bayes error
L(g∗) = 1−∫
Xmax
k=1,...,K[πkpk(x)] dx.
165
Proof
First, note L(g) = P{g(X) 6= Y } = E[P{g(X) 6= Y |X = x}];
P{g(X) 6= Y |X = x} = 1− P{g(X) = Y |X = x}
= 1−K∑
k=1
P{g(X) = k, Y = k|X = x}
= 1−K∑
k=1
I{g(x) = k}P (Y = k|X = x)
= 1−K∑
k=1
I{g(x) = k}p(k|x) ≥ 1−maxk
p(k|x).
The proof is completed by taking expextation and noting that, by
definition,
P{g∗(X) 6= Y |X = x} = 1−maxk
p(k|x).
166
The Bayes rule for normal populations
◮ Assume K = 2 groups with prior probabilitites πk, k = 1, 2, and the
distribution pk(x) of kth population is Np(µk,Σk);
πkpk(x) =πk
(2π)p/2|Σk|1/2exp
{− 1
2(x−µk)
TΣ−1k (x−µk)
}, k = 1, 2.
◮ The Bayes rule decides 1 if π1p1(x) ≥ π2p2(x), and 2 otherwise.
Equivalently, one can compare h1(x) with h2(x), where
hk(x) = log{πkpk(x)}
= −1
2(x− µk)
TΣ−1k (x− µk)︸ ︷︷ ︸
M2k
−p2log(2π)− 1
2log |Σk|+ log πk.
167
Case I: equal covariance matrices
◮ If Σ1 = Σ2 = Σ then the Bayes rule decides 1 when
h1(x)− h2(x) ≥ 0, i.e.,
h1(x)− h2(x) = −1
2M2
1 + log π1 +1
2M2
2 − log π2
= (µ1 − µ2)TΣ−1
(x− µ1 + µ2
2
)+ log
π1π2≥ 0.
This results in the linear decision surface
(µ1 − µ2)TΣ−1
(x− µ1 + µ2
2
)= log
π2π1.
168
Case II: non–equal covariance matrices
◮ If Σ1 6= Σ2 then the Bayse rule decides 1 if
−1
2(x− µ1)
TΣ−11 (x− µ1)−
1
2|Σ1|+ log π1
≥ −1
2(x− µ2)
TΣ−12 (x− µ2)−
1
2|Σ2|+ log π2.
Thus results in quadratic decision surface in Rp.
◮ The Bayes rule cannot be implemented because nor pk(x) neither πk
are known. The idea is to estimate unknown parameters from the
data...
169
1. Linear discriminant analysis
◮ Data: two samples of sizes n1 and n2 from normal populations
Xk1, . . . ,Xknk∼ Np(µk,Σ), k = 1, 2.
◮ Estimates of means µk
µk = Xk· =1
nk
nk∑
j=1
Xkj .
◮ Pooled estimator of Σ
Spooled =(n1 − 1)S1 + (n2 − 1)S2
n1 + n2 − 2
Sk =1
nk − 1
nk∑
j=1
(Xkj − µk)(Xkj − µk)T
170
2. Linear discriminant analysis
◮ Classification rule (LDA):
decide first group if
(µ1 − µ2)TS−1
pooled
(x− µ1 + µ2
2
)≥ log
π2π1,
where
πk =nkn, k = 1, 2.
◮ Although LDA is the ”plug–in Bayes classifier” for normal
populations, it can be applied for any distribution of the data.
171
1. Another interpretation of the LDA
◮ Idea is to find a linear transformation of X such that separation
between the groups will be maximal. Let β ∈ Rp, and Z = βTX.
– if X is from the first group, then µ1,Z = EZ = EβTX = βTµ1.
– if X is from the second group, then µ2,Z = EZ = EβTX = βTµ2.
– var(Z) does not depend on the group: σ2Z = var(βTX) = βTΣβ.
◮ Choose β so that
(µ1,Z − µ2,Z)2
σ2Z
=[βT (µ1 − µ2)]
2
βTΣβ→ max .
Solution of this problem: β∗ = cΣ−1(µ1 − µ2), c 6= 0.
172
2. Another interpretation of the LDA
◮ Estimate of β∗: β∗ = S−1pooled(µ1 − µ2).
◮ Estimates of µ1,Z = βT∗ µ1, µ2,Z = βT
∗ µ2:
µ1,Z = (µ1 − µ2)TS−1
pooled µ1, µ2,Z = (µ1 − µ2)TS−1
pooled µ2.
◮ LDA classification rule: for given x decide the first group if
(µ1 − µ2)TS−1
pooled x ≥ 1
2(µ1,Z + µ2,Z) ⇔
(µ1 − µ2)TS−1
pooled x ≥ 1
2(µ1 − µ2)
TS−1pooled(µ1 + µ2).
173
1. Example: Leptograpsus Crabs data
Two color forms (blue and orange), 50 of each form of each sex.
◮ sp species - ”B” or ”O” for blue or orange
◮ sex
◮ index index 1:50 within each of the four groups
◮ FL frontal lobe size (mm)
◮ RW rear width (mm)
◮ CL carapace length (mm)
◮ CW carapace width (mm)
◮ BD body depth (mm)
174
2. Example: Leptograpsus Crabs data
> library(MASS)
> attach(crabs)
> lcrabs<-cbind(sp, sex, log(crabs[,4:8]))
> lcrabs.lda<-lda(sex~FL+RW+CL+CW, lcrabs) # Linear discriminant analysis
> lcrabs.lda
Call:
lda.formula(sex ~ FL + RW + CL + CW, data = lcrabs)
Prior probabilities of groups:
F M
0.5 0.5
Group means:
FL RW CL CW
F 2.708720 2.579503 3.421028 3.555941
M 2.730305 2.466848 3.464200 3.583751
Coefficients of linear discriminants:
LD1
175
FL -2.889616
RW -25.517644
CL 36.316854
CW -11.827981
LD1 is the vector S−1pooled(µ1 − µ2).
> lcrabs.pred<-predict(lcrabs.lda)
> table(lcrabs$sex, lcrabs.pred$class)
F M
F 97 3
M 3 97
> (3+3)/(97+97+3+3) # Resubstitution (naive) estimate
[1] 0.03
> # Cross-validation estimate
> lcrabs.cv.lda<-lda(sex~FL+RW+CL+CW, lcrabs, CV=T)
> table(lcrabs$sex, lcrabs.cv.lda$class)
F M
F 96 4
M 3 97
176
2. Example: Leptograpsus Crabs data
> plot(lcrabs.lda) # groups histograms in the dicriminant direction
−4 −2 0 2 4
0.00.1
0.20.3
0.4
group F
−4 −2 0 2 4
0.00.1
0.20.3
0.4
group M
177
1. Quadratic discriminant analysis
◮ The QDA can be viewed as the plug–in Bayes rule for the normal
populations with non–equal covariance matrices. Now Σ1 and Σ2 are
substituted by
Sk =1
nk − 1
nk∑
j=1
(Xkj − µk)(Xkj − µk)T , k = 1, 2.
◮ The QDA rule: decide the first group if
−1
2(x− µ1)
TS−11 (x− µ1)−
1
2|S1|+ log π1
≥ −1
2(x− µ2)
TS−12 (x− µ2)−
1
2|S2|+ log π2.
178
2. Quadratic discriminant analysis
> # Quadratic discriminant analysis
>
> lcrabs.qda<-qda(sex~FL+RW+CL+CW, lcrabs)
> lcrabs.qda.pred<-predict(lcrabs.qda)
> table(lcrabs$sex, lcrabs.qda.pred$class)
F M
F 97 3
M 4 96
> lcrabs.cv.qda<-qda(sex~FL+RW+CL+CW, lcrabs, CV=T)
> table(lcrabs$sex, lcrabs.cv.qda$class)
F M
F 96 4
M 5 95
179
12. Classification: k–Nearest Neighbors
180
1. k–nearest neighbors classifier
◮ Data: (Xi, Yi), i = 1, . . . , n, i.i.d. random pairs, Yi ∈ {0, 1},Xi ∈ Rd.
◮ Let d(·, ·) be a distance measure, and for a given x ∈ Rd consider
numbers di(x) = d(Xi,x). Let d(i)(x) be the ith order statistic, i.e.,
d(1)(x) ≤ d(2)(x) ≤ · · · ≤ d(n)(x).
◮ The set of k–nearest neighbors of x
Ak(x) = {Xi : d(Xi,x) ≤ d(k)(x)}.
◮ Classifier
gn(x) =
1,∑n
i=1wn,i(x)I(Yi = 1) >∑n
i=1wni(x)I(Yi = 0),
0, otherwise
where wn,i(x) =1k if Xi ∈ Ak(x), and zero otherwise.
181
2. k–nearest neighbors classifier
◮ Choice of k: often k = 1 is chosen, results in 1-NN classifier. Large k
results in more averaging; small k leads to more variability.
Asymptotic theory suggests
– k →∞ as n→∞
– kn → 0 as n→∞
◮ Choice of the distance: most common choice is Euclidean.
182
Spam data
Data: 4601 instances, 57 attributes
◮ Most of the attributes indicate whether a particular word or character
was frequently occuring in the e-mail. In particular, 48 continuous
real [0,100] attributes indicating percentage of words in the e-mail
that match WORD, i.e.
100 * (number of times the WORD appears in the e-mail)
total number of words in e-mail
◮ WORD is any string of alphanumeric characters bounded by
non-alphanumeric characters or end-of-string.
◮ The run-length attributes (55-57) measure the length of sequences of
consecutive capital letters.
183
k–nearest neighbors: spam data
> library(class)
> spam.d<-spam[, 1:57] # training set
> spam.cl <- spam[,58] # true classifications
> spam.1nn <- knn.cv(spam.d, spam.cl, k=1) # 1-NN with cross-validation
> table(spam.1nn, spam.cl)
spam.cl
spam.1nn 0 1
0 2398 390
1 390 1423
> (390+390)/(2398+1423+390+390) # cross-validation misclassification rate
[1] 0.1695284
184
Spam data: k-nn misclassification rate
5 10 15 20
0.17
0.18
0.19
0.20
0.21
0.22
K
Missc
lassifi
catio
n rate
Spam data: CV−misclassification rate of K−NN
185
Spam data: LDA versus k-nn
> library(MASS)
> spam.cv.lda<-lda(V58~.,data=spam, CV=T)
> spam.cv.lda
Call:
lda(V58 ~ ., data = spam)
Prior probabilities of groups:
0 1
0.6059552 0.3940448
Group means:
V1 V2 V3 V4 V5 V6 V7
0 0.0734792 0.2444656 0.2005811 0.0008859397 0.1810402 0.04454448 0.00938307
1 0.1523387 0.1646498 0.4037948 0.1646718147 0.5139548 0.17487590 0.27540541
V8 V9 V10 V11 V12 V13 V14
0 0.03841463 0.03804878 0.1671700 0.02171090 0.5363235 0.06166428 0.04240316
1 0.20814120 0.17006067 0.3505074 0.11843354 0.5499724 0.14354661 0.08357419
...........................................................................
186
> table(spam.cl, spam.cv.lda$class)
spam.cl 0 1
0 2652 136
1 390 1423
> (136+390)/(136+390+2652+1423)
[1] 0.1143230 # cross-validation misclassification rate
187
13. Classification: decision trees (CART)
188
1. Example
◮ In a hospital, when a heart attack patient is admitted, 19 variables
are measured during the first 24 hours. Among the variables: blood
pressure, age and 17 other ordered or binary variables summarizing
different symptoms.
◮ Based on the 24–hours data, the objective of the study is to identify
high risk patients (those who will not survive at least 30 days).
189
2. Example
Is the minimum systolic blood pressureover initial 24 hours > 91?
Is age > 62.5?
Is sinustachycardiapresent?
Yes
Yes
No
No
G
F
F
G
G − high riskF − not high risk
YesNo
190
1. Binary trees: basic notions
◮ Binary tree is constructed by repeated splits of subsets of X into two
descendant subsets:
– a single variable is found which ”best” splits the data into two
groups
– the data is separated and the process is applied to each sub–group
– stop when the subgroups reach minimal size, or no improvement
can be made.
◮ Terminology
– the root node = X ; a node = a subset of X (circles).
– terminal nodes = subsets which are not split (rectangular box);
each terminal node is designated by the class label.
191
2. Binary trees: basic notions
◮ Construction of a tree requires:
– The selection of splits
– The decisions when to declare a node terminal or to continue
splitting it
– The assignment of each terminal node to a class
192
Notation
◮ n – number of observations; K - number of classes
◮ N(t) – total number of observations at node t
◮ Nk(t) – number of observations from class k at node t
◮ p(k|t) – proportion of observations X at the node t belonging to kth
class, k = 1, . . . , K
p(k|t) = Nk(t)
N(t)=
#{observations from class k at node t}#{observations at node t}
p(t) = [p(1|t), . . . , p(K|t)].
◮ Y (t) – class assigned to the node t:
Y (t) = arg maxk=1,...,K
p(k|t)
193
1. Impurity of the node
◮ Node is pure if it contains data only from one class.
◮ Impurity measure Imp(t) of node t for classification into K classes:
Imp(t) = φ(p(t)), p(t) = [p(1|t), . . . , p(K|t)]
where φ is a non–negative function of p(t) satisfying the following
conditions:
(a) φ has a unique maximum at ( 1K , . . . ,
1K );
(b) φ achieves minimum at (1, 0, . . . , 0), (0, 1, . . . , 0), . . . , (0, . . . , 1);
(c) φ is a symmetric function of p1, . . . , pK .
◮ Imp(t) is largest when all classes are equaly mixed, and smallest when
node contains one class.
194
2. Impurity of the node
◮ Examples of the impurity measure
Imp(t) = −K∑
k=1
p(k|t) log{p(k|t)}, [entropy]
Imp(t) = 1−K∑
k=1
p2(k|t), [Gini index]
when
p(k|t) = Nk(t)
N(t)=
#{observations from class k at node t}#{observations at node t}
195
1. To split or not to split?
◮ Split S: node t to two ”sons” tL and tR
– πL – proportion of observations at t going to tL
– πR – proportion of observations at t going to tR
t
tL tR
πL πR
196
2. To split or not to split?
◮ Goodness of split S is defined by the decrease in impurity measure
Φ(S, t) = ∆Imp(t) = Imp(t)− πLImp(tL)− πRImp(tR);
◮ Idea: choose the split S that maximizes Φ(S, t).
◮ Impurity of the tree T
Imp(T ) =∑
t∈T
π(t)Imp(t),
where T is the set of terminal nodes, π(t) is the proportion of the
whole popuation at node t
197
Numerical example
60
50 10 20
40
20
π(1)=0.7 π(2)=0.3
node 0
node 1 node 2
100 obs.
70 obs. 30 obs.
Imp(t0) = −(60/100) log(60/100)− (40/100) log(40/100) = 0.673
Imp(t1) = −(50/70) log(50/70)− (20/70) log(20/70) = 0.598
Imp(t2) = −(20/30) log(20/30)− (10/30) log(10/30) = 0.637
∆Imp(t0) = 0.637− 0.7× 0.598− 0.3× 0.637 = 0.0273
198
Splitting rules
X = (X1, . . . , Xp) is the vector of features.
◮ Splits are determined by the standartized set of questions Q:
– Each split depends only on a single variable
– For ordered Xi the questions in Q are of the form
{Is Xi ≤ c}, c ∈ (−∞,∞).
– If Xi is categorical, talking values in B = {b1, . . . , bm} then the
questions in Q are {Is Xi ∈ A}, A is any subset of B.
◮ At each node CART looks at variables Xi one by one, finds the best
split for each Xi, and chooses the best of the best.
199
Stop–splitting and assignment rules
◮ Stop splitting if
– decrease in impurity measure of a node is less than a prespecified
threshold
– impurity of the whole tree is less than a threshold
– depth of tree is greater than some parameter
– ...
◮ CART does not employ stopping rule; pruning is used instead.
◮ To a terminal node t assign the class
Y (t) = arg maxk=1,...,K
p(k|t).
200
1. Example: Stage C prostate cancer data
◮ Data: on 146 stage C prostate cancer patients, 7 variables, pgstat –
response
– pgtime – time to progression
– pgstat – status at last follow-up (1=progressed, 0=censored)
– age – age at diagnosis
– eet – early endocrine therapy (1=no, 0=yes)
– ploidy – diploid/tetraploid/aneuploid DNA pattern
– grade – tumor grade (1–4)
– gleason – Gleason grade
201
2. Example: Stage C prostate cancer data
> library(rpart)
> stagec<-read.table("stagec.dat", header=T, sep=",")
> cfit<-rpart(pgstat~age+eet+grade+gleason+ploidy, data=stagec, method="class")
> print(cfit)
n= 146
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 146 54 0 (0.6301370 0.3698630)
2) grade< 2.5 61 9 0 (0.8524590 0.1475410) *
3) grade>=2.5 85 40 1 (0.4705882 0.5294118)
6) ploidy< 1.5 29 11 0 (0.6206897 0.3793103)
12) gleason< 7.5 22 7 0 (0.6818182 0.3181818) *
13) gleason>=7.5 7 3 1 (0.4285714 0.5714286) *
7) ploidy>=1.5 56 22 1 (0.3928571 0.6071429) # child nodes of node t
14) age>=61.5 34 17 0 (0.5000000 0.5000000) # are numbered as
28) age< 64.5 12 4 0 (0.6666667 0.3333333) * # 2t (left) and 2t+1
29) age>=64.5 22 9 1 (0.4090909 0.5909091) * # (right)
15) age< 61.5 22 5 1 (0.2272727 0.7727273) *
202
3. Example: Stage C prostate cancer data (default tree)
> plot(cfit)
> text(cfit)
◮ Some default parameters
* minsplit=20: minimal number of observations in a node for which
split is computed
* minbucket=minsplit/3: minimal number of observation in a
terminal node
* cp=0.01: complexity parameter
203
|grade< 2.5
ploidy< 1.5
gleason< 7.5 age>=61.5
age< 64.5
0
0 1
0 1
1
204
> summary(cfit)
Call:
rpart(formula = pgstat ~ age + eet + grade + gleason + ploidy,
data = stagec, method = "class")
...............................................................
Node number 1: 146 observations, complexity param=0.1111111
predicted class=0 expected loss=0.369863
class counts: 92 54
probabilities: 0.630 0.370
left son=2 (61 obs) right son=3 (85 obs)
Primary splits:
grade < 2.5 to the left, improve=10.35759000, (0 missing)
gleason < 5.5 to the left, improve= 8.39957400, (3 missing)
ploidy < 1.5 to the left, improve= 7.65653300, (0 missing)
age < 58.5 to the right, improve= 1.38812800, (0 missing)
eet < 1.5 to the right, improve= 0.07407407, (2 missing)
Surrogate splits:
gleason < 5.5 to the left, agree=0.863, adj=0.672, (0 split)
ploidy < 1.5 to the left, agree=0.644, adj=0.148, (0 split)
age < 66.5 to the right, agree=0.589, adj=0.016, (0 split)
205
Node number 2: 61 observations
predicted class=0 expected loss=0.147541
class counts: 52 9
probabilities: 0.852 0.148
Node number 3: 85 observations, complexity param=0.1111111
predicted class=1 expected loss=0.4705882
class counts: 40 45
probabilities: 0.471 0.529
left son=6 (29 obs) right son=7 (56 obs)
Primary splits:
ploidy < 1.5 to the left, improve=1.9834830, (0 missing)
age < 56.5 to the right, improve=1.6596080, (0 missing)
gleason < 8.5 to the left, improve=1.6386550, (0 missing)
eet < 1.5 to the right, improve=0.1086108, (1 missing)
Surrogate splits:
age < 72.5 to the right, agree=0.682, adj=0.069, (0 split)
gleason < 9.5 to the right, agree=0.682, adj=0.069, (0 split)
....................................................................
206
◮ grade 1 and 2 go to the left, grade 3 and 4 go to the right.
◮ The improvement is n times the change in the ipurity index. The
largest improvement is for grade, 10.36. The actual values are not so
important; relative size gives indication of the utility of variables.
◮ Once a splitting variable and split point have been decided, what is to
be done with observations missing the variable? CART defines
surrogate variables by re–applying the partitioning algorithm to
predict the two categories using other independent variables.
207
1. Cost–complexity pruning
◮ Misclassification rate of a node t
R(t) =K∑
k 6=Y (t)
p(k|t), Y (t) = arg maxk=1,...,K
p(k|t)︸ ︷︷ ︸label assigned to node t
.
◮ Let T = {t1, . . . , tm} be terminal nodes; misclassification rate of T
R(T ) =
m∑
i=1
N(ti)
nR(ti) =
1
n
∑
t∈T
N(t)R(t).
◮ Let size(T ) = #{t ∈ T : t ∈ T}; for complexity parameter (CP) α > 0
Rα(T ) = R(T ) + α size(T ) =∑
t∈T
[R(t) + α
],
◮ α > 0 imposes a penalty for large trees.
208
2. Cost–complexity pruning
t(0)
t(6)t(5)t(4)t(3)
t(1) t(2)
T(t(2))
◮ T (t) is the sub–tree rooted at t
◮ Error of sub-tree T (t2) and error of the node t2:
Rα(T (t2)) = R(T (t2)) + α size{T (t2)} = R(T (t2)) + 2α
Rα(t2) = R(t2) + α, t2 is treated as terminal.
209
3. Cost–complexity pruning
◮ Pruning is worthwhile if
Rα(t2) ≤ Rα(T (t2)) ⇔ g(t2, T ) =R(T (t2))−R(t2)size{T (t2)} − 1
≤ α.
Function g(t, T ) can be computed for any internal node of the tree.
◮ Weakest–link cutting algorithm:
1. Start with the full tree T1. For each non–terminal node t ∈ T1compute g(t, T1), and find t1 = argmint∈T1 g(t, T1). Set
α2 = g(t1, T1).
2. Define new tree T2 by pruning away the branch rooted at t1. Find
the weakest link in T2 and proceed as in 1.
◮ Result: a decreasing sequence of sub–trees with corresponding α’s.
The final selection is by cross-validation or by validation sample.
210
Example: Stage C prostate cancer data (cont.)
> printcp(cfit)
Classification tree:
rpart(formula = pgstat ~ age + eet + grade + gleason + ploidy,
data = stagec, method = "class")
Variables actually used in tree construction:
[1] age gleason grade ploidy
Root node error: 54/146 = 0.36986
n= 146
CP nsplit rel error xerror xstd
1 0.111111 0 1.00000 1.0000 0.10802
2 0.037037 2 0.77778 1.0741 0.10949
3 0.018519 4 0.70370 1.0556 0.10916
4 0.010000 5 0.68519 1.0556 0.10916
211
Complexity parameter (CP) table
◮ The CP table is printed from the smallest tree (0 splits) to the largest
(5 splits for cancer data)
◮ rel error – relative error on the training set (resubstitution), the
first node has an error of 1.
◮ xerror – the cross–validation estimate of the error
◮ xstd – the standard deviation of the risk
◮ 1-SE rule of thumb: all trees with
xerror ≤ minimal xerror + xstd
are equivalent. Choose the simplest one.
212
Example: data on spam
◮ Data: 4601 instances, 57 attributes
– Most of the attributes indicate whether a particular word or
character was frequently occuring in the e-mail. The run-length
attributes (55-57) measure the length of sequences of consecutive
capital letters.
◮ Default tree building
> spam.tr<-rpart(V58~., data=spam, method="class") # default tree
> plot(spam.tr, compress=T, branch=.3)
> text(spam.tr)
213
1. Spam data: default tree
|V53< 0.0555
V7< 0.055
V52< 0.378
V57< 55.5
V16< 0.845
V25>=0.4
0
0 11
1
0 1
214
2. Spam data: default tree
> printcp(spam.tr)
Classification tree:
rpart(formula = V58 ~ ., data = spam, method = "class")
Variables actually used in tree construction:
[1] V16 V25 V52 V53 V57 V7
Root node error: 1813/4601 = 0.39404
n= 4601
CP nsplit rel error xerror xstd
1 0.476558 0 1.00000 1.00000 0.018282
2 0.148924 1 0.52344 0.54716 0.015386
3 0.043023 2 0.37452 0.44457 0.014222
4 0.030888 4 0.28847 0.33867 0.012723
5 0.010480 5 0.25758 0.28847 0.011875
6 0.010000 6 0.24710 0.27799 0.011685 # classification error
Absolute cross–validation error: 0.27799×0.39404= 0.1095392
215
1. Spam data: unpruned tree
> sctrl <- rpart.control(minbucket=1, minsplit=2, cp=0)
> spam1.tr <- rpart(V58~., data=spam, method="class", control=sctrl)
> printcp(spam1.tr)
Classification tree:
rpart(formula = V58 ~ ., data = spam, method = "class", control = sctrl)
Variables actually used in tree construction:
[1] V1 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V2 V20 V21 V22 V23 V24 V25 V26
[20] V27 V28 V29 V3 V30 V33 V35 V36 V37 V39 V4 V40 V42 V43 V44 V45 V46 V48 V49
[39] V5 V50 V51 V52 V53 V54 V55 V56 V57 V6 V7 V8 V9
Root node error: 1813/4601 = 0.39404
n= 4601
CP nsplit rel error xerror xstd
1 0.47655819 0 1.0000000 1.00000 0.0182819
2 0.14892443 1 0.5234418 0.55268 0.0154419
3 0.04302261 2 0.3745174 0.46663 0.0144933
4 0.03088803 4 0.2884721 0.32598 0.0125182
5 0.01047987 5 0.2575841 0.29454 0.0119835
6 0.00827358 6 0.2471042 0.27248 0.0115825
7 0.00717044 7 0.2388307 0.26751 0.0114891
216
8 0.00529509 8 0.2316602 0.26089 0.0113626
9 0.00441258 14 0.1958081 0.23828 0.0109127
10 0.00358522 15 0.1913955 0.23497 0.0108445
11 0.00330943 19 0.1770546 0.23276 0.0107986
12 0.00275786 20 0.1737452 0.22945 0.0107293
13 0.00220629 24 0.1627137 0.22725 0.0106827
14 0.00193050 28 0.1538886 0.22173 0.0105648
15 0.00183857 31 0.1478213 0.22063 0.0105410
16 0.00165472 34 0.1423056 0.20684 0.0102366
17 0.00137893 42 0.1290678 0.19801 0.0100348
18 0.00110314 46 0.1235521 0.19470 0.0099576
19 0.00082736 62 0.1059018 0.18809 0.0098007 # minimal xerror
20 0.00070916 82 0.0871484 0.18919 0.0098271
21 0.00066189 90 0.0810811 0.18864 0.0098139
22 0.00055157 95 0.0777716 0.19250 0.0099057
23 0.00041368 167 0.0380585 0.19360 0.0099317
24 0.00036771 175 0.0347490 0.19746 0.0100220
25 0.00033094 181 0.0325427 0.19967 0.0100731
26 0.00029700 186 0.0308880 0.20132 0.0101111
27 0.00027579 205 0.0242692 0.21346 0.0103843
28 0.00018386 277 0.0044126 0.21511 0.0104208
29 0.00000000 286 0.0027579 0.21566 0.0104329
217
2. Spam data: unpruned tree
> plot(spam1.tr)
218
|
219
1. Spam data: 1-SE rule pruned tree
> spam2.tr <- prune(spam1.tr, cp=0.00137893)
> printcp(spam2.tr)
Classification tree:
rpart(formula = V58 ~ ., data = spam, method = "class", control = sctrl)
Variables actually used in tree construction:
[1] V16 V17 V19 V21 V22 V24 V25 V27 V28 V37 V4 V46 V49 V5 V50 V52 V53 V55 V56
[20] V57 V6 V7 V8
Root node error: 1813/4601 = 0.39404
n= 4601
CP nsplit rel error xerror xstd
1 0.4765582 0 1.00000 1.00000 0.018282
2 0.1489244 1 0.52344 0.55268 0.015442
3 0.0430226 2 0.37452 0.46663 0.014493
4 0.0308880 4 0.28847 0.32598 0.012518
5 0.0104799 5 0.25758 0.29454 0.011984
220
6 0.0082736 6 0.24710 0.27248 0.011582
7 0.0071704 7 0.23883 0.26751 0.011489
8 0.0052951 8 0.23166 0.26089 0.011363
9 0.0044126 14 0.19581 0.23828 0.010913
10 0.0035852 15 0.19140 0.23497 0.010844
11 0.0033094 19 0.17705 0.23276 0.010799
12 0.0027579 20 0.17375 0.22945 0.010729
13 0.0022063 24 0.16271 0.22725 0.010683
14 0.0019305 28 0.15389 0.22173 0.010565
15 0.0018386 31 0.14782 0.22063 0.010541
16 0.0016547 34 0.14231 0.20684 0.010237
17 0.0013789 42 0.12907 0.19801 0.010035
◮ Absolute cross-validation error
0.19801× 1813/4601 = 0.07802386
221
2. Spam data: 1-SE rule pruned tree
> plot(spam2.tr)
> text(spam2.tr, cex=.5)
222
|V53< 0.0555
V7< 0.055
V52< 0.378
V16< 0.2
V24< 0.01
V22< 0.155
V56< 416
V5< 0.715
V4< 0.565
V28< 7.105
V25>=0.015
V28< 0.23
V56< 19.5
V8< 0.335
V55< 5.739
V49>=0.3115
V55< 2.655
V6< 0.185
V5< 0.49
V50>=0.479
V37>=0.195
V52< 0.119
V5< 1.09
V57< 341
V8< 0.44
V21< 0.31
V57< 55.5
V16< 0.845
V17< 0.66
V24< 2.66
V52< 0.984
V19< 3.635
V56< 4
V46>=0.065
V27>=1.375
V27>=0.14
V25>=0.4
V7< 0.075V46>=0.49
V56< 6.5
V19< 2.845V27>=0.21
0
00 1
1
1 0 1 1
10 1
0 1 1
0 1
0
0 1 1
0 1
1
0
0 0 1
1
1
1 0 0 1
0 1
0 1
0
0 1 0 1
223
1. Spam data: another tree
> spam3.tr <- prune(spam1.tr, cp=0.0018)
> printcp(spam3.tr)
Classification tree:
rpart(formula = V58 ~ ., data = spam, method = "class", control = sctrl)
Variables actually used in tree construction:
[1] V16 V17 V19 V21 V22 V24 V25 V27 V37 V46 V49 V5 V52 V53 V55 V56 V57 V6 V7
[20] V8
Root node error: 1813/4601 = 0.39404
n= 4601
CP nsplit rel error xerror xstd
1 0.4765582 0 1.00000 1.00000 0.018282
2 0.1489244 1 0.52344 0.54661 0.015380
3 0.0430226 2 0.37452 0.43354 0.014081
4 0.0308880 4 0.28847 0.33370 0.012643
5 0.0104799 5 0.25758 0.28792 0.011866
224
6 0.0082736 6 0.24710 0.27689 0.011665
7 0.0071704 7 0.23883 0.26531 0.011447
8 0.0052951 8 0.23166 0.25648 0.011277
9 0.0044126 14 0.19581 0.23718 0.010890
10 0.0035852 15 0.19140 0.22835 0.010706
11 0.0033094 19 0.17705 0.22449 0.010624
12 0.0027579 20 0.17375 0.22614 0.010659
13 0.0022063 24 0.16271 0.22559 0.010648
14 0.0019305 28 0.15389 0.21732 0.010469
15 0.0018386 31 0.14782 0.21732 0.010469
16 0.0018000 34 0.14231 0.20684 0.010237
◮ Absolute cross-validation error
0.20684× 1813/4601 = 0.08150323
> plot(spam3.tr, branch=.4,uniform=T)
> text(spam3.tr, cex=.5)
225
|V53< 0.0555
V7< 0.055
V52< 0.378
V16< 0.2
V24< 0.01
V22< 0.155
V56< 416
V5< 0.715
V8< 0.335
V55< 5.739
V49>=0.3115
V55< 2.655
V6< 0.185
V37>=0.195
V52< 0.119
V5< 1.09
V57< 341V21< 0.31
V57< 55.5
V16< 0.845
V17< 0.66
V24< 2.66
V52< 0.984
V19< 3.635
V56< 4
V46>=0.065
V27>=1.375
V27>=0.14
V25>=0.4
V7< 0.075 V46>=0.49
V56< 6.5
V19< 2.845V27>=0.21
0
0 1
1
1 0 1 0 1
1
0
0 1 0 1
1
0
0
0 1
1
1
1 0
0 1
0 1 0 1 0
0 1 0 1
226
14. Nonparametric smoothing: basic ideas,
kernel and local polynomial estimators
227
Density estimation problem
◮ Old Faithful Geyser data on 272 geyser eruptions, eruption duration
and waiting times between successive eruptions were recorded.
> summary(faithful)
eruptions waiting
Min. :1.600 Min. :43.0
1st Qu.:2.163 1st Qu.:58.0
Median :4.000 Median :76.0
Mean :3.488 Mean :70.9
3rd Qu.:4.454 3rd Qu.:82.0
Max. :5.100 Max. :96.0
◮ We want to estimate density of the waiting times between successive
eruptions.
228
Histogram
◮ Let X1, . . . , Xniid∼ f . We want to estimate density f .
◮ By definition f(x) = limh→012hP{x− h < X ≤ x+ h}.
◮ Idea: fix h, estimate P{x− h < X ≤ x+ h} by1n
∑ni=1 I{x− h < Xi ≤ x+ h}, and let
f(x) =1
2hn
n∑
i=1
I{x− h < Xi ≤ x+ h}.
◮ Histogram: consider bins (bj , bj+1], j = 0, 1, . . ., bj+1 − bj = 2h, ∀j,
f(x) =1
2hn
n∑
i=1
I{bj < Xi ≤ bj+1}, x ∈ (bj , bj+1].
229
Histogram of the Old Faithful data
> hist(faithful$waiting)
Histogram of faithful$waiting
faithful$waiting
Fre
qu
en
cy
40 50 60 70 80 90 100
02
04
0
230
1. How to choose binwidth h?
◮ Bias of f(x):
|Ef(x)− f(x)| =∣∣∣ 12h
∫ x+h
x−h
f(t)dt− f(x)∣∣∣
≤ supt∈(x−h,x+h]
|f(t)− f(x)| ≤ 2Lh
provided that |f(x)− f(x′)| ≤ L|x− x′| for all x, x′
◮ Variance of f(x): because I{x− h < Xi ≤ x+ h} are iid Bernoulli
r.v. with parameter p = P{x− h < X1 ≤ x+ h} =∫ x+h
x−hf(t)dt
var{f(x)} = 1
2nh
1
2h
∫ x+h
x−h
f(t)dt ≤ M
2nh
provided that f(x) ≤M , ∀x.
231
2. How to choose binwidth h?
◮ Mean Squared Error of f(x):
minh>0{4L2h2 +M(2nh)−1} ⇒ h∗ ≍ n−1/3...
◮ Smoother the density is, larger h should be. For instance, if we
assume that |f ′(x)− f ′(x′)| ≤ L|x− x′|, ∀x, x′ then h∗ ≍ h−1/5...
◮ The rule of thumb in R
h = 1.144 · σn−1/5.
232
Kernel estimators
◮ Another representation of the histogram estimator
f(x) =1
nh
n∑
i=1
K(Xi − x
h
), K(t) =
12 , |t| ≤ 1
0, otherwise.
Function K is called kernel, h is called bandwidth.
◮ General kernel estimator: take function K satisfying∫∞−∞K(t)dt = 1
and define
f(x) =1
hn
n∑
i=1
K(Xi − x
h
).
◮ Commonly used kernels:
– rectangular K(t) = 12I{|t| ≤ 1};
– triangular K(t) = (1− |t|)I{|t| ≤ 1};– Gaussian K(t) = 1√
2πe−t2/2 (default in R).
233
1. Kernel density estimation is R
> layout(matrix(1:3, ncol = 3))
> plot(density(faithful$waiting))
> plot(density(faithful$waiting,bw=8))
> plot(density(faithful$waiting,bw=0.8))
234
2. Kernel density estimation in R
40 80
0.0
00
.02
density.default(x = faithful$waiting)
N = 272 Bandwidth = 3.988
De
nsity
20 60 100
0.0
00
0.0
10
0.0
20
density.default(x = faithful$waiting, bw = 8)
N = 272 Bandwidth = 8
De
nsity
40 60 80
0.0
00
.02
0.0
4
density.default(x = faithful$waiting, bw = 0.8)
N = 272 Bandwidth = 0.8D
en
sity
235
Example: motorcycle data
◮ Measurements of head acceleration in a simulated motorcycleaccident, used to test crash helmets.
> library(MASS)
> plot(mcycle)
236
10 20 30 40 50
−100
−50
050
times
accel
237
Regression problem
◮ Model: data – {(Xi, Yi), i = 1, . . . , n}, f is unknown ”smooth”
function, ǫ is error, Eǫ = 0
Yi = f(Xi) + ǫi, i = 1, . . . , n ⇔ f(x) = E(Y |X = x).
◮ Parametric models
* Simple linear regression f(x) = β0 + β1x
* Polynomial regression
f(x) = β0 + β1x+ · · ·+ βpxp
Reduced to multiple linear regression:
f(x) = βTx, x = (1, x, x2, . . . , xp)T .
238
Polynomial regression
> attach(mcycle)
> fit3<-lm(accel~times+I(times^2)+I(times^3))
> fit5<-lm(accel~times+I(times^2)+I(times^3)+I(times^4)+I(times^5))
> fit7<-lm(accel~times+I(times^2)+I(times^3)+I(times^4)+I(times^5)+I(times^6)+
+ I(times^7))
> plot(times, accel)
> lines(times, fit3$fitted, lty=1)
> lines(times, fit5$fitted, lty=2)
> lines(times, fit7$fitted, lty=3)
> legend(40, -70, c("fit3", "fit5", "fit7"), lty=c(1,2,3))
239
10 20 30 40 50
−100
−50
050
times
accel
fit3fit5fit7
240
Nonparametric regression: basic ideas
◮ Local modeling: parametric model in ”local” neighborhood.
* Local average
fh(x) =1
#{Nh(x)}∑
i∈Nh(x)
Yi, Nh(x) = {i : |x−Xi| ≤ h}.
* Local linear (polynomial) regression
* k-NN estimator
◮ How to define the local neighborhood?
241
Kernel estimators
◮ Regression function
f(x) = E(Y |X = x) =
∫yp(x, y)dy∫p(x, y)dy
=
∫yp(x, y)dy
p(x)
◮ Histogram – estimate of p(x)
ph(x) =1
2nh
n∑
i=1
I{i : Xi ∈ (bj , bj+1]} =nj2nh
, x ∈ (bj , bj+1].
◮ Estimator of f(x)
fh(x) =
∑ni=1 YiI
{Xi ∈ [x− h, x+ h]
}∑n
i=1 I{Xi ∈ [x− h, x+ h]
} =:n∑
i=1
wn,i(x)Yi
242
Nadaraya–Watson kernel estimator
◮ More generally, consider the kernel K s.t.∫K(x)dx = 1 and define
fh(x) =
∑ni=1 YiK
(x−Xi
h
)
∑ni=1K
(x−Xi
h
) .
◮ Kernels
– Box kernel: K1(x) = I[−1/2,1/2](x).
– Quadratic kernel: K2(x) =34 (1− x2)I[−1,1](x).
– Gaussian: K2(x) =1√2π
exp{−x2/2}.
◮ Selection of the bandwidth h is important:
fh(Xi)→ Yi as h→ 0; fh(x)→1
n
n∑
i=1
Yi as h→∞.
243
1. Example: motorcycle data
> plot(times, accel)
> lines(ksmooth(times, accel, "normal", bandwidth=1), lty=1)
> lines(ksmooth(times, accel, "normal", bandwidth=2), lty=2)
> lines(ksmooth(times, accel, "normal", bandwidth=3), lty=3)
> legend(40, -100, legend=c("bandwidth=1", "bandwidth=2", "bandwidth=3"),
+ lty=c(1,2,3))
The kernels are scaled so that their quartiles (viewed as probability
densities) are at +/- 0.25*bandwidth, i.e. for the normal kernel
h× z0.75 = 0.25× bandwidth.
244
2. Example: motorcycle data
10 20 30 40 50
−100
−50
050
times
acce
l
bandwidth=1bandwidth=2bandwidth=3
245
1. Local polynomial smoothing
◮ Local model at point x:
Yi = a(x) + b(x)Xi + ǫi, Xi ∈ [x− h, x+ h].
◮ Local linear regression estimator
n∑
i=1
[Yi − a(x)− b(x)Xi
]2I
{ |Xi − x|h
≤ 1
}→ min
a(x),b(x)
f(x) = a(x) + b(x)x.
246
2. Local polynomial smoothing
◮ General local polynomial estimator:
f(z) ≈p∑
j=0
f (j)(x)
j!(z − x)j =
p∑
j=0
βj(z − x)j
n∑
i=1
[Yi −
p∑
j=0
βj(Xi − x)j]2K
( |Xi − x|h
)→ min
β
f(x) = β0, f (j)(x) = j! βj , j = 1, . . . , p.
◮ Both kernel and local polynomial estimators are linear smoothers
f(x) =n∑
i=1
wn,i(x)Yi,n∑
i=1
wn,i(x) = 1.
247
1. Local polynomial smoothing (LOESS)
◮ Idea: fit locally a polynomial to the data
◮ LOESS smoothing: define the weights
wi(x) =
[1− |x−Xi|3
τ3(x, α)
]3
+
, i = 1, . . . , n.
Let r(x, β) be a polynomial of degree p with the coefficeints
β = (β0, . . . , βp). Define
β(x) = argminβ
n∑
i=1
wi(x)[Yi − r(Xi, β)]2, f(x) = r(x, β(x)).
248
2. Local polynomial smoothing (LOESS)
◮ Bandwidth τ(x, α) is chosen as follows:
– denote ∆i(x) = |x−Xi| and order these values
∆(1)(x) ≤ ∆(2)(x) ≤ · · · ≤ ∆(n)(x).
– if 0 < α ≤ 1 then τ(x, α) = ∆(q)(x) where q = [αn];
– if α > 1 then τ(x, α) = α∆(n)(x)
249
1. LOESS: motorcycle data
> attach(mcycle)
> mcycle.1 <- loess(accel~times, span=0.1)
> mcycle.2 <- loess(accel~times, span=0.5)
> mcycle.3 <- loess(accel~times, span=1)
> prtimes<- matrix((0:1000)*((max(times)-min(times))/1000)+min(times), ncol=1)
> praccel.1 <-predict(mcycle.1, prtimes)
> praccel.2 <-predict(mcycle.2, prtimes)
> praccel.3 <-predict(mcycle.3, prtimes)
> plot(mcycle, pch="+")
> lines(prtimes, praccel.1, lty=1)
> lines(prtimes, praccel.2, lty=2)
> lines(prtimes, praccel.3, lty=3)
> legend(40, -90, legend=c("span=0.1", "span=0.5", "span=1"), lty=c(1,2,3))
span= α controls proportion of the points used in local neighborhood; see
definition of τ(x, α);
degree of the polynomial is 2.
250
2. LOESS: motorcycle data
+++++ +++++++++
++++
+++
+
+++
+
+
+
+
+
++
+
++
+
+
++
+
+
+
+
+
+
+
+
+
+
+
+++
+
+
+
+
+
+++
+
+
+
+ ++
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
++
+
+
+
+
+
+
+
+
+
++
+
+
+
+
+
+
+
++
+
+
+
++
+
+
+
+
+
+
+
+
+
++
+ +
+
++
+
+
+
+
+
+
+
10 20 30 40 50
−100
−50
050
times
acce
l
span=0.1span=0.5span=1
251
15. Multivariate nonparametric regression:Regression trees (CART)
252
Multivariate nonparametric regression
◮ Curse of dimensionality
Data are very sparse in high dimensional space
If we have 1000 uniformly distributed points on [0, 1]d then the
average number Nd of points in [0, 0.3]d is as follows
d 1 2 3 4 5
Nd 300 90 27 8.1 2.4
◮ Remedy: nonparametric ”structural” models
– additive structure: f(X) = f1(X1) + . . .+ fp(Xp)
– single–index: f(X) = f0(θTX)
– projection pursuit: f(X) =∑k
i=1 fi(θTi X)
253
Regression trees: basic idea
◮ Data: (Xi, Yi), i = 1, . . . n, X ∈ X ⊂ Rp.
◮ Modeling assumption: there is a partition of X into M regions
D1, . . . , DM , and f is approximated by a contant at each region:
f(x) =M∑
m=1
cmI(x ∈ Dm)
◮ Splitting rules: binary splitting.
Choose a variable Xm and split according to Xm ≤ tm or Xm > tm.
◮ How to partition the domain X ?
254
Aregressiontree:anexample
X1<t1
X2<t2X1<t3
X2<t4D1D2D3
D4D5
255
Correspondingpartitionofthefeaturespace
X1
X2
t1
t2
t3
t4
D1
D2
D3
D4
D5
256
1. Growing the regression tree
◮ Predictor variable: X = (X1, . . . , Xp) ∈ Rp;
Data: (Xi, Yi), Xi = (Xi1, . . . , Xip), i = 1, . . . , n.
◮ Goodness of split S is defined by decrease in the total sum of squares:
Φ(S, t) = SS(t)−[SS(tL) + SS(tR)
],
SS(t) is the total sum of squares∑
(Yi − Y )2 of observations at
node t.
◮ Choose the split S that maximizes Φ(S, t).
257
2. Growing the regression tree
◮ Consider splitting variable Xj and split point s and define
D1(j, s) = {X : Xj ≤ s}, D2(j, s) = {X : Xj > s}
Then we look for j and s that solve
minj,s
[ n∑
i=1
(Yi − c1)2I{Xi ∈ D1(j, s)}+n∑
i=1
(Yi − c2)2I{Xi ∈ D2(j, s)}]
ck = ave{Yi : Xi ∈ Dk(j, s)
}=
∑ni=1 YiI{Xi ∈ Dk(j, s)}∑ni=1 I{Xi ∈ Dk(j, s)}
, k = 1, 2.
◮ For each splitting variable determination of s is done very quickly
(finite number of different splits). The pair (j, s) is found by scanning
over all of the inputs.
258
Pruning the regression tree
◮ Let T = {t1, . . . , tm} be terminal nodes of the tree T with regions
D1, . . . , Dm, and numbers of observations N1, . . . , Nm.
◮ Notation: for each terminal node tm define
cm =1
Nm
n∑
i=1
YiI{Xi ∈ Dm}, Q(tm) =1
Nm
n∑
i=1
(Yi−cm)2I{Xi ∈ Dm}
◮ Cost–complexity criterion: For complexity parameter (CP) α > 0 let
Rα(T ) =
M∑
m=1
NmQ(tm) + α size(T ), size(T ) = #{t ∈ T : t ∈ T}.
◮ Weakest–link cutting algorithm produces a decreasing sequence of
trees with corresponding CP’s. Cross–validation based selection from
this collection. [See transparencies for classification trees].
259
1 Example: regression tree for motorcycle data
> mcycle.tree<-rpart(accel~times, data=mcycle,method="anova")
> plot(mcycle.tree)
> text(mcycle.tree)
> plot(times, accel)
> lines(times, predict(mcycle.tree))
260
|times< 27.4
times>=16.5
times< 24.4
times>=19.5
times>=15.1
times>=35
−114.7 −86.31−42.49
−39.12 −4.357
3.291 29.29
261
2. Example: regression tree for motorcycle data
10 20 30 40 50
−100
−50
050
times
acce
l
262
1. Example: Boston housing data, regression tree
> library(MASS)
> nobs <- dim(Boston)[1]
> trainx<-sample(1:nobs, 2*nobs/3, replace=F)
> testx<-(1:nobs)[-trainx]
> Boston.tree<-rpart(medv~., data=Boston[trainx,], method="anova")
> print(Boston.tree)
n= 337
node), split, n, deviance, yval
* denotes terminal node
1) root 337 26701.3200 22.51810
2) rm< 6.825 277 10100.7600 19.68123
4) lstat>=15 99 1586.7700 14.28990
8) crim>=7.036505 41 409.5088 11.53171 *
9) crim< 7.036505 58 644.8588 16.23966 *
5) lstat< 15 178 4035.9670 22.67978
10) rm< 6.543 145 2251.4780 21.62069
263
20) lstat>=9.66 80 573.8339 20.32125 *
21) lstat< 9.66 65 1376.3040 23.22000 *
11) rm>=6.543 33 907.2133 27.33333 *
3) rm>=6.825 60 4079.5770 35.61500
6) rm< 7.435 42 1248.8060 31.77143
12) lstat>=5.415 21 417.2467 28.66667 *
13) lstat< 5.415 21 426.6981 34.87619 *
7) rm>=7.435 18 762.5450 44.58333 *
#
> plot(Boston.tree)
> text(Boston.tree)
#
#
> Boston.pred <- predict(Boston.tree, Boston[testx,])
> sum((Boston.pred-Boston[testx,"medv"])^2)
[1] 4045.943 # prediction error
# on the test set
264
2. Example: Boston housing data, regression tree
|rm< 6.825
lstat>=15
crim>=7.037 rm< 6.543
lstat>=9.66
rm< 7.435
lstat>=5.415
11.53 16.24
20.32 23.2227.33
28.67 34.8844.58
265
MARS as extension of CART
◮ CART decision trees are based on approximation of f by
f(x) =M∑
m=1
cmBm(x), Bm(x) = I{x ∈ Dm}︸ ︷︷ ︸basis functions
.
◮ Idea: replace I{x ∈ Dm} with continuous function.
If the basis functions Bm(·) are given, the coefficients cm are
estimated by the least squares.
266
Region representation
◮ Step function:
H[η] =
1, if η ≥ 0
0, otherwise
Each region Dm is obtained by, say, Km splits; k-th split,
k = 1, . . . , Km, is performed on the variable xv(k,m) using the
threshold tkm. Therefore
Bm(x) =
Km∏
k=1
H[skm
(xv(k,m) − tkm
)], skm = ±1
f(x) =M∑
m=1
cm
Km∏
k=1
H[skm
(xv(k,m) − tkm
)].
◮ Minimize a lack–of-fit (LOF) measure w.r.t. cm, skm, v(k,m) and
tkm.
267
MARS basis functions
◮ MARS algorithm instead of step functions H[xv − t] and H[−xv + t]
uses [xv − t]+ and [−xv + t]+.
tt
(x− t)+ (t− x)+
◮ The collection of basis functions
C ={(xj − t)+, (t− xj)+ : t ∈ {X1j , X2j , . . . , Xnj}, j = 1, . . . , p
}.
268
MARS: Model building strategy
◮ Forward selection of basis functions: at each iteration we add a pair
of new basis functions. They are obtained by multiplication of the
previously chosen basis functions with [xj − t]+ and [t− xj ]+.
◮ First step: we consider adding to model a function
β1(xj − t)+ + β2(t− xj)+, t ∈ {X1j , X2j , . . . , Xnj}.
Suppose the best choice is β1(x2 −X72)+ + β2(X72 − x2)+. Then the
the set of basis functions at this step is
C1 = {B0(x) = 1, B1(x) = (x2 −X72)+, B2(x) = (X72 − x2)+}
269
◮ Second step: consider including a pair of products
(xj − t)+Bm(x) and (t− xj)+Bm(x), Bm ∈ C1.
◮ Step M : CM = {Bm(x),m = 1, . . . , 2M + 1}. At step M + 1 the
algorithm adds the terms
c2M+2Bl(x) [xj − t]+ + c2M+3Bl(x) [t− xj ]+, Bl ∈ CM , j = 1, . . . , p,
where Bl and j produce maximal decrease in the training error. Stop
when the model contains preset maximum number of terms Mmax.
◮ Final selection: choose fM based on M basis functions that minimizes
LOF(fM ) =n
(n−M − 1)2
n∑
i=1
[yi − fM (xi)
]2
270
Recursive partitioning algorithm
B1(x) = 1
for M = 2 to Mmax do: lof∗ =∞for m = 1 to M − 1 do:
for v = 1 to n do:
for t ∈ {xvj : Bm(xj) > 0}g =
∑i 6=m ciBi(x) + cmBm(x)H[xv − t] + cMBm(x)H[−xv + t]
lof = minc1,...,cM LOF(g)
if lof < lof∗ then lof = lof∗; m∗ = m; v∗ = v; t∗ = t; endif
endfor
endfor
endfor
BM (x) = Bm∗(x)H[−xv∗ + t∗]
Bm∗(x) = Bm∗(x)H[xv∗ − t∗]endfor; end of algorithm
271
1. Example: Boston housing data – MARS
# Number of basis functions<=10, no interactions
> library(mda)
> Boston1.mars<-mars(Boston[,1:13], Boston$medv, degree=1, nk=10)
# ij-th element equal to 1 if term i has a factor of the form x_j>c,
# equal to -1 if term i has a factor of the form x_j <= c,
# and to 0 if x_j is not in term i.
> Boston1.mars$factor
crim zn indus chas nox rm age dis rad tax ptratio black lstat
[1,] 0 0 0 0 0 0 0 0 0 0 0 0 0
[2,] 0 0 0 0 0 0 0 0 0 0 0 0 1
[3,] 0 0 0 0 0 0 0 0 0 0 0 0 -1
[4,] 0 0 0 0 0 1 0 0 0 0 0 0 0
[5,] 0 0 0 0 0 -1 0 0 0 0 0 0 0
[6,] 0 0 0 0 0 0 0 0 0 0 1 0 0
[7,] 0 0 0 0 0 0 0 0 0 0 -1 0 0
[8,] 0 0 0 0 1 0 0 0 0 0 0 0 0
[9,] 0 0 0 0 -1 0 0 0 0 0 0 0 0
> Boston1.mars$cuts
272
# ij-th element equal to the cut point c for variable j in term i.
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
[1,] 0 0 0 0 0.000 0.000 0 0 0 0 0.0 0 0.00
[2,] 0 0 0 0 0.000 0.000 0 0 0 0 0.0 0 6.07
[3,] 0 0 0 0 0.000 0.000 0 0 0 0 0.0 0 6.07
[4,] 0 0 0 0 0.000 6.425 0 0 0 0 0.0 0 0.00
[5,] 0 0 0 0 0.000 6.425 0 0 0 0 0.0 0 0.00
[6,] 0 0 0 0 0.000 0.000 0 0 0 0 17.8 0 0.00
[7,] 0 0 0 0 0.000 0.000 0 0 0 0 17.8 0 0.00
[8,] 0 0 0 0 0.472 0.000 0 0 0 0 0.0 0 0.00
[9,] 0 0 0 0 0.472 0.000 0 0 0 0 0.0 0 0.00
> Boston1.mars$selected.terms
[1] 1 2 3 4 5 6 7 8 9
> Boston1.mars$coefficients
[,1]
[1,] 25.3093268
[2,] -0.5529556
[3,] 2.9450633
[4,] 8.0947256
[5,] 1.4712255
273
[6,] -0.6005808
[7,] 0.8962524
[8,] -12.0724248
[9,] -52.6998783
> Boston1.mars$gcv
[1] 17.73768
> plot(Boston1.mars$residuals)
> abline(0,0)
> qqnorm(Boston1.mars$residuals)
> qqline(Boston1.mars$residuals)
◮ Fitted model
f(x) = 25.30− 0.55× (lstat− 6.07)+ + 2.95× (6.07− lstat)+
+ 8.09× (rm− 6.43)+ + 1.47× (6.43− rm)+
−0.6× (ptratio − 17.8)+ + 0.90× (17.8− ptratio)+
−12.07× (nox− 0.47)+ − 52.7× (0.47− nox)+
274
2. Example: Boston housing data – MARS
0 100 200 300 400 500
−20
−10
010
2030
Index
Bosto
n1.m
ars$re
sidua
ls
−3 −2 −1 0 1 2 3
−20
−10
010
2030
Normal Q−Q Plot
Theoretical Quantiles
Samp
le Qu
antile
s
275
3. Example: Boston housing data – MARS
# Number of basis function <= 40, degree=2
> Boston2.mars<-mars(Boston[,1:13], Boston$medv, degree=2, nk=40)
> Boston2.mars$selected.terms
[1] 1 2 4 5 6 7 8 9 11 13 14 16 19 20 23 25 26 27 28 29 31 32 34 36 38
[26] 39
> Boston2.factor
crim zn indus chas nox rm age dis rad tax ptratio black lstat
[1,] 0 0 0 0 0 0 0 0 0 0 0 0 0
[2,] 0 0 0 0 0 0 0 0 0 0 0 0 1
[3,] 0 0 0 0 0 0 0 0 0 0 0 0 -1
[4,] 0 0 0 0 0 1 0 0 0 0 0 0 0
[5,] 0 0 0 0 0 -1 0 0 0 0 0 0 0
[6,] 0 0 0 0 0 1 0 0 0 0 1 0 0
[7,] 0 0 0 0 0 1 0 0 0 0 -1 0 0
[8,] 0 0 0 0 0 0 0 0 0 1 0 0 -1
[9,] 0 0 0 0 0 0 0 0 0 -1 0 0 -1
[10,] 0 0 0 0 1 0 0 0 0 0 0 0 1
[11,] 0 0 0 0 -1 0 0 0 0 0 0 0 1
276
[12,] 0 0 0 0 0 -1 0 1 0 0 0 0 0
[13,] 0 0 0 0 0 -1 0 -1 0 0 0 0 0
[14,] 1 0 0 0 0 0 0 0 0 0 0 0 0
[15,] -1 0 0 0 0 0 0 0 0 0 0 0 0
[16,] 1 0 0 1 0 0 0 0 0 0 0 0 0
[17,] 1 0 0 -1 0 0 0 0 0 0 0 0 0
[18,] -1 0 0 0 0 0 0 0 0 1 0 0 0
[19,] -1 0 0 0 0 0 0 0 0 -1 0 0 0
[20,] 0 0 0 0 0 0 0 0 0 0 0 0 1
[21,] 0 0 0 0 0 0 0 0 0 0 0 0 -1
[22,] 0 0 0 0 0 1 1 0 0 0 0 0 0
[23,] 0 0 0 0 0 1 -1 0 0 0 0 0 0
[24,] 0 0 0 0 0 0 0 1 0 0 0 0 0
[25,] 0 0 0 0 0 0 0 -1 0 0 0 0 0
[26,] 0 0 0 0 0 0 0 -1 0 0 0 1 0
[27,] 0 0 0 0 0 0 0 -1 0 0 0 -1 0
[28,] 0 0 0 0 0 0 0 0 0 0 0 1 1
[29,] 0 0 0 0 0 0 0 0 0 0 0 -1 1
[30,] 0 0 0 0 0 0 1 -1 0 0 0 0 0
[31,] 0 0 0 0 0 0 -1 -1 0 0 0 0 0
[32,] 0 0 0 0 1 1 0 0 0 0 0 0 0
[33,] 0 0 0 0 -1 1 0 0 0 0 0 0 0
277
[34,] 0 0 0 0 0 0 0 0 0 0 1 0 1
[35,] 0 0 0 0 0 0 0 0 0 0 -1 0 1
[36,] 0 0 0 0 1 0 0 -1 0 0 0 0 0
[37,] 0 0 0 0 -1 0 0 -1 0 0 0 0 0
[38,] 0 0 0 0 0 0 0 1 0 0 0 0 1
[39,] 0 0 0 0 0 0 0 -1 0 0 0 0 1
>
> Boston2.mars$cuts
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
[1,] 0.00000 0 0 0 0.000 0.000 0.0 0.0000 0 0 0.0 0.00 0.00
[2,] 0.00000 0 0 0 0.000 0.000 0.0 0.0000 0 0 0.0 0.00 6.07
[3,] 0.00000 0 0 0 0.000 0.000 0.0 0.0000 0 0 0.0 0.00 6.07
[4,] 0.00000 0 0 0 0.000 6.425 0.0 0.0000 0 0 0.0 0.00 0.00
[5,] 0.00000 0 0 0 0.000 6.425 0.0 0.0000 0 0 0.0 0.00 0.00
[6,] 0.00000 0 0 0 0.000 6.425 0.0 0.0000 0 0 17.8 0.00 0.00
[7,] 0.00000 0 0 0 0.000 6.425 0.0 0.0000 0 0 17.8 0.00 0.00
[8,] 0.00000 0 0 0 0.000 0.000 0.0 0.0000 0 335 0.0 0.00 6.07
[9,] 0.00000 0 0 0 0.000 0.000 0.0 0.0000 0 335 0.0 0.00 6.07
[10,] 0.00000 0 0 0 0.718 0.000 0.0 0.0000 0 0 0.0 0.00 6.07
[11,] 0.00000 0 0 0 0.718 0.000 0.0 0.0000 0 0 0.0 0.00 6.07
[12,] 0.00000 0 0 0 0.000 6.425 0.0 1.8195 0 0 0.0 0.00 0.00
278
[13,] 0.00000 0 0 0 0.000 6.425 0.0 1.8195 0 0 0.0 0.00 0.00
[14,] 4.54192 0 0 0 0.000 0.000 0.0 0.0000 0 0 0.0 0.00 0.00
[15,] 4.54192 0 0 0 0.000 0.000 0.0 0.0000 0 0 0.0 0.00 0.00
[16,] 4.54192 0 0 0 0.000 0.000 0.0 0.0000 0 0 0.0 0.00 0.00
[17,] 4.54192 0 0 0 0.000 0.000 0.0 0.0000 0 0 0.0 0.00 0.00
[18,] 4.54192 0 0 0 0.000 0.000 0.0 0.0000 0 242 0.0 0.00 0.00
[19,] 4.54192 0 0 0 0.000 0.000 0.0 0.0000 0 242 0.0 0.00 0.00
[20,] 0.00000 0 0 0 0.000 0.000 0.0 0.0000 0 0 0.0 0.00 23.97
[21,] 0.00000 0 0 0 0.000 0.000 0.0 0.0000 0 0 0.0 0.00 23.97
[22,] 0.00000 0 0 0 0.000 6.425 84.7 0.0000 0 0 0.0 0.00 0.00
[23,] 0.00000 0 0 0 0.000 6.425 84.7 0.0000 0 0 0.0 0.00 0.00
[24,] 0.00000 0 0 0 0.000 0.000 0.0 4.7075 0 0 0.0 0.00 0.00
[25,] 0.00000 0 0 0 0.000 0.000 0.0 4.7075 0 0 0.0 0.00 0.00
[26,] 0.00000 0 0 0 0.000 0.000 0.0 4.7075 0 0 0.0 373.66 0.00
[27,] 0.00000 0 0 0 0.000 0.000 0.0 4.7075 0 0 0.0 373.66 0.00
[28,] 0.00000 0 0 0 0.000 0.000 0.0 0.0000 0 0 0.0 376.73 6.07
[29,] 0.00000 0 0 0 0.000 0.000 0.0 0.0000 0 0 0.0 376.73 6.07
[30,] 0.00000 0 0 0 0.000 0.000 77.8 4.7075 0 0 0.0 0.00 0.00
[31,] 0.00000 0 0 0 0.000 0.000 77.8 4.7075 0 0 0.0 0.00 0.00
[32,] 0.00000 0 0 0 0.624 6.425 0.0 0.0000 0 0 0.0 0.00 0.00
[33,] 0.00000 0 0 0 0.624 6.425 0.0 0.0000 0 0 0.0 0.00 0.00
[34,] 0.00000 0 0 0 0.000 0.000 0.0 0.0000 0 0 12.6 0.00 6.07
279
[35,] 0.00000 0 0 0 0.000 0.000 0.0 0.0000 0 0 12.6 0.00 6.07
[36,] 0.00000 0 0 0 0.718 0.000 0.0 4.7075 0 0 0.0 0.00 0.00
[37,] 0.00000 0 0 0 0.718 0.000 0.0 4.7075 0 0 0.0 0.00 0.00
[38,] 0.00000 0 0 0 0.000 0.000 0.0 2.9879 0 0 0.0 0.00 6.07
[39,] 0.00000 0 0 0 0.000 0.000 0.0 2.9879 0 0 0.0 0.00 6.07
>
> Boston2.mars$coefficients
[,1]
[1,] 23.054932863
[2,] -0.475918184
[3,] 9.219231990
[4,] -2.823958066
[5,] -1.785616880
[6,] 0.742716889
[7,] 0.018297135
[8,] 0.015258475
[9,] 1.687538405
[10,] 9.636024180
[11,] -0.062409499
[12,] 2.555488850
[13,] 0.013238063
280
[14,] 0.593931139
[15,] 0.047603018
[16,] 2.588471925
[17,] -0.107836555
[18,] -0.012484569
[19,] 0.017930583
[20,] 0.001807397
[21,] 0.045138047
[22,] -98.745400075
[23,] -0.058675469
[24,] -6.453215924
[25,] -0.071061935
[26,] -0.140013447
>
> Boston2.mars$gcv
[1] 9.199464
281
4. Example: Boston housing data – MARS
◮ Fitted model
f(x) = 23.05− 0.48× (lstat − 6.07)+ + 9.21× (6.07− lstat)+
− 2.82× (rm− 6.43)+ − 1.78× (6.43− rm)+
+0.74× (rm− 6.43)+ × (ptratio − 17.8)+
+ 0.002× (rm− 6.43)+ × (17.8− ptratio)+
+ · · ·
282
16. Dimensionality reduction: principal componentsanalysis (PCA)
283
1. PCA: basic idea
◮ The idea is to describe/approximate variability (distribution) of
X = (X1, . . . , Xp) by a distribution in the space of smaller dimension.
◮ Approximation by one dimenstional space: let
δTX =
p∑
i=1
δiXi, ‖δ‖2 =
d∑
i=1
δ2i = 1.
Which projection (normalized linear combination) is the “best
representer” of the vector distribution? Or how to choose δ?
◮ First optimization problem
(Opt1) maxδ:‖δ‖=1
var(δTX) = maxδ:‖δ‖=1
δTΣδ, Σ = cov(X).
284
2. PCA: basic idea
◮ Solution to (Opt1) is eigenvector γ1 of Σ corresponding to the
maximal eigenvalue λ1. The first principal component is
Y1 = γT1 X.
◮ Second optimization problem: max{δTΣδ : ‖δ‖ = 1, δTγ1 = 0
}.
Solution is the eigenvector γ2 of Σ corresponding to the second largest
eigenvalue λ2. Second principal component: Y2 = γT2 X and so on...
◮ In general, if Σ = ΓΛΓT , Λ is diagonal matrix comprised of
eigenvalues, Γ is an orthogonal matrix comprised of eigenvectors then
the PCA transformation is
Y = ΓT (X − µ).
285
PCA: theory
◮ Theorem: Let X ∼ Np(µ,Σ), Σ = ΓTΛΓ, and let Y = ΓT (X − µ) bethe principal components. Then
(i) EYj = 0, var(Yj) = λj , j = 1, . . . , p; cov{Yi, Yj} = 0, ∀i 6= j;
(ii) var(Y1) ≥ var(Y2) ≥ · · · ≥ var(Yp)
(iii)∑p
i=1 var(Yi) = tr(Σ),∏p
i=1 var(Yi) = det(Σ).
◮ Proportion of the variability explained by q components:
ψq =
∑qi=1 λi∑pi=1 λi
=
∑qi=1 var(Yi)∑pi=1 var(Yi)
.
286
PCA: empirical version
◮ Idea: given Xi ∈ Rp, i = 1, . . . , n, estimate Γ and µ, apply PCA to
Xi’s, and keep q first variables that explain “well” the variability.
◮ Estimates:
µ =1
n
n∑
i=1
Xi, Σ =1
n− 1
n∑
i=1
(Xi − µ)(Xi − µ)T .
◮ Spectral decomposition and PCA transformation:
Σ = ΓΛΓT , Yi = ΓT (Xi − µ), i = 1, . . . , n.
If variables are given on different scales, before applying PCA the
data can be standardized:
Xi = D−1/2(Xi − µ), D = diag{Σ}.
287
1. Example: heptathlon data
◮ Data on 25 competitors: seven events and total score
>heptathlon
hurdles highjump shot run200m longjump javelin run800m
Joyner-Kersee (USA) 12.69 1.86 15.80 22.56 7.27 45.66 128.51
John (GDR) 12.85 1.80 16.23 23.65 6.71 42.56 126.12
Behmer (GDR) 13.20 1.83 14.20 23.10 6.68 44.54 124.20
Sablovskaite (URS) 13.61 1.80 15.23 23.92 6.25 42.78 132.24
...........................................................................
To recode all events in same direction (“large” is “good“) we transformrunning events
>heptathlon$hurdles<-max(heptathlon$hurdles)-heptathlon$hurdles
>heptathlon$run200m<-max(heptathlon$run200m)-heptathlon$run200m
>heptathlon$run800m<-max(heptathlon$run800m)-heptathlon$run800m
288
2. Example: heptathlon data
◮ Correlations
>hept<-heptathlon[,-8] #without total score
>round(cor(hept),2)
hurdles highjump shot run200m longjump javelin run800m
hurdles 1.00 0.81 0.65 0.77 0.91 0.01 0.78
highjump 0.81 1.00 0.44 0.49 0.78 0.00 0.59
shot 0.65 0.44 1.00 0.68 0.74 0.27 0.42
run200m 0.77 0.49 0.68 1.00 0.82 0.33 0.62
longjump 0.91 0.78 0.74 0.82 1.00 0.07 0.70
javelin 0.01 0.00 0.27 0.33 0.07 1.00 -0.02
run800m 0.78 0.59 0.42 0.62 0.70 -0.02 1.00
javelin is weakly correlated with other variables...
289
Scatterplot of the data
>plot(hept)
hurdles
1.50 1.70 0 1 2 3 4 36 40 44
01
23
1.50
1.70 highjump
shot
1013
16
02
4
run200m
longjump
5.0
6.0
7.0
3642 javelin
0 1 2 3 10 13 16 5.0 6.0 7.0 0 20 40
020
40
run800m
290
1. Principal components in R
> hept_pca<-prcomp(hept,scale=TRUE)
> print(hept_pca)
Standard deviations:
[1] 2.1119364 1.0928497 0.7218131 0.6761411 0.4952441 0.2701029 0.2213617
Rotation:
PC1 PC2 PC3 PC4 PC5 PC6
hurdles -0.4528710 0.15792058 -0.04514996 0.02653873 -0.09494792 -0.78334101
highjump -0.3771992 0.24807386 -0.36777902 0.67999172 0.01879888 0.09939981
shot -0.3630725 -0.28940743 0.67618919 0.12431725 0.51165201 -0.05085983
run200m -0.4078950 -0.26038545 0.08359211 -0.36106580 -0.64983404 0.02495639
longjump -0.4562318 0.05587394 0.13931653 0.11129249 -0.18429810 0.59020972
javelin -0.0754090 -0.84169212 -0.47156016 0.12079924 0.13510669 -0.02724076
run800m -0.3749594 0.22448984 -0.39585671 -0.60341130 0.50432116 0.15555520
PC7
hurdles 0.38024707
highjump -0.43393114
shot -0.21762491
run200m -0.45338483
longjump 0.61206388
291
javelin 0.17294667
run800m -0.09830963
> summary(hept_pca)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7
Standard deviation 2.112 1.093 0.7218 0.6761 0.4952 0.2701 0.221
Proportion of Variance 0.637 0.171 0.0744 0.0653 0.0350 0.0104 0.007
Cumulative Proportion 0.637 0.808 0.8822 0.9475 0.9826 0.9930 1.000
> a1<-hept_pca$rotation[,1] # linear combination for the 1st principal
# component
> a1
hurdles highjump shot run200m longjump javelin run800m
-0.4528710 -0.3771992 -0.3630725 -0.4078950 -0.4562318 -0.0754090 -0.3749594
>
292
2. Principal components in R
◮ First principal component:
> predict(hept_pca)[,1] # or just hept_pca$x[,1]
Joyner-Kersee (USA) John (GDR) Behmer (GDR) Sablovskaite (URS)
-4.121447626 -2.882185935 -2.649633766 -1.343351210
Choubenkova (URS) Schulz (GDR) Fleming (AUS) Greiner (USA)
-1.359025696 -1.043847471 -1.100385639 -0.923173639
Lajbnerova (CZE) Bouraga (URS) Wijnsma (HOL) Dimitrova (BUL)
-0.530250689 -0.759819024 -0.556268302 -1.186453832
Scheider (SWI) Braun (FRG) Ruotsalainen (FIN) Yuping (CHN)
0.015461226 0.003774223 0.090747709 -0.137225440
Hagger (GB) Brown (USA) Mulliner (GB) Hautenauve (BEL)
0.171128651 0.519252646 1.125481833 1.085697646
Kytola (FIN) Geremias (BRA) Hui-Ing (TAI) Jeong-Mi (KOR)
1.447055499 2.014029620 2.880298635 2.970118607
Launa (PNG)
6.270021972
◮ The first two components account for 81% of the variance.
293
>plot(hept_pca)
hept_pcaVa
rianc
es
01
23
4
> cor(heptathlon$score,hept_pca$x[,1])
[1] -0.9910978
294
>plot(heptathlon$score, hept_pca$x[,1])
4500 5000 5500 6000 6500 7000
−4−2
02
46
heptathlon$score
hept
_pca
$x[,
1]
295
17. Clustering: model–based clustering,K–means, K–medoids
296
1. Clustering problem
◮ Clustering ⇔ unsupervised learning:
Grouping or segmenting objects (observations) into subsets or
”clusters”, such that those within each cluster are more similar
to each other than they are to the members of other groups.
◮ Example – Iris data
Given the measurements in centimeters of the four variables
sepal length/width and petal length/width
for 50 Iris flowers, the goal is to group the observations with
accoradance to the species. The species are Iris setosa,
versicolor, and virginica.
297
2. Clustering problem
◮ Data: {X1, . . . , Xn}, Xi ∈ Rp
◮ Dissimilarity (proximity) measure: if d(·, ·) is a distance,
D = {dij}i,j=1,...,n, dij = d(Xi, Xj),
e.g., di,j = ‖Xi −Xj‖2.
◮ Clustering algorithm maps each observation Xi to one of the K
groups,
C : {X1, . . . , Xn} → {1, . . . , K}.
298
Parametric approach
◮ Formulation: Let X1, . . . , Xn independent vectors from K
populations G1, . . . , GK , and
Xi ∼ f(x, θk) when Xi is sampled from Gk.
◮ Observation labels: For i = 1, . . . , n let γi = k if Xi is sampled from
Gk. The vector γ = (γ1, . . . , γn) is unknown.
◮ Clusters
Ck =⋃
i:γi=k
{Xi}, k = 1, . . .K.
The goal is to find Ck, k = 1, . . . , K.
299
1. Maximum likelihood clustering
◮ Likelihood function to be maximized w.r.t. γ and θ = (θ1, . . . , θK)
L(γ; θ) =∏
i:γi=1
f(Xi, θ1)∏
i:γi=2
f(Xi, θ2) · · ·∏
i:γi=K
f(Xi, θK).
◮ Specific case: f(x, θk) = Np(µk,Σk), k = 1, . . . , K
lnL(γ; θ) = const−1
2
K∑
k=1
∑
i:Xi∈Ck
(Xi−µk)TΣ−1
k (Xi−µk)−1
2
K∑
k=1
nk ln |Σk|,
where nk =∑n
i=1 I{Xi ∈ Ck}, k = 1, . . . , n.
◮ When γ (partition) is fixed and lnL(γ; θ) is maximized w.r.t. θ,
µk(γ) =1
nk
∑
Xi∈Ck
Xi, Σk(γ) =1
nk
∑
Xi∈Ck
[Xi − µk(γ)][Xi − µk(γ)]T
300
2. Maximum likelihood clustering
◮ Substituting µk(γ) and Σk(γ) we obtain
lnL(γ; θ) = constant− 1
2
K∑
k=1
nk ln |Σk(γ)|.
Thus the optimization problem is to minimize
K∏
k=1
|Σk(γ)|nk
over all partitions of the set of observations into K groups.
◮ Computationally infeasible problem
301
How to choose K?
In many cases we don’t know how big K should be.
◮ Use some distance measure (”within–groups variance” and
”between–groups variance”)
◮ Use a visual plot if possible (can plot points and distances)
◮ If we can define a cluster ”purity measure” (as in CART), then one
can optimize over this and penalize for complexity.
302
K–means clustering algorithm
◮ The basic idea
1. Given a preliminary partition {Cold1 , . . . , Cold
K } of observationscompute the group means
mk =1
#(Ck)
∑
i:Xi∈Ck
Xi, k = 1, . . . , K.
2. Associate each observation with the cluster whose mean is closest,
Cnewk =
⋃{Xi : ‖Xi −mk‖ ≤ min
j=1,...,K,j 6=k‖Xi −mk‖
}
3. Go to 1.
◮ Solves the problem:∑n
i=1 mink=1,...,K ‖Xi −mk‖2 → minm1,...,mK.
303
K–means: a simple numerical example
◮ Data: {2, 4, 10, 12, 3, 20, 30, 11, 25}, K = 2.
1. Let m1 = 2, m2 = 4 ⇒ C1 = {2, 3}, C2 = {4, 10, 12, 20, 30, 11, 25}
2. New centers m1 = 2.5, m2 = 16 ⇒
C1 = {2, 3, 4}, C2 = {10, 12, 20, 30, 11, 25}.
m1 m2 C1 C2
3 18 {2, 3, 4, 10} {12, 20, 30, 11, 25}4.75 19.6 {2, 3, 4, 10, 11, 12} {20, 30, 25}7 25 {2, 3, 4, 10, 11, 12} {20, 30, 25}
and the algorithm stops.
304
1. K-means: simulated example
# Data generation
> m1=c(0,1)
> m2=c(4.5,0)
> S1<-cbind(c(2, 0.5), c(0.5, 3))
> S2<-cbind(c(2, -1.5), c(-1.5, 3))
> X1<-mvrnorm(100, m1, S1)
> X2<-mvrnorm(100, m2, S2)
> Y=rbind(X1, X2)
> plot(Y)
> points(X1, col="red")
> points(X2, col="blue")
#
# K-means algorithm
#
>Y.km<-kmeans(Y, 2, iter.max=20)
>Y.km
K-means clustering with 2 clusters of sizes 97, 103
Cluster means:
305
[,1] [,2]
1 4.55303155 -0.0987873
2 0.08029854 1.1484915
Clustering vector:
[1] 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2
[38] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1
[112] 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1
[186] 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1
Within cluster sum of squares by cluster:
[1] 389.4226 426.2129
Available components:
[1] "cluster" "centers" "withinss" "size"
306
2. K–means: simulated example
−2 0 2 4 6
−4−2
02
4
Y[,1]
Y[,2]
307
1. K-means: old swiss banknote data
308
2. K-means: old swiss banknote data
◮ Data: 200 observations on the variables
– X1 = length of the bill
– X2 = height of the bill (right)
– X3 = height of the bill (left)
– X4 = distance of the inner frame to the lower border
– X5 = distance of the inner frame to the upper border
– X6 = length of the diagonal of the central picture
◮ The first 100 are geinuine, the other - counterfeit
309
3. K-means: old swiss banknote data
> bank2<- read.table(file="bank2.dat")
> bank2.km<-kmeans(bank2, 2, 20)
> bank2.km
K-means clustering with 2 clusters of sizes 100, 100
Cluster means:
V1 V2 V3 V4 V5 V6
1 214.823 130.300 130.193 10.530 11.133 139.450
2 214.969 129.943 129.720 8.305 10.168 141.517
Clustering vector:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
310
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Within cluster sum of squares by cluster:
[1] 225.2233 142.8852
Available components:
[1] "cluster" "centers" "withinss" "size"
311
K–medoids clustering algorithm
◮ Extends K–means to non–Euclidean dissimilarity measures d(·, ·).◮ Algorithm
1. Given an initial partition to clusters C = {C1, . . . , CK} find an
observation in each cluster that minimizes the sum of distances to
other observations
i∗k = arg mini:Xi∈Ck
∑
j:Xj∈Ck
d(Xi, Xj) ⇒ mk = Xi∗k, k = 1, . . . , K.
Associate each observation to clusters according to the distance to
the found centroids (medoids) {m1, . . . ,mK}. Proceed until
convergence
◮ Objective function
minC,{ik}K
k=1
K∑
k=1
∑
j:Xj∈Ck
d(Xik , Xj).
312
K–medoids example: countries dissimilarities
◮ Data set: the average dissimilarity scores matrix between 12 countries
(Belgium, Brazil, Chile, Cuba, Egypt, France, India, Israel, USA,
USSR, Yugoslavia and Zaire).
> library(cluster)
> x <-read.table("countries.data")
> val<-c("BEL","BRA", "CHI","CUB", "EGY", "FRA", "IND", "ISR", "USA", "USS",
+ "YUG", "ZAI")
> rownames(x) <- val
> colnames(x) <- val
> x
313
BEL BRA CHI CUB EGY FRA IND ISR USA USS YUG ZAI
BEL 0.00 5.58 7.00 7.08 4.83 2.17 6.42 3.42 2.50 6.08 5.25 4.75
BRA 5.58 0.00 6.50 7.00 5.08 5.75 5.00 5.50 4.92 6.67 6.83 3.00
CHI 7.00 6.50 0.00 3.83 8.17 6.67 5.58 6.42 6.25 4.25 4.50 6.08
CUB 7.08 7.00 3.83 0.00 5.83 6.92 6.00 6.42 7.33 2.67 3.75 6.67
EGY 4.83 5.08 8.17 5.83 0.00 4.92 4.67 5.00 4.50 6.00 5.75 5.00
FRA 2.17 5.75 6.67 6.92 4.92 0.00 6.42 3.92 2.25 6.17 5.42 5.58
IND 6.42 5.00 5.58 6.00 4.67 6.42 0.00 6.17 6.33 6.17 6.08 4.83
ISR 3.42 5.50 6.42 6.42 5.00 3.92 6.17 0.00 2.75 6.92 5.83 6.17
USA 2.50 4.92 6.25 7.33 4.50 2.25 6.33 2.75 0.00 6.17 6.67 5.67
USS 6.08 6.67 4.25 2.67 6.00 6.17 6.17 6.92 6.17 0.00 3.67 6.50
YUG 5.25 6.83 4.50 3.75 5.75 5.42 6.08 5.83 6.67 3.67 0.00 6.92
ZAI 4.75 3.00 6.08 6.67 5.00 5.58 4.83 6.17 5.67 6.50 6.92 0.00
> x.pam2<-pam(x,2, diss=T)
> summary(x.pam2)
Medoids:
ID
[1,] "9" "USA"
[2,] "4" "CUB"
Clustering vector:
314
BEL BRA CHI CUB EGY FRA IND ISR USA USS YUG ZAI
1 1 2 2 1 1 2 1 1 2 2 1
Objective function:
build swap
3.291667 3.236667
Numerical information per cluster:
size max_diss av_diss diameter separation
[1,] 7 5.67 3.227143 6.17 4.67
[2,] 5 6.00 3.250000 6.17 4.67
# diameter - maximal dissimilarity between observations in the cluster
# separation - minimal dissimilarity between an observation of the cluster
# and an observation of another cluster.
Isolated clusters:
L-clusters: character(0) #
L*-clusters: character(0) # diameter < separation
# L-cluster: for each observation i the maximal dissimilarity between
# i and any other observation of the cluster is smaller than the minimal
# dissimilarity between i and any observation of another cluster.
315
Silhouette plot information:
cluster neighbor sil_width
USA 1 2 0.42519084
BEL 1 2 0.39129752
FRA 1 2 0.35152954
ISR 1 2 0.29785894
BRA 1 2 0.22317708
EGY 1 2 0.19652641
ZAI 1 2 0.18897849
CUB 2 1 0.39814815
USS 2 1 0.34104696
CHI 2 1 0.32512211
YUG 2 1 0.26177642
IND 2 1 -0.04466159
Average silhouette width per cluster:
[1] 0.2963655 0.2562864
Average silhouette width of total data set:
[1] 0.2796659
Available components:
[1] "medoids" "id.med" "clustering" "objective" "isolation"
[6] "clusinfo" "silinfo" "diss" "call"
316
1. Silhouette plot
> plot(x.pam2)
IND
YUG
CHI
USS
CUB
ZAI
EGY
BRA
ISR
FRA
BEL
USA
Silhouette width si
0.0 0.2 0.4 0.6 0.8 1.0
Silhouette plot of pam(x = x, k = 2, diss = T)
Average silhouette width : 0.28
n = 12 2 clusters Cj
j : nj | avei∈Cj si
1 : 7 | 0.30
2 : 5 | 0.26
317
2. Silhouette plot
◮ For each observation Xi from cluster Ck define
* in–cluster average dissimilarity
a(i) =
∑j:Xj∈Ck
d(Xi, Xj)
#{j 6= i : Xj ∈ Ck}, Xi ∈ Ck.
* between–clusters average dissimilarity
d(i, Cm) =
∑j:Xj∈Cm
d(Xi, Xj)
#{j : Xj ∈ Cm}, m 6= k, m ∈ {1, . . . , K}.
b(i) = minm6=k,m=1,...,K
d(i, Cm)
◮ Silhouette width:
s(i) =b(i)− a(i)
max{a(i), b(i)} , −1 ≤ s(i) ≤ 1.
s(i) = 0 if Xi is the only observation in its cluster. Silhouette plot
plots s(i) in decreasing order fro each cluster.
318
3. Silhouette plot
◮ Interpretation of silhouette width
– large s(i) (almost 1) – observation is very well clustered;
– small s(i) (around 0) – observation lies between two clusters;
– negative s(i) – observation is badly clustered.
◮ Silhouette coefficient SC = 1n
∑ni=1 s(i): choose K with maximal SC.
– SC ≥ 0.7: strong cluster structure
– 0.5 ≤ SC ≤ 0.7: reasonable structure
– 0.25 ≤ SC ≤ 0.5: weak structure
– SC ≤ 0.25: no structure
319
Countries dissimilarities: 3-medoids
> x.pam3<-pam(x,3, diss=T)
> summary(x.pam3)
Medoids:
ID
[1,] "9" "USA"
[2,] "12" "ZAI"
[3,] "4" "CUB"
Clustering vector:
BEL BRA CHI CUB EGY FRA IND ISR USA USS YUG ZAI
1 2 3 3 1 1 2 1 1 3 3 2
Objective function:
build swap
2.583333 2.506667
Numerical information per cluster:
size max_diss av_diss diameter separation
[1,] 5 4.50 2.4000 5.0 4.67
[2,] 3 4.83 2.6100 5.0 4.67
[3,] 4 3.83 2.5625 4.5 5.25
320
Isolated clusters:
L-clusters: character(0)
L*-clusters: [1] 3
Silhouette plot information:
cluster neighbor sil_width
USA 1 2 0.46808511
FRA 1 2 0.43971831
BEL 1 2 0.42149254
ISR 1 2 0.36561099
EGY 1 2 0.02118644
ZAI 2 1 0.27953625
BRA 2 1 0.25456578
IND 2 3 0.17498951
CUB 3 2 0.47890188
USS 3 1 0.43682195
YUG 3 1 0.31304749
CHI 3 2 0.30726872
Average silhouette width per cluster:
[1] 0.3432187 0.2363638 0.3840100
Average silhouette width of total data set:
321
[1] 0.3301021
Available components:
[1] "medoids" "id.med" "clustering" "objective" "isolation"
[6] "clusinfo" "silinfo" "diss" "call"
>
> plot(x.pam3)
322
CHI
YUG
USS
CUB
IND
BRA
ZAI
EGY
ISR
BEL
FRA
USA
Silhouette width si
0.0 0.2 0.4 0.6 0.8 1.0
Silhouette plot of pam(x = x, k = 3, diss = T)
Average silhouette width : 0.33
n = 12 3 clusters Cj
j : nj | avei∈Cj si
1 : 5 | 0.34
2 : 3 | 0.24
3 : 4 | 0.38
323
18. Clustering: hierarchical methods
324
Hierarchical methods
◮ Two main types of hierarchical clustering
– Agglomerative:
* Start with the points as individual clusters
* At each step, merge the closest pair of clusters until only one
cluster (or K clusters) left
– Divisive:
* Start with one, all–inclusive cluster
* At each step, split a cluster until each cluster contains a point
(or there are K clusters).
◮ Algorithms use dissimilarity/distance matrix
merge or split one cluster in a time.
325
Dissimilarity measures between clusters
Dissimilarity between clusters R and Q
◮ Average dissimilarity
D(R,Q) =1
NRNQ
∑
Xi∈R, Xj∈Q
d(Xi, Xj)
Best when clusters are ball–shaped, fairly well–separated.
◮ Nearest neighbor (single linkage)
D(R,Q) = minXi∈R,Xj∈Q
d(Xi, Xj)
Can lead to ”chaining” effect, elongated clusters
◮ Furthest neighbor (complete linkage)
D(R,Q) = maxXi∈R,Xj∈Q
d(Xi, Xj)
Tends to produce small compact clusters
326
Agglomerative algorithm
Assume that we have a measure dissimilarity between clusters.
1. Construct the finest partition (one observation in each cluster)
repeat:
2. Compute the dissimilarity matrix between clusters.
3. Find the two clusters with the smallest dissimilarity, and merge them
into one cluster.
4. Compute the dissimilarity between the new groups
until: only one cluster remains.
327
1. Example: clustering using single linkage
◮ Initial dissimilarity matrix (5 objects)
1 2 3 4 5
1 0
2 9 0
3 3 7 0
4 6 5 9 0
5 11 10 2 8 0
⇒ because mini,j di,j = d53 = 2,
merge 3 and 5 to cluster (35)
◮ Nearest neighbor distances:
d(35)1 = min{d31, d51} = min{3, 11} = 3
d(35)2 = min{d32, d52} = min{7, 10} = 7
d(35)4 = min{d34, d54} = min{9, 8} = 8.
328
2. Example: clustering using single linkage
◮ Dissimilarity matrix after merger of 3 and 5
(35) 1 2 4
(35) 0
1 3 0
2 7 9 0
4 8 6 5 0
⇒ d(35)1 = 3; merge (35) and 1 to (135)
◮ Distances
d(135)2 = min{d(35)2, d12} = min{7, 9} = 7
d(135)4 = min{d(35)4, d14} = min{8, 6} = 6.
329
3. Example: clustering using single linkage
◮ Dissimilarity matrix after merger of (35) and 1
(135) 2 4
(135) 0
2 7 0
4 6 5 0
⇒ d42 = 5; merge 4 and 2 to (42)
d(135)(24) = min{d(135)2, d(135)4} = min{7, 6} = 6.
◮ The final disssimilarity matrix
(135) (24)
(135) 0
(24) 6 0
⇒ (135) and (24) are merged on the level 6.
330
4. Example: resulting dendrogram
_
_
_
_
_
_
1
2
3
4
5
6
1 3 5 2 4
331
Agglomerative coefficient
◮ Measures of how much ”clustering structure” exists in the data
◮ Ck(j) is the first cluster Xj is merged with, j = 1, . . . , n; R, Q are two
last clusters that are merged at the final step of the algorithm. For
each observation Xj define
α(j) =D(Xj , Ck(j))
D(R,Q), j = 1, . . . , n.
◮ Agglomerative coefficient: AC = 1n
∑nj=1[1− α(j)]
– large AC (close to 1): observations are merged with clusters close
to them in the beginning as compared to the final merger;
– small AC: evenly distributed data, poor clustering structure.
332
1. Example: Swiss Provinces Data (1888)
◮ Data: 47 observations on 5 measures of socio–economic indicators on
Swiss provinces:
– Agriculture: % of males involved in agriculture as occupation
– Examination: % ”draftees” receiving highest mark on army
examination
– Education: % education beyond primary school for ”draftees”
– Catholic: % catholic (as opposed to ”protestant”
– Infant.Mortality: % live births who live less than 1 year
333
2. Example: Swiss Provinces Data (1888)
> library(cluster)
> swiss.x<-swiss[,-1]
> sagg<-agnes(swiss.x)
> pltree(sagg)
> print(sagg$ac)
[1] 0.8774795
◮ By default agnes uses average dissimilarity (complete and single
linkage can be chosen).
◮ There are two main groups with one point V. De Geneve
well–separated from them.
◮ Fairly large AC = 0.878 suggests good clustering structure in the
data.
334
CourtelaryLe Locle
ValdeTraversLa Chauxdfnd
La ValleeLausanneNeuchatelVevey
NeuvevilleBoudry
GrandsonVal de Ruz
AigleMorges
RolleAvenchesOrbeMoudonPayerne
YverdonNyoneAubonne
OronCossonay
LavauxPaysd’enhaut
EchallensMoutierRive Droite
Rive GaucheV. De Geneve
DelemontFranches−Mnt
PorrentruyGruyere
SarineBroye
GlaneVeveyse
SionMonthey
ContheySierreHerensEntremontMartigwy
St Maurice
020
4060
80
Dendrogram of agnes(x = swiss.x)
agnes (*, "average")swiss.x
Height
335
Divisive clustering algorithm
◮ Splitting cluster C to A and B
1. Initialization: A = C, B = ∅
2. For each observation Xj compute
� average dissimilarity a(j) from all other observations in A
� average dissimilarity d(j, B) to all observations in B; d(j, ∅) = 0.
3. Select observation Xk such that S(k) = a(k)− d(k,B) is maximal.
4. If S(k) ≥ 0, add Xk to cluster B: B = B ∪ {Xk}, A = A\{Xk},and go to 2. If S(k) < 0, or A contains only one observation, stop.
◮ Apply the same procedure to clusters A and B (at each step split
cluster with maximal diameter).
336
1. Example: divisive clustering algorithm
◮ Dissimilarity matrix (5 objects)
1 2 3 4 5
1 0
2 9 0
3 3 7 0
4 6 5 9 0
5 11 10 2 8 0
◮ Initial clusters:
A = {1, 2, 3, 4, 5}, B = ∅, diam(A) = d51 = 11.
337
2. Example: divisive clustering algorithm
◮ Step 1: average dissimilarities of observations
a(1) = (9 + 3 + 6 + 11)/4 = 7.25, a(2) = (9 + 7 + 5 + 10)/4 = 7.75
a(3) = (3 + 7 + 9 + 2)/4 = 5.52, a(4) = (6 + 5 + 9 + 8)/4 = 7
a(5) = (11 + 10 + 2 + 8)/4 = 7.75 ⇒A = {1, 2, 3, 4}, B = {5}, diam(A) = 9.
◮ Step 2: Computing average dissimilarities and S(·)
a(1) = (9 + 3 + 6)/3 = 6, S(1) = a(1)− d15 = 6− 11 = −5a(2) = (9 + 7 + 5)/3 = 7, S(2) = a(2)− d25 = 7− 10 = −3a(3) = (3 + 7 + 9)/3 = 6.333, S(3) = a(3)− d35 = 6.333− 2 = 4.333
a(4) = (6 + 5 + 9)/3 = 6.667, S(4) = a(4)− d45 = 6.667− 8 = −1.333⇒ A = {1, 2, 4}, B = {5, 3}, diam(A) = 9, diam(B) = 2.
338
3. Example: divisive clustering algorithm
◮ Step 3:
a(1) = (9 + 6)/2 = 7.5, d(1, B) = (3 + 11)/2 = 7, S(1) = 0.5
a(2) = (9 + 5)/2 = 7, d(2, B) = (7 + 10)/2 = 8.5, S(2) = −1.5a(4) = (6 + 5)/2 = 5.5, d(4, B) = (9 + 8)/2 = 8.5, S(4) = −3⇒ A = {2, 4}, B = {1, 3, 5}, diam(A) = 5, diam(B) = 11.
◮ Step 4:
a(2) = 5, d(2, B) = (9 + 7 + 10)/3 = 8.667, S(2) = −3.667a(4) = 5, d(4, B) = (6 + 9 + 8)/3 = 7.667, S(4) = −2.667.
Since S(2) and S(4) are negative the procedure stops after step 4 with
two clusters A = {2, 4} and B = {1, 3, 5}. Next we start to divide the
cluster B (it has larger diameter).
339
Divisive coefficient
◮ Cluster diameter
diam(C) = maxXi∈C,Xj∈C
d(Xi, Xj)
◮ For observation Xj , let δ(j) be the diameter of the last cluster to
which Xj belongs (before it is split off as a single observation),
divided by the diameter of the whole dataset.
◮ Divisive coefficient: DC = 1n
∑ni=1[1− δ(j)]
– large DC (close to 1): on average observations are in small
compact clusters (relative to the size of the whole dataset) before
being split off; evidence of a good clustering structure.
340
Example: Swiss Provinces Data (1888)
> library(cluster)
> swiss.x<-swiss[,-1]
> sdiv<-diana(swiss.x)
> pltree(sdiv)
> print(sdiv$dc)
[1] 0.903375
◮ diana uses average dissimilarity and Euclidaen distance.
◮ There are three well–separated clusters.
◮ Fairly large DC = 0.903 suggests good clustering structure in the
data.
341
CourtelaryLe Locle
ValdeTraversLa Chauxdfnd
La ValleeLausanneNeuchatelVevey
V. De GeneveRive Droite
Rive GaucheMoutier
NeuvevilleBoudry
GrandsonVal de RuzNyone
YverdonAigle
MorgesRolle
AvenchesOrbeMoudonPayerneAubonne
OronCossonay
LavauxPaysd’enhaut
EchallensDelemont
Franches−MntPorrentruy
GruyereSarine
BroyeGlane
VeveyseSion
MontheyConthey
SierreHerens
EntremontMartigwy
St Maurice
020
4060
80100
120
Dendrogram of diana(x = swiss.x)
diana (*, "NA")swiss.x
Height
342
Hierarchical procedures – comments
◮ The procedures may be sensitive to outliers, ”noise points”
◮ Try different methods, and within a given method, different ways of
assigning distances (complete , average, single linkage). Roughly
consistent outcomes indicate validity of the clutering structure.
◮ Stability of the hierarchical solution can be checked by applying the
algorithm to slightly perturbed data set.
343
Pottery data
◮ Data: chemical composition of 45 specimens of Romano–British
pottery for 9 oxides. A kiln site at which the pottery is found is also
known (5 different sites).
◮ Question: whether the pots can be divided into distinct groups and
how these groups relate to the kiln site?
> pottery
Al2O3 Fe2O3 MgO CaO Na2O K2O TiO2 MnO BaO
1 18.8 9.52 2.00 0.79 0.40 3.20 1.01 0.077 0.015
2 16.9 7.33 1.65 0.84 0.40 3.05 0.99 0.067 0.018
...................................................
...................................................
44 14.8 2.74 0.67 0.03 0.05 2.15 1.34 0.003 0.015
45 19.1 1.64 0.60 0.10 0.03 1.75 1.04 0.007 0.018
344
K-means algorithm
>wss<-rep(0,10)
>wss[1]<-44*sum(apply(pottery,2,var))
>for (i in 2:10) wss[i]<-mean(kmeans(pottery,i)$withinss)
>plot(2:10,wss[2:10],type="b",xlab="Number of groups",ylab="Mean-within-group SS")
>kmeans(pottery,5)$clust
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
5 5 5 5 5 5 5 5 3 3 3 3 3 5 5 3 5 5 5 5 5 1 1 1 2 1
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
2 2 2 2 1 1 1 2 2 4 4 4 4 4 4 4 4 4 4
> kmeans(pottery,4)$clust
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
4 4 4 4 4 4 4 4 2 2 2 2 2 4 4 2 4 4 4 4 4 2 2 2 1 1
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3
345
246810
050
100150
200
Number of groups
Mean−w
ithin−group SS
346
K-medoids
> sil.w<-rep(0,10)
> for (i in 2:10) sil.w[i]<-pam(pottery,i,diss=F)$silinfo$avg.width
> sil.w[2:10]
[1] 0.5018253 0.6031487 0.5038343 0.4991434 0.5061460 0.4781251 0.4644091
[8] 0.4537507 0.4391603
> p3.med<-pam(pottery,3,diss=F)
> p3.med
Medoids:
ID Al2O3 Fe2O3 MgO CaO Na2O K2O TiO2 MnO BaO
2 2 16.9 7.33 1.65 0.84 0.40 3.05 0.99 0.067 0.018
32 32 12.4 6.13 5.69 0.22 0.54 4.65 0.70 0.159 0.015
38 38 18.0 1.50 0.67 0.01 0.06 2.11 0.92 0.001 0.016
Clustering vector:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
> plot(p3.med)
347
−2 0 2 4
−2
−1
01
23
4
clusplot(pam(x = pottery, k = 3, diss = F))
Component 1
Co
mp
on
en
t 2
These two components explain 74.75 % of the point variability.
Silhouette width si
0.0 0.2 0.4 0.6 0.8 1.0
Silhouette plot of pam(x = pottery, k = 3, diss = F)
Average silhouette width : 0.6
n = 45 3 clusters Cj
j : nj | avei∈Cj si
1 : 21 | 0.62
2 : 14 | 0.53
3 : 10 | 0.67
348
Agglomerative algorithm
> pot.ag<-agnes(pottery,diss=F)
> pot.ag
Call: agnes(x = pottery, diss = F)
Agglomerative coefficient: 0.8878186
Order of objects:
[1] 1 2 4 14 15 18 7 9 16 3 20 5 21 8 6 19 17 10 12 13 11 22 24 23 26
[26] 32 33 31 25 29 27 30 34 35 28 36 42 41 38 39 45 43 40 37 44
Height (summary):
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.1273 0.5273 1.0020 1.4720 1.8430 7.4430
Available components:
[1] "order" "height" "ac" "merge" "diss" "call"
[7] "method" "order.lab" "data"
> pltree(pot.ag)
349
12 4 14 15 18
79 16 3 20 5 21
86 1917 10
12 1311
22 2423
2632 33
3125 29
2730
34 3528
36 4241 38 39
4543
4037 440
24
6
Dendrogram of agnes(x = pottery, diss = F, method = "average")
agnes (*, "average")pottery
Hei
ght
350
Divisive algorithm
> pot.dv<-diana(pottery,diss=F)
> pot.dv
......................................................................
Height:
[1] 3.5098771 0.1273460 0.8565378 0.3898923 0.5561591 1.3314567
[7] 2.6377172 0.4109903 0.6023363 0.2934962 0.4666101 1.2262288
[13] 0.2095328 0.5655882 6.3982699 0.7580923 1.8114795 0.7463518
[19] 0.3744663 3.1056614 10.4430689 0.5786156 1.1191769 3.8495285
[25] 1.0225287 1.9532691 5.4350004 0.6852452 1.3666854 2.9777214
[31] 1.8154022 1.1270909 0.3005478 3.5697204 11.7033446 0.3178836
[37] 0.6550580 0.8893852 0.4395009 1.5469845 2.5117392 4.2065322
[43] 6.1297881 1.0822223
Divisive coefficient:
[1] 0.9161985
> pltree(pot.dv)
351
124141518
732052186191791610
121311
2224232633
312529
322730
343528
3642413839
4543
403744
02
46
810
12
Dendrogram of diana(x = pottery, diss = F)
diana (*, "NA")pottery
Height
352
19. Multidimensional scaling
353
MDS: problem and examples
◮ Problem: based on the dissimilary (proximity) matrix between
objects find a spatial representation of these objects in a
low–dimetional space.
◮ MDS deals with “fitting” the data in a low–dimentional space with
minimal distortion to the distances between original points.
◮ Examples:
– Given a matrix of inter–city distances in USA, produce the map.
– Find a geometric representation for dissimilarities between cars.
– Given data on enrollment and graduation in 25 US universities,
produce a two–dimensional representation of the universities.
354
Metric multidimensional scaling
◮ Problem formulation: X be a n× p data matrix, Xi = (xi1, . . . , xip) is
the ith row (observation). We are given the matrix of Euclidean
distances between observations
D = {dij}i,j=1,...,n, d2ij =
p∑
k=1
(xik − xjk)2 = ‖Xi −Xj‖2.
Based on D, we want to reconstruct the data matrix X .
◮ No unique solution exists: the distances do not change if the data
points are rotated or reflected. Usually the restriction that the mean
vector of the configuration is zero is added.
355
1. How to recontruct X from D?
◮ Inner product matrix: define the n× n matrix B = XXT ,
bij =
p∑
k=1
xikxjk, i, j = 1, . . . , n.
Matrix B is related to matrix D:
d2ij =
p∑
k=1
(xik − xjk)2 =
p∑
k=1
x2ik +
p∑
k=1
x2jk − 2
p∑
k=1
xikxjk
= bii + bjj − 2bij . (1)
◮ The idea is to reconstruct first matrix B and then to factorize it.
◮ Centering: assume that all variables are centered, i.e.
X·k =1
n
n∑
i=1
xik = 0, ∀k = 1, . . . , p.
This implies that∑n
i=1 bij = 0, ∀j.356
2. How to recontruct X from D?
◮ Summing up (1) over i, j and i and j we have
n∑
i=1
d2ij = tr(B) + nbjj , ∀j
n∑
j=1
d2ij = tr(B) + nbii, ∀i
n∑
i=1
n∑
j=1
d2ij = 2n tr(B).
Solving this system for bii, bjj and substituting to (1) we get
bij = −1
2
(d2ij −
1
n
n∑
j=1
d2ij −1
n
n∑
i=1
d2ij +1
n2
n∑
i=1
n∑
j=1
d2ij
). (2)
357
3. How to recontruct X from D?
◮ The last step is the factorization of matrix B: the SVD of B is
B = V ΛV T , Λ = diag{λ1, . . . , λn}, V is orthogonal, V V T = 1;
here for definiteness λ1 ≥ λ2 ≥ · · · ≥ λn. Then
X = V Λ1/2, Λ1/2 = diag(√λ1, . . . ,
√λn).
◮ If tne n× p matrix X is of full rank, i.e., rank(X) = p, then n− peigenvalues of B will be zero, i.e., Λ = diag{λ1, . . . , λp, 0, . . . , 0}.
◮ The best q–dimensional representation , q ≤ p, retains first qeigenvalues. The adequacy is judged by Sq =
∑qi=1 λi/
∑pi=1 λi.
358
MDS algorithm summary
1. For given matrix of distances D compute matrix B using (2).
2. Perform singular value decomposition of B, B = V ΛV T ; for
definiteness, let λ1 ≥ λ2 ≥ · · · ≥ λn.
3. Retain q largest eigenvalues, q ≤ p, set Λ1 = diag{λ1, . . . , λq, 0, . . . , 0}.
4. The new q–dimensional data matrix representation is Y = V Λ1/21 .
The rows of matrix Y are called the principal coordinates of X in
q–dimensions.
5. Judge the adequacy of the representation using index Sq.
A reasonable fit corresponds to Sq ∼ 0.8.
359
MDS algorithm – comments
◮ Although D is the matrix of Euclidean distances, the algorithm can
be applied for other distances. In this case B in (2) is symmetric but
not necessarily non–negative definite. Then assess adequacy of the
solution using ∑qi=1 |λi|∑pi=1 |λi|
, or
∑qi=1 λ
2i∑p
i=1 λ2i
.
◮ Other criteria:
– trace: choose the number of coordinates so that the sum of
positive eigenvalues is approximately the sum of all eigenvalues
– magnitude: accept only eigenvalues which substantially exceed the
largest negative eigenvalue.
360
Duality of principal coordinates and PCA
◮ PCA is based on the singular value decomposition of the sample
covariance matrix
Σ =1
n− 1
n∑
i=1
(Xi − µ)(Xi − µ)T , µ =1
n
n∑
i=1
Xi.
◮ If the data matrix X is centered, i.e., µ = 0, then Σ = XXT .
Therefore, if D is the matrix of Euclidean distances, then B given by
(2) coincides with XXT . Thus, in this specific case PCA and
principal coordinates are equivalent.
361
Some remarks
◮ Distance and dissimilarity: MDS is applied to the matrix D = {dij}of distances. Dissimilarity matrix is a symmetric matrix C = {cij}satisfying cij ≤ cii, ∀i, j. One can trasform dissimilarity matrix to
distances matrix by setting dij = (cii − 2cij + cjj)1/2, ∀i, j.
◮ Given a distance matrix D, the object of MDS is to find a data
matrix X in Rq with interpoint distances dij “close” to dij , ∀i, j.
◮ Among all projections of X on q–dimensional subspaces of Rp, the
principal coordinates in Rq minimize the expression
φ =n∑
i=1
n∑
j=1
[d2ij − d2ij ].
362
2. Example: air distance between US cities
◮ Air-distance data: (1) Atlanta, (2) Boston, (3) Cincinnati,(4) Columbus, (5) Dallas, (6) Indianapolis, (7) Little Rock, (8) LosAngeles, (9) Memphis, (10) St. Louis, (11) Spokane, (12) Tampa
> D
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
[1,] 0 1068 461 549 805 508 505 2197 366 558 2467 467
[2,] 1068 0 867 769 1819 941 1494 3052 1355 1178 2747 1379
[3,] 461 867 0 107 943 108 618 2186 502 338 2067 928
[4,] 549 769 107 0 1050 172 725 2245 586 409 2131 985
[5,] 805 1819 943 1050 0 882 325 1403 464 645 1891 1077
[6,] 508 941 108 172 882 0 562 2080 436 234 1959 975
[7,] 505 1494 618 725 325 562 0 1701 137 353 1988 912
[8,] 2197 3052 2186 2245 1403 2080 1701 0 1831 1848 1227 2480
[9,] 366 1355 502 586 464 436 137 1831 0 294 2042 779
[10,] 558 1178 338 409 645 234 353 1848 294 0 1820 1016
[11,] 2467 2747 2067 2131 1891 1959 1988 1227 2042 1820 0 2821
[12,] 467 1379 928 985 1077 975 912 2480 779 1016 2821 0
363
2. Example: air distance between US cities
> names<-c("Atlanta", "Boston",
+ "Cincinnati","Columbus","Dallas","Indianapolis","Little Rock","Los Angeles",
+ "Memphis","St. Louis","Spokane","Tampa")
>
> air.d<-cmdscale(D,k=2,eig=TRUE)
> air.d$eig
[1] 8234381 2450757
> x<-air.d$points[,1]
> y<-air.d$points[,2]
> plot(x,y,xlab="Coordinate 1",ylab="Coordinate 2", xlim=range(x)*1.2, type="n")
> text(x,y,labels=names)
364
−1000 −500 0 500 1000 1500 2000
−800
−600
−400
−200
020
040
060
0
Coordinate 1
Coo
rdin
ate
2
Atlanta
Boston
CincinnatiColumbus
Dallas
Indianapolis
Little Rock
Los Angeles
Memphis
St. Louis
Spokane
Tampa
365
1. Example: data on US universities
◮ Data: 6 variables on 25 US universities
� X1 = average SAT score for new freshmen;
� X2 = percentage of new freshmen in top 10% of high school class;
� X3 = percentage of applicants accepted;
� X4 = student–faculty ratio;
� X5 = estimated annual expences;
� X6 = graduation rate (%).
366
2. Example: data on US universities
> X<-read.table("US-univ.txt")
> names<-as.character(X[,1])
> names
[1] "Harvard" "Princeton" "Yale" "Stanford"
[5] "MIT" "Duke" "Cal_Tech" "Dartmouth"
[9] "Brown" "Johns_Hopkins" "U_Chicago" "U_Penn"
[13] "Cornell" "Northwestern" "Columbia" "NotreDame"
[17] "U_Virginia" "Georgetown" "Carnegie_Mellon" "U_Michigan"
[21] "UC_Berkeley" "U_Wisconsin" "Penn_State" "Purdue"
[25] "Texas_A&M"
> X<-X[,-1]
> D<-dist(X,method="euclidean") # matrix of distances
> univ<-cmdscale(D,k=6,eig=TRUE)
> univ$eig
[1] 2.080000e+04 3.042786e+03 1.287520e+03 5.430015e+02 1.180444e+02
[6] 9.986056e-01
367
3. Example: data on US universities
> sum(univ$eig[1:2])/sum(univ$eig[1:6])
[1] 0.924413
> x<-univ$points[,1]
> y<-univ$points[,2]
> plot(x,y,xlab="Coordinate 1", ylab="Coordinate 2", xlim=range(x)*1.2,type="n")
> text(x,y,names)
368
−40 −20 0 20 40 60 80
−30
−20
−10
010
Coordinate 1
Coor
dina
te 2
Harvard
Princeton
Yale
StanfordMIT
Duke
Cal_Tech
Dartmouth
Brown
Johns_Hopkins
U_Chicago
U_Penn
Cornell
NorthwesternColumbia
NotreDameU_VirginiaGeorgetown
Carnegie_Mellon
U_Michigan
UC_Berkeley
U_Wisconsin
Penn_State
Purdue
Texas_A&M
369
Non–metric multidimensional scaling
◮ Another look at MDS problem: we are given a dissimilarity matrix
D = {dij} for points X in Rp. We want to find points X in Rq such
that the corresponding distances D = {dij} will match as close as
possible the matrix D.
◮ In general, exact match is not possible, so we will require
monotonicity: if
di1j1 < di2j2 < · · · < dimjm , m = n(n− 1)/2
then should be
di1j1 < di2j2 < · · · < dimjm .
370
1. Shepard–Kruskal algorithm
(a) Given dissimilarity matrix D, order off–diagonal elements: for il < jl
di1j1 ≤ · · · ≤ dimjm , il < jl, l = 1, 2, . . . ,m = n(n− 1)/2.
Say that numbers d∗ij are monotonically related to dij (d∗mon∼ d) if
dij < dkl ⇒ d∗ij < d∗kl, ∀i < j, k < l.
(b) Let X (n× q) be a configuration in Rq with interpoint distances dij .
Define
Stress(X) = mind∗:d∗
mon∼ d
∑nj=1
∑i<j(d
∗ij − dij)2∑n
j=1
∑i<j d
2ij
Rank order of d∗ coincides with the rank order of d. Stress(X) is zero
if the rank order of d coincides with the rank order of d.
371
2. Shepard–Kruskal algorithm
(c) For each dimension q the configuration which has smallest stress is
called the best fitting configuration in q dimensions. Let
Sq = minX(n×q)
Stress(X).
(d) Calculate S1, S2,... until the value becomes low. The rule of thumb:
Sq ≥ 20%–poor; Sq = 10%–fair; Sq ≤ 5%–good; Sq = 0%–perfect.
◮ Remark: the solution is obtained by a numerical procedure.
372
The Shepard–Kruskal algorithm: computation
◮ Computation
1. find a random configuration of points (e.g., sampling from a
normal distribution);
2. calculate distances between the points dij ;
3. find optimal monotone transformation d∗ij of the dissimilarities dij ;
4. minimize the stress between the optimally scaled data by finding
new configuration of points; if the stress is small enough stop, if
not go to the step 2.
373
Remarks
◮ The procedure uses steepest descent method. There is no way to
distinguish between local and global minima.
◮ The Shepard–Kruskal solution is non–metric since it uses only the
rank orders.
◮ The method is invariant under rotation, translation, and unifor
expansion or contraction of the best fitting configuration.
◮ The method works for distances and dissimilarities (similarities).
374
1. Example: New Jersey voting data
◮ Data is the matrix containing the number of times 15 New Jersey
congressmen voted differently in the House of Representatives on
19 evironmental bills (abstentions are not recorded).
> voting
Hunt(R) Sandman(R) Howard(D) Thompson(D) Freylinghuysen(R)
Hunt(R) 0 8 15 15 10 ...
Sandman(R) 8 0 17 12 13 ...
Howard(D) 15 17 0 9 16 ...
Thompson(D) 15 12 9 0 14 ...
Freylinghuysen(R) 10 13 16 14 0 ...
Forsythe(R) 9 13 12 12 8 ...
Widnall(R) 7 12 15 13 9 ...
Roe(D) 15 16 5 10 13 ...
Heltoski(D) 16 17 5 8 14 ...
Rodino(D) 14 15 6 8 12 ...
Minish(D) 15 16 5 8 12 ...
Rinaldo(R) 16 17 4 6 12 ...
Maraziti(R) 7 13 11 15 10 ...
Daniels(D) 11 12 10 10 11 ...
Patten(D) 13 16 7 7 11 ...
375
2. Example: New Jersey voting data
> voting.stress<-rep(0,6)
> for (i in 1:6) voting.stress[i]<-isoMDS(voting, k=i)$stress
> voting.stress
[1] 21.1967696 9.8790470 5.4522891 3.6672495 1.4064205 0.8899916
> voting.MDS<-isoMDS(voting,k=2)
initial value 15.268246
iter 5 value 10.264075
final value 9.879047
converged
> x<-voting.MDS$points[,1]
> y<-voting.MDS$points[,2]
> plot(x,y,xlab="Coordinate 1",ylab="Coordinate 2", xlim=range(x)*1.2, type="n")
> text(x,y,colnames(voting))
376
−10 −5 0 5
−6−4
−20
24
68
Coordinate 1
Coor
dina
te 2
Hunt(R)
Sandman(R)
Howard(D)
Thompson(D)
Freylinghuysen(R)
Forsythe(R)
Widnall(R)
Roe(D)
Heltoski(D)
Rodino(D)Minish(D)
Rinaldo(R)
Maraziti(R)
Daniels(D)
Patten(D)
377
Example: voting data, metric MDS
> voting.metrMDS <- cmdscale(voting, k=6, eig=TRUE)
> voting.metrMDS$eig
[1] 497.76083 146.17622 102.91314 76.87756 55.11540 24.74374
> sum(voting.metrMDS$eig[1:2])/sum(voting.metrMDS$eig)
[1] 0.7126454
> x<-voting.metrMDS$points[,1]
> y<-voting.metrMDS$points[,2]
> plot(x,y,xlab="Coordinate 1",ylab="Coordinate 2", xlim=range(x)*1.2, type="n")
> text(x,y,colnames(voting))
378
−10 −5 0 5
−4−2
02
46
8
Coordinate 1
Coor
dina
te 2
Hunt(R)
Sandman(R)
Howard(D)
Thompson(D)
Freylinghuysen(R)Forsythe(R)
Widnall(R)Roe(D)Heltoski(D)Rodino(D)
Minish(D)Rinaldo(R)
Maraziti(R)
Daniels(D)Patten(D)
379
20. Neural networks
380
1. Neuron model
◮ Neurological origins: each elementary nerve cell (neuron) is connected
to many others, can be activated by inputs from elsewhere, and can
stimulate other neurons.
◮ Neuron model (perceptron):
y = ϕ( p∑
j=1
wjxj + w0
), v =
p∑
j=1
wjxj + w0,
– x = (x1, . . . , xp) are inputs
– y is output
– (w1, . . . , wp) are connection weights, w0 is a bias term
– ϕ is a (non–linear) function, called the activation function
381
2. Neuron model
.
.
.
x1
x2
xp
w1
w2
wp
yv
Σ ϕ(·)
382
Activation function
◮ Monotone increasing, bounded function
◮ Examples
* hard limiter: produces ±1 output,
ϕHL(v) = sgn(u).
* sigmoidal (logistic): produces output in (0, 1)
ϕS(v) =1
1 + e−av,
* hyperbolic tan: produces output between −1 and 1,
ϕHT (v) = tanh(v) =1− e−av
1 + e−av
* linear: ϕL(v) = av.
383
Single–unit perceptron
◮ Variables:
* x is a p–vector of features
* z is a prediction target (binary)
* y is the perceptron output used to predict z
y = sgn( p∑
j=1
wjxj + w0
)= ϕHL(w
Tx) =
+1, wTx ≥ 0
−1, wTx < 0.
◮ Training data: D = {(x(t), z(t)), t = 1, . . . , n} – n ”examples”
x(t) = (x(t)1 , . . . , x
(t)p ) is t-th observation of the feature vector,
z(t) is the class indicator (±1) – binary variable.
384
Perceptron learning rule
1. Initialization: set w = w(0) – starting point
2. At every step t = 1, 2, . . .
� select ”example” x(t) from the training set
� compute the perceptron output with current weights w(t−1)
yt = ϕHL([w(t−1)]Tx(t))
� update the weights
w(t) = w(t−1) + η(z(t) − y(t))x(t).
3. Cycle through all the observations in the training set.
◮ η > 0 is the learning rate parameter
◮ z(t) − y(t) is the error on t–th example. If it is zero, the weights are
not updated. Usual stepsize η ≈ 0.1.
385
Perceptron convergence theorem
◮ Theorem
For any data set that is linearly separable, the learning rule is
guaranteed to find the separating hyperplane in a finite number
of steps.
386
Criterion cost function for regression problem
◮ Data:
D ={x(t), z(t), t = 1, . . . , n
},
z(t) is a continuous/discrete variable.
◮ Squared error cost function:
E(w) =1
2
n∑
t=1
[z(t) − ϕ(wTx(t))]2 =1
2
n∑
t=1
e2t (w)
et(w) := z(t) − ϕ(wTx(t)
).
◮ Assumption: activation function ϕ is differentiable.
387
Gradient descent learning rule (batch version)
◮ Initialization: starting vector of weights w(0)
◮ For t = 0, 1, 2, . . . compute w ← w − ηt∇wE(w)⇔
w(t+1) = w(t) − ηt∇wE(w(t)
), ∇wE(w) = {∂E(w)/∂wj}j=0,...,p
◮ Cycle through all the observations in the training set.
∇wE(w) = −n∑
k=1
ek(w)∇w{ϕ(wTx(k)
)} = −
n∑
k=1
ϕ′(wTx(k))ek(w)x
(k) ⇒
w(t+1) = w(t) + ηt
n∑
k=1
ϕ′([w(t)]Tx(k))ek(w
(t))x(k).
w ← w + ηt
n∑
k=1
ϕ′(wTx(k))ek(w)x(k)
◮ E(w) is non–convex, convergence to local minima.
388
Gradient learning rule (sequential version)
◮ At step t = 1, 2, . . . the t-th example x(t) is selected and
w(t+1) = w(t) − ηt∇w
[12e2t (w)
] ∣∣∣w=w(t)
⇔
w(t+1) = w(t) + ηtϕ′([w(t)]Tx(t)
)et(w
(t))x(t) = w(t) + ηtδtx(t)
δt := ϕ′([w(t)]Tx(t)
)et(w
(t)).
– Examples are selected sequentially, each example is selected many
times.
– Easy implementation; processes examples in real time.
– Convergence to local minima
389
1. Neural network
◮ Neural network – multilayer perceptron
◮ Feed–forward network is a network in which units can be numbered so
that all connections go from a vertex to one with a high number. The
vertices are arranged in layers, with connections only to higher layers.
◮ Hidden layer contains M neurons fed by the inputs (x1, . . . xp).
Output of the mth neuron in the hidden layer is
vm = ϕm
(wm0 +
p∑
i=1
wmixi
)= ϕm(wT
mx),
wm is the weights for the perceptron m.
390
2. Neural network
◮ Output layer (unit) is fed by the neuron outputs in the hidden layer:
y = f(wj0 +
M∑
m=1
wjmvm
)= f(wT v),
where f is the activation function in the output unit, w is the vector
of weights of the output unit.
◮ The network with one hidden layer implements the function
y = f(wT v) = f(w0 +
M∑
m=1
wmϕm(wTmx)
)
391
3. Neural network
Hidden layer
.
.
.
.Output unit.
.
.
1
2
M
Input layer
x1
x2
xp
v1
v2y
w11
wMp
392
Specific cases
◮ Projection pursuit model: f(wT v) = vT1+ w0 and vm = ϕm(wTmx)
y = w0 +M∑
m=1
ϕm(wTmx)
◮ Generalized additive model: if f is as before and M = p, wm0 = 0,
wmi = I{i = m} then
y = w0 +
p∑
k=1
ϕm(xm).
393
Representation power of neural networks
Let f0 : [0, 1]d → R is a continuous function. Assume that ϕ is
not a polynomial; then for any ǫ > 0 there exist constants M , and
(wmi, wm) such that for
f(x) =
M∑
m=1
wmϕ( p∑
i=1
wmixi + wm0
)
one has
|f(x)− f0(x)| ≤ ǫ, ∀x ∈ [0, 1]d.
◮ Neural network with one hidden layer can approximate any
continuous function.
394
Training
◮ Prediction error criterion
E(W ) =1
2
n∑
t=1
[y(t)(W )− z(t)
]2=
n∑
t=1
et
(y(t)(W ), z(t)
),
where W denotes all the weights, and
y(t) = y(t)(W ) = f(w0 +
M∑
m=1
wmϕm(wTmx
(t))).
◮ Backpropagation algorithm
Wj ←Wj − η∂E(W )
∂Wj=Wj − η
n∑
t=1
∂et(W )
∂Wj.
◮ The algorithm uses the chain rule for differentiation and requires
differentiable activation functions.
395
Training neural networks
◮ Number of hidden units: usually varies in the range 5− 100.
◮ Overfitting: networks with too many units will overfit. The remedy is
to regularize, e.g., to minimize the criterion
E(W ) + λ∑
i
W 2i , λ ∼ 10−4 − 10−2.
Leads to the so–called weight decay algorithm.
◮ Starting values are usually taking random values near zero. Model
starts out nearly linear and becomes non–linear as the weights grow.
◮ Stopping rule: differetnt ad hoc rules, maximal number of iterations.
◮ Multiple minima: E(W ) is non–convex, has many local minima.
396
Projection pursuit regression
◮ Model: let ωm, m = 1, . . . ,M be the unknown unit p–vectors, and let
f(x) =M∑
m=1
fm(ωTmx), fm(·) are unknown.
fm varies only in the direction defined by the vector ωm.
◮ The ”effective” dimensionality is 1, not p.
◮ The error function (to be minimized w.r.t. fm and ωm, m = 1, . . . ,M)
n∑
i=1
[Yi −
M∑
m=1
fm(ωTmXi)
]2
397
Projection pursuit regression: fitting the model
◮ If ω is given then set vi = ωTXi and apply one–dimentional smoother
to get an estimate of g.
◮ Given g and previous guess ωold for ω write
g(ωTXi) ≈ g(ωToldXi) + g′(ωT
oldXi)(ω − ωold)TXi
and
n∑
i=1
[Yi−g(ωTXi)
]2≈
n∑
i=1
[g′(ωT
oldXi)]2[(ωToldXi+
Yi − g(ωToldXi)
g′(ωoldXi)
)−ωTXi
]
Minimize the last expression w.r.t. ω to get ωnew.
◮ Continue until convergence...
398
1. Neural network
◮ Extension of the idea of perceptron
◮ Feed–forward network is a network in which vertices (units) can be
numbered so that all connections go from a vertex to one with a high
number. The vertices are arranged in layers, with connections only to
higher layers.
◮ Each unit j sums its inputs, adds a constant forming the total input
xj , and applies a function fj to xj to give output yj .
◮ The links have weights wij which multiply the signals travelling
among them by that factor.
399
Fitting (training) neural networks
◮ Starting values: usually are taking random values neart zero. Model
starts out nearly linear and becomes non–linear as the weights grow.
◮ Stopping rule: differetnt ad hoc rules, maximal number of iterations.
◮ Overfitting: networks with too many units will overfit. The remedy is
to regularize, e.g., to minimize the criterion
E(w) + λ∑
i,j
w2ij , λ ∼ 10−4 − 10−2.
Leads to the weight decay algorithm.
◮ Number of hidden units: varies in the range 3− 100.
◮ Multiple minima: E(w) is non–convex, has many local minima.
400
3. Neural network
◮ The activation functions fj are taken to be
– linear
– logisitc f(x) = ℓ(x) = ex/(1 + ex)
– threshold f(x) = I(x > 0)
◮ Neural networks with linear output units and a single hidden layer can
approximate any continuous function (as size of hidden layer grows)
yk = αk +∑
j→k
wjkfj
(αj +
∑
i→j
wijxi
)
◮ Projection pursuit regression
f(x) = α+∑
j
fj
(αj +
∑
i
βjixi
)= α+
∑
j
fj(αj + βTj x)
401
Example: network for the Boston housing data
◮ Structure: 13 predictors, 3 units in the hidden layer, linear output unit
1
2
3
13
14
15
16
17...
18
19
X1
X2
X3
X13
Y
◮ Represents the function (3× 14 + 4 = 46 weights):
y0 = α+17∑
j=15
wj0ϕj
(αj +
13∑
i=1
wijxi
)
402
Notation
◮ All units are numbered sequentially. Every unit j has input xj and
output yj
yj = fj(xj), xj =∑
i→j
wijyi.
Non-existent links are characterized by wij = 0; wij = 0 unless i < j.
◮ the output vector y∗ of the network is modeled as y∗ = f(x∗;w),
where x∗ is the input vector and w is the vector of weights.
◮ Data: {(x∗m, tm),m = 1, . . . , n} – observed examples;
tm is the observation for y∗m = f(x∗
m;w).
403
Back–propagation algorithm
◮ Discrepancy function to be minimized w.r.t. w
E(w) =n∑
m=1
Em(w) =n∑
m=1
∥∥tm − f(x∗m;w)
∥∥2 =n∑
m=1
∥∥tm − y∗m
∥∥2.
◮ Update rule (gradient descent): for η > 0 define
wij ← wij − η∂E(w)
∂wij,
∂E(w)
∂wij=
n∑
m=1
∂Em(w)
∂wij.
◮ Derivative: because f(x∗m;w) depends on wij only via xj
∂Em(w)
∂wij=∂Em
∂xj
∂xj∂wij
= yi∂Em
∂xj︸ ︷︷ ︸δj
= yiδj
δj =∂Em
∂xj=∂Em
∂yj
∂yj∂xj
= f ′j(xj)∂Em
∂yj.
404
1. Example: Boston housing data - neural network
> nobs <- dim(Boston)[1]
> trainx<-sample(1:nobs, 2*nobs/3, replace=F)
> testx<-(1:nobs)[-trainx]
#
# size - number of units in the hidden layer
# decay - parameter for weight decay; default 0
# linout - switch for linear output units; default logistic output units
# starting values - uniformly distributed [-0.7, 0.7]
#
> Boston10.1.nn<-nnet(formula=medv~., data=Boston[trainx,], size=10, decay=1.0e-03,
+ linout=TRUE, maxit=500)
# weights: 151
initial value 211829.925033
final value 31162.391756
converged
> pred <-predict(Boston10.1.nn, Boston[testx,])
> sum((pred-Boston[testx,"medv"])^2)
[1] 11590.46
405
2. Example: Boston housing data - neural network
> # another run with the same parameters
> Boston10.2.nn<-nnet(formula=medv~., data=Boston[trainx,], size=10, decay=1.0e-03,
+ linout=TRUE, maxit=500)
# weights: 151
initial value 223291.589605
iter 10 value 24530.299991
iter 20 value 21653.959154
...........................
iter 490 value 3061.645220
iter 500 value 3043.807915
final value 3043.807915
stopped after 500 iterations
> pred <-predict(Boston10.2.nn, Boston[testx,]) # prediction error on
> sum((pred-Boston[testx,"medv"])^2) # the test set
[1] 2859.247 # CART error was 4045.943
406
3. Example: Boston housing data – neural network
0 50 100 150 200 250 300 350
−50
510
Index
Bosto
n10.2
.nn$re
sidua
ls
−3 −2 −1 0 1 2 3
−50
510
Normal Q−Q Plot
Theoretical Quantiles
Samp
le Qu
antile
s
407
4. Example: Is the model good?
> cbind(pred, Boston[testx,"medv"])
[,1] [,2]
4 31.464151 33.4
12 19.442958 18.9
15 19.847892 18.2
18 18.361424 17.5
.......................
397 14.315629 12.5
403 16.850060 12.1
404 12.685557 8.3
405 -1.021680 8.5
406 -1.686176 5.0
413 11.566423 17.9
414 15.509046 16.3
419 -16.737063 8.8
421 20.722749 16.7
423 23.060568 20.8
408
21. Support Vector Machines (SVM)
409
Hyperplane
◮ Hyperplane in Rp is an affine subspace of dimension p− 1: its
equation is
f(x) = β0 + β1x1 + . . .+ βpxp = 0.
◮ Consider binary classification problem with data
Dn = {(X1, Y1), . . . , (Xn, Yn)}, where Xi = [xi,1, . . . , xi,p] ∈ Rp and
Yi ∈ {−1, 1}.
◮ We say that Dn admits linear separation if there exists a separating
hyperplane f(x) such that
f(Xi) = β0 + β1xi,1 + · · ·+ βpxi,p > 0 if Yi = 1,
f(Xi) = β0 + β1xi,1 + · · ·+ βpxi,p < 0 if Yi = −1.
Separating hyperplane satisfies Yi(β0 + β1xi,1 + · · ·+ βpxi,p) > 0.
410
Separating hyperplanes
−1 0 1 2 3
−1
01
23
−1 0 1 2 3−
10
12
3
X1X1
X2
X2
If f(x) is a separating hyperplane then a natural classifier is sign{f(x)}.
411
1. Maximal margin classfier
◮ If the data set admits linear separation, it is natural to choose the
separating hyperplane with maximal margin, i.e., the separating
hyperplane farthest from the observations.
◮ Optimization problem:
maxβ0,...,βp
M
s.t.
p∑
j=1
β2j = 1,
Yi(β0 + β1xi,1 + · · ·+ βpxi,p) ≥M, ∀i = 1, . . . , n.
M is the hyperplane margin. The optimization problem is convex, it
can be efficiently solved on computer.
412
2. Maximal margin classifier
−1 0 1 2 3
−1
01
23
X1
X2
◮ Maximal margin classifier
413
Drawbacks of maximal margin classifier
◮ What can be done if there is no separating hyperplane?
◮ Maximal margin classifier is very sensitive to single observation.
−1 0 1 2 3
−1
01
23
−1 0 1 2 3
−1
01
23
X1X1
X2
X2
414
1. Support vector classifier
◮ The idea: allow for misclassification (impose soft margin).
◮ Optimization problem
maxβ0,...,βp,ǫ1,...,ǫn
M
s.t.
p∑
j=1
β2j = 1,
Yi(β0 + β1xi,1 + · · ·+ βpxi,p) ≥M(1− ǫi), ∀i = 1, . . . , n.
ǫi ≥ 0,n∑
i=1
ǫi ≤ C.
where ǫ1, . . . , ǫn are slack variables, C ≥ 0 is a tuning parameter.
415
2. Support vector classifier
◮ Slack variable ǫi tells us where ith observation is located:
– ǫi = 0: ith observation on the correct side of the margin;
– ǫi > 0: ith observation violates the margin;
– ǫi > 1: ith observation on the wrong side of the hyperplane.
◮ Tuning parameter C establishes budget for margin violation
– C = 0: no budget for margin violation (maximal margin classifier);
– C > 0: no more than C observations can be misclassified (can lie
on the wrong side of the hyperplane);
– C is usually chosen by cross–valuidation.
416