vi - exercisescdm.unimo.it/home/matematica/maioli.marco/mcssa_esercizi.pdf · 2015-01-15 ·...
TRANSCRIPT
Computational Methods and advanced Statistics Tools
VI - EXERCISES
1. Elementary tools
EX. VI.1A - Two means from paired dataFor checking the reliability of a certain electrical method for measuring tem-peratures, two samples were obtained. Assuming normality and homoge-neous variances of the two populations, test the hypothesis that the popula-tion means are equal, at level α = 5%. The samples are:
106.9 106.3 107.0 106.0 104.9
106.2 106.7 106.8 106.1 105.6
[IV.4D - H◦ not rejected ]
> x<- c(106.9, 106.3, 107, 106, 104.9 )
> y<- c(106.2, 106.7, 106.8, 106.1, 105.6)
> z<-x-y
> mean(z)
[1] -0.06
> var(z)
[1] 0.293
> T<- -0.06/(sd(z)/sqrt(5))
> T
[1] -0.2478577
> qt(0.025,df=4)
[1] -2.776445
EX. VI.1B - Two means from independent samplesThree specimens of high quality concrete had compressive strength 357, 359, 413[Kg/cm3]. For three specimens of ordinary concrete the values were 346, 358, 302.Test for equality of population means, µ1 = µ2, against the alternativeµ1 > µ2. (Assume normality and equality of variances. Choose α = 5%.)[IV.4A - Ans.: H◦ not rejected ]
1
> t.test(c(357,359,413), c(346,358,302), alternative= "greater")
Welch Two Sample t-test
data: c(357, 359, 413) and c(346, 358, 302)
t = 1.6384, df = 3.978, p-value = 0.08854
EX. VI.1C - Comparison of means and variances from two independent sam-ples.Using samples of size 10 and 5 with means x = 390, y = 450, with variancess21= 50, and s2
2= 20, and assuming normality of the corresponding popula-
tions, (1) test the hypothesis µ1 = µ2 against the hypothesis µ1 6= µ2; (2)test the hypothesis H◦ : σ2
1= σ2
2against the alternative σ2
1> σ2
2. Choose
α = 5%.[IV.4A, IV.4B - Ans.: (1) H◦ rejected; (2) H◦ not rejected ]
> varpooled<- (9*50+4*20)/13
> T<-(390-450)/(sqrt(varpooled)*sqrt(1/10+1/5) )
> T
[1] -17.15633
> qt(0.025,df=13)
[1] -2.160369
> F<-50/20
> F
[1] 2.5
> qf(0.95,df1=9,df2=4)
[1] 5.998779
EX. VI.1D - Minimum sample size for estimating mean dissolved oxygen(DO) concentration.Monitoring of pollution levels of similar streams in a region indicates thatthe standard deviation of DO is 1.95 mg/L over a long period of time. Whatis the minimum number of observations required to estimate the mean DOwithin ±0.5 mg/L with 95% confidence?[IV2.B - Ans.: 59 ]
2
> (1.95*qnorm(0.975)/0.5)^2
[1] 58.42859
EX. VI.1E - Significance in change of temperatureA water supply engineer is concerned that possible climatic change withrespect to temperature may have an effect on forecasts for future demands forwater in a city. The long-period mean and standard deviation of the annualaverage temperature measured at midday are 33 and 0.75◦ C. The alarm iscaused by the mean temperature of 34.3◦ C observed for the previous year.Does this suggest that there is an increase in the mean annual temperatureat a 5% level of significance?[IV.3D - Ans.: Reject H◦ ]
> (34.3-33)/0.75
[1] 1.733333
> pnorm(1.73333)
[1] 0.9584815
EX. VI.1F - Confidence limits on proportions of wet days.We want to know the proportion of wet days in March. by observing dailyrainfalls in March. Determine the total number of years of data necessarybefore one can be 95% confident of estimating the true proportion of wetdays to within 0.05.[IV.2D, IV2E - Ans.: 13 years ]
> W<- sqrt((1/2)*(1/2))*qnorm(0.975)/0.05
> W
[1] 19.59964
> W^2
[1] 384.1459
> 385/31
[1] 12.41935
3
EX. VI.1G - Irrigation and rains.Irrigation usually commences on 15 April in the Po river basin, Italy. Anengineer is interested in tghe probability of rain during the 7 days from April15 to 21. From rainfall data of the past 100 years in a particular area, thefollowing distribution of rainy days is obtained for the period:
0 1 2 3 4 5 6 757 30 9 3 1 0
The Binomial model Bin(n = 7; p = 0.1) is postulated. Can this be justifiedat the 5% level of significance on the basis of a chi-squared test?[IV.5A - Ans.: X2 = 5.93; do not reject H◦ ]
> # propteor<- proporzioni teoriche della binomiale
> # accorpando le frequenze piccole del 5, 6,7
> # ft = frequenze teoriche su 100 anni
> # fs = frequenze sperimentali su 100 anni
> dbinom(0:7, size=7, prob=0.1)
[1] 0.4782969 0.3720087 0.1240029 0.0229635 0.0025515 0.0001701 0.0000063
[8] 0.0000001
> p567<- sum(dbinom(5:7,size=7,prob=0.1))
> propteor<- c(dbinom(0:4,7,0.1),p567)
> propteor
[1] 0.4782969 0.3720087 0.1240029 0.0229635 0.0025515 0.0001765
> ft <- propteor*100
> ft
[1] 47.82969 37.20087 12.40029 2.29635 0.25515 0.01765
> fs <- c(57,30,9,3,1,0)
> qchisq(0.95,df=5)
[1] 11.0705
> chi2sperim <- sum ((ft-fs)^2/ft )
> chi2sperim
[1] 6.492133
2. Examples on simple regression
EX. VI.2A - Confidence intervals in linear regression
4
In this example, we compute the confidence intervals of the coefficients usingthe the residual standard error σ. σ is the square root the erratic variance
σ2 =1
n− pSSE
where SSE is nothing else than the sum of squares of residuals:
SSE =n∑
i=1
(yi −XTi θ)
2.
x<-seq(0,5,length=20)
> y<-2+0.5*x+rnorm(20,0,sd=1)
> modello<-lm(y~x)
> modello
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
1.6556 0.6388
> plot(x,y)
> abline(1.6556,0.6388)
> summary(modello)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.6556 0.4073 4.064 0.000728 ***
x 0.6388 0.1393 4.586 0.000229 ***
Residual standard error: 0.9452 on 18 degrees of freedom
Multiple R-Squared: 0.5389, Adjusted R-squared: 0.5132
F-statistic: 21.03 on 1 and 18 DF, p-value: 0.0002290
> # calcolo degli estremi di fiducia dei coefficienti
> # usando il residual standard error fornito da R
> # che e’ 0.9452 = sqrt(SSE/(n-2))
5
> sigma <- 0.9452
> M<- matrix(c(rep(1,20),x),nrow=20,ncol=2)
> SIGMATHETA <- sigma^2* solve(t(M)%*%M)
>
> # gli estremi di fiducia di beta:
> b1 <- 0.6388-qt(0.975,df=18)*sqrt(SIGMATHETA[2,2])
> b2<- 0.6388+qt(0.975,df=18)*sqrt(SIGMATHETA[2,2])
> c(b1,b2)
[1] 0.3461784 0.9314216
>
> # gli estremi di fiducia di alfa:
> a1 <- 1.6556-qt(0.975,df=18)*sqrt(SIGMATHETA[1,1])
> c(a1, a2)
[1] 0.7998315 2.511369
>
EX. VI.2B - Confidence intervals in linear regression.
In this example we compute the confidence intervals of the coefficients usingthe the standard errors:
√
V ar(β) = σ√
XTX2,2
√
V ar(α) = σ√
XTX1,1
where
X =
1 x1... ...1 xn
> # Calcolo degli estremi di fiducia dei coefficienti
> # usando i singoli errori standard forniti da R
> x <- seq(2,9,length=12)
> y <- 1+0.3*x+rnorm(12)
> modello<- lm(y~x)
> anova(modello)
Analysis of Variance Table
6
Response: y
Df Sum Sq Mean Sq F value Pr(>F)
x 1 8.1945 8.1945 13.643 0.004152 **
Residuals 10 6.0066 0.6007
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
> summary(modello)
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-0.8238 -0.5399 -0.2359 0.4433 1.3851
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.7050 0.6032 1.169 0.26960
x 0.3762 0.1018 3.694 0.00415 **
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 0.775 on 10 degrees of freedom
Multiple R-squared: 0.577, Adjusted R-squared: 0.5347
F-statistic: 13.64 on 1 and 10 DF, p-value: 0.004152
> # calcolo intervallo di fiducia di beta e di alfa
> # usando i due errori standard forniti,
> # cioe’ 0.1018 , 0.6032
> b1 <- 0.3762-qt(0.975,df=10)*0.1018
> b2 <- 0.3762+qt(0.975,df=10)*0.1018
> c(b1,b2)
[1] 0.1493755 0.6030245
>
> a1 <- 0.7050 - qt(0.975,df=10)*0.6032
> a2 <- 0.7050 + qt(0.975,df=10)*0.6032
7
> c(a1,a2)
[1] -0.6390134 2.0490134
> # ritroviamo gli intervalli di fiducia usando la teoria,
> # cominciando dalla matrice M dei dati di ingresso
> M<-matrix(c(rep(1,12),x),nrow=12,ncol=2)
> M
[,1] [,2]
[1,] 1 2.000000
[2,] 1 2.636364
[3,] 1 3.272727
[4,] 1 3.909091
[5,] 1 4.545455
[6,] 1 5.181818
[7,] 1 5.818182
[8,] 1 6.454545
[9,] 1 7.090909
[10,] 1 7.727273
[11,] 1 8.363636
[12,] 1 9.000000
> res<-residuals(modello)
> res
1 2 3 4 5 6
-0.54955856 0.28634886 -0.78314741 0.91422068 1.38507619 0.08798026
7 8 9 10 11 12
-0.82381596 -0.53671878 -0.47637085 -0.28992366 0.96771666 -0.18180744
> # calcolo la varianza erratica,
> # cioe’ somma dei quadrati dei residui divisa per numero di G.L.
> s2<-sum((res)^2)/10
> s2
[1] 0.600656
> # calcolo la matrice delle covarianze tra i coefficienti
> varianza <- s2*solve(t(M)%*%M)
> varianza
[,1] [,2]
[1,] 0.36381965 -0.05704818
[2,] -0.05704818 0.01037240
8
> # in particolare, la varianza campionaria di b, coeff. angolare
> varianza[2,2]
[1] 0.0103724
> # errore standard di b
> sqrt(varianza[2,2])
[1] 0.101845
> # intervallo di fiducia di beta
> b1<-0.3762-qt(0.975,df=10)*0.1018
> b2<-0.3762+qt(0.975,df=10)*0.1018
> c(b1,b2)
[1] 0.1493755 0.6030245
> # intervallo di fiducia per alfa
> errore_stand_alfa <- sqrt(varianza[1,1])
> errore_stand_alfa
[1] 0.6031746
> a1 <- 0.7050 - qt(0.975,df=10)*errore_stand_alfa
> a2 <- 0.7050 + qt(0.975,df=10)*errore_stand_alfa
> c(a1,a2)
[1] -0.6389569 2.0489569
EX. VI.2C - Simple regression with change of scale
Many real world phaenomena are non-linear. The linear regression methodcan be applied after changes of scale either in the predictor (ξ = ξ(x)) or inthe response (ψ = ψ(y)), or in both, e.g.: y → √
y, x → log(x), y → ey, ...etc.
For example, suppose that 7 observations (xi, yi) have been treated by a scalechange Y →
√Y and a linear model
√Y i = α + βxi + ǫi. In order to test
linearity, the sums of squares of differences
SST = 4308, SSF = 3708, SSE = 600
are considered. The analysis of variance of this regression gives:
9
Source of var. Sum of squares Degrees of f. Variance SSF /1SSE/5
p-value
Factor x SSF = 3708 1 3708 30.9 0.0025Error SSE = 600 5 120Total SST = 4308 6
The p−value is less than 0.05 and even less than 0.01. The variance SSF/1is so large, with respect to SSE/5, that the hypothesis H◦ : β = 0 is rejected.
3. Residuals
An analysis of RESIDUALS ei = yi − yi should be performed, since anassumption of the model is normality and independence of the errors ǫi, withthe same variance σ2.In the quantiles-quantiles graph, the sample quantiles qi are compared withthe theoretical quantiles qi of a known distribution, the normal one. Thepoints (qi, qi) are plotted.
> # genero 20 numeri distribuiti come un chi quadro, li ordino
> # e li confronto coi quantili normali
> x<-rchisq(20,df=5)
> qsperim<-sort(x)
> qteorici<- qnorm(0.5:19.5/20,mean(x),sd(x))
> plot(qteorici,qsperim)
> # troppo lontani dalla bisettrice: non e’ un campione normale
If the sample quantiles were exactly normal, then qi would be equal to qiand the points would be on the principal bisecting line. Too large deviationsfrom the principal bisecting line induce to suspect a non-normal distributionof the sample. In R the useful fuctions are: qqnorm, qqline, qqplot.In order to perform a quantitative (not only graphical) analysis, the good-ness of fit of data with respect to a theoretical distribution is tested by theKolmogorov-Smirnov test. In R it is performed by the function ks.test.
PROP. VI.3A - The Kolmogorov-Smirnov test
The null hypothesis, that the data are extractd from a particular distribution,against the alternative that they are not,
H◦ : F = F◦, H1 : F = F 6= F◦
10
is tested by means of the Kolmogorov-Smirnov statistics
D = sup |Fn(x)− F◦(x)|.
If critical values of D are overcome, the (one-sample) Kolmogorv-Smirnovtest rejects H◦.The null hypothesis, that two vectors x,y of data are extracted from a samedistribution, against the alternative that they are not,
H◦ : F1 = F2, H1 : F1 6= F2
is tested by the analogous two-sample Kolmogorov-Smirnov test.
In R, the first argument of the function ”ks.test” is always the sample vector.The second argument can be a distribution, or a second vector of data. Toimplement the one-sample K.S. test, the distribution can be written withsuitable parameters, e.g. matching with the mean and variance of the sample.
EX. VI.3B - Test on residuals for a least square parabola.
Let us recover and analyze, by R, a least square parabola
Y = θ◦ + θ1x+ θ2x2
on the basis of data (xi, Yi), i = 1, ..., 8.It is enough to perform a multiple linear regression with p = 3 factors andN = 8 units. The p− 1 = 2 explicative variables will be x1 = x, x2 = x2.The useful R functions are ”lm” and ”summary”. By means of ”lm”, Rdirectly writes the equation
y = 2.588 + 2.065x− 0.211x2
By means of ”summary” we see the result of the t− tests applied to bothcoefficients, by which the null hypothesis H◦ : θ1 = 0, θ2 = 0 is rejected, sincethe p− value is much smaller than α = 0.05.
Moreover the residual standard error is given, that is√
SSE/(N − p), with
N − p = 5, the estimator of σ2 in the linear model.Also the determination coefficient R2, the fraction of total variation explainedby linearity, is given.Finally, the analysis of variance of regression is done by the F test, concerningthe ratio
SSF/(p− 1)
SSE/(N − p)
which has a Fisher distribution with 2, 5 degrees of freedom.
11
> x<-c(1.2,1.8,3.1,4.9,5.7,7.1,8.6,9.8)
> y<-c(4.5,5.9,7.0,7.8,7.2,6.8,4.5,2.7)
> x1<-x
> x2<-x^2
> mod<-lm(y~x1+x2)
> summary(mod)
Call:
lm(formula = y ~ x1 + x2)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.58779 0.34377 7.528 0.000655 ***
x1 2.06492 0.15094 13.680 3.74e-05 ***
x2 -0.21100 0.01366 -15.443 2.07e-05 ***
Residual standard error: 0.2749 on 5 degrees of freedom
Multiple R-Squared: 0.9823, Adjusted R-squared: 0.9753
F-statistic: 139.1 on 2 and 5 DF, p-value: 4.144e-05
If the data are plotted by ”plot”, in the same reference the parabola is repre-sented by ”curve” with the option add=TRUE . Residuals can be recoveredby ”residuals”. Then by ”qqnorm” the quantiles of the residuals are com-pared with the theoretical normal quantiles: if the points are about on astraight line, the errors are normally distributed, confirming the model. Fi-nally a Kolmogorov-Smirnov test on the residuals, with null hypothesis thatthe residuals are normal, gives a considerably large p−value: a further con-firmation of normality of residuals.
> plot(x,y,main="Least Squares parabola")
> curve(2.58+2.06*x-0.21*x^2,add=TRUE)
> res<-residuals(mod)
> qqnorm(res)
> # si puo’ anche fare a mano un q-q grafico: basta ordinare
> # il vettore res con sort e mettere in ascissa i quantili normali:
> plot(qnorm(0.5:7.5/8,mean=0,sd=sd(res)),sort(res))
12
> # test di K.S.: il valore p=0.72 (grande) permette la normalita’ dei residui
> ks.test(res,"pnorm",mean(res),sd(res))
One-sample Kolmogorov-Smirnov test
data: residuals(mod)
D = 0.2266, p-value = 0.7273
alternative hypothesis: two.sided
2 4 6 8 10
34
56
78
Least Squares parabola
x
y
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
−0.
3−
0.1
0.1
0.2
0.3
Normal Q−Q Plot
Theoretical Quantiles
Sam
ple
Qua
ntile
s
Figure 1: (a) The parabola y = 2.58 + 2.06 x− 0.21 x2, compared with data(xi, Yi) (b) The quantiles of residuals ei = Yi − Yi, versus theoretic normalquantiles. If the points are not too far from a straight line, normality is notrejected.
4. Correlation confidence interval
If the R-function cor.test is applied to two samples x, y with the samelength, both Pearson’s test on correlation and a 0.95 confidence interval arefurnished.
13
> x<-rnorm(12)
> y<- -0.7*x+rnorm(12,sd= 2)
> x
[1] -0.27959053 -0.06535758 -1.52057272 0.92043094 1.51153470 -0.76791609
[7] -1.53624017 -0.06771280 1.77028728 0.87329624 -0.05032564 -0.55630347
> y
[1] -1.2968540 5.4361186 1.7098608 -2.2413984 -0.6963806 2.5311615
[7] 1.7840563 3.5788787 -2.3485398 -1.3989689 -0.1303807 1.7089726
> cor.test(x,y)
Pearson’s product-moment correlation
data: x and y
t = -2.315, df = 10, p-value = 0.04314
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.86975489 -0.02542412
sample estimates:
cor
-0.5907068
> # ritroviamol’intervallo di fiducia
> # passando per la trasformazione z
>
> # acquisisco prima la correlazione r stimata
> r <- -0.5907
> # trasformazione z
> zeta <- function(r){ 0.5*log((1+r)/(1-r)) }
> # trasformazione inversa
> erre <- function(z){ (exp(2*z)-1)/(exp(2*z)+1) }
>
> # estremi di fiducia di z, sapendo che Z ~ N[0; 1/(n-3)]
> z1 <- zeta(r) - 1.96/sqrt(12-3)
> z2 <- zeta(r) + 1.96/sqrt(12-3)
># estremi di fiducia della correlazione lineare
> r1 <- erre(z1)
> r2 <- erre(z2)
> c(r1,r2)
[1] -0.86975528 -0.02540173
>
14
5. Analysis of Variance
Recall the model. A variable y is observed in k samples (k ”groups”) withlengths n1, ..., nk. In the ANOVA model, for i = 1, ..., k,, for j = 1, ...ni, theresponse yi,j is equal to a general mean, plus some differential effect of thei-th group, plus independent normal errors:
yi,j = µ+ αi + ǫi,j , ǫi,j ∼ N(0; σ2), ǫi,j independent
H◦ : αi = 0, ∀i; H1 : some αi’s are 6= 0
We set
SSB =k∑
i=1
ni(yi − y)2,
SSW =k∑
i=1
ni∑
j=1
(yi,j − yi)2
SST =k∑
i=1
ni∑
j=1
(yi,j − y)2,
Under the hypothesis H◦
F ≡ SSB/(k − 1)
SSW/(N − k)∼ F (k − 1, N − k)
At the significance level γ, H◦ is rejected if
F ≡ SSB/(k − 1)
SSW/(N − k)≥ F1−γ(k − 1, N − k).
EX. VI.5A - Let the response y be the mileage achieved by cars producedby different companies, but of comparable price, size, and so on. Cars areselected from five companies For each company a number of new cars arechosen and their mileage (in miles per gallon) is recordedThe data are read first from the table
y1 y2 y3 y4 y5
1 25.1 27.1 29.9 25.4 29.2
2 26.2 26.4 21.4 28.2 29.3
3 24.9 26.8 22.2 27.1 30.4
4 25.3 27.2 22.5 26.3 28.5
5 23.9 26.5 20.8 26.6 28.9
6 24.1 26.3 23.9 28.0 28.4
15
Then the data are brought in a unique response vector y . Then the factorx is furnished: 5 levels, each level with 6 observations.
Please notice that x is written numerically first, then it is converted bythe function factor .Such a conversion is very important, else R would interpret x as a numericalvector, so that a meaningless linear regression would be performed!The statistical analysis is made by two steps:first lm(y~x) , then anova(lm(y~x)):
> z<-read.table("makes.txt")
> y<-c(z$y1,z$y2,z$y3,z$y4,z$y5) $
> x<-c(rep(1,6),rep(2,6),rep(3,6),rep(4,6),rep(5,6))
> x<-factor(x)
> mod<-lm(y~x)
> mod
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x2 x3 x4 x5
24.917 1.800 -1.467 2.017 4.200
> anova(mod)
Analysis of Variance Table
Response: y
Df Sum Sq Mean Sq F value p value
x 4 111.105 27.776 10.213 4.835e-05 ***
Residuals 25 67.993 2.720
---
Notice that the value 24.917 is the general mean value y, estimator of theparameter µ of ANOVA model
yij = µ+ αi + ǫij
The coefficients αi, i = 1, ..., 6 are re-coded into x2,...,x5, and representthe differential effect with respect to to the (missing) α1 = x1. The interest
16
is in the signs more than in the numerical values, but what is most importantis the ratio F test. In this example k− 1 = 4, N − k = 30− 5 are the degreesof freedom, and the experimental value of F is larger than the critical value.The hypothesis of equal effects αi is rejected.
Finally, an example of unbalanced experiment, where the number of obser-vations ni can be different for different groups:
y1 y2 y3 y4 y5
1 25.1 27.1 29.9 25.4 29.2
2 26.2 26.4 21.4 28.2 29.3
3 24.9 26.8 22.2 27.1 30.4
4 25.3 27.2 22.5 28.5
5 23.9 20.8 28.9
6 28.4
>
> z<-read.table("makes.txt")
> y1<-z$y1
> y2<-z$y2
> y3<-z$y3
> y4<-z$y4
> y5<-z$y5
> y<-c(y1,y2,y3,y4,y5)
> x<-c(rep(1,5),rep(2,4),rep(3,5),rep(4,3),rep(5,6))
> x<-factor(x)
> mod<-lm(y~x)
> anova(mod)
Analysis of Variance Table
Response: y
Df Sum Sq Mean Sq F value Pr(>F)
x 4 100.662 25.166 6.9715 0.001423 **
Residuals 18 64.976 3.610
17
VI.5B - COMPARISONS
Supposing that the F-ratio test has rejected µ1 = µ2 = ... = µk, the questionarises which pairs of levels have really different means. Multiple Bonferroni’sT suggests m comparisons of pairs of groups by simple t− tests, but with
individual significance level α/m. m can be the total number of pairs,
(
k2
)
.
This way the global significance level α is preserved, but some loss of poweris expected.
In the preceding example, we have m =
(
k2
)
=
(
52
)
= 10 comparisons
of two means. If α = 0.05, we have α/10 = 0.005 . Thus only the pairswhere the p-value is less than 0.05
10= 0.005 have different means.
> y1<- c(25.1, 26.2, 24.9, 25.3, 23.9)
> y2<- c(27.1, 26.4, 26.8, 27.2)
> y3<- c(29.9, 21.4,22.2, 22.5,20.8)
> y4<- c(25.4, 28.2, 27.1)
> y5<- c(29.2, 29.3,30.4,28.5, 28.9,28.4)
>
> t.test(y1,y2)
data: y1 and y2
t = -4.3704, df = 5.693, p-value = 0.005338
alternative hypothesis: true difference in means is not equal to 0
> t.test(y1,y3)
data: y1 and y3
t = 1.0102, df = 4.394, p-value = 0.3648
alternative hypothesis: true difference in means is not equal to 0
> t.test(y1,y4)
data: y1 and y4
t = -2.0352, df = 2.847, p-value = 0.1396
alternative hypothesis: true difference in means is not equal to 0
> t.test(y1,y5)
18
data: y1 and y5
t = -8.5288, df = 8.112, p-value = 2.525e-05
alternative hypothesis: true difference in means is not equal to 0
> t.test(y2,y3)
data: y2 and y3
t = 2.1025, df = 4.093, p-value = 0.1018
alternative hypothesis: true difference in means is not equal to 0
> t.test(y2,y4)
data: y2 and y4
t = -0.03, df = 2.196, p-value = 0.9786
alternative hypothesis: true difference in means is not equal to 0
> t.test(y2,y5)
data: y2 and y5
t = -6.4738, df = 7.636, p-value = 0.0002366
alternative hypothesis: true difference in means is not equal to 0
> t.test(y3,y4)
data: y3 and y4
t = -1.9126, df = 5.516, p-value = 0.1086
alternative hypothesis: true difference in means is not equal to 0
> t.test(y3,y5)
data: y3 and y5
t = -3.4098, df = 4.254, p-value = 0.02454
alternative hypothesis: true difference in means is not equal to 0
> t.test(y4,y5)
19
data: y4 and y5
t = -2.558, df = 2.545, p-value = 0.09823
alternative hypothesis: true difference in means is not equal to 0
By Bonferroni’s comparisons, at the level 0.05 we conclude that the pairsy1, y5 and y2, y5 have different means.
6. Principal Componets Analysis
EXAMPLE VI.6A - Principal component analysis of water pollution data
A file named "‘Bw_river.txt" , containing the data frame, is writtenin the directory used by R. It contains three columns: biochemical oxygendemand, nitrates and ammonia at 38 stations along the Blackwater river inunits of milligrams per liter.
Oxygen nitrates ammonia
1 2.27 1.97 0.11
2 4.41 12.83 0.61
3 4.03 11.11 0.53
4 3.75 9.86 0.47
5 3.37 9.54 0.62
6 3.23 8.85 0.56
7 3.18 8.02 0.64
8 4.08 8.94 1.14
9 4 8.76 1.11
10 3.92 8.59 1.07
11 3.83 8.43 1.04
12 3.74 8.27 1
13 3.66 8.13 0.97
14 3.58 7.99 0.94
15 3.16 6.72 0.83
16 3.43 9.23 0.94
17 3.36 9.1 0.93
18 3.3 8.97 0.91
19 3.24 8.85 0.89
20 3.19 8.74 0.88
21 3.22 9.8 0.95
22 3.17 9.64 0.93
23 3.13 9.49 0.9
20
24 3.08 9.34 0.88
25 3.04 9.2 0.86
26 3 9.06 0.84
27 2.96 8.03 0.82
28 2.93 8.81 0.8
29 2.89 8.69 0.78
30 2.86 8.57 0.76
31 2.82 8.45 0.74
32 2.79 8.35 0.73
33 2.76 8.24 0.71
34 2.73 8.14 0.7
35 2.7 8.04 0.68
36 2.51 6.54 0.48
37 2.49 6.51 0.47
38 2.46 6.46 0.46
After loading Bw_river.txt , principal component analysis is performedin R workspace.For PC analysis we can use either the covariance matrix of the data, or thecorrelation matrix of the data, in which case the R-function princomp needsthe option cor=TRUE .In this example let us work by means of the correlation matrix.
> # la matrice M dei dati ha n righe, p colonne,
> # in questo esempio n=38 osservazioni di p=3 variabili
> M <- read.table("Bw_river.txt")
> # matrice C di correlazione, le p^2 correlazioni tra le p colonne di M
> Sigma <- cor(M)
> Sigma
Oxygen nitrates ammonia
Oxygen 1.0000000 0.6495442 0.5153323
nitrates 0.6495442 1.0000000 0.4153815
ammonia 0.5153323 0.4153815 1.0000000
21
> # autovalori e autovettori di Sigma
> lambda <- eigen(Sigma)$values
> G <- eigen(Sigma)$vectors
> lambda
[1] 2.0594756 0.6049861 0.3355384
> G
[,1] [,2] [,3]
[1,] 0.6155078 -0.2052230 0.7609426
[2,] 0.5845941 -0.5286604 -0.6154413
[3,] 0.5285829 0.8236515 -0.2054225
> # devazioni standard
> sqrt(lambda)
[1] 1.4350873 0.7778085 0.5792567
> # frazione della variazione dei dati spiegata
> # dalla prima PC, dalla prima e seconda PC, etc.
> cumsum(lambda)/sum(lambda)
[1] 0.6864919 0.8881539 1.0000000
> # dunque circa l’89% della variazione dei dati
> # e’ spiegato dalle prime due componenti principali
>
> # prima componente principale della osservazione i-esima
> # (Z1)_i = (riga i-esima di M) x (primo autovettore)
> # cioe’ Z1=(righe di M) x (prima colonna di G)
> (ma dopo aver standardizzato tutti i dati di M con "scale"
> Z1<- scale(M)%*%G[,1]
> # seconda componente principale
> Z2 <- scale(M)%*%G[,2]
> le due prime componenti principali Z1, Z2
> Z12<- matrix(c(Z1,Z2),38,2)
> Z12
[,1] [,2]
[1,] -5.20122624 -0.008114176
[2,] 2.61028148 -2.558734237
[3,] 1.32080591 -2.141088519
[4,] 0.37442794 -1.843102862
[5,] 0.15962963 -1.009548029
[6,] -0.41079658 -0.953633996
[7,] -0.57905692 -0.353740552
[8,] 2.08443078 0.883903100
22
[9,] 1.84720797 0.861467984
[10,] 1.58913403 0.797551511
[11,] 1.34696309 0.772597578
[12,] 1.08029008 0.709463896
[13,] 0.85767099 0.673822348
[14,] 0.63505189 0.638180801
[15,] -0.61263813 0.809052509
[16,] 0.90401695 0.290047393
[17,] 0.74630290 0.323379337
[18,] 0.57633678 0.314447134
[19,] 0.41002159 0.302213322
[20,] 0.28410937 0.320773253
[21,] 0.87937204 0.225807858
[22,] 0.71070313 0.222696080
[23,] 0.53343307 0.174018546
[24,] 0.36841508 0.167605160
[25,] 0.21929801 0.153805767
[26,] 0.07018095 0.140006375
[27,] -0.40386860 0.420050089
[28,] -0.20485042 0.098418367
[29,] -0.34666564 0.078015758
[30,] -0.47623086 0.053528751
[31,] -0.61804607 0.033126143
[32,] -0.71580738 0.040215669
[33,] -0.84172168 0.012427054
[34,] -0.93948298 0.019516580
[35,] -1.06174635 -0.011573643
[36,] -2.33217650 -0.202323859
[37,] -2.39213133 -0.222429987
[38,] -2.47163800 -0.231848501
>
> # questo lavoro e’ fatto dal comando "princomp"
> # con l’opzione "cor=TRUE" per fare le correlazioni e non le covaianze
>
> princomp(M,cor=TRUE)
Call:
princomp(x = M, cor = TRUE)
Standard deviations:
Comp.1 Comp.2 Comp.3
1.4350873 0.7778085 0.5792567
23
3 variables and 38 observations.
> # ritroviamo gli autovalori= le varianze delle componenti principali,
> # cioe’ il quadrato delle deviazioni standard:
> lambda<- c(1.4350873,0.7778085,0.5792567)^2
> lambda
[1] 2.0594756 0.6049861 0.3355383
> plot(princomp(M),cor=TRUE))
> cosi’ abbiamo ottenuto in istogramma le varianze/autovalori
>
> # ritroviamo G, cioe’ i p=3 autovettori, detti "loadings"
> # ad es. la prima colonna da’ i coefficienti di Z1
> # rispetto alle p variabili originarie; etc.
> princomp(M,cor=TRUE)$loadings
Loadings:
Comp.1 Comp.2 Comp.3
Oxygen 0.616 -0.205 0.761
nitrates 0.585 -0.529 -0.615
ammonia 0.529 0.824 -0.205
> # ritroviamo le componenti principali dette "scores"
> # delle n=38 osservazionie
> princomp(X,cor=TRUE)$scores
Comp.1 Comp.2 Comp.3
1 -5.27104448 -0.008223096 1.74962627
2 2.64532039 -2.593081200 0.32139209
3 1.33853565 -2.169829248 0.48534649
4 0.37945404 -1.867843604 0.60040549
5 0.16177241 -1.023099616 -0.00291438
6 -0.41631087 -0.966435025 0.10888380
7 -0.58682985 -0.358488960 0.27824380
8 2.11241097 0.895768102 0.81868525
9 1.87200382 0.873031830 0.79496630
10 1.61046564 0.808257380 0.77700220
11 1.36504394 0.782968480 0.73014515
12 1.09479127 0.718987327 0.69293812
24
13 0.86918387 0.682867348 0.65363849
14 0.64357646 0.646747369 0.61433886
15 -0.62086183 0.819912760 0.57056970
16 0.91615195 0.293940821 -0.09887846
17 0.75632084 0.327720194 -0.14602556
18 0.58407319 0.318668089 -0.16817487
19 0.41552548 0.306270058 -0.19421934
20 0.28792309 0.325079127 -0.21846126
21 0.89117622 0.228838972 -0.65285606
22 0.72024319 0.225685423 -0.64797211
23 0.54059356 0.176354470 -0.62198553
24 0.37336047 0.169854994 -0.62099674
25 0.22224174 0.155870367 -0.60855536
26 0.07112301 0.141885739 -0.59611398
27 -0.40928990 0.425688598 -0.23700257
28 -0.20760021 0.099739478 -0.56756897
29 -0.35131907 0.079062997 -0.56291792
30 -0.48262351 0.054247290 -0.54291912
31 -0.62634237 0.033570809 -0.53826808
32 -0.72541596 0.040755500 -0.53570964
33 -0.85302046 0.012593867 -0.51960601
34 -0.95209406 0.019778559 -0.51704757
35 -1.07599862 -0.011729001 -0.50483911
36 -2.36348229 -0.205039737 -0.01917067
37 -2.42424191 -0.225415758 -0.02853065
38 -2.50481583 -0.234960701 -0.04544806
VI.6B - Use of PC to improve linear regression
> M<-read.table("Bw_river.txt")
> M
Oxygen nitrates ammonia
1 2.27 1.97 0.11
2 4.41 12.83 0.61
3 4.03 11.11 0.53
25
4 3.75 9.86 0.47
5 3.37 9.54 0.62
6 3.23 8.85 0.56
7 3.18 8.02 0.64
8 4.08 8.94 1.14
9 4.00 8.76 1.11
10 3.92 8.59 1.07
11 3.83 8.43 1.04
12 3.74 8.27 1.00
13 3.66 8.13 0.97
14 3.58 7.99 0.94
15 3.16 6.72 0.83
16 3.43 9.23 0.94
17 3.36 9.10 0.93
18 3.30 8.97 0.91
19 3.24 8.85 0.89
20 3.19 8.74 0.88
21 3.22 9.80 0.95
22 3.17 9.64 0.93
23 3.13 9.49 0.90
24 3.08 9.34 0.88
25 3.04 9.20 0.86
26 3.00 9.06 0.84
27 2.96 8.03 0.82
28 2.93 8.81 0.80
29 2.89 8.69 0.78
30 2.86 8.57 0.76
31 2.82 8.45 0.74
32 2.79 8.35 0.73
33 2.76 8.24 0.71
34 2.73 8.14 0.70
35 2.70 8.04 0.68
36 2.51 6.54 0.48
37 2.49 6.51 0.47
38 2.46 6.46 0.46
> x1<- M[,1]
> x2<- M[,2]
> x3<- M[,3]
>
> # simulazione di dati risposta alle tre variabili x1, x2, x3
> y<-0.4*x1-1.3*x2+0.3*x3+rnorm(38)
26
> # regressione lineare multipla di y su x1, x2, x3
> modello<- lm(y~x1+x2+x3)
> summary(modello)
Call:
lm(formula = y ~ x1 + x2 + x3)
Residuals:
Min 1Q Median 3Q Max
-2.5913 -0.6550 0.1072 0.5389 2.5407
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.7140 1.2578 0.568 0.574
x1 0.4754 0.5286 0.899 0.375
x2 -1.2743 0.1563 -8.154 1.64e-09 ***
x3 -1.0922 1.0291 -1.061 0.296
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 1.148 on 34 degrees of freedom
Multiple R-squared: 0.7668, Adjusted R-squared: 0.7462
F-statistic: 37.26 on 3 and 34 DF, p-value: 7.506e-11
> # quindi la regressione e’poco significativa su x1,x3
> # e sull’interecetta
> # cerchiamo le due prime componenti principali
> # per vedere se migliora la significativita’
> princomp(M)
Call:
princomp(x = M)
Standard deviations:
Comp.1 Comp.2 Comp.3
1.6169914 0.3761413 0.1772761
3 variables and 38 observations.
> princomp(M,cor=TRUE)
Call:
princomp(x = M, cor = TRUE)
27
Standard deviations:
Comp.1 Comp.2 Comp.3
1.4350873 0.7778085 0.5792567
3 variables and 38 observations.
> z <- princomp(M,cor=TRUE)$scores
> z1<-z[,1]
> z2<-z[,2]
>
> mo<- lm(y~z1+z2)
> summary(mo)
Call:
lm(formula = y ~ z1 + z2)
Residuals:
Min 1Q Median 3Q Max
-3.3805 -0.8967 0.1238 1.0409 2.9487
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -9.4832 0.2331 -40.687 < 2e-16 ***
z1 1.1548 0.1624 7.111 2.75e-08 ***
z2 0.8245 0.2997 2.752 0.00933 **
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 1.437 on 35 degrees of freedom
Multiple R-squared: 0.6242, Adjusted R-squared: 0.6027
F-statistic: 29.07 on 2 and 35 DF, p-value: 3.648e-08
> # ora la regressione e’altamente significativa su z1, z2
> # e sull’intercetta
28