vi - exercisescdm.unimo.it/home/matematica/maioli.marco/mcssa_esercizi.pdf · 2015-01-15 ·...

Computational Methods and advanced Statistics Tools

VI - EXERCISES

1. Elementary tools

EX. VI.1A - Two means from paired dataFor checking the reliability of a certain electrical method for measuring tem-peratures, two samples were obtained. Assuming normality and homoge-neous variances of the two populations, test the hypothesis that the popula-tion means are equal, at level α = 5%. The samples are:

106.9 106.3 107.0 106.0 104.9

106.2 106.7 106.8 106.1 105.6

[IV.4D - H◦ not rejected ]

> x<- c(106.9, 106.3, 107, 106, 104.9 )

> y<- c(106.2, 106.7, 106.8, 106.1, 105.6)

> z<-x-y

> mean(z)

[1] -0.06

> var(z)

[1] 0.293

> T<- -0.06/(sd(z)/sqrt(5))

> T

[1] -0.2478577

> qt(0.025,df=4)

[1] -2.776445

EX. VI.1B - Two means from independent samplesThree specimens of high quality concrete had compressive strength 357, 359, 413[Kg/cm3]. For three specimens of ordinary concrete the values were 346, 358, 302.Test for equality of population means, µ1 = µ2, against the alternativeµ1 > µ2. (Assume normality and equality of variances. Choose α = 5%.)[IV.4A - Ans.: H◦ not rejected ]

1

> t.test(c(357,359,413), c(346,358,302), alternative= "greater")

Welch Two Sample t-test

data: c(357, 359, 413) and c(346, 358, 302)

t = 1.6384, df = 3.978, p-value = 0.08854

EX. VI.1C - Comparison of means and variances from two independent sam-ples.Using samples of size 10 and 5 with means x = 390, y = 450, with variancess21= 50, and s2

2= 20, and assuming normality of the corresponding popula-

tions, (1) test the hypothesis µ1 = µ2 against the hypothesis µ1 6= µ2; (2)test the hypothesis H◦ : σ2

1= σ2

2against the alternative σ2

1> σ2

2. Choose

α = 5%.[IV.4A, IV.4B - Ans.: (1) H◦ rejected; (2) H◦ not rejected ]

> varpooled<- (9*50+4*20)/13

> T<-(390-450)/(sqrt(varpooled)*sqrt(1/10+1/5) )

> T

[1] -17.15633

> qt(0.025,df=13)

[1] -2.160369

> F<-50/20

> F

[1] 2.5

> qf(0.95,df1=9,df2=4)

[1] 5.998779

EX. VI.1D - Minimum sample size for estimating mean dissolved oxygen(DO) concentration.Monitoring of pollution levels of similar streams in a region indicates thatthe standard deviation of DO is 1.95 mg/L over a long period of time. Whatis the minimum number of observations required to estimate the mean DOwithin ±0.5 mg/L with 95% confidence?[IV2.B - Ans.: 59 ]

2

> (1.95*qnorm(0.975)/0.5)^2

[1] 58.42859

EX. VI.1E - Significance in change of temperatureA water supply engineer is concerned that possible climatic change withrespect to temperature may have an effect on forecasts for future demands forwater in a city. The long-period mean and standard deviation of the annualaverage temperature measured at midday are 33 and 0.75◦ C. The alarm iscaused by the mean temperature of 34.3◦ C observed for the previous year.Does this suggest that there is an increase in the mean annual temperatureat a 5% level of significance?[IV.3D - Ans.: Reject H◦ ]

> (34.3-33)/0.75

[1] 1.733333

> pnorm(1.73333)

[1] 0.9584815

EX. VI.1F - Confidence limits on proportions of wet days.We want to know the proportion of wet days in March. by observing dailyrainfalls in March. Determine the total number of years of data necessarybefore one can be 95% confident of estimating the true proportion of wetdays to within 0.05.[IV.2D, IV2E - Ans.: 13 years ]

> W<- sqrt((1/2)*(1/2))*qnorm(0.975)/0.05

> W

[1] 19.59964

> W^2

[1] 384.1459

> 385/31

[1] 12.41935

3

EX. VI.1G - Irrigation and rains.Irrigation usually commences on 15 April in the Po river basin, Italy. Anengineer is interested in tghe probability of rain during the 7 days from April15 to 21. From rainfall data of the past 100 years in a particular area, thefollowing distribution of rainy days is obtained for the period:

0 1 2 3 4 5 6 757 30 9 3 1 0

The Binomial model Bin(n = 7; p = 0.1) is postulated. Can this be justifiedat the 5% level of significance on the basis of a chi-squared test?[IV.5A - Ans.: X2 = 5.93; do not reject H◦ ]

> # propteor<- proporzioni teoriche della binomiale

> # accorpando le frequenze piccole del 5, 6,7

> # ft = frequenze teoriche su 100 anni

> # fs = frequenze sperimentali su 100 anni

> dbinom(0:7, size=7, prob=0.1)

[1] 0.4782969 0.3720087 0.1240029 0.0229635 0.0025515 0.0001701 0.0000063

[8] 0.0000001

> p567<- sum(dbinom(5:7,size=7,prob=0.1))

> propteor<- c(dbinom(0:4,7,0.1),p567)

> propteor

[1] 0.4782969 0.3720087 0.1240029 0.0229635 0.0025515 0.0001765

> ft <- propteor*100

> ft

[1] 47.82969 37.20087 12.40029 2.29635 0.25515 0.01765

> fs <- c(57,30,9,3,1,0)

> qchisq(0.95,df=5)

[1] 11.0705

> chi2sperim <- sum ((ft-fs)^2/ft )

> chi2sperim

[1] 6.492133

2. Examples on simple regression

EX. VI.2A - Confidence intervals in linear regression

4

In this example, we compute the confidence intervals of the coefficients usingthe the residual standard error σ. σ is the square root the erratic variance

σ2 =1

n− pSSE

where SSE is nothing else than the sum of squares of residuals:

SSE =n∑

i=1

(yi −XTi θ)

2.

x<-seq(0,5,length=20)

> y<-2+0.5*x+rnorm(20,0,sd=1)

> modello<-lm(y~x)

> modello

Call:

lm(formula = y ~ x)

Coefficients:

(Intercept) x

1.6556 0.6388

> plot(x,y)

> abline(1.6556,0.6388)

> summary(modello)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.6556 0.4073 4.064 0.000728 ***

x 0.6388 0.1393 4.586 0.000229 ***

Residual standard error: 0.9452 on 18 degrees of freedom

Multiple R-Squared: 0.5389, Adjusted R-squared: 0.5132

F-statistic: 21.03 on 1 and 18 DF, p-value: 0.0002290

> # calcolo degli estremi di fiducia dei coefficienti

> # usando il residual standard error fornito da R

> # che e’ 0.9452 = sqrt(SSE/(n-2))

5

> sigma <- 0.9452

> M<- matrix(c(rep(1,20),x),nrow=20,ncol=2)

> SIGMATHETA <- sigma^2* solve(t(M)%*%M)

>

> # gli estremi di fiducia di beta:

> b1 <- 0.6388-qt(0.975,df=18)*sqrt(SIGMATHETA[2,2])

> b2<- 0.6388+qt(0.975,df=18)*sqrt(SIGMATHETA[2,2])

> c(b1,b2)

[1] 0.3461784 0.9314216

>

> # gli estremi di fiducia di alfa:

> a1 <- 1.6556-qt(0.975,df=18)*sqrt(SIGMATHETA[1,1])

> c(a1, a2)

[1] 0.7998315 2.511369

>

EX. VI.2B - Confidence intervals in linear regression.

In this example we compute the confidence intervals of the coefficients usingthe the standard errors:

√

V ar(β) = σ√

XTX2,2

√

V ar(α) = σ√

XTX1,1

where

X =

1 x1... ...1 xn

> # Calcolo degli estremi di fiducia dei coefficienti

> # usando i singoli errori standard forniti da R

> x <- seq(2,9,length=12)

> y <- 1+0.3*x+rnorm(12)

> modello<- lm(y~x)

> anova(modello)

Analysis of Variance Table

6

Response: y

Df Sum Sq Mean Sq F value Pr(>F)

x 1 8.1945 8.1945 13.643 0.004152 **

Residuals 10 6.0066 0.6007

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

> summary(modello)

Call:

lm(formula = y ~ x)

Residuals:

Min 1Q Median 3Q Max

-0.8238 -0.5399 -0.2359 0.4433 1.3851

Coefficients:


(Intercept) 0.7050 0.6032 1.169 0.26960

x 0.3762 0.1018 3.694 0.00415 **

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1


Multiple R-squared: 0.577, Adjusted R-squared: 0.5347

F-statistic: 13.64 on 1 and 10 DF, p-value: 0.004152

> # calcolo intervallo di fiducia di beta e di alfa

> # usando i due errori standard forniti,

> # cioe’ 0.1018 , 0.6032

> b1 <- 0.3762-qt(0.975,df=10)*0.1018

> b2 <- 0.3762+qt(0.975,df=10)*0.1018

> c(b1,b2)

[1] 0.1493755 0.6030245

>

> a1 <- 0.7050 - qt(0.975,df=10)*0.6032

> a2 <- 0.7050 + qt(0.975,df=10)*0.6032

7

> c(a1,a2)

[1] -0.6390134 2.0490134

> # ritroviamo gli intervalli di fiducia usando la teoria,

> # cominciando dalla matrice M dei dati di ingresso

> M<-matrix(c(rep(1,12),x),nrow=12,ncol=2)

> M

[,1] [,2]

[1,] 1 2.000000

[2,] 1 2.636364

[3,] 1 3.272727

[4,] 1 3.909091

[5,] 1 4.545455

[6,] 1 5.181818

[7,] 1 5.818182

[8,] 1 6.454545

[9,] 1 7.090909

[10,] 1 7.727273

[11,] 1 8.363636

[12,] 1 9.000000

> res<-residuals(modello)

> res

1 2 3 4 5 6

-0.54955856 0.28634886 -0.78314741 0.91422068 1.38507619 0.08798026

7 8 9 10 11 12

-0.82381596 -0.53671878 -0.47637085 -0.28992366 0.96771666 -0.18180744

> # calcolo la varianza erratica,

> # cioe’ somma dei quadrati dei residui divisa per numero di G.L.

> s2<-sum((res)^2)/10

> s2

[1] 0.600656

> # calcolo la matrice delle covarianze tra i coefficienti

> varianza <- s2*solve(t(M)%*%M)

> varianza

[,1] [,2]

[1,] 0.36381965 -0.05704818

[2,] -0.05704818 0.01037240

8

> # in particolare, la varianza campionaria di b, coeff. angolare

> varianza[2,2]

[1] 0.0103724

> # errore standard di b

> sqrt(varianza[2,2])

[1] 0.101845

> # intervallo di fiducia di beta

> b1<-0.3762-qt(0.975,df=10)*0.1018

> b2<-0.3762+qt(0.975,df=10)*0.1018

> c(b1,b2)

[1] 0.1493755 0.6030245

> # intervallo di fiducia per alfa

> errore_stand_alfa <- sqrt(varianza[1,1])

> errore_stand_alfa

[1] 0.6031746

> a1 <- 0.7050 - qt(0.975,df=10)*errore_stand_alfa

> a2 <- 0.7050 + qt(0.975,df=10)*errore_stand_alfa

> c(a1,a2)

[1] -0.6389569 2.0489569

EX. VI.2C - Simple regression with change of scale

Many real world phaenomena are non-linear. The linear regression methodcan be applied after changes of scale either in the predictor (ξ = ξ(x)) or inthe response (ψ = ψ(y)), or in both, e.g.: y → √

y, x → log(x), y → ey, ...etc.

For example, suppose that 7 observations (xi, yi) have been treated by a scalechange Y →

√Y and a linear model

√Y i = α + βxi + ǫi. In order to test

linearity, the sums of squares of differences

SST = 4308, SSF = 3708, SSE = 600

are considered. The analysis of variance of this regression gives:

9

Source of var. Sum of squares Degrees of f. Variance SSF /1SSE/5

p-value

Factor x SSF = 3708 1 3708 30.9 0.0025Error SSE = 600 5 120Total SST = 4308 6

The p−value is less than 0.05 and even less than 0.01. The variance SSF/1is so large, with respect to SSE/5, that the hypothesis H◦ : β = 0 is rejected.

3. Residuals

An analysis of RESIDUALS ei = yi − yi should be performed, since anassumption of the model is normality and independence of the errors ǫi, withthe same variance σ2.In the quantiles-quantiles graph, the sample quantiles qi are compared withthe theoretical quantiles qi of a known distribution, the normal one. Thepoints (qi, qi) are plotted.

> # genero 20 numeri distribuiti come un chi quadro, li ordino

> # e li confronto coi quantili normali

> x<-rchisq(20,df=5)

> qsperim<-sort(x)

> qteorici<- qnorm(0.5:19.5/20,mean(x),sd(x))

> plot(qteorici,qsperim)

> # troppo lontani dalla bisettrice: non e’ un campione normale

If the sample quantiles were exactly normal, then qi would be equal to qiand the points would be on the principal bisecting line. Too large deviationsfrom the principal bisecting line induce to suspect a non-normal distributionof the sample. In R the useful fuctions are: qqnorm, qqline, qqplot.In order to perform a quantitative (not only graphical) analysis, the good-ness of fit of data with respect to a theoretical distribution is tested by theKolmogorov-Smirnov test. In R it is performed by the function ks.test.

PROP. VI.3A - The Kolmogorov-Smirnov test

The null hypothesis, that the data are extractd from a particular distribution,against the alternative that they are not,

H◦ : F = F◦, H1 : F = F 6= F◦

10

is tested by means of the Kolmogorov-Smirnov statistics

D = sup |Fn(x)− F◦(x)|.

If critical values of D are overcome, the (one-sample) Kolmogorv-Smirnovtest rejects H◦.The null hypothesis, that two vectors x,y of data are extracted from a samedistribution, against the alternative that they are not,

H◦ : F1 = F2, H1 : F1 6= F2

is tested by the analogous two-sample Kolmogorov-Smirnov test.

In R, the first argument of the function ”ks.test” is always the sample vector.The second argument can be a distribution, or a second vector of data. Toimplement the one-sample K.S. test, the distribution can be written withsuitable parameters, e.g. matching with the mean and variance of the sample.

EX. VI.3B - Test on residuals for a least square parabola.

Let us recover and analyze, by R, a least square parabola

Y = θ◦ + θ1x+ θ2x2

on the basis of data (xi, Yi), i = 1, ..., 8.It is enough to perform a multiple linear regression with p = 3 factors andN = 8 units. The p− 1 = 2 explicative variables will be x1 = x, x2 = x2.The useful R functions are ”lm” and ”summary”. By means of ”lm”, Rdirectly writes the equation

y = 2.588 + 2.065x− 0.211x2

By means of ”summary” we see the result of the t− tests applied to bothcoefficients, by which the null hypothesis H◦ : θ1 = 0, θ2 = 0 is rejected, sincethe p− value is much smaller than α = 0.05.

Moreover the residual standard error is given, that is√

SSE/(N − p), with

N − p = 5, the estimator of σ2 in the linear model.Also the determination coefficient R2, the fraction of total variation explainedby linearity, is given.Finally, the analysis of variance of regression is done by the F test, concerningthe ratio

SSF/(p− 1)

SSE/(N − p)

which has a Fisher distribution with 2, 5 degrees of freedom.

11

> x<-c(1.2,1.8,3.1,4.9,5.7,7.1,8.6,9.8)

> y<-c(4.5,5.9,7.0,7.8,7.2,6.8,4.5,2.7)

> x1<-x

> x2<-x^2

> mod<-lm(y~x1+x2)

> summary(mod)

Call:

lm(formula = y ~ x1 + x2)

Coefficients:


(Intercept) 2.58779 0.34377 7.528 0.000655 ***

x1 2.06492 0.15094 13.680 3.74e-05 ***

x2 -0.21100 0.01366 -15.443 2.07e-05 ***


Multiple R-Squared: 0.9823, Adjusted R-squared: 0.9753

F-statistic: 139.1 on 2 and 5 DF, p-value: 4.144e-05

If the data are plotted by ”plot”, in the same reference the parabola is repre-sented by ”curve” with the option add=TRUE . Residuals can be recoveredby ”residuals”. Then by ”qqnorm” the quantiles of the residuals are com-pared with the theoretical normal quantiles: if the points are about on astraight line, the errors are normally distributed, confirming the model. Fi-nally a Kolmogorov-Smirnov test on the residuals, with null hypothesis thatthe residuals are normal, gives a considerably large p−value: a further con-firmation of normality of residuals.

> plot(x,y,main="Least Squares parabola")

> curve(2.58+2.06*x-0.21*x^2,add=TRUE)

> res<-residuals(mod)

> qqnorm(res)

> # si puo’ anche fare a mano un q-q grafico: basta ordinare

> # il vettore res con sort e mettere in ascissa i quantili normali:

> plot(qnorm(0.5:7.5/8,mean=0,sd=sd(res)),sort(res))

12

> # test di K.S.: il valore p=0.72 (grande) permette la normalita’ dei residui

> ks.test(res,"pnorm",mean(res),sd(res))

One-sample Kolmogorov-Smirnov test

data: residuals(mod)

D = 0.2266, p-value = 0.7273

alternative hypothesis: two.sided

2 4 6 8 10

34

56

78

Least Squares parabola

x

y

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

−0.

3−

0.1

0.1

0.2

0.3

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

Figure 1: (a) The parabola y = 2.58 + 2.06 x− 0.21 x2, compared with data(xi, Yi) (b) The quantiles of residuals ei = Yi − Yi, versus theoretic normalquantiles. If the points are not too far from a straight line, normality is notrejected.

4. Correlation confidence interval

If the R-function cor.test is applied to two samples x, y with the samelength, both Pearson’s test on correlation and a 0.95 confidence interval arefurnished.

13

> x<-rnorm(12)

> y<- -0.7*x+rnorm(12,sd= 2)

> x

[1] -0.27959053 -0.06535758 -1.52057272 0.92043094 1.51153470 -0.76791609

[7] -1.53624017 -0.06771280 1.77028728 0.87329624 -0.05032564 -0.55630347

> y

[1] -1.2968540 5.4361186 1.7098608 -2.2413984 -0.6963806 2.5311615

[7] 1.7840563 3.5788787 -2.3485398 -1.3989689 -0.1303807 1.7089726

> cor.test(x,y)

Pearson’s product-moment correlation

data: x and y

t = -2.315, df = 10, p-value = 0.04314

alternative hypothesis: true correlation is not equal to 0

95 percent confidence interval:

-0.86975489 -0.02542412

sample estimates:

cor

-0.5907068

> # ritroviamol’intervallo di fiducia

> # passando per la trasformazione z

>

> # acquisisco prima la correlazione r stimata

> r <- -0.5907

> # trasformazione z

> zeta <- function(r){ 0.5*log((1+r)/(1-r)) }

> # trasformazione inversa

> erre <- function(z){ (exp(2*z)-1)/(exp(2*z)+1) }

>

> # estremi di fiducia di z, sapendo che Z ~ N[0; 1/(n-3)]

> z1 <- zeta(r) - 1.96/sqrt(12-3)

> z2 <- zeta(r) + 1.96/sqrt(12-3)

># estremi di fiducia della correlazione lineare

> r1 <- erre(z1)

> r2 <- erre(z2)

> c(r1,r2)

[1] -0.86975528 -0.02540173

>

14

5. Analysis of Variance

Recall the model. A variable y is observed in k samples (k ”groups”) withlengths n1, ..., nk. In the ANOVA model, for i = 1, ..., k,, for j = 1, ...ni, theresponse yi,j is equal to a general mean, plus some differential effect of thei-th group, plus independent normal errors:

yi,j = µ+ αi + ǫi,j , ǫi,j ∼ N(0; σ2), ǫi,j independent

H◦ : αi = 0, ∀i; H1 : some αi’s are 6= 0

We set

SSB =k∑

i=1

ni(yi − y)2,

SSW =k∑

i=1

ni∑

j=1

(yi,j − yi)2

SST =k∑

i=1

ni∑

j=1

(yi,j − y)2,

Under the hypothesis H◦

F ≡ SSB/(k − 1)

SSW/(N − k)∼ F (k − 1, N − k)

At the significance level γ, H◦ is rejected if

F ≡ SSB/(k − 1)

SSW/(N − k)≥ F1−γ(k − 1, N − k).

EX. VI.5A - Let the response y be the mileage achieved by cars producedby different companies, but of comparable price, size, and so on. Cars areselected from five companies For each company a number of new cars arechosen and their mileage (in miles per gallon) is recordedThe data are read first from the table

y1 y2 y3 y4 y5

1 25.1 27.1 29.9 25.4 29.2

2 26.2 26.4 21.4 28.2 29.3

3 24.9 26.8 22.2 27.1 30.4

4 25.3 27.2 22.5 26.3 28.5

5 23.9 26.5 20.8 26.6 28.9

6 24.1 26.3 23.9 28.0 28.4

15

Then the data are brought in a unique response vector y . Then the factorx is furnished: 5 levels, each level with 6 observations.

Please notice that x is written numerically first, then it is converted bythe function factor .Such a conversion is very important, else R would interpret x as a numericalvector, so that a meaningless linear regression would be performed!The statistical analysis is made by two steps:first lm(y~x) , then anova(lm(y~x)):

> z<-read.table("makes.txt")

> y<-c(z$y1,z$y2,z$y3,z$y4,z$y5) $

> x<-c(rep(1,6),rep(2,6),rep(3,6),rep(4,6),rep(5,6))

> x<-factor(x)

> mod<-lm(y~x)

> mod

Call:

lm(formula = y ~ x)

Coefficients:

(Intercept) x2 x3 x4 x5

24.917 1.800 -1.467 2.017 4.200

> anova(mod)


Response: y

Df Sum Sq Mean Sq F value p value

x 4 111.105 27.776 10.213 4.835e-05 ***

Residuals 25 67.993 2.720

---

Notice that the value 24.917 is the general mean value y, estimator of theparameter µ of ANOVA model

yij = µ+ αi + ǫij

The coefficients αi, i = 1, ..., 6 are re-coded into x2,...,x5, and representthe differential effect with respect to to the (missing) α1 = x1. The interest

16

is in the signs more than in the numerical values, but what is most importantis the ratio F test. In this example k− 1 = 4, N − k = 30− 5 are the degreesof freedom, and the experimental value of F is larger than the critical value.The hypothesis of equal effects αi is rejected.

Finally, an example of unbalanced experiment, where the number of obser-vations ni can be different for different groups:

y1 y2 y3 y4 y5

1 25.1 27.1 29.9 25.4 29.2

2 26.2 26.4 21.4 28.2 29.3

3 24.9 26.8 22.2 27.1 30.4

4 25.3 27.2 22.5 28.5

5 23.9 20.8 28.9

6 28.4

>

> z<-read.table("makes.txt")

> y1<-z$y1

> y2<-z$y2

> y3<-z$y3

> y4<-z$y4

> y5<-z$y5

> y<-c(y1,y2,y3,y4,y5)

> x<-c(rep(1,5),rep(2,4),rep(3,5),rep(4,3),rep(5,6))

> x<-factor(x)

> mod<-lm(y~x)

> anova(mod)


Response: y

Df Sum Sq Mean Sq F value Pr(>F)

x 4 100.662 25.166 6.9715 0.001423 **

Residuals 18 64.976 3.610

17

VI.5B - COMPARISONS

Supposing that the F-ratio test has rejected µ1 = µ2 = ... = µk, the questionarises which pairs of levels have really different means. Multiple Bonferroni’sT suggests m comparisons of pairs of groups by simple t− tests, but with

individual significance level α/m. m can be the total number of pairs,

(

k2

)

.

This way the global significance level α is preserved, but some loss of poweris expected.

In the preceding example, we have m =

(

k2

)

=

(

52

)

= 10 comparisons

of two means. If α = 0.05, we have α/10 = 0.005 . Thus only the pairswhere the p-value is less than 0.05

10= 0.005 have different means.

> y1<- c(25.1, 26.2, 24.9, 25.3, 23.9)

> y2<- c(27.1, 26.4, 26.8, 27.2)

> y3<- c(29.9, 21.4,22.2, 22.5,20.8)

> y4<- c(25.4, 28.2, 27.1)

> y5<- c(29.2, 29.3,30.4,28.5, 28.9,28.4)

>

> t.test(y1,y2)

data: y1 and y2

t = -4.3704, df = 5.693, p-value = 0.005338

alternative hypothesis: true difference in means is not equal to 0

> t.test(y1,y3)

data: y1 and y3

t = 1.0102, df = 4.394, p-value = 0.3648


> t.test(y1,y4)

data: y1 and y4

t = -2.0352, df = 2.847, p-value = 0.1396


> t.test(y1,y5)

18

data: y1 and y5

t = -8.5288, df = 8.112, p-value = 2.525e-05


> t.test(y2,y3)

data: y2 and y3

t = 2.1025, df = 4.093, p-value = 0.1018


> t.test(y2,y4)

data: y2 and y4

t = -0.03, df = 2.196, p-value = 0.9786


> t.test(y2,y5)

data: y2 and y5

t = -6.4738, df = 7.636, p-value = 0.0002366


> t.test(y3,y4)

data: y3 and y4

t = -1.9126, df = 5.516, p-value = 0.1086


> t.test(y3,y5)

data: y3 and y5

t = -3.4098, df = 4.254, p-value = 0.02454


> t.test(y4,y5)

19

data: y4 and y5

t = -2.558, df = 2.545, p-value = 0.09823


By Bonferroni’s comparisons, at the level 0.05 we conclude that the pairsy1, y5 and y2, y5 have different means.

6. Principal Componets Analysis

EXAMPLE VI.6A - Principal component analysis of water pollution data

A file named "‘Bw_river.txt" , containing the data frame, is writtenin the directory used by R. It contains three columns: biochemical oxygendemand, nitrates and ammonia at 38 stations along the Blackwater river inunits of milligrams per liter.

Oxygen nitrates ammonia

1 2.27 1.97 0.11

2 4.41 12.83 0.61

3 4.03 11.11 0.53

4 3.75 9.86 0.47

5 3.37 9.54 0.62

6 3.23 8.85 0.56

7 3.18 8.02 0.64

8 4.08 8.94 1.14

9 4 8.76 1.11

10 3.92 8.59 1.07

11 3.83 8.43 1.04

12 3.74 8.27 1

13 3.66 8.13 0.97

14 3.58 7.99 0.94

15 3.16 6.72 0.83

16 3.43 9.23 0.94

17 3.36 9.1 0.93

18 3.3 8.97 0.91

19 3.24 8.85 0.89

20 3.19 8.74 0.88

21 3.22 9.8 0.95

22 3.17 9.64 0.93

23 3.13 9.49 0.9

20

24 3.08 9.34 0.88

25 3.04 9.2 0.86

26 3 9.06 0.84

27 2.96 8.03 0.82

28 2.93 8.81 0.8

29 2.89 8.69 0.78

30 2.86 8.57 0.76

31 2.82 8.45 0.74

32 2.79 8.35 0.73

33 2.76 8.24 0.71

34 2.73 8.14 0.7

35 2.7 8.04 0.68

36 2.51 6.54 0.48

37 2.49 6.51 0.47

38 2.46 6.46 0.46

After loading Bw_river.txt , principal component analysis is performedin R workspace.For PC analysis we can use either the covariance matrix of the data, or thecorrelation matrix of the data, in which case the R-function princomp needsthe option cor=TRUE .In this example let us work by means of the correlation matrix.

> # la matrice M dei dati ha n righe, p colonne,

> # in questo esempio n=38 osservazioni di p=3 variabili

> M <- read.table("Bw_river.txt")

> # matrice C di correlazione, le p^2 correlazioni tra le p colonne di M

> Sigma <- cor(M)

> Sigma


Oxygen 1.0000000 0.6495442 0.5153323

nitrates 0.6495442 1.0000000 0.4153815

ammonia 0.5153323 0.4153815 1.0000000

21

> # autovalori e autovettori di Sigma

> lambda <- eigen(Sigma)$values

> G <- eigen(Sigma)$vectors

> lambda

[1] 2.0594756 0.6049861 0.3355384

> G

[,1] [,2] [,3]

[1,] 0.6155078 -0.2052230 0.7609426

[2,] 0.5845941 -0.5286604 -0.6154413

[3,] 0.5285829 0.8236515 -0.2054225

> # devazioni standard

> sqrt(lambda)

[1] 1.4350873 0.7778085 0.5792567

> # frazione della variazione dei dati spiegata

> # dalla prima PC, dalla prima e seconda PC, etc.

> cumsum(lambda)/sum(lambda)

[1] 0.6864919 0.8881539 1.0000000

> # dunque circa l’89% della variazione dei dati

> # e’ spiegato dalle prime due componenti principali

>

> # prima componente principale della osservazione i-esima

> # (Z1)_i = (riga i-esima di M) x (primo autovettore)

> # cioe’ Z1=(righe di M) x (prima colonna di G)

> (ma dopo aver standardizzato tutti i dati di M con "scale"

> Z1<- scale(M)%*%G[,1]

> # seconda componente principale

> Z2 <- scale(M)%*%G[,2]

> le due prime componenti principali Z1, Z2

> Z12<- matrix(c(Z1,Z2),38,2)

> Z12

[,1] [,2]

[1,] -5.20122624 -0.008114176

[2,] 2.61028148 -2.558734237

[3,] 1.32080591 -2.141088519

[4,] 0.37442794 -1.843102862

[5,] 0.15962963 -1.009548029

[6,] -0.41079658 -0.953633996

[7,] -0.57905692 -0.353740552

[8,] 2.08443078 0.883903100

22

[9,] 1.84720797 0.861467984

[10,] 1.58913403 0.797551511

[11,] 1.34696309 0.772597578

[12,] 1.08029008 0.709463896

[13,] 0.85767099 0.673822348

[14,] 0.63505189 0.638180801

[15,] -0.61263813 0.809052509

[16,] 0.90401695 0.290047393

[17,] 0.74630290 0.323379337

[18,] 0.57633678 0.314447134

[19,] 0.41002159 0.302213322

[20,] 0.28410937 0.320773253

[21,] 0.87937204 0.225807858

[22,] 0.71070313 0.222696080

[23,] 0.53343307 0.174018546

[24,] 0.36841508 0.167605160

[25,] 0.21929801 0.153805767

[26,] 0.07018095 0.140006375

[27,] -0.40386860 0.420050089

[28,] -0.20485042 0.098418367

[29,] -0.34666564 0.078015758

[30,] -0.47623086 0.053528751

[31,] -0.61804607 0.033126143

[32,] -0.71580738 0.040215669

[33,] -0.84172168 0.012427054

[34,] -0.93948298 0.019516580

[35,] -1.06174635 -0.011573643

[36,] -2.33217650 -0.202323859

[37,] -2.39213133 -0.222429987

[38,] -2.47163800 -0.231848501

>

> # questo lavoro e’ fatto dal comando "princomp"

> # con l’opzione "cor=TRUE" per fare le correlazioni e non le covaianze

>

> princomp(M,cor=TRUE)

Call:

princomp(x = M, cor = TRUE)

Standard deviations:

Comp.1 Comp.2 Comp.3

1.4350873 0.7778085 0.5792567

23

3 variables and 38 observations.

> # ritroviamo gli autovalori= le varianze delle componenti principali,

> # cioe’ il quadrato delle deviazioni standard:

> lambda<- c(1.4350873,0.7778085,0.5792567)^2

> lambda

[1] 2.0594756 0.6049861 0.3355383

> plot(princomp(M),cor=TRUE))

> cosi’ abbiamo ottenuto in istogramma le varianze/autovalori

>

> # ritroviamo G, cioe’ i p=3 autovettori, detti "loadings"

> # ad es. la prima colonna da’ i coefficienti di Z1

> # rispetto alle p variabili originarie; etc.

> princomp(M,cor=TRUE)$loadings

Loadings:


Oxygen 0.616 -0.205 0.761

nitrates 0.585 -0.529 -0.615

ammonia 0.529 0.824 -0.205

> # ritroviamo le componenti principali dette "scores"

> # delle n=38 osservazionie

> princomp(X,cor=TRUE)$scores


1 -5.27104448 -0.008223096 1.74962627

2 2.64532039 -2.593081200 0.32139209

3 1.33853565 -2.169829248 0.48534649

4 0.37945404 -1.867843604 0.60040549

5 0.16177241 -1.023099616 -0.00291438

6 -0.41631087 -0.966435025 0.10888380

7 -0.58682985 -0.358488960 0.27824380

8 2.11241097 0.895768102 0.81868525

9 1.87200382 0.873031830 0.79496630

10 1.61046564 0.808257380 0.77700220

11 1.36504394 0.782968480 0.73014515

12 1.09479127 0.718987327 0.69293812

24

13 0.86918387 0.682867348 0.65363849

14 0.64357646 0.646747369 0.61433886

15 -0.62086183 0.819912760 0.57056970

16 0.91615195 0.293940821 -0.09887846

17 0.75632084 0.327720194 -0.14602556

18 0.58407319 0.318668089 -0.16817487

19 0.41552548 0.306270058 -0.19421934

20 0.28792309 0.325079127 -0.21846126

21 0.89117622 0.228838972 -0.65285606

22 0.72024319 0.225685423 -0.64797211

23 0.54059356 0.176354470 -0.62198553

24 0.37336047 0.169854994 -0.62099674

25 0.22224174 0.155870367 -0.60855536

26 0.07112301 0.141885739 -0.59611398

27 -0.40928990 0.425688598 -0.23700257

28 -0.20760021 0.099739478 -0.56756897

29 -0.35131907 0.079062997 -0.56291792

30 -0.48262351 0.054247290 -0.54291912

31 -0.62634237 0.033570809 -0.53826808

32 -0.72541596 0.040755500 -0.53570964

33 -0.85302046 0.012593867 -0.51960601

34 -0.95209406 0.019778559 -0.51704757

35 -1.07599862 -0.011729001 -0.50483911

36 -2.36348229 -0.205039737 -0.01917067

37 -2.42424191 -0.225415758 -0.02853065

38 -2.50481583 -0.234960701 -0.04544806

VI.6B - Use of PC to improve linear regression

> M<-read.table("Bw_river.txt")

> M


1 2.27 1.97 0.11

2 4.41 12.83 0.61

3 4.03 11.11 0.53

25

4 3.75 9.86 0.47

5 3.37 9.54 0.62

6 3.23 8.85 0.56

7 3.18 8.02 0.64

8 4.08 8.94 1.14

9 4.00 8.76 1.11

10 3.92 8.59 1.07

11 3.83 8.43 1.04

12 3.74 8.27 1.00

13 3.66 8.13 0.97

14 3.58 7.99 0.94

15 3.16 6.72 0.83

16 3.43 9.23 0.94

17 3.36 9.10 0.93

18 3.30 8.97 0.91

19 3.24 8.85 0.89

20 3.19 8.74 0.88

21 3.22 9.80 0.95

22 3.17 9.64 0.93

23 3.13 9.49 0.90

24 3.08 9.34 0.88

25 3.04 9.20 0.86

26 3.00 9.06 0.84

27 2.96 8.03 0.82

28 2.93 8.81 0.80

29 2.89 8.69 0.78

30 2.86 8.57 0.76

31 2.82 8.45 0.74

32 2.79 8.35 0.73

33 2.76 8.24 0.71

34 2.73 8.14 0.70

35 2.70 8.04 0.68

36 2.51 6.54 0.48

37 2.49 6.51 0.47

38 2.46 6.46 0.46

> x1<- M[,1]

> x2<- M[,2]

> x3<- M[,3]

>

> # simulazione di dati risposta alle tre variabili x1, x2, x3

> y<-0.4*x1-1.3*x2+0.3*x3+rnorm(38)

26

> # regressione lineare multipla di y su x1, x2, x3

> modello<- lm(y~x1+x2+x3)

> summary(modello)

Call:

lm(formula = y ~ x1 + x2 + x3)

Residuals:


-2.5913 -0.6550 0.1072 0.5389 2.5407

Coefficients:


(Intercept) 0.7140 1.2578 0.568 0.574

x1 0.4754 0.5286 0.899 0.375

x2 -1.2743 0.1563 -8.154 1.64e-09 ***

x3 -1.0922 1.0291 -1.061 0.296

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1




> # quindi la regressione e’poco significativa su x1,x3

> # e sull’interecetta

> # cerchiamo le due prime componenti principali

> # per vedere se migliora la significativita’

> princomp(M)

Call:

princomp(x = M)



1.6169914 0.3761413 0.1772761


> princomp(M,cor=TRUE)

Call:

princomp(x = M, cor = TRUE)

27



1.4350873 0.7778085 0.5792567


> z <- princomp(M,cor=TRUE)$scores

> z1<-z[,1]

> z2<-z[,2]

>

> mo<- lm(y~z1+z2)

> summary(mo)

Call:

lm(formula = y ~ z1 + z2)

Residuals:


-3.3805 -0.8967 0.1238 1.0409 2.9487

Coefficients:


(Intercept) -9.4832 0.2331 -40.687 < 2e-16 ***

z1 1.1548 0.1624 7.111 2.75e-08 ***

z2 0.8245 0.2997 2.752 0.00933 **

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1




> # ora la regressione e’altamente significativa su z1, z2

> # e sull’intercetta

28

vi - exercisescdm.unimo.it/home/matematica/maioli.marco/mcssa_esercizi.pdf · 2015-01-15 ·...

Documents