chapter 10hosting03.snu.ac.kr/~hokim/int/2018/chap_10.pdf · · 2 days ago그림10.1.1...

Chapter 10

상관 분석

Correlation analysis

상관모형 (Correlation)

• 전형적인 회귀모형에서 X는 고정(fixed non-random)

Y는 확률적 (random)

• X, Y가 동시에 확률적 상관모형

(If both X & Y are random -> bivariate normal dist’n)

이변량 정규분포2 2

2

| |

2

| |

( , ) ~ ( , , , , )

| ~ ( , )

| ~ ( , )

x y x y

x y x y

y x y x

X Y BN

X Y y N

Y X x N

• 는 모집단 상관계수로 다음과 같이 정의 된다.

(population correlation coefficeint is defined by)

( , ) [( )( )]X Y

X Y X Y

Cov X Y E X Y

X

Y

…..

..

X

Y

그림 10.1.1 이변량 정규분포bivariate normal dist’n

(a)는 이변량정규분포의 결합밀도함수를 보여준다.(b)는 이변량 정규분포에서 X가 주어졌을 때 Y의 부분모집단의 주변분포함수를 보여주며, 이때 부분모집단은 정규분포를 따른다.(c)는 Y가 주어졌을 때 X의 부분모집단의 분포를 보여주며, (b)와 마찬가지로 이 경우에도 정규분포를 따른다.

(a) (b) (c)

10.2 상관계수(Correlation Coefficient)

births<-read.csv('E:\\kim\\yes\\myweb\\int\\2018\\data\\birth.csv',header=T)

births$bweight #birth weight of baby

births$gestwks #gestation period

model<-lm(births$bweight~births$gestwks)

summary(model)

cor.test(births$bweight,births$gestwks)

•Sample correlation (r2) and the coefficient of determination (r)

2

1

2 2 2 2

2 2 2 2

/ ( 1)

( ) / / ( 1) ( ) / / ( 1)

( ) ( )

i i i i

i i i i

i i i i

i i i i

r r

x y n x y nr

x x n n y y n n

n x y x y

n x x n y y

표본상관계수 : 관계수모집단 상 의 추정치

• If

2

2~ ( 2)

1

nt r t n

r

0

•When is not 0

① Fisher’s method ( n>25)1 1

ln2 1

~ (0 , 1)1 3

r

r

rz

r

z zz N

n

r=0.872, n=203 H0 :ρ=0.80 vs HA :ρ≠0.80

𝑟 = 0.872 -> 𝑧𝑟=1.34137 by (10.2.4) 𝜌𝑜 = 0.8, 𝑧𝜌𝑜 = 1.09861

𝑍 =1.34137 − 1.09861

1/ 203 − 3= 3.43

-> Reject Ho

*

*

* ** * *

3

4

1

1

test stat : ( ) 11 1

(3 )where :

4

rr

z

z rz z

n

n

zz z n

n

z

n

=z

Hotelling’s method (10≤n≤25)

의 신뢰구간 (CI)

𝑧𝑟 ± 𝑧1−𝛼/2 ∙1

𝑛−3

𝑧𝜌에 대한 95% 신뢰구간 (CI)

1.34137 ± 1.96 ∙1

203−3⟹ (1.20277, 1.47996)

𝒛𝒓 𝒓

1.20277 0.834

1.47996 0.901

> tanh(1.20277)[1] 0.8344976> tanh(1.47996)[1] 0.9014605> library(psych)> fisherz(0.834)[1] 1.201133> fisherz(0.8344976)[1] 1.20277> fisherz(0.9014605)[1] 1.47996

10.3 Some precautions

1. Checking assumptions(Normality, linear relations, homogeneity, etc.)

2. Two variables from a same subject

3. Strong association does not necessarily mean causal effect.

– X ≈Y : X->Y or Y->X or

A->X & A->Y or by chance

4. Be cautious on extrapolation (외연법)

•외연법 (extrapolation)

# regg.Rregg<-read.table(header=TRUE, text="xx yy 50 6155 6160 5965 7170 8075 7680 9085 10690 9895 100100 114")attach(regg)result<-lm(yy~xx) summary(result)plot(xx,yy,xlim=c(0,100),ylim=c(0,120))abline(result)p<-as.data.frame(predict(result, level=0.95, interval="confidence"))lines(cbind(xx,p$lwr), col="blue", lty="dashed")lines(cbind(xx,p$upr), col="blue", lty="dashed")

plot(xx,yy,xlim=c(0,100),ylim=c(0,120))abline(result)pp<-as.data.frame(predict(result, level=0.95, interval="prediction"))lines(cbind(xx,pp$lwr), col="red", lty="dashed")lines(cbind(xx,pp$upr), col="red", lty="dashed")

SAS example corr.sas

/* file corr.sas

SAS example for Chapter 8 */

data corr;

input method1 method2;

x2=method1**2;

y2=method2**2;

xy=method1*method2 ;

cards;

132 130

138 134

144 132

146 140

148 150

152 144

158 150

130 122

162 160

168 150

172 160

174 178

180 168

180 174

188 186

194 172

194 182

200 178

200 196

204 188

210 180

210 196

216 210

220 190

220 202

;run;

* To compare sums and squared sum

on Table 8.7.2 ;

proc means sum;

var method1 method2 x2 y2 xy;

run;

proc plot;

plot method2*method1;

run;

proc corr;

var method1 method2 ;

run;

proc corr fisher(rho0=0.98);var method1 method2 ;run;

R examplecorr<-

read.table(header=TRUE,

text="

method1 method2

132 130

138 134

144 132

146 140

148 150

152 144

158 150

130 122

162 160

168 150

172 160

174 178

180 168

180 174

188 186

194 172

194 182

200 178

200 196

204 188

210 180

210 196

216 210

220 190

220 202

")

attach(corr)

cor.test(method1,method2)

10.4 중상관모형multiple correlation model

Both x and y are random variablesx and y ~ multivariate normalMultiple correlation can be used to see the correlation between them.

0 1 1i j k kj jy x x

x와 y가 모수확률변수일 때

x와 y가 다변량 정규분포일 때

Y와 Xi와의 상관정도 중상관계수

보기 10.4.1> data <-read.csv('E:\\kim\\yes\\myweb\\int\\2018\\data\\example_10_4_1.csv',header=T)

> summary(lm(y~x1+x2,data=data))

Call:

lm(formula = y ~ x1 + x2, data = data)

Residuals:

Min 1Q Median 3Q Max

-62.98 -31.45 -9.12 15.25 258.87

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 34.81450 12.35254 2.818 0.0077 **

x1 34.53919 22.26469 1.551 0.1293

x2 0.04143 0.01259 3.291 0.0022 **

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 55.15 on 37 degrees of freedom

Multiple R-squared: 0.3269, Adjusted R-squared: 0.2905

F-statistic: 8.984 on 2 and 37 DF, p-value: 0.0006602

𝑅𝑦.122 = 0.3269 -> R𝑦.12 = R𝑦.12

2 = 0.572

Test: 𝐹 =𝑅𝑦.12⋯𝑘2

1−𝑅𝑦.12⋯𝑘2 ∙

𝑛−𝑘−1

𝑘

𝐹 =0.327

1−0.327∙40−2−1

2= 8.9848 > qf(0.05,2,37,lower.tail=F)=3.252

유의수준 0.05에서 영가설 𝐻0 ∶ 𝛽1 = 𝛽2 = 0을 기각하고

모집단에서 𝑌, 𝑋1, 𝑋2 사이에 선형성이 존재한다고 결론 내림

/* file mcorr.sas

SAS example for Table 9.7.1

*/

data mcorr;

input chol weight sbp;

cards;

162.2 51.0 108

158.0 52.9 111

157.0 56.0 115

155.0 56.5 116

156.0 58.0 117

154.1 60.1 120

169.1 58.0 124

181.0 61.0 127

174.9 59.4 122

180.2 56.1 121

174.0 61.2 125

;run;

proc plot;

plot chol*weight chol*sbp

sbp*weight;

run;

proc corr;

var chol weight sbp ;

run;

proc corr;

var chol weight ;

partial sbp ;

run;

proc corr;

var chol sbp ;

partial weight ;

run;

proc corr;

var weight sbp ;

partial chol ;

run;

부분상관계수 (Partial Corr. Coef)

• 다른 변수의 효과를 제어한 상태에서의 관계조사

• Linear association after controlling for other covariates

1.2 2

1

. . :y

e g r X

Y X

를 상수로 하고,

와 과의 상관성을 측정하는 부분상관계수

1.2 2

1

. . : when y

e g r X

Y X

is a constant (is fixed to a same value),

partial corr. coef. bewteen and

0 1.2...

21.2...1.2...

: 0

1

1

y k

y ky k

H

n kt r

r

𝑋2를 통제한 후 𝑌와 𝑋1 사이의 편상관 𝑟𝑦1.2 =(𝑟𝑦1−𝑟𝑦2𝑟12)

(1−𝑟𝑦22 )(1−𝑟12

2 )

𝑋1를 통제한 후 𝑌와 𝑋2 사이의 편상관

𝑟𝑦2.1 =(𝑟𝑦2−𝑟𝑦1𝑟12)

(1−𝑟𝑦12 )(1−𝑟12

2 )

𝑌를 통제한 후 𝑋1과 𝑋2 사이의 편상관 𝑋𝑟12.𝑦 =(𝑟12−𝑟𝑦1𝑟𝑦2)

(1−𝑟𝑦12 )(1−𝑟𝑦2

2 )

> data<-read.csv("E:\\kim\\yes\\myweb\\int\\2018\\data\\example_10_4_1.csv",header=T)> cor(data)

y x1 x2y 1.0000000 0.3604168 0.5320713x1 0.3604168 1.0000000 0.3025645x2 0.5320713 0.3025645 1.0000000> library(ppcor)> pcor.test(data$y,data$x1,data$x2)

estimate p.value statistic n gp Method1 0.2471221 0.1293425 1.551299 40 1 pearson> pcor.test(data$x1,data$x2,data$y)

estimate p.value statistic n gp Method1 0.1402861 0.3943218 0.8618496 40 1 pearson> pcor.test(data$y,data$x2,data$x1)

estimate p.value statistic n gp Method1 0.4758025 0.002202508 3.290531 40 1 pearson>

pair-wise plot by R ## pair.r

## put histograms on the diagonal

panel.hist <- function(x, ...)

{

usr <- par("usr"); on.exit(par(usr))

par(usr = c(usr[1:2], 0, 1.5) )

h <- hist(x, plot = FALSE)

breaks <- h$breaks; nB <- length(breaks)

y <- h$counts; y <- y/max(y)

rect(breaks[-nB], 0, breaks[-1], y, col="cyan", ...)

}

## put (absolute) correlations on the upper panels,

## with size proportional to the correlations.

panel.cor <- function(x, y, digits=2, prefix="", cex.cor, ...)

{

usr <- par("usr"); on.exit(par(usr))

par(usr = c(0, 1, 0, 1))

r <- abs(cor(x, y))

txt <- format(c(r, 0.123456789), digits=digits)[1]

txt <- paste(prefix, txt, sep="")

if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)

text(0.5, 0.5, txt, cex = cex.cor * r)

}

?swiss

summary(swiss)

cor(swiss)

pairs(swiss)

pairs(swiss, lower.panel=panel.smooth, upper.panel=panel.cor)

pairs(swiss, lower.panel=panel.smooth, upper.panel=panel.cor, diag.panel=panel.hist)

chapter 10hosting03.snu.ac.kr/~hokim/int/2018/chap_10.pdf · · 2 days ago그림10.1.1...

Documents