chapter 9hosting03.snu.ac.kr/~hokim/int/2014/chap9.pdf · 2014-05-14 · chapter 9...

Post on 22-Jul-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Chapter 9

중회귀분석과 상관

Multiple regression & correlation

2014/5/15

9.1 머리말 (intro)

• One Y& k independent variables

1, , kx x

Y

종속변수 (Dependent variable)

독립변수

(Independent variable)

반응변수 (Response variable)

설명변수(explanatory variable) 예측변수(predictor variable)

1, , kx x

9.2 중회귀모형 (Multiple Regression Model)

• 중회귀모형 (model)

• 회귀계수의 의미(Interpreting the coefficients)

e.g. 2 independent var’s

2

0 1 1 2 2

1, ,~ (0, )j

j j j k kj j

j n

y x x x e

e iid N

Independently & identically distributed

1 2

1 2

0 1 1 2 2

( : , : :

( :length of hospital stay, :length of hospital stay, previous visit, :age

Y x x

Y x x

Y x x e

, )

)

입원기간 과거입원회수 연령

• 가 0일 때 Y의 기대치 Centering 필요 E(Y|x1=x2=0)

0 1 2[ ( 0)]E Y x x

1, 2x x

1 2 1

2

1 1 2 1 2

0 1 2 0 1 2

y

ef

increment of E(Y) corresponding to unit increase of x1 when x2 is fixed

[ ( 1, )] [ ( , )]

( 1) ( )

x x

x

E y x a x b E y x a x b

a b a b

가 같은 값으로 남아 있을때 이 한 단위 증가할 때 의 기대치 의 증가값

의1

2 1 2

fect adjust y effect

Effect of x1 on Y after controlling the effect of x2

x

x x y

를 한 후의 의 에 대한

가 같은 값으로 남아 있을때 이 한 단위 증가할 때 의 기대치 의 증가값

9.3 중회귀방정식을 얻는 방법 estimating regression coef.

• 정규방정식 (normal equation)

• Estimate which minimize L

0 1 1 2 2

2

0 1 1 1 2 1 2 1

2

0 2 1 1 2 2 2 2

j j j

j j j j j j

j j j j j j

nb b x b x y

b x b x b x x x y

b x b x x b x x y

0 1 2, ,

22

0 1 1 2 2

0 1 2

0

j j j jL e y x x

dL dL dL

d d d

Homework

• 연습문제 9.3.4

9.4 중회귀방정식의 평가 evaluating regression model

• 중결정계수 (Multiple Coeff. Of Determination)

sum of squares, total=SS explained + SS unexplained

2

2

.12... 2

ˆy k

j

SST SSR SSE

y y SSRR

SSTy y

총변수=설명되는 자승합+설명되지 않는 자승합

ANOVA Table

2

0 1 2

0

( , )

: 0

:

. ( , - -1, ) then

k

A

b N ci i ii

H

H Not Ho

if V R F k n k reject H

each

Matrix

1 ( 1) ( 1) 1 1

1 11 21 1 0 1

2 12 22 2 1 2

1 2

1

1

1

( ) ( ) 2 ' '

2 ' 2 ' 0

(

n n k k n

k

k

n n n kn k n

X

y x x x

y x x x

y x x x

L y X y X y y X y X X

LX y X X

X

y

' ) '

ˆ 1( ' ) '

X X y

X X X y

1

1

11 21 1

11 12 1 12 22 2

21 22 2 13 23 3

1 2 1 2

1 2

2

1 1 1 2 1

ˆ ( )

1 1 1 1

1

1

1

k

n k

n k

k k kn n n kn

j j kj

j j j j j kj

LSE X X X Y

x x x

x x x x x x

X Yx x x x x x

x x x x x x

n x x x

x x x x x x

1

1

2

1

1 2ˆ ˆ( )

j

j j

kj j kj kj kj j

y

x y

x x x x x y

Var X X

1

1 2

2 2

1 1 1 2

2

2 1 2 2

0 0 1 0 2

0 1 1 1 2

0 2 1 2 2

00 01 02

1 2

01 11 12

02 12 22

when 2

ˆ

ˆvar( ) cov( , ) cov( , )

ˆcov( , ) var( ) cov( , )

ˆcov( , ) cov( , ) var( )

ˆ( )

j j

j j j j

j j j j

k

n x x

x x x x

x x x x

b b b b b

b b b b b

b b b b b

C C C

x x C C C

C C C

2

00 01 02

01 11 12

02 12 22

ˆ

1 0 0

0 1 0 ( )

0 0 1

C C C

I x x C C C

C C C

검정 (Testing)

0

.12

1 2 0

Hypothesis : 0

: 0

Test stat

( 1)

i

A i

i i

bi

bi y k ii

H

H

b

s

s s C

If t t n k then H

:

:

:

standard error

reject

9.5 중회귀방정식의 이용

• 특정한 값이 주어졌을 때 Y값의 하부모집단 평균에 대한 신뢰구간

• 특정한 값이 주어졌을 때 얻게 되는 Y값의 예측구간

iX

iX

2 2 2 2

.12 11 1 22 2 12 1 21 2, 1

1ˆ 2y j j j jn ky t s c x c x c x x

n

Application

Predicting Y for a given X

Estimating the mean of Y for a given X

9.6 질적 독립변수 (Qualitative indep. Var)

• 변수 (variable)

질적변수를 가변수(dummy variable)로 이용 (가변수: (0,1)의 값을 갖는것) 질적변수 k개 범주→k-1개의 가변수 사용

k categories -> k-1 dummy variables

양적(quantitative) 연속 -성적, 연령

Continuous-score, age

질적(qualitative) 범주 – 성별, 인종,직업

Categorical-sex, race, job

가변수의 예 (Examples of dummy var’s)

11 0

1

2 0

13 0

( , -5 , -5 , )

1 4 0

15 0

16 0

x

xotherwise

xotherwise

xotherwise

xotherwise

xotherwise

*

*

* 금

남자성별 여자

도시

농촌

흡연상태

거주지역

흡연자 금연자 년 내 금연자 연자 년 이상 금연자 비흡연자

흡연자

5년내 금연자

5년 이상 금연자

sex

male female

Residential area (urban, rural, suburban)

urban

rural

Smoking status (current smoker, ex-smoker(<=5yrs), ex-smoker(.5 yrs)

smoker

ex-smoker (<=5 years)

ex-smoker (>5 years)

보기 9.6.1

Case # Birth weight

Gestation (week)

Smk status of the mother

1

2

0 1 1 2 2

0 1 1

0 2 1 1

gestation (weeks)

1,

0

1: ( )

( )

( ) ( )

Y

x

x S N

E Y x x

E Y x

E Y x

for nonsmoker

for smoker

same slope ,

출생시 체중(birth weight, grams)

smoker산모의 흡연 smk status of the mother

nonsmoker

model

임신기간 주

different intercept

1

2 1 1 1 1

, x

ˆ

expected diff of birth weights between babies from smokers and nonsmokers

* ( | ) ( | )

E Y X x E Y X x

smoker non - smoker

임신기간이 같다고 할 때 주어진 값에 대해서

어머니가 흡연자인 경우와 어머니가 비흡연자인 경우의 생아의 체중의 차이

0

2

245

2* 5.83 2.0452

2

: significantally different.

( 330.3975 , 158.6825)

ˆ 0

ˆ( )

grams

reject H

b

Tse

ts

2

2

신뢰구간 (CI)

b

*95%

02

1smoker

Non-smoker

•If is significant -> slopes are diff btn smoker/nonsmoker

• If is significant -> intercepts are diff

→ not important without centering

3

2

0 1 1 2 2 3 1 2

0 1 1

0 2 1 3 1

2 : ( )

( )

( ) ( ) ( )

for nonsmoker

for smoker

different slope , different intercept

model E Y x x x x

E Y x

E Y x

Model 2 그림

임신기간

체중

0 2

1

1 3

nonsmoker

smoker

38week

0

2

•centering

2 1 1

1 1

0 1 1 2 2 3 1 2

0 1 1

0 1 1

380 1

0 2 1 3 1

38( )

( )

( )

( 38)

( ) ( ) ( )

x Y

x x

if x x week

E Y x x x x

E Y x

x

E Y x

=

fornonsmoker

일때 의 기대치가 된다 (의미 ,관심있는모수)

for smoker

는 = 0일때 기대치의 차이가 아니라 38일때

흡연자와 비흡연자의 기대치의

x

x

차이가 된다.

* 교훈 : 연속변수를 centering을 시켜주면 절편이 = 0일때의 기대치가 아니라

= 특정값 일때의 기대치가 되므로 더욱 의미 있게 된다.

* centering의 다른 효과 → x간의 mult - colinearity(공선성)를 약화시켜준다.

E(Y|x1=38)

E(Y|x1=38, smoker) -E(Y|x1=38, non-smoker)

Intercept becomes more meaningful after centering. Multi-colinearity becomes weaker after centering

•예제 9.6.2

effect age Trt

Model- 예제9.6.2

*

치료효과 (trt effect)

연령 (양적) age ( )

qualitative

quantitative

A치료방법(질적) trt ( )

B

for trt = A

1

2

3

50 1 1 2 2 3 3 4 1 2 1 3

0 2 1 4 1

50 3 1 1

*

*

1, if

1, if

( ) ( ) ( ) :

( ) ( ) ( ) :

Y

X

X trt

X trt

Y x x x x x x x

E Y x

E Y x

for trt = B

for trt = C0 1 1

( ) :E Y x

intercept & slope for reference cell C

: diff of intercepts (A-C), =0 ?

: diff of intercepts (B-C), =0 ?

: diff of slopes (A-C) , =0 ?

:diff of slopes (B - C), = 0?

0 1

2

3

4

5

, :

예제 9.6.2-mreg.sas

/* File mreg.sas

multiple regression for table

9.6.3 ;*/

data reg;

input effect age method $;

x1=age;x2=(method='A');x3=(me

thod='B');

x12=x1*x2;x13=x1*x3;

cards; 56 21 A

41 23 B

40 30 B

28 19 C

55 28 A

25 23 C

46 33 B

71 67 C

48 42 B

63 33 A

52 33 A

62 56 C

50 45 C

45 43 B

58 38 A

46 37 C

58 43 B

34 27 C

65 43 A

55 45 B

57 48 B

59 47 C

64 48 A

61 53 A

62 58 B

36 29 C

69 53 A

47 29 B

73 58 A

64 66 B

60 67 B

62 63 A

71 59 C

62 51 C

70 67 A

71 63 C

;

run;

proc reg;

model effect=x1 x2 x3 x12

x13;

output out=d p=pred;

id age method;

run;

proc sort;by method;

proc gplot;

plot effect*age=method/

legend;

symbol1 v='A' i=r c=c2

l=1;

symbol2 v='B' i=r c=c2

l=2;

symbol3 v='C' i=r c=c2

l=3;

run;

homework

• 연습문제 9.6.3

9.7 중상관모형 multiple correlation model

0 1 1

j jj k kj

i

y x x e

x y

x y

Y X

모수 확률변수일때

다변량 정규분포일때

와의 상관정도 → 중상관계수

와 가

와 가

Both x and y are random variables x and y ~ multivariate normal Multiple correlation can be used to see the correlation between them.

보기9.7.1

Serum cholesterol weight SBP

weight SBP

⇒혈청콜레스테롤은 수축기 혈압

중상관계수

*

1 2

2

.12

.12

0 .12

2

.12

2

.12

11.61 11.04 0.005

: : :

8.7876 1, 099.669

817.876.7437

1.099.669

.7437 .86

: 0

1~ ( , 1)

1 -

y

y

y Lk

y k

y k

F p

Y cholesterol X X

SSR SST

R

R

H

R n kF F k n k

R k

, 체중과 선형관계가 있다.

There is a significant linear association bwteen sreum cholesterol and (SBP and weight)

예제 9.7.1-mcorr.sas

/* file mcorr.sas

SAS example for Table 9.7.1

*/

data mcorr;

input chol weight sbp;

cards;

162.2 51.0 108

158.0 52.9 111

157.0 56.0 115

155.0 56.5 116

156.0 58.0 117

154.1 60.1 120

169.1 58.0 124

181.0 61.0 127

174.9 59.4 122

180.2 56.1 121

174.0 61.2 125

;run;

proc plot;

plot chol*weight chol*sbp

sbp*weight;

run;

proc corr;

var chol weight sbp ;

run;

proc corr;

var chol weight ;

partial sbp ;

run;

proc corr;

var chol sbp ;

partial weight ;

run;

proc corr;

var weight sbp ;

partial chol ;

run;

부분상관계수 (Partial Corr. Coef)

• 다른 변수의 효과를 제어한 상태에서의 관계조사

• Linear association after controlling for other covariates

.12 2

1

. . :y

e g r X

Y X

를 상수로 하고,

와 과의 상관성을 측정하는 부분상관계수

is a constant (is fixed to a same value),

partial corr. coef. bewteen and

.12 2

1

. . : when y

e g r X

Y X

∴혈청콜레스테롤치가 일정할 때

수축기 혈압과 체중간에는 유의한 상관관계가 존재한다고 결론을 내린다.

We may conclude there is significant linear association bwteen SBP

0

0 1.2...

21.2...1.2...

: 012.

11 2 1.948 8.425

21 .948

: 0

1

1

y k

y ky k

Hy

t

H

n kt r

r

and weight

when serum cholesterol is not changing (=after adjsting for cholesterol effect).

homework

• 연습문제 9.7.1

pair-wise plot by R ## pair.r

## put histograms on the diagonal

panel.hist <- function(x, ...)

{

usr <- par("usr"); on.exit(par(usr))

par(usr = c(usr[1:2], 0, 1.5) )

h <- hist(x, plot = FALSE)

breaks <- h$breaks; nB <- length(breaks)

y <- h$counts; y <- y/max(y)

rect(breaks[-nB], 0, breaks[-1], y, col="cyan", ...)

}

## put (absolute) correlations on the upper panels,

## with size proportional to the correlations.

panel.cor <- function(x, y, digits=2, prefix="", cex.cor, ...)

{

usr <- par("usr"); on.exit(par(usr))

par(usr = c(0, 1, 0, 1))

r <- abs(cor(x, y))

txt <- format(c(r, 0.123456789), digits=digits)[1]

txt <- paste(prefix, txt, sep="")

if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)

text(0.5, 0.5, txt, cex = cex.cor * r)

}

?swiss

summary(swiss)

cor(swiss)

pairs(swiss)

pairs(swiss, lower.panel=panel.smooth, upper.panel=panel.cor)

pairs(swiss, lower.panel=panel.smooth, upper.panel=panel.cor, diag.panel=panel.hist)

9.8 variable(model) selection

• Forward selection

• Backward elimination

• Stepwise selection

Mod18.sas /* file : mod18.sas

Multiple Regression Model with

stepwise selection */

Filename electric

'd:\myweb\int\electric.dat';

data peak;

infile electric ;

input housize 1-3 income 6-11

aircapac 14-16 applindx 19-23

family 26-28 peak 31-35 ;

label housize = 'House Size'

income = 'Family Income'

aircapac = 'Air Conditioning

Capacity'

applindx = 'Appliance Index'

family = 'Number of Family

Members'

peak = 'Peak Hour Electric

Load' ;

run;

proc reg data=peak;

model peak = housize income

aircapac applindx family

/selection=stepwise;

title 'Multiple Regression Model with

stepwise selection';

run;

proc reg data=peak outest=est;

model peak = housize income

aircapac applindx family

/selection=rsquare cp

adjrsq mse best=2 ;

title 'Multiple Regression Model with

stepwise selection';

run;

proc print;

title 'Actual Coefficients, etc.';

proc plot;

plot _cp_*_in_ ='C' _p_*_in_='*'/overlay

vaxis= 0 to 25 by 5 haxis=1 to 5

hpos=40 vpos=30;

title;

run;

homework

• 종합문제10- sas로

top related