categorical data analysis week 2. binary response models binary and binomial responses binary: y...

Categorical Data Analysis

Week 2

Binary Response Models binary and binomial responses

binary: y assumes values of 0 or 1 binomial: y is number of “successes” in n “trials”

distributions Bernoulli:

Binomial:

1Pr( | ) (1 )y yy p p p

Pr( | , ) (1 )y n ynp

yy n p p

Transformational Approach linear probability model

use grouped data (events/trials):

“identity” link:

linear predictor:

problems of prediction outside [0,1]

ii

i

yp

n

( )i i iIp x

i i x

The Logit Model

logit transformation:

inverse logit:

ensures that p is in [0,1] for all values of x and .

logit( ) log1

ii i i

i

p

pp

x

exp(

1 exp(

)( )

)i

i ii

p

The Logit Model

odds and odds ratios are the key to understanding and interpreting this model

the log odds transformation is a “stretching” transformation to map probabilities to the real line

Odds and Probabilities

0 5 10 15 20 25 30

0.0

0.2

0.4

0.6

0.8

1.0

odds

pro

ba

bili

ty

Probabilities and Log Odds

-6 -4 -2 0 2 4 6

0.0

0.2

0.4

0.6

0.8

1.0

log(odds)

pro

ba

bility

The Logit Transformation properties of logit

-6 -4 -2 0 2 4 6

0.0

0.2

0.4

0.6

0.8

1.0

logit

p

linear

Odds, Odds Ratios, and Relative Risk odds of “success” is the ratio:

consider two groups with success probabilities:

odds ratio (OR) is a measure of the odds of success in group 1 relative to group 2

1

p

p

1 2an d p p

1 1 1

2 2 2

/ (1 )

/ (1 )

pp

p p

Odds Ratio

2 X 2 table:

OR is the cross-product ratio (compare x = 1 group to x = 0 group)

odds of y = 1 are 4 times higher when x =1 than when x = 0

50 15

15 20

Y 0 1

0

1 X50

4.4415

20ˆ15

Odds Ratio equivalent interpretation

odds of y = 1 are 0.225 times higher when x = 0 than when x = 1

odds of y = 1 are 1-0.225 = .775 times lower when x = 0 than when x = 1

odds of y = 1 are 77.5% lower when x = 0 than when x = 1

1 15 15ˆ 50

0. 2520

2

Log Odds Ratios

Consider the model:

D is a dummy variable coded 1 if group 1 and 0 otherwise.

group 1: group 2:

LOR: OR:

0logit( )i ip D

0)logit( ip

exp( )

0logit( )ip

Relative Risk

similar to OR, but works with rates

relative risk or rate ratio (RR) is the rate in group 1 relative to group 2

OR RR as .

#Events

Exposure

Dr

R

1

2

RR = r

r

0p

Tutorial: odds and odds ratios

consider the following data


read table:

clearinput educ psex f0 0 8730 1 11901 0 5331 1 1208endlabel define edlev 0 "HS or less" 1 "Col or more"label val educ edlevlabel var educ education

Tutorial: odds and odds ratios compute odds:

verify by hand

tabodds psex educ [fw=f]

Pr>chi2 = 0.0000Score test for trend of odds: chi2(1) = 55.48

Pr>chi2 = 0.0000Test of homogeneity (equal odds): chi2(1) = 55.48 Col or ~e 1208 533 2.26642 2.04681 2.50959 HS or l~s 1190 873 1.36312 1.24911 1.48753 educ cases controls odds [95% Conf. Interval]

Tutorial: odds and odds ratios compute odds ratios:

verify by hand

tabodds psex educ [fw=f], or

Pr>chi2 = 0.0000Score test for trend of odds: chi2(1) = 55.48

Pr>chi2 = 0.0000Test of homogeneity (equal odds): chi2(1) = 55.48 Col or ~e 1.662674 55.48 0.0000 1.452370 1.903429 HS or l~s 1.000000 . . . . educ Odds Ratio chi2 P>chi2 [95% Conf. Interval]

Tutorial: odds and odds ratios stat facts:

variances of functions use in statistical significance tests and forming

confidence intervals basic rule for variances of linear transformations

g(x) = a + bx is a linear function of x, then

this is a trivial case of the delta method applied to a single variable

the delta method for the variance of a nonlinear function g(x) of a single variable is

2var[ ] ( )a xb b varx

2

var[ ( )] var((

))g x

xg x x


variances of odds and odds ratios we can use the delta method to find the variance in the

odds and the odds ratios from the asymptotic (large sample theory) perspective it

is best to work with log odds and log odds ratios the log odds ratio converges to normality at a faster rate

than the odds ratio, so statistical tests may be more appropriate on log odds ratios (nonlinear functions of p)

21

ˆvar(log var( )ˆ ˆ

)(1 )

ˆ pp p


the log odds ratio is the difference in the log odds for two groups

groups are independent

variance of a difference is the sum of the variances

1 2

ˆ ˆ ˆlog ) var(log ) var(logvar( )


data structures: grouped or individual level note:

use frequency weights to handle grouped data or we could “expand” this data by the frequency weights

resulting in individual-level data model results from either data structures are the same

expand the data and verify the following results

expand f


statistical modeling

logit model (glm):

logit model (logit):

logit psex educ [fw=f], or

glm psex educ [fw=f], f(b) eform

Tutorial: odds and odds ratios statistical modeling (#1)

logit model (glm):

educ 1.662674 .1138634 7.42 0.000 1.453834 1.901512 psex Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] OIM

Log likelihood = -2477.935675 BIC = -26387.09 AIC = 1.303857

Link function : g(u) = ln(u/(1-u)) [Logit]Variance function: V(u) = u*(1-u) [Bernoulli]

Pearson = 3804 (1/df) Pearson = 1.000526Deviance = 4955.871349 (1/df) Deviance = 1.303491 Scale parameter = 1Optimization : ML Residual df = 3802Generalized linear models No. of obs = 3804


statistical modeling (#2) some ideas from alternative normalizations

what parameters will this model produce? what is the interpretation of the “constant”

gen cons = 1glm psex cons educ [fw=f], nocons f(b) eform


statistical modeling (#2)

educ 1.662674 .1138634 7.42 0.000 1.453834 1.901512 cons 1.363116 .0607438 6.95 0.000 1.249111 1.487525 psex Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] OIM






what parameters does this model produce? how do you interpret them?

gen lowed = educ == 0gen hied = educ == 1glm psex lowed hied [fw=f], nocons f(b) eform



hied 2.266417 .1178534 15.73 0.000 2.046809 2.509586 lowed 1.363116 .0607438 6.95 0.000 1.249111 1.487525 psex Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] OIM




are these odds ratios?

Tutorial: prediction fitted probabilities (after most recent model)

predict p, mu

tab educ [fw=f], sum(p) nostandard nofreq

Total .63038905 3804 Col or mo .69385409 1741 HS or les .57682985 2063 education Mean Obs. mean psex Summary of predicted

Probit Model

inverse probit is the CDF for a standard normal variable:

link function:

21

21d

2

u

p e u

1)probit( ( )i i ip p

Probit Transformation

-3 -2 -1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0

probit

p

Interpretation probit coefficients

interpreted as a standard normal variables (no log odds-ratio interpretation)

“scaled” versions of logit coefficients

probit models more common in certain disciplines (economics) analogy with linear regression (normal latent variable) more easily extended to multivariate distributions

probit g t3

lo i

Example: Grouped Data Swedish mortality data revisited

_cons -4.017514 .1922715 -20.90 0.000 -4.394359 -3.640669 P2 .5271214 .120775 4.36 0.000 .2904068 .763836 A3 -.8384579 .2006439 -4.18 0.000 -1.231713 -.445203 A2 .1147916 .21511 0.53 0.594 -.3068163 .5363995 y Coef. Std. Err. z P>|z| [95% Conf. Interval] OIM

logit model

_cons -2.101865 .0778879 -26.99 0.000 -2.254522 -1.949207 P2 .2098432 .0472825 4.44 0.000 .1171712 .3025151 A3 -.3247921 .0807731 -4.02 0.000 -.4831045 -.1664797 A2 .0497241 .087904 0.57 0.572 -.1225646 .2220128 y Coef. Std. Err. z P>|z| [95% Conf. Interval] OIM

probit model

Swedish Historical Mortality Data predictions

Logit Probit

A 1 2 A 1 2

1 19.0 10.0 1 19.1 9.92 61.0 32.0 2 61.9 31.63 143.0 60.0 3 141.1 61.4

sum 325 sum 325.1

P P

Programming

Stata: generalized linear model (glm)

glm y A2 A3 P2, family(b n) link(probit)

glm y A2 A3 P2, family(b n) link(logit)

idea of glm is to make model linear in the link. old days: Iteratively Reweighted Least Squares now: Fisher scoring, Newton-Raphson both approaches yield MLEs

Generalized Linear Models applies to a broad class of models

iterative fitting (repeated updating) except for linear model update parameters, weights W, and predicted values m

models differ in terms of W and m and assumptions about the distribution of y

common distributions for y include: normal, binomial, and Poisson

common links include: identity, logit, probit, and log

1( 1) ( )t t t t XW X X y m

Latent Variable Approach example: insect mortality

suppose a researcher exposes insects to dosage levels (u) of an insecticide and observes whether the “subject” lives or dies at that dosage.

the response is expected to depend on the insect’s tolerance (c) to that dosage level.

the insect dies if u > c and survives if u < c

tolerance is not observed (survival is observed)

Pr( 1) Pr( )i i iy u c

Latent Variables u and c are continuous latent variables

examples: women’s employment: u is the market wage and c is the

reservation wage migration: u is the benefit of moving and c is the cost of

moving. observed outcome y =1 or y = 0 reveals the

individual’s preference, which is assumed to maximize a rational individual’s utility function.

Latent Variables Assume linear utility and criterion functions

over-parameterization = identification problem we can identify differences in components but not the

separate components

u uu x

Pr( 1) Pr( ) Pr ( )c u u cy u c x

c cc x

Latent Variables constraints:

Then:

where F(.) is the CDF of ε

u c

Pr( 1) Pr( ) ( )y x F x

c u

Latent Variables and Standardization Need to standardize the mean and variance of ε

binary dependent variables lack inherent scales magnitude of β is only in reference to the mean

and variance of ε which are unknown. redefine ε to a common standard

where a and b are two chosen constants.

* a

b

Standardization for Logit and Probit Models standardization implies

F*() is the cdf of ε*

location a and scale b need to be fixed

setting

and

a b

*Pr( 1)x a

y Fb

*() () probit modelF

Standardization for Logit and Probit Models

distribution of ε is standardized

standard normal probit

standard logistic logit

both distributions have a mean of 0 variances differ

2*probit 1

2

*2logit

3

Extending the Latent Variable Approach observed y is a dichotomous (binary) 0/1 variable

continuous latent variable: linear predictor + residual

observed outcome

*ii iy x

*1 0

0

if

otherwisei

iyy

Notation conditional means of latent variables obtained from

index function:

obtain probabilities from inverse link functions

logit model:

probit model:

*( | )E i iiy x x

( )i i x

( )i i x

ML likelihood function

where if data are binary

log-likelihood function

( )() )( 1 i ii

n yy

ii iL F F

x x

1in

log ( ) ( ) logl )o (g 1i i i i ii

y F n FL y x x

Assessing Models

definitions: L null model (intercept only): L saturated model (a parameter for each cell): L current model:

grouped data (events/trials) deviance (likelihood ratio statistic)

0L

fL

cL

2 2log 2 log logcc f

f

LG LL

L

Deviance grouped data:

if cell sizes are reasonably large deviance is distributed as chi-square

individual-level data: Lf =1 and log Lf =0 deviance is not a “fit” statistic

2 2log cLG

Deviance

deviance is like a residual sum of squares larger values indicate poorer models larger models have smaller deviance

deviance for the more constrained model (Model 1)

deviance for the less constrained model (Model 2)

assume that Model 1 is a constrained version of Model 2.

21G

22G

Difference in Deviance

evaluate competing “nested” models using a likelihood ratio statistic

model chi-square is a special case

SAS, Stata, R, etc. report different statistics

2 1

2 2 2 21 2 df dfG G G

2 2 20 0Model 2log ( 2log )c cG G L L

Other Fit Statistics BIC & AIC (useful for non-nested models)

basic idea of IC : penalize log L for the number of parameters (AIC/BIC) and/or the size of the sample (BIC)

AIC s=1 BIC s= ½ log n (sample size) dfm is the number of model parameters

I )C 2log 2( )( mL s df

Hypothesis Tests/Inference

single parameter: MLE are asymptotically normal Z-test

multi-parameter: likelihood ratio tests (after fitting) Wald tests (test constraints from current model)

0H : 0

0 1 2 0H :

Hypothesis Tests/Inference Wald test (tests a vector of restrictions)

a set of r parameters are all equal to 0

a set of r parameters are linearly restricted

0H : r 0

0H : r R q

restriction matrix constraint vector

parameter subset

Interpreting Parameters odds ratios: consider the model where x is a

continuous predictor and d is a dummy variable

suppose that d denotes sex and x denotes income and the problem concerns voting, where y* is the propensity to vote

results: logit(pi) = -1.92 + 0.012xi + 0.67di

*0 1 2i i i iy x d

Interpreting Parameters for d (dummy variable coded 1 for female) the odds ratio is

straightforward

holding income constant, women’s odds of voting are nearly twice those of men

2

/ (1 ) ˆexp( ) exp(0.67) 1.95/ (1 )

f f f

mm m

p p

p p

Interpreting Parameters

for x (continuous variable for income in thousands of dollars) the odds ratio is a multiplicative effect suppose we increase income by 1 unit ($1,000)

suppose we increase income by c units (c х $1,000$

11

1

êxp[ ( 1)](1)] 1.01

êxp[

exp( )

x

x

11

1

êxp[ ( )]( )]

êxp(exp[

)

x cc

x

Interpreting Parameters if income is increased by $10,000, this increases the odds of

voting by about 13%

a note on percent change in odds: if estimate of β > 0 then percent increase in odds for a unit change in

x is

if estimate of β < 0 then percent decrease in odds for a unit change in x is

ˆ1) 1 0%( 0e

10 0.012 1) 100% 12.75%(e

ˆ) 1 01 0( %e

Marginal Effects

marginal effect: effect of change in x on change in probability

pdf cdf

often we evaluate f(.) at the mean of x.

Pr( 1| ) ( )( )i i i

i kik ik

y F

xf

x

x x

x

)(·f )(·F

Marginal Effect for a Change in a Continuous Variable

Marginal Effect of a Change in a Dummy Variable

if x is a continuous variable and z is a dummy variable

marginal effect of change in z from 0 to 1 is the difference

10 1 2( )i i iF x z

0 1 2 0 1) (( )i ix F xF

Example logit models for high school graduation

odds ratios (constant is baseline odds)

LR Test

Model 3 vs. 2

22

(1) 3log )

2( 1240.70 ( 1038.39))

2(1038.39

2(l

1240.70)

404.6

o

4

g L L

Wald Test Test equality of parental education effects

logit hsg blk hsp female nonint inc nsibs mhs mcol fhs fcol wtesttest mhs=fhstest mcol=fcol

Prob > chi2 = 0.2770 chi2( 1) = 1.18

( 1) mcol - fcol = 0

. test mcol=fcol

Prob > chi2 = 0.9177 chi2( 1) = 0.01

( 1) mhs - fhs = 0

cannot reject H of equal parental education effects on HS graduation

0 mhs fhs

0 mcol fcol

:

:

H

H

Basic Estimation Commands (Stata)

* model 0 - null modelqui logit hsgest store m0* model 1 - race, sex, family structurequi logit hsg blk hsp female nonintest store m1* model 1a - race X family structure interactionsqui xi: logit hsg blk hsp female nonint i.nonint*i.blk i.nonint*i.hspest store m1alrtest m1 m1a* model 2 - SESqui xi: logit hsg blk hsp female nonint inc nsibs mhs mcol fhs fcol est store m2 * model 3 - Indivqui xi: logit hsg blk hsp female nonint inc nsibs mhs mcol fhs fcol wtestest store m3lrtest m2 m3

estimation commands model tests

Fit Statistics etc.* some 'hand' calculations with saved resultsscalar ll = e(ll)scalar npar = e(df_m)+1scalar nobs = e(N)scalar AIC = -2*ll + 2*nparscalar BIC = -2*ll + log(nobs)*npar scalar list AICscalar list BIC

* or use automated fitstat routinefitstat

*output as a table

estout1 m0 m1 m2 m3 using modF07, replace star stfmt(%9.2f %9.0f %9.0f) /// stats(ll N df_m) eform

Analysis of Deviance

(Assumption: m2 nested in m3) Prob > chi2 = 0.0000Likelihood-ratio test LR chi2(1) = 404.64

. lrtest m2 m3


. lrtest m1 m2


. lrtest m0 m1

BIC and AIC (using fitstat)

BIC used by Stata: 2173.993 AIC used by Stata: 2100.754BIC: -24607.056 BIC': -717.672AIC: 0.636 AIC*n: 2100.754Count R2: 0.857 Adj Count R2: 0.096Variance of y*: 6.240 Variance of error: 3.290McKelvey & Zavoina's R2: 0.473 Efron's R2: 0.252ML (Cox-Snell) R2: 0.217 Cragg-Uhler(Nagelkerke) R2: 0.372McFadden's R2: 0.280 McFadden's Adj R2: 0.271 Prob > LR: 0.000D(3293): 2076.754 LR(11): 806.807Log-Lik Intercept Only: -1441.781 Log-Lik Full Model: -1038.377

Measures of Fit for logit of hsg

Marginal Effects0

.2.4

.6.8

1P

r(y=

1)

-4 -2 0 2 4Test Score

white/intact white/nonintactblack/intact black/nonintact

Marginal Effect of Test Score on High School GraduationIncome Quartile 1

Marginal Effects0

.2.4

.6.8

1P

r(y=

1)

-4 -2 0 2 4Test Score

white/intact white/nonintactblack/intact black/nonintact

Marginal Effect of Test Score on High School GraduationIncome Quartile 4

qui sum adjinc, det* quartiles for income distributiongen incQ1 = adjinc < r(p25)gen incQ2 = adjinc >= r(p25) & adjinc < r(p50)gen incQ3 = adjinc >= r(p50) & adjinc < r(p75)gen incQ4 = adjinc >= r(p75)gen incQ = 1 if incQ1==1 replace incQ = 2 if incQ2==1 replace incQ = 3 if incQ3==1 replace incQ = 4 if incQ4==1tab incQ

Generate Income Quartiles

* look at marginal effects of test score on graduation by selected groups* (1) model (income quartiles)local i = 1 while `i' < 5 {logit hsg blk female mhs nonint nsibs urban so wtest if incQ ==`i'margeff

cap drop wm*cap drop bm*prgen wtest, x(blk=0 female=0 mhs=1 nonint=0) gen(wmi) from(-3) to(3)prgen wtest, x(blk=0 female=0 mhs=1 nonint=1) gen(wmn) from(-3) to(3)label var wmip1 "white/intact"label var wmnp1 "white/nonintact"prgen wtest, x(blk=1 female=0 mhs=1 nonint=0) gen(bmi) from(-3) to(3)prgen wtest, x(blk=1 female=0 mhs=1 nonint=1) gen(bmn) from(-3) to(3)label var bmip1 "black/intact"label var bmnp1 "black/nonintact"

Fit Model for Each Quartile calculate predictions

set scheme s2mono twoway (line wmip1 wmix, sort xtitle("Test Score") ytitle("Pr(y=1)")) /// (line wmnp1 wmix, sort) (line bmip1 wmix, sort) (line bmnp1 wmix, sort), /// subtitle("Marginal Effect of Test Score on High School Graduation" /// "Income Quartile ì'" ) saving(wtgrphì', replace) graph export wtgrphì'.eps, as(eps) replacelocal i = ì' + 1}

Graph

Fitted Probabilitieslogit hsg blk female mhs nonint inc nsibs urban so wtestprtab nonint blk female

1 0.8329 0.9480 0.8585 0.9569 0 0.9111 0.9740 0.9258 0.9786 nonint 0 1 0 1 0 1 female and blk

logit: Predicted probabilities of positive outcome for hsg

Fitted Probabilities predicted values

evaluate fitted probabilities at the sample mean values of x (or other fixed quantities)

averaging fitted probabilities over subgroup-specific models will produce marginal probabilities

exp(

1 exp

ˆ)ˆˆ ( )ˆ )(

p

x

xx

1

ˆˆ ( )1 jn

ij ij j j

j

yn

p

x

Observed & Fitted Probabilities

family type white black white blackintactobserved 0.90 0.86 0.91 0.89fitted 0.91 0.97 0.93 0.98n 776 224 749 234nonintactobserved 0.71 0.74 0.81 0.82fitted 0.83 0.95 0.86 0.96n 220 207 196 231Total 996 431 945 465

sex

race racemale female

Alternative Probability Model complementary log –log (cloglog or CLL)

standard extreme-value distribution for u:

cloglog model:

cloglog link function:

( ) exp( )exp exp( )f u u u

( ) 1 exp exp( )F u u

Pr( 1) 1 ex exp(p )i iy x

log log[1 Pr( 1)]i iy x

Extreme-Value Distribution properties

mean of u (Euler’s constant):

variance of u:

difference in two independent extreme value variables yields a logistic variable

(1) 0.5772

2

6

2

1 2

3 logistic(0, )

u u

CLL Transformation

-6 -4 -2 0 2

0.0

0.2

0.4

0.6

0.8

1.0

CLL

p

CLL Model

no “practical” differences from logit and probit models

often suited for survival data and other applications interpretation of coefficients:

exp(β) is a relative risk or hazard ratio not an OR glm: binomial distribution for y with a cloglog link cloglog: use the cloglog command directly

CLL and Logit Model Comparedlogit cloglog

blk 3.658*** 1.987***

female 1.218 1.128*

mhs 1.438** 1.161*

nonint 0.487*** 0.710***

inc 1.635** 1.236**

nsibs 0.938** 0.965**

urban 0.887 0.942

so 1.269 1.115

wtest 5.151*** 2.171***

_cons 6.851*** 1.891***

log L -838.92 -833.96

N 2837 2837

df 9 9

Cloglog and Logit Model Compared

P2 1.694049 .2045987 4.36 0.000 1.336971 2.146494 A3 .4323768 .0867538 -4.18 0.000 .2917924 .6406942 A2 1.12164 .2412759 0.53 0.594 .7357857 1.709839 d Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] OIM

more agreement when modeling rare events

P2 1.684947 .2016957 4.36 0.000 1.332581 2.130487 A3 .4350801 .0864137 -4.19 0.000 .2947864 .642142 A2 1.119414 .2380893 0.53 0.596 .7378156 1.698375 d exp(b) Std. Err. z P>|z| [95% Conf. Interval] OIM

logit

cloglog

Extensions: Multilevel Data

what is multilevel data? individuals are “nested” in a larger context:

children in families, kids in schools etc.

context 1

context 3

context 2

Multilevel Data i.i.d. assumptions?

the outcomes for units in a given context could be associated

standard model would treat all outcomes (regardless of context) as independent

multilevel methods account for the within-cluster dependence

a general problem with binomial responses we assume that trials are independent this might not be realistic non-independence will inflate the variance

(overdispersion)

Multilevel Data example (in book):

40 universities as units of analysis for each university we observe the number of graduates

(n) and the number receiving post-doctoral fellowships (y)

we could compute proportions (MLEs) some proportions would be “better” estimates as they

would have higher precision or lower variance example: the data y1/n1 = 2/5 and y2/n2 = 20/50 give

identical estimates of p but variances of 0.048 and 0.0048 respectively

the 2nd estimate is more precise than the 1st

Multilevel Data multilevel models allow for improved

predictions of individual probabilities MLE estimate is unaltered if it is precise MLE estimate moved toward average if it is

imprecise (shrinkage) multilevel estimate of p would be a weighted average of

the MLE and the average over all MLEs (weight (w) is based on the variance of each MLE and the variance over all the MLEs)

we are generally less interested in the p’s and more interested in the model parameters and variance components

ˆ (1 )i i i ip w pp w

Shrinkage Estimation primitive approach

assume we have a set of estimates (MLEs) our best estimate of the variance of each MLE is

this is the within variance (no pooling) if this is large, then the MLE is a poor estimate

a better estimate might be the average of the MLEs in this case (pooling the estimates)

we can average the MLEs and estimate the between variance as

ˆ(1 ))

ˆˆvar( i ii

i

p pp

n

2ˆ) (1

ar( )v ip pN

p

ˆ ip

Shrinkage Estimation primitive approach

we can then estimate a weight wi

a revised estimate of pi would take account of the precision to for a precision-weighted average precision is a function of ni

more weight is given to more precise MLE’s

) between-groupar(

var(

varianceˆ) var( ) total varianceii

pw

p p

v

ˆ (1 )i i i ip w pp w

Shrinkage: a primitive approach

0 10 20 30 40

0.2

0.4

0.6

0.8

university

obse

rved

and

shr

unke

n pr

obab

ilitie

s

ObservedShrunken

Shrinkage

0 10 20 30 40

0.2

0.4

0.6

0.8

university

obse

rved

and

EB

pro

babi

litie

s

ObservedEB Estimate

results from full Bayesian (multilevel) Analysis

Extension: Multilevel Models assumptions

within-context and between-context variation in outcomes

individuals within the same context share the same “random error” specific to that context

models are hierarchical individuals (level-1) contexts (level-2)

Multilevel Models: Background linear mixed model for continuous y (multilevel, random coefficients, etc.)

level-1 model and level-2 sub-models (hierarchical)

0 1

0 00 01 0

1 10 11 1

ij i i ij ij

i i i

i i i

z

x u

x u

y

Multilevel Models: Background linear mixed model assumptions

level-1 and level-2 residuals

2

0

1

20 01

201 1

~ Normal(0, )

0~ MVN ,

0

where

u

u

u

u

Multilevel Models: Background composite form

00 01 10 11 0 1ij i ij i ij i ij i ijx z x z uy z u

fixed effectscross-level interaction

random effects (level-2)

composite residual

Multilevel Models: Background variance components

0 1

0 1

)

within group: var

total: va

( )

between group: va

r

)

(

r(

i i ij ij

ij

i i ij

u u z

u u z

Multilevel Models: Background general form (linear mixed model)

ij ij ij i ijy x z u

variables associated with fixed coefficients

variables associated with random coefficients

Multilevel Models: Logit Models binomial model (random effect)

assumptions

u increases or decreases the expected response for individual j in context i independently of x

all individuals in context i share the same value of u also called a random intercept model

logit( ) ij ij ip u x

2~ Normal(0, )i uu

0 0i iu

Multilevel Models a hierarchical model:

z is a level-1 variable; x is a level-2 variable random intercept varies among level-2 units note: level-1 residual variance is fixed (why?)

0 1

0 00 01

logit( )=

and

ij i ij

i i i

p z

x u

Multilevel Models a general expression

x are variables associated with “fixed” coefficients z are variables associated with “random” coefficients u is multivariate normal vector of level-2 residuals mean of u is 0; covariance of u is

logit( ) ij ij ij ip x z u

u

Multilevel Models random effects vs. random coefficients

random effects u random coefficients β + u

variance components interested in level-2 variation in u

prediction E(y) is not equal to E(y|u) model based predictions need to consider random

effectsE( | , ) ( )ij i ij ij iy u u xx

Multilevel Models: Generalized Linear Mixed Models (GLMM)

E( | , ) ( )ij i ij ij iy u u xx Conditional Expectation

| ) E[E( | , )E( ]ij ij ij i ijy y ux x

Marginal Expectation

( ) ( )dij i

u

u g u u x

requires numerical integration or simulation

Data Structure multilevel data structure

requires a “context” id to identify individuals belonging to the same context

NLSY sibling data contains a “family id” (constructed by researcher)

data are unbalanced (we do not require clusters to be the same size)

small clusters will contribute less information to the estimation of variance components than larger clusters

it is OK to have clusters of size 1

(i.e., an individual is a context unto themselves) clusters of size 1 contribute to the estimation of fixed

effects but not to the estimation of variance components

Example: clustered data siblings nested in families

y is 1st premarital birth for NLSY women select sib-ships of size > 2 null model (random intercept):

xtlogit fpmbir, i(famid)

or

xtmelogit fpmbir || famid:

Example: clustered data

Likelihood-ratio test of rho=0: chibar2(01) = 20.58 Prob >= chibar2 = 0.000 rho .4730808 .0995195 .2910546 .662556 sigma_u 1.71864 .3430707 1.162171 2.541556 /lnsig2u 1.083066 .3992351 .30058 1.865553 _cons -2.888895 .3318566 -8.71 0.000 -3.539322 -2.238468 fpmbir Coef. Std. Err. z P>|z| [95% Conf. Interval]

Log likelihood = -228.59345 Prob > chi2 = .

random intercept: xtlogit

Example: clustered data

LR test vs. logistic regression: chibar2(01) = 20.73 Prob>=chibar2 = 0.0000 sd(_cons) 1.752456 .3601534 1.171423 2.621685famid: Identity Random-effects Parameters Estimate Std. Err. [95% Conf. Interval]

_cons -2.917541 .3479598 -8.38 0.000 -3.59953 -2.235552 fpmbir Coef. Std. Err. z P>|z| [95% Conf. Interval]

Log likelihood = -228.51781 Prob > chi2 = .Integration points = 7 Wald chi2(0) = .

random intercept: xtmelogit

Variance Component add predictors (mostly level-2)

sd(_cons) 1.451511 .3515003 .9030084 2.333182famid: Identity Random-effects Parameters Estimate Std. Err. [95% Conf. Interval]

weekly .885648 .296273 -0.36 0.717 .4597391 1.706125 consprot 1.614657 .6110603 1.27 0.206 .7690355 3.390111 inc .8848917 .2858459 -0.38 0.705 .4698153 1.666683 medu .8050785 .060073 -2.91 0.004 .6955425 .9318647 nsibs 1.112501 .1032876 1.15 0.251 .9274119 1.33453 nonint 3.356608 1.435222 2.83 0.005 1.451921 7.759938 fpmbir Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]

Log likelihood = -215.39646 Prob > chi2 = 0.0010Integration points = 7 Wald chi2(6) = 22.48

Variance Component conditional variance in u is 2.107 proportionate reduction in error (PRE)

a 31% reduction in level-2 variance when level-2 predictors are accounted for

2 2

2

3.062 2.107PRE 0.312

3.062r c

r

u u

u

Random Effects we can examine the distribution of random effects

01

23

Den

sity

-1 0 1 2 3random effects for famid: _cons

Random Effects we can examine the distribution of random effects

99% 2.405483 2.583755 Kurtosis 4.81897195% 1.523062 2.583755 Skewness 1.68802690% 1.337971 2.431446 Variance .490946275% -.0689377 2.431446 Largest Std. Dev. .700675650% -.1484184 Mean .1132598

25% -.2422871 -.8339383 Sum of Wgt. 65310% -.388522 -.8339383 Obs 653 5% -.5100672 -.9210778 1% -.7111417 -.9210778 Percentiles Smallest random effects for famid: _cons

. sum u, detail

Random Effects Distribution 90th percentile u90 = 1.338

10th percentile u10 = 0.388

the risk for family at 90th percentile is

exp(1.338 – 0.388) = 2.586

times higher than for a family at the 10th percentile

even if families are compositionally identical on covariates, we can assess the hypothetical differential in risks

Growth Curve Models growth models

individuals are level-2 units repeated measures over time on individuals

(level-1) models imply that logits vary across individuals

intercept (conditional average logit) varies slope (conditional average effect of time) varies change is usually assumed to be linear

use GLMM complications due to dimensionality intercept and slope may co-vary (necessitating a more

complex model) and more

Growth Curve Models multilevel logit model for change over time

T is time (strictly increasing) fixed and random coefficients (with covariates)

assume that u0 and u1 are bivariate normal

0 1logit( )ij i i ijp T

0 1

0 00 01 0

1 10 11 1

logit( )ij i i ij

i i i

i i i

p T

X u

X u

Multilevel Logit Models for Change Example: Log odds of employment of black

men in the U.S. 1982-1988 (NLSY) (consider 5 years in this period)

time is coded 0, 1, 3, 4, 6 dependent variable is: not-working, not-in-school unconditional growth (no covariates except T) conditional growth (add covariates) note: cross-level interactions implied by composite

model

00 01 10 11 0 1logit( )ij ij ij i i i ijp X T uT X u T

Fitting Multilevel Model for Change programming

Stata (unconditional growth)

Stata (conditional growth)

xtmelogit y year || id: year, var cov(un)

xtmelogit y year south unem unemyr inc hs ||id: year, var cov(un)

Fitting Multilevel Model for Change

LR test vs. logistic regression: chi2(3) = 250.61 Prob > chi2 = 0.0000 cov(year,_cons) -.0517392 .0789636 -.206505 .1030266 var(_cons) 1.796561 .4330881 1.120075 2.881622 var(year) .0552714 .0241599 .0234654 .1301886id: Unstructured Random-effects Parameters Estimate Std. Err. [95% Conf. Interval]

_cons -.8742502 .0972809 -8.99 0.000 -1.064917 -.6835831 year -.1467877 .0293921 -4.99 0.000 -.2043952 -.0891801 y Coef. Std. Err. z P>|z| [95% Conf. Interval]


max = 5 avg = 5.0 Obs per group: min = 5

Group variable: id Number of groups = 686Mixed-effects logistic regression Number of obs = 3430

Fitting Multilevel Logit Model for Change

LR test vs. logistic regression: chi2(3) = 140.20 Prob > chi2 = 0.0000 cov(year,_cons) -.0622441 .0708861 -.2011783 .07669 var(_cons) 1.304833 .3648705 .7542816 2.257233 var(year) .0433477 .0219905 .016038 .1171612id: Unstructured Random-effects Parameters Estimate Std. Err. [95% Conf. Interval]

_cons -.0612559 .1285939 -0.48 0.634 -.3132954 .1907836 hs -.785545 .1242026 -6.32 0.000 -1.028978 -.5421124 inc -.5732738 .1872211 -3.06 0.002 -.9402205 -.2063271 unemyr -.1120936 .0641975 -1.75 0.081 -.2379184 .0137313 unem 1.014915 .2408795 4.21 0.000 .5428002 1.48703 south -.6523682 .1283314 -5.08 0.000 -.9038931 -.4008434 year -.0921512 .0281795 -3.27 0.001 -.1473819 -.0369205 y Coef. Std. Err. z P>|z| [95% Conf. Interval]


max = 5 avg = 5.0 Obs per group: min = 5

Group variable: id Number of groups = 686Mixed-effects logistic regression Number of obs = 3430

Logits: Observed, Conditional, and Marginal

the log odds of idleness decreases with time and shows variation in level and change

Composite Residuals in a Growth Model composite residual

composite residual variance

covariance of composite residual

0 1ij i i ij ijr u u T

22 2 20 1 01var(

3) 2ij j jr T T

2 20 1 01, ) ( )cov( ij ij j j j jr T T T Tr

Model covariance term is 0 (from either model)

results in simplified interpretation easier estimation via variance components (default option)

significant variation in slopes and initial levels other results:

log odds of idleness decrease over time (negative slope) other covariates except county unemployment have significant

effects on the odds of idleness the main effects are interpreted as effects on initial logits at time 1

or t = 0 or the 1982 baseline) interaction of time and unemployment rate captures the effect of

county unemployment rate in 1982 on the change log odds of idleness

the positive effect implies that higher county unemployment tends to dampen change in odds

IRT Models IRT models

Item Response Theory models account for an individual-level random effect on

a set of items (i.e., ability) items are assumed to tap a single latent construct

(aptitude on a specific subject) item difficulty

test items are assumed to be ordered on a difficulty scale easier harder expected patterns emerge whereby if a more difficult

item is answered correctly the easier items are likely to have been answered correctly

IRT Models IRT models

1-parameter logistic (Rasch) model

pij individual i’s probability of a correct response on the jth item

θ individual i’s ability b item j’s difficulty

properties an individual’s ability parameter is invariant with respect to the

item the difficulty parameter is invariant with respect to individual’s

ability higher ability or lower item difficulty lead to a higher probability

of a correct response both ability and difficulty are measured on the same scale

logit( )ij i jp b

ICC

item characteristics curve (item response curve) depicts the probability of a correct response as a function

of an examinee’s ability or trait level curves are shifted rightward with increasing item difficulty assume that item 3 is more difficult than item 2 and item 2

is more difficult than item 1 probability of a correct response decreases as the

threshold θ = bj is crossed, reflecting increasing item difficulty

IRT Models: ICC (3 Items)

jb slopes of item characteristics curves are equal when ability = item difficulty

Estimation as GLMM specification:

set up a person-item data structure define x as a set of dummy variables change signs on β to reflect “difficulty” fit model without intercept to estimate all item difficulties normalization is common

logit( )ij j i

ij i

p u

u

x

2

1

0 and 1.0J

j uj

PL1 Estimation Stata (data set up )

clearset memory 128minfile junk y1-y5 f using LSAT.datdrop if junk==11 | junk==13expand fdrop f junkgen cons = 1collapse (sum) wt2=cons, by(y1-y5)gen id = _nsort idreshape long y, i(id) j(item)

PL1 Estimation Stata (model set up )

gen i1 = 0gen i2 = 0gen i3 = 0gen i4 = 0gen i5 = 0replace i1 = 1 if item == 1replace i2 = 1 if item == 2replace i3 = 1 if item == 3replace i4 = 1 if item == 4replace i5 = 1 if item == 5** 1PL * constrain sd=1cons 1 [id1]_cons = 1gllamm y i1-i5, i(id) weight(wt) nocons family(binom) cons(1) link(logit) adapt

PL1 Estimation Stata (output )

------------------------------------------------------------------------------ var(1): 1 (0) ***level 2 (id)

------------------------------------------------------------------------------Variances and covariances of random effects i5 2.218779 .104828 21.17 0.000 2.01332 2.424238 i4 1.388057 .086496 16.05 0.000 1.218528 1.557586 i3 .2576052 .0765907 3.36 0.001 .1074903 .4077202 i2 1.063026 .0821146 12.95 0.000 .902084 1.223967 i1 2.871972 .1287498 22.31 0.000 2.619627 3.124317 Coef. Std. Err. z P>|z| [95% Conf. Interval] log likelihood = -2473.054321704064 ( 1) [id1]_cons = 1gllamm model with constraints: Condition Number = 1.8420141 number of level 2 units = 1000number of level 1 units = 5000

PL1 Estimation Stata (parameter normalization)

* normalized solution *[1 -- standard 1PL] *[2 -- coefs sum to 0] [var = 1]mata bALL = st_matrix("e(b)") b = -bALL[1,1..5] mb = mean(b') bs = b:-mb("MML Estimates", "IRT parameters", "B-A Normalization") (-b', b', bs')end

PL1 Estimation

Stata (normalized solution)

param MML Estimates IRT Normalized

1 2.87 -2.87 -1.31

2 1.06 -1.06 0.50

3 0.26 -0.26 1.30

4 1.39 -1.39 0.17

5 2.22 -2.22 -0.66

IRT: Extensions 2-parameter logistic (2PL) model

) (lo t( )gi ij j i j

j j i

ij ij i

a b

u

p

u

x x

jj

j

b

is a factor loading on the random ef c fe tj

item discrimination parameters

0 and 1 (normalization)j jj j

b a


item discrimination parameters reveal differences in item’s utility to distinguish different

ability levels among examinees high values denote items that are more useful in terms of

separating examinees into different ability levels low values denote items that are less useful in

distinguishing examinees in terms of ability ICCs corresponding to this model can intersect as they

differ in location and slope steeper slope of the ICC is associated with a better

discriminating item


Stata (estimation)eq id: i1 i2 i3 i4 i5cons 1 [id1_1]i1 = 1gllamm y i1-i5, i(id) weight(wt) nocons family(binom) link(logit) frload(1) eqs(id) cons(1) adaptmatrix list e(b)*normalized solutions *1 standard 2PL) mata bALL = st_matrix("e(b)") b = bALL[1,1..5] c = bALL[1,6..10] a = -b:/c("MML Estimates-Dif", "IRT Parameters")(b', a')("MML Discrimination Parameters")(c')end


Stata (estimation)* Bock and Aitkin Normalization (p. 164 corrected)mata bALL = st_matrix("e(b)") b = -bALL[1,1..5] c = bALL[1,6..10] lc = ln(c) mb = mean(b') mc = mean(lc') bs = b:-mb cs = exp(lc:-mc)("B-A Normalization DIFFICULTY", "B-A Normalization DISCRIMINATION")(bs', cs')end

IRT: 2PL (1)

i5: .65684452 (.20990788) i4: .68836241 (.18513868) i3: .890914 (.2328178) i2: .72273928 (.18667773) i1: .82565942 (.25811315) loadings for random effect 1 var(1): 1 (0) ***level 2 (id)

------------------------------------------------------------------------------Variances and covariances of random effects i5 2.053265 .1353574 15.17 0.000 1.78797 2.318561 i4 1.284755 .0990363 12.97 0.000 1.090647 1.478862 i3 .24915 .0762746 3.27 0.001 .0996546 .3986454 i2 .9901996 .0900182 11.00 0.000 .8137672 1.166632 i1 2.773234 .205743 13.48 0.000 2.369985 3.176483 Coef. Std. Err. z P>|z| [95% Conf. Interval] log likelihood = -2466.653343760672

IRT: 2PL (2) Bock-Aitkin Normalization

itemItem Difficulty

ParameterDiscrimination

Parameter1 -1.30 1.102 0.48 0.963 1.22 1.184 0.19 0.925 -0.58 0.87

check 0 1

B-A Normalization

item 3 has highest difficulty and greatest discrimination

1PL and 2PL

Binary Response Models for Event Occurrence discrete-time event-history models

purpose: model the probability of an event occurring at some point

in time Pr(event at t | event has not yet occurred by t)

life table events & trials observe the number of events occurring to those who

are at remain at risk as time passes takes account of the changing composition of the sample

as time passes

Life Table

Life Table observe

Rj number at risk in time interval j (R0 = n), where the number at risk in interval j is adjusted over time

Dj events in time interval j (D0 = 0)

Wj removed from risk (censored) in time interval j (W0 = 0)

(removed from risk due to other unrelated causes)

1 1 1j j j jR DR W

Life Table other key quantities

discrete-time hazard (event probability in interval j)

surviving fraction (survivor function in interval j)

ˆ jj

j

pD

R

1

ˆ ˆ(1 )j

j kk

S p

Discrete-Time Hazard Models statistical concepts

discrete random variable Ti (individual’s event or censoring time)

pdf of T (probability that individual i experiences event in period j)

cdf of T (probability that individual i experiences event in period j or earlier)

survivor function (probability that individual i survives past period j)

) Pr( )( ij if t T j

1

) Pr (( ( ) )j

ij i ikk

T jF f tt

) Pr( ) 1 ( )( ij i ijT jS F tt

Discrete-Time Hazard Models statistical concepts

discrete hazard

the conditional probability of event occurrence in interval j for individual i given that an event has not already occurred to that individual by interval j

Pr( | )ij i ip T j T j

Discrete-Time Hazard Models equivalent expression using binary data

binary data dij = 1 if individual i experiences an event in interval j, 0 otherwise

use the sequence of binary values at each interval to form a history of the process for individual i up to the time the event occurs

discrete hazard

1 2 1Pr( 1| 0, 0, , 0)ij ij ij ij idp d d d

Discrete-Time Hazard Models modeling (logit link)

modeling (complementary log –log link)

non-proportional effects

exp( )

1 exp( )j ij

ijj ij

p

x

x

1 exp exp( )ij j ijp x

logit( )ij j ij jp x

Data Structure person-level data person-period form

Data Structure binary sequences

Estimation contributions to likelihood

contribution to log L for individual with event in period j

contribution to log L for individual censored in period j

combine

1 1

log (1 logg (1l )o )jn

ik ik ik iki k

pL d pd

Pr( ) ( ) if 1,

Pr( ) ( ) if 0.i ij ij

ii ij ij

T j f t dL

T j S t d

1

1

log llo og(g 1 )j

i ij ij ikk

pL d p

1

loglo (1 )gj

i ikk

L p

Example: dropping out of Ph.D. programs (large US university)

data: 6,964 individual histories spanning 20 years dropout cannot be distinguished from other types of

leaving (transfer to other program etc.) model the logit hazard of leaving the originally-entered

program as a function of the following: time in program (the time-dependent) baseline hazard) female and percent female in program race/ethnicity (black, Hispanic, Asian) marital status GRE score

also add a program-specific random effect (multilevel)

Example:

Example:

clearset memory 512minfile CID devnt I1-I5 female pctfem black hisp asian married gre using DT28432.datlogit devnt I1-I5, nocons orest store m1logit devnt I1-I5 female pctfem, nocons orest store m2logit devnt I1-I5 female pctfem black hisp asian , nocons orest store m3logit devnt I1-I5 female pctfem black hisp asian married, nocons orest store m4logit devnt I1-I5 female pctfem black hisp asian married gre , nocons or

categorical data analysis week 2. binary response models binary and binomial responses binary: y...

Documents