university of ottawa - bio 4158 – applied biostatistics © antoine morin and scott findlay...

56
University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 22-07-02 23:52 1 Goodness of fit, contingency Goodness of fit, contingency tables and log-linear models tables and log-linear models Appropriate questions The null hypothesis Tests of independence Subdividing tables Multiway tables and log-linear models Power analysis in goodness of fit and contingency tables

Upload: yessenia-backs

Post on 31-Mar-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 11/06/2014 3:11 AM 1 Goodness of fit, contingency tables and

University of Ottawa - Bio 4158 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-10 23:51

1

Goodness of fit, contingency tables and Goodness of fit, contingency tables and log-linear modelslog-linear models

Goodness of fit, contingency tables and Goodness of fit, contingency tables and log-linear modelslog-linear models

Appropriate questions

The null hypothesis

Tests of independence

Subdividing tables

Multiway tables and log-linear models

Power analysis in goodness of fit and contingency tables

Page 2: University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 11/06/2014 3:11 AM 1 Goodness of fit, contingency tables and

University of Ottawa - Bio 4158 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-10 23:51

2

Concepts mapConcepts map

Fitting distributions: goodness of fit

Data binned into classes

Chi-square statistic

G statistic

Requires n>30

Requires many classes, n>>30 for normal distributions

Requires expected frequencies >5 Solution: combine categories

Special corrections when 2 categories

Continuity correction

William's correction

Binomial test

Multinomial test

Tests of normality

Chi-square and G: not recommended

Visual test: often enough

Kolmogorov-Smirnov

Liliefors

Page 3: University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 11/06/2014 3:11 AM 1 Goodness of fit, contingency tables and

University of Ottawa - Bio 4158 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-10 23:51

3

Goodness of fitGoodness of fit

• measures the extent to which some empirical distribution “fits” the distribution expected under the null hypothesis

20 30 40 50 60Fork length

0

10

20

30

Fre

qu

en

cy

Observed

Expected

Page 4: University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 11/06/2014 3:11 AM 1 Goodness of fit, contingency tables and

University of Ottawa - Bio 4158 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-10 23:51

4

Goodness of fit: the Goodness of fit: the underlying principleunderlying principle

• If the match between observed and expected is poorer than would be expected on the basis of measurement precision, then we should reject the null hypothesis.

Fork length

ObservedExpected

0

20

30

Fre

qu

en

cy20 30 40 50 60

0

10

20

30

RejectH0

AcceptH0

Page 5: University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 11/06/2014 3:11 AM 1 Goodness of fit, contingency tables and

University of Ottawa - Bio 4158 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-10 23:51

5

Testing goodness of fit : the Chi-Testing goodness of fit : the Chi-square statistic (square statistic (

• Used for frequency data, i.e. the number of observations/results in each of n categories compared to the number expected under the null hypothesis.

22

1

i i

ii

n f f

f

( )

Fre

qu

en

cyCategory/class

ObservedExpected

Page 6: University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 11/06/2014 3:11 AM 1 Goodness of fit, contingency tables and

University of Ottawa - Bio 4158 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-10 23:51

6

How to translate How to translate 22 into into pp??

• Compare to the 2 distribution with n - 1 degrees of freedom.

• If p is less than the desired level, reject the null hypothesis.

0 5 10 15 20

2 (df = 5)

0

0.2

0.3

Pro

bab

ility 2 = 8.5, p = 0.31

accept

p = = 0.05

Page 7: University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 11/06/2014 3:11 AM 1 Goodness of fit, contingency tables and

University of Ottawa - Bio 4158 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-10 23:51

7

Testing goodness of fit: the log likelihood-Testing goodness of fit: the log likelihood-ratio Chi-square statistic (ratio Chi-square statistic (GG) )

• Similar to 2, and

usually gives similar results.

• In some cases, G is more conservative (i.e. will give higher p values).

G ff

fi

i

ii

n

2

1ln

F

req

ue

ncy

Category/class

ObservedExpected

Page 8: University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 11/06/2014 3:11 AM 1 Goodness of fit, contingency tables and

University of Ottawa - Bio 4158 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-10 23:51

8

22 versus the distribution of versus the distribution of 22 or or GG

• For both 2 and G, p values are calculated assuming a 2 distribution...

• ...but as n decreases, both deviate more and more from 2. 0 5 10 15 20

2/2/G (df = 5)

0

0.2

0.3

Pro

bab

ility

2/G, very small n

2/G, small n

Page 9: University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 11/06/2014 3:11 AM 1 Goodness of fit, contingency tables and

University of Ottawa - Bio 4158 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-10 23:51

9

Assumptions (Assumptions (22 and and GG))

• n is larger than 30.

• Expected frequencies are all larger than 5.

• Test is quite robust except when there are only 2 categories (df = 1).

• For 2 categories, both X2 and G overestimate 2, leading to rejection of null hypothesis with probability greater than i.e. the test is liberal.

Page 10: University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 11/06/2014 3:11 AM 1 Goodness of fit, contingency tables and

University of Ottawa - Bio 4158 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-10 23:51

10

What if What if nn is too small, is too small, there are only 2 there are only 2 categories, etc.?categories, etc.?

• Collect more data, thereby increasing n.

• If n > 2, combine categories.

• Use a correction factor.• Use another test.

Age (yrs)

1 2 3 4

Observed 33 14 8 1

Expected 37 12 5 2

Age (yrs)

1 2 3 4

Observed 57 24 12 5

Expected 55 24 13 6

1 2 3+

Observed 33 14 9

Expected 37 12 7

Moredata

Classes combined

Page 11: University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 11/06/2014 3:11 AM 1 Goodness of fit, contingency tables and

University of Ottawa - Bio 4158 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-10 23:51

11

Corrections for 2 Corrections for 2 categoriescategories• For 2 categories, both X2 and

G overestimate 2, leading to rejection of null hypothesis with probability greater than i.e. test is liberal

• Continuity correction: add 0.5 to observed frequencies.

• Williams’ correction: divide test statistic (G or 2) by:

qkn k

11

6 1

2

( )

Age (yrs)

1 2

Observed 17 8

Expected 20 5

Age (yrs)

1 2

Observed 17.5 8.5

Expected 20.67 5.33

Page 12: University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 11/06/2014 3:11 AM 1 Goodness of fit, contingency tables and

University of Ottawa - Bio 4158 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-10 23:51

12

Contingency analysis: types of Contingency analysis: types of questionsquestions

• Involves two (or more) categorical variables, each with 2 or more categories.

• Considers the number of observations (observed frequencies) in each category of the variables.

• Test is for lack of independence.

Spray infected Notinfected

1 17 63

2 59 21

Results of tests on the efficacyof two sprays (1, 2) in reducingapple blight infection in orchards

Page 13: University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 11/06/2014 3:11 AM 1 Goodness of fit, contingency tables and

University of Ottawa - Bio 4158 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-10 23:51

13

Contingency analysis: types of Contingency analysis: types of questionsquestions

• Does the species composition of bird communities differ among habitats?

• 2 categorical variables: species, habitat type

• H0: the proportion of individuals of each species is independent of (i.e. more or less the same in each) habitat.

Species Habitat1

Habitat2

1 27 63

2 49 91

3 12 19

4 6 3

Page 14: University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 11/06/2014 3:11 AM 1 Goodness of fit, contingency tables and

University of Ottawa - Bio 4158 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-10 23:51

14

Components of the testComponents of the test

• Null hypothesis

• Observations (observed frequencies)

• Statistic (Chi-square or G)

• Assumptions

Page 15: University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 11/06/2014 3:11 AM 1 Goodness of fit, contingency tables and

University of Ottawa - Bio 4158 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-10 23:51

15

Null hypothesisNull hypothesis

• In contingency analysis, the null hypothesis is that the distribution of observed frequencies among categories of one variable (e.g. A) is independent of the category of the other variables (B, C, ...), i.e. that there is no interaction.

• The null hypothesis is always intrinsic!

Page 16: University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 11/06/2014 3:11 AM 1 Goodness of fit, contingency tables and

University of Ottawa - Bio 4158 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-10 23:51

16

Testing Testing HH00: goodness-of-fit: goodness-of-fit

• In contingency analysis (as in all statistical procedures) we fit a model to the data.

• H0 specifies particular values for particular terms (coefficients) in the model…

• …and is evaluated by assessing how well the fitted model, with parameter values as specified by H0, fits the data, i.e. by evaluating goodness-of-fit.

Page 17: University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 11/06/2014 3:11 AM 1 Goodness of fit, contingency tables and

University of Ottawa - Bio 4158 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-10 23:51

17

Reminder: Reminder: goodness of fit.goodness of fit.

• Measures the extent to which some empirical distribution “fits” the distribution expected under the null hypothesis.

Observed

Expected

20 30 40 50 60Fork length

0

10

20

30

Fre

qu

en

cy

Page 18: University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 11/06/2014 3:11 AM 1 Goodness of fit, contingency tables and

University of Ottawa - Bio 4158 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-10 23:51

18

Testing goodness of fit : the Chi-Testing goodness of fit : the Chi-square statistic (square statistic (

• Used for frequency data, i.e. the number of observations/results in each of n categories compared to the number expected under the null hypothesis.

22

1

i i

ii

n f f

f

( )

Fre

qu

en

cyCategory/class

ObservedExpected

Page 19: University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 11/06/2014 3:11 AM 1 Goodness of fit, contingency tables and

University of Ottawa - Bio 4158 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-10 23:51

19

Two way tables: Two way tables: HH0 0

acceptedaccepted• H0: proportion of

infected versus non-infected trees is the same for both sprays.

• In this case, we accept H0.

Spray infected Notinfected

1 17 63

2 19 61

Pro

po

rtio

nin

fec

ted

Spray 2Spray 1

Page 20: University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 11/06/2014 3:11 AM 1 Goodness of fit, contingency tables and

University of Ottawa - Bio 4158 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-10 23:51

20

Two way tables: Two way tables: HH0 0

rejectedrejected

• H0: proportion of infected versus non-infected trees is the same for both sprays.

• In this case, we reject H0.

Spray infected Notinfected

1 17 63

2 59 21

Pro

po

rtio

nin

fec

ted

Spray 2Spray 1

Page 21: University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 11/06/2014 3:11 AM 1 Goodness of fit, contingency tables and

University of Ottawa - Bio 4158 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-10 23:51

21

Two-way tables: the general model-fitting Two-way tables: the general model-fitting procedureprocedure

• Fit 2 models: one in which the interaction is included, the other with it removed.

• Evaluate GOF for each model.

• Evaluate the reduction in GOF associated with dropping the interaction, i.e. under H0 that the interaction is zero.

Model 1(interaction in)

Model 2(interaction out)

GOF(e.g. 2)

Accept H0

( small)

Reject H0

( large)

Page 22: University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 11/06/2014 3:11 AM 1 Goodness of fit, contingency tables and

University of Ottawa - Bio 4158 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-10 23:51

22

Two-way tables : Two-way tables : HH00

and model fitand model fit

• For two way tables, the general model includes a constant, two main effects, and an interaction.

• Thus, independence implies that the goodness-of-fit of a model with the interaction deleted is not significantly different from a model with the interaction included.

Interaction outInteraction in

Accept H0

Go

od

nes

s o

ffi

t (e

.g. G

)

Reject H0

ln ln ln

ln ln( )

fij i

j ij

GOF

Page 23: University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 11/06/2014 3:11 AM 1 Goodness of fit, contingency tables and

University of Ottawa - Bio 4158 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-10 23:51

23

Two-way tables : what does the general Two-way tables : what does the general model model meanmean anyway? anyway?

• The model attempts to predict the observed frequencies in each category.

• So, if all frequencies are equal, then the appropriate model is:

Spray infected Notinfected

1 20 20

2 20 20

N = 80, = 80/4 = 20ln ln lnf fij ij

Page 24: University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 11/06/2014 3:11 AM 1 Goodness of fit, contingency tables and

University of Ottawa - Bio 4158 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-10 23:51

24

Two-way tables : what Two-way tables : what does the general model does the general model meanmean anyway? anyway?

• If N varies between the two sprays, then there will be a “main effect” due to spray.

• So the appropriate model includes a main effect due to spray (row, i).

Spray infected Notinfected

1 30 30

2 10 10

N = 80, = 80/4 = 20f1_ /2 = 30 = 1.5f2_ /2 = 10 = 0.51 = 1.5,2 = 0.5

ln ln ln lnf fij i ij

Page 25: University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 11/06/2014 3:11 AM 1 Goodness of fit, contingency tables and

University of Ottawa - Bio 4158 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-10 23:51

25

Two-way tables : what Two-way tables : what does the general model does the general model meanmean anyway? anyway?

• If the total number of trees infected is different than the number not infected, then there will be a “main effect” due to “infection level”.

• So the appropriate model includes a main effect due to both spray type and infection level.

Spray infected Notinfected

1 24 36

2 8 12

ln ln ln ln lnf fij i j ij

N = 80, = 80/4 = 20f1_ /2 = 30 = 1.5f2_ /2 = 10 = 0.51 = 1.5,2 = 0.5f_1 /2 = 16 = 0.8f_2 /2 = 24 = 1.21 = 0.8,2 = 1.2

Page 26: University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 11/06/2014 3:11 AM 1 Goodness of fit, contingency tables and

University of Ottawa - Bio 4158 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-10 23:51

26

Two-way tables : what does the general Two-way tables : what does the general model model meanmean anyway? anyway?

• Since the expected frequency in cell (i,j) under H0 is:

• ... we can calculate the interaction by:

Spray infected Notinfected

1 20 40

2 10 20N/CRf̂ jiij

ln( ) ln (ln ln ln ) ij ij i jf

N = 90, = 90/4 = 22.5

Page 27: University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 11/06/2014 3:11 AM 1 Goodness of fit, contingency tables and

University of Ottawa - Bio 4158 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-10 23:51

27

Tests of independence: the Chi-square Tests of independence: the Chi-square statistic (statistic (

• Calculate expected frequency for each cell in the table.

• Calculate squared difference between observed and expected frequencies and sum over all cells.

ObservedExpected

n

iij

ijijm

j f̂

f̂f( )

1

2

1

2

Fre

qu

en

cyCell

Page 28: University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 11/06/2014 3:11 AM 1 Goodness of fit, contingency tables and

University of Ottawa - Bio 4158 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-10 23:51

28

Testing independence: the log Testing independence: the log likelihood-ratio Chi-square likelihood-ratio Chi-square

statistic (statistic (GG)) • Similar to 2, and

usually gives very similar results.

• In some cases, G is more conservative (i.e. will give higher p values).

]NlnNlnlnln[G CCRRff i

m

jii

n

iiij

n

iij

m

j

11112

ObservedExpected

Fre

qu

en

cy

Cell

Page 29: University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 11/06/2014 3:11 AM 1 Goodness of fit, contingency tables and

University of Ottawa - Bio 4158 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-10 23:51

29

An example: sex-ratios of eider ducks in An example: sex-ratios of eider ducks in different habitats in Hudson’s Baydifferent habitats in Hudson’s Bay

• Cell counts are observed numbers (raw frequencies) of males and females in different habitats.

Habitat Males Females Total

A 30 34 64

B 55 25 80

C 12 4 16

Total 97 63 160

Page 30: University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 11/06/2014 3:11 AM 1 Goodness of fit, contingency tables and

University of Ottawa - Bio 4158 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-10 23:51

30

Computing expected frequenciesComputing expected frequencies

• Use intrinsic null hypothesis and compute the probability of an observation falling into a cell in the table under this hypothesis.

• Partition the total number of observations according to these probabilities.

Habitat Males Females Total

A 30 34 64

B 55 25 80

C 12 4 16

Total 97 63 160

p(A) = 64/160 = .40; p(male) = 97/160 = . 6105

p(A, male) under H0 = p(A)p(male) = .2425

f(A, male) = p(A, male) X 160 = 38.8

Page 31: University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 11/06/2014 3:11 AM 1 Goodness of fit, contingency tables and

University of Ottawa - Bio 4158 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-10 23:51

31

Assumptions (Assumptions (22 and and GG))

• n is larger than 30.

• Expected frequencies are all larger than 5.

• Test is quite robust except when there are only 2 categories (df = 1).

• For 2 categories, both X2 and G overestimate 2 , leading to rejection of null hypothesis with probability greater than i.e. the test is liberal.

Page 32: University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 11/06/2014 3:11 AM 1 Goodness of fit, contingency tables and

University of Ottawa - Bio 4158 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-10 23:51

32

What if What if nn is too small, there are only 2 is too small, there are only 2 categories, etc.?categories, etc.?

• Increase n.• If n > 2, combine categories.• Use a correction factor.• Use another test.

Page 33: University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 11/06/2014 3:11 AM 1 Goodness of fit, contingency tables and

University of Ottawa - Bio 4158 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-10 23:51

33

Habitat Males Females Total

A 30 34 64

B 55 25 80

C 3 1 4

Total 88 60 148

An example: Combining An example: Combining categoriescategories

• With three habitat categories, expected frequencies are too small in 2 cells.

• Therefore, combine habitats B and C.

Habitat Males Females Total

A 30 34 64

B/C 58 26 84

Total 88 60 148

Page 34: University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 11/06/2014 3:11 AM 1 Goodness of fit, contingency tables and

University of Ottawa - Bio 4158 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-10 23:51

34

Corrections for 2 categoriesCorrections for 2 categories

• For 2 categories, both X2 and G overestimate 2 , leading to rejection of null hypothesis with probability greater than i.e. test is liberal

• Continuity correction: add 0.5 to observed frequencies.

• Williams’ correction: divide test statistic (G or 2) by:

q = 1 + (k2 - 1)/(6n(k-1))

Page 35: University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 11/06/2014 3:11 AM 1 Goodness of fit, contingency tables and

University of Ottawa - Bio 4158 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-10 23:51

35

Subdividing tablesSubdividing tables

• When null hypothesis is rejected, you may wish to determine which categories are contributing substantially to the overall significant test statistic.

• General procedure: find set of largest homogeneous subtables.

• Start with smallest homogeneous table, then add rows or columns until the null hypothesis is rejected.

Page 36: University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 11/06/2014 3:11 AM 1 Goodness of fit, contingency tables and

University of Ottawa - Bio 4158 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-10 23:51

36

Subdividing tablesSubdividing tables

Significant interaction

Habitat Male Female Total

A 30 34 64

B 55 25 80

C 12 4 16

Total 97 63 160

Page 37: University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 11/06/2014 3:11 AM 1 Goodness of fit, contingency tables and

University of Ottawa - Bio 4158 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-10 23:51

37

Subdividing tablesSubdividing tables

Significant interaction

Habitat Male Female Total

A 30 34 64

B 55 25 80

C 12 4 16

Total 97 63 160

Page 38: University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 11/06/2014 3:11 AM 1 Goodness of fit, contingency tables and

University of Ottawa - Bio 4158 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-10 23:51

38

Subdividing tablesSubdividing tables

No significant interaction

Habitat Male Female Total

A 30 34 64

B 55 25 80

C 12 4 16

Total 97 63 160

Page 39: University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 11/06/2014 3:11 AM 1 Goodness of fit, contingency tables and

University of Ottawa - Bio 4158 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-10 23:51

39

Subdividing tablesSubdividing tables

Conclusion: B and C are homogeneous, with both differing significantly from A.

Habitat Male Female Total

A 30 34 64

B 55 25 80

C 12 4 16

Total 97 63 160

Page 40: University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 11/06/2014 3:11 AM 1 Goodness of fit, contingency tables and

University of Ottawa - Bio 4158 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-10 23:51

40

ConclusionConclusion

• Contingency tables are one of the most common methods of analyzing biological data.

• They provide robust tests (chi-square or G) of independence for categorical data...

• ...if sample sizes are adequate and expected frequencies are not too small.

Page 41: University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 11/06/2014 3:11 AM 1 Goodness of fit, contingency tables and

University of Ottawa - Bio 4158 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-10 23:51

41

Multiway tables and log-linear modelsMultiway tables and log-linear models

• Notion of interaction extended to consideration of the effects of several different variables (factors) simultaneously…

• … exactly as in multiple-classification ANOVA.

Page 42: University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 11/06/2014 3:11 AM 1 Goodness of fit, contingency tables and

University of Ottawa - Bio 4158 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-10 23:51

42

Two-way tables : Two-way tables : HH00

and model fitand model fit

• For two way tables, the general model includes a constant, two main effects, and an interaction.

• Thus, independence implies that the goodness-of-fit of a model with the interaction deleted is not significantly different from a model with the interaction included.

Interaction outInteraction in

Accept H0

Go

od

nes

s o

ffi

t (e

.g. G

)

Reject H0

ln ln ln

ln ln( )

fij i

j ij

GOF

Page 43: University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 11/06/2014 3:11 AM 1 Goodness of fit, contingency tables and

University of Ottawa - Bio 4158 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-10 23:51

43

Multiway-way tables Multiway-way tables and log-linear modelsand log-linear models

• For 3- way tables, the general model includes a constant, 3 main effects, 3 2-way interactions, and 1 3-way interaction.

• Thus, independence implies that the goodness-of-fit of a model with the interaction deleted is not significantly different from a model with the interaction included.

ijk

jkikij

kjiijkf

)ln(

)ln()ln()ln(

lnlnlnlnˆln

Interaction outInteraction in

Accept H0

Go

od

nes

s o

ffi

t (e

.g. G

)

Reject H0

GOF

Page 44: University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 11/06/2014 3:11 AM 1 Goodness of fit, contingency tables and

University of Ottawa - Bio 4158 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-10 23:51

44

Multi-way tables Multi-way tables and log-linear and log-linear modelsmodels

• Effects of temperature (H,L) and humidity (H, L) on plant yield (H, L)

• No 3-way interaction, as interaction between yield and temperature does not depend on humidity.

Fre

qu

en

cy

Yield class

Humidity

Tem

per

atu

re

H

L

H L

Low yieldHigh yield

Page 45: University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 11/06/2014 3:11 AM 1 Goodness of fit, contingency tables and

University of Ottawa - Bio 4158 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-10 23:51

45

Multi-way tables and log-Multi-way tables and log-linear modelslinear models

• Effects of temperature (H,L) and humidity (H, L) on plant yield (H, L)

• 3-way interaction, since effect of temperature on yield depends on humidity.

Fre

qu

en

cy

Humidity

Tem

per

atu

re

H

L

H L

Low yieldHigh yield

Page 46: University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 11/06/2014 3:11 AM 1 Goodness of fit, contingency tables and

University of Ottawa - Bio 4158 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-10 23:51

46

The procedureThe procedure

• Test highest order interaction by comparing goodness of fit of full model and model with interaction removed.

• If non-significant, test next-lowest interactions individually (i.e. with the others included).

• Where interactions are significant, do separate tests within each category of the factor(s) involved.

Page 47: University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 11/06/2014 3:11 AM 1 Goodness of fit, contingency tables and

University of Ottawa - Bio 4158 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-10 23:51

47

An example: sex-ratio of sturgeon in the An example: sex-ratio of sturgeon in the lower Saskatchewan Riverlower Saskatchewan River

• What is the “best” model that can be fitted to these data?

• Does sex-ratio depend on location? On year? On location*year?

Location Year Males Females

CumberlandHouse

1978 10 14

CumberlandHouse

1979 30 14

CumberlandHouse

1980 11 6

The Pas 1978 5 16

The Pas 1979 12 12

The Pas 1980 34 18

Page 48: University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 11/06/2014 3:11 AM 1 Goodness of fit, contingency tables and

University of Ottawa - Bio 4158 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-10 23:51

48

Questions/Questions/null hypothesesnull hypotheses

• Does the sex ratio vary among years?

• H0: ()ij = 0

• Does the sex ratio vary between locations?

• H0: ()ik = 0

• Does the sex ratio vary among (year, location) combinations?

• H0: ()ijk = 0

Page 49: University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 11/06/2014 3:11 AM 1 Goodness of fit, contingency tables and

University of Ottawa - Bio 4158 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-10 23:51

49

Fitting log-linear models with SYSTATFitting log-linear models with SYSTAT

• Test 3-way interaction by specifying model with 7 terms.

• Conclusion: accept H0Analysis of Deviance TablePoisson modelResponse: FREQUENCTerms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev Pr(Chi) NULL 7 109.0931 TEMP 1 0.00000 6 109.0931 0.9999998 LIGHT 1 0.00000 5 109.0931 0.9999999 INFECTED 1 6.26638 4 102.8268 0.0123050 TEMP:LIGHT 1 0.00000 3 102.8268 0.9999987 TEMP:INFECTED 1 76.00717 2 26.8196 0.0000000 LIGHT:INFECTED 1 25.73563 1 1.0840 0.0000004TEMP:LIGHT:INFECTED 1 1.08396 0 0.0000 0.2978126

Page 50: University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 11/06/2014 3:11 AM 1 Goodness of fit, contingency tables and

University of Ottawa - Bio 4158 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-10 23:51

50

Residuals (in contingency tables and Residuals (in contingency tables and log-linear models)log-linear models)

• The difference between observed and expected cell frequencies.

• There is one residual for each cell in the table.

• If the fitted model is “good”, all residuals should be relatively small and there should be no obvious pattern in the table.

Page 51: University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 11/06/2014 3:11 AM 1 Goodness of fit, contingency tables and

University of Ottawa - Bio 4158 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-10 23:51

51

Power and sample Power and sample size in goodness of size in goodness of fitfit

• An external null hypothesis is specified, which specifies a set of expected frequencies, or, alternatively, a set of expected proportions:

• The effect size is given by:H p p p N0 01 02 0: , ,...,

wp pp

Oi i

ii

N

( )0

2

01

ObservedExpected

Fre

qu

en

cy

Cell

Page 52: University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 11/06/2014 3:11 AM 1 Goodness of fit, contingency tables and

University of Ottawa - Bio 4158 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-10 23:51

52

Calculating power Calculating power given given ww

• Given and w and N, we can read 1- from suitable tables or curves (e.g. Cohen (1988), Tables 7-3).

1-

Decreasing N

= .05

.1 .2 .3 .4

= .01

w

Page 53: University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 11/06/2014 3:11 AM 1 Goodness of fit, contingency tables and

University of Ottawa - Bio 4158 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-10 23:51

53

Power in goodness Power in goodness of fit: an exampleof fit: an example

• Biological hypothesis: plumage colour in snow geese controlled by a single autosomal locus with 2 alleles, aa = white, Aa, AA = blue.

• So Aa X Aa cross should yield segregation ratios: 1 (AA): 2(Aa): 1(aa).

Goslinggenotype

AA Aa aa

Observed frequency 25 69 38

Observed proportion .190 .522 288

Expected frequency 33 66 33

Expected proportion .25 .50 .25

Page 54: University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 11/06/2014 3:11 AM 1 Goodness of fit, contingency tables and

University of Ottawa - Bio 4158 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-10 23:51

54

Power in goodness of fit: Power in goodness of fit: an example (cont’d)an example (cont’d)

• H0 accepted, and effect size given by:

• From table,

• So, > 84% chance of Type II error, i.e. probability of detecting a true effect size of .076 is very small.

N p 132 152 2 472, . , , .

wp pp

Oi i

ii

N

( )

.02

01

076

w

N .10 .20 .30

100 .13 .42 .77

120 .15 .49 .85

140 .17 .55 .90

1 16 .

Goslinggenotype

AA Aa aa

Observedfrequency

25 69 38

Observedproportion

.190 .522 288

Expectedfrequency

33 66 33

Expectedproportion

.25 .50 .25

Page 55: University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 11/06/2014 3:11 AM 1 Goodness of fit, contingency tables and

University of Ottawa - Bio 4158 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-10 23:51

55

Power and sample size in Power and sample size in contingency tablescontingency tables

• Calculate expected cell proportions p0,ij under H0 of independence given by marginal proportions:

• The effect size is given by:

Nfp ijijo /ˆ,

CjRi

ji ij

ijOij

p

ppw

,

1,1 0

20 )(

Location Year Males Females

CumberlandHouse

1978 10 14

CumberlandHouse

1979 30 14

CumberlandHouse

1980 11 6

Df = (R-1)(C-1)=

Page 56: University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 11/06/2014 3:11 AM 1 Goodness of fit, contingency tables and

University of Ottawa - Bio 4158 – Applied Biostatistics© Antoine Morin and Scott Findlay23-04-10 23:51

56

Power and sample size in Power and sample size in contingency tables: an contingency tables: an exampleexample

• Age structure of two different field mice populations

• So, about 75% chance of Type II error.

2511202121305

12011 0

20

.N,))((,.

.p

)pp(w

Cj,Ri

j,i ij

ijOij

Age Pop.1

Pop. 2 Total

1 yr .22 .23 .45

2 yr .35 .10 .45

3+ yr .03 .07 .10

Total .60 .40 1.00

Cell proportions, N = 120