categorical data analysis - university college dublin · categorical data analysis ... method of...

29
1 Statistics in Science Σ Statistics in Science Σ Categorical Data Analysis PGRM 14 Statistics in Science Σ Statistics in Science Σ What is categorical data? The measurement scale for the response consists of a number of categories 0, 1, 2, 3 and >3 Litter size Very soft, Soft, Hard, Very hard Food texture Dead, alive Mortality Dairy, Beef, Tillage etc. Farm system Measurement Scale Variable

Upload: lehuong

Post on 13-Apr-2018

220 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: Categorical Data Analysis - University College Dublin · Categorical Data Analysis ... Method of analysis is the same. 9 ... This is a snapshot of opinion at a moment in time hence

1

Statistics

in

ScienceΣΣΣΣ

Statistics

in

ScienceΣΣΣΣ

Categorical Data Analysis

PGRM 14

Statistics

in

ScienceΣΣΣΣ

Statistics

in

ScienceΣΣΣΣ

What is categorical data?

The measurement scale for the responseconsists of a number of categories

0, 1, 2, 3 and >3Litter size

Very soft, Soft, Hard, Very hard

Food texture

Dead, aliveMortality

Dairy, Beef, Tillage etc.Farm system

Measurement ScaleVariable

Page 2: Categorical Data Analysis - University College Dublin · Categorical Data Analysis ... Method of analysis is the same. 9 ... This is a snapshot of opinion at a moment in time hence

2

Statistics

in

ScienceΣΣΣΣ

Statistics

in

ScienceΣΣΣΣ

Data Analysis considered:

• Response variable(s)is categorical

• Explanatory variable(s) may be categorical or continuous

Example: Does Post-operative survival (categorical response) depend on the explanatory variables?

Sex (categorical)

Age (continuous)

Example: In a random sample of Irish farmers is there a relationship between attitudes to the EU and farm system.

Farm system (categorical)

Attitude to EU (categorical/ordinal)?

(Two response variables - no explanatory variables)

Could one of these be regarded as explanatory?

Statistics

in

ScienceΣΣΣΣ

Statistics

in

ScienceΣΣΣΣ

Measurement scales for categorical dataNominal - no underlying order

Stellaria media, Poa annua, etc.Weed Species

Dairy, Beef, Tillage etc.Farm system

Measurement ScaleVariable

Ordinal - underlying order in the scale

Primary, Secondary, Tertiary Education

Very likely, Likely, UnlikelyDisease diagnosis

Very soft, Soft, Hard, Very hardFood texture

Measurement ScaleVariable

Interval - underlying numerical distance between scale points

years in educationEducation

<1, 1-2, 2-3.5, 3.5-5, >5Age class

0, 1, 2, 3 and >3Litter size

Measurement ScaleVariable

Page 3: Categorical Data Analysis - University College Dublin · Categorical Data Analysis ... Method of analysis is the same. 9 ... This is a snapshot of opinion at a moment in time hence

3

Statistics

in

ScienceΣΣΣΣ

Statistics

in

ScienceΣΣΣΣ

Tablesreporting categoricaldata

1-, 2- & 3-way

Statistics

in

ScienceΣΣΣΣ

Statistics

in

ScienceΣΣΣΣ

Tables reporting count data: single level

Example:

A geneticist carries out a crossing experiment between F1 hybrids of a wild type and a mutant genotype and obtains an F2 progeny of 90 offspring with the following characteristics.

901080

TotalMutantWild Type

Evidence that a wild type is dominant, giving on average 8:1 offspring phenotype in its favour?

Page 4: Categorical Data Analysis - University College Dublin · Categorical Data Analysis ... Method of analysis is the same. 9 ... This is a snapshot of opinion at a moment in time hence

4

Statistics

in

ScienceΣΣΣΣ

Statistics

in

ScienceΣΣΣΣ

Tables for count data: two-way

Example 1:

A sample 124 mice was divided into two groups, 84 receiving a standard dose of pathogenic bacteria followed by an antiserum and a control group of 40 not receiving the antiserum. After 3 weeks the numbers dead and alive in each group were counted.

1248737Total45402218control 23846519antiserum

% deadTotalAliveDead

Outcome

Association betweenmortality and treatment?

Statistics

in

ScienceΣΣΣΣ

Statistics

in

ScienceΣΣΣΣ

Tables for count data: two-way

Example 2:

Categorical response and categorical explanatory variable:The opinion poll after the Good Friday Agreement with respondents classified by religion (R - Catholic or Protestant)

• Evidence that a majority of decided voters (all voters) support the agreement?

• Support pattern the same for Protestants and Catholics?

232663%Catholic

51800270123407Total

3344820891149Protestant

733526232258Catholic

%FavourTotalUndecidedOpposeFavour

Page 5: Categorical Data Analysis - University College Dublin · Categorical Data Analysis ... Method of analysis is the same. 9 ... This is a snapshot of opinion at a moment in time hence

5

Statistics

in

ScienceΣΣΣΣ

Statistics

in

ScienceΣΣΣΣ

Tables for count data: two-wayExample (Snedecor & Cochran):

The table below shows the number of aphids alive and dead after spraying with four concentrations of solutions of sodium oleate.

• Has the higher concentration given a significantly different percentage kill?

• Is there a relationship between concentration and mortality?

84.893.589.382.771.4% Dead

341771127577Total

525121322Alive

289721006255Dead

Total2.11.61.100.65

Concentration of sodium oleate (%)

Statistics

in

ScienceΣΣΣΣ

Statistics

in

ScienceΣΣΣΣ

Is this the relationship?

?

Note:categoricalresponse

interval categorical explanatory variable

Page 6: Categorical Data Analysis - University College Dublin · Categorical Data Analysis ... Method of analysis is the same. 9 ... This is a snapshot of opinion at a moment in time hence

6

Statistics

in

ScienceΣΣΣΣ

Statistics

in

ScienceΣΣΣΣ

Tables for count data: two-wayExample (Cornfield 1962)

Blood pressure (BP) was measured on a sample of males aged 40-59, who were also classified by whether they developed coronary heart disease (CHD) in a 6-year follow-up period.

BP:interval categorical variablein 8 classes

CHD:CHD or No-CHD

1329123792Total

18.643358>186

16.2998316167 - 186

9.485778157 - 166

8.613912712147 - 156

5.927125516137 - 146

4.228427212127 - 136

6.725223517117 - 126

1.91561533<117

% CHDTotalNo CHDCHDBP

1.Is the incidence of CHD independent of BP?

2.Is there a simple relationship between the probability of CHD and the level of BP?

Statistics

in

ScienceΣΣΣΣ

Statistics

in

ScienceΣΣΣΣ

CHD v BP relationship

Page 7: Categorical Data Analysis - University College Dublin · Categorical Data Analysis ... Method of analysis is the same. 9 ... This is a snapshot of opinion at a moment in time hence

7

Statistics

in

ScienceΣΣΣΣ

Statistics

in

ScienceΣΣΣΣ

3-way tableExample: Grouped binomial (response has 2 categories) data - patterns of psychotropic drug consumption in a sample from West London (Murray et al 1981, Psy Med 11,551-60)

6021Yes5F

9845Yes4F

24271Yes3F

18947Yes2F

21033Yes1F

17930No5F

32752No4F

76596No3F

59642No2F

58812No1F

2610Yes5M

5616Yes4M

12131Yes3M

12516Yes2M

17112Yes1M

909No5M

27526No4M

64438No3M

50016No2M

5319No1M

TotalOn drugsPsych. caseAge GroupSex

Statistics

in

ScienceΣΣΣΣ

Statistics

in

ScienceΣΣΣΣ

Non-tabulated dataExample: Individual Legousia plants were monitored in an experiment to see whether they survived after 3 months.Survived -yes is scored 1Survived -no scored 0.

Also recorded were:

CO2 treatment – 2 levels low and high

Density of Legousia

Density of companion species

Height of the plant (mm)two weeks after planting.

Most individuals will have a unique profile in these three additional variables and so tabulation of the data by them is not feasible. The individual data is presented

Page 8: Categorical Data Analysis - University College Dublin · Categorical Data Analysis ... Method of analysis is the same. 9 ... This is a snapshot of opinion at a moment in time hence

8

Statistics

in

ScienceΣΣΣΣ

Statistics

in

ScienceΣΣΣΣ

Non-tabulated data

1. Is survival related to the

explanatory variables:

CO2, Height, density-self,

density-companions?

2. Can the probability of

survival be predicted from

the subject’s profile?

………………

………………

16427L04

331643H13

272268L12

302035L01

CompLeg.HtCO2SurvSubject

Density

Response

Statistics

in

ScienceΣΣΣΣ

Statistics

in

ScienceΣΣΣΣ

Fixed and non-fixed margins

• One margin fixed: Samples of fixed size are selected for one or more categories and individuals are classified by the other category(s).

• No margin fixed: Individuals in a single sample are simultaneously classified by several categorical variables.

Difference between these depends on the experimental design and how this specified the data should be collected.

Method of analysis is the same.

Page 9: Categorical Data Analysis - University College Dublin · Categorical Data Analysis ... Method of analysis is the same. 9 ... This is a snapshot of opinion at a moment in time hence

9

Statistics

in

ScienceΣΣΣΣ

Statistics

in

ScienceΣΣΣΣ

One margin fixed

Example 1 (Clinical trial - a prospective study):

Of 400 HIV positive pregnant women 200 are assigned at random to each of Breast feeding (BF) or Formula feeding (FF). Two years after birth the child’s HIV status is determined.

Child’s status

20015545FF

20013862BF

TotalHIV -HIV +

Statistics

in

ScienceΣΣΣΣ

Statistics

in

ScienceΣΣΣΣ

No margin fixed (Single sample)

Used in Cross-sectional studies.

Example: A simple random sample of 200 students was classified by gender and attitude to EU integration.

This is a snapshot of opinion at a moment in time hence Cross-sectional.

EU integration

86104Total

1043361Female

965343Male

TotalOpposeFavour

Page 10: Categorical Data Analysis - University College Dublin · Categorical Data Analysis ... Method of analysis is the same. 9 ... This is a snapshot of opinion at a moment in time hence

10

Statistics

in

ScienceΣΣΣΣ

Statistics

in

ScienceΣΣΣΣ

Asking the right question

• Data summarized by counts

• Questions usually relate to %s(equivalently proportions)

Statistics

in

ScienceΣΣΣΣ

Statistics

in

ScienceΣΣΣΣ

Hypotheses for Categorical Data

• Categorical data is summarised by counting individuals falling into the various combinations of categories

• Hypotheses relate to:the probability of an individual being in a particular category

• These probabilities are estimated by the observed proportions in the data

• Using a sample proportion, p, from a sample of size n, to estimate a population proportion the standard error is

√(p(1 – p)/n)eg with p = 0.5, n = 1100,

2×SE = 0.03the often mentioned 3% margin of error

Page 11: Categorical Data Analysis - University College Dublin · Categorical Data Analysis ... Method of analysis is the same. 9 ... This is a snapshot of opinion at a moment in time hence

11

Statistics

in

ScienceΣΣΣΣ

Statistics

in

ScienceΣΣΣΣ

Example

1248737Total

45402218control 23846519antiserum

% deadTotalAliveDead

Outcome

Does % dead depend on antiserum?

Equivalently:

1. Is there an association between mortality and antiserum?

2. Is mortality independent of anitserum?

Statistics

in

ScienceΣΣΣΣ

Statistics

in

ScienceΣΣΣΣ

Example

• As usual we set up a null hypothesis and measure the extent to which the data conflicts with this

• Here H0: prob of death for anti = prob of death for control

• equivalently H0:

– no association between mortality and antiserum

– Mortality and antiserum are independent

1248737Total

45402218control 23846519antiserum

% deadTotalAliveDead

Outcome

Page 12: Categorical Data Analysis - University College Dublin · Categorical Data Analysis ... Method of analysis is the same. 9 ... This is a snapshot of opinion at a moment in time hence

12

Statistics

in

ScienceΣΣΣΣ

Statistics

in

ScienceΣΣΣΣ

Example

Expected counts when H0 is true:

The overall % dead (37/124)would apply to antiserum & control

For the 84 antiserum this would give(84×37)/124 dead and (84×87)/124 alive

For the 40 control this would give(40×37)/124 dead and (40×87)/124 alive

1248737Total

45402218control 23846519antiserum

% deadTotalAliveDead

Outcome

E = (row total)×(column total)/(table total)

Statistics

in

ScienceΣΣΣΣ

Statistics

in

ScienceΣΣΣΣ

Observed and expected counts

1248737Total

45402218control

23846519antiserum

% deadTotalAliveDead

Outcome

1248737Total

29.84028.111.9control29.98458.925.1antiserum

% deadTotalAliveDead

Outcome

Note: some rounding error

Observed

Expected

Page 13: Categorical Data Analysis - University College Dublin · Categorical Data Analysis ... Method of analysis is the same. 9 ... This is a snapshot of opinion at a moment in time hence

13

Statistics

in

ScienceΣΣΣΣ

Statistics

in

ScienceΣΣΣΣ

Chi-squared statistic : X2

• X2 measures difference between observed counts, O, and expected (when H0 holds) counts, E

• If LARGE provides evidence against H0, i.e. evidence for an association (dependence) of mortality on anitserum.

• X2 = ∑(O – E)2/E

• Here SAS/FREQ gives:X2 = 6.48p = Prob(X2 > 6.48 when H0 is true) = 0.0109

• Conclusion:there is evidence (p < 0.05) that mortality depends on antiserum

Statistics

in

ScienceΣΣΣΣ

Statistics

in

ScienceΣΣΣΣ

Table of antiserum by dead

antiserum dead

Frequency

Expected

Row Pct 0 1 Total

antiserum 65

58.935

77.38

19

25.065

22.62

84

control 22

28.065

55.00

18

11.935

45.00

40

Total 87 37 124

SAS/FREQ OUTPUT

X2 = ∑(O – E)2/E

O = Frequency

E = Expected

Row Percents make most sense here(% alive/dead in each antiserum group)

Description of cell contents

Page 14: Categorical Data Analysis - University College Dublin · Categorical Data Analysis ... Method of analysis is the same. 9 ... This is a snapshot of opinion at a moment in time hence

14

Statistics

in

ScienceΣΣΣΣ

Statistics

in

ScienceΣΣΣΣ

0.2287Cramer's V

0.2229Contingency Coefficient

0.2287Phi Coefficient

0.01126.43101Mantel-Haenszel Chi-Square

0.01955.45831Continuity Adj. Chi-Square

0.01226.28461Likelihood Ratio Chi-Square

0.01096.48331Chi-Square

ProbValueDFStatistic

SAS/FREQ OUTPUT

X2 = ∑(O – E)2/EDF = (r–1)×(c-1)

Ignore!

Statistics

in

ScienceΣΣΣΣ

Statistics

in

ScienceΣΣΣΣ

P = 0.001 with X2 = 6.48

6.48

Area 0.05

Area0.001

68% values < 1(not shown)

Page 15: Categorical Data Analysis - University College Dublin · Categorical Data Analysis ... Method of analysis is the same. 9 ... This is a snapshot of opinion at a moment in time hence

15

Statistics

in

ScienceΣΣΣΣ

Statistics

in

ScienceΣΣΣΣ

SAS – data format for FREQ procedure

Mortality Antiserum Count Dead Antiserum 19Alive Antiserum 65Dead Control 18Alive Control 22

2 cols identify the cell

Final column is the ‘response’

– the frequency count for the cell

1248737Total

45402218control

23846519antiserum

% deadTotalAliveDead

Mortality

Statistics

in

ScienceΣΣΣΣ

Statistics

in

ScienceΣΣΣΣ

Code: SAS/FREQ

procprocprocproc freqfreqfreqfreq data = conc;

weight count;

tables antiserum*mortality

/ chisq expected nocol

nopercent;

quitquitquitquit;

Omit column/overall percentagesnocol nopercent

Expected values for each cellexpected

Test statistics (chi-squared etc)chisq

To DoOption

Page 16: Categorical Data Analysis - University College Dublin · Categorical Data Analysis ... Method of analysis is the same. 9 ... This is a snapshot of opinion at a moment in time hence

16

Statistics

in

ScienceΣΣΣΣ

Statistics

in

ScienceΣΣΣΣ

Alphid ExampleExample (Snedecor & Cochran):

The table below shows the number of aphids alive and dead after spraying with four concentrations of solutions of sodium oleate.

• Has the higher concentration given a significantly different percentage kill?

• Is there a relationship between concentration and mortality?

84.893.589.382.771.4% Dead

341771127577Total

525121322Alive

289721006255Dead

Total2.11.61.100.65

Concentration of sodium oleate (%)

Statistics

in

ScienceΣΣΣΣ

Statistics

in

ScienceΣΣΣΣ

341771127577Total

28972

65.258

0.6965

93.51

100

94.921

0.2718

89.29

62

63.563

0.0384

82.27

55

65.258

1.6125

71.43

Dead

525

11.742

3.8711

6.49

12

17.079

1.5105

10.71

13

11.437

0.2136

17.33

22

11.742

8.9617

28.57

Alive

2.11.61.10.65

Frequency

Expected

Cell Chi-Square

Col Pct Total

conc(Sodium oleate concentration (%))status(Outcome)

Table of status by conc

Aphid example (SAS/FREQ OUTPUT)

X2 = 17.18p = 0.0007 (3 df)

Note the largest contributions (O – E)2/E to X2 (8.96 & 3.87) are in top corners

Page 17: Categorical Data Analysis - University College Dublin · Categorical Data Analysis ... Method of analysis is the same. 9 ... This is a snapshot of opinion at a moment in time hence

17

Statistics

in

ScienceΣΣΣΣ

Statistics

in

ScienceΣΣΣΣ

Locating the concentration effect

1527577Total

11762

82.67

55

71.43Dead

3513

17.33

22

28.57Alive

1.10.65Frequency

Expected

Total

Sodium oleate(%)Outcome

Table of Outcome by Sodium

18977112Total

17272

93.51

100

89.29Dead

175

6.49

12

10.71Alive

2.11.6Frequency

Expected

Total

Sodium oleate(%)Outcome

Table of Outcome by Sodium

X2 = 2.71p = 0.10

X2 = 0.99p = 0.32

Statistics

in

ScienceΣΣΣΣ

Statistics

in

ScienceΣΣΣΣ

Locating the concentration effect

X2 = 12.83p = 0.0003

341189152Total

289172

91.01

117

76.97

Dead

5217

8.99

35

23.03

Alive

>1.5%<1.5%

Frequency

Col Pct Total

Sodium

oleate(%)Outcome

Table of Outcome by Sodium

Page 18: Categorical Data Analysis - University College Dublin · Categorical Data Analysis ... Method of analysis is the same. 9 ... This is a snapshot of opinion at a moment in time hence

18

Statistics

in

ScienceΣΣΣΣ

Statistics

in

ScienceΣΣΣΣ

SAS – data format for FREQ procedure

84.893.589.382.771.4% Dead

341771127577Total

525121322Alive

289721006255Dead

Total2.11.61.100.65

Concentration of sodium oleate (%)

Conc status number

0.65 d 55

0.65 a 22

1.10 d 62

1.10 a 13

1.60 d 100

1.60 a 12

2.10 d 72

2.10 a 5

2 cols identify the cell

Final column is the ‘response’

– the frequency count for the cell

Statistics

in

ScienceΣΣΣΣ

Statistics

in

ScienceΣΣΣΣ

Code: SAS/FREQprocprocprocproc freqfreqfreqfreq data = insert;

weight number;

tables status*conc

/ chisq cellchi2 expected

norow nopercent nocum;

quitquitquitquit;

Omit cumulative frequenciesnocum

Omit row/overall percentagesnorow

nopercent

Expected values for each cellexpected

Contribution to X2 from each cellcellchi2

Test statistics (chi-squared etc)chisq

To DoOption

Page 19: Categorical Data Analysis - University College Dublin · Categorical Data Analysis ... Method of analysis is the same. 9 ... This is a snapshot of opinion at a moment in time hence

19

Statistics

in

ScienceΣΣΣΣ

Statistics

in

ScienceΣΣΣΣ

Validity of chi-squared (χ2) test

• Test is based on an approximation leading to use of

the χ2 distribution to calculate p-values

• With several DF and E ≥ 5 approximation is ok

• If E < 1 in any cell approximation may be bad

• With a number of cells in the table perhaps a third or quarter can have E between 1 & 5 without serious

departures from χ2 based p-values. (PGRM pg 14-11)

• In cases where good approximation is in doubt use Fisher’s exact test (SAS/FREQ tables option exact)

Statistics

in

ScienceΣΣΣΣ

Statistics

in

ScienceΣΣΣΣ

Logistic Regression(SAS GENMOD)

Page 20: Categorical Data Analysis - University College Dublin · Categorical Data Analysis ... Method of analysis is the same. 9 ... This is a snapshot of opinion at a moment in time hence

20

Statistics

in

ScienceΣΣΣΣ

Statistics

in

ScienceΣΣΣΣ

Is this the relationship?

?

Note:categoricalresponse

interval categorical explanatory variable

Statistics

in

ScienceΣΣΣΣ

Statistics

in

ScienceΣΣΣΣ

Why logistic and not just χ2?• For sparse data(e.g. where individuals will have unique profiles)

• With many categorical explanatory variables

• With quantitative explanatory variables

In the case of a continuous response we have looked to see if the mean, µ, can be expressed as

µ = a + bx

With categorical data we want an expression for p(the probability of the response in one of the 2 response categories) but

p = a + bxmay give values outside the range 0 to 1!

e.g. p = 0.1 + 0.2x gives p = 1.1 for x = 5

Page 21: Categorical Data Analysis - University College Dublin · Categorical Data Analysis ... Method of analysis is the same. 9 ... This is a snapshot of opinion at a moment in time hence

21

Statistics

in

ScienceΣΣΣΣ

Statistics

in

ScienceΣΣΣΣ

A solution: TRANSFORM• Use the transformation:

p = exp(a + bx)/(1 + exp(a + bx))

• i.e. log(p/(1 – p)) = a + bxlog(Odds) = a + bx

where Odds = p/(1 – p)

Note:exp(x) = ex

Plot is for:a = 0, b = 1

LOGIT:logit(p) = log(p/(1-p))

Statistics

in

ScienceΣΣΣΣ

Statistics

in

ScienceΣΣΣΣ

SAS/GPLOTlogit(p) = −0.119 + 1.25 conc

Logistic Estimate of Death Probability

p

0.6

0.7

0.8

0.9

1.0

Sodium oleate (%)

0.6 1.0 1.4 1.8 2.2

Page 22: Categorical Data Analysis - University College Dublin · Categorical Data Analysis ... Method of analysis is the same. 9 ... This is a snapshot of opinion at a moment in time hence

22

Statistics

in

ScienceΣΣΣΣ

Statistics

in

ScienceΣΣΣΣ

SAS OUTPUT

OBS CONC DEAD TOTAL PROPOBS CONC DEAD TOTAL PROPOBS CONC DEAD TOTAL PROPOBS CONC DEAD TOTAL PROP

1 0.65 53 77 0.688311 0.65 53 77 0.688311 0.65 53 77 0.688311 0.65 53 77 0.68831

2 1.10 57 75 0.760002 1.10 57 75 0.760002 1.10 57 75 0.760002 1.10 57 75 0.76000

3 1.60 95 112 0.848213 1.60 95 112 0.848213 1.60 95 112 0.848213 1.60 95 112 0.84821

4 2.10 73 77 0.948054 2.10 73 77 0.948054 2.10 73 77 0.948054 2.10 73 77 0.94805

Analysis Of Parameter EstimatesAnalysis Of Parameter EstimatesAnalysis Of Parameter EstimatesAnalysis Of Parameter Estimates

Parameter DF Estimate Std Err Parameter DF Estimate Std Err Parameter DF Estimate Std Err Parameter DF Estimate Std Err ChiSquareChiSquareChiSquareChiSquare Pr>ChiPr>ChiPr>ChiPr>Chi

INTERCEPT 1 INTERCEPT 1 INTERCEPT 1 INTERCEPT 1 ----0.1185 0.3749 0.0999 0.75190.1185 0.3749 0.0999 0.75190.1185 0.3749 0.0999 0.75190.1185 0.3749 0.0999 0.7519

CONC 1 1.2480 0.2921 18.2554 0.0001CONC 1 1.2480 0.2921 18.2554 0.0001CONC 1 1.2480 0.2921 18.2554 0.0001CONC 1 1.2480 0.2921 18.2554 0.0001

SCALE 0 1.0000 0.0000 . .SCALE 0 1.0000 0.0000 . .SCALE 0 1.0000 0.0000 . .SCALE 0 1.0000 0.0000 . .

NOTE: The scale parameter was held fixed.NOTE: The scale parameter was held fixed.NOTE: The scale parameter was held fixed.NOTE: The scale parameter was held fixed.

Statistics

in

ScienceΣΣΣΣ

Statistics

in

ScienceΣΣΣΣ

LD50 – lethal dose for 50%

p = 0.5

� p /(1 – p) = 1

� logit(p) = 0 (since log(1) = 0)

� 0 = −0.119 + 1.25 conc

� conc = 0.119/1.25 = 0.095

Odd Ratio (OR)Increasing conc by 1% increases logit(p) by 1.25

log(Odds2) – log(Odds1) = 1.25

log(OR) = 1.25

OR = exp(1.25) = 3.49

log(a) – log(b) = log(a/b)

Page 23: Categorical Data Analysis - University College Dublin · Categorical Data Analysis ... Method of analysis is the same. 9 ... This is a snapshot of opinion at a moment in time hence

23

Statistics

in

ScienceΣΣΣΣ

Statistics

in

ScienceΣΣΣΣ

SAS/GENMOD

conc dead total

0.65 53 771.10 57 751.60 95 1122.10 73 77

procprocprocproc genmodgenmodgenmodgenmod data = log;

model dead/total = conc /

pred

link = logit

dist = binomial;

output

out = p

predicted = p;

runrunrunrun;

include predicted p’s in OUTPUTpred

for modelling log(p/(1-p)) the log(ODDS) link = logit

in work.p a column named p will contain predicted values

predicted = p

output will also go to a data set work.pout = p

the data consists of counts out of a totaldist = binomial

the explanatory variableconc

the proportion to be estimated dead/total

FunctionTerm

Statistics

in

ScienceΣΣΣΣ

Statistics

in

ScienceΣΣΣΣ

Modelling needs biological insight!

Page 24: Categorical Data Analysis - University College Dublin · Categorical Data Analysis ... Method of analysis is the same. 9 ... This is a snapshot of opinion at a moment in time hence

24

Statistics

in

ScienceΣΣΣΣ

Statistics

in

ScienceΣΣΣΣ

Stability analysis (Ex 2 pg 14-15)

Explanatory terms

�Model 1:h d h2 d2 hd

hopefully high order terms will not be needed!

�Model 2:h/d2

biologist suggests this!

Heights, diameter and whether they fell over were recorded for 545 plants.

Aim: model the probability of stability (not falling over) as a function of height an diameter.

…………

110.067.0019

110.058.0019

110.038.0018

100.221.0018

100.084.0018

110.057.0016

nstableheightdiameter

Statistics

in

ScienceΣΣΣΣ

Statistics

in

ScienceΣΣΣΣ

SAS GENMOD/*Model 1*/

PROC Genmod data=htdiam;

model stable/n=h d h*h d*d h*d /link=logit dist=binomial;

RUN;

/*Model 2*/

Data htdiam1; /* Note h = height */

set htdiam; /* d = diameter */

R_hd2=h/(d*d)

Run;

PROC Genmod data=htdiam;

model stable/n= hd /link=logit dist=binomial;

RUN;

Page 25: Categorical Data Analysis - University College Dublin · Categorical Data Analysis ... Method of analysis is the same. 9 ... This is a snapshot of opinion at a moment in time hence

25

Statistics

in

ScienceΣΣΣΣ

Statistics

in

ScienceΣΣΣΣ

Model 1: h, d, h2, d2, hd

1.00001.00000.00001.00000Scale

0.00517.847151.5401262.0331502.4534206.7871hd

<.000121.75-325168-796659120280.4-5609131d2

0.04793.9119.98590.09345.074710.03961h2

<.000157.476240.2523676.464654.03954958.3581diameter

<.000189.01-31.0280-47.29984.1510-39.16391height

<.000132.75-3.5374-7.22280.9402-5.38011Intercept

Pr > ChiSqChi-Square

Wald 95%

Confidence

Limits

Standard

ErrorEstimateDFParameter

Analysis Of Parameter Estimates

How can I describe this!

Statistics

in

ScienceΣΣΣΣ

Statistics

in

ScienceΣΣΣΣ

Model 2: h/d2

1.00001.00000.00001.00000Scale

<.0001127.56-1.4780-2.09870.1583-1.78841h_d2

<.0001107.093.95292.69400.32123.32351Intercept

Pr > ChiSqChi-Square

Wald 95%

Confidence

Limits

Standard

ErrorEstimateDFParameter

Analysis Of Parameter Estimates

Can understand & even plot this!

Page 26: Categorical Data Analysis - University College Dublin · Categorical Data Analysis ... Method of analysis is the same. 9 ... This is a snapshot of opinion at a moment in time hence

26

Statistics

in

ScienceΣΣΣΣ

Statistics

in

ScienceΣΣΣΣ

SAS/GRAPH

But!

Statistics

in

ScienceΣΣΣΣ

Statistics

in

ScienceΣΣΣΣ

Linear v Quadratic in x = h/d2

?

Page 27: Categorical Data Analysis - University College Dublin · Categorical Data Analysis ... Method of analysis is the same. 9 ... This is a snapshot of opinion at a moment in time hence

27

Statistics

in

ScienceΣΣΣΣ

Statistics

in

ScienceΣΣΣΣ

Finally!Modelling counts

Statistics

in

ScienceΣΣΣΣ

Statistics

in

ScienceΣΣΣΣ

Poisson Regression (SAS GENMOD)

For count data- where e.g. we count all – not a subset out of a total

To estimate the mean, µ, and its relationship with an explanatory variable x use a log link (usually):

log(µ) = a + bx

ie µ = exp(a + bx) (which will be >0)

= ea ebx

model count = x /

link = log

distribution = poisson;SAS/GENMOD

Page 28: Categorical Data Analysis - University College Dublin · Categorical Data Analysis ... Method of analysis is the same. 9 ... This is a snapshot of opinion at a moment in time hence

28

Statistics

in

ScienceΣΣΣΣ

Statistics

in

ScienceΣΣΣΣ

Example: Horseshoe crabs & satellites

Each female crab had an attached male (in her nest) & other males (satellites) residing nearby.

• Data recorded– Number of satellites (response)

– Color (light medium, medium, dark medium, dark)

– Spine condition(both good, one worn/broken, both worn/broken)

– Carapace width (cm)

– Weight (kg)

• Poisson Models:– Log link: log(µ) = a + bx

– Identity link: µ = a + bx

Statistics

in

ScienceΣΣΣΣ

Statistics

in

ScienceΣΣΣΣ

Effect of width and colour

Page 29: Categorical Data Analysis - University College Dublin · Categorical Data Analysis ... Method of analysis is the same. 9 ... This is a snapshot of opinion at a moment in time hence

29

Statistics

in

ScienceΣΣΣΣ

Statistics

in

ScienceΣΣΣΣ

Grouping weight & number values

Statistics

in

ScienceΣΣΣΣ

Statistics

in

ScienceΣΣΣΣ

Variation in no. satellites