categorical data analysis - university college dublin · categorical data analysis ... method of...
TRANSCRIPT
1
Statistics
in
ScienceΣΣΣΣ
Statistics
in
ScienceΣΣΣΣ
Categorical Data Analysis
PGRM 14
Statistics
in
ScienceΣΣΣΣ
Statistics
in
ScienceΣΣΣΣ
What is categorical data?
The measurement scale for the responseconsists of a number of categories
0, 1, 2, 3 and >3Litter size
Very soft, Soft, Hard, Very hard
Food texture
Dead, aliveMortality
Dairy, Beef, Tillage etc.Farm system
Measurement ScaleVariable
2
Statistics
in
ScienceΣΣΣΣ
Statistics
in
ScienceΣΣΣΣ
Data Analysis considered:
• Response variable(s)is categorical
• Explanatory variable(s) may be categorical or continuous
Example: Does Post-operative survival (categorical response) depend on the explanatory variables?
Sex (categorical)
Age (continuous)
Example: In a random sample of Irish farmers is there a relationship between attitudes to the EU and farm system.
Farm system (categorical)
Attitude to EU (categorical/ordinal)?
(Two response variables - no explanatory variables)
Could one of these be regarded as explanatory?
Statistics
in
ScienceΣΣΣΣ
Statistics
in
ScienceΣΣΣΣ
Measurement scales for categorical dataNominal - no underlying order
Stellaria media, Poa annua, etc.Weed Species
Dairy, Beef, Tillage etc.Farm system
Measurement ScaleVariable
Ordinal - underlying order in the scale
Primary, Secondary, Tertiary Education
Very likely, Likely, UnlikelyDisease diagnosis
Very soft, Soft, Hard, Very hardFood texture
Measurement ScaleVariable
Interval - underlying numerical distance between scale points
years in educationEducation
<1, 1-2, 2-3.5, 3.5-5, >5Age class
0, 1, 2, 3 and >3Litter size
Measurement ScaleVariable
3
Statistics
in
ScienceΣΣΣΣ
Statistics
in
ScienceΣΣΣΣ
Tablesreporting categoricaldata
1-, 2- & 3-way
Statistics
in
ScienceΣΣΣΣ
Statistics
in
ScienceΣΣΣΣ
Tables reporting count data: single level
Example:
A geneticist carries out a crossing experiment between F1 hybrids of a wild type and a mutant genotype and obtains an F2 progeny of 90 offspring with the following characteristics.
901080
TotalMutantWild Type
Evidence that a wild type is dominant, giving on average 8:1 offspring phenotype in its favour?
4
Statistics
in
ScienceΣΣΣΣ
Statistics
in
ScienceΣΣΣΣ
Tables for count data: two-way
Example 1:
A sample 124 mice was divided into two groups, 84 receiving a standard dose of pathogenic bacteria followed by an antiserum and a control group of 40 not receiving the antiserum. After 3 weeks the numbers dead and alive in each group were counted.
1248737Total45402218control 23846519antiserum
% deadTotalAliveDead
Outcome
Association betweenmortality and treatment?
Statistics
in
ScienceΣΣΣΣ
Statistics
in
ScienceΣΣΣΣ
Tables for count data: two-way
Example 2:
Categorical response and categorical explanatory variable:The opinion poll after the Good Friday Agreement with respondents classified by religion (R - Catholic or Protestant)
• Evidence that a majority of decided voters (all voters) support the agreement?
• Support pattern the same for Protestants and Catholics?
232663%Catholic
51800270123407Total
3344820891149Protestant
733526232258Catholic
%FavourTotalUndecidedOpposeFavour
5
Statistics
in
ScienceΣΣΣΣ
Statistics
in
ScienceΣΣΣΣ
Tables for count data: two-wayExample (Snedecor & Cochran):
The table below shows the number of aphids alive and dead after spraying with four concentrations of solutions of sodium oleate.
• Has the higher concentration given a significantly different percentage kill?
• Is there a relationship between concentration and mortality?
84.893.589.382.771.4% Dead
341771127577Total
525121322Alive
289721006255Dead
Total2.11.61.100.65
Concentration of sodium oleate (%)
Statistics
in
ScienceΣΣΣΣ
Statistics
in
ScienceΣΣΣΣ
Is this the relationship?
?
Note:categoricalresponse
interval categorical explanatory variable
6
Statistics
in
ScienceΣΣΣΣ
Statistics
in
ScienceΣΣΣΣ
Tables for count data: two-wayExample (Cornfield 1962)
Blood pressure (BP) was measured on a sample of males aged 40-59, who were also classified by whether they developed coronary heart disease (CHD) in a 6-year follow-up period.
BP:interval categorical variablein 8 classes
CHD:CHD or No-CHD
1329123792Total
18.643358>186
16.2998316167 - 186
9.485778157 - 166
8.613912712147 - 156
5.927125516137 - 146
4.228427212127 - 136
6.725223517117 - 126
1.91561533<117
% CHDTotalNo CHDCHDBP
1.Is the incidence of CHD independent of BP?
2.Is there a simple relationship between the probability of CHD and the level of BP?
Statistics
in
ScienceΣΣΣΣ
Statistics
in
ScienceΣΣΣΣ
CHD v BP relationship
7
Statistics
in
ScienceΣΣΣΣ
Statistics
in
ScienceΣΣΣΣ
3-way tableExample: Grouped binomial (response has 2 categories) data - patterns of psychotropic drug consumption in a sample from West London (Murray et al 1981, Psy Med 11,551-60)
6021Yes5F
9845Yes4F
24271Yes3F
18947Yes2F
21033Yes1F
17930No5F
32752No4F
76596No3F
59642No2F
58812No1F
2610Yes5M
5616Yes4M
12131Yes3M
12516Yes2M
17112Yes1M
909No5M
27526No4M
64438No3M
50016No2M
5319No1M
TotalOn drugsPsych. caseAge GroupSex
Statistics
in
ScienceΣΣΣΣ
Statistics
in
ScienceΣΣΣΣ
Non-tabulated dataExample: Individual Legousia plants were monitored in an experiment to see whether they survived after 3 months.Survived -yes is scored 1Survived -no scored 0.
Also recorded were:
CO2 treatment – 2 levels low and high
Density of Legousia
Density of companion species
Height of the plant (mm)two weeks after planting.
Most individuals will have a unique profile in these three additional variables and so tabulation of the data by them is not feasible. The individual data is presented
8
Statistics
in
ScienceΣΣΣΣ
Statistics
in
ScienceΣΣΣΣ
Non-tabulated data
1. Is survival related to the
explanatory variables:
CO2, Height, density-self,
density-companions?
2. Can the probability of
survival be predicted from
the subject’s profile?
………………
………………
16427L04
331643H13
272268L12
302035L01
CompLeg.HtCO2SurvSubject
Density
Response
Statistics
in
ScienceΣΣΣΣ
Statistics
in
ScienceΣΣΣΣ
Fixed and non-fixed margins
• One margin fixed: Samples of fixed size are selected for one or more categories and individuals are classified by the other category(s).
• No margin fixed: Individuals in a single sample are simultaneously classified by several categorical variables.
Difference between these depends on the experimental design and how this specified the data should be collected.
Method of analysis is the same.
9
Statistics
in
ScienceΣΣΣΣ
Statistics
in
ScienceΣΣΣΣ
One margin fixed
Example 1 (Clinical trial - a prospective study):
Of 400 HIV positive pregnant women 200 are assigned at random to each of Breast feeding (BF) or Formula feeding (FF). Two years after birth the child’s HIV status is determined.
Child’s status
20015545FF
20013862BF
TotalHIV -HIV +
Statistics
in
ScienceΣΣΣΣ
Statistics
in
ScienceΣΣΣΣ
No margin fixed (Single sample)
Used in Cross-sectional studies.
Example: A simple random sample of 200 students was classified by gender and attitude to EU integration.
This is a snapshot of opinion at a moment in time hence Cross-sectional.
EU integration
86104Total
1043361Female
965343Male
TotalOpposeFavour
10
Statistics
in
ScienceΣΣΣΣ
Statistics
in
ScienceΣΣΣΣ
Asking the right question
• Data summarized by counts
• Questions usually relate to %s(equivalently proportions)
Statistics
in
ScienceΣΣΣΣ
Statistics
in
ScienceΣΣΣΣ
Hypotheses for Categorical Data
• Categorical data is summarised by counting individuals falling into the various combinations of categories
• Hypotheses relate to:the probability of an individual being in a particular category
• These probabilities are estimated by the observed proportions in the data
• Using a sample proportion, p, from a sample of size n, to estimate a population proportion the standard error is
√(p(1 – p)/n)eg with p = 0.5, n = 1100,
2×SE = 0.03the often mentioned 3% margin of error
11
Statistics
in
ScienceΣΣΣΣ
Statistics
in
ScienceΣΣΣΣ
Example
1248737Total
45402218control 23846519antiserum
% deadTotalAliveDead
Outcome
Does % dead depend on antiserum?
Equivalently:
1. Is there an association between mortality and antiserum?
2. Is mortality independent of anitserum?
Statistics
in
ScienceΣΣΣΣ
Statistics
in
ScienceΣΣΣΣ
Example
• As usual we set up a null hypothesis and measure the extent to which the data conflicts with this
• Here H0: prob of death for anti = prob of death for control
• equivalently H0:
– no association between mortality and antiserum
– Mortality and antiserum are independent
1248737Total
45402218control 23846519antiserum
% deadTotalAliveDead
Outcome
12
Statistics
in
ScienceΣΣΣΣ
Statistics
in
ScienceΣΣΣΣ
Example
Expected counts when H0 is true:
The overall % dead (37/124)would apply to antiserum & control
For the 84 antiserum this would give(84×37)/124 dead and (84×87)/124 alive
For the 40 control this would give(40×37)/124 dead and (40×87)/124 alive
1248737Total
45402218control 23846519antiserum
% deadTotalAliveDead
Outcome
E = (row total)×(column total)/(table total)
Statistics
in
ScienceΣΣΣΣ
Statistics
in
ScienceΣΣΣΣ
Observed and expected counts
1248737Total
45402218control
23846519antiserum
% deadTotalAliveDead
Outcome
1248737Total
29.84028.111.9control29.98458.925.1antiserum
% deadTotalAliveDead
Outcome
Note: some rounding error
Observed
Expected
13
Statistics
in
ScienceΣΣΣΣ
Statistics
in
ScienceΣΣΣΣ
Chi-squared statistic : X2
• X2 measures difference between observed counts, O, and expected (when H0 holds) counts, E
• If LARGE provides evidence against H0, i.e. evidence for an association (dependence) of mortality on anitserum.
• X2 = ∑(O – E)2/E
• Here SAS/FREQ gives:X2 = 6.48p = Prob(X2 > 6.48 when H0 is true) = 0.0109
• Conclusion:there is evidence (p < 0.05) that mortality depends on antiserum
Statistics
in
ScienceΣΣΣΣ
Statistics
in
ScienceΣΣΣΣ
Table of antiserum by dead
antiserum dead
Frequency
Expected
Row Pct 0 1 Total
antiserum 65
58.935
77.38
19
25.065
22.62
84
control 22
28.065
55.00
18
11.935
45.00
40
Total 87 37 124
SAS/FREQ OUTPUT
X2 = ∑(O – E)2/E
O = Frequency
E = Expected
Row Percents make most sense here(% alive/dead in each antiserum group)
Description of cell contents
14
Statistics
in
ScienceΣΣΣΣ
Statistics
in
ScienceΣΣΣΣ
0.2287Cramer's V
0.2229Contingency Coefficient
0.2287Phi Coefficient
0.01126.43101Mantel-Haenszel Chi-Square
0.01955.45831Continuity Adj. Chi-Square
0.01226.28461Likelihood Ratio Chi-Square
0.01096.48331Chi-Square
ProbValueDFStatistic
SAS/FREQ OUTPUT
X2 = ∑(O – E)2/EDF = (r–1)×(c-1)
Ignore!
Statistics
in
ScienceΣΣΣΣ
Statistics
in
ScienceΣΣΣΣ
P = 0.001 with X2 = 6.48
6.48
Area 0.05
Area0.001
68% values < 1(not shown)
15
Statistics
in
ScienceΣΣΣΣ
Statistics
in
ScienceΣΣΣΣ
SAS – data format for FREQ procedure
Mortality Antiserum Count Dead Antiserum 19Alive Antiserum 65Dead Control 18Alive Control 22
2 cols identify the cell
Final column is the ‘response’
– the frequency count for the cell
1248737Total
45402218control
23846519antiserum
% deadTotalAliveDead
Mortality
Statistics
in
ScienceΣΣΣΣ
Statistics
in
ScienceΣΣΣΣ
Code: SAS/FREQ
procprocprocproc freqfreqfreqfreq data = conc;
weight count;
tables antiserum*mortality
/ chisq expected nocol
nopercent;
quitquitquitquit;
Omit column/overall percentagesnocol nopercent
Expected values for each cellexpected
Test statistics (chi-squared etc)chisq
To DoOption
16
Statistics
in
ScienceΣΣΣΣ
Statistics
in
ScienceΣΣΣΣ
Alphid ExampleExample (Snedecor & Cochran):
The table below shows the number of aphids alive and dead after spraying with four concentrations of solutions of sodium oleate.
• Has the higher concentration given a significantly different percentage kill?
• Is there a relationship between concentration and mortality?
84.893.589.382.771.4% Dead
341771127577Total
525121322Alive
289721006255Dead
Total2.11.61.100.65
Concentration of sodium oleate (%)
Statistics
in
ScienceΣΣΣΣ
Statistics
in
ScienceΣΣΣΣ
341771127577Total
28972
65.258
0.6965
93.51
100
94.921
0.2718
89.29
62
63.563
0.0384
82.27
55
65.258
1.6125
71.43
Dead
525
11.742
3.8711
6.49
12
17.079
1.5105
10.71
13
11.437
0.2136
17.33
22
11.742
8.9617
28.57
Alive
2.11.61.10.65
Frequency
Expected
Cell Chi-Square
Col Pct Total
conc(Sodium oleate concentration (%))status(Outcome)
Table of status by conc
Aphid example (SAS/FREQ OUTPUT)
X2 = 17.18p = 0.0007 (3 df)
Note the largest contributions (O – E)2/E to X2 (8.96 & 3.87) are in top corners
17
Statistics
in
ScienceΣΣΣΣ
Statistics
in
ScienceΣΣΣΣ
Locating the concentration effect
1527577Total
11762
82.67
55
71.43Dead
3513
17.33
22
28.57Alive
1.10.65Frequency
Expected
Total
Sodium oleate(%)Outcome
Table of Outcome by Sodium
18977112Total
17272
93.51
100
89.29Dead
175
6.49
12
10.71Alive
2.11.6Frequency
Expected
Total
Sodium oleate(%)Outcome
Table of Outcome by Sodium
X2 = 2.71p = 0.10
X2 = 0.99p = 0.32
Statistics
in
ScienceΣΣΣΣ
Statistics
in
ScienceΣΣΣΣ
Locating the concentration effect
X2 = 12.83p = 0.0003
341189152Total
289172
91.01
117
76.97
Dead
5217
8.99
35
23.03
Alive
>1.5%<1.5%
Frequency
Col Pct Total
Sodium
oleate(%)Outcome
Table of Outcome by Sodium
18
Statistics
in
ScienceΣΣΣΣ
Statistics
in
ScienceΣΣΣΣ
SAS – data format for FREQ procedure
84.893.589.382.771.4% Dead
341771127577Total
525121322Alive
289721006255Dead
Total2.11.61.100.65
Concentration of sodium oleate (%)
Conc status number
0.65 d 55
0.65 a 22
1.10 d 62
1.10 a 13
1.60 d 100
1.60 a 12
2.10 d 72
2.10 a 5
2 cols identify the cell
Final column is the ‘response’
– the frequency count for the cell
Statistics
in
ScienceΣΣΣΣ
Statistics
in
ScienceΣΣΣΣ
Code: SAS/FREQprocprocprocproc freqfreqfreqfreq data = insert;
weight number;
tables status*conc
/ chisq cellchi2 expected
norow nopercent nocum;
quitquitquitquit;
Omit cumulative frequenciesnocum
Omit row/overall percentagesnorow
nopercent
Expected values for each cellexpected
Contribution to X2 from each cellcellchi2
Test statistics (chi-squared etc)chisq
To DoOption
19
Statistics
in
ScienceΣΣΣΣ
Statistics
in
ScienceΣΣΣΣ
Validity of chi-squared (χ2) test
• Test is based on an approximation leading to use of
the χ2 distribution to calculate p-values
• With several DF and E ≥ 5 approximation is ok
• If E < 1 in any cell approximation may be bad
• With a number of cells in the table perhaps a third or quarter can have E between 1 & 5 without serious
departures from χ2 based p-values. (PGRM pg 14-11)
• In cases where good approximation is in doubt use Fisher’s exact test (SAS/FREQ tables option exact)
Statistics
in
ScienceΣΣΣΣ
Statistics
in
ScienceΣΣΣΣ
Logistic Regression(SAS GENMOD)
20
Statistics
in
ScienceΣΣΣΣ
Statistics
in
ScienceΣΣΣΣ
Is this the relationship?
?
Note:categoricalresponse
interval categorical explanatory variable
Statistics
in
ScienceΣΣΣΣ
Statistics
in
ScienceΣΣΣΣ
Why logistic and not just χ2?• For sparse data(e.g. where individuals will have unique profiles)
• With many categorical explanatory variables
• With quantitative explanatory variables
In the case of a continuous response we have looked to see if the mean, µ, can be expressed as
µ = a + bx
With categorical data we want an expression for p(the probability of the response in one of the 2 response categories) but
p = a + bxmay give values outside the range 0 to 1!
e.g. p = 0.1 + 0.2x gives p = 1.1 for x = 5
21
Statistics
in
ScienceΣΣΣΣ
Statistics
in
ScienceΣΣΣΣ
A solution: TRANSFORM• Use the transformation:
p = exp(a + bx)/(1 + exp(a + bx))
• i.e. log(p/(1 – p)) = a + bxlog(Odds) = a + bx
where Odds = p/(1 – p)
Note:exp(x) = ex
Plot is for:a = 0, b = 1
LOGIT:logit(p) = log(p/(1-p))
Statistics
in
ScienceΣΣΣΣ
Statistics
in
ScienceΣΣΣΣ
SAS/GPLOTlogit(p) = −0.119 + 1.25 conc
Logistic Estimate of Death Probability
p
0.6
0.7
0.8
0.9
1.0
Sodium oleate (%)
0.6 1.0 1.4 1.8 2.2
22
Statistics
in
ScienceΣΣΣΣ
Statistics
in
ScienceΣΣΣΣ
SAS OUTPUT
OBS CONC DEAD TOTAL PROPOBS CONC DEAD TOTAL PROPOBS CONC DEAD TOTAL PROPOBS CONC DEAD TOTAL PROP
1 0.65 53 77 0.688311 0.65 53 77 0.688311 0.65 53 77 0.688311 0.65 53 77 0.68831
2 1.10 57 75 0.760002 1.10 57 75 0.760002 1.10 57 75 0.760002 1.10 57 75 0.76000
3 1.60 95 112 0.848213 1.60 95 112 0.848213 1.60 95 112 0.848213 1.60 95 112 0.84821
4 2.10 73 77 0.948054 2.10 73 77 0.948054 2.10 73 77 0.948054 2.10 73 77 0.94805
Analysis Of Parameter EstimatesAnalysis Of Parameter EstimatesAnalysis Of Parameter EstimatesAnalysis Of Parameter Estimates
Parameter DF Estimate Std Err Parameter DF Estimate Std Err Parameter DF Estimate Std Err Parameter DF Estimate Std Err ChiSquareChiSquareChiSquareChiSquare Pr>ChiPr>ChiPr>ChiPr>Chi
INTERCEPT 1 INTERCEPT 1 INTERCEPT 1 INTERCEPT 1 ----0.1185 0.3749 0.0999 0.75190.1185 0.3749 0.0999 0.75190.1185 0.3749 0.0999 0.75190.1185 0.3749 0.0999 0.7519
CONC 1 1.2480 0.2921 18.2554 0.0001CONC 1 1.2480 0.2921 18.2554 0.0001CONC 1 1.2480 0.2921 18.2554 0.0001CONC 1 1.2480 0.2921 18.2554 0.0001
SCALE 0 1.0000 0.0000 . .SCALE 0 1.0000 0.0000 . .SCALE 0 1.0000 0.0000 . .SCALE 0 1.0000 0.0000 . .
NOTE: The scale parameter was held fixed.NOTE: The scale parameter was held fixed.NOTE: The scale parameter was held fixed.NOTE: The scale parameter was held fixed.
Statistics
in
ScienceΣΣΣΣ
Statistics
in
ScienceΣΣΣΣ
LD50 – lethal dose for 50%
p = 0.5
� p /(1 – p) = 1
� logit(p) = 0 (since log(1) = 0)
� 0 = −0.119 + 1.25 conc
� conc = 0.119/1.25 = 0.095
Odd Ratio (OR)Increasing conc by 1% increases logit(p) by 1.25
log(Odds2) – log(Odds1) = 1.25
log(OR) = 1.25
OR = exp(1.25) = 3.49
log(a) – log(b) = log(a/b)
23
Statistics
in
ScienceΣΣΣΣ
Statistics
in
ScienceΣΣΣΣ
SAS/GENMOD
conc dead total
0.65 53 771.10 57 751.60 95 1122.10 73 77
procprocprocproc genmodgenmodgenmodgenmod data = log;
model dead/total = conc /
pred
link = logit
dist = binomial;
output
out = p
predicted = p;
runrunrunrun;
include predicted p’s in OUTPUTpred
for modelling log(p/(1-p)) the log(ODDS) link = logit
in work.p a column named p will contain predicted values
predicted = p
output will also go to a data set work.pout = p
the data consists of counts out of a totaldist = binomial
the explanatory variableconc
the proportion to be estimated dead/total
FunctionTerm
Statistics
in
ScienceΣΣΣΣ
Statistics
in
ScienceΣΣΣΣ
Modelling needs biological insight!
24
Statistics
in
ScienceΣΣΣΣ
Statistics
in
ScienceΣΣΣΣ
Stability analysis (Ex 2 pg 14-15)
Explanatory terms
�Model 1:h d h2 d2 hd
hopefully high order terms will not be needed!
�Model 2:h/d2
biologist suggests this!
Heights, diameter and whether they fell over were recorded for 545 plants.
Aim: model the probability of stability (not falling over) as a function of height an diameter.
…………
110.067.0019
110.058.0019
110.038.0018
100.221.0018
100.084.0018
110.057.0016
nstableheightdiameter
Statistics
in
ScienceΣΣΣΣ
Statistics
in
ScienceΣΣΣΣ
SAS GENMOD/*Model 1*/
PROC Genmod data=htdiam;
model stable/n=h d h*h d*d h*d /link=logit dist=binomial;
RUN;
/*Model 2*/
Data htdiam1; /* Note h = height */
set htdiam; /* d = diameter */
R_hd2=h/(d*d)
Run;
PROC Genmod data=htdiam;
model stable/n= hd /link=logit dist=binomial;
RUN;
25
Statistics
in
ScienceΣΣΣΣ
Statistics
in
ScienceΣΣΣΣ
Model 1: h, d, h2, d2, hd
1.00001.00000.00001.00000Scale
0.00517.847151.5401262.0331502.4534206.7871hd
<.000121.75-325168-796659120280.4-5609131d2
0.04793.9119.98590.09345.074710.03961h2
<.000157.476240.2523676.464654.03954958.3581diameter
<.000189.01-31.0280-47.29984.1510-39.16391height
<.000132.75-3.5374-7.22280.9402-5.38011Intercept
Pr > ChiSqChi-Square
Wald 95%
Confidence
Limits
Standard
ErrorEstimateDFParameter
Analysis Of Parameter Estimates
How can I describe this!
Statistics
in
ScienceΣΣΣΣ
Statistics
in
ScienceΣΣΣΣ
Model 2: h/d2
1.00001.00000.00001.00000Scale
<.0001127.56-1.4780-2.09870.1583-1.78841h_d2
<.0001107.093.95292.69400.32123.32351Intercept
Pr > ChiSqChi-Square
Wald 95%
Confidence
Limits
Standard
ErrorEstimateDFParameter
Analysis Of Parameter Estimates
Can understand & even plot this!
26
Statistics
in
ScienceΣΣΣΣ
Statistics
in
ScienceΣΣΣΣ
SAS/GRAPH
But!
Statistics
in
ScienceΣΣΣΣ
Statistics
in
ScienceΣΣΣΣ
Linear v Quadratic in x = h/d2
?
27
Statistics
in
ScienceΣΣΣΣ
Statistics
in
ScienceΣΣΣΣ
Finally!Modelling counts
Statistics
in
ScienceΣΣΣΣ
Statistics
in
ScienceΣΣΣΣ
Poisson Regression (SAS GENMOD)
For count data- where e.g. we count all – not a subset out of a total
To estimate the mean, µ, and its relationship with an explanatory variable x use a log link (usually):
log(µ) = a + bx
ie µ = exp(a + bx) (which will be >0)
= ea ebx
model count = x /
link = log
distribution = poisson;SAS/GENMOD
28
Statistics
in
ScienceΣΣΣΣ
Statistics
in
ScienceΣΣΣΣ
Example: Horseshoe crabs & satellites
Each female crab had an attached male (in her nest) & other males (satellites) residing nearby.
• Data recorded– Number of satellites (response)
– Color (light medium, medium, dark medium, dark)
– Spine condition(both good, one worn/broken, both worn/broken)
– Carapace width (cm)
– Weight (kg)
• Poisson Models:– Log link: log(µ) = a + bx
– Identity link: µ = a + bx
Statistics
in
ScienceΣΣΣΣ
Statistics
in
ScienceΣΣΣΣ
Effect of width and colour
29
Statistics
in
ScienceΣΣΣΣ
Statistics
in
ScienceΣΣΣΣ
Grouping weight & number values
Statistics
in
ScienceΣΣΣΣ
Statistics
in
ScienceΣΣΣΣ
Variation in no. satellites