Download - Categorical models
9/13/2010
1
Categorical ModelsCategorical ModelsPresented by: Jeff Skinner, M.S.
Biostatistics SpecialistBioinformatics and Computational Biosciences Branch
National Institute of Allergy and Infectious Diseases Office of Cyber Infrastructure and Computational Biology
Introduction
Many biological experiments include categorical response
variables, which need to be analyzed with unfamiliar tests
• Simple contingency table methods– Pearson vs. Fisher tests, odds ratios & relative risks, sensitivity & specificity
– M x N tables, McNemar’s test for paired data, MHC tests for confounding
• Logistic regression methods– Odds ratios, estimating LD50, Wald and Likelihood Ratio Tests, …
• Generalized linear model (GLIM) methods– Choosing distribution and link functions, overdispersion statistics, ...
9/13/2010
2
Contingency Tables
• Used to display relationships t i l i bl
Pregnant? RowTotalsamong categorical variables
– Responses in the columns
– Predictors in the rows
• Statistical significance tested using Pearson chi‐square or Fisher’s exact tests
Yes No
Pregnancy
Test?
Positive 27 3 30
Negative 4 26 30
31 29 60Column Totals →
Totals↓
Fisher s exact tests
• Results interpreted using an odds ratio or relative risk
Pearson’s Chi‐Squared Test• Pearson’s chi‐square test
assumes that columns andPregnant?
assumes that columns and rows are independent– Computation of expected values
(Expij) assumes independence
• Chi‐square tests require large sample sizes with no empty
Yes No
Pregnancy
Test?
Positive Obs11 Obs12 R1.Negative Obs21 Obs22 R2.
C.1 C.2 N..
cells & few small cell counts
• P‐values computed from the chi‐square distribution
9/13/2010
3
Fisher’s Exact Test
• Also tests the independence f l d
Pregnant?of columns and rows
• Fisher’s test is valid for all sample sizes and cell counts
• Fisher’s test assumes column
Yes No
Pregnancy
Test?
Positive a b a+b
Negative c d c+d
a+c b+d n
Fisher s test assumes column and row totals are fixed– Fisher’s exact test may be
inappropriate for some tables• P‐values computed using the hypergeometric
distribution shown above
• P‐value represents the probability of finding this specific table vs. all possible tables of sample size of n = a + b + c + d
Odds Ratios and Relative Risk
• Pearson’s chi‐square and Fi h ’ i di
Pregnant?Fisher’s exact tests indicate whether a relationship is statistically significant– Did the results occur by chance?
• Odds ratios and relative risk indicate the magnitude of a
Yes No
Pregnancy
Test?
Positive a b a+b
Negative c d c+d
a+c b+d n
indicate the magnitude of a relationship or its effect size– Was there a large difference in
the odds or risks among rows?
9/13/2010
4
Interpreting OR and RR
• The odds of pregnancy are OR = 58 5 times higher• The odds of pregnancy are OR = 58.5 times higher for women who tested positive than the odds of pregnancy for women who tested negative
• The risk of pregnancy is RR = 6.75 times higher for p g y gwomen who tested positive than the odds of women who tested negative
Sensitivity and Specificity
• Sensitivity and specificity represent the performance
Pregnant?represent the performance of diagnostic tests
• Sensitivity is the proportion of actual positives correctly identified by the diagnostic
Yes No
Pregnancy
Test?
Positive TP FP
Negative FN TN
• Specificity is the proportion of actual negatives correctly identified by the diagnostic
9/13/2010
5
Table Formats
Pregnant?Pregnancy
Test Pregnant? CountYes No
Pregnancy
Test?
Positive 27 3 30
Negative 4 26 30
31 29 60
Test Pregnant? Count
Positive Yes 27
Positive No 3
Negative Yes 26
Negative No 4
Contingency Table format Summarized Table format
• You may need to reformat your data table for some software– Contingency table format for analysis in GraphPad Prism
– Summarized table format for analysis in JMP
Review Contingency Table Results
Pregnant?
Yes No
Pregnancy
Test?
Positive 27 3 30
Negative 4 26 30
31 29 60
Pearson Chi‐Square: X2 = 32.3026, p = 1.319e‐08q , pFisher’s Exact Test: p = 1.975e‐09Odds of pregnancy are OR = 58.5 times higher after positive pregnancy testRisk of pregnancy is RR = 6.75 times higher after positive pregnancy testPregnancy test has 87.1% sensitivity and 89.66% specificity
9/13/2010
6
More Complicated Models
• What if your contingency table is larger than 2 x 2?– Pearson chi‐square and Fisher’s exact test for M x N tables
• What if your table contains paired data?– McNemar’s Test for paired data
• What if your table has three variables?– Mantel‐Haenzel‐Cochran (MHC) test
• What if you have a continuous predictor variable?y p– Logistic regression models
• What about really complicated models?– Generalized Linear Models (GLIM)
M x N Contingency Tables
Blood Types
P hi k h f l M N bl b
A B AB O
Ethnicity
Bambara 7 8 5 20 40
Peul 12 3 3 12 30
Tuareg 11 13 2 4 30
30 24 10 36 100
• Pearson chi‐square tests work the same for larger M x N tables, but researchers need to remember the assumptions about cell counts
• Fisher’s exact test is difficult to compute for M x N tables, but it can be computed using simulations in R or other software
9/13/2010
7
Ordinal vs. Nominal Variables• Ordinal variables have outcomes that are ordered
D D 0 5 10 d 15– Drug Dosages: 0 mg, 5 mg, 10 mg and 15 mg
– Symptom Severity: Mild, Moderate and Severe
• Nominal variables have outcomes that are unordered– Blood Types: A, B, AB and O
– Ethnicity: Bambara, Peul and Tuareg
• Most tests assume nominal variables by defaulty– Ordinal variables require fewer odds ratio estimates
– Ordinal variables may allow for a simpler model
– E.g. compute odds ratios to compare Mild vs. Moderate and Moderate vs. Severe, but do not compare Mild vs. Severe
McNemar’s Test
• McNemar’s test should be used if t bl t t h d
Test 2if table represents a matched pairs design experiment– E.g. Some matched pairs designs
arise from repeated sampling of patients pre‐ and post‐treatment
– E g Case‐control experiments may
Pos Neg
Test 1Positive a b a+b
Negative c d c+d
a+c b+d n
E.g. Case control experiments may use McNemar’s test because case and control patients have been “matched” using key demographic variables like age, gender, race, ...
9/13/2010
8
Mantel‐Haenzel‐Cochran Test
Age < 40 Age > 40
• Mantel‐Haenzel‐Cochran test determines if the relationship
All AgesHeart Attack?
Yes No
Birth
Control?
Yes 16 34
No 34 16
Heart Attack?
Yes No
8 32
2 8
Heart Attack?
Yes No
8 2
32 8
between two table variables remains the same if the table is “paneled” or split by a third table variable
• Often used to investigate Simpson’s Paradox
Logistic Regression
• Logistic regression fits the relationship b t ti di t dbetween a continuous predictor and a categorical response variable
– E.g. predict the gender of an unknown person based on their height
– E.g. predict whether an animal will live or die based on the dose of a drug
• The logistic regression plot represents a change in log odds ratio for each onea change in log odds ratio for each one unit increase in the predictor variable
– E.g. If an unknown person is 61 inches tall, their odds of being male are near zero
– E.g. if an unknown person is 68 inches tall, their odds of being male are about 50‐50
9/13/2010
9
“Long” Data Format
• Each row of data represents one ppatient, animal or subject
• Raw data format is useful when continuous covariates are unique to each subject or patientj p
– E.g. Exact weight of each patient
– E.g. Exact blood pressure, ...
“Wide” Data Format
• If each value of the continuous variable has been replicated, the data can be formatted as a summarized table
• Summarized tables require less space and can be used in multiple modelsp
– Logistic regression models
– Log‐linear models
– Probit analysis
9/13/2010
10
Results from Logistic Regression• Whole model results
Likelihood Ratio Test (LRT)– Likelihood Ratio Test (LRT)
– Model fit diagnostics
• Parameter estimates– Regression coefficients
– Wald tests
• Odds ratiosTh dd f i l 1 107– The odds of survival are 1.107 times higher after every one unit increase in log(dose)
– Odds of survival are 12.794 times higher after every one unit increase in dose
Why Use Both Wald and LRT?• Likelihood Ratio tests compare the fit of two statistical models
– Most statistical models can be described with a likelihood function, e.g., g
– A likelihood ratio test (LRT) computes the log‐likelihood function under a full model (dose and intercept) and reduced model (intercept) to test model fit
• Wald tests evaluate the statistical significance of model parameters– Wald test statistics are constructed very similar to Student’s T‐tests
– Results from Wald test should be consistent with LRT results
9/13/2010
11
Estimate LD50 from Logistic Regression
• You can use interpolated values i di ti t ti tor inverse prediction to estimate
LD50 from a logistic regression
• Open the Inverse Prediction menu and enter Prob = 0.500 to estimate LD50 by finding X at Y = 0.500– Enter Prob = 0.90 for LD90, ...,
• You may need to antilog your LD50 estimate if your predictor is on the log scale (e.g. log10(dose))
Compute LD50 from Parameter Estimates
• Simple logistic regression is defined by the equation
• Therefore, by simple algebra, we find LD50 = ‐B0 / B1
9/13/2010
12
Reed‐Muench Method
• Graphical estimate of LD50 from survival data
• Plot total number of survivors and total number dead against dilution or concentration
• Intersection represents best estimate of LD50
Reed‐Muench Method
9/13/2010
13
Generalized Linear Models• Logistic regression, extensions of Pearson chi‐square tests and other
models can be defined as generalized linear models (GLIM)models can be defined as generalized linear models (GLIM)
• Each GLIM model is coerced into the form of a linear equation by choosing the correct statistical distribution and link function
• Excluding logistic regression, most multifactor categorical models must be specified using the GLIM procedures in your softwarep g p y
• GLIM procedures typically allow analysts to test for overdispersion, where real data has more variance than expected from the model
Distribution Choices
• Modeling categorical responses directlyg g p y– Binomial and multinomial distributions
– Negative binomial distribution
• Modeling contingency table cell counts– Poisson distribution models all cell counts as rare eventsPoisson distribution models all cell counts as rare events
– Normal distribution models cell counts as common events
9/13/2010
14
Link Functions
• Link functions are mathematical transformations used to coerce models into linear equations– The identity link function g(y) = y for linear models
– The log link function g(y) = log(y) for log‐linear models
– The logit link function (below) for logistic regression models
– The probit link function (below) for probit analysis models
Historic Models as GLIM
• Logistic regressiong g– Binomial distribution with logistic link function
• Probit analysis– Binomial distribution with probit link function
• Log‐linear models– Poisson distribution with log link function
• Negative Binomial regression– Negative binomial distribution with log link function
9/13/2010
15
Overdispersion Parameters• Traditional linear models, like linear regression, use independent
parameters to estimate the variance of the response dataparameters to estimate the variance of the response data– E.g. linear regression has independent mean μ = Xβ and variance σ2
• Many GLIM models, like logistic regression, have fixed relationships between the variance and other model parameters– E.g. logistic regression has mean μ = np and variance σ2 = np(1 – p)
– E.g. log‐linear models have μ = σ2 = λ = np for rare event with small p
• Overdispersion parameters are used to account for extra variability• Overdispersion parameters are used to account for extra variability in the responses, which cannot be explained by the model– E.g. logistic regression modeled with variance σ2 = φnp(1 – p)
– Want to know if multiplier φ > 2 to determine significance or importance
Generalized Linear Mixed Models
• Generalized linear models can be advanced further by including random effect variables– These models are called generalized linear mixed models (GLMM)
– Random effect variables are included to account for paired designs, repeated measures designs, split‐plot designs and other effects
– GLMM are typiaclly fit using generalized estimating equations (GEE), often using linearization techniques (e.g. SAS PROC GLIMMIX)
l d d b f• Sometimes complicated GLM and GLMM must be fit using nonlinear modeling procedures in your software– Probit model with binomial errors or Poisson loss function models in JMP
– Probit‐Normal models and Poisson‐Normal models in SAS PROC NLMIXED
9/13/2010
16
Random vs. Fixed Effects
Subject effects are random Gender effects are fixed
• Subject effects are random because the subjects in a experiment are a sample from the population of all possible subjects
• Gender effects are fixed because there are only two genders
Split‐plot Design12 mice: 6 infected, 6 uninfected
3 infected males, 3 infected females, …
• Split‐plot design experiments model experiments where whole plots and subplots represent different EUs
, ,
4 samples taken from each mouse
Each sample treated with one of 2 different drugs
Whole plot (mouse) EU’s: Infection, gender
Subplot (sample) EU’s: drug treatment
whole plots and subplots represent different EUs– Whole plots are often locations, subjects, objects or factors that are difficult to change (e.g. temperature in an incubator)
– Subplot effects are typically the effects of highest interest
– Subplot effects are tested with higher power than whole plot
9/13/2010
17
References
• Agresti A. 2002. Categorical Data Analyses. Second Ed. Wiley‐Interscience.
• Reed LJ and H Muench. 1938. A Simple Method of Estimating Fifty Percent Endpoints. The American Journal of Hygiene. 27(3):493‐497
• SAS Institute Inc. 2007. SAS 9.1.3 Documentation. Cary, NC. SAS Institute Inc.
• SAS Institute Inc 2010 JMP Statistics and Graphics Guide Cary NC SAS• SAS Institute Inc. 2010. JMP Statistics and Graphics Guide. Cary, NC. SAS Institute Inc.