research lessons from survival analysis

Research Lessons from Survival Analysis

Joshua Naranjo

Statistical Consulting CenterDepartment of Statistics

Western Michigan University

Joshua Naranjo Research Lessons from Survival Analysis

A modeling framework

Common elements

1 Data

2 Model

3 Fit criterion

Data Model↖ ↗Fit Criterion


Student’s t

Ex. The two-sample problem

1 Data: {xi}, {yj}2 Model: X ∼ N(µ1, σ

2), Y ∼ N(µ2, σ2)

3 Fit criterion: Likelihood


Model : X ∼ N(x , s2p), Y ∼ N(y , s2

p)

Test : y−xsp√

1/n1+1/n2> tα

Estimate : y − x(

SE = sp√

1/n1 + 1/n2

)


Welch-t

1 Data: {xi}, {yj}2 Model: X ∼ N(µ1, σ

21), Y ∼ N(µ2, σ

22)

3 Fit criterion: Likelihood


Model: X ∼ N(µ1, s21 ) , Y ∼ N(µ2, s

22 )

Test : y−x√s2

1/n1+s22/n2

> tα

Estimate : y − x

(SE =

√s2

1/n1 + s22/n2

)


Wilcoxon Rank-Sum

1 Data: {xi}, {yj}2 Model: X ∼ F (t), Y ∼ F (t −∆)

3 Fit criterion: Rank dispersion


Model: X ∼ F (t), Y ∼ F (t − ∆), where ∆ = med {yj − xi}

Test :∑

Ri−E0√V0

> zα

Estimate : ∆ = med {yj − xi}


Evaluation


Pros and cons1 Efficiency

Null (size, etc.)Alternative (power under various configurations)

2 Robustness

Broadness of assumptionsSensitivity to violations

3 Simplicity and interpretability

4 Computing complexity


Evaluation

Example: The two-sample problem

Student’s t Welch t Rank Sum1. Efficiency X X X2. Robustness x X X3. Interpret X X x4. Computing X X X


Beyond two-sample

More data structures

The k-sample problem (Y1,Y2,Y3)

Regression (X1,X2,X3,Y )

Correlation (X ,Y )

Categorical data (fx , fy )

Survival data (Tx ,Ty )

Longitudinal data (Σy )

Time series data (Yt)

Mixture data (pY1 + (1− p)Y2)

· · ·


Two-sample vs Two-sample

Age at death (from all causes)

Study 1 (Estrogen) Study 2 (Vitamin E)

Treatment Control Treatment Control(n = 1680) (n = 1722) (n = 1819) (n = 1994)

66 75 64 5568 58 90 8972 79 83 57...

......

...... 51

......

44 63...

60


Two-sample vs Two-sample

Crucial difference

1 Study 1 is randomized2 Study 2 is not randomized

Treatment group has higher educationTreatment group more female

Need to adjust for education and gender⇒((((((hhhhhhTwo-sample Regression


Survival data

The two-sample problem

Response is time-to-event (e.g. survival time)

Patients enter the study serially (e.g. patients are referred tostudy upon diagnosis)

Two groups of patients (e.g. randomized treatment arms)


Survival data

subj days status group1 8 1 22 23 1 13 23 1 24 52 0 25 60 1 1:

74 828 1 175 976 0 2

days=time from randomization to death or censoring

status=

{1 if death0 if censored (e.g. subject 4 enrolled 52 days ago)

group=

{1 if treatment 12 if treatment 2


Survival data

subj days group1 8 22 23 13 23 24 >52 25 60 1:

74 828 175 >976 2

days=time from randomization to death or censoring

status=

{1 if death0 if censored (e.g. subject 4 enrolled 52 days ago)

group=

{1 if treatment 12 if treatment 2


Kaplan-Meier Curve

Q: How do we accommodate censoring? (L1)

subj days group1 8 22 23 13 23 24 >52 25 60 1:

74 828 175 >976 2

The KM estimate of S(t) = P[T > t]:S(0) = 1S(8) = 74/75 = .9867S(23) = (74/75)(72/74) = .9600S(60) = (74/75)(72/74)(70/71) = .9465


Kaplan-Meier Curve


Test for equality of groups

H0 : S1(·) = S2(·) vs H1 : Not

Two early approaches

1 Gehan (1965)Adapt rank tests to accommodate censoring

2 Mantel (1968)Adapt frequency tables to accommodate censoring (L2)



1 Adapt rank tests to accommodate censoring

The Wilcoxon Rank-Sum Test (1945)




Idea: Add up the ranks of the Group 1 subjects

subj days group Rank1 8 22 23 1 2.53 23 24 >52 25 60 1 ?:

74 828 175 >976 2

?




Idea: Add up the ranks of the Group 1 subjects

group 1 group 2days rank days rank

23 2.5 8 160 ? 23 2.5

: >52: :: :: :: >976 75

828 74

?



The Mann-Whitney version of Wilcoxon Rank-Sum test

W =∑i

∑j

Uij ,

where Uij =

−1 xi < yj

0 xi = yj1 xi > yj

Let

U∗ij =

−1 xi < yj or xi < y∗j

0 xi = yj or x∗i < yj or xi > y∗j1 xi > yj or x∗i > yj

”Gehan (1965) generalization of Wilcoxon”



”Efron (1967) modification of Gehan-Wilcoxon”

subj days status group Rank1 8 1 22 23 1 1 2.53 23 1 24 52 0 25 60 1 1 ?:

74 828 1 175 976 0 2

?

While it is unlikely that 60 > 52∗, it is more likely that 828 > 52∗,but Gehan treats them as equally uninformative (Uij∗ = 0).

U∗ij = 1− 2S(xi )

S(yj)when xi > y∗j



2 Adapt frequency tables to accommodate censoring

Recall the χ2 test

Yes NoGrp 1 a b n1

Grp 2 c d n2

m1 m2 n

Then

χ2 =(a− E )2

V

whereE =

n1m1

nand V =

n1n2m1m2

n2(n − 1)



”Cochran-Mantel-Haentszel test (1959)”

When data are stratified according to control variables (e.g. raceor gender)

Yes NoGrp 1 a1 b1

Grp 2 c1 d1

Yes Noa2 b2

c2 d2

· · ·Yes NoaK bKcK dK

χ2CMH =

(∑

ai −∑

Ei )2∑

Vi

Q: Why not∑(

ai−EiVi

)2

Be cynical. (L3)



“Logrank test” (Mantel, 1966)

Day 8: 1 death, and it came from Group 2

Grp 1 0 40 40Grp 2 1 34 35

⇒ a1 = 0, E1 = 4075

Day 23: 2 deaths, one from each group

Grp 1 1 39 40Grp 2 1 33 34

⇒ a2 = 1, E2 = 2(

4074

)Day 60: 1 death, and it came from Group 1

Grp 1 1 38 39Grp 2 0 32 32

⇒ a3 = 1, E3 = 3971

Z =∑

(aj−Ej )√∑Vj

, sum over distinct event times



Pros and cons? Not obvious.

While the Mantel logrank test is

ZL =

∑(aj − Ej)√∑

Vj

the Gehan-Wilcoxon can be written as

ZW =

∑nj(aj − Ej)√∑

n2j Vj

where nj is the total number of subjects at risk at event time j .

⇒ The Wilcoxon gives more weight to early differences


Regression

Example: Survival times for heart transplant patients

(T) (X1) (X2) (X3)

Subj Survive Trans Prior Age

1 5 0 0 52

2 15 1 0 55

3 37 1 0 41

4 48 0 1 31

:

:

44 541 1 0 46

45 670 1 0 48


Suppose that survival time T has probability density function f (t).The cumulative density function is

F (t) = P[T ≤ t] =

∫ t

0f (u) du

The survivor function is

S(t) = P[T > t] = 1− F (t)

Define the hazard function as

h(t) =f (t)

S(t)

It can be shown that

f (t) = h(t)e−∫ t

0 h(u) du


Some hazard models:

h(t) =

λ , Exponentialλγt , Gompertzλtα , Weibull

Incorporating covariates like transplant, prior surgery, and age?

h(t) =

λ eβ1x1+···+βkxk , Exponentialλγt eβ1x1+···+βkxk , Gompertzλtα eβ1x1+···+βkxk , Weibull

= λ0(t) eβ′x

Proportional hazards: covariates are multiplicatively related to thehazard


Estimate parameters by maximum likelihood

L =n∏

i=1

fi (ti )

wherefi (t) = hi (t)e−

∫ t0 hi (u) du

hi (t) = λ0(t) eβ′xi

and

λ0(t) =

λ , Exponentialλγt , Gompertzλtα , Weibull

Problem: Specify λ0(t)?


Partial Likelihood

Cox (1972) proposed an estimation method for the βs withoutneeding to specify λ0(t). Maximize the partial likelihood

PL =n∏

i=1

Li

where Li is the conditional probability of failure at time ti giventhe number of cases at risk at time ti .



(T) (X1) (X2) (X3)

Subj Survive Trans Prior Age

1 5 0 0 52

2 15 1 0 55

3 37 1 0 41

4 48 0 1 31

:

:

44 541 1 0 46

45 670 1 0 48


A death occurred 5 days after enrollment. What is the probabilitythat it happens to patient 1 instead of to one of the others at-risk?

L1 =h1(5)

h1(5) + · · ·+ h45(5)

L2 =h2(15)

h2(15) + · · ·+ h45(15)

:

:

L44 =h44(541)

h44(541) + h45(541)

L45 = 1

Similar to

P(Ai |B) =P(Ai ∩ B)

P(A1 ∩ B) + · · ·+ P(An ∩ B)


L1 =h1(5)

h1(5) + · · ·+ h45(5)

=λ0(5) eβ

′x1

λ0(5) eβ′x1 + · · ·+ λ0(5) eβ′x45

=eβ

′x1

eβ′x1 + · · ·+ eβ′x45

L2 =eβ

′x2

eβ′x2 + · · ·+ eβ′x45

The combination of PH assumption and partial likelihood

PL =n∏

i=1

Li

allows estimation of β without specifying the baseline hazard!


Efficiency

2 Efron (1977) “The Efficiency of Cox’s Likelihood Function forCensored Data”.


SAS output

The PHREG Procedure

Dependent variable: Survive

Maximum Likelihood Estimates

Standard Wald Pr> Risk

Var DF Estimate Error Chi-sq Chi-sq Ratio

Trans 1 -1.708140 0.27860 37.590 0.0001 0.181

Prior 1 -0.421402 0.37100 1.290 0.2560 0.656

Age 1 0.058609 0.01505 15.163 0.0001 1.060


The hazard ratio for age is

h(t;Age = x + 1)

h(t;Age = x)=λ0(t)e−1.708x1−0.421x2+0.0586(x3+1)

λ0(t)e−1.708x1−0.421x2+0.0586x3

= e .0586

= 1.0603

so every additional year of age increases hazard of failure by 6%same for all t.

(Implications and consequences of PH)



Standard Risk

Var DF Estimate Error Ratio

Trans 1 -1.708140 0.27860 0.181

Prior 1 -0.421402 0.37100 0.656

Age 1 0.058609 0.01505 1.060

“Age increases hazard of death by 6%, and getting a transplantreduces hazard of death by 82%.”

While the age effect may be real, the magnitude of transplanteffect is likely false.

Transplant⇒ Late death

No Transplant⇒ Early death



Standard Risk

Var DF Estimate Error Ratio

Trans 1 -1.708140 0.27860 0.181

Prior 1 -0.421402 0.37100 0.656

Age 1 0.058609 0.01505 1.060

“Age increases hazard of death by 6%, and getting a transplantreduces hazard of death by 82%.”

While the age effect may be real, the magnitude of transplanteffect is likely false.

Transplant ⇒ Late deathNo Transplant ⇐ Early death


PH assumption


Characterizations of Proportional Hazards

Q: How do we detect PH? (L4: Unobservables)

Two sample:

h(t) = h0(t)eβxi =

{h0(t) Group 1 (x = 0)h0(t)ψ Group 2 (x = 1)

S(t) =

exp{∫ t

0 h0(u)du}

Group 1

exp{∫ t

0 h0(u)ψdu}

Group 2

implies

S2(t) = [S1(t)]ψ



S2(t) = [S1(t)]ψ

Early separation

Late separation



S2(t) = [S1(t)]ψ


The logrank test and Cox Regression

Quotes:Assume that xi is a binary variable and the onlycovariate. Then the score test for Cox-regression isthe same as the log-rank if there are no ties.LR test is aymptotically efficient for the two-sampleproblem when there is noninformative censoring.


The logrank test and Cox Regression


Extensions 1 (Applications)

What about (L5: Extensions)

Tied event times

Repeated measures

Fixed covariatesTime-dependent covariates (e.g. co-meds, hypertension)

Time-dependent coefficients (e.g. treatments that are lesseffective in later stages)

Large p


Extensions 2 (Tools)

Extend to Cox PH regression

R2

Extra SS

Partial plots

Outlier diagnostics

Mallows Cp

AIC

LASSO...


Non-Proportional Hazards

“Cancer immunotherapy activates the immune system to combatcancer. This poses the possibility of a delayed onset of clinicaleffect and violation of the PH assumption.” (Z. Xu, FDA)



JSM 2019 (Denver CO)



The NPH Working Group

Collaboration initiated by FDA in 2016

Methodical evaluation of available methods is needed

Appropriate analysis for various patterns of non-PH

How to optimally test for treatment differences?How to summarize nonconstant treatment effect?

Organizations representedAZ, BMS, Merck, Boehringer Ingelheim, Novartis, Eli Lilly,Abbvie, Genentech, Roche, Bayer, Janssen, Takeda, Amgen,Pfizer, GSK, Celgene, Sanofi, Johnson & Johnson, FDA



Available methods

Weighted logrank (Fleming and Harrington 1987)

Piecewise logrank (Xu et al. 2016)

Combination tests

Karrison 2016 (Max-Combo)Gares et al. 2017Yang and Prentice 2010Logan 2008

Restricted mean survival time (Uno et al. 2014)

Generalized pairwise comparisons Uij

Peron et al. 2018Buyse 2010

Weighted KM test (Pepe and Fleming 1989, 1991)


Weighted logrank tests

Logrank test (Mantel, 1968)

L =

∑(aj − Ej)√∑

Vj

Wilcoxon (Gehan, 1965)

W =

∑nj(aj − Ej)√∑

n2j Vj

.=

∑S(tj)(aj − Ej)√∑

S2(tj)Vj

Weighted Logrank (Fleming and Harrington, 1987)

FH =

∑W (tj)(aj − Ej)√∑

W 2(tj)Vj

where W (t) =[S(t)

]ρ [1− S(t)

]γJoshua Naranjo Research Lessons from Survival Analysis

Weighted logrank tests

FH(ρ, γ) =

Logrank (ρ, γ) = (0, 0)Wilcoxon (1, 0)Late difference (0, 1)

Pros and cons

Easy to understand and implement

With correct choice of weight, can be much more efficientthan either logrank or Wilcoxon

Low efficiency with incorrect choice of weight


Max-Combo

Karrison (2016): Reject H0 : S1 = S2 if

Zm = max(|FH(0, 0)|, |FH(1, 0)|, |FH(0, 1))| > gα

gα is the upper α percentile of the null distribution ofmax(|Z1|, |Z2|, |Z3|)Calculating gα

α = P[Zm > gα]= 1− P[Zm ≤ gα]= 1− P[|Z1| ≤ gα, |Z2| ≤ gα, |Z3| ≤ gα]

(Z1,Z2,Z3) ∼ MN(0,Σ),Gill (1980), Kosorok and Lin (1999), Karrison (2016)


Max-Combo

FH(0, 0) : p = .022FH(1, 0) : p = .062FH(0, 1) : p = .015

Zm : p = .029


Max-Combo

FH(0, 0) : p = .190FH(1, 0) : p = .045FH(0, 1) : p = .938

Zm : p = .082


Lessons:

1 Accommodate (certain data features)

2 Adapt (existing methods)

3 Be cynical

4 Unobservable assumptions (diagnostics)

5 Extensions (broader applications)

6 Try combining methods (balanced properties)


Recent dissertation topics (Statistics, WMU)

”Nonparametric Tests for Ordered Alternatives in aTwo-Stage Nested Design” (Bang Le, 2019)

”A Two-Stage Bayesian Variable Selection for Geo-ReferencedCount Data” (Yuqian Shen, 2019)

”Testing for Association in Stratified 2× 2 ContingencyTables” (Eunice Ampah, 2018)

”Statistical Models for Correlated Data” (Xiaomeng Niu,2017)

”Diagnostics for Choosing Stratified Logrank or WilcoxonTests” (Jhoanne Gatpatan, 2017)

”Subgroup Analysis and Growth Curve Models forLongitudinal Data” (Nichole Andrews, 2017)


research lessons from survival analysis

Documents