Bootstrap Methods and the Accuracyof Large-Scale Estimators
Bradley EfronStanford University
Correlation and Accuracy
• Modern Scientific Studies N cases (genes, SNPs,
pixels, . . . ) each with its own summary statistic “zi”,
i = 1, 2, . . . ,N
• N ∼ 10, 000
• Estimate of interest θ = s(z)[e.g., θ = #{zi > 3}/N
]• Question How accurate is θ?
• Easy answer if zi’s independent (but usually not!)
• Troubles for the bootstrap
Bootstrap Methods 1
Leukemia Microarray Study
(Golub et al., 1999)
• 72 leukemia patients: n1 = 47 “ALL”, n2 = 25 “AML”
• N = 7128 genes • Data matrix X 7128 × 72
• X has independent columns but correlated rows
rms correlation α = .11
• ti = two-sample z-statistic, AML vs. ALL for gene i
• zi = Φ−1 (F70(ti)) [Φ, F70 cdfs N(0, 1), t70]
H0 : zi ∼ N(0, 1) “theoretical null”
Bootstrap Methods 2
Leukemia data: N=7128 z−values, 47 ALL versus 25 AML patients; RMS correlation =.11; Emp Null ~ N(.10, 1.68^2)
z values
Fre
quen
cy
−5 0 5
050
100
150
200
250
300
350
^ ^
Poisson splinefit (df=5)
Bootstrap Methods 3
Leukemia z−value histogram and average 100 bootstrap z* hists.[Two−sample Nonparametric Boots: resample Columns of X ]
z values
Fre
quen
cy
−5 0 5
050
100
150
200
250
300
350
boot average
Poissonspline fit
Bootstrap Methods 4
Bootstrap Dilation
• xi = ith row of X (n equals 72 = 47 + 25)
• xi → zi
• x∗i → z∗i ∼ zi +N(0, σ2i )
• Bootstrap histogram has extra component of variance:
E∗
N∑1
z∗2i
/N
=
N∑1
z2i
/N +
N∑1
σ2i
/N
• Next: Boot stdev estimates for F(x) = #{zi ≥ x}/N
Bootstrap Methods 5
−10 −5 0 5 10
0.00
00.
005
0.01
00.
015
0.02
00.
025
Bootstrap Stdev for empirical cdf of Leukemia z−values,compared with Formula X
x value
Sd
estim
ates
Formula X
Bootstrap
Bootstrap Methods 6
−10 −5 0 5 10
0.00
00.
005
0.01
00.
015
0.02
00.
025
Now permutation and jackknife ests of sd{empirical cdf}compared with Formula X
x value
Sd
estim
ates
* * *
* **
**
*
*
** *
*
* *
* * *
**
** *
*
*
*
*
**
*
*
*
* * *
**
*
**
*
* *
* *
**
*
*
*
*
*
**
*
*
* *
*
* * * **
* **
** * *
** *
perm
Formula X
jackknife
Bootstrap Methods 7
Formula X
Var{F(x)
}�
{F(x)(1−F(x))
N
}+
{σ2
0α f (1)(x)√
2
}2
independence correlation penalty
• σ0 = 1.68 from empirical null
• α = .11 estimated RMS correlation
• f (1)(x) first derivative of estimate f (x)
• Depends on normality: zi ∼ N(µi, σ2i )
Bootstrap Methods 8
Formula X for Leukemia Data
x: 1 2 3 4 5
F(x) .29 .13 .057 .025 .010
sd .017 .022 .010 .004 .002
sd0 .005 .004 .003 .002 .001
Bootstrap Methods 9
−4 −2 0 2 4
0.00
00.
005
0.01
00.
015
0.02
0
Simulation: sd{Fhat(x)} from Formula X; N=6000, n=20+20, alpha=.10;Solid Curve and bars are mean and stdev of sdhat values, 100 sims
Dashed curve is actual sd
stan
dard
dev
iatio
n es
timat
es
−
−
−
−
−
−
−
−
−
−
Bootstrap Methods 10
Multi-Class Normal Model
• Suppose zi’s are in “classes” C1,C2, . . . ,CC, with
zi ∼ N(µc, σ2c ) for zi ∈ Cc
• Nc = # {Cc} , pc = Nc/N[so
∑c Nc = N,
∑c pc = 1
]• Correlation distribution gcd(ρ) = empirical density of
Nc ·Nd correlations between members of Cc, Cd
• Assume gcd all equal g(z)
g(z) = empirical density all(N
2
)correlations
Bootstrap Methods 11
Digression: The Non-Null Distribution of z-Values
• z-value is a test statistic ∼ N(0, 1) under H0
• Theorem Under reasonable conditions the non-null
distribution of z is
z ∼ N(µ, σ2) + Op(1/n)
where
σ2 = 1 + O(1/n
12
)• Normality degrades more slowly than unit standard
deviation
• Helps justify model zi ∼ N(µi, σ2i )
Bootstrap Methods 12
Student-t z-Values
• t ∼ tν(δ) [noncentral-t, noncentrality δ, d f = ν]
• H0 : δ = 0
• z = Φ−1Fν(t) [Fν central t cdf, d f = ν]
so under H0, z ∼ N(0, 1)
• What if δ , 0?
Bootstrap Methods 13
−4 −2 0 2 4 6
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Densities for z=Phiinv(Fnu(t)), t~t(del,nu=20), fordel=0,1,2,3,4,5; Dotted dashed lines are matching N(M,SD)
z value
dens
ity
0 1 2 3 4 5del=
Bootstrap Methods 14
The Count Vector y
• Partition range Z of z into K bins: Z =
K⋃k=1
Zk
• Each bin of width “∆”
• Bin centers “xk”, k = 1, 2, . . . ,K
(Leukemia histogram: Z = [−7.9, 7.9],∆ = .2,K = 79)
• Counts yk = # {zi ∈ Zk} • y = (y1, y2, . . . , yK)′
• Count vector y is discretized order statistic of z
(most statistics of interest of form θ = m(y))
Bootstrap Methods 15
Mehler’s Identity (Lancaster, 1958)
• ϕρ(u, v) = standard normal bivariate density
• Mehler λρ(u, v) =ϕρ(u, v)ϕ(u)ϕ(v)
− 1 =∑j≥1
ρ j
j!h j(u)h j(v)
where h j is jth Hermite polynomial
• Crucial quantity : Λ(u, v) =
∫ 1
−1λρ(u, v)g(ρ) dρ
=∑
j≥1
α j
j!h j(u)h j(v) where α j =
∫ 1
−1ρ jg(ρ) dρ
Bootstrap Methods 16
Exact Covariance of y
• zi ∼ N(µc, σ2c ) for zi ∈ Cc • Nc = #Cc, pc = Nc/N
Theorem cov(y) = cov0 + cov1,
cov0 = N∑
c
pc{diag(πc) − πcπ
′
c}
[independence]
where πck = Prc{zi ∈ bink}, πc = (· · ·πck . . . )′,
cov1 = N2∑
c
∑d
pcpdBcd −N∑
c
pcBcc [corr penalty ]
and Bcd(k, l) = πckπdlΛ(xk − µc
σc,
xl − µd
σd
).
Bootstrap Methods 17
Four Simplifications of cov1
• Drop N term
• Microarray standardization methods make α1 � 0
• Mehler expansion: α2 =∫ 1
−1ρ2g(ρ) is the lead term
• Higher terms ignorable if α2 small
Simplified Formula (almost Formula X):
Letting α = α122 and φ(2)
k =∑
c pcϕ(2)(
xkc−µc
σc
) /σc
cov1 � (N∆α)2φ(2)φ(2)′/2 [rms approximation]
Bootstrap Methods 18
Numerical Comparison
• N = 6000, α = .1
• Two classes: (pc, µc, σc) =
(.95, 0, 1)
(.05, 2.5, 1)
• Next figure compares standard deviations (square roots
diagonal elements) of exact cov(y) & rms approximation
Bootstrap Methods 19
−4 −2 0 2 4
010
2030
40
Compare sd{y[k]} from exact formula (solid) with rms approx (dashed); N=6000, alpha=.1, (p0,mu0,sig0)=(.95,0,1) and(.05,2.5,1)
dashes show bin centers x[k]z value
stan
dard
dev
iatio
n
sd{y[k]},exact
without corrpenalty
rms approx−imation
Bootstrap Methods 20
−4 −2 0 2 4
020
4060
8010
0
Same numerical example, now sd{Fhat[k]}[ Fhat[k]=sum(y[l] for l>=k)/N ]
z value
sd{F
hat}
exact
without corrpenalty
rms approx
Bootstrap Methods 21
Estimation of RMS Correlation α
• ρii′ = empirical correlation, rows i, i′ of X,
N × n expression matrix
• {ρii′} has mean and variance (m, v)
[leukemia = (.00, .192)]
α2 =n
n − 1
(v −
1n − 1
)
ALL AML Both
α: .121 .109 .114
Bootstrap Methods 22
More General Accuracy Estimates
• “Q” q-dimensional statistic of interest: Q = Q(y)
• Influence Function
dQ = D dy[D jk = ∂Q j/∂yk
]cov(Q) = Dcov(y)D′
Bootstrap Methods 23
Example: Accuracy of f
• z → y → f by Poisson GLM
of counts yk on polynomial (xk)
• Q = log(f ) = (. . . log f (xk) . . . )′
• D = M[M ′diag
(f)M
]M ′
/N∆
with M the GLM structure matrix
Bootstrap Methods 24
Local False Discovery Rate
•
p0 = prior Pr null
p1 = prior Pr non-nullz ∼
f0(z)
f1(z)
• Mixture f (z) = p0 f0(z) + p1 f1(z)
• Estimated local false discovery rate
fdr(z) = Pr{null|z} = p0 f0(z)/
f (z)
• cov{log fdr
}� cov
{log f
}
Bootstrap Methods 25
2.0 2.5 3.0 3.5
0.00
0.05
0.10
0.15
0.20
0.25
sd{log fdrhat(z)} ; N=6000, alpha=0, .1, and .2,(p0,mu,sig) = (.95,0,1) and (.05,2.5,1)
stars are sd's for N=1500, alpha=.1; number are fdrhat[z]z value −>
sd
alpha=0
alpha=.1
alpha=.2
0.69 0.58 0.44 0.25 0.09 0.03
* **
*
*
Bootstrap Methods 26
2.0 2.5 3.0 3.5
0.00
0.05
0.10
0.15
0.20
0.25
Now compare sd's for log{fdrhat} and log{Fdrhat},alpha=.1
numbers are Fdr[z]z value −>
sd
sdlogfdr
sdlogFdr
sdlogFdrnon
0.34 0.26 0.18 0.1 0.04 0.01
Bootstrap Methods 27
Poisson Bootstrap
• Null Case All zi ∼ N(0, 1)
• Let A ∼ N(0, α2) and w = Nπ0h2
/√2
• Hierarchical Poisson Resampling: u = Nπ0 + Aw
and
ykind∼ Poi(uk) k = 1, 2, . . . ,K
• i.i.d. case if α = 0
• like rms formula if α = 0
Bootstrap Methods 28
References
Efron, B. (2007a). Correlation and large-scale simultaneous
significance testing. J. Amer. Statist. Assoc. 102: 93–103.
Efron, B. (2007b). Size, power and false discovery rates. Ann.
Statist. 35: 1351–1377.
Efron, B. (2009). Correlated z-values and the accuracy of large-
scale statistical estimates. J. Amer. Statist. Assoc. To appear
(http://stat.stanford.edu/∼brad/papers).
Golub, T. R., Slonim, D. K. and Tamayo, P. et al. (1999).
Molecular Classification of Cancer: Class Discovery and Class
Prediction by Gene Expression Monitoring. Science 286:
531–537, [the leukemia data].
Bootstrap Methods 29
Lancaster, H. O. (1958). The structure of bivariate distributions.
Ann. Math. Statist. 29: 719–736.
Owen, A. B. (2005). Variance of the number of false discoveries.
J. R. Stat. Soc. Ser. B Stat. Methodol. 67: 411–426.
Bootstrap Methods 30