regression analysis - national tsing hua universitymx.nthu.edu.tw/~cking/statistical...
TRANSCRIPT
Regression Analysis
Regression Analysis
Ching-Kang Ing (銀慶剛)
Institute of Statistics, National Tsing Hua University, Hsinchu, Taiwan
1 / 162
Regression Analysis
Outline I
1 Finite Sample TheoryRegression ModelsAnalysis of Variance (ANOVA)Projection MatricesEstimationMultivariate Normal DistributionsGaussian RegressionInterval EstimationAnother look at βModel SelectionPrediction
2 Large Sample TheoryMotivationToward Large Sample Theory IToward Large Sample Theory IIToward Large Sample Theory III
2 / 162
Regression Analysis
Outline II
3 AppendixStatistical View of Spectral DecompositionLimit Theorems
Continuous Mapping TheoremSlutsky’s TheoremCentral Limit TheoremConvergence in the rth MeanSome InequalitiesWeak Law of Large Numbers
Delta MethodTwo-Sample t-TestPearson’s Chi-Squared Test
3 / 162
Regression Analysis
Finite Sample Theory
Regression Models
Regression Models
Consider the following linear regression model:
yi = β0 + β1xi1 + · · ·+ βkxik + εi, i = 1, . . . , n,
where εi are i.i.d. r.v.s with E(ε1) = 0 and E(ε21) = Var(ε1) = σ2 > 0.Define f(β) = ‖y −Xβ‖2 where
X =
1 x11 · · · x1k...
......
1 xn1 · · · xnk
and y =
y1...yn
.
By solving equation
∂f(β)
∂β= 0,
we obtain X>Xβ = X>y, and hence
(β0, . . . , βk)> ≡ β = (X>X)−1X>y.
4 / 162
Regression Analysis
Finite Sample Theory
Analysis of Variance (ANOVA)
Analysis of Variance (ANOVA)
Define
SST =
n∑i=1
(yi − y)2,
SSRes =
n∑i=1
(yi − β0 − β1xi1 − · · · − βkxik)2 =
n∑i=1
(yi − β>xi)2,
SSReg =
n∑i=1
(β0 + β1xi1 + · · ·+ βkxik − y)2 =
n∑i=1
(β>xi − y)2,
where y = n−1∑ni=1 yi and xi = (1, xi1, . . . , xik)>. Then we have
SST = SSReg + SSRes.
5 / 162
Regression Analysis
Finite Sample Theory
Analysis of Variance (ANOVA)
It is not difficult to see (why?) that
SST = y>(I −M0)y where M0 =E
n=
11>
nwith 1 = (1, . . . , 1)>,
SSRes = y>(I −Mk)y where Mk = X(X>X)−1X>,
and
SSReg = y>(Mk −M0)y.
[ Note thaty1...yn
−x>1 β...
x>n β
= y −Xβ = y −X(X>X)−1X>y = (I −Mk)y,
and Mk = M2k , (I −Mk)2 = I −Mk.]
6 / 162
Regression Analysis
Finite Sample Theory
Analysis of Variance (ANOVA)
Therefore, ANOVA is nothing but
y>(I −M0)y = y>(Mk −M0)y + y>(I −Mk)y.
Actually, ANOVA is a Pythagorean equality, as illustrated below, in whichC(X) = {Xa : a ∈ Rk+1} is called the column space of X.
7 / 162
Regression Analysis
Finite Sample Theory
Analysis of Variance (ANOVA)
Another look at SST = SSReg + SSRes
Assume
yi = x>i β + εi, i = 1, . . . , n,
where E(εi) = 0, Var(εi) = σ2, (xi, εi) are i.i.d., and E(εi|xi) = 0 for all i. Notethat we consider the case of “random regressors” instead of fixed ones. Here aresome observations:
(i) E(yi) = E(x>i β) are the same for all i.
(ii) Var(yi) are the same for all i.
(iii) E(yi|xi) = E(x>i β + εi|xi) = x>i β.
(iv)
Var(yi) = Var(E(yi|xi)) + E(Var(yi|xi))= Var(x>i β) + E{E[(yi − x>i β)2|xi]}= Var(x>i β) + Var(εi).
8 / 162
Regression Analysis
Finite Sample Theory
Analysis of Variance (ANOVA)
(v) Var(yi) = Var(y1) can be estimated by
1
n
n∑i=1
(yi − y)2 := Var(yi).
Var(x>i β) = E(x>i β − E(x>i β))2 = E(x>i β − E(yi))2 can be estimated by
1
n
n∑i=1
(x>i β − y)2 := Var(x>i β).
Var(εi) can be estimated by
1
n
n∑i=1
(yi − x>i β)2 := Var(εi).
(vi) Therefore, SST = SSReg + SSRes is nothing but
Var(yi) = Var(x>i β) + Var(εi).
9 / 162
Regression Analysis
Finite Sample Theory
Projection Matrices
Projection Matrices
Let
X =
x11 · · · x1r...
...xn1 · · · xnr
= [X1, . . . ,Xr]
be an n× r matrix. The column space of X, C(X), is defined as
C(X) = {Xa : a = (a1, · · · , ar)> ∈ Rr}
noting that Xa = a1X1 + · · ·+ arXr.
Definition
An n × n matrix M is called an orthogonal projection matrix onto C(X) if andonly if
1 for v ∈ C(X),Mv = v,
2 for w ∈ C⊥(X),Mw = 0, whereC⊥(X) = {s : v>s = 0 for all v ∈ C(X)}.
10 / 162
Regression Analysis
Finite Sample Theory
Projection Matrices
Fact 1
C(M) = C(X).
Proof of Fact 1
Let v ∈ C(X). Then
v = Xb = MXb ∈ C(M), (why?)
for some b.
Let v ∈ C(M). Then
v = Ma = M(a1 + a2) = a1 ∈ C(X),
for some a, and some a1 ∈ C(X),a2 ∈ C⊥(X). This completes the proof.
11 / 162
Regression Analysis
Finite Sample Theory
Projection Matrices
Fact 2
M> = M (symmetric) and M2 = M (idempotent) if and only if M is anorthogonal projection matrix on C(M).
Proof of Fact 2
(⇒) For v ∈ C(M),Mv = MMbidempotent
= Mb = v, for some b.
For w ∈ C⊥(M),Mwsymmetric
= M>w = 0. (why?)(⇐) Define ei = (0, . . . , 0, 1, 0, . . . , 0)>, where i-th component is 1, and the others
are 0.It is suffices to show that for any ei, ej , e
>i M
>(I −M)ej = 0. (why?)
Since we can decompose ei and ej as ei = e(1)i + e
(2)i and ej = e
(1)j + e
(2)j ,
where e(1)i , e
(1)j ∈ C(M) and e
(2)i , e
(2)j ∈ C⊥(M),
e>i M>(I −M)ej = e>i M
>(I −M)(e(1)j + e
(2)j )
why?= e>i M
>e(2)j
why?= e
(1)>i e
(2)j = 0.
This completes the proof.
12 / 162
Regression Analysis
Finite Sample Theory
Projection Matrices
Fact 3
Orthogonal projection matrices are unique.
Proof of Fact 3
Let M and P be orthogonal projection matrices onto some space S ⊆ Rn.
Then, for any v ∈ Rn,v = v1 + v2, where v1 ∈ S and v2 ∈ S⊥.
The desired conclusion follows from
(M − P )v = (M − P )(v1 + v2) = (M − P )v1 = 0.
13 / 162
Regression Analysis
Finite Sample Theory
Projection Matrices
Fact 4
Let o1, · · · ,or be an orthonormal basis of C(X), i.e.,
o>i oj =
{0, if i 6= j,
1, if i = j,
and for any v ∈ C(X),v = Ob for some b ∈ Rr, where O = [o1, . . . ,or]. Then,OO> =
∑ri=1 oio
>i is the orthogonal projection matrix onto C(X).
Proof of Fact 4
Since OO> is symmetric and OO>OO> = OO>, where O>O = Ir, ther-dimensional identity matrix, by Fact 2, OO> is the orthogonal projectionmatrix onto C(OO>).
Moreover, for v ∈ C(X), we have
v = Ob = OO>Ob ∈ C(OO>),
for some b ∈ Rr.
In addition, C(OO>) ⊆ C(O) = C(X). The desired conclusion follows.14 / 162
Regression Analysis
Finite Sample Theory
Projection Matrices
Remark
One can also prove the result by showing
(i) for v ∈ C(X),OO>v = OO>Ob = Ob = v, and(ii) for w ∈ C⊥(X),OO>w = 0 (the n-dimensional vector of zeros).
兩種證法之差異在於第一種方法是先引用Fact 2得到OO>是C(OO>)的正交投影矩陣,再從C(OO>)的結構猜測它與C(X)相同;而後者則是直接猜測OO>是C(X)的正交投影矩陣。前者證明較曲折但“猜測”成分較少,後者則反之。
15 / 162
Regression Analysis
Finite Sample Theory
Projection Matrices
Given a matrix X, how to construct the orthogonal projection matrix for C(X)?
Gram-Schmidt processesLet X = [x1, . . . ,xq] for some q ≥ 1.Define y1 = x1/‖x1‖, where ‖x1‖2 = x>1 x1.
w2 = x2 − (x>2 y1)y1.
y2 = w2/‖w2‖....
ws = xs −∑s−1i=1 (x>s yi)yi.
ys = ws/‖ws‖, 2 ≤ s ≤ q.16 / 162
Regression Analysis
Finite Sample Theory
Projection Matrices
If the rank of C(X) is 1 ≤ r ≤ q, then there are r non-zero yi, denoted byys1 , . . . ,ysr , and Y = (ys1 , . . . ,ysr ) is an orthonormal basis of C(X).
Y Y > is the orthogonal projection matrix onto C(X) (by Fact 4).
17 / 162
Regression Analysis
Finite Sample Theory
Projection Matrices
Explanation of Rank
Explain “the rank of C(X)”:
Let J be a subset of {1, · · · , q} satisfying
(i) {xi, i ∈ J} is linearly independent, i.e.,∑i∈J
aixi = 0 if and only if ai = 0 for all i ∈ J,
(ii) for any J1 ⊇ J with J1 − J 6= ∅, {xi, i ∈ J1} is not linearly independent.
The ”rank of C(X)” is defined by ](J), the number of the elements in J .
18 / 162
Regression Analysis
Finite Sample Theory
Projection Matrices
Moreover, if r(X) = q (i.e. the rank of C(X) is q), the X(X>X)−1X> isthe orthogonal projection matrix of C(X).
Proof
(i) X(X>X)−1X> is symmetric and idempotent.
(ii) C(X(X>X)−1X>)why?= C(X).
If 1 ≤ r(X) < q, then
X(X>X)−X> is the orthogonal projection matrix of C(X),
where A− denotes a generalized inverse (g-inverse) of A which is defined byany matrix G such that AGA = A.
Note that
(X>X)− = (X>X)−1 if r(X) = q,
and
there’re infinitely many (X>X)− if r(X) < q.
But in either case, X(X>X)−X> is unique, according to Fact 3.
19 / 162
Regression Analysis
Finite Sample Theory
Projection Matrices
We now go back to regression problems, and summarize the key features ofM0 = n−111>, Mk = X(X>X)−1X>, (I −M0), (I −Mk), and Mk −M0,where
X =
1 x11 · · · x1k...
...1 xn1 · · · xnk
.
(i) M0 is the orthogonal projection matrix onto C(1).
(ii) Mk is the orthogonal projection matrix onto C(X).
(iii) (I −M0) is the orthogonal projection matrix onto C⊥(1).
(iv) (I −Mk) is the orthogonal projection matrix onto C⊥(X).
20 / 162
Regression Analysis
Finite Sample Theory
Projection Matrices
(v) Mk −M0 is the orthogonal projection matrix onto C((I −M0)X), where
C((I −M0)X)why?= C
x11 − x1
...xn1 − x1
, . . . ,
x1k − xk...
xnk − xk
,
with xi = n−1∑nj=1 xji.
(vi)
M0Mk = M0 = MkM0,
(I −M0)M0 = 0,
(I −Mk)Mk = 0,
(I −Mk)M0 = 0,
where 0 is the n× n matrix of zeros.
21 / 162
Regression Analysis
Finite Sample Theory
Estimation
Estimation
Does β possess any optimal properties?
E(β) = β since
E(β) = E((X>X)−1X>y)
= E{
(X>X)−1X>(Xβ + ε)}
= β + E((X>X)−1X>ε)
= β + (X>X)−1X>E(ε)
= β + (X>X)−1X>0 = β.
Var(β) = (X>X)−1σ2 because
Var(β) = E((β − β)(β − β)>)
= E{
(X>X)−1X>εε>X(X>X)−1}
= (X>X)−1X>E(εε>)X(X>X)−1 = σ2(X>X)−1,
noting that we have used E(εε>) = σ2I.22 / 162
Regression Analysis
Finite Sample Theory
Estimation
Gauss-Markov Theorem
For any β = Ay satisfying
β = E(β) = E(Ay) = E(A(Xβ + ε)) = AXβ for “all” β,
we have Var(β) ≤ Var(β) in the sense that Var(β)−Var(β) is non-negative definite(非負定), i.e., for any ‖a‖ = 1,
a>{
Var(β)− Var(β)}a ≥ 0. (∗)
Remark
(i) Ay is called a linear estimator of β.
(ii) β is unbiased (since we assume E(β) = β for all β ).
(iii) This theorem says that β is the best linear unbiased estimator (BLUE) of β.
(iv) (∗) is equivalent to Var(a>β)why?≥ Var(a>β), meaning that the variance of
a>β is always “not” smaller than that of a>β regardless of which directionvector, a, β and β project onto.
23 / 162
Regression Analysis
Finite Sample Theory
Estimation
Proof of Gauss-Markov Theorem
Let a ∈ Rk+1 be arbitrarily chosen. Then,
Var(a>β) = E[a>(β − β)]2 (since β is unbiased)
= E(a>(β − β) + a>(β − β))2
≥ Var(a>β) + 2E{a>(β − β)(β − β)>a
}(since β is unbiased)
why?= Var(a>β) + 2a>E
((A− (X>X)−1X>)εε>X(X>X)−1
)a
why?= Var(a>β) + 2σ2a>(A− (X>X)−1X>)X(X>X)−1a
why?= Var(a>β) + 2σ2a>[(X>X)−1a− (X>X)−1a]
= Var(a>β).
24 / 162
Regression Analysis
Finite Sample Theory
Estimation
How to estimate σ2?
σ2 =1
n− (k + 1)
n∑i=1
(yi − β0 − β1xi1 − · · · − βkxik)2
=1
n− (k + 1)
n∑i=1
(yi − x>i β)2
=1
n− (k + 1)y>(I −Mk)y.
25 / 162
Regression Analysis
Finite Sample Theory
Estimation
Why ”k + 1”? “k + 1” makes σ2 unbiased, namely, E(σ2) = σ2.
To see this, we have
E(σ2) =1
n− (k + 1)E(y>(I −Mk)y)
why?=
1
n− (k + 1)E(ε>(I −Mk)ε)
why?=
σ2
n− (k + 1)tr(I −Mk)
why?= σ2.
where ε = (ε1, . . . , εk)>.Reasons for the second “why”Define µ = E(z) and V = Cov(z) = E[(z − µ)(z − µ)>]. Then
E(z>Az) = µ>Aµ+ tr(AV ).
Since ε>(I −Mk)ε is a scalar,
E(ε>(I −Mk)ε) = E(tr(ε>(I −Mk)ε))
= tr[E{(I −Mk)εε>}] = tr(I −Mk)σ2.
26 / 162
Regression Analysis
Finite Sample Theory
Estimation
Some facts about trace operator
1. tr(A) :=∑ni=1Aii, where
A = [Aij ]1≤i,j≤n =
A11 · · · A1n
......
An1 · · · Ann
.
2. tr(AB) = tr(BA) and tr(∑ki=1Ai) =
∑ki=1 tr(Ai).
3. tr(Mk) = tr(X(X>X)−1X>) = tr((X>X)−1X>X) = tr(Ik+1) = k + 1,where Ik+1 is the (k + 1)-dimensional identity matrix.
4. tr(Mk) = tr(∑k+1
i=1 oio>i
)=∑k+1i=1 tr(oio
>i ) =
∑k+1i=1 tr(o>i oi) = k + 1,
where {o1, . . . ,ok+1} is an orthonormal basis for C(X).
5. Similarly, we have tr(I −Mk) = n− k − 1 and tr(I −M0) = n− 1.
27 / 162
Regression Analysis
Finite Sample Theory
Multivariate Normal Distributions
Multivariate Normal Distributions
Definition
We say z has an r-dimensional multivariate normal distribution with mean
E(z) = µ,
and variance
E((z − µ)(z − µ)>) = Σ > 0 (i.e.,a>Σa > 0 for all a ∈ Rr and ‖a‖ = 1),
denoted by N(µ,Σ), if there exist a k-dimensional standard normal vector
ε = (ε1, . . . , εk)>, k ≥ r (i.e., ε1, . . . , εk are i.i.d. N(0, 1) random variables),
and an r× k nonrandom matrix A of full row rank satisfying AA> = Σ such that
z ∼ Aε+ µ,
where ∼ means both sides of the notation have the same distribution.28 / 162
Regression Analysis
Finite Sample Theory
Multivariate Normal Distributions
> If ∃ a ∈ Rr such that a>Σa = 0, then E(a>(z − µ))2 = 0 (why?).
This yields P (a>(z − µ) = 0) = 1 because
E(a>(z − µ))2 = 0 implies E(a>(z − µ)) = 0 and Var(a>z) = 0.
Therefore, with probability “1”, one zi is a linear combination of other zj ’s.
29 / 162
Regression Analysis
Finite Sample Theory
Multivariate Normal Distributions
Why E(X) = 0 implies P (X = 0) = 1 for non-negative X?
Fact
Let X be a non-negative r.v., i.e., P (X ≥ 0) = 1. Then, E(X) = 0 impliesP (X = 0) = 1.
Proof of Fact
Suppose P (X = 0) < 1. Then, P (X > 0) > 0, and hence there exists someδ > 0 such that P (X > 0) > δ (why?).
Since P (X > 0)(why?)
= P (⋃∞n=1{X > n−1}) (why?)
= limn→∞ P (X > n−1), itfollows that
P (X > M−1) > δ/2 for some large integer M. (why?) (∗)
Now, (∗) yields
E(X)(why?)
≥ E(XI{X>M−1})(why?)
≥ M−1P (X > M−1) ≥ δ/(2M) > 0,
which gives a contradiction. Thus, the proof is complete.
30 / 162
Regression Analysis
Finite Sample Theory
Multivariate Normal Distributions
Remark
1. A =
a>1...a>r
is said to have a full row rank if a1, . . . ,ar are linearly
independent.
2. A is not unique since for any P>P = PP> = Ik, we have
AA> = APP>A> = Σ.
3. If z ∼ N(µ,Σ), then for any B of full row rank, Bz ∼ N(Bµ,BΣB>).
4. If r = 2, then z is said to be bivariate normal.
5. Let z =
(z1z2
)be a two-dimensional random vector and fulfill
z1 ∼ N(0, 1), z2 ∼ N(0, 1), and E(z1z2) = 0.
It is possible that z is not a bivariate normal.
31 / 162
Regression Analysis
Finite Sample Theory
Multivariate Normal Distributions
Fact 1
If z ∼ N(µ,Σ), then the joint probability density function (pdf) of z, f(z), isgiven by
f(z) = (2π)−r/2(det(Σ))−1/2 exp
{− (z − µ)>Σ−1(z − µ)
2
}.
Proof of Fact 1
By definition,z ∼ Aε+ µ,
where ε ∼ N(0, Ik), k ≥ r, and A is an r × k matrix of full row rank.
Let b1, . . . , bk−r satisfy
b>i bj =
{1, i = j;
0, i 6= j,
and b>i aj = 0 for all 1 ≤ i ≤ k − r, 1 ≤ j ≤ r.
32 / 162
Regression Analysis
Finite Sample Theory
Multivariate Normal Distributions
Proof of Fact 1 (cont.)
Define
A∗ =
(AB
)≡
Ab>1...
b>k−r
and z∗ =
(zw
)= A∗ε+ µ∗,
where µ∗ = (µ, 0, . . . , 0)>.
Then, the joint pdf of z∗ is given by
f∗(z∗) = (2π)−k/2 exp
{− (z∗ − µ∗)(A∗>)−1(A∗)−1(z∗ − µ∗)
2
} ∣∣∣det(A∗−1
)∣∣∣ .
33 / 162
Regression Analysis
Finite Sample Theory
Multivariate Normal Distributions
Proof of Fact 1 (cont.)
Note that here we have used the following facts:
(i) The joint pdf of ε is
(2π)−k/2 exp
{−ε>ε
2
}=
k∏i=1
(2π)−1/2 exp
(−ε
2i
2
)since εi’s are independent, the joint pdf of (ε1, . . . , εk) is the product of themarginal pdfs.
(ii) Let the joint pdf of v = (v1, . . . , vk)> be denoted by f(v),v ∈ D ⊆ Rk, letg(v) = (g1(v), . . . , gk(v))> be a “smooth” one-to-one transformation of Donto E ⊆ Rk, and let g−1(s) = (g−11 (s), . . . , g−1k (s))>, s ∈ E denote theinverse transformation of g(s), which satisfies g−1(g(v)) = v.
34 / 162
Regression Analysis
Finite Sample Theory
Multivariate Normal Distributions
Proof of Fact 1 (cont.)
Define
J =∂g−1(y)
∂y=
∂g−1
1 (y)∂y1
· · · ∂g−11 (y)∂yk
......
∂g−1k (y)
∂y1· · · ∂g−1
k (y)
∂yk
.
Then, the joint pdf of y = g(v) is given by f(g−1(y))|det(J)|. Now, since
(A∗>)−1(A∗)−1 = (A∗A∗>)−1 =
((AB
)(A> B>
))−1=
((AA>)−1 0
0 Ik−r
)=
(Σ−1 0
0 Ik−r
)and
|det((A∗)−1)| = |det(A∗)|−1 = (det(A∗) det(A∗))−1/2
=(det(A∗) det(A∗>)
)−1/2=(det(A∗A∗>)
)−1/2= (det(Σ))−1/2,
35 / 162
Regression Analysis
Finite Sample Theory
Multivariate Normal Distributions
Proof of Fact 1 (cont.)
we have
f∗(z∗)why?= (2π)−r/2 exp
{− (z − µ)>Σ−1(z − µ)
2
}(det(Σ))−1/2
×(2π)−(k−r)/2 exp{−(w>w)/2
},
and hence
f(z) =
∫ ∞−∞· · ·∫ ∞−∞
f∗(z∗) dw
= (2π)−r/2 exp
{− (z − µ)>Σ−1(z − µ)
2
}(det(Σ))−1/2
×∫ ∞−∞· · ·∫ ∞−∞
(2π)−(k−r)/2 exp{−(w>w)/2
}dw
= (2π)−r/2 exp
{− (z − µ)>Σ−1(z − µ)
2
}(det(Σ))−1/2,
where∫∞−∞ · · ·
∫∞−∞(2π)−(k−r)/2 exp
{−(w>w)/2
}dw = 1. (why?)
36 / 162
Regression Analysis
Finite Sample Theory
Multivariate Normal Distributions
Fact 2
Assume z ∼ N(µ,Σ) and z =
(z1z2
). Then Cov(z1, z2) = E((z1 − µ1)(z2 −
µ2)>) = 0, where 0 is a zero matrix, if and only if z1 and z2 are independent,where z1 and z2 are r1- and r2-dimensional, respectively.
Proof of Fact 2
⇐) It is easy and hence skipped.⇒) Since Cov(z1, z2) = 0, we have by Fact 1,
f(z) = f(z1, z2)
=
2∏i=1
(2π)−ri/2 exp
{− (zi − µi)>Σ−1ii (zi − µi)
2
}|det(Σii)|−1/2
= f(z1)f(z2),
where (µ>1 ,µ>2 )> = µ and(
Σ11 Σ12
Σ21 Σ22
)= Σ =
(Σ11 00 Σ22
), by hypothesis.
37 / 162
Regression Analysis
Finite Sample Theory
Multivariate Normal Distributions
Proof of Fact 2 (cont.)
Since f(z1) is the joint pdf of z1 and f(z2) is the joint pdf of z2, the above identityimplies z1 and z2 are independent. (why?)
Here, we’ve used if X nd Y are independent iff f(x, y) = fx(x)fy(y).
Fact 3
Let z ∼ N(µ, σ2Ir) and C =
(B1
B2
)q×r
, q ≤ r, have a full row rank. Then B1z
and B2z are independent if B1B>2 = 0.
Proof of Fact 3
Since
Cov(B1z,B2z) = E(B1(z − µ)(z − µ)>B>2 ) = σ2B1B>2 = 0,
by Fact 2, the desired conclusion follows.
38 / 162
Regression Analysis
Finite Sample Theory
Multivariate Normal Distributions
Definition
Let z be an r-dimensional random vector and let A be an n×n symmetric matrix.Then z>Az is called a quadratic form.
Fact 4
Let E(z) = µ and Var(z) = Σ. Then
E(z>Az) = µ>Aµ+ tr(AΣ).
Proof of Fact 4
For µ = 0, we have
E(z>Az) = E(tr(Azz>)) = tr(AE(zz>)) = tr(AΣ).
For µ 6= 0, we have
tr(AΣ)why?= E((z − µ)>A(z − µ))
why?= E(z>Az)− 2µ>Aµ+ µ>Aµ,
and hence the desired conclusion holds.
39 / 162
Regression Analysis
Finite Sample Theory
Multivariate Normal Distributions
Fact 5
If z ∼ N(0, Ir) and M is an r × r orthogonal projection matrix, then
z>Mz ∼ χ2(r(M)),
where r(M) denotes the rank of M and χ2(k) denotes the chi-square distributionwith k degrees of freedom.
Proof of Fact 5
Denote r(M) by q. Let {o1, . . . ,oq} be an orthonormal basis for C(M).
We have shown that M = OO> =∑qi=1 oio
>i , where O = [o1, . . . ,oq] and
note that O>O = Iq.
Since O> has a full row rank, O>z ∼ N(0,O>O) = N(0, Iq), yielding thato>i z, i = 1, . . . , q, are i.i.d. N(0, 1) distributed. In addition, we have
z>OO>z =
q∑i=1
(o>i z)2 ∼ χ2(q),
which completes the proof.40 / 162
Regression Analysis
Finite Sample Theory
Multivariate Normal Distributions
Fact 6
Let z ∼ N(0,Σ). Then z>Σ−1z ∼ χ2(r).
Proof of Fact 6
Since z ∼ N(0,Σ), we have z ∼ Aε in which AA> = Σ and ε ∼ N(0, Ik)for some k ≥ r. Here, A is an r × k matrix of full row rank. This implies
z>Σ−1zd= ε>A>(AA>)−1Aε.
Here,d= means “is equivalent in distribution to”.
Note that A>(AA>)−1A is symmetric and idempotent. Therefore, it is anorthogonal projection matrix with rank r (why?). By Fact 5,
ε>A>(AA>)−1Aε ∼ χ2(r),
and hence gives the desired conclusion.
41 / 162
Regression Analysis
Finite Sample Theory
Gaussian Regression
Gaussian Regression
Assume ε in y = Xβ + ε obeys ε ∼ N(0, σ2In).
D1 β = (X>X)−1X>ε+ β ∼ N(β, (X>X)−1σ2).Please convince yourself this result!!
D2
σ2 =1
n− k − 1ε>(I −Mk)ε
=σ2
n− k − 1
ε>(I −Mk)ε
σ2∼ σ2χ
2(n− k − 1)
n− k − 1,
recalling that Mk = X(X>X)−1X> and
X =
1 x11 · · · x1k...
...1 xn1 · · · xnk
.
Here I is In, but I sometimes drop the subscript “n” when no confusion ispossible.
42 / 162
Regression Analysis
Finite Sample Theory
Gaussian Regression
Hypothesis testing
(a) F test
Consider the null hypothesis (虛無假設)
H0 : β1 = β2 = · · · = βk = 0. (表示迴歸是不重要的)HA : H0 is wrong. (Alternative hypothesis, 對立假設)
Test statistics:
T1 =
SSReg
kSSRes
n− k − 1
=“迴歸”的“單位”貢獻
“模型殘差”的“單位”貢獻
T1就是這兩類“貢獻”的對比。
T1“大”時,我們傾向“拒絕”H0此一假設,因此時迴歸的貢獻是不可忽視的,但何謂“大”? 這就得依賴T1的分配(distribution)來決定,特別是T1在H0之下的
分配。
43 / 162
Regression Analysis
Finite Sample Theory
Gaussian Regression
更進一步地說,在H0成立的情況下,T1應不會太大,如能在H0下得到T1的分配,我們就可知道
PH0(0 ≤ T1 ≤ c) = 95% (此百分比可依各別需求調整)
的“c”是多少。也就是說T1 ∈ (0, c)的機率高達 95%,而當T1 ≥ c時,我們就要高度“懷疑”H0可能是不對的 (因為在H0下不太可能發生的事情發生了)。
故我們可將T1 ≥ c (或T1 < c) 當成一“檢定的規則”,i.e., reject H0 if T1 ≥ cand do not reject H0 if T1 < c。使用這樣的檢定規則犯下型 I錯誤(Type Ierror)的機率是 5%。[ 5%稱為此一檢定方式的“顯著水準”(significance level),而此一檢定被稱為α-level檢定,α = 5%。]
Truth
Action
H0 HA
Do not reject H0 O.K. Type II errorReject H0 Type I error O.K.
更多關於統計檢定的介紹可參見由黃文璋教授寫的文章“統計顯著性”。
44 / 162
Regression Analysis
Finite Sample Theory
Gaussian Regression
How to derive the distribution of T1 under H0?
(i)SSReg
k
under H0=ε>(Mk −M0)ε
k
by Fact 5∼ σ2χ2(k)
k
(ii)SSRes
n− k − 1= σ2 by D2∼ σ2χ
2(n− k − 1)
n− k − 1(iii) SSReg and SSRes are independent. This is because
SSRegunder H0= ε>ORegO
>Regε,
where OReg consists of the orthonormal basis of C((I −M0)X), and
SSRes = ε>OResO>Resε,
where ORes consists of the orthonormal basis of C⊥((I −M0)X). Moreover,since
O>RegORes = 0, (0: zero matrix)
by Fact 3, O>Regε and O>Resε are independent, and hence SSReg and SSRes areindependent (why?).
45 / 162
Regression Analysis
Finite Sample Theory
Gaussian Regression
Note.
46 / 162
Regression Analysis
Finite Sample Theory
Gaussian Regression
(iv) Combing (i) ∼ (iii), we obtain
T1H0∼ F (k, n− k − 1),
where F (k, n− k − 1) is called the F -distribution with k and n− k − 1degrees of freedoms.[Why? Because T1 (under H0) is a ratio of two ”independent” chi-squaredistributions divided by their corresponding degrees of freedom.]
(v) (α-level) Testing rule: Reject H0 if
T1 ≥ f1−α(k, n− k − 1),
where P (F (k, n− k − 1) > f1−α(k, n− k − 1)) = α.
> f1−α(k, n− k − 1) is called the upper critical value for the F (k, n− k − 1)distribution.
47 / 162
Regression Analysis
Finite Sample Theory
Gaussian Regression
48 / 162
Regression Analysis
Finite Sample Theory
Gaussian Regression
(b) Wald test
Consider the linear parametric hypothesis:
H0 : Dβ = γ,
HA : H0 is wrong,
where D and γ are known, D is a q× (k+ 1) matrix with 1 ≤ q ≤ k+ 1 and γ isa q × 1 vector.
Example
If β =
β1...β4
,D =
(1 0 −1 00 1 0 −1
), and γ =
(00
), then
H0 =
{β1 = β3
β2 = β4and HA : β1 6= β3 or β2 6= β4.
49 / 162
Regression Analysis
Finite Sample Theory
Gaussian Regression
By suitably imposing D and γ, Wald tests are much more flexible than Ftests.
Test statistics:
W1 =(Dβ − γ)>E−1(Dβ − γ)
σ2q,
where E = D(X>X)−1D>.
What is the distribution of W1 under H0?
(i) Dβ − γ H0∼ N(0, σ2E) (Why? Dβ − γ under H0= D(β − β))
(ii)(Dβ − γ)>E−1(Dβ − γ)
σ2∼ χ2(q) (by Fact 6)
(iii) β and σ2 are independent (Why? We’ve argued this previously!!)
(iv)σ2
σ2∼ χ2(n− k − 1)
n− k − 1. (We’ve already shown this!!)
(v) By (i) ∼ (iv), W1H0∼ F (q, n− k − 1).
(vi) Now you can set an α, find the critical value from the F table, andestablish your α-level test!!
50 / 162
Regression Analysis
Finite Sample Theory
Gaussian Regression
(c) T-test
Consider the following hypothesis,
H0 : βj = b, where 1 ≤ j ≤ k, b is known
against the alternative,HA : βj 6= b.
We have
(i) β − β ∼ N(0, (X>X)−1σ2) [see D1], and hence
βj − bH0= e>j (β − β) ∼ N(0, e>j (X>X)−1ejσ
2),
where ej = (0, . . . , 0, 1, 0, . . . , 0)>, the jth component is 1, and the othersare zeros.
(ii)σ2
σ2∼ χ2(n− k − 1)
n− k − 1. [see D2]
(iii) σ2 and βj are independent. (why?)
51 / 162
Regression Analysis
Finite Sample Theory
Gaussian Regression
(iv) By (i) ∼ (iii),
βj−b√e>j (X>X)−1ejσ2√
σ2
σ2
=βj − b√
e>j (X>X)−1ej σ2≡ T H0∼ t(n− k − 1)
where t(n− k − 1) is the t-distribution with n− k − 1 degrees of freedom.
(v) Testing rule: Reject H0 if |T | > tα/2(n− k − 1).
We have PH0(|T | > tα/2(n− k − 1)) = α and hence this is a level α test.
52 / 162
Regression Analysis
Finite Sample Theory
Interval Estimation
Interval Estimation
We first recall some results on point estimation:
(i) E(β) = β and E(σ2) = σ2 (unbiasedness).
(ii) Var(β) = (X>X)−1σ2
(iii) β is BLUE!!
(iv) Var(σ2) =2σ4
n− k − 1→ 0, as n→∞ (under the normal assumption)
[which is desired result because it shows the estimation quality is gettingbetter and better when sample size is getting larger and larger!!]
To see this, note first that
(a)σ2
σ2∼ χ2(n− k − 1)
n− k − 1
(b) E(χ2(n− k − 1)) = n− k − 1
(c) Var(χ2(n− k − 1)) = 2(n− k − 1)
By (a)∼(c), Var(σ2) =2σ4
n− k − 1follows.
53 / 162
Regression Analysis
Finite Sample Theory
Interval Estimation
However, if the normal assumption fails to hold, how should us calculate Var(σ2)?
Some ideas:
σ2 =1
n− k − 1y>(I −Mk)y =
1
n− k − 1ε>(I −Mk)ε
=1
n− k − 1
n∑i=1
n∑j=1
Aijεiεj ,
where [Aij ]1≤i,j≤n ≡ A = I −Mk. It is clear that
E(σ2) =1
n− k − 1
n∑i=1
n∑j=1
AijE(εiεj)
why?=
1
n− k − 1
n∑i=1
Aiiσ2 =
σ2
n− k − 1tr((I −Mk)) = σ2.
54 / 162
Regression Analysis
Finite Sample Theory
Interval Estimation
Moreover, we have
E(σ4) =
(1
n− k − 1
)2 n∑i=1
n∑j=1
n∑k=1
n∑l=1
AijAklE(εiεjεkεl)
=
(1
n− k − 1
)2 n∑i=1
A2iiE(ε4i ) (i = j = k = l)
+
(1
n− k − 1
)2 ∑1≤i,k≤ni 6=k
AiiAkkE(ε2i )E(ε2k) (i = j 6= k = l)
+
(1
n− k − 1
)2 ∑1≤i,j≤ni 6=j
A2ijE(ε2i )E(ε2j ) (i = k 6= j = l)
+
(1
n− k − 1
)2 ∑1≤i,j≤ni 6=j
AijAjiE(ε2i )E(ε2j ) (i = l 6= j = k),
where∑
1≤i,j≤ni 6=j
AijAjiE(ε2i )E(ε2j ) =∑
1≤i,j≤ni 6=j
A2ijE(ε2i )E(ε2j ) (since A is symmetric).
55 / 162
Regression Analysis
Finite Sample Theory
Interval Estimation
Simple algebra shows that
E(σ4) =
(1
n− k − 1
)2
(E(ε41)− 3σ4)
n∑i=1
A2ii
+
(1
n− k − 1
)2
σ4
n∑i=1
n∑k=1
AiiAkk + 2
n∑i=1
n∑j=1
A2ij
=
1
(n− k − 1)2(E(ε41)− 3σ4)
n∑i=1
A2ii + σ4 +
2σ4
n− k − 1.
Note
(i) E(ε41)− 3σ4 = 0 if ε is normal.
(ii)∑ni=1
∑nk=1AiiAkk = (tr(A))2 = (tr(I −Mk))2 = (n− k − 1)2.
(iii)∑ni=1
∑nj=1A
2ij = tr(A2) = tr((I −Mk)2) = tr(I −Mk) = n− k − 1.
Hence
Var(σ2) =1
(n− k − 1)2(E(ε41)− 3σ4)
n∑i=1
A2ii +
2σ4
n− k − 1.
56 / 162
Regression Analysis
Finite Sample Theory
Interval Estimation
Will1
(n− k − 1)2(E(ε41)− 3σ4)
n∑i=1
A2ii converge to zero as n→∞?
Yes, because
n∑i=1
A2ii ≤
n∑i=1
Aii = tr(A) = tr(I −Mk) = n− k − 1.
To see this, we note that the idempotent property of A yields
Aii =
n∑j=1
A2ij ≥ A2
ii (which also yields 0 ≤ Aii ≤ 1).
57 / 162
Regression Analysis
Finite Sample Theory
Interval Estimation
We now get back to interval estimation.
(i) The first goal is to find an interval such that
P (βi ∈ Iα) = 1− α,
where α is small and is decided by the users, 1− α is called a “confidencelevel”.
How to construct Iα?
(a)βi − βi√
e>i (X>X)−1eiσ2∼ t(n− k − 1)
(b) P (βi ∈ (Li, Ri)) = 1− αLi = βi − t1−α/2(n− k − 1)
√e>i (X>X)−1eiσ2
Ri = βi + t1−α/2(n− k − 1)√e>i (X>X)−1eiσ2
58 / 162
Regression Analysis
Finite Sample Theory
Interval Estimation
Does the interval described in (b) have the shortest length?
To answer this question, we need to solve the following problem:
minimizing b− a subject to F (b)− F (a) = 1− α,
where F (·) denotes the distribution function of t(n− k − 1) distribution, and
P
(a <
βi − βi√e>i (X>X)−1eiσ2
≤ b
)= F (b)− F (a) = 1− α.
By the Lagrange method, define
g(a, b, λ) = b− a− λ(F (b)− F (a)− (1− α))
and let ∇g(a, b, λ) = 0, where ∇g = ( ∂g∂a ,∂g∂b ,
∂g∂λ )>. The last identity yield{
f(b) = f(a) = 1λ ,
F (b)− F (a) = 1− α, (∗)
where f(·) is the pdf of t(n− k − 1) distribution.
59 / 162
Regression Analysis
Finite Sample Theory
Interval Estimation
Since the pdf of t(n− k− 1) is symmetric and strictly decreasing (increasing)when x ≥ 0 (when x ≤ 0), (∗) implies b = −a and b > 0.
As a result, the unique solution to (∗) is (−b, b) with 2F (b) = 2− α, i.e.,
(−t1−α/2(n− k − 1), t1−α/2(n− k − 1)).
To check whether 2t1−α/2(n− k − 1) minimizes b− a, we still need toconsider the so-called ”bordered” Hessian matrix evaluated at
s∗ =
a∗b∗λ∗
=
−t1−α/2(n− k − 1)t1−α/2(n− k − 1)
1f(t1−α/2(n−k−1))
.
60 / 162
Regression Analysis
Finite Sample Theory
Interval Estimation
Note that the bordered Hessian matrix is defined by
∇2g =
∂2g∂a∂a
∂2g∂a∂b
∂2g∂a∂λ
· ∂2g∂b∂b
∂2g∂b∂λ
· · ∂2g∂λ∂λ
,
where ∂2g∂λ∂λ = 0, and it is straightforward to show that
∇2g(s∗) =
f ′(−t1−α/2(n−k−1))
f(t1−α/2(n−k−1))0 f(−t1−α/2(n− k − 1))
0−f ′(t1−α/2(n−k−1))
f(t1−α/2(n−k−1))−f(t1−α/2(n− k − 1))
f(−t1−α/2(n− k − 1)) −f(t1−α/2(n− k − 1)) 0
.
Since the principal submatrix f ′(−t1−α/2(n−k−1))f(t1−α/2(n−k−1))
0
0−f ′(t1−α/2(n−k−1))f(t1−α/2(n−k−1))
is positive definite, it follows that 2t1−α/2(n− k − 1) minimizes b− a subjectto F (b)− F (a) = 1− α.
61 / 162
Regression Analysis
Finite Sample Theory
Interval Estimation
(ii) The second goal is to find a (k + 1)-dimensional set Vα such that
P (β ∈ Vα) = 1− α.
How to construct Vα?
(a)(β − β)>X>X(β − β)
σ2∼ χ2(k + 1). (by Fact 6)
(b)(β − β)>X>X(β − β)
(k + 1)σ2∼ F (k + 1, n− k − 1).
62 / 162
Regression Analysis
Finite Sample Theory
Interval Estimation
(c) Vα:
a ≤ (β − β)>X>X(β − β)
(k + 1)σ2≤ b,
where F ∗(b)− F ∗(a) = 1− α and F ∗(·) is the distribution function ofF (k + 1, n− k − 1).
63 / 162
Regression Analysis
Finite Sample Theory
Interval Estimation
(d) It can be shown that the volume of the larger ellipsoid is
πk+12
Γ(k+12 + 1
) ((k + 1)σ2b) k+1
2 (det(X>X))−1/2,
and that of the smaller one is
πk+12
Γ(k+12 + 1
) ((k + 1)σ2a) k+1
2 (det(X>X))−1/2.
64 / 162
Regression Analysis
Finite Sample Theory
Interval Estimation
Hence the volume of Vα is minimized by
minimizing bk+12 − a
k+12 subject to F ∗(b)− F ∗(a) = 1− α.
However, in general, this minimization problem does not have a closed formsolution, but it can be shown that when k = 1,
a∗ = 0, and b∗ = F1−α(k + 1, n− k − 1),
and when both n and k are large and n� k,
a∗ ∼ 0 and b∗ ∼ F1−α(k + 1, n− k − 1).
Note also that unlike the t-distributions, when d1 > 1, the pdfs of Fdistributions have very small values near the origin.
65 / 162
Regression Analysis
Finite Sample Theory
Interval Estimation
66 / 162
Regression Analysis
Finite Sample Theory
Another look at β
Another look at β
Let Xk = (Xk−1,xk). Because C(Xk) and C(Xk−1, (I −Mk−1)xk) are thesame, we have
Mky = (Xk−1,xk)
(βk−1βk
)why?= (Xk−1, (I −Mk−1)xk)
((X>k−1Xk−1)−1 0
0> 1x>k (I−Mk−1)xk
)
×(
X>k−1yx>k (I −Mk−1)y
),
yielding
Xk−1βk−1 + xkβk = Xk−1[(X>k−1Xk−1)−1X>k−1y
−(X>k−1Xk−1)−1X>k−1xkβ∗k ] + xkβ
∗k ,
where β∗k =x>k (I −Mk−1)y
x>k (I −Mk−1)xk.
67 / 162
Regression Analysis
Finite Sample Theory
Another look at β
In addition, since Xk is of full rank , we obtain
βk = β∗k =x>k (I −Mk−1)y
x>k (I −Mk−1)xk.
This shows that βk is equivalent to the LSE of the simple regression of(I −Mk−1)y on (I −Mk−1)xk.
As a result, βk can only be viewed as the marginal contribution of xk to ywhen the effects of the other variables are removed in advance.
68 / 162
Regression Analysis
Finite Sample Theory
Model Selection
Model Selection
Mallows’ Cp:
Let Xp be a submodel of Xk.
69 / 162
Regression Analysis
Finite Sample Theory
Model Selection
Can we construct a measure to describe its prediction performance?
Let Mp be the orthogonal projection matrix of Xp. Then, Mpy can be usedto predict new observations
yNew = Xkβ + εNew,
where εNew and ε are independent, but have the same distribution.
The performance of Mpy can be measured by
E∥∥yNew −Mpy
∥∥2 = E∥∥Xkβ + εNew −Mpy
∥∥2(∗)= nσ2 + E ‖Xkβ −Mpy‖2 .
(∗): since εNew and y are independent.
70 / 162
Regression Analysis
Finite Sample Theory
Model Selection
Let Xk = (Xp,X−p) and β =
(βpβ−p
).
Moreover, we have
E ‖Xkβ −Mpy‖2 = E ‖Xpβp +X−pβ−p −Mp(X−pβ−p +Xpβp + ε)‖2
= E ‖(I −Mp)X−pβ−p −Mpε‖2
why?= pσ2 + β>−pX
>−p(I −Mp)X−pβ−p.
Hence,
E∥∥yNew −Mpy
∥∥2 = (n+ p)σ2 + β>−pX>−p(I −Mp)X−pβ−p.
71 / 162
Regression Analysis
Finite Sample Theory
Model Selection
To estimate this expectation, we start by considering
SSRes(p) = y>(I −Mp)y.
Note first that
E(SSRes(p)) = E(X−pβ−p + ε)>(I −Mp)(X−pβ−p + ε)
= β>−pX>−p(I −Mp)X−pβ−p + E(ε>(I −Mp)ε)
= β>−pX>−p(I −Mp)X−pβ−p + (n− p)σ2.
Therefore,
E(SSRes(p) + 2pσ2) = β>−pX>−p(I −Mp)X−pβ−p + (n+ p)σ2
= E∥∥yNew −Mpy
∥∥2 .Now, Mallows’ Cp is defined by
SSRes(p) + 2pσ2,
which is an unbiased estimate of E∥∥yNew −Mpy
∥∥2 .72 / 162
Regression Analysis
Finite Sample Theory
Prediction
Prediction
(a) How to predict E(yn+1) = x>n+1β when xn+1 = (1, xn+1,1, . . . , xn+1,k)> isavailable?
Point prediction: x>n+1β
Prediction interval (under normality):
(i) x>n+1(β − β) ∼ N(0,x>n+1(X>X)−1xn+1σ2)
Sometimes I use Xk to replace X, in particular, when model selection issue istaken into account.
(ii)x>n+1(β − β)√
x>n+1(X>X)−1xn+1σ2∼ T (n− k − 1)
(iii) Please construct a (1− α) level prediction interval by yourself.
(iv) What if the normal assumption is violated?
73 / 162
Regression Analysis
Finite Sample Theory
Prediction
(b) How to predict yn+1?
point prediction: x>n+1β (Still, we have this guy.)
prediction interval (under normality):
(i) yn+1 − x>n+1β = εn+1−x>n+1(β−β) ∼ N(0, (1 +x>n+1(X>X)−1xn+1)σ2).
(yn+1 − x>n+1β is called prediction error.)
(ii)yn+1 − x>n+1β√
(1 + x>n+1(X>X)−1xn+1)σ2∼ T (n− k − 1).
(iii) Please construct your own (1− α) level prediction interval for yn+1.
74 / 162
Regression Analysis
Large Sample Theory
Motivation
Large Sample Theory
Motivation
Consider againy = Xβ + ε.
If εt’s are not normally distributed, how do we make inference for β and σ2? Howdo we perform prediction?
Q1: Is β = (X>X)−1X>y → β in probability?
Q2: Is σ2 =1
n− (k + 1)y>(I −Mk)y → σ2 in probability?
Q3: If the answer to Q1 is “yes”, what is the limiting distribution of β?
Q4: How do we construct confidence intervals for β based on the answer of Q3?
Q5: How do we test linear or even nonlinear hypotheses without normality?
Q6: How do we do prediction without normality?
75 / 162
Regression Analysis
Large Sample Theory
Toward Large Sample Theory I
Question 1
We first answer Q1 in the special case where X =
1 x1...
...1 xn
.
Definition
A sequence of r.v.s {Zn} is said to converge in probability to a r.v. Z (which canbe a non-random constant) if for any ε > 0,
limn→∞
P (|Zn − Z| > ε) = 0,
which is denoted by Znpr.→ Z.
Remark
A sequence of random vectors {Zn = (Z1n, . . . , Zkn)>} is said to be convergent
in probability to a random vector Z = (Z1, . . . , Zk)> if Zinpr.→ Zi, i = 1, . . . , k,
which is denoted by Znpr.→ Z.
76 / 162
Regression Analysis
Large Sample Theory
Toward Large Sample Theory I
An answer to Q1:
Since
Var(β) = (X>X)−1σ2 = σ2
Sxx + nx2
nSxx
−nxnSxx−nx
nSxx
1
Sxx
,
we have
P (|β0 − β0| > ε)(∗)≤ σ2
ε2Sxx + nx2
nSxx→ 0 if
x2
Sxx→ 0,
((∗): Chebychev’s inequality, which says if E(X) = µ and Var(X) = σ2, then
P (|X − µ| > ε) ≤ σ2
ε2 ) and
P (|β1 − β1| > ε) ≤ σ2
ε21
Sxx→ 0 if
1
Sxx→ 0,
noting that Sxx =
n∑i=1
(xi − x)2 and x =1
n
n∑i=1
xi.
77 / 162
Regression Analysis
Large Sample Theory
Toward Large Sample Theory I
As a result, to ensure βpr.→ β, we need
x2
Sxx→ 0 and
1
Sxx→ 0 as n→∞.
Remark
(i) Please give a heuristic explanation of whyx2
Sxx→ 0 is needed for β0 to
converge to β0 in probability.
(ii) Please explain why Cov(β0, β1) is positive (negative) correlated whenx < 0 (x > 0).
(iii) What are the sufficient conditions for βpr.→ β in general cases?
78 / 162
Regression Analysis
Large Sample Theory
Toward Large Sample Theory I
Question 2
An answer to Q2:
We have shown previously that the variance of σ2 converges to 0 as n→∞.Therefore, by Chebyshev’s inequality,
σ2 pr.→ σ2.
79 / 162
Regression Analysis
Large Sample Theory
Toward Large Sample Theory I
Before answering Q3, let us consider the so-called spectral decomposition forsymmetric matrices.
Let A be a k × k symmetric matrix. Then there exist real numbersλ1, . . . , λk and a k-dimensional orthogonal matrix P = (p1, . . . ,pk)satisfying P>P = PP> = I and Api = λipi such that
A = PDP>,
where D =
λ1 0 · · · 0
0. . .
......
. . . 00 · · · 0 λk
.
80 / 162
Regression Analysis
Large Sample Theory
Toward Large Sample Theory I
Remark
(1) λi is called an eigenvalue of A and pi is the eigenvector corresponding to λi.
(2) Let A be positive definite. Then λi > 0 for i = 1, . . . , k.
Proof. 0p.d.< p>i Api = p>i PDP
>piwhy?= λi.↑
by the spectral decomposition
(3) Let A be positive definite. Define
A1/2 = PD1/2P>,
where
D1/2 =
λ1/21 0 · · · 0
0. . .
......
. . . 0
0 · · · 0 λ1/2k
.
Then, we have (A1/2)2 = A.
81 / 162
Regression Analysis
Large Sample Theory
Toward Large Sample Theory I
Remark (cont.)
(4) Define λmax(A) = max{λ1, . . . , λk} and λmin(A) = min{λ1, . . . , λk}. Then,
λmax(A) = sup‖a‖=1
a>Aa and λmin(A) = inf‖a‖=1
a>Aa.
Proof. As shown before,
λi = p>i Api ≤ sup‖a‖=1
a>Aa.
Moreover, for any a ∈ Rk with ‖a‖ = 1, we have a = Pb, where ‖b‖ = 1.Thus,
a>Aa = b>P>PDP>Pb = b>Db =
k∑i=1
λib2i ≤ λmax(A),
where b = (b1, . . . , bk)>. This yields λmax(A) = sup‖a‖=1
a>Aa. The second
statement can be proven similarly.82 / 162
Regression Analysis
Large Sample Theory
Toward Large Sample Theory I
Remark (cont.)
(5) Let A be positive definite. Then λmax(A−1) = 1/λmin(A).
(6) Let B be any real matrix. Define the “spectral norm” of B as follows:
‖B‖ =
(sup‖a‖=1
a>B>Ba
)1/2
=(λmax(B>B)
)1/2.
We have
(i) If B is symmetric with eigenvalues λ1, . . . , λk, then‖B‖ = max{|λ1|, . . . , |λk|}.
(ii) ‖AB‖ ≤ ‖A‖‖B‖, where A is another real matrix whose number of columnsis the same as the number of the rows of B.
(iii) ‖A+B‖ ≤ ‖A‖+ ‖B‖, where A and B have the same numbers of rows andcolumns.
(iv) If B is positive definite, ‖B‖ ≤ tr(B) =∑ki=1 λi, where λi, i = 1, . . . , k, are
the eigenvalues of B.
83 / 162
Regression Analysis
Large Sample Theory
Toward Large Sample Theory I
Remark (cont.)
(7) Let X be the design matrix of a regression model, i.e.,
X =
1 x11 · · · x1k...
......
1 xn1 · · · xnk
.
Then,
λmax(X>X) = sup‖a‖=1
n∑i=1
(a>xi)2,
and
λmin(X>X) = inf‖a‖=1
n∑i=1
(a>xi)2.
(8) Let x ∼ N(0,Σ) be a p-dimensional multivariate normal vector. Then,Σ−1/2x ∼ N(0, I), and hence x>Σ−1x ∼ χ2(p), which has been shownpreviously in a different way.
84 / 162
Regression Analysis
Large Sample Theory
Toward Large Sample Theory I
We now revisit the question of what makes βpr.→ β in general cases.
The answer to this question is simple. Since
Var(β) = (X>X)−1σ2,
we only needs to show that“each diagonal elements of (X>X)−1 converges to 0”. (∗)
To show (∗), note first that
X>X = T>(T>)−1X>XT−1T ,
where
T =
1 x1 · · · xk0 1 0 · · · 0...
. . .. . .
. . ....
.... . .
. . . 00 · · · 0 1
and T−1 =
1 −x1 · · · −xk0 1 0 · · · 0...
. . .. . .
. . ....
.... . .
. . . 00 · · · 0 1
.
85 / 162
Regression Analysis
Large Sample Theory
Toward Large Sample Theory I
Moreover, we have
(T>)−1X>XT−1 =
(n 0>
0o
X>
(I − En
)o
X
),
where
o
X=
x11 · · · x1k...
...xn1 · · · xnk
and E =
1 · · · 1...
...1 · · · 1
,
noting that
(I − En
)o
X=
x11 − x1 · · · x1k − xk...
...xn1 − x1 · · · xnk − xk
.
where (x11 − x1, . . . , x1k − xk) and (xn1 − x1, . . . , xnk − xk) are the centereddata vectors.
86 / 162
Regression Analysis
Large Sample Theory
Toward Large Sample Theory I
Hence
(X>X)−1 = T−1
(n−1 0>
0 (o
X>
(I − En
)o
X)−1
)(T−1)>,
yielding
(X>X)−1 =
(1
n+ x>D−1x −x>D−1
−D−1x D−1
),
where (T−1)> = (T>)−1, x = (x1, . . . , xk)>, and D =o
X>
(I − En
)o
X.
87 / 162
Regression Analysis
Large Sample Theory
Toward Large Sample Theory I
This implies for each 1 ≤ i ≤ k + 1,
(X>X)−1ii ≤ max
{1
n+ x>D−1x, λmax(D−1)
}≤ max
{1
n+‖x‖2
λmin(D),
1
λmin(D)
},
where λmax(D−1) =1
λmin(D), which converges to 0 if
(i) λmin(o
X>
(I − En
)o
X)n→∞→ ∞,
(ii)
k∑i=1
x2i
λmin(o
X>
(I − En
)o
X)
n→∞→ 0.
請將此兩條件與 Q1 中的答案比較。
88 / 162
Regression Analysis
Large Sample Theory
Toward Large Sample Theory I
圖示如下:
上述條件要求:(i) 資料在散佈最窄的方向(從(x1, . . . , xk)的位置看),也要有足夠大的sum of
squares (information)
(ii) 資料的中心點距原點的距離平方比起λmin(o
X>
(I − En
)o
X)是微不足道的。
89 / 162
Regression Analysis
Large Sample Theory
Toward Large Sample Theory II
Question 3
Note first that
β − β = (X>X)−1X>(y −Xβ)
= (X>X)−1X>ε
=
(n∑i=1
xix>i
)−1n∑i=1
εi
n∑i=1
xiεi
,
noting that we first consider X =
1 x1...
...1 xn
.
90 / 162
Regression Analysis
Large Sample Theory
Toward Large Sample Theory II
Since for ε ∼ N(0, σ2I), we have
β − β ∼ N(0, σ2(X>X)−1),
it is natural to conjecture that when ε is not normally distributed,
(X>X)1/2
σ(β − β)
d→ N(0, I). (∗)
Definition
A sequence of random vectors, {xn}, is said to converge to a random vector, x, indistribution if
P (xn ≤ c)→ P (x ≤ c) ≡ F (c) as n→∞,
for all continuous points of F (·), the distribution function of x, which is denoted
by xnd→ x.
91 / 162
Regression Analysis
Large Sample Theory
Toward Large Sample Theory II
Remark
Cramer-Wold Device:
xnd→ x ⇔ a>xn
d→ a>x for any ‖a‖ = 1.
Therefore, (∗) holds iff
a>(X>X)1/2
σ(β − β) = a>
(n∑i=1
xix>i
)−1/2n∑i=1
εiσ
n∑i=1
xiεiσ
=
n∑i=1
(w1n + w2nxi
σ
)εi
d−→ N(0, 1),
where (w1n, w2n) = a>
(n∑i=1
xix>i
)−1/2.
92 / 162
Regression Analysis
Large Sample Theory
Toward Large Sample Theory II
Lindeberg’s Central Limit Theorem (for the sum of independent r.v.s)
Let Z1n, . . . , Znn be a sequence of independent r.v.s with E(Zin) = 0 andn∑i=1
E(Z2in) =
n∑i=1
σ2in = 1 for all n. If for any δ > 0,
n∑i=1
E(Z2inI|Zin|>δ
)−→ 0, as n→∞, (Lindeberg’s Condition)
無獨尊者,故在”均勻”混合後,原來分配之特性消失,成為常態分配
thenn∑i=1
Zind−→ N(0, 1).
93 / 162
Regression Analysis
Large Sample Theory
Toward Large Sample Theory II
Remark
(1) Lindegerg’s condition implies
max1≤i≤n
σ2in −→ 0 as n→∞.
To see this, we note that for “any” δ > 0
max1≤i≤n
σ2in = max
1≤i≤nE(Z2
in) ≤ max1≤i≤n
E(Z2inI|Zin|>δ
)+ δ2.
Since the first term converges to 0 by Lindeberg’s condition and since δ canbe arbitrarily small, the desired conclusion follows.
(2) Lindeberg’s condition ⇔ CLT + max1≤i≤n
σ2in −→ 0 as n→∞.
94 / 162
Regression Analysis
Large Sample Theory
Toward Large Sample Theory II
Now, we are in the position to check Lindeberg’s condition for
Zin =
(w1n + w2nxi
σ
)εi
denoted by≡ vinεi.
(i) E(vinεi) = 0. (easy)
(ii)n∑i=1
E(v2inε2i ) = 1. (easy but why?)
(iii) Assume E1/2(ε41) < C1 <∞, for some constants C1, C2,n∑i=1
E[v2inε
2i I{v2inε2i>δ2}
]=
n∑i=1
v2inE[ε2i I{v2inε2i>δ2}
]why?≤
n∑i=1
v2inE1/2(ε4i )P
1/2(v2inε2i > δ2)
≤ C1
n∑i=1
v2inE1/2(v2inε
2i )
δ
≤ C2
(n∑i=1
v2in
)max1≤i≤n
|vin| ≤ C3 max1≤i≤n
|vin|.95 / 162
Regression Analysis
Large Sample Theory
Toward Large Sample Theory II
Therefore, Lindeberg’s condition holds for vinεi if
max1≤i≤n
(v2in) = σ−2 max1≤i≤n
a>( n∑i=1
xix>i
)−1/2(1xi
)2
≤ σ−2a>
(n∑i=1
xix>i
)−1a(1 + max
1≤i≤nx2i )
(∗)≤ σ−2λmax
( n∑i=1
xix>i
)−1 (1 + max1≤i≤n
x2i )
= σ−21 + max
1≤i≤nx2i
λmin
(n∑i=1
xix>i
) −→ 0, as n→∞.
96 / 162
Regression Analysis
Large Sample Theory
Toward Large Sample Theory II
(∗) To see this, we have by spectral decomposition for A,A = PDP>, where
D =
λ1 . . .
λk
with 0 < λ1 ≤ λ2 ≤ · · · ≤ λk, and P = (p1, . . . ,pk) satisfies
Api = λipi, p>i pi = 1, and p>i pj = 0 for i 6= j.
Hence,
p>kApk = p>k PDP>pk = (0, . . . , 0, 1)
λ1 . . .
λk
0...01
= λk ≤ sup
‖a‖=1
a>Aa.
97 / 162
Regression Analysis
Large Sample Theory
Toward Large Sample Theory II
On the other hand, for any a ∈ Rk with ‖a‖ = 1, we can express it asa = Pb with ‖b‖ = 1. Thus,
a>Aa = b>P>PD>P>Pb = b>Db =
k∑i=1
λib2i ≤ λk,
where b = (b1, . . . , bk)>. As a result,
λk = sup‖a‖=1
a>Aa.
Similarly, it can be shown that
λ1 = inf‖a‖=1
a>Aa.
98 / 162
Regression Analysis
Large Sample Theory
Toward Large Sample Theory II
To give a more comprehensive sufficient condition, we note that
λmin
(n∑i=1
xix>i
)= λmin
((n
∑xi∑
xi∑x2i
))= λmin
((1 0x 1
)(1 0−x 1
)(n
∑xi∑
xi∑x2i
)×(
1 −x0 1
)(1 x0 1
))= λmin
((1 0x 1
)(n 00 Sxx
)(1 x0 1
))why?
≥ min{n, Sxx}λmin
((1 0x 1
)(1 x0 1
))(∗)≥ C min{n, Sxx},
provided x <∞, where Sxx =∑
(xi − x)2.
Explain
“why?” : λmin(B>AB) ≥ λmin(B>B)λmin(A)
(∗) : if x <∞, λmin
((1 0x 1
)(1 x0 1
))is “bounded away” from 0. (We will
show this later.)99 / 162
Regression Analysis
Large Sample Theory
Toward Large Sample Theory II
In view of this, a set of more transparent sufficient conditions for the Lindeberg’scondition is
(i)
max1≤i≤n
x2i
n−→ 0,
(ii) Sxx −→∞, [this one is also needed for βpr.−→ β.]
(iii)
max1≤i≤n
x2i
Sxx−→ 0.
Can you answer Q3 under general multiple regression models?
100 / 162
Regression Analysis
Large Sample Theory
Toward Large Sample Theory II
事實上,對一般的多元迴歸(k ≥ 1), it is not difficult to show that Lindeberg’scondition holds when
1 + max1≤i≤n
k∑j=1
x2ij
λmin(X>X)−→ 0 as n→∞. (>)
(請對照 k = 1的case)
101 / 162
Regression Analysis
Large Sample Theory
Toward Large Sample Theory II
進一步的問題是,我們能不能得到類似 k = 1 case中(i), (ii), (iii)條件,使得(>)成立。為了回答此一問題,我們需要一點linear algebra。
(1) Let
T =
1 c1 · · · ck0 1 0 · · · 0... 0 1
. . ....
. . .. . . 0
0 0 · · · 0 1
=
(1 c>
0 Ik
),
where c = (c1, . . . , ck)>, and Ik is the k-dimensional identity matrix.Then, we have
λmin(T>T ) ≥ 1
2 + c>c. (∗)
102 / 162
Regression Analysis
Large Sample Theory
Toward Large Sample Theory II
Proof of (∗)Since (∗) holds trivially when c = 0, we only consider the case c 6= 0.
Note first that
E∗ = T>T =
(1 c>
c cc> + Ik
),
and the eigenvalues of E∗ are those λ’s satisfying
det(E∗ − λIk+1) = 0, (∗∗)
where Ik+1 is the (k + 1)-dimensional identity matrix.
In addition,
det(E∗ − λIk+1) = det
(1− λ c>
c cc> + (1− λ)Ik
)
=
det
(0 c>
c cc>
), ifλ = 1;
det
(1− λ 0>
c (1− 11−λ )cc> + (1− λ)Ik
), ifλ 6= 1.
103 / 162
Regression Analysis
Large Sample Theory
Toward Large Sample Theory II
Proof of (∗) (cont.)
For λ = 1,
det
(0 c>
c cc>
)=
{−c21 6= 0, if k = 1;0, if k > 1.
For λ 6= 1,
det
(1− λ 0>
c (1− 11−λ )cc> + (1− λ)Ik
)= (1− λ) det
((1− 1
1− λ
)cc> + (1− λ)Ik
)because this is a triangular matrix
= (1− λ)k+1 det
(Ik +
(1
1− λ −1
(1− λ)2
)cc>
)det(aAk) = ak det(Ak)
= (1− λ)k+1 det(Ik)
(1 +
(1
1− λ −1
(1− λ)2
)c>c
)Please try to prove det(A+ abb>) = det(A)(1 + ab>A−1b)
= (1− λ)k−1(λ2 − (2 + c>c)λ+ 1).104 / 162
Regression Analysis
Large Sample Theory
Toward Large Sample Theory II
Proof of (∗) (cont.)
Therefore, the roots for (∗∗) are
λ = 1 or λ =
(2 + c>c)
(1±
√1− 4
(2 + c>c)2
)2
,
yielding
λmin(T>T ) ≥ min
1,
(2 + c>c)
(1−
√1− 4
(2 + c>c)2
)2
≥ min
{1,
1
2 + c>c
} (since
√1− x ≤ 1− x
2
)=
1
2 + c>c.
Thus the proof of (∗) is complete.105 / 162
Regression Analysis
Large Sample Theory
Toward Large Sample Theory II
(2) We have shown previously that
X>X = T>(n 0>
0 D
)T ,
where
T =
1 x1 · · · xk0 1 0 · · · 0...
. . .. . .
. . ....
.... . .
. . . 00 · · · 0 1
and D =o
X>
(I − En
)o
X .
106 / 162
Regression Analysis
Large Sample Theory
Toward Large Sample Theory II
By λmin(B>AB) ≥ λmin(B>B)λmin(A) and (∗), we obtain
λmin(X>X) ≥ λmin(T>T )λmin
(n 0>
0 D
)≥ 1
2 +
k∑i=1
x2i
min{n, λmin(D)}
λmin
(n 0>
0 D
)= min{n, λmin(D)}
≥ 1
2 + Vmin{n, λmin(D)}.
在此我們假設
k∑i=1
x2i < V <∞(讓討論更聚焦)
107 / 162
Regression Analysis
Large Sample Theory
Toward Large Sample Theory II
(3) 最後為了讓(>)成立,我們給出以下充分條件:
(i′)
max1≤i≤n
k∑j=1
x2ij
n−→ 0,
(ii′) λmin(D) −→∞, (我已解釋過”它”的意義)
(iii′)
max1≤i≤n
k∑j=1
x2ij
λmin(D)−→ 0.
明顯看出 (i), (ii), (iii) 與 (i′), (ii′), (iii′) 是對應的。
108 / 162
Regression Analysis
Large Sample Theory
Toward Large Sample Theory III
Questions 4 and 5
Q4 and Q5: How does one construct confidence intervals (CIs) and testing ruleswhen ε is not normal?
Some basic probabilistic tools
(A) Slutsky’s Theorem.
If Xnd−→X, Yn
pr.−→ a and Znpr.−→ b, where a is a vector of real numbers
and b is a real number, then
Y >n Xn + Znd−→ a>X + b.
Corollary. If Xnd−→X and Yn −Xn
pr.−→ 0, then Ynd−→X.
Proof. Since Yn = Xn − (Xn − Yn), the conclusion follows immediately fromSlutsky’s Theorem.
109 / 162
Regression Analysis
Large Sample Theory
Toward Large Sample Theory III
(B) Big O and Small O notation for a sequence of random vectors.Let an be a sequence of positive numbers. We say
Xn = Op(an),
where Xn is a sequence of random vectors, if for any ε > 0, there exist0 < Mε <∞ and a positive integer N such that for all n ≥ N ,
P
(∥∥∥∥Xn
an
∥∥∥∥ > Mε
)< ε,
andXn = op(an),
ifXn
an
pr.−→ 0.
110 / 162
Regression Analysis
Large Sample Theory
Toward Large Sample Theory III
(C) Big O and Small O notation for a sequence of vectors of real numbers.Let {wn} be a sequence of vectors of real numbers and {an} be a sequenceof positive number. We say wn = O(an) if there exist 0 < M <∞ and apositive integer N such that for all n ≥ N ,∥∥∥∥wnan
∥∥∥∥ < M,
and wn = o(an) if wn/an → 0.
(D) Some rules.Let Xn = op(1), Op(1), o(1) or O(1), and Yn = op(1), Op(1), o(1) or O(1).
For ” + ”: For ”× ” (product):
op Op o Oop op Op op OpOp − Op Op Opo − − o OO − − − O
op Op o Oop op op op opOp − Op op Opo − − o oO − − − O
111 / 162
Regression Analysis
Large Sample Theory
Toward Large Sample Theory III
(E) If Xn = Op(an), then Xn/an = Op(1),
If Xn = op(an), then Xn/an = op(1).
(F) If Xnd−→X, then Xn = Op(1), and if E‖Xn‖q < K <∞ for some q > 0
and for all n, then Xn = Op(1).
(G) If Xnpr.−→X and Yn
pr.−→ Y , then
(Xn
Yn
)pr.−→(XY
).
If Xnd−→X and Yn
d−→ Y , then
(Xn
Yn
)d−→(XY
), provided {Xn} and
{Yn} are independent.
(H) Continuous mapping theorem.
If Xnpr. or d−→ X and f(·) is a continuous function, then f(Xn)
pr. or d−→ f(X).
112 / 162
Regression Analysis
Large Sample Theory
Toward Large Sample Theory III
(I) Delta method.
If√n(Zn − u)
d−→ N(0k×1,Vk×k) andf(·) = (f1(·), . . . , fm(·))> : Rk → Rm is a “sufficiently smooth” function,then
√n(f(Zn)− f(u))
d−→ N(0m×1, (∇f(u))>V (∇f(u))), (∗)
where
∇f(·) =
∂f1(·)∂x1
· · · ∂fm(·)∂x1
......
∂f1(·)∂xk
· · · ∂fm(·)∂xk
is a k ×m matrix.
Sketch of the proof. By Taylor’s Theorem,f(Zn)
·∼ f(u) + (∇f(u))>(Zn − u), which yields
√n(f(Zn)− f(u))
·∼ (∇f(u))>√n(Zn − u).
This and the CLT for Zn (given as an assumption) lead to the desiredconclusion. 113 / 162
Regression Analysis
Large Sample Theory
Toward Large Sample Theory III
We are now ready to answer Q4 & Q5.
(1) An alternative version of CLT for β.
Recall that(X>X)1/2
σ(β − β)
d−→ N(0, I),
under suitable conditions. (What are they?)
Assume
Rn =1
nX>X =
1
n
n∑i=1
xix>in→∞−→ R,
where R is a positive definite matrix.
Then, it can be shown that
1
σR1/2
√n(β − β)
d−→ N(0, I). (∗)
By (∗), we have
√n(β − β)
d−→ N(0,R−1σ2). (∗∗)114 / 162
Regression Analysis
Large Sample Theory
Toward Large Sample Theory III
Additional materials
(i) ‖σ−1(R1/2n −R1/2)
√n(β − β)‖ ≤ σ−1‖R1/2
n −R1/2‖‖√n(β − β)‖
(‖Ax‖2 = x>A>Ax ≤ ‖A‖2‖x‖2)
(ii) ‖R1/2n −R1/2‖ = o(1) (it’s obvious)
(iii) E‖√n(β − β)‖2 why?
= tr((X>Xn )−1)σ2
= tr(R−1n )σ2 n→∞−→ tr(R−1)σ2<∞. (R is p.d.)
(iv) By (i)–(iii), we have∥∥∥σ−1(R1/2n −R1/2)
√n(β − β)
∥∥∥ = o(1)Op(1) = op(1),
yielding1
σR1/2
√n(β − β) and
1
σR1/2n
√n(β − β)
have the same limiting distribution (by Slutsky’s Theorem), which isN(0, I).
115 / 162
Regression Analysis
Large Sample Theory
Toward Large Sample Theory III
(2) Consider the problem of testing a nonlinear null hypothesis,
H0 : β0 + β21 = d,
for some known d, against the alternative hypothesis,
HA : β0 + β21 6= d.
For simplify the discussion, we again assume that
X =
1 X1
......
1 Xn
, hence β =
(β0β1
)and β =
(β0β1
).
Set f(β) = β0 + β21 . Then ∇f(β) =
(1
2β1
).
116 / 162
Regression Analysis
Large Sample Theory
Toward Large Sample Theory III
By the δ-method and (∗∗), we obtain
√n(f(β)− f(β))
H0=√n(f(β)− d)
d−→ N
(0, (1, 2β1)R−1
(1
2β1
)σ2
),
which implies√n(f(β)− d)
σ
√(1, 2β1)R−1
(1
2β1
) d−→ N(0, 1). (∗ ∗ ∗)
Moreover, it holds that
σ
√(1, 2β1)R−1n
(1
2β1
)pr.−→ σ
√(1, 2β1)R−1
(1
2β1
).
(β1
pr.−→ β1Rn −→ R
)117 / 162
Regression Analysis
Large Sample Theory
Toward Large Sample Theory III
This, (∗ ∗ ∗) and Slutsky’s Theorem together imply√n(f(β)− d)
σ
√(1, 2β1)R−1n
(1
2β1
) d−→ N(0, 1).
This result enables us to construct the following testing rule:reject H0 if
f(β) = β0 + β21 > d+
1.96σ
√(1, 2β1)R−1n
(1
2β1
)√n
or
f(β) = β0 + β21 < d−
1.96σ
√(1, 2β1)R−1n
(1
2β1
)√n
which is an “asymptotic” level 5% test, i.e.,
PH0(reject H0)
n→∞−→ 5%.118 / 162
Regression Analysis
Large Sample Theory
Toward Large Sample Theory III
(3) Consider the problem of testing the linear hypothesis
H0 : Dq×kβk×1 = γq×1 against HA :∼ H0,
where Dq×k and γq×1 are known.
Set f(β) = Dβ. By the δ-method and the CLT for β, we have under H0,
√n(f(β)− γ)
d−→ N(0,DR−1D>σ2),
and hence by the continuous mapping theorem,
n(f(β)− γ)>(DR−1D>)−1(f(β)− γ)
σ2
d−→ χ2(q).
This, σ2 pr.−→ σ2, Rnn→∞−→ R, and Slutsky’s theorem further give (some
algebraic manipulations are needed!!)
w1 =n(f(β)− γ)>(DR−1n D
>)−1(f(β)− γ)
σ2
d−→ χ2(q).
119 / 162
Regression Analysis
Large Sample Theory
Toward Large Sample Theory III
Therefore, the following testing rule:
reject H0 if w1 > χ21−α(q),
is an asymptotic level α% test.
Please compare this asymptotic test with its counterpart derived from thefinite-sample theory under normal assumptions.
120 / 162
Regression Analysis
Appendix
Statistical View of Spectral Decomposition
Statistical View of Spectral Decomposition
1. Without loss of generality, we can assume Γ = E(xx>) where x is ap-dimensional random vector with E(x) = 0.
2. Define
a1 = argmaxc∈{s∈Rp:‖s‖=1}E((c>x)2) and λ∗1 = E((a>1 x)2).
By Lagrange multipliers method, Γa1 = λ∗1a1. Define
v1 = a>1 x and u1 = argminc∈RpE((x− cv1)>(x− cv1)).
Then,
u1 =E(xv1)
λ∗1=
Γa1
λ∗1= a1,
R1 := x− u1v1 = x− a1v1 = x− a1a>1 x = (Ip − a1a
>1 )x,
and
Γ1 := Var(R1) = E((Ip − a1a>1 )xx>(Ip − a1a
>1 )) = Γ− λ∗1a1a
>1 .
121 / 162
Regression Analysis
Appendix
Statistical View of Spectral Decomposition
3. Define
a2 = argmaxc∈{s∈Rp:‖s‖=1,s>a1=0}E((c>R1)2) and λ∗2 = E((a>2 R1)2).
By Lagrange multipliers method, we let∂∂c
(c>(Γ− λ∗1a1a
>1 )c− h1c>a1 − h2(c>c− 1)
)= 0;
∂∂h1
(c>(Γ− λ∗1a1a
>1 )c− h1c>a1 − h2(c>c− 1)
)= 0;
∂∂h2
(c>(Γ− λ∗1a1a
>1 )c− h1c>a1 − h2(c>c− 1)
)= 0,
and obtain h1 = 0 and h2 = c>Γc. Therefore,
Γ1a2 = (Γ− λ∗1a1a>1 )a2 = Γa2 = λ∗2a2.
122 / 162
Regression Analysis
Appendix
Statistical View of Spectral Decomposition
3. Define
v2 = a>2 R1 = a>2 x and u2 = argminc∈RpE((R1 − cv2)>(R1 − cv2)).
Then,
u2 =E(R1v2)
λ∗2= a2,
and
R2 := R1 − u2v2 = (Ip − (a1a>1 + a2a
>2 ))x.
123 / 162
Regression Analysis
Appendix
Statistical View of Spectral Decomposition
4. By the similar argument as above, we have Rp := (Ip −∑pi=1 aia
>i )x = 0,
and hence
O = Var(Rp) =
(Ip −
p∑i=1
aia>i
)Γ
(Ip −
p∑i=1
aia>i
)
=
(Γ− 2
p∑i=1
λ∗iaia>i
)+
p∑i=1
p∑j=1
λ∗iaia>i aja
>j
= Γ−p∑i=1
λ∗iaia>i ,
where O is a p× p zero matrix.
5. Define P = (a1, . . . ,ap) and D = diag(λ∗1, . . . , λ∗p). Then,
Γ =
p∑i=1
λ∗iaia>i = PDP>.
124 / 162
Regression Analysis
Appendix
Limit Theorems
Continuous Mapping Theorem
Fact 1
Let Xnpr.−→ X and g be a continuous function on R. Then g(Xn)
pr.−→ g(X).
Proof of Fact 1
For any ε > 0, there exists a large k such that P (|X| > k) ≤ ε2
.
Moreover, we have for any δ > 0 and n ≥ Nδ,ε, P (|Xn −X| > δ) ≤ ε2
.
Since g(x) is uniformly continuous on [−k, k], there exists a δ∗ > 0 such that
|g(x)− g(y)| ≤ ε for all |x− y| ≤ δ∗ and |x| ≤ k.
Now, |g(X)− g(Xn)| > ε implies |X −Xn| > δ∗ or |X| > k, and hence
P (|g(X)− g(Xn)| > ε) ≤ P (|X −Xn| > δ∗) + P (|X| > k) ≤ ε for n ≥ Nδ∗,ε.
Therefore, P (|g(Xn)− g(X)| > ε)→ 0 as n→∞.
Remark
If Xnd−→ X and g is a continuous function on R, then g(Xn)
d−→ g(X).
125 / 162
Regression Analysis
Appendix
Limit Theorems
Fact 2
If Xnpr.−→ X, then Xn
d−→ X.
Definition
Let {an} be a sequence of real number. We denote an → 0 by an = o(1).
Proof of Fact 2
Goal: Fn(x)→ F (x), ∀x ∈ C(F ), where F (x) = P (X ≤ x) and Fn(x) = P (Xn ≤ x).
Let x ∈ C(F ) with x′ < x < x′′. By Xnpr.−→ X,
P (X ≤ x′) = P (X ≤ x′, Xn ≤ x) + P (X ≤ x′, Xn > x)
= P (X ≤ x′, Xn ≤ x) + o(1) ≤ Fn(x) + o(1),
and hence F (x′) ≤ lim infn→∞ Fn(x).
Similarly, we obtain lim supn→∞ Fn(x) ≤ F (x′′), and thus
F (x′) ≤ lim infn→∞
Fn(x) ≤ lim supn→∞
Fn(x) ≤ F (x′′).
The proof is completed by letting x′ ↑ x, x′′ ↓ x and
F (x) = limx′↑x
F (x′) ≤ lim infn→∞
Fn(x) ≤ lim supn→∞
Fn(x) ≤ limx′′↓x
F (x′′) = F (x).
126 / 162
Regression Analysis
Appendix
Limit Theorems
Slutsky’s Theorem
Slutsky’s Theorem
If Xn − Ynpr.−→ 0 and Yn
d−→ Y , then Xnd−→ Y .
Proof of Slutsky’s Theorem
Let x be any continuity point of the c.d.f. of Y , FY . Given δ > 0, there exists a smallε > 0 such that x− ε and x+ ε are continuity points of FY and F (x+ ε)− F (x− ε) < δ.
Define Fn(x) = P (Xn ≤ x). Our goal is to show that
FY (x− ε) ≤ lim infn→∞
Fn(x) ≤ lim supn→∞
Fn(x) ≤ FY (x+ ε),
which implies Fn(x) −→ FY (x).
Since Xn − Ynpr.−→ 0 and Yn
d−→ Y , we have
Fn(x) ≤ P (Yn ≤ x+ Yn −Xn, Yn −Xn ≤ ε) + P (Yn −Xn > ε) ≤ P (Yn ≤ x+ ε) + o(1),
Fn(x) = P (Yn ≤ x+ Yn −Xn, Yn −Xn ≥ −ε) + o(1)
≥ P (Yn ≤ x− ε, Yn −Xn ≥ −ε) + o(1)
≥ P (Yn ≤ x− ε)− P (Yn −Xn < −ε) + o(1) = P (Yn ≤ x− ε) + o(1),
and hence lim supn→∞ Fn(x) ≤ FY (x+ ε) and lim infn→∞ Fn(x) ≥ FY (x− ε).127 / 162
Regression Analysis
Appendix
Limit Theorems
Fact 3
If Xnd−→ X and Yn
pr.−→ c where c is a constant, then
(a) Xn + Ynd−→ X + c
(b) XnYnd−→ cX
Proof of Fact 3
For (a), it suffices to show that Xn + cd−→ X + c, which is obvious.
For (b), it suffices to show that XnYn − cXnpr.−→ 0. It is equivalent to show that if
Xnd−→ X and Yn
pr.−→ 0, then XnYnpr.−→ 0.
Let δ > 0 be an arbitrarily small constant. Then, there exists a large M such thatP (|X| > M) ≤ δ. Now for any ε > 0,
P (|XnYn| > ε) ≤ P(|XnYn| > ε, |Yn| ≤
ε
M
)+ P
(|Yn| >
ε
M
)≤ P (|Xn| > M) + o(1)
= P (|X| > M) + P (|Xn| > M)− P (|X| > M) + o(1)
= P (|X| > M) + o(1),
which implies 0 ≤ lim infn→∞ P (|XnYn| > ε) ≤ lim supn→∞ P (|XnYn| > ε) ≤ δ, and
hence XnYnpr.−→ 0.
128 / 162
Regression Analysis
Appendix
Limit Theorems
Application
Let Xii.i.d.∼ (0, 1) and E(X4
1 ) <∞. Then it follows from
X21 + · · ·+X2
n
n
pr.−→ 1,
X1 + · · ·+Xn√n
d−→ N(0, 1),
and Fact 3 that√n(X1 + · · ·+Xn)
X21 + · · ·+X2
n
d−→ N(0, 1).
129 / 162
Regression Analysis
Appendix
Limit Theorems
Some Remarks on Slutsky’s Theorem
(1) If Xnpr.−→ X and Yn
pr.−→ Y , then Xn + Ynpr.−→ X + Y and XnYn
pr.−→ XY .
Proof.
P (|Xn + Yn − (X + Y )| > ε) ≤ P (|Xn −X|+ |Yn − Y | > ε)
≤ P (|Xn −X| > ε/2 or |Yn − Y | > ε/2)
≤ P (|Xn −X| > ε/2) + P (|Yn − Y | > ε/2)→ 0,
as n→∞. Show by yourself that XnYnpr.−→ XY .
(2) If Xnd−→ X and Yn
d−→ c where c is a constant, then Xn + Ynd−→ X + c and
XnYnd−→ cX because Yn
d−→ c⇔ Ynpr.−→ c.
(Show by yourself that Ynd−→ c⇔ Yn
pr.−→ c)
(3) Assume Xnd−→ X and Yn
d−→ Y . Does Xn + Ynd−→ X + Y ? No. (The distribution of
X + Y is undefined if only the marginal distributions of X and Y are available.)
(4) If
(XnYn
)d−→(XY
), then by continuous mapping theorem,
Xn + Yn =(1 1
)(XnYn
)d−→(1 1
)(XY
)= X + Y.
130 / 162
Regression Analysis
Appendix
Limit Theorems
Central Limit TheoremLindeberg Central Limit Theorem
Let X1 . . . , Xn be independent random variables with E(Xi) = 0 and E(X2i ) = σ2
i for i =1, . . . , n. Define S2
n =∑ni=1 σ
2i and Sn =
∑ni=1Xi. Then
Sn
Snd−→ N(0, 1), (1)
provided for any ε > 0,
1
S2n
n∑i=1
E(X2i I{|Xi|>εSn})→ 0, (Lindeberg’s condition) (2)
as n→∞.
Lyapunov Central Limit Theorem
Let X1 . . . , Xn be independent random variables with E(Xi) = 0 and E(X2i ) = σ2
i for i =1, . . . , n. Define S2
n =∑ni=1 σ
2i and Sn =
∑ni=1Xi. If
1
S2+αn
n∑i=1
E(|Xi|2+α)→ 0 for some α > 0, (Lyapunov’s condition)
then Sn/Snd−→ N(0, 1).
131 / 162
Regression Analysis
Appendix
Limit Theorems
Proof of Lindeberg Central Limit Theorem
To proof (1), we need two facts:
(F1) Levy continuity theoremLet {Xn} be a sequence of random variables and define ϕn(t) = E(exp{itXn}). Then
Xnd−→ X ⇐⇒ ϕn(t)→ ϕ(t),
where ϕ(t) = E(exp{itX}).
(F2) Lemma 8.4.1 of Chow and Teichen (1997)
∣∣∣∣∣∣exp{it} −n∑j=0
(it)j
j!
∣∣∣∣∣∣ ≤ 21−δ|t|n+δ
(1 + δ)(2 + δ) · · · (n+ δ),
where δ is any constant in [0, 1].
132 / 162
Regression Analysis
Appendix
Limit Theorems
Proof of Lindeberg Central Limit Theorem (cont.)
Now,
E
(exp
{itSn
Sn
})− exp
{−t2
2
}= E
(exp
{it(X1 +
∑ni=2 Zi)
Sn
})− exp
{−t2
2
}•
••
+ E
(exp
{it(Sj +
∑ni=j+1 Zi)
Sn
})− E
(exp
{it(Sj−1 +
∑ni=j Zi)
Sn
})•
• (3)•
+ E
(exp
{it(Sn−1 + Zn)
Sn
})− E
(exp
{it(Sn−2 + Zn−1 + Zn)
Sn
})+ E
(exp
{itSn
Sn
})− E
(exp
{it(Sn−1 + Zn)
Sn
}),
where Ziindep.∼ N(0, σ2
i ) and independent of {Xn}.
133 / 162
Regression Analysis
Appendix
Limit Theorems
Proof of Lindeberg Central Limit Theorem (cont.)
It holds that for “Stair j”,∣∣∣∣∣E(exp
{it(Sj +
∑ni=j+1 Zi)
Sn
})− E
(exp
{it(Sj−1 +
∑ni=j Zi)
Sn
})∣∣∣∣∣≤
∣∣∣∣∣exp{−t2
2
}[E
(exp
{itSj
Sn
})exp
{t2S2
j
2S2n
}− E
(exp
{itSj−1
Sn
})exp
{t2S2
j−1
2S2n
}]∣∣∣∣∣≤ exp
{−t2
2
}∣∣∣∣∣E(exp
{itSj−1
Sn
})exp
{t2S2
j
2S2n
}[E
(exp
{itXj
Sn
})− exp
{−t2σ2
j
2S2n
}]∣∣∣∣∣≤
∣∣∣∣∣E(exp
{itXj
Sn
})− exp
{−t2σ2
j
2S2n
}∣∣∣∣∣≤
∣∣∣∣∣E(exp
{itXj
Sn
}− 1−
itXj
Sn+t2X2
j
2S2n
)−(exp
{−t2σ2
j
2S2n
}− 1 +
t2σ2j
2S2n
)∣∣∣∣∣ , (4)
where the first inequality is by
E
(exp
{it∑ni=j+1 Zi
Sn
})= exp
{−t2(S2
n − S2j )
2S2n
}.
134 / 162
Regression Analysis
Appendix
Limit Theorems
Proof of Lindeberg Central Limit Theorem (cont.)
By (F2) (taking δ = 1 and n = 1, 2), we have
n = 1 :
∣∣∣∣∣exp{
itXj
Sn
}− 1−
itXj
Sn+t2X2
j
2S2n
∣∣∣∣∣ ≤∣∣∣∣exp{ itXj
Sn
}− 1−
itXj
Sn
∣∣∣∣+ t2X2j
2S2n
≤t2X2
j
2S2n
+t2X2
j
2S2n
=t2X2
j
S2n
,
n = 2 :
∣∣∣∣∣exp{
itXj
Sn
}− 1−
itXj
Sn+t2X2
j
2S2n
∣∣∣∣∣ ≤ 1
6|t|3
∣∣∣∣XjSn∣∣∣∣3 ,
and hence∣∣∣∣∣E(exp
{itXj
Sn
}− 1−
itXj
Sn+t2X2
j
2S2n
)∣∣∣∣∣ ≤ E(min
(t2X2
j
S2n
,1
6|t|3
∣∣∣∣XjSn∣∣∣∣3))
why?
≤ E
t2X2j
S2n
I{∣∣∣∣XjSn∣∣∣∣>ε} +
1
6|t|3
∣∣∣∣XjSn∣∣∣∣3 I{∣∣∣∣XjSn ∣∣∣∣≤ε}
= E
t2X2j
S2n
I{∣∣∣∣XjSn∣∣∣∣>ε}
+ E
1
6|t|3
∣∣∣∣XjSn∣∣∣∣3 I{∣∣∣∣XjSn ∣∣∣∣≤ε}
≡ Ij + IIj . (5)
135 / 162
Regression Analysis
Appendix
Limit Theorems
Proof of Lindeberg Central Limit Theorem (cont.)
In addition, we have
0 ≤ exp
{−t2σ2
j
2S2n
}− 1 +
t2σ2j
2S2n
≤t4σ4
j
8S4n
, (6)
noting that for x > 0, 0 ≤ exp{−x} − 1 + x ≤ x2/2. Moreover,∑nj=1 σ
2j/S
2n = 1,
E
∣∣∣∣XjSn∣∣∣∣3 I{∣∣∣∣XjSn ∣∣∣∣≤ε}
≤ ε σ2j
S2n
and (2) impliesmax1≤j≤n σ
2j
S2n
→ 0.
By (2)–(6), it follows that∣∣∣∣∣E(exp
{itSn
Sn
})− exp
{−t2
2
}∣∣∣∣∣≤
n∑j=1
(Ij + IIj +
t4σ4j
8S4n
)
≤ t2n∑j=1
E(X2j I{|Xj |>εSn}
)S2n
+|t|3
6ε
n∑j=1
σ2j
S2n
+t4
8
max1≤j≤n σ2j
S2n
n∑j=1
σ2j
S2n
=|t|3
6ε+ o(1).
136 / 162
Regression Analysis
Appendix
Limit Theorems
Proof of Lindeberg Central Limit Theorem (cont.)
Since ε can be arbitrarily small, one gets∣∣∣∣E(exp
{itSn
Sn
})− exp
{−t2
2
}∣∣∣∣ −→n→∞ 0,
which, together with (F1), yields the desired conclusion (1).
Proof of Lyapunov Central Limit Theorem
1
S2n
n∑i=1
E(X2i I{|Xi|>δSn}) =
1
S2n
n∑i=1
E
(|Xi|2+α
|Xi|αI{|Xi|>δSn}
)
≤1
S2+αn δα
n∑i=1
E(|Xi|2+α)→ 0,
if Lyapunov’s condition holds.
137 / 162
Regression Analysis
Appendix
Limit Theorems
Example 1
If X1, . . . , Xn are independent random variables with E(Xi) = 0 for i = 1, . . . , n,
supi≥1 E|Xi|2+α < M , lim infn→∞
S2n/an > 0, and na
−1−α/2n = o(1), then Sn/Sn
d−→ N(0, 1).
Proof of Example 1
lim supn→∞
1
S2+αn
n∑i=1
E(|Xi|2+α) ≤ lim supn→∞
nM
a1+α/2n (S2
n/an)1+α/2
≤ lim supn→∞
na−1−α/2n × lim sup
n→∞
M
(S2n/an)1+α/2
= lim supn→∞
na−1−α/2n ×
M
lim infn→∞(S2n/an)1+α/2
→ 0.
138 / 162
Regression Analysis
Appendix
Limit Theorems
Example 2
Let PJ be the orthogonal projection matrix on to the space spanned by {Xj : j ∈ J}. Consider
ε>(PJ2 − PJ1 )ε where ε = (ε1, . . . , εn)> with εiindep.∼ (0, 1), J2 ⊃ J1, and ](J2) − ](J1) = 1.
Then
ε>(PJ2 − PJ1 )εd−→ χ2(1),
provided supi≥1 E|ε1|2+α < M <∞ for some α > 0 and max1≤i≤n(PJ2 )ii → 0 as n→∞.
Remark
If ε ∼ N (0, I), then ε>(PJ2 − PJ1 )ε ∼ χ2(1).
If (X>X)/an → R (p.d.), then
(PJ2 )ii = e>i XJ2 (X>J2XJ2 )−1X>J2ei =
(xi(J2)√an
)>(X>J2XJ2
an
)−1 (xi(J2)√an
)
≤ λmax
(X>J2XJ2
an
)−1× ∑j∈J2 x
2ij
an→ 0,
provided a−1n∑j∈J2 x
2ij → 0, where xi(J2) = (xij , j ∈ J2)> and XJ2 = (Xj , j ∈ J2).
139 / 162
Regression Analysis
Appendix
Limit Theorems
Proof of Example 2
Let ](J1) = r. Then PJ1 =∑ri=1 oio
>i and PJ2 =
∑r+1i=1 oio
>i , where o>i oi = 1 and
o>i oj = 0 for 1 ≤ i < j ≤ r + 1. Hence, PJ2 − PJ1 = or+1o>r+1. Without loss of
generality, set or+1 = (a1n, . . . , ann)> with∑ni=1 a
2in = 1.
Now ε>(PJ2 − PJ1 )ε = (∑ni=1 ainεi)
2. Note that∑ni=1 ainεi can be view as
n∑i=1
viεi√∑nj=1 v
2j
where vi > 0 for i = 1, . . . , n andn∑j=1
v2j →∞.
The Lyapunov’s condition∑ni=1 E|ainεi|2+α → 0 follows from
n∑i=1
E|ainεi|2+α ≤Mn∑i=1
|ain|2+α ≤M(
n∑i=1
a2in
)max
1≤i≤n|ain|α = M max
1≤i≤n|ain|α,
and max1≤i≤n |ain| = (max1≤i≤n a2in)1/2 ≤ (max1≤i≤n(PJ2 )ii)
1/2 → 0.
By Lyapunov central limit theorem, we have
n∑i=1
ainεid−→ N (0, 1),
and hence ε>(PJ2 − PJ1 )εd−→ χ2(1) is obtained using continuous mapping theorem.
140 / 162
Regression Analysis
Appendix
Limit Theorems
Convergence in the rth Mean
Definition
If E|Xn −X|r → 0 and E|X|r <∞, then we say that Xn converges in the rth mean to X, and
we write XnLr−→ X.
Definition
The r-norm of the random variable Z is defined by ‖Z‖r = (E(|Z|r))1/r.
141 / 162
Regression Analysis
Appendix
Limit Theorems
Some InequalitiesJensen’s inequality
If g is a convex function, then E(g(X)) ≥ g(E(X)).
Proof of Jensen’s inequality
Note that the graph of convex function lies above its tangent line at every point and thus
g(x) ≥ g(µ) + g′(µ)(x− µ),
for any x and µ in the domain of the function g.
Choosing µ = E(X) and replacing x with the random variable X, we have
g(X) ≥ g(E(X)) + g′(E(X))(X − E(X)).
The proof is completed by taking expectation on both sides of the above inequality.
Application
Let q > 1 and g(x) = xq for x > 0. Then g(x) is a convex function.
Assume 0 < s < r. By Jensen’s inequality, we have
E(|X|r) = E((|X|s)r/s) ≥ (E(|X|s))r/s,
and hence (E(|X|r))1/r ≥ (E(|X|s))1/s.142 / 162
Regression Analysis
Appendix
Limit Theorems
Young’s inequality
Let f be a strictly increasing function on [0,∞) and f(0) = 0. Then
ab ≤∫ a
0f(x) dx+
∫ b
0f−1(x) dx.
143 / 162
Regression Analysis
Appendix
Limit Theorems
Holder’s inequality
E|XY | ≤ (E(|X|p))1/p(E(|Y |q))1/q where1
p+
1
q= 1 and p, q ∈ (1,∞).
Proof of Holder’s inequality
Let f(x) = xp−1. Then by Young’s inequality,
ab ≤∫ a
0xp−1 dx+
∫ b
0x1/(p−1) dx =
ap
p+
1
1 + 1/(p− 1)b1+1/(p−1) =
ap
p+bq
q. (∗)
Now, let a = |X|/‖X‖p and b = |Y |/‖Y ‖q . By (∗),
|X|‖X‖p
×|Y |‖Y ‖q
≤1
p×(|X|‖X‖p
)p+
1
q×(|Y |‖Y ‖q
)q,
which implies
E|XY |‖X‖p‖Y ‖q
≤ 1,
and thus the proof is complete.
144 / 162
Regression Analysis
Appendix
Limit Theorems
Minkowski’s inequality
‖X + Y ‖p ≤ ‖X‖p + ‖Y ‖p where 1 ≤ p <∞.
Proof of Minkowski’s inequality
By Holder’s inequality,
E(|X + Y |p) = E(|X + Y |p−1|X + Y |)≤ E(|X + Y |p−1|X|) + E(|X + Y |p−1|Y |)
≤ (E(|X + Y |p))(p−1)/p(E(|X|p))1/p + (E(|X + Y |p))(p−1)/p(E(|Y |p))1/p,
and thus the proof is complete.
145 / 162
Regression Analysis
Appendix
Limit Theorems
Some Facts
(1) XnLr−→ X ⇒ Xn
pr.−→ X ⇒ Xnd−→ X
If X is a constant, then Xnd−→ X ⇒ Xn
pr.−→ X.
If supn≥1 E(|Xn|p) <∞ with p > r, then Xnd−→ X ⇒ Xn
Lr−→ X.
(2) Xnpr.−→ X does not necessarily imply Xn
Lr−→ X.
Example. Let P (Xn = n2) = 1/n and P (Xn = 0) = 1− 1/n. Then
P (|Xn| > ε) = P (Xn > ε) = P (Xn = n2)→ 0,
and hence Xnpr.−→ 0. However,
E|Xn − 0| = E(Xn) = 0× P (Xn = 0) + n2 × P (Xn = n2) = n→∞.
(3) If XnL2−→ X, then E(Xn)→ E(X) and E(X2
n)→ E(X2).
Proof. |E(Xn −X)| ≤ E|Xn −X| ≤ (E(Xn −X)2)1/2 → 0 and
|E(X2n −X2)| = |E[(Xn −X)(Xn −X + 2X)]|
≤ E[(Xn −X)2] + 2E|X(Xn −X)|
≤ E[(Xn −X)2] + 2√E[(Xn −X)2]
√E(X2)→ 0.
146 / 162
Regression Analysis
Appendix
Limit Theorems
Some Facts (Cont.)
(4) If XnLr−→ X, then E(|Xn|r) −→ E(|X|r).
Proof. For r ≥ 1, by Minkowski’s inequality, we have
‖Xn‖r ≤ ‖Xn −X‖r + ‖X‖r and ‖X‖r ≤ ‖Xn −X‖r + ‖Xn‖r,
and hence
‖X‖r − ‖Xn −X‖r ≤ ‖Xn‖r ≤ ‖X‖r + ‖Xn −X‖r,
which, in conjunction with XnLr−→ X, yields the desired result. On the other hand, note
that (a+ b)r ≤ ar + br for a, b ≥ 0 and 0 < r < 1. Hence, for r < 1,
‖Xn‖rr ≤ ‖Xn −X +X‖rr ≤ ‖Xn −X‖rr + ‖X‖rrand
‖X‖rr ≤ ‖X −Xn +Xn‖rr ≤ ‖Xn −X‖rr + ‖Xn‖rr.
By an argument similar to that used for showing the case of r > 1, we have
E(|Xn|r) −→ E(|X|r) for r ≤ 1,
and thus the proof is complete.
147 / 162
Regression Analysis
Appendix
Limit Theorems
Weak Law of Large Numbers
Fact 4
Let X1, . . . , Xn be i.i.d. random variables with E(X1) = µ <∞. Then
Sn
n
pr.−→ µ,
where Sn =∑ni=1 Xi.
Remark
If X1, . . . , Xn are independent random variables with E(X1) < ∞, then the weak law of largenumbers does not necessary hold for {Xi}. Consider the following example:
Let X1, . . . , Xn be a sequence of independent random variables with
P (Xi =√i) = P (Xi = −
√i) =
1
2.
Note that E(Xi) = 0 and Var(Xi) = i for i = 1, . . . , n. Moreover,
S2n =
n∑i=1
Var(Xi) =n∑i=1
i =n(n+ 1)
2.
148 / 162
Regression Analysis
Appendix
Limit Theorems
Remark (Cont.)
Since for some α > 0
n∑i=1
E(|Xi|2+α)
S2+αn
=n∑i=1
i1+α/2
(n(n+ 1)/2)1+α/2= O
(n2+α/2
n2+α
)→ 0,
by Lyapunov central limit theorem, we have
√2∑ni=1 Xi
n
d−→ N (0, 1). Hence, the weak
law of large numbers does not hold for {Xi}.
Proof of Fact 4
Consider
Sn
n− µ =
Sn −mnn
+mn − nµ
n=Sn −mn
n− E(X1I{|X1|>n}) (5-1)
where mn =∑ni=1 E(X
(n)i ) with X
(n)i = XiI{|Xi|≤n}, i = 1, . . . , n.
It suffices to show that
Sn −mnn
pr.−→ 0, (5-2)
and
E(|X1|I{|X1|>n})→ 0. (5-3)
149 / 162
Regression Analysis
Appendix
Limit Theorems
Proof of Fact 4 (Cont.)
Since E(|X1|) <∞, E(|X(n)1 |)→ E(|X1|) and
E(|X1|) = E(|X(n)1 |) + E(|X1|I{|X1|>n}),
we obtain (5-3).
We next show (5-2). Define S(n)n =
∑ni=1X
(n)i . Note first that for any ε > 0,
P
(|Sn −mn|
n> ε
)≤ P
(|Sn −mn|
n> ε,
n⋂i=1
{|Xi| ≤ n})
+ P
(n⋃i=1
{|Xi| > n})
≤ P
(|S(n)n −mn|
n> ε
)+ nP (|X1| > n)
≤ P
(|S(n)n −mn|
n> ε
)+ E(|X1|I{|X1|>n})
= P
(|S(n)n −mn|
n> ε
)+ o(1), (5-4)
where the second and third inequalities are by i.i.d. and (5-3), respectively.
150 / 162
Regression Analysis
Appendix
Limit Theorems
Proof of Fact 4 (Cont.)
By Chebyshev’s inequality, we have
P
(|S(n)n −mn|
n> ε
)≤E((X
(n)1 )2)
nε2. (5-5)
Moreover,
E((X(n)1 )2)
=
∫ ∞0
P (X21 I{|X1|≤n} > x) dx
= 2
∫ ∞0
uP (X21 I{|X1|≤n} > u2) du
= 2
∫ ∞0
uP (|X1|I{|X1|≤n} > u) du
= 2
∫ ∞0
uP (|X1|I{|X1|≤n} > u, |X1| ≤ n) du+ 2
∫ ∞0
uP (|X1|I{|X1|≤n} > u, |X1| > n) du
= 2
∫ ∞0
uP (u < |X1| ≤ n) du
= 2
∫ n
0uP (u < |X1| ≤ n) du ≤ 2
∫ n
0uP (|X1| > u) du. (5-6)
151 / 162
Regression Analysis
Appendix
Limit Theorems
Proof of Fact 4 (Cont.)
Since nP (|X1| > n) ≤ E(|X1|I{|X1|>n}) = o(1), we have
2
∫ n
0uP (|X1| > u) du = 2
∫ A
0uP (|X1| > u) du+ 2
∫ n
AuP (|X1| > u) du
≤ A2 + 2ε3(n−A), (5-7)
where A is large enough such that
uP (|X1| > u) ≤ ε3, ∀u ≥ A.
Hence, (5-2) follows from (5-4)–(5-7) and the proof is complete.
152 / 162
Regression Analysis
Appendix
Delta Method
Delta Method
Assume an(Zn −µ)d−→ Z, where Zn, µ, Z are k-dimensional and an →∞
as n→∞.Let f(·) = (f1(·), . . . , fm(·))> be a smooth function from Rk into Rm with1 ≤ m ≤ k. Define
∇f(·) =
∂f1(·)∂x1
· · · ∂fm(·)∂x1
......
∂f1(·)∂xk
· · · ∂fm(·)∂xk
.
Suppose that there exists ε > 0 such that for some 0 < G <∞,
max1≤i≤m
sup‖x−µ‖≤ε
∥∥∥∥∥(∂2fi(x)
∂xj∂xl
)1≤j,l≤k
∥∥∥∥∥ ≤ G. (∗)
Then
an(f(Zn)− f(µ))d−→ (∇f(µ))>Z. (∗∗)
153 / 162
Regression Analysis
Appendix
Delta Method
Proof of Delta Method
Since an(Zn − µ)d−→ Z, it holds that
an(Zn − µ) = Op(1), (0)
and hence
Zn − µ = Op(a−1n ) = Op(o(1)) = op(1),
yielding
Znpr.−→ µ. (1)
Define An = {‖Zn − µ‖ ≤ ε}, where ε is defined in (∗). Then, by (1),
P (An)→ 1 as n→∞. (2)
154 / 162
Regression Analysis
Appendix
Delta Method
Proof of Delta Method (cont.)
In the following, we shall prove (∗∗) for the case of m = 1. The proof of thecase of m > 1 is similar.
By Taylor’s theorem,
f1(Zn)− f1(µ) = (∇f1(µ))>(Zn − µ) + wn, (3)
where wn = (Zn−µ)>(∂2f1(ξ)∂xj∂xl
)1≤j,l≤k
(Zn−µ) and ‖ξ−µ‖ ≤ ‖Zn−µ‖.
Let x ∈ R be a continuous point of the distribution function of (∇f1(µ))>Z.
Then
P (an(f(Zn)− f(µ)) ≤ x)
why?= P (an(f(Zn)− f(µ)) ≤ x,An) + o(1)
by (3)= P ((∇f1(µ))>an(Zn − µ) + anwn ≤ x,An) + o(1)
why?= P ((∇f1(µ))>an(Zn − µ)IAn + anwnIAn ≤ x) + o(1). (4)
155 / 162
Regression Analysis
Appendix
Delta Method
Proof of Delta Method (cont.)
Note that
|anwnIAn |why?≤ an‖Zn − µ‖2
∥∥∥∥∥(∂2f1(ξ)
∂xj∂xl
)1≤j,l≤k
∥∥∥∥∥ IAn≤ an‖Zn − µ‖2 sup
‖x−µ‖≤ε
∥∥∥∥∥(∂2f1(ξ)
∂xj∂xl
)1≤j,l≤k
∥∥∥∥∥ IAn≤ an‖Zn − µ‖2G
by (0) and (1)= op(1). (5)
Moreover, since
(∇f1(µ))>an(Zn − µ)d−→ (∇f1(µ))>Z, (by continuous mapping theorem)
and IAnpr.−→ 1 (by (2)), it follows from Slutsky’s theorem that
(∇f1(µ))>an(Zn − µ)IAnd−→ (∇f1(µ))>Z. (6)
156 / 162
Regression Analysis
Appendix
Delta Method
Proof of Delta Method (cont.)
By (5) and (6), and Slutsky’s theorem, we obtain
(∇f1(µ))>an(Zn − µ)IAn + anwnIAnd−→ (∇f1(µ))>Z. (7)
By (4) and (7),
P (an(f(Zn)− f(µ)) ≤ x) −→ P ((∇f1(µ))>Z ≤ x),
and hence the desired conclusion follows.
157 / 162
Regression Analysis
Appendix
Two-Sample t-Test
Two-Sample t-Test
Consider the model
z = Xµ+ ε,
where z = (x1, . . . , xm, y1, . . . , yn)>, X = (sij) is a (m+ n)× 2 matrix satisfying
sij =
{1, if {1 ≤ i ≤ m, j = 1} or {m+ 1 ≤ i ≤ m+ n, j = 2};0, otherwise,
µ = (µx, µy)>, ε = (ε1, . . . , εm+n)>, and εi’s are i.i.d. N(0, σ2).
The least squares estimator of µ is
µ = (µx, µy)> = (X>X)−1X>z = (x, y)> ∼ N((
µxµy
),
(σ2
m0
0 σ2
n
)).
Note that H0 : µx = µy and
T =x− y√
σ2(
1m
+ 1n
) ∼H0
N(0, 1).
158 / 162
Regression Analysis
Appendix
Two-Sample t-Test
In practice, σ2 is unknown and we can use
σ2 =1
m+ n− 2z>(I −M)z,
in place of σ2 where M = X(X>X)−1X>.
Define Sx = (m− 1)−1∑mi=1(xi − x)2 and Sy = (n− 1)−1
∑ni=1(yi − y)2. Then some
elementary calculations yield
(I −M)z =
x1 − x...
xm − xy1 − y
...yn − y
,
and hence
σ2 =1
m+ n− 2
m∑i=1
(xi − x)2 +n∑j=1
(yj − y)2
=(m− 1)Sx + (n− 1)Sy
m+ n− 2,
which is the pooled variance.
159 / 162
Regression Analysis
Appendix
Two-Sample t-Test
Since T ∼ N(0, 1), (m+ n− 2)σ2/σ2 ∼ χ2(m+ n− 2), and T⊥σ2, we have
x− y√σ2(
1m
+ 1n
) =
x−y√σ2( 1
m+ 1n )√
σ2
σ2
∼ t(m+ n− 2).
Assume m/(m+ n)→ γx > 0 and n/(m+ n)→ γy > 0 as m→∞ and n→∞. If εi’s
are i.i.d. (0, σ2) (without assuming normality), then one can show that σ2 pr.−−→ σ2,
√m+ n
(x− µxy − µy
)d−→ N
(0,
(1γx
0
0 1γy
)σ2
),
and√m+ n− 2(x− y)√σ2(
1γx
+ 1γy
) d−−→H0
N(0, 1).
This, in conjunction with σ2 pr.−−→ σ2, m+n−2m
→ 1γx
, m+n−2n
→ 1γy
, continuous mapping
theorem, and Slutsky’s theorem, yields√m+ n− 2(x− y)√
σ2(m+n−2
m+ m+n−2
n
) d−−→H0
N(0, 1).
160 / 162
Regression Analysis
Appendix
Pearson’s Chi-Squared Test
Pearson’s Chi-Squared Test
Suppose that X1, . . . , Xn is a random sample of size n from a population, and the nobservations are classified into k classes A1, . . . , Ak.
Let pi denote the probability that an observation falls into the class Ai and∑ki=1 pi = 1.
Note first that
Zt =
I{Xt∈A1}
...I{Xt∈Ak−1}
∼ (p,D − pp>),
1√n
n∑t=1
(Zt − p)d−→ N(0,D − pp>),
and (1√n
n∑t=1
(Zt − p)
)>(D − pp>)−1
(1√n
n∑t=1
(Zt − p)
)d−→ χ2(k − 1), (1)
where p = (p1, . . . , pk−1)> and D = diag(p1, . . . , pk−1).
161 / 162
Regression Analysis
Appendix
Pearson’s Chi-Squared Test
Let 1 be a (k − 1)-dimensional vector with all entries one and Oi =∑nt=1 I{Xt∈Ai} for
i = 1, . . . , k. Define O = (O1, . . . , Ok−1)>.
Since
(D − pp>)−1 = D−1 +D−1pp>D−1
1− p>D−1p= D−1 +
11>
pk,
we have (1√n
n∑t=1
(Zt − p)
)>(D − pp>)−1
(1√n
n∑t=1
(Zt − p)
)
=1
n(O − np)>D−1(O − np) +
1
npk(O − np)>11>(O − np)
=
k−1∑i=1
(Oi − npi)2
npi+
1
npk
(k−1∑i=1
(Oi − npi))2
=
k−1∑i=1
(Oi − npi)2
npi+
1
npk((n−Ok)− n(1− pk))2 =
k∑i=1
(Oi − npi)2
npi. (2)
Hence, by (1) and (2),
k∑i=1
(Oi − npi)2
npi
d−→ χ2(k − 1).
162 / 162