regression analysis - national tsing hua universitymx.nthu.edu.tw/~cking/statistical...

Regression Analysis

Regression Analysis

Ching-Kang Ing (銀慶剛)

Institute of Statistics, National Tsing Hua University, Hsinchu, Taiwan

1 / 162

Regression Analysis

Outline I

1 Finite Sample TheoryRegression ModelsAnalysis of Variance (ANOVA)Projection MatricesEstimationMultivariate Normal DistributionsGaussian RegressionInterval EstimationAnother look at βModel SelectionPrediction

2 Large Sample TheoryMotivationToward Large Sample Theory IToward Large Sample Theory IIToward Large Sample Theory III

2 / 162

Regression Analysis

Outline II

3 AppendixStatistical View of Spectral DecompositionLimit Theorems

Continuous Mapping TheoremSlutsky’s TheoremCentral Limit TheoremConvergence in the rth MeanSome InequalitiesWeak Law of Large Numbers

Delta MethodTwo-Sample t-TestPearson’s Chi-Squared Test

3 / 162

Regression Analysis

Finite Sample Theory

Regression Models

Regression Models

Consider the following linear regression model:

yi = β0 + β1xi1 + · · ·+ βkxik + εi, i = 1, . . . , n,

where εi are i.i.d. r.v.s with E(ε1) = 0 and E(ε21) = Var(ε1) = σ2 > 0.Define f(β) = ‖y −Xβ‖2 where

X =

1 x11 · · · x1k...

......

1 xn1 · · · xnk

and y =

y1...yn

.

By solving equation

∂f(β)

∂β= 0,

we obtain X>Xβ = X>y, and hence

(β0, . . . , βk)> ≡ β = (X>X)−1X>y.

4 / 162

Regression Analysis


Analysis of Variance (ANOVA)


Define

SST =

n∑i=1

(yi − y)2,

SSRes =

n∑i=1

(yi − β0 − β1xi1 − · · · − βkxik)2 =

n∑i=1

(yi − β>xi)2,

SSReg =

n∑i=1

(β0 + β1xi1 + · · ·+ βkxik − y)2 =

n∑i=1

(β>xi − y)2,

where y = n−1∑ni=1 yi and xi = (1, xi1, . . . , xik)>. Then we have

SST = SSReg + SSRes.

5 / 162

Regression Analysis



It is not difficult to see (why?) that

SST = y>(I −M0)y where M0 =E

n=

11>

nwith 1 = (1, . . . , 1)>,

SSRes = y>(I −Mk)y where Mk = X(X>X)−1X>,

and

SSReg = y>(Mk −M0)y.

[ Note thaty1...yn

−x>1 β...

x>n β

= y −Xβ = y −X(X>X)−1X>y = (I −Mk)y,

and Mk = M2k , (I −Mk)2 = I −Mk.]

6 / 162

Regression Analysis



Therefore, ANOVA is nothing but

y>(I −M0)y = y>(Mk −M0)y + y>(I −Mk)y.

Actually, ANOVA is a Pythagorean equality, as illustrated below, in whichC(X) = {Xa : a ∈ Rk+1} is called the column space of X.

7 / 162

Regression Analysis



Another look at SST = SSReg + SSRes

Assume

yi = x>i β + εi, i = 1, . . . , n,

where E(εi) = 0, Var(εi) = σ2, (xi, εi) are i.i.d., and E(εi|xi) = 0 for all i. Notethat we consider the case of “random regressors” instead of fixed ones. Here aresome observations:

(i) E(yi) = E(x>i β) are the same for all i.

(ii) Var(yi) are the same for all i.

(iii) E(yi|xi) = E(x>i β + εi|xi) = x>i β.

(iv)

Var(yi) = Var(E(yi|xi)) + E(Var(yi|xi))= Var(x>i β) + E{E[(yi − x>i β)2|xi]}= Var(x>i β) + Var(εi).

8 / 162

Regression Analysis



(v) Var(yi) = Var(y1) can be estimated by

1

n

n∑i=1

(yi − y)2 := Var(yi).

Var(x>i β) = E(x>i β − E(x>i β))2 = E(x>i β − E(yi))2 can be estimated by

1

n

n∑i=1

(x>i β − y)2 := Var(x>i β).

Var(εi) can be estimated by

1

n

n∑i=1

(yi − x>i β)2 := Var(εi).

(vi) Therefore, SST = SSReg + SSRes is nothing but

Var(yi) = Var(x>i β) + Var(εi).

9 / 162

Regression Analysis


Projection Matrices

Projection Matrices

Let

X =

x11 · · · x1r...

...xn1 · · · xnr

= [X1, . . . ,Xr]

be an n× r matrix. The column space of X, C(X), is defined as

C(X) = {Xa : a = (a1, · · · , ar)> ∈ Rr}

noting that Xa = a1X1 + · · ·+ arXr.

Definition

An n × n matrix M is called an orthogonal projection matrix onto C(X) if andonly if

1 for v ∈ C(X),Mv = v,

2 for w ∈ C⊥(X),Mw = 0, whereC⊥(X) = {s : v>s = 0 for all v ∈ C(X)}.

10 / 162

Regression Analysis


Projection Matrices

Fact 1

C(M) = C(X).

Proof of Fact 1

Let v ∈ C(X). Then

v = Xb = MXb ∈ C(M), (why?)

for some b.

Let v ∈ C(M). Then

v = Ma = M(a1 + a2) = a1 ∈ C(X),

for some a, and some a1 ∈ C(X),a2 ∈ C⊥(X). This completes the proof.

11 / 162

Regression Analysis


Projection Matrices

Fact 2

M> = M (symmetric) and M2 = M (idempotent) if and only if M is anorthogonal projection matrix on C(M).

Proof of Fact 2

(⇒) For v ∈ C(M),Mv = MMbidempotent

= Mb = v, for some b.

For w ∈ C⊥(M),Mwsymmetric

= M>w = 0. (why?)(⇐) Define ei = (0, . . . , 0, 1, 0, . . . , 0)>, where i-th component is 1, and the others

are 0.It is suffices to show that for any ei, ej , e

>i M

>(I −M)ej = 0. (why?)

Since we can decompose ei and ej as ei = e(1)i + e

(2)i and ej = e

(1)j + e

(2)j ,

where e(1)i , e

(1)j ∈ C(M) and e

(2)i , e

(2)j ∈ C⊥(M),

e>i M>(I −M)ej = e>i M

>(I −M)(e(1)j + e

(2)j )

why?= e>i M

>e(2)j

why?= e

(1)>i e

(2)j = 0.

This completes the proof.

12 / 162

Regression Analysis


Projection Matrices

Fact 3

Orthogonal projection matrices are unique.

Proof of Fact 3

Let M and P be orthogonal projection matrices onto some space S ⊆ Rn.

Then, for any v ∈ Rn,v = v1 + v2, where v1 ∈ S and v2 ∈ S⊥.

The desired conclusion follows from

(M − P )v = (M − P )(v1 + v2) = (M − P )v1 = 0.

13 / 162

Regression Analysis


Projection Matrices

Fact 4

Let o1, · · · ,or be an orthonormal basis of C(X), i.e.,

o>i oj =

{0, if i 6= j,

1, if i = j,

and for any v ∈ C(X),v = Ob for some b ∈ Rr, where O = [o1, . . . ,or]. Then,OO> =

∑ri=1 oio

>i is the orthogonal projection matrix onto C(X).

Proof of Fact 4

Since OO> is symmetric and OO>OO> = OO>, where O>O = Ir, ther-dimensional identity matrix, by Fact 2, OO> is the orthogonal projectionmatrix onto C(OO>).

Moreover, for v ∈ C(X), we have

v = Ob = OO>Ob ∈ C(OO>),

for some b ∈ Rr.

In addition, C(OO>) ⊆ C(O) = C(X). The desired conclusion follows.14 / 162

Regression Analysis


Projection Matrices

Remark

One can also prove the result by showing

(i) for v ∈ C(X),OO>v = OO>Ob = Ob = v, and(ii) for w ∈ C⊥(X),OO>w = 0 (the n-dimensional vector of zeros).

兩種證法之差異在於第一種方法是先引用Fact 2得到OO>是C(OO>)的正交投影矩陣，再從C(OO>)的結構猜測它與C(X)相同；而後者則是直接猜測OO>是C(X)的正交投影矩陣。前者證明較曲折但“猜測”成分較少，後者則反之。

15 / 162

Regression Analysis


Projection Matrices

Given a matrix X, how to construct the orthogonal projection matrix for C(X)?

Gram-Schmidt processesLet X = [x1, . . . ,xq] for some q ≥ 1.Define y1 = x1/‖x1‖, where ‖x1‖2 = x>1 x1.

w2 = x2 − (x>2 y1)y1.

y2 = w2/‖w2‖....

ws = xs −∑s−1i=1 (x>s yi)yi.

ys = ws/‖ws‖, 2 ≤ s ≤ q.16 / 162

Regression Analysis


Projection Matrices

If the rank of C(X) is 1 ≤ r ≤ q, then there are r non-zero yi, denoted byys1 , . . . ,ysr , and Y = (ys1 , . . . ,ysr ) is an orthonormal basis of C(X).

Y Y > is the orthogonal projection matrix onto C(X) (by Fact 4).

17 / 162

Regression Analysis


Projection Matrices

Explanation of Rank

Explain “the rank of C(X)”:

Let J be a subset of {1, · · · , q} satisfying

(i) {xi, i ∈ J} is linearly independent, i.e.,∑i∈J

aixi = 0 if and only if ai = 0 for all i ∈ J,

(ii) for any J1 ⊇ J with J1 − J 6= ∅, {xi, i ∈ J1} is not linearly independent.

The ”rank of C(X)” is defined by ](J), the number of the elements in J .

18 / 162

Regression Analysis


Projection Matrices

Moreover, if r(X) = q (i.e. the rank of C(X) is q), the X(X>X)−1X> isthe orthogonal projection matrix of C(X).

Proof

(i) X(X>X)−1X> is symmetric and idempotent.

(ii) C(X(X>X)−1X>)why?= C(X).

If 1 ≤ r(X) < q, then

X(X>X)−X> is the orthogonal projection matrix of C(X),

where A− denotes a generalized inverse (g-inverse) of A which is defined byany matrix G such that AGA = A.

Note that

(X>X)− = (X>X)−1 if r(X) = q,

and

there’re infinitely many (X>X)− if r(X) < q.

But in either case, X(X>X)−X> is unique, according to Fact 3.

19 / 162

Regression Analysis


Projection Matrices

We now go back to regression problems, and summarize the key features ofM0 = n−111>, Mk = X(X>X)−1X>, (I −M0), (I −Mk), and Mk −M0,where

X =

1 x11 · · · x1k...

...1 xn1 · · · xnk

.

(i) M0 is the orthogonal projection matrix onto C(1).

(ii) Mk is the orthogonal projection matrix onto C(X).

(iii) (I −M0) is the orthogonal projection matrix onto C⊥(1).

(iv) (I −Mk) is the orthogonal projection matrix onto C⊥(X).

20 / 162

Regression Analysis


Projection Matrices

(v) Mk −M0 is the orthogonal projection matrix onto C((I −M0)X), where

C((I −M0)X)why?= C

x11 − x1

...xn1 − x1

, . . . ,

x1k − xk...

xnk − xk

,

with xi = n−1∑nj=1 xji.

(vi)

M0Mk = M0 = MkM0,

(I −M0)M0 = 0,

(I −Mk)Mk = 0,

(I −Mk)M0 = 0,

where 0 is the n× n matrix of zeros.

21 / 162

Regression Analysis


Estimation

Estimation

Does β possess any optimal properties?

E(β) = β since

E(β) = E((X>X)−1X>y)

= E{

(X>X)−1X>(Xβ + ε)}

= β + E((X>X)−1X>ε)

= β + (X>X)−1X>E(ε)

= β + (X>X)−1X>0 = β.

Var(β) = (X>X)−1σ2 because

Var(β) = E((β − β)(β − β)>)

= E{

(X>X)−1X>εε>X(X>X)−1}

= (X>X)−1X>E(εε>)X(X>X)−1 = σ2(X>X)−1,

noting that we have used E(εε>) = σ2I.22 / 162

Regression Analysis


Estimation

Gauss-Markov Theorem

For any β = Ay satisfying

β = E(β) = E(Ay) = E(A(Xβ + ε)) = AXβ for “all” β,

we have Var(β) ≤ Var(β) in the sense that Var(β)−Var(β) is non-negative definite(非負定), i.e., for any ‖a‖ = 1,

a>{

Var(β)− Var(β)}a ≥ 0. (∗)

Remark

(i) Ay is called a linear estimator of β.

(ii) β is unbiased (since we assume E(β) = β for all β ).

(iii) This theorem says that β is the best linear unbiased estimator (BLUE) of β.

(iv) (∗) is equivalent to Var(a>β)why?≥ Var(a>β), meaning that the variance of

a>β is always “not” smaller than that of a>β regardless of which directionvector, a, β and β project onto.

23 / 162

Regression Analysis


Estimation

Proof of Gauss-Markov Theorem

Let a ∈ Rk+1 be arbitrarily chosen. Then,

Var(a>β) = E[a>(β − β)]2 (since β is unbiased)

= E(a>(β − β) + a>(β − β))2

≥ Var(a>β) + 2E{a>(β − β)(β − β)>a

}(since β is unbiased)

why?= Var(a>β) + 2a>E

((A− (X>X)−1X>)εε>X(X>X)−1

)a

why?= Var(a>β) + 2σ2a>(A− (X>X)−1X>)X(X>X)−1a

why?= Var(a>β) + 2σ2a>[(X>X)−1a− (X>X)−1a]

= Var(a>β).

24 / 162

Regression Analysis


Estimation

How to estimate σ2?

σ2 =1

n− (k + 1)

n∑i=1

(yi − β0 − β1xi1 − · · · − βkxik)2

=1

n− (k + 1)

n∑i=1

(yi − x>i β)2

=1

n− (k + 1)y>(I −Mk)y.

25 / 162

Regression Analysis


Estimation

Why ”k + 1”? “k + 1” makes σ2 unbiased, namely, E(σ2) = σ2.

To see this, we have

E(σ2) =1

n− (k + 1)E(y>(I −Mk)y)

why?=

1

n− (k + 1)E(ε>(I −Mk)ε)

why?=

σ2

n− (k + 1)tr(I −Mk)

why?= σ2.

where ε = (ε1, . . . , εk)>.Reasons for the second “why”Define µ = E(z) and V = Cov(z) = E[(z − µ)(z − µ)>]. Then

E(z>Az) = µ>Aµ+ tr(AV ).

Since ε>(I −Mk)ε is a scalar,

E(ε>(I −Mk)ε) = E(tr(ε>(I −Mk)ε))

= tr[E{(I −Mk)εε>}] = tr(I −Mk)σ2.

26 / 162

Regression Analysis


Estimation

Some facts about trace operator

1. tr(A) :=∑ni=1Aii, where

A = [Aij ]1≤i,j≤n =

A11 · · · A1n

......

An1 · · · Ann

.

2. tr(AB) = tr(BA) and tr(∑ki=1Ai) =

∑ki=1 tr(Ai).

3. tr(Mk) = tr(X(X>X)−1X>) = tr((X>X)−1X>X) = tr(Ik+1) = k + 1,where Ik+1 is the (k + 1)-dimensional identity matrix.

4. tr(Mk) = tr(∑k+1

i=1 oio>i

)=∑k+1i=1 tr(oio

>i ) =

∑k+1i=1 tr(o>i oi) = k + 1,

where {o1, . . . ,ok+1} is an orthonormal basis for C(X).

5. Similarly, we have tr(I −Mk) = n− k − 1 and tr(I −M0) = n− 1.

27 / 162

Regression Analysis


Multivariate Normal Distributions


Definition

We say z has an r-dimensional multivariate normal distribution with mean

E(z) = µ,

and variance

E((z − µ)(z − µ)>) = Σ > 0 (i.e.,a>Σa > 0 for all a ∈ Rr and ‖a‖ = 1),

denoted by N(µ,Σ), if there exist a k-dimensional standard normal vector

ε = (ε1, . . . , εk)>, k ≥ r (i.e., ε1, . . . , εk are i.i.d. N(0, 1) random variables),

and an r× k nonrandom matrix A of full row rank satisfying AA> = Σ such that

z ∼ Aε+ µ,

where ∼ means both sides of the notation have the same distribution.28 / 162

Regression Analysis



> If ∃ a ∈ Rr such that a>Σa = 0, then E(a>(z − µ))2 = 0 (why?).

This yields P (a>(z − µ) = 0) = 1 because

E(a>(z − µ))2 = 0 implies E(a>(z − µ)) = 0 and Var(a>z) = 0.

Therefore, with probability “1”, one zi is a linear combination of other zj ’s.

29 / 162

Regression Analysis



Why E(X) = 0 implies P (X = 0) = 1 for non-negative X?

Fact

Let X be a non-negative r.v., i.e., P (X ≥ 0) = 1. Then, E(X) = 0 impliesP (X = 0) = 1.

Proof of Fact

Suppose P (X = 0) < 1. Then, P (X > 0) > 0, and hence there exists someδ > 0 such that P (X > 0) > δ (why?).

Since P (X > 0)(why?)

= P (⋃∞n=1{X > n−1}) (why?)

= limn→∞ P (X > n−1), itfollows that

P (X > M−1) > δ/2 for some large integer M. (why?) (∗)

Now, (∗) yields

E(X)(why?)

≥ E(XI{X>M−1})(why?)

≥ M−1P (X > M−1) ≥ δ/(2M) > 0,

which gives a contradiction. Thus, the proof is complete.

30 / 162

Regression Analysis



Remark

1. A =

a>1...a>r

is said to have a full row rank if a1, . . . ,ar are linearly

independent.

2. A is not unique since for any P>P = PP> = Ik, we have

AA> = APP>A> = Σ.

3. If z ∼ N(µ,Σ), then for any B of full row rank, Bz ∼ N(Bµ,BΣB>).

4. If r = 2, then z is said to be bivariate normal.

5. Let z =

(z1z2

)be a two-dimensional random vector and fulfill

z1 ∼ N(0, 1), z2 ∼ N(0, 1), and E(z1z2) = 0.

It is possible that z is not a bivariate normal.

31 / 162

Regression Analysis



Fact 1

If z ∼ N(µ,Σ), then the joint probability density function (pdf) of z, f(z), isgiven by

f(z) = (2π)−r/2(det(Σ))−1/2 exp

{− (z − µ)>Σ−1(z − µ)

2

}.

Proof of Fact 1

By definition,z ∼ Aε+ µ,

where ε ∼ N(0, Ik), k ≥ r, and A is an r × k matrix of full row rank.

Let b1, . . . , bk−r satisfy

b>i bj =

{1, i = j;

0, i 6= j,

and b>i aj = 0 for all 1 ≤ i ≤ k − r, 1 ≤ j ≤ r.

32 / 162

Regression Analysis



Proof of Fact 1 (cont.)

Define

A∗ =

(AB

)≡

Ab>1...

b>k−r

and z∗ =

(zw

)= A∗ε+ µ∗,

where µ∗ = (µ, 0, . . . , 0)>.

Then, the joint pdf of z∗ is given by

f∗(z∗) = (2π)−k/2 exp

{− (z∗ − µ∗)(A∗>)−1(A∗)−1(z∗ − µ∗)

2

} ∣∣∣det(A∗−1

)∣∣∣ .

33 / 162

Regression Analysis




Note that here we have used the following facts:

(i) The joint pdf of ε is

(2π)−k/2 exp

{−ε>ε

2

}=

k∏i=1

(2π)−1/2 exp

(−ε

2i

2

)since εi’s are independent, the joint pdf of (ε1, . . . , εk) is the product of themarginal pdfs.

(ii) Let the joint pdf of v = (v1, . . . , vk)> be denoted by f(v),v ∈ D ⊆ Rk, letg(v) = (g1(v), . . . , gk(v))> be a “smooth” one-to-one transformation of Donto E ⊆ Rk, and let g−1(s) = (g−11 (s), . . . , g−1k (s))>, s ∈ E denote theinverse transformation of g(s), which satisfies g−1(g(v)) = v.

34 / 162

Regression Analysis




Define

J =∂g−1(y)

∂y=

∂g−1

1 (y)∂y1

· · · ∂g−11 (y)∂yk

......

∂g−1k (y)

∂y1· · · ∂g−1

k (y)

∂yk

.

Then, the joint pdf of y = g(v) is given by f(g−1(y))|det(J)|. Now, since

(A∗>)−1(A∗)−1 = (A∗A∗>)−1 =

((AB

)(A> B>

))−1=

((AA>)−1 0

0 Ik−r

)=

(Σ−1 0

0 Ik−r

)and

|det((A∗)−1)| = |det(A∗)|−1 = (det(A∗) det(A∗))−1/2

=(det(A∗) det(A∗>)

)−1/2=(det(A∗A∗>)

)−1/2= (det(Σ))−1/2,

35 / 162

Regression Analysis




we have

f∗(z∗)why?= (2π)−r/2 exp

{− (z − µ)>Σ−1(z − µ)

2

}(det(Σ))−1/2

×(2π)−(k−r)/2 exp{−(w>w)/2

},

and hence

f(z) =

∫ ∞−∞· · ·∫ ∞−∞

f∗(z∗) dw

= (2π)−r/2 exp

{− (z − µ)>Σ−1(z − µ)

2

}(det(Σ))−1/2

×∫ ∞−∞· · ·∫ ∞−∞

(2π)−(k−r)/2 exp{−(w>w)/2

}dw

= (2π)−r/2 exp

{− (z − µ)>Σ−1(z − µ)

2

}(det(Σ))−1/2,

where∫∞−∞ · · ·

∫∞−∞(2π)−(k−r)/2 exp

{−(w>w)/2

}dw = 1. (why?)

36 / 162

Regression Analysis



Fact 2

Assume z ∼ N(µ,Σ) and z =

(z1z2

). Then Cov(z1, z2) = E((z1 − µ1)(z2 −

µ2)>) = 0, where 0 is a zero matrix, if and only if z1 and z2 are independent,where z1 and z2 are r1- and r2-dimensional, respectively.

Proof of Fact 2

⇐) It is easy and hence skipped.⇒) Since Cov(z1, z2) = 0, we have by Fact 1,

f(z) = f(z1, z2)

=

2∏i=1

(2π)−ri/2 exp

{− (zi − µi)>Σ−1ii (zi − µi)

2

}|det(Σii)|−1/2

= f(z1)f(z2),

where (µ>1 ,µ>2 )> = µ and(

Σ11 Σ12

Σ21 Σ22

)= Σ =

(Σ11 00 Σ22

), by hypothesis.

37 / 162

Regression Analysis




Since f(z1) is the joint pdf of z1 and f(z2) is the joint pdf of z2, the above identityimplies z1 and z2 are independent. (why?)

Here, we’ve used if X nd Y are independent iff f(x, y) = fx(x)fy(y).

Fact 3

Let z ∼ N(µ, σ2Ir) and C =

(B1

B2

)q×r

, q ≤ r, have a full row rank. Then B1z

and B2z are independent if B1B>2 = 0.

Proof of Fact 3

Since

Cov(B1z,B2z) = E(B1(z − µ)(z − µ)>B>2 ) = σ2B1B>2 = 0,

by Fact 2, the desired conclusion follows.

38 / 162

Regression Analysis



Definition

Let z be an r-dimensional random vector and let A be an n×n symmetric matrix.Then z>Az is called a quadratic form.

Fact 4

Let E(z) = µ and Var(z) = Σ. Then

E(z>Az) = µ>Aµ+ tr(AΣ).

Proof of Fact 4

For µ = 0, we have

E(z>Az) = E(tr(Azz>)) = tr(AE(zz>)) = tr(AΣ).

For µ 6= 0, we have

tr(AΣ)why?= E((z − µ)>A(z − µ))

why?= E(z>Az)− 2µ>Aµ+ µ>Aµ,

and hence the desired conclusion holds.

39 / 162

Regression Analysis



Fact 5

If z ∼ N(0, Ir) and M is an r × r orthogonal projection matrix, then

z>Mz ∼ χ2(r(M)),

where r(M) denotes the rank of M and χ2(k) denotes the chi-square distributionwith k degrees of freedom.

Proof of Fact 5

Denote r(M) by q. Let {o1, . . . ,oq} be an orthonormal basis for C(M).

We have shown that M = OO> =∑qi=1 oio

>i , where O = [o1, . . . ,oq] and

note that O>O = Iq.

Since O> has a full row rank, O>z ∼ N(0,O>O) = N(0, Iq), yielding thato>i z, i = 1, . . . , q, are i.i.d. N(0, 1) distributed. In addition, we have

z>OO>z =

q∑i=1

(o>i z)2 ∼ χ2(q),

which completes the proof.40 / 162

Regression Analysis



Fact 6

Let z ∼ N(0,Σ). Then z>Σ−1z ∼ χ2(r).

Proof of Fact 6

Since z ∼ N(0,Σ), we have z ∼ Aε in which AA> = Σ and ε ∼ N(0, Ik)for some k ≥ r. Here, A is an r × k matrix of full row rank. This implies

z>Σ−1zd= ε>A>(AA>)−1Aε.

Here,d= means “is equivalent in distribution to”.

Note that A>(AA>)−1A is symmetric and idempotent. Therefore, it is anorthogonal projection matrix with rank r (why?). By Fact 5,

ε>A>(AA>)−1Aε ∼ χ2(r),

and hence gives the desired conclusion.

41 / 162

Regression Analysis


Gaussian Regression

Gaussian Regression

Assume ε in y = Xβ + ε obeys ε ∼ N(0, σ2In).

D1 β = (X>X)−1X>ε+ β ∼ N(β, (X>X)−1σ2).Please convince yourself this result!!

D2

σ2 =1

n− k − 1ε>(I −Mk)ε

=σ2

n− k − 1

ε>(I −Mk)ε

σ2∼ σ2χ

2(n− k − 1)

n− k − 1,

recalling that Mk = X(X>X)−1X> and

X =

1 x11 · · · x1k...

...1 xn1 · · · xnk

.

Here I is In, but I sometimes drop the subscript “n” when no confusion ispossible.

42 / 162

Regression Analysis


Gaussian Regression

Hypothesis testing

(a) F test

Consider the null hypothesis (虛無假設)

H0 : β1 = β2 = · · · = βk = 0. (表示迴歸是不重要的)HA : H0 is wrong. (Alternative hypothesis, 對立假設)

Test statistics:

T1 =

SSReg

kSSRes

n− k − 1

=“迴歸”的“單位”貢獻

“模型殘差”的“單位”貢獻

T1就是這兩類“貢獻”的對比。

T1“大”時，我們傾向“拒絕”H0此一假設，因此時迴歸的貢獻是不可忽視的，但何謂“大”? 這就得依賴T1的分配(distribution)來決定，特別是T1在H0之下的

分配。

43 / 162

Regression Analysis


Gaussian Regression

更進一步地說，在H0成立的情況下，T1應不會太大，如能在H0下得到T1的分配，我們就可知道

PH0(0 ≤ T1 ≤ c) = 95% (此百分比可依各別需求調整)

的“c”是多少。也就是說T1 ∈ (0, c)的機率高達 95%，而當T1 ≥ c時，我們就要高度“懷疑”H0可能是不對的 (因為在H0下不太可能發生的事情發生了)。

故我們可將T1 ≥ c (或T1 < c) 當成一“檢定的規則”，i.e., reject H0 if T1 ≥ cand do not reject H0 if T1 < c。使用這樣的檢定規則犯下型 I錯誤(Type Ierror)的機率是 5%。[ 5%稱為此一檢定方式的“顯著水準”(significance level)，而此一檢定被稱為α-level檢定，α = 5%。]

Truth

Action

H0 HA

Do not reject H0 O.K. Type II errorReject H0 Type I error O.K.

更多關於統計檢定的介紹可參見由黃文璋教授寫的文章“統計顯著性”。

44 / 162

Regression Analysis


Gaussian Regression

How to derive the distribution of T1 under H0?

(i)SSReg

k

under H0=ε>(Mk −M0)ε

k

by Fact 5∼ σ2χ2(k)

k

(ii)SSRes

n− k − 1= σ2 by D2∼ σ2χ

2(n− k − 1)

n− k − 1(iii) SSReg and SSRes are independent. This is because

SSRegunder H0= ε>ORegO

>Regε,

where OReg consists of the orthonormal basis of C((I −M0)X), and

SSRes = ε>OResO>Resε,

where ORes consists of the orthonormal basis of C⊥((I −M0)X). Moreover,since

O>RegORes = 0, (0: zero matrix)

by Fact 3, O>Regε and O>Resε are independent, and hence SSReg and SSRes areindependent (why?).

45 / 162

Regression Analysis


Gaussian Regression

Note.

46 / 162

Regression Analysis


Gaussian Regression

(iv) Combing (i) ∼ (iii), we obtain

T1H0∼ F (k, n− k − 1),

where F (k, n− k − 1) is called the F -distribution with k and n− k − 1degrees of freedoms.[Why? Because T1 (under H0) is a ratio of two ”independent” chi-squaredistributions divided by their corresponding degrees of freedom.]

(v) (α-level) Testing rule: Reject H0 if

T1 ≥ f1−α(k, n− k − 1),

where P (F (k, n− k − 1) > f1−α(k, n− k − 1)) = α.

> f1−α(k, n− k − 1) is called the upper critical value for the F (k, n− k − 1)distribution.

47 / 162

Regression Analysis


Gaussian Regression

48 / 162

Regression Analysis


Gaussian Regression

(b) Wald test

Consider the linear parametric hypothesis:

H0 : Dβ = γ,

HA : H0 is wrong,

where D and γ are known, D is a q× (k+ 1) matrix with 1 ≤ q ≤ k+ 1 and γ isa q × 1 vector.

Example

If β =

β1...β4

,D =

(1 0 −1 00 1 0 −1

), and γ =

(00

), then

H0 =

{β1 = β3

β2 = β4and HA : β1 6= β3 or β2 6= β4.

49 / 162

Regression Analysis


Gaussian Regression

By suitably imposing D and γ, Wald tests are much more flexible than Ftests.

Test statistics:

W1 =(Dβ − γ)>E−1(Dβ − γ)

σ2q,

where E = D(X>X)−1D>.

What is the distribution of W1 under H0?

(i) Dβ − γ H0∼ N(0, σ2E) (Why? Dβ − γ under H0= D(β − β))

(ii)(Dβ − γ)>E−1(Dβ − γ)

σ2∼ χ2(q) (by Fact 6)

(iii) β and σ2 are independent (Why? We’ve argued this previously!!)

(iv)σ2

σ2∼ χ2(n− k − 1)

n− k − 1. (We’ve already shown this!!)

(v) By (i) ∼ (iv), W1H0∼ F (q, n− k − 1).

(vi) Now you can set an α, find the critical value from the F table, andestablish your α-level test!!

50 / 162

Regression Analysis


Gaussian Regression

(c) T-test

Consider the following hypothesis,

H0 : βj = b, where 1 ≤ j ≤ k, b is known

against the alternative,HA : βj 6= b.

We have

(i) β − β ∼ N(0, (X>X)−1σ2) [see D1], and hence

βj − bH0= e>j (β − β) ∼ N(0, e>j (X>X)−1ejσ

2),

where ej = (0, . . . , 0, 1, 0, . . . , 0)>, the jth component is 1, and the othersare zeros.

(ii)σ2

σ2∼ χ2(n− k − 1)

n− k − 1. [see D2]

(iii) σ2 and βj are independent. (why?)

51 / 162

Regression Analysis


Gaussian Regression

(iv) By (i) ∼ (iii),

βj−b√e>j (X>X)−1ejσ2√

σ2

σ2

=βj − b√

e>j (X>X)−1ej σ2≡ T H0∼ t(n− k − 1)

where t(n− k − 1) is the t-distribution with n− k − 1 degrees of freedom.

(v) Testing rule: Reject H0 if |T | > tα/2(n− k − 1).

We have PH0(|T | > tα/2(n− k − 1)) = α and hence this is a level α test.

52 / 162

Regression Analysis


Interval Estimation

Interval Estimation

We first recall some results on point estimation:

(i) E(β) = β and E(σ2) = σ2 (unbiasedness).

(ii) Var(β) = (X>X)−1σ2

(iii) β is BLUE!!

(iv) Var(σ2) =2σ4

n− k − 1→ 0, as n→∞ (under the normal assumption)

[which is desired result because it shows the estimation quality is gettingbetter and better when sample size is getting larger and larger!!]

To see this, note first that

(a)σ2

σ2∼ χ2(n− k − 1)

n− k − 1

(b) E(χ2(n− k − 1)) = n− k − 1

(c) Var(χ2(n− k − 1)) = 2(n− k − 1)

By (a)∼(c), Var(σ2) =2σ4

n− k − 1follows.

53 / 162

Regression Analysis


Interval Estimation

However, if the normal assumption fails to hold, how should us calculate Var(σ2)?

Some ideas:

σ2 =1

n− k − 1y>(I −Mk)y =

1

n− k − 1ε>(I −Mk)ε

=1

n− k − 1

n∑i=1

n∑j=1

Aijεiεj ,

where [Aij ]1≤i,j≤n ≡ A = I −Mk. It is clear that

E(σ2) =1

n− k − 1

n∑i=1

n∑j=1

AijE(εiεj)

why?=

1

n− k − 1

n∑i=1

Aiiσ2 =

σ2

n− k − 1tr((I −Mk)) = σ2.

54 / 162

Regression Analysis


Interval Estimation

Moreover, we have

E(σ4) =

(1

n− k − 1

)2 n∑i=1

n∑j=1

n∑k=1

n∑l=1

AijAklE(εiεjεkεl)

=

(1

n− k − 1

)2 n∑i=1

A2iiE(ε4i ) (i = j = k = l)

+

(1

n− k − 1

)2 ∑1≤i,k≤ni 6=k

AiiAkkE(ε2i )E(ε2k) (i = j 6= k = l)

+

(1

n− k − 1

)2 ∑1≤i,j≤ni 6=j

A2ijE(ε2i )E(ε2j ) (i = k 6= j = l)

+

(1

n− k − 1

)2 ∑1≤i,j≤ni 6=j

AijAjiE(ε2i )E(ε2j ) (i = l 6= j = k),

where∑

1≤i,j≤ni 6=j

AijAjiE(ε2i )E(ε2j ) =∑

1≤i,j≤ni 6=j

A2ijE(ε2i )E(ε2j ) (since A is symmetric).

55 / 162

Regression Analysis


Interval Estimation

Simple algebra shows that

E(σ4) =

(1

n− k − 1

)2

(E(ε41)− 3σ4)

n∑i=1

A2ii

+

(1

n− k − 1

)2

σ4

n∑i=1

n∑k=1

AiiAkk + 2

n∑i=1

n∑j=1

A2ij

=

1

(n− k − 1)2(E(ε41)− 3σ4)

n∑i=1

A2ii + σ4 +

2σ4

n− k − 1.

Note

(i) E(ε41)− 3σ4 = 0 if ε is normal.

(ii)∑ni=1

∑nk=1AiiAkk = (tr(A))2 = (tr(I −Mk))2 = (n− k − 1)2.

(iii)∑ni=1

∑nj=1A

2ij = tr(A2) = tr((I −Mk)2) = tr(I −Mk) = n− k − 1.

Hence

Var(σ2) =1

(n− k − 1)2(E(ε41)− 3σ4)

n∑i=1

A2ii +

2σ4

n− k − 1.

56 / 162

Regression Analysis


Interval Estimation

Will1

(n− k − 1)2(E(ε41)− 3σ4)

n∑i=1

A2ii converge to zero as n→∞?

Yes, because

n∑i=1

A2ii ≤

n∑i=1

Aii = tr(A) = tr(I −Mk) = n− k − 1.

To see this, we note that the idempotent property of A yields

Aii =

n∑j=1

A2ij ≥ A2

ii (which also yields 0 ≤ Aii ≤ 1).

57 / 162

Regression Analysis


Interval Estimation

We now get back to interval estimation.

(i) The first goal is to find an interval such that

P (βi ∈ Iα) = 1− α,

where α is small and is decided by the users, 1− α is called a “confidencelevel”.

How to construct Iα?

(a)βi − βi√

e>i (X>X)−1eiσ2∼ t(n− k − 1)

(b) P (βi ∈ (Li, Ri)) = 1− αLi = βi − t1−α/2(n− k − 1)

√e>i (X>X)−1eiσ2

Ri = βi + t1−α/2(n− k − 1)√e>i (X>X)−1eiσ2

58 / 162

Regression Analysis


Interval Estimation

Does the interval described in (b) have the shortest length?

To answer this question, we need to solve the following problem:

minimizing b− a subject to F (b)− F (a) = 1− α,

where F (·) denotes the distribution function of t(n− k − 1) distribution, and

P

(a <

βi − βi√e>i (X>X)−1eiσ2

≤ b

)= F (b)− F (a) = 1− α.

By the Lagrange method, define

g(a, b, λ) = b− a− λ(F (b)− F (a)− (1− α))

and let ∇g(a, b, λ) = 0, where ∇g = ( ∂g∂a ,∂g∂b ,

∂g∂λ )>. The last identity yield{

f(b) = f(a) = 1λ ,

F (b)− F (a) = 1− α, (∗)

where f(·) is the pdf of t(n− k − 1) distribution.

59 / 162

Regression Analysis


Interval Estimation

Since the pdf of t(n− k− 1) is symmetric and strictly decreasing (increasing)when x ≥ 0 (when x ≤ 0), (∗) implies b = −a and b > 0.

As a result, the unique solution to (∗) is (−b, b) with 2F (b) = 2− α, i.e.,

(−t1−α/2(n− k − 1), t1−α/2(n− k − 1)).

To check whether 2t1−α/2(n− k − 1) minimizes b− a, we still need toconsider the so-called ”bordered” Hessian matrix evaluated at

s∗ =

a∗b∗λ∗

=

−t1−α/2(n− k − 1)t1−α/2(n− k − 1)

1f(t1−α/2(n−k−1))

.

60 / 162

Regression Analysis


Interval Estimation

Note that the bordered Hessian matrix is defined by

∇2g =

∂2g∂a∂a

∂2g∂a∂b

∂2g∂a∂λ

· ∂2g∂b∂b

∂2g∂b∂λ

· · ∂2g∂λ∂λ

,

where ∂2g∂λ∂λ = 0, and it is straightforward to show that

∇2g(s∗) =

f ′(−t1−α/2(n−k−1))

f(t1−α/2(n−k−1))0 f(−t1−α/2(n− k − 1))

0−f ′(t1−α/2(n−k−1))

f(t1−α/2(n−k−1))−f(t1−α/2(n− k − 1))

f(−t1−α/2(n− k − 1)) −f(t1−α/2(n− k − 1)) 0

.

Since the principal submatrix f ′(−t1−α/2(n−k−1))f(t1−α/2(n−k−1))

0

0−f ′(t1−α/2(n−k−1))f(t1−α/2(n−k−1))

is positive definite, it follows that 2t1−α/2(n− k − 1) minimizes b− a subjectto F (b)− F (a) = 1− α.

61 / 162

Regression Analysis


Interval Estimation

(ii) The second goal is to find a (k + 1)-dimensional set Vα such that

P (β ∈ Vα) = 1− α.

How to construct Vα?

(a)(β − β)>X>X(β − β)

σ2∼ χ2(k + 1). (by Fact 6)

(b)(β − β)>X>X(β − β)

(k + 1)σ2∼ F (k + 1, n− k − 1).

62 / 162

Regression Analysis


Interval Estimation

(c) Vα:

a ≤ (β − β)>X>X(β − β)

(k + 1)σ2≤ b,

where F ∗(b)− F ∗(a) = 1− α and F ∗(·) is the distribution function ofF (k + 1, n− k − 1).

63 / 162

Regression Analysis


Interval Estimation

(d) It can be shown that the volume of the larger ellipsoid is

πk+12

Γ(k+12 + 1

) ((k + 1)σ2b) k+1

2 (det(X>X))−1/2,

and that of the smaller one is

πk+12

Γ(k+12 + 1

) ((k + 1)σ2a) k+1

2 (det(X>X))−1/2.

64 / 162

Regression Analysis


Interval Estimation

Hence the volume of Vα is minimized by

minimizing bk+12 − a

k+12 subject to F ∗(b)− F ∗(a) = 1− α.

However, in general, this minimization problem does not have a closed formsolution, but it can be shown that when k = 1,

a∗ = 0, and b∗ = F1−α(k + 1, n− k − 1),

and when both n and k are large and n� k,

a∗ ∼ 0 and b∗ ∼ F1−α(k + 1, n− k − 1).

Note also that unlike the t-distributions, when d1 > 1, the pdfs of Fdistributions have very small values near the origin.

65 / 162

Regression Analysis


Interval Estimation

66 / 162

Regression Analysis


Another look at β

Another look at β

Let Xk = (Xk−1,xk). Because C(Xk) and C(Xk−1, (I −Mk−1)xk) are thesame, we have

Mky = (Xk−1,xk)

(βk−1βk

)why?= (Xk−1, (I −Mk−1)xk)

((X>k−1Xk−1)−1 0

0> 1x>k (I−Mk−1)xk

)

×(

X>k−1yx>k (I −Mk−1)y

),

yielding

Xk−1βk−1 + xkβk = Xk−1[(X>k−1Xk−1)−1X>k−1y

−(X>k−1Xk−1)−1X>k−1xkβ∗k ] + xkβ

∗k ,

where β∗k =x>k (I −Mk−1)y

x>k (I −Mk−1)xk.

67 / 162

Regression Analysis


Another look at β

In addition, since Xk is of full rank , we obtain

βk = β∗k =x>k (I −Mk−1)y

x>k (I −Mk−1)xk.

This shows that βk is equivalent to the LSE of the simple regression of(I −Mk−1)y on (I −Mk−1)xk.

As a result, βk can only be viewed as the marginal contribution of xk to ywhen the effects of the other variables are removed in advance.

68 / 162

Regression Analysis


Model Selection

Model Selection

Mallows’ Cp:

Let Xp be a submodel of Xk.

69 / 162

Regression Analysis


Model Selection

Can we construct a measure to describe its prediction performance?

Let Mp be the orthogonal projection matrix of Xp. Then, Mpy can be usedto predict new observations

yNew = Xkβ + εNew,

where εNew and ε are independent, but have the same distribution.

The performance of Mpy can be measured by

E∥∥yNew −Mpy

∥∥2 = E∥∥Xkβ + εNew −Mpy

∥∥2(∗)= nσ2 + E ‖Xkβ −Mpy‖2 .

(∗): since εNew and y are independent.

70 / 162

Regression Analysis


Model Selection

Let Xk = (Xp,X−p) and β =

(βpβ−p

).

Moreover, we have

E ‖Xkβ −Mpy‖2 = E ‖Xpβp +X−pβ−p −Mp(X−pβ−p +Xpβp + ε)‖2

= E ‖(I −Mp)X−pβ−p −Mpε‖2

why?= pσ2 + β>−pX

>−p(I −Mp)X−pβ−p.

Hence,

E∥∥yNew −Mpy

∥∥2 = (n+ p)σ2 + β>−pX>−p(I −Mp)X−pβ−p.

71 / 162

Regression Analysis


Model Selection

To estimate this expectation, we start by considering

SSRes(p) = y>(I −Mp)y.

Note first that

E(SSRes(p)) = E(X−pβ−p + ε)>(I −Mp)(X−pβ−p + ε)

= β>−pX>−p(I −Mp)X−pβ−p + E(ε>(I −Mp)ε)

= β>−pX>−p(I −Mp)X−pβ−p + (n− p)σ2.

Therefore,

E(SSRes(p) + 2pσ2) = β>−pX>−p(I −Mp)X−pβ−p + (n+ p)σ2

= E∥∥yNew −Mpy

∥∥2 .Now, Mallows’ Cp is defined by

SSRes(p) + 2pσ2,

which is an unbiased estimate of E∥∥yNew −Mpy

∥∥2 .72 / 162

Regression Analysis


Prediction

Prediction

(a) How to predict E(yn+1) = x>n+1β when xn+1 = (1, xn+1,1, . . . , xn+1,k)> isavailable?

Point prediction: x>n+1β

Prediction interval (under normality):

(i) x>n+1(β − β) ∼ N(0,x>n+1(X>X)−1xn+1σ2)

Sometimes I use Xk to replace X, in particular, when model selection issue istaken into account.

(ii)x>n+1(β − β)√

x>n+1(X>X)−1xn+1σ2∼ T (n− k − 1)

(iii) Please construct a (1− α) level prediction interval by yourself.

(iv) What if the normal assumption is violated?

73 / 162

Regression Analysis


Prediction

(b) How to predict yn+1?

point prediction: x>n+1β (Still, we have this guy.)

prediction interval (under normality):

(i) yn+1 − x>n+1β = εn+1−x>n+1(β−β) ∼ N(0, (1 +x>n+1(X>X)−1xn+1)σ2).

(yn+1 − x>n+1β is called prediction error.)

(ii)yn+1 − x>n+1β√

(1 + x>n+1(X>X)−1xn+1)σ2∼ T (n− k − 1).

(iii) Please construct your own (1− α) level prediction interval for yn+1.

74 / 162

Regression Analysis

Large Sample Theory

Motivation

Large Sample Theory

Motivation

Consider againy = Xβ + ε.

If εt’s are not normally distributed, how do we make inference for β and σ2? Howdo we perform prediction?

Q1: Is β = (X>X)−1X>y → β in probability?

Q2: Is σ2 =1

n− (k + 1)y>(I −Mk)y → σ2 in probability?

Q3: If the answer to Q1 is “yes”, what is the limiting distribution of β?

Q4: How do we construct confidence intervals for β based on the answer of Q3?

Q5: How do we test linear or even nonlinear hypotheses without normality?

Q6: How do we do prediction without normality?

75 / 162

Regression Analysis

Large Sample Theory

Toward Large Sample Theory I

Question 1

We first answer Q1 in the special case where X =

1 x1...

...1 xn

.

Definition

A sequence of r.v.s {Zn} is said to converge in probability to a r.v. Z (which canbe a non-random constant) if for any ε > 0,

limn→∞

P (|Zn − Z| > ε) = 0,

which is denoted by Znpr.→ Z.

Remark

A sequence of random vectors {Zn = (Z1n, . . . , Zkn)>} is said to be convergent

in probability to a random vector Z = (Z1, . . . , Zk)> if Zinpr.→ Zi, i = 1, . . . , k,

which is denoted by Znpr.→ Z.

76 / 162

Regression Analysis

Large Sample Theory


An answer to Q1:

Since

Var(β) = (X>X)−1σ2 = σ2

Sxx + nx2

nSxx

−nxnSxx−nx

nSxx

1

Sxx

,

we have

P (|β0 − β0| > ε)(∗)≤ σ2

ε2Sxx + nx2

nSxx→ 0 if

x2

Sxx→ 0,

((∗): Chebychev’s inequality, which says if E(X) = µ and Var(X) = σ2, then

P (|X − µ| > ε) ≤ σ2

ε2 ) and

P (|β1 − β1| > ε) ≤ σ2

ε21

Sxx→ 0 if

1

Sxx→ 0,

noting that Sxx =

n∑i=1

(xi − x)2 and x =1

n

n∑i=1

xi.

77 / 162

Regression Analysis

Large Sample Theory


As a result, to ensure βpr.→ β, we need

x2

Sxx→ 0 and

1

Sxx→ 0 as n→∞.

Remark

(i) Please give a heuristic explanation of whyx2

Sxx→ 0 is needed for β0 to

converge to β0 in probability.

(ii) Please explain why Cov(β0, β1) is positive (negative) correlated whenx < 0 (x > 0).

(iii) What are the sufficient conditions for βpr.→ β in general cases?

78 / 162

Regression Analysis

Large Sample Theory


Question 2

An answer to Q2:

We have shown previously that the variance of σ2 converges to 0 as n→∞.Therefore, by Chebyshev’s inequality,

σ2 pr.→ σ2.

79 / 162

Regression Analysis

Large Sample Theory


Before answering Q3, let us consider the so-called spectral decomposition forsymmetric matrices.

Let A be a k × k symmetric matrix. Then there exist real numbersλ1, . . . , λk and a k-dimensional orthogonal matrix P = (p1, . . . ,pk)satisfying P>P = PP> = I and Api = λipi such that

A = PDP>,

where D =

λ1 0 · · · 0

0. . .

......

. . . 00 · · · 0 λk

.

80 / 162

Regression Analysis

Large Sample Theory


Remark

(1) λi is called an eigenvalue of A and pi is the eigenvector corresponding to λi.

(2) Let A be positive definite. Then λi > 0 for i = 1, . . . , k.

Proof. 0p.d.< p>i Api = p>i PDP

>piwhy?= λi.↑

by the spectral decomposition

(3) Let A be positive definite. Define

A1/2 = PD1/2P>,

where

D1/2 =

λ1/21 0 · · · 0

0. . .

......

. . . 0

0 · · · 0 λ1/2k

.

Then, we have (A1/2)2 = A.

81 / 162

Regression Analysis

Large Sample Theory


Remark (cont.)

(4) Define λmax(A) = max{λ1, . . . , λk} and λmin(A) = min{λ1, . . . , λk}. Then,

λmax(A) = sup‖a‖=1

a>Aa and λmin(A) = inf‖a‖=1

a>Aa.

Proof. As shown before,

λi = p>i Api ≤ sup‖a‖=1

a>Aa.

Moreover, for any a ∈ Rk with ‖a‖ = 1, we have a = Pb, where ‖b‖ = 1.Thus,

a>Aa = b>P>PDP>Pb = b>Db =

k∑i=1

λib2i ≤ λmax(A),

where b = (b1, . . . , bk)>. This yields λmax(A) = sup‖a‖=1

a>Aa. The second

statement can be proven similarly.82 / 162

Regression Analysis

Large Sample Theory


Remark (cont.)

(5) Let A be positive definite. Then λmax(A−1) = 1/λmin(A).

(6) Let B be any real matrix. Define the “spectral norm” of B as follows:

‖B‖ =

(sup‖a‖=1

a>B>Ba

)1/2

=(λmax(B>B)

)1/2.

We have

(i) If B is symmetric with eigenvalues λ1, . . . , λk, then‖B‖ = max{|λ1|, . . . , |λk|}.

(ii) ‖AB‖ ≤ ‖A‖‖B‖, where A is another real matrix whose number of columnsis the same as the number of the rows of B.

(iii) ‖A+B‖ ≤ ‖A‖+ ‖B‖, where A and B have the same numbers of rows andcolumns.

(iv) If B is positive definite, ‖B‖ ≤ tr(B) =∑ki=1 λi, where λi, i = 1, . . . , k, are

the eigenvalues of B.

83 / 162

Regression Analysis

Large Sample Theory


Remark (cont.)

(7) Let X be the design matrix of a regression model, i.e.,

X =

1 x11 · · · x1k...

......

1 xn1 · · · xnk

.

Then,

λmax(X>X) = sup‖a‖=1

n∑i=1

(a>xi)2,

and

λmin(X>X) = inf‖a‖=1

n∑i=1

(a>xi)2.

(8) Let x ∼ N(0,Σ) be a p-dimensional multivariate normal vector. Then,Σ−1/2x ∼ N(0, I), and hence x>Σ−1x ∼ χ2(p), which has been shownpreviously in a different way.

84 / 162

Regression Analysis

Large Sample Theory


We now revisit the question of what makes βpr.→ β in general cases.

The answer to this question is simple. Since

Var(β) = (X>X)−1σ2,

we only needs to show that“each diagonal elements of (X>X)−1 converges to 0”. (∗)

To show (∗), note first that

X>X = T>(T>)−1X>XT−1T ,

where

T =

1 x1 · · · xk0 1 0 · · · 0...

. . .. . .

. . ....

.... . .

. . . 00 · · · 0 1

and T−1 =

1 −x1 · · · −xk0 1 0 · · · 0...

. . .. . .

. . ....

.... . .

. . . 00 · · · 0 1

.

85 / 162

Regression Analysis

Large Sample Theory


Moreover, we have

(T>)−1X>XT−1 =

(n 0>

0o

X>

(I − En

)o

X

),

where

o

X=

x11 · · · x1k...

...xn1 · · · xnk

and E =

1 · · · 1...

...1 · · · 1

,

noting that

(I − En

)o

X=

x11 − x1 · · · x1k − xk...

...xn1 − x1 · · · xnk − xk

.

where (x11 − x1, . . . , x1k − xk) and (xn1 − x1, . . . , xnk − xk) are the centereddata vectors.

86 / 162

Regression Analysis

Large Sample Theory


Hence

(X>X)−1 = T−1

(n−1 0>

0 (o

X>

(I − En

)o

X)−1

)(T−1)>,

yielding

(X>X)−1 =

(1

n+ x>D−1x −x>D−1

−D−1x D−1

),

where (T−1)> = (T>)−1, x = (x1, . . . , xk)>, and D =o

X>

(I − En

)o

X.

87 / 162

Regression Analysis

Large Sample Theory


This implies for each 1 ≤ i ≤ k + 1,

(X>X)−1ii ≤ max

{1

n+ x>D−1x, λmax(D−1)

}≤ max

{1

n+‖x‖2

λmin(D),

1

λmin(D)

},

where λmax(D−1) =1

λmin(D), which converges to 0 if

(i) λmin(o

X>

(I − En

)o

X)n→∞→ ∞,

(ii)

k∑i=1

x2i

λmin(o

X>

(I − En

)o

X)

n→∞→ 0.

請將此兩條件與 Q1 中的答案比較。

88 / 162

Regression Analysis

Large Sample Theory


圖示如下：

上述條件要求：(i) 資料在散佈最窄的方向(從(x1, . . . , xk)的位置看)，也要有足夠大的sum of

squares (information)

(ii) 資料的中心點距原點的距離平方比起λmin(o

X>

(I − En

)o

X)是微不足道的。

89 / 162

Regression Analysis

Large Sample Theory

Toward Large Sample Theory II

Question 3

Note first that

β − β = (X>X)−1X>(y −Xβ)

= (X>X)−1X>ε

=

(n∑i=1

xix>i

)−1n∑i=1

εi

n∑i=1

xiεi

,

noting that we first consider X =

1 x1...

...1 xn

.

90 / 162

Regression Analysis

Large Sample Theory


Since for ε ∼ N(0, σ2I), we have

β − β ∼ N(0, σ2(X>X)−1),

it is natural to conjecture that when ε is not normally distributed,

(X>X)1/2

σ(β − β)

d→ N(0, I). (∗)

Definition

A sequence of random vectors, {xn}, is said to converge to a random vector, x, indistribution if

P (xn ≤ c)→ P (x ≤ c) ≡ F (c) as n→∞,

for all continuous points of F (·), the distribution function of x, which is denoted

by xnd→ x.

91 / 162

Regression Analysis

Large Sample Theory


Remark

Cramer-Wold Device:

xnd→ x ⇔ a>xn

d→ a>x for any ‖a‖ = 1.

Therefore, (∗) holds iff

a>(X>X)1/2

σ(β − β) = a>

(n∑i=1

xix>i

)−1/2n∑i=1

εiσ

n∑i=1

xiεiσ

=

n∑i=1

(w1n + w2nxi

σ

)εi

d−→ N(0, 1),

where (w1n, w2n) = a>

(n∑i=1

xix>i

)−1/2.

92 / 162

Regression Analysis

Large Sample Theory


Lindeberg’s Central Limit Theorem (for the sum of independent r.v.s)

Let Z1n, . . . , Znn be a sequence of independent r.v.s with E(Zin) = 0 andn∑i=1

E(Z2in) =

n∑i=1

σ2in = 1 for all n. If for any δ > 0,

n∑i=1

E(Z2inI|Zin|>δ

)−→ 0, as n→∞, (Lindeberg’s Condition)

無獨尊者，故在”均勻”混合後，原來分配之特性消失，成為常態分配

thenn∑i=1

Zind−→ N(0, 1).

93 / 162

Regression Analysis

Large Sample Theory


Remark

(1) Lindegerg’s condition implies

max1≤i≤n

σ2in −→ 0 as n→∞.

To see this, we note that for “any” δ > 0

max1≤i≤n

σ2in = max

1≤i≤nE(Z2

in) ≤ max1≤i≤n

E(Z2inI|Zin|>δ

)+ δ2.

Since the first term converges to 0 by Lindeberg’s condition and since δ canbe arbitrarily small, the desired conclusion follows.

(2) Lindeberg’s condition ⇔ CLT + max1≤i≤n

σ2in −→ 0 as n→∞.

94 / 162

Regression Analysis

Large Sample Theory


Now, we are in the position to check Lindeberg’s condition for

Zin =

(w1n + w2nxi

σ

)εi

denoted by≡ vinεi.

(i) E(vinεi) = 0. (easy)

(ii)n∑i=1

E(v2inε2i ) = 1. (easy but why?)

(iii) Assume E1/2(ε41) < C1 <∞, for some constants C1, C2,n∑i=1

E[v2inε

2i I{v2inε2i>δ2}

]=

n∑i=1

v2inE[ε2i I{v2inε2i>δ2}

]why?≤

n∑i=1

v2inE1/2(ε4i )P

1/2(v2inε2i > δ2)

≤ C1

n∑i=1

v2inE1/2(v2inε

2i )

δ

≤ C2

(n∑i=1

v2in

)max1≤i≤n

|vin| ≤ C3 max1≤i≤n

|vin|.95 / 162

Regression Analysis

Large Sample Theory


Therefore, Lindeberg’s condition holds for vinεi if

max1≤i≤n

(v2in) = σ−2 max1≤i≤n

a>( n∑i=1

xix>i

)−1/2(1xi

)2

≤ σ−2a>

(n∑i=1

xix>i

)−1a(1 + max

1≤i≤nx2i )

(∗)≤ σ−2λmax

( n∑i=1

xix>i

)−1 (1 + max1≤i≤n

x2i )

= σ−21 + max

1≤i≤nx2i

λmin

(n∑i=1

xix>i

) −→ 0, as n→∞.

96 / 162

Regression Analysis

Large Sample Theory


(∗) To see this, we have by spectral decomposition for A,A = PDP>, where

D =

λ1 . . .

λk

with 0 < λ1 ≤ λ2 ≤ · · · ≤ λk, and P = (p1, . . . ,pk) satisfies

Api = λipi, p>i pi = 1, and p>i pj = 0 for i 6= j.

Hence,

p>kApk = p>k PDP>pk = (0, . . . , 0, 1)

λ1 . . .

λk

0...01

= λk ≤ sup

‖a‖=1

a>Aa.

97 / 162

Regression Analysis

Large Sample Theory


On the other hand, for any a ∈ Rk with ‖a‖ = 1, we can express it asa = Pb with ‖b‖ = 1. Thus,

a>Aa = b>P>PD>P>Pb = b>Db =

k∑i=1

λib2i ≤ λk,

where b = (b1, . . . , bk)>. As a result,

λk = sup‖a‖=1

a>Aa.

Similarly, it can be shown that

λ1 = inf‖a‖=1

a>Aa.

98 / 162

Regression Analysis

Large Sample Theory


To give a more comprehensive sufficient condition, we note that

λmin

(n∑i=1

xix>i

)= λmin

((n

∑xi∑

xi∑x2i

))= λmin

((1 0x 1

)(1 0−x 1

)(n

∑xi∑

xi∑x2i

)×(

1 −x0 1

)(1 x0 1

))= λmin

((1 0x 1

)(n 00 Sxx

)(1 x0 1

))why?

≥ min{n, Sxx}λmin

((1 0x 1

)(1 x0 1

))(∗)≥ C min{n, Sxx},

provided x <∞, where Sxx =∑

(xi − x)2.

Explain

“why?” : λmin(B>AB) ≥ λmin(B>B)λmin(A)

(∗) : if x <∞, λmin

((1 0x 1

)(1 x0 1

))is “bounded away” from 0. (We will

show this later.)99 / 162

Regression Analysis

Large Sample Theory


In view of this, a set of more transparent sufficient conditions for the Lindeberg’scondition is

(i)

max1≤i≤n

x2i

n−→ 0,

(ii) Sxx −→∞, [this one is also needed for βpr.−→ β.]

(iii)

max1≤i≤n

x2i

Sxx−→ 0.

Can you answer Q3 under general multiple regression models?

100 / 162

Regression Analysis

Large Sample Theory


事實上，對一般的多元迴歸(k ≥ 1), it is not difficult to show that Lindeberg’scondition holds when

1 + max1≤i≤n

k∑j=1

x2ij

λmin(X>X)−→ 0 as n→∞. (>)

(請對照 k = 1的case)

101 / 162

Regression Analysis

Large Sample Theory


進一步的問題是，我們能不能得到類似 k = 1 case中(i), (ii), (iii)條件，使得(>)成立。為了回答此一問題，我們需要一點linear algebra。

(1) Let

T =

1 c1 · · · ck0 1 0 · · · 0... 0 1

. . ....

. . .. . . 0

0 0 · · · 0 1

=

(1 c>

0 Ik

),

where c = (c1, . . . , ck)>, and Ik is the k-dimensional identity matrix.Then, we have

λmin(T>T ) ≥ 1

2 + c>c. (∗)

102 / 162

Regression Analysis

Large Sample Theory


Proof of (∗)Since (∗) holds trivially when c = 0, we only consider the case c 6= 0.

Note first that

E∗ = T>T =

(1 c>

c cc> + Ik

),

and the eigenvalues of E∗ are those λ’s satisfying

det(E∗ − λIk+1) = 0, (∗∗)

where Ik+1 is the (k + 1)-dimensional identity matrix.

In addition,

det(E∗ − λIk+1) = det

(1− λ c>

c cc> + (1− λ)Ik

)

=

det

(0 c>

c cc>

), ifλ = 1;

det

(1− λ 0>

c (1− 11−λ )cc> + (1− λ)Ik

), ifλ 6= 1.

103 / 162

Regression Analysis

Large Sample Theory


Proof of (∗) (cont.)

For λ = 1,

det

(0 c>

c cc>

)=

{−c21 6= 0, if k = 1;0, if k > 1.

For λ 6= 1,

det

(1− λ 0>

c (1− 11−λ )cc> + (1− λ)Ik

)= (1− λ) det

((1− 1

1− λ

)cc> + (1− λ)Ik

)because this is a triangular matrix

= (1− λ)k+1 det

(Ik +

(1

1− λ −1

(1− λ)2

)cc>

)det(aAk) = ak det(Ak)

= (1− λ)k+1 det(Ik)

(1 +

(1

1− λ −1

(1− λ)2

)c>c

)Please try to prove det(A+ abb>) = det(A)(1 + ab>A−1b)

= (1− λ)k−1(λ2 − (2 + c>c)λ+ 1).104 / 162

Regression Analysis

Large Sample Theory


Proof of (∗) (cont.)

Therefore, the roots for (∗∗) are

λ = 1 or λ =

(2 + c>c)

(1±

√1− 4

(2 + c>c)2

)2

,

yielding

λmin(T>T ) ≥ min

1,

(2 + c>c)

(1−

√1− 4

(2 + c>c)2

)2

≥ min

{1,

1

2 + c>c

} (since

√1− x ≤ 1− x

2

)=

1

2 + c>c.

Thus the proof of (∗) is complete.105 / 162

Regression Analysis

Large Sample Theory


(2) We have shown previously that

X>X = T>(n 0>

0 D

)T ,

where

T =

1 x1 · · · xk0 1 0 · · · 0...

. . .. . .

. . ....

.... . .

. . . 00 · · · 0 1

and D =o

X>

(I − En

)o

X .

106 / 162

Regression Analysis

Large Sample Theory


By λmin(B>AB) ≥ λmin(B>B)λmin(A) and (∗), we obtain

λmin(X>X) ≥ λmin(T>T )λmin

(n 0>

0 D

)≥ 1

2 +

k∑i=1

x2i

min{n, λmin(D)}

λmin

(n 0>

0 D

)= min{n, λmin(D)}

≥ 1

2 + Vmin{n, λmin(D)}.

在此我們假設

k∑i=1

x2i < V <∞(讓討論更聚焦)

107 / 162

Regression Analysis

Large Sample Theory


(3) 最後為了讓(>)成立，我們給出以下充分條件：

(i′)

max1≤i≤n

k∑j=1

x2ij

n−→ 0,

(ii′) λmin(D) −→∞, (我已解釋過”它”的意義)

(iii′)

max1≤i≤n

k∑j=1

x2ij

λmin(D)−→ 0.

明顯看出 (i), (ii), (iii) 與 (i′), (ii′), (iii′) 是對應的。

108 / 162

Regression Analysis

Large Sample Theory

Toward Large Sample Theory III

Questions 4 and 5

Q4 and Q5: How does one construct confidence intervals (CIs) and testing ruleswhen ε is not normal?

Some basic probabilistic tools

(A) Slutsky’s Theorem.

If Xnd−→X, Yn

pr.−→ a and Znpr.−→ b, where a is a vector of real numbers

and b is a real number, then

Y >n Xn + Znd−→ a>X + b.

Corollary. If Xnd−→X and Yn −Xn

pr.−→ 0, then Ynd−→X.

Proof. Since Yn = Xn − (Xn − Yn), the conclusion follows immediately fromSlutsky’s Theorem.

109 / 162

Regression Analysis

Large Sample Theory


(B) Big O and Small O notation for a sequence of random vectors.Let an be a sequence of positive numbers. We say

Xn = Op(an),

where Xn is a sequence of random vectors, if for any ε > 0, there exist0 < Mε <∞ and a positive integer N such that for all n ≥ N ,

P

(∥∥∥∥Xn

an

∥∥∥∥ > Mε

)< ε,

andXn = op(an),

ifXn

an

pr.−→ 0.

110 / 162

Regression Analysis

Large Sample Theory


(C) Big O and Small O notation for a sequence of vectors of real numbers.Let {wn} be a sequence of vectors of real numbers and {an} be a sequenceof positive number. We say wn = O(an) if there exist 0 < M <∞ and apositive integer N such that for all n ≥ N ,∥∥∥∥wnan

∥∥∥∥ < M,

and wn = o(an) if wn/an → 0.

(D) Some rules.Let Xn = op(1), Op(1), o(1) or O(1), and Yn = op(1), Op(1), o(1) or O(1).

For ” + ”: For ”× ” (product):

op Op o Oop op Op op OpOp − Op Op Opo − − o OO − − − O

op Op o Oop op op op opOp − Op op Opo − − o oO − − − O

111 / 162

Regression Analysis

Large Sample Theory


(E) If Xn = Op(an), then Xn/an = Op(1),

If Xn = op(an), then Xn/an = op(1).

(F) If Xnd−→X, then Xn = Op(1), and if E‖Xn‖q < K <∞ for some q > 0

and for all n, then Xn = Op(1).

(G) If Xnpr.−→X and Yn

pr.−→ Y , then

(Xn

Yn

)pr.−→(XY

).

If Xnd−→X and Yn

d−→ Y , then

(Xn

Yn

)d−→(XY

), provided {Xn} and

{Yn} are independent.

(H) Continuous mapping theorem.

If Xnpr. or d−→ X and f(·) is a continuous function, then f(Xn)

pr. or d−→ f(X).

112 / 162

Regression Analysis

Large Sample Theory


(I) Delta method.

If√n(Zn − u)

d−→ N(0k×1,Vk×k) andf(·) = (f1(·), . . . , fm(·))> : Rk → Rm is a “sufficiently smooth” function,then

√n(f(Zn)− f(u))

d−→ N(0m×1, (∇f(u))>V (∇f(u))), (∗)

where

∇f(·) =

∂f1(·)∂x1

· · · ∂fm(·)∂x1

......

∂f1(·)∂xk

· · · ∂fm(·)∂xk

is a k ×m matrix.

Sketch of the proof. By Taylor’s Theorem,f(Zn)

·∼ f(u) + (∇f(u))>(Zn − u), which yields

√n(f(Zn)− f(u))

·∼ (∇f(u))>√n(Zn − u).

This and the CLT for Zn (given as an assumption) lead to the desiredconclusion. 113 / 162

Regression Analysis

Large Sample Theory


We are now ready to answer Q4 & Q5.

(1) An alternative version of CLT for β.

Recall that(X>X)1/2

σ(β − β)

d−→ N(0, I),

under suitable conditions. (What are they?)

Assume

Rn =1

nX>X =

1

n

n∑i=1

xix>in→∞−→ R,

where R is a positive definite matrix.

Then, it can be shown that

1

σR1/2

√n(β − β)

d−→ N(0, I). (∗)

By (∗), we have

√n(β − β)

d−→ N(0,R−1σ2). (∗∗)114 / 162

Regression Analysis

Large Sample Theory


Additional materials

(i) ‖σ−1(R1/2n −R1/2)

√n(β − β)‖ ≤ σ−1‖R1/2

n −R1/2‖‖√n(β − β)‖

(‖Ax‖2 = x>A>Ax ≤ ‖A‖2‖x‖2)

(ii) ‖R1/2n −R1/2‖ = o(1) (it’s obvious)

(iii) E‖√n(β − β)‖2 why?

= tr((X>Xn )−1)σ2

= tr(R−1n )σ2 n→∞−→ tr(R−1)σ2<∞. (R is p.d.)

(iv) By (i)–(iii), we have∥∥∥σ−1(R1/2n −R1/2)

√n(β − β)

∥∥∥ = o(1)Op(1) = op(1),

yielding1

σR1/2

√n(β − β) and

1

σR1/2n

√n(β − β)

have the same limiting distribution (by Slutsky’s Theorem), which isN(0, I).

115 / 162

Regression Analysis

Large Sample Theory


(2) Consider the problem of testing a nonlinear null hypothesis,

H0 : β0 + β21 = d,

for some known d, against the alternative hypothesis,

HA : β0 + β21 6= d.

For simplify the discussion, we again assume that

X =

1 X1

......

1 Xn

, hence β =

(β0β1

)and β =

(β0β1

).

Set f(β) = β0 + β21 . Then ∇f(β) =

(1

2β1

).

116 / 162

Regression Analysis

Large Sample Theory


By the δ-method and (∗∗), we obtain

√n(f(β)− f(β))

H0=√n(f(β)− d)

d−→ N

(0, (1, 2β1)R−1

(1

2β1

)σ2

),

which implies√n(f(β)− d)

σ

√(1, 2β1)R−1

(1

2β1

) d−→ N(0, 1). (∗ ∗ ∗)

Moreover, it holds that

σ

√(1, 2β1)R−1n

(1

2β1

)pr.−→ σ

√(1, 2β1)R−1

(1

2β1

).

(β1

pr.−→ β1Rn −→ R

)117 / 162

Regression Analysis

Large Sample Theory


This, (∗ ∗ ∗) and Slutsky’s Theorem together imply√n(f(β)− d)

σ

√(1, 2β1)R−1n

(1

2β1

) d−→ N(0, 1).

This result enables us to construct the following testing rule:reject H0 if

f(β) = β0 + β21 > d+

1.96σ

√(1, 2β1)R−1n

(1

2β1

)√n

or

f(β) = β0 + β21 < d−

1.96σ

√(1, 2β1)R−1n

(1

2β1

)√n

which is an “asymptotic” level 5% test, i.e.,

PH0(reject H0)

n→∞−→ 5%.118 / 162

Regression Analysis

Large Sample Theory


(3) Consider the problem of testing the linear hypothesis

H0 : Dq×kβk×1 = γq×1 against HA :∼ H0,

where Dq×k and γq×1 are known.

Set f(β) = Dβ. By the δ-method and the CLT for β, we have under H0,

√n(f(β)− γ)

d−→ N(0,DR−1D>σ2),

and hence by the continuous mapping theorem,

n(f(β)− γ)>(DR−1D>)−1(f(β)− γ)

σ2

d−→ χ2(q).

This, σ2 pr.−→ σ2, Rnn→∞−→ R, and Slutsky’s theorem further give (some

algebraic manipulations are needed!!)

w1 =n(f(β)− γ)>(DR−1n D

>)−1(f(β)− γ)

σ2

d−→ χ2(q).

119 / 162

Regression Analysis

Large Sample Theory


Therefore, the following testing rule:

reject H0 if w1 > χ21−α(q),

is an asymptotic level α% test.

Please compare this asymptotic test with its counterpart derived from thefinite-sample theory under normal assumptions.

120 / 162

Regression Analysis

Appendix

Statistical View of Spectral Decomposition


1. Without loss of generality, we can assume Γ = E(xx>) where x is ap-dimensional random vector with E(x) = 0.

2. Define

a1 = argmaxc∈{s∈Rp:‖s‖=1}E((c>x)2) and λ∗1 = E((a>1 x)2).

By Lagrange multipliers method, Γa1 = λ∗1a1. Define

v1 = a>1 x and u1 = argminc∈RpE((x− cv1)>(x− cv1)).

Then,

u1 =E(xv1)

λ∗1=

Γa1

λ∗1= a1,

R1 := x− u1v1 = x− a1v1 = x− a1a>1 x = (Ip − a1a

>1 )x,

and

Γ1 := Var(R1) = E((Ip − a1a>1 )xx>(Ip − a1a

>1 )) = Γ− λ∗1a1a

>1 .

121 / 162

Regression Analysis

Appendix


3. Define

a2 = argmaxc∈{s∈Rp:‖s‖=1,s>a1=0}E((c>R1)2) and λ∗2 = E((a>2 R1)2).

By Lagrange multipliers method, we let∂∂c

(c>(Γ− λ∗1a1a

>1 )c− h1c>a1 − h2(c>c− 1)

)= 0;

∂∂h1

(c>(Γ− λ∗1a1a

>1 )c− h1c>a1 − h2(c>c− 1)

)= 0;

∂∂h2

(c>(Γ− λ∗1a1a

>1 )c− h1c>a1 − h2(c>c− 1)

)= 0,

and obtain h1 = 0 and h2 = c>Γc. Therefore,

Γ1a2 = (Γ− λ∗1a1a>1 )a2 = Γa2 = λ∗2a2.

122 / 162

Regression Analysis

Appendix


3. Define

v2 = a>2 R1 = a>2 x and u2 = argminc∈RpE((R1 − cv2)>(R1 − cv2)).

Then,

u2 =E(R1v2)

λ∗2= a2,

and

R2 := R1 − u2v2 = (Ip − (a1a>1 + a2a

>2 ))x.

123 / 162

Regression Analysis

Appendix


4. By the similar argument as above, we have Rp := (Ip −∑pi=1 aia

>i )x = 0,

and hence

O = Var(Rp) =

(Ip −

p∑i=1

aia>i

)Γ

(Ip −

p∑i=1

aia>i

)

=

(Γ− 2

p∑i=1

λ∗iaia>i

)+

p∑i=1

p∑j=1

λ∗iaia>i aja

>j

= Γ−p∑i=1

λ∗iaia>i ,

where O is a p× p zero matrix.

5. Define P = (a1, . . . ,ap) and D = diag(λ∗1, . . . , λ∗p). Then,

Γ =

p∑i=1

λ∗iaia>i = PDP>.

124 / 162

Regression Analysis

Appendix

Limit Theorems

Continuous Mapping Theorem

Fact 1

Let Xnpr.−→ X and g be a continuous function on R. Then g(Xn)

pr.−→ g(X).

Proof of Fact 1

For any ε > 0, there exists a large k such that P (|X| > k) ≤ ε2

.

Moreover, we have for any δ > 0 and n ≥ Nδ,ε, P (|Xn −X| > δ) ≤ ε2

.

Since g(x) is uniformly continuous on [−k, k], there exists a δ∗ > 0 such that

|g(x)− g(y)| ≤ ε for all |x− y| ≤ δ∗ and |x| ≤ k.

Now, |g(X)− g(Xn)| > ε implies |X −Xn| > δ∗ or |X| > k, and hence

P (|g(X)− g(Xn)| > ε) ≤ P (|X −Xn| > δ∗) + P (|X| > k) ≤ ε for n ≥ Nδ∗,ε.

Therefore, P (|g(Xn)− g(X)| > ε)→ 0 as n→∞.

Remark

If Xnd−→ X and g is a continuous function on R, then g(Xn)

d−→ g(X).

125 / 162

Regression Analysis

Appendix

Limit Theorems

Fact 2

If Xnpr.−→ X, then Xn

d−→ X.

Definition

Let {an} be a sequence of real number. We denote an → 0 by an = o(1).

Proof of Fact 2

Goal: Fn(x)→ F (x), ∀x ∈ C(F ), where F (x) = P (X ≤ x) and Fn(x) = P (Xn ≤ x).

Let x ∈ C(F ) with x′ < x < x′′. By Xnpr.−→ X,

P (X ≤ x′) = P (X ≤ x′, Xn ≤ x) + P (X ≤ x′, Xn > x)

= P (X ≤ x′, Xn ≤ x) + o(1) ≤ Fn(x) + o(1),

and hence F (x′) ≤ lim infn→∞ Fn(x).

Similarly, we obtain lim supn→∞ Fn(x) ≤ F (x′′), and thus

F (x′) ≤ lim infn→∞

Fn(x) ≤ lim supn→∞

Fn(x) ≤ F (x′′).

The proof is completed by letting x′ ↑ x, x′′ ↓ x and

F (x) = limx′↑x

F (x′) ≤ lim infn→∞


Fn(x) ≤ limx′′↓x

F (x′′) = F (x).

126 / 162

Regression Analysis

Appendix

Limit Theorems

Slutsky’s Theorem

Slutsky’s Theorem

If Xn − Ynpr.−→ 0 and Yn

d−→ Y , then Xnd−→ Y .

Proof of Slutsky’s Theorem

Let x be any continuity point of the c.d.f. of Y , FY . Given δ > 0, there exists a smallε > 0 such that x− ε and x+ ε are continuity points of FY and F (x+ ε)− F (x− ε) < δ.

Define Fn(x) = P (Xn ≤ x). Our goal is to show that

FY (x− ε) ≤ lim infn→∞


Fn(x) ≤ FY (x+ ε),

which implies Fn(x) −→ FY (x).

Since Xn − Ynpr.−→ 0 and Yn

d−→ Y , we have

Fn(x) ≤ P (Yn ≤ x+ Yn −Xn, Yn −Xn ≤ ε) + P (Yn −Xn > ε) ≤ P (Yn ≤ x+ ε) + o(1),

Fn(x) = P (Yn ≤ x+ Yn −Xn, Yn −Xn ≥ −ε) + o(1)

≥ P (Yn ≤ x− ε, Yn −Xn ≥ −ε) + o(1)

≥ P (Yn ≤ x− ε)− P (Yn −Xn < −ε) + o(1) = P (Yn ≤ x− ε) + o(1),

and hence lim supn→∞ Fn(x) ≤ FY (x+ ε) and lim infn→∞ Fn(x) ≥ FY (x− ε).127 / 162

Regression Analysis

Appendix

Limit Theorems

Fact 3

If Xnd−→ X and Yn

pr.−→ c where c is a constant, then

(a) Xn + Ynd−→ X + c

(b) XnYnd−→ cX

Proof of Fact 3

For (a), it suffices to show that Xn + cd−→ X + c, which is obvious.

For (b), it suffices to show that XnYn − cXnpr.−→ 0. It is equivalent to show that if

Xnd−→ X and Yn

pr.−→ 0, then XnYnpr.−→ 0.

Let δ > 0 be an arbitrarily small constant. Then, there exists a large M such thatP (|X| > M) ≤ δ. Now for any ε > 0,

P (|XnYn| > ε) ≤ P(|XnYn| > ε, |Yn| ≤

ε

M

)+ P

(|Yn| >

ε

M

)≤ P (|Xn| > M) + o(1)

= P (|X| > M) + P (|Xn| > M)− P (|X| > M) + o(1)

= P (|X| > M) + o(1),

which implies 0 ≤ lim infn→∞ P (|XnYn| > ε) ≤ lim supn→∞ P (|XnYn| > ε) ≤ δ, and

hence XnYnpr.−→ 0.

128 / 162

Regression Analysis

Appendix

Limit Theorems

Application

Let Xii.i.d.∼ (0, 1) and E(X4

1 ) <∞. Then it follows from

X21 + · · ·+X2

n

n

pr.−→ 1,

X1 + · · ·+Xn√n

d−→ N(0, 1),

and Fact 3 that√n(X1 + · · ·+Xn)

X21 + · · ·+X2

n

d−→ N(0, 1).

129 / 162

Regression Analysis

Appendix

Limit Theorems

Some Remarks on Slutsky’s Theorem

(1) If Xnpr.−→ X and Yn

pr.−→ Y , then Xn + Ynpr.−→ X + Y and XnYn

pr.−→ XY .

Proof.

P (|Xn + Yn − (X + Y )| > ε) ≤ P (|Xn −X|+ |Yn − Y | > ε)

≤ P (|Xn −X| > ε/2 or |Yn − Y | > ε/2)

≤ P (|Xn −X| > ε/2) + P (|Yn − Y | > ε/2)→ 0,

as n→∞. Show by yourself that XnYnpr.−→ XY .

(2) If Xnd−→ X and Yn

d−→ c where c is a constant, then Xn + Ynd−→ X + c and

XnYnd−→ cX because Yn

d−→ c⇔ Ynpr.−→ c.

(Show by yourself that Ynd−→ c⇔ Yn

pr.−→ c)

(3) Assume Xnd−→ X and Yn

d−→ Y . Does Xn + Ynd−→ X + Y ? No. (The distribution of

X + Y is undefined if only the marginal distributions of X and Y are available.)

(4) If

(XnYn

)d−→(XY

), then by continuous mapping theorem,

Xn + Yn =(1 1

)(XnYn

)d−→(1 1

)(XY

)= X + Y.

130 / 162

Regression Analysis

Appendix

Limit Theorems

Central Limit TheoremLindeberg Central Limit Theorem

Let X1 . . . , Xn be independent random variables with E(Xi) = 0 and E(X2i ) = σ2

i for i =1, . . . , n. Define S2

n =∑ni=1 σ

2i and Sn =

∑ni=1Xi. Then

Sn

Snd−→ N(0, 1), (1)

provided for any ε > 0,

1

S2n

n∑i=1

E(X2i I{|Xi|>εSn})→ 0, (Lindeberg’s condition) (2)

as n→∞.

Lyapunov Central Limit Theorem

Let X1 . . . , Xn be independent random variables with E(Xi) = 0 and E(X2i ) = σ2

i for i =1, . . . , n. Define S2

n =∑ni=1 σ

2i and Sn =

∑ni=1Xi. If

1

S2+αn

n∑i=1

E(|Xi|2+α)→ 0 for some α > 0, (Lyapunov’s condition)

then Sn/Snd−→ N(0, 1).

131 / 162

Regression Analysis

Appendix

Limit Theorems

Proof of Lindeberg Central Limit Theorem

To proof (1), we need two facts:

(F1) Levy continuity theoremLet {Xn} be a sequence of random variables and define ϕn(t) = E(exp{itXn}). Then

Xnd−→ X ⇐⇒ ϕn(t)→ ϕ(t),

where ϕ(t) = E(exp{itX}).

(F2) Lemma 8.4.1 of Chow and Teichen (1997)

∣∣∣∣∣∣exp{it} −n∑j=0

(it)j

j!

∣∣∣∣∣∣ ≤ 21−δ|t|n+δ

(1 + δ)(2 + δ) · · · (n+ δ),

where δ is any constant in [0, 1].

132 / 162

Regression Analysis

Appendix

Limit Theorems

Proof of Lindeberg Central Limit Theorem (cont.)

Now,

E

(exp

{itSn

Sn

})− exp

{−t2

2

}= E

(exp

{it(X1 +

∑ni=2 Zi)

Sn

})− exp

{−t2

2

}•

••

+ E

(exp

{it(Sj +

∑ni=j+1 Zi)

Sn

})− E

(exp

{it(Sj−1 +

∑ni=j Zi)

Sn

})•

• (3)•

+ E

(exp

{it(Sn−1 + Zn)

Sn

})− E

(exp

{it(Sn−2 + Zn−1 + Zn)

Sn

})+ E

(exp

{itSn

Sn

})− E

(exp

{it(Sn−1 + Zn)

Sn

}),

where Ziindep.∼ N(0, σ2

i ) and independent of {Xn}.

133 / 162

Regression Analysis

Appendix

Limit Theorems


It holds that for “Stair j”,∣∣∣∣∣E(exp

{it(Sj +

∑ni=j+1 Zi)

Sn

})− E

(exp

{it(Sj−1 +

∑ni=j Zi)

Sn

})∣∣∣∣∣≤

∣∣∣∣∣exp{−t2

2

}[E

(exp

{itSj

Sn

})exp

{t2S2

j

2S2n

}− E

(exp

{itSj−1

Sn

})exp

{t2S2

j−1

2S2n

}]∣∣∣∣∣≤ exp

{−t2

2

}∣∣∣∣∣E(exp

{itSj−1

Sn

})exp

{t2S2

j

2S2n

}[E

(exp

{itXj

Sn

})− exp

{−t2σ2

j

2S2n

}]∣∣∣∣∣≤

∣∣∣∣∣E(exp

{itXj

Sn

})− exp

{−t2σ2

j

2S2n

}∣∣∣∣∣≤

∣∣∣∣∣E(exp

{itXj

Sn

}− 1−

itXj

Sn+t2X2

j

2S2n

)−(exp

{−t2σ2

j

2S2n

}− 1 +

t2σ2j

2S2n

)∣∣∣∣∣ , (4)

where the first inequality is by

E

(exp

{it∑ni=j+1 Zi

Sn

})= exp

{−t2(S2

n − S2j )

2S2n

}.

134 / 162

Regression Analysis

Appendix

Limit Theorems


By (F2) (taking δ = 1 and n = 1, 2), we have

n = 1 :

∣∣∣∣∣exp{

itXj

Sn

}− 1−

itXj

Sn+t2X2

j

2S2n

∣∣∣∣∣ ≤∣∣∣∣exp{ itXj

Sn

}− 1−

itXj

Sn

∣∣∣∣+ t2X2j

2S2n

≤t2X2

j

2S2n

+t2X2

j

2S2n

=t2X2

j

S2n

,

n = 2 :

∣∣∣∣∣exp{

itXj

Sn

}− 1−

itXj

Sn+t2X2

j

2S2n

∣∣∣∣∣ ≤ 1

6|t|3

∣∣∣∣XjSn∣∣∣∣3 ,

and hence∣∣∣∣∣E(exp

{itXj

Sn

}− 1−

itXj

Sn+t2X2

j

2S2n

)∣∣∣∣∣ ≤ E(min

(t2X2

j

S2n

,1

6|t|3

∣∣∣∣XjSn∣∣∣∣3))

why?

≤ E

t2X2j

S2n

I{∣∣∣∣XjSn∣∣∣∣>ε} +

1

6|t|3

∣∣∣∣XjSn∣∣∣∣3 I{∣∣∣∣XjSn ∣∣∣∣≤ε}

= E

t2X2j

S2n

I{∣∣∣∣XjSn∣∣∣∣>ε}

+ E

1

6|t|3

∣∣∣∣XjSn∣∣∣∣3 I{∣∣∣∣XjSn ∣∣∣∣≤ε}

≡ Ij + IIj . (5)

135 / 162

Regression Analysis

Appendix

Limit Theorems


In addition, we have

0 ≤ exp

{−t2σ2

j

2S2n

}− 1 +

t2σ2j

2S2n

≤t4σ4

j

8S4n

, (6)

noting that for x > 0, 0 ≤ exp{−x} − 1 + x ≤ x2/2. Moreover,∑nj=1 σ

2j/S

2n = 1,

E

∣∣∣∣XjSn∣∣∣∣3 I{∣∣∣∣XjSn ∣∣∣∣≤ε}

≤ ε σ2j

S2n

and (2) impliesmax1≤j≤n σ

2j

S2n

→ 0.

By (2)–(6), it follows that∣∣∣∣∣E(exp

{itSn

Sn

})− exp

{−t2

2

}∣∣∣∣∣≤

n∑j=1

(Ij + IIj +

t4σ4j

8S4n

)

≤ t2n∑j=1

E(X2j I{|Xj |>εSn}

)S2n

+|t|3

6ε

n∑j=1

σ2j

S2n

+t4

8

max1≤j≤n σ2j

S2n

n∑j=1

σ2j

S2n

=|t|3

6ε+ o(1).

136 / 162

Regression Analysis

Appendix

Limit Theorems


Since ε can be arbitrarily small, one gets∣∣∣∣E(exp

{itSn

Sn

})− exp

{−t2

2

}∣∣∣∣ −→n→∞ 0,

which, together with (F1), yields the desired conclusion (1).

Proof of Lyapunov Central Limit Theorem

1

S2n

n∑i=1

E(X2i I{|Xi|>δSn}) =

1

S2n

n∑i=1

E

(|Xi|2+α

|Xi|αI{|Xi|>δSn}

)

≤1

S2+αn δα

n∑i=1

E(|Xi|2+α)→ 0,

if Lyapunov’s condition holds.

137 / 162

Regression Analysis

Appendix

Limit Theorems

Example 1

If X1, . . . , Xn are independent random variables with E(Xi) = 0 for i = 1, . . . , n,

supi≥1 E|Xi|2+α < M , lim infn→∞

S2n/an > 0, and na

−1−α/2n = o(1), then Sn/Sn

d−→ N(0, 1).

Proof of Example 1

lim supn→∞

1

S2+αn

n∑i=1

E(|Xi|2+α) ≤ lim supn→∞

nM

a1+α/2n (S2

n/an)1+α/2

≤ lim supn→∞

na−1−α/2n × lim sup

n→∞

M

(S2n/an)1+α/2

= lim supn→∞

na−1−α/2n ×

M

lim infn→∞(S2n/an)1+α/2

→ 0.

138 / 162

Regression Analysis

Appendix

Limit Theorems

Example 2

Let PJ be the orthogonal projection matrix on to the space spanned by {Xj : j ∈ J}. Consider

ε>(PJ2 − PJ1 )ε where ε = (ε1, . . . , εn)> with εiindep.∼ (0, 1), J2 ⊃ J1, and ](J2) − ](J1) = 1.

Then

ε>(PJ2 − PJ1 )εd−→ χ2(1),

provided supi≥1 E|ε1|2+α < M <∞ for some α > 0 and max1≤i≤n(PJ2 )ii → 0 as n→∞.

Remark

If ε ∼ N (0, I), then ε>(PJ2 − PJ1 )ε ∼ χ2(1).

If (X>X)/an → R (p.d.), then

(PJ2 )ii = e>i XJ2 (X>J2XJ2 )−1X>J2ei =

(xi(J2)√an

)>(X>J2XJ2

an

)−1 (xi(J2)√an

)

≤ λmax

(X>J2XJ2

an

)−1× ∑j∈J2 x

2ij

an→ 0,

provided a−1n∑j∈J2 x

2ij → 0, where xi(J2) = (xij , j ∈ J2)> and XJ2 = (Xj , j ∈ J2).

139 / 162

Regression Analysis

Appendix

Limit Theorems

Proof of Example 2

Let ](J1) = r. Then PJ1 =∑ri=1 oio

>i and PJ2 =

∑r+1i=1 oio

>i , where o>i oi = 1 and

o>i oj = 0 for 1 ≤ i < j ≤ r + 1. Hence, PJ2 − PJ1 = or+1o>r+1. Without loss of

generality, set or+1 = (a1n, . . . , ann)> with∑ni=1 a

2in = 1.

Now ε>(PJ2 − PJ1 )ε = (∑ni=1 ainεi)

2. Note that∑ni=1 ainεi can be view as

n∑i=1

viεi√∑nj=1 v

2j

where vi > 0 for i = 1, . . . , n andn∑j=1

v2j →∞.

The Lyapunov’s condition∑ni=1 E|ainεi|2+α → 0 follows from

n∑i=1

E|ainεi|2+α ≤Mn∑i=1

|ain|2+α ≤M(

n∑i=1

a2in

)max

1≤i≤n|ain|α = M max

1≤i≤n|ain|α,

and max1≤i≤n |ain| = (max1≤i≤n a2in)1/2 ≤ (max1≤i≤n(PJ2 )ii)

1/2 → 0.

By Lyapunov central limit theorem, we have

n∑i=1

ainεid−→ N (0, 1),

and hence ε>(PJ2 − PJ1 )εd−→ χ2(1) is obtained using continuous mapping theorem.

140 / 162

Regression Analysis

Appendix

Limit Theorems

Convergence in the rth Mean

Definition

If E|Xn −X|r → 0 and E|X|r <∞, then we say that Xn converges in the rth mean to X, and

we write XnLr−→ X.

Definition

The r-norm of the random variable Z is defined by ‖Z‖r = (E(|Z|r))1/r.

141 / 162

Regression Analysis

Appendix

Limit Theorems

Some InequalitiesJensen’s inequality

If g is a convex function, then E(g(X)) ≥ g(E(X)).

Proof of Jensen’s inequality

Note that the graph of convex function lies above its tangent line at every point and thus

g(x) ≥ g(µ) + g′(µ)(x− µ),

for any x and µ in the domain of the function g.

Choosing µ = E(X) and replacing x with the random variable X, we have

g(X) ≥ g(E(X)) + g′(E(X))(X − E(X)).

The proof is completed by taking expectation on both sides of the above inequality.

Application

Let q > 1 and g(x) = xq for x > 0. Then g(x) is a convex function.

Assume 0 < s < r. By Jensen’s inequality, we have

E(|X|r) = E((|X|s)r/s) ≥ (E(|X|s))r/s,

and hence (E(|X|r))1/r ≥ (E(|X|s))1/s.142 / 162

Regression Analysis

Appendix

Limit Theorems

Young’s inequality

Let f be a strictly increasing function on [0,∞) and f(0) = 0. Then

ab ≤∫ a

0f(x) dx+

∫ b

0f−1(x) dx.

143 / 162

Regression Analysis

Appendix

Limit Theorems

Holder’s inequality

E|XY | ≤ (E(|X|p))1/p(E(|Y |q))1/q where1

p+

1

q= 1 and p, q ∈ (1,∞).

Proof of Holder’s inequality

Let f(x) = xp−1. Then by Young’s inequality,

ab ≤∫ a

0xp−1 dx+

∫ b

0x1/(p−1) dx =

ap

p+

1

1 + 1/(p− 1)b1+1/(p−1) =

ap

p+bq

q. (∗)

Now, let a = |X|/‖X‖p and b = |Y |/‖Y ‖q . By (∗),

|X|‖X‖p

×|Y |‖Y ‖q

≤1

p×(|X|‖X‖p

)p+

1

q×(|Y |‖Y ‖q

)q,

which implies

E|XY |‖X‖p‖Y ‖q

≤ 1,

and thus the proof is complete.

144 / 162

Regression Analysis

Appendix

Limit Theorems

Minkowski’s inequality

‖X + Y ‖p ≤ ‖X‖p + ‖Y ‖p where 1 ≤ p <∞.

Proof of Minkowski’s inequality

By Holder’s inequality,

E(|X + Y |p) = E(|X + Y |p−1|X + Y |)≤ E(|X + Y |p−1|X|) + E(|X + Y |p−1|Y |)

≤ (E(|X + Y |p))(p−1)/p(E(|X|p))1/p + (E(|X + Y |p))(p−1)/p(E(|Y |p))1/p,


145 / 162

Regression Analysis

Appendix

Limit Theorems

Some Facts (Cont.)

(4) If XnLr−→ X, then E(|Xn|r) −→ E(|X|r).

Proof. For r ≥ 1, by Minkowski’s inequality, we have

‖Xn‖r ≤ ‖Xn −X‖r + ‖X‖r and ‖X‖r ≤ ‖Xn −X‖r + ‖Xn‖r,

and hence

‖X‖r − ‖Xn −X‖r ≤ ‖Xn‖r ≤ ‖X‖r + ‖Xn −X‖r,

which, in conjunction with XnLr−→ X, yields the desired result. On the other hand, note

that (a+ b)r ≤ ar + br for a, b ≥ 0 and 0 < r < 1. Hence, for r < 1,

‖Xn‖rr ≤ ‖Xn −X +X‖rr ≤ ‖Xn −X‖rr + ‖X‖rrand

‖X‖rr ≤ ‖X −Xn +Xn‖rr ≤ ‖Xn −X‖rr + ‖Xn‖rr.

By an argument similar to that used for showing the case of r > 1, we have

E(|Xn|r) −→ E(|X|r) for r ≤ 1,


147 / 162

Regression Analysis

Appendix

Limit Theorems

Weak Law of Large Numbers

Fact 4

Let X1, . . . , Xn be i.i.d. random variables with E(X1) = µ <∞. Then

Sn

n

pr.−→ µ,

where Sn =∑ni=1 Xi.

Remark

If X1, . . . , Xn are independent random variables with E(X1) < ∞, then the weak law of largenumbers does not necessary hold for {Xi}. Consider the following example:

Let X1, . . . , Xn be a sequence of independent random variables with

P (Xi =√i) = P (Xi = −

√i) =

1

2.

Note that E(Xi) = 0 and Var(Xi) = i for i = 1, . . . , n. Moreover,

S2n =

n∑i=1

Var(Xi) =n∑i=1

i =n(n+ 1)

2.

148 / 162

Regression Analysis

Appendix

Limit Theorems

Remark (Cont.)

Since for some α > 0

n∑i=1

E(|Xi|2+α)

S2+αn

=n∑i=1

i1+α/2

(n(n+ 1)/2)1+α/2= O

(n2+α/2

n2+α

)→ 0,

by Lyapunov central limit theorem, we have

√2∑ni=1 Xi

n

d−→ N (0, 1). Hence, the weak

law of large numbers does not hold for {Xi}.

Proof of Fact 4

Consider

Sn

n− µ =

Sn −mnn

+mn − nµ

n=Sn −mn

n− E(X1I{|X1|>n}) (5-1)

where mn =∑ni=1 E(X

(n)i ) with X

(n)i = XiI{|Xi|≤n}, i = 1, . . . , n.

It suffices to show that

Sn −mnn

pr.−→ 0, (5-2)

and

E(|X1|I{|X1|>n})→ 0. (5-3)

149 / 162

Regression Analysis

Appendix

Limit Theorems

Proof of Fact 4 (Cont.)

Since E(|X1|) <∞, E(|X(n)1 |)→ E(|X1|) and

E(|X1|) = E(|X(n)1 |) + E(|X1|I{|X1|>n}),

we obtain (5-3).

We next show (5-2). Define S(n)n =

∑ni=1X

(n)i . Note first that for any ε > 0,

P

(|Sn −mn|

n> ε

)≤ P

(|Sn −mn|

n> ε,

n⋂i=1

{|Xi| ≤ n})

+ P

(n⋃i=1

{|Xi| > n})

≤ P

(|S(n)n −mn|

n> ε

)+ nP (|X1| > n)

≤ P

(|S(n)n −mn|

n> ε

)+ E(|X1|I{|X1|>n})

= P

(|S(n)n −mn|

n> ε

)+ o(1), (5-4)

where the second and third inequalities are by i.i.d. and (5-3), respectively.

150 / 162

Regression Analysis

Appendix

Limit Theorems


By Chebyshev’s inequality, we have

P

(|S(n)n −mn|

n> ε

)≤E((X

(n)1 )2)

nε2. (5-5)

Moreover,

E((X(n)1 )2)

=

∫ ∞0

P (X21 I{|X1|≤n} > x) dx

= 2

∫ ∞0

uP (X21 I{|X1|≤n} > u2) du

= 2

∫ ∞0

uP (|X1|I{|X1|≤n} > u) du

= 2

∫ ∞0

uP (|X1|I{|X1|≤n} > u, |X1| ≤ n) du+ 2

∫ ∞0

uP (|X1|I{|X1|≤n} > u, |X1| > n) du

= 2

∫ ∞0

uP (u < |X1| ≤ n) du

= 2

∫ n

0uP (u < |X1| ≤ n) du ≤ 2

∫ n

0uP (|X1| > u) du. (5-6)

151 / 162

Regression Analysis

Appendix

Limit Theorems


Since nP (|X1| > n) ≤ E(|X1|I{|X1|>n}) = o(1), we have

2

∫ n

0uP (|X1| > u) du = 2

∫ A

0uP (|X1| > u) du+ 2

∫ n

AuP (|X1| > u) du

≤ A2 + 2ε3(n−A), (5-7)

where A is large enough such that

uP (|X1| > u) ≤ ε3, ∀u ≥ A.

Hence, (5-2) follows from (5-4)–(5-7) and the proof is complete.

152 / 162

Regression Analysis

Appendix

Delta Method

Delta Method

Assume an(Zn −µ)d−→ Z, where Zn, µ, Z are k-dimensional and an →∞

as n→∞.Let f(·) = (f1(·), . . . , fm(·))> be a smooth function from Rk into Rm with1 ≤ m ≤ k. Define

∇f(·) =

∂f1(·)∂x1

· · · ∂fm(·)∂x1

......

∂f1(·)∂xk

· · · ∂fm(·)∂xk

.

Suppose that there exists ε > 0 such that for some 0 < G <∞,

max1≤i≤m

sup‖x−µ‖≤ε

∥∥∥∥∥(∂2fi(x)

∂xj∂xl

)1≤j,l≤k

∥∥∥∥∥ ≤ G. (∗)

Then

an(f(Zn)− f(µ))d−→ (∇f(µ))>Z. (∗∗)

153 / 162

Regression Analysis

Appendix

Delta Method

Proof of Delta Method

Since an(Zn − µ)d−→ Z, it holds that

an(Zn − µ) = Op(1), (0)

and hence

Zn − µ = Op(a−1n ) = Op(o(1)) = op(1),

yielding

Znpr.−→ µ. (1)

Define An = {‖Zn − µ‖ ≤ ε}, where ε is defined in (∗). Then, by (1),

P (An)→ 1 as n→∞. (2)

154 / 162

Regression Analysis

Appendix

Delta Method

Proof of Delta Method (cont.)

In the following, we shall prove (∗∗) for the case of m = 1. The proof of thecase of m > 1 is similar.

By Taylor’s theorem,

f1(Zn)− f1(µ) = (∇f1(µ))>(Zn − µ) + wn, (3)

where wn = (Zn−µ)>(∂2f1(ξ)∂xj∂xl

)1≤j,l≤k

(Zn−µ) and ‖ξ−µ‖ ≤ ‖Zn−µ‖.

Let x ∈ R be a continuous point of the distribution function of (∇f1(µ))>Z.

Then

P (an(f(Zn)− f(µ)) ≤ x)

why?= P (an(f(Zn)− f(µ)) ≤ x,An) + o(1)

by (3)= P ((∇f1(µ))>an(Zn − µ) + anwn ≤ x,An) + o(1)

why?= P ((∇f1(µ))>an(Zn − µ)IAn + anwnIAn ≤ x) + o(1). (4)

155 / 162

Regression Analysis

Appendix

Delta Method


Note that

|anwnIAn |why?≤ an‖Zn − µ‖2

∥∥∥∥∥(∂2f1(ξ)

∂xj∂xl

)1≤j,l≤k

∥∥∥∥∥ IAn≤ an‖Zn − µ‖2 sup

‖x−µ‖≤ε

∥∥∥∥∥(∂2f1(ξ)

∂xj∂xl

)1≤j,l≤k

∥∥∥∥∥ IAn≤ an‖Zn − µ‖2G

by (0) and (1)= op(1). (5)

Moreover, since

(∇f1(µ))>an(Zn − µ)d−→ (∇f1(µ))>Z, (by continuous mapping theorem)

and IAnpr.−→ 1 (by (2)), it follows from Slutsky’s theorem that

(∇f1(µ))>an(Zn − µ)IAnd−→ (∇f1(µ))>Z. (6)

156 / 162

Regression Analysis

Appendix

Delta Method


By (5) and (6), and Slutsky’s theorem, we obtain

(∇f1(µ))>an(Zn − µ)IAn + anwnIAnd−→ (∇f1(µ))>Z. (7)

By (4) and (7),

P (an(f(Zn)− f(µ)) ≤ x) −→ P ((∇f1(µ))>Z ≤ x),

and hence the desired conclusion follows.

157 / 162

Regression Analysis

Appendix

Two-Sample t-Test

Two-Sample t-Test

Consider the model

z = Xµ+ ε,

where z = (x1, . . . , xm, y1, . . . , yn)>, X = (sij) is a (m+ n)× 2 matrix satisfying

sij =

{1, if {1 ≤ i ≤ m, j = 1} or {m+ 1 ≤ i ≤ m+ n, j = 2};0, otherwise,

µ = (µx, µy)>, ε = (ε1, . . . , εm+n)>, and εi’s are i.i.d. N(0, σ2).

The least squares estimator of µ is

µ = (µx, µy)> = (X>X)−1X>z = (x, y)> ∼ N((

µxµy

),

(σ2

m0

0 σ2

n

)).

Note that H0 : µx = µy and

T =x− y√

σ2(

1m

+ 1n

) ∼H0

N(0, 1).

158 / 162

Regression Analysis

Appendix

Two-Sample t-Test

In practice, σ2 is unknown and we can use

σ2 =1

m+ n− 2z>(I −M)z,

in place of σ2 where M = X(X>X)−1X>.

Define Sx = (m− 1)−1∑mi=1(xi − x)2 and Sy = (n− 1)−1

∑ni=1(yi − y)2. Then some

elementary calculations yield

(I −M)z =

x1 − x...

xm − xy1 − y

...yn − y

,

and hence

σ2 =1

m+ n− 2

m∑i=1

(xi − x)2 +n∑j=1

(yj − y)2

=(m− 1)Sx + (n− 1)Sy

m+ n− 2,

which is the pooled variance.

159 / 162

Regression Analysis

Appendix

Two-Sample t-Test

Since T ∼ N(0, 1), (m+ n− 2)σ2/σ2 ∼ χ2(m+ n− 2), and T⊥σ2, we have

x− y√σ2(

1m

+ 1n

) =

x−y√σ2( 1

m+ 1n )√

σ2

σ2

∼ t(m+ n− 2).

Assume m/(m+ n)→ γx > 0 and n/(m+ n)→ γy > 0 as m→∞ and n→∞. If εi’s

are i.i.d. (0, σ2) (without assuming normality), then one can show that σ2 pr.−−→ σ2,

√m+ n

(x− µxy − µy

)d−→ N

(0,

(1γx

0

0 1γy

)σ2

),

and√m+ n− 2(x− y)√σ2(

1γx

+ 1γy

) d−−→H0

N(0, 1).

This, in conjunction with σ2 pr.−−→ σ2, m+n−2m

→ 1γx

, m+n−2n

→ 1γy

, continuous mapping

theorem, and Slutsky’s theorem, yields√m+ n− 2(x− y)√

σ2(m+n−2

m+ m+n−2

n

) d−−→H0

N(0, 1).

160 / 162

Regression Analysis

Appendix

Pearson’s Chi-Squared Test


Suppose that X1, . . . , Xn is a random sample of size n from a population, and the nobservations are classified into k classes A1, . . . , Ak.

Let pi denote the probability that an observation falls into the class Ai and∑ki=1 pi = 1.

Note first that

Zt =

I{Xt∈A1}

...I{Xt∈Ak−1}

∼ (p,D − pp>),

1√n

n∑t=1

(Zt − p)d−→ N(0,D − pp>),

and (1√n

n∑t=1

(Zt − p)

)>(D − pp>)−1

(1√n

n∑t=1

(Zt − p)

)d−→ χ2(k − 1), (1)

where p = (p1, . . . , pk−1)> and D = diag(p1, . . . , pk−1).

161 / 162

Regression Analysis

Appendix


Let 1 be a (k − 1)-dimensional vector with all entries one and Oi =∑nt=1 I{Xt∈Ai} for

i = 1, . . . , k. Define O = (O1, . . . , Ok−1)>.

Since

(D − pp>)−1 = D−1 +D−1pp>D−1

1− p>D−1p= D−1 +

11>

pk,

we have (1√n

n∑t=1

(Zt − p)

)>(D − pp>)−1

(1√n

n∑t=1

(Zt − p)

)

=1

n(O − np)>D−1(O − np) +

1

npk(O − np)>11>(O − np)

=

k−1∑i=1

(Oi − npi)2

npi+

1

npk

(k−1∑i=1

(Oi − npi))2

=

k−1∑i=1

(Oi − npi)2

npi+

1

npk((n−Ok)− n(1− pk))2 =

k∑i=1

(Oi − npi)2

npi. (2)

Hence, by (1) and (2),

k∑i=1

(Oi − npi)2

npi

d−→ χ2(k − 1).

162 / 162

regression analysis - national tsing hua universitymx.nthu.edu.tw/~cking/statistical...

Documents