[김재영] probability

Studies in Economic Statistics Jae-Young Kim

1 Introduction to Probability

1.1 Introduction

Definition 1.1 (Probability Space). A probability space is a triple (Ω,F , P)where,

1. Ω (Sample Space): the set of all possible outcomes of a random experiment.

2. F (σ-field or σ-algebra): a collection of subsets of Ω.

3. P (Probability Measure): a real-valued function defined on F .

Example 1.1 (Tossing a Coin).

• Ω = H, T

• F = ∅, H, T, H, T

• P(∅) = 0

• P(H) = P(T) = 1/2

• P(H, T) = 1

Definition 1.2 (σ-field (σ-algebra)). A class F of subsets of Ω is called σ-fieldor σ-algebra if it satisfies:

1. Ω ∈ F

2. For A ∈ F , Ac ∈ F

3. For Ai ∈ F , i = 1, 2, · · ·, ∪iAi ∈ F

Remarks

• A σ-field is always a field, but not vice versa.

• An element A ∈ F is called an event.

• An element ω ∈ Ω is called an outcome.

1


Definition 1.3 (The smallest σ-field generated by A, σ(A)). Let A be a classof subsets of Ω. Consider a class that is the intersection of all the σ-field containingA; it is called σ-field generated by A and is denoted by σ(A). σ(A) satisfies

1. A ⊂ σ(A).

2. σ(A) is a σ-field.

3. If A ⊂ G, and G is a σ-field, then σ(A) ⊂ G.

Example 1.2 (σ(A)).

• Ω = 1,2,3,4,5,6

• A = 1,3,5

• A = A

⇒ σ(A) = A, Ac, ∅, Ω

Definition 1.4 (Probability Measure). A real-valued set function defined on aσ-field is a probability measure if it satisfies

1. P(A) ≥ 0, ∀A ∈ F

2. P(Ω) = 1

3. For Ai ∩ Aj = ∅, i = j, P(∪

i Ai) = ∑i P(Ai)

Remarks

• The three properties given above are often referred to as the axioms ofprobability.

• A probability (measure) has the range on [0, 1], and a measure has therange on [0, ∞].

Definition 1.5 (Lebesque Measure). First we define µ on an open interval in thenatural way. Note that any open set in R can be represented as countable union ofdisjoint open intervals.

• Outer measure of A

µ∗(A) = inf ∑A⊂∪kCk

µ(Ck), Ck: Open covering

2


• Inner measure of Aµ∗(A) = 1 − µ∗(Ac)

• Lebesque Measure: µ(A) = µ∗(A) = µ∗(A)

Theorem 1.1 (Unique Extension). A probability measure on a field F0 has aunique extension in a σ-field generated by F0.

1. Let P be a probability measure on F0 and let F=σ(F0).Then, there exists aprobability measure Q on F such that Q(A) = P(A) for A ∈ F0.

2. Let Q′ be another probability measure on F such that Q′(A) = P(A) for A∈ F0.Then Q′(A) = Q(A) for A ∈ F .

3. For Ai ∈ F , Ai ∩ Aj = ϕ,∪∞

i=1 Ai ∈ F , Q is countably additive.

Theorem 1.2 (Properties of Probability Measure).

1. For A ⊂ B, P(A) ≤ P(B).ProofHint: P(B - A) = P(B) - P(A)

2. P(A ∪ B) = P(A) + P(B) - P(A ∩ B).ProofHint: A ∪ B = A ∪ (B ∩ Ac)

3. P(A ∪ B) ≤ P(A)+ P(B)

• Extension

P(n∪

k=1

Ak) =n

∑k=1

P(Ak) − ∑i<j

P(Ai ∩ Aj)+

· · · + (−1)n+1P(A1 ∩ A2 ∩ · · · ∩ An)

• Boole’s inequality

P(∞∪

i=1

Ai) ≤∞

∑i=1

P(Ai)

3


1.2 Some Limit Concepts of Probability

Definition 1.6 (Limit of Events for Monotone Sequences). Let En be a se-quence of events. En is monotone when E1 ⊂ E2 ⊂ · · · or E1 ⊃ E2 ⊃ · · · .

1. Monotone increasing sequence of events :

E1 ⊂ E2 ⊂ . . . → (lim En =∞∪

n=1

En)

2. Monotone decreasing sequence of events :

E1 ⊃ E2 ⊃ . . . → (lim En =∞∩

n=1

En)

Theorem 1.3 (A monotone sequence of events En).

P(lim En) = lim P(En)

Proof.

• E0 = ϕ, En: monotone increasing

• Fn = En − En−1, P(Fi) = P(Ei) − P(Ei−1)

• P(∪n

i=1 Fi) = ∑ni=1 P(Fi) = P(En) = P(

∪ni=1 Ei)

Definition 1.7 (Limit Supremum and Limit Infimum of Events). For a se-quence of events En, define

lim supn

En =∞∩

n=1

∞∪k=n

Ek (∀n ≥ 1, ∃k ≥ n such that ω ∈ Ek, En infinitely often)

lim infn

En =∞∪

n=1

∞∩k=n

Ek (∃n ≥ 1 such that ∀k ≥ n, ω ∈ Ek, En eventually)

lim En = lim sup En = lim inf En

Lemma 1.1 (Borel - Cantelli). Let En be a sequence of events.

If∞

∑i=1

P(Ei) < ∞, then P(lim sup En) = 0

4


Proof.

P(lim sup En) = P(∞∩

n=1

∞∪k=n

Ek) ≤ P(∞∪

k=n

Ek) ≤∞

∑k=n

P(Ek) → 0

RemarksNote that if P(En) → 0, P(lim inf En) = 0

Lemma 1.2 (2nd Borel - Cantelli Lemma). Let En be a independent sequenceof events.

If∞

∑i=1

P(Ei) = ∞, then P(lim sup En) = 1

1.3 Conditional Probability and Independence

Definition 1.8 (Conditional Probability). For an event A s.t P(A) > 0, theconditional probability of A given B is defined as

P(A | B) =P(A ∩ B)

P(B)Definition 1.9 (Independence: A ⊥ B). Let A, B ∈ F , B = ϕ

• If A ⊥ B, then P(A ∩ B) = P(A)P(B).

• If A ⊥ B, then P(A | B) = P(A).

• P(A | B) = P(A∩B)P(B) = P(A)P(B)

P(B) = P(A)

RemarksIf A or B is empty, then they are always independent.

Definition 1.10 (Pairwise Independence).

• Let Γ be a class of subsets of Ω.

• For any pair A, B ∈ Γ, if P(A ∩ B) = P(A)P(B), then events in Γ arepairwise independent.

Definition 1.11 (Mutual Independence).

• Let Γ be a class of subsets of Ω.

• For any collection of events (Ai1, . . . , Aik), i, k = 1, 2, . . . in Γ, if P(Ai1 ∩Ai2 ∩ · · · ∩ Aik) = Πk

j=1P(Aij), then events in Γ are mutually independentor completely indempendent.

5


1.4 Bayes Theorem

Theorem 1.4 (Bayes Theorem). For A, B ∈ F , P(A) > 0, P(B) > 0,

• P(B | A) = P(A∩B)P(A) = P(A|B)P(B)

P(A|B)P(B)+P(A|Bc)P(Bc)

• P(A | B) = P(A∩B)P(B) = P(B|A)P(A)

P(B|A)P(A)+P(B|Ac)P(Ac)

Remarks A Partition Ai of Ω

• Ai, i = 1, 2, . . . , n

• Ai is a partition of Ω if it satisfies

(i)∪n

i=1 Ai = Ω

(ii) Ai ∩ Aj = ϕ, i = j

• Ai, i = 1, 2, . . . , n, a partition of Ω, P(Ai) > 0

• For every B ∈ F , P(B) > 0,

• P(Ai | B) = P(B|Ai)P(Ai)Σn

i=1P(B|Ai)P(Ai)

Remarks Bayesian Approach

• On a probability space (Ω,F , P)

• Events H ∈ F , P(· | H) = PH

• Let Hi be a partition of Ω, which are unobservable events.

• Let B ⊂ Ω be observable.

• P(Hi | B) = P(Hi)P(B|Hi)∑n

i=1 P(Hi)P(B|Hi)

Remarks Classical VS Bayesian Approach

Y = Xβ + ε

• Classical (Frequentist) Approach

(a) X, Y are random variables.

(b) Parameters (β) are fixed.

• Bayesian Approach

(a) Unknowns (Unobservable) are regarded as random variables.

(b) β, ε are random variables.

6


2 Random Variables, Distribution Functions, and Ex-pectation

2.1 Random Variables

Definition 2.1 (Random Variable).

• A finite function X : Ω → R is a random variable (r.v) if for each B ∈ B,X−1(B)=ω : X(ω) ∈ B ∈ F , where B is the Borel σ-algebra on R

Remarks

• A random variable is a real measurable function.

• A random variable X : Ω → R defined on (Ω,F , P) is called F/B-measurable function.

Definition 2.2 (Measurable Mapping).

• Measurable mapping: Generalization of measurable function

• Let (Ω,F ), (Ω′,F ′) be two measurable spaces.

• A mapping T : Ω → Ω′ is said to be F/F ′-measurable if for any B ∈ F ′,T−1(B) = ω ∈ Ω : T(ω) ∈ B ∈ F .

Theorem 2.1.

• Let (Ω,F , P) be a probability space.

• Let X be a random variable defined on Ω.

• Then, the random variable X induces a new probability space (R,B, PX)where X : Ω → R.

Proof.

For B ∈ B, let PX(B) = P[X−1(B)] = P[ω : X(ω) ∈ B].

It is sufficient to show that

1. PX(R) = 1

2. PX(B) ≥ 0 for any B ∈ B3. For Bi, i = 1, 2, . . . , with Bi ∩ Bj = ∅

PX(∪iBi) = ∑i

P(Bi)

7


2.2 Probability Distribution Function

Definition 2.3 (Distribution Function). Let X be a random variable. Given x,a real valued function FX(·) defined as FX(x) = P[ω : X(ω) ≤ x] is called thedistribution function (DF) of a random variable X.

Definition 2.4 (Cumulative distribution function (cdf)).

FX(x) = P[ω : X(ω) ≤ x] = P(X ≤ x) = PX(−∞, x] = PX[r : −∞ < r ≤ x]

FX(x2) − FX(x1) = PX(x1, x2]

Theorem 2.2 (Properties of Distribution Function).

1. limx→−∞ FX(x) = 0, limx→+∞ FX(x) = 1

2. For x1 ≤ x2, FX(x1) ≤ FX(x2) (Monotone and Non-decreasing)

3. lim0<h→0 FX(x + h) = FX(x) (Right Continuity)

RemarksA distribution function is not necessarily left continuous.

Definition 2.5 (Discrete Random Variable). A random variable X is said to bediscrete if the range of X is countable or if there exists E, a countable set, suchthat P(X ∈ E) = 1.

Definition 2.6 (Continuous Random Variable). A random variable X is saidto be continuous if there exists a function fX(·) such that FX(x) =

∫ x−∞ fX(t)dt

for every real number x.

Remarks Another Characterization of Continuous Random Variable

• Let FX(·) be a distribution function (DF) of a random variable X.

(a) A distribution function, FX(·) is absolutely continuous if andonly if there exists a non-negative function f such that

FX(x) =∫ x

−∞f (t)dt ∀x ∈ R

(b) That is, a random variable X is a continuous random variable ifand only if FX(·) is absolutely continuous.

8


Definition 2.7 (Continuity).

• A function f : X → Y is continuous at a point x0 ∈ X if, at x0, for givenany ϵ > 0, ∃δ > 0 such that

ρ(x0, x) < δ ⇒ ρ′[ f (x0), f (x)] < ϵ

where ρ and ρ′ are metrics on X and Y.

• A function f is said to be continuous if it is continuous at each x ∈ X.

Definition 2.8 (Uniform Continuity).

• Let f : X → Y be a mapping from a metric space < X, ρ > to < Y, ρ′ >.

• We say that f is uniformly continuous if for any given ϵ > 0, ∃δ > 0 suchthat, for any x1, x2 ∈ X,

ρ(x1, x2) < δ ⇒ ρ′( f (x1), f (x2)) < ϵ.

Remarks

Uniformly continuous ⇒ Continuous

When f is defined on compact set (closed and bounded set if Rn), Con-tinuous ⇒ Uniformly Continuous.

Definition 2.9 (Absolute Continuity of a Function on Real Line).

• A real-valued function f defined on [a, b] is said to be absolutely continu-ous on [a, b] if, for any given ϵ > 0, ∃δ > 0 such that

k

∑i=1

(ai, bi) < δ ⇒k

∑i=1

| f (bi) − f (ai)| < ϵ

for (ai, bi) pairwise disjoint, i = 1, · · · , k, k being arbitrary.

Remarks

• Absolutely continuous ⇒ Uniformly continuous

• Uniformly continuous ; Absolutely continuous

9


Definition 2.10 (Absolute Continuity of a Measure: P ≪ Q).

• Let P, Q be two σ-finite measures in F .

- For a given ϵ > 0, ∃δ > 0 s.t Q(A) < δ ⇒ P(A) < ϵ.

- If Q(A) = 0 ⇒ P(A) = 0 ∀A ∈ F

⇒ P is absolute continuous with respect to Q or we denote that (P ≪ Q).

Example 2.1.

• P(A) =∫

A f dQ, A ∈ F

• FX(x) =∫ x−∞ f (t)dt

Theorem 2.3 (Radon-Nikodym Theorem). Let P, Q be two σ-finite measuresin F . If P ≪ Q, then there exists f ≥ 0 such that P(A) =

∫A f dQ for any

A ∈ F . We write f = dPdQ and call it Radon-Nikodym derivative.

Definition 2.11 (Probability Mass Function). If X is a discrete random vari-able with distinct values x1, x2, . . . , xk, then the function, denoted by fX(xi) =P[X = xi] = P[ω : X(ω) = xi] such that

• fX(xi) > 0 for x = xi, i = 1, . . . , k

• fX(x) = 0 for x = xi

• ∑ fX(xi) = 1

is said be the probability mass function (pmf) of X.

Remarks

• Some other names of p.m.f are Discrete density function, discrete fre-quency function, and probability function.

• Note that fX(xi) = FX(xi) − FX(xi−1)

Definition 2.12 (Probability Density Function). If X is continuous randomvariable, then the function fX(·) such that FX(x) =

∫ x−∞ fX(t)dt is called the

probability density function of X.

• fX(x) ≥ 0, ∀x

•∫ ∞−∞ fX(x)dx = 1

10


Remarks

• Some other names of p.d.f are Density function, continuous density func-tion, and integrating density function.

• P[X = xi] = 0

• fX(x) = dFX(x)dx

• P(a < X ≤ b) = F(b) − F(a) =∫ b

a f (x)dx

Remarks Decomposition of a Distribution Function

• Any cdf F(x) may be represented in the form of mixed distribution :

FX(x) = p1FDX (x) + p2FC

X (x) where pi ≥ 0, i = 1, 2, p1 + p2 = 1, D:discrete, C: continuous.

Theorem 2.4 (Function of a Random Variable). Let X be a random variableand g be a Borel measurable function.Then, Y = g(X) is also a random variable.

Proof. It suffices to show that Y ≤ y ∈ F to see if Y = g(X) is a randomvariable. That is, Y ≤ y = g(X) ≤ y = ω : X ∈ g−1(−∞, y] ∈ F

2.3 Expectation and Moments

Definition 2.13 (Expected Value). Let X be a random variable. Then, we defineE(X) as expected value, (mathematical) expectation or mean of X.

1. Continuous random variable ⇒ E(X) =∫

x f (x)dx

2. Discrete random variable ⇒ E(X) = ∑ xi fi

Definition 2.14 (Expectation of a Function of Random Variable). Let Y =g(X) be a random variable. Suppose that

∫| g(x) | f (x)dx < ∞. Then, we

define E[Y] = E[g(X)] =∫

g(x) f (x)dx =∫

y f (y)dy.

Theorem 2.5 (Preservation of Monotonicity). Let E[gi(X)] be an expectationfor a real valued function gi of X. Suppose that E(| gi(X) |) =

∫| gi(x) |

f (x)dx < ∞. If g1(x) ≤ g2(x) for all x, then E[g1(X)] ≤ E[g2(X)].

Proof.

Suppose that g1(x) ≤ g2(x) for all x.

11


Then, E[g1(X)] − E[g2(X)] =∫

g1(x) f (x)dx −∫

g2(x) f (x)dx

=∫

[g1(x) − g2(x)] f (x)dx ≤ 0.

Remarks

• Suppose that g1(x) ≤ g2(x) for almost every x and | g1 |< ∞ and| g2 |< ∞. Then, P[ω : g1(X(ω)) ≤ g2(X(ω)) = 1.

• That is, A = ω : g1(x) ≤ g2(x) with P(A) = 1 and Ac = ω :g1(x) > g2(x) with P(Ac) = 0

• Finally, E[g1(X) − g2(X)] =∫

A[g1(x) − g2(x)] f (x)dx +∫

Ac [g1(x) −g2(x)] f (x)dx ≤ 0.

Theorem 2.6 (Properties of Expectation).

1. When c is constant, E(c) = c

2. E(cX) = cE(X) (cf. E(XY | X) = XE(Y | X))

3. Linear Opeartor E(X + Y) = E(X) + E(Y)

4. If X ⊥ Y, then E(XY) = E(X)E(Y)

Proof.

1.∫

c f (x)dx = c∫

f dx = c · 1 = c

2. Trivial.

3. E(X +Y) =∫∫

(x + y) f (x, y)dxdy =∫∫

x f (x, y)dxdy +∫∫

y f (x, y)dxdy=

∫x[

∫f (x, y)dy]dx +

∫y[

∫f (x, y)dx]dy =

∫x f (x)dx +

∫y f (y)dy

= E(X) + E(Y))

4. It is trivial when we use f (x, y) = f (x) f (y).

Definition 2.15 (Moments).

• rth moment of X ⇒ mr = µ′r = E(Xr) =

∫xr f (x)dx

• rth central moment of X ⇒ µr = E[(X−E(X))r] =∫

(X−E(X))r f (x)dx

12


Example 2.2.

1. E(X) = ∑i xi fi, X = 1n ∑ xi

2. Var(X) = E[(X − E(X))2]

3. Skewness = E[(X − E(X))3]

4. Kurtosis = E[(X − E(X))4]

Definition 2.16 (Moment Generating Function). For a continuous randomvariable X,

• MX(t) = E[etx] =∫

etx f (x)dx for −h < t < h, for some small h > 0

• dMX(t)dt =

∫xetx f (x)dx

• dr MX(t)dtr =

∫xretx f (x)dx

• µ′r = E[Xr] = dr MX(t)

dtr |t=0

For a discrete random variable X,

• MX(t) = E[etx] = ∑i etxi f (xi) where ex = ∑∞i=0

1i! x

i

• µ′r = E[Xr] = dr MX(t)

dtr |t=0

Theorem 2.7. For 0 < s < r, if E[| X |r] exists, then E[| X |s] < ∞.

Remarks

• There must exist h > 0 such that MX(t) = E[etx] =∫

etx f (x)dx for−h < t < h.

• The moment generating function (mgf) does not always exist for arandom variable X.

Example 2.3.

• Consider the r.v X having pdf f (x) = x−2 I[1,∞)(x).

⇒ If the mgf of X exists, then it is given by∫ ∞

1 x−2etxdx by the definition ofmgf. However, it can be shown that the integral does not exist for any t > 0.In fact, E[X] = ∞.

13


• Cauchy distribution: t(1)

⇒ E[X] = ∞ and thus all the moments do not exist.

Definition 2.17 (Characteristic Function).

• ϕX(t) = E[eitX] =∫

eitx f (x)dx where i =√−1

c f . eiy = cos(y) + isin(y)

Remarks

• ϕX(t) ⇔ FX: Characteristic function exists for any random variableX.

• | eitx |=| cost(tx) + isin(tx) |= cos2(tx) + sin2(tx) = 1

• drϕX(t)dtr |t=0= E[(itX)r] = irµ

′r

• MX(t) → mr

• FX(x) ⇔ mr for all r (if mr exists for every r)

2.4 Characteristics of Distribution

Location (Representative Value)

1. Expectation: µ = µ1 = E(X) =∫

x f (x)dx

(a) E(c) = c

(b) E(cX) = cE(X)

(c) E(X + Y) = E(X) + E(Y)

(d) If X⊥Y, then E(XY) = E(X)E(Y).

2. αth-Quantile ξα: the smallest ξ such that FX(ξ) ≤ α

3. Median: 0.5th quantile

(a) m or Xmed such that P(X < m) ≤ 12 and P(X > m) ≤ 1

2

(b) In a symmetric distribution, E(X) = m.

4. Mode: Xmod

(a) A mode of a distribution of one random variable X is a valueof x that maximizes the pdf or pmf.

(b) There may be more than one mode. Also, there may be nomode at all.

14


Measures of Dispersion

1. Variance: µ2 = Var(X) = E[(X − µ)2]

(a) Var(c) = 0

(b) Var(cX) = c2Var(X)

(c) Var(a + bX) = b2Var(X)

2. Standard Deviation: SD(X) =√

Var(X) (cf. SD(a + bX) =|b|SD(X))

3. Interquantile Range: ξ0.75 − ξ0.25

– This is useful for an asymmetric distribution.

Skewness

1. Skewness: µ3 = E[(X − µ)3]

(a) µ3 > 0: skewed to the right

(b) µ3 = 0: symmetric

(c) µ3 < 0: skewed to the left

2. Skewness Coefficient: unit-free measure

µ3

σ3 =E[(X − µ)3]

(E[(X − µ)2])3/2

Kurtosis

1. Kurtosis: µ4 = E[(X − µ)4]

(a) µ4 > 3: long tail (leptokurtic)

(b) µ4 = 3: normal (mesokurtic)

(c) µ4 < 3: short tail (platykurtic)

2. Kurtosis Coefficient: unit-free measure

µ4

σ4 =E[(X − µ)4]

(E[(X − µ)2])4/2

15


2.5 Inequalities

Theorem 2.8 (Markov Inequality). Let X be a random variable and g(·) a non-negative Borel measurable function. Then, for every k > 0,

P[g(X) ≥ k] ≤ E[g(X)]k

Proof.

E[g(X)] =∫

g(x) f (x)dx =∫

X:g(x)≥kg(x) f (x)dx +

∫X:g(x)<k

g(x) f (x)dx

≥∫

X:g(x)≥kg(x) f (x)dx ≥

∫X:g(x)≥k

k f (x)dx

≥ k∫

X:g(x)≥kf (x)dx = kP[g(X) ≥ k]

Example 2.4.

• Apply Markov inequality to g(x) = (X − µ)2, k = r2σ2X

⇒ Chebyshev’s inequality : P[(X − µ)2 ≥ r2σ2X] ≤ 1

r2

• g(x) =| X |, g(x) =| X |α

Theorem 2.9 (Jensen’s Inequality). Let X be a random variable with mean E[X],and let g(·) be a convex function. Then E[g(X)] ≥ g(E[X]).

Proof. Since g(x) is continuous and convex, there exists a line, satisfyingl(x) ≤ g(x) and l(E[X]) = g(E[X]). By definition, l(x) goes through thepoint (E[X], g(E[X])) and we can let l(x) = a + bx. That is,

E[l(X)] = E[(a + bX)] = a + bE[X] = l(E[X])

⇒ (E[X]) = l(E[X]) = E[l(X)] ≤ E[g(X)]

Theorem 2.10 (Holder’s Inequality). Let X, Y be two random variables and p, qare numbers such that p > 1, q > 1, 1

p + 1q = 1. Then,

E[XY] ≤ E[| X |p]1p E[| Y |q]

1q

16


Example 2.5.

Apply Holder’s inequality to p = q = 2

E[XY] ≤ E[X2]12 E[Y2]

12 : Cauchy-Schwarz’s inequality

⇒ Cov(X, Y) ≤√

Var(X)√

Var(Y) (c f . Cov(X, Y) = E[(X−µX)(Y−µY)])

∴ −1 ≤ ρXY = Cov(X,Y)√Var(X)

√Var(Y)

≤ 1

3 Joint and Conditional Distributions, Stochastic In-dependence and More Expectations

3.1 Joint Distribution

Definition 3.1 (n-dimensional Random Variable).

• Let X(ω) = (X1(ω), X2(ω), ·, Xn(ω)) for ω ∈ Ω be an n-dimensionalfunction defined on (Ω,F , P) into Rn

• X(ω) is called n-dimensional random variable if the inverse image of everyn-dimensional interval in Rn, I = (x1, x2, · · · , xn) : −∞ < xi < ai,ai ∈ R, i = 1, 2, · · · , n is in F .

• i.e. X−1(I) = ω : X1(ω) ≤ x1, · · · , Xn(ω) ≤ xn ∈ F .

Theorem 3.1 (Construction of a n-dimensional Random Variable). Let Xi,i = 1, · · · , n be each one-dimensional random variable. Then, X = (X1, · · · , Xn)is an n-dimensional random variable.

Definition 3.2 (Joint Cumulative Distribution Function). Let X be n-dimensionalrandom variable; X = (X1, · · · , Xn). Then, the joint cumulative distributionfunction of X is defined as

FX(x1, · · · , xn) = FX1,··· ,Xn(x1, · · · , xn) = P[ω : X1(ω) ≤ x1; · · · ; Xn(ω) ≤ xn]

for each (x1, · · · , xn) ∈ Rn

Theorem 3.2 (Properties of Joint Cumulative Distribution Function).

1. Non-decreasing with respect to all arguments x1, · · · , xn

17


2. Right continuous with respect to all arguments x1, · · · , xn

c f . lim0<h→0 F(x + h, y) = lim0<h→0 F(x, y + h) = F(x, y)

3. F(+∞, +∞) = 1, FXY(−∞, y) = FXY(x,−∞) = 0 for all x, y

4. F(x2, y2)− F(x2, y1)− F(x1, y2)+ F(x1, y1) ≥ 0 (∵ P[x1 ≤ X ≤ x2, y1 ≤Y ≤ y2] ≥ 0)

Definition 3.3 (Joint Probability Mass Function). Let X = (X1, X2, . . . , Xn)be a discrete random vector with distinct values a1, a2, . . . , ak ∈ Rn. Then thefunction, denoted by fX(ai) = P[X = ai], such that

• fX(x) > 0 for x = ai, i = 1, . . . , k

• fX(x) = 0 for x = ai

• ∑i fX(ai) = 1

is called the joint probability mass function of X.

Definition 3.4 (Joint Probability Density Function). Let X = (X1, X2, . . . , Xn)be a continuous random vector and FX1,...,Xn be its cumulative distribution func-tion. Then the function fX1,...,Xn such that

FX1,...,Xn(x1, x2, . . . , xn) =∫ x1

−∞· · ·

∫ xn

−∞f (t1, t2, . . . , tn)dt1 · · · dtn

exists. That function is called the joint probability density function of X.

Remarks

• f (x1, . . . , xn) ≥ 0,∀(x1, . . . , xn)

• f (x1, . . . , xn) = ∂n F(x1,...,xn)∂x1···∂xn

•∫ ∞−∞ · · ·

∫ ∞−∞ f (t1, t2, . . . , tn)dt1 · · · dtn = 1

3.2 Marginal Distribution

Definition 3.5 (Marginal Distribution). Let X, Y be two random variables. Thenthe marginal distributions of X and Y are:

FX(x) = FXY(x, +∞) = P[X ≤ x, Y < +∞]

FY(y) = FXY(+∞, y) = P[X < +∞, Y ≤ y]

18


Definition 3.6 (Marginal Probability Density Function). Let X, Y be two ran-dom variables and let fX,Y(x, y) be the joint pdf of X, Y. Then marginal probabilitydensity functions of X and Y are:

• (Discrete case)

fX(x) = ∑nj=1 f (xi, yj)

fY(y) = ∑ni=1 f (xi, yj)

• (Continuous case)

fX(x) =∫

f (x, y)dy

fY(y) =∫

f (x, y)dx

3.3 Conditional Distribution

Definition 3.7 (Conditional Probability Distribution Function). Let X, Y betwo random variables. Then the conditional distribution of X given Y is:

FX|Y(x | y) = P(X ≤ x | Y = y)

and the conditional density of X given Y is:

fX|Y(x | y) =∂FX|Y(x | y)

∂x(Continuous)

fX|Y(x | y) = P(X = x | Y = y) (Discrete)

FX|Y(x | y) =∫ x

−∞f (u | y)du

Remarks

• FX|Y(x | y) =∫ x−∞

fX,Y(u,y)fY(y) du

• ∂FX|Y(x|y)∂x = fX,Y(x,y)

fY(y)

Theorem 3.3 (Alternative Derivation of Conditional Density).

fX|Y(x | y) =fX,Y(x, y)

fY(y)if fY(y)>0

19


Proof. First, consider discrete random variables X, Y. Let Ax = w : X(w) =x, By = w : Y(w) = y. Then we have,

fX|Y(x | y) = P(X = x | Y = y) = P(Ax | By) =P(Ax ∩ By)

P(By)

=P(w : X(w) = x, Y(w) = y)

P(w : Y(w) = y) =fX,Y(x, y)

fY(y)

Next, consider continuous random variables X, Y. Let Ax = w : X(w) 5x and Bε = w : y − ε 5 Y(w) 5 y + ε. Define By = limε→0 Bε. Then wehave,

FX|Y(x | y) = P(Ax | By) = limε→0

P(w : X(w) 5 x, y − ε 5 Y(w) 5 y + ε)P(w : y − ε 5 Y(w) 5 y + ε)

=limε→0

12ε

∫ y+εy−ε

∫ x−∞ fX,Y(u, v)dudv

limε→012ε

∫ y+εy−ε fY(v)dv

=∫ x

∞ fX,Y(u, y)dufY(y)

=∫ x

−∞

fX,Y(u, y)fY(y)

du

Therefore, fX|Y(x|y) = fX,Y(x,y)fY(y) .

3.4 Independence of Random Variables

Definition 3.8 (Independence of Random Variables). The random variablesX and Y are said to be independent if

fX,Y(x, y) = fX(x) fY(y) (P(Ax ∩ By) = P(Ax)P(By))

Random variables that are not independent are said to be dependent.

Theorem 3.4. X and Y are independent if and only if

FX,Y(x, y) = FX(x)FY(y) ∀(x, y) ∈ R2

Proof.

⇐) By partial differentiations

⇒) FX,Y(x, y)

=P(ω : X(ω) 5 x, Y(ω) 5 y) = P(w : X(w) 5 x ∩ w : Y(w) 5 y)=P(ω : X(ω) 5 x)P(w : Y(w) 5 y) = FX(x)FY(y)

20


Definition 3.9 (Pairwise and Mutual Independence). Let X1, X2, · · · , Xn berandom variables.

• X1, . . . , Xn are pairwise independent if Xi⊥Xj for ∀i, j = 1, 2, · · · , n, i =j

• X1, . . . , Xn are mutually independent if for any k collection,(Xi1 , Xi2 , . . . , Xik) ∈ (X1, X2, . . . , Xn), k = 2, 3, . . . , n,

FXi1 ,··· ,Xik(xi1 , · · · , xik) =

k

∏j=1

FXij(xij)

Theorem 3.5 (Preservation of Independence). Let X, Y be random variablesand g1, g2 be Borel-measurable functions. If X⊥Y, then g1(X)⊥g2(Y).

Proof.

P(g1(X) 5 x, g2(Y) 5 y) = P(g1(X) ∈ (−∞, x], g2(Y) ∈ (−∞, y])= P(X ∈ g−1

1 (−∞, x], Y ∈ g−12 (−∞, y])

= P(X ∈ g−11 (−∞, x])P(Y ∈ g−1

2 (−∞, y])= P(g1(X) ∈ (−∞, x])P(g2(Y) ∈ (−∞, y])= P(g1(X) ≤ x])P(g2(Y) ≤ y)

Definition 3.10. Identically Distributed Random Variables Let X, Y be randomvariables. X and Y are identically distributed if FX(a) = FY(a) ∀a ∈ R andwe denote X d= Y

Theorem 3.6. If Xi (i = 1, 2, · · · , n) are independent identically distributed,

FX1,··· ,Xn(x1, · · · , xn) =n

∏i=1

FX(xi)

Definition 3.11 (Moment Generating Function of Joint Distribution). For arandom vector X = (X1, X2, · · · , Xn)′, the moment generating function is

mX(t) = E[et′X] = E[et1X1+t2X2+···+tnXn ] < ∞ − hi < ti < hi (i = 1, 2, . . . , n hi > 0)

21


Definition 3.12 (Cross Moments).

µ′r1,r2

= E[Xr11 Xr2

2 ]: (r1, r2)th cross moment

µr1,r2 = E[(X1 − µ1)r1(X2 − µ2)r2 ]: (r1, r2)th cross central moment

Remarks

µ′r1,r2

= ∂r1+r2 MX,Y(t1,t2)∂tr1

1 ∂tr22

|t1=t2=0

(i)r1+r2 µ′r1,r2

= ∂r1+r2 ϕX,Y(t1,t2)∂tr1

1 ∂tr22

|t1=t2=0

(ϕX,Y : Characteristic f unction)

Theorem 3.7. X1, X2, . . . , Xn are mutually independent if and only if

MX1,X2,··· ,Xn(t1, t2, · · · tn) = MX1(t1)MX2(t2) · · · MXn(tn)

Theorem 3.8. Let X⊥Y and g1, g2 be Borel-measurable functions. Then,

E[g1(X)g2(Y)] = E[g1(X)][g2(Y)]

Remarks

• A trivial corollary of the theorem is that X⊥Y ⇒ Cov(X, Y) = 0

Theorem 3.9. Let X1, X2, . . . , Xn be random variables. Let S = ∑ni=1 aiXi. Then,

Var(S) =n

∑i=1

a2i Var(Xi) + ∑

i =jaiajCov(Xi, Xj)

If X1, X2, . . . , Xn are independent,

Var(S) =n

∑i=1

a2i Var(Xi)

22


3.5 Conditional Expectation

Definition 3.13 (Conditional Expectation). Let X be an integrable randomvariable on (Ω,F , P) and that G is a sub σ-field of F (G ⊂ F ). Then there exista random variable E[X|G], called the conditional expected value of X given G,with following properties:

(1) E[X|G] is G-measurable and integrable.

(2) E[X|G] satisfies the functional equation∫G

E[X|G]dP =∫

GXdP, G ∈ G

Definition 3.14 (Conditional Mean). Let X, Y be random variables and h(·) bea Borel-measurable function. Then,

E[h(X)|Y = y] = ∑i

h(xi) f (xi|y) (Discrete)

=∫

h(x) f (x|y)dx (Continuous)

Remarks

E[h(X)|Y] is also a random variable.

Theorem 3.10 (Properties of Conditional Expectation).

1. E[c|Y] = c, c : consant

2. For h1(), h2(), Borel-measurable functions

E[c1h1(X) + c2h2(X)|Y] = c1E[h1(X)|Y] + c2E[h2(X)|Y]

3. P[X ≥ 0] = 1 ⇒ E[X|Y] ≥ 0

4. P[X1 ≥ X2] = 1 ⇒ E[X1|Y] ≥ E[X2|Y]

5. ϕ(·): A function of X, Y ⇒ E[ϕ(X, Y)|Y = y] = E[ϕ(X, y)|Y = y]

6. Ψ(·): A Borel-measurable function ⇒ E[Ψ(X)ϕ(X, Y)|X] = Ψ(X)E[ϕ(X, Y)|X]

23


Theorem 3.11 (Law of Iterated Expectations). Let X, Y be random variablesand E[h(X)] exist. Then,

E[E[h(X)|Y]] = E[h(X)]

Proof. ∫ ∞

−∞

∫ ∞

−∞h(x) fX,Y(x, y)dxdy

=∫ ∞

−∞[∫ ∞

−∞h(x)

fX,Y(x, y)fY(y)

dx] fY(y)dy

=∫ ∞

−∞E[h(X)|y] fY(y)dy = E[E[h(X)|Y]]

= E[h(X)]

Definition 3.15 (Conditional Variance). Let X, Y be random variables and E[X|Y]be a conditional expectation of X given Y. Then,

Var(X|Y) = E[(X − E[X|Y])2|Y]

Theorem 3.12. Let X, Y be random variables with finite variances. Then,

1. Var(X|Y) = E[X2|Y] − (E[X|Y])2

2. Var(X) = E[Var(X|Y)] + Var(E[X|Y])

Proof.

1. E[(X − E[X|Y])2|Y] = E[X2 − 2XE[X|Y] + (E[X|Y])2|Y]=E[X2|Y]-2E[XE[X|Y]|Y]+E[(E[X|Y])2|Y]=E[X2|Y]-(E[X|Y])2

2. E[Var(X|Y)] = E[E[X2|Y] − (E[X|Y])2]=E[X2] − (E[X])2 − (E[(E[X|Y])2] − (E[X])2)=Var(X) − Var(E[X|Y])∴ Var(X) = E[Var(X|Y)] + Var(E[X|Y])

24

[김재영] probability

Documents