340 printable course notes

Stat 340 Course Notes Spring 2010

Generously Funded by MEF

Contributions: Riley Metzger Michaelangelo Finistauri Special Thanks: Without the following people and groups, these course notes would never have been completed: MEF, Don McLeish.

Table of Contents Chapter 1: Probability…………………………………………………………… 1 Exercises for Chapter 1 29 Solutions ………………………………………………………………………….. pp. 1‐8 Chapter 2: Statistics….…………………………………………………………… 34 Exercises for Chapter 2 38 Solutions ………………………………………………………………………….. pp. 1‐9 Chapter 3: Validation….………………………………………………………… 42 Exercises for Chapter 3 52 Solutions …………………………………………………………………………. pp. 1‐4 Chapter 4: Queuing Systems ….…………………………………………….. 53 Exercises for Chapter 4 66 Solutions ………………………………………………………………………… pp. 1‐4 Chapter 5: Generating Random Variables….………………………….. 67 Exercises for Chapter 5 75 Solutions ………………………………………………………………………….. pp. 1‐4 Chapter 6: Variance Reduction ….………………………………………….. 76 Exercises for Chapter 6 120 Solutions …………………………………………………………………………. pp. 1‐12 Statistical Tables

Chapter 1

Probability

Three approaches to de�ning probability are:

1. The classical de�nition: Let the sample space (denoted by S) be theset of all possible distinct outcomes to an experiment. The probability ofsome event is

number of ways the event can occur

number of outcomes in S;

provided all points in S are equally likely. For example, when a die isrolled, the probability of getting a 2 is 1

6 because one of the six faces is a2.

2. The relative frequency de�nition: The probability of an event is theproportion (or fraction) of times the event occurs over a very long (the-oretically in�nite) series of repetitions of an experiment or process. Forexample, this de�nition could be used to argue that the probability of get-ting a 2 from a rolled die is 1

6 . For instance, if we roll the die 100 times,but get a 2 30 times, we may suspect that the probability of getting a 2is 1

3 .

3. The subjective probability de�nition: The probability of an event is ameasure of how sure the person making the statement is that the eventwill happen. For example, after considering all available data, a weatherforecaster might say that the probability of rain today is 30% or 0.3.

Unfortunately, all three of these de�nitions have serious limitations.

1

1.1 Sample Spaces and Probability

Consider a phenomenon or process which is repeatable, at least in theory, andsuppose that certain events (outcomes) A1; A2; A3; : : : are de�ned. We will oftenrefer to this phenomenon or process an �experiment," and refer to a singlerepetition of the experiment as a �trial." Then, the probability of an event A,denoted by P (A), is a number between 0 and 1.

De�nition 1. A sample space S is a set of distinct outcomes for an exper-iment or process, with the property that in a single trial, one and only one ofthese outcomes occurs. The outcomes that make up the sample space are calledsample points.

De�nition 2. Let S = fA1; A2; A3; : : :g be a discrete sample space. Thenprobabilities P (Ai) are numbers attached to the Ai�s (i = 1; 2; 3; : : :) such thatthe following two conditions hold:

(1) 0 � P (Ai) � 1

(2)Pi

P (Ai) = 1

The set of values fP (Ai); i = 1; 2; : : : g is called a probability distrib-ution on S.

De�nition 3. An event in a discrete sample space is a subset A � S. If theevent contains only one point, e.g. A1 = fA1g, we call it a simple event. Anevent A made up of two or more simple events such as A = fA1; A2g is calleda compound event.

De�nition 4. The probability P(A) of an event A is the sum of probabilitiesfor all the simple events that make up A.

Example: Suppose a 6-sided die is rolled, and let the sample space be S =f1; 2; 3; 4; 5; 6g, where 1 means the number 1 occurs, and so on. If the die is anordinary one, we would �nd it useful to de�ne probabilities as

P (i) = 1=6 for i = 1; 2; 3; 4; 5; 6;

because if the die were tossed repeatedly (as in some games or gambling situ-ations), then each number would occur close to 1=6 of the time. However, ifthe die were weighted in a way such that one face favoured the others, thesenumerical values would not be so useful.Note that if we wish to consider a compound event, the probability is easily

obtained. For example, if A = �even number," then because A = f2; 4; 6g weget P (A) = P (2) + P (4) + P (6) = 1=2.

2

1.2 Conditional Probability

It is often important to know how the outcome of one event will a¤ect theoutcome of another. Consider �ipping a coin: if we know that the �rst �ip is ahead, how does this a¤ect the outcome of a second �ip? Logically this shouldhave no e¤ect. Now, suppose we are interested in the values of two rolled dice(assuming each die is a standard six-faced, with faces numbered 1 through 6,hereafter referred to as a D6 ). If the total sum of the dice is 5 and we know thevalue of one of the dice, then how does this a¤ect the outcome of the seconddie? This is the basic concept behind Conditional Probability.

1.2.1 De�nition

Given two events, A and B, and given that we know B occurs the ConditionalProbability of AjB (said �A given B�) is de�ned as:

P (AjB) = P (A \B)P (B)

Recall that A \ B denotes the intersection of events A and B (hereafter, thiswill be shortened to AB).Note: If A and B are independent,

P (AB) = P (A)P (B) so

P (AjB) = P (A)P (B)

P (B)= P (A):

1.2.2 Example

Consider the �rst example: the coin. Assume that the coin is fair, which is tosay that the probability of either outcome is strictly 1

2 . Event A is the eventthat the �rst toss will land a head, and event B is the event that the secondtoss will land a tail. Then the conditional probability of BjA is:

P (BjA) =P (B \A)P (A)

=1412

=1

2

Notice that the conditional probability BjA is the same as the probability of B.This is not always the case.

3

1.3 Random Variables

There is a far more intuitive and useful representation of probabilistic events:Random Variables (r.v). A random variable can be thought of as an unknownvalue (numeric) which is determined by chance.

De�nition 5. A random variable is a function that assigns a real number toeach point in a sample space S.

Example 1. From the previous examples we can have X be the random variablerepresenting the outcome of a coin toss. We could assign X with the value of1 if the coin turns out to be a head, and 0 otherwise. We can even have asequence of n random variables for a series of n coin tosses. In such a case,X might be the number of heads on the n coin tosses.

Random variables come in three main types: Discrete, Continuous andMixed (a combination of Discrete and Continuous). Which of these categoriesa random variable falls into is dependent on the domain of the sample space.Both the coin and dice examples have a discrete support for the sample space.Time, height or temperature are examples where the support may indicate acontinuous random variable.

1.4 Discrete Random Variables

A discrete random variable is one whose sample space is �nite or countablyin�nite. Common sample spaces are (proper or improper) subsets of integers.

De�nition 6. The probability function (p.f.) of a random variable X is thefunction

f(x) = P (X = x); de�ned for all x 2 A;whereAisthesamplespaceoftherandomvariable:

The set of pairs f(x; f(x)) : x 2 Ag is called the probability distributionof X. All probability functions must have two properties:

1. f(x) � 0 for all values of x (i.e. for x 2 A)

2.P

all x2Af(x) = 1

By implication, these properties ensure that f(x) � 1 for all x. We consider afew �toy�examples before dealing with more complicated problems.

4

1.5 Expectation, Averages, Variability

De�nition 7. The expected value (also called the mean or the expectation) ofa discrete random variable X with probability function f(x) is

E(X) =Xall x

xf(x):

The expected value of X is also often denoted by the Greek letter �. Theexpected value of X can be thought as the average of the X-values that wouldoccur in an in�nite series of repetitions of the process where X is de�ned.Notes:

(1) You can interpret E[g(X)] as the average value of g(X) in an in�nite seriesof repetitions of the process where X is de�ned.

(2) E [g(X)] is also known as the expected value of g(X). This name is some-what misleading since the average value of g(X) may be a value whichg(X) never takes - hence unexpected!

(3) The case where g(X) = X reduces to our earlier de�nition of E(X).

Theorem 1. Suppose the random variable X has probability function f(x):Then, the expected value of some function g(X) of X is given by

E [g(X)] =Xall x

g(x)f(x)

Properties of Expectation:If your linear algebra is good, it may help if you think of E as being a linearoperator. Otherwise, you�ll have to remember these and subsequent properties.

1. For constants a and b,

E [ag(X) + b] = aE [g(X)] + b

Proof: E [ag(X) + b] =Pall x

[ag(x) + b] f(x)

=Pall x

[ag(x)f(x) + bf(x)]

= aPall x

g(x)f(x) + bPall x

f(x)

= aE [g(X)] + b

�since

Pall x

f(x) = 1

�2. For constants a and b and functions g1 and g2, it is also easy to show

E [ag1(X) + bg2(X)] = aE [g1(X)] + bE [g2(X)]

Variability:

5

While an average is a useful summary of a set of observations, or of a proba-bility distribution, it omits another important piece of information, namely theamount of variability. For example, it would be possible for car doors to bethe right width, on average, and still have no doors �t properly. In the caseof �tting car doors, we would also want the door widths to all be close to thiscorrect average. We give a way of measuring the amount of variability next.You might think we could use the average di¤erence between X and � to indi-cate the amount of variation. In terms of expectation, this would be E (X � �).However, E (X � �) = E(X) � � (since � is a constant) = 0. We soon realizethat to measure variability we need a function that is the same sign for X > �and for X < �. We now de�ne

De�nition 8. The variance of a r.v X is Eh(X � �)2

i, and is denoted by �2

or by Var (X).

In words, the variance is the average squared distance from the mean. Thisturns out to be a very useful measure of the variability of X.Example: Let X be the number of heads when a fair coin is tossed 4 times.Then, X � Binomial

�4; 12�; and so, � = np = (4)

�12

�= 2. Without doing

any calculations, we know �2 � 4 because X is always between 0 and 4. Henceit can never be further away from � than 2. This makes the average squareddistance from � at most 4. The values of f(x) are

x 0 1 2 3 4 since f(x) =�4x

� �12

�x � 12

�4�xf(x) 1/16 4/16 6/16 4/16 1/16 =

�4x

� �12

�4The value of V ar(X) (i.e. �2) is easily found here:

�2 = Eh(X � �)2

i=

4Px=0

(x� �)2 f(x)

=(0� 2)2

�116

�+ (1� 2)2

�416

�+ (2� 2)2

�616

�+(3� 2)2

�416

�+ (4� 2)2

�116

�= 1

De�nition 9. The standard deviation of a random variable X is � =

rEh(X � �)2

iBoth variance and standard deviation are commonly used to measure variability.

The basic de�nition of variance is often awkward to use for mathematical cal-culation of �2, whereas the following two results are often useful:

(1) �2 = E�X2�� 2

(2) �2 = E [X(X � 1)] + �� 2

6

Properties of Mean and VarianceIf a and b are constants and Y = aX + b, then

�Y = a�X + b and �2Y = a

2�2X

(where �X and �2X are the mean and variance of X and �Y and �2Y are themean and variance of Y ). The proof of this is left to the reader as an exercise.

1.6 Moment Generating Functions

We have now seen two functions which characterize a distribution, the probabil-ity function and the cumulative distribution function. There is a third type offunction: the moment generating function, which uniquely determines a distri-bution. The moment generating function is closely related to other transformsused in mathematics: the Laplace and Fourier transforms.

De�nition 10. Consider a discrete random variable X with probability functionf(x). The moment generating function (m.g.f.) of X is de�ned as

M(t) = E(etX) =Xx

etxf(x):

We will assume that the moment generating function is de�ned and �nite forvalues of t in an interval around 0 (i.e. for some a > 0 ,

Pxetxf(x) < 1 for

all t 2 [�a; a]).The moments of a random variable X are the expectations of the functions

Xr for r = 1; 2; : : : . The expected value E(Xr) is called the rth moment ofX. The mean � = E(X) is therefore the �rst moment, E(X2) is the secondand so on. It is often easy to �nd the moments of a probability distributionmathematically by using the moment generating function. This often giveseasier derivations of means and variances than the direct summation methods inthe preceding section. The following theorem gives a useful property of m.g.f.�s.

Theorem 2. Let the random variable X have m.g.f. M(t). Then

E(Xr) =M (r)(0) r = 1; 2; : : :

where M (r)(0) stands for drM(t)=dtr evaluated at t = 0.

Proof:M(t) =

Pxetxf(x) and if the sum converges, then

M (r)(t) = ddtr

Pxetxf(x)

=Px

ddtr (e

tx)f(x)

=Pxxretxf(x)

7

Therefore M (r)(0) =P

x xrf(x) = E(Xr), as stated.

This sometimes gives a simple way to �nd the moments for a distribution.

Example 1. Suppose X has a Binomial(n; p) distribution. Then its momentgenerating function is

M(t) =nXx=0

etx�n

x

�px(1� p)n�x

=nXx=0

�n

x

�(pet)x(1� p)n�x

= (pet + 1� p)n

Therefore

M 0(t) = npet(pet + 1� p)n�1

M 00(t) = npet(pet + 1� p)n�1 + n(n� 1)p2e2t(pet + 1� p)n�2

and so

E[X] =M 0(0) = np;

E[X2] =M"(0) = np+ n(n� 1)p2

V ar(X) = E(X2)� E(X)2 = np(1� p)

1.7 Discrete Distributions

1.7.1 Discrete Uniform

The Discrete Uniform is used when every outcome is equiprobable such as withfair dice, coins, and simple random sampling (surveying method). This is thesimplest discrete random variable. Let X �U(a,b) where parameters a and b arethe integer min and max of the support respectively (support is in the integers).For coin tosses a = 0 and b = 1, and for dice a = 1 and b = 6. For the discreteuniform, the sample space is the closed set of all integers between a and b. Theuniform has the following properties.

f(x) =1

b� a+ 1 ; x 2 S

F (x) =x� a+ 1b� a+ 1 ; x 2 S

E[X] =a+ b

2

V ar(X) =(b� a+ 1)2 � 1

12

8

1.7.2 Bernoulli

Given a trial which results in either a success or a failure (like �ipping a coin;we can consider a head as a success and a tail as a failure), we can use a randomvariable to represent the outcome. This situation is modelled by a Bernoullirandom variable, named after the scientist Jacob Bernoulli. This is also knownas a Bernoulli Trial with a probability of success, p, and probability of failure,q = 1 � p. This random variable only has one parameter, p, which is theprobability of a success. The support is simply 0 and 1 (i.e. failure or successrespectively).

f(x) =

�q x = 0p x = 1

F (x) =

�q x = 01 x = 1

E[X] = p

V ar(X) = p(1� p)

1.7.3 Binomial

When there are n independent and identically distributed Bernoulli randomvariables and the value of interest is in the number of success (i.e. the number ofheads one obtains after �ipping a coin a certain number of times), then this canbe described by a binomial random variable. The Binomial random variable hastwo parameters: number of trials, n, and probability of success, p. The samplespace is the number of possible successes given n trials to x 2 Z[0; n].

f(x) =

�n

x

�px(1� p)n�x

F (x) =xXi=0

f(i)

E[X] = np

V ar(X) = np(1� p)

1.7.4 Geometric

Given a series of independent Bernoulli trials, if we are interested in the numberof trials until the �rst success then we would use a Geometric random variable.This has one parameter, the probability of success, p. The sample space is thenon-negative integers which is very intuitive. If the �rst trial is a success then 0additional trials were required; however, we can have up to in�nitely many trials

9

until our �rst success. Especially if the coin rolls into a nearby storm drain.

f(x) = p(1� p)x

F (x) = 1� (1� p)x+1

E[X] =1� pp

V ar(X) =1� pp2

1.7.5 Negative Binomial

Consider the previous random variable except we are now interested in the kth

success. Then this is a negative binomial distribution. This distribution hastwo parameters: the number of desired successes, k, and the probability of eachsuccess occurring, p. The sample space for this random variable, again, is theset of non-negative integers.

f(x) =

�x+ k � 1k � 1

�pk(1� p)x

F (x) =

xXi=0

f(i)

E[X] = k1� pp

V ar(X) = k1� pp2

1.7.6 Hypergeometric

Now assume that we have a population of size N which can be split into twosubgroups arbitrarily called A and B (a common example is an urn full ofcoloured balls). The subgroups have size M and N �M respectively. If wetake a sample without replacement of size n from the population, the numberof elements taken from subgroup A is the random variable of interest. Notethat the two subgroups can usually be swapped in terms of notation withoutconsequence. Consider the example of an urn full of N balls. There are M blueballs and N �M red balls. If we take a sample of n balls out of the urn, howmany blue balls are there? This assumes that each ball is equiprobable to beselected at each selection.The parameters and support for this random variable have very speci�c

restrictions. N , the population size, is a positive integer (preferably greaterthan 2). M , the size of a subpopulation (i.e. number of blue balls), can be ofany integer in the set [0; N ]. n, the number of items selected, is an integer inthe set [0; N ]. Now the sample space is a little tricky: the support of the sample

10

space is x 2 [max(0; n+M �N); : : : ;min(M;n)].

f(x) =

�Mx

��N�Mn�x

��Nn

�F (x) =

xXi=1

f(i)

E[X] =nM

N

V ar(X) = nM

N

�1� M

N

�N � nN � 1

1.7.7 Poisson

The number of independent events that occur with a common rate, �, overa �xed period of time, t, is known as the Poisson distribution. This randomvariable has the parameter �t, where � is the rate of arrivals (strictly positive,real number) and t is the time of interest. The support for this random variableis the non-negative real numbers.

f(x) =(�t)xe��t

x!

F (x) = e��txXi=1

(�t)x

x!

E[X] = �t

V ar(X) = �t

1.8 Discrete Multivariate Distributions

1.8.1 Joint Probability Functions:

First, suppose there are two random variables X and Y , and de�ne the function

f(x; y) = P (X = x and Y = y)= P (X = x; Y = y):

We call f(x; y) the joint probability function of (X;Y ). In general,

f(x1; x2; � � � ; xn) = P (X1 = x1 and X2 = x2 and : : : and Xn = xn)

if there are n random variables X1; : : : ; Xn.The properties of a joint probability function are similar to those for a singlevariable; for two random variables we have f(x; y) � 0 for all (x; y) andX

all(x;y)

f(x; y) = 1:

11

Example: Consider the following numerical example, where we show f(x; y)

in a table.

xf(x; y) 0 1 2

1 .1 .2 .3y

2 .2 .1 .1For example, f(0; 2) = P (X = 0 and Y = 2) = 0:2: We can check that f(x; y)is a proper joint probability function since f(x; y) � 0 for all 6 combinations of(x; y) and the sum of these 6 probabilities is 1. When there are only a few

values for X and Y , it is often easier to tabulate f(x; y) than to �nd a formulafor it. We�ll use this example below to illustrate other de�nitions for

multivariate distributions, but �rst we give a short example where we need to�nd f(x; y).

Example: Suppose the range for (X;Y ), which is the set of possiblevalues (x; y) is the following: X can be 0, 1, 2, or 3, Y can be 0 or 1. We�llsee that not all 8 combinations (x; y) are possible in the following table off(x; y) = P (X = x; Y = y).

xf(x; y) 0 1 2 3

0 18

28

18 0

y1 0 1

828

18

Note that the range or joint p.f. for (X;Y ) is a little awkward to write downhere in formulas, so we just use the table.

Marginal Distributions: We may be given a joint probability functioninvolving more variables than we�re interested in using. How can we eliminatevariables that are not of interest? Look at the �rst example above: if we�re onlyinterested in X, and don�t care what value Y takes, we can see that

P (X = 0) = P (X = 0; Y = 1) + P (X = 0; Y = 2);

so P (X = 0) = f(0; 1) + f(0; 2) = 0:3: Similarly

P (X = 1) = f(1; 1) + f(1; 2) = :3 and

P (X = 2) = f(2; 1) + f(2; 2) = :4

The distribution of X obtained in this way from the joint distribution iscalled the marginal probability function of X:

12

x 0 1 2f(x) .3 .3 .4

In the same way, if we were only interested in Y , we obtain

P (Y = 1) = f(0; 1) + f(1; 1) + f(2; 1) = :6

since X can be 0, 1, or 2 when Y = 1. The marginal probability function of Ywould be:

y 1 2f(y) .6 .4

We generally put a subscript on the f to indicate whether it is the marginalprobability function for the �rst or second variable. So f1(1) would be P (X =1) = :3, while f2(1) would be P (Y = 1) = 0:6. An alternative notation thatyou may see is fX(x) and fY (y).In general, to �nd f1(x) we add over all values of y where X = x, and to �ndf2(y) we add over all values of x with Y = y. Then

f1(x) =Xall y

f(x; y) and

f2(y) =Xall x

f(x; y):

This reasoning can be extended beyond two variables. For example, with threevariables (X1; X2; X3),

f1(x1) =X

all (x2 ;x3 )

f(x1; x2; x3) and

f1;3(x1; x3) =Xall x2

f(x1; x2; x3) = P (X1 = x1; X3 = x3)

where f1;3(x1; x3) is the marginal joint distribution of (X1; X3):

1.8.2 Conditional Probability Functions (described usingRandom Variables):

Again, we can extend a de�nition from events to random variables. For eventsA and B, recall that P (AjB) = P (AB)

P (B) . Since P (X = xjY = y) = P (X =

x; Y = y)=P (Y = y), we make the following de�nition.

13

De�nition 11. The conditional probability function of X given Y = y isf(xjy) = f(x;y)

f2(y).

Similarly, f(yjx) = f(x;y)f1(x)

(provided, of course, the denominator is not zero).

In our �rst example let us �nd f(xjY = 1).

f(xjY = 1) = f(x; 1)

f2(1):

This gives:

x 0 1 2f(xjY = 1) :1

:6 =16

:2:6 =

13

:3:6 =

12

As you would expect, marginal and conditional probability functions areprobability functions in that they are always � 0 and their sum is 1.

1.9 Multinomial Distribution

There is only this one multivariate model distribution introduced in this course,though other multivariate distributions exist. The multinomial distribution de-�ned below is very important. It is a generalization of the binomial model tothe case where each trial has k possible outcomes.Physical Setup: This distribution is the same as Binomial except there arek outcomes rather than two. An experiment is repeated independently n timeswith k distinct outcomes each time. Let the probabilities of these k outcomesbe p1; p2; � � � ; pk each time. Let X1 be the number of times the 1st outcometo occur, X2 the number of times the 2nd outcome occurs, � � � , Xk the numberof times the kth outcome occurs. Then (X1; X2; � � � ; Xk) has a multinomialdistribution.Notes:

(1) p1 + p2 + � � �+ pk = 1

(2) X1 +X2 + � � �+Xk = n,

If we wish, we can drop one of the variables (say the last), and just notethat Xk equals n�X1 �X2 � � � � �Xk�1.

Illustration:

1. Suppose student marks are given in letter grades: A, B, C, D, or F. Ina class of 80 students, the number of students getting A, B, ..., F mighthave a multinomial distribution with n = 80 and k = 5.

14

Joint Probability Function: The joint probability function of X1; : : : ; Xk isgiven by extending the argument in the sprinters example from k = 3 to generalk. There are n!

x1!x2!��xk! di¤erent outcomes of the n trials in which x1 are of the1st type, x2 are of the 2nd type, etc. Each of these arrangements has probabilitypx11 p

x22 � � � p

xkk since p1 is multiplied x1 times in some order, etc.

Therefore f (x1; x2; � � � ; xk) =n!

x1!x2! � � �xk!px11 p

x22 � � � p

xkk

The restriction on the xi�s are xi = 0; 1; � � � ; n andkPi=1

xi = n.

As a check thatPf (x1; x2; � � � ; xk) = 1 we use the multinomial theorem to getX n!

x1!x2! � � �xk!px11 � � � p

xkk = (p1 + p2 + � � �+ pk)n = 1:

Here is another simple example.

Example: Every person has one of four blood types: A, B, AB and O. (Thisis important in determining, for example, who may give a blood transfusionto a person.) In a large population, let the fraction that has type A, B, ABand O, respectively, be p1; p2; p3; p4. Then, if n persons are randomly selectedfrom the population, the numbers X1; X2; X3; X4 of types A, B, AB, O havea multinomial distribution with k = 4 (for Caucasian people, the values of thepi�s are approximately p1 = :45; p2 = :08; p3 = :03; p4 = :44:)

Note: We sometimes use the notation (X1; : : : ; Xk) � Mult(n; p1; : : : ; pk) toindicate that (X1; : : : ; Xk) has a multinomial distribution.

1.10 Expectation forMultivariate Distributions:Covariance and Correlation

It is easy to extend the de�nition of expected value to multiple variables. Gen-eralizing E [g (X)] =

Pall x

g(x)f(x) leads to the de�nition of expected value in

the multivariate case.

De�nition 12.E [g (X;Y )] =

Xall (x;y)

g(x; y)f(x; y)

and

E [g (X1; X2; � � � ; Xn)] =X

all (x1;x2;�� ;xn)

g (x1; x2; � � �xn) f (x1; � � � ; xn)

15

As before, these represent the average value of g(X;Y ) and g(X1; : : : ; Xn).E [g (X;Y )] could also be determined by �nding the probability function fZ(z) ofZ = g(X;Y ) and then using the de�nition of expected value E(Z) =

Pall z zfZ(z).

Example: Let the joint probability function, f(x; y), be given by

xf(x; y) 0 1 2

1 .1 .2 .3y 2 .2 .1 .1

Find E(XY ) and E(X).Solution:

E (XY ) =X

all (x;y)

xyf(x; y)

= (0� 1� :1) + (1� 1� :2) + (2� 1� :3) + (0� 2� :2) + (1� 2� :1) + (2� 2� :1)= 1:4

To �nd E(X) we have a choice of methods. First, taking g(x; y) = x we get

E(X) =X

all (x;y)

xf(x; y)

= (0� :1) + (1� :2) + (2� :3) + (0� :2) + (1� :1) + (2� :1)= 1:1

Alternatively, since E(X) only involves X, we could �nd f1(x) and use

E(X) =2X

x=0

xf1(x) = (0� :3) + (1� :3) + (2� :4) = 1:1

Property of Multivariate Expectation: It is easily proved (make sure youcan do this) that

E [ag1(X;Y ) + bg2(X;Y )] = aE [g1(X;Y )] + bE [g2(X;Y )]

This can be extended beyond 2 functions g1 and g2, and beyond 2 variables Xand Y .

16

1.10.1 Relationships between Variables:

De�nition 13. The covariance of X and Y , denoted Cov(X;Y ) or �XY , is

Cov(X;Y ) = E [(X � �X)(Y � �Y )]

For calculation purposes, it is easier to express the formula in the following form:

Cov(X;Y ) = E [(X � �X) (Y � �Y )] = E (XY � �XY �X�Y + �X�Y )= E(XY )� �XE(Y )� �Y E(X) + �X�Y= E(XY )� E(X)E(Y )� E(Y )E(X) + E(X)E(Y )

Therefore Cov(X;Y ) = E(XY )� E(X)E(Y )

Example:In the example with joint probability function

xf(x; y) 0 1 2

1 :1 :2 :3y

2 :2 :1 :1

�nd Cov (X;Y ).

Solution: We previously calculated E(XY ) = 1:4 and E(X) = 1:1. Simi-larly, E(Y ) = (1� :6) + (2� :4) = 1:4

Therefore Cov(X;Y ) = 1:4� (1:1)(1:4) = �:14

Theorem 3. If X and Y are independent then Cov (X;Y ) = 0.

Proof: Recall E (X � �X) = E(X)� �X = 0. Let X and Y be independent.Then f(x; y) = f1(x)f2(y).

Cov (X;Y ) = E [(X � �X) (Y � �Y )] =Pall y

� Pall x

(x� �X) (y � �Y ) f1(x)f2(y)�

=Pall y

�(y � �Y ) f2(y)

Pall x

(x� �X) f1(x)�

=Pall y

[(y � �Y ) f2(y)E (X � �X)]

=Pall y

0 = 0

The following theorem gives a direct proof of the result above, and is useful inmany other situations.

Theorem 4. Suppose random variables X and Y are independent. Then, ifg1(X) and g2(Y ) are any two functions,

E[g1(X)g2(Y )] = E[g1(X)]E[g2(Y )]:

17

To prove Theorem 4 above, we just note that if X and Y are independent then

Cov(X;Y ) = E[(X � �X)(Y � �Y )]= E(X � �X)E(Y � �Y ) = 0

Caution: This result is not reversible. If Cov(X;Y ) = 0 we cannot concludethat X and Y are independent.Example: Let (X;Y ) have the joint probability function f(0; 0) = 0:2; f(1; 1) =0:6; f(2; 0) = 0:2; i.e. (X;Y ) only takes on 3 values.

x 0 1 2f1(x) .2 .6 .2

and

y 0 1f2(y) .4 .6

are marginal probability functions. Since f1(x)f2(y) 6= f(x; y); therefore, Xand Y are not independent. However,

E (XY ) = (0� 0� :2) + (1� 1� :6) + (2� 0� :2) = :6E(X) = (0� :2) + (1� :6) + (2� :2) = 1 and E(Y ) = (0� :4) + (1� :6) = :6

Therefore, Cov (X;Y ) = E(XY )� E(X)E(Y ) = :6� (1)(:6) = 0

So, X and Y have covariance 0 but are not independent. If Cov (X;Y ) = 0we say that X and Y are uncorrelated, because of the de�nition of correlationgiven below.

De�nition 14. The correlation coe¢ cient of X and Y is � = Cov (X;Y )�X�Y

The correlation coe¢ cient measures the strength of the linear relationshipbetween X and Y and is simply a rescaled version of the covariance, scaled tolie in the interval [�1; 1].Properties of �:

1) Since �X and �Y , the standard deviations of X and Y , are both positive,� will have the same sign as Cov (X;Y ). Hence the interpretation ofthe sign of � is the same as for Cov (X;Y ), and � = 0 if X and Y areindependent. When � = 0 we say that X and Y are uncorrelated.

2) �1 � � � 1 and as � ! �1 the relation between X and Y becomesone-to-one and linear.

18

1.11 Mean and Variance of a Linear Combina-tion of Random Variables

Many problems require us to consider linear combinations of random variables;examples will be given below and in Chapter 9. Although writing down theformulas is somewhat tedious, we give here some important results about theirmeans and variances.

Results for Means:

1. E (aX + bY ) = aE(X) + bE(Y ) = a�X + b�Y , when a and b are con-stants. (This follows from the de�nition of expected value .) In particular,E (X + Y ) = �X + �Y and E (X � Y ) = �X � �Y .

2. Let ai be constants (real numbers) and E (Xi) = �i. Then E (PaiXi) =P

ai�i. In particular, E (PXi) =

PE (Xi).

3. Let X1; X2; � � � ; Xn be random variables which have a mean �. (Youcan imagine these being some sample results from an experiment such asrecording the number of occupants in cars travelling over a toll bridge.)

The sample mean is X =

nPi=1

Xi

n . Then E�X�= �.

Results for Covariance:

1. Cov (X;X) = E [(X � �X) (X � �X)] = Eh(X � �)2

i= V ar (X)

2. Cov (aX + bY; cU + dV ) = ac Cov (X;U)+ad Cov (X;V )+bc Cov (Y; U)+bd Cov (Y; V ) where a; b; c; and d are constants.

This type of result can be generalized, but gets messy to write out.

Results for Variance:

1. Variance of a linear combination of random variables:

V ar (aX + bY ) = a2 V ar(X) + b2 V ar(Y ) + 2abCov (X;Y )

2. Variance of a sum of independent random variables: Let X andY be independent. Since Cov (X;Y ) = 0, result 1. gives

Var (X + Y ) = �2X + �2Y ;

i.e., for independent variables, the variance of a sum is the sum of thevariances. Also note

Var (X � Y ) = �2X + (�1)2�2Y = �2X + �2Y ;

i.e., for independent variables, the variance of a di¤erence is the sum ofthe variances.

19

3. Variance of a general linear combination of random variables:Let ai be constants and Var (Xi) = �2i . Then

Var�X

aiXi

�=X

a2i�2i + 2

Xi<j

aiaj Cov (Xi; Xj) :

This is a generalization of 1. and can be proved using either of the methodsused for 1.

4. Variance of a linear combination of independent random vari-ables: Special cases of result 3. are:

a) If X1; X2; � � � ; Xn are independent then Cov (Xi; Xj) = 0, so that

Var�X

aiXi

�=X

a2i�2i :

b) If X1; X2; � � � ; Xn are independent and all have the same variance�2, then

Var�X�= �2=n

Remark: This result is very important in probability and statistics. To recap,it says that if X1; : : : ; Xn are independent random variables with the same mean

� and some variance �2, then the sample mean �X = 1n

nPi=1

Xi has

E( �X) = �

Var ( �X) = �2=n

1.11.1 Indicator Variables

The results for linear combinations of random variables provide a way of break-ing up more complicated problems, involving mean and variance, into simplerpieces using indicator variables; an indicator variable is just a binary variable(0 or 1) that indicates whether or not some event occurs. We�ll illustrate thisimportant method with an example.

Example: We have N letters to N di¤erent people, and N envelopesaddressed to those N people. One letter is put in each envelope at random. Findthe mean and variance of the number of letters placed in the right envelope.

Solution:

Let Xi =

�0; if letter i is not in envelope i1; if letter i is in envelope i:

ThenNPi=1

Xi is the number of correctly placed letters. Once again, the Xi�s are

dependent (Why?).

20

First E (Xi) =1P

xi=0xif(xi) = f(1) = 1

N = E�X2i

�(since there is 1 chance in

N that letter i will be put in envelope i) and then,

Var (Xi) = E�X2i

�� [E (Xi)]2 =

1

N� 1

N2=1

N

�1� 1

N

�Exercise: Before calculating cov (Xi; Xj), what sign do you expect it to

have? (If letter i is correctly placed, does that make it more or less likely thatletter j will be placed correctly?)Next, E (XiXj) = f(1; 1) (As in the last example, this is the only non-zeroterm in the sum.) Now, f(1; 1) = 1

N1

N�1 since once letter i is correctly placedthere is 1 chance in N � 1 of letter j going in envelope j.

Therefore E (XiXj) =1

N(N � 1)For the covariance,

Cov (Xi; Xj) = E (XiXj)� E (Xi)E (Xj) =1

N(N � 1) ��1

N

��1

N

�=1

N

�1

N � 1 �1

N

�=

1

N2(N � 1)

E

NXi=1

Xi

!=

NXi=1

E (Xi) =NXi=1

1

N=

�1

N

�N = 1

Var

NXi=1

Xi

!=

NXi=1

Var (Xi) + 2Xi<j

Cov (Xi; Xj)

=NXi=1

1

N

�1� 1

N

�+ 2

�N

2

�1

N2(N � 1)

= N1

N

�1� 1

N

�+ 2

�N

2

�1

N2(N � 1)

= 1� 1

N+ 2

N(N � 1)2

1

N2(N � 1) = 1

(Common sense often helps in this course, but we have found no way of beingable to say this result is obvious. On average 1 letter will be correctly placedand the variance will be 1, regardless of how many letters there are.)

1.12 Continuous Distributions

As expected, a continuous distribution is one such that the support of the ran-dom variable is continuous. Again, time is a very common example of one suchsupport. Even the surface area of the face of a dart board can be written as theCartesian product of two continuous random variables.

21

1.12.1 Notation

The interpretation of the probability functions for continuous random variablesis di¤erent from those of the discrete variety. The probability of any single pointhappening in a continuous distribution is 0. With continuous distributions,we are often more interested in the probability of a particular range of valueshappening.For continuous distributions, it is easier to consider the cumulative distributionfunctions �rst. This is again a non-decreasing function taking all values from 0to 1 (it is also monotonic when restricted to the support set) with the followingproperties.

1. limx!�1 F (x) = 0 and limx!1 F (x) = 1

2. F (x) is a non-decreasing function with respect to x.

3. P(a < X � b) = F (b)�F (a) notice that mathematically there is no di¤er-ence between stricly less than and less than or equal to when consideringcontinuous random variables. This is mathematically equivalent to theFundamental Theorem of Calculus II.

The probability density function is the analogue to the discrete probabilitymass function. This function is more representative of the relative likelihood ofdi¤erent outcomes and is de�ned as

f(x)�x = P (x < X < x+�x) = F (x+�x)� F (x)

As �x becomes smaller, we divide the probability by �x to be able to comparemany probabilities of length �x. This is the de�nition of a derivative, so we seethat f(x) = d

dxF (x) for CDF F (x).

1.12.2 Uniform

This is very similar to the discrete version: every outcome is equiprobable overthe support. This random variable also has two parameters, a and b, which area maximum and a minimum. The support is all real numbers in the set of [a; b].

f(x) =1

b� aF (x) =

x� ab� a

E[X] =b+ a

2

V ar(X) =(b� a)212

22

1.12.3 Normal

This distribution models data which clusters around an average. This distrib-ution has two parameters: the mean � and the variance �2. There is also theGaussian distribution which has the same properties as the normal except thatthe second parameter is the standard deviation �. The mean can take on anyreal value, whereas, the variance must be a positive real number.

f(x) =1p2��

e�(x��)2

2�2

E[X] = �

V ar(X) = �2

It is important to note that there is no closed form expression for the cumula-tive distribution function of the normal distribution. We use a table of valuesto determine approximate probabilities and quantiles.There is a very important case of the Normal distribution: the standard nor-mal, which has a mean of 0 and a variance of 1. Given any random variableX�Normal(�; �2), we can standardize the random variable by the following op-erations: X�� = Z �Normal(0,1).

1.12.4 Exponential

The exponential distribution is used to model the times in between events froma Poisson process. It has one parameter, the rate, �, and has a support of thenon-negative real numbers.

f(x) = �e��x

F (x) = 1� e��x

E[X] =1

�

V ar(X) =1

�2

1.12.5 Gamma

The gamma distribution is often used to model waiting times between a certainnumber of events. It can also be expressed as the sum of �nitely many indepen-dent and identically distributed exponential distributions. This distribution hastwo parameters: the number of exponential terms n, and the rate parameter �.In this distribution there is the Gamma function, � which has some very usefulproperties.The gamma function has one parameter, which is a real number and is de-

�ned as:

�(z) =

Z 1

0

xz�1e�xdx

23

This function has three major, important properties:

1. �(z) = (z � 1)�(z � 1) (if z is a positive integer, �(z) = (z � 1)!)

2. �(1) = 1

3. �( 12 ) =p�

4. The distribution itself has the following properties and de�nitions.

f(x) =xn�1e�

x�

�(n)�n

F (x) =

R x=�0

tn�1e�tdt

�(n)

E[X] = n�

V ar(X) = n�2

1.12.6 �2

This distribution is formed by taking the sum of n independent, squared stan-dard normal random variables. This random variable has one parameter n,called the degrees of freedom. In many instances it refers to the number of in-dependent terms in the statistic.Similar to the Gamma function, there are many important properties to the �2

distribution. Most of them follow directly from the fact that the distribution isderived from a sum of squared normally distributed random variables.

f(x) =

�12

�n=2��n2

� xn=2�1e�x=2F (x) = intractable

E[X] = n

V ar[X] = 2n

Consider two independent �2 random variables withm and n degrees of freedom(i.e M � �2m and N � �2n); then their sum is also a �2 random variable withm+n degrees of freedom (i.e. M+N � �2m+n). This is intuitive as it arises fromknowing the fact that �2 random variables are derived from sums of squarednormals. Since the probability function for this distribution cannot be solvedexplicitly, we must use tables to determine probabilities.Unlike the Normal, this distribution lacks symmetry, which would normally

simplify. The table will be of the form:

Degrees of Freedom 0.001 � � � 0.9991 1.57�10�6 � � � 10.8...

.... . .

...100 61.9 � � � 149

24

Each row shows the quantiles for the given degrees of freedom. The quantilesmatch up with their respective probability along the columns. If the test samplehas more than 100 degrees of freedom, then use the Normal distribution tableinstead.

1.13 Order Statistics

Given a series of n random variables, we may be interested in the distributionof a maximum or minimum value. First, let us assume that we have a series ofn random variables Xi with CDF Fi(x). Then let T = maxfXig. We can �ndthe PDF of this random variable.

P(T < t) = P(maxfXig < tg), if the largest is less than t, then they all are= P(X1 < t;X2 < t; : : : ;Xn < t), by independence

= P(X1 < t) � P(X2 < t) � : : : � P(Xn < t), since identically distributed

=nYi=1

Fi(t)

Similarly, let R = minfXig.

P(R < r) = P(minfXig < r)= 1� P(minfXig � r)= 1� P(X1 � r;X2 � r; : : : ; Xn � r)= 1� P(X1 � r)P(X2 � r) � : : : � P(Xn � r)= 1� (1� P(X1 < r))(1� P(X2 < r)) � : : : � (1� P(Xn < r))

= 1�nYi=1

(1� F i(r))

For exercise, determine the min distribution for a series of n independentand indentically distributed exponential random variables.

25

1.15 Appendix - Maximum Likelihood Estima-

tion

There are two major theologies on how to estimate parameters: Frequentist andBayesian. We will only concentrate on the Frequentist methodologies.Frequentists use Maximum Likelihood Estimation. It is a method which

maximizes the probability of the parameters given the data, which is to saythat it �nds the most likely parameter values.To �nd this, we create a function called the Likelihood Function and de�ne

it as follows for n observations:

L(�) = P (X1 = x1; X2 = x2; : : : ; Xn = xn)

For our data set we usually impose the independent and identically dis-tributed assumptions. The Likelihood Function can be simpli�ed under theseassumptions.

L(�) = P (X1 = x1; X2 = x2; : : : ; Xn = xn)

= P (X1 = x1)P (X2 = x2) � : : : � P (Xn = xn)

=nYi=1

f(xi; �)

Since we intend on maximizing this equation, it is often more convenient touse the Log-Likelihood Function, which is the natural logarithm of the LikelihoodFunction.

`(�) = ln(L(�))

Let�s look at some common examples.

Example 1

Suppose we have a set of n independent and identically distributed data pointsfrom an exponential distribution, fx1; x2; : : : ; xng. Find the Likelihood Func-tion, the Log-Likelihood Function, and �nd the value which maximizes these(called the Maximum Likelihood Estimator).Finding the Likelihood Function: since we have independent and identically

distributed data points for a known distribution, the likelihood takes the form:

L(�) =nYi=1

f(xi;�)

=nYi=1

�e��xi

= �ne��Pn

i=1 xi

26

1.14

SummaryofDistributions

Distribution

Parameters

PDF,f(x)

CDF,F(x)

Mean

Variance

Support

DiscreteUniform

a;b

1b�a+1

x�a+1

b�a+1

a+b

2(b�a+1)2�1

12

x2Z[a;b]

Bernoulli

ppxq1�x

(1�p)1�x

pp(1�p)

x2Z[0;1]

Binomial

n;p

� n x

� px qn�x

P x i=0f(i)

np

np(1�p)

x2Z[0;n]

Geometric

ppqx

1�(1�p)x+1

1�pp

1�p

p2

x2Z[0;1)

NegativeBinomial

k;p

� x+k�1

k�1

� pk (q)x

P x i=0f(i)

k1�pp

k1�p

p2

x2Z[0;1)

Hypergeometric

M;n;N

(Mx)(

N�M

n�x)

(N n)

P x i=0f(i)

nM N

nM N

� 1�M N

� N�n

N�1

x2Z[max(0;n+M�N);min(M;n)]

Poisson

�(�t)xe��t

x!

e��tP x i=

0(�t)i

i!�t

�t

x2Z[0;1)

ContinuousUniform

a;b

1b�a

x�a

b�a

b+a2

(b�a)2

12

x2R[a;b]

Normal

�;�

21

p2��e�

(x��)2

�Intractable

��2

x2R(�1;1)

Exponential

��e�

�x

1�e�

�x

1 �1 �2

x2R[0;1)

Gamma

n;�

xn�1e�x �

�(n)�

n

R x=� 0tn�1e�t

�(n)dt

n�

n�2

x2R[0;1)

�2

n(1 2)k

=2

�(n 2)xk=2�1

ex=2

Intractable

n2n

x2R[0;1)

27

Finding the Log-Likelihood Function: taking the natural log of the likelihoodyields

`(�) = ln(L(�))

= ln(�ne��Pn

i=1 xi)

= n ln�� nXi=1

xi

Finding the Maximum Likelihood Estimator (MLE): by di¤erentiating eitherthe Likelihood or the Log-Likelihood, we can �nd a critical point. For many ofthe common distributions (i.e. mostly the ones with names), there is a singleglobal maximum, but not for all distributions.The advantage of the Log-Likelihood over the Likelihood is that constants

can be put into separate terms and then disappear when we take the derivative.For this example we will use the Log-Likelihood Model.

d`(�̂)

d�̂=

d

d�̂(n ln �̂� �̂

nXi=1

xi)

=n

�̂�

nXi=1

xi

Setting this expression equal to 0 and solving for �̂ yields

0 =n

�̂�

nXi=1

xi

�̂ =nPni=1 xi

=1

x

Example 2

Find the MLE for a Binomial (n; �). Not that in many cases n is alreadyknown so the real value of interest is �.Finding the likelihood function for k iid random variables:

L(�;x) =

kYi=1

f(xi)

=kYi=1

�n

xi

��xi(1� �)n�xi

= �Pk

i=1 xi(1� �)n�Pk

i=1 xi

kYi=1

�n

xi

�

28

Now �nd the log-likelihood function:

`(�) = ln

�Pk

i=1 xi(1� �)n�Pk

i=1 xi

kYi=1

�n

xi

�!

=kXi=1

xi ln� +

n�

kXi=1

xi

!ln(1� �) + C

Taking the derivative with respect to the parameter �.

`0(�̂) =

Pki=1 xi�̂

� n�Pk

i=1 xi1� �̂

Solve for �̂.

0 =

Pki=1 xi�̂

� n�Pk

i=1 xi1� �̂

= (1� �̂)kXi=1

xi � �̂ n�

kXi=1

xi

!

=kXi=1

xi � n�̂

n�̂ =

kXi=1

xi

�̂ = �x

As an exercise, �nd the MLE for the Normal Distribution.

1.16 Chapter Problems

1.16.1a List a sample space for tossing a fair coin 3 times.

1.16.1b What is the probability of 2 consecutive tails (but not 3)?

1.16.2 Machine Recognition of Handwritten Digits. Suppose that youhave an optical scanner and associated software for determining which ofthe digits 0; 1; :::; 9 an individual has written in a square box. The systemmay of course be wrong sometimes, depending on the legibility of thehandwritten number.

1.16.2a List a sample space S that includes points (x; y), where x stands for thenumber actually written, and y stands for the number that the machineidenti�es.

29

1.16.2b Suppose that the machine is asked to identify very large numbers ofdigits, of which 0; 1; :::; 9 occur equally often, and suppose that the follow-ing probabilities apply to the points in your sample space:

p(0; 6) = p(6; 0) = :004; p(0; 0) = p(6; 6) = :096p(5; 9) = p(9; 5) = :005; p(5; 5) = p(9; 9) = :095p(4; 7) = p(7; 4) = :002; p(4; 4) = p(7; 7) = :098

p(y; y) = :100 for y = 1; 2; 3; 8

Give a table with probabilities for each point (x; y) in S. What fraction ofnumbers is correctly identi�ed?

1.16.3 Prove the following Expected Value properties

1.16.3a E[ag(X) + b] = aE[g(X)] + b, for constants a and b.

1.16.3b E[ag1(X) + bg2(X)] = aE[g1(X)] + bE[g2(X)], for constants a and b.

1.16.4 Expected Winnings in a Lottery

A small lottery sells 1000 tickets numbered 000; 001; : : : ; 999; the tickets cost$10 each, and each one has a unique number. When all the tickets have beensold the draw takes place: this draw consists of a single ticket from 000 to 999being chosen at random. For ticket holders, the prize structure is as follows:

� Your ticket is drawn - win $5000.

� Your ticket has the same �rst two number as the winning ticket, but thethird is di¤erent - win $100.

� Your ticket has the same �rst number as the winning ticket, but the secondnumber is di¤erent - win $10.

� All other cases - win nothing.

Let the random variable X represent the winnings from a given ticket. FindE(X).

1.16.5 Example: Diagnostic Medical Tests: along with expensive, highlyaccurate tests used for diagnosing the presence of some conditions in aperson, there are also cheap, less accurate ones. Suppose we have twocheap tests and one expensive test, with the following characteristics: allthree tests are positive if a person has the condition (there are no �falsenegatives�), but the cheap tests can give �false positives.�

Let a person be chosen at random, and let D = {person has the condition}.The three tests are

30

Test 1: P (positive test jD) = .05; test costs $5.00Test 2: P (positive test jD) = .03; test costs $8.00Test 3: P (positive test jD) = 0; test costs $40.00

We want to check a large number of people for the condition, and have to chooseamong three testing strategies:

1.16.5a Use Test 1, followed by Test 3 if Test 1 is positive.

1.16.5b Use Test 2, followed by Test 3 if Test 2 is positive.

1.16.5c Use Test 3.

Determine the expected cost per person under each of strategies (i), (ii) and(iii). We will then choose the strategy with the lowest expected cost. It isknown that about .001 of the population have the condition (i.e. P (D) =:001; P (D) = :999).

1.16.6 Prove the following: let the random variable X have m.g.f. M(t). Then

E(Xr) =M (r)(0) r = 1; 2; : : :


1.16.7 Show that the Poisson distribution with probability function

f(x) = e��x=x! x = 0; 1; 2; : : :

has m.g.f. M(t) = e��+�et

. Then show that E(X) = � and V ar(X) = �.

1.16.8 A potter is producing teapots one at a time. Assume that they areproduced independently of each other and with probability p the pot pro-duced will be �satisfactory�; the rest are sold at a lower price. LetX be thenumber of rejects before a satisfactory teapot is produced and recorded.When 12 satisfactory teapots are produced, .there should be 12 values ofX indicating the number of rejects before each teapot is produced. Whatis the probability the 12 values of X will consist of six 0�s, three 1�s, two2�s and one value which is � 3?

1.16.9 Consider a race with competitors A, B, and C. They compete in 10matches and have the probabilities of winning 0.5, 0.4, and 0.1 respectively.You can see that we only need to use X1 and X2 in our formula.

f (x1; x2) =10!

x1!x2!(10� x1 � x2)!(:5)x1(:4)x2(:1)10�x1�x2

where A wins x1 times and B wins x2 times in 10 races. Find E (X1X2).

31

1.16.10 The joint probability function of (X;Y ) is:

1.

xf(x; y) 0 1 2

0 .06 .15 .09y

1 .14 .35 .21

Calculate the correlation coe¢ cient, �. What does it indicate about therelationship between X and Y ?

1.16.11 Prove the covariance results on page 19.

1.16.12 Suppose that n people take a blood test for a disease, where each personhas probability p of having the disease, independent of other persons. Tosave time and money, blood samples from k people are pooled and analyzedtogether. If none of the k persons has the disease then the test will benegative, but otherwise it will be positive. If the pooled test is positivethen each of the k persons is tested separately (so k + 1 tests are done inthat case).

1.16.12a Let X be the number of tests required for a group of k people. Showthat

E(X) = k + 1� k(1� p)k:

1.16.12b What is the expected number of tests required for n=k groups of kpeople each? If p = :01, evaluate this for the cases k = 1; 5; 10.

1.16.12c Show that if p is small, then the expected number of tests in part (b)is approximately n(kp+ k�1), and is minimized for k := p�1=2.

1.16.13 A manufacturer of car radios ships them to retailers in cartons of nradios. The pro�t per radio is $59.50, less shipping cost of $25 per carton,so the pro�t is $ (59:5n� 25) per carton. To promote sales by assuringhigh quality, the manufacturer promises to pay the retailer $200X2 if Xradios in the carton are defective. (The retailer is then responsible for re-pairing any defective radios.) Suppose radios are produced independentlyand that 5% of radios are defective. How many radios should be packedper carton to maximize expected net pro�t per carton?

1.16.14 Let X have a geometric distribution with probability functionf(x) = p(1� p)x; x = 0; 1; 2; :::

1.16.14a Calculate the m.g.f. M(t) = E�etX�, where t is a parameter.

1.16.14b Find the mean and variance of X.

1.16.14c Use your result in (b) to show that if p is the probability of �success�(S) in a sequence of Bernoulli trials, then the expected number of trialsuntil the �rst S occurs is 1=p. Explain why this is �obvious�.

32

1.16.15 Find the distributions that correspond to the following moment-generatingfunctions:

1.16.15a M(t) = 13e�t�2 , for t < ln(3=2)

1.16.15b M(t) = e2(et�1), for t <1

1.16.16 Find the moment generating function of the discrete uniform distrib-ution X on fa; a+ 1; :::; bg;

P (X = x) =1

b� a+ 1 ; for x = a; a+ 1; :::; b:

What do you get in the special case a = b and in the case b = a+ 1? Usethe moment generating function in these two cases to con�rm the expectedvalue and the variance of X.

1.16.17 Let X be a random variable taking values in the set f0; 1; 2g withmoments E(X) = 1, E(X2) = 3=2:

1.16.17a Find the moment generating function of X.

1.16.17b Find the �rst six moments of X.

1.16.17c Find P (X = i); i = 0; 1; 2.

1.16.17d Show that any probability distribution on f0; 1; 2g is completely de-termined by its �rst two moments.

1.16.18 Assume that each week a stock either increases in value by $1 withprobability 1

2 or decreases by $1, these moves independent of the past.The current price of the stock is $50. I wish to purchase a call optionwhich gives me the option of buying the stock 13 weeks from now at a�strike price�of $55. Of course, if the stock price at that time is $55 orless there is no bene�t to the option and it is not exercised. Assume thatthe return from the option is

g(S13) = max(S13 � 55; 0)

where S13 is the price of the stock in 13 weeks. What is the fair price ofthe option today assuming no transaction costs and 0% interest; i.e. whatis E[g(S13)]?

33


1.15.1a List a sample space for tossing a fair coin 3 times.

The sample space consists of all possible outcomes. Let H and T denotethat the of a toss is heads or tails respectively. Let a sequence of outcomesbe represented as a string. Then the sample space for 3 Bernoulli trials ofa fair coin are

S = fHHH;HHT;HTH; THH;HTT; THT; TTH; TTTg

1.15.1b What is the probability of 2 consecutive tails (but not 3)?

This is a subset of the sample space de�ned as

SA = fHTT; TTHg

Since all outcomes are equiprobable (since we have assumed that we areusing a fair coin), then the probability of exactly 2 consecutive tails is 1

4 :

1.15.2 Machine Recognition of Handwritten Digits. Suppose that youhave an optical scanner and associated software for determining which ofthe digits 0; 1; :::; 9 an individual has written in a square box. The systemmay of course be wrong sometimes, depending on the legibility of thehandwritten number.

1.15.2a Describe a sample space S that includes points (x; y), where x standsfor the number actually written, and y stands for the number that themachine identi�es.

The sample space for this model is every pair of x and y, so it is ofthe form

S = f(0; 0); (0; 1); : : : ; (9; 9)g

1.15.2b Suppose that the machine is asked to identify very large numbers ofdigits, of which 0; 1; :::; 9 occur equally often, and suppose that the follow-ing probabilities apply to the points in your sample space:

p(0; 6) = p(6; 0) = :004; p(0; 0) = p(6; 6) = :096p(5; 9) = p(9; 5) = :005; p(5; 5) = p(9; 9) = :095p(4; 7) = p(7; 4) = :002; p(4; 4) = p(7; 7) = :098

p(y; y) = :100 for y = 1; 2; 3; 8

Give a table with probabilities for each point (x; y) in S. What fraction of num-bers is correctly identi�ed?

1

X=Y 0 1 2 3 4 5 6 7 8 90 0:096 0:000 0:000 0:000 0:000 0:000 0:004 0:000 0:000 0:0001 0:000 0:100 0:000 0:000 0:000 0:000 0:000 0:000 0:000 0:0002 0:000 0:000 0:100 0:000 0:000 0:000 0:000 0:000 0:000 0:0003 0:000 0:000 0:000 0:100 0:000 0:000 0:000 0:000 0:000 0:0004 0:000 0:000 0:000 0:000 0:098 0:000 0:000 0:002 0:000 0:0005 0:000 0:000 0:000 0:000 0:000 0:095 0:000 0:000 0:000 0:0056 0:004 0:000 0:000 0:000 0:000 0:000 0:096 0:000 0:000 0:0007 0:000 0:000 0:000 0:000 0:002 0:000 0:000 0:098 0:000 0:0008 0:000 0:000 0:000 0:000 0:000 0:000 0:000 0:000 0:100 0:0009 0:000 0:000 0:000 0:000 0:000 0:005 0:000 0:000 0:000 0:095

The fraction of correctly identi�ed numbers if also 1 minus the fraction of in-correctly identi�ed numbers.

= 1� (P (0; 6) + P (6; 0) + P (7; 4) + P (4; 7) + P (9; 5) + P (5; 9))= 1� (0:004 + 0:004 + 0:002 + 0:002 + 0:005 + 0:005)= 1� 0:022= 0:978

1.15.3 Prove Theorem 1. Suppose the random variable X has probability func-tion f(x). Then the expected value of some function g(X) of X is givenby

E[g(X)] =Xall x

g(x)f(x)

Proof:

Let Y be a random variable to represent g(X). Then Y will have PDF ofthe form

h(y) =X

g(x)=y

f(x)

This is just another random variable and hence it will have expectation ofthe form

E[Y ] =Xall x

g(x)f(x)

1.15.4 Expected Winnings in a Lottery

A small lottery sells 1000 tickets numbered 000; 001; : : : ; 999; the tickets cost $10each. When all the tickets have been sold the draw takes place: this consists ofa simple ticket from 000 to 999 being chosen at random. For ticket holders theprize structure is as follows:

� Your ticket is drawn - win $5000.

2

� Your ticket has the same �rst two numbers as the winning ticket, but thethird is di¤erent - win $100.

� Your ticket has the same �rst number as the winning ticket, but the secondnumber is di¤erent - win $10.

� All other cases - win nothing.

Let the random variable X represent the winnings from a given ticket. FindE(X).

The expected value of the purchase of a single ticket is

E(X) = 5000 � 1

1000+ 100 � 1

100+ 10 � 9

100� 10

= 5 + 1 + 0:9� 10= �3:1

1.15.5 Example: Diagnostic Medical Tests: Often there are cheaper,less accurate tests for diagnosing the presence of some conditions in aperson, along with more expensive, accurate tests. Suppose we have twocheap tests and one expensive test, with the following characteristics. Allthree tests are positive if a person has the condition (there are no �falsenegatives�), but the cheap tests give �false positives�.

Let a person be chosen at random, and let D = {person has the condition}.The three tests are

Test 1: P (positive test jD) = .05; test costs $5.00Test 2: P (positive test jD) = .03; test costs $8.00Test 3: P (positive test jD) = 0; test costs $40.00

We want to check a large number of people for the condition, and have to chooseamong three testing strategies:

1.15.5a Use Test 1, followed by Test 3 if Test 1 is positive.

1.15.5b Use Test 2, followed by Test 3 if Test 2 is positive.

1.15.5c Use Test 3.

Determine the expected cost per person under each of strategies (i), (ii) and(iii). We will then choose the strategy with the lowest expected cost. It isknown that about .001 of the population have the condition (i.e. P (D) =:001; P (D) = :999).

3

For strategy 1:

Expected Cost 1 = 5 + 40 � (0:001 + 0:05)= 7:04

For strategy 2:

Expected Cost 2 = 8 + 40(0:001 + 0:03)

= 9:24

For strategy 3:Expect Cost 3 = 40

Therefore, the least expensive option is to use the �rst strategy.

1.15.6 Prove the following: let the random variable X have m.g.f. M(t). Then

E(Xr) =M (r)(0) r = 1; 2; : : :


Proof:

By de�nition the moment generating function for random variable X withPDF f(s)

M (r)(t) = E[etX ]

=

ZS

esXf(s)ds

=

ZS

�1 + tx+

t2x2

2!+ : : :

�f(s)ds

= E[X0] + tE[X1] + t2E[X2] + : : :

Then

M (r) =�E[X0] + tE[X1] + t2E[X2] + : : :

� ��t=0

= 0 + : : :+ 0 + E[Xr] + 0 + : : :

Since all terms with a power less than r will be constant terms and hence0, and all terms with a power greater than r will still have a t term whichevaluates to 0.

1.15.7 Show that the Poisson distribution with probability function

f(x) = e��x=x! x = 0; 1; 2; : : :

4

has m.g.f. M(t) = e��+�et

. Then show that E(X) = � and V ar(X) = �.

By de�nition, a Poisson Distributed random variable will have MGF

M(t) = E[etX ]

=1Xx=0

esxe��x

x!

= e��1Xx=0

(e�s�)x

x!

= e��e�et

= e�(et�1)

Finding the expected value

M 0(0) = �e0e�(et�1)

= �e0e�(e0�1)

= �e(1�1)

= �

Finding the variance

M 00(0) = �ete�(et�1)

= �ete�(et�1) + �2e2te�(e

t�1)

= �e0e�(e0 � 1) + �2e0e�(e0�1)

= �+ �2

V ar(X) =M 00(0)� [M 0(0)]2= �+ �2 � �2

= �

1.15.8 A potter is producing teapots one at a time. Assume that they are pro-duced independently of each other and with probability p the pot producedwill be �satisfactory�; the rest are sold at a lower price. The number, X,of rejects before producing a satisfactory teapot is recorded. When 12satisfactory teapots are produced, what is the probability the 12 values ofX will consist of six 0�s, three 1�s, two 2�s and one value which is � 3?

1.15.9 Consider a race with competitors A, B, and C. They compete in 10matches and have the probabilities of winning 0.5, 0.4, and 0.1 respectively.You can see that we only need to use X1 and X2 in our formulae.

f (x1; x2) =10!

x1!x2!(10� x1 � x2)!(:5)x1(:4)x2(:1)10�x1�x2

5

where A wins x1 times and B wins x2 times in 10 races. Find E (X1X2).

1.15.10 The joint probability function of (X;Y ) is:

xf(x; y) 0 1 2

0 .06 .15 .09y

1 .14 .35 .21

1. Calculate the correlation coe¢ cient, �. What does it indicate about therelationship between X and Y ?

The correlation can be found by the following calculation

Corr(X;Y ) =Cov(X;Y )pV ar(X)V ar(Y )

=E(XY )� E(X)E(Y )p

(E(X2)� E(X)2)(E(Y 2)� E(Y )2)

E(XY ) = 0 � (0:06 + 0:15 + 0:09 + 0:14) + 1 � 1 � 0:35 + 1 � 2 � 0:21= 0:35 + 0:42

= 0:77

E(X) = 0 � 0:20 + 1 � 0:50 + 2 � 0:30= 1:1

E(X2) = 02 � 0:20 + 12 � 0:50 + 22 � 0:30= 1:9

E(Y ) = 0 � 0:30 + 1 � 0:70= 0:70

E(Y 2) = 02 � 0:30 + 12 � 0:70= 0:70

Corr(X;Y ) =0:77� 1:1 � 0:7p

(1:9� 1:12)(0:70� 0:702)

=0p

(1:9� 1:12)(0:70� 0:702)= 0

From this we can determine

6

1.15.11 Prove the covariance results on page 19.

1.15.12 Suppose that n people take a blood test for a disease, where each personhas probability p of having the disease, independent of other persons. Tosave time and money, blood samples from k people are pooled and analyzedtogether. If none of the k persons has the disease then the test will benegative, but otherwise it will be positive. If the pooled test is positivethen each of the k persons is tested separately (so k + 1 tests are done inthat case).

1.15.12a Let X be the number of tests required for a group of k people. Showthat

E(X) = k + 1� k(1� p)k:

1.15.12b What is the expected number of tests required for n=k groups of kpeople each? If p = :01, evaluate this for the cases k = 1; 5; 10.

1.15.12c Show that if p is small, the expected number of tests in part (b) isapproximately n(kp+ k�1), and is minimized for k := p�1=2.

1.15.13 A manufacturer of car radios ships them to retailers in cartons of nradios. The pro�t per radio is $59.50, less shipping cost of $25 per carton,so the pro�t is $ (59:5n� 25) per carton. To promote sales by assuringhigh quality, the manufacturer promises to pay the retailer $200X2 if Xradios in the carton are defective. (The retailer is then responsible for re-pairing any defective radios.) Suppose radios are produced independentlyand that 5% of radios are defective. How many radios should be packedper carton to maximize expected net pro�t per carton?

1.15.14 Let X have a geometric distribution with probability functionf(x) = p(1� p)x; x = 0; 1; 2; :::

1.15.14a Calculate the m.g.f. M(t) = E�etX

�, where t is a parameter.

1.15.14b Find the mean and variance of X.

1.15.14c Use your result in (b) to show that if p is the probability of �success�(S) in a sequence of Bernoulli trials, then the expected number of trialsuntil the �rst S occurs is 1=p. Explain why this is �obvious�.

1.15.15 Find the distributions that correspond to the following moment-generatingfunctions:

1.15.15a M(t) = 13e�t�2 , for t < ln(3=2)

7

1.15.15b M(t) = e2(et�1); for t <1

Since a Poisson Random Variable has MGF e�(et�1) then this MGF de�nes

a Poisson(2) random variable.

1.15.16 Find the moment generating function of the discrete uniform distrib-ution X on fa; a+ 1; :::; bg;

P (X = x) =1

b� a+ 1 ; for x = a; a+ 1; :::; b:

What do you get in the special case a = b and in the case b = a+ 1? Usethe moment generating function in these two cases to con�rm the expectedvalue and the variance of X:

1.15.17 Let X be a random variable taking values in the set f0; 1; 2g withmoments E(X) = 1, E(X2) = 3=2:

1.15.17a Find the moment generating function of X

1.15.17b Find the �rst six moments of X

1.15.17c Find P (X = i); i = 0; 1; 2:

1.15.17d Show that any probability distribution on f0; 1; 2g is completely de-termined by its �rst two moments.

1.15.18 Assume that each week a stock either increases in value by $1 withprobability 1

2 or decreases by $1, these moves independent of the past.The current price of the stock is $50. I wish to purchase a call optionwhich allows me (if I wish to do so) the option of buying the stock 13weeks from now at a �strike price� of $55. Of course if the stock priceat that time is $55 or less there is no bene�t to the option and it is notexercised. Assume that the return from the option is

g(S13) = max(S13 � 55; 0)

where S13 is the price of the stock in 13 weeks. What is the fair price ofthe option today assuming no transaction costs and 0% interest; i.e. whatis E[g(S13)]?

8

Chapter 2

Statistics

2.1 Parameters, Estimates & Estimators

In general, we draw a sample from a population or as output from an experi-ment. The goal of statistical analysis is to determine the value of a populationparameter, �, by using an estimate, �̂, from the sample.Population parameters give characteristics of the population such as expres-

sions of the mean and the variance. Because each sample drawn is merely aninstance of the samples that could occur, we use an estimator ~� to denote thepossible values of �̂ we may get.

Group Parameter(s)Population �; �2

Sample �̂ = x; �̂2 = s2

2.2 Estimates

Our estimates are obtained using the Maximum Likelihood Estimation method.This method determines the most probable estimate using the sample data.Some common ones are listed below

Distribution and Parameter(s) EstimateNormal(�; �2) �̂ = x

Continuous Uniform(�; �) �̂ = max(x1; : : : ; xn)

Expontential(�) �̂ = 1x

2.3 Sampling Distributions

Each estimator is a random variable representing all possible values that ourestimate may take with each sample. Hence our estimator has a distributioncalled a sampling distribution.

34

Example 2. Let X � N(�; �2). Our estimate of � is �̂ = x. This is the realizedestimate we get from the data we have. Without having the data, this estimate isalso a random variable, known as an estimator and denoted by ~� = X. Since Xis normally distributed, then x is normally distributed with mean E[X] = � andvariance V ar(X) = �2

n . These results follow from the Central Limit Theorem.

2.4 Con�dence Interval

Although we�d like to use our estimates, there is a good chance they will not beequal to the population parameter. As a result, we build an interval of valuesin which we are con�dent that the real value will lie. If our parameter, ~�, hasa sampling distribution with mean � and variance V ar(~�) (and hence, standarderror SE(�)), then our con�dence interval will be

� � c � SE(�)where c is a table value and the standard error is the standard deviation of theestimator. For example, if ~� � N(�; �2=n) then the con�dence interval (CI) for� from a Normal distribution with �2 known is

Estimate� c � (Standard Error)

�̂� c �pn

= x� z �pn

If �2 is unknown, the CI for � is

x� tn�1spnz

is a value selected from the Normal table and tn�1 is a value selected fromthe Student�s t table with n�1 degrees of freedom tables. We make the selectionbased on the con�dence level (CL) which is a measure of the certainty we havethat our unknown parameter is in the interval.For example, if the CL is 95%, then we are 95% con�dent that our unknown

parameter is in the interval.A con�dence interval is not a probability since the interval changes with each

sample. In the case of a proportion, we stateX � Binomial(n; �) with

X:� N(n�; n�(1� �)).

Hence, for a proportion:

X � n�pn� (1� �)

� N (0; 1)

Let�s review with some examples.

35

Example 1

A bank robber decides to measure his returns obtained by looting a series ofsafety deposit boxes from a randomly selected bank. He samples 45 safetydeposit boxes and �nds an average return of $57.80 after pawning the contentsof each box. The standard deviation of the earnings per box is $9.60. Find a99% con�dence interval on the average earning from a randomly selected safetydeposit box from the sampled bank.

�̂� z�=2�p45

�̂� z0:005�p45

57:8� 2:576 9:6p45

57:8� 3:7

Therefore, the robber can be 99% con�dent that the mean return on his safetydeposit boxes is between $54.10 and $61.50.

Example 2

A group of recent high school graduates was speeding down a desolate highwayin the middle of life, laughing with the hubris of youth. They gleefully criedout their speeds at various points during their adventure. The speeds cried outwere

234.3 233.0 232.9 235.2 234.0 232.8 233.8 233.0Determine a 90% con�dence interval for the average speed of the carefree

graduates�driving. For this, we need to calculate the sample mean, as well asthe sample standard error. Now to estimate the variance:

s2 =

Pni=1 (xi � x)

2

n� 1

=

Pni=1 (xi)

2 � n (x)2

n� 1

=436650:2� 8 � 233:62

7� 0:7279

Note that we have 7 degrees of freedom (since �2 was estimated); therefore,the critical value is 1.895. Then, we can solve for the CI.

x � t0:05;7spn

= 233:6� 0:57= (233:03; 234:17)

36

Thus, we are 90% con�dent that the true value of the average speed of youngdrivers is between 233.11 and 234.09 kilometres per hour.

Example 3

A manufacturing company�s production line takes a random sample of theirleading product to determine the number of units that are defective. The com-pany samples 880, 308 specimen of which were proved to be defective. Create a95% con�dence interval of the proportion of defective units on the productionline. This can be modelled using a binomial distribution where p representsthe proportion of defective items. The general form for the CI of a binomialresponse model is

p̂� zrp̂(1� p̂)n

0:35� 1:96r0:35 � 0:65880

0:5� 0:032

Thus, we have a 95% con�dence interval of (0.318, 0.382) for the proportion ofdefective items in the production line.

Example 4

We are interested in a CI for the parameter �2. The formula is

(n� 1)s2�2�=2

< �2 <(n� 1)s2�21��=2

A textbook wants to question students on their ability to determine CIs for theestimation of a standard error. The sampled variance is 26.70 from a total of10 sampled units. Find a 95% CI for these numbers.

9 � 26:719:02

< �2 <9 � 26:72:70

12:63 < �2 < 89:00

Thus, we are 95% con�dent that the actual value of the variance is between12.63 and 89.

2.4.1 Comparison between 2 parameters

Often we will want to compare 2 parameters. If the samples are independentthen the comparision can be measured statistically with the following formula

�x1 � �x2 � tn1+n2�2

ss21n1+s22n2

37

If the two samples are dependent, then we calculate the di¤erence in the datavalues in the two groups. The average di¤erence is �xd, and the di¤erence betweenthe parameters of interest and the standard deviation has the form

�xd �tnd�1sdp

n

In the case of 2 proportions, we use

p̂1 � p̂2 � Z

sp̂1(1� p̂1)

n1+p̂2(1� p̂2)

n2

Example 5

A professor administers a test out of 40 marks. She notices a di¤erence in theaverages between students who took both STAT 230 and 231 (prepared students)versus her students that have only taken STAT 230 (unprepared students).There are 7 prepared students, they had an average of 32 marks and a varianceof 4.47. There are 6 unprepared students who had an average of 30.2 marks anda variance of 0.652. Find a 90% con�dence interval on the di¤erence of theirmarks.

�x1 � �x2 � tn1+n2�2

ss21n1+s22n2

32� 30:2� 1:796r4:47

7+0:652

61:8� 1:55

This gives a 90% con�dence interval that the di¤erence in the two means isbetween 0.25 and 3.35.In addition to a CI, a hypothesis test (HT) could also be performed. For that,we refer you to your STAT 231 course notes for further details.


2.5.1 A machine produces a particular type of computer chip with an averagelength of 0.8 centimetres. The production machine had a sample error of0.2 centimeters when 10 randomly selected chips were measured. Deter-mine the 95% con�dence interval on the average length of the computerchip.

2.5.2 A study of 500 participants was conducted to determine the proportionof �t people who could leap across small buildings with a running startand favourable winds. Of the participants, only 60 managed to achievethis feat. Determine the 90% con�dence interval on the true proportionof people who successfully lept over a small building.

38

2.4.3 The O¢ ce of the Registrar General of Canada reported that a sample of200,000 males born in 1990 had 12% of them named "Optimus Prime."Determine the 95% con�dence interval associated with the true proportionof newborn males given the awesome name of Optimus Prime.

2.5.4 Given that a the sample error of hacking coughs from a sample of 20 ex-cessive smokers is 1.6 milliseconds, determine the 95% con�dence intervalfor the variance of annoying, hacking coughs of excessive smokers.

2.5.5 Gas prices have increased signi�cantly over the past decade. The aver-age price of one litre of gasoline was recorded at a nearby station everyweek. Determine the 90% con�dence interval for both the variance andthe standard deviation of the price in dollars of the average gas price overthe following two-month period:

59 54 53 52 5139 49 46 49 48

2.5.6 A survey conducted compared student spending habits between two com-parable universities. 50 students were polled on their monthly spendingson entertainment. The average spendings were $175.53 and $171.31 eachwith standard deviations $9.52 and $10.89 respectively. Determine the95% con�dence interval associated with the di¤erence in the means ofentertainment spending between the two student bodies.

2.5.7 A competitive sports radio station has asked you to discuss the di¤erencebetween scores in American and Canadian Whack-a-Mole teams. Youpicked the champions from each nation and recorded their data over a40-year period for the highest number of moles whacked in a single bout.The data is as follows:

Canadian Leaders47 49 73 50 65 70 49 47 40 4346 35 38 40 47 39 49 37 37 3640 37 31 48 48 45 52 38 38 3644 40 48 45 45 36 39 44 52 47

American Leaders47 57 52 47 48 56 56 52 50 4046 43 44 51 36 42 49 49 40 4339 39 22 41 45 46 39 32 36 3232 32 37 33 44 49 44 44 49 32

2.5.7a De�ne the population for this set of data.

2.5.7b What kind of sample was used?

2.5.7c How do you feel this sample representing the target population?

2.5.7d State the hypotheses you will use to compare the data sets.

39

2.5.7e Determine a reasonable signi�cance level. For what purpose did youselect this level?

2.5.7f What statistical test is reasonable to use for your hypotheses? Justifyyour answer.

2.5.7g Perform statistical analysis as described in your answers to the above.What are the results?

2.5.7h State your conclusions

2.5.7i Consider how this data set describes the original question. What areyour thoughts on the �t?

2.5.7j What alternative data set would you consider using to answer the ques-tions above?

2.5.8 Eight athletes have acquired a perfectly legal, under the counter perfor-mance enhancer that they plan to test its e¤ectiveness. As part of theirtraining regimen, they test their performance levels before using the en-hancer as well as afterwards. To end their rigorous training program, theyperform statistical analysis on the data. Using the data that they acquired,perform a test to determine if the enhancer is e¤ective with signi�cancelevel of �=0.05. Assume normality on the data.

Athlete 1 2 3 4 5 6 7 8Without Enhancer 95 104 83 92 119 115 99 98With Enhancer 99 107 81 93 122 113 101 98

2.5.9 An advertising �rm released a new controversial ad campaign for a par-ticular brand of clothing. The campaign was featured in six randomlyselected retailers of the product. Average weekly sales for both a monthbefore and after the unveiling of the new ad campaign are shown in thetable below:

Location 1 2 3 4 5 6Before Ad 210 235 208 190 172 244After Ad 190 170 210 188 173 228

Determine whether or not the ad campaign caused a signi�cant change insales with a signi�cance level of �=0.10.

2.5.10 Early educational facilities began standardized testing to determinewhether or not pre-school children could recognize a ninja. 12 out of34 privately run pre-schools had a ninja identi�cation success rate of lessthan 80%, while 17 out of 24 publicly run pre-schools had a success rateless than 80%. Determine a 95% con�dence interval for the di¤erence ofproportions in successful indenti�cation of ninjas in publicly and privatelyrun pre-schools.

40

2.5.11 Consider the model Y = � + E, where E � N(0; �2). Use maximumlikelihood estimation to

2.5.11a Show b� = �x.2.5.11b Show b�2 = Pn

i=1(yi��)2

n .

41

1 Chapter Problems

2.4.1 A machine produces a particular type of computer chip with averagelength of 0.8 centimetres. The production machine had a sample error of0.2 centimeters when 10 randomly selected chips were measured. Deter-mine the 95% con�dence interval on the average length of the computerchip.

2.4.1 Solution Since � is unknown and s must replace it, the t distributionmust be used for a 95% con�denc einterval. Hence, with 9 degrees of free-dom, t�=2 = 2:262.The 95% con�dence interval of the population mean is found by substi-tuting in the formula

�X � t�=2�spn

�< � < �X + t�=2

�spn

�Hence, 0:8� 2:262 0:2p

10< � < 0:8 + 2:262

0:2p10

0:8� 0:143 < � < 0:8 + 0:143

0:657 < � < 0:943

Therefore, one can be 95% con�dent that the population mean computerchip length is between 0.657 and 0.943 centimetres based on a sample of10 tires.

2.4.2 A study of 500 participants was conducted to determine the proportionof �t humans that could leap small buildings with a running start andfavourable winds. Of the participants only 60 managed to match this featof strength. Determine the 90% con�dence interval on the true proportionof humans who successul lept over a small building under these conditions.

2.4.2 Solution Since � = 1 � 0:90 = 0:10 and z�=2 = 1:645, substituting inthe formula

p̂� z�=2

rp̂q̂

n< p < p̂+ z�=2

rp̂q̂

n

When p̂ = 60=500 = 0:12 and q̂ = 0:88, one gets

1

0:12� 1:645r0:12 � 0:88500

< p < 0:12 + 1:645

r0:12 � 0:88500

0:12� 0:024 < p < 0:12 + 0:024

0:096 < p < 0:144

9:6% < p < 14:4%

or

Hence, one can be 90% con�dent that the percentage of applicants whocan leap over a small building is between 9.6% and 14.4%.

2.4.3 The Canadian O¢ ce of the Registrar General reported that a sample of200,000 male births in 1990 had 12% of them named "Optimus Prime".Determine the 95% con�dence interval associated with the true proportionof newborn males given the awesome name of Optimus Prime.

2.4.3 Solution From the Snapshot, p̂ = 0:12 (i.e. 12%), and n = 200; 000.Since z�=2 = 1:96, substituting in the formula

p̂� z�=2

rp̂q̂

n< p < p̂+ z�=2

rp̂q̂

n

yields 0:12� 1:96r0:12 � 0:88200; 000

< p < 0:12 + 1:96

r0:12 � 0:88200; 000

0:119 < p < 0:121

Hence, one can say with 95% con�dence that the true percentage of malesnamed Optimus Prime is between 11.9% and 12.1%.

2.4.4 Given that a the sample error hacking coughs from a sample of 20 exces-sive smokers is 1.6 milliseconds determine a 95% con�dence interval forthe variance of annoying, hacking coughs of excessive smokers.

2.4.4 Solution Since � = 0:05, the two critical values, respectively, for the0.025 and 0.975 levels for 19 degrees of freedom are 32.852 and 8.907. The95% con�dence interval for the variance is found by substituting in theformula

(n� 1)s2�2righ /t

< �2 <(n� 1)s2�2low

(20� 1)(1:6)232:852

< �2 <(20� 1)(1:6)2

8:907

1:5 < �2 < 5:5

2

Hence, one can be 95% con�dent that the true variance for the hackingcoughs is between 1.5 and 5.5 milliseconds. (For the standard deviation,the con�dence interval is (1.2, 2.3).)

2.4.5 Prices of gas have increased signi�cantly over the past decade. The av-erage price of one litre of gasoline was recorded at a nearby station everyweek. Determine a 90% con�dence interval for both the variance and thestandard deviation of the price in dollars of the average price of gasolineover the following two-month period:

59 54 53 52 5139 49 46 49 48

2.4.5 Solution First we �nd the variance for the data. The estimated varianceis s2 = 28:2.

Then �nd �2right and �2low from a �29 with quantile � = 0:10 the criti-

cal values are 3.325 and 16.919 using 0.95 and 0.05.

Substitute and solve

(n� 1)s2�2righ /t

< �2 <(n� 1)s2�2low

(10� 1)(28:2)216:919

< �2 <(10� 1)(28:2)2

3:325

15:0 < �2 < 76:3

Hence one can be 90% con�dent that the variation for the price of gasolineper litre is between $15.00 and $76.30. For standard deviation, the intervalis ($3.87, 8.73).

2.4.6 A survey conducted compared student spending habits between two com-parable universities. 50 students were polled on their monthly spendinghabits on entertainment. The averages were $175.53 and $171.31 eachwith standard deviations $9.52 and $10.89 respectively. Determine the95% con�dence associated with the di¤erence in the means of entertain-ment spending between the two undergraduate bodies.

2.4.6 Solution

( �X1 � �X2)� z�=2

s�21n1+�22n2

< �1 � �2 < ( �X1 � �X2) + z�=2

s�21n1+�22n2

(1:75:53� 171:31)� 1:96r9:522 + 10:892

50< �1 � �2 < (1:75:53� 171:31) + 1:96

r9:522 + 10:892

504:22� 4:01 < �1 � �2 < 4:22 + 4:01

0:21 < �1 � �2 < 8:23

3

Since the con�dence interval does not contain zero, the decision is to rejectthe null hypothesis.

2.4.7 A competative sports radio has asked you to discuss the di¤erence be-tween scores in American and Canadian Whack-a-Mole teams. You pickthe champions from each nation and record their data over a 40-year pe-riod for highest number of moles whacked in a single bout. The data is asfollows

Canadian Leaders47 49 73 50 65 70 49 47 40 4346 35 38 40 47 39 49 37 37 3640 37 31 48 48 45 52 38 38 3644 40 48 45 45 36 39 44 52 47

American Leaders47 57 52 47 48 56 56 52 50 4046 43 44 51 36 42 49 49 40 4339 39 22 41 45 46 39 32 36 3232 32 37 33 44 49 44 44 49 32

2.4.7a De�ne the population for this set of data.

2.4.7a Solution The population is all Moles whacked by Canadian and Amer-ican Whack-a-Mole Champions.

2.4.7b What kind of sample was used?

2.4.7b Solution A cluster sample was used.

2.4.7c How do you feel this sample represents the target population?

2.4.7c Solution Answers will vary. While this sample is not representative ofall professional Whack-a-Molers, per se, it does allow us to compare theleaders in each league.

2.4.7d State the hypotheses you will use to compare the data sets.

2.4.7d Solution H0 : �1 = �2 and HA : �1 6= �2.

2.4.7e Determine a reasonable signi�cance level. For what purpose did youselect this level?

2.4.7e Solution Answers will vary. Possible answers include the 0.05 and the0.01 signi�cance levels.

2.4.7f What statistical test is reasonable to use for your hypotheses? Justifyyour answer.

2.4.7f Solution We will use the z test for the di¤erence in means.

4

2.4.7g Perform statistical analysis as described in your answers to the above.What are the results?

2.4.7g Solution Our test statitsic is z = 44:75�42:88q8:882+7:822

40

= 1:00, and our p-value

is 0.3173.

2.4.7h State your conclusions

2.4.7h Solution We fail to reject the null hypothesis. There is not enoughevidence to conclude that there is a di¤erence in the number of moleswhacked by Canadian versus American champions.

2.4.7i Consider how this data set describes the original question. What areyour thoughts on the �t?

2.4.7i Solution Answers will vary. One possible answer is that since we donot have a random sample of data from each nationa, we cannot answerthe original question asked.

2.4.7j What alternative data set would you consider using to answer the ques-tions above?

2.4.7j Solution Answers will vary. One possible answer is that we could get arandom sample of data from each nation from a recent season.

2.4.8 Eight athletes have acquired a perfectly legal, under the counter perfor-mance enhancer that they plan to test its a¤ectiveness. As part of theirtraining regime they test their performance levels before using the en-hancer as well as afterwards. To end their rigorous training programmethey perform statistical analysis on the data. Using the data that theyacquired, perform a test to determine if the enchancer is e¤ective withsigni�cance level of �=0.05. Assume normality on the data.

Athlete 1 2 3 4 5 6 7 8Without Enhancer 95 104 83 92 119 115 99 98With Enhancer 99 107 81 93 122 113 101 98

2.4.8 Solution State the hypothesis and identify the claim. For the supple-ment to be e¤ective the before weights must be signi�cantly less than theafter weights: hence the mean of the di¤erences must be less than zero.H0 : �D = 0 and HA : �D < 0 (claim).

Find the critical value. The degrees of freedom are n�1. In this case, thedf = 8 � 1 = 7. The critical value for a left-tailed test with � = 0:05 is�1:895.

Compute the di¤erences.

5

Before (X1) After (X2) Di¤erence D = X1 �X2 Squared Di¤. D2 = (X1 �X2)295 99 -4 16104 107 -3 983 81 2 492 93 -1 1119 122 -3 9115 113 2 499 101 -2 498 98 0 0

�D = �9 �D2 = 47

with mean di¤erence �98 = �1:125. With this chart, one can calculate the

standard deviation of the di¤erences

sD =

s�D2 � (�D)2

n

n� 1 =

s47� (�9)2

8

8� 1 � 2:295

Calculating the test statistic

t =�D � �0sD=

pn=�1:125� 02:295=

p8� �1:386

Comparing against the theroetical, the decision is to not reject the nullhypothesis at � = 0:05, since �1:386 > �1:895.

2.4.9 An advertizing �rm releases a new controversial ad campaign for a par-ticular brand of clothing. The campaign was featured in six randomlyselected retailers of the product. Average weekly sales both a month be-fore and release the unveiling of the new ad campaign. Determine whetheror not the ad campaign caused a signi�cant change in sales with a signi�-cance level of �=0.10.

Location 1 2 3 4 5 6Before Ad 210 235 208 190 172 244After Ad 190 170 210 188 173 228

2.4.9 Solution State the hypothesis and identify the claim. For the supple-ment to be e¤ective the before weights must be signi�cantly less than theafter weights: hence the mean of the di¤erences must be less than zero.H0 : �D = 0 and HA : �D 6= 0 (claim).

Find the critical value. The degrees of freedom are 5. The critical valuefor at � = 0:10 is �2:015.

Compute the di¤erences.

6

Before (X1) After (X2) Di¤erence D = X1 �X2 Squared Di¤. D2 = (X1 �X2)2210 190 20 400235 170 65 4225208 210 -2 4190 188 2 4172 173 -1 1244 228 16 256

�D = 100 �D2 = 4890

with mean di¤erence 1006 = 16:7. With this chart, one can calculate the

standard deviation of the di¤erences

sD =

s�D2 � (�D)2

n

n� 1 =

s4890� (100)2

6

5� 25:4

Calculating the test statistic

t =�D � �0sD=

pn=16:7� 025:4=

p6� 1:610

The decision is to not reject the null hypothesis, since the test statistics1:610 is in the non-critical region:

2.4.10 Early educational facilities began standardized testing to determinewhether or not pre-school children could recognize a ninja. 12 out of34 privately run pre-schools had a ninja identi�cation success rate of lessthan 80%, while 17 out of 24 publically run pre-schools had a successrate less than 80%. Determine a 95% con�dence interval for the di¤er-ence of proportions in successful indenti�cation of ninjas in publically andprivately run pre-schools.

2.4.10 Solution

p̂1 =12

34= 0:35; q̂1 = 0:65

p̂2 =17

24= 0:71; q̂2 = 0:29

Substitute into the formula

(p̂1 � p̂2)� z�=2rp̂1q̂1n1

+p̂2q̂2n2

< p1 � p2 < (p̂1 � p̂2) + z�=2rp̂1q̂1n1

+p̂2q̂2n2

(0:35� 0:71)� 1:96r0:35 � 0:65

34+0:71 � 0:29

24< p1 � p2 < (0:35� 0:71) + 1:96

r0:35 � 0:65

34+0:71 � 0:29

24�0:36� 0:242 < p1 � p2 < �0:36 + 0:242

�0:620 < p1 � p2 < �0:118

Since 0 is not contained in the interval, the decision is to reject the nullhypothesis H0 : p1 = p2.

7

2.4.11 Consider the model X = � + E, where E � N(�; �2). Use maximumlikelihood estimation (MLE) to

2.4.11a Show � = �x.

2.4.11b Show �2 =Pn

i=1(xi��)2

n .

2.4.11a&b Solution Find the log-likelihood of the Normal distribution

L(�) =nYi=1

1p2��

e�12 (

xi�� )

2

`(�) = C � n2ln�2 � 1

2

nXi=1

�xi � ��

�2To solve for the estimated value of �, take the partial derivative withrespect to �.

@`(�)

@�̂= 0� 0� 1

2� 2

nXi=1

xi � �̂�

=

nXi=1

xi � �̂�

Set equal to zero and solve. Note that � is a positive constant with respectto �.

0 =

nXi=1

(xi � �̂)

0 =nXi=1

xi � n�̂

n�̂ =nXi=1

xi

�̂ = �x

Now for �2. Take the partial derivative with respect to �2.

@`(�)

@�̂= 0� n

2� 2 �̂�̂2� 12� (�2)

nXi=1

(xi � �)�̂3

2

= �n�̂+

nXi=1

(xi � �)�̂3

2

8

Set this equation equal to 0 and then solve for �̂.

0 = �n�̂+

nXi=1

(xi � �)�̂3

2

0 = �n�2 +nXi=1

(xi � �)2

�̂2 =

Pni=1 (xi � �)

2

n

9

Chapter 3

Validation

In many textbooks, the author will say something like: �The waiting timesbetween buses arriving at a bus stop follow an exponential distribution.�Howdo we know if this is true? It�s all good and well to say that a set of data followsa distribution, but how can we argue that it does?It is important to remember that hypothesis tests are not conclusive in that

the results don�t lead to concrete solutions. They test to see if a particularparameter value given both the data and a margin of error is signi�cant. Sohow can we determine which distribution to use? Consider the following set ofcompletely random data.

Order of Occurence Data Value1 1=42 1=23 3=44 1

How can we examine the data in such a way that we can pick a distributionthat seems to �t? The best solution is to graph and compare it to knowndistributions.

42

3.1 �2 Test for Goodness of Fit

For discrete models, we can use the �2 Test for Goodness of Fit (sometimesreferred to as the Pearson�s �2 Test). If we have data which can be categorized(into groups called bins), then we can use this test to compare a theoreticaldistribution to the distribution of a set of data. Consider the following example:

3.1.1 Example

A study compares two sub-groups of a population (males and females), and looksat whether or not the members of that population have cancer. The surveyingteam found 351 participants and recorded the following data:

Males Females TotalCancer 104 73 137

No Cancer 89 85 174Total 193 158 351

The interest of this study is to determine whether or not having cancer isindependent of one�s sex. For this experiment, we assume that the two areindependent. Let the null hypothesis be that these two are independent, andlet the alternative hypothesis be that sex and having cancer are dependent.Let X represent the sex of the unit, and Y represent whether or not the unit

has cancer. Statistically spreaking we are interested in whether or not these tworandom variables are independent of each other. More speci�cally, we want tosee if P (X = x; Y = y) = P (X = x)P (Y = y).

43

The underlying concept here is the multinomial distribution. Our data valuesare going to fall into 1 of 4 bins

� Male with cancer

� Male without cancer

� Female with cancer

� Female without cancer

Based on the data, we can estimate the probability that a particular personof a particular gender has or does not have cancer. Expressed more generally,we have:

X = 0 X = 1 TotalY = 0 a b a+ bY = 1 c d c+ dTotal a+ c b+ d a+ b+ c+ d = n

Similarly, we can argue that the marginal distribution is binomial. Hencethe expected number of males with cancer can be calculated as E[X] = np.And so, if we are interested in the expected number of males with cancer givena sample of size n, then we would be interested in n�P(X = 0)�P(Y = 0)assuming that we have independence. This becomes

n� a+ bn

� a+ cn

=(a+ b)(a+ c)

n

One can apply the same conept for the remaining three boxes, which will yeilda chart of the expected values for each bin, as shown below:

Male Female TotalCancer (a+b)(a+c)

n � 75:33 (b+a)(b+d)n � 61:67 137

No Cancer (c+d)(c+a)n � 95:68 (d+b)(d+c)

n � 78:32 174Total 193 158 351

We now have tables detailing the expected bins and the observed. The teststatistic for the general Pearson Test with k bins is

D =kXi=1

(oi � ei)2ei

where, for the ith bin, oi is the observed frequency and ei is the expected/theoreticalfrequency. D is not exactly a �2; however, it is asymptotically a �2 distribution,with one degree of freedom in this case. We need to determine the degrees offreedom for this statistic.

44

Using a calculator, we get an approximate test statistic of d=14.03 on 1degree of freedom. We calculate the probability of D > d as follows:

P (D > d) = 1� P (D � d)= 1� P (D � 14:03)> 1� 0:999> 0:001

Therefore, we have very strong evidence against our hypothesis and we rejectit in favour of the alternate hypothesis: cancer and sex are dependent on eachother.The intricacies of using the �2 Test in another situation may not be apparent

from the previous example. So consider the following set of data:

47 199 120 59 204128 408 47 48 7966 61 25 91 217

These data come from some unknown discrete distribution.

0 100 200 300 400

0.0

0.2

0.4

0.6

0.8

1.0

ecdf(x)

x

Fn(x

)

Figure 3.1: Empirical CDF of Observed Data Points

We begin by asking if we model our data using a geometric distribution.Hence, we need to determine the parameter of our model. From the reviewof ML estimation of the geometric distribution, we obtain an estimate of the

45

paremter p from our data:

p̂ =1

x

=nPni=1 xi

=

1

15

15Xi=1

xi

!�1

=

�1799

15

��1� 0:00834

Therefore, we hypothesize that the data follows a Geo(0:00833) distribution.The �rst order of business is to determine what interval we should use for ourdata. The following rules should be imposed:

1. Let Xi be the event of a data value landing in Bin i. Let pi be theprobability of a data value being in Bin i. Let n be the number of datavalues. Then E[Xi] = npi �5 for all i. This condition is necessary as theapproximation to the �2 distribution requries the expected frequencies tobe greater than or equal to 5; if this is not the case, the approximationbreaks down in which case the G-test would be more appropriate.

2. Let B be the number of bins, and p be the number of parameters requiredto be estimated. Then the degrees of freedom (df) are B � p � 1. Theremust be at least 1 df.

3. The data must be discrete.

4. The size of the bins can have vary in size.

In situations requiring calculation by hand (such as in practice examples ortests), we may relax assumption 1 to make the question doable in the allottedtime.In our sample, at least three bins are needed in order that we satisfy our

degrees of freedom condition. Accordingly, we can divide our bins on to theintervals [0,47], (47,120] and (120,1). So, we get the following bins

Bin 1 Bin 2 Bin 325, 47, 47 48, 59, 61, 66, 79, 91, 120 128, 199, 204, 217, 408

Now, we can state the null and alternative hypotheses:H0 : the data collected is governed by a Geometric Distribution with para-

meter 0.00833.

HA : the data is not governed by a Geometric Distribution with parameter0.00833.

46

The test statistic is D =Pk

i=1(oi�ei)2

ei. For the geometric, we have

P (Bin 1) = P (0 � x � 47) � 0:3307P (Bin 2) = P (48 � x � 124) � 0:3178P (Bin 3) = P (125 � x <1) � 0:3515

The code from R used to get these results are as follows:

> pgeom(47; 0:00833)

[1]0:3306945

> pgeom(124; 0:00833)� pgeom(47; 0:00833)[1]0:3178285

> 1� pgeom(124; 0:00833)[1]0:351477

From there, we can use the formula to get a test statistic by math:

D =3Xi=1

�(oi � ei)2

ei

�=

(3� 15 � 0:3307)215 � 0:3307 +

(7� 15 � 0:3178)215 � 0:3178 +

(5� 15 � 0:3515)215 � 0:3515

� 1:8349

Now we have enough information to �nish this example! We have a test statisticof 1.8349 with 1 degree of freedom, so going back to the very original form ofthe question:

P (D > d) = 1� P (D � d)= 1� P (D � 1:835)

We end up with two possible solutions depending on what we do from here.We can either use a simple R routine to get a more accurate answer or we canrefer to the �2 table for a wider approximation. Let�s compare both methods.Using the R command:

> 1-pchisq(1.8349,1)

This yields the value 0.179.Now, looking at the �2 table with 1 degree of freedom we have that 0:5 <

P (D � 1:8349) < 0:9, which implies: 1�0:5 > 1�P (D � 1:8349) > 1�0:9 whichsimpli�es to the range of (0:10; 0:50). The latter one is a wide interval; however,it shows that our null hypothesis will not be rejected at a 10% signi�cancelevel. The value 0.179 is nice since it gives us a better indication of the exactmagnitude.

47

3.2 Kolmogorov-Smirnov Test

This test compares a theoretical distribution F to the empirical cumulativedistribution function (ECDF) from the data F̂ . If F and F̂ �appear� close,then we will assume that our data are F distributed.

Step 1 Build an ECDF from our data. We determine how far F̂ and F are fromeach other. We call this distance, d, the K-S Distance.

Step 2 We ask if d is �large.�We compare d to the distance we�d get by comparingF to the ECDF. If the distance is relatively small, then we would expectthe data to be F distributed (i.e if our p-value is small, we reject F as thedistribution).

3.2.1 Building an Empirical CDF

First of all, we assume that every observation has a weighted probability 1n ,

where n is the number of observations. The empirical CDF is de�ned as follows:

F̂ (x) =1

n

nXi=1

I(xi � x)

This is the total number of observations with values less than or equal to x.Take the example of a random set of data: f1; 1; 2; 5g. This has the followingECDF.

0 2 4 6

0.0

0.2

0.4

0.6

0.8

1.0

Empirical CDF

X

F(x)

Figure 3.2: ECDF of Observations

F̂ can be built for discrete and continuous data. We assume for the contin-uous data that P (xi = a; xj = a)=0, for i 6= j.We create an Empirical CDF as follows:

48

1. Order the data.

2. Assume each point is equiprobable.

3. Make a step function Ixn where Ix is the number of data points less than

or equal to x.

3.2.2 The K-S Distance

If our ordered data was a1; a2; : : : ; an at every step of our ECDF, we willcalculate two distances: di;U = F (ai) � F̂ (ai) and di;D = F (ai) � F̂ (ai�1).We�ll call these the �up� and �down� distances. The K-S distance, d; is themaxfdi;U ; di;Dg.

Example:

We have received the data set {1,4,9,16}, and we suspect that it follows anexponential distribution.

5 0 5 10 15 20

0.0

0.2

0.4

0.6

0.8

1.0

Empirical CDF

X

F(x)

Step 1 Build the empirical cdf

Since we believe that this follows an exponential distribution, we can esti-mate its parameter:

�̂ =1

x=2

15

Now, we can look at the comparison between the theoretical and empiricalCDF probabilities:

49

5 0 5 10 15 20

0.0

0.2

0.4

0.6

0.8

1.0

Empirical CDF

X

F(x)

Figure 3.3: Theoretical vs. Empirical cdfs

F (1) = 1� e� 215 F (4) = 1� e� 8

15 F (9) = 1� e� 185 F (16) = 1� e� 32

15

F̂ (1) = 14 F̂ (4) = 1

2 F̂ (9) = 34 F̂ (16) = 1

Let di;U represent the �up�distance and di;D the �down�distance for ob-servation i. Then, for this set of data, we have the following:

d1;U = 0:1252 d2;U = 0:0866 d3;U = 0:0512 d4;U = 0:1184d1;D = 0:1248 d2;D = 0:1634 d3;D = 0:1980 d4;D = 0:1316

Thus, our KS Distance is d = 0:1980.

Step 2 Create data n data sets derived from F :

At this point, we will create n sets of ECDFs from F and denote them asF̂i (i = 1; ::; n). Then, we calculate the K-S Distance between F and each F̂i.Finally, we consider the frequency of Di > d. The more i�s that satisfy this, themore acceptable our F̂ will be.

The following are examples of how to perform a KS test using R. The �rstmethod uses R code to calculate a p-value.

#set up for the data of interest#sorted random exponential with rate 1x<-sort(rexp(10,rate=1))

50

5 0 5 10 15 20

0.0

0.2

0.4

0.6

0.8

1.0

Empirical CDF

X

F(x)

d1,D

d1,U

d2,D

d3,D

d4,D

d4,U

d3,U

d2,U

Figure 3.4: Theoretical vs. Empirical cdfs with K-S Distances

plot(ecdf(x))#Plots Empirical CDFlines(x,pexp(x,rate=1/mean(x)))#Plots lines for other data valuesks.test(x,"pexp",1/mean(x))

#Calculates the ks test

The following is the code for the KS test. As an exercise, create your owncode for the KS test.

#****************************************************#KS exp will perform a KS test and return the p-valueKSEXP <- function (data,N,d){data <- sort(data)rate <- 1/mean(data)#ks is the set of all ks distancesks <- NULLfor (i in 1:N){#generate data from EXP(rate)#rdata is the random data

51

set.seed(i)rdata <- sort(rexp(length(data),rate=rate))rrate <- 1/mean(rdata)D <- max(abs(1-exp(-rrate*rdata)-seq(1/length(data),1,by=1/length(data))),abs(1-exp(-rrate*rdata)-seq(0,1-1/length(data),by=1/length(data))))ks<-cbind(D, ks)} #end forPvalue <- mean(ks>d)return(Pvalue)}#end fn KSEXP#*****************************************************#Generate an exponential random variableEXPGEN<- function(rate, time, seed){#set.seed(seed)newtime <- rexp(1,rate=rate) + timereturn(newtime)}

To calculate the p-value for P(D � d) we want the percentage of times Diis bigger than d (that is, the percentage of time that the generated data is lessreliable than our observed data).

P (D � d) � The number of times Di is bigger than d

=1

n

nXi=1

I(Di � d)

If our p-value is small, then d is too large and we reject our hypothesis. Weuse our usual 5% rejection rule. Since the test statistic is dependent on thenumber of sets of values, then the answers will vary every time the test is done.To reduce the variation, we just make a large number of sets. We will still seevariation in our test statistics; however, the larger the number of generated setsare, the smaller the variation will be.

52

1 Chapter Problems

3.3.1 Given the follow data set: perform a �2 test for Goodness of Fit using aBinomial distribution with n = 20 and p̂ = 0:5.13 8 8 12 119 9 10 15 1311 7 8 10 109 8 11 13 9

3.3.1 Solution We will need at least two bins for this since we have oneestimated value. Construct the bins as follows:

Bin 1: x 2 [0; 9] with expectation E[Bin 1] = 20 � P (Bin 1) =20 � 0:4119015 = 8:23803

Bin 2: x 2 [10; 20] with expectation E[Bin 2] = 20 � P (Bin 2) =20 � 0:5880985 = 11:76197

The actual counts for the bins are 9 and 11. Then the test statisticbecomes

t =(9� 8:23803)2

8:23803+(11� 11:76197)2

11:76197= 0:056240

We are testing the hypthesis H0 : p = 0:50 versus HA : p 6= 0:50.

Then, on 1 degree of freedom

P (T > t) = 1� P (T � t)= 1� P (T � 0:056240)� 99:5% from R

in the range of (0:50; 0:95) from the Tables

From this we have no evidence against the hypothesis that this data followsa Binomial Distribution with probability of success 0.50. Note that the binsthat were used for this test is one of many valid choices.

3.3.2 Given the following data determine an appropriate distribution and per-form a �2 Test using MLE.

10 5 3 5 510 1 3 8 54 9 7 5 4

1

3.3.2 Solution Solutions will vary, but should take the following form.

Look at the Empirical CDF (note, data was modi�ed using the jitter functionso that no points coincide with each other)

Graph NeededThis looks like it could follow a Poisson Distribution or Binomial. The MLE

for the rate parameter for the Poisson Distribution is �x. At least two bins willbe required. Follow procedure as above.

3.3.3 Given the following set of Geometric data1 3 0 0 50 0 0 8 02 4 3 0 10 0 0 1 3

3.3.3a Perform a �2 test under the assumption that p̂ = 0:60.

3.3.3a Solution We require at least two bins for this test to satisfy the degreesof freedom. We will construct the bins as follows:

Bin 1: x = 0 with expectation E[Bin 1] = 20 � P (Bin 1) = 20 � 0:60 = 12

Bin 2: x 2 [1;1) with expectation E[Bin 2] = 20 �P (Bin 2) = 20 �0:40 = 8

The actual counts for the bins are both 10. Then the test statistic becomes

t =(10� 12)2

12+(10� 8)2

8= 0:8333

We are testing the hypothesis H0 : p = 0:60 versus HA : p 6= 0:60.


P (T > t) = 1� P (T � t)= 1� P (T � 0:8333)� 36:1% from R

From this we have no evidence against our hypothesis that p̂ = 0:60.

3.3.3b Use the following ML estimator for p and perform the same test:

p̂ =

1 +

1

n

nXi=1

xi

!�1

2

3.3.3b Solution Find an estimate for p̂.

p̂ =

1 +

1

n

nXi=1

xi

!�1

=

�1 +

31

20

��1� 0:3613

Now to create bins in a similar manner to above.

Bin 1: x = 0with expectation E[Bin 1] = 20 � P (Bin 1) = 20 � 0:3921 = 7:8431

Bin 2: x 2 [2;1) with expectation E[Bin 2] = 20�P (Bin 2) = 20�0:6078 =12:1569

The actual counts for the bins are both 10. Then the test statistic becomes

t =(10� 7:8431)2

7:8431+(10� 12:15169)2

12:15169= 0:9742

We are testing the hypothesis H0 : p = 0:3613 versus HA : p 6= 0:3613.


P (T > t) = 1� P (T � t)= 1� P (T � 0:9742)� 32:4% from R

From this we have no evidence against our hypothesis that p̂ = 0:3613.

3.3.4 Using R, create your own

3.3.4a �2 Test for Goodness of Fit

3.3.4b KS Test

3.3.4c Using the following data set and the KS Test that you made in 3.3.4b,perform a KS Test for the distribution Normal(-1,2)

-3.4749 1.4955 -1.7907 -1.2529 -5.4987-2.6497 -3.4814 0.6968 -1.1721 3.2448-3.9419 -3.9650 4.2603 1.9561 -1.9241-1.2756 4.9261 -5.3436 -3.2091 1.7410

3

3.3.5 Using the following data, determine an appropriate distribution, estimatetheir parameters and perform a KS Test.

0.1356 0.3759 0.0158 0.1043 0.04060.5534 0.7184 0.2038 0.1298 0.02651.2225 0.7828 0.3368 0.0490 0.5329

Look at the Empirical CDF

Graph Needed

Compare this to theoretical ECDFs. Many distributions work; however, wewill only show the steps for the Exponential Distribution.

This distribution only has one parameter, � with MLE �̂ = 1n

Pni=1 xi =

5:228115 � 0:3485. Then using the following R Code

ks.test(data,pexp,rate=(0.3485^(-1)))

With p-value 0.8706.

3.3.6 Explain the variance in the p-value of the KS Test. How can this variancebe reduced? Is this variance present in other tests like the �2?

3.3.6 Solution The variance in the KS Test p-value is a result of the generationof theoretical data points to be used to compare against the ECDF. Thisvariance can be reduced by increasing the number of generated functions.This is not present in most other test statistics since they do not involvethe generation of random values, but rely on the analysis of realizations.

3.3.7 Download the data �le ks1.txt from the course website. Determine anappropriate model, use Maximum Likelihood to estimate its parametersand then perform a KS Test on your results.

3.3.7 Solution The ECDF looks like that of a Normal random variable. Meanis 1.94 with SD 1.05.

4

Chapter 4

Queuing Systems

4.1 Introduction

Waiting in queues (British word for �lines�) is a phenomenon that occurs every-where. For example, we wait in checkout lines in grocery stores, or we wait tobe served in a single line in a fast food restaurant. A queueing system usuallyconsists of multiple servers, which can be in groups or in sequence. For simplic-ity, suppose we have a queueing system with two distinct servers, named Server1 and Server 2. There are three ways in which these servers can be arranged.We mention only two of them here:First, as illustrated below, we could have Server 1 serving a person from the

queue and then send him/her to Server 2.

Figure 4.1: Queue with Two Servers in Series

A second type of arrangement is shown below in Figure 4.2. Server 1 andServer 2 share one queue and can each serve no more than one unit (note thatit�s possible that both servers are busy). This is the grocery store model (serversare cashiers), or a mode of transportation model (servers could be bus routesto the same location). We call these parallel servers.

53

Figure 4.2: Queue with Two Servers in Parallel

This example raises an important topic with queues: their types. This couldhappen in a grocery store: if Server 2 is the cashier for the express lane (8 itemsto purchase or less) and Server 1 is a regular service lane, then they would havedi¤erent serving times. The following is a list of some of the di¤erent types ofserving:

� FIFO (First In, First Out): In computer science terms, this is a"queue." For example, in a grocery store, the shopkeepers attempt tosell their olders stock �rst to avoid spoilage.

� LIFO (Last In, First Out): In computer science terms, this is a "stack."

54

Having dishes in a stack is a good example of this.

� Priority Queue: Units in a queue are given di¤erent priority ranks mak-ing the order of service is dependent on their priority ranking. An excellentexample of this is in an emergency room where patients are attended toaccording to the severity of their injuries/condition.

� SIRO (Service In Random Order): Units in the queue are servedat random. For example, collision resloution protocols in cable accessnetworks operate in a manner quite similar to SIRO.

� SPT (Shortest Processing Time): Units are ranked by which takesthe least amount of time to serve. When I do my assignments, I tacklethe easiest questions �rst, leaving the hardest questions at the very end.

55

4.1.1 Notation

Now that we have a general understanding of what a queue is, we need a commonmethod of describing what we�re talking about. Any queue can be written inthe following form:

A/B/C/D/E

� A - Arrival Distribution. M representsMarkovian which means the arrivaldistribution is Poisson; D represents a constant; G represents a Generaldistribution.

� B - Departure Service Distribution. Same letters as in A are used in thiscase.

� C - Number of servers. Note that this does not provide information onthe formation of the servers.

� D - System Capacity (if omitted, then it is assumed to be in�nite). Thisis the maximum number of units that can be in the queue at any moment.

56

� E - Calling Population (if omitted, then it is assumed to be in�nite). Thisis the maximum number of units that can be put into the queue.

4.2 One Server Systems

Perhaps the simplest version of the server is a single server system: a queue thathas only one server. This type of system is nice and simple to build the basicsfor us to use! We will impose the assumption that arrivals and departures followa Homogeneous Poisson Process, which means:

1. N(t) is a non-decreasing function of time, t , with the initial condition thatN(0) = 0.

2. The process has independent increments. Speci�cally, no interval is de-pendent on a prior non-overlapping interval. Or mathematically:

N(s1 + t1)�N(s1)

is indepedent ofN(s2 + t2)�N(s2)

for s1 + t1 < s2 with s1; s2; t1; t2 > 0.

3. The number of events occurring over a period (0; t] follows a Poisson dis-tribution with parameter � . If � was not then we would have a Non-Homogeneous Poisson Process.

First we�re going to need some tools to help us understand the distributionof the events in a queue. Assuming that we have a one-server queue in whichelements enter the system but never leave the line without being served (oneexample is standing in line at the post o¢ ce). Recall that we will only be usingMarkovian arrivals and departures since that will simplify the mathematics (aswe�ll be dealing with Poisson and exponential random variables). Begin withthe �rst arrival: what is the distribution for the event that the �rst event occursby time t? More speci�cally, we are asking what the probability is when no eventoccurs in the interval (0; t] given that no events have occured at present time.This is

P (N(t)�N(0) > 0jN(0) = 0) = P (N(t�0)>0)P (N(0)=0)

= 1�e��t1

= 1� e��t

This is exponential! This is interesting because:

� It�s a distribution with which we are familiar.

� It has a simple pdf and cdf.

� It also has the memoryless property.

57

We can handle arrivals. What if we also have departures? Assume that weare in a Post O¢ ce and people leave the line after realizing they �lled out thewrong form for their package. This is a model where arrivals happen at a rate�a and depart at a rate �d. Recall that this is called anM/M/1 server. Assuming that queue isn�t empty, what is the probability

that an arrival will occur before a departure at time t ?Let A denote arrival and D will denote departure. We get:

P (A < D) =

1Z�1

P (A < DjD = t) � fD(t)dt , by conditioning on D.

=

1Z�1

P (A < t) � fD(t)dt , by independence.

=

1Z�1

P (A � t) � fD(t)dt , this makes a cdf.

=

1Z�1

FA(t) � fD(t)dt

=

1Z0

(1� e��at)�de��dtdt

= �d

0@ 1Z0

e��dtdt�1Z0

e�(�a+�d)tdt

1A= 1� �d

�a+�d

= �a�a+�d

Example The Campus Tech store will receive Apple Macbooks 2 times a weekand can sell them to a student at a rate of 1 per week. If all events are Poissonprocesses, what is the probability that the store will receive a Macbook beforeselling it?Solution: Let �a denote the arrival rate of Macbooks at the store, and �d

be the selling rate of Macbooks to students. Thus, �a =2 and �d=1, and so,P (Arrive < Sell) = �a

�a+�d= 2

3 .

4.2.1 Algorithm

The following is an algorithm to generate a single server system. For this towork, we use the following variables:

� Time Variables

� t represents time.

58

� Counter Variables

�NA and ND represent the number of arrivals and departures by timet respectively.

� System State Variables

� n represents the number of customers in the system.

� Event List

� tA; tD represent the time of the next arrival and departure respec-tively.

For this algorithm, the termination of the queue is at a particular time T .This value is predetermined.

� Set t = NA = ND = 0.

� Set the system state to zero. (n = 0).

� Generate tA and set tD = tA + 1

Then we proceed with the cases to update the system.

Case 1 tA � tD; tA � T

� Reset t = tA. This sets the system time to the most recent eventtime (the arrival)

� Reset n = n + 1. This increases the current number of units in thequeue by one.

� Reset NA = NA + 1. This increases the number of arrivals by one.� If n = 1

�Generate service time, Y , and set tD = t+ Y . If this is the �rstunit in the system, then its departure time must be generated.

Case 2 tD < tA; tD � T

� Reset t = tD. This sets the system time to the most recent eventtime (the departure)

� Reset n = n � 1. This decreases the current number of units in thequeue by one.

� Reset ND = ND + 1. This increases the total number of departuresby one.

� If n = 0.

59

�Reset tD = tA + 1.

� Otherwise, generate service time Y , and reset tD = t+ Y .� Notice that this algorithm puts arrivals before departures (even ifthey occur simultaneously).

Case 3 min(tA; tD) > T; n > 0

� Reset t = tD:� Reset n = n� 1.� Reset ND = ND + 1.� If n = 0

�Reset tD = tA + 1.

� Otherwise generate service time Y and reset tD = t+ Y .

� This is the instance where no more units are permitted to enter thequeue; however, there are still units in the queue which require ser-vice.

Case 4 min(tA; tD) > T; n = 0

� Output desired information.� No items remain in the queue and no arrivals can occur. This is theterminating case.

#*****************************************************## Modified: MM1 CodeREPAIRSTAT340 <- function(birth, death, events, machines){# Define the EL = (next birth=Ta, next death=Td)TimeList <- rep(0,machines+1) #a list of times for each state 0 to m.n <- 0t <- 0prevtime <- 0infinit <- 1000000Td <- infinitTa <- EXPGEN(birth,t,10)for (event in 1:events){#Arrival before departureif (Ta <= Td){prevtime <- tt <- TaTimeList[n+1] <- TimeList[n+1] + t - prevtime

60

n <- n + 1Td <- EXPGEN(death,t,10)if (n==machines) Ta <- infinit #no arrival allowed.else Ta <- EXPGEN((machines-n)*birth,t,10)} #endif#Departure before arrivalelse if (Td <= Ta){prevtime <- tt <- TdTimeList[n+1] <- TimeList[n+1] + t - prevtimen <- n - 1Ta <- EXPGEN((machines-n)*birth,t,10)Td <- EXPGEN(death,t,10)if (n==0) Td <- infinit}#endelse}#endforreturn(TimeList/(sum(TimeList)))}

#*****************************************************#Testing the code:#------------------birthr<- 0.1deathr<- 0.1m <- 6dist<-.25#Acceptable:ACCEPT5 <- function(pvalue){#Gives a range of p-values to acceptlower <- 0.3642upper <- 0.4924if (pvalue < upper){if (pvalue > lower) ans <- "pass"else ans <- "notpass"}else ans <- "notpass"return(ans)}

61

4.3 Two Server System with Servers in Series

Modelling single server queues is de�nitely a good starting point; however, it haslimited uses applications since the real world tends to be more interesting andcomplicated. There are two major di¤erences that we will begin to address inmulti-server systems: server formation and math. On the plus side: one a¤ectsthe other. On the down side: one of them can get really ugly with many servers.So now let us begin with a two server system! If you recall the picture from

the preamble of this chapter with Server 1 and Server 2 we can see that we canclassify systems of servers as either serving units simultaneously or both servingunits. Systems where each unit in the queue goes to a unique server and thenis �nished that are called parallel systems, wheras systems where each unit isserved �rst by one server and then the other are called series systems.Examples of a series system can be found in many service-oriented industries.

One example is the Canadian Health system: each unit is a patient in the system;they enter the hospital to be served and then �ve days later, they see a doctor.Another service example: every time the author of this text goes to a certainstore, he gets served by a cashier. The cashier always makes a mistake and thenthe author has to go to the customer service desk to �x the problem.

Other examples include a sequence of interviews where the interviewee goesfrom one interviewer to the next.

4.3.1 Algorithm

Generating these servers using computers can be very confusing, and here is anexample which is fairly intuitive, but still takes a bit of time to get accustomedto. It is important to understand the notation of the algorithm before you beginreading it.

62

� Time Variables

� t represents time.

� System State Variables (shorthand SS)

� (n1; n2) represents the number of units in each server�s queue (whichincludes service). The subscript represents the ith queue, in this casei=1,2.

� Counter Variables

�NA represents the number of arrivals by time t.

�ND represents the number of departures by time t.

� Output Variables

�A1(n): the arrival time of unit n into the system and hence server#1�s queue.

�A2(n): the arrival time of unit n into server #2�s queue.

�D(n): the departure time of unit n from the queuing system.

� Event List

� tA; t1; t2: where each represents the time of the next event for eacharea in the queue. tA represents the next arrival time to the queue, t1represents the completion/departure time with Server 1 and t2 repre-sents the completion/departure time with Server 2.We are assumingthat arrivals will happen inde�nitely; however, if there are no unitsat Server 1 or at Server 2, then we must set it to a su¢ ciently largenumber such that the algorithm works (see below for illustration).For simplicity, we will use 1 to symbolize this number.

It is important to note that this algorithm doesn�t have a �nish time andis theoretically in�nite. To address this the programmer must decide what thelimit of their algorithm is, whether it be that the code will run for a certainamount of "time" or that only a certain number of people can be served. It isimportant to have some aptitude in programming for simulations and be certainthat you can investigate closing conditions.

� Initialization:

� Set t = NA = ND = 0.

� Set SS=(0; 0).

�Generate an arrival time, tA, and then set t1 = t2 =1.

Now that everything is set up, the following algorithm will collect data.

63

Case 1 tA = min(tA; t1; t2)

� Reset t = tA:

� Reset NA = NA + 1

� Reset n1 = n1 + 1

� Generate a new arrival time, and set tAequal to it.

� If n1 = 1, generate a departure time and set t1 equal to it.

Case 2 t1 = min(tA; t1; t2)

� Reset t = t1:

� Reset n1 = n1 � 1; n2 = n2 + 1.

� If n1 = 0, then set t1 =1; otherwise, generate a new completion time forserver one and set it equal to t1

� If n2 = 1, generate a new completion time for server two and set it equalto t2.

Case 3 t2 = min(tA; t1; t2)

� Reset t = t2:

� Reset ND = ND + 1.

� Reset n2 = n2 � 1.

� If n2 = 0 then set t2 =1:

� If n2 > 0 then generate a new completion time for server two.

Remember, when generating a new server time to add the current time tothe generated value so that the event has a chance to occur.

4.4 Two-server System with Servers in Parallel

Even in the two-server case the orientation of the servers can make for a com-plicated system. Even though it is beyond the scope of this text, a three-serverqueue has 4 unique ways to be set up with the assumption that the serversare distinct. Parallel systems add to the dynamic of the system where multipleservers can work on units in the queue simultaneously, such as in the famousgrocery store model. Fortunately, given our assumption of a Poisson process,one will �nd that the math becomes more of an exercise in managing the par-ticulars of the model than a computation nightmare. However, it can still becomputationally challenging.

64

As an exercise, alter the algorithm given in the previous section to simu-late a two-server system in which two servers work in parallel. How will thisalgorithm change if units are placed into lines for each server (a more accuraterepresentation of the grocery store model)?

4.5 Repair Problem

The repair problem is a variant on a common statistical problem known as aRandom Walk. A random walk is a model where we map a series of outcomesor events to the integers (normally, reals can also be used). Travelling from onestate to another has its own probability and each state can be reached from anyother state in a �nite number of state transitions. Under this model, we havebirths and deaths. A birth increases the number of units in the queue; whereas,a death decreases the number of units in the queue. It is important to realizethat these do not always represent biological births and deaths.

Example Consider a computer lab on campus: the lab has a certain numberof computers, say 30, and technicians, say 2. If a computer breaks down, thena technician will begin working on the computer to �x it. A computer breakingdown would constitute a birth as it adds a unit to the serving system; whereas,a technician �nishing �xing a computer would constitute a death since it takesa unit out of the server. We generally assume that only one server can work ona unit at a time.Now, assume that we have a time homogeneous Poisson process representing

the computer repair scenario described above except with fewer computers: 4computers and 2 servers. The break down of a machine follows an exponentialdistribution with rate 1; the mean serving time is denoted as Ti. Let Ri denotethe repair times and these are exponential with rate 2. Again, a computerbreaking is a birth and a computer being successfully �xed is a death. This isa M=M=3=30=30 Queue. We have the following variables:

� n:= number of units in the queue.

� �n := Rate of births.

� �n := Rate of deaths.

� �n := Proportion of time spent in state n.

Theorem 5. The rate leaving from state n is equal to the rate entering state n.

Which is to say: �n(�n + �n) = �n�1�n�1 + �n+1�n+1 with �0�0 = �1�1;

therefore, �n = �0n�1Qi=0

�i�i+1

.

65

For the births, suppose we are in state 0, then to get to state 1 one of thecomputers must break down, which is Exp(4). From state 1 to 2, the distributionof arrivals is Exp(3), and so forth.For the deaths, moving from states 2, 3 or 4 into the next lowest state is

modelled by an Exp(4) distribution; whereas, moving from state 1 to 0 followsan Exp(2) distribution.We are interested in knowing how often we are in state n. Also, in which

state do we spend the most time on average? Recall �n(�n+�n) = �n�1�n�1+�n+1�n+1 .

�0(0 + 4) = �12�1(3 + 2) = �04 + �24�2(2 + 4) = �13 + �34�3(1 + 4) = �22 + �44�4(0 + 4) = �31This becomes a linear system of equations, which is solvable. We end up

with the solution: � = 1687

�1; 2; 32 ;

34 ;

316

�. Be sure to check that these values sum

to 1.To �nd in which state we spend the most time, calculate:E[N(t)] =

Pn�n

= 0 � �0 + : : :+ 4 � �4� 1.3

Therefore, we expect to be in state 1 for the largest proportion of time. Thefollowing graphic is called a System State Diagram.

Figure 4.3: System State Diagram for Computer Problem

66

1 Chapter Problems

4.6.1 Describe the following Queues in terms of course terminology. Indentifywhether these are series or parellel queuing systems.

4.6.1i A gas station with 8 pumps.

4.6.1i Solution This is an M/M/8 queue (similarly M=M=8=n=t if there isa speci�ed maximum capacity for waiting cars in the station n, and amaximum number of vehicles registered in a particular area t). This is aFIFO server in parallel.

4.6.1ii A penalty box in a game of hockey.

4.6.1ii Solution This is an M/M/3/5/5 queue (per team). It is a FIFO serverin parallel.

4.6.1iii A stop light on a four-way street.

4.6.1iii Solution This is an M/M/4 queue (can vary if a limit on vehicles inwait is in place). This is a FIFO server in parallel.

4.6.1iv A stack of dishes.

4.6.1iv Solution This is an M/M/1 queue (can vary if there is a �nite numberof dishes available). This is a LIFO server in series.

4.6.1v A bu¤et table with three stations.

6.4.1v Solution This is an M/M/3 queue (can vary with limitations imposedon the population available to the bu¤et). This is a FIFO server in series.

4.6.2 A pawn shop will receive musical items to sell 3 times a week, and sellthem to a customer at a rate of 2 per week. Treat all actions with tradingof stock as Poisson process and assume that there will never be insu¢ cientstock to sell to customers. Answer the following:

4.6.2a What is the probability that the store will receive a musical item beforeselling one? What is the probability of selling before receiving?

4.6.2a Solution Let the A denote an arrival and D a departure of a musicalitem at the pawn shop. Then A and D follow a Poisson distribution withparameters �A = 3 and �D = 2 respectively. Then we have

P (A < D) =�A

�A + �D

=3

3 + 2

=3

5

1

and similarly

P (A > D) =�A

�A + �D

=2

3 + 22

5

Thus the probability that a musical item will arrive before one will leaveis 3

5 and the complimentary event has probability25 .

4.6.2b What is the probability that the store will neither receive nor sell anystock in one week given that at the beginning of the week it has soldnothing?

4.6.2b Solution Let E denote that any event will happen with respect to mu-sical stock in the pawn shop. Then E follows a Poisson distribution withparameter �E = �A + �D = 3 + 2 = 5. Then we want

P ((N(1) = 0jN(0) = 0) =P (N(1) = 0)

P (N(0) = 0)

=50e�5

0!00e0

0!

= e�5

� 0:00674

4.6.3 The surviving cast of the original Star Trek is at a convention and areapproached by various nerds. People dressed up as Klingons arrive at arate of 6 per minute and leave at at rate of 2 per minute. People dressedup as Vulcans arrive at a rate of 4 per minute and leave a rate of 3 perminute. People dressed up as gaseous energy beings appear at a rate of 1per minute and leave at a rate of 1 per minute.

4.6.3a What is the probability that a fan will arrive before a fan will leave?

4.6.3a Solution We can summarize this by using two di¤erent events: a fanarrives, �A = 6 + 4 + 3 = 13 and a fan leaves �D = 2 + 3 + 1 = 6. Thenthis becomes

P (A < D) =�A

�A + �D

=13

13 + 6

=13

19

2

4.6.3b What is the probability that a Klingon will leave before a Vulcan?

4.6.3b Solution Similar to the previous section we now are looking for the�rst event to be the Vulcan departure �VD versus a Klingon departure�KD

.

P (KD < VD) =�KD

�KD+ �VD

=2

2 + 3

=2

5

4.6.3c What is the probability that a gaseous energy being will arrive before aKlingon arrives and a Vulcan leaves?

4.6.3c Solution This is a slight modi�cation of the previous question. Wewant the �rst event to be a gaseous arrival GA over a Klingon arrival KA

and a Vulcan departure VD.

P (min(GA;KA; VD) = GA) =�GA

�GA+ �KA

+ �VD

=1

1 + 6 + 3= 0:10

4.6.3d What is the expected number of fans to arrive at the event in a fourhour period?

4.6.3d Solution RILEY-HUMAN. I forget how to do this so the following maybe incorrect. . .

Since the event of arrivals is a Poisson Process with the following rate�A = 13, and N(t) �Poisson(�At) with t = 240. Then the expected num-ber of arrivals in 4 hours (or 240 minutes) can be determined from theproperties of the Poisson. Hence

E[N(t)] = �At

= 13 � 240= 3120

Therefore, for a four hour period, the Star Trek cast can be expected tobe accosted by 3120 nerds.

4.6.3e What is the probability that a gaseous energy being will leave after tenminutes given that it hasn�t left after 5 minutes?

3

4.6.3e Solution We want to solve the following

P (GD > 10jGD > 5) = P (GD > 5) , by the memoryless property

= 1� P (GD � 5)= 1� 1 + e�5�GD= e�5

= 0:0067

4.6.4

4

Chapter 5

Generating RandomVariables

In simulations, we need to create enough data to be able to make useful observa-tions regarding any patterns in the data. Three realizations may not be enoughto see the behaviour of a stock price over time, or to determine the growth inpopulations based on birth, death and immigration models.

We want to have su¢ ciently random data, but what does this really mean?Let�s say we are given the following sequence of randomly generated numbers:

9 9 9 9 9 9 9 9 9 9

We would say that this is deterministic because we could guess the nextnumber in the sequence as �9�. We�d be wrong in this case if the next numberis 12. However, the numbers do not appear to be random and someone couldtheoretically �gure out the next number in the sequence using math. This isreferred to as being deterministic.So, we need a method of generating random variables or numbers which ap-

pears to be random, and at the same time is computationally simple. Again,computers in this day and age are fast in mathematical computations, but sim-plicity means an increase in e¢ ciency when we�re dealing with millions of datapoints or more.

5.1 Pseudo Random Number Generators

It is important to note that humans are terrible random number generators.Computers, although, entirely calculable, have the advantage of appearing tobe more random than humans. Who�d have thought that would be the case?Even though we know for a fact that the computing process for generating

random numbers is deterministic, these numbers still seem quite random. Howdo they do that?

67

Let�s say we want to generate a series of numbers which appear to be Uni-form(0,1) distributed. This can be done using modular arithmetic. If you recallclassical algebra, we say that a is congruent to b mod m if a = c �m+ b, wherec is an integer. For example:

31 � 4 � 7 + 3 mod 7

� 3 mod 7

Assume we are working in arithmetic mod 8 with xn = 5xn�1 and x0 = 3 (x0 iscalled the seed). Use math to show that we end up with the following sequenceof numbers: f3; 7; 3; 7; : : :g. Notice that the pattern begins to repeat after 2iterations. We say that this algorithm has a period of 2 or simply period 2.

De�nition 15. The period is the largest number of unique numbers before the�rst repeatition.

Now, assume our seed is 4. Use math to show that we end up with thefollowing sequence of numbers: f4; 4; 4; 4; : : :g. Notice that it has period 1. Thisproblem can be �xed using Linear Congruential Generators.

5.1.1 Linear Congruential Generators

Suppose we were to edit our previous example. Before, we had a generator ofthe form:

xn�1 � a � xn mod c

Now, let us add in a constant term to help circumvent the situation where ourgenerator repeats 0�s in�nitely or something similar.

xn�1 � a � xn + b mod c

This helps out a bit, but we can still have the situation where a � xn + b � 0mod c. We need one more condition.

5.1.2 Theorem

Statement - Ifm is a prime, then the maximum period of a Linear CongruentialGenerator

(xn = axn�1 mod m); a 6= 0

is m� 1, if and only if

am�1 � 1 mod m;ai 6� 1 mod m

80 < i < m� 1.

68

5.2 Inverse Transform Theorem for ContinuousRandom Variables

One of the most powerful tools available to us is the uniform distribution: specif-ically, the Uniform(0,1) distribution. For those using R, you will readily haveaccess to functions like runif, rexp, rgeom, and so forth. Assuming that youdo not have access to any computer software, you�ll need tools to mimic thesefunctions on your own.

Theorem - Assume the following properties hold:

1. Let X be a r.v. such that

� limx!�1

F (x) = 0

� limx!1

F (x) = 1

� F (x) < F (y)! x < y

2. g(x) � g(y) for x � y

3. F�1(F (x)) = x, if F is invertible

4. Let U �Uniform(0,1), then

P (U � t) =Z t

0

du = t

Then, let U = F (x).

Statement of Theorem: Let U be a Uniform(0,1) r.v. for any continu-ous function F , then the r.v. de�ned as X = F�1(U) has a distributionfunction. This means if we are given U = 0:2, we know F (x) = 0:2 and wecan �nd the value of X. In summary, given a U(0,1) r.v., we can obtainan X from any continuous CDF simply by inverting the function.

Proof.

FX(x) = P (X � x)= P (F�1(U) � x), by Property 1= P (F (F�1(U)) � F (x)), by Property 2= P (U � F (x)), by Property 3= F (x), by Property 4

69

5.2.1 Examples

Example Assume we needed a r.v. to have the distribution function G(x) =x3. The previous theorem says by letting u = x3 where u 2U(0,1), we haveu13 = x. And so, to obtain a value of x from G(x), we obtain a u; u 2U(0,1),and then set x = u

13 .

Exponential Example Assume that we have exponential data (with someunreported rate �) and we want to convert our data to Uniform(0,1) eitherbecause we�re paranoid that someone will review our results or we just likemessing with people�s minds. How can we convert our data from exponentialto uniform? Let xi be a data realization with respective uniform representationui.

F (xi) = 1� e��xi

ui = 1� e��xi

1� ui = e��xi

ln(1� ui) = ��xi

xi = � 1�ln(1� ui)

70

5.3 Acceptance-Rejection Method

Assume that we want to generate a random variable from a probability function, f(x). However, f(x) is not invertible or does not have a "nice" method wecan use to generate a random variable. On the other hand, we know of afunction g(x) such that kg(x) is a probability function for constant k and kg(x) isarbitrarily "close" to f(x). Furthermore, there is a constant c for which ckg(x) �f(x);8x �! 1 � f(x)

ckg(x) � 0.What do you get when you have all of these things? The Acceptance-

Rejection Method ! The full algorithm for the method is as follows:

1. Generate kg(x)

2. Generate u, a uniform(0,1) realization

3. Check: u < f(x)ckg(x) . If yes, then we�ve generated a value. If not, then

repeat process.

Let�s look at an example.

Example We wish to generate a random variable from the following discretefunction:

x 0 1 2f(x) 0.3 0.4 0.3We decide to use a binomial(2, 12 ) random variable as our g(x) function. Let�s

look at the probabilities:x 0 1 2

f(x) 0.3 0.4 0.3g(x) 0.25 0.5 0.25We need g(x) � f(x);8x. Since 0.5>0.4, we do not need to concern ourselves

with this. However, 0.3 is larger than 0.25 by a factor of 65 ; therefore, this is ourconstant to satisfy the conditions of the acceptance rejection method. Thereforec = 6

5 :

x 0 1 2f(x) 0.3 0.4 0.3kg(x) 0.3 0.6 0.3

This completes our requirements. So, we generate a value for x; suppose itis x = 1, then with R, we can generate a random uniform(0,1) value: 0.2817782.Then, we have f(1)

65 g(1)

= 0:40:6 = 0:75 > 0:2817782. Therefore, we have generated

a value! In this example, x = 0, and 2 will always generate a value since the Rfunction runif never generates values at the extremities.

71

5.4 Compositions

5.5 Special Cases

In mathematics, there are always simpli�cations under certain constraints. Sincethe lengthy computations can be taxing on resources, simpli�cation them canmake work more easily.

5.5.1 Poisson

Suppose we are simulating a Poisson distribution and we are looking for a com-putationally simple method to calculate values for its probability function. Con-sider the simple case: a Poisson distribution with parameter �. First, we lookat the probability function for the distribution:

f(x) =e��x

x!

Let pi be the probability that X = i from this distribution. Evaluating pi+1pi,

we get the following:

pi+1pi

=�i+1e��=(i+ 1)!

�ie��=(i)!

=�i+1

�i� (i)!

(i+ 1)!

=�i

i+ 1

pi+1 = pi ��i

i+ 1

This is computationally simpler than the standard method of determiningsequences of probabilities. It�s algorithm is as follows:

1. Generate a uniform(0,1) random variable.

2. Set i = 0 and calculate p0 = e��. Set p = e��.

3. If u < p, then x = i.

4. Else i = i+ 1, p = p � �i+1 .

5. Go to 1.

Repeat until the conditions are satis�ed or until you have enough points.This is a computational blessing in that we no longer need to concernourselves with factorial evaluations, which take longer to compute as iincreases.

72

5.5.2 Non-Homogeneous Poisson Process

A non-homogeneous Poisson process is similar to a regular Poisson process ex-cept that the rate, �, is not constant. In fact, the rate either becomes a functionof time, �(t), or a function of the number of events at time t (denoted as N(t)),�(N(t)). Remember that a Poisson Process can be extended to an exponential.In fact, in this instance, we will build a Poisson r.v. from an exponential. Wewill �nd the number of events up to a time T . So, we build intervals untilwe surpass T , and this will be the number of events. First, we look at thehomogeneous Poisson process:Algorithm

1. Generate a uniform(0,1) random variable, u.

2. Let t = 0, i = 0 for time t; and number of events, i.

3. t := t� 1� ln(u), if t � T stop and output i.

4. Otherwise, increase i by 1 (recall that we now have t := t� 1� ln(u)).

This generates an exponential with constant �. For a non-constant rate, wewill use a process called thinning. Again, this usually depends on time, �(t), andwe have a rate, �, which is larger than �(t);8t. We generate exponentials withthe rate �, but decide whether or not to accept or reject them by comparing�(t) and � with u. In other words, we don�t accept every point.Algorithm

1. Generate a uniform(0,1) random variable, u.

2. t := 0; i := 0, for time and number of events respectively as above.

3. t := t� 1� ln(u)

4. Generate a uniform(0,1) random variable, v: Is v < �(t)� ?

(a) If "Yes, "

i. If t > T then output i.ii. Otherwise i := i+ 1; t := t� 1

� ln(u).

(b) If "No,"

i. If t > T , then output i.ii. Otherwise go to 3.

73

5.5.3 Binomial

Similar to the Poisson distribution, there is an easy way to calculate incrementalincreases in X~binomial(n,�). Given a similar setup: pi is the probability thatX = i on a binomial with PDF f(x) =

�nx

��x(1� �)n�x. Again, compare pi+1

pi.

pi+1pi

=

�ni+1

��i+1(1� �)n�i�1�ni

��i(1� �)n�i

=

n!(i+1)!(n�i�1)!�

i+1(1� �)n�i�1n!

i!(n�i)!�i(1� �)n�i

=i!

(i+ 1)!� (n� i)!(n� i� 1)! �

�

1� �

=1

i+ 1� (n� i) � �

1� �

pi+1 =pi(n� i)�

(i+ 1)(1� �)

Granted that this is less aesthetically pleasing as the Poisson simpli�cation,it is much nicer than constantly evaluating binomials.

5.5.4 Normal/Gaussian

One Gaussian Generator

1. Generate two Uniform[-1,1] random variables, Z1; Z2.

2. If z21 + z22 > 1 go to 1.

3. Else, let R=p�2 ln(z21 + z22)

We can�t use the inverse transform theorem due to the fact that F (x) doesnot have a closed form solution. Then, the joint probability functions of twoindependent, standard normal random variables is

f(x; y) = fX(x)fY (y)

=1

2�e(�

12 (x

2+y2))

Instead of cartesian coordinates, we will use polar coordinates. This is achange of variable from (x; y) ! (r; �) for r =

px2 + y2 and � = arctan

�yx

�.

For this to work, we will need the Jacobian to apply the change (see calculustextbook for more details).

74

J =

�� @r@x

@r@y

@�@x

@�@y

��=

�� 2x 2y

��yx2

� �1 +

�yx

�2��1 �1x

� �1 +

�yx

�2��1 ��= 2 (just believe it).

Now to apply the change of variable.

f(x; y) =1

2�e(�

12 (x

2+y2))

=1

2�e�

12 r

��12�� , the inverse of the Jacobian

=1

2�� 12e�

12 r

Upon inspection, this is the product of two random variables! One is U�0; 12�

�and the second is Exponential(2), both of which are invertible! Therefore, wecan make a new algorithm:

1. Generate two uniform(0,1) random variables, u1; u2.

2. Set r = � 12 ln(u); � = 2�u2.

3. Then, x = r2 cos �; y = r2 sin �, this becomes

x =p�2 lnu1 cos(2�u2)

=v1p�2 lnu1v21 + v

22

y =p�2 lnu1 sin(2�u2)

=v2p�2 lnu1v21 + v

22

4. Remember that these points still must be on a circle. Therefore, for v1 =2u1 � 1; v2 = 2u2 � 1, if v21 + v22 > 1 go to 1. Otherwise, use our X and Yvalues.

This algorithm can be cleaned up to get the �rst algorithm mentioned inthis chapter.

75

1 Chapter Problems

5.6.1 Find the periodicity for the following Random Number Generator. Is thisoptimal?

xn = 5xn�1mod19

with seed x0 = 5.

5.6.1 Solutionn 0 1 2 3 4 5 6 7 8 9xn 5 6 11 17 9 7 16 4 1 5

Therefore, this has period 9. This is not optimal because there are valuesbetween 1 and 19 that do not appear.

5.6.2 Find the periodicity for the following Random Number Generator. Is thisoptimal?

xn = 3xn�1mod19

with seed 2.

5.6.2 Solutionn 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18xn 2 6 18 16 10 11 14 4 12 17 13 1 3 9 8 5 15 7 2

Therefore, this has periodicit 18. It is an optimal LCG since the period isequal to one less the modulus.

5.6.3 Find the periodicity to the following Linear Congruential Generator. Isthis optimal?

xn = 5xn�1 + 2mod 19

with seed x0 = 5.

5.6.3 Solutionn 0 1 2 3 4 5 6 7 8 9xn 5 8 4 3 17 11 0 2 12 5

This is not an optimal LCG, given that it has periodicity 9.

5.6.4 For the following LCG, determine if the periodicity is m� 1, where m isthe modulus. Use the seed x0 = 1 for each.

5.6.4a xn = 5xn�1mod19.

5.6.4a Solution From 5.6.1 we know that 59 � 1mod 19. Therefore, this doesnot have a period of m� 1 = 18.

5.6.4b xn = 8xn�1mod11.

5.6.4b Solutionn 0 1 2 3 4 5 6 7 8 9 10xn 8 9 6 4 10 3 2 5 7 1 8

Therefore, this LCG has period 10.

5.6.4c xn = 3xn�1mod14:

1

5.6.4c Solution Since this LCG does not have a prime modulus, we must eval-uate each term explicitly.

n 0 1 2 3 4 5 6xn 3 9 13 11 5 1 3

5.6.5 Given the follow Uniform(0,1) data points, calculate the data points forthe following distributions.

u1 = 0:25; u2 = 0:50; u3 = 0:75

5.6.5a Exponential with parameter � = 5.

5.6.5a Solution From the course text we have that given an Uniform(0,1)realization we can generate an exponential realization using the followingformula:

xi = �1

�ln(1� ui)

Hence for each other points we have

x1 = �15ln(1� 0:25)

� 0:0575

x2 = �15ln(1� 0:50)

� 0:1386

x3 = �15ln(1� 0:75)

� 0:2773

5.6.5b Pareto with parameters � = 2; � = 4�The Pareto distribution has the following PDF: F (x) = 1�

��x

��; x � �

�.

5.6.5b Solution Since we have the CDF of the Pareto distribution we candetermine the form of the transformation

F (xi) = 1��

xi

��ui = 1�

��

xi

��1� ui =

��

xi

��(1� ui)1=� =

�

xixi�

= (1� ui)�1=�

xi = �(1� ui)�1=�

2

From here, we can plug in the values and get:

x1 = 2(1� 0:25)�1=4

� 2:1491

x2 = 2(1� 0:50)�1=4

� 2:3784

x3 = 2(1� 0:75)�1=4

� 2:8284

5.6.6 Given the following LCG, generate three realizations from an exponentialdistribution with parameter � = 0:50.

xn = 7xn�1 + 2mod 23

with x0 = 14.

5.6.6 Solution Determining the �rst three terms using the LCG

x0 = 14

x1 = 7 � 14 + 2mod 23 � 8x2 = 7 � 8 + 2mod 23 � 12

Hence our uniform terms are now 1423 ;

823 ; and

1223 . Using these are the form

of the exponential distribution we can generate the terms that we desire.

x1 = �2 ln�1� 14

23

�� 1:8766

x2 = �2 ln�1� 8

23

�� 0:8549

x3 = �2 ln�1� 12

23

�� 1:4752

Therefore, we have generated three realizations of an Exp(0.50) distribu-tion and they are x1 = 1:8766; x2 = 0:8549; and x3 = 1:4752.

5.6.7 Given the following values, construct a valid Acceptance-Rejection algo-rithm for the following data using a binomial with probability of success0.70.

x = 0 x = 1 x = 2 x = 3 x = 40.01 0.10 0.25 0.40 0.24

From this test, if x = 2 will pass the Acceptance-Rejection algorithm whenu = 0:50.

3

5.6.7 Solution First we must determine an appropriate value of c to satisfythe condition 1 � f(x)

c�g(x) .

x = 0 x = 1 x = 2 x = 3 x = 4f(x) 0.01 0.10 0.25 0.40 0.24g(x) 0.0081 0.0756 0.2646 0.4116 0.2401min(c) 1.2346 1.3228 1 1 1

With this imformation we can determine the minimum value of c thatwill satisfy the inequality 1 � f(x)

c�g(x) for all x is 1.3228. Algorithms mayvary, see section 5.3 for an example of one such algorithm. Secondly, givenu = 0:50 and x = 2 we have the following

5 <f(x)

c � g(x)

5 <f(2)

c � g(2)

5 <0:25

1:3228 � 0:26465 < 0:7143

From this result we have that x = 2 with pass the AR method in thisinstance with u = 0:50; therefore, we have generated a random variable.

5.6.8 Create an algorithm to generate random variables from the followingmixture distribution:

f(x) =

�14 x 2 R(0; 1)

34e�x+1 x 2 R[1;1)

5.6.8 Solution Generate two uniform random realizations u1; u2. Then

1. If u1 < 14 then f(x) = u2.

2. Else then f(x) = � ln(u2).

4

Chapter 6

Variance Reduction

6.1 Introduction.

Imagine that you are employed by a brokerage �rm and you are handling a seriesof portfolios of both high-risk and low-risk stocks to generate the optimal yieldfor your clients. The current prices of all stocks in the portfolio are known, aswell as the amounts owned by the client. However, the future stock prices arenot known. How do you manage the portfolio (i.e. which securities do you buyand which do you sell)? Note that the total value of the investor�s portfolio isthe sum of the value of the individual stocks, and similarly, the overall returnfor the portfolio is the sum of a weighted average of the individual returns.Consider a signi�cant simpli�cation of this problem: what is the return on a

single European call option? A European call option is a security with a strikeprice, K, and a maturity date, T . On the date of maturity, you can exercise theoption and buy it at the strike price or at the current market price (whicheveris lower). Assuming we have a model that re�ects market conditions, and has areasonable approximation of the distribution of payouts, how much should weinvest in a single stock? Other possible questions of interest: what if we were...

1. inventory control managers at a retail store: how many goods should weorder and at what time?

2. in charge of regulating tra¢ c lights in a high-tra¢ c city intersection?

3. trying to optimize code by �nding the most e¢ cient order of math?

4. looking for possible fraudulent cases in a series of health insurance claims?

5. trying to model the frequency of accidents in an amusement park?

6. standing in a grocery store check-out line, watching as the new cashierstruggles to count change and we are interested in evaluating how muchwe�ve died on the inside during this unmoving procession?

76

As you have seen in earlier chapters, we used simulation to solve these prob-lems. A simple example is integration. In math, we often �nd that thereare many expressions which we cannot solve for explicitly. The most com-mon example of this in statistics is the normal distribution�s cdf: f(x) =R x�1

1p2��exp

�� (t��)2

2�2

�dt. In fact, it can be shown that any function of the

formRe�x

2

dx has no closed form solution. What if we want to know the areaunder the curve of e�x

2=2, from 0 to 1 (i.e. evaluateR 10e�x

2=2dx)?

This area can be estimated using Crude Monte Carlo Integration (CMC).

Example The area of a unit circle (i.e. radius equal 1) is �r2 = � �3:14159::: Suppose we didn�t know or have the formula �r2, how might wedetermine the area of the circle? One suggestion involves �ring points atrandom at a square (with equal side lengths of size 2) centered at the origin(0; 0). The circle lies in the center of the square and the area of the square is2 � 2 = 4 units squared. If points are randomly �red, then the proportion ofthese points that lie in the circle relative to the square is indicative of the areaof the circle.

Algorithmically:1. Set the count to zero.2. Generate an x coordinate on -1 to 1.3. Generate an y coordinate on -1 to 1.4. If the point (x; y) lies in the unit circle (x2 + y2 � r2 = 1), then increase

the count by 1. Return to 2.

77

5. Repeat steps 2 to 3, n times.

6. The area of the unit circle iscount

n�Area of the Square = 4� count

n:

R Code:

#Calculate the area of a circle in a squareArea<-function(n){count<-0for (i in 1:dim(n)[1]){if (n[i,1]^(2)+n[i,2]^(2)<=1){count<-count+1

}}

4*count/dim(n)[1]

78

}n<-cbind(runif(10,-1,1),runif(10,-1,1))Area(n)

The following data were generated by using R, and the speci�cs of howthese points were created will be discussed later in this chapter. As you cansee, the larger the number of pairs (n), the closer we get to the actual answer.Theoretically, we obtain the exact value of � after collecting in�nitely manypoints, but who can wait for in�nitely many iterations?

Number of Pairs (n) Estimate of �10 3.650 3.2100 3.32500 3.2241,000 3.17645,000 3.164810,000 3.1484850,000 3.12512100,000 3.141516

This is an example of how Crude Monte Carlo integration works. We wantto solve a problem where we take samples and evaluate them according to themodel in order to determine expectation or other properties.

6.2 CrudeMonte Carlo Integration with Boundsof Zero and One

Introduction:

This method involves repeatedly sampling heights from a function, and usingthe average height as a measure of the function�s area.

For crude Monte Carlo integration (hereafter refered to as CMC) to work, webegin by considering a random variable derived from the uniform distribution,U(0; 1).

Recall:1. Given a random variable X � U(a; b), its probability density function is

f(x) = 1b�a , for a < b.

2. Given a continuous function of our random variable, X, g(X), theexpected value of g(X) is:

E(g(X)) = E(g(x))

=

ZX

g(x)f(x)dx

79

3. An expectation is a long-run average; so, in practice: E(X) � 1n

Pni=1 xi:

Mathematically:

Returning to the original problem, we wish to determine the area beneathg(x); from x = 0 to x = 1; or, in other words, we wish to evaluate � =

R 10g(x)dx:

To begin, we consider a random variable X that is U(0; 1). Next, we �ndE[g(X)]:First, note that E[g(x)] � 1

n

Pni=1 g(xi) because:

E[g(X)] =R 10g(x)f(x) dx [We can think of this as an average height, by 2 above]

=R 10g(x)� (1) dx [by 1 above]

=R 10g(x) dx [Which is the area of interest!]

� 1n

Pni=1 g(xi) and each xi is a realization of a U(0; 1) random variable, by 3 above.

The last line is where the approximation is used. It is not too di¢ cult tovisualize how this works since we have already estimated averages using themean of a set of data. Estimating a de�nite integral using a �nite number ofdata points is frequently referred to as the crude Monte Carlo method.

There are restrictions on what can be used for this estimation. The followingconditions must hold:

1. Each xi is a realization of X � U(0; 1).

80

2. For n su¢ ciently large, the Law of Large Numbers gives 1n

Pi g(xi) �R 1

0g(x)dx

3. g(xi) represents the height of the function.

It may be di¢ cult to imagine now, but a long time ago, computers weresigni�cantly slower than they are today. Making algorithms or calculations ef-�cient was very important. Even today, e¢ ciency can help reduce the time asimulation takes since a single simulation could be run over one-hundred thou-sand times before termination. From the � calculation example, earlier in thischapter, that relatively simple calculation for 100,000 terms took about 4 sec-onds, and the code was 6 lines long. More complicated algorithms might takemuch longer.

Example We wish to evaluate � =R 10e�x

2=2dx:a) Write a CMC algorithm to evaluate �:

Algorithm:

1. Generate n values of xi, each one being a realization of X � U(0; 1).2. For each xi, �nd g(xi) = e

�x2i =2.3. The estimated area is b� = 1

n

Pi g(xi).

b) Use a normal table or some other means to calculate �:

The N(0; 1) cdf is denoted by F (X) =1p2�

R x�1 e

�x2=2dx.

Then, � =p2� (F (1)� F (0)) = (0:8413� 0:5)

p2� � 0:855

c) Write R code to estimate �. Using n = 10, 100 and 1000, provideestimates for �:

x=runif(10);g=exp(-x^2/2);theta=mean(g);theta

Note that estimates will change. With n = 10, b� = 0:848762Variability of our CMC Estimator:We will need numerical quantities that re�ect the variability of the crude

Monte Carlo estimator. For simplicity:

81

1. Let E(g(X)) = �2. Let V ar (g (X)) = �2:3. � is estimated by b� = 1

n

Pni=1 g(xi)

4. The estimator of b� is e� = 1n

Pni=1 g(Xi).

5. n is the sample size, the number of heights we sample.6. Each xi is a realization of X � U(0; 1):

V ar(~�) = V ar(1

n

nXi=1

g(Xi))

=1

n2V ar(

nXi=1

g(Xi))

=1

nV ar(g(X))

=�2

n

To estimate �, we use what we call the standard error of our estimator

g(X). It is of the formq

�2

n . The expectation-based variance estimator is

�̂2 = 1n�1

Pni=1(xi � x)2.

Now that we have a measure for the variance, we can now begin to exploreother types of estimation and compare these to the crude Monte Carlo method.This is important because we will be comparing every other technique againstthis method.

6.3 Population Calculation

Let�s say we want to determine the value of the parameter, � (e.g. �the averageheight of the population of Canada�) to within a �xed margin of error (denotedas E), how do we determine the number of units to include in the study?

Theory: Consider the general form of the Con�dence Interval for parameter�.

CI � �̂ � c

s�̂2

n

n is the sample size, �̂ is the estimated standard deviation, �̂ is the estimateof the mean and c is the corresponding table value. Since �̂ is generally an

82

average and n is relatively large, we assume normality, and obtain c from thenormal table.

The length of the interval is dependent on the second term; therefore, it isthe point of interest in the calculation for sample size. Examining the term onits own, we can see that the overall variability in the interval is

E = �cr�2

n

E2 =c2�2

n

n =�c�E

�2E is prespeci�ed in the planning stage of the PPDAC process. �2, however,

is unknown. To estimate �2, we either:

1. Use historical data: if another related survey or experiment was con-ducted, then it is acceptable to use the estimate of the variance from thatdata.

2. Run a small, pilot survey: take a very small sample to derive an estimatefor the variance using the model selected in the planning step.

Example The Fake Bureau of Random Statistics would like to determinethe average shoe size for Canadian-born citizens between the ages 13� 19. Thedesired length of the con�dence interval is 0:01. A cursory survey was conductedwhich showed a variance of 0:104. However, a study done �fty years ago on thesales of shoes based on size for teenagers had an estimate of the variance of 40

333 .

1. Determine n using the cursory survey:

Plugging in the values we get:

n1 = d4�2E�2e= d4 � 0:104 � 0:01�2e= d4160e= 4160

2. Determine n using the historical survey:

Pluggin in the values we get:

n2 = d4�2E�2e

= d4 � 40333

� 0:01�2e

= d4804:804805e= 4805

83

3. What are the advantages and disadvantages of both methods? Comparethe two results:

The �rst method has the advantage of using current data; however, it hasthe problem of requiring to run a small survey at the beginning of theprocess. The second method is good since it saves both time and money;however, the data may no longer be relevant due to the age of the valuesor the historical values may not properly represent the current values. Inthe case of this example, the number of samples required di¤er greatly.Although, for the purpose of running a simulation an extra 645 iterationsisn�t always a strain on resources, but this doesn�t necessarily hold true inthe real world. Experiments and surveys are costly with respect to bothtime and money. From this, we can gather that by reducing the varianceof the model, we can e¤ectively save two very important resources: timeand money.

In both cases above, we round up. If we do not round up, we will not havesu¢ ciently many units in our experiment to get the desired results, with respectto interval length.


2=2dx:Using the CMC algorithm, how large should your sample size be to get an

answer to within 0.0001 units at an approximately 95% con�dence level.

#CMCu=runif(1000);est=exp(-u^2/2);variance=var(est)variance

Code Output: 0.014

Note that 1000 was arbitrarily selected. It is a relatively small number(from a computing point of view).

n1 = d4�2E�2e= d4 � 0:014 � 0:0001�2e= 5600000

Hence, we need to run our code 5,600,000 times to get an answer within0.0001 units. Imagine what would happen if this weren�t a computer simulation,

and the amount of substantial resources that would be required to implement asurvey of this magnitude.

84

6.4 CrudeMonte Carlo Integration with Boundsfrom A to B

The issue with the above methodogy is that it only allows you to integrate

functions from 0 to 1�i:e: � =

R 10g(x)dx

�. This has probably bothered you;

so let�s expand to integrals of the form � =R bag(x)dx:

To evaluate these integrals, the secret is to perform a change of variable.The goal of the change of variable is not to make the function nicer (as it wasin Calculus 1) but to make the bounds nicer instead .

Example In Calculus 1, you may have integratedR 126exp(�4x+ 2)dx by mak-

ing the substituion u = �4x+ 2: Here, we aren�t interested in the function (infact, since the computer performs the evaluation, it can get as ugly as we�d like);instead, we will try to change the bounds to be from 0 to 1, so a good choice in

this case might be, u =(x� 6)6

.

In general, for �nite a and b, a good choice is u =x� ab� a .

In the case of in�nite a and b, we want to do the following:

1. We only want 1 bound to be in�nite, so we use the integral rule:R1�1 f (x)

dx =R a�1 f (x) dx+

R1af (x) dx (a is �nite).

2. We make an appropriate change of variable. Ensure that whatever youselect as your change of variable exists on the integral�s domain. For example,

one possibility is u =1

1 + x. However, this would not be a good choice if it

were possible for x = �1.

Example We wish to evaluate � =R 0�5 e

�x2=2dx.1. Write an CMC algorithm to estimate �.

2. We make the change of variable u = �x5, which gives �5 du = dx and

�5u = x.

Based on this change of variable, we want: � =R 10e�(�5u)

2=2 � 5 du.

Algorithm:

85

1. Generate n values of xi, each of which is a realization of X � U(0; 1).2. For each xi, �nd g(xi) = e

�(5xi)2=2 � 5.3. The estimated area is: b� = 1

n

Pi g(xi).

Example We wish to evaluate � =R1�5 e

�x2=2dx:

a) Write a CMC algorithm to estimate �:

b) We make the change of variable

u =1

6 + x

which gives

� 1

u2du = dx

and

1

u� 6 = x

Based on this change of variable, we want:

� =

Z 1

0

exp

��1

u� 6�2=2

!� 1u2du

Algorithm:

1. Generate n values of xi, each of which is a realization of X � U(0; 1).

2. For each xi, �nd g(xi) = exp

��1

xi� 6�2=2

!� 1x2i.

3. The estimated area is: b� = 1n

Pi g(xi).

86

Example We wish to evaluate � =R11

1ex�1dx.

1. Write a CMC algorithm to evaluate �.

2. We want make a mapping of the domain [1;1) to [0; 1]. A simple methodfor this is to use x = 1

u . This technically maps [1;1) ! (0; 1], but sincewe are using continuous uniform random variables, this is not a concern asP [u = U ] = 0.

Then, we have u = 1x , and -

1u2 du = dx.

Making use of the change of variables formula ,we have

� =

Z 1

0

1

exp�1u2

�� 1

du

See above for more detail about the algorithm.

6.5 Crude Simulation of a Call Option Underthe Black Scholes Model

Consider the integral used to price a European call option. Recall that a calloption gives the owner the right, but not the obligation to buy the stock at thestrike price K.The price of a stock at any particular time t is denoted by St. The stock

price when you purchase your call option is S0 and when you decide to makeyour call is ST . Hence, t 2 f0; 1; 2; : : : ; Tg.

Now, if we consider an integral used to price a call option. If a Europeanoption has a payo¤ V (ST ) where ST is the value of the stock at maturity (whichoccurs at time T ), then the price of the option can be valued using the discountedfuture payo¤ from the option, assuming a risk neutral measure:

e�rTE[V (ST )] = e�rTE[V (S0e

X)]

Under the Black-Scholes model, the random variable X = ln(ST =S0) is nor-mally distributed with a mean of rT � �2T=2 and a variance of �2T . We cangenerate a normal using the inverse transform theorem, and by the CentralLimit Theorem, we know that X = ��1(U ; rT � �2

2 T; �2T ) follows a U(0; 1)

distribution (��1(�;�; �2) denotes the inverse of a normally Distributed ran-dom variable with mean � and variance �2). As shown previously, the valueof the option can be written as an expectation of a r.v. governed by a U(0,1)distritbuion.

87

Eff(U)g =1R0

f(u)du

Where f(u) = e�rTV (S0 expf��1(U ; rT � �2

2 T; �2T )g).

If k < ST , it means that you can buy the stock at a price that is less thanthe market value.

If k > ST , you would not exercise the call option since it would cost less topurchase the stock in the open market.

To model a stock option, we will use the Black-Scholes-Merton (hereafterreferred to as BSM), which is a popular �nancial model. From above, we cansee that our option has a payo¤ of:

P (ST ) = max(0; ST � k)

The BSM assumes risk neutrality and measures the future payo¤ P (ST ) bydiscounting to the present, t = 0,

e�rTE[V (ST )] = e�rTE[V (S0e

X)]

88

Recall that CMC estimators are generated by evaulating a function manytimes at randomly selected values (generated from a U(0,1)), and then thesevalues are averaged. The following code creates a function in R that estimatesthe expected value of a stock option:

fn<-function(u,S0,strike,r,sigma,T){#u is a vector of uniform[0,1] randomly renerated values#S0 is the price of the stock at time 0#strike is the strike price of the call option#sigma is the variability/volatility of the stock#r is the interest rate associated with the stock#T is the time from t=0 until maturity, also the number of compounding periodsx<-S0*exp(qnorm(u,mean=r*T-sigma^2*T/2,sd=sigma*sqrt(T)))y<-exp(-r*T)*pmax(x-strike,0)y}

#Test values\bigskip fn(runif(10000),45,60,0.06,0.25,5)fn(runif(100),10,10,0.04,0.3,0.5)

89

One run of this code resulted in an approximate value of the option as�̂CR = 0:4620 with an estimated standard error of

p8:7x10�7. A put option

is an option to sell a stock for a pre-speci�ed price (also called the strike) at agiven date of maturity and has the same pricing function.

An advantage of the Monte Carlo methods versus numerical techniques iswe get a simple, accurate estimator due to the fact that we use a sample meanin our calculations. For a series of n simulations, we can measure the accuracyusing the standard error or our sample mean, since

V ar(�̂CR) =V ar(f(U1))

n.

We can also attain the standard error of the sample mean using the standarddeviation

SE(�̂CR) =�̂fpn

where �̂f = V ar(f(U)). This is estimated using the sample standard devi-ation. Since our function generates a vector of estimates, then we can obtain asample estimator of �f by

90

\bigskip sqrt(var(fn($\cdot $))

The standard error can be obtained with

sf=sqrt(var(fn($\cdot $)))\thinspacesf/sqrt(length(u))

One run of the code has given an estimate of 0.6603 for the standard devia-tion �f or standard error �f

p500000 or 0.0009.

6.6 Antithetic Random Variables

Introduction:

1. Recall that sample size is calculated by n = c2�2

E2 . Hence, if we want toreduce �2, we can reduce our sample size. Consider the variance of a sumof two random variables.

Var(X + Y ) = Var(X) +Var(Y ) + 2Cov(X;Y )

91

2 We can reduce this sum by making the covariance term negative.

Mathematically:

To reduce the variance of our CMC estimator, we will build a new estimatorcalled ~�ANTI , which is comprised of two random variables, g(U) and g(V ).

To ensure the covariance term is less than zero, we can use the followingconditions:

1. For two continuous U(0; 1) random variables, U and V , we want theircovariance to be less than 0: We can set V = 1� U: This is one exampleof a choice for V that will ensure the covariance is less than 0.

2. f (U) should be a monotonic function.

Then Cov(f(U); f(V )) < 0.

Our antithetic estimator is

~�ANTI =1

2n

nXi=1

(f(Ui) + f(1� Ui))

Note that in the antithetic case, we end up doubling the number of observa-tions which is why there is the divisor of 2n instead of n.

How does the antithetic method improve the variability? Is the antitheticestimator unbiased?

Unbiasedness:

Again, we are using Ui � U(0; 1) random variables and having Vi = 1� Ui,we get the following:

E[~�ANTI ] = E

"1

2n

nXi=1

(f(Xi) + f(1�Xi))#

=1

2n

E

"nXi=1

f(Xi)

#+ E

"nXi=1

f(1�Xi)#!

=1

2n(n� + n�)

=2n

2n� �

= �

92

Thus, the antithetic estimator is unbiased.Variability:

By taking the variance of the antithetic estimator, we get the following:

V ar(~�ANTI) = V ar

1

2n

nXi=1

(f(Ui) + f(Vi))

!

=1

4n2V ar

nXi=1

(f(Ui) + f(Vi)

!

=1

4n2

nXi=1

(V ar(f(Ui)) + V ar(f(Vi)) + 2Cov(f(Ui); f(Vi)))

=1

4n2�n�2 + n2�2 + 2nCov(f(Ui); f(Vi))

�=

1

2n

��2 + Cov(f(Ui); f(Vi))

�=

1

2n

��2 + C

�In the last line, we let C denote the covariance.

Now that we have the theoretic variance of the antithetic estimator, we cancompare this to the CMC estimator.

Variance Issue:

Ignoring the covariance for a moment, we see that V ar(~�ANTI) = �2

2n <�2

n =

V ar(~�CMC). This is not an accurate comparison since it does not take intoconsideration that the antithetic estimater uses twice as many values (heights).To balance this, we need to do one of several things:

1. Compare the antithetic estimator onn

2sampled heights to that of the

CMC on n.2. Compare 2�Var(~�ANTI) to Var(~�CMC)3. Compare Var(~�ANTI) to

1

2Var(~�CMC)

Function Evaluations: A function evaluation occurs every time you calla function. For example, in CMC, where the estimator is e� = 1

n

Pni=1 g(Xi),

the number of times we call g(x) is n. Thus, n is the number of funcitonevaluations.

93

E¢ ciency: For an equal number of function evaluations, the e¢ ciency ofa new estimator, �̂NEW is calculated as:

E¢ ciency =V ar(�̂CMC)

V ar(�̂NEW )

When we compare the two variances, we use the ratio of the two. A numberlarger than one implies that the variance of the CMC is worse than that of theNEW estimator. To interpret this number, we use an example:Suppose that the e¢ ciency is 5. This means that we need 5 times the

number of function evaluations using the CMC estimator to obtain the samee¢ ciency as the new estimator.

Example: Antithetic When comparing the antithetic estimator againstthe crude Monte Carlo estimator, which can be thought of how much faster theAntithetic method is:

E¢ ciency =V ar(�̂CMC)

2V ar(�̂ANTI)

=

�2

n22n (�

2 + C)

=�2

�2 + C

Hence, we will only see an improvement in e¢ ciency when C < 0.


2=2dx:a) Write an algorithm to estimate � using an antithetic estimator:

On the domain, g(x) = e�x2=2 is a monotonic function, hence, an antithetic

estimator would be a good choice here.

Algorithm:

1. Generate n values of ui, a realization of U � U(0; 1).2. For each ui, set vi = 1� ui and �nd h(ui) = e

�u2i =2 and h(vi) = e�v2i =2:

94

3. The estimated area is b� = 12n

Pni (h(ui) + h (vi))

b) Write R code to estimate �. Using n = 10, 100 and 1000, provideestimates for �:

u=runif(10);v=1-runif(10);g1=exp(-u^2/2);g2=exp(-v^2/2);est=(g1+g2)/2;theta=mean(est);theta\bigskip

Note that estimates will change. With n = 10, �̂CMC = 0:8157554.

As a result, to make the CMC estimator variance worse than that of theANTI estimator, we want to ensure the covariance, C, is negative.

6.7 Strati�ed Sampling

In survey sampling, the act of stratifying is when the surveyer partitions thepopulation into subpopulations called strata. For this method to be e¤ective,the attributes within each strata must be similar while the attributes of thepopulations in di¤erent strata are di¤erent. A couple of examples will illustratethis concept:

Example Problem: We want to determine the average size of a Canadiancitizen�s english vocabulary.Plan: Canada is composed of two principle disparate languages: French and

English. French is primarily spoken in Quebec, while English is mainly spokenin the other 9 provinces and 3 territories. The author of the study suspectsthat those in Quebec will have a smaller English vocabulary than those in therest of Canada (on average). Hence, the author breaks Canada into two strata:Quebec and the rest of Canada.

95

Example Problem: We want to determine the average income of Cana-dians.Plan: We break the country into 13 strata (one for each province and

territory) and conduct sampling in each one separately.

Question: What bene�t(s) do you suspect stratifying gives?

It is very important to note that di¤erent populations have di¤erent vari-ances. This also is true for many mathematical functions. Although it isconceivable to split a �nite population into �nitely many strata, we will onlylook at the case where there are two strata.

Consider each of the these functions: Finding the point through which onecan draw a dividing line is quite straightforward for A. However, what about Band C? We can see that B is an oscillating function, increasing over time andC looks like a quadratic function. Having to choose a point is arbitrary andthere are in�nitely many points from which we can choose; however, we haveno simple way to determine the one which best isolates the variance save forlengthly empirical calculations.Given a function on the interval (0,1), we wish to arbitarily select a point

a at which the variance of the function begins to di¤er from the rest of thefunction. This point is arbitrary, but should be reasonable. Thus, we can splitup the function we wish to estimate:

96

Z 1

0

f(x)dx =

Z a

0

f(x)dx+

Z 1

a

f(x)dx; let x = au! dx = adu

= a

Z 1

0

f(ua)du+ (1� a)Z 1

0

f((1� a)u+ a)du

This leads to the estimate

�̂STRAT =a

n1

n1Xi=1

f(yi) +1� an2

n2Xj=1

f(yj)

Where yi = uia and yj = (1 � a)uj + a. Check that this is an unbiasedestimatorRecall: The variance of the CMC estimator, e�CMC ; was V ar

�e�CMC

�=

�2

n : Hence the larger the sample size, n, the smaller the variance of our estima-tor.

In terms of the strati�ed estimator, we have two samples, n1 and n2: Inmost situations, we have a maximum sample size that we can sample, n. Inthis case, n = n1 + n2:

We want to minimize the variance of our strati�ed estimator to improve thee¢ ciency.

The problem, stated mathematically is:Find n1; n2By Minimizing e�STRATSubject to the constraint: n = n1 + n2

Let Vi be the variance of strata i: Without going into detail, the solutionto the problem is to make the sample size strata proportional to the length ofstrata�s interval times the variance of the strata. In notation, this is n1 / a�V1and n2 / (1 � a) � V2. Since we often have a limit on how many units canbe included in the population (due to budget or time constraints for example),then we can determine the sample size of strata i by

n

�ni

n1 + n2

�= n

cViaV1 + (1� a)V2

.

For c = a or c = 1� a depending on i. Due to math, the numbers will haveto be rounded such that n = n1 + n2 to meet these constraints.

Now we derive the variance of this estimator:

97

V ar(~�STRAT)

= V ar(ae�1 + (1� a)e�2)= a2V ar(e�1) + (1� a)2V ar(e�2)= a2

�21n1+ (1� a)2�

22

n2.

ni is the size of the sample �2i is the variance of stratum i.

The following algorithm gives an e¢ cient means to obtain and evaluate theestimator:

1. Make a graph and select an a.

2. Do a simple study with a small n to get an idea of the variance. Letn1 = n2:

3. Plug this into a program (e.g. R) to estimate the variance of each stratum.

4. Determine the optimal fnig; i 2 Z[1; 2] with n / a�Var(Stratum i). To

solve we use n ��

nin1+n2

�.

5. Perform your study using the new numbers.

Example

We wish to evaluate � =R 10e�x

2=2dx.

a) Write a STRAT algorithm using the value a =1

3to evaluate �:

�̂STRAT = 1=3n1

Pn1i=1 f(yi) +

2=3n2

Pn2j=1 f(yj)

= 1=3n1

Pn1i=1

�e�(1=3ui)

2=2�+ 2=3

n2

Pn2j=1

�e�(2=3uj+1=3)

2=2�

Algorithm:

1. Generate n1 values of ui and n2 values of vi realizations of U � U(0; 1).2. The estimated area is b� = 1=3

n

Pni=1

�e�(1=3ui)

2=2�+ 2=3

n2

Pn2j=1

�e�(2=3uj+1=3)

2=2�

b) Assuming that we can sample n = 1000 heights, what value of n1 andn2 should we use?

98

u=runif(1000);\quadv=runif(1000); #Why is there a v term at all?g1=exp(-(1/3*u)^2/2);g2=exp(-(2/3*u+1/3)^2/2);n1=ceiling(1000*(1/3*var(g1))/(1/3*var(g1)+2/3*var(g2)))n2=1000-n1

During this run of the program, I obtained n1 = 14 and n2 = 986

c) Write down the R code using the algorithm in a) and the values in b) toestimate �. Using n = 10, 100 and 1000, provide estimates for �:

u=runif(n1);v=runif(n2);g1=exp(-(1/3*u)^2/2);g2=exp(-(2/3*u+1/3)^2/2);est=1/3*mean(g1)+2/3*mean(g2)est

Note that the estimates will change. For n = 10, ~�STRAT = 0:8832353.

In fact, we can have m disjoint strata that cover the entire domain. If wede�ne the size of each stratum to be a1; a2; : : : ; am with variances V1; V2; : : : ; Vm,then the optimal sample size will be

n

njPm

i=1;i 6=j ni

!= n

ajVjPm

i=1;i 6=j aiVi

!

Example: Call Options De�ne f by the BSM model. In R, this is thefunction fn of section 6.5 The following code will be used to stratify our functionfor a prespeci�ed a value.

x=runif(50,000)y=runif(50,000)STRATA1=a*fn(a+x)STRATA2=(1-2)*fn((1-a)*y)Var(c(strata1,strata2))

99

The variance of the sample mean of the components of the vector F isvar(F)/length(F) or around 9:2 � 10�8. Since each component of the vectorabove corresponds to two function evaluations, we should compare this with acrude Monte Carlo estimator with n = 1000000 having variance �2f � 10�6 =4:36�10�7. This corresponds to an e¢ ciency gain of 43:6=9:2 or around 5. Wecan a¤ord to use one �fth of the sample size by simply stratifying the sampleinto two strata. The improvement is somewhat limited by the fact that we arestill sampling in a region in which the function is 0 (although now slightly lessoften).

6.8 Control Variates

The crude Monte Carlo Integration Method is a means for us to empiricallycalculate the value of a de�nite integral which would be otherwise impossible.Consider the function e�x

2

over the interval (0,1). Although we cannot integratethat function exactly, what we can do is �nd a function which is very close toe�x

2

, but has a known integrand like e�x. We may be able to use this fact tohelp us integrate ex

2

: This is the logic behind the control variate technique.

For a given de�nite integral which still satist�es the crude Monte Carlointegration criteria, we have

� =

Z 1

0

f(x)dx, de�nition of CMC

=

Z 1

0

(f(x)� g(x) + g(x))dx

=

Z 1

0

(f(x)� g(x))dx+Z 1

0

g(x)dx.

This method will work given the following criteria.

1. f(x) and g(x) must be �close�(i.e. f(x) � g(x)).

2. g(x) must be integrable, since f(x) usually isn�t.

This method works because the variability of h(x) = f(x)� g(x) is smallerthan the variability of f(x) on it�s own. Note that �closeness�is arbitrary.

We de�ne the control variate estimator to be �̂CV = 1n

Pni=1 h(Xi)+

R 10g(x)dx.

As in the prior cases, we investigate unbiasedness and variability.

Unbiasedness:

100

E[~�CV ] = E

"1

n

nXi=1

h(Xi) +

Z 1

0

g(x)dx

#

=1

nE

"nXi=1

(f(Xi)� g(Xi))#+

Z 1

0

g(x)dx

= � � 1

nE

"nXi=1

g(Xi)

#+

Z 1

0

g(x)dx

= �

Thus, the control variate estimator is unbiased.Variability:

This result holds by the Law of Large Numbers.

V ar(~�CV) = V ar

"1

n

nXi=1

h(xi) +

Z 1

0

g(x)dx

#

=1

n2V ar

"nXi=1

h(xi)

#

=�2in

Thus, �2i dependent on the function h; and provided that f(x) � g(x),�2C Vn <

�2CM Cn .

6.9 Optimal Control Variates

Now, if we want to improve the e¢ ciency of our simulation method, then con-sider a more general expression for the control variate which we will de�ne asthe optimal control variate (OCV for short):

~�OCV =

Z 1

0

(f(x) + d(g(x)� g(x)))dx.

Note that if d = 1, we get the same estimator as that of the previous example.For a general d, the variance of the estimator is as follows:

V ar(~�OCV) =1

n2

nXi=1

V ar(f(xi)� dg(xi))

=1

n2

nXi=1

(V ar(f(x)) + d2V ar(g(x))� 2dCov(f(x); g(x)))

101

We now want to �nd the minimum value with respect to the �d�value. LetV ar(f(x)) = �2f , V ar(g(x)) = �

2g and Cov(f(x); g(x)) = �.

V ar(~�CV) =1

n2

nXi=1

(�2f + d2�2g � 2d�)

=1

n(�2f + d

2�2g � 2d�), di¤erentiate with respect to d

V ar(~�CV) =1

n(0 + 2d�2g � 2�), set equal to 0

0 = 2d�2g � 2d�

d =�

�2g

=Cov(f(X); g(X))

V ar(g(X))

Thus, the variance reduction and hence e¢ ciency improvement is entirelydependent on the fact that the variance of a de�nite integral is 0, since it is aconstant and that the two functions are relatively close. Again, �closeness� isarbitrary.


2=2dx.a) Write a CV algorithm using the function f(x) = e�x to evaluate �:

~�OCV = 1n

Pni=1 h(Xi) +

R 10g(x)dx.

= 1n

Pni=1

�e�x

2i =2�e�xi

�+R 10e�xdx.

= 1n

Pni=1

�e�x

2i =2�e�xi

�+�1� e�1

�

Algorithm:

1. Generate n values of ui, a realization of U � U(0; 1).2. For each ui, �nd h(ui) = e

�u2i =2 � e�ui :3. The estimated area is: b� = 1

n

Pni

�e�u

2i =2 � e�ui

�+ (1� e�1)

b) Write the R code using the algorithm in a) to estimate �. Using n = 10,100 and 1000, provide estimates for �:

102

u=runif(10);g1=exp(-u^2/2);g2=exp(-u);est=(g1-g2);theta=mean(est)+(1-exp(-1));sigmasq=var(est);theta

Note that estimates will change. For n = 10, b�CV = 0:811051.c) Write an OCV algorithm using the function f(x) = e�x to evaluate �:

~�OCV = 1n

Pni=1

�e�x

2i =2 � de�xi

�+ d

�1� e�1

�(from a)

Where d =Cov(f(X); g(X))

V ar(g(X))

Algorithm:

1. Generate n values of ui, a realization of U � U(0; 1).

2. For each ui, �nd h(ui) = e�u2i =2 � de�ui ; d = Cov(e�U

2i =2; e�Ui)

V ar(e�Ui)

3. The estimated area is: b� = 1n

Pni

�e�u

2i =2 � de�ui

�+ d(1� e�1)

d) Write R code using the algorithm in c to estimate �. Using n = 10, 100and 1000, provide estimates for �:

u=runif(10);g1=exp(-u^2/2);g2=exp(-u);d=cov(g1,g2)/var(g2)est=(g1-d*g2);theta=mean(est)+d*(1-exp(-1));theta

Note: estimates will change. For n = 10, b�OCV = 0:8880204.e) Compare the e¢ ciencies of the OCV, CV, STRAT, ANTI estimators with

n = 1000.

Combining the code from above:

#CMC

103

u=runif(1000);est=exp(-u^2/2);CMC=var(est)/1000;#ANTIu=runif(1000);v=1-runif(1000);g1=exp(-u^2/2);g2=exp(-v^2/2);est=(g1+g2)/2;theta=mean(est);ANTI=var(est)/1000;

#CVu=runif(1000);g1=exp(-u^2/2);g2=exp(-u);est=(g1-g2);theta=mean(est)+(1-exp(-1));CV=var(est)/1000;

#OCVu=runif(1000);g1=exp(-u^2/2);g2=exp(-u);d=cov(g1,g2)/var(g2)est=(g1-d*g2);theta=mean(est)+d*(1-exp(-1));OCV=var(est)/1000;

#STRATu=runif(1000);v=runif(1000);g1=exp(-(1/3*u)^2/2);g2=exp(-(2/3*u+1/3)^2/2);n1=ceiling(1000*(1/3*var(g1))/(1/3*var(g1)+2/3*var(g2)))n2=1000-n1u=runif(n1);v=runif(n2);g1=exp(-(1/3*u)^2/2);g2=exp(-(2/3*u+1/3)^2/2);est=1/3*mean(g1)+2/3*mean(g2)STRAT=1/9*var(g1)/n1+4/9*var(g2)/n2;output=as.matrix(t(CMC/c(STRAT,OCV,CV,ANTI)))colnames(output)=c(�STRAT�,�OCV�,�CV�,�ANTI�)

output

104

STRAT OCV CV ANTI

1.306962 10.03115 2.496909 2.08554

Hence, the best, most e¢ cient method in this case is OCV. It is 10 timesmore e¢ cient than CMC. This means that the CMC code requires 10 timesmore heights to get the same e¢ ciency as OCV. It may not seem like much butit implies that I can run my OCV code 1000 times to get the same accuracy assomeone using CMC 10,000 times.

105

6.10 Importance Sampling

Recall the de�nition of crude Monte Carlo Integration:

E[g(X)] =

Zf(x)g(x)dx:

If X � U(0; 1) and hence f(x) = 1, then we have the basis of other variancereduction techniques. Now, we consider what happens if X is not uniformlydistributed.

In the control variate case, we change the formula by adding and subtractinga known function g(x): basically, by adding zero to the integral, retaining itsunbiasedness and allowing us to solve it more easily. In importance sampling,we will instead multiply by 1. The known function in this case will be h(x),which is selected under the following assumptions:

1. h(x); or at least a constant times h(x); is a pdf.

2. We have a method to generate a random variable from h(x): At themoment, assume that this allows us to use any function R can create,including normal, exponential, chi-squared, gamma, etc...

3.f(x)g(x)

h(x)� k; a constant. Hence, it will have little variability.

Note: Often, for simplicity, we ignore g(x), meaning we assume g(x) = 1:

Theory:

Mathematically, we will have done �nothing�; however, this will do a lot morethan nothing. Most notably, it allows us to create a proper random variable witha pdf.

� =

Zf(x)g(x)dx

=

Zf(x)g(x)

�h(x)

h(x)

�dx, for h(x) 6= 08x

=

Z �f(x)g(x)

h(x)

�h(x)dx

= E

�f(x)g(x)

h(x)

�

106

Under our simplifying assumption, � = E�f(x)

h(x)

�:

And, since an expectation is a weighted average, we can get our estimator:

�̂IMP =1

n

nXi=1

�f(xi)g(xi)

h(xi)

�

Using our simplifying assumption, we have

�̂IMP =1

n

nXi=1

�f(xi)

h(xi)

�

In all of these cases, X has cdf H(X):

The variance of importance sampling technique involves using Taylor�s ap-proximation. For our purposes:

V ar�e�IMP� = 1

nV ar

�f(X)

h(X)

�The following assumptions on h(x) should hold:

1. h(x) should be close to a constant times f(x), i.e.f(x)

h(x)= k

2. h(x) should not equal zero at the same time as f(x). For h(x) = 0, wecan ignore the data point.

Example We wish to evaluate � =R1�1 e

�jxjdx:a) Write an IMP algorithm using the N(0; 1) pdf to evaluate �:

~�IMP =R1�1 e

�jxjdx

=R1�1 e

�jxj

0@ 1p2�exp

��x2

2

�1p2�exp

��x2

2

�1A dx

=R1�1

e�jxj

1p2�exp

��x2

2

�! 1p2�exp

��x2

2

�dx

= E

e�jXj

1p2�exp

��X2

2

�! where X � N (0; 1)

107

Algorithm:

1. Generate n values of ui, a realization of N � U(0; 1).

2. The estimated area is b� = 1n

Pni

0@ e�juij

1p2�exp

��u2i

2

�1A

b) Write R code using the algorithm in a to estimate �. Using n = 10, 100and 1000, provide estimates for �:

#IMPu=rnorm(1000,0,1);g1=exp(-abs(u));g2=1/sqrt(2*pi)*exp(-u^2/2);est=mean(g1/g2);

Note: estimates will change. For n = 10, b�IMP =1.958993.c) Determine the real value of � =

R1�1 e

�jxjdx:

Answer: 2

108

6.11 Combining Monte Carlo Estimators

We have now seen some of the di¤erent variance reduction techniques. Themain challenge is to determine the �best� one of them for a given function.For instance, the variance formula may be used as a basis of choosing a �best�method; however, the main issue with this is that these variances (and e¢ cien-cies) must also estimated from the simulation. Furthermore, it is rarely cleara priori which sampling procedure and estimator is best. For example, if afunction f is monotone on [0; 1] then an antithetic variate can be introducedwith an estimator of the form

�̂a1 =1

2[f(U) + f(1� U)]; U � U [0; 1]. (6.1)

If, however, the function is increasing to a maximum somewhere around 12 , and

then decreasing thereafter, we might prefer

�̂a2 =1

4[f(U=2) + f((1� U)=2) + f((1 + U)=2) + f(1� U=2)]: (6.2)

Notice that any weighted average of these two unbiased estimators of � wouldalso provide an unbiased estimator of �. The large number of potential variancereduction techniques is an embarrassment of riches. Thus the question is: whichvariance reduction methods should we use and how will we know whether it isbetter than the competitors? Tthe answer to this is to use �all of the methods�(within reason of course); in fact, choosing a single method is often neithernecessary nor desirable. Rather, it is preferable to use a weighted average ofthe available estimators with the optimal choice of the weights provided byregression.Suppose in general that we have k estimators or statistics b�i; i = 1; ::k, all

unbiased estimators of the same parameter � so that E(b�i) = � for all i . Invector notation, letting �0 = (b�1; :::;b�k), we write E(�) = 1� where 1 is thek-dimensional column vector of ones so that 10 = (1; 1; :::; 1): Let us suppose forthe moment that we know the variance-covariance matrix V of the vector �,de�ned by

Vij = cov(b�i;b�j):Theorem 6. (best linear combinations of estimators)The linear combination of the b�i which provides an unbiased estimator of � andhas minimum variance among all linear unbiased estimators isb�blc =X

i

bib�i (6.3)

where the vector b = (b1; :::; bk)0 is given by

b = (1tV �11)�1V �11:

The variance of the resulting estimator is

var(b�blc) = btV b = 1=(1tV �11)109

Proof. It is easy to see that for any linear combination (6.3), the variance of theestimator is

btV b

and we wish to minimize this quadratic form as a function of b, subject to theconstraint that the coe¢ cients add to one, or that

b01 =1:

Introducing the Lagrangian, we wish to set the derivatives, with respect tothe components bi, equal to zero

@

@bfbtV b+ �(b01�1)g = 0 or

2V b+ �1= 0

b=constant� V �11:Upon requiring that the coe¢ cients add to one, we discover the value of theconstant above is (1tV �11)�1:

This theorem indicates that the ideal linear combination of estimators hascoe¢ cients proportional to the row sums of the inverse covariance matrix. No-tably, the variance of a particular estimator b�i is one of many ingredients inthat sum. In practice, we almost never know the variance-covariance matrix Vof a vector of estimators �. However, through simulation, we evaluate theseestimators using the same uniform input to each and obtain independent repli-cated values of �: This permits us to estimate the covariance matrix V , andsince we typically conduct many simulations, this estimate can be very accurate.Let us suppose that we have n simulated values of the vectors �, and call these�1; :::;�n. As usual, we estimate the covariance matrix V using the samplecovariance matrix bV = 1

n� 1

nXi=1

(�i ��)(�i ��)0

where

� =1

n

nXi=1

�i:

Let us return to the example and attempt to �nd the best combination ofthe many estimators we have considered so far. To this end, let

b�1 = 0:53

2[f(:47 + :53u) + f(1� :53u)], an antithetic estimator,

b�2 = 0:37

2[f(:47 + :37u) + f(:84� :37u)] + 0:16

2[f(:84 + :16u) + f(1� :16u)], a strati�ed-antithetic estimatorb�3 = 0:37[f(:47 + :37u)] + 0:16[f(1� :16u)], a strati�ed-antithetic estimatorb�4 = Z g(x)dx+ [f(u)� g(u)], a control variate estimator

b�5 = �̂IMP ; an importance sampling estimator (??).110

All these estimator are obtained from a single input uniform random variateU: In order to determine the optimal linear combination ,we need to generatesimulated values of all 5 estimators using the same uniform random numbers asinputs. We determine the best linear combination of these estimators using the

following R code:

function [o,v,b,V]=optimal(U)% generates optimal linear combination of five estimators and outputs% average estimator, variance and weights% input $U$ a row vector of U[0,1] random numbersT1=(.53/2)*(fn(.47+.53*U)+fn(1-.53*U));T2=.37*.5*(fn(.47+.37*U)+fn(.84-.37*U))+.16*.5*(fn(.84+.16*U)+fn(1-.16*U));T3=.37*fn(.47+.37*U)+.16*fn(1-.16*U);intg=2*(.53)^3+.53^2/2;T4=intg+fn(U)-GG(U);T5=importance(�fn�,U);X=[T1� T2� T3� T4� T5�];% matrix whose columns are replications of the same estimator,% a row=5 estimators using same Umean(X)V=cov(X); % this estimates the covariance matrix Von=ones(5,1);V1=inv(V); % the inverse of the covariance matrixb=V1*on/(on�*V1*on); % vector of coefficients of the optimal linear combinationo=mean(X*b); % vector of the optimal linear combinationsv=1/(on�*V1*on); % variance of the optimal linear combination based on a single U\bigskip

One run of this estimator, called with [o,v,b,V]= optimal(unifrnd(0,1,1,1000000))yields

o = 0:4615

b = [�0:5499 1:4478 0:1011 0:0491 � 0:0481]:

The estimate 0:4615 is accurate to at least four decimals, which is not sur-prising since the variance per uniform random number input is v = 1:13�10�5.In other words, the variance of the mean based on 1,000,000 uniform input is1:13�10�10, and the standard error is around .00001, so we can expect accuracyto at least 4 decimal places. Note that some of the weights are negative andothers are greater than one.Do these negative weights indicate estimators that are worse than useless?

The e¤ect of some estimators can may be, on subtraction, render the remainingfunction to be more linear and making it more easily estimated. We note thatnegative coe¢ cients are quite common in regression. The e¢ ciency gain overcrude Monte Carlo is an extraordinary 40,000. However, since there are 10function evaluations for each uniform variate input, the e¢ ciency when adjustingfor the number of function evaluations is 4,000.

111

This simulation using 1,000,000 uniform random numbers and taking a 63seconds on a Pentium IV (2.4 GHz) (including the time required to generate all�ve estimators) is equivalent to forty billion simulations by crude Monte Carlo:a major task on a supercomputer!If we intended to use this simulation method repeatedly, we might as well

wish to see whether some of the estimators can be omitted without too much lossof information. Since the variance of the optimal estimator is 1=(1tV �11), wemight use this to attempt to select one of the estimators for deletion. Notice thatit is not so much the covariance of the estimators V which enters into Theorem 6but its inverse J = V �1, which we can consider as a type of information matrixby analogy to maximum likelihood theory. For example, we could choose todelete the ith estimator, i.e. delete the ith row and column of V where i is chosento have the smallest e¤ect on 1=(1tV �11) or its reciprocal 1tJ1 =

Pi

Pj Jij :

In particular, if we let V(i) be the matrix V with the ith row and columndeleted and J(i) = V�1

(i) , then we can identify 1tJ1 � 1tJ(i)1 as the loss of

information when the ith estimator is deleted. Since not all estimators havethe same number of function evaluations, we should adjust this information byFE(i) =number of function evaluations required by the ith estimator. In otherwords, if an estimator i is to be deleted, it should be the one corresponding to

minif1tJ1� 1tJ(i)1

FE(i)g,

we should drop this ith estimator if the minimum is less than the information perfunction evaluation in the combined estimator. In other words, we will increasethe information available in our simulation per function evaluation.In the above example, with all �ve estimators included, 1tJ1 = 88; 757 (with

10 function evaluations per uniform variate), so the information per functionevaluation is 8; 876:

i 1tJ1� 1tJ(i)1 FE(i)1tJ1�1tJ(i)1

FE(i)

1 88,048 2 44,0242 87,989 4 21,9973 28,017 2 14,0084 55,725 1 55,7255 32,323 1 32,323

In this case, if we were to eliminate one of the estimators, our choice wouldlikely be number 3 since it contributes the least information per function eval-uation. However, since all contribute more than 8; 876 per function evaluation,we should likely retain all �ve estimators.

112

6.12 Variance Reduction Using Common Ran-dom Numbers

We now discuss another variance reduction technique, closely related to anti-thetic variates called common random numbers, used for example wheneverwe wish to estimate the di¤erence in performance between two systems, or anyother variable involving a di¤erence such as a slope of a function.

Example 3. For a simple example, suppose we have two estimators b�1;b�2 ofthe �center�of a symmetric distribution. We would like to know which of theseestimators is better in the sense that it has smaller variance when applied toa sample from a speci�c distribution symmetric about its median. If both esti-mators are unbiased estimators of the median, then the �rst estimator is betterif

var(b�1) < var(b�2)and so, we are interested in estimating a quantity like

Eh1(X)� Eh2(X);

where X is a vector representing a sample from the distribution and h1(X) =b�21; h2(X) = b�22: There are at least two ways of estimating these di¤erences:1. Generate samples and hence values of h1(Xi); i = 1; :::; n and Eh2(Xj); j =1; 2; :::;m independently, and use the estimator

1

n

nXi=1

h1(Xi)�1

m

mXj=1

h2(Xj):

2. Generate samples and hence values of h1(Xi); h2(Xi); i = 1; :::; n inde-pendently, and use the estimator

1

n

nXi=1

(h1(Xi)� h2(Xi)):

It seems intuitive that the second method is preferable since it removesthe variability due to the particular sample from the comparison. This is acommon type of problem in which we want to estimate the di¤erence betweentwo expected values. For example, we may be considering investing in a newpiece of equipment that will speed up processing at one node of a network andwe wish to estimate the expected improvement in performance between the newsystem and the old. In general, suppose that we wish to estimate the di¤erencebetween two expectations, say

Eh1(X)� Eh2(Y ), (6.4)

113

where the random variable or vector X has cumulative distribution function FXand Y has cumulative distribution function FY . Notice that the variance of aMonte Carlo estimator is

var[h1(X)� h2(Y )] = var[h1(X)] + var[h2(Y )]� 2covfh1(X); h2(Y )g (6.5)

is small if we can induce a high degree of positive correlation between thegenerated random variables X and Y . This is precisely the opposite problemthat led to antithetic random numbers, where we wished to induce a high degreeof negative correlation. The following lemma is due to Hoe¤ding (1940) andprovides a useful bound on the joint cumulative distribution function of tworandom variables X and Y . Suppose X and Y have cumulative distributionfunctions FX(x) and FY (y) respectively and a joint cumulative distributionfunction G(x; y) = P [X � x; Y � y].

Lemma 7. (a) The joint cumulative distribution function G of (X;Y ) alwayssatis�es

(FX(x) + FY (y)� 1)+ � G(x; y) � min(FX(x); FY (y)) (6.6)

for all x; y .(b) Assume that FX and FY are continuous functions. In the case that

X = F�1X (U) and Y = F�1Y (U) for U uniform on [0; 1]; equality is achievedon the right G(x; y) = min(FX(x); FY (y)): In the case that X = F�1X (U) andY = F�1Y (1�U), there is equality on the left; (FX(x) +FY (y)� 1)+ = G(x; y):

Proof. (a) Note that

P [X � x; Y � y] � P [X � x], and similarly,� P [Y � y]:

This shows thatG(x; y) � min(FX(x); FY (y));

verifying the right side of (6.6). Similarly, for the left side,

P [X � x; Y � y] = P [X � x]� P [X � x; Y > y]� P [X � x]� P [Y > y]= FX(x)� (1� FY (y))= (FX(x) + FY (y)� 1):

Since it is also non-negative, the left side follows.For (b), suppose X = F�1X (U) and Y = F�1Y (U); then

P [X � x; Y � y] = P [F�1X (U) � x; F�1Y (U) � y]= P [U � FX(x); U � FY (y)]

since P [X = x] = 0 and P [Y = y] = 0:

114

ButP [U � FX(x); U � FY (y)] = min(FX(x); FY (y));

verifying the equality on the right of (6.6) for common random numbers. By asimilar argument,

P [F�1X (U) � x; F�1Y (1� U) � y] = P [U � FX(x); 1� U � FY (y)]= P [U � FX(x); U � 1� FY (y)]= (FX(x)� (1� FY (y)))+;

verifying the equality on the left.

The following theorem supports the use of common random numbers tomaximize covariance and antithetic random numbers to minimize covariance.

Theorem 8. (maximum/minimum covariance)Suppose h1 and h2 are both non-decreasing (or both non-increasing) functions.Subject to the constraint that X;Y have cumulative distribution functions FX ; FYrespectively, the covariance

cov[h1(X); h2(Y )]

is maximized when Y = F�1Y (U) and X = F�1X (U) (i.e. for common U(0; 1)random numbers) and is minimized when Y = F�1Y (U) and X = F�1X (1 � U)(i.e. for antithetic random numbers).

Proof. We will sketch a proof of the theorem when the distributions are allcontinuous and h1; h2 are di¤erentiable. De�ne G(x; y) = P [X � x; Y � y]:The following representation of covariance is useful: de�ne

H(x; y) = P (X > x; Y > y)� P (X > x)P (Y > y) (6.7)

= G(x; y)� FX(x)FY (y):

Notice that, using integration by parts,Z 1

�1

Z 1

�1H(x; y)h01(x)h

02(y)dxdy

= �Z 1

�1

Z 1

�1

@

@xH(x; y)h1(x)h

02(y)dxdy

=

Z 1

�1

Z 1

�1

@2

@x@yH(x; y)h1(x)h2(y)dxdy

=

Z 1

�1

Z 1

�1h1(x)h2(y)g(x; y)dxdy �

Z 1

�1h1(x)fX(x)dx

Z 1

�1h2(y)fY (y)dy

= cov(h1(X); h2(Y )); (6.8)

where g(x; y); fX(x); fY (y) denote the joint probability density function, theprobability density function of X and that of Y respectively. In fact, this result

115

holds in general even without the assumption that the distributions are contin-uous. The covariance between h1(X) and h2(Y ), for h1 and h2 di¤erentiablefunctions, is

cov(h1(X); h2(Y )) =

Z 1

�1

Z 1

�1H(x; y)h01(x)h

02(y)dxdy:

The formula shows that in order to maximize the covariance, if h1; h2 are bothincreasing or decreasing functions, it is su¢ cient to maximize H(x; y) for eachx and y since h01(x); h

02(y) are both non-negative. Since we are constraining

the marginal cumulative distribution functions FX and FY , this is equivalent tomaximizing G(x; y) subject to the following constraints

limy!1

G(x; y) = FX(x)

limx!1

G(x; y) = FY (y):

Lemma 7 shows that the maximum is achieved when common random numbersare used, and the minimum achieved when we use antithetic random numbers.

We can argue intuitively for the use of common random numbers in the caseof a discrete distribution with probability on the points indicated in Figure ??.This �gure corresponds to a joint distribution with the following probabilities,say

x 0 0.25 0.25 0.75 0.75 1y 0 0.25 0.75 0.25 0.75 1P [X = x; Y = y] .1 .2 .2 .1 .2 .2

Suppose we wish to maximize P [X > x; Y > y], subject to the constraint thatthe probabilities P [X > x] and P [Y > y] are �xed. We have indicated arbitrary�xed values of (x; y) in the �gure. Note that if there is any weight attached tothe point in the lower right quadrant (labelled �P2�), some or all of this weightcan be reassigned to the point P3 in the lower left quadrant provided that thereis an equal movement of weight from the upper left P4 to the upper right P1:Such a movement of weight will increase the value of G(x; y) without a¤ectingP [X � x] or P [Y � y]. The weight that we are able to transfer in this exampleis 0:1, the minimum of the weights on P4 and P2. In general, this continuesuntil there is no weight in one of the o¤-diagonal quadrants for every choice of(x; y). The resulting distribution in this example is given by

x 0 0.25 0.25 0.75 0.75 1y 0 0.25 0.75 0.25 0.75 1P [X = x; Y = y] .1 .3 0 .1 .3 .2

It is easy to see that such a joint distribution can be generated from commonrandom numbers X = F�1X (U); Y = F�1Y (U):

116

6.13 Variance Reduction Using Conditioning

We now consider a simple but powerful generalization of control variates. Sup-pose that we can decompose a random variable T into two components T1; "

T = T1 + "

so that T1; " are uncorrelated

Cov(T1; ") = 0.

Assume as well that E(") = 0. Regression is one method for determining sucha decomposition, and the error term " in regression satis�es these conditions.As a result, T1 has the same mean as T , and it is easy to see that

V ar(T ) = V ar(T1) + V ar(").

So, T1 as smaller variance than T (unless " = 0 with probability 1). This meansthat if we wish to estimate the common mean of T or T1; the estimator T1 ispreferable, since it has the same mean but a smaller variance.One special case is variance reduction by conditioning. For the standard

de�nition and properties of conditional expectation, see Appendix. One com-mon de�nition of E[XjY ] is the unique (with probability one) function g(y)of Y , which minimizes E[X � g(Y )]2. This de�nition only applies to randomvariables X which have �nite variances; and so, this de�nition requires somemodi�cation when E(X2) = 1. For simplicity, we will assume here that allrandom variables, say X;Y; Z have �nite variances. We can de�ne conditionalcovariance using conditional expectation as

Cov(X;Y jZ) = E[XY jZ]� E[XjZ]E[Y jZ]

and conditional variance:

V ar(XjZ) = E(X2jZ)� (E[XjZ])2.

Variance reduction through conditioning is justi�ed by the following well-knownresult:

Theorem 9. (a)E(X) = EfE[XjY ]g(b) cov(X;Y ) = Efcov(X;Y jZ)g+ covfE[XjZ]; E[Y jZ]g(c) var(X) = Efvar(XjZ)g+ varfE[XjZ]g

This Theorem is used as follows: suppose we are considering a candidateestimator �̂, which is an unbiased estimator of �; also, suppose we have anarbitrary random variable Z which is somehow related to �̂, and we have chosenZ carefully so that we are able to calculate the conditional expectation T1 =E[�̂jZ]. Then by part (a) of the above Theorem, T1 is also an unbiased estimatorof �. De�ne

" = �̂ � T1:

117

By part (c),V ar(�̂) = V ar(T1) + V ar(")

and var(T1) = var(�̂) � var(") < var(�̂): In other words, for any variableZ, E[�̂jZ] has the same expectation as �̂, but a smaller variance. Actually, thedecrease in variance is largest if Z and �̂ are nearly independent, because in thiscase E[�̂jZ] is close to a constant and its variance close to zero. In general, thesearch for an appropriate Z - as to reducing the variance of an estimator byconditioning - requires searching for a random variable Z such that:

1. The conditional expectation E[�̂jZ] with the original estimator is com-putable.

2. V ar(E[�̂jZ]) is substantially smaller than V ar(�̂):

Example 4. (hit or miss)

Suppose we wish to estimate the area under a certain graph f(x) by the hitand miss method. A crude method would involve determining a multiple c ofa probability density function g(x) which dominates f(x) so that cg(x) � f(x)for all x: We can generate points (X;Y ) at random and uniformly distributedunder the graph of cg(x) by generating X by inverse transform X = G�1(U1)where G(x) is the cumulative distribution function corresponding to density gand then generating Y from the U(0; cg(X)) distribution, say Y = cg(X)U2:An example, with g(x) = 2x; 0 < x < 1 and c = 1=4 is given in Figure ??:The hit and miss estimator of the area under the graph of f is obtained

by generating such random points (X;Y ) and counting the proportion that fallunder the graph of g; i.e. for which Y � f(X). This proportion estimates theprobability

P [Y � f(X)] = area under f(x)area under cg(x)

=area under f(x)

c.

This holds since g(x) is a probability density function. Notice that if we de�ne

W =

�c if Y � f(X)0 if Y > f(X)

,

then

E(W ) = c� area under f(x)area under cg(x)

= area under f(x)

118

So, W is an unbiased estimator of the parameter that we wish to estimate. Wemight therefore estimate the area under f(x) using a Monte Carlo estimator�̂HM = 1

n

Pni=1Wi based on independent values of Wi.This is the �hit-or-miss�

estimator. However, in this case, it is easy to �nd a random variable Z suchthat the conditional expectation E(ZjW ) can be determined in closed form. Infact we can choose Z = X; we obtain

E[W jX] = f(X)

g(X):

This is therefore an unbiased estimator of the same parameter and it has smallervariance than does W: For a sample of size n, we should replace the crudeestimator �̂cr by the estimator

�̂Cond =1

n

nXi=1

f(Xi)

g(Xi)

=1

n

nXi=1

f(Xi)

2Xi

with Xi generated from X = G�1(Ui) =pUi; i = 1; 2; :::; n and Ui � U(0,1). In

this case, the conditional expectation results in a familiar form for the estimator�̂Cond: Although this is simply an importance sampling estimator with g(x) asthe importance distribution, this derivation shows that the estimator �̂Cond hasa smaller variance than �̂HM .

119

6.14 Problems

1. Let the antithetic estimator be denoted by e� = nXi=1

f (Ui) + f (1� Ui)2n

;

where Ui � U(0; 1). Show that the estimator is unbiased. Note: � =Z 1

0

f(x) dx.

2. RANDI�s main problem was that it was based on RANDU. This explainswhy RANDI was not a good random number generator. Because of con-cerns over the quality of RANDI, RANDII was used. Professor Derry Va-tive loves integrating things using only a U(0,1) "derived" from RANDII.

In particular, he wishes to determine: � =

1Z�1

1

2jxjdx

(a) Find �:

(b) Write an algorithm using Monte Carlo integration and an appropriatesubstitution that gives the integral(s) 0, 1 bounds to �nd b�:

3. We wish to estimate the area beneath f(x) = ln (x+ 1) for 0 � x � 1: A"close" function to f(x), if needed, is g(x) =

px (note: g(x) is not a p.f.).

The following is known about f(x) and g(x):

E (f (X)) � 0:4; E�(f (X))

2�� 0:2; V ar (f (X)) � 0:2;

E (g (X)) =2

3; E�(g (X))

2�=1

2; V ar (g (X)) =

1

18;

E (f (X) g (X)) � 0:3; Cov (f (X) ; g (X)) � 0:03;E (f (X) f (1�X)) � 0:11; Cov (f (X) ; f (1�X)) � �0:05.

(a) Determine, or state, �:

(b) Using g(x) as a control variate, state the estimate b�cv in terms off(xi) and g(xi). Be sure to evaluate any expressions that can beevaluated.

(c) Determine the e¢ ciency of our control variate estimator.

(d) Explain what the e¢ ciency you calculated in (c) means to the generallayperson.

(e) Using g(x), state the importance estimate, b�imp in terms of f(xi) andg(xi): Clearly explain where all parts of the expression come from.

(f) Would you suggest using antithetic random variables on f(x)? Whyor why not?

(g) Determine the e¢ ciency of our antithetic estimator.

120

4. Dery Vate wants to integrate f(x) = exp(x�1) from 0 to 1. She wants analgorithm that will obtain an estimate for this integral, �, with as small avariance as possible.

(a) Give a crude monte carlo estimate using the uniform random vari-ables U = [0.1, 0.5, 0.8].

(b) Obtain the expected value and variance of the crude Monte Carloestimator.

(c) She thinks she may be able to improve the process using antitheticrandom variables. Explain in words why this will or will not work?

(d) Obtain the expected value and variance of the antithetic estimator.

(e) Still not content with the results, she decides to use a control variate.Find the expected value and variance using the control variate g(x)= 3(x-1).

5. We are interested in estimating � =Z 1

0

exp (x) dx.

(a) Determine the variance for the crude monte carlo method.

(b) Determine the variance for the antithetic method.

(c) Determine the variance for the control variate g(U) = U , U~U(0; 1).

6. Use CMC to write an algorithm that can be used to estimate � for eachof the integrals below.

(a) � =Z 1

0

exp (x) dx.

(b) � =Z �8

12

exp (x) dx.

(c) � =Z �1

�1

1

xdx.

(d) � =Z 1

�1

1

xdx.

(e) � =Z 1

�1exp

��x2

�dx.

7. In question 6, write algorithms to estimate � using:

(a) Importance Sampling.

(b) Control Variates.

(c) Optimal Control Variates.

(d) Strati�ed.

(e) Antithetic Random Variables.

121

2.1 Solutions

6.14.1 Let the antithetic estimator be denoted by e� = nXi=1

f (Ui) + f (1� Ui)2n

;

where Ui � U(0; 1). Show the estimator is unbiased. Note: � =Z 1

0

f(x)

dx.

6.14.1 Solution

E�e�� = E

nXi=1

f (Ui) + f (1� Ui)2n

!

=

nXi=1

E (f (Ui)) + E (f (1� Ui))2n

!

=

nXi=1

� + �

2n

!

= 12n � 2n�

= �

6.14.2 RANDI�s main problem was that it was based on RANDU. This ex-plains why RANDI was not a good random number generator. Becauseof concerns over the quality of RANDI, RANDII was used. ProfessorDerry Vative loves integrating things using only a U(0,1) �derived�from

RANDII. In particular he wishes to determine: � =

1Z�1

1

2jxjdx

6.14.2a Find �:

6.14.2a Solution

1. (a)

� =

1Z�1

1

2jxjdx = 2 lim

a�>1

aZ0

2�x dx

� = 2

1Z0

1

2xdx = 2 lim

a�>1

�� 1

ln 22�x

��a0

�

(by symmetry) = 2

�1

ln 2

�= 2: 885 4

6.14.2b Write an algorithm, using Monte Carlo integration and an appropriatesubstitution that gives the integral(s) 0, 1 bounds; to �nd b�:

1

6.14.2b Solution

1. (a)

Let t =1

1 + xx =

1

t� 1

dx = �t�2 dt

And � = 2

0Z1

1

2j1=t�1j�t�2 dt 2

1Z0

1

2j1=t�1jt�2 dt

The Algorithm:

1. Gen t, repeat n times

2. b� = 2Xn

i=1

1

2j1=t�1jt�2

n

6.14.3 We wish to estimate the area beneath f(x) = ln (x+ 1) for 0 � x � 1:A �close� function to f(x); if needed, is g(x) =

px (note: g(x) is not a

p.f.) The following is known about f(x) and g(x):

1. E (f (X)) � 0:4; E�(f (X))

2�� 0:2; V ar (f (X)) � 0:2;

E (g (X)) =2

3; E�(g (X))

2�=1

2; V ar (g (X)) =

1

18;

E (f (X) g (X)) � 0:3; Cov (f (X) ; g (X)) � 0:03;E (f (X) f (1�X)) � 0:11; Cov (f (X) ; f (1�X)) � �0:05

6.14.3a Determine, or state, �:

6.14.3a Solution � = E(f(X)) � 0:4

6.14.3b Using g(x) as a control variate state the estimate b�cv in terms of f(xi)and g(xi): Be sure to evaluate any expressions that can be evaluated.

6.14.3b Solution

1. (a)

b�cv =nXi=1

�f (xi)� g (xi)

n

�+ E (g (X))

=nXi=1

�f (xi)� g (xi)

n

�+2

3

6.14.3c Determine the e¢ ciency of our control variate estimator.

6.14.3c Solution

1. (a)

Eff =V ar

�e�cmc�V ar

�e�cv� =V ar (f (X)) =n

(V ar(f (X)� g (X))) =n

=V ar (f (X))

V ar (f (X)) + V ar (g (X))� 2Cov(f (X) ; g(X))=

0:2

0:2 +1

18� 2 (0:03)

= 1:02

2

6.14.3d Explain what the e¢ ciency you calculated in (c) means to the generallayperson.

6.14.3d Solution This means that if I were to run the CMC code 102 timesit would be as accurate as the CV code running 100 times.

6.14.3e Using g(x), state the importance estimate, b�imp in terms of f(xi) andg(xi): Clearly explain where all parts of the expression come from.

6.14.3e Solution

1. (a)

1Z0

apx dx = a

2

3x3=2

��10

= 1 =) a =3

2

Therefore let h(x) =3

2

px which means H(x) = x3=2

b�imp =nXi=1

�f (xi) =h (xi)

n

�and each xi = u2=3

6.14.3f Would you suggest using antithetic random variables on f(x)? Whyor why not.

6.14.3f Solution They should work as f(x) is a monotonic, increasing functionon the interval.

6.14.3g Determine the e¢ ciency of our antithetic estimator.

6.14.3g Solution

1. (a)

Eff =V ar

�e�cmc�V ar

�e�anti� =V ar (f (X)) =n

2V ar

�(f (X) + f (1�X)

2

�=n

=2V ar (f (X))

2V ar (f (X)) + 2Cov(f (X) ; f(1�X))=

V ar (f (X))

V ar (f (X)) + Cov(f (X) ; f(1�X))=

0:2

0:2� 0:05= 1:33

6.14.4 Dery Vate wants to integrate f(x) = exp(x� 1) from 0 to 1. She wantsan algorithm that will obtain an estimate for this integral, �, with as smalla variance as possible.

6.14.4a Give a crude monte carlo estimate using the uniform random variablesU = [0.1, 0.5, 0.8].

3

6.14.4a Solution From CMC we have an estimate for the integral as

~�CMC =nXi=1

f(xi)

=1

n

nXi=1

exi�1

=0:4066 + 0:6065 + 0:8187

3= 0:6106

6.14.4b Obtain the expected value and variance of the crude monte carlo esti-mator.

6.14.4b Solution

E[~�CMC ] = E

"1

n

nXi=1

exi�1

#

=1

n

nXi=1

E[exi�1]

=1

n

nXi=1

�

=�n

n= �

V ar[~�CMC ] = V ar

"1

n

nXi=1

exi�1

#

=1

n2V ar

"nXi=1

exi�1

#

=1

n2

nXi=1

V ar�exi�1

�=

1

n2

nXi=1

�2CMC

=�2CMCn

n2

=�2CMC

n

6.14.4c She thinks she may be able to improve the process using antitheticrandom variables. Explain in words why this will or will not work?

4

6.14.4c Solution Dery Vate will �nd the the Antithetic Estimator will be moree¢ cient than the Crude Monte Carlo because the function is a monotonicfunction over the interval.

6.14.4d Obtain the expected value and variance of the antithetic estimator.

6.14.4d Solution

Eh~�ANTI

i= E

"1

2n

nXi=1

(f(xi) + f (1� xi))#

=1

2nE

"nXi=1

(f(xi) + f (1� xi))#

=1

2n

nXi=1

E [f (xi)] +nXi=1

E [f (1� xi)]!

=1

2n

nXi=1

(� + �)

!

=2�n

2n= �

V arh~�ANTI

i= V ar

"1

2n

nXi=1

(f (xi) + f (1� xi))!#

=1

4n2V ar

"nXi=1

(f (xi) + f (1� xi))#

=1

4n2

nXi=1

V ar (f (xi) + f (1� xi))!

=1

4n2

nXi=1

(V ar (f (xi)) + V ar (f (1� xi)) + 2Cov (f (xi) ; f (1� xi)))!

=1

4n2

nXi=1

�2ANTI + �2ANTI + 2nC

!

=�2ANTI + C

2n

Note that C < 0 since we are guaranteed to have a negative covariance.

6.14.4e Still not content with the results she decides to use a control variate.Find the expected value and variance using the control variate g(x) =3(x-1).

5

6.14.4e Solution

Eh~�CV

i= E

"1

n

nXi=1

(f (xi) + g (xi)� g (xi))#

=1

nE

"nXi=1

(f (xi) + g (xi)� g (xi))#

=1

n

"E

"nXi=1

(f (xi)� g (xi))#+ E

"nXi=1

g (xi)

##

=1

n

nXi=1

E [g(xi)] + 0

!

=�n

n= �; since g(x) � f(x)

V arh~�CV

i= V ar

"1

n

nXi=1

(f (xi) + g (xi)� g (xi))#

=1

n2V ar

"nXi=1

(f (xi) + g (xi)� g (xi))#

=1

n2V ar

"nXi=1

((f (xi)� g (xi)) + g (xi))#

=1

n2V ar

"nXi=1

g (xi)

#

=1

n2

nXi=1

V ar [g (xi)]

=�2gn2

6.14.5 We are interested in estimating � =Z 1

0

exp (x) dx.

6.14.5a Determine the variance for the crude monte carlo method.

6

6.14.5a Solution

V arh~�CMC

i= V ar

"1

n

nXi=1

f(xi)

#

=1

n2V ar

"nXi=1

f (xi)

#

=1

n2

nXi=1

V ar [f (xi)]

=1

n2

nXi=1

�f(xi)�

��̂CMC

��26.14.5b Determine the variance for the antithetic method.

6.14.5b Solution

V arh~�ANTI

i= V ar

"1

2n

nXi=1

(f (xi) + f (1� xi))#

=1

4n2V ar

"nXi=1

(f (xi) + f (1� xi))#

=1

4n2

nXi=1

(V ar [f (xi) + f (1� xi)])

=1

4n2

nXi=1

(V ar (f (xi)) + V ar (f (1� xi)) + 2Cov (f (xi) ; f (1� xi)))

=1

4n2

nXi=1

�f (xi)� �̂ANTI

�2+

nXi=1

�f (1� xi)� �̂ANTI

�2+ 2

nXi=1

�f (xi)� �̂ANTI

��f (1� xi)� �̂ANTI

�!

6.14.5c Determine the variance for the control variate g(U) = U , U~U(0; 1).

6.14.5c Solution

V arh~�CV

i= V ar

"1

n

nXi=1

((f (xi)� g (xi)) + g (xi))#

=1

n2V ar

"nXi=1

(C + g (xi))

#

=1

n2

nXi=1

�g (xi)� �̂CV

�26.14.6 Use CMC to write an algorithm that can be used to estimate � for each

of the integrals below.

7

6.14.6a � =Z 1

0

exp (x) dx

6.14.6b � =

Z �8

12

exp (x) dx

6.14.6c � =Z �1

�1

1

xdx

6.14.6d � =

Z 1

�1

1

xdx

6.14.6d Solution Note: All other solutions are either a simpli�cation of thisone, or an extension.De�ne a limit point a. This solution will use the change of variable u = � 1

xfor a = 0 when x < 0 and u = 1

x when x > 0. Note that u 6= 1 ! x = 0which does not exist. Since we are using a continuous, uniform on thebounds (0; 1) this is not a concern. Then we haveZ 1

�1

1

xdx =

Z 0

�1

1

xdx+

Z 1

0

1

xdx

Now apply the change of variable: x = � 1u , with

1u2 du = dx to the �rst

term and x = 1u , with �

1u2 du = dx to the second term.Z 1

�1

1

xdx =

Z 1

0

1

� 1u

1

u2du+

Z 0

1

11u

�1u2du

= �Z 1

0

u

u2du�

Z 1

0

�uu2du

=

Z 1

0

1

udu�

Z 1

0

1

udu

Now to start the algorithm:

1. Generate u1 and u2, two U(0,1) RV.

2. If u1 < 0:50 then x1 = � 1u2.

3. Else then x1 = 1u2.

4. Repeat process n times. Take average.

6.14.6e � =Z 1

�1exp

��x2

�dx

6.14.7 In question 6.14.6, write algorithms to estimate � using:

6.14.7a Importance Sampling

6.14.7a Solution For 6.14.6d.

8

6.14.7b Control Variates

6.14.7c Optimal Control Variates

6.14.7d Strati�ed

6.14.7d Solution For 6.14.6d.

6.14.7e Antithetic Random Variables

1. [ANGE - PLEASE CHANGE THESE QUESTIONS, IF POSSI-BLE, THEY ARE FROM DON]

6.14.8 Find the CMC and ANTI estimators for the following functionZ 1

0

ex � 1e2 � 1dx:

Determine the e¢ ciency of the ANTI. Would you expect this to be greaterless than 1? Explain your answer.

6.14.8 Solution The CMC estimator is

~�CMC =1

n

nXi=1

exi�1

e2 � 1

The ANTI estimator is (using U = 1� V ).

~�ANTI =1

2n

nXi=1

�exi�1

e2 � 1 +e�xi

e2 � 1

�With e¢ cient measure

E¢ ciency =V ar(~�CMC)

2V ar�~�ANTI

�=

1n2

Pni=1

�exi�1

e2�1 � �̂CMC

�224n2

P��exi�1

e2�1 � �̂ANTI�2+�e�xie2�1 � �̂ANTI

�2+ 2Cov

�exi�1

e2�1 ;e�xie2�1

��

=

Pni=1

�exi�1

e2�1 � �̂CMC

�22P��

exi�1

e2�1 � �̂ANTI�2+�e�xie2�1 � �̂ANTI

�2+ 2

(e2�1)2Cov (exi�1; e�xi)

�This value is expected to be greater than 1 due to the fact that the functionis monotonic over any interval, which is the optimal use for the Antitheticestimator.

9

1. How large a sample size would I need, using antithetic and crude MonteCarlo, in order to estimate the above integral, correct to four decimalplaces, with probability at least 95%?

2. Under what conditions on f does the use of antithetic random numberscompletely correct for the variability of the Monte-Carlo estimator? i.e.When is var(f(U) + f(1� U)) = 0?

3. Show that if we use antithetic random numbers to generate two normalrandom variables X1; X2; having mean rT � �2T=2 and variance �2T ,this is equivalent to setting X2 = 2(rT ��2T=2)�X1: In other words, itis not necessary to use the inverse transform method to generate normalrandom variables in order to permit the use of antithetic random numbers.

4. Show that the variance of a weighted average

var(�X + (1� �)W )

is minimized over � when

� =var(W )� cov(X;W )

var(W ) + var(X)� 2cov(X;W )

Determine the resulting minimum variance. What if the random variablesX;W are independent?

5. Use a strati�ed random sample to integrate the functionZ 1

0

eu � 1e� 1 du:

What do you recommend for intervals (two or three) and sample sizes?What is the e¢ ciency gain?

6. Use a combination of strati�ed random sampling and an antithetic randomnumber in the form

1

2[f(U=2) + f(1� U=2)]

to integrate the function Z 1

0

eu � 1e� 1 du:

What is the e¢ ciency gain?

7. In the case f(x) = ex�1e�1 , use g(x) = x as a control variate to integrate over

[0,1]. Show that the variance is reduced by a factor of approximately 60.Is there much additional improvement if we use a more general quadraticfunction of x?

10

8. In the case f(x) = ex�1e�1 , consider using g(x) = x as a control variate

to integrate over [0,1]. Note that regression of f(U) on g(U) yieldsf(U)�E(f(U)) = �[g(U)�Eg(U)]+" where the error term " has mean0 and is uncorrelated with g(U) and � = cov(f(U); g(U))=var(g(U):Therefore, taking expectations on both sides and reorganising the terms,E(f(U)) = f(U)� �[g(U)� E(g(U))]: The Monte-Carlo estimator

1

n

nXi=1

ff(Ui)� �[g(Ui)� E(g(Ui))]g

is an improved control variate estimator, equivalent to the one discussedabove in the case � = 1: Determine how much better this estimatoris than the basic control variate case � = 1 by performing simulations.Show that the variance is reduced by a factor of approximately 60. Isthere much additional improvement if we use a more general quadraticfunction of x?

9. A call option pays an amount V (S) = 1=(1 + exp(S(T )� k)) at time Tfor some predetermined price k: Discuss what you would use for a controlvariate and conduct a simulation to determine how it performs, assuminggeometric Brownian motion for the stock price, interest rate 5%, annualvolatility 20% and various initial stock prices, values of k and T:

10. It has been suggested that stocks are not log-normally distributed but thedistribution can be well approximated by replacing the normal distribu-tion by a student t distribution. Suppose that the daily returns Xi areindependent with probability density function f(x) = c(1+(x=b)2)�2 (there-scaled student distribution with 3 degrees of freedom). We wish to esti-mate a weekly V ar:95, a value ev such that P [

P5i=1Xi < v] = 0:95: If we

wish to do this by simulation, suggest an appropriate method involvingimportance sampling. Implement and estimate the variance reduction.

11. Suppose, for example, I have three di¤erent simulation estimators Y1; Y2; Y3whose means depend on two unknown parameters �1; �2. In particular,suppose Y1; Y2; Y3; are unbiased estimators of �1; �1 + �2; �2 respectively.Let us assume for the moment that var(Yi) = 1; cov(Yi; Yj) = �1=2.I want to estimate the parameter �1. Should I use only the estimatorY1which is the unbiased estimator of �1, or some linear combination ofY1; Y2; Y3? Compare the number of simulations necessary for a certaindegree of accuracy.

12. Consider the systematic sample estimator based on the trapezoidal rule:

�̂ =1

n

n�1Xi=0

f(V + i=n); V � U [0; 1n]

Discuss the bias and variance of this estimator. In the case f(x) = x2;how does it compare with other estimators such as crude Monte Carlo

11

and antithetic random numbers requiring n function evaluations. Arethere any disadvantages to its use?

13. In the case f(x) = ex�1e�1 , use g(x) = x as a control variate to integrate

over [0,1]. Find the optimal linear combination using estimators (??) and(??), an importance sampling estimator and the control variate estimatorabove. What is the e¢ ciency gain over crude Monte-Carlo?

14. The rho of an option is the derivative of the option price with respectto the interest rate parameter r: What is the value of � for a call optionwith S0 = 10; strike=10, r = 0:05; T = :25 and � = :2? Use a simulationto estimate this slope and determine the variance of your estimator. Tryusing (i) independent simulations at two points and (ii) common randomnumbers. What can you say about the variances of your estimators?

15. For any random variables X;Y; prove that P (X � x; Y � y) � P (X �x)P (Y � y) = P (X > x; Y > y)� P (X > x)P (Y > y) for all x; y:

12

Statistical

Tables

340 printable course notes

Documents