st597a: 2011, part ii ( c j.b.) 1 st597a: statistical...

126
ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of Massachusetts Not to be reused without permission Contents 1 Basic use of macros. 5 1.1 Running the same analysis on many data sets. .......................... 5 1.2 Running the same analysis on different variables.......................... 6 1.3 % let .................................................. 7 1.4 A macro within a macro. ...................................... 8 2 Arrays 9 3 Using SAS GRAPH to create high resolution plots. 13 4 More on programming features. 16 5 Probability Functions and their application. 20 5.1 Specific functions getting cumulative probabilities. ........................ 20 5.2 The CDF function. ......................................... 20 5.3 The PDF function .......................................... 20 5.4 Plotting the normal density ...................................... 24 5.5 Exact inferences for a proportion based on the Binomial. .................... 25 6 Inverse probability functions and their application 27 7 Power and sample size applications. 31 7.1 The noncentral t-distribution and power functions......................... 31 7.2 Estimating a single mean. Sample size via a CI........................... 34 7.3 Sample size and power for estimating a proportion......................... 35 7.4 Comparison of two means; independent samples. Power and sample size............. 38 8 Generating random quantities. 42 8.1 Some random generating functions.................................. 42 8.2 Generating random integers from between 1 and M . ....................... 42 8.3 Sampling with proc surveyselect. .................................. 43 8.4 Generating Normal random variables ................................ 44 8.5 Simulating the behavior of inferences with normal samples. ................... 44

Upload: others

Post on 08-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 1

ST597A: Statistical Computing, Part II (2011)c©John Buonaccorsi, Univ. of Massachusetts

Not to be reused without permission

Contents

1 Basic use of macros. 5

1.1 Running the same analysis on many data sets. . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2 Running the same analysis on different variables. . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3 % let . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.4 A macro within a macro. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Arrays 9

3 Using SAS GRAPH to create high resolution plots. 13

4 More on programming features. 16

5 Probability Functions and their application. 20

5.1 Specific functions getting cumulative probabilities. . . . . . . . . . . . . . . . . . . . . . . . . 20

5.2 The CDF function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5.3 The PDF function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5.4 Plotting the normal density. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.5 Exact inferences for a proportion based on the Binomial. . . . . . . . . . . . . . . . . . . . . 25

6 Inverse probability functions and their application 27

7 Power and sample size applications. 31

7.1 The noncentral t-distribution and power functions. . . . . . . . . . . . . . . . . . . . . . . . . 31

7.2 Estimating a single mean. Sample size via a CI. . . . . . . . . . . . . . . . . . . . . . . . . . . 34

7.3 Sample size and power for estimating a proportion. . . . . . . . . . . . . . . . . . . . . . . . . 35

7.4 Comparison of two means; independent samples. Power and sample size. . . . . . . . . . . . . 38

8 Generating random quantities. 42

8.1 Some random generating functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

8.2 Generating random integers from between 1 and M . . . . . . . . . . . . . . . . . . . . . . . . 42

8.3 Sampling with proc surveyselect. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

8.4 Generating Normal random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

8.5 Simulating the behavior of inferences with normal samples. . . . . . . . . . . . . . . . . . . . 44

Page 2: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 2

8.6 Simulating sampling from the exponential distribution. . . . . . . . . . . . . . . . . . . . . . . 46

8.7 Simulating the number of arrivals in a fixed period of time. . . . . . . . . . . . . . . . . . . . 50

8.8 Simulating survival rates and time to extinction. . . . . . . . . . . . . . . . . . . . . . . . . . 53

8.9 More on random number generation in SAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

9 R: Part II 58

9.1 Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

9.1.1 Directing the graphics output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

9.2 Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

9.3 looping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

9.4 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

9.5 Probability Functions with examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

9.6 Some More graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

9.6.1 Using legends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

9.6.2 Writing text in margins or in graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

10 SAS-IML. 78

10.1 Basic matrix operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

10.2 Naming rows and columns, formats, reset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

10.3 Reading from a SAS data set into matrices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

10.4 Example. Computing descriptive statistics etc. in a one-sample problem . . . . . . . . . . . . 86

10.5 Missing values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

10.6 Writing to external files in IML. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

10.7 Example: Statistics for multiple groups. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

10.8 Example: Multivariate observations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

10.9 Creating SAS files from within IML. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

10.9.1 Creating a SAS file directly from a matrix . . . . . . . . . . . . . . . . . . . . . . . . . 97

10.9.2 Creating a SAS file using variables; with application to simulating random samples . . 99

10.10 User define modules/functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

10.11 Simulating simple random sampling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

10.12Least squares/regression via IML. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

10.13Non-linear least squares and calling subroutines. . . . . . . . . . . . . . . . . . . . . . . . . . 106

10.14Bootstrapping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

11 R: Part 3. Working with matrices 113

11.1 Least squares in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

Page 3: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 3

11.2 Some additional comments on using functions in R . . . . . . . . . . . . . . . . . . . . . . . . 119

12 Bootstrap Add on. 121

12.1 An enhanced one sample IML program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

12.2 R resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

Page 4: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 4

.

Page 5: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 5

Begin with more advanced (but still non-IML)features of SAS.

1 Basic use of macros.

Macros are a convenient way to run programs (perhaps multiple times) allowing variable names, or file namesto be input variables. In essence the macros here define a function with input variables or filenames and thethe macros are executed with specified input values. These examples show just some of the basic features.

1.1 Running the same analysis on many data sets.

options ls=80 nodate;

title ’ Macro processing of the four quabbin blocks’;

%macro loopdat(dname,type); /* loopdat is the internal macro name)

data &dname;

infile "g:/s597/data/&dname&type";

input block line plot type BA hunt dist herb wp1 wp45 hk1

hk45 bir1 bir45 map1 map45 oak1 oak45 ash1 ash45 oth1 oth45

total1 total45 total;

title "&dname";

proc means n mean stdev;

class block line;

var total;

run;

%mend loopdat;

%loopdat(pl95,.dat);

%loopdat(peter95,.dat);

%loopdat(ns95,.dat);

%loopdat(pres95,.dat);

pl95

Analysis Variable : TOTAL

BLOCK LINE N Obs N Mean Std Dev

------------------------------------------------------------------

1 105 22 22 0.5909091 1.6230216

127 42 42 0.0476190 0.3086067

138 42 42 0.7142857 3.4590688

154 28 28 1.7500000 3.9592835

------------------------------------------------------------------

peter95

BLOCK LINE N Obs N Mean Std Dev

------------------------------------------------------------------

5 1 51 51 2.1372549 7.5631200

2 47 47 1.9148936 4.0153498

3 42 42 1.1666667 1.6217525

4 56 56 3.0714286 7.7432836

5 11 11 9.4545455 16.0833059

------------------------------------------------------------------

etc. for other blocks.

Page 6: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 6

1.2 Running the same analysis on different variables.

option ls=80 ps=60 nodate;

data b;

infile ’brain.dat’;

input Gender $ FSIQ VIQ PIQ Weight Height mriCount;

run;

%macro loopvar(vname);

title "&vname";

proc means;

var &vname;

proc corr;

var &vname mricount;

run;

%mend loopvar;

%loopvar(fsiq);

%loopvar(viq);

%loopvar(piq);

HEAVILY EDITED OUTPUT

fsiq

Analysis Variable : FSIQ

N Mean Std Dev Minimum Maximum

----------------------------------------------------------

40 113.4500000 24.0820712 77.0000000 144.0000000

----------------------------------------------------------

Correlation Analysis

Pearson Correlation Coefficients / Prob > |R| under Ho: Rho=0 / N = 40

FSIQ MRICOUNT

FSIQ 1.00000 0.35764

0.0 0.0235

viq

Analysis Variable : VIQ

N Mean Std Dev Minimum Maximum

----------------------------------------------------------

40 112.3500000 23.6161071 71.0000000 150.0000000

----------------------------------------------------------

etc. for rest of viq and then for PIQ

Correlation Analysis

2 ’VAR’ Variables: PIQ MRICOUNT

...

Page 7: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 7

1.3 % let

The %let lets you define a macro variable or a list of variables. One advantage of this is that if this variableor list of variables occurs in a number of places in the program, you can save some programming with aninitial identification via the %let. Also, if you then want to change the value of the variable for another run,you only need to change it once where it is initially defined.

options ps=60 ls=80;

title ’illustrate macro variable list’;

%let vlist = vo35 vo4 vo45 vo5 vo55 vo6;

data a;

infile ’speed2’;

input id sex &vlist;

run;

proc means n mean min max;

class sex;

var &vlist;;

run;

illustrate macro variable list 1

SEX N Obs Variable N Mean Minimum Maximum

---------------------------------------------------------------------------

1 15 VO35 15 15.4333333 12.2000000 19.4000000

VO4 15 18.4533333 15.6000000 21.5000000

VO45 15 22.9133333 19.0000000 26.5000000

VO5 15 29.8533333 25.1000000 35.2000000

VO55 15 38.9866667 32.7000000 48.9000000

VO6 15 45.2133333 39.2000000 51.3000000

2 9 VO35 9 14.6666667 12.6000000 16.5000000

VO4 9 17.8777778 15.4000000 22.2000000

VO45 9 23.3444444 18.7000000 27.1000000

VO5 9 31.7888889 24.3000000 38.1000000

VO55 9 37.1888889 31.9000000 42.9000000

VO6 9 41.1555556 36.2000000 45.0000000

---------------------------------------------------------------------------

Page 8: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 8

1.4 A macro within a macro.

You can have a macro in a macro as demonstrated by the following which loops through data sets and thenthrough variables within the data set.

options ls=80 nodate;

title ’ A macro in a macro’;

%macro loopdat(dname,type);

data &dname;

infile "&dname&type";

input block line plot type BA hunt dist herb wp1 wp45 hk1

hk45 bir1 bir45 map1 map45 oak1 oak45 ash1 ash45 oth1 oth45

total1 total45 total;

title "&dname";

%vloop(total45);

%vloop(total);

run;

%mend loopdat;

%macro vloop(vname);

title "&dname-&vname";

proc means mean std;

class line;

var &vname;

run;

%mend vloop;

%loopdat(pl95,.dat);

%loopdat(peter95,.dat);

Heavily edited output

pl95-total45

LINE N Obs Mean Std Dev

-----------------------------------------------

105 22 0.1363636 0.6396021

...

pl95-total

LINE N Obs Mean Std Dev

-----------------------------------------------

105 22 0.5909091 1.6230216

....

peter95-total45

LINE N Obs Mean Std Dev

-----------------------------------------------

...

peter95-total

LINE N Obs Mean Std Dev

....

Page 9: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 9

2 Arrays

Arrays provide a convenient and efficient way to do various tasks involving data manipulations and calcula-tions. The arrays must be used within a data step.

array name[n] varlist; where n is an integer and varlist is a list of variables creates an array of dimensionn. name[i] will equate to the ith variable in the varlist.

array name[*] varlist; Does the same thing as above but it will determine what n, the dimension of the array,is. You can use dim(name) to get the dimension of the array.

• The elements of an array must be all character or all numeric. If the array is a character array, thenput a $ in front of the varlist as in array name[n] $ varlist;

array name[n] numeric will put all of the numeric variables in the array name (you can use * instead ofan integer n).

array name[n] character will put all of the character variables in the array name (you can use * insteadof an integer n).

If you have variables name1, name2, . . . namen in the data then you can just use array name[n]; by itselfand this will put name1 through namen into the array name.

array name[n] will create an n dimensional array called name, but if there are no variables called name1,etc. then it creates variables name1 to namen (which will have missing values until something is assigned tothem.

Example. This example demonstrates the use of arrays with the speed/vo data where we want convertthe vo values to 0 (if vo is le 25) and 1 (if vo gt 25). The recoded values are in newvo1 to newvo6. Bothprograms produce the same result.

options ps=60 ls=80;

data a;

infile ’speed2’;

input id sex vo35 vo4 vo45 vo5 vo55 vo6;

array vo[6] vo35 vo4 vo45 vo5 vo55 vo6;

array newvo[6];

do i = 1 to 6;

if vo[i] gt 25 then newvo[i]=1;

if vo[i] le 25 then newvo[i]=0;

end;

drop i;

proc print;

run;

data a;

infile ’speed2’;

input id sex vo1-vo6;

array vo[6];

array newvo[6];

do i = 1 to 6;

if vo[i] gt 25 then newvo[i]=1;

if vo[i] le 25 then newvo[i]=0;

Page 10: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 10

end;

drop i;

proc print;

run;

N N N N N N

E E E E E E

V V V W W W W W W

O S O V O V O V V V V V V V

B I E 3 O 4 O 5 O O O O O O O

S D X 5 4 5 5 5 6 1 2 3 4 5 6

1 1 1 15.7 18.4 22.0 34.8 45.3 51.1 0 0 0 1 1 1

2 2 1 14.8 18.0 25.1 29.6 44.8 51.3 0 0 1 1 1 1

3 3 1 16.6 21.0 25.5 35.2 48.9 49.1 0 0 1 1 1 1

4 4 1 14.6 19.2 24.2 30.9 36.2 47.6 0 0 0 1 1 1

....

24 24 2 14.3 16.9 23.9 36.8 37.4 36.2 0 0 0 1 1 1

Example: recoding missing values. This example uses data which was introduced briefly earlier. Thisis from Statlib.

Data from which conclusions were drawn in the article "Sleep in

Mammals: Ecological and Constitutional Correlates" by Allison, T. and

Cicchetti, D. (1976), _Science_, November 12, vol. 194, pp. 732-734.

Includes brain and body weight, life span, gestation time, time

sleeping, and predation and danger indices for 62 mammals.

Variables below (from left to right) for Mammals Data Set:

species of animal, body weight in kg, brain weight in g,

slow wave ("nondreaming") sleep (hrs/day), paradoxical ("dreaming") sleep (hrs/day)

total sleep (hrs/day) (sum of slow wave and paradoxical sleep),

maximum life span (years), gestation time (days), predation index (1-5)

sleep exposure index (1-5), overall danger index (1-5),

Note: Missing values denoted by -999.0

option ls=80 nodate;

data a;

infile ’sleep.dat’;

input species $25. bodywt brainwt slsleep parsleep

totsleep lifespan gestate prindex slindex danindex;

proc print;

var species brainwt slsleep parsleep;

run;

data b;

set a;

array numval[*] _numeric_;

do i = 1 to dim(numval);

if numval[i]=-999.0 then numval[i]=.;

end;

Page 11: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 11

proc print;

var species brainwt slsleep parsleep;

run;

OBS SPECIES BRAINWT SLSLEEP PARSLEEP

1 African elephant 5712.00 -999.0 -999.0

2 African giant pouched rat 6.60 6.3 2.0

3 Arctic Fox 44.50 -999.0 -999.0

....

62 Yellow-bellied marmot 17.0 -999.0 -999.0

OBS SPECIES BRAINWT SLSLEEP PARSLEEP

1 African elephant 5712.00 . .

2 African giant pouched rat 6.60 6.3 2.0

3 Arctic Fox 44.50 . .

...

62 Yellow-bellied marmot 17.0 . .

Example: Illustrating the use of arrays to create many cases out of one.

options ps=60 ls=80;

/* using arrays with the speed data to create 6 lines for

each subject */

data a;

infile ’speed2’;

input id sex voval1-voval6;

array speedar[6] _temporary_ (3.5 4 4.5 5 5.5 6);

array voval[6];

do i = 1 to 6;

speed=speedar[i];

vo=voval[i];

output;

end;

keep id speed vo;

proc print;

run;

OBS ID SPEED VO

1 1 3.5 15.7

2 1 4.0 18.4

3 1 4.5 22.0

4 1 5.0 34.8

5 1 5.5 45.3

6 1 6.0 51.1

7 2 3.5 14.8

8 2 4.0 18.0

Note the use of a temporary array and assignment of values to the array. As a temporary array it does notcreate variables speedar1 to speedar6. Note also that we could not use vo for both the array name and thenew variable. Need to use two different names.

Page 12: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 12

Example. This example is based on data from J. Grand. In the original file there are abundance values for17 species of birds, contained in the variables acco to toru. We want one line for each species with a speciesvariable and an abundance value.

data a;

infile ’birdsjg.prn’;

input plotid 7-8 type $ MW PPSOT PPSOF SO DL G2 HW PL

acco amsa capu cavo chpe coer covi1 dedi eral icga mnva pier piol pivi pogr1

scmi toru;

drop MW PPSOT PPSOF SO DL G2 HW PL;

proc print;

run;

data b;

set a;

array name[17] $ _temporary_

(’acco’ ’amsa’ ’capu’ ’cavo’ ’chpe’ ’coer’ ’covi1’ ’dedi’ ’eral’

’icga’ ’mnva’ ’pier’ ’piol’ ’pivi’ ’pogr1’ ’scmi’ ’toru’);

array values[17]

acco amsa capu cavo chpe coer covi1 dedi eral icga mnva pier piol pivi pogr1

scmi toru;

do i = 1 to 17;

species = name[i];

abund=values[i];

output;

end;

drop acco amsa capu cavo chpe coer covi1 dedi eral icga mnva pier piol pivi pogr1

scmi toru;

proc print;

run;

LISTING OF ORIGINAL DATA

Obs plotid type acco amsa capu cavo chpe coer covi1 dedi

1 1 MW 0.00 0.0 0.000 0.0000 0.0000 0.0000 0.0000 0.0000

2 2 PPSOT 0.00 0.0 0.000 0.0000 0.0000 0.0000 0.3333 0.0000

...

Obs eral icga mnva pier piol pivi pogr1 scmi toru

1 0.0 0.1667 0.0000 9.5000 0.0000 0.0000 0.0000 0.0000 0.0000

2 0.0 3.5000 0.3333 12.0000 0.0000 0.1667 0.0000 0.0000 0.1667

...

PARTIAL LISTING OF NEW SAS FILE

Obs plotid type i species abund

1 1 MW 1 acco 0.0000

2 1 MW 2 amsa 0.0000

9 1 MW 9 eral 0.0000

10 1 MW 10 icga 0.1667

18 2 PPSOT 1 acco 0.0000

Page 13: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 13

24 2 PPSOT 7 covi1 0.3333

27 2 PPSOT 10 icga 3.5000

28 2 PPSOT 11 mnva 0.3333

29 2 PPSOT 12 pier 12.0000

3 Using SAS GRAPH to create high resolution plots.

Note: Many SAS procedures, including those discussed below create high resolution graphics. For some ofthese with no direction of the output, the graph will come up in the graph window and can then be saved. Insome cases the ODS graphics command must be used (as in plots associated with proc cor). More generallythe plots can often be directed via ODS or with a use of gsasfile with specification of a device and gaccess.The details are not documented here but will be discussed a bit further in class.

SAS GRAPH refers to a collection of procedures that create high resolution plots. Here is introductory lookat SAS graph with a few examples, enough to do the basics. The documentation is contained in SAS/GRAPHusers guide and through the online documentation on the web under SAS/GRAPH.

The following occur outside of any proc. There are other such statements. These are the most importantones.

• The GOPTIONS statement temporarily sets default values for many graphics attributes and deviceparameters used by SAS/GRAPH procedures. This includes setting the size of the graph through hsizeand vsize. goptions reset=all; will set all of the graphic options to their default values.

• The TITLEn statement controls title information.

• The AXISn statement controls labeling of an axis.

• The SYMBOLn statement controls information associated with a specific plot command.

The information from the axis and symbol statements will be used in a plot command.

Example. Using the defoliation variable in gypsy moth data. For each subplot, plot the meanas well as mean + 1.96SE and mean - 1.96SE.

title ’create defol means’;

data dd;;

infile ’e:\s597\data\87md.dat’;

input plot $ subplot egg def;

run;

proc means maxdec=2 data =dd noprint nway;

class plot;

var egg def;

output out=result2 mean=meanegg meandef stderr=seegg sedef;

run;

data c;

set result2;

up=meanegg+(1.96*seegg);

Page 14: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 14

low = meanegg-(1.96*seegg);

run;

goption reset=all;

goption hsize=6 vsize=4

title1 ’Mean egg mass and two standard errors’;

title2 ’ Maryland 1987’;

axis2 label= none; /* without this the y-axis label would be up since

that is used first in the plot */

axis1 offset = (.2in,.2in);

symbol1 v=u color=black;

symbol2 v=l color=black; /* that is the letter 1 not the number 1*/

symbol3 v=m color=black;

proc gplot data=c;

plot up*plot=1 low*plot=2 meanegg*plot=3/overlay vaxis=axis2

haxis=axis1;

run;

Figure 1: Gypsy Moth Plot

Example; employment data. Show use of i = option to connect points.

option ls=80 nodate;

data a;

infile ’employ.dat’;

input Findex unemploy year; /*Unemploy= Unemployment rate.*/

Page 15: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 15

newyear=year+1950; /* year: Year of the 1950 decade (0 = 1950)*/

axis1 label= (f=swiss ’Year’);

axis2 label= (f=swiss ’Unemployment’);

symbol1 c=black i=join v=star;

proc gplot;

plot unemploy*newyear/haxis=axis1 vaxis=axis2;

run;

Figure 2: Employment Plot

Example; . Show use of i = option to connect points.

data new;

infile ’pred.out’;

input x0 yhat low up yhatw loww upw;

run;

symbol1 i=spline l=1 color=black;

symbol2 i=spline l=2 color=black;

proc gplot data=new;

plot yhat*x0=1 low*x0=1 up*x0=1

yhatw*x0=2 loww*x0=2 upw*x0=2 /overlay;

run;

Example of 3d scatter plot using proc g3d. The scatter command creates a scatter plot. The pyramid is the

Page 16: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 16

Figure 3: Esterase Assay plot

default. Other symbols are possible with the shape= option. The noneedle option will eliminate the verticallines leading to the points.

data values;

infile ’e:\s597\data\house.dat’;

input PRICE SQFT AGE FEATS NE CUST COR TAX;

run;

title1 ’3d plot for House data ’;

proc g3d;

scatter sqft*age=price;

run;

4 More on programming features.

A quick look at a few programming features. These will get used more later and some can also be used inIML.

DO LOOPS:

Do loops have the general form of :

Page 17: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 17

Figure 4: House price plot

do ;

statements

end;

There is often something after the do which creates an iterative do loop, while the form beginning with justdo; usually occurs within an if-then loop.

The simplest iterative do loop is:

do index = start to finish by increment;

- If there is no increment an increment of 1 is assumed.

- The start and finish can be expressions.

- You can use multiple “start to finish by increment” separated by a comma. do i =1 to 10 by 1, 11 to 100by 2;

Another form specifies a list of specifications explicitly, as in

do i = 4, 23, 56; or do i = ’saturday’, ’sunday’, ’wednesday’;

You can loop through dates in an iterative way via; do i = ’date1’d to ’date2’d by n;

date1 and date2 are of the form ddmmyy designating day, month and year. The d trailing the ’date’ indicatesit is a date.

Page 18: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 18

/* this creates and plots a quadratic function of x */

option ls=80 nodate;

data values;

a = 2; b=4; c=8;

do x = 1 to 10 by .1;

y= a + (b*x) + (c*(x**2));

output;

end;

keep x y;

proc print;

run;

proc plot hpercent=50 vpercent=50;

plot y*x;

run;

OBS X Y

1 1.0 14.00

91 10.0 842.00

Note: Just giving low level plot since easier to embed.

Plot of y*x. Legend: A = 1 obs, B = 2 obs, etc.

y |

1000 +

|

|

| AB

| BA

| BB

| BB

| BB

| BB

500 + BB

| ABB

| BBA

| ABB

| ABBA

| BBBA

| BBBB

| ABBBB

| ABBBBBBA

0 + ABBA

---+---------+---------+---------+---------+---------+--

1 3 5 7 9 11

x

You can also use “until” and “while” within the do statement, either by themselves or in addition to aniterative loop.

do ... while(statement); do ... until(statement);

For example if in the program above we used do x = 1 to 10 by .1 while (y lt 500); the result would only

Page 19: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 19

include cases where y is less than 500.

IF-THEN-ELSE. The general form is

if expression then statement; else statement;

The else is optional. The statement after the else can be a do loop rather than a single statement.

/* example where if choice =1 homework counts if choice=2 it doesn’t */

data a;

infile ’grades.dat’;

input h1-h4 exam1 exam2 choice;

array wt[6] _temporary_ (.125 .125 .125 .125 .25 .25);

array points[6] h1-h4 exam1 exam2;

score=0;

if choice=2 then

score = (exam1+exam2)/2;

else

do i = 1 to 6;

score =score + wt[i]*points[i];

end;

run;

goto statements. You can label statements where : follows the label and the program is direct to thatstatement with a goto statement.

if i gt 6 then goto check;

...

check: statement;

The program below will create the same output as the initial plot of a quadratic.

/* this creates and plots a quadratic function of x */

option ls=80 nodate;

data values;

a = 2; b=4; c=8;

x=0;

loop:x=x+.1;

y= a + (b*x) + (c*(x**2));

output;

if x < 9.9 then goto loop;

keep x y;

proc print;

run;

proc plot vpercent=50;

plot y*x;

run;

Page 20: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 20

5 Probability Functions and their application.

5.1 Specific functions getting cumulative probabilities.

First there are a set of functions that return cumulative probabilities for specific distributions. So, if X is arandom variable, these functions return P (X ≤ x).

• poisson(m,x) returns P (X ≤ x) when X is Poisson with mean m.

• probbeta(x,a,b) returns P (X ≤ x) when X ∼ Beta(a, b)

• probbnml(p,n,x) returns P (X ≤ x) when X ∼ Binomial with n = number of trials and p = probabilityof success.

• probchi(x,df,nc) returns P (X ≤ x) when X ∼ chi-square with df degrees of freedom and non-centralityparameter nc. The nc can be omitted, which leads to the more familiar central chi-square distribution.

• probf(x,n1,n2,nc) returns P (X ≤ x) when X ∼ F with n1 and n2 degrees of freedom respectively andnon-centrality parameter nc. The nc can be omitted, which leads to the more familiar F distribution.

• probgam(x,a) returns P (X ≤ x) when X ∼ Gamma(a).

• probhypr(N,k,m,x) returns P (X ≤ x) when X ∼ Hypergeometric based on a population of size N , kof the N items having a characteristic of interest, and m is the sample size.

• probnegb(p,n,x) returns P (X ≤ x) when X ∼ Negative Binomial with p = probability of success andn = an integer (e.g., number of desired success in sequential sampling).

• probnorm(x) returns P (X ≤ x) when X ∼ N(0, 1) (a standard normal).

• probt(x,df,nc) returns P (X ≤ x) when X ∼ t with df degrees of freedom and non-centrality parameternc. The nc can be omitted, which leads to the more familiar central t distribution.

5.2 The CDF function.

CDF computes cumulative probabilities for a large number of distributions. The general form is

cdf(’dist’,x,parameters)

where dist is a name from the list of possible distributions available.This includes more than those usedabove. For example cdf(’chisquared’,x,df,nc) returns P (X ≤ x) when X ∼ chi-square with df degreesof freedom and non-centrality parameter nc (which can be left off); this does exactly the same thing asprobchi(x,df,nc). (Note that the documentation refers to x as quantile, but this makes no sense as x is apoint at which we want a cumulative probability, not a quantile.)

5.3 The PDF function

For a discrete random variable X pdf(’dist’,x,parameters) will return P (X = x). For a continuousrandom variable with density function f(x), it will return f(x).

Example Using Hypergeometric. Based on a consulting problem with housing services. Some sprinklerssimilar to those in dorms had recently failed in a hotel fire. The recommendation was to sample 1 or 2% of

Page 21: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 21

the sprinklers and if any found defective, replace all of them. Want to find probabilities of getting differentnumbers of defectives in the sample for different sample sizes and different defective rates in the population.In particular the interest is in the probability of getting 0, as this is the case where no action is taken.The example below is for buildings with 698 units in them. The calculations are done using the cumulativeprobabilities (which is how it had to be done before the pdf function became available) and then directlythrough pdf.

options ls=80;

/* this is lucey.sas - calculates probability of

getting 0, 1, 2, 3 defective for each building with number of defective

going from 0 to 100 and changing sample sizes.*/

title ’building type a’;

data a;

popsizea = 698;

do ssa=7 to 70 by 7; /*ssa = number of sprinklers sampled*/

ssrate=ssa/popsizea; /* sampling rate*/

do def = 1 to 101 by 5; /* number of defective sprinklers in pop.*/

defrate=def/popsizea;

proba0 = probhypr(popsizea,def,ssa,0);

p1 = probhypr(popsizea,def,ssa,1);

p2 = probhypr(popsizea,def,ssa,2);

p3 = probhypr(popsizea,def,ssa,3);

proba1=p1-proba0;

proba2=p2-p1;

proba3=p3-p2;

pa1=pdf(’hyper’,1,popsizea,def,ssa);

output;

end;

end;

run;

proc print noobs;

var def defrate ssa ssrate proba0 proba1 proba2 proba3 pa1;

run;

def defrate ssa ssrate proba0 proba1 proba2 proba3 pa1

1 0.00143 7 0.010029 0.98997 0.01003 . . 0.01003

6 0.00860 7 0.010029 0.94111 0.05762 0.00126 0.00001 0.05762

11 0.01576 7 0.010029 0.89433 0.10112 0.00445 0.00010 0.10112

36 0.05158 56 0.080229 0.04540 0.15077 0.23868 0.23985 0.15077

41 0.05874 56 0.080229 0.02914 0.11114 0.20273 0.23563 0.11114

96 0.13754 70 0.10029 0.00002 0.00022 0.00136 0.00540 0.00022

101 0.14470 70 0.10029 0.00001 0.00013 0.00082 0.00348 0.00013

This program determines the needed sample size so the detection rate (the probability of getting one or moredefective in the sample) is at least .95, with defective rates going from .01 to .5 by .01 and pop. size 698.

options ls=80;

title ’sample size for buildings with 698 units’;

data a;

Page 22: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 22

popsizea = 698;

do defrate = .01 to .5 by .01;

defa=int(defrate*popsizea);

ssizea=0;

loop1: ssizea = ssizea+1;

proba = probhypr(popsizea,defa,ssizea,0);

detecta=1-proba;

if detecta < .95 then go to loop1;

output;

end;

run;

proc print;

run;

sample size for buildings with 698 units 1

OBS POPSIZEA DEFRATE DEFA SSIZEA PROBA DETECTA

1 698 0.01 6 274 0.049544 0.95046

2 698 0.02 13 143 0.049319 0.95068

3 698 0.03 20 96 0.049611 0.95039

4 698 0.04 27 72 0.049842 0.95016

47 698 0.47 328 5 0.041322 0.95868

48 698 0.48 335 5 0.037539 0.96246

49 698 0.49 342 5 0.034037 0.96596

50 698 0.50 349 5 0.030803 0.96920

Computing Binomial Distributions. There are n independent

trials. On each trial the probability of success is p. Then P (X = x) =n Cxpx(1 − p)n−x for x = 0, 1, . . . n,where nCx is the combinatoric n choose x.

options ls=80 nodate;

/* computing and plotting the binomial distribution with

a sample of size 10 and probability .1 of success*/

%macro bincalc(n,p);

data a;

do x = 0 to &n;

cum=probbnml(&p,&n,x);

output;

end;

run;

title "binomial with p = &p and n = &n";

data b;

set a;

prob=cum-lag(cum);

if x=0 then prob=cum;

proc print;

run;

filename gsasfile "bin&p&n";

Page 23: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 23

goption device=ps

gaccess=gsasfile

gsfmode=replace noprompt;

symbol i=needle;

proc gplot;

plot prob*x;

run;

%mend bincalc;

%bincalc(10,.1);

%bincalc(100,.5);

binomial with p = .1 and n = 10 1

OBS X CUM PROB

1 0 0.34868 0.34868

2 1 0.73610 0.38742

3 2 0.92981 0.19371

4 3 0.98720 0.05740

5 4 0.99837 0.01116

6 5 0.99985 0.00149

7 6 0.99999 0.00014

8 7 1.00000 0.00001

9 8 1.00000 0.00000

10 9 1.00000 0.00000

11 10 1.00000 0.00000

binomial with p = .5 and n = 100 2

OBS X CUM PROB

... omitted entries with all 0’s

28 27 0.00000 0.000002

29 28 0.00001 0.000004

30 29 0.00002 0.000010

31 30 0.00004 0.000023

58 57 0.93339 0.030069

59 58 0.95569 0.022292

60 59 0.97156 0.015869

101 100 1.00000 0.000000

Computing Normal probabilities. X ∼ N(µ, σ2). This is shorthand for X is normally distributed withmean µ and variance σ2, or equivalently standard deviation σ. Then, P (X ≤ x) = P (Z ≤ (x − µ)/σ)),where Z ∼ N(0, 1). Recall also that if X1, . . . Xn are independent, each ∼ N(µ, σ2), then X ∼ N(µ, σ2/n),where X is the sample mean.

Example: A manufacturing process produces boxes of cereal (with an advertised content of 24 oz.) withmean 23.8 oz and standard deviation .3 oz. Let X be the weight of a random box and X the mean of 10random boxes (which say make up a carton.)Find a) P (X ≤ 24), b) P (X > 24), c) P (23.5 ≤ X ≤ 23.9), d)P (X ≤ 24), e) P (X > 24), f) P (23.5 ≤ X ≤ 23.9).

option ls=80 nodate;

data a;

Page 24: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 24

mu = 23.8;

sigma=.3;

n=10;

sdm=sigma/sqrt(n);

ansa = probnorm((24-mu)/sigma);

ansb = 1 - ansa;

ansc = probnorm((23.9-mu)/sigma) - probnorm((23.5-mu)/sigma);

ansd = probnorm((24-mu)/sdm);

anse = 1 - ansd;

ansg = probnorm((23.9-mu)/sdm) - probnorm((23.5-mu)/sdm);

proc print;

run;

OBS MU SIGMA N SDM ANSA ANSB ANSC ANSD ANSE ANSG

1 23.8 0.3 10 0.094868 0.74751 0.25249 0.47190 0.98249 0.017507 0.85330

These probability functions also allow you to compute P-values associated with test of hypotheses for testsbased on many well known distributions. Some papers will record a test-statistic but with just a bound on ap-value (for example, simply indicating whether it is greater than .05 or not). Suppose a paper says a test isbased on a chi-square test with 8 degrees of freedom, with a value of 24.8701 for the test statistic, but don’treport the p-value. You can calculate the P-value via p= 1-probchi(24.8701,8); returning a value or pof .001635325. (See page 60 of notes-Part). Similarly you can get P-values for test based on the standardnormal, t and F distribution, etc. but be careful as to whether you have a two-sided or one-sided alternative.

5.4 Plotting the normal density.

option ls=80 nodate;

/* plotting the normal density */

data a;

mu = 4;

sigma=1;

low=mu-5*sigma;

up = mu+5*sigma;

inc=(up-low)/100;

do x = low to up by inc;

f=pdf(’normal’,x,mu,sigma);

output;

end;

keep x f;

proc print;

run;

symbol1 i=spline l=1 color=blue;

proc gplot;

plot f*x;

run;

Obs x f

1 -1.0 0.00000

2 -0.9 0.00000

Page 25: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 25

51 4.0 0.39894

52 4.1 0.39695

100 8.9 0.00000

101 9.0 0.00000

Figure 5: Plot of the normal density with mean 4 and variance 1.

5.5 Exact inferences for a proportion based on the Binomial.

For small samples, test of hypotheses and confidence intervals are based on the binomial, rather than thenormal approximation to the binomial. Consider H0 : π = π0 versus HA : π 6= π0. We would like a test withprobability of a type 1 error equal to α.

Setting up the rejection region

If X is the number of successes in the sample of size n, the test will be of the form

Reject H0 if X ≤ R1 or X ≥ R2.

With R1 and R2 as fixed constants, we cannot necessarily get a test of exact size α since X is discrete withpoint probabilities at 0, 1, . . . n. We keep the probability of a type I error ≤ α. With a two-sided test, splitthe α up into two pieces.

R1 = largest integer such that when π = π0, P (X ≤ R1) ≤ α/2.

R2 = smallest integer such that when π = π0, P (X ≥ R2) ≤ α/2.

The rejection region defined by R1 and R2 obviously depends on π0.

option ls=80 nodate;

/*getting the rejection region for an exact binomial test of level 1 -a */

Page 26: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 26

%macro binrej(n,pi0,a);

data a;

title "sample size &n, null value &pi0, size &a";

r1=-1;

loop1:r1=r1+1;

pler1=probbnml(&pi0,&n,r1);

if pler1 le &a/2 then go to loop1;

r1=r1-1;

pler1=probbnml(&pi0,&n,r1);

r2=&n+1;

loop2:r2=r2-1;

pger2= 1- probbnml(&pi0,&n,r2-1);

if pger2 le &a/2 then go to loop2;

r2=r2+1;

pger2= 1- probbnml(&pi0,&n,r2-1);

probrej=pler1+pger2;

proc print;

run;

%mend;

%binrej(20,.5,.05);

sample size 20, null value .5, size .05

OBS R1 PLER1 R2 PGER2 PROBREJ

1 5 0.020695 15 0.020695 0.041389

So, with n = 20 and a null hypothesis of H0 : π = .5 and a desired size of .05, we reject H0 if X ≤ 5 orX ≥ 15. The actual size achieved is .0414.

Getting a confidence interval.

From the family of tests above, one way of getting an exact confidence interval (in the sense that theprobability that the interval contains the true π is ≥ 1− α) is to use the set of π0 for which H0 : π = π0 isnot rejected.

Getting exact P-value for a test based on the Binomial.

Consider H0 : π = π0. Suppose the number of observed successes is Xobs and X is the random variable equalto the number of successes in the sample. Under H0, X ∼ Binomial(n, π0) with an expected value of nπ0.Xobs is some distance from nπ0, the expected value under the null. Let D = |Xobs − nπ0|. The p-value is

P − value = P (X ≤ (nπ0 −D)) + P (X ≥ nπ0 + D).

For one-sided test, the P-value would look at just one tail using Xobs. If the alternative is H0 : π > π0, thenP-value = P (X ≥ Xobs), while if the alternative is H0 : π < π0, then P-value = P (X ≤ Xobs).

Example: Consider the sign test for testing whether there is a difference between two “groups” with paireddata. For observation i, there are two variables Xi and Yi and we can define π = P (Yi > Xi). The nullhypothesis H0 : π = .5, is one way to define no difference; what it says is that the distribution of thedifference is symmetric around 0. A “success” is if Y > X , a failure if Y < X and cases with X = Y arediscarded.

For example, for the LA heart data consider testing the null hypothesis of no difference between weightin 1950 and weight in 1962 in the sense that the distribution of the difference has median 0. There weresix ties. Of the remaining 194 observations, 103 were “successes” (wt62 bigger than wt50). The expected

Page 27: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 27

number of successes under H0 is 97, so D = 6. The P-value is P = P (X ≤ 91) + P (X ≥ 103) whenX ∼ Binomial(194, .5).

diffwt=wt62-wt50;

x=diffwt;

if diffwt lt 0 then x=0;

if diffwt gt 0 then x=1;

if diffwt=0 then x=2;

run;

proc univariate;

var diffwt; run;

proc freq;

tables x; run;

Tests for Location: Mu0=0

Test -Statistic- -----p Value------

Student’s t t -0.60276 Pr > |t| 0.5474

Sign M 6 Pr >= |M| 0.4297

Signed Rank S 283.5 Pr >= |S| 0.7182

Cumulative Cumulative

X Frequency Percent Frequency Percent

-----------------------------------------------

0 91 45.5 91 45.5

1 103 51.5 194 97.0

2 6 3.0 200 100.0

options ls=70 ps=60 nodate;

data a;

pleft = probbnml(.5,194,91);

pright= 1- probbnml(.5,194,102);

pvalue = pleft + pright;

proc print;

run;

OBS PLEFT PRIGHT PVALUE

1 0.21487 0.21487 0.42975

6 Inverse probability functions and their application

These functions determine percentiles for various continuous distributions. For a random variable X the100pth quantile or percentile, denote qp is the value for which P (X ≤ qp) = p. All the functions belowreturn the 100pth quantile/percentile for the specified distribution.

• betainv(p,a,b): X ∼ Beta(a, b)

• cinv(p,df,nc): X ∼ chi-square with df degrees of freedom and non-centrality parameter nc.

• finv(p,n1,n2,nc): X ∼ F with n1 and n2 degrees of freedom respectively and non-centrality parameternc.

Page 28: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 28

• gaminv(p,a): X ∼ Gamma(a).

• probit(p): X ∼ N(0, 1)

• tinv(p,df,nc): X ∼ t with df degrees of freedom and non-centrality parameter nc.

The nc can be omitted for the cinv,finv and tinv functions.

Normal percentiles

For the standard normal denote the 100pth percentile by zp (be careful with notation. Some books will writezp for the value with probability p to the right, rather than to the left. Here to we use it as the value withprobability p to the left to be consistent with the sas arguments.)

Getting standard normal percentiles.

options ls=80 nodate;

title ’finding standard normal percentile’;

data a;

array v[7] _temporary_ (.01 .025 .05 .5 .95 .975 .99);

array zval[7];

do i = 1 to 7;

zval[i]=probit(v[i]);

end;

drop i;

proc print;

run;

finding standard normal percentile 1

OBS ZVAL1 ZVAL2 ZVAL3 ZVAL4 ZVAL5 ZVAL6 ZVAL7

1 -2.32635 -1.95996 -1.64485 0 1.64485 1.95996 2.32635

Computing normal percentiles.. If X ∼ N(µ, σ2) then the pth percentile for X is µ + zpσ where zp isthe 100pth percentile of the standard normal. Notice that zp is negative when p < .5.

Example: Related to a grant proposal, needed to find cumulative probabilities at 2, 4 and 6 and the 10,50th and 90th percentiles for a normal with mean 4 and standard deviation 1. This was modelling individualdietary intakes of beta carotene.

option ls=80;

data a;

array yvec[3] _temporary_ (2 4 6);

array pvec[3] _temporary_ (.1 .5 .9);

array g[3];

array per[3];

sigma2= 1;

mu = 4;

do i = 1 to 3;

g[i] = probnorm((yvec[i]-mu)/sqrt(sigma2));

per[i] = mu + sqrt(sigma2)*probit(pvec[i]);

end;

proc print;

Page 29: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 29

run;

OBS G1 G2 G3 PER1 PER2 PER3 SIGMA2 MU I

1 0.022750 0.5 0.97725 2.71845 4 5.28155 1 4 4

Finding percentiles of the t-distribution.

options ls=80 nodate;

title ’finding t percentile’;

%macro tpercent(df);

title "percentiles of t with &df degrees of freedom"; /* note double quotes*/

data a;

array v[7] _temporary_ (.01 .025 .05 .5 .95 .975 .99);

array tval[7];

do i = 1 to 7;

tval[i]=tinv(v[i],&df);

end;

drop i;

proc print;

run;

%mend;

%tpercent(5);

%tpercent(10);

percentiles of t with 5 degrees of freedom 1

OBS TVAL1 TVAL2 TVAL3 TVAL4 TVAL5 TVAL6 TVAL7

1 -3.36493 -2.57058 -2.01505 0 2.01505 2.57058 3.36493

percentiles of t with 10 degrees of freedom 2

OBS TVAL1 TVAL2 TVAL3 TVAL4 TVAL5 TVAL6 TVAL7

1 -2.76377 -2.22814 -1.81246 0 1.81246 2.22814 2.76377

Computing a confidence interval for a variance under normality. Using the data for wt50 fromLA heart data (see 597B notes). The program is in the form of macro where the arguments are the samplevariance (s2), the number of observations (n) and a, where the confidence coefficient is 1-a.

options ls=80 nodate;

title ’finding confidence interval for a variance and s.d. under normality’;

%macro civar(s2,n,a);

/* s2 is sample variance, n is the sample size, 1- a = conf. coefficient*/

data b;

lval = cinv((&a/2),&n-1);

uval = cinv(1-&a/2,&n-1);

lowvar = (&n-1)*&s2/uval;

upvar = (&n-1)*&s2/lval;

lowsd=sqrt(lowvar);

upsd=sqrt(upvar);

proc print;

run;

%mend;

%civar(709.66771,200,.05);

Page 30: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 30

finding confidence interval for a variance and s.d. under normality 1

OBS LVAL UVAL LOWVAR UPVAR LOWSD UPSD

1 161.826 239.960 588.532 872.689 24.2597 29.5413

Finding a confidence interval for the mean with changing confidence level, illustrating howit gives the p-value for a two-sided test and plotting intervals versus confidence level. Usedifference in serum cholesterol from 1952 to 1960 for the LA heart data. From pages 35-36 of Part I of notes,the mean difference is -4.035 (the standard error 3.78),the P-value for testing the null hypothesis that meanis 0 is .2872 and a 90% confidence interval is (-10.28325,2.21325).

options ls=70 ps=60 nodate;

data values;

infile ’c:\s597\data\ladata’;

input id age md50 sp50 dp50 ht50 wt50 sc50

soec cs md62 sp62 dp62 sc62 wt62 ihdx yrdth;

diffsc=sc62-sc50;

proc means n mean stderr;

output out=result n=n mean=mean stderr=se;

var diffsc;

run;

data b;

set result;

do a = .01 to .99 by .01;

clevel=1-a;

tval=tinv(1-a/2,n-1);

upper = mean + tval*se;

lower = mean - tval*se;

output;

end;

keep clevel tval lower upper;

proc print;

run;

symbol1 i=spline l=1 color=black;

symbol2 i=spline l=2 color=black;

proc gplot data=b;

plot upper*clevel=1 lower*clevel=2/overlay;

run;

Obs clevel tval upper lower

1 0.99 2.60076 5.79844 -13.8684

2 0.98 2.34523 4.83229 -12.9023

3 0.97 2.18576 4.22934 -12.2993

4 0.96 2.06730 3.78142 -11.8514

5 0.95 1.97196 3.42094 -11.4909

28 0.72 1.08327 0.06082 -8.1308

29 0.71 1.06095 -0.02358 -8.0464

Page 31: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 31

Figure 6: Confidence interval for diffsc as a function of confidence level

7 Power and sample size applications.

7.1 The noncentral t-distribution and power functions.

Consider a hypothesis test for a mean where the null hypothesis involves µ0 (H0 could be µ ≤ µ0, µ ≥ µ0

or µ = µ0 with HA being the complement of H0.). Assume the data arises from a random sample from apopulation with mean µ and variance σ2. The t-test uses the test statistic T = (X−µ0)/(s/n1/2). If the truevalue is actually µ then the distribution of T is a noncentral t-distribution with n-1 degrees of freedom anda noncentrality parameter nc = (µ − µ0)/(σ/n1/2). The power function of the test is π(µ) = P ( reject H0)when the true mean is µ. For a value of µ in H0, this is the probability of a type 1 error. For a value ofµ in HA, this is the power = probability of a correct decision = 1 - P(type 2 error). The power functionwill depend on µ, σ (note that we need σ to obtain the power function) and the sample size. The rejectionregion will depend on whether the test is one-sided or two-sided.

Example: Suppose water is deemed unsafe for drinking if the mean nitrate concentration exceeds 30 ppm.The approach is to test assume the water is okay (H0 : µ ≤ 30) unless evidence shows otherwise (HA : µ > 30)using a test of size .05 (a 5% chance of declaring water unsafe when in fact it is) based on n water samples.Suppose the standard deviation for an individual observation is 2 ppm. Find the power function for a sampleof size size 5 and plot it. Determine the sample size needed to get the power to be .8 when µ = 31.

title ’power example’;

options ls=80 ps=60 nodate;

%macro powerg(mu0,sigma,n,alpha);

title "power for one sided > mu0 = &mu0, sigma=&sigma,n=&n,alpha=&alpha";

data b;

do mu = 25 to 35 by .5;

df=&n-1;

nc = sqrt(&n)*(mu - &mu0)/&sigma;

tval= tinv(1-&alpha,df); /*this gives the cutoff point for rejection region*/

power = 1 - probt(tval,df,nc);

Page 32: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 32

output;

end;

drop sigma df;

run;

proc print;

run;

proc plot vpercent=25;

plot power*mu;

run;

%mend;

%powerg(30,2,5,.05);

power for one sided > mu0 = 30, sigma=2,n=5,alpha=.05 1

OBS MU NC TVAL POWER

1 25.0 -5.59017 2.13185 0.00000

2 25.5 -5.03115 2.13185 0.00000

10 29.5 -0.55902 2.13185 0.01693

11 30.0 0.00000 2.13185 0.05000

12 30.5 0.55902 2.13185 0.12018

13 31.0 1.11803 2.13185 0.23900

21 35.0 5.59017 2.13185 0.99751

Plot of POWER*MU. Legend: A = 1 obs, B = 2 obs, etc.

POWER |

1.0 + A A A A

| A

| A

0.5 + A

| A

| A A

0.0 + A A A A A A A A A A A

-+--------+--------+--------+--------+--------+--------+--------+--------+

24.0 25.5 27.0 28.5 30.0 31.5 33.0 34.5 36.0

MU

This next program determines the sample size needed to get the probability of correctly rejecting at µ = 31to be .8.

options ls=80 ps=60 nodate;

%macro ssg(mu0,sigma,mu,alpha,target);

/* mu = null value, mu = where power is to be controlled,

sigma = s.d., alpha = size of test, target = desired power */

title "mu0= &mu0, mu = &mu sigma=&sigma,alpha=&alpha target=&target";

data b;

n=0;

loop: n= n+1;

df=n-1;

nc = sqrt(n)*(&mu - &mu0)/&sigma;

Page 33: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 33

tval= tinv(1-&alpha,df);

/*this gives the cutoff point for rejection region*/

power = 1 - probt(tval,df,nc);

output;

if power lt &target then go to loop;

run;

proc print;

run;

%mend;

%ssg(30,2,31,.05,.8);

mu0= 30, mu = 31 sigma=2,alpha=.05 target=.8 1

OBS N DF NC TVAL POWER

1 1 0 0.50000 . .

2 2 1 0.70711 6.31375 0.10582

16 16 15 2.00000 1.75305 0.60403

17 17 16 2.06155 1.74588 0.62869

27 27 26 2.59808 1.70562 0.81183

The needed sample size is 27. Note that the power function is increasing in µ and the power(probability ofcorrectly rejecting H0) for any µ bigger than 31 will be less than the .81183 obtained at µ = 31.

This program takes the same setting but varies the desired power and gets the needed sample size.

options ls=80 ps=60 nodate;

%macro ssg(mu0,sigma,mu,alpha);

/* mu = null value, mu = where power is to be controlled,

sigma = s.d., alpha = size of test, target = desired power */

title "mu0= &mu0, mu = &mu sigma=&sigma,alpha=&alpha";

data b;

do target = .5 to .98 by .02; /*target is desired power*/

n=0;

loop: n= n+1;

df=n-1;

nc = sqrt(n)*(&mu - &mu0)/&sigma;

tval= tinv(1-&alpha,df);

/*this gives the cutoff point for rejection region*/

power = 1 - probt(tval,df,nc);

if power lt target then go to loop;

output;

end;

keept target n power;

run;

proc print;

run;

proc plot verpercent=25;

plot n*target;

%mend;

%ssg(30,2,31,.05);

Page 34: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 34

mu0= 30, mu = 31 sigma=2,alpha=.05 1

OBS TARGET N POWER

1 0.50 13 0.52201

16 0.80 27 0.81183

18 0.84 30 0.84825

19 0.86 32 0.86885

25 0.98 57 0.98142

Plot of N*TARGET. Legend: A = 1 obs, B = 2 obs, etc.

N

56 + A

49 + A

42 + A A

35 + A A A

28 + A A A A

21 + A A A A A A A

14 + A A A A A A A

7 +

--+--------+--------+--------+--------+--------+--------+--------+--------+--

0.50 0.56 0.62 0.68 0.74 0.80 0.86 0.92 0.98

TARGET

In the analyst, for specified hypotheses with a given α and σ, you can get the power for a specified rangeof n and get a plot of the power versus n. Or you can specify a particular alternative at which to controlpower, a range of power values, get the needed sample size and plot the needed sample size versus the power.These will be demonstrated in class.

7.2 Estimating a single mean. Sample size via a CI.

The sample size required so that the probability that X is within d units of µ with probability 1−α; that is,

P (|X − µ| ≤ d) = 1− α

isn = σ2z2

1−α/2/d2.

This is an exact result if the population is normal, otherwise it is approximate. This can also be viewed asapproximately giving you a confidence interval of length 2d. The actual half width of a confidence intervalbased on the t distribution is t1−α/2,n−1S/n1/2, which is random since S is random; the expected length is

approximately (since the expected value of S is not exactly σ) t1−α/2,n−1σ/n1/2. If you set this equal to dand treat t1−α/2,n−1 as z1−α/2, which is reasonable for even moderate n, you get n as give above.

Note, as with determining sample size with power this requires knowledge of σ.

Example: For the water sample situation with σ = 2, find n so the probability that the sample mean iswithin d units of the true mean is 1− α.

options ls=80 ps=60 nodate;

title "confidence intervals for varying d and 1 - alpha’’;

data b;

Page 35: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 35

sigma=2;

do alpha = .05 to .2 by .05;

do d = .2 to 2 by .2;

confid=1-alpha;

zval = probit(1 - alpha/2);

n = (zval*sigma)**2/(d**2);

output;

end;

end;

keep d confid n;

proc print noobs;

run;

D CONFID N

0.2 0.95 384.146

0.4 0.95 96.036

2.0 0.95 3.841

0.2 0.90 270.554

2.0 0.90 2.706

0.2 0.80 164.237

2.0 0.80 1.642

7.3 Sample size and power for estimating a proportion.

There are many examples where the object is to estimate the proportions of individuals with a particularcharacteristic (e.g., having a certain disease or other health condition, holding a particular opinion, etc.)Assuming the population is big enough that the Binomial rather than the Hypergeometric can be used andthat n is big enough that the normal approximation to the Binomial will work, then wanting P (|p − π| ≤d) = 1− α)) requires

n = π(1− π)z2

1−α/2/d2.

Note that d is expressed as a proportion and is an absolute difference. So d = .03 means being with 3 percent(absolute) of the true π (or having the half width of the confidence interval approximately .03). Note also,that in this case, the sample size needed depends on the quantity π that we are trying to estimate. Over0 < π < 1, the function π(1 − π) increases up to π = .5, at which point it achieves a maximum and thendecreases. In the absence of any knowledge the safe thing to do is use π = .5, if you are comfortable sayingthat you know beforehand that π is in some interval [a, b] then use the value in [a, b] that is closest to .5(which of course would be .5 itself if .5 is in [a, b]).

options ls=80 ps=60 nodate;

%macro ssprop(alpha);

title "sample size for estimation a proportion with alpha = &alpha";

data b;

do d = .02 to .10 by .02;

do pi = .02 to .98 by .04;

zval = probit(1 - &alpha/2);

n = (pi*(1-pi)*(zval**2))/(d**2);

Page 36: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 36

output;

end; end;

keep d pi n;

proc print; run;

%mend; %ssprop(.05);

sample size for estimation a proportion with alpha = .05 1

OBS D PI N

1 0.02 0.02 188.23

2 0.02 0.06 541.65

12 0.02 0.46 2385.55

13 0.02 0.50 2400.91

14 0.02 0.54 2385.55

24 0.02 0.94 541.65

25 0.02 0.98 188.23

rest omitted.

REMARKS:

1. If you want to have P (|p− π| ≤ dπ) = 1− α then n = (1− π)z2

1−α/2/d2π.

2. You can also figure an exact sample size using either the Hypergeometric or the Binomial.

Power and sample size for test for a single proportion.

π = true proportion of interest. Have a sample of size n with X = number of successes and p = X/n =proportion of successes in the population.

H0 : π = π0 , HA : π 6= π0 : Reject H0 if X ≤ R1 or X ≥ R2.

H0 : π ≥ π0 , HA : π < π0 : Reject H0 if X ≤ R1.

H0 : π ≤ π0 , HA : π > π0 : Reject H0 if X ≥ R2.

In each case the test is based getting test has size α. These can be determined exactly using the Hypergeo-metric or Binomial(see page 24 for example) or approximately if the normal approximation to the Binomialis used. In the latter case the test statistic is based on

Z =p− π0

[π0(1− π0)/n]1/2

.

For example, suppose we are considering a one-sided test with H0 : π ≥ π0 and HA : π < π0. Then usingthe normal approximation we reject H0 if Z < −z1−α or equivalently

X < n(π0 − [π0(1− π0)/n]1/2

z1−α) = R1

The power function, which we write now as f(π) is

f(π) = P (rejectH0|π is the true value).

So the exact power of the given test is found by getting the probability about X based on X ∼ Binomial(n, π).The approximate power of the given test is found based on the standard normal. For the latter, if thetrue proportion is π, then Z is approximately normal with mean (π − π0)/[π0(1 − π0)/n]1/2 and variance[π(1 − π)]/[π0(1− π0)]. So in the one sided example above the power is P (Z < −z1−α) where Z is normal

Page 37: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 37

with mean and variance as above. One convenient use of this formula is that it isolates the sample size nexplicitly.

Example: “Statistical Evidence of Discrimination. Based on example in Devore and Peck “Explo-ration and Analysis of Data” drawing from an article in J.A.S.A. 1982. 25% of individuals eligible for grandjury duty in a particular area are African American. Based on n selections, from which p = proportion ofA.A.s’ in the n selections, a test of H0 : π ≥ .25 and HA : π < .25, where π is interpreted as the longterm probability that a selection of a juror for grand jury is A.A. (There is certainly a question regardingthe assumption of this being n independent observations here that needs to be addressed, but we’ll use thatassumption for illustration here). This uses a rejection region based on the normal approximation and thengets the exact and approximate power of this test.

options ls=80 ps=60 nodate;

%macro powerp(pi0,n1,n2,inc,alpha);

title "power for one-sided test of a proportion, HA <";

data b;

do n = &n1 to &n2 by &inc;

div=sqrt(&pi0*(1-&pi0)/n);

do pi = .05 to .4 by .05;

meanz= (pi-&pi0)/div;

varz= (pi*(1-pi))/(&pi0*(1-&pi0));

zval= -probit(1-&alpha); /*this gives the cutoff point for rejection region*/

stand=(zval-meanz)/sqrt(varz);

power = probnorm(stand); /* approximate power */

/*get exact power of test based on Z */

q =n*(&pi0 + (sqrt(&pi0*(1-&pi0)/n)*zval));

c = int(q);

if c-q = 0 then c =c-1;

powerex=probbnml(pi,n,c);

output;

end;

end;

/*keep n pi power powerex;*/

run;

proc print;

run;

proc sort;

by n;

run;

proc plot vpercent=30;

plot power*pi powerex*pi/overlay;

by n;

run;

%mend;

%powerp(.25,10,100,10,.05);

LIMITED OUTPUT

power for one-sided test of a proportion, HA < 1

OBS N PI POWER POWEREX

1 10 0.05 0.35715 0.59874

2 10 0.10 0.21389 0.34868

Page 38: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 38

3 10 0.15 0.13370 0.19687

4 10 0.20 0.08298 0.10737

5 10 0.25 0.05000 0.05631

6 10 0.30 0.02876 0.02825

7 10 0.35 0.01553 0.01346

8 10 0.40 0.00772 0.00605

73 100 0.05 1.00000 1.00000

74 100 0.10 0.99568 0.98999

75 100 0.15 0.78984 0.76328

76 100 0.20 0.29785 0.27119

77 100 0.25 0.05000 0.03763

78 100 0.30 0.00408 0.00216

79 100 0.35 0.00017 0.00005

80 100 0.40 0.00000 0.00000

7.4 Comparison of two means; independent samples. Power and sample size.

For comparing two samples, a similar approach to power and sample size can be used. Suppose we haveindependent samples from two groups with population mean µj population variance σ2

j , sample mean Xj ,

sample variance S2

j and sample size nj for group j. If the null hypothesis involves µ1−µ2 = d0, the t-statisticis

T =X1 − X2 − d0

Sp

[

1

n1

+ 1

n2

]1/2,

under the assumption of equal variances σ2

1= σ2

2= σ2. The rejection region will depend on the

alternative. In this case the power function and sample size calculations are based on the fact that if thetrue difference is µ1 − µ2 = ∆0, then T is distributed as a non-central t-distribution with df = n1 + n2 − 2degrees of freedom and noncentrality parameter

nc =∆0 − d0

σ[

1

n1

+ 1

n2

]1/2.

If the variances are unequal, then the test statistic used is

Tu =X1 − X2 − d0

[

S2

1

n1

+S2

2

n2

]1/2.

This does not have exactly a t-distribution, but for n1 and n2 “large” (often take to mean bigger than 30),we can use a normal approximation. If the true difference is ∆0, then Tu is approximately normal with mean

∆0 − d0

[

σ2

1

n1

+σ2

2

n2

]1/2

and variance 1.

Using this distribution in general, will provide a reasonable approximation to the power, as long as thesample sizes are not very small. With large sample sizes this will also give essentially the right answer even

Page 39: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 39

if we assumed σ2

1= σ2

2= σ2 and exact normality of the populations, in which the exact answer is based on

the non-central t.

For sample size determination, one usually assumes n1 = n2 and determines the common sample size. Youcould also specify that n1 = cn2 for some specified constant c.

Example: Comparing blood levels for people living near freeways to those not. (Based on example inDevore “Probability and Statistics for Engineering and the Sciences, 3rd edition”, p. 332 from a paper inthe Archives of Environmental Health. The means in there samples were about 10 for the no-freeway groupand 17 for the near freeway group). We will approach this from the perspective of testing H0 : µ1 = µ2

(µ1−µ2 = 0) versus HA : µ1 6= µ2. (If we were trying to “prove” that the mean for those living near freeways(µ1) is higher than those not, then we would use the alternative HA : µ1−µ2 > 0). Assume that we will takean equal number n1 = n2 for each group. Based on previous data we will use a common standard deviationof σ = 6 and examine the power of the test as function of the true difference ∆. Note that power dependsjust on the difference in the means and not on the actual values.

title ’power two-sample two-sided, equal variance’;

options ls=80 ps=60 nodate;

%macro powerg(d0,sigma,n1,alpha);

title "power for two sample, two sided d0 = &d0, sigma=&sigma,n1=&n1,alpha=&alpha";

data b;

do delta = -10 to 10 by .5;

df=2*&n1-2;

nc = (delta - &d0)/(&sigma*sqrt(2/&n1));

tval= tinv(1-(&alpha/2),df); /*this gives the cutoff point for rejection region*/

power = probt(-tval,df,nc) + (1 - probt(tval,df,nc));

output;

end;

drop sigma df;

run;

proc print;

run;

proc plot vpercent=25;

plot power*delta;

run;

%mend;

%powerg(0,6,10,.05);

power for two sample, two sided d0 = 0, sigma=6,n1=10,alpha=.05 1

OBS DELTA NC TVAL POWER

1 -10.0 -3.72678 2.10092 0.94081

2 -9.5 -3.54044 2.10092 0.91714

19 -1.0 -0.37268 2.10092 0.06441

20 -0.5 -0.18634 2.10092 0.05358

21 0.0 0.00000 2.10092 0.05000

22 0.5 0.18634 2.10092 0.05358

23 1.0 0.37268 2.10092 0.06441

40 9.5 3.54044 2.10092 0.91714

41 10.0 3.72678 2.10092 0.94081

Plot of POWER*DELTA. Legend: A = 1 obs, B = 2 obs, etc.

POWER |

Page 40: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 40

1.0 + AA AA

| AAAA AAAA

| AA AA

0.5 + AAA AAA

| AA AA

| AAAA AAAA

0.0 + AAAAAAA

---+---------+---------+---------+---------+--

-10 -5 0 5 10

DELTA

The next program determines the sample size needed to achieve a certain power (which varies in the programas the value target) at a specified distance from the null delta.

title ’power two-sample two-sided, equal variance’;

options ls=80 ps=60 nodate;

%macro sst(d0,sigma,alpha,delta);

title "sample size: two sided d0 = &d0, sigma=&sigma,alpha=&alpha, delta=&delta" ;

data b;

do target = .8 to .98 by .02;

n1=1;

loop: n1=n1+1;

df=2*n1-2;

nc = (&delta - &d0)/(&sigma*sqrt(2/n1));

tval= tinv(1-(&alpha/2),df); /*this gives the cutoff point for rejection region*/

power = probt(-tval,df,nc) + (1 - probt(tval,df,nc));

if (power lt target) then goto loop;

output;

end;

drop df nc tval;

run;

proc print;

run;

proc plot vpercent=25;

plot n1*delta;

run;

%mend;

%sst(0,6,.05,10);

sample size: two sided d0 = 0, sigma=6,alpha=.05, delta=10 1

OBS TARGET N1 POWER

1 0.80 7 0.81627

2 0.82 8 0.87236

3 0.84 8 0.87236

4 0.86 8 0.87236

5 0.88 9 0.91255

6 0.90 9 0.91255

7 0.92 10 0.94081

8 0.94 10 0.94081

Page 41: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 41

9 0.96 11 0.96037

10 0.98 13 0.98273

USING THE ANALYST. Similar to the one sample problem the analysis requires that you fix a particularalternative and it will then find the power over a range of sample sizes. It will also determine sample size toachieve a certain power, which varies, at a particular alternative (the same as our program above.) Strangely,the analyst requires that you enter values for each of the means. The power only depends on the differencein means not on the actual means, so you will get the same answer if you use µ1 = 10 and µ2 = 20 as youwould with µ1 = 1500 and µ2 = 1510 (and any other values with a difference of 10.)

Two-Sample t-Test Group 1 Mean = 100 Group 2 Mean = 110

Standard Deviation = 6 Alpha = 0.05 2-Sided Test

N per Group Power

2 0.170

4 0.507

6 0.739

8 0.872

10 0.940

12 0.973

14 0.988

16 >.99

18 >.99

Note the agreement at n = 10 with our answer.

Using the analyst to determine sample size.

Group 1 Mean = 10 Group 2 Mean = 20

Standard Deviation = 6 Alpha = 0.05 2-Sided Test

N per

Power Group

0.800 7

0.820 8

0.840 8

0.860 8

0.880 9

0.900 9

0.920 10

0.940 11

0.960 12

0.980 13

Again, note the agreement with our answer, although there is a disagreement (10 rather than 11) atpower=.94.

Power and sample size for comparing two proportions. Note that as with testing for a singleproportions, this is not done in the Analyst. We will not treat this problem in detail here. One thing tonote is unlike comparing two means, the power , standard error and sample size can depend on the two trueproportions and not just their difference. This is because of the manner in which the proportions enter thevariances.

Page 42: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 42

8 Generating random quantities.

8.1 Some random generating functions.

The following SAS functions generate random variables.

RANBIN(seed,n,p): Binomial(n,p)

RANCAU(seed): Cauchy

RANEXP(seed): Exponential

RANGAM(seed,a): Gamma with scale parameter a.

RANNOR(seed): Standard normal

RANPOI(seed,m): Poisson with mean m

RANTBL(seed,p1, p2, . . . pk): Random from a distribution on integers 1, . . . k, with P (X = j) = pj .

RANTRI(seed,h): Triangular distribution

RANUNI(seed): Uniform over the interval [0,1]

From the SAS documentation:

“ Seed Values: Random-number functions and CALL routines generate streams of random numbers froman initial starting point, called a seed, that either the user or the computer clock supplies. A seed must bea nonnegative integer with a value less than 231-1 (or 2,147,483,647). If you use a positive seed, you canalways replicate the stream of random numbers by using the same DATA step. If you use zero as the seed,the computer clock initializes the stream, and the stream of random numbers is not replicable.:

When you go to the individual documentation, it says you can use negative integers, in which case it willuse the computer clock to initialize the stream. Might as well use 0 for the seed in this case.

8.2 Generating random integers from between 1 and M .

This can be done directly in the following way. Suppose U is distributed uniform over the interval [0, 1].Then W = U ∗M is distributed uniform over [0, M ] (There is probability 0 of getting exactly 0 or M). If Wfalls in the interval [j, j + 1) then we set the random integer, call it K, to j+1. This can be done by definingK = int(U ∗M +1). Notice that the possible values of K are 1, . . .M and each possible value has probability1/M .

Example. Sampling transects for the Quabbin regeneration study. For each of the five blocks in theQuabbin, they wanted to add some new transects to a regeneration study. The transects would run east-west with some specified width (corresponding to the diameter of a circle with area = 1/millionth of anacre.) This meant that within each block there is some number of potential transects. For example, in thecase of the Pelham block there were 9344 possible transects to sample from and they wanted to sample 5new transects. The first program below generates 50 random integers for each block (I’m only showing youthe first two blocks) using uniform random variables as just described. To take a simple random sample ofn transects they would just use the first n distinct values in the list. The random numbers below do notcorrespond to the actual transects used since I didn’t keep the original numbers. This is a new run whichwill yield different random numbers.

Page 43: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 43

title ’generate random numbers for regeneration study’;

title ’pelham’;

data a; pelham

do i=1 to 50; OBS U K

u=ranuni(0); 1 0.37306 3486

k=int((u*9344)+1); 2 0.88181 8240

output; 3 0.96527 9020

end; 4 0.78550 7340

proc print; 5 0.69525 6497

var u k;

run; 48 0.29311 2739

title ’newsalem’; 49 0.67373 6296

data b; 50 0.44657 4173

do i=1 to 50;

u=ranuni(0); newsalem

k=int((u*3769)+1); OBS U K

output; 1 0.39562 1492

end; 2 0.93758 3534

proc print; 3 0.95385 3596

var u k; 4 0.64641 2437

run; 50 0.60487 2280

8.3 Sampling with proc surveyselect.

SAS has a surveyselect procedure that can be used to select a simple random sample without replacement.It requires a data set to sample from. Usually we are trying to create a random sample of n units out of Mfor which we will collect data, in which case we need to create an artificial data set with values from 1 toM. There may also be some cases where there is a database with a listing of individual units and we wantto sample from it. To get a SRS without replacement of n integers from 1, . . . , M using surveyselect do thefollowing:

%macro srs(m,ss); /* m is population size, ss is sample size */

title "Generating a SRS: populations size &m, sample size &ss";

data a;

do j = 1 to &m;

output;

end;

run;

proc surveyselect data=a method=srs n=&ss out=sample;

run;

proc print data=sample;

run;

%mend;

%srs(9344,10);

Page 44: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 44

Generating a SRS: populations size 9344, sample size 10, The SURVEYSELECT Procedure

Selection Method Simple Random Sampling

Input Data Set A Obs j

Random Number Seed 59344 1 1622

Sample Size 10 2 3041

Selection Probability 0.00107 3 3909

Sampling Weight 934.4 4 5195

Output Data Set SAMPLE 5 5449

6 6667

7 7034

8 7426

9 8821

10 9126

8.4 Generating Normal random variables

If Z has a standard normal distribution then µ + σZ is normal with mean µ and variance σ2 (standarddeviation σ).

option ls=80 nodate;

%macro simn(n,mu,sigma);

title "simulating a sample of size &n from N(&mu, &sigma **2)";

data a;

do i = 1 to &n;

z = rannor(0);

x = &mu + &sigma*z;

output;

end;

keep x;

run;

proc univariate;

histogram x/kernel (color=red);

inset n mean (5.2) std (5.2);

run;

%mend;

%simn(1000,4,6);

From univariate output

Mean 4.14397679 Variance 34.5488964 Std Deviation 5.87783093

8.5 Simulating the behavior of inferences with normal samples.

Simulating the behavior of the sample mean X and the confidence interval for the population mean X ±t1−α/2,n−1S/n1/2 for a sample of size n from a N(µ, σ2). We know that the distribution of X is normal with

expected value/mean µ and standard deviation σ/n1/2 and the probability the confidence interval is correctis 1− α.

option ls=80 nodate;

Page 45: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 45

Figure 7: Data from 1000 simulations

%macro simn(nsim,n,mu,sigma,alpha);

title ‘‘simulate dist. of sample mean and CI, n= &n from N(&mu, &sigma **2)’’;

data a;

tval=tinv(1- &alpha/2,&n-1);

do j =1 to &nsim;

sumx=0;

sumx2=0;

true=&mu;

do i = 1 to &n;

z = rannor(0);

x = &mu + &sigma*z;

sumx=sumx+x;

sumx2=sumx2+(x**2);

end;

meanx=sumx/&n;

varx = (sumx2 - (&n *(meanx**2)))/(&n-1);

sdx=sqrt(varx);

low=meanx - tval*(sdx/sqrt(&n));

up=meanx + tval*(sdx/sqrt(&n));

check=0;

if low le &mu le up then check=1;

output;

end;

Page 46: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 46

keep j true meanx sdx low up check;

run;

proc print;

run;

proc means;

var meanx check;

run;

goptions reset=all;

title1 ‘‘Distribution of X-bar: Sample from Normal’’;

title2 ‘‘n= &n from N(&mu, &sigma **2)’’;

proc univariate;

histogram meanx/kernel (color=red);

inset n mean (5.2) std (5.2);

run;

goptions reset=all;

title ’Confidence Interval Performance’;

symbol1 v=L c=blue;

symbol2 v=U c=purple;

symbol3 l=1 i=spline c=red;

proc gplot;

plot low*j=1 up*j=2 true*j/overlay;

run;

%mend;

%simn(500,10,4,6,.05);

OBS J TRUE MEANX SDX LOW UP CHECK

1 1 4 4.99161 4.26667 1.93942 8.0438 1

2 2 4 3.71135 2.54504 1.89074 5.5320 1

12 12 4 1.35386 3.53194 -1.17274 3.8805 0

13 13 4 2.82975 5.37792 -1.01738 6.6769 1

500 500 4 8.45663 2.90202 6.38065 10.5326 0

Variable N Mean Std Dev Minimum Maximum

---------------------------------------------------------------------

MEANX 500 3.8976617 1.9392731 -2.6808968 9.7251622

CHECK 500 0.9460000 0.2262441 0 1.0000000

---------------------------------------------------------------------

The theoretical mean and st. deviation of the distribution of X are 4 and 1.897 and the confidence level is.95.

8.6 Simulating sampling from the exponential distribution.

A random variable X follows an exponential distribution with mean µ > 0 if it has density function f(x) =(1/µ)e−x/µ for x ≥ 0 and is 0 for x ≤ 0. This is sometimes written as f(x) = λe−xλ, where λ = 1/µ. Theexponential is often used to model lifetimes or waiting times between events (e.g., arrival at a bank windowto get service). If X has this distribution then the variance of X is µ2 and the standard deviation is µ.

If you were buying a package of n items (e.g., batteries or light bulbs) produced by a process which hasan exponential distribution for the lifetime, then the total lifetime of the n items is T =

i Xi = nX.

Page 47: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 47

Theoretically we know that X has expected value µ and standard deviation µ/n1/2 (since the populationstandard deviation is σ). Equivalently T has expected value nµ and standard deviation n1/2µ). Actually,more is known, namely that

i Xi follows a Gamma distribution. It is also known that S2 has expectedvalue µ2 (which is the population variance).

This example simulates the behavior of the sample mean, sample variance and a confidence interval for themean X± t1−α/2,n−1S/n1/2 with a random sample of size n from an exponential distribution. Recall that forsmall sample size this interval is motivated by assuming a sample from the normal for small n, which doesn’tapply here. For large n this interval is supported by the Central Limit Theorem. So, the performance of theconfidence interval should improve as n increases.

The RANEXP function generates a random variable, call it Y , which is exponential with mean 1. It can beshown that X = µY will be distributed exponential with mean µ.

option ls=80 nodate;

%macro sime(nsim,n,mu,alpha);

title ‘‘simulate sample mean and CI: sample from exp(mean=&mu), n= &n’’;

data a;

tval=tinv(1- &alpha/2,&n-1);

do j =1 to &nsim;

sumx=0;

sumx2=0;

true=&mu;

do i = 1 to &n;

x = &mu*ranexp(0);

sumx=sumx+x;

sumx2=sumx2+(x**2);

end;

meanx=sumx/&n;

varx = (sumx2 - (&n *(meanx**2)))/(&n-1);

sdx=sqrt(varx);

low=meanx - tval*(sdx/sqrt(&n));

up=meanx + tval*(sdx/sqrt(&n));

check=0;

if low le &mu le up then check=1;

output;

end;

keep j true meanx sdx low up check;

run;

proc means;

var meanx check;

run;

goptions reset=all;

title1 ‘‘Distribution of X-bar: Sample from Exponential’’;

title2 ‘‘n= &n with mean &mu’’;

proc univariate plots;

var meanx;

histogram meanx/kernel (color=red);

inset n mean (5.2) std (5.2);

run;

goptions reset=all;

title ‘‘Confidence Interval Performance. Exponential mean = &mu’’;

Page 48: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 48

symbol1 v=L c=blue;

symbol2 v=U c=purple;

symbol3 l=1 i=spline c=red;

proc gplot;

plot low*j=1 up*j=2 true*j/overlay;

run;

%mend;

%sime(1000,5,4,.05);

%sime(1000,10,4,.05);

%sime(1000,30,4,.05);

%sime(1000,50,4,.05);

%sime(1000,100,4,.05);

%sime(1000,500,4,.05);

simulate sample mean and CI: sample from exp(mean=4), n= 5 1

Variable N Mean Std Dev Minimum Maximum

--------------------------------------------------------------------------------

meanx 1000 3.9260 1.6824 0.5167 11.7516

sdx 1000 3.4373 2.0270 0.2101 17.4600

varx 1000 15.9196 22.1207 0.0441 304.8515

min 1000 0.7688 0.7364 0.0004 5.3666

max 1000 9.0259 4.8317 0.8201 42.2134

check 1000 0.8820 0.3228 0.0000 1.0000

--------------------------------------------------------------------------------

simulate sample mean and CI: sample from exp(mean=4), n= 30 3

Variable N Mean Std Dev Minimum Maximum

--------------------------------------------------------------------------------

meanx 1000 4.0156 0.7401 2.0240 6.4855

sdx 1000 3.8908 0.9894 1.7441 9.0797

varx 1000 16.1162 8.6491 3.0418 82.4407

min 1000 0.1399 0.1408 0.0001 1.1879

max 1000 16.0941 5.3004 6.2835 43.1996

check 1000 0.9220 0.2683 0.0000 1.0000

--------------------------------------------------------------------------------

simulate sample mean and CI: sample from exp(mean=4), n= 50 4

Variable N Mean Std Dev Minimum Maximum

--------------------------------------------------------------------------------

meanx 1000 3.9934 0.5754 2.3901 6.1291

sdx 1000 3.9551 0.7887 2.1090 8.5870

varx 1000 16.2642 6.8073 4.4479 73.7368

min 1000 0.0790 0.0777 0.0000 0.6411

max 1000 18.1060 5.3386 8.4551 60.4502

check 1000 0.9320 0.2519 0.0000 1.0000

--------------------------------------------------------------------------------

Page 49: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 49

simulate sample mean and CI: sample from exp(mean=4), n= 100 5

Variable N Mean Std Dev Minimum Maximum

--------------------------------------------------------------------------------

meanx 1000 3.9869 0.4056 2.8004 5.3098

sdx 1000 3.9400 0.5509 2.7147 6.3016

varx 1000 15.8265 4.5172 7.3695 39.7105

min 1000 0.0403 0.0402 0.0000 0.2837

max 1000 20.4791 5.1413 10.4529 48.5757

check 1000 0.9330 0.2501 0.0000 1.0000

--------------------------------------------------------------------------------

simulate sample mean and CI: sample from exp(mean=4), n= 500 6

Variable N Mean Std Dev Minimum Maximum

--------------------------------------------------------------------------------

meanx 1000 3.9979 0.1741 3.4631 4.5791

sdx 1000 3.9859 0.2514 3.3646 5.2186

varx 1000 15.9504 2.0351 11.3207 27.2340

min 1000 0.0079 0.0079 0.0000 0.0578

max 1000 27.2129 5.1157 17.8174 55.5868

check 1000 0.9550 0.2074 0.0000 1.0000

--------------------------------------------------------------------------------

Mean of x with n=5

Histogram # Boxplot

11.75+** 4 0

.* 1 0

.** 4 0

.* 1 0

.

.** 6 0

.**** 10 0

.** 6 0

.** 4 |

.********* 26 |

.*********** 32 |

.*********** 33 |

.******************* 57 |

.********************* 63 |

.********************************* 99 +-----+

.******************************* 93 | + |

.****************************************** 126 *-----*

.******************************************* 127 | |

.************************************ 107 +-----+

.****************************** 89 |

.*********************** 67 |

.********** 29 |

.***** 15 |

0.25+* 1 |

----+----+----+----+----+----+----+----+---

* may represent up to 3 counts

Page 50: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 50

meanx fo n=500

Histogram # Boxplot

4.625+* 1 0

.

.* 3 0

.* 3 0

.** 4 |

.**** 11 |

.****** 18 |

.********** 29 |

.************ 34 |

.************************ 71 |

.****************************** 90 +-----+

.****************************** 89 | |

.************************************** 114 | |

.*********************************** 103 *--+--*

.************************************** 113 | |

.*********************************** 105 +-----+

.************************** 77 |

.*************** 43 |

.**************** 46 |

.********* 27 |

.*** 9 |

.** 6 |

.* 3 |

3.475+* 1 0

----+----+----+----+----+----+----+---

* may represent up to 3 counts

More on sampling distributions will be demonstrated in class.

8.7 Simulating the number of arrivals in a fixed period of time.

Suppose that time between arrivals (of people, phone calls, etc.) is exponentially distributed with mean µ.Consider the time interval [0, t] and define Xt to be the number of arrivals in that interval. Theoreticallyit can be shown that Xt has a Poisson distribution with parameter t/µ (which is also the mean of thedistribution). So, with θ = t/µ, P (X = j) = θje−θ/j! for j = 0, 1, . . .. We will simulate this situation andcompare the empirical results to what is expected.

Page 51: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 51

option ls=80 nodate;

%macro sime3(nsim,mu,t);

/* arrivals are exponential with mean mu*/

title "simulate number of arrivals in time 0 to t";

data a;

do k =1 to &nsim;

xt=-1;

total=0;

i = 0;

loop: i = i +1;

xt=xt+1;

time = &mu*ranexp(0);

total = total +time;

if total lt &t then goto loop;

output;

end;

keep k xt; proc print; run;

proc freq;

tables xt;

run;

proc means; var xt; run;

/* get the theoretical poisson distribution to compare */

data poiss;

lambda = (1/&mu)*&t;

upper = int(lambda + 5*lambda); /* going 5 standard deviations above the mean*/

j=0;

prob=poisson(lambda,j);

do j = 1 to upper;

prob = poisson(lambda,j) - poisson(lambda,j-1);

output;

end;

keep j prob;

proc print data=poiss;

run;

%mend;

%sime3(500,4,60); /* mean = 4 minutes, t = 1 hour = 60 minutes*/

simulate number of arrivals in time 0 to t 1

OBS K XT

1 1 19

2 2 22

3 3 12

498 498 13

499 499 9

500 500 14

Page 52: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 52

XT Frequency Percent OBS J PROB

------------------------- 1 1 0.00000

2 2 0.00003

3 3 0.00017

4 2 0.4 4 4 0.00065

5 1 0.2 5 5 0.00194

6 3 0.6 6 6 0.00484

7 3 0.6 7 7 0.01037

8 11 2.2 8 8 0.01944

9 18 3.6 9 9 0.03241

10 26 5.2 10 10 0.04861

11 35 7.0 11 11 0.06629

12 36 7.2 12 12 0.08286

13 44 8.8 13 13 0.09561

14 49 9.8 14 14 0.10244

15 42 8.4 15 15 0.10244

16 45 9.0 16 16 0.09603

17 52 10.4 17 17 0.08474

18 39 7.8 18 18 0.07061

19 33 6.6 19 19 0.05575

20 25 5.0 20 20 0.04181

21 15 3.0 21 21 0.02986

22 11 2.2 22 22 0.02036

23 3 0.6 23 23 0.01328

24 4 0.8 24 24 0.00830

25 2 0.4 25 25 0.00498

27 1 0.2 26 26 0.00287

27 27 0.00160

28 28 0.00086

29 29 0.00044

30 30 0.00022

cut off the rest

Analysis Variable : XT

N Mean Std Dev Minimum Maximum

-----------------------------------------------------------

500 14.9980000 3.8558697 4.0000000 27.0000000

-----------------------------------------------------------

Page 53: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 53

8.8 Simulating survival rates and time to extinction.

Simulating the survival of a cohort over time.

These simulations assume that the daily survival rate is constant over time and is the same for each individual.That is the probability of surviving over the next unit of time is the same, regardless of what unit of timeyou have survived to. Both the assumption of common survival rates for each individual and the assumptionthat the survival rate is constant over time can be modified.

options ls=80 nodate;

/* this simulates the survival of n0 initial individuals up to time t,

with daily survival rate pi. based on nsim simulations */

options linesize=80 nodate;

%macro simpop(n0,pi,t,nsim);

title ‘‘N0 = &n0 pi = &pi MaxTime =’ &t ‘‘;

data a;

do sim = 1 to &nsim;

oldsurv=&n0;

do time = 1 to &t;

if oldsurv=0 then goto check;

survive= ranbin(0,oldsurv,&pi);

oldsurv=survive;

check: output;

end;

end;

run;

proc print data = a noobs;

var sim time survive;

run;

/*proc gplot;

plot survive*time;

by sim; run; */

proc means;

class time;

var survive;

run;

%mend;

%simpop(100,.8,5,100);

‘‘N0 = 100 pi = .8 MaxTime = 5 ‘‘

sim time survive

1 1 86

1 2 68

1 3 58

1 4 45

1 5 34

2 1 80

2 2 67

2 3 54

2 4 40

2 5 36

Page 54: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 54

------------------------------------ sim=1 -------------------------------------

Plot of survive*time. Legend: A = 1 obs, B = 2 obs, etc.

survive

100 +

|

|

75 + A

|

| A

50 +

| A

| A

25 + A

---+-------------+-------------+-------------+-------------+--

1 2 3 4 5

time

Analysis Variable : survive

N

time Obs N Mean Std Dev Minimum Maximum

1 100 100 80.5900000 3.7928561 69.0000000 88.0000000

2 100 100 64.4600000 4.7020735 53.0000000 76.0000000

3 100 100 51.7100000 4.7360534 38.0000000 65.0000000

4 100 100 41.4700000 4.9593397 26.0000000 54.0000000

5 100 100 32.8200000 4.7829242 19.0000000 45.0000000

Simulating Time until extinction

options ls=80 nodate;

/* getting time until extinction*/

%macro simpop2(n0,pi,nsim);

data a;

do sim = 1 to &nsim;

oldsurv=&n0;

time=0;

loop: time = time+1;

survive= ranbin(0,oldsurv,&pi);

oldsurv=survive;

if survive gt 0 then goto loop;

output;

end;

keep sim time;

run;

proc print;

run;

proc univariate plot;

var time;

%mend;

%simpop2(100,.8,500);

Page 55: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 55

OBS SIM TIME

1 1 20

2 2 19

499 499 19

500 500 18

Mean 23.754 Std Dev 5.574912

Histogram # Boxplot

47+* 1 *

.

.*** 5 0

.** 3 0

.*** 5 0

.****** 11 0

.*** 6 |

.******** 15 |

31+************* 25 |

.*************** 30 |

.********************* 42 +-----+

.*************************************** 78 | |

.******************************************* 85 *--+--*

.********************************************** 92 +-----+

.************************** 51 |

.******************* 38 |

15+******* 13 |

----+----+----+----+----+----+----+----+----+-

* may represent up to 2 counts

Page 56: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 56

Simulating the performance of a crude estimate of survival rate.

options ls=80 nodate;

/* simulating an estimate of the survival rate */

%macro simpop(n0,pi,t,nsim);

data a;

do sim = 1 to &nsim;

oldsurv=&n0;

k=0;

sum=0;

do time = 1 to &t;

if oldsurv=0 then goto check;

survive= ranbin(0,oldsurv,&pi);

k=k+1;

sum=sum+(survive/oldsurv);

/*(survive/oldsurv) estimates the survival rate at current time*/

oldsurv=survive;

check: end;

pihat = sum/k; /* estimates survival by averaging individual estimates*/

output;

end;

keep sim pihat;

run;

proc print;

run;

proc means;

var pihat;

%mend;

%simpop(100,.8,5,500);

OBS SIM PIHAT

1 1 0.79197

2 2 0.82014

499 499 0.82939

500 500 0.78743

Analysis Variable : PIHAT

N Mean Std Dev Minimum Maximum

-----------------------------------------------------------

500 0.7993498 0.0231622 0.7180423 0.8727904

-----------------------------------------------------------

Page 57: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 57

8.9 More on random number generation in SAS

SAS now has a general rand function of the form;

rand(’DISTNAME’, parameter list)

where the names of distributions and their parameters are as in the handout showing the general form forpdf, cdf and quantile.

So, for example if you want to generate X from a normal with mean 5 and standard deviation 4 you can usex = rand(’NORM’, 5,4); while to generate an observation from a exponential distribution with mean 5 youcan use x = rand(’EXP0’, 5); .

There is also a proc simnormal that will generate multivariate normal random variables but if doing this itis often better to do so in IML. More on the multivariate normal later.

Multinomial. The multinomial distribution generalizes the binomial where there are k categories with πj

= P( result is in category j) with∑k

j=1πj = 1. Multinomial observations can be generated using rantbl( )

or rand (’TABLE’, ...). See the SAS documentation for details.

Page 58: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 58

9 R: Part II

9.1 Graphics

The graphics in R are one of its strong points. This section describes the basic features of plotting in R.There is much, much more. A good reference is “R Graphics” by Murrell (Chapman & Hall).

In creating a figure the initial plot is created using a plot command. The plot command creates the axes andthe frame for the plot and does and (almost always) does an initial plot. It has many options. Subsequentoverlays to the plot, if desired, are done using the points command or the lines command. These havefewer options since they aren’t involved with the axes set-up, titles, etc.

The plot command:

The simplest form of the plot command is plot(x,y) where x and y are vectors of equal length. What is inx is plotted on the x-axis and y on the y-axis. Only the x and y are required but there are many optionsthat can be included. Below is a general form of the plot function with many of the commonly used optionsin it. Information on a number of the options can be found via help(par) and help(plot). Some of the textbelow is extracted directly from the help.

plot(x,y,type = , lty = ,pch = , main =” “ , sub = “ “, xlab = “ “ , ylab = “ “, xlim = c(,), ylim = c(,),col = ,cex=, font=)

• type = specifies the type of plot, here are some of the options (points is the default)

"p" for points,

"l" for lines,

"b" for both,

"h" for ’histogram’ like (or ’high-density’) vertical lines,

• lty = a number, specifies the line type if type = “l”. The default is 1.

The line type. Line types can either be specified as an integer (0=blank, 1=solid (default), 2=dashed,3=dotted, 4=dotdash, 5=longdash, 6=twodash) or as one of the character strings ”blank”, ”solid”,”dashed”, ”dotted”, ”dotdash”, ”longdash”, or ”twodash”, where ”blank” uses ’invisible lines (i.e.,does not draw them).

• pch = specifies the symbol for a point, if included. The default is a circle.

Either an integer specifying a symbol or a single character to be used as the default in plotting points.See points for possible values and their interpretation. Note that only integers and single-characterstrings can be set as a graphics parameter (and not NA nor NULL).

• main = gives the main plot title

• sub = gives a subtitle

• xlim and ylim specify the limits of the x and y axis; so xlim = c(4,10) gives a plot where the x-axisranges from 4 to 10;

• col = sets a color; e.g., col= “red”

• cex = number controls the size of symbols. From the help

Page 59: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 59

cex= A numerical value giving the amount by which plotting

text and symbols should be magnified relative to the default.

Note that some graphics functions such as plot.default have

an argument of this name which multiplies this graphical parameter,

and some functions such as points accept a vector of

values which are recycled. Other uses will take just

the first value if a vector of length greater than one is supplied.

There is also cex.axis, cex.lab,cex.main,cex.sub which lets you control individual parts.

• font = controls the font type. Below is from the help. It is a little obtuse, so experiment.

An integer which specifies which font to use for text. If possible, device drivers

arrange so that 1 corresponds to plain text (the default), 2 to bold face, 3 to

italic and 4 to bold italic. Also, font 5 is expected to be the symbol font, in

Adobe symbol encoding. On some devices font families can be selected by family

to choose different sets of 5 fonts.

The points statement.

Add points to the plot. This is of the form

points(x,y, pch=, cex = , col = )

where only the x and y are required. There are a few other options.

The lines statement.

Adds a line (or line combined with points) to the plot. This is of the form

lines(x,y, options)

The options can include type = , lty = , and some of the other options in the plot statement but, obviouslynot those concerned with the plot layout (e.g., main = , xlab = , etc.)

The abline statement.

abline(a,b) will plot a straight line with intercept 0 and slope b. (Sometimes the argument for the line mightcome from the output of a function; as in when doing simple linear regression we used abline(reg); see page96 of part I of notes.)

Multiple plots on a page

The command

par(mfrow=c(a,b))

lays out a plotting page which will allows a*b plots arranged in a rows and b columns. This is put beforethe first plot statement. As a plot statement is encountered in puts the plot in the next available spot wherethe the plot are filled in a row at a time and within a row from left to right.

Plotting only some points.

Page 60: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 60

In the plot, lines or points commands you can select out values by putting a condition after the variable. Forexample plot(x[expression], y[expression], ) will plot just y versus x for observations satisfying the expression;this might for example be selecting based on values of another variables or the x and y themselves.

Unemployment example. The following does the unemployment plot (see page 15). Here it does it threedifferent ways; two with points connected by a line and one with a line with no points. The par statementlays out the page to have three plots top to bottom.

edata<-read.table(’g:/s597/data/employ.dat’,header=F)

par(mfrow=c(3,1)) # lines up three plots vertically

Findex<-edata$V1

unemploy<-edata$V2

year<-edata$V3

newyear<-year+1950;

plot(newyear,unemploy,type="b",main="Unemployment over Time",

xlab="Year", ylab = "Unemployment")

#below uses a * rather than the default circle

plot(newyear,unemploy,type="b",pch="*",main="Unemployment over Time",

xlab="Year", ylab = "Unemployment")

#below uses no points, and chooses line type 3

plot(newyear,unemploy,type="l",lty=3,main="Unemployment over Time",

xlab="Year", ylab = "Unemployment")

1952 1954 1956 1958 1960

1.5

2.5

3.5

4.5

Unemployment over Time

Year

Unem

ploy

men

t

*

** *

*

* **

*

*

1952 1954 1956 1958 1960

1.5

2.5

3.5

4.5

Unemployment over Time

Year

Unem

ploy

men

t

1952 1954 1956 1958 1960

1.5

2.5

3.5

4.5

Unemployment over Time

Year

Unem

ploy

men

t

Figure 8: Unemployment plots using R

Esterase assay example. This does the plot on page 16. Note: We could have read an labeled variables

Page 61: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 61

as above, but here I’ve illustrated the use of matrix(scan as an alternative to a read. The result is a matrixcalled data, rather than a dataframe. We create the variables, by name, by equating them columns of thematrix called data. It also shows overlaying using lines and points and the need to control the range of they-axis. Finally it also demonstrates how you can read another data set and overlay a graph from it on theoriginal plot. In this case it gets the original points that went into getting the fitted lines and predictionintervals.

data<-matrix(scan(’g:/s597/data/pred.out’),ncol=7,byrow=T)

par(mfrow=c(1,1)) # single plot per page; not needed if in new session

# but resets if had a different layout earlier

x0<-data[,1]

yhat<-data[,2]

low <-data[,3]

up <- data[,4]

yhatw<-data[,5]

loww<-data[,6]

upw<-data[,7]

plot(x0,yhat,type="l",lty=1,xlab = "concentration", ylab = "count",

main = "Prediction intervals for Assay Data", ylim = c(-200,1400))

lines(x0,low,lty=1)

lines(x0,up, lty=1)

lines(x0,yhatw,lty=2)

lines(x0,loww,lty=2)

lines(x0,upw,lty=2)

edata<-read.table(’g:/s597/data/ester.dat’)

conc<-edata$V1 # true esterase concentration

count<-edata$V2 # radioactive binding count

points(conc,count,pch="*")

This plots verbal IQ versus brain size, for males and females, using the brain data

brain<-read.table("g:/s597/data/Brain_h.dat",na.string=".",header=T)

head(brain)

attach(brain)

par(mfrow=c(2,1))

plot(mriCount[Gender==’Male’],VIQ[Gender==’Male’],

xlab="MRI COUNT", ylab = "Verbal IQ", main = "Verbal IQ versus

Brain Size: Males")

plot(mriCount[Gender==’Female’],VIQ[Gender==’Female’],

xlab="MRI COUNT", ylab = "Verbal IQ", main = "Verbal IQ versus

Brain Size: Females")

9.1.1 Directing the graphics output

Graphics go to a graphics device. Without doing anything, when you create a plot it opens a graphics windowas the active device. (This is equivalent to using the device function windows() in a windows environment).A plot in the graphics window can be saved using “save as”, as a pdf, eps, JPEG, and some others.

You can also explicitly open another device/file and the graphics output will be sent to that device/file. Forexample the following would create a postscript file. Note some of the options used. In place of postscript,other options are pdf( ) and jpeg( ) as well as some others.

Page 62: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 62

10 20 30 40 50

050

010

00Prediction intervals for Assay Data

concentration

coun

t

** *****

*

*****

*

***

**

*****

**

*** *

*

******

***

*

*

*

*

*

*

*

** *

**

****

**

**

**

*

**

* *

* *

*

*

*

*

**

* *

*

*

**

*

*

*

*

*

**

**

***

**

**

*

**

*

*

*

* *

*

Figure 9: Esterase Assay intervals using unweighted and weighted least squares.

postscript("h:/mecourse/phplot.ps",horizontal=F,height=10,width=8,font=3)

use help(postscript) or help(jpeg) etc. to see the options. Notice that the default is horizontal = T whichcreates a landscape plot.

Once you have opened another device (e.g., via ps, pdf or jpeg) output will be routed there. If you openanother device it becomes active.

dev.off() will shut off the active device and return to using the interactive graph window when you nextplot something.

9.2 Miscellaneous

• Sequencing See Section 15.4.3 for definition and use of the seq command.

• Creating a vector. The rep command can be used to create a vector. rep(values,n) creates n repeatsof values. If values is a single item, this is a vector of length n, but if values has p components, this isa vector length n ∗ p. If the n is replaced by each=n, then it repeats each quantity in values n times.

> rep(4,10)

[1] 4 4 4 4 4 4 4 4 4 4

> rep(NA,13)

[1] NA NA NA NA NA NA NA NA NA NA NA NA NA

Page 63: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 63

900000 950000 1000000 1050000

8010

012

014

0Verbal IQ versusBrain Size: Males

MRI COUNT

Verb

al IQ

800000 850000 900000 950000

7090

110

130

Verbal IQ versusBrain Size: Females

MRI COUNT

Verb

al IQ

Figure 10: Verbal IQ versus brain size for each gender

> rep(1:p,10)

Error: object "p" not found

> rep(1:5,10)

[1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3

[39] 4 5 1 2 3 4 5 1 2 3 4 5

> rep(1:5,each=10)

[1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4

[39] 4 4 5 5 5 5 5 5 5 5 5 5

• The cat statement. The cat() statement lets you write script or variables to a file or to the console.This lets you add text to your output and it also is a way to output variables to a file.

cat(....,file = , append = T)

if file = is omitted, then the output goes to the console.

Append = T is only an option if writing to a file, it appends to the file. If append = T is omitted thenappend=F is assumed which means the file is overwritten.

The ... is what is output. It can be a collection of “items” separate by commas. An item in quotes iswritten as is except for some quantities using\.”\n “is a carriage return and ”\t” produces a tab.

• multiple commands on a line: You can put multiple commands on a line if all but the last end ina semicolon

Page 64: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 64

• If statements: If statements are executed as below. Notice that the first piece is in parentheses, ( ),and the second in brackets, .

if (expression1) {expression2}

or you can have an if then else

if (expression1) then {expression2} else

{expression 3}

The expressions can be multiple lines.

9.3 looping

Do loops are done in R using the for command or the while command.

The following loops through values of varname and carries out expression2 using the for command.

for (varname in seq) {expression2 involving varname}

where seq defines a range of possible values for varname.

The while command is of the form

while (condition) {statements}

Example: Here is an example involving the for and the if statement both. In homework 5, this wouldconvert the missing values for CRIMTYPE (treated as numerical)

for (k in 1:length(CRIMTYPE)){if (CRIMTYPE[k]==9){CRIMTYPE[k]=NA}}

Other applications appear in later examples.

9.4 Functions

The general form for defining a function is

fname<-function(argument1,argument2, ...)

{statements

return(objects)}

At the end of the expression the return(objects) specifies what values are returned to the main program fromthe function. The following function gets summary statistics for a numerical variable where the missing codeis contained in mcode.

statc<-function(x,mcode){

for (k in 1:length(x)){if (x[k]==mcode){x[k]=NA}}

mean<-mean(x,na.rm=T)

Page 65: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 65

sd <-sd(x,na.rm=T )

nmiss<-sum(is.na(x))

n<-length(x)- nmiss

med<-median(x, na.rm=T)

sum<-summary(x)

min<-sum[1]

max<-sum[6]

statt<-cbind(n=n,nmiss=nmiss,mean=mean,SD=sd,median=med,min=min,max=max)

return(statt)

}

If we knew there was no missing values you could use just stats< −function(x) and skip some of the rest.

You can save the text defining your function to a file, say the above is saved to g:/s597/statcom.R. You canthen run this using the source command and execute it as in the following example; where AGE (from theSYC data) was already in the workspace from previous commands.

> source("g:/s597/stats.R")

> stats(AGE,99)

n nmiss mean SD median min max

Min. 2621 0 16.80923 1.911258 17 11 24

NOTE: The source command executes what is in the file. (Also note that when using source it doesn’tautomatically list things. Suppose somewhere in the file it said just AGE. If we typed AGE in the console itwould list age, but it doesn’t do so when it is in a file being “sourced”. Instead you need to use print(AGE)

9.5 Probability Functions with examples

There are four general functions, with first letters d, p, q and r that are used in working with probabilitydistributions and generating samples from them. In each case the arguments involve the parameters of thedistribution and are distribution specific.

• dName(x, arg1, ....) returns the PDF (density or mass function) evaluated at x of a random variablewith distribution Name

• pName(x, arg1, ....) returns P (X ≤ x), the CDF (density or mass function) evaluated at x of a randomvariable with distribution Name.

• qName(p, arg1, ....) returns the quantile, the value qp such that P (X ≤ qp) = p.

• rName(n, arg1, ...) generates n observations from the distribution.

Both the q and the p functions have options that come after the arguments that reverse the tail being workedwith:

pName(x, arg1, ...., lower.tail=F) returns P (X > x).

qName(p, arg1, ... , lower.tail = F) returns q1−p; i.e. the value with probability p to the right of it.

There are also options that convert to log scales that we won’t discuss here. See the help( ) results.

Some specific distributions, illustrated with the d function.

Page 66: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 66

• Uniform: dunif(x,minv,maxv) or dunif(x, min = minv , max = maxv), minv = lower bound, maxv=upperbound

If minv and maxv are omitted, then minv = 0 and maxv=1 (standard uniform)

• Exponential: dexp(x,ratev) or dexp(x,rate=ratev).

rate is 1/mean, not the mean. If rate is omitted, then rate = 1 is assumed.

• Normal; dnorm(x, meanv, sdv) or dnorm(x, mean = meanv, sd = sdv)

• t-distribution dt(x, df, ncp)

ncp = 0 if third argument omitted.

• Chi-square: dchisq(x, df, ncp)

ncp = 0 if third argument omitted.

• F: df(x, df1, df2,ncp )

ncp = 0 if fourth argument omitted.

# plotting the binomial

n<-10

pi<-.2

pval<-rep(NA,n)

pval

for (k in 1:n)

{pval[k] = dbinom(k,n,pi)}

kvec<-seq(1:n)

kvec

pval

plot(kvec,pval, type = "h")

#PLOTTING THE BINOMIALS AS A FUNCTION

bplot<-function(n,pi){

pval<-rep(NA,n)

pval

for (k in 1:n)

{pval[k] = dbinom(k,n,pi)}

kvec<-seq(1:n)

kvec

pval

plot(kvec,pval,type="h")}

bplot(10,.2)

# This does four plots to a page.

par(mfrow=c(2,2))

bplot<-function(pi){

for (n in c(5,10,20,50)){

Page 67: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 67

pval<-rep(NA,n)

maint<-paste("n = ",n)

for (k in 1:n)

{pval[k] = dbinom(k,n,pi)}

kvec<-seq(1:n)

plot(kvec,pval,type="h",xlab = "k", ylab= "probability",main = maint)}}

bplot(.2)

# HERE IS A LONGER WAY WITH EACH GRAPH LABELED BY THE SAMPLE SIZE. NOT NECESSARY.

par(mfrow=c(2,2))

pi<-.2

n<-5

pval<-rep(NA,n)

pval

for (k in 1:n)

{pval[k] = dbinom(k,n,pi)}

kvec<-seq(1:n)

plot(kvec,pval,type="h",xlab = "k", ylab= "probability", main = "n = 5")

n<-10

pval<-rep(NA,n)

for (k in 1:n)

{pval[k] = dbinom(k,n,pi)}

kvec<-seq(1:n)

plot(kvec,pval,type="h",xlab = "k", ylab= "probability", main = "n = 10")

n<-20

pval<-rep(NA,n)

for (k in 1:n)

{pval[k] = dbinom(k,n,pi)}

kvec<-seq(1:n)

plot(kvec,pval,type="h",xlab = "k", ylab= "probability",main = "n = 20")

n<-50

pval<-rep(NA,n)

for (k in 1:n)

{pval[k] = dbinom(k,n,pi)}

kvec<-seq(1:n)

plot(kvec,pval,type="h",xlab = "k", ylab= "probability",main = "n = 50")

Hypergeometric Example on page 21. We will do this problem four different ways to show some of thefeatures of R.

METHOD 1

# hypergeometric example on page 21.

# Here we write to the console using the cat

# function and a print.

popsizea <- 698

upper <-4

prob <-rep(NA,upper)

value <- rep(NA,upper)

ssavalues<-seq(7,70,by=7)

numberssa <-length(ssavalues)

dvalues<-seq(1,101,by=5)

Page 68: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 68

1 2 3 4 5

0.0

0.1

0.2

0.3

0.4

n = 5

k

prob

abilit

y

2 4 6 8 10

0.00

0.10

0.20

0.30

n = 10

k

prob

abilit

y

5 10 15 20

0.00

0.05

0.10

0.15

0.20

n = 20

k

prob

abilit

y

0 10 20 30 40 50

0.00

0.04

0.08

0.12

n = 50

k

prob

abilit

y

Figure 11: Binomial probability density functions with π = .2

numberdef<-length(dvalues)

for (j in 1:numberssa)

{ssa= ssavalues[j]

ssrate=ssa/popsizea

for (def in seq(1,101, by = 5))

{defrate=def/popsizea;

nondef = popsizea-def;

for (k in 0:upper)

{prob[k+1] = dhyper(k,def,nondef,ssa)

value[k]=k}

cat("probabilities with N = 698,sample size= ", ssa,

"number defective = ", def, "\n")

prob0<-prob[1]; prob1<-prob[2]; prob2<-prob[3];prob3<-prob[4]

values<-cbind(prob0,prob1,prob2,prob3)

print(values)}}

probabilities with N = 698,sample size= 7 number defective = 1

prob0 prob1 prob2 prob3

[1,] 0.9899713 0.01002865 0 0

probabilities with N = 698,sample size= 7 number defective = 6

prob0 prob1 prob2 prob3

[1,] 0.9411107 0.05761902 0.001258057 1.219048e-05

...

Page 69: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 69

METHOD 2

# Now we’ll create a "file" with a separate line

# for each combination of sample size and number defective

# Here we do it showing also how you can write to a file

# using the cat command

myfile<-"g:/s597/houtput"

popsizea <- 698

upper <-4

prob <- rep(NA,upper)

value <- rep(NA,upper)

ssavalues<-seq(7,70,by=7)

numberssa <-length(ssavalues)

dvalues<-seq(1,101,by=5)

numberdef<-length(dvalues)

cat("ssa","\t", "def","\t", "p0","\t","p1","\t","p2","\t","p3\n", file=myfile)

for (j in 1:numberssa)

{ssa= ssavalues[j]

ssrate=ssa/popsizea

for (m in 1:numberdef)

{def = dvalues[m]

defrate=def/popsizea

nondef = popsizea-def

for (k in 0:upper)

{prob[k+1] = dhyper(k,def,nondef,ssa)

value[k]=k}

prob0<-prob[1]; prob1<-prob[2]; prob2<-prob[3];prob3<-prob[4]

cat(ssa, "\t", def, "\t", prob0, "\t", prob1, "\t", prob2,

"\t", prob3,"\n",file=myfile,append=T)

}}

hdata<-read.delim("g:/s597/houtput")

head(hdata)

ssa def p0 p1 p2 p3

1 7 1 0.9899713 0.01002865 0.000000000 0.000000e+00

2 7 6 0.9411107 0.05761902 0.001258057 1.219048e-05

3 7 11 0.8943318 0.10112120 0.004448147 9.768991e-05

4 7 16 0.8495603 0.14075550 0.009355982 3.219856e-04

5 7 21 0.8067238 0.17673380 0.015779810 7.424871e-04

6 7 26 0.7657521 0.20925960 0.023529940 1.408978e-03

METHOD 3

# Here we do it showing how you can create a matrix and write to

# the matrix.

popsizea <- 698

upper <-4

Page 70: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 70

prob <- rep(NA,upper)

value <- rep(NA,upper)

ssavalues<-seq(7,70,by=7)

numberssa <-length(ssavalues)

dvalues<-seq(1,101,by=5)

numberdef<-length(dvalues)

total = numberssa*numberdef

total

data<-matrix(NA,total,6) # create a total x 6 matrix with NA’s for entries

index=0

for (j in 1:numberssa)

{ssa= ssavalues[j]

ssrate=ssa/popsizea

for (m in 1:numberdef)

{def = dvalues[m]

defrate=def/popsizea

nondef = popsizea-def

for (k in 0:upper)

{prob[k+1] = dhyper(k,def,nondef,ssa)

value[k]=k}

index=index+1

data[index,1]<-ssa

data[index,2]<-def

data[index,3]<-prob[1]

data[index,4]<-prob[2]

data[index,5]<-prob[3]

data[index,6]<-prob[4]

}}

data

[,1] [,2] [,3] [,4] [,5] [,6]

[1,] 7 1 9.899713e-01 0.0100286533 0.000000000 0.000000e+00

[2,] 7 6 9.411107e-01 0.0576190211 0.001258057 1.219048e-05

[3,] 7 11 8.943318e-01 0.1011212164 0.004448147 9.768991e-05

...

[208,] 70 91 3.240206e-05 0.0003836452 0.002210053 8.256321e-03

[209,] 70 96 1.752117e-05 0.0002209048 0.001355834 5.399683e-03

[210,] 70 101 9.422897e-06 0.0001261740 0.000822874 3.484018e-03

METHOD 4

# Here we do it showing how you can create vectors

# write to them and then bind them to a dataframe

popsizea <- 698

upper <-4

prob <- rep(NA,upper)

ssavalues<-seq(7,70,by=7)

numberssa <-length(ssavalues)

Page 71: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 71

dvalues<-seq(1,101,by=5)

numberdef<-length(dvalues)

total = numberssa*numberdef

ssav<-rep(NA,total)

defv<-rep(NA,total)

p0v<-rep(NA,total)

p1v<-rep(NA,total)

p2v<-rep(NA,total)

p3v<-rep(NA,total)

index=0

for (j in 1:numberssa)

{ssa= ssavalues[j]

ssrate=ssa/popsizea

for (m in 1:numberdef)

{def = dvalues[m]

defrate=def/popsizea

nondef = popsizea-def

for (k in 0:upper)

{prob[k+1] = dhyper(k,def,nondef,ssa)

}

index=index+1

ssav[index]<-ssa

defv[index]<-def

p0v[index]<-prob[1]

p1v[index]<-prob[2]

p2v[index]<-prob[3]

p3v[index]<-prob[4]

}}

hdata<-cbind(ssav,defv,p0v,p1v,p2v,p3v)

head(hdata)

ssav defv p0v p1v p2v p3v

[1,] 7 1 0.9899713 0.01002865 0.000000000 0.000000e+00

[2,] 7 6 0.9411107 0.05761902 0.001258057 1.219048e-05

[3,] 7 11 0.8943318 0.10112122 0.004448147 9.768991e-05

[4,] 7 16 0.8495603 0.14075555 0.009355982 3.219856e-04

[5,] 7 21 0.8067238 0.17673382 0.015779805 7.424871e-04

[6,] 7 26 0.7657521 0.20925958 0.023529938 1.408978e-03

Power example: This gets the power function and plots it for a one sided test for the mean. See page31. Shows the use of the cat command to write out header information. Note the need for a , to separatesomething in quotes from a variable. These , ’s are not printed as you’ll see.

powerone<-function(mu0,sigma,n,alpha){

cat("Power example with sample size ", n," null = ", mu0,

" sigma = ", sigma, " alpha = ", alpha, "\n")

muval<- seq(25,35,by=.5)

#muval

nmu <-length(muval)

power<-rep(NA,nmu)

for(k in 1:nmu)

Page 72: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 72

{mu = muval[k]

df= n-1

nc = sqrt(n)*(mu - mu0)/sigma

tval= qt(1-alpha,df)

power[k] = 1 - pt(tval,df,nc)}

plot (muval,power,type = "l",ylab="power",xlab="Null value",

main= "power function")

values<-data.frame(muval,power)

return(values)}

powerone(30,2,5,.05)

Power example with sample size 5 null = 30 sigma = 2 alpha = 0.05

muval power

1 25.0 1.554168e-11

2 25.5 4.549612e-10

3 26.0 1.016218e-08

4 26.5 1.720899e-07

5 27.0 2.219886e-06

6 27.5 2.193684e-05

7 28.0 1.671735e-04

19 34.0 9.748306e-01

20 34.5 9.914523e-01

21 35.0 9.975115e-01

Sample size determination. This gets the sample size needed for the one-sample t-test over differenttarget values at a specified alternative, µ. See page 33.

ssizeone<-function(mu0,sigma,mu,alpha)

{targets <- seq(.5,.98, by=.02)

ntarget<- length(targets)

nv<-rep(NA,ntarget)

pv<-rep(NA,ntarget)

m=0

for (j in 1:ntarget)

{target=targets[j]

target

n<-2

power<-0

while(power < target)

{df=n-1

nc = sqrt(n)*(mu - mu0)/sigma

tval= qt(1-alpha,df)

power = 1 - pt(tval,df,nc)

n= n+1}

m<-m+1

nv[m]=n-1;

pv[m]=power;

} #end j/target loop

data<-cbind(targets,nv,pv)

plot(targets,nv,xlab="target",ylab="sample size",

Page 73: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 73

26 28 30 32 34

0.0

0.2

0.4

0.6

0.8

1.0

power function

Null value

powe

r

Figure 12: One-sided power using R

type = "b", main= " H0:mu <= 30, alt = 31, sigma=2, alpha=.05")

return(data)

} # end function.

ssizeone(30,2,31,.05)

targets nv pv

[1,] 0.50 13 0.5220115

[2,] 0.52 13 0.5220115

[3,] 0.54 14 0.5507256

....

[24,] 0.96 48 0.9615312

[25,] 0.98 57 0.9814151

Simulating and plotting the mean mean from an exponential

simc<-function(mu)

{par(mfrow=c(3,2))

nsim<-1000

means<-rep(0,nsim)

for (n in c(1,5,10,30,50,100))

{for (j in 1:nsim)

{values<-rexp(n,1/mu)

means[j]<- sum(values)/n}

Page 74: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 74

0.5 0.6 0.7 0.8 0.9

2030

4050

H0:mu <= 30, alt = 31, sigma=2, alpha=.05

target

sam

ple

size

Figure 13: Sample size needed to have power = target at µ = 31 for test of H0 : µ ≤ 30, versus HA : µ > 30with α = .05.

hist(means,main="mean",freq=FALSE)

lines(density(means))}}

simc(4)

Page 75: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 75

mean

means

Dens

ity

0 5 10 15 20 25 30 35

0.00

0.06

0.12

mean

means

Dens

ity

0 2 4 6 8 10 12

0.00

0.10

0.20

mean

means

Dens

ity

2 4 6 8

0.00

0.10

0.20

0.30

mean

means

Dens

ity

2 3 4 5 6 7

0.0

0.2

0.4

mean

means

Dens

ity

2 3 4 5 6

0.0

0.2

0.4

0.6

mean

means

Dens

ity

3.0 3.5 4.0 4.5 5.0 5.5

0.0

0.4

0.8

Figure 14: Distribution of sample mean with samples from the exponential; n = 1,5,10,30,50,100

9.6 Some More graphics

9.6.1 Using legends

This comes from my book (“Measurement Error: Models, Methods and applications”). It is from an examplefrom Montgomery and Peck (Regression Analysis: 1992) for which a quadratic model was used to model thetensile strength in Kraft paper as a function of the hardwood concentration in the batch of pulp used. Thisprogram plots five different quadratic fits, one just a regular fit to the data (called naive), the other four arefits based on various methods that correct for the fact that the hardwood concentration can’t be observedexactly but is estimated with some uncertainty. (This measurement error in the predictors causes bias inthe fitted coefficients).

In this code the position of the legend is give by the first two arguments in the legend statement, whichpositions the upper left of the box with the legend at x = 0 and y = 60.

#postscript("g:/mecourse/paper.ps",horizontal=F,height=6,width=4.5)

# Plot of five different quadratic functions with a legend.

x<-seq(0,20,.1)

fitn<- 1.11 + 8.99*x -.44*(x**2)

fitc<- -11.03 + 13.36*x -.73*(x**2)

fitrc<- -2.95 + 10.11*x -.52*(x**2)

fitrci<- -3.99 + 11.03*x -.60*(x**2)

fits<- -1.42 + 9.77*x -.499*(x**2)

plot(x,fitn,xlab="Hardwood Concentration",ylab = "Tensile strength",

Page 76: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 76

type = "l", lty=1, cex=0.8,ylim=c(0,60))

lines(x,fitc,type = "l", lty=2, cex=0.8)

lines(x,fitrc,type = "l", lty=3, cex=0.8)

lines(x,fitrci,type = "l", lty=4, cex=0.8)

lines(x,fits,type = "l", lty=5, cex=0.8)

legend(0,60,c("Naive","MOM", "RC", "RC-I", "SIMEX"),lty=1:5,cex=0.8)

0 5 10 15 20

010

2030

4050

60

Hardwood Concentration

Tens

ile s

treng

th

Naive

MOM

RC

RC−I

SIMEX

Figure 15: Naive and other fits (accounting for measurement error ) of strength versus hardwood concentra-tion

9.6.2 Writing text in margins or in graphs

You can write in the margins of the plot using mtext and in the graph itself using text.

#showing how to write in margins with mtext and that

#default is side = 3 (top)

par(mfrow=c(3,2))

plot(1:10, (-4:5)^2, main="Parabola Points", xlab="xlab")

mtext("10 of them ",side=1)

plot(1:10, (-4:5)^2, main="Parabola Points", xlab="xlab")

mtext("10 of them",side=2)

plot(1:10, (-4:5)^2, main="Parabola Points", xlab="xlab")

mtext("10 of them",side=3)

plot(1:10, (-4:5)^2, main="Parabola Points", xlab="xlab")

Page 77: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 77

mtext("10 of them",side=4)

plot(1:10, (-4:5)^2, main="Parabola Points", xlab="xlab")

mtext("10 of them")

plot(1:10, (-4:5)^2, main="Parabola Points", xlab="xlab")

text(5,10,"THIS SHOWS HOW YOU CAN WRITE IN TEXT")

# The text is centered at x = 5 and y = 10

2 4 6 8 10

05

1020

Parabola Points

xlab

(−4:

5)^2

10 of them 2 4 6 8 10

05

1020

Parabola Points

xlab(−

4:5)

^2

10 o

f the

m

2 4 6 8 10

05

1020

Parabola Points

xlab

(−4:

5)^2

10 of them

2 4 6 8 10

05

1020

Parabola Points

xlab

(−4:

5)^2

10 o

f the

m

2 4 6 8 10

05

1020

Parabola Points

xlab

(−4:

5)^2

10 of them

2 4 6 8 10

05

1020

Parabola Points

xlab

(−4:

5)^2

THIS SHOWS HOW YOU CAN WRITE IN TEXT

Figure 16: Unemployment plots using R

Page 78: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 78

10 SAS-IML.

IML is a powerful programming language which is part of SAS (in fact, hidden under the procs in SASis IML). It is not a procedure in the way that other procedures are in SAS, even though it is begun byusing proc iml;. This simply invokes the IML programming language. quit; ends IML. You can executepieces interactively without having to run the whole procedure. Among its advantages are (not meant to beexclusive by any means).

• Handles matrices easily with many built in matrix operations.

• You can print items at any time using a combination of text and variables.

• You can define functions and modules.

• Besides, the matrix functions there are many, many other built in functions and subroutines includingthose doing nonlinear optimization and a general randgen procedure for generating multiple randomvariables from certain distributions.

10.1 Basic matrix operations

The following demonstrates some of the basic matrix functions, including defining matrices directly.

options ls=80 nodate;

proc iml; /* this starts iml */

/* Create a 3 by 3 matrix called a. Spaces separate entries within a

row, a , separates rows. */

a = {1 5 8, 5 3 6, 8 6 4}; /* note a is symmetric */

/* create a 3 by 3 matrix b */

b = { 5 2 4, 1 3 2, -5 6 7};

/* I(k) create a k by k identity matrix */

imat = I(5);

/* J(nrow,ncol,value) creates a matrix with nrow rows, ncol columns

and value for each entry */

c = J(3,4,2); /* creates a 3 by 4 matrix of 2’s */

d = J(2,3); /* no value given. The default is 1 */

e = {1 1 5 6 7 9 11 11}; /* creates a 1 by 7 row vector */

/* row vectors can be created using the index operator :

r = i:j will create a row vector going from i up to j by steps of 1 if

j is bigger than i or going from i down to j if i is bigger than j.

r = do(l,u,inc) will create a row vector going from l to u by steps of

inc */

r1 = 1:5;

r2 = do(10,18,2);

Page 79: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of
Page 80: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of
Page 81: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of
Page 82: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 82

C = A//B adjoins the rows of A and B creating a matrix

C =

[

AB

]

.

proc iml;

A = {1 5 8, 5 3 6, 8 6 4};

B = { 5 2 4, 1 3 2, -5 6 7};

C = A||B;

C2 = A//B;

print A B C C2;

A B

1 5 8 5 2 4

5 3 6 1 3 2

8 6 4 -5 6 7

C C2

1 5 8 5 2 4 1 5 8

5 3 6 1 3 2 5 3 6

8 6 4 -5 6 7 8 6 4

5 2 4

1 3 2

-5 6 7

10.2 Naming rows and columns, formats, reset

• You can specify row and column names as well as a format right in the print command.

• The mattrib command can be used to attach certain attributes, including row and column names tomatrices. When the matrix is printed the row and column names will be printed.

• Matrices that do not have row and column names from a mattrib or with them specified in the printstatement will not have row and column names. This is the default. If reset autoname; is used thenin printing matrices, those that don’t have row or column names will be printed with row1, etc. andcol1 etc. as the headings. reset noautoname will return to using no row and column names for suchmatrices.

• By default the name of the matrix will be printed. reset noname; will leave the matrix name out inprinting. reset name; will return to printing the matrix name.

• You can specify the format for the entries in the matrix by putting format w.d either right in theprint statement or in the mattrib. In either case this goes within the brackets similar to rowname andcolname in the examples below.

options ls=80 nodate;

proc iml; /* this starts iml */

a = {1 5 8, 5 3 6, 8 6 4};

b = { 5 2 4, 1 3 2, -5 6 7};

Page 83: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 83

rnames = {mon tue wed};

cnames = {group1 group2 group3};

print a;

print a[rowname=rnames colname=cnames];

mattrib a rowname = (rnames) colname=(cnames);

reset autoname spaces=3; /*spaces = n puts n spaces between matrices*/

print a;

print b;

reset noautoname;

print a;

print b;

reset autoname;

print a b; /* ?? this leads to strange column headings for b

since a has names and b does not */

quit;

A

1 5 8

5 3 6

8 6 4

A

GROUP1 GROUP2 GROUP3

MON 1 5 8

TUE 5 3 6

WED 8 6 4

B

COL1 COL2 COL3

ROW1 5 2 4

ROW2 1 3 2

ROW3 -5 6 7

A

GROUP1 GROUP2 GROUP3

MON 1 5 8

TUE 5 3 6

WED 8 6 4

B

5 2 4

1 3 2

-5 6 7

B

A GROUP1 GROUP2 GROUP3 Col5 Col6 Col7

MON 1 5 8 5 2 4

TUE 5 3 6 1 3 2

WED 8 6 4 -5 6 7

Page 84: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 84

10.3 Reading from a SAS data set into matrices.

Within IML where an internal sas data set, called b created from a data set, has already been defined. Inorder to use the data set the command is use b;. When done with data set the command close b; is used.

The general form of the read statement is: read range var operand where(expression) into matname;

where the boldface indicate specified sas language and the other things are specified by you. See thedocumentation for full detail.

range specifies the range of cases to consider. Here are some common uses

• all = all observations.

• current = current observation.

• next n = next n observations where n is an integer.

• after = all observations after the current one.

operand specifies a list of variables. It can be list of variable names in brackets, from the SAS data set, all= all variables, num = all numeric variables or char = all character variables. With no var clause thedefault is all numerical variables.

where is optional. It only uses cases that satisfy the expression.

into specifies the matrix, called matname, to put things in. If there is no into clause, then each variable willbe read into a column vector (a matrix with one column) where the name of the matrix is the same as thename of the original sas variable. Note that the same name can be used in the sas data set and in iml.

options ps=60 ls=80;

data a;

infile ’speed2’;

input id sex vo35 vo4 vo45 vo5 vo55 vo6;

proc print;

run;

proc iml;

use a;

read all into bigmat;

read all var {id};

read all var {id sex} into b;

read all var{vo35 vo4 vo45 vo5 vo55 vo6} where(sex=1) into vmat1;

read all var{vo35 vo4 vo45 vo5 vo55 vo6} where(sex=2) into vmat2;

read next 3 var {id} into id1;

read next 5 var {id} into id2;

close a;

print bigmat; print id; print b; print vmat1;

print vmat2; print id1; print id2; quit; run;

OBS ID SEX VO35 VO4 VO45 VO5 VO55 VO6

1 1 1 15.7 18.4 22.0 34.8 45.3 51.1

2 2 1 14.8 18.0 25.1 29.6 44.8 51.3

3 3 1 16.6 21.0 25.5 35.2 48.9 49.1

4 4 1 14.6 19.2 24.2 30.9 36.2 47.6

Page 85: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 85

5 5 1 13.0 18.2 20.4 25.1 33.5 48.9

6 6 1 18.4 21.5 25.7 34.1 42.6 48.5

7 7 2 14.9 18.9 23.9 31.6 42.9 42.0

8 8 1 16.3 19.7 26.5 31.1 39.5 42.6

...

23 23 1 15.4 17.5 22.8 31.6 43.0 51.2

24 24 2 14.3 16.9 23.9 36.8 37.4 36.2

BIGMAT

1 1 15.7 18.4 22 34.8 45.3 51.1

2 1 14.8 18 25.1 29.6 44.8 51.3

...

23 1 15.4 17.5 22.8 31.6 43 51.2

24 2 14.3 16.9 23.9 36.8 37.4 36.2

ID

1

2

3

.

.

23

24

B

1 1

2 1

3 1

.........

22 2

23 1

24 2

VMAT1

15.7 18.4 22 34.8 45.3 51.1

14.8 18 25.1 29.6 44.8 51.3

16.6 21 25.5 35.2 48.9 49.1

14.6 19.2 24.2 30.9 36.2 47.6

13 18.2 20.4 25.1 33.5 48.9

18.4 21.5 25.7 34.1 42.6 48.5

16.3 19.7 26.5 31.1 39.5 42.6

15.6 16.9 22.1 27.9 38.2 39.5

19.4 21.3 24.6 30.6 40.1 42.2

14.1 15.6 20.4 26 33.7 43.1

12.2 16.2 19 25.4 36.6 41.8

14.8 16.8 20.4 26.8 33.1 41.3

16.1 18.8 23 26.3 32.7 39.2

14.5 17.7 22 32.4 36.6 40.8

15.4 17.5 22.8 31.6 43 51.2

VMAT2

Page 86: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 86

14.9 18.9 23.9 31.6 42.9 42

14.7 19.2 22.2 32.1 33.1 39.5

15.3 17.9 25 35.3 38 40.8

14.5 16.7 27.1 38.1 41.3 45

14 15.9 21 26.3 34.1 39.1

15.2 17.8 22.4 26.5 36.5 44.3

16.5 22.2 25.9 35.1 39.5 43.6

12.6 15.4 18.7 24.3 31.9 39.9

14.3 16.9 23.9 36.8 37.4 36.2

ID1 ID2

1 4

2 5

3 6

7

8

10.4 Example. Computing descriptive statistics etc. in a one-sample problem

Illustrating the use of Proc IML to run descriptive statistics and carry out statistical inference for a onesample setting. This shows

1. Calculation of the mean, sd, variance and standard error of the mean.

2. Calculation of confidence intervals for the population mean, variance, and standard deviation underthe assumption of normality.

3. Shows how to sort a vector using the rank function and then use this to get quantiles.

title ’Illustrating descriptive statistics with the LA heart data’;

options ls=80 ps=60 nodate;

data values;

infile ’ladata’;

input id age md50 sp50 dp50 ht50 wt50 sc50

soec cs md62 sp62 dp62 sc62 wt62 ihdx yrdth;

run;

proc iml;

use values;

read all var{wt50} into x;

close values;

print x;

n = nrow(x);

/* get mean, variance and standard deviation */

mean = sum(x)/n;

var = (ssq(x) - (n*(mean**2)))/(n-1);

sd=sqrt(var);

reset noname; /* name of matrices will not be printed*/

print "n = " n "mean: " mean "st. deviation: " sd "variance:" var;

Page 87: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 87

/* get 95% confidence interval for the mean and variance */

a=.05;

se = sd/sqrt(n);

tval=tinv(1-a/2,n-1);

upper = mean + tval*se;

lower = mean - tval*se;

print "standard error = " se ;

print "95% CI for mean = " lower " to " upper;

lval = cinv((a/2),n-1);

uval = cinv(1-a/2,n-1);

lowvar = (n-1)*var/uval;

upvar = (n-1)*var/lval;

lowsd=sqrt(lowvar);

upsd=sqrt(upvar);

print "95% CI for variance = " lowvar " to " upvar;

print "95% CI for standard dev. = " lowsd " to " upsd;

/* Organizing results in a matrix for better looking output*/

print "Estimates and 95% confidence intervals";

cimat=J(3,3);

cimat[1,1]=mean;

cimat[1,2]=lower;

cimat[1,3]=upper;

cimat[2,1]=var;

cimat[2,2]=lowvar;

cimat[2,3]=upper;

cimat[3,1]=sd;

cimat[3,2]=lowsd;

cimat[3,3]=upsd;

rname={Mean Variance StDeviation};

cname={Estimate LowerCI UpperCI};

reset noname;

print cimat [rowname=rname colname=cname format=6.2];

/* First create a ranked version of x */

rx=J(n,1); /* initializes rx */

rx[rank(x)]=x; /* rx has the ranked values */

reset name;

print rx x;

/* get quantiles using definition 5 in SAS proc univariate*/

per={1,5,25,50,75,95,99}; /* specify percentiles of interest */

q=per;

k=nrow(per);

do i = 1 to k;

p=per[i]/100;

j = int(n*p);

g = n*p - j;

if g=0 then q[i]=.5*(rx[j]+rx[j+1]);

else q[i] = rx[j+1];

Page 88: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 88

end;

print per q;

quit;

run;

Illustrating descriptive statistics with the LA heart data 1

X

147

167

... middle 196 cases omitted

138

157

n = 200 mean: 168.075 st. deviation: 26.639589 variance: 709.66771

standard error = 1.8837034

95% CI for mean = 164.36042 to 171.78958

95% CI for variance = 588.53168 to 872.68866

95% CI for standard dev. = 24.259672 to 29.541304

Estimates and 95% confidence intervals

ESTIMATE LOWERCI UPPERCI

MEAN 168.08 164.36 171.79

VARIANCE 709.67 588.53 171.79

STDEVIATION 26.64 24.26 29.54

RX X

109 147

120 167

... middle 196 cases omitted

235 138

245 157

PER Q

1 120

5 129

25 147

50 165

75 189

95 213

99 234

Page 89: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 89

10.5 Missing values

Missing values must be handled carefully in IML. For example, missing values get read into a matrix orvector unless excluded. Certain functions, such as sum and ssq will ignore missing values. A function likenrow or ncol will not account for missing values in any way. If we ran the program above on a data setwith missing values, then the analysis would be wrong. The following illustrates how to handle this. Notaccounting for the missing values gives the wrong answer in meany.

options ls=80 ps=60 nodate;

data values;

infile ’test.dat’;

input v;

proc print;

run;

proc iml;

use values;

read all var{v} where(v ^= .) into x; /* ^= represents not equal */

read all var{v} into y;

close values;

print x y; OBS V X Y N MEAN NY MEANY

n = nrow(x); 1 4 4 4 4 5.75 6 3.8333333

mean=sum(x)/n; 2 . 5 .

ny=nrow(y); 3 5 6 5

meany=sum(y)/ny; 4 6 8 6

print n mean ny meany; 5 . .

quit; 6 8 8

run;

10.6 Writing to external files in IML.

This is done via a put statement with the same usage as in the data step. It can include text, in quotes, orvariable names. If you want to put an element from a matrix, it needs to be put in parentheses.

As seen in a later section you can also save values to a temporary SAS file and then save it permanently, asa SAS file or a text file; see Section 4 and 14.2 of Part I of the notes.

Going back to the example creating a goodness of fit test statistic. If we added the file and put statementto the do loop near the end

do i = 1 to 10;

obs[i]=.1*n;

chisq=chisq + ((obs[i] - exp[i])**2)/exp[i];

file ’goodfit.dat’;

put (d[i]) +1 (obs[i]) +1 (exp[i]) 5.2; /* need to put elements in parentheses*/

end;

it creates the text file goodfit.dat which looks like:

137 20 24.34

Page 90: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 90

144 20 12.27

150 20 13.13

158.5 20 22.18

165 20 18.88

171.5 20 19.42

183.5 20 33.51

192 20 19.34

203 20 17.93

10.7 Example: Statistics for multiple groups.

Suppose there is a variable, say group = which defines different groups or categories and we want to getvarious statistics for another variable for each of these groups. With two groups, this would be the startingpoint for the various inferences we make on differences in means, ratios of variances, etc.

One way to do this is to use the where group = option to select those with group = whatever and putthis into a vector, with a different vector for each group. In this case you need to have one read for eachgroup and know the values for group ahead of time. This is not very elegant.

A better option is to the design function, which creates a matrix of 0’s and 1’s indicating group membership.This has the advantage of easily generalizing to multiple groups and not needing to know how many groupsor what the group indicators are beforehand.

Using the nutrition example with method is the grouping variable. Run first on two groups.

option linesize=80 pagesize=60 nodate;

data a;

infile ’nut2.dat’;

input id method lnrsty k1-k3 a1-a3 b1-b2;

changeb=b2-b1;

if method=3 or method=2;

run;

proc iml;

use a;

read all var{changeb} where(method=2 & changeb ^=.) into met2;

read all var{changeb} where(method=3 & changeb ^=.) into met3;

read all var{changeb} where (changeb ^= .) into changeb;

read all var{method} where (changeb ^=.) into method;

/* or could create sas data set beforehand eliminating cases

where changeb is missing */

designm=design(method);

close a;

print met2 met3;

/* getting statistics for each group using separate vectors*/.

print "group1 = method 2, group2 = method 3 ";

n1=nrow(met2);

x1bar=sum(met2)/n1;

var1 = (ssq(met2) - (n1*(x1bar**2)))/(n1-1);

print n1 x1bar var1;

n2=nrow(met3);

x2bar=sum(met3)/n2;

Page 91: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 91

var2 = (ssq(met3) - (n2*(x2bar**2)))/(n2-1);

print n2 x2bar var2;

print method designm;

ngroup=ncol(designm);

n=J(ngroup,1);

mean=J(ngroup,1);

var=J(ngroup,1);

do j = 1 to ngroup;

n[j]=sum(designm[,j]);

obs=designm[,j]#changeb;

print j obs;

mean[j]=sum(obs)/n[j];

var[j] = (ssq(obs) - (n[j]*(mean[j]**2)))/(n[j]-1);

end;

print n mean var;

quit;

run;

N1 X1BAR VAR1

66 1.5606061 11.32704

N2 X2BAR VAR2

57 0.8245614 14.754386

METHOD DESIGNM

3 0 1

3 0 1

2 1 0

..............

3 0 1

2 1 0

N MEAN VAR

66 1.5606061 11.32704

57 0.8245614 14.754386

The following makes this into a macro to handle an arbitrary number of groups and uses the design function.

option linesize=80 pagesize=60 nodate;

%macro gstat(group,y);

data a;

infile ’g:/s597/data/nut2.dat’;

input id method lnrsty k1-k3 a1-a3 b1-b2;

changeb=b2-b1;

y = &y;

group =&group;

/*y = changeb;

group=method;*/

run;

proc iml;

use a;

read all var{y} where (y ^= .) into y;

Page 92: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 92

read all var{group} where (y ^=.) into group;

designm=design(group);

close a;

/* getting statistics for each group using separate vectors*/

print group designm;

ngroup=ncol(designm);

n=J(ngroup,1);

mean=J(ngroup,1);

var=J(ngroup,1);

do j = 1 to ngroup;

n[j]=sum(designm[,j]);

obs=designm[,j]#y;

mean[j]=sum(obs)/n[j];

var[j] = (ssq(obs) - (n[j]*(mean[j]**2)))/(n[j]-1);

end;

print n mean var;

run;

quit;

%mend;

%gstat(method,changeb)

OUTPUT SHOWS JUST A PORTION OF THE DESIGN MATRIX.

group designm

.....

1 1 0 0

3 0 0 1

1 1 0 0

2 0 1 0

3 0 0 1

1 1 0 0

1 1 0 0

3 0 0 1

3 0 0 1

2 0 1 0

1 1 0 0

2 0 1 0

2 0 1 0

.....

n mean var

78 2.4615385 14.017982

66 1.5606061 11.32704

57 0.8245614 14.754386

10.8 Example: Multivariate observations.

Suppose the data consists of m variables on each of n “individuals/units”. Typically the data will be initiallyarranged in the usual way with a row in the data set being an “individual” and the m variables being thecolumns in the data set (but the role of individuals and variables could switch). Write xij for the jth variable

Page 93: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 93

for the ith individual (i = 1 to n, j = 1 to m).

The sample mean for variable j is x.j =∑

i xij/n (all sums over i are from 1 to n).

The sample mean vector, x is the m× 1 matrix with jth component equal to x.j . As done here, it is typicalto define the mean vector as a column vector even though the original data layout may have the variablesin a row.

The sample variance for variable j is s2

j =∑

i(xij − x.j)2/(n− 1).

The sample covariance between two variables j and k is sjk =∑

i(xij − x.j)(xik − x.k)/(n − 1). Noticethat in this notation we can also write sjj for s2

j .

The sample covariance matrix, denoted S is the m×m matrix with sjk for the (j, k)th element.

The sample correlation between two variables j and k is rjk = sjk/sjsk. (rjj = 1 = correlation of avariable with itself.)

The sample correlation matrix, denoted R is the m×m matrix with rjk for the (j, k)th element.

Define the m× 1 vector

xi =

xi1

xi2

.

.xim

.

In matrix form x =∑

i xi/n,

S =∑

i

(xi − x)(xi − x)′/(n− 1) = (∑

i

xix′

i − nxx′)/(n− 1).

and R = D−1SD−1 where D is a m×m diagonal matrix with (j, j) element sj .

The sample mean, covariance and correlation matrix are at the heart of most multivariate statistical tech-niques.

Example. This data, taken from Johnson and Wichern’s Applied Multivariate Statistical methods con-sists of 100 weeks of rates of returns on five stocks (Allied chemical, duPont, Union Carbide, Exxon, andTexaco). This illustrates how to get the means, variances, covariances and correlations directly via matrixmultiplication and then computes some measures of how “synchronized” the different stock series are overtime.

title ’Illustrating multivariate statistics with stock data’;

options ls=80 ps=60 nodate;

data values;

infile ’stock.dat’;

input allied dupont unionc exxon texaco;

*proc print;

run;

proc corr cov;

run;

proc iml;

use values;

read all into xval; /* xval is 100 by 5 */

close values;

Page 94: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 94

x=t(xval);

n = nrow(xval);

m=ncol(xval);

print n m;

do i = 1 to n;

x[,i]=t(xval[i,]);

end;

/* x is 5 by 100 with ith column containing the 5 stocks

at time i*/

mean = x[,+]/n; /* the function x[,+] creates sums over columns*/

print "mean vector " mean;

sumxtx=J(m,m,0);

do i = 1 to n;

sumxtx = sumxtx + x[,i]*t(x[,i]);

end;

s = (sumxtx - n*mean*t(mean))/(n-1);

print "covariance matrix " s;

d=I(m);

do j = 1 to m;

d[j,j] = sqrt(s[j,j]);

end;

r = inv(d)*s*inv(d);

print "correlation matrix" r;

/* getting the correlation matrix using the corr module */

rm=corr(xval);

print "correlation matrix from the module" rm;

/* Getting the average pairwise correlation among a pairs of

stock. This is one way to measure synchrony over time. */

np = m*(m-1)/2; /* = number of distinct pairs */

total=0;

do j = 1 to m-1;

do k = j+1 to m;

total =total+r[j,k];

end;

end;

aver=total/np;

print "average correlation" aver;

/* For each series create a matrix where for stock j,

ith entry is 1 if have an increase from time i to time i +1

is -1 if have a decrease from time i to time i +1

is 0 if no change. */

change=J(n-1,m);

do j = 1 to m;

do i = 1 to n-1;

change[i,j]=xval[i+1,j]-xval[i,j];

Page 95: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 95

if change[i,j] > 0 then change[i,j]=1;

if change[i,j] < 0 then change[i,j]=-1;

if change[i,j] = 0 then change[i,j]=0;

end;

end;

total2=0;

count=0;

agree=J(np,3);

do j =1 to m-1;

do k = j+1 to m;

count=count+1;

sum2=0;

do i = 1 to n-1;

if change[i,j]=change[i,k] then sum2=sum2+1;

/* keeps count of where change is in the same direction */

end;

pagree = sum2/(n-1); /* prop. ageeing for series j and k */

agree[count,1]=j;

agree[count,2]=k;

agree[count,3]=pagree;

total2=total2+pagree;

end;

end;

reset noname spaces=4;

names = {series1 series2 propAgree};

print agree [colname=names];

averp = total2/np;

print "average proportion agreement" averp;

quit;

run;

Illustrating multivariate statistics with stock data 1

OBS ALLIED DUPONT UNIONC EXXON TEXACO

1 0.00000 0.00000 0.00000 0.039473 0.000000

2 0.02703 -0.04486 -0.00303 -0.014466 0.043478

. . .

98 0.045454 0.046375 0.074561 0.014563 0.018779

99 0.050167 0.036380 0.004082 -0.011961 0.009216

100 0.019108 -0.033303 0.008362 0.033898 0.004566

Covariance Matrix DF = 99

ALLIED DUPONT UNIONC EXXON TEXACO

ALLIED 0.0016299269 0.0008166676 0.0008100713 0.0004422405 0.0005139715

DUPONT 0.0008166676 0.0012293759 0.0008276330 0.0003868550 0.0003109431

UNIONC 0.0008100713 0.0008276330 0.0015560763 0.0004872816 0.0004624767

EXXON 0.0004422405 0.0003868550 0.0004872816 0.0008023323 0.0004084734

TEXACO 0.0005139715 0.0003109431 0.0004624767 0.0004084734 0.0007587370

Simple Statistics

Variable N Mean Std Dev Sum Minimum Maximum

ALLIED 100 0.00543 0.04037 0.54337 -0.09665 0.12281

Page 96: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 96

DUPONT 100 0.00483 0.03506 0.48271 -0.07576 0.11881

UNIONC 100 0.00565 0.03945 0.56542 -0.09146 0.10262

EXXON 100 0.00629 0.02833 0.62914 -0.05313 0.08855

TEXACO 100 0.00371 0.02755 0.37085 -0.05051 0.08247

Pearson Correlation Coefficients / Prob > |R| under Ho: Rho=0 / N = 100

ALLIED DUPONT UNIONC EXXON TEXACO

ALLIED 1.00000 0.57692 0.50866 0.38672 0.46218

0.0 0.0001 0.0001 0.0001 0.0001

etc. rest eliminated.

N M

100 5

mean vector MEAN

0.0054337

0.0048271

0.0056542

0.0062914

0.0037085

covariance matrix S

0.0016299 0.0008167 0.0008101 0.0004422 0.000514

0.0008167 0.0012294 0.0008276 0.0003869 0.0003109

0.0008101 0.0008276 0.0015561 0.0004873 0.0004625

0.0004422 0.0003869 0.0004873 0.0008023 0.0004085

0.000514 0.0003109 0.0004625 0.0004085 0.0007587

R correlation matrix

1 0.5769244 0.5086555 0.3867206 0.4621781

0.5769244 1 0.5983841 0.3895191 0.3219534

0.5086555 0.5983841 1 0.4361014 0.4256266

0.3867206 0.3895191 0.4361014 1 0.5235293

0.4621781 0.3219534 0.4256266 0.5235293 1

correlation matrix from the module

RM

1 0.5769244 0.5086555 0.3867206 0.4621781

0.5769244 1 0.5983841 0.3895191 0.3219534

0.5086555 0.5983841 1 0.4361014 0.4256266

0.3867206 0.3895191 0.4361014 1 0.5235293

0.4621781 0.3219534 0.4256266 0.5235293 1

AVER

average correlation 0.4629592

SERIES1 SERIES2 PROPAGREE

1 2 0.6767677

1 3 0.7070707

1 4 0.6666667

1 5 0.6767677

2 3 0.7272727

2 4 0.5858586

2 5 0.6161616

Page 97: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 97

3 4 0.6363636

3 5 0.6262626

4 5 0.5858586

average proportion agreement 0.6505051

One should not automatically carry out the usual inferences on the individual series mean or other parametersfor the series, nor should one carry out inferences on the correlation between two series in the usual way.This is because the observations in a series will typically have some type of correlation over time, while theusual inferences assume the observations are uncorrelated (or more strongly independent). There are waysto examine this through time series techniques (which we won’t do here.)

10.9 Creating SAS files from within IML.

Temporary SAS files can be done using the create and append commands.

10.9.1 Creating a SAS file directly from a matrix

The example below uses the stock example and illustrates creating SAS data sets using the change valuesand ranks over time for each company.

options ls=80 ps=60 nodate;

data values;

infile ’stock.dat’;

input allied dupont unionc exxon texaco;

run;

proc iml;

use values;

read all into xval; /* xval is 100 by 5 */

close values;

x=t(xval);

n = nrow(xval);

m=ncol(xval);

print n m;

/* Create the matrix change */

change=J(n-1,m);

do j = 1 to m;

do i = 1 to n-1;

change[i,j]=xval[i+1,j]-xval[i,j];

if change[i,j] > 0 then change[i,j]=1;

if change[i,j] < 0 then change[i,j]=-1;

if change[i,j] = 0 then change[i,j]=0;

end;

end;

/* create a sas data set called changed, with no colname option the

default is to name the variables as col1, col2, ... */

create changed from change;

append from change;

Page 98: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 98

/* create a matrix called rankm, where column j has the ranks of

the 100 observations for the company in column j and then

create a sas data set called rankm. NOTE that we can use the sas

data name rankm even though it also a matrix name in IML. The variables

are named using the colname feature*/

rankm=j(n,m); /*initialize the matrix*/

do j = 1 to m;

rankm[,j]=ranktie(xval[,j]);

end;

names={’r1’ ’r2’ ’r3’ ’r4’ ’r5’};

create rankm from rankm[colname= names];

append from rankm;

quit;

run;

title ’print the sas data set changed’;

proc print data=changed;

run;

title ’print the sas data set rankm’;

proc print data=rankm;

run;

print the sas data set changed 2

OBS COL1 COL2 COL3 COL4 COL5

1 1 -1 -1 -1 1

2 1 1 1 1 1

3 -1 -1 -1 -1 -1

4 1 -1 -1 -1 -1

99 -1 -1 1 1 -1

print the sas data set rankm 4

OBS R1 R2 R3 R4 R5

1 48.0 53 49.5 89.0 47.0

2 73.0 5 45.0 23.0 93.0

3 100.0 95 98.0 99.0 99.0

99 84.0 83 55.0 29 62.0

100 67.0 11 59.0 84 50.0

Page 99: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 99

10.9.2 Creating a SAS file using variables; with application to simulating random samples

The following illustrates how to create the temporary SAS file “simout” by listing of set of variables in thecreate statement. Anytime an append is encountered these variables are written to the simout. This is arewrite of the simulation program on page 47 of the 597C notes but now in IML rather than in the datastep. Notice that the if-then in checking the confidence interval has to be done differently in IML than inthe data step.

option ls=80 nodate;

%macro simn(nsim,n,mu,sigma,alpha);

title ‘‘simulate dist. of sample mean and CI, n= &n from N(&mu, &sigma **2)’’;

proc iml;

create simout var{j true mean var sd low up check};

tval=tinv(1- &alpha/2,&n-1);

true=&mu;

n = &n;

x=J(n,1);

do j =1 to &nsim;

do i = 1 to n;

x[i] = &mu + &sigma*rannor(0);

end;

mean = sum(x)/n;

var = (ssq(x) - (n*(mean**2)))/(n-1);

sd=sqrt(var);

se = sd/sqrt(n);

low=mean - tval*se;

up=mean + tval*se;

check=0;

if (low <= &mu) & (&mu <= up) then check=1;

/* NOTE if (low <= &mu <= up) then check=1; DOES NOT WORK. CHECK IS ALWAYS 1*/

append; /* writes variables in varlist of create to simout*/

end;

quit;

proc means data=simout;

var mean check;

run;

goptions reset=all;

title1 ‘‘Distribution of X-bar: Sample from Normal’’;

title2 ‘‘n= &n from N(&mu, &sigma **2)’’;

proc univariate;

histogram mean/kernel (color=red);

inset n mean (5.2) std (5.2);

run;

%mend;

%simn(1000,10,4,6,.05);

simulate dist. of sample mean and CI, n= 10 from N(4, 6 **2) 82

The MEANS Procedure

Variable N Mean Std Dev Minimum Maximum

MEAN 1000 4.0648820 1.9396007 -2.0542079 11.2478573

Page 100: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 100

CHECK 1000 0.9450000 0.2280943 0 1.0000000

10.10 User define modules/functions

When a particular set of statements will be used multiple times, you can create user defined modules whichcan then be called. There will be some arguments to the module and some variables that you will want topass from the module to the main IML program. The latter are specified through the global statement. Anyvariables that are used in the module but are not in the global statement will not have their values passedto the main IML program. Here is a simple example which has a module which computes the sample size,mean, variance and standard deviation for a specified vector.

option linesize=80 pagesize=60 nodate;

data a;

infile ’c:\s597\data\nut2.dat’;

input id method lnrsty k1-k3 a1-a3 b1-b2;

changeb=b2-b1;

if method=3 or method=2;

run;

proc iml;

start meansd(xvec) global(n,mean,var,sd);

n=nrow(xvec);

mean=sum(xvec)/n;

var = (ssq(xvec) - (n*(mean**2)))/(n-1);

sd=sqrt(var);

finish meansd;

use a;

read all var{changeb} where(method=2 & changeb ^=.) into met2;

read all var{changeb} where(method=3 & changeb ^=.) into met3;

close a;

print met2 met3;

/* getting statistics for each group using separate vectors*/

call meansd(met2);

print "group1 = method2", n mean var sd;

call meansd(met3);

print "group2 = method3", n mean var sd;

quit;

run;

group1 = method2

N MEAN VAR SD

66 1.5606061 11.32704 3.3655668

group2 = method3

N MEAN VAR SD

57 0.8245614 14.754386 3.8411438

Page 101: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 101

10.11 Simulating simple random sampling.

This first example simulates the performance of an estimate confidence interval for mean using a simplerandom sample of size n from a finite population of size N containing an outlier. Here the populationconsists of 137 days of a fishing season and the variable we work with is the number of fish caught. Thisis the 1975 data from Parvin Lake, CO collected to help decide how to use sampling in future years. Theestimate of the mean is the sample mean x, which has variance (1− n/N)σ2/n, where σ2 is the populationvariance (defined here with an N−1 rather than N as is often done in the sampling literature). The (1−n/N)is a finite population correction factor. The estimated standard error of x is SE = [(1−n/N)s2/n]1/2, wheres2 is the sample variance and a (supposed) approximate confidence interval is x± z1−α/2SE.

/* THIS PROGRAM SIMULATES THE PERFORMANCE OF NORMAL BASED CONFIDENCE

INTERVALS FOR THE MEAN AS WELL AS THE SAMPLING DISTRIBUTION OF THE

SAMPLE MEAN WHEN SAMPLING FROM A FINITE POPULATION. */

title ’SRS from Parvin data y = fish caught’;

option ls=80 nodate;

data a;

infile ’e:\s640\lake.csv’;

input month date day $ trips hours kept caught;

y = caught;

run;

title ’Population Values’;

proc univariate data=a noprint;

hist y/kernel (color=blue);

var y;

run;

%macro popsim(ss);

proc iml;

alpha = .05;

use a;

read all var{y} into y;

close a;

n=&ss;

ys=j(n,1); /* initialize vector to hold sample values */

create simout var{ybar samples2 low up count};

/* the population values are in the N by 1 vector y */

bign = nrow(y);

yubar = sum(y)/bign; /* population mean */

popS2 = (ssq(y) - (bign*(yubar**2)))/(bigN-1); /* Population "variance"*/

truese = sqrt((bign - n)*pops2/(bign*n));

title "Sample size &ss";

print "population size = " bign " sample size =" n;

print "true mean = " yubar "population s2 = " pops2;

print "true SE of the mean =" truese;

do j=1 to 1000; /* looping through simulations */

check=J(n,1,0);

**************************************************************************;

* This takes a sample of size n without replacement;

* There are other ways to do this. The method here is not;

* very efficient since it checks through the whole vector each time;

Page 102: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 102

do i=1 to n; /* m indexes position in sample */

label:uv=uniform(0);

k=int(uv*bign+1);

kk=0;

label2:kk=kk+1;

if k=check[kk] then goto label; /* will go to label if the selected value k

had previously been chosen*/

if kk=i then goto skip;

goto label2;

skip: check[i]=k;

ys[i]=y[k];

end;

**************************************************************************;

/* get sample statistics */

ybar = sum(ys)/n;

samples2 = (ssq(ys) - (n*(ybar**2)))/(n-1);

se = sqrt((bign - n)*samples2/(bign*n)); /* SE using F.P.C.F.*/

z = probit(1 - (alpha/2));

low = ybar -z*se;

up = ybar + z*se;

count = 0;

if (low <= yubar) & (yubar <= up) then count=1;

append;

end;

quit;

run;

title "sample size &ss" ;

proc means;

var ybar samples2 count;

run;

/* THE MEAN OF COVER GIVES THE SUCCESS RATE

OF THE NORMAL INTERVAL */

title "sample size &ss" ;

proc univariate noprint;

hist ybar/kernel (color=black);

var ybar;

run;

%mend;

%popsim(10); %popsim(15); %popsim(20); %popsim(50); %popsim(100);

Population Values

BIGN N

population size = 137 sample size = 10

YUBAR POPS2

true mean = 55.554745 population s2 = 4597.5429

TRUESE

true SE of the mean = 20.644505

sample size 10

Variable N Mean Std Dev Minimum Maximum

Page 103: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 103

YBAR 1000 54.6671000 19.3267475 21.7000000 141.2000000

SAMPLES2 1000 4024.75 10251.81 110.6222222 47397.83

COUNT 1000 0.8570000 0.3502480 0 1.0000000

sample size 100

TRUESE

true SE of the mean = 3.5237369

Variable N Mean Std Dev Minimum Maximum

YBAR 1000 55.6671600 3.4915400 45.9300000 62.4500000

SAMPLES2 1000 4667.13 1873.68 1149.80 6031.92

COUNT 1000 0.8370000 0.3695505 0 1.0000000

10.12 Least squares/regression via IML.

Consider a setting where there is an outcome variable, Yi for the ith observation (i = 1 to n) and a setof k predictor variables, xi1, . . . , xik for observation i. We write xi for the k × 1 column vector containingxi1, . . . , xik. (REMARK: The x’s may contain different functions of a common variable, or one of the x’smay be a constant. When models have an intercept term in them there is an xij which is always 1.)

The objective is to fit some specified function f(xi, β) that relates Yi to xi. This can be viewed as ageneral numerical curve-fitting problem without any statistical context. In a statistical context, the functionf(xi, β) is taken as a model for E(Yi) (the expected value of Yi given xi) which can also be expressed asYi = f(xi, β) + εi, where εi is random with mean 0.

Least Squares: Find β which minimizes Q(β) =∑

i(Yi − f(xi, β))2.

linear model (linear in the parameters:) f(xi, β) = xi1β1 + . . . xipβp. Define the n × 1 vector Y and then× p matrix X by

Y =

Y1

Y2

.Yn

and X =

x11 x12 . x1p

x21 x22 . x2p

. . . .xn1 xn2 . xnp

=

x′1

x′2

.x′n

.

Then Q(β) = (Y − Xβ)′(Y − Xβ). Differentiating and setting equal to zero yields the so-called normalequations X ′Xβ = X ′Y . If the matrix X has rank p, meaning the columns of X are linearly independent,then X ′X is of rank p (i.e., is nonsingular) and the unique solution to the normal equations and one thatcan be shown to minimize Q(β) is given by

β = (X ′X)−1X ′Y

where the . is used to denote the fact that the solution is viewed as an estimate in a statistical context.

One necessary (but certainly not sufficient) condition for X to be of rank p is that n ≥ p. If you try to

compute β and the rank condition is not meant, you will get an error message from the inverse function.

Checking the rank. There is no function in IML that automatically checks the rank of a matrix. Withoutexplaining why it works, here is one way to get the rank of a the matrix a.

ranka=sum(eigval( a*ginv(a)));

The ith residual is Yi − Yi, where Yi = f(xi, β) is the ith fitted value..

Page 104: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 104

The minimum value of Q which is obtained is SSE =∑n

i=1(Yi − Yi)

2 =∑n

i=1r2

i .

In a statistical context, suppose the errors (the εi) are uncorrelated, each with mean 0 and variance σ2.

Then the covariance matrix of β is Cov(β) = σ2(X ′X)−1 which is estimated by covb = σ2(X ′X)−1 with

σ2 = SSE/(n− p) an unbiased estimate of σ2. The square root of the jth diagonal element of covb gives the

estimated standard error of βj , the jth estimated coefficient.

Example. Using SMSA data (from homework 4). Regress Y = mortal on rain and so2pot.

title ’Regression with two variables- proc reg and IML’;

options pagesize=60 linesize=80;

data a;

infile ’g:/s597/data/smsa.dat’;

input city $ jantemp julytemp relhum rain mortal educ popden pnwhite pwc

pop perhouse income hcpot noxpot so2pot nox;

con = 1;

run;

proc reg;

model mortal = rain so2pot;

run;

/* doing linear least squares in IML */

proc iml;

use a;

read all var{mortal} into y;

read all var {con rain so2pot} into x;

* read all var{rain so2pot} into x2;

close a;

/* with defining the variable con = 1 as above

can create x as follows

jvec = j(nrow(x),1);

x = jvec||x2; */

p=ncol(x);

n=nrow(x);

betahat = inv(t(x)*x)*t(x)*y;

residual=y - x*betahat; /*observed minus fitted values */

sse = t(residual)*residual; /*sum of squared residuals */

mse = sse/(n-p); /* estimate of variance */

covb= mse*inv(t(x)*x); /* estimate of variance covariance

matrix of coefficients */

sevec = J(p,1);

do j = 1 to 3;

sevec[j] = sqrt(covb[j,j]);

end;

print y x residual;

print "coefficients " betahat "standard errors" sevec;

rankx=sum(eigval( x*ginv(x))); /* This gives you the rank of

the matrix X */

print p rankx;

reset noname;

quit;

run;

Page 105: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 105

Regression with two variables- proc reg and IML 9

The REG Procedure

Dependent Variable: mortal

Number of Observations Read 60

Number of Observations Used 60

Sum of Mean

Source DF Squares Square F Value Pr > F

Model 2 96848 48424 20.98 <.0001

Error 57 131550 2307.89804

Corrected Total 59 228398

Root MSE 48.04059 R-Square 0.4240

Dependent Mean 940.34867 Adj R-Sq 0.4038

Coeff Var 5.10881

Parameter Estimates

Parameter Standard

Variable DF Estimate Error t Value Pr > |t|

Intercept 1 811.77692 23.13109 35.09 <.0001

rain 1 2.68223 0.54710 4.90 <.0001

so2pot 1 0.47648 0.09939 4.79 <.0001

PROC IML OUTPUT

Regression with two variables- proc reg and IML 10

Y X RESIDUAL

921.87 1 36 59 -14.57959

997.87 1 35 39 73.632241

962.35 1 44 33 16.831034

1003.5 1 65 42 -2.634157

895.7 1 65 8 -94.23383

911.82 1 62 49 -89.60282

954.44 1 38 39 22.155545

BETAHAT SEVEC

coefficients 811.77692 standard errors 23.131091

2.6822319 0.547099

0.4764801 0.0993883

P RANKX

3 3

Page 106: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 106

10.13 Non-linear least squares and calling subroutines.

IML also has a large number of built in subroutines. A subroutine is invoked through a call statement,with various arguments provided for the call. A number of these subroutines deal with optimization andleast squares. We illustrate with one of the least squares subroutine called NLPLM, using only the first fivearguments for the subroutine. There are other arguments, but these can be omitted since they are at theend. If an optional argument is being skipped earlier in the list of arguments in a subroutine, a place mustbe allowed for it; that is there would be successive commas. The general form of the call to this particularsubroutine, where there are p parameters to be fit, is:

call nlplm(rc,bres,’’FUN’’,b,opt);

where rc contains return codes, bres is a p× 1 vector containing the result, FUN is a user defined functionthat defines the least squares problem, b is a p × 1 vector of starting values and opt is a vector containingsome options. All of these names are the choice of the user (besides nlplm).

Example: Consider the cigarette smoking and cancer data. Suppose we want to fit a model of the formβ1x

β2 for how the incidence of bladder cancer Y relates to cigarette smoking, x. To get starting values wecan fit log(Y ) = γ1 + γ2log(x), where if the relationship were deterministic the coefficients would relate viaβ2 = γ2 and γ1 = log(β1) so β1 = eγ1 . We will do a linear fit on the log-transformed variables to get startingvalues and then run nonlinear least squares.

option ls=80 nodate;

data a;

infile ’cig.dat’;

input state $ cig blad lung kid leuk;

logc=log(cig);

logb=log(blad);

con=1;

run;

proc iml;

use a;

read all var{logb} into ly;

read all var {con logc} into lx;

read all var{blad} into y;

read all var{cig} into x;

close a;

/* doing linear least squares of log(blad) versus log(cig)

to get starting values */

blog= inv(t(lx)*lx)*t(lx)*ly;

print ’linear least squares solution of log(y) on log(x)’ blog;

p=2;

n=nrow(x);

bres=J(2,1,1);

b=bres;

b[1]=exp(blog[1]);

b[2]=blog[2];

opt=j(2,1,0);

opt[1]=n; /* for nlplm the first component of opt must be the number

of observations. The second component controls the printing.

Setting it equal to 0 as here, leads to no automatic output */

Page 107: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 107

/* define the function going into least squares */

start FUN(b) global(n,x,y);

f=j(n,1,0);

do i = 1 to n;

f[i] = y[i] - b[1]*(x[i]**b[2]);

end;

return(f);

finish FUN;

call nlplm(rc,bres,"FUN",b,opt);

print rc bres b;

quit;

run;

BLOG

linear least squares solution of log(y) on log(x) -0.966781

0.7380888

RC BRES B

6 0.3744682 0.7472444 0.3803052

0.7380888

The return code of 6 indicates successful convergence. The least squares solution is β1 = .3745 and β2 = .7472.(In general, multiple starting points should often be tried as nonlinear least squares routines can sometimesbe sensitive to starting values).

10.14 Bootstrapping.

Bootstrapping refers to the use of resampling techniques in order to get estimates of the bias and standarderror of estimates or to get confidence intervals or test of hypotheses for parameters. The resampling involvessome type of sampling “from the original data”, where the resampling is done in a way which mimics theway the original data was generated. The bootstrap has become very popular in recent years; see Efron andTibshirani, “An Introduction to the Bootstrap” (1993) or Davison and Hinkley “Bootstrap Methods andTheir Application” (1997) for introductions.

Here, we consider a single sample problem with one parameter of interest, where the bootstrap iseasiest to define. I am not going to try and describe here why the bootstrap works or even many aspectsof how it work, but rather describe computationally how it works and how it can be easily implementedin IML. Before collecting the data the random sample can be viewed as a collection of n random variablesW1, . . . , Wn, which are independent and identically distributed (which is the meaning of “random sample”).The observed values are denoted w1, . . . wn. The Wi could be multivariate.

Suppose there is a single parameter of interest, denoted by θ. For the univariate case, this might be thepopulation mean, variance, standard deviation, 90th percentile, etc. while examples for the bivariate case(2 variables making up each W ) include the population correlation ρ or the ratio of the mean of the twovariables.

We estimate θ with some statistic T , which by definition is a function of the data. Assessing how good T isan estimator of θ is often done through

• Bias = E(T )− θ

and

Page 108: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 108

• SE = standard error of T = standard deviation of T

while confidence intervals and tests are based on the sampling distribution of T . Sometimes these quantitiescan be determined exactly (under some assumptions).

The bootstrap provides an estimate of the bias, standard error and, more generally, the sampling distributionof T by simulating sampling from a population, where the “population” that is used in the original dataw1, . . . wn.

The bootstrap algorithm (for a single random sample)

Take B large (usually at least 500)

For b = 1 to B

1. Take a random sample of size n WITH REPLACEMENT from the values w1, . . . wn. Denote thisbootstrap sample by wb1, . . . , wbn. Notice that some of the original values in the sample will typicallyoccur more than once in the bootstrap sample.

2. Compute your estimate in the same way as you did for original data now using the bootstrap samplewb1, . . . , wbn. For the bth bootstrap sample denote the estimate by Tb.

There are now B estimates T1, . . . , TB .

The bootstrap estimate of the standard error of T is:

SEboot =

[

∑Bb=1

(Tb − T )2

B − 1

]1/2

,

where T =∑B

b=1Tb/B.

Technically the bootstrap estimate of standard error is the limit of the above as B →∞, but it is commonto use the term in this way.

If T is a plug-in estimator (as for example, will be the case for estimators based on sample means, variances,correlations and percentiles; which covers much of what we want to do) then the bootstrap estimate ofBias is T − T .

The Bootstrap estimate of the distribution of T is simply the distribution of the B values T1, . . . TB ,usually represented by a stem-and-leaf plot, histogram or smoothed-histogram. This is referred to as theEmpirical Bootstrap Distribution.

Bootstrap Confidence Intervals

• normal-based bootstrap confidence interval:

T ± z(1− α/2)SEboot,

• The Percentile Method:

Consider α1 and α2 with α1 +α2 = α( usually α1 = α2 = α/2). The percentile interval is [L, U ], where

L = 100(α1)th percentile from the bootstrap data T1, . . . TB .

Page 109: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 109

U = 100(1− α2)the percentile from the bootstrap data T1, . . . TB .

That is, approximately 100α1% of the bootstrap values are less than or equal to L and 100(1− α2)%of the bootstrap values are less than or equal to U . Note that if the distribution of the Tb values isapproximately normal with a mean near the original T , the bootstrap percentile interval will be inclose agreement with the normal-based interval.

Notice also, that with the percentile method, if [L, U ] is the percentile interval for θ then the percentileinterval for φ = g(θ) is simply [g(L), g(U)]. This is a nice property that does not always hold for othermethods.

Nonparametric confidence intervals can be found using the bootstrap in other ways besides the percentilemethod. In practice an improved bootstrap method called the BCa method (bias corrected acceleratedmethod) is often used.

Example. Using the bootstrap to analyze one of the stock series. We will treat this as a random sample ofsize 100 (the serial correlation is actually weak) and estimate various parameters in the underlying model.

title ’Bootstrap with a single sample’;

options ps=60 ls=80;

/* THIS IS A PROGRAM TO BOOTSTRAP WITH A SINGLE

RANDOM SAMPLE USING THE MEAN, VARIANCE, STANDARD

DEVIATION, COEFFICIENT OF VARIATION, MEDIAN, 90TH PERCENTILE */

filename bb ’boot.out’;

/* READ IN ORIGINAL DATA INTO INTERNAL SAS FILE values */

data values;

infile ’stock.dat’;

input allied dupont unionc exxon texaco;

run;

/* USING JUST THE ALLIED RETURN RATES */

title ’descriptive statistics on original sample’;

proc univariate cibasic cipctldf plots;

var allied;

run;

/* START INTO IML WHERE BOOTSTRAPPING WILL BE DONE */

proc iml;

/* put data into vector x */

use values;

read all var{allied} into x;

close values;

n=nrow(x); /* = sample size */

xb=x; /* initializes xb to be same size as x */

rx=x; /* initializes rx*/

nboot = 1000; /* specify number of bootstrap replicates */

do sim=1 to nboot;

/* get the n samples with replacement. i indicates

the sampling within bootstrap replicate j. The generated

k is a discrete uniform over 1 to n; the function int

takes the integer part */

do i= 1 to n;

Page 110: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 110

uv=uniform(0);

k=int(uv*n+1);

xb[i]=x[k];

end;

/* xb contains the n values in bootstrap sample */

smean = sum(xb)/n; /* get sample mean */

svar=(ssq(xb) - (n*(smean**2)))/(n-1); /* sample variance */

sd = sqrt(svar); /* sample s.dev. */

cv = sd/smean; /* coefficient of variation */

rx[rank(rx)] = xb; /* rx has ranked values */

/* get median */

p=.5;

j = int(n*p);

g = n*p - j;

if g=0 then median =.5*(rx[j]+rx[j+1]);

else median = rx[j+1];

p=.9;

j = int(n*p);

g = n*p - j;

if g=0 then p90 =.5*(rx[j]+rx[j+1]);

else p90 = rx[j+1];

file bb;

put smean +1 svar +1 sd +1 cv +1 median +1 p90;

end;

quit;

run;

/* Get descriptive statistics on bootstrap values*/

data new;

infile ’boot.out’;

input smean svar sd cv median p90;

run;

proc means;

run;

proc univariate;

var smean svar sd cv median p90;

run;

descriptive statistics on original sample 56

Variable: allied

Location Variability

Mean 0.005434 Std Deviation 0.04037

Median 0.000000 Variance 0.00163

Mode 0.000000 Range 0.21946

Interquartile Range 0.05283

Quantile Estimate

100% Max 0.1228070

90% 0.0590165

50% Median 0.0000000

0% Min -0.0966540

Basic Confidence Limits Assuming Normality

Page 111: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 111

Parameter Estimate 90% Confidence Limits

Mean 0.00543 -0.00127 0.01214

Std Deviation 0.04037 0.03619 0.04576

Variance 0.00163 0.00131 0.00209

90% Confidence Limits -------Order Statistics-------

Quantile Distribution Free LCL Rank UCL Rank Coverage

90% 0.051282 0.068965 85 95 90.25

50% Median -0.003195 0.010274 42 59 91.14

*** ANALYSIS OF BOOTSTRAP SAMPLES. OUTPUT HEAVILY EDITED.

Variable N Mean Std Dev Minimum Maximum

smean 1000 0.0053754 0.0041517 -0.0084860 0.0194430

svar 1000 0.0016002 0.000227336 0.000978800 0.0024138

sd 1000 0.0399021 0.0028418 0.0312857 0.0491309

cv 1000 -31.9993740 1122.93 -33824.04 5756.43

median 1000 0.0049183 0.0285557 -0.0788150 0.0958860

p90 1000 0.0058421 0.0281108 -0.0869120 0.0958860

Variable: smean

Quantile Estimate

95% 0.01212490

90% 0.01069245

75% Q3 0.00817140

50% Median 0.00538515

25% Q1 0.00267920

10% -0.00003850

5% -0.00155400

Variable: sd

Quantile Estimate

95% 0.0445262

5% 0.0351526

Variable: cv

Value Obs Value Obs

-4099.906 767 5756.429 459

Variable: cv Variable: median Variable: p90

Quantile Estimate Quantile Estimate Quantile Estimate

100% Max 5756.42880 95% 0.0524363 95% 0.0519173

95% 33.92074 5% -0.0402220 5% -0.0388860

5% -29.27325

0% Min -33824.04000

Follow up on the bootstrap example:

• The CV for the original sample is 7.495.

Page 112: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 112

• The bootstrap estimates of standard error are given by the standard deviation over the bootstrapsamples. For the sample mean the bootstrap estimate of standard error is .0041517 (actually we canestimate the SE of the mean analytically using S/n1/2 so we don’t need the bootstrap. It can be shownthat the bootstrap estimate of SE of the mean as B increases converges to ((n− 1)/n)S/n1/2 which isclose to the formula based estimate.) As a second example, for the standard deviation the bootstrapestimate of standard error is .00284. The standard error CV (which is 1122.93) is heavily influencedby some outliers.

• The bootstrap estimates of bias are found by taking the mean of the bootstrap samples - originalestimate. For the mean (which we know theoretically is unbiased) this is .0053754 - .00543, while forthe CV it is -31.999-7.495 = -39.5 and for the 90th percentile is .0058 - .05902 = -.05 (which mightseem small but is of the same magnitude as the original estimate). The mean is sensitive to outliersso a better sense of bias might be to get the the median of the bootstrap samples and compare it tothe original values. The median for the bootstrap samples are

variable median

smean 0.00539

svar 0.001585

sd 0.039817

cv 6.0655

median 0.004832

p90 0.004395

In most cases these are similar to the mean, but for the CV the median of the bootstrap samples ismuch closer to the original estimate than the mean.

• Confidence intervals:

If one treats the estimate as approximately normal with no bias and uses “estimate ±1.645se” foran approximate confidence interval the results are N-low to N-up below. The percentile intervals aredenoted P-low and P-up. Note that the normal based interval for the CV is heavily influenced bythe large SE, due to outliers and should not be used. The percentile method is not sensitive to theoutliers. Neither of these intervals should be used if bias is an issue, as looks like is the case for the90th percentile.

N-low N-up P-low P-up

smean -0.0015 0.0122 -0.0015 0.0121

svar 0.0012 0.0020 0.0012 0.0020

sd 0.0352 0.0446 0.0352 0.0445

cv -1879.2 1815.22 -29.2732 33.9207

median -0.0421 0.0519 -0.0402 0.0524

p90 -0.0404 0.0521 -0.0389 0.0519

Getting the percentile automatically in IML.

You can get the bootstrap percentile intervals by saving the B bootstrap values in a vector, sorting itand then getting the required percentiles. We illustrate with the same example as above to get confidenceintervals for the variance. Much of the code is the same, with the following changes;

Right after the line with nboot = ; add bvalue=J(nboot,1); Then right after svar = ; add bvalue[sim]=svar;After the end statement that follows the put .. ; and before the line quit; add

Page 113: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 113

/* THE VECTOR BVALUE HAS THE BOOTSTRAP VALUES OF THE ESTIMATE OF

INTEREST. RANKB HAS THE RANKED BOOTSTRAP VALUES */

rankb=bvalue;

rankb[rank(rankb)] = bvalue; /* rankb has ranked values */

alpha={.1,.05,.02};

do kk = 1 to 3;

p=alpha[kk]/2;

j = int(nboot*p);

g = nboot*p - j;

if g=0 then low =.5*(rankb[j]+rankb[j+1]);

else low = rankb[j+1];

p=1- (alpha[kk]/2);

j = int(nboot*p);

g = nboot*p - j;

if g=0 then up =.5*(rankb[j]+rankb[j+1]);

else up = rankb[j+1];

clevel = 100*(1 - alpha[kk]);

print "confidence intervals for variance";

print "confidence level" clevel "ci for variance " low " to " up;

end;

confidence intervals for variance

CLEVEL LOW UP

confidence level 90 ci for variance 0.0012383 to 0.0020242

confidence level 95 ci for variance 0.0011852 to 0.0020923

confidence level 98 ci for variance 0.0011257 to 0.0021846

Variable=SVAR Quantiles(Def=5)

100% Max 0.002419 99% 0.002185

75% Q3 0.001773 95% 0.002024

50% Med 0.001615 90% 0.001932

25% Q1 0.001465 10% 0.001311

0% Min 0.000908 5% 0.001238

1%

Rerunning the analysis we get the above results. Notice that the results on the variance over the 1000bootstraps will not be exactly the same as those from the first run. You can see the match between the 98%ci which uses the 1% and 99% from proc univariate and similarly for the 90% ci which used the 5% and 95%.

11 R: Part 3. Working with matrices

While the way SAS IML operates can be different than how we worked in the data step (which uses theso-called base language), there is no such distinction in R. As noted when we first introduced R, it basicallyapplies functions to objects and those objects can be matrices. So, there is no real fundamental difference inhandling matrices in R, compared to what we’ve done earlier. In fact some of our earlier examples alreadywrote to vectors using an index. In addition, how we referred to elements (or rows or columns) in a dataframecarries over in working with a matrix.

A matrix is an two-dimensional array with r rows and c columns. If c is equal to 1 then this is a “column”vector, or in R just referred to as a vector.

As seen earlier, if we bind together vectors of numbers using rbind or cbind then the result is a matrix.

Page 114: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 114

We also saw (see p. 61 part II) that we can read data right into a matrix using matrix(scan( ... and we usedthe rep and seq commands to create vectors.

Creating matrices via direct assignment:

This can be done with the matrix function: the basics syntax is

matrix(data, nrow, ncol, byrow = F(default) or T)

data is a vector of values (with total elements equal to nrow*ncol; or a single value which will be assignedto all positions.nrow is the desired number of rows.ncol is the desired number of columns.byrow: If FALSE (the default) the matrix is filled by columns, otherwise the matrix is filled by rows.

Here is R script for creating matrices and showing basic matrix operations. See http://www.statmethods.net/advstats/matrix.htmlfor more details. There are numerous other webpages that can be found with details on matrix operations.

# a is symmettric so the fact that it reads by column

# doesn’t matter

a<-matrix(c(1,5,8,5,3,6,8,6,4),3,3)

a

#showing the need for byrow = T to read a row at a time

b<-matrix(c(5,2,4,1,3,2,-5,6,7),3,3)

b

b<-matrix(c(5,2,4,1,3,2,-5,6,7),3,3,byrow=T)

b

M<-matrix(NA,5,6)

M

M<-matrix(1,5,6)

M

# create an identity matrix

C <-diag(5)

C

# BASIC MATRIX OPERATIONS

sumab<-a + b # summation

sumab

diffab<-a-b # difference

diffab

prodab<- a %*% b # product

prodab

trana <-t(a) # transpose

trana

ainv<-solve(a) # inverse

ainv

y<-eigen(a) # y$val has eigenvalues, y$vec has eigenvectors

y$val

y$vec

deta<-det(a) # determinant

deta

Page 115: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 115

ranka<-rank(a) #ranks the elements in a

ranka

> # a is symmettric so the fact that it reads by column

> # doesn’t matter

> a<-matrix(c(1,5,8,5,3,6,8,6,4),3,3)

> a

[,1] [,2] [,3]

[1,] 1 5 8

[2,] 5 3 6

[3,] 8 6 4

> #showing the need for byrow = T to read a row at a time

> b<-matrix(c(5,2,4,1,3,2,-5,6,7),3,3)

> b

[,1] [,2] [,3]

[1,] 5 1 -5

[2,] 2 3 6

[3,] 4 2 7

> b<-matrix(c(5,2,4,1,3,2,-5,6,7),3,3,byrow=T)

> b

[,1] [,2] [,3]

[1,] 5 2 4

[2,] 1 3 2

[3,] -5 6 7

> M<-matrix(NA,5,6)

> M

[,1] [,2] [,3] [,4] [,5] [,6]

[1,] NA NA NA NA NA NA

[2,] NA NA NA NA NA NA

[3,] NA NA NA NA NA NA

[4,] NA NA NA NA NA NA

[5,] NA NA NA NA NA NA

> M<-matrix(1,5,6)

> M

[,1] [,2] [,3] [,4] [,5] [,6]

[1,] 1 1 1 1 1 1

[2,] 1 1 1 1 1 1

[3,] 1 1 1 1 1 1

[4,] 1 1 1 1 1 1

[5,] 1 1 1 1 1 1

>

> # create an identity matrix

> C <-diag(5)

> C

[,1] [,2] [,3] [,4] [,5]

[1,] 1 0 0 0 0

[2,] 0 1 0 0 0

[3,] 0 0 1 0 0

[4,] 0 0 0 1 0

[5,] 0 0 0 0 1

Page 116: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 116

>

> # BASIC MATRIX OPERATIONS

> sumab<-a + b # summation

> sumab

[,1] [,2] [,3]

[1,] 6 7 12

[2,] 6 6 8

[3,] 3 12 11

> diffab<-a-b # difference

> diffab

[,1] [,2] [,3]

[1,] -4 3 4

[2,] 4 0 4

[3,] 13 0 -3

> prodab<- a %*% b # product

> prodab

[,1] [,2] [,3]

[1,] -30 65 70

[2,] -2 55 68

[3,] 26 58 72

> trana <-t(a) # transpose

> trana

[,1] [,2] [,3]

[1,] 1 5 8

[2,] 5 3 6

[3,] 8 6 4

> ainv<-solve(a) # inverse

> ainv

[,1] [,2] [,3]

[1,] -0.14634146 0.1707317 0.03658537

[2,] 0.17073171 -0.3658537 0.20731707

[3,] 0.03658537 0.2073171 -0.13414634

> y<-eigen(a) # y$val has eigenvalues, y$vec has eigenvectors

> y$val

[1] 15.513954 -1.874493 -5.639461

> y$vec

[,1] [,2] [,3]

[1,] -0.5420786 0.3354056 0.77048941

[2,] -0.5294635 -0.8483266 -0.00321535

[3,] -0.6525482 0.4096890 -0.63744469

> deta<-det(a)

> deta

[1] 164

> ranka<-rank(a) #ranks the elements in a

> ranka

[1] 1.0 4.5 8.5 4.5 2.0 6.5 8.5 6.5 3.0

Page 117: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 117

11.1 Least squares in R

This does linear regression with two predictors, first using glm and then illustrating matrix calculations.

data<-read.table(’g:/s597/data/smsa.dat’)

rain<-data$V5

mortal<-data$V6

so2pot<-data$V16

con <-rep(1,length(mortal))

regmodel<-glm(mortal ~ rain+ so2pot)

summary(regmodel)

# doing least squares explicitly

y<-mortal

x<-cbind(con,rain,so2pot)

dimx<-dim(x)

p=dimx[2]

n=dimx[1]

n; p

xpxinv<-solve(t(x)%*%x) #inverse of X’X

betahat<- xpxinv%*%t(x)%*%y #estimated coefficients

residual <- y - x%*%betahat

sse <- t(residual)%*%residual; #sum of squared residuals

mse <- sse/(n-p) # estimate of variance

covb<- mse[1]*solve(t(x)%*%x) #estimate of variance covariance of betahat

sevec <- rep(0,p);

for (j in 1 :3)

{sevec[j] = sqrt(covb[j,j])}

info<-cbind(y,x,residual)

info

betahat

mse

covb

estimates<-cbind(betahat,sevec)

estimates

> data<-read.table(’g:/s597/data/smsa.dat’)

> rain<-data$V5

> mortal<-data$V6

> so2pot<-data$V16

> con <-rep(1,length(mortal))

> regmodel<-glm(mortal ~ rain+ so2pot)

> summary(regmodel)

Call:

glm(formula = mortal ~ rain + so2pot)

Deviance Residuals:

Min 1Q Median 3Q Max

-111.747 -29.239 -3.163 27.045 156.066

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 811.77692 23.13109 35.095 < 2e-16 ***

rain 2.68223 0.54710 4.903 8.22e-06 ***

Page 118: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 118

so2pot 0.47648 0.09939 4.794 1.21e-05 ***

---

(Dispersion parameter for gaussian family taken to be 2307.898)

> # doing least squares explicitly

> y<-mortal

> x<-cbind(con,rain,so2pot)

> dimx<-dim(x)

> p=dimx[2]

> n=dimx[1]

> n; p

[1] 60

[1] 3

> xpxinv<-solve(t(x)%*%x) #inverse of X’X

> betahat<- xpxinv%*%t(x)%*%y #estimated coefficients

> residual <- y - x%*%betahat

> sse <- t(residual)%*%residual; #sum of squared residuals

> mse <- sse/(n-p) # estimate of variance

> covb<- mse[1]*solve(t(x)%*%x) #estimate of variance covariance

> # matrix of coefficients

> sevec <- rep(0,p);

> for (j in 1 :3)

+ {sevec[j] = sqrt(covb[j,j])}

> info<-cbind(y,x,residual)

> info

y con rain so2pot

[1,] 921.87 1 36 59 -14.579593

[2,] 997.87 1 35 39 73.632241

[3,] 962.35 1 44 33 16.831034

...........

[58,] 895.70 1 65 8 -94.233833

[59,] 911.82 1 62 49 -89.602822

[60,] 954.44 1 38 39 22.155545

> betahat

[,1]

con 811.7769190

rain 2.6822319

so2pot 0.4764801

> mse

[,1]

[1,] 2307.898

> covb

con rain so2pot

con 535.0473586 -11.841137326 -0.782642347

rain -11.8411373 0.299317271 0.006553182

so2pot -0.7826423 0.006553182 0.009878042

> estimates<-cbind(betahat,sevec)

> estimates

sevec

con 811.7769190 23.13109073

rain 2.6822319 0.54709896

so2pot 0.4764801 0.09938834

Page 119: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 119

11.2 Some additional comments on using functions in R

Printing:

As noted elsewhere when working within a function just listing an object will not result in it being printedto the console (as would be done if working interactively;i.e. not in a function). Within a function you canprint to the console using print( ) or the cat command.

Return. return(object) will end the function and print what is in the object to the console. There doesn’thave to be a return in a function (see earlier examples). You CANNOT return multiple quantities by usingreturn(obj1,obj2,..). This does not work. Those individual quantities have to be combined into one object.To illustrate, below is the function we used to find power values for a one-sided test for a mean; see page 72for the output.

powerone<-function(mu0,sigma,n,alpha){

cat(‘‘Power example with sample size ‘‘, n,’’ null = ‘‘, mu0,

‘‘ sigma = ‘‘, sigma, ‘‘ alpha = ‘‘, alpha, ‘‘\n’’)

muval<- seq(25,35,by=.5)

muval

nmu <-length(muval)

power<-rep(NA,nmu)

for(k in 1:nmu)

{mu = muval[k]

df= n-1

nc = sqrt(n)*(mu - mu0)/sigma

tval= qt(1-alpha,df)

power[k] = 1 - pt(tval,df,nc)}

plot (muval,power,type = ‘‘l’’,ylab=’’power’’,xlab=’’Null value’’,

main= ‘‘power function’’)

values<-data.frame(muval,power)

return(values)}

powerone(30,2,5,.05)

Below we see what we get if use a list (with no names) and then a list with names to return muval, powerand some other quantities.

values<-list(mu0,sigma,n,alpha,muval,power)

return(values)

Power example with sample size 5 null = 30 sigma = 2 alpha = 0.05

[[1]]

[1] 30

[[2]]

[1] 2

[[3]]

[1] 5

[[4]]

[1] 0.05

[[5]]

Page 120: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 120

[1] 25.0 25.5 26.0 26.5 27.0 27.5 28.0 28.5 29.0 29.5 30.0 30.5 31.0 31.5 32.0

[16] 32.5 33.0 33.5 34.0 34.5 35.0

[[6]]

[1] 1.554168e-11 4.549612e-10 1.016218e-08 1.720899e-07 2.219886e-06

[6] 2.193684e-05 1.671735e-04 9.901027e-04 4.598786e-03 1.692908e-02

[11] 5.000000e-02 1.201795e-01 2.389952e-01 4.008470e-01 5.797374e-01

[16] 7.414620e-01 8.619466e-01 9.364164e-01 9.748306e-01 9.914523e-01

[21] 9.975115e-01

values<-list(null = mu0,sigma = sigma,n = n, alpha = alpha,

muval = mu,power = power)

return(values)

Power example with sample size 5 null = 30 sigma = 2 alpha = 0.05

$null

[1] 30

$sigma

[1] 2

$n

[1] 5

$alpha

[1] 0.05

$muval

[1] 35

$power

[1] 1.554168e-11 4.549612e-10 1.016218e-08 1.720899e-07 2.219886e-06

[6] 2.193684e-05 1.671735e-04 9.901027e-04 4.598786e-03 1.692908e-02

[11] 5.000000e-02 1.201795e-01 2.389952e-01 4.008470e-01 5.797374e-01

[16] 7.414620e-01 8.619466e-01 9.364164e-01 9.748306e-01 9.914523e-01

[21] 9.975115e-01

Using the object outside of the function.

Once you run a function, quantities computed in the function are not available, even those that are partof the object specified in the return() statement. If you want access to quantities after you have run thefunction, then you can save the result of the function. For example if you used

presults<-powerone(30,2,5,.05)

then presults will have whatever is in the return (values in the example above). Note that you could use thename values again. That is, you could use

values<-powerone(30,2,5,.05)

and this will return what is values in the function to the object values outside the function. As in general,you cannot refer to the things inside values individually. It if is a dataframe then you can use attach, or

Page 121: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 121

if it is a dataframe or a list you can refer to items using presults$name. So, in the first case above wherevalues was just a dataframe that came from binding muval and power you could used presults$muval orpresults$power to refer to the individual vectors. In the case of using the list with names, if you used youcan refer to presults$sigma, etc.

12 Bootstrap Add on.

First, note that in SAS you could generate many bootstrap samples using proc surveyselect with the repoption. These can then be fed into subsequent analyses/ computations. But, if working in IML there isreally no need to do this (or advantage to doing this). It is easier to generate the samples explicitly.

12.1 An enhanced one sample IML program

Below is a more compact and somewhat more comprehensive bootstrap program that does what was doneon pages 109-113 of the notes. The explanation of the bootstrap is given on those pages. Here the programis just a little different in thati) rather than write to an external file, it creates an internal SAS file andii) it then uses a macro to loop through and run the bootstrap analysis on the various statistics that havebeen saved over the bootstrap samples.

A few other notes:

1. There is typo in line 10 on page 110. It should read

rx[rank(xb)] = xb; /* rx has ranked values */

This will have some effect on of the analyses of the median and 90th percentile.

2. The analysis below is run on a different data set using a sample of households where the amountexpended on food was measured.

3. Using the stock data on page 109 and following was not the best example to start with, from anapplied perspective. The bootstrap (and other) analyses there would only apply if you can view theobservations from different weeks as a random sample (independent and identically distributed). Thismay not be too bad there but in general you shouldn’t do this with time series without a check onassumptions.

title ’Bootstrap with a single sample’;

options pagesize=60 linesize=80 nodate;

/* THIS IS A PROGRAM TO BOOTSTRAP WITH A SINGLE

RANDOM SAMPLE ESTIMATING THE MEAN, VARIANCE, STANDARD

DEVIATION, COEFFICIENT OF VARIATION, MEDIAN AND 90TH PERCENTILE.

FOR ILLUSTRATION. IN PRACTICE THE BOOTSTRAP AND OTHER PROCEDURES

CAN HAVE TROUBLE ESTIMATING PERCENTILES WITH SMALL SAMPLE SIZE.

THE DEFAULT HERE IN USING PROC UNIVARIATE AND FORMING BOOTSTRAP

PERCENTILE CIS IS CONFIDENE LEVEL=.95 (ALPHA = .05)*/

***************************************************************;

Page 122: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 122

* READ IN ORIGINAL DATA INTO INTERNAL SAS FILE values ;

* SET x = VARIABLE OF INTEREST HERE ;

* THIS EXAMPLE IS ANALYZING HOUSEHOLD FOOD EXPENDITURES. ;

***************************************************************;

data values;

infile ’g:/s597/data/food.dat’;

input expend;

x=expend;

run;

title ’descriptive statistics on original sample’;

proc univariate cibasic cipctlnormal cipctldf plot;

var x;

run;

proc iml;

/* put data into vector x */

use values;

read all var{x} into x;

close values;

n=nrow(x); /* = sample size */

xb=x; /* initializes xb to be same size as x */

rx = x; /* initializes vector that will have ranked bootstrap sample */

***************************************************************;

* SPECIFY THE NUMBER OF BOOTSTRAP SAMPLES ;

nboot = 10000;

*******************************************************************;

* SPECIFY WHICH STATISTICS/ESTIAMTES WILL BE SAVED TO A SAS FILE ;

create bootout var{smean svar sd cv median p90};

********************************************************************;

do b =1 to nboot; /* b indexes bootstrap number */

/* get the n samples with replacement. i indicates

the sampling within bootstrap replicate j. The generated

k is a discrete uniform over 1 to n; the function int

takes the integer part */

do i= 1 to n;

uv=uniform(0);

k=int(uv*n+1);

xb[i]=x[k];

end;

/* xb contains the n values in bootstrap sample */

/*compute statistics of interest.*/

Page 123: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 123

smean = sum(xb)/n; /* get sample mean */

svar=(ssq(xb) - (n*(smean**2)))/(n-1); /* sample variance */

sd = sqrt(svar); /* sample s.dev. */

cv = sd/smean; /* coefficient of variation */

/*compute median and 90th percentile for the bootstrap sample.

using definition 5 in SAS. */

/*compute median and 90th percentile */

rx[rank(xb)] = xb; /* rx has ranked values */

/* get median */

p=.5;

j = int(n*p);

g = n*p - j;

if g=0 then median =.5*(rx[j]+rx[j+1]);

else median = rx[j+1];

p=.9;

j = int(n*p);

g = n*p - j;

if g=0 then p90 =.5*(rx[j]+rx[j+1]);

else p90 = rx[j+1];

append; /* appends designated variables to sas file bootout "created" earlier*/

end; /* ends the b (bootstrap) loop */

quit;

title ’summary measures over bootstrap samples via proc mean’;

proc means data=bootout n mean std; run;

/* THE MACRO BELOW WILL TAKE THE BOOTSTRAP VALUES OF

variable, OBTAINED FOR EACH OF THE NSIM BOOTSTRAP SAMPLES

AND GET THE BOOTSTRAP MEAN, STANDARD DEVIATION/ERROR, AS WELL

AS THE BOOTSTRAP PERCENTILE INTERVALS */

%macro banalyze(variable);

title "BOOTSTRAP ANALYSIS OF &variable";

proc iml;

use bootout;

read all var{&variable} into bootv;

close bootout;

alpha = .05; /* set confidence level at 1 - alpha*/

nboot = nrow(bootv);

bootmean = sum(bootv)/nboot;

bootvar = (ssq(bootv) - (nboot*(bootmean**2)))/(nboot-1);

if abs(bootvar) < 1e-8 then bootvar=0;

bootse = sqrt(bootvar);

/* now get percentile interval for pi */

rankb=bootv;

Page 124: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 124

rankb[rank(rankb)] = bootv; /* rankb has ranked values */

p=alpha/2;

j = int(nboot*p);

g = nboot*p - j;

if g=0 then lowboot =.5*(rankb[j]+rankb[j+1]);

else lowboot = rankb[j+1];

p=1- (alpha/2);

j = int(nboot*p);

g = nboot*p - j;

if g=0 then upboot =.5*(rankb[j]+rankb[j+1]);

else upboot = rankb[j+1];

p=.50;

j = int(nboot*p);

g = nboot*p - j;

if g=0 then bootmed =.5*(rankb[j]+rankb[j+1]);

else bootmed = rankb[j+1];

min=rankb[1];

max=rankb[nboot];

clevel = 100*(1-alpha);

print "Bootstrap sample size = " nboot "Confidence level = " clevel;

output = J(1,5,1);

cnames = {"B-Mean" "B-Median" "Boot-SE" "Boot-L" "Boot-U"};

reset noname;

output[1,1] = bootmean;

output[1,2] = bootmed;

output[1,3] = bootse;

output[1,4] = lowboot;

output[1,5] = upboot;

print output[colname=cnames];

%mend;

%banalyze(smean);

%banalyze(svar);

%banalyze(sd);

%banalyze(median);

%banalyze(p90);

descriptive statistics on original sample 2230

Variable: x

N 40 Sum Weights 40

Mean 3609.6 Sum Observations 144384

Std Deviation 1509.29978 Variance 2277985.84

Coeff Variation 41.8134913 Std Error Mean 238.641249

Basic Confidence Limits Assuming Normality

Parameter Estimate 90% Confidence Limits

Mean 3610 3208 4012

Std Deviation 1509 1276 1859

Variance 2277986 1627961 3457486

Page 125: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 125

Quantiles (Definition 5)

90% Confidence Limits

Quantile Estimate Assuming Normality

99% 8147.0 6471.642 8048.363

95% 7225.0 5566.512 6817.607

90% 5683.5 5073.978 6171.152

75% Q3 4110.0 4222.654 5117.472

50% Median 3165.0 3207.519 4011.681

90% Confidence Limits -------Order Statistics-------

Quantile Distribution Free LCL Rank UCL Rank Coverage

95% 5587 8147 36 40 82.35

90% 4670 8147 33 40 94.33

75% Q3 3679 5176 26 35 90.23

50% Median 2830 3679 15 26 91.93

Variable: x

Stem Leaf # Boxplot

8 1 1 0

7 6 1 0

7

6 9 1 0

6

5 68 2 |

5 02 2 |

4 7 1 |

4 024 3 +-----+

3 67789 5 | + |

3 122234 6 *-----*

2 556667888889 12 +-----+

2 01244 5 |

1 |

1 2 1 |

----+----+----+----+

Multiply Stem.Leaf by 10**+3

summary measures over bootstrap samples via proc mean

Variable N Mean Std Dev

SMEAN 10000 3608.69 231.9764974

SVAR 10000 2211014.47 650970.30

SD 10000 1470.12 223.0754178

CV 10000 0.4062640 0.0480721

MEDIAN 10000 3172.28 241.9560555

P90 10000 5568.51 810.9068116

BOOTSTRAP ANALYSIS OF smean

NBOOT CLEVEL

Bootstrap sample size = 10000 Confidence level = 95

B-Mean B-Median Boot-SE Boot-L Boot-U

3608.6867 3601.8 231.9765 3174.5875 4088.2625

Page 126: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of

ST597A: 2011, Part II ( c©J.B.) 126

BOOTSTRAP ANALYSIS OF svar

NBOOT CLEVEL

Bootstrap sample size = 10000 Confidence level = 95

B-Mean B-Median Boot-SE Boot-L Boot-U

2211014.5 2183823.1 650970.3 1013311.7 3537060.4

BOOTSTRAP ANALYSIS OF sd

NBOOT CLEVEL

Bootstrap sample size = 10000 Confidence level = 95

B-Mean B-Median Boot-SE Boot-L Boot-U

1470.1214 1477.7764 223.07542 1006.6338 1880.7074

BOOTSTRAP ANALYSIS OF median

NBOOT CLEVEL

Bootstrap sample size = 10000 Confidence level = 95

B-Mean B-Median Boot-SE Boot-L Boot-U

3172.2757 3165 241.95606 2807 3711

BOOTSTRAP ANALYSIS OF p90

NBOOT CLEVEL

Bootstrap sample size = 10000 Confidence level = 95

B-Mean B-Median Boot-SE Boot-L Boot-U

5568.508 5587 810.90681 4250 7580

12.2 R resources

You can find many webpages that have an introduction to using the bootstrap in R. One useful one is

http://www.ats.ucla.edu/stat/r/library/bootstrap.htm

Note the convenient sample function in R. Be careful in that most of the automatic boostrapping justresamples with replacement from observations. This is only appropriate if the original sample is a randomsample (independent and identically distributed)! Otherwise the bootstrap resampling must be tailored tothe design and programmed explicitly.