st597a: 2011, part ii ( c j.b.) 1 st597a: statistical...
TRANSCRIPT
![Page 1: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/1.jpg)
ST597A: 2011, Part II ( c©J.B.) 1
ST597A: Statistical Computing, Part II (2011)c©John Buonaccorsi, Univ. of Massachusetts
Not to be reused without permission
Contents
1 Basic use of macros. 5
1.1 Running the same analysis on many data sets. . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Running the same analysis on different variables. . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 % let . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 A macro within a macro. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Arrays 9
3 Using SAS GRAPH to create high resolution plots. 13
4 More on programming features. 16
5 Probability Functions and their application. 20
5.1 Specific functions getting cumulative probabilities. . . . . . . . . . . . . . . . . . . . . . . . . 20
5.2 The CDF function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.3 The PDF function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.4 Plotting the normal density. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.5 Exact inferences for a proportion based on the Binomial. . . . . . . . . . . . . . . . . . . . . 25
6 Inverse probability functions and their application 27
7 Power and sample size applications. 31
7.1 The noncentral t-distribution and power functions. . . . . . . . . . . . . . . . . . . . . . . . . 31
7.2 Estimating a single mean. Sample size via a CI. . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.3 Sample size and power for estimating a proportion. . . . . . . . . . . . . . . . . . . . . . . . . 35
7.4 Comparison of two means; independent samples. Power and sample size. . . . . . . . . . . . . 38
8 Generating random quantities. 42
8.1 Some random generating functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
8.2 Generating random integers from between 1 and M . . . . . . . . . . . . . . . . . . . . . . . . 42
8.3 Sampling with proc surveyselect. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
8.4 Generating Normal random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
8.5 Simulating the behavior of inferences with normal samples. . . . . . . . . . . . . . . . . . . . 44
![Page 2: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/2.jpg)
ST597A: 2011, Part II ( c©J.B.) 2
8.6 Simulating sampling from the exponential distribution. . . . . . . . . . . . . . . . . . . . . . . 46
8.7 Simulating the number of arrivals in a fixed period of time. . . . . . . . . . . . . . . . . . . . 50
8.8 Simulating survival rates and time to extinction. . . . . . . . . . . . . . . . . . . . . . . . . . 53
8.9 More on random number generation in SAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
9 R: Part II 58
9.1 Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
9.1.1 Directing the graphics output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
9.2 Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
9.3 looping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
9.4 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
9.5 Probability Functions with examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
9.6 Some More graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
9.6.1 Using legends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
9.6.2 Writing text in margins or in graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
10 SAS-IML. 78
10.1 Basic matrix operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
10.2 Naming rows and columns, formats, reset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
10.3 Reading from a SAS data set into matrices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
10.4 Example. Computing descriptive statistics etc. in a one-sample problem . . . . . . . . . . . . 86
10.5 Missing values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
10.6 Writing to external files in IML. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
10.7 Example: Statistics for multiple groups. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
10.8 Example: Multivariate observations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
10.9 Creating SAS files from within IML. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
10.9.1 Creating a SAS file directly from a matrix . . . . . . . . . . . . . . . . . . . . . . . . . 97
10.9.2 Creating a SAS file using variables; with application to simulating random samples . . 99
10.10 User define modules/functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
10.11 Simulating simple random sampling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
10.12Least squares/regression via IML. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
10.13Non-linear least squares and calling subroutines. . . . . . . . . . . . . . . . . . . . . . . . . . 106
10.14Bootstrapping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
11 R: Part 3. Working with matrices 113
11.1 Least squares in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
![Page 3: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/3.jpg)
ST597A: 2011, Part II ( c©J.B.) 3
11.2 Some additional comments on using functions in R . . . . . . . . . . . . . . . . . . . . . . . . 119
12 Bootstrap Add on. 121
12.1 An enhanced one sample IML program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
12.2 R resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
![Page 4: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/4.jpg)
ST597A: 2011, Part II ( c©J.B.) 4
.
![Page 5: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/5.jpg)
ST597A: 2011, Part II ( c©J.B.) 5
Begin with more advanced (but still non-IML)features of SAS.
1 Basic use of macros.
Macros are a convenient way to run programs (perhaps multiple times) allowing variable names, or file namesto be input variables. In essence the macros here define a function with input variables or filenames and thethe macros are executed with specified input values. These examples show just some of the basic features.
1.1 Running the same analysis on many data sets.
options ls=80 nodate;
title ’ Macro processing of the four quabbin blocks’;
%macro loopdat(dname,type); /* loopdat is the internal macro name)
data &dname;
infile "g:/s597/data/&dname&type";
input block line plot type BA hunt dist herb wp1 wp45 hk1
hk45 bir1 bir45 map1 map45 oak1 oak45 ash1 ash45 oth1 oth45
total1 total45 total;
title "&dname";
proc means n mean stdev;
class block line;
var total;
run;
%mend loopdat;
%loopdat(pl95,.dat);
%loopdat(peter95,.dat);
%loopdat(ns95,.dat);
%loopdat(pres95,.dat);
pl95
Analysis Variable : TOTAL
BLOCK LINE N Obs N Mean Std Dev
------------------------------------------------------------------
1 105 22 22 0.5909091 1.6230216
127 42 42 0.0476190 0.3086067
138 42 42 0.7142857 3.4590688
154 28 28 1.7500000 3.9592835
------------------------------------------------------------------
peter95
BLOCK LINE N Obs N Mean Std Dev
------------------------------------------------------------------
5 1 51 51 2.1372549 7.5631200
2 47 47 1.9148936 4.0153498
3 42 42 1.1666667 1.6217525
4 56 56 3.0714286 7.7432836
5 11 11 9.4545455 16.0833059
------------------------------------------------------------------
etc. for other blocks.
![Page 6: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/6.jpg)
ST597A: 2011, Part II ( c©J.B.) 6
1.2 Running the same analysis on different variables.
option ls=80 ps=60 nodate;
data b;
infile ’brain.dat’;
input Gender $ FSIQ VIQ PIQ Weight Height mriCount;
run;
%macro loopvar(vname);
title "&vname";
proc means;
var &vname;
proc corr;
var &vname mricount;
run;
%mend loopvar;
%loopvar(fsiq);
%loopvar(viq);
%loopvar(piq);
HEAVILY EDITED OUTPUT
fsiq
Analysis Variable : FSIQ
N Mean Std Dev Minimum Maximum
----------------------------------------------------------
40 113.4500000 24.0820712 77.0000000 144.0000000
----------------------------------------------------------
Correlation Analysis
Pearson Correlation Coefficients / Prob > |R| under Ho: Rho=0 / N = 40
FSIQ MRICOUNT
FSIQ 1.00000 0.35764
0.0 0.0235
viq
Analysis Variable : VIQ
N Mean Std Dev Minimum Maximum
----------------------------------------------------------
40 112.3500000 23.6161071 71.0000000 150.0000000
----------------------------------------------------------
etc. for rest of viq and then for PIQ
Correlation Analysis
2 ’VAR’ Variables: PIQ MRICOUNT
...
![Page 7: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/7.jpg)
ST597A: 2011, Part II ( c©J.B.) 7
1.3 % let
The %let lets you define a macro variable or a list of variables. One advantage of this is that if this variableor list of variables occurs in a number of places in the program, you can save some programming with aninitial identification via the %let. Also, if you then want to change the value of the variable for another run,you only need to change it once where it is initially defined.
options ps=60 ls=80;
title ’illustrate macro variable list’;
%let vlist = vo35 vo4 vo45 vo5 vo55 vo6;
data a;
infile ’speed2’;
input id sex &vlist;
run;
proc means n mean min max;
class sex;
var &vlist;;
run;
illustrate macro variable list 1
SEX N Obs Variable N Mean Minimum Maximum
---------------------------------------------------------------------------
1 15 VO35 15 15.4333333 12.2000000 19.4000000
VO4 15 18.4533333 15.6000000 21.5000000
VO45 15 22.9133333 19.0000000 26.5000000
VO5 15 29.8533333 25.1000000 35.2000000
VO55 15 38.9866667 32.7000000 48.9000000
VO6 15 45.2133333 39.2000000 51.3000000
2 9 VO35 9 14.6666667 12.6000000 16.5000000
VO4 9 17.8777778 15.4000000 22.2000000
VO45 9 23.3444444 18.7000000 27.1000000
VO5 9 31.7888889 24.3000000 38.1000000
VO55 9 37.1888889 31.9000000 42.9000000
VO6 9 41.1555556 36.2000000 45.0000000
---------------------------------------------------------------------------
![Page 8: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/8.jpg)
ST597A: 2011, Part II ( c©J.B.) 8
1.4 A macro within a macro.
You can have a macro in a macro as demonstrated by the following which loops through data sets and thenthrough variables within the data set.
options ls=80 nodate;
title ’ A macro in a macro’;
%macro loopdat(dname,type);
data &dname;
infile "&dname&type";
input block line plot type BA hunt dist herb wp1 wp45 hk1
hk45 bir1 bir45 map1 map45 oak1 oak45 ash1 ash45 oth1 oth45
total1 total45 total;
title "&dname";
%vloop(total45);
%vloop(total);
run;
%mend loopdat;
%macro vloop(vname);
title "&dname-&vname";
proc means mean std;
class line;
var &vname;
run;
%mend vloop;
%loopdat(pl95,.dat);
%loopdat(peter95,.dat);
Heavily edited output
pl95-total45
LINE N Obs Mean Std Dev
-----------------------------------------------
105 22 0.1363636 0.6396021
...
pl95-total
LINE N Obs Mean Std Dev
-----------------------------------------------
105 22 0.5909091 1.6230216
....
peter95-total45
LINE N Obs Mean Std Dev
-----------------------------------------------
...
peter95-total
LINE N Obs Mean Std Dev
....
![Page 9: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/9.jpg)
ST597A: 2011, Part II ( c©J.B.) 9
2 Arrays
Arrays provide a convenient and efficient way to do various tasks involving data manipulations and calcula-tions. The arrays must be used within a data step.
array name[n] varlist; where n is an integer and varlist is a list of variables creates an array of dimensionn. name[i] will equate to the ith variable in the varlist.
array name[*] varlist; Does the same thing as above but it will determine what n, the dimension of the array,is. You can use dim(name) to get the dimension of the array.
• The elements of an array must be all character or all numeric. If the array is a character array, thenput a $ in front of the varlist as in array name[n] $ varlist;
array name[n] numeric will put all of the numeric variables in the array name (you can use * instead ofan integer n).
array name[n] character will put all of the character variables in the array name (you can use * insteadof an integer n).
If you have variables name1, name2, . . . namen in the data then you can just use array name[n]; by itselfand this will put name1 through namen into the array name.
array name[n] will create an n dimensional array called name, but if there are no variables called name1,etc. then it creates variables name1 to namen (which will have missing values until something is assigned tothem.
Example. This example demonstrates the use of arrays with the speed/vo data where we want convertthe vo values to 0 (if vo is le 25) and 1 (if vo gt 25). The recoded values are in newvo1 to newvo6. Bothprograms produce the same result.
options ps=60 ls=80;
data a;
infile ’speed2’;
input id sex vo35 vo4 vo45 vo5 vo55 vo6;
array vo[6] vo35 vo4 vo45 vo5 vo55 vo6;
array newvo[6];
do i = 1 to 6;
if vo[i] gt 25 then newvo[i]=1;
if vo[i] le 25 then newvo[i]=0;
end;
drop i;
proc print;
run;
data a;
infile ’speed2’;
input id sex vo1-vo6;
array vo[6];
array newvo[6];
do i = 1 to 6;
if vo[i] gt 25 then newvo[i]=1;
if vo[i] le 25 then newvo[i]=0;
![Page 10: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/10.jpg)
ST597A: 2011, Part II ( c©J.B.) 10
end;
drop i;
proc print;
run;
N N N N N N
E E E E E E
V V V W W W W W W
O S O V O V O V V V V V V V
B I E 3 O 4 O 5 O O O O O O O
S D X 5 4 5 5 5 6 1 2 3 4 5 6
1 1 1 15.7 18.4 22.0 34.8 45.3 51.1 0 0 0 1 1 1
2 2 1 14.8 18.0 25.1 29.6 44.8 51.3 0 0 1 1 1 1
3 3 1 16.6 21.0 25.5 35.2 48.9 49.1 0 0 1 1 1 1
4 4 1 14.6 19.2 24.2 30.9 36.2 47.6 0 0 0 1 1 1
....
24 24 2 14.3 16.9 23.9 36.8 37.4 36.2 0 0 0 1 1 1
Example: recoding missing values. This example uses data which was introduced briefly earlier. Thisis from Statlib.
Data from which conclusions were drawn in the article "Sleep in
Mammals: Ecological and Constitutional Correlates" by Allison, T. and
Cicchetti, D. (1976), _Science_, November 12, vol. 194, pp. 732-734.
Includes brain and body weight, life span, gestation time, time
sleeping, and predation and danger indices for 62 mammals.
Variables below (from left to right) for Mammals Data Set:
species of animal, body weight in kg, brain weight in g,
slow wave ("nondreaming") sleep (hrs/day), paradoxical ("dreaming") sleep (hrs/day)
total sleep (hrs/day) (sum of slow wave and paradoxical sleep),
maximum life span (years), gestation time (days), predation index (1-5)
sleep exposure index (1-5), overall danger index (1-5),
Note: Missing values denoted by -999.0
option ls=80 nodate;
data a;
infile ’sleep.dat’;
input species $25. bodywt brainwt slsleep parsleep
totsleep lifespan gestate prindex slindex danindex;
proc print;
var species brainwt slsleep parsleep;
run;
data b;
set a;
array numval[*] _numeric_;
do i = 1 to dim(numval);
if numval[i]=-999.0 then numval[i]=.;
end;
![Page 11: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/11.jpg)
ST597A: 2011, Part II ( c©J.B.) 11
proc print;
var species brainwt slsleep parsleep;
run;
OBS SPECIES BRAINWT SLSLEEP PARSLEEP
1 African elephant 5712.00 -999.0 -999.0
2 African giant pouched rat 6.60 6.3 2.0
3 Arctic Fox 44.50 -999.0 -999.0
....
62 Yellow-bellied marmot 17.0 -999.0 -999.0
OBS SPECIES BRAINWT SLSLEEP PARSLEEP
1 African elephant 5712.00 . .
2 African giant pouched rat 6.60 6.3 2.0
3 Arctic Fox 44.50 . .
...
62 Yellow-bellied marmot 17.0 . .
Example: Illustrating the use of arrays to create many cases out of one.
options ps=60 ls=80;
/* using arrays with the speed data to create 6 lines for
each subject */
data a;
infile ’speed2’;
input id sex voval1-voval6;
array speedar[6] _temporary_ (3.5 4 4.5 5 5.5 6);
array voval[6];
do i = 1 to 6;
speed=speedar[i];
vo=voval[i];
output;
end;
keep id speed vo;
proc print;
run;
OBS ID SPEED VO
1 1 3.5 15.7
2 1 4.0 18.4
3 1 4.5 22.0
4 1 5.0 34.8
5 1 5.5 45.3
6 1 6.0 51.1
7 2 3.5 14.8
8 2 4.0 18.0
Note the use of a temporary array and assignment of values to the array. As a temporary array it does notcreate variables speedar1 to speedar6. Note also that we could not use vo for both the array name and thenew variable. Need to use two different names.
![Page 12: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/12.jpg)
ST597A: 2011, Part II ( c©J.B.) 12
Example. This example is based on data from J. Grand. In the original file there are abundance values for17 species of birds, contained in the variables acco to toru. We want one line for each species with a speciesvariable and an abundance value.
data a;
infile ’birdsjg.prn’;
input plotid 7-8 type $ MW PPSOT PPSOF SO DL G2 HW PL
acco amsa capu cavo chpe coer covi1 dedi eral icga mnva pier piol pivi pogr1
scmi toru;
drop MW PPSOT PPSOF SO DL G2 HW PL;
proc print;
run;
data b;
set a;
array name[17] $ _temporary_
(’acco’ ’amsa’ ’capu’ ’cavo’ ’chpe’ ’coer’ ’covi1’ ’dedi’ ’eral’
’icga’ ’mnva’ ’pier’ ’piol’ ’pivi’ ’pogr1’ ’scmi’ ’toru’);
array values[17]
acco amsa capu cavo chpe coer covi1 dedi eral icga mnva pier piol pivi pogr1
scmi toru;
do i = 1 to 17;
species = name[i];
abund=values[i];
output;
end;
drop acco amsa capu cavo chpe coer covi1 dedi eral icga mnva pier piol pivi pogr1
scmi toru;
proc print;
run;
LISTING OF ORIGINAL DATA
Obs plotid type acco amsa capu cavo chpe coer covi1 dedi
1 1 MW 0.00 0.0 0.000 0.0000 0.0000 0.0000 0.0000 0.0000
2 2 PPSOT 0.00 0.0 0.000 0.0000 0.0000 0.0000 0.3333 0.0000
...
Obs eral icga mnva pier piol pivi pogr1 scmi toru
1 0.0 0.1667 0.0000 9.5000 0.0000 0.0000 0.0000 0.0000 0.0000
2 0.0 3.5000 0.3333 12.0000 0.0000 0.1667 0.0000 0.0000 0.1667
...
PARTIAL LISTING OF NEW SAS FILE
Obs plotid type i species abund
1 1 MW 1 acco 0.0000
2 1 MW 2 amsa 0.0000
9 1 MW 9 eral 0.0000
10 1 MW 10 icga 0.1667
18 2 PPSOT 1 acco 0.0000
![Page 13: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/13.jpg)
ST597A: 2011, Part II ( c©J.B.) 13
24 2 PPSOT 7 covi1 0.3333
27 2 PPSOT 10 icga 3.5000
28 2 PPSOT 11 mnva 0.3333
29 2 PPSOT 12 pier 12.0000
3 Using SAS GRAPH to create high resolution plots.
Note: Many SAS procedures, including those discussed below create high resolution graphics. For some ofthese with no direction of the output, the graph will come up in the graph window and can then be saved. Insome cases the ODS graphics command must be used (as in plots associated with proc cor). More generallythe plots can often be directed via ODS or with a use of gsasfile with specification of a device and gaccess.The details are not documented here but will be discussed a bit further in class.
SAS GRAPH refers to a collection of procedures that create high resolution plots. Here is introductory lookat SAS graph with a few examples, enough to do the basics. The documentation is contained in SAS/GRAPHusers guide and through the online documentation on the web under SAS/GRAPH.
The following occur outside of any proc. There are other such statements. These are the most importantones.
• The GOPTIONS statement temporarily sets default values for many graphics attributes and deviceparameters used by SAS/GRAPH procedures. This includes setting the size of the graph through hsizeand vsize. goptions reset=all; will set all of the graphic options to their default values.
• The TITLEn statement controls title information.
• The AXISn statement controls labeling of an axis.
• The SYMBOLn statement controls information associated with a specific plot command.
The information from the axis and symbol statements will be used in a plot command.
Example. Using the defoliation variable in gypsy moth data. For each subplot, plot the meanas well as mean + 1.96SE and mean - 1.96SE.
title ’create defol means’;
data dd;;
infile ’e:\s597\data\87md.dat’;
input plot $ subplot egg def;
run;
proc means maxdec=2 data =dd noprint nway;
class plot;
var egg def;
output out=result2 mean=meanegg meandef stderr=seegg sedef;
run;
data c;
set result2;
up=meanegg+(1.96*seegg);
![Page 14: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/14.jpg)
ST597A: 2011, Part II ( c©J.B.) 14
low = meanegg-(1.96*seegg);
run;
goption reset=all;
goption hsize=6 vsize=4
title1 ’Mean egg mass and two standard errors’;
title2 ’ Maryland 1987’;
axis2 label= none; /* without this the y-axis label would be up since
that is used first in the plot */
axis1 offset = (.2in,.2in);
symbol1 v=u color=black;
symbol2 v=l color=black; /* that is the letter 1 not the number 1*/
symbol3 v=m color=black;
proc gplot data=c;
plot up*plot=1 low*plot=2 meanegg*plot=3/overlay vaxis=axis2
haxis=axis1;
run;
Figure 1: Gypsy Moth Plot
Example; employment data. Show use of i = option to connect points.
option ls=80 nodate;
data a;
infile ’employ.dat’;
input Findex unemploy year; /*Unemploy= Unemployment rate.*/
![Page 15: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/15.jpg)
ST597A: 2011, Part II ( c©J.B.) 15
newyear=year+1950; /* year: Year of the 1950 decade (0 = 1950)*/
axis1 label= (f=swiss ’Year’);
axis2 label= (f=swiss ’Unemployment’);
symbol1 c=black i=join v=star;
proc gplot;
plot unemploy*newyear/haxis=axis1 vaxis=axis2;
run;
Figure 2: Employment Plot
Example; . Show use of i = option to connect points.
data new;
infile ’pred.out’;
input x0 yhat low up yhatw loww upw;
run;
symbol1 i=spline l=1 color=black;
symbol2 i=spline l=2 color=black;
proc gplot data=new;
plot yhat*x0=1 low*x0=1 up*x0=1
yhatw*x0=2 loww*x0=2 upw*x0=2 /overlay;
run;
Example of 3d scatter plot using proc g3d. The scatter command creates a scatter plot. The pyramid is the
![Page 16: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/16.jpg)
ST597A: 2011, Part II ( c©J.B.) 16
Figure 3: Esterase Assay plot
default. Other symbols are possible with the shape= option. The noneedle option will eliminate the verticallines leading to the points.
data values;
infile ’e:\s597\data\house.dat’;
input PRICE SQFT AGE FEATS NE CUST COR TAX;
run;
title1 ’3d plot for House data ’;
proc g3d;
scatter sqft*age=price;
run;
4 More on programming features.
A quick look at a few programming features. These will get used more later and some can also be used inIML.
DO LOOPS:
Do loops have the general form of :
![Page 17: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/17.jpg)
ST597A: 2011, Part II ( c©J.B.) 17
Figure 4: House price plot
do ;
statements
end;
There is often something after the do which creates an iterative do loop, while the form beginning with justdo; usually occurs within an if-then loop.
The simplest iterative do loop is:
do index = start to finish by increment;
- If there is no increment an increment of 1 is assumed.
- The start and finish can be expressions.
- You can use multiple “start to finish by increment” separated by a comma. do i =1 to 10 by 1, 11 to 100by 2;
Another form specifies a list of specifications explicitly, as in
do i = 4, 23, 56; or do i = ’saturday’, ’sunday’, ’wednesday’;
You can loop through dates in an iterative way via; do i = ’date1’d to ’date2’d by n;
date1 and date2 are of the form ddmmyy designating day, month and year. The d trailing the ’date’ indicatesit is a date.
![Page 18: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/18.jpg)
ST597A: 2011, Part II ( c©J.B.) 18
/* this creates and plots a quadratic function of x */
option ls=80 nodate;
data values;
a = 2; b=4; c=8;
do x = 1 to 10 by .1;
y= a + (b*x) + (c*(x**2));
output;
end;
keep x y;
proc print;
run;
proc plot hpercent=50 vpercent=50;
plot y*x;
run;
OBS X Y
1 1.0 14.00
91 10.0 842.00
Note: Just giving low level plot since easier to embed.
Plot of y*x. Legend: A = 1 obs, B = 2 obs, etc.
y |
1000 +
|
|
| AB
| BA
| BB
| BB
| BB
| BB
500 + BB
| ABB
| BBA
| ABB
| ABBA
| BBBA
| BBBB
| ABBBB
| ABBBBBBA
0 + ABBA
---+---------+---------+---------+---------+---------+--
1 3 5 7 9 11
x
You can also use “until” and “while” within the do statement, either by themselves or in addition to aniterative loop.
do ... while(statement); do ... until(statement);
For example if in the program above we used do x = 1 to 10 by .1 while (y lt 500); the result would only
![Page 19: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/19.jpg)
ST597A: 2011, Part II ( c©J.B.) 19
include cases where y is less than 500.
IF-THEN-ELSE. The general form is
if expression then statement; else statement;
The else is optional. The statement after the else can be a do loop rather than a single statement.
/* example where if choice =1 homework counts if choice=2 it doesn’t */
data a;
infile ’grades.dat’;
input h1-h4 exam1 exam2 choice;
array wt[6] _temporary_ (.125 .125 .125 .125 .25 .25);
array points[6] h1-h4 exam1 exam2;
score=0;
if choice=2 then
score = (exam1+exam2)/2;
else
do i = 1 to 6;
score =score + wt[i]*points[i];
end;
run;
goto statements. You can label statements where : follows the label and the program is direct to thatstatement with a goto statement.
if i gt 6 then goto check;
...
check: statement;
The program below will create the same output as the initial plot of a quadratic.
/* this creates and plots a quadratic function of x */
option ls=80 nodate;
data values;
a = 2; b=4; c=8;
x=0;
loop:x=x+.1;
y= a + (b*x) + (c*(x**2));
output;
if x < 9.9 then goto loop;
keep x y;
proc print;
run;
proc plot vpercent=50;
plot y*x;
run;
![Page 20: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/20.jpg)
ST597A: 2011, Part II ( c©J.B.) 20
5 Probability Functions and their application.
5.1 Specific functions getting cumulative probabilities.
First there are a set of functions that return cumulative probabilities for specific distributions. So, if X is arandom variable, these functions return P (X ≤ x).
• poisson(m,x) returns P (X ≤ x) when X is Poisson with mean m.
• probbeta(x,a,b) returns P (X ≤ x) when X ∼ Beta(a, b)
• probbnml(p,n,x) returns P (X ≤ x) when X ∼ Binomial with n = number of trials and p = probabilityof success.
• probchi(x,df,nc) returns P (X ≤ x) when X ∼ chi-square with df degrees of freedom and non-centralityparameter nc. The nc can be omitted, which leads to the more familiar central chi-square distribution.
• probf(x,n1,n2,nc) returns P (X ≤ x) when X ∼ F with n1 and n2 degrees of freedom respectively andnon-centrality parameter nc. The nc can be omitted, which leads to the more familiar F distribution.
• probgam(x,a) returns P (X ≤ x) when X ∼ Gamma(a).
• probhypr(N,k,m,x) returns P (X ≤ x) when X ∼ Hypergeometric based on a population of size N , kof the N items having a characteristic of interest, and m is the sample size.
• probnegb(p,n,x) returns P (X ≤ x) when X ∼ Negative Binomial with p = probability of success andn = an integer (e.g., number of desired success in sequential sampling).
• probnorm(x) returns P (X ≤ x) when X ∼ N(0, 1) (a standard normal).
• probt(x,df,nc) returns P (X ≤ x) when X ∼ t with df degrees of freedom and non-centrality parameternc. The nc can be omitted, which leads to the more familiar central t distribution.
5.2 The CDF function.
CDF computes cumulative probabilities for a large number of distributions. The general form is
cdf(’dist’,x,parameters)
where dist is a name from the list of possible distributions available.This includes more than those usedabove. For example cdf(’chisquared’,x,df,nc) returns P (X ≤ x) when X ∼ chi-square with df degreesof freedom and non-centrality parameter nc (which can be left off); this does exactly the same thing asprobchi(x,df,nc). (Note that the documentation refers to x as quantile, but this makes no sense as x is apoint at which we want a cumulative probability, not a quantile.)
5.3 The PDF function
For a discrete random variable X pdf(’dist’,x,parameters) will return P (X = x). For a continuousrandom variable with density function f(x), it will return f(x).
Example Using Hypergeometric. Based on a consulting problem with housing services. Some sprinklerssimilar to those in dorms had recently failed in a hotel fire. The recommendation was to sample 1 or 2% of
![Page 21: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/21.jpg)
ST597A: 2011, Part II ( c©J.B.) 21
the sprinklers and if any found defective, replace all of them. Want to find probabilities of getting differentnumbers of defectives in the sample for different sample sizes and different defective rates in the population.In particular the interest is in the probability of getting 0, as this is the case where no action is taken.The example below is for buildings with 698 units in them. The calculations are done using the cumulativeprobabilities (which is how it had to be done before the pdf function became available) and then directlythrough pdf.
options ls=80;
/* this is lucey.sas - calculates probability of
getting 0, 1, 2, 3 defective for each building with number of defective
going from 0 to 100 and changing sample sizes.*/
title ’building type a’;
data a;
popsizea = 698;
do ssa=7 to 70 by 7; /*ssa = number of sprinklers sampled*/
ssrate=ssa/popsizea; /* sampling rate*/
do def = 1 to 101 by 5; /* number of defective sprinklers in pop.*/
defrate=def/popsizea;
proba0 = probhypr(popsizea,def,ssa,0);
p1 = probhypr(popsizea,def,ssa,1);
p2 = probhypr(popsizea,def,ssa,2);
p3 = probhypr(popsizea,def,ssa,3);
proba1=p1-proba0;
proba2=p2-p1;
proba3=p3-p2;
pa1=pdf(’hyper’,1,popsizea,def,ssa);
output;
end;
end;
run;
proc print noobs;
var def defrate ssa ssrate proba0 proba1 proba2 proba3 pa1;
run;
def defrate ssa ssrate proba0 proba1 proba2 proba3 pa1
1 0.00143 7 0.010029 0.98997 0.01003 . . 0.01003
6 0.00860 7 0.010029 0.94111 0.05762 0.00126 0.00001 0.05762
11 0.01576 7 0.010029 0.89433 0.10112 0.00445 0.00010 0.10112
36 0.05158 56 0.080229 0.04540 0.15077 0.23868 0.23985 0.15077
41 0.05874 56 0.080229 0.02914 0.11114 0.20273 0.23563 0.11114
96 0.13754 70 0.10029 0.00002 0.00022 0.00136 0.00540 0.00022
101 0.14470 70 0.10029 0.00001 0.00013 0.00082 0.00348 0.00013
This program determines the needed sample size so the detection rate (the probability of getting one or moredefective in the sample) is at least .95, with defective rates going from .01 to .5 by .01 and pop. size 698.
options ls=80;
title ’sample size for buildings with 698 units’;
data a;
![Page 22: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/22.jpg)
ST597A: 2011, Part II ( c©J.B.) 22
popsizea = 698;
do defrate = .01 to .5 by .01;
defa=int(defrate*popsizea);
ssizea=0;
loop1: ssizea = ssizea+1;
proba = probhypr(popsizea,defa,ssizea,0);
detecta=1-proba;
if detecta < .95 then go to loop1;
output;
end;
run;
proc print;
run;
sample size for buildings with 698 units 1
OBS POPSIZEA DEFRATE DEFA SSIZEA PROBA DETECTA
1 698 0.01 6 274 0.049544 0.95046
2 698 0.02 13 143 0.049319 0.95068
3 698 0.03 20 96 0.049611 0.95039
4 698 0.04 27 72 0.049842 0.95016
47 698 0.47 328 5 0.041322 0.95868
48 698 0.48 335 5 0.037539 0.96246
49 698 0.49 342 5 0.034037 0.96596
50 698 0.50 349 5 0.030803 0.96920
Computing Binomial Distributions. There are n independent
trials. On each trial the probability of success is p. Then P (X = x) =n Cxpx(1 − p)n−x for x = 0, 1, . . . n,where nCx is the combinatoric n choose x.
options ls=80 nodate;
/* computing and plotting the binomial distribution with
a sample of size 10 and probability .1 of success*/
%macro bincalc(n,p);
data a;
do x = 0 to &n;
cum=probbnml(&p,&n,x);
output;
end;
run;
title "binomial with p = &p and n = &n";
data b;
set a;
prob=cum-lag(cum);
if x=0 then prob=cum;
proc print;
run;
filename gsasfile "bin&p&n";
![Page 23: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/23.jpg)
ST597A: 2011, Part II ( c©J.B.) 23
goption device=ps
gaccess=gsasfile
gsfmode=replace noprompt;
symbol i=needle;
proc gplot;
plot prob*x;
run;
%mend bincalc;
%bincalc(10,.1);
%bincalc(100,.5);
binomial with p = .1 and n = 10 1
OBS X CUM PROB
1 0 0.34868 0.34868
2 1 0.73610 0.38742
3 2 0.92981 0.19371
4 3 0.98720 0.05740
5 4 0.99837 0.01116
6 5 0.99985 0.00149
7 6 0.99999 0.00014
8 7 1.00000 0.00001
9 8 1.00000 0.00000
10 9 1.00000 0.00000
11 10 1.00000 0.00000
binomial with p = .5 and n = 100 2
OBS X CUM PROB
... omitted entries with all 0’s
28 27 0.00000 0.000002
29 28 0.00001 0.000004
30 29 0.00002 0.000010
31 30 0.00004 0.000023
58 57 0.93339 0.030069
59 58 0.95569 0.022292
60 59 0.97156 0.015869
101 100 1.00000 0.000000
Computing Normal probabilities. X ∼ N(µ, σ2). This is shorthand for X is normally distributed withmean µ and variance σ2, or equivalently standard deviation σ. Then, P (X ≤ x) = P (Z ≤ (x − µ)/σ)),where Z ∼ N(0, 1). Recall also that if X1, . . . Xn are independent, each ∼ N(µ, σ2), then X ∼ N(µ, σ2/n),where X is the sample mean.
Example: A manufacturing process produces boxes of cereal (with an advertised content of 24 oz.) withmean 23.8 oz and standard deviation .3 oz. Let X be the weight of a random box and X the mean of 10random boxes (which say make up a carton.)Find a) P (X ≤ 24), b) P (X > 24), c) P (23.5 ≤ X ≤ 23.9), d)P (X ≤ 24), e) P (X > 24), f) P (23.5 ≤ X ≤ 23.9).
option ls=80 nodate;
data a;
![Page 24: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/24.jpg)
ST597A: 2011, Part II ( c©J.B.) 24
mu = 23.8;
sigma=.3;
n=10;
sdm=sigma/sqrt(n);
ansa = probnorm((24-mu)/sigma);
ansb = 1 - ansa;
ansc = probnorm((23.9-mu)/sigma) - probnorm((23.5-mu)/sigma);
ansd = probnorm((24-mu)/sdm);
anse = 1 - ansd;
ansg = probnorm((23.9-mu)/sdm) - probnorm((23.5-mu)/sdm);
proc print;
run;
OBS MU SIGMA N SDM ANSA ANSB ANSC ANSD ANSE ANSG
1 23.8 0.3 10 0.094868 0.74751 0.25249 0.47190 0.98249 0.017507 0.85330
These probability functions also allow you to compute P-values associated with test of hypotheses for testsbased on many well known distributions. Some papers will record a test-statistic but with just a bound on ap-value (for example, simply indicating whether it is greater than .05 or not). Suppose a paper says a test isbased on a chi-square test with 8 degrees of freedom, with a value of 24.8701 for the test statistic, but don’treport the p-value. You can calculate the P-value via p= 1-probchi(24.8701,8); returning a value or pof .001635325. (See page 60 of notes-Part). Similarly you can get P-values for test based on the standardnormal, t and F distribution, etc. but be careful as to whether you have a two-sided or one-sided alternative.
5.4 Plotting the normal density.
option ls=80 nodate;
/* plotting the normal density */
data a;
mu = 4;
sigma=1;
low=mu-5*sigma;
up = mu+5*sigma;
inc=(up-low)/100;
do x = low to up by inc;
f=pdf(’normal’,x,mu,sigma);
output;
end;
keep x f;
proc print;
run;
symbol1 i=spline l=1 color=blue;
proc gplot;
plot f*x;
run;
Obs x f
1 -1.0 0.00000
2 -0.9 0.00000
![Page 25: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/25.jpg)
ST597A: 2011, Part II ( c©J.B.) 25
51 4.0 0.39894
52 4.1 0.39695
100 8.9 0.00000
101 9.0 0.00000
Figure 5: Plot of the normal density with mean 4 and variance 1.
5.5 Exact inferences for a proportion based on the Binomial.
For small samples, test of hypotheses and confidence intervals are based on the binomial, rather than thenormal approximation to the binomial. Consider H0 : π = π0 versus HA : π 6= π0. We would like a test withprobability of a type 1 error equal to α.
Setting up the rejection region
If X is the number of successes in the sample of size n, the test will be of the form
Reject H0 if X ≤ R1 or X ≥ R2.
With R1 and R2 as fixed constants, we cannot necessarily get a test of exact size α since X is discrete withpoint probabilities at 0, 1, . . . n. We keep the probability of a type I error ≤ α. With a two-sided test, splitthe α up into two pieces.
R1 = largest integer such that when π = π0, P (X ≤ R1) ≤ α/2.
R2 = smallest integer such that when π = π0, P (X ≥ R2) ≤ α/2.
The rejection region defined by R1 and R2 obviously depends on π0.
option ls=80 nodate;
/*getting the rejection region for an exact binomial test of level 1 -a */
![Page 26: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/26.jpg)
ST597A: 2011, Part II ( c©J.B.) 26
%macro binrej(n,pi0,a);
data a;
title "sample size &n, null value &pi0, size &a";
r1=-1;
loop1:r1=r1+1;
pler1=probbnml(&pi0,&n,r1);
if pler1 le &a/2 then go to loop1;
r1=r1-1;
pler1=probbnml(&pi0,&n,r1);
r2=&n+1;
loop2:r2=r2-1;
pger2= 1- probbnml(&pi0,&n,r2-1);
if pger2 le &a/2 then go to loop2;
r2=r2+1;
pger2= 1- probbnml(&pi0,&n,r2-1);
probrej=pler1+pger2;
proc print;
run;
%mend;
%binrej(20,.5,.05);
sample size 20, null value .5, size .05
OBS R1 PLER1 R2 PGER2 PROBREJ
1 5 0.020695 15 0.020695 0.041389
So, with n = 20 and a null hypothesis of H0 : π = .5 and a desired size of .05, we reject H0 if X ≤ 5 orX ≥ 15. The actual size achieved is .0414.
Getting a confidence interval.
From the family of tests above, one way of getting an exact confidence interval (in the sense that theprobability that the interval contains the true π is ≥ 1− α) is to use the set of π0 for which H0 : π = π0 isnot rejected.
Getting exact P-value for a test based on the Binomial.
Consider H0 : π = π0. Suppose the number of observed successes is Xobs and X is the random variable equalto the number of successes in the sample. Under H0, X ∼ Binomial(n, π0) with an expected value of nπ0.Xobs is some distance from nπ0, the expected value under the null. Let D = |Xobs − nπ0|. The p-value is
P − value = P (X ≤ (nπ0 −D)) + P (X ≥ nπ0 + D).
For one-sided test, the P-value would look at just one tail using Xobs. If the alternative is H0 : π > π0, thenP-value = P (X ≥ Xobs), while if the alternative is H0 : π < π0, then P-value = P (X ≤ Xobs).
Example: Consider the sign test for testing whether there is a difference between two “groups” with paireddata. For observation i, there are two variables Xi and Yi and we can define π = P (Yi > Xi). The nullhypothesis H0 : π = .5, is one way to define no difference; what it says is that the distribution of thedifference is symmetric around 0. A “success” is if Y > X , a failure if Y < X and cases with X = Y arediscarded.
For example, for the LA heart data consider testing the null hypothesis of no difference between weightin 1950 and weight in 1962 in the sense that the distribution of the difference has median 0. There weresix ties. Of the remaining 194 observations, 103 were “successes” (wt62 bigger than wt50). The expected
![Page 27: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/27.jpg)
ST597A: 2011, Part II ( c©J.B.) 27
number of successes under H0 is 97, so D = 6. The P-value is P = P (X ≤ 91) + P (X ≥ 103) whenX ∼ Binomial(194, .5).
diffwt=wt62-wt50;
x=diffwt;
if diffwt lt 0 then x=0;
if diffwt gt 0 then x=1;
if diffwt=0 then x=2;
run;
proc univariate;
var diffwt; run;
proc freq;
tables x; run;
Tests for Location: Mu0=0
Test -Statistic- -----p Value------
Student’s t t -0.60276 Pr > |t| 0.5474
Sign M 6 Pr >= |M| 0.4297
Signed Rank S 283.5 Pr >= |S| 0.7182
Cumulative Cumulative
X Frequency Percent Frequency Percent
-----------------------------------------------
0 91 45.5 91 45.5
1 103 51.5 194 97.0
2 6 3.0 200 100.0
options ls=70 ps=60 nodate;
data a;
pleft = probbnml(.5,194,91);
pright= 1- probbnml(.5,194,102);
pvalue = pleft + pright;
proc print;
run;
OBS PLEFT PRIGHT PVALUE
1 0.21487 0.21487 0.42975
6 Inverse probability functions and their application
These functions determine percentiles for various continuous distributions. For a random variable X the100pth quantile or percentile, denote qp is the value for which P (X ≤ qp) = p. All the functions belowreturn the 100pth quantile/percentile for the specified distribution.
• betainv(p,a,b): X ∼ Beta(a, b)
• cinv(p,df,nc): X ∼ chi-square with df degrees of freedom and non-centrality parameter nc.
• finv(p,n1,n2,nc): X ∼ F with n1 and n2 degrees of freedom respectively and non-centrality parameternc.
![Page 28: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/28.jpg)
ST597A: 2011, Part II ( c©J.B.) 28
• gaminv(p,a): X ∼ Gamma(a).
• probit(p): X ∼ N(0, 1)
• tinv(p,df,nc): X ∼ t with df degrees of freedom and non-centrality parameter nc.
The nc can be omitted for the cinv,finv and tinv functions.
Normal percentiles
For the standard normal denote the 100pth percentile by zp (be careful with notation. Some books will writezp for the value with probability p to the right, rather than to the left. Here to we use it as the value withprobability p to the left to be consistent with the sas arguments.)
Getting standard normal percentiles.
options ls=80 nodate;
title ’finding standard normal percentile’;
data a;
array v[7] _temporary_ (.01 .025 .05 .5 .95 .975 .99);
array zval[7];
do i = 1 to 7;
zval[i]=probit(v[i]);
end;
drop i;
proc print;
run;
finding standard normal percentile 1
OBS ZVAL1 ZVAL2 ZVAL3 ZVAL4 ZVAL5 ZVAL6 ZVAL7
1 -2.32635 -1.95996 -1.64485 0 1.64485 1.95996 2.32635
Computing normal percentiles.. If X ∼ N(µ, σ2) then the pth percentile for X is µ + zpσ where zp isthe 100pth percentile of the standard normal. Notice that zp is negative when p < .5.
Example: Related to a grant proposal, needed to find cumulative probabilities at 2, 4 and 6 and the 10,50th and 90th percentiles for a normal with mean 4 and standard deviation 1. This was modelling individualdietary intakes of beta carotene.
option ls=80;
data a;
array yvec[3] _temporary_ (2 4 6);
array pvec[3] _temporary_ (.1 .5 .9);
array g[3];
array per[3];
sigma2= 1;
mu = 4;
do i = 1 to 3;
g[i] = probnorm((yvec[i]-mu)/sqrt(sigma2));
per[i] = mu + sqrt(sigma2)*probit(pvec[i]);
end;
proc print;
![Page 29: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/29.jpg)
ST597A: 2011, Part II ( c©J.B.) 29
run;
OBS G1 G2 G3 PER1 PER2 PER3 SIGMA2 MU I
1 0.022750 0.5 0.97725 2.71845 4 5.28155 1 4 4
Finding percentiles of the t-distribution.
options ls=80 nodate;
title ’finding t percentile’;
%macro tpercent(df);
title "percentiles of t with &df degrees of freedom"; /* note double quotes*/
data a;
array v[7] _temporary_ (.01 .025 .05 .5 .95 .975 .99);
array tval[7];
do i = 1 to 7;
tval[i]=tinv(v[i],&df);
end;
drop i;
proc print;
run;
%mend;
%tpercent(5);
%tpercent(10);
percentiles of t with 5 degrees of freedom 1
OBS TVAL1 TVAL2 TVAL3 TVAL4 TVAL5 TVAL6 TVAL7
1 -3.36493 -2.57058 -2.01505 0 2.01505 2.57058 3.36493
percentiles of t with 10 degrees of freedom 2
OBS TVAL1 TVAL2 TVAL3 TVAL4 TVAL5 TVAL6 TVAL7
1 -2.76377 -2.22814 -1.81246 0 1.81246 2.22814 2.76377
Computing a confidence interval for a variance under normality. Using the data for wt50 fromLA heart data (see 597B notes). The program is in the form of macro where the arguments are the samplevariance (s2), the number of observations (n) and a, where the confidence coefficient is 1-a.
options ls=80 nodate;
title ’finding confidence interval for a variance and s.d. under normality’;
%macro civar(s2,n,a);
/* s2 is sample variance, n is the sample size, 1- a = conf. coefficient*/
data b;
lval = cinv((&a/2),&n-1);
uval = cinv(1-&a/2,&n-1);
lowvar = (&n-1)*&s2/uval;
upvar = (&n-1)*&s2/lval;
lowsd=sqrt(lowvar);
upsd=sqrt(upvar);
proc print;
run;
%mend;
%civar(709.66771,200,.05);
![Page 30: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/30.jpg)
ST597A: 2011, Part II ( c©J.B.) 30
finding confidence interval for a variance and s.d. under normality 1
OBS LVAL UVAL LOWVAR UPVAR LOWSD UPSD
1 161.826 239.960 588.532 872.689 24.2597 29.5413
Finding a confidence interval for the mean with changing confidence level, illustrating howit gives the p-value for a two-sided test and plotting intervals versus confidence level. Usedifference in serum cholesterol from 1952 to 1960 for the LA heart data. From pages 35-36 of Part I of notes,the mean difference is -4.035 (the standard error 3.78),the P-value for testing the null hypothesis that meanis 0 is .2872 and a 90% confidence interval is (-10.28325,2.21325).
options ls=70 ps=60 nodate;
data values;
infile ’c:\s597\data\ladata’;
input id age md50 sp50 dp50 ht50 wt50 sc50
soec cs md62 sp62 dp62 sc62 wt62 ihdx yrdth;
diffsc=sc62-sc50;
proc means n mean stderr;
output out=result n=n mean=mean stderr=se;
var diffsc;
run;
data b;
set result;
do a = .01 to .99 by .01;
clevel=1-a;
tval=tinv(1-a/2,n-1);
upper = mean + tval*se;
lower = mean - tval*se;
output;
end;
keep clevel tval lower upper;
proc print;
run;
symbol1 i=spline l=1 color=black;
symbol2 i=spline l=2 color=black;
proc gplot data=b;
plot upper*clevel=1 lower*clevel=2/overlay;
run;
Obs clevel tval upper lower
1 0.99 2.60076 5.79844 -13.8684
2 0.98 2.34523 4.83229 -12.9023
3 0.97 2.18576 4.22934 -12.2993
4 0.96 2.06730 3.78142 -11.8514
5 0.95 1.97196 3.42094 -11.4909
28 0.72 1.08327 0.06082 -8.1308
29 0.71 1.06095 -0.02358 -8.0464
![Page 31: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/31.jpg)
ST597A: 2011, Part II ( c©J.B.) 31
Figure 6: Confidence interval for diffsc as a function of confidence level
7 Power and sample size applications.
7.1 The noncentral t-distribution and power functions.
Consider a hypothesis test for a mean where the null hypothesis involves µ0 (H0 could be µ ≤ µ0, µ ≥ µ0
or µ = µ0 with HA being the complement of H0.). Assume the data arises from a random sample from apopulation with mean µ and variance σ2. The t-test uses the test statistic T = (X−µ0)/(s/n1/2). If the truevalue is actually µ then the distribution of T is a noncentral t-distribution with n-1 degrees of freedom anda noncentrality parameter nc = (µ − µ0)/(σ/n1/2). The power function of the test is π(µ) = P ( reject H0)when the true mean is µ. For a value of µ in H0, this is the probability of a type 1 error. For a value ofµ in HA, this is the power = probability of a correct decision = 1 - P(type 2 error). The power functionwill depend on µ, σ (note that we need σ to obtain the power function) and the sample size. The rejectionregion will depend on whether the test is one-sided or two-sided.
Example: Suppose water is deemed unsafe for drinking if the mean nitrate concentration exceeds 30 ppm.The approach is to test assume the water is okay (H0 : µ ≤ 30) unless evidence shows otherwise (HA : µ > 30)using a test of size .05 (a 5% chance of declaring water unsafe when in fact it is) based on n water samples.Suppose the standard deviation for an individual observation is 2 ppm. Find the power function for a sampleof size size 5 and plot it. Determine the sample size needed to get the power to be .8 when µ = 31.
title ’power example’;
options ls=80 ps=60 nodate;
%macro powerg(mu0,sigma,n,alpha);
title "power for one sided > mu0 = &mu0, sigma=&sigma,n=&n,alpha=&alpha";
data b;
do mu = 25 to 35 by .5;
df=&n-1;
nc = sqrt(&n)*(mu - &mu0)/σ
tval= tinv(1-&alpha,df); /*this gives the cutoff point for rejection region*/
power = 1 - probt(tval,df,nc);
![Page 32: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/32.jpg)
ST597A: 2011, Part II ( c©J.B.) 32
output;
end;
drop sigma df;
run;
proc print;
run;
proc plot vpercent=25;
plot power*mu;
run;
%mend;
%powerg(30,2,5,.05);
power for one sided > mu0 = 30, sigma=2,n=5,alpha=.05 1
OBS MU NC TVAL POWER
1 25.0 -5.59017 2.13185 0.00000
2 25.5 -5.03115 2.13185 0.00000
10 29.5 -0.55902 2.13185 0.01693
11 30.0 0.00000 2.13185 0.05000
12 30.5 0.55902 2.13185 0.12018
13 31.0 1.11803 2.13185 0.23900
21 35.0 5.59017 2.13185 0.99751
Plot of POWER*MU. Legend: A = 1 obs, B = 2 obs, etc.
POWER |
1.0 + A A A A
| A
| A
0.5 + A
| A
| A A
0.0 + A A A A A A A A A A A
-+--------+--------+--------+--------+--------+--------+--------+--------+
24.0 25.5 27.0 28.5 30.0 31.5 33.0 34.5 36.0
MU
This next program determines the sample size needed to get the probability of correctly rejecting at µ = 31to be .8.
options ls=80 ps=60 nodate;
%macro ssg(mu0,sigma,mu,alpha,target);
/* mu = null value, mu = where power is to be controlled,
sigma = s.d., alpha = size of test, target = desired power */
title "mu0= &mu0, mu = &mu sigma=&sigma,alpha=&alpha target=&target";
data b;
n=0;
loop: n= n+1;
df=n-1;
nc = sqrt(n)*(&mu - &mu0)/σ
![Page 33: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/33.jpg)
ST597A: 2011, Part II ( c©J.B.) 33
tval= tinv(1-&alpha,df);
/*this gives the cutoff point for rejection region*/
power = 1 - probt(tval,df,nc);
output;
if power lt &target then go to loop;
run;
proc print;
run;
%mend;
%ssg(30,2,31,.05,.8);
mu0= 30, mu = 31 sigma=2,alpha=.05 target=.8 1
OBS N DF NC TVAL POWER
1 1 0 0.50000 . .
2 2 1 0.70711 6.31375 0.10582
16 16 15 2.00000 1.75305 0.60403
17 17 16 2.06155 1.74588 0.62869
27 27 26 2.59808 1.70562 0.81183
The needed sample size is 27. Note that the power function is increasing in µ and the power(probability ofcorrectly rejecting H0) for any µ bigger than 31 will be less than the .81183 obtained at µ = 31.
This program takes the same setting but varies the desired power and gets the needed sample size.
options ls=80 ps=60 nodate;
%macro ssg(mu0,sigma,mu,alpha);
/* mu = null value, mu = where power is to be controlled,
sigma = s.d., alpha = size of test, target = desired power */
title "mu0= &mu0, mu = &mu sigma=&sigma,alpha=&alpha";
data b;
do target = .5 to .98 by .02; /*target is desired power*/
n=0;
loop: n= n+1;
df=n-1;
nc = sqrt(n)*(&mu - &mu0)/σ
tval= tinv(1-&alpha,df);
/*this gives the cutoff point for rejection region*/
power = 1 - probt(tval,df,nc);
if power lt target then go to loop;
output;
end;
keept target n power;
run;
proc print;
run;
proc plot verpercent=25;
plot n*target;
%mend;
%ssg(30,2,31,.05);
![Page 34: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/34.jpg)
ST597A: 2011, Part II ( c©J.B.) 34
mu0= 30, mu = 31 sigma=2,alpha=.05 1
OBS TARGET N POWER
1 0.50 13 0.52201
16 0.80 27 0.81183
18 0.84 30 0.84825
19 0.86 32 0.86885
25 0.98 57 0.98142
Plot of N*TARGET. Legend: A = 1 obs, B = 2 obs, etc.
N
56 + A
49 + A
42 + A A
35 + A A A
28 + A A A A
21 + A A A A A A A
14 + A A A A A A A
7 +
--+--------+--------+--------+--------+--------+--------+--------+--------+--
0.50 0.56 0.62 0.68 0.74 0.80 0.86 0.92 0.98
TARGET
In the analyst, for specified hypotheses with a given α and σ, you can get the power for a specified rangeof n and get a plot of the power versus n. Or you can specify a particular alternative at which to controlpower, a range of power values, get the needed sample size and plot the needed sample size versus the power.These will be demonstrated in class.
7.2 Estimating a single mean. Sample size via a CI.
The sample size required so that the probability that X is within d units of µ with probability 1−α; that is,
P (|X − µ| ≤ d) = 1− α
isn = σ2z2
1−α/2/d2.
This is an exact result if the population is normal, otherwise it is approximate. This can also be viewed asapproximately giving you a confidence interval of length 2d. The actual half width of a confidence intervalbased on the t distribution is t1−α/2,n−1S/n1/2, which is random since S is random; the expected length is
approximately (since the expected value of S is not exactly σ) t1−α/2,n−1σ/n1/2. If you set this equal to dand treat t1−α/2,n−1 as z1−α/2, which is reasonable for even moderate n, you get n as give above.
Note, as with determining sample size with power this requires knowledge of σ.
Example: For the water sample situation with σ = 2, find n so the probability that the sample mean iswithin d units of the true mean is 1− α.
options ls=80 ps=60 nodate;
title "confidence intervals for varying d and 1 - alpha’’;
data b;
![Page 35: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/35.jpg)
ST597A: 2011, Part II ( c©J.B.) 35
sigma=2;
do alpha = .05 to .2 by .05;
do d = .2 to 2 by .2;
confid=1-alpha;
zval = probit(1 - alpha/2);
n = (zval*sigma)**2/(d**2);
output;
end;
end;
keep d confid n;
proc print noobs;
run;
D CONFID N
0.2 0.95 384.146
0.4 0.95 96.036
2.0 0.95 3.841
0.2 0.90 270.554
2.0 0.90 2.706
0.2 0.80 164.237
2.0 0.80 1.642
7.3 Sample size and power for estimating a proportion.
There are many examples where the object is to estimate the proportions of individuals with a particularcharacteristic (e.g., having a certain disease or other health condition, holding a particular opinion, etc.)Assuming the population is big enough that the Binomial rather than the Hypergeometric can be used andthat n is big enough that the normal approximation to the Binomial will work, then wanting P (|p − π| ≤d) = 1− α)) requires
n = π(1− π)z2
1−α/2/d2.
Note that d is expressed as a proportion and is an absolute difference. So d = .03 means being with 3 percent(absolute) of the true π (or having the half width of the confidence interval approximately .03). Note also,that in this case, the sample size needed depends on the quantity π that we are trying to estimate. Over0 < π < 1, the function π(1 − π) increases up to π = .5, at which point it achieves a maximum and thendecreases. In the absence of any knowledge the safe thing to do is use π = .5, if you are comfortable sayingthat you know beforehand that π is in some interval [a, b] then use the value in [a, b] that is closest to .5(which of course would be .5 itself if .5 is in [a, b]).
options ls=80 ps=60 nodate;
%macro ssprop(alpha);
title "sample size for estimation a proportion with alpha = &alpha";
data b;
do d = .02 to .10 by .02;
do pi = .02 to .98 by .04;
zval = probit(1 - &alpha/2);
n = (pi*(1-pi)*(zval**2))/(d**2);
![Page 36: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/36.jpg)
ST597A: 2011, Part II ( c©J.B.) 36
output;
end; end;
keep d pi n;
proc print; run;
%mend; %ssprop(.05);
sample size for estimation a proportion with alpha = .05 1
OBS D PI N
1 0.02 0.02 188.23
2 0.02 0.06 541.65
12 0.02 0.46 2385.55
13 0.02 0.50 2400.91
14 0.02 0.54 2385.55
24 0.02 0.94 541.65
25 0.02 0.98 188.23
rest omitted.
REMARKS:
1. If you want to have P (|p− π| ≤ dπ) = 1− α then n = (1− π)z2
1−α/2/d2π.
2. You can also figure an exact sample size using either the Hypergeometric or the Binomial.
Power and sample size for test for a single proportion.
π = true proportion of interest. Have a sample of size n with X = number of successes and p = X/n =proportion of successes in the population.
H0 : π = π0 , HA : π 6= π0 : Reject H0 if X ≤ R1 or X ≥ R2.
H0 : π ≥ π0 , HA : π < π0 : Reject H0 if X ≤ R1.
H0 : π ≤ π0 , HA : π > π0 : Reject H0 if X ≥ R2.
In each case the test is based getting test has size α. These can be determined exactly using the Hypergeo-metric or Binomial(see page 24 for example) or approximately if the normal approximation to the Binomialis used. In the latter case the test statistic is based on
Z =p− π0
[π0(1− π0)/n]1/2
.
For example, suppose we are considering a one-sided test with H0 : π ≥ π0 and HA : π < π0. Then usingthe normal approximation we reject H0 if Z < −z1−α or equivalently
X < n(π0 − [π0(1− π0)/n]1/2
z1−α) = R1
The power function, which we write now as f(π) is
f(π) = P (rejectH0|π is the true value).
So the exact power of the given test is found by getting the probability about X based on X ∼ Binomial(n, π).The approximate power of the given test is found based on the standard normal. For the latter, if thetrue proportion is π, then Z is approximately normal with mean (π − π0)/[π0(1 − π0)/n]1/2 and variance[π(1 − π)]/[π0(1− π0)]. So in the one sided example above the power is P (Z < −z1−α) where Z is normal
![Page 37: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/37.jpg)
ST597A: 2011, Part II ( c©J.B.) 37
with mean and variance as above. One convenient use of this formula is that it isolates the sample size nexplicitly.
Example: “Statistical Evidence of Discrimination. Based on example in Devore and Peck “Explo-ration and Analysis of Data” drawing from an article in J.A.S.A. 1982. 25% of individuals eligible for grandjury duty in a particular area are African American. Based on n selections, from which p = proportion ofA.A.s’ in the n selections, a test of H0 : π ≥ .25 and HA : π < .25, where π is interpreted as the longterm probability that a selection of a juror for grand jury is A.A. (There is certainly a question regardingthe assumption of this being n independent observations here that needs to be addressed, but we’ll use thatassumption for illustration here). This uses a rejection region based on the normal approximation and thengets the exact and approximate power of this test.
options ls=80 ps=60 nodate;
%macro powerp(pi0,n1,n2,inc,alpha);
title "power for one-sided test of a proportion, HA <";
data b;
do n = &n1 to &n2 by &inc;
div=sqrt(&pi0*(1-&pi0)/n);
do pi = .05 to .4 by .05;
meanz= (pi-&pi0)/div;
varz= (pi*(1-pi))/(&pi0*(1-&pi0));
zval= -probit(1-&alpha); /*this gives the cutoff point for rejection region*/
stand=(zval-meanz)/sqrt(varz);
power = probnorm(stand); /* approximate power */
/*get exact power of test based on Z */
q =n*(&pi0 + (sqrt(&pi0*(1-&pi0)/n)*zval));
c = int(q);
if c-q = 0 then c =c-1;
powerex=probbnml(pi,n,c);
output;
end;
end;
/*keep n pi power powerex;*/
run;
proc print;
run;
proc sort;
by n;
run;
proc plot vpercent=30;
plot power*pi powerex*pi/overlay;
by n;
run;
%mend;
%powerp(.25,10,100,10,.05);
LIMITED OUTPUT
power for one-sided test of a proportion, HA < 1
OBS N PI POWER POWEREX
1 10 0.05 0.35715 0.59874
2 10 0.10 0.21389 0.34868
![Page 38: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/38.jpg)
ST597A: 2011, Part II ( c©J.B.) 38
3 10 0.15 0.13370 0.19687
4 10 0.20 0.08298 0.10737
5 10 0.25 0.05000 0.05631
6 10 0.30 0.02876 0.02825
7 10 0.35 0.01553 0.01346
8 10 0.40 0.00772 0.00605
73 100 0.05 1.00000 1.00000
74 100 0.10 0.99568 0.98999
75 100 0.15 0.78984 0.76328
76 100 0.20 0.29785 0.27119
77 100 0.25 0.05000 0.03763
78 100 0.30 0.00408 0.00216
79 100 0.35 0.00017 0.00005
80 100 0.40 0.00000 0.00000
7.4 Comparison of two means; independent samples. Power and sample size.
For comparing two samples, a similar approach to power and sample size can be used. Suppose we haveindependent samples from two groups with population mean µj population variance σ2
j , sample mean Xj ,
sample variance S2
j and sample size nj for group j. If the null hypothesis involves µ1−µ2 = d0, the t-statisticis
T =X1 − X2 − d0
Sp
[
1
n1
+ 1
n2
]1/2,
under the assumption of equal variances σ2
1= σ2
2= σ2. The rejection region will depend on the
alternative. In this case the power function and sample size calculations are based on the fact that if thetrue difference is µ1 − µ2 = ∆0, then T is distributed as a non-central t-distribution with df = n1 + n2 − 2degrees of freedom and noncentrality parameter
nc =∆0 − d0
σ[
1
n1
+ 1
n2
]1/2.
If the variances are unequal, then the test statistic used is
Tu =X1 − X2 − d0
[
S2
1
n1
+S2
2
n2
]1/2.
This does not have exactly a t-distribution, but for n1 and n2 “large” (often take to mean bigger than 30),we can use a normal approximation. If the true difference is ∆0, then Tu is approximately normal with mean
∆0 − d0
[
σ2
1
n1
+σ2
2
n2
]1/2
and variance 1.
Using this distribution in general, will provide a reasonable approximation to the power, as long as thesample sizes are not very small. With large sample sizes this will also give essentially the right answer even
![Page 39: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/39.jpg)
ST597A: 2011, Part II ( c©J.B.) 39
if we assumed σ2
1= σ2
2= σ2 and exact normality of the populations, in which the exact answer is based on
the non-central t.
For sample size determination, one usually assumes n1 = n2 and determines the common sample size. Youcould also specify that n1 = cn2 for some specified constant c.
Example: Comparing blood levels for people living near freeways to those not. (Based on example inDevore “Probability and Statistics for Engineering and the Sciences, 3rd edition”, p. 332 from a paper inthe Archives of Environmental Health. The means in there samples were about 10 for the no-freeway groupand 17 for the near freeway group). We will approach this from the perspective of testing H0 : µ1 = µ2
(µ1−µ2 = 0) versus HA : µ1 6= µ2. (If we were trying to “prove” that the mean for those living near freeways(µ1) is higher than those not, then we would use the alternative HA : µ1−µ2 > 0). Assume that we will takean equal number n1 = n2 for each group. Based on previous data we will use a common standard deviationof σ = 6 and examine the power of the test as function of the true difference ∆. Note that power dependsjust on the difference in the means and not on the actual values.
title ’power two-sample two-sided, equal variance’;
options ls=80 ps=60 nodate;
%macro powerg(d0,sigma,n1,alpha);
title "power for two sample, two sided d0 = &d0, sigma=&sigma,n1=&n1,alpha=&alpha";
data b;
do delta = -10 to 10 by .5;
df=2*&n1-2;
nc = (delta - &d0)/(&sigma*sqrt(2/&n1));
tval= tinv(1-(&alpha/2),df); /*this gives the cutoff point for rejection region*/
power = probt(-tval,df,nc) + (1 - probt(tval,df,nc));
output;
end;
drop sigma df;
run;
proc print;
run;
proc plot vpercent=25;
plot power*delta;
run;
%mend;
%powerg(0,6,10,.05);
power for two sample, two sided d0 = 0, sigma=6,n1=10,alpha=.05 1
OBS DELTA NC TVAL POWER
1 -10.0 -3.72678 2.10092 0.94081
2 -9.5 -3.54044 2.10092 0.91714
19 -1.0 -0.37268 2.10092 0.06441
20 -0.5 -0.18634 2.10092 0.05358
21 0.0 0.00000 2.10092 0.05000
22 0.5 0.18634 2.10092 0.05358
23 1.0 0.37268 2.10092 0.06441
40 9.5 3.54044 2.10092 0.91714
41 10.0 3.72678 2.10092 0.94081
Plot of POWER*DELTA. Legend: A = 1 obs, B = 2 obs, etc.
POWER |
![Page 40: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/40.jpg)
ST597A: 2011, Part II ( c©J.B.) 40
1.0 + AA AA
| AAAA AAAA
| AA AA
0.5 + AAA AAA
| AA AA
| AAAA AAAA
0.0 + AAAAAAA
---+---------+---------+---------+---------+--
-10 -5 0 5 10
DELTA
The next program determines the sample size needed to achieve a certain power (which varies in the programas the value target) at a specified distance from the null delta.
title ’power two-sample two-sided, equal variance’;
options ls=80 ps=60 nodate;
%macro sst(d0,sigma,alpha,delta);
title "sample size: two sided d0 = &d0, sigma=&sigma,alpha=&alpha, delta=&delta" ;
data b;
do target = .8 to .98 by .02;
n1=1;
loop: n1=n1+1;
df=2*n1-2;
nc = (&delta - &d0)/(&sigma*sqrt(2/n1));
tval= tinv(1-(&alpha/2),df); /*this gives the cutoff point for rejection region*/
power = probt(-tval,df,nc) + (1 - probt(tval,df,nc));
if (power lt target) then goto loop;
output;
end;
drop df nc tval;
run;
proc print;
run;
proc plot vpercent=25;
plot n1*delta;
run;
%mend;
%sst(0,6,.05,10);
sample size: two sided d0 = 0, sigma=6,alpha=.05, delta=10 1
OBS TARGET N1 POWER
1 0.80 7 0.81627
2 0.82 8 0.87236
3 0.84 8 0.87236
4 0.86 8 0.87236
5 0.88 9 0.91255
6 0.90 9 0.91255
7 0.92 10 0.94081
8 0.94 10 0.94081
![Page 41: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/41.jpg)
ST597A: 2011, Part II ( c©J.B.) 41
9 0.96 11 0.96037
10 0.98 13 0.98273
USING THE ANALYST. Similar to the one sample problem the analysis requires that you fix a particularalternative and it will then find the power over a range of sample sizes. It will also determine sample size toachieve a certain power, which varies, at a particular alternative (the same as our program above.) Strangely,the analyst requires that you enter values for each of the means. The power only depends on the differencein means not on the actual means, so you will get the same answer if you use µ1 = 10 and µ2 = 20 as youwould with µ1 = 1500 and µ2 = 1510 (and any other values with a difference of 10.)
Two-Sample t-Test Group 1 Mean = 100 Group 2 Mean = 110
Standard Deviation = 6 Alpha = 0.05 2-Sided Test
N per Group Power
2 0.170
4 0.507
6 0.739
8 0.872
10 0.940
12 0.973
14 0.988
16 >.99
18 >.99
Note the agreement at n = 10 with our answer.
Using the analyst to determine sample size.
Group 1 Mean = 10 Group 2 Mean = 20
Standard Deviation = 6 Alpha = 0.05 2-Sided Test
N per
Power Group
0.800 7
0.820 8
0.840 8
0.860 8
0.880 9
0.900 9
0.920 10
0.940 11
0.960 12
0.980 13
Again, note the agreement with our answer, although there is a disagreement (10 rather than 11) atpower=.94.
Power and sample size for comparing two proportions. Note that as with testing for a singleproportions, this is not done in the Analyst. We will not treat this problem in detail here. One thing tonote is unlike comparing two means, the power , standard error and sample size can depend on the two trueproportions and not just their difference. This is because of the manner in which the proportions enter thevariances.
![Page 42: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/42.jpg)
ST597A: 2011, Part II ( c©J.B.) 42
8 Generating random quantities.
8.1 Some random generating functions.
The following SAS functions generate random variables.
RANBIN(seed,n,p): Binomial(n,p)
RANCAU(seed): Cauchy
RANEXP(seed): Exponential
RANGAM(seed,a): Gamma with scale parameter a.
RANNOR(seed): Standard normal
RANPOI(seed,m): Poisson with mean m
RANTBL(seed,p1, p2, . . . pk): Random from a distribution on integers 1, . . . k, with P (X = j) = pj .
RANTRI(seed,h): Triangular distribution
RANUNI(seed): Uniform over the interval [0,1]
From the SAS documentation:
“ Seed Values: Random-number functions and CALL routines generate streams of random numbers froman initial starting point, called a seed, that either the user or the computer clock supplies. A seed must bea nonnegative integer with a value less than 231-1 (or 2,147,483,647). If you use a positive seed, you canalways replicate the stream of random numbers by using the same DATA step. If you use zero as the seed,the computer clock initializes the stream, and the stream of random numbers is not replicable.:
When you go to the individual documentation, it says you can use negative integers, in which case it willuse the computer clock to initialize the stream. Might as well use 0 for the seed in this case.
8.2 Generating random integers from between 1 and M .
This can be done directly in the following way. Suppose U is distributed uniform over the interval [0, 1].Then W = U ∗M is distributed uniform over [0, M ] (There is probability 0 of getting exactly 0 or M). If Wfalls in the interval [j, j + 1) then we set the random integer, call it K, to j+1. This can be done by definingK = int(U ∗M +1). Notice that the possible values of K are 1, . . .M and each possible value has probability1/M .
Example. Sampling transects for the Quabbin regeneration study. For each of the five blocks in theQuabbin, they wanted to add some new transects to a regeneration study. The transects would run east-west with some specified width (corresponding to the diameter of a circle with area = 1/millionth of anacre.) This meant that within each block there is some number of potential transects. For example, in thecase of the Pelham block there were 9344 possible transects to sample from and they wanted to sample 5new transects. The first program below generates 50 random integers for each block (I’m only showing youthe first two blocks) using uniform random variables as just described. To take a simple random sample ofn transects they would just use the first n distinct values in the list. The random numbers below do notcorrespond to the actual transects used since I didn’t keep the original numbers. This is a new run whichwill yield different random numbers.
![Page 43: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/43.jpg)
ST597A: 2011, Part II ( c©J.B.) 43
title ’generate random numbers for regeneration study’;
title ’pelham’;
data a; pelham
do i=1 to 50; OBS U K
u=ranuni(0); 1 0.37306 3486
k=int((u*9344)+1); 2 0.88181 8240
output; 3 0.96527 9020
end; 4 0.78550 7340
proc print; 5 0.69525 6497
var u k;
run; 48 0.29311 2739
title ’newsalem’; 49 0.67373 6296
data b; 50 0.44657 4173
do i=1 to 50;
u=ranuni(0); newsalem
k=int((u*3769)+1); OBS U K
output; 1 0.39562 1492
end; 2 0.93758 3534
proc print; 3 0.95385 3596
var u k; 4 0.64641 2437
run; 50 0.60487 2280
8.3 Sampling with proc surveyselect.
SAS has a surveyselect procedure that can be used to select a simple random sample without replacement.It requires a data set to sample from. Usually we are trying to create a random sample of n units out of Mfor which we will collect data, in which case we need to create an artificial data set with values from 1 toM. There may also be some cases where there is a database with a listing of individual units and we wantto sample from it. To get a SRS without replacement of n integers from 1, . . . , M using surveyselect do thefollowing:
%macro srs(m,ss); /* m is population size, ss is sample size */
title "Generating a SRS: populations size &m, sample size &ss";
data a;
do j = 1 to &m;
output;
end;
run;
proc surveyselect data=a method=srs n=&ss out=sample;
run;
proc print data=sample;
run;
%mend;
%srs(9344,10);
![Page 44: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/44.jpg)
ST597A: 2011, Part II ( c©J.B.) 44
Generating a SRS: populations size 9344, sample size 10, The SURVEYSELECT Procedure
Selection Method Simple Random Sampling
Input Data Set A Obs j
Random Number Seed 59344 1 1622
Sample Size 10 2 3041
Selection Probability 0.00107 3 3909
Sampling Weight 934.4 4 5195
Output Data Set SAMPLE 5 5449
6 6667
7 7034
8 7426
9 8821
10 9126
8.4 Generating Normal random variables
If Z has a standard normal distribution then µ + σZ is normal with mean µ and variance σ2 (standarddeviation σ).
option ls=80 nodate;
%macro simn(n,mu,sigma);
title "simulating a sample of size &n from N(&mu, &sigma **2)";
data a;
do i = 1 to &n;
z = rannor(0);
x = &mu + &sigma*z;
output;
end;
keep x;
run;
proc univariate;
histogram x/kernel (color=red);
inset n mean (5.2) std (5.2);
run;
%mend;
%simn(1000,4,6);
From univariate output
Mean 4.14397679 Variance 34.5488964 Std Deviation 5.87783093
8.5 Simulating the behavior of inferences with normal samples.
Simulating the behavior of the sample mean X and the confidence interval for the population mean X ±t1−α/2,n−1S/n1/2 for a sample of size n from a N(µ, σ2). We know that the distribution of X is normal with
expected value/mean µ and standard deviation σ/n1/2 and the probability the confidence interval is correctis 1− α.
option ls=80 nodate;
![Page 45: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/45.jpg)
ST597A: 2011, Part II ( c©J.B.) 45
Figure 7: Data from 1000 simulations
%macro simn(nsim,n,mu,sigma,alpha);
title ‘‘simulate dist. of sample mean and CI, n= &n from N(&mu, &sigma **2)’’;
data a;
tval=tinv(1- &alpha/2,&n-1);
do j =1 to ≁
sumx=0;
sumx2=0;
true=μ
do i = 1 to &n;
z = rannor(0);
x = &mu + &sigma*z;
sumx=sumx+x;
sumx2=sumx2+(x**2);
end;
meanx=sumx/&n;
varx = (sumx2 - (&n *(meanx**2)))/(&n-1);
sdx=sqrt(varx);
low=meanx - tval*(sdx/sqrt(&n));
up=meanx + tval*(sdx/sqrt(&n));
check=0;
if low le &mu le up then check=1;
output;
end;
![Page 46: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/46.jpg)
ST597A: 2011, Part II ( c©J.B.) 46
keep j true meanx sdx low up check;
run;
proc print;
run;
proc means;
var meanx check;
run;
goptions reset=all;
title1 ‘‘Distribution of X-bar: Sample from Normal’’;
title2 ‘‘n= &n from N(&mu, &sigma **2)’’;
proc univariate;
histogram meanx/kernel (color=red);
inset n mean (5.2) std (5.2);
run;
goptions reset=all;
title ’Confidence Interval Performance’;
symbol1 v=L c=blue;
symbol2 v=U c=purple;
symbol3 l=1 i=spline c=red;
proc gplot;
plot low*j=1 up*j=2 true*j/overlay;
run;
%mend;
%simn(500,10,4,6,.05);
OBS J TRUE MEANX SDX LOW UP CHECK
1 1 4 4.99161 4.26667 1.93942 8.0438 1
2 2 4 3.71135 2.54504 1.89074 5.5320 1
12 12 4 1.35386 3.53194 -1.17274 3.8805 0
13 13 4 2.82975 5.37792 -1.01738 6.6769 1
500 500 4 8.45663 2.90202 6.38065 10.5326 0
Variable N Mean Std Dev Minimum Maximum
---------------------------------------------------------------------
MEANX 500 3.8976617 1.9392731 -2.6808968 9.7251622
CHECK 500 0.9460000 0.2262441 0 1.0000000
---------------------------------------------------------------------
The theoretical mean and st. deviation of the distribution of X are 4 and 1.897 and the confidence level is.95.
8.6 Simulating sampling from the exponential distribution.
A random variable X follows an exponential distribution with mean µ > 0 if it has density function f(x) =(1/µ)e−x/µ for x ≥ 0 and is 0 for x ≤ 0. This is sometimes written as f(x) = λe−xλ, where λ = 1/µ. Theexponential is often used to model lifetimes or waiting times between events (e.g., arrival at a bank windowto get service). If X has this distribution then the variance of X is µ2 and the standard deviation is µ.
If you were buying a package of n items (e.g., batteries or light bulbs) produced by a process which hasan exponential distribution for the lifetime, then the total lifetime of the n items is T =
∑
i Xi = nX.
![Page 47: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/47.jpg)
ST597A: 2011, Part II ( c©J.B.) 47
Theoretically we know that X has expected value µ and standard deviation µ/n1/2 (since the populationstandard deviation is σ). Equivalently T has expected value nµ and standard deviation n1/2µ). Actually,more is known, namely that
∑
i Xi follows a Gamma distribution. It is also known that S2 has expectedvalue µ2 (which is the population variance).
This example simulates the behavior of the sample mean, sample variance and a confidence interval for themean X± t1−α/2,n−1S/n1/2 with a random sample of size n from an exponential distribution. Recall that forsmall sample size this interval is motivated by assuming a sample from the normal for small n, which doesn’tapply here. For large n this interval is supported by the Central Limit Theorem. So, the performance of theconfidence interval should improve as n increases.
The RANEXP function generates a random variable, call it Y , which is exponential with mean 1. It can beshown that X = µY will be distributed exponential with mean µ.
option ls=80 nodate;
%macro sime(nsim,n,mu,alpha);
title ‘‘simulate sample mean and CI: sample from exp(mean=&mu), n= &n’’;
data a;
tval=tinv(1- &alpha/2,&n-1);
do j =1 to ≁
sumx=0;
sumx2=0;
true=μ
do i = 1 to &n;
x = &mu*ranexp(0);
sumx=sumx+x;
sumx2=sumx2+(x**2);
end;
meanx=sumx/&n;
varx = (sumx2 - (&n *(meanx**2)))/(&n-1);
sdx=sqrt(varx);
low=meanx - tval*(sdx/sqrt(&n));
up=meanx + tval*(sdx/sqrt(&n));
check=0;
if low le &mu le up then check=1;
output;
end;
keep j true meanx sdx low up check;
run;
proc means;
var meanx check;
run;
goptions reset=all;
title1 ‘‘Distribution of X-bar: Sample from Exponential’’;
title2 ‘‘n= &n with mean &mu’’;
proc univariate plots;
var meanx;
histogram meanx/kernel (color=red);
inset n mean (5.2) std (5.2);
run;
goptions reset=all;
title ‘‘Confidence Interval Performance. Exponential mean = &mu’’;
![Page 48: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/48.jpg)
ST597A: 2011, Part II ( c©J.B.) 48
symbol1 v=L c=blue;
symbol2 v=U c=purple;
symbol3 l=1 i=spline c=red;
proc gplot;
plot low*j=1 up*j=2 true*j/overlay;
run;
%mend;
%sime(1000,5,4,.05);
%sime(1000,10,4,.05);
%sime(1000,30,4,.05);
%sime(1000,50,4,.05);
%sime(1000,100,4,.05);
%sime(1000,500,4,.05);
simulate sample mean and CI: sample from exp(mean=4), n= 5 1
Variable N Mean Std Dev Minimum Maximum
--------------------------------------------------------------------------------
meanx 1000 3.9260 1.6824 0.5167 11.7516
sdx 1000 3.4373 2.0270 0.2101 17.4600
varx 1000 15.9196 22.1207 0.0441 304.8515
min 1000 0.7688 0.7364 0.0004 5.3666
max 1000 9.0259 4.8317 0.8201 42.2134
check 1000 0.8820 0.3228 0.0000 1.0000
--------------------------------------------------------------------------------
simulate sample mean and CI: sample from exp(mean=4), n= 30 3
Variable N Mean Std Dev Minimum Maximum
--------------------------------------------------------------------------------
meanx 1000 4.0156 0.7401 2.0240 6.4855
sdx 1000 3.8908 0.9894 1.7441 9.0797
varx 1000 16.1162 8.6491 3.0418 82.4407
min 1000 0.1399 0.1408 0.0001 1.1879
max 1000 16.0941 5.3004 6.2835 43.1996
check 1000 0.9220 0.2683 0.0000 1.0000
--------------------------------------------------------------------------------
simulate sample mean and CI: sample from exp(mean=4), n= 50 4
Variable N Mean Std Dev Minimum Maximum
--------------------------------------------------------------------------------
meanx 1000 3.9934 0.5754 2.3901 6.1291
sdx 1000 3.9551 0.7887 2.1090 8.5870
varx 1000 16.2642 6.8073 4.4479 73.7368
min 1000 0.0790 0.0777 0.0000 0.6411
max 1000 18.1060 5.3386 8.4551 60.4502
check 1000 0.9320 0.2519 0.0000 1.0000
--------------------------------------------------------------------------------
![Page 49: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/49.jpg)
ST597A: 2011, Part II ( c©J.B.) 49
simulate sample mean and CI: sample from exp(mean=4), n= 100 5
Variable N Mean Std Dev Minimum Maximum
--------------------------------------------------------------------------------
meanx 1000 3.9869 0.4056 2.8004 5.3098
sdx 1000 3.9400 0.5509 2.7147 6.3016
varx 1000 15.8265 4.5172 7.3695 39.7105
min 1000 0.0403 0.0402 0.0000 0.2837
max 1000 20.4791 5.1413 10.4529 48.5757
check 1000 0.9330 0.2501 0.0000 1.0000
--------------------------------------------------------------------------------
simulate sample mean and CI: sample from exp(mean=4), n= 500 6
Variable N Mean Std Dev Minimum Maximum
--------------------------------------------------------------------------------
meanx 1000 3.9979 0.1741 3.4631 4.5791
sdx 1000 3.9859 0.2514 3.3646 5.2186
varx 1000 15.9504 2.0351 11.3207 27.2340
min 1000 0.0079 0.0079 0.0000 0.0578
max 1000 27.2129 5.1157 17.8174 55.5868
check 1000 0.9550 0.2074 0.0000 1.0000
--------------------------------------------------------------------------------
Mean of x with n=5
Histogram # Boxplot
11.75+** 4 0
.* 1 0
.** 4 0
.* 1 0
.
.** 6 0
.**** 10 0
.** 6 0
.** 4 |
.********* 26 |
.*********** 32 |
.*********** 33 |
.******************* 57 |
.********************* 63 |
.********************************* 99 +-----+
.******************************* 93 | + |
.****************************************** 126 *-----*
.******************************************* 127 | |
.************************************ 107 +-----+
.****************************** 89 |
.*********************** 67 |
.********** 29 |
.***** 15 |
0.25+* 1 |
----+----+----+----+----+----+----+----+---
* may represent up to 3 counts
![Page 50: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/50.jpg)
ST597A: 2011, Part II ( c©J.B.) 50
meanx fo n=500
Histogram # Boxplot
4.625+* 1 0
.
.* 3 0
.* 3 0
.** 4 |
.**** 11 |
.****** 18 |
.********** 29 |
.************ 34 |
.************************ 71 |
.****************************** 90 +-----+
.****************************** 89 | |
.************************************** 114 | |
.*********************************** 103 *--+--*
.************************************** 113 | |
.*********************************** 105 +-----+
.************************** 77 |
.*************** 43 |
.**************** 46 |
.********* 27 |
.*** 9 |
.** 6 |
.* 3 |
3.475+* 1 0
----+----+----+----+----+----+----+---
* may represent up to 3 counts
More on sampling distributions will be demonstrated in class.
8.7 Simulating the number of arrivals in a fixed period of time.
Suppose that time between arrivals (of people, phone calls, etc.) is exponentially distributed with mean µ.Consider the time interval [0, t] and define Xt to be the number of arrivals in that interval. Theoreticallyit can be shown that Xt has a Poisson distribution with parameter t/µ (which is also the mean of thedistribution). So, with θ = t/µ, P (X = j) = θje−θ/j! for j = 0, 1, . . .. We will simulate this situation andcompare the empirical results to what is expected.
![Page 51: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/51.jpg)
ST597A: 2011, Part II ( c©J.B.) 51
option ls=80 nodate;
%macro sime3(nsim,mu,t);
/* arrivals are exponential with mean mu*/
title "simulate number of arrivals in time 0 to t";
data a;
do k =1 to ≁
xt=-1;
total=0;
i = 0;
loop: i = i +1;
xt=xt+1;
time = &mu*ranexp(0);
total = total +time;
if total lt &t then goto loop;
output;
end;
keep k xt; proc print; run;
proc freq;
tables xt;
run;
proc means; var xt; run;
/* get the theoretical poisson distribution to compare */
data poiss;
lambda = (1/&mu)*&t;
upper = int(lambda + 5*lambda); /* going 5 standard deviations above the mean*/
j=0;
prob=poisson(lambda,j);
do j = 1 to upper;
prob = poisson(lambda,j) - poisson(lambda,j-1);
output;
end;
keep j prob;
proc print data=poiss;
run;
%mend;
%sime3(500,4,60); /* mean = 4 minutes, t = 1 hour = 60 minutes*/
simulate number of arrivals in time 0 to t 1
OBS K XT
1 1 19
2 2 22
3 3 12
498 498 13
499 499 9
500 500 14
![Page 52: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/52.jpg)
ST597A: 2011, Part II ( c©J.B.) 52
XT Frequency Percent OBS J PROB
------------------------- 1 1 0.00000
2 2 0.00003
3 3 0.00017
4 2 0.4 4 4 0.00065
5 1 0.2 5 5 0.00194
6 3 0.6 6 6 0.00484
7 3 0.6 7 7 0.01037
8 11 2.2 8 8 0.01944
9 18 3.6 9 9 0.03241
10 26 5.2 10 10 0.04861
11 35 7.0 11 11 0.06629
12 36 7.2 12 12 0.08286
13 44 8.8 13 13 0.09561
14 49 9.8 14 14 0.10244
15 42 8.4 15 15 0.10244
16 45 9.0 16 16 0.09603
17 52 10.4 17 17 0.08474
18 39 7.8 18 18 0.07061
19 33 6.6 19 19 0.05575
20 25 5.0 20 20 0.04181
21 15 3.0 21 21 0.02986
22 11 2.2 22 22 0.02036
23 3 0.6 23 23 0.01328
24 4 0.8 24 24 0.00830
25 2 0.4 25 25 0.00498
27 1 0.2 26 26 0.00287
27 27 0.00160
28 28 0.00086
29 29 0.00044
30 30 0.00022
cut off the rest
Analysis Variable : XT
N Mean Std Dev Minimum Maximum
-----------------------------------------------------------
500 14.9980000 3.8558697 4.0000000 27.0000000
-----------------------------------------------------------
![Page 53: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/53.jpg)
ST597A: 2011, Part II ( c©J.B.) 53
8.8 Simulating survival rates and time to extinction.
Simulating the survival of a cohort over time.
These simulations assume that the daily survival rate is constant over time and is the same for each individual.That is the probability of surviving over the next unit of time is the same, regardless of what unit of timeyou have survived to. Both the assumption of common survival rates for each individual and the assumptionthat the survival rate is constant over time can be modified.
options ls=80 nodate;
/* this simulates the survival of n0 initial individuals up to time t,
with daily survival rate pi. based on nsim simulations */
options linesize=80 nodate;
%macro simpop(n0,pi,t,nsim);
title ‘‘N0 = &n0 pi = &pi MaxTime =’ &t ‘‘;
data a;
do sim = 1 to ≁
oldsurv=&n0;
do time = 1 to &t;
if oldsurv=0 then goto check;
survive= ranbin(0,oldsurv,&pi);
oldsurv=survive;
check: output;
end;
end;
run;
proc print data = a noobs;
var sim time survive;
run;
/*proc gplot;
plot survive*time;
by sim; run; */
proc means;
class time;
var survive;
run;
%mend;
%simpop(100,.8,5,100);
‘‘N0 = 100 pi = .8 MaxTime = 5 ‘‘
sim time survive
1 1 86
1 2 68
1 3 58
1 4 45
1 5 34
2 1 80
2 2 67
2 3 54
2 4 40
2 5 36
![Page 54: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/54.jpg)
ST597A: 2011, Part II ( c©J.B.) 54
------------------------------------ sim=1 -------------------------------------
Plot of survive*time. Legend: A = 1 obs, B = 2 obs, etc.
survive
100 +
|
|
75 + A
|
| A
50 +
| A
| A
25 + A
---+-------------+-------------+-------------+-------------+--
1 2 3 4 5
time
Analysis Variable : survive
N
time Obs N Mean Std Dev Minimum Maximum
1 100 100 80.5900000 3.7928561 69.0000000 88.0000000
2 100 100 64.4600000 4.7020735 53.0000000 76.0000000
3 100 100 51.7100000 4.7360534 38.0000000 65.0000000
4 100 100 41.4700000 4.9593397 26.0000000 54.0000000
5 100 100 32.8200000 4.7829242 19.0000000 45.0000000
Simulating Time until extinction
options ls=80 nodate;
/* getting time until extinction*/
%macro simpop2(n0,pi,nsim);
data a;
do sim = 1 to ≁
oldsurv=&n0;
time=0;
loop: time = time+1;
survive= ranbin(0,oldsurv,&pi);
oldsurv=survive;
if survive gt 0 then goto loop;
output;
end;
keep sim time;
run;
proc print;
run;
proc univariate plot;
var time;
%mend;
%simpop2(100,.8,500);
![Page 55: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/55.jpg)
ST597A: 2011, Part II ( c©J.B.) 55
OBS SIM TIME
1 1 20
2 2 19
499 499 19
500 500 18
Mean 23.754 Std Dev 5.574912
Histogram # Boxplot
47+* 1 *
.
.*** 5 0
.** 3 0
.*** 5 0
.****** 11 0
.*** 6 |
.******** 15 |
31+************* 25 |
.*************** 30 |
.********************* 42 +-----+
.*************************************** 78 | |
.******************************************* 85 *--+--*
.********************************************** 92 +-----+
.************************** 51 |
.******************* 38 |
15+******* 13 |
----+----+----+----+----+----+----+----+----+-
* may represent up to 2 counts
![Page 56: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/56.jpg)
ST597A: 2011, Part II ( c©J.B.) 56
Simulating the performance of a crude estimate of survival rate.
options ls=80 nodate;
/* simulating an estimate of the survival rate */
%macro simpop(n0,pi,t,nsim);
data a;
do sim = 1 to ≁
oldsurv=&n0;
k=0;
sum=0;
do time = 1 to &t;
if oldsurv=0 then goto check;
survive= ranbin(0,oldsurv,&pi);
k=k+1;
sum=sum+(survive/oldsurv);
/*(survive/oldsurv) estimates the survival rate at current time*/
oldsurv=survive;
check: end;
pihat = sum/k; /* estimates survival by averaging individual estimates*/
output;
end;
keep sim pihat;
run;
proc print;
run;
proc means;
var pihat;
%mend;
%simpop(100,.8,5,500);
OBS SIM PIHAT
1 1 0.79197
2 2 0.82014
499 499 0.82939
500 500 0.78743
Analysis Variable : PIHAT
N Mean Std Dev Minimum Maximum
-----------------------------------------------------------
500 0.7993498 0.0231622 0.7180423 0.8727904
-----------------------------------------------------------
![Page 57: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/57.jpg)
ST597A: 2011, Part II ( c©J.B.) 57
8.9 More on random number generation in SAS
SAS now has a general rand function of the form;
rand(’DISTNAME’, parameter list)
where the names of distributions and their parameters are as in the handout showing the general form forpdf, cdf and quantile.
So, for example if you want to generate X from a normal with mean 5 and standard deviation 4 you can usex = rand(’NORM’, 5,4); while to generate an observation from a exponential distribution with mean 5 youcan use x = rand(’EXP0’, 5); .
There is also a proc simnormal that will generate multivariate normal random variables but if doing this itis often better to do so in IML. More on the multivariate normal later.
Multinomial. The multinomial distribution generalizes the binomial where there are k categories with πj
= P( result is in category j) with∑k
j=1πj = 1. Multinomial observations can be generated using rantbl( )
or rand (’TABLE’, ...). See the SAS documentation for details.
![Page 58: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/58.jpg)
ST597A: 2011, Part II ( c©J.B.) 58
9 R: Part II
9.1 Graphics
The graphics in R are one of its strong points. This section describes the basic features of plotting in R.There is much, much more. A good reference is “R Graphics” by Murrell (Chapman & Hall).
In creating a figure the initial plot is created using a plot command. The plot command creates the axes andthe frame for the plot and does and (almost always) does an initial plot. It has many options. Subsequentoverlays to the plot, if desired, are done using the points command or the lines command. These havefewer options since they aren’t involved with the axes set-up, titles, etc.
The plot command:
The simplest form of the plot command is plot(x,y) where x and y are vectors of equal length. What is inx is plotted on the x-axis and y on the y-axis. Only the x and y are required but there are many optionsthat can be included. Below is a general form of the plot function with many of the commonly used optionsin it. Information on a number of the options can be found via help(par) and help(plot). Some of the textbelow is extracted directly from the help.
plot(x,y,type = , lty = ,pch = , main =” “ , sub = “ “, xlab = “ “ , ylab = “ “, xlim = c(,), ylim = c(,),col = ,cex=, font=)
• type = specifies the type of plot, here are some of the options (points is the default)
"p" for points,
"l" for lines,
"b" for both,
"h" for ’histogram’ like (or ’high-density’) vertical lines,
• lty = a number, specifies the line type if type = “l”. The default is 1.
The line type. Line types can either be specified as an integer (0=blank, 1=solid (default), 2=dashed,3=dotted, 4=dotdash, 5=longdash, 6=twodash) or as one of the character strings ”blank”, ”solid”,”dashed”, ”dotted”, ”dotdash”, ”longdash”, or ”twodash”, where ”blank” uses ’invisible lines (i.e.,does not draw them).
• pch = specifies the symbol for a point, if included. The default is a circle.
Either an integer specifying a symbol or a single character to be used as the default in plotting points.See points for possible values and their interpretation. Note that only integers and single-characterstrings can be set as a graphics parameter (and not NA nor NULL).
• main = gives the main plot title
• sub = gives a subtitle
• xlim and ylim specify the limits of the x and y axis; so xlim = c(4,10) gives a plot where the x-axisranges from 4 to 10;
• col = sets a color; e.g., col= “red”
• cex = number controls the size of symbols. From the help
![Page 59: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/59.jpg)
ST597A: 2011, Part II ( c©J.B.) 59
cex= A numerical value giving the amount by which plotting
text and symbols should be magnified relative to the default.
Note that some graphics functions such as plot.default have
an argument of this name which multiplies this graphical parameter,
and some functions such as points accept a vector of
values which are recycled. Other uses will take just
the first value if a vector of length greater than one is supplied.
There is also cex.axis, cex.lab,cex.main,cex.sub which lets you control individual parts.
• font = controls the font type. Below is from the help. It is a little obtuse, so experiment.
An integer which specifies which font to use for text. If possible, device drivers
arrange so that 1 corresponds to plain text (the default), 2 to bold face, 3 to
italic and 4 to bold italic. Also, font 5 is expected to be the symbol font, in
Adobe symbol encoding. On some devices font families can be selected by family
to choose different sets of 5 fonts.
The points statement.
Add points to the plot. This is of the form
points(x,y, pch=, cex = , col = )
where only the x and y are required. There are a few other options.
The lines statement.
Adds a line (or line combined with points) to the plot. This is of the form
lines(x,y, options)
The options can include type = , lty = , and some of the other options in the plot statement but, obviouslynot those concerned with the plot layout (e.g., main = , xlab = , etc.)
The abline statement.
abline(a,b) will plot a straight line with intercept 0 and slope b. (Sometimes the argument for the line mightcome from the output of a function; as in when doing simple linear regression we used abline(reg); see page96 of part I of notes.)
Multiple plots on a page
The command
par(mfrow=c(a,b))
lays out a plotting page which will allows a*b plots arranged in a rows and b columns. This is put beforethe first plot statement. As a plot statement is encountered in puts the plot in the next available spot wherethe the plot are filled in a row at a time and within a row from left to right.
Plotting only some points.
![Page 60: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/60.jpg)
ST597A: 2011, Part II ( c©J.B.) 60
In the plot, lines or points commands you can select out values by putting a condition after the variable. Forexample plot(x[expression], y[expression], ) will plot just y versus x for observations satisfying the expression;this might for example be selecting based on values of another variables or the x and y themselves.
Unemployment example. The following does the unemployment plot (see page 15). Here it does it threedifferent ways; two with points connected by a line and one with a line with no points. The par statementlays out the page to have three plots top to bottom.
edata<-read.table(’g:/s597/data/employ.dat’,header=F)
par(mfrow=c(3,1)) # lines up three plots vertically
Findex<-edata$V1
unemploy<-edata$V2
year<-edata$V3
newyear<-year+1950;
plot(newyear,unemploy,type="b",main="Unemployment over Time",
xlab="Year", ylab = "Unemployment")
#below uses a * rather than the default circle
plot(newyear,unemploy,type="b",pch="*",main="Unemployment over Time",
xlab="Year", ylab = "Unemployment")
#below uses no points, and chooses line type 3
plot(newyear,unemploy,type="l",lty=3,main="Unemployment over Time",
xlab="Year", ylab = "Unemployment")
1952 1954 1956 1958 1960
1.5
2.5
3.5
4.5
Unemployment over Time
Year
Unem
ploy
men
t
*
** *
*
* **
*
*
1952 1954 1956 1958 1960
1.5
2.5
3.5
4.5
Unemployment over Time
Year
Unem
ploy
men
t
1952 1954 1956 1958 1960
1.5
2.5
3.5
4.5
Unemployment over Time
Year
Unem
ploy
men
t
Figure 8: Unemployment plots using R
Esterase assay example. This does the plot on page 16. Note: We could have read an labeled variables
![Page 61: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/61.jpg)
ST597A: 2011, Part II ( c©J.B.) 61
as above, but here I’ve illustrated the use of matrix(scan as an alternative to a read. The result is a matrixcalled data, rather than a dataframe. We create the variables, by name, by equating them columns of thematrix called data. It also shows overlaying using lines and points and the need to control the range of they-axis. Finally it also demonstrates how you can read another data set and overlay a graph from it on theoriginal plot. In this case it gets the original points that went into getting the fitted lines and predictionintervals.
data<-matrix(scan(’g:/s597/data/pred.out’),ncol=7,byrow=T)
par(mfrow=c(1,1)) # single plot per page; not needed if in new session
# but resets if had a different layout earlier
x0<-data[,1]
yhat<-data[,2]
low <-data[,3]
up <- data[,4]
yhatw<-data[,5]
loww<-data[,6]
upw<-data[,7]
plot(x0,yhat,type="l",lty=1,xlab = "concentration", ylab = "count",
main = "Prediction intervals for Assay Data", ylim = c(-200,1400))
lines(x0,low,lty=1)
lines(x0,up, lty=1)
lines(x0,yhatw,lty=2)
lines(x0,loww,lty=2)
lines(x0,upw,lty=2)
edata<-read.table(’g:/s597/data/ester.dat’)
conc<-edata$V1 # true esterase concentration
count<-edata$V2 # radioactive binding count
points(conc,count,pch="*")
This plots verbal IQ versus brain size, for males and females, using the brain data
brain<-read.table("g:/s597/data/Brain_h.dat",na.string=".",header=T)
head(brain)
attach(brain)
par(mfrow=c(2,1))
plot(mriCount[Gender==’Male’],VIQ[Gender==’Male’],
xlab="MRI COUNT", ylab = "Verbal IQ", main = "Verbal IQ versus
Brain Size: Males")
plot(mriCount[Gender==’Female’],VIQ[Gender==’Female’],
xlab="MRI COUNT", ylab = "Verbal IQ", main = "Verbal IQ versus
Brain Size: Females")
9.1.1 Directing the graphics output
Graphics go to a graphics device. Without doing anything, when you create a plot it opens a graphics windowas the active device. (This is equivalent to using the device function windows() in a windows environment).A plot in the graphics window can be saved using “save as”, as a pdf, eps, JPEG, and some others.
You can also explicitly open another device/file and the graphics output will be sent to that device/file. Forexample the following would create a postscript file. Note some of the options used. In place of postscript,other options are pdf( ) and jpeg( ) as well as some others.
![Page 62: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/62.jpg)
ST597A: 2011, Part II ( c©J.B.) 62
10 20 30 40 50
050
010
00Prediction intervals for Assay Data
concentration
coun
t
** *****
*
*****
*
***
**
*****
**
*** *
*
******
***
*
*
*
*
*
*
*
** *
**
****
**
**
**
*
**
* *
* *
*
*
*
*
**
* *
*
*
**
*
*
*
*
*
**
**
***
**
**
*
**
*
*
*
* *
*
Figure 9: Esterase Assay intervals using unweighted and weighted least squares.
postscript("h:/mecourse/phplot.ps",horizontal=F,height=10,width=8,font=3)
use help(postscript) or help(jpeg) etc. to see the options. Notice that the default is horizontal = T whichcreates a landscape plot.
Once you have opened another device (e.g., via ps, pdf or jpeg) output will be routed there. If you openanother device it becomes active.
dev.off() will shut off the active device and return to using the interactive graph window when you nextplot something.
9.2 Miscellaneous
• Sequencing See Section 15.4.3 for definition and use of the seq command.
• Creating a vector. The rep command can be used to create a vector. rep(values,n) creates n repeatsof values. If values is a single item, this is a vector of length n, but if values has p components, this isa vector length n ∗ p. If the n is replaced by each=n, then it repeats each quantity in values n times.
> rep(4,10)
[1] 4 4 4 4 4 4 4 4 4 4
> rep(NA,13)
[1] NA NA NA NA NA NA NA NA NA NA NA NA NA
![Page 63: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/63.jpg)
ST597A: 2011, Part II ( c©J.B.) 63
900000 950000 1000000 1050000
8010
012
014
0Verbal IQ versusBrain Size: Males
MRI COUNT
Verb
al IQ
800000 850000 900000 950000
7090
110
130
Verbal IQ versusBrain Size: Females
MRI COUNT
Verb
al IQ
Figure 10: Verbal IQ versus brain size for each gender
> rep(1:p,10)
Error: object "p" not found
> rep(1:5,10)
[1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3
[39] 4 5 1 2 3 4 5 1 2 3 4 5
> rep(1:5,each=10)
[1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4
[39] 4 4 5 5 5 5 5 5 5 5 5 5
• The cat statement. The cat() statement lets you write script or variables to a file or to the console.This lets you add text to your output and it also is a way to output variables to a file.
cat(....,file = , append = T)
if file = is omitted, then the output goes to the console.
Append = T is only an option if writing to a file, it appends to the file. If append = T is omitted thenappend=F is assumed which means the file is overwritten.
The ... is what is output. It can be a collection of “items” separate by commas. An item in quotes iswritten as is except for some quantities using\.”\n “is a carriage return and ”\t” produces a tab.
• multiple commands on a line: You can put multiple commands on a line if all but the last end ina semicolon
![Page 64: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/64.jpg)
ST597A: 2011, Part II ( c©J.B.) 64
• If statements: If statements are executed as below. Notice that the first piece is in parentheses, ( ),and the second in brackets, .
if (expression1) {expression2}
or you can have an if then else
if (expression1) then {expression2} else
{expression 3}
The expressions can be multiple lines.
9.3 looping
Do loops are done in R using the for command or the while command.
The following loops through values of varname and carries out expression2 using the for command.
for (varname in seq) {expression2 involving varname}
where seq defines a range of possible values for varname.
The while command is of the form
while (condition) {statements}
Example: Here is an example involving the for and the if statement both. In homework 5, this wouldconvert the missing values for CRIMTYPE (treated as numerical)
for (k in 1:length(CRIMTYPE)){if (CRIMTYPE[k]==9){CRIMTYPE[k]=NA}}
Other applications appear in later examples.
9.4 Functions
The general form for defining a function is
fname<-function(argument1,argument2, ...)
{statements
return(objects)}
At the end of the expression the return(objects) specifies what values are returned to the main program fromthe function. The following function gets summary statistics for a numerical variable where the missing codeis contained in mcode.
statc<-function(x,mcode){
for (k in 1:length(x)){if (x[k]==mcode){x[k]=NA}}
mean<-mean(x,na.rm=T)
![Page 65: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/65.jpg)
ST597A: 2011, Part II ( c©J.B.) 65
sd <-sd(x,na.rm=T )
nmiss<-sum(is.na(x))
n<-length(x)- nmiss
med<-median(x, na.rm=T)
sum<-summary(x)
min<-sum[1]
max<-sum[6]
statt<-cbind(n=n,nmiss=nmiss,mean=mean,SD=sd,median=med,min=min,max=max)
return(statt)
}
If we knew there was no missing values you could use just stats< −function(x) and skip some of the rest.
You can save the text defining your function to a file, say the above is saved to g:/s597/statcom.R. You canthen run this using the source command and execute it as in the following example; where AGE (from theSYC data) was already in the workspace from previous commands.
> source("g:/s597/stats.R")
> stats(AGE,99)
n nmiss mean SD median min max
Min. 2621 0 16.80923 1.911258 17 11 24
NOTE: The source command executes what is in the file. (Also note that when using source it doesn’tautomatically list things. Suppose somewhere in the file it said just AGE. If we typed AGE in the console itwould list age, but it doesn’t do so when it is in a file being “sourced”. Instead you need to use print(AGE)
9.5 Probability Functions with examples
There are four general functions, with first letters d, p, q and r that are used in working with probabilitydistributions and generating samples from them. In each case the arguments involve the parameters of thedistribution and are distribution specific.
• dName(x, arg1, ....) returns the PDF (density or mass function) evaluated at x of a random variablewith distribution Name
• pName(x, arg1, ....) returns P (X ≤ x), the CDF (density or mass function) evaluated at x of a randomvariable with distribution Name.
• qName(p, arg1, ....) returns the quantile, the value qp such that P (X ≤ qp) = p.
• rName(n, arg1, ...) generates n observations from the distribution.
Both the q and the p functions have options that come after the arguments that reverse the tail being workedwith:
pName(x, arg1, ...., lower.tail=F) returns P (X > x).
qName(p, arg1, ... , lower.tail = F) returns q1−p; i.e. the value with probability p to the right of it.
There are also options that convert to log scales that we won’t discuss here. See the help( ) results.
Some specific distributions, illustrated with the d function.
![Page 66: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/66.jpg)
ST597A: 2011, Part II ( c©J.B.) 66
• Uniform: dunif(x,minv,maxv) or dunif(x, min = minv , max = maxv), minv = lower bound, maxv=upperbound
If minv and maxv are omitted, then minv = 0 and maxv=1 (standard uniform)
• Exponential: dexp(x,ratev) or dexp(x,rate=ratev).
rate is 1/mean, not the mean. If rate is omitted, then rate = 1 is assumed.
• Normal; dnorm(x, meanv, sdv) or dnorm(x, mean = meanv, sd = sdv)
• t-distribution dt(x, df, ncp)
ncp = 0 if third argument omitted.
• Chi-square: dchisq(x, df, ncp)
ncp = 0 if third argument omitted.
• F: df(x, df1, df2,ncp )
ncp = 0 if fourth argument omitted.
# plotting the binomial
n<-10
pi<-.2
pval<-rep(NA,n)
pval
for (k in 1:n)
{pval[k] = dbinom(k,n,pi)}
kvec<-seq(1:n)
kvec
pval
plot(kvec,pval, type = "h")
#PLOTTING THE BINOMIALS AS A FUNCTION
bplot<-function(n,pi){
pval<-rep(NA,n)
pval
for (k in 1:n)
{pval[k] = dbinom(k,n,pi)}
kvec<-seq(1:n)
kvec
pval
plot(kvec,pval,type="h")}
bplot(10,.2)
# This does four plots to a page.
par(mfrow=c(2,2))
bplot<-function(pi){
for (n in c(5,10,20,50)){
![Page 67: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/67.jpg)
ST597A: 2011, Part II ( c©J.B.) 67
pval<-rep(NA,n)
maint<-paste("n = ",n)
for (k in 1:n)
{pval[k] = dbinom(k,n,pi)}
kvec<-seq(1:n)
plot(kvec,pval,type="h",xlab = "k", ylab= "probability",main = maint)}}
bplot(.2)
# HERE IS A LONGER WAY WITH EACH GRAPH LABELED BY THE SAMPLE SIZE. NOT NECESSARY.
par(mfrow=c(2,2))
pi<-.2
n<-5
pval<-rep(NA,n)
pval
for (k in 1:n)
{pval[k] = dbinom(k,n,pi)}
kvec<-seq(1:n)
plot(kvec,pval,type="h",xlab = "k", ylab= "probability", main = "n = 5")
n<-10
pval<-rep(NA,n)
for (k in 1:n)
{pval[k] = dbinom(k,n,pi)}
kvec<-seq(1:n)
plot(kvec,pval,type="h",xlab = "k", ylab= "probability", main = "n = 10")
n<-20
pval<-rep(NA,n)
for (k in 1:n)
{pval[k] = dbinom(k,n,pi)}
kvec<-seq(1:n)
plot(kvec,pval,type="h",xlab = "k", ylab= "probability",main = "n = 20")
n<-50
pval<-rep(NA,n)
for (k in 1:n)
{pval[k] = dbinom(k,n,pi)}
kvec<-seq(1:n)
plot(kvec,pval,type="h",xlab = "k", ylab= "probability",main = "n = 50")
Hypergeometric Example on page 21. We will do this problem four different ways to show some of thefeatures of R.
METHOD 1
# hypergeometric example on page 21.
# Here we write to the console using the cat
# function and a print.
popsizea <- 698
upper <-4
prob <-rep(NA,upper)
value <- rep(NA,upper)
ssavalues<-seq(7,70,by=7)
numberssa <-length(ssavalues)
dvalues<-seq(1,101,by=5)
![Page 68: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/68.jpg)
ST597A: 2011, Part II ( c©J.B.) 68
1 2 3 4 5
0.0
0.1
0.2
0.3
0.4
n = 5
k
prob
abilit
y
2 4 6 8 10
0.00
0.10
0.20
0.30
n = 10
k
prob
abilit
y
5 10 15 20
0.00
0.05
0.10
0.15
0.20
n = 20
k
prob
abilit
y
0 10 20 30 40 50
0.00
0.04
0.08
0.12
n = 50
k
prob
abilit
y
Figure 11: Binomial probability density functions with π = .2
numberdef<-length(dvalues)
for (j in 1:numberssa)
{ssa= ssavalues[j]
ssrate=ssa/popsizea
for (def in seq(1,101, by = 5))
{defrate=def/popsizea;
nondef = popsizea-def;
for (k in 0:upper)
{prob[k+1] = dhyper(k,def,nondef,ssa)
value[k]=k}
cat("probabilities with N = 698,sample size= ", ssa,
"number defective = ", def, "\n")
prob0<-prob[1]; prob1<-prob[2]; prob2<-prob[3];prob3<-prob[4]
values<-cbind(prob0,prob1,prob2,prob3)
print(values)}}
probabilities with N = 698,sample size= 7 number defective = 1
prob0 prob1 prob2 prob3
[1,] 0.9899713 0.01002865 0 0
probabilities with N = 698,sample size= 7 number defective = 6
prob0 prob1 prob2 prob3
[1,] 0.9411107 0.05761902 0.001258057 1.219048e-05
...
![Page 69: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/69.jpg)
ST597A: 2011, Part II ( c©J.B.) 69
METHOD 2
# Now we’ll create a "file" with a separate line
# for each combination of sample size and number defective
# Here we do it showing also how you can write to a file
# using the cat command
myfile<-"g:/s597/houtput"
popsizea <- 698
upper <-4
prob <- rep(NA,upper)
value <- rep(NA,upper)
ssavalues<-seq(7,70,by=7)
numberssa <-length(ssavalues)
dvalues<-seq(1,101,by=5)
numberdef<-length(dvalues)
cat("ssa","\t", "def","\t", "p0","\t","p1","\t","p2","\t","p3\n", file=myfile)
for (j in 1:numberssa)
{ssa= ssavalues[j]
ssrate=ssa/popsizea
for (m in 1:numberdef)
{def = dvalues[m]
defrate=def/popsizea
nondef = popsizea-def
for (k in 0:upper)
{prob[k+1] = dhyper(k,def,nondef,ssa)
value[k]=k}
prob0<-prob[1]; prob1<-prob[2]; prob2<-prob[3];prob3<-prob[4]
cat(ssa, "\t", def, "\t", prob0, "\t", prob1, "\t", prob2,
"\t", prob3,"\n",file=myfile,append=T)
}}
hdata<-read.delim("g:/s597/houtput")
head(hdata)
ssa def p0 p1 p2 p3
1 7 1 0.9899713 0.01002865 0.000000000 0.000000e+00
2 7 6 0.9411107 0.05761902 0.001258057 1.219048e-05
3 7 11 0.8943318 0.10112120 0.004448147 9.768991e-05
4 7 16 0.8495603 0.14075550 0.009355982 3.219856e-04
5 7 21 0.8067238 0.17673380 0.015779810 7.424871e-04
6 7 26 0.7657521 0.20925960 0.023529940 1.408978e-03
METHOD 3
# Here we do it showing how you can create a matrix and write to
# the matrix.
popsizea <- 698
upper <-4
![Page 70: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/70.jpg)
ST597A: 2011, Part II ( c©J.B.) 70
prob <- rep(NA,upper)
value <- rep(NA,upper)
ssavalues<-seq(7,70,by=7)
numberssa <-length(ssavalues)
dvalues<-seq(1,101,by=5)
numberdef<-length(dvalues)
total = numberssa*numberdef
total
data<-matrix(NA,total,6) # create a total x 6 matrix with NA’s for entries
index=0
for (j in 1:numberssa)
{ssa= ssavalues[j]
ssrate=ssa/popsizea
for (m in 1:numberdef)
{def = dvalues[m]
defrate=def/popsizea
nondef = popsizea-def
for (k in 0:upper)
{prob[k+1] = dhyper(k,def,nondef,ssa)
value[k]=k}
index=index+1
data[index,1]<-ssa
data[index,2]<-def
data[index,3]<-prob[1]
data[index,4]<-prob[2]
data[index,5]<-prob[3]
data[index,6]<-prob[4]
}}
data
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 7 1 9.899713e-01 0.0100286533 0.000000000 0.000000e+00
[2,] 7 6 9.411107e-01 0.0576190211 0.001258057 1.219048e-05
[3,] 7 11 8.943318e-01 0.1011212164 0.004448147 9.768991e-05
...
[208,] 70 91 3.240206e-05 0.0003836452 0.002210053 8.256321e-03
[209,] 70 96 1.752117e-05 0.0002209048 0.001355834 5.399683e-03
[210,] 70 101 9.422897e-06 0.0001261740 0.000822874 3.484018e-03
METHOD 4
# Here we do it showing how you can create vectors
# write to them and then bind them to a dataframe
popsizea <- 698
upper <-4
prob <- rep(NA,upper)
ssavalues<-seq(7,70,by=7)
numberssa <-length(ssavalues)
![Page 71: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/71.jpg)
ST597A: 2011, Part II ( c©J.B.) 71
dvalues<-seq(1,101,by=5)
numberdef<-length(dvalues)
total = numberssa*numberdef
ssav<-rep(NA,total)
defv<-rep(NA,total)
p0v<-rep(NA,total)
p1v<-rep(NA,total)
p2v<-rep(NA,total)
p3v<-rep(NA,total)
index=0
for (j in 1:numberssa)
{ssa= ssavalues[j]
ssrate=ssa/popsizea
for (m in 1:numberdef)
{def = dvalues[m]
defrate=def/popsizea
nondef = popsizea-def
for (k in 0:upper)
{prob[k+1] = dhyper(k,def,nondef,ssa)
}
index=index+1
ssav[index]<-ssa
defv[index]<-def
p0v[index]<-prob[1]
p1v[index]<-prob[2]
p2v[index]<-prob[3]
p3v[index]<-prob[4]
}}
hdata<-cbind(ssav,defv,p0v,p1v,p2v,p3v)
head(hdata)
ssav defv p0v p1v p2v p3v
[1,] 7 1 0.9899713 0.01002865 0.000000000 0.000000e+00
[2,] 7 6 0.9411107 0.05761902 0.001258057 1.219048e-05
[3,] 7 11 0.8943318 0.10112122 0.004448147 9.768991e-05
[4,] 7 16 0.8495603 0.14075555 0.009355982 3.219856e-04
[5,] 7 21 0.8067238 0.17673382 0.015779805 7.424871e-04
[6,] 7 26 0.7657521 0.20925958 0.023529938 1.408978e-03
Power example: This gets the power function and plots it for a one sided test for the mean. See page31. Shows the use of the cat command to write out header information. Note the need for a , to separatesomething in quotes from a variable. These , ’s are not printed as you’ll see.
powerone<-function(mu0,sigma,n,alpha){
cat("Power example with sample size ", n," null = ", mu0,
" sigma = ", sigma, " alpha = ", alpha, "\n")
muval<- seq(25,35,by=.5)
#muval
nmu <-length(muval)
power<-rep(NA,nmu)
for(k in 1:nmu)
![Page 72: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/72.jpg)
ST597A: 2011, Part II ( c©J.B.) 72
{mu = muval[k]
df= n-1
nc = sqrt(n)*(mu - mu0)/sigma
tval= qt(1-alpha,df)
power[k] = 1 - pt(tval,df,nc)}
plot (muval,power,type = "l",ylab="power",xlab="Null value",
main= "power function")
values<-data.frame(muval,power)
return(values)}
powerone(30,2,5,.05)
Power example with sample size 5 null = 30 sigma = 2 alpha = 0.05
muval power
1 25.0 1.554168e-11
2 25.5 4.549612e-10
3 26.0 1.016218e-08
4 26.5 1.720899e-07
5 27.0 2.219886e-06
6 27.5 2.193684e-05
7 28.0 1.671735e-04
19 34.0 9.748306e-01
20 34.5 9.914523e-01
21 35.0 9.975115e-01
Sample size determination. This gets the sample size needed for the one-sample t-test over differenttarget values at a specified alternative, µ. See page 33.
ssizeone<-function(mu0,sigma,mu,alpha)
{targets <- seq(.5,.98, by=.02)
ntarget<- length(targets)
nv<-rep(NA,ntarget)
pv<-rep(NA,ntarget)
m=0
for (j in 1:ntarget)
{target=targets[j]
target
n<-2
power<-0
while(power < target)
{df=n-1
nc = sqrt(n)*(mu - mu0)/sigma
tval= qt(1-alpha,df)
power = 1 - pt(tval,df,nc)
n= n+1}
m<-m+1
nv[m]=n-1;
pv[m]=power;
} #end j/target loop
data<-cbind(targets,nv,pv)
plot(targets,nv,xlab="target",ylab="sample size",
![Page 73: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/73.jpg)
ST597A: 2011, Part II ( c©J.B.) 73
26 28 30 32 34
0.0
0.2
0.4
0.6
0.8
1.0
power function
Null value
powe
r
Figure 12: One-sided power using R
type = "b", main= " H0:mu <= 30, alt = 31, sigma=2, alpha=.05")
return(data)
} # end function.
ssizeone(30,2,31,.05)
targets nv pv
[1,] 0.50 13 0.5220115
[2,] 0.52 13 0.5220115
[3,] 0.54 14 0.5507256
....
[24,] 0.96 48 0.9615312
[25,] 0.98 57 0.9814151
Simulating and plotting the mean mean from an exponential
simc<-function(mu)
{par(mfrow=c(3,2))
nsim<-1000
means<-rep(0,nsim)
for (n in c(1,5,10,30,50,100))
{for (j in 1:nsim)
{values<-rexp(n,1/mu)
means[j]<- sum(values)/n}
![Page 74: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/74.jpg)
ST597A: 2011, Part II ( c©J.B.) 74
0.5 0.6 0.7 0.8 0.9
2030
4050
H0:mu <= 30, alt = 31, sigma=2, alpha=.05
target
sam
ple
size
Figure 13: Sample size needed to have power = target at µ = 31 for test of H0 : µ ≤ 30, versus HA : µ > 30with α = .05.
hist(means,main="mean",freq=FALSE)
lines(density(means))}}
simc(4)
![Page 75: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/75.jpg)
ST597A: 2011, Part II ( c©J.B.) 75
mean
means
Dens
ity
0 5 10 15 20 25 30 35
0.00
0.06
0.12
mean
means
Dens
ity
0 2 4 6 8 10 12
0.00
0.10
0.20
mean
means
Dens
ity
2 4 6 8
0.00
0.10
0.20
0.30
mean
means
Dens
ity
2 3 4 5 6 7
0.0
0.2
0.4
mean
means
Dens
ity
2 3 4 5 6
0.0
0.2
0.4
0.6
mean
means
Dens
ity
3.0 3.5 4.0 4.5 5.0 5.5
0.0
0.4
0.8
Figure 14: Distribution of sample mean with samples from the exponential; n = 1,5,10,30,50,100
9.6 Some More graphics
9.6.1 Using legends
This comes from my book (“Measurement Error: Models, Methods and applications”). It is from an examplefrom Montgomery and Peck (Regression Analysis: 1992) for which a quadratic model was used to model thetensile strength in Kraft paper as a function of the hardwood concentration in the batch of pulp used. Thisprogram plots five different quadratic fits, one just a regular fit to the data (called naive), the other four arefits based on various methods that correct for the fact that the hardwood concentration can’t be observedexactly but is estimated with some uncertainty. (This measurement error in the predictors causes bias inthe fitted coefficients).
In this code the position of the legend is give by the first two arguments in the legend statement, whichpositions the upper left of the box with the legend at x = 0 and y = 60.
#postscript("g:/mecourse/paper.ps",horizontal=F,height=6,width=4.5)
# Plot of five different quadratic functions with a legend.
x<-seq(0,20,.1)
fitn<- 1.11 + 8.99*x -.44*(x**2)
fitc<- -11.03 + 13.36*x -.73*(x**2)
fitrc<- -2.95 + 10.11*x -.52*(x**2)
fitrci<- -3.99 + 11.03*x -.60*(x**2)
fits<- -1.42 + 9.77*x -.499*(x**2)
plot(x,fitn,xlab="Hardwood Concentration",ylab = "Tensile strength",
![Page 76: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/76.jpg)
ST597A: 2011, Part II ( c©J.B.) 76
type = "l", lty=1, cex=0.8,ylim=c(0,60))
lines(x,fitc,type = "l", lty=2, cex=0.8)
lines(x,fitrc,type = "l", lty=3, cex=0.8)
lines(x,fitrci,type = "l", lty=4, cex=0.8)
lines(x,fits,type = "l", lty=5, cex=0.8)
legend(0,60,c("Naive","MOM", "RC", "RC-I", "SIMEX"),lty=1:5,cex=0.8)
0 5 10 15 20
010
2030
4050
60
Hardwood Concentration
Tens
ile s
treng
th
Naive
MOM
RC
RC−I
SIMEX
Figure 15: Naive and other fits (accounting for measurement error ) of strength versus hardwood concentra-tion
9.6.2 Writing text in margins or in graphs
You can write in the margins of the plot using mtext and in the graph itself using text.
#showing how to write in margins with mtext and that
#default is side = 3 (top)
par(mfrow=c(3,2))
plot(1:10, (-4:5)^2, main="Parabola Points", xlab="xlab")
mtext("10 of them ",side=1)
plot(1:10, (-4:5)^2, main="Parabola Points", xlab="xlab")
mtext("10 of them",side=2)
plot(1:10, (-4:5)^2, main="Parabola Points", xlab="xlab")
mtext("10 of them",side=3)
plot(1:10, (-4:5)^2, main="Parabola Points", xlab="xlab")
![Page 77: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/77.jpg)
ST597A: 2011, Part II ( c©J.B.) 77
mtext("10 of them",side=4)
plot(1:10, (-4:5)^2, main="Parabola Points", xlab="xlab")
mtext("10 of them")
plot(1:10, (-4:5)^2, main="Parabola Points", xlab="xlab")
text(5,10,"THIS SHOWS HOW YOU CAN WRITE IN TEXT")
# The text is centered at x = 5 and y = 10
2 4 6 8 10
05
1020
Parabola Points
xlab
(−4:
5)^2
10 of them 2 4 6 8 10
05
1020
Parabola Points
xlab(−
4:5)
^2
10 o
f the
m
2 4 6 8 10
05
1020
Parabola Points
xlab
(−4:
5)^2
10 of them
2 4 6 8 10
05
1020
Parabola Points
xlab
(−4:
5)^2
10 o
f the
m
2 4 6 8 10
05
1020
Parabola Points
xlab
(−4:
5)^2
10 of them
2 4 6 8 10
05
1020
Parabola Points
xlab
(−4:
5)^2
THIS SHOWS HOW YOU CAN WRITE IN TEXT
Figure 16: Unemployment plots using R
![Page 78: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/78.jpg)
ST597A: 2011, Part II ( c©J.B.) 78
10 SAS-IML.
IML is a powerful programming language which is part of SAS (in fact, hidden under the procs in SASis IML). It is not a procedure in the way that other procedures are in SAS, even though it is begun byusing proc iml;. This simply invokes the IML programming language. quit; ends IML. You can executepieces interactively without having to run the whole procedure. Among its advantages are (not meant to beexclusive by any means).
• Handles matrices easily with many built in matrix operations.
• You can print items at any time using a combination of text and variables.
• You can define functions and modules.
• Besides, the matrix functions there are many, many other built in functions and subroutines includingthose doing nonlinear optimization and a general randgen procedure for generating multiple randomvariables from certain distributions.
10.1 Basic matrix operations
The following demonstrates some of the basic matrix functions, including defining matrices directly.
options ls=80 nodate;
proc iml; /* this starts iml */
/* Create a 3 by 3 matrix called a. Spaces separate entries within a
row, a , separates rows. */
a = {1 5 8, 5 3 6, 8 6 4}; /* note a is symmetric */
/* create a 3 by 3 matrix b */
b = { 5 2 4, 1 3 2, -5 6 7};
/* I(k) create a k by k identity matrix */
imat = I(5);
/* J(nrow,ncol,value) creates a matrix with nrow rows, ncol columns
and value for each entry */
c = J(3,4,2); /* creates a 3 by 4 matrix of 2’s */
d = J(2,3); /* no value given. The default is 1 */
e = {1 1 5 6 7 9 11 11}; /* creates a 1 by 7 row vector */
/* row vectors can be created using the index operator :
r = i:j will create a row vector going from i up to j by steps of 1 if
j is bigger than i or going from i down to j if i is bigger than j.
r = do(l,u,inc) will create a row vector going from l to u by steps of
inc */
r1 = 1:5;
r2 = do(10,18,2);
![Page 79: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/79.jpg)
![Page 80: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/80.jpg)
![Page 81: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/81.jpg)
![Page 82: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/82.jpg)
ST597A: 2011, Part II ( c©J.B.) 82
C = A//B adjoins the rows of A and B creating a matrix
C =
[
AB
]
.
proc iml;
A = {1 5 8, 5 3 6, 8 6 4};
B = { 5 2 4, 1 3 2, -5 6 7};
C = A||B;
C2 = A//B;
print A B C C2;
A B
1 5 8 5 2 4
5 3 6 1 3 2
8 6 4 -5 6 7
C C2
1 5 8 5 2 4 1 5 8
5 3 6 1 3 2 5 3 6
8 6 4 -5 6 7 8 6 4
5 2 4
1 3 2
-5 6 7
10.2 Naming rows and columns, formats, reset
• You can specify row and column names as well as a format right in the print command.
• The mattrib command can be used to attach certain attributes, including row and column names tomatrices. When the matrix is printed the row and column names will be printed.
• Matrices that do not have row and column names from a mattrib or with them specified in the printstatement will not have row and column names. This is the default. If reset autoname; is used thenin printing matrices, those that don’t have row or column names will be printed with row1, etc. andcol1 etc. as the headings. reset noautoname will return to using no row and column names for suchmatrices.
• By default the name of the matrix will be printed. reset noname; will leave the matrix name out inprinting. reset name; will return to printing the matrix name.
• You can specify the format for the entries in the matrix by putting format w.d either right in theprint statement or in the mattrib. In either case this goes within the brackets similar to rowname andcolname in the examples below.
options ls=80 nodate;
proc iml; /* this starts iml */
a = {1 5 8, 5 3 6, 8 6 4};
b = { 5 2 4, 1 3 2, -5 6 7};
![Page 83: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/83.jpg)
ST597A: 2011, Part II ( c©J.B.) 83
rnames = {mon tue wed};
cnames = {group1 group2 group3};
print a;
print a[rowname=rnames colname=cnames];
mattrib a rowname = (rnames) colname=(cnames);
reset autoname spaces=3; /*spaces = n puts n spaces between matrices*/
print a;
print b;
reset noautoname;
print a;
print b;
reset autoname;
print a b; /* ?? this leads to strange column headings for b
since a has names and b does not */
quit;
A
1 5 8
5 3 6
8 6 4
A
GROUP1 GROUP2 GROUP3
MON 1 5 8
TUE 5 3 6
WED 8 6 4
B
COL1 COL2 COL3
ROW1 5 2 4
ROW2 1 3 2
ROW3 -5 6 7
A
GROUP1 GROUP2 GROUP3
MON 1 5 8
TUE 5 3 6
WED 8 6 4
B
5 2 4
1 3 2
-5 6 7
B
A GROUP1 GROUP2 GROUP3 Col5 Col6 Col7
MON 1 5 8 5 2 4
TUE 5 3 6 1 3 2
WED 8 6 4 -5 6 7
![Page 84: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/84.jpg)
ST597A: 2011, Part II ( c©J.B.) 84
10.3 Reading from a SAS data set into matrices.
Within IML where an internal sas data set, called b created from a data set, has already been defined. Inorder to use the data set the command is use b;. When done with data set the command close b; is used.
The general form of the read statement is: read range var operand where(expression) into matname;
where the boldface indicate specified sas language and the other things are specified by you. See thedocumentation for full detail.
range specifies the range of cases to consider. Here are some common uses
• all = all observations.
• current = current observation.
• next n = next n observations where n is an integer.
• after = all observations after the current one.
operand specifies a list of variables. It can be list of variable names in brackets, from the SAS data set, all= all variables, num = all numeric variables or char = all character variables. With no var clause thedefault is all numerical variables.
where is optional. It only uses cases that satisfy the expression.
into specifies the matrix, called matname, to put things in. If there is no into clause, then each variable willbe read into a column vector (a matrix with one column) where the name of the matrix is the same as thename of the original sas variable. Note that the same name can be used in the sas data set and in iml.
options ps=60 ls=80;
data a;
infile ’speed2’;
input id sex vo35 vo4 vo45 vo5 vo55 vo6;
proc print;
run;
proc iml;
use a;
read all into bigmat;
read all var {id};
read all var {id sex} into b;
read all var{vo35 vo4 vo45 vo5 vo55 vo6} where(sex=1) into vmat1;
read all var{vo35 vo4 vo45 vo5 vo55 vo6} where(sex=2) into vmat2;
read next 3 var {id} into id1;
read next 5 var {id} into id2;
close a;
print bigmat; print id; print b; print vmat1;
print vmat2; print id1; print id2; quit; run;
OBS ID SEX VO35 VO4 VO45 VO5 VO55 VO6
1 1 1 15.7 18.4 22.0 34.8 45.3 51.1
2 2 1 14.8 18.0 25.1 29.6 44.8 51.3
3 3 1 16.6 21.0 25.5 35.2 48.9 49.1
4 4 1 14.6 19.2 24.2 30.9 36.2 47.6
![Page 85: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/85.jpg)
ST597A: 2011, Part II ( c©J.B.) 85
5 5 1 13.0 18.2 20.4 25.1 33.5 48.9
6 6 1 18.4 21.5 25.7 34.1 42.6 48.5
7 7 2 14.9 18.9 23.9 31.6 42.9 42.0
8 8 1 16.3 19.7 26.5 31.1 39.5 42.6
...
23 23 1 15.4 17.5 22.8 31.6 43.0 51.2
24 24 2 14.3 16.9 23.9 36.8 37.4 36.2
BIGMAT
1 1 15.7 18.4 22 34.8 45.3 51.1
2 1 14.8 18 25.1 29.6 44.8 51.3
...
23 1 15.4 17.5 22.8 31.6 43 51.2
24 2 14.3 16.9 23.9 36.8 37.4 36.2
ID
1
2
3
.
.
23
24
B
1 1
2 1
3 1
.........
22 2
23 1
24 2
VMAT1
15.7 18.4 22 34.8 45.3 51.1
14.8 18 25.1 29.6 44.8 51.3
16.6 21 25.5 35.2 48.9 49.1
14.6 19.2 24.2 30.9 36.2 47.6
13 18.2 20.4 25.1 33.5 48.9
18.4 21.5 25.7 34.1 42.6 48.5
16.3 19.7 26.5 31.1 39.5 42.6
15.6 16.9 22.1 27.9 38.2 39.5
19.4 21.3 24.6 30.6 40.1 42.2
14.1 15.6 20.4 26 33.7 43.1
12.2 16.2 19 25.4 36.6 41.8
14.8 16.8 20.4 26.8 33.1 41.3
16.1 18.8 23 26.3 32.7 39.2
14.5 17.7 22 32.4 36.6 40.8
15.4 17.5 22.8 31.6 43 51.2
VMAT2
![Page 86: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/86.jpg)
ST597A: 2011, Part II ( c©J.B.) 86
14.9 18.9 23.9 31.6 42.9 42
14.7 19.2 22.2 32.1 33.1 39.5
15.3 17.9 25 35.3 38 40.8
14.5 16.7 27.1 38.1 41.3 45
14 15.9 21 26.3 34.1 39.1
15.2 17.8 22.4 26.5 36.5 44.3
16.5 22.2 25.9 35.1 39.5 43.6
12.6 15.4 18.7 24.3 31.9 39.9
14.3 16.9 23.9 36.8 37.4 36.2
ID1 ID2
1 4
2 5
3 6
7
8
10.4 Example. Computing descriptive statistics etc. in a one-sample problem
Illustrating the use of Proc IML to run descriptive statistics and carry out statistical inference for a onesample setting. This shows
1. Calculation of the mean, sd, variance and standard error of the mean.
2. Calculation of confidence intervals for the population mean, variance, and standard deviation underthe assumption of normality.
3. Shows how to sort a vector using the rank function and then use this to get quantiles.
title ’Illustrating descriptive statistics with the LA heart data’;
options ls=80 ps=60 nodate;
data values;
infile ’ladata’;
input id age md50 sp50 dp50 ht50 wt50 sc50
soec cs md62 sp62 dp62 sc62 wt62 ihdx yrdth;
run;
proc iml;
use values;
read all var{wt50} into x;
close values;
print x;
n = nrow(x);
/* get mean, variance and standard deviation */
mean = sum(x)/n;
var = (ssq(x) - (n*(mean**2)))/(n-1);
sd=sqrt(var);
reset noname; /* name of matrices will not be printed*/
print "n = " n "mean: " mean "st. deviation: " sd "variance:" var;
![Page 87: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/87.jpg)
ST597A: 2011, Part II ( c©J.B.) 87
/* get 95% confidence interval for the mean and variance */
a=.05;
se = sd/sqrt(n);
tval=tinv(1-a/2,n-1);
upper = mean + tval*se;
lower = mean - tval*se;
print "standard error = " se ;
print "95% CI for mean = " lower " to " upper;
lval = cinv((a/2),n-1);
uval = cinv(1-a/2,n-1);
lowvar = (n-1)*var/uval;
upvar = (n-1)*var/lval;
lowsd=sqrt(lowvar);
upsd=sqrt(upvar);
print "95% CI for variance = " lowvar " to " upvar;
print "95% CI for standard dev. = " lowsd " to " upsd;
/* Organizing results in a matrix for better looking output*/
print "Estimates and 95% confidence intervals";
cimat=J(3,3);
cimat[1,1]=mean;
cimat[1,2]=lower;
cimat[1,3]=upper;
cimat[2,1]=var;
cimat[2,2]=lowvar;
cimat[2,3]=upper;
cimat[3,1]=sd;
cimat[3,2]=lowsd;
cimat[3,3]=upsd;
rname={Mean Variance StDeviation};
cname={Estimate LowerCI UpperCI};
reset noname;
print cimat [rowname=rname colname=cname format=6.2];
/* First create a ranked version of x */
rx=J(n,1); /* initializes rx */
rx[rank(x)]=x; /* rx has the ranked values */
reset name;
print rx x;
/* get quantiles using definition 5 in SAS proc univariate*/
per={1,5,25,50,75,95,99}; /* specify percentiles of interest */
q=per;
k=nrow(per);
do i = 1 to k;
p=per[i]/100;
j = int(n*p);
g = n*p - j;
if g=0 then q[i]=.5*(rx[j]+rx[j+1]);
else q[i] = rx[j+1];
![Page 88: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/88.jpg)
ST597A: 2011, Part II ( c©J.B.) 88
end;
print per q;
quit;
run;
Illustrating descriptive statistics with the LA heart data 1
X
147
167
... middle 196 cases omitted
138
157
n = 200 mean: 168.075 st. deviation: 26.639589 variance: 709.66771
standard error = 1.8837034
95% CI for mean = 164.36042 to 171.78958
95% CI for variance = 588.53168 to 872.68866
95% CI for standard dev. = 24.259672 to 29.541304
Estimates and 95% confidence intervals
ESTIMATE LOWERCI UPPERCI
MEAN 168.08 164.36 171.79
VARIANCE 709.67 588.53 171.79
STDEVIATION 26.64 24.26 29.54
RX X
109 147
120 167
... middle 196 cases omitted
235 138
245 157
PER Q
1 120
5 129
25 147
50 165
75 189
95 213
99 234
![Page 89: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/89.jpg)
ST597A: 2011, Part II ( c©J.B.) 89
10.5 Missing values
Missing values must be handled carefully in IML. For example, missing values get read into a matrix orvector unless excluded. Certain functions, such as sum and ssq will ignore missing values. A function likenrow or ncol will not account for missing values in any way. If we ran the program above on a data setwith missing values, then the analysis would be wrong. The following illustrates how to handle this. Notaccounting for the missing values gives the wrong answer in meany.
options ls=80 ps=60 nodate;
data values;
infile ’test.dat’;
input v;
proc print;
run;
proc iml;
use values;
read all var{v} where(v ^= .) into x; /* ^= represents not equal */
read all var{v} into y;
close values;
print x y; OBS V X Y N MEAN NY MEANY
n = nrow(x); 1 4 4 4 4 5.75 6 3.8333333
mean=sum(x)/n; 2 . 5 .
ny=nrow(y); 3 5 6 5
meany=sum(y)/ny; 4 6 8 6
print n mean ny meany; 5 . .
quit; 6 8 8
run;
10.6 Writing to external files in IML.
This is done via a put statement with the same usage as in the data step. It can include text, in quotes, orvariable names. If you want to put an element from a matrix, it needs to be put in parentheses.
As seen in a later section you can also save values to a temporary SAS file and then save it permanently, asa SAS file or a text file; see Section 4 and 14.2 of Part I of the notes.
Going back to the example creating a goodness of fit test statistic. If we added the file and put statementto the do loop near the end
do i = 1 to 10;
obs[i]=.1*n;
chisq=chisq + ((obs[i] - exp[i])**2)/exp[i];
file ’goodfit.dat’;
put (d[i]) +1 (obs[i]) +1 (exp[i]) 5.2; /* need to put elements in parentheses*/
end;
it creates the text file goodfit.dat which looks like:
137 20 24.34
![Page 90: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/90.jpg)
ST597A: 2011, Part II ( c©J.B.) 90
144 20 12.27
150 20 13.13
158.5 20 22.18
165 20 18.88
171.5 20 19.42
183.5 20 33.51
192 20 19.34
203 20 17.93
10.7 Example: Statistics for multiple groups.
Suppose there is a variable, say group = which defines different groups or categories and we want to getvarious statistics for another variable for each of these groups. With two groups, this would be the startingpoint for the various inferences we make on differences in means, ratios of variances, etc.
One way to do this is to use the where group = option to select those with group = whatever and putthis into a vector, with a different vector for each group. In this case you need to have one read for eachgroup and know the values for group ahead of time. This is not very elegant.
A better option is to the design function, which creates a matrix of 0’s and 1’s indicating group membership.This has the advantage of easily generalizing to multiple groups and not needing to know how many groupsor what the group indicators are beforehand.
Using the nutrition example with method is the grouping variable. Run first on two groups.
option linesize=80 pagesize=60 nodate;
data a;
infile ’nut2.dat’;
input id method lnrsty k1-k3 a1-a3 b1-b2;
changeb=b2-b1;
if method=3 or method=2;
run;
proc iml;
use a;
read all var{changeb} where(method=2 & changeb ^=.) into met2;
read all var{changeb} where(method=3 & changeb ^=.) into met3;
read all var{changeb} where (changeb ^= .) into changeb;
read all var{method} where (changeb ^=.) into method;
/* or could create sas data set beforehand eliminating cases
where changeb is missing */
designm=design(method);
close a;
print met2 met3;
/* getting statistics for each group using separate vectors*/.
print "group1 = method 2, group2 = method 3 ";
n1=nrow(met2);
x1bar=sum(met2)/n1;
var1 = (ssq(met2) - (n1*(x1bar**2)))/(n1-1);
print n1 x1bar var1;
n2=nrow(met3);
x2bar=sum(met3)/n2;
![Page 91: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/91.jpg)
ST597A: 2011, Part II ( c©J.B.) 91
var2 = (ssq(met3) - (n2*(x2bar**2)))/(n2-1);
print n2 x2bar var2;
print method designm;
ngroup=ncol(designm);
n=J(ngroup,1);
mean=J(ngroup,1);
var=J(ngroup,1);
do j = 1 to ngroup;
n[j]=sum(designm[,j]);
obs=designm[,j]#changeb;
print j obs;
mean[j]=sum(obs)/n[j];
var[j] = (ssq(obs) - (n[j]*(mean[j]**2)))/(n[j]-1);
end;
print n mean var;
quit;
run;
N1 X1BAR VAR1
66 1.5606061 11.32704
N2 X2BAR VAR2
57 0.8245614 14.754386
METHOD DESIGNM
3 0 1
3 0 1
2 1 0
..............
3 0 1
2 1 0
N MEAN VAR
66 1.5606061 11.32704
57 0.8245614 14.754386
The following makes this into a macro to handle an arbitrary number of groups and uses the design function.
option linesize=80 pagesize=60 nodate;
%macro gstat(group,y);
data a;
infile ’g:/s597/data/nut2.dat’;
input id method lnrsty k1-k3 a1-a3 b1-b2;
changeb=b2-b1;
y = &y;
group =&group;
/*y = changeb;
group=method;*/
run;
proc iml;
use a;
read all var{y} where (y ^= .) into y;
![Page 92: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/92.jpg)
ST597A: 2011, Part II ( c©J.B.) 92
read all var{group} where (y ^=.) into group;
designm=design(group);
close a;
/* getting statistics for each group using separate vectors*/
print group designm;
ngroup=ncol(designm);
n=J(ngroup,1);
mean=J(ngroup,1);
var=J(ngroup,1);
do j = 1 to ngroup;
n[j]=sum(designm[,j]);
obs=designm[,j]#y;
mean[j]=sum(obs)/n[j];
var[j] = (ssq(obs) - (n[j]*(mean[j]**2)))/(n[j]-1);
end;
print n mean var;
run;
quit;
%mend;
%gstat(method,changeb)
OUTPUT SHOWS JUST A PORTION OF THE DESIGN MATRIX.
group designm
.....
1 1 0 0
3 0 0 1
1 1 0 0
2 0 1 0
3 0 0 1
1 1 0 0
1 1 0 0
3 0 0 1
3 0 0 1
2 0 1 0
1 1 0 0
2 0 1 0
2 0 1 0
.....
n mean var
78 2.4615385 14.017982
66 1.5606061 11.32704
57 0.8245614 14.754386
10.8 Example: Multivariate observations.
Suppose the data consists of m variables on each of n “individuals/units”. Typically the data will be initiallyarranged in the usual way with a row in the data set being an “individual” and the m variables being thecolumns in the data set (but the role of individuals and variables could switch). Write xij for the jth variable
![Page 93: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/93.jpg)
ST597A: 2011, Part II ( c©J.B.) 93
for the ith individual (i = 1 to n, j = 1 to m).
The sample mean for variable j is x.j =∑
i xij/n (all sums over i are from 1 to n).
The sample mean vector, x is the m× 1 matrix with jth component equal to x.j . As done here, it is typicalto define the mean vector as a column vector even though the original data layout may have the variablesin a row.
The sample variance for variable j is s2
j =∑
i(xij − x.j)2/(n− 1).
The sample covariance between two variables j and k is sjk =∑
i(xij − x.j)(xik − x.k)/(n − 1). Noticethat in this notation we can also write sjj for s2
j .
The sample covariance matrix, denoted S is the m×m matrix with sjk for the (j, k)th element.
The sample correlation between two variables j and k is rjk = sjk/sjsk. (rjj = 1 = correlation of avariable with itself.)
The sample correlation matrix, denoted R is the m×m matrix with rjk for the (j, k)th element.
Define the m× 1 vector
xi =
xi1
xi2
.
.xim
.
In matrix form x =∑
i xi/n,
S =∑
i
(xi − x)(xi − x)′/(n− 1) = (∑
i
xix′
i − nxx′)/(n− 1).
and R = D−1SD−1 where D is a m×m diagonal matrix with (j, j) element sj .
The sample mean, covariance and correlation matrix are at the heart of most multivariate statistical tech-niques.
Example. This data, taken from Johnson and Wichern’s Applied Multivariate Statistical methods con-sists of 100 weeks of rates of returns on five stocks (Allied chemical, duPont, Union Carbide, Exxon, andTexaco). This illustrates how to get the means, variances, covariances and correlations directly via matrixmultiplication and then computes some measures of how “synchronized” the different stock series are overtime.
title ’Illustrating multivariate statistics with stock data’;
options ls=80 ps=60 nodate;
data values;
infile ’stock.dat’;
input allied dupont unionc exxon texaco;
*proc print;
run;
proc corr cov;
run;
proc iml;
use values;
read all into xval; /* xval is 100 by 5 */
close values;
![Page 94: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/94.jpg)
ST597A: 2011, Part II ( c©J.B.) 94
x=t(xval);
n = nrow(xval);
m=ncol(xval);
print n m;
do i = 1 to n;
x[,i]=t(xval[i,]);
end;
/* x is 5 by 100 with ith column containing the 5 stocks
at time i*/
mean = x[,+]/n; /* the function x[,+] creates sums over columns*/
print "mean vector " mean;
sumxtx=J(m,m,0);
do i = 1 to n;
sumxtx = sumxtx + x[,i]*t(x[,i]);
end;
s = (sumxtx - n*mean*t(mean))/(n-1);
print "covariance matrix " s;
d=I(m);
do j = 1 to m;
d[j,j] = sqrt(s[j,j]);
end;
r = inv(d)*s*inv(d);
print "correlation matrix" r;
/* getting the correlation matrix using the corr module */
rm=corr(xval);
print "correlation matrix from the module" rm;
/* Getting the average pairwise correlation among a pairs of
stock. This is one way to measure synchrony over time. */
np = m*(m-1)/2; /* = number of distinct pairs */
total=0;
do j = 1 to m-1;
do k = j+1 to m;
total =total+r[j,k];
end;
end;
aver=total/np;
print "average correlation" aver;
/* For each series create a matrix where for stock j,
ith entry is 1 if have an increase from time i to time i +1
is -1 if have a decrease from time i to time i +1
is 0 if no change. */
change=J(n-1,m);
do j = 1 to m;
do i = 1 to n-1;
change[i,j]=xval[i+1,j]-xval[i,j];
![Page 95: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/95.jpg)
ST597A: 2011, Part II ( c©J.B.) 95
if change[i,j] > 0 then change[i,j]=1;
if change[i,j] < 0 then change[i,j]=-1;
if change[i,j] = 0 then change[i,j]=0;
end;
end;
total2=0;
count=0;
agree=J(np,3);
do j =1 to m-1;
do k = j+1 to m;
count=count+1;
sum2=0;
do i = 1 to n-1;
if change[i,j]=change[i,k] then sum2=sum2+1;
/* keeps count of where change is in the same direction */
end;
pagree = sum2/(n-1); /* prop. ageeing for series j and k */
agree[count,1]=j;
agree[count,2]=k;
agree[count,3]=pagree;
total2=total2+pagree;
end;
end;
reset noname spaces=4;
names = {series1 series2 propAgree};
print agree [colname=names];
averp = total2/np;
print "average proportion agreement" averp;
quit;
run;
Illustrating multivariate statistics with stock data 1
OBS ALLIED DUPONT UNIONC EXXON TEXACO
1 0.00000 0.00000 0.00000 0.039473 0.000000
2 0.02703 -0.04486 -0.00303 -0.014466 0.043478
. . .
98 0.045454 0.046375 0.074561 0.014563 0.018779
99 0.050167 0.036380 0.004082 -0.011961 0.009216
100 0.019108 -0.033303 0.008362 0.033898 0.004566
Covariance Matrix DF = 99
ALLIED DUPONT UNIONC EXXON TEXACO
ALLIED 0.0016299269 0.0008166676 0.0008100713 0.0004422405 0.0005139715
DUPONT 0.0008166676 0.0012293759 0.0008276330 0.0003868550 0.0003109431
UNIONC 0.0008100713 0.0008276330 0.0015560763 0.0004872816 0.0004624767
EXXON 0.0004422405 0.0003868550 0.0004872816 0.0008023323 0.0004084734
TEXACO 0.0005139715 0.0003109431 0.0004624767 0.0004084734 0.0007587370
Simple Statistics
Variable N Mean Std Dev Sum Minimum Maximum
ALLIED 100 0.00543 0.04037 0.54337 -0.09665 0.12281
![Page 96: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/96.jpg)
ST597A: 2011, Part II ( c©J.B.) 96
DUPONT 100 0.00483 0.03506 0.48271 -0.07576 0.11881
UNIONC 100 0.00565 0.03945 0.56542 -0.09146 0.10262
EXXON 100 0.00629 0.02833 0.62914 -0.05313 0.08855
TEXACO 100 0.00371 0.02755 0.37085 -0.05051 0.08247
Pearson Correlation Coefficients / Prob > |R| under Ho: Rho=0 / N = 100
ALLIED DUPONT UNIONC EXXON TEXACO
ALLIED 1.00000 0.57692 0.50866 0.38672 0.46218
0.0 0.0001 0.0001 0.0001 0.0001
etc. rest eliminated.
N M
100 5
mean vector MEAN
0.0054337
0.0048271
0.0056542
0.0062914
0.0037085
covariance matrix S
0.0016299 0.0008167 0.0008101 0.0004422 0.000514
0.0008167 0.0012294 0.0008276 0.0003869 0.0003109
0.0008101 0.0008276 0.0015561 0.0004873 0.0004625
0.0004422 0.0003869 0.0004873 0.0008023 0.0004085
0.000514 0.0003109 0.0004625 0.0004085 0.0007587
R correlation matrix
1 0.5769244 0.5086555 0.3867206 0.4621781
0.5769244 1 0.5983841 0.3895191 0.3219534
0.5086555 0.5983841 1 0.4361014 0.4256266
0.3867206 0.3895191 0.4361014 1 0.5235293
0.4621781 0.3219534 0.4256266 0.5235293 1
correlation matrix from the module
RM
1 0.5769244 0.5086555 0.3867206 0.4621781
0.5769244 1 0.5983841 0.3895191 0.3219534
0.5086555 0.5983841 1 0.4361014 0.4256266
0.3867206 0.3895191 0.4361014 1 0.5235293
0.4621781 0.3219534 0.4256266 0.5235293 1
AVER
average correlation 0.4629592
SERIES1 SERIES2 PROPAGREE
1 2 0.6767677
1 3 0.7070707
1 4 0.6666667
1 5 0.6767677
2 3 0.7272727
2 4 0.5858586
2 5 0.6161616
![Page 97: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/97.jpg)
ST597A: 2011, Part II ( c©J.B.) 97
3 4 0.6363636
3 5 0.6262626
4 5 0.5858586
average proportion agreement 0.6505051
One should not automatically carry out the usual inferences on the individual series mean or other parametersfor the series, nor should one carry out inferences on the correlation between two series in the usual way.This is because the observations in a series will typically have some type of correlation over time, while theusual inferences assume the observations are uncorrelated (or more strongly independent). There are waysto examine this through time series techniques (which we won’t do here.)
10.9 Creating SAS files from within IML.
Temporary SAS files can be done using the create and append commands.
10.9.1 Creating a SAS file directly from a matrix
The example below uses the stock example and illustrates creating SAS data sets using the change valuesand ranks over time for each company.
options ls=80 ps=60 nodate;
data values;
infile ’stock.dat’;
input allied dupont unionc exxon texaco;
run;
proc iml;
use values;
read all into xval; /* xval is 100 by 5 */
close values;
x=t(xval);
n = nrow(xval);
m=ncol(xval);
print n m;
/* Create the matrix change */
change=J(n-1,m);
do j = 1 to m;
do i = 1 to n-1;
change[i,j]=xval[i+1,j]-xval[i,j];
if change[i,j] > 0 then change[i,j]=1;
if change[i,j] < 0 then change[i,j]=-1;
if change[i,j] = 0 then change[i,j]=0;
end;
end;
/* create a sas data set called changed, with no colname option the
default is to name the variables as col1, col2, ... */
create changed from change;
append from change;
![Page 98: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/98.jpg)
ST597A: 2011, Part II ( c©J.B.) 98
/* create a matrix called rankm, where column j has the ranks of
the 100 observations for the company in column j and then
create a sas data set called rankm. NOTE that we can use the sas
data name rankm even though it also a matrix name in IML. The variables
are named using the colname feature*/
rankm=j(n,m); /*initialize the matrix*/
do j = 1 to m;
rankm[,j]=ranktie(xval[,j]);
end;
names={’r1’ ’r2’ ’r3’ ’r4’ ’r5’};
create rankm from rankm[colname= names];
append from rankm;
quit;
run;
title ’print the sas data set changed’;
proc print data=changed;
run;
title ’print the sas data set rankm’;
proc print data=rankm;
run;
print the sas data set changed 2
OBS COL1 COL2 COL3 COL4 COL5
1 1 -1 -1 -1 1
2 1 1 1 1 1
3 -1 -1 -1 -1 -1
4 1 -1 -1 -1 -1
99 -1 -1 1 1 -1
print the sas data set rankm 4
OBS R1 R2 R3 R4 R5
1 48.0 53 49.5 89.0 47.0
2 73.0 5 45.0 23.0 93.0
3 100.0 95 98.0 99.0 99.0
99 84.0 83 55.0 29 62.0
100 67.0 11 59.0 84 50.0
![Page 99: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/99.jpg)
ST597A: 2011, Part II ( c©J.B.) 99
10.9.2 Creating a SAS file using variables; with application to simulating random samples
The following illustrates how to create the temporary SAS file “simout” by listing of set of variables in thecreate statement. Anytime an append is encountered these variables are written to the simout. This is arewrite of the simulation program on page 47 of the 597C notes but now in IML rather than in the datastep. Notice that the if-then in checking the confidence interval has to be done differently in IML than inthe data step.
option ls=80 nodate;
%macro simn(nsim,n,mu,sigma,alpha);
title ‘‘simulate dist. of sample mean and CI, n= &n from N(&mu, &sigma **2)’’;
proc iml;
create simout var{j true mean var sd low up check};
tval=tinv(1- &alpha/2,&n-1);
true=μ
n = &n;
x=J(n,1);
do j =1 to ≁
do i = 1 to n;
x[i] = &mu + &sigma*rannor(0);
end;
mean = sum(x)/n;
var = (ssq(x) - (n*(mean**2)))/(n-1);
sd=sqrt(var);
se = sd/sqrt(n);
low=mean - tval*se;
up=mean + tval*se;
check=0;
if (low <= &mu) & (&mu <= up) then check=1;
/* NOTE if (low <= &mu <= up) then check=1; DOES NOT WORK. CHECK IS ALWAYS 1*/
append; /* writes variables in varlist of create to simout*/
end;
quit;
proc means data=simout;
var mean check;
run;
goptions reset=all;
title1 ‘‘Distribution of X-bar: Sample from Normal’’;
title2 ‘‘n= &n from N(&mu, &sigma **2)’’;
proc univariate;
histogram mean/kernel (color=red);
inset n mean (5.2) std (5.2);
run;
%mend;
%simn(1000,10,4,6,.05);
simulate dist. of sample mean and CI, n= 10 from N(4, 6 **2) 82
The MEANS Procedure
Variable N Mean Std Dev Minimum Maximum
MEAN 1000 4.0648820 1.9396007 -2.0542079 11.2478573
![Page 100: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/100.jpg)
ST597A: 2011, Part II ( c©J.B.) 100
CHECK 1000 0.9450000 0.2280943 0 1.0000000
10.10 User define modules/functions
When a particular set of statements will be used multiple times, you can create user defined modules whichcan then be called. There will be some arguments to the module and some variables that you will want topass from the module to the main IML program. The latter are specified through the global statement. Anyvariables that are used in the module but are not in the global statement will not have their values passedto the main IML program. Here is a simple example which has a module which computes the sample size,mean, variance and standard deviation for a specified vector.
option linesize=80 pagesize=60 nodate;
data a;
infile ’c:\s597\data\nut2.dat’;
input id method lnrsty k1-k3 a1-a3 b1-b2;
changeb=b2-b1;
if method=3 or method=2;
run;
proc iml;
start meansd(xvec) global(n,mean,var,sd);
n=nrow(xvec);
mean=sum(xvec)/n;
var = (ssq(xvec) - (n*(mean**2)))/(n-1);
sd=sqrt(var);
finish meansd;
use a;
read all var{changeb} where(method=2 & changeb ^=.) into met2;
read all var{changeb} where(method=3 & changeb ^=.) into met3;
close a;
print met2 met3;
/* getting statistics for each group using separate vectors*/
call meansd(met2);
print "group1 = method2", n mean var sd;
call meansd(met3);
print "group2 = method3", n mean var sd;
quit;
run;
group1 = method2
N MEAN VAR SD
66 1.5606061 11.32704 3.3655668
group2 = method3
N MEAN VAR SD
57 0.8245614 14.754386 3.8411438
![Page 101: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/101.jpg)
ST597A: 2011, Part II ( c©J.B.) 101
10.11 Simulating simple random sampling.
This first example simulates the performance of an estimate confidence interval for mean using a simplerandom sample of size n from a finite population of size N containing an outlier. Here the populationconsists of 137 days of a fishing season and the variable we work with is the number of fish caught. Thisis the 1975 data from Parvin Lake, CO collected to help decide how to use sampling in future years. Theestimate of the mean is the sample mean x, which has variance (1− n/N)σ2/n, where σ2 is the populationvariance (defined here with an N−1 rather than N as is often done in the sampling literature). The (1−n/N)is a finite population correction factor. The estimated standard error of x is SE = [(1−n/N)s2/n]1/2, wheres2 is the sample variance and a (supposed) approximate confidence interval is x± z1−α/2SE.
/* THIS PROGRAM SIMULATES THE PERFORMANCE OF NORMAL BASED CONFIDENCE
INTERVALS FOR THE MEAN AS WELL AS THE SAMPLING DISTRIBUTION OF THE
SAMPLE MEAN WHEN SAMPLING FROM A FINITE POPULATION. */
title ’SRS from Parvin data y = fish caught’;
option ls=80 nodate;
data a;
infile ’e:\s640\lake.csv’;
input month date day $ trips hours kept caught;
y = caught;
run;
title ’Population Values’;
proc univariate data=a noprint;
hist y/kernel (color=blue);
var y;
run;
%macro popsim(ss);
proc iml;
alpha = .05;
use a;
read all var{y} into y;
close a;
n=&ss;
ys=j(n,1); /* initialize vector to hold sample values */
create simout var{ybar samples2 low up count};
/* the population values are in the N by 1 vector y */
bign = nrow(y);
yubar = sum(y)/bign; /* population mean */
popS2 = (ssq(y) - (bign*(yubar**2)))/(bigN-1); /* Population "variance"*/
truese = sqrt((bign - n)*pops2/(bign*n));
title "Sample size &ss";
print "population size = " bign " sample size =" n;
print "true mean = " yubar "population s2 = " pops2;
print "true SE of the mean =" truese;
do j=1 to 1000; /* looping through simulations */
check=J(n,1,0);
**************************************************************************;
* This takes a sample of size n without replacement;
* There are other ways to do this. The method here is not;
* very efficient since it checks through the whole vector each time;
![Page 102: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/102.jpg)
ST597A: 2011, Part II ( c©J.B.) 102
do i=1 to n; /* m indexes position in sample */
label:uv=uniform(0);
k=int(uv*bign+1);
kk=0;
label2:kk=kk+1;
if k=check[kk] then goto label; /* will go to label if the selected value k
had previously been chosen*/
if kk=i then goto skip;
goto label2;
skip: check[i]=k;
ys[i]=y[k];
end;
**************************************************************************;
/* get sample statistics */
ybar = sum(ys)/n;
samples2 = (ssq(ys) - (n*(ybar**2)))/(n-1);
se = sqrt((bign - n)*samples2/(bign*n)); /* SE using F.P.C.F.*/
z = probit(1 - (alpha/2));
low = ybar -z*se;
up = ybar + z*se;
count = 0;
if (low <= yubar) & (yubar <= up) then count=1;
append;
end;
quit;
run;
title "sample size &ss" ;
proc means;
var ybar samples2 count;
run;
/* THE MEAN OF COVER GIVES THE SUCCESS RATE
OF THE NORMAL INTERVAL */
title "sample size &ss" ;
proc univariate noprint;
hist ybar/kernel (color=black);
var ybar;
run;
%mend;
%popsim(10); %popsim(15); %popsim(20); %popsim(50); %popsim(100);
Population Values
BIGN N
population size = 137 sample size = 10
YUBAR POPS2
true mean = 55.554745 population s2 = 4597.5429
TRUESE
true SE of the mean = 20.644505
sample size 10
Variable N Mean Std Dev Minimum Maximum
![Page 103: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/103.jpg)
ST597A: 2011, Part II ( c©J.B.) 103
YBAR 1000 54.6671000 19.3267475 21.7000000 141.2000000
SAMPLES2 1000 4024.75 10251.81 110.6222222 47397.83
COUNT 1000 0.8570000 0.3502480 0 1.0000000
sample size 100
TRUESE
true SE of the mean = 3.5237369
Variable N Mean Std Dev Minimum Maximum
YBAR 1000 55.6671600 3.4915400 45.9300000 62.4500000
SAMPLES2 1000 4667.13 1873.68 1149.80 6031.92
COUNT 1000 0.8370000 0.3695505 0 1.0000000
10.12 Least squares/regression via IML.
Consider a setting where there is an outcome variable, Yi for the ith observation (i = 1 to n) and a setof k predictor variables, xi1, . . . , xik for observation i. We write xi for the k × 1 column vector containingxi1, . . . , xik. (REMARK: The x’s may contain different functions of a common variable, or one of the x’smay be a constant. When models have an intercept term in them there is an xij which is always 1.)
The objective is to fit some specified function f(xi, β) that relates Yi to xi. This can be viewed as ageneral numerical curve-fitting problem without any statistical context. In a statistical context, the functionf(xi, β) is taken as a model for E(Yi) (the expected value of Yi given xi) which can also be expressed asYi = f(xi, β) + εi, where εi is random with mean 0.
Least Squares: Find β which minimizes Q(β) =∑
i(Yi − f(xi, β))2.
linear model (linear in the parameters:) f(xi, β) = xi1β1 + . . . xipβp. Define the n × 1 vector Y and then× p matrix X by
Y =
Y1
Y2
.Yn
and X =
x11 x12 . x1p
x21 x22 . x2p
. . . .xn1 xn2 . xnp
=
x′1
x′2
.x′n
.
Then Q(β) = (Y − Xβ)′(Y − Xβ). Differentiating and setting equal to zero yields the so-called normalequations X ′Xβ = X ′Y . If the matrix X has rank p, meaning the columns of X are linearly independent,then X ′X is of rank p (i.e., is nonsingular) and the unique solution to the normal equations and one thatcan be shown to minimize Q(β) is given by
β = (X ′X)−1X ′Y
where the . is used to denote the fact that the solution is viewed as an estimate in a statistical context.
One necessary (but certainly not sufficient) condition for X to be of rank p is that n ≥ p. If you try to
compute β and the rank condition is not meant, you will get an error message from the inverse function.
Checking the rank. There is no function in IML that automatically checks the rank of a matrix. Withoutexplaining why it works, here is one way to get the rank of a the matrix a.
ranka=sum(eigval( a*ginv(a)));
The ith residual is Yi − Yi, where Yi = f(xi, β) is the ith fitted value..
![Page 104: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/104.jpg)
ST597A: 2011, Part II ( c©J.B.) 104
The minimum value of Q which is obtained is SSE =∑n
i=1(Yi − Yi)
2 =∑n
i=1r2
i .
In a statistical context, suppose the errors (the εi) are uncorrelated, each with mean 0 and variance σ2.
Then the covariance matrix of β is Cov(β) = σ2(X ′X)−1 which is estimated by covb = σ2(X ′X)−1 with
σ2 = SSE/(n− p) an unbiased estimate of σ2. The square root of the jth diagonal element of covb gives the
estimated standard error of βj , the jth estimated coefficient.
Example. Using SMSA data (from homework 4). Regress Y = mortal on rain and so2pot.
title ’Regression with two variables- proc reg and IML’;
options pagesize=60 linesize=80;
data a;
infile ’g:/s597/data/smsa.dat’;
input city $ jantemp julytemp relhum rain mortal educ popden pnwhite pwc
pop perhouse income hcpot noxpot so2pot nox;
con = 1;
run;
proc reg;
model mortal = rain so2pot;
run;
/* doing linear least squares in IML */
proc iml;
use a;
read all var{mortal} into y;
read all var {con rain so2pot} into x;
* read all var{rain so2pot} into x2;
close a;
/* with defining the variable con = 1 as above
can create x as follows
jvec = j(nrow(x),1);
x = jvec||x2; */
p=ncol(x);
n=nrow(x);
betahat = inv(t(x)*x)*t(x)*y;
residual=y - x*betahat; /*observed minus fitted values */
sse = t(residual)*residual; /*sum of squared residuals */
mse = sse/(n-p); /* estimate of variance */
covb= mse*inv(t(x)*x); /* estimate of variance covariance
matrix of coefficients */
sevec = J(p,1);
do j = 1 to 3;
sevec[j] = sqrt(covb[j,j]);
end;
print y x residual;
print "coefficients " betahat "standard errors" sevec;
rankx=sum(eigval( x*ginv(x))); /* This gives you the rank of
the matrix X */
print p rankx;
reset noname;
quit;
run;
![Page 105: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/105.jpg)
ST597A: 2011, Part II ( c©J.B.) 105
Regression with two variables- proc reg and IML 9
The REG Procedure
Dependent Variable: mortal
Number of Observations Read 60
Number of Observations Used 60
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 2 96848 48424 20.98 <.0001
Error 57 131550 2307.89804
Corrected Total 59 228398
Root MSE 48.04059 R-Square 0.4240
Dependent Mean 940.34867 Adj R-Sq 0.4038
Coeff Var 5.10881
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Intercept 1 811.77692 23.13109 35.09 <.0001
rain 1 2.68223 0.54710 4.90 <.0001
so2pot 1 0.47648 0.09939 4.79 <.0001
PROC IML OUTPUT
Regression with two variables- proc reg and IML 10
Y X RESIDUAL
921.87 1 36 59 -14.57959
997.87 1 35 39 73.632241
962.35 1 44 33 16.831034
1003.5 1 65 42 -2.634157
895.7 1 65 8 -94.23383
911.82 1 62 49 -89.60282
954.44 1 38 39 22.155545
BETAHAT SEVEC
coefficients 811.77692 standard errors 23.131091
2.6822319 0.547099
0.4764801 0.0993883
P RANKX
3 3
![Page 106: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/106.jpg)
ST597A: 2011, Part II ( c©J.B.) 106
10.13 Non-linear least squares and calling subroutines.
IML also has a large number of built in subroutines. A subroutine is invoked through a call statement,with various arguments provided for the call. A number of these subroutines deal with optimization andleast squares. We illustrate with one of the least squares subroutine called NLPLM, using only the first fivearguments for the subroutine. There are other arguments, but these can be omitted since they are at theend. If an optional argument is being skipped earlier in the list of arguments in a subroutine, a place mustbe allowed for it; that is there would be successive commas. The general form of the call to this particularsubroutine, where there are p parameters to be fit, is:
call nlplm(rc,bres,’’FUN’’,b,opt);
where rc contains return codes, bres is a p× 1 vector containing the result, FUN is a user defined functionthat defines the least squares problem, b is a p × 1 vector of starting values and opt is a vector containingsome options. All of these names are the choice of the user (besides nlplm).
Example: Consider the cigarette smoking and cancer data. Suppose we want to fit a model of the formβ1x
β2 for how the incidence of bladder cancer Y relates to cigarette smoking, x. To get starting values wecan fit log(Y ) = γ1 + γ2log(x), where if the relationship were deterministic the coefficients would relate viaβ2 = γ2 and γ1 = log(β1) so β1 = eγ1 . We will do a linear fit on the log-transformed variables to get startingvalues and then run nonlinear least squares.
option ls=80 nodate;
data a;
infile ’cig.dat’;
input state $ cig blad lung kid leuk;
logc=log(cig);
logb=log(blad);
con=1;
run;
proc iml;
use a;
read all var{logb} into ly;
read all var {con logc} into lx;
read all var{blad} into y;
read all var{cig} into x;
close a;
/* doing linear least squares of log(blad) versus log(cig)
to get starting values */
blog= inv(t(lx)*lx)*t(lx)*ly;
print ’linear least squares solution of log(y) on log(x)’ blog;
p=2;
n=nrow(x);
bres=J(2,1,1);
b=bres;
b[1]=exp(blog[1]);
b[2]=blog[2];
opt=j(2,1,0);
opt[1]=n; /* for nlplm the first component of opt must be the number
of observations. The second component controls the printing.
Setting it equal to 0 as here, leads to no automatic output */
![Page 107: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/107.jpg)
ST597A: 2011, Part II ( c©J.B.) 107
/* define the function going into least squares */
start FUN(b) global(n,x,y);
f=j(n,1,0);
do i = 1 to n;
f[i] = y[i] - b[1]*(x[i]**b[2]);
end;
return(f);
finish FUN;
call nlplm(rc,bres,"FUN",b,opt);
print rc bres b;
quit;
run;
BLOG
linear least squares solution of log(y) on log(x) -0.966781
0.7380888
RC BRES B
6 0.3744682 0.7472444 0.3803052
0.7380888
The return code of 6 indicates successful convergence. The least squares solution is β1 = .3745 and β2 = .7472.(In general, multiple starting points should often be tried as nonlinear least squares routines can sometimesbe sensitive to starting values).
10.14 Bootstrapping.
Bootstrapping refers to the use of resampling techniques in order to get estimates of the bias and standarderror of estimates or to get confidence intervals or test of hypotheses for parameters. The resampling involvessome type of sampling “from the original data”, where the resampling is done in a way which mimics theway the original data was generated. The bootstrap has become very popular in recent years; see Efron andTibshirani, “An Introduction to the Bootstrap” (1993) or Davison and Hinkley “Bootstrap Methods andTheir Application” (1997) for introductions.
Here, we consider a single sample problem with one parameter of interest, where the bootstrap iseasiest to define. I am not going to try and describe here why the bootstrap works or even many aspectsof how it work, but rather describe computationally how it works and how it can be easily implementedin IML. Before collecting the data the random sample can be viewed as a collection of n random variablesW1, . . . , Wn, which are independent and identically distributed (which is the meaning of “random sample”).The observed values are denoted w1, . . . wn. The Wi could be multivariate.
Suppose there is a single parameter of interest, denoted by θ. For the univariate case, this might be thepopulation mean, variance, standard deviation, 90th percentile, etc. while examples for the bivariate case(2 variables making up each W ) include the population correlation ρ or the ratio of the mean of the twovariables.
We estimate θ with some statistic T , which by definition is a function of the data. Assessing how good T isan estimator of θ is often done through
• Bias = E(T )− θ
and
![Page 108: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/108.jpg)
ST597A: 2011, Part II ( c©J.B.) 108
• SE = standard error of T = standard deviation of T
while confidence intervals and tests are based on the sampling distribution of T . Sometimes these quantitiescan be determined exactly (under some assumptions).
The bootstrap provides an estimate of the bias, standard error and, more generally, the sampling distributionof T by simulating sampling from a population, where the “population” that is used in the original dataw1, . . . wn.
The bootstrap algorithm (for a single random sample)
Take B large (usually at least 500)
For b = 1 to B
1. Take a random sample of size n WITH REPLACEMENT from the values w1, . . . wn. Denote thisbootstrap sample by wb1, . . . , wbn. Notice that some of the original values in the sample will typicallyoccur more than once in the bootstrap sample.
2. Compute your estimate in the same way as you did for original data now using the bootstrap samplewb1, . . . , wbn. For the bth bootstrap sample denote the estimate by Tb.
There are now B estimates T1, . . . , TB .
The bootstrap estimate of the standard error of T is:
SEboot =
[
∑Bb=1
(Tb − T )2
B − 1
]1/2
,
where T =∑B
b=1Tb/B.
Technically the bootstrap estimate of standard error is the limit of the above as B →∞, but it is commonto use the term in this way.
If T is a plug-in estimator (as for example, will be the case for estimators based on sample means, variances,correlations and percentiles; which covers much of what we want to do) then the bootstrap estimate ofBias is T − T .
The Bootstrap estimate of the distribution of T is simply the distribution of the B values T1, . . . TB ,usually represented by a stem-and-leaf plot, histogram or smoothed-histogram. This is referred to as theEmpirical Bootstrap Distribution.
Bootstrap Confidence Intervals
• normal-based bootstrap confidence interval:
T ± z(1− α/2)SEboot,
• The Percentile Method:
Consider α1 and α2 with α1 +α2 = α( usually α1 = α2 = α/2). The percentile interval is [L, U ], where
L = 100(α1)th percentile from the bootstrap data T1, . . . TB .
![Page 109: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/109.jpg)
ST597A: 2011, Part II ( c©J.B.) 109
U = 100(1− α2)the percentile from the bootstrap data T1, . . . TB .
That is, approximately 100α1% of the bootstrap values are less than or equal to L and 100(1− α2)%of the bootstrap values are less than or equal to U . Note that if the distribution of the Tb values isapproximately normal with a mean near the original T , the bootstrap percentile interval will be inclose agreement with the normal-based interval.
Notice also, that with the percentile method, if [L, U ] is the percentile interval for θ then the percentileinterval for φ = g(θ) is simply [g(L), g(U)]. This is a nice property that does not always hold for othermethods.
Nonparametric confidence intervals can be found using the bootstrap in other ways besides the percentilemethod. In practice an improved bootstrap method called the BCa method (bias corrected acceleratedmethod) is often used.
Example. Using the bootstrap to analyze one of the stock series. We will treat this as a random sample ofsize 100 (the serial correlation is actually weak) and estimate various parameters in the underlying model.
title ’Bootstrap with a single sample’;
options ps=60 ls=80;
/* THIS IS A PROGRAM TO BOOTSTRAP WITH A SINGLE
RANDOM SAMPLE USING THE MEAN, VARIANCE, STANDARD
DEVIATION, COEFFICIENT OF VARIATION, MEDIAN, 90TH PERCENTILE */
filename bb ’boot.out’;
/* READ IN ORIGINAL DATA INTO INTERNAL SAS FILE values */
data values;
infile ’stock.dat’;
input allied dupont unionc exxon texaco;
run;
/* USING JUST THE ALLIED RETURN RATES */
title ’descriptive statistics on original sample’;
proc univariate cibasic cipctldf plots;
var allied;
run;
/* START INTO IML WHERE BOOTSTRAPPING WILL BE DONE */
proc iml;
/* put data into vector x */
use values;
read all var{allied} into x;
close values;
n=nrow(x); /* = sample size */
xb=x; /* initializes xb to be same size as x */
rx=x; /* initializes rx*/
nboot = 1000; /* specify number of bootstrap replicates */
do sim=1 to nboot;
/* get the n samples with replacement. i indicates
the sampling within bootstrap replicate j. The generated
k is a discrete uniform over 1 to n; the function int
takes the integer part */
do i= 1 to n;
![Page 110: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/110.jpg)
ST597A: 2011, Part II ( c©J.B.) 110
uv=uniform(0);
k=int(uv*n+1);
xb[i]=x[k];
end;
/* xb contains the n values in bootstrap sample */
smean = sum(xb)/n; /* get sample mean */
svar=(ssq(xb) - (n*(smean**2)))/(n-1); /* sample variance */
sd = sqrt(svar); /* sample s.dev. */
cv = sd/smean; /* coefficient of variation */
rx[rank(rx)] = xb; /* rx has ranked values */
/* get median */
p=.5;
j = int(n*p);
g = n*p - j;
if g=0 then median =.5*(rx[j]+rx[j+1]);
else median = rx[j+1];
p=.9;
j = int(n*p);
g = n*p - j;
if g=0 then p90 =.5*(rx[j]+rx[j+1]);
else p90 = rx[j+1];
file bb;
put smean +1 svar +1 sd +1 cv +1 median +1 p90;
end;
quit;
run;
/* Get descriptive statistics on bootstrap values*/
data new;
infile ’boot.out’;
input smean svar sd cv median p90;
run;
proc means;
run;
proc univariate;
var smean svar sd cv median p90;
run;
descriptive statistics on original sample 56
Variable: allied
Location Variability
Mean 0.005434 Std Deviation 0.04037
Median 0.000000 Variance 0.00163
Mode 0.000000 Range 0.21946
Interquartile Range 0.05283
Quantile Estimate
100% Max 0.1228070
90% 0.0590165
50% Median 0.0000000
0% Min -0.0966540
Basic Confidence Limits Assuming Normality
![Page 111: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/111.jpg)
ST597A: 2011, Part II ( c©J.B.) 111
Parameter Estimate 90% Confidence Limits
Mean 0.00543 -0.00127 0.01214
Std Deviation 0.04037 0.03619 0.04576
Variance 0.00163 0.00131 0.00209
90% Confidence Limits -------Order Statistics-------
Quantile Distribution Free LCL Rank UCL Rank Coverage
90% 0.051282 0.068965 85 95 90.25
50% Median -0.003195 0.010274 42 59 91.14
*** ANALYSIS OF BOOTSTRAP SAMPLES. OUTPUT HEAVILY EDITED.
Variable N Mean Std Dev Minimum Maximum
smean 1000 0.0053754 0.0041517 -0.0084860 0.0194430
svar 1000 0.0016002 0.000227336 0.000978800 0.0024138
sd 1000 0.0399021 0.0028418 0.0312857 0.0491309
cv 1000 -31.9993740 1122.93 -33824.04 5756.43
median 1000 0.0049183 0.0285557 -0.0788150 0.0958860
p90 1000 0.0058421 0.0281108 -0.0869120 0.0958860
Variable: smean
Quantile Estimate
95% 0.01212490
90% 0.01069245
75% Q3 0.00817140
50% Median 0.00538515
25% Q1 0.00267920
10% -0.00003850
5% -0.00155400
Variable: sd
Quantile Estimate
95% 0.0445262
5% 0.0351526
Variable: cv
Value Obs Value Obs
-4099.906 767 5756.429 459
Variable: cv Variable: median Variable: p90
Quantile Estimate Quantile Estimate Quantile Estimate
100% Max 5756.42880 95% 0.0524363 95% 0.0519173
95% 33.92074 5% -0.0402220 5% -0.0388860
5% -29.27325
0% Min -33824.04000
Follow up on the bootstrap example:
• The CV for the original sample is 7.495.
![Page 112: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/112.jpg)
ST597A: 2011, Part II ( c©J.B.) 112
• The bootstrap estimates of standard error are given by the standard deviation over the bootstrapsamples. For the sample mean the bootstrap estimate of standard error is .0041517 (actually we canestimate the SE of the mean analytically using S/n1/2 so we don’t need the bootstrap. It can be shownthat the bootstrap estimate of SE of the mean as B increases converges to ((n− 1)/n)S/n1/2 which isclose to the formula based estimate.) As a second example, for the standard deviation the bootstrapestimate of standard error is .00284. The standard error CV (which is 1122.93) is heavily influencedby some outliers.
• The bootstrap estimates of bias are found by taking the mean of the bootstrap samples - originalestimate. For the mean (which we know theoretically is unbiased) this is .0053754 - .00543, while forthe CV it is -31.999-7.495 = -39.5 and for the 90th percentile is .0058 - .05902 = -.05 (which mightseem small but is of the same magnitude as the original estimate). The mean is sensitive to outliersso a better sense of bias might be to get the the median of the bootstrap samples and compare it tothe original values. The median for the bootstrap samples are
variable median
smean 0.00539
svar 0.001585
sd 0.039817
cv 6.0655
median 0.004832
p90 0.004395
In most cases these are similar to the mean, but for the CV the median of the bootstrap samples ismuch closer to the original estimate than the mean.
• Confidence intervals:
If one treats the estimate as approximately normal with no bias and uses “estimate ±1.645se” foran approximate confidence interval the results are N-low to N-up below. The percentile intervals aredenoted P-low and P-up. Note that the normal based interval for the CV is heavily influenced bythe large SE, due to outliers and should not be used. The percentile method is not sensitive to theoutliers. Neither of these intervals should be used if bias is an issue, as looks like is the case for the90th percentile.
N-low N-up P-low P-up
smean -0.0015 0.0122 -0.0015 0.0121
svar 0.0012 0.0020 0.0012 0.0020
sd 0.0352 0.0446 0.0352 0.0445
cv -1879.2 1815.22 -29.2732 33.9207
median -0.0421 0.0519 -0.0402 0.0524
p90 -0.0404 0.0521 -0.0389 0.0519
Getting the percentile automatically in IML.
You can get the bootstrap percentile intervals by saving the B bootstrap values in a vector, sorting itand then getting the required percentiles. We illustrate with the same example as above to get confidenceintervals for the variance. Much of the code is the same, with the following changes;
Right after the line with nboot = ; add bvalue=J(nboot,1); Then right after svar = ; add bvalue[sim]=svar;After the end statement that follows the put .. ; and before the line quit; add
![Page 113: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/113.jpg)
ST597A: 2011, Part II ( c©J.B.) 113
/* THE VECTOR BVALUE HAS THE BOOTSTRAP VALUES OF THE ESTIMATE OF
INTEREST. RANKB HAS THE RANKED BOOTSTRAP VALUES */
rankb=bvalue;
rankb[rank(rankb)] = bvalue; /* rankb has ranked values */
alpha={.1,.05,.02};
do kk = 1 to 3;
p=alpha[kk]/2;
j = int(nboot*p);
g = nboot*p - j;
if g=0 then low =.5*(rankb[j]+rankb[j+1]);
else low = rankb[j+1];
p=1- (alpha[kk]/2);
j = int(nboot*p);
g = nboot*p - j;
if g=0 then up =.5*(rankb[j]+rankb[j+1]);
else up = rankb[j+1];
clevel = 100*(1 - alpha[kk]);
print "confidence intervals for variance";
print "confidence level" clevel "ci for variance " low " to " up;
end;
confidence intervals for variance
CLEVEL LOW UP
confidence level 90 ci for variance 0.0012383 to 0.0020242
confidence level 95 ci for variance 0.0011852 to 0.0020923
confidence level 98 ci for variance 0.0011257 to 0.0021846
Variable=SVAR Quantiles(Def=5)
100% Max 0.002419 99% 0.002185
75% Q3 0.001773 95% 0.002024
50% Med 0.001615 90% 0.001932
25% Q1 0.001465 10% 0.001311
0% Min 0.000908 5% 0.001238
1%
Rerunning the analysis we get the above results. Notice that the results on the variance over the 1000bootstraps will not be exactly the same as those from the first run. You can see the match between the 98%ci which uses the 1% and 99% from proc univariate and similarly for the 90% ci which used the 5% and 95%.
11 R: Part 3. Working with matrices
While the way SAS IML operates can be different than how we worked in the data step (which uses theso-called base language), there is no such distinction in R. As noted when we first introduced R, it basicallyapplies functions to objects and those objects can be matrices. So, there is no real fundamental difference inhandling matrices in R, compared to what we’ve done earlier. In fact some of our earlier examples alreadywrote to vectors using an index. In addition, how we referred to elements (or rows or columns) in a dataframecarries over in working with a matrix.
A matrix is an two-dimensional array with r rows and c columns. If c is equal to 1 then this is a “column”vector, or in R just referred to as a vector.
As seen earlier, if we bind together vectors of numbers using rbind or cbind then the result is a matrix.
![Page 114: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/114.jpg)
ST597A: 2011, Part II ( c©J.B.) 114
We also saw (see p. 61 part II) that we can read data right into a matrix using matrix(scan( ... and we usedthe rep and seq commands to create vectors.
Creating matrices via direct assignment:
This can be done with the matrix function: the basics syntax is
matrix(data, nrow, ncol, byrow = F(default) or T)
data is a vector of values (with total elements equal to nrow*ncol; or a single value which will be assignedto all positions.nrow is the desired number of rows.ncol is the desired number of columns.byrow: If FALSE (the default) the matrix is filled by columns, otherwise the matrix is filled by rows.
Here is R script for creating matrices and showing basic matrix operations. See http://www.statmethods.net/advstats/matrix.htmlfor more details. There are numerous other webpages that can be found with details on matrix operations.
# a is symmettric so the fact that it reads by column
# doesn’t matter
a<-matrix(c(1,5,8,5,3,6,8,6,4),3,3)
a
#showing the need for byrow = T to read a row at a time
b<-matrix(c(5,2,4,1,3,2,-5,6,7),3,3)
b
b<-matrix(c(5,2,4,1,3,2,-5,6,7),3,3,byrow=T)
b
M<-matrix(NA,5,6)
M
M<-matrix(1,5,6)
M
# create an identity matrix
C <-diag(5)
C
# BASIC MATRIX OPERATIONS
sumab<-a + b # summation
sumab
diffab<-a-b # difference
diffab
prodab<- a %*% b # product
prodab
trana <-t(a) # transpose
trana
ainv<-solve(a) # inverse
ainv
y<-eigen(a) # y$val has eigenvalues, y$vec has eigenvectors
y$val
y$vec
deta<-det(a) # determinant
deta
![Page 115: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/115.jpg)
ST597A: 2011, Part II ( c©J.B.) 115
ranka<-rank(a) #ranks the elements in a
ranka
> # a is symmettric so the fact that it reads by column
> # doesn’t matter
> a<-matrix(c(1,5,8,5,3,6,8,6,4),3,3)
> a
[,1] [,2] [,3]
[1,] 1 5 8
[2,] 5 3 6
[3,] 8 6 4
> #showing the need for byrow = T to read a row at a time
> b<-matrix(c(5,2,4,1,3,2,-5,6,7),3,3)
> b
[,1] [,2] [,3]
[1,] 5 1 -5
[2,] 2 3 6
[3,] 4 2 7
> b<-matrix(c(5,2,4,1,3,2,-5,6,7),3,3,byrow=T)
> b
[,1] [,2] [,3]
[1,] 5 2 4
[2,] 1 3 2
[3,] -5 6 7
> M<-matrix(NA,5,6)
> M
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] NA NA NA NA NA NA
[2,] NA NA NA NA NA NA
[3,] NA NA NA NA NA NA
[4,] NA NA NA NA NA NA
[5,] NA NA NA NA NA NA
> M<-matrix(1,5,6)
> M
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1 1 1 1 1 1
[2,] 1 1 1 1 1 1
[3,] 1 1 1 1 1 1
[4,] 1 1 1 1 1 1
[5,] 1 1 1 1 1 1
>
> # create an identity matrix
> C <-diag(5)
> C
[,1] [,2] [,3] [,4] [,5]
[1,] 1 0 0 0 0
[2,] 0 1 0 0 0
[3,] 0 0 1 0 0
[4,] 0 0 0 1 0
[5,] 0 0 0 0 1
![Page 116: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/116.jpg)
ST597A: 2011, Part II ( c©J.B.) 116
>
> # BASIC MATRIX OPERATIONS
> sumab<-a + b # summation
> sumab
[,1] [,2] [,3]
[1,] 6 7 12
[2,] 6 6 8
[3,] 3 12 11
> diffab<-a-b # difference
> diffab
[,1] [,2] [,3]
[1,] -4 3 4
[2,] 4 0 4
[3,] 13 0 -3
> prodab<- a %*% b # product
> prodab
[,1] [,2] [,3]
[1,] -30 65 70
[2,] -2 55 68
[3,] 26 58 72
> trana <-t(a) # transpose
> trana
[,1] [,2] [,3]
[1,] 1 5 8
[2,] 5 3 6
[3,] 8 6 4
> ainv<-solve(a) # inverse
> ainv
[,1] [,2] [,3]
[1,] -0.14634146 0.1707317 0.03658537
[2,] 0.17073171 -0.3658537 0.20731707
[3,] 0.03658537 0.2073171 -0.13414634
> y<-eigen(a) # y$val has eigenvalues, y$vec has eigenvectors
> y$val
[1] 15.513954 -1.874493 -5.639461
> y$vec
[,1] [,2] [,3]
[1,] -0.5420786 0.3354056 0.77048941
[2,] -0.5294635 -0.8483266 -0.00321535
[3,] -0.6525482 0.4096890 -0.63744469
> deta<-det(a)
> deta
[1] 164
> ranka<-rank(a) #ranks the elements in a
> ranka
[1] 1.0 4.5 8.5 4.5 2.0 6.5 8.5 6.5 3.0
![Page 117: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/117.jpg)
ST597A: 2011, Part II ( c©J.B.) 117
11.1 Least squares in R
This does linear regression with two predictors, first using glm and then illustrating matrix calculations.
data<-read.table(’g:/s597/data/smsa.dat’)
rain<-data$V5
mortal<-data$V6
so2pot<-data$V16
con <-rep(1,length(mortal))
regmodel<-glm(mortal ~ rain+ so2pot)
summary(regmodel)
# doing least squares explicitly
y<-mortal
x<-cbind(con,rain,so2pot)
dimx<-dim(x)
p=dimx[2]
n=dimx[1]
n; p
xpxinv<-solve(t(x)%*%x) #inverse of X’X
betahat<- xpxinv%*%t(x)%*%y #estimated coefficients
residual <- y - x%*%betahat
sse <- t(residual)%*%residual; #sum of squared residuals
mse <- sse/(n-p) # estimate of variance
covb<- mse[1]*solve(t(x)%*%x) #estimate of variance covariance of betahat
sevec <- rep(0,p);
for (j in 1 :3)
{sevec[j] = sqrt(covb[j,j])}
info<-cbind(y,x,residual)
info
betahat
mse
covb
estimates<-cbind(betahat,sevec)
estimates
> data<-read.table(’g:/s597/data/smsa.dat’)
> rain<-data$V5
> mortal<-data$V6
> so2pot<-data$V16
> con <-rep(1,length(mortal))
> regmodel<-glm(mortal ~ rain+ so2pot)
> summary(regmodel)
Call:
glm(formula = mortal ~ rain + so2pot)
Deviance Residuals:
Min 1Q Median 3Q Max
-111.747 -29.239 -3.163 27.045 156.066
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 811.77692 23.13109 35.095 < 2e-16 ***
rain 2.68223 0.54710 4.903 8.22e-06 ***
![Page 118: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/118.jpg)
ST597A: 2011, Part II ( c©J.B.) 118
so2pot 0.47648 0.09939 4.794 1.21e-05 ***
---
(Dispersion parameter for gaussian family taken to be 2307.898)
> # doing least squares explicitly
> y<-mortal
> x<-cbind(con,rain,so2pot)
> dimx<-dim(x)
> p=dimx[2]
> n=dimx[1]
> n; p
[1] 60
[1] 3
> xpxinv<-solve(t(x)%*%x) #inverse of X’X
> betahat<- xpxinv%*%t(x)%*%y #estimated coefficients
> residual <- y - x%*%betahat
> sse <- t(residual)%*%residual; #sum of squared residuals
> mse <- sse/(n-p) # estimate of variance
> covb<- mse[1]*solve(t(x)%*%x) #estimate of variance covariance
> # matrix of coefficients
> sevec <- rep(0,p);
> for (j in 1 :3)
+ {sevec[j] = sqrt(covb[j,j])}
> info<-cbind(y,x,residual)
> info
y con rain so2pot
[1,] 921.87 1 36 59 -14.579593
[2,] 997.87 1 35 39 73.632241
[3,] 962.35 1 44 33 16.831034
...........
[58,] 895.70 1 65 8 -94.233833
[59,] 911.82 1 62 49 -89.602822
[60,] 954.44 1 38 39 22.155545
> betahat
[,1]
con 811.7769190
rain 2.6822319
so2pot 0.4764801
> mse
[,1]
[1,] 2307.898
> covb
con rain so2pot
con 535.0473586 -11.841137326 -0.782642347
rain -11.8411373 0.299317271 0.006553182
so2pot -0.7826423 0.006553182 0.009878042
> estimates<-cbind(betahat,sevec)
> estimates
sevec
con 811.7769190 23.13109073
rain 2.6822319 0.54709896
so2pot 0.4764801 0.09938834
![Page 119: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/119.jpg)
ST597A: 2011, Part II ( c©J.B.) 119
11.2 Some additional comments on using functions in R
Printing:
As noted elsewhere when working within a function just listing an object will not result in it being printedto the console (as would be done if working interactively;i.e. not in a function). Within a function you canprint to the console using print( ) or the cat command.
Return. return(object) will end the function and print what is in the object to the console. There doesn’thave to be a return in a function (see earlier examples). You CANNOT return multiple quantities by usingreturn(obj1,obj2,..). This does not work. Those individual quantities have to be combined into one object.To illustrate, below is the function we used to find power values for a one-sided test for a mean; see page 72for the output.
powerone<-function(mu0,sigma,n,alpha){
cat(‘‘Power example with sample size ‘‘, n,’’ null = ‘‘, mu0,
‘‘ sigma = ‘‘, sigma, ‘‘ alpha = ‘‘, alpha, ‘‘\n’’)
muval<- seq(25,35,by=.5)
muval
nmu <-length(muval)
power<-rep(NA,nmu)
for(k in 1:nmu)
{mu = muval[k]
df= n-1
nc = sqrt(n)*(mu - mu0)/sigma
tval= qt(1-alpha,df)
power[k] = 1 - pt(tval,df,nc)}
plot (muval,power,type = ‘‘l’’,ylab=’’power’’,xlab=’’Null value’’,
main= ‘‘power function’’)
values<-data.frame(muval,power)
return(values)}
powerone(30,2,5,.05)
Below we see what we get if use a list (with no names) and then a list with names to return muval, powerand some other quantities.
values<-list(mu0,sigma,n,alpha,muval,power)
return(values)
Power example with sample size 5 null = 30 sigma = 2 alpha = 0.05
[[1]]
[1] 30
[[2]]
[1] 2
[[3]]
[1] 5
[[4]]
[1] 0.05
[[5]]
![Page 120: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/120.jpg)
ST597A: 2011, Part II ( c©J.B.) 120
[1] 25.0 25.5 26.0 26.5 27.0 27.5 28.0 28.5 29.0 29.5 30.0 30.5 31.0 31.5 32.0
[16] 32.5 33.0 33.5 34.0 34.5 35.0
[[6]]
[1] 1.554168e-11 4.549612e-10 1.016218e-08 1.720899e-07 2.219886e-06
[6] 2.193684e-05 1.671735e-04 9.901027e-04 4.598786e-03 1.692908e-02
[11] 5.000000e-02 1.201795e-01 2.389952e-01 4.008470e-01 5.797374e-01
[16] 7.414620e-01 8.619466e-01 9.364164e-01 9.748306e-01 9.914523e-01
[21] 9.975115e-01
values<-list(null = mu0,sigma = sigma,n = n, alpha = alpha,
muval = mu,power = power)
return(values)
Power example with sample size 5 null = 30 sigma = 2 alpha = 0.05
$null
[1] 30
$sigma
[1] 2
$n
[1] 5
$alpha
[1] 0.05
$muval
[1] 35
$power
[1] 1.554168e-11 4.549612e-10 1.016218e-08 1.720899e-07 2.219886e-06
[6] 2.193684e-05 1.671735e-04 9.901027e-04 4.598786e-03 1.692908e-02
[11] 5.000000e-02 1.201795e-01 2.389952e-01 4.008470e-01 5.797374e-01
[16] 7.414620e-01 8.619466e-01 9.364164e-01 9.748306e-01 9.914523e-01
[21] 9.975115e-01
Using the object outside of the function.
Once you run a function, quantities computed in the function are not available, even those that are partof the object specified in the return() statement. If you want access to quantities after you have run thefunction, then you can save the result of the function. For example if you used
presults<-powerone(30,2,5,.05)
then presults will have whatever is in the return (values in the example above). Note that you could use thename values again. That is, you could use
values<-powerone(30,2,5,.05)
and this will return what is values in the function to the object values outside the function. As in general,you cannot refer to the things inside values individually. It if is a dataframe then you can use attach, or
![Page 121: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/121.jpg)
ST597A: 2011, Part II ( c©J.B.) 121
if it is a dataframe or a list you can refer to items using presults$name. So, in the first case above wherevalues was just a dataframe that came from binding muval and power you could used presults$muval orpresults$power to refer to the individual vectors. In the case of using the list with names, if you used youcan refer to presults$sigma, etc.
12 Bootstrap Add on.
First, note that in SAS you could generate many bootstrap samples using proc surveyselect with the repoption. These can then be fed into subsequent analyses/ computations. But, if working in IML there isreally no need to do this (or advantage to doing this). It is easier to generate the samples explicitly.
12.1 An enhanced one sample IML program
Below is a more compact and somewhat more comprehensive bootstrap program that does what was doneon pages 109-113 of the notes. The explanation of the bootstrap is given on those pages. Here the programis just a little different in thati) rather than write to an external file, it creates an internal SAS file andii) it then uses a macro to loop through and run the bootstrap analysis on the various statistics that havebeen saved over the bootstrap samples.
A few other notes:
1. There is typo in line 10 on page 110. It should read
rx[rank(xb)] = xb; /* rx has ranked values */
This will have some effect on of the analyses of the median and 90th percentile.
2. The analysis below is run on a different data set using a sample of households where the amountexpended on food was measured.
3. Using the stock data on page 109 and following was not the best example to start with, from anapplied perspective. The bootstrap (and other) analyses there would only apply if you can view theobservations from different weeks as a random sample (independent and identically distributed). Thismay not be too bad there but in general you shouldn’t do this with time series without a check onassumptions.
title ’Bootstrap with a single sample’;
options pagesize=60 linesize=80 nodate;
/* THIS IS A PROGRAM TO BOOTSTRAP WITH A SINGLE
RANDOM SAMPLE ESTIMATING THE MEAN, VARIANCE, STANDARD
DEVIATION, COEFFICIENT OF VARIATION, MEDIAN AND 90TH PERCENTILE.
FOR ILLUSTRATION. IN PRACTICE THE BOOTSTRAP AND OTHER PROCEDURES
CAN HAVE TROUBLE ESTIMATING PERCENTILES WITH SMALL SAMPLE SIZE.
THE DEFAULT HERE IN USING PROC UNIVARIATE AND FORMING BOOTSTRAP
PERCENTILE CIS IS CONFIDENE LEVEL=.95 (ALPHA = .05)*/
***************************************************************;
![Page 122: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/122.jpg)
ST597A: 2011, Part II ( c©J.B.) 122
* READ IN ORIGINAL DATA INTO INTERNAL SAS FILE values ;
* SET x = VARIABLE OF INTEREST HERE ;
* THIS EXAMPLE IS ANALYZING HOUSEHOLD FOOD EXPENDITURES. ;
***************************************************************;
data values;
infile ’g:/s597/data/food.dat’;
input expend;
x=expend;
run;
title ’descriptive statistics on original sample’;
proc univariate cibasic cipctlnormal cipctldf plot;
var x;
run;
proc iml;
/* put data into vector x */
use values;
read all var{x} into x;
close values;
n=nrow(x); /* = sample size */
xb=x; /* initializes xb to be same size as x */
rx = x; /* initializes vector that will have ranked bootstrap sample */
***************************************************************;
* SPECIFY THE NUMBER OF BOOTSTRAP SAMPLES ;
nboot = 10000;
*******************************************************************;
* SPECIFY WHICH STATISTICS/ESTIAMTES WILL BE SAVED TO A SAS FILE ;
create bootout var{smean svar sd cv median p90};
********************************************************************;
do b =1 to nboot; /* b indexes bootstrap number */
/* get the n samples with replacement. i indicates
the sampling within bootstrap replicate j. The generated
k is a discrete uniform over 1 to n; the function int
takes the integer part */
do i= 1 to n;
uv=uniform(0);
k=int(uv*n+1);
xb[i]=x[k];
end;
/* xb contains the n values in bootstrap sample */
/*compute statistics of interest.*/
![Page 123: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/123.jpg)
ST597A: 2011, Part II ( c©J.B.) 123
smean = sum(xb)/n; /* get sample mean */
svar=(ssq(xb) - (n*(smean**2)))/(n-1); /* sample variance */
sd = sqrt(svar); /* sample s.dev. */
cv = sd/smean; /* coefficient of variation */
/*compute median and 90th percentile for the bootstrap sample.
using definition 5 in SAS. */
/*compute median and 90th percentile */
rx[rank(xb)] = xb; /* rx has ranked values */
/* get median */
p=.5;
j = int(n*p);
g = n*p - j;
if g=0 then median =.5*(rx[j]+rx[j+1]);
else median = rx[j+1];
p=.9;
j = int(n*p);
g = n*p - j;
if g=0 then p90 =.5*(rx[j]+rx[j+1]);
else p90 = rx[j+1];
append; /* appends designated variables to sas file bootout "created" earlier*/
end; /* ends the b (bootstrap) loop */
quit;
title ’summary measures over bootstrap samples via proc mean’;
proc means data=bootout n mean std; run;
/* THE MACRO BELOW WILL TAKE THE BOOTSTRAP VALUES OF
variable, OBTAINED FOR EACH OF THE NSIM BOOTSTRAP SAMPLES
AND GET THE BOOTSTRAP MEAN, STANDARD DEVIATION/ERROR, AS WELL
AS THE BOOTSTRAP PERCENTILE INTERVALS */
%macro banalyze(variable);
title "BOOTSTRAP ANALYSIS OF &variable";
proc iml;
use bootout;
read all var{&variable} into bootv;
close bootout;
alpha = .05; /* set confidence level at 1 - alpha*/
nboot = nrow(bootv);
bootmean = sum(bootv)/nboot;
bootvar = (ssq(bootv) - (nboot*(bootmean**2)))/(nboot-1);
if abs(bootvar) < 1e-8 then bootvar=0;
bootse = sqrt(bootvar);
/* now get percentile interval for pi */
rankb=bootv;
![Page 124: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/124.jpg)
ST597A: 2011, Part II ( c©J.B.) 124
rankb[rank(rankb)] = bootv; /* rankb has ranked values */
p=alpha/2;
j = int(nboot*p);
g = nboot*p - j;
if g=0 then lowboot =.5*(rankb[j]+rankb[j+1]);
else lowboot = rankb[j+1];
p=1- (alpha/2);
j = int(nboot*p);
g = nboot*p - j;
if g=0 then upboot =.5*(rankb[j]+rankb[j+1]);
else upboot = rankb[j+1];
p=.50;
j = int(nboot*p);
g = nboot*p - j;
if g=0 then bootmed =.5*(rankb[j]+rankb[j+1]);
else bootmed = rankb[j+1];
min=rankb[1];
max=rankb[nboot];
clevel = 100*(1-alpha);
print "Bootstrap sample size = " nboot "Confidence level = " clevel;
output = J(1,5,1);
cnames = {"B-Mean" "B-Median" "Boot-SE" "Boot-L" "Boot-U"};
reset noname;
output[1,1] = bootmean;
output[1,2] = bootmed;
output[1,3] = bootse;
output[1,4] = lowboot;
output[1,5] = upboot;
print output[colname=cnames];
%mend;
%banalyze(smean);
%banalyze(svar);
%banalyze(sd);
%banalyze(median);
%banalyze(p90);
descriptive statistics on original sample 2230
Variable: x
N 40 Sum Weights 40
Mean 3609.6 Sum Observations 144384
Std Deviation 1509.29978 Variance 2277985.84
Coeff Variation 41.8134913 Std Error Mean 238.641249
Basic Confidence Limits Assuming Normality
Parameter Estimate 90% Confidence Limits
Mean 3610 3208 4012
Std Deviation 1509 1276 1859
Variance 2277986 1627961 3457486
![Page 125: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/125.jpg)
ST597A: 2011, Part II ( c©J.B.) 125
Quantiles (Definition 5)
90% Confidence Limits
Quantile Estimate Assuming Normality
99% 8147.0 6471.642 8048.363
95% 7225.0 5566.512 6817.607
90% 5683.5 5073.978 6171.152
75% Q3 4110.0 4222.654 5117.472
50% Median 3165.0 3207.519 4011.681
90% Confidence Limits -------Order Statistics-------
Quantile Distribution Free LCL Rank UCL Rank Coverage
95% 5587 8147 36 40 82.35
90% 4670 8147 33 40 94.33
75% Q3 3679 5176 26 35 90.23
50% Median 2830 3679 15 26 91.93
Variable: x
Stem Leaf # Boxplot
8 1 1 0
7 6 1 0
7
6 9 1 0
6
5 68 2 |
5 02 2 |
4 7 1 |
4 024 3 +-----+
3 67789 5 | + |
3 122234 6 *-----*
2 556667888889 12 +-----+
2 01244 5 |
1 |
1 2 1 |
----+----+----+----+
Multiply Stem.Leaf by 10**+3
summary measures over bootstrap samples via proc mean
Variable N Mean Std Dev
SMEAN 10000 3608.69 231.9764974
SVAR 10000 2211014.47 650970.30
SD 10000 1470.12 223.0754178
CV 10000 0.4062640 0.0480721
MEDIAN 10000 3172.28 241.9560555
P90 10000 5568.51 810.9068116
BOOTSTRAP ANALYSIS OF smean
NBOOT CLEVEL
Bootstrap sample size = 10000 Confidence level = 95
B-Mean B-Median Boot-SE Boot-L Boot-U
3608.6867 3601.8 231.9765 3174.5875 4088.2625
![Page 126: ST597A: 2011, Part II ( c J.B.) 1 ST597A: Statistical ...people.math.umass.edu/~johnpb/s597/s597c.pdf · ST597A: Statistical Computing, Part II (2011) c John Buonaccorsi, Univ. of](https://reader031.vdocuments.pub/reader031/viewer/2022011900/5f03f3997e708231d40b9427/html5/thumbnails/126.jpg)
ST597A: 2011, Part II ( c©J.B.) 126
BOOTSTRAP ANALYSIS OF svar
NBOOT CLEVEL
Bootstrap sample size = 10000 Confidence level = 95
B-Mean B-Median Boot-SE Boot-L Boot-U
2211014.5 2183823.1 650970.3 1013311.7 3537060.4
BOOTSTRAP ANALYSIS OF sd
NBOOT CLEVEL
Bootstrap sample size = 10000 Confidence level = 95
B-Mean B-Median Boot-SE Boot-L Boot-U
1470.1214 1477.7764 223.07542 1006.6338 1880.7074
BOOTSTRAP ANALYSIS OF median
NBOOT CLEVEL
Bootstrap sample size = 10000 Confidence level = 95
B-Mean B-Median Boot-SE Boot-L Boot-U
3172.2757 3165 241.95606 2807 3711
BOOTSTRAP ANALYSIS OF p90
NBOOT CLEVEL
Bootstrap sample size = 10000 Confidence level = 95
B-Mean B-Median Boot-SE Boot-L Boot-U
5568.508 5587 810.90681 4250 7580
12.2 R resources
You can find many webpages that have an introduction to using the bootstrap in R. One useful one is
http://www.ats.ucla.edu/stat/r/library/bootstrap.htm
Note the convenient sample function in R. Be careful in that most of the automatic boostrapping justresamples with replacement from observations. This is only appropriate if the original sample is a randomsample (independent and identically distributed)! Otherwise the bootstrap resampling must be tailored tothe design and programmed explicitly.