518 SAS TEMPLATES

Stat 518 - Fall 2000 - SAS Templates

Class Notes from 9/11/00
Homework 3 Notes
Homework 4 Notes
Homework 5 Notes
Homework 6 Notes
Homework 7 Notes
Homework 8 Notes

The Basics of SAS:

Hitting the [F3] key will run the program currently in the Program Editor window.

This will however erase whatever was written in the Program Editor window. To recall whatever was there, make sure you are in that window, and hit the [F4] key.

If you keep running more programs it will keep adding it all to the Output window. To clear the Output window, make sure you are in that window, and choose Clear All under the Edit menu.

The following is the SAS code for analyzing the SAT data presented in the August 31st, 1999 New York Times:


OPTIONS pagesize=60 linesize=80;



DATA sat1;

INPUT state $  verbal math pct;

LABEL state = "State"

   verbal = "Average Verbal Subtest Score"

   math = "Average Math Subtest Score"

   pct = "Percent of Students Taking Test"; 

CARDS;

        Ala.    561  555    9

        Alaska  516  514   50

        Ariz.   524  525   34

        Ark.    563  556    6

        Calif.  497  514   49

        Colo.   536  540   32

        Conn.   510  509   80

        Dela.   503  497   67

        D.C.    494  478   77

        Fla.    499  498   53

        Ga.     487  482   63

        Hawaii  482  513   52

        Idaho   542  540   16

        Ill.    569  585   12

        Ind.    496  498   60

        Iowa    594  598    5

        Kan.    578  576    9

        Ky.     547  547   12

        La.     561  558    8

        Maine   507  507   68

        Md.     507  511   65

        Mass.   511  511   78

        Mich.   557  565   11

        Minn.   586  598    9

        Miss.   563  548    4

        Mo.     572  572    8

        Mont.   545  546   21

        Neb.    568  571    8

        Nev.    512  517   34

        N.H.    520  518   72

        N.J.    498  510   80

        N.M.    549  542   12

        N.Y.    495  502   76

        N.C.    493  493   61

        N.D.    594  605    5

        Ohio    534  538   25

        Okla.   576  560    8

        Ore.    525  525   53

        Pa.     498  495   70

        R.I.    504  499   70

        S.C.    479  475   61

        S.D.    585  588    4

        Tenn.   559  553   13

        Texas   494  499   50

        Utah    570  565    5

        Vt.     514  506   70

        Va.     508  499   65

        Wash.   525  526   52

        W.Va.   527  512   18

        Wis.    584  595    7

        Wyo.    546  551   10

;

Note that _most_ lines end with a semi-colon, but not all. SAS will crash if you miss one, but usually the log window will tell you where the problem is.

The OPTIONS line only needs to be used once during a session. It sets the length of the page and the length of the lines for viewing on the screen and printing. The font can be set by using the Options sub-menu of the Tools menu along the top of the screen. When you cut and paste from SAS to a word processor, the font Courier New works well.

The DATA line defines what the name of the data set is. The name must be eight characters or less, with no spaces, and only letters, numbers, and underscores. It must start with a letter. The INPUT line gives the names of the variables, and they must be in the order that the data will be entered. The $ after state on the INPUT line means that the variable state is qualitative instead of quantitative.

If we hit F3 at this point to enter what we put above, nothing new will appear on the output screen. This is no big surprise however once we realize that we haven't told SAS to return any output! The code below simply tells SAS to print back out the data we entered.




PROC PRINT data=sat1;

TITLE "September 1, 1999 - SAT Report";

RUN;

The basic method for getting a summary of the data is to use PROC UNIVARIATE.


PROC UNIVARIATE DATA=sat1 PLOT FREQ ;

VAR pct ;

TITLE 'Summary of the Percent of Students Taking the SAT';

RUN;

The VAR line says which of the variables you want a summary of. Note that there are many different definitions of percentile, and the exact value may not be the same as we saw how to calculate in class.

PROC INSIGHT allows many of these analyses, as well as many more advanced analyses and nicer graphs.


PROC INSIGHT; 

OPEN sat1;

DIST pct;

RUN;

You can cut and paste the graphs from PROC INSIGHT right into microsoft word. Simply click on the border of the box you want to copy with the right mouse button to select it. You can then cut and paste like normal. Clicking on the arrow in the bottom corner of each of the boxes gives you options for adjusting the graphs format. To quit PROC INSIGHT, click on the X in the upper right portion of the spreadsheet. Note that INSIGHT can also be started using the choice Solutions-Analysis-Interactive Data Analysis in SAS version 8.

One very useful ability in PROC INSIGHT is the ability to make new variables from the old ones. This is done by going to the Edit menu, and selecting Variables, and then Other.... This can be used for example to make the z-scores.

There are a variety of seemingly commonsense procedures that you would think SAS would be good at. Unfortunately it either hides them well or doesn't do them for some reason. Luckily we can program SAS to do some of these. The following example finds a confidence interval for the variance. The function CINV looks up the value on the chi-square table that goes with the percentage and the degrees of freedom you give it. The 0.05 in the problem is the alpha from (1-alph)*100%.


PROC MEANS NOPRINT DATA=sat1;

VAR verbal;

OUTPUT OUT=temp STD=sd N=n;

RUN;


DATA temp2;

SET temp;

KEEP var n alpha cilow cihigh;

INPUT alpha;

var = sd*sd;

df = n - 1;

cilow = (n-1)*(var)/CINV(1-(alpha/2),df);

cihigh = (n-1)*(var)/CINV(alpha/2,df);

CARDS;

.05

;


PROC PRINT data=temp2;

RUN;

The mean could also have been kept on the OUTPUT line using MEAN=xbar for example. To make a confidence interval using a t or normal distribution, the functions to "look up the values in the table" would have been TINV or PROBIT respectively. To get the p-value to form a test of hypothesis, we could use the functions: PROBCHI for chi-square, PROBT for t, and PROBNORM for normal. (It should be noted that the newest version of INSIGHT does do the confidence interval for the variance under the Tables menu.)

To quit, simply choose Exit in the File menu for each program, and use CTRL+ALT+DEL to logoff the machine.

Homework 3 Notes

Problem C: The first step is to enter the data into SAS as we did in the lab in class. The Q-Q plot and t-test can be accessed by starting up PROC INSIGHT with the name of the dataset on the OPEN line, and the name of the variable on the DIST line. To add the Q-Q plot to the output, go under the Graphs menu and choose QQ Plot... and hit OK in the box that pops up. To add the line to the plot, go to Curves and choose QQ Ref Line and hit OK in the box that pops up. To add the t-test to the data set go to the Tables menu and select Tests for Location.... In the box that pops up, enter the mu value, and hit ok. The new version will cause several tests to be conducted, we simply want the Student's t one.

One thing to note is that SAS didn't ask if you wanted to test greater than, less than, or not equals to. By default it does the two sided (not equals to test), and you have to adjust the p-value manually if you wanted to do a one sided test. If you want to make SAS perform the one sided tests, the following code will test the hypothesis mu_pct>30 for the SAT data set above.


PROC MEANS NOPRINT DATA=sat1;

VAR pct;

OUTPUT OUT=temp MEAN=xbar STD=sd N=n

RUN;

DATA temp2;

SET temp;

KEEP xbar mu sd n t pgreater pless ptwoside;

INPUT mu;

t  = (xbar-mu)/(sd/sqrt(n));

df = n - 1;

pgreater = 1 - probt(t,df);

pless = probt(t,df);

ptwoside = 2*MIN(1-ABS(probt(t,df)),ABS(probt(t,df)));

cards;

30

;

PROC PRINT;

RUN;

When you are cutting and pasting the Q-Q plot into Microsoft Word, one of the difficulties is that it will not print out correctly right away. Once it has been pasted into MS Word, click on the image twice with the left mouse button. This opens it into a separate window where you could edit it if you wanted. Simply go under File and select Close and Return to Document. This will return you to the rest of the document (where you should have put the code you ran in the program editor window and the p-value you found) with the graph formatted to print out correctly.

Homework 4 Notes

We've seen in Homework 3 how to conduct the t-test and generate a Q-Q plot. The t-test based confidence interval for the mean can be generated in PROC INSIGHT under the Tables menu, by choosing the Basic Confidence Intervals option. One could also program SAS to construct the confidence interval in the same way we constructed the CI for the variance, or the t-test for the mean. Unfortunately, the built in sign test in SAS is two-sided and does not report the usual test statistic. It does not have a built in method of forming the confidence interval at all. Before continuing, it is useful to enter the data from in class:

DATA beakclap;

INPUT x y @@;

z=y-x;

CARDS;

5.8	5	13.5	21	26.1	73	7.4	25

7.6	3	23.0	77	10.7	59	9.1	13

19.3	36	26.3	46	17.5	9	17.9	25

18.3	59	14.2	38	55.2	70	15.4	36

30.0	55	21.3	46	26.8	25	8.1	30

24.3	29	21.3	46	18.2	71	22.5	31

31.1	33

;

The sign test can be conducted in PROC INSIGHT in SAS version 8.0. Once the distribution window for z is open, you can choose Tests for Location... in the Tables Menu. The statistic it returns is one-half the number of observations greater than the null hypothesis median minus one-half the number of observations smaller than the null hypothesis median. The p-value it returns is for the two-sided test. PROC UNIVARIATE can also be used, and it returns the same statistic as part of its regular output. (Look for M(Sign) and Prob > |M| in version 6.) Remember to be careful when converting a two-sided p-value to a one-sided one.

With a little work we could program SAS to determine the correct statistic and all three p-values. The data set setup will have two columns in addition to the z column. The first column has a one for every z greater than zero (the null hypothesis value in this case), and the second column has a one for every z not equal to zero. You can put in a PROC PRINT DATA=setup; line to see exactly what this dataset looks like. The PROC UNIVARIATE portion simply sums up these two extra columns, calls the sums T and n and saves them as a data set called Tandn. Finally, the data set pvals contains the work to calculate the p-values. The one disadvantage of SAS, compared to S-Plus, is that we have to enter this entire set of code every time, manually changing the name of the data set, the name of the variable (z in this case), and the null hypothesis value.

DATA setup;

SET beakclap;

KEEP z geq tot;

geq = ((z-0)> 1E-7);

tot = ((z-0)>1E-7 or (z-0)<-1E-7);

RUN;

 

PROC UNIVARIATE DATA=setup NOPRINT;

VAR geq tot;

OUTPUT OUT=Tandn SUM= T n 

RUN;

 

DATA pvals;

SET Tandn;

KEEP T n greater less twoside;

greater = 1 - PROBBNML(0.5,n,T-1);

less = PROBBNML(0.5,n,T);

twoside = MIN(1,2*MIN(greater,less));

;

PROC PRINT;

RUN;

As SAS does not have a function to look up the quantiles of the binomial, you would have to go through a bit of work to make sure that the confidence interval was done correctly. As such, you don't need to do part e with SAS.

Homework 5 Notes

1) Luckily SAS does have built in functions to conduct the t-test, sign test, and signed-rank test. They can all be run in PROC INSIGHT. Say we wanted to do example 1 on page 355.

DATA examp;

INPUT first second;

CARDS;

86	88

71	77

77	76

68	64

91	96

72	72

77	65

91	90

70	65

71	80	

88	81	

87	72

;

PROC INSIGHT;

OPEN examp;

RUN;

After entering the data, we need to make the variable with the differences. Click on the spread sheet, and then go to the "Edit" menu, and then choose "Variables" and in that menu take "Other...". We want to take the difference, so first click on Y-X in the Transformation list. Then pick which variable you want to be X, click on that, and then click on the X box. Do the same for Y, and then hit ok. A new variable should have appeared in the spreadsheet.

Once we have the difference in the spreadsheet, go up to the "Analyze" menu, select "Distribution (Y)", and choose the difference variable for the Y and hit "OK". We could now get the Q-Q plot for the differences, or make the confidence intervals using the selections in the various menus. To do the hypothesis tests, choose "Location Tests..." under the "Tables" menu. Here simply check all the appropriate boxes, pick which value you are testing the median/mean equals, and click ok. This will add the three tests to the INSIGHT graphics window.

Note that the p-values given here are for the two-sided test. To get the one-sided p-value decide if the observed value is in the "correct" tail or not, and then either take half of the observed value or one minus half the observed value as appropriate. Also, recall from homework 4 that the statistic given for the sign test is of a somewhat different form, even though the two-sided p-value is correct. Similarly, note that the statistic for the signed rank test is 1/2 the value of the one on page 355. Again however, it calculates the correct two sided p-value.

4) PROC GLM can conduct the ANOVA for this type of data, and the following code could be used. The dollar sign is used to indicate that the group is a name and not necesarily a number. The following code does the work for example 1 on page 291.

DATA examp;

INPUT group $ value @@;

CARDS;

1   83      2       91      3       101     4       78

1   91      2       90      3       100     4       82

1   94      2       81      3       91      4       81

1   89      2       83      3       93      4       77

1   89      2       84      3       96      4       79

1   96      2       83      3       95      4       81

1   91      2       88      3       94      4       80

1   92      2       91      4       81      1       90

2   89      2       84

;

PROC GLM DATA=examp;

CLASS group;

MODEL value=group;

RUN;

PROC NPAR1WAY will conduct the Kruskal-Wallis test (as well as the Mann-Whitney-Wilcoxon rank-sum testi).

PROC NPAR1WAY WILCOXON DATA=examp;

CLASS group;

VAR value;

RUN;

To check the assumption of normality and equal variances, we can output the residuals of the ANOVA model, and simply do a q-q plot of the residuals, and a residual versus predicted plot. One way of doing that is to add the line:

OUTPUT OUT=resids P=pred R=resid;

after the MODEL line. Then you can start up PROC INSIGHT on the new data set resids. You could also use the same residual versus predicted plots to see if the distributions for the various groups appear the same for the Kruskal-Wallis test.

PROC INSIGHT can also be used instead of PROC GLM to both fit the ANOVA and get the residual plots. Start up PROC INSIGHT with the data set examp. Under Analyze, choose Fit(YX), selecting group for x and value for y. The residual versus predicted plot appears automatically at the bottom, and the q-q plot can be selected under the Graphs menu.

Homework 6 Notes

SAS has a variety of ways of performing the standard linear regression. In PROC INSIGHT, you can simply select the option Fit (YX) under the Analyze menu. Choose the correct Y and X variables and click OK. The resulting window contains the least squares fit equation, a scatter plot, a slide bar to try various polynomial regressions, a box with the sqrt(MSE) and R-square, the ANOVA table, and a residulal vs. predicted plot. The option for adding the q-q plot is under the Graphs menu.

Other options for performing this regression include PROC GLM and PROC REG. In both cases you will want to include an OUTPUT line in order to check the residuals.

Homework 7 Notes

Fisher's Exact Test and the Mantel-Haenszel Test: PROC FREQ will perform both Fisher's exact test and the Mantel-Haenszel test. The following code will analyze the data in example 3 in section 4.1.

DATA examp3;

INPUT gender $ posit $ count;

CARDS;

male	accrep	1

male	teller	9

fem  	accrep	3

fem	teller  1

;

PROC FREQ DATA=examp3;

TABLES gender*posit / EXACT;

WEIGHT count;

RUN;

Notice that it performs a large number of tests, and that the one you are looking for is near the bottom of the output. This will also work on larger tables as well. The code for example 4 is below. Make sure to check that the contingency tables it forms are the ones you want! Notice that SAS calculates the non-continuity corrected chi-squared statistic and not the normal one, so that 2.0515 = 1.4323².

DATA examp4;

INPUT treat $ sorf $ group $ count;

CARDS;

treat	s	1	10

treat	s	2	9

treat	s	3	8

treat	f	1	1

treat	f	2	0

treat	f	3	0

cont	s	1	12

cont	s	2	11

cont	s	3	7

cont	f	1	1

cont	f	2	1

cont	f	3	3

;

PROC FREQ DATA=examp4 ORDER=DATA;

TABLES group*treat*sorf / CMH1;

WEIGHT count;

RUN;

Chi-squared Goodness of Fit Test: PROC FREQ will also perform the chi-squared goodness of fit test. The following code will analyze the data in Example 1 on page 242.

DATA examp1;

INPUT digit $ count @@;

CARDS;

0	22	1	28	2	41	3	35

4	19	5	25	6	25	7	40

8	30	9	35

;

PROC FREQ DATA=examp1;

TABLES digit / TESTF=(30,30,30,30,30,30,30,30,30,30);

WEIGHT count;

RUN;

Instead of specifying the expected frequencies, we could have set the expected percents using TESTP=(0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,.0.1,0.1).

Homework 8 Notes

Under Construction!

Below is the data concerning 326 defendants in homicide indictments in 20 Florida counties during 1976-1977. It can be found in Radelet's 1981 article in American Sociological Review.

Defendant's
Race Victim's
Race Death
Penalty No Death
Penalty

White White 19 132
Black 0 9
Black White 11 52
Black 6 97

The following code would fit the family of hierarchical models to the Berkeley admissions data we used in class:


DATA berk;

INPUT dept $ gender $ accepted $ count;

CARDS;

A	M	Y	512

A	M	N	313

A	F	Y	89	

A	F	N	19

B	M	Y	353

B	M	N	207

B	F	Y	17

B	F	N	8

C	M	Y	120

C	M	N	205

C	F	Y	202

C	F	N	391

D	M	Y	138

D	M	N	279

D	F	Y	131

D	F	N	244

E	M	Y	53

E	M	N	138

E	F	Y	94

E	F	N	299

F	M	Y	22

F	M	N	351

F	F	Y	24

F	F	N	317

;



PROC CATMOD DATA=berk ORDER=DATA;

WEIGHT count;

MODEL  dept*gender*accepted =_response_ / NOPROFILE NOPREDVAR NORESPONSE NOPARM NOITER NODESIGN;

LOGLIN dept gender accepted;

RUN;

The above code will analyze the complete independence model. In turn, each of the following could be used to analyze the other models.


LOGLIN dept gender accepted dept*gender;

LOGLIN dept gender accepted dept*accepted; 

LOGLIN dept gender accepted gender*accepted; 

LOGLIN dept gender accepted dept*gender dept*accepted; 

LOGLIN dept gender accepted dept*gender gender*accepted; 

LOGLIN dept gender accepted dept*accepted gender*accepted; 

LOGLIN dept gender accepted dept*gender dept*accepted gender*accepted; 

LOGLIN dept gender accepted dept*gender dept*accepted gender*accepted dept*gender*accepted;

Notice that it says that PROC CATMOD is still running. Because of this you don't need to rerun everything. For example, to get the next model you could just enter:


LOGLIN dept gender accepted dept*gender;

RUN;

and it will add that to the output. You have to do this one model at a time unfortunately. When you are finished using the procedure, you can close it using QUIT; and hitting [F3].

For the death penalty example, remember that you can't enter an observed value of 0. Instead use something like 0.00001. This will make SAS recognize the cell, but won't change the results noticeably, otherwise.

For the data above, after running all the models you would find that even the complicated model with everything except the three way interaction (the second to last one) doesn't fit at all. The p-value is only 0.0011 for testing H0: this model fits vs. HA: this model doesn't fit. For the saturated model (the last one with the three way interaction too) no p-value is given. This is because that model will always fit exactly. We would interpret this as meaning that there is a three way interaction. e.g. the dependence between gender and accepted changes with the departments. This mean we cannot find a simple way of describing the data. If one of the other models had fit then we would have been able to some conclusion that had simplified the interpretation. No such luck in this case.

Because the model fits exactly, the observed values and predicted values are the same, and we can just use the observed odds ratios in each department to describe what is happening. To illustrate what we could do otherwise, assume for the moment that the model with dept, gender, accepted, and gender*accepted fit the data. This would mean that there was an association between gender and being accepted, but that it was the same for every department. To get the fitted values, we would rerun the model line (after removing the NOPREDVAR options and adding PRED=FREQ) and the appropriate LOGLIN line.


MODEL  dept*gender*accepted =_response_ / PRED=FREQ NOPROFILE NORESPONSE NOPARM NOITER NODESIGN;

LOGLIN dept gender accepted accepted*gender;

RUN;

If you look at the Maximum Likelihood Predicted Values for Frequenciestable in the output, you could make the observed or expected 2x2 contingency tables for each department. If you make the expected (a.k.a. predicted) tables, they all have the same odds ratio 1.8410.

Defendant's Race	Victim's Race	Death Penalty	No Death Penalty

White	White	19	132
White	Black	0	9

Black	White	11	52
Black	Black	6	97