This page contains sample code for using SAS in the course STAT 516 with the text Statistical Methods, 2nd Edition by Freund and Wilson. It is designed so that you can read along with material below, and copy and paste the SAS code (usually looking like type writer font) directly from the web page into the SAS program editor window. The introduction section is written assuming you have no previous experience using SAS. In the later sections the examples usually correspond to portions of the text book, and it would be helpful to have the book with you. In general, you will be able to use these templates to do your homework assignments after modifying them by entering your own data and making sure that all of the names match up.
Originally created by B. Habing - Last updated: 4/20/04
- Chapter 7 | |
- Section 7.5 | |
- Section 7.8 | |
- Chapter 8 | |
- Section 8.8 | |
- Section 8.9 | |
- Chapter 6 | |
- Chapter 9 | |
- Sections 10.2 - 10.4 | |
- Section 11.5 | |
- Section 11.7 |
Text Book Data Sets
Computer Trouble?
SAS Won't Start?
Graphs Printing Small in Word?
To begin with you should open up the program SAS. If there is a SAS icon on the screen already, you can just double click that to start the program. If there is not a SAS icon on the screen already, you need to use the start menu. Click the Start rectangle in the bottom left of the screen with your left mouse button. This should open up a list of options above the Start rectangle, and you should slide the mouse arrow up to the one labeled Programs. This should open up a list of more options to the right and you should keep sliding along until you get to the one labeled The SAS System for Windows. Once you get there click on it. This should start up SAS.
There are three main windows that are used in SAS. The Log window, the Program Editor window, and the Output window. The Log and Program Editor window are the two on the screen when you start up the program. The Output window isn't visible yet because you haven't created anything to output. If you happen to lose one of these windows they usually have a bar at the bottom of the SAS window. You can also find them under the View menu.
The Program Editor is where you tell SAS what you want done. The Output window is where it puts the results, and the Log window is where it tells you what it did and if there are any errors. It is important to note that the Output window often gets very long! You usually want to copy the parts you want to print into MS-Word and print from there. It is also important to note that you should check the Log window everytime you run anything. The errors will appear in maroon. Successful runs appear in Blue.
Hitting the [F3] key at the top of your keyboard will run the program currently in the Program Editor window (or the part of it you have highlighted). You can also run programs by clicking on the little image of the runner in the list of symbols near the top of the SAS program screen.
In older editions of SAS, running the program will erase whatever was written in the Program Editor window. To recall whatever was there, make sure you are in that window, and hit the [F4] key.
If you keep running more programs it will keep adding it all to the Output window. To clear the Output window, make sure you are in that window, and choose Clear text under the Edit menu.
If you happen to lose a window, try looking under the View menu.
The following is the SAS code for examining the housing price data that is discussed throughout chapter 7. The full data set can be found in Table 1.2 on page 9. The subset of the data that we will eventually use is in Table 7.2 on page 293.
The data in the full data set is formatted in eleven columns
OPTIONS pagesize=50 linesize=64; DATA Housing_Data; INPUT Obs zip age bed bath size lot exter $ garage fp price; CARDS; 1 3 21 3 3.0 0.951 64.904 Other 0 0 30.000 2 3 21 3 2.0 1.036 217.800 Frame 0 0 39.900 3 4 7 1 1.0 0.676 54.450 Other 2 0 46.500 4 3 6 3 2.0 1.456 51.836 Other 0 1 48.600 5 1 51 3 1.0 1.186 10.857 Other 1 0 51.500 6 2 19 3 2.0 1.456 40.075 Frame 0 0 56.990 7 3 8 3 2.0 1.368 . Frame 0 0 59.900 8 4 27 3 1.0 0.994 11.016 Frame 1 0 62.500 9 1 51 2 1.0 1.176 6.259 Frame 1 1 65.500 10 3 1 3 2.0 1.216 11.348 Other 0 0 69.000 11 4 32 3 2.0 1.410 25.450 Brick 0 0 76.900 12 3 2 3 2.0 1.344 . Other 0 1 79.000 13 3 25 2 2.0 1.064 218.671 Other 0 0 79.900 14 1 31 3 1.5 1.770 19.602 Brick 0 1 79.950 15 4 29 3 2.0 1.524 12.720 Brick 2 1 82.900 16 3 16 3 2.0 1.750 130.680 Frame 0 0 84.900 17 3 20 3 2.0 1.152 104.544 Other 2 0 85.000 18 3 18 4 2.0 1.770 10.640 Other 0 0 87.900 19 4 28 3 2.0 1.624 12.700 Brick 2 1 89.900 20 2 27 3 2.0 1.540 5.679 Brick 2 1 89.900 21 1 8 3 2.0 1.532 6.900 Brick 2 1 93.500 22 4 19 3 2.0 1.647 6.900 Brick 2 0 94.900 23 2 3 3 2.0 1.344 43.560 Other 1 0 95.800 24 4 5 3 2.0 1.550 6.575 Brick 2 1 98.500 25 4 5 4 2.0 1.752 8.193 Brick 2 0 99.500 26 4 27 3 1.5 1.450 11.300 Brick 1 1 99.900 27 4 33 2 2.0 1.312 7.150 Brick 0 1 102.000 28 1 4 3 2.0 1.636 6.097 Brick 1 0 106.000 29 4 0 3 2.0 1.500 . Brick 2 0 108.900 30 2 36 3 2.5 1.800 83.635 Brick 2 1 109.900 31 3 5 4 2.5 1.972 7.667 Brick 2 0 110.000 32 3 0 3 2.0 1.387 . Brick 2 0 112.290 33 4 27 4 2.0 2.082 13.500 Brick 3 1 114.900 34 3 15 3 2.0 . 269.549 Frame 0 0 119.500 35 4 23 4 2.5 2.463 10.747 Brick 2 1 119.900 36 4 25 3 2.0 2.572 7.090 Brick 2 1 119.900 37 4 24 4 2.0 2.113 7.200 Brick 2 1 122.900 38 4 1 3 2.5 2.016 9.000 Brick 2 1 123.938 39 1 34 3 2.0 1.852 13.500 Brick 2 0 124.900 40 4 26 4 2.0 2.670 9.158 Brick 2 1 126.900 41 2 26 3 2.0 2.336 5.408 Brick 0 1 129.900 42 4 31 3 2.0 1.980 8.325 Brick 2 1 132.900 43 2 24 4 2.5 2.483 10.295 Brick 2 1 134.900 44 2 29 5 2.5 2.809 15.927 Brick 2 1 135.900 45 4 21 3 2.0 2.036 16.910 Brick 2 1 139.500 46 3 10 3 2.0 2.298 10.950 Brick 2 1 139.990 47 4 3 3 2.0 2.038 7.000 Brick 2 0 144.900 48 2 9 3 2.5 2.370 10.796 Brick 2 1 147.600 49 2 29 5 3.5 2.921 11.992 Brick 2 1 149.990 50 2 8 3 2.0 2.262 . Brick 2 1 152.550 51 4 7 3 3.0 2.456 . Brick 2 1 156.900 52 4 1 4 2.0 2.436 52.000 Brick 2 1 164.000 53 3 27 3 2.0 1.920 226.512 Frame 4 1 167.500 54 4 5 3 2.5 2.949 11.950 Brick 2 1 169.900 55 2 32 4 3.5 3.310 10.500 Brick 2 1 175.000 56 4 29 3 3.0 2.805 16.500 Brick 2 1 179.000 57 4 1 3 3.0 2.553 8.610 Brick 2 1 179.900 58 4 1 3 2.0 2.510 . Other 2 1 189.500 59 4 33 3 4.0 3.627 17.760 Brick 3 1 199.000 60 2 25 4 2.5 3.056 10.400 Other 2 1 216.000 61 3 16 3 2.5 3.045 168.576 Brick 3 1 229.900 62 4 2 4 4.5 3.253 54.362 Brick 3 2 285.000 63 2 2 4 3.5 4.106 44.737 Brick 3 1 328.900 64 4 0 3 2.5 2.993 . Brick 2 1 313.685 65 4 0 3 2.5 2.992 14.500 Other 3 1 327.300 66 4 20 4 3.0 3.055 250.034 Brick 3 0 349.900 67 4 18 5 4.0 3.846 23.086 Brick 4 3 370.000 68 4 3 4 4.5 3.314 43.734 Brick 3 1 380.000 69 4 5 4 3.5 3.472 130.723 Brick 2 2 395.000 ;Note that _most_ lines end with a semi-colon, but not all. SAS will crash if you miss one, but usually the log window will tell you where the problem is. An extra/missing semi-colon is probably the single largest reason for a SAS program crashing.
The OPTIONS line only needs to be used once during a session. It sets the length of the page and the length of the lines for viewing on the screen and printing. The font can be set by using the Options choice under the Tools menu along the top of the screen. When you cut and paste from SAS to a word processor, the font 10 point Courier New works well.
The DATA line defines what the name of the data set is. The name should start with a letter, have no spaces, and only letters, numbers, and underscores. The INPUT line gives the names of the variables, and they must be in the order that the data will be entered. The $ after exter on the INPUT line means that the variable exter is qualitative instead of quantitative. (Sometimes you will see an @@ at the end of the INPUT line. This means that the variables will be entered right after each other on the same line with no returns, instead of needing one row for each person.)
Note that in several places a . appears instead of a number or a word. This signifies a missing value, and SAS will ignore that observation if it was going ot use the missing variable in the calculation. In this case if you tried to calculate the mean of each of the variables, different numbers of observations would be used in each case.
If we hit F3 at this point to enter what we put above, nothing new will appear on the output screen. This is no big surprise however once we realize that we haven't told SAS to return any output! The code below simply tells SAS to print back out the data we entered.
PROC PRINT DATA=Housing_Data;
TITLE "Table 1.2: Housing Data";
RUN;
For the examples in Chapter 7 we are concerned only with the size and price of houses priced at less than $200,000. The following lines will construct this data subset and then print it out to verify that it has worked correctly.
DATA Size_and_Price;
SET Housing_Data;
KEEP size price;
WHERE price < 200;
RUN;
PROC PRINT DATA=Size_and_Price;
TITLE "Table 7.2: Data on Size and Price";
RUN;
Whenever you have a DATA line, that means you are creating a new dataset with that name. The SET line tells it that we are making this new data set from an old one. The KEEP line says the only variables we want in this new data set are the ones on that line. The lines after that say any special commands that go into the making of the new data set. In this case the WHERE command is used to make sure we only keep one gender or the other. Later we will see examples of making datasets that involve using mathematical functions. In any case, it should be pretty straight-forward when you just stop and read through what the lines say.
The most basic procedure to give out some actual graphs and statistics is PROC UNIVARIATE:
PROC UNIVARIATE DATA=Size_and_Price PLOT FREQ ;
VAR size;
TITLE 'Sizes of Houses under $200,000';
RUN;
The VAR line says which of the variables you want a summary of. Also note that the graphs here are pretty awful. The INSIGHT procedure will do most of the things that the UNIVARIATE procedure will, and a lot more. The one thing it won't do is be open to programming it to do new things. Later in the semester we'll see how some of the other procedures in SAS can be used to do things that aren't already programmed in.
PROC INSIGHT;
OPEN Size_and_Price;
DIST size;
RUN;
You can cut and paste the graphs from PROC INSIGHT right into Microsoft Word. Simply click on the border of the box you want to copy with the left mouse button to select it. You can then cut and paste like normal. (Before printing the graphs, you might want to adjust them so that they print at the correct size.)
While in PROC INSIGHT, clicking on the arrow in the bottom corner of each of the boxes gives you options for adjusting the graphs format. To quit PROC INSIGHT, click on the X in the upper right portion of the spreadsheet.
DATA Size_and_Price; INPUT size price @@; CARDS; 0.951 30 1.036 39.9 0.676 46.5 1.456 48.6 1.186 51.5 1.456 56.99 1.368 59.9 0.994 62.5 1.176 65.5 1.216 69 1.41 76.9 1.344 79 1.064 79.9 1.77 79.95 1.524 82.9 1.75 84.9 1.152 85 1.77 87.9 1.624 89.9 1.54 89.9 1.532 93.5 1.647 94.9 1.344 95.8 1.55 98.5 1.752 99.5 1.45 99.9 1.312 102 1.636 106 1.5 108.9 1.8 109.9 1.972 110 1.387 112.29 2.082 114.9 . 119.5 2.463 119.9 2.572 119.9 2.113 122.9 2.016 123.938 1.852 124.9 2.67 126.9 2.336 129.9 1.98 132.9 2.483 134.9 2.809 135.9 2.036 139.5 2.298 139.99 2.038 144.9 2.37 147.6 2.921 149.99 2.262 152.55 2.456 156.9 2.436 164 1.92 167.5 2.949 169.9 3.31 175 2.805 179 2.553 179.9 2.51 189.5 3.627 199 ;
Once we have the data set entered we could start PROC INSIGHT like we did above:
PROC INSIGHT; OPEN Size_and_Price; RUN;Or, we could start it just using the menus, by going to the Solutions menu, the Analysis submenu, and then Interactive Data Analysis. Select Work and then Size_and_Price and then hit Open.
Once the spreadsheet has appeared, we can get the basic regression output by choosing Fit (YX) under the Analyze menu. To predict price from size we choose size for X and price for Y and hit OK.
The output screen now has the regression equation at the top (compare to the equation at the bottom of page 296) followed by the scatter plot with the regression line shown in red (compare to the graph on page 294 (and notice that the labels are off by a factor of 1,000).
The Summary of Fit box then contains the mean of y (111.034) and the root-MSE (as shown near the bottom of page 298). R-square is discussed in Section 7.7 and adjusted R-square will be disscussed in Section 8.8. Following this is the Analysis of Variance table shown at the top of page 302.
The box labeled Type III tests is another topic that will be discussed in Chapter 8 when we get to multiple regression.
The Parameter Estimates box contains the t-test discussed on pages 302-303. Note that the value 4.1276 (the size row and Std Error column) is the root-MSE/root-Sxx quantity that appears in the bottom of the t-test and that is used for calculating the confidence interval on page 303 as well.
At the bottom of the output screen on the left is one of two that we will use to check the regression assumptions, it is called the Residual versus Predicted Plot.
Finally, notice that two columns have been added to the spreadsheet. The R_price column are the residuals for each observation and the P_price column are the predicted values for each observation.
Once the basic regression has been performed we can add a confidence interval for the slope to the output simply by choosing C.I. / C.I. (Wald) for Parameters under the Tables menu, and then choosing the percentage that we want. Choosing 95% will give the result on the bottom of page 303 through top of page 304.
PROC INSIGHT will also produce graphs like Figure 7.5 on page 306 showing the Confidence Interval for the Regression Line and the Prediction Interval for a New Observation. These options are found by selecting Confidence Curves under the Curves menu. The Mean option corresponds to the CI for the regression line, and the Prediction option corresponds to the PI.
Unfortunately PROC INSIGHT won't calculate the values for these curves at particular values of X (like you can see in Table 7.10 on page 314). To do this we need to use another procedure called PROC GLM. After quitting PROC INSIGHT by clicking on the X on the spreadsheet, we can use the following:
PROC GLM DATA=Size_and_Price; MODEL price = size / ALPHA=0.05 CLI; RUN;Notice that we put the y-variable price to the left of the equal sign and the x-variable size to the right. Also note that CLI is for the individual prediction interval and CLM is for the mean (or regression line).PROC GLM DATA=Size_and_Price; MODEL price = size / ALPHA=0.05 CLM; RUN;
Say we wanted to predict the price of a house that had size 1.500. One way of doing this would be to see if there was already an observation with that size. If so, we can simply look for the confidence/prediction interval corresponding to that observation. In this case, observation 29 has size 1.500 so we can just look at that interval.
If there was no house with size 1.5 we would first have to add an observation with that size... but with no price (since we don't actually have the observation). The following lines would make a new data set called SandP2 that will have the size 1.500 house first, will print out the new data set, and will calculate the prediction interval.
It is important that you know what order the variables were in in the Size_and_Price data set, and that we use the same order in the newobs data set here.
DATA newobs; INPUT size price; CARDS; 1.500 . ; DATA SandP2; SET newobs Size_and_Price; PROC PRINT DATA=SandP2; RUN; PROC GLM DATA=SandP2; MODEL price = size / ALPHA=0.05 CLI; RUN;The 95% prediction interval for a 1500 square foot house ends up being (49.660, 129.453) = ($49,660, $129,453).
First we will use the Size_and_Price data set in PROC INSIGHT. The basic regression output window already contains the residual versus predicted plot shown in Figure 7.9 on page 320.
Notice for the most part that the data is spread evenly above and below the the zero-line for each range of predicted values. This is a sign that the errors have mean zero (or equivalently that a straight line is appropriate). Compare this to the Figure 7.11 on page 321.
Second, notice that the plot is pretty close to having the amount of spread in each range of predicted value be the same for each other range or predicted values (we need only move one of the higher points to, say, (200,40) for it to look perfect). This is a sign that the variances of the errors are constant in each range of x-values. See Firgure 7.12 on page 322 for a plot of non-constant variances.
In order to check the assumption that the errors are normally distributed we need to construct a q-q plot of the residuals. This can be done by selecting Residual Normal QQ under the Graphs menu. As the points are very close to the line, and so we don't have evidence against the assumption of normality.
In general the best way to check the assumption of independence is to know how the data is gathered. It is also possible to plot the residuals against some other variables to get a good idea. We could have, for example, chosen Scatter Plot (YX) under the Analyze menu, and chosen R_price (the residuals) for X and the other variables in the full data set (like number of fireplaces) for Y. Ideally there would be no pattern in the residual plots. If there was a pattern, that would mean there was some connection between the errors that was not captured by our regression model.
Now, repeat the above regression... but use the entire Housing_Data set. (You will have to close and re-open PROC INSIGHT).
Notice that in this case the variances do not seem equal... they increase as the predicted value increases. One way of fixing this problem is to take the logarithm of the Y variable.
Select Edit, Variables, log(Y). Choose price for Y and click OK. This will add another column to the spreadsheet called L_price, and this column contains the logarithm of price. Now choose Fit (YX) under the Analyze menu and use L_price for Y and size for X.
Now the plot looks much better! You need to be careful though... because now you aren't predicting price... you are predicting the logarithm of price.
DATA fw08x02; INPUT Obs age bed bath size lot price; CARDS; 1 21 3 3.0 0.951 64.904 30.000 2 21 3 2.0 1.036 217.800 39.900 3 7 1 1.0 0.676 54.450 46.500 4 6 3 2.0 1.456 51.836 48.600 5 51 3 1.0 1.186 10.857 51.500 6 19 3 2.0 1.456 40.075 56.990 7 8 3 2.0 1.368 . 59.900 8 27 3 1.0 0.994 11.016 62.500 9 51 2 1.0 1.176 6.259 65.500 10 1 3 2.0 1.216 11.348 69.000 11 32 3 2.0 1.410 25.450 76.900 12 2 3 2.0 1.344 . 79.000 13 25 2 2.0 1.064 218.671 79.900 14 31 3 1.5 1.770 19.602 79.950 15 29 3 2.0 1.524 12.720 82.900 16 16 3 2.0 1.750 130.680 84.900 17 20 3 2.0 1.152 104.544 85.000 18 18 4 2.0 1.770 10.640 87.900 19 28 3 2.0 1.624 12.700 89.900 20 27 3 2.0 1.540 5.679 89.900 21 8 3 2.0 1.532 6.900 93.500 22 19 3 2.0 1.647 6.900 94.900 23 3 3 2.0 1.344 43.560 95.800 24 5 3 2.0 1.550 6.575 98.500 25 5 4 2.0 1.752 8.193 99.500 26 27 3 1.5 1.450 11.300 99.900 27 33 2 2.0 1.312 7.150 102.000 28 4 3 2.0 1.636 6.097 106.000 29 0 3 2.0 1.500 . 108.900 30 36 3 2.5 1.800 83.635 109.900 31 5 4 2.5 1.972 7.667 110.000 32 0 3 2.0 1.387 . 112.290 33 27 4 2.0 2.082 13.500 114.900 34 15 3 2.0 . 269.549 119.500 35 23 4 2.5 2.463 10.747 119.900 36 25 3 2.0 2.572 7.090 119.900 37 24 4 2.0 2.113 7.200 122.900 38 1 3 2.5 2.016 9.000 123.938 39 34 3 2.0 1.852 13.500 124.900 40 26 4 2.0 2.670 9.158 126.900 41 26 3 2.0 2.336 5.408 129.900 42 31 3 2.0 1.980 8.325 132.900 43 24 4 2.5 2.483 10.295 134.900 44 29 5 2.5 2.809 15.927 135.900 45 21 3 2.0 2.036 16.910 139.500 46 10 3 2.0 2.298 10.950 139.990 47 3 3 2.0 2.038 7.000 144.900 48 9 3 2.5 2.370 10.796 147.600 49 29 5 3.5 2.921 11.992 149.990 50 8 3 2.0 2.262 . 152.550 51 7 3 3.0 2.456 . 156.900 52 1 4 2.0 2.436 52.000 164.000 53 27 3 2.0 1.920 226.512 167.500 54 5 3 2.5 2.949 11.950 169.900 55 32 4 3.5 3.310 10.500 175.000 56 29 3 3.0 2.805 16.500 179.000 57 1 3 3.0 2.553 8.610 179.900 58 1 3 2.0 2.510 . 189.500 59 33 3 4.0 3.627 17.760 199.000 ; PROC INSIGHT; OPEN fw08x02; RUN;To construct the scatterplot matrix in figure 8.2 you would select Scatterplot (YX) under the Analyze menu. Choose all of age, bed, bath, size, lot, and price for Y and for X and hit ok.
To get the estimated regression equation in the middle of apage 349 we good use Fit (YX) in the Analyze menu just like we did for simple linear regression. Choose Price for Y and age, bed, bath, size, and lot for X and hit ok.
We could even add the XX' matrix in Table 8.3 to the output by choosing the XX' Matrix option under Tables (although we won't really be doing anything with it).
The option for adding the Type I Sum of Squares to the output (as discussed in the Supplement to 8.3) is found under the Tables menu.
Note that the output already includes the Variance Inflation factor discussed in section 8.7.
The PROC GLM code from the supplement is:
PROC GLM DATA=fw08x02; MODEL price = age bed bath size lot; RUN;All of this output was already seen in the PROC INSIGHT output too.
DATA squid; INPUT Obs RL WL RNL NWL W WT; CARDS; 1 1.31 1.07 0.44 0.75 0.35 1.95 2 1.55 1.49 0.53 0.90 0.47 2.90 3 0.99 0.84 0.34 0.57 0.32 0.72 4 0.99 0.83 0.34 0.54 0.27 0.81 5 1.05 0.90 0.36 0.64 0.30 1.09 6 1.09 0.93 0.42 0.61 0.31 1.22 7 1.08 0.90 0.40 0.51 0.31 1.02 8 1.27 1.08 0.44 0.77 0.34 1.93 9 0.99 0.85 0.36 0.56 0.29 0.64 10 1.34 1.13 0.45 0.77 0.37 2.08 11 1.30 1.10 0.45 0.76 0.38 1.98 12 1.33 1.10 0.48 0.77 0.38 1.90 13 1.86 1.47 0.60 1.01 0.65 8.56 14 1.58 1.34 0.52 0.95 0.50 4.49 15 1.97 1.59 0.67 1.20 0.59 8.49 16 1.80 1.56 0.66 1.02 0.59 6.17 17 1.75 1.58 0.63 1.09 0.59 7.54 18 1.72 1.43 0.64 1.02 0.63 6.36 19 1.68 1.57 0.72 0.96 0.68 7.63 20 1.75 1.59 0.68 1.08 0.62 7.78 21 2.19 1.86 0.75 1.24 0.72 10.15 22 1.73 1.67 0.64 1.14 0.55 6.88 ; DATA logsquid; SET squid; L_wt=log(WT); L_rl=log(RL); L_wl=log(WL); L_rnl=log(RNL); L_nwl=log(NWL); L_w=log(W); KEEP L_wt L_rl L_wl L_rnl L_nwl L_w; RUN; PROC REG DATA=logsquid; MODEL L_wt = L_rl L_wl L_rnl L_nwl L_w / SELECTION = RSQUARE ADJRSQ CP; RUN;
To analyze the data in Table 8.17 on page 392 you would use the following:
DATA FW08X06; INPUT Y X1 X2 X3; CARDS; 763 19.8 128 86 650 20.9 110 72 554 15.1 95 62 742 19.8 123 82 470 21.4 77 52 651 19.5 107 72 756 25.2 123 84 563 26.2 95 83 681 26.8 116 76 579 28.8 100 64 716 22.0 110 80 650 24.2 107 71 761 24.9 125 81 549 25.6 89 61 641 24.7 103 71 606 26.2 103 67 696 21.0 110 77 795 29.4 133 83 582 21.6 96 65 559 20.0 91 62 ; PROC INSIGHT; OPEN FW08X06; FIT Y = X1 X2 X3; RUN;Now, under the vars menu, select Hat Diag, Dffits, and Studentized Residuals. You should notice that the F_Y column contains the same values as the DFFITS column in the output on page 393. You can also see that observation 8 has the largest hat diagonal value, and that observation 8 also has the largest studentized residual value. Clicking on the 8 on the left edge of the spreadsheet will highlight the point on the residual vs. predicted plot. Notice that some other points seem just as extreme (like point 11 in the top middle).
If you choose Analyze and then Scatterplot (Y X). Now choose all of Y, X1, X2, and X3 for X and for Y. Notice the one point that is very different for a variety of observations? That is point 8. Point 11 is in with all of the others.
The following is the the code needed for performing the one-way ANOVA in the supplement to sction 6.5.
Entering the data: The data and description of the problem can be found in Example 6.6 on page 264. Notice that each observation has two values, the group (the variable with the $ on the input line) and the response variable (the yij). The @@ simply tells it that more than one observation will be entered per line.
DATA shrimp_weights; INPUT diet $ weight @@; CARDS; cafo_1 47.0 cafo_1 50.9 cafo_1 45.2 cafo_1 48.9 cafo_1 48.2 calo_2 38.1 calo_2 39.6 calo_2 39.1 calo_2 33.1 calo_2 40.3 faso_3 57.4 faso_3 55.1 faso_3 54.2 faso_3 56.8 faso_3 52.5 falo_4 54.2 falo_4 57.7 falo_4 57.1 falo_4 47.9 falo_4 53.4 bc_5 38.5 bc_5 42.0 bc_5 38.7 bc_5 38.9 bc_5 44.6 lma_6 48.9 lma_6 47.0 lma_6 47.0 lma_6 44.4 lma_6 46.9 lmaa_7 87.8 lmaa_7 81.7 lmaa_7 73.3 lmaa_7 82.7 lmaa_7 74.8 ;Checking the assumptions and getting the ANOVA table: PROC INSIGHT can be used to get the two residual plots, while PROC GLM will conduct the modified Levene test. Both will return the ANOVA table as well as some additional output that is extra. As in regression, the y variable goes on the left-hand side of the equals sign on the FIT and MODEL lines. The group variables are the CLASS variables.
PROC INSIGHT; OPEN shrimp_weights; FIT weight=diet; RUN; PROC GLM DATA=shrimp_weights ORDER=DATA; CLASS diet; MODEL weight=diet; MEANS diet / HOVTEST=BF; RUN;Estimating contrasts: The following code will produce the output in table 6.23 on page 266. Notice that it is NOT adjusted for the Holm procedure BUT it does return both the estimate (L-hat) and the estimated standard deviation of (L-hat) that could be used to make a confidence interval. (You simply take the estimated value +/- ta/2*standard error where the degrees of freedom come from the error line in the ANOVA table.) It is important to note that if you make more than one confidence interval though you will have multiple comparison problems and want to use some procedure to correct for that. The Holm procedure doesn't work for confidence intervals, so you could use the Bonferroni method (or Tukey if you are doing only pairwise comparisons).
PROC GLM DATA=shrimp_weights ORDER=DATA; CLASS diet; MODEL weight=diet; ESTIMATE 'newold' diet 3 3 3 3 -4 -4 -4 / divisor=12; ESTIMATE 'corn' diet 5 5 -2 -2 -2 -2 -2 / divisor=10; ESTIMATE 'fish' diet 4 -3 4 4 -3 -3 -3 / divisor=12; ESTIMATE 'lin' diet -2 5 -2 5 -2 -2 -2 / divisor=10; ESTIMATE 'sun' diet -1 -1 6 -1 -1 -1 -1 / divisor=6; ESTIMATE 'mic' diet -2 -2 -2 -2 -2 5 5 / divisor=10; ESTIMATE 'art' diet -1 -1 -1 -1 -1 -1 6 / divisor=6; RUN;All pairwise comparisons: PROC MULTTEST can also be used for estimating contrasts. While it doesn't give the estimates of the contrasts it does automatically adjust the p-values.
PROC MULTTEST DATA=shrimp_weights ORDER=DATA HOLM; CLASS diet; CONTRAST '1 vs. 2' 1 -1 0 0 0 0 0; CONTRAST '1 vs. 3' 1 0 -1 0 0 0 0; CONTRAST '1 vs. 4' 1 0 0 -1 0 0 0; CONTRAST '1 vs. 5' 1 0 0 0 -1 0 0; CONTRAST '1 vs. 6' 1 0 0 0 0 -1 0; CONTRAST '1 vs. 7' 1 0 0 0 0 0 -1; CONTRAST '2 vs. 3' 0 1 -1 0 0 0 0; CONTRAST '2 vs. 4' 0 1 0 -1 0 0 0; CONTRAST '2 vs. 5' 0 1 0 0 -1 0 0; CONTRAST '2 vs. 6' 0 1 0 0 0 -1 0; CONTRAST '2 vs. 7' 0 1 0 0 0 0 -1; CONTRAST '3 vs. 4' 0 0 1 -1 0 0 0; CONTRAST '3 vs. 5' 0 0 1 0 -1 0 0; CONTRAST '3 vs. 6' 0 0 1 0 0 -1 0; CONTRAST '3 vs. 7' 0 0 1 0 0 0 -1; CONTRAST '4 vs. 5' 0 0 0 1 -1 0 0; CONTRAST '4 vs. 6' 0 0 0 1 0 -1 0; CONTRAST '4 vs. 7' 0 0 0 1 0 0 -1; CONTRAST '5 vs. 6' 0 0 0 0 1 -1 0; CONTRAST '5 vs. 7' 0 0 0 0 1 0 -1; CONTRAST '6 vs. 7' 0 0 0 0 0 1 -1; TEST mean(weight); RUN;
Other multiple comparisons procedures: The following code will produce the output in Table 6.24 using Tukey's procedure:
PROC GLM DATA=shrimp_weights ORDER=DATA; CLASS diet; MODEL weight=diet; MEANS diet / ALPHA=0.05 TUKEY; RUN;Adding CLDIFF on the right side of the / would have produced confidence intervals instead of the nice display. Besides TUKEY it will also do BON (for Bonferroni), DUNCAN, and SCHEFFE among others.
Chapter 9 - Basic Two-way ANOVA
This section has sample code for analyzing a two-way ANOVA that is factorial, balanced, with replications, and fixed effects.
Entering the data: The data used is the oil example described in example 9.2 on page 421 and shown in table 9.5 on page 428.
DATA oil; INPUT cyl oil $ mpg; CARDS; 4 STANDARD 22.6 4 STANDARD 24.5 4 STANDARD 23.1 4 STANDARD 25.3 4 STANDARD 22.1 4 MULTI 23.7 4 MULTI 24.6 4 MULTI 25.0 4 MULTI 24.0 4 MULTI 23.1 4 GASMISER 26.0 4 GASMISER 25.0 4 GASMISER 26.9 4 GASMISER 26.0 4 GASMISER 25.4 6 STANDARD 23.6 6 STANDARD 21.7 6 STANDARD 20.3 6 STANDARD 21.0 6 STANDARD 22.0 6 MULTI 23.5 6 MULTI 22.8 6 MULTI 24.6 6 MULTI 24.6 6 MULTI 22.5 6 GASMISER 21.4 6 GASMISER 20.7 6 GASMISER 20.5 6 GASMISER 23.2 6 GASMISER 21.3 ;Checking the assumptions and getting the ANOVA table: The ANOVA Table 9.6 (page 430) can be generated using either PROC INSIGHT or PROC GLM. PROC INSIGHT will also give the residual plots. PROC GLM on the other hand can be made to give the means that appear in Figure 9.1 on page 431. (The text appears to give the wrong mean for 4/GASMISER.
PROC INSIGHT; OPEN oil; FIT mpg = cyl oil cyl*oil; RUN; PROC GLM DATA=oil ORDER=DATA; CLASS cyl oil; MODEL mpg = cyl oil cyl*oil; MEANS cyl*oil; RUN;In order to use the modified Levene's test though we need to trick SAS into thinking this is a one-way ANOVA. We do this by making 6 separate treatment levels (the combinations of cylinder and oil types).
DATA oil2; SET oil; KEEP block mpg; block = trim(cyl)||trim(oil); PROC PRINT data=oil2; RUN; PROC GLM DATA=oil2 ORDER=DATA; CLASS block; MODEL mpg = block; MEANS block / HOVTEST=BF; RUN;Estimating Contrasts: The code to perform all of the various multiple comparison procedures can be somewhat complicated. But it is fairly straight forward to enter in the contrasts described on pages 433-435 and to get the output in Table 9.8 on page 436. (Actually the output on page 436 is incorrect, as the estimates are all a factor of 2, 3 , or 4 smaller than they should be. For example, if you look at the data on page 428 the difference between 4 cylinder and 6 cylinder should be the 2.24 we get with the code below, not the 1.12 that the book reports. The t-test statistics are correct though.)
To see how we entered the coefficients for the L1L2 contrast, just look at the table on page 434... or look at table 9.7 and the values that went with the different cylinder/oil combinations. Remember we need to make sure we entered the data in the right order!
To get F statistics you would simply use CONTRAST instead of ESTIMATE for each of the lines.
PROC GLM DATA=oil ORDER=DATA; CLASS cyl oil; MODEL mpg = cyl oil cyl*oil; ESTIMATE 'L1=cyl' cyl -1 1; ESTIMATE 'L2=std. vs. m&g' oil 1 -.5 -.5; ESTIMATE 'L3=m vs. g.' oil 0 1 -1; ESTIMATE 'L1L2' cyl*oil -1 .5 .5 1 -.5 -.5; ESTIMATE 'L1L3' cyl*oil 0 -1 1 0 1 -1; RUN;
Sections 10.2-10.4 - Random Effects Models
This section has the code for analyzing the randomized block design discussed in section 10.2. If the data being analyzed is fixed effect then simply leave out the random line. If there are replications (as in Section 10.3) then the interaction term can be included. This can get complicated however and I would suggest having a statistical consultant assist you with it.
To analyze the data in Table 10.2 on page 465 we would use the following:
DATA wheat; INPUT variety $ block $ yield @@; CARDS; A 1 31.0 A 2 39.5 A 3 30.5 A 4 35.5 A 5 37.0 B 1 28.0 B 2 34.0 B 3 24.5 B 4 31.5 B 5 31.5 C 1 25.5 C 2 31.0 C 3 25.0 C 4 33.0 C 5 29.5 ; PROC GLM DATA=wheat ORDER=DATA; CLASS variety block; MODEL yield = variety block; RANDOM block; RUN;
Note that we had to leave out the interaction term because there were no replications. Also note that the ESTIMATE or CONTRAST lines for fixed effects could be used just like they were for a factorial design.
This will produce the information in the ANOVA table 10.4 (page 469) by looking at the Type III Table and the ANOVA table. It also gives the formulas for the expeced mean squares for the Treatments (variety) and Blocks as seen in table 10.3 (page 467) notice in sas that the term involving the sum of tau-squareds is summarized by a Q.
To analyze the latin squares in section 10.4 the code would be almost the same as above. The difference would be that a third class variable would have to be entered for each observation and so would appear on the INPUT, CLASS, MODEL, and maybe RANDOM line.
Section 11.5 - Analysis of Covariance
The following code works through example 11.3 on pages 519-522
DATA cmclass; INPUT obs class pre post IQ ; CARDS; 1 1 3 10 122 2 2 24 34 129 3 3 10 21 114 4 1 5 10 121 5 2 18 27 114 6 3 3 18 114 7 1 6 14 101 8 2 11 20 116 9 3 10 20 110 10 1 11 29 131 11 2 10 13 126 12 3 3 9 94 13 1 11 17 129 14 2 11 19 110 15 3 6 13 102 16 1 13 21 115 17 2 2 28 138 18 3 9 24 128 19 1 7 5 122 20 2 10 13 119 21 3 13 19 111 22 1 12 17 112 23 2 14 21 123 24 3 7 25 119 25 1 13 17 123 26 2 11 14 115 27 3 10 24 120 28 1 8 22 119 29 2 12 17 116 30 3 9 21 112 31 1 9 22 122 32 2 14 16 125 33 3 7 21 105 34 1 10 18 111 35 2 7 10 122 36 3 4 17 120 37 1 6 11 117 38 2 8 18 120 39 3 7 24 120 40 1 13 20 112 41 2 10 13 111 42 3 12 25 118 43 1 7 8 122 44 2 11 17 127 45 3 6 23 110 46 1 11 20 124 47 2 12 13 122 48 3 7 22 127 49 1 5 15 118 50 2 6 13 127 51 1 9 25 113 52 2 3 13 115 53 1 8 25 126 54 2 4 13 112 55 1 2 14 132 56 1 11 17 93 ; PROC GLM DATA=cmclass;Here the pretest score is to be used like in regression, and so we don't put it in the class line. We do put the class there. An interaction generally doesn't make any sense for an ANCOVA, so we leave it out. The ESTIMATE line gives us the slope. The LSMEANS line gives us the estimates, along with the standard errors, the test as to if that the class effect = 0, and the p-values of the tests as to whether each of the classes is the same as the effects for the other classes. These last tests come from the matrix like part gives the same results as on page 522, namely that one and two are not significantly different, but one and three and two and three are. (Remember that the experimentwise error might need to be adjusted if using this.)
CLASS class;
MODEL post = class pre;
ESTIMATE 'slope' pre 1;
LSMEANS class / stderr pdiff;
RUN;
Because we are using the pretest score (the covariate) similarly to a blocking variable here, we would use the Type III sum of squares to test if what class they are in makes a difference.
The assumptions are the same as before. However, because we are actually doing something like a regression, we can't use the methods we used to test the assumptions for an ANOVA. Adding the line:
OUTPUT out=cmresids P=FIT R=RESIDUAL;after the MODEL line will make a new data set called cmresids. Running PROC INSIGHT with this new data set will then let us plot the residuals. We could for example plot the CLASS vs. the residuals, the fitted values vs. the residuals, and the PRE vs. the residuals for example. We can also use the Dist(Y) option under the Analyze menu to do a histogram and q-q plot of the residuals. Similarly, we could just have used FIT (Y X).
In order to check that the model is appropriate, we need to make sure that the slope doesn't change with the different classes. To get the same answer as the bottom of page 526, all we need to do is fit the model again, this time with the interaction CLASS*PRE included. The null hypothesis that this line of the ANOVA table tests is that the slope is the same for each class, so if we reject it, it means the assumptions of the ANCOVA are not met.
Section 11.7 - Logistic Regression
The following code will analyze the data in Table 11.8 on page 532 using logistic regression (it is probably a better analysis than the one the book has, so don't pay attention to what the output on page 533 looks like). This analysis is shown on page 541. (Remember if we at entered the data in two columns that we would have needed to put @@ before the semi-colon on the INPUT line.)
DATA fw11x04; INPUT y $ income; CARDS; 0 9.2 0 12.9 0 9.2 1 9.6 0 9.3 1 10.1 0 9.4 1 10.3 0 9.5 1 10.9 0 9.5 1 10.9 0 9.5 1 11.1 0 9.6 1 11.1 0 9.7 1 11.1 0 9.7 1 11.5 0 9.8 1 11.8 0 9.8 1 11.9 0 9.9 1 12.1 0 10.5 1 12.2 0 10.5 1 12.5 0 10.9 1 12.6 0 11 1 12.6 0 11.2 1 12.6 0 11.2 1 12.9 0 11.5 1 12.9 0 11.7 1 12.9 0 11.8 1 12.9 0 12.1 1 13.1 0 12.3 1 13.2 0 12.5 1 13.5 ; PROC LOGISTIC DATA=fw11x04 DESCENDING; MODEL y=income /LACKFIT; RUN;The important part of the output is:
Analysis of Maximum Likelihood Estimates Standard Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1 -11.3472 3.3511 11.4660 0.0007 income 1 1.0018 0.2954 11.5013 0.0007 Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 15.5687 1 < .0001 Score 14.1810 1 0.0002 Wald 11.5013 1 0.0007 Hosmer and Lemeshow Goodness-of-Fit Test Chi-Square DF Pr > ChiSq 6.7043 7 0.4603Looking at the equation on page 536, -11.3472 is the estimate of beta0 and 1.0018 is the estimate of beta1. The likelihood ratio test p-value of < 0.0001 tests the null hypothesis that beta1 is zero, and the p-value of 0.4603 from the Hosmer and Lemeshow test tests the null hypothesis that the logistic form is appropriate.
In most cases, help with the computers (NOT the programming) can be gained by e-mailing help@stat.sc.edu
For the printers on the first and second floor, printer paper is available in the Stat Department office. For printers on the third floor, paper is available in the Math Department office.
If you are using a PC restarting the machine will fix many problems, but obviously don't try that if you have a file that won't save or the like.
If SAS won't start, one of the things to check is that your computer has loaded the X drive correctly (whatever that means). Go to My Computer and see if the apps on 'lc-nt' (X:) is listed as one of the drives. If it isn't, go to the Tools menu and select Map Network Drive.... Select X for the drive, and enter \\lc-nt\apps for the Folder. Then click Finish. This should connect your computer to the X drive and allow SAS to run. If you already had the X-drive connected, then you will need to e-mail help@stat.sc.edu.
If your graphs print out extremely small after you copy them to word, you might be able to fix the problem by "opening and closing" the image. In word, right click once on the image, and select Edit Picture or Open Picture Object under the Edit menu. A separate window will open with the image in it. Simply choose Close Picture. It should now print out ok. This will also make the spacing between the characters in the labels look right if they were somewhat off.
If the problem is an emergency requiring immediate attention
see the statistics department computer person in room 209D.
If they
are not available and it is an emergency see Minna Moore in room
417.
Flagrantly non-emergency cases may result in
suspension of computer privileges.