STAT 516 - SAS Templates

This page contains sample code for using SAS in the course STAT 516 with the text Statistical Methods, 2nd Edition by Freund and Wilson. It is designed so that you can read along with material below, and copy and paste the SAS code (usually looking like type writer font) directly from the web page into the SAS program editor window. The introduction section is written assuming you have no previous experience using SAS. In the later sections the examples usually correspond to portions of the text book, and it would be helpful to have the book with you. In general, you will be able to use these templates to do your homework assignments after modifying them by entering your own data and making sure that all of the names match up.

Originally created by B. Habing - Last updated: 4/20/04


Introduction to SAS

  • Simple Linear Regression
  • - Chapter 7
  • Confidence Intervals for Regression
  • - Section 7.5
  • Plots for Assumption Checking
  • - Section 7.8
  • Multiple Regression
  • - Chapter 8
  • Variable Selection
  • - Section 8.8
  • Outlier Diagnostics
  • - Section 8.9
  • One-way ANOVA
  • - Chapter 6
  • Basic Two-way ANOVA
  • - Chapter 9
  • Random Effects Models
  • - Sections 10.2 - 10.4
  • Analysis of Covariance
  • - Section 11.5
  • Logistic Regression
  • - Section 11.7

    Text Book Data Sets
    Computer Trouble?
    SAS Won't Start?
    Graphs Printing Small in Word?


    Introduction to SAS:

    To begin with you should open up the program SAS. If there is a SAS icon on the screen already, you can just double click that to start the program. If there is not a SAS icon on the screen already, you need to use the start menu. Click the Start rectangle in the bottom left of the screen with your left mouse button. This should open up a list of options above the Start rectangle, and you should slide the mouse arrow up to the one labeled Programs. This should open up a list of more options to the right and you should keep sliding along until you get to the one labeled The SAS System for Windows. Once you get there click on it. This should start up SAS.

    There are three main windows that are used in SAS. The Log window, the Program Editor window, and the Output window. The Log and Program Editor window are the two on the screen when you start up the program. The Output window isn't visible yet because you haven't created anything to output. If you happen to lose one of these windows they usually have a bar at the bottom of the SAS window. You can also find them under the View menu.

    The Program Editor is where you tell SAS what you want done. The Output window is where it puts the results, and the Log window is where it tells you what it did and if there are any errors. It is important to note that the Output window often gets very long! You usually want to copy the parts you want to print into MS-Word and print from there. It is also important to note that you should check the Log window everytime you run anything. The errors will appear in maroon. Successful runs appear in Blue.

    Hitting the [F3] key at the top of your keyboard will run the program currently in the Program Editor window (or the part of it you have highlighted). You can also run programs by clicking on the little image of the runner in the list of symbols near the top of the SAS program screen.

    In older editions of SAS, running the program will erase whatever was written in the Program Editor window. To recall whatever was there, make sure you are in that window, and hit the [F4] key.

    If you keep running more programs it will keep adding it all to the Output window. To clear the Output window, make sure you are in that window, and choose Clear text under the Edit menu.

    If you happen to lose a window, try looking under the View menu.

    The following is the SAS code for examining the housing price data that is discussed throughout chapter 7. The full data set can be found in Table 1.2 on page 9. The subset of the data that we will eventually use is in Table 7.2 on page 293.

    The data in the full data set is formatted in eleven columns

    1. Obs is the observation number
    2. zip is the last digit of the zip code
    3. age is the age of the house in years
    4. bed is the number of bedrooms
    5. bath is the number of bathrooms
    6. size is the interior area in square feet
    7. lot is the size of the lot in square feet
    8. exter is the external siding material
    9. garage is the number of cars the garage will hold
    10. fp is the number of fireplaces
    11. price is the price of the home in dollars
    (You can just cut and paste the code below starting with OPTIONS and ending with the ; after all the numbers into the Program Editor window.)
     

    OPTIONS pagesize=50 linesize=64;
    
    DATA Housing_Data;   
    INPUT Obs   zip   age   bed   bath size lot  exter $  garage   fp    price; 
    CARDS;
      1    3     21    3     3.0   0.951    64.904   Other      0      0    30.000
      2    3     21    3     2.0   1.036   217.800   Frame      0      0    39.900
      3    4      7    1     1.0   0.676    54.450   Other      2      0    46.500
      4    3      6    3     2.0   1.456    51.836   Other      0      1    48.600
      5    1     51    3     1.0   1.186    10.857   Other      1      0    51.500
      6    2     19    3     2.0   1.456    40.075   Frame      0      0    56.990
      7    3      8    3     2.0   1.368      .      Frame      0      0    59.900
      8    4     27    3     1.0   0.994    11.016   Frame      1      0    62.500  
      9    1     51    2     1.0   1.176     6.259   Frame      1      1    65.500  
     10    3      1    3     2.0   1.216    11.348   Other      0      0    69.000
     11    4     32    3     2.0   1.410    25.450   Brick      0      0    76.900
     12    3      2    3     2.0   1.344      .      Other      0      1    79.000  
     13    3     25    2     2.0   1.064   218.671   Other      0      0    79.900 
     14    1     31    3     1.5   1.770    19.602   Brick      0      1    79.950 
     15    4     29    3     2.0   1.524    12.720   Brick      2      1    82.900 
     16    3     16    3     2.0   1.750   130.680   Frame      0      0    84.900
     17    3     20    3     2.0   1.152   104.544   Other      2      0    85.000 
     18    3     18    4     2.0   1.770    10.640   Other      0      0    87.900
     19    4     28    3     2.0   1.624    12.700   Brick      2      1    89.900
     20    2     27    3     2.0   1.540     5.679   Brick      2      1    89.900
     21    1      8    3     2.0   1.532     6.900   Brick      2      1    93.500 
     22    4     19    3     2.0   1.647     6.900   Brick      2      0    94.900 
     23    2      3    3     2.0   1.344    43.560   Other      1      0    95.800
     24    4      5    3     2.0   1.550     6.575   Brick      2      1    98.500
     25    4      5    4     2.0   1.752     8.193   Brick      2      0    99.500
     26    4     27    3     1.5   1.450    11.300   Brick      1      1    99.900
     27    4     33    2     2.0   1.312     7.150   Brick      0      1   102.000
     28    1      4    3     2.0   1.636     6.097   Brick      1      0   106.000
     29    4      0    3     2.0   1.500      .      Brick      2      0   108.900
     30    2     36    3     2.5   1.800    83.635   Brick      2      1   109.900
     31    3      5    4     2.5   1.972     7.667   Brick      2      0   110.000
     32    3      0    3     2.0   1.387      .      Brick      2      0   112.290
     33    4     27    4     2.0   2.082    13.500   Brick      3      1   114.900
     34    3     15    3     2.0    .      269.549   Frame      0      0   119.500
     35    4     23    4     2.5   2.463    10.747   Brick      2      1   119.900
     36    4     25    3     2.0   2.572     7.090   Brick      2      1   119.900
     37    4     24    4     2.0   2.113     7.200   Brick      2      1   122.900
     38    4      1    3     2.5   2.016     9.000   Brick      2      1   123.938
     39    1     34    3     2.0   1.852    13.500   Brick      2      0   124.900
     40    4     26    4     2.0   2.670     9.158   Brick      2      1   126.900
     41    2     26    3     2.0   2.336     5.408   Brick      0      1   129.900
     42    4     31    3     2.0   1.980     8.325   Brick      2      1   132.900
     43    2     24    4     2.5   2.483    10.295   Brick      2      1   134.900
     44    2     29    5     2.5   2.809    15.927   Brick      2      1   135.900
     45    4     21    3     2.0   2.036    16.910   Brick      2      1   139.500
     46    3     10    3     2.0   2.298    10.950   Brick      2      1   139.990
     47    4      3    3     2.0   2.038     7.000   Brick      2      0   144.900  
     48    2      9    3     2.5   2.370    10.796   Brick      2      1   147.600
     49    2     29    5     3.5   2.921    11.992   Brick      2      1   149.990
     50    2      8    3     2.0   2.262      .      Brick      2      1   152.550
     51    4      7    3     3.0   2.456      .      Brick      2      1   156.900
     52    4      1    4     2.0   2.436    52.000   Brick      2      1   164.000
     53    3     27    3     2.0   1.920   226.512   Frame      4      1   167.500
     54    4      5    3     2.5   2.949    11.950   Brick      2      1   169.900
     55    2     32    4     3.5   3.310    10.500   Brick      2      1   175.000 
     56    4     29    3     3.0   2.805    16.500   Brick      2      1   179.000
     57    4      1    3     3.0   2.553     8.610   Brick      2      1   179.900
     58    4      1    3     2.0   2.510      .      Other      2      1   189.500 
     59    4     33    3     4.0   3.627    17.760   Brick      3      1   199.000
     60    2     25    4     2.5   3.056    10.400   Other      2      1   216.000
     61    3     16    3     2.5   3.045   168.576   Brick      3      1   229.900
     62    4      2    4     4.5   3.253    54.362   Brick      3      2   285.000
     63    2      2    4     3.5   4.106    44.737   Brick      3      1   328.900 
     64    4      0    3     2.5   2.993      .      Brick      2      1   313.685 
     65    4      0    3     2.5   2.992    14.500   Other      3      1   327.300 
     66    4     20    4     3.0   3.055   250.034   Brick      3      0   349.900  
     67    4     18    5     4.0   3.846    23.086   Brick      4      3   370.000
     68    4      3    4     4.5   3.314    43.734   Brick      3      1   380.000 
     69    4      5    4     3.5   3.472   130.723   Brick      2      2   395.000
    ;
    
    Note that _most_ lines end with a semi-colon, but not all. SAS will crash if you miss one, but usually the log window will tell you where the problem is. An extra/missing semi-colon is probably the single largest reason for a SAS program crashing.

    The OPTIONS line only needs to be used once during a session. It sets the length of the page and the length of the lines for viewing on the screen and printing. The font can be set by using the Options choice under the Tools menu along the top of the screen. When you cut and paste from SAS to a word processor, the font 10 point Courier New works well.

    The DATA line defines what the name of the data set is. The name should start with a letter, have no spaces, and only letters, numbers, and underscores. The INPUT line gives the names of the variables, and they must be in the order that the data will be entered. The $ after exter on the INPUT line means that the variable exter is qualitative instead of quantitative. (Sometimes you will see an @@ at the end of the INPUT line. This means that the variables will be entered right after each other on the same line with no returns, instead of needing one row for each person.)

    Note that in several places a . appears instead of a number or a word. This signifies a missing value, and SAS will ignore that observation if it was going ot use the missing variable in the calculation. In this case if you tried to calculate the mean of each of the variables, different numbers of observations would be used in each case.

    If we hit F3 at this point to enter what we put above, nothing new will appear on the output screen. This is no big surprise however once we realize that we haven't told SAS to return any output! The code below simply tells SAS to print back out the data we entered.


    PROC PRINT DATA=Housing_Data;
    TITLE "Table 1.2: Housing Data";
    RUN;

    For the examples in Chapter 7 we are concerned only with the size and price of houses priced at less than $200,000. The following lines will construct this data subset and then print it out to verify that it has worked correctly.


    DATA Size_and_Price;
    SET Housing_Data;
    KEEP size price;
    WHERE price < 200;
    RUN;

    PROC PRINT DATA=Size_and_Price;
    TITLE "Table 7.2: Data on Size and Price";
    RUN;

    Whenever you have a DATA line, that means you are creating a new dataset with that name. The SET line tells it that we are making this new data set from an old one. The KEEP line says the only variables we want in this new data set are the ones on that line. The lines after that say any special commands that go into the making of the new data set. In this case the WHERE command is used to make sure we only keep one gender or the other. Later we will see examples of making datasets that involve using mathematical functions. In any case, it should be pretty straight-forward when you just stop and read through what the lines say.

    The most basic procedure to give out some actual graphs and statistics is PROC UNIVARIATE:


    PROC UNIVARIATE DATA=Size_and_Price PLOT FREQ ;
    VAR size;
    TITLE 'Sizes of Houses under $200,000';
    RUN;

    The VAR line says which of the variables you want a summary of. Also note that the graphs here are pretty awful. The INSIGHT procedure will do most of the things that the UNIVARIATE procedure will, and a lot more. The one thing it won't do is be open to programming it to do new things. Later in the semester we'll see how some of the other procedures in SAS can be used to do things that aren't already programmed in.


    PROC INSIGHT;
    OPEN Size_and_Price;
    DIST size;
    RUN;

    You can cut and paste the graphs from PROC INSIGHT right into Microsoft Word. Simply click on the border of the box you want to copy with the left mouse button to select it. You can then cut and paste like normal. (Before printing the graphs, you might want to adjust them so that they print at the correct size.)

    While in PROC INSIGHT, clicking on the arrow in the bottom corner of each of the boxes gives you options for adjusting the graphs format. To quit PROC INSIGHT, click on the X in the upper right portion of the spreadsheet.


    Chapter 7 - Simple Linear Regression

    PROC INSIGHT can also be used for performing most of the regression analyses we will be encountering. The following example uses the data set Size_and_Price we created above. If you just created that data set then you do not need to enter it again. Otherwise it can be created using the following code:

    DATA Size_and_Price;
    INPUT size price @@;
    CARDS;
    0.951	30	1.036	39.9	0.676	46.5
    1.456	48.6	1.186	51.5	1.456	56.99
    1.368	59.9	0.994	62.5	1.176	65.5
    1.216	69	1.41	76.9	1.344	79
    1.064	79.9	1.77	79.95	1.524	82.9
    1.75	84.9	1.152	85	1.77	87.9
    1.624	89.9	1.54	89.9	1.532	93.5
    1.647	94.9	1.344	95.8	1.55	98.5
    1.752	99.5	1.45	99.9	1.312	102
    1.636	106	1.5	108.9	1.8	109.9
    1.972	110	1.387	112.29	2.082	114.9
    .	119.5	2.463	119.9	2.572	119.9
    2.113	122.9	2.016	123.938	1.852	124.9
    2.67	126.9	2.336	129.9	1.98	132.9
    2.483	134.9	2.809	135.9	2.036	139.5
    2.298	139.99	2.038	144.9	2.37	147.6
    2.921	149.99	2.262	152.55	2.456	156.9
    2.436	164	1.92	167.5	2.949	169.9
    3.31	175	2.805	179	2.553	179.9
    2.51	189.5	3.627	199		
    ;
    

    Once we have the data set entered we could start PROC INSIGHT like we did above:

    PROC INSIGHT;
    OPEN Size_and_Price;
    RUN;
    
    Or, we could start it just using the menus, by going to the Solutions menu, the Analysis submenu, and then Interactive Data Analysis. Select Work and then Size_and_Price and then hit Open.

    Once the spreadsheet has appeared, we can get the basic regression output by choosing Fit (YX) under the Analyze menu. To predict price from size we choose size for X and price for Y and hit OK.

    The output screen now has the regression equation at the top (compare to the equation at the bottom of page 296) followed by the scatter plot with the regression line shown in red (compare to the graph on page 294 (and notice that the labels are off by a factor of 1,000).

    The Summary of Fit box then contains the mean of y (111.034) and the root-MSE (as shown near the bottom of page 298). R-square is discussed in Section 7.7 and adjusted R-square will be disscussed in Section 8.8. Following this is the Analysis of Variance table shown at the top of page 302.

    The box labeled Type III tests is another topic that will be discussed in Chapter 8 when we get to multiple regression.

    The Parameter Estimates box contains the t-test discussed on pages 302-303. Note that the value 4.1276 (the size row and Std Error column) is the root-MSE/root-Sxx quantity that appears in the bottom of the t-test and that is used for calculating the confidence interval on page 303 as well.

    At the bottom of the output screen on the left is one of two that we will use to check the regression assumptions, it is called the Residual versus Predicted Plot.

    Finally, notice that two columns have been added to the spreadsheet. The R_price column are the residuals for each observation and the P_price column are the predicted values for each observation.


    Section 7.5 - Confidence Intervals for Regression

    This section continues the Size_and_Price example started above.

    Once the basic regression has been performed we can add a confidence interval for the slope to the output simply by choosing C.I. / C.I. (Wald) for Parameters under the Tables menu, and then choosing the percentage that we want. Choosing 95% will give the result on the bottom of page 303 through top of page 304.

    PROC INSIGHT will also produce graphs like Figure 7.5 on page 306 showing the Confidence Interval for the Regression Line and the Prediction Interval for a New Observation. These options are found by selecting Confidence Curves under the Curves menu. The Mean option corresponds to the CI for the regression line, and the Prediction option corresponds to the PI.

    Unfortunately PROC INSIGHT won't calculate the values for these curves at particular values of X (like you can see in Table 7.10 on page 314). To do this we need to use another procedure called PROC GLM. After quitting PROC INSIGHT by clicking on the X on the spreadsheet, we can use the following:

    PROC GLM DATA=Size_and_Price;
    MODEL price = size / ALPHA=0.05 CLI;
    RUN;
    

    PROC GLM DATA=Size_and_Price; MODEL price = size / ALPHA=0.05 CLM; RUN;

    Notice that we put the y-variable price to the left of the equal sign and the x-variable size to the right. Also note that CLI is for the individual prediction interval and CLM is for the mean (or regression line).

    Say we wanted to predict the price of a house that had size 1.500. One way of doing this would be to see if there was already an observation with that size. If so, we can simply look for the confidence/prediction interval corresponding to that observation. In this case, observation 29 has size 1.500 so we can just look at that interval.

    If there was no house with size 1.5 we would first have to add an observation with that size... but with no price (since we don't actually have the observation). The following lines would make a new data set called SandP2 that will have the size 1.500 house first, will print out the new data set, and will calculate the prediction interval.

    It is important that you know what order the variables were in in the Size_and_Price data set, and that we use the same order in the newobs data set here.

    DATA newobs;
    INPUT size price;
    CARDS;
    1.500 .
    ;
    DATA SandP2;
    SET newobs Size_and_Price;
    
    PROC PRINT DATA=SandP2;
    RUN;
    
    PROC GLM DATA=SandP2;
    MODEL price = size / ALPHA=0.05 CLI;
    RUN;
    
    The 95% prediction interval for a 1500 square foot house ends up being (49.660, 129.453) = ($49,660, $129,453).


    Section 7.8 - Plots for Assumption Checking

    This section continues the housing price example started above.

    First we will use the Size_and_Price data set in PROC INSIGHT. The basic regression output window already contains the residual versus predicted plot shown in Figure 7.9 on page 320.

    Notice for the most part that the data is spread evenly above and below the the zero-line for each range of predicted values. This is a sign that the errors have mean zero (or equivalently that a straight line is appropriate). Compare this to the Figure 7.11 on page 321.

    Second, notice that the plot is pretty close to having the amount of spread in each range of predicted value be the same for each other range or predicted values (we need only move one of the higher points to, say, (200,40) for it to look perfect). This is a sign that the variances of the errors are constant in each range of x-values. See Firgure 7.12 on page 322 for a plot of non-constant variances.

    In order to check the assumption that the errors are normally distributed we need to construct a q-q plot of the residuals. This can be done by selecting Residual Normal QQ under the Graphs menu. As the points are very close to the line, and so we don't have evidence against the assumption of normality.

    In general the best way to check the assumption of independence is to know how the data is gathered. It is also possible to plot the residuals against some other variables to get a good idea. We could have, for example, chosen Scatter Plot (YX) under the Analyze menu, and chosen R_price (the residuals) for X and the other variables in the full data set (like number of fireplaces) for Y. Ideally there would be no pattern in the residual plots. If there was a pattern, that would mean there was some connection between the errors that was not captured by our regression model.

    Now, repeat the above regression... but use the entire Housing_Data set. (You will have to close and re-open PROC INSIGHT).

    Notice that in this case the variances do not seem equal... they increase as the predicted value increases. One way of fixing this problem is to take the logarithm of the Y variable.

    Select Edit, Variables, log(Y). Choose price for Y and click OK. This will add another column to the spreadsheet called L_price, and this column contains the logarithm of price. Now choose Fit (YX) under the Analyze menu and use L_price for Y and size for X.

    Now the plot looks much better! You need to be careful though... because now you aren't predicting price... you are predicting the logarithm of price.


    Chapter 8 - Multiple Linear Regression

    The following are the code and instructions for getting the output related to Table 8.2 on page 346-347 and discussed in the Supplement to Section 8.3.
    DATA fw08x02;
    INPUT Obs    age    bed    bath     size      lot       price;
    CARDS;
      1     21     3      3.0    0.951     64.904     30.000
      2     21     3      2.0    1.036    217.800     39.900
      3      7     1      1.0    0.676     54.450     46.500
      4      6     3      2.0    1.456     51.836     48.600
      5     51     3      1.0    1.186     10.857     51.500
      6     19     3      2.0    1.456     40.075     56.990
      7      8     3      2.0    1.368       .        59.900
      8     27     3      1.0    0.994     11.016     62.500
      9     51     2      1.0    1.176      6.259     65.500
     10      1     3      2.0    1.216     11.348     69.000
     11     32     3      2.0    1.410     25.450     76.900
     12      2     3      2.0    1.344       .        79.000
     13     25     2      2.0    1.064    218.671     79.900
     14     31     3      1.5    1.770     19.602     79.950
     15     29     3      2.0    1.524     12.720     82.900
     16     16     3      2.0    1.750    130.680     84.900
     17     20     3      2.0    1.152    104.544     85.000
     18     18     4      2.0    1.770     10.640     87.900
     19     28     3      2.0    1.624     12.700     89.900
     20     27     3      2.0    1.540      5.679     89.900
     21      8     3      2.0    1.532      6.900     93.500
     22     19     3      2.0    1.647      6.900     94.900
     23      3     3      2.0    1.344     43.560     95.800
     24      5     3      2.0    1.550      6.575     98.500
     25      5     4      2.0    1.752      8.193     99.500
     26     27     3      1.5    1.450     11.300     99.900
     27     33     2      2.0    1.312      7.150    102.000
     28      4     3      2.0    1.636      6.097    106.000
     29      0     3      2.0    1.500       .       108.900
     30     36     3      2.5    1.800     83.635    109.900
     31      5     4      2.5    1.972      7.667    110.000
     32      0     3      2.0    1.387       .       112.290
     33     27     4      2.0    2.082     13.500    114.900
     34     15     3      2.0     .       269.549    119.500
     35     23     4      2.5    2.463     10.747    119.900
     36     25     3      2.0    2.572      7.090    119.900
     37     24     4      2.0    2.113      7.200    122.900
     38      1     3      2.5    2.016      9.000    123.938
     39     34     3      2.0    1.852     13.500    124.900
     40     26     4      2.0    2.670      9.158    126.900
     41     26     3      2.0    2.336      5.408    129.900
     42     31     3      2.0    1.980      8.325    132.900
     43     24     4      2.5    2.483     10.295    134.900
     44     29     5      2.5    2.809     15.927    135.900
     45     21     3      2.0    2.036     16.910    139.500
     46     10     3      2.0    2.298     10.950    139.990
     47      3     3      2.0    2.038      7.000    144.900
     48      9     3      2.5    2.370     10.796    147.600
     49     29     5      3.5    2.921     11.992    149.990
     50      8     3      2.0    2.262       .       152.550
     51      7     3      3.0    2.456       .       156.900
     52      1     4      2.0    2.436     52.000    164.000
     53     27     3      2.0    1.920    226.512    167.500
     54      5     3      2.5    2.949     11.950    169.900
     55     32     4      3.5    3.310     10.500    175.000
     56     29     3      3.0    2.805     16.500    179.000
     57      1     3      3.0    2.553      8.610    179.900
     58      1     3      2.0    2.510       .       189.500
     59     33     3      4.0    3.627     17.760    199.000
    ;
        
    PROC INSIGHT;
    OPEN fw08x02;
    RUN;
    
    To construct the scatterplot matrix in figure 8.2 you would select Scatterplot (YX) under the Analyze menu. Choose all of age, bed, bath, size, lot, and price for Y and for X and hit ok.

    To get the estimated regression equation in the middle of apage 349 we good use Fit (YX) in the Analyze menu just like we did for simple linear regression. Choose Price for Y and age, bed, bath, size, and lot for X and hit ok.

    We could even add the XX' matrix in Table 8.3 to the output by choosing the XX' Matrix option under Tables (although we won't really be doing anything with it).

    The option for adding the Type I Sum of Squares to the output (as discussed in the Supplement to 8.3) is found under the Tables menu.

    Note that the output already includes the Variance Inflation factor discussed in section 8.7.

    The PROC GLM code from the supplement is:

    PROC GLM DATA=fw08x02;
    MODEL price = age bed bath size lot;
    RUN;
    
    All of this output was already seen in the PROC INSIGHT output too.


    Section 8.8 - Variable Selection

    PROC REG is the procedure that will perform the variable selection using adjusted-R2 and C(p). The following enters the data in Table 8.10 on page 376, takes the logarithms as discussed in example 8.5 on pages 375-379, and produces the output in Table 8.15 on page 385. Note: The only reason we are taking the log here is because the variances had a fan-shaped structure. You don't always need to take them to do variable selection!

    DATA squid;
    INPUT Obs     RL      WL      RNL     NWL      W       WT;
    CARDS;
      1    1.31    1.07    0.44    0.75    0.35     1.95
      2    1.55    1.49    0.53    0.90    0.47     2.90
      3    0.99    0.84    0.34    0.57    0.32     0.72
      4    0.99    0.83    0.34    0.54    0.27     0.81
      5    1.05    0.90    0.36    0.64    0.30     1.09
      6    1.09    0.93    0.42    0.61    0.31     1.22
      7    1.08    0.90    0.40    0.51    0.31     1.02
      8    1.27    1.08    0.44    0.77    0.34     1.93
      9    0.99    0.85    0.36    0.56    0.29     0.64
     10    1.34    1.13    0.45    0.77    0.37     2.08
     11    1.30    1.10    0.45    0.76    0.38     1.98
     12    1.33    1.10    0.48    0.77    0.38     1.90
     13    1.86    1.47    0.60    1.01    0.65     8.56
     14    1.58    1.34    0.52    0.95    0.50     4.49
     15    1.97    1.59    0.67    1.20    0.59     8.49
     16    1.80    1.56    0.66    1.02    0.59     6.17
     17    1.75    1.58    0.63    1.09    0.59     7.54
     18    1.72    1.43    0.64    1.02    0.63     6.36
     19    1.68    1.57    0.72    0.96    0.68     7.63
     20    1.75    1.59    0.68    1.08    0.62     7.78
     21    2.19    1.86    0.75    1.24    0.72    10.15
     22    1.73    1.67    0.64    1.14    0.55     6.88
    ;
    
    DATA logsquid;
    SET squid;
    L_wt=log(WT);
    L_rl=log(RL);
    L_wl=log(WL);
    L_rnl=log(RNL);
    L_nwl=log(NWL);
    L_w=log(W);
    KEEP L_wt L_rl L_wl L_rnl L_nwl L_w;
    RUN;
        
    
    PROC REG DATA=logsquid;
    MODEL   L_wt = L_rl  L_wl  L_rnl L_nwl L_w /
            SELECTION = RSQUARE ADJRSQ CP;
    RUN;
    


    Section 8.9 - Outlier Diagnostics

    PROC INSIGHT can be used to calculate the hii - hat diagonal for measuring leverage, the externally studentized residuals RT for detecting extreme y-values, and the DFITTS for measuring influence. It puts these on the spreadsheet, labeled H_, RT_, and F_ respectively. You can either write them down from the spreadsheet... or if your good with word, you can copy and paste them out.

    To analyze the data in Table 8.17 on page 392 you would use the following:

    DATA FW08X06;                                                        
    INPUT Y X1 X2 X3;                                                     
    CARDS;                                                                
    763  19.8  128  86                                                   
    650  20.9  110  72                                                     
    554  15.1   95  62                                                    
    742  19.8  123  82                                                    
    470  21.4   77  52                                                    
    651  19.5  107  72                                                    
    756  25.2  123  84                                                   
    563  26.2   95  83                                                    
    681  26.8  116  76                                                    
    579  28.8  100 64                                                     
    716  22.0  110  80                                                     
    650  24.2  107  71                                                 
    761  24.9  125  81                                                      
    549  25.6   89  61                                                       
    641  24.7  103  71                                                      
    606  26.2  103  67                                                      
    696  21.0  110  77                                                       
    795  29.4  133  83                                                           
    582  21.6   96  65                                                      
    559  20.0   91  62
    ;
    
    PROC INSIGHT;
    OPEN FW08X06;
    FIT Y = X1 X2 X3;
    RUN;
    
    Now, under the vars menu, select Hat Diag, Dffits, and Studentized Residuals. You should notice that the F_Y column contains the same values as the DFFITS column in the output on page 393. You can also see that observation 8 has the largest hat diagonal value, and that observation 8 also has the largest studentized residual value. Clicking on the 8 on the left edge of the spreadsheet will highlight the point on the residual vs. predicted plot. Notice that some other points seem just as extreme (like point 11 in the top middle).

    If you choose Analyze and then Scatterplot (Y X). Now choose all of Y, X1, X2, and X3 for X and for Y. Notice the one point that is very different for a variety of observations? That is point 8. Point 11 is in with all of the others.


    Chapter 6 - One-way ANOVA

    The following is the the code needed for performing the one-way ANOVA in the supplement to sction 6.5.

    Entering the data: The data and description of the problem can be found in Example 6.6 on page 264. Notice that each observation has two values, the group (the variable with the $ on the input line) and the response variable (the yij). The @@ simply tells it that more than one observation will be entered per line.

    DATA shrimp_weights;
    INPUT diet $ weight @@;
    CARDS;
    cafo_1 47.0	cafo_1 50.9	cafo_1 45.2	cafo_1 48.9	cafo_1 48.2
    calo_2 38.1 	calo_2 39.6	calo_2 39.1	calo_2 33.1    	calo_2 40.3
    faso_3 57.4	faso_3 55.1	faso_3 54.2 	faso_3 56.8     faso_3 52.5
    falo_4 54.2	falo_4 57.7	falo_4 57.1	falo_4 47.9	falo_4 53.4
    bc_5   38.5	bc_5   42.0	bc_5   38.7	bc_5   38.9	bc_5   44.6
    lma_6  48.9	lma_6  47.0	lma_6  47.0	lma_6  44.4	lma_6  46.9
    lmaa_7 87.8	lmaa_7 81.7	lmaa_7 73.3	lmaa_7 82.7	lmaa_7 74.8 
    ;
    
    Checking the assumptions and getting the ANOVA table: PROC INSIGHT can be used to get the two residual plots, while PROC GLM will conduct the modified Levene test. Both will return the ANOVA table as well as some additional output that is extra. As in regression, the y variable goes on the left-hand side of the equals sign on the FIT and MODEL lines. The group variables are the CLASS variables.
     
    PROC INSIGHT;
    OPEN shrimp_weights;
    FIT weight=diet;
    RUN;
    
    PROC GLM DATA=shrimp_weights ORDER=DATA;
    CLASS diet;
    MODEL weight=diet;
    MEANS diet / HOVTEST=BF;
    RUN;
    
    Estimating contrasts: The following code will produce the output in table 6.23 on page 266. Notice that it is NOT adjusted for the Holm procedure BUT it does return both the estimate (L-hat) and the estimated standard deviation of (L-hat) that could be used to make a confidence interval. (You simply take the estimated value +/- ta/2*standard error where the degrees of freedom come from the error line in the ANOVA table.) It is important to note that if you make more than one confidence interval though you will have multiple comparison problems and want to use some procedure to correct for that. The Holm procedure doesn't work for confidence intervals, so you could use the Bonferroni method (or Tukey if you are doing only pairwise comparisons).
    PROC GLM DATA=shrimp_weights ORDER=DATA;
    CLASS diet;
    MODEL weight=diet;
    ESTIMATE 'newold'  diet  3  3  3  3 -4 -4 -4 / divisor=12;
    ESTIMATE 'corn'	 diet  5  5 -2 -2 -2 -2 -2 / divisor=10;
    ESTIMATE 'fish'	 diet  4 -3  4  4 -3 -3 -3 / divisor=12;
    ESTIMATE 'lin'	 diet -2  5 -2  5 -2 -2 -2 / divisor=10;
    ESTIMATE 'sun'	 diet -1 -1  6 -1 -1 -1 -1 / divisor=6;
    ESTIMATE 'mic'	 diet -2 -2 -2 -2 -2  5  5 / divisor=10;
    ESTIMATE 'art'	 diet -1 -1 -1 -1 -1 -1  6 / divisor=6;
    RUN;
    
    All pairwise comparisons: PROC MULTTEST can also be used for estimating contrasts. While it doesn't give the estimates of the contrasts it does automatically adjust the p-values.
    PROC MULTTEST DATA=shrimp_weights ORDER=DATA HOLM;
    CLASS diet;
    CONTRAST '1 vs. 2'  1 -1  0  0  0  0  0;
    CONTRAST '1 vs. 3'  1  0 -1  0  0  0  0;
    CONTRAST '1 vs. 4'  1  0  0 -1  0  0  0; 
    CONTRAST '1 vs. 5'  1  0  0  0 -1  0  0;
    CONTRAST '1 vs. 6'  1  0  0  0  0 -1  0;
    CONTRAST '1 vs. 7'  1  0  0  0  0  0 -1;
    CONTRAST '2 vs. 3'  0  1 -1  0  0  0  0;
    CONTRAST '2 vs. 4'  0  1  0 -1  0  0  0;
    CONTRAST '2 vs. 5'  0  1  0  0 -1  0  0;
    CONTRAST '2 vs. 6'  0  1  0  0  0 -1  0;
    CONTRAST '2 vs. 7'  0  1  0  0  0  0 -1;
    CONTRAST '3 vs. 4'  0  0  1 -1  0  0  0;
    CONTRAST '3 vs. 5'  0  0  1  0 -1  0  0;
    CONTRAST '3 vs. 6'  0  0  1  0  0 -1  0;
    CONTRAST '3 vs. 7'  0  0  1  0  0  0 -1;
    CONTRAST '4 vs. 5'  0  0  0  1 -1  0  0;
    CONTRAST '4 vs. 6'  0  0  0  1  0 -1  0;
    CONTRAST '4 vs. 7'  0  0  0  1  0  0 -1;
    CONTRAST '5 vs. 6'  0  0  0  0  1 -1  0;
    CONTRAST '5 vs. 7'  0  0  0  0  1  0 -1;
    CONTRAST '6 vs. 7'  0  0  0  0  0  1 -1;
    TEST mean(weight);
    RUN;
    

    Other multiple comparisons procedures: The following code will produce the output in Table 6.24 using Tukey's procedure:

    PROC GLM DATA=shrimp_weights ORDER=DATA;
    CLASS diet;
    MODEL weight=diet;
    MEANS diet / ALPHA=0.05 TUKEY;
    RUN;
    
    Adding CLDIFF on the right side of the / would have produced confidence intervals instead of the nice display. Besides TUKEY it will also do BON (for Bonferroni), DUNCAN, and SCHEFFE among others.


    Chapter 9 - Basic Two-way ANOVA

    This section has sample code for analyzing a two-way ANOVA that is factorial, balanced, with replications, and fixed effects.

    Entering the data: The data used is the oil example described in example 9.2 on page 421 and shown in table 9.5 on page 428.

    DATA oil; 
    INPUT  cyl  oil $ mpg;
    CARDS;
    4     STANDARD    22.6
    4     STANDARD    24.5
    4     STANDARD    23.1
    4     STANDARD    25.3
    4     STANDARD    22.1
    4     MULTI       23.7
    4     MULTI       24.6
    4     MULTI       25.0
    4     MULTI       24.0
    4     MULTI       23.1
    4     GASMISER    26.0
    4     GASMISER    25.0
    4     GASMISER    26.9
    4     GASMISER    26.0
    4     GASMISER    25.4
    6     STANDARD    23.6
    6     STANDARD    21.7
    6     STANDARD    20.3
    6     STANDARD    21.0
    6     STANDARD    22.0
    6     MULTI       23.5
    6     MULTI       22.8
    6     MULTI       24.6
    6     MULTI       24.6
    6     MULTI       22.5
    6     GASMISER    21.4
    6     GASMISER    20.7
    6     GASMISER    20.5
    6     GASMISER    23.2
    6     GASMISER    21.3
    ;
    
    Checking the assumptions and getting the ANOVA table: The ANOVA Table 9.6 (page 430) can be generated using either PROC INSIGHT or PROC GLM. PROC INSIGHT will also give the residual plots. PROC GLM on the other hand can be made to give the means that appear in Figure 9.1 on page 431. (The text appears to give the wrong mean for 4/GASMISER.
    PROC INSIGHT;
    OPEN oil;
    FIT mpg = cyl oil cyl*oil;
    RUN;
    
    PROC GLM DATA=oil ORDER=DATA;
    CLASS cyl oil;
    MODEL mpg = cyl oil cyl*oil;
    MEANS cyl*oil;
    RUN;
    
    In order to use the modified Levene's test though we need to trick SAS into thinking this is a one-way ANOVA. We do this by making 6 separate treatment levels (the combinations of cylinder and oil types).
    DATA oil2;
    SET oil;
    KEEP  block mpg;
    block = trim(cyl)||trim(oil);
    
    PROC PRINT data=oil2;
    RUN;
    
    PROC GLM DATA=oil2 ORDER=DATA;
    CLASS block;
    MODEL mpg = block;
    MEANS block / HOVTEST=BF;
    RUN;
    
    Estimating Contrasts: The code to perform all of the various multiple comparison procedures can be somewhat complicated. But it is fairly straight forward to enter in the contrasts described on pages 433-435 and to get the output in Table 9.8 on page 436. (Actually the output on page 436 is incorrect, as the estimates are all a factor of 2, 3 , or 4 smaller than they should be. For example, if you look at the data on page 428 the difference between 4 cylinder and 6 cylinder should be the 2.24 we get with the code below, not the 1.12 that the book reports. The t-test statistics are correct though.)

    To see how we entered the coefficients for the L1L2 contrast, just look at the table on page 434... or look at table 9.7 and the values that went with the different cylinder/oil combinations. Remember we need to make sure we entered the data in the right order!

    To get F statistics you would simply use CONTRAST instead of ESTIMATE for each of the lines.

    PROC GLM DATA=oil ORDER=DATA;
    CLASS cyl oil;
    MODEL mpg = cyl oil cyl*oil;
    ESTIMATE 'L1=cyl'          cyl     -1   1;
    ESTIMATE 'L2=std. vs. m&g' oil      1 -.5 -.5;
    ESTIMATE 'L3=m vs. g.'     oil      0   1  -1;
    ESTIMATE 'L1L2'            cyl*oil -1  .5  .5  1 -.5 -.5;
    ESTIMATE 'L1L3'            cyl*oil  0  -1   1  0   1  -1; 
    RUN;
    


    Sections 10.2-10.4 - Random Effects Models

    This section has the code for analyzing the randomized block design discussed in section 10.2. If the data being analyzed is fixed effect then simply leave out the random line. If there are replications (as in Section 10.3) then the interaction term can be included. This can get complicated however and I would suggest having a statistical consultant assist you with it.

    To analyze the data in Table 10.2 on page 465 we would use the following:

    DATA wheat;
    INPUT variety $ block $ yield @@;
    CARDS;
    A	1	31.0	A	2	39.5	A	3	30.5
    A	4	35.5	A	5	37.0
    B	1	28.0	B	2	34.0	B	3	24.5
    B	4	31.5	B	5	31.5
    C	1	25.5	C	2	31.0	C	3	25.0
    C	4	33.0	C	5	29.5
    ;
    
    PROC GLM DATA=wheat ORDER=DATA;
    CLASS variety block;
    MODEL yield = variety block;
    RANDOM block;
    RUN;
    

    Note that we had to leave out the interaction term because there were no replications. Also note that the ESTIMATE or CONTRAST lines for fixed effects could be used just like they were for a factorial design.

    This will produce the information in the ANOVA table 10.4 (page 469) by looking at the Type III Table and the ANOVA table. It also gives the formulas for the expeced mean squares for the Treatments (variety) and Blocks as seen in table 10.3 (page 467) notice in sas that the term involving the sum of tau-squareds is summarized by a Q.

    To analyze the latin squares in section 10.4 the code would be almost the same as above. The difference would be that a third class variable would have to be entered for each observation and so would appear on the INPUT, CLASS, MODEL, and maybe RANDOM line.


    Section 11.5 - Analysis of Covariance

    The following code works through example 11.3 on pages 519-522

    DATA cmclass;                                                
    INPUT                 obs    class    pre    post     IQ ;     
    CARDS;                                                       
                            1      1        3     10     122    
                            2      2       24     34     129     
                            3      3       10     21     114     
                            4      1        5     10     121     
                            5      2       18     27     114     
                            6      3        3     18     114     
                            7      1        6     14     101     
                            8      2       11     20     116     
                            9      3       10     20     110    
                           10      1       11     29     131     
                           11      2       10     13     126     
                           12      3        3      9      94     
                           13      1       11     17     129     
                           14      2       11     19     110     
                           15      3        6     13     102    
                           16      1       13     21     115     
                           17      2        2     28     138     
                           18      3        9     24     128    
                           19      1        7      5     122    
                           20      2       10     13     119     
                           21      3       13     19     111     
                           22      1       12     17     112     
                           23      2       14     21     123     
                           24      3        7     25     119     
                           25      1       13     17     123     
                           26      2       11     14     115     
                           27      3       10     24     120     
                           28      1        8     22     119    
                           29      2       12     17     116     
                           30      3        9     21     112     
                           31      1        9     22     122     
                           32      2       14     16     125    
                           33      3        7     21     105     
                           34      1       10     18     111     
                           35      2        7     10     122    
                           36      3        4     17     120     
                           37      1        6     11     117     
                           38      2        8     18     120     
                           39      3        7     24     120     
                           40      1       13     20     112     
                           41      2       10     13     111     
                           42      3       12     25     118     
                           43      1        7      8     122     
                           44      2       11     17     127     
                           45      3        6     23     110     
                           46      1       11     20     124     
                           47      2       12     13     122     
                           48      3        7     22     127     
                           49      1        5     15     118    
                           50      2        6     13     127     
                           51      1        9     25     113    
                           52      2        3     13     115    
                           53      1        8     25     126     
                           54      2        4     13     112    
                           55      1        2     14     132    
                           56      1       11     17      93
    ;
    
    PROC GLM DATA=cmclass;
    
    CLASS class;
    MODEL post = class pre;
    ESTIMATE 'slope' pre 1;
    LSMEANS class / stderr pdiff;
    RUN;
    Here the pretest score is to be used like in regression, and so we don't put it in the class line. We do put the class there. An interaction generally doesn't make any sense for an ANCOVA, so we leave it out. The ESTIMATE line gives us the slope. The LSMEANS line gives us the estimates, along with the standard errors, the test as to if that the class effect = 0, and the p-values of the tests as to whether each of the classes is the same as the effects for the other classes. These last tests come from the matrix like part gives the same results as on page 522, namely that one and two are not significantly different, but one and three and two and three are. (Remember that the experimentwise error might need to be adjusted if using this.)

    Because we are using the pretest score (the covariate) similarly to a blocking variable here, we would use the Type III sum of squares to test if what class they are in makes a difference.

    The assumptions are the same as before. However, because we are actually doing something like a regression, we can't use the methods we used to test the assumptions for an ANOVA. Adding the line:

    
    OUTPUT out=cmresids P=FIT R=RESIDUAL;
    
    
    after the MODEL line will make a new data set called cmresids. Running PROC INSIGHT with this new data set will then let us plot the residuals. We could for example plot the CLASS vs. the residuals, the fitted values vs. the residuals, and the PRE vs. the residuals for example. We can also use the Dist(Y) option under the Analyze menu to do a histogram and q-q plot of the residuals. Similarly, we could just have used FIT (Y X).

    In order to check that the model is appropriate, we need to make sure that the slope doesn't change with the different classes. To get the same answer as the bottom of page 526, all we need to do is fit the model again, this time with the interaction CLASS*PRE included. The null hypothesis that this line of the ANOVA table tests is that the slope is the same for each class, so if we reject it, it means the assumptions of the ANCOVA are not met.


    Section 11.7 - Logistic Regression

    The following code will analyze the data in Table 11.8 on page 532 using logistic regression (it is probably a better analysis than the one the book has, so don't pay attention to what the output on page 533 looks like). This analysis is shown on page 541. (Remember if we at entered the data in two columns that we would have needed to put @@ before the semi-colon on the INPUT line.)

    DATA fw11x04;                                        
    INPUT y $ income;                                      
    CARDS;                                               
    0 9.2                                               
    0 12.9                                               
    0 9.2                                               
    1 9.6                                                
    0 9.3                                                
    1 10.1                                               
    0 9.4                                                
    1 10.3                                               
    0 9.5                                                
    1 10.9                                                
    0 9.5                                               
    1 10.9                                               
    0 9.5                                                
    1 11.1                                               
    0 9.6                                               
    1 11.1                                               
    0 9.7                                                
    1 11.1                                               
    0 9.7                                                
    1 11.5                                               
    0 9.8                                                
    1 11.8                                               
    0 9.8                                                
    1 11.9                                               
    0 9.9                                                
    1 12.1                                               
    0 10.5                                              
    1 12.2                                               
    0 10.5                                               
    1 12.5                                               
    0 10.9                                               
    1 12.6                                               
    0 11                                                
    1 12.6                                               
    0 11.2                                               
    1 12.6                                               
    0 11.2                                               
    1 12.9                                              
    0 11.5                                              
    1 12.9                                              
    0 11.7                                               
    1 12.9                                               
    0 11.8                                               
    1 12.9                                               
    0 12.1                                               
    1 13.1                                              
    0 12.3                                              
    1 13.2                                               
    0 12.5                                           
    1 13.5
    ;
    
    PROC LOGISTIC DATA=fw11x04 DESCENDING;
    MODEL y=income /LACKFIT;
    RUN;
    
    The important part of the output is:
              Analysis of Maximum Likelihood Estimates
    
                                   Standard
    Parameter    DF    Estimate       Error    Chi-Square    Pr > ChiSq
    Intercept     1   -11.3472      3.3511       11.4660        0.0007
    income        1   1.0018      0.2954       11.5013        0.0007
    
    
    
                  Testing Global Null Hypothesis: BETA=0
    
    Test                 Chi-Square       DF     Pr > ChiSq
    Likelihood Ratio        15.5687        1       < .0001
    Score                   14.1810        1         0.0002
    Wald                    11.5013        1         0.0007
    
    
                Hosmer and Lemeshow Goodness-of-Fit Test
    
    Chi-Square       DF     Pr > ChiSq
    6.7043        7         0.4603
    
    Looking at the equation on page 536, -11.3472 is the estimate of beta0 and 1.0018 is the estimate of beta1. The likelihood ratio test p-value of < 0.0001 tests the null hypothesis that beta1 is zero, and the p-value of 0.4603 from the Hosmer and Lemeshow test tests the null hypothesis that the logistic form is appropriate.


    The Text Book Data

    The various data sets used in the text book can be found on the Companion Web Site for Statistical Methods.


    Computer Trouble?

    In most cases, help with the computers (NOT the programming) can be gained by e-mailing help@stat.sc.edu

    For the printers on the first and second floor, printer paper is available in the Stat Department office. For printers on the third floor, paper is available in the Math Department office.

    If you are using a PC restarting the machine will fix many problems, but obviously don't try that if you have a file that won't save or the like.

    If SAS won't start, one of the things to check is that your computer has loaded the X drive correctly (whatever that means). Go to My Computer and see if the apps on 'lc-nt' (X:) is listed as one of the drives. If it isn't, go to the Tools menu and select Map Network Drive.... Select X for the drive, and enter \\lc-nt\apps for the Folder. Then click Finish. This should connect your computer to the X drive and allow SAS to run. If you already had the X-drive connected, then you will need to e-mail help@stat.sc.edu.

    If your graphs print out extremely small after you copy them to word, you might be able to fix the problem by "opening and closing" the image. In word, right click once on the image, and select Edit Picture or Open Picture Object under the Edit menu. A separate window will open with the image in it. Simply choose Close Picture. It should now print out ok. This will also make the spacing between the characters in the labels look right if they were somewhat off.

    If the problem is an emergency requiring immediate attention see the statistics department computer person in room 209D.
    If they are not available and it is an emergency see Minna Moore in room 417.
    Flagrantly non-emergency cases may result in suspension of computer privileges.