This will however erase whatever was written in the Program Editor window. To recall whatever was there, make sure you are in that window, and hit the [F4] key.
If you keep running more programs it will keep adding it all to the Output window. To clear the Output window, make sure you are in that window, and choose Clear All under the Edit menu.
If you happen to lose a window, try looking under the View menu.
The code below analyzes data concerning the effect of different levels of nitrogen ferilizer on lettuce production. (The data was gathered by Dr. B. Gardner at the University of Arizona, and can be found in Statistical Principles of Research Design and Analysis by Kuehl.)
OPTIONS pagesize=60 linesize=80;
DATA lettuce;
INPUT lbsperacre headsperplot;
CARDS;
0 104
0 114
0 90
0 140
200 131
200 148
200 154
200 163
;
Note that _most_ lines end with a semi-colon, but not all. SAS will crash if you miss one, but usually the log window will tell you where the problem is.
The OPTIONS line only needs to be used once during a session. It sets the length of the page and the length of the lines for viewing on the screen and printing. The font can be set by using the Options choice under the Tools menu along the top of the screen. When you cut and paste from SAS to a word processor, the font Courier New works well.
The DATA line defines what the name of the data set is. The name may not have any spaces; only letters, numbers, and underscores. It must start with a letter. In older versions of SAS the name must be eight characters long or less. The INPUT line gives the names of the variables, and they must be in the order that the data will be entered. If we had put a $ after either of the two variable names, it would have treated those values as names and not as numbers.
If we hit F3 at this point to run what we put above, nothing new will appear on the output screen. This is no big surprise, however, once we realize that we haven't told SAS to return any output! The code below simply tells SAS to print back out the data we entered.
PROC PRINT DATA=lettuce;
TITLE "The Lettuce Data";
RUN;
The most basic method for getting a summary of the data is to use PROC UNIVARIATE.
PROC UNIVARIATE DATA=lettuce PLOT FREQ ;
VAR headsperplot;
TITLE 'Summary The Number of Heads Per Plot';
RUN;
The VAR line says which of the variables you want a summary of. Also note that the graphs here are pretty awful. We'll see in a few minutes how to make the graphics look better.
SAS has a built in procedure for doing the two sample t-test, cleverly called PROC TTEST.
PROC TTEST DATA=lettuce;
CLASS lbsperacre;
VAR headsperplot;
RUN;
Now in order to trust the above p-values, one of the things we need to do is to check if the data set appears normal. This is one of the many graphs that PROC INSIGHT will produce. (It will also do many of the other procedures we need to do this semester).
PROC INSIGHT;Another way to open PROC INSIGHT is to go to the Solutions menu, then to the Analysis menu, and then finally to the Ineteractive Data Analysis option. Once there you will need to go to the WORK library, and choose the lettuce data set.
OPEN lettuce;
RUN;
Once you start PROC INSIGHT a spreadsheet containing the data should appear on the screen. To analyze this data, go to the Analyze menu, and choose Distribution (Y). In the white box under LETTUCE select headsperplot and click the Y button. Then select lbsperacre and click the Group button. Finally, click OK.
This causes an output window with two sets of graphs side-by-side to appear. You can cut and paste the graphs from these windows right into microsoft word. Simply click on the border of the box you want to copy with the left mouse button to select it. You can then cut and paste like normal. Later, to quit PROC INSIGHT, you will simply click on the X in the upper right portion of the spreadsheet.
The plot to check for normality is found under the Graphs menu, choose QQ Plot... and then just click ok in the box that pops up. You can then add a line to this plot by using the QQ Ref Line... option in the Curves menu. Again just click ok in the box that pops up. If the data is approximately normally distributed then it should fall near the line. With so few data points however, it is hard to tell if the data looks close or not. We can conduct an hypothesis test of normality by selecting Tests for Normality under the Tables menu. The Anderson-Darling Test is perhaps the best one to use.
Throughout PROC INSIGHT are little boxes that give you additional options. The arrow box at the bottom of the histogram window contains a Ticks... option that lets you control what boxes and endpoints you use in the histogram. Try chaing the "Tick Increment". In the spread sheet, if you click on the black box at the end of any row, it gives the options to take any given point out of the calculations or graphs. If you do that the output window automatically updates. Similarly, if you click on any number,that point is highlighted in all the graphs.
Besides Distribution (Y) in the Analyze menu, we could also have chosen Fit(YX). Choose lbsperacre for X and headsperacre for Y. If the X variable is numerical, this will conduct a linear regression. If it is a categorical (name) variable, it will conduct an Analysis of Variance. Now, click OK and we will see what the regression output looks like.
We could go to the spreadsheet and add another set of data points from the experiment as well.
100 146
100 142
100 152
100 156
There are a variety of additional choices that can be added to the output window. Perhaps the most useful is the Residual Normal QQ plot under the Graphs menu.
Instead of entering the data with each observation on a separate line, you could enter all the data back to back by putting @@ at the end of the INPUT line. For example, the data from last time could have been entered as follows:
DATA lettuce;
INPUT lbsperacre headsperplot @@;
CARDS;
0 104 0 114 0 90
0 140 200 131 200 148
200 154 200 163
;
Once you've entered the data, you can use the Distribution (Y) menu to do the test of hypothesis and make the confidence interval. Once the window with the histograms pops up, the menu for confidence intervals can be found under the Tables menu. You simply need to choose what percent confidence you want.
The hypothesis tests can also be found in Tables menu under Tests for Location..... You simply need to choose what mean you are using for the null hypothesis. The result you want is on the line labeled Student's t. The p-value there is for the two-sided alternate hypothesis though, so you may need to adjust it for the particular alternate hypothesis you are testing. (Sort of the opposite of problem 1d.)
Problem 3) The basic commands for running a simple linear regression were gone over in class on January 22nd. The only thing that wasn't was how to construct the QQ plot for the residuals. Once you have the output window for the regression open, simply choose Residual Normal QQ under the Graphs menu.
The following code would generate both of these for the lettuce data above, as well as for the case where the lbsperacre was 100. The . indicates that there is no headsperplot value for that observation.
DATA lettuce;The dependent variable goes on the left side of the equals sign in the MODEL line, and the independent variables go on the right. CLI tells it to make the confidence interval for the individual (hence the I), and CLM tells it to make the confidence interval for the means. The QUIT; at the end tells SAS that you are done running PROC GLM.
INPUT lbsperacre headsperplot;
CARDS;
0 104
0 114
0 90
0 140
200 131
200 148
200 154
200 163
100 .
;
PROC GLM DATA=lettuce;
MODEL headsperplot = lbsperacre / ALPHA=0.05 CLI;
RUN;
PROC GLM DATA=lettuce;
MODEL headsperplot = lbsperacre / ALPHA=0.05 CLM;
RUN;
QUIT;
Problem 3) Here is the raw data for this problem, with the first line being the names of the variables. Remember in the INPUT line to put a $ after state since they are the names of the states and not numerical values.
state sat takers income years public expend rank Iowa 1088 3 326 16.79 87.8 25.60 89.7 SouthDakota 1075 2 264 16.07 86.2 19.95 90.6 NorthDakota 1068 3 317 16.57 88.3 20.62 89.8 Kansas 1045 5 338 16.30 83.9 27.14 86.3 Nebraska 1045 5 293 17.25 83.6 21.05 88.5 Montana 1033 8 263 15.91 93.7 29.48 86.4 Minnesota 1028 7 343 17.41 78.3 24.84 83.4 Utah 1022 4 333 16.57 75.2 17.42 85.9 Wyoming 1017 5 328 16.01 97.0 25.96 87.5 Wisconsin 1011 10 304 16.85 77.3 27.69 84.2 Oklahoma 1001 5 358 15.95 74.2 20.07 85.6 Arkansas 999 4 295 15.49 86.4 15.71 89.2 Tennessee 999 9 330 15.72 61.2 14.58 83.4 NewMexico 997 8 316 15.92 79.5 22.19 83.7 Idaho 995 7 285 16.18 92.1 17.80 85.9 Mississippi 988 3 315 16.76 67.9 15.36 90.1 Kentucky 985 6 330 16.61 71.4 15.69 86.4 Colorado 983 16 333 16.83 88.3 26.56 81.8 Washington 982 19 309 16.23 87.5 26.53 83.2 Arizona 981 11 314 15.98 80.9 19.14 84.3 Illinois 977 14 347 15.80 74.6 24.41 78.7 Louisiana 975 5 394 16.85 44.8 19.72 82.9 Missouri 975 10 322 16.42 67.7 20.79 80.6 Michigan 973 10 335 16.50 80.7 24.61 81.8 WestVirginia 968 7 292 17.08 90.6 18.16 86.2 Alabama 964 6 313 16.37 69.6 13.84 83.9 Ohio 958 16 306 16.52 71.5 21.43 79.5 NewHampshire 925 56 248 16.35 78.1 20.33 73.6 Alaska 923 31 401 15.32 96.5 50.10 79.6 Nevada 917 18 288 14.73 89.1 21.79 81.1 Oregon 908 40 261 14.48 92.1 30.49 79.3 Vermont 904 54 225 16.50 84.2 20.17 75.8 California 899 36 293 15.52 83.0 25.94 77.5 Delaware 897 42 277 16.95 67.9 27.81 71.4 Connecticut 896 69 287 16.75 76.8 26.97 69.8 NewYork 896 59 236 16.86 80.4 33.58 70.5 Maine 890 46 208 16.05 85.7 20.55 74.6 Florida 889 39 255 15.91 80.5 22.62 74.6 Maryland 889 50 312 16.90 80.4 25.41 71.5 Virginia 888 52 295 16.08 88.8 22.23 72.4 Massachusetts 888 65 246 16.79 80.7 31.74 69.9 Pennsylvania 885 50 241 17.27 78.6 27.98 73.4 RhodeIsland 877 59 228 16.67 79.7 25.59 71.4 NewJersey 869 64 269 16.37 80.6 27.91 69.8 Texas 868 32 303 14.95 91.7 19.55 76.4 Indiana 860 48 258 14.39 90.2 17.93 74.1 Hawaii 857 47 277 16.40 67.6 21.21 69.9 NorthCarolina 827 47 224 15.31 92.8 19.92 75.3 Georgia 823 51 250 15.55 86.5 16.52 74.0 SouthCarolina 790 48 214 15.42 88.1 15.60 74.0To perform the multiple regression in PROC INSIGHT, you simply choose Fit(YX) as you would for simple linear regression. Just select all of the X-variables before you hit OK.
Note that you don't need to do any new SAS work for parts b and d. For part b, I give you the result you would get if you just used takers to predict sat, and you got the Type III Tests box automatically in part a. For part c, you can add the studentized residuals, leverage (hat matrix diagonal), and Cook's D to the spreadsheet by going under Vars menu in PROC INSIGHT.
PROC REG DATA=brains; MODEL brain = body litter gest / SELECTION = RSQUARE ADJRSQ CP; RUN;
Problem 2) The following gives the populations of the states in 1990 and 1999, and the ranks. (We don't need the ranks for this problem. The is from www.fedstats.gov, with the District of Columbia removed.)
State pop1990 rank1990 pop1999 rank1999 Alabama 4040 22 4370 23 Alaska 550 49 620 48 Arizona 3665 24 4778 20 Arkansas 2351 33 2551 33 California 29811 1 33145 1 Colorado 3294 26 4056 24 Connecticut 3287 27 3282 29 Delaware 666 46 754 45 Florida 12938 4 15111 4 Georgia 6478 11 7788 10 Hawaii 1108 41 1185 42 Idaho 1007 42 1252 40 Illinois 11431 6 12128 5 Indiana 5544 14 5943 14 Iowa 2777 30 2869 30 Kansas 2478 32 2654 32 Kentucky 3687 23 3961 25 Louisiana 4222 21 4372 22 Maine 1228 38 1253 39 Maryland 4781 19 5172 19 Massachusetts 6016 13 6175 13 Michigan 9295 8 9864 8 Minnesota 4376 20 4776 21 Mississippi 2575 31 2769 31 Missouri 5117 15 5468 17 Montana 799 44 883 44 Nebraska 1578 36 1666 38 Nevada 1202 39 1809 35 NewHampshire 1109 40 1201 41 NewJersey 7748 9 8143 9 NewMexico 1515 37 1740 37 NewYork 17991 2 18197 3 NorthCarolina 6632 10 7651 11 NorthDakota 639 47 634 47 Ohio 10847 7 11257 7 Oklahoma 3146 28 3358 27 Oregon 2842 29 3316 28 Pennsylvania 11883 5 11994 6 RhodeIsland 1003 43 991 43 SouthCarolina 3486 25 3886 26 SouthDakota 696 45 733 46 Tennessee 4877 17 5484 16 Texas 16986 3 20044 2 Utah 1723 35 2130 34 Vermont 563 48 594 49 Virginia 6189 12 6873 12 Washington 4867 18 5756 15 WestVirginia 1793 34 1807 36 Wisconsin 4892 16 5250 18 Wyoming 454 50 480 50The tricky part on this problem is how you transform the variables. While the spreadsheet is on top in PROC INSIGHT go to the Variables option under the Edit menu. If you just wanted to (for example) take the exponent of one of your variables you would choose exp(Y) and then simply select what variable Y was and hit ok. This then adds the new variable to the spreadsheet. You can do some more complicated transformations by choosing Other... under the Variables option menu.
Once the new variable has been added to the sheet (make sure you checked what it would be called before you hit OK!) you can then use it in either Distribution (Y) or Fit (YX) just like you would any other variable.
Data for problem 3: The following data is from a masters thesis by B. Nichols, in the Department of Animal Science at the University of Arizona (as found in Kuehl, 1994). Steaks of approximately the same size (75g) were randomly assigned to the different packaging conditions. After storing the meat for nine days at 4 degrees celsius, the amount of psychotrophic bacteria was measured. The units are the logarithm of the count per square-centimeter.
commercial 7.66 commercial 6.98 commercial 7.80
vacuum 5.26 vacuum 5.44 vacuum 5.80
mixed 7.41 mixed 7.33 mixed 7.04
CO2 3.51 CO2 2.91 CO2 3.66
Example: The following discusses the analysis of the data in Table 7-4 on page 297. The data consists of the cortisol levels in three groups of people: healthy individuals, and two-levels of depressed individuals. This data could be analyzed using PROC INSIGHT to get the ANOVA table and residual plots. Using PROC GLM it is also possible to test contrasts and perform Levene's test in addition to getting the ANOVA table.
The code below uses PROC GLM to produce the ANOVA table 7-5 on page 298. It also uses two contrasts (one comparing non-melancholic to healthy, and one comparing melancholic to healthy) to get the same results as the text does in figure 7-5 on page 301. The book used dummy variables however, and the code below uses contrasts; thus the variable Dn in the regression is the same as the contrast 'nonm minus healthy' and Dm is the same as the contrast 'm minus healthy'. The lines using ESTIMATE do the same thing as the lines with CONTRAST but they also return the estimated value (the same as the slopes in figure 7-5) and the t instead of F (recall F=t2).
DATA tab7p4; INPUT group $ cort @@; CARDS; h 2.5 n 5.4 m 8.1 h 7.2 n 7.8 m 9.5 h 8.9 n 8.0 m 9.8 h 9.3 n 9.3 m 12.2 h 9.9 n 9.7 m 12.3 h 10.3 n 11.1 m 12.5 h 11.6 n 11.6 m 13.3 h 14.8 n 12.0 m 17.5 h 4.5 n 12.8 m 24.3 h 7.0 n 13.1 m 10.1 h 8.5 n 15.8 m 11.8 h 9.3 n 7.5 m 9.8 h 9.8 n 7.9 m 12.1 h 10.3 n 7.6 m 12.5 h 11.6 n 9.4 m 12.5 h 11.7 n 9.6 m 13.4 n 11.3 m 16.1 n 11.6 m 25.2 n 11.8 n 12.6 n 13.2 n 16.3 ; PROC GLM DATA=tab7p4 ORDER=DATA; CLASS group; MODEL cort=group; CONTRAST 'nonm minus healthy' group -1 1 0; CONTRAST 'm minus healthy' group -1 0 1; ESTIMATE 'nonm minus healthy' group -1 1 0; ESTIMATE 'm minus healthy' group -1 0 1; RUN;SAS won't automatically construct the confidence intervals for the contrasts, but we could do it by hand from the output (or program SAS to do it). This is because the output from the ESTIMATE line gives both the estimate and the standard error for the estimate. (The degrees of freedom are the same as the degrees of freedom for the SSwit.) The 95% CI for the difference of the means for the non-melancholic and healthy groups would thus be: 1.5 +/- 1.96 (1.15947) = 1.5 +/- 2.335 = (-0.835, 3.835).
To conduct a test of hypotheses that the variances of the three groups are equal, we could add an extra line after the MODEL statement above. The following would perform the Levene median test (see page 325).
MEANS group / HOVTEST=BF;The BF stands for Brown and Forsythe, the name that SAS uses for this particular test. For this example, the test of the null hypothesis that the variances for the three groups are equal is 0.6234. We would thus fail to reject the null hypothesis. (Remember that we need to use PROC INSIGHT to check the normality assumption still!)
To generate the output for the Holm test (as on page 315), we need to use PROC MULTTEST. The following code would perform the Holm test on all of the different pairs of groups. (SAS calls this the "Stepdown Bonferroni" method.)
PROC MULTTEST DATA=tab7p4 ORDER=DATA HOLM; CLASS group; CONTRAST 'healthy vs. nonm' 1 -1 0; CONTRAST 'healthy vs. m' 1 0 -1; CONTRAST 'nonm vs. m' 0 1 -1; TEST mean(cort); RUN;The p-values in the column labeled Stepdown Bonferroni have already been adjusted so that you simply need to compare them to the family-wise alpha-level. The logic here is that the smallest p-value was 0.008, so you could either compare 0.0008 to alpha/3 or compare 0.0008*3 to alpha. You could then either compare 0.0157 to alpha/2 or compare 0.0157*2 to alpha. Then finally, you would compare 0.2014 to alpha either way. For this example we could write up the output for alpha = 0.001, 0.02, 0.05 and 0.25 as follows.
Group Mean a=0.001 a=0.02 a=0.05 a=0.25 ---------------------------------------------------------------- m 13.500 A B B C A B nonm 10.700 A A B A B A A A healthy 9.200 A A A A ----------------------------------------------------------------For high alpha levels, we reject the null hypothesis more often (more likely to make type I errors) and are thus more likely to reject that the groups are the same. For a small alpha level it is hard to have enough evidence to reject that the groups are the same (more likely to make type II errors) and so we are likely to conclude they are all the same.
Data for problem 2:
reg norm 989 reg norm 1025 reg norm 1030 reg nplus 1191 reg nplus 1233 reg nplus 1221 reg double 1226 reg double 1202 reg double 1180 red norm 1211 red norm 1215 red norm 1182 red nplus 1860 red nplus 1910 red nplus 1926 red double 1516 red double 1501 red double 1498 cost norm 1577 cost norm 1559 cost norm 1598 cost nplus 2492 cost nplus 2527 cost nplus 2511 cost double 1801 cost double 1833 cost double 1852
Example as ANOVA: The following discusses the analysis of the kidney data on pages 364-373 by performing the standard two-way ANOVA (including using Holm's test on the main effect and interactions).
DATA ksh; INPUT strain $ site $ activity @@; CARDS; norm dct 62 norm dct 73 norm dct 58 norm dct 66 hyp dct 44 hyp dct 49 hyp dct 46 hyp dct 37 norm ccd 15 norm ccd 31 norm ccd 19 norm ccd 35 hyp ccd 8 hyp ccd 36 hyp ccd 11 hyp ccd 18 norm omcd 7 norm omcd 7 norm omcd 9 norm omcd 17 hyp omcd 19 hyp omcd 7 hyp omcd 15 hyp omcd 4 ; PROC GLM DATA=ksh ORDER=DATA; CLASS strain site; MODEL activity = strain site strain*site; RUN;After running this portion, note that the ANOVA table in the output is identical to the one given on the top of page 369, even though that one was gotten using dummy variable coding. Also note that if you add up the various Type I SS on the page 369 output, you will get the same values as the Type III SS in the SAS output from the above code.
Now, to see if the main effects (strain and site) and the interaction (strain*site) are significant, we actually need to do three tests. Because this model is balanced and factorial, the three tests will be independent though. Thus, we could use the Sidak adjustment to the alpha levels to get the desired family-wise alpha. We can do even better than this though. Just like you can use the Holm test to improve on Bonferroni, we can use what the text calls the Holm-Sidak procedure instead of just Sidak. (See the note at the bottom of page 316.) To do this we use PROC MULTTEST, but instead of giving it contrasts, we give it the p-values from the Type III tests, and instead of HOLM we use STEPSID. The data set pvals is are the p-values from above, and the p-values must be called raw_p. It is important to note what order you entered the values p-values in!
DATA pvals; INPUT whichone $ raw_p; CARDS; strain 0.0156 site 0.0001 strain*site 0.0404 ; PROC MULTTEST PDATA=pvals STEPSID; RUN;
In this case, all three adjusted p-values are still less than 0.05, and so we would still say that both the main effects and interaction were significant with a family-wise alpha of 0.05. If you had used STEPBON instead, the first ajdusted p-value would have been slightly larger, and so we would have been less likely to reject. This is because Bonferroni is conservative (i.e. the worst possible case) whereas Sidak's is exact in the case of independent tests.
To check the assumptions for this analysis, we need to get the residual plots and use Levene's test. The residual plots can be gotten from PROC INSIGHT simply by using Fit(YX). Choose activity as the Y variable, and both strain and site as the X variables. This is not enough, however. We still need to put the interaction term in. Highlight both strain and site at the same time, and then hit Cross. This will add strain*site to the list of independent variables. Now, hit OK and you can proceed as usual.
Unfortunately SAS will only run Levene's test for a one-way ANOVA. Because of this, we need to trick SAS into thinking this is a one-way ANOVA by actually making it into one! What we will do is change it into a one-way ANOVA with six different treatments normdct, hypdct, normccd, hypccd, normomcd, and hypomcd. The following code will do this using the || command for concatonating the names of the factor levels together after trimming off the extra spaces they may have. The following code makes the new data set ksh2 from the old data set ksh, prints it out so that you can see what its done, and then run's the modified Levene's test. (Recall that SAS calls this test the Brown and Forsythe test.)
DATA ksh2; SET ksh; KEEP block activity; block = trim(strain)||trim(site); PROC PRINT data=ksh2; RUN; PROC GLM DATA=ksh2 ORDER=DATA; CLASS block; MODEL activity = block; MEANS block / HOVTEST=BF; RUN;
Example as Regression: The following discusses the analysis of the kidney data on pages 364-373 by using regression with dummy variables, (including making SAS recode the data for you!).
Looking at what the degrees of freedom for the ANOVA would be [(a-1), (b-1), (a-1)*(b-1)], we will need 1 dummy variable for strain, 2 for site, and 2*1=1 for the interactions. Mostly following the text (page 366), we will call these H1, N1, N2, H1N1, and H1N2. We'll call the new data set kshreg and make it from the old data set ksh.
DATA kshreg; SET ksh; KEEP activity H1 N1 N2 H1N1 H1N2; IF strain='norm' then H1=1; ELSE H1=-1; IF site='dct' then N1=1; ELSE IF site='ccd' then N1=0; ELSE N1=-1; IF site='dct' then N2=0; ELSE IF site='ccd' then N2=1; ELSE N2=-1; H1N1=H1*N1; H1N2=H1*N2; PROC PRINT DATA=kshreg; RUN;Note that this is the same as table 8-13 on page 368, except that it did all the hypertensive first, and we are doing the hypertensive dct and then the normal dct, etc... The lines are all there though.
Of course we could have just entered the data all over again from the table on page 368...
DATA kshreg2; INPUT activity H1 N1 N2 H1N1 H1N2; CARDS; 62 1 1 0 1 0 etc...... but that would have been a waste since we already had the data entered from running the ANOVA earlier.
Once the data is entered, we can can the output using either PROC GLM or PROC INSIGHT to do a regression.
PROC INSIGHT; OPEN kshreg; FIT activity = H1 N1 N2 H1N1 H1N2; RUN;Note that the Type III SS don't match what is in the output on pg. 369. This is because the VIFs are not all 1 (the dummy variables are not independent!). Because of this, we need to use the Type I SS. This can be done by choosing Type I under the Tables menu. Once you have the correct SS, you can construct your own ANOVA F statistics by hand. For example, the SS for dct in the ANOVA table is the same as the SS for N1 plus the SS for N2. (8287 = 7656.25 + 630.75). The degrees of freedom would be 2, 1 for each of the two dummy variables. The work required to do this is discussed on pages 368 to 370. Before PROC GLM was created there were some cases in which all of this extra work would be required. Luckily, PROC GLM knows how to use the dummy variables correctly on its own, behind the scenes, and we rarely need to go through all of this work with the dummy variables (although many text books seem to like doing it).
For this problem, simply show that you get the same SS for each of the main effects and the interactions. You don't need to calculate all the MS and F statistics.
For checking the assumptions, we can use the residual vs. predicted and q-q plots just like in regression. There is no easy way to get it to do Levene's test however without going back to the standard ANOVA method.
Problem 1: This problem uses the data discussed on pages 382-388 of the text. Most of the code you need for running it is provided below.
DATA rats; INPUT nephron $ strain $ activity; CARDS; DCT Norm 58 DCT Norm 66 DCT Hyp 44 DCT Hyp 49 DCT Hyp 46 CCD Norm 15 CCD Norm 31 CCD Norm 19 CCD Norm 35 CCD Hyp 8 CCD Hyp 11 CCD Hyp 18 OMCD Norm 7 OMCD Norm 7 OMCD Norm 9 OMCD Norm 17 OMCD Hyp 19 OMCD Hyp 7 OMCD Hyp 15 OMCD Hyp 4 ; PROC GLM DATA=rats ORDER=DATA; CLASS nephron strain; MODEL activity = nephron strain nephron*strain / SOLUTION; LSMEANS nephron strain nephron*strain; RUN;The command SOLUTION tells SAS to solve for the model parameters. It doesn't do it quite the way we'd like however. For example, it says the effect of DCT is to add 35.08, the effect of CCD is to add 1.0833, and the effect of OMCD is to add 0. (Make sure you can find these values on the ouput.) These three don't average out to zero! Question 1a on the homework just wants to know, what combination of nephron and strain is the baseline. That is, which of the combinations of nephron and strain has group mean equal to mu (what SAS calls the intercept). For example, we know it isn't the DCT-norm group because it has average to mu+35.0833-1.250+16.916, which isn't equal to just mu.
In order to get the estimates where the sum of the effects are equal to zero, we need to do a little bit more work using the estimate line. The * notation below means to take the average over all of the values of that variable. Thus, y*** is our estimate of what the overall average would be if the design were balanced. It would include mu (the intercept), 1/3rd of each of the three nephron sites, and 1/2 of each of the strains, and 1/6 of each of the six interactions. This is because it is the average of everything. Notice that since we can't enter the exact value of 1/3rd that I put a two there and divided the whole line by six. Similarly, y*2* is what the average of all the hyperactive mice would be (the 2nd level of the 2nd varialbe strain) if the design were balanced. This would include mu, the average of the three nephron sites, the value for hyperactive, and the average of all the interactions involving hyperactive. Notice that if you run the code below you get exactly the values you got from the LSMEANS above.
PROC GLM DATA=rats ORDER=DATA; CLASS nephron strain; MODEL activity = nephron strain nephron*strain; ESTIMATE 'y*** = mu ' intercept 6 nephron 2 2 2 strain 3 3 nephron*strain 1 1 1 1 1 1 /divisor=6; ESTIMATE 'y1** = lsmean all DCT ' intercept 6 nephron 6 0 0 strain 3 3 nephron*strain 3 3 0 0 0 0 /divisor=6; ESTIMATE 'y2** = lsmean all CCD ' intercept 6 nephron 0 6 0 strain 3 3 nephron*strain 0 0 3 3 0 0 /divisor=6; ESTIMATE 'y3** = lsmean all OMCD' intercept 6 nephron 0 0 6 strain 3 3 nephron*strain 0 0 0 0 3 3 /divisor=6; ESTIMATE 'y*1* = lsmean all Norm' intercept 6 nephron 2 2 2 strain 6 0 nephron*strain 2 0 2 0 2 0 /divisor=6; ESTIMATE 'y*2* = lsmean all Hyp ' intercept 6 nephron 2 2 2 strain 0 6 nephron*strain 0 2 0 2 0 2 /divisor=6; ESTIMATE 'y11* = lsmean DCTNorm ' intercept 6 nephron 6 0 0 strain 6 0 nephron*strain 6 0 0 0 0 0 /divisor=6; ESTIMATE 'y12* = lsmean DCTHyp ' intercept 6 nephron 6 0 0 strain 0 6 nephron*strain 0 6 0 0 0 0 /divisor=6; ESTIMATE 'y21* = lsmean CCDNorm ' intercept 6 nephron 0 6 0 strain 6 0 nephron*strain 0 0 6 0 0 0 /divisor=6; ESTIMATE 'y22* = lsmean CCDHyp ' intercept 6 nephron 0 6 0 strain 0 6 nephron*strain 0 0 0 6 0 0 /divisor=6; ESTIMATE 'y31* = lsmean OMCDNorm' intercept 6 nephron 0 0 6 strain 6 0 nephron*strain 0 0 0 0 6 0 /divisor=6; ESTIMATE 'y32* = lsmean OMCDHyp ' intercept 6 nephron 0 0 6 strain 0 6 nephron*strain 0 0 0 0 0 6 /divisor=6; RUN;The reason for bothering to do this is so that we can estimate the various parameters individually now. For example, we simply have mu = y***, because both mu and y*** are the overall averages. To get n1 we would say n1 = y1** - y***, because the effect of being in nephron level 1 should be the difference between the average of being in nephron level one minus the overall average. Similarly, to get s1 we would say s1 = y*1* - y***. Finally, to get (ns)11 we would say
(ns)11 = y11*-n1-s1-mu
=
y11* - (y1**-y***) - (y*1*-y***) - y***
= y11* - y1** - y*1* + y***
All of these equations are similar to those from the ANOVA table for the two main effects and the interactions. We could have SAS calculate these parameter estimates by combining the estimate lines for the y***, yi**, y*j*, and yij* that we created above.
PROC GLM DATA=rats ORDER=DATA; CLASS nephron strain; MODEL activity = nephron strain nephron*strain; ESTIMATE 'mu = y*** ' intercept 6 nephron 2 2 2 strain 3 3 nephron*strain 1 1 1 1 1 1 /divisor=6; ESTIMATE 'n1 = y1** - y*** ' intercept 0 nephron 4 -2 -2 strain 0 0 nephron*strain 2 2 -1 -1 -1 -1 /divisor=6; ESTIMATE 's1 = y*1* - y*** ' intercept 0 nephron 0 0 0 strain 3 -3 nephron*strain 1 -1 1 -1 1 -1 /divisor=6; ESTIMATE 'ns11=y11*-y1*-y*1+y***' intercept 0 nephron 0 0 0 strain 0 0 nephron*strain 2 -2 -1 1 -1 1 /divisor=6; RUN;Notice that mu+n1 now gives us the LSMeans value for DCT, mu+s1 now gives us the LSMeans value for Norm, and mu+n1+s1+n1s1 gives us the LSMeans value for DCT*Norm. Not only that, but we have a separate p-value for testing whether each of the terms in the model is equal to zero individually! We could thus go in and pick out some specific terms that we want to test are equal to zero. For this example, if all we wanted to test was the null hypothesis that n1=0, we would get a p-value of < 0.001 and reject that null hypothesis.
Because this gives us the standard error as well as the estimate, we could also construct confidence intervals for the parameters. To make a confidence interval for the n1 effect, we would simply take 26.3472 +/- talpha/2 2.5852 where the degrees of freedom are the df for the SSE (14 in this case). (For question 1c on the homework, notice that the OMCD-hypertensive group is the third level of n, and the second level of s according to how we entered the data.)
We could also use this to test about particular differences. For example, say we wanted to test if the combination DCT-Norm (the main effects plus interaction) was different than CCD-Hyp (the main effects plus interaction). This would be y11* - y22* so:
PROC GLM DATA=rats ORDER=DATA; CLASS nephron strain; MODEL activity = nephron strain nephron*strain; ESTIMATE 'y22*-y11*' intercept 0 nephron -6 6 0 strain -6 6 nephron*strain -6 0 0 6 0 0 /divisor=6; RUN;Clearly DCT-Normal results in significantly more activity than CCD-Hyper. If you check back on Table 8-16 this makes sense though... one group has values 58 and 66 and the other group has values 8, 11, and 18.
Problem 2: You can run a model with random effects just like one with fixed effect. After the MODEL line, simply add a line
RANDOM name_of_random_effects ;
where you plug the name of the factor that is random in (instead of leaving in "name_of_random_effects". SAS will then tell you what the E(MS) are at the bottom of the output window.
What to do with the E(MS) is discussed on page 460-461, and in the example below.
Example:: Say we were analyzing the data from problem one, but had added the line RANDOM strain because the strain of rat was randomly chosen. The hypotheses we might want to test are: are all the nephron effects equal, is the variance of the strain effects equal to 0, and are all the interactions of nephron*strain equal. In SAS's notation, we would be interested in testing that Q(nephron), Var(strain), and Q(nephron*strain) are equal to zero.
The GLM Procedure Sum of Source DF Squares Mean Square F Value Pr > F Model 5 6229.666667 1245.933333 29.97 <.0001 Error 14 582.083333 41.577381 Corrected Total 19 6811.750000 Source DF Type III SS Mean Square F Value Pr > F nephron 2 5994.229167 2997.114583 72.09 <.0001 strain 1 382.699275 382.699275 9.20 0.0089 nephron*strain 2 278.647771 139.323886 3.35 0.0647 Source Type III Expected Mean Square nephron Var(Error) + Q(nephron,nephron*strain) strain Var(Error) + 9.3913 Var(strain) + Q(nephron*strain) nephron*strain Var(Error) + Q(nephron*strain)Note: our estimate of the Var(Error) is always the MSE in the ANOVA table.
Looking at the Type III Expected Mean Square, we can see that in order to get a term with Var(strain) by itself, we would need to cancel out both the Var(Error) and the Q(nephron*strain) from the line Source=strain. The only way to get both Var(Error) and Q(nephron*strain) is to use the line Source=nephron*strain. This tells us that the F-statistic we want to use has MSstrain=382.669275 as the numerator, MSnephron*strain=139.323886 as the denominator, and degrees of freedom = 1,2. Thus our F value got testing Var(strain)=0 would be F=382.67/139.32=2.747. Table E-2 says we would compare this to 18.51 and would fail to reject.
This is different than what the p-value of 0.0089 for strain seems to be telling us in the Type III tests! This is because the Type III tests always just use the MSE as the denominator. So with a random effect it would actually be testing the null hypothesis 9.3913Var(strain)+Q(nephron*strain=0. That is, it would be testing that the variance of the strain of rat was zero AND that there was no interaction! The only way to tell this is what is happening is to look at the expected mean squares.
The p-value 0.0647 for testing if there is an interaction is correct though. Can you tell why?
Finally, note that there is no obvious way to test whether Q(nephron)=0 (that is, that there is no effect due to the nephron.) This is because none of the expected terms have Q(nephron) by itself. They only have it combined with the interaction in the term Q(nephron,nephron*strain). This is because SAS doesn't do always do the interactions correctly for models with random effects if there is an interaction term. Because of this, SAS GLM recommends that you always make the interaction random if at least one of the effects involved is random.
Luckily on the homework, question 2a points out that there is not an interaction term in the model you will be looking at.
An ANCOVA model can be fit using PROC GLM or PROC INSIGHT just like an ANOVA or a regression model can.
Below is the data for Homework Assignment 8. The variables are temperature in Fahrenheit and whether or not there was an O-ring failure.
53 1 56 1 57 1 63 0 66 0 67 0 67 0 67 0 68 0 69 0 70 0 70 1 70 1 70 1 72 0 73 0 75 0 75 1 76 0 76 0 78 0 79 0 80 0 81 0All of this problem (except the Hosmer-Lemeshow Statistic) can be conducted using PROC INSIGHT. In the FIT (YX) menu you have to choose several options after you select Y and X. Select METHOD using the button at the bottom. In the menu that pops up choose the response distribution Binomial, and the link function Logit.
PROC LOGISTIC will also calculate the Hosmer-Lemeshow Statistic. The following code would analyze the data in Table 12-1 on page 609, and it reproduces the output in figure 12-2 on page 611.
DATA grad; INPUT int success failure; total=success+failure; CARDS; 1 0 2 2 0 2 3 0 5 4 0 3 5 1 6 6 1 3 7 1 2 8 4 7 9 5 4 10 7 4 11 7 2 12 7 1 13 11 1 14 7 0 15 11 0 16 5 0 17 7 0 18 2 0 19 2 0 ; PROC LOGISTIC DATA=grad; MODEL success/total = int / LACKFIT; RUN;The data here was input in a slightly different format that the way the data for the homework assignment would be entered. For this data we had several observations recorded at the same intelligence level, all reported on a single line of input (e.g. there were two observations at z=1, 0 successes and 2 failures). Your input would have this on separate lines. For example, the data set grad would start like:
DATA grad; INPUT int grad @@; CARDS; 1 0 1 0 2 0 2 0 3 0 3 0 3 0 3 0 3 0 4 0 4 0 4 0 5 1 5 0 5 0 5 0 5 0 5 0 5 0 etc... ;In this case the code used to run PROC LOGISTIC would be:
PROC LOGISTIC DATA=grad DESCENDING; MODEL grad = int /LACKFIT; RUN;The log window will tell you that the DESCENDING means you are predicting the probability of getting a 1. Without it, you would be predicting the probability of getting a zero.
It should now allow you to start SAS. Unfortunately you'll have to do this every time you want to start SAS until they get your account moved over to a newer machine.
Now, if you were not listed as being hooked up to sum-nt, then it could be any number of things. First you should try starting it on another computer. If that fails, you should see if Jamie Winterton (7-5346) is available in Room 209D. If she is not there, you should next see if Jay Dew (7-5413) is in 415. If that fails, you should see if Minna Moore is available in Room 417. Note: If Minna is the only one in, she will probably not be able to help you right away, but she will pass your message on to the next available person.