Stat 516 - Spring 2001 - SAS Templates

Class Notes from 1/22/2001
Notes on Homework One
Notes on Homework Two
Notes on Homework Three
Notes on Homework Four
Notes on Homework Five
Notes on Homework Six
Notes on Homework Seven
Notes on Homework Eight
What if SAS won't start?


The Basics of SAS:

Hitting the [F3] key will run the program currently in the Program Editor window.

This will however erase whatever was written in the Program Editor window. To recall whatever was there, make sure you are in that window, and hit the [F4] key.

If you keep running more programs it will keep adding it all to the Output window. To clear the Output window, make sure you are in that window, and choose Clear All under the Edit menu.

If you happen to lose a window, try looking under the View menu.

The code below analyzes data concerning the effect of different levels of nitrogen ferilizer on lettuce production. (The data was gathered by Dr. B. Gardner at the University of Arizona, and can be found in Statistical Principles of Research Design and Analysis by Kuehl.)

OPTIONS pagesize=60 linesize=80;


DATA lettuce;
INPUT lbsperacre headsperplot;
CARDS;
0 104
0 114
0 90
0 140
200 131
200 148
200 154
200 163
;

Note that _most_ lines end with a semi-colon, but not all. SAS will crash if you miss one, but usually the log window will tell you where the problem is.

The OPTIONS line only needs to be used once during a session. It sets the length of the page and the length of the lines for viewing on the screen and printing. The font can be set by using the Options choice under the Tools menu along the top of the screen. When you cut and paste from SAS to a word processor, the font Courier New works well.

The DATA line defines what the name of the data set is. The name may not have any spaces; only letters, numbers, and underscores. It must start with a letter. In older versions of SAS the name must be eight characters long or less. The INPUT line gives the names of the variables, and they must be in the order that the data will be entered. If we had put a $ after either of the two variable names, it would have treated those values as names and not as numbers.

If we hit F3 at this point to run what we put above, nothing new will appear on the output screen. This is no big surprise, however, once we realize that we haven't told SAS to return any output! The code below simply tells SAS to print back out the data we entered.


PROC PRINT DATA=lettuce;
TITLE "The Lettuce Data";
RUN;

The most basic method for getting a summary of the data is to use PROC UNIVARIATE.

PROC UNIVARIATE DATA=lettuce PLOT FREQ ;

VAR headsperplot;
TITLE 'Summary The Number of Heads Per Plot';
RUN;

The VAR line says which of the variables you want a summary of. Also note that the graphs here are pretty awful. We'll see in a few minutes how to make the graphics look better.

SAS has a built in procedure for doing the two sample t-test, cleverly called PROC TTEST.

PROC TTEST DATA=lettuce;

CLASS lbsperacre;
VAR headsperplot;
RUN;

Now in order to trust the above p-values, one of the things we need to do is to check if the data set appears normal. This is one of the many graphs that PROC INSIGHT will produce. (It will also do many of the other procedures we need to do this semester).

PROC INSIGHT;

OPEN lettuce;
RUN;
Another way to open PROC INSIGHT is to go to the Solutions menu, then to the Analysis menu, and then finally to the Ineteractive Data Analysis option. Once there you will need to go to the WORK library, and choose the lettuce data set.

Once you start PROC INSIGHT a spreadsheet containing the data should appear on the screen. To analyze this data, go to the Analyze menu, and choose Distribution (Y). In the white box under LETTUCE select headsperplot and click the Y button. Then select lbsperacre and click the Group button. Finally, click OK.

This causes an output window with two sets of graphs side-by-side to appear. You can cut and paste the graphs from these windows right into microsoft word. Simply click on the border of the box you want to copy with the left mouse button to select it. You can then cut and paste like normal. Later, to quit PROC INSIGHT, you will simply click on the X in the upper right portion of the spreadsheet.

The plot to check for normality is found under the Graphs menu, choose QQ Plot... and then just click ok in the box that pops up. You can then add a line to this plot by using the QQ Ref Line... option in the Curves menu. Again just click ok in the box that pops up. If the data is approximately normally distributed then it should fall near the line. With so few data points however, it is hard to tell if the data looks close or not. We can conduct an hypothesis test of normality by selecting Tests for Normality under the Tables menu. The Anderson-Darling Test is perhaps the best one to use.

Throughout PROC INSIGHT are little boxes that give you additional options. The arrow box at the bottom of the histogram window contains a Ticks... option that lets you control what boxes and endpoints you use in the histogram. Try chaing the "Tick Increment". In the spread sheet, if you click on the black box at the end of any row, it gives the options to take any given point out of the calculations or graphs. If you do that the output window automatically updates. Similarly, if you click on any number,that point is highlighted in all the graphs.

Besides Distribution (Y) in the Analyze menu, we could also have chosen Fit(YX). Choose lbsperacre for X and headsperacre for Y. If the X variable is numerical, this will conduct a linear regression. If it is a categorical (name) variable, it will conduct an Analysis of Variance. Now, click OK and we will see what the regression output looks like.

We could go to the spreadsheet and add another set of data points from the experiment as well.


100 146
100 142
100 152
100 156

There are a variety of additional choices that can be added to the output window. Perhaps the most useful is the Residual Normal QQ plot under the Graphs menu.


Notes on Homework 1:

Problem 2) The data set in this problem only has one variable, so there will only be one name on your INPUT line.

Instead of entering the data with each observation on a separate line, you could enter all the data back to back by putting @@ at the end of the INPUT line. For example, the data from last time could have been entered as follows:


DATA lettuce;
INPUT lbsperacre headsperplot @@;
CARDS;
0 104 0 114 0 90
0 140 200 131 200 148
200 154 200 163
;

Once you've entered the data, you can use the Distribution (Y) menu to do the test of hypothesis and make the confidence interval. Once the window with the histograms pops up, the menu for confidence intervals can be found under the Tables menu. You simply need to choose what percent confidence you want.

The hypothesis tests can also be found in Tables menu under Tests for Location..... You simply need to choose what mean you are using for the null hypothesis. The result you want is on the line labeled Student's t. The p-value there is for the two-sided alternate hypothesis though, so you may need to adjust it for the particular alternate hypothesis you are testing. (Sort of the opposite of problem 1d.)

Problem 3) The basic commands for running a simple linear regression were gone over in class on January 22nd. The only thing that wasn't was how to construct the QQ plot for the residuals. Once you have the output window for the regression open, simply choose Residual Normal QQ under the Graphs menu.


Notes on Homework 2

Problem 1b) In order to get the confidence interval for the regression line (for the mean predicted value) and the confidence intervals for a new observation (individual predicted value) you need to use PROC GLM. GLM stands for "General Linear Model" and includes both regression and analysis of variance.

The following code would generate both of these for the lettuce data above, as well as for the case where the lbsperacre was 100. The . indicates that there is no headsperplot value for that observation.

DATA lettuce;

INPUT lbsperacre headsperplot;
CARDS;
0 104
0 114
0 90
0 140
200 131
200 148
200 154
200 163
100 .
;

PROC GLM DATA=lettuce;
MODEL headsperplot = lbsperacre / ALPHA=0.05 CLI;
RUN;

PROC GLM DATA=lettuce;
MODEL headsperplot = lbsperacre / ALPHA=0.05 CLM;
RUN;
QUIT;
The dependent variable goes on the left side of the equals sign in the MODEL line, and the independent variables go on the right. CLI tells it to make the confidence interval for the individual (hence the I), and CLM tells it to make the confidence interval for the means. The QUIT; at the end tells SAS that you are done running PROC GLM.

Problem 3) Here is the raw data for this problem, with the first line being the names of the variables. Remember in the INPUT line to put a $ after state since they are the names of the states and not numerical values.

state 		sat 	takers 	income 	years 	public 	expend 	rank
Iowa		1088	3	326	16.79	87.8	25.60	89.7
SouthDakota	1075	2	264	16.07	86.2	19.95	90.6
NorthDakota	1068	3	317	16.57	88.3	20.62	89.8
Kansas		1045	5	338	16.30	83.9	27.14	86.3
Nebraska	1045	5	293	17.25	83.6	21.05	88.5
Montana		1033	8	263	15.91	93.7	29.48	86.4
Minnesota	1028	7	343	17.41	78.3	24.84	83.4
Utah		1022	4	333	16.57	75.2	17.42	85.9
Wyoming		1017	5	328	16.01	97.0	25.96	87.5
Wisconsin	1011  	10	304	16.85	77.3	27.69	84.2
Oklahoma	1001	5	358	15.95	74.2	20.07	85.6
Arkansas	999	4	295	15.49	86.4	15.71	89.2
Tennessee	999	9	330	15.72	61.2	14.58	83.4
NewMexico	997	8	316	15.92	79.5	22.19	83.7
Idaho		995	7	285	16.18	92.1	17.80	85.9
Mississippi	988	3	315	16.76	67.9	15.36	90.1
Kentucky	985	6	330	16.61	71.4	15.69	86.4
Colorado	983  	16	333	16.83	88.3	26.56	81.8
Washington	982  	19	309	16.23	87.5	26.53	83.2
Arizona		981  	11	314	15.98	80.9	19.14	84.3
Illinois	977  	14	347	15.80	74.6	24.41	78.7
Louisiana	975   	5	394	16.85	44.8	19.72	82.9
Missouri	975  	10	322	16.42	67.7	20.79	80.6
Michigan	973  	10	335	16.50	80.7	24.61	81.8
WestVirginia	968   	7	292	17.08	90.6	18.16	86.2
Alabama		964   	6	313	16.37	69.6	13.84	83.9
Ohio		958  	16	306	16.52	71.5	21.43	79.5
NewHampshire	925  	56	248	16.35	78.1	20.33	73.6
Alaska		923  	31	401	15.32	96.5	50.10	79.6
Nevada		917  	18	288	14.73	89.1	21.79	81.1
Oregon		908  	40	261	14.48	92.1	30.49	79.3
Vermont		904  	54	225	16.50	84.2	20.17	75.8
California	899  	36	293	15.52	83.0	25.94	77.5
Delaware	897  	42	277	16.95	67.9	27.81	71.4
Connecticut	896  	69	287	16.75	76.8	26.97	69.8
NewYork		896  	59	236	16.86	80.4	33.58	70.5
Maine		890  	46	208	16.05	85.7	20.55	74.6
Florida		889  	39	255	15.91	80.5	22.62	74.6
Maryland	889  	50	312	16.90	80.4	25.41	71.5
Virginia	888  	52	295	16.08	88.8	22.23	72.4
Massachusetts	888  	65	246	16.79	80.7	31.74	69.9
Pennsylvania	885  	50	241	17.27	78.6	27.98	73.4
RhodeIsland	877  	59	228	16.67	79.7	25.59	71.4
NewJersey	869  	64	269	16.37	80.6	27.91	69.8
Texas		868  	32	303	14.95	91.7	19.55	76.4
Indiana		860  	48	258	14.39	90.2	17.93	74.1
Hawaii		857  	47	277	16.40	67.6	21.21	69.9
NorthCarolina	827  	47	224	15.31	92.8	19.92	75.3
Georgia		823  	51	250	15.55	86.5	16.52	74.0
SouthCarolina	790  	48	214	15.42	88.1	15.60	74.0
To perform the multiple regression in PROC INSIGHT, you simply choose Fit(YX) as you would for simple linear regression. Just select all of the X-variables before you hit OK.

Note that you don't need to do any new SAS work for parts b and d. For part b, I give you the result you would get if you just used takers to predict sat, and you got the Type III Tests box automatically in part a. For part c, you can add the studentized residuals, leverage (hat matrix diagonal), and Cook's D to the spreadsheet by going under Vars menu in PROC INSIGHT.


Notes on Homework 3

Problem 1c) To perform the model selection, you need to use the procedure PROC REG. The following code would perform the model selection for the data set in class on brain weights.
PROC REG DATA=brains;
MODEL   brain = body litter gest /
        SELECTION = RSQUARE ADJRSQ CP;
RUN;

Problem 2) The following gives the populations of the states in 1990 and 1999, and the ranks. (We don't need the ranks for this problem. The is from www.fedstats.gov, with the District of Columbia removed.)

State			      pop1990 rank1990  pop1999	 rank1999	      
Alabama                          4040       22     4370        23
Alaska                            550       49      620        48
Arizona                          3665       24     4778        20
Arkansas                         2351       33     2551        33
California                      29811        1    33145         1
Colorado                         3294       26     4056        24
Connecticut                      3287       27     3282        29
Delaware                          666       46      754        45
Florida                         12938        4    15111         4
Georgia                          6478       11     7788        10
Hawaii                           1108       41     1185        42
Idaho                            1007       42     1252        40
Illinois                        11431        6    12128         5
Indiana                          5544       14     5943        14
Iowa                             2777       30     2869        30
Kansas                           2478       32     2654        32
Kentucky                         3687       23     3961        25
Louisiana                        4222       21     4372        22
Maine                            1228       38     1253        39
Maryland                         4781       19     5172        19
Massachusetts                    6016       13     6175        13
Michigan                         9295        8     9864         8
Minnesota                        4376       20     4776        21
Mississippi                      2575       31     2769        31
Missouri                         5117       15     5468        17
Montana                           799       44      883        44
Nebraska                         1578       36     1666        38
Nevada                           1202       39     1809        35
NewHampshire                     1109       40     1201        41
NewJersey                        7748        9     8143         9
NewMexico                        1515       37     1740        37
NewYork                         17991        2    18197         3
NorthCarolina                    6632       10     7651        11
NorthDakota                       639       47      634        47
Ohio                            10847        7    11257         7
Oklahoma                         3146       28     3358        27
Oregon                           2842       29     3316        28
Pennsylvania                    11883        5    11994         6
RhodeIsland                      1003       43      991        43
SouthCarolina                    3486       25     3886        26
SouthDakota                       696       45      733        46
Tennessee                        4877       17     5484        16
Texas                           16986        3    20044         2
Utah                             1723       35     2130        34
Vermont                           563       48      594        49
Virginia                         6189       12     6873        12
Washington                       4867       18     5756        15
WestVirginia                     1793       34     1807        36
Wisconsin                        4892       16     5250        18
Wyoming                           454       50      480        50
The tricky part on this problem is how you transform the variables. While the spreadsheet is on top in PROC INSIGHT go to the Variables option under the Edit menu. If you just wanted to (for example) take the exponent of one of your variables you would choose exp(Y) and then simply select what variable Y was and hit ok. This then adds the new variable to the spreadsheet. You can do some more complicated transformations by choosing Other... under the Variables option menu.

Once the new variable has been added to the sheet (make sure you checked what it would be called before you hit OK!) you can then use it in either Distribution (Y) or Fit (YX) just like you would any other variable.


Notes on Homework 4

Data for problem 3: The following data is from a masters thesis by B. Nichols, in the Department of Animal Science at the University of Arizona (as found in Kuehl, 1994). Steaks of approximately the same size (75g) were randomly assigned to the different packaging conditions. After storing the meat for nine days at 4 degrees celsius, the amount of psychotrophic bacteria was measured. The units are the logarithm of the count per square-centimeter.

commercial	7.66	commercial 	6.98	commercial	7.80

vacuum 5.26 vacuum 5.44 vacuum 5.80
mixed 7.41 mixed 7.33 mixed 7.04
CO2 3.51 CO2 2.91 CO2 3.66

Example: The following discusses the analysis of the data in Table 7-4 on page 297. The data consists of the cortisol levels in three groups of people: healthy individuals, and two-levels of depressed individuals. This data could be analyzed using PROC INSIGHT to get the ANOVA table and residual plots. Using PROC GLM it is also possible to test contrasts and perform Levene's test in addition to getting the ANOVA table.

The code below uses PROC GLM to produce the ANOVA table 7-5 on page 298. It also uses two contrasts (one comparing non-melancholic to healthy, and one comparing melancholic to healthy) to get the same results as the text does in figure 7-5 on page 301. The book used dummy variables however, and the code below uses contrasts; thus the variable Dn in the regression is the same as the contrast 'nonm minus healthy' and Dm is the same as the contrast 'm minus healthy'. The lines using ESTIMATE do the same thing as the lines with CONTRAST but they also return the estimated value (the same as the slopes in figure 7-5) and the t instead of F (recall F=t2).

DATA tab7p4;
INPUT group $ cort @@;
CARDS;
h	2.5	n	5.4	m	8.1
h	7.2	n	7.8	m	9.5
h	8.9	n	8.0	m	9.8
h	9.3	n	9.3	m	12.2
h	9.9	n	9.7	m	12.3
h	10.3	n	11.1	m	12.5
h	11.6	n	11.6	m	13.3
h	14.8	n	12.0	m	17.5
h	4.5	n	12.8	m	24.3
h	7.0	n	13.1	m	10.1
h	8.5	n	15.8	m	11.8
h	9.3	n	7.5	m	9.8
h	9.8	n	7.9	m	12.1
h	10.3	n	7.6	m	12.5
h	11.6	n	9.4	m	12.5
h	11.7	n	9.6	m	13.4
		n	11.3	m	16.1
		n	11.6	m	25.2
		n	11.8
		n	12.6
		n	13.2
		n	16.3
;

PROC GLM DATA=tab7p4 ORDER=DATA;
CLASS group;
MODEL cort=group;
CONTRAST 'nonm minus healthy' group -1 1 0;
CONTRAST 'm minus healthy'    group -1 0 1;
ESTIMATE 'nonm minus healthy' group -1 1 0;
ESTIMATE 'm minus healthy'    group -1 0 1;
RUN;
SAS won't automatically construct the confidence intervals for the contrasts, but we could do it by hand from the output (or program SAS to do it). This is because the output from the ESTIMATE line gives both the estimate and the standard error for the estimate. (The degrees of freedom are the same as the degrees of freedom for the SSwit.) The 95% CI for the difference of the means for the non-melancholic and healthy groups would thus be: 1.5 +/- 1.96 (1.15947) = 1.5 +/- 2.335 = (-0.835, 3.835).

To conduct a test of hypotheses that the variances of the three groups are equal, we could add an extra line after the MODEL statement above. The following would perform the Levene median test (see page 325).

MEANS group / HOVTEST=BF;
The BF stands for Brown and Forsythe, the name that SAS uses for this particular test. For this example, the test of the null hypothesis that the variances for the three groups are equal is 0.6234. We would thus fail to reject the null hypothesis. (Remember that we need to use PROC INSIGHT to check the normality assumption still!)

To generate the output for the Holm test (as on page 315), we need to use PROC MULTTEST. The following code would perform the Holm test on all of the different pairs of groups. (SAS calls this the "Stepdown Bonferroni" method.)

PROC MULTTEST DATA=tab7p4 ORDER=DATA HOLM;
CLASS group;
CONTRAST 'healthy vs. nonm'  1 -1 0;
CONTRAST 'healthy vs. m'     1 0 -1;
CONTRAST 'nonm vs. m'        0 1 -1;
TEST mean(cort);
RUN;
The p-values in the column labeled Stepdown Bonferroni have already been adjusted so that you simply need to compare them to the family-wise alpha-level. The logic here is that the smallest p-value was 0.008, so you could either compare 0.0008 to alpha/3 or compare 0.0008*3 to alpha. You could then either compare 0.0157 to alpha/2 or compare 0.0157*2 to alpha. Then finally, you would compare 0.2014 to alpha either way. For this example we could write up the output for alpha = 0.001, 0.02, 0.05 and 0.25 as follows.
Group        Mean       a=0.001     a=0.02     a=0.05    a=0.25
----------------------------------------------------------------
m           13.500         A             B          B          C 
                           A             B
nonm        10.700         A          A  B        A          B
                           A          A           A
healthy      9.200         A          A           A        A 
----------------------------------------------------------------
For high alpha levels, we reject the null hypothesis more often (more likely to make type I errors) and are thus more likely to reject that the groups are the same. For a small alpha level it is hard to have enough evidence to reject that the groups are the same (more likely to make type II errors) and so we are likely to conclude they are all the same.


Notes on Homework 5

Data for problem 2:

reg	norm	989	reg	norm	1025	reg	norm	1030
reg	nplus	1191	reg	nplus	1233	reg	nplus	1221
reg	double	1226	reg	double	1202	reg	double	1180
red	norm	1211	red	norm	1215	red	norm	1182
red	nplus	1860	red	nplus	1910	red 	nplus	1926
red	double	1516	red	double	1501	red	double	1498
cost	norm	1577	cost	norm	1559	cost	norm	1598
cost	nplus	2492	cost	nplus	2527	cost	nplus	2511
cost 	double	1801	cost	double	1833	cost	double	1852

Example as ANOVA: The following discusses the analysis of the kidney data on pages 364-373 by performing the standard two-way ANOVA (including using Holm's test on the main effect and interactions).

DATA ksh;
INPUT strain $ site $ activity @@;
CARDS;
norm dct  62	norm dct  73	norm dct  58	norm dct  66
hyp  dct  44	hyp  dct  49 	hyp  dct  46	hyp  dct  37
norm ccd  15	norm ccd  31	norm ccd  19	norm ccd  35
hyp  ccd   8	hyp  ccd  36	hyp  ccd  11	hyp  ccd  18
norm omcd  7	norm omcd  7	norm omcd  9 	norm omcd 17
hyp  omcd 19	hyp  omcd  7	hyp  omcd 15	hyp  omcd  4
;

PROC GLM DATA=ksh ORDER=DATA;
CLASS strain site;
MODEL activity = strain site strain*site;
RUN;
After running this portion, note that the ANOVA table in the output is identical to the one given on the top of page 369, even though that one was gotten using dummy variable coding. Also note that if you add up the various Type I SS on the page 369 output, you will get the same values as the Type III SS in the SAS output from the above code.

Now, to see if the main effects (strain and site) and the interaction (strain*site) are significant, we actually need to do three tests. Because this model is balanced and factorial, the three tests will be independent though. Thus, we could use the Sidak adjustment to the alpha levels to get the desired family-wise alpha. We can do even better than this though. Just like you can use the Holm test to improve on Bonferroni, we can use what the text calls the Holm-Sidak procedure instead of just Sidak. (See the note at the bottom of page 316.) To do this we use PROC MULTTEST, but instead of giving it contrasts, we give it the p-values from the Type III tests, and instead of HOLM we use STEPSID. The data set pvals is are the p-values from above, and the p-values must be called raw_p. It is important to note what order you entered the values p-values in!

DATA pvals;
INPUT whichone $ raw_p;
CARDS;
strain	0.0156
site	0.0001
strain*site 0.0404
;

PROC MULTTEST PDATA=pvals STEPSID;
RUN;

In this case, all three adjusted p-values are still less than 0.05, and so we would still say that both the main effects and interaction were significant with a family-wise alpha of 0.05. If you had used STEPBON instead, the first ajdusted p-value would have been slightly larger, and so we would have been less likely to reject. This is because Bonferroni is conservative (i.e. the worst possible case) whereas Sidak's is exact in the case of independent tests.

To check the assumptions for this analysis, we need to get the residual plots and use Levene's test. The residual plots can be gotten from PROC INSIGHT simply by using Fit(YX). Choose activity as the Y variable, and both strain and site as the X variables. This is not enough, however. We still need to put the interaction term in. Highlight both strain and site at the same time, and then hit Cross. This will add strain*site to the list of independent variables. Now, hit OK and you can proceed as usual.

Unfortunately SAS will only run Levene's test for a one-way ANOVA. Because of this, we need to trick SAS into thinking this is a one-way ANOVA by actually making it into one! What we will do is change it into a one-way ANOVA with six different treatments normdct, hypdct, normccd, hypccd, normomcd, and hypomcd. The following code will do this using the || command for concatonating the names of the factor levels together after trimming off the extra spaces they may have. The following code makes the new data set ksh2 from the old data set ksh, prints it out so that you can see what its done, and then run's the modified Levene's test. (Recall that SAS calls this test the Brown and Forsythe test.)

DATA ksh2;
SET ksh;
KEEP  block activity;
block = trim(strain)||trim(site);

PROC PRINT data=ksh2;
RUN;

PROC GLM DATA=ksh2 ORDER=DATA;
CLASS block;
MODEL activity = block;
MEANS block / HOVTEST=BF;
RUN;

Example as Regression: The following discusses the analysis of the kidney data on pages 364-373 by using regression with dummy variables, (including making SAS recode the data for you!).

Looking at what the degrees of freedom for the ANOVA would be [(a-1), (b-1), (a-1)*(b-1)], we will need 1 dummy variable for strain, 2 for site, and 2*1=1 for the interactions. Mostly following the text (page 366), we will call these H1, N1, N2, H1N1, and H1N2. We'll call the new data set kshreg and make it from the old data set ksh.

DATA kshreg;
SET ksh;
KEEP  activity H1 N1 N2 H1N1 H1N2;

IF strain='norm' then H1=1;
ELSE H1=-1;  

IF site='dct' then N1=1;
ELSE IF site='ccd' then N1=0;
ELSE N1=-1;

IF site='dct' then N2=0;
ELSE IF site='ccd' then N2=1;
ELSE N2=-1;

H1N1=H1*N1;
H1N2=H1*N2;


PROC PRINT DATA=kshreg;
RUN;
Note that this is the same as table 8-13 on page 368, except that it did all the hypertensive first, and we are doing the hypertensive dct and then the normal dct, etc... The lines are all there though.

Of course we could have just entered the data all over again from the table on page 368...

DATA kshreg2;
INPUT activity H1 N1 N2 H1N1 H1N2;
CARDS;
62	1	1	0	1	0
etc...
... but that would have been a waste since we already had the data entered from running the ANOVA earlier.

Once the data is entered, we can can the output using either PROC GLM or PROC INSIGHT to do a regression.

PROC INSIGHT;
OPEN kshreg;
FIT activity = H1 N1 N2 H1N1 H1N2;
RUN;
Note that the Type III SS don't match what is in the output on pg. 369. This is because the VIFs are not all 1 (the dummy variables are not independent!). Because of this, we need to use the Type I SS. This can be done by choosing Type I under the Tables menu. Once you have the correct SS, you can construct your own ANOVA F statistics by hand. For example, the SS for dct in the ANOVA table is the same as the SS for N1 plus the SS for N2. (8287 = 7656.25 + 630.75). The degrees of freedom would be 2, 1 for each of the two dummy variables. The work required to do this is discussed on pages 368 to 370. Before PROC GLM was created there were some cases in which all of this extra work would be required. Luckily, PROC GLM knows how to use the dummy variables correctly on its own, behind the scenes, and we rarely need to go through all of this work with the dummy variables (although many text books seem to like doing it).

For this problem, simply show that you get the same SS for each of the main effects and the interactions. You don't need to calculate all the MS and F statistics.

For checking the assumptions, we can use the residual vs. predicted and q-q plots just like in regression. There is no easy way to get it to do Levene's test however without going back to the standard ANOVA method.


Notes on Homework 6

Problem 1: This problem uses the data discussed on pages 382-388 of the text. Most of the code you need for running it is provided below.

DATA rats;
INPUT nephron $ strain $ activity;
CARDS;
DCT	Norm	58
DCT 	Norm	66
DCT	Hyp	44
DCT 	Hyp	49
DCT	Hyp	46
CCD	Norm	15
CCD	Norm	31
CCD	Norm	19
CCD	Norm	35
CCD	Hyp	8
CCD	Hyp	11
CCD	Hyp	18
OMCD	Norm	7
OMCD	Norm	7
OMCD	Norm	9
OMCD	Norm	17
OMCD 	Hyp	19
OMCD	Hyp	7
OMCD	Hyp	15
OMCD	Hyp	4
;
PROC GLM DATA=rats ORDER=DATA;
CLASS nephron strain;
MODEL activity = nephron strain nephron*strain / SOLUTION;
LSMEANS nephron strain nephron*strain;
RUN;
The command SOLUTION tells SAS to solve for the model parameters. It doesn't do it quite the way we'd like however. For example, it says the effect of DCT is to add 35.08, the effect of CCD is to add 1.0833, and the effect of OMCD is to add 0. (Make sure you can find these values on the ouput.) These three don't average out to zero! Question 1a on the homework just wants to know, what combination of nephron and strain is the baseline. That is, which of the combinations of nephron and strain has group mean equal to mu (what SAS calls the intercept). For example, we know it isn't the DCT-norm group because it has average to mu+35.0833-1.250+16.916, which isn't equal to just mu.

In order to get the estimates where the sum of the effects are equal to zero, we need to do a little bit more work using the estimate line. The * notation below means to take the average over all of the values of that variable. Thus, y*** is our estimate of what the overall average would be if the design were balanced. It would include mu (the intercept), 1/3rd of each of the three nephron sites, and 1/2 of each of the strains, and 1/6 of each of the six interactions. This is because it is the average of everything. Notice that since we can't enter the exact value of 1/3rd that I put a two there and divided the whole line by six. Similarly, y*2* is what the average of all the hyperactive mice would be (the 2nd level of the 2nd varialbe strain) if the design were balanced. This would include mu, the average of the three nephron sites, the value for hyperactive, and the average of all the interactions involving hyperactive. Notice that if you run the code below you get exactly the values you got from the LSMEANS above.

PROC GLM DATA=rats ORDER=DATA;
CLASS nephron strain;  
MODEL activity = nephron strain nephron*strain;
ESTIMATE 'y*** = mu             ' intercept 6 nephron 2 2 2 strain 3 3 nephron*strain  1 1 1 1 1 1 /divisor=6;
ESTIMATE 'y1** = lsmean all DCT ' intercept 6 nephron 6 0 0 strain 3 3 nephron*strain  3 3 0 0 0 0 /divisor=6;
ESTIMATE 'y2** = lsmean all CCD ' intercept 6 nephron 0 6 0 strain 3 3 nephron*strain  0 0 3 3 0 0 /divisor=6;
ESTIMATE 'y3** = lsmean all OMCD' intercept 6 nephron 0 0 6 strain 3 3 nephron*strain  0 0 0 0 3 3 /divisor=6;
ESTIMATE 'y*1* = lsmean all Norm' intercept 6 nephron 2 2 2 strain 6 0 nephron*strain  2 0 2 0 2 0 /divisor=6; 
ESTIMATE 'y*2* = lsmean all Hyp ' intercept 6 nephron 2 2 2 strain 0 6 nephron*strain  0 2 0 2 0 2 /divisor=6;
ESTIMATE 'y11* = lsmean DCTNorm ' intercept 6 nephron 6 0 0 strain 6 0 nephron*strain  6 0 0 0 0 0 /divisor=6; 
ESTIMATE 'y12* = lsmean DCTHyp  ' intercept 6 nephron 6 0 0 strain 0 6 nephron*strain  0 6 0 0 0 0 /divisor=6;
ESTIMATE 'y21* = lsmean CCDNorm ' intercept 6 nephron 0 6 0 strain 6 0 nephron*strain  0 0 6 0 0 0 /divisor=6;
ESTIMATE 'y22* = lsmean CCDHyp  ' intercept 6 nephron 0 6 0 strain 0 6 nephron*strain  0 0 0 6 0 0 /divisor=6;
ESTIMATE 'y31* = lsmean OMCDNorm' intercept 6 nephron 0 0 6 strain 6 0 nephron*strain  0 0 0 0 6 0 /divisor=6;
ESTIMATE 'y32* = lsmean OMCDHyp ' intercept 6 nephron 0 0 6 strain 0 6 nephron*strain  0 0 0 0 0 6 /divisor=6;
RUN;
The reason for bothering to do this is so that we can estimate the various parameters individually now. For example, we simply have mu = y***, because both mu and y*** are the overall averages. To get n1 we would say n1 = y1** - y***, because the effect of being in nephron level 1 should be the difference between the average of being in nephron level one minus the overall average. Similarly, to get s1 we would say s1 = y*1* - y***. Finally, to get (ns)11 we would say

(ns)11 = y11*-n1-s1-mu
      = y11* - (y1**-y***) - (y*1*-y***) - y***
      = y11* - y1** - y*1* + y***

All of these equations are similar to those from the ANOVA table for the two main effects and the interactions. We could have SAS calculate these parameter estimates by combining the estimate lines for the y***, yi**, y*j*, and yij* that we created above.

PROC GLM DATA=rats ORDER=DATA;
CLASS nephron strain;
MODEL activity = nephron strain nephron*strain;
ESTIMATE 'mu = y***             ' intercept 6 nephron 2  2  2 strain 3  3 nephron*strain 1  1  1  1  1  1 /divisor=6;
ESTIMATE 'n1 = y1** - y***      ' intercept 0 nephron 4 -2 -2 strain 0  0 nephron*strain 2  2 -1 -1 -1 -1 /divisor=6;
ESTIMATE 's1 = y*1* - y***      ' intercept 0 nephron 0  0  0 strain 3 -3 nephron*strain 1 -1  1 -1  1 -1 /divisor=6;
ESTIMATE 'ns11=y11*-y1*-y*1+y***' intercept 0 nephron 0  0  0 strain 0  0 nephron*strain 2 -2 -1  1 -1  1 /divisor=6;
RUN; 
Notice that mu+n1 now gives us the LSMeans value for DCT, mu+s1 now gives us the LSMeans value for Norm, and mu+n1+s1+n1s1 gives us the LSMeans value for DCT*Norm. Not only that, but we have a separate p-value for testing whether each of the terms in the model is equal to zero individually! We could thus go in and pick out some specific terms that we want to test are equal to zero. For this example, if all we wanted to test was the null hypothesis that n1=0, we would get a p-value of < 0.001 and reject that null hypothesis.

Because this gives us the standard error as well as the estimate, we could also construct confidence intervals for the parameters. To make a confidence interval for the n1 effect, we would simply take 26.3472 +/- talpha/2 2.5852 where the degrees of freedom are the df for the SSE (14 in this case). (For question 1c on the homework, notice that the OMCD-hypertensive group is the third level of n, and the second level of s according to how we entered the data.)

We could also use this to test about particular differences. For example, say we wanted to test if the combination DCT-Norm (the main effects plus interaction) was different than CCD-Hyp (the main effects plus interaction). This would be y11* - y22* so:

PROC GLM DATA=rats ORDER=DATA;
CLASS nephron strain;
MODEL activity = nephron strain nephron*strain;
ESTIMATE 'y22*-y11*' intercept 0 nephron -6 6 0 strain -6 6 nephron*strain -6 0 0 6 0 0 /divisor=6;
RUN;
Clearly DCT-Normal results in significantly more activity than CCD-Hyper. If you check back on Table 8-16 this makes sense though... one group has values 58 and 66 and the other group has values 8, 11, and 18.

Problem 2: You can run a model with random effects just like one with fixed effect. After the MODEL line, simply add a line

RANDOM name_of_random_effects ;

where you plug the name of the factor that is random in (instead of leaving in "name_of_random_effects". SAS will then tell you what the E(MS) are at the bottom of the output window.

What to do with the E(MS) is discussed on page 460-461, and in the example below.

Example:: Say we were analyzing the data from problem one, but had added the line RANDOM strain because the strain of rat was randomly chosen. The hypotheses we might want to test are: are all the nephron effects equal, is the variance of the strain effects equal to 0, and are all the interactions of nephron*strain equal. In SAS's notation, we would be interested in testing that Q(nephron), Var(strain), and Q(nephron*strain) are equal to zero.

                                        The GLM Procedure

                                               Sum of
       Source                      DF         Squares     Mean Square    F Value    Pr > F

       Model                        5     6229.666667     1245.933333      29.97    <.0001

       Error                       14      582.083333       41.577381

       Corrected Total             19     6811.750000


       Source                      DF     Type III SS     Mean Square    F Value    Pr > F

       nephron                      2     5994.229167     2997.114583      72.09    <.0001
       strain                       1      382.699275      382.699275       9.20    0.0089
       nephron*strain               2      278.647771      139.323886       3.35    0.0647



           Source                  Type III Expected Mean Square

           nephron                 Var(Error) + Q(nephron,nephron*strain)

           strain                  Var(Error) + 9.3913 Var(strain) + Q(nephron*strain)

           nephron*strain          Var(Error) + Q(nephron*strain)

Note: our estimate of the Var(Error) is always the MSE in the ANOVA table.

Looking at the Type III Expected Mean Square, we can see that in order to get a term with Var(strain) by itself, we would need to cancel out both the Var(Error) and the Q(nephron*strain) from the line Source=strain. The only way to get both Var(Error) and Q(nephron*strain) is to use the line Source=nephron*strain. This tells us that the F-statistic we want to use has MSstrain=382.669275 as the numerator, MSnephron*strain=139.323886 as the denominator, and degrees of freedom = 1,2. Thus our F value got testing Var(strain)=0 would be F=382.67/139.32=2.747. Table E-2 says we would compare this to 18.51 and would fail to reject.

This is different than what the p-value of 0.0089 for strain seems to be telling us in the Type III tests! This is because the Type III tests always just use the MSE as the denominator. So with a random effect it would actually be testing the null hypothesis 9.3913Var(strain)+Q(nephron*strain=0. That is, it would be testing that the variance of the strain of rat was zero AND that there was no interaction! The only way to tell this is what is happening is to look at the expected mean squares.

The p-value 0.0647 for testing if there is an interaction is correct though. Can you tell why?

Finally, note that there is no obvious way to test whether Q(nephron)=0 (that is, that there is no effect due to the nephron.) This is because none of the expected terms have Q(nephron) by itself. They only have it combined with the interaction in the term Q(nephron,nephron*strain). This is because SAS doesn't do always do the interactions correctly for models with random effects if there is an interaction term. Because of this, SAS GLM recommends that you always make the interaction random if at least one of the effects involved is random.

Luckily on the homework, question 2a points out that there is not an interaction term in the model you will be looking at.


Notes on Homework 7

An ANCOVA model can be fit using PROC GLM or PROC INSIGHT just like an ANOVA or a regression model can.


Notes on Homework 8

Below is the data for Homework Assignment 8. The variables are temperature in Fahrenheit and whether or not there was an O-ring failure.

53      1	56	1	57	1
63      0	66	0	67	0
67      0	67	0	68	0
69      0	70	0	70	1
70      1	70	1	72	0
73      0	75	0	75	1
76      0	76	0	78	0
79      0	80	0 	81	0	
All of this problem (except the Hosmer-Lemeshow Statistic) can be conducted using PROC INSIGHT. In the FIT (YX) menu you have to choose several options after you select Y and X. Select METHOD using the button at the bottom. In the menu that pops up choose the response distribution Binomial, and the link function Logit.

PROC LOGISTIC will also calculate the Hosmer-Lemeshow Statistic. The following code would analyze the data in Table 12-1 on page 609, and it reproduces the output in figure 12-2 on page 611.

DATA grad;
INPUT int  success  failure;
total=success+failure;
CARDS;
1	0	2
2	0	2
3	0	5
4	0	3
5	1	6
6	1	3
7	1	2
8	4	7
9	5	4
10	7	4
11	7	2
12	7	1
13	11	1
14	7	0
15	11	0
16	5	0
17	7	0
18	2	0
19	2	0
;

PROC LOGISTIC DATA=grad;
MODEL success/total = int / LACKFIT;
RUN;
The data here was input in a slightly different format that the way the data for the homework assignment would be entered. For this data we had several observations recorded at the same intelligence level, all reported on a single line of input (e.g. there were two observations at z=1, 0 successes and 2 failures). Your input would have this on separate lines. For example, the data set grad would start like:
DATA grad;
INPUT int grad @@;
CARDS;
1	0	1	0	2	0
2	0	3	0	3	0
3	0	3	0	3	0	
4	0	4	0	4	0
5 	1	5	0	5	0
5	0	5	0	5	0
5	0	etc...
;
In this case the code used to run PROC LOGISTIC would be:
PROC LOGISTIC DATA=grad DESCENDING;
MODEL grad = int /LACKFIT;
RUN;
The log window will tell you that the DESCENDING means you are predicting the probability of getting a 1. Without it, you would be predicting the probability of getting a zero.


What if SAS won't start?

If you attempt to start SAS and it gives you an error instead of starting the program, there are several possible difficulties. One of them could be that you have an old account that was not correctly updated. To see if this is your problem, go to the Start box and call up Windows NT Explorer in the Programs menu. If you scroll down the left part of the window that pops up and can find a drive labeled apps$ on 'sum-nt' (X:) then the commands below will help you fix this.

1) Close Windows NT Explorer
2) Right click on the My Computer icon on the left side of the screen.
3) Select Disconnect Network Drive
4) Choose X: \\sum-nt\apps and click OK
5) Again, Right click on the My Computer icon
6) Select Map Network Drive...
7) Select X in the Drive box
8) Type \\lc-nt\apps in the path box and hit ok

It should now allow you to start SAS. Unfortunately you'll have to do this every time you want to start SAS until they get your account moved over to a newer machine.

Now, if you were not listed as being hooked up to sum-nt, then it could be any number of things. First you should try starting it on another computer. If that fails, you should see if Jamie Winterton (7-5346) is available in Room 209D. If she is not there, you should next see if Jay Dew (7-5413) is in 415. If that fails, you should see if Minna Moore is available in Room 417. Note: If Minna is the only one in, she will probably not be able to help you right away, but she will pass your message on to the next available person.