516 SAS TEMPLATES

Stat 516 - Spring 2000 - SAS Templates

Class Notes from 1/12/2000
Notes on Homework One
Notes on Homework Two
Notes on Homework Three
Notes on Homework Four
Notes on Homework Five
Notes on Homework Six
Notes on Homework Seven
Notes on Homework Eight
Downloading the Data from the Web

The Basics of SAS:

Hitting the [F3] key will run the program currently in the Program Editor window.

This will however erase whatever was written in the Program Editor window. To recall whatever was there, make sure you are in that window, and hit the [F4] key.

If you keep running more programs it will keep adding it all to the Output window. To clear the Output window, make sure you are in that window, and choose Clear text under the Edit menu.

If you happen to lose a window, try looking under the Globals menu.

The following is the SAS code for analyzing the twins data we discussed in class. The data is reported in the book The Statistical Sleuth by Ramsey and Schafer, and is originally from: Suddath, R.L. et al. (1990) Anatomical abnormailities in the brains of monozygotic twins discordant for schizophrenia," New England Journal of Medicine, 322(12), 789-93.


OPTIONS pagesize=60 linesize=80;



DATA twins;

INPUT twinid $ unnaf affect;

LABEL twinid = "Twin pair identification letter"

   unnaf = "Vol. of left hipp. in unaffected twin"

   affect = "Vol. of left hipp. in schizophrenic twin";

CARDS;

A	1.94	1.27

B	1.44	1.63

C	1.56	1.47

D	1.58	1.39

E	2.06	1.93

F	1.66	1.26

G	1.75	1.71

H	1.77	1.67

I	1.78	1.28

J	1.92	1.85

K	1.25	1.02

L	1.93	1.34

M	2.04	2.02

N	1.62	1.59

O	2.08	1.97

;

Note that _most_ lines end with a semi-colon, but not all. SAS will crash if you miss one, but usually the log window will tell you where the problem is.

The OPTIONS line only needs to be used once during a session. It sets the length of the page and the length of the lines for viewing on the screen and printing. The font can be set by using the Options menu along the top of the screen. When you cut and paste from SAS to a word processor, the font Courier New works well.

The DATA line defines what the name of the data set is. The name must be eight characters or less, with no spaces, and only letters, numbers, and underscores. It must start with a letter. The "INPUT" line gives the names of the variables, and they must be in the order that the data will be entered. The $ after twinid on the INPUT line means that the variable twinid is qualitative instead of quantitative.

If we hit F3 at this point to enter what we put above, nothing new will appear on the output screen. This is no big surprise however once we realize that we haven't told SAS to return any output! The code below simply tells SAS to print back out the data we entered.


PROC PRINT DATA=twins;

TITLE "Schizophrenia Twin Study";

RUN;

The basic method for getting a summary of the data is to use PROC UNIVARIATE.


PROC UNIVARIATE DATA=twins PLOT FREQ ;

VAR unnaf ;

TITLE 'Summary of the Unnafected Twin';

RUN;

The VAR line says which of the variables you want a summary of. Also note that the graphs here are pretty awful. We'll see in a few minutes how to make the graphics look better. But first it would be nice if the data set included the differences in the weights of the left hippocampuses. The following code makes a new data set called twin2 that will contain unaff, affect, and a new variable called diff. It will not contain twinid because we don't really need anything for it.

DATA twin2;

SET twins;

KEEP unnaf affect diff;

diff = unnaf - affect;

RUN;



PROC PRINT DATA=twin2;

TITLE "Schizophrenia Twin Differences";

RUN;

Now, SAS has a variety of ways of doing a t-test, but it won't necesarily give us the p-value we are looking for. The following code will do that. It should be fairly easy to follow through. What we are doing below are the steps needed to test that the mean diff is 0:

1) Calculate the mean, standard deviation, and get the number of values for the variable diff in the data set twin2.
2) Output these values into a data set called temp, because we'll only use it temporarily
3) Using the data set temp, and the mean entered after the cards statement, calculate t = (xbar-mu)/(sd/sqrt(n))
4) Calculate the p-value for the three different alternative hypotheses. The function probt(t,df) calculates the area less than t in a t-distribution with df degrees of freedom
5) Put this information in another temporary data set called temp2, and print it out

PROC MEANS NOPRINT DATA=twin2;

VAR diff;

OUTPUT OUT=temp MEAN=xbar STD=sd N=n;

RUN;

DATA temp2;

SET temp;

KEEP xbar mu sd n t pgreater pless ptwoside;

INPUT mu;

t  = (xbar-mu)/(sd/sqrt(n));

df = n - 1;

pgreater = 1 - probt(t,df);

pless = probt(t,df);

ptwoside = 2*MIN(1-ABS(probt(t,df)),ABS(probt(t,df)));

CARDS;

0

;

PROC PRINT;

RUN;

Now in order to trust the above p-values, one of the things we need to do is to check if the data set appears normal. This is one of the many graphs that PROC INSIGHT will produce. (It will also do many of the other procedures we need to do this semester).

PROC INSIGHT;

OPEN twin2;

DIST diff;

RUN;

You can cut and paste the graphs from PROC INSIGHT right into microsoft word. Simply click on the border of the box you want to copy with the right mouse button to select it. You can then cut and paste like normal. Clicking on the arrow in the bottom corner of each of the boxes gives you options for adjusting the graphs format. To quit PROC INSIGHT, click on the X in the upper right portion of the spreadsheet.

The plot to check for normality is found under the Graphs menu, choose QQ Plot... and then just click ok in the box that pops up. You can then add a line to this plot by using the QQ Ref Line... option in the Curves menu. Again just click ok in the box that pops up. If the data is approximately normally distributed then it should fall near the line.

PROC INSIGHT can also do a t-test or make confidence intervals. These options are in the Tables menu under either C.I. for Mean or Location Tests....

Throughout PROC INSIGHT are little boxes that give you additional options. The arrow box at the bottom of the histogram window contains a Ticks... option that lets you control what boxes and endpoints you use in the histogram. Try chaing the "Tick Increment". In the spread sheet, if you click on the black box at the end of any row, it gives the options to take any given point out of the calculations or graphs. If you do that the output window automatically updates. Similarly, if you click on any number,that point is highlighted in all the graphs.

Notes on Homework 1:

Pg.178 #9 and 10: The code for doing this problem can be found above in what we did in the lab. First you need to enter the data. Note that there will only be one variable name on the INPUT line. This is because we only have one data set here (i.e. no twins, and no id). Once you've entered the data, for the t-test you can use the code we entered for doing that (it begins with PROC MEANS. You need to change the name of the data set, the name of the variable, and make sure you have the right mean for the null hypothesis. You then need to tell which of the three p-values you want to use. For the confidence interval, you can just use PROC INSIGHT.

Pg.320 #1: PROC INSIGHT will calculate the regression line, the variance of the residuals, and the estimated thickness at each temperature. You need to enter the data (as always), and start insight using the correct data set name on the OPEN line. You don't need to use the DIST line though, since we don't care about either of the individual variables for this problem.

Once PROC INSIGHT is open, under the Analyze menu, choose Fit (YX). Click on the variable you want to use for Y, and then click on the 'Y' box. Click on the variable you want to use for X and then click on the 'X' box. Then click 'OK'.

The model equation is the regression line. The 'Root MSE' in the 'Summary of Fit' box is the standard deviation of the residuals, and you can find the variance of the residuals (the MSE) in the ANOVA table. The estimated values will appear in the spreadsheet in the column labeled P_name of Y variable.

Notes on Homework 2:

Pg. 320-326, #1c-d, 4a, 6: PROC INSIGHT will calcualate all of the things you need to answer these questions. Just give it the correct data, and make sure you choose the correct variables to be X and Y. The residuals can be found in the spreadsheet in the column labeled R_name of Y variable. The residual plot is the one at the very bottom of the output window. To add a Q-Q plot of the residuals, go under the Graphs menu and select Residual Normal QQ. Remember if you click on a dot in a plot it shows you which value in the spreadsheet it goes to. Also note that the data set for problem 6 can be found on the web-site at the address listed below.

Pg. 320-326, #4b: Pages 184-191 discuss testing the hypothesis that the means of two separate groups are equal. The SAS procedure that conducts this test is PROC TTEST. The following code will work through the example 5.3 on page 191-192.


DATA ex5p3;

INPUT score type $ @@;

CARDS;

0	sub	0	sub	0	sub	1	sub

3	sub	3	sub	5	sub	7	sub

9	sub	9	sub	0	rail	0	rail

0	rail	0	rail	0	rail	1	rail

1	rail	1	rail	1	rail	2	rail

2	rail	2	rail	2	rail	2	rail

3	rail	4	rail	5	rail

;



PROC TTEST DATA=ex5p3;

CLASS type;

VAR score;

RUN;

The @@ on the INPUT means we can enter multiple observations on the sam line. (Without it, it would read in the first 0 sub and then jump down to the next line and 3 sub). The CLASS line in the t-test procedure is where you tell the procedure which of the two groups the observation is in. The VAR line is where you tell it what you are taking the means of.

If you run the above code, and check the output, you can find the p-values for the tests by looking for Prob in the output. If you look, you will find three of these. Two are in the table at the top, on on the equal line, and one on the unequal line. The one on the unequal line is for testing if the two means are equal when the variances are not equal (pages 190-192). The one on the equal line is for testing if the two means are equal when the variances are also equal (pages 186-189). The p-value at the bottom is for testing the null hypothesis that the variances are equal (just like it says!). If the Prob>F is small, then we have evidence the variances are unequal, if it is large we would say that the variances are equal.

Notes on Homework 3:

2) Performing multiple regression using PROC INSIGHT is the same as doing standard regression, EXCEPT that after you select Fit(YX) under the Analyze menu you select more than one X. The VIF is contained on the basic regression output in PROC INSIGHT output.

The influence, potential, and standardized residuals can be added to the spreadsheet part of PROC INSIGHT by having selected the screen with the regression output, and then by choosing Dffits, Hat Diag, or Standardized Residual repsectively under the Vars menu. R_ is the residual, P_ is the predicted value, F_ is the influence, H_ is the potential, and RS_ is the standardized residuals. Once the variables appear on the spread sheet, you can include them in plots us using the Scatter Plot option under the Analyze menu. For example you could make a separate plot for each of the X variables against the residuals, or could make plots of the X variables against each other.

Using the data set for example 8.4 (fw08x04), the variable selection can be done using PROC REG. The left side of the equal sign is the Y, the right side is the list of all of the X variables. The "/ Selection =" means to report those statistics for each possible set of variables.

PROC REG DATA=fw08x04;
MODEL   WT = RL WL RNL NWL W /
        SELECTION = RSQUARE ADJRSQ CP;
RUN;

4b) The text book's web-site data set already includes the commands to change all of the variables to logarithms. The logarithm variable names are the same as the others except that they start w ith an "l". They are part of the data set. (The data set is all there, you d on't need to add anything to that part, except for the ; on the line after it.) You could also use PROC INSIGHT to change the variable into logarithms. In the Edit menu choose Variables and log(Y). When the box comes up, simply select the variable you want to transform, click the Y button and hit OK.

Notes on Homework 4:

Pg.209-212,#1: PROC TTEST was also used on homework two.

Pg.275,#4a: PROC INSIGHT can fit a basic ANOVA model, however it is unable to do many of the other analyses we will be doing later. Instead it is best to use PROC ANOVA. Here is how we would work out the data we in tables 6.3 and 6.5.

DATA ricexamp;

INPUT variety yield;

CARDS;

1	934

1	1041

1	1028

1	935

2	880

2	963

2	924

2	946

3	987

3	951

3	976

3	840

4	992

4	1143

4	1140

4	1191

;

PROC ANOVA DATA=ricexamp;

 CLASS variety;

 MODEL yield=variety;

RUN;

Notes on Homework 5:

The following is one way to enter the lunchmeat data we used in class. The $ after wrap says that wrap consists of names and not numbers.

DATA lunchmt;

INPUT wrap $ bacteria;

CARDS;

comm    7.66

comm    6.98

comm    7.80

vac     5.26

vac     5.44

vac     5.80

mixgas  7.41

mixgas  7.33

mixgas  7.04

allC02  3.51

allC02  2.91

allC02  3.66

;

To check the assumptions you can run PROC INSIGHT just like it was a regression, with wrap as the Y and bacteria as the X. If you are using PROC INSIGHT to do this though it is important to either: include the $ after the name of the categorical variable, or once you're in PROC INSIGHT make sure that the column that has the factor names in it has Nom at the top - if not, then just click on the Int and change it.

To do a basic ANOVA, testing only that some of the means are not equal, the following code would work.

PROC ANOVA DATA=lunchmt;

CLASS wrap;

MODEL bacteria=wrap;

RUN;

To fit the contrasts we talked about in class, it is easiest to use PROC GLM, where GLM stands for Generalized Linear Model. (GLM and ANOVA will give the same ANOVA table, so it doesn't matter which one you use.) Note that the sum of the coefficients have to add up to 0 in order for SAS to fit the model. This will output the 'p-values' for the three contrasts. Remember to adjust the alpha-level according to the formula we worked out for orthogonal contrasts (if they are indeed independent), or according to Bonferroni's formula if they are not orthogonal.

PROC GLM DATA=lunchmt ORDER=DATA;

CLASS wrap;

MODEL bacteria=wrap;

CONTRAST 'C1' wrap 1 -.3333333 -.3333333 -.3333334;

CONTRAST 'C2' wrap 0  1 -0.5 -0.5;

CONTRAST 'C3' wrap 0  0  1 -1;

ESTIMATE 'C1' wrap 1 -.3333333 -.3333333 -.3333334;

ESTIMATE 'C2' wrap 0  1 -0.5 -0.5;

ESTIMATE 'C3' wrap 0  0  1 -1;

RUN;

To perform the Bonferroni, Duncan, Tukey, and Fisher comparisons between the various means, we could use the following code. Remember to set your alpha level in advance, and you should only run the multiple comparison procedure you intend to use. Instead of putting lines under the names of the factor levels that are the same (see page 249), it puts letters of the alphabet.

PROC GLM DATA=lunchmt ORDER=DATA;

CLASS wrap;

MODEL bacteria=wrap;

MEANS wrap / ALPHA=0.01 BON DUNCAN TUKEY LSD;

RUN;

Adding CLDIFF to the MEANS line after the / will give the confidence intervals as output if you're interested in that.

Adding HOVTEST=LEVENE to the MEANS line after the / will perform Levene's test to see if the variances are equal, in case you can't tell from the residual plot. Remember however that this test requires the data to be approximately normal. The Brown and Forsythe test HOVTEST=BF is a bit more robust.

Notes on Homework 6:

The homework problems can be found at: homework assignment 6

The following code would give an analysis of the data in Table 9.4 on pages 424-432.


DATA cars;

INPUT cyl $  oil $ rep $ mpg;

cards;

                       4     STANDARD    1    22.6

                       4     STANDARD    2    24.5

                       4     STANDARD    3    23.1

                       4     STANDARD    4    25.3

                       4     STANDARD    5    22.1

                       4     MULTI       1    23.7

                       4     MULTI       2    24.6

                       4     MULTI       3    25.0

                       4     MULTI       4    24.0

                       4     MULTI       5    23.1

                       4     GASMISER    1    26.0

                       4     GASMISER    2    25.0

                       4     GASMISER    3    26.9

                       4     GASMISER    4    26.0

                       4     GASMISER    5    25.4

                       6     STANDARD    1    23.6

                       6     STANDARD    2    21.7

                       6     STANDARD    3    20.3

                       6     STANDARD    4    21.0

                       6     STANDARD    5    22.0

                       6     MULTI       1    23.5

                       6     MULTI       2    22.8

                       6     MULTI       3    24.6

                       6     MULTI       4    24.6

                       6     MULTI       5    22.5

                       6     GASMISER    1    21.4

                       6     GASMISER    2    20.7

                       6     GASMISER    3    20.5

                       6     GASMISER    4    23.2

                       6     GASMISER    5    21.3

;

PROC GLM DATA=cars ORDER=DATA;

CLASS cyl oil;

MODEL mpg = cyl oil cyl*oil;

CONTRAST 'L1 = 4 vs. 6' cyl -1 1;

CONTRAST 'L2 = standard vs. others' oil 1 -.5 -.5;

CONTRAST 'L3 = multi vs. gasmiser' oil 0 1 -1;

CONTRAST 'L1*L2' cyl*oil -1 0.5 0.5 1 -0.5 -0.5;

CONTRAST 'L1*L3' cyl*oil 0 -1 1 0 1 -1;

RUN;

The model line simply fits the basic ANOVA with interaction. The next three contrast lines are just the standard contrasts as discussed beginning on the bottom of page 429 and over onto the top of page 430. The next two lines are the interaction contrasts which are a little trickier to deal with.

In order to figure out the values for these interaction contrasts, we need to lay out a matrix like is shown on the bottom of page 430. Since 'cyl' is the first variable (as listed in the class line, model statement, and cyl*oil) we will use it to label each row. And we will label each column by 'oil' since it is the second variable. If we were to write the coefficients by the labels for the rows and columns, and multiply them, we would get the values inside the table. The values in the contrast statement can then be read off by going across the rows one and a time. (Follow along in your book and you can see what I mean.)

Now, assume that instead of being replications, that "rep" instead refers to five different drivers, instead of just replications. In this case we have a three-way factorial model with no replications. To fit the additive model to this data, the following code could be used. (By additive model, we mean that there are no interaction terms at all.)


PROC GLM DATA=cars ORDER=DATA;

CLASS cyl oil rep;

MODEL mpg = cyl oil rep;

RUN;

Note that since we have both two way interactions (cyl*oil, cyl*rep, oil*rep) and three way interactions (cyl*oil*rep), it would be possible to fit the model assuming there was no three way interaction, but that there were still the possibility of two way interactions.


PROC GLM DATA=cars ORDER=DATA;

CLASS cyl oil rep;

MODEL mpg = cyl oil rep cyl*oil cyl*rep oil*rep;

RUN;

In order to use PROC INSIGHT easily, it is often easier to make a second data set containing just the residuals. To do this for the original model (not using rep), we would use the following code:

PROC GLM DATA=cars ORDER=DATA NOPRINT;

CLASS cyl oil;

MODEL mpg = cyl oil cyl*oil;

OUTPUT OUT=cars2 P=pred R=resid;

RUN;

The new data set cars2 will contain the predicted values and the residuals, and we could use PROC INSIGHT to get the various residual plots then. (Do a scatter plot of pred vs. resid, and use the Distribution (Y) on resid to get to the menus for doing the Q-Q plot.

We also need to use this new data set if you want to do Levene's test or the Brown and Forsythe test. This is because these tests are only programed to work with one-way ANOVAs. To do this then we need to rephrase the problem as a one-way ANOVA for predicting the residuals from all of the possible blocks (in this case the 6 combinations of oil and cyl). from this ANOVA using the MEANS line.

DATA cars3;

SET cars2;

KEEP  block pred resid;

block = trim(cyl)||trim(oil);



PROC GLM DATA=cars3 ORDER=DATA;

CLASS block;

MODEL resid = block;

MEANS block /HOVTEST=LEVENE;

RUN;

This new data set can also be used to make the various residual plots using PROC INSIGHT on 'cars3'. You can simply do a scatterplot of 'resid' and 'pred' for example. A scatterplot of 'resid' vs. 'block' should al so be insightful. You could also do a similar plot using the data set 'cars', and doing a scatterplot of 'resid' vs. 'cyl' or vs. 'oil'.

Notes on Homework 7:

The homework problems can be found at: homework assignment 7

The following code would analyze the data in Table 10.2 on pages 473-476 (using the data downloaded from the web-page, but not the code it gives).


PROC GLM DATA=fw10x02 ORDER=DATA;
CLASS LAB MATERIAL;
MODEL STRESS = LAB MATERIAL LAB*MATERIAL;
RANDOM LAB LAB*MATERIAL / TEST;
MEANS MATERIAL / DUNCAN E=LAB*MATERIAL;
OUTPUT OUT=fwresids P=pred R=resid;
RUN;

Note that the E=LAB*MATERIAL on the MEANS line tells it that you want to use a different denominator for conducting the test. (see for example bottom page 473). You can tell it should be E=LAB*MATERIAL by checking the EMS for testing MATERIAL.

The analysis of latin squares works just like the analysis of a factorial design (using PROC GLM), just remember that you cannot include the interaction terms.

Notes on Homework 8:

The homework problems can be found at: homework assignment 8

Unbalanced Data and Getting the Parameter Estimates:

Luckily PROC GLM (but not PROC ANOVA) will correctly fit unbalanced models. The following code will analyze the data in Table 11.1 on page 516.

DATA examp;

INPUT factA $ factC $ value @@;

CARDS;

1	1	4	1	1	5	1	1	6

1	2	8	2	1	5	2	2	7	

2	2	9

;



PROC GLM DATA=examp ORDER=DATA;

CLASS factA factC;

MODEL value = factA factC factA*factC / solution;

LSMEANS factA factC factA*factC;

RUN;

Unfortunately when the data is unbalanced you can't use the MEANS statement (and thus can't use Duncan's method either). Instead you can use the LSMEANS statement. This won't perform any of the tests, but does correctly give you the average for each level of factorA, factorC, or combined level. These aren't the parameters though, these are the y_i**, y_*j*, and y_ij*.

In order to estimate the actual parameters, you could use the solution line. This is found on the output in the section labeled parameters. In this example, note that it seems to say that mu=8, a₁=0, a₂=0, c₁=0, c₂=-3, ac₁₁=0, ac₁₂=0, ac₂₁=0, and ac₂₂=0. Note however that the sum of the c_i isn't zero! SAS uses a different set of constraints that work by setting the last level of each factor to be zero, instead of setting the sum across the factor to 0. (This actually makes sense if say the last level is the placebo, and you want to compare the treatments to it, but isn't the way we want to approach the problem).

To get the parameter estimates we want takes a bit more work. It turns out that SAS calculates the parameters and uses them in the estimate and contrast lines we've used before. We can use the estimate line to calculate the y_***, y_i**, y_*j*, and y_ij*. So, for example:

ESTIMATE 'y***' intercept 1 factA 0.5 0.5 factC 0.5 0.5 factA*factC 0.25 0.25 0.25 0.25;

ESTIMATE 'y1**' intercept 1 factA 1.0 0.0 factC 0.5 0.5 factA*factC 0.5 0.5 0.0 0.0;

ESTIMATE 'y*1*' intercept 1 factA 0.5 0.5 factC 1.0 0.0 factA*factC 0.5 0.0 0.5 0.0;

ESTIMATE 'y11*' intercept 1 factA 1.0 0.0 factC 1.0 0.0 factA*factC 1.0 0.0 0.0 0.0;

The overall average has the intercept (the mu term), both levels of factA (so average them), both levels of factC (so average them), and all four levels of the interactions (so average them). The average for level one of factA includes the intercept, only level one of factA, both levels of factC (so average them), and the two interactions that are on factA (so average those two). etc... You can check that these lines give the right answers by comparing what they give with what the LSMEANS statement gave.

To get the estimates of the various parameters, we only need to take the differences of these. Like we saw in class, a₁ = y_1** - y_***, and ac₁₁ = y_11* - y_1** - y_*1* + y_***, etc... So, for example:

ESTIMATE 'a1'   intercept 0 factA 0.5 -0.5 factC 0.0  0.0 factA*factC 0.25  0.25 -0.25 -0.25;

ESTIMATE 'c1'   intercept 0 factA 0.0  0.0 factC 0.5 -0.5 factA*factC 0.25 -0.25  0.25 -0.25;

ESTIMATE 'ac11' intercept 0 factA 0.0  0.0 factC 0.0  0.0 factA*factC 0.25 -0.25 -0.25  0.25;

(Note that 'mu' is the same thing as 'y***'.) This particular homework problem has three levels of factor 1 and four levels of factor 2, so you would need to estimate three a's, four c's, and twelve ac's. Note that there will be a lot of 1/3rds because of this, and typing in all the .3333's will get annoying. The command divisor can make this a bit easier. The following two lines for the above sample data set are exactly the same.

ESTIMATE 'a1' intercept 0 factA 0.5 -0.5 factC 0 0 factA*factC 0.25 0.25 -0.25 -0.25;

ESTIMATE '#2' intercept 0 factA 2   -2   factC 0 0 factA*factC 1    1    -1    -1    / DIVISOR=4;

So for the homework problem data, we could get the various averages by:

ESTIMATE 'y***' intercept 12 med 4 4 4 time 3 3 3 3 med*time 1 1 1 1 1 1 1 1 1 1 1 1  / DIVISOR=12; 

ESTIMATE 'y1**' intercept 12 med 12 0 0 time 3 3 3 3 med*time 3 3 3 3 0 0 0 0 0 0 0 0 / DIVISOR=12;

ESTIMATE 'y2**' intercept 12 med 0 12 0 time 3 3 3 3 med*time 0 0 0 0 3 3 3 3 0 0 0 0 / DIVISOR=12; 

ESTIMATE 'y3**' intercept 12 med 0 0 12 time 3 3 3 3 med*time 0 0 0 0 0 0 0 0 3 3 3 3 / DIVISOR=12;

ESTIMATE 'y*1*' intercept 12 med 4 4 4 time 12 0 0 0 med*time 4 0 0 0 4 0 0 0 4 0 0 0 / DIVISOR=12;

ESTIMATE 'y*2*' intercept 12 med 4 4 4 time 0 12 0 0 med*time 0 4 0 0 0 4 0 0 0 4 0 0 / DIVISOR=12;

ESTIMATE 'y*3*' intercept 12 med 4 4 4 time 0 0 12 0 med*time 0 0 4 0 0 0 4 0 0 0 4 0 / DIVISOR=12;

ESTIMATE 'y*4*' intercept 12 med 4 4 4 time 0 0 0 12 med*time 0 0 0 4 0 0 0 4 0 0 0 4 / DIVISOR=12;

ESTIMATE 'y11*' intercept 12 med 12 0 0 time 12 0 0 0 med*time 12 0 0 0 0 0 0 0 0 0 0 0 / DIVISOR=12;

ESTIMATE 'y12*' intercept 12 med 12 0 0 time 0 12 0 0 med*time 0 12 0 0 0 0 0 0 0 0 0 0 / DIVISOR=12;

ESTIMATE 'y13*' intercept 12 med 12 0 0 time 0 0 12 0 med*time 0 0 12 0 0 0 0 0 0 0 0 0 / DIVISOR=12;

ESTIMATE 'y14*' intercept 12 med 12 0 0 time 0 0 0 12 med*time 0 0 0 12 0 0 0 0 0 0 0 0 / DIVISOR=12;

ESTIMATE 'y21*' intercept 12 med 0 12 0 time 12 0 0 0 med*time 0 0 0 0 12 0 0 0 0 0 0 0 / DIVISOR=12;

ESTIMATE 'y22*' intercept 12 med 0 12 0 time 0 12 0 0 med*time 0 0 0 0 0 12 0 0 0 0 0 0 / DIVISOR=12;

ESTIMATE 'y23*' intercept 12 med 0 12 0 time 0 0 12 0 med*time 0 0 0 0 0 0 12 0 0 0 0 0 / DIVISOR=12;

ESTIMATE 'y24*' intercept 12 med 0 12 0 time 0 0 0 12 med*time 0 0 0 0 0 0 0 12 0 0 0 0 / DIVISOR=12;

ESTIMATE 'y31*' intercept 12 med 0 0 12 time 12 0 0 0 med*time 0 0 0 0 0 0 0 0 12 0 0 0 / DIVISOR=12;

ESTIMATE 'y32*' intercept 12 med 0 0 12 time 0 12 0 0 med*time 0 0 0 0 0 0 0 0 0 12 0 0 / DIVISOR=12;

ESTIMATE 'y33*' intercept 12 med 0 0 12 time 0 0 12 0 med*time 0 0 0 0 0 0 0 0 0 0 12 0 / DIVISOR=12;

ESTIMATE 'y34*' intercept 12 med 0 0 12 time 0 0 0 12 med*time 0 0 0 0 0 0 0 0 0 0 0 12/ DIVISOR=12;

The pattern is pretty easy to see once you start looking at it.

We could now use the differences between these to estimate the parameters instead, for example...

ESTIMATE 'mu=y***'            intercept 12 med 4  4  4 time 3  3  3  3 med*time 1  1  1  1  1  1  1  1  1  1  1  1 / DIVISOR=12;

ESTIMATE 'a1=y1**-y***'       intercept 0  med 8 -4 -4 time 0  0  0  0 med*time 2  2  2  2 -1 -1 -1 -1 -1 -1 -1 -1 / DIVISOR=12;

ESTIMATE 'c1=y*1*-y***'       intercept 0  med 0  0  0 time 9 -3 -3 -3 med*time 3 -1 -1 -1  3 -1 -1 -1  3 -1 -1 -1 / DIVISOR=12;

ESTIMATE 'ac11=y11*-a1-c1-mu' intercept 0  med 0  0  0 time 0  0  0  0 med*time 6 -2 -2 -2 -3  1  1  1 -3  1  1  1 / DIVISOR=12;

The rest of the parameters you are trying to estimate can be found in the same way.

Analysis of Covariance: The following code works through example 11.2 on page 521.

Table 11.3 data


PROC GLM DATA=cmclass;

CLASS class;

MODEL post = class pre;

ESTIMATE 'slope' pre 1;

LSMEANS class / stderr pdiff;

RUN;

Here the pretest score is to be used like in regression, and so we don't put it in the class line. We do put the class there. An interaction generally doesn't make any sense for an ANCOVA, so we leave it out. The ESTIMATE line gives us the slope. The LSMEANS line gives us the estimates, along with the standard errors, the test as to if that the class effect = 0, and the p-values of the tests as to whether each of the classes is the same as the effects for the other classes. These last tests come from the matrix like part gives the same results as on page 524, namely that one and two are not significantly different, but one and three and two and three are. (Remember that the experimentwise error might need to be adjusted if using this.)

Because we are using the pretest score (the covariate) similarly to a blocking variable here, we would use the Type III sum of squares to test if what class they are in makes a difference.

The assumptions are the same as before. However, because we are actually doing something like a regression, we can't use the methods we used to test the assumptions for an ANOVA. Adding the line:


OUTPUT out=cmresids P=FIT R=RESIDUAL;

after the MODEL line will make a new data set called cmresids. Running PROC INSIGHT with this new data set will then let us plot the residuals. We could for example plot the CLASS vs. the residuals, the fitted values vs. the residuals, and the PRE vs. the residuals for example. We can also use the Dist(Y) option under the Analyze menu to do a histogram and q-q plot of the residuals.

In order to check that the model is appropriate, we need to make sure that the slope doesn't change with the different classes. To get the same answer as the bottom of page 527, all we need to do is fit the model again, this time with the interaction CLASS*PRE included. The null hypothesis that this line of the ANOVA table tests is that the slope is the same for each class, so if we reject it, it means the assumptions of the ANCOVA are not met.

Logistic Regression: To analyze data as a logistic regression, you would start PROC INSIGHT just as if it were a standard regressions. Under the Analyze menu you would choose Fit(YX) and you would choose the correct variables to be both Y and X. The difference comes in the method used for fitting. Instead of simply choosing OK after selecting Y and X, you want to choose a different method (by using the method button). Set the response distribution to binomial, the link function to logit, and the scale to MLE. Then click ok. To get a graph of the fitted curve, choose Line Plot (YX) in the Analyze Menu and plot the predicted value (as the y variable) against the x-variable (as the x variable).

SAS has a variety of other procedures to analyze logistic regression, including PROC LOGISTIC and PROC CATMOD.

Downloading the Data from the Web

The various data sets used in the text book can be found on the web, so that you don't need to type them in. The web address for this is:

ftp://ftp.harcourtbrace.com/pub/academic_press/saved/textbook/freund.data/ .

For example fw01p05 would be the data for problem 5 of chapter 1.

Note that if you are using Internet Explorer, it may work better to use "select all" under the edit menu, instead of highlighting the text manually)