Notes on Homework One
Notes on Homework Two
Notes on Homework Three
Notes on Homework Four
Notes on Homework Five
Notes on Homework Six
Notes on Homework Seven
Notes on Homework Eight
Downloading the Data from the Web

From in class on January 14th


Making SAS perform the procedures from last semester....
--------------------------------------------------------

F3 = run
F4 = recall previous program editor statement
    
clear the output window if you are done with what is there


The following data is from Table 5.13 on page 211.

----------begin pasting with next line-----------
DATA devMPG;
INPUT  without with;
LABEL without = "MPG without device"
	with = "MPG with device";
CARDS;
21.0	20.6
30.0	29.9
29.8	30.7
27.3	26.5
27.7	26.7
33.1	32.8
18.8	21.7
26.2	28.2
28.0	28.9
18.9	19.9
29.3	32.4
21.0	22.0
;
PROC PRINT;
RUN;
-----------stop pasting with previous line--------



To get a summary of the data without the device  use....


--------------begin------------------------------
PROC UNIVARIATE DATA=devMPG PLOT PCTLDEF=4;
VAR without;
TITLE 'Summary of MPG without the device';
RUN;
-------------end---------------------------------



Getting a confidence interval for the mean without the device 
use...

------------begin---------------------------------
PROC MEANS DATA=devMPG N MEAN STD CLM ALPHA=0.05  MAXDEC=3;
VAR without;
RUN;
-------------end----------------------------------



To test for a difference in means in two independent
populations we use PROC TTEST which requires each entry
to be of the form...

group  value
group  value

so first we need to reformat the data into a new data set.

(NOTE: We only need to reformat the data here because it 
wasn't in the right format already.  That is, the first variable 
was the values for the first group, and the second variable was
the values for the second group.   If it is already set up so 
that the first variable is which group, and the second variable
is the value, then there is no need to do all these DATA  
statements.						-1/25) 

-------------------begin-------------------------
DATA temp1;
SET devMPG;
KEEP mpg device;
device = 'with '; 
mpg = with;

DATA temp2;
SET devMPG;
KEEP mpg device;
device = 'wout';
mpg = without;

DATA dev2MPG;
SET temp1 temp2;

PROC PRINT;
RUN;     

PROC TTEST DATA = dev2MPG;
CLASS device;
VAR mpg;
TITLE 'Effect of Device on MPG';
RUN;
-----------------end-------------------------------    


This also tests that the variances are equal!


It should be noticed that this data is paired though, and 
that it might be desired to perform the test keeping that
in mind.

------------------begin---------------------------
DATA pairMPG;
SET devMPG;
KEEP diff;
diff = with - without;

PROC UNIVARIATE DATA=pairMPG  PCTLDEF=4;
VAR diff;
TITLE 'Summary of MPG without the device';
RUN;
------------------end-----------------------------



PROC insight allows many of these analyses and is also
one of the procedures which performs simple linear    
regression.


------------------begin---------------------------
PROC INSIGHT; 
OPEN devMPG;
DIST with without;
RUN;
-------------------end----------------------------


Note:  the percentiles reported here will disagree with those
given in proc univariate.


Selecting under graphs gives the option to plot the Q-Q plots.

Selecting under tables gives the option of giving frequency tables,
confidence intervals for the means, and tests that the means are
equal to particular values.

Selecting FIT(YX) under analyze will perform linear regression.

P_  indicates the predicted values, and R_ indicates the residuals.

Notes on Homework 1

1d) To check if the F-value is significant or not, table A.4A in the back of the text can be used. The numerator df is the df for the regression. The denominator df is the df for the error. We reject the null hypothesis at the alpha=0.05 level if it is greater than the value in the table.

4a) Performing a regression means: "determine the form of the least squares line." The PROC INSIGHT output after using FIT(YX) includes the test that the slope is equal to 0. It can be found in the Analysis of Variance table in the output. This is discussed on page 295 and 296 of the text.

4b) The trick here is figuring out which test to use. It is one of those that we did on the 14th in class.

9b) Residual plots are discussed on pages 311 - 313, with Figures 7.7 to 7.9 being examples of bad residual plots. The residual plot will be at the bottom of the PROC INSIGHT output.

Notes on Homework 2

5a) "the nature of the relationship" can be summarized by the Beta₁ and by the correlation coefficient.

5b) To get the confidence interval for mu_y|x and y_y|x it is easiest to use PROG REG. The following code would perform the regression that predicts the performance "with" the device based on the observed performance "without" the device:

PROC REG data=devMPG;
  MODEL with=without;
  PRINT CLI;
RUN;

The line "PRINT CLI" gives the confidence intervals for y_y|x. To get the confidence intervals for mu_y|x you would use the line "PRINT CLM".

Running the above lines with the dataset devMPG we would see for example that: "If a car gets 21 mpg without the device (like observation 1), then we would expect it to get 22.1510 mpg with the device (even though we only observed it getting 20.6000)." If we believed the model fit the data, then we would say "We expect that for a car that got 21 mpg without the device, that there is a 95% chance it would get between 18.8202 and 25.4819 miles per gallon with it."

The complication in problem 5 is that we don't actually have any players who are 60 inches tall in the data set! As an example of how to fix this, say we wanted to get the confidence interval for a car that got 35 mpg without the device. To do this we would need to make a new data set where the car has 35mpg without it, but NO VALUE with it. No value is signified by using a period.

DATA devMPG35;
INPUT  without with;
LABEL without = "MPG without device"
        with = "MPG with device";
CARDS;
21.0    20.6
30.0    29.9
29.8    30.7
27.3    26.5
27.7    26.7
33.1    32.8
18.8    21.7
26.2    28.2
28.0    28.9
18.9    19.9
29.3    32.4
21.0    22.0
35.0    .
;
PROC PRINT;
RUN;
    
PROC REG data=devMPG35;
  MODEL with=without;
  PRINT CLI;
RUN;

So observation 13 (since 35 goes with the 13th data point) has the answer.

7b) "your findings" should include the equation of the regression line, the estimated sigma, and the result of testing if Beta₁ = 0 or not.

4a) Performing multiple regression using PROC INSIGHT is the same as doing standard regression EXCEPT that after you select FIT(YX) under 'Analyze' you pick more than one X.

Note: If you are entering a string of characters (like ATL for the city Atlanta) in the cards section, you need to follow the name of that variable with a $ on the input line. See for example the input line for data set fw08p04 on the text books web-site. (there are some extra lines in there that you don't need to worry about for this part of the problem and just leave out.)

Notes on Homework 3

2) The VIF is contained on the basic regression output in PROC INSIGHT output.

The influence, potential, and standardized residuals can be added to the spreadsheet part of PROC INSIGHT by having selected the screen with the regression output, and then by choosing "Dffits", "Hat Diag", or "Standardized Residual" repsectively under the "Vars" menu. R_ is the residual, P_ is the predicted value, F_ is the influence, H_ is the potential, and RS_ is the standardized residuals. Once the variables appear on the spread sheet, you can include them in plots using the "Scatter Plot" option under the "Analyze" menu.

Using the data set we have been working with in class, the variable selection can be done using PROC REG. The left side of the equal sign is the Y, the right side is the list of all of the X variables. The "/ Selection =" means to report those statistics for each possible set of variables.

PROC REG;
MODEL WEIGHT = DBH HEIGHT AGE GRAV /
        SELECTION = RSQUARE ADJRSQUARE CP;
RUN;

4b) The text's web-site data set already includes the commands to change all of the variables to logarithms. The logarithm variable names are the same as the others except that they start with an "l". They are part of the data set. (The data set is all there, you don't need to add anything to that part, except for the ; on the line after it.)

7b) In order to do the regression on both TIME and the square of TIME, you need to add TIME to the list of variables. This can be done as follows.

DATA FW08P07; TITLE 'Ch 8, Exercise 7, DISTANCE COVERED BY IRRIGATION WATER';
INPUT  DISTANCE     TIME ;
TIMESQ=TIME*TIME;
CARDS;
                                  85       0.15
etc...

When you do the regression, just use both TIME and TIMESQ for X.

Notes on Homework 4

1) To do PROC TTEST you need to make sure the data is in the correct format as described above. Sometimes it is already in the correct format.

4a) PROC INSIGHT can fit a basic ANOVA model, however it is unable to do many of the other analyses we will be doing later. Instead it is best to use PROC ANOVA. Here is how we would work out the data we used in class.


DATA examp;

INPUT group value;

CARDS;

1       1

1       2

1       3

2       1

2       4

2       7

3       5

3       6

3       7

;

PROC ANOVA DATA=examp;

 CLASS group;

 MODEL value=group;

RUN;

If the name of the group is actually a name, instead of a number, you need to put a $ after it. The dollar sign in the input line means to treat the variable before it as just a name, and not a particular value. If you check back, any of the data sets that included the name of a city had a dollar sign after city.

Notes on Homework 5

The following is one way to enter the data from the example that we have been using in class. The $ after wrap says that wrap consists of names and not numbers.

DATA lunchmt;
INPUT wrap $ bacteria;
CARDS;
comm	7.66
comm	6.98
comm	7.80
vac	5.26
vac	5.44
vac	5.80
mixgas	7.41
mixgas	7.33
mixgas	7.04
allC02	3.51
allC02	2.91
allC02	3.66
;

To do a basic ANOVA, testing only that some of the means are not equal, the following code would work.

PROC ANOVA DATA=lunchmt;
CLASS wrap;
MODEL bacteria=wrap;
RUN;

To fit the contrasts we talked about in class, it is easiest to use PROC GLM, where GLM stands for Generalized Linear Model. Note that the some of the coefficients have to add up to 0 in order for SAS to fit the model. This will output the 'p-values' for the three contrasts. Remember to adjust the alpha-level according to the formula we worked out for orthogonal contrasts (if they are indeed independent), or according to Bonferroni's formula if they are not orthogonal.

PROC GLM DATA=lunchmt ORDER=DATA;
CLASS wrap;
MODEL bacteria=wrap;
CONTRAST 'C1' wrap 1 -.3333333 -.3333333 -.3333334;
CONTRAST 'C2' wrap 0  1 -0.5 -0.5;
CONTRAST 'C3' wrap 0  0  1 -1;
ESTIMATE 'C1' wrap 1 -.3333333 -.3333333 -.3333334;
ESTIMATE 'C2' wrap 0  1 -0.5 -0.5;
ESTIMATE 'C3' wrap 0  0  1 -1;
RUN;

To perform the Bonferroni, Duncan, Tukey, and Fisher comparisons between the various means, we could use the following code. Remember to set your alpha level in advance, and you should only run the multiple comparison procedure you intend to use. Instead of putting lines under the names of the factor levels that are the same (see page 249), it puts letters of the alphabet.

PROC GLM DATA=lunchmt ORDER=DATA;
CLASS wrap;
MODEL bacteria=wrap;
MEANS wrap / ALPHA=0.01 BON DUNCAN TUKEY LSD;
RUN;

Adding CLDIFF to the means line after the / will give the confidence intervals as output.

Notes on Homework 6

The homework problems can be found at: homework assignment 6

Notes on this assignment are under construction.

The following code would give an analysis of the data in Table 9.4 on pages 424-432.


DATA cars; 
INPUT cyl $  oil $ rep $ mpg;
cards;
                       4     STANDARD    1    22.6
                       4     STANDARD    2    24.5
                       4     STANDARD    3    23.1
                       4     STANDARD    4    25.3
                       4     STANDARD    5    22.1
                       4     MULTI       1    23.7
                       4     MULTI       2    24.6
                       4     MULTI       3    25.0
                       4     MULTI       4    24.0
                       4     MULTI       5    23.1
                       4     GASMISER    1    26.0
                       4     GASMISER    2    25.0
                       4     GASMISER    3    26.9
                       4     GASMISER    4    26.0
                       4     GASMISER    5    25.4
                       6     STANDARD    1    23.6
                       6     STANDARD    2    21.7
                       6     STANDARD    3    20.3
                       6     STANDARD    4    21.0
                       6     STANDARD    5    22.0
                       6     MULTI       1    23.5
                       6     MULTI       2    22.8
                       6     MULTI       3    24.6
                       6     MULTI       4    24.6
                       6     MULTI       5    22.5
                       6     GASMISER    1    21.4
                       6     GASMISER    2    20.7
                       6     GASMISER    3    20.5
                       6     GASMISER    4    23.2
                       6     GASMISER    5    21.3
;
PROC GLM DATA=cars ORDER=DATA;
CLASS cyl oil;
MODEL mpg = cyl oil cyl*oil;
CONTRAST 'L1 = 4 vs. 6' cyl -1 1;
CONTRAST 'L2 = standard vs. others' oil 1 -.5 -.5;
CONTRAST 'L3 = multi vs. gasmiser' oil 0 1 -1;
CONTRAST 'L1*L2' cyl*oil -1 0.5 0.5 1 -0.5 -0.5;
CONTRAST 'L1*L3' cyl*oil 0 -1 1 0 1 -1;
RUN;

The model line simply fits the basic ANOVA with interaction. The next three contrast lines are just the standard contrasts as discussed beginning on the bottom of page 429 and over onto the top of page 430. The next two lines are the interaction contrasts which are a little trickier to deal with.

In order to figure out the values for these interaction contrasts, we need to lay out a matrix like is shown on the bottom of page 430. Since 'cyl' is the first variable (as listed in the class line, model statement, and cyl*oil) we will use it to label each row. And we will label each column by 'oil' since it is the second variable. If we were to write the coefficients by the labels for the rows and columns, and multiply them, we would get the values inside the table. The values in the contrast statement can then be read off by going across the rows one and a time. (Follow along in your book and you can see what I mean.)

Now, assume that instead of being replications, that "rep" instead refers to five different drivers, instead of just replications. In this case we have a three-way factorial model with no replications. To fit the additive model to this data, the following code could be used.


PROC GLM DATA=cars ORDER=DATA;
CLASS cyl oil rep;
MODEL mpg = cyl oil rep;
RUN;

Note that since we have both two way interactions (cyl*oil, cyl*rep, oil*rep) and three way interactions (cyl*oil*rep), it would be possible to fit the model assuming there was no three way interaction, but that there were still the possibility of two way interactions.


PROC GLM DATA=cars ORDER=DATA;
CLASS cyl oil rep;
MODEL mpg = cyl oil rep cyl*oil cyl*rep oil*rep;
RUN;

SAS has a test that is more robust that Hartley's Fmax test for testing that all the variances are equal. This test is Levene's F test. Unfortunately Levene's test is set up to work only for one-way ANOVA's. In order to use this test on a multi-way ANOVA, we have to: a) fit the model and get the residuals, b) fit a one may model of the residuals based on the all the blocks (in this case the 6 combinations of oil and cyl), and c) get Levene's test from this ANOVA using the MEANS line.

The first five lines fit the model we want, and makes a new dataset called 'cars2' that also has the predicted values and residuals. The next four lines make a data set called 'cars3' that has three variables: the predicted values, the residuals, and a variable called block that combines 'cyl' and 'oil'. The '||' between the two means to concatonate the names of the levels of those two variables into one new variable. The final five lines then run Levene's test. The test (and it's p-value) will be near the middle of the output. NOTE: the rest of the output for this last part are not of any use.


PROC GLM DATA=cars ORDER=DATA NOPRINT;
CLASS cyl oil;
MODEL mpg = cyl oil cyl*oil;
OUTPUT OUT=cars2 P=pred R=resid;
RUN;

DATA cars3;
SET cars2;
KEEP  block pred resid;
block = cyl||oil;

PROC GLM DATA=cars3 ORDER=DATA;
CLASS block;
MODEL resid = block;
MEANS block /HOVTEST=LEVENE;
RUN;

This new data set can also be used to make the various residual plots using PROC INSIGHT on 'cars3'. You can simply do a scatterplot of 'resid' and 'pred' for example. A scatterplot of 'resid' vs. 'block' should also be insightful. You could also do a similar plot using the data set 'cars', and doing a scatterplot of 'resid' vs. 'cyl' or vs. 'oil'.

Notes on Homework 7

3) The following code analyzes the data for Example 6.6 on page 262.


DATA teach;
INPUT  teacher $ score @@;
CARDS;
A	84	A	90	A	76
A	62	A	72	A	81	A	70
B	75	B	85	B	91
B	98	B	82	B	75	B	74
C	72	C	76	C	74
C	85	C	77	C	60	C	62
D	88	D	98	D	70
D	95	D	86	D	80	D	75
;
PROC GLM  DATA=teach  ORDER=DATA;
CLASS teacher;
MODEL  score = teacher;
RANDOM teacher / TEST;
RUN;

The line RANDOM tells SAS that the variable teacher is a random effect and the / TEST tells it that we want to have the output testing whether or not the variance from the teachers is zero.

Notice that the output says that the mean square on the line for teachers is equal to the MSE (the variance of the error) + 7 times the variance from the teachers. You could use this equation to solve for the variance components. (Reading along with the example in the book will make this clear.)

PROC VARCOMP could also be used to calculate the variance components for you.


PROC VARCOMP  DATA=teach  METHOD=TYPE1;
CLASS teacher;
MODEL score = teacher;
RUN;

Note that VARCOMP assumes an effect is random until you tell it otherwise, and that it will not give you the test of hypotheses.

8) The following code works through example 10.2 on page 473.

Table 10.4 data


PROC GLM  DATA=fw10x02  ORDER=DATA;
CLASS lab material;
MODEL  stress = lab material lab*material;
RANDOM lab lab*material  / TEST;
RUN;

Note that here, lab is the random effect, and material is the fixed effect. We need to tell SAS that lab is random, and because it is random that lab*material is random too. On the output where it tells you what the sum of squares are, Q(material) is the term with the sum of the tau-squareds. (Comparing the output to the expected sum squares on page 471 will let you see what SAS calls the different things).

Note that the ANOVA table at the top of the output uses the MSE in the denominator. By checking the E(MS) you can see that that isn't what you want. It gives the correct test near the bottom of the output. In the output the line "Source: MATERIAL" tells you which MS is in the numerator, and the line "Error: MS(LAB*MATERIAL)" tells you which MS is in the denominator.

How to calculate the gain in efficiency is discussed on the bottom of page 475.

9a) Say we wanted to analyze the car data (from the homework six notes above) except missing several data. There will be one dummy variable for the cylinders, and two for the oils, and 2*1 = 2 for the interactions. It is easiest to make the data set with just the main effects, and then make a new data set where SAS calulates the interactions.


DATA cars2;
INPUT cyl $ oil $ mpg  	mu  	cyl4 	stand 	multi;
CARDS;
4     STANDARD    22.6		1	1	1	0
4     STANDARD    24.5		1	1	1	0
4     STANDARD    23.1		1	1	1	0
4     MULTI       23.7		1	1	0	1	
4     GASMISER    26.0		1	1	0	0
4     GASMISER    25.0		1	1	0	0
4     GASMISER    26.9		1	1	0	0
4     GASMISER    26.0		1	1	0	0
4     GASMISER    25.4		1	1	0	0
6     STANDARD    22.0		1	0	1	0
6     MULTI       23.5		1	0	0	1
6     MULTI       22.8		1	0	0	1
6     MULTI       24.6		1	0	0	1
6     MULTI       24.6		1	0	0	1
6     MULTI       22.5		1	0	0	1
6     GASMISER    21.4		1	0	0	0
6     GASMISER    20.7		1	0	0	0
6     GASMISER    20.5		1	0	0	0	
6     GASMISER    23.2		1	0	0	0
;
DATA cars3;
SET cars2;
KEEP mpg mu cyl4 stand multi stand4 multi4;
stand4 = cyl4 * stand;
multi4 = cyl4 * multi;
;
PROC REG DATA=cars3;
MODEL mpg = cyl4 stand multi stand4 multi4;
RUN;

Unfortunately, PROC REG used on the dummy variables doesn't tell us how to combine the two oil types (for example) to make one oil type. It does give us the overall p-value for everything. It turns out though that PROC GLM (but not PROC ANOVA) automatically knows what to do if you give it unbalanced data!


PROC GLM DATA=cars2;
CLASS cyl oil;
MODEL mpg = cyl oil cyl*oil;
RUN;

Notice that the overall sums of squares agree now for the model as a whole.

NOTE: As said in class, this means that to do a test for a dummy variable model we do not need to actually figure out what the variables are. PROC GLM does it for us. Also note to see the estimated values we can use the line: LSMEANS cyl oil cyl*oil . PROC ANOVA, or the line MEANS however will not give the correct values.

Notes on Homework 8

1) The following code works through example 11.2 on page 521.

Table 11.3 data


PROC GLM DATA=cmclass;
CLASS class;
MODEL post = class pre;
ESTIMATE 'slope' pre 1;
LSMEANS class / stderr pdiff;
RUN;

Here the pretest score is to be used like in regression, and so we don't put it in the class line. We do put the class there. An interaction generally doesn't make any sense for an ANCOVA, so we leave it out. The ESTIMATE line gives us the slope. The LSMEANS line gives us the estimates, along with the standard errors, the test as to if that the class effect = 0, and the p-values of the tests as to whether each of the classes is the same as the effects for the other classes. These last tests come from the matrix like part gives the same results as on page 524, namely that one and two are not significantly different, but one and three and two and three are. (Remember that the experimentwise error might need to be adjusted if using this.)

Because we are using the pretest score (the covariate) similarly to a blocking variable here, we would use the Type III sum of squares to test if what class they are in makes a difference.

The assumptions are the same as before. However, because we are actually doing something like a regression, we can't use the methods we used to test the assumptions for an ANOVA. Adding the line:


OUTPUT out=cmresids P=FIT R=RESIDUAL;

after the MODEL line will make a new data set called cmresids. Running PROC INSIGHT with this new data set will then let us plot the residuals. We could for example plot the CLASS vs. the residuals, the fitted values vs. the residuals, and the PRE vs. the residuals for example. We can also use the Dist(Y) option under the Analyze menu to do a histogram and q-q plot of the residuals.

In order to check that the model is appropriate, we need to make sure that the slope doesn't change with the different classes. To get the same answer as the bottom of page 527, all we need to do is fit the model again, this time with the interaction CLASS*PRE included. The null hypothesis that this line of the ANOVA table tests is that the slope is the same for each class, so if we reject it, it means the assumptions of the ANCOVA are not met.

5a) This is just the standard linear regression like we did at the beginning of the semester.

5c) One way to try and fit logistic regression is to transform the data by using the logit transformation on the data, and then use PROC REG. The follwoing code does this for the data in example 11.4. and follows along to the bottom of page 541. NOTE: This will only work if we have repeated observations with the variable "number" never equal to "n" or to 0.


DATA tumors;
INPUT conc n number;
CARDS;
0.0	50	2
2.1	54	5
5.4	46	5
8.0	51	10
15.0	50	40
19.5	52	42
;

DATA tumors2;
SET tumors;
KEEP conc n number phat tumlogit tumwght;
phat = number/n;
tumlogit = log((number/n)/(1-(number/n)));
tumwght = n*(number/n)*(1-(number/n));
PROC PRINT;
RUN;

PROC REG DATA=tumors2;
MODEL tumlogit = conc;
WEIGHT tumwght;
OUTPUT OUT=tumors3 P=pred R=resid;
RUN;

DATA tumors4;
SET tumors3;
KEEP conc phat estprop residprp;
estprop = exp(pred)/(1+exp(pred));
residprp = phat - estprop;
PROC PRINT;
RUN;

The "DATA tumors2" section makes a new data set called "tumors2" which is all of the data in Table 11.11. The "PROC REG" section produces the results in Table 11.12. Finally, the "DATA tumors4" section makes a new data set called "tumors4" which is the data that would be in Figure 11.5. "estprop" would be the estimated values on the curve, "phat" are the dots, and "residprp" are the residuals.

It is probably better just to use PROC LOGISTIC. Using the data set tumors above, this could be done by.


PROC LOGISTIC DATA=tumors;
MODEL number/n = conc / LINK = LOGIT;
OUTPUT OUT=tumors5 P=pred;
RUN;

DATA tumors6;
SET tumors5;
KEEP conc phat estprop residprp;
phat = number/n;
estprop = pred;
residprp = phat - pred;
PROC PRINT;
RUN;

Note there is no need for any transformations or weights. There command LINK = LOGIT tells it we want to do logistic regression. Other options are NORMIT which would give something called a PROBIT regression, and CLOGLOG. The data set "tumors6" is formated the same as "tumors 4. You could then use PROC INSIGHT on tumors6 to do a residuals plot of estprop vs. residprp.

Downloading the Data from the Web

The various data sets used in the text book can be found on the web, so that you don't need to type them in. The web address for this is:

ftp://ftp.harcourtbrace.com/pub/academic_press/saved/textbook/freund.data/ .

For example fw01p05 would be the data for problem 5 of chapter 1.