Stat 515 - Spring 2000 - SAS Templates

Class Notes from 1/20/2000
Notes on Homework Three
Notes on Homework Nine
Notes on Homework Ten
Notes on Homework Eleven
Notes on Homework Twelve
Notes on Homework Fourteen & Fifteen
Notes on Homework Sixteen
Notes on Homework Eighteen
Downloading the Data from the Web


The Basics of SAS:

Hitting the [F3] key will run the program currently in the Program Editor window.

This will however erase whatever was written in the Program Editor window. To recall whatever was there, make sure you are in that window, and hit the [F4] key.

If you keep running more programs it will keep adding it all to the Output window. To clear the Output window, make sure you are in that window, and choose Clear text under the Edit menu.

If you happen to lose a window, try looking under the Globals menu.

The following is the SAS code for entering data about the starting salaries of a group of bank employees. The data consists of the beginning salaries of all 32 male and 61 female entry level cleriacal workers hired between 1969 and 1977 by a bank. The data is reported in the book The Statistical Sleuth by Ramsey and Schafer, and is originally from: H.V. Roberts, "Harris Trust and Savings Bank: An Analysis of Employee Compensation" (1979), Report 7946, Center for Mathematical Studies in Business and Economics, University of Chicago Graduate School of Business.

The data is formatted in two columns, the first is the starting salary, the second is an id code, m for male, and f for female.
 


OPTIONS pagesize=60 linesize=80;

DATA bankdata;
INPUT salary gender $ @@;
LABEL salary = "Starting Salary"
   gender = "m=male, f=female";
CARDS;
3900 f 4020 f 4290 f 4380 f 4380 f 4380 f
4380 f 4380 f 4440 f 4500 f 4500 f 4620 f
4800 f 4800 f 4800 f 4800 f 4800 f 4800 f
4800 f 4800 f 4800 f 4800 f 4980 f 5100 f
5100 f 5100 f 5100 f 5100 f 5100 f 5160 f
5220 f 5220 f 5280 f 5280 f 5280 f 5400 f
5400 f 5400 f 5400 f 5400 f 5400 f 5400 f
5400 f 5400 f 5400 f 5400 f 5400 f 5520 f
5520 f 5580 f 5640 f 5700 f 5700 f 5700 f
5700 f 5700 f 6000 f 6000 f 6120 f 6300 f
6300 f 4620 m 5040 m 5100 m 5100 m 5220 m
5400 m 5400 m 5400 m 5400 m 5400 m 5700 m
6000 m 6000 m 6000 m 6000 m 6000 m 6000 m
6000 m 6000 m 6000 m 6000 m 6000 m 6000 m
6000 m 6300 m 6600 m 6600 m 6600 m 6840 m
6900 m 6900 m 8100 m
;
Note that _most_ lines end with a semi-colon, but not all. SAS will crash if you miss one, but usually the log window will tell you where the problem is.

The OPTIONS line only needs to be used once during a session. It sets the length of the page and the length of the lines for viewing on the screen and printing. The font can be set by using the Options menu along the top of the screen. When you cut and paste from SAS to a word processor, the font Courier New works well.

The DATA line defines what the name of the data set is. The name must be eight characters or less, with no spaces, and only letters, numbers, and underscores. It must start with a letter. The INPUT line gives the names of the variables, and they must be in the order that the data will be entered. The $ after gender on the INPUT line means that the variable gender is qualitative instead of quantitative. The @@ at the end of the INPUT line means that the variables will be entered right after each other on the same line with no returns. (Instead of needing one row for each person.)

If we hit F3 at this point to enter what we put above, nothing new will appear on the output screen. This is no big surprise however once we realize that we haven't told SAS to return any output! The code below simply tells SAS to print back out the data we entered.


PROC PRINT DATA=bankdata;
TITLE "Gender Equity in Salaries";
RUN;

The only difficulty we have now, is that it would be nice to look at both the men and women separately, so we need to be able to split the data up based on whats in the second column. The following lines will make two separate data sets male and female, and then print out the second one to make sure it is working right:


DATA male;
SET bankdata;
KEEP salary;
WHERE gender='m';
RUN;

DATA female;
SET bankdata;
KEEP salary;
WHERE gender='f';
RUN;

PROC PRINT DATA=female;
TITLE "Female Salaries";
RUN;

Whenever you have a DATA line, that means you are creating a new dataset with that name. The SET line tells it that we are making this new data set from an old one. The KEEP line says the only variables we want in this new data set are the ones on that line. The lines after that say any special commands that go into the making of the new data set. In this case the WHERE command is used to make sure we only keep one gender or the other. Later we will see examples of making datasets that involve using mathematical functions. In any case, it should be pretty straight-forward when you just stop and read through what the lines say.

The most basic procedure to give out some actual graphs and statistics is PROC UNIVARIATE:


PROC UNIVARIATE DATA=female PLOT FREQ ;
VAR salary;
TITLE 'Summary of the Female Salaries';
RUN;

The VAR line says which of the variables you want a summary of. Also note that the graphs here are pretty awful. The INSIGHT procedure will do most of the things that the UNIVARIATE procedure will, and a lot more. The one thing it won't do is be open to programming it to do new things. Later in the semester we'll see how some of the other procedures in SAS can be used to do things that aren't already programmed in.


PROC INSIGHT;
OPEN female;
DIST salary;
RUN;

You can cut and paste the graphs from PROC INSIGHT right into microsoft word. Simply click on the border of the box you want to copy with the right mouse button to select it. You can then cut and paste like normal. Clicking on the arrow in the bottom corner of each of the boxes gives you options for adjusting the graphs format. To quit PROC INSIGHT, click on the X in the upper right portion of the spreadsheet.

One thing that we can notice from the histogram and box plot is that the data does not look very symmetric, instead it looks slightly skewed to the left. We can add a curve over the histogram to make it easier to compare to a bell-shaped (or normal) curve. Under the Curves menu, choose Parametric Density.... Just hit ok on the box that pops up.

One of the problems with the histogram, is that the way it looks can be affected a lot by how the width of the bars is selected, and where they start and begin. The box with the arrow in it, at the lower left side of the histogram lets you control that. Click on that box, and then select Ticks.... Change the 3800 to 3600, and the 6600 to 6400, and then click ok. Now try 3700 and 6700, and set the Tick Increment to 600.

Unlike the histogram, the Q-Q plot is not subject to options chosen by the user. Under the Graphs menu select QQ Plot... and then click ok in the window that comes up. Then under Curves select QQ Ref Line.... The idea of the Q-Q plot is that it plots the actual data along the y-axis, and the values that the data would have if they were exactly the percentiles of a normal curve. So if the data is approximately like that of a bell curve, the line should look fairly close to straight. If not, it should be off. Notice that this looks very close to a straight line. Because the data is bell-curved, the empirical rule should hold, and it does fairly well for this data. Notice that the mean and median differ by less than 100. Also notice that 2*IQR = 1200 and 3*SD = 1619.

Lets change one of the values though, so that the data appears less normal. Click on the spreadsheet, and change the 3900 to 8900, the 4020 to 8020, the 4290 to the 8290, and the first 4380 to 8380. Now note that those three points are rather exterme as indicated on the box plot. You can also see the change in the Q-Q plot. Now, the empirical rule does not work nearly as well. Notice that the mean is over 100 larger than the median (we only changed 4 out of 61 values, and not by _that_ much) which isn't too bad. However, notice that 2*IQR = 1560, but 3*SD = 2780!


Notes on Homework 3:

This assignment involves describing the data in three different ways, and then discussing what conclusions you can draw. The first way is to graphically represent the data when it is in form of the three qualitative (or discrete) categories: being within 5 degrees, being five or more degrees too high, or being 5 or more too low. The second way is by graphically represent the continuous qualitative data in the form of the actual degrees off. The third way is too numerically summarize the qualitative data. In all three cases you will be comparing two groups, one male and one female. You need to decide if you will be comparing group 1 to group 2, or if you will be comparing the group of all of the men to the group of all of the women. (And you need to say why you made the choice you did! Either one could be a good answer, but you need to justify it.)

The first thing you need to do is to put the data into SAS. The data can be found at the address below (so that you don't have to type it in yourself. Notice that there are four columns (just like described on page 87), and so your INPUT line will have to have a name for each of the four variables. Also note that you will have to label three of the four as strings using the $.

Then just like we separated the data-set bankdata into two data sets male and female, you will have to separate this data-set into two groups according to what you decided for the first part. For splitting based just on gender you will use WHERE gender=. To just use the students, you need to notice that they are just called Student whether they are M or F. In this case you would use (for example) WHERE gender="M" AND group="Student". In either case you will want to KEEP both the deviation value, AND the judgement category.

To get the graphs for the qualitative/discrete data, just run PROC INSIGHT, opening each group, with the DIST being set to whatever name you called JUDGE. One of the two graphs this will produce is a bar graph. The other one is called a mosaic plot. To get the graphs and statistics for the quantitative data, just go up under the Analyze menu, select Distribution (Y), click on the name that you gave to the deviation measurement, click on the Y box, and then click ok. Remember you nedd to do this once for the male group that you decided on, and once for the female group.


Notes on Homework 9:

PROC INSIGHT is capable of making confidence intervals for means. Once you enter the data (using the DATA, INPUT, and CARDS lines as shown above) just start up PROC INSIGHT using that data set name on the OPEN line, and the variable name on the DIST line.

Once you have gotten all of the historgrams and the like from PROC INSIGHT, you can make the confidence interval by going to the Tables menu. Select C.I. for Mean, and the percentage you want. It will add the confidence interval to the bottom of the window. (You could also go into the spreadsheet and change individual data points to see how that affects things if you wanted.)


Notes on Homework 10:

We've seen (above) how PROC INSIGHT can be used to make both the confidence interval for the mean, and the Q-Q plot.

To determine the confidence interval for the variance, we have to write a simple program. PROC MEANS is again used to calculate the sd and add up the sample size that are needed in the formula for the confidence interval. The OUTPUT line says that the data set called temp that will have the sd and n. Once we have that we need to get the values from the Chi-square table, and make the confidence intervals. The function CINV (chi-square inverse) looks up what value from the Chi-square table goes with the particular alpha. The 0.10 down near the bottom is the alpha (that goes with what is on the INPUT line. (Recall that the percent for the confidence interval is 1-alpha.)

The following code constructs a 90% CI for the variance of the data in SOUND.DAT on page 243.


DATA sound;

INPUT spl @@;
CARDS;
73.0 80.1 82.8 76.8 73.5 74.3 76.0 68.1
; PROC MEANS NOPRINT DATA=sound;
VAR spl;
OUTPUT OUT=temp STD=sd N=n;
RUN;
DATA temp2;
SET temp;
KEEP var n alpha cilow cihigh;
INPUT alpha;
var = sd*sd;
df = n - 1;
cilow = (n-1)*(var)/CINV(1-(alpha/2),df);
cihigh = (n-1)*(var)/CINV(alpha/2,df);
CARDS;
.10
;
PROC PRINT data=temp2;
RUN;


Notes on Homework 11:

PROC INSIGHT is capable of performing the t-test for a population mean. To test the mean of the population is 75 for the population sound (above) was sampled from, you would use:


PROC INSIGHT;

OPEN sound;
DIST spl;
RUN;
You would then go to the Tables menu, and the Location Tests... sub-menu. Check the "Student's T Test" box, and on the parameter line put the population mean from the null hypothesis. Then just click "OK".

Note that this gives the test for the two-sided alternative hypothesis. If you want a one-sided test you need to figure out how to change it (a picture would help.)

We could also program SAS to give us the three possible p-values.


PROC MEANS NOPRINT DATA=sound;

VAR spl;
OUTPUT OUT=temp MEAN=xbar STD=sd N=n
RUN;
DATA temp2;
SET temp;
KEEP xbar mu sd n t pgreater pless ptwoside;
INPUT mu;
t = (xbar-mu)/(sd/sqrt(n));
df = n - 1;
pgreater = 1 - probt(t,df);
pless = probt(t,df);
ptwoside = 2*MIN(1-ABS(probt(t,df)),ABS(probt(t,df)));
cards;
75
;
PROC PRINT;
RUN;


Notes on Homework 12:

There are three t-tests that come up when you have two sets of data. If the two sets are independent, the next question is whether or not you can assume the variances of the two populations are equal or not. In either case, PROC TTEST can provide the answer. The third situation is where the data is paired (like twins). This is a little trickier and involves taking the difference of each pair first.

Page 300: 7.8a

t-test
The following code would be used to test if the two means were equal for the data in Table 7.2 (page 320). For PROC TTEST we have to tell SAS which group each observation belongs to. The @@ tells SAS that we will be entering more than one observation per line. The $ after group tells SAS that the group name is a name, and not a number.


DATA readtest;

INPUT group $ score @@;
CARDS;
N 80 N 80 N 79 N 81
S 79 S 62 S 70 S 68
N 76 N 66 N 71 N 76
S 73 S 76 S 86 S 73
N 70 N 85
S 72 S 68 S 75 S 66
;
PROC TTEST DATA=readtest;
CLASS group;
VAR score;
RUN;
The output for PROC TTEST consists of three different pieces. The first column is the description of the two samples, including the mean, sample size, and sd. The second column gives the T, the DF and the p-value for the two hypothesis tests: the one where we assume the variances are equal, and the one where we do not. Notice that the DF in the row for unequal is a decimal! This is because SAS uses a complicated method of estimating what DF to use. Recall that the unequal method is not totally accurate unless the sample size is large. In both cases this is the p-value for the two-sided hypothesis test. If you want to test either greater than or less than, you need to do a bit more work.

Looking at the output, the top row is the first group, and the second row is the second group, and so the difference = mean1 - mean2. So for the example above, the T goes with testing the difference = mean of group N - mean of group S. In the problem you are doing, it wants you to test the alternate hypothesis that mean2 > mean1. Just subtract the mean2 from both sides, and then switch it around to see that this is the same as testing the alternate hypothesis that: diff = mean1 - mean2 < 0.

Now to use the two sided hypothesis p-value for a one sided test, you would do the following: If the hypothesis is one-sided, then you need to look at the sign of the T. If you are testing that the difference is > 0, then: if T is positive you need to cut the p-value in half, and if it is negative you need to take one minus p-value/two. On the other hand, if you are testing that the difference is < 0, then: if T is positive you need to take one minus p-value/two, and if it is negative you need to take p-value/two.

The final thing the output gives is in the last row. It is the test of the null hypothesis that the variances are equal. So a small p-value means we reject that they are equal, and should use the unequal method. For this sample data the p-value is 0.4891, so we would fail to reject the null hypothesis, and use the equal row.

But what about testing that the difference is some other value?
PROC TTEST tests the null hypothesis that the two populations have the same mean. This problem though wants us to test that the difference (x2bar-x1bar)=10. If we look at the numerator of the equation on page 320, (x2bar-x1bar) - 10 is the same as (x1bar+10) - x2bar. Because of this, all we need to do is to change the data by subtracting 10 from the first sample. To do this to the readtest data we would:


DATA readtst2;

SET readtest;
KEEP group score;
IF group='N' THEN score=score+10;
RUN;
PROC TTEST DATA=readtst2;
CLASS group;
VAR score;
RUN;

CI and s2p
Unfortunately, while PROC TTEST has to calculate everything that you would need to make the confidence interval or find s2p, it doesn't actually let you use those values. It turns out that another procedure called an ANOVA (which is taught in STAT 516) actually calculates the same thing! (Remember that for a 90% confidence interval, alpha is 0 .10 and alpha/2 = 0.05.)



PROC ANOVA DATA=readtest;

CLASS group;
MODEL score=group;
MEANS group / LSD CLDIFF ALPHA=0.10;
RUN;
The output for the above PROC ANOVA calculates the confidence interval for the case where the variances are equal and also calculates the pooled variance. All of this can be found on the last of the three pages of output this procedure generates. The pooled variance is the value labeled MSE (33.75231 for this example). The confidence intervals for both the first group - second group, and second group minus first are then given.

Checking Assumptions
In checking the assumption that the data is normal, we need for BOTH samples to be normal. We can still use PROC INSIGHT for this.


PROC INSIGHT;

OPEN readtest;
DIST score;
RUN;
If you now go up to the Graphs and Curves menus to add the Q-Q plot to the output. Unfortunately this is the Q-Q plot for all the data together. On the spreadsheet, notice that each row has a little box in the far left. Click on that box with the left button and the menu that appears has three choices: label in plots, show in graphs, and include in calculations. If we wanted to make the Q-Q plot just for the S group, then you would uncheck the "Show in Graphs" and "Include in Calculations" options for every single one of the group S observations. To get the Q-Q plot for group N, you would recheck all the group S observations, and then uncheck the group N ones.

Needless to say, this would be rather tedious! We also could have made two separate data sets, and run PROC INSIGHT on each of them separately. The following code would take the data set readtest and make two separate data sets called groupN and groupS. The code for doing this is at the top of the page (we did it in the computer lab the first time on the dataset bankdata and again in Homework 3).

Page 317: 7.28 a,b To perform the t-test on paired data, it is necesary to first enter the data, and then to combine the two separate variables together. The following code does this for the data in Table 7.3 (page 3 29).


DATA pairread;

INPUT new stand;
CARDS;
77 72
74 68
82 76
73 68
87 84
69 68
66 61
80 76
;

DATA pair2;
SET pairread;
KEEP diff;
diff = new - stand;
RUN;
All you need to do now is do the usual t-test (like on homework 11) using the dataset pair2, the variable diff, and the correct null hypothesis. (Remember to make sure you're using the right p-value.) To make the Q-Q plot, just use PROC INSIGHT on the dataset pair2 and variable diff.


Notes on Homework 14 & 15:

PROC INSIGHT will conduct the regression analysis for you, and output most of the results. The following code would be used to analyze the data in Table 9.1 on page 428.


DATA react;

INPUT amt time;
CARDS;
1 1
2 1
3 2
4 2
5 4
;
PROC INSIGHT;
OPEN react;
FIT time = amt;
RUN;
On the fit line, put the y variable on the left hand side of the equals sign, and the x variable on the right hand side. So in this example, we are predicting time from amount. If you start PROC INSIGHT without the fit line, you can still get to regression by choosing Fit(YX) under the Analyze menu. In the window that opens up, click on the name of the x variable and then on the x button, and on the name of the y variable and then on the y button. Since problem 9.31 wants you to do two regressions, you will have to enter the data with three columns (speed, A, B) and you will have two different FIT lines in the code.

Near the top of the output window is the regression equation, in this case the intercept is -0.100 and the slope is 0.700. Below that is the scatterplot of the points with the regression line drawn in.

Looking through the output, the box labeled "Summary of Fit" gives you the mean of the y's (called mean of response) and the square root of the MSE (that is, s, the estimate of sigma). In the box labeled Analysis of Variance you can find the SSE in the column labeled "Sum of Squares" and the row labeled "Error". The p-value in the ANOVA table is for testing the null hypothesis that the slope is zero. The MSE or s-squared is in the column "Mean Squares" right next to the SSE.

At the very bottom is the residual plot of the predicted values P_... against the residuals R_.... In this plot you are looking to see if the points make a fan, a > or < shape, or hour-glass shape; if so, it means that the variance of the errors is not constant. You also want to check the residual plot to see if it looks like a curve fits the points better than the horizontal line drawn there does; if so, it means that the mean of the errors isn't zero for all of the values. In the example here, with only five points, it is nearly impossible to tell in either case. To check if the errors are normal, you need to add the Q-Q plot, go under the Graphs menu, and select Residual Normal QQ. As usual the Q-Q plot should look like a straight line if normality holds.


Notes on Homework 16:

PROC INSIGHT will do several of the procedures discussed in sections 9.5 and 9.8.

To construct a confidence interval for the true slope, select C.I. (Wald) for Parameters under the Tables menu.

To construct a plot like Figure 9.24 on page 470, you would choose the option Confidence Curves in the Curves menu. The one choice will add the confidence interval for the mean values, and the other will add the prediction intervals.

To get the confidence and prediction intervals for a particular x value however we need to use PROC GLM. The following code will produce the output for the data in Figure's 9.19 and 9.20, as well as the 95% prediction (CLI) and confidence (CLM) intervals for x=2.5. (The dot in the y column means there is no y value to go with that x value.)


DATA example;

INPUT x y;
CARDS;
1 1
2 1
3 2
4 2
5 4
2.5 .
;
PROC GLM DATA=example;
MODEL y = x / ALPHA=0.05 CLI;
RUN;

PROC GLM DATA=example;
MODEL y = x / ALPHA=0.05 CLM;
RUN;
Note that 95% = 1 - alpha, so that alpha=0.05.
(PROC REG also has a CLI and CLM option, but inexplicably it only lets you make 95% confidence itnervals.)


Notes on Homework 18:

The following code will work example 8.5 on page 405.


DATA ex8p4;

INPUT rel $ marit $ count;
CARDS;
A D 39
B D 19
C D 12
D D 28
None D 18
A Never 172
B Never 61
C Never 44
D Never 70
None Never 37
;

PROC FREQ DATA=ex8p4;
WEIGHT count;
TABLES rel*marit / chisq expected nopercent;
RUN;


Downloading the Data from the Web

The various data sets used in the text book can be found on the web, so that you don't need to type them in. The web address for this directory is: www.stat.sc.edu/~habing/courses/MS7thEd/ . The key to the various names in this directory can be found on pages vii - ix in the beginning of the text, and in Appendix B on pages 524 - 529.

Note that if you are using Internet Explorer, it may work better to use "select all" under the edit menu, instead of highlighting the text manually). Also note that you don't want to leave the little box at the end of the data set in when you try to run SAS.