Stat 515 - Fall 2000 - SAS Templates

Class Notes from 9/1/2000
Homework 3 Notes
Homework 6 Notes
Homework 7 Notes
Homework 8 Notes
Homework 10 Notes
Homework 11 Notes
Homework 12 Notes
Homework 13 Notes
Homework 14 Notes
Homework 17 Notes
Homework 18 Notes
Downloading the Data from the Web


The Basics of SAS:

Hitting the [F3] key will run the program currently in the Program Editor window.

This will however erase whatever was written in the Program Editor window. To recall whatever was there, make sure you are in that window, and hit the [F4] key.

If you keep running more programs it will keep adding it all to the Output window. To clear the Output window, make sure you are in that window, and choose Clear All under the Edit menu.

The following is the SAS code for entering data about the starting salaries of a group of bank employees. The data consists of the beginning salaries of all 32 male and 61 female entry level clerical workers hired between 1969 and 1977 by a bank. The data is reported in the book The Statistical Sleuth by Ramsey and Schafer, and is originally from: H.V. Roberts, "Harris Trust and Savings Bank: An Analysis of Employee Compensation" (1979), Report 7946, Center for Mathematical Studies in Business and Economics, University of Chicago Graduate School of Business.

The data is formatted in two columns, the first is the starting salary, the second is an id code, m for male, and f for female.
 


OPTIONS pagesize=60 linesize=80;

DATA bankdata;
INPUT salary gender $ @@;
LABEL salary = "Starting Salary"
   gender = "m=male, f=female";
CARDS;
3900 f 4020 f 4290 f 4380 f 4380 f 4380 f
4380 f 4380 f 4440 f 4500 f 4500 f 4620 f
4800 f 4800 f 4800 f 4800 f 4800 f 4800 f
4800 f 4800 f 4800 f 4800 f 4980 f 5100 f
5100 f 5100 f 5100 f 5100 f 5100 f 5160 f
5220 f 5220 f 5280 f 5280 f 5280 f 5400 f
5400 f 5400 f 5400 f 5400 f 5400 f 5400 f
5400 f 5400 f 5400 f 5400 f 5400 f 5520 f
5520 f 5580 f 5640 f 5700 f 5700 f 5700 f
5700 f 5700 f 6000 f 6000 f 6120 f 6300 f
6300 f 4620 m 5040 m 5100 m 5100 m 5220 m
5400 m 5400 m 5400 m 5400 m 5400 m 5700 m
6000 m 6000 m 6000 m 6000 m 6000 m 6000 m
6000 m 6000 m 6000 m 6000 m 6000 m 6000 m
6000 m 6300 m 6600 m 6600 m 6600 m 6840 m
6900 m 6900 m 8100 m
;
Note that _most_ lines end with a semi-colon, but not all. SAS will crash if you miss one, but usually the log window will tell you where the problem is.

The OPTIONS line only needs to be used once during a session. It sets the length of the page and the length of the lines for viewing on the screen and printing. The font can be set by using the Options menu along the top of the screen. When you cut and paste from SAS to a word processor, the font Courier New works well.

The DATA line defines what the name of the data set is. The name must be eight characters or less, with no spaces, and only letters, numbers, and underscores. It must start with a letter. The INPUT line gives the names of the variables, and they must be in the order that the data will be entered. The $ after gender on the INPUT line means that the variable gender is qualitative instead of quantitative. The @@ at the end of the INPUT line means that the variables will be entered right after each other on the same line with no returns. (Instead of needing one row for each person.)

If we hit F3 at this point to enter what we put above, nothing new will appear on the output screen. This is no big surprise however once we realize that we haven't told SAS to return any output! The code below simply tells SAS to print back out the data we entered.


PROC PRINT DATA=bankdata;
TITLE "Gender Equity in Salaries";
RUN;

The only difficulty we have now is that it would be nice to look at both the men and women separately, so we need to be able to split the data up based on what's in the second column. The following lines will make two separate data sets male and female, and then print out the second one to make sure it is working right:


DATA male;
SET bankdata;
KEEP salary;
WHERE gender='m';
RUN;

DATA female;
SET bankdata;
KEEP salary;
WHERE gender='f';
RUN;

PROC PRINT DATA=female;
TITLE "Female Salaries";
RUN;

Whenever you have a DATA line, that means you are creating a new dataset with that name. The SET line tells it that we are making this new data set from an old one. The KEEP line says the only variables we want in this new data set are the ones on that line. The lines after that say any special commands that go into the making of the new data set. In this case the WHERE command is used to make sure we only keep one gender or the other. Later we will see examples of making datasets that involve using mathematical functions. In any case, it should be pretty straight-forward when you just stop and read through what the lines say.

The most basic procedure to give out some actual graphs and statistics is PROC UNIVARIATE:


PROC UNIVARIATE DATA=female PLOT FREQ ;
VAR salary;
TITLE 'Summary of the Female Salaries';
RUN;

The VAR line says which of the variables you want a summary of. Also note that the graphs here are pretty awful. The INSIGHT procedure will do most of the things that the UNIVARIATE procedure will, and a lot more. INSIGHT however can not be programmed to perform new tasks that are not already built in. Later in the semester we'll see how some of the other procedures in SAS can be used to do things that aren't already programmed in.


PROC INSIGHT;
OPEN female;
DIST salary;
RUN;

Another way to open PROC INSIGHT is to go to the Solutions menu, then to the Analysis menu, and then finally to the Ineteractive Data Analysis option. Once there you will need to go to the WORK library, and choose the FEMALE data set. If you go this route instead, you will need to also make a selection to get the information about the distribution of female salaries. Go to the Analyze menu, and choose Distribution(Y). Select salary, click the Y button, and then click OK.

Once PROC INSIGHT opens, you can cut and paste the graphs from PROC INSIGHT right into Microsoft Word. Simply click on the border of the box you want to copy with the right mouse button to select it. You can then cut and paste like normal. Clicking on the arrow in the bottom corner of each of the boxes gives you options for adjusting the graphs format. To quit PROC INSIGHT, click on the X in the upper right portion of the spreadsheet.

In addition to the graphs, there is a box labeled Moments that contains the mean, variance, and standard deviation, as well as host of other values. The box below that is labeled Quantiles. These are the percentiles. For example Q3 is the 75th percentile, the value at which 75% of the data points are smaller and 25% are larger. The median is the 50th percentile. While the mean and standard deviation are useful when the data is normal or bell-shaped or mound-shaped, the percentiles are useful in other cases (you just have to use more than two numbers then!).

One thing that we can notice from the histogram and box plot is that the data does not look very symmetric in this case, instead it looks slightly skewed to the left. We can add a curve over the histogram to make it easier to compare to a bell-shaped (or normal) curve. Under the Curves menu, choose Parametric Density.... Just hit ok on the box that pops up.

One of the problems with the histogram, is that the way it looks can be affected a lot by how the width of the bars is selected, and where the bars start and end. The box with the arrow in it, at the lower left side of the histogram lets you control that. Click on that box, and then select Ticks.... Change the 3800 to 3600, and the 6600 to 6400, and then click ok. Now try 3700 and 6700, and set the Tick Increment to 600.

The box plot (or box-and-whisker plot) at the top of the graphics window isn't affected by any choice of class intervals. Instead it is based on the percentiles. The main box has three lines, the 25th percentile, the median, and the 75th percentile. The length of the box (75th percentile minus the 25th percentile) is called the IQR or interquartile range. It can be found in the Quantiles window, where it is called Q3-Q1. It is a measure of variability like the standard deviation. The length of the the whiskers extending from the box can be up to 1.5 IQRs long. Any points beyond that are given by dots and are possible outliers. This data set doesn't have any.

Lets change one of the values though, so that the data appears less normal. Click on the spreadsheet, and change the 3900 to 8900, the 4020 to 8020, the 4290 to the 8290, and the first 4380 to 8380. Note the changes this causes to the various graphs and statistics.

When finished running SAS, remember to close the program and logout.


Homework 3 Notes

The data for this assignment is contained both on the disk that is included with your text and on the web. The link to the data on the web is at the bottom of this page. The file you want is SLI.DAT, just as in the book. You can simply cut and paste the data into the program editor window in SAS, but it may be more convenient to put it into microsoft word first (so that you will have all of the code saved for later).

Similar to what we did in class, you will need a DATA line, and need to select an 8 character name for the data set. For the INPUT line, there are five columns in the data set, and so you will need five names on that line. The second and third need to be followed by dollar signs as they are words, and not numbers. The LABEL line is optional, so you can then skip right to the CARDS; line. The data comes next, and then the semi-colon on the line after the last line of data. (If the little box marking the end of the data set on the web is there, delete it.)

Hitting [F3] now would enter the data, but return no ouput. You could use the PROC PRINT lines to make sure it was entered correctly, but don't need to. In order to calculate the statistics for each group you need to create three new data sets, one for each of the three groups. (Note: Do NOT put $ signs on the KEEP line.) You do this just as we did in class for the bank data, except that you need to make sure you are using the correct names. Once the three data sets have been created, you can then start PROC INSIGHT individually for each one, just as we did in class.

There is a somewhat easier way to do this however, a way that doesn't require you to split up the data set at all. Start up PROC INSIGHT using the original, unsplit data set. First you choose Analyze and Distribution (Y), and select the DIQ variable as the Y variable. Before you hit Ok, select the group variable (YND,SLI,OND), and then hit the Group button. The output window will now have the three different variables in separate graphs side by side. (Use the right/left scroll bar to see them.)

To print out the graphs and boxes you need from PROC INSIGHT, you can simply click on the edge of the box with the left mouse button to select it. It can then be cut and paste into MS Word. The fonts are sometimes distorted when this is done. To fix this, in MS Word, click twice on the image with the left mouse button and then choose Close and Return to Document in the File menu. Make sure that the graphs and statistics are labeled with which group they correspond to.


Homework 6 Notes

The first step is to enter the data into SAS as we did in the lab in class, and as in homework 3. The data only has one variable in it (only one name goes on the input line), so you don't need to worry about separating out the various variables using KEEP or WHERE.

The stem and leaf plot for part a can be gotten using PROC UNIVARIATE with the PLOT option. The answers for parts b-d can be gotten by starting up PROC INSIGHT with the name of the dataset on the OPEN line, and the name of the variable on the DIST line. The standard deviation and the upper and lower quartiles are already part of the output. To add the Q-Q plot (a.k.a. the normal probability plot) to the output, go under the Graphs menu and choose QQ Plot... and hit OK in the box that pops up. To add the line to the plot, go to Curves and choose QQ Ref Line and hit OK in the box that pops up.

When you are cutting and pasting the Q-Q plot into Microsoft Word, one of the difficulties is that it will not print out correctly right away. Once it has been pasted into MS Word, click on the image twice with the left mouse button. This opens it into a separate window where you could edit it if you wanted. Simply go under File and select Close and Return to Document. This will return you to the rest of the document (where you should have put the code you ran in the program editor window, the stem-and-leaf plot, and the standard deviation and quartiles) with the graph formatted to print out correctly.


Homework 7 Notes

SAS has built in functions that can calculate the values you find in the tables. Each distribution has one function that solves P(X < x0)=?, and is called the probability function. Each distribution also has another function that solves P(X < ?)=p, and is called the quantile function.

Distribution
QuantileProbability
Standard Normal
PROBIT(pct)PROBNORM(val)
chi2
CINV(pct,df)PROBCHI(val,df)
t
TINV(pct,df)PROBT(val,df)
F
FINV(pct,dfx,dfy)PROBF(val,dfx,dfy)

In each case, the pct is the probability (percent of the area) that is less than the val, and df are the degrees of freedom. Notice that this is the opposite of what the chi2, t, and F tables in the book report; they each give the probability greater than the value.

Say the sample size was 10 and we were asked to find: P(chi2 > 4.16816), P(2 < chi2 < 4), and the x0 such that P(chi2 > x0) = 0.005. The code to give these three answers would be as follows.

DATA answers;

a1 = 1 - PROBCHI(4.16816,9);
a2 = PROBCHI(4,9) - PROBCHI(2,9);
a3 = CINV((1-0.005),9);
;
PROC PRINT DATA=answers;
RUN;
You can compare the results of this code to what you would get using Table XI in the text. (Drawing the pictures should help.) While the Table can give us the answers for the first and third questions, for the second, the best it can do is to say between (.100 and .050) minus between (.010 and .005), so that the final answer is somewhere between (.095 and .040).


Homework 8 Notes

We already saw how to form the Q-Q plot in Homework 6. In SAS version 8, PROC INSIGHT will also automatically construct the confidence intervals for not only the mean, but also for the variance. Simply select Basic Confidence Intervals under the Tables menu. In version 6 it will only calculate the CI for the mean. Simply choose that option in the Tables menu.


Homework 10 Notes

PROC INSIGHT will automatically give the p-value for the two-sided hypothesis test. Simply select Tests for Location... under the Tables menu. In the box that comes up, put in the correct value for the mean. The output that comes up (in version 8) includes three different tests, we only want the Sutdent's t result. The statistic should be the value of t, and the p-value is for testing "not equals to". If you are testing a one sided alternate hypothesis then you will need to draw the picture to see if this two-sided p-value needs to be cut in half, or if you need to take 1 minus half of the value, to get the correct one-sided p-value.

Another way to get the one-sided p-values is to use the following code. (After entering the data, of course). The code below would analyze the data in Example 6.4 on page 283, if you entered the data and called the data set hosppat and the variable stay. It returns all three p-values (one for each possible alternate hypothesis) and you have to choose the correct one. The only portions you need to change are the name of the data set, the name of the variable, and the value after the cards line. The value after the cards line should be the value for the null hypothesis.

If you read through the code, you should be able to make out several of the formulas.

PROC MEANS NOPRINT DATA=hosppat;

VAR stay;
OUTPUT OUT=temp MEAN=xbar STD=sd N=n
RUN;
DATA temp2;
SET temp;
KEEP xbar mu sd n t pgreater pless ptwoside;
INPUT mu;
t = (xbar-mu)/(sd/sqrt(n));
df = n - 1;
pgreater = 1 - probt(t,df);
pless = probt(t,df);
ptwoside = 2*MIN(1-ABS(probt(t,df)),ABS(probt(t,df)));
cards;
5
;
PROC PRINT;
RUN;


Homework 11 Notes

The following code would analyze the data in Example 7.4 on page 320.

DATA learner;

INPUT group $ value @@;
CARDS;
new 80 new 80 new 79 new 81
stand 79 stand 62 stand 70 stand 68
new 76 new 66 new 71 new 76
stand 73 stand 76 stand 86 stand 73
new 70 new 85
stand 72 stand 68 stand 75 stand 66
;


PROC TTEST DATA=learner;
CLASS group;
VAR value;
RUN;

PROC TTEST changed quite a bit between SAS version 6 and version 8, and version 8 should probably be used. (All of the computers in 303A now have that version working on them.)

The order you enter the groups determines which hypothesis is being tested. Because the new group was entered first, the procedure will look at the difference new-stand. By default the procedure tests the hypothesis that this difference is equal to zero. To test that the difference is equal to some other value, say 5, you would add H0=5 to the first line between learner and the semi-colon.

The first three rows of the output contain the means, confidence intervals for the means, standard deviations, and confidence intervals for the standard deviations for the two groups individually, and for the difference of the two groups.

The next two rows of the output are for the two t-tests. The Pooled line is the case where the variances are equal, and the Sattertwaite line is for unequal variances. For unequal variances in SAS, SAS uses a complicated formula to estimate the degrees of freedom. Notice that they are not an integer! Since we only use the unequal case for large sample sizes (where the z and t are nearly identical) this shouldn't be a problem.

The final line is a test that the variances of the two groups are equal or not. The Pr>F column is the p-value for this test, so if it is a small value (less than alpha) you reject the null hypothesis that the variances are equal.

In checking the assumption that the data is normal, we need for BOTH samples to be normal. We can still use PROC INSIGHT for this.

PROC INSIGHT;

OPEN learner;
RUN;
When you choose Distribution (Y) under the Analyze menu, choose value for Y and choose group for group. This will put the information for both groups in the same window (use the scrollbar at the bottom to switch between them). If you add the q-q plot and q-q line, it will add them for both variables. This is also the place you can get the values for the means and variances to do the test by hand.


Homework 12 Notes

The following code will conduct the paired t-test for the data in Table 7.4 on page 330. Remember that the p-value is for the two-sided alternate hypothesis and would need to be ajusted for testing either > or <. (Note that computer output is given for part c already, you may do part b by hand instead of SAS if you wish.)

DATA learner;

INPUT new standard;
CARDS;
77 72
74 68
82 76
73 68
87 84
69 68
66 61
80 76
;

PROC TTEST DATA=learner;
PAIRED new:standard;
RUN;
You could also use PROC INSIGHT. After starting PROC INSIGHT, choose Variables under the Edit menu. Select the Other... option. We can now make a new variable that is equal to new-standard. Click on Y-X in the transformation box, then select new for the Y value, standard for the X value, and click on OK. You can now choose Distribution (Y) under the Analyze menu and simply do a one sample t-test on the new variable you created.

Homework 13 Notes

The data for this problem is on the disk that comes with the book and also on the web-site (see the link at the bottom of the page). Remember that a $ must go after the name of the second variable on the input line, because the second variable is the letter standing for an emotion and not a number. Once the data is entered, all of the needed results can be gotten from PROC INSIGHT. Under the Analyze menu choose Fit(YX). Select the observed value for the Y variable and the group for the X variable, and then hit OK. The q-q plot can be added by choosing the Residual Normal QQ option under the Graphs menu.

Notice that choosing these options adds three columns to the spreadsheet. The column beginning with R_ is called the residual. The residual is the estimated error from the model equation. The column beginning with P_ is called the predicted value. It is the estimated mean for that group, so muA is 0.8533 for example. Note that the difference between this value and the actual observation is equal to the residual. The column beginning with RN_ is used to make the q-q plot.

The graph on the bottom left is a plot of the residual values versus the predicted values. It thus has six columns (one for each group as they have different predicted values) and gives the estimated errors for each of those groups. This is the plot you can use to check if the errors in each group seem to have mean zero (is zero close to the middle of the values in each column?) and see if the groups seem to have the same variance (is each column as spread out as the others?). The column on the bottom right is the q-q plot for the residuals (the estimated errors) and you can use it to check if the errors come from a distribution that is approximately normal.


Homework 14 Notes

The simple linear regression can be run in SAS in exactly the same way as the one-way ANOVA in Homework 13! In this case you would NOT put a dollar sign after either the risk or the credit rating though, just after the country name. For the residual vs. predicted plot (on the bottom left) the data will not appear in columns. You can still look at each range of x-values individually to see if it looks like the mean is near zero and that the variances are approximately the same.


Homework 17 Notes

The following code will analyze the data in example 8.4 on page 396.
DATA ex8p4; 

INPUT opinion $ count;
CARDS;
legal 39
decrim 99
exist 336
noopin 26
;
PROC FREQ DATA=ex8p4 ORDER=data;
TABLES opinion / TESTP=(.07,.18,.65,.10);
WEIGHT count;
RUN;
Instead of using TESTP (test proportion), you also could use TESTF (test frequency). In this case you would put the expected values in instead of the proportions. One further complication with PROC FREQ in SAS is that it doesn't handle observed values of zero well. If there is a cell that was 0, use the value 0.00001 instead. This way SAS will actually recognize that it is a cell, and it won't throw the test statistic off by very much.


Homework 18 Notes

The following code will work example 8.5 on page 405.
DATA ex8p5;

INPUT rel $ marit $ count;
CARDS;
A D 39
B D 19
C D 12
D D 28
None D 18
A Never 172
B Never 61
C Never 44
D Never 70
None Never 37
;

PROC FREQ DATA=ex8p5;
WEIGHT count;
TABLES rel*marit / chisq expected nopercent;
RUN;


Downloading the Data from the Web

The various data sets used in the text book can be found on the web, so that you don't need to type them in. The web address for this directory is: www.stat.sc.edu/~habing/courses/MS7thEd/ . The key to the various names in this directory can be found on pages vii - ix in the beginning of the text, and in Appendix B on pages 524 - 529.

Note that if you are using Internet Explorer, it may work better to use "select all" under the edit menu, instead of highlighting the text manually). Also note that you don't want to leave the little box at the end of the data set in when you try to run SAS. Finally, in Internet Explorer, a box may pop up asking you if you wish to download or open the data. Just select open, and it should come up in some sort of document editor that you can cut and paste from.