Stat 515 - Spring 2001 - SAS Templates

Class Notes from 1/24/2001
Homework 3 Notes
Homework 8 Notes
Homework 11 Notes
Homework 12 Notes
Homework 14 Notes
Homework 15 and 16 Notes
Homework 17 and 18 Notes
Downloading the Data from the Web
What if SAS won't start?


The Basics of SAS:

Hitting the [F3] key will run the program currently in the Program Editor window.

This will however erase whatever was written in the Program Editor window. To recall whatever was there, make sure you are in that window, and hit the [F4] key.

If you happen to lose a window, check under the View menu at the top.

If you keep running more programs it will keep adding it all to the Output window. To clear the Output window, make sure you are in that window, and choose Clear All under the Edit menu.

The following is the SAS code for entering data about the starting salaries of a group of bank employees. The data consists of the beginning salaries of all 32 male and 61 female entry level clerical workers hired between 1969 and 1977 by a bank. The data is reported in the book The Statistical Sleuth by Ramsey and Schafer, and is originally from: H.V. Roberts, "Harris Trust and Savings Bank: An Analysis of Employee Compensation" (1979), Report 7946, Center for Mathematical Studies in Business and Economics, University of Chicago Graduate School of Business.

The data is formatted in two columns, the first is the starting salary, the second is an id code, m for male, and f for female.
 


OPTIONS pagesize=60 linesize=80;

DATA bankdata;
INPUT salary gender $ @@;
LABEL salary = "Starting Salary"
   gender = "m=male, f=female";
CARDS;
3900 f 4020 f 4290 f 4380 f 4380 f 4380 f
4380 f 4380 f 4440 f 4500 f 4500 f 4620 f
4800 f 4800 f 4800 f 4800 f 4800 f 4800 f
4800 f 4800 f 4800 f 4800 f 4980 f 5100 f
5100 f 5100 f 5100 f 5100 f 5100 f 5160 f
5220 f 5220 f 5280 f 5280 f 5280 f 5400 f
5400 f 5400 f 5400 f 5400 f 5400 f 5400 f
5400 f 5400 f 5400 f 5400 f 5400 f 5520 f
5520 f 5580 f 5640 f 5700 f 5700 f 5700 f
5700 f 5700 f 6000 f 6000 f 6120 f 6300 f
6300 f 4620 m 5040 m 5100 m 5100 m 5220 m
5400 m 5400 m 5400 m 5400 m 5400 m 5700 m
6000 m 6000 m 6000 m 6000 m 6000 m 6000 m
6000 m 6000 m 6000 m 6000 m 6000 m 6000 m
6000 m 6300 m 6600 m 6600 m 6600 m 6840 m
6900 m 6900 m 8100 m
;
Note that _most_ lines end with a semi-colon, but not all. SAS will crash if you miss one, but usually the log window will tell you where the problem is.

The OPTIONS line only needs to be used once during a session. It sets the length of the page and the length of the lines for viewing on the screen and printing. The font can be set by using the Options menu along the top of the screen. When you cut and paste from SAS to a word processor, the font Courier New works well.

The DATA line defines what the name of the data set is. The name must be eight characters or less, with no spaces, and only letters, numbers, and underscores. It must start with a letter. The INPUT line gives the names of the variables, and they must be in the order that the data will be entered. The $ after gender on the INPUT line means that the variable gender is qualitative instead of quantitative. The @@ at the end of the INPUT line means that the variables will be entered right after each other on the same line with no returns. (Instead of needing one row for each person.)

If we hit F3 at this point to enter what we put above, nothing new will appear on the output screen. This is no big surprise however once we realize that we haven't told SAS to return any output! The code below simply tells SAS to print back out the data we entered.


PROC PRINT DATA=bankdata;
TITLE "Gender Equity in Salaries";
RUN;

The only difficulty we have now is that it would be nice to look at both the men and women separately, so we need to be able to split the data up based on what's in the second column. The following lines will make two separate data sets male and female, and then print out the second one to make sure it is working right:


DATA male;
SET bankdata;
KEEP salary;
WHERE gender='m';
RUN;

DATA female;
SET bankdata;
KEEP salary;
WHERE gender='f';
RUN;

PROC PRINT DATA=female;
TITLE "Female Salaries";
RUN;

Whenever you have a DATA line, that means you are creating a new dataset with that name. The SET line tells it that we are making this new data set from an old one. The KEEP line says the only variables we want in this new data set are the ones on that line. The lines after that say any special commands that go into the making of the new data set. In this case the WHERE command is used to make sure we only keep one gender or the other. Later we will see examples of making datasets that involve using mathematical functions. In any case, it should be pretty straight-forward when you just stop and read through what the lines say.

The most basic procedure to give out some actual graphs and statistics is PROC UNIVARIATE:


PROC UNIVARIATE DATA=female PLOT FREQ ;
VAR salary;
TITLE 'Summary of the Female Salaries';
RUN;

The VAR line says which of the variables you want a summary of. Also note that the graphs here are pretty awful. The INSIGHT procedure will do most of the things that the UNIVARIATE procedure will, and a lot more. INSIGHT however can not be programmed to perform new tasks that are not already built in. Later in the semester we'll see how some of the other procedures in SAS can be used to do things that aren't already programmed in.


PROC INSIGHT;
OPEN female;
DIST salary;
RUN;

Another way to open PROC INSIGHT is to go to the Solutions menu, then to the Analysis menu, and then finally to the Interactive Data Analysis option. Once there you will need to go to the WORK library, and choose the FEMALE data set. If you go this route instead, you will need to also make a selection to get the information about the distribution of female salaries. Go to the Analyze menu, and choose Distribution(Y). Select salary, click the Y button, and then click OK.

Once PROC INSIGHT opens, you can cut and paste the graphs from PROC INSIGHT right into Microsoft Word. Simply click on the border of the box you want to copy with the right mouse button to select it. You can then cut and paste like normal. Clicking on the arrow in the bottom corner of each of the boxes gives you options for adjusting the graphs format. To quit PROC INSIGHT, click on the X in the upper right portion of the spreadsheet.

In addition to the graphs, there is a box labeled Moments that contains the mean, variance, and standard deviation, as well as host of other values. The box below that is labeled Quantiles. These are the percentiles. For example Q3 is the 75th percentile, the value at which 75% of the data points are smaller and 25% are larger. The median is the 50th percentile. While the mean and standard deviation are useful when the data is normal or bell-shaped or mound-shaped, the percentiles are useful in other cases (you just have to use more than two numbers then!).

One thing that we can notice from the histogram and box plot is that the data does not look very symmetric in this case, instead it looks slightly skewed to the left. We can add a curve over the histogram to make it easier to compare to a bell-shaped (or normal) curve. Under the Curves menu, choose Parametric Density.... Just hit ok on the box that pops up.

One of the problems with the histogram, is that the way it looks can be affected a lot by how the width of the bars is selected, and where the bars start and end. The box with the arrow in it, at the lower left side of the histogram lets you control that. Click on that box, and then select Ticks.... Change the 3800 to 3600, and the 6600 to 6400, and then click ok. Now try 3700 and 6700, and set the Tick Increment to 600.

The box plot (or box-and-whisker plot) at the top of the graphics window isn't affected by any choice of class intervals. Instead it is based on the percentiles. The main box has three lines, the 25th percentile, the median, and the 75th percentile. The length of the box (75th percentile minus the 25th percentile) is called the IQR or interquartile range. It can be found in the Quantiles window, where it is called Q3-Q1. It is a measure of variability like the standard deviation. The length of the the whiskers extending from the box can be up to 1.5 IQRs long. Any points beyond that are given by dots and are possible outliers. This data set doesn't have any.

Lets change one of the values though, so that the data appears less normal. Click on the spreadsheet, and change the 3900 to 8900, the 4020 to 8020, the 4290 to the 8290, and the first 4380 to 8380. Note the changes this causes to the various graphs and statistics.

When finished running SAS, remember to close the program and logout.


Homework 3 Notes:

To check if the data is approximately normally distributed (a.k.a. follows a bell-curve, or is mound-shaped), it is best to use a Q-Q plot. This can be done using PROC INSIGHT. After you have used Distribution (Y) to call up the window with the graphs and summary statistics in it (including the values to answer part A) you can add the Q-Q plot to that output window. First, select QQ Plot... under the Graphs menu (hit OK in the box that comes up). Then select QQ Ref Line... under the Curves menu (again hit OK). Data that comes from a population that is normally distributed should lie close to the line.

One thing to notice in the Q-Q plot is that the "x-axis" is labeled with N_ followed by the name of the variable. This same variable has also been added to the spreadsheet. This can be thought of as "the value that this data point would have had if it came from a normal distribution." Technically, it is gotten by finding what percentile this data point is for the data set, and then by finding what value would be that percentile for a normal distribution. How this could be done will be discussed in section 4.5.

To print out the Q-Q plot (or any of the other boxes from PROC INSIGHT), you simply need to "right-click" on the border of the box containing the graph you want. Open up a copy of Microsoft Word, and hit return enough so that you have a blank area on the page large enough for the image, and then simply use the paste option in the edit menu in Word. If you simply printed out the document now though, it will most likely have fonts that don't fit properly. To correct this, "left-click" twice on the center of the image in Word. This will open up the graphic in a separate Word document where you could edit it. It also automatically adjusts the fonts. Now, simply select Close & Return... under the File menu.


Homework 8 Notes:

Recall that the first step is always to enter the data. There will only be one variable name on the INPUT line for this problem as there is only one variable. Remember to use @@ at the end of the INPUT line if you plan on putting more than one observation on a line. Once the data is in, you can start up PROC INSIGHT and get the Q-Q Plot as before. In SAS version 8, PROC INSIGHT will automatically construct the confidence intervals not only for the mean, but also for the variance. Simply select Basic Confidence Intervals under the Tables menu.


Homework 11 Notes:

The following code would analyze the data in Example 7.4 on page 320.
DATA learner;

INPUT group $ value @@;
CARDS;
new 80 new 80 new 79 new 81
stand 79 stand 62 stand 70 stand 68
new 76 new 66 new 71 new 76
stand 73 stand 76 stand 86 stand 73
new 70 new 85
stand 72 stand 68 stand 75 stand 66
;


PROC TTEST DATA=learner;
CLASS group;
VAR value;
RUN;

The order you enter the groups determines which hypothesis is being tested. Because the new group was entered first, the procedure will look at the difference new-stand. By default the procedure tests the hypothesis that this difference is equal to zero. To test that the difference is equal to some other value, say 5, you would add H0=5 to the first line between learner and the semi-colon (in PROC TTEST).

The first three rows of the output contain the means, confidence intervals for the means, standard deviations, and confidence intervals for the standard deviations for the two groups individually, and for the difference of the two groups.

The next two rows of the output are for the two t-tests. The Pooled line is the case where the variances are equal, and the Sattertwaite line is for unequal variances. For unequal variances in SAS, SAS uses a complicated formula to estimate the degrees of freedom. Notice that they are not an integer! Since we only use the unequal case for large sample sizes (where the z and t are nearly identical) this shouldn't be a problem. One important thing to note here is that SAS is doing the two-sided test for the alternate hypothesis "not equals to". If you are testing either < or >, then you will need to draw the picture and adjust the p-value by hand.

The final line is a test of the null hypothesis that the variances of the two groups are equal or not. The Pr>F column is the p-value for this test, so if it is a small value (less than alpha) you reject the null hypothesis that the variances are equal.

In checking the assumption that the data is normal, we need for BOTH samples to be normal. We can still use PROC INSIGHT for this.

PROC INSIGHT;

OPEN learner;
RUN;
When you choose Distribution (Y) under the Analyze menu, choose value for Y and choose group for group. This will put the information for both groups in the same window (use the scroll bar at the bottom to switch between them). If you add the q-q plot and q-q line, it will add them for both variables.

In addition to just looking at the QQ plots to see if the data is approximately normal, we can actually test the null hypothesis H0: the population is normally distriubted against the alternate hypothesis HA: the population is not normally distributed. To do this, select Tests for Normality under the Tables menu. This will add a box containing several tests for normality to the bottom of the PROC INSIGHT output window. The one we want to use is the Anderson-Darling test. For this example, the p-value is greater than 0.2500 for both the new, and standard groups. Because the p-value is large we do not reject the null hypothesis, and cannot reject that the data comes from a population that is normally distributed.


Homework 12 Notes:

The following code will conduct the paired t-test for the data in Table 7.4 on page 330. Remember that the p-value is for the two-sided alternate hypothesis and would need to be ajusted for testing either > or <.
DATA learner;

INPUT new standard;
CARDS;
77 72
74 68
82 76
73 68
87 84
69 68
66 61
80 76
;

PROC TTEST DATA=learner;
PAIRED new:standard;
RUN;
You could also use PROC INSIGHT. After starting PROC INSIGHT, choose Variables under the Edit menu. Select the Other... option. We can now make a new variable that is equal to new-standard. Click on Y-X in the transformation box, then select new for the Y value, standard for the X value, and click on OK. You can now choose Distribution (Y) under the Analyze menu and simply do a one sample t-test on the new variable you created.


Homework 14 Notes:

The data for this problem is on the disk that comes with the book and also on the web-site (see the link at the bottom of the page). Remember that a $ must go after the name of the second variable on the input line, because the second variable is the letter standing for an emotion and not a number. Once the data is entered, all of the needed results can be gotten from PROC INSIGHT. Under the Analyze menu choose Fit(YX). Select the observed value for the Y variable and the group for the X variable, and then hit OK. The q-q plot can be added by choosing the Residual Normal QQ option under the Graphs menu.

Notice that choosing these options adds three columns to the spreadsheet. The column beginning with R_ is called the residual. The residual is the estimated error from the model equation. The column beginning with P_ is called the predicted value. It is the estimated mean for that group, so muA is 0.8533 for example. Note that the difference between this value and the actual observation is equal to the residual. The column beginning with RN_ is used to make the q-q plot.

The graph on the bottom left is a plot of the residual values versus the predicted values. It thus has six columns (one for each group as they have different predicted values) and gives the estimated errors for each of those groups. This is the plot you can use to check if the errors in each group seem to have mean zero (is zero close to the middle of the values in each column?) and see if the groups seem to have the same variance (is each column as spread out as the others?). The column on the bottom right is the q-q plot for the residuals (the estimated errors) and you can use it to check if the errors come from a distribution that is approximately normal.



Homework 15 and 16 Notes:

Note, the assignment has been slightly modified!

The SAS output for this problem is produced in exactly the same way as the SAS output for Homework 14. Simply choose the dependent variable for Y and the independent variable for X in PROC INSIGHT.


Homework 17 and 18 Notes

The following code will analyze the data in example 8.4 on page 396.
DATA ex8p4;

INPUT opinion $ count;
CARDS;
legal 39
decrim 99
exist 336
noopin 26
;
PROC FREQ DATA=ex8p4 ORDER=data;
TABLES opinion / TESTP=(.07,.18,.65,.10);
WEIGHT count;
RUN;
Instead of using TESTP (test proportion), you also could use TESTF (test frequency). In this case you would put the expected values in instead of the proportions. One further complication with PROC FREQ in SAS is that it doesn't handle observed values of zero well. If there is a cell that was 0, use the value 0.00001 instead. This way SAS will actually recognize that it is a cell, and it won't throw the test statistic off by very much.

The following code will work example 8.5 on page 405.

DATA ex8p5;

INPUT rel $ marit $ count;
CARDS;
A D 39
B D 19
C D 12
D D 28
None D 18
A Never 172
B Never 61
C Never 44
D Never 70
None Never 37
;

PROC FREQ DATA=ex8p5;
WEIGHT count;
TABLES rel*marit / chisq expected nopercent;
RUN;


Downloading the Data from the Web

The various data sets used in the text book can be found on the web, so that you don't need to type them in. The web address for this directory is: www.stat.sc.edu/~habing/courses/MS7thEd/ . The key to the various names in this directory can be found on pages vii - ix in the beginning of the text, and in Appendix B on pages 524 - 529.

Note that if you are using Internet Explorer, it may work better to use "select all" under the edit menu, instead of highlighting the text manually). Also note that you don't want to leave the little box at the end of the data set in when you try to run SAS. Finally, in Internet Explorer, a box may pop up asking you if you wish to download or open the data. Just select open, and it should come up in some sort of document editor that you can cut and paste from.


What if SAS won't start?

If you attempt to start SAS and it gives you an error instead of starting the program, there are several possible difficulties. One of them could be that you have an old account that was not correctly updated. To see if this is your problem, go to the Start box and call up Windows NT Explorer in the Programs menu. If you scroll down the left part of the window that pops up and can find a drive labeled apps$ on 'sum-nt' (X:) then the commands below will help you fix this.

1) Close Windows NT Explorer
2) Right click on the My Computer icon on the left side of the screen.
3) Select Disconnect Network Drive
4) Choose X: \\sum-nt\apps and click OK
5) Again, Right click on the My Computer icon
6) Select Map Network Drive...
7) Select X in the Drive box
8) Type \\lc-nt\apps in the path box and hit ok

It should now allow you to start SAS. Unfortunately you'll have to do this every time you want to start SAS until they get your account moved over to a newer machine.

Now, if you were not listed as being hooked up to sum-nt, then it could be any number of things. First you should try starting it on another computer. If that fails, you should see if Jamie Winterton (7-5346) is available in Room 209D. If she is not there, you should next see if Jay Dew (7-5413) is in 415. If that fails, you should see if Minna Moore is available in Room 417. Note: If Minna is the only one in, she will probably not be able to help you right away, but she will pass your message on to the next available person.