Stat 515 - Fall 1999 - SAS Templates

Projects Due Thursday, December 2nd at 4:30pm

Class Notes from 8/31/99
Notes on Homework 3
Notes on Homework 7
Notes on Homework 9
Notes on Homework 10
Notes on Homework 11
Notes on Homework 12
Notes on Homework 13
Notes on Homework 16
Notes on Homework 18
Text Book Data Sets (Note: if you are using Internet Explorer, it may work better to use "select all" under the edit menu, instead of highlighting the text manually)


Using SAS to get descriptive statistics and plots:

Hitting the [F3] key will run the program currently in the Program Editor window.

This will however erase whatever was written in the Program Editor window. To recall whatever was there, make sure you are in that window, and hit the [F4] key.

If you keep running more programs it will keep adding it all to the Output window. To clear the Output window, make sure you are in that window, and choose "Clear text" under the Edit menu.

The following is the SAS code for analyzing the student loan default rate as given in Table 2.5 on page 33.


OPTIONS pagesize=60 linesize=80;

DATA table2p5;
INPUT state $ pct @@;
LABEL state = "State"
   pct = "Percent of Loans in Default";
CARDS;
Ala 12.0 Ill 9.3 Mont 6.4 RI 8.8
Alaska 19.17 Ind 6.7 Nebr 4.9 SC 14.1
Ariz 12.1 Iowa 6.2 Nev 10.1 SDak 5.5
Ark 12.9 Kans 5.7 NH 7.9 Tenn 12.3
Calif 11.4 Ky 10.3 NJ 12.0 Tex 15.2
Colo 9.5 La 13.5 NMex 7.5 Utah 6.0
Conn 8.8 Maine 9.7 NY 11.3 Vt 8.3
Del 10.9 Md 16.6 NC 15.5 Va 14.4
DC 14.7 Mass 8.3 NDak 4.8 Wash 8.4
Fla 11.8 Mich 11.4 Ohio 10.4 WVa 9.5
Ga 14.8 Minn 6.6 Okla 11.2 Wis 9.0
Hawaii 12.8 Miss 15.6 Oreg 7.9 Wyo 2.7
Idaho 7.1 Mo 8.8 Pa 8.7
;

PROC PRINT data=table2p5;
TITLE "Table 2.5: Percentage of Student Loans in Default";
RUN;
Note that _most_ lines end with a semi-colon, but not all. SAS will crash if you miss one, but usually the log window will tell you where the problem is.

The "OPTIONS" line only needs to be used once during a session. It sets the length of the page and the length of the lines for viewing on the screen and printing. The font can be set by using the "Options" menu along the top of the screen. When you cut and paste from SAS to a word processor, the font Courier New works well.

The "DATA" line defines what the name of the data set is. The name must be eight characters or less, with no spaces, and only letters, numbers, and underscores. It must start with a letter. The "INPUT" line gives the names of the variables, and they must be in the order that the data will be entered. The $ after "state" on the "INPUT" line means that the variable "state" is qualitative instead of quantitative. The @@ means that there will not be a return between each pair of variables. Note that they must still go in order: state, pct, state, pct, etc...

PROC PRINT is the command that simply prints out the data. If we left that off here, nothing would appear on the "OUTPUT" screen. The "TITLE" command here, and the "LABEL" command above are optional.

The basic method for getting a summary of the data is to use PROC UNIVARIATE.


PROC UNIVARIATE DATA=table2p5 PLOT FREQ ;

VAR pct ;
TITLE 'Summary of the Loan Default Data';
RUN;
The "VAR" line says which of the variables you want a summary of. Note that there are many different definitions of percentile, and the exact value may not be the same as we saw how to calculate in class.

PROC insight allows many of these analyses, as well as many more advanced analyses and nicer graphs. While it is possible to change the definitions of the percentiles in PROC UNIVARIATE, you can not do so in the current editions of PROC INSIGHT.


PROC INSIGHT; 

OPEN table2p5;
DIST pct;
RUN;

You can cut and paste the graphs from PROC INSIGHT right into microsoft word. Simply click on the border of the box you want to copy with the right mouse button to select it. You can then cut and paste like normal. Clicking on the arrow in the bottom corner of each of the boxes gives you options for adjusting the graphs format. To quit PROC INSIGHT, click on the X in the upper right portion of the spreadsheet.

One very useful ability in PROC INSIGHT is the ability to make new variables from the old ones. This is done by going to the "Edit" menu, and selecting "Variables", and then "Other...". This can be used for example to make the z-scores.


Notes on Homework 3:

For the second problem you need to somehow get the difference between the two values, and then make a stem-and-leaf plot of that difference. One way would be to use the Edit, Variable, Other menu choice for Y-X and then to use the save option. Another way would be two make a new data set. The example below takes the baseball attendance in thousands for 1960 and 1961 and makes a new dataset "base2" that only contains the difference.

 

DATA baseball;

INPUT y1960 y1961;
CARDS;
809 673
663 1123
2253 1813
1497 1100
862 584
1705 1199
1096 855
1795 1391
1187 951
1129 850
1644 1151
950 735
1167 1606
774 683
1627 1747
743 597
;
DATA base2;
SET baseball;
KEEP diff;
diff = y1961 - y1960;
RUN;
The line "SET" says that we are going to use the dataset "baseball" to make the new dataset. The line "KEEP" says the new dataset will only have the difference in it, and it will be called "diff". Finally the next line gives the formula to make the difference. This same method could also be used for more complicated formulas.

Note that when you run just the code above beginning with the DATA lines that it won't add anything to the output screen, because all you've told it to do is remember this new data set. You would need to run PROC UNIVARIATE again using "DATA=base2" and "VAR diff". Also note that once you have already entered a data set during that session at the computer, you don't need to re-enter it. SAS will remember it until you quit the program.


Notes on Homework 7:

The easiest way to get a confidence interval for the mean, or a q-q plot, is to use PROC INSIGHT. Say we wanted to get an 90% CI for the PCT in the SAT data at the top of this page, and a q-q plot to make sure the data is normal. First we would have to enter the data (copying in the first batch of code at the top of this web page, you don't need the PROC PRINT part). Secondly we would have to start up PROC INSIGHT (the third set of code at the top of this web page). Once PROC INSIGHT is started, go to "Tables" menu, and go down to the choice "C.I for Mean" choice, and finally simply select the percentage you want. It should appear at the bottom of the insight window, and you can cut the box out and paste it into microsoft word. To get the q-q plot, go to the "Graphs" menu and select "QQ Plot...". The default setting is normal, so just hit ok. To add the line to the plot, go to the "Curves" menu and select "QQ Ref Line...". Again the default setting of "Least Squares" turns out to be what we want, so just hit ok again. You can also put this graph in microsoft word simply by cutting and pasting it over.

Of course for this problem, you need to enter the 16 data points yourself. Simply follow the format above for the other DATA steps. Remember we don't need to use a $ sign here because they are numbers not names. Also remember that the names you give things have to be eight characters or less.


Notes on Homework 9:

Page 268, 6.45: SAS has several ways of calculating a t-test of the alternate hypothesis "not equal to". This p-value is denoted by Prob > |T|. The absolute value bars around the T are the give away for this. To get the one sided tests, you have to do a little extra coding. The program below gives the three possible p-values for testing if the pct in table2p5 has mean=10. It does this by using PROC MEANS to calculate the various parts that go into figuring out the statistc, and the function probt which is the command that emulates looking up a value on the t table. The three things you need to change are: the data-set name, the variable of interest, and the mean from the null hypothesis.


PROC MEANS NOPRINT DATA=table2p5;

VAR pct;
OUTPUT OUT=temp MEAN=xbar STD=sd N=n
RUN;
DATA temp2;
SET temp;
KEEP xbar mu sd n t pgreater pless ptwoside;
INPUT mu;
t = (xbar-mu)/(sd/sqrt(n));
df = n - 1;
pgreater = 1 - probt(t,df);
pless = probt(t,df);
ptwoside = 2*MIN(1-ABS(probt(t,df)),ABS(probt(t,df)));
cards;
10
;
PROC PRINT;
RUN;


Notes on Homework 10:

To calculate the p-value for testing about the mean for this problem, you can simply use the code from homework 9. The data for this problem is already typed in on the text book data set page at http://www.stat.sc.edu/~habing/courses/msdata/Ex06.017. Remember to use the select all option if you are on internet explorer to avoid any difficulties when cutting and pasting.

To determine the confidence interval for the variance, we again have to write some simple code. PROC MEANS is again used to calculate the sd and add up the sample size. This time, instead of probt to get the p-value, we use CINV (chi-square inverse) to look up what value from the Chi-square table goes with the particular alpha. (Recall that the percent for the confidence interval is 1-alpha.) The following code constructs a 90% CI for the variance of pct in table2p5.


PROC MEANS NOPRINT DATA=table2p5;

VAR pct;
OUTPUT OUT=temp STD=sd N=n;
RUN;
DATA temp2;
SET temp;
KEEP var n alpha cilow cihigh;
INPUT alpha;
var = sd*sd;
df = n - 1;
cilow = (n-1)*(var)/CINV(1-(alpha/2),df);
cihigh = (n-1)*(var)/CINV(alpha/2,df);
CARDS;
.10
;
PROC PRINT data=temp2;
RUN;


Notes on Homework 11:

Because problem 1 has a sample size of 36, you can just use the normal table to figure out what the values are. In the example problem on the handout however you can't do that because the sample size is small and you would need to use a t-distribution. The code below will give you back the three possible p-values that go with the t-values you enter as data (the four there are the example values from the handout). Note that you also need to change the degrees of freedom (currently set at 11), to what they are in this particular case.


DATA pvals;

INPUT t;
pgreater = 1- probt(t,11);
pless = probt(t,11);
ptwoside = 2*MIN(1-ABS(probt(t,11)),ABS(probt(t,11)));
CARDS;
-2.718
-1.337
0.7413
2.8196
;
PROC PRINT;
RUN;


Notes on Homework 12:

There are three t-tests that come up when you have two sets of data. If the two sets are independent, the next question is whether or not you can assume the variances of the two populations are equal or not. In either case, PROC TTEST can provide the answer. The third situation is where the data is paired (like twins). This is a little trickier and involves taking the difference of each pair first.

Page 300: 7.8 The following code would be used to test if the two means were equal for the data in Table 7.2 (page 294). For PROC TTEST we have to tell SAS which group each observation belongs to. The @@ tells SAS that we will be entering more than one observation per line. The $ after group tells SAS that the group name is a name, and not a number.


DATA readtest;

INPUT group $ score @@;
CARDS;
N 80 N 80 N 79 N 81
S 79 S 62 S 70 S 68
N 76 N 66 N 79 N 76
S 73 S 76 S 86 S 73
S 72 S 68 S 75 S 66
;
PROC TTEST DATA=readtest;
CLASS group;
VAR score;
RUN;
The output for PROC TTEST consists of three different pieces. The first column is the description of the two samples, including the mean, sample size, and sd. The second column gives the T, the DF and the p-value for the two hypothesis tests: the one where we assume the variances are equal, and the one where we do not. Notice that the DF in the row for unequal is a decimal! This is because SAS uses a complicated method of estimating what DF to use. Recall that the unequal method is not totally accurate unless the sample size is large. In both cases this is the p-value for the two-sided hypothesis test. If you want to test either greater than or less than, you need to do a bit more work.

Looking at the output, the top row is the first group, and the second row is the second group, and so the difference = mean1 - mean2. So for the example above, the T goes with testing the difference = mean of group N - mean of group S. In the problem you are doing, it wants you to test the alternate hypothesis that mean2 > mean1. Just subtract the mean2 from both sides, and then switch it around to see that this is the same as testing the alternate hypothesis that: diff = mean1 - mean2 < 0.

Now to use the two sided hypothesis p-value that says gives you for a one sided test, you would do the following: If the hypothesis is one-sided, then you need to look at the sign of the T. If you are testing that the difference is > 0, then: if T is positive you need to cut the p-value in half, and if it is negative you need to take one minus p-value/two. On the other hand, if you are testing that the difference is < 0, then: if T is positive you need to take one minus p-value/two, and if it is negative you need to take p-value/two.

The final thing the output gives is in the last row. It is the test of the null hypothesis that the variances are equal. So a small p-value means we reject that they are equal, and should use the unequal method. For this sample data the p-value is 0.4891, so we would fail to reject the null hypothesis, and use the equal row.

Unfortunately, while PROC TTEST has to calculate everything that you would need to make the confidence interval or find s2p, it doesn't actually let you use those values. It turns out that another procedure called an ANOVA (which is taught in STAT 516) actually calculates the same thing! (Remember that for a 90% confidence interval, alpha is 0.10 and alpha/2 = 0.05.)


PROC ANOVA DATA=readtest;

CLASS group;
MODEL score=group;
MEANS group / LSD CLDIFF ALPHA=0.10;
RUN;
The output for the above PROC ANOVA calculates the confidence interval for the case where the variances are equal and also calculates the pooled variance. All of this can be found on the last of the three pages of output this procedure generates. The pooled variance is the value labeled MSE (33.75231 for this example). The confidence intervals for both the first group - second group, and second group minus first are then given.

Page 317: 7.38 a,b To perform the t-test on paired data, it is necesary to first enter the data, and then to combine the two separate variables together. The following code does this for the data in Table 7.4 (page 308).


DATA pairread; 

INPUT new stand;
CARDS;
77 72
74 68
82 76
73 68
87 84
69 68
66 61
80 76
;

DATA pair2;
SET pairread;
KEEP diff;
diff = new - stand;
RUN;
All you need to do now is do the usual t-test (like on homework 9) using the datpair2, the variable diff, and the correct null hypothesis, instead of table2p5, pct, and 10. (Note that in part b you are supposed to be doing the test on the ages... the table at the bottom of page 317 gives the values for height and weight, so yours will be different!)


Notes on Homework 13:

The easiest way to do regression is to use PROC INSIGHT. There is another procedure, PROC REG, that we is useful in more complicated situations. The following would open PROC insight for the baseball data example from homework three above.


PROC INSIGHT;

OPEN baseball;
RUN;
Once PROC INSIGHT starts, under the Analyze menu, choose Fit(YX). Click on the variable you want to use for Y and then click on the 'Y' box. Click on the variable you want to use for X and then click on the 'X' box. Then click 'OK'.

The model equation at the top is the regression line. Under that is the graph of the data with the regression line drawn in. The 'Root MSE' in the 'Summary of Fit' box is the standard deviation of the residuals. You can make a q-q plot for the residuals by choosing Residual Normal QQ on the Graphs menu.


Notes on Homework 16:

Parts c and d of problem 9.68 want you to make the confidence and prediction intervals for a certain value of the x variable (4 years old). The output we want to get should look like the two outputs at the bottom of page 438. The one on the bottom is the prediction interval, and the one on the top is the confidence interval. In order to get this we need to use PROC REG. The following code will make both types of intervals and put them on the same output for data in Table 9.1 on page 404. (This is the same data used on the bottom of page 438.) Here, we also add an extra x value of 1.5 so we can get the intervals for that value. Since this is not an actual data point we just put a period in for the y.


DATA example;

INPUT amount_x time_y;
CARDS;
1 1
2 1
3 2
4 2
5 4
1.5 .
;

PROC REG data=example;
MODEL time_y = amount_x;
PRINT CLI CLM;
RUN;
Here, the CLI means to give the prediction interval, and the CLM means to give the confidence interval. You don't need to put both in unless you want them both. In the output you can tell which set of numbers goes with the x=1.5. Its the one without a y-value or residual.

To get the curves like in Figure 9.24, we return to PROC INSIGHT. Fit the appropriate regression model using Fit(YX) under the Analyze menu. To add the additional curves to the plot, go to the Curves menu, and the Confidence Curves choice in that menu. You now just need to pick which type of interval you want and what percent. If you put both on the same graph, the wider one is the prediction interval, and the narrower one is the confidence interval. You can tell which should be wider just by looking at the formulas on page 437.


Notes on Homework 18:

The following code will work example 8.4a on page 383 and 384.


DATA ex8p4;

INPUT rel $ marit $ count;
CARDS;
A D 39
B D 19
C D 12
D D 28
None D 18
A Never 172
B Never 61
C Never 44
D Never 70
None Never 37
;

PROC FREQ DATA=ex8p4;
WEIGHT count;
TABLES rel*marit / chisq expected nopercent;
RUN;
For part e, how to form a confidence interval for the differences of two percentages is discussed at the bottom of page 363. SAS doesn't have a built in way of easily calculating this confidence interval, so it is probably easiest to just do that part by hand.