Notes on Homework Two
Notes on Homework Three
Notes on Homework Ten
Notes on Homework Eleven
Notes on Homework Twelve
Notes on Homework Thirteen
Notes on Homework Fourteen
Notes on Homework Fifteen
Notes on Homework 16 & 17
Notes on Homework Eighteen
Downloading the Data from the Web

From in class on January 19th


Using SAS to get descriptive statistics and plots
--------------------------------------------------------

Hitting the [F3] key will run the program currently in the 
Program Editor window.

This will however erase whatever was written in the Program
Editor window.  To recall whatever was there, make sure you
are in that window, and hit the [F4] key.
    
If you keep running more programs it will keep adding it all
to the Output window.  To clear the Output window, make sure
you are in that window, and choose "Clear text" under the Edit
menu.

The following data is for the first 14 trees in Table 1.6
on page 13.  (It goes with Example 1.2.)
 

----------begin pasting with next line-----------
DATA trees;
INPUT  dfoot hcrn ht;
LABEL dfoot = "diameter at one foot" 
      hcrn = "height to the base of crown"
      ht = "total height";
CARDS;
4.1	1.5	24.5
3.4	4.7	25.0
4.4	2.8	29.0
3.6	5.1	27.0
4.4	1.6	26.5 
3.9	1.9	27.0
3.6	5.3	27.0
4.3	7.6	28.0
4.8	1.1	28.5
3.5	1.2	26.0
4.3	2.3	28.0
4.8	1.7	28.5
2.9	1.1	20.5
5.6	2.2	31.5
;
PROC PRINT;
RUN;
-----------stop pasting with previous line--------



To get a summary of the data at one foot use....


--------------begin------------------------------
PROC UNIVARIATE DATA=trees PLOT FREQ;
VAR dfoot;
TITLE 'Summary of the Trees Diameters';
RUN;
-------------end---------------------------------


Note that there are many different definitions of    
percentile, and another program may not use the 
same one.



What if we want the percentage of the tree that 
is the crown?

------------------begin---------------------------
DATA tree2;
SET trees;
KEEP percrn;
percrn = (ht-hcrn)/ht;

PROC UNIVARIATE DATA=tree2;
VAR percrn;
TITLE 'Summary of percent of tree that is crown';
RUN;
------------------end-----------------------------



PROC insight allows many of these analyses and also
gives the nice graphs.   While it is possible to 
change the definitions of the percentiles in 
PROC UNIVARIATE, you can not do so in the current
editions of PROC INSIGHT.


------------------begin---------------------------
PROC INSIGHT; 
OPEN trees;
DIST dfoot hcrn;
RUN;
-------------------end----------------------------

Notes on Homework 2

2a) The stem and leaf plot is described on page 30 and 31. It is part of the output of PROC UNIVARIATE.

2c) To make a scatterplot, use PROC INSIGHT, go under the analyze menu at the top, and choose the "Scatter Plot (YX)" option. There will be a white box with the names of the variables in it. Click on one of the variables and then click on either the x or the y box. Click on the other variable name and then click on the x or y box you didn't already choose. Finally, click on the ok box.

Notes on Homework 3

2b) Using the above example of the tree data, say we wanted to be able to look at 'ht', 'dfoot' AND 'ht/dfoot'. To do this, we would need to make a data set that had those first two variables and also had the new one. The code we could use would be:


DATA tree3;
SET trees;
KEEP ht dfoot hoverd;
hoverd = ht/dfoot;
RUN;

We could then run PROC UNIVARIATE or PROC INSIGHT using the new data set tree3.

2c) Note that they are asking for 2 scatterplots here. One of them is population and expenditures, the other is population and expenditures/populations.

Notes on Homework 10

Unfortunately, while SAS does many complicated procedures very well, it does a very, very poor job of some of the basic things. PROC MEANS and PROC INSIGHT can make confidence intervals with no trouble, and it can do the two sided hypothesis test of the mean. In order to do a one sided test though we have to right some code of our own.

The following will calculate the t-test for the null hypothesis that the mean of the variable dfoot from the data-set trees is equal to five or not.

What we are doing below are the following steps (which you can follow through):
1) Calculate the MEAN, SD, and get the number of values for the variable dfoot in the data set trees.
2) Output these values into a data set called temp, because we'll only use it temporarily
3) Using the data set temp, and the mean entered after the cards statement, calculate t = (xbar-mu)/(sd/sqrt(n))
4) Calculate the p-value for the three different alternative hypotheses. The function probt(t,df) calculates the area less than t in a t-distribution with df degrees of freedom
5) Put this information in another temporary data set called temp2, and print it out

PROC MEANS NOPRINT DATA=trees;
VAR dfoot;
OUTPUT OUT=temp MEAN=xbar STD=sd N=n
RUN;
DATA temp2;
SET temp;
KEEP xbar mu sd n t pgreater pless ptwoside;
INPUT mu;
t  = (xbar-mu)/(sd/sqrt(n));
df = n - 1;
pgreater = 1 - probt(t,df);
pless = probt(t,df);
ptwoside = 2*MIN(1-ABS(probt(t,df)),ABS(probt(t,df)));
cards;
5
;
PROC PRINT;
RUN;

To try this out, copy in the commands from the top of this web page that enter the data in (cutting and pasting them into the program editor window in SAS), and hit F3 to run it. Now that SAS has the data in it, cut and paste the above lines of command in and run those. You will get the resulting p-values. You need to decide in advance which of the three alternate hypotheses you are interested in though, so that you know which of the three p-values is the one you need to use.

To do this with a different data set, you first need to enter that data. Then to calculate the p-value, you need to change three things above. Change trees to the name of the data set, change dfoot to the name of the variable, and change 5 to the mu you are using.

Notes on Homework 11

The easiest way to form a confidence interval for the mean, and to check for normality is to use PROC INSIGHT. After entering the data, code similar to the following could be used to start PROC INSIGHT.

PROC INSIGHT;
OPEN trees;
DIST dfoot;
RUN;

This code will start PROC INSIGHT, bringing up a window of graphics and descriptive statistics, and also a spreadsheet. Any values you change in the spreadsheet portion will cause the appropriate changes in the other window. To quit PROC INSIGHT, click on the x in the upper-right corner of the spreadsheet window.

To add a normal curve to the histogram, select the 'Parametric Density' option under the 'Curves' menu. (Distribution = Normal, Method = Sample Estimates). To plot the q-q plot, select the 'QQ Plot' option under the 'Graphs' menu. (Distribution = Normal). To add the straight line to the q-q plot, select the 'QQ Ref Line' option under the 'Curves' menu. (Method = Least Squares). To construct the confidence interval for the mean, select the 'C.I. for Mean' option under the 'Tables' menu. Choosing the appropriate percentage.

You can print out the output in PROC INSIGHT either by cutting and pasting from the PROC INSIGHT window into a microsoft word document. You can also select and delete the parts you don't want to print out, and just print it out right from the SAS. When printing from SAS it will print out every graph and table that is at least partially showing on the screen.

Much like the one sided t-test from homework 10, it takes a bit of work to get the confidence interval for the variance or sd. The following code works much the same as before, but makes the confidence interval for the variance. The function CINV looks up the value on the chi-square table that goes with the percentage and the degrees of freedom you give it. The 0.05 in the cards is the alpha from (1-alpha)*100%.

PROC MEANS NOPRINT DATA=trees;
VAR dfoot;
OUTPUT OUT=temp STD=sd N=n
RUN;
DATA temp2;
SET temp;
KEEP var n alpha cilow cihigh;
INPUT alpha;
var = sd*sd;
df = n - 1;
cilow = (n-1)*(sd*sd)/CINV(1-(alpha/2),df);
cihigh = (n-1)*(sd*sd)/CINV(alpha/2,df);
cards;
.05
;
PROC PRINT;
RUN;

Notes on Homework 12

There are three t-tests that come up when you have two sets of data. If the two sets are independent, the next question is whether or not you can assume the variances of the two populations are equal or not. In either case, PROC TTEST can provide the answer. The third situation is where the data is paired (like twins). This is a little trickier and involves taking the difference of each pair first.

The following code would be used to test if the two means were equal for the data in Table 5.4 (page 194). For PROC TTEST we have to tell SAS which group each observation belongs to. The @@ tells SAS that we will be entering more than one observation per line. The $ after group tells SAS that the group name is a name, and not a number.

DATA peanutj;
INPUT  group $ ounces @@;
CARDS;
N 8.06	N 8.39	C 7.99	C 8.03
N 8.64	N 8.46	C 8.12	C 8.14
N 7.97	N 8.28	C 8.34	C 8.14
N 7.81	N 8.02	C 8.17	C 7.87
N 7.93  N 8.39  C 8.11
N 8.57
;
PROC TTEST DATA=peanutj;
CLASS group;
VAR ounces;
RUN;

The output for PROC TTEST consists of three different pieces. The first column is the description of the two samples, including the mean, sample size, and sd. The second column gives the T, the DF and the p-value for the two hypothesis tests: the one where we assume the variances are equal, and the one where we do not. Notice that the DF in the row for unequal is a decimal! This is because SAS uses a complicated method of estimating what DF to use. Recall that the unequal method is not totally accurate unless the sample size is large. In both cases this is the p-value for the two-sided hypothesis test. The final thing the output gives is in the last row. It is the test of the null hypothesis that the variances are equal. So a small p-value means we reject that they are equal, and should use the unequal method.

To perform the t-test on paired data, it is necesary to first enter the data, and then to combine the two separate variables together. The following code does this for the data in Table 5.5 (page 197).

DATA baseball;
INPUT y1960 y1961;  
CARDS;
809	673
663	1123
2253	1813
1497	1100
862	584
1705	1199
1096	855
1795	1391
1187	951
1129	850
1644	1151
950	735
1167	1606
774	683
1627	1747
743	597
;

DATA base2;
SET baseball;
KEEP diff;
diff = y1961 - y1960;
RUN;

All you need to do now is do the usual t-test (like on homework 10) using the dataset base2 and the variable diff.

Notes on Homework 13

Example 7.1 is concerned with predicting the sales prices of homes from their square feet. The data would be entered as follows.


DATA homes;
INPUT space price;
LABEL space = "space in 1,000 square feet"
	price = "sales price in $1,000";
CARDS;
1.326	27.9
1.391	33.5
1.000	19.0 
1.542	31.0
0.735	18.9
1.444	38.3
1.796	45.0
1.770	43.5
1.708	41.5
1.529	35.5
2.234	65.5
1.607	48.55
1.648	41.5
1.608	46.5
1.020	18.0
;

The easiest way to do regression is to use PROC INSIGHT. There is another procedure, PROC REG, that we will use later too.


PROC INSIGHT;
OPEN homes;
RUN;

Once PROC INSIGHT starts, under the ANALYZE menu, choose FIT(YX). Click on the variable you want to use for Y and then click on the 'Y' box. Click on the variable you want to use for X and then click on the 'X' box. Then click 'OK'.

The model equation at the top is the regression line. Under that is the graph of the data with the regression line drawn in. The 'Root MSE' in the 'Summary of Fit' box is the standard deviation of the residuals. You can either just print the page out and highlight the answers, or by clicking on the boundary of the given box, you can cut and paste them into a microsoft word file.

The spreadsheet for PROC INSIGHT includes the residuals (R_PRICE), and (P_PRICE) is the predicted value. To print it out, just make sure that the screen shows all the parts you want to print out.

Notes on Homework 14

Use PROC INSIGHT and the ANALYZE menu just like on homework 13. The ANOVA table is in the box called 'Analysis of Variance', and the t-test is in the bottom row of the box called 'Parameter Estimates.'

Notes on Homework 15

4b) If it doesn't hit you after a few minutes what to do here, think back to the first day we talked about regression. The code you need was from a previous homework.

5) The spreadsheet in PROC INSIGHT gives the predicted values for the observations _in_ the data set. So for example we could use PROC INSIGHT to estimate the number of goals for someone who is 71 inches tall (because person one was that tall), or 69 inches tall (because observation five was that tall), but no one was 60 inches tall.

PROC INSIGHT can also add the confidence intervals for either the predicted average of everyone at a given x, or for the predicted value of an individual observation at a given x. To do this, use the 'Curves Menu' and select 'Confidence Curves'. You can then choose to either make the curve for the 'Mean' or 'Prediction' and pick the percent you want.

Unfortunately that doesn't give you the answer the problem wants. To get the confidence interval for a particular value you need to use PROC REG. The code here will give both the confidence interval for the mean price (CLM) and the confidence interval for the predicted price (CLI) for a home with 2,000 square feet from the data set above.


DATA homes2;
INPUT space price;
LABEL space = "space in 1,000 square feet"
        price = "sales price in $1,000";
CARDS;
1.326   27.9
1.391   33.5
1.000   19.0
1.542   31.0
0.735   18.9
1.444   38.3
1.796   45.0
1.770   43.5
1.708   41.5
1.529   35.5
2.234   65.5
1.607   48.55
1.648   41.5
1.608   46.5
1.020   18.0
2.000	.
;


PROC REG DATA=homes2;
  MODEL price = space;
  PRINT CLM;
  PRINT CLI;
RUN;

The last data entry "2.000 ." tells SAS that we know we want a space of 2.000, but that we don't know what the price was. The period is used to signify missing data. This data won't be used in figuring out the degrees of freedom or the regressionline, but tells SAS to use that value to if it does any confidence intervals or predicting.

The 'MODEL' line in PROC REG tells SAS what regression line we want. The variables on the left side are predicted from the variables on the right side. Finally the 'PRINT CLM' and 'PRINT CLI' tell SAS to give both the 95% confidence intervals for the average at each space, and for the individuals at each space. The output from the above code contains the ANOVA table, a t-test, and the confidence intervals we asked for.

Notes on Homework 16 & 17

It is probably easier to copy the data set from the link below, than it is to type it in yourself. PROC INSIGHT automatically gives you the residual plot at the bottom of the screen when you use 'Fit(YX)'. The x-axis with the P_ on it is for the predicted values, and the y-axis with the R_ on it is for the residuals.

Note that the spreadsheet also has the residuals and predicted values on it. You could do a Q-Q plot and histogram for the residuals by going under the 'Analyze' menu and choosing 'Distribution(Y)'. Choose the 'R_' variable (the residuals) as your Y. (Doing this on the home prices data above seems to show that the residuals are skewed to the right, and the q-q plot doesn't look very straight either.)

If you want to find out which residual goes with which residual goes with which point, you can simply click on the point on the residual plot, and it will highlight the corresponding entry on the spreadsheet.

Notes on Homework 18

1b, 8) The following code will work through example 12.9 on page 576.


DATA twelve9;
INPUT opinion $ party $ count; 
CARDS;
favor 	dem	16
favor	rep	21
favor	none	11
nofavor	dem	24
nofavor	rep	17
nofavor	none	13
;

PROC FREQ DATA=twelve9;
TABLES opinion*party / CHISQ;
WEIGHT count;
RUN;

The option CHISQ works for both testing independence and homogeneity. The p-value for the test is given on the first row, labeled simply "Chi-Square". The next line has the likelihood ratio test statistic mentioned on page 579. To perform Fisher's exact test if the n's were small we would have used EXACT instead.

7)The following code performs the goodness of fit test for example 12.1. Note that the percentages in the testp section must add up to 1. If we had a certain number we expected then we would have used the TESTF command instead, and the total number there would have had to have matched the total number of observations.


DATA genet;
INPUT type $ obs;
CARDS;
A 	82 		
B	35      
C	29
D	14	
;
PROC FREQ DATA=genet;
TABLES type / TESTP=(0.5625, 0.1875, 0.1875, 0.0625);
WEIGHT obs;
RUN;

Downloading the Data from the Web

The various data sets used in the text book can be found on the web, so that you don't need to type them in. The web address for this is:

ftp://ftp.harcourtbrace.com/pub/academic_press/saved/textbook/freund.data/.

For example fw01p05 would be the data for problem 5 of chapter 1.