Stat 530 - Fall 2001 - SAS Templates

The Basics of SAS (in class 8/30/01)
Star Plots
Chapter 4 Code (in class 9/20/01)
MANOVA (for use on Homework 4)
Principal Components (in class 10/4/01)
Factor Analysis (in class 10/11/01)
Discriminant Analysis (in class 10/24/01)
The Jackknife and Logistic Regression
Cluster Analysis
Multivariate Multiple Regression and Canonical Correlation Analysis
Computer Trouble?
SAS Won't Start?
Graphs Printing Small in Word?


The Basics of SAS:

SAS's strong points are that it is perhaps the most widely used statistical package and that it also serves as a database management program. Its biggest weakness is that it is fairly hard to program or customize.

When you start SAS there are three windows that are used. The Log window, the Program Editor window, and the Output window. If you happen to lose one of these windows they usually have a bar at the bottom of the SAS window. You can also find them under the View menu.

The Program Editor is where you tell SAS what you want done. The Output window is where it puts the results, and the Log window is where it tells you what it did and if there are any errors. It is important to note that the Output window often gets very long! You usually want to copy the parts you want to print into MS-Word and print from there. It is also important to note that you should check the Log window everytime you run anything. (The error SAS Syntax Editor control is not installed is ok though.) The errors will appear in maroon. Successful runs appear in Blue.

Hitting the [F3] key will run the program currently in the Program Editor window.

This will however erase whatever was written in the Program Editor window. To recall whatever was there, make sure you are in that window, and hit the [F4] key.

If you happen to lose a window, check under the View menu at the top.

If you keep running more programs it will keep adding it all to the Output window. To clear the Output window, make sure you are in that window, and choose Clear All under the Edit menu.

In what follows, we will replicate as much of what we did with R as we can easily do.

We would enter the vector (5, 4, 3, 2) using code something like the following:


OPTIONS pagesize=60 linesize=60;

DATA sampvect;
INPUT values @@;
LABEL values = "Just some numbers";
CARDS;
5 4 3
2
;

Note that _most_ lines end with a semi-colon, but not all. SAS will crash if you miss one, but usually the log window will tell you where the problem is.

The OPTIONS line only needs to be used once during a session. It sets the length of the page and the length of the lines for viewing on the screen and printing. The font can be set by using the Options submenu of the Tools menu. When you cut and paste from SAS to a word processor, the font Courier New works well.

The DATA line defines what the name of the data set is. The name should be eight characters or less, with no spaces, and only letters, numbers, and underscores. It must start with a letter. The INPUT line gives the names of the variables, and they must be in the order that the data will be entered. The @@ at the end of the INPUT line means that the variables will be entered right after each other on the same line with no returns. (Instead of needing one row for each observation.)

If we hit F3 at this point to enter what we put above, nothing new will appear on the output screen. This is no big surprise however once we realize that we haven't told SAS to return any output! The code below simply tells SAS to print back out the data we entered.


PROC PRINT DATA=sampvect;
TITLE "Just the values";
RUN;

The most basic procedure to give out some actual graphs and statistics is PROC UNIVARIATE:


PROC UNIVARIATE DATA=sampvect PLOT FREQ ;
VAR values;
TITLE 'Summary of the Values';
RUN;

The VAR line says which of the variables you want a summary of. Also note that the graphs here are pretty awful. The INSIGHT procedure will do most of the things that the UNIVARIATE procedure will, and a lot more. INSIGHT however can not be programmed to perform new tasks that are not already built in. Later in the semester we'll see how some of the other procedures in SAS can be used to do things that aren't already programmed in.


PROC INSIGHT;
OPEN sampvect;
DIST values;
RUN;

Another way to open PROC INSIGHT is to go to the Solutions menu, then to the Analysis menu, and then finally to the Interactive Data Analysis option. Once there you will need to go to the WORK library, and choose the sampvect data set. If you go this route instead, you will need to also make a selection to get the information about the distribution of female salaries. Go to the Analyze menu, and choose Distribution(Y). Select values, click the Y button, and then click OK.

Once PROC INSIGHT opens, you can cut and paste the graphs from PROC INSIGHT right into Microsoft Word. Simply click on the border of the box you want to copy with the right mouse button to select it. You can then cut and paste like normal. Clicking on the arrow in the bottom corner of each of the boxes gives you options for adjusting the graphs format. The various menus along the top also give other choices such as adding QQplots or conducting test of hypotheses. To quit PROC INSIGHT, click on the X in the upper right portion of the spreadsheet.

There are various ways to enter "more exciting" data sets. The Import Data... option in the File menu will allow you to read in text files and spreadsheets. It is also possible to simply cut and paste data into SAS. Open up the web page http://www.stat.sc.edu/~habing/courses/data/rivers.txt, select the entire page, and paste it into the Program Editor window.

We must now add the lines around it: DATA, INPUT, CARDS, and the final ;. We will use the first line of headings as the INPUT line. We will not need the @@ at the end, but we will need to put a $ after the River, Country, and Continent. This is to indicate that these are character strings and not numeric values. The first three lines will thus need to look something like:

DATA nitro;
INPUT River $     Country  $  Cont $   Discharge      Runoff      Area      Density      NO3      Export      Dep      Nprec      Prec;
CARDS;
(You might need to remove some blank spaces when you add the $ signs, because the line will be too long otherwise.)

You can now use PROC PRINT to make sure the data was accepted correctly. Note that it truncated the names of the various variables that were extremely long before.

If we wanted to look at the various continents separately, and only for some of the variables, we could form new data sets with just the portions we want. The following would keep the Discharge, Density, and N03 for Europe only.


DATA Eunitro;
SET nitro;
KEEP Discharge Density N03;
WHERE Cont='Eu';
RUN;

PROC PRINT DATA=EuNitro;
TITLE "So, did it work?";
RUN;

Whenever you have a DATA line, that means you are creating a new dataset with that name. The SET line tells it that we are making this new data set from an old one. The KEEP line says the only variables we want in this new data set are the ones on that line. The lines after that say any special commands that go into the making of the new data set. In this case the WHERE command is used to make sure we only observations that had specific values for some variables. We could also make datasets that involve using mathematical functions of the variables already there. In any case, it should be pretty straight-forward when you just stop and read through what the lines say.

Two things to notice about the above. First, it was not case sensitive (EuNitro vs. Eunitro). Second, what happened to the NO3??

If we plan on using PROC INSIGHT, there was really no need to do the subsetting here. We can choose to ignore certain values once we are in INSIGHT. So, now, start INSIGHT up with the entire nitro data set.

Under the Analyze menu, choose Scatter Plot and select N03 for Y and Density for X. Also, choose Fit (YX) in that same menu.

Now, choose Edit, Observations, will allow you to include or exclude observations from the graphs and plots. Select Exclude from Calculations for Cont ^= Eu. Notice how all of the numbers have been recalculated, and the non-European rivers are shown with X's. We could also remove those observations from the plots entirely. Also notice the change on the spreadsheet that occured from each of these. By right-clicking on an observation number, you can make the choice for that observation individually.

Now, take a moment and reinclude all of the observations.

If you construct a scatterplot for X=Density, Y=NO3, and Group=Cont, you will get a scatter plot for each continent separately.

Try Box Plot/Mosacic Plot (Y) for Y=N03, with Group=Cont. Now try it with Y=NO3 and X=Density.

Remember, the arrows in the lower left give you various options with the graphs you have constructed. You can also change a value in the spreadsheet to see how that affects the display. Clicking on an observation in the spreadsheet will highlight that observation in the graphs, and vice-versa, as well.

Try a Rotating Plot (ZYX) using three quantitative variables. Similarly for Contour Plot (ZYX). Finally, choose Scatter Plot but select ALL of the quantitative variables for X and also for Y.

As you can see, PROC INSIGHT has lots of nice built in graphing procedures. Unfortunately it cannot be customized beyond what it has programmed in to start. The other graphical routines in SAS are often not as easy to use. PROC GPLOT and GCHART do allow for some of the same control as in S-Plus however. A list of these, and other graphical functions, can be found in the SAS help for SAS/GRAPH. The basic statistical procedures are listed under SAS/STAT, and the help for INSIGHT is listed under SAS/INSIGHT. When you call up the help, it will generally take over whatever web-browser window was on top, and use that to display the help files.

There are also several other graphical procedures that our added in each new version of SAS... and they have progressively become more user friendly with time.

PROC G3D DATA=nitro;
SCATTER Density*Discharge=NO3;
TITLE1 'Better Still!';
RUN;


Star Plots

The following code will create some very rudimentary star plots for the bears data set from Homework 2. (Only the first line of the raw data is shown below.)

DATA BEARS;
INPUT Age      Sex      Head_L      Head_W      Neck_G      Length      Chest_G      Weight      Name $ ;
CARDS;
70      1      15.0      6.5      28.0      78.0      45.0      334      Adam
Insert Rest of Data Here
;

DATA stars;
SET bears;
VALUE = Age; VARIABLE='Age'; OUTPUT;
VALUE = Sex; VARIABLE='Sex'; OUTPUT;
VALUE = Head_L; VARIABLE='Head_L'; OUTPUT;
VALUE = Head_W; VARIABLE='Head_W'; OUTPUT;
VALUE = Neck_G; VARIABLE='Neck_G'; OUTPUT;
VALUE = Length; VARIABLE='Length'; OUTPUT;
VALUE = Chest_G; VARIABLE='Chest_G'; OUTPUT;
VALUE = Weight; VARIABLE='Weight'; OUTPUT;
KEEP Name VALUE VARIABLE;


PROC GCHART DATA=stars;
STAR variable / NOHEADING VALUE=NONE SUMVAR=value GROUP=Name ACROSS=3 DOWN=3;
RUN;
It appears that SAS has difficulty fitting many rows on one screen (the DOWN command). It also appears that it scales all of the variables by dividing by the largest of all of the maximums. Thus many of the variables barely show up at all because they are all being divided by 436 (Robert's weight). The easiest way to fix this second difficulty might be to use a spreadsheet program to find the maximum value of each variable and scale them accordingly.


Chapter 4

The following uses the Bumpus sparrow data as found in Table 1.1 on pages 2-3. The column of survived or not survived has been added.

DATA Bumpus;
INPUT Surv $ Total_Length      Alar_Extent      Beak_Head      Humerus      Sternum ;
CARDS;
Y 156      245      31.6      18.5      20.5
Y 154      240      30.4      17.9      19.6
Y 153      240      31.0      18.4      20.6
Y 153      236      30.9      17.7      20.2
Y 155      243      31.5      18.6      20.3
Y 163      247      32.0      19.0      20.9
Y 157      238      30.9      18.4      20.2
Y 155      239      32.8      18.6      21.2
Y 164      248      32.7      19.1      21.1
Y 158      238      31.0      18.8      22.0
Y 158      240      31.3      18.6      22.0
Y 160      244      31.1      18.6      20.5
Y 161      246      32.3      19.3      21.8
Y 157      245      32.0      19.1      20.0
Y 157      235      31.5      18.1      19.8
Y 156      237      30.9      18.0      20.3
Y 158      244      31.4      18.5      21.6
Y 153      238      30.5      18.2      20.9
Y 155      236      30.3      18.5      20.1
Y 163      246      32.5      18.6      21.9
Y 159      236      31.5      18.0      21.5
N 155      240      31.4      18.0      20.7
N 156      240      31.5      18.2      20.6
N 160      242      32.6      18.8      21.7
N 152      232      30.3      17.2      19.8
N 160      250      31.7      18.8      22.5
N 155      237      31.0      18.5      20.0
N 157      245      32.2      19.5      21.4
N 165      245      33.1      19.8      22.7
N 153      231      30.1      17.3      19.8
N 162      239      30.3      18.0      23.1
N 162      243      31.6      18.8      21.3
N 159      245      31.8      18.5      21.7
N 159      247      30.9      18.1      19.0
N 155      243      30.9      18.5      21.3
N 162      252      31.9      19.1      22.2
N 152      230      30.4      17.3      18.6
N 159      242      30.8      18.2      20.5
N 155      238      31.2      17.9      19.3
N 163      249      33.4      19.5      22.8
N 163      242      31.0      18.1      20.7
N 156      237      31.7      18.2      20.3
N 159      238      31.5      18.4      20.3
N 161      245      32.1      19.1      20.8
N 155      235      30.7      17.7      19.6
N 162      247      31.9      19.1      20.4
N 153      237      30.6      18.6      20.4
N 162      245      32.5      18.5      21.1
N 164      248      32.3      18.8      20.9
;
It is possible to program SAS to do some manual calculations. The following for example would conduct a one-sample T-test that the mean Total_Length of all the sparrows was 156.
PROC MEANS NOPRINT DATA=Bumpus;
VAR Total_Length;
OUTPUT OUT=temp MEAN=xbar STD=sd N=n
RUN;
DATA temp2;
SET temp;
KEEP xbar mu sd n t pgreater pless ptwoside;
INPUT mu;
t  = (xbar-mu)/(sd/sqrt(n));
df = n - 1;
pgreater = 1 - probt(t,df);
pless = probt(t,df);
ptwoside = 2*MIN(1-ABS(probt(t,df)),ABS(probt(t,df)));
cards;
156
;
PROC PRINT;
RUN;

SAS also has functions to look up the values corresponding to various other distributions. Each distribution has one function that solves P(X < x0)=?, and is called the probability function. Each distribution also has another function that solves P(X < ?)=p, and is called the quantile function.

Distribution
QuantileProbability
Standard Normal
PROBIT(pct)PROBNORM(val)
chi2
CINV(pct,df)PROBCHI(val,df)
t
TINV(pct,df)PROBT(val,df)
F
FINV(pct,dfx,dfy)PROBF(val,dfx,dfy)

In each case, the pct is the probability (percent of the area) that is less than the val, and df are the degrees of freedom.

Of course this is all rather silly in this case since PROC INSIGHT and several other PROCs will already do this for us. The two-sample t-test is also built in as well.

PROC TTEST DATA=Bumpus;
CLASS Surv;
VAR Total_Length Alar_Extent Beak_Head Humerus Sternum ;
RUN;

We could modify the resulting p-values (the rows for variances equal) using the Bonferroni correction, or another alternative called the Holm test using PROC MULTTEST.

DATA pvals;
INPUT whichone $ raw_p;
CARDS;
Total_Length 0.3258  
Alar_Extent  0.7004
Beak_Head    0.8461
Humerus      0.7460
Sternum      0.9185
;

PROC MULTTEST PDATA=pvals BON STEPBON;
RUN;

Why are all the p-values 1? (Try replacing the original ones by smaller values, say 0.01, 0.02, 0.03, 0.04, and 0.05.)

Notice PROC TTEST also provided an F test that the variances are equal. (The folded means that it always put the larger of the two variances on the numerator.)

To get another of the tests of equal variance, we could treat this problem as a two-group 1-way ANOVA and use PROC GLM.

PROC GLM DATA=Bumpus ORDER=DATA;
CLASS Surv;
MODEL Total_Length = Surv;
MEANS Surv / HOVTEST=BF;
RUN;
The HOVTEST means "homogeneity of variance test" and the BF means "Brown and Forsythe test". The Brown and Forsythe test (a.k.a. Modified Levene's test) uses the absolute deviations from the sample median as discussed on page 44 in Manly.

Hotelling's T2 is a special case of MANOVA where we have only two samples. While SAS doesn't calculate the T2 directly, it does calculate the same F statistic to evaluate another type of multivariate test called Wilk's Lambda. (It can be shown, with some effort, that Wilk's Lambda and the T2 will always give the same p-value.)

PROC GLM DATA=Bumpus;
CLASS Surv;
MODEL Total_Length Alar_Extent Beak_Head Humerus Sternum = Surv;
MANOVA h=Surv;          
RUN;


MANOVA

The example worked out below uses the data found in Table 1.2 on pages 4-5 of Manly. The data is from a 1905 work by Thomson and Randall-Maciver. For each of five epochs in Egypt, four measurements were made on samples of 30 male skulls. The measurements are the first four variables: MB=maximum breadth, BH=basibregmatic height, BL=basialveaolar length, and NH=nasal height. The data can be found online at: http://www.stat.sc.edu/~habing/courses/data/skulls.txt.

MANOVA (Multivariate ANalysis Of VAriance) is the situation described on page 49 of Manly where you wish to compare several mean vectors instead of just two. (The case with just two is Hotelling's T2.) The Wilk's Lambda statistic not only agrees with Hotelling's T2 when there are just two samples, but also works in the case where their are more than two samples.

The following code will analyze the Egyptian skulls data, testing the null hypothesis that all five dynasties have equal mean vectors against the alternate hypothesis that at least one mean vector differs.

DATA skulls;
INPUT MB	BH	BL	NH	Epoch $;
CARDS;
131	138	89	49	Earlypre
(REST OF DATA GOES HERE)
;
 
PROC GLM DATA=skulls;
CLASS Epoch;
MODEL MB BH BL NH = Epoch;
MANOVA h=Epoch;
RUN;

At the top of the produced output are four one-way ANOVA tables, one for each variable. Each one tests whether the mean of that variable is the same for all five samples (a large F = small p-value) signals that we should reject the null hypothesis. These are the five tests are the ones discussed at the beginning of Example 4.3 on page 51 and 52.

Following those five tables are a list of the eigenvalues and eigenvectors, and then the p-values for four tests of the MANOVA hypothesis. The four tests are all fairly similar, and will often agree. It is important to remember that you have to choose which test you are using _before_ you look at the p-values though. The Wilk's Lambda test (based on Likelihood ratio's) is probably a good choice. For this example, the p-value is very small (less than .0001) and so we reject the null hypothesis, not all of the populations have the same underlying mean vectors.

In order to tell how they differ, we will need to get a bit more in depth and use some information about the distances between the observed mean vectors (see the R templates).


Principal Components

The following example uses the data set http://www.stat.sc.edu/~habing/courses/data/testdata.txt that can be found in Marida, Kent, and Bibby's Multivariate Analysis.

DATA testdata;
INPUT Mechanics_C	Vectors_C	Algebra_O	Analysis_O	Statistics_O;
CARDS;
77	82	67	67	81
(REST OF DATA GOES HERE)
;

One of the SAS procedures that conducts princiapl components analysis is PROC PRINCOMP.

PROC PRINCOMP COV OUT=prin DATA=testdata;
VAR Mechanics_C       Vectors_C       Algebra_O       Analysis_O      Statistics_O;
RUN;

PROC PLOT DATA=prin;
PLOT PRIN2*PRIN1;
RUN;

You can also perform principal components analysis in PROC INSIGHT. Once you have started up PROC INSIGHT with the appropriate data set, choose Multivariate (YX) under the Analyze window. You then need to select all of the variables for Y. Then hit OK. Various pieces of the principal components output can then be found under the Tables, Graphs, and Curves menus. One of the options includes a three dimensional (controllable) plot of the principal components.

Using PROC PRINT on prin will reveal, among other things, the values of the transformed observations.

Principal Components Plot Options: Using PROC PLOT to plot the other principal components, all you need to do is replace PRIN2 and PRIN1 with the appropriate other choices.

By default, the plot uses A to represent each points, and then uses B to represent two overlapping points, etc. If we had had another column in the data set, say name, we then could have used: PLOT PRIN2*PRIN1 = name to plot them by the first letter of the name. Using PLOT PRIN2*PRIN1 = Vectors_C will plot by the ten's digit of the vector test score.

This would obviously be a problem in some cases where you had group names that began with the same letter. One way of taking care of this would be to create a new variable in the data set that consisted of the single letter ID you wanted to use for each point. For example:

DATA td2;
SET testdata;
IF _N_ <= 46 THEN idval='A';
ELSE IF _N_ >= 47 THEN idval='B';
KEEP Mechanics_C Vectors_C Algebra_O Analysis_O Statistics_O idval;
PROC PRINT;
RUN;

PROC PRINCOMP COV OUT=prin2 DATA=td2;
VAR Mechanics_C       Vectors_C       Algebra_O       Analysis_O      Statistics_O;
RUN;

PROC PLOT DATA=prin2;
PLOT PRIN2*PRIN1 = idval;
RUN;

The reason this looks odd in this case is that the test scores were already basically sorted by total score in the original data set.

Scaling the Variables First: If the data doesn't come on the same units or on a natural scale, it can be advantageous to standardize it first (as mentioned on page 80 of Manly). Removing the COV from the PROC PRINCOMP line will cause the correlation matrix to be used instead of the covariance matrix. It is also possible to standardize the variables first.

PROC STANDARD DATA=testdata MEAN=0 STD=1  OUT=td3;
VAR Mechanics_C      Vectors_C      Algebra_O      Analysis_O      Statistics_O;
RUN;

PROC PRINT DATA=td3;
RUN;

You could now perform the principal components analysis on td3.


Factor Analysis

(BONUS: Click here for a pdf file that has the SAS code and SAS output for example 7.1 on pages 99-104.)

The examples in this section again use the testdata from the section above on principal components analysis.

First, the principal components analysis on the raw data.

PROC PRINCOMP COV OUT=prin DATA=testdata;
VAR Mechanics_C	Vectors_C	Algebra_O	Analysis_O	Statistics_O;
RUN;
Now on the standardized data.
PROC PRINCOMP OUT=prin DATA=testdata;
VAR Mechanics_C	Vectors_C	Algebra_O	Analysis_O	Statistics_O;
RUN;
Setting priors to one and rotate to none will give the factor solution that corresponds to the principal components analysis. (This is like what is described on page 101 for the European employment data... except we limited it to 2 factors for this example.)
PROC FACTOR DATA=testdata 
	SIMPLE 
	METHOD=PRIN 
	PRIORS=ONE 
	NFACT=2 
	SCREE 
	ROTATE=NONE;
VAR Mechanics_C	Vectors_C	Algebra_O	Analysis_O	Statistics_O;
RUN;
The following will perform the varimax rotation for the exploratory factor analysis. (This is like what is described on page 102 for the European employment data... except we limited it to 2 factors for this example.)
PROC FACTOR DATA=testdata 
	SIMPLE 
	METHOD=PRIN 
	PRIORS=SMC
	NFACT=2
	SCREE 
	ROTATE=VARIMAX; 
VAR Mechanics_C	Vectors_C	Algebra_O	Analysis_O	Statistics_O;
RUN;

Note that the in the principal components factor solution all of the variables load heavily on the first factor. Using the varimax solution, they are separated more.

Factor1 Factor2
Mechanics_C 0.30063 0.59976
Vectors_C 0.36943 0.62307
Algebra_O 0.68909 0.53697
Analysis_O 0.68841 0.36556
Statistics_O 0.65900 0.32784

We could say it appears as if there are two common factors... one which influences the Mechanics, Vectors, and Algebra tests and another which influences the Algebra, Analysis, and Statistics tests. (The overlap in Algebra doesn't fit with the idea of a closed book factor and an open book factor... but still, this solution is not far from that idea.)

If we wanted to, we could decide that some of these factor loadings should be zero and perform a confirmatory factor analysis. In this case we might try using one common factor for the Mecahnics and Vectors tests (a closed book factor) and another common factor for Algebra, Analysis, and Statistics tests (an open book factor). We would do this because it is based on what seems plausible from the set up of the tests and because the exploratory analysis gives a result that is fairly close to the one we would expect. Of course we we couldn't actually trust any test statistics resulting from this confirmatory analysis on this same data set. This is because we would be data snooping. Instead, we would want to run the confirmatory analysis on a second group of examinees.


Discriminant Analysis

This section uses the data set skulls that we used in the MANOVA example above. http://www.stat.sc.edu/~habing/courses/data/skulls.txt.

DATA skulls;
INPUT MB        BH      BL      NH      Epoch $;
CARDS;
131     138     89      49      Earlypre
(REST OF DATA GOES HERE)
;

PROC CANDISC DATA=skulls OUT=outskull DISTANCE ANOVA PSSCP BSSCP;
  CLASS Epoch;
  VAR MB BH BL NH;
RUN;

PROC PLOT DATA=outskull;
PLOT can2*can1=Epoch;
RUN;


The Jackknife and Logistic Regression

Continuing the above example...

The Jackknife for Estimating Discriminant Analysis Error Rates: In order to return the jackknife corrected estimates of the classification accuracy, we need to use PROC DISCRIM. It will contain both the classification rate using the whole data set and the one using the jackknife, so you need to make sure you read the headings for the output tables!

PROC DISCRIM DATA=skulls CANONICAL CROSSVALIDATE;
CLASS Epoch;
VAR MB BH BL NH;
RUN;

Logistic Regression: Logistic regression can be performed in SAS using either PROC LOGISTIC or PROC INSIGHT. PROC LOGISTIC has the benefit of including the Hosmer-Lemeshow Goodness of Fit Test, while PROC INSIGHT has the advantage of allowing for the easy plotting of the predicted values and the residuals.

To use the following code you will first need to remove all the skulls except for those in the groups Earlypre and Latepre... and if you plan to use PROC INSIGHT you should change these values to be 0 or 1. I call the modified data set skulls2.

PROC LOGISTIC DATA=skulls2 DESCENDING;
MODEL Epoch = MB BH BL NH / LACKFIT;
RUN;

The Hosmer and Lemeshow Goodness of Fit test tests the null hypothesis that a logistic regression model is appropriate. The test that the independent variables are able to predict the dependent variable is the Likelihood Ratio test found earlier in the output. It tests the null hypothesis that all of the coefficients are zero.

To perform logistic regression using PROC INSIGHT, choose FIT(YX) in the Analyze Menu. Once there, you have to choose several options after you select Y and X. Select METHOD using the button at the bottom. In the menu that pops up choose the response distribution Binomial, and the link function Logit. The p-value for the likelihood ratio test appears in the Analysis of Deviance box on the output and the estimated probabilities are added to the spread sheet.

Note that the estimates and p-values match those found using R, and that the coefficients are of the form found in formula (8.3) on page 119. You need to make sure if you are reporting the coefficients that you know what form they come in! Also note that the chi-square statistics in SAS are the square of the z-values found in R.


Cluster Analysis

The following code will perform the cluster analysis for the example on pages 129-131 of the text. (A special thanks to Brooke for deciphering the SAS manual....)

data distance (type=distance);
  input (var1 var2 var3 var4 var5) (4.) @22 var $;
  cards;
  0                 var1
  2   0             var2
  6   5   0         var3
 10   9   4   0     var4
  9   8   5   3   0 var5
  ;
run;

proc print data=distance;
run;

proc cluster data=distance method=average;
run;

proc tree horizontal spaces=2;
run;

proc cluster data=distance method=single;
run;

proc tree horizontal spaces=2;
run;

proc cluster data=distance method=complete;
run;

proc tree horizontal spaces=2;
run;


Multivariate Multiple Regression and Canonical Correlation Analysis

The data set we have been working with can be found at: http://www.stat.sc.edu/~habing/courses/data/sascity.txt.

As we saw briefly, PROC REG will perform multivariate multiple regression.

PROC REG DATA=city;
MODEL MORTAL OVER65 HOUSE EDUC SOUND DENSITY NONWHITE WHITECOL POOR = PRECIP JANTEMP JULYTEMP HC NOX SO2 HUMIDITY;
MTEST PRECIP, JANTEMP, JULYTEMP, HC, NOX, SO2, HUMIDITY;
RUN;

The MTEST line above is testing that all of the slope parameters for all of the X variables listed are the same for all of the y variables.

Any list of X and Y variables can be given, and their abscence means to assume that all should be included. Thus, simply using MTEST; would have produced the same output as above. Using MTEST OVER65, NONWHITE, POOR , JANTEMP, JULYTEMP; would test that the slopes corresponding to those x variables were zero for the given y variables.

The canonical correlation analysis can either be carried out using PROC CANCORR or PROC INSIGHT. The following code will give the output seen in class:

PROC CANCORR DATA=city VPREFIX=envir WPREFIX=people ALL;
VAR PRECIP JANTEMP JULYTEMP  HC  NOX SO2 HUMIDITY;
WITH MORTAL OVER65 HOUSE EDUC SOUND DENSITY NONWHITE WHITECOL POOR;
RUN;

Example: Consider the baseball statistics at http://www.stat.sc.edu/~habing/courses/data/bballtest.txt. Begin by standardizing all of the carreer statistics to be yearly averages instead of totals.

1) What is the best predictor of HR86 and RBI 86 from among the career numbers and years? What is the p-value for testing that all of the coefficients involved are 0?

2) Conduct a canonical correlation analysis for predicting the 86 batting statistics and 87 salary from the standardized career batting statistics and number of years.

As promised... Baseball Canonical Correlation Example


Computer Trouble?

In most cases, help with the computers (NOT the programming) can be gained by e-mailing help@stat.sc.edu

For the printers on the first and second floor, printer paper is available in the Stat Department office. For printers on the third floor, paper is available in the Math Department office.

If you are using a PC restarting the machine will fix many problems, but obviously don't try that if you have a file that won't save or the like. DO NOT TURN OFF THE UNIX MACHINES.

If SAS won't start, one of the things to check is that your computer has loaded the X drive correctly (whatever that means). Go to My Computer and see if the apps on 'lc-nt' (X:) is listed as one of the drives. If it isn't, go to the Tools menu and select Map Network Drive.... Select X for the drive, and enter \\lc-nt\apps for the Folder. Then click Finish. This should connect your computer to the X drive and allow SAS to run. If you already had the X-drive connected, then you will need to e-mail help@stat.sc.edu.

If your graphs print out extremely small after you copy them to word, you might be able to fix the problem by "opening and closing" the image. In word, left click once on the image, and select Edit Picture or Open Picture Object under the Edit menu. A separate window will open with the image in it. Simply choose Close Picture. It should now print out ok. This will also make the spacing between the characters in the labels look right if they were somewhat off.

If the problem is an emergency requiring immediate attention see Wei Pan in room 209D.
If Wei Pan is not in, and it is an emergency see Jason Dew in room 415.
If neither Jason or Wei is available and it is an emergency see Minna Moore in room 417.
Flagrantly non-emergency cases may result in suspension of computer privileges.