Stat 530 - Fall 2003 - SAS Templates

The Basics of SAS
Chapter 3:
    Face Plots
Chapter 4:
    Multiple t-tests and Hotelling's T2
Chapter 6:
    Principal Components Analysis
Chapter 7:
    Factor Analysis
Chapter 8:
    Fisher's Linear Discriminant Analysis
    Logistic Regression
Chapter 10:
    Canonical Correlation Analysis
Computer Trouble?
SAS Won't Start?
Graphs Printing Small in Word?


The Basics of SAS:

SAS's strong points are that it is perhaps the most widely used statistical package and that it also serves as a database management program. Its biggest weakness is that it is fairly hard to program or customize.

When you start SAS there are three windows that are used. The Log window, the Program Editor window, and the Output window. If you happen to lose one of these windows they usually have a bar at the bottom of the SAS window. You can also find them under the View menu.

The Program Editor is where you tell SAS what you want done. The Output window is where it puts the results, and the Log window is where it tells you what it did and if there are any errors. It is important to note that the Output window often gets very long! You usually want to copy the parts you want to print into MS-Word and print from there. It is also important to note that you should check the Log window everytime you run anything. (The error SAS Syntax Editor control is not installed is ok though.) The errors will appear in maroon. Successful runs appear in Blue.

Hitting the [F3] key will run the program currently in the Program Editor window.

This will however erase whatever was written in the Program Editor window. To recall whatever was there, make sure you are in that window, and hit the [F4] key.

If you happen to lose a window, check under the View menu at the top.

If you keep running more programs it will keep adding it all to the Output window. To clear the Output window, make sure you are in that window, and choose Clear All under the Edit menu.

In what follows, we will replicate as much of what we did with R as we can easily do.

We would enter the vector (5, 4, 3, 2) using code something like the following:


OPTIONS pagesize=60 linesize=60;

DATA sampvect;
INPUT values @@;
LABEL values = "Just some numbers";
CARDS;
5 4 3
2
;

Note that _most_ lines end with a semi-colon, but not all. SAS will crash if you miss one, but usually the log window will tell you where the problem is.

The OPTIONS line only needs to be used once during a session. It sets the length of the page and the length of the lines for viewing on the screen and printing. The font can be set by using the Options submenu of the Tools menu. When you cut and paste from SAS to a word processor, the font Courier New works well.

The DATA line defines what the name of the data set is. The name should be eight characters or less, with no spaces, and only letters, numbers, and underscores. It must start with a letter. The INPUT line gives the names of the variables, and they must be in the order that the data will be entered. The @@ at the end of the INPUT line means that the variables will be entered right after each other on the same line with no returns. (Instead of needing one row for each observation.)

If we hit F3 at this point to enter what we put above, nothing new will appear on the output screen. This is no big surprise however once we realize that we haven't told SAS to return any output! The code below simply tells SAS to print back out the data we entered.


PROC PRINT DATA=sampvect;
TITLE "Just the values";
RUN;

The most basic procedure to give out some actual graphs and statistics is PROC UNIVARIATE:


PROC UNIVARIATE DATA=sampvect PLOT FREQ ;
VAR values;
TITLE 'Summary of the Values';
RUN;

The VAR line says which of the variables you want a summary of. Also note that the graphs here are pretty awful. The INSIGHT procedure will do most of the things that the UNIVARIATE procedure will, and a lot more. INSIGHT however can not be programmed to perform new tasks that are not already built in. Later in the semester we'll see how some of the other procedures in SAS can be used to do things that aren't already programmed in.


PROC INSIGHT;
OPEN sampvect;
DIST values;
RUN;

Another way to open PROC INSIGHT is to go to the Solutions menu, then to the Analysis menu, and then finally to the Interactive Data Analysis option. Once there you will need to go to the WORK library, and choose the sampvect data set. If you go this route instead, you will need to also make a selection to get the information about the distribution of female salaries. Go to the Analyze menu, and choose Distribution(Y). Select values, click the Y button, and then click OK.

Once PROC INSIGHT opens, you can cut and paste the graphs from PROC INSIGHT right into Microsoft Word. Simply click on the border of the box you want to copy with the right mouse button to select it. You can then cut and paste like normal. Clicking on the arrow in the bottom corner of each of the boxes gives you options for adjusting the graphs format. The various menus along the top also give other choices such as adding QQplots or conducting test of hypotheses. To quit PROC INSIGHT, click on the X in the upper right portion of the spreadsheet.

There are various ways to enter "more exciting" data sets. The Import Data... option in the File menu will allow you to read in text files and spreadsheets. It is also possible to simply cut and paste data into SAS. Open up the web page http://www.stat.sc.edu/~habing/courses/data/census.txt, select the entire page, and paste it into the Program Editor window.

We must now add the lines around it: DATA, INPUT, CARDS, and the final ;. We will use the first line of headings as the INPUT line. We will not need the @@ at the end, but we will need to put a $ after the City and State. This is to indicate that these are character strings and not numeric values. The first three lines will thus need to look something like:

DATA census;
INPUT City $     State  $    Population      PopChange      PopDens      Under5      Over65      Asian      Black      Hispanic      Birth      InfantD      CarD      HeartD      Income      Poverty      Unemploy      Grants      BankpP      HousepP;
CARDS;
(For some data sets you might need to remove some blank spaces when you add the $ signs, because the line will be too long otherwise.)

You can now use PROC PRINT to make sure the data was accepted correctly. Note that it truncated the names of the various variables that were extremely long before.

If we wanted to look at the various continents separately, and only for some of the variables, we could form new data sets with just the portions we want. The following would keep the City,PopDens, Income and InfantDand for South Carolina only.


DATA SCcensus;
SET census;
KEEP city popdens income infantd;
WHERE State='SC';
RUN;

PROC PRINT DATA=sccensus;
TITLE "So, did it work?";
RUN;

Whenever you have a DATA line, that means you are creating a new dataset with that name. The SET line tells it that we are making this new data set from an old one. The KEEP line says the only variables we want in this new data set are the ones on that line. The lines after that say any special commands that go into the making of the new data set. In this case the WHERE command is used to make sure we only observations that had specific values for some variables. We could also make datasets that involve using mathematical functions of the variables already there. In any case, it should be pretty straight-forward when you just stop and read through what the lines say.

Note that it was not case sensitive for the name of the data set or or variable names, but is case sensitive for whatever is in the quotation marks.

If we plan on using PROC INSIGHT, there was really no need to do the subsetting here. We can choose to ignore certain values once we are in INSIGHT. So, now, start INSIGHT up with the entire census data set.

Under the Analyze menu, choose Scatter Plot and select InfantD for Y and PopDens for X. Also, choose Fit (YX) in that same menu.

Now, choose Edit, Observations, will allow you to include or exclude observations from the graphs and plots. Select Exclude from Calculations for State ^= SC. Notice how all of the numbers have been recalculated, and the non-South Carolina cities are shown with X's. We could also remove those observations from the plots entirely. Also notice the change on the spreadsheet that occured from each of these. By right-clicking on an observation number, you can make the choice for that observation individually.

Now, take a moment and reinclude all of the observations.

If you construct a scatterplot for X=PopDens, Y=InfantD, and Group=State, you will get a scatter plot for each continent separately.

Try Box Plot/Mosaic Plot (Y) for Y=InfantD, with Group=State.

Remember, the arrows in the lower left give you various options with the graphs you have constructed. You can also change a value in the spreadsheet to see how that affects the display. Clicking on an observation in the spreadsheet will highlight that observation in the graphs, and vice-versa, as well.

Try a Rotating Plot (ZYX) using three quantitative variables. Similarly for Contour Plot (ZYX). Finally, choose Scatter Plot but select ALL of the quantitative variables for X and also for Y.

As you can see, PROC INSIGHT has lots of nice built in graphing procedures. Unfortunately it cannot be customized beyond what it has programmed in to start. The other graphical routines in SAS are often not as easy to use. PROC GPLOT and GCHART do allow for some of the same control as in S-Plus however. A list of these, and other graphical functions, can be found in the SAS help for SAS/GRAPH. The basic statistical procedures are listed under SAS/STAT, and the help for INSIGHT is listed under SAS/INSIGHT. When you call up the help, it will generally take over whatever web-browser window was on top, and use that to display the help files.

There are also several other graphical procedures that our added in each new version of SAS... and they have progressively become more user friendly with time.

PROC G3D DATA=census;
SCATTER PopDens*Income=InfantD;
TITLE1 'Better Still!';
RUN;


Chernoff Faces

While SAS does not have a built in procedure for constructing Chernoff Faces, a macro for doing so can be found at: http://www.math.yorku.ca/SCS/sasmac/faces.html. This version allows you to control 18 different aspects of the face, on either the right or left side of the face, for a total of 36 different options! What each variable controls is listed on the yourku.ca page listed above.

A second macro that is needed to convert all the variables to be values from 0 to 1 can be found at: http://www.math.yorku.ca/SCS/sasmac/scale.html. To begin with, select the get faces.sas option on that first page and save it onto your computer (the Z-drive if you are on the CSM domain). Do the same for the get scale.sas on the second page. In SAS then, choose File and Open and select the faces.sas file you just saved. This should cause the code to appear in a window called faces.sas. Hit F3 in that window. Do the same for the scale.sas. Now go to View and open up the Program Editor window.

The following code uses this macro to make Chernoff faces for the variables in columns 4-7 of the bumpus sparrow data at: http://www.stat.sc.edu/~habing/courses/data/bumpus3.txt. Column 4 is assigned to all of the eye measurements, column 5 is assigned to all of the eyebrow measurements, column 6 is assigned to all of the hair measurements, and column 7 is assigned to all of the mouth measurements. The face line and nose line are not used. Note that the faces macro requires all of the data to be rescaled to be between 0 and 1, and so we first use the scale macro to make a new data set containing only the four variables we want.

DATA bumpus;
INPUT Bird $ Survive Total_Length	Alar_Extent Beak_Head Humerus Sternum;
CARDS;
1	1	156	245	31.6	18.5	20.5
(Insert rest of data set here)
;

%scale(data=bumpus,
       out=scaledbumpus,
       outstat=range,
       var =Alar_Extent Beak_Head Humerus  Sternum,
       id=Bird);

%faces(data=scaledbumpus,
       id=id,    
       res=3,
       blks=1, rows=4, cols=4,
       l1 =Alar_Extent, r1 =Alar_Extent,
       l2 =Alar_Extent, r2 =Alar_Extent,
       l3 =Alar_Extent, r3 =Alar_Extent,
       l4 =Alar_Extent, r4 =Alar_Extent,
       l5 =Alar_Extent, r5 =Alar_Extent,
       l6 =Alar_Extent, r6 =Alar_Extent,
       l7 =Beak_Head, r7 =Beak_Head,
       l8 =Beak_Head, r8 =Beak_Head,
       l9 =Beak_Head, r9 =Beak_Head,
       l10=Beak_Head, r10=Beak_Head,
       l11=Humerus, r11=Humerus,
       l12=Humerus, r12=Humerus,
       l13=,   r13=,
       l14=Humerus, r14=Humerus,
       l15=Humerus, r15=Humerus,
       l16=,   r16=,
       l17=Sternum,  r17=Sternum,
       l18=Sternum,  r18=Sternum);


Multiple t-tests and Hotelling's T2

It is actually pretty annoying to get SAS to calculate Hotelling's T-square (its matrix manipulation isn't as nice as R's, and it doesn't have it built in). However, Hotelling's T-square is a special case of Multivarate Analysis of Variance (MANOVA). In particular, when there are just two groups the p-value for Hotelling's T-square is identical to the p-value of a statistic called Wilk's Lambda. This statistic is calculated by SAS in PROC GLM.

The following code will calculate Wilk's Lambda for the bumpus sparrow data (pages 2-3) that is analyzed on pages 40-43. (The data can be found at http://www.stat.sc.edu/~habing/courses/data/bumpus3.txt

DATA bumpus;
INPUT Bird Survive Total_Length Alar_Extent Beak_Head Humerus Sternum;
CARDS;
1	1	156	245	31.6	18.5	20.5
(Insert Rest of Data Here!)
;

PROC GLM DATA=bumpus;
CLASS Survive;
MODEL Total_Length Alar_Extent Beak_Head Humerus Sternum = Survive;
MANOVA h=Survive;
MEANS Survive / HOVTEST=BF;
RUN;

This SAS code produces several pages of output. The first page just says what the group (class variable) is, and how many levels it has.

The next several pages do ANOVAs to test whether each of the groups have the same mean for each of the variables. For just two groups these p-values are identical to the two sample t-test p-values! (You can compare them to the results you got from R.)

In this case they are: Total_Length 0.3258, Alar_Extent 0.7004, Beak_Head 0.8461, Humerus 0.7460, and Sternum 0.9185.

The page after that gives something labeled "Characteristic Roots" at the top... we can ignore this top half of the page. The bottom half of this page has several test statistics, one of which is labeled Wilk's Lambda. Notice that the F statistic and degrees of freedom match both the text on page 43 and the results from R for Hotelling's T2. The T statistic value will NOT however equal the Wilk's Lamda value (there is a messy formula which connects them).

Assumptions: The second to last page of this PROC GLM output gives something called the Brown and Forsythe test for each of the variables. The Brown and Forsythe test is SAS's name for the modified Levene test. It gives the same p-values as the R function mlevene.test on the R templates page.

The easiest way to check the assumptions of multivariate normailty is to open the data set in PROC INSIGHT. As we saw in the lab, the Distribution (Y) option will let you do Q-Q plots (under the Curves menu) and the Scatter Plot (YX) option will let you plot all of the values against each other.

Multiple T-tests: We could modify the p-values from the separate ANOVAs using the Bonferroni correction, or another alternative called the Holm test using PROC MULTTEST.

DATA pvals;
INPUT whichone $ raw_p;
CARDS;
Total_Length 0.3258 
Alar_Extent  0.7004
Beak_Head    0.8461
Humerus      0.7460
Sternum      0.9185
;

PROC MULTTEST PDATA=pvals BON STEPBON;
RUN;

To see why all of the p-values in this example of the Holm procedure have been replaced by 1, try replacing the p-values in the above code by smaller values 0.01, 0.02, 0.03, 0.04, and 0.05. Notice that it is multiplying each p-value by how many tests there are remaining when it gets to that one... except that this adjusted p-value can never be smaller than a previously adjusted one. You simply compare the adjusted p-values to your desired alpha-levels.


Principal Components

The following code will replicate what we did in class with the bears data set using R. The bears data set can be found at http://www.stat.sc.edu/~habing/courses/data/bears.txt. Note that in the DATA step below I removed the periods that appeared in some variable names. Also notice that Name required a $ because it is a character string and not quantitative.

DATA bears;
INPUT Age Sex HeadL HeadW NeckG Length ChestG Weight Name $;
CARDS;
70	1	15.0	6.5	28.0	78.0	45.0	334	Adam
(Insert Rest of Data Here!)
;

PROC CORR DATA=bears COV;
  VAR HeadL HeadW NeckG Length ChestG;
RUN;

Notice that it also gives the correlation matrix, and even tests of hypotheses that the separate correlations equal zero.

One of the SAS procedures that conducts princiapl components analysis is PROC PRINCOMP. The following code will perform it using the covariance matrix.

PROC PRINCOMP COV OUT=prin DATA=bears;
  VAR HeadL HeadW NeckG Length ChestG;
RUN;

PROC PLOT DATA=prin;
  PLOT PRIN2*PRIN1;
RUN;

You can also perform principal components analysis in PROC INSIGHT. Once you have started up PROC INSIGHT with the appropriate data set, choose Multivariate (YX) under the Analyze window. You then need to select all of the appropriate variables for Y. Then hit OK. Various pieces of the principal components output can then be found under the Tables, Graphs, and Curves menus. One of the options includes a three dimensional (controllable) plot of the principal components.

Using PROC PRINT on prin will reveal, among other things, the values of the transformed observations.

PROC PRINT DATA=prin;
RUN;

To check that the covariances and correlations of the Z's, we could again use PROC CORR, this time on the data set prin. (Notice that it also calculates the means!)

PROC CORR DATA=prin COV;
  VAR Prin1 Prin2 Prin4 Prin5;
RUN;

To perform principal components analysis using the correlation matrix instead of the covariance matrix, simply remove the COV from the PROC PRINCOMP line.

Principal Components Plot Options: Using PROC PLOT to plot the other principal components, all you need to do is replace PRIN2 and PRIN1 with the appropriate other choices.

By default, the plot uses A to represent each points, and then uses B to represent two overlapping points, etc. If we had had another column in the data set, say name, we then could have used: PLOT PRIN2*PRIN1 = name to plot them by the first letter of the name. Using PLOT PRIN2*PRIN1 = Sex will plot by the 1 for male and two for female..


Factor Analysis

Note: The page http://www.stat.sc.edu/~habing/courses/530e7p1F01.pdf contains an annotated pdf copy of the SAS code and SAS output for much of what is shown below. It corresponds with example 7.1 on pages 99-104.

The following code can be used to carry out the i various factor analyses on the data set in figure 1.5 that is used in example 7.1 in Chapter 7. The data can be found in the file: http://www.stat.sc.edu/~habing/courses/data/eurojobs.txt.

DATA eurojobs;
INPUT Country $	Agr	Min	Man	PS	Con	SI	Fin	SPS	TC;
CARDS;
Belgium		3.3	0.9	27.6	0.9	8.2	19.1	6.2	26.6	7.2
(Insert rest of data here.)
;

PROC FACTOR DATA=eurojobs
	SIMPLE 
	METHOD=PRIN 
	PRIORS=MAX
	NFACT=4
	SCREE 
	ROTATE=VARIMAX; 
VAR 	Agr	Min	Man	PS	Con	SI	Fin	SPS	TC;
RUN;	

Some (of the MANY) options are as follows:

Adding the line RESIDUAL after SIMPLE will add the residual covariance matrix to the output (what is produced by testfit in R.) It is labeled Residual Correlations With Uniqueness on the Diagonal.

Adding the line OUT=dsetname will make a new data set called whatever you put in for dsetname that will have the estimated factor scores.


Fisher's Linear Discriminant Analysis

The code here will uses the skulls data to work through Example 8.1 on page 112. It is important to note that it will not exactly match the output in the book since the W matrix on page 112 has an error (compare it to the W matrix on page 52... there is a digit off). Further, in (8.2) on page 113 have been standardized differently than what R or SAS do. The output in table 8.3 will match up however (after rearranging the columns and rows appropriately).

http://www.stat.sc.edu/~habing/courses/data/skulls.txt.

DATA skulls;
INPUT MB        BH      BL      NH      Epoch $;
CARDS;
131     138     89      49      Earlypre
(REST OF DATA GOES HERE)
;

PROC CANDISC DATA=skulls OUT=outskull DISTANCE ANOVA PSSCP BSSCP;
  CLASS Epoch;
  VAR MB BH BL NH;
RUN;

PROC PLOT DATA=outskull;
PLOT can2*can1=Epoch;
RUN;

You can add the option NCAN=2, for example, to the CANDISC line if you only want it to produce the first 2 linear discriminant functions.

The Jackknife for Estimating Discriminant Analysis Error Rates: In order to return the jackknife corrected estimates of the classification accuracy, we need to use PROC DISCRIM. It will contain both the classification rate using the whole data set and the one using the jackknife, so you need to make sure you read the headings for the output tables!

PROC DISCRIM DATA=skulls CANONICAL CROSSVALIDATE;
CLASS Epoch;
VAR MB BH BL NH;
RUN;


Logistic Regression

Logistic regression can be performed in SAS using either PROC LOGISTIC or PROC INSIGHT. PROC LOGISTIC has the benefit of including the Hosmer-Lemeshow Goodness of Fit Test, while PROC INSIGHT has the advantage of allowing for the easy plotting of the predicted values and the residuals.

To use the following code you will first need to remove all the skulls except for those in the groups Earlypre and Roman... and if you plan to use PROC INSIGHT you should change these values to be 0 or 1. I call the modified data set skulls2.

PROC LOGISTIC DATA=skulls2 DESCENDING;
MODEL Epoch = MB BH BL NH / LACKFIT;
RUN;

The Hosmer and Lemeshow Goodness of Fit test tests the null hypothesis that a logistic regression model is appropriate. The test that the independent variables are able to predict the dependent variable is the Likelihood Ratio test found earlier in the output. It tests the null hypothesis that all of the coefficients are zero.

To perform logistic regression using PROC INSIGHT, choose FIT(YX) in the Analyze Menu. Once there, you have to choose several options after you select Y and X. Select METHOD using the button at the bottom. In the menu that pops up choose the response distribution Binomial, and the link function Logit. The p-value for the likelihood ratio test appears in the Analysis of Deviance box on the output and the estimated probabilities are added to the spread sheet.

Note that the estimates and p-values match those found using R, and that the coefficients are of the form found in formula (8.3) on page 119. You need to make sure if you are reporting the coefficients that you know what form they come in! Also note that the chi-square statistics in SAS are the square of the z-values found in R.


Canonical Correlation Analysis

The data set we have been working with can be found at: http://www.stat.sc.edu/~habing/courses/data/sascity.txt.

As we saw briefly, PROC REG will perform multivariate multiple regression.

PROC REG DATA=city;
MODEL MORTAL OVER65 HOUSE EDUC SOUND DENSITY NONWHITE WHITECOL POOR = PRECIP JANTEMP JULYTEMP HC NOX SO2 HUMIDITY;
MTEST PRECIP, JANTEMP, JULYTEMP, HC, NOX, SO2, HUMIDITY;
RUN;

The MTEST line above is testing that all of the slope parameters for all of the X variables listed are the same for all of the y variables.

Any list of X and Y variables can be given, and their abscence means to assume that all should be included. Thus, simply using MTEST; would have produced the same output as above. Using MTEST OVER65, NONWHITE, POOR , JANTEMP, JULYTEMP; would test that the slopes corresponding to those x variables were zero for the given y variables.

The canonical correlation analysis can either be carried out using PROC CANCORR or PROC INSIGHT. The following code will give the output seen in class:

PROC CANCORR DATA=city VPREFIX=envir WPREFIX=people ALL;
VAR PRECIP JANTEMP JULYTEMP  HC  NOX SO2 HUMIDITY;
WITH MORTAL OVER65 HOUSE EDUC SOUND DENSITY NONWHITE WHITECOL POOR;
RUN;

To use PROC INSIGHT you need to use the Multivariate option under Analysis and then uses the buttons at the bottom of the window that comes up to select canoncial correlation and the options you want (and need to choose the Y and X variables too!)


Computer Trouble?

In most cases, help with the computers (NOT the programming) can be gained by e-mailing help@stat.sc.edu

For the printers on the first and second floor, printer paper is available in the Stat Department office. For printers on the third floor, paper is available in the Math Department office.

If you are using a PC restarting the machine will fix many problems, but obviously don't try that if you have a file that won't save or the like.

If SAS won't start, one of the things to check is that your computer has loaded the X drive correctly (whatever that means). Go to My Computer and see if the apps on 'lc-nt' (X:) is listed as one of the drives. If it isn't, go to the Tools menu and select Map Network Drive.... Select X for the drive, and enter \\lc-nt\apps for the Folder. Then click Finish. This should connect your computer to the X drive and allow SAS to run. If you already had the X-drive connected, then you will need to e-mail help@stat.sc.edu.

If your graphs print out extremely small after you copy them to word, you might be able to fix the problem by "opening and closing" the image. In word, left click once on the image, and select Edit Picture or Open Picture Object under the Edit menu. A separate window will open with the image in it. Simply choose Close Picture. It should now print out ok. This will also make the spacing between the characters in the labels look right if they were somewhat off.

If the problem is an emergency requiring immediate attention see Jason Dew in room 209D.
If neither Jason is not available and it is an emergency see Minna Moore in room 417.
Flagrantly non-emergency cases may result in suspension of computer privileges.