STAT 530 SAS TEMPLATES

STAT 530 - Fall 2008 - SAS Templates

The Basics of SAS
The oildata Example
Finding and Displaying Correlation Matrices
Multivariate Normality (Section 1.4)
Principal Components Analysis (Chapter 3)
Factor Analysis (Chapter 4)
Multidimensional Scaling and Distances (Chapter 5)
MANOVA and Discriminant Analysis (Chapter 7)
Graphs Printing Small in Word?

The Basics of SAS:

SAS is one of the most widely used statistical and data base management programs. A one year license can be gotten from University Technology Services (1244 Blossom Street).

There are three main windows that are used in SAS. The Log window, the Program Editor window, and the Output window. The Log and Program Editor window are the two on the screen when you start up the program. The Output window isn't visible yet because you haven't created anything to output. If you happen to lose one of these windows they usually have a bar at the bottom of the SAS window. You can also find them under the View menu.

The Program Editor is where you tell SAS what you want done. The Output window is where it puts the results, and the Log window is where it tells you what it did and if there are any errors. It is important to note that the Output window often gets very long! You usually want to copy the parts you want to print into MS-Word and print from there. It is also important to note that you should check the Log window everytime you run anything. The errors will appear in maroon. Successful runs appear in Blue.

Hitting the [F3] key at the top of your keyboard will run the program currently in the Program Editor window. You can also run programs by clicking on the little image of the runner in the list of symbols near the top of the SAS program screen.

In older editions of SAS, running the program will erase whatever was written in the Program Editor window. To recall whatever was there, make sure you are in that window, and hit the [F4] key.

If you keep running more programs it will keep adding it all to the Output window. To clear the Output window, make sure you are in that window, and choose Clear text under the Edit menu.

If you happen to lose a window, try looking under the View menu.

One way of entering data into SAS is in a data step. The following code enters the four data points 5, 4, 3, and 2 along with labels for them going from A-D (so that you can see how two variables can be entered at once). You can just cut and paste the code below starting with DATA and ending with the ; after all the numbers into the Program Editor window.

DATA sample; INPUT letter $ x @@; LABEL x = "The variable X" letter = "Label for the data point"; CARDS; A 5 B 4 C 3 D 2 ;

Note that _most_ lines end with a semi-colon, but not all. SAS will crash if you miss one, but usually the log window will tell you where the problem is. An extra/missing semi-colon is probably the single largest reason for a SAS program crashing.

The DATA line defines what the name of the data set is. The name should start with a letter, have no spaces, and only letters, numbers, and underscores. The INPUT line gives the names of the variables, and they must be in the order that the data will be entered. The $ after letter on the INPUT line means that the variable letter is qualitative instead of quantitative. The @@ at the end of the INPUT line means that the variables will be entered right after each other on the same line with no returns. (Instead of needing one row for each observation.)

If we hit F3 at this point to enter what we put above, nothing new will appear on the output screen. This is no big surprise however once we realize that we haven't told SAS to return any output! The code below simply tells SAS to print back out the data we entered.

PROC PRINT DATA=sample; TITLE "Just printing out the sample data"; RUN;

The DATA step can also be used to manipulate already existing data.

DATA weirdsample; SET sample; weird = sin(x)+x**2; KEEP x weird; RUN;

PROC PRINT DATA=weirdsample; RUN;

Whenever you have a DATA line, that means you are creating a new dataset with that name. The SET line tells it that we are making this new data set from an old one. The KEEP line says the only variables we want in this new data set are the ones on that line. The lines after that say any special commands that go into the making of the new data set.

The most basic procedure to give out some actual graphs and statistics is PROC UNIVARIATE:

PROC UNIVARIATE DATA=sample PLOT FREQ ; VAR x; TITLE 'Summary of the sample data'; RUN;

Various other procedures could be used to get output such as a t-test or linear regression. More refined graphics along with a spreadsheet for entering the data and menus for choosing analyses can be found in PROC INSIGHT.

PROC INSIGHT; OPEN sample; DIST x; RUN;

You can cut and paste the graphs from PROC INSIGHT right into Microsoft Word. Simply click on the border of the box you want to copy with the left mouse button to select it. One of the menu-options, for example, is the t-test that can be found under Tables > Tests for Location....

While in PROC INSIGHT, left-clicking on the arrow in the bottom corner of each of the display boxes gives you options for adjusting the graphs format. Right-clicking on the squares at the start of each data line in the spreadsheet gives you the option of removing the data point from the calculation or displays. And clicking on the number in the spreadsheet highlights that value in all of the graphic windows. Finally, to quit PROC INSIGHT, click on the X in the upper right portion of the spreadsheet.

The oildata Example:

SAS is able to load in large data sets using the Import Data... option under the File menu. This data set is: standard tab delimited text that we can save in the WORK library under the name oildata. It can also create code that you could put straight into the editor window next time.

To view this data set in PROC INSIGHT you could use Solutions > Analysis > Interactive Data Analysis select the WORK Library and the OILDATA data set.

Back in the editor window we could use a DATA step to select just certain variables and observations.

DATA justlikeinr; SET oildata; KEEP Idnum Marital Age Econ Conv Q1; WHERE GENDER=1; RUN;

PROC PRINT DATA=justlikeinr; RUN;

Finding and Displaying Correlation Matrices:

The following code will find the correlation matrix for section 2 of the oildata set. It assumes that you successfully read in the data like in the previous section. The first set of lines creates the data set with just the variables we want.

DATA sect2; SET oildata; KEEP Econ Conv Flex Safe Low Dep; RUN;

We could then analyze the data using either PROC CORR or by using PROC INSIGHT.

Using PROC CORR: This next set of lines uses PROC CORR to print out the correlation matrix. (The NOSIMPLE makes SAS leave out some of the basic summary statistics. The NOPROB option asks it not to do a test of whether the correlation is significantly different from zero. If we left of the VAR line it would do all of the variables with each other... which is the same thing in this case.)

PROC CORR DATA=sect2 NOSIMPLE NOPROB; VAR Econ Conv Flex Safe Low Dep; RUN;

If you added BEST=3 to the first line, it would only give the three correlations that were largest in absolute value in order. This can make it easier to find the most positive and most negative relationships without having to see the enitre matrix at once.

Beginning with the first column we can see that: Economy goes strongly with Safety and Low Energy Use; weakly with Convenience and Dependability; and is opposite of Flexibility. This suggests a group of questions with:

Group 1: Economy, Safety, and Low Energy Use (and maybe Convenience and Dependability)

Giving to the next column we see that: Convenience goes strongly with Flexibility and Dependability; a bit more weakly with Economy and Safety; and negatively with Low Energy Use. This suggests a group of questions with:

Group 2: Convenience, Flexibility, Dependability (and maybe Safety and Economy)

If you had to break them into only two groups, this should already give you a good idea of how to do it... and you can check the remaining columns to see if we get any contradictions. In fact we could check out the correlations of the groups separately if we wanted to.

PROC CORR DATA=sect2 NOSIMPLE NOPROB; VAR Econ Safe Low; RUN; PROC CORR DATA=sect2 NOSIMPLE NOPROB; VAR Conv Flex Dep; RUN; PROC CORR DATA=sect2 NOSIMPLE NOPROB; VAR Econ Safe Low; WITH Conv Flex Dep; RUN;

Can you see why group 1 all hangs together? What about group 2? Is there a question that looks like it would go well with either group? (Be careful reading that third correlation!) Also notice that none of the questions really strongly go in the "opposite direction" from the other questions (what would it look like if that were the case?)

If we wanted to, we could now use a DATA step to get each persons total score (add up there 1 to 25 values on the questions) for each group of items.

DATA twoscores; SET sect2; score1=Econ+Safe+Low; score2=Conv+Flex+Dep; KEEP score1 score2; RUN; PROC PRINT; RUN;

We can see that the first person should get a 12+1+8=21 for the group 1 score and a 16+25+16=57 for the group 2 score. (The homework doesn't ask you to do this though!)

Using PROC INSIGHT: Once the data is loaded in, we can call PROC INSIGHT.

PROC INSIGHT; OPEN sect2; RUN;

Once the spreadsheet opens, you can get the correlation matrix by choosing Multivariate(YX) under the Analyze menu. Select all of the variables in the left hand box and then click the Y box to put them there. Click on the Method box and make sure Correlation Matrix is checked off. In the output window make sure that only the CORR line has a check by it (you might try Pairwise CORR to see what it does too). Clicking OK will then produce the correlation matrix. You can get fewer decimal places by clicking on the gray arrow in the upper right of the "Correlation Matrix" box, select format, and chosse 10.2 for example.

Group 1: Economy, Safety, and Low Energy Use (and maybe Convenience and Dependability)

Group 2: Convenience, Flexibility, Dependability (and maybe Safety and Economy)

Try just just choosing Econ, Safe, and Low for Y. Then try just choosing Conv, Flex, and Dep for Y. Finally choose Econ, Safe, and Low for Y and Conv, Flex, and Dep for X.

If we wanted to, we could now exit PROC INSIGHT and use a DATA step to get each persons total score (add up there 1 to 25 values on the questions) for each group of items.

DATA twoscores; SET sect2; score1=Econ+Safe+Low; score2=Conv+Flex+Dep; KEEP score1 score2; RUN; PROC PRINT; RUN;

We can see that the first person should get a 12+1+8=21 for the group 1 score and a 16+25+16=57 for the group 2 score. (The homework doesn't ask you to do this though!)

Multivariate Normality:

Data that follows a multivariate normal distribution comes from a very particular probability density function (whose formula we saw in class). This makes it relatively easy to generate random samples that should be multivariate normal. The following code generates a sample of size 1000 from a population where the first variable has mean 5, the second has mean 0, and the third has mean -1. The variances are 1, 1, and 4 respectively, with the covariance between 1 and 2 being 0.5, 1 and 3 being -0.2, and 2 and 3 being 0.

Note that we have to write our own program to do this in SAS (unlike R) and it is a bit unpleasant if you have to do it from scratch. Of course you can just cut and paste it in and run it! The only things that you might ever need to change are the covariance matrix, the mean vector, the sample size n, and the seed. The seed is similar to your starting position on the random number table... and if you use the same seed you will always get the same "random numbers"!

In any case the resulting data set is called mvnormdata (you should see where you need to go to change that), and the variables are simply called COL1, COL2, and COL3.

PROC IML; sigma = {1 0.5 -0.2, 0.5 1 0, -0.2 0 4}; mu = {5 0 -1}; n = 1000; seed = 91305; q=NROW(sigma); MUMAT=REPEAT(mu,n,1); SROOT=ROOT(sigma); Z=NORMAL(REPEAT(seed,n,q)); x=Z*SROOT+MUMAT; CREATE mvnormdata FROM x; APPEND FROM x; QUIT;

Unfortunately it is much more difficult to tell if a data set you have actually comes from a multivariate normal distribution. Generally we "cheat" and just check a few conditions that must be true if the data set is multivariate normal.

Each of the variables is normal separately - Check using Q-Q plot
Each pair of variables makes an ellipse - Check using a scatterplot (maybe with a bivariate box-plot on top)
The set of generalized distances from each point to the center of the points is chi-square - Check using a chi-square plot.

Making the q-q plots: One way to make the Q-Q plots is to use PROC INSIGHT and choose QQ Ref Line... under the Curves menu. Recall you can open PROC INSIGHT either from the Solutions menu (choose Interactive Data Analysis) or by running something like:

PROC INSIGHT; OPEN mvnormdata; RUN;

Another way is to use PROC UNIVARIATE with qqplot on a separate line.

PROC UNIVARIATE DATA=mvnormdata NOPRINT; QQPLOT; RUN;

This will open up a GRAPH window that contains the three q-q plots.

As long as you used the same seed you should all have the same (not-really random results) that all look pretty close to being normal. Most of the points are on the line, except for a few at the ends (remember this generated data has 1,000 observations... so just a few being off isn't bad).

Making the scatterplots: If you have many variables, you would probably want to make each scatterplot on a separate screen and not put them all up at once (they would be too small to see!). In this case you could just use and the Scatter Plot (YX) option under the Analyze menu in PROC INSIGHT. You can then choose one of the variables for X and another for Y to get each plot. If you wanted them all at once simply choose all of them for X and all of them for Y.

If the data is really from a multivariate normal population then scatter plots should form ellipses with most of the points in the middle. All three of these look pretty good.

You can also get something with an idea similar to the Bivariate Boxplot we saw in the text. In fact the formula used to get the ellipse is based on Hotelling's T-squared that we'll see later and is probably actually better for checking multivariate normality. To see these, choose Multivariate (YX) under the Analyze menu. Choose all of the variables for X and all of them for Y, and under the OUTPUT menu make sure that 80% Pred. Conf. Ellipse is the only thing checked. The resulting ellipse should contain about 80% of the values. Make sure to scroll down, and to the right in order to see all of them. Also note that the slidebar at the bottom of the window lets you change the percentage, and you can also get a different percentage added on top under the Curves menu. (Choose Prediction, not Mean.)

Making the Chi-square plot: The formula for the distances used to make the chi-square plot are in the middle of page 10 in the text. The following code produces the plot in a similar manner to R, only taking the diagonal out of a big multiplied matrix instead of using a loop. The only thing you should ever need to change is the name of the data set mvnormdata on the second line (make it whatever data set you are using).

PROC IML; USE mvnormdata; READ ALL INTO X; N=NROW(X); SUMX=X[+,]; S=(T(X)*X-T(SUMX)*SUMX/N)/(N-1); XBAR=SUMX/N; OBSVALS=VECDIAG((X-(J(N,1)*XBAR))*INV(S)*T(X-(J(N,1)*XBAR))); EXPVALS=CINV((RANK(OBSVALS)/(N+1)),NCOL(X)); CREATE CHIPLOT VAR {OBSVALS EXPVALS}; APPEND; QUIT; PROC GPLOT DATA=CHIPLOT; PLOT EXPVALS*OBSVALS; RUN;

If the data is really from a multivariate normal population, then the distances should be chi-squared, and this plot should look like a straight line. (Remember that small sample sizes aren't nearly as straight as big ones because of random fluctuation.)

In Conclusion: If the three checks above all look fairly good, then that is strong evidence that the population is (at least very approximately) multivariate normal, and we would be comfortable using any methods that made that assumption. If the checks failed a little then we would want to know how robust the method was. If the checks failed badly then we should probably not trust any method that needed multivariate normality!

Graphics

The following code provides some basic descriptive plots for the census data set. The first step is to load in the census data set at: http://www.stat.sc.edu/~habing/courses/data/census.txt by first saving it and then importing it, or by pasting it into the editor window and adding the needed lines.

Many of the graphics can be gotten using PROC GPLOT. For example, a basic plot of x (Birth) vs. y (HeartD).

PROC GPLOT DATA=census; PLOT HeartD * Birth; RUN;

To restrict the axes the command is HAXIS for the horizontal (x) and VAXIS for the (y).

PROC GPLOT DATA=census; PLOT HeartD * Birth / HAXIS=14 to 15; RUN;

This code adds random jitter to Birth (just like is done in R) and then produces the graph. Idiotically, there is no easy way to have a SAS data step give you the maximum and minimum of the Birth variable, and so that gets put in manually.

DATA jitter; SET census; SEED=092005; JitterB=birth+ranuni(seed)*(29.9-8)/50; KEEP JitterB Birth HeartD; PROC GPLOT DATA=jitter; PLOT HeartD*JitterB / HAXIS=14 to 15; RUN;

Principal Components

The following instructions tell how to analyze the data in section 3.3 of the text book. First using PROC INSIGHT and then using some of the other PROCS. In either case the first step is to get the data from: http://www.stat.sc.edu/~habing/courses/data/usair.txt

Using PROC INSIGHT: To produce the output on page 53 using PROC INSIGHT you first need to open the data set (recall you could select Solutions > Analysis > Interactive Data Analysis... and choose the correct data set from the Work library.

The output screen can be gotten by selecting Analyze > Multivariate (YX). Choose the desired variables (all but City and SO2) as Y. Then click the Method button. If you want to use the correlation matrix then make sure it is checked in the upper left, and if you want to use the covariance matrix check that. Click Ok to return to the previous window. Then click the Output button. In the upper left make sure that Corr is the only one checked. Then check the box in the upper right labeled Principal Components Analysis and click the button right underneath of it.

In the Principal Components Options window choose: Eigenvalues for the component table; All in both places you can do that; and finally check Std Scoring Coefs if you want to use the correlation matrix. Clicking Ok several times will then produce the desired output.

Note that you should have automatically produced the output at the top of page 55 with dots (you can click on them to get the names). (Did you see the option if you wante to get the graph for the first three principal components instead of just the first two?)

To get the output on the top of page 60, first note that the various principal components were automatically added to the spreadsheet. You can proceed by selecting Analyze > Fit (YX). Choose SO2 for Y and PCR1, PCR2, and PCR3 for X and click OK.

You can add the q-q plot by selecting Residual Normal QQ from the Graphs menu. If you want the quartiles for the residuals, choose Analyze > Distribution (Y) with R_SO2 as Y.

Using the Editor Window: One way to get the output from the top of page 53 is to use PROC PRINCOMP. The following code will perform the analysis using the correlation and save estimated component values in a data set called prin. (Add COV before the OUT on the first line to make it use the covariances instead).

PROC PRINCOMP OUT=prin DATA=usair; VAR NegTemp Manuf Pop Wind Precip Days; RUN;

We could then plot it like on page 55 using:

PROC GPLOT DATA=prin; PLOT Prin2*Prin1=City; RUN;

The regression analysis on the top of page 60 can be done using:

PROC GLM DATA=prin; MODEL SO2=Prin1 Prin2 Prin3; RUN;

Factor Analysis

The examples below use the life data set from page 78 of the text book (ch4life.txt) and the druguse.cor correlation matrix from page 83 (with code at: ch4druguse.txt).

Once we have loaded in the life data set, we could try to replicate the output on page 79 using the code below. Note that the example in the text uses the multivariate normal based maximum likelihood method (and not the iterated principal factor or principal components methods hmwk 5 asks for) and uses a varimax rotation (instead of leaving it unrotated like hmwk 5 wants).

PROC FACTOR DATA=life HEY METHOD=ML PRIORS=MAX ROTATE=VARIMAX NFACT=3 RESIDUALS OUTSTAT=factout; VAR m0 m25 m50 m75 w0 w25 w50 w75; RUN;

Like many of the procedures in SAS, PROC FACTOR returns a lot of output we don't particularly care about. Some parts of particular note (in order of outpu) are:

Preliminary Eigenvalues: These are not the eigen values from the raw correlation matrix! Use the PROC PRINCOMP or PROC INSIGHT procedures above to get the values you want. The scree-plot option is also not the values you want.
Eigenvalues of the Weighted Reduced Corrleation Matrix: Also not what we want.
Factor Pattern: This is the unrotated factor pattern.
Variance Explained by Each Factor: The unweighted column is the variance explained by each of the unrotated factors.
Final Communality Estimates: These are the communalities for the model. For this example they will differ slightly from what we found running it in R (one minus the uniquenesses on page 79) because by default R sets the maximum communality allowed to be 0.995 for the multivariate normal method. The HEY command we used above sets the maximum to be 1. Notice that these same values are identically repeated later in the output.
Residual Correlation with Uniqueness on the Diagonal: This should be very similar to the output in R. The only difference in the formatting is that SAS puts the estimated uniquenesses along the diagonal (instead of just wasting the space by putting all zeros there).
Rotated Factor Pattern: This corresponds to the loading matrix on page 79. Recall the example uses the varimax rotation. If we had not requested that rotation we would simply use the Factor pattern shown previously.
Variance Explained by Each Factor: Whereas the previous version of this table was for the unrotated version, this is for the rotated one. The unweighted column agrees fairly closely with what is found in the text.

To have SAS give the iterated principal components method we would simply replace METHOD=ML with METHOD=PRINIT. To get the principal components method METHOD=PRIN and PRIORS=ONE. To only have the unrotated solution, use ROTATE=NONE. These and some of the many other options can be found using the built in SAS help, and are summarized here:

HEY: Sets any estimated communalities greater than 1 to be exactly 1 (By default R sets them down to be 0.995 for the normal estimation method). In this case if we remove the HEY SAS will give an error and say it can't give an estimate. Another option is ULTRA that lets SAS run and give impossible values of the communality (this is the default for the iterative method in our R code).
METHOD:
- PRIN will perform principal components factor analysis if the priors are set to one, and will perform principal factor factor method otherwise.
- PRINIT will perform the iterated principal factor factor analysis. When using PRINIT make sure to check the output for a warning that says ERROR: Maximum iterations exceeded. If this occurs you will want to add in a line MAXITER=100 or some other suitably large number.
- ML will perform the maximum likelihood factor analysis
PRIORS:
- ONE sets the prior communalities to one. This will be the principal components method if you also selected PRIN
- MAX this is the method used in class where each communalitiy is set to the absolute value of the largest correlation with the other variables
- SMC this method is similar to MAX but uses the multiple R² from a multiple regression with the other variables
ROTATE: Can be set to NONE, VARIMAX, QUARTIMAX, and PROMAX among many others.
OUTSTAT saves much of the output, including the correlations and residuals to the specified data set-name. In this example you could then run PROC PRINT DATA=factout; RUN; to see what you had saved. This new data set is used below to make the various plots like we saw in R.

The residual plots: It takes a bit of work to figure out how to make SAS give the various residual plots that we saw in R. Assuming you included the OUTSTAT line above, you can set up the data set you need using the code shown below.

While it LOOKS ugly, all you need to do is to replace the m0 to w75 on the 2 KEEP and 2 VAR lines with the names of the variables you are actually using (e.g. X1 to X13 for the drug use example - NOTE: the names you put in must be in alphabetical order, and so you might have to do that manually!) Don't delete the vtemp in the first two though!

DATA tempc; SET factout ; vtemp=_NAME_; KEEP vtemp m0 m25 m50 m75 w0 w25 w50 w75; WHERE _TYPE_="CORR"; RUN; DATA tempr; SET factout ; vtemp=_NAME_; KEEP vtemp m0 m25 m50 m75 w0 w25 w50 w75; WHERE _TYPE_="RESIDUAL"; RUN; PROC TRANSPOSE DATA=tempc OUT=tempc2; VAR m0 m25 m50 m75 w0 w25 w50 w75; BY vtemp; RUN; PROC TRANSPOSE DATA=tempr OUT=tempr2; VAR m0 m25 m50 m75 w0 w25 w50 w75; BY vtemp; RUN; DATA tempc3; SET tempc2; pair=trim(vtemp)||trim(_NAME_); original=COL1; KEEP pair original; WHERE vtemp>_NAME_; RUN; DATA tempr3; SET tempr2; pair=trim(vtemp)||trim(_NAME_); residual=COL1; KEEP pair residual; WHERE vtemp>_NAME_; RUN; DATA fitdata; MERGE tempc3 tempr3; BY pair; predicted=original-residual; RUN;

All this code does is to make the data set fitdata that has everything you want in order to check the residuals.

Start up PROC INSIGHT using the data set fitdata.

The predicted versus original plot can be made by choosing Analyze > Fit (YX) or Analyze > Scatter Plot (YX) with original for X and predicted for Y. If the model fits well then all of the points should be close to a line.
The residual versus predicted plot can be made by choosing Analyze > Fit (YX) and selecting predicted for X and residual for Y. If the model fits well then the points should be clustered around a flat line at zero with no particular pattern. It might be useful to look at the R-Square and see if it is close to zero.
The histogram of the residuals can be made by choosing Analyze > Distribution (Y) and selecting residual for Y. The residuals should be scattered around zero (although not necessarily normally so). It might be useful to see if their mean is close to zero and if their standard deviation is fairly small. The five number summary in the Quantiles box will also give you a feel for how big they are.

To get the output from pages 85-87 you first need to copy the data commands from ch4druguse.txt into the editor window. You can then get the first part of the output using:

PROC FACTOR DATA=druguse HEY METHOD=ML PRIORS=MAX ROTATE=VARIMAX NFACT=6 RESIDUALS OUTSTAT=factout; VAR X1-X13; RUN;

Notice that it automatically recognizes that it is using a correlation matrix and not the raw data. To get tables 4.11 and 4.12 simply change the NFACT line.

Multidimensional Scaling and Distances

Chapter 5 contains quite a few different example data sets, and so while it's still fairly easy to read it isn't as easy to see all that can be done with one data set. Instead the data in Table 6.3 on page 124 is used both for most of the methods for this chapter and for the cluster analysis in Chapter 6. The data set contains the chemical composition of 45 specimens of Romano-British pottery (for some reason the text says 48, and the table skips number 40 and only goes to 46).

The data can be found at: http://www.stat.sc.edu/~habing/courses/data/pottery.txt.

The small data set in Table 5.2 on page 92 is also used to show the calculation of distances. The code for entering this data in the editor window is:

DATA ddat; INPUT country $ birthrate deathrate; CARDS; Algeria 36.4 14.6 France 18.2 11.7 Hungary 13.1 9.9 Poland 19.0 7.5 NewZealand 25.5 8.8 ;

Distances: The SAS procedures for multidimensional scaling and cluster analysis are both designed to work off of the raw distance matrices. PROC DISTANCE can calculate many of the distance matrices we saw in class.

Euclidean distance for the ddat data set would be gotten using:

PROC DISTANCE DATA=ddat OUT=ddist; VAR INTERVAL (birthrate deathrate); ID country; RUN; PROC PRINT DATA=ddist; RUN;

The OUT=ddist specifies the name of the data set we want to save the distances in. The INTERVAL signifies that the variables are numeric. Finally, the PROC PRINT is just so we can see what the distances look like on the screen. The help file contains a list of many of the other distances. For example the maximum distance is found using the option METHOD=CHEBYCHEV.

PROC DISTANCE DATA=ddat OUT=ddist METHOD=CHEBYCHEV; VAR INTERVAL (birthrate deathrate); ID country; RUN; PROC PRINT DATA=ddist; RUN;

So the distance from Algeria to France is the largest of (36.4-18.2)=18.2 and (14.6-11.7)=2.9.

The standardized (Karl Pearson) distance is calculated by adding STD=STD to the variable line as indicated below.

PROC DISTANCE DATA=ddat OUT=ddist; VAR INTERVAL (birthrate deathrate / STD=STD); ID country; RUN; PROC PRINT DATA=ddist; RUN;

SAS doesn't easily calculate Mahalanobis distance between each pair of observations, and would probably require the use of PROC IML. (If anyone would like to convert the corresponding R code to something reasonable looking in SAS I would certainly post it here.)

Classical Multidimensional Scaling: SAS's built in scaling PROC doesn't automatically perform multidimensional scaling, but it can be tricked into doing so. The following code will perform the multidimensional scaling on the pottery data with the standardized distances and produce a data set with both the first two estimated coordinates and the two data labeling variables. It assumes you have the pottery data loaded in as pottery and all of the variables have the same name as the data set.

PROC DISTANCE DATA=pottery OUT=potdist; VAR INTERVAL (AL2O3 FE2O3 MGO CAO NA2O K2O TIO2 MNO BAO / STD=STD); RUN; PROC MDS DATA=potdist LEVEL=ABSOLUTE FIT=D FORMULA=1 DIM=9 OUT=COORDS OCONFIG; RUN; DATA potcoord; MERGE pottery COORDS; KEEP No Kiln Dim1-Dim9; RUN; PROC PRINT DATA=potcoord; RUN;

In order to get these results you must set DIM equal to the number of variables, not the number of components you want. It isn'a a problem to keep them all since the latter ones don't affect the first few in the classical method. No and Kiln are included here because they are the two identifying variables in this data set.

The plot of the first two components, plot with the points identified by the No, and plot with the points identified by the Kiln can be made as follows:

PROC PLOT DATA=potcoord; PLOT Dim1*Dim2; PLOT Dim1*Dim2='*' $ No; PLOT Dim1*Dim2=Kiln; RUN;

To see how many dimensions are needed when using classical multidimensionl scaling we should use some measure of stress. The formula in class had a numerator that was the difference of the sum of squares. This is the value that classical multidimensional scaling minimizes for Euclidean distance. It is more common to use a value called Kruskal's Stress1, and this is probably what you should use. It uses the square of the difference as the numerator instead. The goal is to aim for a value 0.1 or lower.

We can trick SAS into calculating this for us for various numbers of dimensions using the coordinate set we created above.

PROC MDS DATA=potdist LEVEL=ABSOLUTE MAXITER=0 FORMULA=1 DIM=2 INITIAL=potcoord; PROC MDS DATA=potdist LEVEL=ABSOLUTE MAXITER=0 FORMULA=1 DIM=3 INITIAL=potcoord; PROC MDS DATA=potdist LEVEL=ABSOLUTE MAXITER=0 FORMULA=1 DIM=4 INITIAL=potcoord; PROC MDS DATA=potdist LEVEL=ABSOLUTE MAXITER=0 FORMULA=1 DIM=5 INITIAL=potcoord; PROC MDS DATA=potdist LEVEL=ABSOLUTE MAXITER=0 FORMULA=1 DIM=6 INITIAL=potcoord; PROC MDS DATA=potdist LEVEL=ABSOLUTE MAXITER=0 FORMULA=1 DIM=7 INITIAL=potcoord; PROC MDS DATA=potdist LEVEL=ABSOLUTE MAXITER=0 FORMULA=1 DIM=8 INITIAL=potcoord; RUN;

The Stress1 value is the "Badness-of-Fit-Criterion" that appears on each screen. By the 0.1 criteria we would probably go with 4 dimensions, although 3 isn't much worse.

Isometric Multidimensional Scaling: The following code will perform 2-dimensional isometric multidimensional scaling on the standardized distances and make the various plots in a manner similar to R does.

PROC DISTANCE DATA=pottery OUT=potdist; VAR INTERVAL (AL2O3 FE2O3 MGO CAO NA2O K2O TIO2 MNO BAO / STD=STD); RUN; PROC MDS DATA=potdist LEVEL=ORDINAL FIT=D FORMULA=1 DIM=2 OUT=COORDS OCONFIG; RUN; DATA potcoord; MERGE pottery COORDS; KEEP No Kiln Dim1-Dim2; RUN; PROC PLOT DATA=potcoord; PLOT Dim1*Dim2; PLOT Dim1*Dim2='*' $ No; PLOT Dim1*Dim2=Kiln; RUN;

It is important to notice that while this answer will have apparently different values than what R gives, they are actually approximately the same. If you rotate the picture and blow it up they will match. Since it is an isometric mapping a multiple (blown up or shrunk copy) of a map is the same.

Also note that the output automatically gives the stress on the last line of "Baddness-Of-Fit". This is the appropriate nonmetric measure fo stress and one recommendation is that 0.10 is fair, 0.05 is good, and 0.02 is excellent. If you changed the number of dimensions above to 3, 4, and 5, we would find that 2 dimensions is fair (.090), 3 is almost good (0.053), 4 is good (0.034), and 5 is excellent (0.020). Depending on how you wanted to balance accuracy and simplicity you could probably argue for any of 3, 4, or 5 dimensions and would certainly not want more than 5.

MANOVA and Discriminant Analysis

The two population example in the text uses the data in table 7.1 as described on pages 137-139 of the text. This data set can be found at: tibet.txt.

The more than two group example in the text uses the data found in table 5.8 on pages 100-102. This data set can be found at: skull.txt

Checking Assumptions: The instructions for checking multivariate normality are discussed above. For the Tibet example you need to make 2*5=10 q-q plots, 2*(5choose2)=20 bivariate plots with ellipses, 2 chi-square plots, and 2 covariance matrices. All of these except the chi-square plots can be done fairly easily in PROC INSIGHT by selecting all of the numeric variables as the variables and the group variable as the Group.

MANOVA: The following commands will perform the MANOVA on the Tibet data.

PROC GLM DATA=Tibet; CLASS Type; MODEL Length Breadth Height Fheight Fbreadth = Type; MANOVA H=Type; RUN;

This procedure will create a large number of extra pages of output. The material we want is at the bottom of the last page where you see all four of the test statistics have F-values of 4.84 and p-values of 0.0029. (The text book on page 140 that says F is 15.17 is in error, see the MANOVA R templates page for why.)

The code for the skull data is similar.

PROC GLM DATA=skull; CLASS EPOCH; MODEL MB BH BL NH = EPOCH; MANOVA H=EPOCH; RUN;

Again, the desired results are on the bottom of the last page of output and should match table 7.3 on page 149.

Discriminant Analysis: The following code will conduct the discriminant analysis for the Tibet data as seen in part on page 144 and 145.

PROC DISCRIM DATA=Tibet CANONICAL CROSSVALIDATE OUT=TibOut; CLASS Type; VAR Length Breadth Height Fheight Fbreadth; RUN; PROC PRINT DATA=TibOut; RUN;

The first page of output indicates the count for each group (17 and 15) and the default priors (0.5 and 0.5). The third page gives the following:

                                                       Test of H0: The canonical correlations in
                   Eigenvalues of Inv(E)*H           the current row and all that follow are zero
                     = CanRsq/(1-CanRsq)
                                                     Likelihood Approximate
         Eigenvalue Difference Proportion Cumulative      Ratio     F Value Num DF Den DF Pr > F

       1     0.9301                1.0000     1.0000 0.51811582        4.84      5     26 0.0029

                                 NOTE: The F statistic is exact.

The p-value is for the test that at least one canonical discriminant function is significant and matches what was found in the MANOVA for Wilk's lamda. If there were more possible discriminant functions we would get tests that at least that many discriminant functions were needed.

The middle of page 5 gives the coefficients for the discriminant function (a mulitple of the vector a' that is about one-quarter the way down page 144 and the same vector that R would give.

                                   Raw Canonical Coefficients

                                   Variable              Can1

                                   Length        0.0477265915
                                   Breadth       -.0832479294
                                   Height        -.0027958415
                                   Fheight       0.0946949999
                                   Fbreadth      0.0948094010

The sixth page gives the linear coefficients found at the bottom of table 7.2 on page 145.

Page 7 gives the estimated accuracy ignoring the need for cross-validation, and page 8 gives the estimated accuracy if you take cross-validation into account.

Finally, the last set of pages give the values estimated probabilities of being in each group.

To get the estimated values for the two new skulls (like on page 144 and 145) we could create a brief data set with those two values, and put it in for the TESTDATA argument. Also, the output on page 145 from S-PLUS apparently doesn't use the right priors and uses the values from the data instead. The PRIORS PROPORTIONAL uses the data proportions (the default was equal percentages for all groups).

DATA testTibet; INPUT Length Breadth Height Fheight Fbreadth; CARDS; 171.0 140.5 127.0 69.5 137.0 179.0 132.0 140.0 72.0 138.5 ; PROC DISCRIM DATA=Tibet TESTDATA=testTibet CANONICAL NOPRINT TESTOUT=TibTest; CLASS Type; VAR Length Breadth Height Fheight Fbreadth; PRIORS PROPORTIONAL; RUN; PROC PRINT DATA=TibTest; RUN;

The code for producing the output similar to that on pages 152 and 153 for the skulls data is:

PROC DISCRIM DATA=skull CANONICAL CROSSVALIDATE OUT=skullout; CLASS EPOCH; VAR MB BH BL NH; RUN; PROC PLOT DATA=skullout; PLOT can2*can1='*' $ EPOCH; RUN;

The main differences between the SAS output and the output on page 152 are that: SAS gives tests of how many discriminant functions are needed (one in this case), has some of the Raw Canonical vectors have the signs reversed, the order of the variables in the classification tables are "alphabetical", and the plot contains all of the individual observations.

If your graphs print out extremely small after you copy them to word, you might be able to fix the problem by "opening and closing" the image. In word, left click once on the image, and select Edit Picture or Open Picture Object under the Edit menu. A separate window will open with the image in it. Simply choose Close Picture. It should now print out ok. This will also make the spacing between the characters in the labels look right if they were somewhat off.