STAT 530 R TEMPLATES

STAT 530 - Fall 2008 - R Templates

The Basics of R
The oildata Example
Finding and Displaying Correlation Matrices
Multivariate Normality (Section 1.4)
Graphics (Chapter 2)
Principal Components Analysis (Chapter 3)
Factor Analysis (Chapter 4)
Multidimensional Scaling and Distances (Chapter 5)
Cluster Analysis (Chapter 6)
MANOVA and Discriminant Analysis (Chapter 7)
Canoncial Correlation Analyis (Chapter 8)

The Basics of R:

R, like the commercial S-Plus, is based on the programming language S. It can be downloaded for from www.r-project.org [At home, after reading the instructions!, choose: CRAN, then a mirror site in the US, then Windows, then base, then R-2.7.2-win32.exe]. This page also has links to FAQs and other information about R. This page also has links to FAQs and other information about R.

One of the nice things about R (besides the fact that it's powerful and free) is that you can save everything you've done after you've finished each time. If you are working on a machine you have administrator access to it will ask you if you want to save the workspace when you quit the program (choose yes, and all the entered data and functions will be there next time you start it up). If you are on a networked machine where you don't have administrator access, or if you want to be able to access the saved data from several machines then you can load and save the .Rdata file on a disk or pen-drive (use the Load Workspace and Save Workspace options under the File menu).

At the heart of R are the various objects that you enter. At any time (when you are in R) you can type in the function objects() to see which ones you have entered previously. Presumably there aren't any there right now if you've never used R before.. R also has many built in objects, most of which are functions. Because these objects are always there, it will not list them when you type objects().

A list of the various functions for R can be found in the manuals that are installed with the program. The various manuals can be found under the Manuals submenu of the Help menu. In particular, the R Reference Manual lists not only all the functions in the base package of R, but also some of those in the supplementary packages as well. Because this manual is nearly 1000 pages long, you probably shouldn't print out a copy!!!!

The various supplementary packages are simply groups of functions that others wrote and decided to make available to the public. A list of those that you downloaded can be found under the Packages menu by choosing Load Package. The one we will use most often is the MASS library that is designed to accompany Venables and Ripley's Modern Applied Statistics with S. It can also be loaded in by typing library(MASS).

Once you know the name of a function, you can get help about it simply by typing help(functionname). In R, parentheses indicate either mathematical grouping, like they usually do, or that you are applying a function. Braces [ ] indicate that you are finding an element in a string.

The following demonstrates some of the basic ideas of how R syntax works.

x<-c(5,4,3,2)

The <- is the command that assigns what ever is on the left to the name that is on the right. The c indicates that an array or vector is going to be entered. Simply typing x now will return those four values. Typing

x[3]

will return the third value,

mean(x)

will return the mean, and

objects()

will show you that you now have an object called x.

R also has a variety of graphical functions and statistical functions built in.

hist(x)

will construct a histogram, and

t.test(x,alt="greater",mu=5)

will test the null hypothesis mu=5 vs. the alternate that mu > 5. (If you happen to mistype a command and get an error, just hit the up error to bring it up again for editing).

Besides just getting the result of a function, we can store the results somewhere. For example

x.t<-t.test(x,alt="greater",mu=5)

will store the result of the t-test in x.t.

attributes(x.t)

will then show you the various properties of x.t. Thus,

x.t$p.value

will return the p-value.

R can also be used to write your own functions. For example, if you wanted a function that returned sin(x)+x² we could write:

weird<-function(x){ y<-sin(x)+x^2 y}

Try weird(5) and weird(x). What is it doing? Try making the first line have (x=7) instead of just (x) and just running weird().

The oildata Example:

R is able to read data in both from files and from appropriately formatted web-pages. The following command will read in the data we saw in class on the 18th:

oildata<-read.table("http://www.stat.sc.edu/~habing/courses/data/oildata.txt",header=TRUE)

Typing oildata will show you the entire data set, and attributes(oildata) will give you a list of the various names associated with this data set. Neither of these is particularly helpful though since they more than fill up the entire screen.

Try the following to see if you can tell what they are returning:

oildata[,2]

oildata[2,]

oildata$Gender

We could also make subsets of the data. For example, try:

oildata[oildata[,2]=='1',c("Marital","Age","Econ","Conv","Q1")]

What does this seem to return? Why?

The apply function and its relatives make it easy to use a function on certain parts of a data set. For example

apply(oildata[,12:31],1,sum)

applies the sum to each Row (=1) of columns 12-13 of oildata.

The tapply function is related: tapply(oildata$Low,oildata$Gender,mean)

The first argument is the list being worked with, the second defines groups, and the third is the function to apply to each group.

R also has a great deal of graphical flexibility. For example, compare:

plot(oildata$Income,oildata$Low)

and

plot(as.factor(oildata$Income),oildata$Low)

plot(as.factor(oildata$Income),oildata$Q1)

Notice how it adjusted the type of plot produced based on the type of data (quantitative vs. qualitative/factor).

Finding and Displaying Correlation Matrices:

The following code will find the correlation matrix for section 2 of the oildata set. The first line reads in the entire data set and the second picks out only the desired variables. Lines three through six then display the correlation matrix with: the default number of decimal places, two decimal places, rounded to one decimal place, and as just positive or negative. Finally, we can also make it display the values without duplication. the diagonal.

oildata<-read.table("http://www.stat.sc.edu/~habing/courses/data/oildata.txt",header=TRUE) sect2<-oildata[,6:11] cor(sect2) round(cor(sect2),2) round(cor(sect2),1) sign(cor(sect2)) as.dist(round(cor(sect2),2))

Beginning with the first column we can see that: Economy goes strongly with Safety and Low Energy Use; weakly with Convenience and Dependability; and is opposite of Flexibility. This suggests a group of questions with:

Group 1: Economy, Safety, and Low Energy Use (and maybe Convenience and Dependability)

Giving to the next column we see that: Convenience goes strongly with Flexibility and Dependability; a bit more weakly with Economy and Safety; and negatively with Low Energy Use. This suggests a group of questions with:

Group 2: Convenience, Flexibility, Dependability (and maybe Safety and Economy)

If you had to break them into only two groups, this should already give you a good idea of how to do it... and you can check the remaining columns to see if we get any contradictions. In fact we could check out the correlations of the groups separately if we wanted to.

group1<-sect2[,c("Econ","Safe","Low")] group2<-sect2[,c("Conv","Flex","Dep")] cor(group1) cor(group2) cor(group1,group2)

Can you see why group 1 all hangs together? What about group 2? Is there a question that looks like it would go well with either group? (Be careful reading that third correlation!) Also notice that none of the questions really strongly go in the "opposite direction" from the other questions (what would it look like if that were the case?)

If we wanted to, we could now use apply to get each persons total score (add up there 1 to 25 values on the questions) for each group of items. For example, by typing sect2[,1] we can see that the first person should get a 12+1+8=21 for the group 1 score and a 16+25+16=57 for the group 2 score. (The homework doesn't ask you to do this though!)

Multivariate Normality:

Data that follows a multivariate normal distribution comes from a very particular probability density function (whose formula we saw in class). This makes it rather easy to generate random samples that should be multivariate normal. The following code generates a sample of size 1000 from a population where the first variable has mean 5, the second has mean 0, and the third has mean -1. The variances are 1, 1, and 4 respectively, with the covariance between 1 and 2 being 0.5, 1 and 3 being -0.2, and 2 and 3 being 0. The first line loads in the library MASS that has some functions that we need; the second defines the mean vector; and the third defines the covariance matrix. The last line generates the data and stores it as x.

library(MASS) mu<-c(5,0,-1) sigma<-matrix(c(1,0.5,-0.2, 0.5,1,0, -.2,0,4), ncol=3,byrow=T) x<-mvrnorm(n=1000,mu,sigma)

Unfortunately it is much more difficult to tell if a data set you have actually comes from a multivariate normal distribution. Generally we "cheat" and just check a few conditions that must be true if the data set is multivariate normal.

Each of the variables is normal separately - Check using Q-Q plot
Each pair of variables makes an ellipse - Check using a scatterplot (maybe with a bivariate box-plot on top)
The set of generalized distances from each point to the center of the points is chi-square - Check using a chi-square plot.

The functions to make the bivariate box-plot and chi-square plot can be loaded from the text book site into R using the source command shown below. If you are not connected to the internet, you can save that web-page as a text file on your computer, and then just source whatever you saved it as.

Loading the text book functions:

source("http://biostatistics.iop.kcl.ac.uk/publications/everitt/RSPCMA/functions.txt")

Making the q-q plots: The following code makes the Q-Q plots and uses the par(mfrow) command to print them all on the screen at once in a 2 row by 2 column matrix.

par(mfrow=c(2,2)) qqnorm(x[,1]) qqline(x[,1]) qqnorm(x[,2]) qqline(x[,2]) qqnorm(x[,3]) qqline(x[,3]) par(mfrow=c(1,1))

Each one of you will get somewhat different graphs with this example because you have different random data sets. In my example most of the points were on the line, except for a few at the ends (remember this generated data has 1,000 observations... so just a few being off isn't bad). Note that the last par(mfrow) command just makes sure that the next set of graphs you print out show up one at a time.

Making the scatterplots: If you have many variables, you would probably want to make each scatterplot on a separate screen and not put them all up at once (they would be too small to see!). In this case you could just use plot(x[,1],x[,2]) and then use 1,3 and 2,3. If you wanted them all at once the command pairs(x) makes a scatterplot matrix.

To get the bivariate boxplots we could just use bvbox(cbind(x[,1],x[,2]),xlab="x1",ylab="x2") in the same way. The only difference between this and plot is that we need to combine the columns we want together using cbind (that is, "bind the vectors together in a matrix").

Printing out the three bivariate boxplots on the same screen could be done by:

par(mfrow=c(2,2)) bvbox(cbind(x[,1],x[,2]),xlab="x1",ylab="x2") bvbox(cbind(x[,1],x[,3]),xlab="x1",ylab="x3") bvbox(cbind(x[,2],x[,3]),xlab="x2",ylab="x3") par(mfrow=c(1,1))

If the data is really from a multivariate normal population then most of the data should be in the middle ellipse (which you probably can't see the boundary of in the small plots!), there should be a pretty even scatter between the inner ellipse and the outer one, and there should be very few outliers. All three of my plots looked pretty good (at the most there were 12 or so outliers out of the 1,000... which is about the 99% accuracy the book mentions on page 26).

Making the Chi-square plot: The formula for the distances used to make the chi-square plot are in the middle of page 10 in the text. We could program these into R using the following code (note that it assumes your data set is called x). The first line makes the mean vector xbar and the second makes the covariance matrix S. We then tell it to set up an empty vector called di2 that has 1000 zeros in it. The loop then calculates the distance formula from the text. Remember that t() is the transpose function, %*% is matrix multiplication, and solve() will find the inverse. We then sort the data, find the quantiles of the chi-square with 3 degrees of freedom (because there are three x variables), and plot it.

xbar<-apply(x,2,mean) S<-var(x) di2<-rep(0,1000) for (i in 1:1000){ di2[i]<-t(x[i,]-xbar)%*% solve(S)%*%(x[i,]-xbar)} obsvals<-sort(di2) expvals<-qchisq((1:1000)/1001,3) plot(expvals,obsvals) lines(c(0,20),c(0,20))

If the above code says something like non-conformable arguments try using the line x<-as.matrix(x) and running it again. It might be unhappy because we are trying to use a "table" as a "matrix"... even though we know it should be the same thing.

Of course, the textbook has a built in function to make this plot (check out page 12). So if you downloaded the textbook functions you could just use:

chisplot(x)

It gives basically the same plot, just without the line. (Can you figure out how to add the line?)

In any case, if the data is really from a multivariate normal population, then the distances should be chi-squared, and this plot should look like a straight line. (Remember that small sample sizes aren't nearly as straight as big ones because of random fluctuation.)

In Conclusion: If the three checks above all look fairly good, then that is strong evidence that the population is (at least very approximately) multivariate normal, and we would be comfortable using any methods that made that assumption. If the checks failed a little then we would want to know how robust the method was. If the checks failed badly then we should probably not trust any method that needed multivariate normality!

Graphics

The following code provides some basic descriptive plots for the census data set. The first three lines load the data set in and produce one variable with the birth rate and another with the heart disease rate.

census<-read.table("http://www.stat.sc.edu/~habing/courses/data/census.txt",header=TRUE) birth<-census[,"Birth"] heartd<-census[,"HeartD"]

Variations on the Scatter Plot

#plot of x (birth) vs. y (heartd)# plot(birth,heartd) #plot with x going only from 14 to 15# plot(birth,heartd,xlim=c(14,15)) #jittering the values along the x (birth) axis# jbirth<-jitter(birth) plot(jbirth,heartd,xlim=c(14,15)) #plotting (the jittered values) with the 2nd column of census# #as the list of labels (at 90% magnification)# #the type="n" tells it to plot nothing at first, just set it up# plot(jbirth,heartd,xlim=c(14,15),type="n") text(jbirth,heartd,census[,2],cex=0.9) #plot, with the standard regression line added# #notice the order the variables are used in # #the lwd=2 makes the line twice as thick plot(birth,heartd) abline(lm(heartd~birth),lwd=2) #the plot again, but with a nonparametric regression line# plot(birth,heartd) lines(lowess(birth,heartd),lwd=2) #the plot with both regression lines# #lty can make it dashed or dotted depending on the value# plot(birth,heartd) abline(lm(heartd~birth),lwd=2) lines(lowess(birth,heartd),lty=2,lwd=2)

Adding a Third Variable

#for these examples we also use the percent over65# over65<-census[,"Over65"] #a plot with different sized circles whose# #radii depend on the over65 variable# #try help(symbols) for some other options# symbols(birth,heartd,circles=over65,inches=0.1) #a coplot where the population is divided# #based on the over65 variable- the bottom left# #plot corresponds to the first (leftmost group)# coplot(heartd~birth|over65) #the scatterplot matrix of all three variables# #where cbind puts the three vectors together in a matrix# pairs(cbind(birth,heartd,over65))

Density Plots (using bivden from the textbook)

#Begin by loading in the textbook functions and# #estimating the density for the pair of variables# source("http://biostatistics.iop.kcl.ac.uk/publications/everitt/RSPCMA/functions.txt") den<-bivden(birth,heartd) #The 3 dimensional surface plot# persp(den$seqx,den$seqy,den$den,xlab="birth",ylab="heartd",zlab="den") #theta controls the x-y planes rotation and phi tilts the whole thing# persp(den$seqx,den$seqy,den$den,xlab="birth",ylab="heartd",zlab="den",theta=45,phi=20) #a contour plot (topographic map)# contour(den$seqx,den$seqy,den$den,nlevels=10)

3D Plot (using cloud from the lattice library)

#Begin by loading in the lattice library# library(lattice) #make the plot with over65 as Z and the others as X and Y# cloud(over65~birth*heartd)

Principal Components

The following code analyzes the data in section 3.3 of the text book. To start with we can download the data set from the textbook web-site:

source("http://www.york.ac.uk/depts/maths/data/everitt/chap3usair.dat")

This stores the data as usair.dat just like in the text. The output on page 53 is gotten by using the commands found on page 52.

#Give the correlation matrix of all the variables but the first# # Note that the minus notation means all of the variables but# # the one(s) being subtracted.# cor(usair.dat[,-1]) #Dot the PCA analysis using the correlation matrix and store# # it as usair.pc # # Set cor=FALSE to use the covariance matrix instead# # (Note from help(princomp) that FALSE (the cov) is the defaulti)# usair.pc<-princomp(usair.dat[,-1],cor=TRUE) #Give the sd of each component and the percent of variance explained# summary(usair.pc) #Give the loadings (the coefficients for finding the principal# # components in terms of the original data (standardized if cor=T# # and centered if cor=F)# loadings(usair.pc) #One way to see all of the loadings could be done by using# print(loadings(usair.pc),cutoff=0)

Note that we can get much of the same information using the eigen() function. In this case since we are using the correlation matrix the command would be:

eigen(cor(usair.dat[,-1]))

Notic that these values should match the output you got previously. (Recall that the eigen values are the variance, not the standard deviation.)

To get the plot shown at the top of page 55 we could simply use the code on page 54. (Recall that sometimes the signs get flipped, so don't be surprised if you get a "mirror image" of the graphic in the book).

plot(usair.pc$scores[,1],usair.pc$scores[,2],xlab="PC1",ylab="PC2",type="n") text(usair.pc$scores[,1],usair.pc$scores[,2],labels=abbreviate(row.names(usair.dat)),cex=0.7)

It is also ok to save the scores somewhere with a slightly shorter name:

usair.pred<-predict(usair.pc) plot(usair.pred[,1],usair.pred[,2],xlab="PC1",ylab="PC2",type="n") text(usair.pred[,1],usair.pred[,2],labels=abbreviate(row.names(usair.dat)),cex=0.7)

The output at the top of page 60 can be gotten using code similar to that on page 56.

#Make the y-variable SO2.# SO2<-usair.dat[,1] #Conduct a multiple regression to predict SO2 from the first# # three principal components and save it in SO2.reg.# # then write out the output and residual plots.# SO2.reg<-lm(SO2~usair.pred[,1]+usair.pred[,2]+usair.pred[,3]) summary(SO2.reg) par(mfrow=c(2,2)) plot(SO2.reg) par(mfrow=c(1,1))

Of course we could also have put some of the other original variables on the right hand side of the ~ in the lm statement if we wanted to use those as the x-variables instead.

Factor Analysis

The functions we are using for factor analysis can be found at: www.stat.sc.edu/~habing/courses/530/fact.txt.
Special thanks to Jun Ni for some error finding

It can be loaded into R using:

source("http://www.stat.sc.edu/~habing/courses/530/fact.txt")

The examples below use the life data set from page 78 of the text book. The data can be loaded into R using:

source("http://www.york.ac.uk/depts/maths/data/everitt/chap4lifeexp.dat")

All of the factor analysis commands can simply be run using the function fact. This function can either be run on a raw data set using something like fact(datasetnamehere) or on a correlation matrix using something like fact(r=cormatrixnamehere).

The additional options are:

maxfactors= the maximum number of factors you want it to try from a correlation matrix (the default is from the degrees of freedom formula)
method= can be either "iter" for the iterated principal factor method (the default); "pf" for the non-iteraterated principal factor method; "pca" for the principal components method; or, "norm" for the maximum likelihood method that assumes multivariate normality.
rotation- is "none" by default, but can also be "varimax" or "quartimax" with some work. In order to use the quartimax rotation you must be using the latest version of R (2.1.1, use Help > About to check), and load in the library GPArotation from Packages > Install Packages. The diagnostics generated by this function are not appropriate for an oblique rotation like promax.
niter= is the maximum number of iterations for the principal factor method (default is 100). If you see Warning: Convergence not achieved you might want to try putting this in with a bigger value, say niter=500. That is, unless see Warning: Communality greater than one. In this case you might try putting niter= whatever value it suggests to get a legal solution.
full= make this TRUE if you want the full residual correlation matrix. The default just gives a summary of the values.
plot= make this FALSE if you don't want the graphics.
dec= the number of decimal places you want results to print on the screen as (default is 3).

To get a 3 factor solution using the iterated factor method we could either use the raw data:

fact(life,method="iter",maxfactors=3)

or run it off of a correlation matrix:

lifecor<-cor(life) fact(r=lifecor,method="iter",maxfactors=3)

Page 79 in the text differs in two ways. It uses the maximum likelihood estimation assuming the data is normal (method="norm") and it also uses a rotation (rotation="varimax"). The following command will essentially give the output in the text. (Note that you are not supposed to use the varimax rotation on homework 5).

fact(r=lifecor,method="norm",rotation="varimax",maxfactors=3)

The output should be fairly easy to follow:

eigen.values are the eigen values from the original corrlation matrix to help you decide how many factors seem reasonable.
method is the estimation method and rotation requested. Homework 5 doesn't need a rotation... and the problem tells you which method to use.
loadings is the loadings matrix for the given method, adjusted so that the signs will generally be positive (if possible). Note that it is the same as the one in the text except that small values have not been ommitted and an extra decimal place is given here.
communalities is the set of communalities, and should be 1-the uniquenesses given in the text. As it says in the help screen for factanal, by default R never lets uniquenesses fall below 0.005 for the multivariate method. SAS uses a different fix though and so won't give exactly the same answer for this problem. In the iterative method there is no limit built in and the program will warn you if a communality has exceeded one. If that happens it will tell you what the last legal iteration was.
importance is the information conatined in the top of Table 4.4. The first line is the variance accounted for by each factor, and should match with the SS loadings. The proportion of the variance that was explained is simply this number divided by the number of variables (8 in this case).
residuals gives a summary of the values of residual correlation matrix. If you had used the option full=TRUE you would have seen the entire residual correlation matrix... which is usually a lot bigger than we want!

In addition to trying to look at the variance explained and to eyeball the actual values of the residuals we can also plot them. The fact function produces four graphical displays:

The upper left corner contains a scree plot of the original correlation matrix. This corresponds to the eigen values in the output.
The upper right corner is a scatter plot of the predicted correlations verus the actual correlations from the data set. If the model fits well there should be no pattern and all of the points should be close to the line.
The lower left corner is a residual vs. predicted plot like we would see in regression. As in regression we hope the values are small and don't show any patterns. In order to help see if there is a trend in this residual plot the regression line has been drawn on and the r-squared value is given where the title would normally be.
The lower right corner is a histogram of the residuals, along with the mean and standard deviation of the residuals. We don't necessarily expect them to be multivarite normal, but they should have a mean close to zero, a small standard deviation, and not have heavy tails.

Looking at these plots, the three factor solution using the multivariate normal method looks pretty good.

If you hit the up arrow and replace the norm with iter it will repeat the analysis with the iterated principal factor method. The residual plot still looks fairly good... in fact the standard deviation of the residuals dropped from 0.015 to 0.006! There is a problem though (check the communalities). If this happens in real life you have a few choices: start with a smaller number of iterations (the warning tells you the last legal number) or just the principal factor method itself, use the normal method, add more factors, remove some factors, or adjust the code to stop them from exceeding 1. (SAS has an option that does this last one for all of its methods, our R code only has it for the multivariate normal method).

To see some really troubling residual plots, hit the up arrow and replace the iter with pca and maxfactors with 1.

With just a little practice you should be able to quickly try several different numbers of factors and see how they compare in terms of communalities, variance explained, and residual plots.

If you have loaded in the GPArotation package you can also get the output in Table 4.5 on page 80:

fact(r=lifecor,method="norm",rotation="quartimax",maxfactors=3)

The answer will differ just slightly because S-PLUS (used to generate the table in the book) has a slighly different bound on the communalities/uniquenesses than R does. (And if you don't have GPArotation installed it will warn you and use varimax instead).

To get output similar to that on page 85 through 87 you could use:

source("http://www.york.ac.uk/depts/maths/data/everitt/chap4druguse.dat") druguse.cor druguse.cor[11,12]<-.304 fact(r=druguse.cor,method="norm",rotation="varimax",maxfactors=6,full=TRUE) fact(r=druguse.cor,method="norm",rotation="varimax",maxfactors=3,full=TRUE)$residuals fact(r=druguse.cor,method="norm",rotation="varimax",maxfactors=4,full=TRUE)$residuals

Note the second line is needed because there appears to be an error on the authors web-site for this data (the matrix isn't symmetric).

The values differ very slightly from those in the text. It appears this is because the text's run might have been in S-Plus since it was not using a maximum communality of 0.995 like the default (check out the last uniqueness).

Multidimensional Scaling and Distances

Chapter 5 contains quite a few different example data sets, and so while it's still fairly easy to read it isn't as easy to see all that can be done with one data set. Instead the data in Table 6.3 on page 124 is used both for most of the methods for this chapter and for the cluster analysis in Chapter 6. The data set contains the chemical composition of 45 specimens of Romano-British pottery (for some reason the text says 48, and the table skips number 40 and only goes to 46).

The data can be loaded in using:

source("http://www.york.ac.uk/depts/maths/data/everitt/chap6pottery.data") No<-c(1:39,41:46) Kiln<-c(rep(1,21),rep(2,12),rep(3,2),rep(4,5),rep(5,5))

The small data set in Table 5.2 on page 92 is also used to show the calculation of distances. The code for entering this data is:

ddat<-matrix(c(36.4,14.6,18.2,11.7,13.1,9.9,19.0,7.5,25.5,8.8),byrow=T,ncol=2) rownames(ddat)<-c("Algeria","France","Hungary","Poland","New Zealand") colnames(ddat)<-c("Birth rate","Death rate")

Distances: When doing either multidimensional scaling or cluster analysis it is necessary to convert the data to a distance matrix. The function dist calculates Euclidean distance by default. So dist(ddat) returns the Euclidean distances at the bottom of page 92. Notice that it doesn't return the entire matrix or have the values rounded off. Try round(dist(ddat),2) and as.matrix(dist(ddat)). Also remember that we could assign the distance matrix to another variable name if we plan on having to retype it a lot: ddist<-dist(ddat).

Using help(dist) will show you several of the other distances it can be made to calculate automatically. For example dist(ddat,method="max") will set the distance to be the difference on one variable that is largest. So the distance between Algeria and France is the largest of (36.4-18.2)=18.2 and (14.6-11.7)=2.9.

The following functions can be copied in to find the standardized (Karl Pearson) distance and (unsquared) Mahalanobis distance.

standardize<-function(x){ (x-mean(x))/sd(x)} KPdist<-function(x){ x2<-apply(x,2,standardize) dist(x2)} Mdist<-function(x){ n<-nrow(x) d<-matrix(0,ncol=n,nrow=n) S<-solve(cov(x)) for (i in 1:(n-1)){ for (j in (i+1):n){ d[i,j]<-sqrt(t(x[i,]-x[j,])%*%S%*%(x[i,]-x[j,])) }} d<-d+t(d) as.dist(d) }

They can be run using KPdist(ddat) and Mdist(ddat) respectively... but of course won't work if you don't copy the functions in first!

Classical Multidimensional Scaling: The following code will perform classical multidimensional scaling on the above pottery data. (Make sure you have copied the above standardize and KPdist functions in first.)

#Perform a two-dimensional classical multidimensional scaling on the # standardized distances pot.cmd<-cmdscale(KPdist(pottery.data),k=2) #Plot the scaling result on the first two coordinates plot(pot.cmd[,1],pot.cmd[,2],xlab="Coordinate 1",ylab="Coordinate 2") #Same plot, but using the observation number for the label plot(pot.cmd[,1],pot.cmd[,2],xlab="Coordinate 1",ylab="Coordinate 2",type="n") text(pot.cmd[,1],pot.cmd[,2],as.character(No),cex=0.8) #Same plot, but using the kiln number for the label plot(pot.cmd[,1],pot.cmd[,2],xlab="Coordinate 1",ylab="Coordinate 2",type="n") text(pot.cmd[,1],pot.cmd[,2],as.character(Kiln),cex=0.8)

To see how many dimensions are needed when using classical multidimensionl scaling we should use some measure of stress. The formula in class had a numerator that was the difference of the sum of squares. This is the value that classical multidimensional scaling minimizes for Euclidean distance. It is more common to use a value called Kruskal's Stress1, and this is probably what you should use. It uses the square of the difference as the numerator instead. The goal is to aim for a value 0.1 or lower.

stress1<-function(datadist,fitteddist){ sqrt(sum((datadist-fitteddist)^2)/sum(datadist^2))}

To use this function we need to put in the original data distance we are using as the first argument and the euclidean distance of the estimated values as the second.

#Kruskal's Stress1 for 2 to 8 dimensions (9 will be perfect). # This code fits a five dimensional scaling # and then just selects off the lower number of dimensions. # Typing these manually would be easy using the up arrow. # pot.cmd9<-cmdscale(KPdist(pottery.data),k=9) stress1(KPdist(pottery.data),dist(pot.cmd9[,1:2])) stress1(KPdist(pottery.data),dist(pot.cmd9[,1:3])) stress1(KPdist(pottery.data),dist(pot.cmd9[,1:4])) stress1(KPdist(pottery.data),dist(pot.cmd9[,1:5])) stress1(KPdist(pottery.data),dist(pot.cmd9[,1:6])) stress1(KPdist(pottery.data),dist(pot.cmd9[,1:7])) stress1(KPdist(pottery.data),dist(pot.cmd9[,1:8]))

By this criteria we would probably go with 4 dimensions, although 3 isn't much worse.

An alternative measure of fit for classical multidimensional scaling is the value given at the bottom of page 95. R automatically calculates this value using the eig=T option. For example:

cmdscale(KPdist(pottery.data),k=2,eig=T)$GOF[2] cmdscale(KPdist(pottery.data),k=3,eig=T)$GOF[2] cmdscale(KPdist(pottery.data),k=4,eig=T)$GOF[2] cmdscale(KPdist(pottery.data),k=5,eig=T)$GOF[2] cmdscale(KPdist(pottery.data),k=6,eig=T)$GOF[2] cmdscale(KPdist(pottery.data),k=7,eig=T)$GOF[2] cmdscale(KPdist(pottery.data),k=8,eig=T)$GOF[2] cmdscale(KPdist(pottery.data),k=9,eig=T)$GOF[2]

By the book's rule of thumb that 0.8 or larger is good we could stop with 3 dimensions.

Finally, note that the classical multidimensional scaling with Euclidean distance gives the same result as principal components with covariances, and classical multidimensional scaling with standardized distance gives the same result as principal components with the correlations...

cmdsol<-cmdscale(dist(pottery.data),k=2) pcasol<-predict(princomp(pottery.data,cor=F))[,1:2] cbind(cmdsol,pcasol)

... except maybe for the signs.

Isometric Multidimensional Scaling: The following code will perform isometric multidimensional scaling on the above pottery data. (Make sure you have copied the above standardize and KPdist functions in first... and you will need to load in the MASS library.)

#Load in the MASS library# library(MASS) #Perform a two-dimensional isometric multidimensional scaling on the # standardized distances pot.iso<-isoMDS(KPdist(pottery.data),k=2) #Plot the scaling result on the first two coordinates plot(pot.iso$points[,1],pot.iso$points[,2],xlab="Coordinate 1",ylab="Coordinate 2") #Same plot, but using the observation number for the label plot(pot.iso$points[,1],pot.iso$points[,2],xlab="Coordinate 1",ylab="Coordinate 2",type="n") text(pot.iso$points[,1],pot.iso$points[,2],as.character(No),cex=0.8) #Same plot, but using the kiln number for the label plot(pot.iso$points[,1],pot.iso$points[,2],xlab="Coordinate 1",ylab="Coordinate 2",type="n") text(pot.iso$points[,1],pot.iso$points[,2],as.character(Kiln),cex=0.8) #Unlike the classical method a different number of dimensions will change # the result on the lower ones. Compare the last plot to the one with 3 dimensions pot.iso3<-isoMDS(KPdist(pottery.data),k=3) text(pot.iso3$points[,1],pot.iso3$points[,2],as.character(Kiln),font=4)

Note that the isoMDS function automatically calculates an appropriate non-metric measure of stress multiplied by 100. For example:

initial value 7.761548 iter 5 value 5.420941 iter 10 value 5.264936 final value 5.252784 converged

Says that the inital stress was 0.07761548, after 5 iterations of moving the points around it was 0.05420941, after 10 it was 0.05264936, and the final solution has 0.05252784. The scale recommended for evaluating this stress statistic is that a value of 0.10 is fair, a value of 0.05 is good, and a value of 0.02 is excellent.

Using code like:

isoMDS(KPdist(pottery.data),k=2)$stress

and replacing the k's with progressively more dimensions shows that 2 dimensions is fair (.090), 3 is almost good (0.053), 4 is good (0.034), and 5 is excellent (0.019). Depending on how you wanted to balance accuracy and simplicity you could probably argue for any of 3, 4, or 5 dimensions and would certainly not want more than 5.

Cluster Analysis

Both this data set and the information on the distance measures can be found above in the section on multidimesnional scaling and distances.

Hierarchical Cluster Analysis: The following code will conduct the hierarchical cluster analysis using the standardized (Karl Pearson distance), furtherst-neighbor (complete linkage) method, and plot it using both the observation numbers and the kiln numbers.

#Perform the hca using the "complete" linkage pottery.hca<-hclust(KPdist(pottery.data),method="complete") #Construct the dendogram labeled with the observation number plot(pottery.hca,labels=No,cex=0.75) #Construct the dendogram labeled with the kiln number plot(pottery.hca,labels=Kiln,cex=0.75)

Other linkages include: "single", "average", "centroid", and "ward.

The function cutree can be used to see which items would be grouped together by the hierarchical cluster analysis for different numbers of clusters. For example: cutree(pottery.hca,3) will list how the different samples would be divided into three clusters.

K-means Clustering: The K-means clustering function kmeans works off of the data matrix and not a distance matrix, and must be given the number of clusters that are desired. It may also be given an initial set of cluster centers or be told to use a certain number of random centers.

The following code would find 5 clusters for the standardized pottery data, using 10 random starting points to look for the best cluster from.

#Standardized the variables potteryz<-standardize(pottery.data) #Do the clustering pottery.k<-kmeans(potteryz,centers=5,nstart=10) #See the cluster assignments and the within group sum of squares pottery.k$cluster pottery.k$withinss #Make the two-dimesnional isometric scaling plot with the # cluster assignments as the labels library(MASS) pot.iso<-isoMDS(KPdist(pottery.data),k=2) plot(pot.iso$points[,1],pot.iso$points[,2],xlab="Coord 1",ylab="Coord 2",type="n") text(pot.iso$points[,1],pot.iso$points[,2],labels=as.character(pottery.k$cluster))

It is important to note that these five groups we found might not match up with the five Kilns!

par(mfrow=c(1,2)) plot(pot.iso$points[,1],pot.iso$points[,2],xlab="Coord 1",ylab="Coord 2",type="n",main="by Kiln") text(pot.iso$points[,1],pot.iso$points[,2],labels=as.character(Kiln)) plot(pot.iso$points[,1],pot.iso$points[,2],xlab="Coord 1",ylab="Coord 2",type="n",main="by Cluster Solution") text(pot.iso$points[,1],pot.iso$points[,2],labels=as.character(pottery.k$cluster)) par(mfrow=c(1,1))

MANOVA and Discriminant Analysis

The two population example in the text uses the data in table 7.1 as described on pages 137-139 of the text. This data set Tibet can be loaded in using: source("http://www.york.ac.uk/depts/maths/data/everitt/chap7tibetskull.dat") . The individual columns can be referred to as variables on their own by using attach(Tibet). Once that is done you can use Length, Breadth, Height, Fheight, Fbreadth and Type as variables on their own.

The more than two group example in the text uses the data found in table 5.8 on pages 100-102. This data set skulls consists of the measurements of Egyptial skulls from five different epochs. It can be loaded in using: source("http://www.york.ac.uk/depts/maths/data/everitt/chap5skulls.dat") . Again, using attach(skulls) will let you refer to the columns by their names Epoch, MB, BH, and NH.

Checking Assumptions: The commands for checking multivariate normality are discussed above. The cov command can be used to get the covariance matrix. In both cases remember to only use the columns of the data set that you want and not the group variable too. We could make the separate groups for the Tibet data using:

tib1<-Tibet[Tibet[,6]=="1",1:5] tib2<-Tibet[Tibet[,6]=="2",1:5]

You then need to make 2*5=10 qq-plots, 2*(5choose2)=20 bivariate box plots, 2 chi-square plots, and 2 covariance matrices.

MANOVA: For the Tibet data, page 140 gives the code for calculating Hotelling's T-squared and its p-value directly... except for an errors in the second and third lines from the bottom. The correct formulas for these lines can be found looking at display 7.1:

T2<-(l1*l2/(l1+l2))*t(m1-m2)%*%solve(S123)%*%(m1-m2) Fstat<-(l1+l2-5-1)*T2/((l1+l2-2)*5)

This results in a T2 value of 27.90211, an F statistic of 4.836366, and a p-value of 0.002936228.

This can also be replicated using the manova function (notice that all of the p-values are the same).

tib.manova<-manova(cbind(Length,Breadth,Height,Fheight,Fbreadth)~Type) summary(tib.manova,test="Wilks") summary(tib.manova,test="Roy") summary(tib.manova,test="Hotelling-Lawley") summary(tib.manova,test="Pillai")

The similar code for analyzing the skulls data can be found at the bottom of page 148 with the results shown on 149. Note that you don't need to use the attach and cbind if you don't want to, and instead could just refer to the columns of the data set and use an as.matrix on the left hand side of the ~:

skulls.manova<-manova(as.matrix(skulls[,2:5])~skulls[,1]) summary(skulls.manova,test="Pillai")

Discriminant Analysis: The function for Fisher's linear discriminant analysis is in the MASS library and so library(MASS) needs to be run this session if it hasn't been already. The following code will then perform a discriminant analysis on the Tibet data similar to what is being described on pages 144-146.

# Perform the discriminant analysis with both groups being predicted # to be equally likely. Leaving out the prior command makes the prior # be the percentages in the original data. tib.lda<-lda(Type~Length+Breadth+Height+Fheight+Fbreadth,prior=c(0.5,0.5)) tib.lda # Notice that the values for the coefficients of linear discriminants # don't match those given for a' about 1/4 of the way down page 144. # They are multiples of each other though and will give the # same classifications. # # Also note that it makes no sense to do a plot for this one because there # are only two groups, and so can only be one function! # Following through the example starting with the last four lines # of page 144 where we try to estimate where two new skulls would go. # the new data must be in the form of a data frame... which is kind # of annoying, but not that hard to set up. skull1<-c(171,140.5,127,69.5,137) skull2<-c(179,132,140,72,138.5) newdata<-rbind(skull1,skull2) colnames(newdata)<-colnames(Tibet[,1:5]) newdata<-as.data.frame(newdata) predict(tib.lda,newdata) # We see that we are 75.45% certain skull 1 belongs to group 1 # and 82.59% certain that skull 2 belongs to group 2 assuming normality # and equal covariances. # The reason this differs from the book is that they again # apparently used S-Plus and it seems to forget what you set the prior # to be. If you ran predict(tib.lda,newdata,prior=c(17/32,15/32)) # you would get the table at the bottom of page 145. # The following code will predict which group all of the skulls # seem to belong and give the percentages assuming normality. predict(tib.lda) # Table of the percentage correct (without using cross-validation) # as on the bottom of page 146. table(Original=Type,Predicted=predict(tib.lda)$class) # Calculating the jack-knifed (leave-one-out) error rate, again # with the 50/50 prior and assuming normality and equal covariances tib.cv<-lda(Type~Length+Breadth+Height+Fheight+Fbreadth,prior=c(0.5,0.5),CV=T) table(Original=Type,Predicted=tib.cv$class) # Notice that it did not perform as well! # Finally, relating to the homework we could see which of the # variables is most related to the first discriminant function # by knowing that the values on the discriminant function are # gotten by using preict(name of the lda output)$x cor(predict(tib.lda)$x,Tibet[,1:4]) # In this case Fheight predicts the most (closest to +/-1)

The following code will construct output similar to that on pages 152 and 153 for the skulls data.

# Conduct the lda and get the group means and canonical coefficients # from table 7.4. skull.lda<-lda(EPOCH~MB+BH+BL+NH) skull.lda # Get the "plug-in" classification table and the cross validation # tables from the bottom of table 7.4. table(Original=EPOCH,Predicted=predict(skull.lda)$class) skull.cv<-lda(EPOCH~MB+BH+BL+NH,CV=T) table(Original=EPOCH,Predicted=skull.cv$class) # Figure 7.2 on page 153 plots the MEANS of the five epochs # on the first two principal components using the code # earlier on that page. The following code plots all of # observations. plot(predict(skull.lda)$x[,1:2],xlab="Disc 1",ylab="Disc 2",type="n") text(predict(skull.lda)$x[,1:2],labels=skulls[,1],cex=0.5)

Canonical Correlation Analysis

The function given below will produce canonical correlation output similar to that given in SAS. To do this you need to cut and paste in the entire function as given below. To produce output corresponding to that on Homework 10 you would then read in the data set, set up the two groups of variables, and run the function.

cities<-read.table("http://www.stat.sc.edu/~habing/courses/data/cities.txt",head=T) attach(cities) x<-cbind(PRECIP,JANTEMP,JULYTEMP,HC,NOX,SO2,HUMIDITY) y<-cbind(MORTAL,OVER65,HOUSE,EDUC,SOUND,DENSITY,NONWHITE,WHITECOL,POOR) cancor2(x,y)

Of course that last line of code won't work if you didn't copy in the following function first!

cancor2<-function(x,y,dec=4){ #Canonical Correlation Analysis to mimic SAS PROC CANCOR output. #Basic formulas can be found in Chapter 10 of Mardia, Kent, and Bibby (1979). # The approximate F statistic is exercise 3.7.6b. x<-as.matrix(x) y<-as.matrix(y) n<-dim(x)[1] q1<-dim(x)[2] q2<-dim(y)[2] q<-min(q1,q2) S11<-cov(x) S12<-cov(x,y) S21<-t(S12) S22<-cov(y) E1<-eigen(solve(S11)%*%S12%*%solve(S22)%*%S21) E2<-eigen(solve(S22)%*%S21%*%solve(S11)%*%S12) rsquared<-as.real(E1$values[1:q]) LR<-NULL pp<-NULL qq<-NULL tt<-NULL for (i in 1:q){ LR<-c(LR,prod(1-rsquared[i:q])) pp<-c(pp,q1-i+1) qq<-c(qq,q2-i+1) tt<-c(tt,n-1-i+1)} m<-tt-0.5*(pp+qq+1) lambda<-(1/4)*(pp*qq-2) s<-sqrt((pp^2*qq^2-4)/(pp^2+qq^2-5)) F<-((m*s-2*lambda)/(pp*qq))*((1-LR^(1/s))/LR^(1/s)) df1<-pp*qq df2<-(m*s-2*lambda) pval<-1-pf(F,df1,df2) outmat<-round(cbind(sqrt(rsquared),rsquared,LR,F,df1,df2,pval),dec) colnames(outmat)=list("R","RSquared","LR","ApproxF","NumDF","DenDF","pvalue") rownames(outmat)=as.character(1:q) xrels<-round(cor(x,x%*%E1$vectors)[,1:q],dec) colnames(xrels)<-apply(cbind(rep("U",q),as.character(1:q)),1,paste,collapse="") yrels<-round(cor(y,y%*%E2$vectors)[,1:q],dec) colnames(yrels)<-apply(cbind(rep("V",q),as.character(1:q)),1,paste,collapse="") list(Summary=outmat,XUCorrelations=xrels,YVCorrelations=yrels) }

R has its own cancor function (hence the number 2 at the end of the above function name) but it doesn't produce the test statistics or the correlations between the original variables and the canonical covariates.