Stat 530 - Fall 2001 - R Templates

The Basics of R (in class 8/28/01)
Plotting Symbols, Multiple Plots, Regression, Box Plots
Analyzing the Draft Data (in class 9/4/01)
Star Plots and Multiple Regression (for use on Homework 2)
Chernoff-like Faces (as designed in class)
Why is it Drawing Borders on My Faces?
A Few Matrix Commands (for use on Homework 3)
Finding Distances (for use on Homework 4)
Principal Components (in class 10/4/01)
Discriminant Analysis (in class 10/24/01)
The Jackknife and Logistic Regression(as seen in class)
Cluster Analysis (in class 11/6/01)
Distances Revisited
Multidimensional Scaling (in class 11/27/01)
More MDS (Plotting 3rd dimension and Sammon)
Trees and the Tree Library
Computer Trouble?


The Basics of R:

R, like the commercial S-Plus, is based on the programming language S. It can be downloaded for from www.r-project.org [At home, after reading the instructions!, choose: CRAN, then Windows (95 and later), then base, then SetupR.exe]. This page also has links to FAQs and other information about R.

One of the nice things about R (besides the fact that it's powerful and free) is that you can save everything you've done after you've finished each time. Since you don't have access to the C drive on your accounts, you will need to manually load and save the .Rdata file in your Z drive each time you want to start and end the program, respectively. You would, of course, have to save the .Rdata file on disk if you wanted to use it both at home and at school.

At the heart of R are the various objects that you enter. At any time (when you are in R) you can type in the function objects() to see which ones you have entered previously. Presumably there aren't any there right now. R also has many built in objects, most of which are functions. Because these objects are always there, it will not list them when you type objects().

A list of the various functions for R can be found in the manuals that are installed with the program. The various manuals can be found under the Manuals submenu of the Help menu. In particular, the R Reference Manual lists not only all the functions in the base package of R, but also some of those in the supplementary packages as well. Because this manual is nearly 1000 pages long, you probably shouldn't print out a copy!!!!

The various supplementary packages are simply groups of functions that others wrote and decided to make available to the public. A list of those that you downloaded can be found under the Packages menu by choosing Load Package. The two we will use most often are mva and MASS. They can also be loaded in by typing library(mva) and library(MASS). Additional packages can also be found on the R home site and downloaded. Because you don't have access to the C drive though, you have to download them to the correct place and give the correct command to open them.

Once you know the name of a function, you can get help about it simply by typing help(functionname). In R, parentheses indicate either mathematical grouping, like they usually do, or that you are applying a function. Braces [ ] indicate that you are finding an element in a string.

This brief example shows some of R's flexibility.

x<-c(5,4,3,2)

The <- is the command that assigns what ever is on the left to the name that is on the right. The c indicates that an array or vector is going to be entered. Simply typing x now will return those four values. Typing x[3] will return the third value, mean(x) will return the mean, and objects() will show you that you now have an object called x.

R also has a variety of graphical functions and statistical functions built in. hist(x) will construct a histogram, and t.test(x,alternative="greater",mu=5) will test the null hypothesis mu=5 vs. the alternate that mu > 5. (If you happen to mistype a command and get an error, just hit the up error to bring it up again for editing).

Besides just getting the result of a function, we can store the results somewhere. For example x.t<-t.test(x,alternative="greater",mu=5) will store the result of the t-test in x.t. attributes(x.t) will then show you the various properties of x.t. Thus, x.t$p.value will return the p-value.

R can also be used to write your own functions. For example, if you wanted a function that returned sin(x)+x2 we could write:

weird<-function(x){
y<-sin(x)+x^2
return(y)}

Try weird(5) and weird(x). What is it doing?

R is able to read data in both from files and from appropriately formatted web-pages. The following command will read in the data we saw in class the other day:

rivers<-read.table("http://www.stat.sc.edu/~habing/courses/data/rivers.txt",header=TRUE)

Typing attributes(rivers) will give you a list of the various names associated with this data set. Try the following to see if you can tell what they are returning: rivers[,4], rivers[4,], rivers$Discharge.

We could also make subsets of the data. For example, try:

rivers[rivers[,3]=='Eu',4]

What does this seem to return? Why?

We could use this to return the mean Discharge for each continent by simply putting the above line in side of mean( ).

Trying

plot(rivers$Cont,rivers$Discharge)

will give a graph of the Dishcharge for each continent. (It gives box-and-whisker plots because the x-variable is qualitative.)

We could also use this to help graphing two of the continuous variables against each other. Say we want to plot the Density (column 7) against the N03 (column 8). We could type:

plot(rivers$Density,rivers$N03)

There is no way to tell if there is a difference based on each continent though.

plot(rivers[rivers[,3]=='Eu',7],rivers[rivers[,3]=='Eu',8],pch='E')

Now, why will there be trouble if we do the same thing, but replace 'Eu' with 'As' and the 'E' with 'A'? Use the up arrow and try.

Try putting par(new=T) between the two commands. Now what's the problem?

Try lines like the following:

plot(rivers[rivers[,3]=='Eu',7],rivers[rivers[,3]=='Eu',8],pch='E', xlim=c(min(rivers[,7]),max(rivers[,7])),xlab="Density", ylim=c(min(rivers[,8]),max(rivers[,8])),ylab="N03")

Can you now plot all the graphs overlayed on each other?

If you wanted to, you could add various lines to the graph. To add a horizontal line for the mean N03 in Europe you would use the command

euromean<-mean(rivers[rivers[,3]=='Eu',8])
lines(c(min(rivers[,7]),max(rivers[,7])),c(euromean,euromean))

Two types of graphs talked about in Jacoby's book are given by:

stars(rivers[,4:12])

and

pairs(rivers[,4:12])


Some More Plotting Commands

In class on the 28th we saw the following commands:

plot(x,y) makes a scatterplot (Jacoby page 1-2)
par(new=T) tells R not to erase the previous plot when it draws the new one.
lines(c(x1,x2,...,xn),c(y1,y2,...,yn) If you already have a plot open, this will draw the line segments to connext (x1,y1) to (x2,y2) to ... to (xn,yn). You don't need to use par(new=T)
stars(datamatrix) makes a star plot (Jacoby page 22-25)
pairs(datamatrix) makes a scatterplot matrix (Jacoby page 38-42)

A variety of other plotting commands are discussed below. Many use the data set rivers that we used on the 28th.

Plotting Symbols: As we saw on the 28th, the option pch="a" would plot the character a for each point, instead of simply a dot. If instead we used a number (not in quotes) we would get one of several types of symbols. Try the following to see which symbols are available. (The type="n" means that nothing should be plotted except the axes, labels, and the title. The command points just adds points to a previously drawn graph.)

plot(5,5,xlim=c(0,18),ylim=c(0,18),xlab="",ylab="",main="Plot Symbols",type="n")

par(new=T)
plot(1,1,xlim=c(0,18),ylim=c(0,18),xlab="",ylab="",pch=1)
points(2,2,pch=2)
for (i in 3:18){
points(i,i,pch=i)
}

Multiple Plots on the Screen: par(mfrow=c(2,2)) will cause the graphs to be displayed as 2 rows and 2 columns. To reset the entire window (so that you can do all four again), you could change it back to 11 using par(mfrow=c(1,1)) and then go back to 2.

Regression Line: The following code will perform the regression to predict NO3 from Density, and plot the data and regression line.


riv.reg <- lm(rivers$NO3~rivers$Density)
summary(riv.reg)
plot(rivers$Density,rivers$NO3,xlab="N03",ylab="Density")
lines(rivers$Density,fitted(riv.reg))

You also could have plotted each of the continents as a separate letter as we did above, and then plotted the regression line over the plot.

Box Plots: We saw before that plot would make side by side box plots if the x variable consisted of categories and the y variable was quantitative. One difficulty is that it alphabetizes the figures by category names. Further, if the category names are numbers, plot won't even give you the boxplots. If your category names were numbers, you could try:

boxplot(values~categorynumbers)

(of course you want to put in the name of the vector with the values and the name of the vector with the category numbers.)

We could also trick the function boxplot into giving us the values in the order we wanted. Say we wanted them in the order SA, Af, Au, As, Eu, NAm. What we would need to do is make a vector of numbers so that each observation that went with SA had a 1, each that went with AF had a 2, and so on. The following function will do that for us:

vectormaker<-function(categoryvector,ordervector){
newcatvector<-rep(0,length(categoryvector))
for (i in 1:length(ordervector)){
  newcatvector[categoryvector==ordervector[i]]<-i
  }
return(newcatvector)
}
If you run

wewant<-c("SA","Af","Au","As","Eu","NAm")
vc<-vectormaker(rivers$Cont,wewant)

It will make a new vector called vc that will be like the string of continents, but with all the SAs replaced by ones, all the Afs replaced by 2, etc... (The vector wewant is simply the order in which we want the continets to appear.) Type in both vc and rivers$cont to check that it happended correctly.

Typing

boxplot(rivers$NO3~vc,names=wewant)

Will now produce a boxplot with the variables in the order we want.

We could have combined this all into one function too:

orderedboxes<-function(datavector,categoryvector,ordervector){
newcatvector<-rep(0,length(categoryvector))
for (i in 1:length(ordervector)){
  newcatvector[categoryvector==ordervector[i]]<-i
  }
boxplot(datavector~newcatvector,names=ordervector)
}
And just run

orderedboxes(rivers$NO3,rivers$Cont,c("SA","Af","Au","As","Eu","NAm"))


Analyzing the Draft Data

The following is the code we looked at in class on the 4th for analyzing the 1970 draft data on homework assignment 1. First, we read in the data, see what the column names are, and scatter plot the day versus the draft (using two different ways to name the columns).

draft<-read.table("http://www.stat.sc.edu/~habing/courses/data/draft70.txt",header=TRUE) 

attributes(draft)

plot(draft$Day.of.Year,draft$Draft.Order)

plot(draft[,1],draft[,2])

Next we could see that using different plot symbols doesn't seem to help very much. (The commands points and pch are discussed in the section on plotting symbols above.)

plot(5,5,xlim=c(0,366),ylim=c(0,366),xlab="",ylab="",main="With Plot Symbols",type="n")
for (i in 1:366){
  points(draft[i,1],draft[i,2],pch=draft[i,3])
}

The following code performs a linear regression and graphs the line over the scatter plot.

draft.reg <- lm(draft$Draft.Order~draft$Day.of.Year)
summary(draft.reg)
plot(draft$Day.of.Year,draft$Draft.Order,xlab="Day of Year",ylab="Draft Order")
lines(draft$Day.of.Year,fitted(draft.reg))

To view all four residual plots at one time, we need to change the display to a matrix with four spaces (say a 2x2 one).

par(mfrow=c(2,2))
plot(draft.reg)

par(mfrow=c(1,1))

Instead of a regression line, we could have drawn a mean value line for each month.

par(mfrow=c(1,1))
plot(draft$Day.of.Year,draft$Draft.Order,xlab="Day of Year",ylab="Draft Order")
monthavg<-rep(0,12)
for (i in 1:12){
 monthavg[i]<-mean(draft[draft[,3]==i,2])
}
for (i in 1:366){
 lines(c(i-0.5,i+0.5),c(monthavg[draft[i,3]],monthavg[draft[i,3]]))
}

We also could have used a "moving weighted average" such as kernel smoothing. To see the kernel smoothing function ksmooth we first need to call the modern regression library that is downloaded as part of R, but is not automatically integrated with it when you start the program.

library(modreg)
draft.smooth<-ksmooth(draft$Day.of.Year,draft$Draft.Order,kernel="normal",bandwidth=30)
plot(5,5,xlim=c(0,366),ylim=c(0,366),xlab="",ylab="",main="With A Smoother",type="n")
points(draft[,1],draft[,2])
lines(draft.smooth)

The bandwidth value changes the number of points that are averaged over. These two lines will make the curve smoother. Changing bandwidth to 0.5 would have simply connected the dots.

draft2.smooth<-ksmooth(draft$Day.of.Year,draft$Draft.Order,kernel="normal",bandwidth=60)
lines(draft2.smooth)

We also could have gotten the "connect the line effect" using the option type="l" when we plotted the original data set.

plot(draft$Day.of.Year,draft$Draft.Order,type="l")

Finally, a side by side boxplot based on the month provides similar information to the scatterplot with the mean lines for each month.

boxplot(draft$Draft.Order~draft$Month.Num)


Details on Star Plots and Some Multiple Regression

Stars: The examples in this section are based on the data set http://www.stat.sc.edu/~habing/courses/data/cities.txt that is reported in Chapter 11 of Ramsey and Schaefer's The Statistical Sleuth, and that originally appeared in Graphical Representation of Multivariate Data published by Academic Press in 1978. It concerns 60 metropolitan standard areas from 1959-1961 (rea is Reading, Pennsylvania for example).

cities<-read.table("http://www.stat.sc.edu/~habing/courses/data/cities.txt",header=TRUE) 
cities[38,]

We can see that this data set has 17 variables: precipitation, temperature in January, temperature in July, population over 65, population per household, education level, percent of housing that is sound (safe), population density, nonwhite population, percent of white color workers, percent of poor, hydrocarbon level, nitrogen oxide level, sulfur dioxide level, humidity, mortality rate, and the city name abbreviation. (Line 38 contains these values for the New York City MSA.)

stars(cities[,1:16])

Just doing a star plot on the raw data clearly isn't very helpful. We can't see the observation numbers and don't know which line is which. Looking at the options under help(stars) does give us some extra things to try...

stars(cities[,1:16],key.loc=c(16,1),cex=0.5,len=0.8)

cex shrinks the letters to 50% their normal size, len shrinks the spikes on the stars to 80% their normal size, and the key.loc puts on the key for what each spike is.

By default, the star plot uses the row names (the observation names) to label each star. Unfortunately the default row names are just the observation number. For this data set we could change that so that each row had the name of the city it was associated with.

row.names(cities)<-cities[,17]
attributes(cities)

If you repeat the above stars command it will label the stars with these city names.

We might decide that we want to change the order in whcih the variables appear around the star. For example, maybe we want to group the "weather related things", the "socio-economic things", and the "polution things" together, and remove mortality. The order we would want then (looking at the $names attributes of cities could be:

colorder<-c(1:3,15,4:11,12:14)
attributes(cities[colorder])
stars(cities[,colorder],key.loc=c(16,1),cex=0.5,len=0.8)

Notice that we moved variable 15 and left out variable 16. We also could have ordered the variables from "most beneficial" to "neutral" to "most harmful" for example, or removed more of the variables if we had reason to believe they were highly correlated and provided no new information. (It's easier to read the plot if there are fewer spikes.) We also could have just kept the variables we thought should be negative.

Just like we can change the order of the variables, we could change the order that the observations are plotted. We could go through and pick out groups of observations that have similar stars, make a vector called roworder and do stars(cities[roworder,colorder]). We might want to order them based on the spikes in the bottom-right for example: roworder=c(29, 40, 12, etc... 4).

We could also use the ordering of the observations to re-include mortality in the plot. We could order the observations from lowest mortality to highest:

roworder<-order(cities[,16])
cities[roworder,16:17]
stars(cities[roworder,colorder],key.loc=c(16,1),cex=0.5,len=0.8)

Leaving all of the spikes in still makes this very hard to read though. If you didn't want to remove any of the variables, you could also try adding color. To see what the current availalbe colors are, type palette(). We could color all of the positive variables green, the neutral ones yellow, and the negative ones red for example.

To decide which of the variables are positive and which are negative, we can look at the order in which we are using them (in this case based on using colorder). We will then give each one either a value of 3 if positive (green) 2 if neutral (yellow) and 1 if negative (red). (Of course, I'm just guessing that rain shouldn't matter, that a high winter temperature is good, and a high july temperature is bad... usually you'd want to consult a specialist in the area!)

attributes(cities[colorder])
colorvector<-c(2,3,1,1,1,3,3,3,1,2,2,1,1,1,1)
stars(cities[roworder,colorder],key.loc=c(16,1),cex=0.5,len=0.8,
   draw.segments=T,colors=colorvector)

(You need to use the draw.segments=T to plot it in color. If you leave off the colors= it will use the default colors.)

It's perhaps somewhat disheartening that the colors still don't really bring out any strong patterns. If we look at what is most highly correlated with mortality though:

cor(cities[,1:16])[,16]
We see that it is actually precipiation and non-white! Two things we had coded in yellow instead of read. If we went and recolored, and reordered the variables based on this correlation the picture should probably show us something new. Of course, if we do this we aren't really using the graphic to discover anything new... we're just trying to reinforce what we already know from the numerical statistics.

Two final things to try are illustrated in the following plot:

par(mfrow=c(2,1))

palette(rainbow(19))
stars(cities[1,1:16],cex=0.5,len=0.8,full=F,draw.segments=T)

palette(rainbow(16))
stars(cities[1,1:16],cex=0.5,len=0.8,draw.segments=T)

Note that because we only used one city all of the variables were scaled to be the same length! If you read the help file for stars, it scales them so that 0 is 0, but the largest in each direction is 1.

These were all just examples of what you could do with this data set, and there is no single "right answer". Just remember that your main goal is to lower the amount of "noise" in the display so that you can perhaps uncover some more patterns in the data set.

Multiple Regression: Multiple regression is one of the things that SAS does best, and so it probably makes more sense to do that part of the problem with PROC INSIGHT. You could however fit a regression model in R using lm just like you could a standard regression. For example, the following will fit the model to predict the mortality rate from all of the other variables:

 
citymort.reg<-lm(cities[,16]~cities[,1]+cities[,2]+cities[,3]+cities[,4]+
  cities[,5]+cities[,6]+cities[,7]+cities[,8]+cities[,9]+cities[,10]+
  cities[,11]+cities[,12]+cities[,13]+cities[,14]+cities[,15])
citymort.reg
summary(citymort.reg)
par(mfrow=c(2,2))
plot(citymort.reg)
We can see from the residual plot that the most extreme cities are number 37 = no = New Orleans, and number 28 = lan = Lancaster, PA. We could also have gotten a list of the residuals for each city in order using:
resorder<-order(citymort.reg$residuals)
data.frame(cities[resorder,17],citymort.reg$residuals[resorder])


Chernoff-like Faces

The following code will implement the faces for up to five variables that we designed in class. It is made up of seven functions: ellipse, rotoellipse, egghead, bigear, hairy, liney, and faces. The first two are plotting functions for a window with x from -1 to 1 and y from -1 to 1. The next four functions plot the four different kinds of faces, and the faces function is the main function that calls the others.

ellipse<-function(x0=0,y0=0,a=0.5,b=0.5,new=T){
if (new==T) {par(new=F)}
if (new==F) {par(new=T)}
plot(c(0,0),xlim=c(-1,1),ylim=c(-1,1),xlab="",ylab="",type="n",axes=F)
par(new=T)
curve((y0+sqrt(abs(b^2-((x-x0)*b/a)^2))),x0-a,x0+a,n=200,xlim=c(-1,1),ylim=c(-1,1),
xlab="",ylab="",axes=F)
par(new=T)
curve((y0-sqrt(abs(b^2-((x-x0)*b/a)^2))),x0-a,x0+a,n=200,xlim=c(-1,1),ylim=c(-1,1),
xlab="",ylab="",axes=F)
}


rotoellipse<-function(x0=0,y0=0,a=0.5,b=0.5,rangle=0,new=T){
if (new==T) {par(new=F)}
if (new==F) {par(new=T)}
plot(c(0,0),xlim=c(-1,1),ylim=c(-1,1),xlab="",ylab="",type="n",axes=F)
xvec<-((-100:100)/100)*a
yvec<-sqrt(abs(b^2-(xvec*b/a)^2))
rvec<-sqrt(xvec^2+yvec^2)
angle<-pi*((sign(xvec)-1)/(-2))+atan(yvec/xvec)
xvec<-x0+rvec*cos(angle+rangle)
yvec<-y0+rvec*sin(angle+rangle)
par(new=T)
plot(xvec[1:100],yvec[1:100],xlim=c(-1,1),ylim=c(-1,1),xlab="",ylab="",axes=F,type="l")
par(new=T)
plot(xvec[102:201],yvec[102:201],xlim=c(-1,1),ylim=c(-1,1),xlab="",ylab="",axes=F,type="l")
xvec<-((-100:100)/100)*a
yvec<--1*sqrt(abs(b^2-(xvec*b/a)^2))
rvec<-sqrt(xvec^2+yvec^2)
angle<-pi*((sign(xvec)-1)/(-2))+atan(yvec/xvec)
xvec<-x0+rvec*cos(angle+rangle)
yvec<-y0+rvec*sin(angle+rangle)
par(new=T)
plot(xvec[1:100],yvec[1:100],xlim=c(-1,1),ylim=c(-1,1),xlab="",ylab="",axes=F,type="l")
par(new=T)
plot(xvec[102:201],yvec[102:201],xlim=c(-1,1),ylim=c(-1,1),xlab="",ylab="",axes=F,type="l")
}



egghead<-function(x=c(0.5,0.5,0.5,0.5,0.5),label=""){
plot(c(0,0),xlim=c(-1,1),ylim=c(-1,1),xlab="",ylab="",type="n",axes=F,main=label)
a<-0.6+0.4*x[1]
b<-0.6+0.4*x[2]
halfmouth<-0.9*a*x[3]
eyeradius<-a*x[4]*2/6
halfnose<-0.4*b*x[5]
ellipse(x0=0,y0=0,a=a,b=b,new=F)
lines(c(-1*halfmouth,halfmouth),c((-1*b/2),(-1*b/2)))
ellipse(x0=(-1*a/2),y0=b/3,a=eyeradius,b=eyeradius,new=F)
ellipse(x0=a/2,y0=b/3,a=eyeradius,b=eyeradius,new=F)
lines(c(0,0),c(-halfnose,halfnose))
}



bigear<-function(x=c(0.5,0.5,0.5,0.5,0.5),label=""){
plot(c(0,0),xlim=c(-1,1),ylim=c(-1,1),xlab="",ylab="",type="n",axes=F,main=label)
lftearb<-0.125+.875*x[1]
rgtearb<-0.125+.875*x[2]
moutha<-0.125+.375*x[3]
lfteyea<-x[4]/(4*sqrt(2))
rgteyea<-x[5]/(4*sqrt(2))
ellipse(x0=0,y0=0,a=0.5,b=0.5,new=F)
ellipse(x0=-0.625,y0=0,a=0.125,b=lftearb,new=F)
ellipse(x0=0.625,y0=0,a=0.125,b=rgtearb,new=F)
par(new=T)
curve((-1*sqrt(abs(0.25^2-((x-0)*0.25/moutha)^2))),-1*moutha,moutha,n=200,
xlim=c(-1,1),ylim=c(-1,1),xlab="",ylab="",axes=F)
rotoellipse(x0=-0.125,y0=0.25,a=lfteyea,b=1/(6*sqrt(2)),rangle=pi/8,new=F)
rotoellipse(x0=0.125,y0=0.25,a=rgteyea,b=1/(6*sqrt(2)),rangle=-1*pi/8,new=F)
}



hairy<-function(x=c(0.5,0.5,0.5,0.5,0.5),label=""){
plot(c(0,0),xlim=c(-1,1),ylim=c(-1,1),xlab="",ylab="",type="n",axes=F,main=label)
hair<-0.01+0.14*x[1]
halfmouth<-0.05+0.1*x[2]
halfnose<-0.05+0.1*x[3]
eyeradius<-0.1+0.1*x[4]
halfbrow<-0.05+0.1*x[5]
ellipse(x0=0,y0=0,a=0.85,b=0.85,new=F)
angle<-rep(0,9)
for (i in 1:9){
   angle[i]<-(3*pi/8)+(i-1)*(pi/32)
   lines(x=c(0.85*cos(angle[i]),(0.85+hair)*cos(angle[i])),
         y=c(0.85*sin(angle[i]),(0.85+hair)*sin(angle[i])))
}
lines(x=c(-1*halfmouth,halfmouth),y=c(-0.3,-0.3))
lines(x=c(-1*halfnose,halfnose,0,-1*halfnose),y=c(0,0,.15,0))
ellipse(x0=-0.2,y0=0.2,a=eyeradius,b=eyeradius,new=F)
ellipse(x0=0.2,y0=0.2,a=eyeradius,b=eyeradius,new=F)
}


liney<-function(x=c(0.5,0.5,0.5,0.5,0.5),label=""){
plot(c(0,0),xlim=c(-1,1),ylim=c(-1,1),xlab="",ylab="",type="n",axes=F,main=label)
a<-0.5+0.5*x[1]
eyeheight<-0.2+0.1*x[2]
eyehoriz<-0.1+0.3*x[3]
halfmouth<-0.25*x[5]
ellipse(x0=0,y0=0,a=a,b=a,new=F)
lines(x=c(-1*eyehoriz,-1*eyehoriz),y=c(.2,.2+eyeheight))
lines(x=c(eyehoriz,eyehoriz),y=c(.2,.2+eyeheight))
lines(x=c(-1*halfmouth,halfmouth),y=c(-.25,-.25))
}



faces<-function(x,label=rep("",length(x[,1])),facetype="egghead"){
par(pty="s")
par(mai=c(0.01,0.01,0.1,0.01))
par(cex.main=0.8)
par(mfrow=c(1,1))
par(mfrow=c(ceiling(sqrt(length(x[,1]))),ceiling(sqrt(length(x[,1])))))
for (i in 1:max(length(x[1,]),5)){
x[,i]<-(x[,i]-min(x[,i]))/(max(x[,i])-min(x[,i]))
}
if (length(x[1,]) < 5) {
for (i in (length(x[1,])+1):5){
x[,i]<-0.5
}
}
for (i in 1:length(x[,1])){
if(facetype=="egghead"){egghead(c(x[i,1],x[i,2],x[i,3],x[i,4],x[i,5]),
    label=as.character(label[i]))}
if (facetype=="bigear"){bigear(c(x[i,1],x[i,2],x[i,3],x[i,4],x[i,5]),
    label=as.character(label[i]))}
if (facetype=="hairy"){hairy(c(x[i,1],x[i,2],x[i,3],x[i,4],x[i,5]),
    label=as.character(label[i]))}
if (facetype=="liney"){liney(c(x[i,1],x[i,2],x[i,3],x[i,4],x[i,5]),
    label=as.character(label[i]))}
}
}

To see the differences try the following:

bears<-read.table("http://www.stat.sc.edu/~habing/courses/data/bears.txt",header=TRUE) 
faces(bears[,1:5],bears[,9])
faces(bears[,1:5],bears[,9],"bigear")
faces(bears[,1:5],bears[,9],"hairy")
faces(bears[,1:5],bears[,9],"liney")


Why is it Drawing Borders on My Faces?

Some of you may find, that for some reason when you plot the Chernoff-faces above, they are plotted with a border. The reason for this is that many of the functions contain the command axes=F, where the F stands for FALSE.

The problem is that you might have defined F to be something else! Try just typing F and enter. If you get a value back (instead of FALSE) then you need to type rm(F) to remove the value and restore it to its original meaning.

Similar bad things can happen if you use the name t for transpose or c for concatonate (make a vector) or any of the other built in function names.


A Few Matrix Commands

The following code evaluates equations 4.4 to 4.6 (page 39 and 40) for the sparrow data (page 2-3) to replicate the output shown in Example 4.1. Note that the first 21 sparrows are the group that survived, and the remaining 28 did not. As you enter each line, you should type the name of the variable you just created so that you can see what the command you entered actually calculated. The commands of particular importance are:

bumpus <- read.table("http://www.stat.sc.edu/~habing/courses/data/bumpus.txt",header=TRUE) 

group1 <- bumpus[1:21,]
group2 <- bumpus[22:49,]

n1 <- length(group1[,1])
xbar1 <- apply(group1,2,mean)
C1 <- cov(group1)

n2 <- length(group2[,1])
xbar2 <- apply(group2,2,mean)
C2 <- cov(group2)

C <- ((n1-1)*C1+(n2-1)*C2)/(n1+n2-2)

Cinv <- solve(C)  

Tsquare <- (n1*n2)*t(xbar1-xbar2)%*%Cinv%*%(xbar1-xbar2)/(n1+n2)

p <- length(group1[1,])

F <- (n1+n2-p-1)*Tsquare/((n1+n2-2)*p)

p.value <- 1 - pf(F,df1=p,df2=n1+n2-p-1)

To get the t-test for total_length (column 1) we could have used

t.test(group1[,1],group2[,1],alternative="two.sided",var.equal=T)

If we didn't have the raw data, we could have entered the matrices directly. The following code would enter xbar1 and C1.

xbar1 <- c(157.381,241.000,31.433,18.500,20.810)
C1 <- matrix(c(11.048,9.10,1.557,0.870,1.286,
  9.100, 17.500, 1.910, 1.310, 0.880,
  1.557, 1.910, 0.531, 0.189, 0.240,
  0.870, 1.310, 0.189, 0.176, 0.133,
  1.286, 0.880, 0.240, 0.133, 0.575),nrow=5,ncol=5,byrow=T)

The commands following the entry of these variables would then work just the same for Hotelling's T-squared. For the regular two-sample t-test we would have to calculate the t-statistic manually, and could then use either a t-table or the command pt to get the p-value. The following would conduct the t-test for total_length using just the two means 157.38 and 158.43 and the pooled variance 13.358.

n1<-21
xb1<-157.38
n2<-28
xb2<-158.43
s2p<-13.358
tstat<-(xb1-xb2)/sqrt(s2p*(1/n1+1/n2))
pvalue<-2*min(pt(tstat,df=n1+n2-2),1-pt(tstat,df=n1+n2-2))

Note that the statistic and p-value are slightly off from the the t.test line above because of rounding. Can you tell why we used that formula for the p-value?

Eigenvalues: The eigenvalues and eigenvectors of the matrix C1 could be found using the command

eigen(C1)

The following shows how the matrix C1 can be recovered from the eigenvalues and the eigenvectors. This result is known as the Spectral Decomposition Theorem. By comparing C1 to recoveredC1 you can see that all of the information in C1 is contained in the eigenvectors and eigenvalues.

lambda <- diag(eigen(C1)$values)
gamma <- eigen(C1)$vectors
recoveredC1 <- gamma%*%lambda%*%t(gamma)


Finding Distances

The examples in this section use the data set found in Table 1.2 on pages 4-5. The data is from a 1905 work by Thomson and Randall-Maciver. For each of five epochs in Egypt, four measurements were made on samples of 30 male skulls. The measurements are the first four variables: MB=maximum breadth, BH=basibregmatic height, BL=basialveaolar length, and NH=nasal height. In order to use this example it is necessary to load the data in, and to attach the mva library.

skulls <- read.table("http://www.stat.sc.edu/~habing/courses/data/skulls.txt",header=TRUE) 
library(mva)
You will need to use the command library(mva) every time you start up R when you want to use the functions it contains (like dist).

Distances Between Observations: The function dist will calculate several of the distances, namely: euclidean, maximum, manhattan, and canberra. As written here we calculate the distances between the first two skulls using both the dist command and directly. (You can double check these by hand using Table 1.2 on page 4-5).

x1<-as.numeric(skulls[1,1:4])
x2<-as.numeric(skulls[2,1:4])

dist(rbind(x1,x2),"euc")
sqrt(t(x1-x2)%*%(x1-x2))

dist(rbind(x1,x2),"max")
max(abs(x1-x2))

dist(rbind(x1,x2),"manh")
sum(abs(x1-x2))

dist(rbind(x1,x2),"can")
sum(abs(x1-x2)/abs(x1+x2))

The command as.numeric is needed in order for the absolute values function to work later; R doesn't know what you mean by the absolute value of a table, it does for a vector. The command rbind is needed inside of dist because dist finds the distance between the rows of a vector. Because we only have two rows here, there is only one value to return. We could also have had it return all of the distances (all 11,175 of them) for every pair of observations. Because dist only gives you the lower diagonal of the matrix, it will look very weird on the screen! You might get a better feel for it using 1:6 instead of all 1:150. (The 1:4 is there because column 5 is not a set of values.)

dist(skulls[1:150,1:4],"euc")

We can also change a distance output into a full matrix and vice-versa. (Some R procedures require a distance object, some require a matrix.)

d1<-dist(skulls[1:6,1:4],"euc")
d1
dmat1<-as.matrix(d1)
dmat1
d1take2<-as.dist(dmat1)
d1take2

To find the standardized euclidean distance, we can standardize each column before taking the distance. (NOTE: In this code we are ignoring that they are from different samples, otherwise you would need to use the pooled varianceslike we do in the section below on the distances between samples.)

standskull<-skulls
for (i in 1:4){
standskull[,i]<-(skulls[,i]-mean(skulls[,i]))/sqrt(var(skulls[,i]))
}
The distance matrix for the first six observations is then just
dist(standskull[1:6,1:4],"euc")

Notice that we standardized using ALL of the observations, even though we only wanted a subset of the distances. Also notice that we would get exactly the same result if we had used the Euclidean distance formula (5.1 on page 59) and replaced
(xik-xjk)2
with
(xik-xjk)2/sk2.

We could write this directly (and inefficiently) as:

distmat<-matrix(0,ncol=6,nrow=6)
for (i in 1:6){
  for (j in 1:6){
     for (k in 1:4){
        distmat[i,j]<-distmat[i,j]+(skulls[i,k]-skulls[j,k])^2/var(skulls[,k])
     }
     distmat[i,j]<-sqrt(distmat[i,j])
  }
}
as.dist(distmat)

Because of the nested loops, this could take quite a while to run if we used 1 to 150 instead of 1 to 6!

To find Mahalanobis distance between two observations (again ignoring that we have different samples), we first need the covariance matrix between the variables. We can then simply find the inverse of the covariance matrix and take:
Dij2 = (xi-xj)TC-1(xi-xj)

We could either do this manually, or use the built in function mahalanobis. In order to do it manually, we again require that skulls be a matrix in order for the matrix multiplication to work. We could do this by first making a matrix called skullmat.

skullmat<-as.matrix(skulls[1:150,1:4])

C<-var(skullmat)
Cinv<-solve(C)
Dsquare12 <- t(skullmat[1,]-skullmat[2,])%*%Cinv%*%(skullmat[1,]-skullmat[2,])
Dsquare12
  
mahalanobis(skullmat[1,],skullmat[2,],C)

Distances Between Samples (e.g. Distances Between Mean Vectors): One way of finding the distances between samples is to simply repeat what we did with observation vectors, but use the mean vectors for the samples instead. (We will see some alternative methods in Chapter 9.)

The following code will replicate the work shown in Example 5.2 on pages 64-67. Remember that at any point you can type the name of an object to see if we made it correctly.

x1<-as.matrix(skulls[skulls[,5]=="Earlypre",1:4])
x2<-as.matrix(skulls[skulls[,5]=="Latepre",1:4])
x3<-as.matrix(skulls[skulls[,5]=="Twelveand13",1:4])
x4<-as.matrix(skulls[skulls[,5]=="Ptolemaic",1:4])
x5<-as.matrix(skulls[skulls[,5]=="Roman",1:4])

n1<-length(x1[,1])
n2<-length(x2[,1])
n3<-length(x3[,1])
n4<-length(x4[,1])
n5<-length(x5[,1])

p<-length(x1[1,])

xbar1<-apply(x1,2,mean)
xbar2<-apply(x2,2,mean)
xbar3<-apply(x3,2,mean)
xbar4<-apply(x4,2,mean)
xbar5<-apply(x5,2,mean)

C1<-var(x1)
C2<-var(x2)
C3<-var(x3)
C4<-var(x4)
C5<-var(x5)

C<-((n1-1)*C1+(n2-1)*C2+(n3-1)*C3+(n4-1)*C4+(n5-1)*C5)/(n1+n2+n3+n4+n5-5)

Now to calculate Penrose's distance (equation 5.2 on page 62) we are basically calculating the square of the standardized distance between the mean vectors. The only difference is that we are dividing this by p. (Without the p it is what some books call Pearson's distance.)

P12<-0
for (k in 1:p){
  P12<-P12+((xbar1[k]-xbar2[k])^2)/(p*C[k,k])
}
P12

A little matrix algebra can also give a shortcut to producing the data matrix of Penrose distances. The trick is to realize that Penrose distance is very similar to Mahalanobis distance except that we pretend all the matrices are zero (and divide by P). The following gives the distance object in the middle of page 66.

variancediagonalmatrix<-diag(diag(C))
xbarmatrix<-rbind(xbar1,xbar2,xbar3,xbar4,xbar5)
PijMat<-matrix(0,nrow=5,ncol=5)
for (i in 1:5){
  for (j in 1:5){
     PijMat[i,j]<-mahalanobis(xbarmatrix[i,],xbarmatrix[j,],variancediagonalmatrix)/p
  }
}
as.dist(PijMat) 

To get the matrix of Mahalanobis distances, we just use the above, but use C instead of variancediagonalmatrix and don't divide by p. The following will produce the distance object in the middle of page 67.

MahMat<-matrix(0,nrow=5,ncol=5)
for (i in 1:5){
  for (j in 1:5){
     MahMat[i,j]<-mahalanobis(xbarmatrix[i,],xbarmatrix[j,],C)
  } 
}
as.dist(MahMat)

How Close is an Observation to the Mean?: To tell how close an observation is to a mean, we need to again come up with a set of distances, but now we need the Mahalanobis distance between each observation and each mean vector. In this case there are 5 mean vectors and 30 observations in each group (150 total). We could thus make a 150 by 5 matrix showing the distance between each observation and each group.

BigMat<-matrix(0,nrow=150,ncol=5)
for (i in 1:150){
  for (j in 1:5){
    BigMat[i,j]<-mahalanobis(skulls[i,1:4],xbarmatrix[j,],C)
  }
}
BigMat
Now we need to remember that the first 30 are "Early Predynastic" and are in the same group as Column 1, the next 30 are "Late Predynastic" and are in the same group as Column 2, etc...

Thus, for the first 30 observations, if any of the values in columns 2-5 are smaller than in column 1, that observation is closer to some other group mean than it is to its own. For the second 30 observations, if any of the values in columns 1 or 3-5 are smaller than in column 2, that observation is closer to some other group mean than it is to its own. etc...

If we go to observation 1 of BigMat, we find that it is 6.5536 away from its own group mean, but only 4.3263 away from the mean for group 3. It is thus closer to the mean of another group than it is to its own. If we look at observation 149 of BigMat, we find that the group mean it is closest to is its own (group 5).

One way to simplify the process would be to look at the following:

BigMat2<-BigMat
BigMat2[1:30,]<-BigMat[1:30,]-BigMat[1:30,1]
BigMat2[31:60,]<-BigMat[31:60,]-BigMat[31:60,2]
BigMat2[61:90,]<-BigMat[61:90,]-BigMat[61:90,3]
BigMat2[91:120,]<-BigMat[91:120,]-BigMat[91:120,4]
BigMat2[121:150,]<-BigMat[121:150,]-BigMat[121:150,5]
BigMat2

Any negative number in the matrix now signifies an observation that is closer to some other mean that it is to its own. We can now go through and pull out all of the rows where this happens (and make sure and keep the original observation number!)

minimumofeachrow<-apply(BigMat2,1,min)
cbind(BigMat2,1:150)[minimumofeachrow<0,]

It is perhaps somewhat heartening that so many of the observations (99 out of 150) are actually more similar to the average of some other group than they are to their own and yet the MANOVA (see the SAS templates) could still tell their was a difference.

On the other hand, this seems to show that simply using the mean vector isn't necessarily a good way to summarize a sample. We also need some way of seeing how well those mean vectors capture the behavior of the members of each sample.


Principal Components

The following example uses the data set http://www.stat.sc.edu/~habing/courses/data/testdata.txt that can be found in Marida, Kent, and Bibby's Multivariate Analysis. Several of the functions we will use are found in the libraries MASS and mva and so we will need to start those as well.

td <- read.table("http://www.stat.sc.edu/~habing/courses/data/testdata.txt",header=TRUE) 
library(mva)
library(MASS)

As we saw in class we could begin by simply getting some descriptive statistics and basic plots for the data sets.

pairs(td)
apply(td,2,mean)
var(td)

The function that conducts the principal components analysis is called prcomp and the function that plots the results is eqscplot.

prtd<-prcomp(td)      
prtd
summary(prtd)
eqscplot(prtd$x[,1],prtd$x[,2],type="n")
text(prtd$x[,1],prtd$x[,2],1:88)

Notice that the first principal component is essentially a "overall" ability, and that the second seems to be open book vs. closed book. To verify this we can check the plot of the 88 examinees on the first two components. How do the "total scores" of those on the far right compare to those on the far left? How did those at the top do on the open book tests as opposed to the closed book? Those on the bottom?

To see the relationship between the principal components components analysis and the eigen vectors and values, we can simply calculate them.

eigen(var(td))

We can also verify some of the properties of the principal components

tdmat<-as.matrix(td)
transformer<-as.matrix(prtd$rotation)
newtd<-tdmat%*%transformer

round(var(newtd)*1000000)/1000000
round(var(td)*1000000)/1000000

round(cor(td,newtd)^2*1000000)/1000000
cor(td,td)^2

Principal Components Plot Options: It is possible to plot any component against any other component using eqscplot, simply change the column indices 1 and 2 to the numbers of the components you want.

If the type="n" is left out then dots will be plotted at each point instead of names. The text command is what labels the plots. Say the data had come from two sections of students, A with the first 46 and B with the next 42. In that case we could have used the command:

tdnames<-c(rep('A',46),rep('B',42))
text(prtd$x[,1],prtd$x[,2],tdnames)
(Notice that this is kind of odd looking here because the original data is basically in order of the examinees score.)

Scaling the Variables First: If the data doesn't come on the same units or on a natural scale, it can be advantageous to standardize it first (as mentioned on page 80 of Manly). This can be done by using the option scale.=TRUE as in the following:

prtd2<-prcomp(td,scale.=TRUE)
prtd2
Notice that the weights are now distributed somewhat more evenly because we have factored out the variances. Notice that the principal components now match eigen(cor(td)).

Another way of doing this would have been to standardize the data to start with. This is probably safer because you are then not tempted to apply the principal components based on standardizing to the original, unstandardized data.

standardtd<-td
for (i in 1:5){
 standardtd[,i]<-(td[,i]-mean(td[,i]))/sqrt(var(td[,i]))
} 
prtd3<-prcomp(standardtd)
prtd3


Discriminant Analysis

The material at the beginning of Chapter 8 using Mahalanobis distance is discussed above in the section on distances.

Fisher's Canonical Linear Discriminantion Functions provide another method for performing Discriminant Analysis. The code here will use the skulls data used above.

skulls <- read.table("http://www.stat.sc.edu/~habing/courses/data/skulls.txt",header=TRUE) 

We will also want to have the MASS and mva libraries installed.

library(mva)
library(MASS)

The following code will then carry out the analyses discussed on pages 112-114.

snames<-c(rep('e',30),rep('l',30),rep('t',30),rep('p',30),rep('r',30))

skull.disc <- lda(skulls[,1:4],snames)

skull.disc$scaling

At first these coefficients don't seem to match those given on page 113! We can replicate the formulas on 112-113 and see where the difference comes in though. We use a shortcut formula from Mardia, Kent, and Bibby (page 139) to calculate formulas 4.11 and 4.12 though...

X<-as.matrix(skulls[,1:4])
one1<-c(rep(1,30),rep(0,120))
one2<-c(rep(0,30),rep(1,30),rep(0,90))
one3<-c(rep(0,60),rep(1,30),rep(0,60))
one4<-c(rep(0,90),rep(1,30),rep(0,30))
one5<-c(rep(0,120),rep(1,30))
bigone<-one1+one2+one3+one4+one5 
I1<-diag(one1)
I2<-diag(one2)
I3<-diag(one3)
I4<-diag(one4)
I5<-diag(one5)
H1<-I1-(1/30)*one1%*%t(one1)
H2<-I2-(1/30)*one2%*%t(one2)
H3<-I3-(1/30)*one3%*%t(one3)
H4<-I4-(1/30)*one4%*%t(one4)
H5<-I5-(1/30)*one5%*%t(one5)
C1<-H1+H2+H3+H4+H5
C2<--1*(C1-I1-I2-I3-I4-I5)-bigone%*%t(bigone)/150
W<-t(X)%*%C1%*%X 
B<-t(X)%*%C2%*%X
W
B

Notice that the W matrix is different from what the book gives because the book has an error on page 112... compare to page 52. And now finding the eigen vectors and eigen values...

skull.disc2<-eigen(solve(W)%*%B)
skull.disc2

Notice that these values don't agree with either the ones in the book or the ones from the discriminant analysis above!!!! What's going on?

The code below will transform the original data by the three different discrimination transformations.

bookdisc<-matrix(c(-0.0107,0.0040,0.0119,-0.0068,
0.0031,0.0168,-0.0046,-0.0022,
-0.0068,0.0010,0.0000,0.0247,
0.0126,-0.0001,0.0112,0.0054),byrow=F,ncol=4)

newX1<-as.matrix(X[,1:4])%*%as.matrix(skull.disc$scaling)
newX2<-as.matrix(X[,1:4])%*%as.matrix(skull.disc2$vectors)
newX3<-as.matrix(X[,1:4])%*%bookdisc
bigX<-cbind(newX1,newX2,newX3)
round(10*cor(bigX))/10

From the correlation we can see that the only difference appears to be in the scale that each of the transformed variables has (and rounding error for the one in the book).

In any case, the output from lda can be plotted:

skull.x<-predict(skull.disc,X,dimen=2)$x
eqscplot(skull.x,type="n",xlab="first discriminant",ylab="second discriminant")
text(skull.x,snames)

We can also get an estimate of what "probability" we think each observation has of belonging in each of the different groups, and what group is closest.

predict(skull.disc,X,dimen=4)$class
predict(skull.disc,X,dimen=4)$posterior

Finally, we can enter a function from Venables and Ripley's book to count up the number of correct and incorrect classifications

confusion<-function(true,predict)
{
jt<-table(true,predict)
jn<-dimnames(jt)[[2]]
jt1<-jt[jn, ]
structure(jt,error=(1-sum(diag(jt1))/length(true)))
}

confusion(snames,predict(skull.disc,X,dimen=2)$class)
confusion(snames,predict(skull.disc,X,dimen=4)$class)


The Jackknife and Logistic Regression

Continuing the above example...

The Jackknife for Estimating Discriminant Analysis Error Rates: The following will perform the jackknife procedure to estimate the misclassification rate. Note that we need to set the priors to be equal for the five groups in order to get the same results as the built in procedure (below) or the one that SAS will do. If you have some prior information that the groups are not of the same size then of course you should consider incorporating that information!

skull.pred<-rep("0",150)
skull.true<-rep("0",150)
for (i in 1:150){
skull.disc<-lda(skulls[,1:4],snames,subset=c(1:150)[-i],prior=c(0.2,0.2,0.2,0.2,0.2))
skull.temp<-predict(skull.disc,as.matrix(skulls[,1:4])[i,],dimen=4)$class
skull.pred[i]<-as.character(skull.temp)
skull.true[i]<-snames[i]
}
skull.t<-table(skull.true,skull.pred)
skull.e<-1-(skull.t[1,1]+skull.t[2,2]+skull.t[3,3]+skull.t[4,4]+skull.t[5,5])/150
skull.t
skull.e

Notice the use of the "-i" index to remove the i'th observation each time through.

The following uses the built in functions in R to perform the jackknife procedure and is much easier to use! (It also uses the equal weighting prior by default).

skull.disc <- lda(skulls[,1:4],snames,CV=TRUE)
confusion(snames,skull.disc$class)

Logistic Regression: Performing logistic regression in R is done by using the glm (generalized linear model) function. The following code will conduct the logistic regression to distinguish between Earlypre and Latepre skulls (the | means "or"). We begin my reducing the dataset to just those to cases:

sk2<-skulls[skulls[,5]=="Earlypre" | skulls[,5]=="Latepre",]
sk2.lr<-glm(formula = sk2[,5] ~ sk2[,1] + sk2[,2] + sk2[,3] +sk2[,4], family = binomial(logit))
summary(sk2.lr)

Note that the estimates and p-values match those found using SAS, and that the coefficients are of the form found in formula (8.3) on page 119. You need to make sure if you are reporting the coefficients that you know what form they come in! Also note that the chi-square statistics in SAS are the square of the z-values shown here.

To get the p-value for the likelihood ratio test of the null hypothesis that all of the slopes are zero we use the difference between the null and residual deviances reported at the bottom of the summary:

df<-summary(sk2.lr)$df.null-summary(sk2.lr)$df.residual
lrchisq<-summary(sk2.lr)$null.deviance-summary(sk2.lr)$deviance
df
lrchisq
1-pchisq(lrchisq,df)

To see how I knew names of the df and chi-squares (that is, what to put after the $ signs) just type attributes(summary(sk2.lr)).

The predicted probabilities can be found under the name:

sk2.lr$fitted.values
of course you need to make sure you know if they are the probability of being in the Earlypre group or of being in the Latepre group!

For the Hosmer-Lemeshow Goodness of Fit test it may be easier just to use SAS since it is built in there, although you could program it into R.


Cluster Analysis

The following code corresponds to the example on pages 128-132.

library(mva)
library(MASS)
d1<-matrix(0,nrow=5,ncol=5)
d1[lower.tri(d1)]<-c(2,6,10,9,5,9,8,4,5,3)
d1<-d1+t(d1)
d1

This sets up the distance matrix shown on pg. 130. The functions hclust and plclust will perform the cluster analysis and the plotting on a distance matrix.

par(mfrow=c(1,3))

h1<-hclust(as.dist(d1),method="single")
plot.hclust(h1)

h2<-hclust(as.dist(d1),method="complete")
plot.hclust(h2)

h3<-hclust(as.dist(d1),method="average")
plot.hclust(h3)
We could also do this with the skulls data from before.
par(mfrow=c(1,1))
skulls <- read.table("http://www.stat.sc.edu/~habing/courses/data/skulls.txt",header=TRUE) 
snames<-c(rep('e',30),rep('l',30),rep('t',30),rep('p',30),rep('r',30))
scl<-hclust(dist(skulls[,1:4]),method="complete")
plot.hclust(scl,labels=snames)

For this data set, why is it no surprise that the cluster didn't seem to return a well organized pattern?

Which type of distance was used above between observations?

Which method was used between clusters?

The following will say which observation belonged to each cluster if we forced it to choose 5 clusters.

scl2<-cutree(scl,k=5)
scl2
reals<-c(rep(1,30),rep(2,30),rep(3,30),rep(4,30),rep(5,30))
confusion(reals,scl2)

What is wrong with the above confusion matrix?

What is the code below doing?

scl3<-as.character(scl2)
scl3[scl3=="5"]<-"r"
scl3[scl3=="4"]<-"e"
scl3[scl3=="2"]<-"l"
scl3[scl3=="3"]<-"t"
scl3[scl3=="1"]<-"p"
confusion(snames,scl3)


Distances Revisited

The following code revisits the Finding Distances section above and finds the standardized distance matrix and Mahalanobis distance matrix for the skulls data....

skullmat<-as.matrix(skulls[1:150,1:4])
C<-var(skullmat)

standskull<-skullmat
for (i in 1:4){
standskull[,i]<-(skulls[,i]-mean(skulls[,i]))/sqrt(var(skulls[,i]))
}
standskulldist<-dist(standskull,"euc")


mahskulldist<-matrix(0,nrow=150,ncol=150)
for (i in 1:(150-1)){
  for (j in i:150){
    mahskulldist[i,j]<-mahalanobis(skullmat[i,],skullmat[j,],C)
    mahskulldist[j,i]<-mahskulldist[i,j]
}}
mahskulldist<-dist(mahskulldist)

These are now distance matrices that can be used in the cluster analysis. (Why do they work? That's already explained in the first distance section!)


Multidimensional Scaling

SETTING IT UP AND ENTERING THE DATA
------------------------------------

library(MASS)
library(mva)

scnames<-c("Gr","Sp","Aug","Col","Ctt","Sav","Cls")
scdist<-matrix(c(0,1.6,6.2,6.2,5.8,12.6,12.4,
1.6,0,6.9,5.7,4.2,12.9,12.1,
6.2,6.9,0,4.9,9,6.7,8.3,
6.2,5.7,4.9,0,5.4,8.3,6.5,
5.8,4.2,9.0,5.4,0,13.3,11.0,
12.6,12.9,6.7,8.3,13.3,0,5.4,
12.4,12.1,8.3,6.5,11.0,5.4,0),byrow=T,ncol=7)


THE BASIC TWO DIMENSIONAL SCALING
----------------------------------

loc <- cmdscale(scdist)
eqscplot(loc,type="n")
text(loc,scnames)


ROTATING THE COORDINATES
------------------------

rotate<-function(x,angle){
angle<--1*angle*pi/180
newx<-x
newx[,1]<-x[,1]*cos(angle)-x[,2]*sin(angle)
newx[,2]<-x[,1]*sin(angle)+x[,2]*cos(angle)
return(newx)
}

par(mfrow=c(2,2))
par(pty="s")
plot(loc,type="n",xlim=c(-8,8),ylim=c(-8,8),xlab="",ylab="")
text(loc,scnames)
plot(rotate(loc,45),type="n",xlim=c(-8,8),ylim=c(-8,8),xlab="",ylab="")
text(rotate(loc,45),scnames)
plot(rotate(loc,90),type="n",xlim=c(-8,8),ylim=c(-8,8),xlab="",ylab="")
text(rotate(loc,90),scnames)
plot(rotate(loc,180),type="n",xlim=c(-8,8),ylim=c(-8,8),xlab="",ylab="")
text(rotate(loc,180),scnames)


rotate<-function(x,angle,flip=F){
angle<--1*angle*pi/180
if (flip) {x[,1]<--1*x[,1]}
newx<-x
newx[,1]<-x[,1]*cos(angle)-x[,2]*sin(angle)
newx[,2]<-x[,1]*sin(angle)+x[,2]*cos(angle)
return(newx)
}

 
par(mfrow=c(2,2))
par(pty="s")
plot(loc,type="n",xlim=c(-8,8),ylim=c(-8,8),xlab="",ylab="")
text(loc,scnames)
plot(rotate(loc,0,flip=T),type="n",xlim=c(-8,8),ylim=c(-8,8),xlab="",ylab="")
text(rotate(loc,0,flip=T),scnames)
plot(rotate(loc,90,flip=T),type="n",xlim=c(-8,8),ylim=c(-8,8),xlab="",ylab="")
text(rotate(loc,90,flip=T),scnames)
plot(rotate(loc,180,flip=T),type="n",xlim=c(-8,8),ylim=c(-8,8),xlab="",ylab="")
text(rotate(loc,180,flip=T),scnames)

par(mfrow=c(1,1))
plot(rotate(loc,-115,flip=T),type="n",xlim=c(-8,8),ylim=c(-8,8),xlab="",ylab="")
text(rotate(loc,-115,flip=T),scnames)


HOW CLOSE IS IT?
----------------
sum((as.dist(scdist)-dist(loc))^2)/sum(dist(loc)^2)


SHOWING THAT PRINCIPAL COMPONENTS WORKS THE SAME
------------------------------------------------

prsc<-prcomp(loc)      
prsc
summary(prsc)
eqscplot(prsc$x[,1],prsc$x[,2],type="n")
text(prsc$x[,1],prsc$x[,2],scnames)
dist(prsc$x)


CHECKING FOR THE OTHER DIMENSIONS
---------------------------------

loc1 <- cmdscale(scdist,k=1)
loc2 <- cmdscale(scdist,k=2)
loc3 <- cmdscale(scdist,k=3)
loc4 <- cmdscale(scdist,k=4)
er<-rep(0,4)
er[1]<-sum((as.dist(scdist)-dist(loc1))^2)/sum(dist(loc1)^2)
er[2]<-sum((as.dist(scdist)-dist(loc2))^2)/sum(dist(loc2)^2)
er[3]<-sum((as.dist(scdist)-dist(loc3))^2)/sum(dist(loc3)^2)
er[4]<-sum((as.dist(scdist)-dist(loc4))^2)/sum(dist(loc4)^2)
plot(er)
plot(er,ylim=c(0,0.05))

For nonmetric scaling, you can use the function sammon.


More MDS (Plotting 3rd dimension and Sammon)

The code below will perform multidimensional scaling on a portion of the skulls data after standardizing the data and using Euclidean distance. It will use skulls 1,2,3,31,32,33,61,62,63,91,92,93,121,122,123, and it will use the first three letters of the Epoch name to label the points with.

library(mva)
library(MASS)
skulls <- read.table("http://www.stat.sc.edu/~habing/courses/data/skulls.txt",header=TRUE) 

skulls2<-skulls[c(1,2,3,31,32,33,61,62,63,91,92,93,121,122,123),]
zskulls2<-skulls2
for (i in 1:4){
   zskulls2[,i]<-(skulls2[,i]-mean(skulls2[,i]))/sqrt(var(skulls2[,i]))
}
zskulls2[,5]<-substring(as.character(skulls2[,5]),1,3)

zsdist<-dist(zskulls2[,1:4],method="euclidean")

loc2 <- cmdscale(zsdist,k=2)
plot(loc2,type="n",xlim=c(-3,3),ylim=c(-3,3))
text(loc2,as.character(zskulls2[,5]))

loc3 <- cmdscale(zsdist,k=3)
plot(loc3[,2:3],type="n",xlim=c(-3,3),ylim=c(-3,3))
text(loc3[,2:3],as.character(zskulls2[,5]))

I chose the limits of -3 and 3 because those are large enough that the entire names of the variables would appear on the screen. If you try:

eqscplot(loc2,type="n")
text(loc2,as.character(zskulls2[,5]))

You may find that part of the names are cut off if you have your graphics window set fairly small.

Repeating the above using the nonmetric Sammon scaling could be done with

sam2 <- sammon(zsdist,k=2)
plot(sam2$points,type="n",xlim=c(-3,3),ylim=c(-3,3))
text(sam2$points,as.character(zskulls2[,5]))

sam3 <- sammon(zsdist,k=3)
plot(sam3$points[,2:3],type="n",xlim=c(-3,3),ylim=c(-3,3))
text(sam3$points[,2:3],as.character(zskulls2[,5]))

To see why you need the $points simply type loc3 and sam3 and compare the output.


Tree and the Tree Library

Where is the tree library?: To begin with, the Tree library is not one of those that usually downloads automatically with R. Because of this if you try library(tree) you will usually get an error.

If you are at home or on another PC that you have administrator access to, then all you need to do is: be connected to the internet, start R, go to the Packages menu, select Install Package from CRAN, and select Tree. You are now ready to go.

If you are on a CSM domain PC then the first time you need to put a copy of the Tree library onto your Z drive.

----------------------------------------------------------------------

Start up internet explorer and go to the page
http://cran.r-project.org/bin/windows/contrib/tree.zip

When it asks you what you want to do with the file...
* select "Save this file to disk"
* choose your Z drive and save it there

On the next box that comes up, select "Open Folder"
and select "tree.zip"

If it asks you what program you want to open it with...
* Choose "Other"
* Go to the X drive
* Select "WINZIP",
* Select "WINZIP32.EXE"
* Hit "ok"

You will then need to do something like the following
(although each computer is somewhat different...
the goal is to unzip "tree.zip" into a folder called
tree on your z drive...)

Next, Yes, "Start With WinZip Wizard", Next, Next, Next,
Next, Next,
(tree.zip should be selected and so should the z drive)
Next,
Change the selected folder to z:\tree\
Unzip Now
Next
Close

If it automatically opens WinZip...
* Choose "Extract"
* Highlight the Z drive so it shows in the window in the upper left
* Choose "New Folder"
* Type the word tree into the box
* Ok
* Extract
* Choose Exit in the File Menu -----------------------------------------------------------------------

Following the above commands should put a copy of all the needed code into a directory on your Z drive.
From now on all you need to do to get the trees functions are to use the following two lines

library(tree,lib.loc="z:/tree/")
library(help=tree,lib.loc="z:/tree")

The Tree Commands from in class are...

bumpus2<-read.table("http://www.stat.sc.edu/~habing/courses/data/bumpus2.txt",header=TRUE) 
library(mva)
library(MASS)
bumpus2[,1]<-as.character(bumpus2[,1])
b.tr<-tree(survive~length+alar+weight+lbh+lhum+lfem+ltibio+wskull+lkeel,bumpus2)
summary(b.tr)
plot(b.tr)
text(b.tr)
predict.tree(b.tr)


Computer Trouble?

In most cases, help with the computers (NOT the programming) can be gained by e-mailing help@stat.sc.edu

For the printers on the first and second floor, printer paper is available in the Stat Department office. For printers on the third floor, paper is available in the Math Department office.

If you are using a PC restarting the machine will fix many problems, but obviously don't try that if you have a file that won't save or the like. DO NOT TURN OFF THE UNIX MACHINES.

If the problem is an emergency requiring immediate attention see Wei Pan in room 209D.
If Wei Pan is not in, and it is an emergency see Jason Dew in room 415.
If neither Jason or Wei is available and it is an emergency see Minna Moore in room 417.
Flagrantly non-emergency cases may result in suspension of computer privileges.