530 R TEMPLATES

Stat 530 - Fall 2003 - R Templates

The Basics of R
Chapter 3:
Plotting Symbols, Multiple Plots, Regression, Box Plots
Face Plots
Chapter 4:
Multiple t-tests
Hotelling's T²
Chapter 5:
Distances
Chapter 6:
Principal Components
Chapter 7:
Factor Analysis
Chapter 8:
Fisher's Linear Discriminant Analysis
Logistic Regression
Chapter 9:
Cluster Analysis
Chapter 11:
Multidimensional Scaling
Computer Trouble?
Graphs Printing Small in Word?

The Basics of R:

R, like the commercial S-Plus, is based on the programming language S. It can be downloaded for from www.r-project.org [At home, after reading the instructions!, choose: CRAN, then a mirror site in the US, then Windows (95 and later), then base, then rw1071.exe]. This page also has links to FAQs and other information about R.

One of the nice things about R (besides the fact that it's powerful and free) is that you can save everything you've done after you've finished each time. Since you don't have access to the C drive on your accounts, you will need to manually load and save the .Rdata file in your Z drive each time you want to start and end the program, respectively. You would, of course, have to save the .Rdata file on disk if you wanted to use it both at home and at school.

At the heart of R are the various objects that you enter. At any time (when you are in R) you can type in the function objects() to see which ones you have entered previously. Presumably there aren't any there right now. R also has many built in objects, most of which are functions. Because these objects are always there, it will not list them when you type objects().

A list of the various functions for R can be found in the manuals that are installed with the program. The various manuals can be found under the Manuals submenu of the Help menu. In particular, the R Reference Manual lists not only all the functions in the base package of R, but also some of those in the supplementary packages as well. Because this manual is nearly 1000 pages long, you probably shouldn't print out a copy!!!!

The various supplementary packages are simply groups of functions that others wrote and decided to make available to the public. A list of those that you downloaded can be found under the Packages menu by choosing Load Package. The two we will use most often are mva and MASS. They can also be loaded in by typing library(mva) and library(MASS). Additional packages can also be found on the R home site and downloaded. Because you don't have access to the C drive though, you have to download them to the correct place and give the correct command to open them.

Once you know the name of a function, you can get help about it simply by typing help(functionname). In R, parentheses indicate either mathematical grouping, like they usually do, or that you are applying a function. Braces [ ] indicate that you are finding an element in a string.

This brief example shows some of R's flexibility.

x<-c(5,4,3,2)

The <- is the command that assigns what ever is on the left to the name that is on the right. The c indicates that an array or vector is going to be entered. Simply typing x now will return those four values. Typing

x[3]

will return the third value,

mean(x)

will return the mean, and

objects()

will show you that you now have an object called x.

R also has a variety of graphical functions and statistical functions built in.

hist(x)

will construct a histogram, and

t.test(x,alternative="greater",mu=5)

will test the null hypothesis mu=5 vs. the alternate that mu > 5. (If you happen to mistype a command and get an error, just hit the up error to bring it up again for editing).

Besides just getting the result of a function, we can store the results somewhere. For example

x.t<-t.test(x,alternative="greater",mu=5)

will store the result of the t-test in x.t.

attributes(x.t)

will then show you the various properties of x.t. Thus,

x.t$p.value

will return the p-value.

R can also be used to write your own functions. For example, if you wanted a function that returned sin(x)+x² we could write:

weird<-function(x){ y<-sin(x)+x^2 return(y)}

Try weird(5) and weird(x). What is it doing?

R is able to read data in both from files and from appropriately formatted web-pages. The following command will read in the data we saw in class the other day:

census<-read.table("http://www.stat.sc.edu/~habing/courses/data/census.txt",header=TRUE)

Typing attributes(census) will give you a list of the various names associated with this data set. Try the following to see if you can tell what they are returning:

census[,4]

census[4,]

census$PopDens

We could also make subsets of the data. For example, try:

census[census[,2]=='SC',c("City","PopDens","Income","InfantD")]

What does this seem to return? Why?

It is also easy to apply a function to subsets of data, try:

tapply(census$InfantD,census$State,mean)

The first argument is the list being worked with, the second defines groups, and the third is the function to apply to each group. There are some other functions like this that we will see later on.

Trying

plot(census$State,census$PopDens)

will give a graph of the population density for each of the cities. (It gives box-and-whisker plots because the x-variable is qualitative.)

We could also use this to help graphing two of the continuous variables against each other. Say we want to plot the infant death rate (column 12) against the population density (column 5). We could use either of the following:

plot(census$PopDens,census$InfantD)
plot(census[,5],census[,12])

There is no way to tell if there is a difference based on each city though though.

plot(census[census$State=='SC',5],census[census$State=='SC',12],pch='S')

Now, why will there be trouble if we do the same thing, but replace 'SC' with 'NC' and the 'S' with 'N'? Use the up arrow and try.

Try putting par(new=T) between the two commands. Now what's the problem?

Try lines like the following:

plot(census[census$State=='SC',5],census[census$State=='SC',12],pch='S', xlim=c(min(census[,5]),max(census[,5])),xlab="Population Density", ylim=c(min(census[,12]),max(census[,12])),ylab="Infant Mortality")

Can you now plot the graphs overlayed on each other?

If you wanted to, you could add various lines to the graph. To add a horizontal line for the mean infant mortality rate in SC you could use the commands

scMInfD<-mean(census[census$State=='SC',12]) lines(c(min(census[,5]),max(census[,5])),c(scMInfD,scMInfD))

And one final graph that we saw in PROC INSIGHT in SAS:

pairs(census[,3:20])

Some More Plotting Commands

In class on the 27th we saw the following commands:

plot(x,y) makes a scatterplot
par(new=T) tells R not to erase the previous plot when it draws the new one.
lines(c(x1,x2,...,xn),c(y1,y2,...,yn) If you already have a plot open, this will draw the line segments to connext (x1,y1) to (x2,y2) to ... to (xn,yn). You don't need to use par(new=T)
pairs(datamatrix) makes a scatterplot matrix

A variety of other plotting commands are discussed below. Many use the data set census that we used on the 27th. You will thus need to load in that data set again (if you didn't save your workspace) before running some of them.

Plotting Symbols: The option pch="a", for example, plot the character a for each point, instead of simply a dot. If instead we used a number (not in quotes) we would get one of several types of symbols. Try the following to see which symbols are available. (The type="n" means that nothing should be plotted except the axes, labels, and the title. The command points just adds points to a previously drawn graph.)

plot(5,5,xlim=c(0,18),ylim=c(0,18),xlab="",ylab="",main="Plot Symbols",type="n")

par(new=T)

plot(1,1,xlim=c(0,18),ylim=c(0,18),xlab="",ylab="",pch=1)           

points(2,2,pch=2)

for (i in 3:18){

  points(i,i,pch=i)

}

Multiple Plots on the Screen: par(mfrow=c(2,2)) will cause the graphs to be displayed as 2 rows and 2 columns. This can cause trouble sometimes, however. Say you are plotting graphs in sets of three. The first three will fit nicely on the screen... but the first of the next set of three will only appear very briefly in the last spot, before the next two appear at the top of a newly cleared off screen. To reset the entire window before plotting, you could change it back to 1x1 using par(mfrow=c(1,1)) and then go back to 2x2.

Regression Line: The following code will perform the regression to predict Population Density (PopDens) from the total population (Population), and plot the data and regression line.


cens.reg <- lm(census$PopDens~census$Population)

summary(cens.reg)

plot(census$Population,census$PopDens,xlab="Population",ylab="Population Density")

lines(census$Population,fitted(cens.reg))

You also could have plotted each of the states as a separate using pch if you wanted more detail.

Box Plots: We saw before that plot would make side by side box plots if the x variable consisted of categories and the y variable was quantitative. One difficulty is that it alphabetizes the figures by category names. Further, if the category names are numbers, plot won't even give you the boxplots. If your category names were numbers, you could try:

boxplot(values~categorynumbers)

(of course you want to put in the name of the vector with the values and the name of the vector with the category numbers.)

We could also trick the function boxplot into giving us the values in the order we wanted. Say we wanted them in the order FL, GA, SC, NC, VA. What we would need to do is make a vector of numbers so that each observation that went with FL had a 1, each that went with GA had a 2, and so on. The following function will do that for us:

vectormaker<-function(categoryvector,ordervector){
newcatvector<-rep(0,length(categoryvector))
for (i in 1:length(ordervector)){
  newcatvector[categoryvector==ordervector[i]]<-i
  }
return(newcatvector)
}

If you run

wewant<-c("FL","GA","SC","NC","VA")
vc<-vectormaker(census$State,wewant)

It will make a new vector called vc that will be like the string of states, but with all the FLs replaced by ones, all the GAs replaced by 2, etc... (The vector wewant is simply the order in which we want the continets to appear.) Type in

data.frame(vc,census$State)

to see if it worked correctly (it should put 0s in for the states we didn't select). to check that it happended correctly.

Typing

boxplot(census$PopDens[vc!=0]~vc[vc!=0],names=wewant)

will now produce a boxplot with the variables in the order we want. The != is "not equals", so it says to only plot the ones with vc not equal to zero.

We could have combined this all into one function too:

orderedboxes<-function(datavector,categoryvector,ordervector){
newcatvector<-rep(0,length(categoryvector))
for (i in 1:length(ordervector)){
  newcatvector[categoryvector==ordervector[i]]<-i
  }
boxplot(datavector[newcatvector!=0]~newcatvector[newcatvector!=0],names=ordervector)
}

And just run

orderedboxes(census$PopDens,census$State,c("FL","GA","SC","NC","VA"))

Face Plots

The code required for creating the face plots we designed in class can be found at: http://www.stat.sc.edu/~habing/courses/530faceF03.txt . You could open that page and select all of the text and copy and paste it into R in order to create the faces. An easier way would be to simply run the following command:

source("http://www.stat.sc.edu/~habing/courses/530faceF03.txt")

There are eleven functions on that page. The functions ellipse and cellipse draw ellipses (or parts thereof) and filled in ellipses respectively. The functions mbbjw, bpt, wywmpd, bhsm, dlbj, bbwb, tiks, and ypa draw the eight different types of faces if they are fed five variables valued from 0 to 1. Finally, faces will draw the faces for a matrix of data that is up to five columns wide.

To construct the faces, you run:

faces(matrixname,labellist,facetype)

where matrixname is the name of the five column matrix you are trying to analyze, labellist is the list of names you want to appear above the faces, and facetype is one of the eight face type names in quotation marks (e.g. "mbbjw").

To see what each of the five variables does in one of the five faces you can go look for that face symply by typing in the name without quotes, and with out parentheses (e.g. mbbjw). If you can't tell which extreme goes with a hi and lo value from the code, you can compare the actual data set to the faces it produces to help decide.

The following code analyzes the data set shown in Table 1.1 on page 2.

#load in the data#


bumpus<-read.table("http://www.stat.sc.edu/~habing/courses/data/bumpus3.txt",head=T)


#make the bhsm faces for the first nine sparrows and columns 4-7,#
#with the first column being used as the label# 


faces(bumpus[1:9,4:7],bumpus[1:9,1],"bhsm")


#make the bbwb faces for all the sparrows and columns 4-7,#
#with the first column being used as the label#
#BUT put them in the order of column 3!#


col3order<-order(bumpus[,3])
faces(bumpus[col3order,4:7],bumpus[col3order,1],"bbwb")


#Notice that they all have the same hair!  Checking the code for#
#bbwb we see that the hair is the fifth variable... How many#
#different variables did we ask it to include for this data set though?#

t-tests

The function below will conduct a set of t-tests between two groups and will return the p-values for the individual tests. You should probably be able to follow through all the code for this function.

multi.t.test<-function(group1,group2){
#This function will conduct t-tests between two groups on multiple variables#
##
#1) Make sure they are both matrices#
##
group1<-as.matrix(group1)
group2<-as.matrix(group2)
##
#2) Set up the output vector#
##
nvars<-length(group1[1,])
pvals<-rep(0,length=nvars)
##
#3) Conduct all the t-tests and store there p-values#
##
for (i in 1:nvars){
 pvals[i]<-t.test(group1[,i],group2[,i],var.equal=T)$p.value
  }
##
#Finally, return the output
##
return(pvals)
}

Using the sparrow data set above, we could test to see if there is a difference between the survivors and non-survivors. We could begin by making the two groups. (You need to have loaded in the bumpus data set from the faces section above.)

bsurv<-bumpus[bumpus[,2]==1,3:7]
bdied<-bumpus[bumpus[,2]==0,3:7]
multi.t.test(bsurv,bdied)

It appears that there is no significant difference between the two groups on any of these five variables, so we have no need to apply an adjustment for dealing with several tests simultaneously. That, of course, assumes we can trust the assumptions!

To check the assumptions we need to make 10 different q-q plots and perform five different modified Levene tests. The following code will construct the ten q-q plots we need here and lay them out in a 4x3 graphical display. (After doing this you will want to use par(mfrow=c(1,1)) before doing anything else graphical.

par(mfrow=c(3,4))
for (surv in 0:1){
  for (variable in 3:7){
    qqnorm(bumpus[bumpus[,2]==surv,variable])
    qqline(bumpus[bumpus[,2]==surv,variable])
}}

None of the Q-Q plots look particularly bad (the procedure is fairly robust)... it does appear that variable 5 (graph 5) for the dead group and variables 1 and 4 (graphs 6 and 9) for the survivor group have some outliers.

The following function will conduct a modified Levene test for all the variables in two groups. Its format is very similar to that of the multi.t.test function above. (See Neter, Kutner, Nachtsheim, and Wasserman for a reference.)

mlevene.test<-function(group1,group2){
#This function will conduct the modified Levene test for two groups#
#Updated 9/30/03 to use the Median instead of the Mean#
##
#1) Make sure they are both matrices#
##
group1<-as.matrix(group1)
group2<-as.matrix(group2)
##
#2) Set up the output vector#
##
nvars<-length(group1[1,])
pvals<-rep(0,length=nvars)
##
#3) Conduct all the modified Levene tests and store there p-values#
##
for (i in 1:nvars){
 g1temp<-abs(group1[,i]-median(group1[,i]))
 g2temp<-abs(group2[,i]-median(group2[,i]))

 pvals[i]<-t.test(g1temp,g2temp,var.equal=T)$p.value
  }
##
#Finally, return the output
##
return(pvals)
}

Running this on the sparrow data

mlevene.test(bsurv,bdied)

returns p-values ranging from 0.06197897 to 0.41933977. The 0.06197897 might cause us some concern since we know the modified Levene test lacks some power. There are two reasons to not be overly concerned however: 1) We are checking five variances here and so have inflated Type I error rates, and 2) We know the t-test is fairly robust.

Hotelling's T²

This section is divided into two parts. First we work out the values the long way using matrices. Then there is a function that will calculate it for you from the raw data. Both examples use the sparrow data (pages 2-3) seen above:

bumpus<-read.table("http://www.stat.sc.edu/~habing/courses/data/bumpus3.txt",head=T)
bsurv<-bumpus[bumpus[,2]==1,3:7]
bdied<-bumpus[bumpus[,2]==0,3:7]

Calculating T² using Matrices The following code evaluates equations 4.4 to 4.6 (page 39 and 40) for the sparrow data (page 2-3) to replicate the output shown in Example 4.1. Note that the first 21 sparrows are the group that survived, and the remaining 28 did not. As you enter each line, you should type the name of the variable you just created so that you can see what the command you entered actually calculated. The commands of particular importance are:

apply - the 2 means take the means of the columns, a 1 would mean rows
solve - finds the inverse
t - find the transpose of a matrix
%*% - is matrix multiplication.
pf - finds the area below F in the F distribution
matrix - says to format the vector as a matrix
pt - finds the area below tstat in the t distribution

First we need to get the mean vectors and pooled covariance matrix. We could calculate it from the raw data as follows:

n1 <- length(bsurv[,1])
xbar1 <- apply(bsurv,2,mean)
C1 <- cov(bsurv)

n2 <- length(bdied[,1])
xbar2 <- apply(bdied,2,mean)
C2 <- cov(bdied)

C <- ((n1-1)*C1+(n2-1)*C2)/(n1+n2-2)

Or, if we already new the values we could enter them manually. When entering the matrix, if you enter it by going across rows, you need to use the option byrow=T. You also need to tell it how many columns the matrix has, using ncol=.

n1<-21
n2<-28
xbar1<-c(157.381,241.000,31.433,18.500,20.810)
xbar2<-c(158.429,241.571,31.479,18.446,20.839)
C<-matrix(c(13.358,13.748,1.951,1.373,2.231,
  13.748,26.146,2.765,2.252,2.710,
  1.951,2.765,0.645,0.350,0.423,
  1.373,2.252,0.350,0.324,0.347,
  2.231,2.710,0.423,0.347,1.004),byrow=T,ncol=5)

Once we have the values entered, we can simply use the matrix commands for the formulas 4.5 and 4.6 on pages 39 and 40. The numbers should match those you find on pages 42 and 43. (Well... they'll be off a little if you took the second option above and entered the numbers manually. That's because of rounding error.)

Cinv <- solve(C)  

Tsquare <- (n1*n2)*t(xbar1-xbar2)%*%Cinv%*%(xbar1-xbar2)/(n1+n2)

p <- length(C[1,])

F <- (n1+n2-p-1)*Tsquare/((n1+n2-2)*p)

p.value <- 1 - pf(F,df1=p,df2=n1+n2-p-1)

A function for calculating T²

The following code is a function for calculating T² using the same commands you saw above.

tsquare<-function(group1,group2){
#This function calculates Hotelling's T-square (pg. 39-40)#
##
#1) Make sure they are both matrices
##
group1<-as.matrix(group1)
group2<-as.matrix(group2)
##
#2) Find the mean vector and covariance matrix for the first group#
##
n1 <- length(group1[,1])
xbar1 <- apply(group1,2,mean)
C1 <- cov(group1)
##
#3) find the mean vector and covariance matrix for the second group#
##
n2 <- length(group2[,1])
xbar2 <- apply(group2,2,mean)
C2 <- cov(group2)
##
#4) find the pooled covariance matrix and its inverse#
##
C <- ((n1-1)*C1+(n2-1)*C2)/(n1+n2-2)
Cinv <- solve(C)  
##
#5) multiply to find the T-square stat and convert it to F and p-value#
##
Tsquare <- (n1*n2)*t(xbar1-xbar2)%*%Cinv%*%(xbar1-xbar2)/(n1+n2)
p <- length(group1[1,])
F <- (n1+n2-p-1)*Tsquare/((n1+n2-2)*p)
p.value <- 1 - pf(F,df1=p,df2=n1+n2-p-1)
##
#Finally, output all of the results#
##
return(n1,xbar1,C1,xbar2,C2,n2,C,p,Tsquare,F,p.value)}

For example, try tsquare(bsurv,bdied).

As far as checking the assumptions, we checked all of the individual q-q plots in the above section on t-tests. To get all of the pair-wise scatter plots we could simply use the pairs command on each of the two sets of data:

pairs(bsurv)
pairs(bdied)

It's awfully hard to tell with this few data points if they make ellipses or not, but they don't look too bad (except maybe for Total.Length and Alar.Extent for the survived group).

To visually compare the two covariance matrices, we could simply output them.

C1   
C2

It looks in this case like C2 is much more variable than C1! Because the sample sizes are different (21 for the 1st group and 28 for the second group) this could be troublesome. Finding the generalized variances (the determinant of the covariance matrices)

det(C1)
det(C2)

We see that the larger group has the larger variance, and so the T² statistic is likely to be conservative (not reject as often as it should).

Distances

Distances Between Individual Observations: The following code will replicate Example 5.1 on page 60. See help(dist) for more information.)

dogs<-read.table("http://www.stat.sc.edu/~habing/courses/data/dogs.txt",head=T)
 
dogs2<-dogs[,2:7]
rownames(dogs2)<-dogs[,1]   
for (i in 1:6){
  dogs2[,i]<-(dogs2[,i]-mean(dogs2[,i]))/sd(dogs2[,i])
}

dist(dogs2)

Notice that this doesn't seem to match the table on page 61! Try dist(dogs2)*sqrt(7/6) though. What happened is that when they standardized they used the standard deviation formula where they divided by n instead of dividing by n-1.

Distances Between Samples: The following code will replicate Example 5.2 on page 64-67 using Mahalanobis distance. Here we do this with the built in function mahalanobis, but we could also have used formulas that were very simiar to those used for calculating Hotelling's T² except that more than two groups can occur.

The data set is that found in Table 1.2 on pages 4-5 from a 1905 work by Thomson and Randall-Maciver. For each of five epochs in Egypt, four measurements were made on samples of 30 male skulls. The measurements are the first four variables: MB=maximum breadth, BH=basibregmatic height, BL=basialveaolar length, and NH=nasal height. In order to use this example it is necessary to load the data in, and to attach the mva library.

skulls <- read.table("http://www.stat.sc.edu/~habing/courses/data/skulls.txt",header=TRUE)

We also use the following function to calculate the C matrix and a matrix of the x-bars. I probably wouldn't try to follow what this function does unless you are trying to get better at your R programming.

mahsetup<-function(datacolumns,labelcolumn){
datacolumns<-as.matrix(datacolumns)
labelcolumn<-as.factor(labelcolumn)
nvars<-length(datacolumns[1,])
nvector<-rep(0,nvars)
groupnames<-levels(labelcolumn)
ngroups<-length(groupnames)
xbarmatrix<-matrix(0,ncol=ngroups,nrow=nvars)
covmatrix<-matrix(0,ncol=nvars,nrow=nvars)
for (i in 1:ngroups){
  tempdata<-datacolumns[labelcolumn==groupnames[i],]
  nvector[i]<-length(tempdata[,1])
  xbarmatrix[,i]<-apply(tempdata,2,mean)
  covmatrix<-covmatrix+var(tempdata)*(nvector[i]-1) 
}
covmatrix<-covmatrix/(sum(nvector)-ngroups)
colnames(xbarmatrix)<-groupnames
rownames(xbarmatrix)<-rownames(covmatrix)
return(xbarmatrix,covmatrix)}

So, this code should give you the distance matrix on page 67:

xbarandC<-mahsetup(skulls[,1:4],skulls[,5])
xbar<-xbarandC$xbarmatrix
C<-xbarandC$covmatrix
mahmat<-matrix(0,nrow=5,ncol=5)
for (i in 1:5){
  for (j in 1:5){
    mahmat[i,j]<-mahalanobis(xbar[,i],xbar[,j],C)
}}
rownames(mahmat)<-colnames(mahmat)<-colnames(xbar)
xbar
C
mahmat

Compare the xbar to the bottom of page 64 (note the order of the dynasties has changed a little), the C to the bottom of page 65, and the mahmat to the middle of page 67.

Did we really need that mahsetup function? : If you really don't like the mahsetup program above, here's what we would have had to go through to calculate the xbar matrix and C the long way:

x1<-as.matrix(skulls[skulls[,5]=="Earlypre",1:4])
x2<-as.matrix(skulls[skulls[,5]=="Latepre",1:4])
x3<-as.matrix(skulls[skulls[,5]=="Twelveand13",1:4])
x4<-as.matrix(skulls[skulls[,5]=="Ptolemaic",1:4])
x5<-as.matrix(skulls[skulls[,5]=="Roman",1:4])

n1<-length(x1[,1])
n2<-length(x2[,1])
n3<-length(x3[,1])
n4<-length(x4[,1])
n5<-length(x5[,1])

xbar1<-apply(x1,2,mean)
xbar2<-apply(x2,2,mean)
xbar3<-apply(x3,2,mean)
xbar4<-apply(x4,2,mean)
xbar5<-apply(x5,2,mean)
xbarmatrix<-rbind(xbar1,xbar2,xbar3,xbar4,xbar5)

C1<-var(x1)
C2<-var(x2)
C3<-var(x3)
C4<-var(x4)
C5<-var(x5)
C<-((n1-1)*C1+(n2-1)*C2+(n3-1)*C3+(n4-1)*C4+(n5-1)*C5)/(n1+n2+n3+n4+n5-5)

xbarmatrix
C

Principal Components

The following is the analysis of the bear data, basically as we saw it in class. I have also used the build in function for making a scree plot as well.

The Bear Data:

library(MASS)

bears<-read.table("http://www.stat.sc.edu/~habing/courses/data/bears.txt",header=T)

beardat<-as.matrix(bears[,3:7])
var(beardat)

#Do the principal components on the raw data#

bears.pca<-princomp(beardat)
summary(bears.pca)

loadings(bears.pca)

#To get the zero'd out small values#
#(use cor not cov if you want the rescaled variables)#

eigen(cov(beardat))

#Create the Z's#

bears.pred<-predict(bears.pca)

#Verify that the Z's have the properties they are supposed to#
#(You normally wouldn't need to do this!)

round(var(bears.pred),2)
round(apply(bears.pred,2,mean),2)

#Do the principal components on the standardized data# 

b2.pca<-princomp(beardat,cor=T)
summary(b2.pca)

loadings(b2.pca)

b2.pred<-predict(b2.pca)

#Use the correlation and r^2 matrices to see how the#
#Z's and X's relate# 

cor(beardat,b2.pred)

cor(beardat,b2.pred^2)

#We could also round this to two decimal places#

round(cor(beardat,b2.pred)^2,2)


#Note that the r^2 add to 1 for each X variable#

apply(cor(beardat,b2.pred)^2,1,sum)

#Plot the first two principal components#

eqscplot(b2.pred[,1:2],type="n",xlab="first pc",ylab="second pc")
text(b2.pred[,1:2],labels=as.character(bears[,9]),cex=0.5)

#Plot for making a scree plot#

plot(b2.pca)

Other Data Sets: The other two data sets we used in class can be found at:

testdata<-read.table("http://www.stat.sc.edu/~habing/courses/data/testdata.txt",header=T)

bballtest<-read.table("http://www.stat.sc.edu/~habing/courses/data/bballtest.txt",header=T)

Factor Analysis

(The functions at 530factF03.txt were updated 4:22pm on 10/14/03 !)
Several functions for conducting factor analysis can be found at: http://www.stat.sc.edu/~habing/courses/530factF03.txt. You could open that page and select all of the text and copy and paste it into R or you can simply run the following command:

source("http://www.stat.sc.edu/~habing/courses/530factF03.txt")

The functions are:

factpca Performs Principal Components Factor Analysis on x using the correlation matrix. It returns the un-rotated loadings.
factpf Performs Principal Factor Factor Analysis on x using the correlation matrix. It returns the un-rotated loadings.
factiter Performs Iterated Principal Factor Factor Analysis on x using the correlation matrix. You may specify the number of iterations and the maximum number of factors to return (must be >1). It returns the communalties after each iteration and the final un-rotated loadings.
testfit Given the raw data and the estimated factor loadings, it returns the estimated Cov(error) matrix.
fitsummary Given the raw data and the estimated factor loadings it returns the estimated Var(error), a summary of the covariances of the errors, and the estimated communalities.

The following code will carry out various factor analyses on the data set in figure 1.5 that is used in example 7.1 in Chapter 7. The data can be found in the file: http://www.stat.sc.edu/~habing/courses/data/eurojobs.txt.

eurojob<-read.table("http://www.stat.sc.edu/~habing/courses/data/eurojobs.txt",head=T)

eurodat<-eurojob[,2:10]

#Three methods to pick the number of factors# 

eigen(cor(eurodat))$value
plot(princomp(eurodat,cor=T))
summary(princomp(eurodat,cor=T)) 

#Unrotated Principal Component Factor Analysis#
#Returning only 4 Factors, and rounding to 2 digits#
#This gives you the values (up to sign) found on pg. 101#

fs1<-factpca(eurodat)
fs1<-fs1[,1:4]
round(fs1,2)

#Here are the varimax values on page 102#

fs2<-varimax(fs1)$loadings
round(fs2,2)

#To see if this fits, though, we could use either#
#of the following#

testfit(eurodat,fs2)
fitsummary(eurodat,fs2)

#Notice that the covariances of the errors have absolute#
#values as large as 0.1480 which is larger than many of#
#the error variances! (although it is still smaller than the#
#communalities)#

#The following code will use the iterated principal factor#
#method with 100 iterations and a maximum of 4 factors# 

fs3<-factiter(eurodat,niter=100,maxfactor=4)
round(fs3$f.loadings,2)

#Notice that some of the loadings are greater than 1.#

fitsummary(eurodat,fs3$f.loadings)

#So it is not suprising that we get a negative covariance estimate.#

#We could still try it without iteration and see if we get something#
#that fits a bit better though.

fs4<-factpf(eurodat)
fs4<-fs4[,1:4]
round(fs4,2)
fitsummary(eurodat,fs4)

#While this fits a lot better than the iterated solution, it still doesn't#
#fit all that way.  Maybe this isn't such a good example in the text...#
#except it shows what can happen quite a bit with real data.#

Fisher's Linear Discriminant Analysis

The code here will uses the skulls data to work through Example 8.1 on page 112. It is important to note that it will not exactly match the output in the book since the W matrix on page 112 has an error (compare it to the W matrix on page 52... there is a digit off). Further, in (8.2) on page 113 have been standardized differently than what R or SAS do. The output in table 8.3 will match up however (after rearranging the columns and rows appropriately).

The function lda is contained in the MASS library.

library(MASS)
skulls <- read.table("http://www.stat.sc.edu/~habing/courses/data/skulls.txt",header=TRUE) 
snames<-c(rep('e',30),rep('l',30),rep('t',30),rep('p',30),rep('r',30))

skull.disc <- lda(skulls[,1:4],snames)

skull.disc$scaling

In any case, the output from lda can be plotted:

skull.x<-predict(skull.disc,skulls[,1:4],dimen=2)$x
eqscplot(skull.x,type="n",xlab="first discriminant",ylab="second discriminant")
text(skull.x,snames)

We can also get an estimate of what "probability" we think each observation has of belonging in each of the different groups, and what group is closest.

predict(skull.disc,skulls[,1:4],dimen=4)$class
predict(skull.disc,skulls[,1:4],dimen=4)$posterior

Finally, we can enter a function from an earlier version of Venables and Ripley's book to count up the number of correct and incorrect classifications

accuracy<-function(true,predict)
{
jt<-table(true,predict)
jn<-dimnames(jt)[[2]]
jt1<-jt[jn, ]
return(jt,error=(1-sum(diag(jt1))/length(true)))
}

accuracy(snames,predict(skull.disc,skulls[,1:4],dimen=2)$class)
accuracy(snames,predict(skull.disc,skulls[,1:4],dimen=4)$class)

The Jackknife for Estimating Discriminant Analysis Error Rates: The following uses the built in functions in R to perform the jackknife

skull.disc <- lda(skulls[,1:4],snames,CV=TRUE)
accuracy(snames,skull.disc$class)

Logistic Regression

Performing logistic regression in R is done by using the glm (generalized linear model) function. The following code will conduct the logistic regression to distinguish between Earlypre and Latepre skulls (the | means "or"). We begin my reducing the dataset to just those to cases:

sk2<-skulls[skulls[,5]=="Earlypre" | skulls[,5]=="Roman",]
sk2.lr<-glm(formula = sk2[,5] ~ sk2[,1] + sk2[,2] + sk2[,3] +sk2[,4], family = binomial(logit))
summary(sk2.lr)

Note that the estimates and p-values match those found using SAS, and that the coefficients are of the form found in formula (8.3) on page 119. You need to make sure if you are reporting the coefficients that you know what form they come in! Also note that the chi-square statistics in SAS are the square of the z-values shown here.

To get the p-value for the likelihood ratio test of the null hypothesis that all of the slopes are zero we use the difference between the null and residual deviances reported at the bottom of the summary:

df<-summary(sk2.lr)$df.null-summary(sk2.lr)$df.residual
lrchisq<-summary(sk2.lr)$null.deviance-summary(sk2.lr)$deviance
df
lrchisq
1-pchisq(lrchisq,df)

To see how I knew names of the df and chi-squares (that is, what to put after the $ signs) just type attributes(summary(sk2.lr)).

The predicted probabilities can be found under the name:

sk2.lr$fitted.values

of course you need to make sure you know if they are the probability of being in the Earlypre group or of being in the Roman group!

For the Hosmer-Lemeshow Goodness of Fit test it may be easier just to use SAS since it is built in there, although you could program it into R.

Cluster Analysis

The following code corresponds to the example on pages 128-132. Note that it involves actually entering the distance matrix and calling it d1! If we had the raw data set it would be easier and we would just use the command dist to get the distances.

library(mva)
library(MASS)
d1<-matrix(0,nrow=5,ncol=5)
d1[lower.tri(d1)]<-c(2,6,10,9,5,9,8,4,5,3)
d1<-d1+t(d1)
d1

This sets up the distance matrix shown on pg. 130. The functions hclust and plot will perform the cluster analysis and the plotting on a distance matrix. (Check the axes to make sure they really are actually different!)

par(mfrow=c(1,3))

h1<-hclust(as.dist(d1),method="single")
plot(h1)

h2<-hclust(as.dist(d1),method="complete")
plot(h2)

h3<-hclust(as.dist(d1),method="average")
plot(h3)

Note that "single" is nearest neighbor linkage and "complete" is the farthest neighbor linkage.

We could also do this with the skulls data from before. (Although it doesn't work well at all here!) Notice that we want to plot the rows of the data set skulls on the map. If we had wanted to plot points corresponding to the columns, we would have had to have used dist(t(skulls[,1:4])) instead.

par(mfrow=c(1,1))
skulls <- read.table("http://www.stat.sc.edu/~habing/courses/data/skulls.txt",header=TRUE) 
snames<-c(rep('e',30),rep('l',30),rep('t',30),rep('p',30),rep('r',30))
scl<-hclust(dist(skulls[,1:4]),method="complete")
plot(scl,labels=snames,cex=0.75)

Multidimensional Scaling

The following code is for the cube example we used in class. Note that we need to take the distance between the points. It is important to notice that the variables we want to plot are the different rows. If we had wanted to plot the different columns we would have had to have used dist(t(cube)) to take the transpose of the data set.

library(MASS)

cube<-matrix(c(
0,0,0,
0,1,0,
1,1,0,
1,0,0,
1,0,1,
1,1,1,
0,1,1,
0,0,1),byrow=T,ncol=3)

names<-c("sw","nw","ne","se","SE","NE","NW","SW")

c.cmd2<-cmdscale(dist(cube),k=2)
par(mfrow=c(1,1))
plot(c.cmd2,type="n")
text(c.cmd2,labels=names)

c.cmd3<-cmdscale(dist(cube),k=3)
par(mfrow=c(2,2))
plot(c.cmd3[,1:2],type="n")
text(c.cmd3[,1:2],labels=names)
plot(c.cmd3[,c(1,3)],type="n")
text(c.cmd3[,c(1,3)],labels=names)
plot(c.cmd3[,2:3],type="n")
text(c.cmd3[,2:3],labels=names)

c.sam2<-sammon(dist(cube),k=2)
par(mfrow=c(1,1))
plot(c.sam2$points,type="n") 
text(c.sam2$points,labels=names)

c.iso<-isoMDS(dist(cube),k=2)
plot(c.iso$points,type="n")
text(c.iso$points,labels=names)

Computer Trouble?

In most cases, help with the computers (NOT the programming) can be gained by e-mailing help@stat.sc.edu

For the printers on the first and second floor, printer paper is available in the Stat Department office. For printers on the third floor, paper is available in the Math Department office.

If you are using a PC restarting the machine will fix many problems, but obviously don't try that if you have a file that won't save or the like.

If your graphs print out extremely small after you copy them to word, you might be able to fix the problem by "opening and closing" the image. In word, left click once on the image, and select Edit Picture or Open Picture Object under the Edit menu. A separate window will open with the image in it. Simply choose Close Picture. It should now print out ok. This will also make the spacing between the characters in the labels look right if they were somewhat off.

If the problem is an emergency requiring immediate attention see Jason Dew in room 209D.
If Jason is not available and it is an emergency see Minna Moore in room 417.
Flagrantly non-emergency cases may result in suspension of computer privileges.

`plot(x,y)`	makes a scatterplot
`par(new=T)`	tells R not to erase the previous plot when it draws the new one.
`lines(c(x1,x2,...,xn),c(y1,y2,...,yn)`	If you already have a plot open, this will draw the line segments to connext (x1,y1) to (x2,y2) to ... to (xn,yn). You don't need to use `par(new=T)`
`pairs(datamatrix)`	makes a scatterplot matrix