STAT 778 R TEMPLATES

Stat 778 - Spring 2002 - R Templates

The Basics of R
Loading in Data
Classical Test Theory (Corrections made 1/31/02)
Classical Item Analysis
Subsetting Data (for Homework 2)
Some Basic Statistical Analyses
Simulating IRT Data (for Homework 3)
Saving Simulated Data
Splitting an Exam by Items
Difference Between IRFs
Item Information Functions
Dimensionality Procedures
Computer Trouble?

The Basics of R:

R, like the commercial S-Plus, is based on the programming language S. It can be downloaded for from www.r-project.org [At home, after reading the instructions!, choose: CRAN, then Windows (95 and later), then base, then SetupR.exe]. This page also has links to FAQs and other information about R.

One of the nice things about R (besides the fact that it's powerful and free) is that you can save everything you've done after you've finished each time. If you are using R on a lab PC in LeConte, you will need to manually load and save the .Rdata file in your Z drive each time you want to start and end the program, respectively. (This is because you don't have access to the C drive.) This wouldn't be a problem if you are using it on your own PC.

At the heart of R are the various objects that you enter. At any time (when you are in R) you can type in the function objects() to see which ones you have entered previously. Presumably there aren't any there right now. R also has many built in objects, most of which are functions. Because these objects are always there, it will not list them when you type objects().

A list of the various functions for R can be found in the manuals that are installed with the program. The various manuals can be found under the Manuals submenu of the Help menu. In particular, the R Reference Manual lists not only all the functions in the base package of R, but also some of those in the supplementary packages as well. Because this manual is nearly 1000 pages long, you probably shouldn't print out a copy!!!!

Once you know the name of a function, you can get help about it simply by typing help(functionname). In R, parentheses indicate either mathematical grouping, like they usually do, or that you are applying a function. Braces [ ] indicate that you are finding an element in a string or array.

This brief example shows some of R's flexibility.

x<-c(5,4,3,2)

The <- is the command that assigns what ever is on the left to the name that is on the right. The c indicates that an array or vector is going to be entered. Simply typing x now will return those four values. Typing x[3] will return the third value, mean(x) will return the mean, and objects() will show you that you now have an object called x.

R also has a variety of graphical functions and statistical functions built in. hist(x) will construct a histogram, and t.test(x,alternative="greater",mu=5) will test the null hypothesis mu=5 vs. the alternate that mu > 5. (If you happen to mistype a command and get an error, just hit the up arrow to bring it up again for editing).

Besides just getting the result of a function, we can store the results somewhere. For example x.t<-t.test(x,alternative="greater",mu=5) will store the result of the t-test in x.t. attributes(x.t) will then show you the various properties of x.t. Thus, x.t$p.value will return the p-value.

R can also be used to write your own functions. For example, if you wanted a function that returned sin(x)+x² we could write:

weird<-function(x){ y<-sin(x)+x^2 return(y)}

Try weird(5) and weird(x). What is it doing?

Loading in Data: Two common formats for testing data are simply big matrices, either with or without spaces. Two examples can be found at: http://www.stat.sc.edu/~habing/courses/data/a1291nospace.txt and http://www.stat.sc.edu/~habing/courses/data/a1291space.txt. To load these into R as an object called a1291 we could do:

a1291<-read.fwf("http://www.stat.sc.edu/~habing/courses/data/a1291nospace.txt",widths=rep(1,24))

a1291<-read.fwf("http://www.stat.sc.edu/~habing/courses/data/a1291space.txt",widths=rep(2,24))

The widths=rep(x,24) tells r that there will be 24 columns, each of width x. Notice that just typing a1291 won't be very interesting. However, if you type

dim(a1291)

you will see that a1291 is a matrix with 6000 rows and 24 columns. Using

apply(a1291,2,mean)

will give you the average of each column (the 2 is for column, 1 would be for row). Finally, we could get a histogram of the sums of the rows

hist(apply(a1291,1,sum),breaks=c((0:25)-0.5)).

Classical Test Theory Statistics

The three functions below give some of the basic Classical Test Theory statistics that we talked about in class. It is probably best to copy all three of the functions over at one time since sem uses alpha and xandt uses sem. Copy them over just as they are, you don't need to make any changes.

Once you have copied the three functions over, you can run them by simply using the form function(arguments). Thus, to get the measures of reliability for the a1291 data set we could use alpha(a1291). To get the 60% confidence intervals for true score using alpha (instead of lambda2) you would type
xandt(a1291,rtype="alpha",conf.level=0.6)

Brief Descriptions of the Three Functions

alpha - Calculates both the estimated Cronbach's alpha and Guttman's lambda2 statistics.

sem - Calculates the estimated sem using both alpha and lambda2 to estimate the reliability. sem1 uses alpha, and and sem2 uses lambda2.

xandt - Gives the frequency distribution of observed scores, the regression estimate of true score, the sem based confidence interval for T, and the regression based confidence interval for T. It gives you the option of using either alpha or lambda2 (lambda 2 is the default) and your choice of confidence level (0.95 is the default). At the top of the ouput it gives you the sample mean, variance and standard deviation of the observed scores (X). Below that it gives a table. The columns of the table are:

x = observed score,

treg = the t-hat estimated using the regression method for that X,

count = the number of examinees with that observed score,

losem and hisem = the confidence interval for T using the form X+z*SEM,

loreg and hireg = the confidence interval for T using the form T-hat+z*SEE.

The Functions Themselves (Updated 4:44pm 1/31/02... thanks F.Vera)

alpha<-function(testdata){
n<-ncol(testdata)
nexmn<-nrow(testdata)
x<-apply(testdata,1,sum)
s2y<-diag(var(testdata))*(nexmn-1)/nexmn
s2x<-var(x)*(nexmn-1)/nexmn
alpha<-(n/(n-1))*(1-sum(s2y)/s2x)
s2yy<-(((var(testdata)-diag(diag(var(testdata))))*(nexmn-1)/nexmn))^2
lambda2<-1-sum(s2y)/s2x+sqrt((n/(n-1))*sum(s2yy))/s2x
return(alpha,lambda2)}

sem<-function(testdata){
nexmn<-nrow(testdata)
x<-apply(testdata,1,sum)
rhat<-alpha(testdata)
sx<-sqrt(var(x)*(nexmn-1)/nexmn)
sem1<-sx*sqrt(1-rhat$alpha)
sem2<-sx*sqrt(1-rhat$lambda2)
return(sem1,sem2)}

xandt<-function(testdata,rtype="lambda2",conf.level=0.95){
nexmn<-nrow(testdata)
x<-apply(testdata,1,sum)
avgx<-mean(x)
varx<-var(x)*(nexmn-1)/nexmn
sdx<-sqrt(varx)
maxx<-max(x)
xrange<-c(0:maxx)
count<-rep(0,maxx)
for (i in 0:maxx){
  count[i+1]<-sum(rep(1,nexmn)[x==i])}
x<-c(0:maxx)
z<--1*qnorm((1-conf.level)/2)
if (rtype=="alpha") {
rhat<-alpha(testdata)$alpha
sem.est<-sem(testdata)$sem1
}
else {
rhat<-alpha(testdata)$lambda2
sem.est<-sem(testdata)$sem2
}
losem<-x-z*sem.est
hisem<-x+z*sem.est
treg<-rhat*x+(1-rhat)*avgx
see.est<-sem.est*sqrt(rhat)
loreg<-treg-z*see.est
hireg<-treg+z*see.est
return(avgx,varx,sdx,cbind(x,treg,count,losem,hisem,loreg,hireg))}

Classical Item Analysis

The function items will calculate the p-value, the variance, the D based on a user specified percent of the examinees (default=27%), the point biserial, and the biserial. If correct=T (the default) it will remove the item in question from the sum score, otherwise it will include that item. dpct tells it what percent of the examinees to use at each end for calculating D. digits tells it how many digits to report after the decimal (the default is four).

Thus, items(a1291) will analyze a1291 at the default settings, while
items(a1291,correct=F,dpct=0.5,digits=3) will analyze it without removing the item in question, using 50% instead of 27%, and using only 3 digits after the decimal place.

items<-function(testdata,correct=T,dpct=0.27,digits=4){
n<-ncol(testdata)
nexmn<-nrow(testdata)
x<-apply(testdata,1,sum)
medx<-median(x)
lowx<-sort(x)[floor((nexmn+1)*dpct)]
highx<-sort(x)[ceiling((nexmn+1)*(1-dpct))]
pvalue<-as.numeric(apply(testdata,2,mean))
var<-as.numeric(apply(testdata,2,var)*(nexmn-1)/nexmn)
D<-rep(0,n)
PBis<-rep(0,n)
Bis<-rep(0,n)
for (i in 1:n){
     if (correct==T){
         xx<-x-testdata[,i]
	 lowx<-sort(xx)[floor((nexmn+1)*dpct)]
         highx<-sort(xx)[ceiling((nexmn+1)*(1-dpct))]
         }
     else {xx<-x}
     pu<-mean(testdata[xx >= highx,i])
     pl<-mean(testdata[xx <= lowx,i])
     D[i]<-pu-pl
     PBis[i]<-cor(xx,testdata[,i])
     Bis[i]<-sqrt(pvalue[i]*(1-pvalue[i]))*PBis[i]/dnorm(qnorm(pvalue[i]))
}
return(round((10^digits)*cbind(pvalue,var,D,PBis,Bis))/(10^digits))}

Subsetting Data

In order to use the following code, you must already have the data set a1291 loaded into your computer. Simply copying and pasting the following code into R will then divide the dataset into three smaller sets of 2,000 examinees each.

It does this by going through the following steps: 1) generate a list of the numbers 1-6,000 in a random order, 2) put the observations corresponding to those first 2,000 random numbers into arand, 3) put the remaining 4,000 observations into not.arand, 4) calculate the observed score X for these 4,000 observations, 5) order the 4,000 observations by their observed score, 6) put the first 2,000 in alo, and 7) put the rest in ahi.

random.order<-sample(1:6000,replace=F)
arand<-a1291[random.order[1:2000],]
not.arand<-a1291[random.order[2001:6000],]
x.for.not.arand<-apply(not.arand,1,sum)
not.arand<-not.arand[order(x.for.not.arand),]
alo<-not.arand[1:2000,]
ahi<-not.arand[2001:4000,]

Having entered the above code means you have added three new data sets to R: arand, alo, and ahi. You could now run sem(alo), for example, if you wanted.

The following code will combine the biserial correlation calculated with alo and all of the item statistics calculated with ahi into a single piece of output. The first line makes something called q.1.data that has the fifth column of output from items(alo) and all of the columns of output from items(arand). The second line gives new names to the columns to make it clear what we've calculated.

q.1.data<-cbind(items(alo)[,5],items(arand)) 
colnames(q.1.data)<-c("lo.Bis","rand.p","rand.v","rand.D","rand.PBis","rand.Bis")

After you enter the above two lines of code, you may simply type q.1.data to see the data on the screen. You could cut and paste it from there to another application, or analyze it in R if you are comfortable with that. Keep in mind that your numbers will be slightly different than everyone elses because we each took our own random subsets.

Some Basic Statistical Analysis

The goal of this section is to show how R can be used for conducting multiple regression and for producing a few kinds of plots. The data set used in this example concerns the girth, length, and width of a population of bears. It can be found at: http://www.stat.sc.edu/~habing/courses/data/bears.txt. The data would be read into R using:

bears<-read.table("http://www.stat.sc.edu/~habing/courses/data/bears.txt",header=TRUE)

The command read.table automatically reads the data in as a data-frame. This allows us to refer to the columns by name, and not just by number. If we had another data set, we might want to use the command as.data.frame on it. For example, q.1.frame<-as.data.frame(q.1.data)
makes a data-frame called q.1.frame out of the dataset q.1.data.

One nice plot command is simply all of the pairwise scatter plots: pairs(bears)

If we wanted to make this plot without variables 2 and 9 we could use: pairs(bears[c(1,3:8)])
This simply tells it to make the pairs plots for columns 1 and 3 to 8 of the bears data.

The following code will perform a multiple regression to predict the weight of the bears from the other variables. To use this on a different dataset, you would simply replace the name bears with the name of your data set (and of course replace the names of the variables too!)

regresults<-lm(Weight~Age+Head.L+Head.W+Neck.G+Length+Chest.G,data=bears)
summary(regresults)

This shows that Age, Neck.G, and Chest.G are all statistically significant at the 0.05 level. It also gives the R-Squared and Adjusted R-Squared.

*** WARNING! It is important to remember that there could be other variables that are significant! Each of the tests generated in most regression analyses is conditional on including all of the other variables... if we had two measures of roughly the same quantity then neither one would be significant, even if either of them would be a better predictor than all of the other variables combined. One way around this is to use step-wise regression to choose the best model. The commands for doing this are given at the end of this section. ***

The basic residual plots could be generated using:

par(mfrow=c(2,2))
plot(regresults)

The par(mfrow=c(2,2)) just says we want to put 4 plots on the screen in the form of a 2 x 2 matrix.

Plots of the independent variables against the observed weights could be generated as follows:

par(mfrow=c(3,2))
plot(bears$Weight,bears$Age)
plot(bears$Weight,bears$Head.L)
plot(bears$Weight,bears$Head.W)
plot(bears$Weight,bears$Neck.G)
plot(bears$Weight,bears$Length)
plot(bears$Weight,bears$Chest.G)

The command as.data.frame will simply allow us to refer to the variables by name, instead of having to use a column number.

We could also conduct a stepwise regression. The following will report the results of a stepwise regression allowing both forward and backwards steps. It uses the regresults from above as the initial model. The criterion it tries to optimize is the AIC.

summary(step(regresults,direction="both",trace=0))

In this case it arrives at the same conclusion as simply looking at what values are significant. In other cases it can be a valuable tool.

Simulating IRT Data

The three functions below are:

irf - given theta, a, b, c, and the type of model ("3pl" or "norm"), this function returns the probability the examinee will get the item correct. By default the guessing parameter cc is set to 0. The following would give the values to plot for item 1 in problem 1 on page 28 of the text (answer on page 30).

irf(theta=c(-3,-2,-1,0,1,2,3),a=1.8,b=1.0,cc=0.0,type="3pl")
round(1000*irf(theta=c(-3,-2,-1,0,1,2,3),a=1.8,b=1.0,cc=0.0,type="3pl"))/1000

Notice that you are able to enter multiple theta values as a vector if you like.

irfplot - will give a plot of the specified item response function for abilities ranting from -3.5 to +3.5. It allows you to label the graph, and make it dashed if you would like. The following will construct a graph of the same parameters for both the normal ogive (dashed) and logistic (not-dashed) models. 3PL and not-dashed are both the default settings.

irfplot(a=1.8,b=1.0,cc=0.0,label="item 1 comparison")
par(new=T)
irfplot(a=1.8,b=1.0,cc=0.0,type="norm",dashed="T")

The command par(new=T) allows you to overlay the first and second plot (otherwise it would have erased the first one).

irtgen - this function will simulate data from an exam that follows either the normal-ogive or logistic models. It requires the number of examinees (default=10); the vector of a, b, and c parameters; the type of item ("3PL" or "norm"); and the mean and standard deviation of the examinee abilities (which are simulated to follow the normal distribution). The following sample code simulates an exam consisting of 10 items and 100 examinees, where each item had a=1, b=0, and c=1, where it followed the normal-ogive model, and where the distribution of examinee abilities was standard normal. It calls the simulated data set sampdat and runs items on it.

a<-rep(1,10)
b<-rep(0,10)
cc<-rep(0,10)
sampdat<-irtgen(nexmn=100,avec=a,bvec=b,cvec=cc,type="norm",mthet=0,sdthet=1)
sampdat
items(sampdat)

The rep(1,10) is telling R to make a vector that repeats 1 ten times. Type a to see thats what it did. Also, note that we had to use cc instead of c for the name of that vector. This is because R reserves c to be the command for making a vector (e.g. c(1,2,3). Finally, make sure to enter all three functions because both irfplot and irtgen use irf.

The Functions

irf<-function(theta=0,a=1,b=0,cc=0,type="3pl"){
if (type=="3pl"){
  prob<-cc+(1-cc)/(1+exp(-1.7*a*(theta-b)))
}
if (type=="norm"){
  prob<-cc+(1-cc)*pnorm(a*(theta-b))
}
return(prob)
}


irfplot<-function(a=1,b=0,cc=0,type="3pl",label="",dashed="F"){
prob<-rep(0,71)
theta<-rep(0,71)
for (i in 1:71){
theta[i]<-(i-36)/10
prob[i]<-irf(theta[i],a,b,cc,type)
}
if (dashed=="F"){
plot(theta,prob,xlab="theta",ylab="prob",type="l",
xlim=c(-3.5,3.5),ylim=c(0,1),main=label)
}
if (dashed=="T"){
plot(theta,prob,xlab="theta",ylab="prob",type="l",
xlim=c(-3.5,3.5),ylim=c(0,1),main=label,lty=2)
}
}


irtgen<-function(nexmn=10,avec=c(1),bvec=c(0),cvec=c(0),type="3pl",
mthet=0,sdthet=1){
n<-length(avec)
data<-matrix(0,nrow=nexmn,ncol=n)
ability<-rnorm(nexmn,mean=mthet,sd=sdthet)
for (j in 1:nexmn){
  for (i in 1:n){
    try<-runif(1,0,1)
    if (try < irf(ability[j],avec[i],bvec[i],cvec[i],type)){
      data[j,i]<-1
     }
   }
}
return(data)
}

Saving Simulated Data

irtsave - will save an IRT data matrix with several different options. It may have a leading examinee id number, up to 99999, (the default), and it may or may not (the default) have spaces between the item responses following the id. The following are four examples of what the outputted files could look like (following the command used to generate it). It uses the sampdat matrix created above.

----------------------------------------------
irtsave(sampdat,"d:/778/simul.dat")

00001 0000100000
00002 0010001110
00003 0011000000
etc...
----------------------------------------------
irtsave(sampdat,"d:/778/simul.dat",sep=T)

00001 0 0 0 0 1 0 0 0 0 0
00002 0 0 1 0 0 0 1 1 1 0
00003 0 0 1 1 0 0 0 0 0 0
etc...
----------------------------------------------
irtsave(sampdat,"d:/778/simul.dat",id=F)

0000100000
0010001110
0011000000
etc...
----------------------------------------------
irtsave(sampdat,"d:/778/simul.dat",sep=T,id=F)

 0 0 0 0 1 0 0 0 0 0
 0 0 1 0 0 0 1 1 1 0
 0 0 1 1 0 0 0 0 0 0
etc...
----------------------------------------------

The Function itself is:

irtsave<-function(datamatrix,flnm,id=T,sep=F){
nitem<-length(datamatrix[1,])
nexmn<-length(datamatrix[,1])
if (sep==F) {
  s1<-"" 
  s2<-" " 
}
else if (sep==T) {
  s1<-" " 
  s2<-""
}
ndigits<-5
rowput<-rep("",nexmn)
for (j in 1:nexmn){
  for (i in 1:nitem){
    rowput[j]<-paste(rowput[j],as.character(datamatrix[j,i]),sep=s1)
  }
  if (id==T){
    rowput[j]<-paste(
      paste(rep("0",(ndigits-ceiling(log10(j+1)))),sep="",collapse=""),
      as.character(j),s2,rowput[j],sep="")
  } 
}
write(as.table(as.matrix(rowput)),file=flnm)
}

Splitting an Exam by Items

Say we want to split the exam a1291 into two separate exams, one containing items 1,8-15, and 23, and the other containing 2-7,16-22, and 24. This could be done using the following code that calls the new exams asplit1 and asplit2.

asplit1<-a1291[,c(1,8:15,23)]
asplit2<-a1291[,c(2:7,16:22,24)]

If needed, you could save these new data sets to disk using the irtsave function above.

Difference Between IRFs

irfdiff - This is the function that appears in the answer key to Homework set 5, problem #4. In order to use this function you must have already entered the function irf above. The function approximates the area between two item response functions between theta=-10 and theta=10 for an ability distribution with mean=mthet and standard deviation=sdthet. It works for either the normal ogive model or the 3pl model.

The code for the function is:

irfdiff<-function(a1,b1,c1,a2,b2,c2,type="3pl",mthet=0,sdthet=1){
thet<-(-500:500)/50
weight<-dnorm(thet,mean=mthet,sd=sdthet)
diff<-irf(thet,a1,b1,c1,type=type)-irf(thet,a2,b2,c2,type=type)
probdiff<-(sum(weight*abs(diff))/sum(weight))
return(probdiff)
}

For two 3PL items with parameters a=0.914, b=-1.516, c=0.184 and a=0.79, b=-1.822, and c=0, where the population is standard normal, we would use the following code:

irfdiff(a1=0.914,b1=-1.516,c1=0.184,a2=0.79,b2=-1.822,c2=0,type="3pl",mthet=0,sdthet=1)

Item Information Functions

The following two functions concern the information functions of items and tests. As the second function relies on the first, it is probably best to enter them as a pair. In each a single item can be analyzed by entering its particular a, b, and c parameters. If vectors of parameters are entered then the analysis will be done for a test consisting of those items. For example a=c(1,1.5,1), b=c(0.5,1,1), cc=c(0,0,0.2) would be used to get the information for a three item test with parameters: a1=1, b1=0.5, c1=0, a2=1.5, b2=1, c2=0, a3=1, b3=1, and c3=0.2.

irfinfo - returns the information for the specified item or test at the theta values requested. It will function for either a logistic or ogive form using the options type="3pl" or type="norm".

The following commands will generate part of the answers to question 3 on page 97 of the text.

irfinfo(theta=c(-2,-1,0,1,2),a=c(1.25,1.5,1.25),b=c(-0.5,0,1),
cc=c(0,0,0),type="3pl")

irfinfoplot - returns the plotted information function, with the same options as irfplot above. The only difference in the options is that the user can specify the maximum value of the y-axis for the plots using maxy=. The default is 1.1 times the maximum information for that item. This is important for overlaying the plots of several information functions

The following commands will generate the information curves for the two items (and the two item test) where the parameters are a1=1.5, b1=-0.5, c1=0.2, a2=1, b2=1, c2=0.25.

par(mfrow=c(2,1))

irfinfoplot(a=1.5,b=-0.5,cc=0.2,type="3pl",label="two items information",
dashed="T",maxy=1.25)
par(new=T)
irfinfoplot(a=1,b=1,c=0.25,type="3pl",label="",dashed="F",maxy=1.25)

irfinfoplot(a=1.5,b=-0.5,cc=0.2,label="test info for two items",
dashed="T",maxy=1.25)
par(new=T)
irfinfoplot(a=1,b=1,c=0.25,dashed="T",maxy=1.25)
par(new=T)
irfinfoplot(a=c(1.5,1),b=c(-0.5,1),cc=c(0.2,0.25),maxy=1.25)

Recall that the par(mfrow=c(2,1)) tells R to plot the graphics so that there are two rows and one column in the plot window, and that par(new=T) means to plot over the current figure. Also notice that the default settings are: type="3pl", dashed="F", and label="", so we don't need to include those.

The code for the two functions is:

#-----------------------------------------------------------

irfinfo<-function(theta=0,a=1,b=0,cc=0,type="3pl"){
n<-length(a)
info<-0
if (type=="3pl"){
  for (i in 1:n){
    info<-info+2.89*a[i]^2*(1-cc[i])/(
      (cc[i]+exp(1.7*a[i]*(theta-b[i])))*
      ((1+exp(-1.7*a[i]*(theta-b[i])))^2)
      )
  }
}
else{
  for (i in 1:n){
    info<-info+
       a[i]^2*(1-cc[i])^2*dnorm(a[i]*(theta-b[i]))^2/
      (irf(theta=theta,a=a[i],b=b[i],cc=cc[i],type="norm")* 
      (1-irf(theta=theta,a=a[i],b=b[i],cc=cc[i],type="norm")))
  }
}
return(info)
}


irfinfoplot<-function(a=1,b=0,cc=0,type="3pl",label="",dashed="F",maxy=0){
info<-rep(0,71)
theta<-rep(0,71)
for (i in 1:71){
theta[i]<-(i-36)/10
info[i]<-irfinfo(theta=theta[i],a=a,b=b,cc=cc,type=type)
}
if (maxy==0){maxy<-1.1*max(info)}
if (dashed=="F"){
plot(theta,info,xlab="theta",ylab="information",type="l",
xlim=c(-3.5,3.5),ylim=c(0,maxy),main=label)
}
if (dashed=="T"){
plot(theta,info,xlab="theta",ylab="information",type="l",
xlim=c(-3.5,3.5),ylim=c(0,maxy),main=label,lty=2)
}
}

#-----------------------------------------------------------

Dimensionality Procedures

The following four functions are related to assessing the dimensionality of dichotomous standardized test data. Two of the functions (stratify and accov) are called by the others and generally wouldn't be called by themselves. The other two functions are:

WARNING! These functions may take quite a while to run, especially for large exams. Up to 10 minutes for the hcaccprox function for 250 examinees, depending on the speed of your computer.

irtmh - calculates Rosenbaum's Mantel-Haenszel Statistic for IRT data. It requires the name of the data matrix, and the numbers of the two items. It also has an option (include=T) for including the two items in question in the score. The following would calculate the statistic between items 1 and 2, and 1 and 10 in the a1291 data set. iIt also calculates the statistic between items 1 and 10 including the items in the score.

irtmh(a1291,1,2)
irtmh(a1291,1,15)
irtmh(a1291,1,15,include=T)

The function returns the z statistic, the p-value for testing that the covariance (conditioned on the chosen observed score) is zero vs. not zero, and the p-value for testing that the covariance is zero vs. less than zero. Notice in this case, that even with the multidimensionality in this data that the second p.less value is still very large! Even with including the two-items in question (which should have a negative bias) the third p.less is barely significant.

hcaccprox - conducts Roussos's cluster analysis procedure on the given test data matrix. While the commercial version is very fast, the CURRENT R version is SLOW. It is not recommended for use on more than a few hundred examinees at a time. It is perhaps best to randomly take a subset of the examinees and run it on that. The following would run the procedure on a random sample of 200 examinees from a1291.

randorder<-sample(1:6000,250,replace=F)
hcaccprox(a1291[randorder,])

The code for the four functions is:

#-----------------------------------------------------------

stratify<-function(respmat,it1,it2,include=F){
  nex<-nrow(respmat)
  nit<-ncol(respmat)
  outarray<-array(0,dim=c(2,2,nit+1))
  numatscore<-rep(0,nit+1)
  if (include==T){
    score<-apply(respmat,1,sum)+1
  }
  else{ 
    score<-apply(respmat,1,sum)-respmat[,it1]-respmat[,it2]+3
  }
  for (j in 1:nex){
    outarray[respmat[j,it1]+1,respmat[j,it2]+1,score[j]]<-
        outarray[respmat[j,it1]+1,respmat[j,it2]+1,score[j]]+1
    numatscore[score[j]]<-numatscore[score[j]]+1
  }
  return(outarray[,,(1:(nit+1))[(numatscore>1)]])
}    
  

irtmh<-function(respmat,it1,it2,include=F){
  temp<-mantelhaen.test(stratify(a1291[1:500,],it1,it2,include))
  zmh<-as.numeric(sqrt(temp$statistic)*sign(temp$estimate-1))
  return(zmh=zmh,p.equal=2*min(pnorm(zmh),1-pnorm(zmh)) ,p.less=pnorm(zmh))
}


accov<-function(respmat,include=F){
 nit<-ncol(respmat)
 nex<-nrow(respmat)
 ccov<-matrix(0,nrow=nit,ncol=nit)
 tscore<-apply(respmat,1,sum)
 for (i in 1:(nit-1)) {
   for (l in (i+1):nit) {
    irfi <- rep(0,(nit-1))
    irfl <- rep(0,(nit-1))
    irfil <- rep(0,(nit-1))
    numatscore <- rep(0,(nit-1))
    sil<-tscore-respmat[,i]-respmat[,l]+1
    for (j in 1:nex){
      numatscore[sil[j]]<-numatscore[sil[j]]+1
      irfi[sil[j]]<-irfi[sil[j]]+respmat[j,i]
      irfl[sil[j]]<-irfl[sil[j]]+respmat[j,l]
      irfil[sil[j]]<-irfil[sil[j]]+respmat[j,i]*respmat[j,l]
    }
    numexcounted<-0
    for (k in 1:(nit-1)){
      if (numatscore[k] > 1) {
        numexcounted<-numexcounted+numatscore[k]
        irfil[k]<-irfil[k]/numatscore[k]
        irfi[k]<-irfi[k]/numatscore[k]
        irfl[k]<-irfl[k]/numatscore[k]
        ccov[i,l]<-ccov[i,l] + (irfil[k]-irfi[k]*irfl[k])*numatscore[k] 
      }
    }
    ccov[i,l]<-ccov[i,l]/numexcounted
    ccov[l,i]<-ccov[i,l]
   }
 }
  return(ccov)
}

hcaccprox<-function(respmat){
  dmat<-as.dist(-1*accov(respmat)+1)
  library(mva)
  plot(hclust(dmat,method="average"))
} 

#-----------------------------------------------------------

Computer Trouble?

In most cases, help with the computers in LeConte (NOT the programming) can be gained by e-mailing help@stat.sc.edu. If you are having programming trouble, just e-mail me. If you are having computer trouble with a computer outside LeConte... sorry...

For the printers on the first and second floor, printer paper is available in the Stat Department office. For printers on the third floor, paper is available in the Math Department office.

If you are using a PC restarting the machine will fix many problems, but obviously don't try that if you have a file that won't save or the like. DO NOT TURN OFF THE UNIX MACHINES.

In LeConte, if the problem is an emergency requiring immediate attention see Jason Dew in room 415.
If Jason is not available and it is an emergency see Minna Moore in room 417.
Flagrantly non-emergency cases may result in suspension of computer privileges.