Stat 518 - Fall 2007 - R Templates

Class Notes from 9/30/07
Section 3.1: Binomial Test
Section 3.2: Quantile Test
Section 3.5: McNemar Test
Section 5.7: Wilcoxon Signed Ranks Test
Section 5.1: Mann-Whitney-Wilcoxon Rank Sum Test [& 2-sample t-test]
Section 5.2: Kruskal-Wallis Test [& 1-way ANOVA]
Section 5.8: Friedman Test [& block-design ANOVA]


The Basics of R:

R, like the commercial S-Plus, is based on the programming language S. It can be downloaded for from www.r-project.org [At home, after reading the instructions!, choose: CRAN, then a mirror site in the US, then Windows (95 and later), then base, then R-2.5.1-win32.exe]. This page also has links to FAQs and other information about R.

One of the nice things about R (besides the fact that it's powerful and free) is that you can save everything you've done after you've finished each time. If you are using a network machine where you don't have access to the C drive, you can manually load and save the .Rdata file on another dirve each time you want to start and end the program. You would, of course, have to save the .Rdata file on disk if you wanted to use it both at home and at school.

At the heart of R are the various objects that you enter. At any time (when you are in R) you can type in the function objects() to see which ones you have entered previously. Presumably there aren't any there right now. R also has many built in objects, most of which are functions. Because these objects are always there, it will not list them when you type objects().

A list of the various functions for R can be found in the manuals that are installed with the program. The various manuals can be found under the Manuals submenu of the Help menu. In particular, the R Reference Manual lists not only all the functions in the base package of R, but also some of those in the supplementary packages as well. Because this manual is nearly 1000 pages long, you probably shouldn't print out a copy!!!!

The various supplementary packages are simply groups of functions that others wrote and decided to make available to the public. A list of those that you downloaded can be found under the Packages menu by choosing Load Package. The one that we will use most often is the MASS package. It can be loaded in by typing library(MASS). Note that some of the non-standard libraries need to be downloaded and unzipped somewhere first if you don't have access to your C drive.

Once you know the name of a function, you can get help about it simply by typing help(functionname). In R, parentheses indicate either mathematical grouping, like they usually do, or that you are applying a function. Braces [ ] indicate that you are finding an element in a string.

This brief example shows some of R's flexibility.

x<-c(5,4,3,2)

The <- is the command that assigns what ever is on the left to the name that is on the right. The c indicates that an array or vector is going to be entered. Simply typing x now will return those four values. Typing

x[3]

will return the third value,

mean(x)

will return the mean, and

objects()

will show you that you now have an object called x.

R also has a variety of graphical functions and statistical functions built in.

hist(x)

will construct a histogram, and

t.test(x,alternative="greater",mu=5)

will test the null hypothesis mu=5 vs. the alternate that mu > 5. (If you happen to mistype a command and get an error, just hit the up error to bring it up again for editing).

Besides just getting the result of a function, we can store the results somewhere. For example

x.t<-t.test(x,alternative="greater",mu=5)

will store the result of the t-test in x.t.

attributes(x.t)

will then show you the various properties of x.t. Thus,

x.t$p.value

will return the p-value.

One of the things we will be doing on some homework assignments and for the project is using R to generate random numbers following various discrete and continuous distributions, and seeing how they behave. Try:

ndat<-rnorm(200,5,10)
chidat<-rchisq(200,4)
par(mfrow=c(2,1))
hist(ndat)
hist(chidat)
par(mfrow=c(1,1))

In order to replicate the work we did with SAS, we would need to load in the SAT data (from the August 31st, 1999 New York Times).

        Ala.    561  555    9
        Alaska  516  514   50
        Ariz.   524  525   34
        Ark.    563  556    6
        Calif.  497  514   49
        Colo.   536  540   32
        Conn.   510  509   80
        Dela.   503  497   67
        D.C.    494  478   77
        Fla.    499  498   53
        Ga.     487  482   63
        Hawaii  482  513   52
        Idaho   542  540   16
        Ill.    569  585   12
        Ind.    496  498   60
        Iowa    594  598    5
        Kan.    578  576    9
        Ky.     547  547   12
        La.     561  558    8
        Maine   507  507   68
        Md.     507  511   65
        Mass.   511  511   78
        Mich.   557  565   11
        Minn.   586  598    9
        Miss.   563  548    4
        Mo.     572  572    8
        Mont.   545  546   21
        Neb.    568  571    8
        Nev.    512  517   34
        N.H.    520  518   72
        N.J.    498  510   80
        N.M.    549  542   12
        N.Y.    495  502   76
        N.C.    493  493   61
        N.D.    594  605    5
        Ohio    534  538   25
        Okla.   576  560    8
        Ore.    525  525   53
        Pa.     498  495   70
        R.I.    504  499   70
        S.C.    479  475   61
        S.D.    585  588    4
        Tenn.   559  553   13
        Texas   494  499   50
        Utah    570  565    5
        Vt.     514  506   70
        Va.     508  499   65
        Wash.   525  526   52
        W.Va.   527  512   18
        Wis.    584  595    7
        Wyo.    546  551   10

One way of doing this would be to read it in straight from the keyboard.

satdata<-read.table(stdin())

You can then cut and paste the above data set in. And hit return at the last line. Typing satdata will show you what has been read in. Unfortunately it is hard to guarantee that everything has been read in in the correct format, and it would be nice to have variable names. The following would have read it in with the column names and the correct format:

satdata<-read.table(stdin(),col.names=c("state","verbal","math","pct"),colClasses=c('character','numeric','numeric','numeric'))

It is often easier to read the data in from the web or another file. For example, look at the file http://www.stat.sc.edu/~habing/courses/data/satdata.txt. This could be read in using:

satdata<-read.table("http://www.stat.sc.edu/~habing/courses/data/satdata.txt",head=T)

We have now made a "data-frame" where each of the four columns goes with a different variable. We can refer to the four columns by using satdata$state, satdata$verbal, satdata$math, or satdata$pct. We can shorten the typing a bit, by assigning the columns to different variables.

v<-satdata$verbal
m<-satdata$math
p<-satdata$pct
The code below performs a linear regression to predict v from p. It is often best to put the regression output into an object though instead of simply running it.

vmregress<-lm(v~p)

vmregress

attributes(vmregress)

vmregress$residuals

qqnorm(vmregress$residuals)
qqline(vmregress$residuals)

One of the great benefits of R is the ability to write new functions in a very simple manner:

CIvar<-function(x,alpha=0.05){
df<-length(x)-1
clow <- df*var(x)/qchisq(1-alpha/2,df)
chi <- df*var(x)/qchisq(alpha/2,df)
list(var=var(x),n=length(x),alpha=alpha,cilow=clow,cihigh=chi)}

Now simply try typing CIvar(v) . Note that it uses alpha=0.05 by default. We could have used a different value by simply telling it to, CIvar(v,alpha=0.10).


Binomial Test:

The exact binomial test in Section 3.1 can be carried out using binom.test in R. The output from example 1 on page 127 can be gotton from:

binom.test(x=3,n=19,p=0.5,alternative="less")

This command also gives you the exact confidence interval that we talked about in class (the one that you get by reversing the test, and where the confidence level is always greater than 1-alpha). Notice that the "less" option means that it isn't giving you a two-sided confidence interval like we are used to seeing, instead you are getting an upper confidence bound (we are 95% confident that the true population percentage is less than 0.3594256). If you change "less" to greater, you get something very different. "two.sided" (or just leaving it out all together) would give you the regular confidence interval.

The large sample approximation can be gotten using prop.test. To get the output in example 2 on page 128 you would use:

prop.test(x=682,n=925,p=.75,alternative="two.sided")

Note that the rounding error makes the p-value be 0.393 instead of the 0.392 reported in the text. (Of course, why use the approximation if R will give you the exact?) Note that it also gives you the large sample confidence interval. To get the corrected confidence interval from the paper in class, you could just make x be two larger and n be four larger.


Quantile Test:

R doesn't have a built in function to perform the quantile test like it is described in section 3.2. After entering the following function, you can get all of the output though.

quantile.test<-function(x,xstar=0,quantile=.5,alternative="two.sided"){
  n<-length(x)
  p<-quantile
  T1<-sum(x<=xstar)
  T2<-sum(x< xstar)
  if (alternative=="less") {
    p.value<-1-pbinom(T2-1,n,p)}
  if (alternative=="greater"){
    p.value<-pbinom(T1,n,p)}
  if (alternative=="two.sided"){
    p.value<-2*min(1-pbinom(T2-1,n,p),pbinom(T1,n,p))}
  list(xstar=xstar,alternative=alternative,T1=T1,T2=T2,p.value=p.value)}

This can be used to perform the analysis in example 1 on pages 139-140 using:

x<-c(189,233,195,160,212,176,231,185,199,213,202,193,174,166,248)
quantile.test(x,xstar=193,quantile=.75,alternative="two.sided")

or simply quantile.test(x,193,.75) .

A confidence interval for a quantile can be gotten using the following function.

quantile.interval<-function(x,quantile=.5,conf.level=.95){
  n<-length(x)
  p<-quantile
  alpha<-1-conf.level
  rmin1<-qbinom(alpha/2,n,p)-1
  r<-rmin1+1
  alpha1<-pbinom(r-1,n,p)
  smin1<-qbinom(1-alpha/2,n,p)
  s<-smin1+1
  alpha2<-1-pbinom(s-1,n,p)
  clo<-sort(x)[r]
  chi<-sort(x)[s]
  conf.level<-1-alpha1-alpha2
  list(quantile=quantile,conf.level=conf.level,r=r,s=s,interval=c(clo,chi))}

Note that it doesn't give you exactly the same result as in example 3 on page 145.

x<-c(46.9,47.2,49.1,56.5,56.8,59.2,59.9,63.2,63.3,
  63.4,63.7,64.1,67.1,67.7,73.3,78.5)
quantile.interval(x,quantile=0.75,conf.level=.9)

This is because the function in the book picks a value "close" to alpha/2, while the function above makes sure that neither alpha1 or alpha2 exceeds alpha/2.


McNemar Test:

The McNemar test in section 3.5 can be carried out using the function mcnemar.test. It is probably easiest to enter the data in as a matrix, and the following code shows this in working out example 1 on page 168.

datamatrix<-matrix(c(63,21,4,12),ncol=2)
mcnemar.test(datamatrix)
Note that R uses the large sample approximation to get the p-value. Also note that the value of T1 that it shows doesn't exactly match what the book has. This is because R uses a continutity correction by default in the large sample approximation. To get the T1 value in the text, use mcnemar.test(datamatrix,correct=F) .

You could also use the binomial test of 0.5 for this example, where the number of heads is 21 and number of tails is 4.

binom.test(21,21+4,.5)

Note that the p-value here (0.0009105) is fairly close to the continuity corrected value of (0.001374), and is of course exactly correct.


Wilcoxon Signed Ranks Test:

The Wilcoxon Signed Ranks test in section 5.7 can be carried out using the wilcox.test command. The output in example 1 on page 355 could be carried out using either of the following:

x<-c(86,71,77,68,91,72,77,91,70,71,88,87)
y<-c(88,77,76,64,96,72,65,90,65,80,81,72)
wilcox.test(y,x,paired=T,alternative="less")

Note that it gives the T+ value of 24.5 (see the last paragraph of the example) and says it can't give the exact p-value because of the ties. The p-value it does give is very close to the 0.238 that the example reports using equation 7. The same answer could have been gotten using the differences instead of the paired option.

d<-y-x
wilcox.test(d,alternative="less")
To get the answer to example 2, you would use the options mu=30 and alternative="greater".

When trying to get the confidence interval, the option in wilcox.test doesn't return the same value as the text when there are ties (it uses the large sample approximation). To get the results on page 361, you could first enter the following function:

WSRci<-function(d,conf.level=.95){
  d<-sort(d)
  n<-length(d)
  alpha<-1-conf.level
  thematrix<-matrix(d,ncol=n,nrow=n,byrow=T)+matrix(d,ncol=n,nrow=n,byrow=F)
  thematrix<-thematrix[!upper.tri(thematrix)]/2
  tablevalue<-qsignrank(alpha/2,n)
  estimator<-median(thematrix)
  clow<-sort(thematrix)[tablevalue]
  chi<-sort(thematrix)[n*(n+1)/2+1-tablevalue]
  list(matrix.values=thematrix,table.value=tablevalue,estimator=estimator,interval=c(clow,chi))}

and then simply use WSRci(y-x,.95) .


Mann-Whitney-Wilcoxon Rank Sum Test [& 2-sample t-test]:

The Mann-Whitney-Wilcoxon rank sum test in section 5.1 can be carried out using the function wilcox.test. The results in example 1 on page 276-278 could be gotten using:

x<-c(14.8,7.3,5.6,6.3,9.0,4.2,10.6,12.5,12.9,16.1,11.4,2.7)
y<-c(12.7,14.2,12.6,2.1,17.7,11.8,16.9,7.9,16.0,10.6,5.6,5.6,
  7.6,11.3,8.3,6.7,3.6,1.0,2.4,6.4,9.1,6.7,18.6,3.2,
  6.2,6.1,15.3,10.6,1.8,5.9,9.9,10.6,14.8,5.0,2.6,4.0)
wilcox.test(x,y,alternative="greater")

Note that while the p-value matches what is on page 277, wilcox.test reports a W statistic and not a T. To get the T value on page 277, take the W and add n*(n+1)/2 where n is the number of observations in x (in this case 12). 243+(12*13)/2 = 321.

The two-sample t-test can be run using:
t.test(x,y,alternative="greater")
Note that this gives you a "lower confidence bound" and not "confidence interval" like you are used to. For that, you need to remove the alternative="greater" .

When trying to get the confidence interval, the option in wilcox.test doesn't return the same value as the text when there are ties (it uses the large sample approximation). To get the results on page 282-283, you could first enter the following function:

MWWRSci<-function(x,y,conf.level=.95){
  x<-sort(x)
  n<-length(x)
  y<-sort(y)
  m<-length(y)
  alpha<-1-conf.level
  thematrix<-matrix(x,ncol=m,nrow=n,byrow=F)-matrix(y,ncol=m,nrow=n,byrow=T)
  k<-qwilcox(alpha/2,n,m)
  estimator<-median(thematrix)
  clow<-sort(thematrix)[k]
  chi<-sort(thematrix)[n*m+1-k]
  list(matrix.values=thematrix,table.k=k,estimator=estimator,interval=c(clow,chi))}

and then enter the data and use the function:

a<-c(7.3,6.9,7.2,7.8,7.2)
b<-c(7.4,6.8,6.9,6.7,7.1)
MWWRSci(a,b,.95)


Kruskal-Wallis Test [& 1-way ANOVA]:

The Kruskal-Wallis test in section 5.2 can be run using kruskal.test. THe following code analyzes the data in example 1 on page 290-291:

m1<-c(83,91,94,89,89,96,91,92,90)
m2<-c(91,90,81,83,84,83,88,91,89,84)
m3<-c(101,100,91,93,96,95,94)
m4<-c(78,82,81,77,79,81,80,81)
kruskal.test(list(m1,m2,m3,m4))

Note that the test statistic is very slightly different from the one shown in the text. Another way of performing the test on the same data set is:

values<-c(83,91,94,89,89,96,91,92,90,91,90,81,83,84,83,88,91,89,84,101,100,91,93,
   96,95,94,78,82,81,77,79,81,80,81)
groups<-as.factor(c(rep(1,9),rep(2,10),rep(3,7),rep(4,8)))
kruskal.test(values~groups)

This second way is in the same format as what is needed for performing a 1-way ANOVA:

anova(lm(values~groups))

The Mean Sq column has the "variances", groups is the between one, and Residuals is within one. The F statistic and p-value for testing the null hypothesis that all of the means are on the right side of the table. The residual plot for the ANOVA can be gotten by using plot(lm(values~groups)) .


Friedman Test [& block-design ANOVA]:

The Friedman Test in section 5.8 can be carried out using friedman.test. For example, the following code analyzes the data in example 1 on pages 371-372 using the T1 statistic.

values<-c(4,4,3,3,4,2,1,2,3.5,4,4,3.5,
 3,2,1.5,1,2,2,3,4,1,1,2,1,
 2,3,1.5,2,1,2,2,1,2,3,3,2,
 1,1,4,4,3,4,4,3,3.5,2,1,3.5)
homeowners<-as.factor(rep(1:12,4))
grass<-as.factor(c(rep(1,12),rep(2,12),rep(3,12),rep(4,12)))
friedman.test(values~grass|homeowners)

If you type cbind(homeowners,grass,values) you can see that we entered the data in a way that looks different than the table in the example, but that it has all the same information.

When we enter it in this fashion, we can also do the block design ANOVA:

anova(lm(values~grass+homeowners))

The p-value we want is on the grass line, our treatment for this example. The residual plot can be gotten by using: plot(lm(values~grass+homeowners)) .