R, like the commercial S-Plus, is based on the programming language S. It can be downloaded for from www.r-project.org [At home, after reading the instructions!, choose: CRAN, then a mirror site in the US, then Windows (95 and later), then base, then R-2.5.1-win32.exe]. This page also has links to FAQs and other information about R.
One of the nice things about R (besides the fact that it's powerful and free) is that you can save everything you've done after you've finished each time. If you are using a network machine where you don't have access to the C drive, you can manually load and save the .Rdata file on another dirve each time you want to start and end the program. You would, of course, have to save the .Rdata file on disk if you wanted to use it both at home and at school.
At the heart of R are the various objects that you enter. At any time (when you are in R) you can type in the function objects() to see which ones you have entered previously. Presumably there aren't any there right now. R also has many built in objects, most of which are functions. Because these objects are always there, it will not list them when you type objects().
A list of the various functions for R can be found in the manuals that are installed with the program. The various manuals can be found under the Manuals submenu of the Help menu. In particular, the R Reference Manual lists not only all the functions in the base package of R, but also some of those in the supplementary packages as well. Because this manual is nearly 1000 pages long, you probably shouldn't print out a copy!!!!
The various supplementary packages are simply groups of functions that others wrote and decided to make available to the public. A list of those that you downloaded can be found under the Packages menu by choosing Load Package. The one that we will use most often is the MASS package. It can be loaded in by typing library(MASS). Note that some of the non-standard libraries need to be downloaded and unzipped somewhere first if you don't have access to your C drive.
Once you know the name of a function, you can get help about it simply by typing help(functionname). In R, parentheses indicate either mathematical grouping, like they usually do, or that you are applying a function. Braces [ ] indicate that you are finding an element in a string.
This brief example shows some of R's flexibility.
x<-c(5,4,3,2)
The <- is the command that assigns what ever is on the left to the name that is on the right. The c indicates that an array or vector is going to be entered. Simply typing x now will return those four values. Typing
x[3]
will return the third value,
mean(x)
will return the mean, and
objects()
will show you that you now have an object called x.
R also has a variety of graphical functions and statistical functions built in.
hist(x)
will construct a histogram, and
t.test(x,alternative="greater",mu=5)
will test the null hypothesis mu=5 vs. the alternate that mu > 5. (If you happen to mistype a command and get an error, just hit the up error to bring it up again for editing).
Besides just getting the result of a function, we can store the results somewhere. For example
x.t<-t.test(x,alternative="greater",mu=5)
will store the result of the t-test in x.t.
attributes(x.t)
will then show you the various properties of x.t. Thus,
x.t$p.value
will return the p-value.
One of the things we will be doing on some homework assignments and for the project is using R to generate random numbers following various discrete and continuous distributions, and seeing how they behave. Try:
ndat<-rnorm(200,5,10)
chidat<-rchisq(200,4)
par(mfrow=c(2,1))
hist(ndat)
hist(chidat)
par(mfrow=c(1,1))
In order to replicate the work we did with SAS, we would need to load in the SAT data (from the August 31st, 1999 New York Times).
Ala. 561 555 9 Alaska 516 514 50 Ariz. 524 525 34 Ark. 563 556 6 Calif. 497 514 49 Colo. 536 540 32 Conn. 510 509 80 Dela. 503 497 67 D.C. 494 478 77 Fla. 499 498 53 Ga. 487 482 63 Hawaii 482 513 52 Idaho 542 540 16 Ill. 569 585 12 Ind. 496 498 60 Iowa 594 598 5 Kan. 578 576 9 Ky. 547 547 12 La. 561 558 8 Maine 507 507 68 Md. 507 511 65 Mass. 511 511 78 Mich. 557 565 11 Minn. 586 598 9 Miss. 563 548 4 Mo. 572 572 8 Mont. 545 546 21 Neb. 568 571 8 Nev. 512 517 34 N.H. 520 518 72 N.J. 498 510 80 N.M. 549 542 12 N.Y. 495 502 76 N.C. 493 493 61 N.D. 594 605 5 Ohio 534 538 25 Okla. 576 560 8 Ore. 525 525 53 Pa. 498 495 70 R.I. 504 499 70 S.C. 479 475 61 S.D. 585 588 4 Tenn. 559 553 13 Texas 494 499 50 Utah 570 565 5 Vt. 514 506 70 Va. 508 499 65 Wash. 525 526 52 W.Va. 527 512 18 Wis. 584 595 7 Wyo. 546 551 10
One way of doing this would be to read it in straight from the keyboard.
satdata<-read.table(stdin())
You can then cut and paste the above data set in. And hit return at the last line. Typing satdata will show you what has been read in. Unfortunately it is hard to guarantee that everything has been read in in the correct format, and it would be nice to have variable names. The following would have read it in with the column names and the correct format:
satdata<-read.table(stdin(),col.names=c("state","verbal","math","pct"),colClasses=c('character','numeric','numeric','numeric'))
It is often easier to read the data in from the web or another file. For example, look at the file http://www.stat.sc.edu/~habing/courses/data/satdata.txt. This could be read in using:
satdata<-read.table("http://www.stat.sc.edu/~habing/courses/data/satdata.txt",head=T)
We have now made a "data-frame" where each of the four columns goes with a different variable. We can refer to the four columns by using satdata$state, satdata$verbal, satdata$math, or satdata$pct. We can shorten the typing a bit, by assigning the columns to different variables.
v<-satdata$verbal m<-satdata$math p<-satdata$pctThe code below performs a linear regression to predict v from p. It is often best to put the regression output into an object though instead of simply running it.
vmregress<-lm(v~p) vmregress attributes(vmregress) vmregress$residuals qqnorm(vmregress$residuals) qqline(vmregress$residuals)
One of the great benefits of R is the ability to write new functions in a very simple manner:
CIvar<-function(x,alpha=0.05){ df<-length(x)-1 clow <- df*var(x)/qchisq(1-alpha/2,df) chi <- df*var(x)/qchisq(alpha/2,df) list(var=var(x),n=length(x),alpha=alpha,cilow=clow,cihigh=chi)}
Now simply try typing CIvar(v) . Note that it uses alpha=0.05 by default. We could have used a different value by simply telling it to, CIvar(v,alpha=0.10).
The exact binomial test in Section 3.1 can be carried out using binom.test in R. The output from example 1 on page 127 can be gotton from:
binom.test(x=3,n=19,p=0.5,alternative="less")
This command also gives you the exact confidence interval that we talked about in class (the one that you get by reversing the test, and where the confidence level is always greater than 1-alpha). Notice that the "less" option means that it isn't giving you a two-sided confidence interval like we are used to seeing, instead you are getting an upper confidence bound (we are 95% confident that the true population percentage is less than 0.3594256). If you change "less" to greater, you get something very different. "two.sided" (or just leaving it out all together) would give you the regular confidence interval.
The large sample approximation can be gotten using prop.test. To get the output in example 2 on page 128 you would use:
prop.test(x=682,n=925,p=.75,alternative="two.sided")
Note that the rounding error makes the p-value be 0.393 instead of the 0.392 reported in the text. (Of course, why use the approximation if R will give you the exact?) Note that it also gives you the large sample confidence interval. To get the corrected confidence interval from the paper in class, you could just make x be two larger and n be four larger.
R doesn't have a built in function to perform the quantile test like it is described in section 3.2. After entering the following function, you can get all of the output though.
quantile.test<-function(x,xstar=0,quantile=.5,alternative="two.sided"){ n<-length(x) p<-quantile T1<-sum(x<=xstar) T2<-sum(x< xstar) if (alternative=="less") { p.value<-1-pbinom(T2-1,n,p)} if (alternative=="greater"){ p.value<-pbinom(T1,n,p)} if (alternative=="two.sided"){ p.value<-2*min(1-pbinom(T2-1,n,p),pbinom(T1,n,p))} list(xstar=xstar,alternative=alternative,T1=T1,T2=T2,p.value=p.value)}
This can be used to perform the analysis in example 1 on pages 139-140 using:
x<-c(189,233,195,160,212,176,231,185,199,213,202,193,174,166,248) quantile.test(x,xstar=193,quantile=.75,alternative="two.sided")
or simply quantile.test(x,193,.75) .
A confidence interval for a quantile can be gotten using the following function.
quantile.interval<-function(x,quantile=.5,conf.level=.95){ n<-length(x) p<-quantile alpha<-1-conf.level rmin1<-qbinom(alpha/2,n,p)-1 r<-rmin1+1 alpha1<-pbinom(r-1,n,p) smin1<-qbinom(1-alpha/2,n,p) s<-smin1+1 alpha2<-1-pbinom(s-1,n,p) clo<-sort(x)[r] chi<-sort(x)[s] conf.level<-1-alpha1-alpha2 list(quantile=quantile,conf.level=conf.level,r=r,s=s,interval=c(clo,chi))}
Note that it doesn't give you exactly the same result as in example 3 on page 145.
x<-c(46.9,47.2,49.1,56.5,56.8,59.2,59.9,63.2,63.3, 63.4,63.7,64.1,67.1,67.7,73.3,78.5) quantile.interval(x,quantile=0.75,conf.level=.9)
This is because the function in the book picks a value "close" to alpha/2, while the function above makes sure that neither alpha1 or alpha2 exceeds alpha/2.
The McNemar test in section 3.5 can be carried out using the function mcnemar.test. It is probably easiest to enter the data in as a matrix, and the following code shows this in working out example 1 on page 168.
datamatrix<-matrix(c(63,21,4,12),ncol=2) mcnemar.test(datamatrix)Note that R uses the large sample approximation to get the p-value. Also note that the value of T1 that it shows doesn't exactly match what the book has. This is because R uses a continutity correction by default in the large sample approximation. To get the T1 value in the text, use mcnemar.test(datamatrix,correct=F) .
You could also use the binomial test of 0.5 for this example, where the number of heads is 21 and number of tails is 4.
binom.test(21,21+4,.5)
Note that the p-value here (0.0009105) is fairly close to the continuity corrected value of (0.001374), and is of course exactly correct.
The Wilcoxon Signed Ranks test in section 5.7 can be carried out using the wilcox.test command. The output in example 1 on page 355 could be carried out using either of the following:
x<-c(86,71,77,68,91,72,77,91,70,71,88,87) y<-c(88,77,76,64,96,72,65,90,65,80,81,72) wilcox.test(y,x,paired=T,alternative="less")
Note that it gives the T+ value of 24.5 (see the last paragraph of the example) and says it can't give the exact p-value because of the ties. The p-value it does give is very close to the 0.238 that the example reports using equation 7. The same answer could have been gotten using the differences instead of the paired option.
d<-y-x wilcox.test(d,alternative="less")To get the answer to example 2, you would use the options mu=30 and alternative="greater".
When trying to get the confidence interval, the option in wilcox.test doesn't return the same value as the text when there are ties (it uses the large sample approximation). To get the results on page 361, you could first enter the following function:
WSRci<-function(d,conf.level=.95){ d<-sort(d) n<-length(d) alpha<-1-conf.level thematrix<-matrix(d,ncol=n,nrow=n,byrow=T)+matrix(d,ncol=n,nrow=n,byrow=F) thematrix<-thematrix[!upper.tri(thematrix)]/2 tablevalue<-qsignrank(alpha/2,n) estimator<-median(thematrix) clow<-sort(thematrix)[tablevalue] chi<-sort(thematrix)[n*(n+1)/2+1-tablevalue] list(matrix.values=thematrix,table.value=tablevalue,estimator=estimator,interval=c(clow,chi))}
and then simply use WSRci(y-x,.95) .
The Mann-Whitney-Wilcoxon rank sum test in section 5.1 can be carried out using the function wilcox.test. The results in example 1 on page 276-278 could be gotten using:
x<-c(14.8,7.3,5.6,6.3,9.0,4.2,10.6,12.5,12.9,16.1,11.4,2.7) y<-c(12.7,14.2,12.6,2.1,17.7,11.8,16.9,7.9,16.0,10.6,5.6,5.6, 7.6,11.3,8.3,6.7,3.6,1.0,2.4,6.4,9.1,6.7,18.6,3.2, 6.2,6.1,15.3,10.6,1.8,5.9,9.9,10.6,14.8,5.0,2.6,4.0) wilcox.test(x,y,alternative="greater")
Note that while the p-value matches what is on page 277, wilcox.test reports a W statistic and not a T. To get the T value on page 277, take the W and add n*(n+1)/2 where n is the number of observations in x (in this case 12). 243+(12*13)/2 = 321.
The two-sample t-test can be run using:
t.test(x,y,alternative="greater")
Note that this gives you a "lower confidence bound" and not "confidence interval"
like you are used to. For that, you need to remove the alternative="greater" .
When trying to get the confidence interval, the option in wilcox.test doesn't return the same value as the text when there are ties (it uses the large sample approximation). To get the results on page 282-283, you could first enter the following function:
MWWRSci<-function(x,y,conf.level=.95){ x<-sort(x) n<-length(x) y<-sort(y) m<-length(y) alpha<-1-conf.level thematrix<-matrix(x,ncol=m,nrow=n,byrow=F)-matrix(y,ncol=m,nrow=n,byrow=T) k<-qwilcox(alpha/2,n,m) estimator<-median(thematrix) clow<-sort(thematrix)[k] chi<-sort(thematrix)[n*m+1-k] list(matrix.values=thematrix,table.k=k,estimator=estimator,interval=c(clow,chi))}
and then enter the data and use the function:
a<-c(7.3,6.9,7.2,7.8,7.2) b<-c(7.4,6.8,6.9,6.7,7.1) MWWRSci(a,b,.95)
The Kruskal-Wallis test in section 5.2 can be run using kruskal.test. THe following code analyzes the data in example 1 on page 290-291:
m1<-c(83,91,94,89,89,96,91,92,90) m2<-c(91,90,81,83,84,83,88,91,89,84) m3<-c(101,100,91,93,96,95,94) m4<-c(78,82,81,77,79,81,80,81) kruskal.test(list(m1,m2,m3,m4))
Note that the test statistic is very slightly different from the one shown in the text. Another way of performing the test on the same data set is:
values<-c(83,91,94,89,89,96,91,92,90,91,90,81,83,84,83,88,91,89,84,101,100,91,93, 96,95,94,78,82,81,77,79,81,80,81) groups<-as.factor(c(rep(1,9),rep(2,10),rep(3,7),rep(4,8))) kruskal.test(values~groups)
This second way is in the same format as what is needed for performing a 1-way ANOVA:
anova(lm(values~groups))
The Mean Sq column has the "variances", groups is the between one, and Residuals is within one. The F statistic and p-value for testing the null hypothesis that all of the means are on the right side of the table. The residual plot for the ANOVA can be gotten by using plot(lm(values~groups)) .
The Friedman Test in section 5.8 can be carried out using friedman.test. For example, the following code analyzes the data in example 1 on pages 371-372 using the T1 statistic.
values<-c(4,4,3,3,4,2,1,2,3.5,4,4,3.5, 3,2,1.5,1,2,2,3,4,1,1,2,1, 2,3,1.5,2,1,2,2,1,2,3,3,2, 1,1,4,4,3,4,4,3,3.5,2,1,3.5) homeowners<-as.factor(rep(1:12,4)) grass<-as.factor(c(rep(1,12),rep(2,12),rep(3,12),rep(4,12))) friedman.test(values~grass|homeowners)
If you type cbind(homeowners,grass,values) you can see that we entered the data in a way that looks different than the table in the example, but that it has all the same information.
When we enter it in this fashion, we can also do the block design ANOVA:
anova(lm(values~grass+homeowners))
The p-value we want is on the grass line, our treatment for this example. The residual plot can be gotten by using: plot(lm(values~grass+homeowners)) .