Stat 513 - Fall 2008 - R Templates

An Introduction to R
Getting the CDF, inverse CDF, and PDF
Plotting a Power Curve for a Z-test for means
Simulation Code to Check Robustness
Regression
Football Rankings


An Introduction to R:

R, like the commercial S-Plus, is based on the programming language S. It can be downloaded for from www.r-project.org [At home, after reading the instructions!, choose: CRAN, then a mirror site in the US, then Windows, then base, then R-2.7.2-win32.exe]. This page also has links to FAQs and other information about R.

One of the nice things about R (besides the fact that it's powerful and free) is that you can save everything you've done after you've finished each time. If you are using a network machine where you don't have access to the C drive, you can manually load and save the .Rdata file on another drive each time you want to start and end the program. You would, of course, have to save the .Rdata file on disk if you wanted to use it both at home and at school.

At the heart of R are the various objects that you enter. At any time (when you are in R) you can type in the function objects() to see which ones you have entered previously. Presumably there aren't any there right now. R also has many built in objects, most of which are functions. Because these objects are always there, it will not list them when you type objects().

A list of the various functions for R can be found in the manuals that are installed with the program. The various manuals can be found under the Manuals submenu of the Help menu. In particular, the R Reference Manual lists not only all the functions in the base package of R, but also some of those in the supplementary packages as well. Because this manual is nearly 1000 pages long, you probably shouldn't print out a copy!!!!

The various supplementary packages are simply groups of functions that others wrote and decided to make available to the public. A list of those that you downloaded can be found under the Packages menu by choosing Load Package. One that is used most often is the MASS package. It can be loaded in by typing library(MASS). Note that some of the non-standard libraries need to be downloaded and unzipped somewhere first if you don't have access to your C drive.

Once you know the name of a function, you can get help about it simply by typing help(functionname). In R, parentheses indicate either mathematical grouping, like they usually do, or that you are applying a function. Braces [ ] indicate that you are finding an element in a string.

This brief example shows some of R's flexibility.

x<-c(5,4,3,2)

The <- is the command that assigns what ever is on the left to the name that is on the right. The c indicates that an array or vector is going to be entered. Simply typing x now will return those four values. Typing

x[2]

will return the second value,

mean(x)

will return the mean, and

objects()

will show you that you now have an object called x.

R also has a variety of graphical functions and statistical functions built in.

hist(x)

will construct a histogram. Besides just getting the result of a function, we can store the results somewhere. For example

mean.x<-mean(x)

will store the mean of x in a new object called

mean.x


Getting the CDF, inverse CDF, and PDF

R can calculate the basic information for most of the distributions seen in the course. A d at the beginning of the distribution name means the function is giving the value of the p.d.f. or p.m.f., a p means it is giving the probability value as determined by the CDF, and a q indicates it is giving the quantile from the inverse of the CDF.

For example, the 97.5th percentile of a standard normal should be 1.96

qnorm(0.975)

and 50% of a standard normal distribution should be below 0

pnorm(0)

To calculate values for a different normal distribution you can use the options mean= and sd=, for example:

qnorm(0.975,mean=10,sd=10)

should be 10(1.96)+10, and is.

Other common distributions include: binom, f, t, and chisq. To get help on them you must give the whole function name though (e.g. help(qbinom) instead of help(binom)).

As other examples, the following code would give you the first line of Table 6:

round(qchisq(1-c(0.995,0.990,0.975,0.950,0.900),df=1),7)

First, the function round( ,7) is saying to report whatever is inside to 7 decimal places. Secondly, the function qchisq( ,df=1) is saying to give the inverse CDF value (the quantile) for whatever probabilities you put in for a chi-squared distribution for df=1. The c(0.995,0.990,0.975,0.950,0.900) is saying what probabilities you are working with. Finally, the 1- is there because R uses the CDF, but check the picture at the top of Table 6... it is calculating probabilities in the opposite direction!

The start of the first row, and all of the first column of Table 1 (the binomials) would be gotten using:

pbinom(q=0,size=5,prob=c(0.01,0.05,0.10,0.20,0.30))
pbinom(q=c(0,1,2,3,4),size=5,prob=0.01)

You could put them inside round( ,3) to get them to the same number of digits


Plotting a Power Curve for a Z-test for Means

Consider the example of finding the power of a z-test for the mean (where we assume the true variance is known). For n=100, sigma=10, alpha=0.05, Ho: mu=0, Ha: mu>0, (after some algebra) we get

power at mu_a
= P[reject Ho|Ho is false]
= P[(x-bar - mu_o)/(sigma/sqrt(n)) >= 1.645]
= P[(x-bar - mu_a)/(sigma/sqrt(n)) >= 1.645 + (mu_o-mu_a)/(sigma/sqrt(n)]
= P[Z >= 1.645+(0-mu_a)/(10/sqrt(100)]
= P[Z >= 1.645-mu_a]

To calculate the power at the single value of mu_a=0.5 we could use:

1-pnorm(1.645-0.5)

and get 0.1261046.

To plot a power curve we want to calculate this for a range of values, say mu_a goes from -1 to 5. The following code first makes the list of mu_a's, then gets the corresponding powers, and then plots them.

mu.a<-(-10:50)/10
powers<-1-pnorm(1.645-mu.a)
plot(mu.a,powers,xlim=c(-1,5),ylim=c(0,1),type="l")
lines(c(0,0),c(0,1),lty=2)
lines(c(-1,5),c(0.05,0.05),lty=2)

The first line of code makes a list of all of the numbers from -10 to 50 divided by 10 (so -1 to 5). The second line just plugs that list of values into the formula we use to find an individual power. The third line plots those. The xlim gives the range of the x-axis, and the ylim gives the range of the y-axis. The fourth and fifth lines plot where the null hypothesis is (x-value equals 0) and the alpha level (0.05 from -1 to 5) respectively.


Simulation Code to Check Robustness

The following code generates two sets of random samples from a t distribution with 5 degrees of freedom. The first matrix x has 1,000 rows of data sets of size 20. y is similar. The two-sample variance test is then run, and the percentage of times there is a rejection for a 0.05 alpha level is determined.

x<-matrix(rt(20*1000,df=5),ncol=20)
y<-matrix(rt(20*1000,df=5),ncol=20)
pvals<-rep(0,1000)
for (i in 1:1000){
pvals[i]<-var.test(x[i,],y[i,])$p.value
}
sum(pvals<0.05)/1000

For the homework, you will only need one sample, and will want to use the t.test function.


Regression:

The following code analyzes the data in Examples 11.1 to 11.5 in the text. The data are entered as the vectors x and y, and y is predicted from x. The output is stored in regoutput.

x<-c(-2,-1,0,1,2)
y<-c(0,0,1,1,3)
regoutput<-lm(y~x)

If you typeregoutput you will see the beta-hat-0 and beta-hat-1 values. You can get the plot of the points and the line using

plot(x,y,xlab="X",ylab="Y")
abline(regoutput)

To get the S, look for the "Residual standard error" in the following output. Square it to get S2.

summary(regoutput)

This summary also gives the p-values for testing the two sided hypotheses about beta-0 and beta-1, in a format similar to how we saw it in SAS.

You can find the SSE, by using

anova(regoutput)

and looking in the Sum Sq (e.g. SS) column of the Residuals (e.g. error) line. Also, note that the value in that row for the Mean Sq column is the S2.

To see the residual plots, look at the top two graphs in:

par(mfrow=c(2,2))
plot(regoutput)
par(mfrow=c(1,1))