702/703 R TEMPLATES

STAT 702 - Fall 2006 & STAT 703 - Spring 2007
R Templates

STAT 702
The Basics of R
Ch.1 #17: Plotting Acceptance Probabilities
Ch.2 #67: Plotting PDFs
Lecture 23: Expected Number of Tests Needed
Lecture 23: Benefit of Stratifying
STAT 703
Lecture 3: MoM, QQ Plots, and Parametric Bootstrap
Lecture 4: Maximum Likelihood Estimation
Homework 2: MLEs for Lotistic-type Regression

The Basics of R:

R, like the commercial S-Plus, is based on the programming language S. It can be downloaded for from www.r-project.org [At home, after reading the instructions!, choose: CRAN, then a mirror site in the US, then Windows (95 and later), then base, then R-2.3.1-win32.exe]. This page also has links to FAQs and other information about R.

One of the nice things about R (besides the fact that it's powerful and free) is that you can save everything you've done after you've finished each time. If you are using a network machine where you don't have access to the C drive, you can manually load and save the .Rdata file on another dirve each time you want to start and end the program. You would, of course, have to save the .Rdata file on disk if you wanted to use it both at home and at school.

At the heart of R are the various objects that you enter. At any time (when you are in R) you can type in the function objects() to see which ones you have entered previously. Presumably there aren't any there right now. R also has many built in objects, most of which are functions. Because these objects are always there, it will not list them when you type objects().

A list of the various functions for R can be found in the manuals that are installed with the program. The various manuals can be found under the Manuals submenu of the Help menu. In particular, the R Reference Manual lists not only all the functions in the base package of R, but also some of those in the supplementary packages as well. Because this manual is nearly 1000 pages long, you probably shouldn't print out a copy!!!!

The various supplementary packages are simply groups of functions that others wrote and decided to make available to the public. A list of those that you downloaded can be found under the Packages menu by choosing Load Package. The one that we will use most often is the MASS package. It can be loaded in by typing library(MASS). Note that some of the non-standard libraries need to be downloaded and unzipped somewhere first if you don't have access to your C drive.

Once you know the name of a function, you can get help about it simply by typing help(functionname). In R, parentheses indicate either mathematical grouping, like they usually do, or that you are applying a function. Braces [ ] indicate that you are finding an element in a string.

This brief example shows some of R's flexibility.

x<-c(5,4,3,2)

The <- is the command that assigns what ever is on the left to the name that is on the right. The c indicates that an array or vector is going to be entered. Simply typing x now will return those four values. Typing

x[3]

will return the third value,

mean(x)

will return the mean, and

objects()

will show you that you now have an object called x.

R also has a variety of graphical functions and statistical functions built in.

hist(x)

will construct a histogram, and

t.test(x,alternative="greater",mu=5)

will test the null hypothesis mu=5 vs. the alternate that mu > 5. (If you happen to mistype a command and get an error, just hit the up error to bring it up again for editing).

Besides just getting the result of a function, we can store the results somewhere. For example

x.t<-t.test(x,alternative="greater",mu=5)

will store the result of the t-test in x.t.

attributes(x.t)

will then show you the various properties of x.t. Thus,

x.t$p.value

will return the p-value.

R can also be used to write your own functions. For example, if you wanted a function that returned sin(x)+x² we could write:

weird<-function(x){ y<-sin(x)+x^2 return(y)}

Try weird(5) and weird(x). What is it doing?

One of the things we will be doing on some homework assignments is using R to generate random numbers following various discrete and continuous distributions, and seeing how they behave. Try:

ndat<-rnorm(200,5,10) chidat<-rchisq(200,4) par(mfrow=c(2,1)) hist(ndat) hist(chidat)

Ch. 1 #17: Plotting Acceptance Probabilities

In R it is easy to construct a scatterplot of two sets of points against each other. Consider:

X   12   -12   18

Y   16    6   21

The vectors would be entered using the code:

x<-c(12,-12,18)

y<-c(16,6,21)

A basic plot would then be made using

plot(x,y)

A few options are setting the limites of the x and y axis (xlim and ylim), changing the names of the x and y axis (xlab and ylab), and connecting the points (type="l") assuming they are in order. The following command shows what can happen if they are not in order.

plot(x,y,xlim=c(-20,20),ylim=c(0,25),xlab="variable 1",ylab="variable 2",type="l")

To fix this we can use the order and sort commands to sort the x-variable and to put the y-variable in that same order.

x1<-sort(x)
y1<-y[order(x)]
plot(x1,y1,xlim=c(-20,20),ylim=c(0,25),xlab="variable 1",ylab="variable 2",type="l")

Now the homework assignment gives that the x values (the p's) are: 0, 0.05, 0.10, 0.15, 0.2, and 0.25. But now we need to find the y values!

Treating it as a binomial: The following code will calculate the probability we examined in class where we wanted to know the chance of seeing exactly 10 defectives out of 20, where the probability of a defective is 0.02.

choose(20,10)*(0.02)^10*(1-0.02)^(20-10)

This is the same as the built in function dbinom. dbinom must be given three values: the number of "successes", the population size, and the probability of a "success". So

dbinom(10,20,.02)

should return the same value. If we wanted to get this value for a bunch of different probabilities we could try something like:

dbinom(10,20,c(0,0.05,0.10,0.15,0.2,0.25))

ps<-c(0,0.05,0.10,0.15,0.2,0.25)
ps
dbinom(10,20,ps)

We could also get the probability of exactly 10 or 11 by adding two of these vectors together:

psfor10<-dbinom(10,20,c(0,0.05,0.10,0.15,0.2,0.25))
psfor11<-dbinom(11,20,c(0,0.05,0.10,0.15,0.2,0.25))
psfor10or11<-psfor10+psfor11
psfor10or11

or to get the probability that NOT exactly 10 or 11:

1-psfor10or11

And we could plot it...

plot(ps,1-psfor10or11,xlim=c(0,0.25),ylim=c(0,1),xlab="p",ylab="prob",type="l")

Not a very exciting plot for this example! Making ylim have a different range might make it look better.

Treating it as a Hypergeometric: It is also possible (and closer to the problems actual phrasing) to treat this problem as a hypergeometric instead of a binomial. As an example, consider the problem with the 0.02 defective rate. Imagine that there were originally 2000 parts of which 40 were defective (40/2000=0.02). The box of 20 we are looking at had 20 randomly chosen from that 1,000.

This would then be a hypergeometric with population size 2000, number of "successes" (defective in this case) 40, and number chosen 20. The probability of getting exactly 10 defectives out of the twenty would then be:

choose(40,10)*choose(2000-40,20-10)/choose(2000,20)

from the formula at the top of page 13 or the middle of page 40.

This is the same as the built in function dhyper. dhyper must be given three values: the number of "successes" chosen, the number of possible "successes" overall, the number of possible "failures" overall, and the total sample size we chose. So

dhyper(10,20,1980,40)

should return the same value. If we wanted to get this value for a bunch of different probabilities we just need to adjust them to be the number out of 2000 instead of just percents.


ps<-c(0,0.05,0.10,0.15,0.2,0.25)
numbad<-2000*ps
numgood<-2000-numbad
dhyper(10,numbad,numgood,20)

We could also get the probability of exactly 10 or 11 by adding two of these vectors together:

psfor10<-dhyper(10,numbad,numgood,20)
psfor11<-dhyper(11,numbad,numgood,20)
psfor10or11<-psfor10+psfor11
psfor10or11

or to get the probability that NOT exactly 10 or 11:

1-psfor10or11

And we could plot it...

plot(ps,1-psfor10or11,xlim=c(0,0.25),ylim=c(0,1),xlab="p",ylab="prob",type="l")

Not a very exciting plot for this example! Making ylim have a different range might make it look better.

Ch. 2 #67: Plotting PDFs

The code below is from in class on September 23rd, where we plotted the pdf for a variety of gamma distributions.

par(mfrow=c(2,2))
t<-(0:2500)/100
ga_t<-dgamma(t,shape=1,rate=1)
gb_t<-dgamma(t,shape=2.5,rate=1)
gc_t<-dgamma(t,shape=5,rate=1)
gd_t<-dgamma(t,shape=10,rate=1)
plot(t,ga_t,xlim=c(0,25),type="l")
plot(t,gb_t,xlim=c(0,25),type="l")
plot(t,gc_t,xlim=c(0,25),type="l")
plot(t,gd_t,xlim=c(0,25),type="l")

t<-(0:2500)/100
ge_t<-dgamma(t,shape=2.5,rate=0.25)
gf_t<-dgamma(t,shape=2.5,rate=0.5)
gg_t<-dgamma(t,shape=2.5,rate=2)
gh_t<-dgamma(t,shape=2.5,rate=4)
plot(t,ge_t,xlim=c(0,25),type="l")
plot(t,gf_t,xlim=c(0,25),type="l")
plot(t,gg_t,xlim=c(0,25),type="l")
plot(t,gh_t,xlim=c(0,25),type="l")

The par(mfrow=c(2,2)) command tells it to make a 2x2 matrix of plots. We then set up the values to plot at (in this case 0 to 25 at increments of 0.01). The dgamma function is the function for the pdf of the gamma ( help(dgamma) tells you about the parameters), and then we plot them using lines ( type="l" ).

See help(dweibull)) for information on the Weibull distribution... and make sure to match up the parameters!

Lecture 23: Expected Number of Tests Needed

The following code is from Lecture 23 and is designed to calculate the expected number of tests for various numbers of groups (see Example C on page 121). n is the overall number of samples, m is the number of groups it is divided into, and p is the probability of any individual being "infected".

expntests<-function(n,m,p){
m*((1-p)^(n/m)+(n/m+1)*(1-(1-p)^(n/m)))
}

n<-5000
p<-.001
ms<-120:160

cbind(120:160,expntests(n=n,m=ms,p=p))

Lecture 23: Benefit of Stratifying

The following code is from Lecture 23 and is designed to show the benefit of stratified sampling over simple random sampling. Consider sampling from a population of size N with strata of size Ni. The population overall has proportion p of the desired trait, while each level of stratum has proportion pi. Compare the variance of an overall sample of size n to stratifyed samples of size ni from each stratum.

For example consider a population of size 150 with prevalence 0.5 that could be broken into strata of size 50 with prevalences 0.1, 0.5, and 0.9. Compare taking a sample of size 30 from the overall population to taking samples of size 10 from each strata.

N<-150
Ni<-c(50,50,50)
p<-0.5
pi<-c(0.1,0.5,0.9)
n<-30
ni<-c(10,10,10)

v<-(p*(1-p)*(N-n)/(N-1))/n
vstrat<-sum((Ni/N)^2*pi*(1-pi)*(Ni-ni)/(Ni-1)/ni)

v
vstrat

In this case we see that the variance using the stratified sampling (0.0039) is less than the variance using a simple random sample (0.0067). Changing the various values can show that the benefit disappears if the strata levels aren't actually different in terms of their prevalences.

Lecture 3: MoM, QQ Plots, and Parametric Bootstrap

The following code is from class on January 18th. The first step was to load the data set into R, and to then pick off the first column. I called it pars as short for parametrs, and saved it as the vector a. I sorted as well to make it easier to look at. c is a reserved name, so you shouldn't call a vector that (maybe cc?).

pars<-read.table("http://www.stat.sc.edu/~habing/courses/703/itests.txt",head=TRUE) a<-sort(pars[,1])

This code calculates the mean and estimated variance from the data. Notice that the formulas we are using divide by n and not n-1 so we need to adjust what R is calculating. I then calculated the method of moments estimators (I also had it find the number of observations since I would need that later)

n<-length(a) xbar<-mean(a) sigma2hat<-(n-1)/n*var(a) lhat<-xbar/sigma2hat ahat<-xbar^2/sigma2hat

The q-q plot is made using the qgamma (quantiles of the gamma) function. Other distributions have similarly named functions. Recall from last semester that are lambda is the rate parameter (=1/scale) in R. The line on the q-q plot is drawn using the quantiles of the gamma.

plot(sort(a),qgamma((1:n)/(n+1),shape=ahat,rate=lhat)) lines(qgamma((1:n)/(n+1),shape=ahat,rate=lhat),qgamma((1:n)/(n+1),shape=ahat,rate=lhat))

Finally, here is the code for the estimating 100,000 bootstrap samples for this example. Try summary(lhat.dist) or sd(lhat.dist) or hist(lhat.dist) to see the properties of lambdas sampling distribution.

nsamples<-100000 x<-rgamma(n*nsamples,shape=ahat,rate=lhat) x<-matrix(x,ncol=n) xbar.dist<-apply(x,1,mean) s2h.dist<-(n-1)/n*apply(x,1,var) lhat.dist<-xbar.dist/s2h.dist ahat.dist<-xbar.dist^2/s2h.dist

Lecture 4: Maximum Likelihood Estimation

The code below uses R to find the maximum likelihood estimates for the example above. The idea is to use the built in function for finding extrema called optim that finds minimums, and to enter in the log-likelihood function manually. Of course we want the maximum, and not the minimum so we need to have ngamloglike be the negative of the log-likelihood.

optim is set up so that the first argument pars is what we want estimated, and the second argument data is what we know. You should be able to check that the following code contains the function for the negative log-likelihood from page 256, and then asks R to use the optim, starting with 6 for the initial a guess, and 6 for the initial l guess.

ngamloglike<-function(pars,data){ a<-pars[1] l<-pars[2] n<-length(data) -1* (n*a*log(l)+(a-1)*sum(log(data))- l*sum(data)-n*lgamma(a)) } optim(c(6,6),ngamloglike,data=a)

Notice that the estimates are alpha-hat=7.338857 and lambda-hat=6.854916.

We can also apply the bootstrap to these estimates to see if it appears we really have recovered something near the maximum of the sampling distribution.

amle<-7.338857 lmle<-6.854916 nsamples<-100000 x<-rgamma(n*nsamples,shape=amle,rate=lmle) x<-matrix(x,ncol=n) xbar.dist<-apply(x,1,mean) s2h.dist<-(n-1)/n*apply(x,1,var) lmle.dist<-xbar.dist/s2h.dist amle.dist<-xbar.dist^2/s2h.dist hist(lmle.dist,nclass=50,xlim=c(0,25)) lines(c(mean(lmle.dist),mean(lmle.dist)),c(-1000,30000),lwd=4,lty=5) text(14,8000,"Mean of Estimates") lines(c(lmle,lmle),c(-1000,30000),lwd=4) text(3,8000,"True Value")

Homework 2: MLEs for Logistic-type Regression

The following code reads in the data concerning the temperature and the failure of the space shuttle o-rings before the Challenger disaster. The first column contains the tempearture, and the second whether an O-ring failed or not -- 1 indicates it worked successfully and a 0 indicates a failure.

oring<-read.table("http://www.stat.sc.edu/~habing/courses/703/oring.txt",head=TRUE)
oring<-as.matrix(oring)

The function lr.nlgolik contains the negative log-likelihood for standard logistic regression and uses the built in CDF function for the logistic distribution: plogis. The optim command can then used with initial values a=0 and b=0.1. The result will be found under the $par in the order we entered them (a and then b).

lr.nloglik<-function(pars,data){ a<-pars[1] b<-pars[2] s<-1/b m<--a*s x<-data[,1] y<-data[,2] p<-plogis(x,m,s) -sum(y*log(p)+(1-y)*log(1-p)) } optim(c(0,.1),lr.nloglik,data=oring)

This can be compared to the results gotten using the built in glm function. The general formaty is glm(y~x,modeltype) where we are using x to predict y and modeltype is appropriate for the kind of model we are fitting (binomial in this case).

glm(oring[,2]~oring[,1],binomial)