Stat 518 - Fall 2000 - S-Plus Templates

Class Notes from 9/13/00
Class Notes for 9/15/00
Homework 3 Notes
Homework 4 Notes
Homework 5 Notes
Homework 6 Notes
Homework 7 Notes
Other Nonparametric Methods
Copy, Paste, or Print


The Basics of S-plus:

Unlike SAS where you have to hit a key to run the program, S-Plus runs what you tell it to immediately. It is therefore important to keep a copy of the code you will be running in some other file, in general some word processor when using a PC.

Another difference between S-Plus and SAS is that S-Plus is run from the various Sun Sparcs in the department of mathematics and statistics. To use them from a PC you must thus remote login using the program X-Win32. When you start up X-Win32, nothing at first will happen except for the bar appearing at the bottom of the screen. Selecting one of the sessions there will call up an X-window into which you need to answer the prompt for your login name and password. These will generally NOT be the same as the login and password for the PC.

At the heart of S-Plus are the various objects that you enter. At any time (when you are in S-Plus) you can type in the function objects() to see which ones you have entered previously. Presumably there aren't any there right now. Splus also has many built in objects, most of which are functions. It saves these objects permanently (until you erase them). For this to work though, you need to create a .Data directory in UNIX. This can be done (before starting S-Plus), by typing:

mkdir .Data

Once we have done that, we can now start S-Plus. Simply type Splus at the unix prompt. You can now type objects() to see if anything is already there or not. One of the simplest types of objects to enter is simply a vector. Try typing:

x<-c(5,4,3,2)

The <- is the command that assigns what ever is on the left to the name that is on the right. The c indicates that an array or vector is going to be entered. Simply typing x now will return those four values. Typing x[3] will return the third value, mean(x) will return the mean, and objects() will show you that you now have an object called x.

Say we wished to enter the data set we used last time into S-Plus. We would use the line:


satdata<-scan("",what=list(state="",verbal=0,math=0,pct=0))

S-Plus then comes back with a prompt to enter the data. We can now just cut and paste the block into it. Because UNIX tends to be unhappy if you try to cut and paste a whole bunch at one time, it is probably better to do it in to chunks. Paste the first chunk in by clicking with both mouse buttons, make sure there is a prompt showing, and then paste the second half in and hit return until it gets out of scan mode. Now, typing satdata will show you that the data set is in there.

The SAT data presented in the August 31st, 1999 New York Times:


Ala. 561 555 9
Alaska 516 514 50
Ariz. 524 525 34
Ark. 563 556 6
Calif. 497 514 49
Colo. 536 540 32
Conn. 510 509 80
Dela. 503 497 67
D.C. 494 478 77
Fla. 499 498 53
Ga. 487 482 63
Hawaii 482 513 52
Idaho 542 540 16
Ill. 569 585 12
Ind. 496 498 60
Iowa 594 598 5
Kan. 578 576 9
Ky. 547 547 12
La. 561 558 8
Maine 507 507 68
Md. 507 511 65
Mass. 511 511 78
Mich. 557 565 11
Minn. 586 598 9
Miss. 563 548 4
Mo. 572 572 8
Mont. 545 546 21
Neb. 568 571 8
Nev. 512 517 34
N.H. 520 518 72
N.J. 498 510 80
N.M. 549 542 12
N.Y. 495 502 76
N.C. 493 493 61
N.D. 594 605 5
Ohio 534 538 25
Okla. 576 560 8
Ore. 525 525 53
Pa. 498 495 70
R.I. 504 499 70
S.C. 479 475 61
S.D. 585 588 4
Tenn. 559 553 13
Texas 494 499 50
Utah 570 565 5
Vt. 514 506 70
Va. 508 499 65
Wash. 525 526 52
W.Va. 527 512 18
Wis. 584 595 7
Wyo. 546 551 10

We have now made a "data-frame" where each of the four columns goes with a different variable. We can refer to the four columns by using satdata$state, satdata$verbal, satdata$math, or satdata$pct. The hist command will give a histogram for the data, and means will give the mean. We can shorten the typing a bit, by renaming satdata$pct to simply p


p<-satdata$pct


hist(p)
mean(p)
Unfortunately S-Plus removed the interactive help menus, but you can still find out more about changing the histogram for example by typing help(hist).

This is the code that would be used for making the confidence interval for the variance of the percent of students taking the test in a state.


alpha<-.05

df<-length(p)-1
clow <- df*var(p)/qchisq(1-alpha/2,df)
chi <- df*var(p)/qchisq(alpha/2,df)
c(clow,chi)


The Basics of S-plus Continued:

Last time we tried to read in the 1999 SAT data by cutting and pasting into S-Plus... and found that that isn't the easiest way to go about it. S-Plus is much better at reading out of files. What we can do is to create a file in your UNIX account, copy the SAT data into that file, and then read that file into S-Plus.

In UNIX, type: vi satdata

This command opens up an editing window with the vi editor. This is probably one of the worst editors for UNIX, and you might look into emacs, joe, or pico if you plan on doing lots of editing on UNIX. Now, hit i once to put the window into insert mode. You can now cut and paste the entire data set into the window. If it doesn't paste in correctly, you can hit Esc and u to undo the change; then hit i and try again. When the data is in ok, hit Esc : w and q to exit and save the file.

Now start up S-Plus, and enter the following:

satdata<-scan("satdata",what=list(state="",verbal=0,math=0,pct=0))

Typing satdata will now show you the entire data set. Note that we have replaced the satdata object you entered Wednesday, because we have entered something else with the same name.


v<-satdata$verbal

m<-satdata$math
p<-satdata$pct
The code below performs a linear regression to predict v from m. It is often best to put the regression output into an object though instead of simply running it.



vmregress<-lm(v~m)

vmregress

attributes(vmregress)

vmregress$residuals

qqnorm(vmregress$residuals)
qqline(vmregress$residuals)

Perhaps the greatest strength of S-Plus is that the user can add to the list of built in functions. The code below is for plotting the empirical distribution function of any data set:


edf<-function(y){

x<-sort(y)
plot(c(min(x)-1,max(x)+1),c(0,1),type="n",xlab="x",ylab="p")
lines(c(x[1]-1,x[1]),c(0,0),lty=1)
lines(c(x[1],x[1]),c(0,1/length(x)),lty=2)
for (i in 1:(length(x)-1)){
lines(c(x[i],x[i+1]),c(i/length(x),i/length(x)),lty=1)
lines(c(x[i+1],x[i+1]),c(i/length(x),(i+1)/length(x)),lty=2)}
lines(c(x[length(x)],x[length(x)]+1),c(1,1),lty=1)
}
Typing edf(p) will plot the edf for the percent of students taking the SAT in each state.


Homework 3 Notes

Page 133 #2,4: The command for the exact test to see if p is a certain value for data from a binomial distribution is:
binom.test(x, n, p=0.5, alternative="two.sided")
Instead of x use the number of 'successes'; instead of n use the number of 'trials'; 0.5 is the default value for the null hypothesis, simply replace it with the appropriate value; and "two.sided" is the default value for the alternate hypothesis, you may replace it with either "less" or "greater". The command for the large sample approximation would be:
prop.test(x, n, p=0.5, alternative="two.sided", correct=T)
The correct=T says to use the continuity correction, correct=F would be for not using the correction. Recall that you can type help(binom.test) or help(prop.test) to see some of the options available. You may wish to try replicating the results of Example 1 on page 127. Notice that the confidence intervals will not be correct unless "two.sided" is chosen. It should also be noted that the confidence interval for prop.test is calculated differently than in the book.

S-Plus does not have a built in command for finding the confidence intervals given by table A4. The book "Nonparametric Statistical Methods" by Hollander and Wolfe gives the formula for finding them though.

L = Y / (Y+ (n - Y + 1) * f [ 1 - alpha/2, 2(n - Y + 1), 2 Y ] )

and
U = 1 - (n-Y) / (n - Y + (Y + 1) * f [ 1 - alpha/2, 2(Y+1), 2(n - Y)])

where f [a,df1,df2] is the a-th percentile for the F distribution with the specified degees of freedom. A function to generate this confidence interval could be written as:

binom.conf <- function(Y,n,alpha){

L <- Y / (Y+ (n - Y + 1) * qf(1 - alpha/2,2*(n - Y + 1),2*Y))
U <- 1 - ((n-Y) / (n - Y + (Y + 1) * qf( 1 - alpha/2,2*(Y+1),2*(n - Y))))
return(c(L,U))}
A function to give the large sample confidence interval could be written as:
binom.lsconf<- function(Y,n,alpha){

L <- (Y/n) - qnorm(1-alpha/2)*sqrt(Y*(n-Y)/n^3)
U <- (Y/n) + qnorm(1-alpha/2)*sqrt(Y*(n-Y)/n^3)
return(c(L,U))}
Simply use the observed value for Y, and the sample size for n. You can try to replicate the results in example 3 on page 130 to make sure that you're using the functions correctly.

 
Problem C: Just as in SAS, the first step is to enter the data. For example, if I were entering the data for problem 3 on page 93, I might type:

pg93data<-c(12.6,13.0,12.1,11.8,12.1)
For a longer data set, you might wish to type the command out on a word processor first, and cut and paste it into the S-Plus window. Similarily you could use vi and scan. We saw in class how to get the Q-Q plot using qqnorm and qqline, and the instructions for printing out graphics from S-Plus are given below. The command for conducting a t-test in S-Plus is simply t.test. The commands for testing mu=12 vs. mu>12 for the above data set would be:
t.test(pg93data, alternative="greater", mu=12)


Homework 4 Notes

1) Homework 3 involved performing the t-test and the q-q plot. The t.test command will also construct the confidence interval. Instead of mu=12 and alternative="greater", use conf.level=.95 and alternative="two.sided" for a 95% confidence interval for example. The confidence interval won't be right if you use "less" or "greater"!

The following function, once entered, will perform the sign test on one data set:

sign.test<-function(dset,med=0,alternative="two.sided"){

T1<-length(dset[dset - med > 1e-7])
n<-T1+length(dset[dset - med < -1*1e-7])
if (alternative=="less"){
pval<-pbinom(T1,n,.5)}
else if (alternative=="greater"){
pval<-1-pbinom(T1-1,n,.5)}
else if(alternative=="two.sided"){
pval<-2*min(1-pbinom(T1-1,n,.5),pbinom(T1,n,.5))
pval<-min(pval,1)}
note<-'May have errors if values have accuracy greater than 2e-07.'
return(note,T1,n,med,alternative,pval)}
dset is the list of the data, med is the value that the null hypothesis gives, and alternative is the alternative hypothesis. Reading through the code, it should be pretty clear that this function does exactly what we would do if doing the test by hand. length(dset[dset-med > 1e-7]) says to simply count up the number of observations in dset that are > greater than the null hypothesis med values. The 1e-7 is do deal with the problem of the exteme digits of the numbers not being stored as zero by the computer. pbinom returns the same value that you would get using the table in the back of the text. The equals in the first line tells us that the default value of med is 0 and the default value of alternative is two.sided. Finally, the line pval<-min(pval,1) is for the case where n is even and the standard method of calculating a p-value can sometimes result in a value greater than one.

To run this for example 1 on page 161, we would notice that the data set is a little different than what we might think at first. It consists of 8 +'s, 1 -, and 1 tie. We could run this by using a +1 for the plusses, a -1 for the minuses, and a 0 for the ties, with the null hypothesis value being 0. The rep repeat command will make this a little easier.

examp1<-c(rep(1,8),-1,0)


sign.test(examp1,alternative="greater")
Similarly, we can make a function to construct a confidence interval for the various median. The confidence interval is constructed simply by finding all of the values for which the sign test would not reject the null hypothesis. One important function needed for constructing the confidence interval is sort which simply puts the data in order. A second is qbinom(alpha,n,p) which gives the "y" value corresponding to the smallest probability for that binomial that is greater than or equal to alpha. OOPS! TROUBLE IF Y=0!!!!!!, maybe add y<-max(y,1) 6/2/03

sign.interval<-function(dset,conf=.95){

alpha <- 1 - conf
orderstat <- sort(dset)
n <- length(dset)
y <- qbinom((alpha/2),n,0.5)
low<-orderstat[y]
hi<-orderstat[n-y+1]
realconf<-1-2*pbinom(y-1,n,.5)
return(low,hi,realconf)}
Note that this will give a pretty boring answer for the data examp2! Trying it with the beak data from class may be more informative.
x<-c(5.8,13.5,26.1,7.4,7.6,23.0,10.7,9.1,19.3,26.3,17.5,17.9,18.3,14.2,

55.2,15.4,30.0,21.3,26.8,8.1,24.3,21.3,18.2,22.5,31.1)
y<-c(5,21,73,25,3,77,59,13,36,46,9,25,59,38,70,36,55,46,25,30,29,46,71,
31,33)
z<-c(y-x)

sign.interval(z,.95)

2) In this particular case, because the tests are fairly straightforward and the original data is from a normal population, it seems like it should be fairly easy to calculate the exact power. If the distribution the data comes from is exactly normal (which it is here) with a given mean and standard deviation, then the sample mean is also normal and has the same mean and 1/sqrt(n) times the original standard deviation. (See Theorem 1 on page 84 for example.) For the t-test this means we only have to find the rejection region, and then find the probability that the normal distribution (for the sample mean) under the alternate hypothesis falls into it. Unfortunately, the rejection region depends on the sample standard deviation... and we don't have the sample standard deviation. The method to calculate the power exactly for the t-test depends on what is called the non-central t-distribution. We can fudge this a bit by using the population standard deviation in place of the sample standard deviation. Because of this, the power we will calculate is actually a biased estimate. The power we estimate from the simulation study will be unbiased. The sign test does not have this difficulty. Once we know the probability that each observed xi is greater than the hypothesized median, it is just a standard binomial problem.

The following code is for the power example we saw in class. The test is of mean/median=2400 vs. mean/median > 2400, where the population is normal with mean=2425 and standard deviation=200, the sample size=50 and alpha=.05. The code calculates the biased estimate of the power for the t-test, and the true power estimate for the sign test. (The value for the t-test will disagree slightly with the value we got in the example, as we are using the t-distribution instead of the z. After running, type rejection.region to see the value we used here. The value before was 2,446.5.) It should be noted that it could all have been put in just one line, instead of defining the variables separately. This way makes what is going on a bit more intuitive though.

null.mean <- 2400

alt.mean <-2425
sd<-200
n<-50
alt.xbar.sd <- sd/sqrt(n)
alpha<-0.05
df<-n-1

t.rejection.region <- null.mean + qt(1-alpha,df)*sd/sqrt(n)

powert<-1-pnorm(t.rejection.region,mean=alt.mean,sd=alt.xbar.sd)
It's a bit trickier to see exactly what is going on to calculate the power for the sign test. The first line calculates the rejection region for performing the sign test at the correct alpha level: the 1-alpha there is because we are looking at the upper tail, and the last +1 because we want the value at which the hypothesis will be rejected. (This is the opposite of the "greater" than line in the sign.test program above.) The second line calculates the probability that an xi generated under the alternate hypothesis will be greater than the median (= mean) from the null hypothesis. The final line then asks what the probability is that we will have enough of these xi that are larger than the median under the null hypothesis so that we can reject the null hypothesis.
sign.rejection.region <- qbinom(1-alpha,n,.5)+1


prob.greater <- 1-pnorm(null.mean,mean=alt.mean,sd=sd)

powersign <- 1-pbinom(sign.rejection.region-1,n,prob.greater)
In contrast to the above code, the code to run the simulation study is very straightforward. To compare the powers we need to repeatedly generate a data set and see if either or both of the tests reject the null hypothesis or not. At the end we simply take the percentage of times each test rejected and that's our estimate of the power for each. How accurate are these estimates? Well they are just percentages, and we have the formula's for finding confidence intervals for percentages. The following code would do the simulation with 100 different data sets. To see what the function rnorm does, simply use help(rnorm).

ntimes<-100

simpowert<-0
simpowersign<-0
for (i in 1:ntimes){
x<-rnorm(50,mean=2425,sd=200)
simpowert<-simpowert+(t.test(x,alternative="greater",mu=2400)$p.value<=0.05)
simpowersign<-simpowersign+((sign.test(x,med=2400,alternative="greater")$pval)<=0.05)
}
simpowert <- simpowert/ntimes
simpowersign <- simpowersign/ntimes
simpowert
simpowersign
It may take a minute or two to run (hit return once to make sure it is all in). The values of simpowert and simpowersign should be near powert and powersign, but will change every time you run the simulation study. The values can change a _lot_ though. For the homework, run the simulation three times, reporting the three values for each test.


Homework 5 Notes

1) The function t.test will not only do the one sample t-test, put also the paired t-test. We saw on homework four how to conduct the sign test, and there is a built in function for the signed rank test, wilcox.test. You can use help(t.test) or help(wilcox.test) to get the details on these two functions. (Note that they both use x-y and not y-x.) The code to run all three of these tests for the data in example 1 on page 355.

first<-c(86,71,77,68,91,72,77,91,70,71,88,87)

second<-c(88,77,76,64,96,72,65,90,65,80,81,72)

t.test(second,first,alternative="less",mu=0,paired=T)

sign.test(second-first,alternative="less")

wilcox.test(second,first,alternative="less",mu=0,paired=T,exact=T,correct=F)
The setting exact=T tells it to do the exact test (the table) instead of the large sample approximation. If there are ties it and zero values it will use a rather complicated formula to do a large sample approximation. (It gives a reference in the help listing if you need to ever look it up.) While this value will not agree exactly with the one using other formulas, it should be close.

2) To construct the Hodges-Lehmann confidence interval, we simply tell S-Plus to do exactly what we would do by hand. The only complication is that the table that the S-Plus funtion that duplicates Table A12 only goes in one direction (give it the value and it gives the p). So a function is needed to "get the table value" correctly. The function below simply starts with 0 and goes up to the maximum value of the statistic, and sees which one corresponds most closely to p. You can try it with a few values and it should return the values in Table A12. Note that this function even works if p>0.5.

qsignrank<-function(p,n){

if (p<0.5){
val<-0
for (i in 0:(n*(n+1)/4)){
if (psignrank(i,n)<=p){
val<-i
}}
p1<-psignrank(val,n)
p2<-psignrank(val+1,n)
val<-ceiling(val+max(0,(p-p1)/(p2-p1)))}
else if (p==0.5){val<-(n*(n+1)/4)}
else {val<-(n*(n+1)/2)-qsignrank((1-p),n)}
return(val)
}

The following function will construct the Hodges-Lehmann CI for the signed rank test as described on pages 360-362. It is very similar to the one we discussed in class that corresponded to the rank sum test. The line theavgs<-matrix... simply tells S-Plus that we are going to want theavgs to be a matrix of the correct side. The command lower.tri(theavgs)==F simply tells S-Plus that we only want to use the values in the upper part of the matrix like on page 361.
HL.interval<-function(x,y,conflevel){

D <- sort(y-x)
alpha <- 1-conflevel
theavgs <- matrix(0,nrow=length(D),ncol=length(D))
wL<- qsignrank(alpha/2,length(D))
wU<- length(D)*(length(D)+1)/2 - wL +1
for (i in 1:length(D)){
for (j in 1:length(D)){
theavgs[i,j] <- (D[i]+D[j])/2 }}
L<-sort(theavgs[lower.tri(theavgs)==F])[wL]
U<-sort(theavgs[lower.tri(theavgs)==F])[wU]
return(D,theavgs,L,U)}

To replicate the example on page 361, simply use:

HL.interval(first,second,conflevel=.95)

Notice from page 361 that this confidence interval is constructed so that the closed interval has probability greater than or equal to 1-alpha of containing the true median. The reason for this is due to the ambiguity at the end points that come from possible ties. For the confidence interval for the sign test, we excluded the end point and so the true confidence level may have been somewhat less than 1-alpha.

3) The setup for this simulation study will be very similar to the simulation study in problem 2 on homework 4. This is because the alpha level is a special case of the power function. We simply need to get the percentage of times the null hypothesis is rejected where the null hypothesis is true. Besides the changes for n, alpha, and the number of simulation runs, we also need to change how the data is generated, and use the correct tests. To generate the data, simply replace:

x<-rnorm(50,mean=2425,sd=200)

with

x<-runif(20,0,1)

y<-runif(20,0,1)

The commands for the paired t-test and signed rank test are the same as used in problem 1. Just remember to give them all the correct options.

4) The trickiest part in using the functions for ANOVA and its nonparametric counterparts are entering the data correctly. To enter the data from example 1 on page 291 and run the ANOVA and Kruskal-Wallis test we would use the following code.


meth<-factor(c(rep("1",9),rep("2",10),rep("3",7),rep("4",8)))


obs<-c(83,91,94,89,89,96,91,92,90,91,90,81,83,84,83,88,91,89,84,
101,100,91,93,96,95,94,78,82,81,77,79,81,80,81)

summary(aov(obs~meth))

kruskal.test(obs,meth)
Try typing in meth to see what the first line of the above code is doing. Again, as S-Plus uses a slightly different method for dealing with ties, the value for the Kruskal-Wallis test may differ slightly from what you get by doing it by hand.

To check the assumptions of the ANOVA we need to get a residual versus predicted plot and q-q plot of the residuals, just like we would for a regression. The residual versus predicted plot can also be used to look at the assumptions of the Kruskal-Wallis test. To do this, we need to save all of the output from running the ANOVA.

aovout<-aov(obs~meth)
We could then enter attributes(aovout) to see what values we stored. The following code would set the output screen up to put three graphs up at once, and give a plot of the original data, a q-q plot and the residual versus predicted plot.
par(mfrow=c(3,1))

plot(meth,obs)
qqnorm(aovout$residuals)
qqline(aovout$residuals)
plot(aovout$fitted.values,aovout$residuals)
To plot one graph at a time, first type par(mfrow=c(1,1)). This command sets the graphics window up as a matrix with the specified number of rows and columns.


Homework 6 Notes

1) The following data is from Leger, Cochrane, and Moore (1979). The Consumption of alcohol is measured in liters per person per year. The number of heart disease deaths is per 1,000 men aged 55 to 64.

CountryConsumptionMortality
Norway2.86.2
Scotland3.29.0
England3.27.1
Ireland3.46.8
Finland4.310.2
Canada4.97.8
United States5.19.3
Netherlands5.25.9
New Zealand5.98.9
Denmark5.95.5
Sweden6.67.1
Australia8.39.1
Belgium12.65.1
Germany15.14.7
Austria25.14.7
Switzerland33.13.1
Italy75.93.2
France75.92.1

a) As a side-note, the command that enables you to calculate the various correlation coefficients is cor.test. It will also perform the test of hypotheses to check if the correlation is significantly different from zero.

c) The following code will perform simple linear regression (using the standard least squares) for the data in example 1 on page 336.

gmat<-c(710,610,640,580,545,560,610,530,560,540,570,560)

gpa<-c(4.0,4.0,3.9,3.8,3.7,3.6,3.5,3.5,3.5,3.3,3.2,3.2)
gpa.reg<-lm(gpa~gmat)

gpa.reg
summary(gpa.reg)
attributes(gpa.reg)
The plot in Figure 1 would be gained by simply using
plot(gmat,gpa,xlim=c(500,750),ylim=c(3.0,4.25))
lines(sort(gmat),gpa.reg$fitted.values[order(gmat)])
The command sort simply sorts the vector from largest to smallest. The command order gives the order that the vector is in, thus gpa.reg$fitted.values[order(gmat)] returns the predicted values in the order from the smallest to largest gmat score.

Various other plots can be constructed as well. Of particular note plot(gpa.reg) will construct 6 different plots. Unfortunately it won't wait long enough for you to see them all on the screen. Typing the command par(mfrow=c(3,2)) first will lay out the graphics window in a 3 by 2 matrix, and let you see all of the graphs. Simply use par(mfrow=c(1,1)) to return to one graph at a time.

d) The following commands will perform the nonparametric linear regression described on page 335-336, you can use it to verify the results given at the bottom of page 338 and middle of page 339.

nplinreg<-function(y,x){

n<-length(x)
maxN<-n*(n-1)/2
S<-rep(0,maxN)
N<-0
for (j in 2:n){
for (i in 1:(j-1)){
if (x[i]!=x[j]){
N<-N+1
S[N]<-(y[j]-y[i])/(x[j]-x[i])}
}}
S<-sort(S[1:N])
b1<-median(S)
a1<-median(y)-b1*median(x)
fitted.values<-(a1+b1*x)
residuals<-y-fitted.values
return(a1,b1,fitted.values,residuals)}

e) The function in SAS that performs kernel smoothing is ksmooth. Other functions that perform smoothing include loess, supsmu, and smooth.spline.

gpa.ks1<-ksmooth(gmat,gpa,"normal",ban=.5)

gpa.ks2<-ksmooth(gmat,gpa,"normal",ban=75)
gpa.ks3<-ksmooth(gmat,gpa,"normal",ban=500)
will use kernel smoothing to estimate the curve for predicting gpa from gmat using the normal kernel and bandwidths of 0.5, 75, and 500. It could be plotted using:
plot(gmat,gpa)

lines(gpa.ks1$x,gpa.ks1$y)
lines(gpa.ks2$x,gpa.ks2$y,lty=2)
lines(gpa.ks3$x,gpa.ks3$y,lty=1,lwd=3)

 
2) The function to perform the Kolmogorov-Smirnov test is simply ks.gof. You also need to tell it which distribution you are testing. help(ks.gof) will give you the list of the ones that will work. The following code would analyze the data in example 1 on page 433.

vals<-c(0.621,0.503,0.203,0.477,0.710,0.581,0.329,0.480,0.554,0.382)

ks.gof(vals,distribution="uniform",min=0,max=1)
We know that uniform is a valid distribution name to use because the help file for ks.gof says so. We know that we need to provide the min and max by looking at help(runif). The parameters needed for a normal can be found under help(rnorm).

The simulation to estimate the power curve can be performed just like the previous simulations, using a separate run for each sd. To plot the estimated power curve you just need to make two vectors, one for the x-axis and one for the y-axis. Say we called the x-axis mu and the y-axes estimated.power. The command to plot the curve would be:

plot(sort(mu),estimated.power[order(mu)],ylim=c(0,1),type='l')


Homework 7 Notes

Fisher's Exact Test and the Mantel-Haenszel Test: The function to conduct Fisher's exact test is fisher.test and the command to conduct the Mantel-Haenszel test is mantelhaen.test.

The following code would analyze the data in example 3 and example 4 in section 4.1. In example 3 the data is entered as a matrix, reading the elements in one column at a time. The order of the data is thus: (row 1, column 1), (row 2, column 1), (row 1, column 2), (row 2, column 2). In example 4 the data is entered as an array (like a matrix but with more than two dimensions). In this case the data is again entered one column at a time. Note that the p-value in this case is two-sided, and the test statistic is the continuity corrected chi-squared one and not the normal one (1.0114 = 1.00572).

examp13<-matrix(c(1,3,9,1),nrow=2,ncol=2)

examp13
fisher.test(examp13) examp14<-array(c(10,12,1,1,9,11,0,1,8,7,0,3),dim=c(2,2,3))
examp14
mantelhaen.test(examp14)

Chi-squared Goodness of Fit Test: The function for performing the Chi-squared goodness of fit test is chisq.gof. You need to tell it what distribution you want to use. You also need to tell it which groups you are going to put the data in, unless you want to use the automatic setting, and also how many parameters you are having estimated. help(chisq.gof) provides many of the details.

The following code first generates 50 random variables that are normally distributed with mean 2 and standard deviation 1. It then does the chisq.gof test that the data is from a normal distribution with mean and variance estimated from the data. Because we are estimating two parameters in this example we need to set n.param.est=2. If you are not estimating any parameters you can leave this part of the command out. The cells chosen in this example are eight equally spaced groups from -4 to 4, and then one group with all the values less than -4, and one with all the values greater than 4. The borders of these 10 cells are set using the vector cut.points.

x<-rnorm(50,mean=2,sd=1)


chisq.gof(x,cut.points=c(-Inf,-4,-3,-2,-1,0,1,2,3,
4,Inf),distribution="normal",mean=mean(x),
sd=sqrt(var(x)),n.param.est=2)
This function seems to give the sample size error for expected values of 5 as well as for those less than 5. To see the expected values we could use:

output<-chisq.gof(x,cut.points=c(-Inf,-4,-3,-2,-1,0,1,2,3,
4,Inf),distribution="normal",mean=mean(x),
sd=sqrt(var(x)),n.param.est=2)

output$expected
These values are hard to see because of the scientific notation, so round(output$expected*1000)/1000 might be easier to see. The observed counts would be gotten by using $counts instead.

It is a little tricker to use some of the discrete distributions because some of them are not included in the list in help(chisq.gof). Say there are four possible outcomes (1, 2, 3, 4) and we wish to test that each has probability 1/4 of occurring. The number observed in each cell are 43, 56, 59, and 47 respectively. This could be done using cut.points(0.5,1.5,2.5,3.5,4.5) and distribution="uniform",min=0.5,max=4.5. The code for this problem would thus be:


x<-c(rep(1,43),rep(2,56),rep(3,59),rep(4,47))
chisq.gof(x,cut.points=c(0.5,1.5,2.5,3.5,4.5),
distribution="uniform",min=0.5,max=4.5)


Other Nonparametric Methods

MWW Rank Sum Test: The Mann-Whitney-Wilcoxon Rank Sum Test can be performed using wilcox.test. Make sure to set paired=F or it will do the Wilcoxon Signed Rank Test instead.

Friedman's Test and Two-way ANOVA: The following code will analyze the data in Example 1 on page 371-372.

homeowner<-factor(c(rep(1:12,4)))

grass<-factor(c(rep(1,12),rep(2,12),rep(3,12),rep(4,12)))
obs<-c(4,4,3,3,4,2,1,2,3.5,4,4,3.5,3,2,1.5,1,2,2,3,4,1,1,2,1,
2,3,1.5,2,1,2,2,1,2,3,3,2,1,1,4,4,3,4,4,3,3.5,2,1,3.5)

friedman.test(obs,grass,homeowner)
summary(aov(obs~grass+homeowner))
Note that for Friedman's test, the first variable must be the observations, the second must be the treatments, and the third must be the blocks. For the ANOVA you need to look at the line that is for the treatment variable to find the correct p-value.


Notes on Copying, Pasting, and Printing Graphics:

In general, you will want to type your programs in MS Word or notepad, and then copy and paste them into the X-window. To copy from the X-window, simply highlight the text you want with the left mouse button. To paste into the X-window, simply hit both mouse buttons simultaneously.

To print a graphic, you need to set it to the right printer. Otherwise the default is the printer that the SUN machine is actually in. (Probably 310 LeConte.) To print to the printer LEX-3 in 303A, you would need to go to the "Options" menu, and choose "Printing..." In the command box, make it say lp -dlex3  Now simply click the "Print" button, and note that it should say at the bottom of the graphic window that it has printed. Simply hit close to make that window go away. Note that you can't view or print the graphics if you are using telnet instead of X-windows.