Stat 518 - Fall 1999 - S-Plus Templates

Class Notes from 9/7/99

Homework Three Notes
Homework Four Notes
Homework Five Notes
Homework Six Notes
Homework Seven Notes
Homework Eight Notes

Project Notes
Note on Printing Graphs and Functions
H-L interval for MWW rank sum test


The Basics of S-plus:

Unlike SAS where you have to hit a key to run the program, S-Plus runs what you tell it to immediately. It is therefore important to keep a copy of the code you will be running in some other file, in general some word processor when using a PC.

Another difference between S-Plus and SAS is that S-Plus is run from the various Sun Sparcs in the department of mathematics and statistics. To use them from a PC you must thus remote login using the program X-Win32. When you start up X-Win32, nothing at first will happen except for the bar appearing at the bottom of the screen. Selecting one of the sessions there will call up an X-window into which you need to answer the prompt for your login name and password. These will generally NOT be the same as the login and password for the PC.

Once you are logged in, simply type Splus at the prompt to continue.

At the heart of S-Plus are the various objects that you enter. At any time you can type in the function objects() to see which ones you have entered previously. Presumably there aren't any there right now. Splus also has many built in objects, most of which are functions.

The following code will enter a variable called ex3 which is made up of the data points 0, 2, 1, 2, 3, 4 from example 3 and 4 on page 82. It then calculates the mean, the variance, and draws the empirical distribution function. It might be best to enter them one group at a time so you can see what's happening at each stage. To paste into Splus you simply click with the left mouse button where you want to paste, and then click with both buttons simulatenously.


ex3<-c(0,2,1,2,3,4)


mean(ex3)

var(ex3)

motif()

edf<-function(y){
x<-sort(y)
plot(c(min(x)-1,max(x)+1),c(0,1),type="n",xlab="x",ylab="p")
lines(c(x[1]-1,x[1]),c(0,0),lty=1)
lines(c(x[1],x[1]),c(0,1/length(x)),lty=2)
for (i in 1:(length(x)-1)){
lines(c(x[i],x[i+1]),c(i/length(x),i/length(x)),lty=1)
lines(c(x[i+1],x[i+1]),c(i/length(x),(i+1)/length(x)),lty=2)}
lines(c(x[length(x)],x[length(x)]+1),c(1,1),lty=1)
}
edf(ex3)
Note that the var function returns the variance where the denominator is n-1. Example 4 on page 83 however uses n. Typing help(var) will tell you how to get it to produce that result. Say we wished to enter the data set we used last time into S-Plus. We would use the line:

satdata<-scan("",what=list(state="",verbal=0,math=0,pct=0))

S-Plus then comes back with a prompt to enter the data. We can now just cut and paste the block into it. Because UNIX tends to be unhappy if you try to cut and paste a whole bunch at one time, it is probably better to do it in to chunks. Paste the first chunk in by clicking with both mouse buttons, make sure there is a prompt showing, and then paste the second half in and hit return until it gets out of scan mode. Now, typing satdata will show you that the data set is in there.

The SAT data presented in the August 31st, 1999 New York Times:


Ala. 561 555 9
Alaska 516 514 50
Ariz. 524 525 34
Ark. 563 556 6
Calif. 497 514 49
Colo. 536 540 32
Conn. 510 509 80
Dela. 503 497 67
D.C. 494 478 77
Fla. 499 498 53
Ga. 487 482 63
Hawaii 482 513 52
Idaho 542 540 16
Ill. 569 585 12
Ind. 496 498 60
Iowa 594 598 5
Kan. 578 576 9
Ky. 547 547 12
La. 561 558 8
Maine 507 507 68
Md. 507 511 65
Mass. 511 511 78
Mich. 557 565 11
Minn. 586 598 9
Miss. 563 548 4
Mo. 572 572 8
Mont. 545 546 21
Neb. 568 571 8
Nev. 512 517 34
N.H. 520 518 72
N.J. 498 510 80
N.M. 549 542 12
N.Y. 495 502 76
N.C. 493 493 61
N.D. 594 605 5
Ohio 534 538 25
Okla. 576 560 8
Ore. 525 525 53
Pa. 498 495 70
R.I. 504 499 70
S.C. 479 475 61
S.D. 585 588 4
Tenn. 559 553 13
Texas 494 499 50
Utah 570 565 5
Vt. 514 506 70
Va. 508 499 65
Wash. 525 526 52
W.Va. 527 512 18
Wis. 584 595 7
Wyo. 546 551 10

We have now made a "data-frame" where each of the four columns goes with a different variable. We can refer to the four columns by using satdata$state, satdata$verbal, satdata$math, or satdata$pct. The following code will give us many of the results we got looking at this data set in SAS. But first we'll begin by giving all of the variables shorter names.


v<-satdata$verbal

m<-satdata$math
p<-satdata$pct

summary(p)

hist(p)

qqnorm(p)
qqline(p)
Unfortunately S-Plus removed the interactive help menus, but you can still find out more about changing the histogram for example by typing help(hist). The code below performs a linear regression to predict v from m. It is often best to put the regression output into an object though instead of simply running it.



vmregress<-lm(v~m)

vmregress

attributes(vmregress)

vmregress$residuals

qqnorm(vmregress$residuals)
qqline(vmregress$residuals)
Finally, this is the code that would be used for making the confidence interval for the variance of the verbal scores.

alpha<-.05

df<-length(v)-1
clow <- df*var(v)/qchisq(1-alpha/2,df)
chi <- df*var(v)/qchisq(alpha/2,df)
c(clow,chi)
Some other commands include: rnorm to generate random numbers, try x<-rnorm(1000,0,1); pnorm to return the cumulative distribution function for the normal, try pnorm(2,0,1) or pnorm(2,2,1); or try help to see what any of t.test, var.test, and cor.test do.

Finally, to quit S-Plus, type q(), and then type exit to close X-Win 32, and don't forget to log off the machine using CTRL+ALT+DEL.


Notes on Printing Graphics and Functions:

To print a graphic, you need to set it to the right printer. Otherwise the default is the printer that the SUN machine is actually in. (Probably 310 LeConte.) To print to the printer LEX-3 in 303A, you would need to go to the "Options" menu, and choose "Printing..." In the command box, make it say lp -dlex3  Now simply click the "Print" button, and note that it should say at the bottom of the graphic window that it has printed. Simply hit close to make that window go away. Note that you can't view or print the graphics if you are using telnet instead of X-windows.

In any case that involves <-function, simply entering that batch of code will not generate any answers! All this does is tell S-Plus that you want to be able to do that function again and again without having to enter it all. To get the particular output, you have to use that function with the appropriate values. Similarly, the simulation code doesn't give any output. It makes the variables containing the desired numbers, and you have to type those names to see what the values are.


Notes on Homework Three:

Page 133, #2: The command for the exact test to see if p is a certain value for data from a binomial distribution is:


binom.test(x, n, p=0.5, alternative="two.sided")

Instead of x use the number of 'successes'; instead of n use the number of 'trials'; 0.5 is the default value for the null hypothesis, simply replace it with the appropriate value; and "two.sided" is the default value for the alternate hypothesis, you may replace it with either "less" or "greater". (If it was desired to use the large sample approximation, prop.test would be used instead of binom.test.)

Page 133, #4: S-Plus does not have a built in command for finding the confidence intervals given by table A4. The book "Nonparametric Statistical Methods" by Hollander and Wolfe gives the formula for finding them though.

L = Y / (Y+ (n - Y + 1) * f [ 1 - alpha/2, 2(n - Y + 1), 2 Y ] )

and
U = 1 - (n-Y) / (n - Y + (Y + 1) * f [ 1 - alpha/2, 2(Y+1), 2(n - Y)])

where f [a,df1,df2] is the (a-th percentile for the F distribution with the specified degees of freedom. A function to generate this confidence interval could be written as follows.


binom.conf <- function(Y,n,alpha){

L <- Y / (Y+ (n - Y + 1) * qf(1 - alpha/2,2*(n - Y + 1),2*Y))
U <- 1 - ((n-Y) / (n - Y + (Y + 1) * qf( 1 - alpha/2,2*(Y+1),2*(n - Y))))
return(c(L,U))}
If it was desired to use the large sample approximation, prop.test gives that as part of the output, but make sure to use "two.sided".

Page 164, #2: The sign test is simply the binomial test where D = # of times X > Y, n = # non-ties, and p=0.5. If you are given the data set instead of the raw data, the command to get the # of times X > Y is: sum(X > Y). The command to find out the number of non-ties is: length(X) - sum(X==Y).


Notes on Homework Four:

1) Splus has the built in function t.test to perform the t-test (alternative is either "greater", "less", or "two.sided".)


x<-c(-30,-25,-20,-15,0,1,1,1,2,2,2,3)

t.test(x,alternative="less",mu=0)
To get the confidence interval for the mean, you also use the function t.test, but instead of mu=0 and alternative="less", use conf.level=.95 for a 95% confidence interval for example. The confidence interval won't be right if you use "less" or "greater"! To get the qqplot, simply use qqnorm(x) followed by qqline(x). Note how bad this looks for the "diet" example from in class!

The following function, once entered, will perform the quantile test:


quantile.test<-function(dset,pstar,xstar,alternative="two.sided"){

T1<-sum(dset<=xstar)
T2<-sum(dset< xstar)
n<-length(dset)
if (alternative=="less"){
pval<-1-pbinom(T2-1,n,pstar)}
else if (alternative=="greater"){
pval<-pbinom(T1,n,pstar)}
else if(alternative=="two.sided"){
pval<-2*min(1-pbinom(T2-1,n,pstar),pbinom(T1,n,pstar))}
return(T1,T2,pval)}
dset is the list of the data, pstar is the percentile the test is for (so median would be 0.5), xstar is the value that the null hypothesis gives, and alternative is what the alternative hypothesis (e.g. Median<10 would be alternative="less"). Reading through the code, it should be pretty clear that this function does exactly what we would do if doing the test by hand. sum(dset<=xstar) says to simply count up the number of observations in dset that are < (less than) or = (equal) to xstar. pbinom returns the same value that you would get using the table in the back of the text. So, to run it for the diet example, we would simply use quantile.test(x,0.5,0,alternative="less").

Similarly, we can make a function to construct a confidence interval for the various quantiles. One important function in doing this is sort which simply puts the data in order. A second is qbinom(alpha,n,p) which gives the "y" value corresponding to the smallest probability for that binomial that is larger than alpha. Note that this is slightly different than what you would get by taking the _closest_ one. Taking the closest may give a percent less than the 1-alpha level you set, this always gives a greater one.


quantile.interval<-function(dset,pstar,conf){   

alpha <- 1 - conf
orderstat <- sort(dset)
n <- length(dset)
r <- (qbinom((alpha/2),n,pstar)-1) + 1
s <- (qbinom((1-alpha/2),n,pstar)-1) +1
low<-orderstat[r]
hi<-orderstat[s]
realconf<-pbinom(s-1,n,pstar)-pbinom(r-1,n,pstar)
return(r,s,low,hi,realconf)}
This uses formulas (41), (43), and (45) on page 147 to calculate the actual confidence level of the interval. Note that for discrete distributions that the confidence statement is at least that value.

3) To compare the powers we need to repeatedly generate a data set and see if either or both of the tests reject the null hypothesis or not. At the end we simply take the percentage of times each tests rejected and thats our estimate of the power for each. How accurate are these estimates? Well they are just percentages, and we have the formula's for finding confidence intervals for percentages. The following code would do the simulation with 100 different data sets, each data set would have 15 observations, and the power is calculated using an alpha level of 0.10. To see what the function rnorm does, simply use help(rnorm).


powert<-0

powermed<-0
for (i in 1:100){
x<-rnorm(15,0.25,1)
powert<-powert+(t.test(x,alternative="greater",mu=0)$p.value<=0.10)
powermed<-powermed+((quantile.test(x,0.5,0,alternative="greater")$pval)<=0.10)
}
It may take a few minutes to run. The above example should give a powert of around 35 (it changes every time you run it, its a simulation!) so the estimated power for the t-test in this case is 35/100 = 0.35. For the median test it is around 20, so 0.20. The values can change a _lot_ though. Just work out the confidence intervals for the n=100 simulations.

For the homework problem, do the simulation 400 times, with a sample size of 20, and an alpha of 0.05.


Notes on Homework Five:

1) The function t.test will not only do the one sample t-test, put also the paired t-test. We saw on homework three how to conduct the sign test, and there is a built in function for the signed rank test, wilcox.test. You can use help(t.test) or help(wilcox.test) to get the details on these two functions. The code to run all three of these tests for the data in example 1 on page 355.


x<-c(86,71,77,68,91,72,77,91,70,71,88,87)

y<-c(88,77,76,64,96,72,65,90,65,80,81,72)

t.test(y,x,alternative="less",mu=0,paired=T)

d<-sum(y>x)
n<-length(x)-sum(x==y)
binom.test(d, n, p=0.5, alternative="less")

wilcox.test(y,x,alternative="less",mu=0,paired=T,exact=T,correct=F)
The setting exact=T tells it to do the exact test (the table) instead of the normal approximation. If there are ties it will use the approximation any way and will tell you so. The values may be slightly different than what you calculated by hand due to very minor changes in how it calculates it. Make sure you are using the correct alternative hypotheses.

2) To construct the Hodges-Lehmann confidence interval, we simply tell S-Plus to do exactly what we would do by hand. The only complication is that the table that the S-Plus funtion that duplicates Table A12 only goes in one direction (give it the value and it gives the p). So a function is needed to "get the table value" correctly. The function below simply starts with 0 and goes up to the maximum value of the statistic, and sees which one corresponds most closely to p. You can try it with a few values and it should return the values in Table A12. Note that this function even works if p>0.5.


qsignrank<-function(p,n){

if (p<0.5){
val<-0
for (i in 0:(n*(n+1)/4)){
if (psignrank(i,n)<=p){
val<-i
}}
p1<-psignrank(val,n)
p2<-psignrank(val+1,n)
val<-ceiling(val+max(0,(p-p1)/(p2-p1)))}
else if (p==0.5){val<-(n*(n+1)/4)}
else {val<-(n*(n+1)/2)-qsignrank((1-p),n)}
return(val)
}

The following function will construct the Hodges-Lehmann CI for the example 1 data, as described on page 361. The line theavgs<-matrix... simply tells S-Plus that we are going to want theavgs to be a matrix of the correct side. The command lower.tri(theavgs)==F simply tells S-Plus that we only want to use the values in the upper part of the matrix like on page 361.

HL.interval<-function(x,y,conflevel){

D <- sort(y-x)
alpha <- 1-conflevel
theavgs <- matrix(0,nrow=length(D),ncol=length(D))
wL<- qsignrank(alpha/2,length(D))
wU<- length(D)*(length(D)+1)/2 - wL +1
for (i in 1:length(D)){
for (j in 1:length(D)){
theavgs[i,j] <- (D[i]+D[j])/2 }}
L<-sort(theavgs[lower.tri(theavgs)==F])[wL]
U<-sort(theavgs[lower.tri(theavgs)==F])[wU]
return(D,theavgs,L,U)}
NOTE: code for this function was changed on 10/19. It is ok if you used the previous code, it may just give slightly different values.

Note that it is impossible to construct a 90%, 95%, or 99% CI for this data using the Hodges-Lehmann estimator. We can however construct an 80% one. (You do not need to answer the questions associated with the data set in the book, but part c does point this out.)

3) The set-up here will be similar to the one on Homework 4. One trick we can use, is that if we were to have two vectors x and y of length 20, then runif(20,x,y) would generate 20 uniforms, the first going from x1 to y1, the second from x2 to y2, etc... The following code would run the simulation with n=25, alpha=0.10, and 300 simulation runs. Note that this may take five minutes or so to run.

 

alphat<-0

asign<-0
asrnk<-0
for (i in 1:300){
E<-runif(25,0,1)
X<-runif(25,10-E,10+E)
Y<-runif(25,10-E,10+E)
alphat<-alphat+(t.test(Y,X,alternative="greater",
mu=0,paired=T)$p.value<=0.10)
d<-sum(Y>X)
n<-length(X)-sum(X==Y)
asign<-asign+(binom.test(d,n,p=0.5,alternative="greater")$p.value<=0.10)
asrnk<-asrnk+(wilcox.test(Y,X,alternative="greater",
mu=0,paired=T,exact=T,correct=F)$p.value<=0.10)
}

4) wilcox.test can also be used for the Mann-Whitney test, simply used paired=F instead of paired=T. This is also how to get the two-sample t-test using t.test.


Notes on Homework Six:

1) The function that performs the F test for two variances is var.test. The following code would analyze the data in example 1 on page 304.


pres<-c(10.8,11.1,10.4,10.1,11.3)

new<-c(10.8,10.5,11.0,10.9,10.8,10.7,10.8)

var.test(pres,new,alternative="greater")
It is necessary to program a special function to perform Conover's test though. The following function performs the large sample approximation.

conover.test<-function(x,y,alternative="two.sided"){

u<-abs(x-mean(x))
v<-abs(y-mean(y))
n<-length(x)
m<-length(y)
N<-n+m
R<-rank(c(u,v))
Tstat<-sum(R[1:n]^2)
Rbar2<-(1/N)*(sum(R^2))
R4<-sum(R^4)
T1<-(Tstat-n*Rbar2)/sqrt(((n*m*R4)/(N*(N-1)))-((n*m*(Rbar2^2))/(N-1)))
if (alternative=="less"){
pval<-pnorm(T1)}
else if (alternative=="greater"){
pval<-1-pnorm(T1)}
else{
pval<-2*min(1-pnorm(T1),pnorm(T1))}
return(T1,Tstat,pval)}
To perform the test you would simply need to run conover.test(pres,new,alternative="greater")

2) The function cor.test both calculates the statistic and performs the test for both Spearman's Rho and Kendall's Tau. Using help(cor.test) explains how to use it. Note that it may give slightly different values from the book due to the way it deals with ties.

3) The command lm(y~x) performs a linear regression, for predicting y from x, and the output can be made to include the estimate of the slope and it's standard error. Simply use z<-lm(y~x) and summary(z) to see these values. The following function would make the confidence interval for the slope of the line predicting y from x, using the standard linear regression model would be:

normalslope<-function(x,y,conf.level=.95){

coeffs<-summary(lm(y~x))$coefficients
df<-length(y)-2
bhat<-coeffs[2,1]
alpha<-1-conf.level
clow<-coeffs[2,1]-qt(1-alpha/2,df)*coeffs[2,2]
chi<-coeffs[2,1]+qt(1-alpha/2,df)*coeffs[2,2]
return(bhat,conf.level,clow,chi)}
To see why this works, just notice that summary(lm(y~x))$coefficients is a matrix where the second row goes with the slope, the first column is the estimate, and the second column is the standard error.

The following function performs the sequence of steps listed on page 335 for constructing the nonparametric confidence interval using the large sample approximation (equation 11 on page 317).


npslope<-function(x,y,conf.level=.95){

n<-length(x)
maxN<-n*(n-1)/2
alpha<-1-conf.level
S<-rep(0,maxN)
N<-0
for (j in 2:n){
for (i in 1:(j-1)){
if (x[i]!=x[j]){
N<-N+1
S[N]<-(y[j]-y[i])/(x[j]-x[i])}
}}
S<-sort(S[1:N])
bhat<-median(S)
wval<-qnorm(1-alpha/2)*sqrt(n*(n-1)*(2*n+5)/18)
r<-floor(0.5*(N-wval))
s<-ceiling(0.5*(N+wval)+1)
clow<-S[r]
chi<-S[s]
return(wval,r,s,S,bhat,conf.level,clow,chi)}
Note that the answer here will differ slightly than what you would get by hand using Table A11 because this is the large sample approximation. Basically the above function just follows through the steps on page 335 and 336. The != means "is not equal too", the function ceiling rounds a number up, and the function floor rounds a number down. The N and maxN are just used to determine how many S's we have to deal with.

4 and 5) The trickiest part in using the functions for ANOVA and its nonparametric counterparts are entering the data correctly. To enter the data from example 1 on page 291 and run the ANOVA and Kruskal-Wallis test we would use the following code.


meth<-factor(c(rep("1",9),rep("2",10),rep("3",7),rep("4",8)))

obs<-c(83,91,94,89,89,96,91,92,90,91,90,81,83,84,83,88,91,89,84,
101,100,91,93,96,95,94,78,82,81,77,79,81,80,81)

summary(aov(obs~meth))

kruskal.test(obs,meth)
The vector meth is simply a list of the categories the various observations go in. The first 9 values are of type "1", the next 10 are of type "2", etc... rep("1",9) simply means to repeat "1" nine times. The vector obs are simply the values of the observations in order. The following code works in the same way for the data in example 1, using the ANOVA and Friedman's test.

homeowner<-factor(c(rep(1:12,4)))

grass<-factor(c(rep(1,12),rep(2,12),rep(3,12),rep(4,12)))
obs<-c(4,4,3,3,4,2,1,2,3.5,4,4,3.5,3,2,1.5,1,2,2,3,4,1,1,2,1,
2,3,1.5,2,1,2,2,1,2,3,3,2,1,1,4,4,3,4,4,3,3.5,2,1,3.5)

friedman.test(obs,grass,homeowner)
summary(aov(obs~grass+homeowner))
Note that for Friedman's test, the first variable must be the observations, the second must be the treatments, and the third must be the blocks. For the ANOVA you need to look at the line that is for the treatment variable.


Notes on Homework Seven:

Page 249 #1: The function to perform the Kolmogorov-Smirnov test is simply ks.gof. You also need to tell it which distribution you are testing. help(ks.gof) will give you the list of the ones that will work. The function for performing the Chi-squared goodness of fit test is chisq.gof. Again you need to tell it what distribution you want to use. You also need to tell it which groups you are going to put the data in, unless you want to use the automatic setting, and also how many parameters you are having estimated. help(chisq.gof) provides many of the details.

The following code first generates 50 random variables that are uniformly distributed between -2 and 2, and then does both the ks.gof and chisq.gof tests that the data is from a distribution that is uniformly distributed between -2 and 2. (So it should accept most of the time.) It uses the ten equally spaced groups for the chi-squared test, and doesn't use any estimated parameters.

It then tests if the data is normal or not... without knowing what the parameters are. (So in this case we want it to reject, but remember they do not have a lot of power.) According to the help section, it is capable of estimating the values for the normal case. In the chi-square case for this test it uses the five groups that the computer chooses itself, we need to give it the estimated values, and need to tell it how many we estimated.


x<-runif(50,min=-2,max=2)


ks.gof(x,distribution="uniform",min=-2,max=2)
chisq.gof(x,cut.points=c(-2,-1.6,-1.2,-0.8,-0.4,
0,0.4,0.8,1.2,1.6,2.0),distribution="uniform",
min=-2,max=2)

ks.gof(x,distribution="normal")
chisq.gof(x,n.classes=5,distribution="normal",
mean=mean(x),sd=sqrt(var(x)),n.param.est=2)

One extra note on this problem. If you try to enter all the data by typing it in all at once with no returns, it will lose a bunch of the numbers. This is because there is a limit to how long a line can be and still be copied. So, all you need to do is to put a few return in after a comma about half way through it if you are typing it straight into S-Plus (not recommended), or cut and paste it over in parts if you are entering it into word first.

Page 441 #1: We can use code similar to what we used to graph the EDF at the very top of this page to graph the confidence bands. The d you need to enter is the value from the table that goes with the confidence level you want. The function works in the same way as the edf function near the top of this page. It just needs to be done three times: once for the edf, once for the lower bound, and once for the upper bound.


edfconf<-function(y,d=0){

x<-sort(y)
plot(c(min(x)-1,max(x)+1),c(0,1),type="n",xlab="x",ylab="p")
lines(c(x[1]-1,x[1]),c(0,0),lty=1)
lines(c(x[1],x[1]),c(0,1/length(x)),lty=2)
for (i in 1:(length(x)-1)){
lines(c(x[i],x[i+1]),c(i/length(x),i/length(x)),lty=1)
lines(c(x[i+1],x[i+1]),c(i/length(x),(i+1)/length(x)),lty=2)}
lines(c(x[length(x)],x[length(x)]+1),c(1,1),lty=1)

lines(c(x[1]-1,x[1]),c(0,0),lty=1,lwd=2)
lines(c(x[1],x[1]),c(0,max(0,1/length(x)-d)),lty=1,lwd=2)
for (i in 1:(length(x)-1)){
lines(c(x[i],x[i+1]),c(max(0,i/length(x)-d),max(0,i/length(x)-d)),lty=1,lwd=2)
lines(c(x[i+1],x[i+1]),c(max(0,i/length(x)-d),max(0,(i+1)/length(x)-d)),
lty=1,lwd=2)}
lines(c(x[length(x)],x[length(x)]+1),c(1-d,1-d),lty=1,lwd=2)

lines(c(x[1]-1,x[1]),c(0+d,0+d),lty=1,lwd=2)
lines(c(x[1],x[1]),c(0+d,min(1,1/length(x)+d)),lty=1,lwd=2)
for (i in 1:(length(x)-1)){
lines(c(x[i],x[i+1]),c(min(1,i/length(x)+d),min(1,i/length(x)+d)),lty=1,lwd=2)
lines(c(x[i+1],x[i+1]),c(min(1,i/length(x)+d),min(1,(i+1)/length(x)+d)),
lty=1,lwd=2)}
lines(c(x[length(x)],x[length(x)]+1),c(1,1),lty=1,lwd=2)
}
NOTE: the above may be too long to cut and paste all at once, so paste it in, in two parts.

Page 216, #4 The function that conducts the chi-squared tests in Section 4.2 is chisq.test. To run this test you need to enter the data in the form of a matrix. The following code is for example one on page 202. (nrow=2 means there are 2 rows, and byrow=T means that we are forming the data matrix one row at a time)

 

row1<-c(6,14,17,9)

row2<-c(30,32,17,3)
data<-matrix(c(row1,row2),nrow=2,byrow=T)
chisq.test(data,correct=F)

Page 176, #1: McNemar's test follows the same format as the chi-squared tests. The function to use is mcnemar.test. (again, using the correct=F option if you are not using the continuity correction)

Page 195, #1: The chi-squared test for differences in probabilities in section 4.1 is the same is the same as the one used in section 4.2. The only difference is that the one here is a chi-square, and the discussion on 181 uses a standard normal. Remember though that a chi-square with 1 degree of freedom is a standard normal squared. If you use chisq.test to do example 1 on page 183, you will get that the statistic is 1.6116 (= -1.2695 squared) and the p-value is 0.2043.


Notes on Homework Eight:

Page 238, #1: It may be just as easy to do this one by hand... chisq.test could be used to calculate T, and then using (3) on page 230 to calculate cramer's statistic. Unfortunately chisq.test does not work if one of the rows or columns is all zero. If we define 0/0 to be zero as described on page 329, the following code will calculate both T and Cramer's statistic.


cramer.stat<-function(x){

nr<-length(x[,1])
nc<-length(x[1,])
ex<-matrix(0,ncol=nc,nr=nrow)
Tstat<-0
for (i in 1:nr){
for (j in 1:nc){
ex[i,j]<-(sum(x[i,])*sum(x[,j])/sum(x))
if((!(sum(x[i,])==0))&&(!(sum(x[,j])==0))){
Tstat<-Tstat+(x[i,j]-ex[i,j])^2/ex[i,j]}
}}
N<-sum(x)
q<-min(length(x[,1]),length(x[1,]))
cramer<-sqrt(Tstat/(N*(q-1)))
return(Tstat,cramer)
}
The phi-coefficient and Yule's two statistics can be calculated simply by following the formulas. Note that these statistics are only defined for 2x2 tables and so for this problem we have to use only choices A and B, instead of A, B, and other.

yule.stat<-function(x){

a<-x[1,1]
b<-x[1,2]
c<-x[2,1]
d<-x[2,2]
alpha<-a*d/(b*c)
phi<-(a*d-b*c)/sqrt((a+b)*(c+d)*(a+c)*(b+d))
yuleq<-(a*d-b*c)/(a*d+b*c)
if(alpha==Inf){yulecollig<-1}
else {yulecollig<-(sqrt(alpha)-1)/(sqrt(alpha)+1)}
return(alpha,phi,yuleq,yulecollig)
}


H-L interval for MWW rank sum test:

This is similar to the one for the signed rank test, and just follows the instructions on page 281-283.


HL2.interval<-function(x,y,conflevel){

alpha <- 1-conflevel
thematrix <- matrix(0,nrow=length(x),ncol=length(y))
x<-sort(x)
y<-sort(y)
big<-max(length(x),length(y))
if(big<50){
wL<-qwilcox(alpha/2,length(x),length(y))-length(x)*(length(x)+1)/2}
else {
n<-length(x)
m<-length(y)
N<-length(x)+length(y)
zp<-qnorm(alpha/2)
wp<-round(n*(N+1)/2+zp*sqrt(n*m*(N+1)/12))
wL<-wp-length(x)*(length(x)+1)/2}
wU<- length(x)*length(y)+1-wL
mat1<-matrix(rep(x,length(y)),byrow=T,nrow=length(y))
mat2<-t(matrix(rep(y,length(x)),byrow=T,nrow=length(x)))
thematrix<-mat1-mat2
L<-sort(thematrix)[wL]
U<-sort(thematrix)[wU]
return(wL,wU,thematrix,L,U)}