Stat 518 Splus Notes

518 Project Sample Extract

In case we need it for later, start the automatic help window

help.start(gui="motif")

Say we have the following two unpaired data sets, and wish to test that they have equal means.

> x

[1] 1.10307518 -0.16883761 -0.23079351 -0.04519230 -0.60671367 -0.21540892

[7] 0.51668447 1.73707800 -0.19824105 -0.42189858 1.31438488 0.03050895

[13] 0.83574658 -0.28574632 0.54903485 -0.26240058 1.07397187 1.18961169

[19] 0.07189669 -0.96305472

> y

[1] 1.57769805 -0.08806207 3.04443683 -0.03853285 2.72570946 0.48420941

[7] 0.84167653 3.41423541 -0.19815307 -0.42078669 4.17812150 0.28912995

[13] 1.07975805 0.34447280 1.35142805 -0.25977628 2.64246163 1.24564407

[19] 1.21441543 -0.34081294

Finding distributions to simulate x and y from:

hist(x)

qqnorm(x)

qqline(x)

x looks fairly normal, so we will test to see if we can say it's normal with

> mean(x)

[1] 0.2511853

> sqrt(var(x))

[1] 0.7375315

We can do a Kolmogorov-Smirnov test, for graphic output can also plot the normal curve overlaid with the histogram.

hist(x,xlim=c(-1,2),ylim=c(0,1),nclass=10,prob=T)

x2<- (-10:20)/10

y2<- dnorm(x2,mean(x),sqrt(var(x)))

par(new=T)

plot(x2,y2,xlim=c(-1,2),ylim=c(0,1),type="l")

Similarly for the y's

hist(y)

hist(y,nclass=10)

qqnorm(y)

qqline(y)

y doesn't look very normal. We might think y is supposed to be between -.5 and 4.5 for example, and so

it could be a Beta shifted over by .5 to the left, and multiplied by 5. A Beta has mean

a/(a+b) and variance ab / ((a+b+1)(a+b)²).

> mean((y+.5)/5)

[1] 0.3308727

> var((y+.5)/5)

[1] 0.07622194

So, to make the math easier, say 1/3 and 0.076. Let d = a+b and solving the above gives...

a = 0.26, b = 0.52

hist(y,xlim=c(-1,5),ylim=c(0,1),nclass=10,prob=T)

x2<- (-10:50)/10

y2<- dbeta((x2+.5)/5,.26,.52)/6

par(new=T)

plot(x2,y2,xlim=c(-1,5),ylim=c(0,1),type="l")

This doesn't look good at all!

> mean(y+.5)

[1] 1.654364

> var(y+.5)

[1] 1.905549

A c² distribution has mean = df and variance = 2df.

Try a c² distribution one degree of freedom, multiplied by square root of two, and shifted over by a half...

hist(y,xlim=c(-1,5),ylim=c(0,1),nclass=10,prob=T)

x2<- (-10:50)/10

y2<- dchisq((x2+0.5)/sqrt(2),1)/sqrt(2)

par(new=T)

plot(x2,y2,xlim=c(-1,5),ylim=c(0,1),type="l")

This is too skewed to the right, try a c² distribution two degrees of freedom, divided by square root of two,

and shifted over by a half.

hist(y,xlim=c(-1,5),ylim=c(0,1),nclass=10,prob=T)

x2<- (-10:50)/10

y2<- dchisq((x2+0.5)*sqrt(2),2)*sqrt(2)

par(new=T)

plot(x2,y2,xlim=c(-1,5),ylim=c(0,1),type="l")

Looking better... try threes instead of twos...

hist(y,xlim=c(-1,5),ylim=c(0,1),nclass=10,prob=T)

x2<- (-10:50)/10

y2<- dchisq((x2+0.5)*sqrt(3),3)*sqrt(3)

par(new=T)

plot(x2,y2,xlim=c(-1,5),ylim=c(0,1),type="l")

Go with the twos...

Doing the test....

A lot of the tests are built into S-plus...

t.test

wilcox.test

cor.test

aov

kruskal.test

Use the help menu to find out what the options are... for example choosing the alternative hypothesis,

or making it a paired test for the t or wilcox, or choosing the type of correlation in cor.test.

When running the test it is always good to put in a variable, so that you can look at it more later.

> out1<-t.test(x,y,paired=F,alternative="two.sided")

> out1

Standard Two-Sample t-Test

data: x and y

t = -2.5808, df = 38, p-value = 0.0138

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-1.6116442 -0.1947125

sample estimates:

mean of x mean of y

0.2511853 1.154364

> attributes(out1)

$names:

[1] "statistic" "parameters" "p.value" "conf.int" "estimate"

[6] "null.value" "alternative" "method" "data.name"

$class:

[1] "htest"

Note that we don't actually want all the output, just the p-value.

> out1$"p.value"

[1] 0.01384421

In fact we just want to know how the p-value compares to a.

> (out1$"p.value"<=.05)

[1] T

> (out1$"p.value"<=.01)

[1] F

A T counts as a 1, and an F counts as a zero.

The Simulation Study

To test the alpha level... so H₀ is true... so the x and y have the same location....

Now x was normal with mean=0.25 and sd=0.74, while y was ((c²(2)) / sqrt(2)) - 0.5. So y has mean 0.91. We would want to make these means equal (actually want variance, but that is hard to do).

The means differ by 0.91-0.25 = 0.66, so shift both of them over by 0.33 to test the null hypothesis of equal

means.

Because Splus doesn't deal well with loops, we split it into two parts... it takes around three minutes to run on some of the machines... maybe a bit longer on some others.

alpha<-0.05

numrejectsnorm<-0

numrejectswilcox<-0

both<-0

for (i in 1:106){

x<-rnorm(20,(0.25+0.33),0.74)

y<- (rchisq(20,2)/sqrt(2))-0.5-0.33

tempt<- t.test(x,y,alternative="two.sided",paired=F)

tempw<- wilcox.test(x,y,alternative="two.sided",paired=F)

numrejectsnorm<-numrejectsnorm + (tempt$"p.value"<=alpha)

numrejectswilcox<-numrejectswilcox+

(tempw$"p.value"<=alpha)

both<-both+(tempt$"p.value"<=alpha)*

(tempw$"p.value"<=alpha)

}

for (i in 1:106){

x<-rnorm(20,(0.25+0.33),0.74)

y<- (rchisq(20,2)/sqrt(2))-0.5-0.33

tempt<- t.test(x,y,alternative="two.sided",paired=F)

tempw<- wilcox.test(x,y,alternative="two.sided",paired=F)

numrejectsnorm<-numrejectsnorm + (tempt$"p.value"<=alpha)

numrejectswilcox<-numrejectswilcox+

(tempw$"p.value"<=alpha)

both<-both+(tempt$"p.value"<=alpha)*

(tempw$"p.value"<=alpha)

}

percentnorm<-numrejectsnorm/212

percentwilcox<-numrejectswilcox/212

percentboth<-both/212

The output is:

> percentnorm

[1] 0.0754717

> percentwilcox

[1] 0.1650943

Note that the alpha level for the Wilcox test looks just awful! That's because its assumptions weren't met. We might want to do a second "a-level study" where we make the medians the same, this might make it a bit closer, even though technically the Wilcoxon test needs the distribution to be symmetric around the median. The problem is we don't have a nice formula for what the medians of the c²(2) distribution is. One way to figure it out would be to generate a large number of them and take the median.

> median((rchisq(10000,2)/sqrt(2))-.5)

[1] 0.4733352

So instead of shifting them by 0.33, try shifting them both by 0.11. Keep in mind though that the assumptions still aren't met because the errors would have to come from the same continuous population!

This wouldn't be a problem in a paired test though. Your right up needs to explain if the assumptions aren't met!

The power study would be done similarly, by making the means of x and y farther apart. In particular,

try it with the original x and y distributions you fit. Try them where they differ by about half way between

the observed difference and the null hypothesis, and try it where they differ by more than that.

For each of these results, you should give a confidence interval for the estimates for both the Wilcox test and the t-test alpha-levels and powers using the material in chapter 2. Also test that they are different using the McNemar test.

n1<-numrejectsnorm-both

n2<-numrejectswilcox-both

pstar<-n1/(n1+n2)

z<- (pstar-.5)/sqrt(.5*.5/(n1+n2))

Then use pnorm to get the p-value.

So the final output of this simulation study would be the two confidence intervals and the McNemar test, for each of the parameter settings.