Stat 518 - Fall 1999 - Project Notes
UNDER CONSTRUCTION!
What distribution does the data look like?
What Distribution Does the Data Look Like?
The whole idea of the simulation study is to generate data that looks
just like your real data, but that we know what the true population looks
like. This way we know for the "fake data" whether the null hypothesis
is really true or not, and we can thus estimate the alpha level and powers
that the tests you are using should have on data like yours. Remember that
because with your data we don't know if the null hypothesis is true or not,
we don't know if really should be accepting or rejecting. The simulation
studies will thus give us an idea of how much we can trust the conclusions
you came to when you did the tests on your data.
The first step then is to find out what distribution your data looks like
it comes from. In describing your data in the earlier sections you probably
should have done a q-q plot, and either a histogram or a box plot (or both).
There are four main possibilities that will be addressed below:
- 1) Your data looks normal: the q-q plot looks very good.
- 2) Your data non-normal but unbounded: your data
doesn't seem to have a sharp drop off on the lower end (like a chi-square)
or on both ends (like a uniform), but it isn't exactly normal. Instead
it is skewed (the historgram or box-plot leans one way or the other) or
it has either heavy or light tails (the q-q plot makes an S shape).
- 3) Your data is bounded below: for example the values start from zero
and go up from there.
- 4) Your data is bounded above and below: for example your data may
be bounded between 0 and 100 because it is made up of percentages.
Keep in mind that it is
Bounded between two numbers (low,high)
beta
UNBOUNDED:
----------
If your data appears to be normally distributed from the q-q plot,
then it is ok to just use the normal distribution to simulate
your data from. If however your data seems to have heavy tails or light
tails when you look at the q-q plot, or if it seems to be pretty skewed
when you look at a box-plot, then you will want to simulate from a
distribution that is skewed and has heavier or lighter tails.
One method of generating data that has the right mean, variance,
skew, and heaviness of tails was proposed by Allen I. Fleishman
in the article: Fleishman, A.I. (1978). A method for simulating
non-normal distributions. Psychometrika 43, 521-532. He
called his method the "power method." The basic idea is that you
generate a standard normal Z and then use
X=a+bZ+cZ2+dZ3 where a, b, c, and d are
found based on the distribution you want.
The following function returns the mean, variance, skewness, and
kurtosis of your data set in that order.
sandk<-function(x){
n <- length(x)
m1p <- mean(x)
m2 <- sum((x-m1p)^2)/n
m3 <- sum((x-m1p)^3)/n
m4 <- sum((x-m1p)^4)/n
k1 <- m1p
k2 <- n*m2/(n-1)
k3 <- ((n^2)/((n-1)*(n-2)))*m3
k4 <- (((n^2)/((n-1)*(n-2)))/(n-3))*((n+1)*m4 - 3*(n-1)*m2^2)
g1 <- k3/(k2^(3/2))
g2 <- k4/(k2^2)
return(k1,k2,g1,g2)}
Once you just need to follow through the three really gross formulas
that Fleishman gives in his article. Luckily, Maple can solve these for
us! While we haven't used Maple in class yet, this is extremely easy.
All you need to do is to change the values of g1 and g2 below to match
the g1 and g2 for your data. Then simply start Maple (its on all the
computers in 303A), cut and paste the code into the window, and hit
return. It will then give you back the g1 and g2 you gave it, as well
as the b, c, and d that we are looking for. For example the code
below will give you the values that go with g1=1.75 and g2=9 (very skewed
and very heavy tails). The values are b=.640165, c=.176266, and d=.099113.
eq1 := 2=2*b^2 + 12*b*d + g1^2/(b^2+24*b*d+105*d^2+2)^2 + 30*d^2;
eq2 := g2=24*(b*d+c^2*(1+b^2+28*b*d)+d^2*(12+48*b*d+141*c^2+225*d^2));
eq3 := c=g1/(2*(b^2+24*b*d+105*d^2+2));
eq4 := g1=1.75;
eq5 := g2=9;
fsolve({eq1,eq2,eq3,eq4,eq5},{b,c,d,g1,g2});
Once you have the
rlfleish<-function(n,m,v,b,c,d){
Z<-rnorm(n,0,1)
a<--1*c
Y<-a+b*Z+c*Z^2+d*^3
X<-m+sqrt(v)*Y
return(X)}
x<-rt(1000,df=5)
sandk(x)
y<-rlfleish(1000,0,1.461356,.7704679,0,0.071376534)
sandk(y)
x<-rt(1000,df=3)
sandk(x)
BOUNDED BELOW:
--------------
You need to set the value low to be the smallest value the data can take.
Three distributions in this case are: f, gamma, and lognormal.
The f distribution has two parameters, df1 and df2. The following
code calculates the method of moments estimator.
low <- 0
n <- length(x)
m <- mean(x-low)
v <- var(x-low)
r2 <- 2*m/(m-1)
r2<- max(r2,0.1)
r1 <- (2*r2^3-4*r2^2)/(v*(r2-2)^2*(r2-4)-2*r2^2)
r1<-max(r1,0.1)
ks.gof((x-low),distribution="f",df1=r1,df2=r2)
The gamma distribution has two parameters, alpha and beta.
If alpah=1 then it is an exponential distribution. If
alpha=v/2 and beta=2 then it is a Chi-squared distribution
with v degrees of freedom. The following code calculates the
method of moments estimation, an approximation to the
maximum likelihood estimation, and the geometric mean of the two.
The formulas used below are from Johnson, Kotz, and Balakrishnan
Volume 1. Note that S-plus uses rate=1/beta.
low <- 0
n <- length(x)
m <- mean(x-low)
v <- var(x-low)
gm <- prod(x-low)^(1/n)
b1 <- ((n-1)*v)/(n*m)
a1 <- m/bhat
Rn <- log(m/gm)
a2 <- (0.500876+0.1648852*Rn-0.0544274*Rn^2)/Rn
b2 <- m/ahat
a3 <- sqrt(a1*a2)
b3 <- sqrt(b1*b2)
ks.gof((x-low),distribution="gamma",shape=a1,rate=1/b1)
ks.gof((x-low),distribution="gamma",shape=a2,rate=1/b2)
ks.gof((x-low),distribution="gamma",shape=a3,rate=1/b3)
The lognormal distribution is simply the exp of a normal. The
parameters are the mean and the variance of that noraml.
x<-rlnorm(20,18,101)
low <- 0
n <- length(x)
ml <- mean(log(x))
sdl <- sqrt(var(log(x)))
c(ml,sdl)
ks.gof((x-low),distribution="lognormal",meanlog=ml,sdlog=sdl)