518 Project Notes

Stat 518 - Fall 1999 - Project Notes

UNDER CONSTRUCTION! What distribution does the data look like?

What Distribution Does the Data Look Like?

The whole idea of the simulation study is to generate data that looks just like your real data, but that we know what the true population looks like. This way we know for the "fake data" whether the null hypothesis is really true or not, and we can thus estimate the alpha level and powers that the tests you are using should have on data like yours. Remember that because with your data we don't know if the null hypothesis is true or not, we don't know if really should be accepting or rejecting. The simulation studies will thus give us an idea of how much we can trust the conclusions you came to when you did the tests on your data.

The first step then is to find out what distribution your data looks like it comes from. In describing your data in the earlier sections you probably should have done a q-q plot, and either a histogram or a box plot (or both). There are four main possibilities that will be addressed below:

1) Your data looks normal: the q-q plot looks very good.
2) Your data non-normal but unbounded: your data doesn't seem to have a sharp drop off on the lower end (like a chi-square) or on both ends (like a uniform), but it isn't exactly normal. Instead it is skewed (the historgram or box-plot leans one way or the other) or it has either heavy or light tails (the q-q plot makes an S shape).
3) Your data is bounded below: for example the values start from zero and go up from there.
4) Your data is bounded above and below: for example your data may be bounded between 0 and 100 because it is made up of percentages.

Keep in mind that it is Bounded between two numbers (low,high) beta UNBOUNDED: ---------- If your data appears to be normally distributed from the q-q plot, then it is ok to just use the normal distribution to simulate your data from. If however your data seems to have heavy tails or light tails when you look at the q-q plot, or if it seems to be pretty skewed when you look at a box-plot, then you will want to simulate from a distribution that is skewed and has heavier or lighter tails. One method of generating data that has the right mean, variance, skew, and heaviness of tails was proposed by Allen I. Fleishman in the article: Fleishman, A.I. (1978). A method for simulating non-normal distributions. Psychometrika 43, 521-532. He called his method the "power method." The basic idea is that you generate a standard normal Z and then use X=a+bZ+cZ²+dZ³ where a, b, c, and d are found based on the distribution you want. The following function returns the mean, variance, skewness, and kurtosis of your data set in that order. sandk<-function(x){ n <- length(x) m1p <- mean(x) m2 <- sum((x-m1p)^2)/n m3 <- sum((x-m1p)^3)/n m4 <- sum((x-m1p)^4)/n k1 <- m1p k2 <- n*m2/(n-1) k3 <- ((n^2)/((n-1)*(n-2)))*m3 k4 <- (((n^2)/((n-1)*(n-2)))/(n-3))*((n+1)*m4 - 3*(n-1)*m2^2) g1 <- k3/(k2^(3/2)) g2 <- k4/(k2^2) return(k1,k2,g1,g2)} Once you just need to follow through the three really gross formulas that Fleishman gives in his article. Luckily, Maple can solve these for us! While we haven't used Maple in class yet, this is extremely easy. All you need to do is to change the values of g1 and g2 below to match the g1 and g2 for your data. Then simply start Maple (its on all the computers in 303A), cut and paste the code into the window, and hit return. It will then give you back the g1 and g2 you gave it, as well as the b, c, and d that we are looking for. For example the code below will give you the values that go with g1=1.75 and g2=9 (very skewed and very heavy tails). The values are b=.640165, c=.176266, and d=.099113. eq1 := 2=2*b^2 + 12*b*d + g1^2/(b^2+24*b*d+105*d^2+2)^2 + 30*d^2; eq2 := g2=24*(b*d+c^2*(1+b^2+28*b*d)+d^2*(12+48*b*d+141*c^2+225*d^2)); eq3 := c=g1/(2*(b^2+24*b*d+105*d^2+2)); eq4 := g1=1.75; eq5 := g2=9; fsolve({eq1,eq2,eq3,eq4,eq5},{b,c,d,g1,g2}); Once you have the rlfleish<-function(n,m,v,b,c,d){ Z<-rnorm(n,0,1) a<--1*c Y<-a+b*Z+c*Z^2+d*^3 X<-m+sqrt(v)*Y return(X)} x<-rt(1000,df=5) sandk(x) y<-rlfleish(1000,0,1.461356,.7704679,0,0.071376534) sandk(y) x<-rt(1000,df=3) sandk(x) BOUNDED BELOW: -------------- You need to set the value low to be the smallest value the data can take. Three distributions in this case are: f, gamma, and lognormal. The f distribution has two parameters, df1 and df2. The following code calculates the method of moments estimator. low <- 0 n <- length(x) m <- mean(x-low) v <- var(x-low) r2 <- 2*m/(m-1) r2<- max(r2,0.1) r1 <- (2*r2^3-4*r2^2)/(v*(r2-2)^2*(r2-4)-2*r2^2) r1<-max(r1,0.1) ks.gof((x-low),distribution="f",df1=r1,df2=r2) The gamma distribution has two parameters, alpha and beta. If alpah=1 then it is an exponential distribution. If alpha=v/2 and beta=2 then it is a Chi-squared distribution with v degrees of freedom. The following code calculates the method of moments estimation, an approximation to the maximum likelihood estimation, and the geometric mean of the two. The formulas used below are from Johnson, Kotz, and Balakrishnan Volume 1. Note that S-plus uses rate=1/beta. low <- 0 n <- length(x) m <- mean(x-low) v <- var(x-low) gm <- prod(x-low)^(1/n) b1 <- ((n-1)*v)/(n*m) a1 <- m/bhat Rn <- log(m/gm) a2 <- (0.500876+0.1648852*Rn-0.0544274*Rn^2)/Rn b2 <- m/ahat a3 <- sqrt(a1*a2) b3 <- sqrt(b1*b2) ks.gof((x-low),distribution="gamma",shape=a1,rate=1/b1) ks.gof((x-low),distribution="gamma",shape=a2,rate=1/b2) ks.gof((x-low),distribution="gamma",shape=a3,rate=1/b3) The lognormal distribution is simply the exp of a normal. The parameters are the mean and the variance of that noraml. x<-rlnorm(20,18,101) low <- 0 n <- length(x) ml <- mean(log(x)) sdl <- sqrt(var(log(x))) c(ml,sdl) ks.gof((x-low),distribution="lognormal",meanlog=ml,sdlog=sdl)