################################################################################### # A complete regression example (Sec. 11.9) # This is the fire damage data set (p. 561 of book) # The response variable is fire damage (in thousands of dollars) to houses # The predictor variable is distance from fire station (in miles) # First we read in the data set and name the variables: # Reading the data into a temporary file called "my.datafile": my.datafile <- tempfile() cat(file=my.datafile, " 3.4 26.2 1.8 17.8 4.6 31.3 2.3 23.1 3.1 27.5 5.5 36 0.7 14.1 3 22.3 2.6 19.6 4.3 31.3 2.1 24 1.1 17.3 6.1 43.2 4.8 36.4 3.8 26.1 ", sep=" ") # Name the data set and give the variable (column) a name: firedata <- read.table(my.datafile, header=FALSE, col.names=c("distance", "damage")) # Alternatively, could type: # firedata <- <- read.table("http://people.stat.sc.edu/hitchcock/firedamageheaders.txt", header=TRUE) attach(firedata) # attaching the data frame # Let's do a scatter plot to see if a linear relationship is appropriate. plot(distance, damage) # The linear relationship seems reasonable based on the plot. # Let's use the lm function to estimate the slope and intercept of the linear model # Let's also plot the estimated regression line over our scatterplot: SLR.fire <- lm(damage ~ distance) SLR.fire abline(SLR.fire) # So our least-squares regression line is Y-hat = 10.28 + 4.92 X # How should we interpret these parameter estimates? # Checking model assumptions by looking at a residual plot: plot(distance, resid(SLR.fire) ); abline(h=0) dev.new() # Checking the normality assumption: qqnorm(resid(SLR.fire) ) # A more detailed output: summary(SLR.fire) # Note that our estimate of sigma, denoted s # or "Residual standard error" as R calls it, is 2.31635. # Suppose we want to test the model usefulness. # We test H_0: beta_1 = 0 against the two-tailed alternative. # The test statistic value t for this test is given as 12.53 # and the P-value is nearly 0. We reject H_0 and conclude the model is useful. # What about a 95% CI for beta_1? # Note that the estimated slope is 4.919 and its standard error is given as 0.393. # From table VI, t_(.025) = 2.16 (for n-2 = 13 d.f.). # The 95% CI for beta_1 is (4.919 - 2.16*0.393, 4.919 + 2.16*0.393). # Therefore the 95% CI for beta_1 is (4.070, 5.768). # How do we interpret this CI? # The correlation coefficient r can be found using the following code: cor(distance, damage) # We see that r = .96098 for these data. # The coefficient of determination r^2 is the square of r. # It is given as .9235 on the summary output. How do we interpret this value? # Finally, suppose we want to calculate CIs for mean damage # and prediction intervals for damage at particular distances. # The options clm and cli will give us CIs for E(Y) and PIs for Y, for the values # of X (the distances) in the data set. # But suppose we were interested in the 95% CI for mean damage and 95% prediction # interval for damage, for a house which was 3.5 miles from the fire station. # getting the 95% confidence interval for the mean at X=3.5 : x.new.value <- data.frame(distance = c(3.5)) predict(SLR.fire, x.new.value, interval="confidence", level=0.95) # getting the 90% prediction interval for a new observation with X=3.5 : predict(SLR.fire, x.new.value, interval="prediction", level=0.95) # This shows the prediction for the house which was 3.5 miles from the station. # We see the predicted damage for this house is about 27.5 thousand dollars. # The 95% CI for mean damage for all houses 3.5 miles from the station is # between 26.2 and 28.8 thousand dollars. # The prediction interval for the damage to a new house that is 3.5 miles from the # station is (22.3, 32.7) in thousands of dollars.