###################################################################################

# A complete regression example (Sec. 11.9) 

# This is the fire damage data set (p. 561 of book)                        
# The response variable is fire damage (in thousands of dollars) to houses 
# The predictor variable is distance from fire station (in miles)          

# First we read in the data set and name the variables: 

# Reading the data into a temporary file called "my.datafile":

my.datafile <- tempfile()
cat(file=my.datafile, "  
 3.4 26.2
 1.8 17.8
 4.6 31.3
 2.3 23.1
 3.1 27.5
 5.5 36
 0.7 14.1
 3 22.3
 2.6 19.6
 4.3 31.3
 2.1 24
 1.1 17.3
 6.1 43.2
 4.8 36.4
 3.8 26.1
", sep=" ")

# Name the data set and give the variable (column) a name:
firedata <- read.table(my.datafile, header=FALSE, col.names=c("distance", "damage"))

# Alternatively, could type:

# firedata <- <- read.table("http://people.stat.sc.edu/hitchcock/firedamageheaders.txt", header=TRUE)

attach(firedata)  # attaching the data frame

# Let's do a scatter plot to see if a linear relationship is appropriate. 

plot(distance, damage)

# The linear relationship seems reasonable based on the plot. 

# Let's use the lm function to estimate the slope and intercept of the linear model 
# Let's also plot the estimated regression line over our scatterplot:           

SLR.fire <- lm(damage ~ distance)
SLR.fire
abline(SLR.fire)


# So our least-squares regression line is Y-hat = 10.28 + 4.92 X  
# How should we interpret these parameter estimates?              

# Checking model assumptions by looking at a residual plot:

plot(distance, resid(SLR.fire) ); abline(h=0) 

dev.new()

# Checking the normality assumption:

qqnorm(resid(SLR.fire) ) 

# A more detailed output:

summary(SLR.fire)

# Note that our estimate of sigma, denoted s 
# or "Residual standard error" as R calls it, is 2.31635. 

# Suppose we want to test the model usefulness.                               
# We test H_0: beta_1 = 0 against the two-tailed alternative.                 
# The test statistic value t for this test is given as 12.53                  
# and the P-value is nearly 0. We reject H_0 and conclude the model is useful.

# What about a 95% CI for beta_1? 
# Note that the estimated slope is 4.919 and its standard error is given as 0.393. 

# From table VI, t_(.025) = 2.16 (for n-2 = 13 d.f.). 

# The 95% CI for beta_1 is (4.919 - 2.16*0.393, 4.919 + 2.16*0.393). 
# Therefore the 95% CI for beta_1 is (4.070, 5.768).                 
# How do we interpret this CI?                                       

# The correlation coefficient r can be found using the following code: 

cor(distance, damage)

# We see that r = .96098 for these data.                                        
# The coefficient of determination r^2 is the square of r.                      
# It is given as .9235 on the summary output.  How do we interpret this value? 

# Finally, suppose we want to calculate CIs for mean damage    
# and prediction intervals for damage at particular distances. 

# The options clm and cli will give us CIs for E(Y) and PIs for Y, for the values 
# of X (the distances) in the data set.                                           
# But suppose we were interested in the 95% CI for mean damage and 95% prediction 
# interval for damage, for a house which was 3.5 miles from the fire station.     
                         
# getting the 95% confidence interval for the mean at X=3.5 :

x.new.value <- data.frame(distance = c(3.5))

predict(SLR.fire, x.new.value, interval="confidence", level=0.95)

# getting the 90% prediction interval for a new observation with X=3.5 :

predict(SLR.fire, x.new.value, interval="prediction", level=0.95)


# This shows the prediction for the house which was 3.5 miles from the station.  
# We see the predicted damage for this house is about 27.5 thousand dollars.  
# The 95% CI for mean damage for all houses 3.5 miles from the station is 
# between 26.2 and 28.8 thousand dollars. 
# The prediction interval for the damage to a new house that is 3.5 miles from the     
# station is (22.3, 32.7) in thousands of dollars.