STAT 530, Fall 2022
--------------------

Homework 6
-----------


IMPORTANT NOTE: For EACH of these problems, write a couple of sentences 
explaining in words what substantive conclusions about the data
that you can draw from the plots and/or analyses.

PROBLEM 1:
---------------

Use linear discriminant analysis (LDA) to build a classification rule to classifying the Bumpus bird data
into two groups ("survived" and "died") based on the 5 numerical measurements.  

The Bumpus bird data (along with a survival/death indicator vector) can be read in using the 
following R code:

bumpbird <- read.table("http://people.stat.sc.edu/hitchcock/bumpusbird.txt", header=T)
names(bumpbird) <- c("ID", "tot.length", "alar.length", "beak.head.length", "humerus.length", "keel.stern.length")
attach(bumpbird)
bumpbird.numeric <- bumpbird[,-1]
bumpbird.IDs <- bumpbird[,1]
survival.indicator <- as.factor(c(rep("survived",times=21),rep("died",length=28)))

(a) Use the LDA rule to predict the survival status for a hypothetical bird with:
tot.length=156, alar.length=242, beak.head.length=31.4, humerus.length=18.1, keel.stern.length=19.4
For part (a), assume equal prior probabilities of surviving and dying.
Give the probability of surviving for such a bird.

(b) Find the plug-in misclassification rate and the cross-validation misclassification rate 
for the LDA classification rule from part (a).

(c) Use the LDA rule to predict the survival status for a hypothetical bird with:
tot.length=156, alar.length=242, beak.head.length=31.4, humerus.length=18.1, keel.stern.length=19.4
For part (c), use the default prior probabilities which equal the sample proportions of birds 
surviving and dying. Give the probability of surviving for such a bird.

(d) Find the plug-in misclassification rate and the cross-validation misclassification rate 
for the LDA classification rule from part (c).  How do these compare to the rates that you found in part (b)?


PROBLEM 2:  
---------------

(a) Use the CLASSIFICATION TREE approach on the Egyptian Skulls data in the Chapter 7 in-class R examples
to obtain the classification tree (show the plot of the tree) and classify into an Epoch the new skull 
with the measurements:
MB = 133.0, BH = 130.0, BL = 95.0, NH = 50.0
You may assume equal prior probabilities of being in each category. 

(b) Use the random forest approach to do the same classification as in part (a).  Comment on any similarities 
and/or differences between your conclusions in part (a) and in part (b).  What does the random forest approach 
tell you about the relative importance of the four various predictors in the classification?


PROBLEM 3:  
---------------

*The CHAPTER 3 U.S. air pollution data set (from chapter 3, DIFFERENT from the Chapters 1-2 air pollution data) 
is given on the course web page.  This R code will read in the data:

USairpol.full <- read.table("http://people.stat.sc.edu/hitchcock/usair.txt", header=T)
city.names <- as.character(USairpol.full[,1])
USairpol.data <- USairpol.full[,-1]
USairpol.data$Temp <- (-USairpol.data$Temp)
attach(USairpol.data)

*These are the descriptions of the variables in the data set.  These are each measured on 41 U.S. cities.

SO2=sulphur dioxide content of air (a measure of air pollution)
Temp=average annual temperature in degrees F
Manuf=number of manufacturing enterprises employing 20 or more workers
Pop=Population size (1970 census) in thousands
Wind=Average annual wind speed in miles per hour
Precip=Average annual precipitation in inches
Days=Average number of days with precipitation per year

(a) Use a regression tree approach with SO2 as the dependent (response) variable and the other variables as 
independent (explanatory) variables.  (You can use the default settings of the 'rpart' function.)  
Show the plot of the tree.  Based on the tree, which seems to be the most important explanatory variables to
predict sulphur dioxide content?
Use the regression tree to predict the SO2 for a city with 
Temp=60, Manuf=390, Pop=500, Wind=8.5, Precip=45, Days=110

(b) Use the random forest approach to do the same prediction as in part (a).  Comment on any similarities 
and/or differences between your conclusions in part (a) and in part (b).  What does the random forest approach 
tell you about the relative importance of the six various predictors in the prediction?


PROBLEM 4:  EXTRA CREDIT for EVERYONE
--------------------------------------------------------------------------

Use the Sudden Infant Death Syndrome (SIDS) data given on the course web page.  The "Group" variable designates
49 healthy, surviving infants (Group = 1) and 16 infants who were SIDS victims (Group = 2).  The predictor 
variables were Heart Rate; Birthweight; Factor68 (a measurement based on recorded electrocardiograms 
and respiratory movements); and Gestational Age.

The data can be read into R by:

sidsdata <- read.table("http://people.stat.sc.edu/hitchcock/SIDSdata.txt", header=T)
attach(sidsdata)

Use the Support Vector Machine approach to classify a new baby with 
HR = 100, BW = 3000, Factor68 = 0.3, Gesage = 40
as into either the healthy group or the SIDS group.

Comment on any choices of tuning parameters, settings, etc. that you used.