STAT 530, Fall 2022
--------------------

Homework 3
-----------

#### Only graduate students are required to do problem 5(b) below.  It is extra credit for undergraduates.
#### Undergraduates must do problems 1, 2, 3 and 4, and 5(a) below.

R TIP:

The code:
M <- matrix(c(w,x,y,z), nrow=2, ncol=2, byrow=T)

(where w,x,y,z are some numbers) will produce a 2-by-2 matrix M with entries:

[w  x
 y  z]

A 3-by-3 matrix can be obtained similarly by putting nine numbers separated by commas in c()
and using nrow=3, ncol=3 .


# 1)  Suppose our multivariate data have covariance matrix S =

[5  0  0
 0  9  0
 0  0  9]

a) Find the eigenvalues and eigenvectors of S.

HINT:  You can use the 'eigen' function in R, e.g.:
eigen(M)
gives the eigenvalues and eigenvectors of some matrix M.

b) Determine all three principal components for such a data set, using a PCA based on S.
   Remember that a principal component is a linear combination of the original variables,
   so your answer should be in the form of linear combinations of X1, X2, and X3.

HELPFUL NOTE:  In R, we can do a PCA where the covariance matrix is input rather than the 
data matrix using code such as:

my.pc <- princomp(covmat=my.S); summary(my.pc, loadings=T)

where my.S is the covariance matrix.

HINT: Remember what we said in class about blank spots in the printed output of coefficients!
To see the full set of coefficients, in the 'summary' code above, you can change
loadings=T
to 
loadings=T,cutoff=0


c) What can you say about the principal components associated with 
   eigenvalues that are the same value?


# 2)  Suppose a multivariate data set has sample covariance matrix S =

[36  5
  5  4]

a) Determine both principal components for such a data set, using a PCA based on S.  
(Refer to HELPFUL NOTE and R TIP from problem 1).
   Remember that a principal component is a linear combination of the original variables,
   so your answer should be in the form of linear combinations of X1 and X2.

b) Determine the correlation matrix R that corresponds to the covariance matrix S.

c) Determine both principal components for such a data set, using a PCA based on R.
Are the PCA results different from those in part (a)?  If so, try to explain why they are different.

HELPFUL NOTE 2:  We can do a PCA where the correlation matrix is input rather than the 
data matrix using code such as:

my.pc <- princomp(covmat=my.R); summary(my.pc, loadings=T)

where my.R is the correlation matrix.


########################################################################################################
NOTE: For EACH of the following problems, also write (at least) a couple of paragraphs explaining the
choices you made in doing the PCA and explaining in words what substantive conclusions 
about the data that you can draw from the PCA.  You can use relevant results and 
graphics to support your conclusions.
########################################################################################################

#### Only graduate students are required to do part (b) of problem 5.  It is extra credit for undergraduates.

# 3) Do and interpret a principal components analysis on the book's Exercise 3.3 data, with
the given correlation matrix, which can be entered into R with the following code:

my.cor.mat <- matrix(c(1,.402,.396,.301,.305,.339,.340,
.402,1,.618,.150,.135,.206,.183,
.396,.618,1,.321,.289,.363,.345,
.301,.150,.321,1,.846,.759,.661,
.305,.135,.289,.846,1,.797,.800,
.339,.206,.363,.759,.797,1,.736,
.340,.183,.345,.661,.800,.736,1),
ncol=7, nrow=7, byrow=T);

As mentioned in the book, the 7 variables are 'head length', 'head breadth', 'face breadth', 
'left finger length', 'left forearm length', 'left foot length','height'.

The measurements were taken on 3000 convicted criminals and were originally published in 1902.

Obtain the principal components (including choosing an appropriate number of PCs).
Also make an attempt to interpret your PCs.

See HELPFUL NOTE 2 and R TIP above, which will also help with this problem.

# 4) Perform an appropriate PCA on the "Contents of Foodstuffs" data set from the course web page, 
including the appropriate display of PC scores.  Also make an attempt to interpret your PCs.

*The "Contents of Foodstuffs" data set is given on the course web page.
Full descriptions of the observation names, in order, are given in the vector below.

This R code will read in the data:

food.full <- read.table("http://people.stat.sc.edu/hitchcock/foodstuffs.txt", header=T)
food.labels <- as.character(food.full[,1])
food.data <- food.full[,-1]
food.descriptions <- c('beef_braised','hamburger','beef roast','beef_steak','beef_canned','chicken_broiled',
'chicken_canned','beef_heart','lamb_leg_roast','lamb_shoulder_roast','ham_smoked','pork_roast','pork_simmered',
'beef_tongue','veal_cutlet','bluefish_baked','clams_raw','clams_canned','crabmeat_canned','haddock_fried',
'mackerel_broiled','mackerel_canned','perch_fried','salmon_canned','sardines_canned','tuna_canned','shrimp_canned')

#To see the labels and descriptions together:
cbind(food.labels,food.descriptions)

# 5) 

*The CHAPTER 3 U.S. air pollution data set (from chapter 3, DIFFERENT from the Chapters 1-2 air pollution data) 
is given on the course web page.  This R code will read in the data:

USairpol.full <- read.table("http://people.stat.sc.edu/hitchcock/usair.txt", header=T)
city.names <- as.character(USairpol.full[,1])
USairpol.data <- USairpol.full[,-1]

*These are the descriptions of the variables in the data set.  These are each measured on 41 U.S. cities.

SO2=sulphur dioxide content of air (a measure of air pollution)
Temp=average annual temperature in degrees F
Manuf=number of manufacturing enterprises employing 20 or more workers
Pop=Population size (1970 census) in thousands
Wind=Average annual wind speed in miles per hour
Precip=Average annual precipitation in inches
Days=Average number of days with precipitation per year

(a) Perform an appropriate PCA on the "CHAPTER 3 U.S. air pollution data set" data set, 
including the appropriate display of PC scores.  Also make an attempt to interpret your PCs. 
Identify any notable outliers.

(b, only required for graduate students) Perform another PCA after removing one or two of the most severe outliers.  Comment on any differences in the PCA results from this PCA compared to the PCA on the full data set.

# To remove one or two outlying observations, try code such as:

out.row1 <-        # Put the row number of the most severe outlier here (after the arrow)
USairpol.data.reduced <- USairpol.data[-out.row1,]
out.row2 <-        # Put the row number of another outlier here (after the arrow)
USairpol.data.reduced.more <- USairpol.data[-c(out.row1,out.row2),]