STAT 530, Fall 2022
--------------------

Homework 1
-----------


1)  Suppose our multivariate data have sample covariance matrix S =

[  2   -3    2
  -3    6    4
   2    4    3 ]

Note you can define this matrix in R with the code:

my.S <- matrix(c(2,-3,2,-3,6,4,2,4,3), byrow=T, nrow=3, ncol=3)

a) Based on this covariance matrix, how many columns (variables) does the original data matrix have?
   Can you tell how many rows the original data matrix has?

b) Find and write (or print) the inverse of S.

c) Find and write (or print) the correlation matrix for this data set.

NOTE:  The R functions 'matrix' and 'solve' can help with this problem.
In R code:
solve(M)
will give the inverse of a matrix M.


2)  Suppose a multivariate data set has sample covariance matrix S =

[16  -2   4
 -2   9  -1
  4  -1  25]

(See the hint in problem 1 for how to define a matrix in R.)

a) Determine the matrix D^{-1/2}, where D^{-1/2} is defined in the Chapter 1 notes.

b) Calculate and print the sample correlation matrix R for this data set.

NOTE: In R code: 
M %*% N 
performs the matrix multiplication of M times N.


3) An air pollution data set (somewhat different that the US Air pollution data we studied in our class examples)
is given on the course web page.  
For this problem, we will focus only on the first 16 observations (cities).
You can read the data into R (as a data frame) with the code:

airpol.full <- read.table("http://people.stat.sc.edu/hitchcock/airpoll.txt", header=T)
city.names <- as.character(airpol.full[1:16,1])
airpol.data.sub <- airpol.full[1:16,2:8]
# if you want to make the row labels be the city names, add this line of code:
airpol.data.sub <- data.frame(airpol.data.sub, row.names=city.names)

# Perform your analysis on the 'airpol.data.sub' subset.

a) Use R to calculate the sample covariance matrix and the sample correlation matrix for this data subset.  
Print these, rounding values to two digits.
Identify which pairs of variables seem to be strongly associated.  Write a paragraph describing the
nature (strength and direction) of the relationship between these variable pairs.

NOTE:  The variables measured on each city are:
Rainfall (mean annual precipitation in inches)
Education (median school years completed by those over 25 in 1960)
Population density (population per square mile in urbanized area in 1960), 
Nonwhite (Percent of Urban area Residents that are Nonwhite)
NOX (relative pollution potential of Nitrogen Oxide) 
SO2 (relative pollution potential of Sulfur Dioxide)
Mortality Rate (total age-adjusted mortality rate, in deaths per 100,000)

Both Nitrogen Oxide and Sulfur Dioxide are measures of air pollution.

b) Use R to calculate the distance matrix for these observations (after scaling the variables by 
dividing each variable by its standard deviation).  Write a paragraph describing the four most 
similar pairs of cities and the four most different pairs of cities, giving evidence from the distance matrix.  
For the pairs that are different, look back at the data to help you describe which variables 
in particular make the observations different.

c) Give a plot that will help assess whether this data set comes from a multivariate normal distribution.
   What is your conclusion, based on the plot?

GRADUATE STUDENTS ONLY (extra credit for undergrads):
Regardless of your answer to (3)c), attempt some Box-Cox transformation(s) on the 'airpol.data.sub' data.  
Can any transformation(s) improve the multivariate normality?  Discuss this in a paragraph.