Fall 2008
STAT 530 - Applied Multivariate Statistics
Tuesday / Thursday 4:00-5:15
203 BA Building

Course Website: http://www.stat.sc.edu/~habing/courses/530F08.html

9Due: Thursday, December 4thHomework 9
8Due: Thursday, November 20th Homework 8
7Due: Thursday, November 13th Homework 7
6Due: Thursday, October 30th Consider the beardat PCA example we saw in class on September 25, 2008. Conduct classical multidimensional scaling on that data using Euclidean distances. Verify that the first two principal components found using the covariance matrix are the same as the first two dimensions in the scaling. (Give the code or menu options you used, and the PCA and scaling values corresponding to the first 10 bears.)
5Due: Thursday, October 16th Homework 5
4Due: Thursday, October 2nd The data set testdata.txt is from Mardia, Kent, and Bibby's Multivariate Analysis.It concerns the results of 88 students on a five part exam: 1) Closed Book on Mechanics (Calculus like Physics); 2) Closed Book on Vectors; 3) Open Book on Algebra; 4) Open Book on Analysis (The Theory of Calculus); and 5) Open Book in Statistics. Each exam is scored separately from 0 to 100.

a) Choose to either use the correlation matrix or the covariance matrix for your principal components analysis and justify your choice based on the characteristics of the data.
b) Describe how much information would be lost if the data was summarized using only 1, or only 2, or only 3, or only 4 components.
c) Find an interpretation for what each of the first four components measure.
d) If only four componenets are used, then only some portion of each of the original five variables will be explained. One way to measure what percent of each original variable is explained by the four principal components is to use the R-squared value from a multiple regression predicting the variable from the components. Using this method, which of the five variables are explained very well by the first four componenets, and which are not? (see BearReg.doc for how this would be done with the Bear data from class).

3Due: Tuesday, September 23rd The data set sat3.txt contains the educational data for each state and the District of Columbia. Part03 is the percent of students taking the SAT in 2003; Verbal03, Math03, and Total03 are the state averages for the respective sections; and Expend01 is the per pupil educational spending in 2001.

1) Construct a graphical display to support the argument that more money is not associated with higher test scores.
2) Construct a graphical display to counter the argument in 1 by also taking the percentage of student in the state taking the exam into account.

In both cases make sure your graphical display is clearly labeled and includes appropriate captions.

3) There are a large number of ways of finding outliers in data sets (some we've seen in this class, some are obvious extensions of things we've done in this class, and some are things you might have seen in a course on multiple regression). Considering the three variables Part03, Total03, and Expend01 choose a method that seems reasonable to you to identify any states you think are outliers and briefly explain your results.

2Due: Tuesday, September 16th 1) Imagine that someone wanted to come up with a total score to summarize each persons view of the oil crisis (Q1-Q20).
a) Explain why it doesn't make sense to just add up all of the numbers.
b) Find the correlation matrix for Q1-Q20 data set and suggest two separate groups of questions that might be added separately.
c) How could these two scores be combined to form a single score?

2) Check whether the data set normsamp.txt is actually multivariate normal.

1Due: Tuesday, September 2nd This assignment uses the oildata data set from in class on the 21st and 26th.

Using R, make two variables containing the ratings of "Low Energy Use", one for males and one for females. Conduct a two sample t-test to see whether there is a difference between the genders and check the assumptions using a q-q plot. Summarize your results.

Hint: t.test(x,y), qqnorm(x), qqline(x), and look over the sample code we had for making a subset of the data in class.