Fall 2007 - Stat 730 Homework

Fall 2007
STAT 730 - Multivariate Analysis
Monday/Wednesday 2:30-3:45
201A LeConte

Course Website: http://www.stat.sc.edu/~habing/courses/730F07.html

1 Due: Wednesday, September 4th 1.4.2 (0), (i), (ii), (iv)
1.6.1
Write an R function to calculate b_1,p and b_2,p on a data set. A sample data set is at: cork data. The values should be 4.476 and 22.957.

2 Due: Monday, September 17th 1.5.2, 1.5.3, 8.6.1
Let S=the matrix A in (A.6.16). For which value(s) of rho is the Mahalanobis transformation orthogonal?

3 Due: Monday, October 8th 1) Show that c in problem 2.2.1 must equal p(p+1).
2) Let X ~ N_p(0,I). Define U_(p-1)x1 by u_i=(x_i+x_i+1)/2. Find the distribution of U, including specifying the values of the covariance matrix.
3) Prove the result that 2.7.8 on page 49 holds for all spherically symmetric random variables. You may use the standard results concerning the univariate normal, chi-square, and F distributions from 712-713 without proof. (Hint: Write (2.7.8) in terms of the r and theta from the polar transformation, letting n+m=p and considering theta_p-m in terms of the x_i, x_p-m in terms of the r and theta, and r in terms of the x_i. Consider a particular (well chosen) distribution for X.)
4) Show that the multivariate Cauchy with parameters mu and sigma is stable (see problem 2.6.5 for the p.d.f and c.f).

PS 1 Due: Wednesday, October 24th To Be Worked on Individually
1) Sequel to problem 8.6.1:
a) Show that if A_pxp is symmetric with all positive off-diagonal elements (and arbitrary diagonal ones) then all the coefficients of its first eigenvector must have the same sign.
b) Show that if all of the coefficients of the first eigenvector have the same sign then every other eigenvector must have at least one element with a positive coefficient and at least one with a negative coefficient.
2) Consider the class of distribtuion where the log of the characteristic function is given by (2.7.9) and each positive definite omega_j is not exactly identical to any other omega_i. Show that the sum of two independent random variables of this type with equal alpha and p (but arbitrary a, m, and omega_j) has this same form, specifying what the a, alpha, m, and omega_j of the sum are in terms of the two original sets of parameters.
3) Problem 3.2.4 parts a and b.
4) Provide the details for the justification of the results given in section 2.5.5 concerning the matrix normal

4 Due: Monday, November 12th 1) Problem 3.4.14
2) Consider variables 6-9 of the psych data (paragraph comprehension through word meaning). Comment on how well this data seems to approximate multivariate normality. You do not need to show all of the graphs used to come to your conclusion, just representative ones.
3) For this question, assume you are satisfied with the multivariate normality of the data set from question two (irregardless of what you found there). Conduct a hypothesis test (or set of hypotheses tests) do determine how many principal components should be kept based on the logic discussed in section 8.4.3. Adjust for multiple comparisons if more than one test is conducted.
4) The data set http://www.stat.sc.edu/~habing/courses/data/cereal.txt contains 235 ratings of breakfast cereals on 25 different attributes. Construct a parsimonious factor model for the attributes that describe the quality of cereal. The model should have interpretable factors where each variable tends to load only on one factor. Where possible, give an informal name to your factor. (For this assignment, ignore the fact that we have information on who did the ratings and that we know what brand of cereal was rated.)

5 Due: Monday, December 3 The data set http://www.stat.sc.edu/~habing/courses/data/g2.txt contains the grades of 95 students in a STAT 110 course on 8 homework assignments and 3 exams. It also contains which college they are from: Arts and Sciences, Communications, or Hotel, Restaurant, and Tourism.
1) Conduct a canonical correlation analysis to examine the relationship between the 8 homework grades and the three exam grades.
a) Use an alpha=0.10 level to determine how many of the canonical relationships are statistically significant (be sure to say what multiple comparison procedure you used).
b) Describe what the significant canonical covariates are measureing in terms of the homework and exam grades.
c) Note that when predicting E3 from the 8 homeworks the r-squared value is 0.2840 with a p-value of 0.0002 - why isn't the fact that this is more pronounced than your second canonical covariate a contradiction?
2) Conduct a linear discriminant analysis to predict the college of the student from the homework and exam scores.
a) What percent accuracy would you expect if you randomly distributed the 95 students among the three groups (so that the number in each assigned group matched the original number in that group)?
b) What percent accuracy would you expect if you assigned all of the observations to the largest of the three groups?
c) What percent accuracy did you observe using the entire data set?
d) What percent accuracy did you observe in the cross-validated data set?
e) Plot the first two dimensions of the discriminant function, showing the college membership of each student. Include on the plot the location of the mean value for each of the three colleges students.
f) Briefly relate this plot to the the differences you saw between the accuracy estimate based on the entire data set and the accuracy estimate from the cross-validated data set.
g) Calculate the four MANOVA statistics for this problem, and comment on their significance at the a=0.10 level.
h) Relate your finding in (g) to your plot in (e).

PS 2 1) Practitioners often want to use factor analysis on data sets that aren't normally distributed (or even continuous). Consider the case where 1,000 examinees are answering 41 questions on an exam, and the the quality of their solutions can be represented by a (normally distributed) 1-factor model. Simulate an n=1,000 p=41 data set from a 1-dimensional factor model where the lambda_i = 0.5 and the psi_ii are 0.25. Demonstrate that a 1-factor model is appropriate for the data and fits well. Now consider the case where the instructor decides to award only scores of 0 or 1 on each question, and grades the questions progressively stricter. Simulate this situation by defining y_1 to be 1 if x_1 < -2 and 0 if x_1 >= -2. Similarly for y_2 use -1.9, for y_3 use -1.8, .... for y_41 use +2. Conduct exploratory factor analysis on this model, indicating the number of factors indicated and the quality of the fit. Can you relate the values of the unrotated factor solution to the cut-off values?
2) Consider the subsample of the Bears data from class that consists of:
Female/Old - Allison, Edith, Smokey, Thelma, Tozia
Female/Young - Addy, Denise, Evelyn, Ness, Suzie
Male/Old - Buck, Clyde, Ian, Mighty, Pete
Male/Young - Floyd, Pasquale, Viking, Xavier, Willie
and the standardized chest girth, standardized, head length, standardized headwith, and standardized neck girth. Test whether the interaction between age and gender is significant in the corresponding two-way MANOVA (you may assume multivariate normality is satisfied; give the exact p-value).
Helpful code:
bears<-read.table("http://www.stat.sc.edu/~habing/courses/data/finbears.txt",head=T) b2<-bears[c(4,15,43,45,46,2,11,17,33,44,6,10,24,32,38,19,37,49,52,51), c(1,2,4,11:14)]
3) Consider the stress function that could be entered in R as:
stress<-function(x1,x2){sum((dist(x1)-dist(x2))^2)/sum(dist(x2)^2)}
where x1 is the original data set and x2 is the estimated map from classical multidimensional scaling. A common recommendation is that the fit is acceptable if the stress is less than 0.15. How many dimensions does this criterion recommend for the bears data in quesiton 2? Why should you have been able to predict the stress value for k=4 before calculating it?

Bonus Select one of the following two:
1) 14.2.4 (Notice this shows that basing the distance off the correlation satisfies the triangle inequality.)
or
2) Consider the following data set giving the counts of hair color and eye color combinations for a sample of people:

Hair Eye Fair Red Medium Dark Black Light 688 116 584 188 4 Blue 326 38 241 110 3 Medium 343 84 909 412 26 Dark 98 48 403 681 81

Distance measures between the columns of the data could be defined using the chi-squared statistic:

coldist<-function(x){ C<-ncol(x) r<-nrow(x) ndoti<-apply(x,2,sum) N<-sum(x) pij<-matrix(0,ncol=C,nrow=r) dij<-matrix(0,ncol=C,nrow=C) for (i in 1:r){ for (j in 1:C){ pij[i,j]<-x[i,j]/ndoti[j] }} pidot<-apply(x,1,sum)/N for (i in 1:C){ for (j in 1:C){ for (k in 1:r){ dij[i,j]<-dij[i,j]+((pij[k,i]-pij[k,j])^2)/pidot[k] }}} sqrt(dij) }

and the a row distance could be defined as
rowdist<-function(x){coldist(t(x))}
Conduct separate 2-dimensional classical multidimensional scalings for hair color and eye color. (As an aside, the next step in the method called correspondance analysis would be to make sure the correct signs were used and to plot the graphs on top of each other to see the joint relationship.)

1	Due: Wednesday, September 4th	1.4.2 (0), (i), (ii), (iv) 1.6.1 Write an R function to calculate b_1,p and b_2,p on a data set. A sample data set is at: cork data. The values should be 4.476 and 22.957.
2	Due: Monday, September 17th	1.5.2, 1.5.3, 8.6.1 Let S=the matrix A in (A.6.16). For which value(s) of rho is the Mahalanobis transformation orthogonal?
3	Due: Monday, October 8th	1) Show that c in problem 2.2.1 must equal p(p+1). 2) Let X ~ N_p(0,I). Define U_(p-1)x1 by u_i=(x_i+x_i+1)/2. Find the distribution of U, including specifying the values of the covariance matrix. 3) Prove the result that 2.7.8 on page 49 holds for all spherically symmetric random variables. You may use the standard results concerning the univariate normal, chi-square, and F distributions from 712-713 without proof. (Hint: Write (2.7.8) in terms of the r and theta from the polar transformation, letting n+m=p and considering theta_p-m in terms of the x_i, x_p-m in terms of the r and theta, and r in terms of the x_i. Consider a particular (well chosen) distribution for X.) 4) Show that the multivariate Cauchy with parameters mu and sigma is stable (see problem 2.6.5 for the p.d.f and c.f).
PS 1	Due: Wednesday, October 24th	To Be Worked on Individually 1) Sequel to problem 8.6.1: a) Show that if A_pxp is symmetric with all positive off-diagonal elements (and arbitrary diagonal ones) then all the coefficients of its first eigenvector must have the same sign. b) Show that if all of the coefficients of the first eigenvector have the same sign then every other eigenvector must have at least one element with a positive coefficient and at least one with a negative coefficient. 2) Consider the class of distribtuion where the log of the characteristic function is given by (2.7.9) and each positive definite omega_j is not exactly identical to any other omega_i. Show that the sum of two independent random variables of this type with equal alpha and p (but arbitrary a, m, and omega_j) has this same form, specifying what the a, alpha, m, and omega_j of the sum are in terms of the two original sets of parameters. 3) Problem 3.2.4 parts a and b. 4) Provide the details for the justification of the results given in section 2.5.5 concerning the matrix normal
4	Due: Monday, November 12th	1) Problem 3.4.14 2) Consider variables 6-9 of the psych data (paragraph comprehension through word meaning). Comment on how well this data seems to approximate multivariate normality. You do not need to show all of the graphs used to come to your conclusion, just representative ones. 3) For this question, assume you are satisfied with the multivariate normality of the data set from question two (irregardless of what you found there). Conduct a hypothesis test (or set of hypotheses tests) do determine how many principal components should be kept based on the logic discussed in section 8.4.3. Adjust for multiple comparisons if more than one test is conducted. 4) The data set http://www.stat.sc.edu/~habing/courses/data/cereal.txt contains 235 ratings of breakfast cereals on 25 different attributes. Construct a parsimonious factor model for the attributes that describe the quality of cereal. The model should have interpretable factors where each variable tends to load only on one factor. Where possible, give an informal name to your factor. (For this assignment, ignore the fact that we have information on who did the ratings and that we know what brand of cereal was rated.)
5	Due: Monday, December 3	The data set http://www.stat.sc.edu/~habing/courses/data/g2.txt contains the grades of 95 students in a STAT 110 course on 8 homework assignments and 3 exams. It also contains which college they are from: Arts and Sciences, Communications, or Hotel, Restaurant, and Tourism. 1) Conduct a canonical correlation analysis to examine the relationship between the 8 homework grades and the three exam grades. a) Use an alpha=0.10 level to determine how many of the canonical relationships are statistically significant (be sure to say what multiple comparison procedure you used). b) Describe what the significant canonical covariates are measureing in terms of the homework and exam grades. c) Note that when predicting E3 from the 8 homeworks the r-squared value is 0.2840 with a p-value of 0.0002 - why isn't the fact that this is more pronounced than your second canonical covariate a contradiction? 2) Conduct a linear discriminant analysis to predict the college of the student from the homework and exam scores. a) What percent accuracy would you expect if you randomly distributed the 95 students among the three groups (so that the number in each assigned group matched the original number in that group)? b) What percent accuracy would you expect if you assigned all of the observations to the largest of the three groups? c) What percent accuracy did you observe using the entire data set? d) What percent accuracy did you observe in the cross-validated data set? e) Plot the first two dimensions of the discriminant function, showing the college membership of each student. Include on the plot the location of the mean value for each of the three colleges students. f) Briefly relate this plot to the the differences you saw between the accuracy estimate based on the entire data set and the accuracy estimate from the cross-validated data set. g) Calculate the four MANOVA statistics for this problem, and comment on their significance at the a=0.10 level. h) Relate your finding in (g) to your plot in (e).
PS 2		1) Practitioners often want to use factor analysis on data sets that aren't normally distributed (or even continuous). Consider the case where 1,000 examinees are answering 41 questions on an exam, and the the quality of their solutions can be represented by a (normally distributed) 1-factor model. Simulate an n=1,000 p=41 data set from a 1-dimensional factor model where the lambda_i = 0.5 and the psi_ii are 0.25. Demonstrate that a 1-factor model is appropriate for the data and fits well. Now consider the case where the instructor decides to award only scores of 0 or 1 on each question, and grades the questions progressively stricter. Simulate this situation by defining y_1 to be 1 if x_1 < -2 and 0 if x_1 >= -2. Similarly for y_2 use -1.9, for y_3 use -1.8, .... for y_41 use +2. Conduct exploratory factor analysis on this model, indicating the number of factors indicated and the quality of the fit. Can you relate the values of the unrotated factor solution to the cut-off values? 2) Consider the subsample of the Bears data from class that consists of: Female/Old - Allison, Edith, Smokey, Thelma, Tozia Female/Young - Addy, Denise, Evelyn, Ness, Suzie Male/Old - Buck, Clyde, Ian, Mighty, Pete Male/Young - Floyd, Pasquale, Viking, Xavier, Willie and the standardized chest girth, standardized, head length, standardized headwith, and standardized neck girth. Test whether the interaction between age and gender is significant in the corresponding two-way MANOVA (you may assume multivariate normality is satisfied; give the exact p-value). Helpful code: `bears<-read.table("http://www.stat.sc.edu/~habing/courses/data/finbears.txt",head=T) b2<-bears[c(4,15,43,45,46,2,11,17,33,44,6,10,24,32,38,19,37,49,52,51), c(1,2,4,11:14)]` 3) Consider the stress function that could be entered in R as: `stress<-function(x1,x2){sum((dist(x1)-dist(x2))^2)/sum(dist(x2)^2)}` where x1 is the original data set and x2 is the estimated map from classical multidimensional scaling. A common recommendation is that the fit is acceptable if the stress is less than 0.15. How many dimensions does this criterion recommend for the bears data in quesiton 2? Why should you have been able to predict the stress value for k=4 before calculating it?
Bonus		Select one of the following two: 1) 14.2.4 (Notice this shows that basing the distance off the correlation satisfies the triangle inequality.) or 2) Consider the following data set giving the counts of hair color and eye color combinations for a sample of people: Hair Eye Fair Red Medium Dark Black Light 688 116 584 188 4 Blue 326 38 241 110 3 Medium 343 84 909 412 26 Dark 98 48 403 681 81 Distance measures between the columns of the data could be defined using the chi-squared statistic: coldist<-function(x){ C<-ncol(x) r<-nrow(x) ndoti<-apply(x,2,sum) N<-sum(x) pij<-matrix(0,ncol=C,nrow=r) dij<-matrix(0,ncol=C,nrow=C) for (i in 1:r){ for (j in 1:C){ pij[i,j]<-x[i,j]/ndoti[j] }} pidot<-apply(x,1,sum)/N for (i in 1:C){ for (j in 1:C){ for (k in 1:r){ dij[i,j]<-dij[i,j]+((pij[k,i]-pij[k,j])^2)/pidot[k] }}} sqrt(dij) } and the a row distance could be defined as `rowdist<-function(x){coldist(t(x))}` Conduct separate 2-dimensional classical multidimensional scalings for hair color and eye color. (As an aside, the next step in the method called correspondance analysis would be to make sure the correct signs were used and to plot the graphs on top of each other to see the joint relationship.)

Fall 2007 STAT 730 - Multivariate Analysis Monday/Wednesday 2:30-3:45 201A LeConte

Fall 2007
STAT 730 - Multivariate Analysis
Monday/Wednesday 2:30-3:45
201A LeConte