1 | Due: Wednesday, September 4th |
1.4.2 (0), (i), (ii), (iv)
1.6.1 Write an R function to calculate b1,p and b2,p on a data set. A sample data set is at: cork data. The values should be 4.476 and 22.957. |
2 | Due: Monday, September 17th | 1.5.2, 1.5.3, 8.6.1
Let S=the matrix A in (A.6.16). For which value(s) of rho is the Mahalanobis transformation orthogonal? |
3 | Due: Monday, October 8th |
1) Show that c in problem 2.2.1 must equal p(p+1).
2) Let X ~ Np(0,I). Define U(p-1)x1 by ui=(xi+xi+1)/2. Find the distribution of U, including specifying the values of the covariance matrix. 3) Prove the result that 2.7.8 on page 49 holds for all spherically symmetric random variables. You may use the standard results concerning the univariate normal, chi-square, and F distributions from 712-713 without proof. (Hint: Write (2.7.8) in terms of the r and theta from the polar transformation, letting n+m=p and considering thetap-m in terms of the xi, xp-m in terms of the r and theta, and r in terms of the xi. Consider a particular (well chosen) distribution for X.) 4) Show that the multivariate Cauchy with parameters mu and sigma is stable (see problem 2.6.5 for the p.d.f and c.f). |
PS 1 | Due: Wednesday, October 24th |
To Be Worked on Individually
1) Sequel to problem 8.6.1:
|
4 | Due: Monday, November 12th |
1) Problem 3.4.14
2) Consider variables 6-9 of the psych data (paragraph comprehension through word meaning). Comment on how well this data seems to approximate multivariate normality. You do not need to show all of the graphs used to come to your conclusion, just representative ones. 3) For this question, assume you are satisfied with the multivariate normality of the data set from question two (irregardless of what you found there). Conduct a hypothesis test (or set of hypotheses tests) do determine how many principal components should be kept based on the logic discussed in section 8.4.3. Adjust for multiple comparisons if more than one test is conducted. 4) The data set http://www.stat.sc.edu/~habing/courses/data/cereal.txt contains 235 ratings of breakfast cereals on 25 different attributes. Construct a parsimonious factor model for the attributes that describe the quality of cereal. The model should have interpretable factors where each variable tends to load only on one factor. Where possible, give an informal name to your factor. (For this assignment, ignore the fact that we have information on who did the ratings and that we know what brand of cereal was rated.) |
5 | Due: Monday, December 3 |
The data set http://www.stat.sc.edu/~habing/courses/data/g2.txt contains the grades of 95
students in a STAT 110 course on 8 homework assignments and 3 exams. It also
contains which college they are from: Arts and Sciences, Communications, or
Hotel, Restaurant, and Tourism.
1) Conduct a canonical correlation analysis to examine the relationship
between the 8 homework grades and the three exam grades.
2) Conduct a linear discriminant analysis to predict the college of the student
from the homework and exam scores.
|
PS 2 |
1) Practitioners often want to use factor analysis on data sets that aren't normally distributed (or even continuous). Consider the case where 1,000 examinees are answering 41 questions on an exam, and the the quality of their solutions can be represented by a (normally distributed) 1-factor model. Simulate an n=1,000 p=41 data set from a 1-dimensional factor model where the lambda_i = 0.5 and the psi_ii are 0.25. Demonstrate that a 1-factor model is appropriate for the data and fits well. Now consider the case where the instructor decides to award only scores of 0 or 1 on each question, and grades the questions progressively stricter. Simulate this situation by defining y_1 to be 1 if x_1 < -2 and 0 if x_1 >= -2. Similarly for y_2 use -1.9, for y_3 use -1.8, .... for y_41 use +2. Conduct exploratory factor analysis on this model, indicating the number of factors indicated and the quality of the fit. Can you relate the values of the unrotated factor solution to the cut-off values?
2) Consider the subsample of the Bears data from class that consists of:
Helpful code:
3) Consider the stress function that could be entered in R as:
| |
Bonus |
Select one of the following two:
1) 14.2.4 (Notice this shows that basing the distance off the correlation satisfies the triangle inequality.) or 2) Consider the following data set giving the counts of hair color and eye color combinations for a sample of people:
Hair Eye Fair Red Medium Dark Black Light 688 116 584 188 4 Blue 326 38 241 110 3 Medium 343 84 909 412 26 Dark 98 48 403 681 81 Distance measures between the columns of the data could be defined using the chi-squared statistic:
coldist<-function(x){ C<-ncol(x) r<-nrow(x) ndoti<-apply(x,2,sum) N<-sum(x) pij<-matrix(0,ncol=C,nrow=r) dij<-matrix(0,ncol=C,nrow=C) for (i in 1:r){ for (j in 1:C){ pij[i,j]<-x[i,j]/ndoti[j] }} pidot<-apply(x,1,sum)/N for (i in 1:C){ for (j in 1:C){ for (k in 1:r){ dij[i,j]<-dij[i,j]+((pij[k,i]-pij[k,j])^2)/pidot[k] }}} sqrt(dij) } and the a row distance could be defined as rowdist<-function(x){coldist(t(x))} Conduct separate 2-dimensional classical multidimensional scalings for hair color and eye color. (As an aside, the next step in the method called correspondance analysis would be to make sure the correct signs were used and to plot the graphs on top of each other to see the joint relationship.) |