Spring 2001
Statistics 530 - Statistical Methods I
Tuesday/Thursday 2:00-3:15
210A LeConte

Course Website: http://www.stat.sc.edu/~habing/courses/530F01.html

11Due: Thursaday, December 6th For this assignment you will need to pick 5 locations on various sides of the horseshoe. (At least one must be a building on the south side, one must be a building along the north side, and one must be a "landmark" in front of McKissick.) Print out the above map of the horseshoe and indicate your five chosen points.

Working in groups if you want, you need to measure off all the distances between these five landmarks. One way of doing this is to get a very long tape measure, another would be to simply pace off the distances. If you pace off the distances you need to make sure that you keep the size of your paces constant... this could take work if several of you collaborate.

(Ok, ok, I can't take it anymore! You can instead, simply use a ruler to measure the values off the the printed out map if you want... but I will give a bit extra to those people who actually do go to the Horseshoe and measure between the buildings/landmarks)

1) Use multidimensional scaling to construct a map of the five landmarks using the distance matrix you constructed.
2) After rotating and flipping (if needed) visually compare your map to the map you printed out and describe how closely they seem to agree.
3) Why would this method be much less useful if you were, say, asked to construct the map for 20 locations instead of 5?
4) Perform a cluster analysis using the distance matrix you constructed. Does it return any useful information?

10Due: Thursday, November 29th Consider the bears data set from homework 2. Conduct a canonical correlation analysis to predict the body measurements (body length, chest girth, and weight) from the head measurements (length of head, width of head, and neck girth).

1) Why does SAS only calculate three pairs of canonical variates for this data set?
2) Report the overall test of whether there is any relationship between the x variables as a group and the y variables as a group.
3) How many pairs of the canonical variates are statistically significantly correlated? (For each pair, give both the p-values and the R2s and briefly describe the strength of the relationship.)
4) Give a brief summary of what you think the significant correlates seem to be measuring.

9Due: Tuesday, November 13th Choose a data set from the baseball data, the iris data, or the bumpus 2 data. Try several distance and agglomeration methods for clustering. (Give at least a partial list of which ones you tried.)
1) Which seems most reasonable? Why?
2) For the answer you chose in part 1, what seems to be the unifying theme behind the clusters? (For example, are they somewhat overlapping with some natural grouping of the observations? Or, are they based around one of the variables in particular?)
3) How close are they to the underlying group variable?
8Due: Tuesday, November 6th Consider the data set http://www.stat.sc.edu/~habing/courses/data/bumpus2.txt. This is a larger version of the data set found int Table 1.1 on pages 2-3. In this case the data consists of the adult male sparrows. The variables are: survived? (1=yes, 0=no), total length (mm), alar extent (tip to tip of extended wings) (mm), weight (g), length of beak and head (mm), length of humerus (in), length of femur (in), length of tibiotarsus (in), width of skull (in), and length of keel of sternum (in).

1) Perform a discriminant analysis to distinguish between the survivors and non-survivors. Compare the measures of accuracy of the discriminant rule gained by using the entire samples classification rates and by using the jackknife method.
2) Perform a multiple logistic regression to predict the survival of the bird from the other variables. Does the logistic regression model fit? Does the model distinguish between the survivors and non-survivors? Construct a side by side box plot of the predicted probability of surviving for the survivor and non-survivor groups.

7Due: Thursday, November 1st In assignments 4 and 5 we examined the IRIS data. Use that data again to conduct a linear discriminant analysis.
1) Summarize the success in classifying using only the first linear discriminant function, using both of the first two functions, and using all four functions. Compare the success of these three.
2) Plot the observations by the first two linear discriminant functions. Compare this plot to the one in question 2 of homework 5.
6Due: Thursday, October 18th The data from the personality test is contained in two separate files. http://www.stat.sc.edu/~habing/courses/data/pers1.txt contains the observed scores based on columns in the score sheet. (Column 1 was E/I, columns 2 and 3 were S/N, columns 4 and 5 were T/F, and columns 6 and 7 were J/P.) The second file http://www.stat.sc.edu/~habing/courses/data/pers2.txt contains the scores for all 70 questions for each person, arranged so that 1-10 go with column 1, 11-20 go with column 2, etc...

1) Why can we probably not use the data set pers2.txt for conducting exploratory factor analysis?
2) Conduct principal components analysis on pers1.txt. Can you find any interpretation for the first four components? Does it come close to reflecting what the structure of the columns should be?
3) Conduct exploratory factor analysis on pers1.txt using the varimax rotation and four factors. Can you find any interpretation for the first four factors? Does it come close to reflecting what the structure of the columns should be?
4) Draw a path diagram for the model you think the designers of the test seem to think should hold for the pers1.txt data.

5Due: Tuesday, October 9th For the iris data in assignment 4, conduct a principal components analysis using either SAS or R.
1) Report the principal components and the percent of variation explained by each of them.
2) Plot the irises by their first two principal components, using a different symbol for each type of iris. Does the plot seem to verify the findings from Homework assignment 4?
3) Put into words how the first principal component is combining the variables.
4) In your opinion, does the second component seem worthwhile to report? Why or why not?
4Due: Tuesday, Ocober 2nd The web page http://www.stat.sc.edu/~habing/courses/data/iris.txt contains a famous data set gathered by E. Anderson and discussed by R.A. Fisher. It concerns the measurements of 150 irises from 3 species. The four variables that the flowers are measured on are: sepal length, sepal width, petal length, and petal width. (The sepal is a leaf in the outer whorl of leaves that protect the flower.)

1) Use SAS to conduct a MANOVA to see if the means vectors for these 3 species are equal.
2) Use R to calculate Mahalinobis distance between the three species. Which two species are most similar?
3) Use R to tell if any of the observed irises are closer to one of the other species means than they are to their own species mean?

3Due: Tuesday, September 25th 1) The R-templates page has the code for producing the various Chernoff faces used in class. For each of the four face types, comment on any changes that you think are needed to make them more useful (i.e. something doesn't vary enough or is too hard to see, etc...). Incorporate one of these changes into the code and verify that it performs in the way you planned.

2) A 1966 anthropometric data set compares two communities by examining the sizes of the femur and humerus of their members. The first community had a sample size of 27, a mean femur length of 460.4 mm, and a mean humerus length of 335.1 mm. The second community had a sample size of 20, a mean femur length of 444.3 mm and a mean humerus length of 323.1 mm. The pooled covariance matrix was found to be:

561.8380.7
380.7343.2

Show that each of the two pooled-t-tests is significant at alpha=0.05, but that the Hotelling's T2 is not significant at alpha=0.05.
2Due: Tuesday, September 11th The web page http://www.stat.sc.edu/~habing/courses/data/bears.txt contains a subset of a data set described in Reader's Digest (April, 1979) and Sports Afield, (September, 1981).

The data set consists of several measurements for bears that were captured, measured, and released. (The full data set actually caught several of the bears multiple times over a period of years.) The variables in the data set are: estimated age in months, gender (1=male, 2=female), length of head in inches, width of head in inches, girth of the neck in inches, body length in inches, girth of the chest in inches, weight in pounds, and name. The observations are currently ordered by name.

Your assignment consists of three parts.
1) Construct a scatterplot matrix for the data and give a brief summary of the relationships between the variables that the plot reveals.
2) Construct a star plot for the data, taking care to order the variables and observations to highlight the major pattern that you see in the plot. Briefly describe why you ordered them in the way you did and what conclusions you can make.
3) Assume that weight was the most important variable to be predicted. Perform a multiple regression (you don't need to do all the transformations or variable checking). Briefly summarize what you found from the regression in terms of which variables seem to be the most important for predicting weight. Also, pick a few of the points with more extreme residuals and say if the star plot provides any insight into why they may be unusual.

1Due: Tuesday, September 4th The web page http://www.stat.sc.edu/~habing/courses/data/draft70.txt contains the data from the 1970 draft lottery to determine the order in which people would have to report to the draft board to serve in the Vietnam war (if they were still needed).

The first column is the day of the year from 1 to 366 (including leap year), the second column is the order in which the capsule containing that date was drawn from a container, the third column is the number of the month that the day was in, the fourth column is the name of the month, and the fifth column is the day of the month. Thus, those born on September 14th (in the 9th month and the 258th day of the year) was the first to report to the draft board.

The question that you are to try and answer is whether or not it appears that the method used to randomize the birthdays was fair. That is, did each birthday have the same chance of being selected each time a number was drawn? There are a large number of ways to analyze this data, but your assignment is to produce an easily explainable graph that shows if the draft was fair or not, and a short explanation to accompany it. If it was not fair your graph and explanation should show in what way it was unfair. Be sure to include a copy of any code you used to generate the output.