Fall 2003
Statistics 530 - Statistical Methods I
Monday/Wednesday/Friday 11:15-12:05
210A LeConte

Course Website: http://www.stat.sc.edu/~habing/courses/530F03.html

7Due: Friday, December 5th This assignment uses the www.stat.sc.edu/~habing/courses/data/crabs2.txt data set from the previous assignment.

1) It is desired to perform a two dimensional multi-dimensional scaling of the crab data using the standardized values, and hoped that the scaling will separate the four types of crabs. Why would two dimensions be desirable? Which of the three methods (Classical, Sammon, or Kruskal's nonmetric (iso)) do you think has the best chance of showing distinct clusters that lie in a data set? Why? (You shouldn't have to perform the scaling to answer this.)

2) Perform the scaling method you chose in part 1 on the crab data and construct a plot of the scaling where each crab is denoted solely by a dot. By looking at the separation of points in the scaling, divide the scaling into separate clusters. (Do not do any additional analyses to help you do this! You do not need to make 4 distinct clusters if you can't see them in the plot.)

3) Re-plot the scaling in 2, this time labeling each crab by its type. Compare this plot to the best of the cluster analysis dendograms from the previous homework. A group of six B/b crabs were apparently misclassified as O/o crabs in the cluster analysis. Guess which six crabs these are on the scaling (do not do any additional analyses to find them!). Two crabs in the cluster analysis were separated out from the remaining crabs. Guess which two crabs these are on the scaling (don not do any additional analyses to find them!).

4) Finally, repeat both the clustering and the scaling, labeling each observation by its observation number. How well did you guess in part 3.

6Due: Monday, November 24th www.stat.sc.edu/~habing/courses/530hmwk6F03.pdf
5Due: Monday, November 10th This assignment uses the data set http://www.stat.sc.edu/~habing/courses/data/orange.txt. The data set concerns several samples of orange juice from several different countries (BEL, LSP, TME, and VME). Each of them has had several chemical elements measured: boron (B), barium (BA), calcium (CA), potassium (K), magnesium (MG), manganese (MN), phosphorous (P), rubidium (RB), and zinc (ZN). The first varibable is simply an ID number.

  1. How many canonical linear discriminant functions are required to perfectly classify the orange juice samples based on their chemical make-up.
  2. Construct a plot of each of the samples (labeled by their country) on their first two discriminant functions.
  3. Perform the linear discriminant analysis using only one discriminant function:
    • Interpret the first linear discriminant function.
    • How can you tell that the built in cross-validation function is not using only the first linear discriminant function?
    • Briefly describe what kind of error would be made using only the first linear discriminant function.
  4. Check whether or not the assumptions of equal covariance matrices and multivariate normality seem to be met. If either assumption does not appear to be met, describe the concequences (e.g. can we trust parts 2-3 above? what else could we not do?)
  5. Attempt to fit a logistic regression for those oranges that are either from BEL or LSP. The computer gives a warning!?!? What is going on with these two groups that the estimation used for the logistic regression doesn't work? (e.g. relate it to the actual data and what the curve would have to look like, don't just repeat the error message)
4Due: Friday, October 17th www.stat.sc.edu/~habing/courses/530hmwk4F03.pdf
3Due: Monday, October 6th www.stat.sc.edu/~habing/courses/530hmwk3F03.pdf
2Due: Monday, September 22nd www.stat.sc.edu/~habing/courses/530hmwk2F03.pdf
1Due: Friday, September 5th 1) The web page http://www.stat.sc.edu/~habing/courses/data/draft70.txt contains the data from the 1970 draft lottery to determine the order in which people would have to report to the draft board to serve in the Vietnam war (if they were still needed). A container was filled with capsules, one for each day of the year. The container was then shaken and the capsules were drawn and the order of the birthdays was recorded. This order was the order in which people were called up for the draft.

The first column in the data set is the day of the year from 1 to 366 (including leap year), the second column is the order in which the capsule containing that date was drawn from a container, the third column is the number of the month that the day was in, the fourth column is the name of the month, and the fifth column is the day of the month. Thus, those born on September 14th (in the 9th month and the 258th day of the year) was the first to report to the draft board.

The question that you are to try and answer is whether or not it appears that the method used to randomize the birthdays was fair. That is, did each birthday have the same chance of being selected. There are a large number of ways to analyze this data, but your assignment is to use either SAS or R to produce an easily explainable graph that shows if the draft was fair or not, and a short explanation to accompany it. If it was not fair your graph and explanation should show in what way it was unfair. Be sure to include a copy of any code you used to generate the output.

2) The web-page http://www.stat.sc.edu/~habing/courses/data/bballtest.txt contains the baseball data we discussed in class. Use whichever package (SAS or R) that you did not use for the first problem to construct a histogram of the number of homeruns hit in 1986 (HR86) by players in the national league (League=N).