11 | Due: Thursaday, December 6th |
For this assignment you will need to pick 5 locations on various sides of the
horseshoe.
(At least one must be a building on the south side, one must be a building along the north side, and one must be a "landmark" in front of McKissick.)
Print out the above map of the horseshoe and indicate your five chosen points.
Working in groups if you want, you need to measure off all the distances between these five landmarks. One way of doing this is to get a very long tape measure, another would be to simply pace off the distances. If you pace off the distances you need to make sure that you keep the size of your paces constant... this could take work if several of you collaborate. (Ok, ok, I can't take it anymore! You can instead, simply use a ruler to measure the values off the the printed out map if you want... but I will give a bit extra to those people who actually do go to the Horseshoe and measure between the buildings/landmarks)
1) Use multidimensional scaling to construct a map of the five landmarks
using the distance matrix you constructed.
| ||||
10 | Due: Thursday, November 29th |
Consider the bears data set from homework 2. Conduct a canonical correlation
analysis to predict the body measurements (body length, chest
girth, and weight) from the head measurements (length of head, width of head,
and neck girth).
1) Why does SAS only calculate three pairs of canonical variates for this
data set? | ||||
9 | Due: Tuesday, November 13th |
Choose a data set from the baseball data, the iris data, or the
bumpus 2 data. Try several distance and agglomeration methods
for clustering. (Give at least a partial list of which ones you tried.)
1) Which seems most reasonable? Why? 2) For the answer you chose in part 1, what seems to be the unifying theme behind the clusters? (For example, are they somewhat overlapping with some natural grouping of the observations? Or, are they based around one of the variables in particular?) 3) How close are they to the underlying group variable? | ||||
8 | Due: Tuesday, November 6th |
Consider the data set
http://www.stat.sc.edu/~habing/courses/data/bumpus2.txt.
This is a larger version of the data set found int Table 1.1 on pages 2-3.
In this case the data consists of the adult male sparrows. The variables
are: survived? (1=yes, 0=no),
total length (mm), alar extent (tip to tip of extended wings) (mm),
weight (g), length of beak and head (mm), length of humerus (in),
length of femur (in), length of tibiotarsus (in), width of skull (in),
and length of keel of sternum (in).
1) Perform a discriminant analysis to distinguish between the survivors
and non-survivors. Compare the measures of accuracy of the discriminant rule
gained by using
the entire samples classification rates and by using the jackknife method.
| ||||
7 | Due: Thursday, November 1st |
In assignments 4 and 5 we examined the IRIS data. Use that data
again to conduct a linear discriminant analysis.
1) Summarize the success in classifying using only the first linear discriminant function, using both of the first two functions, and using all four functions. Compare the success of these three. 2) Plot the observations by the first two linear discriminant functions. Compare this plot to the one in question 2 of homework 5. | ||||
6 | Due: Thursday, October 18th |
The data from the personality test is contained in two separate files.
http://www.stat.sc.edu/~habing/courses/data/pers1.txt contains
the observed scores based on columns in the score sheet. (Column 1 was
E/I, columns 2 and 3 were S/N, columns 4 and 5 were T/F, and columns 6 and 7
were J/P.) The second file
http://www.stat.sc.edu/~habing/courses/data/pers2.txt contains
the scores for all 70 questions for each person, arranged so that 1-10
go with column 1, 11-20 go with column 2, etc...
1) Why can we probably not use the data set pers2.txt for conducting
exploratory factor analysis?
| ||||
5 | Due: Tuesday, October 9th |
For the iris data in assignment 4, conduct a principal components analysis
using either SAS
or R.
1) Report the principal components and the percent of variation explained by each of them. 2) Plot the irises by their first two principal components, using a different symbol for each type of iris. Does the plot seem to verify the findings from Homework assignment 4? 3) Put into words how the first principal component is combining the variables. 4) In your opinion, does the second component seem worthwhile to report? Why or why not? | ||||
4 | Due: Tuesday, Ocober 2nd |
The web page
http://www.stat.sc.edu/~habing/courses/data/iris.txt contains a
famous data set gathered by E. Anderson and discussed by R.A. Fisher.
It concerns the measurements of 150 irises from 3 species. The four
variables that the flowers are measured on are: sepal length, sepal width,
petal length, and petal width. (The sepal is a leaf in the outer whorl
of leaves that protect the flower.)
1) Use SAS to conduct a MANOVA to see if the means vectors for these
3 species are equal.
| ||||
3 | Due: Tuesday, September 25th |
1) The R-templates page has the code for producing the various
Chernoff faces used in class. For each of the four face types,
comment on any changes that you think are needed to make them
more useful (i.e. something doesn't vary enough or is too hard
to see, etc...). Incorporate one of these changes into the
code and verify that it performs in the way you planned.
2) A 1966 anthropometric data set compares two communities by
examining the sizes of the femur and humerus of their members.
The first community had a sample size of 27, a mean femur length of
460.4 mm, and a mean humerus length of 335.1 mm. The second
community had a sample size of 20, a mean femur length of 444.3 mm
and a mean humerus length of 323.1 mm. The pooled covariance matrix
was found to be:
Show that each of the two pooled-t-tests is significant at alpha=0.05, but that the Hotelling's T2 is not significant at alpha=0.05. | ||||
2 | Due: Tuesday, September 11th |
The web page
http://www.stat.sc.edu/~habing/courses/data/bears.txt contains a subset
of a data set described in Reader's Digest (April, 1979) and
Sports Afield, (September, 1981).
The data set consists of several measurements for bears that were captured, measured, and released. (The full data set actually caught several of the bears multiple times over a period of years.) The variables in the data set are: estimated age in months, gender (1=male, 2=female), length of head in inches, width of head in inches, girth of the neck in inches, body length in inches, girth of the chest in inches, weight in pounds, and name. The observations are currently ordered by name.
Your assignment consists of three parts.
| ||||
1 | Due: Tuesday, September 4th |
The web page
http://www.stat.sc.edu/~habing/courses/data/draft70.txt contains the
data from the 1970 draft lottery to determine the order in which people would
have to report to the draft board to serve in the Vietnam war (if they were
still needed). The first column is the day of the year from 1 to 366 (including leap year), the second column is the order in which the capsule containing that date was drawn from a container, the third column is the number of the month that the day was in, the fourth column is the name of the month, and the fifth column is the day of the month. Thus, those born on September 14th (in the 9th month and the 258th day of the year) was the first to report to the draft board. The question that you are to try and answer is whether or not it appears that the method used to randomize the birthdays was fair. That is, did each birthday have the same chance of being selected each time a number was drawn? There are a large number of ways to analyze this data, but your assignment is to produce an easily explainable graph that shows if the draft was fair or not, and a short explanation to accompany it. If it was not fair your graph and explanation should show in what way it was unfair. Be sure to include a copy of any code you used to generate the output. |