STAT 530, Fall 2022
--------------------

Homework 5
-----------

Problem 1:

An educational testing expert was asked to judge (on a scale of 0 to 100) how dissimilar were pairs of standardized tests.  
The following R code produces the resulting dissimilarity matrix for six kinds of standardized test.

tests.diss <- 100*round(1-cov2cor(ability.cov$cov),2)
print(tests.diss)

Descriptions of the six tests are as follows: 

general: a non-verbal measure of general intelligence using Cattell's culture-fair test.
picture: a picture-completion test
blocks: block design
maze: mazes
reading: reading comprehension
vocab: vocabulary

Find a two-dimensional multidimensional scaling solution, plot the tests on a 2-D map, and try to interpret 
meanings for the dimensions underlying the judgments about the distinctions among the tests.  Assess how well 
the 2-D solution represents the dissimilarities among the tests, using a numerical measure.

Note that
row.names(tests.diss)
contains the vector of the test names.



IMPORTANT NOTE: For these clustering problems below, also write several sentences 
explaining in words what substantive conclusions about the data
that you can draw from the plots and/or analyses.

Problem 2:  

Do both a hierarchical clustering and a partitioning clustering 
of the tennis racquet data on the course web page.  
For each clustering, you may pick your favorite specific approach.  
Give the partitions of racquets into clusters, give some plot(s) 
to visualize the cluster structure, and make an attempt to characterize the clusters.

The racquet data can be read in with the following code:
racq.data <- read.table("http://people.stat.sc.edu/hitchcock/racquetsdata530.txt",header=T)
racquet.names <- as.character(racq.data[,1])
racquet.numeric.data <- racq.data[,-1]

The variables in the tennis racquets data set are:
X1 = length of racquet (in inches)
X2 = static weight (in ounces) = this is how much the racquet actually weighs on a scale
X3 = balance (in inches)  = this is a measure of whether the racquet is heavier in on the head end or on the handle end; 
     more negative values indicate a more head-heavy racquet; positive values indicate a more head-light racquet; 
     zero indicates an even balance.
X4 = swingweight = this is a complicated measure of how heavy the racquet FEELS when it is swung
X5 = headsize (in square inches) = the size of the racquet face (the strung area)
X6 = beamwidth (in mm) = the width of the cross-section (edge) of the racquet

Problem 3:

Do a model-based clustering of the pottery data set given in Table 1.3.  Verify that BIC suggests a 3-cluster solution.  Which covariance structure does BIC suggest for the best model? Give some plot(s) 
to visualize the cluster structure, and make an attempt to characterize the clusters (a PCA can help with this).

The pottery data can be read in with the following code:
pottfull<-read.table("http://people.stat.sc.edu/hitchcock/potteryTable63.txt", header=T)
attach(pottfull)
pott<-pottfull[,-c(1,2)]

*** Read Page 8 of the Everitt and Hothorn book for some insight into the variables in the pottery data set,
which are amounts of 9 chemicals (Al2O3, Fe2O3, MgO, CaO, Na2O, K2O, TiO2, MnO, BaO) found in 46
specimens of Romano-British pottery.
Also note that "No" (number) and "Kiln" in the 'pottfull' data above are simply labeling variables and 
should NOT be included in the cluster analysis algorithm (the value of Kiln might be informative in interpreting 
the clustering result).  
The 'pott' data set above contains only the numeric variables; the cluster analysis itself should be done on this.