STAT 530, Fall 2022 -------------------- Homework 2 ----------- NOTE: The air pollution data set (somewhat different that the US Air pollution data we studied in our class examples) is given on the course web page. You should use the FULL data set for the problems given below. You can read the data into R (as a data frame) with the code: airpol.full <- read.table("http://people.stat.sc.edu/hitchcock/airpoll.txt", header=T) city.names <- as.character(airpol.full[,1]) airpol.data <- airpol.full[,2:8] # if you want to make the row labels be the city names, add this line of code: airpol.full <- data.frame(airpol.full, row.names=city.names) NOTE: For EACH of these problems, also write several sentences explaining in words what substantive conclusions about the data that you can draw from the plots. 1. Do a star plot to display all 7 variables. And also do a plot using Chernoff Faces. Write a short paragraph explaining what the plots tell you about the cities. You can include the "labels" argument to label the drawings for both the stars function and the faces function, e.g.: labels=city.names within the call of each function. 2. Produce a scatterplot matrix for this air pollution data set. Write a short paragraph explaining the main conclusions from the scatterplot matrix. 3. Produce chi-plots for the following pairs of variables in the air pollution data set: SO2 and mortality; Nonwhite percentage and NOX; Education and Rainfall; and write comments about those. This exercise is to give you practice in making and interpreting chi-plots. 4. Do a bivariate boxplot of the pair of variables "Education" and "Mortality" from the air pollution data set. Explain what the plot tells you about the relationship between the two variables. Do you see any outliers? If so, which cities are they? 5. Do a bubble plot with "Education" and "Mortality" on the axes and "Population Density" represented by the bubbles. Explain what the plot tells you about the relationships among the three variables. Comment on any notable cities. 6. Often you can investigate all possible combinations of two categorical variables (say, categ1 and categ2) by using their cross-product, categ1:categ2 . Read in the 'Salaries' data frame from the 'carData' package with: library(carData); data(Salaries, package="carData") You may need to install the 'carData' package with install.packages("carData") first. (a) Do a pirate plot (or side-by-side box plots, if you prefer) of 'salary' for the different level combinations of sex:discipline [Note that discipline "A"=theoretical and "B"=applied] What are your conclusions about salary comparisons across the sexes and disciplines based on this plot? (b) Use the 'qplot' function in the 'ggplot2' package to do a symbolic scatter plot of salary against yrs.since.phd, with separate symbols for the sex categories. Include + geom_smooth(method='lm',se=FALSE) on the end of your line of code that calls the qplot function to get best-fit lines through the plots. What are your conclusions from this plot? (c) Include + facet_wrap(~discipline) on the end of your line of code in (b). What are your conclusions from the resulting plot? 7. Look at the interactive bubble plots (one for each year) at https://usafacts.org/projects/jobs/who-leaves? (a) Explain the three variables represented in each bubble plot (year could be considered a fourth variable in the series of plots). (b) For the 2021 plot, discuss the associations (or lack thereof) between each pair of variables, which can be seen by looking at the plot. (c) For the 2020 plot, what are the two occupations that had the highest percentage of workers leave? Give a possible explanation for why these occupations had an exceptionally large value for this variable in 2020. (d) Extra Credit (mandatory for grad students): If you were doing a linear regression of the Y-axis variable on the X-axis variable for the 2021 data, what are three complications that you would have to address in your analysis? NOTE: The preferred way to turn in the homework is to put your answers neatly in a Word document or pdf file and upload it in Blackboard according to the instructions.