STAT 530 - Fall 2008 - SPSS Templates

The Basics of SPSS
The oildata Example
Finding and Displaying Correlation Matrices
Multivariate Normality (Section 1.4)
Graphics (Chapter 2)
Principal Components Analysis (Chapter 3)


The Basics of SPSS:

SPSS is a commercial statistical package whose name originally stood for "Statistical Package for the Social Sciences". A one year license can be obtained for University-owned computers (contact your department's computer manager for more information.) Student copies are available from several sources including e-academy.

One of the nice things about SPSS is that much of it is menu driven and based around a spreadsheet. This makes it fairly easy and intuitive to use for anyone accustomed to using spreadsheets. For example, the various menus across the top are the:

The following should demonstrate some of the ways to use SPSS.

Begin by entering 5, 4, 3, and 2 into the first four rows of column 1. Notice as you type the first value in the column it puts a generic name at the top of the column. By clicking on "Variable View" at the bottom left of the window we can get a bit more information about this variable we entered, including: the name (click on that window and we can rename it x), the type (currently numeric), the width, and number of decimal places we want showing.

The mean can be then be calculated using Analyze > Descriptive Statistics > Descriptives.... Notice you must choose the variable you want, so make sure x is highlighted in the left window and then click the arrow in the middle. When the variable is transferred over you can then click OK.

A histogram can be plotted using Graphs > Histogram, again selecting the variables.

A t-test can be performed using Analyze > Compare Means > One-Sample T Test. Note that the variable must be selected again. Also note that the value for the null hypothesis can be entered in the bottom middle, and that options allows you to change the percentage of the confidence interval. Unlike R there is no obvious way to do the one-sided tests.

By right clicking on the various parts of the output window you can copy and paste the various output tables into word, for example.

The Transform menu can be used to create new variables. For example we could set weirdx to be sin(x)+x2. Choose the Compute... sub-menu. Type weirdx in for the Target Variable and use the function group menu, calculator pad, and arrow key to put the function into the Numeric Expression box. (Sin is one of the arithmetic options, and the up arrow appears when you select it, the ** in the bottom left of the calculator is the symbol for raising to a power.)

When you click ok it will add the new variable to the spreadsheet.


The oildata Example:

SPSS is able to load in large data sets using the options in the File menu. Read Text Data will allow us to read in the oildata.txt file. This data set is: not predefined, delimited, with names at top, beginning on line 2, with each line a separte case, we want all the cases, and tab delimited seems correct. For now we can set all of the variables as numeric... although we might want to change that later.

The entire data set now appears in the spreadsheet, and we can see more detail on it by using the Variable View tab at the bottom.

Note that if we wanted to sort the file based on the values of a certain variable we could use Data > Split File, choose Organize output by groups and select the variable we want the groups based on (Gender for example). This will then sort the spreadsheet by that variable. We could then select and delete the desired rows and columns to make a smaller data set if we wanted. Data > Select Cases gives the option of deleted those that weren't selected.

The main difference with a larger data set is that you need to be careful about which variable you are selecting!


Finding and Displaying Correlation Matrices:

The following code will find the correlation matrix for section 2 of the oildata set. It assumes that you successfully read in the data like in the previous section.

One place we can go to get the correlations is the Correlate > Bivariate option under the Analyze menu. Selecting the variables we want will give us the matrix of correlations. (For this example, we want Econ, Conv, Flex, Safe, Low, and Dep.) Unfortunately this output window could be hard to read with a lot of variables because it also gives the p-value for testing if the correlation coefficient is zero, and the sample sized used (observations with missing values are deleted for each pair).

A slightly less busy display can be gotten by choosing Data Reduction > Factor under the analyze menu. under the Analyze menu. Select the variables we want, and with the Descriptives button make sure that only Coefficients is checked. The top part of the output will be the correlation matrix, and we can ignore the rest of the output.

Beginning with the first column we can see that: Economy goes strongly with Safety and Low Energy Use; weakly with Convenience and Dependability; and is opposite of Flexibility. This suggests a group of questions with:

Group 1: Economy, Safety, and Low Energy Use (and maybe Convenience and Dependability)

Giving to the next column we see that: Convenience goes strongly with Flexibility and Dependability; a bit more weakly with Economy and Safety; and negatively with Low Energy Use. This suggests a group of questions with:

Group 2: Convenience, Flexibility, Dependability (and maybe Safety and Economy)

If you had to break them into only two groups, this should already give you a good idea of how to do it... and you can check the remaining columns to see if we get any contradictions. In fact we could check out the correlations of the groups separately if we wanted to. This could be done the same way we got the entire correlation table... just using the three variables we put in each group instead of all six.

We can also get the correlations of the first group with the variables with the second. Under the File menu select New > Syntax and enter the following into the window that appears.

CORRELATIONS Econ Safe Low WITH Conv Flex Dep.

Hitting Run (9th of the 12 menus along the top of the syntax window) will then produce the output.

Can you see why group 1 all hangs together? What about group 2? Is there a question that looks like it would go well with either group? (Be careful reading that third correlation!) Also notice that none of the questions really strongly go in the "opposite direction" from the other questions (what would it look like if that were the case?)

If we wanted to we could use the Compute option from the Transform menu to calculate the scores for each person on each group of questions. Type Group1 into the target variable box and enter Econ+Safe+Low into the numeric expression box. i You can then repeat the process for making Group2 out of the other three variables.

We can see that the first person should get a 12+1+8=21 for the group 1 score and a 16+25+16=57 for the group 2 score. (The homework doesn't ask you to do this though!)


Multivariate Normality:

Data that follows a multivariate normal distribution comes from a very particular probability density function (whose formula we saw in class). This makes it relatively easy to generate random samples that should be multivariate normal. The following code generates a sample of size 1000 from a population where the first variable has mean 5, the second has mean 0, and the third has mean -1. The variances are 1, 1, and 4 respectively, with the covariance between 1 and 2 being 0.5, 1 and 3 being -0.2, and 2 and 3 being 0.

Note that we have to write our own program to do this in SPSS (unlike R) and it is a bit unpleasant if you have to do it from scratch. Of course you can just cut and paste it into a syntax window and run it! (You can open a syntax window using File > New > Syntax and the Run option is the 9th of the 12 choices at the top of the new screen). The only things that you might ever need to change are the covariance matrix, the mean vector, and the sample size n. The output is added to the spreadsheet as new columns.

MATRIX.
COMPUTE sigma =
{1, 0.5, -.2;
0.5, 1, 0;
-0.2, 0, 4}.
COMPUTE mu = {5, 0, -1}.
COMPUTE n = 1000.
COMPUTE onemat = make(n,1,1).
COMPUTE q=NROW(sigma).
COMPUTE x = sqrt(-2*ln(UNIFORM(n,q)))&*COS((2*3.14159265358979)*UNIFORM(n,q)).
COMPUTE x=x*CHOL(sigma)+onemat*mu.
SAVE x/OUTFILE = *.
END MATRIX.

Unfortunately it is much more difficult to tell if a data set you have actually comes from a multivariate normal distribution. Generally we "cheat" and just check a few conditions that must be true if the data set is multivariate normal.

  1. Each of the variables is normal separately - Check using Q-Q plot
  2. Each pair of variables makes an ellipse - Check using a scatterplot (maybe with a bivariate box-plot on top)
  3. The set of generalized distances from each point to the center of the points is chi-square - Check using a chi-square plot.
Making the q-q plots: One way to make the q-q plots is to choose the Graphs > Q-Q... option. Select all of the variables you are intersted in and choose Ok. In the output window, ignore the "Detrended" plots.

While you all have different random examples using the above code, they should probably be pretty close to the line, except maybe a few at the end (remember this generated data has 1,000 observations... so just a few being off isn't bad).

Making the scatterplots: If you have many variables, you would probably want to make each scatterplot on a separate screen and not put them all up at once (they would be too small to see!). This can be done by choosing the Simple Scatter option from Graphs > Scatter/Dot.... You could also choose the Matrix Scatter option if you wanted to see them all at once. Unfortunately there is no way to easily add something similar to the bivariate box-plot like you can in SAS or R.

Making the Chi-square plot: The formula for the distances used to make the chi-square plot are in the middle of page 10 in the text. The following code produces the plot in a similar manner to R, only taking the diagonal out of a big multiplied matrix instead of using a loop. The only thing you should ever need to change is the name of the variables COL1 COL2 COL3 on the second line (make it whatever you are using).

MATRIX.
GET X /VARIABLES = COL1 COL2 COL3.
COMPUTE N=NROW(X).
COMPUTE SUMX=T(CSUM(X)).
COMPUTE XBAR=SUMX/N.
COMPUTE J=MAKE(N,1,1).
COMPUTE S=(T(X)*X-SUMX*T(SUMX)/N)/(N-1).
COMPUTE OBSVALS=DIAG((X-J*T(XBAR))*INV(S)*T(X-J*T(XBAR))).
SAVE {X, OBSVALS} / OUTFILE=*.
END MATRIX.

This will add another column to your spreadsheet (and write over the names of the data that was there unfortunately). You can then select Graphs > QQ..., select the new column as the variable, Chi-square as the test distribution, and the appropriate degrees of freedom (3 in this case). If the data is really from a multivariate normal population, then the distances should be chi-squared, and this plot should look like a straight line. (Remember that small sample sizes aren't nearly as straight as big ones because of random fluctuation.)

In Conclusion: If the three checks above all look fairly good, then that is strong evidence that the population is (at least very approximately) multivariate normal, and we would be comfortable using any methods that made that assumption. If the checks failed a little then we would want to know how robust the method was. If the checks failed badly then we should probably not trust any method that needed multivariate normality!


Graphics

The following code provides some instructions for some basic descriptive plots for the census data set census.txt.

Variations on the Scatter Plot

A plot of x (birth) vs. y (heartd) can be gained by using Graphs > Scatter/Dot > Simple Scatter, moving Birth to the X Axis box and HeartD to the Y Axis box, and hitting Ok.

To restrict the plot to have the x (birth) values go only from 14 to 15 double click on the full scatterplot in the viewer. This will open the Chart Editor. Select Edit > Select X Axis. In the Scale table click the Auto button by Minimum and Maximum and enter the 14 and 15 values respectively. Hitting Apply will adjust the graph.

To add a regression line (again from the Chart Editor) select Elements > Fit Line at Total. Using the Fit Line tab of the box that appears you can then select either Linear regression or Loess for the nonparametric regression. (You might have to hit Apply to see it change).

If you wish to label the points in a graph with a variable name you must again use Graphs > Scatter/Dot > Simple Scatter. Birth should still be selected for X and HeartD for Y. Choose State for Label Cases by and hit OK. This should produce the plot with the state abbreviation near each point.

Some of the options (like "jittering") points are only available using the Interactive mode. Select Graphs > Interactive > Scatterplot.... Move Birth to the box with the sideways arrow (the X-axis) and HeartD to the up arrow (the Y-axis). Note the mechanic here is slightly different and you just hold down the mouse button on the variable name and move it over. Click Ok to produce the graph.

Double clicking on the resulting graph will open the interactive mode. Along the top line the third option should be a red bar graph with part of a face over it. Clicking on it will open the Chart Editor. To restrict the X axis to go from 14 to 15 click the first of the two "Scale Axis" options and then the "Edit" button. You can now click the check marks next to minimum and maximum and replace them with 14 and 15 respectively. To jitter the observations select "Cloud" and then the "Jittering" tab. Click the check-box, slide the bar over to about 1.0%, and click ok. (Note it will jitter both the X and Y variables.) To get out of edit mode click on the X in the upper right of "Chart Manager" box, and then just click to some of the white area to the side of the "Interactive Graph".

Adding a Third Variable

A third variable can be added to a scatter plot using the Interactive mode. Select Graphs > Interactive > Scatterplot.... Move Birth to the box with the sideways arrow (the X-axis) and HeartD to the up arrow (the Y-axis). Double clicking on the resulting graph will open the interactive mode. Clicking the blue, white, and red box in the upper left corner will open the Assign Graph Variables box. Move over65 to the Size box and you should see the graph change as you watch.

SPSS can also be used to produce a coplot, but it takes a bit more work. Select Transform > Visual Bander, put Over65 in the box on the right, and hit continue. Click Over65. Then select Make Cutpoints.... Click on Equal Percentiles Based on Scanned Cases, enter 5 for the Number of Cutpoints and click Apply. Finally, type Over65Groups in the Banded Variable box and click OK. Now open the interactive scatterplot mode again and put the Over65Groups in for the Panel Variable. To see how the groups were made you could select Graphs > Boxplot.... Choose Over65 for the Variable, Over65Groups for the Category Axis, and hit Ok.

A scatterplot matrix can be made simply by using Graphs > Scatter/Dot and choosing Matrix Scatter. Simply put the three variables in the Matrix Variables box.

Density Plots

SPSS doesn't appear to have an easily locatable two dimensional density estimator.

3D Plot

Rotatable 3D plots can be produced using Graphs > Scatter/Dot and choosing 3-D Scatter. Choose the desired variables for X, Y, and Z and click OK. To rotate the graph, double click on the output to open the Chart Editor. Double clicking on the Chart Editor box will open the Properties box and show the 3-D Rotation tab.


Principal Components Analysis:

The following instructions tell how to analyze the data in section 3.3 of the book. The first step is to get the data from: http://www.stat.sc.edu/~habing/courses/data/usair.txt

At first glance it seems that the analysis can be done by choosing: Analyze > Data Reduction > Factor. The variables in question can then be selected. In the Descriptives... box Coefficients should be the only thing checked. In Extraction... the method should be Principal Components; analyze should be whichever you want to analyze, either Correlation Matrix or Covariance Matrix; the display should have Unrotated factor solution checked; and the number of factors should be six (the number of variables in this case). Rotation... should be set at none. Scores should have Save as variables checked with Regression.

Hitting Ok shoud then produce some output. The correlation matrix at the top of the output matches the top of Table 3.2 as does the Total Variance Explained table (except that it gives the variances and not the standard deviations). Unfortunately the Component Matrix does not match the loadings matrix in the text. Instead it is simply the correlation between each of the components and the original variables. Further, the factor score estimates that have been added to the spread sheet have all been standardized (as in factor analysis) instead of being left on the scale that we usually want in principal components!

There is apparently no easy menu-driven way to make SPSS give us these standard results either. One option is to multiply each of the estimated columns of factor scores by the square root of the corresponding eigenvalue (the variance you can read off from the output). You then need to perform multiple regression to predict each principal components from the original variables (if you used the covariance matrix), or from the standardized values of the original variables (if you used the correlation matrix). In the latter case you must also make SPSS do the standardization. In either case it is time consuming and produces fairly ugly output.

One alternative is to use the built in matrix language. The following code will produce the output at the top of page 53 and save the estimated principal component values to some files. If you use this for other data sets you will need to change lines 1 and 2 to have the names of the variables you are using, lines 3 and 4 to have the correct number of components, and the second and third lines from the end to save them to the files you want.

MATRIX.
GET X /VARIABLES = NegTemp Manuf Pop Wind Precip Days.
COMPUTE VNAMES={"NegTemp","Manuf","Pop","Wind","Precip","Days"}.
COMPUTE PCSNAMES={"PCS1","PCS2","PCS3","PCS4","PCS5","PCS6"}.
COMPUTE PCRNAMES={"PCR1","PCR2","PCR3","PCR4","PCR5","PCR6"}.
COMPUTE N=NROW(X).
COMPUTE SUMX=T(CSUM(X)).
COMPUTE XBAR=SUMX/N.
COMPUTE J=MAKE(N,1,1).
COMPUTE S=(T(X)*X-SUMX*T(SUMX)/N)/(N-1).
COMPUTE SDMAT=SQRT(MDIAG(DIAG(S))).

COMPUTE R=INV(SDMAT)*S*INV(SDMAT).
PRINT R
/ TITLE 'Correlation Matrix'
/ FORMAT 'F8.3'
/ RNAMES=VNAMES
/ CNAMES=VNAMES.

PRINT /TITLE '--------------------------------------------------'.
PRINT /TITLE 'Principal Components Results Using The Covariance'.

COMPUTE EVALCOV=EVAL(S).
PRINT EVALCOV
/ TITLE 'Variance of Components (Using COV)'
/ FORMAT 'F9.4'
/ RNAMES=PCSNAMES.

COMPUTE PROPCOV=EVALCOV&/CSUM(EVALCOV).
PRINT PROPCOV
/ TITLE 'Proportion of Variance (Using COV)'
/ FORMAT 'F8.3'
/ RNAMES=PCSNAMES.

CALL SVD(S,LOADCOV,L1,A1).
PRINT LOADCOV
/ TITLE 'Loadings (Using COV)'
/ FORMAT 'F8.3'
/ RNAMES=VNAMES
/ CNAMES=PCSNAMES.

PRINT /TITLE '--------------------------------------------------'.
PRINT /Title 'Principal Components Results Using the Correlation'.

COMPUTE EVALCOR=EVAL(R).
PRINT EVALCOR
/ TITLE 'Variance of Components (Using COR)'
/ FORMAT 'F8.3'
/ RNAMES=PCRNAMES.

COMPUTE PROPCOR=EVALCOR&/CSUM(EVALCOR).
PRINT PROPCOR
/ TITLE 'Proportion of Variance Components (Using COR)'
/ FORMAT 'F8.3'
/ RNAMES=PCRNAMES.

CALL SVD(R,LOADCOR,L2,A2).
PRINT LOADCOR
/ TITLE 'Loadings (Using COR)'
/ FORMAT 'F8.3'
/ RNAMES=VNAMES
/ CNAMES=PCRNAMES.

COMPUTE PCS=(X-J*T(XBAR))*LOADCOV.
COMPUTE PCR=(X-J*T(XBAR))*INV(SDMAT)*LOADCOR.
SAVE PCS / OUTFILE='D:\pcacov.sav' / NAMES=PCSNAMES.
SAVE PCR / OUTFILE='D:\pcacor.sav' / NAMES=PCRNAMES.
END MATRIX.

You can add the newly created factor scores to your spreadsheet by choosing Data > Merge Files > Add Variables and selecting the file with the variables you want to add (either based on the covariance or based on the correlations). Click Ok and No.

To reproduce the graph at the top of page 55 (assuming you have loaded the principal component values from the correlations into the spreadsheet) select Graphs > Scatter/Dot... then choose Simple Scatter. Choose PCR2 for the Y Axis, PCR1 for the X Axis, and Set Markers by they City.

To get the output at the top of page 60, choose Analyze < Regression < Linear.... Select the SO2 for the dependent variable, the first three PCR for the independents. Under Plots choose ZRESID for Y and ZPRED for X (just the z-scores for what we usually plot) and make sure that Normal Probability Plot is checked.