Spring 2002
STAT 778/EDRM 828 - Item Response Theory
Tuesday/Thursday 2:00-3:15
201A LeConte

Course Website: http://www.stat.sc.edu/~habing/courses/778S02.html

Assignment 9
(Optional)
Due: 4:30pm Thursday, May 2 Answer either of the following:

A) Use the program GGUM2000 to estimate the item and ability parameters for the data set cap.dat. The estimating program, data set, and questions can be found at: http://www.stat.sc.edu/~habing/courses/778fitS02.html#ggum. Report which subject (examinee) appears to have the most extreme pro-death penalty view, and which subject appears to have the least extreme pro-death penalty view. Finally, select the most discriminating pro-death penalty question with |delta|>2, the most discriminating anti-death penalty question with |delta|>2, and the most discriminating moderate question with |delta|<1.

or

B) The data set http://www.stat.sc.edu/~habing/courses/data/alet.dat is the same as the data set a1291 except that the various paragraphs have been collapsed into four polytomous items. Use parscale and the command file http://www.stat.sc.edu/~habing/courses/data/h9.psl to fit Muraki's generalized partial credit model to this data set. Notice that the standard phase 2 output file does not give all of the item parameters for the model... just the overall difficulty and discrimination. Which of the paragraphs seems most difficult? Most discriminating? Does this seem to match with what you would expect from the classical item analysis p-values and biserials for the original a1291 data set?

Assignment 8 Due: Tuesday, April 23 For this assignment we will be briefly examining the way in which the Mantel-Haenszel Z statistic and D statistic are affected by the various characteristics of the exam. In all four cases, you will be asked to simulate a reference and focal group data set using the distributions of Donoghue and Allen as your base (see the R help page section on DIF), and will be asked to answer a question or two on each of the simulated data sets. For each case, let the exam consist of 24 items and each of the reference and focal groups have 500 examinees.

1) Simulate an exam having no DIF, where both groups have standard normal ability distributions. Compare the resulting set of Z statistics to the standard normal reference distribution, and compare the resulting D statistics to the value 0.

2) Simulate an exam as in (1), but where the first item effectively has a b value that is 0.5 larger for the focal group than for the reference group. Comment on the size of the Z and D statistics for item 1 as compared to the values observed in problem 1. Also, calculate the mean of the D statistics for items 2-24. Why should I be able to be fairly certain that you will observe a negative value for the average of the D statistics for 2-24?

3) Simulate an exam as in (1), but where all of the items effectively have b values that are 0.5 larger for the focal group than for the reference group. Examine the resulting Z and D statistics. Why don't all of the items appear to be DIFed against the focal group? Assuming the simulated abilities are correct, why is or isn't this result a bad thing?

4) Simulate an exam as in (1), but where the focal group has ability mean of -0.25 instead of 0. Why don't all of the items appear to be DIFed against the focal group? Assuming the simulated abilities are correct, why is or isn't this result a bad thing?

Assignment 7 Due: Thursday, April 11 1) Simulate ten data sets, each with 20 items and 1000 examinees. Each of the items should follow a 3PL model and have parameters a=1.0, b=0.0, and c=0.20. Each of the examinees should have their ability randomly sampled from a standard normal distribution. For each data set calculate Rosenbaum's Mantel-Haenszel Z statistic for the first pair of items when the items in question are not included in the score (the default for the function). How do these 10 values compare to the reference standard normal distribution? Using the same 10 simulated data sets, repeat this process, but include the two items in the score. How do the two sets of Z statistics compare to each other?

Problems 2 and 3 refer to the June 1992 analytical reasoning test that was used on the first exam. Recall that the data can be found at http://www.stat.sc.edu/~habing/courses/778ex1/a692a.dat. This is part of one of the data sets that is described in the 1996 Stout, et.al. paper. (The layout of items is described on page 337, and page 349 says that it seems to have the most multidimensionality of any of the tests considered.)

2) Based on the paragraph structure of the test, select three item pairs that may possibly have a significantly negative associations using Rosenbaum's Mantel-Haenszel procedure, and thus indicate multidimensionality. Did you find any significant values at the alpha=0.05 level? Select three additional item pairs, this time ones you would expect to have positive associations based on the paragraph structure of the test. How do the results for these two sets of item pairs compare?

3) Take a random sample of size 250 from the a692 data set and apply the HCA/CCPROX procedure to it [note that the data set only has 1,000 examinees in it to begin with when you are using sample]. How well does the output reflect the paragraph structure of the test? Which two items look like they are most closely related?

Assignment 6 Due: Thursday, March 28 These problems use the 3PL parameters for items 35-39 in Appendix A of the text.

1) Find the four item test that would make the most useful test for discriminating between examinees with ability below 1.0 and those above 1.0. (Do not use any of the items more than once.)

2) Say it is desired to have a test whose information is not too low at any ability between -2.0 to 2.0. Find the four item test whose lowest amount of information between -2.0 and 2.0 is higher than the lowest amount of information of any of the competing tests. (Do not use any of the items more than once.)

3) Consider a test made up of items 35 and 38. Find the standard error for an estimated ability of 0.

Assignment 5 Due: Thursday, February 28 This assignment uses the output files from HMWK #4.

1) Plot the two estimated item response functions for item 2 on the same axes.

2) Recall from HMWK 4 that the two programs estimated different means and standard deviations for the estimated abilities. Translate the estimated a and b values from bilog and parscale to a standard normal scale.

3) Plot the two modified IRFs on the same axes.

4) Notice that they still seem to disagree over part of the ability range. Why is this disagreement not as severe as the graphical display seems to indicate?

5) The file http://www.stat.sc.edu/~habing/courses/data/simul.ab contains the true examinee abilities that were used to simulate the data set. How well do these seem to match the ability estimates from PARSCALE and BILOG-MG?

Assignment 4 Due: Tuesday, February 26 Run both BILOG and PARSCALE with the default 3PL settings to estimate the item parameters and examinee abilities for the data set http://www.stat.sc.edu/~habing/courses/simul.dat. The command files for the programs can be found at http://www.stat.sc.edu/~habing/courses/data/bilogex.BLM and http://www.stat.sc.edu/~habing/courses/data/parsex.PSL respectively.

You do not need to turn in the entire set of output files! Instead, for each program, report: the estimated biserial correlation and p-values for the first two items, the estimated item parameters for the first two items, the estimated examinee abilities for the first two examinees, and the mean and standard deviations of the estimated ability distributions.

You will want to save the output files for homework 5 and 6.

Assignment 3 Due: Thursday, February 14 1) Consider an item with biserial correlation 0.7 and p-value 0.4. Give an estimate of the a and b parameters that this item would have under a normal ogive model when the ability distribution is standard normal. [You can practice using the fomulas with Table 16.11.1 (page 379) in Lord and Novick.]

2) Simulate ten data sets, each with 20 items and 1000 examinees. Each of the items should follow the normal ogive model and have parameters identical to those you found in question 1. Each of the examinees should have their ability randomly sampled from a standard normal distribution. For each data set record the estimated p-value and biserial correlation for the first item. Briefly comment on how well the estimated p-value and biserial correlation approximate the "true values".

3) Simulate one more data set, as in question 2, but set the standard deviation of the ability distribution to be 0.5. How did the estimates change? Speculate on a reason why they would have changed (or not changed) in this particular way

4) Simulate one more data set, as in question 2, but set the mean of the ability distribution to be 0.5. How did the estimates change? Speculate on a reason why they would have changed (or not changed) in this particular way.

5) Two different groups of examinees are given the same set of items. For the first group, item one is estimated to have parameters a=1.0 and b=0.5. For the second group the parameter estimates are a=0.8 and b=0.8. Give the relationship between the ability scale theta_1 estimated for the first group and ability scale theta_2 estimated for the second group. If we estimated the error on the theta_1 scale to be +/- 0.1, what would you estimate the error on the theta_2 scale?

Assignment 2 Due: Tues. February 5th Split the data set a1291 into three subsets of 2000 examinees each: arand = a random sample, alo = the lowest scoring of the remaining examinees, and ahi the highest scoring of the remaining examinees.

1) Calculate the biserial correlations for the items on the alo subset of examinees. Also calculate all of the classical item analysis statistics for the items on the arand subset. Your goal is to see how these two sets of statistics (one from alo one from arand) are related.

Analyze these quantities to see if you can discover how the statistics calculated on the random subset of items influence the biserial correlations calculated on the low ability subset. In particular, can you find what characterisitcs of the items in arand seem to lead to a high estimated biserial in alo? What characteristics lead to a negative estimated biserial in alo?

You may use any statistical package to calculate descriptive statistics, basic plots, or build models to identify the relationships. Briefly summarize your findings in a paragraph or so.

2) The estimated reliability of the test calculated using ahi is lower than that of arand... but somehow it also has a smaller sem. What characteristic of the ahi group is causing this?

Assignment 1Due: Thurs. January 31st 1) Consider a 30 question exam with reliability of 0.80. Give an estimate of the reliability that each single item would have.

2) In a sentence or two, why is it possible for the estimated lambda2 value to exceed the true reliability of a test for that set of examinees?

3) Construct a 95% confidence interval (X+SEM type) for the true score of an examinee with observed score 1 on the a1291 data set. In a sentence or two, point out a weakness that this demonstrates in using the SEM in general.

4) For the a1291 data set, which item do you think would be the most useful in a test aimed at low ability examinees? How did you choose it?

5) For the a1291 data set, which item do you think would be the most useful for splitting the examinees into two groups of equal size? How did you choose it?