Don Edwards

Statistics Department

University of South Carolina


Models and Measures of Inter-Rater Agreement

An agreement process is here defined to be a pool of many items to be subjectively judged (e.g. patient biopsies) and many potential raters (e.g. pathologists). In this talk we consider only the simplest case, where the outcome of any given judging is binary (1=cancer, 0=not cancer). Most would agree that the rating process is unreliable if there is not good agreement between judges examining the same item - but how should the strength of the inter-rater agreement be quantified? Cohen (1960) defined a simple chance-corrected measure of agreement for the case of two raters, which has been generalized to the case of many raters. Cohen's κ is widely used - there have been more than 7,000 citations since 1967. In this talk, under simple data models which allow explicit probability calculations, we show that Cohen's κ overcorrects for chance. In particular, under these models it substantially underestimates the strength of inter-rater agreement for the diagnosis of rare diseases. An alternative model-based measure of agreement κΜ is defined. It is compared to Cohen's κ in simulations and using data collected by Allsbrook et al. (2001) in a study carried out to assess the reliability of the Gleason grading scale for the diagnosis of prostate cancer.

Joint research with Dr. Kerrie Nelson, Massachusetts General Hospital

Principal reference: Nelson, Kerrie, and Edwards, Don (2008). "On population-based measures of agreement". The Canadian Journal of Statistics, 36: 411-426.


Back to Colloquium Series