Don Edwards
Statistics Department
University of South Carolina
Models and Measures of Inter-Rater Agreement
An agreement process is here defined to be a pool of many items to be
subjectively judged (e.g. patient biopsies) and many potential raters (e.g.
pathologists). In this talk we consider only the simplest case, where the
outcome of any given judging is binary (1=cancer, 0=not cancer). Most would
agree that the rating process is unreliable if there is not good agreement
between judges examining the same item - but how should the strength of the
inter-rater agreement be quantified? Cohen (1960) defined a simple
chance-corrected measure of agreement for the case of two raters, which has
been generalized to the case of many raters. Cohen's κ is widely used -
there have been more than 7,000 citations since 1967. In this talk, under
simple data models which allow explicit probability calculations, we show
that Cohen's κ overcorrects for chance. In particular, under these models
it substantially underestimates the strength of inter-rater agreement for
the diagnosis of rare diseases. An alternative model-based measure of
agreement κΜ is defined. It is compared to Cohen's
κ in simulations and
using data collected by Allsbrook et al. (2001) in a study carried out to
assess the reliability of the Gleason grading scale for the diagnosis of
prostate cancer.
Joint research with Dr. Kerrie Nelson, Massachusetts General Hospital
Principal reference:
Nelson, Kerrie, and Edwards, Don (2008). "On population-based measures of
agreement". The Canadian Journal of Statistics, 36: 411-426.
Back to Colloquium Series