Jaxk Reeves
Statistics Department
University of Georgia
Sporulation Data Revisited
The 'hot' topic in Statistical Genetics for the past few years has been
analysis of micro-array data. Micro-array technology allows researchers to
simultaneously measure the RNA expression level of thousands of genes of
an organism simultaneously and over time. Such data can be very useful in
biological research, since, if properly analyzed, it can give researchers
some clues as to the function of some genes or yield some ideas about the genes
which are likely to be involved in bio-chemical pathways for certain
processes. In a typical experiment, the number of genes, n, analyzed is
frequently in the thousands; much larger than the number whose effects on
the process in question could possibly be measured at a significant level.
The statistician's role in this endeavor is two-fold: to find the genes for
which there is significant evidence of change in expression level during the
process and, among these, to cluster those for which the evidence is highest
of mutual associated expression and/or repression. A 'classic' data set
which many researchers in this field use to illustrate their pet techniques is the
yeast sporulation data set of Chu, DeRisi, et al. (Chu, Science, 1998),
available on-line at http://cmgm.stanford.edu/pbrown/sporulation. Although
micro-array techniques have improved significantly in the past 5 years and
there has been some move away from the use of the Red/Green intensity
measures used in this experiment, the experiment was very well designed for its time
and is not atypical of the sort of micro-array analysis currently being done
in many biological labs throughout the world. In this talk, I will examine
a number of statistical assumptions made in the analysis of this data set
and discuss the effects of these assumptions upon conclusions reached by the
authors of the sporulation study and, more generally, by those who use
similar statistical analysis techniques.
Back to Colloquium Series