Skip to Content

College of Arts & Sciences
Department of Statistics


Stephanie Hicks Colloquium

Thursday, February 18, 2016 - 2:45pm

Statistics Department Colloquium

Where: LeConte College, Room 210

Speaker: Stephanie Hicks

Affiliation: Harvard University, Department of Biostatistics and Computational Biology

Title:  Tackling Systematic Errors in Genomics with Examples from Single-Cell RNA-Sequencing and DNA methylation

Abstract:  Systematic errors, including batch effects, have been widely reported as a major challenge in high-throughput technologies. In this talk, I will present two examples in functional genomics. In the first part, I will present an analysis of fifteen published single-cell RNA-Sequencing studies and illustrate systematic errors can explain a substantial percentage of observed cell-to-cell expression variability. Specifically, we found that the proportion of genes reported as expressed explains a substantial part of observed variability and that this quantity varies systematically across experimental batches. Furthermore, we found that experimental designs that confound outcomes of interest with batch effects are common. In the second part, I will present a statistical model that accounts for cellular heterogeneity in whole blood measured from DNA methylation profiles. Current methods to estimate these relative proportions of cell types depend on creating costly, platform-dependent methylation profiles of purified blood cell types. Platform-specific systematic errors require a new whole blood sample of interest to be measured using the same platform technology as the purified blood cell types. Our method is based on the idea of identifying informative genomic regions that are clearly methylated or unmethylated for each cell type, which permits estimation in multiple platform technologies as cell types preserve their methylation state in regions independent of platform despite observed measurements being platform dependent. We fit a mixture model with region-specific platform-dependent random effects to estimate the unobservable platform-dependent methylation profiles and the relative cell type proportions. Under these assumptions, the challenge reduces to a missing data problem when using informative genomic regions. I will demonstrate that the method accurately estimates the cell composition from purified and whole blood samples and is applicable across multiple platforms, including the Illumina 450K array and whole-genome bisulfite sequencing.