Stephanie Hicks ColloquiumThursday, February 18, 2016 - 2:45pm
Statistics Department Colloquium
Where: LeConte College, Room 210
Speaker: Stephanie Hicks
Affiliation: Harvard University, Department of Biostatistics and Computational Biology
Title: Tackling Systematic Errors in Genomics with Examples from Single-Cell RNA-Sequencing and DNA methylation
Abstract: Systematic errors, including batch effects, have been widely reported as a major challenge in high-throughput technologies. In this talk, I will present two examples in functional genomics. In the first part, I will present an analysis of fifteen published single-cell RNA-Sequencing studies and illustrate systematic errors can explain a substantial percentage of observed cell-to-cell expression variability. Specifically, we found that the proportion of genes reported as expressed explains a substantial part of observed variability and that this quantity varies systematically across experimental batches. Furthermore, we found that experimental designs that confound outcomes of interest with batch effects are common. In the second part, I will present a statistical model that accounts for cellular heterogeneity in whole blood measured from DNA methylation profiles. Current methods to estimate these relative proportions of cell types depend on creating costly, platform-dependent methylation profiles of purified blood cell types. Platform-specific systematic errors require a new whole blood sample of interest to be measured using the same platform technology as the purified blood cell types. Our method is based on the idea of identifying informative genomic regions that are clearly methylated or unmethylated for each cell type, which permits estimation in multiple platform technologies as cell types preserve their methylation state in regions independent of platform despite observed measurements being platform dependent. We fit a mixture model with region-specific platform-dependent random effects to estimate the unobservable platform-dependent methylation profiles and the relative cell type proportions. Under these assumptions, the challenge reduces to a missing data problem when using informative genomic regions. I will demonstrate that the method accurately estimates the cell composition from purified and whole blood samples and is applicable across multiple platforms, including the Illumina 450K array and whole-genome bisulfite sequencing.