Statistical methods for ChIP-seq and microbiome studies using next-generation DNA sequencing data

Goren, Emily
Journal Title
Journal ISSN
Volume Title
Research Projects
Organizational Units
Organizational Unit
Journal Issue

In this dissertation, we studied two different types of data generated by next-generation sequencing technologies. Chapter 2 is about analysis of ChIP-seq data with biological replicates to identify protein-binding sites. Chapters 3-4 are about analysis of microbiome data to estimate the causal effects of microbiome features on interesting outcomes in presence of confounding variables.

ChIP-seq experiments aim to detect DNA-protein binding sites and require biological replication to draw inferential conclusions. However, there is no current consensus on how to analyze ChIP-seq data with biological replicates. Very few methodologies exist for the joint analysis of replicated ChIP-seq data, with approaches ranging from combining the results of analyzing replicates individually to joint modeling of all replicates. Combining the results of individual replicates analyzed separately can lead to reduced peak classification performance compared to joint modeling. Currently available methods for joint analysis may fail to control the false discovery rate at the nominal level. In Chapter 2, we propose BinQuasi, a peak caller for replicated ChIP-seq data, that jointly models biological replicates using a generalized linear model framework and employs a one-sided quasi-likelihood ratio test to detect peaks. When applied to simulated and real data, BinQuasi performs favorably compared to existing methods, including better control of false discovery rate than existing joint modeling approaches. BinQuasi offers a flexible approach to joint modeling of replicated ChIP-seq data which is preferable to combining the results of replicates analyzed individually. We created an R package called BinQuasi that is available at

Microbiome studies have uncovered associations between microbes and human, animal, and plant health outcomes. This has led to an interest in identifying microbial interventions for treatment of disease and optimization of crop yields which will require the identification of individual relevant microbiome features. That task is challenging because of the high dimensionality of microbiome data and the confounding that results from the complex and dynamic interactions among host, environment, and microbiome. The performance of variable selection and estimation procedures may be unsatisfactory when there are differentially abundant features resulting from a categorical confounding variable. For microbiome studies with such a confounding structure, we propose a standardization approach in Chapter 3 to estimation of population effects of individual microbiome features. Due to the high dimensionality and confounding-induced correlation between features, we propose feature screening, selection, and estimation conditional on each stratum of the confounder. Comprehensive simulation studies are used to demonstrate the advantages of our approach in recovering relevant features. Utilizing a potential-outcomes framework, we outline assumptions required to ascribe causal, rather than associational, interpretations to the identified microbiome effects. We applied the proposed approach to an agricultural study of the rhizosphere microbiome of sorghum in which nitrogen fertilizer application is a confounding variable. We identified microbial taxa that are consistent with biological understanding of potential plant-microbe interactions.

In Chapter 4, we present an inverse probability weighting approach to causal analysis of the effects of individual microbiome features in presence of continuous confounding variables. In simulated microbiome data, we show inverse probability weighting in marginal models provides microbiome effect estimates with lower bias and mean squared error than conditional regression adjustment for confounding. Our approach is demonstrated using an agricultural data set for identification of soil microbes with the potential to modulate biomass production in sorghum.

bioinformatics, biostatistics, causal inference, generalized linear models, high-dimensional inference, next-generation sequencing