Statistical methods for ChIP-seq and microbiome studies using next-generation DNA sequencing data

dc.contributor.advisor Peng . Liu
dc.contributor.advisor Chong . Wang Goren, Emily
dc.contributor.department Statistics 2020-02-12T22:55:16.000 2020-06-30T03:20:12Z 2020-06-30T03:20:12Z Sun Dec 01 00:00:00 UTC 2019 2021-11-09 2019-01-01
dc.description.abstract <p>In this dissertation, we studied two different types of data generated by next-generation sequencing technologies. Chapter 2 is about analysis of ChIP-seq data with biological replicates to identify protein-binding sites. Chapters 3-4 are about analysis of microbiome data to estimate the causal effects of microbiome features on interesting outcomes in presence of confounding variables.</p> <p>ChIP-seq experiments aim to detect DNA-protein binding sites and require biological replication to draw inferential conclusions. However, there is no current consensus on how to analyze ChIP-seq data with biological replicates. Very few methodologies exist for the joint analysis of replicated ChIP-seq data, with approaches ranging from combining the results of analyzing replicates individually to joint modeling of all replicates. Combining the results of individual replicates analyzed separately can lead to reduced peak classification performance compared to joint modeling. Currently available methods for joint analysis may fail to control the false discovery rate at the nominal level. In Chapter 2, we propose BinQuasi, a peak caller for replicated ChIP-seq data, that jointly models biological replicates using a generalized linear model framework and employs a one-sided quasi-likelihood ratio test to detect peaks. When applied to simulated and real data, BinQuasi performs favorably compared to existing methods, including better control of false discovery rate than existing joint modeling approaches. BinQuasi offers a flexible approach to joint modeling of replicated ChIP-seq data which is preferable to combining the results of replicates analyzed individually. We created an R package called BinQuasi that is available at</p> <p>Microbiome studies have uncovered associations between microbes and human, animal, and plant health outcomes. This has led to an interest in identifying microbial interventions for treatment of disease and optimization of crop yields which will require the identification of individual relevant microbiome features. That task is challenging because of the high dimensionality of microbiome data and the confounding that results from the complex and dynamic interactions among host, environment, and microbiome. The performance of variable selection and estimation procedures may be unsatisfactory when there are differentially abundant features resulting from a categorical confounding variable. For microbiome studies with such a confounding structure, we propose a standardization approach in Chapter 3 to estimation of population effects of individual microbiome features. Due to the high dimensionality and confounding-induced correlation between features, we propose feature screening, selection, and estimation conditional on each stratum of the confounder. Comprehensive simulation studies are used to demonstrate the advantages of our approach in recovering relevant features. Utilizing a potential-outcomes framework, we outline assumptions required to ascribe causal, rather than associational, interpretations to the identified microbiome effects. We applied the proposed approach to an agricultural study of the rhizosphere microbiome of sorghum in which nitrogen fertilizer application is a confounding variable. We identified microbial taxa that are consistent with biological understanding of potential plant-microbe interactions.</p> <p>In Chapter 4, we present an inverse probability weighting approach to causal analysis of the effects of individual microbiome features in presence of continuous confounding variables. In simulated microbiome data, we show inverse probability weighting in marginal models provides microbiome effect estimates with lower bias and mean squared error than conditional regression adjustment for confounding. Our approach is demonstrated using an agricultural data set for identification of soil microbes with the potential to modulate biomass production in sorghum.</p>
dc.format.mimetype application/pdf
dc.identifier archive/
dc.identifier.articleid 8693
dc.identifier.contextkey 16524763
dc.identifier.s3bucket isulib-bepress-aws-west
dc.identifier.submissionpath etd/17686
dc.language.iso en
dc.source.bitstream archive/|||Fri Jan 14 21:27:33 UTC 2022
dc.subject.disciplines Statistics and Probability
dc.subject.keywords bioinformatics
dc.subject.keywords biostatistics
dc.subject.keywords causal inference
dc.subject.keywords generalized linear models
dc.subject.keywords high-dimensional inference
dc.subject.keywords next-generation sequencing
dc.title Statistical methods for ChIP-seq and microbiome studies using next-generation DNA sequencing data
dc.type article
dc.type.genre dissertation
dspace.entity.type Publication
relation.isOrgUnitOfPublication 264904d9-9e66-4169-8e11-034e537ddbca Statistics dissertation Doctor of Philosophy
Original bundle
Now showing 1 - 1 of 1
1.66 MB
Adobe Portable Document Format