Assessing and accounting for correlation in RNA-seq data analysis

dc.contributor.advisor Dan . Nettleton
dc.contributor.advisor Roger . Wise
dc.contributor.author Liu, Meiling
dc.contributor.department Department of Statistics (LAS)
dc.date 2020-02-12T22:57:38.000
dc.date.accessioned 2020-06-30T03:20:31Z
dc.date.available 2020-06-30T03:20:31Z
dc.date.copyright Sun Dec 01 00:00:00 UTC 2019
dc.date.embargo 2020-11-26
dc.date.issued 2019-01-01
dc.description.abstract <p>RNA-sequencing (RNA-seq) technology is a high-throughput next-generation sequencing procedure. It allows researchers to measure gene transcript abundance at a lower cost and with a higher resolution.</p> <p>Advances in RNA-seq technology promoted new methodological development in several branches of quantitative analysis for RNA-seq data. In this dissertation, we focus on several topics related to RNA-seq data analysis.</p> <p>This dissertation is comprised of three papers on the analysis of RNA-seq data. We first introduce a method for detecting differentially expressed genes across different experimental conditions with correlated RNA-seq data. We fit a general linear model to the transformed read counts of each gene and assume the error vector has a block-diagonal correlation matrix with unstructured blocks that</p> <p>account for within-gene correlations. In order to stabilize parameter estimation with limited replicates, we shrink the residual maximum likelihood estimator of correlation parameters toward a mean-correlation locally-weighted scatterplot smoothing curve. The shrinkage weights are determined by using a hierarchical model and then estimated via parametric bootstrap. Due to the information sharing across genes in parameter estimation, the null distribution of test statistic is unknown and mathematically intractable. Thus, we approximate the null test distribution through a parametric bootstrap strategy.</p> <p>Next, we focus on correlation estimation between genes. Gene co-expression correlation estimation is a fundamental step in gene co-expression network construction. The correlation estimates could also be used as inputs of topological statistics which help analyze gene functions. We propose a new strategy for co-expression correlation definition and estimation. We introduce a motivating dataset with two factors and a split-plot experimental design. We define two types of co-expression correlations that originate from two different sources. We apply a linear mixed model to each gene pair. The correlations within random effects and random errors are used to represent the two types of correlations.</p> <p>Finally, we consider a basic topic in quantitative RNA-seq analysis, gene filtering. It is essential to remove genes with extremely low read counts before further analysis to avoid numerical problems and to get a more stable estimates. For most differential expression and gene network analyses tools, there are embedded gene filtering functions. In general, these functions rely on a user-defined hard threshold for gene selection and fail to make full use of gene features, such as gene length and GC content level. Several studies have shown that gene features have a significant impact on RNA-sequencing efficiency and thus should be considered in subsequent analysis. We propose to fit a</p> <p>model involving a two-component mixture of Gaussian distribution to the transformed read counts for each sample and assume all parameters are functions of GC content. We adopt a modified semiparametric expectation-maximization algorithm for parameter estimation.</p> <p>We perform a series of simulation studies and show, that in many cases, the proposed methods improve upon existing methods and are more robust.</p>
dc.format.mimetype application/pdf
dc.identifier archive/lib.dr.iastate.edu/etd/17731/
dc.identifier.articleid 8738
dc.identifier.contextkey 16524992
dc.identifier.s3bucket isulib-bepress-aws-west
dc.identifier.submissionpath etd/17731
dc.identifier.uri https://dr.lib.iastate.edu/handle/20.500.12876/31914
dc.language.iso en
dc.source.bitstream archive/lib.dr.iastate.edu/etd/17731/Liu_iastate_0097E_18527.pdf|||Fri Jan 14 21:28:09 UTC 2022
dc.subject.disciplines Bioinformatics
dc.subject.disciplines Statistics and Probability
dc.subject.keywords correlation
dc.subject.keywords differential expression analysis
dc.subject.keywords gene co-expression network
dc.subject.keywords RNA-seq
dc.title Assessing and accounting for correlation in RNA-seq data analysis
dc.type dissertation
dc.type.genre dissertation
dspace.entity.type Publication
relation.isOrgUnitOfPublication 264904d9-9e66-4169-8e11-034e537ddbca
thesis.degree.discipline Bioinformatics and Computational Biology; Statistics
thesis.degree.level dissertation
thesis.degree.name Doctor of Philosophy
File
Original bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
Liu_iastate_0097E_18527.pdf
Size:
2.51 MB
Format:
Adobe Portable Document Format
Description: