Statistical methods for gene expression studies using next-generation sequencing experiments

dc.contributor.advisor Peng Liu
dc.contributor.author Bi, Ran
dc.contributor.department Statistics (LAS)
dc.date 2019-03-26T17:42:43.000
dc.date.accessioned 2020-06-30T03:13:38Z
dc.date.available 2020-06-30T03:13:38Z
dc.date.copyright Sat Dec 01 00:00:00 UTC 2018
dc.date.embargo 2001-01-01
dc.date.issued 2018-01-01
dc.description.abstract <p>In recent years, the advancement in high-throughput next-generation sequencing technologies have revolutionized the way for genomic studies. The rapid progress of these technologies has resulted in an ever-increasing number of high-dimensional gene expression datasets available for analysis. However, due to the genetic complexity and high cost of such experiments, the number of replicates employed in an experiment is typically small. This introduces the so-called “small n, large p” problem, where n refers to the sample size and p refers to the number of genes, in which case the power of statistical inference is limited after adjusting multiple testing errors. This dissertation presents novel statistical methods for gene expression experiments based on sequencing data, including sample size calculation and methods that allow borrow information across genes for identifying differential expressed (DE) genes, detecting gene expression heterosis, and assessing differential translation across treatments.</p> <p>Chapter 2 proposes a one-time simulation based sample size calculation method while controlling false discovery rate (FDR) for RNA-sequencing (RNA-seq) experimental design. Our procedure is based on the weighted linear model analysis facilitated by the voom method, which has been shown to have competitive performance in terms of power and FDR control for RNA-seq differential expression analysis. We derive a method that approximates the average power across the DE genes, and then calculate the sample size to achieve a desired average power while controlling FDR. Simulation results demonstrate that the actual power of several popularly applied tests for differential expression is achieved and is close to the desired power for RNA-seq data with sample size calculated based on our method.</p> <p>Chapter 3 develops a semi-parametric Bayesian approach for DE analysis in RNA-seq data. More specifically, we model the count data from RNA-seq experiments with a Poisson-Gamma mixture model, and propose a Bayesian mixture modeling procedure with a Dirichlet process as the prior model for the distribution of fold changes between the two treatment means. We develop Markov chain Monte Carlo (MCMC) posterior simulation using Metropolis Hastings algorithm to generate posterior samples for differential expression analysis while controlling FDR. Simulation study results suggest that our proposed method outperforms other popular methods used for detecting DE genes.</p> <p>In Chapter 4, we extend the idea of Chapter 3 by proposing a powerful test to detect gene expression heterosis while controlling FDR. We use the similar Poisson-Gamma mixture model for RNA-seq count data, and propose a Bayesian mixture modeling procedure with a Dirichlet process as the prior for the distribution of fold changes between each parental line versus the hybrid offspring respectively. The MCMC sampling scheme with Gibbs algorithm is utilized to provide posterior inference to detect heterosis genes while controlling false discovery rate. The effectiveness of our approach is demonstrated through simulation studies.</p> <p>Chapter 5 addresses another gene expression analysis challenge with ribosome profiling data. It explores a new a statistical framework, RiboZIP, to identify differentially translated genes (DTGs). We model the ribosome profiling data with a zero-inflated Poisson (ZIP) model, and propose a Bayesian hierarchical modeling procedure to assess differential translation while taking the paring information between mRNA and RPFs samples into account. The MCMC sampling scheme is employed for posterior inference to detect DTGs while controlling FDR. We investigate the performance of our method and compare it with several existing methods used for ribosome profiling data. The analysis results show that our RiboZIP method generally provides a better ranking for genes as well as higher number of true significant results, while still adequately controlling FDR.</p> <p>In summary, this dissertation raised and coped with several statistical problems under transcriptome data analysis. All proposed methods are evaluated through simulation studies and applied to real data analysis with fruitful results.</p>
dc.format.mimetype application/pdf
dc.identifier archive/lib.dr.iastate.edu/etd/16790/
dc.identifier.articleid 7797
dc.identifier.contextkey 14007037
dc.identifier.s3bucket isulib-bepress-aws-west
dc.identifier.submissionpath etd/16790
dc.identifier.uri https://dr.lib.iastate.edu/handle/20.500.12876/30973
dc.language.iso en
dc.source.bitstream archive/lib.dr.iastate.edu/etd/16790/Bi_iastate_0097E_17650.pdf|||Fri Jan 14 21:06:01 UTC 2022
dc.subject.disciplines Biostatistics
dc.subject.disciplines Statistics and Probability
dc.subject.keywords Differential Expression/Translation
dc.subject.keywords False Discovery Rate
dc.subject.keywords MCMC
dc.subject.keywords Next-generation Sequencing
dc.subject.keywords Nonparametric Bayesian
dc.subject.keywords Sample Size Calculation
dc.title Statistical methods for gene expression studies using next-generation sequencing experiments
dc.type dissertation
dc.type.genre dissertation
dspace.entity.type Publication
relation.isOrgUnitOfPublication 264904d9-9e66-4169-8e11-034e537ddbca
thesis.degree.discipline Statistics
thesis.degree.level dissertation
thesis.degree.name Doctor of Philosophy
File
Original bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
Bi_iastate_0097E_17650.pdf
Size:
3.79 MB
Format:
Adobe Portable Document Format
Description: