Statistical methods for gene expression studies using next-generation sequencing experiments

Bi, Ran

Statistical methods for gene expression studies using next-generation sequencing experiments

dc.contributor.advisor	Peng Liu
dc.contributor.author	Bi, Ran
dc.contributor.department	Statistics (LAS)
dc.date	2019-03-26T17:42:43.000
dc.date.accessioned	2020-06-30T03:13:38Z
dc.date.available	2020-06-30T03:13:38Z
dc.date.copyright	Sat Dec 01 00:00:00 UTC 2018
dc.date.embargo	2001-01-01
dc.date.issued	2018-01-01
dc.description.abstract	<p>In recent years, the advancement in high-throughput next-generation sequencing technologies have revolutionized the way for genomic studies. The rapid progress of these technologies has resulted in an ever-increasing number of high-dimensional gene expression datasets available for analysis. However, due to the genetic complexity and high cost of such experiments, the number of replicates employed in an experiment is typically small. This introduces the so-called “small n, large p” problem, where n refers to the sample size and p refers to the number of genes, in which case the power of statistical inference is limited after adjusting multiple testing errors. This dissertation presents novel statistical methods for gene expression experiments based on sequencing data, including sample size calculation and methods that allow borrow information across genes for identifying differential expressed (DE) genes, detecting gene expression heterosis, and assessing differential translation across treatments.</p> <p>Chapter 2 proposes a one-time simulation based sample size calculation method while controlling false discovery rate (FDR) for RNA-sequencing (RNA-seq) experimental design. Our procedure is based on the weighted linear model analysis facilitated by the voom method, which has been shown to have competitive performance in terms of power and FDR control for RNA-seq differential expression analysis. We derive a method that approximates the average power across the DE genes, and then calculate the sample size to achieve a desired average power while controlling FDR. Simulation results demonstrate that the actual power of several popularly applied tests for differential expression is achieved and is close to the desired power for RNA-seq data with sample size calculated based on our method.</p> <p>Chapter 3 develops a semi-parametric Bayesian approach for DE analysis in RNA-seq data. More specifically, we model the count data from RNA-seq experiments with a Poisson-Gamma mixture model, and propose a Bayesian mixture modeling procedure with a Dirichlet process as the prior model for the distribution of fold changes between the two treatment means. We develop Markov chain Monte Carlo (MCMC) posterior simulation using Metropolis Hastings algorithm to generate posterior samples for differential expression analysis while controlling FDR. Simulation study results suggest that our proposed method outperforms other popular methods used for detecting DE genes.</p> <p>In Chapter 4, we extend the idea of Chapter 3 by proposing a powerful test to detect gene expression heterosis while controlling FDR. We use the similar Poisson-Gamma mixture model for RNA-seq count data, and propose a Bayesian mixture modeling procedure with a Dirichlet process as the prior for the distribution of fold changes between each parental line versus the hybrid offspring respectively. The MCMC sampling scheme with Gibbs algorithm is utilized to provide posterior inference to detect heterosis genes while controlling false discovery rate. The effectiveness of our approach is demonstrated through simulation studies.</p> <p>Chapter 5 addresses another gene expression analysis challenge with ribosome profiling data. It explores a new a statistical framework, RiboZIP, to identify differentially translated genes (DTGs). We model the ribosome profiling data with a zero-inflated Poisson (ZIP) model, and propose a Bayesian hierarchical modeling procedure to assess differential translation while taking the paring information between mRNA and RPFs samples into account. The MCMC sampling scheme is employed for posterior inference to detect DTGs while controlling FDR. We investigate the performance of our method and compare it with several existing methods used for ribosome profiling data. The analysis results show that our RiboZIP method generally provides a better ranking for genes as well as higher number of true significant results, while still adequately controlling FDR.</p> <p>In summary, this dissertation raised and coped with several statistical problems under transcriptome data analysis. All proposed methods are evaluated through simulation studies and applied to real data analysis with fruitful results.</p>
dc.format.mimetype	application/pdf
dc.identifier	archive/lib.dr.iastate.edu/etd/16790/
dc.identifier.articleid	7797
dc.identifier.contextkey	14007037
dc.identifier.s3bucket	isulib-bepress-aws-west
dc.identifier.submissionpath	etd/16790
dc.identifier.uri	https://dr.lib.iastate.edu/handle/20.500.12876/30973
dc.language.iso	en
dc.source.bitstream	archive/lib.dr.iastate.edu/etd/16790/Bi_iastate_0097E_17650.pdf\|\|\|Fri Jan 14 21:06:01 UTC 2022
dc.subject.disciplines	Biostatistics
dc.subject.disciplines	Statistics and Probability
dc.subject.keywords	Differential Expression/Translation
dc.subject.keywords	False Discovery Rate
dc.subject.keywords	MCMC
dc.subject.keywords	Next-generation Sequencing
dc.subject.keywords	Nonparametric Bayesian
dc.subject.keywords	Sample Size Calculation
dc.title	Statistical methods for gene expression studies using next-generation sequencing experiments
dc.type	dissertation
dc.type.genre	dissertation
dspace.entity.type	Publication
relation.isOrgUnitOfPublication	264904d9-9e66-4169-8e11-034e537ddbca
thesis.degree.discipline	Statistics
thesis.degree.level	dissertation
thesis.degree.name	Doctor of Philosophy

File

Original bundle

Now showing 1 - 1 of 1

Name:: Bi_iastate_0097E_17650.pdf
Size:: 3.79 MB
Format:: Adobe Portable Document Format
Description:

Download

Collections

Theses and Dissertations