Statistical methods for gene expression studies using next-generation sequencing experiments

Thumbnail Image
Bi, Ran
Major Professor
Peng Liu
Committee Member
Journal Title
Journal ISSN
Volume Title
Research Projects
Organizational Units
Organizational Unit
As leaders in statistical research, collaboration, and education, the Department of Statistics at Iowa State University offers students an education like no other. We are committed to our mission of developing and applying statistical methods, and proud of our award-winning students and faculty.
Journal Issue
Is Version Of

In recent years, the advancement in high-throughput next-generation sequencing technologies have revolutionized the way for genomic studies. The rapid progress of these technologies has resulted in an ever-increasing number of high-dimensional gene expression datasets available for analysis. However, due to the genetic complexity and high cost of such experiments, the number of replicates employed in an experiment is typically small. This introduces the so-called “small n, large p” problem, where n refers to the sample size and p refers to the number of genes, in which case the power of statistical inference is limited after adjusting multiple testing errors. This dissertation presents novel statistical methods for gene expression experiments based on sequencing data, including sample size calculation and methods that allow borrow information across genes for identifying differential expressed (DE) genes, detecting gene expression heterosis, and assessing differential translation across treatments.

Chapter 2 proposes a one-time simulation based sample size calculation method while controlling false discovery rate (FDR) for RNA-sequencing (RNA-seq) experimental design. Our procedure is based on the weighted linear model analysis facilitated by the voom method, which has been shown to have competitive performance in terms of power and FDR control for RNA-seq differential expression analysis. We derive a method that approximates the average power across the DE genes, and then calculate the sample size to achieve a desired average power while controlling FDR. Simulation results demonstrate that the actual power of several popularly applied tests for differential expression is achieved and is close to the desired power for RNA-seq data with sample size calculated based on our method.

Chapter 3 develops a semi-parametric Bayesian approach for DE analysis in RNA-seq data. More specifically, we model the count data from RNA-seq experiments with a Poisson-Gamma mixture model, and propose a Bayesian mixture modeling procedure with a Dirichlet process as the prior model for the distribution of fold changes between the two treatment means. We develop Markov chain Monte Carlo (MCMC) posterior simulation using Metropolis Hastings algorithm to generate posterior samples for differential expression analysis while controlling FDR. Simulation study results suggest that our proposed method outperforms other popular methods used for detecting DE genes.

In Chapter 4, we extend the idea of Chapter 3 by proposing a powerful test to detect gene expression heterosis while controlling FDR. We use the similar Poisson-Gamma mixture model for RNA-seq count data, and propose a Bayesian mixture modeling procedure with a Dirichlet process as the prior for the distribution of fold changes between each parental line versus the hybrid offspring respectively. The MCMC sampling scheme with Gibbs algorithm is utilized to provide posterior inference to detect heterosis genes while controlling false discovery rate. The effectiveness of our approach is demonstrated through simulation studies.

Chapter 5 addresses another gene expression analysis challenge with ribosome profiling data. It explores a new a statistical framework, RiboZIP, to identify differentially translated genes (DTGs). We model the ribosome profiling data with a zero-inflated Poisson (ZIP) model, and propose a Bayesian hierarchical modeling procedure to assess differential translation while taking the paring information between mRNA and RPFs samples into account. The MCMC sampling scheme is employed for posterior inference to detect DTGs while controlling FDR. We investigate the performance of our method and compare it with several existing methods used for ribosome profiling data. The analysis results show that our RiboZIP method generally provides a better ranking for genes as well as higher number of true significant results, while still adequately controlling FDR.

In summary, this dissertation raised and coped with several statistical problems under transcriptome data analysis. All proposed methods are evaluated through simulation studies and applied to real data analysis with fruitful results.

Sat Dec 01 00:00:00 UTC 2018