Improving RNA-seq transcript quantification

Thumbnail Image
Date
2022-12
Authors
Yuan, Lingnan
Major Professor
Advisor
Dorman, Karin
Liu, Peng
Dai, Xiongtao
Niemi, Jarad
Espin Palazon, Raquel
Committee Member
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
RNA-seq is a deep sequencing technique used to analyze the expression of messenger RNA (mRNA) molecules (transcripts) in a cell or cells. Many existing tools for transcript quantification use the EM algorithm. This dissertation proposes several methods to improve the performance of these tools. In the first part of the dissertation, we incorporate EM acceleration methods, Anderson acceleration, SQUAREM and Quasi-Newton methods, in one of the most popular transcript quantification tools, Salmon. We show that the accelerated algorithms can speed up the original EM algorithm with no cost in accuracy. The performance is consistent across different initializations and data characteristics. Versions with back-tracking guarantee monotone convergence and boundary constraints with limited effect on the speed. In the second part of the dissertation, we focus on estimation methods that better reflect the sparsity found in bulk and especially single-cell RNA-seq data. We introduce a penalty function, designed for probabilities, in the optimization. The penalty encourages estimated transcript abundances to lie on a vertex or edge of the probability simplex, thus achieving both shrinkage and parsimony in the estimated transcript abundances. The penalized EM algorithm better distinguishes truly absent transcripts from expressed ones than the original EM, both in bulk and single-cell RNA-seq data. In the third part of the dissertation, we focus on more efficient calculation of the quantification uncertainty, or estimated standard errors, of transcript abundances. Current methods to estimate quantification uncertainty rely heavily on resampling methods, like bootstrap and Gibbs sampling, which require large number of expensive replicates for good accuracy. We demonstrate that the formulation derived using Louis' method can be used to estimate the quantification uncertainty without resampling. We demonstrate its utility on simulated data. All three methods should have broad utility in the quantification step of standard RNA-seq analyses.
Series Number
Journal Issue
Is Version Of
Versions
Series
Academic or Administrative Unit
Type
dissertation
Comments
Rights Statement
Copyright
Funding
DOI
Supplemental Resources
Source