Improving RNA-seq transcript quantification

dc.contributor.advisor Dorman, Karin
dc.contributor.advisor Liu, Peng
dc.contributor.advisor Dai, Xiongtao
dc.contributor.advisor Niemi, Jarad
dc.contributor.advisor Espin Palazon, Raquel
dc.contributor.author Yuan, Lingnan
dc.contributor.department Statistics (LAS)
dc.date.accessioned 2023-01-10T20:08:10Z
dc.date.available 2023-01-10T20:08:10Z
dc.date.embargo 2025-01-10T00:00:00Z
dc.date.issued 2022-12
dc.date.updated 2023-01-10T20:08:11Z
dc.description.abstract RNA-seq is a deep sequencing technique used to analyze the expression of messenger RNA (mRNA) molecules (transcripts) in a cell or cells. Many existing tools for transcript quantification use the EM algorithm. This dissertation proposes several methods to improve the performance of these tools. In the first part of the dissertation, we incorporate EM acceleration methods, Anderson acceleration, SQUAREM and Quasi-Newton methods, in one of the most popular transcript quantification tools, Salmon. We show that the accelerated algorithms can speed up the original EM algorithm with no cost in accuracy. The performance is consistent across different initializations and data characteristics. Versions with back-tracking guarantee monotone convergence and boundary constraints with limited effect on the speed. In the second part of the dissertation, we focus on estimation methods that better reflect the sparsity found in bulk and especially single-cell RNA-seq data. We introduce a penalty function, designed for probabilities, in the optimization. The penalty encourages estimated transcript abundances to lie on a vertex or edge of the probability simplex, thus achieving both shrinkage and parsimony in the estimated transcript abundances. The penalized EM algorithm better distinguishes truly absent transcripts from expressed ones than the original EM, both in bulk and single-cell RNA-seq data. In the third part of the dissertation, we focus on more efficient calculation of the quantification uncertainty, or estimated standard errors, of transcript abundances. Current methods to estimate quantification uncertainty rely heavily on resampling methods, like bootstrap and Gibbs sampling, which require large number of expensive replicates for good accuracy. We demonstrate that the formulation derived using Louis' method can be used to estimate the quantification uncertainty without resampling. We demonstrate its utility on simulated data. All three methods should have broad utility in the quantification step of standard RNA-seq analyses.
dc.format.mimetype PDF
dc.identifier.uri https://dr.lib.iastate.edu/handle/20.500.12876/Nr1VX9Rz
dc.language.iso en
dc.language.rfc3066 en
dc.subject.disciplines Statistics en_US
dc.subject.disciplines Bioinformatics en_US
dc.subject.keywords EM algorithm en_US
dc.subject.keywords Penalty function en_US
dc.subject.keywords Quantification uncertainty en_US
dc.subject.keywords RNA-seq en_US
dc.title Improving RNA-seq transcript quantification
dc.type dissertation en_US
dc.type.genre dissertation en_US
dspace.entity.type Publication
relation.isOrgUnitOfPublication 264904d9-9e66-4169-8e11-034e537ddbca
thesis.degree.discipline Statistics en_US
thesis.degree.discipline Bioinformatics en_US
thesis.degree.grantor Iowa State University en_US
thesis.degree.level dissertation $
thesis.degree.name Doctor of Philosophy en_US
File
Original bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
Yuan_iastate_0097E_20621.pdf
Size:
867.66 KB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
0 B
Format:
Item-specific license agreed upon to submission
Description: