Model-based clustering methods for high-throughput sequencing data

Thumbnail Image
Date
2021-12
Authors
Peng, Xiyu
Major Professor
Advisor
Dorman, Karin S
Gu, Xun
Huang, Xiaoqiu
Nettleton, Dan
Phillips, Gregory J
Committee Member
Journal Title
Journal ISSN
Volume Title
Publisher
Altmetrics
Authors
Research Projects
Organizational Units
Journal Issue
Is Version Of
Versions
Series
Department
Statistics
Abstract
Biology is full of heterogeneity and stochasticity. This dissertation presents three studies developing and applying model-based clustering methods to explore the diversity underlying the genetics data. In the first two studies, we aim for clustering amplicon sequence data. Amplicon sequencing has been widely applied to explore heterogeneity and rare variants in genetic populations. Resolving true biological variants and accurately quantifying their abundance from noisy amplicon sequence data is the foundation of downstream analyses. However, measured abundances are distorted by stochasticity and bias in amplification, along with errors generated during Polymerase Chain Reaction (PCR) and sequencing. The first study aims to correct errors. We introduce a reference-free, model-based clustering method to rapidly resolve the number, abundance, and identity of real biological sequences in massive Illumina amplicon datasets. It estimates a mixture model, using a greedy strategy to gradually select error-free sequences while approximately maximizing the likelihood. The second study further addresses the amplification bias via accurate deduplication. We propose a deduplication method to estimate absolute molecular counts from amplicon sequence data with Unique Molecular Identifiers (UMIs). Both errors in the UMIs and sampled sequences can be detected and corrected, and our method can recognize UMI collisions. In both two studies, we benchmark our approaches and demonstrate that our approaches have better performance than other competing methods. Clever strategies were adopted to make our algorithms applicable to massive datasets with millions of reads. In the third study, we focus on omics datasets. We propose a method for simultaneous dimension reduction and clustering, combining factor analysis for dimension reduction and a simple Gaussian mixture model for clustering. A penalization framework is introduced with sparsifying penalties posed on the factor loadings. We show that our proposed method enhances the accuracy in parameter estimation and increases the clustering performance on both simulated and real datasets.
Comments
Description
Keywords
Citation
Source
Copyright