Highly accurate approaches for some high-throughput sequencing data preprocessing tasks
Date
2022-12
Authors
Andari, Shofi
Major Professor
Advisor
Dorman, Karin
Yandeau-Nelson, Marna
Huang, Xiaoqiu
Severin, Andrew
Wurtele, Eve
Committee Member
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Next-generation sequencing (NGS) data underlie many experimental techniques in modern biology. Despite improvements in experimental protocols and the careful efforts of lab scientists, some NGS datasets are particularly noisy. We focus on two common pipeline steps that are sensitive to noisy data: read pair merging and amplicon denoising. First, in NGS paired-end sequencing, a DNA fragment is read from both ends, generating paired forward and reverse reads. In many modern protocols that sample short fragments, these read pairs may overlap. Merging overlapping read pairs can provide longer, higher quality single reads for use in downstream bioinformatics pipelines, especially in low quality datasets. There are many bioinformatics tools for read pair merging, but most account for errors in unflexible ways, such as allowing up to a user-provided number of mismatches. We examine the performance of several state-of-the-art read pair merging tools on simulated and real datasets, representing various indel rates, overlap sizes, and read qualities. We also present a highly accurate read-pair merger based on Needleman-Wunsch alignment with a custom, quality-based scoring system and demonstrate its superior performance on noisy datasets. Second, deep sequencing of amplicons is common in many NGS experiments, and a major task is to identify the true amplicons and discard the error amplicons. Most Illumina denoisers work by assuming true amplicons are observed multiple times, but there may be no replication in noisy data. We demonstrate that denoisers indeed fail on noisy datasets, even when replicated reads occur, and we propose an alternative approach to identify candidate amplicons based on replicated partial reads.
Series Number
Journal Issue
Is Version Of
Versions
Series
Academic or Administrative Unit
Type
dissertation