Probabilistic methods for quality improvement in high-throughput sequencing data

Yin, Xin

Probabilistic methods for quality improvement in high-throughput sequencing data

File

Yin_iastate_0097E_15996.pdf (1.15 MB)

Date

2016-01-01

Authors

Yin, Xin

Advisor

Karin S. Dorman

Gregory Phillips

Altmetrics

Abstract

Advances in high-throughput next-generation sequencing (NGS) technologies have enabled the determination of millions of nucleotide sequences in massive parallelism at affordable costs. Many studies have shown increased error rates over Sanger sequencing, in sequencing data produced by mainstream next-generation sequencing platforms, and have demonstrated the negative impacts of sequencing errors on a wide range of applications of NGS. Thus, it is critically important for primary analysis of sequencing data to produce accurate, high-quality nucleotides for downstream bioinformatics pipelines.

Two bioinformatics problems are dedicated to the direct removal of sequencing errors: base-calling and error-correction. However, existing error correction methods are mostly algorithmic and heuristics. Few methods can address insertion and deletion errors, the dominant error type produced by many platforms. On the other hand, most base-callers do not model the underlying genome structures of the sequencing data, which are necessary for improving base-calling quality especially in low-quality regions. The sequential application of base-caller and error-corrector do not fully offset their shortcomings.

In recognition of these issues, in this dissertation, we propose a probabilistic framework that closely emulate the sequencing-by-synthesis (SBS) process adopted by many NGS platforms. The core idea is to model sequencing data (individual reads, or fluorescent intensities) as independent emissions from a Hidden Markov model (HMM) with transition distributions to model local and double-stranded dependence in the genome, and emission distributions to model the subtle error characteristics of the sequencers. Deriving from this backbone, we develop three novel methods for improving the data quality of high-throughput sequencing: 1) PREMIER, an accurate probabilistic error corrector of substitution errors in Illumina data, 2) PREMIER-bc, an integrated base-caller and error corrector that significantly improves base-calling quality, and 3) PREMIER-indel, an extended error correction method that addresses substitution, insertion and deletion errors for SBS-based sequencers with good empirical performance.

Our foray of using probabilistic methods for base-calling and error correction provides the immediate benefits to downstream analyses with increased sequencing data quality, and more importantly, a flexible and fully-probabilistic basis to go beyond primary analysis.

Academic or Administrative Unit

Statistics (LAS)

Type

dissertation

Copyright

Fri Jan 01 00:00:00 UTC 2016