Sequence-specific sequence comparison using pairwise statistical significance

Agrawal, Ankit
Major Professor
Xiaoqiu Huang
Committee Member
Journal Title
Journal ISSN
Volume Title
Research Projects
Organizational Units
Computer Science
Organizational Unit
Journal Issue
Computer Science

Sequence comparison is one of the most fundamental computational problems in bioinformatics for which many approaches have been and are still being developed. In particular, pairwise sequence alignment forms the crux of both DNA and protein sequence comparison techniques, which in turn forms the basis of many other applications in bioinformatics. Pairwise sequence alignment methods align two sequences using a substitution matrix consisting of pairwise scores of aligning different residues with each other (like BLOSUM62), and give an alignment score for the given sequence-pair. The biologists routinely use such pairwise alignment programs to identify similar, or more specifically, related sequences (having common ancestor). It is widely accepted that the relatedness of two sequences is better judged by statistical significance of the alignment score rather than by the alignment score alone. This research addresses the problem of accurately estimating statistical significance of pairwise alignment for the purpose of identifying related sequences, by making the sequence comparison process more sequence-specific.

The major contributions of this research work are as follows. Firstly, using sequence-specific strategies for pairwise sequence alignment in conjunction with sequence-specific strategies for statistical significance estimation, wherein accurate methods for pairwise statistical significance estimation using standard, sequence-specific, and position-specific substitution matrices are developed. Secondly, using pairwise statistical significance to improve the performance of the most popular database search program PSI-BLAST. Thirdly, design and implementation of heuristics to speed-up pairwise statistical significance estimation by an factor of more than 200. The implementation of all the methods developed in this work is freely available online.

With the all-pervasive application of sequence alignment methods in bioinformatics using the ever-increasing sequence data, this work is expected to offer useful contributions to the research community.