Large-scale methods in computational genomics

dc.contributor.advisor Srinivas Aluru Kalyanaraman, Anantharaman
dc.contributor.department Electrical and Computer Engineering 2018-08-24T23:10:00.000 2020-06-30T07:43:21Z 2020-06-30T07:43:21Z Sun Jan 01 00:00:00 UTC 2006 2006-01-01
dc.description.abstract <p>The explosive growth in biological sequence data coupled with the design and deployment of increasingly high throughput sequencing technologies has created a need for methods capable of processing large-scale sequence data in a time and cost effective manner. In this dissertation, we address this need through the development of faster algorithms, space-efficient methods, and high-performance parallel computing techniques for some key problems in computational genomics;The first problem addressed is the clustering of DNA sequences based on a measure of sequence similarity. Our clustering method: (i) guarantees linear space complexity, in contrast to the quadratic memory requirements of previously developed methods; (ii) identifies sequence pairs containing long maximal matches in the decreasing order of their maximal match lengths in run-time proportional to the sum of input and output sizes; (iii) provides heuristics to significantly reduce the number of pairs evaluated for checking sequence similarity without affecting quality; and (iv) has parallel strategies that provide linear speedup and a proportionate reduction in space per processor. Our approach has significantly enhanced the problem size reach while also drastically reducing the time to solution;The next problem we address is the de novo detection of genomic repeats called Long Terminal Repeat (LTR) retrotransposons. Our algorithm guarantees linear space complexity and produces high quality candidates for prediction in run-time proportional to the sum of input and output sizes. Validation of our approach on the yeast genome demonstrates both superior quality and performance results when compared to previously developed software;In a genome assembly project, fragments sequenced from a target genome are computationally assembled into numerous supersequences called "contigs", which are then ordered and oriented into "scaffolds". In this dissertation, we introduce a new problem called retroscaffolding for scaffolding contigs based on the knowledge of their LTR retrotransposon content. Through identification of sequencing gaps that span LTR retrotransposons, retroscaffolding provides a mechanism for prioritizing sequencing gaps for finishing purposes;While most of the problems addressed here have been studied previously, the main contribution in this dissertation is the development of methods that can scale to the largest available sequence collections.</p>
dc.format.mimetype application/pdf
dc.identifier archive/
dc.identifier.articleid 2528
dc.identifier.contextkey 6094965
dc.identifier.s3bucket isulib-bepress-aws-west
dc.identifier.submissionpath rtd/1529
dc.language.iso en
dc.source.bitstream archive/|||Fri Jan 14 20:38:38 UTC 2022
dc.subject.disciplines Bioinformatics
dc.subject.disciplines Computer Sciences
dc.subject.keywords Electrical and computer engineering
dc.subject.keywords Computer engineering
dc.title Large-scale methods in computational genomics
dc.type article
dc.type.genre dissertation
dspace.entity.type Publication
relation.isOrgUnitOfPublication a75a044c-d11e-44cd-af4f-dab1d83339ff dissertation Doctor of Philosophy
Original bundle
Now showing 1 - 1 of 1
4.02 MB
Adobe Portable Document Format