Parallel clustering of expressed sequence tags
Date
Authors
Major Professor
Advisor
Committee Member
Journal Title
Journal ISSN
Volume Title
Publisher
Altmetrics
Abstract
Expressed sequence tags, abbreviated ESTs, are DNA molecules experimentally derived from expressed portions of genes. Clustering of ESTs is essential for gene recognition, understanding important genetic variations such as those resulting in diseases and removing redundancies in gene indices. Currently, the software programs that are mostly widely used for EST clustering are those that are developed for solving the related problem of fragment assembly. Due to the differences in the nature of the problems and the input the fragment assembly programs are not an ideal match for clustering large EST data sets. In this thesis, we present the design and development of a parallel software system that targets large-scale EST clustering. The novel features of our approach include 1) design of space efficient algorithms to keep the space requirement linear in the size of the input data set, 2) a combination of algorithmic techniques to reduce the total work without sacrificing the quality of EST clustering, and 3) use of parallel processing to reduce the run-time and facilitate the clustering of large data sets. Using a combination of these techniques, we report the clustering of 144,870 Arabidopsis ESTs in 9.5 minutes on a 64-processor IBM xSeries cluster with 512 MB memory per processor, a problem that does not execute on 512 MB due to insufficient memory using CAP3, a state-of-the-art fragment assembly sequential software and takes 247 minutes to run when the memory is increased to 1 GB. We also clustered 327,632 rat ESTs in 47 minutes on 64 processors with 512 MB memory per processor.