Algorithms in supertree inference and phylogenetic data mining

Chen, Duhong
Major Professor
David Fernandez-Baca
Committee Member
Journal Title
Journal ISSN
Volume Title
Research Projects
Organizational Units
Computer Science
Organizational Unit
Journal Issue
Computer Science

Science and society would benefit enormously from comprehensive phylogenetic knowledge of the Tree of Life (ToL), a framework that includes above 1.7 million species on Earth. ToL shows how living things have evolved since the origins of life billions of years ago. With existing computational approaches, we cannot produce a global estimate of evolutionary history from molecular data of all these species. Therefore, the development of computer algorithms and methods is critical to the field of phyloinformatics. The goal of this dissertation is to design, analyze, and implement algorithms to solve three specific problems in the field of phyloinformatics to support ToL studies.;The first part of this dissertation is about algorithms for the matrix representation with flipping (MRF) method to construct large subtrees of the ToL. The utility of the MRF supertree method has been limited by the speed of its heuristic algorithms. We describe a new heuristic algorithm for MRF supertree construction that improves upon the speed of the previous heuristic by a factor of n (the number of taxa in the supertree). This new heuristic makes MRF tractable for large-scale supertree analyses and allows the first comparisons of MRF with other supertree methods using large empirical data sets. Analyses of three published supertree data sets indicate that MRF supertrees are equally or more similar to the input trees on average than matrix representation with parsimony (MRP) and modified mincut supertrees. The results also show that large differences may exist between MRF and MRP supertrees, and demonstrate that the MRF supertree method is a practical and potentially more accurate alternative to the nearly ubiquitous MRP supertree method.;The second part of this dissertation is dedicated to new algorithms for partitioning phylogenetic data sets. We describe two new methods to partition phylogenetic data sets of discrete characters based on pairwise compatibility. The partitioning methods make no assumptions regarding the phylogeny, model of evolution, or characteristics of the data. The methods first build a compatibility graph, in which each node represents a character in the data set. Edges in the compatibility graph may represent strict compatibility of characters or they may be weighted based on a fractional compatibility scoring procedure that measures how close the characters are to being compatible. Given the desired number of partitions, the partitioning methods then seek to cluster together the characters with the highest average pairwise compatibility, so that characters in each cluster are more compatible with each other than they are with characters in the other cluster(s). We demonstrate that the spectral partitioning effectively identifies characters with different evolutionary histories in simulated data sets, and it is better at highlighting phylogenetic conflict within empirical data sets than previously used partitioning methods.;The third part of this dissertation describes a new intelligent search engine, named PhyloFinder, for phylogenetic databases that we have implemented using trees from TreeBASE. PhyloFinder enables taxonomic queries, in which it identifies trees in the database containing the exact name of the query taxon and/or any synonymous taxon names and provides spelling suggestions for the query when there is no match. Additionally, PhyloFinder can identify trees containing descendants or direct ancestors the query taxon by making use of the ontology guide from NCBI taxonomy database. PhyloFinder also executes phylogenetic queries, in which it identifies trees that contain the query tree or topologies that are similar to the query tree. PhyloFinder can enhance the utility of any tree database by providing tools for both taxonomic and phylogenetic queries as well as visualization tools that highlight the query results and provide links to NCBI and TBMap.