Towards data cleaning in large public biological databases

Thumbnail Image
Bagheri, Hamid
Major Professor
Hridesh Rajan
Committee Member
Journal Title
Journal ISSN
Volume Title
Research Projects
Organizational Units
Organizational Unit
Computer Science

Computer Science—the theory, representation, processing, communication and use of information—is fundamentally transforming every aspect of human endeavor. The Department of Computer Science at Iowa State University advances computational and information sciences through; 1. educational and research programs within and beyond the university; 2. active engagement to help define national and international research, and 3. educational agendas, and sustained commitment to graduating leaders for academia, industry and government.

The Computer Science Department was officially established in 1969, with Robert Stewart serving as the founding Department Chair. Faculty were composed of joint appointments with Mathematics, Statistics, and Electrical Engineering. In 1969, the building which now houses the Computer Science department, then simply called the Computer Science building, was completed. Later it was named Atanasoff Hall. Throughout the 1980s to present, the department expanded and developed its teaching and research agendas to cover many areas of computing.

Dates of Existence

Related Units

Journal Issue
Is Version Of

As the cost of sequencing decreases, the amount of data being deposited into public repositories isincreasing rapidly. As sequencing data continues to accumulate in the online repositories, scientists can increasingly use multi-tiered data to better answer biological questions. One main challenge that the public biological repositories have is the problem of data quality of the metadata. Unfortunately, most public databases do not have methods for identifying errors in their metadata, leading to the potential for error propagation. In order to do the cleaning at the large scale, scalable infrastructure and algorithms are needed to be developed. In this dissertation, we built a domain-specific language and large-scale infrastructure, called BoaG, to analyze the wealth of genomics data. We used the BoaG’s interface to reason about the provenance, frequencies, and quality of annotations. The second part of the dissertation focuses on the cleaning of the public repositories at scale. Most public databases, such as non-redundant (NR), rely on user input and do not have methods for identifying errors in the provided metadata, leading to the potential for error propagation. Previous research on a small subset of the NR database analyzed misclassification based on sequence similarity. To the best of our knowledge, the amount of taxonomic misclassification in the entire database has not been quantified. We proposed and developed an automatic approach to detect and remove the suspicious taxonomic assignments and mispredicted functional annotations. We also addressed widely used sequence clustering information of the public databases. The usefulness of clusters to explore different biological analyses has been shown for functional annotation, family classification, systems biology, structural genomics, and phylogenetic analysis [73]. We utilized CD-HIT [33] to cluster NR sequences at different similarity levels, i.e. 95%, 90%, 85%, down to 65%. To improve the data quality of the clusters, we removed anomalies and then provided a confidence score based on the lineage of all sequences within each cluster. For the functional annotations, we utilized protein ontology (PRO) [58] and Gene Ontology [11] that are knowledge-based graphs to detect potentially mispredicted functions. Ontologies have been utilized to express knowledge. In this dissertation, we leveraged them to improve the quality of the public genomics databases. We proposed a computational method that abstracts ontology graphs into a lower-dimensional network representation that makes reasoning for inconsistencies among the list of functional annotations easier. We found that the BoaG infrastructure provided fewer lines of code, reduced storage size, and provided automatic parallelization for the large-scale analyses on the NR dataset. The BoaG’s web-interface is also implemented and is made publicly available for researchers to test different hypotheses and share them among others. We have identified “29,175,336" proteins in the NR database that have more than one distinct taxonomic assignments, among which “2,238,230" (7.6%) are potentially taxonomically misclassified. We also found that the total number of potential misclassifications in clusters at 95% similarity, above the genus level, is “3,689,089" out of 88M clusters, which are 4% of the total clusters. This percentage of misclassifications in NR has a significant impact due to the potential for error propagation in the downstream analysis. This method proposed in this dissertation will be a valuable tool in cleaning up large-scale public databases. The technique we proposed could be extended to address other kinds of annotation errors of the public databases at scale.

Sat May 01 00:00:00 UTC 2021