Representing, comparing, and querying phenotype descriptions in plants using computational methods
Plant phenotype descriptions are abundant both in literature and in community datastores. Enabling basic aggregation, organization, and analyses over this data requires that phenotype descriptions be represented in a computable format. One successful approach to this challenge has been to develop standardized vocabularies and biological ontologies that can be used to annotate phenotypes, allowing for the sparsity of the data to be reduced by inferring implicit information about the annotated data, and enabling simple quantification of similarity between annotated data. This type of structured curation has shown promise for enabling dataset-wide analyses on plant phenotype descriptions, but the time and effort required for curation of individual phenotype descriptions is a limiting factor in how scalable this approach is in light of the increasing volume of available text data related to plant phenotypes. Computational approaches have the potential to alleviate this problem by providing methods for representing phenotype descriptions and allowing quantification of phenotype similarity. In this work, computational pipelines for representing and comparing phenotypes are presented, and evaluated for their ability to predict biological relationships between genes. Approaches from the natural language processing domain perform as well as similarity metrics over curated annotations for predicting shared phenotypes. These approaches also show promise both for helping curators organize large datasets as well as for enabling researchers to explore relationships among available phenotype descriptions. A web application for querying datasets of plant phenotype descriptions and identifying associated genes is also presented, and example use cases are discussed.