Discovering meaning from biological sequences: focus on predicting misannotated proteins, binding patterns, and G4-quadruplex secondary

Andorf, Carson
Major Professor
Vasant Honavar
Drena Dobbs
Committee Member
Journal Title
Journal ISSN
Volume Title
Research Projects
Organizational Units
Computer Science
Organizational Unit
Journal Issue
Computer Science

Proteins are the principal catalytic agents, structural elements, signal transmitters, transporters, and molecular machines in cells. Experimental determination of protein function is expensive in time and resources compared to computational methods. Hence, assigning proteins function, predicting protein binding patterns, and understanding protein regulation are important problems in functional genomics and key challenges in bioinformatics. This dissertation comprises of three studies. In the first two papers, we apply machine-learning methods to (1) identify misannotated sequences and (2) predict the binding patterns of proteins. The third paper is (3) a genome-wide analysis of G4-quadruplex sequences in the maize genome. The first two papers are based on two-stage classification methods. The first stage uses machine-learning approaches that combine composition-based and sequence-based features. We use either a decision trees (HDTree) or support vector machines (SVM) as second-stage classifiers and show that classification performance reaches or outperforms more computationally expensive approaches. For study (1) our method identified potential misannotated sequences within a well-characterized set of proteins in a popular bioinformatics database. We identified misannotated proteins and show the proteins have contradicting AmiGO and UniProt annotations. For study (2), we developed a three-phase approach: Phase I classifies whether a protein binds with another protein. Phase II determines whether a protein-binding protein is a hub. Phase III classifies hub proteins based on the number of binding sites and the number of concurrent binding partners. For study (3), we carried out a computational genome-wide screen to identify non-telomeric G4-quadruplex (G4Q) elements in maize to explore their potential role in gene regulation for flowering plants. Analysis of G4Q-containing genes uncovered a striking tendency for their enrichment in genes of networks and pathways associated with electron transport, sugar degradation, and hypoxia responsiveness. The maize G4Q elements may play a previously unrecognized role in coordinating global regulation of gene expression in response to hypoxia to control carbohydrate metabolism for anaerobic metabolism. We demonstrated that our three studies have the ability to predict and provide new insights in classifying misannotated proteins, understanding protein binding patterns, and identifying a potentially new model for gene regulation.