Sequence-based prediction of RNA-protein interactions
The interaction of RNAs with proteins is fundamental for executing many of the key roles they play in living systems, including translation, post-transcriptional regulation of gene expression, RNA splicing, and viral replication. Recently, new roles for RNA-protein interactions have emerged, following the discovery that the human genome is pervasively transcribed and produces thousands of non-coding RNAs (ncRNAs). Although the functions of many ncRNAs are not yet known, one emerging theme is that long non-coding RNAs (lncRNAs) often drive the formation of ribonucleoprotein (RNP) complexes, which in turn influence the regulation of gene expression. Although the human genome is predicted to encode almost as many different RNA-binding proteins as DNA-binding transcription factors, our current understanding of the cellular roles of RNA-binding proteins, how they recognize their targets, and how they are regulated, lags far behind our understanding of transcription factors.
To improve our comprehension of RNA-protein recognition and the regulation of RNA-protein interaction networks within cells, this dissertation has four related goals: (i) performing a rigorous and systematic evaluation of sequence- and structure-based methods for predicting RNA-binding residues in proteins; (ii) developing improved method for predicting interfacial residues in RNA-binding proteins, using only sequence information; (iii) generating a comprehensive collection of RNA-protein interaction motifs (RPIMs); and (iv) developing improved methods for RNA-protein interaction partner prediction.
First, we present a systematic evaluation of state-of-the-art machine learning methods for predicting RNA-binding residues in proteins, using three carefully curated benchmark datasets and a rich set of data representations. We show that sequence-based methods trained using position-specific scoring matrices (PSSMs) perform better than structure-based methods, which use more complex features extracted from the 3D structures of proteins. Second, we present RNABindRPlus, a new method for predicting RNA-binding residues in proteins, using only sequence information. The predictor combines output from an optimized Support Vector Machine (SVM) classifier with the output from a novel homology-based method (HomPRIP). We show that RNABindRPlus performs better than all currently available methods for predicting interfacial residues in proteins. Third, we extract more than 30,000 unique RNA-protein interfacial motifs (RPIMs), consisting of contiguous residues from both the RNA and protein chains of characterized RNA-protein complexes. Lastly, we demonstrate the utility of RPIMs in predicting RNA-protein interaction partners. We employ them in an innovative and significantly improved method for partner prediction and show that it has both a high true positive rate and a much lower false positive rate than other available methods. Taken together, the results presented here provide important new insights into the determinants of RNA-protein recognition, in addition to valuable new software tools for interrogating and predicting RNA-protein complexes and interaction networks.