Identification of interface residues involved in protein-protein and protein-DNA interactions from sequence using machine learning approaches
Identification of interface residues involved in protein-protein and protein-DNA interactions is critical for understanding the functions of biological systems. Because identifying interface residues using experimental methods cannot catch up with the pace at which protein sequences are determined, computational methods that can identify interface residues are urgently needed. In this study, we apply machine-learning methods to identify interface residues with the focus on the methods using amino acid sequence information alone. We have developed classifiers for identification of the residues involved in protein-protein and protein-DNA interactions using a window of primary sequence as input. The classifiers were evaluated using both representative datasets and specific cases of interest based on multiple measurements. The results have shown the feasibility of identifying interface residues from sequence. We have also explored information besides primary sequence to improve the performance of sequence-based classifiers. The results show that the performance of sequence-based classifiers can be improved by using solvent accessibility and sequence entropy of the target residue as additional inputs. We have developed a database of protein-protein interfaces that consists of all the protein-protein interfaces derived from the Protein Data Bank. This database, for the first time, makes possible the quick and flexible retrieval of interface sets and various interface features. We have systematically analyzed the characteristics of interfaces using the largest dataset available. In particular, we compared interfaces with the samples that had the same solvent accessibility as the interfaces. This strategy excludes the effect of solvent accessibility on the distributions of residues, secondary structure, and sequence entropy.