Machine learning for prediction of protein properties

Thumbnail Image
Kool, Daniel Benjamin
Major Professor
Jernigan, Robert L
Jernigan, Robert
Underbakke, Eric
Dickerson, Julie
Huang, Xiaoqiu
Song, Guang
Committee Member
Journal Title
Journal ISSN
Volume Title
Research Projects
Organizational Units
Journal Issue
Is Version Of
Biochem, Biophysics, and Molecular Biology
The first part of this thesis presents a comprehensive investigation into the intricate relationship between protein residues, geometries, and pocket features and their impact on protein functionality. The primary investigative tool used throughout this study is machine learning, with a focus on the eXtreme Gradient Boosting (XGBoost) tree-based classification method. The approach emphasizes the importance of accurately treating and preparing the data to obtain reliable and insightful results, and careful attention is given to the preparation and analysis of input features to gain a comprehensive understanding of the underlying mechanisms, particularly in terms of the characteristics of the individual amino acids participating. One of the main challenges that was addressed in this thesis is ways to deal with highly imbalanced datasets. To address this challenge, various scaling/standardization functions and techniques have been employed to generate synthetic samples. The results highlight significant differences and consistencies between these different data preparation schemes. Additionally, we used the SHAP method to identify important features and variables for the machine learning model, obtaining global and residue-level importance values. By identifying these key features and variables, we gain a deeper understanding of some of the details of the underlying mechanisms that influence protein functionality. The methods and data preparation strategies are extended in this study to predict ligand binding residues. Specifically, the binding of two biologically significant ligands, HEM and PLP, is investigated using similar geometric and physicochemical properties. The insights gained from this study can inform future experimental work and accelerate the discovery of new therapies for diseases. By accurately predicting ligand binding residues, we can better understand how proteins interact with their environment in general and in specific ways, and how we can modify these interactions to improve health outcomes. In addition to ligand binding, we also explore the use of machine learning to predict free energy and phenotype changes caused by mutations in proteins. By understanding how mutations affect protein function, we can better understand the mechanisms of diseases and ultimately develop more effective restorative treatments. We also present a novel method to predict the melting temperature of proteins from different datasets with high accuracy, utilizing a neural network approach and consider two different sets of input features. This method can be used to better understand how proteins behave under different conditions, and to develop more stable and effective proteins for use in biotechnology and medicine. Finally, this study describes the development of BioMakie.jl, a Julia programming package that provides a range of tools for investigating proteins. The package currently allows users to view proteins and multiple sequence alignments, with ongoing development focused on creating new visualizations and connecting Julia's event systems to web/JavaScript. By reducing the need to know multiple coding languages and lowering the learning curve for protein analysis, BioMakie.jl aims to make it easier for researchers to explore and understand protein structures and functions. This tool can be used by researchers across a wide range of disciplines to better understand the fundamental building blocks of life and their mechanisms. Overall we demonstrate the power of machine learning and feature importance methods in analyzing complex biological systems, such as proteins. By gaining a deeper understanding of the underlying mechanisms that influence protein functionality, we can eventually develop more effective therapies for diseases. Additionally, the development of BioMakie.jl provides a powerful tool for researchers to investigate proteins and gain new insights into their structures and functions.