Exploring the utility and effectiveness of protein language models on protein relations
Date
2024-08
Authors
Kilinc, Mesih
Major Professor
Advisor
Jernigan, Robert L
Huang, Xiaoqiu
Macintosh, Gustavo
Walley, Justin
Wu, Zhijun
Committee Member
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Proteins are key macromolecules in the lifecycle of all organisms. Understanding the relationship between proteins increases our understanding of their function by inferring functions from close relatives. Many industrial processes and diseases can benefit a better understanding of protein functions. Due to their importance, traditional sequence-based protein similarity tools such as BLAST are being used widely. The paradigm is widely accepted that sequence defines
structure, which in turn defines the dynamics and the function. However, proteins are interesting in that one amino acid mutation can sometimes change the whole structure topology, and yet, we can find similar structures with as low as 10% sequence similarity. This shows the need to understand the complex protein relations beyond the local amino acid sequence dependences. Recently, transformer-based language models for both natural languages and proteins has shown
astonishing advances. They can capture complex relationships between language and its environment even though they are only trained with the sequence itself. In this work, we harvest the contextual understanding of protein language models and optimize it for high accuracy and efficiency in identifying protein relations.
Series Number
Journal Issue
Is Version Of
Versions
Series
Academic or Administrative Unit
Type
dissertation