Applications of machine learning to solve biological puzzles

Thumbnail Image
Mann, Carla
Major Professor
Drena L. Dobbs
Robert Jernigan
Committee Member
Journal Title
Journal ISSN
Volume Title
Research Projects
Organizational Units
Organizational Unit
Genetics, Development and Cell Biology

The Department of Genetics, Development, and Cell Biology seeks to teach subcellular and cellular processes, genome dynamics, cell structure and function, and molecular mechanisms of development, in so doing offering a Major in Biology and a Major in Genetics.

The Department of Genetics, Development, and Cell Biology was founded in 2005.

Related Units

Journal Issue
Is Version Of

The era of “big data” has led to the generation of more biological data than any human could hope to process. This flood of data has necessitated the development of computational methods to assist in analysis, and has made it possible to begin to model complex biological systems. Machine learning methods represent one avenue for modeling, and allow for the identification of intricate and often cryptic sequence signals underlying many biological processes.

In this dissertation, I present two machine learning models, RPIDisorder and MEDJED, which were developed to predict RNA-protein interaction partners (RPIPs) and DNA double-strand break (DSB) repair by the microhomology-mediated end joining (MMEJ) pathway, respectively. I also present the Gene Sculpt Suite, a set of freely available web-based software tools for precision gene editing.

RPIDisorder uses signals from protein and RNA sequences (some of which have been previously utilized in published RNA-protein partner prediction methods), but it additionally exploits signal from disordered protein regions to predict interactions with greater specificity than has been possible before. RPIDisorder allows for the prediction of biologically relevant RNA-protein interaction networks, which in turn can assist in the development of clinical interventions for the numerous cancers and neurological and metabolic disorders associated with disruptions in RNA-protein interactions. RPIDisorder is freely available at

MEDJED (Microhomology-Evoked Deletion Judication EluciDation) uses signal within and surrounding short stretches of homologous DNA sequence (microhomologies) on either side of an introduced DSB to predict the extent to which a targeted genomic site will be repaired using the MMEJ pathway. MEDJED is freely available at

The advent of gene editing nucleases including CRISPR/Cas systems, TALENs, and zinc finger nucleases has made it possible to insert, delete, and precisely edit DNA. A great deal of recent research has focused on improving the efficiency and precision of these nucleases by leveraging endogenous DSB repair pathways including non-homologous end joining (NHEJ) and homologous recombination (HR). However, homology-mediated end joining pathways (HMEJ), including MMEJ and single-strand annealing (SSA), provide many advantages over NHEJ and HR. The Gene Sculpt Suite is a set of three web-based tools (GTagHD, MEDJED, and MENTHU) that leverage HMEJ pathways to enhance exogenous DNA knock-in (GTagHD) and produce more efficient and precise gene knock-outs (MEDJED and MENTHU). The Gene Sculpt Suite is freely available at

Taken together, the results of these studies demonstrate that machine learning models can be valuable for identifying sequence signals that regulate macromolecular recognition, with numerous potential applications in both basic and applied research.

Thu Aug 01 00:00:00 UTC 2019