A modular data analysis pipeline for the discovery of novel RNA motifs

Thumbnail Image
Date
2006-01-01
Authors
Schonfeld, Justin
Major Professor
Advisor
Daniel Ashlock
Dan Voytas
Committee Member
Journal Title
Journal ISSN
Volume Title
Publisher
Altmetrics
Authors
Research Projects
Organizational Units
Organizational Unit
Mathematics
Welcome to the exciting world of mathematics at Iowa State University. From cracking codes to modeling the spread of diseases, our program offers something for everyone. With a wide range of courses and research opportunities, you will have the chance to delve deep into the world of mathematics and discover your own unique talents and interests. Whether you dream of working for a top tech company, teaching at a prestigious university, or pursuing cutting-edge research, join us and discover the limitless potential of mathematics at Iowa State University!
Journal Issue
Is Version Of
Versions
Series
Department
Mathematics
Abstract

This dissertation presents a modular software pipeline that searches collections of RNA sequences for novel RNA motifs. In this case the motifs incorporate elements of primary and secondary structure. The motif search pipeline breaks up sets of RNA sequences into shortened segments of RNA primary sequence. The shortened segments are then folded to obtain low energy secondary structures. The distance estimation module of the pipeline then calculates distances between the folded bricks, and then analyzes the resulting distance matrices for patterns;An initial implementation of the pipeline is applied to synthetic and biological data sets. This implementation introduces a new distance measure for comparing RNA sequences based on structural annotation of the folded sequence as well as a new data analysis technique called non-linear projection. The modular nature of the pipeline is then used to explore the relationships between several different distance measures on random data, synthetic data, and a biological data set consisting of iron response elements. It is shown that the different distance measures capture different relationships between the RNA sequences. The non-linear projection algorithm is used to produce 2-dimensional projections of the distance matrices which are examined via inspection and k-means multiclustering. The pipeline is able to successfully cluster synthetic RNA sequences based only on primary sequence data as well as the iron response elements data set. The dissertation also presents a preliminary analysis of a large biological data set of HIV sequences.

Comments
Description
Keywords
Citation
Source
Copyright
Sun Jan 01 00:00:00 UTC 2006