A modular data analysis pipeline for the discovery of novel RNA motifs

Schonfeld, Justin

A modular data analysis pipeline for the discovery of novel RNA motifs

File

r_3217314.pdf (2.04 MB)

Date

2006-01-01

Authors

Schonfeld, Justin

Advisor

Daniel Ashlock

Dan Voytas

Altmetrics

Organizational Units

Organizational Unit

Mathematics

Welcome to the exciting world of mathematics at Iowa State University. From cracking codes to modeling the spread of diseases, our program offers something for everyone. With a wide range of courses and research opportunities, you will have the chance to delve deep into the world of mathematics and discover your own unique talents and interests. Whether you dream of working for a top tech company, teaching at a prestigious university, or pursuing cutting-edge research, join us and discover the limitless potential of mathematics at Iowa State University!

Department

Mathematics

Abstract

This dissertation presents a modular software pipeline that searches collections of RNA sequences for novel RNA motifs. In this case the motifs incorporate elements of primary and secondary structure. The motif search pipeline breaks up sets of RNA sequences into shortened segments of RNA primary sequence. The shortened segments are then folded to obtain low energy secondary structures. The distance estimation module of the pipeline then calculates distances between the folded bricks, and then analyzes the resulting distance matrices for patterns;An initial implementation of the pipeline is applied to synthetic and biological data sets. This implementation introduces a new distance measure for comparing RNA sequences based on structural annotation of the folded sequence as well as a new data analysis technique called non-linear projection. The modular nature of the pipeline is then used to explore the relationships between several different distance measures on random data, synthetic data, and a biological data set consisting of iron response elements. It is shown that the different distance measures capture different relationships between the RNA sequences. The non-linear projection algorithm is used to produce 2-dimensional projections of the distance matrices which are examined via inspection and k-means multiclustering. The pipeline is able to successfully cluster synthetic RNA sequences based only on primary sequence data as well as the iron response elements data set. The dissertation also presents a preliminary analysis of a large biological data set of HIV sequences.

Copyright

Sun Jan 01 00:00:00 UTC 2006

Collections

Theses and Dissertations

Full item page