A modular data analysis pipeline for the discovery of novel RNA motifs

dc.contributor.advisor Daniel Ashlock
dc.contributor.advisor Dan Voytas
dc.contributor.author Schonfeld, Justin
dc.contributor.department Mathematics
dc.date 2018-08-25T01:25:50.000
dc.date.accessioned 2020-06-30T07:25:25Z
dc.date.available 2020-06-30T07:25:25Z
dc.date.copyright Sun Jan 01 00:00:00 UTC 2006
dc.date.issued 2006-01-01
dc.description.abstract <p>This dissertation presents a modular software pipeline that searches collections of RNA sequences for novel RNA motifs. In this case the motifs incorporate elements of primary and secondary structure. The motif search pipeline breaks up sets of RNA sequences into shortened segments of RNA primary sequence. The shortened segments are then folded to obtain low energy secondary structures. The distance estimation module of the pipeline then calculates distances between the folded bricks, and then analyzes the resulting distance matrices for patterns;An initial implementation of the pipeline is applied to synthetic and biological data sets. This implementation introduces a new distance measure for comparing RNA sequences based on structural annotation of the folded sequence as well as a new data analysis technique called non-linear projection. The modular nature of the pipeline is then used to explore the relationships between several different distance measures on random data, synthetic data, and a biological data set consisting of iron response elements. It is shown that the different distance measures capture different relationships between the RNA sequences. The non-linear projection algorithm is used to produce 2-dimensional projections of the distance matrices which are examined via inspection and k-means multiclustering. The pipeline is able to successfully cluster synthetic RNA sequences based only on primary sequence data as well as the iron response elements data set. The dissertation also presents a preliminary analysis of a large biological data set of HIV sequences.</p>
dc.format.mimetype application/pdf
dc.identifier archive/lib.dr.iastate.edu/rtd/1300/
dc.identifier.articleid 2299
dc.identifier.contextkey 6093760
dc.identifier.doi https://doi.org/10.31274/rtd-180813-17348
dc.identifier.s3bucket isulib-bepress-aws-west
dc.identifier.submissionpath rtd/1300
dc.identifier.uri https://dr.lib.iastate.edu/handle/20.500.12876/66431
dc.language.iso en
dc.source.bitstream archive/lib.dr.iastate.edu/rtd/1300/r_3217314.pdf|||Fri Jan 14 19:42:12 UTC 2022
dc.subject.disciplines Biostatistics
dc.subject.disciplines Computer Sciences
dc.subject.disciplines Molecular Biology
dc.subject.keywords Mathematics
dc.subject.keywords Bioinformatics and computational biology
dc.title A modular data analysis pipeline for the discovery of novel RNA motifs
dc.type article
dc.type.genre dissertation
dspace.entity.type Publication
relation.isOrgUnitOfPublication 82295b2b-0f85-4929-9659-075c93e82c48
thesis.degree.discipline Bioinformatics and Computational Biology
thesis.degree.level dissertation
thesis.degree.name Doctor of Philosophy
File
Original bundle
Now showing 1 - 1 of 1
Name:
r_3217314.pdf
Size:
2.04 MB
Format:
Adobe Portable Document Format
Description: