Machine learning methods for omics data integration

dc.contributor.advisor Julie A. Dickerson
dc.contributor.author Zhou, Wengang
dc.contributor.department Department of Electrical and Computer Engineering
dc.date 2018-08-11T14:13:40.000
dc.date.accessioned 2020-06-30T02:41:16Z
dc.date.available 2020-06-30T02:41:16Z
dc.date.copyright Sat Jan 01 00:00:00 UTC 2011
dc.date.embargo 2013-06-05
dc.date.issued 2011-01-01
dc.description.abstract <p>High-throughput technologies produce genome-scale transcriptomic and metabolomic (omics) datasets that allow for the system-level studies of complex biological processes. The limitation lies in the small number of samples versus the larger number of features represented in these datasets. Machine learning methods can help integrate these large-scale omics datasets and identify key features from each dataset. A novel class dependent feature selection method integrates the F statistic, maximum relevance binary particle swarm optimization (MRBPSO), and class dependent multi-category classification (CDMC) system. A set of highly differentially expressed genes are pre-selected using the F statistic as a filter for each dataset. MRBPSO and CDMC function as a wrapper to select desirable feature subsets for each class and classify the samples using those chosen class-dependent feature subsets. The results indicate that the class-dependent approaches can effectively identify unique biomarkers for each cancer type and improve classification accuracy compared to class independent feature selection methods. The integration of transcriptomics and metabolomics data is based on a classification framework. Compared to principal component analysis and non-negative matrix factorization based integration approaches, our proposed method achieves 20-30% higher prediction accuracies on Arabidopsis tissue development data. Metabolite-predictive genes and gene-predictive metabolites are selected from transcriptomic and metabolomic data respectively. The constructed gene-metabolite correlation network can infer the functions of unknown genes and metabolites. Tissue-specific genes and metabolites are identified by the class-dependent feature selection method. Evidence from subcellular locations, gene ontology, and biochemical pathways support the involvement of these entities in different developmental stages and tissues in Arabidopsis.</p>
dc.format.mimetype application/pdf
dc.identifier archive/lib.dr.iastate.edu/etd/12238/
dc.identifier.articleid 3226
dc.identifier.contextkey 2808424
dc.identifier.doi https://doi.org/10.31274/etd-180810-4307
dc.identifier.s3bucket isulib-bepress-aws-west
dc.identifier.submissionpath etd/12238
dc.identifier.uri https://dr.lib.iastate.edu/handle/20.500.12876/26427
dc.language.iso en
dc.source.bitstream archive/lib.dr.iastate.edu/etd/12238/Zhou_iastate_0097E_12021.pdf|||Fri Jan 14 19:16:21 UTC 2022
dc.subject.disciplines Electrical and Computer Engineering
dc.title Machine learning methods for omics data integration
dc.type dissertation
dc.type.genre dissertation
dspace.entity.type Publication
relation.isOrgUnitOfPublication a75a044c-d11e-44cd-af4f-dab1d83339ff
thesis.degree.discipline Bioinformatics and Computational Biology
thesis.degree.level dissertation
thesis.degree.name Doctor of Philosophy
File
Original bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
Zhou_iastate_0097E_12021.pdf
Size:
2.14 MB
Format:
Adobe Portable Document Format
Description: