Machine learning methods for omics data integration

Zhou, Wengang

Machine learning methods for omics data integration

dc.contributor.advisor	Julie A. Dickerson
dc.contributor.author	Zhou, Wengang
dc.contributor.department	Department of Electrical and Computer Engineering
dc.date	2018-08-11T14:13:40.000
dc.date.accessioned	2020-06-30T02:41:16Z
dc.date.available	2020-06-30T02:41:16Z
dc.date.copyright	Sat Jan 01 00:00:00 UTC 2011
dc.date.embargo	2013-06-05
dc.date.issued	2011-01-01
dc.description.abstract	<p>High-throughput technologies produce genome-scale transcriptomic and metabolomic (omics) datasets that allow for the system-level studies of complex biological processes. The limitation lies in the small number of samples versus the larger number of features represented in these datasets. Machine learning methods can help integrate these large-scale omics datasets and identify key features from each dataset. A novel class dependent feature selection method integrates the F statistic, maximum relevance binary particle swarm optimization (MRBPSO), and class dependent multi-category classification (CDMC) system. A set of highly differentially expressed genes are pre-selected using the F statistic as a filter for each dataset. MRBPSO and CDMC function as a wrapper to select desirable feature subsets for each class and classify the samples using those chosen class-dependent feature subsets. The results indicate that the class-dependent approaches can effectively identify unique biomarkers for each cancer type and improve classification accuracy compared to class independent feature selection methods. The integration of transcriptomics and metabolomics data is based on a classification framework. Compared to principal component analysis and non-negative matrix factorization based integration approaches, our proposed method achieves 20-30% higher prediction accuracies on Arabidopsis tissue development data. Metabolite-predictive genes and gene-predictive metabolites are selected from transcriptomic and metabolomic data respectively. The constructed gene-metabolite correlation network can infer the functions of unknown genes and metabolites. Tissue-specific genes and metabolites are identified by the class-dependent feature selection method. Evidence from subcellular locations, gene ontology, and biochemical pathways support the involvement of these entities in different developmental stages and tissues in Arabidopsis.</p>
dc.format.mimetype	application/pdf
dc.identifier	archive/lib.dr.iastate.edu/etd/12238/
dc.identifier.articleid	3226
dc.identifier.contextkey	2808424
dc.identifier.doi	https://doi.org/10.31274/etd-180810-4307
dc.identifier.s3bucket	isulib-bepress-aws-west
dc.identifier.submissionpath	etd/12238
dc.identifier.uri	https://dr.lib.iastate.edu/handle/20.500.12876/26427
dc.language.iso	en
dc.source.bitstream	archive/lib.dr.iastate.edu/etd/12238/Zhou_iastate_0097E_12021.pdf\|\|\|Fri Jan 14 19:16:21 UTC 2022
dc.subject.disciplines	Electrical and Computer Engineering
dc.title	Machine learning methods for omics data integration
dc.type	dissertation
dc.type.genre	dissertation
dspace.entity.type	Publication
relation.isOrgUnitOfPublication	a75a044c-d11e-44cd-af4f-dab1d83339ff
thesis.degree.discipline	Bioinformatics and Computational Biology
thesis.degree.level	dissertation
thesis.degree.name	Doctor of Philosophy

File

Original bundle

Now showing 1 - 1 of 1

Name:: Zhou_iastate_0097E_12021.pdf
Size:: 2.14 MB
Format:: Adobe Portable Document Format
Description:

Download

Collections

Theses and Dissertations