Improving text clustering for functional analysis of genes

dc.contributor.advisor Daniel Berleant
dc.contributor.advisor Eve Wurtele Ding, Jing
dc.contributor.department Electrical and Computer Engineering
dc.contributor.other Electrical and Computer Engineering 2018-08-25T03:26:16.000 2020-06-30T08:05:18Z 2020-06-30T08:05:18Z 2006-01-01
dc.description.abstract <p>Continued rapid advancements in genomic, proteomic and metabolomic technologies demand computer-aided methods and tools to efficiently and timely process large amount of data, extract meaningful information, and interpret data into knowledge. While numerous algorithms and systems have been developed for information extraction (i.e. profiling analysis), biological interpretation still largely relies on biologists' domain knowledge, as well as collecting and analyzing functional information from various public databases. The goal of this project was to build a text clustering-based software system, called GeneNarrator, for functional analysis of genes (microarray experiments);GeneNarrator automatically collected MEDLINE citations for a list of genes as the source of functional information. A two-step clustering approach was designed to process the citations. The first-step (text) clustering grouped the citations into hierarchical topics. The second-step (gene) clustering grouped the genes based on the similarities of their occurrences across the clusters resulting from step one. Hence, we planned to demonstrate how, instead of manually collecting and tediously sifting through potentially thousands of citations, biologists can be presented with dozens of topics as a summarization of the citations, and gene (groups) mapped to the topics;In order to improve the first-step text clustering part of the system, several strategies were explored, including different vector space models (BOW-based or concept-based) for text representation, vector space dimensionality reduction (document frequency filtering), and multi clustering. The most improvement came from multi-clustering. The clusterings were evaluated in terms of self-consistency and agreement with a manually constructed gold standard dataset using a newly proposed metric, normalized mutual information.</p>
dc.format.mimetype application/pdf
dc.identifier archive/
dc.identifier.articleid 2810
dc.identifier.contextkey 6105449
dc.identifier.s3bucket isulib-bepress-aws-west
dc.identifier.submissionpath rtd/1811
dc.source.bitstream archive/|||Fri Jan 14 21:37:06 UTC 2022
dc.subject Electrical and computer engineering
dc.subject Computer engineering
dc.subject Bioinformatics and computational biology
dc.title Improving text clustering for functional analysis of genes
dc.type dissertation
dspace.entity.type Publication
relation.isOrgUnitOfPublication a75a044c-d11e-44cd-af4f-dab1d83339ff Doctor of Philosophy
Original bundle
Now showing 1 - 1 of 1
1.99 MB
Adobe Portable Document Format