Improving text clustering for functional analysis of genes

Date
2006-01-01
Authors
Ding, Jing
Journal Title
Journal ISSN
Volume Title
Publisher
Source URI
Altmetrics
Authors
Research Projects
Organizational Units
Journal Issue
Series
Abstract

Continued rapid advancements in genomic, proteomic and metabolomic technologies demand computer-aided methods and tools to efficiently and timely process large amount of data, extract meaningful information, and interpret data into knowledge. While numerous algorithms and systems have been developed for information extraction (i.e. profiling analysis), biological interpretation still largely relies on biologists' domain knowledge, as well as collecting and analyzing functional information from various public databases. The goal of this project was to build a text clustering-based software system, called GeneNarrator, for functional analysis of genes (microarray experiments);GeneNarrator automatically collected MEDLINE citations for a list of genes as the source of functional information. A two-step clustering approach was designed to process the citations. The first-step (text) clustering grouped the citations into hierarchical topics. The second-step (gene) clustering grouped the genes based on the similarities of their occurrences across the clusters resulting from step one. Hence, we planned to demonstrate how, instead of manually collecting and tediously sifting through potentially thousands of citations, biologists can be presented with dozens of topics as a summarization of the citations, and gene (groups) mapped to the topics;In order to improve the first-step text clustering part of the system, several strategies were explored, including different vector space models (BOW-based or concept-based) for text representation, vector space dimensionality reduction (document frequency filtering), and multi clustering. The most improvement came from multi-clustering. The clusterings were evaluated in terms of self-consistency and agreement with a manually constructed gold standard dataset using a newly proposed metric, normalized mutual information.

Description
Keywords
Electrical and computer engineering, Computer engineering, Bioinformatics and computational biology
Citation