Data-driven discovery of rules for protein function classification based on sequence motifs

Thumbnail Image
Date
2002-01-01
Authors
Wang, Xiangyun
Major Professor
Advisor
Committee Member
Journal Title
Journal ISSN
Volume Title
Publisher
Altmetrics
Authors
Research Projects
Organizational Units
Journal Issue
Is Version Of
Versions
Series
Department
Theses & dissertations (Interdisciplinary)
Abstract

This thesis describes an approach to data-driven discovery of decision trees or rules for assigning protein sequences to functional families using sequence motifs. This method is able to capture regularities that can be described in terms of presence or absence of arbitrary combinations of motifs. A training set of peptidase sequences labeled with the corresponding MEROPS functional families or clans is used to automatically construct decision trees that capture regularities that are sufficient to assign the sequences to their respective functional families. The performance of the resulting decision tree classifiers is then evaluated on an independent test set. Results of experiments that proposed approach matches or outperforms protein function classification based on the presence of a single characteristic motif in terms of accuracy, precision, and recall. We compared the rules constructed using motifs generated by a multiple sequence alignment based motif discovery tool (MEME) with rules constructed using expert annotated ProSite motifs (patterns and profiles). Our results indicate that the former provide a potentially powerful high throughput technique for constructing protein function classifiers when adequate training data are available. Examination of the generated rules in the case of a Caspase (C14) family suggests that the proposed technique might be able to identify combinations of sequence motifs that characterize functionally significant 3-dimensional structural features of proteins.

Comments
Description
Keywords
Citation
Source
Copyright
Tue Jan 01 00:00:00 UTC 2002