PathBinder: a sentence repository of biochemical interactions extracted from MEDLINE
Date
Authors
Major Professor
Advisor
Committee Member
Journal Title
Journal ISSN
Volume Title
Publisher
Altmetrics
Abstract
MEDLINE is a fast growing online scientific literature database covering the fields of life science, medicine, health care, etc. It provides attractive opportunities for automatic information extraction for tasks such as extracting networks of protein interactions, as well as for benefiting researchers who need to efficiently sift through the literature to find work relating to small sets of biochemicals of interest. PathBinder is a software system that extracts sentences containing potential biochemical interactions from the baseline MEDLINE database annual distribution. Interactions between two biochemicals are assumed if they co-occur in a single sentence. Single sentences were parsed from MEDLINE abstracts, and scanned against a dictionary containing more than 80,000 entries (>40,000 biochemicals and their aliases) for at least two different biochemicals. The dictionary was constructed automatically by extracting names and synonyms of protein and non-protein biochemicals from four databases. The extracted sentences are organized in a repository, about 11 GB in size, easily retrievable through a 2-level index system based on two biochemical names. The performance of PathBinder in terms of information extraction metrics (e.g. precision and recall) was evaluated using a sample MEDLINE file. Sentence parsing has a precision of 99.6% and a recall of 99.5%. Biochemical labeling had a precision of 80.5% and a recall of 57.3%.