Learning classifiers from distributed, semantically heterogeneous, autonomous data sources

Thumbnail Image
Date
2004-01-01
Authors
Caragea, Doina
Major Professor
Advisor
Vasant Honavar
Committee Member
Journal Title
Journal ISSN
Volume Title
Publisher
Altmetrics
Authors
Research Projects
Organizational Units
Organizational Unit
Computer Science

Computer Science—the theory, representation, processing, communication and use of information—is fundamentally transforming every aspect of human endeavor. The Department of Computer Science at Iowa State University advances computational and information sciences through; 1. educational and research programs within and beyond the university; 2. active engagement to help define national and international research, and 3. educational agendas, and sustained commitment to graduating leaders for academia, industry and government.

History
The Computer Science Department was officially established in 1969, with Robert Stewart serving as the founding Department Chair. Faculty were composed of joint appointments with Mathematics, Statistics, and Electrical Engineering. In 1969, the building which now houses the Computer Science department, then simply called the Computer Science building, was completed. Later it was named Atanasoff Hall. Throughout the 1980s to present, the department expanded and developed its teaching and research agendas to cover many areas of computing.

Dates of Existence
1969-present

Related Units

Journal Issue
Is Version Of
Versions
Series
Department
Computer Science
Abstract

Recent advances in computing, communications, and digital storage technologies, together with development of high throughput data acquisition technologies have made it possible to gather and store large volumes of data in digital form. These developments have resulted in unprecedented opportunities for large-scale data-driven knowledge acquisition with the potential for fundamental gains in scientific understanding (e.g., characterization of macromolecular structure-function relationships in biology) in many data-rich domains. In such applications, the data sources of interest are typically physically distributed, semantically heterogeneous and autonomously owned and operated, which makes it impossible to use traditional machine learning algorithms for knowledge acquisition.;However, we observe that most of the learning algorithms use only certain statistics computed from data in the process of generating the hypothesis that they output and we use this observation to design a general strategy for transforming traditional algorithms for learning from data into algorithms for learning from distributed data. The resulting algorithms are provably exact in that the classifiers produced by them are identical to those obtained by the corresponding algorithms in the centralized setting (i.e., when all of the data is available in a central location) and they compare favorably to their centralized counterparts in terms of time and communication complexity.;To deal with the semantical heterogeneity problem, we introduce ontology-extended data sources and define a user perspective consisting of an ontology and a set of interoperation constraints between data source ontologies and the user ontology. We show how these constraints can be used to define mappings and conversion functions needed to answer statistical queries from semantically heterogeneous data viewed from a certain user perspective. That is further used to extend our approach for learning from distributed data into a theoretically sound approach to learning from semantically heterogeneous data.;The work described above contributed to the design and implementation of AirlDM, a collection of data source independent machine learning algorithms through the means of sufficient statistics and data source wrappers, and to the design of INDUS, a federated, query-centric system for knowledge acquisition from distributed, semantically heterogeneous, autonomous data sources.

Comments
Description
Keywords
Citation
Source
Subject Categories
Copyright
Thu Jan 01 00:00:00 UTC 2004