Semantics-based approach for generating partial views from linked life-cycle highway project data
The purpose of this dissertation is to develop methods that can assist data integration and extraction from heterogeneous sources generated throughout the life-cycle of a highway project. In the era of computerized technologies, project data is largely available in digital format. Due to the fragmented nature of the civil infrastructure sector, digital data are created and managed separately by different project actors in proprietary data warehouses. The differences in the data structure and semantics greatly hinder the exchange and fully reuse of digital project data. In order to address those issues, this dissertation carries out the following three individual studies.
The first study aims to develop a framework for interconnecting heterogeneous life cycle project data into an unified and linked data space. This is an ontology-based framework that consists of two phases: (1) translating proprietary datasets into homogeneous RDF data graphs; and (2) connecting separate data networks to each other. Three domain ontologies for design, construction, and asset condition survey phases are developed to support data transformation. A merged ontology that integrates the domain ontologies is constructed to provide guidance on how to connect data nodes from domain graphs.
The second study is to deal with the terminology inconsistency between data sources. An automated method is developed that employs Natural Language Processing (NLP) and machine learning techniques to support constructing a domain specific lexicon from design manuals. The method utilizes pattern rules to extract technical terms from texts and learns their representation vectors using a neural network based word embedding approach. The study also includes the development of an integrated method of minimal-supervised machine learning, clustering analysis, and word vectors, for computing the term semantics and classifying the relations between terms in the target lexicon.
In the last study, a data retrieval technique for extracting subsets of an XML civil data schema is designed and tested. The algorithm takes a keyword input of the end user and returns a ranked list of the most relevant XML branches. This study utilizes a lexicon of the highway domain generated from the second study to analyze the semantics of the end user keywords. A context-based similarity measure is introduced to evaluate the relevance between a certain branch in the source schema and the user query.
The methods and algorithms resulting from this research were tested using case studies and empirical experiments.
The results indicate that the study successfully address the heterogeneity in the structure and terminology of data and enable a fast extraction of sub-models of data. The study is expected to enhance the efficiency in reusing digital data generated throughout the project life-cycle, and contribute to the success in transitioning from paper-based to digital project delivery for civil infrastructure projects.