Information extraction with weak supervision

dc.contributor.advisor Li, Qi
dc.contributor.advisor Cai, Ying
dc.contributor.advisor Liu, Kevin
dc.contributor.advisor Gao, Hongyang
dc.contributor.advisor Huai, Mengdi
dc.contributor.author Zhou, Kang
dc.contributor.department Department of Computer Science
dc.date.accessioned 2025-02-11T17:23:01Z
dc.date.available 2025-02-11T17:23:01Z
dc.date.issued 2024-12
dc.date.updated 2025-02-11T17:23:02Z
dc.description.abstract This dissertation explores the development and application of weak supervision techniques to address key challenges in three fundamental information extraction (IE) tasks: Named Entity Recognition (NER), Relation Extraction (RE), and Entity Linking (EL). Traditional supervised learning methods in these domains often require extensive human annotations, which are costly and time-consuming, limiting their scalability and applicability in real-world scenarios. To overcome these limitations, this research introduces innovative weakly supervised methodologies for each of these tasks, aiming to reduce reliance on manual labeling while maintaining high performance. The first part of the dissertation presents a novel framework, Confidence-Based Multi-Class Positive and Unlabeled (Conf-MPU) learning, designed to enhance the performance of distantly supervised NER. By incorporating confidence scores into a multi-class PU learning approach, Conf-MPU effectively handles incomplete labeling and varying false negative rates inherent in distantly supervised data. Experimental results on benchmark datasets demonstrate that Conf-MPU significantly outperforms existing state-of-the-art methods, advancing the field of distantly supervised NER. The second part focuses on improving Relation Extraction through the integration of indirect supervision. A novel approach, DSRE-NLI, is introduced, which leverages a Natural Language Inference (NLI) engine and a Semi-Automatic Relation Verbalization (SARV) mechanism to diagnose and mitigate label noise in distantly supervised RE tasks. This method enhances the semantic diversity of relation templates with minimal human input, resulting in a significant performance boost over traditional distantly supervised methods on real and simulated datasets. The third part of the dissertation addresses challenges in Zero-Shot Entity Linking (ZSEL) with a new re-ranking approach, GenDecider, which incorporates “None of the Candidates” (NoC) judgments into the re-ranking process. By formulating the task as a generative process using the Llama model, GenDecider effectively detects scenarios where the correct entity is not among the retrieved candidates. This approach significantly improves the accuracy and reliability of ZSEL systems, as evidenced by its performance on the benchmark ZESHEL dataset. Collectively, the contributions of this dissertation lie in advancing weak supervision techniques across three critical IE tasks, reducing the dependency on extensive manual annotations, and improving the robustness and scalability of information extraction systems. The findings have broad implications for the development of practical, scalable IE solutions in data-rich environments. Future research directions include refining noise-handling mechanisms, optimizing computational efficiency, and expanding the proposed methods to multilingual and low-resource settings.
dc.format.mimetype PDF
dc.identifier.uri https://dr.lib.iastate.edu/handle/20.500.12876/Qr9mg7Jr
dc.language.iso en
dc.language.rfc3066 en
dc.subject.disciplines Computer science en_US
dc.subject.keywords Information Extraction en_US
dc.subject.keywords Weak Supervision en_US
dc.title Information extraction with weak supervision
dc.type dissertation en_US
dc.type.genre dissertation en_US
dspace.entity.type Publication
relation.isOrgUnitOfPublication f7be4eb9-d1d0-4081-859b-b15cee251456
thesis.degree.discipline Computer science en_US
thesis.degree.grantor Iowa State University en_US
thesis.degree.level dissertation $
thesis.degree.name Doctor of Philosophy en_US
File
Original bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
Zhou_iastate_0097E_21832.pdf
Size:
2.6 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
0 B
Format:
Item-specific license agreed upon to submission
Description: