Parallel Fractional Hot Deck Imputation and Variance Estimation for Big Incomplete Data Curing

Thumbnail Image
Date
2020-10-06
Authors
Yang, Yicheng
Major Professor
Advisor
Committee Member
Journal Title
Journal ISSN
Volume Title
Publisher
IEEE
Authors
Person
Kim, Jae Kwang
Professor
Person
Cho, In-Ho
Associate Professor
Research Projects
Organizational Units
Organizational Unit
Journal Issue
Is Version Of
Versions
Series
Department
Statistics
Abstract
The fractional hot-deck imputation (FHDI) is a general-purpose, assumption-free imputation method for handling multivariate missing data by filling each missing item with multiple observed values without resorting to artificially created values. The corresponding R package FHDI \cite{Im:2018} holds generality and efficiency, but it is not adequate for tackling big incomplete data due to the requirement of excessive memory and long running time. As a first step to tackle big incomplete data by leveraging the FHDI, we developed a new version of a parallel fractional hot-deck imputation (named as P-FHDI) program suitable for curing large incomplete datasets. Results show a favorable speedup when the P-FHDI is applied to big datasets with up to millions of instances or 10,000 of variables. This paper explains the detailed parallel algorithms of the P-FHDI for large instances (big-n) or high-dimensionality (big-p) datasets and confirms the favorable scalability. The proposed program inherits all the advantages of the serial FHDI and enables a parallel variance estimation, which will benefit a broad audience in science and engineering.
Comments
This is a manuscript of an article published as Yang, Yicheng, Jaekwang Kim, and In-Ho Cho. "Parallel Fractional Hot Deck Imputation and Variance Estimation for Big Incomplete Data Curing." IEEE Transactions on Knowledge and Data Engineering (2020). DOI: 10.1109/TKDE.2020.3029146. Copyright 2020 IEEE. Attribution 4.0 International (CC BY 4.0). Posted with permission.
Description
Keywords
Citation
DOI
Copyright
Collections