Parallel Fractional Hot Deck Imputation and Variance Estimation for Big Incomplete Data Curing
Date
2020-10-06
Authors
Major Professor
Advisor
Committee Member
Journal Title
Journal ISSN
Volume Title
Publisher
IEEE
Abstract
The fractional hot-deck imputation (FHDI) is a general-purpose, assumption-free imputation method for handling multivariate missing data by filling each missing item with multiple observed values without resorting to artificially created values. The corresponding R package FHDI \cite{Im:2018} holds generality and efficiency, but it is not adequate for tackling big incomplete data due to the requirement of excessive memory and long running time. As a first step to tackle big incomplete data by leveraging the FHDI, we developed a new version of a parallel fractional hot-deck imputation (named as P-FHDI) program suitable for curing large incomplete datasets. Results show a favorable speedup when the P-FHDI is applied to big datasets with up to millions of instances or 10,000 of variables. This paper explains the detailed parallel algorithms of the P-FHDI for large instances (big-n) or high-dimensionality (big-p) datasets and confirms the favorable scalability. The proposed program inherits all the advantages of the serial FHDI and enables a parallel variance estimation, which will benefit a broad audience in science and engineering.
Series Number
Journal Issue
Is Version Of
Versions
Series
Academic or Administrative Unit
Type
article
Comments
This is a manuscript of an article published as Yang, Yicheng, Jaekwang Kim, and In-Ho Cho. "Parallel Fractional Hot Deck Imputation and Variance Estimation for Big Incomplete Data Curing." IEEE Transactions on Knowledge and Data Engineering (2020).
DOI: 10.1109/TKDE.2020.3029146.
Copyright 2020 IEEE.
Attribution 4.0 International (CC BY 4.0).
Posted with permission.