Parallel Fractional Hot Deck Imputation and Variance Estimation for Big Incomplete Data Curing

Yang, Yicheng; Kim, Jae Kwang; Cho, In-Ho

Parallel Fractional Hot Deck Imputation and Variance Estimation for Big Incomplete Data Curing

File

2020-KimJaiKwang-ParallelFractional.pdf (3.42 MB)

Date

2020-10-06

Authors

Yang, Yicheng

Kim, Jae Kwang

Cho, In-Ho

Publisher

IEEE

Abstract

The fractional hot-deck imputation (FHDI) is a general-purpose, assumption-free imputation method for handling multivariate missing data by filling each missing item with multiple observed values without resorting to artificially created values. The corresponding R package FHDI \cite{Im:2018} holds generality and efficiency, but it is not adequate for tackling big incomplete data due to the requirement of excessive memory and long running time. As a first step to tackle big incomplete data by leveraging the FHDI, we developed a new version of a parallel fractional hot-deck imputation (named as P-FHDI) program suitable for curing large incomplete datasets. Results show a favorable speedup when the P-FHDI is applied to big datasets with up to millions of instances or 10,000 of variables. This paper explains the detailed parallel algorithms of the P-FHDI for large instances (big-n) or high-dimensionality (big-p) datasets and confirms the favorable scalability. The proposed program inherits all the advantages of the serial FHDI and enables a parallel variance estimation, which will benefit a broad audience in science and engineering.

Academic or Administrative Unit

Statistics (CALS)

Type

article

Comments

This is a manuscript of an article published as Yang, Yicheng, Jaekwang Kim, and In-Ho Cho. "Parallel Fractional Hot Deck Imputation and Variance Estimation for Big Incomplete Data Curing." IEEE Transactions on Knowledge and Data Engineering (2020). DOI: 10.1109/TKDE.2020.3029146. Copyright 2020 IEEE. Attribution 4.0 International (CC BY 4.0). Posted with permission.