Development of general-purpose data curing platform for big incomplete data in infrastructure engineering and broad science

Thumbnail Image
Yang, Yicheng
Major Professor
Cho, In Ho
Kim, Jae Kwang
Ceylan, Halil
Tian, Jin
Laflamme, Simon
Committee Member
Journal Title
Journal ISSN
Volume Title
Research Projects
Organizational Units
Journal Issue
Is Version Of
Civil, Construction, and Environmental Engineering
The fractional hot-deck imputation (FHDI) is a general-purpose, distributional assumption-free imputation method for handling multivariate missing data by filling each missing unit with multiple observed values without resorting to artificially created values. The corresponding R package FHDI holds generality and efficiency, but computational limitations and memory requirements prevent it from curing big incomplete data. Departing from FHDI theory, we developed the first version of the parallel fractional hot-deck imputation (P-FHDI) program capable of handling big incomplete datasets with large instances (the so-called big-n) or high-dimensionality (the so-called big-p). P-FHDI inherits all the advantages of FHDI and strengthens its power by implementing a parallel Jackknife variance estimation. This dissertation explains the detailed parallel algorithms of P-FHDI for big-n data and the sure independent screening method for variable reduction of big-p data. P-FHDI exhibits linear scalability for big incomplete datasets with millions of instances or 10, 000 variables. However, excessive memory requirements and execution time are intractable obstacles when P-FHDI is applied to ultra incomplete data (i.e., concurrently big-n and big-p) with tremendous instances and high dimensionality. We developed the ultra data-oriented P-FHDI (named UP-FHDI) suitable for handling ultra incomplete data. This dissertation illustrates the special ultra data-oriented parallel algorithms of UP-FHDI. Besides the parallel Jackknife method, UP-FHDI adopts parallel linearization techniques to enable a computationally efficient variance estimation. Results exhibit promising scalability of UP-FHDI with ultra data and confirm its positive impact on the subsequent deep learning performance. Remarkably, UP-FHDI can handle an ultra dataset with one million instances and 10, 000 variables and has no restrictions to data types and volume. Applications of UP-FHDI to various big incomplete data in infrastructure engineering and broad science affirm its accuracy, generality, and efficiency.