Topics in functional data analysis and machine learning predictive inference
This dissertation is composed of three research projects focused on functional data analysis and machine learning predictive inference.
The first project deals with the covariance estimation, principal component analysis, and prediction of spatially correlated functional data. We develop a general framework and fully nonparametric estimation methods for spatial functional data collected under a geostatistics setting, where locations are sampled from a spatial point process and a random function is discretely observed at each location and contaminated with a functional nugget effect and measurement errors. Unified asymptotic convergence rates are developed for the proposed estimators that are applicable to both sparse and dense functional data. Simulation studies and analyses of two real-estate datasets show that our proposed approach outperforms other state-of-the-art approaches.
In the second project, we present a novel application of functional modeling to plant phenotypic data derived from crowdscourced images annotated by Amazon Mechanical Turk (MTurk) workers. The goal of this study is to estimate the effect of genotype and its interaction with environment on plant growth while adjusting for measurement errors from crowdsourcing image analysis. We assume plant height measurements as discrete observations of growth curves contaminated with MTurk worker random effects and heteroscedastic measurement errors. A reduced-rank functional model, along with a robust and shape-constrained estimation approach, is developed for growth curves and derivatives that depend on replicates, genotypes, and environmental conditions. As byproducts, the proposed model leads to a new method for assessing the quality of MTurk worker data and an index for measuring the sensitivity to drought for various genotypes.
In the third project, we propose a new approach to constructing random forest prediction intervals that utilizes the empirical distribution of out-of-bag prediction errors, and provides theory that guarantees asymptotic coverage for the proposed intervals. We perform extensive numerical experiments along with analysis of 60 real datasets to compare the finite-sample properties of the proposed intervals with two state-of-the-art approaches: quantile regression forests and split conformal intervals. The results demonstrate the advantages, reliability and efficiency of the proposed approach.