Nonparametric regression models with and without measurement error in the covariates, for univariate and vector responses: a Bayesian approach

Thumbnail Image
Trujillo-Rivera, Eduardo
Major Professor
Alicia L. Carriquiry
Daniel J. Nordman
Committee Member
Journal Title
Journal ISSN
Volume Title
Research Projects
Organizational Units
Organizational Unit
Journal Issue
Is Version Of

This dissertation addresses the problem of estimation in multivariate non-parametric regression of real value and vector valued functions when there is classical measurement errors in the covariates. Different estimation approaches, including selection of bandwidth parameters, are studied first and compared for the case of no measurement error, and then for the error case. New theoretical results related to criteria for selecting the bandwidth parameter are presented for the vector valued regression problem. We also conjecture on possible extensions of the methods to improve estimation in the multivariate response case.

In the context of semi-parametric regression with multiple covariates, it is known that the solution to the penalized least squares minimization problem can be interpreted as the mean of the posterior distribution arising in the context of an empirical Bayesian approach. The probability model in this approach has a Gaussian process as prior on the target regression function with co-variance structure depending on the reproducing kernel of an associated reproducing kernel Hilbert space. By the Representer Theorem, the solution to the minimization problem can be expressed as a linear combination of a set of known basis functions. We prove that under different a Bayesian model with multivariate normal priors on the coefficients and covariance structure depending on a reproducing kernel, it is possible to obtain the same posterior estimates of the regression function as with the previous formulation with the Gaussian process prior. Our approach has an advantage over its predecessor; to predict the value of the target function on any domain and to produce credible intervals for the predictions, we only need to evaluate known basis functions using estimated parameters. In contrast, when using the previous Bayes formulation with Gaussian process prior, we first need to fix the points where the Gaussian process is to be estimated but subsequent evaluations of the process is done externally; for computational reasons, obtaining an exact solution to the penalized least square minimization problem is not practical; instead, we review, modify and implement an approximate solution. We show that the full conditional posterior distribution of the point-wise regression estimates is the same in both approaches.

We evaluated the performance of our method using simulation. We compared our Bayesian approach applied to existing methods proposed for estimation in non-parametric regression in the frequentist setting, including thin plate splines, a linear mixed model interpretation of thin plate splines, and tensor product splines with marginal thin plate splines. In all cases, we computed the previously mentioned approximate solution to the optimization least square problem. The computation of smoothing parameters is done via empirical Bayes approach that involves the minimization of score functions. We considered three different score functions from the literature. The linear mixed model formulation enables us to write the smoothing parameter as the ratio of two variances and therefore we can estimate the parameter, as a fourth approach, using the standard Bayesian estimation framework. We compare the various approaches by focusing on frequentist properties of the Bayesian estimator of the regression function and of the point-wise credible intervals. In particular, we compute average coverage rates of the credible intervals for all methods, where the average is taken over the prediction points. We find that the average coverage probability is close to the nominal level, at least for predictions inside the observation region for the covariates; while point-wise credible intervals are not to be trusted to have nominal coverage, unless they are inside the region of covariate observation and only when using specific methods to select smoothing parameters.

The simulation has two objectives: to study the performance of the estimators and to examine potential approaches involving basis functions with tractable form which might be used in a more complex setting with errors in the measurements. We argue that the Bayesian framework applied to the thin plate spline approach is an acceptable trade off between computational complexity required to fit and predict from the model and the frequentist properties of the estimators. Using the proposed Bayesian model and the thin plate splines, we extend our Bayes model for the regression problem with multiple regressors and classical measurement error in the covariates. We carried out similar simulation study with the purpose of studying the frequentist properties of the estimators. We discuss simulation results that refer to point-wise estimation of the regression function, empirical coverage of point-wise credible intervals for evaluations of the regression function, and to performance of estimators of the observation-error variance.

While reviewing the literature, we found that many results are either presented without proof or with proofs that seemed incomplete to us. In those cases, we endeavored to write complete proofs for those results on which we relied. If the proof of a proposition is presented in this dissertation, that indicates that it was not available in the literature and can be considered original research. Whenever a proposition is listed without a proof, it means that the proof was published elsewhere and we include the corresponding citation.

Finally, we also consider the case where the response is vector-valued and the form of the mean regression function is unknown. We first propose an approach of estimation when there is no measurement error in the covariates. We then extend the method to the case where covariates are measured with classical error. As in the univariate response case, we do not assume a form of the regression function but we do formulate a set of assumptions that must be met. We propose -- without complete proof -- three methods for computing the smoothing parameters and extend the methods to theoretically address calculation of a diagonal bandwidth matrix and a general bandwidth matrix. We illustrate these methods via simulated examples.

Sun Jan 01 00:00:00 UTC 2017