Bayesian variable selection in ultra-high dimensional settings

dc.contributor.advisor Dutta, Somak
dc.contributor.advisor Roy, Vivekananda
dc.contributor.advisor Carriquiry, Alicia
dc.contributor.advisor Dorman, Karin
dc.contributor.advisor Kaiser, Mark
dc.contributor.author Li, Dongjin
dc.contributor.department Department of Statistics (LAS)
dc.date.accessioned 2022-11-08T23:52:03Z
dc.date.available 2022-11-08T23:52:03Z
dc.date.issued 2021-08
dc.date.updated 2022-11-08T23:52:03Z
dc.description.abstract This thesis is a collection of three papers focused on Bayesian variable selection for ultra-high dimensional problems. In various disciplines of modern scientific research, datasets are commonly available with tens of thousands of predictors but a limited number of observations. Nevertheless, only very few of these predictors are believed to be associated with the response, making variable selection for ultra-high dimensional setup an important and challenging problem. In this thesis, we present novel Bayesian methods for variable selection in ultra-high dimensional settings. In Chapter \ref{ch.sven}, we propose a Bayesian variable selection method, called SVEN, for Gaussian linear regression models. The method is based on a hierarchical Gaussian linear model with priors placed on the regression coefficients as well as on the model space. The use of degenerate {\it spike} priors on inactive variables and Gaussian {\it slab} priors for the important predictors results in sparsity of the regression coefficients and provides an analytically available form of the posterior probability of a model. The strong model selection consistency is shown to be attained when the number of predictors grows nearly exponentially with the sample size and even when the norm of mean effects solely due to the unimportant variables diverge, which is a novel attractive feature. We develop a scalable variable selection algorithm with an inbuilt screening method that efficiently explores the enormous model space, rapidly identify the regions of high posterior probabilities and make fast inference and prediction. To further mitigate multimodal posterior distributions, we use a temperature schedule whose values are guided by our model selection consistency results. An appealing byproduct of SVEN is the construction of novel model weight adjusted prediction intervals. To implement SVEN, we develop an R package called ``Bravo" which is now available on the Comprehensive R Archive Network (CRAN). In Chapter \ref{ch.bravo}, we describe the major features of the software and conduct step-by-step analysis on several examples to show the usages of the functions in the package. In Chapter \ref{ch.spsven}, we extend our SVEN model to include spatial random effects and call this Bayesian variable selection method SP-SVEN. Our SP-SVEN model is based on a hierarchical Gaussian linear mixed model where the well-known spike-and-slab priors are placed on the regression coefficients to achieve sparsity and a Gaussian intrinsic auto-regression prior is assigned to the spatial random effects. The use of Gaussian conjugate priors ensures the availability of the explicit form of the posterior distribution conditional on the two spatial parameters involved in the intrinsic auto-regression, and numerical integration is used to integrate out these two parameters to obtain the posterior distribution of a given model. For the priors on the two spatial parameters, we propose using the data from unsequenced varieties, if available, to build a hierarchical mixture model, and use the corresponding posterior distribution of the spatial parameters as their prior for the variable selection model. We also develop a scalable algorithm that embeds model based screening and uses fast Cholesky updates to compute the posterior probabilities, and thereby achieving fast exploration of the gigantic model space and rapid discovery of the high posterior regions. The outstanding performance of SVEN and SP-SVEN are demonstrated through a number of simulation studies and some real data examples from genome wide association studies in field trial experiments.
dc.format.mimetype PDF
dc.identifier.doi https://doi.org/10.31274/td-20240329-105
dc.identifier.orcid 0000-0002-0928-0307
dc.identifier.uri https://dr.lib.iastate.edu/handle/20.500.12876/qzoD43ew
dc.language.iso en
dc.language.rfc3066 en
dc.subject.disciplines Statistics en_US
dc.subject.keywords GWAS en_US
dc.subject.keywords hierarchical model en_US
dc.subject.keywords posterior prediction en_US
dc.subject.keywords shrinkage en_US
dc.subject.keywords stochastic search en_US
dc.subject.keywords subset selection en_US
dc.title Bayesian variable selection in ultra-high dimensional settings
dc.type dissertation en_US
dc.type.genre dissertation en_US
dspace.entity.type Publication
relation.isOrgUnitOfPublication 264904d9-9e66-4169-8e11-034e537ddbca
thesis.degree.discipline Statistics en_US
thesis.degree.grantor Iowa State University en_US
thesis.degree.level dissertation $
thesis.degree.name Doctor of Philosophy en_US
File
Original bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
Li_iastate_0097E_19611.pdf
Size:
1.25 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
0 B
Format:
Item-specific license agreed upon to submission
Description: