Bayesian variable selection in ultra-high dimensional settings

Li, Dongjin

Bayesian variable selection in ultra-high dimensional settings

dc.contributor.advisor	Dutta, Somak
dc.contributor.advisor	Roy, Vivekananda
dc.contributor.advisor	Carriquiry, Alicia
dc.contributor.advisor	Dorman, Karin
dc.contributor.advisor	Kaiser, Mark
dc.contributor.author	Li, Dongjin
dc.contributor.department	Department of Statistics (LAS)
dc.date.accessioned	2022-11-08T23:52:03Z
dc.date.available	2022-11-08T23:52:03Z
dc.date.issued	2021-08
dc.date.updated	2022-11-08T23:52:03Z
dc.description.abstract	This thesis is a collection of three papers focused on Bayesian variable selection for ultra-high dimensional problems. In various disciplines of modern scientific research, datasets are commonly available with tens of thousands of predictors but a limited number of observations. Nevertheless, only very few of these predictors are believed to be associated with the response, making variable selection for ultra-high dimensional setup an important and challenging problem. In this thesis, we present novel Bayesian methods for variable selection in ultra-high dimensional settings. In Chapter \ref{ch.sven}, we propose a Bayesian variable selection method, called SVEN, for Gaussian linear regression models. The method is based on a hierarchical Gaussian linear model with priors placed on the regression coefficients as well as on the model space. The use of degenerate {\it spike} priors on inactive variables and Gaussian {\it slab} priors for the important predictors results in sparsity of the regression coefficients and provides an analytically available form of the posterior probability of a model. The strong model selection consistency is shown to be attained when the number of predictors grows nearly exponentially with the sample size and even when the norm of mean effects solely due to the unimportant variables diverge, which is a novel attractive feature. We develop a scalable variable selection algorithm with an inbuilt screening method that efficiently explores the enormous model space, rapidly identify the regions of high posterior probabilities and make fast inference and prediction. To further mitigate multimodal posterior distributions, we use a temperature schedule whose values are guided by our model selection consistency results. An appealing byproduct of SVEN is the construction of novel model weight adjusted prediction intervals. To implement SVEN, we develop an R package called ``Bravo" which is now available on the Comprehensive R Archive Network (CRAN). In Chapter \ref{ch.bravo}, we describe the major features of the software and conduct step-by-step analysis on several examples to show the usages of the functions in the package. In Chapter \ref{ch.spsven}, we extend our SVEN model to include spatial random effects and call this Bayesian variable selection method SP-SVEN. Our SP-SVEN model is based on a hierarchical Gaussian linear mixed model where the well-known spike-and-slab priors are placed on the regression coefficients to achieve sparsity and a Gaussian intrinsic auto-regression prior is assigned to the spatial random effects. The use of Gaussian conjugate priors ensures the availability of the explicit form of the posterior distribution conditional on the two spatial parameters involved in the intrinsic auto-regression, and numerical integration is used to integrate out these two parameters to obtain the posterior distribution of a given model. For the priors on the two spatial parameters, we propose using the data from unsequenced varieties, if available, to build a hierarchical mixture model, and use the corresponding posterior distribution of the spatial parameters as their prior for the variable selection model. We also develop a scalable algorithm that embeds model based screening and uses fast Cholesky updates to compute the posterior probabilities, and thereby achieving fast exploration of the gigantic model space and rapid discovery of the high posterior regions. The outstanding performance of SVEN and SP-SVEN are demonstrated through a number of simulation studies and some real data examples from genome wide association studies in field trial experiments.
dc.format.mimetype	PDF
dc.identifier.doi	https://doi.org/10.31274/td-20240329-105
dc.identifier.orcid	0000-0002-0928-0307
dc.identifier.uri	https://dr.lib.iastate.edu/handle/20.500.12876/qzoD43ew
dc.language.iso	en
dc.language.rfc3066	en
dc.subject.disciplines	Statistics	en_US
dc.subject.keywords	GWAS	en_US
dc.subject.keywords	hierarchical model	en_US
dc.subject.keywords	posterior prediction	en_US
dc.subject.keywords	shrinkage	en_US
dc.subject.keywords	stochastic search	en_US
dc.subject.keywords	subset selection	en_US
dc.title	Bayesian variable selection in ultra-high dimensional settings
dc.type	dissertation	en_US
dc.type.genre	dissertation	en_US
dspace.entity.type	Publication
relation.isOrgUnitOfPublication	264904d9-9e66-4169-8e11-034e537ddbca
thesis.degree.discipline	Statistics	en_US
thesis.degree.grantor	Iowa State University	en_US
thesis.degree.level	dissertation	$
thesis.degree.name	Doctor of Philosophy	en_US

File

Original bundle

Now showing 1 - 1 of 1

Name:: Li_iastate_0097E_19611.pdf
Size:: 1.25 MB
Format:: Adobe Portable Document Format
Description:

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 0 B
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Theses and Dissertations