Contributions to improve the accuracy and computational efficiency of genomic prediction
The discovery of genome-wide high-density molecular markers (e.g., single-nucleotide
polymorphisms, SNPs) has revolutionized genetic analyses in human medicine, animal and
plant breeding. There are several active areas of research and development in whole-genome
analyses, including 1) collection or simulation of genomic data, 2) use of genomic data for
prediction or genome-wide association studies, and 3) validation of the performance of these analyses.
In this thesis, several statistical models and computational algorithms were proposed and investigated,
contributing to these three areas of research and development.
A contribution to the first area is a simulation strategy that drops down origins and positions of chromosomal segments rather than every allele state
to efficiently simulate sequence data and complex pedigree structures across multiple generations. A software tool called XSim, which incorporates the
efficient strategy, was developed with implementations in C++ and Julia. XSim allows the genome of founders to be characterized by real genome sequence
data and complex pedigree structures among descendants.
Several methods contributing to the use of genomic data for prediction and genome-wide association studies (GWAS) were proposed and investigated. Two
methods were proposed to improve the computational efficiency of Bayesian multiple-regression analyses. First, we showed how Gibbs samplers without the
use of the Metropolis-Hastings (MH) algorithm can be used for the BayesB method, where the prior for each marker
effect follows a mixture distribution with a point mass at zero with probability pi and a univariate-t distribution with probability
1-pi. We showed that by introducing a indicator variable in BayesB, indicating whether the marker effect for a locus is zero or non-zero, the marker effect
and locus-specific variance can be sampled using Gibbs. We considered three different versions
of the Gibbs sampler to sample each marker effect, locus-specific variance and
its indicator variable. Computational efficiencies defined as the number of effective samples per second of computing time
were compared with simulated data. Among the Gibbs samplers that were considered, the most efficient sampler is about 2.1 times as efficient as the MH
algorithm proposed by Meuwissen et al. and 1.7 times as efficient as that proposed
by Habier et al. Second, we proposed a strategy to parallelize
Gibbs sampling for each marker within each step of the
MCMC chain. This parallelization is accomplished by using an orthogonal data augmentation
strategy, where the marker covariate matrix is augmented by adding p new rows, where p is the number of markers,
such that its columns are orthogonal. The use of this strategy is expected to increase the speed of
Gibbs sampling with lower memory requirements. The parallel Gibbs sampling
approach using an augmented marker covariate matrix was shown for BayesC methods, where the prior for each marker
effect follows a mixture distribution with a point mass at zero and a univariate normal distribution. The full conditional distributions that are
needed for BayesC with orthogonal data augmentation (BayesC-ODA) were derived and the convergence
of BayesC-ODA was studied. In analyses of the simulated data, BayesC-ODA provided virtually
identical predictions of breeding values as BayesC when the chain length was about 20,000 to 80,000,
which is similar to the commonly used chain length of 50,000.
Two methods were proposed or investigated to improve prediction accuracy of Bayesian multiple-
regression analyses. First, we proposed a flexible variable selection model for multiple-trait analyses
with BayesCpi or BayesB priors. This model was compared to single-trait methods and a previously proposed
multi-trait model using real and simulated data. Flexible variable selection showed an advantage when data were from two simulated traits, where a locus had an effect only on one of the traits. Second, we
compared alternative approaches to single-trait genomic prediction using genotyped and non-genotyped Hanwoo
beef cattle. In those data analyses, the single-step methods, which take advantage of all pedigree,
phenotypic and genomic information simultaneously, gave similar or higher prediction accuracies compared to
methods using only genotyped or non-genotyped individuals. Alternative priors allowed single-step Bayesian regression methods (SSBR) to outperform
single-step genomic best linear unbiased prediction (SSGBLUP) in some cases.
One method contributing to the validation of the performance of whole-genome analyses was proposed. In leave-one-out cross validation (LOOCV), one individual is omitted for training with validation on the omitted individual.
Efficient LOOCV strategies were proposed for genomic best linear unbiased prediction (GBLUP) in scenarios when n>p or n
n is the number of observations and p is the number of markers. These strategies were compared
to naive application of LOOCV with simulated data. In these data analyses, efficient LOOCV, requiring little
more effort than a single analysis, was much faster than the naive LOOCV.