Contributions to improve the accuracy and computational efficiency of genomic prediction

Thumbnail Image
Cheng, Hao
Major Professor
Rohan L. Fernando
Committee Member
Journal Title
Journal ISSN
Volume Title
Research Projects
Organizational Units
Organizational Unit
Journal Issue
Is Version Of
Animal Science

The discovery of genome-wide high-density molecular markers (e.g., single-nucleotide

polymorphisms, SNPs) has revolutionized genetic analyses in human medicine, animal and

plant breeding. There are several active areas of research and development in whole-genome

analyses, including 1) collection or simulation of genomic data, 2) use of genomic data for

prediction or genome-wide association studies, and 3) validation of the performance of these analyses.

In this thesis, several statistical models and computational algorithms were proposed and investigated,

contributing to these three areas of research and development.

A contribution to the first area is a simulation strategy that drops down origins and positions of chromosomal segments rather than every allele state

to efficiently simulate sequence data and complex pedigree structures across multiple generations. A software tool called XSim, which incorporates the

efficient strategy, was developed with implementations in C++ and Julia. XSim allows the genome of founders to be characterized by real genome sequence

data and complex pedigree structures among descendants.

Several methods contributing to the use of genomic data for prediction and genome-wide association studies (GWAS) were proposed and investigated. Two

methods were proposed to improve the computational efficiency of Bayesian multiple-regression analyses. First, we showed how Gibbs samplers without the

use of the Metropolis-Hastings (MH) algorithm can be used for the BayesB method, where the prior for each marker

effect follows a mixture distribution with a point mass at zero with probability pi and a univariate-t distribution with probability

1-pi. We showed that by introducing a indicator variable in BayesB, indicating whether the marker effect for a locus is zero or non-zero, the marker effect

and locus-specific variance can be sampled using Gibbs. We considered three different versions

of the Gibbs sampler to sample each marker effect, locus-specific variance and

its indicator variable. Computational efficiencies defined as the number of effective samples per second of computing time

were compared with simulated data. Among the Gibbs samplers that were considered, the most efficient sampler is about 2.1 times as efficient as the MH

algorithm proposed by Meuwissen et al. and 1.7 times as efficient as that proposed

by Habier et al. Second, we proposed a strategy to parallelize

Gibbs sampling for each marker within each step of the

MCMC chain. This parallelization is accomplished by using an orthogonal data augmentation

strategy, where the marker covariate matrix is augmented by adding p new rows, where p is the number of markers,

such that its columns are orthogonal. The use of this strategy is expected to increase the speed of

Gibbs sampling with lower memory requirements. The parallel Gibbs sampling

approach using an augmented marker covariate matrix was shown for BayesC methods, where the prior for each marker

effect follows a mixture distribution with a point mass at zero and a univariate normal distribution. The full conditional distributions that are

needed for BayesC with orthogonal data augmentation (BayesC-ODA) were derived and the convergence

of BayesC-ODA was studied. In analyses of the simulated data, BayesC-ODA provided virtually

identical predictions of breeding values as BayesC when the chain length was about 20,000 to 80,000,

which is similar to the commonly used chain length of 50,000.

Two methods were proposed or investigated to improve prediction accuracy of Bayesian multiple-

regression analyses. First, we proposed a flexible variable selection model for multiple-trait analyses

with BayesCpi or BayesB priors. This model was compared to single-trait methods and a previously proposed

multi-trait model using real and simulated data. Flexible variable selection showed an advantage when data were from two simulated traits, where a locus had an effect only on one of the traits. Second, we

compared alternative approaches to single-trait genomic prediction using genotyped and non-genotyped Hanwoo

beef cattle. In those data analyses, the single-step methods, which take advantage of all pedigree,

phenotypic and genomic information simultaneously, gave similar or higher prediction accuracies compared to

methods using only genotyped or non-genotyped individuals. Alternative priors allowed single-step Bayesian regression methods (SSBR) to outperform

single-step genomic best linear unbiased prediction (SSGBLUP) in some cases.

One method contributing to the validation of the performance of whole-genome analyses was proposed. In leave-one-out cross validation (LOOCV), one individual is omitted for training with validation on the omitted individual.

Efficient LOOCV strategies were proposed for genomic best linear unbiased prediction (GBLUP) in scenarios when n>p or n

n is the number of observations and p is the number of markers. These strategies were compared

to naive application of LOOCV with simulated data. In these data analyses, efficient LOOCV, requiring little

more effort than a single analysis, was much faster than the naive LOOCV.

Sun Jan 01 00:00:00 UTC 2017