Contributions to improve the accuracy and computational efficiency of genomic prediction

Cheng, Hao

Contributions to improve the accuracy and computational efficiency of genomic prediction

dc.contributor.advisor	Rohan L. Fernando
dc.contributor.author	Cheng, Hao
dc.contributor.department	Department of Animal Science
dc.date	2019-11-04T21:11:18.000
dc.date.accessioned	2020-06-30T03:08:21Z
dc.date.available	2020-06-30T03:08:21Z
dc.date.copyright	Sun Jan 01 00:00:00 UTC 2017
dc.date.embargo	2017-11-19
dc.date.issued	2017-01-01
dc.description.abstract	<p>The discovery of genome-wide high-density molecular markers (e.g., single-nucleotide</p> <p>polymorphisms, SNPs) has revolutionized genetic analyses in human medicine, animal and</p> <p>plant breeding. There are several active areas of research and development in whole-genome</p> <p>analyses, including 1) collection or simulation of genomic data, 2) use of genomic data for</p> <p>prediction or genome-wide association studies, and 3) validation of the performance of these analyses.</p> <p>In this thesis, several statistical models and computational algorithms were proposed and investigated,</p> <p>contributing to these three areas of research and development.</p> <p>A contribution to the first area is a simulation strategy that drops down origins and positions of chromosomal segments rather than every allele state</p> <p>to efficiently simulate sequence data and complex pedigree structures across multiple generations. A software tool called XSim, which incorporates the</p> <p>efficient strategy, was developed with implementations in C++ and Julia. XSim allows the genome of founders to be characterized by real genome sequence</p> <p>data and complex pedigree structures among descendants.</p> <p>Several methods contributing to the use of genomic data for prediction and genome-wide association studies (GWAS) were proposed and investigated. Two</p> <p>methods were proposed to improve the computational efficiency of Bayesian multiple-regression analyses. First, we showed how Gibbs samplers without the</p> <p>use of the Metropolis-Hastings (MH) algorithm can be used for the BayesB method, where the prior for each marker</p> <p>effect follows a mixture distribution with a point mass at zero with probability pi and a univariate-t distribution with probability</p> <p>1-pi. We showed that by introducing a indicator variable in BayesB, indicating whether the marker effect for a locus is zero or non-zero, the marker effect</p> <p>and locus-specific variance can be sampled using Gibbs. We considered three different versions</p> <p>of the Gibbs sampler to sample each marker effect, locus-specific variance and</p> <p>its indicator variable. Computational efficiencies defined as the number of effective samples per second of computing time</p> <p>were compared with simulated data. Among the Gibbs samplers that were considered, the most efficient sampler is about 2.1 times as efficient as the MH</p> <p>algorithm proposed by Meuwissen et al. and 1.7 times as efficient as that proposed</p> <p>by Habier et al. Second, we proposed a strategy to parallelize</p> <p>Gibbs sampling for each marker within each step of the</p> <p>MCMC chain. This parallelization is accomplished by using an orthogonal data augmentation</p> <p>strategy, where the marker covariate matrix is augmented by adding p new rows, where p is the number of markers,</p> <p>such that its columns are orthogonal. The use of this strategy is expected to increase the speed of</p> <p>Gibbs sampling with lower memory requirements. The parallel Gibbs sampling</p> <p>approach using an augmented marker covariate matrix was shown for BayesC methods, where the prior for each marker</p> <p>effect follows a mixture distribution with a point mass at zero and a univariate normal distribution. The full conditional distributions that are</p> <p>needed for BayesC with orthogonal data augmentation (BayesC-ODA) were derived and the convergence</p> <p>of BayesC-ODA was studied. In analyses of the simulated data, BayesC-ODA provided virtually</p> <p>identical predictions of breeding values as BayesC when the chain length was about 20,000 to 80,000,</p> <p>which is similar to the commonly used chain length of 50,000.</p> <p>Two methods were proposed or investigated to improve prediction accuracy of Bayesian multiple-</p> <p>regression analyses. First, we proposed a flexible variable selection model for multiple-trait analyses</p> <p>with BayesCpi or BayesB priors. This model was compared to single-trait methods and a previously proposed</p> <p>multi-trait model using real and simulated data. Flexible variable selection showed an advantage when data were from two simulated traits, where a locus had an effect only on one of the traits. Second, we</p> <p>compared alternative approaches to single-trait genomic prediction using genotyped and non-genotyped Hanwoo</p> <p>beef cattle. In those data analyses, the single-step methods, which take advantage of all pedigree,</p> <p>phenotypic and genomic information simultaneously, gave similar or higher prediction accuracies compared to</p> <p>methods using only genotyped or non-genotyped individuals. Alternative priors allowed single-step Bayesian regression methods (SSBR) to outperform</p> <p>single-step genomic best linear unbiased prediction (SSGBLUP) in some cases.</p> <p>One method contributing to the validation of the performance of whole-genome analyses was proposed. In leave-one-out cross validation (LOOCV), one individual is omitted for training with validation on the omitted individual.</p> <p>Efficient LOOCV strategies were proposed for genomic best linear unbiased prediction (GBLUP) in scenarios when n>p or n</p> <p>n is the number of observations and p is the number of markers. These strategies were compared</p> <p>to naive application of LOOCV with simulated data. In these data analyses, efficient LOOCV, requiring little</p> <p>more effort than a single analysis, was much faster than the naive LOOCV.</p>
dc.format.mimetype	application/pdf
dc.identifier	archive/lib.dr.iastate.edu/etd/16050/
dc.identifier.articleid	7057
dc.identifier.contextkey	11337878
dc.identifier.s3bucket	isulib-bepress-aws-west
dc.identifier.submissionpath	etd/16050
dc.identifier.uri	https://dr.lib.iastate.edu/handle/20.500.12876/30233
dc.language.iso	en
dc.source.bitstream	archive/lib.dr.iastate.edu/etd/16050/Cheng_iastate_0097E_16291.pdf\|\|\|Fri Jan 14 20:54:24 UTC 2022
dc.subject.disciplines	Genetics
dc.subject.disciplines	Statistics and Probability
dc.title	Contributions to improve the accuracy and computational efficiency of genomic prediction
dc.type	dissertation
dc.type.genre	dissertation
dspace.entity.type	Publication
relation.isOrgUnitOfPublication	85ecce08-311a-441b-9c4d-ee2a3569506f
thesis.degree.discipline	Genetics; Statistics
thesis.degree.level	dissertation
thesis.degree.name	Doctor of Philosophy

File

Original bundle

Now showing 1 - 1 of 1

Name:: Cheng_iastate_0097E_16291.pdf
Size:: 1.84 MB
Format:: Adobe Portable Document Format
Description:

Download

Collections

Theses and Dissertations