Contributions to improve the accuracy and computational efficiency of genomic prediction

dc.contributor.advisor Rohan L. Fernando
dc.contributor.author Cheng, Hao
dc.contributor.department Animal Science
dc.date 2019-11-04T21:11:18.000
dc.date.accessioned 2020-06-30T03:08:21Z
dc.date.available 2020-06-30T03:08:21Z
dc.date.copyright Sun Jan 01 00:00:00 UTC 2017
dc.date.embargo 2017-11-19
dc.date.issued 2017-01-01
dc.description.abstract <p>The discovery of genome-wide high-density molecular markers (e.g., single-nucleotide</p> <p>polymorphisms, SNPs) has revolutionized genetic analyses in human medicine, animal and</p> <p>plant breeding. There are several active areas of research and development in whole-genome</p> <p>analyses, including 1) collection or simulation of genomic data, 2) use of genomic data for</p> <p>prediction or genome-wide association studies, and 3) validation of the performance of these analyses.</p> <p>In this thesis, several statistical models and computational algorithms were proposed and investigated,</p> <p>contributing to these three areas of research and development.</p> <p>A contribution to the first area is a simulation strategy that drops down origins and positions of chromosomal segments rather than every allele state</p> <p>to efficiently simulate sequence data and complex pedigree structures across multiple generations. A software tool called XSim, which incorporates the</p> <p>efficient strategy, was developed with implementations in C++ and Julia. XSim allows the genome of founders to be characterized by real genome sequence</p> <p>data and complex pedigree structures among descendants.</p> <p>Several methods contributing to the use of genomic data for prediction and genome-wide association studies (GWAS) were proposed and investigated. Two</p> <p>methods were proposed to improve the computational efficiency of Bayesian multiple-regression analyses. First, we showed how Gibbs samplers without the</p> <p>use of the Metropolis-Hastings (MH) algorithm can be used for the BayesB method, where the prior for each marker</p> <p>effect follows a mixture distribution with a point mass at zero with probability pi and a univariate-t distribution with probability</p> <p>1-pi. We showed that by introducing a indicator variable in BayesB, indicating whether the marker effect for a locus is zero or non-zero, the marker effect</p> <p>and locus-specific variance can be sampled using Gibbs. We considered three different versions</p> <p>of the Gibbs sampler to sample each marker effect, locus-specific variance and</p> <p>its indicator variable. Computational efficiencies defined as the number of effective samples per second of computing time</p> <p>were compared with simulated data. Among the Gibbs samplers that were considered, the most efficient sampler is about 2.1 times as efficient as the MH</p> <p>algorithm proposed by Meuwissen et al. and 1.7 times as efficient as that proposed</p> <p>by Habier et al. Second, we proposed a strategy to parallelize</p> <p>Gibbs sampling for each marker within each step of the</p> <p>MCMC chain. This parallelization is accomplished by using an orthogonal data augmentation</p> <p>strategy, where the marker covariate matrix is augmented by adding p new rows, where p is the number of markers,</p> <p>such that its columns are orthogonal. The use of this strategy is expected to increase the speed of</p> <p>Gibbs sampling with lower memory requirements. The parallel Gibbs sampling</p> <p>approach using an augmented marker covariate matrix was shown for BayesC methods, where the prior for each marker</p> <p>effect follows a mixture distribution with a point mass at zero and a univariate normal distribution. The full conditional distributions that are</p> <p>needed for BayesC with orthogonal data augmentation (BayesC-ODA) were derived and the convergence</p> <p>of BayesC-ODA was studied. In analyses of the simulated data, BayesC-ODA provided virtually</p> <p>identical predictions of breeding values as BayesC when the chain length was about 20,000 to 80,000,</p> <p>which is similar to the commonly used chain length of 50,000.</p> <p>Two methods were proposed or investigated to improve prediction accuracy of Bayesian multiple-</p> <p>regression analyses. First, we proposed a flexible variable selection model for multiple-trait analyses</p> <p>with BayesCpi or BayesB priors. This model was compared to single-trait methods and a previously proposed</p> <p>multi-trait model using real and simulated data. Flexible variable selection showed an advantage when data were from two simulated traits, where a locus had an effect only on one of the traits. Second, we</p> <p>compared alternative approaches to single-trait genomic prediction using genotyped and non-genotyped Hanwoo</p> <p>beef cattle. In those data analyses, the single-step methods, which take advantage of all pedigree,</p> <p>phenotypic and genomic information simultaneously, gave similar or higher prediction accuracies compared to</p> <p>methods using only genotyped or non-genotyped individuals. Alternative priors allowed single-step Bayesian regression methods (SSBR) to outperform</p> <p>single-step genomic best linear unbiased prediction (SSGBLUP) in some cases.</p> <p>One method contributing to the validation of the performance of whole-genome analyses was proposed. In leave-one-out cross validation (LOOCV), one individual is omitted for training with validation on the omitted individual.</p> <p>Efficient LOOCV strategies were proposed for genomic best linear unbiased prediction (GBLUP) in scenarios when n>p or n</p> <p>n is the number of observations and p is the number of markers. These strategies were compared</p> <p>to naive application of LOOCV with simulated data. In these data analyses, efficient LOOCV, requiring little</p> <p>more effort than a single analysis, was much faster than the naive LOOCV.</p>
dc.format.mimetype application/pdf
dc.identifier archive/lib.dr.iastate.edu/etd/16050/
dc.identifier.articleid 7057
dc.identifier.contextkey 11337878
dc.identifier.s3bucket isulib-bepress-aws-west
dc.identifier.submissionpath etd/16050
dc.identifier.uri https://dr.lib.iastate.edu/handle/20.500.12876/30233
dc.language.iso en
dc.source.bitstream archive/lib.dr.iastate.edu/etd/16050/Cheng_iastate_0097E_16291.pdf|||Fri Jan 14 20:54:24 UTC 2022
dc.subject.disciplines Genetics
dc.subject.disciplines Statistics and Probability
dc.title Contributions to improve the accuracy and computational efficiency of genomic prediction
dc.type article
dc.type.genre dissertation
dspace.entity.type Publication
relation.isOrgUnitOfPublication 85ecce08-311a-441b-9c4d-ee2a3569506f
thesis.degree.discipline Genetics; Statistics
thesis.degree.level dissertation
thesis.degree.name Doctor of Philosophy
File
Original bundle
Now showing 1 - 1 of 1
Name:
Cheng_iastate_0097E_16291.pdf
Size:
1.84 MB
Format:
Adobe Portable Document Format
Description: