Evaluation of Parametric and Nonparametric Statistical Methods in Genomic Prediction

Howard, Reka

Evaluation of Parametric and Nonparametric Statistical Methods in Genomic Prediction

File

Howard_iastate_0097E_15898.pdf (4.17 MB)

Date

2016-01-01

Authors

Howard, Reka

Advisor

Alicia Carriquiry

William Beavis

Altmetrics

Department

Statistics

Abstract

The availability of high-density markers resulted an increased interest in the use of markers for phenotype prediction in plant breeding. Genomic Prediction is a technique that uses marker and phenotypic information of individuals to build a model that enables plant breeders to predict the phenotypic value of individuals with only genotypic scores. In recent years there have been a large number of parametric and nonparametric statistical methods developed for purposes of genomic prediction.

Initially we review parametric methods including Least Squares Regression, Ridge Regression, Bayesian Ridge Regression, Least Absolute Shrinkage and Selection Operator (LASSO), Bayesian LASSO, best linear unbiased prediction (BLUP), Bayes A, Bayes B, Bayes C, and Bayes Cpi, and nonparametric methods including Nadaraya-Watson Estimator, Reproducing Kernel Hilbert Space, Support Vector Machine Regression, and Neural Networks. We also contrast the methods based on accuracy and mean squared error (MSE) using simulated genetic architectures consisting of completely additive or two-way epistatic interactions in populations derived from crosses of inbred lines where the genetic architecture contributes low (0.3) and high (0.7) proportions of the total simulated phenotypic variability.

Based on these preliminary results we introduce Response Surface Methodology (RSM) as a systematic strategy for investigating Genomic Prediction methods as an efficient approach to investigating a wide range of the design variables. We illustrate RSM with a simulated example where the response we optimize is the difference between prediction accuracies of a parametric method and a nonparametric method. We examine how the number of individuals, markers, QTL, and different percentage of epistasis and heritability maximize the estimated differences in accuracies. We found the the greatest impact on estimates of accuracy and MSE was due to genetic architecture of the population and the heritability of the trait. When epistasis and heritability are highest, the advantage of using a nonparametric method versus a parametric prediction method is greatest.

Finally, we simulate data for a structured population consisting of multiple families parental generation`s phenotypic and genotypic information to predict the progeny`s phenotypes. Simulations utilized high density molecular genotypic scores from a sample of soybean varieties adapted to maturity zone 3 to establish the structured breeding population. In the simulation we consider low and high heritability, two different genetic architectures, and the training data contain either all of the parents or only a subset of the parents with the highest phenotypic values. We define a different metric to evaluate genomic prediction techniques, where we compare simulated progeny having the highest phenotypic values with predicted progeny having the highest phenotypic values based on their parental phenotypic and genotypic values. We found that if the genetic architecture is additive then the parametric and nonparametric methods perform similarly according to the new metric. When epistasis is present, the nonparametric method had a higher percentage of identical parents than the parametric method.

Copyright

Fri Jan 01 00:00:00 UTC 2016

Collections

Theses and Dissertations

Full item page