Sensitivity of genomic selection to using different prior distributions
© Verbyla et al; licensee BioMed Central Ltd. 2010
Published: 31 March 2010
Skip to main content
© Verbyla et al; licensee BioMed Central Ltd. 2010
Published: 31 March 2010
Genomic selection describes a selection strategy based on genomic estimated breeding values (GEBV) predicted from dense genetic markers such as single nucleotide polymorphism (SNP) data. Different Bayesian models have been suggested to derive the prediction equation, with the main difference centred around the specification of the prior distributions.
The simulated dataset of the 13th QTL-MAS workshop was analysed using four Bayesian approaches to predict GEBV for animals without phenotypic information. Different prior distributions were assumed to assess their affect on the accuracy of the predicted GEBV.
All methods produced GEBV that were highly correlated with the true breeding values. The models appear relatively insensitive to the choice of prior distributions for QTL-MAS data set and this is consistent with uniformity of performance of different methods found in real data.
Genomic selection describes a technique for evaluating an animal's breeding value by simultaneously evaluating and summing marker effects across the genome. It uses panels of SNPs covering the whole genome so that ideally all QTL are in linkage disequilibrium with at least one marker, thereby maximizing the proportion of genetic variance explained by the SNPs.
Meuwissen et al (2001)  presented three models to produce GEBV. The first invoked the infinitesimal model assumption such that all SNPs had effects derived from the same normal distribution. The other approaches used a Bayesian framework to apply hierarchical models with different prior distributions assuming unequal variances across the SNP, resulting in a t distribution for prior distribution for the QTL effects. The specification of the prior distributions of the QTL effects has been reported to be important to the accurate prediction of breeding values and when mapping multiple QTL across the entire genome .
The aim of this study was to assess the effect that different prior distributions and subsequently the models using these priors, had on the accuracy of estimated GEBV using the 13th QTL-MAS simulated data set where we had no prior knowledge of the trait's distribution of QTL effects.
where y is the vector of phenotypes of the trait being analysed for all n individuals, μ is the mean, 1 n is a vector of ones of length n, X j is a vector of indicator variables representing the genotypes of the j th marker for all individuals (x ij =0,1,2), β j is the size of the QTL effect associated with marker j, u is the vector of random polygenic effects of length n (Z is the associated design matrix) and is assumed to be normally distributed, u ~ N (0, A) where A is the pedigree derived additive genetic relationship matrix and e is the residual error also assumed to be normally distributed, e ~ N(0, I ) where I is the nxn identity matrix. The prior distributions for the variances of the random polygenic effects and the residual were uninformative flat priors of the form Χ-2(- 2,0). The GEBV at each time point were calculated as
Prior Distribution Specifications
Bayes A/B (Hybrid)
with probability 1- π
with probability π
γ i ~bernoulli(π)
1 - p(γ i = 0) = p(γ i = 1) = π
The other two models assumed mixture distributions for the SNP effects reflecting the assumption that there is a large number of SNPs with zero or near zero effects and a second smaller set of SNPs with larger significant effects. A Bayes A/B "hybrid" method was used. This approximation to Bayes B  was used to keep computational and time demands reasonable. In this algorithm, after every k Bayes A iterations, Bayes B via the reverse jump algorithm is employed. The Reverse Jump algorithm  is run multiple times per SNP and then any SNP with a final state of zero in the current Bayes B iterations is set to zero for the subsequent k iterations of the Bayes A. This maintains the correct transitions between models of differing dimensionality. The prior distributions are identical to that of the original Bayes B using a mixture prior distribution for the SNP variance allowing a proportion, 1-π, to be set to zero. The other proportion π is sampled from the same mixture distribution as Bayes A. See Meuwissen et al (2001) for more details of priors and conditional distributions used.
A faster alternative to both the Bayes A/B hybrid and Bayes B is to use Stochastic Search Variable Selection (SSVS)  (Bayes C [5, 6]). This avoids the problem of the changing dimensionally of the models by providing a technique to maintain constant dimensionality across all models while still allowing the SNP in the predictive set to change. Instead of removing all non-significant parameters, their posterior distributions are limited to values close to zero. The major advantage of this method is that it can be implemented using the Gibbs sampler instead of the more computationally demanding algorithms such as the reverse jump algorithm. The indicator variable (γ i ) determines whether the ith SNP effect is sampled from the larger distribution (i.e. significant effect) or from the small distribution with near zero effects (see Table 1). The prior values of π (the proportion sampled from the non-zero distribution or the larger distribution respectively) for both Bayes A\B and Bayes C was set to 0.05, reflecting the fact that with 435 SNP, it appeared reasonable to expect at least 21 SNP would be associated with a QTL.
The algorithms associated with each model were run for 30,000 iterations with the first 10,000 discarded as burn-in.
The problem of how to model the time series data and estimate GEBV at time point 600 was explored. However, there was little information available to estimate any inflection points or asymptotic values. The GEBV estimated at time points 265, 397 and 530 were found to have a linear relationship (eg. appeared to form the linear part of the growth curve). Consequently, as there was no other information available after time point 530 to predict asymptotes etc., the GEBV at time point 600 were estimated by fitting a linear regression through the breeding values at the three linear time points (265, 397 and 530).
Correlations Between Estimated GEBV for unphenotyped animals at t=600
Comparison of True and Estimated GEBV.
The inclusion of the polygenic effect in the model (not simulated in the data) only slightly reduced the accuracy of prediction (.01) but not significantly (results not shown). It was included in the model as its inclusion has been shown to produce slightly better accuracies of prediction while reducing the bias of the variance components.
Bayes BLUP produced a significantly different set of GEBV. This is evident by the much lower correlations with the other methods and difference in regression coefficients between BLUP and the other methods. Despite these differences Bayes BLUP produces good accuracy and a low MSE (Table 3). Hayes et al (2009)  reports that New Zealand, Australian, the Netherlands and United States studies all found that BLUP gave lower accuracy of GEBV than Bayesian Methods for traits where there is a single QTL that explains a large proportion of the genetic variance e.g. DGAT1 for Fat Percentage. In the current dataset a finite number of QTL were simulated where the largest amount of genetic variance explained by a single QTL was 10.5%. Despite this, Bayes BLUP is still able to produce very accurate GEBV compared to the other methods. One reason this occurs may be that a number of SNPs are required to pick up the effect of a single QTL, resulting in large numbers of SNPs with small effects, which matches the prior distribution of BLUP. However if the percentage of genetic variance explained by a single QTL was to be larger, Bayes BLUP could be expected to produce worse results. Thus this caveat to using Bayes BLUP should be considered when using this method.
All methods produced GEBV that were highly correlated (greater than 0.85) with the true breeding values despite diverse assumptions and prior distributions. This indicates that the hierarchical model is relatively insensitive to the choice of prior distributions for this data set. Thus all models perform well and this is consistent with the general uniformity of performance found across methods in real data. . Despite the general equality in the performance of the different methods, it is still recommended that any information about a trait's QTL effect distribution and phenotypic data should be used to determine the choice of model, prior distributions and setting of the hyper parameters. This will maximise the likelihood of calculating the most accurate GEBV possible.
KV was funded by the Marie Curie Host Fellowships for Early Stage Research Training, as part of the 6th Framework Programme of the European Commission. This Publication represents the views of the Authors, not the European Commission, and the Commission is not liable for any use that may be made of the information.
This article has been published as part of BMC Proceedings Volume 4 Supplement 1, 2009: Proceedings of 13th European workshop on QTL mapping and marker assisted selection.
The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/4?issue=S1.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.