Volume 4 Supplement 1
Proceedings of the 13th European workshop on QTL mapping and marker assisted selection
Genomic breeding value prediction using three Bayesian methods and application to reduced density marker panels
 Matthew A Cleveland^{1}Email author,
 Selma Forni^{1},
 Nader Deeb^{1} and
 Christian Maltecca^{2}
DOI: 10.1186/175365614S1S6
© Cleveland et al; licensee BioMed Central Ltd. 2010
Published: 31 March 2010
Abstract
Background
Bayesian approaches for predicting genomic breeding values (GEBV) have been proposed that allow for different variances for individual markers resulting in a shrinkage procedure that uses prior information to coerce negligible effects towards zero. These approaches have generally assumed application to highdensity genotype data on all individuals, which may not be the case in practice. In this study, three approaches were compared for their predictive power in computing GEBV when training at high SNP marker density and predicting at high or low densities: the well known BayesA, a generalization of BayesA where scale and degrees of freedom are estimated from the data (Studentt) and a Bayesian implementation of the Lasso method. Twelve scenarios were evaluated for predicting GEBV using lowdensity marker subsets, including selection of SNP based on genome spacing or size of additive effect and the inclusion of unknown genotype information in the form of genotype probabilities from pedigree and genotyped ancestors.
Results
The GEBV accuracy (calculated as correlation between GEBV and traditional breeding values) was highest for Lasso, followed by Studentt and then BayesA. When comparing GEBV to true breeding values, Studentt was most accurate, though differences were small. In general the shrinkage applied by the Lasso approach was less conservative than BayesA or Studentt, indicating that Lasso may be more sensitive to QTL with small effects. In the reduceddensity marker subsets the ranking of the methods was generally consistent. Overall, lowdensity, evenlyspaced SNPs did a poor job of predicting GEBV, but SNPs selected based on additive effect size yielded accuracies similar to those at high density, even when coverage was low. The inclusion of genotype probabilities to the evenlyspaced subsets showed promising increases in accuracy and may be more useful in cases where many QTL of small effect are expected.
Conclusions
In this dataset the Studentt approach slightly outperformed the other methods when predicting GEBV at both high and low density, but the Lasso method may have particular advantages in situations where many small QTL are expected. When markers were selected at low density based on genome spacing, the inclusion of genotype probabilities increased GEBV accuracy which would allow a single low density marker panel to be used across traits.
Background
A number of approaches have recently been proposed for the prediction of genomic breeding values for highdensity single nucleotide polymorphism (SNP) panels. Methods commonly used fall into two categories, BLUP and Bayesian approaches. In a BLUP framework SNP effects are sampled from a normal distribution and the variance is assumed constant across SNPs [1]. In a Bayesian approach prior knowledge about the distribution of SNP effects is assumed, generally that many SNPs are likely to have small individual effects and only a few will have large effects [2], allowing for different variances for individual SNPs. This assumption results in a shrinkage procedure in which the prior information is used to coerce negligible effects toward zero. Different derivations of this shrinkage approach have been proposed, including BayesA[1]. In this method a scaled inverseχ^{2} prior is assigned to SNP variances. Scale and degrees of freedom of the distribution are in this case set as hyperparameters and samples of the posterior distribution are obtained through MCMC methods. A generalization in which the hyperparameters regulating the shrinkage are treated as unknown parameters and estimated from the data leads to the well known Studentt model [3] where the amount of shrinkage is controlled by the data. Alternative shrinkage approaches have also been recently proposed. A particularly appealing method is the least absolute shrinkage and selection operator (Lasso)[4]. In its Bayesian interpretation Lasso estimates can be seen as posterior mode estimates when the regression parameters have independent and identical Laplace priors. Yi and Xu [5] recently compared Lasso and Studentt models for QTL mapping. Prediction of genomic breeding values can be seen as a generalization of the same problem. It has been reported [6, 7] that Bayesian methods give higher genomic breeding value accuracies than BLUP methods. There are few published results, though, on the performance of different shrinkage methods for genomic breeding value prediction. These approaches were initially developed assuming dense genome wide SNP coverage. This may not be the case in practice as it is often cost prohibitive to genotype all animals at high density and it may be desired to predict genomic breeding values using low density panels.
This study investigated the predictive performance of different Bayesian hierarchical approaches, BayesA, Studentt and Lasso, when training and predicting genomic breeding values at high density and when predicting at lower densities.
Methods
The dataset used for analysis was simulated as part of the 13^{th} QTLMAS Workshop, see [8] for details. The data consisted of 5 sires, 20 dams and 2000 offspring, of which 1000 had phenotypes. The 1000 phenotyped offspring made up the training set, while the 1000 unphenotyped offspring comprised the prediction set for calculation of genomic breeding values (GEBV). All individuals were genotyped for 453 SNP markers, approximately equallyspaced across five chromosomes of length one Morgan each.
Prediction of phenotypes and breeding values
The simulated dataset included phenotypes for five traits representing measures of yield at five different time points (t0, t132, t265, t397 and t530). A sixth phenotype was predicted to represent yield at a time point beyond the simulated data, time point 600 (t600). A number of nonlinear models were tested to predict t600 [9–12], and the Gompertz model [12] was found to best fit the data according to AIC [13] and BIC [14] measures. Least squares estimates of growth curve parameters were obtained for each individual using the procedure "NLIN" from SAS [15]. Individual growth curve parameters could then be used to calculate individual phenotypic predictions for any time point until maturity, including t600. Traditional breeding values were estimated using a single trait linear model for each of the time points. We report the results for t530 and t600.
Description of models
 1)
Sample µ from N(µ  y, β, )
 2)
Sample β_{ j } from N(β  y, µ, , ), where the updates in this case are obtained though GaussSeidel with residual update [18]
 3)
Sample from Inv − χ^{2} (  y,µ,β)
 4)
 i.
Lasso method:
Sample . from InvGauss(  β_{ j } ,λ)
Sample λ from Gamma(λ^{2}  )
 ii.
BayesA, Studentt:
Sample from Inv − χ^{ 2 }(  β,v,s^{ 2 })
 i.

For Studentt only:

Sample s^{ 2 } from Gamma (s^{ 2 }  , , v)

Sample v with a Metropolis step (v  s^{2}, )

 5)
return to Step 1
The Gibbs sampling algorithm for all three methods was implemented in R [19]. For each analysis a single chain of 15000 iterations was run with a burnin period of 5500 iterations. Samples were stored every 30 iterations. Convergence of each chain was assessed both by visual inspection of the trace and the use of estimates of effective sample size for the variances obtained through the R coda package [20]. Inferences on the parameters were made on the average of the posterior samples after burnin.
where X_{ m } is a matrix of genotypes expressed as (0,1,2) and β_{ m } is a vector of posterior mean effects for a particular method, for m SNPs. A cross validation procedure was also used where phenotyped individuals were randomly split into training and prediction sets (90% training; 10% prediction) 10 times to assess the stability of the genomic predictions for t530 and t600.
Lowdensity marker subsets
Number of SNPs included in the calculation of genomic breeding values in each lowdensity scenario
Scenario  Evenlyspaced^{a}  Largest effects^{b}  Genotype probabilities^{c}  Total 

EVEN_19  19  19  
EVEN_38  38  38  
EVEN_76  76  76  
SIG_19  19  19  
SIG_38  38  38  
SIG_76  76  76  
EVEN_GP_19  19  434  453  
EVEN_GP_38  38  415  453  
EVEN_GP_76  76  377  453  
SIG_GP_19  19  434  453  
SIG_GP_38  38  415  453  
SIG_GP_76  76  377  453 
where P(1) and P(2) are the probabilities of individual (i) having the heterozygous and homozygous (coded as 2) genotypes, respectively, for each marker (j). When the actual genotype is known the matrix element is simply coded as before (0,1,2). This approach is related to the genetic predictor approach of Boer et al.[22].
Results
Correlations between genomic breeding values and breeding values from a traditional animal model for animals in the prediction set (without phenotypes) and coefficients of regression of traditional on genomic breeding values, for t530 and t600.
t530  t600  

Method  Corr.  b  Corr.  B 
BayesA  0.673  0.893  0.674  0.880 
Studentt  0.718  1.019  0.720  1.010 
Lasso  0.736  1.061  0.737  1.072 
Correlations between genomic breeding values and breeding values from different low SNPdensity approaches (and change in correlation compared to original full marker model), where all SNP effects are estimated in the same high SNPdensity training set, for t530 and t600.
t530  t600  

Scenario  BayesA  Studentt  Lasso  BayesA  Studentt  Lasso 
EVEN_19  0.255  0.142  0.195  0.128  0.098  0.173 
(0.418)  (0.846)  (0.594)  (0.532)  (0.622)  (0.564)  
EVEN_38  0.481  0.494  0.528  0.469  0.485  0.522 
(0.192)  (0.249)  (0.242)  (0.180)  (0.235)  (0.215)  
EVEN_76  0.490  0.544  0.586  0.472  0.532  0.584 
(0.183)  (0.246)  (0.192)  (0.130)  (0.188)  (0.153)  
SIG_19  0.663  0.699  0.709  0.669  0.692  0.709 
(0.010)  (0.049)  (0.037)  (0.025)  (0.028)  (0.028)  
SIG_38  0.664  0.703  0.713  0.669  0.707  0.721 
(0.009)  (0.049)  (0.033)  (0.029)  (0.013)  (0.016)  
SIG_76  0.667  0.709  0.711  0.672  0.712  0.729 
(0.006)  (0.046)  (0.027)  (0.035)  (0.008)  (0.008)  
EVEN_GP_19  0.937  0.967  0.980  0.928  0.967  0.978 
(0.264)  (0.210)  (0.231)  (0.293)  (0.247)  (0.241)  
EVEN_GP_38  0.733  0.785  0.861  0.736  0.789  0.862 
(0.060)  (0.018)  (0.049)  (0.111)  (0.069)  (0.125)  
EVEN_GP_76  0.733  0.786  0.854  0.736  0.789  0.856 
(0.060)  (0.018)  (0.050)  (0.112)  (0.069)  (0.119)  
SIG_GP_19  0.674  0.730  0.802  0.675  0.735  0.798 
(0.001)  (0.043)  (0.006)  (0.056)  (0.015)  (0.061)  
SIG_GP_38  0.673  0.728  0.783  0.675  0.731  0.791 
(0)  (0.043)  (0.008)  (0.054)  (0.011)  (0.054)  
SIG_GP_76  0.673  0.724  0.767  0.674  0.729  0.769 
(0)  (0.044)  (0.012)  (0.050)  (0.009)  (0.032) 
Discussion
The use of lowdensity SNP subsets is based on the concept of Habier et al.[23] where SNP effects are estimated from a training dataset using highdensity SNP genotypes, but GEBV are then calculated for individuals genotyped for only a small subset of the SNPs. These subsets may be chosen by selecting markers for even genome coverage or based on effect size for a certain trait, where ungenotyped SNPs may be filled in to approximate highdensity coverage. The current analysis found that evenlyspaced SNPs alone did a poor job of predicting GEBV (Table 3). By chance this approach could produce high GEBV accuracies if selected SNPs happened to be in linkage disequilibrium (LD) with large QTL for a particular trait, but in general it would be expected that many QTL would not be represented by the lowdensity panel. In the current dataset average LD was low (results not shown) which explains the poorer performance of the evenlyspaced, lowdensity subset compared to other approaches. Selecting only SNPs with large effects in each of the three methods yielded GEBV that were nearly as accurate as when using all markers, in all cases. This result is likely specific to the case where few QTL of large or moderate effect are expected and thus few markers will account for most of the variance, which is presumed in this dataset based on Figure 1. In fact, the correlation between GEBV and EBV for t600 in the prediction set was 0.603 using the three SNPs with largest effects in BayesA, only a 7% reduction in accuracy.
The scenarios using genotype probabilities performed well and in most cases showed a small or no reduction in accuracy, compared to using the full marker set. Due to the population structure (full and halfsib families) and completeness of parental genotypes it is expected that the genotype probabilities are a good representation of the true genotypes in this case. In a situation where there are fewer ties between individuals the advantage of using genotype probabilities (in place of actual genotypes) is likely to be lower than what was found in this study. A number of the scenarios even showed large increases in accuracy to unrealistic levels (e.g., EVEN_GP_19, Table 3). Paradoxically, the evenlyspaced scenarios outperformed the largest SNP effects scenarios, where the best performance came from the smallest number of SNPs. This result can be attributed to calculating accuracy based on the EBV. With fewer markers and less information (based on even spacing) the GEBV calculated in EVEN_GP_19 are nearly identical within family and are implicitly based on family relationships, through SNP allele sharing, and thus the GEBV are approximations of the EBV rather than the true breeding value. Using the EBV as a proxy for the true breeding value appears to be a poor choice in this case. Addition of true breeding values should make this a fairer comparison.
Epilogue
Accuracy of genomic breeding values using three methods, as the correlation between true and predicted breeding values, for animals in the prediction set using all markers (ALL) and using alternative lowdensity approaches, for t600.
Scenario  BayesA  Studentt  Lasso 

ALL  0.916  0.945  0.916 
EVEN_19  0.040  0.206  0.258 
EVEN_38  0.732  0.738  0.738 
EVEN_76  0.734  0.761  0.758 
SIG_19  0.913  0.931  0.910 
SIG_38  0.915  0.938  0.914 
SIG_76  0.915  0.943  0.921 
EVEN_GP_19  0.658  0.674  0.671 
EVEN_GP_38  0.833  0.84  0.817 
EVEN_GP_76  0.834  0.846  0.825 
SIG_GP_19  0.914  0.937  0.914 
SIG_GP_38  0.915  0.940  0.917 
SIG_GP_76  0.916  0.943  0.920 
Conclusions
For this simulated dataset the Lasso method slightly outperformed BayesA and Studentt when considering accuracy as the correlation between GEBV and EBV, but Studentt performed the best when comparing GEBV to TBV. BayesA and Studentt appeared to be more conservative in shrinkage of SNP effects indicating that Lasso may be more sensitive to small QTL and thus may perform better than other methods for traits where large or moderate QTL are not expected. In the analysis of reduced marker density few SNPs were needed to maintain levels of accuracy similar to the highdensity SNP set when SNPs with large effect were selected. When markers were selected based on spacing, the use of genotype probabilities in place of known genotypes increased the accuracy of the GEBV, which would allow a single lowdensity panel to be used across traits.
Declarations
Acknowledgement
This article has been published as part of BMC Proceedings Volume 4 Supplement 1, 2009: Proceedings of 13th European workshop on QTL mapping and marker assisted selection.
The full contents of the supplement are available online at http://www.biomedcentral.com/17536561/4?issue=S1.
Authors’ Affiliations
References
 Meuwissen THE, Hayes BJ, Goddard ME: Prediction of total genetic value using genomewide dense marker maps. Genet. 2001, 157: 18191829.Google Scholar
 Hayes BJ, Goddard ME: The distribution of the effects of genes affecting quantitative traits in livestock. Genet. Sel. Evol. 2001, 33: 209229. 10.1186/12979686333209.PubMed CentralView ArticlePubMedGoogle Scholar
 Andrews DF, Mallows CL: Scale mixtures of normal distributions. J Royal Stat Soc BMethodological. 1974, 36: 99102.Google Scholar
 Tibshirani R: Regression shrinkage and selection via the lasso. J Royal Stat Soc B. 1996, 58: 267288.Google Scholar
 Yi N, Xu S: Bayesian LASSO for quantitative trait loci mapping. Genet. 2008, 179: 10451055. 10.1534/genetics.107.085589.View ArticleGoogle Scholar
 VanRaden PM: Efficient methods to compute genomic predictions. J Dairy Sci. 2008, 91: 44144423. 10.3168/jds.20070980.View ArticlePubMedGoogle Scholar
 VanRaden PM, Van Tassell CP, Wiggans GR, Sonstegard TS, Schnabel RD, Taylor JF, Schenkel FS: Invited review: Reliability of genomic predictions for North American Holstein bull. J Dairy Sci. 2009, 92: 1624. 10.3168/jds.20081514.View ArticlePubMedGoogle Scholar
 Coster A, Bastiaansen J, Calus M, Maliepaard C, Bink M: QTLMAS 2009: Simulated dataset. BMCProc. 2010, 4 (Suppl 1): S3Google Scholar
 Brody S: Bioenergetics and growth. 1945, Reinhold Publishing Corp.Google Scholar
 Von Bertalanffy L: Quantitative laws in metabolism and growth. The Quarterly Review of Biology. 1957, 32: 217230. 10.1086/401873.View ArticlePubMedGoogle Scholar
 Nelder JA: The fitting of a generalization of the logistic curve. Biometrics. 1961, 17: 89110. 10.2307/2527498.View ArticleGoogle Scholar
 Laird AK: Dynamics of relative growth. Growth. 1965, 29: 249263.PubMedGoogle Scholar
 Akaike H: A new look at the statistical model identification. IEEE Trans Autom Control. 1974, 19: 716723. 10.1109/TAC.1974.1100705.View ArticleGoogle Scholar
 Schwarz G: Estimating the dimension of a model. Annals of Stat. 1978, 6 (8): 461464. 10.1214/aos/1176344136.View ArticleGoogle Scholar
 SAS Institute Inc: SAS 9.2 Help and Documentation. 2009Google Scholar
 Park T, Casella G: The Bayesian Lasso. JAmer Stat Soc. 2008, 103: 681686.Google Scholar
 de los Campos G, Naya H, Gianola D, Crossa J, Legarra A, Manfredi E, Weigel K, Cotes JM: Predicting quantitative traits with regression models for dense molecular markers and pedigree. Genet. 2009, 182: 375385. 10.1534/genetics.109.101501.View ArticleGoogle Scholar
 Legarra A, Misztal I: Technical Note: Computing strategies in genomewide selection. J Dairy Sci. 2008, 91: 360366. 10.3168/jds.20070403.View ArticlePubMedGoogle Scholar
 R Development Core Team: R: A language and environment for statisitcal computing. 2008Google Scholar
 Plummer M, Best N, Cowles K, Vines K: CODE: Convergence diagnosis and output analysis for MCMC. R News. 2006, 6: 711.Google Scholar
 Kerr RJ, Kinghorn BP: An efficient algorithm for segregation analysis in large populations. JAnim Breed Genet. 1996, 113: 457469.View ArticleGoogle Scholar
 Boer MP, Wright D, Feng L, Podlich DW, Luo L, Cooper M, van Eeuwijk FA: A mixedmodel quantitative trait loci (QTL) analysis for multiple environment trial data using environmental covariables for QTLby environment interactions, with an example in Maize. Genet. 2007, 177: 18011813. 10.1534/genetics.107.071068.View ArticleGoogle Scholar
 Habier D, Fernando RL, Dekkers JCM: Genomic selection using lowdensity marker panels. Genet. 2009, 182: 343353. 10.1534/genetics.108.100289.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.