 Introduction
 Open Access
 Published:
Comparison of analyses of the QTLMAS XII common dataset. I: Genomic selection
BMC Proceedings volume 3, Article number: S1 (2009)
Abstract
A dataset was simulated and distributed to participants of the QTLMAS XII workshop who were invited to develop genomic selection models. Each contributing group was asked to describe the model development and validation as well as to submit genomic predictions for three generations of individuals, for which they only knew the genotypes. The organisers used these genomic predictions to perform the final validation by comparison to the true breeding values, which were known only to the organisers. Methods used by the 5 groups fell in 3 classes 1) fixed effects models 2) BLUP models, and 3) Bayesian MCMC based models. The Bayesian analyses gave the highest accuracies, followed by the BLUP models, while the fixed effects models generally had low accuracies and large error variance. The best BLUP models as well as the best Bayesian models gave unbiased predictions. The BLUP models are clearly sensitive to the assumed SNP variance, because they do not estimate SNP variance, but take the specified variance as the true variance. The current comparison suggests that Bayesian analyses on haplotypes or SNPs are the most promising approach for Genomic selection although the BLUP models may provide a computationally attractive alternative with little loss of efficiency. On the other hand fixed effect type models are unlikely to provide any gain over traditional pedigree indexes for selection.
Introduction
Genomic selection
Hybrid Marker Assisted Selection (MAS) schemes were the first tool proposed to include information on a few main genes or quantitative trait loci (QTL) into best linear unbiased prediction (BLUP) of breeding values e.g. [1]. More recently, genomic selection (GS) [2] was proposed. This approach relies on a genomewide dense marker map, such that markers in linkage disequilibrium (LD) with each QTL are available. GS hence utilizes the data on all available markers to produce genomic estimated breeding values (GEBV) by summing the effects of all small chromosome segments characterized by their marker alleles. Because predictions using GS are based on marker associations and not pedigree information, the requirement to have phenotypes on selection candidates or their close relatives is relaxed and a breeding value can be obtained as soon as the genotypes are available. As a result, the method has the potential to increase genetic progress as well as reducing costs [3]. In breeding schemes with long generation intervals (e.g. cattle) the increase will mainly be a consequence of shortening the interval. In breeding schemes with shorter generation intervals the genetic gain may be increased due to higher accuracies of breeding values at the time of selection. With new genotyping technologies becoming available for livestock species (eg. Bovine50 Bead Chip http://www.illumin.com), GS is now becoming a very attractive approach to predict breeding values.
Several statistical methods have been proposed to be used in genomic prediction models [2, 4]. Apart from the specific models used, the prediction of genomic breeding values also involve steps of data editing, choice of response variable, utilization of marker information, and model validation.
The aim of this study was to compare approaches used to predict genomic breeding values in a situation that mimics real life. This was done by distributing genotypic data, phenotypic data, and pedigree information, but not the true breeding values to participants of the QTLMAS XII workshop. Participants then applied various approaches to predict genomic breeding values for three generations of nonphenotyped individuals. The organisers then compared the predictions with true simulated breeding values.
Methods
Simulation of data
Pedigree
The pedigree was simulated in three parts as illustrated in Figure 1. First a historic population was simulated with 50 generations without records (G_{h1} to G_{h50}), followed by 4 generations (G_{t1} to G_{t4}; the training set), with both genotype and phenotype records, and finally the last 3 generations (G_{v1} to G_{v3}; the validation set) were only genotypes and true breeding values were generated. The historic population was created by 100 founder individuals (50 males and 50 females) in generations G_{h1}. For each of the subsequent 50 generations, 50 males and 50 females were produced by randomly sampling parents from the previous generation. The base generation of the recorded pedigree (G_{t1}) had 15 males and 150 females. The parents for these were sampled randomly from individuals in G_{h50}. Each male was mated to 10 females and each mating pair produced 10 offspring. This created a fullsibhalfsib design, in which each male had 100 progeny and each female had 10 progeny and a total of 1500 individuals per generation. In the following 5 generations (G_{t2} to G_{t4} and G_{v1} to G_{v2}), 15 males and 150 females were selected randomly to be parents of the next generation and the same mating design was repeated. In the last 3 generations (G_{v1} to G_{v3}) 400 of the 1500 individuals were selected randomly to be genotyped. The resulting 1200 individuals constitute the validation set.
Marker and QTL
Marker alleles were sampled for 6000 biallelic loci on 6 chromosomes (1000 markers on each chromosome) with 0.1 cM between adjacent loci. In the founder individuals (G_{h1}), the two alleles at each marker locus were sampled with equal probabilities. Recombination was sampled according to Haldane's mapping function [5]. A total of 48 QTL were simulated. QTL positions were sampled under the assumption of a multinomial distribution of genes across the genome. The multinomial distribution used was based on the genetic map of the mouse genome [6]. The allele substitution effects of the QTL were drawn from a gamma distribution with scale parameter (α) 5.4 and shape parameter (β) 0.42 following [7]. We wanted to control the genetic effects and genomic location of four QTL. To achieve this, we replaced four of the 48 QTL that were closest to the desired locations with QTL of predefined effects. The allele substitution effects of these four fixed QTL were standardized based on their individual allelic frequencies in the last generation of the historic pedigree, so that each of these QTL explains a predefined percentage of the genetic variance. Positions and allele substitution effects and contribution to genetic and phenotypic variance for each QTL is available at http://www.computationalgenetics.se/QTLMAS08/QTLMAS/DATA_files/QTLMAS_simulated_effects.pdf
Phenotype
The phenotypes were obtained as the cumulative effects of the 44 randomly drawn QTL, the 4 defined QTL, and a random residual. First, the effects of the 44 randomQTL were summed. The effect was standardized to mean 0 and variance 1. Then the effects of four fixed QTL were added to constitute the total genetic effect. The residual variance was defined to obtain a heritability of 0.30. The individuals' phenotypes were derived as the sum of the individuals' genetic value and a random residual drawn from a normal distribution with mean zero and variance equal to the residual variance.
Validation of prediction models
Participants of the QTLMAS XII workshop were provided pedigree, phenotypic, and genomic data on the 4665 individuals of generations G_{t1} to G_{t4} and only pedigree and genomic data on generations G_{v1} to G_{v3}. Using these data, they used various models to estimate GEBVs for individuals in generations G_{v1} to G_{v3}. The properties of the reported GEBVs were then assessed by three different criteria to relate the GEBVs reported by the workshop participants and the true breeding values (TBV), which were only known by the organisers of the workshop.
The first criterion was the accuracy of the GEBVs as a measure of their predictive ability. Accuracies were calculated as the correlation between GEBVs and TBVs. The second criterion was the bias of the GEBVs, assessed as the coefficients of regressing TBVs on GEBVs. A regression coefficient close to 1 indicates that predictions are unbiased. The third criterion was the rank correlations between GEBVs and TBVs in the top 10% of individuals ranked on TBVs.
Genomic selection models
A range of different statistical methods were used by QTLMAS XII participants to analyse the simulated data. Some participants have provided GEBVs from all their investigated models to the organisers while some have provided GEBVs from selected models only. According to the properties of the models for which GEBVs have been provided, they can be classified into three categories: 1) fixed effect models, 2) BLUP based models with random marker effects, and 3) Bayesian MCMC based models with a mixture of some markers attributed a large variance and the majority attributed a small variance. Different characteristics of the models used and relation to the model specifications in the contributed papers are shown in Table 1 for the fixed effect models, in Table 2 for the BLUP based models, and in Table 3 for the Bayesian models. Except for models GP_{BLUP11}, GP_{BLUP12}, and GP_{BLUP16} the provided phenotypes were used as the response variable in the analyses. Only models GP_{Bayes1} and GP_{Bayes3} included a polygenic effect in the model and only models GP_{Bayes1}, GP_{Bayes3}, and GP_{Bayes5} used haplotype information.
Results and discussion
In generations 1–7, the mean minor allele frequency (MAF) of markers was 0.298. The cumulative distribution of MAF in Figure 2 shows a rather equal distribution, which is a consequence of random drift over the 50 generations from the original starting values of 0.5 for each locus. For 38 marker loci one allele was fixed. The mean linkage disequilibrium (r^{2}) between adjacent SNPs was 0.20 and the median was 0.11. This may be relatively low compared to other simulation studies that often use 1000 generations of random mating. On the other hand, the LD achieved in this paper seems very comparable to the realised values from real data analysis. For markers with a distance around 0.1 Mb the average R2 was 0.14 and 0.22 in different breeds and studies [8–10].
Fixed effect models
Only one contribution [11] used fixed effects models to predict GEBVs of the simulated data. They used 4 different sets of SNPs in the prediction models ranging from using all 6000 SNPs, every 5^{th} SNP, every 10^{th} SNP, every 20^{th} SNP or setting a MAF of 0.2. As expected, the model fit improved as more SNPs were fitted in the training set [11]. However, the predictive ability of the models in the validation dataset decreased with more SNPs in the model (Table 4). Using all 6000 SNPs resulted in a correlation between GEBVs and TBVs (averaged over generations G_{v1} to G_{v3}) of 0.16 compared to a correlation of 0.56 with 300 SNPs.
All evaluated fixed effects models overestimated breeding values severely (Table 4). The regression coefficients of TBVs on GEBVs were between 0.03 and 0.53 averaged over generations G_{v1} to G_{v3}. The biases increased with the number of markers in the model and were particularly strong when all SNPs were used. Table 4 shows that the rank correlations drop dramatically with the inclusion of more markers in the model. This means that the low regression coefficient for this model was not only due to a scaling effect. In the worst case more parameters are fitted than available observations, which lead to a serious overfitting of data. This is known to result in unstable prediction with large prediction errors [12]. As a result, the overfitted model had poor predictive ability for the data beyond the training set, resulting in a large variance of GEBVs. These results are well in line with the original observation of [2], where it was found that using a least squares approach leads to biased GEBVs and poor predictive ability. On the other hand [13] found relatively good results using fixed regression models, if markers were selected based on associations to the phenotypes and a liberal significance threshold. This indicates that fixed effects models could be more useful that the current simulations indicated, if a procedure is used that balance the problems of overfitting data and selecting SNPs with overestimated effects.
Random effect models
Three contributions presented GEBVs from BLUP models fitting SNPs as random effects. In two studies [11, 14] the variance explained by each SNP was given as either the total genetic variance (σ^{2}_{G}) or the total genetic variance divided by the number of SNPs (σ^{2}_{G}/N_{SNPs}). It was apparent that models using σ^{2}_{G} lead to results similar to those from the fixed effects models. This is most apparent when comparing least squares models GP_{Fix2} – GP_{Fix5} ([11]; Table 4) to BLUP models GP_{BLUP1} – GP_{BLUP4} ([11]; Table 5), where correlations and regression coefficients are very similar when the same number of SNPs were used. When σ^{2}_{G}/N_{SNPs} was used instead, the models performed much better. This is apparent when comparing accuracies from models GP_{BLUP1} to GP_{BLUP4} with those from models GP_{BLUP5} to GP_{BLUP8}. The correlation between GEBVs and TBVs increased from 0.23 to 0.75 when 3328 SNPs were used (GP_{BLUP1} and GP_{BLUP5}) and from 0.56 to 0.61 when only 300 SNPs were used (GP_{BLUP4} and GP_{BLUP8}). When using this variance, the SNP effects are regressed considerably more towards zero, which leads to virtually unbiased estimates for the models with more than 600 SNPs.
The effect of the assumed variance explained by each SNP on the accuracy of GEBV can also be seen in the other two contributions. In the contribution [14] GEBV were obtained from a BLUP model with the variance either σ^{2}_{G} or σ^{2}_{G}/N_{SNPs}. Here the correlation between TBVs and GEBVs increased from 0.55 to 0.77 and the regression coefficient increased from 0.41 to 0.94, when phenotypic values were used as the response variable. In the study by Pimentel et al. [15], the correlation was 0.22 and the regression was 0.06 using a model with the variance of each SNP equal to residual variance, while the correlation was 0.51 and the regression was 0.31 using a model with the variance σ^{2}_{G}/N_{SNPs}. These results clearly show that the importance of fitting the SNP effects as random effects and providing a reasonable SNP variance increases with the number of markers included in the models.
Mixture distribution/Bayesian models
Two contributions [16, 17] analysed data with Bayesian methods and in total submitted GEBVs from five different models. GEBVs from all models were more accurate (Table 6) than the nonBayesian models. All models had a high predictive ability with correlations between GEBVs and TBVs ranging from 0.84 to 0.92 averaged over generations G_{v1} to G_{v3}.
All Bayesian models used Gibbs sampling algorithms, in which marker or QTL effects were assumed to follow a mixture distribution, where relatively few markers were assumed to explain a large variance and a large number explained a very small variance. This is most likely the main reason for the improved performance of these models over the BLUP models in which all markers are assumed to explain the same amount of variance. The assumption of a homogeneous variance for all markers leads to a poor prediction of the effect of a QTL with a large contribution to the trait from a single marker even if they are in complete LD. In this simulated dataset the 10 largest QTL explain 82.9% of the genetic variance. This may favour the Bayesian models relative to BLUP models, compared to situations with a large number of QTL contributing more equally to the genetic variance.
Fitting single SNP or haplotype effects
Following the internal validation by [17], it can be seen that, for this data, it was an advantage to fit effects of haplotypes rather than effects of single SNPs with the model used in [17]. It must be noted that the data were provided with known haplotypes. In real life, haplotypes are estimated with errors, which may affect the results. The advantage of using haplotypes in this study is most likely because there is higher LD between the haplotypes and the QTL than between any of the individual markers and the QTL. On the other hand there are some disadvantages of fitting haplotype effects. These disadvantages include: 1) for a given position there are more effects to be estimated, 2) large haplotypes are more likely to break up by recombinations, 3) haplotypes are more sensitive to errors in the map. Therefore, the optimal size of haplotypes is a tradeoff between having a predictor in high LD with the underlying genes and the precision to estimate haplotype effects [18].
Models GP_{Bayes1} and GP_{Bayes2} fit haplotype effects with a correlation between alleles proportional to the probability of IBD. This is expected to perform better as more phenotypes contribute to the estimation of particular haplotype effects. However, Table 6 shows that the accuracies are slightly worse and regressions further from 1 compared with the other Bayesian models. In particular, it seems that the predictive ability, both assessed as accuracies and the rank correlations for the top animals, decreases faster over generations. A possible explanation is that the IBD based model is theoretically advantageous, but it may be associated with numerical problems. For instance, the IBD matrices calculated from pair wise comparisons of haplotypes are generally not positive definite and must be manipulated before inversion.
Inclusion of polygenic effects
Comparing GP_{Bayes1} to GP_{Bayes2} and GP_{Bayes3} to GP_{Bayes4} (Table 6) shows no effect of including a polygenic component in the model for these analyses. However, we do not know what proportion of genetic variance can be captured by SNP markers in real data. If SNP markers don't capture all the genetic variance it is still important to include a polygenic component. In the present study, the total genetic effect was simulated from biallelic QTL and the markers were uniformly distributed across the whole genome. However, in real data, part of the genetic variance comes from structural variation such as copy number variation, inversion, deletion etc. [19]. Also the simulated population is very homogeneous. In real data there are likely to be genetic structures that lead to spurious associations [20]. In such situations inclusion of pedigree may improve predictions more that what is observed in this simulated dataset.
Using EBVs or phenotypic values as response variables
Two of the contributions used both phenotypic values and EBVs as response variables in the analyses. In the study by Pimentel et al. [15], the correlation between TBV and GEBV derived from phenotypic values was slightly higher than the correlation for GEBV derived from EBVs. In the study by Macciotta et al. [14], using a model with SNP variance of σ^{2}_{G}/N_{SNPs}, the accuracy of GEBV based on EBVs was 0.53, but the accuracy was 0.77 for GEBV based on phenotypic values. The lower accuracy of GEBV obtained using EBVs as response variable is most likely due to information lost in the procedure of predicting breeding values, which do not have high accuracies themselves. If this is the case, EBVs should only be used as response variables when they have very high reliabilities. The same effect is seen in the models by [15] who also used both EBVs and phenotypes as the response variable.
Model validation
Several workshop participants performed internal validations by either estimating correlations between GEBVs and phenotypes or EBVs in the training set or by cross validation. When correlations are calculated within the training set only, this reflects mainly how well the model fit the data in the training set and not necessarily how well the model predict the next generation. When a statistical model includes too many covariates, relative to the amount of data available, the model may fit the data perfectly but have a poor predictive ability. This was most obvious in the results obtained using fixed effects models. In [11], the correlation between EBV and GEBV in the training set was highest when all markers were included and declined as markers were removed. The predictive ability was, however, actually very poor for the fixed model with all markers (GP_{Fix1}) and increased as markers were removed from the model (GP_{Fix2} – Gp_{Fix5}).
To assess the predictive ability, it is necessary to perform model validation. There are many approaches to model validation. A common approach in statistical practice is crossvalidation where data sample is partitioned into subsets. The analysis is then initially performed on a single of these subsets, while the others are retained for subsequent validation of the initial analysis. In the workshop, [15] validated the models using data of G_{t1}G_{t3} as training data and G_{t4} as test data. This approach may have the disadvantage of using EBVs for validation of GEBVs in G_{t4} that were based on information from phenotypes of individuals in G_{t1}G_{t3}. This creates a strong dependency between the data used for model development and validation, which could be avoided by using the phenotypic data in generation G_{t4} rather than EBVs. Villumsen et al. [17] used a 5fold cross validation where in each of the five validation sets, 20% of the data in G_{t4}, were taken as test data. The advantage of cross validation is that it makes it possible to retain training data as large as possible, while obtaining the amount of total test data as large as required (with maximal total test data equal to the whole data). In the simulated data of this workshop this strategy proved very efficient in selecting models that predicted the TBVs in the validation data very accurately.
Decrease of accuracies over generations
While three generations were simulated for validation, results are given as the mean accuracy of all three generations. This is because we could see no clear trend as to which differences in the models lead to a higher decrease in accuracy over the three generations. The only obvious result was a clear relation between the accuracy of GEBVs in generation G_{v1} and the decline from G_{v1} to G_{v3}, such that models with a high accuracy declined less than models with low accuracies.
It must be noted that the current comparison is based on a single replicate of simulated data and therefore the conclusions must be interpreted with caution.
Conclusion
The comparison of the different methods applied to the dataset by the workshop participants clearly shows a distinct clustering of the three approaches, where the Bayesian analyses gave the highest accuracies, followed by the BLUP models, while the fixed effects models generally had low accuracies and large error variance. However, some BLUP models were less biased than some Bayesian models. The BLUP models are clearly sensitive to the given SNP variance, because a BLUP does not estimate SNP variance, but takes the specified variance as the true variance. For instance, if the number of QTL would increase, and each QTL would have a smaller effect, it is expected that the differences between the BLUP and Bayesian models would be smaller. The current comparison suggests that Bayesian analyses on haplotypes or SNPs are the most promising approach for Genomic selection although the BLUP models may provide a computationally attractive alternative with little loss of efficiency. As already concluded by [2] fixed effect type models are unlikely to provide any gain over traditional pedigree indexes for selection.
Abbreviations
 GEBV:

genomic estimated breeding values
 GS:

genomic selection
 TBV:

true breeding value.
References
 1.
Fernando RL, Grossman M: Marker assisted selection using best linear unbiased prediction. Genet Sel Evol. 1989, 21: 467477. 10.1051/gse:19890407.
 2.
Meuwissen THE, Hayes BJ, Goddard ME: Prediction of total genetic value using genomewide dense marker map. Genetics. 2001, 157: 18191829.
 3.
Schaeffer LR: Strategy for applying genomewide selection in dairy cattle. J Anim Breed Genet. 2006, 123: 218223. 10.1111/j.14390388.2006.00595.x.
 4.
Gianola D, Fernando RL, Stella A: Genomicassisted prediction of genetic value with semiparametric procedures. Genetics. 2006, 173: 17611776. 10.1534/genetics.105.049510.
 5.
Haldane JBS: The combination of linkage values and the calculation of distances between the loci of linked factors. J Genet. 1919, 8: 299309. 10.1007/BF02983270.
 6.
NCBI 2005: National Centre for Biotechnology Information, MGI genetic map of the mouse genome (Mus musculus). [http://www.ncbi.nlm.nih.gov/Genomes/]
 7.
Hayes BJ, Goddard ME: The distribution of the effects og genes affecting quantitative traits in livestock. Genet Sel Evol. 2001, 33: 209229. 10.1051/gse:2001117.
 8.
McKay SD, Schnabel RD, Murdoch BM, Matukumalli LK, Aerts J, Coppieters W, Crews D, Dias Neto E, Gill CA, Gao C, Mannen H, Stothard P, Wang Z, Van Tassell CP, Williams JL, Taylor JF, Moore SS: Whole genome linkage disequilibrium maps in cattle. BMC Genetics. 2007, 8: 7410.1186/14712156874.
 9.
Sargolzaei M, Schenkel FS, Jansen GB, Schaeffer LR: Extent of Linkage Disequilibrium in Holstein Cattle in North America. J Dairy Sci. 2008, 91: 21062117. 10.3168/jds.20070553.
 10.
de Ross AP, Hayes BJ, Spelman RJ, Goddard ME: Linkage disequilibrium and persistence of phase in HolsteinFresian, Jersey and Angus cattle. Genetics. 179 (3): 15031512. 10.1534/genetics.107.084301.
 11.
Zukowski K, Suchochi T, Gontarek A, Szyda J: The impact of single nucleotide polymorphism selection on prediction of genomewide breeding values. BMC Proceedings. 2009, 3 (Suppl 1): S1310.1186/175365613s1s13.
 12.
Hastie T, Tibshirani R, Friedman J: The elements of statistical learning. 2003, New York, Springer
 13.
Macciotta PP, Gaspa G, Steri R, Pieramati C, Carnier P, Dimauro C: Preselection of the most significant SNPs for the estimation of genomic breeding values. BMC Proceedings. 2009, 3 (Suppl 1): S1410.1186/175365613s1s14.
 14.
Habier D, Fernando RL, Dekkers JCM: The impact on relationship information on genomeassisted breeding values. Genetics. 2007, 177: 23892397.
 15.
Pimentel ECG, König S, Schenkel FS, Simianer H: Comparison of statistical procedures for estimating polygenic effects using dense genomewide marker data. BMC Proceedings. 2009, 3 (Suppl 1): S1210.1186/175365613s1s12.
 16.
Calus MPL, de Roos APV, Veerkamp RF: Estimating genomic breeding values from the QTLMAS Workshop data using single SNP regression and the haplotype/IBD approach. BMC Proceedings. 2009, 3 (Suppl 1): S1010.1186/175365613s1s10.
 17.
Villumsen TM, Janss L: Bayesian genomic selection: the effect of haplotype length and priors. BMC Proceedings. 2009, 3 (Suppl 1): S1110.1186/175365613s1s11.
 18.
Villumsen TM, Janss L, Lund MS: The importance of haplotype length and heritability using genomic selection in dairy cattle. J Anim Breed Genet. 2008, Published online: 24 Sep 2008; doi 10.1111/j.14390388.2008.00747.x
 19.
Andrew J, Chang SZ, Eichler EE: Structural Variation of the human genome. Annual Review of Genomics and Human Genetics. 2006, 7 (1): 40710.1146/annurev.genom.7.080505.115618.
 20.
Yu J, Pressoir G, Briggs WH, Vroh Bi I, Yamasaki M, Doebley JF, McMullen MD, Gaut BS, Nielsen DM, Holland JB, Kresovich S, Buckler ES: A unified mixedmodel method for association mapping that accounts for multiple levels of relatedness. Nature Genetics. 2006, 38: 203208. 10.1038/ng1702.
Acknowledgements
This article has been published as part of BMC Proceedings Volume 3 Supplement 1, 2009: Proceedings of the 12th European workshop on QTL mapping and marker assisted selection. The full contents of the supplement are available online at http://www.biomedcentral.com/17536561/3?issue=S1.
Author information
Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
MSL, ÖC, and DJK designed the simulations. GSa and MSL wrote simulation software. GSa performed the simulations and computed correlations to true breeding values. MSL drafted the manuscript and all authors contributed to interpretation of results.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Lund, M.S., Sahana, G., de Koning, D. et al. Comparison of analyses of the QTLMAS XII common dataset. I: Genomic selection. BMC Proc 3, S1 (2009). https://doi.org/10.1186/175365613S1S1
Published: