Volume 3 Supplement 1
Comparison of analyses of the QTLMAS XII common dataset. I: Genomic selection
© Lund et al; licensee BioMed Central Ltd. 2009
Published: 23 February 2009
A dataset was simulated and distributed to participants of the QTLMAS XII workshop who were invited to develop genomic selection models. Each contributing group was asked to describe the model development and validation as well as to submit genomic predictions for three generations of individuals, for which they only knew the genotypes. The organisers used these genomic predictions to perform the final validation by comparison to the true breeding values, which were known only to the organisers. Methods used by the 5 groups fell in 3 classes 1) fixed effects models 2) BLUP models, and 3) Bayesian MCMC based models. The Bayesian analyses gave the highest accuracies, followed by the BLUP models, while the fixed effects models generally had low accuracies and large error variance. The best BLUP models as well as the best Bayesian models gave unbiased predictions. The BLUP models are clearly sensitive to the assumed SNP variance, because they do not estimate SNP variance, but take the specified variance as the true variance. The current comparison suggests that Bayesian analyses on haplotypes or SNPs are the most promising approach for Genomic selection although the BLUP models may provide a computationally attractive alternative with little loss of efficiency. On the other hand fixed effect type models are unlikely to provide any gain over traditional pedigree indexes for selection.
Hybrid Marker Assisted Selection (MAS) schemes were the first tool proposed to include information on a few main genes or quantitative trait loci (QTL) into best linear unbiased prediction (BLUP) of breeding values e.g. . More recently, genomic selection (GS)  was proposed. This approach relies on a genome-wide dense marker map, such that markers in linkage disequilibrium (LD) with each QTL are available. GS hence utilizes the data on all available markers to produce genomic estimated breeding values (GEBV) by summing the effects of all small chromosome segments characterized by their marker alleles. Because predictions using GS are based on marker associations and not pedigree information, the requirement to have phenotypes on selection candidates or their close relatives is relaxed and a breeding value can be obtained as soon as the genotypes are available. As a result, the method has the potential to increase genetic progress as well as reducing costs . In breeding schemes with long generation intervals (e.g. cattle) the increase will mainly be a consequence of shortening the interval. In breeding schemes with shorter generation intervals the genetic gain may be increased due to higher accuracies of breeding values at the time of selection. With new genotyping technologies becoming available for livestock species (eg. Bovine50 Bead Chip http://www.illumin.com), GS is now becoming a very attractive approach to predict breeding values.
Several statistical methods have been proposed to be used in genomic prediction models [2, 4]. Apart from the specific models used, the prediction of genomic breeding values also involve steps of data editing, choice of response variable, utilization of marker information, and model validation.
The aim of this study was to compare approaches used to predict genomic breeding values in a situation that mimics real life. This was done by distributing genotypic data, phenotypic data, and pedigree information, but not the true breeding values to participants of the QTL-MAS XII workshop. Participants then applied various approaches to predict genomic breeding values for three generations of non-phenotyped individuals. The organisers then compared the predictions with true simulated breeding values.
Simulation of data
Marker and QTL
Marker alleles were sampled for 6000 biallelic loci on 6 chromosomes (1000 markers on each chromosome) with 0.1 cM between adjacent loci. In the founder individuals (Gh1), the two alleles at each marker locus were sampled with equal probabilities. Recombination was sampled according to Haldane's mapping function . A total of 48 QTL were simulated. QTL positions were sampled under the assumption of a multinomial distribution of genes across the genome. The multinomial distribution used was based on the genetic map of the mouse genome . The allele substitution effects of the QTL were drawn from a gamma distribution with scale parameter (α) 5.4 and shape parameter (β) 0.42 following . We wanted to control the genetic effects and genomic location of four QTL. To achieve this, we replaced four of the 48 QTL that were closest to the desired locations with QTL of pre-defined effects. The allele substitution effects of these four fixed QTL were standardized based on their individual allelic frequencies in the last generation of the historic pedigree, so that each of these QTL explains a predefined percentage of the genetic variance. Positions and allele substitution effects and contribution to genetic and phenotypic variance for each QTL is available at http://www.computationalgenetics.se/QTLMAS08/QTLMAS/DATA_files/QTLMAS_simulated_effects.pdf
The phenotypes were obtained as the cumulative effects of the 44 randomly drawn QTL, the 4 defined QTL, and a random residual. First, the effects of the 44 random-QTL were summed. The effect was standardized to mean 0 and variance 1. Then the effects of four fixed QTL were added to constitute the total genetic effect. The residual variance was defined to obtain a heritability of 0.30. The individuals' phenotypes were derived as the sum of the individuals' genetic value and a random residual drawn from a normal distribution with mean zero and variance equal to the residual variance.
Validation of prediction models
Participants of the QTLMAS XII workshop were provided pedigree, phenotypic, and genomic data on the 4665 individuals of generations Gt1 to Gt4 and only pedigree and genomic data on generations Gv1 to Gv3. Using these data, they used various models to estimate GEBVs for individuals in generations Gv1 to Gv3. The properties of the reported GEBVs were then assessed by three different criteria to relate the GEBVs reported by the workshop participants and the true breeding values (TBV), which were only known by the organisers of the workshop.
The first criterion was the accuracy of the GEBVs as a measure of their predictive ability. Accuracies were calculated as the correlation between GEBVs and TBVs. The second criterion was the bias of the GEBVs, assessed as the coefficients of regressing TBVs on GEBVs. A regression coefficient close to 1 indicates that predictions are unbiased. The third criterion was the rank correlations between GEBVs and TBVs in the top 10% of individuals ranked on TBVs.
Genomic selection models
Fixed effects models, their name in the contributed paper, and number of SNPs fitted.
Random effects BLUP models, their name in the contributed paper, assumed SNP variance and response variable used.
Number of markers
Bayesian models, their name in the contributed paper, assumptions on SNP effects and polygenic effects.
Results and discussion
Fixed effect models
Comparison of genomic estimated breeding values (GEBV) and true breeding values (TBV) for fixed effects models.
All evaluated fixed effects models overestimated breeding values severely (Table 4). The regression coefficients of TBVs on GEBVs were between 0.03 and 0.53 averaged over generations Gv1 to Gv3. The biases increased with the number of markers in the model and were particularly strong when all SNPs were used. Table 4 shows that the rank correlations drop dramatically with the inclusion of more markers in the model. This means that the low regression coefficient for this model was not only due to a scaling effect. In the worst case more parameters are fitted than available observations, which lead to a serious over-fitting of data. This is known to result in unstable prediction with large prediction errors . As a result, the over-fitted model had poor predictive ability for the data beyond the training set, resulting in a large variance of GEBVs. These results are well in line with the original observation of , where it was found that using a least squares approach leads to biased GEBVs and poor predictive ability. On the other hand  found relatively good results using fixed regression models, if markers were selected based on associations to the phenotypes and a liberal significance threshold. This indicates that fixed effects models could be more useful that the current simulations indicated, if a procedure is used that balance the problems of over-fitting data and selecting SNPs with overestimated effects.
Random effect models
Comparison of genomic estimated breeding values (GEBV) and true breeding values (TBV) for BLUP models.
The effect of the assumed variance explained by each SNP on the accuracy of GEBV can also be seen in the other two contributions. In the contribution  GEBV were obtained from a BLUP model with the variance either σ2G or σ2G/NSNPs. Here the correlation between TBVs and GEBVs increased from 0.55 to 0.77 and the regression coefficient increased from 0.41 to 0.94, when phenotypic values were used as the response variable. In the study by Pimentel et al. , the correlation was 0.22 and the regression was 0.06 using a model with the variance of each SNP equal to residual variance, while the correlation was 0.51 and the regression was 0.31 using a model with the variance σ2G/NSNPs. These results clearly show that the importance of fitting the SNP effects as random effects and providing a reasonable SNP variance increases with the number of markers included in the models.
Mixture distribution/Bayesian models
Comparison of genomic estimated breeding values (GEBV) and true breeding values (TBV) for Bayesian models.
All Bayesian models used Gibbs sampling algorithms, in which marker or QTL effects were assumed to follow a mixture distribution, where relatively few markers were assumed to explain a large variance and a large number explained a very small variance. This is most likely the main reason for the improved performance of these models over the BLUP models in which all markers are assumed to explain the same amount of variance. The assumption of a homogeneous variance for all markers leads to a poor prediction of the effect of a QTL with a large contribution to the trait from a single marker even if they are in complete LD. In this simulated dataset the 10 largest QTL explain 82.9% of the genetic variance. This may favour the Bayesian models relative to BLUP models, compared to situations with a large number of QTL contributing more equally to the genetic variance.
Fitting single SNP or haplotype effects
Following the internal validation by , it can be seen that, for this data, it was an advantage to fit effects of haplotypes rather than effects of single SNPs with the model used in . It must be noted that the data were provided with known haplotypes. In real life, haplotypes are estimated with errors, which may affect the results. The advantage of using haplotypes in this study is most likely because there is higher LD between the haplotypes and the QTL than between any of the individual markers and the QTL. On the other hand there are some disadvantages of fitting haplotype effects. These disadvantages include: 1) for a given position there are more effects to be estimated, 2) large haplotypes are more likely to break up by recombinations, 3) haplotypes are more sensitive to errors in the map. Therefore, the optimal size of haplotypes is a trade-off between having a predictor in high LD with the underlying genes and the precision to estimate haplotype effects .
Models GPBayes1 and GPBayes2 fit haplotype effects with a correlation between alleles proportional to the probability of IBD. This is expected to perform better as more phenotypes contribute to the estimation of particular haplotype effects. However, Table 6 shows that the accuracies are slightly worse and regressions further from 1 compared with the other Bayesian models. In particular, it seems that the predictive ability, both assessed as accuracies and the rank correlations for the top animals, decreases faster over generations. A possible explanation is that the IBD based model is theoretically advantageous, but it may be associated with numerical problems. For instance, the IBD matrices calculated from pair wise comparisons of haplotypes are generally not positive definite and must be manipulated before inversion.
Inclusion of polygenic effects
Comparing GPBayes1 to GPBayes2 and GPBayes3 to GPBayes4 (Table 6) shows no effect of including a polygenic component in the model for these analyses. However, we do not know what proportion of genetic variance can be captured by SNP markers in real data. If SNP markers don't capture all the genetic variance it is still important to include a polygenic component. In the present study, the total genetic effect was simulated from bi-allelic QTL and the markers were uniformly distributed across the whole genome. However, in real data, part of the genetic variance comes from structural variation such as copy number variation, inversion, deletion etc. . Also the simulated population is very homogeneous. In real data there are likely to be genetic structures that lead to spurious associations . In such situations inclusion of pedigree may improve predictions more that what is observed in this simulated dataset.
Using EBVs or phenotypic values as response variables
Two of the contributions used both phenotypic values and EBVs as response variables in the analyses. In the study by Pimentel et al. , the correlation between TBV and GEBV derived from phenotypic values was slightly higher than the correlation for GEBV derived from EBVs. In the study by Macciotta et al. , using a model with SNP variance of σ2G/NSNPs, the accuracy of GEBV based on EBVs was 0.53, but the accuracy was 0.77 for GEBV based on phenotypic values. The lower accuracy of GEBV obtained using EBVs as response variable is most likely due to information lost in the procedure of predicting breeding values, which do not have high accuracies themselves. If this is the case, EBVs should only be used as response variables when they have very high reliabilities. The same effect is seen in the models by  who also used both EBVs and phenotypes as the response variable.
Several workshop participants performed internal validations by either estimating correlations between GEBVs and phenotypes or EBVs in the training set or by cross validation. When correlations are calculated within the training set only, this reflects mainly how well the model fit the data in the training set and not necessarily how well the model predict the next generation. When a statistical model includes too many covariates, relative to the amount of data available, the model may fit the data perfectly but have a poor predictive ability. This was most obvious in the results obtained using fixed effects models. In , the correlation between EBV and GEBV in the training set was highest when all markers were included and declined as markers were removed. The predictive ability was, however, actually very poor for the fixed model with all markers (GPFix1) and increased as markers were removed from the model (GPFix2 – GpFix5).
To assess the predictive ability, it is necessary to perform model validation. There are many approaches to model validation. A common approach in statistical practice is cross-validation where data sample is partitioned into subsets. The analysis is then initially performed on a single of these subsets, while the others are retained for subsequent validation of the initial analysis. In the workshop,  validated the models using data of Gt1-Gt3 as training data and Gt4 as test data. This approach may have the disadvantage of using EBVs for validation of GEBVs in Gt4 that were based on information from phenotypes of individuals in Gt1-Gt3. This creates a strong dependency between the data used for model development and validation, which could be avoided by using the phenotypic data in generation Gt4 rather than EBVs. Villumsen et al.  used a 5-fold cross validation where in each of the five validation sets, 20% of the data in Gt4, were taken as test data. The advantage of cross validation is that it makes it possible to retain training data as large as possible, while obtaining the amount of total test data as large as required (with maximal total test data equal to the whole data). In the simulated data of this workshop this strategy proved very efficient in selecting models that predicted the TBVs in the validation data very accurately.
Decrease of accuracies over generations
While three generations were simulated for validation, results are given as the mean accuracy of all three generations. This is because we could see no clear trend as to which differences in the models lead to a higher decrease in accuracy over the three generations. The only obvious result was a clear relation between the accuracy of GEBVs in generation Gv1 and the decline from Gv1 to Gv3, such that models with a high accuracy declined less than models with low accuracies.
It must be noted that the current comparison is based on a single replicate of simulated data and therefore the conclusions must be interpreted with caution.
The comparison of the different methods applied to the dataset by the workshop participants clearly shows a distinct clustering of the three approaches, where the Bayesian analyses gave the highest accuracies, followed by the BLUP models, while the fixed effects models generally had low accuracies and large error variance. However, some BLUP models were less biased than some Bayesian models. The BLUP models are clearly sensitive to the given SNP variance, because a BLUP does not estimate SNP variance, but takes the specified variance as the true variance. For instance, if the number of QTL would increase, and each QTL would have a smaller effect, it is expected that the differences between the BLUP and Bayesian models would be smaller. The current comparison suggests that Bayesian analyses on haplotypes or SNPs are the most promising approach for Genomic selection although the BLUP models may provide a computationally attractive alternative with little loss of efficiency. As already concluded by  fixed effect type models are unlikely to provide any gain over traditional pedigree indexes for selection.
List of abbreviations used
genomic estimated breeding values
true breeding value.
This article has been published as part of BMC Proceedings Volume 3 Supplement 1, 2009: Proceedings of the 12th European workshop on QTL mapping and marker assisted selection. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/3?issue=S1.
- Fernando RL, Grossman M: Marker assisted selection using best linear unbiased prediction. Genet Sel Evol. 1989, 21: 467-477. 10.1051/gse:19890407.PubMed CentralView Article
- Meuwissen THE, Hayes BJ, Goddard ME: Prediction of total genetic value using genome-wide dense marker map. Genetics. 2001, 157: 1819-1829.PubMed CentralPubMed
- Schaeffer LR: Strategy for applying genome-wide selection in dairy cattle. J Anim Breed Genet. 2006, 123: 218-223. 10.1111/j.1439-0388.2006.00595.x.View ArticlePubMed
- Gianola D, Fernando RL, Stella A: Genomic-assisted prediction of genetic value with semiparametric procedures. Genetics. 2006, 173: 1761-1776. 10.1534/genetics.105.049510.PubMed CentralView ArticlePubMed
- Haldane JBS: The combination of linkage values and the calculation of distances between the loci of linked factors. J Genet. 1919, 8: 299-309. 10.1007/BF02983270.View Article
- NCBI 2005: National Centre for Biotechnology Information, MGI genetic map of the mouse genome (Mus musculus). [http://www.ncbi.nlm.nih.gov/Genomes/]
- Hayes BJ, Goddard ME: The distribution of the effects og genes affecting quantitative traits in livestock. Genet Sel Evol. 2001, 33: 209-229. 10.1051/gse:2001117.PubMed CentralView ArticlePubMed
- McKay SD, Schnabel RD, Murdoch BM, Matukumalli LK, Aerts J, Coppieters W, Crews D, Dias Neto E, Gill CA, Gao C, Mannen H, Stothard P, Wang Z, Van Tassell CP, Williams JL, Taylor JF, Moore SS: Whole genome linkage disequilibrium maps in cattle. BMC Genetics. 2007, 8: 74-10.1186/1471-2156-8-74.PubMed CentralView ArticlePubMed
- Sargolzaei M, Schenkel FS, Jansen GB, Schaeffer LR: Extent of Linkage Disequilibrium in Holstein Cattle in North America. J Dairy Sci. 2008, 91: 2106-2117. 10.3168/jds.2007-0553.View ArticlePubMed
- de Ross AP, Hayes BJ, Spelman RJ, Goddard ME: Linkage disequilibrium and persistence of phase in Holstein-Fresian, Jersey and Angus cattle. Genetics. 179 (3): 1503-1512. 10.1534/genetics.107.084301.
- Zukowski K, Suchochi T, Gontarek A, Szyda J: The impact of single nucleotide polymorphism selection on prediction of genomewide breeding values. BMC Proceedings. 2009, 3 (Suppl 1): S13-10.1186/1753-6561-3-s1-s13.PubMed CentralView ArticlePubMed
- Hastie T, Tibshirani R, Friedman J: The elements of statistical learning. 2003, New York, Springer
- Macciotta PP, Gaspa G, Steri R, Pieramati C, Carnier P, Dimauro C: Pre-selection of the most significant SNPs for the estimation of genomic breeding values. BMC Proceedings. 2009, 3 (Suppl 1): S14-10.1186/1753-6561-3-s1-s14.PubMed CentralView ArticlePubMed
- Habier D, Fernando RL, Dekkers JCM: The impact on relationship information on genome-assisted breeding values. Genetics. 2007, 177: 2389-2397.PubMed CentralPubMed
- Pimentel ECG, König S, Schenkel FS, Simianer H: Comparison of statistical procedures for estimating polygenic effects using dense genome-wide marker data. BMC Proceedings. 2009, 3 (Suppl 1): S12-10.1186/1753-6561-3-s1-s12.PubMed CentralView ArticlePubMed
- Calus MPL, de Roos APV, Veerkamp RF: Estimating genomic breeding values from the QTL-MAS Workshop data using single SNP regression and the haplotype/IBD approach. BMC Proceedings. 2009, 3 (Suppl 1): S10-10.1186/1753-6561-3-s1-s10.PubMed CentralView ArticlePubMed
- Villumsen TM, Janss L: Bayesian genomic selection: the effect of haplotype length and priors. BMC Proceedings. 2009, 3 (Suppl 1): S11-10.1186/1753-6561-3-s1-s11.PubMed CentralView ArticlePubMed
- Villumsen TM, Janss L, Lund MS: The importance of haplotype length and heritability using genomic selection in dairy cattle. J Anim Breed Genet. 2008, Published online: 24 Sep 2008; doi 10.1111/j.1439-0388.2008.00747.x
- Andrew J, Chang SZ, Eichler EE: Structural Variation of the human genome. Annual Review of Genomics and Human Genetics. 2006, 7 (1): 407-10.1146/annurev.genom.7.080505.115618.View Article
- Yu J, Pressoir G, Briggs WH, Vroh Bi I, Yamasaki M, Doebley JF, McMullen MD, Gaut BS, Nielsen DM, Holland JB, Kresovich S, Buckler ES: A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nature Genetics. 2006, 38: 203-208. 10.1038/ng1702.View ArticlePubMed
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.