Skip to main content

Pre-selection of most significant SNPS for the estimation of genomic breeding values


The availability of a large amount of SNP markers throughout the genome of different livestock species offers the opportunity to estimate genomic breeding values (GEBVs). However, the estimation of many effects in a data set of limited size represent a severe statistical problem. A pre-selection of SNPS based on single regression may provide a reasonable compromise between accuracy of results, number of independent variables to be considered and computing requirements.

A total of 595 and 618 SNPS were pre-selected using a simple linear regression for each SNP, based on phenotypes or polygenic EBVs, respectively, with an average distance of 9–10 cM between them. Chromosome four had the largest frequency of selected SNPS. Average correlations between GEBVs and TBVs were about 0.82 and 0.73 for the TRAINING generations when phenotypes or polygenic EBVs were considered as dependent variable, whereas they tend to decrease to 0.66 and 0.54 for the PREDICTION generations. The pre-selection of SNPs using the phenotypes as dependent variable together with a BLUP estimation of marker genotype effects using a variance contribution of each marker equal to σ2a/nsnps resulted in a remarkable accuracy of GEBV estimation (0.77) in the PREDICTION generations.


The availability of a large amount of SNP markers throughout the genome for several livestock species allows the prediction of genomic breeding values (GEBVs) as the sum of the effect of the different haplotype intervals that cover the whole genome [13]. Although a dense maker map results in a great advantage for identifying genome regions involved in the determinism of a trait, the estimation of the effect of a large number of haplotypes (up to some hundred thousands) based on a limited number of phenotypes (some hundreds) represents a relevant statistical and computational issue. BLUP methodologies are able to predict more haplotype effects than data points by treating them as random and assuming an equal variance for each interval [1]. Furthermore, the reconstruction of parental haplotypes can be avoided by using SNP genotypes directly. Actually, the use of single markers instead of haplotypes resulted in a slight reduction of accuracy in QTL fine mapping [4] and in no sensible differences in MAS accuracy for a low heritability trait [2]. A further issue is whether or not all SNPs should be included in a predictive model [5]. A pre-selection of SNPs based on single regression could represent a reasonable option to speed up calculations. Meuwissen et al. [1] reported an overestimation of haplotype effects and low accuracy of GEBVs when a least-square stepwise regression approach was used to both pre-select SNPs and estimate their effects. On the contrary, in genome wide association analysis, the regression-based pre-selection of SNPs yielded a reasonable statistical power in QTL detection [6]. In any case, considering the size of available SNP platforms (50 k for cattle), a pre-selection of SNPs combined with a BLUP estimation of marker effects could represent an acceptable compromise between accuracy of GEBVs, number of independent variables to be considered and computing requirements. In the present paper, genomic breeding values were estimated on a simulated data set of 5,865 individuals by first selecting the most relevant SNPs and then using a BLUP methodology to estimate marker effects.



A simulated data set of 5,865 individuals generated for the XII QTL-MAS workshop was used. The genome consisted of six chromosomes, with a total of 6,000 SNP marker loci (1,000 per chromosome). Individuals of the first four generations (TRAINING data set) had pedigree, phenotype, and marker information available whereas those of the last three generations (PREDICTION data set) had only pedigree and marker information.

Polygenic breeding value prediction

Variance components and polygenic breeding values were estimated by analyzing the whole data set with the following single trait BLUP animal model using the MTDFREML package:

yijk = SEXi + GENj + ak + eijk

where y is the trait value, SEX is the fixed effect of sex (i = 1,2), GEN is the fixed effect of generation (j = 0–6), ak is the random genetic additive effect of the k-th animal (a~N(0, Aσ2a)), eijk is the random residual (e~N(0, Iσ2e)). The relationship matrix included 5,939 animals.

Pre-selection of SNPS

Data of the 4,665 animals of the TRAINING data set were analysed by a simple linear regression for each SNP.

y = μ+ SNPi + e

where y is the phenotype or the polygenic EBV, SNP is the genotype at the i-th SNP (i: 1 to 6,000). An empirical threshold of 1.6E-6 for the P values of the F test was fixed to retain markers for further multiple-SNP analysis.

GEBVs estimation

The effect of marker genotypes was estimated on the TRAINING data set with the following mixed linear model

y = μ + SEX i + GEN i + k = 1 m H k b k + e MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeeyEaKNaeyypa0JaeqiVd0Maey4kaSIaee4uamLaeeyrauKaeeiwaG1aaSbaaSqaaiabbMgaPbqabaGccqGHRaWkcqqGhbWrcqqGfbqrcqqGobGtdaWgaaWcbaGaeeyAaKgabeaakiabgUcaRmaaqahabaGaeCisaG0aaSbaaSqaaiabdUgaRbqabaGccqWHIbGydaWgaaWcbaGaem4AaSgabeaakiabgUcaRiabbwgaLbWcbaGaem4AaSMaeyypa0JaeGymaedabaGaemyBa0ganiabggHiLdaaaa@4B9F@

where y is the phenotype, b is a vector of three genotype effects for SNP k, and H is the corresponding design matrix for the SNP k, and e is the random residual. No interaction between different SNPs loci and a constant variance for each SNP locus were assumed. In studies where no variance components are estimated, the variance ratio λ = σ2e2a to be used in the solution of mixed model equations is usually fixed a priori at a value (for example 1) able to remove possible dependencies among intervals while treating these factors closer to fixed effects [7]. On the other hand, when variance component are estimated and assuming an equal contribution of each locus to the variance, λ should be calculated as σ2e/(σ2a/nsnps). In this paper, both the two options (indicated as λ1 and λ2, respectively) were tested.

For the estimation of marker effects using EBVs as dependent variable, the model for the estimation of the marker effect did not include the SEX and GEN factors.

The genomic breeding value for each of the i animals of the whole data set was then calculated as

GEBV i = μ + k = 1 m h ' k b ^ k MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaee4raCKaeeyrauKaeeOqaiKaeeOvay1aaSbaaSqaaiabbMgaPbqabaGccqGH9aqpcqaH8oqBcqGHRaWkdaaeWbqaaiabhIgaOjabhEcaNmaaBaaaleaacqWGRbWAaeqaaOGafCOyaiMbaKaadaWgaaWcbaGaem4AaSgabeaaaeaacqWGRbWAcqGH9aqpcqaIXaqmaeaacqWGTbqBa0GaeyyeIuoaaaa@435C@

Thus, four genomic breeding values were calculated for each animal: using phenotypes as dependent variable with variance ratio λ1(GEBV1) or λ2 (GEBV2), or using polygenic EBVs (GEBV3 and GEBV4, respectively). Accuracy of genomic selection was calculated as the correlation between GEBVs and true breeding values (TBVs).

Results and discussion

Variance components estimated with the polygenic animal model were 1.324 and 3.142 for additive genetic (σ2a) and residual (σ2e) variance, respectively, with a resulting heritability of the trait of 0.30. Average accuracies of EBV estimation were 0.71 and 0.33 for the TRAINING and PREDICTION generations, respectively.

The number of markers retained was 595 and 618 for the screening carried out on phenotypes and polygenic EBVs, respectively. In the latter case, a more stringent threshold was adopted (1.6E-22 for the P value of the F tests) in order to retain a comparable number of markers. The largest number of significant SNPs obtained when EBVs were used could be partially explained by less noise in EBVs which already erase the fixed effects and large proportion of residuals. The number of common SNPs selected in both analyses was 411 (about 69% of the total number of retained SNPs). The average marker distance was 0.99 cM for phenotype-selected (min 0.1, max 18.6 cM) and 0.95 for EBVs-selected (min 0.1, max 19.3 cM) SNPs. The distribution of selected markers across the six chromosomes is reported in Figure 1.

Figure 1
figure 1

Distribution of number of selected markers based on phenotypes (white bars) or EBVs (black bars) across the six chromosomes.

Most of selected SNPs are located on chromosomes four (more than 200) and 1 (about 140). Distribution of markers selected on phenotypes (white bars) or polygenic EBVs (black bars) follow approximately the same pattern. The percentage of SNPs selected by both analyses ranged between 80% (chromosome 4) and 33% (chromosome 5). Based on variance component estimates and on the number of markers retained, variance ratios used for estimation of marker genotype effects were 2.4 (λ1) for GEBV1 and GEBV3, 1,412 (λ2) for GEBV2 and 1,466 (λ2) for GEBV4.

Correlations between GEBVs and TBV in the TRAINING generations are around 0.87 and 0.72 when phenotypes or EBVs are used as dependent variable (Table 1). These values are rather high, as expected, being GEBVs calculated on the animals whose phenotypes were used for estimating marker effects. In any case, the use of different variance ratios resulted in a relevant increase in the accuracy of GEBV estimation based on phenotypes. Correlations between different GEBVs and polygenic EBVs are related to the different dependent variable used. Moreover, the correlation between TBVs and polygenic EBVs corresponds to the accuracy calculated in the animal model estimation.

Table 1 Correlations between true breeding values (TBV), polygenic EBVs, genomic breeding values estimated using markers selected on phenotype (GEBV1 and GEBV2) and on polygenic EBVs (GEBV3 and GEBV4) for the animals of the TRAINING generations (0–3)

Accuracies of GEBV estimation for the PREDICTION generations (Table 2) are lower (around 0.55) except for GEBV2 that shows a value still higher than 0.75, usually considered as the average GEBV accuracy in genome-wide selection schemes [3]. This result is comparable with those reported by de Roos et al. [8], in a work on actual data using polygenic EBVs as true breeding values, and of the same order of correlations reported for simulated data with similar marker density [1, 9, 10]. The high accuracy of GEBV2 is clearly related to the use of lower variance associated to each locus that probably prevents an over estimation of marker effects. On the other hand, the adoption of a smaller variance did not affect estimates based on EBVs. The low correlation between EBVs and TBV could be a possible explanation for these results. However, also the accuracy of about 0.55 obtained with the GEBV1, GEBV2 and GEBV4 should not be neglected, being considerably higher than that obtained with polygenic EBVs (0.33).

Table 2 Correlations among true breeding values (TBV), polygenic EBVs, genomic breeding values estimated using markers selected on phenotype (GEBV1 and GEBV2) and on polygenic EBVs (GEBV3 and GEBV4) for the animals of the PREDICTION generations (4–6)

The pattern of correlations between TBVs and the different GEBVs across the seven generations considered (Figure 2) confirms the constant higher accuracy of GEBV2 and the decrease in accuracy passing from TRAINING to PREDICTION generations.

Figure 2
figure 2

Pattern of correlations between true breeding values and GEBV1 (white bars), GEBV2 (black bars), GEBV3 (grey bars), GEBV4 (segmented bars) across the seven generations.

The strategy here presented for genomic breeding value estimation, that combines a pre-selection of SNPs based on least squares regression and a BLUP estimation of marker effects, gave in general poorer results in terms of accuracy of GEBV in comparison with those reported in the literature for more sophisticated methods based on Bayesian inference and on the estimation of IBD matrix. However, the GEBVs calculated using phenotypes as dependent variable and assuming an equal contribution of each marker locus to the variance of the trait showed an accuracy that is closer to the one of the best methods.


  1. Meuwissen THE, Hayes BJ, Goddard M: Prediction of total genetic value using genome-wide dense markers maps. Genetics. 2001, 157: 1819-1829.

    PubMed Central  CAS  PubMed  Google Scholar 

  2. Calus MPL, Meuwissen THE, de Roos PW, Veerkamp RF: Accuracy of genomic selection using different methods to define haplotypes. Genetics. 2008, 178: 553-561. 10.1534/genetics.107.080838.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  3. Schaeffer LR: Strategy for applying genome-wide selection in dairy cattle. J Anim Breed Genet. 2006, 23: 218-223. 10.1111/j.1439-0388.2006.00595.x.

    Article  Google Scholar 

  4. Grapes L, Dekkers JCM, Rotschild MF, Fernando RL: Comparing linkage disequilibrium-based methods for fine mapping quantitative trait loci. Genetics. 2004, 166: 1561-1570. 10.1534/genetics.166.3.1561.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  5. Gonzalez-Recio O, Gianola D, Long N, Weigel KA, Rosa GJM, Avendano S: Nonparametric methods for incorporating genomic information into genetic evaluations: an application to mortality in broilers. Genetics. 2008, 178: 2305-2313. 10.1534/genetics.107.084293.

    Article  PubMed Central  PubMed  Google Scholar 

  6. Aulchenko Y, de Koning DJ, Haley C: Genomewide rapid association using mixed model and regression: a fast and simple method for genomewide pedigree-based quantitative trait loci association analysis. Genetics. 2007, 177: 577-585. 10.1534/genetics.107.075614.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  7. Schaeffer LR: Factors affecting accuracy of genomic estimated breeding values. University of Guelph, 2006:1-20.

  8. de Roos APW, Schrooten C, Mullaart E, Calus MP, Veerkamp RF: Breeding values estimation for fat percentage using dense markers on Bos taurus autosome 14. J Dairy Sci. 2007, 90: 4821-4829. 10.3168/jds.2007-0158.

    Article  CAS  PubMed  Google Scholar 

  9. Kolbedhari D, Schaeffer LR, Robinson JAB: Estimation of genome-wide haplotype effects in half-sibs design. J Anim Breed Genet. 2007, 124: 356-361. 10.1111/j.1439-0388.2007.00698.x.

    Article  Google Scholar 

  10. Muir WM: Comparison of genomic and traditional BLUP estimated breeding value accuracy and selection response under alternative trait and genomic parameters. J Anim Breed Genet. 2007, 124: 342-355. 10.1111/j.1439-0388.2007.00700.x.

    Article  CAS  PubMed  Google Scholar 

Download references


Research funded by the Italian Ministry of Agricultural Policies (research project SELMOL). The authors wish to thank Prof. Aldo Cappio-Borlino for his contribution.

This article has been published as part of BMC Proceedings Volume 3 Supplement 1, 2009: Proceedings of the 12th European workshop on QTL mapping and marker assisted selection. The full contents of the supplement are available online at

Author information

Authors and Affiliations


Corresponding author

Correspondence to Nicolò PP Macciotta.

Additional information

Competing interests

The authors declare that they have no competing interests.

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and Permissions

About this article

Cite this article

Macciotta, N.P., Gaspa, G., Steri, R. et al. Pre-selection of most significant SNPS for the estimation of genomic breeding values. BMC Proc 3 (Suppl 1), S14 (2009).

Download citation

  • Published:

  • DOI:


  • Genomic Breeding
  • True Breeding
  • Estimate Marker Effect
  • Average Marker Distance
  • Random Genetic Additive Effect