The impact of single nucleotide polymorphism selection on prediction of genomewide breeding values
© Żukowski et al; licensee BioMed Central Ltd. 2009
Published: 23 February 2009
Skip to main content
© Żukowski et al; licensee BioMed Central Ltd. 2009
Published: 23 February 2009
The study focuses on the impact of different sets of single nucleotide polymorphisms (SNPs) selected from the available data set on prediction of genomewide breeding values (GBVs) of animals. Correlations between breeding values estimated as additive polygenic effects (EBVs) and GBVs as well as correlations between true breeding values (TBVs) and GBVs are used as major criteria for the comparison of different SNP selection schemes and GBV estimation models.
The analysed data is the simulated data set from the XII QTL Workshop. In the analysis five different SNP data sets are considered. For prediction of EBVs a standard mixed animal model is applied, whereas GBVs are defined as the sum of additive effects of SNPs estimated for the different SNP data sets using model 1 with fixed SNPs effects, model 2 with fixed SNPs effects and a random additive polygenic effect, model 3 with a random effects of uncorrelated SNP genotypes.
The additive polygenic and residual variance components estimated by the EBV model amount to 1.36 and 3.12, respectively. Differences between models are expressed by comparing the ranking of individuals based on EBV and on GBV and by correlations. Among 100 individuals with the highest EBVs, depending on a model and a data set, there are only between 11 and 37 individuals with the highest GBVs. The highest correlation between GBV and EBV amounts to 0.787 and is observed for model 3 with 3,328 SNPs selected based on their minor allele frequency, the lowest correlation of 0.519 is attributed to model 2 with 300 SNPs. Correlations between GBV estimates obtained from different models with the same number of SNPs range between 0.916 and 0. 998, whereas correlations between different SNP data sets using the same model fall under 0.850.
These results indicate that successful application of high throughoutput SNP genotyping technologies for prediction of breeding values is a very promising approach, but before the method can be routinely applied further methodological improvements regarding model construction and SNP selection are required.
The idea behind using high throughoutput single nucleotide polymorphism (SNP) microarray technology in cattle breeding industry is based on the assumption that the additive genetic merit of animals (mainly bulls) can be accurately predicted based on their genotypes at many SNPs. This study focuses on the impact of different sets of SNPs selected from the available data set of 6,000 SNPs on prediction of GBVs of animals. Correlations between breeding values estimated as additive polygenic effects (EBVs) using a standard mixed animal model, and GBVs are used as a major criterion for the comparison of different SNP selection schemes and different GBV estimation models.
The analysed data is the simulated data set from the XII QTL Workshop, consisting of 5,865 individuals from seven generations, divided into (i) a group of 4,665 animals from generations 1–4 for which both phenotypes and genotypes are available, (ii) a group of 1,200 animals from generations 5–7 for which only genotypes are available. Phenotypes represent a quantitative trait, while genotypes represent 6,000 SNP markers evenly distributed every 0.1 cM over six chromosomes. In our analysis five different SNP data sets are considered. They comprise:
- a set of all available 6,000 SNPs (SNP6000),
- a set of 3,328 SNPs selected based on their estimated minor allele frequency (MAF) using the condition: MAF ≥ 0.3 (SNP3328),
- a set of 1,200 SNPs selected as every 5th SNP out of the available set (SNP1200),
- a set of 600 SNPs selected as every 10th SNP out of the available set (SNP600),
- a set of 300 SNPs selected as every 20th SNP out of the available set (SNP300).
For prediction of EBVs a standard mixed animal model is applied: y = μ + Zα + e, where y is a vector of phenotypic values, μ is the overall mean, is a vector of random additive polygenic effects of animals with a covariance matrix given by the numerator relationship matrix (A) and the component of the additive polygenic variance , and is a vector of residuals. GBVs are defined as the sum of additive effects of SNPs, estimated from different SNP data sets defined above using the following models:
- (1) y = μ + Xq + e, where q (NSNP × 1) is a vector of fixed additive SNP effects with the corresponding design matrix X with score 0, 1, or 2 for an SNP genotype 11, 12, or 22 respectively, NSNP is the number of SNPs considered and other model parameters are defined as above.
- (2) y = μ + Xq + Zα + e, with all the parameters defined as above.
- (3) y = μ + Zq + e, where is a vector of random SNP effects with the corresponding design matrix Z with score 0, 1, or 2 for an SNP genotype 11, 12, or 22 respectively.
Note that EBVs and GBVs are estimated for the 4,665 animals from the first four generations. The estimation of parameters of all the mixed models was based on solving the mixed model equations (MME, ) while effects in model 1 were estimated using the least squares approach. The DFREML package  was used for the estimation of parameters and variance components of the EBV model, whereas the parameters of GBV models (model 1–3) were estimated using R programmes. For models 1–3 residual and additive polygenic variance components were assumed as known and were set with the estimates obtained from the EBV model. Due to too high memory requirements for building an inverse of the coefficient matrix of MME, we were unable to estimate parameters of models 2 and 3 for the data set with all SNPs.
The additive polygenic and residual variance components estimated by the EBV model amount to 1.36 and 3.12, respectively, which results in a heritability of 0.30.
Differences in top 100 ranking of individuals.
Correlation between EBV and GBV.
In the paper of Meuwissen et al. , which was a pioneering in the filed of using multiple SNPs for the prediction of GBV, a similar correlation of 0.73 between true and predicted GBV, based on a random SNP haplotype effects, was reported. However, using fixed SNP effects resulted in a correlation as low as 0.32 – lower than the correlation in our study if at least 1200 SNPs are considered. Much higher correlations of 0.95 between true additive genetic values and GBVs estimated by a model with random SNP genotype effects and a model with a random additive polygenic effect with SNP effect modelled by a kernel function, were observed by Gianola et al. (2006) , but for the favourable conditions of unrelated individuals, no correlations between SNPs, and all 100 loci determining a trait fitted into the model. Similar correlations were also reported by Habier et al. .
Summarising, each of methods applied in the present study has its drawbacks: the mixed animal model unifies the additive genetic background and does not properly account for the existence of QTL along the genome, model 1, with the increasing number of SNPs included, suffers problems related to over fitting, models 1 and 3 do not use information on relationship among individuals, in model 2 the additive polygenic relationships are given too much emphasis since the corresponding variance component was not estimated for this model, but simply assumed as known and equal to the variance component of a pure polygenic model without SNPs. The number of fitted SNPs not only influences on the estimates of GBV, but also the feasibility of computations – that is why it should be treated with caution. Although the highest EBV-GBV correlations are obtained for data a set with all 6000 SNPs, similar values are observed using a bit more than a half of SNPs selected based on MAF.
The most important result of this study, also reported by other authors [3, 5], are overall low correlations between EBVs and GBVs which indicate that both quantities cannot be regarded as describing the same genetic background. The correlations between TBVs and GBVs are even lower .
Summarising, relatively simple models applied in this study are not stable enough (e.g. robust towards the number of fitted SNPs, poorly correlated with EBV) to be used for routine national genetic evaluation of dairy cattle, especially if the EBVs estimated using a classical method are to be regarded as the desired selection criterion. On the other hand practical application of more sophisticated methods is hampered by computational problems. Although successful using of high throughoutput SNP genotyping technologies for prediction of breeding values is a very promising approach, before the method can be routinely applied, further methodological improvements regarding model construction and SNP selection procedures are needed.
This article has been published as part of BMC Proceedings Volume 3 Supplement 1, 2009: Proceedings of the 12th European workshop on QTL mapping and marker assisted selection. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/3?issue=S1.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.