Comparison of statistical procedures for estimating polygenic effects using dense genome-wide marker data
© Pimentel et al; licensee BioMed Central Ltd. 2009
Published: 23 February 2009
In this study we compared different statistical procedures for estimating SNP effects using the simulated data set from the XII QTL-MAS workshop. Five procedures were considered and tested in a reference population, i.e., the first four generations, from which phenotypes and genotypes were available. The procedures can be interpreted as variants of ridge regression, with different ways for defining the shrinkage parameter. Comparisons were made with respect to the correlation between genomic and conventional estimated breeding values. Moderate correlations were obtained from all methods. Two of them were used to predict genomic breeding values in the last three generations. Correlations between these and the true breeding values were also moderate. We concluded that the ridge regression procedures applied in this study did not outperform the simple use of a ratio of variances in a mixed model method, both providing moderate accuracies of predicted genomic breeding values.
The development of appropriate methods to detect a large number of DNA sequence variations in the genome has launched a series of studies [1, 2] attempting to associate such alterations with phenotypic variation in complex traits. High-density panels for genotyping thousands of single nucleotide polymorphisms (SNP) are now commercially available and their costs are likely to decrease over time. If the number of markers in such a panel is large enough that it covers the entire genome, then it may be assumed that most of the quantitative trait loci (QTL) associated with a given trait will be in linkage disequilibrium with at least some of these markers. The use of this new source of information in selection programs requires accurate estimation of the effects of QTL associated with the markers, or alternatively the effects of the markers themselves, on traits of interest. Genome-wide estimated breeding values (GEBV) can then be calculated by taking the summation of these effects across the whole genome. Here we compared different statistical approaches to estimate SNP effects using the simulated common data set provided by the XII QTL-MAS workshop.
The approach adopted here followed the implementation of genomic selection described in . Selection candidates have their GEBV calculated from a prediction equation. This prediction equation is derived and tested in another sample of animals, not necessarily related to the selection candidates, called the reference population. The reference population comprises a discovery data set, from which the prediction equation is derived; and a validation data set, in which the equation is tested to assess its accuracy. Animals in the reference population must then have both phenotype records and marker genotype information available.
The available simulated population consisted of 5,865 animals from seven generations. Animals from the first four generations had both phenotypic records and genotypes for the 6,000 SNP loci and therefore were used to form the reference population. The discovery and the validation data sets were defined as the animals belonging to the first three (3,165 animals) and the fourth (1,500 animals) generations, respectively.
Two genetic evaluations were performed: one for the animals in the discovery sample only (GE1); and another for all animals in the reference population (GE2). In both cases, an animal model with a fixed effect of gender was used. Variance components were estimated using VCE . Estimated breeding values (EBV) from GE1 were used as the response variable in the derivation of the prediction equation, in the discovery data set. Then correlations between GEBV and EBV, from GE2, were computed for the animals in the validation data set, and used as reference for comparison among the statistical procedures.
yi is the EBV of the ith animal;
μ is an overall mean;
xij is an indicator variable for the jth SNP genotype of the ith animal;
bj is the slope on the jth SNP genotype;
p is the number of genotyped SNPs;
ei is a random residual term.
The coefficients xij were defined as -1 for genotype A1A1, 0 for genotype A1A2 and +1 for genotype A2A2. Here we did not make any assumption about the positions of the QTL and assumed strictly additive effects for the markers. Therefore the genomic region represented by a given marker was treated much like a QTL and bj was actually an estimate of the QTL allele substitution effect.
ι is a (n × 1) vector of ones, where n is the number of genotyped animals;
W is a diagonal matrix with wii equal to the reliability of the EBV of the ith animal;
X is the (n × p) matrix of coefficients xij;
Φ is a square matrix of order p.
The key point in the estimation process here is the definition of Φ and this is the parameter that characterizes the departure from weighted least squares to the following statistical procedures:
BLUP1: Φ = I
VIF stands for Variance Inflation Factor and is defined as:
where abs(ti) is the absolute Student-t statistic for testing the null hypotheses that the value of the ith parameter is zero. The criterion for choosing the value of θ was the same as above.
RR2*: a variant of RR2 to be done in two steps only (i.e., without testing different values for θ): i) estimate SNP effects with BLUP1; ii) use the t-values of estimates to define the weights. In the second step, ϕi was set either to zero if abs(ti) exceeded the mean abs(ti) by more than 3 standard deviations, or to λ (used in BLUP2) otherwise.
Results and discussion
Estimated residual and additive genetic variances were 3.17 and 1.23 within the discovery sample, and 3.12 and 1.36 in whole the reference population, respectively. Within the discovery sample, reliabilities on the EBV ranged from 0.48 to 0.86, with average and standard deviation of 0.50 ± 0.05. The mean (± SD) of the EBV in the validation sample was 0.185 ± 0.844.
Means of GEBV and correlations between EBV and GEBV in the validation sample (Generation 3), from each procedure.
Mean GEBV ± SD
rEBV, GEBV ± SE
0.355 ± 0.719
0.499 ± 0.019
0.366 ± 0.550
0.611 ± 0.016
0.366 ± 0.580
0.588 ± 0.017
0.360 ± 0.533
0.630 ± 0.016
0.363 ± 0.556
0.603 ± 0.016
In both BLUP procedures, equal variances were assumed for all markers. Therefore, the difference between them was due to the amount of shrinkage imposed. In BLUP2, the assumed variance for each marker was very small, which resulted in a large value for the ratio, and a much stronger shrinkage on parameter estimates (Figure 1). The BLUP2 would therefore be closer to a prior assumption that marker effects are expected to be close to zero, not allowing some of them to deviate from this expectation. A more realistic assumption would be that QTL effects follow a Gamma distribution, where many have a small effect and few have a large effect, as suggested in  and used in [1, 2]. In our study, a prior distribution of variances of markers was not formally defined. Instead, the different weights in the RR1 and RR2 procedures were derived from the data, in the form of VIF and t-values.
When different levels of shrinkage were allowed by the weighting factors in the RR1 and RR2 methods some discrimination among marker effects could be made (Figure 2). This feature was more pronounced in RR2, where weights were functions of t-values. These two ridge regression investigative procedures (i.e., testing different values of θ) were used in an attempt to identify one possible parameter to be used in a simpler and faster way. Since RR2 seemed more promising, the t-values were picked as the parameters to be used in the two-step procedure RR2*.
Methods BLUP2 and RR2* were then used to estimate SNP effects again using data from the whole reference population. Correlations between GEBV and the true breeding values in the last three generations ranged from 0.40 in generation 6 to 0.52 in generation 4 (table 2 in ). The lower correlation with the true breeding values can in part be explained by the use of EBV as a proxy for breeding values in the analyses performed here. Notice that the average reliability on the EBV in the discovery sample was only 0.5. In a real application, one would likely use highly accurate EBV to derive the prediction equation.
Correlations between GEBV and true breeding values, when the response variable on the estimation step was the phenotype.
Results from other methods presented at the Workshop indicated that the definition of priors in a full-fledged Bayesian framework may provide higher accuracies of genomic breeding values.
The ridge regression procedures applied in this study did not outperform the simple use of a ratio of variances in a mixed model method, both providing moderate accuracies of predicted genomic breeding values.
This study is part of the project FUGATO-plus GenoTrack and was financially supported by the German Ministry of Education and Research, BMBF, the Förderverein Biotechnologieforschung e.V. (FBF), Bonn, and Lohmann Tierzucht GmbH, Cuxhaven.
This article has been published as part of BMC Proceedings Volume 3 Supplement 1, 2009: Proceedings of the 12th European workshop on QTL mapping and marker assisted selection. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/3?issue=S1.
- Meuwissen THE, Hayes BJ, Goddard ME: Prediction of total genetic value using genome-wide dense marker maps. Genetics. 2001, 157: 1819-1829.PubMed CentralPubMedGoogle Scholar
- Xu S: Estimating polygenic effects using markers of the entire genome. Genetics. 2003, 163: 789-801.PubMed CentralPubMedGoogle Scholar
- Goddard ME, Hayes BJ: Genomic selection. J Anim Breed Genet. 2007, 124: 323-330. 10.1111/j.1439-0388.2007.00702.x.View ArticlePubMedGoogle Scholar
- Groeneveld E: VCE User's Manual, Version 4.2.5. 1998, Mariensee, Germany: Institute of Animal Breeding and Animal Behavior, Federal Research Institute for AgricultureGoogle Scholar
- Roso VM, Schenkel FS, Miller SP, Schaeffer LR: Estimation of genetic effects in the presence of multicollinearity in multibreed beef cattle evaluation. J Anim Sci. 2005, 83: 1788-1800.PubMedGoogle Scholar
- Maindonald JH: Statistical computation. 1984, New York: John Wiley & SonsGoogle Scholar
- Hayes B, Goddard ME: The distribution of the effects of genes affecting quantitative traits in livestock. Genet Sel Evol. 2001, 33: 209-229. 10.1051/gse:2001117.PubMed CentralView ArticlePubMedGoogle Scholar
- Lund MS, Sahana G, deKoning D-J, Carlborg Ö: Comparison of analyses of the QTLMAS XII common dataset. I: Genomic selection. BMC Proceedings. 2009, 3 (Suppl 1): S1-10.1186/1753-6561-3-s1-s1.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.