Volume 6 Supplement 2
Proceedings of the 15th European workshop on QTL mapping and marker assisted selection (QTLMAS)
Alternative strategies for selecting subsets of predicting SNPs by LASSOLARS procedure
 M Graziano Usai^{1}Email author,
 Antonello Carta^{1} and
 Sara Casu^{1}
DOI: 10.1186/175365616S2S9
© Usai et al.; licensee BioMed Central Ltd. 2012
Published: 21 May 2012
Abstract
Background
The least absolute shrinkage and selection operator (LASSO) can be used to predict SNP effects. This operator has the desirable feature of including in the model only a subset of explanatory SNPs, which can be useful both in QTL detection and GWS studies. LASSO solutions can be obtained by the least angle regression (LARS) algorithm. The big issue with this procedure is to define the best constraint (t), i.e. the upper bound of the sum of absolute value of the SNP effects which roughly corresponds to the number of SNPs to be selected. Usai et al. (2009) dealt with this problem by a crossvalidation approach and defined t as the average number of selected SNPs overall replications. Nevertheless, in small size populations, such estimator could give underestimated values of t. Here we propose two alternative ways to define t and compared them with the "classical" one.
Methods
The first (strategy 1), was based on 1,000 crossvalidations carried out by randomly splitting the reference population (2,000 individuals with performance) into two halves. The value of t was the number of SNPs which occurred in more than 5% of replications. The second (strategy 2), which did not use crossvalidations, was based on the minimization of the Cptype selection criterion which depends on the number of selected SNPs and the expected residual variance.
Results
The size of the subset of selected SNPs was 46, 189 and 64 for the classical approach, strategy 1 and 2 respectively. Classical and strategy 2 gave similar results and indicated quite clearly the regions were QTL with additive effects were located. Strategy 1 confirmed such regions and added further positions which gave a less clear scenario. Correlation between GEBVs estimated with the three strategies and TBVs in progenies without phenotypes were 0.9237, 0.9000 and 0.9240 for classical, strategy 1 and 2 respectively.
Conclusions
This suggests that the Cptype selection criterion is a valid alternative to the crossvalidations to define the best constraint for selecting subsets of predicting SNPs by LASSOLARS procedure.
Background
Where y_{ i } is the phenotype of the i^{ th } individual; x_{ ij } is the genotype of the i^{ th } individual at the j^{ th } marker; β_{ j } is the allelic substitution effect for the j^{ th } marker and t is the constraint that allows some estimated SNP effects to be exactly zero. The challenge with implementing the LASSO approach is how to best choose the constraint parameter (t) which in turn depends on the size of the subset of explanatory variables, in this case the number of SNPs [3].
The LASSO problem can be solved by quadratic programming [1] or by Bayesian approaches [4]. The latter have been implemented in several genome wise selection (GWS) and QTL detection studies [5, 6]. An alternative way to produce LASSO solution is a modified version of the Least Angle Regression (LARS) algorithm [7]. This procedure is a version of traditional forward selection methods which estimates the effects by successive iterations. For each iteration the SNP with the highest absolute correlation between genotypes and current residuals is added to the model. To obtain LASSO solutions the LARS procedure is modified so that either addition or subtraction of one marker to the model per iteration may occur. Usai et al. [8] suggested the LASSOLARS method to estimate the marker effects for genomic selection including a crossvalidation step to define the best constraint t. With this approach good results were obtained both in simulated and real dataset. Nonetheless one limitation of LASSOLARS is that, as a constrained version of the ordinary least squares, it cannot estimate the effects for a number of markers larger than the number of individuals in the reference population. Thus in real data, where the reference population size is often relatively limited and for those traits affected by a large number of QTL, LASSOLARS may not be able to predict the effect of all the QTL contributing to the total genetic variability. Moreover, by using crossvalidations at each replicate the training sample where SNP effects are estimated is only a portion of the total reference population. So the number of SNPs selected at each replication will be smaller than expected and their average overall replications will underestimate the true value of the best t.
In this study we propose two alternative strategies to define the t and compared them with the classical strategy [8].
Methods
Data
A simulated data set of 3,220 individuals generated for the 15th QTLMAS workshop was used. The first generation consisted in 220 founders (20 males and 200). The second generation consisted in 3,000 individuals organized in 20 sire families of 150 individuals each and 200 dam families of 15 individuals each. The dam families were nested in the sire families. The genome consisted in five chromosomes. Each chromosome was 1 Morgan long and carried 1998 SNPs evenly distributed. Genotypes were available for all the individuals. Phenotypes were available only for 2,000 progenies (1/3 of each sire and dam family) which represented the reference population. The further 1,000 progenies had genotypes but no phenotypes and represented the candidate population.
LASSOLARS classical strategy
At each crossvalidation replication the reference population was randomly split into training (T) and validation (V) samples of equal size. This strategy corresponded to that suggested by Usai et al. [8] where the 50 % random splitting was chosen since it gave the lowest t variability and the highest accuracy of GEBV estimates. At each replication LASSOLARS was run on T sample. At each step of the procedure the genomic breeding values (GEBV) of V was updated by the current set of SNP effects estimated on T. When the maximum of the correlation between the GEBVs and phenotypes of V was reached the LASSOLARS was stopped and the number and the identity of the active SNPs were retained. The procedure was replicate 1,000 times. Afterwards the average number of active SNPs was taken as the best value of t. Moreover the SNP frequency of occurrence (fo) was defined as the proportion of times that a SNP was selected (i.e. with non zero effect) over all replications.
Strategy 1
The only difference with respect to the classical strategy was that here the best t was defined as the number of SNPs which occurred in the crossvalidations more than 5% of the times. This strategy was based on the assumption that if a SNP is selected more than expected by chance it is probable that it affects or is linked to QTL affecting the traits. The probability of a SNP to be selected by chance was estimated by permuting 10,000 times phenotypes on genotypes and by running LASSOLARS until the best t estimated for classical strategy was attained. The frequency of occurrence of 5% corresponded to a probability lower than 0.0001.
Strategy 2
where y_{ i } and ŷ_{ i } are, respectively, the phenotype and the current predicted value of the i^{ th } individual; n is the number of individual; df is the degree of freedom which here corresponded to the number of active SNPs and σ_{ e }^{ 2 } is the residual variance of the complete model, i.e. the model including all SNPs. The latter was estimated by a REML procedure running a GBLUP animal model with ASREML software [9]. The genomic relationship matrix was built as described by Hayes et al. in 2009 [10]. Since this strategy did not require crossvalidations, LASSOLARS was run on the whole reference population until the minimum Cptype value was reached and the corresponding number of active SNPs was taken as best t.
GEBV estimation
Once the best t were defined, LASSOLARS was run on the whole population. For each strategy the procedure was stopped when the corresponding t SNPs were in the model. The estimated allelic substitution effects of the selected SNPs were used to calculate the GEBVs on the candidate population (1,000 progenies without phenotypes). The GEBV accuracy, defined as the correlation coefficients between true breeding values (TBVs) and GEBVs, and the regression coefficient of the TBVs on GEBVs were calculated for each strategy.
Results
Best t definition
QTL mapping
GEBV estimation
Genomic breeding value (GEBV) accuracy (r) and regression coefficient (b) of true breeding value (TBV) on GEBV for the three tested strategies
Strategy  r(TBV,GEBV)  b(TBV,GEBV) 

Classical  0.9237  1.2512 
Strategy1  0.9000  1.0220 
Strategy2  0.9240  1.1877 
Discussion
Our results demonstrated that LASSOLARS performs well estimating SNPs associated to QTL with additive effects. The detection of QTL with different action was rather poor. However it suggests the presence of the imprinted QTL and of the first epistatic QTL. The second epistatic QTL was neglected since LASSOLARS just selects the SNPs which underline the main portion of the variability explained by both QTL. Concerning the choice of the best constraint for LASSOLARS, classical and strategy2 although based on different procedures gave very similar results. This suggests that a valid estimation of the best constraint can be obtained without crossvalidation with a large computing time saving. Indeed, while the crossvalidation procedure took 3 hours and 35 minutes, strategy2 just took 8 seconds. Nevertheless, the current data set did not allowed to verify if the constraint estimation based on Cptype minimization can overcome the underestimation of t expected with crossvalidation. Thus a study based on a dataset with high ratio between number of QTL and reference population size is needed. Strategy1 seems in general the worst, since most of the selected SNPs did not correspond to those with the highest fo. This happened because the complementary among SNPs selected in the crossvalidation was not accounted for. Indeed if two SNPs are strongly correlated and equally correlated with the phenotype, they could be alternatively selected in the crossvalidation due to the random sampling. Nevertheless they explain the same portion of variance and when LASSOLARS runs on the whole reference population only one of them is selected. In fact most of the SNPs further selected by strategy1 respect to the other two were false positives (Figure 2). The presence of many false positives leads to a lower GEBV accuracy. The regression of TBVs on GEBVs close to 1 obtained by strategy1 could be due to the higher weight given to the QTL with imprinted and epistatic effects.
Conclusions
We conclude that the strategy based on the Cptype selection criterion is a valid alternative to the crossvalidations to define the best constraint for selecting subsets of predicting SNPs by LASSOLARS procedure.
List of Abbreviations used
 GBLUP:

Genomic Best Linear Unbiased Prediction
 GEBV:

Genomic Breeding Values
 GWS:

Genome Wise Selection
 LARS:

Least Angle Regression
 LASSO:

Least Absolute Shrinkage and Selection Operator
 QTL:

Quantitative Trait Locus
 REML:

REstricted Maximum Likelihood
 SNP:

Single Nucleotide Polymophism
 TBV:

True Breeding Value.
Declarations
Acknowledgements
Research funded by the program APQ "Attivazione del Centro Biodiversità al servizio dell'allevamento" of Regional Government of Sardinia.
This article has been published as part of BMC Proceedings Volume 6 Supplement 2, 2012: Proceedings of the 15th European workshop on QTL mapping and marker assisted selection (QTLMAS). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcproc/supplements/6/S2.
Authors’ Affiliations
References
 Tibshirani R: Regression shrinkage and selection via the Lasso. J R Statist Soc B. 1996, 58: 267288.Google Scholar
 Hayes BJ, Goddard ME : The distribution of the effects of genes affecting quantitative traits in livestock. Genet Sel Evol. 2001, 33: 209229. 10.1186/12979686333209.PubMed CentralView ArticlePubMedGoogle Scholar
 Foster SD, Verbyla AP, Pitchford WS: Incorporating LASSO Effects into a Mixed Model for Quantitative Trait Loci Detection. J Agric Biol Environ Stat. 2007, 12 (Suppl 2): 300314.View ArticleGoogle Scholar
 Park T, Casella G : The Bayesian Lasso. J Am Stat Assoc. 2008, 103: 681686. 10.1198/016214508000000337.View ArticleGoogle Scholar
 Mutshinda C, Sillanpää M: Extended Bayesian LASSO for multiple quantitative trait loci mapping and unobserved phenotype prediction. Genetics. 2010, 186: 10671075. 10.1534/genetics.110.119586.PubMed CentralView ArticlePubMedGoogle Scholar
 Legarra A, RobertGranié C, Croiseau P, Guillaume F, Fritz S: Improved Lasso for genomic selection. Genet Res. 2011, 93: 7787. 10.1017/S0016672310000534.View ArticleGoogle Scholar
 Efron B, Hastie T, Johnstone I, Tibshirani R: Least angle regression. Ann Stat. 2004, 32: 407499. 10.1214/009053604000000067.View ArticleGoogle Scholar
 Usai MG, Goddard ME, Hayes BJ: LASSO with crossvalidation for genomic selection. Genet Res. 2009, 91: 427436. 10.1017/S0016672309990334.View ArticleGoogle Scholar
 Gilmour AR, Gogel BJ, Cullis BR, Thompson R: ASReml User Guide Release 3.0. VSN International Ltd. 2009, Hemel Hempstead, HP1 1ES, UKGoogle Scholar
 Hayes BJ, Bowman P, Chamberlain A, Verbyla K, Goddard ME: Accuracy of genomic breeding values in multibreed dairy cattle populations. Genet Sel Evol. 2009, 41: 5110.1186/129796864151.PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.