Comparison between single-marker analysis using Merlin and multi-marker analysis using LASSO for Framingham simulated data.

We compared family-based single-marker association analysis using Merlin and multi-marker analysis using LASSO (least absolute shrinkage and selection operator) for the low-density lipoprotein phenotype at the first visit for all 200 replicates of the Genetic Analysis Workshop 16 Framingham simulated data sets. Using "answers," we selected single-nucleotide polymorphisms (SNPs) on chromosome 22 for comparison of results between single-marker and multi-marker analyses. For the major causal SNP rs2294207 on chromosome 22, both single-marker and multi-marker analyses provided similar results, indicating the importance of this SNP. For the 12 polygenic SNPs on the same chromosome, both single-marker and multi-marker analyses failed to provide statistically significant associations, indicating that their effects were too weak to be detected by either method. The main difference between the two methods was that for the 14 SNPs near the causal SNPs, p-values from Merlin were the next smallest, whereas LASSO often excluded these non-causal neighboring SNPs entirely from the first 10,000 models.


Background
Association analysis is often performed using single markers or haplotype analysis of multiple single-nucleotide polymorphisms (SNPs) within adjoining short regions or candidate genes. However, analysis that simultaneously uses multiple markers may be more powerful for detecting several causal genes and, hence, may be more appropriate for complex diseases [1].
The least absolute shrinkage and selection operator (LASSO) is a penalized least squares method imposing the L1-penalty on the regression coefficients [2]. Because this penalty induces shrinkage, prediction using LASSO is more reproducible than the regular multiple linear regression, in the case when there are more predictors than individuals (small n large p). Compared with a regular multiple linear regression (ordinary least squares), LASSO can handle the multicollinearity resulting from the highly correlated markers. Moreover, due to the nature of the L1-penalty, many regression coefficients are exactly zero. Hence, LASSO does both shrinkage and automatic variable selection simultaneously, a form of parsimonious model selection.
Our main goal in this paper was to explore the performance of LASSO for SNP selection in association analysis. In particular, we compared the relative importance (ranks) of SNPs provided by LASSO to that of SNPs inferred by single-marker analysis.

Phenotypes and genotypes
We used the low-density lipoprotein (LDL) phenotype at the first visit for all 200 replicates of the Genetic Analysis Workshop 16 (GAW16) Framingham simulated data sets. This phenotype was adjusted for age, smoking, and diet separately for both sexes and then corrected for medication (HMG-CoA reductase inhibitors) [3]. Because the GAW16 data set only contained individuals with genotypes, we created records for untyped parents as founder individuals. Because their actual relationship with other members in the same family ID was not provided, one extended family was often divided into multiple families: 1129 families with size ranging from 1 to 470 became 1920 families with size ranging from 1 to 72. Chromosome 22 included one major causal SNP and 12 polygenic SNPs that influenced the simulated LDL phenotype [4]. To reduce the number of SNPs, we chose 5011 SNPs located between 23.28 Mb and 49.10 Mb, 0.1 Mb in each direction past the left and right influencing SNPs. We excluded SNPs with minor allele frequency (MAF) less than or equal to 0.003 (we wanted to include one polygenic SNP with MAF 0.004). The final data set for analysis consisted of 4589 SNPs and 6857 individuals.

Single-marker analysis using Merlin
For single marker analysis, we used Merlin [5,6]. The family-based association test provided by Merlin has two advantages. First, missing genotypes (1.5% of all genotypes) were imputed, using flanking markers and family relationships, and incorporated in the association test. Second, unlike most family-based linkage and association programs, which do not provide results for data sets with mendelian inconsistent genotypes, the Merlin association test does provide results by ignoring families with mendelian inconsistent genotypes. Even though this may not be an optimal way to handle genotype errors, it bypasses removing genotype errors, which can be tedious for data sets with large number of SNPs and large families. Linkage disequilibrium (LD) between the major causal SNP and other SNPs (measured by r 2 ) was computed using R package genetics.
Multi-marker analysis using LASSO For covariate-adjusted phenotype y i and SNPs x i1 ,..., x ip of The LASSO solution path provides a sequence of models, from the simplest model including only an intercept (when t = 0) to the most complex model including all SNPs as predictors (when t is very large). If a particular SNP becomes a predictor in the i th model, then that SNP tends to stay as a predictor for all bigger models, but this does not always happen. For ranking SNPs, we used this "entry" number that indicates when a particular SNP becomes a predictor in the LASSO solution path. For our analysis, we evaluated the first 10,000 models in the LASSO solution path, using R package lars [7]. We used Merlin to impute missing SNPs because lars requires each individual to have values for all predictors: removing individuals with partially missing SNPs would make use of only onetenth of the data. This also makes the data set more consistent with single-marker analysis.

Results
Single-marker analysis using Merlin Figure 1A shows association test results for Replicate 1 of 200 simulated LDL phenotypes: results were consistent across all 200 Replicates ( Table 1). The major causal SNP rs2294207 provided statistically significant association with p-value 4.5 × 10 -19 for Replicate 1: for all 200 replicates, this SNP ranked 1.1 on average (Table 1) with p-values ranging from 6.9 × 10 -13 to 1.6 × 10 -29 . In Replicate 1, 14 SNPs near the major causal SNP (10 SNPs around 30.91 and 4 SNPs around 30.95) had p-values ranging from 3.0 × 10 -8 to 3.8 × 10 -19 ( Figure 1A): these SNPs provided significant association across all 200 replicates ( Table 1). Ranks of these neighboring SNPs were almost in the order of LD between them and the causal SNP. Out of 12 polygenic SNPs, the most significantly associated SNP was rs5765113 (p-value 3.5 × 10 -5 ranking 20 for Replicate 1): for all 200 replicates, this SNP ranked 35.8 on average (Table 1) with p-values ranging from 5.7 × 10 -2 to 7.9 × 10 -8 .   (Table 1). Because these nearby SNPs were highly correlated with the causal SNP, once they were included as predictors the causal SNP became a predictor much later (with average rank 5.3). In contrast to single-marker analysis in which the top 15 SNPs with smallest p-values were all near the major causal SNP, only 3 SNPs out of these top 15 SNPs were near the major causal SNP and the remaining 12 SNPs were more or less uniformly located ( Figure 1B). For Replicate 1 ( Figure 1A), 960 SNPs that were excluded from the LASSO analysis (cyan points) included these neighboring SNPs. This was consistent across all 200 replicates: all 14 neighboring SNPs were sometimes excluded from the LASSO solution path. For example, SNP rs136457 was excluded from the LASSO path in 159 out of 200 replicates even though its average rank from single-marker analysis was 11.9 (Table 1). Overall, we have not found much consistency between ranks from Merlin and those from LASSO (correlation = 0.07 across all 200 replicates and correlation = 0.08 in replicate 1, shown in Figure 1C).

Conclusion
In this paper, we applied single-marker analysis using Merlin and multi-marker analysis using LASSO to the simulated LDL phenotype data on chromosome 22. Single-marker analysis using Merlin correctly provided statistically significant association of the major causal SNP rs2294207 with p-value less than 6.9 × 10 -13 for all 200 replicates. Multi-marker analysis using LASSO also included this causal SNP as the first predictor in 114 out of 200 replicates, indicating the importance of this SNP. When the causal SNP was not included as the first predictor, one of its three neighboring SNPs was included as the first predictor. Merlin declared statistically significant 14 non-causal neighboring SNPs, whereas the first 10,000 models in the LASSO solution paths often excluded these 14 SNPs. The 12 polygenic SNPs were less statistically significant than these neighboring 14 SNPs by both Merlin and LASSO analyses, indicating that their effects were too small to be detected. Overall, there was little consistency between the rank orders of the 4589 SNPs provided by Merlin and LASSO.
Our results indicate that Merlin and LASSO analyses provide different results. We observe that LASSO typically included 3 SNPs near the causal SNPs out of the 15 SNPs that showed very strong association from Merlin and excluded the remaining SNPs from the LASSO path (up to the first 10,000 models). This may be useful because these neighboring SNPs are not causal. We expected that LASSO would provide better results for the 12 polygenic SNPs. However, this may not have occurred because the strength of their effects was much smaller than the effect of the major causal SNP; thus, for this data set the phenotype appears to be influenced by a single SNP, in which case single-marker analysis will perform better than multi-marker analysis. Hence, our results are inconclusive in terms whether the LASSO analysis provides additional information.
The relative advantage of multi-marker analyses over single-marker will depend on the underlying disease model. Other penalized least-squares methods may provide results more similar to single-marker analysis than LASSO. Ridge regression (penalized regression with L2 penalty) shrinks the coefficients of correlated predictors toward each other, so they borrow strength from each other. In the extreme case of k identical predictors, they each get identical coefficients with 1/k th the size that any single one would get if fit alone. On the other hand, LASSO (with L1 penalty) is somewhat indifferent to very correlated predictors and will tend to pick one and ignore the rest. The elastic net regression (penalized regression with a convex combination of both penalties) can have the advantages of both ridge and LASSO [8].
We suspect that LASSO may provide better inference for diseases with multiple causal SNPs that are not in LD. For other cases (i.e., diseases with multiple causal SNPs in LD), ridge, elastic net, or haplotype analysis may provide better inference. Further investigation is needed. and revised the manuscript. All authors read and approved the final manuscript.