- Proceedings
- Open Access

# Using penalized regression to predict phenotype from SNP data

- Svetlana Cherlin
^{1}Email author, - Richard A. J. Howey
^{1}and - Heather J. Cordell
^{1}

**Published:**17 September 2018

## Abstract

### Background

In a typical genome-enabled prediction problem there are many more predictor variables than response variables. This prohibits the application of multiple linear regression, because the unique ordinary least squares estimators of the regression coefficients are not defined. To overcome this problem, penalized regression methods have been proposed, aiming at shrinking the coefficients toward zero.

### Methods

We explore prediction of phenotype from single nucleotide polymorphism (SNP) data in the GAW20 data set using a penalized regression approach (LASSO [least absolute shrinkage and selection operator] regression). We use 10-fold cross-validation to assess predictive performance and 10-fold nested cross-validation to specify a penalty parameter.

### Results

By analyzing approximately 600,000 SNPs we find that, when the sample size comprises a few hundred individuals, SNP effects are heavily penalized, resulting in a poor predictive performance. Increasing the sample size to a few thousand individuals results in a much smaller penalization of the true effects, thus greatly improving the prediction.

### Conclusions

LASSO regression results in a heavy shrinkage of the regression coefficients, and also requires large sample sizes (several thousand individuals) to achieve good prediction.

## Background

In a typical genome-wide association study (GWAS), several thousands to several millions of single nucleotide polymorphism (SNP) markers are genotyped in a sample size of several hundred to several thousand individuals, thus leading to many more predictor variables than response variables. In this case, multiple linear regression cannot be used because the unique ordinary least squares estimators of the regression coefficients are not defined. Methods that allow for more predictors than observations [1] may cause model overfitting. Overfitted models are likely to demonstrate poor predictive ability when applied to new data. To overcome these problems, penalized regression methods have been proposed [2–6], aiming at shrinking the regression coefficients toward zero. Depending on the form of the penalty function, some methods (eg, ridge regression [3]) only shrink the coefficients without setting them to zero, whereas other methods (eg, the least absolute shrinkage and selection operator [LASSO] regression [2]) allow shrinkage of the coefficients down to exactly zero, thus performing variable selection.

The strength of the penalty is controlled by a regularization parameter that determines the amount of shrinkage imposed. One challenge of penalized approaches is choosing an optimal value of the regularization parameter. This is often done by *k*-fold cross-validation to find the parameter value in the training folds that minimizes the average mean squared error in the test folds. Assessing the predictive performance can also be done using *k*-fold cross-validation. In this case, the two cross-validation experiments are combined into one so-called nested cross-validation. In nested cross-validation, an outer cross-validation loop is used to assess the predictive performance, while, within each outer fold, an inner cross-validation loop is used to find the regularization parameter [7]. The most commonly used number of inner and outer folds is 10 because it provides a reasonable classification accuracy [8].

Here, we focus on LASSO linear regression, which has the property of variable selection. We apply 10-fold cross-validation to assess the out-of-sample predictive performance, using 10-fold nested cross-validation to specify the penalty parameter. We explore the effect of the sample size on the predictive ability of LASSO regression.

## Methods

The GAW20 data are based on the Genetics of Lipid Lowering Drugs and Diet Network (GOLDN) study data set [9] that investigated the epigenetic determinants of triglyceride (TG) response. TGs are major blood lipids [10] that constitute an important biomarker of cardiovascular disease risk [11]. Previous GWAS studies found a number of loci associated with TG levels [12]. We focus on predicting the TG response by analyzing the GWAS (SNP) data and four measures of TG. The first two measures (at visits 1 and 2) were taken before the lipid-lowering drug treatment; the second two measures (at visits 3 and 4) were taken after the treatment.

We performed quality control (QC) on the GAW20 GWAS (SNP) data using standard procedures outlined in Turner et al. [13]. SNP-level QC removes 63,907 SNPs with low minor allele frequency (< 0.01), and 2694 SNPs for failing a test of Hardy-Weinberg equilibrium (*p* ≤ 0.00001). Of 822 individuals for whom we have SNP data, the simulated phenotype is available for 680 individuals and the real phenotype is available for 778 individuals. Working on a log scale, we take the mean of the TG measures for visits 1 and 2 as a baseline measure, and the mean of visits 3 and 4 as a follow-up measure. If the measure for either visit 1 or 2 is not available, we take the only available measure as a baseline measure. Similarly, if the measure for either visit 3 or 4 is not available, we take the only available measure as a follow-up measure. We adjust the follow-up measure for the baseline measure, age, center, smoking status, and first 20 principal components (PCs) of SNP effects using a linear regression. The number of PCs was defined by examining the quantile–quantile (Q-Q) plot of the *p* values from the ordinary linear regression. When 20 PCs were incorporated into the linear regression, the Q-Q plot showed no inflated *p* values, which suggests that the relatedness and population stratification are accounted for. We take the standardized residuals as our final phenotype.

### Lasso

*β*

_{0}+ Xβ + ϵ, where y is a vector of response variables; X is a

*n*×

*p*matrix of predictor variables;

*β*

_{0}is an intercept; β = (

*β*

_{1}, …,

*β*

_{p}) is a vector of regression coefficients; and ϵ is a vector of the error terms, ϵ~N(0,

*σ*

^{2}). For

*n*>

*p*the estimated values of the coefficients are found by minimizing the residual sum of squares:

*n*<

*p*. In this case, penalized regression is often used, where the estimators of β are found by minimizing the sum of the residual sum of squares and a penalty function:

*P*(

*λ*, β) is the penalty function with a regularization parameter

*λ*which controls the amount of shrinkage. The LASSO penalty [2] utilizes an

*ℓ*

_{1}-norm penalty, that is, \( P\left(\lambda, \boldsymbol{\beta} \right)=\left\Vert \boldsymbol{\beta} \right\Vert {}_{\ell_1} \); consequently, the estimators of the coefficients take the form:

One important property of the LASSO penalty is that it allows the coefficients to be set to exactly zero, thus performing variable selection.

## Results

### Simulated data

### Real data

## Discussion

The prediction ability of LASSO was assessed on the simulated and the real GAW20 data sets through 10-fold nested cross-validation. Both data sets demonstrated poor predictive performance and exhibited a noticeable shrinkage of the predicted phenotype toward zero. By examining the effect of the most significant causal SNP in the simulated data, we found that it was heavily penalized. To investigate this issue, we analyzed a much larger data set (approximately 7000 individuals). We found that the effect of the causal SNP was much less penalized, thus enabling the best prediction possible with that SNP.

## Conclusions

The prediction ability of LASSO was assessed on the GAW20 data sets and on the much larger data set available from previous studies. Poor predictive performance is achieved for data sets of a few hundred individuals with a weak signal. This can be explained by the fact that the LASSO regression coefficients are substantially shrunk. Other regularized methods that do not result in such a heavy shrinkage of the regression coefficients might be of use. For example, with hyper-LASSO [6] the extent of the shrinkage depends on the size of the coefficients, and adaptive LASSO [5] uses different adaptive weights for penalizing different coefficients. Both of these can potentially lead to a moderate shrinkage. However, with LASSO, increasing the sample size from a few hundred to a few thousand individuals increased the strength of the signal and reduced the amount of shrinkage of the regression coefficients, thus improving the prediction. We conclude that LASSO regression requires large sample sizes (several thousands of individuals) to achieve good prediction.

## Declarations

### Funding

Publication of this article was supported by NIH R01 GM031575. Support for this work was provided by the Wellcome Trust (Grant 102858/Z/13/Z) and an MRC/Arthritis Research UK award: Maximizing Therapeutic Utility in RA (MATURA) (Grant MR-K015346).

### Availability of data and materials

The data that support the findings of this study are available from the Genetic Analysis Workshop (GAW), but restrictions apply to the availability of these data, which were used under license for the current study. Qualified researchers may request these data directly from GAW.

### About this supplement

This article has been published as part of BMC Proceedings Volume 12 Supplement 9, 2018: Genetic Analysis Workshop 20: envisioning the future of statistical genetics by exploring methods for epigenetic and pharmacogenomic data. The full contents of the supplement are available online at https://bmcproc.biomedcentral.com/articles/supplements/volume-12-supplement-9.

### Authors’ contributions

SC conducted the statistical analysis and drafted the manuscript. RAJH conducted the transformation of the data to a PLINK-readable format and performed QC on the data. HJC conceived the overall study and critically revised the manuscript. All the authors read and approved the final manuscript.

### Ethics approval and consent to participate

Not applicable.

### Consent for publication

Not applicable.

### Competing interests

The authors declare that they have no competing interests.

### Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

## Authors’ Affiliations

## References

- Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression. Ann Stat. 2004;32(2):407–99.View ArticleGoogle Scholar
- Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc Series B Stat Methodol. 1996;58(1):267–88.Google Scholar
- Cessie SL, Houwelingen JCV. Ridge estimator in logistic regression. J R Stat Soc Ser C Appl Stat. 1992;41(1):191–201.Google Scholar
- Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc Series B Stat Methodol. 2005;67(2):301–20.View ArticleGoogle Scholar
- Zou H. The adaptive lasso and its oracle properties. J Am Stat Assoc. 2006;101(476):1418–29.View ArticleGoogle Scholar
- Hoggart CJ, Whittaker JC, De Iorio M, Balding DJ. Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies. PLoS Genet. 2008;4(7):e1000130.View ArticleGoogle Scholar
- Varma S, Simon R. Bias in error estimation when using cross-validation for model selection. BMC Bioinformatics. 2006;7(1):91.View ArticleGoogle Scholar
- Kohavi R. A study of cross-validation and bootstrap for accuracy estimation and model selection. IJCAI-95. 1995:1137–43.Google Scholar
- Irvin MR, Zhi D, Joehanes R, Mendelson M, Aslibekyan S, Claas SA, Thibeault KS, Patel N, Day K, Jones LW, et al. Epigenome-wide association study of fasting blood lipids in the genetics of lipid- lowering drugs and diet network study. Circulation. 2014;130(7):565–72.View ArticleGoogle Scholar
- Kwiterovich PO Jr. The metabolic pathways of high-density lipoprotein, low-density lipoprotein, and triglycerides: a current review. Am J Cardiol. 2000;86(12A):5L–10L.View ArticleGoogle Scholar
- Miller M, Stone NJ, Ballantyne C, Bittner V, Criqui MH, Ginsberg HN, Goldberg AC, Howard WJ, Jacobson MS, Kris-Etherton PM, et al. Triglycerides and cardiovascular disease: a scientific statement from the American Heart Association. Circulation. 2011;123(20):2292–333.View ArticleGoogle Scholar
- Willer CJ, Schmidt EM, Sengupta S, Peloso GM, Gustafsson S, Kanoni S, Ganna A, Chen J, Buchkovich ML, Mora S, et al. Discovery and refinement of loci associated with lipid levels. Nat Genet. 2013;45(11):1274–83.View ArticleGoogle Scholar
- Turner S, Armstrong LL, Bradford Y, Carlson CS, Crawford DC, Crenshaw AT, de Andrade M, Doheny KF, Haines JL, Hayes G, et al.: Quality control procedures for genome wide association studies. Curr Protoc Hum Genet 2011; Chapter 1: Unit 1.19.Google Scholar
- Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ, et al. PLINK: a toolset for whole-genome association and population-based linkage analysis. Am J Hum Genet. 2007;81(3):559–75.View ArticleGoogle Scholar