- Proceedings
- Open Access
- Published:

# A 2-step penalized regression method for family-based next-generation sequencing association studies

*BMC Proceedings*
**volume 8**, Article number: S25 (2014)

## Abstract

Large-scale genetic studies are often composed of related participants, and utilizing familial relationships can be cumbersome and computationally challenging. We present an approach to efficiently handle sequencing data from complex pedigrees that incorporates information from rare variants as well as common variants. Our method employs a 2-step procedure that sequentially regresses out correlation from familial relatedness and then uses the resulting phenotypic residuals in a penalized regression framework to test for associations with variants within genetic units. The operating characteristics of this approach are detailed using simulation data based on a large, multigenerational cohort.

## Background

As biological techniques for assaying genetic variation have advanced, so have methods for candidate gene, genome-wide, exome, and whole genome association studies. In addition to methodological advancements purely statistical in nature, each progression has resulted in new analytical complexities, including those relating to problems of assuring reliable data quality and handling massive multiple testing and computational challenges. Increasingly, the ability to analyze the complex data resulting from genetic studies relies on having specialized software and vast computational resources. Methods are needed that are able to appropriately respect data complexity and are also accessible to investigators who wish to prevent an expensive computing investment.

Family-based association methodologies in the specific context of next-generation sequencing have been proposed [1]. Like any approach incorporating related individuals, these methods must somehow either utilize or, at the least, take into account, the dependence structure that necessarily exists within pedigrees. This within-family information can be drawn upon for the sake of inference as in transmission-based test statistics or, alternatively, can be adjusted for to remove any dependency that could violate independence assumptions of the downstream statistical test. When a goal is computational simplicity, in the context of large, complex pedigrees it could be advantageous to perform a within-family adjustment. The next-generation sequencing data set generated from the Genetic Analysis Workshop 18 (GAW18) contains more than 8 million variants from 20 complex pedigrees, and is thus ideal to examine these types of approaches.

The transition from designing genome-wide association studies (GWAS) that rest upon the common disease and common variant hypothesis to exome and whole genome studies that are better equipped to ascertain effects from rare variations has resulted in much interest for methods to group or aggregate variants in order to test for the multiple rare variant hypothesis that rarer variants of larger effect underlie common disease variation. In what follows, we present a method that handles complex pedigrees in a computationally accessible manner while also incorporating information over a genetic functional unit.

## Methods

### Data set

Whole genome sequencing for a subset of the San Antonio Family Studies (SAFS) participants was conducted through the Type 2 Diabetes Genetic Exploration by Next-generation sequencing in Ethnic Samples (T2D-GENES) Consortium. We examined the 1,215,399 variants on chromosome 3 that were genotyped on the 955 fully phenotyped subjects.

### Statistical analysis

The simple and fast way to test the effect of a genetic marker on a trait is to use a contrast of the frequencies or means among genotype groups in a linear model. However, this method does not explicitly take into account relationships among family members when available, which can lead to both false-positive and false-negative associations. The use of a linear mixed model is a potential solution to this problem when family-based data is available. These models adjust for familial relationships by modeling the polygenic component between individuals as a random effect but are computationally intensive. Recently, Aulchenko et al [2] developed a rapid and robust 2-step method based on a mixed-model framework for family-based association studies. The first step of this method is to perform a single polygenic analysis using the complete pedigree but ignoring marker data. Subsequently, the residuals from this analysis, which are now adjusted for polygenic covariance and fixed covariate effects, are used as an updated quantitative trait for association testing using classical methods for unrelated individuals. To our knowledge, however, this approach has not been adapted to rare variant testing. We adopted this 2-step polygenic regression adjustment and residual testing approach to test for both common and rare variants by inclusion of a penalized regression step. We then applied the method to the SAFS genome sequencing data.

*Step 1: adjustment for family structure*. First, we estimated the kinship matrix, \text{\Phi}, in the SAFS pedigrees using a linkage disequilibrium-pruned subset of all common single-nucleotide polymorphisms (SNPs) genotyped in 955 participants [1, 2]. For the sake of comparison, we also calculated the pairwise kinship coefficients based on the provided pedigree structure [3]. These 2 approaches for parameterizing relatedness were highly concordant as expected (correlation = 0.962; average difference 0.003 ± 0.006). Because genomic kinship coefficients do not depend on the completeness and quality of pedigrees and can provide more accurate information on ancestral relatedness than the given pedigree structure [1, 4], we used genomic-estimated kinship to adjust for family structure in the following mixed model:

where *y*_{
i
} is the phenotype of the *i*th individual, \mu is a grand mean, {\beta}_{1} and {\beta}_{2} are the fixed effects of age and gender, and *G*_{
i
} and *e*_{
i
} are random additive polygenic and residual effects, respectively, for individual i. The vectors ** G**and e, consisting of all polygenic and residual effects, follow zero-mean multivariate normal distributions with variance-covariance matrices 2\text{\Phi}{\sigma}_{G}^{2} and {I}_{n}{\sigma}_{e}^{2}, respectively. Here {\sigma}_{G}^{2} is the additive genetic variance explained by the kinship-based polygenic component, {I}_{n} is an (

*n × n*) identity matrix,

*n*is the number of subjects, and {\sigma}_{e}^{2} is the residual variance. The following vector of polygenic residuals, {y}^{*}, now has no dependence induced by familial relationships and was estimated as {y}^{*}={{\displaystyle \hat{\sigma}}}_{e}^{2}{{\displaystyle \widehat{\sum}}}^{-1}\left({\xea}^{*}\right), where {\xea}^{*} is the vector of trait values adjusted for covariates, that is, y-\left({\displaystyle \widehat{\mu}}+{\widehat{\beta}}_{1}\phantom{\rule{2.77695pt}{0ex}}age+{\widehat{\beta}}_{2}\phantom{\rule{2.77695pt}{0ex}}gender\right), and {\displaystyle \widehat{\sum}} is the restricted maximum likelihood estimate of 2\text{\Phi}{\sigma}_{G}^{2}+I{\sigma}_{e}^{2}. These residuals were then used in the second step as quantitative traits from unrelated individuals. The R package GenABEL was used for this analysis [4].

*Step 2: penalized regression including a gene-based group penalty*. Instead of using the linear model in the original approach proposed by Aulchenko et al, we applied a lasso-type group penalized regression to better incorporate rare variants. This type of penalized regression has been applied to many forms of genetic analysis, including microarray data [5] and GWAS data [6]. Friedman et al [7] introduced the mixture of group and lasso penalties, and this approach has been explored using, for example, breast cancer GWAS data for testing common and rare variants [8]. In this study, variants were grouped by genes and assigned a weight, {s}_{j}=2\sqrt{{p}_{j}\left(1-{p}_{j}\right)}, where {p}_{j} is the minor allele frequency for the *j*th SNP. The factor 2 makes the value of the weight range between 0 and 1. In our study, we adopted this method for the whole genome sequencing data with the residuals generated via step 1 from the family data. Briefly, we have an objective function to be minimized that comprises the underlying sums of squares from ordinary regression (first term) modified by a conventional lasso penalty and, finally, the group penalty, as below:

where {||.||}_{2} denotes the L^{2} norm, SNP is an n\times m matrix of genotypes for the m SNPs, \beta is an m\times 1 vector of SNP effects, {\lambda}_{L} and {\lambda}_{E} are the lasso and group penalties, respectively, and {t}_{G} is a weight function operating on the *G*th group, often used to adjust for gene size. Here, {\beta}_{j} represents the (fixed) effect of the *j*th SNP, and {\beta}_{G} corresponds to (fixed) effects for SNPs within the *G*th group, where for our purposes *G* is defined by gene. The use of the *L*^{2} norm on the effects within a group encourages the incorporation of rare causal variants inside the same gene. To test the power of this approach on rare variants in particular, we further excluded common variants (minor allele frequency [MAF] ≥0.05) and repeated the analysis. These analyses were performed using the statistical software Mendel (Version 12.0) [9, 10]. This version did not, to our knowledge, incorporate {t}_{G} weighting, so its default of 1 was used throughout, giving equal weight to all genes.

## Results

We used GRCh37/hg19 build annotations to map chromosome 3 variants to genes; 521,355 of the 1,215,399 variants were located in a total of 1,165 genes. There are 188 causal variants, ie, variants with nonzero effect sizes for either diastolic blood pressure (DBP) or systolic blood pressure (SBP) at exam 1; these map to 31 unique genes. One hundred sixty-eight of these variants are causal for DBP and 134 for SBP. Genes with at least 1 causal variant are here referred to as *causal genes*. Table 1 presents SNP counts and other descriptors for each causal gene. Some variants used to simulate phenotypes lie nearby but outside of the genes assigned via the GAW simulation. As a result of these discrepancies, some causal genes include no mapped causal SNPs, an apparent contradiction. Interestingly, because we report on the proportion of simulation replicates that a SNP from a causal gene is in the final penalized regression model (Table 2), some causal genes with no causal SNPs are still detected. For example, *PAK2* contains 4 SNPs used to model SBP and DBP, but, although none of these map within *PAK2* using GRCh37/hg19, *PAK2* can still be found in the fitted models.

Table 2 provides the gene-based probabilities of discovering at least 1 SNP within each trait-specific (DBP or SBP) causal gene using either a 90% or a 50% lasso weight \left(\frac{{\lambda}_{L}}{\left({\lambda}_{L}+{\lambda}_{E}\right)}=\frac{{\lambda}_{L}}{\lambda}=0.5\right), as suggested by Zhou et al [8], and using all variants or only rare ones (MAF <0.05). For each model, we chose the regression penalty parameter to estimate the nonzero effects of approximately 134 variant predictors. This number was chosen for convenience from the number of causal SNPs for SBP. We examined other choices for this parameter (data not shown) and, even though not performing thorough cross-validation (a step that is common for penalized regression approaches but is not yet incorporated into Mendel), this appeared to provide a reasonable balance between capturing true and false positives in the final model. We examined how varying the relative contributions of the pure and group lasso penalties affects true-positive and false-positive rates. Figure 1 displays these rates for each of the 200 simulation replicates when testing all variants. Results when restricting to rare variants exhibit similar patterns, but have uniformly higher rates because of the exclusion of causal and noncausal genes with only common variants. An increase in the proportion of penalty for the pure lasso results in more distinct genes entering the final model, so both true-positive and false-positive rates are higher.

*MAP4* is consistently discovered using our approach, which is not surprising because the variants within *MAP4* confer more than 5% heritability for both DBP and SBP. No other gene encompasses that amount of heritability for either trait. In fact, the next highest heritability is that of *FLNB* for SBP (0.28%), and the other genes are correspondingly much less reliably detected. A few genes (eg, *ARHGEF3, FLNB*, and *SCAP*) are discovered with at least 10% probability in some models, although it is difficult to infer with confidence any pattern of characteristics for detection when these are so low. It appears that detection probability increases with more weight placed toward the pure lasso penalty, which is consistent with Figure 1. Restricting analysis to rare variants improves causal gene detection at least partially, as a result of the reduction in the number of genes considered, while retaining the same number of variant predictors in the model; for some genes (eg, *FLNB, MUC13, SCAP*, and *ZBTB38*) the improvement is considerable. Some interesting patterns in results from Table 2 can be attributed to gene characteristics in Table 1. For example, *FLNB* shows a markedly higher detection probability when testing SBP with all variants using a 90% lasso proportion; this discrepancy is because a common causal variant unique to SBP is removed from consideration when examining only common variants and is more reliably detected with pure lasso.

Per-gene false-positive rates are reasonably maintained (all below 0.03%), partially because of the upper bound constraint that the penalty tuning parameter imposes. That is, because only approximately 134 variants can enter the final regression model, there is a limit to the number of times each noncausal gene can be falsely discovered. False discovery rates (ie, the proportion of genes in the final models that are noncausal) are large, often above 90%, which is a known problem with penalized regression strategies.

## Discussion

Methods for handling complex pedigrees are varied and are often computationally intensive. The advent of next-generation sequencing exacerbates these complexities. We have proposed a novel method for approaching this problem that is easily implemented in freely available software packages and eases computational burden by regressing out correlations caused by familial relatedness in an initial step, preventing the estimation of often complex mixed models for each variant under consideration.

We chose to use a kinship matrix estimated based on genomic data to adjust for family structure because this confers several advantages over pedigree kinship [1, 4, 6]. First, it does not depend on the completeness and quality of the pedigree. For example, if one child is adopted but this information is not provided, the genomic kinship can give the correct estimation, while the pedigree kinship will classify the child as the first-degree relative of the parents and the sibs, which can induce bias. Second, genomic kinship may give a better estimate of a true covariance between individual genomes, while pedigree kinship provides only the expectation of the proportion of genome shared identical by descent. Third, genomic kinship can be incorporated in the presence of potential population stratification. Therefore, the use of genomic kinship is expected to lead to better estimates of polygenic model, and thus better power to detect association.

Methods that group variants within genes and treat the gene as a functional unit, as in the group lasso portion of our approach, can efficiently borrow information across the gene without necessarily testing the disease-influencing variants. We find that, in the context of the GAW18 simulated data, our method can successfully and consistently discover disease genes with sufficient heritability, but is largely underpowered when heritability is below 0.03%. More exploration of the proposed approach through simulation is warranted to examine cross-validation strategies for choosing lasso parameters, the related effects of linkage disequilibrium structure, and private variants. Notably, private variants (ie, those unique to a pedigree) will not likely be detected using our proposed methodology. The incorporation of tracking genetic transmissions will be required to do so, and any method that treats familial relatedness as a nuisance, as we do, in order to utilize test statistics for independent subjects will be severely underpowered for these variants.

## References

Zhu Y, Xiong M: Family-based association studies for next-generation sequencing. Am J Hum Genet. 2012, 90: 1028-1045. 10.1016/j.ajhg.2012.04.022.

Aulchenko YS, De Koning D-J, Haley C: Genomewide rapid association using mixed model and regression: a fast and simple method for genomewide pedigree-based quantitative trait loci association analysis. Genetics. 2007, 177: 577-585. 10.1534/genetics.107.075614.

Sinnwell JP, Therneau TM, Atkinson EJ, Schaid DJ, Matsumoto ME, McDonnell SK: Kinship2: an enhancement to the kinship R package with additional pedigree utilities. 2011, R package, [http://CRAN.R-project.org/package=kinship2]

Aulchenko YS, Ripke S, Isaacs A, Van Duijn CM: GenABEL: an R library for genome-wide association analysis. Bioinformatics. 2007, 23: 1294-1296. 10.1093/bioinformatics/btm108.

Wu TT, Lange K: Coordinate descent algorithms for lasso penalized regression. Ann Appl Stat. 2008, 2: 224-244. 10.1214/07-AOAS147.

Wu TT, Chen YF, Hastie T, Sobel E, Lange K: Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics. 2009, 25: 714-721. 10.1093/bioinformatics/btp041.

Friedman J, Hastie T, Tibshirani R: A note on the group lasso and a sparse group lasso. 2010, 8-[http://arxiv.org/abs/1001.0736]

Zhou H, Sehl ME, Sinsheimer JS, Lange K: Association screening of common and rare genetic variants by penalized regression. Bioinformatics. 2010, 26: 2375-2382. 10.1093/bioinformatics/btq448.

Lange K, Cantor R, Horvath S, Perola M, Sabatti C, Sinsheimer JSE: MENDEL version 4.0: a complete package for the exact genetic analysis of discrete traits in pedigree and population data sets. Am J Hum Genet. 2001, 69: 504-10.1086/322739.

Zhou H, Alexander DH, Sehl ME, Sinsheimer JS, Sobel EM, Lange K: Penalized regression for genome-wide association screening of sequence data. Pac Symp Biocomput. 2011, 106-117.

## Acknowledgements

We are thankful to the section editor and two anonymous reviewers for constructive suggestions. The Genetic Analysis Workshop is supported by National Institutes of Health grant R01 GM031575. This work was supported by NIH grants 8P20GM103436-12 (DWF, KN) and K25AG043546 (DWF). The GAW18 whole genome sequence data were provided by the T2D-GENES Consortium, which is supported by NIH grants U01 DK085524, U01 DK085584, U01 DK085501, U01 DK085526, and U01 DK085545. The other genetic and phenotypic data for GAW18 were provided by the San Antonio Family Heart Study and San Antonio Family Diabetes/Gallbladder Study, which are supported by NIH grants P01 HL045222, R01 DK047482, and R01 DK053889. The Genetic Analysis Workshop is supported by NIH grant R01 GM031575.

This article has been published as part of *BMC Proceedings* Volume 8 Supplement 1, 2014: Genetic Analysis Workshop 18. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcproc/supplements/8/S1. Publication charges for this supplement were funded by the Texas Biomedical Research Institute.

## Author information

### Authors and Affiliations

### Corresponding author

## Additional information

### Competing interests

The authors declare that they have no competing interests.

### Authors' contributions

SS, XW, and DWF designed the overall study. XD, SS, KN, and DWF conducted the statistical analyses and created the tables and figure. SS and DWF drafted the manuscript, which was revised by XD, KN, and XW. All authors discussed the project throughout, read, and approved the final manuscript.

## Rights and permissions

**Open Access** This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/.

The Creative Commons Public Domain Dedication waiver (https://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

## About this article

### Cite this article

Ding, X., Su, S., Nandakumar, K. *et al.* A 2-step penalized regression method for family-based next-generation sequencing association studies.
*BMC Proc* **8, **S25 (2014). https://doi.org/10.1186/1753-6561-8-S1-S25

Published:

DOI: https://doi.org/10.1186/1753-6561-8-S1-S25

### Keywords

- Rare Variant
- Causal Gene
- Genomic Kinship
- Causal SNPs
- Group Lasso