Detection of rare functional variants using group ISIS
© Niu et al; licensee BioMed Central Ltd. 2011
Published: 29 November 2011
Genome-wide association studies have been firmly established in investigations of the associations between common genetic variants and complex traits or diseases. However, a large portion of complex traits and diseases cannot be explained well by common variants. Detecting rare functional variants becomes a trend and a necessity. Because rare variants have such a small minor allele frequency (e.g., <0.05), detecting functional rare variants is challenging. Group iterative sure independence screening (ISIS), a fast group selection tool, was developed to select important genes and the single-nucleotide polymorphisms within. The performance of the group ISIS and group penalization methods is compared for detecting important genes in the Genetic Analysis Workshop 17 data. The results suggest that the group ISIS is an efficient tool to discover genes and single-nucleotide polymorphisms associated to phenotypes.
Understanding the inherited basis of genetic variation in human health and disease is currently one of the most challenging tasks. Genome-wide association studies have been used to establish the statistical association between hundreds of loci across the genome and common complex traits. Although this approach has brought substantial knowledge and understanding of the diverse molecular pathways that underlie specific diseases, more evidence shows that a large portion of complex diseases cannot be explained by common genetic variants [1, 2]. Therefore alternative approaches are needed to detect and analyze rare variants associated with disease susceptibility genes. Although statistical methods for the detection of common functional variants (e.g., with minor allele frequencies [MAF] > 0.05) have been extensively developed and successively applied to numerous studies, methods for detecting rare functional variants are limited. Some methods developed for analysis of common variants can be easily extended to rare variants, for example, single-marker test, multiple-marker test, and collapsing methods, but their performance may not be optimal [3–5].
The primary purpose of this paper is to analyze quantitative traits Q1 and Q2 in replicates 1–200 of the Genetic Analysis Workshop 17 (GAW17) simulated mini-exome data . We study the GAW17 data set using modern ultra-high-dimensional model selection and group selection techniques. Given the natural group structure (i.e., genes) among single-nucleotide polymorphisms (SNPs), group selection tools can select the groups that consist of a number of weak predictors (i.e., SNPs with small MAFs) whose effect as a group on the phenotypes could be significant. In the context of the GAW17 data set, these weak predictors are just rare genetic variants. Contrary to collapsing methods, modern ultra-high-dimensional model selection techniques consider the joint effect among groups as well as among individuals and avoid oversimplification of the model. We propose group iterative sure independence screening (ISIS) for gene and SNP selection. We apply the method to analyze the GAW17 data and to compare it with penalized likelihood methods, such as the group least absolute shrinkage and selection operator (LASSO) and the group minimax concave penalty (MCP) in terms of the true significant genes (i.e., genes with significant SNPs) in the simulated GAW17 data. Functional variants are referred to as important variants throughout the text.
where y is an n × 1 response vector, X j is an n × p j matrix corresponding to the jth group of variables, β j is a p j × 1 coefficient vector, and ε is a random noise vector with normal distribution. Denote X = (X1, …, X J ) and .
We assume that the model is bilevel sparse, which means that only a small number of β j are nonzero vectors and, moreover, that each nontrivial β j is itself a sparse vector. In our analysis of the GAW17 data set, the response y is the quantitative phenotype Q1 or Q2, and the predictors are the 24,487 SNPs grouped in 3,205 genes. The bilevel sparse assumption, interpreted in this study, says that only a small number of genes are related to the phenotype of interest and that only some of the SNPs in these related genes are important. The assumption on sparsity plays a critical role in high-dimensional statistical modeling. The bilevel sparse assumption is appropriate for models with grouped predictors.
Because the GAW17 data are mini-exome human data, we use 0, 1, and 2 to denote genotypes AA, Aa, and aa, respectively. Thus each column in the design matrix X consists of the numbers 0, 1, and 2. Among the 24,487 SNPs in the data set, there are 9,433 SNPs with a MAF of 0.07% [= 1/(697 × 2)]; that is, this is the smallest MAF in the GAW17 data because only 1 individual out of 697 individuals has a variant at each such SNP locus. The fact that 9,433 is much greater than 697 makes no statistical model identifiable. Because of the nonidentifiability of the model, it is necessary to consider the group (gene) selection.
Many high-dimensional model selection and group selection techniques have been developed recently. One of the most popular methods is the penalized likelihood method, such as the LASSO, the smoothly clipped absolute deviation (SCAD) penalty, the MCP, and their extensions for group selection (e.g., group LASSO and group MCP) [7–11]. Two popular algorithms are used to find the maximizer of the penalized likelihood. The first algorithm is the least angle regression (LARS) algorithm and its extensions . It is efficient when the number of parameters (p) is comparable with the number of samples (n). However, when p is ultrahigh, as in our setting (24,487 potential predictors with only 697 observations), the LARS algorithm usually cannot be applied. The second algorithm is coordinate descent, which optimizes a target function with respect to a single variable at a time, iteratively cycling through all variables until convergence is reached . This method is even faster than LARS and can converge to the same solutions as LARS in many cases. However, it is unknown whether the two methods work well in the ultrahigh setting.
An alternative method for model selection that is different from the penalization approach is correlation learning, for example, sure independence screening (SIS) and iterative sure independence screening (ISIS) and their extensions for the group selection, called group SIS and group ISIS . Suppose that there are J groups of variables, denoted by G1, …, G J . In the first step of this approach, regression is performed using a single group of variables for each G j , j = 1, …, J. (Here, if the number of variables in G j is small, we can simply perform a least-squares regression. Otherwise, when the size of G j is large, we can use penalized regressions [LASSO or SCAD] or greedy algorithms [forward stepwise selection] to select the model and to estimate the coefficients.) After the first step, the residual sum of squares (RSS) is calculated by dividing by the degrees of freedom (df) for each regression using a single group. Then, the groups are ranked, and the top k groups are selected on the basis of the smallest RSS/df value. This is the main procedure for the group SIS method.
The group ISIS is just the iterative version of group SIS. First, one or several groups, say, G i 1, …, G ik , are selected using group SIS. Then, regression is performed using these selected groups together with a single new group from the rest of the groups for each of the new groups. Next, all the new groups are ordered by the RSS/df values, and top groups with the smallest RSS/df values are selected and added to the pool of selected groups. This process is iterated until some criterion is met. Note that this version of group ISIS corresponds to a version of ISIS introduced by Fan et al. , which improves the original ISIS given by Fan and Lv .
Group SIS and group ISIS are general strategies, and they have many variants in different applications. In this paper, we describe two procedures related to group ISIS for gene selection and apply them to analyze the GAW17 data. In the GAW17 data, the phenotypes may have strong associations with factors such as sex, age, and smoking. Therefore these factors are always considered important by default.
For procedure 1 we preset a maximal iteration K (12 or 15 in our study) and a constant C (5 in our simulation). In the first step, we perform a regression using each single group G j and sex, age, and smoking, where j goes from 1 to J. We restrict the model size to at most C + 3 when we do the regressions. For example, if the jth group G j consists of 12 variables, we use a forward stepwise selection method to select 5 among those 12 variables besides sex, age, and smoking. Then, we rank the RSS/df values for all groups. After ranking, we select the group, denoted by G(1), with the lowest RSS/df value in the model. At the kth iteration, we perform a regression using selected groups G(1), …, G( k −1), sex, age, and smoking together with a single group for each group from the rest. We restrict the whole model size to at most Ck + 3 when we do the regressions. Then, we select the new group G( k ) by ranking the RSS/df values. The procedure ends when k >K. In short, we select one group each time until the kth iteration. As we will see, this procedure can be considered a special case of procedure 2. However, as the counterpart of forward stepwise selection in the setting of group selection, procedure 1 is an interesting procedure in itself. Therefore we include it in our numeric studies.
Procedure 2 is similar to procedure 1 except that it selects multiple groups each time and allows deletion. To make the algorithm more efficient, we select k0 = 5 groups with the lowest RSS/df values at each iteration. After adding these five new groups, we do group selection immediately and keep only the selected groups before entering the next iteration. The procedure stops when it is stable, that is, when it deletes newly selected groups in the group selection. We use group LASSO and group MCP with either the Akaike information criterion (AIC) or the Bayesian information criterion (BIC) for group selection in the deletion step and discard the groups that are not selected by group LASSO or group MCP. We find that the procedure usually stops within 10 iterations, so we set the maximum number of iterations at 10. Group MCP and group LASSO are implemented using R package grpreg and can also be directly applied to select important genes. It is interesting to compare procedure 1 and procedure 2 with these methods.
Two phenotypes of Q1 and Q2 are studied as interesting responses in this paper. Because there are 200 replicates of data sets in GAW17 and because each replicate has a certain amount of noise, we consider three noise levels (low, medium, and high). In the low noise level case, we take average of 200 replicates as the response variable; the medium noise level refers to the average of 10 replicates. For the high noise level, only one replicate is considered. We select the first replicate as an example of high noise data to compare the different selection methods. Other individual replicates could have been used, but the same conclusion is obtained.
Comparisons of penalized likelihood methods and group ISIS procedures 1 and 2 at three different noise levels
Penalized likelihood method
Group ISIS procedure 1
Group ISIS procedure 2
Comparisons of penalized likelihood methods and group ISIS procedure 2 at medium noise level over 20 data sets
Penalized likelihood method
Group ISIS procedure 2
Genes and SNPs selected by group ISIS procedure 2 (group MCP with AIC) for the low noise data
C13S348*, C13S431, C13S522, C13S523, C13S524
C4S1861, C4S1878, C4S1884
C6S5441, C6S5446, C6S5449
C3S4859, C3S4869, C3S4873, C3S4874, C3S4875
C17S1019*, C17S1024, C17S1043, C17S1046, C17S1055
C9S376, C9S377, C9S444
C11S5292, C11S5301, C11S5302
Although it is unlikely in practice, when additional information is known in advance (e.g., all important SNPs are nonsynonymous or have MAF less than 0.2), we can filter out parts of the SNPs first and apply group ISIS to the rest. There are 13,572 nonsynonymous SNPs in 2,196 genes and 23,131 low-MAF (<0.2) SNPs in 3,100 genes. For the phenotypes Q1 and Q2, all important SNPs are nonsynonymous and have low MAF, so we lose nothing by searching among the restricted data set. We applied group ISIS to these data sets and found that the results were almost identical with the results reported in Table 1. Therefore we conclude that ultra-high dimensionality and existence of common variants (say, MAF > 0.2) hardly influence the performance of group ISIS. That is one of the main advantages of group ISIS.
The high noise level is the main reason that no method works well using only one replicate of the data. Besides that, to understand why some genes (e.g., VEGFC for Q1) cannot be selected even for the low noise data, we checked the data carefully. We found that SNP C4S4935, the only important SNP in gene VEGFC, is identical to SNP C13S348 in gene FLT1 in the design matrix. Gene FLT1 includes many SNPs related to Q1. So if FLT1 is selected, VEGFC does not have any priority to be selected. In short, the nonidentifiability may bring up some issue in gene selection when some important genes have few related SNPs.
Discussion and conclusions
In this study, we used modern ultra-high-dimensional model selection tools to detect important genes and important SNPs related to the phenotypes of interest. Group SIS and group ISIS were developed to conquer the difficulty of ultra-high dimensionality and nonidentifiability. As group LASSO is to LASSO, group SIS (or group ISIS) is the counterpart to SIS (ISIS) in the setting of group selection. These group selection tools work well under the bilevel sparse assumption. We used the penalized likelihood methods and group ISIS procedures to analyze the GAW17 data and compared their performance in recovering important genes associated with phenotypes Q1 and Q2. It seems that the group ISIS approach performs better than the penalized likelihood methods in terms of number of false positives and false discovery proportion. In particular, the proposed methods work well when additional replicates are available, that is, at the low and medium noise levels. When only one replicate with high noise is available, it seems that no method works well.
This research is supported partly by University of Arizona internal grants. The Genetic Analysis Workshop is supported by National Institutes of Health grant R01 GM031575.
This article has been published as part of BMC Proceedings Volume 5 Supplement 9, 2011: Genetic Analysis Workshop 17. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/5?issue=S9.
- McCarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, Ioannidis JPA, Hirschhorn JN: Genome-wide association studies for complex traits: consensus, uncertainty, and challenges. Nat Rev Genet. 2008, 9: 356-369. 10.1038/nrg2344.View ArticlePubMedGoogle Scholar
- Nejentsev S, Walker N, Riches D, Egholm M, Todd JA: Rare variants of IFIH1, a gene implicated in antiviral responses, protect against type 1 diabetes. Science. 2009, 324: 387-389. 10.1126/science.1167728.PubMed CentralView ArticlePubMedGoogle Scholar
- Li B, Leal SM: Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet. 2008, 83: 311-321. 10.1016/j.ajhg.2008.06.024.PubMed CentralView ArticlePubMedGoogle Scholar
- Matthews AG, Haynes C, Liu C, Ott J: Collapsing SNP genotypes in case-control genome-wide association studies increases the type I error rate and power. Stat Appl Genet Mol Biol. 2008, 7: article 23Google Scholar
- Xiong M, Zhao J, Boerwinkle E: Generalized T2 test for genome association studies. Am J Hum Genet. 2002, 70: 1257-1268. 10.1086/340392.PubMed CentralView ArticlePubMedGoogle Scholar
- Almasy LA, Dyer TD, Peralta JM, Kent JW, Charlesworth JC, Curran JE, Blangero J: Genetic Analysis Workshop 17 mini-exome simulation. BMC Proc. 2011, 5 (suppl 9): S2-10.1186/1753-6561-5-S9-S2.PubMed CentralView ArticlePubMedGoogle Scholar
- Breheny P, Huang J: Penalized methods for bi-level variable selection. Stat Interface. 2009, 2: 369-380.PubMed CentralView ArticlePubMedGoogle Scholar
- Fan J, Li R: Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc. 2001, 96: 1348-1360. 10.1198/016214501753382273.View ArticleGoogle Scholar
- Tibshirani R: Regression shrinkage and selection via the LASSO. J R Stat Soc Ser B. 1996, 58: 267-288.Google Scholar
- Yuan M, Lin Y: Model selection and estimation in regression with grouped variables. J R Stat Soc Ser B. 2006, 68: 49-67. 10.1111/j.1467-9868.2005.00532.x.View ArticleGoogle Scholar
- Zhang CH: Nearly unbiased variable selection under minimax concave penalty. Ann Stat. 2010, 38: 894-942. 10.1214/09-AOS729.View ArticleGoogle Scholar
- Efron B, Hastie T, Johnstone I, Tibshirani R: Least angle regression. Ann Stat. 2004, 32: 407-499. 10.1214/009053604000000067.View ArticleGoogle Scholar
- Friedman J, Hastie T, Hoffling H, Tibshirani R: Pathwise coordinate optimization. Ann Appl Stat. 2007, 1: 302-332. 10.1214/07-AOAS131.View ArticleGoogle Scholar
- Fan J, Lv J: Sure independence screening for ultrahigh dimensional feature space. J R Stat Soc Ser B. 2008, 70: 849-911. 10.1111/j.1467-9868.2008.00674.x.View ArticleGoogle Scholar
- Fan J, Samworth R, Wu Y: Ultrahigh dimensional feature selection: beyond the linear model. J Mach Learn Res. 2009, 10: 2013-2038.PubMed CentralPubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.