Comparative study of statistical methods for detecting association with rare variants in exome-resequencing data

Genome-wide association studies for complex traits are based on the common disease/common variant (CDCV) and common disease/rare variant (CDRV) assumptions. Under the CDCV hypothesis, classical genome-wide association studies using single-marker tests are powerful in detecting common susceptibility variants, but under the CDRV hypothesis they are not as powerful. Several methods have been recently proposed to detect association with multiple rare variants collectively in a functional unit such as a gene. In this paper, we compare the relative performance of several of these methods on the Genetic Analysis Workshop 17 data. We evaluate these methods using the unrelated individual and family data sets. Association was tested using 200 replicates for the quantitative trait Q1. Although in these data the power to detect association is often low, our results show that collapsing methods are promising tools. However, we faced the challenge of assessing the proper type I error to validate our power comparisons. We observed that the type I error rate was not well controlled; however, we did not find a general trend specific to each method. Each method can be conservative or nonconservative depending on the studied gene. Our results also suggest that collapsing and the single-locus association approaches may not be affected to the same extent by population stratification. This deserves further investigation.


Background
Classical genome-wide association studies have successfully detected many common genetic variants that are associated with complex traits. It is likely that low-frequency or rare variants are also contributing to genetic risk [1]. The statistical power to detect phenotypic association with such variants is limited because of the small number of observations for any given variant and a more stringent multiple test correction compared to common variants [2]. The simultaneous analysis of rare variants aims to identify accumulations of minor alleles within the same functional unit (e.g., gene).
Several new methods have been recently proposed to tackle the rare variant problem [2][3][4][5][6]. The principal difference between them lies in the way the information on the multiple rare variants is used. Some methods use a subset of variants that satisfy predefined selection criteria, whereas other methods use all variants. The methods also differ in the way in which the cumulative information on minor alleles within a functional unit is coded. Finally, multivariate collapsing approaches have also been proposed. Most of these recent developments have been applied to association analyses in data from unrelated individuals. A new method has been recently developed [4,6] that can be applied to both unrelated individual and family data.
In this paper, we evaluate and compare the power of different collapsing methods for detecting association of multiple rare variants with a quantitative trait. We first focus on the unrelated individuals data and then incorporate some of these approaches within the general framework of the mixed model for association analysis in the family data set of Genetic Analysis Workshop 17 (GAW17). We tried to answer the following questions: Does the use of a subset of rare variants perform better than using all variants? Do the collapsing approaches perform similarly with unrelated individual and family data sets? The analyses were performed using the GAW17 data with knowledge of the answers [7].

Methods
We studied the quantitative trait Q1 influenced by 39 variants in nine independent genes.

Statistical association analysis of rare variants
We carry out the association test at the gene level. Assume that a gene G contains J G variants denoted SNP j , j = 1, …, J G , and that MAF j is the minor allele frequency of SNP j . Let Y = (y 1 , …, y N ) be the observations of the phenotype Q1 in N unrelated individuals, and let X iG be the vector of genotypes of the SNPs in gene G for individual i. The genotypes are coded 0, 1, or 2, depending on the number of minor alleles.
Let T maf be a selection criterion on minor allele frequency (MAF) values. The association methods we have investigated vary according to a predefined T maf value (i.e., less than 1%, less than 5%, or less than 50%) and on the number of collapsing groups. They are all based on a linear regression modeling the relationship between the trait Y and the SNP data within a gene. We briefly review these methods in this Methods section. More details are given by Dering et al. [8].
Association testing in the unrelated individuals data set: univariate collapsing approaches The univariate collapsing approaches use only a subset of variants that satisfy the constraint MAF ≤ T maf , where T maf is a predefined selection value.
The first univariate collapsing approach is the collapsing and summation test (CAST). Let X iG (maf) be the vector of genotype scores of the SNPs with MAF <T maf , and let J G (maf) be the length of the vector X iG (maf). The variable C = C iG (maf) (i = 1, …, N) denotes the two collapsing strategies that we used: collapsing absence/presence (CA) and collapsing proportion (CP). For the CA strategy: and for the CP strategy: Equation (1) is based on the presence or absence of the minor allele at any rare variant in gene G within an individual [3]. Equation (2) is based on the proportion of rare variants with MAF ≤ T maf at which an individual i carries at least one copy of the minor allele [5]. The model is Y = Cb + ε, where e s ẽ , ) and σ 2 is the residual variance. The effect of b can be tested with a likelihood ratio test that follows a chi-square distribution with 1 degree of freedom (df).
The second univariate collapsing method is the variable-threshold (VT) approach [2], which uses the CP approach to collapse rare SNPs with MAF <T maf but maximizes the statistic according to T maf . All T maf values observed in the gene G are considered. For each T maf , a regression z-score is computed. Let z max be the maximum z-score across all T maf values. The test of association is based on z max , and its statistical significance is evaluated empirically by permutation.
The last univariate collapsing method is the weightedsum (WS) approach [2], which is a generalization of the binary trait weighted-sum approach proposed by Madsen and Browning [4] for quantitative traits. Under this approach, T maf = 0.5 (i.e., all variants in a gene G are used). The collapsing variable C for subject i in the WS approach is given by: where: For each gene G, a genetic score is calculated as: The significance of Z G is assessed empirically by permutation.
Association testing in the unrelated individuals data set: combined multivariate and collapsing approach The combined multivariate and collapsing (CMC) method originally proposed by Li and Leal [3] uses a multiple regression model that contains the CA method's collapsing variable of SNPs with MAF <T maf = 1% and includes all k remaining SNPs, X j1,…,k , individually.
The multivariate model (denoted here as CMC3) is: Evidence of association (∃j, b j ≠ 0, j = 0, …, k) is assessed with the likelihood ratio test, which follows a chi-square distribution with (k + 1) df.
Using only the SNPs with MAF ≤ 5%, we extended this model in two ways. In both extensions the multivariate model contains the CA collapsing variable of SNPs with MAF < 1%. In the first variation of this model (denoted CMC1), the multivariate model also contains the CA collapsing variable of the other SNPs (i.e., 1% ≤ MAF ≤ 5%). In contrast, in the second extension (denoted CMC2), the other SNPs are included individually in the multivariate model.
The CMC1 model is then written as: and the test of association is a likelihood ratio test with 2 df.
The CMC2 model is the same as Eq. (6), where k is the number of SNPs and 1% ≤ MAF ≤ 5%. Evidence of association is assessed with the likelihood ratio test with (k + 1) df.

Association testing in the unrelated individuals data set: single-marker test
For comparison purposes, we also carried out a single-locus association test. For a gene G, association with each SNP was tested using the likelihood ratio test. For each gene G, we obtained J G likelihood ratio test statistics, each with 1 df. The evidence of association at the gene level was based on the maximum of the J G likelihood ratio test statistics.
Single-marker (SM) tests were conducted with PLINK, version 1.07 [9]. The R.2.10.1 software was used for all collapsing approaches except the VT and WS approaches. For these two approaches we used the R script (http:// genetics.bwh.harvard.edu/vt/dokuwiki/) [2], and we set the number of permutations to 1,000.

Association testing in the family data set
We used the measured genotype (MG) test [10], which is a linear mixed model: where: s c 2 and s e 2 are the polygenic and the residual variances, respectively, and Φ i is the kinship matrix in family i. The SNP data in relatives were collapsed as described under the CA, CP, and WS collapsing approaches. In these three approaches, the test of association is a likelihood ratio test with 1 df. In addition, we also carried out the bivariate CMC1 approach using the likelihood ratio test with 2 df. We could not evaluate the VT approach because it maximizes T maf . We carried out the MG test using the QTDT software (http://www. sph.umich.edu/csg/abecasis/QTDT/).

Type I error rate and power estimates
The empirical distribution of each association approach was evaluated in unrelated individuals and in family data. Type I error and power rates were estimated by testing association of Q1 to each of the seven false causal genes and each of the nine true causal genes, respectively, using the 200 replicates. Type I error and power rates were derived at a nominal level of a = 5%.
In the unrelated individuals data set, we evaluated association with Q1 using 10 approaches: CA1 and CA5 with T maf = 1% and 5%, respectively; CP1 and CP5 with T maf = 1% and 5%, respectively; and VT, WS, CMC1, CMC2, CMC3, and SM. For the WS and VT tests, we used empirical P-values. For all remaining association tests we used tabulated nominal P-values. In each replicate, we tested for association of Q1 with each of the 16 genes using each of the 10 approaches. For each gene and for each association procedure we computed the proportion of replicates having a P-value ≤ a. For the SM approach, we applied a Bonferroni correction to account for the multiple tests; we computed the proportion of replicates such that the lowest P-value out of the J G SNPs was less than or equal to a/J G .
In the family data set, we evaluated similarly the following five approaches: CA1 and CA5 with T maf = 1% and 5%, respectively; CP1 and CP5 with T maf = 1% and 5%, respectively; and SM. We also evaluated the WS approach but used the tabulated P-value derived from a chi-square distribution with 1 df.

Results and discussion
The characteristics of the nine causal and seven noncausal genes are shown in Table 1. The total number of SNPs (causal and noncausal) per gene is given along with their distributions by MAF category. The MAF for the causal variants ranges from 0.07% to 16.5% in the 1000 Genomes Project data (for unrelated individuals), and the number of causal variants per gene varies from 1 (VEGFC, VEGFA) to 11 (FLT1). One causal gene (VEGFC) has one single SNP, and thus only one association approach (SM) can be applied. For the noncausal genes, the number of SNPs per gene ranges from 6 (CTSS) to 83 (LY75), and, as for the causal genes, most (>70%) of the SNPs are uncommon (MAF < 5%).
Estimates of Type I error and power rates in the unrelated individuals data set Table 2 shows the type I error rates estimated at the gene level of each association approach for the unrelated individuals data set. As can be seen, the type I error rate is not well controlled no matter which association approach is used: The rates can be higher or lower than expected. For some genes, almost all association approaches show inflated type I error rates (e.g., MAPT, IFI44). Conversely, for some other genes (FOXI1, LY75), the type I error rates of some approaches are inflated, whereas the other approaches tend to be conservative.
Overall, the SM and CMC3 approaches appear to have inflated type I errors more frequently. Interestingly, these two approaches are the only ones that used the common SNPs individually. Clearly, several SNPs in these sequence data, including those in our noncausal genes, have population-specific allele frequencies. Given that the genotype data were not simulated, we hypothesize that the inflated rates could be explained by the observed  differences in the mean of Q1 between the four populations (−0.059, −0.002, 0.021, and 0.072 in Africans, Chinese, Japanese, and Europeans, respectively. We recomputed the type I error accounting for possible clusters. First, we ran a principal components (PC) analysis with Eigenstrat [11] using the full mini-exome SNP data excluding SNPs with MAF < 5%. In each replicate, we computed the residual of Q1 obtained by regression of Q1 on the first five PCs. We reestimated the type I error levels using the residual of Q1 as the phenotype. The last 10 columns of Table 2 show the results. As can be seen, after adjusting for the five PCs, only a few of the type I error estimates remained higher than expected. In fact, most of the estimates were lower than expected.
In conclusion, to estimate the power of these approaches in the data sets, we used two strategies (Table 3): Power was first computed at a theoretical level of 5%, although the different approaches may not have comparable true false-positive rates; second, power was computed accounting for the five PCs, that is, using the residuals of Q1. All methods performed well for the KDR and FLT1 genes. Conversely, all but two methods performed poorly (power < 10%) for two genes: For ELAVL4 the power was greater than 30% using the SM and CMC3 approaches, and for HIF3A the power was greater than 17% for the CMC2 and CMC3 approaches. For the remaining four genes, one of the pooling methods outperformed the SM method after a Bonferroni correction. In these data, the CA and CP approaches had roughly similar power, and so, in what follows, the CP method will serve as a reference.
The choice of the threshold T maf seems to have a large effect on power, and, in general, the power is higher when the criteria are less stringent (T maf = 5% vs. 1%). Although this is not surprising for genes with causal SNPs having 1% < MAF < 5% (ARNT, HIF1A), we made the same observation for genes with all causal SNPs having a MAF < 1% (FLT4 and VEGFA; see Table 1). This may suggest that allele correlation within these genes exists among causal and noncausal rare variants. The VT approach, which does not require a predefined choice on T maf , did not appear to outperform the CP approach. On the other hand, one of the univariate (WS) or multivariate (CMC3) collapsing methods that uses all SNPs showed better power than the CP method. This again may be explained by allele correlation among SNPs. When adjusting for population stratification, again, all approaches had the greatest power for the FLT1 and KDR genes and the lowest power for the ELAVL4 and HIF3A genes. Nonetheless, most power estimates were lower, and the power drop was noticeable, especially for the FLT4 and HIF1A genes. However, it is unclear whether this drop is fully explained by the lower values of the adjusted false-positive rates. VEGFC 0.58 n/a n/a n/a n/a n/a n/a n/a n/a n/a VEGFC 0.56 n/a n/a n/a n/a n/a n/a n/a n/a n/a Estimates outside the 95% confidence interval are underlined. n/a, not applicable. a Bonferroni-corrected P-value.
Estimates of type I error and power rates in the family data set Table 4 shows the type I error and power rates estimated at the gene level of each association approach for the family data set. It also shows the number of SNPs, causal and noncausal, that are polymorphic in the family samples. Type I error rates appeared to be better controlled in the family data than in the unrelated individuals data set with a few exceptions, especially the MAPT gene, for which most type I errors were biased upward. This gene is located in a genomic region with a low recombination rate and a long range of linkage disequilibrium. All association approaches show high and similar power rates for VEGFA. High power (>80%) was observed for FLT1 using the SM and CP approaches and for KDR using the CA(0-5%), CP (0-5%), VT, and CMC1 approaches. In general, as observed in the unrelated individuals data set, the CA and CP approaches showed greater power under the less stringent T maf criterion of 5% versus when T maf = 1%.

Power of collapsing approaches in unrelated individuals versus family data set
Two causal genes (FLT1, KDR) were consistently detected with good power (>80%) in the unrelated individual and family data sets, irrespective of the association approach. One gene (VEGFA) was detected in the family sample but not in the unrelated individuals data set (power < 54%, or power < 32% after adjusting for population stratification). Conversely, ARNT was detected in the unrelated individuals data set (power = 96%, or power = 77% after adjusting for population stratification) but not in the family data (power = 12%).

Conclusions
We found that for some genes collapsing approaches may be powerful tools to detect multiple rare variants for complex traits. In particular, the choice of the threshold T maf seems to have a large effect on power, and, in general, we found a higher power when the criterion was less stringent (T maf = 5% vs. 1%). In the same vein, including all SNPs, whether by means of a univariate or a multivariate collapsing approach, can improve the power. In addition, a few of the causal genes were detected in both the related and the unrelated individuals data, whereas other causal genes were detected only in either the unrelated individuals or the family data. However, in these data the power of association was often limited. More important, we found that type I error rates may be highly variable between genes and between approaches. We faced the challenge of assessing the proper type I error to validate our power comparisons. We acknowledge that our type I and type II error rates may not be generalized because of the way the GAW17 data were simulated: Phenotype but not genotype data were generated. Further, because the genotypes of founders did not vary between replicates, each family was either always n/a n/a n/a n/a n/a n/a N, number of polymorphic SNPs. V, number of polymorphic causal variants. a Bonferroni-corrected P-value.
informative (at least one founder carries a causal variant) or never informative (no founder carries a causal variant) for testing association to a given causal variant. Finally, our results also raise an interesting point that might deserve future investigation, namely, that the collapsing and the single-locus association approaches may not be affected to the same extent by population stratification. Our results suggest that collapsing approaches may be more robust, especially in the presence of multiple variants.