Improved power by collapsing rare and common variants based on a data-adaptive forward selection strategy

Genome-wide association studies have been used successfully to detect associations between common genetic variants and complex diseases, but common single-nucleotide polymorphisms (SNPs) detected by these studies explain only 5–10% of disease heritability. Alternatively, the common disease/rare variants hypothesis suggests that complex diseases are often caused by multiple rare variants with moderate to high effects. Under this hypothesis, the analysis of the cumulative effect of rare variants may thus help us discover the missing genetic variations. Collapsing all rare variants across a functional region is currently a popular method to find rare variants that may have a causal effect on certain diseases. However, the power of tests based on collapsing methods is often impaired by misclassification of functional variants. We develop a data-adaptive forward selection procedure that selectively chooses only variants that improve the association signal between functional regions and the disease risk. We apply our strategy to the Genetic Analysis Workshop 17 unrelated individuals data with quantitative traits. The type I error rate and the power of different collapsing functions are evaluated. The substantially higher power of the proposed strategy was demonstrated. The new method provides a useful strategy for the association study of sequencing data by taking advantage of the selection of rare variants.


Background
Genome-wide association studies (GWAS) have been used successfully to detect associations between common genetic variants and complex diseases. However, common single-nucleotide polymorphisms (SNPs) detected by current GWAS explain only at most 5-10% of disease heritability [1]. One possible reason could be that another type of variant, rare variants, has not been considered in the current GWAS. Recent studies have shown that common diseases can be caused by functional variants with a wide spectrum of allele frequencies, including rare alleles [2][3][4]. GWAS on common SNPs are based on the currently popular common disease/common variants hypothesis for complex disease etiology; these studies are well suited for detecting genetic variants with high allele frequencies and relatively small to modest genetic effects. There are some difficulties in identifying variants based on the alternative common disease/rare variants hypothesis, which suggests that complex traits are caused collectively by multiple rare variants with moderate to high effects. Under this hypothesis, the analysis of the cumulative effect of rare variants may become crucial for discovering the missing genetic variation from traditional GWAS.
With the development of next-generation sequencing technologies, more rare variants can be genotyped, so the analysis of associations between rare variants and diseases becomes possible. It is well known that traditional GWAS lack power for detecting rare variants. More powerful tests are needed to analyze resequencing genetic data. Recently, Li and Leal [5] proposed a strategy that collapses all the rare variants across a functional region. The idea behind this strategy is to assume that each rare variant in a functional region contributes to a disease, and collapsing genotypes across variants will result in enriched association signals.
Several tests based on different collapsing strategies for case-control studies have been proposed. One is the cohort allelic sum test (CAST) [6], in which the number of individuals with one or more mutations in a group (e. g., a gene) are compared between case subjects and control subjects. Because CAST deals only with rare variants, the combined multivariate collapsing (CMC) method [5] generalized it by performing a multivariate test with common variants and collapsing scores of rare variants. A weighted-sum statistic [7] is another method that collapses both common and rare variants by adding different weights based on allele frequencies, assuming that rare variants have a high effect compared with common variants. With the regression approaches proposed by Morris and Zeggini [8], these methods can also be extended to quantitative phenotypes. In addition, the power of a single-marker test is usually low because of the lack of genetic variant information and the need to adjust for multiple corrections. Multiple-marker tests might also lose power as a result of higher degrees of freedom. The collapsing methods can avoid drawbacks from both single-marker tests and multiple-marker tests by considering all the genetic variant information with only 1 degree of freedom.
However, the collapsing methods may not be robust and could be seriously impaired by misclassification of collapsing regions [5]. Regions can usually be defined by genes, SNP allele frequencies, or variant functionality. If all rare variants within a collapsing region have the same positive or negative effect on a disease, then the association signal could be amplified; however, if collapsing combines functional and nonfunctional variants, this would adversely affect power. To address this problem, we develop a data-driven forward selection strategy in which a common variant is first chosen as a base to collapse a specific region and then rare variants are selected to be collapsed with the base SNP. The proposed method is well suited for detecting regions in which common and rare variants have the same genetic effect on a disease, especially when the association signal of common variants cannot be identified by traditional GWAS. The new method is robust to the size of the region and can efficiently deal with noise caused by misclassification of noncausal rare variants. We apply our method to the Genetic Analysis Workshop 17 (GAW17) unrelated individuals data with quantitative traits Q1 and Q2. The proposed method works for quantitative traits.

Data preprocessing
We analyze the GAW17 unrelated individuals data set, with 200 replicates of simulated phenotypes Q1 and Q2, to compare the power of different tests. There are 697 samples consisting of 209 case subjects and 488 control subjects. The data set contains 24,487 SNPs within 3,205 genes generated using real sequence data from the 1000 Genomes Project. We define rare variants as SNPs with a minor allele frequency less than 0.01 and perform collapsing within each gene. Because collapsing methods do not work well if the region includes too few SNPs, we first filter the genes according to a criterion of having at least 10 variants with one or more common SNPs. After filtering, we have 553 genes for analysis. The analysis is performed with the knowledge of the underlying simulating model.
To control the false-positive rate, we adjust phenotypes for the effects of confounding variables and population stratification. We perform a linear regression of the phenotypes from the first replicate on the variables Sex, Age, and Smoking status to select confounding variables. The results are shown in Table 1. Variables with a p-value greater than 0.05 are selected as covariates for the adjustment. The top five eigenvectors from Eigenstrat [9,10] are also considered covariates. For each replicate, residuals of the multivariate linear regression of phenotypes on all the selected covariates are regarded as the adjusted phenotypes.

Data-adaptive forward selection
In brief, our strategy is as follows. We start with the most significant common SNP within one region. Rare variants are then added to the collapsed set one at a time until there are no variants remaining or until there is no visible improvement in the goodness-of-fit of the fitted model. More specifically, assume that there are m common variants and n rare variants within a certain predefined genomic region. Let x denote the genotype for all samples of a certain common variant, and let g denote the rare variant. Our strategy consists of the following steps.
Step 1. Build a linear model on each common SNP. The SNP with the largest genetic effect as measured by the F statistic of the linear regression is selected as the base of the collapsing function for this region: F is calculated based on: Step 2. Collapse each rare variant with the base SNP in this region according to a specific collapsing function. Perform a linear regression on each collapsed score Collapse(S, g i ). Based on the F statistics, variants with the most significant values are then selected as the base for the next procedure: F new is calculated based on: Step 3. If F new >F, then update S, F, and n with S new , F new , and n − 1. Repeat step 2 until either F no longer increases or n = 0.
When the selection procedure is finished, the test statistic FS is defined as: which is the absolute value of the t statistics in the linear regression model: where S final is the final collapsed score after the selection. Under the null hypothesis, the selection procedure drives the statistics in two different directions; taking the absolute value allows us to proceed with the normalization step, described in the next subsection.

Genome-wide permutation test
To correct the bias resulting from selection and obtain the global empirical p-value, we perform a genome-wide permutation test. Assume that M permutations are performed for k candidate genes. Let FS i be the observed t statistics for the ith gene, and let FS null i j be the observed t statistics for the ith gene on the jth permutation. To compare the genetic effect across genes, we normalize the statistics using the estimated mean and variance from the permutation and obtain the adjusted statistic FS adji : and:

Comparison of tests
We investigate the performance of our forward selection strategy with three collapsing methods: (1) the indicator method, (2) the sum method, and (3) the weighted-sum method. Let r ij represent the genotype for the ith individual at the jth locus. The collapsing functions are then defined as follows. The indicator function: is a function of the presence or absence of any minor allele in any region within an individual, which was first used in CAST [6]. The sum function: is a function that describes the overall effect at any region within an individual. It has the same effect as proportion coding. Both the indicator and sum functions have been demonstrated to be powerful in the detection of associated rare variants [8]. The weightedsum function is: where the weight is calculated by: (there is one unreadable character of the right side of (16) in my computer. Please make sure it is same as before. It is q hat) and: and n is the total number of individuals. We estimate allele frequencies by jointly considering all subjects, because we could not follow the suggestion to estimate allele frequencies from unaffected individuals [7] when dealing with quantitative traits.
The test statistics of the indicator, sum, and weightedsum functions without a forward selection strategy are denoted T ind , T sum , and T ws , respectively, and their pvalues can be calculated from the t distribution and adjusted by the Bonferroni correction. Statistics T ind and T sum deal only with the rare variants, whereas T ws considers both common and rare variants.
The corresponding statistics obtained with our forward selection strategy are denoted FS ind , FS sum , and FS ws . Another test statistic, denoted T com , is also obtained by using linear regression on the most significant common variant in a region. A comparison of the proposed strategy with T com allows us to access the effect of adding rare variant information to common variant information in a specific genetic region.

Type I error and power
We analyzed 553 filtered genes on the 200 replicates for phenotypes Q1 and Q2. The type I error rate of the test and the power of the test are defined as follows. Take Q1 as an example; there are 4 functional and 549 nonfunctional genes. At a given significance level a, if the adjusted or global p-value for a gene is greater than a, we would reject the null hypothesis. Next, we consider (number of tests that rejected the null hypothesis)/[200(549)] for the 549 nonfunctional genes as the type I error rate of the test. Because different tests should adapt to different types of genes, power is calculated by (number of tests rejected at the null hypothesis)/200 for each functional gene.
We found that the type I error of tests using the weighted-sum function was inflated for Q1. To have a fair comparison, we selected a different significance level for each test to have the type I error rates of all tests at the same level. Some permutation tests cannot have exact 5% type I error because there were only 1,000 permutations. Therefore we chose a significance level so that all tests had a well-controlled false-positive rate of about 6% for Q1 and 5% for Q2 ( Table 2). The power of the tests was calculated based on the same significance levels as the type I error. Table 3 lists all the genes with power greater than 5% according to at least one test.
For Q1, FTL1 was detected by T com with 100% power, which indicated that the common SNP had a strong effect on the disease. However, T com is not optimal for evaluating the proposed strategy. Gene KDR is the case we want to consider, because T com became underpowered, which indicated that the effects of common SNPs were not significant. For KDR, FS ws achieved the highest power, followed by T ws , FS sum , and FS ind . We also found that tests considering both common and rare variants achieved higher power than those that considered only rare variants (T ind and T sum ) or common variants (T com ). All forward-selection-based tests were ranked at the top for KDR, which demonstrates the potential power of the forward selection strategy. For Q2, all tests became underpowered. Three genes (SREBF1, SIRT1, and VNN3) were detected by at least one test with power greater than 5%. FS ws achieved the highest power (8%) in detecting SIRT1.

Discussion
We have proposed a data-adaptive forward selection strategy for genetic association studies with multiple common and rare variants. The proposed test is aimed at selecting rare variants for collapsing that best amplify the association signal between functional regions and phenotypes. Traditional collapsing methods do not have the option of selectively collapsing only functional rare variants with the same effects on the risk of disease; thus they may be underpowered by misclassification in collapsing regions. The major advantage of our method is that it can selectively collapse rare variants with the same genetic effect as the common variant in the functional region, thus leading to a more significant association signal. To correct the bias resulting from selection and to alleviate the computational burden, we performed a genome-wide permutation test. We evaluated the power of our method using different collapsing functions, based on the same type I error. The results show that the proposed method has substantially higher power across different collapsing strategies. The way to select rare variants for each functional region also suggests that our method may have higher power to detect functional regions with an either positive (damage) or negative (protective) effect on the disease traits, even if no common variant is associated with the disease, so long as enough rare variants collectively affect the disease. The forward selection strategy could also be a powerful tool by adding different weights based on allele frequencies in order to lower the effect of common variants.

Conclusions
We developed a data-adaptive forward selection procedure by collapsing a common variant with selected rare variants. The validity and substantially higher power of the proposed strategy were demonstrated using the GAW17 data. The method provides a useful strategy for association studies of sequencing data by taking advantage of the selection of rare variants.