A gene-based approach for testing association of rare alleles

Rare genetic variants have been shown to be important to the susceptibility of common human diseases. Methods for detecting association of rare genetic variants are drawing much attention. In this report, we applied a gene-based approach to the 200 simulated data sets of unrelated individuals. The test can detect the association of some genes with multiple rare variants.


Background
Genome-wide association studies (GWAS) have been promising for identifying the underlying genetic basis of complex disorders. Indeed, many disease susceptibility regions have been identified using this approach. However, there is still "missing heritability" for most common diseases [1]. Part of the reason is that the statistical tests used in traditional GWAS may not have sufficient power to detect the association of rare genetic variants because of low allele counts in a sample. Rare genetic variants have been shown to contribute to the risk in some common disorders [2,3]. A possible approach is to combine the information at multiple rare genetic variants and test the association collectively in a gene or a pathway [4,5]. Dering et al. [6] provides a review of the association methods that combine information from multiple genetic markers. Here, we apply a gene-based approach for testing association of rare alleles [7] to the 200 simulated data sets of unrelated individuals provided by Genetic Analysis Workshop 17. The empirical type I error rate and power are reported.

Statistical test
We apply an association method to combine the information of single-nucleotide polymorphisms (SNPs) in a particular gene. The method can be applied to any segment of the genome. Here, we use genes as natural segments of the genome. Instead of using the standard chisquare test to compare the allele frequencies in case and control subjects, we propose to compare the mutation rates between the two groups. Specifically, we count the number of minor alleles with a minor allele frequency (MAF) less than 0.01 (which we refer to as mutations) in a specific gene in each individual from the case and control groups. Let P i be the number of mutations in a gene in individual i in the case group and Q j be the number of mutations in the same gene in individual j in the control group. Let P and Q be the average number of mutations in the gene in the case and control groups, respectively, and let M be the average number of mutations in the gene in the total sample. Also, let S PQ 2 be the pooled sample variance of the number of mutations in the total sample, given by: where n P and n Q are the number of individuals in the case and control groups, respectively, and N is the total sample size. Then we define the test statistic: Under the null hypothesis of no association of the set of SNPs in a gene with the disease, the average number of mutations in the case and control groups should be equal and the test statistic T G is asymptotically distributed as a central chi-square distribution with one degree of freedom.

Distribution of SNPs within gene
We identified all the SNPs within a specific gene by comparing the nucleotide position information against the starting and ending positions of the gene. Gene AHNAK, which has 231 SNPs, is the gene with the most SNPs. Nonetheless, 1,191 genes have only one SNP. The mean number of SNPs is 8.33 per gene with a variance of 227.6.

Type I error
The affection status provided in each phenotype replicate file was used to put the samples into case and control groups. We then applied the gene-based test to all the 200 replicates of the simulated data. As an example, Table 1 lists the genes that are significant at the 0.001 level from the tests using the data from the first replicate.
From the simulation model, 36 genes are involved in the simulation of the disease affection status, as explained in the next subsection. Therefore the remaining 3,169 genes are not involved. The significant genes among these 3,169 genes are counted as false positives. For each replicate, we count the number of false positives and calculate the ratio of the number over 3,169. We take the ratio as an estimate of type I error rate. Figure 1 gives the plots of the type I error rate in 200 replicates at the 0.01 and 0.001 levels. The mean and standard deviation of the type I error rate are 0.00105 and 0.00837 at the 0.01 level and 0.00102 and 0.00268 at the 0.001 level.

Power
Using the simulation model [8], we simulate the disease status using a liability threshold model. The liability is a function of Q1, Q2, Q4, and a latent liability, which are influenced by 36 genes. For the test results of each replicate, we count the number of significant genes that are among the 36 genes, that is, real positives, and calculate the ratio of the number over 36 as a rough estimate of power for each replicate. Six of the 36 genes are significant at the 0.001 level from replicate 1. This gives a power estimate of 16.7%. Figure 2 gives the plot of the power estimates in 200 replicates at the 0.01 and 0.001 levels. The power estimates vary quite a bit across replicates. The mean and standard deviation of the power estimates are 0.137 and 0.037 at the 0.01 level and 0.106 and 0.019 at the 0.001 level.
Because we have test results from 200 replicates, we also count the number of times a particular gene is called significant across the 200 replicates. Table 2 gives the list of genes that are significant at least 30 times over the 200 replicates at the 0.001 level. At the top of the list is the FLT1 gene, which is significant 171 times at the 0.01 level and 136 times at the 0.001 level, giving power estimates of 85.5% and 68.0%, respectively. From the simulation model, 11 of the 35 SNPs in this gene are influencing Q1. Similarly, the tests at the PIK3C2B gene are significant 139 times at the 0.01 level and 87 times at the 0.001 level, giving power estimates of 69.5% and 43.5%, respectively. PIK3C2B has 71 SNPs, 24 of which influence the disease liability.

Discussion
Association methods for rare genetic variants are attracting much attention in the genome era, especially with the advance of next-generation sequencing technology. Because rare alleles appear in only a few individuals, the traditional single-marker tests have low power. An alternative method is to group genetic variants by gene or pathway and test the variants in one group collectively. In this report, we applied a gene-based approach to the data on unrelated individuals. The test is based on the Poisson mutation process for rare genetic variants. Our results show that this test has modest power in detecting the association of genes when all the underlying genes are considered. The type I error rate seems to be well controlled on average but is inflated in some replicates, as shown in Figure 2. However, it should be noted that the approach for estimating the type I error rate is not rigorous in that in each replicate the estimate is based on only 3,169 "null genes" assumed to be unrelated to the disease status. There could be considerable Monte Carlo error; the assumption that all of the 3,169 genes are unrelated to the disease status may not be true if there are some unknown interactions between some genes used in simulating the disease phenotype and some null genes. It is encouraging to see that the test can detect the association signal at FLT1 and PIK3C2B with relatively good power. Nonetheless, the validity and power of the test depend on the assumption of the distribution of the susceptibility mutations. Apparently, if a particular gene has many susceptibility mutations, then, because all of them are contributing to the disease risk, we would expect a larger difference between the number of mutations in the case and control groups, which could translate into higher power than genes with fewer mutations. The validity of using the simulated data sets also depends on the simulation model and its compatibility with the test assumptions. Because this is a group test of all the SNPs within one gene, the model might not work well for genes that have only one or two susceptibility mutations, whereas it does work well for genes with more susceptibility mutations, as in the cases of FLT1   and PIK3C2B, which have 11 and 24 susceptibility SNPs, respectively. The simulation model also assumes that all the minor alleles in the model increase disease risk, which may favor some of the collapsing methods.