Association screening for genes with multiple potentially rare variants: an inverse-probability weighted clustering approach
© Liu et al; licensee BioMed Central Ltd. 2011
Published: 29 November 2011
Skip to main content
Volume 5 Supplement 9
© Liu et al; licensee BioMed Central Ltd. 2011
Published: 29 November 2011
Both common variants and rare variants are involved in the etiology of most complex diseases in humans. Developments in sequencing technology have led to the identification of a high density of rare variant single-nucleotide polymorphisms (SNPs) on the genome, each of which affects only at most 1% of the population. Genotypes derived from these SNPs allow one to study the involvement of rare variants in common human disorders. Here, we propose an association screening approach that treats genes as units of analysis. SNPs within a gene are used to create partitions of individuals, and inverse-probability weighting is used to overweight genotypic differences observed on rare variants. Association between a phenotype trait and the constructed partition is then evaluated. We consider three association tests (one-way ANOVA, chi-square test, and the partition retention method) and compare these strategies using the simulated data from the Genetic Analysis Workshop 17. Several genes that contain causal SNPs were identified by the proposed method as top genes.
Rare variants are common on the genome and have long been speculated to be involved in the etiology of most human disorders . In the 2000s, a large number of genome-wide association studies (GWAS) were conducted using relatively more common single-nucleotide polymorphisms (SNPs) (with minor allele frequency [MAF] > 5%). Most of the common variants identified in these studies have borderline odds ratios and can explain only a small fraction of susceptibility to a disease . As a result, there has been increasing interest in the study of rare variants for complex diseases. This concern has also been fueled by advancements in sequencing technology. In particular, the availability of such technology has directly led to the implementation of the 1000 Genomes Project (http://www.1000genomes.org/), in which 1,000 genomes from individuals of different ethnic backgrounds were sequenced, consequently leading to the identification of a large number of rare variants (SNPs) with MAF < 1% and some very rare variants with MAF < 0.5%. Because of these low MAFs, association methods developed for common variants have limited efficiency for mapping rare variants in population studies. For these methods to have adequate power to detect individual rare variants, the sample size needs to increase substantially as the MAF decreases.
It is also more likely for a rare variant to contribute to the susceptibility of a disease as part of a group of rare variants in the same gene or pathway. Therefore grouping or collapsing rare variants is the most feasible option to improve efficiency in studying rare variants. Usually, the grouping is constructed on the basis of functional relevancy, physical proximity, or both. Once rare variants have been grouped, their genotypic information is combined, or collapsed, into a usually univariate score, and the association between the group of rare variants and the disease is then studied using the association between the univariate score and the disease traits. See Asimit and Zeggini  and Dering et al.  for excellent reviews of different methods for rare variant association analysis, including single-marker, multimarker, and various collapsing strategies. A popular alternative to collapsing genotypic information is to combine single-SNP statistics.
In this paper, we consider a gene-based association analysis for rare variants. This is equivalent to grouping based on the gene affiliation of SNPs. We propose using a clustering-based method for collapsing genotypic information of multiple SNPs within each gene. The clustering is based on an inverse-probability weighted sum of genotypic differences that highlights the variation at rare variant loci. Association between the collapsed partition label and the disease traits can then be readily evaluated using single-marker association methods, such as one-way analysis of variance (ANOVA), a chi-square test, and the partition retention method [4, 5]. We apply our approach to the simulated data of the Genetic Analysis Workshop 17 (GAW17) without knowledge of the simulation models. After the workshop, a comparison of our results with the simulation answers led to interesting observations regarding both the method and the simulated data. We discuss these observations in the Results section.
The simulated data set of GAW17 is a combination of real sequence data and simulated phenotypes. An exome of 3,205 autosomal genes, corresponding to 24,487 SNPs, was selected. Sequences of these SNPs were obtained from the 1000 Genomes Project on 697 unrelated subjects. SNPs with missing values were imputed using fastPhase. A majority of the SNPs (74%) were rare variants (MAF < 1%). Two hundred phenotype sets were simulated based on these common genotype data. Each simulated unrelated-individual data set has three quantitative trait values (Q1, Q2, Q4) and the Affected status Y, with 209 case subjects and 488 control subjects. Gene information and SNP information were provided. Especially, whenever available, SNPs were labeled as synonymous or nonsynonymous .
We propose to evaluate an individual gene’s association with disease traits. SNPs within a gene are grouped for the association analysis. Our main focus is a collapsing strategy for multiple-SNP genotypes within a gene. We propose to create partitions of individuals (or observed genotypes) based on their genotypic differences evaluated by inverse-probability weighted similarity scores. It is easier to start with considering alleles at a single SNP locus first. For two individuals, we can count when they have the same alleles or different alleles. When the MAF is small, the chance of having a random match for the major allele is high. On the other hand, if a rare variant is involved in the etiology of a disease, then the case subjects are more likely to have the same rare variants than the control subjects are. Therefore for rare variant association analysis we want to overweight the allelic or genotypic similarity for the minor alleles but not that for the major alleles.
Inverse probability similarity measure: allelic similarity scores
Inverse probability similarity measure: genotypic similarity scores
After obtaining the partition of individuals, for each gene we tested the association between the partition indexes obtained from the SNPs in that particular gene and the disease phenotypes. For the disease status Y, we considered one-way ANOVA, the chi-square test of independence, and the partition retention method . For continuous-valued disease outcomes Q1, Q2, and Q4, we considered one-way ANOVA and the partition retention method.
where n i is the number of individuals in partition element i and is the sample mean of element i. and s are the sample mean and the standard deviation of all n individuals, respectively. Under the null hypothesis, I asymptotically converges to a weighted sum of chi-square distributions with 1 degree of freedom and therefore has mean 1. The partition retention method is more robust to sparse partition than the chi-square test and can be applied to both dichotomous disease status and continuous-valued traits . Intuitively, the I in the partition retention method evaluates the amount of influence a particular gene has on the disease phenotypes.
p-values for the ANOVA test and the chi-square test are derived from corresponding asymptotic distributions. To address the multiple testing issue, we control the family-wise error rate using the conservative Bonferroni correction. For the evaluation using the partition retention I, we simply chose the top 0.1% of genes for each trait. A further examination of results from chromosome 4 revealed that, by using a cutoff of the top 0.1%, only 15 of the 200 replicates returned any null gene (a family-wise type I error rate), which suggests that the top 0.1% is a reasonable threshold. In practice, we suggest evaluating p-values using permutations and controlling the false discovery rate in order to have better sensitivity to real genetic signals.
Because we have 200 simulation sets, for each gene we counted the number of times it was selected (either in the top 0.1% for I using the partition retention method or significant by Bonferroni correction for ANOVA and the chi-square test) for each trait for each method. We also compared the effects of partition sizes (results not shown). The significance varied between different partition sizes, and the partition size that corresponded to the most significant results also changed from simulation to simulation. Therefore we used the average count across six partition sizes (from 5 to 10) to rank genes. By visually examining the average counts (not shown), we observed that Q1 had strong genetic signals and that Q2 and Affected status were harder to map. For Q4, the one-way ANOVA identified many noncausal genes, or false positives, to which the partition retention method was relatively more immune.
Association between a consistent false-positive gene (OR2T3) and a causal SNP at C13S523
C13S523 genotype (p = 1.8 × 10−18 by Fisher’s exact test)
Partition based on SNPs of OR2T3
where β i is the effect size β used in the simulation model for SNP i, which is 0 for noncausal SNPs.
In this paper, we propose a novel strategy for gene-based association analysis for genes with multiple potentially rare variants. The inverse-probability weighted clustering approach automatically adjusts weights for rare variants and overweights their genotypic variation when comparing individuals for an association study. Individuals are first partitioned on the basis of their genetic similarity on multiple SNPs in a gene, and this partition is then used to calculate association between a gene and a disease trait.
We also considered several association scores and the effect of including synonymous variants. Different methods seem to focus on nonoverlapping signals, which suggests a multimethod approach for future association studies. From our results, we can conclude that our method gains power by considering multiple rare variants in a gene, as illustrated in Figure 1 for one of our identified causal genes. It is probably beneficial to consider synonymous and nonsynonymous SNPs in future practice. Filtering out synonymous SNPs corresponds to a weight of 0 being assigned to synonymous SNPs and a weight of 1 being assigned to nonsynonymous SNPs, which can be extended to a smoother weighting scheme as a possible future direction.
For this simulation study, we used asymptotic p-values and the conservative Bonferonni correction because we needed to analyze 200 sets of data. In practice, we suggest evaluating p-values using permutations and controlling the false discovery rate in order to have better sensitivity to real genetic signals. Population information is provided with the simulated data. Some consistent false positives may have resulted from confounding due to population admixture. We recommend using existing methods, such as Eigensoft , to adjust for population stratification in real applications when applying our method. It should be pointed out that algorithms such as Eigensoft  may convert the original discrete genotype data to continuous values, which requires modification to the similarity measure defined in Table 1.
This research was supported by National Institutes of Health (NIH) grants R01 GM070789 and 3R01 GM070789-05S1 and National Science Foundation grant DMS 0714669. The Genetic Analysis Workshop is supported by NIH grant R01 GM031575.
This article has been published as part of BMC Proceedings Volume 5 Supplement 9, 2011: Genetic Analysis Workshop 17. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/5?issue=S9.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.