A novel method to detect rare variants using both family and unrelated case-control data
© Feng et al; licensee BioMed Central Ltd. 2011
Published: 29 November 2011
To detect rare variants associated with a phenotype, we develop a novel statistical method that can use both family and unrelated case-control data. Unlike the currently existing methods, we first use family data to calculate weights to be given to rare variants, differentiating between concordantly affected and discordant sib pairs. These weights are then used in an association test applied to the unrelated case-control data. We applied the proposed method to the simulated sequencing data in Genetic Analysis Workshop 17 and identified two genes associated with the disease.
Genome-wide association studies, which are based on the common disease/common variants assumption, have successfully identified susceptibility loci for complex traits. However, the variants discovered through these studies explain only a modest portion of the trait variability . With the new technological advances, it has been suggested that it is time to shift the search from common variants of modest effect to rarer variants of large effect by effectively searching the full genome . Rare variants may hold promise to predict individual risk for personalized medicine because of their large effect, although it has been argued that common variants illuminate the biologic pathways that underlie diseases .
Bodmer and Tomlinson  suggested that a set of low-frequency variants from different genes can account for a significant proportion of the variability of relatively common diseases. To achieve reasonable statistical power, it is critical to define the rare variants and test them collectively. The existing statistical methods in the literature mainly collapse rare variants . Madsen and Browning  proposed using the inverse of the variance of the minor allele frequency (MAF) in control subjects as a weight and then collapsing the weighted rare variants.
which follows approximately a standard normal distribution under the null hypothesis. Madsen and Browning  demonstrated that this weighted-sum method is more powerful than the collapsing method .
Recently, we have demonstrated that family data are useful for searching for rare variants [7, 8], because the rare variants can be substantially enriched among segregating family members. Here, we present a statistical method to test rare variants by using both family and unrelated case-control sequencing data.
Defining a weight for each SNP using the family data set
We start by assuming that we have L variants belonging to a group (gene, pathway, specific genomic region, etc.). We have shown that family data, such as affected sib pairs, have enriched information for detecting rare disease-associated variants because the same disease variants are more likely to segregate within a family . Thus each family can be considered homogeneous; that is, affected family members share the same or allelic disease variants, the latter being an idea used in traditional linkage analysis. Our method uses either affected sib pairs or discordant sib pairs to determine weights for such rare variants.
Assume that a SNP has two alleles A and a, where A always refers to the minor allele. For the ith sib pair and the jth SNP, we define a genotype score . Let α be a predefined threshold. We always let if a SNP has a MAF greater than α. For those SNPs with MAF ≤ α , we define as follows. For affected sib pairs, if both sibs carry A, if neither sib carries A, and if one sib carries A and the other does not, where K is the number of other SNPs in the region with MAF <α carried by the other sib. For discordant sib pairs, if the affected sib carries allele A and the unaffected sib does not, if the affected sib does not carry A, and if both sibs carry A, where K is the number of alleles with MAF <α present at the other SNPs carried by the affected sib.
where p j is the MAF of the SNP, estimated from the case and control subjects combined. We rank the b j in descending order and give the jth SNP weight w j = b j if b j falls in the top quartile and weight w j = 0 if not. The proposed weight incorporates information about the variants shared (not shared) by affected (discordant) sib pairs.
Performing association tests
Application to the Genetic Analysis Workshop 17 sequence data set
The Genetic Analysis Workshop 17 (GAW17) simulated sequence data set includes both family and unrelated individuals data, which is ideal for the proposed method. Each replicate is composed of 697 individuals in the family data set and 697 individuals in the unrelated individuals data set. There are 200 replicates in total, all with the same genotype data but with phenotype data independently simulated across the replicates.
We first performed an analysis using only one replicate, which was selected to be the 82nd replicate. To calculate the weights, we clustered the sib pairs into affected and discordant sib pairs. Based on the 82nd phenotype replicate, we identified 38 affected and 22 discordant sib pairs. We then set α = 0.01 as the MAF cutoff to define the rare variants and calculated weights based on the identified 38 affected sib pairs using the Eq. (4-5).
The 82nd phenotype replicate includes 209 unrelated case subjects and 488 unrelated control subjects. This data set was used to test for association in each of 3,205 genes.
We compared the power of the proposed method with Madsen and Browning’s method using the same unrelated case-control data. The same MAF threshold of α = 0.01 was used to define rare variants for Madsen and Browning’s method.
We expect there to be little or no power for detecting rare variants using a single replicate because of the small sample size. We therefore included additional replicates until we reached a sample size of 400 affected sib pairs and 2,400 case subjects and 2,400 control subjects. We used 400 affected sib pairs, which was suggested by our previous study . Although the genotypes are the same for the different replicates, the genotype-phenotype association is simulated independently for the different replicates. Thus the way we increased the sample size will have little impact on our power comparison.
We applied the proposed method and Madsen and Browning’s method to 3,205 genes using the 82nd replicate. As expected, we found virtually no power for either method.
Causal gene with tested p-value
p-value (Madsen and Browning method)
p-value (our method)
8.63 × 10−6
1.48 × 10−5
8.67 × 10−10
Discussion and conclusions
In this paper, we propose a novel method to analyze the GAW17 data set. Unlike the existing methods, the proposed method calculates the weights using either affected or discordant sib pairs. The proposed method requires that both the case and control groups and the family members are genotyped for the same set of SNPs or are resequenced in the same region. Compared with Madsen and Browning’s method, the proposed method detects the true causal gene HSP90AA1, which was missed by Madsen and Browning’s method. Our method demonstrates that incorporating family data can potentially improve statistical power to detect rare variants in an association analysis, which is consistent with the result of Zhu et al. .
This work was supported by National Institutes of Health grant HL086718 from the National Heart, Lung, and Blood Institute, National Human Genome Research Institute grant HG003054, and National Center for Research Resources grant RR03655.
This article has been published as part of BMC Proceedings Volume 5 Supplement 9, 2011: Genetic Analysis Workshop 17. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/5?issue=S9.
- Visscher PM: Sizing up human height variation. Nat Genet. 2008, 40: 489-490. 10.1038/ng0508-489.View ArticlePubMedGoogle Scholar
- Goldstein DB: Common genetic variation and human traits. New Engl J Med. 2009, 360: 1696-1698. 10.1056/NEJMp0806284.View ArticlePubMedGoogle Scholar
- Hirschhorn JN: Genomewide association studies: illuminating biologic pathways. New Engl J Med. 2009, 360: 1699-1701. 10.1056/NEJMp0808934.View ArticlePubMedGoogle Scholar
- Bodmer W, Tomlinson I: Rare genetic variants and the risk of cancer. Curr Opin Genet Dev. 2010, 20: 262-267. 10.1016/j.gde.2010.04.016.View ArticlePubMedGoogle Scholar
- Li B, Leal SM: Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet. 2008, 83: 311-321. 10.1016/j.ajhg.2008.06.024.PubMed CentralView ArticlePubMedGoogle Scholar
- Madsen BE, Browning SR: A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genet. 2009, 5: e1000384-10.1371/journal.pgen.1000384.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhu X, Feng T, Li Y, Lu Q, Elston RC: Detecting rare variants for complex traits using family and unrelated data. Genet Epidemiol. 2010, 32: 171-187.View ArticleGoogle Scholar
- Feng T, Zhu X: Genome-wide searching of rare genetic variants in WTCCC data. Hum Genet. 2010, 128: 269-280. 10.1007/s00439-010-0849-9.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.