Identity-by-descent filtering as a tool for the identification of disease alleles in exome sequence data from distant relatives
© Akula et al; licensee BioMed Central Ltd. 2011
Published: 29 November 2011
Large-scale, deep resequencing may be the next logical step in the genetic investigation of common complex diseases. Because each individual is likely to carry many thousands of variants, the identification of causal alleles requires an efficient strategy to reduce the number of candidate variants. Under many genetic models, causal alleles can be expected to reside within identity-by-descent (IBD) regions shared by affected relatives. In distant relatives, IBD regions constitute a small portion of the genome and can thus greatly reduce the search space for causal alleles. However, the effectiveness of this strategy is unknown. We test the simulated mini-exome data set in extended pedigrees provided by Genetic Analysis Workshop 17. At the fourth- and fifth-degree level of relatedness, case-case pairs shared between 1% and 9% of the genome identical by descent. As expected, no genes were shared identical by descent by all case subjects, but 43 genes were shared by many case subjects across at least 50 replicates. We filtered variants in these genes based on population frequency, function, informativeness, and evidence of association using the family-based association test. This analysis highlighted five genes previously implicated in triglyceride, lipid, and cholesterol metabolism. Comparison with the list of true risk alleles revealed that strict IBD filtering followed by association testing of the rarest alleles was the most sensitive strategy. IBD filtering may be a useful strategy for narrowing down the list of candidate variants in exome data, but the optimal degree of relatedness of affected pairs will depend on the genetic architecture of the disease under study.
Single-nucleotide polymorphism (SNP) microarrays used in genome-wide association studies have been designed to interrogate SNPs with minor allele frequencies (MAFs) greater than or equal to 5%. Genome-wide association studies for a wide variety of complex diseases explain only a small proportion of disease heritability. The so-called missing heritability can be attributed to uncommon and rare variants that are not well interrogated by SNP arrays [1, 2]. This observation, combined with major advances in large-scale sequencing methods, has fueled the use of whole-exome and whole-genome sequencing to identify risk variants in common diseases. Using this approach, researchers have successfully identified rare variants involved in Mendelian disorders [3–5], but the number of candidate variants uncovered in these studies has been unexpectedly large, and close to 10,000 variants per individual may be functional. Because common diseases are thought to be genetically heterogeneous [2, 6], narrowing down the list of candidate variants to a few causal variants is a challenging process, and the best strategy remains unclear.
To identify loci that encode potential causative alleles, we test the strategy of identity-by-descent (IBD) filtering, that is, isolating IBD regions shared by affected individuals. In distant relatives, IBD regions constitute a small portion of the genome, effectively narrowing the search space for disease alleles under a variety of genetic models [3, 6]. IBD analysis may be sufficiently robust to detect loci involved in genetically heterogeneous traits where traditional genetic linkage analysis has failed [3–5, 7]. However, the effectiveness of this strategy in the face of high genetic heterogeneity is largely unknown. We apply this strategy to the mini-exome data set of eight large pedigrees in 200 simulated phenotype files provided by Genetic Analysis Workshop 17 (GAW17) (http://www.gaworkshop.org/gaw17/) . When combined with typical filtering and family-based association testing (FBAT), IBD filtering analysis identified five candidate genes that were previously shown to be involved in triglyceride, lipid, and cholesterol metabolism.
We analyzed the mini-exome data in the GAW17 family data set, which consists of 697 individuals in eight extended pedigrees. We did not have any knowledge of the actual risk alleles or phenotypes; that is, we did not request the causal genes and markers (answers) from GAW17 until we had completed our analysis.
Identity by descent
Two or more alleles are identical by descent if they are inherited from the same ancestor. BEAGLE, GERMLINE, and PLINK are some statistical tools that are commonly used to calculate IBD between individuals [9–11], but in the current analysis we use IBD regions provided in the GAW17 simulated data. According to the GAW17 instructions, an IBD score of 0 indicates no sharing, an IBD score of 0.5 indicates sharing of one allele, and an IBD score of 1 indicates sharing of two alleles. However, because without inbreeding only full siblings can share two alleles identical by descent at a locus, an IBD score of 1 does not occur in the GAW17 pedigrees; hence we consider only IBD scores of 0.5 in our analysis.
First-degree relatives (parent-offspring) share 50% of their genomes, second-degree relatives (grandparents-grandchildren, avuncular pairs) about 25%, third-degree relatives, such as first cousins, about 12.5%, fourth-degree relatives about 6.25%, and fifth-degree relatives about 3.13%. Although these percentages are relatively stable for first-degree relatives, they tend to vary for more distant relatives because of the stochastic nature of recombination events .
The first unknown factor involves the optimal degree of relatedness. More closely related cases will likely share more of the same risk alleles but will also share a larger portion of the genome, with many potential variants. More distantly related individuals will share less of the genome but may also carry distinct sets of risk alleles as a result of segregation, the introduction of risk alleles by married-in relatives, and new variants. Because these parameters are generally unknown and because the number of candidate functional variants carried by each individual is large, we opt for a strategy of stringent IBD filtering, focusing on relative pairs who share less than 10% of the genome, corresponding to fourth- and fifth-degree relatives.
IBD filtering in phenotype file 1
Rare SNPs (MAF < 0.1)
We repeated this approach for the remaining 199 replicates and then ranked each gene based on the number of replicates in which it was selected [4, 5]. Intuitively, this strategy should be quite robust to allelic heterogeneity but less robust to locus heterogeneity. If locus heterogeneity is expected to be high, one could retain genes that overlap with IBD regions in as few as one case-case pair and then use the detected genes from each family as an estimate of the intrafamilial locus heterogeneity.
In summary, our strategy involved the following steps: (1) calculating the IBD score between all pairs; (2) selecting affected pairs; (3) choosing case-case pairs that share between 1% and 9% of the genome; (4) selecting a list of genes shared by most case subjects; (5) repeating steps 1 through 4 for each of the 200 replicate files; (6) ranking each gene based on the number of replicate files from which it was detected.
Because the simulated data set was not well suited for sophisticated filtering of variants, we used the commonly applied filters for the MAF in the 1000 Genomes Project data and determined the potential functional impact (nonsynonymous variants). In this case, we applied a 10% MAF threshold, because, in practice, earlier genome-wide association studies could be expected to find more common variants if they conferred reasonable disease risk.
Family-based association test
To exclude variants that were clearly not associated with the phenotype, we performed an FBAT analysis (http://biosun1.harvard.edu/~fbat/fbat.htm) on all 200 replicate files. We included markers that were informative in at least three out of eight pedigrees. Because there are multiple nuclear families in a pedigree, we used the FBAT option -e, as recommended by the software developer. We then ranked variants by the minimum FBAT p-value observed across the 200 replicate analyses. We also performed an FBAT -e analysis after setting the number of informative families to 5 and 8 (8 being the maximum number of pedigrees in the GAW17 data), in order to evaluate the effect of this parameter on the identification of true causal alleles (see Discussion and Conclusions section).
Summary of IBD analysis and FBAT analysis
Total genes: 3,205
Total SNPs in 43 genes: 956
Genes shared by most people in at least 1 phenotype file: 1,798
SNPs with < 10% MAF: 876
Genes seen in at least 50 phenotype files: 43
Nonsynonymous SNPs: 525
FBAT p < 0.05: 12
Top 43 genes from the IBD analysis
Observed number of phenotype files
Variant filtering revealed that the 43 top-ranked genes contained 956 variants. MAF and functional filtering reduced this list to 525 variants in 32 genes (Table 2, FBAT analysis).
Candidate genes and variants
Number of phenotype files
Discussion and conclusions
We assume that the GAW17 data set is genetically heterogeneous. Therefore not all affected individuals share the same causal genes (locus heterogeneity), nor do they share the same variants (allelic heterogeneity). We addressed the locus heterogeneity problem by using IBD analysis between distantly related case subjects, selecting genes that were often but not always shared by case subjects. To address allelic heterogeneity, we considered all variants that passed our frequency and functionality filters and all variants located in genes selected by IBD filtering. Larger sample sizes allowed a more liberal IBD filtering, increasing the robustness of this strategy in the face of locus heterogeneity.
Although the IBD filtering did substantially reduce the candidate gene list, there were still 43 candidate genes with many sequence variants. The top hits of the IBD filtering alone were F5 (shared by case-case pairs in 140 phenotype files) and NF2 (136 files). Neither of these genes contained variants that were seen in more than a few case subjects. Thus it was important to work down the list to identify variants that were more frequent in this data set. A larger data set would have allowed more discovered variants to be included in the analysis, potentially increasing robustness to allelic heterogeneity.
Filtering of variants based on MAF and potential function cut the list in half, but it is not clear whether this filtering method will be ideal for common complex traits. Depending on penetrance, true risk alleles might be fairly common in comparison data sets, especially those consisting of control subjects who have not been screened for the trait of interest. One could set the MAF threshold higher than 10% and exclude variants that are homozygous in a few control subjects, because these might be more likely to produce a recognized phenotype in control subjects. Similar arguments can be made about functionality. In practice, most studies of complex traits aim to include variants with regulatory or splicing effects, which we could not estimate in the GAW17 data set.
Family-based association testing was the final component of our strategy, aimed at eliminating variants (and genes) that were clearly not associated with the phenotype. In real-world data, power analysis would guide the choice of appropriate p-value thresholds for the family-based association testing, and candidates would generally be further evaluated in large case-control samples. Because many rare variants are singletons, nominated genes would typically be resequenced in additional case and control subjects to test the hypothesis that the genes harbor additional deleterious variants in case subjects that might not have been observed in the original study. See Krawitz et al.  for a successful example of this strategy.
Our analysis nominated a set of five candidate genes, APOB, TTLL4, ACCN4, COL6A3, and TG, three of which are implicated in cardiovascular disease. Apolipoprotein B (APOB) is the main apolipoprotein component of low-density lipoproteins and is known to play a role in atherosclerotic plaque formation . ACCN4 (amiloride-sensitive cation channel 4) encodes an amiloride-sensitive sodium channel, and amilorides are often prescribed to control heart failure. The extracellular matrix of arteries and the myocardium have high levels of collagen fibers, and COL6A3 (alpha 3 type VI collagen isoform 5 precursor) encodes one of the alpha chains of collagen that participates in plaque and clot formation .
We necessarily used several arbitrary thresholds in this exercise. Ideally, the optimal thresholds would be selected at the start of a sequencing experiment, guided by the available sample size, replication resources, and educated guesses about the genetic architecture of the disease target.
At the conclusion of the GAW17 meeting, we requested the list of true causal genes so that we could assess the effect of our threshold choices on the results. The list of candidate genes identified by our IBD analysis included VEGFC, one of the genes simulated to harbor causal alleles in the GAW17 data (Table 3). However, VEGFC contained only one variant that was too rare to be informative in our FBAT analysis. A larger data set might allow more discovered variants to be considered, perhaps by grouping within each gene, potentially increasing robustness to allelic heterogeneity.
Effect of IBD filtering and allele frequency thresholds on sensitivity
FBAT -e Min Size
“Most cases” (TP%)
“Max cases” (TP%)
These results suggest that IBD filtering is a promising strategy for narrowing down the list of candidate variants in exome data. Although the sensitivity was low in the simulated GAW17 data, IBD filtering should be particularly effective in founder populations where rare disease alleles are more likely to be inherited from a common ancestor. More theoretical work is needed to determine the optimal degree of relatedness at which case-case pairs should be selected and to identify the best strategy for ranking variants in IBD regions for further study. Much will depend on the genetic architecture of the disease under study.
The Genetic Analysis Workshop is supported by National Institutes of Health grant R01 GM031575. This work was supported by the National Institute of Mental Health (NIMH) Intramural Research Program. The computational analysis was performed on the Helix server at the National Institutes of Health.
This article has been published as part of BMC Proceedings Volume 5 Supplement 9, 2011: Genetic Analysis Workshop 17. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/5?issue=S9.
- Maher B: Personal genomes: the case of the missing heritability. Nature. 2008, 456: 18-21.View ArticlePubMedGoogle Scholar
- Cirulli ET, Goldstein DB: Uncovering the roles of rare variants in common disease through whole-genome sequencing. Nat Rev Genet. 2010, 11: 415-425. 10.1038/nrg2779.View ArticlePubMedGoogle Scholar
- Ng SB, Buckingham KJ, Lee C, Bigham AW, Tabor HK, Dent KM, Huff CD, Shannon PT, Jabs EW, Nickerson DA, et al: Exome sequencing identifies the cause of a Mendelian disorder. Nat Genet. 2010, 42: 30-35. 10.1038/ng.499.PubMed CentralView ArticlePubMedGoogle Scholar
- Ng SB, Bigham AW, Buckingham KJ, Hannibal MC, McMillin MJ, Gildersleeve HI, Beck AE, Tabor HK, Cooper GM, Mefford HC, et al: Exome sequencing identifies MLL2 mutations as a cause of Kabuki syndrome. Nat Genet. 2010, 42: 790-793. 10.1038/ng.646.PubMed CentralView ArticlePubMedGoogle Scholar
- Krawitz PM, Schweiger MR, Rödelsperger C, Marcelis C, Kölsch U, Meisel C, Stephani F, Kinoshita T, Murakami Y, Bauer S, et al: Identity-by-descent filtering of exome sequence data identifies PIGV mutations in hyperphosphatasia mental retardation syndrome. Nat Genet. 2010, 42: 827-829. 10.1038/ng.653.View ArticlePubMedGoogle Scholar
- Dickson SP, Wang K, Krantz I, Hakonarson H, Goldstein DB: Rare variants create synthetic genome-wide associations. PLoS Biol. 2010, 8: e1000294-10.1371/journal.pbio.1000294.PubMed CentralView ArticlePubMedGoogle Scholar
- Roach JC, Glusman G, Smit AF, Huff CD, Hubley R, Shannon PT, Rowen L, Pant KP, Goodman N, Bamshad M, et al: Analysis of genetic inheritance in a family quartet by whole-genome sequencing. Science. 2010, 328: 636-639. 10.1126/science.1186802.PubMed CentralView ArticlePubMedGoogle Scholar
- Almasy LA, Dyer TD, Peralta JM, Kent JW, Charlesworth JC, Curran JE, Blangero J: Genetic Analysis Workshop 17 mini-exome simulation. BMC Proc. 2011, 5 (suppl 9): S2-10.1186/1753-6561-5-S9-S2.PubMed CentralView ArticlePubMedGoogle Scholar
- Browning SR, Browning BL: High-resolution detection of identity by descent in unrelated individuals. Am J Hum Genet. 2010, 86: 526-539. 10.1016/j.ajhg.2010.02.021.PubMed CentralView ArticlePubMedGoogle Scholar
- Gusev A, Lowe JK, Stoffel M, Daly MJ, Altshuler D, Breslow JL, Friedman JM, Pe’er I: Whole population, genome-wide mapping of hidden relatedness. Genome Res. 2009, 19: 318-326.PubMed CentralView ArticlePubMedGoogle Scholar
- Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ, et al: PLINK: a tool set for whole-genome association and population-based linkage analysis. Am J Hum Genet. 2007, 81: 559-575. 10.1086/519795.PubMed CentralView ArticlePubMedGoogle Scholar
- Hon L, Henn BM, Macpherson JM, Eriksson N, Wojcicki A, Avey L, Saxonov S, Mountain JL: Discovering distant relatives within a diverse set of populations using DNA segments identical by descent. American Society of Human Genetics, 59th annual meeting. 2009, [http://www.ashg.org/2009meeting/abstracts/fulltext/f10169.htm] , Abstract 596Google Scholar
- McQueen MJ, Hawken S, Wang X, Ounpuu S, Sniderman A, Probstfield J, Steyn K, Sanderson JE, Hasani M, Volkova E: Lipids, lipoproteins, and apolipoproteins as risk markers of myocardial infarction in 52 countries (the INTERHEART Study): a case-control study. Lancet. 2008, 372: 224-233. 10.1016/S0140-6736(08)61076-4.View ArticlePubMedGoogle Scholar
- Rodriguez-Feo JA, Sluijter JP, de Kleijn DP, Pasterkamp G: Modulation of collagen turnover in cardiovascular disease. Curr Pharm Des. 2005, 11: 2501-2514. 10.2174/1381612054367544.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.