Volume 5 Supplement 9
Genetic Analysis Workshop 17: Unraveling Human Exome Data
Identity by descent and association analysis of dichotomous traits based on large pedigrees
- Tian Liu^{1} and
- Anbupalam Thalamuthu^{1}Email author
https://doi.org/10.1186/1753-6561-5-S9-S31
© Liu and Thalamuth; licensee BioMed Central Ltd. 2011
Published: 29 November 2011
Abstract
The goals of our analysis were to map functional loci, which contribute to the case-control status of a trait of interest, using large pedigrees. We used logistic regression fitted with the generalized estimation equation to test associations between a dichotomous phenotype and all genotyped common and rare single-nucleotide polymorphisms. In addition to the association study, we also developed and applied a simple and fast identical-by-descent-based test to identify loci that were shared among affected individuals more often than expected by chance. Among the top significant loci, we assessed the statistical power and the false discovery rate of both methods. We also demonstrated that family-based studies, compared with the standard population-based association studies, have great values and advantages for the discovery of multiple rare causal variants.
Background
Population-based genome-wide association studies (GWAS) using unrelated individuals are becoming increasingly popular in genetic research. Recent large GWAS have shown that common genetic variants are involved in common diseases, but most of the variants found in this way account for only a small portion of the trait variance. On the other hand, accumulating evidence from candidate-gene-based resequencing suggests that many rare genetic variants contribute to the trait variance of common diseases. Pedigree resources are conventionally believed to be powerful for identifying rare variants and are considered appropriate for the application of linkage strategies. However, linkage analysis often requires nuclear families characterized by multiple informative offspring and is more applicable for quantitative traits [1, 2]. Other methods, which exist to appropriately perform association analyses of dichotomous traits with extended pedigrees, are often computationally impractical for large-scale genome-wide association analyses. Hence we are motivated to explore alternative strategies that can be efficiently carried out on a genome-wide scale and can appropriately handle familial relationships in arbitrary-structured pedigrees.
The family data from Genetic Analysis Workshop 17 (GAW17) provide us with a great opportunity to investigate appropriate approaches for the family-based association tests. The approaches we consider in this work include fitting a logistic regression model with adjustments of covariates and a novel approach using identical-by-descent (IBD) measures.
Methods
We used the family data sets provided by GAW17. This data set had 697 subjects (209 affected and 488 unaffected individuals) from 8 extended families and fully informative markers for 3,205 genes. Assuming that recombination was not allowed within genes, IBD scores were also provided at each gene location (see Almasy et al. [3] for additional details of the GAW17 data). Using the case-control data sets, we explored two different types of methods: association analysis and IBD linkage analysis.
Association analyses
where y_{ i } is the affected status of individual i, case samples are coded 1 and control samples are coded 0, and g_{ i } is the genotype of individual i at a given SNP marker. The intercept parameter μ is called the base line odds and the β parameters represent log odds ratio corresponding to the variables used in the model. Assuming that minor allele B is the risk allele, g_{ i } is coded 0, 1, or 2 corresponding to genotypes AA, AB, or BB, respectively.
Because the individuals are no longer independent in the large pedigrees, the joint likelihood function for all individuals has a complicated form. We used the generalized estimation equation (GEE) [4, 5] with fixed covariance structure to fit the logistic regression (Eq. (1)). We computed the kinship matrix of the 697 individuals using the kinship program in the R package [6] and used the kinship matrix to specify the covariance matrix in the GEE.
After assigning a collapsed genotype to grouped rare variants, we evaluate the association between collapsed genotypes and disease outcomes using the logistic model specified in Eq. (1).
IBD analysis
Phenotypes of relatives are similar because relatives share similar environment and similar genetic variants (genotypes or haplotypes). Genotypes are similar because these relatives share genes that are identical by descent. Intuitively, the disease-associated loci are more likely to be similar in case samples than in control samples. Comparing the distributions of IBD scores between arbitrary pairs of case subjects and pairs of control subjects appears to be a promising approach to identify loci associated with a trait. Based on this general idea, we formulated a new test using a 2 × 3 contingency table that can identify loci shared among affected individuals more often than expected by chance. Our algorithm considers all relationships simultaneously and can be used to test association in pedigrees of arbitrary size.
IBD configurations for pairs of individuals
State | Description |
---|---|
S _{1} | BB/BB; IBD = 1 |
S _{2} | BB/BB; IBD = 0.5 |
S _{3} | BB/BB; IBD = 0 |
S _{4} | BB/AB; IBD = 0.5 |
S _{5} | BB/AB; IBD = 0 |
S _{6} | BB/AA; IBD = 0 |
S _{7} | AB/AB; IBD = 1 |
S _{8} | AB/AB; IBD = 0.5 |
S _{9} | AB/AB; IBD = 0 |
S _{10} | AB/AA; IBD = 0.5 |
S _{11} | AB/AA; IBD = 0 |
S _{12} | AA/AA; IBD = 1 |
S _{13} | AA/AA; IBD = 0.5 |
S _{14} | AA/AA; IBD = 0 |
Under the null hypothesis of no association, the frequencies of the 14 states are the same in affected and unaffected samples. Consequently, the pairs of affected and pairs of unaffected individuals have equal frequencies of sharing zero, one, or two copies of B alleles IBD. On the other hand, under the alternative hypothesis of association, we expect to observe that pairs of affected individuals share at least one copy of the B allele with greater chance. This also means, under the alternative hypothesis, that we expect that affected individuals involved in forming pairs will share at least one copy of the inherited B allele with a greater chance than unaffected individuals.
Contingency table
Group | Number of distinct individuals who form pairs IBD at B|B | Number of distinct individuals who form pairs IBD at B|− but not B|B | Other | |
---|---|---|---|---|
Case | Observed | nBB ^{1} | nB ^{1} | n ^{1} |
Expected | p_{case}(nBB^{1} + nBB^{0}) | p_{case}(nB^{1} + nB^{0}) | p_{case}(n^{1} + n^{0} ) | |
Control | Observed | nBB ^{0} | nB ^{0} | n ^{0} |
Expected | p_{control}(nBB^{ 1 } +nBB^{ 0 }) | p_{control}(nB^{1} + nB^{0}) | p_{control}(n^{1} + n^{0} ) |
Under the null hypothesis H_{0}, the expected cell frequencies of the affected individuals should be the same as the expected cell frequencies of the unaffected individuals, and the test statistic has an asymptotic chi-squared distribution with two degrees of freedom.
Contingency table of the toy example
Group | Number of distinct individuals who form pairs IBD at B|B | Number of distinct individuals who form pairs IBD at B|− but not B|B | Other | |
---|---|---|---|---|
Case | Observed | 2 | 3 | 1 |
Expected | (6/13)(2 + 0) = 12/13 | (6/13)(3 + 2) = 30/13 | (6/13)(1 + 5) = 36/13 | |
Control | Observed | 0 | 2 | 5 |
Expected | (7/13)(2 + 0) = 14/13 | (7/13)(3 + 2) = 35/13 | (7/13)(1 + 5) = 42/13 |
Results
We first compared the performances of two different family-based approaches (association analysis and IBD analysis) in terms of detection power and false discovery rate using the 200 simulated extended family data sets in GAW17. We then carried out standard stratified association analyses with the population-based data sets so that we could further investigate the application and the value of using family-based approaches. In particular, we were keen to find out whether family-based approaches have certain advantages in discovering rare causal variants.
Family-based association analysis
We also evaluated false-positive rates. We declared the significance of the gene-level p-values with an arbitrary cutoff value α. Genes with p-values less than 0.05 were considered detected in each simulation. Averaging over 200 replicates, the chance of wrongly declaring noncausal genes (over all 3,205 genes) was 0.096 and 0.033 at an α level of 0.05 and 0.01, respectively.
We further evaluated the power of detecting true functional genes at an α level of 0.05 and 0.01, respectively. Among all 3,205 genes, the family-based association study successfully identified gene VEGFA as the most significant gene with a power as high as 81% at α = 0.05. Four other causal genes with the highest discovery rates were LPL, VNN1, SHC1, and SIRT1. Compared to the other 31 causal genes, these 5 genes have both strong effects and relatively high frequencies of carrying risk variants.
IBD analysis
Comparing family-based approaches with the population-based association analysis
As a comparison to the family-based analyses, we also performed a standard stratified analysis on the 200 simulated population case-control data sets. We assessed the power of population-based studies for all causal genes based on the outputs from the PLINK computer program. FLT1, which has three common variants and large genetic effects, is the only gene with a detection power greater than 50%. For most other causal genes with rare variants, family-based studies had better power of detection than population-based studies. The Q-Q plot in Figure 3a shows that the standard stratified analyses were also inflated with some false positives. The estimated inflation factors ranged from 1.02 to 1.37, and the actual false-positive rates at α = 0.05 and α = 0.01 were 0.089 and 0.029, respectively. Although all three types of analyses have comparable false-positive rates, the receiver operating characteristic (ROC) curves (Figure 1) show that the two family-based analyses clearly performed significantly better than the standard population-based association study in terms of having higher sensitivity.
Conclusions
In this work, our analyses provide new insights into the genetic studies of family data with large extended pedigrees for dichotomous traits. We tested the utility of two types of family-based approaches: a logistic regression fitted using GEE and our proposed IBD test. The logistic regression with GEE was used to test the associations in large pedigrees while simultaneously controlling for environmental covariates. To properly handle the rare variants, we provide an operable and straightforward scheme to collapse the rare variants within a gene. This simple collapsing strategy was shown to be useful. As an example, we had a reasonable power for identifying causal gene SIRT1, which is enriched with rare causal variants. We also showed that the linkage of disease alleles can be tested based on IBD scores in extended pedigrees. By incorporating information from not only parents and siblings but also other ancestors, we developed a chi-square test based on a 2 × 3 contingency table. Our IBD test provides an attractive alternative to the conventional tests because it is computationally fast and does not require one to specify an inheritance model explicitly.
We compared the performance of family-based studies using these two approaches with the performance of the population-based studies using the standard stratified analysis. Population-based studies seemed to have better power for detecting common variants, and the family-based studies seemed to have better power for detecting rare variants. If a risk allele is present in early founders and the effects of risk alleles are relatively large, the IBD analyses clearly outperformed the family-based association analyses. We noticed that, even for rare variants that have extremely low frequencies or that are found in only a few families, the IBD analysis has a good chance of picking up them. For example, in this work, the power of detecting risk genes VEGFC and HIF3A using the IBD tests was significantly higher than the power obtained using the logistic regressions. In other instances, family-based association analyses had better power to detect genes that have relatively high frequencies of risk alleles and relatively large genetic effects. In the worst scenario, if the risk allele was extremely rare, not present in early founders, and of small genetic effect, both methods failed. Because the two family-based approaches have their own advantages for dealing with rare variants, these two approaches can potentially compensate each other. Combining these two types of analysis may be a more powerful solution in the search for causal variants.
Although hunting for rare causal variants using family data sets seems promising, many practical issues need to be addressed before the effectiveness of family-based analyses can be fully recognized. For example, the IBD test can produce unstable results, and it is not very straightforward to obtain the gene-level IBD scores in the first place. Also, we noticed that all analyses were inflated with some false positives. We suspect that the inflated false positives may be caused by those nonfunctional variants whose genotypes are highly correlated with the functional variants. Luedtke et al. [9] offers a detailed discussion of these so-called spurious associated genes. Other possible sources of inflated false positives in practical studies include insufficient correction of population stratification, inappropriate handling of rare variants, and the effects of linkage disequilibrium structures. We will further investigate the influence of these possible sources in our future studies.
Declarations
Acknowledgments
We thank the organizers of Genetic Analysis Workshop 17 for providing the exome data set. The Genetic Association Workshop is supported by National Institutes of Health grant R01 GM031575. We also thank all members of the Human Genetics Groups at the Genome Institute of Singapore for useful comments and inputs.
This article has been published as part of BMC Proceedings Volume 5 Supplement 9, 2011: Genetic Analysis Workshop 17. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/5?issue=S9.
Authors’ Affiliations
References
- Weeks ED, Lang K: The affected-pedigree-member method of linkage analysis. Am J Hum Genet. 1988, 42: 315-326.PubMed CentralPubMedGoogle Scholar
- Risch N: Linkage strategies for genetically complex traits. II. The power of affected relative pairs. Am J Hum Genet. 1990, 46: 229-241.PubMed CentralPubMedGoogle Scholar
- Almasy LA, Dyer TD, Peralta JM, Kent JW, Charlesworth JC, Curran JE, Blangero J: Genetic Analysis Workshop 17 mini-exome simulation. BMC Proc. 2011, 5 (suppl 9): S2-10.1186/1753-6561-5-S9-S2.PubMed CentralView ArticlePubMedGoogle Scholar
- Liang KY, Zeger SL: Longitudinal data analysis using generalized linear models. Biometrika. 1986, 73: 13-22. 10.1093/biomet/73.1.13.View ArticleGoogle Scholar
- Liang KY, Zeger SL: Longitudinal data analysis for discrete and continuous outcomes. Biometrics. 1986, 42: 121-130. 10.2307/2531248.View ArticlePubMedGoogle Scholar
- Carey VJ: Ported to R by Thomas Lumley (versions 3.13 and 4.4) and Brian Ripley (version 4.13): GEE—Generalized Estimation Equation solver [4.13]. 2007, [http://cran.r-project.org/]Google Scholar
- Asimit J, Zeggini E: Rare variant association analysis methods for complex traits. Annu Rev Genet. 2010, 44: 293-308. 10.1146/annurev-genet-102209-163421.View ArticlePubMedGoogle Scholar
- Benjamini Y, Hochberg YJ: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B Meth. 1995, 57: 289-300.Google Scholar
- Luedtke A, Powers S, Petersen A, Sitarik A, Bekmetjev A, Tintle NL: Evaluating methods for the analysis of rare variants in sequence data. BMC Proc. 2011, 5 (suppl 9): S119-10.1186/1753-6561-5-S9-S119.PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.