Volume 5 Supplement 9
Exploration and comparison of methods for combining population- and family-based genetic association using the Genetic Analysis Workshop 17 mini-exome
© Fardo et al; licensee BioMed Central Ltd. 2011
Published: 29 November 2011
We examine the performance of various methods for combining family- and population-based genetic association data. Several approaches have been proposed for situations in which information is collected from both a subset of unrelated subjects and a subset of family members. Analyzing these samples separately is known to be inefficient, and it is important to determine the scenarios for which differing methods perform well. Others have investigated this question; however, no extensive simulations have been conducted, nor have these methods been applied to mini-exome-style data such as that provided by Genetic Analysis Workshop 17. We quantify the empirical power and false-positive rates for three existing methods applied to the Genetic Analysis Workshop 17 mini-exome data and compare relative performance. We use knowledge of the underlying data simulation model to make these assessments.
Study designs for genetic association studies fall into two broad categories: (1) population-based studies that recruit unrelated individuals and (2) family-based studies that collect some number of related pedigrees. Often, both study designs are used for a particular investigation. For example, when a linkage study has been performed and family data are collected, follow-up analysis can include association using a new unrelated study population. The analytic methods appropriate for either design differ, thus making difficult the aggregation of the association metrics across the study designs. Heuristically, population-based metrics attempt to quantify a measure of correlation or association between some function of genotype at a given marker and the disease phenotype, whereas family-based association measures use properties of Mendelian transmissions from parents to offspring and are inherently conditional.
Because analyzing the disparate types of data in isolation most often results in nonoptimal statistical power, investigators have proposed several methods for efficiently combining these data. We briefly summarize three methods to be applied to the Genetic Analysis Workshop 17 (GAW17) data in the Methods section. Each approach is distinguished by the study designs for which it is appropriate, the assumptions necessary for valid inference, and the handling of population stratification (whether it is formally or informally tested or whether it is taken into account by means of adjustments). Operationally, these methods are distinguishable by computation and implementation considerations and by empirical performance. We assess the performance in this paper. Other researchers have investigated the question of relative performance ; however, no simulations have been conducted for comparison.
An important consideration to keep in mind throughout this investigation is the underlying causal model that was used to generate the GAW17 data . First, rather than reflecting the common disease/common variant hypothesis that the established methods presented address, the data-generating mechanism used was consistent with the multiple rare variant or the common disease/rare variant (CDRV) hypothesis, which suggests that common disease susceptibility is garnered through multiple rare variants with moderate to high penetrance. Intuitively, the current methods do not perform well in identifying rare single-nucleotide polymorphisms (SNPs); in this paper we intend to assess this performance and to motivate possible modifications that would be successful when the CDRV hypothesis is true. In addition, the disease was simulated to have ≫ 30% prevalence, which violates the often-invoked rare disease assumption.
The first attempts to combine population- and family-based association data were developed by Nagelkerke et al. , who used a likelihood framework to combine case-control data with family data by exploiting the likelihood formulation  of the transmission disequilibrium test (TDT) . This approach assumes Hardy-Weinberg equilibrium (HWE), random mating, and a multiplicative model of allelic effect. Although no formal test of the appropriateness of combining the two types of data has been developed, we discuss ad hoc procedures.
Epstein et al.  generalized this work by relaxing the assumptions of HWE, random mating, and the assumed multiplicative mode of inheritance. In addition, they described a formal test for the appropriateness of combining case-control and case-trio data by comparing genotype relative risk (RR) estimates from between-individual and within-family analyses, respectively. The proposed two-stage procedure facilitates valid model selection in the presence of population stratification. Further extensions of this approach were made by Chen and Lin . Their method uses weighted least squares to aggregate the disparate RRs and requires no assumptions for mating-type distributions.
Epstein et al.’s and Chen and Lin’s methods rely on two strong assumptions: a rare disease and the absence of population stratification. Later work has been targeted at both relaxing the rare disease assumption and adjusting for population stratification. Zhu et al.  used a principal components strategy to adjust for population stratification and to aggregate families and case-control samples by means of a linear regression framework. Within-family correlations were empirically estimated from the data and incorporated into the variance of the test statistic. Zhang et al.  proposed a similar method in which they defined a score test and used generalized estimating equations  to account for familial correlation. Their method can be more easily applied to multivariate outcomes. Other useful approaches, some with a focus on genome-wide association, have been proposed but are not evaluated here [11–21].
Because the approach by Chen and Lin  is not immediately generalizable to pedigrees, we extracted nuclear families and then sampled 194 trios from the nuclear families to provide a uniform comparison between the methods. These sampled data (697 unrelated case or control individuals and 582 family members from the 194 trios) are used for our comparisons. We assume an additive mode of inheritance throughout.
Chen and Lin’s method
where W1 and W2 are weights derived from linear model theory assuming the parameter estimates follow a multivariate normal distribution (see Chen and Lin  for details). Here, the assumptions of a rare disease and no population stratification are necessary for validity. However, the test used to reject the appropriateness of combining the RR estimates is not well powered, as evidenced by our simulations, which often did not confer sufficient evidence to reject the null hypothesis of parameter equivalence even though the simulated disease is not, in fact, rare—a necessary condition for such equivalence. This method was designed for case-trio and unrelated control subjects; however, in our analyses control offspring from the control trios are added to the case-control subsample.
Zhu et al.’s method
where N is the number of families, k i is the number of individuals in the ith family, and N T is the total number of individuals. Within-family correlations are taken into account in the calculation of the variance of T to construct a Wald test. Although this method requires enough markers to estimate principal components, it has the distinct advantage of being robust to population stratification. It can incorporate more complex family structures and does not discard any of the GAW17 data for analysis. Software to apply this approach, FamCC, is available from Zhu et al. .
Zhang et al.’s method
Zhang et al.’s  method adapts a score test statistic proposed by Lange et al.  that applies generalized estimating equations to family-based association tests. To obtain estimates for the score test statistic, the components of the test statistic are decomposed into two mutually exclusive sets: the unrelated individuals and the trios. Traits are treated as constants so that the population genotype mean and variance are estimated for the unrelated individuals and the genotype mean and variance for the offspring are defined through Mendelian transmissions. Similar to Zhu et al.’s method, this framework allows for incorporation of covariates, but unlike the other methods considered, it can easily handle missing parents.
where g im and g if are the mother’s and father’s genotypes in the ith family, respectively. The score Z = U + R is squared and standardized by its variance to provide a score test. Zhang et al.  provide a Java-based program, GAP, for analysis.
For each method we tested all 24,487 SNPs from the GAW17 data using the 697 unrelated individuals in the case-control sample and the subsampled 194 trios (582 individuals) in each of the 200 simulation replicates, with affected status as the phenotype. Although an adjustment for multiple testing would be appropriate for this study design, we chose to use a 5% nominal level of significance throughout in order to better compare the methods. Although these methods readily generalize to handling other genetic models, we assumed an additive mode of inheritance throughout.
Average empirical rejection rates
With SNPs from spurious genes removed
Chen and Lin
Zhang et al.
Zhu et al.
SNP discovery power
Empirical rejection rates for top causal SNPs
Chen and Lin
Zhang et al.
Zhu et al.
Discussion and conclusions
Several methods address the problem of combining population- and family-based genetic association data. These methods differ fundamentally in whether they incorporate within-family transmissions and rely on tests for population stratification to justify effect estimate aggregation or perform between-individual analyses using family data. Performance related to population stratification cannot be assessed here because no stratification was simulated in the GAW17 data.
Although the Zhang et al.  method performed better than the other two methods considered, we did see that no method was well powered to detect causal SNPs in this scenario. Both the Zhang et al.  and the Zhu et al.  methods allow for more general pedigree structures than the trios-only analysis performed here and will likely perform more favorably when larger pedigrees are considered. In future work, we plan to adapt aggregation methods suitable for the CDRV hypothesis.
We thank the two anonymous reviewers for providing suggestions that improved this manuscript. We also thank Shelley Bull for helpful discussions and Mike Epstein, Tao Feng, Lei Zhang, and Xiaofeng Zhu for coding and software. DWF was supported by National Institutes of Health (NIH) National Center for Research Resources (NCRR) grant P20RR020145, and DWF and JL were supported by NIH NCRR grant 5P20RR016481-10.
This article has been published as part of BMC Proceedings Volume 5 Supplement 9, 2011: Genetic Analysis Workshop 17. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/5?issue=S9.
- Infante-Rivard C, Mirea L, Bull SB: Combining case-control and case-trio data from the same population in genetic association analyses: overview of approaches and illustration with a candidate gene study. Am J Epidemiol. 2009, 170: 657-664. 10.1093/aje/kwp180.View ArticlePubMedGoogle Scholar
- Almasy LA, Dyer TD, Peralta JM, Kent JW, Charlesworth JC, Curran JE, Blangero J: Genetic Analysis Workshop 17 mini-exome simulation. BMC Proc. 2011, 5 (suppl 9): S2-10.1186/1753-6561-5-S9-S2.PubMed CentralView ArticlePubMedGoogle Scholar
- Nagelkerke NJD, Hoebee B, Teunis P, Kimman TG: Combining the transmission disequilibrium test and case-control methodology using generalized logistic regression. Eur J Hum Genet. 2004, 12: 964-970. 10.1038/sj.ejhg.5201255.View ArticlePubMedGoogle Scholar
- Abel L, Müller-Myhsok B: Maximum-likelihood expression of the transmission/disequilibrium test and power considerations. Am J Hum Genet. 1998, 63: 664-667. 10.1086/301975.PubMed CentralView ArticlePubMedGoogle Scholar
- Spielman RS, McGinnis RE, Ewens WJ: Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM). Am J Hum Genet. 1993, 52: 506-516.PubMed CentralPubMedGoogle Scholar
- Epstein MP, Veal CD, Trembath RC, Barker JNWN, Li C, Satten GA: Genetic association analysis using data from triads and unrelated subjects. Am J Hum Genet. 2005, 76: 592-608. 10.1086/429225.PubMed CentralView ArticlePubMedGoogle Scholar
- Chen YH, Lin HW: Simple association analysis combining data from trios/sibships and unrelated controls. Genet Epidemiol. 2008, 32: 520-527. 10.1002/gepi.20325.View ArticlePubMedGoogle Scholar
- Zhu X, Li S, Cooper RS, Elston RC: A unified association analysis approach for family and unrelated samples correcting for stratification. Am J Hum Genet. 2008, 82: 352-365. 10.1016/j.ajhg.2007.10.009.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang L, Pei YF, Li J, Papasian CJ, Deng HW: Univariate/multivariate genome-wide association scans using data from families and unrelated samples. PLoS One. 2009, 4: e6502-10.1371/journal.pone.0006502.PubMed CentralView ArticlePubMedGoogle Scholar
- Liang K, Zeger S: Longitudinal data analysis using generalized linear models. Biometrika. 1986, 73: 13-22. 10.1093/biomet/73.1.13.View ArticleGoogle Scholar
- Weinberg CR, Wilcox AJ, Lie RT: A log-linear approach to case-parent-triad data: assessing effects of disease genes that act either directly or through maternal effects and that may be subject to parental imprinting. Am J Hum Genet. 1998, 62: 969-978. 10.1086/301802.PubMed CentralView ArticlePubMedGoogle Scholar
- Weinberg CR, Umbach DM: A hybrid design for studying genetic influences on risk of diseases with onset early in life. Am J Hum Genet. 2005, 77: 627-636. 10.1086/496900.PubMed CentralView ArticlePubMedGoogle Scholar
- Kazeem GR, Farrall M: Integrating case-control and TDT studies. Ann Hum Genet. 2005, 69 (pt 3): 329-335.View ArticlePubMedGoogle Scholar
- Joo J, Tian X, Zheng G, Stylianou M, Lin JP, Geller NL: Joint analysis of case-parents trio and unrelated case-control designs in large scale association studies. BMC Proc. 2007, 1 (suppl 1): S28-10.1186/1753-6561-1-s1-s28.PubMed CentralView ArticlePubMedGoogle Scholar
- Dudbridge F: Likelihood-based association analysis for nuclear families and unrelated subjects with missing genotype data. Hum Hered. 2008, 66: 87-98. 10.1159/000119108.PubMed CentralView ArticlePubMedGoogle Scholar
- Pfeiffer RM, Pee D, Landi MT: On combining family and case-control studies. Genet Epidemiol. 2008, 32: 638-646. 10.1002/gepi.20338.View ArticlePubMedGoogle Scholar
- Hsu L, Starr JR, Zheng Y, Schwartz SM: On combining triads and unrelated subjects data in candidate gene studies: an application to data on testicular cancer. Hum Hered. 2009, 67: 88-103. 10.1159/000179557.PubMed CentralView ArticlePubMedGoogle Scholar
- Vermeulen SH, Shi M, Weinberg CR, Umbach DM: A hybrid design: case-parent triads supplemented by control-mother dyads. Genet Epidemiol. 2009, 33: 136-144. 10.1002/gepi.20365.PubMed CentralView ArticlePubMedGoogle Scholar
- Guo CY, Lunetta KL, DeStefano AL, Cupples LA: Combined haplotype relative risk (CHRR): a general and simple genetic association test that combines trios and unrelated case-controls. Genet Epidemiol. 2009, 33: 54-62. 10.1002/gepi.20356.PubMed CentralView ArticlePubMedGoogle Scholar
- Zheng Y, Heagerty PJ, Hsu L, Newcomb PA: On combining family-based and population-based case-control data in association studies. Biometrics. 2010, 66: 1024-1033. 10.1111/j.1541-0420.2010.01393.x.PubMed CentralView ArticlePubMedGoogle Scholar
- Lasky-Su J, Won S, Mick E, Anney RJL, Franke B, Neale B, Biederman J, Smalley SL, Loo SK, Todorov A, et al.: On genome-wide association studies for family-based designs: an integrative analysis approach combining ascertained family samples with unselected controls. Am J Hum Genet. 2010, 86: 573-580. 10.1016/j.ajhg.2010.02.019.PubMed CentralView ArticlePubMedGoogle Scholar
- Schaid DJ, Sommer SS: Genotype relative risks: methods for design and analysis of candidate-gene association studies. Am J Hum Genet. 1993, 53: 1114-1126.PubMed CentralPubMedGoogle Scholar
- Lange C, Silverman EK, Xu X, Weiss ST, Laird NM: A multivariate family-based association test using generalized estimating equations: FBAT-GEE. Biostatistics. 2003, 4: 195-206. 10.1093/biostatistics/4.2.195.View ArticlePubMedGoogle Scholar
- Zhang L, Li J, Pei YF, Liu Y, Deng HW: Tests of association for quantitative traits in nuclear families using principal components to correct for population stratification. Ann Hum Genet. 2009, 73 (pt 6): 601-613.PubMed CentralView ArticlePubMedGoogle Scholar
- Luedtke A, Powers S, Petersen A, Sitarik A, Bekmetjev A, Tintle N: Evaluating methods for the analysis of rare variants in sequence data. BMC Proc. 2011, 5 (suppl 9): S119-10.1186/1753-6561-5-S9-S119.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.