Comparing strategies for evaluation of candidate genes in case-control studies using family data

Tian, Xin; Joo, Jungnam; Wu, Colin O; Lin, Jing-Ping

doi:10.1186/1753-6561-1-S1-S31

Volume 1 Supplement 1

Genetic Analysis Workshop 15: Gene Expression Analysis and Approaches to Detecting Multiple Functional Loci

Proceedings
Open access
Published: 18 December 2007

Comparing strategies for evaluation of candidate genes in case-control studies using family data

Xin Tian¹,
Jungnam Joo¹,
Colin O Wu¹ &
…
Jing-Ping Lin¹

BMC Proceedings volume 1, Article number: S31 (2007) Cite this article

1051 Accesses
4 Citations
Metrics details

Abstract

The goal of this analysis is to compare different test strategies for genetic association in case-control studies using related individuals. The first test is the trend test that is corrected for related individuals on the basis of identity-by-descent information. The second approach is to use generalized estimating equations to adjust for the correlation between relatives, and the third is the multiple outputation method. We compare the power of these test strategies in a simulation study, and apply these methods to a candidate gene dataset of Genetic Analysis Workshop 15 from the North American Rheumatoid Arthritis Consortium.

Background

The case-control design is a widely used and powerful approach for genetic association studies [1, 2]. Genotype frequencies are compared between case and control samples to identify candidate genes or nearby markers that are associated with the susceptibility to a disease. Although association studies may be subject to the possibility of population stratification, it has been recognized that this effect is small in magnitude in well designed studies that sample controls and cases from a homogeneous population, or that match cases by the major confounding variables such as age, gender, and race-ethnicity [1]. Recently, there has been increasing interest in statistical methods that evaluate association between genetic markers and disease status using family-based data [2, 3]. This would allow data available from linkage studies or multicase families to be used efficiently to test for association. Unlike traditional case-control studies in which all individuals are unrelated, cases from the same family are often correlated because these individuals share genetic and environmental conditions. Consequently, the frequency of risk alleles at a marker locus is usually increased among related cases relative to unrelated cases. Using related cases sampled from families or ascertained from family linkage studies and unrelated controls may increase the false positive rate (type I error) of an association test, compared to the traditional case-control design based on independent samples. Ignoring the dependence among related individuals may potentially lead to incorrect or spurious results. Hence, any test of genetic association must account for correlation among family members.

Different methods may be used to evaluate genetic associations of candidate genes in case-control studies when some individuals (cases or controls) are related. We briefly sketch three of these methods, the Cochran-Armitage trend test corrected for identity-by-descent (IBD) information, the generalized estimating equations method, and the multiple outputation method. Little is known about their relative efficiency and performance. We compare their power in a simulation study and apply these methods to the candidate gene data of Genetic Analysis Workshop 15 (GAW15) from the North American Rheumatoid Arthritis Consortium (NARAC), which contains affected sibs with rheumatoid arthritis and unrelated controls.

Methods

Cochran-Armitage trend test accounting for related individuals

Consider data for a case-control study of genetic association as in Table 1. Assume a marker of a candidate gene with two alleles: N and M, where N is a normal allele and M is a risk allele or is in linkage disequilibrium with a risk allele. Denote genotypes as g₀ = NN, g₁ = NM, and g₂ = MM. Let the genotype frequencies for cases and controls be p_jand q_j, j = 0, 1, 2, respectively. Hence, the null hypothesis of no association is p_j= q_jfor each j.

Table 1 The data in a case-control study

Full size table

Given the data in Table 1, the Cochran-Armitage trend test for association [4] between a disease and a marker can be written as Z_x= U(x)/ $\hat{σ}$ , where U(x) = $n^{- 1} \sum_{j = 0}^{2} x_{j} (S r_{j} - R s_{j})$ , and x = (x₀, x₁, x₂)^Tis a set of increasing scores (weights) assigned to the three genotypes (g₀, g₁, g₂) a priori based on the underlying genetic model. Under the null hypothesis, $var [U (x)] = n^{- 1} R S [\sum_{j = 0}^{2} x_{j}^{2} p_{j} - {(\sum_{j = 0}^{2} x_{j} p_{j})}^{2}]$ , which can be estimated by ${\hat{σ}}^{2} = n^{- 3} R S [n \sum_{j = 0}^{2} x_{j}^{2} n_{j} - {(\sum_{j = 0}^{2} x_{j} n_{j})}^{2}]$ ; Z_xasymptotically follows a standard normal distribution N(0, 1).

However, because cases and controls within the same family may be biologically related, Slager and Schaid [3] proposed the following method for estimating the variance to account for correlations among related cases or controls. Let u_i= (u_i0, u_i1, u_i2)^Tbe the genotype indicator vector for the i^th case, where u_ij= 1 for the i^th case with genotype g_j and u_ij= 0 otherwise, i = 1,...,R. Similarly, we use v_jfor controls. Then $r = {(r_{0}, r_{1}, r_{2})}^{T} = \sum u_{i}$ , and $s = {(s_{0}, s_{1}, s_{2})}^{T} = \sum v_{j}$ . Let φ = R/n. Then the above test statistic is U(x) = x^T[(1 - φ)r - φs], and var[U(x)] = x^T{var[(1 - φ)r - φs]}x = $x^{T} {{(1 - φ)}^{2} var (\sum u_{i}) + φ^{2} var (\sum v_{j}) - 2 φ (1 - φ) cov (\sum u_{i}, \sum v_{j})} x$ .

Here the variance and covariance terms can be calculated based on the multinomial distributions and IBD-sharing probabilities for pairs of related individuals [3].

Generalized estimating equations (GEE) method

The GEE developed by Liang and Zeger [5] for the analysis of longitudinal data can be applied for case-control data in genetic studies. Let $y_{i} = {(y_{i 1}, ..., y_{i, n_{i}})}^{T}$ be the response variable for n_irelated subjects, i = 1,...,m, where m is the total number of families. For a binary trait, y_ij= 1 for cases and 0 for controls. The logistic regression model can be considered for the case-control data in Table 1: log[E(y_ij)/(1 - E(y_ij))] = β₀ + β₁x_ij+ $β_{2}^{T}$ w_ij, where x_ij= x₀, x₁, or x₂ is the score assigned to the genotype as above, and w_ijdenotes other covariates. The test of genetic association is equivalent to the test of β₁ = 0. Due to correlation of related family members, the conventional methods assuming independence are incorrect. The estimate and standard error for β = (β₀, β₁, $β_{2}^{T}$ )^Tbased on the GEE procedure take into account the within-family correlation, where β is estimated by solving the equations $\sum_{i = 1}^{m} {(\frac{\partial μ_{i}}{\partial β})}^{T} V_{i}^{- 1} (y_{i} - μ_{i}) = 0$ , with μ_i= E(y_i; β) and V_i= V_i(y_i; β, θ) denoting the "working" covariance matrix of y_i. The estimate of β is asymptotically normally distributed and its variance is given by $Σ = Σ_{1}^{- 1} Σ_{2} Σ_{1}^{- 1}$ , where $Σ_{1} = \sum_{i = 1}^{m} {(\frac{\partial μ_{i}}{\partial β})}^{T} V_{i}^{- 1} (\frac{\partial μ_{i}}{\partial β})$ , and $Σ_{2} = \sum_{i = 1}^{m} {(\frac{\partial μ_{i}}{\partial β})}^{T} V_{i}^{- 1} (y_{i} - μ_{i}) {(y_{i} - μ_{i})}^{T} V_{i}^{- 1} (\frac{\partial μ_{i}}{\partial β})$ . There are a number of choices for V_i and it has been shown that the GEE estimates are valid and consistent even if the working covariance matrix is misspecified. For family or affected sib-pair data, a simple and reasonable choice is the exchangeable correlation matrix with a common correlation θ for each pair of relatives [6].

Multiple outputation (MO) method

The MO method proposed by Hoffman et al. [7] and Follmann et al. [8] provides inferences for clustered correlated data by averaging analyses of independent data. For independent case-control data in genetic studies, several methods can provide a normally distributed statistic, $\hat{β}$ , for the genetic association and an estimate of its variance, $\hat{σ}$ ². For example, the trend test statistic Z_xabove is a sensible choice, which estimates the weighted differences of the genetic frequencies. For case-control data sampled from families, a new sample can be obtained by randomly selecting an individual from each family, and then $\hat{β}$ and $\hat{σ}$ ² can be computed based on this new sample. After repeating this multiple times, the estimate of association will be the average of the $\hat{β}$ values, and an estimate of its variance is given by the average of the $\hat{σ}$ ² minus the sample variance of the $\hat{β}$ values. The MO estimate has been shown to be asymptotically normally distributed.

Results

A simulation study

To compare the performance of the three methods, we conducted a small simulation by generating case-control data sets and computing the empirical power for all the tests under three genetic models: recessive, additive, and dominant. The simulations were similar to those performed by Tian et al. [9] with 10,000 replications. We assume that the disease prevalence, K, is 0.1, the marker allele frequency, p, is 0.3, and Hardy-Weinberg equilibrium holds. To facilitate the calculation, each case-control data set included 200 cases generated as 100 affected sib pairs drawn from 100 different families, and 200 unrelated controls. Let the genotype relative risks RR₁ = f₁/f₀, and RR₂ = f₂/f₀, where f₀, f₁ and f₂ are the penetrances for genotypes g₀, g₁, and g₂. Thus, equivalently, the null hypothesis can be written as RR₁ = RR₂ = 1. The alternative hypothesis can be specified by varying RR₁ and RR₂.

Table 2 displays the empirical power of the trend test with variance corrected by IBD information (Z_IBD-Tr), the tests based on the GEE estimate (Z_GEE), and the MO estimate (Z_MO). The relative risks RR₁ and RR₂ were chosen so that a particular trend test had about 85% power for each given model. The scores (x₁, x₂, x₃) = (0, 1, 2) for the additive model were used for the three tests in the simulations assuming the underlying model was unknown. Under the null hypothesis of no association, all three tests have the correct type I error, around 0.05. For all three genetic models, both the GEE and MO tests have relatively good power, ranging from 73% to 84%, compared with the IBD-corrected trend test.

Table 2 The empirical power of the three tests

Full size table

Application

The GAW15 NARAC candidate gene data consisted of affected sibs with rheumatoid arthritis from multiplex families and unrelated controls. The candidate gene data from the PTPN22 locus [10] had 14 SNPs genotyped on 1269 cases and 1519 unrelated controls. The cases were from 665 families: 123 families had 1 case, 492 families had 2 affected siblings, and 50 families had 3 or more affected siblings. For sib pairs from the same family, their IBD sharing probabilities were calculated using the software MERLIN [11].

Table 3 presents results based on the three testing methods and the trend test without adjusting for correlated cases. The performance of these tests is comparable. The Bonferroni correction was applied to adjust for multiple testing of 14 SNPs, and only the SNPs with an adjusted p-value less than 0.05 in any one of the tests are presented. All three test methods identified the same markers that were significantly associated with the susceptibility of rheumatoid arthritis. The unadjusted trend test that assumed independent cases overestimated the association and could result in a larger false-positive rate.

Table 3 Results for the NARAC candidate gene data

Full size table

Discussion

We consider three methods that use completely different approaches to account for correlation among family members. The IBD-corrected trend test requires the genotype information from parents or other family members to obtain more accurate IBD calculation. Because the variance of the test is corrected for correlation among related cases using the genealogy and marker information, this test is expected to be more powerful than the tests using only family pedigree information. The GEE approach estimates the correlation among related cases through a working correlation matrix, and the MO accounts for the correlation through repeated sampling. In our simulation study, the GEE and MO approaches appear to have similar power. Note that Follmann et al. [8] showed that the GEE estimates under an exchangeable working correlation performed better than MO in some simulations; however, the GEE may have problems converging. They also showed that in certain simple settings MO was slightly more powerful than or competitive to GEE with working independence correlation. The relative efficiency of these tests was unknown in general, and it would require a more extensive simulation to explore their behaviors. In addition, compared to the IBD-corrected trend test, both GEE and MO are simple and broadly applicable approaches that can also easily adjust for multiple covariates.

Note that these methods used in case-control studies are sensitive to population stratification. In genetic association studies, case-control and family-based designs are two fundamentally different approaches. While case-control designs study the contrast of allele/genotype frequencies between cases and controls to identify associations within populations, family-based designs use families to look for susceptibility alleles through transmission within families. Thus, when population stratification is suspected, family-based designs are preferred to case-control designs. For such designs, the well known transmission disequilibrium test (TDT) and its various extensions, such as the family-based association tests (FBATs), are commonly used [2, 12]. They are robust against population substructure. However, trios consisting of an affected child and parents are needed for TDT, which may be difficult to obtain. Other designs such as affected sibs and discordant sib pairs have been shown to be less powerful than case-control studies for both rare and common diseases [2, 3]. Moreover, to test bi-allelic markers like SNPs, family-based tests require a large number of families because they discard all the homozygous (non-informative) parents. For the above GAW15 example, most of parental genotypes and unaffected siblings are not available for the NARAC candidate gene data. Thus, this data set is not suitable for using either the TDT or FBAT tests. Therefore, when there is no evidence of major population substructure, the cases collected from families for linkage studies can be recycled for association, and additional unrelated controls may be obtained and genotyped to increase the power to confirm the candidate marker.

The test results from the three methods depend on the scores assigned to the genotypes based on the assumption of the underlying genetic models such as recessive, additive, and dominant. In practice, since the genetic model is unknown for most complex diseases, the additive model is usually assumed first, with x = (0, 1, 2) indicating the numbers of risk alleles. Applying a trend test with one set of scores would result in a loss of power if the genetic model is misspecified. Hence, more robust tests can be considered to protect against model uncertainty [9].

Conclusion

In summary, we compare three methods of testing genetic association for case-control studies with cases drawn from families and unrelated controls. Our results indicate that all three methods perform well, and their performance is comparable in the simulation and application to the GAW15 NARAC data. All three methods can be applied to more general situations where the controls or both cases and controls are also correlated.

Abbreviations

FBAT:: Family-based association test
GAW:: Genetic Analysis Workshop
GEE:: Generalized estimating equation
IBD:: Identical by descent
MO:: Multiple outputation
NARAC:: North American Rheumatoid Arthritis Consortium
SNP:: Single-nucleotide polymorphism
TDT:: Transmission-disequilibrium test

References

Risch N: Searching for genetic determinants in the new millennium. Nature. 2000, 405: 847-856. 10.1038/35015718.
Article PubMed CAS Google Scholar
Laird NM, Lange C: Family-based designs in the age of large-scale gene-association studies. Nat Rev Genet. 2006, 7: 385-394. 10.1038/nrg1839.
Article PubMed CAS Google Scholar
Slager SL, Schaid DJ: Evaluation of candidate genes in case-control studies: a statistical method to account for related subjects. Am J Hum Genet. 2001, 68: 1457-1462. 10.1086/320608.
Article PubMed Central PubMed CAS Google Scholar
Sasieni PD: From genotypes to genes: doubling the sample size. Biometrics. 1997, 53: 1253-1261. 10.2307/2533494.
Article PubMed CAS Google Scholar
Liang KY, Zeger SL: Longitudinal data analysis using generalized linear models. Biometrika. 1986, 73: 13-22. 10.1093/biomet/73.1.13.
Article Google Scholar
Liang KY, Pulver AE: Analysis of case-control/family sampling design. Genet Epidemiol. 1996, 13: 253-270. 10.1002/(SICI)1098-2272(1996)13:3<253::AID-GEPI3>3.0.CO;2-7.
Article PubMed CAS Google Scholar
Hoffman EB, Sen PK, Weinberg CR: Within-cluster resampling. Biometrika. 2001, 88: 1121-1134. 10.1093/biomet/88.4.1121.
Article Google Scholar
Follmann D, Proschan M, Leifer E: Multiple outputation: inference for complex clustered data by averaging analyses from independent data. Biometrics. 2003, 59: 420-429. 10.1111/1541-0420.00049.
Article PubMed Google Scholar
Tian X, Joo J, Zheng G, Lin JP: Robust trend tests for genetic association in case-control studies using family data. BMC Genet. 2005, 6 (Supp1): S107-10.1186/1471-2156-6-S1-S107.
Article PubMed Central PubMed Google Scholar
Carlton VE, Hu X, Chokkalingam AP, Schrodi SJ, Brandon R, Alexander HC, Chang M, Catanese JJ, Leong DU, Ardlie KG, Kastner DL, Seldin MF, Criswell LA, Gregersen PK, Beasley E, Thomson G, Amos CI, Begovich AB: PTPN22 genetic variation: evidence for multiple variants associated with rheumatoid arthritis. Am J Hum Genet. 2005, 77: 567-581. 10.1086/468189.
Article PubMed Central PubMed CAS Google Scholar
Abecasis GR, Cherny SS, Cookson WO, Cardon LR: Merlin-rapid analysis of dense genetic maps using sparse gene flow trees. Nat Genet. 2002, 30: 97-101. 10.1038/ng786.
Article PubMed CAS Google Scholar
Spielman RS, McGinnis RE, Ewens WJ: Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM). Am J Hum Genet. 1993, 52: 506-516.
PubMed Central PubMed CAS Google Scholar

Download references

Acknowledgements

This article has been published as part of BMC Proceedings Volume 1 Supplement 1, 2007: Genetic Analysis Workshop 15: Gene Expression Analysis and Approaches to Detecting Multiple Functional Loci. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/1?issue=S1.

Author information

Authors and Affiliations

Office of Biostatistics Research, National Heart, Lung and Blood Institute, 6701 Rockledge Drive, Bethesda, Maryland, 20892, USA
Xin Tian, Jungnam Joo, Colin O Wu & Jing-Ping Lin

Authors

Xin Tian
View author publications
You can also search for this author in PubMed Google Scholar
Jungnam Joo
View author publications
You can also search for this author in PubMed Google Scholar
Colin O Wu
View author publications
You can also search for this author in PubMed Google Scholar
Jing-Ping Lin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xin Tian.

Additional information

Competing interests

The author(s) declare that they have no competing interests.

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Tian, X., Joo, J., Wu, C.O. et al. Comparing strategies for evaluation of candidate genes in case-control studies using family data. BMC Proc 1 (Suppl 1), S31 (2007). https://doi.org/10.1186/1753-6561-1-S1-S31

Download citation

Published: 18 December 2007
DOI: https://doi.org/10.1186/1753-6561-1-S1-S31

Genetic Analysis Workshop 15: Gene Expression Analysis and Approaches to Detecting Multiple Functional Loci

Comparing strategies for evaluation of candidate genes in case-control studies using family data

Abstract

Background

Methods

Cochran-Armitage trend test accounting for related individuals

Generalized estimating equations (GEE) method

Multiple outputation (MO) method

Results

A simulation study

Application

Discussion

Conclusion

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Rights and permissions

About this article

Cite this article

Keywords

BMC Proceedings

Contact us

Genetic Analysis Workshop 15: Gene Expression Analysis and Approaches to Detecting Multiple Functional Loci

Comparing strategies for evaluation of candidate genes in case-control studies using family data

Abstract

Background

Methods

Cochran-Armitage trend test accounting for related individuals

Generalized estimating equations (GEE) method

Multiple outputation (MO) method

Results

A simulation study

Application

Discussion

Conclusion

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Proceedings

Contact us