 Proceedings
 Open Access
 Published:
Comparing strategies for evaluation of candidate genes in casecontrol studies using family data
BMC Proceedingsvolume 1, Article number: S31 (2007)
Abstract
The goal of this analysis is to compare different test strategies for genetic association in casecontrol studies using related individuals. The first test is the trend test that is corrected for related individuals on the basis of identitybydescent information. The second approach is to use generalized estimating equations to adjust for the correlation between relatives, and the third is the multiple outputation method. We compare the power of these test strategies in a simulation study, and apply these methods to a candidate gene dataset of Genetic Analysis Workshop 15 from the North American Rheumatoid Arthritis Consortium.
Background
The casecontrol design is a widely used and powerful approach for genetic association studies [1, 2]. Genotype frequencies are compared between case and control samples to identify candidate genes or nearby markers that are associated with the susceptibility to a disease. Although association studies may be subject to the possibility of population stratification, it has been recognized that this effect is small in magnitude in well designed studies that sample controls and cases from a homogeneous population, or that match cases by the major confounding variables such as age, gender, and raceethnicity [1]. Recently, there has been increasing interest in statistical methods that evaluate association between genetic markers and disease status using familybased data [2, 3]. This would allow data available from linkage studies or multicase families to be used efficiently to test for association. Unlike traditional casecontrol studies in which all individuals are unrelated, cases from the same family are often correlated because these individuals share genetic and environmental conditions. Consequently, the frequency of risk alleles at a marker locus is usually increased among related cases relative to unrelated cases. Using related cases sampled from families or ascertained from family linkage studies and unrelated controls may increase the false positive rate (type I error) of an association test, compared to the traditional casecontrol design based on independent samples. Ignoring the dependence among related individuals may potentially lead to incorrect or spurious results. Hence, any test of genetic association must account for correlation among family members.
Different methods may be used to evaluate genetic associations of candidate genes in casecontrol studies when some individuals (cases or controls) are related. We briefly sketch three of these methods, the CochranArmitage trend test corrected for identitybydescent (IBD) information, the generalized estimating equations method, and the multiple outputation method. Little is known about their relative efficiency and performance. We compare their power in a simulation study and apply these methods to the candidate gene data of Genetic Analysis Workshop 15 (GAW15) from the North American Rheumatoid Arthritis Consortium (NARAC), which contains affected sibs with rheumatoid arthritis and unrelated controls.
Methods
CochranArmitage trend test accounting for related individuals
Consider data for a casecontrol study of genetic association as in Table 1. Assume a marker of a candidate gene with two alleles: N and M, where N is a normal allele and M is a risk allele or is in linkage disequilibrium with a risk allele. Denote genotypes as g_{0} = NN, g_{1} = NM, and g_{2} = MM. Let the genotype frequencies for cases and controls be p_{ j }and q_{ j }, j = 0, 1, 2, respectively. Hence, the null hypothesis of no association is p_{ j }= q_{ j }for each j.
Given the data in Table 1, the CochranArmitage trend test for association [4] between a disease and a marker can be written as Z_{ x }= U(x)/$\widehat{\sigma}$, where U(x) = ${n}^{1}{\displaystyle {\sum}_{j=0}^{2}{x}_{j}(S{r}_{j}R{s}_{j})}$, and x = (x_{0}, x_{1}, x_{2})^{T}is a set of increasing scores (weights) assigned to the three genotypes (g_{0}, g_{1}, g_{2}) a priori based on the underlying genetic model. Under the null hypothesis, $\mathrm{var}\phantom{\rule{0.5em}{0ex}}[U(x)]={n}^{1}RS[{\displaystyle {\sum}_{j=0}^{2}{x}_{j}^{2}{p}_{j}{({\displaystyle {\sum}_{j=0}^{2}{x}_{j}{p}_{j}})}^{2}}]$, which can be estimated by ${\widehat{\sigma}}^{2}={n}^{3}RS[n{\displaystyle {\sum}_{j=0}^{2}{x}_{j}^{2}{n}_{j}{({\displaystyle {\sum}_{j=0}^{2}{x}_{j}{n}_{j}})}^{2}}]$; Z_{ x }asymptotically follows a standard normal distribution N(0, 1).
However, because cases and controls within the same family may be biologically related, Slager and Schaid [3] proposed the following method for estimating the variance to account for correlations among related cases or controls. Let u_{ i }= (u_{i0}, u_{i1}, u_{i2})^{T}be the genotype indicator vector for the i^{th} case, where u_{ ij }= 1 for the i^{th} case with genotype g_{j} and u_{ ij }= 0 otherwise, i = 1,...,R. Similarly, we use v_{ j }for controls. Then $r={({r}_{0},{r}_{1},{r}_{2})}^{T}={\displaystyle \sum {u}_{i}}$, and $s={({s}_{0},{s}_{1},{s}_{2})}^{T}={\displaystyle \sum {v}_{j}}$. Let φ = R/n. Then the above test statistic is U(x) = x^{T}[(1  φ)r  φs], and var[U(x)] = x^{T}{var[(1  φ)r  φs]}x = ${x}^{T}\{{(1\phi )}^{2}\mathrm{var}\phantom{\rule{0.5em}{0ex}}({\displaystyle \sum {u}_{i}})+{\phi}^{2}\mathrm{var}\phantom{\rule{0.5em}{0ex}}({\displaystyle \sum {v}_{j}})2\phi (1\phi )\mathrm{cov}\phantom{\rule{0.5em}{0ex}}({\displaystyle \sum {u}_{i},}{\displaystyle \sum {v}_{j}})\}x$.
Here the variance and covariance terms can be calculated based on the multinomial distributions and IBDsharing probabilities for pairs of related individuals [3].
Generalized estimating equations (GEE) method
The GEE developed by Liang and Zeger [5] for the analysis of longitudinal data can be applied for casecontrol data in genetic studies. Let ${y}_{i}={({y}_{i1},\mathrm{...},{y}_{i,{n}_{i}})}^{T}$ be the response variable for n_{ i }related subjects, i = 1,...,m, where m is the total number of families. For a binary trait, y_{ ij }= 1 for cases and 0 for controls. The logistic regression model can be considered for the casecontrol data in Table 1: log[E(y_{ ij })/(1  E(y_{ ij }))] = β_{0} + β_{1}x_{ ij }+ ${\beta}_{2}^{T}$w_{ ij }, where x_{ ij }= x_{0}, x_{1}, or x_{2} is the score assigned to the genotype as above, and w_{ ij }denotes other covariates. The test of genetic association is equivalent to the test of β_{1} = 0. Due to correlation of related family members, the conventional methods assuming independence are incorrect. The estimate and standard error for β = (β_{0}, β_{1}, ${\beta}_{2}^{T}$)^{T}based on the GEE procedure take into account the withinfamily correlation, where β is estimated by solving the equations ${\sum}_{i=1}^{m}{(\frac{\partial {\mu}_{i}}{\partial \beta})}^{T}{V}_{i}^{1}({y}_{i}{\mu}_{i})=0$, with μ_{ i }= E(y_{ i }; β) and V_{ i }= V_{ i }(y_{ i }; β, θ) denoting the "working" covariance matrix of y_{ i }. The estimate of β is asymptotically normally distributed and its variance is given by $\Sigma ={\Sigma}_{1}^{1}{\Sigma}_{2}{\Sigma}_{1}^{1}$, where ${\Sigma}_{1}={\displaystyle {\sum}_{i=1}^{m}{(\frac{\partial {\mu}_{i}}{\partial \beta})}^{T}{V}_{i}^{1}}(\frac{\partial {\mu}_{i}}{\partial \beta})$, and ${\Sigma}_{2}={\displaystyle {\sum}_{i=1}^{m}{(\frac{\partial {\mu}_{i}}{\partial \beta})}^{T}{V}_{i}^{1}}({y}_{i}{\mu}_{i}){({y}_{i}{\mu}_{i})}^{T}{V}_{i}^{1}(\frac{\partial {\mu}_{i}}{\partial \beta})$. There are a number of choices for V_{i} and it has been shown that the GEE estimates are valid and consistent even if the working covariance matrix is misspecified. For family or affected sibpair data, a simple and reasonable choice is the exchangeable correlation matrix with a common correlation θ for each pair of relatives [6].
Multiple outputation (MO) method
The MO method proposed by Hoffman et al. [7] and Follmann et al. [8] provides inferences for clustered correlated data by averaging analyses of independent data. For independent casecontrol data in genetic studies, several methods can provide a normally distributed statistic, $\widehat{\beta}$, for the genetic association and an estimate of its variance, $\widehat{\sigma}$^{2}. For example, the trend test statistic Z_{ x }above is a sensible choice, which estimates the weighted differences of the genetic frequencies. For casecontrol data sampled from families, a new sample can be obtained by randomly selecting an individual from each family, and then $\widehat{\beta}$ and $\widehat{\sigma}$^{2} can be computed based on this new sample. After repeating this multiple times, the estimate of association will be the average of the $\widehat{\beta}$ values, and an estimate of its variance is given by the average of the $\widehat{\sigma}$^{2} minus the sample variance of the $\widehat{\beta}$ values. The MO estimate has been shown to be asymptotically normally distributed.
Results
A simulation study
To compare the performance of the three methods, we conducted a small simulation by generating casecontrol data sets and computing the empirical power for all the tests under three genetic models: recessive, additive, and dominant. The simulations were similar to those performed by Tian et al. [9] with 10,000 replications. We assume that the disease prevalence, K, is 0.1, the marker allele frequency, p, is 0.3, and HardyWeinberg equilibrium holds. To facilitate the calculation, each casecontrol data set included 200 cases generated as 100 affected sib pairs drawn from 100 different families, and 200 unrelated controls. Let the genotype relative risks RR_{1} = f_{1}/f_{0}, and RR_{2} = f_{2}/f_{0}, where f_{0}, f_{1} and f_{2} are the penetrances for genotypes g_{0}, g_{1}, and g_{2}. Thus, equivalently, the null hypothesis can be written as RR_{1} = RR_{2} = 1. The alternative hypothesis can be specified by varying RR_{1} and RR_{2}.
Table 2 displays the empirical power of the trend test with variance corrected by IBD information (Z_{IBDTr}), the tests based on the GEE estimate (Z_{GEE}), and the MO estimate (Z_{MO}). The relative risks RR_{1} and RR_{2} were chosen so that a particular trend test had about 85% power for each given model. The scores (x_{1}, x_{2}, x_{3}) = (0, 1, 2) for the additive model were used for the three tests in the simulations assuming the underlying model was unknown. Under the null hypothesis of no association, all three tests have the correct type I error, around 0.05. For all three genetic models, both the GEE and MO tests have relatively good power, ranging from 73% to 84%, compared with the IBDcorrected trend test.
Application
The GAW15 NARAC candidate gene data consisted of affected sibs with rheumatoid arthritis from multiplex families and unrelated controls. The candidate gene data from the PTPN22 locus [10] had 14 SNPs genotyped on 1269 cases and 1519 unrelated controls. The cases were from 665 families: 123 families had 1 case, 492 families had 2 affected siblings, and 50 families had 3 or more affected siblings. For sib pairs from the same family, their IBD sharing probabilities were calculated using the software MERLIN [11].
Table 3 presents results based on the three testing methods and the trend test without adjusting for correlated cases. The performance of these tests is comparable. The Bonferroni correction was applied to adjust for multiple testing of 14 SNPs, and only the SNPs with an adjusted pvalue less than 0.05 in any one of the tests are presented. All three test methods identified the same markers that were significantly associated with the susceptibility of rheumatoid arthritis. The unadjusted trend test that assumed independent cases overestimated the association and could result in a larger falsepositive rate.
Discussion
We consider three methods that use completely different approaches to account for correlation among family members. The IBDcorrected trend test requires the genotype information from parents or other family members to obtain more accurate IBD calculation. Because the variance of the test is corrected for correlation among related cases using the genealogy and marker information, this test is expected to be more powerful than the tests using only family pedigree information. The GEE approach estimates the correlation among related cases through a working correlation matrix, and the MO accounts for the correlation through repeated sampling. In our simulation study, the GEE and MO approaches appear to have similar power. Note that Follmann et al. [8] showed that the GEE estimates under an exchangeable working correlation performed better than MO in some simulations; however, the GEE may have problems converging. They also showed that in certain simple settings MO was slightly more powerful than or competitive to GEE with working independence correlation. The relative efficiency of these tests was unknown in general, and it would require a more extensive simulation to explore their behaviors. In addition, compared to the IBDcorrected trend test, both GEE and MO are simple and broadly applicable approaches that can also easily adjust for multiple covariates.
Note that these methods used in casecontrol studies are sensitive to population stratification. In genetic association studies, casecontrol and familybased designs are two fundamentally different approaches. While casecontrol designs study the contrast of allele/genotype frequencies between cases and controls to identify associations within populations, familybased designs use families to look for susceptibility alleles through transmission within families. Thus, when population stratification is suspected, familybased designs are preferred to casecontrol designs. For such designs, the well known transmission disequilibrium test (TDT) and its various extensions, such as the familybased association tests (FBATs), are commonly used [2, 12]. They are robust against population substructure. However, trios consisting of an affected child and parents are needed for TDT, which may be difficult to obtain. Other designs such as affected sibs and discordant sib pairs have been shown to be less powerful than casecontrol studies for both rare and common diseases [2, 3]. Moreover, to test biallelic markers like SNPs, familybased tests require a large number of families because they discard all the homozygous (noninformative) parents. For the above GAW15 example, most of parental genotypes and unaffected siblings are not available for the NARAC candidate gene data. Thus, this data set is not suitable for using either the TDT or FBAT tests. Therefore, when there is no evidence of major population substructure, the cases collected from families for linkage studies can be recycled for association, and additional unrelated controls may be obtained and genotyped to increase the power to confirm the candidate marker.
The test results from the three methods depend on the scores assigned to the genotypes based on the assumption of the underlying genetic models such as recessive, additive, and dominant. In practice, since the genetic model is unknown for most complex diseases, the additive model is usually assumed first, with x = (0, 1, 2) indicating the numbers of risk alleles. Applying a trend test with one set of scores would result in a loss of power if the genetic model is misspecified. Hence, more robust tests can be considered to protect against model uncertainty [9].
Conclusion
In summary, we compare three methods of testing genetic association for casecontrol studies with cases drawn from families and unrelated controls. Our results indicate that all three methods perform well, and their performance is comparable in the simulation and application to the GAW15 NARAC data. All three methods can be applied to more general situations where the controls or both cases and controls are also correlated.
Abbreviations
 FBAT:

Familybased association test
 GAW:

Genetic Analysis Workshop
 GEE:

Generalized estimating equation
 IBD:

Identical by descent
 MO:

Multiple outputation
 NARAC:

North American Rheumatoid Arthritis Consortium
 SNP:

Singlenucleotide polymorphism
 TDT:

Transmissiondisequilibrium test
References
 1.
Risch N: Searching for genetic determinants in the new millennium. Nature. 2000, 405: 847856. 10.1038/35015718.
 2.
Laird NM, Lange C: Familybased designs in the age of largescale geneassociation studies. Nat Rev Genet. 2006, 7: 385394. 10.1038/nrg1839.
 3.
Slager SL, Schaid DJ: Evaluation of candidate genes in casecontrol studies: a statistical method to account for related subjects. Am J Hum Genet. 2001, 68: 14571462. 10.1086/320608.
 4.
Sasieni PD: From genotypes to genes: doubling the sample size. Biometrics. 1997, 53: 12531261. 10.2307/2533494.
 5.
Liang KY, Zeger SL: Longitudinal data analysis using generalized linear models. Biometrika. 1986, 73: 1322. 10.1093/biomet/73.1.13.
 6.
Liang KY, Pulver AE: Analysis of casecontrol/family sampling design. Genet Epidemiol. 1996, 13: 253270. 10.1002/(SICI)10982272(1996)13:3<253::AIDGEPI3>3.0.CO;27.
 7.
Hoffman EB, Sen PK, Weinberg CR: Withincluster resampling. Biometrika. 2001, 88: 11211134. 10.1093/biomet/88.4.1121.
 8.
Follmann D, Proschan M, Leifer E: Multiple outputation: inference for complex clustered data by averaging analyses from independent data. Biometrics. 2003, 59: 420429. 10.1111/15410420.00049.
 9.
Tian X, Joo J, Zheng G, Lin JP: Robust trend tests for genetic association in casecontrol studies using family data. BMC Genet. 2005, 6 (Supp1): S10710.1186/147121566S1S107.
 10.
Carlton VE, Hu X, Chokkalingam AP, Schrodi SJ, Brandon R, Alexander HC, Chang M, Catanese JJ, Leong DU, Ardlie KG, Kastner DL, Seldin MF, Criswell LA, Gregersen PK, Beasley E, Thomson G, Amos CI, Begovich AB: PTPN22 genetic variation: evidence for multiple variants associated with rheumatoid arthritis. Am J Hum Genet. 2005, 77: 567581. 10.1086/468189.
 11.
Abecasis GR, Cherny SS, Cookson WO, Cardon LR: Merlinrapid analysis of dense genetic maps using sparse gene flow trees. Nat Genet. 2002, 30: 97101. 10.1038/ng786.
 12.
Spielman RS, McGinnis RE, Ewens WJ: Transmission test for linkage disequilibrium: the insulin gene region and insulindependent diabetes mellitus (IDDM). Am J Hum Genet. 1993, 52: 506516.
Acknowledgements
This article has been published as part of BMC Proceedings Volume 1 Supplement 1, 2007: Genetic Analysis Workshop 15: Gene Expression Analysis and Approaches to Detecting Multiple Functional Loci. The full contents of the supplement are available online at http://www.biomedcentral.com/17536561/1?issue=S1.
Author information
Additional information
Competing interests
The author(s) declare that they have no competing interests.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Published
DOI
Keywords
 Generalize Estimate Equation
 Trend Test
 Transmission Disequilibrium Test
 Genetic Analysis Workshop
 Unrelated Control