Comparing strategies for evaluation of candidate genes in case-control studies using family data.

The goal of this analysis is to compare different test strategies for genetic association in case-control studies using related individuals. The first test is the trend test that is corrected for related individuals on the basis of identity-by-descent information. The second approach is to use generalized estimating equations to adjust for the correlation between relatives, and the third is the multiple outputation method. We compare the power of these test strategies in a simulation study, and apply these methods to a candidate gene dataset of Genetic Analysis Workshop 15 from the North American Rheumatoid Arthritis Consortium.


Background
The case-control design is a widely used and powerful approach for genetic association studies [1,2]. Genotype frequencies are compared between case and control samples to identify candidate genes or nearby markers that are associated with the susceptibility to a disease. Although association studies may be subject to the possibility of population stratification, it has been recognized that this effect is small in magnitude in well designed studies that sample controls and cases from a homogeneous population, or that match cases by the major confounding variables such as age, gender, and race-ethnicity [1]. Recently, there has been increasing interest in statistical methods that evaluate association between genetic markers and disease status using family-based data [2,3]. This would allow data available from linkage studies or multicase families to be used efficiently to test for association.
Unlike traditional case-control studies in which all individuals are unrelated, cases from the same family are often correlated because these individuals share genetic and environmental conditions. Consequently, the frequency of risk alleles at a marker locus is usually increased among related cases relative to unrelated cases. Using related cases sampled from families or ascertained from family linkage studies and unrelated controls may increase the false positive rate (type I error) of an association test, compared to the traditional case-control design based on independent samples. Ignoring the dependence among related individuals may potentially lead to incorrect or spurious results. Hence, any test of genetic association must account for correlation among family members. Different methods may be used to evaluate genetic associations of candidate genes in case-control studies when some individuals (cases or controls) are related. We briefly sketch three of these methods, the Cochran-Armitage trend test corrected for identity-by-descent (IBD) information, the generalized estimating equations method, and the multiple outputation method. Little is known about their relative efficiency and performance. We compare their power in a simulation study and apply these methods to the candidate gene data of Genetic Analysis Workshop 15 (GAW15) from the North American Rheumatoid Arthritis Consortium (NARAC), which contains affected sibs with rheumatoid arthritis and unrelated controls.

Cochran-Armitage trend test accounting for related individuals
Consider data for a case-control study of genetic association as in Table 1. Assume a marker of a candidate gene with two alleles: N and M, where N is a normal allele and M is a risk allele or is in linkage disequilibrium with a risk allele. Denote genotypes as g 0 = NN, g 1 = NM, and g 2 = MM. Let the genotype frequencies for cases and controls be p j and q j , j = 0, 1, 2, respectively. Hence, the null hypothesis of no association is p j = q j for each j.
Given the data in Table 1, the Cochran-Armitage trend test for association [4] between a disease and a marker can be written as , and x = (x 0 , x 1 , x 2 ) T is a set of increasing scores (weights) assigned to the three genotypes (g 0 , g 1 , g 2 ) a priori based on the underlying genetic model. Under the null hypothesis, , which can be estimated by ; Z x asymptotically follows a standard normal distribution N(0, 1).
However, because cases and controls within the same family may be biologically related, Slager and Schaid [3] proposed the following method for estimating the variance to account for correlations among related cases or controls. Let u i = (u i0 , u i1 , u i2 ) T be the genotype indicator vector for the i th case, where u ij = 1 for the i th case with genotype g j and u ij = 0 otherwise, i = 1,...,R. Similarly, we use v j for controls. Then , and . Let φ = R/n. Then the above test Here the variance and covariance terms can be calculated based on the multinomial distributions and IBD-sharing probabilities for pairs of related individuals [3].

Generalized estimating equations (GEE) method
The GEE developed by Liang and Zeger [5] for the analysis of longitudinal data can be applied for case-control data in genetic studies. Let be the response variable for n i related subjects, i = 1,...,m, where m is the total number of families. For a binary trait, y ij = 1 for cases and 0 for controls. The logistic regression model can be considered for the case-control data in Table 1: Total n 0 n 1 n 2 n affected sib-pair data, a simple and reasonable choice is the exchangeable correlation matrix with a common correlation θ for each pair of relatives [6].

Multiple outputation (MO) method
The MO method proposed by Hoffman et al. [7] and Follmann et al. [8] provides inferences for clustered correlated data by averaging analyses of independent data. For independent case-control data in genetic studies, several methods can provide a normally distributed statistic, , for the genetic association and an estimate of its variance, 2 . For example, the trend test statistic Z x above is a sensible choice, which estimates the weighted differences of the genetic frequencies. shown to be asymptotically normally distributed.

A simulation study
To compare the performance of the three methods, we conducted a small simulation by generating case-control data sets and computing the empirical power for all the tests under three genetic models: recessive, additive, and dominant. The simulations were similar to those performed by Tian et al. [9] with 10,000 replications. We assume that the disease prevalence, K, is 0.1, the marker allele frequency, p, is 0.3, and Hardy-Weinberg equilibrium holds. To facilitate the calculation, each case-control data set included 200 cases generated as 100 affected sib pairs drawn from 100 different families, and 200 unrelated controls. Let the genotype relative risks RR 1 = f 1 /f 0 , and RR 2 = f 2 /f 0 , where f 0 , f 1 and f 2 are the penetrances for genotypes g 0 , g 1 , and g 2 . Thus, equivalently, the null hypothesis can be written as RR 1 = RR 2 = 1. The alternative hypothesis can be specified by varying RR 1 and RR 2 .  [11]. Table 3 presents results based on the three testing methods and the trend test without adjusting for correlated cases. The performance of these tests is comparable. The Bonferroni correction was applied to adjust for multiple testing of 14 SNPs, and only the SNPs with an adjusted pvalue less than 0.05 in any one of the tests are presented. All three test methods identified the same markers that were significantly associated with the susceptibility of rheumatoid arthritis. The unadjusted trend test that assumed independent cases overestimated the association and could result in a larger false-positive rate.

Discussion
We consider three methods that use completely different approaches to account for correlation among family members. The IBD-corrected trend test requires the genotype information from parents or other family members to obtain more accurate IBD calculation. Because the variance of the test is corrected for correlation among related cases using the genealogy and marker information, this test is expected to be more powerful than the tests using only family pedigree information. The GEE approach estimates the correlation among related cases through a working correlation matrix, and the MO accounts for the correlation through repeated sampling. In our simulation study, the GEE and MO approaches appear to have similar power. Note that Follmann et al. [8] showed that the GEE estimates under an exchangeable working correlation per-βσ β σβσ β formed better than MO in some simulations; however, the GEE may have problems converging. They also showed that in certain simple settings MO was slightly more powerful than or competitive to GEE with working independence correlation. The relative efficiency of these tests was unknown in general, and it would require a more extensive simulation to explore their behaviors. In addition, compared to the IBD-corrected trend test, both GEE and MO are simple and broadly applicable approaches that can also easily adjust for multiple covariates.
Note that these methods used in case-control studies are sensitive to population stratification. In genetic association studies, case-control and family-based designs are two fundamentally different approaches. While case-control designs study the contrast of allele/genotype frequencies between cases and controls to identify associations within populations, family-based designs use families to look for susceptibility alleles through transmission within families. Thus, when population stratification is suspected, family-based designs are preferred to case-control designs. For such designs, the well known transmission disequilibrium test (TDT) and its various extensions, such as the family-based association tests (FBATs), are commonly used [2,12]. They are robust against population substructure. However, trios consisting of an affected child and parents are needed for TDT, which may be difficult to obtain. Other designs such as affected sibs and discordant sib pairs have been shown to be less powerful than case-control studies for both rare and common diseases [2,3]. Moreover, to test bi-allelic markers like SNPs, family-based tests require a large number of families because they discard all the homozygous (non-informative) parents. For the above GAW15 example, most of parental genotypes and unaffected siblings are not available for the NARAC candidate gene data. Thus, this data set is not suitable for using either the TDT or FBAT tests. Therefore, when there is no evidence of major population substructure, the cases collected from families for linkage studies can be recycled for association, and additional unrelated controls may be obtained and genotyped to increase the power to confirm the candidate marker.
The test results from the three methods depend on the scores assigned to the genotypes based on the assumption of the underlying genetic models such as recessive, additive, and dominant. In practice, since the genetic model is unknown for most complex diseases, the additive model is usually assumed first, with x = (0, 1, 2) indicating the numbers of risk alleles. Applying a trend test with one set of scores would result in a loss of power if the genetic model is misspecified. Hence, more robust tests can be considered to protect against model uncertainty [9].

Conclusion
In summary, we compare three methods of testing genetic association for case-control studies with cases drawn from families and unrelated controls. Our results indicate that all three methods perform well, and their performance is comparable in the simulation and application to the GAW15 NARAC data. All three methods can be applied to more general situations where the controls or both cases and controls are also correlated.

Competing interests
The author(s) declare that they have no competing interests.