Joint analysis of case-parents trio and unrelated case-control designs in large scale association studies.

We present a new method for testing association when data from both case-parents trios and unrelated controls are available. Our method combines test statistics for case-parents trio and unrelated case-control studies by adjusting for the correlation that arises when the same set of cases is used for both tests. We further consider several analytical approaches for two-stage studies on a large number of markers, including methods based on the joint analysis. The performance of the proposed approaches is examined by analyzing the simulated data provided by the Genetic Analysis Workshop 15.


Background
Genetic association studies are a popular method to detect genetic markers associated with a complex human disease. Two common designs in genetic association studies are family-based designs using case-parents trios and population-based designs using unrelated cases and controls. The transmission disequilibrium test (TDT) is frequently used to analyze the case-parents trio data [1]. The TDT tests for both linkage and association and is not sensitive to population admixture and stratification. Using a likelihood approach, Schaid and Sommer [2] proposed TDTtype statistics that are more powerful than the TDT for a specific genetic model (see also [3]). For the unrelated case-control design, a linear trend test [4], which is often more powerful than the TDT based on case-parents trios, can be considered specifically when obtaining a sufficient number of trios is difficult.
Data that contain both case-parents trios and unrelated cases and controls on the same set of markers are increasingly available. Nagelkerke et al. [5] provided a few situations where such a mixture of case-parents trios and unrelated cases and controls can occur: 1) a case-parents trio design was originally considered and then unrelated controls were added, 2) a case-control design was originally considered and then the parents of the cases were added to confirm the findings. Such designs are typically analyzed in two stages, and strategies for analyzing this type of data while fully utilizing the given information are important.
In this paper, we study several approaches for testing genome-wide association in such situations. Based on the design, either a TDT-type statistic or a linear trend test will be used in the first stage to select a proportion of markers that will be tested in the second stage. The other test will then be applied in the second stage while controlling the genome-wide false positive rates by adjusting for the correlation with the first stage. Following a recently proposed method by Skol et al. [6], we also study a joint analysis for the second stage.

Methods
Consider a marker with two alleles, M and N, where M itself is a risk allele or is in linkage disequilibrium with a risk allele with frequency p, and N is a normal allele with frequency q = 1 -p. Penetrances are defined as the probabilities of disease conditional on the genotypes, that is, f 0 = Pr(disease|NN), f 1 = Pr(disease|NM), and f 2 = Pr(dis-ease|MM). No association implies f 0 = f 1 = f 2 , whereas f 0 ≤ f 1 ≤ f 2 with at least one strict inequality implies there is an association between the marker and a disease. Using f 0 as a baseline penetrance, the genotype relative risks are defined as ψ i = f i /f 0 for i = 1, 2. A genetic model is recessive, additive, or dominant when f 0 = f 1 (or ψ 1 = 1, ψ 2 = ψ), f 1 = (f 0 + f 2 )/2 (or ψ 1 = ψ, ψ 2 = 2ψ-1), or f 1 = f 2 (or ψ 1 = ψ 2 = ψ).

Case-parents trio design
In the case-parents trio design, cases and their parents are selected from the population and their genotypes are obtained. There are six possible parental mating types for a marker with two alleles M and N: 1) MM × MM, 2) MM × NM, 3) MM × NN, 4) NM × NM, 5) NM × NN, and 6) NN × NN. These six mating types are given in the first col-umn of Table 1. The second column provides case genotypes for each mating type, and the third column is the sample size of trios under each mating type. The probabilities of parental mating types can be calculated by assuming Hardy-Weinberg equilibrium (HWE), and the probability of each n ij can then be obtained and is presented in the fourth column. The last column contains the probabilities of a case genotype given parental mating type (Schaid and Sommer [2]). Schaid and Sommer [2] suggested an analysis conditional on parental mating types that provides unbiased estimates of genotype relative risks. Denote the likelihood function for a given model as L(ψ), then the score test for H 0 : ψ = 1 can be obtained by ∂logL(ψ)/∂ψ/{-∂ 2 logL(ψ)/∂ψ 2 } 1/2 | ψ = 1 .

Unrelated case-control design
For the unrelated case-control design, denote the genotype counts of three genotypes NN, MN and MM as (r 0 , r 1 , r 2 ) in cases and (s 0 , s 1 , s 2 ) in controls that follow multinomial distributions mul(R: p 0 , p 1 , p 2 ) and mul(S: q 0 , q 1 , q 2 ). Then the null hypothesis of no association implies p i = q i for each i.
Sasieni [4] proposed a method that uses the marker genotype as a covariate in the logistic regression model where the genotype is coded by increasing scores, that is, 0, x, and 1 for NN, NM, and MM, where 0 ≤ x ≤ 1. The optimal scores for recessive, additive and dominant models are x = 0, 1/2, and 1 [4,7] and the trend test [7] is given by for (x 0 , x 1 , , as a test statistic in a joint analysis. We consider a uniform weight, that is, w 1 = w 2 = 1 [8,9] for simplicity. Other choices of weight, such as a weight proportional to the number of informative cases used in each test, can also be considered.

Two-stage method in large scale association studies
To test K markers in a two-stage analysis, we consider four strategies that use either Z TDT or Z CC in the first stage based on the intended design, and in each situation, the other test or the joint test is applied in the second stage. As in Skol et al. [6], we obtain thresholds C 1 and C 2 (or C joint ) for two stages in each strategy by controlling the genomewide significance level at α. C 1 can be obtained as C 1 = Φ -1 (1 -π 1 /2), where π 1 is the proportion of markers selected in the first stage. On the other hand, C joint and C 2 need to be calculated iteratively so that they satisfy or when the joint analysis is used or when the other test is used. Here, Z 1i and Z 2i denote the tests used in the first and the second stage for the i th SNP (Z 2i is replaced by Z jointi when the joint analysis is used in the second stage). We need the subscript i because the correlations between two tests for different SNPs are generally not the same. Under HWE, however, we can show this correlation is a constant (Appendix), and these equations can then be simplified to P(|Z 1 | > C 1 , |Z joint | > C joint ) = α/K and P(|Z 1 | > C 1 , |Z 2 | > C 2 , Z 1 Z 2 > 0) = α/K.

Data
The Genetic Analysis Workshop 15 provided simulated rheumatoid arthritis data that contain 1500 families with affected sib pairs and their parents, and 2000 unrelated controls on 9187 SNPs distributed throughout the genome. We used the first simulated data set and we randomly selected one from the affected sib pairs for data analysis. The minor allele frequencies of all 9187 SNPs were greater than 1%.

Results
To apply the two-stage analysis, we first obtained the threshold for each strategy using π 1 = 0.1 (C 1 = 1.6449) and Eq. (1) and (2). Therefore, we control the genomewide false-positive rate at 0.05, and we define a "significant" SNP as one with test statistic greater than the threshold in both stages. As expected, a slightly larger threshold for the second stage is required for the joint analysis to control the same genome-wide false-positive rate (C 2 = 4.5121 vs. C joint = 4.5470) [6]. Table 2 summarizes results based on an additive genetic model. The chromosome, SNP name, and distance from the nearest major gene are listed in the first three columns of the table, and if the SNP was selected by the specified method (last three columns), the p-value from the second stage is listed. We noticed that even with a larger threshold required, the joint analysis in the second stage found more significant SNPs near the major genes. Also, we noticed that when the joint analysis was used in the second stage, the same set of significant SNPs was found regardless of the choice of the test statistic in the first stage. However, different results were obtained when either Z TDT or Z CC was used in the first stage followed by the other test in the second stage. Specifically, the joint analysis using either Z TDT or Z CC in the first stage found 18 significant SNPs among which 9 and 14 were located within 1 Mb (bold) and 5 Mb (italic) of the major genes. When Z TDT in the first stage was followed by Z CC in the second stage, we found 17 significant SNPs, and 8 of these were located within 1 Mb of the causal genes, and 13 were located within 5 Mb. These methods found SNPs near the major genes on chromosome 6, 11, and 18. On the other hand, when we used Z CC in the first stage followed by Z TDT in the second stage, a total of 10 significant SNPs were found: 7 of them were located within 1 Mb of a major gene only on chromosome 6 and 11, and 10 were located within 5 Mb. A SNP near the major gene on chromosome 18 was not found by this method.
When we applied these three tests (Z CC , Z TDT , Z joint ) to a single-stage analysis, these tests found the same set of SNPs identified in a two-stage analysis with the corresponding test at the second stage. That is, Z CC , Z TDT , Z jointi in a single-stage found 17, 10, and 18 SNPs in columns 4, 5, and 6 of Table 2. This implies that a two-stage analysis can maintain power with a substantially reduced genotyping cost while controlling the same genome-wide falsepositive rate [6].

Discussion
In this paper, we presented a new method for testing association when both case-parents trios and unrelated controls are available. Because parents are selected for having an affected child, we consider the characteristics of nonaffected parents to be different from those of unrelated controls in case-control studies. Thus, the genotype information of parents was used only for Z TDT and not for Z CC . By adjusting for the correlation between the two test statistics (Z TDT and Z CC ), we proposed a combined test statistic for analyzing such data.
For data with a large number of markers in a two-stage analysis, we considered several analytical approaches following the method by Skol et al. [6]. Even with a slightly larger threshold required, more SNPs near the major genes were found using the joint analysis in the second stage. Also, we noticed the choice of test for the first stage was important when two separate tests were used in the two stages, but when the joint analysis was used, the impact of which test was used first seemed to be less important. The added benefit of the joint analysis was rather minor compared to what was studied by Skol et al. [6] because the two tests for the first and the second stages were highly correlated even without using the joint analysis. Nevertheless, the joint analysis found slightly more significant SNPs and is robust against the choice of the first stage test. These properties suggest that the joint analysis would be desirable.
Our method can be generalized to data with missing genotypes by either imputing the missing genotypes based on partially available data [5,10], or by omitting cases without complete parental information from Z TDT . In this situation, the correlation between Z TDT and Z CC will decrease, and therefore, the advantage of the joint analysis could be accentuated. Complete justification, however, requires further study.

Conclusion
We presented a new method for testing association when data from both case-parents trios and unrelated controls are available. By deriving the correlation of test statistics for these two designs, we proposed a combined test as a joint analysis. In a two-stage analysis for testing a large number of markers, we found that the joint analysis detects more SNPs near the major genes than other methods that do not use the combined test in the second stage. This approach is also robust against the choice of the first stage test.