Joint analysis of case-parents trio and unrelated case-control designs in large scale association studies

Joo, Jungnam; Tian, Xin; Zheng, Gang; Stylianou, Mario; Lin, Jing-Ping; Geller, Nancy L

doi:10.1186/1753-6561-1-S1-S28

Volume 1 Supplement 1

Genetic Analysis Workshop 15: Gene Expression Analysis and Approaches to Detecting Multiple Functional Loci

Proceedings
Open access
Published: 18 December 2007

Joint analysis of case-parents trio and unrelated case-control designs in large scale association studies

Jungnam Joo¹,
Xin Tian¹,
Gang Zheng¹,
Mario Stylianou¹,
Jing-Ping Lin¹ &
…
Nancy L Geller¹

BMC Proceedings volume 1, Article number: S28 (2007) Cite this article

1084 Accesses
4 Citations
Metrics details

Abstract

We present a new method for testing association when data from both case-parents trios and unrelated controls are available. Our method combines test statistics for case-parents trio and unrelated case-control studies by adjusting for the correlation that arises when the same set of cases is used for both tests. We further consider several analytical approaches for two-stage studies on a large number of markers, including methods based on the joint analysis. The performance of the proposed approaches is examined by analyzing the simulated data provided by the Genetic Analysis Workshop 15.

Background

Genetic association studies are a popular method to detect genetic markers associated with a complex human disease. Two common designs in genetic association studies are family-based designs using case-parents trios and population-based designs using unrelated cases and controls. The transmission disequilibrium test (TDT) is frequently used to analyze the case-parents trio data [1]. The TDT tests for both linkage and association and is not sensitive to population admixture and stratification. Using a likelihood approach, Schaid and Sommer [2] proposed TDT-type statistics that are more powerful than the TDT for a specific genetic model (see also [3]). For the unrelated case-control design, a linear trend test [4], which is often more powerful than the TDT based on case-parents trios, can be considered specifically when obtaining a sufficient number of trios is difficult.

Data that contain both case-parents trios and unrelated cases and controls on the same set of markers are increasingly available. Nagelkerke et al. [5] provided a few situations where such a mixture of case-parents trios and unrelated cases and controls can occur: 1) a case-parents trio design was originally considered and then unrelated controls were added, 2) a case-control design was originally considered and then the parents of the cases were added to confirm the findings. Such designs are typically analyzed in two stages, and strategies for analyzing this type of data while fully utilizing the given information are important.

In this paper, we study several approaches for testing genome-wide association in such situations. Based on the design, either a TDT-type statistic or a linear trend test will be used in the first stage to select a proportion of markers that will be tested in the second stage. The other test will then be applied in the second stage while controlling the genome-wide false positive rates by adjusting for the correlation with the first stage. Following a recently proposed method by Skol et al. [6], we also study a joint analysis for the second stage.

Methods

Consider a marker with two alleles, M and N, where M itself is a risk allele or is in linkage disequilibrium with a risk allele with frequency p, and N is a normal allele with frequency q = 1 - p. Penetrances are defined as the probabilities of disease conditional on the genotypes, that is, f₀ = Pr(disease|NN), f₁ = Pr(disease|NM), and f₂ = Pr(disease|MM). No association implies f₀ = f₁ = f₂, whereas f₀ ≤ f₁ ≤ f₂ with at least one strict inequality implies there is an association between the marker and a disease. Using f₀ as a baseline penetrance, the genotype relative risks are defined as ψ_i= f_i/f₀ for i = 1, 2. A genetic model is recessive, additive, or dominant when f₀ = f₁ (or ψ₁ = 1, ψ₂ = ψ), f₁ = (f₀ + f₂)/2 (or ψ₁ = ψ, ψ₂ = 2ψ-1), or f₁ = f₂ (or ψ₁ = ψ₂ = ψ).

Case-parents trio design

In the case-parents trio design, cases and their parents are selected from the population and their genotypes are obtained. There are six possible parental mating types for a marker with two alleles M and N: 1) MM × MM, 2) MM × NM, 3) MM × NN, 4) NM × NM, 5) NM × NN, and 6) NN × NN. These six mating types are given in the first column of Table 1. The second column provides case genotypes for each mating type, and the third column is the sample size of trios under each mating type. The probabilities of parental mating types can be calculated by assuming Hardy-Weinberg equilibrium (HWE), and the probability of each n_ijcan then be obtained and is presented in the fourth column. The last column contains the probabilities of a case genotype given parental mating type (Schaid and Sommer [2]).

Table 1 Conditional probabilities of genotype given parental mating types and offspring disease status

Full size table

Schaid and Sommer [2] suggested an analysis conditional on parental mating types that provides unbiased estimates of genotype relative risks. Denote the likelihood function for a given model as L(ψ), then the score test for H₀: ψ = 1 can be obtained by ∂logL(ψ)/∂ψ/{-∂²logL(ψ)/∂ψ²}^1/2|_{ψ = 1}.

Unrelated case-control design

For the unrelated case-control design, denote the genotype counts of three genotypes NN, MN and MM as (r₀, r₁, r₂) in cases and (s₀, s₁, s₂) in controls that follow multinomial distributions mul(R: p₀, p₁, p₂) and mul(S: q₀, q₁, q₂). Then the null hypothesis of no association implies p_i= q_ifor each i.

Sasieni [4] proposed a method that uses the marker genotype as a covariate in the logistic regression model where the genotype is coded by increasing scores, that is, 0, x, and 1 for NN, NM, and MM, where 0 ≤ x ≤ 1. The optimal scores for recessive, additive and dominant models are x = 0, 1/2, and 1 [4, 7] and the trend test [7] is given by $Z_{C C} = \frac{U (x)}{\sqrt{V a r (U (x))}}$ , where $U (x) = \sum_{i = 0}^{2} x_{i} (1 - R / N) r_{i} - \sum_{i = 0}^{2} x_{i} (R / N) s_{i}$ , and $V a r (U (x)) = N^{- 1} R S {\sum_{i} x_{i}^{2} p_{i} - {(\sum_{i} x_{i} p_{i})}^{2}}$ for (x₀, x₁, x₂) = (0, x, 1) and N = R+S. Under the null hypothesis, Z_CCasymptotically follows the standard normal distribution.

Combined test of Z_TDTand Z_CC

Because the cases used in Z_TDTand Z_CCoverlap, results from the two tests are correlated, and this correlation, ρ, must be considered when obtaining a combined test. By noting that both tests are functions of a multinomial random variable n with dimension 10 for the 10 n_ijcategories from Table 1, the correlation between Z_TDTand Z_CCcan be obtained given a specific genetic model (Appendix). The probability of each category can be consistently estimated by the observed counts and ρ can be consistently estimated by the sample correlation between Z_TDTand Z_CC.

We propose the weighted average, $Z_{joint} = \frac{\sqrt{w_{1}} Z_{T D T} + \sqrt{w_{2}} Z_{C C}}{\sqrt{(w_{1} + w_{2} + 2 \sqrt{w_{1} w_{2}} ρ)}}$ , as a test statistic in a joint analysis. We consider a uniform weight, that is, w₁ = w₂ = 1 [8, 9] for simplicity. Other choices of weight, such as a weight proportional to the number of informative cases used in each test, can also be considered.

Two-stage method in large scale association studies

To test K markers in a two-stage analysis, we consider four strategies that use either Z_TDTor Z_CCin the first stage based on the intended design, and in each situation, the other test or the joint test is applied in the second stage. As in Skol et al. [6], we obtain thresholds C₁ and C₂ (or C_joint) for two stages in each strategy by controlling the genome-wide significance level at α. C₁ can be obtained as C₁ = Φ^-1(1 - π₁/2), where π₁ is the proportion of markers selected in the first stage. On the other hand, C_joint and C₂ need to be calculated iteratively so that they satisfy

\sum_{i = 1}^{K} P (| Z_{1 i} | > C_{1}, | Z_{j o i n t i} | > C_{j o i n t}) = α

(1)

or

\sum_{i = 1}^{K} P (| Z_{1 i} | > C_{1}, | Z_{2 i} | > C_{2}, Z_{1 i} Z_{2 i} > 0) = α

(2)

when the joint analysis is used or when the other test is used. Here, Z_1iand Z_2idenote the tests used in the first and the second stage for the i^th SNP (Z_2iis replaced by Z_jointiwhen the joint analysis is used in the second stage). We need the subscript i because the correlations between two tests for different SNPs are generally not the same. Under HWE, however, we can show this correlation is a constant (Appendix), and these equations can then be simplified to P(|Z₁| > C₁, |Z_joint| > C_joint) = α/K and P(|Z₁| > C₁, |Z₂| > C₂, Z₁Z₂ > 0) = α/K.

Data

The Genetic Analysis Workshop 15 provided simulated rheumatoid arthritis data that contain 1500 families with affected sib pairs and their parents, and 2000 unrelated controls on 9187 SNPs distributed throughout the genome. We used the first simulated data set and we randomly selected one from the affected sib pairs for data analysis. The minor allele frequencies of all 9187 SNPs were greater than 1%.

Results

To apply the two-stage analysis, we first obtained the threshold for each strategy using π₁ = 0.1 (C₁ = 1.6449) and Eq. (1) and (2). Therefore, we control the genome-wide false-positive rate at 0.05, and we define a "significant" SNP as one with test statistic greater than the threshold in both stages. As expected, a slightly larger threshold for the second stage is required for the joint analysis to control the same genome-wide false-positive rate (C₂ = 4.5121 vs. C_joint = 4.5470) [6]. Table 2 summarizes results based on an additive genetic model. The chromosome, SNP name, and distance from the nearest major gene are listed in the first three columns of the table, and if the SNP was selected by the specified method (last three columns), the p-value from the second stage is listed. We noticed that even with a larger threshold required, the joint analysis in the second stage found more significant SNPs near the major genes. Also, we noticed that when the joint analysis was used in the second stage, the same set of significant SNPs was found regardless of the choice of the test statistic in the first stage. However, different results were obtained when either Z_TDTor Z_CCwas used in the first stage followed by the other test in the second stage. Specifically, the joint analysis using either Z_TDTor Z_CCin the first stage found 18 significant SNPs among which 9 and 14 were located within 1 Mb (bold) and 5 Mb (italic) of the major genes. When Z_TDTin the first stage was followed by Z_CCin the second stage, we found 17 significant SNPs, and 8 of these were located within 1 Mb of the causal genes, and 13 were located within 5 Mb. These methods found SNPs near the major genes on chromosome 6, 11, and 18. On the other hand, when we used Z_CCin the first stage followed by Z_TDTin the second stage, a total of 10 significant SNPs were found: 7 of them were located within 1 Mb of a major gene only on chromosome 6 and 11, and 10 were located within 5 Mb. A SNP near the major gene on chromosome 18 was not found by this method.

Table 2 Two-stage analysis: selected SNPs and their corresponding p-values in the second stage

Full size table

When we applied these three tests (Z_CC, Z_TDT, Z_joint) to a single-stage analysis, these tests found the same set of SNPs identified in a two-stage analysis with the corresponding test at the second stage. That is, Z_CC, Z_TDT, Z_jointi in a single-stage found 17, 10, and 18 SNPs in columns 4, 5, and 6 of Table 2. This implies that a two-stage analysis can maintain power with a substantially reduced genotyping cost while controlling the same genome-wide false-positive rate [6].

Discussion

In this paper, we presented a new method for testing association when both case-parents trios and unrelated controls are available. Because parents are selected for having an affected child, we consider the characteristics of non-affected parents to be different from those of unrelated controls in case-control studies. Thus, the genotype information of parents was used only for Z_TDTand not for Z_CC. By adjusting for the correlation between the two test statistics (Z_TDTand Z_CC), we proposed a combined test statistic for analyzing such data.

For data with a large number of markers in a two-stage analysis, we considered several analytical approaches following the method by Skol et al. [6]. Even with a slightly larger threshold required, more SNPs near the major genes were found using the joint analysis in the second stage. Also, we noticed the choice of test for the first stage was important when two separate tests were used in the two stages, but when the joint analysis was used, the impact of which test was used first seemed to be less important. The added benefit of the joint analysis was rather minor compared to what was studied by Skol et al. [6] because the two tests for the first and the second stages were highly correlated even without using the joint analysis. Nevertheless, the joint analysis found slightly more significant SNPs and is robust against the choice of the first stage test. These properties suggest that the joint analysis would be desirable.

Our method can be generalized to data with missing genotypes by either imputing the missing genotypes based on partially available data [5, 10], or by omitting cases without complete parental information from Z_TDT. In this situation, the correlation between Z_TDTand Z_CCwill decrease, and therefore, the advantage of the joint analysis could be accentuated. Complete justification, however, requires further study.

Conclusion

We presented a new method for testing association when data from both case-parents trios and unrelated controls are available. By deriving the correlation of test statistics for these two designs, we proposed a combined test as a joint analysis. In a two-stage analysis for testing a large number of markers, we found that the joint analysis detects more SNPs near the major genes than other methods that do not use the combined test in the second stage. This approach is also robust against the choice of the first stage test.

Appendix

When the conditional likelihood is used for Z_TDT, n₁ = n₁₂, n₂ = (n₂₁, n₂₂), n₃ = n₃₁, n₄ = (n₄₀, n₄₁, n₄₂), n₅ = (n₅₀, n₅₁), and n₆ = n₆₀ are independent random variables conditional on parental mating types (m) where n₂ and n₅ follow a binomial distribution and n₄ follows a trinomial distribution with probabilities given in column 5 of Table 1[2]. The score test for H₀: ψ = 1 is then written as $Z_{T D T} = \frac{U_{T} (n) - E (U_{T} (n) | m)}{\sqrt{{Var(U}_{T} (n)| m)}}$ , where U_T(n) = n₂₂+n₄₂, n₂₂+n₄₂+0.5(n₂₁+n₄₁+n₅₁) and n₄₂+n₄₁+n₅₁ for the recessive, additive, and dominant models. By applying the variance decomposition formula, we obtain the correlation between Z_TDTand Z_CCas $(1 - R / N) \frac{E (\sqrt{{Var(U}_{T} (n)| m)})}{\sqrt{Var(U(x))}}$ . An additional distributional assumption needs to be made for parental genotypes. We considered six parental mating types as a six dimensional multinomial distribution, and the corresponding probabilities were consistently estimated by the observed counts.

Under HWE, we can show that the correlation for three models can be simplified to $\sqrt{1 - R / N}$ when all cases have parental genotypes available. When only a proportion of cases overlaps between case-parents and case-control designs, we can introduce an additional parameter η < 1 such that $\sum_{i j} n_{i j} = η R$ , and the correlation between Z_TDTand Z_CCis reduced to $η \sqrt{1 - R / N}$ .

References

Spielman RS, McGinnis R, Ewens WJ: Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM). Am J Hum Genet. 1993, 52: 506-516.
PubMed Central PubMed CAS Google Scholar
Schaid DJ, Sommer SS: Genotype relative risks: methods for design and analysis of candidate-gene association studies. Am J Hum Genet. 1993, 53: 1114-1126.
PubMed Central PubMed CAS Google Scholar
Zheng G, Freidlin B, Gastwirth JL: Robust TDT-type candidate-gene association tests. Ann Hum Gene. 2002, 66: 145-155. 10.1046/j.1469-1809.2002.00104.x.
Article CAS Google Scholar
Sasieni PD: From genotypes to genes: doubling the sample size. Biometrics. 1997, 53: 1253-1261. 10.2307/2533494.
Article PubMed CAS Google Scholar
Nagelkerke NJD, Hoebee B, Teunis P, Kimman TG: Combining the transmission disequilibrium test and case-control methodology using generalized logistic regression. Eur J Hum Genet. 2004, 12: 964-970. 10.1038/sj.ejhg.5201255.
Article PubMed CAS Google Scholar
Skol AD, Scott LJ, Abecasis GR, Boehnke M: Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nat Genet. 2005, 38: 209-213. 10.1038/ng1706.
Article Google Scholar
Zheng G, Freidlin B, Li Z, Gastwirth JL: Choice of scores in trend tests for case-control studies of candidate-gene associations. Biometrical J. 2003, 45: 335-348. 10.1002/bimj.200390016.
Article Google Scholar
O'Brien PC: Procedures for comparing samples with multiple endpoints. Biometrics. 1984, 40: 1079-1087. 10.2307/2531158.
Article PubMed Google Scholar
Tang DI, Geller NL, Pocock SJ: On the design and analysis of randomized clinical trials with multiple endpoints. Biometrics. 1993, 49: 23-30. 10.2307/2532599.
Article PubMed CAS Google Scholar
Weinberg CR: Allowing for missing parents in genetic studies of case-parent triads. Am J Hum Genet. 1999, 64: 1186-1193. 10.1086/302337.
Article PubMed Central PubMed CAS Google Scholar

Download references

Acknowledgements

This article has been published as part of BMC Proceedings Volume 1 Supplement 1, 2007: Genetic Analysis Workshop 15: Gene Expression Analysis and Approaches to Detecting Multiple Functional Loci. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/1?issue=S1.

Author information

Authors and Affiliations

Office of Biostatistics Research, National Heart, Lung and Blood Institute, 6701 Rockledge Drive, MSC 7913, Bethesda, MD, 20892-7913, USA
Jungnam Joo, Xin Tian, Gang Zheng, Mario Stylianou, Jing-Ping Lin & Nancy L Geller

Authors

Jungnam Joo
View author publications
You can also search for this author in PubMed Google Scholar
Xin Tian
View author publications
You can also search for this author in PubMed Google Scholar
Gang Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Mario Stylianou
View author publications
You can also search for this author in PubMed Google Scholar
Jing-Ping Lin
View author publications
You can also search for this author in PubMed Google Scholar
Nancy L Geller
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jungnam Joo.

Additional information

Competing interests

The author(s) declare that they have no competing interests.

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Joo, J., Tian, X., Zheng, G. et al. Joint analysis of case-parents trio and unrelated case-control designs in large scale association studies. BMC Proc 1 (Suppl 1), S28 (2007). https://doi.org/10.1186/1753-6561-1-S1-S28

Download citation

Published: 18 December 2007
DOI: https://doi.org/10.1186/1753-6561-1-S1-S28

Genetic Analysis Workshop 15: Gene Expression Analysis and Approaches to Detecting Multiple Functional Loci

Joint analysis of case-parents trio and unrelated case-control designs in large scale association studies

Abstract

Background

Methods

Case-parents trio design

Unrelated case-control design

Combined test of Z_TDTand Z_CC

Two-stage method in large scale association studies

Data

Results

Discussion

Conclusion

Appendix

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Rights and permissions

About this article

Cite this article

Keywords

BMC Proceedings

Contact us

Genetic Analysis Workshop 15: Gene Expression Analysis and Approaches to Detecting Multiple Functional Loci

Joint analysis of case-parents trio and unrelated case-control designs in large scale association studies

Abstract

Background

Methods

Case-parents trio design

Unrelated case-control design

Combined test of Z TDT and Z CC

Two-stage method in large scale association studies

Data

Results

Discussion

Conclusion

Appendix

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Proceedings

Contact us

Combined test of Z_TDTand Z_CC