- Proceedings
- Open Access
- Published:

# Joint analysis of case-parents trio and unrelated case-control designs in large scale association studies

*BMC Proceedings***volume 1**, Article number: S28 (2007)

## Abstract

We present a new method for testing association when data from both case-parents trios and unrelated controls are available. Our method combines test statistics for case-parents trio and unrelated case-control studies by adjusting for the correlation that arises when the same set of cases is used for both tests. We further consider several analytical approaches for two-stage studies on a large number of markers, including methods based on the joint analysis. The performance of the proposed approaches is examined by analyzing the simulated data provided by the Genetic Analysis Workshop 15.

## Background

Genetic association studies are a popular method to detect genetic markers associated with a complex human disease. Two common designs in genetic association studies are family-based designs using case-parents trios and population-based designs using unrelated cases and controls. The transmission disequilibrium test (TDT) is frequently used to analyze the case-parents trio data [1]. The TDT tests for both linkage and association and is not sensitive to population admixture and stratification. Using a likelihood approach, Schaid and Sommer [2] proposed TDT-type statistics that are more powerful than the TDT for a specific genetic model (see also [3]). For the unrelated case-control design, a linear trend test [4], which is often more powerful than the TDT based on case-parents trios, can be considered specifically when obtaining a sufficient number of trios is difficult.

Data that contain both case-parents trios and unrelated cases and controls on the same set of markers are increasingly available. Nagelkerke et al. [5] provided a few situations where such a mixture of case-parents trios and unrelated cases and controls can occur: 1) a case-parents trio design was originally considered and then unrelated controls were added, 2) a case-control design was originally considered and then the parents of the cases were added to confirm the findings. Such designs are typically analyzed in two stages, and strategies for analyzing this type of data while fully utilizing the given information are important.

In this paper, we study several approaches for testing genome-wide association in such situations. Based on the design, either a TDT-type statistic or a linear trend test will be used in the first stage to select a proportion of markers that will be tested in the second stage. The other test will then be applied in the second stage while controlling the genome-wide false positive rates by adjusting for the correlation with the first stage. Following a recently proposed method by Skol et al. [6], we also study a joint analysis for the second stage.

## Methods

Consider a marker with two alleles, M and N, where M itself is a risk allele or is in linkage disequilibrium with a risk allele with frequency *p*, and N is a normal allele with frequency *q* = 1 - *p*. Penetrances are defined as the probabilities of disease conditional on the genotypes, that is, *f*_{0} = Pr(disease|NN), *f*_{1} = Pr(disease|NM), and *f*_{2} = Pr(disease|MM). No association implies *f*_{0} = *f*_{1} = *f*_{2}, whereas *f*_{0} ≤ *f*_{1} ≤ *f*_{2} with at least one strict inequality implies there is an association between the marker and a disease. Using *f*_{0} as a baseline penetrance, the genotype relative risks are defined as *ψ*_{
i
}= *f*_{
i
}/*f*_{0} for *i* = 1, 2. A genetic model is recessive, additive, or dominant when *f*_{0} = *f*_{1} (or *ψ*_{1} = 1, *ψ*_{2} = *ψ*), *f*_{1} = (*f*_{0} + *f*_{2})/2 (or *ψ*_{1} = *ψ*, *ψ*_{2} = 2*ψ*-1), or *f*_{1} = *f*_{2} (or *ψ*_{1} = *ψ*_{2} = *ψ*).

### Case-parents trio design

In the case-parents trio design, cases and their parents are selected from the population and their genotypes are obtained. There are six possible parental mating types for a marker with two alleles M and N: 1) MM × MM, 2) MM × NM, 3) MM × NN, 4) NM × NM, 5) NM × NN, and 6) NN × NN. These six mating types are given in the first column of Table 1. The second column provides case genotypes for each mating type, and the third column is the sample size of trios under each mating type. The probabilities of parental mating types can be calculated by assuming Hardy-Weinberg equilibrium (HWE), and the probability of each *n*_{
ij
}can then be obtained and is presented in the fourth column. The last column contains the probabilities of a case genotype given parental mating type (Schaid and Sommer [2]).

Schaid and Sommer [2] suggested an analysis conditional on parental mating types that provides unbiased estimates of genotype relative risks. Denote the likelihood function for a given model as L(*ψ*), then the score test for *H*_{0}: *ψ* = 1 can be obtained by ∂logL(*ψ*)/∂*ψ*/{-∂^{2}logL(*ψ*)/∂*ψ*^{2}}^{1/2}|_{ψ = 1}.

### Unrelated case-control design

For the unrelated case-control design, denote the genotype counts of three genotypes NN, MN and MM as (*r*_{0}, *r*_{1}, *r*_{2}) in cases and (*s*_{0}, *s*_{1}, *s*_{2}) in controls that follow multinomial distributions *mul*(*R: p*_{0}, *p*_{1}, *p*_{2}) and *mul*(*S: q*_{0}, *q*_{1}, *q*_{2}). Then the null hypothesis of no association implies *p*_{
i
}= *q*_{
i
}for each *i*.

Sasieni [4] proposed a method that uses the marker genotype as a covariate in the logistic regression model where the genotype is coded by increasing scores, that is, 0, *x*, and 1 for NN, NM, and MM, where 0 ≤ *x* ≤ 1. The optimal scores for recessive, additive and dominant models are *x* = 0, 1/2, and 1 [4, 7] and the trend test [7] is given by ${Z}_{CC}=\frac{U(x)}{\sqrt{Var(U(x))}}$, where $U(x)={\displaystyle {\sum}_{i=0}^{2}{x}_{i}(1-R/N){r}_{i}-}{\displaystyle {\sum}_{i=0}^{2}{x}_{i}(R/N){s}_{i}}$, and $Var(U(x))={N}^{-1}RS\left\{{\displaystyle {\sum}_{i}{x}_{i}^{2}{p}_{i}-{({\displaystyle {\sum}_{i}{x}_{i}{p}_{i}})}^{2}}\right\}$ for (*x*_{0}, *x*_{1}, *x*_{2}) = (0, *x*, 1) and *N* = *R*+*S*. Under the null hypothesis, *Z*_{
CC
}asymptotically follows the standard normal distribution.

### Combined test of *Z*_{
TDT
}and *Z*_{
CC
}

Because the cases used in *Z*_{
TDT
}and *Z*_{
CC
}overlap, results from the two tests are correlated, and this correlation, *ρ*, must be considered when obtaining a combined test. By noting that both tests are functions of a multinomial random variable *n* with dimension 10 for the 10 *n*_{
ij
}categories from Table 1, the correlation between *Z*_{
TDT
}and *Z*_{
CC
}can be obtained given a specific genetic model (Appendix). The probability of each category can be consistently estimated by the observed counts and *ρ* can be consistently estimated by the sample correlation between *Z*_{
TDT
}and *Z*_{
CC
}.

We propose the weighted average, ${Z}_{\text{joint}}=\frac{\sqrt{{w}_{1}}{Z}_{TDT}+\sqrt{{w}_{2}}{Z}_{CC}}{\sqrt{({w}_{1}+{w}_{2}+2\sqrt{{w}_{1}{w}_{2}}\rho )}}$, as a test statistic in a joint analysis. We consider a uniform weight, that is, *w*_{1} = *w*_{2} = 1 [8, 9] for simplicity. Other choices of weight, such as a weight proportional to the number of informative cases used in each test, can also be considered.

### Two-stage method in large scale association studies

To test *K* markers in a two-stage analysis, we consider four strategies that use either *Z*_{
TDT
}or *Z*_{
CC
}in the first stage based on the intended design, and in each situation, the other test or the joint test is applied in the second stage. As in Skol et al. [6], we obtain thresholds *C*_{1} and *C*_{2} (or *C*_{joint}) for two stages in each strategy by controlling the genome-wide significance level at *α*. *C*_{1} can be obtained as *C*_{1} = Φ^{-1}(1 - *π*_{1}/2), where *π*_{1} is the proportion of markers selected in the first stage. On the other hand, *C*_{joint} and *C*_{2} need to be calculated iteratively so that they satisfy

or

when the joint analysis is used or when the other test is used. Here, *Z*_{1i}and *Z*_{2i}denote the tests used in the first and the second stage for the *i*^{th} SNP (*Z*_{2i}is replaced by *Z*_{
jointi
}when the joint analysis is used in the second stage). We need the subscript *i* because the correlations between two tests for different SNPs are generally not the same. Under HWE, however, we can show this correlation is a constant (Appendix), and these equations can then be simplified to *P*(|*Z*_{1}| > *C*_{1}, |*Z*_{joint}| > *C*_{joint}) = *α*/*K* and *P*(|*Z*_{1}| > *C*_{1}, |*Z*_{2}| > *C*_{2}, *Z*_{1}*Z*_{2} > 0) = *α*/*K*.

### Data

The Genetic Analysis Workshop 15 provided simulated rheumatoid arthritis data that contain 1500 families with affected sib pairs and their parents, and 2000 unrelated controls on 9187 SNPs distributed throughout the genome. We used the first simulated data set and we randomly selected one from the affected sib pairs for data analysis. The minor allele frequencies of all 9187 SNPs were greater than 1%.

## Results

To apply the two-stage analysis, we first obtained the threshold for each strategy using *π*_{1} = 0.1 (*C*_{1} = 1.6449) and Eq. (1) and (2). Therefore, we control the genome-wide false-positive rate at 0.05, and we define a "significant" SNP as one with test statistic greater than the threshold in both stages. As expected, a slightly larger threshold for the second stage is required for the joint analysis to control the same genome-wide false-positive rate (*C*_{2} = 4.5121 vs. *C*_{joint} = 4.5470) [6]. Table 2 summarizes results based on an additive genetic model. The chromosome, SNP name, and distance from the nearest major gene are listed in the first three columns of the table, and if the SNP was selected by the specified method (last three columns), the *p*-value from the second stage is listed. We noticed that even with a larger threshold required, the joint analysis in the second stage found more significant SNPs near the major genes. Also, we noticed that when the joint analysis was used in the second stage, the same set of significant SNPs was found regardless of the choice of the test statistic in the first stage. However, different results were obtained when either *Z*_{
TDT
}or *Z*_{
CC
}was used in the first stage followed by the other test in the second stage. Specifically, the joint analysis using either *Z*_{
TDT
}or *Z*_{
CC
}in the first stage found 18 significant SNPs among which 9 and 14 were located within 1 Mb (bold) and 5 Mb (italic) of the major genes. When *Z*_{
TDT
}in the first stage was followed by *Z*_{
CC
}in the second stage, we found 17 significant SNPs, and 8 of these were located within 1 Mb of the causal genes, and 13 were located within 5 Mb. These methods found SNPs near the major genes on chromosome 6, 11, and 18. On the other hand, when we used *Z*_{
CC
}in the first stage followed by *Z*_{
TDT
}in the second stage, a total of 10 significant SNPs were found: 7 of them were located within 1 Mb of a major gene only on chromosome 6 and 11, and 10 were located within 5 Mb. A SNP near the major gene on chromosome 18 was not found by this method.

When we applied these three tests (*Z*_{
CC
}, *Z*_{
TDT
}, *Z*_{joint}) to a single-stage analysis, these tests found the same set of SNPs identified in a two-stage analysis with the corresponding test at the second stage. That is, *Z*_{
CC
}, *Z*_{
TDT
}, *Z*_{jointi} in a single-stage found 17, 10, and 18 SNPs in columns 4, 5, and 6 of Table 2. This implies that a two-stage analysis can maintain power with a substantially reduced genotyping cost while controlling the same genome-wide false-positive rate [6].

## Discussion

In this paper, we presented a new method for testing association when both case-parents trios and unrelated controls are available. Because parents are selected for having an affected child, we consider the characteristics of non-affected parents to be different from those of unrelated controls in case-control studies. Thus, the genotype information of parents was used only for *Z*_{
TDT
}and not for *Z*_{
CC
}. By adjusting for the correlation between the two test statistics (*Z*_{
TDT
}and *Z*_{
CC
}), we proposed a combined test statistic for analyzing such data.

For data with a large number of markers in a two-stage analysis, we considered several analytical approaches following the method by Skol et al. [6]. Even with a slightly larger threshold required, more SNPs near the major genes were found using the joint analysis in the second stage. Also, we noticed the choice of test for the first stage was important when two separate tests were used in the two stages, but when the joint analysis was used, the impact of which test was used first seemed to be less important. The added benefit of the joint analysis was rather minor compared to what was studied by Skol et al. [6] because the two tests for the first and the second stages were highly correlated even without using the joint analysis. Nevertheless, the joint analysis found slightly more significant SNPs and is robust against the choice of the first stage test. These properties suggest that the joint analysis would be desirable.

Our method can be generalized to data with missing genotypes by either imputing the missing genotypes based on partially available data [5, 10], or by omitting cases without complete parental information from *Z*_{
TDT
}. In this situation, the correlation between *Z*_{
TDT
}and *Z*_{
CC
}will decrease, and therefore, the advantage of the joint analysis could be accentuated. Complete justification, however, requires further study.

## Conclusion

We presented a new method for testing association when data from both case-parents trios and unrelated controls are available. By deriving the correlation of test statistics for these two designs, we proposed a combined test as a joint analysis. In a two-stage analysis for testing a large number of markers, we found that the joint analysis detects more SNPs near the major genes than other methods that do not use the combined test in the second stage. This approach is also robust against the choice of the first stage test.

## Appendix

When the conditional likelihood is used for *Z*_{
TDT
}, *n*_{1} = *n*_{12}, *n*_{2} = (*n*_{21}, *n*_{22}), *n*_{3} = *n*_{31}, *n*_{4} = (*n*_{40}, *n*_{41}, *n*_{42}), *n*_{5} = (*n*_{50}, *n*_{51}), and *n*_{6} = *n*_{60} are independent random variables conditional on parental mating types (*m*) where *n*_{2} and *n*_{5} follow a binomial distribution and *n*_{4} follows a trinomial distribution with probabilities given in column 5 of Table 1[2]. The score test for *H*_{0}: *ψ* = 1 is then written as ${Z}_{TDT}=\frac{{U}_{T}(n)-E({U}_{T}(n)|m)}{\sqrt{{\text{Var(U}}_{\text{T}}\text{(n)|}m\text{)}}}$, where *U*_{
T
}(*n*) = *n*_{22}+*n*_{42}, *n*_{22}+*n*_{42}+0.5(*n*_{21}+*n*_{41}+*n*_{51}) and *n*_{42}+*n*_{41}+*n*_{51} for the recessive, additive, and dominant models. By applying the variance decomposition formula, we obtain the correlation between *Z*_{
TDT
}and *Z*_{
CC
}as $(1-R/N)\frac{E(\sqrt{{\text{Var(U}}_{\text{T}}\text{(n)|}m\text{)}})}{\sqrt{\text{Var(U(x))}}}$. An additional distributional assumption needs to be made for parental genotypes. We considered six parental mating types as a six dimensional multinomial distribution, and the corresponding probabilities were consistently estimated by the observed counts.

Under HWE, we can show that the correlation for three models can be simplified to $\sqrt{1-R/N}$ when all cases have parental genotypes available. When only a proportion of cases overlaps between case-parents and case-control designs, we can introduce an additional parameter *η* < 1 such that ${\sum}_{ij}{n}_{ij}}=\eta R$, and the correlation between *Z*_{
TDT
}and *Z*_{
CC
}is reduced to $\eta \sqrt{1-R/N}$.

## References

- 1.
Spielman RS, McGinnis R, Ewens WJ: Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM). Am J Hum Genet. 1993, 52: 506-516.

- 2.
Schaid DJ, Sommer SS: Genotype relative risks: methods for design and analysis of candidate-gene association studies. Am J Hum Genet. 1993, 53: 1114-1126.

- 3.
Zheng G, Freidlin B, Gastwirth JL: Robust TDT-type candidate-gene association tests. Ann Hum Gene. 2002, 66: 145-155. 10.1046/j.1469-1809.2002.00104.x.

- 4.
Sasieni PD: From genotypes to genes: doubling the sample size. Biometrics. 1997, 53: 1253-1261. 10.2307/2533494.

- 5.
Nagelkerke NJD, Hoebee B, Teunis P, Kimman TG: Combining the transmission disequilibrium test and case-control methodology using generalized logistic regression. Eur J Hum Genet. 2004, 12: 964-970. 10.1038/sj.ejhg.5201255.

- 6.
Skol AD, Scott LJ, Abecasis GR, Boehnke M: Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nat Genet. 2005, 38: 209-213. 10.1038/ng1706.

- 7.
Zheng G, Freidlin B, Li Z, Gastwirth JL: Choice of scores in trend tests for case-control studies of candidate-gene associations. Biometrical J. 2003, 45: 335-348. 10.1002/bimj.200390016.

- 8.
O'Brien PC: Procedures for comparing samples with multiple endpoints. Biometrics. 1984, 40: 1079-1087. 10.2307/2531158.

- 9.
Tang DI, Geller NL, Pocock SJ: On the design and analysis of randomized clinical trials with multiple endpoints. Biometrics. 1993, 49: 23-30. 10.2307/2532599.

- 10.
Weinberg CR: Allowing for missing parents in genetic studies of case-parent triads. Am J Hum Genet. 1999, 64: 1186-1193. 10.1086/302337.

## Acknowledgements

This article has been published as part of *BMC Proceedings* Volume 1 Supplement 1, 2007: Genetic Analysis Workshop 15: Gene Expression Analysis and Approaches to Detecting Multiple Functional Loci. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/1?issue=S1.

## Author information

## Additional information

### Competing interests

The author(s) declare that they have no competing interests.

## Rights and permissions

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## About this article

#### Published

#### DOI

### Keywords

- Joint Analysis
- Significant SNPs
- Transmission Disequilibrium Test
- Genetic Analysis Workshop
- Unrelated Control