Genome-wide association tests by two-stage approaches with unified analysis of families and unrelated individuals
- Xuexia Wang^{1},
- Zhaogong Zhang^{1, 2},
- Shuanglin Zhang^{1, 2} and
- Qiuying Sha^{1}Email author
https://doi.org/10.1186/1753-6561-1-S1-S140
© Wang et al; licensee BioMed Central Ltd. 2007
Published: 18 December 2007
Abstract
Multiple testing is a problem in genome-wide or region-wide association studies. In this report, we consider a study design given by the Genetic Analysis Workshop 15 (GAW15) Problem 3 – nuclear families (parents with their affected children) and unrelated controls. Based on this design, we propose three two-stage approaches to deal with the problem of multiple testing. The tests in the first stage, statistically independent of the association test used in the second stage, are used to screen or select single-nucleotide polymorphisms (SNPs). Then, in the second stage, a family-based association test is performed on a much smaller set of selected SNPs. Thus, the problem of multiple testing is much less severe. Our simulation studies and application to the dense SNP data of chromosome 6 in the GAW15 Problem 3 show that the two-stage methods are more powerful than the one-stage method (using the family-based association test only).
Background
Genome-wide or region-wide association is a promising approach to mapping complex disease genes [1, 2]. However, the success of genome-wide or region-wide association studies will depend on whether the information gain of increased number of single-nucleotide polymorphisms (SNPs) will be diluted by the multiple-comparison problem [3]. When tens or hundreds of thousands of SNPs are tested for association, the p-values need to be adjusted for controlling type I error rates. Most multiple-testing adjustment approaches, including Bonferroni correction for controlling the family-wise error rate and the method proposed by Benjamini and Hochberg [4] for controlling the false discovery rate (FDR), become more conservative as more tests are done.
In case-control studies, several authors have proposed a two-stage design that utilizes two independent samples [5, 6]. The first sample is used to screen and select SNPs for association tests. The association tests are conducted on the selected SNPs by using the second sample, so that the number of association tests is diminished and the correction for multiple testing is less severe. Recently, in mapping quantitative trait loci using family data, Van Steen et al. [3] proposed an interesting approach that performs the SNP screening and association test using the same sample. The basic idea of Van Steen et al.'s method is that the screening test based on the traits and between-family genotype scores is statistically independent of the association test that depends on trait values and within-family genotype scores. The screening test is used first to select SNPs, and the association test is performed on a much smaller set of selected SNPs. Unfortunately, the same idea cannot be applied to family-based analyses for qualitative traits.
In this article, we propose several two-stage methods to test association for qualitative traits by using nuclear families (including parental phenotypes) or nuclear families and unrelated controls. To analyze the data set of nuclear families, we compare the allele frequency in affected parents with that of unaffected parents (test I) to screen and select SNPs. Then the pedigree disequilibrium test (PDT) [7] is used to perform the association test on the selected SNPs by comparing the alleles that are transmitted to the children with those that are not transmitted. To analyze the data set that contains nuclear families and unrelated controls, as the data set in the GAW15 Problem 3, we propose two methods to screen SNPs. One is comparing the allele frequency in parents with that in unrelated controls (test II). The other is a combination of test I and II. All the proposed screening tests are independent of the association test, that is, the PDT. Furthermore, because a significant association only depends on the results of the PDT, the proposed two-stage approaches are robust to population admixture. We compare the performance of the proposed methods by using PDT alone through simulation studies and analysis of the data set of the GAW15 Problem 3. Our simulation and the GAW15 data analysis results show that the three proposed two-stage methods have correct type I error rates and, in most cases, are more powerful than the PDT.
Methods
Consider a sample of n nuclear families and N unrelated controls. Suppose that we have genotyped M markers across the genome or in a candidate region for each sampled individual. Also, all children in the nuclear families are affected and the disease status of the parents is available. The reason for considering this kind of sample is the design of the GAW15 Problem 3 data set. To detect disease susceptibility loci, based on the sample structure, we proposed three methods. All three methods are two-stage approaches – the methods in the first stage are used to screen and select SNPs and those in the second stage are used to test the association on the selected SNPs.
where $\widehat{p}$ and $\widehat{q}$ are the sample frequencies of allele A in cases and controls, respectively; ${\sigma}^{2}=(\frac{1}{2{N}_{1}}+\frac{1}{2{N}_{2}}){p}_{0}(1-{p}_{0})$ is the estimate of the variance of $\widehat{p}$ - $\widehat{q}$; p_{0} is the sample allele frequency of allele A in the whole sample. Under the null hypothesis of no association, this test statistic asymptotically follows a standard normal distribution. When the absolute value of T is large, we reject the null hypothesis of no association. Based on the test statistic T, we propose the following three tests that can be used in the first stage to screen SNPs:
1. Consider affected parents of the sampled nuclear families as cases and unaffected parents of the sampled nuclear families as controls. The test statistic T based on this sample is denoted by T_{ cc }. The T_{ cc }only uses the nuclear families (does not need the unrelated controls).
2. Consider all the parents of the n sampled nuclear families as cases and the N sampled unrelated controls as controls. The test statistic T based on this sample is denoted by T_{ pc }. If A is a high risk allele, the frequency of A among the parents should be higher than that in the controls, because each pair of parents has at least one affected child.
3. The third approach is a combination of the T_{ pc }and T_{ cc }. The test statistic of this approach is Fisher's combination of the p-values of the two tests and is given by T_{ cb }= -2(log P_{1} + log P_{2}), where P_{1} and P_{2} are the p-values of the tests T_{ pc }and T_{ cc }, respectively. Under the null hypothesis of no association, T_{ cb }will follow a χ^{2} distribution with 4 degrees of freedom [8].
Then the test statistic of the PDT is given by $PDT=U/\widehat{\sigma}$. Under null hypothesis of no association, the PDT follows the standard normal distribution.
When we apply the two-stage approaches, we first apply one of T_{ pc }, T_{ cc }, or T_{ cb }to each of the M markers and get M p-values. Select L markers with the smallest p-values (we will discuss later how to choose L). Then, we apply the PDT to the L selected SNPs, and declare a SNP as significant if the p-value of the PDT at this marker is less than a threshold δ_{ Lα }. The threshold δ_{ Lα }is determined by controlling the FDR, the ratio of the number of falsely rejected null hypotheses to the total number of rejected null hypotheses, at level α. To control the FDR we can choose the cut-off δ_{ Lα }as follows [4]: let p_{(1)},...,p_{(L) }be the ordered p-values when we apply the PDT to the L selected markers, then ${\delta}_{L\alpha}=\mathrm{max}\{{p}_{(i)}:{p}_{(i)}\le \frac{i\alpha}{L}\}$.
In our simulation studies and application to analyze the GAW15 simulated data, we use the following method to calculate the power of the two-stage test to detect one disease locus, say locus D. Suppose that there are K replicated samples. Let k denote the number of samples in which locus D is selected in the first stage and the p-value of locus D in the second stage is less than δ_{ Lα }. Then, the power to detect Locus D is k/K.
Results
Simulated data
Estimated type I error rates at nominal level 0.05, 1000 replicate simulation data sets
Two children in each family^{b} | One child in each family^{b} | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
Sample size | Minor allele frequency | α ^{a} | PDT | T _{ pc } | T _{ cc } | T _{ cb } | PDT | T _{ pc } | T _{ cc } | T _{ cb } |
100 families, 120 controls | 0.1 | 0.05 | 0.014 | 0.032 | 0.043 | 0.042 | 0.031 | 0.033 | 0.044 | 0.043 |
0.1 | 0.023 | 0.028 | 0.032 | 0.029 | 0.041 | 0.029 | ||||
0.2 | 0.025 | 0.027 | 0.027 | 0.031 | 0.031 | 0.026 | ||||
0.3 | 0.05 | 0.024 | 0.042 | 0.049 | 0.047 | 0.036 | 0.05 | 0.045 | 0.053 | |
0.1 | 0.04 | 0.043 | 0.035 | 0.051 | 0.047 | 0.046 | ||||
0.2 | 0.046 | 0.038 | 0.037 | 0.054 | 0.047 | 0.053 | ||||
500 families, 600 controls | 0.1 | 0.05 | 0.048 | 0.051 | 0.051 | 0.059 | 0.053 | 0.052 | 0.054 | 0.051 |
0.1 | 0.04 | 0.052 | 0.042 | 0.052 | 0.041 | 0.045 | ||||
0.2 | 0.042 | 0.046 | 0.044 | 0.04 | 0.047 | 0.05 | ||||
0.3 | 0.05 | 0.059 | 0.053 | 0.047 | 0.045 | 0.042 | 0.046 | 0.04 | 0.042 | |
0.1 | 0.054 | 0.042 | 0.046 | 0.048 | 0.034 | 0.044 | ||||
0.2 | 0.053 | 0.034 | 0.048 | 0.047 | 0.043 | 0.041 |
GAW15 data analysis
We applied three two-stage approaches and the PDT to analyze the dense SNP data of chromosome 6 in the GAW15 Problem 3 (simulated rheumatoid arthritis data). The data contain 100 replicate data sets. In each replicate data set, there are 1500 nuclear families, 2000 unrelated controls and two affected children in each family. Each individual has genotypes at 17,820 SNPs on chromosome 6. From the answer provided with the data set, we know that there are three trait loci: Locus DR, Locus C, and Locus D on chromosome 6. Locus DR affects the risk of rheumatoid arthritis (RA). Locus C increases RA risk only in women. These two loci are in the same position. The typed SNP 3437 on chromosome 6 is in the same position as Locus DR and Locus C, that is, the recombination rates between SNP 3437 and Locus DR and between SNP 3437 and Locus C are both zero. The rare allele of Locus D increases RA risk five-fold. In the dense SNP panel of chromosome 6, the SNP that is nearest to Locus D is SNP 3917. The genetic distance between locus D and SNP 3917 is 0.00171 cM, and the physical distance is 1565 bp.
Power for detecting SNP 3455 based on the dense SNP data of chromosome 6 by using 150 families and 200 unrelated controls
Two children in each family^{b} | One child in each family^{b} | |||||||
---|---|---|---|---|---|---|---|---|
α ^{a} | PDT | T _{ pc } | T _{ cc } | T _{ cb } | PDT | T _{ pc } | T _{ cc } | T _{ cb } |
0.01 | 0.93 | 0.15 | 0.96 | 0.66 | 0.13 | 0.87 | ||
0.05 | 0.95 | 0.39 | 0.96 | 0.65 | 0.29 | 0.8 | ||
0.1 | 0.83 | 0.96 | 0.54 | 0.96 | 0.53 | 0.64 | 0.4 | 0.75 |
0.3 | 0.91 | 0.77 | 0.91 | 0.58 | 0.53 | 0.61 | ||
0.5 | 0.88 | 0.81 | 0.88 | 0.56 | 0.54 | 0.56 |
Power for detecting marker 3917 based on the dense SNP data of chromosome 6 by merging two replicate data sets to increase sample size
Two children in each family^{b} | One child in each family^{b} | |||||||
---|---|---|---|---|---|---|---|---|
α ^{a} | PDT | T _{ pc } | T _{ cc } | T _{ cb } | PDT | T _{ pc } | T _{ cc } | T _{ cb } |
0.01 | 0.36 | 0.22 | 0.44 | 0.24 | 0.18 | 0.32 | ||
0.05 | 0.56 | 0.32 | 0.68 | 0.4 | 0.28 | 0.52 | ||
0.1 | 0.56 | 0.54 | 0.42 | 0.68 | 0.3 | 0.42 | 0.36 | 0.54 |
0.3 | 0.58 | 0.54 | 0.64 | 0.32 | 0.28 | 0.32 | ||
0.5 | 0.62 | 0.56 | 0.64 | 0.34 | 0.28 | 0.34 |
Discussion
In this report for genome-wide or region-wide association studies, we proposed three two-stage approaches to analyze family data or data sets that contain family data as well as unrelated controls. Based on our simulation studies and applications of the data sets of the GAW15 Problem 3, we are able to demonstrate that, in the case of one child in each family – the typical data set of the TDT design – all three two-stage approaches are more powerful than the PDT. In almost all the cases we considered, the T_{ cb }using family data and unrelated controls is more powerful than the PDT, and in several cases the T_{ cb }can double the power of the PDT. How to choose the value of the threshold α is a problem. From our simulation studies, one can see that the value of α around 0.01 may be a good choice for the T_{ pc }and T_{ cb }. If only the family data are available, we would use the two-stage approach T_{ cc }. In the case of one child, the value around 0.1 may be a good choice for α. In the case of two children, the T_{ cc }does not benefit much. In general, we need further investigation for choosing the value of α.
Conclusion
Our simulation and the GAW15 data analysis results show that the three proposed two-stage methods have correct type I error rates and, in most cases, are more powerful than the PDT.
Declarations
Acknowledgements
This work was supported by National Institute of Health (NIH) grants R01 GM069940, R03 HG 003613, R01 HG003054, and R03 AG024491.
This article has been published as part of BMC Proceedings Volume 1 Supplement 1, 2007: Genetic Analysis Workshop 15: Gene Expression Analysis and Approaches to Detecting Multiple Functional Loci. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/1?issue=S1.
Authors’ Affiliations
References
- Risch N: Searching for genetic determinants in the new millennium. Nature. 2000, 405: 847-856. 10.1038/35015718.View ArticlePubMedGoogle Scholar
- Hirschhorn JN, Daly MJ: Genome-wide association studies for common diseases and complex traits. Nat Rev Genet. 2005, 6: 95-108. 10.1038/nrg1521.View ArticlePubMedGoogle Scholar
- Van Steen K, McQueen MB, Herbert A, Raby B, Lyon H, Demeo DL, Murphy A, Su J, Datta S, Rosenow C, Christman M, Silverman EK, Laird NM, Weiss ST, Lange C: Genomic screening and replicate data set using the same data set in family-based association testing. Nat Genet. 2000, 37: 683-691. 10.1038/ng1582.View ArticleGoogle Scholar
- Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc B. 1995, 57: 289-300.Google Scholar
- Satagopan JM, Elston RC: Optimal two-stage genotyping in population-based association studies. Genet Epidemiol. 2003, 25: 149-157. 10.1002/gepi.10260.View ArticlePubMedGoogle Scholar
- Wang H, Thomas DC, Peer I, Stram DO: Optimal two-stage genotyping designs for genome-wide association scan. Genet Epidemiol. 2006, 30: 356-368. 10.1002/gepi.20150.View ArticlePubMedGoogle Scholar
- Martin ER, Monks SA, Warren LL, Kaplan NL: A test for linkage and association in general pedigress: the pedigree disequilibrium test. Am J Hum Genet. 2000, 67: 146-154. 10.1086/302957.View ArticlePubMed CentralPubMedGoogle Scholar
- Fisher RA: Statistical Methods for Research Workers. 1932, London: Oliver and Boyd, 4Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.