Marker selection for whole-genome association studies with two-stage designs using dense single-nucleotide polymorphisms
- Jing Li^{1}Email author
https://doi.org/10.1186/1753-6561-1-S1-S136
© Li; licensee BioMed Central Ltd. 2007
Published: 18 December 2007
Abstract
Large-scale genome-wide association studies are increasingly common, due in large part to recent advances in genotyping technology. Despite a dramatic drop in genotyping costs, it is still too expensive to genotype thousands of individuals for hundreds of thousands single-nucleotide polymorphisms (SNPs) for large-scale whole-genome association studies for many researchers. A two-stage design has been a promising alternative: in the first stage, only a small fraction of samples are genotyped and tested using a dense set of SNPs, and only a small subset of markers that show moderate associations with the disease will be genotyped in the second stage. In this report, I developed an approach to select and prioritize SNPs for association studies with a two-stage or multi-stage design. In the first stage, the method not only evaluates associations of SNPs with the disease of interest, it also explicitly explores correlations among SNPs. I applied the approach on the simulated Genetic Analysis Workshop 15 Problem 3 data sets, which have modeled the complex genetic architecture of rheumatoid arthritis. Results show that the method can greatly reduce the number of SNPs required in later stage(s) without sacrificing mapping precision.
Background
A two-stage design has been a promising strategy for genome-wide association studies [1–5], primarily for the purpose of reducing genotyping costs. Studies have shown that two-stage designs can effectively reduce costs, even with a much higher per genotyping costs in stage two using specially designed arrays, compared to fixed arrays in stage one [3]. An optimal two-stage design to achieve a minimum cost with a similar overall significance level and statistical power depends on many factors such as disease allele frequencies, disease effects, fraction of samples genotyped in stage one, fraction of markers genotyped in stage two, as well as genotyping cost ratio in stage one and stage two. Several groups have investigated the issue using different statistical tests under different assumptions [1–4].
Generally speaking, there are three test strategies that can be adopted in stage two, namely, replication-based analysis, joint analysis assuming homogeneity between stages, and joint analysis that allows heterogeneity between stages [1–4]. In a replication-based study, data in stage two are considered alone and a positive association is reported if a statistical score reaches its significance level. In a joint analysis, subjects in stage one and in stage two will be considered together at the end, while raw data from two stages are combined first to obtain an overall statistic if assuming homogeneity, and statistics from two stages are combined if assuming heterogeneity [4]. A common practice to evaluate statistical significance for multiple tests by all three methods is to use Bonferroni adjusted p-values, which basically assumes all single-nucleotide polymorphisms (SNPs) are independent and in linkage equilibrium. Based on data from the HapMap project [6] and some other sources such as the Cancer Genetic Markers of Susceptibility (CGEMS) project http://cgems.cancer.gov, the assumption of linkage equilibrium is unlikely to hold when using SNP arrays with hundreds of thousands markers because many nearby SNPs are in high linkage disequilibrium. The Bonferroni correction is highly conservative and may partially explain the preliminary negative results from the CGEMS project: none of the 300 K SNPs are significantly associated with prostate cancer at a genome level of 0.05 after the Bonferroni correction. Permutations tests can be performed for the replication-based analysis, but it is not straightforward to extend permutation tests to joint analysis [7]. In addition, permutation tests are usually time-consuming and unlikely scale up to genome-wide studies. In this report, I explicitly explore the dependence between SNPs within a two-stage design using the simulated dense SNP data sets provided by Genetic Analysis Workshop 15 (GAW15) by applying a clustering algorithm and employing the joint analysis strategy for power studies.
Methods
I applied the above clustering algorithm within a two-stage design using the joint analysis on the simulated data sets of Problem 3. All analyses were carried out with knowledge of true disease gene locations. I first tested the above algorithm on the dense SNP set on chromosome 6, which contains the HLA-DRB1 locus and Locus D. The total number of SNPs is 17,820, with an average inter-marker interval of 10 kbp, which corresponds to a 300 K array. As a comparison, I also applied the algorithm on the SNP data of chromosome 18 that mimic a 10 K SNP chip set. SNP data on chromosome 1 were used to evaluate the type I errors. I first constructed data sets for a case-control study with a two-stage design. For each data set, only one affected child was randomly chosen as a case subject from each nuclear family with an affected sib pair. One child is selected as a control subject from each normal family. Therefore, all cases and controls are independent. Because some alleles around the HLA-DRB1 locus have very strong effects on the disease status, only a very small fraction of cases and controls were randomly selected for testing from all subjects (1500 cases and 2000 controls). Let n denote the total number of subjects tested in stage one and stage two together, where an equal number of cases and controls were tested. For chromosome 6, n took the values of 100, 200, and 300. Let f denote the fraction of the number of subjects in stage one, and f took the values of 0.3, 0.4, and 0.5 in this experiment. I assumed only nf subjects were genotyped for all m SNPs in stage one. The Pearson χ^{2} statistic was used to select a subset of k SNPs for stage two based on a significance level of 0.05 without adjustments. The clustering algorithm was then applied to the k SNPs with a LD threshold D' = 0.8 and a distance threshold of 100 kbp for chromosome 6. For each parameter combination, 100 independent replicates were randomly sampled from the original data sets. I have investigated and compared the power, costs, significance levels, and prediction errors (the distances from the predicted locations to the true gene location) of three methods, namely, the one-stage design using all data, the two-stage design without clustering, and the two-stage design with clustering. For chromosome 18 and chromosome 1, because the total number of markers on each chromosome is much smaller than the number of SNPs on chromosome 6, and the effect of Locus E on chromosome 18 is much smaller than the HLA-DRB1 locus, a different set of parameters has been used (e.g., n = 750, 1000, 1250; and the distance threshold for clustering is 5 Mbp).
Results
Power, number of positive SNPs, and significance levels
Mean (SD) number of positive SNPs at significance level α and fraction of samples f in stage one and for sample sizes 100, 200, and 300 for each method (one-stage design, two-stage design, and two-stage design with clustering)
100 | 200 | 300 | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
α | f | 1 stage | 2 stage | 2 stage-c | 1 stage | 2 stage | 2 stage-c | 1 stage | 2 stage | 2 stage-c |
0.05 | 0.3 | 15.8(± 0.42) | 6.8(± 0.20) | 27.6(± 0.54) | 10.7(± 0.25) | 39.7(± 0.62) | 15.0(± 0.29) | |||
0.4 | 17.9(± 0.44) | 16.1(± 0.43) | 6.5(± 0.19) | 31.1(± 0.62) | 28.3(± 0.58) | 11.3(± 0.25) | 44.9(± 0.71) | 40.1(± 0.63) | 15.3(± 0.27) | |
0.5 | 16.1(± 0.42) | 6.9(± 0.18) | 28.2(± 0.56) | 11.4(± 0.26) | 40.5(± 0.64) | 15.7(± 0.27) | ||||
0.01 | 0.3 | 14.0(± 0.41) | 6.0(± 0.20) | 24.6(± 0.48) | 9.8(± 0.23) | 35.6(± 0.57) | 13.5(± 0.27) | |||
0.4 | 15.7(± 0.41) | 14.2(± 0.38) | 6.0(± 0.18) | 27.6(± 0.54) | 25.0(± 0.47) | 10.4(± 0.22) | 39.4(± 0.62) | 36.0(± 0.54) | 14.0(± 0.26) | |
0.5 | 14.2(± 0.39) | 6.2(± 0.16) | 25.1(± 0.48) | 10.3(± 0.23) | 36.1(± 0.56) | 14.3(± 0.25) |
Mean (SD) significance levels (-log_{10}(p)) for each design for fraction of samples f in stage one and for sample sizes 100, 200 and 300 for each method
100 | 200 | 300 | |||||||
---|---|---|---|---|---|---|---|---|---|
f | 1 stage | 2 stage | 2 stage-c | 1 stage | 2 stage | 2 stage-c | 1 stage | 2 stage | 2 stage-c |
0.3 | 10.8(± 0.29) | 11.0(± 0.30) | 25.1(± 0.40) | 25.3(± 0.41) | 39.7(± 0.47) | 39.9(± 0.49) | |||
0.4 | 11.6(± 0.30) | 10.7(± 0.29) | 11.0(± 0.30) | 26.1(± 0.40) | 25.1(± 0.40) | 25.4(± 0.40) | 40.8(± 0.47) | 39.7(± 0.47) | 40.1(± 0.47) |
0.5 | 10.8(± 0.30) | 11.1(± 0.30) | 25.2(± 0.40) | 25.5(± 0.40) | 39.7(± 0.47) | 40.1(± 0.47) |
Distances
Mean (SD) distances of the predicted locus from the disease locus (kbp) for fraction of samples f in stage one and for sample sizes 100, 200 and 300 for each method
100 | 200 | 300 | |||||||
---|---|---|---|---|---|---|---|---|---|
f | 1 stage | 2 stage | 2 stage-c | 1 stage | 2 stage | 2 stage-c | 1 stage | 2 stage | 2 stage-c |
0.3 | 68(± 5.8) | 85(± 5.8) | 85(± 6.0) | 90(± 5.9) | 82(± 5.7) | 83(± 5.7) | |||
0.4 | 68(± 5.7) | 67(± 5.7) | 75(± 5.8) | 86(± 6.0) | 87(± 6.0) | 91(± 6.0) | 81(± 5.6) | 81(± 5.6) | 82(± 5.7) |
0.5 | 69(± 5.8) | 73(± 5.9) | 86(± 6.0) | 90(± 6.0) | 80(± 5.7) | 81(± 5.7) |
Number of genotyped SNPs and costs
By clustering nearby SNPs that are in high LD, one can significantly reduce genotyping costs in stage two. On average, the number of SNPs for the second stage with clustering (781 ± 12) is only about one half of the number of SNPs without clustering (1846 ± 72). Those numbers are very robust with regard to sample sizes and the fractions of samples being genotyped in stage one. The costs for the two methods with a two-stage design are the same for stage one (which is about half to 30% of the cost of one-stage design). And the cost of the two-stage design with clustering is about half of it without clustering. The overall saving depends on the cost ratio of genotyping a single SNP in stage one and in stage two.
Rare alleles
There is another locus on chromosome 6 about 5 cM away from the HLA-DRB1 locus that contributes to the development of rheumatoid arthritis (RA). But the disease allele has a very low frequency (0.0083) the above procedure cannot detect the signal with small sample sizes (smaller than 300).
Results on chromosome 18
The same procedure has been applied on chromosome 18 (with 303 markers) using a different set of parameters. Results show that almost no SNPs that are significant in stage one can be grouped together when the LD threshold D' = 0.8, even when the distance threshold as large as 5 Mbp. Therefore, the above approach is effective when using very dense SNP sets such as 300 K or 500 K arrays.
Type I errors
No genes on chromosome 1 have effects on RA in the simulated data, so it was taken as a data set in evaluating type I errors for the three methods. Because this is another data set mimicking a 10 K SNP chip, the results from the two-stage designs with and without clustering are quite similar, and both methods have correct type I errors at both 0.05 and 0.01 level (sample sizes 750, 1000 and 1250). The one-stage design using Bonferroni correction has correct but much lower error rates, which means a Bonferroni correction is conservative even for SNPs with low correlations.
Discussion and conclusion
For very dense SNP arrays, it is highly likely that SNPs within a short distance are not independent from each other. In this report, I have investigated a strategy of evaluating SNP correlations within a two-stage design using case-control samples, and have applied the algorithm on the Problem 3 simulated data sets of GAW15. The strategy can reduce the genotyping costs in stage two by half with similar or better performance (power/significance level, number of false positives, mapping precision) on data sets based on 300 K SNP arrays. Two-stage designs are promising for genome-wide association studies. As illustrated in this paper, advanced processing in stage one can further reduce genotyping costs in later stages without sacrificing mapping precision. A potential drawback using SNPs with little redundancy is that a failed assay in stage two for a marker SNP will lose information on a whole region of the genome.
Declarations
Acknowledgements
This work was supported by NIH/NLM grant LM008991, and in part by NIH/NCRR grant RR03655. The author thanks the group editors and referees for their diligent review and time spent in helping to improve the manuscript.
This article has been published as part of BMC Proceedings Volume 1 Supplement 1, 2007: Genetic Analysis Workshop 15: Gene Expression Analysis and Approaches to Detecting Multiple Functional Loci. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/1?issue=S1.
Authors’ Affiliations
References
- Satagopan JM, Elston RC: Optimal two-stage genotyping in population-based association studies. Genet Epidemiol. 2003, 25: 149-157. 10.1002/gepi.10260.View ArticlePubMedGoogle Scholar
- Thomas D, Xie R, Gebregziabher M: Two-stage sampling designs for gene association studies. Genet Epidemiol. 2004, 27: 401-414. 10.1002/gepi.20047.View ArticlePubMedGoogle Scholar
- Wang H, Thomas DC, Pe'er I, Stram DO: Optimal two-stage genotyping designs for genome-wide association scans. Genet Epidemiol. 2006, 30: 356-368. 10.1002/gepi.20150.View ArticlePubMedGoogle Scholar
- Skol AD, Scott LJ, Abecasis GR, Boehnke M: Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nat Genet. 2006, 38: 209-213. 10.1038/ng1706.View ArticlePubMedGoogle Scholar
- Hirschhorn JN, Daly MJ: Genome-wide association studies for common diseases and complex traits. Nat Rev Genet. 2005, 6: 95-108. 10.1038/nrg1521.View ArticlePubMedGoogle Scholar
- The International HapMap Consortium: A haplotype map of the human genome. Nature. 2005, 437: 1299-1320. 10.1038/nature04226.View ArticlePubMed CentralGoogle Scholar
- Lin DY: Evaluating statistical significance in two-stage genomewide association studies. Am J Hum Genet. 2006, 78: 505-509. 10.1086/500812.View ArticlePubMed CentralPubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.