Genome-wide association studies using single-nucleotide polymorphisms versus haplotypes: an empirical comparison with data from the North American Rheumatoid Arthritis Consortium.

The high genomic density of the single-nucleotide polymorphism (SNP) sets that are typically surveyed in genome-wide association studies (GWAS) now allows the application of haplotype-based methods. Although the choice of haplotype-based vs. individual-SNP approaches is expected to affect the results of association studies, few empirical comparisons of method performance have been reported on the genome-wide scale in the same set of individuals. To measure the relative ability of the two strategies to detect associations, we used a large dataset from the North American Rheumatoid Arthritis Consortium to: 1) partition the genome into haplotype blocks, 2) associate haplotypes with disease, and 3) compare the results with individual-SNP association mapping. Although some associations were shared across methods, each approach uniquely identified several strong candidate regions. Our results suggest that the application of both haplotype-based and individual-SNP testing to GWAS should be adopted as a routine procedure.


Background
Advances in genotyping technology have stimulated genome-wide association studies (GWAS) of common human diseases. The high genomic density of singlenucleotide polymorphisms (SNPs) available on current genotyping platforms raises the prospect of combining neighboring SNPs into haplotypes for association analysis.
Haplotype-based association testing offers several advantages over the standard "one-SNP-at-a-time" approach [1]. First, genome-wide haplotype approaches reduce the dimension of association testing when a single global test for a block is used. Performing fewer tests preserves power and helps to maintain reasonable false-positive rates. Second, haplotype methods facilitate detection of

Open Access
associations driven by cis-interactions among nearby SNPs that might be missed by methods that consider SNPs one at a time. Finally, haplotype approaches 1) recognize that variation in populations is inherently structured into genomic blocks and 2) exploit these correlations among SNPs. For all these reasons, using haplotypes in association testing is expected to increase power relative to single-SNP approaches, and studies based on human haplotype structure have provided support for this claim [2].
Nevertheless, haplotype association mapping faces several challenges. Haplotype block structure and phase is rarely observed in human genotyping data, requiring the application of statistical procedures that introduce additional error. Haplotype reconstruction methods that use different criteria produce different results [3], leaving the choice of the best approach for association mapping unclear. Moreover, when a block contains a large number of haplotypes, the increased degrees of freedom within a block can erode power.
The performance of haplotype-based and single-SNP association mapping have been compared, with mixed results. Power calculations using explicit analytical formula [4] and simulations based on the HapMap [2] suggested that using haplotypes can significantly improve power. Alternatively, Long and Langley [5] argued that single-SNP tests are at least as powerful in their simulation studies. Empirical comparisons have also yielded inconsistent conclusions [4].
The observed inconsistencies suggest that the performance of methods may depend on the nature of the data. Some associations might be detected more readily using individual SNPs, while others might only be discovered using haplotypes. Interestingly, most previous studies have restricted their comparisons of method performance to a small subset of the genome. To address these issues, we conducted GWAS with both haplotype-based and individual-SNP methods. By applying both approaches to the same genome-wide dataset (the North American Rheumatoid Arthritis Consortium (NARAC) provided by the Genetic Analysis Workshop 16 (GAW16)) we could directly compare the performance of these methods in an empirical context.

Phenotypes and genotypes
The data set from the NARAC contained 868 case subjects with rheumatoid arthritis (RA) and 1194 matched control subjects. Individuals were genotyped at 545,080 SNPs. Our study focused on the autosomes, where genotypes at 531,689 SNPs were available. After removing 17,754 SNPs that showed 1) a deviation from Hardy-Weinberg equilibrium in controls or 2) less than 0.001 minor allele frequency in the total sample, 513,935 SNPs were retained for further analysis.
Haplotype block partitioning algorithms Two methods were used to define haplotype blocks. The first method [6] was focused on D', a normalized measure of linkage disequilibrium (LD). This method identified informative pairs of SNPs as Category (1): those for which the upper 95% confidence bound on D' was between 0.7 and 0.98 and Category (2): those for which the upper confidence bound on D' was less than 0.9. We defined a haplotype block as a region where at least 95% of pairs among informative SNPs belonged to Category (1). The second method [7] used the four-gamete rule. For each pair of SNPs, the population frequencies of the four possible haplotypes were calculated, and the number of haplotypes with observed frequencies of at least 0.01 were counted. Blocks were constructed by combining consecutive pairs of SNPs for which only three haplotypes were observed. All methods were implemented in the computer program Haploview [8].
To examine the effects of changes in parameters on block definition, we varied the proportion of informative SNP pairs in Category (1) needed to combine blocks (in Gabriel's method) and the minimum frequency needed to call a haplotype "observed" (in the four-gamete rule algorithm). We compared the block sizes that resulted from applying these algorithms to the 8,051 SNPs on chromosome 21 and assuming a range of values for these two parameters. Although variation in block size was observed, block sizes for parameter values near the defaults were fairly stable; consequently, default values (0.95 in Gabriel's method, 0.01 in the four-gamete rule algorithm) were used in all subsequent analyses.
Testing for associations between disease status and genotype A high variance inflation factor (1.45) [9] suggested that association analyses of these data might be affected by population stratification. To account for these effects, we calculated top eigenvectors of the covariance matrix across the samples [10] using SNPs sampled every fifth position after excluding SNPs on the short arms of chromosomes 6 and 8, as Plenge et al. suggested [11]. Ten outliers, detected from ten eigenvectors, were excluded in all subsequent analyses. Both individual SNP associations and haplotype associations were measured by likelihood ratio tests via logistic regression where three eigenvectors were included as covariates to correct population stratification. These tests for individual SNP associations were implemented in the computer program PLINK [12,13]. For haplotype association tests, we estimated haplotypes in each block by the standard expectation maximization algorithm, implemented in PLINK, and conducted likelihood ratio tests via logistic regression with haplotypes by using the statistical package R [14]. Because we aimed to detect collective associations between groups of haplotypes and arthritis, we used a single global test of association for haplotypes. p-Values were compared to the Bonferroni threshold (alpha = 0.05/# tests) to identify statistically significant loci (SNPs not belonging to haplotypes were counted in both sets of analyses).

Results
In what follows, we refer to a haplotype block as a partition containing at least two SNPs and a singleton as a partition containing only one SNP. GAB blocks and GAM blocks represent blocks constructed by the algorithm of Gabriel et al. [6] and the four gamete rule, respectively.
Haplotype block partitioning GAM partitioned the genome into more blocks (100,121) than GAB (97,881). On average, GAM blocks included more SNPs and were larger in size than GAB blocks (median number of SNPs: 3 in GAB; 4 in GAM), suggesting greater genomic coverage by GAM blocks. We note that a block is defined when a partition consists of at least two SNPs. Because the GAB method produces more singletons, the GAM method has both more blocks and a higher average block size. Block sizes estimated by both methods (GAM median size = 8,639 bp; GAB median size = 7,335 bp) were similar to those observed for other populations of European descent [15]. We also uncovered considerable variation in haplotype block structure across the genome, with block size ranging from 2 bp to 3,547,000 bp for both methods. Similar numbers of SNPs were assigned to haplotype blocks on the different chromosomes (median = 3 for most chromosomes). Because chromosomes vary in physical size, this suggests that variation in block size among chromosomes primarily reflected differences in the density of genotyped SNPs. Most (92%) SNPs were localized to GAB or GAM blocks, indicating that the density of genotyped SNPs was sufficient to conduct haplotype-based association analysis.
We ran Haploview on Intel Xeon 3 GHz dual Quad core system with 32 Gb of RAM. GAB block partitioning and GAM block partitioning on chromosome 22 including 8,205 SNPs required 19 and 14 minutes, respectively.
Haplotype association test vs. individual SNP association test A large number of tests on chromosome 6 showed strong associations (data not shown). The significance level, 0.05 becomes 9.73 × 10 -8 for single SNP, 2.71 × 10 -7 for GAB, 3.11 × 10 -7 for GAM after Bonferroni correction with total numbers of tests of 513,935 for single-SNP, 184,504 (97,881 GAB blocks plus 86,623 singletons) for GAB, 160,737 (100,121 GAM blocks plus 60,616 singletons) for GAM, respectively. Several associations survived the stringent Bonferroni correction for multiple testing (51 GAB blocks, 50 GAM blocks, and 21 individual SNPs) and some associations were shared among methods. A total of 8 out of 51 significant GAB blocks (Table 1) and 6 out of 50 significant GAM blocks included SNPs that were also significant in individual SNP association tests. Four SNPs showed significant associations using all three methods (Table 1). However, many associations were only detected when certain methods were applied. 43 GAB blocks and 44 GAM blocks that showed significant associations were not detected by individual SNP association tests and 11 of 21 SNPs that showed significant associations in individual tests were not significant in haplotype association tests.
We asked whether reducing the significance threshold (9.73 × 10 -8 × 2000) in individual SNP association tests and testing associations involving haplotypes within ± 100 kb of the resulting significant SNPs improved the consistency between individual SNP association tests and haplotype association tests. However, even using these extreme criteria, only 25 GAB blocks and 29 GAM blocks overlapped with the regions containing SNP association tests that were significant.
Most of the SNPs and haplotypes that showed significant associations on chromosome 6 were located near the HLA region. Similar to the pattern for significant associations in other genomic regions, p-values from haplotype association tests were smaller than p-values from individual SNP association tests (Figure 1).
Haplotype-based association tests of 1,578 blocks on chromosome 22 using R required 4 minutes on Intel Xeon 3 GHz dual Quad core system with 32 Gb of RAM.
Haplotype association test: GAB vs. GAM Twenty-five significant GAB blocks and 25 significant GAM blocks overlapped with each other. Interestingly, 18 GAB and GAM blocks out of 25 are identical, which is why these 18 haplotype-based associations using GAM and GAB blocks showed consistent results. For the remaining seven overlapped blocks, we hypothesize that signals of association with disease were strong, so that their detection was not sensitive to block partitioning. The fact that three of these blocks included SNPs that were significant in individual SNP association tests supports this hypothesis. In the regions where association tests in GAB and GAM blocks showed inconsistent results, we observed that differences between GAB block partitions and GAM block partitions were not unusual. In many cases, one block contained the other block with an additional one or two SNPs, but results from haplotype-based association tests in the two blocks were still substantially different.

Conclusion
GWAS now commonly survey SNPs at a genomic density similar to this study. Consequently, our observation that most of the genome could be organized into multi-SNP haplotypes indicates that available resources are sufficient to conduct haplotype-based mapping on the genomic scale. Those regions that were significantly associated with RA in both individual-SNP tests and haplotype-based tests represent promising candidates for further study. For example, the HLA region on chromosome 6 remained significant for all three genome-wide association tests, strongly suggesting that genes in this region contribute to disease risk.
Although some associations were observed consistently across methods, some associations were only detected using haplotype-based tests. Several factors might explain these differences. Haplotype-based methods required approximately 65% fewer tests than the individual-SNP approach. As a result, the multiple testing correction was less severe for haplotype-based methods. Haplotype-based methods can also detect cisinteractions among several causal variants [16]. Furthermore, because the power to detect associations is maximized when marker and causal variant frequencies are similar, analyses using haplotypes could find associations with rare alleles that analyses using individual SNPs may miss. We also discovered associations using individual SNPs that were not seen in haplotype tests. Perhaps these represented cases in which only a single SNP exhibited strong LD with a causal variant, so that forming haplotypes with several adjacent SNPs diluted the strength of association. Regardless of the explanation for observed differences among methods, our results indicate that the application of both individual-SNP and haplotype-based approaches to GWAS will maximize the potential for finding biologically important associations.
Although some regions show consistent significant associations in different block partitions (GAB and GAM), in most regions, haplotype-based association tests are really sensitive to changes in block partitions. This result suggests that the effects of other block partitioning algorithms on GWAS should be compared. For example, haplotype-based association testing using a sliding window of fixed physical or genetic size would be an alternative approach. Although this strategy is easily implemented, it ignores information about haplotype block structure. The variation in block structure across the genome suggests that methods that use this structure (such as those applied in this paper) should be more powerful for GWAS, but this issue needs to be examined.
Our study also suggests several avenues for future research. Additional measurements of the effects of different haplotype partitioning algorithms on the power of downstream association tests -in both simulations and empirical data -would be useful. For example, the error inherent in haplotype block estimation needs to be incorporated in association analysis. Furthermore, the likelihood ratio tests used here ignored the evolutionary relationships among haplotypes. An improved analysis that uses this information (e.g., a cladistic analysis [17]) would be worthwhile.