- Open Access
Density-based clustering in haplotype analysis for association mapping
© Igo et al; licensee BioMed Central Ltd. 2007
Published: 18 December 2007
Clustering of related haplotypes in haplotype-based association mapping has the potential to improve power by reducing the degrees of freedom without sacrificing important information about the underlying genetic structure. We have modified a generalized linear model approach for association analysis by incorporating a density-based clustering algorithm to reduce the number of coefficients in the model. Using the GAW 15 Problem 3 simulated data, we show that our novel method can substantially enhance power to detect association with the binary rheumatoid arthritis (RA) phenotype at the HLA-DRB1 locus on chromosome 6. In contrast, clustering did not appreciably improve performance at locus D, perhaps a consequence of a rare susceptibility allele and of the overwhelming effect of HLA-DRB1/locus C, 5 cM distal. Optimization of parameters governing the clustering algorithm identified a set of parameters that delivered nearly ideal performance in a variety of situations. The cluster-based score test was valid over a wide range of haplotype diversity, and was robust to severe departures from Hardy-Weinberg equilibrium encountered near HLA-DRB1 in RA case-control samples.
Haplotypes generally contain more information than individual single-nucleotide polymorphisms (SNPs) about the underlying genetic architecture, and therefore offer greater power to detect association between markers and traits. However, the power of haplotype-based methods for association mapping, like that of other approaches, is diminished in studies of complex traits by the presence of both allelic heterogeneity (i.e., mutations arising more than once in the same gene) and locus heterogeneity. One approach to ameliorate the effect of allelic heterogeneity is to cluster similar haplotypes, under the assumption that these may have diverged more recently in a population's history than the occurrence of a disease-causing mutation.
We combined the density-based clustering algorithm of Li and Jiang  with the general linear model (GLM) approach of Schaid et al. [2, 3] for association mapping. Based on real pedigrees and SNPs, the simulated Genetic Analysis Workshop (GAW) 15 Problem 3 data sets provide an outstanding opportunity to compare the performance of our novel cluster-based method with the original, haplotype-based approach. The region near the HLA-DRB1 gene, in addition, presents an unusual context for rheumatoid arthritis (RA), on account of the very strong effect of certain HLA-DRB1 alleles on the phenotype , potentially inducing deviation from Hardy-Weinberg equilibrium (dHWE) in nearby SNPs in case-control samples. The GLM used in both methods relies on the assumption of HWE in calculating posterior probabilities of haplotype pairs from unphased SNP genotypes. The original approach of Schaid et al. appears to be robust to dHWE in simulated case-control data generated under a simple genetic model . However, the sensitivity of our novel approach to dHWE remains to be tested.
In this report, we compare the performance of the haplotype- and cluster-based methods in detecting association with RA, and assess the type I error of both methods in the presence and absence of dHWE.
All analyses were carried out with knowledge of the true location of susceptibility loci.
Marker names from the chromosome 6 dense SNP scan are abbreviated here such that "denseSNP6_N" will be denoted as "SNP N". We tested the markers flanking the HLA-DRB1 locus (DRB1, coincident with SNP 3437, 49.5 cM) and locus D (between SNPs 3916 and 3917, 54.6 cM) for redundancy using BEST , and removed one redundant marker, SNP 3434, from the region near DRB1. We explored patterns of linkage disequilibrium (LD) and assessed the significance of nonzero LD by the likelihood ratio test in HaploView version 3.32 .
Testing for dHWE was carried out using the exact test in HaploView, and, for confirmation, the exact and χ2 tests as implemented in the R genetics library package, version 1.2.0. Analyses were performed on sets of 1500 cases – one affected sib chosen at random from each affected-sib pair (ASP) family – and 2000 unrelated controls. All cases and controls in each set were from the same replicate.
We used the ASSOC program in S.A.G.E.  for case-control tests of association. The transmission disequilibrium test (TDT) was performed on trios of parents and affected child as implemented in the S.A.G.E. program TDTEX. Trios were selected with one random offspring from all 1500 ASP families in each replicate. In addition, the generalized family-based approach implemented in FBAT version 1.7.2 [9, 10] was carried out in parallel on sets of 1500 complete ASP families, with a null hypothesis of no linkage or association.
Association mapping by linear regression with clustering of haplotypes
We have extended the regression-based approach for association testing of Schaid et al.  to incorporate haplotype groups via a density-based clustering algorithm . The primary goal was to reduce the dimensionality of the regression by clumping together haplotypes that are likely to have diverged recently, whether through mutation or recombination. Posterior haplotype probabilities from unphased data are obtained from the Decipher program in S.A.G.E.  and are imported into a modified version of the HapMiner program  as haplotype weights for clustering. Each pair of haplotypes is assigned a similarity score, a generalization of several scores previously described , which is converted to a distance metric on the interval [0,1] . Clusters are formed in regions of high density (haplotype weight). A haplotype is designated a "core" haplotype if enough density, determined by the density threshold MinPts, is located within a given distance ε from it. Haplotypes within this ε neighborhood are clustered together. We modified HapMiner such that very common haplotypes, defined by the parameter pmin, are never clustered together. This prevents improper grouping of ancient haplotypes. For the analyses presented here, we selected a value for pmin of 1/(2k), where k is the number of haplotypes present with a frequency large enough to include in the GLM (see below).
Cluster assignments for all possible haplotypes are imported into the haplo.score function in HaploStats [2, 3]. This method first estimates haplotype frequencies by the expectation maximization algorithm, and uses the frequencies to calculate posterior probabilities of haplotype pairs for each individual, assuming HWE. The posterior probabilities, in turn, are incorporated into a score test for association based on the likelihood function of a particular GLM – for case-control data, a logistic model – in which each haplotype is assigned a model coefficient . A global score test for association is asymptotically distributed, under the null model of no association (all coefficients equal to 0), as a χ2 random variable with degrees of freedom equal to the rank of the variance matrix for the score statistic. In our cluster-based approach, parameter estimates are obtained for clusters, rather than for haplotypes. We calculated the variance of the score statistic as per the generalized score test of Boos  as implemented by Tzeng et al.  because we found that variance calculation in Schaid et al.  based on the Louis information  inflated the type I error of the test when covariates were included in the analysis (data not shown). To prevent numerical instability and loss of power resulting from estimation of rare haplotypes , only haplotypes or clusters with frequencies above 0.002 were included.
Power and type I error analyses on simulated datasets
We carried out studies of power and type I error of association mapping methods on subsets of all 100 replicates. Cases were randomly selected from the offspring of ASP families, such that no sample contained both sibs from any family; controls were chosen at random from the unrelated controls. All individuals within a sample were taken from the same replicate. We adjusted the sample size for each analysis to yield moderate (40–60%) power from the haplotype-only test. Where necessary, we included sex and the number of DRB1*04 alleles in the model as covariates, to reduce the signal strength to a level useful for power comparisons; this adjustment was not part of the analysis. Type I error in association-positive regions was assessed by randomizing case/control status (and covariates, if any) relative to genotype data by permutation.
A second null-model analysis was performed at locations far removed from the HLA region and locus D. The mean haplotype diversity, measured as the number of unique haplotypes present in the phased data, for all sets of six and eight consecutive markers was estimated over replicates 1–50 in two large regions on chromosome 6q comprising SNPs 11001–12000 and 15001–16000. Four levels of diversity were defined: low (10th percentile of all marker sets), medium (50th), high (90th) and very high (99th). Samples were extracted at selected locations at each diversity level, and score tests for association were carried out as above (without permutation).
Deviation from HWE
Power analyses at HLA-DRB1 and locus D
Power of haplotype- and cluster-based association analyses at the HLA-DRB1 locus a
α = 0.05
α = 0.01
Mean d.f. c
6 Markers d
8 Markers e
Power of association analyses at Locus D a
Power, no covariates
Power, adjusted b
α = 0.05
α = 0.01
α = 0.05
α = 0.01
Mean d.f. c
6 Markers d
Clusters, ε = 0.4 e
Clusters, ε = 0.5
8 Markers f
Clusters, ε = 0.7 e
Clusters, ε = 0.5
Ability to detect association at locus D was markedly reduced with adjustment for sex and number of DRB1*04 alleles, with power at the 0.01 significance level falling almost to background (Table 2). This observation strongly suggests that DRB1 and locus C are providing most of the genetic signal at locus D. Low but significant LD (|D'| < 0.1; LOD score > 2 for H0: D' = 0) was observed between SNP 3437 at DRB1 and SNP 3917, 1.6 kb from locus D (data not shown). Given the overwhelming effect of DRB1 on RA, this small level of LD may explain the association between RA and haplotypes at locus D.
Assessment of type I error
False-positive rates measured at HLA-DRB1 in the absence of covariates matched expectation when trait values were permuted relative to genotypes, both with and without haplotype clustering, as did those measured at locus D. With adjustment for sex and DRB1*04 alleles, cluster-based score tests again yielded proper type I error, whereas the test was somewhat conservative without clustering, returning a false-positive rate of about 0.025 at a significance level of 0.05. Results were similar for all six- and eight-marker haplotypes examined in power analyses (data not shown). Thus, our cluster-based score test appears to be robust to the severe dHWE encountered within the HLA region.
Type I error of score tests, as a function of haplotype diversity
α = 0.05
α = 0.01
α = 0.05
α = 0.01
In summary, incorporation of haplotype clustering by the procedure of Li and Jiang  noticeably improves the power of the association mapping approach of Schaid et al.  to detect association with RA at the DRB1 locus (with adjustments to reduce signal strength), but only minimally improves power at locus D. In general, we expect clustering, in the presence of allelic heterogeneity, to improve performance of the score test and to enhance our ability to identify causative variants. Clustering also promises to increase power of the test in regions with extensive haplotype diversity by grouping rare haplotypes and thus reducing the degrees of freedom of the score test. However, because the Problem 3 data were not simulated under a coalescent model incorporating independent disease-causing mutations at DRB1, we could not directly test these hypotheses. Similarly, haplotype analysis would not necessarily be expected to improve upon single-SNP association methods given data simulated in this manner, especially at a trait locus as overwhelmingly influential as DRB1. Indeed, at least two other GAW15 studies did not obtain more significant results from haplotype analysis than with single-locus approaches [15, 16].
Comparisons of the two approaches at these loci suggest guidelines for selecting operating parameters for HapMiner. Although performance was optimized at several values of ε, setting ε to 0.5 provided near-optimal results in all situations. The choice of MinPts affected performance very little at the relatively large range of ε displayed here. At small values of ε, however, MinPts may significantly affect the degree of clustering (data not shown). Limiting the extent of clustering by setting pmin relatively small prevents "overclustering," in which haplotypes not recently diverged are grouped together, but also reduces the potential advantage of clustering. In practice, a reduction in power on clustering haplotypes is indicative of overclustering (data not shown; RPI, unpublished results). The Shannon information criterion employed by Tzeng et al. [13, 17] for determining "core" haplotypes may also prove useful for limiting clustering of common haplotypes by the HapMiner algorithm. This method differs from that of Tzeng et al.  in that clustering is based on the distance metric rather than on an evolutionary model. In addition, less common haplotypes are not necessarily grouped with the most common ones, allowing widely diverged haplotypes to remain distinct.
Our work provides evidence that the GLM framework for association mapping is robust to severe departures from HWE under the null model. However, the GLM approach appears to be sensitive to adjustment with a covariate that is very tightly correlated with the trait, in that it may lose power in the presence of more extensive haplotype diversity, although clustering decreased this sensitivity, most likely by reducing the number of coefficients. It is possible that removing the multiplicative effect of HLA-DRB1*04 alleles on the odds of RA also extracted most of the trait information, perhaps changing the null distribution of the score statistic.
The apparent association at locus D appears to be largely due to HLA-DRB1 and locus C. The Problem 3 data appear to be unusual in that one locus exerts such a strong effect on the disease of interest that association is clearly discernible from a distance of over 5 cM. It is not surprising that our ability to detect locus D was marginal. The association study design is predicated on the "common disease-common variant" hypothesis , which posits that complex disease is characterized by small disease-locus effects for ancient, common alleles, and rampant locus heterogeneity. Locus D, on the other hand, has a very low disease allele frequency (0.008), and although the increase in risk is large with each disease allele, not enough susceptible genotypes were available in the case population to detect it.
We thank Dan Baechle for his programming expertise. This work was supported in part by NIH grant HL07567. JL is supported in part by NIH/NLM grant 008911 and a startup fund from Case Western Reserve University. Some of the results in this paper were obtained using the program package S.A.G.E., which is supported by a U.S. Public Health Service Resource Grant (RR03655) from the National Center for Research Resources.
This article has been published as part of BMC Proceedings Volume 1 Supplement 1, 2007: Genetic Analysis Workshop 15: Gene Expression Analysis and Approaches to Detecting Multiple Functional Loci. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/1?issue=S1.
- Li J, Jiang T: Haplotype-based linkage disequilibrium mapping via direct data mining. Bioinformatics. 2005, 21: 4384-4393. 10.1093/bioinformatics/bti732.View ArticlePubMedGoogle Scholar
- Schaid DJ: Evaluating associations of haplotypes with traits. Genet Epidemiol. 2004, 27: 348-364. 10.1002/gepi.20037.View ArticlePubMedGoogle Scholar
- Schaid DJ, Rowland CM, Tines DE, Jacobson RM, Poland GA: Score tests for association between traits and haplotypes when linkage phase is ambiguous. Am J Hum Genet. 2002, 70: 425-434. 10.1086/338688.View ArticlePubMed CentralPubMedGoogle Scholar
- Newton JL, Harney SMJ, Wordsworth BP, Brown MA: A review of the MHC genetics of rheumatoid arthritis. Genes Immun. 2004, 5: 151-157. 10.1038/sj.gene.6364045.View ArticlePubMedGoogle Scholar
- Satten GA, Epstein MP: Comparison of prospective and retrospective methods for haplotype inference in case-control studies. Genet Epidemiol. 2004, 27: 192-201. 10.1002/gepi.20020.View ArticlePubMedGoogle Scholar
- Sebastiani P, Lazarus R, Weiss ST, Kunkel LM, Kohane IS, Ramoni MF: Minimal haplotype tagging. Proc Natl Acad Sci USA. 2003, 100: 9900-9905. 10.1073/pnas.1633613100.View ArticlePubMed CentralPubMedGoogle Scholar
- Barrett JC, Fry B, Maller J, Daly MJ: Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics. 2005, 21: 263-265. 10.1093/bioinformatics/bth457.View ArticlePubMedGoogle Scholar
- Statistical Analysis for Genetic Epidemiology, version 5.2. [http://darwin.cwru.edu/sage/]
- Laird NM: Implementing a unified approach to family-based tests of association. Genet Epidemiol. 2000, 19 (Suppl 1): S36-S42. 10.1002/1098-2272(2000)19:1+<::AID-GEPI6>3.0.CO;2-M.View ArticlePubMedGoogle Scholar
- Rabinowitz D, Laird N: A unified approach to adjusting association tests for population admixture with arbitrary pedigree structure and arbitrary missing marker information. Hum Hered. 2000, 50: 211-223. 10.1159/000022918.View ArticlePubMedGoogle Scholar
- Tzeng J-Y, Devlin B, Wasserman L, Roeder K: On the identification of disease mutations by the analysis of haplotype similarity and goodness of fit. Am J Hum Genet. 2003, 72: 891-902. 10.1086/373881.View ArticlePubMed CentralPubMedGoogle Scholar
- Boos DD: On generalized score tests. Am Stat. 1992, 46: 327-333. 10.2307/2685328.Google Scholar
- Tzeng J-Y, Wang C-H, Kao J-T, Hsiao CK: Regression-based association analysis with clustered haplotypes through use of genotypes. Am J Hum Genet. 2006, 78: 231-242. 10.1086/500025.View ArticlePubMed CentralPubMedGoogle Scholar
- Louis TA: Finding the observed information matrix when using the EM algorithm. J Royal Stat Soc B. 1982, 44: 226-233.Google Scholar
- Pankratz N: A two-stage classification approach identifies seven susceptibility genes for a simulated complex disease. BMC Proc. 2007, 1 (Suppl 1): S30-View ArticlePubMed CentralPubMedGoogle Scholar
- Yoo YJ, Gao G, Zhang K: Case-control association analysis of rheumatoid arthritis with candidate genes using related cases. BMC Proc. 2007, 1 (Suppl 1): S33-View ArticlePubMed CentralPubMedGoogle Scholar
- Tzeng J-Y: Evolutionary-based grouping of haplotypes in association analysis. Genet Epidemiol. 2005, 28: 220-231. 10.1002/gepi.20063.View ArticlePubMedGoogle Scholar
- Collins FS, Guyer MS, Charkravarti A: Variations on a theme: cataloguing human DNA sequence variation. Science. 1997, 278: 1580-1581. 10.1126/science.278.5343.1580.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.