General methods
All analyses were carried out with knowledge of the true location of susceptibility loci.
Marker names from the chromosome 6 dense SNP scan are abbreviated here such that "denseSNP6_N" will be denoted as "SNP N". We tested the markers flanking the HLA-DRB1 locus (DRB1, coincident with SNP 3437, 49.5 cM) and locus D (between SNPs 3916 and 3917, 54.6 cM) for redundancy using BEST [6], and removed one redundant marker, SNP 3434, from the region near DRB1. We explored patterns of linkage disequilibrium (LD) and assessed the significance of nonzero LD by the likelihood ratio test in HaploView version 3.32 [7].
Testing for dHWE was carried out using the exact test in HaploView, and, for confirmation, the exact and χ2 tests as implemented in the R genetics library package, version 1.2.0. Analyses were performed on sets of 1500 cases – one affected sib chosen at random from each affected-sib pair (ASP) family – and 2000 unrelated controls. All cases and controls in each set were from the same replicate.
We used the ASSOC program in S.A.G.E. [8] for case-control tests of association. The transmission disequilibrium test (TDT) was performed on trios of parents and affected child as implemented in the S.A.G.E. program TDTEX. Trios were selected with one random offspring from all 1500 ASP families in each replicate. In addition, the generalized family-based approach implemented in FBAT version 1.7.2 [9, 10] was carried out in parallel on sets of 1500 complete ASP families, with a null hypothesis of no linkage or association.
Association mapping by linear regression with clustering of haplotypes
We have extended the regression-based approach for association testing of Schaid et al. [3] to incorporate haplotype groups via a density-based clustering algorithm [1]. The primary goal was to reduce the dimensionality of the regression by clumping together haplotypes that are likely to have diverged recently, whether through mutation or recombination. Posterior haplotype probabilities from unphased data are obtained from the Decipher program in S.A.G.E. [8] and are imported into a modified version of the HapMiner program [1] as haplotype weights for clustering. Each pair of haplotypes is assigned a similarity score, a generalization of several scores previously described [11], which is converted to a distance metric on the interval [0,1] [1]. Clusters are formed in regions of high density (haplotype weight). A haplotype is designated a "core" haplotype if enough density, determined by the density threshold MinPts, is located within a given distance ε from it. Haplotypes within this ε neighborhood are clustered together. We modified HapMiner such that very common haplotypes, defined by the parameter pmin, are never clustered together. This prevents improper grouping of ancient haplotypes. For the analyses presented here, we selected a value for pmin of 1/(2k), where k is the number of haplotypes present with a frequency large enough to include in the GLM (see below).
Cluster assignments for all possible haplotypes are imported into the haplo.score function in HaploStats [2, 3]. This method first estimates haplotype frequencies by the expectation maximization algorithm, and uses the frequencies to calculate posterior probabilities of haplotype pairs for each individual, assuming HWE. The posterior probabilities, in turn, are incorporated into a score test for association based on the likelihood function of a particular GLM – for case-control data, a logistic model – in which each haplotype is assigned a model coefficient [3]. A global score test for association is asymptotically distributed, under the null model of no association (all coefficients equal to 0), as a χ2 random variable with degrees of freedom equal to the rank of the variance matrix for the score statistic. In our cluster-based approach, parameter estimates are obtained for clusters, rather than for haplotypes. We calculated the variance of the score statistic as per the generalized score test of Boos [12] as implemented by Tzeng et al. [13] because we found that variance calculation in Schaid et al. [3] based on the Louis information [14] inflated the type I error of the test when covariates were included in the analysis (data not shown). To prevent numerical instability and loss of power resulting from estimation of rare haplotypes [3], only haplotypes or clusters with frequencies above 0.002 were included.
Power and type I error analyses on simulated datasets
We carried out studies of power and type I error of association mapping methods on subsets of all 100 replicates. Cases were randomly selected from the offspring of ASP families, such that no sample contained both sibs from any family; controls were chosen at random from the unrelated controls. All individuals within a sample were taken from the same replicate. We adjusted the sample size for each analysis to yield moderate (40–60%) power from the haplotype-only test. Where necessary, we included sex and the number of DRB1*04 alleles in the model as covariates, to reduce the signal strength to a level useful for power comparisons; this adjustment was not part of the analysis. Type I error in association-positive regions was assessed by randomizing case/control status (and covariates, if any) relative to genotype data by permutation.
A second null-model analysis was performed at locations far removed from the HLA region and locus D. The mean haplotype diversity, measured as the number of unique haplotypes present in the phased data, for all sets of six and eight consecutive markers was estimated over replicates 1–50 in two large regions on chromosome 6q comprising SNPs 11001–12000 and 15001–16000. Four levels of diversity were defined: low (10th percentile of all marker sets), medium (50th), high (90th) and very high (99th). Samples were extracted at selected locations at each diversity level, and score tests for association were carried out as above (without permutation).