### Statistical method

For now, assume haplotype phase is known. Therefore, for a sample with *N* subjects, there are 2*N* haplotypes. Let *Y* denote disease status for each of the 2*N* haplotypes, with values of 0 and 1 according to whether a haplotype came from a control or case, respectively. Let *X* denote the 2*N* alleles of a new single-nucleotide polymorphism (SNP) we wish to add to the current set of haplotypes, with values 0 and 1 according to whether an allele is common or rare, respectively. Let *H* be the number of distinguishable current haplotypes. We create *H* strata according to the current set of haplotypes, and use the MH procedure to test the association of *X* and *Y*, conditional on the current haplotypes. For the *h*^{th} stratum, let *n*_{
ijh
}denote the number of haplotypes with *X* = *i*, and *Y* = *j*. It is well known that conditional on fixed row and column margins, the entry *n*_{11h}has a hypergeometric distribution. Under the null hypothesis that *X* and *Y* are not associated in any stratum, the MH statistic,

MH=\frac{{\left[{\displaystyle {\sum}_{h}({n}_{11h}-E({n}_{11h}))}\right]}^{2}}{{\displaystyle {\sum}_{h}Var({n}_{11h})}}

has an asymptotic chi-square distribution with one degree of freedom, where

\begin{array}{cc}{\mu}_{11h}=E({n}_{11h})=\frac{{n}_{1+h}{n}_{+1h}}{{n}_{++h}},& Var({n}_{11h})=\frac{{n}_{1+h}{n}_{0+h}{n}_{+1h}{n}_{+0h}}{{n}_{++h}^{2}({n}_{++h}-1)}.\end{array}

By applying the MH approach sequentially, we can decide which markers should be added to a variable length haplotype. When the haplotype phase is unknown, for each subject the posterior probabilities of all possible haplotype pairs were estimated by the expectation maximization (EM) algorithm using haplo.em [5]. Then the estimated haplotypes were used in a weighted fashion. In this situation, *n*_{
ijh
}is the expected count on the basis of the sum of posterior probabilities of all estimated haplotypes with *X* = *i* and *Y* = *j* in the *h*^{th} stratum, which could be a fraction instead of an integer.

When scanning SNP *X*_{0} at position *x*, we examined SNPs close to it on both sides to determine if they provide additional information for association. The two alleles 0 and 1 of SNP *X*_{0} separate the sample of alleles into two strata: those with allele 1 and those with allele 0. We first examined whether at least one of the nearest SNPs on each side (left and right) of *X*_{0} provides information for association, conditional on *X*_{0}. If at least one of them offers substantially additional information, we combined *X*_{0} with the SNP(s) into a multilocus haplotype variable and test if the second nearest marker on each side of *X*_{0} should be combined. The process is continued until no SNP should be added. More details about the algorithm can be found in Yu and Schaid [3]. Once the sequential search procedure ends, two test statistics are then calculated: 1) A -log_{10}(*p*-value) for the haplotype-based chi-square test for the contingency table of the haplotypes constructed from all makers in the sequentially chosen SNPs and disease status. Denote this statistic by {\chi}_{H}^{2}(*x*). 2) A -log10(*p*-value) for the sum of conditional chi-square statistics, Sum(*x*). Because conditioning creates independent chi-square statistics, Sum(*x*) has an asymptotic chi-square distribution with the degrees of freedom equal to the number of variables combined. Denote this statistic by {\chi}_{S}^{2}(*x*).

Thus, for the marker at physical position *x*, three statistics are calculated: 1) the traditional single-marker chi-square statistic {\chi}_{0}^{2}(*x*) that uses the marker being scanned only, 2) the sequential haplotype statistic {\chi}_{H}^{2}(*x*), and 3) the sequential summary statistic {\chi}_{S}^{2}(*x*). With permutation of the disease status, a pointwise *p*-value is defined as the percentage of times that the permuted statistic is larger than the observed statistic at a position. On the other hand, when we examine regional *p*-values, we used the maximum of the statistics across the whole region as the test statistic and define *p*-values correspondingly. Both the pointwise and regional *p*-values were calculated.

### Data

The data from the *PTPN22* gene [6] contain both case siblings and unrelated controls. To create unrelated case-control data, one sib from each case family was randomly chosen. When analyzing the data, we assumed that the confirmed variant R620W was not observed, i.e., we used its surrogate markers (12 SNPs). To evaluate the pointwise and regional associations with disease status, one million permutations were used. The NARAC association mapping data consist of 2300 SNPs across a 10-Mb region on chromosome 18q. The data were collected from 460 unrelated cases and 460 unrelated controls. Two individuals with 5% or more missing SNPs were removed. After dropping markers with minor allele frequency less than 0.05 or *p*-value of Hardy-Weinberg equilibrium test less than 0.01, 2186 SNPs were used in the analysis. The pointwise and regional associations were evaluated using 10,000 permutations.