Association mapping via a class of haplotype-sharing statistics

We present a class of haplotype-sharing statistics useful for association mapping in case-parent trio data. The framework presented allows derivation of novel tests as well as new simplified variance estimators for previously proposed tests. We give an overview of this framework and apply four such tests to the simulated data of Genetic Analysis Workshop 15. We find that these haplotype-based statistics result in greater power and better risk locus localization than the single locus single-nucleotide polymorphism analysis.


Background
Haplotype-sharing methods attempt to utilize insights from population genetics while maintaining the simplified statistical model used for association studies in genetic epidemiology. Coalescent models suggest that for some diseases, chromosomes of affected persons share a more recent common ancestor than a randomly selected pair of chromosomes. If a disease-causing mutation is relatively recent, haplotypes of affected persons may be identical by state (IBS) over a longer region near a risk locus than would be found among randomly selected haplotypes. Thus, haplotype sharing attempts association mapping by looking for regions where the patterns of similarity in IBS among haplotypes of affected persons differs from that found in random haplotypes.
In a recent paper, we derived the distribution of some previously proposed and novel haplotype-sharing tests [1].
Here, we give an overview of these results and apply them to the Genetic Analysis Workshop 15 (GAW15) Problem 3 data. We consider statistics of the form It is possible to show that taking γ = yields the numerator of the haplotype-sharing statistics considered by each of van der Meulen and te Meerman [2], Bourgain et al. [3], Tzeng et al. [4], and Zhang et al. [5], though these statistics differ in the computation of their variances. Writing these "standard" haplotype sharing tests in the form Eq. (1) allows us to interpret them as looking for differences between vectors and that are in the direction of An appealing choice of γ is ( -), as this direction weights differences in haplotypes by their differences in frequency (Gerard te Meerman, personal communication). However, Slutsky's theorem no longer applies as under the null hypothesis. Instead, we use the fact that is a quadratic form whose distribution is a mixture of independent χ 2 variates, with weights given by the eigenvalues of the matrix S k . Following Imhof [8], we approximate this weighted χ 2 distribution using a three-moment approximation. We refer to the resulting test as the cross test.

Methods
Finally, we note that because the p test uses , while the cross test uses γ = ( -), the two tests appear to be looking at sharing in orthogonal directions; hence, a combined test seems desirable. Thus, we seek the distribution of .
Once again, this is a quadratic form whose distribution is a mixture of independent χ 2 variates, with weights given by the eigenvalues of the matrix , and we approximate this distribution as in Imhof [8].

Application to GAW15 data
We compare the rho, p, cross, and combined tests by applying them to the GAW15 Problem 3 simulated "loose" SNP set for chromosome 6. We extracted 200 trios from each of 100 replicates by taking the first affected sibling and their parents from the first 200 families in each data set. We used only 200 trios both to speed up computation and because the effect of the risk locus on chromosome 6 was so strong that a reduced data set seemed more realistic. We used the answers to guide our analysis throughout. Specifically, we focused on a 10-cM region (45 cM to 55 cM) around the DR rheumatoid arthritis risk locus on chromosome 6 (DR locus is at 49.45557055 cM). In each data set we scanned the region using haplotype windows of 10 loci. The windows were shifted through the region two SNPs at a time so that if the first window started with SNP1 the next window would start with SNP3. The rho, p, cross, and combined tests were computed for each window and the transmission disequilibrium test (TDT) was applied to each SNP in the region. Estimates of haplotype frequencies required for the computation of the test statistics were computed using the software package HAPLORE [9]. In each data set we compute the max{-log 10 (P value )} for each test (where the max is taken over loci) and note this value and its position (for the haplotype-based tests the location is taken as the average location of SNPs 5 and 6 in the window), which we take as an estimate of the location of the risk locus. An average localization bias for each test was then computed by averaging the distance between the estimated locations and the true risk locus position over the 100 data sets. We compared the empirical distributions of -log 10 (P value ) values for each test at three loci to investigate the effect of increasing distance from the true disease locus on the performance of each test. Figure 1 presents the results of the rho, p, cross, combined, and TDT tests in the 10-cM region of the chromosome 6 risk locus for Replicate 1. Three things are apparent from this analysis. First, the haplotype-based methods seem to be more powerful than the TDT, yielding much largerlog 10 (P value ) values. Second, the haplotype-based methods seem to localize the risk locus well. Finally, the haplotypebased methods seem to be more concentrated around the risk locus, being both larger at the locus and dropping more quickly away from the risk locus than the TDT. Visual inspection of other data replicates suggests the same pattern; to confirm, we investigated each of the above points systematically. First, in order to summarize the power of the various tests we report the first quartile, median, mean, and third quartile of the max{log 10 (P value )}of each test over the 100 replicates (Table 1). We see that the haplotype-based methods are consistently higher and that the cross test performs best among all tests. Next, we report the localization bias and MSE of the TDT and each of the haplotype sharing tests (Table 1). Here, once again, the cross test appears to do better than the others, though we note that the small biases involved make it difficult to make conclusions. Finally, Figure 2 presents the empirical distribution functions of -log 10 (P value ) values for each test statistic at three different loci. Our findings are consistent with the observations in Replicate 1: the haplotype-based methods have larger -log 10 (P value ) values at the risk locus and drop off more quickly away from the risk locus than the TDT throughout the replications. In particular, at 1.036 cM from the disease locus, essentially all replicates have a non-significant test statistic (i.e., values that fall to the left of the gray vertical line in Figure 2) for all of the haplotype sharing tests while most replicates have a significant TDT. By 0.244 cM the situation has changed, and all replicates have significant haplotypesharing tests while about 40% of replicates have a non-significant TDT. At 0.004 cM from the disease locus, all tests are significant, but the superiority of the cross statistic for these data is more readily apparent.

Conclusion
We presented an overview of a new framework for deriving haplotype-sharing statistics and applied four such statistics to the GAW15 simulated data. Our findings suggest that these haplotype-based statistics can result in greater Analysis of Replicate 1 in a 10-cM region containing risk locus  Empirical distribution function of -log 10 (P value ) values for three loci over 100 replicates