- Proceedings
- Open Access
- Published:

# Association mapping via a class of haplotype-sharing statistics

*BMC Proceedings***volume 1**, Article number: S123 (2007)

## Abstract

We present a class of haplotype-sharing statistics useful for association mapping in case-parent trio data. The framework presented allows derivation of novel tests as well as new simplified variance estimators for previously proposed tests. We give an overview of this framework and apply four such tests to the simulated data of Genetic Analysis Workshop 15. We find that these haplotype-based statistics result in greater power and better risk locus localization than the single locus single-nucleotide polymorphism analysis.

## Background

Haplotype-sharing methods attempt to utilize insights from population genetics while maintaining the simplified statistical model used for association studies in genetic epidemiology. Coalescent models suggest that for some diseases, chromosomes of affected persons share a more recent common ancestor than a randomly selected pair of chromosomes. If a disease-causing mutation is relatively recent, haplotypes of affected persons may be identical by state (IBS) over a longer region near a risk locus than would be found among randomly selected haplotypes. Thus, haplotype sharing attempts association mapping by looking for regions where the patterns of similarity in IBS among haplotypes of affected persons differs from that found in random haplotypes.

In a recent paper, we derived the distribution of some previously proposed and novel haplotype-sharing tests [1]. Here, we give an overview of these results and apply them to the Genetic Analysis Workshop 15 (GAW15) Problem 3 data.

## Methods

For the *i*^{th} of *n*case-parent trios, let *H*_{1i}and *H*_{2i}be the paternal transmitted and untransmitted haplotypes, while *H*_{3i}and *H*_{4i}denote the maternal transmitted and untransmitted haplotypes. Assume haplotypes having *L* loci, so that there are 2^{L}possible haplotypes. Let S_{
k
}(*H*_{1}, *H*_{2}) measure the similarity between haplotypes *H*_{1} and *H*_{2} at a fixed locus *k*. Many similarity metrics are possible; here we measure similarity by the maximum information length contrast, the number of loci *H*_{1} and *H*_{2} share IBS looking upstream and downstream from a fixed locus *k*. Let S_{
k
}be the matrix having (*i*, *j*)^{th} element S_{
k
}(*H*_{
i
}, *H*_{
j
}). Let $\widehat{\pi}$, $\widehat{\rho}$, and $\widehat{p}$ denote vectors of haplotype frequency estimators for untransmitted, transmitted, and all haplotypes respectively, obtained under phase uncertainty.

We consider statistics of the form

It is possible to show that taking *γ* = $\widehat{p}$ yields the numerator of the haplotype-sharing statistics considered by each of van der Meulen and te Meerman [2], Bourgain et al. [3], Tzeng et al. [4], and Zhang et al. [5], though these statistics differ in the computation of their variances. Writing these "standard" haplotype sharing tests in the form Eq. (1) allows us to interpret them as looking for differences between vectors $\widehat{\rho}$ and $\widehat{\pi}$ that are in the direction of $\widehat{p}$^{T}S_{
k
}, i.e., in the direction of sharing with the parental haplotypes. The form of *U*_{
k
}(*γ*) also allows us to derive a simple formula for its variance. We make explicit the fact that *γ* is often a function of the data by writing $\widehat{\gamma}$. Using Slutsky's theorem [6, Section 1.5.4], as long as $\widehat{\gamma}\stackrel{p}{\to}{\gamma}_{0}\ne 0$ under the null hypothesis, Var{*U*_{
k
}($\widehat{\gamma}$)} can be estimated by ${\widehat{\gamma}}^{T}{S}_{k}\widehat{\Sigma}{S}_{k}\widehat{\gamma}$, where $\widehat{\Sigma}$ is the empirical variance estimator of ($\widehat{\rho}$ - $\widehat{\pi}$). This variance estimator is considerably simpler than those previously proposed, and is valid even with phase uncertainty and for stratified populations [1]. Use of *γ* = $\widehat{p}$yields the statistic ${T}_{\widehat{p}}={U}_{k}^{2}\left(\widehat{p}\right)/\text{Var}\left\{{U}_{k}\left(\widehat{p}\right)\right\}$, which we refer to as the *p* test. Another choice, *γ* = $\widehat{\rho}$, was used by Levinson et al. [7], who contrasted sharing in transmitted haplotypes, ${\widehat{\rho}}^{T}{S}_{k}\widehat{\rho}$, with the cross product ${\widehat{\rho}}^{T}{S}_{k}\widehat{\pi}$ to give ${\widehat{\rho}}^{T}{S}_{k}\widehat{\rho}-{\widehat{\rho}}^{T}{S}_{k}\widehat{\pi}={\widehat{\rho}}^{T}{S}_{k}\left(\widehat{\rho}-\widehat{\pi}\right)$. We call this the *rho* test.

An appealing choice of *γ* is ($\widehat{\rho}$ - $\widehat{\pi}$), as this direction weights differences in haplotypes by their differences in frequency (Gerard te Meerman, personal communication). However, Slutsky's theorem no longer applies as $\left(\widehat{\rho}-\widehat{\pi}\right)\stackrel{p}{\to}0$ under the null hypothesis. Instead, we use the fact that ${U}_{k}\left(\widehat{\rho}-\widehat{\pi}\right)={\left(\widehat{\rho}-\widehat{\pi}\right)}^{T}{S}_{k}\left(\widehat{\rho}-\widehat{\pi}\right)$ is a quadratic form whose distribution is a mixture of independent *χ*^{2} variates, with weights given by the eigenvalues of the matrix $\widehat{\Sigma}$S_{
k
}. Following Imhof [8], we approximate this weighted *χ*^{2} distribution using a three-moment approximation. We refer to the resulting test as the *cross* test.

Finally, we note that because the *p* test uses $\gamma =\widehat{p}={\scriptscriptstyle \frac{1}{2}}\left(\widehat{\rho}+\widehat{\pi}\right)$, while the *cross* test uses *γ* = ($\widehat{\rho}$ - $\widehat{\pi}$), the two tests appear to be looking at sharing in orthogonal directions; hence, a *combined* test seems desirable. Thus, we seek the distribution of ${T}_{\widehat{p}}+{U}_{k}\left(\widehat{\rho}-\widehat{\pi}\right)={\left(\widehat{\rho}-\widehat{\pi}\right)}^{T}\left[\frac{{\widehat{p}}^{T}{S}_{k}{S}_{k}\widehat{p}}{{\widehat{p}}^{T}{S}_{k}\widehat{\Sigma}{S}_{k}\widehat{p}}+{S}_{k}\right]\left(\widehat{\rho}-\widehat{\pi}\right)$. Once again, this is a quadratic form whose distribution is a mixture of independent *χ*^{2} variates, with weights given by the eigenvalues of the matrix $\widehat{\Sigma}\left[\frac{{\widehat{p}}^{T}{S}_{k}{S}_{k}\widehat{p}}{{\widehat{p}}^{T}{S}_{k}\widehat{\Sigma}{S}_{k}\widehat{p}}+{S}_{k}\right]$, and we approximate this distribution as in Imhof [8].

### Application to GAW15 data

We compare the *rho*, *p*, *cross*, and *combined* tests by applying them to the GAW15 Problem 3 simulated "loose" SNP set for chromosome 6. We extracted 200 trios from each of 100 replicates by taking the first affected sibling and their parents from the first 200 families in each data set. We used only 200 trios both to speed up computation and because the effect of the risk locus on chromosome 6 was so strong that a reduced data set seemed more realistic. We used the answers to guide our analysis throughout. Specifically, we focused on a 10-cM region (45 cM to 55 cM) around the DR rheumatoid arthritis risk locus on chromosome 6 (DR locus is at 49.45557055 cM). In each data set we scanned the region using haplotype windows of 10 loci. The windows were shifted through the region two SNPs at a time so that if the first window started with SNP1 the next window would start with SNP3. The *rho*, *p*, *cross*, and *combined* tests were computed for each window and the transmission disequilibrium test (TDT) was applied to each SNP in the region. Estimates of haplotype frequencies required for the computation of the test statistics were computed using the software package HAPLORE [9]. In each data set we compute the max{-log_{10}(*P*_{
value
})} for each test (where the max is taken over loci) and note this value and its position (for the haplotype-based tests the location is taken as the average location of SNPs 5 and 6 in the window), which we take as an estimate of the location of the risk locus. An average localization bias for each test was then computed by averaging the distance between the estimated locations and the true risk locus position over the 100 data sets. We compared the empirical distributions of -log_{10}(*P*_{
value
}) values for each test at three loci to investigate the effect of increasing distance from the true disease locus on the performance of each test.

## Results and discussion

Figure 1 presents the results of the *rho*, *p*, *cross*, *combined*, and TDT tests in the 10-cM region of the chromosome 6 risk locus for Replicate 1. Three things are apparent from this analysis. First, the haplotype-based methods seem to be more powerful than the TDT, yielding much larger -log_{10}(*P*_{
value
}) values. Second, the haplotype-based methods seem to localize the risk locus well. Finally, the haplotype-based methods seem to be more concentrated around the risk locus, being both larger at the locus and dropping more quickly away from the risk locus than the TDT. Visual inspection of other data replicates suggests the same pattern; to confirm, we investigated each of the above points systematically. First, in order to summarize the power of the various tests we report the first quartile, median, mean, and third quartile of the max{-log_{10}(*P*_{
value
})}of each test over the 100 replicates (Table 1). We see that the haplotype-based methods are consistently higher and that the *cross* test performs best among all tests. Next, we report the localization bias and MSE of the TDT and each of the haplotype sharing tests (Table 1). Here, once again, the *cross* test appears to do better than the others, though we note that the small biases involved make it difficult to make conclusions. Finally, Figure 2 presents the empirical distribution functions of -log_{10}(*P*_{
value
}) values for each test statistic at three different loci. Our findings are consistent with the observations in Replicate 1: the haplotype-based methods have larger -log_{10}(*P*_{
value
}) values at the risk locus and drop off more quickly away from the risk locus than the TDT throughout the replications. In particular, at 1.036 cM from the disease locus, essentially all replicates have a non-significant test statistic (i.e., values that fall to the left of the gray vertical line in Figure 2) for all of the haplotype sharing tests while most replicates have a significant TDT. By 0.244 cM the situation has changed, and all replicates have significant haplotype-sharing tests while about 40% of replicates have a non-significant TDT. At 0.004 cM from the disease locus, all tests are significant, but the superiority of the *cross* statistic for these data is more readily apparent.

## Conclusion

We presented an overview of a new framework for deriving haplotype-sharing statistics and applied four such statistics to the GAW15 simulated data. Our findings suggest that these haplotype-based statistics can result in greater power and better risk locus localization compared to the single-SNP (TDT) analysis. The framework presented allows visualization of relationships between tests and computation of simplified estimators of the asymptotic distribution of the test statistics. This second feature is quite important because previous estimators have been complex or have depended on permutation procedures, making systematic power studies difficult or impossible.

## References

- 1.
Allen AS, Satten GA: Statistical models for haplotype sharing in case-parent trio data. Hum Hered. 2007, 64: 35-44. 10.1159/000101421.

- 2.
Van der Meulen M, te Meerman G: Haplotype sharing analysis in affected individuals from nuclear families with at least one affected offspring. Genet Epidemiol. 1997, 14: 915-919. 10.1002/(SICI)1098-2272(1997)14:6<915::AID-GEPI59>3.0.CO;2-P.

- 3.
Bourgain C, Genin E, Quesneville H, Clerget-Darpoux F: Search for multifactorial disease susceptibility genes in founder populations. Ann Hum Genet. 2000, 64: 255-265. 10.1046/j.1469-1809.2000.6430255.x.

- 4.
Tzeng J, Devlin B, Wasserman L, Roeder K: On the identification of disease mutations by the analysis of haplotype similarity and goodness of fit. Am J Hum Genet. 2003, 72: 891-902. 10.1086/373881.

- 5.
Zhang S, Sha Q, Chen H, Dong J, Jiang R: Transmission/disequilibrium test based on haplotype sharing for tightly linked markers. Am J Hum Genet. 2003, 73: 566-579. 10.1086/378205.

- 6.
Serfling R: Approximation Theorems of Mathematical Statistics. 1980, New York: John Wiley & Sons

- 7.
Levinson D, Kirby A, Slepner S, Nolte I, Spijker G, te Meerman G: Simulation studies of detection of a complex disease in a partially isolated population. Am J Med Genet (Neuropsych Genet). 2001, 105: 65-70. 10.1002/1096-8628(20010108)105:1<65::AID-AJMG1064>3.0.CO;2-0.

- 8.
Imhof J: Computing the distribution of quadratic forms in normal variables. Biometrika. 1961, 48: 419-426.

- 9.
Zhang K, Sun F, Zhao H: HAPLORE: a program for haplotype reconstruction in general pedigrees without recombination. Bioinformatics. 2005, 21: 90-103. 10.1093/bioinformatics/bth388.

## Acknowledgements

ASA acknowledges support from National Heart Lung and Blood Institute, National Institutes of Health grant K25 HL077663.

This article has been published as part of *BMC Proceedings* Volume 1 Supplement 1, 2007: Genetic Analysis Workshop 15: Gene Expression Analysis and Approaches to Detecting Multiple Functional Loci. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/1?issue=S1.

## Author information

## Additional information

### Competing interests

The author(s) declare that they have no competing interests.

## Rights and permissions

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## About this article

#### Published

#### DOI

### Keywords

- Association Mapping
- Risk Locus
- Transmission Disequilibrium Test
- Genetic Analysis Workshop
- Affected Person