Association mapping via a class of haplotype-sharing statistics

Allen, Andrew S; Satten, Glen A

doi:10.1186/1753-6561-1-S1-S123

Volume 1 Supplement 1

Genetic Analysis Workshop 15: Gene Expression Analysis and Approaches to Detecting Multiple Functional Loci

Proceedings
Open access
Published: 18 December 2007

Association mapping via a class of haplotype-sharing statistics

Andrew S Allen^1,2 &
Glen A Satten³

BMC Proceedings volume 1, Article number: S123 (2007) Cite this article

743 Accesses
3 Citations
Metrics details

Abstract

We present a class of haplotype-sharing statistics useful for association mapping in case-parent trio data. The framework presented allows derivation of novel tests as well as new simplified variance estimators for previously proposed tests. We give an overview of this framework and apply four such tests to the simulated data of Genetic Analysis Workshop 15. We find that these haplotype-based statistics result in greater power and better risk locus localization than the single locus single-nucleotide polymorphism analysis.

Background

Haplotype-sharing methods attempt to utilize insights from population genetics while maintaining the simplified statistical model used for association studies in genetic epidemiology. Coalescent models suggest that for some diseases, chromosomes of affected persons share a more recent common ancestor than a randomly selected pair of chromosomes. If a disease-causing mutation is relatively recent, haplotypes of affected persons may be identical by state (IBS) over a longer region near a risk locus than would be found among randomly selected haplotypes. Thus, haplotype sharing attempts association mapping by looking for regions where the patterns of similarity in IBS among haplotypes of affected persons differs from that found in random haplotypes.

In a recent paper, we derived the distribution of some previously proposed and novel haplotype-sharing tests [1]. Here, we give an overview of these results and apply them to the Genetic Analysis Workshop 15 (GAW15) Problem 3 data.

Methods

For the i^th of ncase-parent trios, let H_1iand H_2ibe the paternal transmitted and untransmitted haplotypes, while H_3iand H_4idenote the maternal transmitted and untransmitted haplotypes. Assume haplotypes having L loci, so that there are 2^Lpossible haplotypes. Let S_k(H₁, H₂) measure the similarity between haplotypes H₁ and H₂ at a fixed locus k. Many similarity metrics are possible; here we measure similarity by the maximum information length contrast, the number of loci H₁ and H₂ share IBS looking upstream and downstream from a fixed locus k. Let S_kbe the matrix having (i, j)^th element S_k(H_i, H_j). Let $\hat{π}$ , $\hat{ρ}$ , and $\hat{p}$ denote vectors of haplotype frequency estimators for untransmitted, transmitted, and all haplotypes respectively, obtained under phase uncertainty.

We consider statistics of the form

U_{k} (γ) = γ^{T} S_{k} (\hat{ρ} - \hat{π}) .

(1)

It is possible to show that taking γ = $\hat{p}$ yields the numerator of the haplotype-sharing statistics considered by each of van der Meulen and te Meerman [2], Bourgain et al. [3], Tzeng et al. [4], and Zhang et al. [5], though these statistics differ in the computation of their variances. Writing these "standard" haplotype sharing tests in the form Eq. (1) allows us to interpret them as looking for differences between vectors $\hat{ρ}$ and $\hat{π}$ that are in the direction of $\hat{p}$ ^TS_k, i.e., in the direction of sharing with the parental haplotypes. The form of U_k(γ) also allows us to derive a simple formula for its variance. We make explicit the fact that γ is often a function of the data by writing $\hat{γ}$ . Using Slutsky's theorem [6, Section 1.5.4], as long as $\hat{γ} \overset{p}{\to} γ_{0} \neq 0$ under the null hypothesis, Var{U_k( $\hat{γ}$ )} can be estimated by ${\hat{γ}}^{T} S_{k} \hat{Σ} S_{k} \hat{γ}$ , where $\hat{Σ}$ is the empirical variance estimator of ( $\hat{ρ}$ - $\hat{π}$ ). This variance estimator is considerably simpler than those previously proposed, and is valid even with phase uncertainty and for stratified populations [1]. Use of γ = $\hat{p}$ yields the statistic $T_{\hat{p}} = U_{k}^{2} (\hat{p}) / Var {U_{k} (\hat{p})}$ , which we refer to as the p test. Another choice, γ = $\hat{ρ}$ , was used by Levinson et al. [7], who contrasted sharing in transmitted haplotypes, ${\hat{ρ}}^{T} S_{k} \hat{ρ}$ , with the cross product ${\hat{ρ}}^{T} S_{k} \hat{π}$ to give ${\hat{ρ}}^{T} S_{k} \hat{ρ} - {\hat{ρ}}^{T} S_{k} \hat{π} = {\hat{ρ}}^{T} S_{k} (\hat{ρ} - \hat{π})$ . We call this the rho test.

An appealing choice of γ is ( $\hat{ρ}$ - $\hat{π}$ ), as this direction weights differences in haplotypes by their differences in frequency (Gerard te Meerman, personal communication). However, Slutsky's theorem no longer applies as $(\hat{ρ} - \hat{π}) \overset{p}{\to} 0$ under the null hypothesis. Instead, we use the fact that $U_{k} (\hat{ρ} - \hat{π}) = {(\hat{ρ} - \hat{π})}^{T} S_{k} (\hat{ρ} - \hat{π})$ is a quadratic form whose distribution is a mixture of independent χ² variates, with weights given by the eigenvalues of the matrix $\hat{Σ}$ S_k. Following Imhof [8], we approximate this weighted χ² distribution using a three-moment approximation. We refer to the resulting test as the cross test.

Finally, we note that because the p test uses $γ = \hat{p} = \frac{1}{2} (\hat{ρ} + \hat{π})$ , while the cross test uses γ = ( $\hat{ρ}$ - $\hat{π}$ ), the two tests appear to be looking at sharing in orthogonal directions; hence, a combined test seems desirable. Thus, we seek the distribution of $T_{\hat{p}} + U_{k} (\hat{ρ} - \hat{π}) = {(\hat{ρ} - \hat{π})}^{T} [\frac{{\hat{p}}^{T} S_{k} S_{k} \hat{p}}{{\hat{p}}^{T} S_{k} \hat{Σ} S_{k} \hat{p}} + S_{k}] (\hat{ρ} - \hat{π})$ . Once again, this is a quadratic form whose distribution is a mixture of independent χ² variates, with weights given by the eigenvalues of the matrix $\hat{Σ} [\frac{{\hat{p}}^{T} S_{k} S_{k} \hat{p}}{{\hat{p}}^{T} S_{k} \hat{Σ} S_{k} \hat{p}} + S_{k}]$ , and we approximate this distribution as in Imhof [8].

Application to GAW15 data

We compare the rho, p, cross, and combined tests by applying them to the GAW15 Problem 3 simulated "loose" SNP set for chromosome 6. We extracted 200 trios from each of 100 replicates by taking the first affected sibling and their parents from the first 200 families in each data set. We used only 200 trios both to speed up computation and because the effect of the risk locus on chromosome 6 was so strong that a reduced data set seemed more realistic. We used the answers to guide our analysis throughout. Specifically, we focused on a 10-cM region (45 cM to 55 cM) around the DR rheumatoid arthritis risk locus on chromosome 6 (DR locus is at 49.45557055 cM). In each data set we scanned the region using haplotype windows of 10 loci. The windows were shifted through the region two SNPs at a time so that if the first window started with SNP1 the next window would start with SNP3. The rho, p, cross, and combined tests were computed for each window and the transmission disequilibrium test (TDT) was applied to each SNP in the region. Estimates of haplotype frequencies required for the computation of the test statistics were computed using the software package HAPLORE [9]. In each data set we compute the max{-log₁₀(P_value)} for each test (where the max is taken over loci) and note this value and its position (for the haplotype-based tests the location is taken as the average location of SNPs 5 and 6 in the window), which we take as an estimate of the location of the risk locus. An average localization bias for each test was then computed by averaging the distance between the estimated locations and the true risk locus position over the 100 data sets. We compared the empirical distributions of -log₁₀(P_value) values for each test at three loci to investigate the effect of increasing distance from the true disease locus on the performance of each test.

Results and discussion

Figure 1 presents the results of the rho, p, cross, combined, and TDT tests in the 10-cM region of the chromosome 6 risk locus for Replicate 1. Three things are apparent from this analysis. First, the haplotype-based methods seem to be more powerful than the TDT, yielding much larger -log₁₀(P_value) values. Second, the haplotype-based methods seem to localize the risk locus well. Finally, the haplotype-based methods seem to be more concentrated around the risk locus, being both larger at the locus and dropping more quickly away from the risk locus than the TDT. Visual inspection of other data replicates suggests the same pattern; to confirm, we investigated each of the above points systematically. First, in order to summarize the power of the various tests we report the first quartile, median, mean, and third quartile of the max{-log₁₀(P_value)}of each test over the 100 replicates (Table 1). We see that the haplotype-based methods are consistently higher and that the cross test performs best among all tests. Next, we report the localization bias and MSE of the TDT and each of the haplotype sharing tests (Table 1). Here, once again, the cross test appears to do better than the others, though we note that the small biases involved make it difficult to make conclusions. Finally, Figure 2 presents the empirical distribution functions of -log₁₀(P_value) values for each test statistic at three different loci. Our findings are consistent with the observations in Replicate 1: the haplotype-based methods have larger -log₁₀(P_value) values at the risk locus and drop off more quickly away from the risk locus than the TDT throughout the replications. In particular, at 1.036 cM from the disease locus, essentially all replicates have a non-significant test statistic (i.e., values that fall to the left of the gray vertical line in Figure 2) for all of the haplotype sharing tests while most replicates have a significant TDT. By 0.244 cM the situation has changed, and all replicates have significant haplotype-sharing tests while about 40% of replicates have a non-significant TDT. At 0.004 cM from the disease locus, all tests are significant, but the superiority of the cross statistic for these data is more readily apparent.

Table 1 Bias and power summaries of 100 data replicates

Full size table

Conclusion

We presented an overview of a new framework for deriving haplotype-sharing statistics and applied four such statistics to the GAW15 simulated data. Our findings suggest that these haplotype-based statistics can result in greater power and better risk locus localization compared to the single-SNP (TDT) analysis. The framework presented allows visualization of relationships between tests and computation of simplified estimators of the asymptotic distribution of the test statistics. This second feature is quite important because previous estimators have been complex or have depended on permutation procedures, making systematic power studies difficult or impossible.

References

Allen AS, Satten GA: Statistical models for haplotype sharing in case-parent trio data. Hum Hered. 2007, 64: 35-44. 10.1159/000101421.
Article PubMed Google Scholar
Van der Meulen M, te Meerman G: Haplotype sharing analysis in affected individuals from nuclear families with at least one affected offspring. Genet Epidemiol. 1997, 14: 915-919. 10.1002/(SICI)1098-2272(1997)14:6<915::AID-GEPI59>3.0.CO;2-P.
Article PubMed CAS Google Scholar
Bourgain C, Genin E, Quesneville H, Clerget-Darpoux F: Search for multifactorial disease susceptibility genes in founder populations. Ann Hum Genet. 2000, 64: 255-265. 10.1046/j.1469-1809.2000.6430255.x.
Article PubMed CAS Google Scholar
Tzeng J, Devlin B, Wasserman L, Roeder K: On the identification of disease mutations by the analysis of haplotype similarity and goodness of fit. Am J Hum Genet. 2003, 72: 891-902. 10.1086/373881.
Article PubMed Central PubMed CAS Google Scholar
Zhang S, Sha Q, Chen H, Dong J, Jiang R: Transmission/disequilibrium test based on haplotype sharing for tightly linked markers. Am J Hum Genet. 2003, 73: 566-579. 10.1086/378205.
Article PubMed Central PubMed CAS Google Scholar
Serfling R: Approximation Theorems of Mathematical Statistics. 1980, New York: John Wiley & Sons
Book Google Scholar
Levinson D, Kirby A, Slepner S, Nolte I, Spijker G, te Meerman G: Simulation studies of detection of a complex disease in a partially isolated population. Am J Med Genet (Neuropsych Genet). 2001, 105: 65-70. 10.1002/1096-8628(20010108)105:1<65::AID-AJMG1064>3.0.CO;2-0.
Article CAS Google Scholar
Imhof J: Computing the distribution of quadratic forms in normal variables. Biometrika. 1961, 48: 419-426.
Article Google Scholar
Zhang K, Sun F, Zhao H: HAPLORE: a program for haplotype reconstruction in general pedigrees without recombination. Bioinformatics. 2005, 21: 90-103. 10.1093/bioinformatics/bth388.
Article PubMed CAS Google Scholar

Download references

Acknowledgements

ASA acknowledges support from National Heart Lung and Blood Institute, National Institutes of Health grant K25 HL077663.

This article has been published as part of BMC Proceedings Volume 1 Supplement 1, 2007: Genetic Analysis Workshop 15: Gene Expression Analysis and Approaches to Detecting Multiple Functional Loci. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/1?issue=S1.

Author information

Authors and Affiliations

Department of Biostatistics and Bioinformatics, Duke University, Hock Plaza, Suite 1102, 2424 Erwin Road, Durham, North Carolina, 27705, USA
Andrew S Allen
Duke Clinical Research Institute, Duke University, North Pavilion, 2400 Pratt Street, Durham, North Carolina, 27705, USA
Andrew S Allen
Centers for Disease Control and Prevention, Mailstop K-23, 4770 Buford Highway, Atlanta, Georgia, 30345, USA
Glen A Satten

Authors

Andrew S Allen
View author publications
You can also search for this author in PubMed Google Scholar
Glen A Satten
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Andrew S Allen.

Additional information

Competing interests

The author(s) declare that they have no competing interests.

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Allen, A.S., Satten, G.A. Association mapping via a class of haplotype-sharing statistics. BMC Proc 1 (Suppl 1), S123 (2007). https://doi.org/10.1186/1753-6561-1-S1-S123

Download citation

Published: 18 December 2007
DOI: https://doi.org/10.1186/1753-6561-1-S1-S123

Genetic Analysis Workshop 15: Gene Expression Analysis and Approaches to Detecting Multiple Functional Loci

Association mapping via a class of haplotype-sharing statistics

Abstract

Background