For the ith of ncase-parent trios, let H1iand H2ibe the paternal transmitted and untransmitted haplotypes, while H3iand H4idenote the maternal transmitted and untransmitted haplotypes. Assume haplotypes having L loci, so that there are 2Lpossible haplotypes. Let S
k
(H1, H2) measure the similarity between haplotypes H1 and H2 at a fixed locus k. Many similarity metrics are possible; here we measure similarity by the maximum information length contrast, the number of loci H1 and H2 share IBS looking upstream and downstream from a fixed locus k. Let S
k
be the matrix having (i, j)th element S
k
(H
i
, H
j
). Let , , and denote vectors of haplotype frequency estimators for untransmitted, transmitted, and all haplotypes respectively, obtained under phase uncertainty.
We consider statistics of the form
(1)
It is possible to show that taking γ = yields the numerator of the haplotype-sharing statistics considered by each of van der Meulen and te Meerman [2], Bourgain et al. [3], Tzeng et al. [4], and Zhang et al. [5], though these statistics differ in the computation of their variances. Writing these "standard" haplotype sharing tests in the form Eq. (1) allows us to interpret them as looking for differences between vectors and that are in the direction of TS
k
, i.e., in the direction of sharing with the parental haplotypes. The form of U
k
(γ) also allows us to derive a simple formula for its variance. We make explicit the fact that γ is often a function of the data by writing . Using Slutsky's theorem [6, Section 1.5.4], as long as under the null hypothesis, Var{U
k
()} can be estimated by , where is the empirical variance estimator of ( - ). This variance estimator is considerably simpler than those previously proposed, and is valid even with phase uncertainty and for stratified populations [1]. Use of γ = yields the statistic , which we refer to as the p test. Another choice, γ = , was used by Levinson et al. [7], who contrasted sharing in transmitted haplotypes, , with the cross product to give . We call this the rho test.
An appealing choice of γ is ( - ), as this direction weights differences in haplotypes by their differences in frequency (Gerard te Meerman, personal communication). However, Slutsky's theorem no longer applies as under the null hypothesis. Instead, we use the fact that is a quadratic form whose distribution is a mixture of independent χ2 variates, with weights given by the eigenvalues of the matrix S
k
. Following Imhof [8], we approximate this weighted χ2 distribution using a three-moment approximation. We refer to the resulting test as the cross test.
Finally, we note that because the p test uses , while the cross test uses γ = ( - ), the two tests appear to be looking at sharing in orthogonal directions; hence, a combined test seems desirable. Thus, we seek the distribution of . Once again, this is a quadratic form whose distribution is a mixture of independent χ2 variates, with weights given by the eigenvalues of the matrix , and we approximate this distribution as in Imhof [8].
Application to GAW15 data
We compare the rho, p, cross, and combined tests by applying them to the GAW15 Problem 3 simulated "loose" SNP set for chromosome 6. We extracted 200 trios from each of 100 replicates by taking the first affected sibling and their parents from the first 200 families in each data set. We used only 200 trios both to speed up computation and because the effect of the risk locus on chromosome 6 was so strong that a reduced data set seemed more realistic. We used the answers to guide our analysis throughout. Specifically, we focused on a 10-cM region (45 cM to 55 cM) around the DR rheumatoid arthritis risk locus on chromosome 6 (DR locus is at 49.45557055 cM). In each data set we scanned the region using haplotype windows of 10 loci. The windows were shifted through the region two SNPs at a time so that if the first window started with SNP1 the next window would start with SNP3. The rho, p, cross, and combined tests were computed for each window and the transmission disequilibrium test (TDT) was applied to each SNP in the region. Estimates of haplotype frequencies required for the computation of the test statistics were computed using the software package HAPLORE [9]. In each data set we compute the max{-log10(P
value
)} for each test (where the max is taken over loci) and note this value and its position (for the haplotype-based tests the location is taken as the average location of SNPs 5 and 6 in the window), which we take as an estimate of the location of the risk locus. An average localization bias for each test was then computed by averaging the distance between the estimated locations and the true risk locus position over the 100 data sets. We compared the empirical distributions of -log10(P
value
) values for each test at three loci to investigate the effect of increasing distance from the true disease locus on the performance of each test.