Materials
Data from both Problems 2 and 3 of Genetic Analysis Workshop 15 (GAW15) were analyzed. From Problem 2, we used a data set of the North American Rheumatoid Arthritis Consortium (NARAC) of 2300 single-nucleotide polymorphisms (SNPs) covering 10 Mb of chromosome 18q, a region that had shown linkage evidence for rheumatoid arthritis (RA) [10]. This data set contained 460 cases and 460 controls, the latter being recruited from a New York City population. Problem 3 provided simulated RA data on 9187 SNPs distributed over the entire genome. We used Replicates 1 through 10 of the affected sib-pair nuclear families for analysis. From each family, haplotypes transmitted to the first affected sib were used as cases and non-transmitted haplotypes as controls, regardless of the affection status of the parents. We had no prior knowledge of the answers.
Haplotype inference
The 920 subjects of the NARAC data set were phased using the phasing program 2SNP [11]. This program reconstructed the 1840 haplotypes for all 2300 SNPs in approximately 20 minutes. The construction of the simulated data sets allowed extraction of phased haplotypes and these data sets were analyzed without phase ambiguities or missing alleles.
Statistical analysis
Haplotype sharing between two haplotypes X and Y from the perspective of each locus k, denoted as h(X, Y; k), can be evaluated as the number of consecutive SNPs in the telomeric and centromeric directions carrying the same alleles including locus k. Given a sample of case haplotypes X1,...,X
N
and a sample of control haplotypes Y1,...,Y
M
, four measures of haplotype sharing at locus k are defined as follows:
case sharing: ;
control sharing: ;
cross sharing: ;
overall sharing: .
The first haplotype sharing method, the HSS, compares the case haplotype sharing with control haplotype sharing. In contrast to the earlier HSS [7, 8], in this manner the HSS corrects for linkage disequilibrium (LD) other than that caused by the disease mutation. We hypothesize that haplotype sharing will be larger among cases than among controls at loci involved in the disease and at other loci in LD with them, because i) haplotypes containing a risk allele are more likely to be similar to each other and dissimilar to haplotypes containing a non-risk allele; and ii) haplotypes containing the risk allele may be shared over longer stretches. The first factor is explained from the concepts of association and LD. The second factor can be explained by presumably shorter coalescence times of disease alleles and hence fewer recombination events in the sample of cases compared to the sample of wild-type alleles in controls. The HSS at locus k is defined as
where sdSHCASE(k) and sdSHCTR(k) are the estimates of the standard deviation of the mean haplotype sharing at locus k accounting for LD among cases and controls, respectively. When N and M are large, SHCASE(k) and SHCTR(k) follow a normal distribution (Central Limit Theory) and, because SHCASE(k) and SHCTR(k) are independent, significances of t
HSS
(k) can be derived from a t-distribution with N + M - 2 degrees of freedom (i.e., N - 1 from the cases and M - 1 from the controls). The main statistical problem in evaluating mean haplotype sharing is how to calculate the variance of the mean sharing between all pairs of haplotypes. Generally, haplotypes will share alleles in groups and this means that haplotype agreement between haplotype pairs is not independent. For the HSS, we derived an unbiased estimate from the theory of U-statistics for the standard deviation of the mean haplotype sharing (see Appendix).
The second haplotype-sharing method is the CROSS test. This hypothesizes that a case and a control haplotype are different from each other in the region of a disease locus and will therefore show less haplotype sharing (cross sharing; SHCROSS) than two random haplotypes (SHALL). This test incorporates more information on allele frequency differences between cases and controls (i.e., the single SNP association "signal") than the HSS. Unlike the HSS, an equivalent U-statistics variance of the cross sharing can not be estimated because of the correlation between SHCROSSand SHALL. Therefore, the variance of the cross sharing is estimated from a sequential randomization procedure in which case and control status is randomly permuted over the haplotypes as long as the interim significance estimate remains interesting (i.e., p-value < 0.1). In order to render this test fast and hence feasible for whole-genome screens with a high density of SNPs, the significance is not determined from the randomization procedure, but the variance of SHCROSS(k) - SHALL(k) is estimated from a maximum of 200 randomizations, which is a sufficient number to provide a reasonably accurate variance estimate. The CROSS test at locus k is then defined as
Note that a negative value implicates positive association. As a result of the correlation between SHCROSSand SHALL, the tails of the z
CROSS
(k) distribution are not properly approximated by a normal distribution, leading to downward biased p-values for extreme z-values. Therefore, the z-values are transformed to a chi-square distribution with ν degrees of freedom:
With an appropriately chosen ν, the distribution resembles the true z-score distribution, especially in the tails, so that realistic p-values are obtained. The best choice for ν typically depends on the sample size and on the individual chromosome. For the current study, we empirically derived the value for ν that minimized the bias in p-values in non-associated regions.
In order to compare the performances of the HSS and CROSS, we also performed single-SNP and haplotype-association analysis. Single-SNP association was tested by means of a chi-square test. For haplotype association, frequencies of haplotypes of five consecutive SNPs were counted and a log-likelihood ratio test was performed including only haplotypes with n > 10 to assess the significance of the difference between cases and controls (our own software, available on request).
We used a conservative Bonferroni correction to correct for multiple testing. Hence, in the real data, a result was considered significant if the -log(p-value) was larger than -log(0.05/2300) = 4.65, and in the simulation study if the -log(p-value) was larger than -log(0.05/9187) = 5.25 and suggestive if it was larger than -log(0.10/9187) = 4.95.