Simulated data sets
From the simulated rheumatoid arthritis data set consisting of 1500 families with two affected children and 2000 unrelated controls (Problem 3), the first affected child from each of the first 1000 families was chosen to constitute the case group. For each replication the cases were matched by sex with 1000 controls. With prior knowledge of the disease-causing loci, 21 SNPs (3427 to 3447) including Locus C of the high-density scan of chromosome 6 were extracted. In addition, 20 SNPs (260 to 279) from chromosome 18 around Locus E were chosen.
Because of the known strong effect of Locus C, smaller samples each consisting of 50 or 100 cases and controls were used for the analysis of chromosome 6. Females and males were analyzed separately because of the known gender-specific interaction between Locus C and the Disease Locus DR. Data on chromosome 18 was analyzed using samples of 500 or 1000 cases and controls for both sexes combined. The haplotypes used for analysis were provided by GAW15.
Measures of haplotype similarity L
ij
(x)
We applied four different measures of haplotype similarity based on the number of shared intervals, i.e., number of intervals surrounding a marker that are flanked by markers with the same alleles (IBS). Modified versions of these four measures were also employed to take into account the sharing of a single marker and the unobserved regions between the examined markers beyond the shared region. The first and common measure N counts the number of shared intervals in the vicinity of a specific marker. The modified version of this measure, N+, corresponds to N + 1. For the next three measures, N was weighted with the physical, KB, or genetic length, CM, or LDU, between the first and the last shared markers in (kilo)base pairs, centimorgans, or LDUs, respectively. LDUs were introduced by Morton et al. as a genetic distance based on the observed haplotype frequencies [3]. Maniatis et al. showed that the use of LDUs might improve the power of single-point linkage analysis and the power to identify disease-causing variants [4]. The software LDMAP [4] has been used to determine LDUs of the chosen markers. For the modified versions, KB+, CM+, and LDU+, the half of the distance both before the first shared marker and after the last shared marker were added. When either the first or the last of all investigated markers was involved, then half of the distance to the second and to the penultimate marker was used as a proxy, respectively. We also studied the measure proposed by Yu et al. [5], which gives greater weights to the sharing for rare marker alleles than to that for common alleles. The weights are determined as the probability of particular alleles at the specific marker, conditional on the surrounding alleles and are estimated from the allele frequencies of control haplotypes.
Exploratory analysis
Kruskal's nonmetric multidimensional scaling (MDS) was used to explore the resemblance among the different similarity measures [6]. MDS performs a minimizing algorithm based on the stress value, a least square estimator that assesses the dimensionality of the data using the observed and estimated distances. The smaller the stress value, the better the fit, with stress values between 0 and 2.5 indicating an excellent goodness of fit. After dimensionality assessment, a graphical representation of the data permitted the investigation of resemblance among similarity measures.
Mantel statistics using haplotype sharing and power analysis
The haplotype-based Mantel statistic correlates the haplotype similarity L
ij
for every marker x, where i and j are two haplotypes, and the phenotypic similarity from two individuals s
i
and s
j
corresponding to the haplotypes i and j [2]. The phenotypic similarity is defined as the mean corrected product , where μ denotes the expectation of the phenotype in the sample, i.e., μ = 0.5, while a case was coded as 1 and a control as 0. Thus, the defined statistic is the sum of the cross products of L
ij
(x) and :
Statistical significance was assessed via a Monte Carlo permutation. The empirical null distribution, i.e., the distribution in which the genetic and the phenotypic similarity were independently distributed, was estimated by permuting the phenotype 1000 times while keeping together the two haplotypes derived from an individual. The empirical p-value was derived by comparing the observed statistic against the empirical distribution.
The different haplotype similarity measures were calculated for each replication r = 1,..., 100 for both chromosomes and employed to investigate the power of the Mantel statistics to map the disease locus. The power was estimated as the number of replications with a significant test result (p-value less than α = 0.05) divided by 100, the total number of replications.
The data management, the calculation of the similarity measures as well as the Mantel statistics were performed in the R programming language.