Data preparation
The data, provided by NARAC (North American Rheumatoid Arthritis Consortium) consist of 2300 single-nucleotide polymorphisms (SNPs) in a 10-Mb region of 18q21 with linkage evidence in U.S. and French scans [5]. Illumina genotyped these markers in 460 cases and 460 controls, matched for age and gender, from New York. The genotypic data for controls were screened and 7 SNPs with > 10 for the Hardy-Weinberg test [6] were removed, leaving 2293 to be analyzed. CHROMSCAN requires SNPs to be located on both physical and LDU scales. Physical locations were taken from build 35 of the human genome sequence. Unlike physical maps, study-specific and various LDU maps are available, corresponding to the four HapMap samples separately and combined (CEU, CHB, JPT, YRI, and cosmopolitan). The LDU map with the highest SNP density and population attributes closest to the experimental data should be optimal. We therefore used LDU locations relative to the CEU HapMap data with a density of 1 SNP per 863 bp compared to 1 SNP per 4139 bp in the NARAC data. We also used the kilobase map to determine the robustness and power of LDU maps compared with physical maps.
LDU map construction
The theory for constructing LDU maps has been described [7]. Briefly, the LDU distance for the ith SNP interval is given by ε
i
d
i
, where ε
i
describes the exponential decline of association with physical distance d
i
in kb. Values of ε
i
are estimated by composite likelihood that fits the Malecot model [8] to multiple pairwise diplotype data. The Malecot equation, given by , uses additional parameters to describes association at the last major bottleneck (M), and residual association at large distance (L) to predict rho (ρ), the probability of association.
Association mapping
The CHROMSCAN program [3] uses a model similar to LDU maps except the exponential term is replaced by εΔ(S
i
- S) to estimate the location (S) of a disease gene, where S
i
is the location of the ithmarker in kilobases or LDU. The Kronecker Δ is used for map direction and assures a correct sign, with Δ = 1 if S
i
≥ S or -1 if S
i
<S. To calculate the expected association with distance, z
i
, the model becomes , where M is diminished by complex inheritance and L is the association at large distance. The observed association is determined by a 2 × 2 table between affection status and the two alleles of each SNP to give and , where ad - bc ≥ 0 and b ≤ c is ensured by rearrangement of columns and rows [9]. Given the observed associations , the Malecot parameters are estimated iteratively using composite likelihood, which evades a heavy Bonferroni correction by combining information over all loci within a region as , where and z
i
are the observed and expected association values, respectively, at the ith SNP. Their squared difference is weighted by information (K
i
) which is estimated as: , where is the Pearson from the 2 × 2 table.
Sub-hypotheses of the Malecot model are used to test for a causal polymorphism. Model A, which estimates none of the parameters and uses M = 0 with predicted L [10], is taken as the null hypothesis H0 in which there is no association between affection status and SNPs. Model D estimates M, S, and L. Therefore the ΛA - ΛD comparison tests for a disease determinant at location S. For both models, ε is fixed to 1 for the LDU map and to a value of ε determined from pairwise marker-by-marker association data for the kilobase map. In order to account for autocorrelation between SNPs as a result of LD, the significance of evidence is determined by a rank-based permutation test [3].
Three separate analyses of the data were performed by CHROMSCAN. The first is a preliminary screen of the entire 10-Mb bin, which is divided into 18 nonoverlapping regions, each with at least 30 SNPs and covering at least 10 LDUs. To determine accurate levels of significance, the number of permutation replicates must approach the actual level of significance so that interpolation of the variance under H1 is reliable. To minimize computation time, the initial analysis was restricted to 100 replicates. Significant regions identified by the initial screen were re-analyzed separately using 1000 and 5000 replicates in order to verify convergence. To demonstrate the power of LDU maps, this analysis was repeated using the kilobase map and two estimates of the exponential decline ε derived from the significant region and the 10-Mb region [11]. The risk for rheumatoid arthritis is elevated in females, especially with late onset (≥35–≤60) [12]. Our third analysis therefore stratified cases into three groups corresponding to males, females with onset ≤39, and females with onset ≥40. The partition of females around an onset age of 40 was chosen to give approximately equal numbers of 'early' and 'late' onset cases. Unaffected controls for the three groups were all males (with similar age and total number of individuals as affected males), and females divided by current age to give similar total numbers of individuals as cases, respectively. This analysis was restricted to significant regions from the initial screen and used 5000 replicates.