Autosomal microsatellite data for the NARAC, UK, and ECRAF samples were combined with the Illumina single-nucleotide polymorphism (SNP) sample from Canada. The genotypic data were stored in sample-specific files and their marker maps were aligned to improve map correspondence between samples. The Canada and ECRAF loci were placed on the NARAC and UK genetic map using the NCBI physical positions of NARAC, ECRAF, and Canada loci – see Segurado et al.  for more detail. Non-Caucasians were removed from the NARAC sample to minimize heterogeneity. No ethnicity information was available for the UK, ECRAF, and Canada samples. The software GRR  was used to identify potentially incorrect inheritance structures within pedigrees. These pedigrees were subsequently removed from further analyses. No Mendelian inheritance errors were detected with PedCheck .
Linkage disequilibrium (LD)
We removed evidence of LD to minimize the chance of excessive false positives . Microsatellite markers separated by less than 0.5 cM were identified and those with the lowest single-point information content were removed. The SNP map was thinned to a 0.5 cM grid on the basis of location. Any remaining SNP pairs with an r2 > 0.05 and separated by less than 5 cM were thinned further until no LD remained.
The genetic and clinical phenotypes analyzed were gender (binary), age at onset (AAO; continuous), definite erosion (binary), and rheumatoid factor (RF) IgM (four levels; treated as continuous). The ECRAF and Canada phenotype information available was limited to gender and RA status. The RA susceptibility locus HLA-DRB1 on chromosome 6 was also investigated. We defined a binary measure for HLA to represent whether an individual carried a high risk allele, as described in the GAW15 Problem 2 data description . An individual was coded as HLA+ if they carried at least one copy of the five high risk alleles, i.e., DRB1*0401, 0404, 0405, 0408, or 0409. HLA- was defined as no copies of the seven medium increased risk (i.e., DRB1*0101, 0102, 0104, 0105, 1001, 1402, 1406) or the five high risk alleles.
The familiality of the phenotypes AAO, definite erosion, and RF in the individuals affected with RA was assessed in a mixed-effects regression framework by taking the phenotype of interest as the dependent variable, implemented in the software packages MIXOR  and MIXREG . Intra-class correlation coefficients (ICCs) were estimated and indicate the proportion of unexplained variance attributable to family membership, i.e., the strength of the familial effect.
Multipoint model-free affected relative-pair (ARP) linkage analysis was performed with the raw phenotypes AAO, definite erosion, RF, HLA, and gender included as covariates (in separate analyses). Sample-specific allele frequencies and pair-wise IBD (identity-by-descent) allele sharing probabilities using information from the full pedigree were estimated by MERLIN  at 2 cM intervals. For each chromosome, the IBD estimates from the four samples were combined into a single file. Assuming the maternal and paternal alleles to be inherited independently, the allele sharing probability, p
, can be modelled in a logistic regression framework and can be written as logit(p
) = O + α + βx, where O is a fixed offset that depends on the relationship between the pair, α is a measure of divergence of IBD from the null in the sample as a whole, and β incorporates covariate x into the model. Because the parameters p
, O, and α are based on pairs of individuals, so must be the covariate parameter. When considering a continuous measure, covariates were constructed for the mean and difference for each pair. A binary measure (- or +) was resolved into either -/-, -/+, or +/+ pairs of individuals. For further information on including covariates in the model and constraining the parameters, see Hamshere et al. . The IBD estimates and covariate data were then used to estimate the allele sharing probability p
, given particular covariates, and then to obtain ARP linkage statistics. Because HLA resides on chromosome 6, no HLA covariate analysis was performed on chromosome 6.
For each chromosome and covariate, two multipoint LOD scores were produced at each 2-cM position: i) the covariate LOD score and ii) a univariate LOD score, in only the ARPs included in the covariate analysis, i.e., excluding those with missing covariate data. An increase in the maximum LOD score over the chromosome (i–ii; ILOD) in excess of 2.0 was taken to indicate a potential covariate effect. Empirical significance levels for each LOD score peak in the observed data were obtained as follows: 10,000 replicates of chromosome 22 were simulated in the absence of linkage, using the same pedigree structures, marker locations, marker allele frequencies and missing genotype patterns as the original data. The average number of peaks per chromosome reaching the required height was calculated from these replicates (note: peaks were defined as local maxima in the LOD score curve separated by at least 30 cM). The number of peaks per genome was approximated by multiplying by 60 (since the length of chromosome 22 is approximately 1/60 of the total length of the autosomes in this sample). This procedure gives similar results to those obtained by simulating replicates of all 22 chromosomes (data not shown), and is considerably easier computationally. Correction for the multiple testing of six non-independent genome scans was applied as follows. First, criteria were chosen for each covariate to give the same significance level (i.e., number of peaks expected by chance per genome scan) as the test peak. Then, for each replicate chromosome, the locations and heights of all the peaks from all six covariates were combined into a single list, and the total number of peaks greater than their corresponding criterion was obtained. The distance criterion of 30 cM for defining separate peaks ensured that peaks from several covariates that are close together (i.e., non-independence) were counted only once. The expected number of peaks per genome was calculated as before. Following Lander and Kruglyak , we called peaks in the observed data "genome-wide significant" if the expected number of peaks per genome at least as high as in the simulated data was ≤ 0.05, and "genome-wide suggestive" if this quantity was < 1.0.