Details regarding NARAC subject enrollment, phenotype designation, and genotype information are published elsewhere [6]. Using the total 404 SNPs available for chromosome 6 and the 104 SNPs available from chromosome 21, we checked for genotype errors using the software CheckErrors [7], which uses graphical modeling to calculate the posterior probability of genotype errors in pedigrees. For chromosome 6 we eliminated 16 SNPs and for chromosome 21 we eliminated 1 SNP because of errors. We performed a linkage analysis on each of these cleaned files using the multipoint Markov-chain Monte Carlo (MCMC) linkage method MCLINK [8], which computes the robust multipoint TLOD [9] statistic. For each chromosome, we identified peak regions (TLOD ≥ 2.0), not accounting for LD. To ensure that the peak could be resolved in other analyses, we included ~20 SNPs on either side of the peak and called these data our "Complete" SNP sets. These linkage results are likely to be biased because of underlying LD as 63.7% of the parents in the data set were not genotyped.
The four methods (i.e., the three novel methods and SNPLINK) were each applied to the "Complete" SNP data files. For the graphical modeling of LD combined with linkage analysis method, we used the new McLink software http://www-genepi.med.utah.edu/~alun/software/, which we will refer to as McLink-LD in this manuscript. For the three LD elimination methods, we used both the original MCLINK [8] and Merlin [10] software packages to perform the linkage analyses. Allele frequencies were estimated from observation of all genotyped individuals at each locus, rather than only the founders. Genotype data was available for only ~15% of the pedigree founders, and because the total number of individuals with genotype data was large (n = 1991 individuals) the estimated allele frequencies from all genotyped individuals provides reasonable estimates of the founder allele frequencies, even without adjustment for familial relationships. The original pedigree structure for all 757 pedigrees was used for all analyses, except for the Merlin analyses, in which 56 large pedigrees were removed because of memory limitations. For the genetic map, we assumed that 1 Mb was equivalent to 1 cM. Although this is a simplistic assumption, when inter-marker distances are relatively short and when a dense marker map is used, the assumption has been shown to produce nearly identical linkage results as a more detailed genetic map [11]. We performed a parametric analysis using a dominant model, assuming a minor allele frequency of 0.01 and a penetrance of 0.008, 0.5, and 0.5 for carriers of none, one, and two disease alleles, respectively. Our parametric model results in an overall prevalence rate of 7.5 cases per 1000 individuals, which is consistent with published prevalence rates for rheumatoid arthritis [12].
McLink-LD
Thomas and Camp [13] and Thomas [14] introduced graphical modeling as an approach to represent allelic association in a tractable way. A graphical model consists of two elements: a Markov graph with vertices representing variables, which are connected in such a way that given the states of its neighbors, the state of a variable is conditionally independent of any other variable and the parameters that specify the conditional dependences. In the case of discrete data, these are given by multinomial distributions on the states of the variables in the maximal cliques of the graph. Thomas [14] developed a two-stage scheme to apply this to data from a random sample of diploids genotyped at multiple loci. In the first stage, an initial graphical model is assumed, and given the observed genotypes, an imputation of the haplotypes is made. In the second stage, given the imputed haplotypes a new graphical model is estimated. These stages are iterated in a simulated annealing search for an optimal LD model, and implemented in a program called HapGraph http://www-genepi.med.utah.edu/~alun/software/.
Through application of graphical modeling to the haplotypes of founders in a pedigree, we are able to obtain valid linkage statistics for dense SNP loci without having to discard any data. The new McLink-LD software, made available by Alun Thomas, incorporates the LD model obtained from graphical modeling and computes LOD score statistics using MCMC methods similar to those described by Thomas et al. [8] The program also models genotype errors using the approach of Thomas and Camp [15], so that checking for apparent Mendelian segregation is unnecessary. Recombination fractions can be estimated on the interval (0, 1) to take advantage of any potential evidence for linkage and to identify possible model misspecification. Full details of the method, including use of the program, are given by Thomas [16].
Although McLink-LD can estimate an LD model using a large subset of unrelated individuals and it can also model genotype errors, for consistency with our other LD elimination methods, LD assessment was performed on 100 unrelated, random individuals with genotype data selected from the 757 rheumatoid arthritis families. LD modeling using 100 individuals has been shown previously to be a sufficient sample size to capture the underlying genetic variation [17]. All SNPs in the "Complete" SNP sets were included in the analyses. Genotype errors were previously eliminated from these data files. For consistency with the other LD methods, the results displayed are over the recombination fraction interval (0, 0.5).
PCA-haplotype method
Principal-component analysis (PCA) has been used for selection of tagging-SNPs for candidate gene studies [4, 5]. Advantages of PCA include that the methodology can capture the underlying genetic variation without redundancy, genetic markers are not required to be contiguous, and statistical packages (e.g., SAS, SPSS, STATA) are readily available to perform analyses. Here we apply PCA methodology to larger genomic regions of interest using both haplotype data and genotype data. Applying the method to haplotype data, we used the same 100 unrelated, random individuals as described above as a subset of the total arthritis resource to characterize the LD structure of the regions. We performed pair-wise D' analysis for the 100 independent individuals and all pairs of SNPs in the "Complete" SNP data sets using an in-house modified version of the EMLD [18] software that increases the number of markers that can be studied at one time. All markers with a D' value ≥ 0.7 and within 2 million base pairs of each other were considered for potential removal due to high LD. These high-LD markers were phased using the software SNPHAP [19], and the resulting haplotypes were entered into PCA. Eigenvalue thresholds were set to capture at least 90% of the genetic variation of extracted factors. For each of the resulting LD groups, the SNP with the highest factor loading (required to be ≥|0.4|) was retained while all other SNPs were eliminated as providing redundant information. Linkage analysis was then performed on the "Complete" SNP set containing all 757 families but modified to only include SNPs with D' < 0.7 and SNPs selected from the PCA analysis.
PCA-genotype method
For the PCA-Genotype method, we also used the same 100 unrelated, random individuals described above to characterize LD. For each SNP we recoded all of the genotype data from the "Complete" SNP data sets as -1, 0, 1 [20]; corresponding to homozygous wild type (1, 1), heterozygous [(1, 2) or (2, 1)], or homozygous rare genotype (2, 2), respectively. All of the recoded genotype data were then entered into the two-step PCA analysis method proposed by Horne and Camp [17]. The first PCA step was performed as described above for the PCA-Haplotype method. The second PCA step was used to select, among multiple markers in an LD group with a factor loading ≥|0.4|, the marker(s) that best capture(s) the underlying genetic variation. Because more markers are typically included in each LD group in the PCA-Genotype method compared to the PCA-Haplotype method, the two-step rather than the single-step PCA methodology was utilized here. As with the PCA-Haplotype method, we again modified the "Complete" SNP sets for all 757 families to only include markers that were retained from the PCA analysis.
SNPLINK
SNPLINK [3] is a freely available Perl script that removes markers in high LD by computing either D' or r2 between consecutive marker pairs for all individuals in a data set, ignoring relationships. Only one marker from each high-LD pair of SNPs, based on a high-LD threshold defined by the user, is retained. We defined high LD to be D' ≥ 0.7, consistent with our PCA-Haplotype method. We successfully analyzed the chromosome 6 "Complete" SNP data using SNPLINK, but found that SNPLINK halted unexpectedly when running the chromosome 21 data. Therefore, using the SNPLINK protocol, we manually identified which SNPs to eliminate because of high LD in both chromosomes 6 and 21. Our results for chromosome 6 compared well to the SNPLINK output. Four markers differed between the two marker lists; we selected from among two equal markers the opposite marker as that selected by SNPLINK. Hence, we feel confident that our analysis of the chromosome 21 data set would be similar to the output of SNPLINK had we been able to obtain it.