Comparison of genome-wide single-nucleotide polymorphism linkage analyses in Caucasian and Hispanic NARAC families.

We performed linkage analysis on families with rheumatoid arthritis, stratifying by ethnic origin. We compared results using either Kong and Cox nonparametric LOD scores or MOD score analysis using the software GeneHunter MODSCORE. We first applied SNPLINK to remove markers showing excess linkage disequilibrium from the SNPs in the Illumina IV SNP Linkage panel. In this analysis there were 659 self-reported Caucasian families and 29 self-reported Hispanic families in the NARAC collection. Chromosome 19 yielded MOD scores > 3.00 in the Hispanic group, while chromosomes 2, 6, 7, 11, and XY had MOD scores > 3.00 in the Caucasian group. We performed simulation studies to evaluate the empirical distribution of the MOD score for autosomal loci separately in Hispanics and Caucasians. Results showed genome-wide significant evidence for linkage in Caucasians for chromosomes 2q and 6p, but no significant evidence for any linkages in the Hispanics, including little evidence for linkage to chromosome 6p in this group. An examination of the difference of phenotypes in two ethnic groups suggested significantly earlier mean age of onset, higher percentage of anti-cyclic citrullinated peptide positive people, and lower percentage of affected people carrying shared epitopes in Hispanics than those in Caucasians. A larger sample size of the Hispanic group is needed to identify linkage regions.


Background
Rheumatoid arthritis (RA) is a complex genetic disease with possible genetic heterogeneity among different ethnic groups [1]. There is a lack of information concerning genetic risk factors for RA in Hispanic populations, so we sought to characterize in the available sample both the clinical features as well as the genetic profiles that influence disease risk. We evaluated the phenotypes between the two groups to compare their ages of onset, rheumatoid factor (RF)-IgM and anti-CCP (anti-cyclic citrullinated peptide) levels and the percentage of affected individuals carrying shared epitopes.
Previously we performed a genome-wide single-nucleotide polymorphism (SNP) analysis of NARAC (North American Rheumatoid Arthritis Consortium) Caucasian families and identified two new loci at 2q33 and 11p12, in addition to confirming evidence for linkage in the HLA region (Kong and Cox LOD score of 16.14) [2].
Here we applied standard linkage analysis methods as well as the MOD score approach to this same set of Caucasian families as well as to a set of Hispanic families to compare evidence of linkage to RA in these two groups. In addition, the MOD score method provides estimates of the penetrance for putative disease-susceptibility loci, while conditioning on the disease status, thus adjusting for ascertainment of the families. However, the distribution of the MOD score test statistic is complex, and we therefore have performed extensive simulations to obtain empirical p-values.
Evidence of linkage on chromosomes 2 and 6 was confirmed by MOD score analysis for the Caucasian group, and weak evidence of linkage to chromosome 6 was found to be not significant in Hispanic group using empirical p-values. Significant differences of phenotypes between these two groups were found in age of onset, proportion of anti-CCP positive people, and percentage of affected people carrying the shared epitopes.

Methods
An R 2 value, a measure of linkage disequilibrium (LD), of 0.05 was used as cut-off to remove markers of linkage disequilibrium (LD) using SNPLINK [3,4]. The data sets for both Caucasians and Hispanics were analyzed by SNPLINK with Merlin [3,5] and by GeneHunter using MODSCORE [6].
To assess significance in the Hispanic sample, we performed simulations of the families for all chromosomes (excluding the X chromosome) using 10,000 replicate samples with the computer program Allegro [7]. To derive a genome-wide estimate of the maximum MOD score, we selected from each replicate the maximum MOD score across all autosomal loci. Owing to the computational intensity of the simulations in the much larger Caucasian sample, we could only complete simulation of 1000 replicates. To obtain more precise estimates of the empirical MOD score distribution we derived 100,000 bootstraps from the simulated results for Caucasians to obtain a distribution of MOD scores in this group. To perform the bootstrap, we randomly selected a maximum MOD score from each chromosome from each of the 1000 replicated results, and then selected the maximum MOD score from all 22 chromosomes to obtain the maximum genomic MOD score for that replicate. This process was repeated 100,000 times to obtain empirical MOD score distribution.
Distributions of phenotypes, including age of onset, anti-CCP, RF-IgM and shared epitopes, were also compared between two groups. The significance of the differences were formally assessed using a t-test, binomial test, or survival modeling, since segregation analysis cannot be done because of the complex ascertainment used for selecting these families. We know who the primary proband is but not who the obligatory additional proband is. Therefore, we have chosen to apply a MOD score approach, which conditions on the disease status in the family and subsequently performs a segregation analysis to estimate parameters describing penetrance.

Results
The results presented in Table 1 show maximal Kong and Cox (KC) LOD scores and MOD scores on each chromosome separately for Caucasians and for Hispanics. For Caucasians, maximal KC LOD scores exceeding 3.00 are found on chromosomes 2, 6, 11, and the pseudoautosomal region of XY (we did not study this region further because results from Amos et al. [2] suggest the XY region reflects a false-positive signal). MOD score analysis indicated identical positions and slightly higher scores for each of these chromosomes. Contrasting results of MOD score and KC LOD score analyses in Hispanics showed no KC LOD scores over 1.50, but 9 chromosomes yielded MOD scores higher than 1.50. Of note, chromosome 6 showed only weak evidence for linkage in Hispanics using either KC LOD score or MOD score methods. MOD score analysis suggested evidence for linkage on chromosome 19, which was not provided in the LOD score analysis, suggesting a possible false-positive result. Best fitting models from MOD score analyses corresponding to maximal MOD scores of 1.50 or greater are provided in Table  2 for Caucasians and Hispanics, even though they often lead to excess predicted prevalence of the disease.
To better characterize the results that we obtained from MOD score analysis suggesting linkage on chromosome 19 in Hispanics, we performed a genome-wide simulation study with 10,000 replicates. Figure 1 presents a distribution of maximum MOD scores among all 22 autosomal chromosomes of each of the 10,000 replicates from genome-wide simulation data. A max MOD score of 3.03 corresponds to a genome-wide p-value of 0.42, suggesting that evidence on chromosome 19 was a false-positive finding for Hispanics.
We also performed a similar simulation study for Caucasians using only 1000 replicates because of the computa-  tional burden -more than 20 days of CPU time per chromosome would be needed and therefore a genomewide study of 10,000 replicates is computationally prohibitive, since the sample size for the Caucasian group is more than 20 times larger than that of Hispanic group. The results are summarized in Figure 2. Max MOD scores of 16.87, 4.10, 3.70, and 3.22 on chromosomes 6, 2, 11, and 7 correspond to genome-wide p-values of ~0.0, 0.08, 0.17, and 0.42, respectively. The empirically derived MOD scores corresponding to p-values of 0.05 in Hispanics and Caucasians deviated somewhat. For Caucasians, the empirical MOD score for 5% significance is 4.39 while for Hispanics it is 4.08.
Comparison of the distributions of phenotypes including age of onset, anti-CCP, RF-IgM, and shared epitopes between two groups were included in Table 3. The Hispanic group was found to have earlier age of onset (mean 34.48 vs. 39.35), higher anti-CCP values (mean 120.20 vs. 107.51), higher RF-IgM values (mean 319.10 vs. 255.06), and a higher percentage of anti-CCP positive people (anti-CCP ≥ 20, 91.53% vs. 76.45%, results not shown) yet a smaller percentage of affected people carried the shared epitopes (71.88% vs. 84.55%). The differences in age of onset were compared using survival analysis using robust variance correction for controlling familial correlation of the ages. This analysis suggested significantly different hazard risks between Hispanics and Caucasians (p-value = 0.02 in Cox regression analysis after correction). The means of anti-CCP and RF-IgM of two groups were compared using t-test and was found not significant (p-value = 0.18 and p-value = 0.20 for anti-CCP and RF-IgM, respectively). However, the difference in proportion of anti-CCP-positive individuals in Hispanics and Caucasian is highly significant using the binomial test (two-sided exact p-value = 0.0032). The difference of percentage of shared epitopes in affected people is also significant (two-sided exact p-value = 0.01 using binomial test).
The mean number of affected siblings in the families is about the same in both groups (2.10 and 2.13 for Hispanics and Caucasians, respectively). The proportion with parents available might affect MOD score calculations, but is actually higher in Hispanics (65.52%) than in Caucasians (41.09%). Allele frequencies for the SNPs giving high Kong and Cox LOD scores or MOD scores on chromosomes 6 and 19 did not reveal significant difference between the two ethnic groups (data not shown here). There were no detectable genotyping errors in these ethnic groups.
Distribution of genome-wide maximum simulated MOD scores in each of 10,000 replicate for the Hispanics Figure 1 Distribution of genome-wide maximum simulated MOD scores in each of 10,000 replicate for the Hispanics. In the genome-wide simulated data of all 22 autosomal chromosomes, a maximum MOD score of 3.03 from the real Hispanic data on chromosome 19 (as pointed by arrow) corresponds to a p-value of 0.42.

Discussion
The results may reflect underlying genetic variations between Caucasian and Hispanic groups useful for diagnosis and treatment of disease. The sample size of Hispanics available for study strongly limits our ability to generalize our findings. Using the Caucasian sample as standard, the expected LOD score in Hispanics from 29 families is 0.71, so the evidence for linkage to chromosome 6 is comparable to its expectation. However, we note that chromosomes other than 6 yielded higher LOD scores in the Hispanics, suggesting that non-HLA-region genes may play a stronger role in this population than in Caucasian populations. The weaker linkage to HLA and lower percentage of people carrying shared epitopes in Hispanics are interesting because the associations in that group with the shared epitopes are quite weak and tend to yield lower associations in general [8]. There also may be genetic heterogeneity in Hispanic and Caucasian groups. It is important to note that the regions that were identified by KC-LOD score and MOD score were consistent in Caucasian families but not in the Hispanic family data. This difference also could be due to the small number of families in the Hispanic group. The sample size may be important when performing MOD score analyses because MOD scores are optimized over several parameters.
Distribution of genome-wide simulated MOD scores in 100,000 bootstraps on 1000 replicates for the Caucasians Figure 2 Distribution of genome-wide simulated MOD scores in 100,000 bootstraps on 1000 replicates for the Caucasians. Every tenth genome-wide maximum simulated MOD scores by bootstrapping 100,000 times on 1000 replicates are used in the figure due to the number of points limited for plotting. Maximum MOD scores of 16.87, 4.10, 3.70, and 3.22 from real Caucasian data on chromosomes 6, 2, 11, and 7 correspond to genome-wide p-values of ~0.0, 0.08, 0.17, and 0.42, respectively, as indicated by arrows.