Modeling of PTPN22 and HLA-DRB1 susceptibility to rheumatoid arthritis.

In the present paper, we used the North American Rheumatoid Arthritis Consortium data provided for Genetic Analysis Workshop 15 Problem 2 to: 1) estimate the penetrances of PTPN22 and HLA-DRB1 and, 2) test the selected model of PTPN22 conditional on the rheumatoid factor status. To achieve these aims, we used the marker association segregation chi-square method, fitting simultaneously both genotype frequency and identical by descent distributions in a sample of 3690 White individuals from 604 nuclear families. A co-dominant model fitted the rs2476601 (R620W) single-nucleotide polymorphism (SNP) of the PTPN22 gene well, whereas a lack of fit for all models was observed for the HLA-DRB1 locus. Testing genetic models of rheumatoid arthritis that include the PTPN22 SNP in addition to the HLA-DRB1 locus did not affect the results, nor did subgroup analysis of PTPN22 conditional on the rheumatoid factor status. In conclusion, PTPN22 R620W SNP is a risk factor for rheumatoid arthritis. The genetic architecture of the HLA-DRB1 locus is highly complex, and more elaborate modeling of this locus is required.


Background
The HLA locus, and in particular several alleles of HLA-DRB1, have been associated with rheumatoid arthritis (RA) and other autoimmune disorders [1,2]. The specific alleles being implicated vary according to population. Commonly reported set of alleles are DRB1*0101, 0102, 0401, 0404, 0405, 0408, and 1001 [2]. However, these alleles are either associated with severe forms of RA, or are weakly and sometime inconsistently associated with RA. Moreover, variability in disease expression is observed in individuals with the same HLA background, ranging from some being severely affected to others being unaffected [2]. Therefore, HLA-DRB1 does not alone explain the genetic susceptibility to common forms of RA. In addition to the HLA locus, other regions have been identified by genome scans [1,3,4] and/or by candidate gene approaches [1,5]. Among the non-HLA loci, PTPN22, a gene encoding for protein tyrosine phosphatase nonreceptor 22, is considered a strong candidate [1]. Phosphatases are known to play a role in immune-cell homeostasis. Recently, a functional single-nucleotide polymorphism (SNP) in PTPN22 gene (R620W allele in rs2476601) was reported to be associated with RA [4,5]. However, neither HLA-DRB1 nor PTPN22 rs 2476601 individually fully explain the genetic contribution to RA, nor they are necessary or sufficient for RA to be present in a given individual. Moreover, the association to RA may be specific to rheumatoid factor-positive RA patients, as recently reported for another PTPN22 variant [6]. Because there is strong evidence implicating these two genes in RA, and because the effect of PTPN22 variants may be restricted to rheumatoid factor-positive patients, the purpose of our study is twofold: first, to estimate the genetic model (including penetrances) associated with these two genes in White nuclear families from the North American Rheumatoid Arthritis Consortium (NARAC) data; and second, to model PTPN22 susceptibility conditional on the rheumatoid factor status in RA patients. In the present paper, we report that a co-dominant model best fits the PTPN22 data and that stratification based on rheumatoid factor status did not modified the results. No model fitted the HLA-DRB1 locus.

Data
We used the NARAC data of Problem 2 of the Genetic Analysis Workshop 15 (GAW15). For the present paper, only the data from individuals labeled as "Caucasian", and those with "unknown" ethnicity, were considered (n = 3690). We pooled these two ethnic categories because the majority of the NARAC sample was white and the allele frequencies for the four HLA-DRB1 microsatellites (d6S265, TNFA, D6S273, and D6S1629), as well as the rs2476601 SNP for PTPN22, were not statistically different in the "unknown" ethnicity group compared to the one labeled as Caucasian. We selected a sib-pair design approach, and only one sib-pair per nuclear family was included in the analysis. We could not use the full sib-pair data available with the selected analytic approach due to the non-independence of the various sib-pairs within sibships. In families with more than two affected sibs, the sibs with the most complete data and the closest in age were those kept for analysis. The remaining affected sibs, half sibs, and unaffected sibs, were excluded from the analysis. In the context of multigenerational pedigrees, each generation was split into nuclear families. Parental data were used to compute the penetrances. Using the data provided through GAW15, we defined as an index case the sib from each pair that had the most complete data.

HLA-DRB1 allele classification and PTPN22 rs2476601 allele labeling
Because the HLA-DRB1 alleles reported to be associated with RA share an RAA motif at position 71-74, our classification of the HLA-DRB1 alleles is based on their amino acid sequence at that position [7]. When allele sub-types were not available, we assigned the allele according to their frequencies (e.g., 01R alleles were considered as *0101; individuals with HLA-DRB*14 alleles were randomly assigned half to *1401 and half *1402). We assigned the HLA-DRB1*0101, *0405, *0408, *1001, *1402, and *16 as E1, *0401 and *0409 as E2, *0102, *0404, *423, *12, and *1406 as E3, and the other alleles as Ex. The alleles classified as Ex were considered as the non-susceptibility alleles. For PTPN22, the susceptibility allele of the R620W missense SNP (rs2476601) corresponds to the minor allele T, whereas the common allele is C.

Statistical analyses
Genetic models for HLA-DRB1 and PTPN22 genes in RA were tested using the marker association segregation chisquare (MASC) method [8]. Details of this approach have been described elsewhere [7,8]. Briefly, the MASC method is based on the idea of minimizing a sum of independent chi-squares and testing the goodness-of-fit of various models [8]. This approach estimates penetrances simultaneously using information on the marker association and segregation with the disease. Thus, MASC uses the allelic association information from the genotype distribution among unrelated index cases, as well as the linkage information, based on the proportion of siblings sharing 2, 1, or 0 alleles identical by descent (IBD), from each index case and its affected sib. To deal with potential IBD estimation uncertainty, the probability of sharing 2, 1, or 0 alleles IBD was computed for flanking SNPs using MER-LIN. Based on IBD estimation probability of equal or more than 80% for individual SNP, no ambiguity in IBD sharing information was detected. Because the MASC approach at some point is conditioning on the IBD status, it cannot accommodate analyses of large sibships due the non-independence of the sib-pairs within sibships. Nuclear families with the following three configurations were included: affected sibs and unaffected parents, affected sibs and one affected parent, and affected sibs and affected parents. These corresponded to MASC labels of C 2 , C 4 , and C 6 , respectively. The data was then stratified according to these distributions (i.e., family, genotype, and IBD distributions) using the information on both the index cases and their relatives. The expected distributions were then computed for each proposed model. These distributions were used to estimate the relative penetrance of each genotype, which are the ratio of the penetrance for a given genotype to the penetrance for the referent (i.e., the higher-risk genotype). These estimations require knowledge of the genotype frequencies in the general population. These were estimated using the affected family-based association method (AFBAC) [9]. This approach uses the parental alleles not transmitted to the children, assuming Hardy-Weinberg equilibrium. In the context of multiplex ascertainment, such as in the current study, the average of the parental alleles transmitted to both sibs is compared to the AFBAC population of parental alleles never transmitted to the affected sib-pair. In the framework of MASC, the genetic model is good (i.e., explains the observed association and linkage data) when the expected and the observed distributions do not differ significantly. Therefore, a p-value > 0.05 will correspond to the acceptance of the model, whereas a p-value < 0.05 implies that other factors not modeled are involved in the disease expression. We first fitted the co-dominant model, i.e., the most general model. To test whether the penetrance of the different genotypes differ significantly, pairwise comparisons were performed with the maximum likelihood ratio test. Since the co-dominant model fitted well, we then tested if the penetrance for the heterozygotes and the homozygotes were equals (dominant vs. co-dominant, recessive vs. codominant). Confidence intervals for all estimates were computed using a bootstrap procedure.

Lack of fit of IBD data for HLA-DRB1 (data not shown)
The estimated general population allele frequencies are 15%, 6%, 5%, and 74% for E 1 , E 2 , E 3 , and E x , respectively. These frequencies are similar to those of the population recruited in France through the European Consortium on Rheumatoid Arthritis [7]. However, the allele classification used in our study was not exactly like the new classification validated by the European Consortium study because of the lack of sub-typing for some alleles, in particular DRB1*11 and DRB1*13. In our study, all the expected distributions related to HLA-DRB1 differed significantly from the observed distributions (p = 0.002). In particular, in C 4 families (i.e., the sibs and one parent are affected), 79% of E x /E x probands share one HLA-DRB1 allele IBD with their affected sibling, compared to the expected proportion of 48%. Thus, this model does not explain all the observations. Therefore, we could not model epistasis analyzing HLA-DRB1 conditional on PTPN22.

A co-dominant model fits IBD data for PTPN22
The estimated general population allele frequency for the T allele is 5% (thus, 95% for C) (data not shown). Table 1 shows the expected and the observed distributions for the family configurations, the genotypes, and the IBD sharing. Compared to the distributions at the HLA-DRB1 locus, there are more missing genotype values for the PTPN22 locus, in particular for parents. Moreover, determination of IBD is sometimes difficult due to the bi-allelic nature of the rs2476601 marker (i.e., SNP).  Table 2, coupling refers to the probability of carrying the susceptibility allele knowing the marker allele. In this table, we observe that the T allele has a probability of 1, whereas the probability associated with the C allele is much lower (0.07). This strongly suggests that the susceptibility allele is T, i.e., the marker allele, or that of a gene in very close linkage disequilibrium with the marker locus.

Conclusion
Our results support the role of SNP R620W of the PTPN22 gene in the risk of developing RA. None of the models tested fitted the data for HLA-DRB1. This further reinforce the conclusion of Tezenas du Montcel et al. [7] that classification of HLA-DRB1 alleles is complex, and that some classification systems may not capture the complexity at this locus. Alternatively, the lack of fit of the main models tested for HLA-DRB1 may also suggest that epistasis is a major part of the underlying genetic architecture for this locus, i.e., effect of an interaction without measurable main effects. Better understanding of the genetic architecture of RA will be essential not only to identify additional genes implicated in RA but also as a critical component to an eventual translation of genetic research knowledge into clinical benefits for patients.