Data
We used the NARAC data of Problem 2 of the Genetic Analysis Workshop 15 (GAW15). For the present paper, only the data from individuals labeled as "Caucasian", and those with "unknown" ethnicity, were considered (n = 3690). We pooled these two ethnic categories because the majority of the NARAC sample was white and the allele frequencies for the four HLA-DRB1 microsatellites (d6S265, TNFA, D6S273, and D6S1629), as well as the rs2476601 SNP for PTPN22, were not statistically different in the "unknown" ethnicity group compared to the one labeled as Caucasian. We selected a sib-pair design approach, and only one sib-pair per nuclear family was included in the analysis. We could not use the full sib-pair data available with the selected analytic approach due to the non-independence of the various sib-pairs within sibships. In families with more than two affected sibs, the sibs with the most complete data and the closest in age were those kept for analysis. The remaining affected sibs, half sibs, and unaffected sibs, were excluded from the analysis. In the context of multigenerational pedigrees, each generation was split into nuclear families. Parental data were used to compute the penetrances. Using the data provided through GAW15, we defined as an index case the sib from each pair that had the most complete data.
HLA-DRB1 allele classification and PTPN22 rs2476601 allele labeling
Because the HLA-DRB1 alleles reported to be associated with RA share an RAA motif at position 71–74, our classification of the HLA-DRB1 alleles is based on their amino acid sequence at that position [7]. When allele sub-types were not available, we assigned the allele according to their frequencies (e.g., 01R alleles were considered as *0101; individuals with HLA-DRB*14 alleles were randomly assigned half to *1401 and half *1402). We assigned the HLA-DRB1*0101, *0405, *0408, *1001, *1402, and *16 as E1, *0401 and *0409 as E2, *0102, *0404, *423, *12, and *1406 as E3, and the other alleles as Ex. The alleles classified as Ex were considered as the non-susceptibility alleles. For PTPN22, the susceptibility allele of the R620W missense SNP (rs2476601) corresponds to the minor allele T, whereas the common allele is C.
Statistical analyses
Genetic models for HLA-DRB1 and PTPN22 genes in RA were tested using the marker association segregation chi-square (MASC) method [8]. Details of this approach have been described elsewhere [7, 8]. Briefly, the MASC method is based on the idea of minimizing a sum of independent chi-squares and testing the goodness-of-fit of various models [8]. This approach estimates penetrances simultaneously using information on the marker association and segregation with the disease. Thus, MASC uses the allelic association information from the genotype distribution among unrelated index cases, as well as the linkage information, based on the proportion of siblings sharing 2, 1, or 0 alleles identical by descent (IBD), from each index case and its affected sib. To deal with potential IBD estimation uncertainty, the probability of sharing 2, 1, or 0 alleles IBD was computed for flanking SNPs using MERLIN. Based on IBD estimation probability of equal or more than 80% for individual SNP, no ambiguity in IBD sharing information was detected. Because the MASC approach at some point is conditioning on the IBD status, it cannot accommodate analyses of large sibships due the non-independence of the sib-pairs within sibships. Nuclear families with the following three configurations were included: affected sibs and unaffected parents, affected sibs and one affected parent, and affected sibs and affected parents. These corresponded to MASC labels of C2, C4, and C6, respectively. The data was then stratified according to these distributions (i.e., family, genotype, and IBD distributions) using the information on both the index cases and their relatives. The expected distributions were then computed for each proposed model. These distributions were used to estimate the relative penetrance of each genotype, which are the ratio of the penetrance for a given genotype to the penetrance for the referent (i.e., the higher-risk genotype). These estimations require knowledge of the genotype frequencies in the general population. These were estimated using the affected family-based association method (AFBAC) [9]. This approach uses the parental alleles not transmitted to the children, assuming Hardy-Weinberg equilibrium. In the context of multiplex ascertainment, such as in the current study, the average of the parental alleles transmitted to both sibs is compared to the AFBAC population of parental alleles never transmitted to the affected sib-pair. In the framework of MASC, the genetic model is good (i.e., explains the observed association and linkage data) when the expected and the observed distributions do not differ significantly. Therefore, a p-value > 0.05 will correspond to the acceptance of the model, whereas a p-value < 0.05 implies that other factors not modeled are involved in the disease expression. We first fitted the co-dominant model, i.e., the most general model. To test whether the penetrance of the different genotypes differ significantly, pairwise comparisons were performed with the maximum likelihood ratio test. Since the co-dominant model fitted well, we then tested if the penetrance for the heterozygotes and the homozygotes were equals (dominant vs. co-dominant, recessive vs. co-dominant). Confidence intervals for all estimates were computed using a bootstrap procedure.