Volume 3 Supplement 7
Genetic Analysis Workshop 16
Representation of genetic association via attributable familial relative risks in order to identify polymorphisms functionally relevant to rheumatoid arthritis
 Justo Lorenzo Bermejo^{1, 2}Email author,
 Christine Fischer^{3},
 Anke Schulz^{4},
 Nadine Cremer^{4},
 Rebecca Hein^{4},
 Lars Beckmann^{4},
 Jenny ChangClaude^{4} and
 Kari Hemminki^{2, 5}
DOI: 10.1186/175365613S7S10
© Bermejo et al; licensee BioMed Central Ltd. 2009
Published: 15 December 2009
Abstract
The results from association studies are usually summarized by a measure of evidence of association (frequentist or Bayesian probability values) that does not directly reflect the impact of the detected signals on familial aggregation. This article investigates the possible advantage of a twodimensional representation of genetic association in order to identify polymorphisms relevant to disease: a measure of evidence of association (the Bayes factor, BF) combined with the estimated contribution to familiality (the attributable sibling relative risk, λ_{s}). Simulation and data from the North American Rheumatoid Consortium (NARAC) were used to assess the possible benefit under several scenarios. Simulation indicated that the allele frequencies to reach the maximum BF and the maximum attributable λ_{s} diverged as the size of the genetic effect increased. The representation of BF versus attributable λ_{s} for selected regions of NARAC data revealed that SNPs involved in replicated associations clearly departed from the bulk of SNPs in these regions. In the 12 investigated regions, and particularly in the lowrecombination major histocompatibility region, the ranking of SNPs according to BF differed from the ranking of SNPs according to attributable λ_{s}. The present results should be generalized using more extensive simulations and additional real data, but they suggest that a characterization of genetic association by both BF and attributable λ_{s} may result in an improved ranking of variants for further biological analyses.
Introduction
Susceptibility to rheumatoid arthritis (RA) is determined by both genetic and environmental factors, with an estimated sibling relative risk of 510. The HLADRB1 and PTPN22 genes explain approximately 50% of the familial aggregation. A recent genomewide association (GWA) study carried out by the Wellcome Trust Case Control Consortium (WTCCC) found nine singlenucleotide polymorphisms (SNPs) with associated probability values of less than 10^{5} that did not map to loci previously related to RA [1]. In agreement with the common diseasecommon variant hypothesis, the novel variants were common (minor allele frequencies from 0.07 to 0.41) and they showed genotype relative risks (GRRs) of less than 2.3. The authors of the WTCCC article recognized a disparity between the large populationattributable fraction (PAF) and the modest sibling relative risk (λ_{s}) explained by the newly identified variants [1]. This disparity is related to the technical characteristics of the platforms used in GWA studies, designed to scrutinize polymorphisms (common variants).
Let us assume that a polymorphism is a marker of a rarer causal variant, and the markerspecific PAF equals the PAF of the causal variant. Under this assumption, we have demonstrated that the causal variant has a higher attributable λ_{s} than the marker [2]. In other words: for a fixed PAF, causal variants contribute more to familial aggregation (λ_{s}) than the polymorphisms they are represented by.
There is much debate concerning the representation of statistical evidence in GWA studies. The frequentist pvalue depends on the power of the study and it does not provide a consistent measure of evidence because identical pvalues do not imply identical evidence of genetic association [3]. In contrast, the Bayes factor (BF) statistic provides a single measure of the strength of evidence of association.
We hypothesize here that the representation of genetic association by BF together with the attributable λ_{s} may help to identify polymorphisms relevant to RA, and use simulation and data from the North American Rheumatoid Consortium (NARAC) to evaluate this assumption.
Methods
Derivation of attributable risks and Bayes factors
The sibling relative risk (λ_{s}) attributable to a gene reflects the extent to which a particular genetic locus contributes to familial aggregation. The calculation of attributable λ_{s} values was first described in an early paper of James [4]. If the frequency of the highrisk allele is denoted by RAF, the relative risk of RA for highrisk allele homozygotes compared to lowrisk allele homozygotes by GRR_{hom}, the relative risk of heterozygotes compared to lowrisk allele homozygotes by GRR_{het} and the disease prevalence among lowrisk allele homozygotes by κ_{0}, the attributable λ_{s} is given by: λ_{s} = 1+(1/2V_{a}+1/4V_{d})/K^{2}, where V_{a} (additive genetic variance divided by κ_{0}^{2}) equals 2RAF(1RAF)[(1RAF)(1GRR_{het})RAF(GRR_{hom}GRR_{het})]^{2}, V_{d} (dominance genetic variance divided by κ_{0}^{2}) equals RAF^{2}(1RAF)^{2}[1+GRR_{hom}2GRR_{het}]^{2} and K (population prevalence divided by κ_{0}) equals RAF^{2}GRR_{Hom}+2RAF(1RAF)GRR_{het}+ (1RAF)^{2}. Note that if the overall disease prevalence is 50%, λ_{s} cannot exceed two by definition. However, an important property of the attributable λ_{s} is that it only depends on RAF, GRR_{hom}, and GRR_{het}. That is, the attributable λ_{s} is independent of the disease prevalence among lowrisk homozygotes κ_{0}.
The BF statistic is the ratio of the probability of the observed data under the assumption that there is a true association to its probability under the null hypothesis (absence of association). A small BF provides evidence in favor of a true association. To investigate the relationship between attributable λ_{s} and the BF, we calculated the expected distribution of genotypes in 1000 cases and 1000 controls when GRR_{hom} = 1.5, 2, 3, 4, 5 and 10. Calculations were carried out for the dominant model, which assumes that GRR_{het} = GRR_{hom}, the recessive model (GRR_{het} = 1), and for the additive model (GRR_{het} = (1+GRR_{hom})/2). Under HardyWeinberg equilibrium, the expected distribution of genotypes in controls (D = 0) is: Pr(G = aaD = 0) = (1RAF)^{2}; Pr(G = AaD = 0) = 2RAF(1RAF); Pr(G = AAD = 0) = RAF^{2} The expected distribution of genotypes in cases (D = 1) satisfies:
GRR_{hom} = (Pr(D = 1  G = AA)/Pr(D = 0  G = AA))/(Pr(D = 1  G = aa)/Pr(D = 0  G = aa)) and GRR_{het} = (Pr(D = 1  G = Aa)/Pr(D = 0  G = Aa))/(Pr(D = 1  G = aa)/Pr(D = 0  G = aa)). For example, if RAF = 0.2 and GRR_{hom} = GRR_{het} = 2 (dominance), the expected distribution is (aa = 640, Aa = 320, AA = 40) in controls and (aa = 471, Aa = 471, AA = 59) in cases.
The expected distributions of genotypes were investigated by Bayesian logistic regression. We considered a general threegenotype model of association. Calculation of BFs requires assumptions about effect sizes. We assumed N(0,1) priors on the mean and the two genetic effects. The function MCMClogit (R package MCMCpack [5]) was used to generate a 10,000 sample dataset from the posterior distribution using a random walk Metropolis algorithm. The BayesFactor function in the same package was used to compute log marginal likelihoods for a model 'with' compared to a model 'without' genotypic information using a Laplace approximation.
Investigated regions
Description and results from the 12 investigated regions
WTCCC study^{a}  NARAC data^{b}  

Chr  SNP  Position  MAD of log_{10}  Outlying SNPs, gene regions  log_{10} (BF)  RAF  λ_{s} Median (5^{th}95^{th})  
GRR_{hom}  GRR_{het}  
1p13  rs6679677  11,401,850  0.282  0.199  rs2476601, PTPN22  3.91  0.084  1.048 (1.0171.098) 
1p36  rs6684865  2,578,391  0.282  0.199         
1p31  rs11162922  80,284,079  0.282  0.199         
4p15  rs3816587  25,093,513  0.255  0.188  rs12505556, SLC34A2  3.04  0.112  1.071 (1.0111.630) 
6p21a  rs6457617  32,771,829  0.306  0.206  rs2395175, MHC  16.48  0.148  1.164 (1.1081.241) 
rs7765379, MHC  6.11  0.113  1.175 (1.0531.668)  
6p21b  rs615672  32,682,149  0.306  0.206  rs2395175, MHC  16.48  0.148  1.164 (1.1081.241) 
rs7765379MHC  6.11  0.113  1.175 (1.0531.668)  
6q23  rs6920220  138,048,197  0.306  0.206         
7q32  rs11761231  130,827,294  0.268  0.185         
10p15  rs2104286  6,139,051  0.290  0.197         
13q12  rs9550642  19,848,092  0.286  0.200  rs1407961, ZMYM2  3.59  0.920  1.033 (1.0331.225) 
21q22  rs2837960  41,433,788  0.288  0.208  rs468646, MX1  4.28  0.488  1.038 (1.0191.057) 
rs466092, MX2  3.59  0.488  1.030 (1.0161.048)  
22q13  rs743777  35,876,107  0.283  0.216  rs3218258, IL2RB  9.63  0.771  1.061 (1.0411.084) 
rs710183, CSF2RB  4.26  0.072  1.028 (1.0111.183)  
rs8137446, KCTD17  3.48  0.903  1.035 (1.0181.123) 
Calculation of attributable risks and Bayes factors for NARAC data
SNPs in the 12 selected intervals were extracted from NARAC data, and the association between RA and the retrieved SNPs was represented by 1) the base 10 logarithm of the BF (log_{10}BF) and 2) the attributable λ_{s}. RA status (affected or unaffected) was modelled by logistic regression, taking into account the covariates gender, number of HLADRB1 sharedepitope alleles (NN = 0, SN = 1 or SS = 2) and genotype (three levels). Individuals with missing information were excluded from the calculations. Only SNPs with the three genotypes represented in both cases and controls were considered.
As already mentioned, calculation of BFs requires assumptions on effect sizes. We calculated frequentist estimates of logistic regression coefficients over the corresponding entire chromosomes using NARAC data. Density plots and tests of goodness of fit of frequentist estimates of log_{10}GRR_{Hom} and log_{10}GRR_{Het} for the seven investigated chromosomes indicated that GRR variation was better represented by the median absolute deviation (MAD) than by the standard deviation (data not shown). Therefore, we assumed N(0, MAD^{2}) priors on the logarithms of genetic effects. MCMClogit and BayesFactor [5] were used to calculate BFs for each SNP in the selected regions based on NARAC data. The number of individuals with complete data and the observed allele frequencies in controls were used to draw 10,000 samples from a binomial distribution, which were combined with the posterior estimates of genetic effects to simulate the distribution of attributable λ_{s} for each SNP.
A bagplot is a bivariate generalization of the nonparametric univariate boxplot [6]. In a bagplot, the region which contains 50% of the observations with greatest bivariate depth is denominated the bag. Observations outside the bag expanded three times are considered outliers, they are too far away from the data's central bulk. These outliers are indicated by closed dots in univariate boxplots. Bagplots have been shown to be equivariant for linear transformations and not limited to elliptical distributions. Due to these properties and the characteristics of our data (unknown shape of the underlying distributions), we used nonparametric bagplots to identify deviating SNPs. Bagplots were determined using the function compute.bagplot in the R package aplpack [7].
Results
Theoretical relationship between Bayes factor and attributable risk
Allele frequencies at maximum λ_{s} and maximum log_{10}(BF)
Dominant model  Recessive model  Additive model  

GRR_{hom}  Max log_{10}(BF)  Max λ_{s}  Max log_{10}(BF)  Max λ_{s}  Max log_{10}(BF)  Max λ_{s} 
1.5  0.24  0.21  0.70  0.66  0.39  0.40 
2  0.22  0.17  0.64  0.61  0.38  0.33 
3  0.20  0.13  0.61  0.54  0.33  0.25 
4  0.17  0.10  0.59  0.48  0.26  0.20 
5  0.16  0.08  0.57  0.45  0.24  0.17 
10  0.14  0.05  0.49  0.33  0.17  0.09 
Bayes factors and attributable sibling relative risks in the investigated regions
The analysis of complete chromosomes by frequentist logistic regression resulted in 2(GRR_{hom}, GRR_{het}) × 8(chromosomes 1, 4, 6, 7, 10, 13, 21, 22) sets of genetic effects. Table 1 shows the MADs of log_{10}GRR_{hom} and log_{10}GRR_{het} used as scale parameters in the normal prior distributions. As expected, the a priori variance of genetic effects was highest for chromosome 6 (MHC region).
Discussion
This study explored the advantage of combining BF, a measure of statistical evidence of association, and the attributable λ_{s}, a measure of contribution to familial aggregation, in order to characterize the effect of a particular polymorphism on disease risk. Both simulation and NARACbased results showed that BF usually correlates with the attributable λ_{s}. Most markers identified in GWA studies confer a slightly increased risk of disease. For this kind of modest genetic effects (GRR ≤ 2), simulation indicated that the maximum log_{10}BF and the maximum attributable λ_{s} were expected to be found for variants with similar frequencies: around 20% (dominant), 40% (additive) and 70% (recessive penetrances). The larger the size of the genetic effect, the lower the RAF to reach the maxima, and the larger the difference between RAFs. Causal variants generally show stronger effects than markers from GWA studies. For example, a recent GWA study identified five novel breast cancer susceptibility loci. The highest GRR was 1.6, which can be compared with GRR = 6 to 30 for carriers of BRCA1 mutations. Simulation suggested that, for this kind of strong genetic effects, attributable λ_{s} and BF are two different dimensions.
The representation of λ_{s} versus log_{10}BF based on NARAC data confirmed that SNPs which have shown replicated associations clearly departed from the majority of polymorphisms in the selected regions. For example, the functionally relevant SNP rs2476601 was unequivocally separated from the rest of SNPs in the 1p13 region. Other genes located in the proximity of the most extreme SNPs were ZMYM2, MX1, MX2, IL2RB, and CSF2RB. This list includes SNPs which show strong evidence of association (log_{10}BF higher than three) and outlying λ_{s} values. SNPs with high attributable λ_{s} values but relatively low BFs may also be of interest. Simulation showed that one BF value corresponds to two different λ_{s} values. In the 12 regions investigated using NARAC data, the ranking of SNPs according to BF differed from the ranking according to attributable λ_{s}. The difference was particularly clear in the lowrecombination MHC region, where rs7765379 showed the maximum λ_{s} and rs2395175 showed the maximum log_{10}BF. This difference in rankings suggests that, in addition to a measure of association evidence, the extensive finemapping needed to characterize etiological variants may benefit from the consideration of the attributable λ_{s}. Future studies based on more extensive simulations and additional real data should investigate this possibility.
The present study made use of Bayesian and robust statistics. The Bayesian approach has been overlooked in the analysis of GWA studies, where the adjustment for multiple testing, the relationship between power and statistical significance, and the selection of disease models are important issues [1, 3]. We only considered SNPs with the three genotypes represented in both cases and controls. This strategy would initially hamper the identification of rare causal variants but, on the other hand, logistic regression has not been robust to particular types of outliers, and a threegenotype model with well represented categories would result in more robust estimates. Robust methods aim to describe the structure best fitting to the bulk of the data, and to identify deviating observations. Robust statistics are used a lot less than they ought to be in genetic epidemiology. The use of MADs and bagplots in the present article illustrates their advantage over classical methods in practical situations.
Conclusion
The association results from GWA studies are usually summarized by a measure of evidence of association (frequentist or Bayesian probability values), which do not reflect the contribution of the identified signals to familial aggregation. We propose here a twodimensional characterization of genetic association consisting of the attributable λ_{s} and the BF.
List of abbreviations used
 BF:

Bayes factor
 GRR:

Genotype relative risk
 GWA:

Genomewide association
 λ _{s} :

sibling relative risk
 MAD:

Median absolute deviation
 MHC:

Major histocompatibility complex
 PAF:

Populationattributable fraction
 RA:

Rheumatoid arthritis
 SNP:

Singlenucleotide polymorphism
 WTCCC:

Wellcome trust case control consortium
Declarations
Acknowledgements
The Genetic Analysis Workshops are supported by NIH grant R01 GM031575 from the National Institute of General Medical Sciences. JLB was supported by Deutsche Krebshilfe and the Swedish Cancer Society. RH was supported by NGFN 2 (SMPGEM, PGES30T09), funded by the BMBF.
This article has been published as part of BMC Proceedings Volume 3 Supplement 7, 2009: Genetic Analysis Workshop 16. The full contents of the supplement are available online at http://www.biomedcentral.com/17536561/3?issue=S7.
Authors’ Affiliations
References
 Wellcome Trust Case Control Consortium: Genomewide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007, 447: 661678. 10.1038/nature05911.View Article
 Hemminki K, Försti A, Bermejo JL: The 'common diseasecommon variant' hypothesis and familial risks. PLoS ONE. 2008, 3: e250410.1371/journal.pone.0002504.PubMed CentralView ArticlePubMed
 Wakefield J: Bayes factors for genomewide association studies: comparison with Pvalues. Genet Epidemiol. 2009, 33: 7986. 10.1002/gepi.20359.View ArticlePubMed
 James JW: Frequency in relatives for an allornone trait. Ann Hum Genet. 1971, 35 (1): 4749. 10.1111/j.14691809.1956.tb01377.x.View ArticlePubMed
 Martin AD, Quinn KM: MCMCpack: (MCMC) Package. [http://mcmcpack.wustl.edu/wiki/index.php/Main_Page]
 Rousseeuw PJ, Ruts I, Tukey JW: The Bagplot: a Bivariate boxplot. Am Stat. 1999, 53: 382387. 10.2307/2686061.
 Wolf P: aplpack: Another Plot PACKage. [http://www.wiwi.unibielefeld.de/~wolf/software/aplpack/]
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.