- Open Access
Comparison of the power of haplotype-based versus single- and multilocus association methods for gene × environment (gene × sex) interactions and application to gene × smoking and gene × sex interactions in rheumatoid arthritis
BMC Proceedingsvolume 1, Article number: S73 (2007)
Accounting for interactions with environmental factors in association studies may improve the power to detect genetic effects and may help identifying important environmental effect modifiers. The power of unphased genotype-versus haplotype-based methods in regions with high linkage disequilibrium (LD), as measured by D', for analyzing gene × environment (gene × sex) interactions was compared using the Genetic Analysis Workshop 15 (GAW15) simulated data on rheumatoid arthritis with prior knowledge of the answers. Stepwise and regular conditional logistic regression (CLR) was performed using a matched case-control sample for a HLA region interacting with sex. Haplotype-based analyses were performed using a haplotype-sharing-based Mantel statistic and a test for haplotype-trait association in a general linear model framework. A step-down minP algorithm was applied to derive adjusted p-values and to allow for power comparisons. These methods were also applied to the GAW15 real data set for PTPN22.
For markers in strong LD, stepwise CLR performed poorly because of the correlation/collinearity between the predictors in the model. The power was high for detecting genetic main effects using simple CLR models and haplotype-based methods and for detecting joint effects using CLR and Mantel statistics. Only the haplotype-trait association test had high power to detect the gene × sex interaction.
In the PTPN22 region with markers characterized by strong LD, all methods indicated a significant genotype × sex interaction in a sample of about 1000 subjects. The previously reported R620W single-nucleotide polymorphism was identified using logistic regression, but the haplotype-based methods did not provide any precise location information.
The inclusion of gene × environment interaction terms in association analyses may improve the power to detect genetic effects and may contribute to the identification of important environmental effect modifiers . Current technology makes it possible to genotype very dense sets of markers and the underlying genomic structure can be captured using haplotypes. The merits and drawbacks of haplotype-based compared with unphased genotype-based methods in the analysis of joint effects of genetic and non-genetic factors in regions with high linkage disequilibrium (LD, as measured by D') are of particular interest. When investigating candidate genes using dense markers that are in strong LD, the goal is to distinguish potentially causal variant(s) from those showing association merely due to LD. Using the Genetic Analysis Workshop 15 (GAW15) simulated data, we compared the power of three genotype- and haplotype-based methods to account for gene × environment interactions both to identify the genetic region and to identify the causal variant(s) in case-control scenarios. The methods are ordinary conditional logistic regression (CLR) and stepwise conditional logistic regression (SCLR) , Mantel statistics using haplotype sharing , and a haplotype-trait association statistic using a maximum likelihood estimate to test for interaction effects . In contrast to the haplotype-trait association statistic, which provides a statistical test for each haplotype, CLR, SCLR, and the Mantel statistics provide a statistical test for each genetic marker. The methods also differ in the number of tests performed as well as in the number of degrees of freedom. To account for multiple hypotheses testing when investigating multiple single-nucleotide polymorphisms (SNPs) in a candidate gene and to obtain comparable power estimates, we applied a step-down minP algorithm .
In addition, we explored the performance of these methods in the GAW15 real data set for PTPN22 using markers in strong LD (in terms of D'). The data are a subset of the published data on rheumatoid arthritis (RA) . RA is a complex autoimmune disease more common in women and in smokers  and with a moderately strong genetic component. Genetic association with RA was found in the PTPN22 gene [6, 8, 9], where the SNP R620W has been found to have a stronger effect in males than in females [6, 10].
Simulated data sets
From the GAW15 simulated data modeling RA, we selected 100 replicate samples to represent a matched case-control study. For each replicate, the first affected offspring within an affected sibling pair of the first 500 families (1500 families in total) was chosen as case. Out of 2000 unrelated unaffected controls, 500 were matched to the cases by age and sex. The proportion of females:males in the case-control pairs was 2:1. In the simulated HLA region from the high-density scan on chromosome 6 with 18,000 SNPs, the disease loci DR and C have the same physical position and are in high LD. The DR locus has been simulated to have a strong effect on the risk of RA independently of sex, while locus C was generated to interact with sex, such that it increases risk only in women. We investigated these loci and the 10 flanking SNPs on either side, spanning a region of about 300 kb. The alleles of the multiallelic DR locus were recoded as biallelic based on the answers, i.e., the risk allele vs. all other alleles combined.
Patients and genotypes in the RA sample
For the analysis of the real data set on PTPN22, we used all 1001 individuals from the NARAC (North American Rheumatoid Arthritis Consortium) sample for whom genotypes were available, these were 665 unrelated RA cases and 336 unrelated unaffected controls (Table 1, see Plenge et al.  for details of ascertainment). For the analysis of interaction effects of SNPs and smoking, we excluded individuals whose smoking status was not known. The control sample is not representative of the population regarding the distribution of sex and smoking because it includes a higher percentage of females and smokers than the case sample. Thus the estimated main effects of sex and smoking are strongly biased and not presented here. All 14 SNPs provided in PTPN22, which span approximately 58 kb, were investigated (Table 2).
Statistical analysis methods
1. Logistic regression
We investigated four different sets of CLR models: 1) the main effect of each marker individually in 22 separate regression models; 2) only the joint effect of each genotyped marker individually with sex as an interaction term. This approach was considered in order to compare CLR and the Mantel statistics for joint effects, M1, mentioned below; 3) both the main effect of each marker individually and their interaction with sex; 4) SCLR (stepwise CLR with backwards model selection based on the Akaike information criterion) with the full model including all main and first-order interaction effects of the 22 markers simultaneously. The SCLR approach was proposed specifically to distinguish between causal variants and those merely in LD . Note that these analyses only use unphased multilocus genotype data. We used an additive genetic model on the logit scale, i.e., multiplicative on the odds ratio scale, without dominance effects to reduce the number of degrees of freedom.
We employed analogous unconditional LR models for the real data on PTPN22 because cases and controls were not matched. Here, gene × environment (current smoking) and gene × sex interactions were investigated. All CLR and LR analyses were performed using the computer program R (version 2.3.1), using the general linear model (glm) procedure and the step function for stepwise model selection.
2. Haplotype-based analysis
2.1 Mantel statistic using haplotype sharing
We applied the approach of Mantel's statistics for space-time clustering to correlate genetic and phenotypic similarity : whereby x denotes a putative disease locus, i and j are haplotypes, L ij (x) is measured as the number of intervals surrounding x flanked by markers identical by state (haplotype sharing). The sum is over all pairwise comparisons of haplotypes. denotes the phenotypic similarity of the individuals s i and s j , and is defined as the mean-corrected product , where μ y denotes the sample mean, and and the disease status of s i and s j . To analyze the joint effect of a marker locus and an environmental factor, a measure of exposure similarity was introduced: . is computed as = 1, = , and = 0 else, where and denote the environmental factor of s i and s j . M1 is a test for joint effects and does not distinguish between genetic main effects and interaction effects.
2.2 Haplotype-trait association
We applied a test for haplotype-trait association in a general linear model framework, using maximum likelihood estimates for main and interaction effects of haplotypes and non-genetic factors, allowing for haplotype phase to be ambiguous [4, 11]. Two regression models were applied. The first model contained sex and all main effects of haplotypes. The second model included all main effects and first-order haplotype-sex interaction effects. The effects of the haplotypes were modeled as additive. We first used the most frequent haplotype, which included the risk alleles on the loci DR and C, as baseline category. We also analyzed the data using a less frequent haplotype, which did not include any risk alleles, as the baseline haplotype. Certainly, without prior knowledge of the answers, we would not have done so. Only haplotypes with a frequency of at least 5% were considered. We modified the function haplo.glm, which is included in the haplo.stats R-library .
For the PTPN22 data set, we used a model which included the main effects of the haplotypes, sex, and smoking, as well as the first-order haplotype-sex or haplotype-smoking interaction terms.
3. Permutation procedure and step-down minP adjusted p-values
The numbers of tests and degrees of freedom differed between the statistical methods and models. Thus, we permuted the case-control status while keeping together genotypes and sex-as well as smoking status in the analysis of the real data-for each individual, and calculated adjusted p-values by a step-down minP algorithm .
As depicted in Table 3, power of CLR was generally high for Model 1 to detect the genetic main effect at the DR locus and for Model 2 to detect the joint effect of the SNPs and sex. For SNPs 5 and 15 only, there was hardly any power. In contrast, modeling of both a genetic main effect and an interaction effect resulted in very low power to detect the interaction and low to moderate power to detect the genetic main effect. SCLR performed highly unsatisfactorily and had very low power for all effects modeled.
The haplotype-sharing-based Mantel statistics had 100% power for all markers both for the genetic main effect and for the joint effect (Table 4) even when no more than 50 case-control pairs were investigated (data not shown). For the haplotype-trait association test, we present results only for the four haplotypes, which were observed in at least 80 of 100 samples (Table 2). The GAW15 data were simulated such that the allele coded 3 at the DR locus increases RA risk while the allele coded as 1 at locus C was simulated to increase risk for RA only in women. Thus it was surprising that both risk-related alleles were also included in the most common haplotype, the reference haplotype by default. All three haplotypes indicated the main and the interaction effect with power estimates ranging between 0.82 and 1, and 0.65 and 0.78, respectively. We reexamined the data using as reference the second most frequent haplotype, which did not comprise the risk alleles. The estimated power was moderate to high for the detection of the main effect (0.57 to 1) and low for the detection of the interaction effect with sex (0.08 to 0.51, data not shown).
In the PTPN22 region, LR identified significant effects at R620W (p = 0.04) and rs1217413 (0.05), considering main effects only and when considering also interaction effects with sex (p = 0.007 and p = 0.026, respectively) (Table 2). The remaining SNPs surrounding R620W did not show significant results. An interaction effect with smoking was not observed. As observed for the simulated data, power was low for all effects modeled using stepwise LR (data not shown).
The Mantel statistics did not yield significant main effects of the investigated SNPs with the lowest adjusted p-value at 0.23 (Table 2). However, evidence for joint effects (M1) both with sex and with smoking was found (lowest adjusted p-values of 0.02 and <0.001, respectively). Six common haplotypes with frequencies ≥ 5% were estimated in PTPN22 (Table 5). Applying the haplotype-trait association test, one haplotype detected with a frequency 13.6% was associated with a significantly increased risk in the model comprising only main effects. This haplotype differed from all other haplotypes at locus R620W (Table 5). One haplotype with frequency 8.52%, which was found to be significant in the haplotype-sex analysis, was associated with a decreased risk. Compared to the reference haplotype, this haplotype carries the same allele at R620W, but differs for SNPs rs12730735 and rs1217414. In both cases, the respective alleles are also present on other haplotypes, which do not show an interaction, thus a specific interacting disease variant could not be identified. There was no evidence for a haplotype-smoking interaction.
Current technology permits genotyping of dense marker sets across the whole genome. Once associated regions have been identified, efforts will be made to confirm these results in independent samples by a hypothesis-based candidate gene approach in which control of the type I error rate will be of great importance. Accounting for gene × environment interaction given moderate genetic main effects may improve the power to detect or confirm an association with a gene, a genetic region, or causal variant(s). Given this scenario, we investigated the performance of some methods proposed for detecting genetic main and gene × environmental interaction effects. We found that stepwise LR or CLR performed poorly given strong LD (as measured by D') between markers-a phenomenon called (multi-)collinearity that results in overfiting of the model and weak robustness . It is well known that automated stepwise methods do not necessarily produce the best model if there are redundant predictors; often the model created will have poor predictive accuracy [13–15]. Although stepwise model selection was proposed for elucidating which of several markers in LD might be functionally relevant , this strategy seems to be inefficient when strong LD (in terms of D') exists. In this situation, ordinary LR performs very well when only the main effect of each marker is tested separately in a model and the p-values are adjusted for the total number of tests performed. The low power for SNP 5 is likely due to low LD (r2 ≈ 0) with the disease loci and non-matching allele frequencies . This does not appear to apply for SNP 15, where power is low but LD is stronger. All other markers had a power of about 100% for the genetic main effect despite mostly low r2-values between these markers and the disease loci. Therefore, in the simulated scenario, it was not possible to localize the functional variant within this region, e.g., there was no general tendency for the true causal locus to have the smallest p-value or largest effect size estimate. The simulated genetic main effect was also much stronger than the interaction effect, so that there was no improvement adding a gene × sex interaction term in the model. A much larger sample size would have been necessary for detecting the interaction effect.
There was only a limited number of haplotypes estimated from the 22 simulated markers due to the strong LD (as measured by D') between them. Both haplotype-based methods were able to detect association in the investigated region but they failed to identify the disease variant(s). The findings regarding joint effects suggest that analysis for joint effects may be an option for detecting genetic effects in some situations. For smaller sample sizes, the Mantel statistics for joint effects (M1), however, is more powerful than a similar LR modeling of interaction effects (data not shown).
For the analysis of PTPN22 in RA, we used a subset of the data of Plenge et al. , who identified an interaction of R620W with sex in a case-only analysis in their complete data set of about 4100 RA cases and controls. Using the different LR models, we were also able to detect this genotype × sex interaction in a subsample of 1001 subjects at R620W. The lack of significance for the genetic main effect in the model including both main and interaction terms, indicated that the effect of R620W indeed depends on the gender of the genotype carrier. Furthermore, it was possible to distinguish R620W from neighboring markers with allele frequencies that did not match and low r2-values . The Mantel statistic failed to identify main effects but showed strong evidence for a joint effect both with gender and with smoking in model M1. These significant effects may however be due to apparent strong main effects of sex and smoking caused by an unrepresentative control sample and may not reflect a true interaction effect. No specific marker could be detected as disease-causing polymorphism because of the dependency of the test statistics at different markers. By contrast, the results of the haplotype association method haplo.glm were similar to the results of the LR models implicating the R620W locus as a risk locus for RA. However, it is unclear whether the significant haplotype-sex interaction observed for a different haplotype refers to a further disease variant.
Our results suggest that stepwise or other automated variable selection methods are not suitable for the investigation of gene × environment interactions in regions with high LD, as measured by D'. Given strong genetic main effects and moderate gene-sex interaction, as in the simulated data, both simple CLR and haplotype-based methods have adequate power for the detection of genetic main effects without considering gene × sex interaction in moderate sample sizes. In the simulated scenario, haplotype-trait association tests had better power than simple LR modeling to detect moderate gene × sex interactions. However, this result cannot be generalized to other situations without further simulation studies under different genetic models. On the other hand, haplotype-based methods that always use data from consecutive markers tend to yield similar results for neighboring markers, thus making localization of a causal variant potentially more difficult than with genotype-based methods that consider each marker at a time. Thus, the potential for localization of causal variants with each method should also be explored in further simulation studies.
Hunter DJ: Gene-environment interactions in human diseases. Nat Rev Genet. 2005, 6: 287-298. 10.1038/nrg1578.
Cordell HJ, Clayton DG: A unified stepwise regression procedure for evaluating the relative effects of polymorphisms within a gene using case/control or family data: application to HLA in type 1 diabetes. Am J Hum Genet. 2002, 70: 124-141. 10.1086/338007.
Beckmann L, Thomas DC, Fischer C, Chang-Claude J: Haplotype sharing analysis using Mantel statistics. Hum Hered. 2005, 59: 67-78. 10.1159/000085221.
Lake SL, Lyon H, Tantisira K, Silverman EK, Weiss ST, Laird NM, Schaid DJ: Estimation and tests of haplotype-environment interaction when linkage phase is ambiguous. Hum Hered. 2003, 55: 56-65. 10.1159/000071811.
Obreiter M, Fischer C, Chang-Claude J, Beckmann L: SDMinP: a program to control the family wise error rate using step-down minP adjusted P-values. Bioinformatics. 2005, 21: 3183-3184. 10.1093/bioinformatics/bti480.
Plenge RM, Padyukov L, Remmers EF, Purcell S, Lee AT, Karlson EW, Wolfe F, Kastner DL, Alfredsson L, Altshuler D, Gregersen PK, Klareskog L, Rioux JD: Replication of putative candidate-gene associations with rheumatoid arthritis in >4,000 samples from North America and Sweden: association of susceptibility with PTPN22, CTLA4, and PADI4. Am J Hum Genet. 2005, 77: 1044-1060. 10.1086/498651.
Criswell LA, Merlino LA, Cerhan JR, Mikuls TR, Mudano AS, Burma M, Folsom AR, Saag KG: Cigarette smoking and the risk of rheumatoid arthritis among postmenopausal women: results from the Iowa Women's Health Study. Am J Med. 2002, 112: 465-471. 10.1016/S0002-9343(02)01051-3.
Begovich AB, Carlton VE, Honigberg LA, Schrodi SJ, Chokkalingam AP, Alexander HC, Ardlie KG, Huang Q, Smith AM, Spoerke JM, Conn MT, Chang M, Chang SY, Saiki RK, Catanese JJ, Leong DU, Garcia VE, McAllister LB, Jeffery DA, Lee AT, Batliwalla F, Remmers E, Criswell LA, Seldin MF, Kastner DL, Amos CI, Sninsky JJ, Gregersen PK: A missense single-nucleotide polymorphism in a gene encoding a protein tyrosine phosphatase (PTPN22) is associated with rheumatoid arthritis. Am J Hum Genet. 2004, 75: 330-337. 10.1086/422827.
Carlton VE, Hu X, Chokkalingam AP, Schrodi SJ, Brandon R, Alexander HC, Chang M, Catanese JJ, Leong DU, Ardlie KG, Conn MT, Chang M, Chang SY, Saiki RK, Catanese JJ, Leong DU, Garcia VE, McAllister LB, Jeffery DA, Lee AT, Batliwalla F, Remmers E, Criswell LA, Seldin MF, Kastner DL, Amos CI, Sninsky JJ, Gregersen PK: PTPN22 genetic variation: evidence for multiple variants associated with rheumatoid arthritis. Am J Hum Genet. 2005, 77: 567-581. 10.1086/468189.
Pierer M, Kaltenhauser S, Arnold S, Wahle M, Baerwald C, Hantzschel H, Wagner U: Association of PTPN22 1858 single-nucleotide polymorphism with rheumatoid arthritis in a German cohort: higher frequency of the risk allele in male compared to female patients. Arthritis Res Ther. 2006, 8: R75-10.1186/ar1945.
Schaid DJ, Rowland CM, Tines DE, Jacobson RM, Poland GA: Score tests for association between traits and haplotypes when linkage phase is ambiguous. Am J Hum Genet. 2002, 70: 425-434. 10.1086/338688.
Draper N, Smith H: Applied Regression Analysis. 1998, New York: John Wiley and Sons, 369-
Derkson S, Keselman H: Backward, forward and stepwise automated subset selection algorithms: frequency of obtaining authentic and noise variables. Br J Math Statist Psychol. 1992, 45: 265-282.
Judd C, McClelland G: Data Analysis: A Model-Comparison Approach. 1989, San Diego: Harcourt Brace Jovanovich
Steyerberg EW, Eijkemans MJ, Harrell FE, Habbema JD: Prognostic modeling with logistic regression analysis: in search of a sensible strategy in small data sets. Med Decis Making. 2001, 21: 45-56.
Zondervan KT, Cardon LR: The complex interplay among factors that influence allelic association. Nat Rev Genet. 2004, 5: 89-100. 10.1038/nrg1270.
This work was supported by the Deutsche Forschungsgemeinschaft (grant CH117/3-1) and by the Bundesministerium für Bildung und Forschung through the German National Genome Net (NGFN, grant numbers 01GR0460, 01GR0461).
This article has been published as part of BMC Proceedings Volume 1 Supplement 1, 2007: Genetic Analysis Workshop 15: Gene Expression Analysis and Approaches to Detecting Multiple Functional Loci. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/1?issue=S1.
The author(s) declare that they have no competing interests.
Astrid Dempfle, Rebecca Hein contributed equally to this work.