Pattern-based mining strategy to detect multi-locus association and gene x environment interaction.

As genome-wide association studies grow in popularity for the identification of genetic factors for common and rare diseases, analytical methods to comb through large numbers of genetic variants efficiently to identify disease association are increasingly in demand. We have developed a pattern-based data-mining approach to discover unlinked multilocus genetic effects for complex disease and to detect genotype x phenotype/genotype x environment interactions. On a densely mapped chromosome 18 data set for rheumatoid arthritis that was made available by Genetic Analysis Workshop 15, this method detected two potential two-locus associations as well as a putative two-locus gene x gender interaction.


Background
Long considered to hold promise for dissecting the genetic etiology of complex diseases [1], genome-wide association studies have produced significant and sometimes highly reproducible findings in an increasing number of publications [2]. Despite early successes that demonstrated the feasibility of whole-genome studies, strategies for fully analyzing genome-wide data, such as detecting multilocus genetic effects for complex disease or detecting genotype × phenotype/genotype × environment interactions, are still underdeveloped or underutilized, affecting the effectiveness of those large-scale studies on advancing our understanding of the genetic contribution to common human diseases [3,4]. An increasing number of methods have been proposed to detect multilocus genetic association: some make use of haplotypes [5] or logistic regression [6], while others use nonparametric "data mining" strategies such as the multifactor dimensionality reduction (MDR) [7] and neural networks [8]. Methods to analyze pair-wise interactions between unlinked loci have also been developed [9].
We have developed a pattern-based mining strategy (manuscript in preparation) to detect local (markers in moder-ate or strong linkage disequilibrium, or LD) and global (unlinked markers) multilocus genetic associations as well as gene × gene/gene × environment interactions. The pattern-based method exhaustively yet efficiently identifies all patterns satisfying pre-defined pattern search criteria and evaluates the association of patterns to disease state through a χ 2 -based test statistics. For the work described in this report, we applied the pattern-based method on the chromosome 18 data set on rheumatoid arthritis (RA) ascertained by the North American Rheumatoid Arthritis Consortium (NARAC) that was made available for the Genetic Analysis Workshop 15 (GAW15). Ten patterns were found to be significantly associated with the RA phenotype after multiple testing correction. The significance of those ten patterns was confirmed using Monte Carlo simulation. Furthermore, we identified a potentially significant multi-gene/gender interaction involving two loci: SNP0177 and the LD region containing markers SNP0603-SNP0615.

Pattern Examiner: a pattern-based method to detect multi-locus association and interaction
Data are organized in a two-dimensional data matrix with markers as columns, individuals as rows, and individuals' alleles or genotypes as cell values. Each marker is represented by five columns: one column for each of the two alleles of the marker and one column for each of the three possible genotypes for the marker. A pattern is defined as a maximal sub-matrix of the data matrix in which the value of each marker across all individuals in the submatrix satisfies a predefined equivalence criterion such as same genotype value. A sub-matrix is maximal if 1) no more rows can be added while keeping the columns fixed and, 2) no more columns can be added when keeping the rows fixed. Under this formulation patterns can be used to model both multilocus allelic and multilocus genotypic contributions to disease state.
Pattern Examiner is a nonparametric data mining-based method for the detection of multilocus associations and gene × gene/gene × environment interactions on data collected in population-based case/control studies. This method has two steps: 1) pattern discovery and 2) significance evaluation. In the pattern discovery step, patterns are identified using as input data from the case population only. The extensiveness (and execution time) of the pattern discovery step is controlled by two parameters: the support threshold, which specifies the minimum number of rows a pattern must have; and the locus threshold, which specifies the extent of locus interaction. For example, with the support and locus thresholds set to 20 and 2 respectively, all reported patterns will have 20 or more case supports and mostly one or two markers. In the significance evaluation step, a 2 × 2 contingency table is constructed for each pattern to tally its support in the case and control populations ("case support" and "control support", respectively). The two categorical variables tabulated are population type ("cases" vs. "controls") and pattern match status ("matches" vs. "does not match"). Partially missing data are excluded. The p-value is obtained from a χ 2 test of independence and then adjusted for multiple testing. A modified Bonferroni correction for multiple testing is applied to each pattern, using as the correction factor the total number of patterns that contain equal or greater case support than the target pattern under significance evaluation, rather than the total number of patterns identified. As a result, the adjusted significance is robust against the arbitrary selection of values for parameters in the pattern discovery step. The odds ratio with confidence interval is also calculated for each pattern.

Test for differential gene × gender interactions in cases vs. controls
To test for the null hypothesis that there is no difference on genotype × gender interaction between cases and controls, three 2 × N contingency tables were constructed with the observed N genotypes of a significant pattern as rows and genders as columns. Three chi-square values, χ 2 case , χ 2 control , and χ 2 pooled , were then obtained for cases only, controls only, and cases and controls pooled, respectively. The p-value was obtained using χ 2 = χ 2 case + χ 2 controlχ 2 pooled with N -1 degree of freedom.

Significant association between two-locus patterns and RA
We identified 2.6 million patterns, mostly containing two markers, from the NARAC chromosome 18 data set, for a support threshold of 20 and a locus threshold of 2. From this set, 65,689 patterns were found to have p-values ≤ 0.01 before multiple testing correction. After Bonferroni correction, ten patterns remained significant, as shown in Table 1 (Column "Adjusted p-value"). Interestingly, all significant patterns share the following characteristics: 1) they all contain marker SNP0177 with allele 1, suggesting a dominant effect for marker SNP0177; 2) they all have odds ratio around two, suggesting a modest relative risk; 3) they all have large number of case support (more than 50% of all cases), suggesting a rather common inheritance; and 4) they are all mapped to intergenic regions on chromosome 18. Furthermore, except for markers SNP672 and SNP0177, all other markers in those ten patterns are mapped to two LD blocks as identified by the HapBlock program [ To further evaluate the significance of those ten patterns, we performed a Monte Carlo simulation using 400 simulated data sets generated by randomizing the case-control assignment of the 920 individuals in the study while maintaining the female/male ratio in both cases and controls. The pattern-based method was applied to each simulated data set. On average, 2.7 million patterns were identified from each data set, which were then subject to test statistics and multiple testing correction. In 8 out of 400 (2%) simulated data sets we observed ten or more patterns that had p-values less than 0.05 after multiple testing correction (false-positive patterns). In 20 out of 400 (5%) simulated data sets we observed five or more false-positive patterns. These results suggest that the ten significant patterns discovered in the real data set are unlikely to be an artifact of chance alone. Furthermore, if we consider the fact that these ten patterns share the same marker, SNP0177, then the significance gets even stronger: out of the 400 simulated data sets there was only 1 case (0.25%) where sharing of a common marker was observed in ten or more patterns.

Significant interaction between multilocus genotypes and gender in RA
It is known that the incidence rate of RA is higher in females. Indeed, there is a matching 3.82:1 female/male ratio in both the case sample and control sample of this data set. However, as shown in Table 1 (columns "No. of cases (F/M ratio)" and "No. of controls (F/M ratio)"), notable differences on the female/male ratio between cases and controls were observed for many significant patterns. To investigate the genotype × gender interaction further, we constructed 2 × N contingency tables for each marker in the significant patterns and for each significant pattern with observed genotypes for marker or pattern as rows. Four variations of the contingency tables were constructed as detailed in the legend of Table 2. Table 2 shows a representative pattern demonstrating significant interaction between multilocus genotypes and gender. Individually, marker SNP0177 or SNP0615 was not significant in male population (bold). However together they demonstrated strong significance in male population that reached the same magnitude (p = 0.00118) as in females (p = 0.00109) despite the much smaller sample size. When both female and male populations were considered together ("With gender partition"), an even stronger association between the pair of markers and the disease was observed. For individuals with the 1/1-1/2 genotype for markers SNP0177 and SNP0615, an odds ratio of 11.7 (confidence interval: 1.34-102.86) was observed in females over males. For individuals with the 2/2-1/2 genotype for markers SNP0177 and SNP0615, an odds ratio of 7.51 (confidence interval: 1.94-28.99) was observed in males over females. Logistic regression analysis on the interaction between the pattern and gender also yielded a significant p-value of 0.0021 (data not shown). Furthermore, a test for differential interaction in cases and controls yielded a p-value of 0.0016, suggesting that there is significant interaction between the SNP0177-SNP0615 pattern and gender in the affected individuals vs. unaffected individuals. Similar results were obtained for all patterns containing markers in the SNP0603-0615 LD block (data not shown). On the other hand, no differential interaction in cases and controls was observed for the three remaining significant patterns (data not shown).

Discussion
Several lines of evidence suggest that the significant multilocus associations we detected with Pattern Examiner on the NARAC chromosome 18 data set provided by GAW15 might be true associations: 1) except for marker SNP0177 and SNP0672, multiple markers from the same LD region (markers SNP1130 and 1131, markers SNP0603-0615) were found to be in different significant patterns; 2) all markers in the same LD region displayed consistent dominant (an allele in the pattern) or recessive (a homozygote genotype in a pattern) effect; 3) a Monte Carlo simulation with randomized cases and controls produced a type I error probability of 0.02 for the observed results. The singleton SNP0177 does raise a flag because it is in a strong LD region with adjacent markers. Further investigation is necessary to confirm the role SNP0177 plays in the association. The ultimate proof for true association will have to come from replication studies. Taking advantage of our method's ability to detect multilocus association, we performed a novel multilocus gene × gender interaction analysis and detected a two-locus gene × gender interaction that was supported by three independent assays. Detailed analysis revealed large odds ratios for certain genotype combinations. However, a wide confidence interval (due to small sample size) dampens our enthusiasm. Again, further investigation with larger sample sizes is necessary to confirm this observation.
Several reports in the "Association -Problem 2" group of GAW15 have identified the LD region including markers SNP1097-1107 to be significantly associated with RA [ [11]; Zhu G, Rao S, Li X, personal communication]. Although we also found markers SNP1097-1107 to be significant using a genotype-based single-marker χ 2 test (p = 0.0002 for most markers in this region), the significance disappeared after Bonferroni correction with 2300 markers. After applying a less stringent Bonferroni correction using the number of LD blocks (233 LD blocks were identified by the HapBlock program) in this 10-Mb region, those markers resurfaced as significant. Because none of the significant patterns identified by the pattern-based method included markers SNP1097-1107, this locus might act alone as a risk factor for RA.
By focusing on detecting multilocus association and interactions, the pattern-based method complements the traditional single marker association test. A preliminary power analysis (data not shown) suggested that this method has more power than the traditional single marker-based analysis under a couple of two-locus disease models. Additional power analysis is needed to further validate its utility.

Conclusion
We have identified two potential multilocus associations (SNP0177/SNP1130-1131 and SNP0177/SNP0603-0615) with RA as well as evidence of interaction between gender and loci SNP0177/SNP0603-0615.  Table 1 and for each significant pattern with the observed N genotypes as rows. Each contingency table has two columns for case and control. In "Without gender partition", individuals were grouped and allocated to table cells according to genotypes regardless of their gender. In "With gender partition", individuals were grouped together according to genotypes and gender with each combination of genotype and gender as a row. In "Males only", only males were grouped and allocated to table cells according to genotypes. In "Females only", only females were grouped and allocated to table cells according to genotypes. b Bold font indicates results for males only.