Volume 1 Supplement 1
Pattern-based mining strategy to detect multi-locus association and gene × environment interaction
© Li et al; licensee BioMed Central Ltd. 2007
Published: 18 December 2007
As genome-wide association studies grow in popularity for the identification of genetic factors for common and rare diseases, analytical methods to comb through large numbers of genetic variants efficiently to identify disease association are increasingly in demand. We have developed a pattern-based data-mining approach to discover unlinked multilocus genetic effects for complex disease and to detect genotype × phenotype/genotype × environment interactions. On a densely mapped chromosome 18 data set for rheumatoid arthritis that was made available by Genetic Analysis Workshop 15, this method detected two potential two-locus associations as well as a putative two-locus gene × gender interaction.
Long considered to hold promise for dissecting the genetic etiology of complex diseases , genome-wide association studies have produced significant and sometimes highly reproducible findings in an increasing number of publications . Despite early successes that demonstrated the feasibility of whole-genome studies, strategies for fully analyzing genome-wide data, such as detecting multilocus genetic effects for complex disease or detecting genotype × phenotype/genotype × environment interactions, are still underdeveloped or underutilized, affecting the effectiveness of those large-scale studies on advancing our understanding of the genetic contribution to common human diseases [3, 4]. An increasing number of methods have been proposed to detect multilocus genetic association: some make use of haplotypes  or logistic regression , while others use nonparametric "data mining" strategies such as the multifactor dimensionality reduction (MDR)  and neural networks . Methods to analyze pair-wise interactions between unlinked loci have also been developed .
We have developed a pattern-based mining strategy (manuscript in preparation) to detect local (markers in moderate or strong linkage disequilibrium, or LD) and global (unlinked markers) multilocus genetic associations as well as gene × gene/gene × environment interactions. The pattern-based method exhaustively yet efficiently identifies all patterns satisfying pre-defined pattern search criteria and evaluates the association of patterns to disease state through a χ2-based test statistics. For the work described in this report, we applied the pattern-based method on the chromosome 18 data set on rheumatoid arthritis (RA) ascertained by the North American Rheumatoid Arthritis Consortium (NARAC) that was made available for the Genetic Analysis Workshop 15 (GAW15). Ten patterns were found to be significantly associated with the RA phenotype after multiple testing correction. The significance of those ten patterns was confirmed using Monte Carlo simulation. Furthermore, we identified a potentially significant multi-gene/gender interaction involving two loci: SNP0177 and the LD region containing markers SNP0603–SNP0615.
Pattern Examiner: a pattern-based method to detect multi-locus association and interaction
Data are organized in a two-dimensional data matrix with markers as columns, individuals as rows, and individuals' alleles or genotypes as cell values. Each marker is represented by five columns: one column for each of the two alleles of the marker and one column for each of the three possible genotypes for the marker. A pattern is defined as a maximal sub-matrix of the data matrix in which the value of each marker across all individuals in the sub-matrix satisfies a predefined equivalence criterion such as same genotype value. A sub-matrix is maximal if 1) no more rows can be added while keeping the columns fixed and, 2) no more columns can be added when keeping the rows fixed. Under this formulation patterns can be used to model both multilocus allelic and multilocus genotypic contributions to disease state.
Pattern Examiner is a nonparametric data mining-based method for the detection of multilocus associations and gene × gene/gene × environment interactions on data collected in population-based case/control studies. This method has two steps: 1) pattern discovery and 2) significance evaluation. In the pattern discovery step, patterns are identified using as input data from the case population only. The extensiveness (and execution time) of the pattern discovery step is controlled by two parameters: the support threshold, which specifies the minimum number of rows a pattern must have; and the locus threshold, which specifies the extent of locus interaction. For example, with the support and locus thresholds set to 20 and 2 respectively, all reported patterns will have 20 or more case supports and mostly one or two markers. In the significance evaluation step, a 2 × 2 contingency table is constructed for each pattern to tally its support in the case and control populations ("case support" and "control support", respectively). The two categorical variables tabulated are population type ("cases" vs. "controls") and pattern match status ("matches" vs. "does not match"). Partially missing data are excluded. The p-value is obtained from a χ2 test of independence and then adjusted for multiple testing. A modified Bonferroni correction for multiple testing is applied to each pattern, using as the correction factor the total number of patterns that contain equal or greater case support than the target pattern under significance evaluation, rather than the total number of patterns identified. As a result, the adjusted significance is robust against the arbitrary selection of values for parameters in the pattern discovery step. The odds ratio with confidence interval is also calculated for each pattern.
Test for differential gene × gender interactions in cases vs. controls
To test for the null hypothesis that there is no difference on genotype × gender interaction between cases and controls, three 2 × N contingency tables were constructed with the observed N genotypes of a significant pattern as rows and genders as columns. Three chi-square values, χ2case, χ2control, and χ2pooled, were then obtained for cases only, controls only, and cases and controls pooled, respectively. The p-value was obtained using χ2 = χ2case + χ2control - χ2pooled with N - 1 degree of freedom.
Significant association between two-locus patterns and RA
Significant patterns identified with the pattern-based method
Markers in pattern
No. of cases (F/M ratio)a
No. of controls (F/M ratio)
Odds ratio (confidence interval)
SNP0177 (allele 1); SNP1131 (allele 1)
5.04 × 10-8
SNP0177 (allele 1); SNP0610 (genotype 2/2)
1.57 × 10-7
SNP0177 (allele 1); SNP1130 (allele 2)
8.03 × 10-8
SNP0177 (allele 1); SNP0615 (genotype 1/1)
3.08 × 10-7
SNP0177 (allele 1); SNP0603 (genotype 1/1)
1.49 × 10-7
SNP0177 (allele 1); SNP0604 (genotype 2/2); SNP0605 (genotype 2/2)
4.42 × 10-7
SNP0177 (allele 1); SNP0609 (genotype 1/1)
3.41 × 10-7
SNP0177 (allele 1); SNP0606 (genotype 2/2)
8.68 × 10-7
SNP0177 (allele 1); SNP0672 (allele 2)
2.33 × 10-6
SNP0177 (allele 1); SNP0608 (genotype 1/1)
4.91 × 10-7
To further evaluate the significance of those ten patterns, we performed a Monte Carlo simulation using 400 simulated data sets generated by randomizing the case-control assignment of the 920 individuals in the study while maintaining the female/male ratio in both cases and controls. The pattern-based method was applied to each simulated data set. On average, 2.7 million patterns were identified from each data set, which were then subject to test statistics and multiple testing correction. In 8 out of 400 (2%) simulated data sets we observed ten or more patterns that had p-values less than 0.05 after multiple testing correction (false-positive patterns). In 20 out of 400 (5%) simulated data sets we observed five or more false-positive patterns. These results suggest that the ten significant patterns discovered in the real data set are unlikely to be an artifact of chance alone. Furthermore, if we consider the fact that these ten patterns share the same marker, SNP0177, then the significance gets even stronger: out of the 400 simulated data sets there was only 1 case (0.25%) where sharing of a common marker was observed in ten or more patterns.
Significant interaction between multilocus genotypes and gender in RA
Evidence of interaction between multiple loci and gender in cases
With gender partition
Without gender partition
With gender partition
Without gender partition
With gender partition
Without gender partition
Several lines of evidence suggest that the significant multilocus associations we detected with Pattern Examiner on the NARAC chromosome 18 data set provided by GAW15 might be true associations: 1) except for marker SNP0177 and SNP0672, multiple markers from the same LD region (markers SNP1130 and 1131, markers SNP0603–0615) were found to be in different significant patterns; 2) all markers in the same LD region displayed consistent dominant (an allele in the pattern) or recessive (a homozygote genotype in a pattern) effect; 3) a Monte Carlo simulation with randomized cases and controls produced a type I error probability of 0.02 for the observed results. The singleton SNP0177 does raise a flag because it is in a strong LD region with adjacent markers. Further investigation is necessary to confirm the role SNP0177 plays in the association. The ultimate proof for true association will have to come from replication studies. Taking advantage of our method's ability to detect multilocus association, we performed a novel multilocus gene × gender interaction analysis and detected a two-locus gene × gender interaction that was supported by three independent assays. Detailed analysis revealed large odds ratios for certain genotype combinations. However, a wide confidence interval (due to small sample size) dampens our enthusiasm. Again, further investigation with larger sample sizes is necessary to confirm this observation.
Several reports in the "Association – Problem 2" group of GAW15 have identified the LD region including markers SNP1097–1107 to be significantly associated with RA [; Zhu G, Rao S, Li X, personal communication]. Although we also found markers SNP1097–1107 to be significant using a genotype-based single-marker χ2 test (p = 0.0002 for most markers in this region), the significance disappeared after Bonferroni correction with 2300 markers. After applying a less stringent Bonferroni correction using the number of LD blocks (233 LD blocks were identified by the HapBlock program) in this 10-Mb region, those markers resurfaced as significant. Because none of the significant patterns identified by the pattern-based method included markers SNP1097–1107, this locus might act alone as a risk factor for RA.
By focusing on detecting multilocus association and interactions, the pattern-based method complements the traditional single marker association test. A preliminary power analysis (data not shown) suggested that this method has more power than the traditional single marker-based analysis under a couple of two-locus disease models. Additional power analysis is needed to further validate its utility.
We have identified two potential multilocus associations (SNP0177/SNP1130–1131 and SNP0177/SNP0603–0615) with RA as well as evidence of interaction between gender and loci SNP0177/SNP0603–0615.
We thank Drs. Fatemeh Haghighi, Peter Gregersen, Wentian Li, and Jurg Ott for many helpful discussions. This work is supported by a Small Business Innovation Research (SBIR) grant to ZL (2R44CA101432-02A1).
This article has been published as part of BMC Proceedings Volume 1 Supplement 1, 2007: Genetic Analysis Workshop 15: Gene Expression Analysis and Approaches to Detecting Multiple Functional Loci. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/1?issue=S1.
- Risch N: Searching for genetic determinants in the new millennium. Nature. 2000, 405: 847-856. 10.1038/35015718.View ArticlePubMed
- Arking D, Pfeufer A, Post W, Kao WH, Newton-Cheh C, Ikeda M, West K, Kashuk C, Akyol M, Perz S, Jalilzadeh S, Illig T, Gieger C, Guo CY, Larson MG, Wichmann HE, Marbán E, O'Donnell CJ, Hirschhorn JN, Kääb S, Spooner PM, Meitinger T, Chakravarti A: A common genetic variant in the NOS1 regulator NOS1AP modulates cardiac repolarization. Nat Genet. 2006, 38: 644-651. 10.1038/ng1790.View ArticlePubMed
- Evans D, Cardon L: Genome-wide association: a promising start to a long race. Trends Genet. 2006, 22: 350-354. 10.1016/j.tig.2006.05.001.View ArticlePubMed
- Moore J, Richie M: The challenges of whole-genome approaches to common diseases. JAMA. 2004, 291: 1642-1643. 10.1001/jama.291.13.1642.View ArticlePubMed
- Lou X, Casella G, Littel R, Yang M, Johnson J, Wu R: A haplotype-based algorithm for multilocus linkage disequilibrium mapping of quantitative trait loci with epistasis. Genetics. 2003, 163: 1533-1548.PubMed CentralPubMed
- Tan Q, Christiansen L, Christensen K, Bathum L, Li S, Zhao J, Kruse T: Haplotype association analysis of human disease traits using genotype data of unrelated individuals. Genet Res. 2005, 86: 233-231. 10.1017/S0016672305007792.View Article
- Richie M, Hahn L, Roodi N, Bailey L, Dupont W, Parl F, Moore J: Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am J Hum Genet. 2001, 69: 138-147. 10.1086/321276.View Article
- Lucek P, Ott J: Neural network analysis of complex traits. Genet Epidemiol. 1997, 14: 1101-1106. 10.1002/(SICI)1098-2272(1997)14:6<1101::AID-GEPI90>3.0.CO;2-K.View ArticlePubMed
- Zhao J, Jin L, Xiong M: Test for interaction between two unlinked loci. Am J Hum Genet. 2006, 79: 831-845. 10.1086/508571.View ArticlePubMed CentralPubMed
- Zhang K, Qin Z, Chen T, Liu J, Waterman M, Sun F: HapBlock: haplotype block partitioning and tag SNP selection software using a set of dynamic programming algorithms. Bioinformatics. 2005, 21: 131-134. 10.1093/bioinformatics/bth482.View ArticlePubMed
- Tapper W, Collins A, Morton NE: Mapping a gene for rheumatoid arthritis on chromosome 18q21. BMC Proc. 2007, 1 (Suppl 1): S18-View ArticlePubMed CentralPubMed
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.