Volume 3 Supplement 7
Integration of a priori gene set information into genome-wide association studies
© Sohns et al; licensee BioMed Central Ltd. 2009
Published: 15 December 2009
In genome-wide association studies (GWAS) genetic markers are often ranked to select genes for further pursuit. Especially for moderately associated and interrelated genes, information on genes and pathways may improve the selection. We applied and combined two main approaches for data integration to a GWAS for rheumatoid arthritis, gene set enrichment analysis (GSEA) and hierarchical Bayes prioritization (HBP). Many associated genes are located in the HLA region on 6p21. However, the ranking lists of genes and gene sets differ considerably depending on the chosen approach: HBP changes the ranking only slightly and primarily contains HLA genes in the top 100 gene lists. GSEA includes also many non-HLA genes.
With genotyping chips containing 500,000 and more single-nucleotide polymorphisms (SNPs) and good genome coverage, genome-wide association studies (GWAS) are now widely used to search for susceptibility genes for complex diseases. For 500,000 statistical tests in parallel and a nominal level of 0.05, the genome-wide significance level is 10-7. Hence, moderately associated SNPs will have a poor chance of being found at this level, even in very large samples. Often SNPs are ranked as a first step to select a "most promising" subset of SNPs or genes to follow-up. Thus, it is of interest not to overlook so-called "gene sets" with related genes, e.g., by pathway, function, or structure, which jointly account for genomic association to the investigated trait.
Considering p-values for marker selection without external information may yield many false positives. Here we focus on two approaches incorporating molecular genetic knowledge, the hierarchical Bayes prioritization (HBP) by Lewinger et al.  and the gene set enrichment analysis (GSEA) to GWAS by Wang et al. . We compare the two methods and present ways to combine them in a GWAS for rheumatoid arthritis.
We applied the strategies to the genome-wide Genetic Analysis Workshop 16 (GAW16) Rheumatoid Arthritis (RA) data from the North American Rheumatoid Arthritis Consortium (NARAC). These data include 868 cases and 1194 controls, recruited to Institutional Review Board-approved protocols and genotyped on Illumina 550 k SNP chips. All research was carried out in accordance with the Declaration of Helsinki.
GSEA and HBP
Gene set methods in GWAS are based on an initial ranking of single SNPs by p-values (here: Cochrane-Armitage trend test). Then they either identify biologically relevant "pathways" with functional genetic variation or they support prioritization of associated candidate markers or genes. They can be seen as enhancement to reveal the full spectrum of genes influencing disease .
GSEA was originally developed for gene expression analysis  and recently proposed for GWAS . To each gene we assigned the maximal test statistic among all of its SNPs and ranked the genes from largest to smallest maximum. The enrichment score (ES) for each gene set measures if its genes are randomly distributed in the ranking or concentrated on the top. Statistical significance was assessed by permutations and family-wise error rate (FWER). The leading edge subset (LES) was defined as high-scoring genes of the significant gene sets driving the ES.
HBP  aims to re-rank markers using prior covariates on each marker. Regression coefficients for the relationship between prior covariates and observed single marker association statistics are estimated i) in a logistic model for the prior probability using marker distance to genes and gene set information and ii) a linear model for the strength of association. With the a posteriori probability of a marker to be associated, a re-ranked marker list is created. For GSEA we used the GenGen-package by Wang , for HBP we used a routine for the statistical package R provided by Lewinger et al. .
Gene set and SNP annotation
For gene-to-pathway annotation (GtP), we used a file in the GenGen-package , and for SNP-to-gene annotation (StG), we used files from Illumina. Gene Name Service (GNS)  was used to assure gene name consistency. We used information from the GtP as "gene set info" and information about the physical and functional position of the SNP relative to the nearest known/predicted gene (e.g., synonymous, coding, 3'UTR) from StG as "SNP info". We combined gene sets with a large overlap and excluded sets with less than 11 genes. Finally, 876 gene sets remained.
Strategies of data analysis with GSEA or HBP
I) GSEA alone
GSEA was performed on basic single-SNP association test statistics. The results are p-values for gene sets and a list of LES genes.
II) One-step HBP
HBP was performed using SNP information and gene set indicators (1 = gene in set, 0 otherwise) as prior covariates. The result is a ranking of SNPs.
III) Two-step HBP
HBP was performed using SNP information as prior covariate, followed by HBP additionally using gene set information. The average a posteriori probability of association of all remaining genes of the considered set was used as gene set information for all SNPs of a gene. The gene-specific probability is the maximum of the a posteriori probabilities of gene SNPs. The result is a ranking of SNPs.
IV) HBP followed by GSEA
HBP was performed using SNP information as prior covariate followed by GSEA using the a posteriori probabilities of HBP as entry ranking. Results are p-values for gene sets and a list of LES genes.
This index is 0 if all elements are different, and 1 if all lists contain the same elements. It may be roughly interpreted as the chance of an element to appear in another list as well.
After quality control and trend test, 334 SNPs are significant at the genome-wide level. They belong to 90 genes (81 in HLA region) that are involved in 153 gene sets. Due to computer limitations we had to restrict the number of considered pathways to 100, thus using only the top 75 genes. This led to only two a priori gene sets without genes from the HLA-region, and hence to an influence on the preference towards HLA.
Strategy I (GSEA) yielded 20 gene sets with FWER < 0.05. The 19 best-ranked gene sets contained the top 100 LES genes. Strategy IV (HBP+GSEA) resulted in three gene sets with FWER < 0.05. The two best-ranked gene sets contained 68 LES genes. Only subset LES genes of the third gene set could be added to fill up to 100 top genes.
Comparison of most promising genes
Comparison of gene set ranking
Top gene sets after applying GSEA, one/two-step HPB, or HBP+GSEA
GWAS aim to discover new associations and novel disease genes. For complex diseases, many potentially interacting genes may be involved. Biological processes, indicated by gene sets rather than single genes, might warrant further investigation.
The gene set approaches cannot replace the original GWAS ranking, but they may identify additional SNPs within sets that escaped identification due to weak marginal effects. Locus heterogeneity within one pathway, also a possible replication problem, can be considered. Gene set approaches can help to structure results and to distinguish truly associated from unassociated markers . In this context please note that not all biological details can be incorporated, especially because many gene sets are not yet well understood and updates in databases lag behind knowledge.
In this GWAS, 87 out of the top 100 initial ranked genes are in the HLA region, a region well known for its role in RA. The special challenges are to contrast genes within HLA region, but also to identify non-HLA susceptibility genes.
Neither GSEA nor HBP is a gold standard for the integration of gene set information into GWAS. We found considerable differences in the resulting lists of the most promising genes and gene sets. The chance of a gene appearing in more than one of our final gene lists is only 50%. The same is true for gene sets. Although the top 100 gene lists of both approaches with only HBP are almost identical, their lists of gene sets overlapped in only 3 of the top 20 entries. These heterogenous results point to methodological differences. GSEA uses the ranking of genes to find enriched gene sets by summing ranks, while HBP uses prior gene set information to change the ranking of SNPs. Hence, GSEA directly leads to list of most promising gene sets and only builds a bridge by LES to a list of genes. For HBP, the reverse is true.
In GSEA, genes with many SNPs are favored by using the maximal test statistic per gene. In HBP, considering all markers of a gene may penalize larger genes, because a true association signal at one marker might be diluted by all unassociated markers of the gene. GSEA corrects for linkage disequilibrium structure and for multiple testing by false-discovery rate or FWER by a computationally intensive permutation procedure. Neither correction is considered for the much faster HBP.
Strategies II and III differ only in the way gene set information is prepared for HBP. In II we used an indicator for a gene set as "prior" information, for III we used set-specific weights derived from the observed association, which is not strictly "prior".
In comparison with single-SNP analysis, HBP can "be superior when the proportion of true positive associations is not too small, as in GWAS with hundreds of truly associated SNPs" . This can explain the difference in identified gene sets when compared with GSEA. Lewinger also stated that "when the non-centrality parameters of the true associations are large enough to be picked by the raw test statistics there is little to be gained from prior covariates" . Hence, in this GWAS with 334 genome-wide significant SNPs, the list of most promising genes did not change for HBP. However, with GSEA new non-HLA genes were identified. The methods have a substantial influence on the re-ranked list of top genes. Strategies I and IV incorporate significance of gene sets, while Strategies II and III use regression coefficients for selecting the top sets, which provide only information on the magnitude of up-ranking of the genes included in the gene sets.
Note that combining different ranking lists - not considered here - may lead to a so-called voting paradoxes.
In summary, HBP keeps the prominent role of the HLA-complex while GSEA enriches the top gene list with non-HLA genes. Both methods identified the well known association of HLA and RA. The finding of non-HLA SNPs by the GSEA suggests that HLA and non-HLA markers are involved in the disease process. All strategies included the sets GO0002460 and hsa04612 in their top 20 gene set lists. Thus, all have the ability to recognize HLA-dominated gene sets as well as other sets. Because both approaches have their own rationale, the choice of the method is currently a matter of preference. The main advantage of HBP over GSEA is that different types of prior information can be considered, not only gene set information.
The considered methods were developed to increase signals jointly for weakly informative markers in different genes but within one gene set. Because GSEA uses only the maximal SNP test statistic per gene, several weakly informative markers within one gene will not be detected. This problem may be addressed by combining the SNP statistics with one-gene statistics or by processing SNP sets instead of gene sets. We concentrated on single-SNP methods. Please note that depending on the context, haploype approaches or machine learning methods might be more advantageous.
Considering prior information, e.g., sets of biological interrelated genes, is a promising method in GWAS analysis. Some critical aspects still need to be examined, including whether to reduce the set of markers and how. The chosen method has a large impact as the resulting lists of "most promising" genes or gene sets may be very different.
List of abbreviations used
Family-wise error rate
Genetic Analysis Workshop 16
Gene set enrichment analysis
Gene Name Service
Genome-wide association study
Hierarchical Bayes prioritization
Leading edge subset
North American Rheumatoid Arthritis Consortium
The Genetic Analysis Workshops are supported by NIH grant R01 GM031575 from the National Institute of General Medical Sciences. This research was supported by the German National Genome Research Network (BMBF, grant 01GS0837).
This article has been published as part of BMC Proceedings Volume 3 Supplement 7, 2009: Genetic Analysis Workshop 16. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/3?issue=S7.
- Lewinger JP, Conti DV, Baurley JW, Triche TJ, Thomas DC: Hierarchical Bayes prioritization of marker associations from a genome-wide association scan for further investigation. Genet Epidemiol. 2007, 31: 871-882. 10.1002/gepi.20248.View ArticlePubMedGoogle Scholar
- Wang K, Li M, Bucan M: Pathway-based approach for analysis of genomewide association studies. Am J Hum Genet. 2007, 81: 1278-1283. 10.1086/522374.PubMed CentralView ArticlePubMedGoogle Scholar
- Chasman DI: On the utility of gene set methods in genomewide association studies of quantitative traits. Genet Epidemiol. 2008, 32: 658-668. 10.1002/gepi.20334.View ArticlePubMedGoogle Scholar
- Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golum TR, Lander ES, Mesirov JP: Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci. 2005, 102: 15545-15550. 10.1073/pnas.0506580102.PubMed CentralView ArticlePubMedGoogle Scholar
- GenGen. [http://openbioinformatics.org/gengen]
- Lin K-T, Liu C-H, Chiou J-J, Tseng W-H, Hsu C-N: Gene name service: no-nonsense alias resolution service for Homo sapiens genes. Proceedings 2007 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology Workshops (WI-IAT Workshops 2007): 2007 November 2-5; Silicon Valley. 2007, Washington, DC: IEEE, 185-188. full_text.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.