Skip to main content


Finding pathway regulators: gene set approach using peak identification algorithms


Recently, a number of different approaches have been used to examine variation in gene expression and to identify genes whose level of transcript differed greatly among unrelated individuals. Previous studies have commonly focused on identifying determinants that regulate gene expressions by targeting individual genes. However, it is difficult to detect true differences in the level of gene expression among genotypes from noise due to issues such as multiple testing and limited sample size. To increase the statistical power for detecting this difference, we consider a 'gene set' approach by focusing on subtle but coordinated changes in gene expression across multiple genes rather than individual genes. We defined a 'gene set' as a set of genes in the same biological pathway and focused on identifying common regulators based on an assumption that the genes within the same pathway are controlled by common regulators. We applied the gene set approach to the expression data of mRNA in Centre d'Etude du Polymorphisme Humain lymphoblast cells to identify regulators controlling the genes in a biological pathway. Our gene set approach successfully identified potent regulators controlling gene expression in an inflammatory response pathway.


In the past 2 years, profiling gene expression has become popular in the field of human genetics as various polymorphic functions located on the human genome have been characterized in the context of large-scale gene expression studies [15]. Understanding the genetic mechanisms controlling gene expression will give an insight into the functional elements of the human genome, and ultimately reveal functional polymorphisms affecting the risks associated with certain human diseases [6].

In previous studies, researchers have focused on identifying the expression quantitative trait loci (eQTL) that regulate gene expression by targeting individual genes. However, issues such as multiple testing, variations in expression across individuals, and limited sample size in human studies make it difficult to discern true differences in the level of gene expression from noise, particularly where alterations in gene expression among genotypes are modest. These problems have been well defined in analyses of microarray data, and several approaches have been suggested to increase the statistical power to detect this difference. It is by focusing on subtle but coordinated changes in gene expression across multiple genes in a gene set rather than individual genes that the most promising route to a solution is to be found [7].

In using the concept of 'gene set', we consider a two-step procedure to the analysis of the Genetic Analysis Workshop 15 (GAW15) Problem 1 data to identify regulators controlling the functional pathways in lymphoblast cells. The first step uses a peak identification algorithm (PIA) to detect genetic markers with relatively high and similar significance patterns in the same pathway. The second step applies Fisher's exact test to detect the common regulator (peak) of the pathway, under the assumption that the number of peaks at a specific marker has the same distribution as that for all other markers.


Identifying eQTL for a single gene

In this study, expression profiles for 8798 transcripts and genotype variations for 2882 single-nucleotide polymorphisms (SNPs) were selected. A multipoint genome-wide linkage analysis was performed using SIBPAL, a sub-routine of the Statistical Analysis for Genetic Epidemiology (S.A.G.E) software package. Pair-wise phenotype differences between siblings were weighted using an option ('W4') of SIBPAL.

Defining gene set by functional pathway

We defined a 'gene set' as a set of genes with a common biological pathway. Information on the pathway was obtained from four different sources to assign genes to each of the pathways: Kyoto Encyclopedia of Genes and Genomes (KEGG) [8], Gene Map Annotator and Pathway Profiler (GenMAPP) [9], Pharmacogenetics and Pharmacogenomics Knowledge Base (PharmGKB) [10], and BioCarta There are 467 different gene sets. Each gene set contains various numbers of genes, ranging from 5 to 53.

Peak identification

In order to detect genetic markers with relatively high and similar patterns of p-values to other genes in the same pathway, we applied the PIA. The PIA uses smoothing methods, which have the advantage of enhancing peaks and getting rid of spurious peaks by using the average of neighboring peaks.

First, we transformed the p-value from a multipoint genome-wide linkage analysis to s = -log(p-value). The higher value of s indicates the stronger effect of the marker on the level of gene expression. Next, we searched the peaks of s in the gene expression data for all SNP markers located on each gene. The values of s were first smoothed using the local variations from the nearest neighbors. Then potential peaks were identified using Ripley's K function [11]. In this study, the 'PROcess' of R package was used to identify these peaks

Detecting significant markers in the common pathway

Once peaks showing the significant effects on expression level of each gene were identified, the common peaks in a set of genes were sought. We assumed that the number of peaks at specific markers across genes has a hypergeometric distribution under the null hypothesis that the number of peaks of genes at a specific marker has the same distribution as that for all other markers. We tested the significance of sharing peaks at all specific markers across genes using Fisher's exact test.


Because our gene expression data were obtained from a lymphoblastic cell line, we focused on four different pathways directly related to immune systems, namely the 'inflammatory response pathway', 'glucocorticoid and inflammatory genes pathway', 'glucocorticoid and inflammatory genes pathway gene regulation', and 'cytokines and inflammatory response pathway'. These pathways contain 23, 7, 5, and 24 genes, respectively. Among these four pathways, only 'inflammatory response pathway' showed a significant result. Here, we present the results of a multipoint linkage test, peak identification, and summary peaks of genes related to this pathway.

Identifying eQTL for a single gene

The results of a multipoint linkage analysis for the 23 genes are shown in Table 1. The threshold of p-values in this linkage analysis was 7.903701 × 10-5 (e.g., adjusted significant level of 0.01 using the Bonferroni correction). The three significant eQTL regions controlling gene expression level were identified from the 23 genes. These three regions are unlinked (Table 1), so it could not be determined whether the corresponding QTLs work in coordination or act independently on the pathway.

Table 1 eQTLs of Genes involved in the Inflammatory response pathway

Peak identification

The 23 genes in the 'inflammatory response pathway' were smoothed using the nearest neighboring peaks. Then we found peaks of s value using the PIA. Figure 1A shows the smoothed s values and the local variances across all chromosomes. The blue line represents the smoothed peaks, the red circles represent the peaks selected, and the black line represents the local variance.

Figure 1

The result of peak identification and detecting common significant marker. A, The blue line represents the smoothed peaks, the red circles represent the peaks selected, and the black line represents the local variances. B, Peak identification of markers on chromosome 14. The markers in the red box represent the significant common markers of the inflammatory response pathway.

Detecting significant markers in the common pathway

Four significant markers were identified by Fisher's exact tests. The negative log p-values from Fisher's exact test for genome-wide markers are shown in Figure 2. The red line indicates the threshold of p-values adjusted by the Bonferroni correction. Markers showing statistical significances and their positions are shown in Table 2. Among the four markers, the lowest p-value 1.5787 × 10-9 was observed at rs766737, which is linked to the regulator of the 'inflammatory response pathway'. The six different genes among the 23 genes shared common peaks at rs766737 located on chromosome 14 (Figure 1B).

Figure 2

The result of Fisher's exact test. The y-axis is the -log(p-value) from Fisher's exact test for each marker. The red line indicates the adjusted significance level using Bonferroni correction.

Table 2 The significant markers selected by Fisher's exact test

Discussion and conclusion

In this study, we proposed a method to integrate genetic data and gene expression data based on the gene set approach. We defined a 'gene set' as a set of genes in the same biological pathway and focused on identifying common regulators based on an assumption that genes within the same pathway are controlled by common regulators, either directly or indirectly. However, the genes in the low cascade of the pathway are not easily detected by using a classical linkage analysis with a strict threshold adjusting for multiple comparison problems, even though these genes are controlled by common regulators.

We, therefore, proposed a two-step procedure for detecting pathway regulators. In the first step, we performed the PIA to detect p-values in a multipoint linkage analysis. The PIA allowed us to choose markers controlling subtle overall changes in gene expression levels. Once peaks were identified, the peaks of a specific gene were compared to those of other genes in the same pathway, in the second step. For detecting the common peaks of genes within the same pathway at a specific marker, we performed Fisher's exact test and obtained one potent regulator for the inflammatory response pathway.

When we applied the gene set approach to the GAW 15 Problem 1 data, there were significant differences in results between the classical eQTL approach and the gene set approach. First of all, the classical eQTL approach only detected 3 genes with significant eQTL among the 23 genes. However, these 3 eQTLs are not linked, so we were unable to conclude whether or not these genes are regulated by the same regulators. On the other hand, the gene set approach identified a significant common regulator controlling gene expressions in the 'inflammatory response pathway'. The most significant marker, rs766737, is located within 2 kb from a mRNA transcript for the T cell receptor alpha locus (TRA@) on 14q11.2.

The T cell receptor (TCR) is a molecule found on the surface of T lymphocytes that is responsible for recognizing antigens bound to major histocompatibility complex (MHC) molecules. The TCRs recognize foreign antigens, and then convey the message to the nucleus to induce an inflammatory response [12]. Our bodies produce many T cells, each with specific TCRs on their surfaces through the recombination of the genes that encode the receptors, before they have encountered complementary antigens. Thus, the gene set approach convinced us that the genotype variation in a TCR has an effect on the expression level of genes in an inflammatory response pathway.

Our method has two advantages over the classical eQTL approach. First, it makes possible to obtain a more functional inference of the result on the basis of the biological pathway. Using prior knowledge we were able to obtain regulators of the whole pathway rather than individual genes. Second, it makes possible to detect relatively small but global effects on the genes interacting in the same pathway.

The proposed method can be further extended. The current method does not consider any linkage disequilibrium block structures. We think a test for significant regions using flanking markers could improve the power of detection of the common regulators.


  1. 1.

    Brem RB, Yvert G, Clinton R, Kruglyak L: Genetic dissection of transcriptional regulation in budding yeast. Science. 2002, 296: 752-755. 10.1126/science.1069516.

  2. 2.

    Cheung VG, Conlin LK, Weber TM, Arcaro M, Jen KY, Morley M, Spielman RS: Natural variation in human gene expression assessed in lymphoblastoid cells. Nat Genet. 2003, 33: 422-425. 10.1038/ng1094.

  3. 3.

    Monks SA, Leonardson A, Zhu H, Cundiff P, Pietrusiak P, Edwards S, Phillips JW, Sachs A, Schadt EE: Genetic inheritance of gene expression in human cell lines. Am J Hum Genet. 2004, 75: 1094-1105. 10.1086/426461.

  4. 4.

    Morley M, Molony CM, Weber TM, Devlin JL, Ewens KG, Spielman RS, Cheung VG: Genetic analysis of genome-wide variation in human gene expression. Nature. 2004, 430: 743-747. 10.1038/nature02797.

  5. 5.

    Chesler EJ, Lu L, Shou S, Qu Y, Gu J, Wang J, Hsu HC, Mountz JD, Baldwin NE, Langston MA, Threadgill DW, Manly KF, Williams RW: Complex trait analysis of gene expression uncovers polygenic and pleiotropic networks that modulate nervous system function. Nat Genet. 2005, 37: 233-242. 10.1038/ng1518.

  6. 6.

    Pastinen T, Ge B, Hudson TJ: Influence of human genome polymorphism on gene expression. Hum Mol Genet. 2006, 15: R9-R16. 10.1093/hmg/ddl044.

  7. 7.

    Mootha VK, Lindgren CM, Eriksson KF, Subramanian A, Sihag S, Lehar J, Puigserver P, Carlsson E, Ridderstråle M, Laurila E, Houstis N, Daly MJ, Patterson N, Mesirov JP, Golub TR, Tamayo P, Spiegelman B, Lander ES, Hirschhorn JN, Altshuler D, Groop LC: PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat Genet. 2003, 33: 267-273. 10.1038/ng1180.

  8. 8.

    Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori M: The KEGG resource for deciphering the genome. Nucleic Acids Res. 2004, 32: D277-D280. 10.1093/nar/gkh063.

  9. 9.

    Dahlquist KD, Salomonis N, Vranizan K, Lawlor SC, Conklin BR: GenMAPP, a new tool for viewing and analyzing microarray data on biological pathways. Nat Genet. 2002, 31: 19-20. 10.1038/ng0502-19.

  10. 10.

    Thorn CF, Klein TE, Altman RB: PharmGKB: the pharmacogenetics and pharmacogenomics knowledge base. Methods Mol Biol. 2005, 311: 179-191.

  11. 11.

    Ripley BD: Modelling spatial patterns. J Roy Stat Soc Series B. 1977, 39: 172-179.

  12. 12.

    Purves WK, Sadava D, Orians GH, Heller C: Life: The Science of Biology. 2001, Sunderland, MA: Sinauer Associates and W. H. Freeman and Company, 6

Download references


This work was supported by the National Research Laboratory Program of Korea Science and Engineering Foundation (M10500000126).

This article has been published as part of BMC Proceedings Volume 1 Supplement 1, 2007: Genetic Analysis Workshop 15: Gene Expression Analysis and Approaches to Detecting Multiple Functional Loci. The full contents of the supplement are available online at

Author information

Correspondence to Taesung Park.

Additional information

Competing interests

The author(s) declare that they have no competing interests.

Rights and permissions

Reprints and Permissions

About this article


  • Significant Marker
  • Common Regulator
  • Common Peak
  • Multipoint Linkage Analysis
  • Significant eQTL