Gene network reconstruction from microarray data
© Jaffrezic and Tosser-Klopp; licensee BioMed Central Ltd. 2009
Published: 16 July 2009
Often, software available for biological pathways reconstruction rely on literature search to find links between genes. The aim of this study is to reconstruct gene networks from microarray data, using Graphical Gaussian models.
The GeneNet R package was applied to the Eadgene chicken infection data set. No significant edges were found for the list of differentially expressed genes between conditions MM8 and MA8. On the other hand, a large number of significant edges were found among 85 differentially expressed genes between conditions MM8 and MM24.
Many edges were inferred from the microarray data. Most of them could, however, not be validated using other pathway reconstruction software. This was partly due to the fact that a quite large proportion of the differentially expressed genes were not annotated. Further biological validation is therefore needed for these networks, using for example in vitro invalidation of genes.
Two main approaches have been proposed in the literature for gene network reconstruction from microarray data, namely Bayesian networks and Graphical Gaussian models. Bayesian networks are directed acyclic graphs, i.e. no feedback loop is possible. They are usually very computationally intensive and, as far as we are aware of, no R package is available for large-scale gene network reconstruction using Bayesian networks. On the other hand, Graphical Gaussian models are undirected graphs and are very computationally efficient. An R package is available for gene network reconstruction from microarray data using Graphical Gaussian models, namely GeneNet .
Werhli et al.  presented a comparison study between Bayesian networks and Graphical Gaussian models for gene network reconstruction. They concluded that both methods provided quite similar results for network reconstruction based on observed microarray data. We therefore chose, in this study, to base inference on Graphical Gaussian models.
Graphical Gaussian models
Let Xbe the observed data matrix with N rows, corresponding to the number of samples, and G columns, corresponding to the number of genes. Xis supposed to follow a multivariate normal distribution (μ, Σ), with mean vector μ= (μ1,...., μ G )' and positive-definite covariance matrix Σ = (σ ij )(1≤i,j≤G).
Covariance parameters σ ij can also be written as: σ ij = ρ ij σ i σ j , where and are the variance terms for genes i and j, respectively. Parameter ρ ij corresponds to the Pearson correlation coefficient between genes i and j.
with Σ-1 = (ω ij ), for 1 ≤ i, j ≤ G.
Second, the partial correlation matrix has to be calculated using the previous equations. Finally, statistical tests can be performed to determine the partial correlation coefficients that are different from 0, and which correspond to the significant edges of the graph.
The procedure described above is, however, only applicable when sample size N is larger than the number of variables G. In fact, the sample covariance matrix is otherwise not positive-definite and cannot be inverted, which prevents a direct computation of the partial correlation matrix. In microarray experiments, however, we are very often in situations where sample size N is much smaller than the total number of genes G.
Schäfer and Strimmer  therefore proposed to use a shrunk estimate of the covariance matrix using a James-Stein estimator. The aim of this approach is to construct a well conditioned positive-definite matrix so that the matrix has full rank and can easily be inverted.
where is the estimated empirical covariance matrix. Shrinkage parameter λ is chosen to minimize the mean-squared error (MSE) and can be determined analytically .
where s ij are the empirical covariance parameters.
An edge-specific local FDR procedure was then defined, based on the estimated partial correlation coefficients. As recommended by Efron , an edge is considered significant if its local FDR value is smaller than 20%.
The GeneNet R package  was applied to the Eadgene chicken infection data set  and the R code used to produce these analyses is available from the first author. We considered here the lists of differentially expressed genes obtained for two sets of conditions. In condition MA, chickens were infected at two weeks of age with a parasite called Eimeria maxima and two weeks later with the parasite called Eimeria acervulina, and in condition MM, chickens were infected first with E. maxima and afterwards with the same parasite E. maxima. Two time points were sampled post infection: 8 hours and 24 hours. At a 5% Benjamini-Hochberg (BH) threshold, 85 genes were found differentially expressed between groups MM and MA at 8 hours post infection, whereas 800 genes were found differentially expressed at a 5% BH threshold for condition MM between the two time points 8 and 24 hours. Due to the quite small number of biological replicates per condition (5 animals), network inference can only be performed on a few dozens of genes. For conditions MM8 and MM24, we therefore considered a more stringent BH threshold, with 116 differentially expressed genes at a 1% Benjamini-Hochberg threshold.
As no missing values are allowed in GeneNet, 58 genes were used for network reconstruction among the list of differentially expressed genes between conditions MM8 and MA8. Network inference was performed using the expression values for each condition independently. For this first analysis, no significant edges were found at the recommended 20% local FDR  threshold, for either condition MM8 or MA8.
For conditions MM8 and MM24, as no missing values are allowed, network reconstruction was based on 85 genes among the 116 found differentially expressed at a 1% Benjamini-Hochberg threshold.
For expression values observed in condition MM8, 2356 edges were found significant at the 20% local FDR threshold among the 85 genes, and even 1964 edges were found significant at the more stringent 5% local FDR threshold. Similarly, a very large number of edges were found significant between these 85 genes for condition MM24. In fact, 1760 edges were found significant at the 20% local FDR threshold, and 1156 at the 5% local FDR threshold.
Names of the annotated genes for condition MM8.
Human ortholog HGNC
Names of the annotated genes for condition MM24.
Human ortholog HGNC
Gene network reconstruction was based here on the expression data from this experiment only. It would be interesting to integrate some prior biological knowledge such as gene relationships already found in the literature, or to combine expression values from several studies, to have more power and accuracy for the edge detection.
Biological validation of the edges inferred here with Graphical Gaussian models was very difficult, mainly due to the lack of annotation for the lists of differentially expressed genes. An important effort therefore has to be made in the near future to obtain a more complete annotation of the chicken genome and other livestock species.
In the approach presented here, and based on partial correlations, it is only possible to model linear dependencies between genes. In order to take into account non linear relationships it may be possible, as suggested by Hausser and Strimmer  to use entropy instead of partial correlations to infer the edges between genes.
As only two time points were available in this study, static networks were considered here using the expression values at each time point separately. Several methods have recently been proposed for gene network reconstruction in time course studies, mainly based on VAR1 models . If additional time points were added in the future to this experiment, it would be interesting to use these methods to study the gene relationships over time.
This article has been published as part of BMC Proceedings Volume 3 Supplement 4, 2009: EADGENE and SABRE Post-analyses Workshop. The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/3?issue=S4.
- Schäfer J, Strimmer K: A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Stat Appl Genet Mol Biol. 2005, 4: 32-Google Scholar
- Werhli AV, Grzegorczyk M, Husmeier D: Comparative evaluation of reverse engineering gene regulatory networks with relevance networks, graphical gaussian models and bayesian models. Bioinformatics. 2006, 22: 2523-2531. 10.1093/bioinformatics/btl391.View ArticlePubMedGoogle Scholar
- Whittaker J: Graphical Models in Applied Multivariate Statistics. 1990, New York: WileyGoogle Scholar
- Efron B: Local false discovery rates. 2005, Technical report Department of Statistics, Stanford UniversityGoogle Scholar
- Swinkels W, Cornelissen J, Rebel A: Immune reactions after a homologous or heterologous challenge of broilers primed with Eimeria maxima. 2009 in pressGoogle Scholar
- Hausser J, Strimmer K: Entropy inference and the James-Stein estimator, with application to nonlinear gene association networks. 2009 in pressGoogle Scholar
- Opgen-Rhein R, Strimmer K: Learning causal networks from systems biology time course data: an effective model selection procedure for the vector autoregressive process. BMC Bioinformatics. 2007, 8 (Suppl 2): S3-10.1186/1471-2105-8-S2-S3.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.