MetaPath: identifying differentially abundant metabolic pathways in metagenomic datasets
© Liu and Pop; licensee BioMed Central Ltd. 2011
Published: 28 April 2011
Skip to main content
Volume 5 Supplement 2
© Liu and Pop; licensee BioMed Central Ltd. 2011
Published: 28 April 2011
Enabled by rapid advances in sequencing technology, metagenomic studies aim to characterize entire communities of microbes bypassing the need for culturing individual bacterial members. One major goal of metagenomic studies is to identify specific functional adaptations of microbial communities to their habitats. The functional profile and the abundances for a sample can be estimated by mapping metagenomic sequences to the global metabolic network consisting of thousands of molecular reactions. Here we describe a powerful analytical method (MetaPath) that can identify differentially abundant pathways in metagenomic datasets, relying on a combination of metagenomic sequence data and prior metabolic pathway knowledge.
First, we introduce a scoring function for an arbitrary subnetwork and find the max-weight subnetwork in the global network by a greedy search algorithm. Then we compute two p values (p abund and p struct ) using nonparametric approaches to answer two different statistical questions: (1) is this subnetwork differentically abundant? (2) What is the probability of finding such good subnetworks by chance given the data and network structure? Finally, significant metabolic subnetworks are discovered based on these two p values.
In order to validate our methods, we have designed a simulated metabolic pathways dataset and show that MetaPath outperforms other commonly used approaches. We also demonstrate the power of our methods in analyzing two publicly available metagenomic datasets, and show that the subnetworks identified by MetaPath provide valuable insights into the biological activities of the microbiome.
We have introduced a statistical method for finding significant metabolic subnetworks from metagenomic datasets. Compared with previous methods, results from MetaPath are more robust against noise in the data, and have significantly higher sensitivity and specificity (when tested on simulated datasets). When applied to two publicly available metagenomic datasets, the output of MetaPath is consistent with previous observations and also provides several new insights into the metabolic activity of the gut microbiome. The software is freely available at http://metapath.cbcb.umd.edu.
Metagenomics is a new scientific field that involves the analysis of organismal DNA sequences obtained directly from an environmental sample, enabling studies of microorganisms that are not easily cultured in a laboratory . Metagenomic studies, pioneered in the early 2000s , have recently increased in number and scope due to the emergence of next generation sequencing technologies. Due to the difficulty of assembling entire organisms from a metagenomic dataset , most analyses take a gene-centric view, treating the community as an aggregate and ignoring the exact assignment of genes to individual organisms. In fact, it can be argued that the environment is better characterized by its gene complement rather than by its taxonomic composition, given that similar biological functions can be performed by microbes of distinct taxonomic origins. Supporting this view is the observation that, while the taxonomic composition of the human gut microbiome varies significantly between people, the functional profile is remarkably stable across samples . The functional profile for a sample can be recovered by mapping sequences to gene families , subsystems  or metabolic pathways . The relative abundance of each functional category can be estimated by counting how many sequences are assigned to each category, and this information is the basis for detailed comparisons of the functional potential of different functions. In a typical comparative metagenomics experiment, random shotgun sequences are generated from a collection of samples belonging to two groups, for example, obese or lean twins , and healthy infants or adults . An important biological problem is to find differentially abundant functional signatures (e.g., genes or metabolic pathways) that are selected for by their local environments. Traditional analysis approaches compare the relative abundances of the categories one-at-a-time between different phenotypes, and compute the significance using one of several statistical approaches [8–10]. When comparing communities at the gene family level, many functional categories are commonly found to be differentially abundant, even after correcting for multiple hypothesis testing [3, 7]. The interpretation of these data can be daunting. An alternative approach focuses on functional subsystems and metabolic pathway comparisons , the number of which is much smaller than gene families. Results at these levels are easier to interpret and can provide a stronger evidence of distinct functional capacities than at the level of individual gene families. Such analyses, however, can be unnecessarily coarse. For example, the use of KEGG pathways as a basis for analysis is complicated by the following issues: (1) the definitions of pathways in KEGG are coercive, and the interactions between these pathways are ignored; (2) the genes in a pathway may not be fully covered by the identified genes in a metagenomic sample; (3) significant differences in the abundance of certain genes may be masked once the abundance of all genes in a pathway is aggregated.
To address these problems, we introduce a general method (MetaPath) for searching the global metabolic network to find differentially abundant finer-level subnetworks. For the purposes of this paper we define a subnetwork to be a connected set of genes that is statistically enriched or depleted in one group of samples. Underlying our approach is a statistical scoring system that captures the differential abundance for a given subnetwork, combined with a greedy search algorithm for a maximum weighted subgraph, to indentify the highest scoring subnetworks. Unlike previous approaches, MetaPath explicitly searches significant subnetwork in the global metabolic network (rather than the KEGG defined pathways), enabling us to detect subnetworks spanning predefined “containers”. In addition, we developed rigorous statistical methods that take into account the topology of the network when testing the significance of the subnetworks.
Using simulated datasets, we demonstrate that Metapath outperforms previously described approaches for comparing biological networks based on abundance data. We show that our findings are more robust to noisy data than the results of single gene comparisons, and that MetaPath can find finer-level subnetwork than can be found by comparing predefined KEGG pathways. We also discuss the biological significance of the results derived from the application of MetaPath to actual metagenomic datasets, demonstrating that the output from MetaPath is easy to interpret and provides valuable biological insights. The software is freely available at http://metapath.cbcb.umd.edu.
where is the inverse cumulative density function (CDF) of standard normal distribution; G1 and G2 represent two different phenotypic groups. Using this formula, if a reaction is more abundant in population G1, then its Z score will be positive and vice versa. We are specifically interested in finding a pathway whose reactions are either enriched or depleted as a whole, as apposed to previous approaches [13, 14] that identify active or perturbed subnetworks, which may contain a mixture of enriched and depleted components. Similar to the approach of  we define the aggregate score for a particular subnetwork to be the sum of the Z scores over all reactions contained within it: , where k is the size (number of metabolic reactions) of the subnetwork.
This algorithm tries to find a connected metabolic subnetwork, which can have any arbitrary structure, with maximum weight. However, it is believed that in metabolic networks, chains are especially more biologically meaningful and interesting, because they attempt to capture the structure of a series of reactions that are successively connected. To allow this idea, we modify line 8 of the above algorithm to “Pick an edge ej which has the highest weight of the edges that are adjacent to and have the same direction with ej-1”. Both searching algorithms are implemented in our program and can be selected through command-line parameters. To find all significant subnetworks (computing significance is discussed below), we iteratively remove the edges in the global network that are contained in previously found significant subnetworks, and rerun our greedy search on the rest of the network until we can no longer find any additional significant subnetworks. Note, that unlike the original version of our code , the search algorithm is not limited to given subnetwork size, rather will find all significant subnetworks irrespective of size.
Differential abundance is assessed on an edge-by-edge basis (reaction-by-reaction) using Metastats;
The significance estimates (p-values) from Metastats are fed into a greedy search algorithm to determine all maximally weighted subnetworks(in terms of statistical Z-scores) in the global metabolic network;
The significance of each subnetwork detected by the greedy search algorithm is assessed using both a topology-independent bootstrapping approach (p abund ), and a topology-dependent bootstrapping approach (p struct );
The subnetworks determined to be significant (p abund ≤ 0.05 and p struct ≤ 0.05) are reported to the user (Note: the threshold for significance can be adjusted through command-line parameters). The pathways are ranked by p abund values.
In order to validate our methods, we have designed a simulated metagenomic study and compared the results with three previous approaches: (i) identifying significantly active subnetworks using simulated annealing and greedy search ; (ii) discovering significant individual reactions using Metastats ; and (iii) finding differentially abundant KEGG defined pathways, an approach widely used in metagenomic functional comparison [3, 7, 10]. We choose these tools because they are addressing similar biological problems. However they do not solve the exact same problem as this paper, which is finding differentially abundant subnetworks that may span two or more KEGG defined pathways (see discussion in the Background section). Here the goal of this simulated study is to show that the computational problem in this paper can not be directly solved by applying methods previously developed in a related context.
We designed a simulated metabolic pathways dataset in which five subjects are created for each of the two groups with distinct phenotypes. To generate the artificial reaction abundance matrix (where rows represent reactions and columns represent subjects), a Gaussian distribution is created for each reaction, whose mean is randomly chosen from a real metagenomic dataset (gut microbiome from obese and lean subjects ). The variance of each distribution is calculated by setting the relative standard deviation (standard deviation divided by the mean) to 0.2. If we define a reaction to be equally abundant between two groups under comparison, then a random abundance value is generated from the same distribution for each subject. Otherwise, if a reaction is defined to be significantly enriched in one group, then another normal distribution is created for this reaction by increasing the mean such that the p value of the difference for the two distributions is less than a predefined value (0.05 and 0.01 were used). In this study, we have chosen a subnetwork (a series of reactions with length 5 or 10) to be enriched in one population. The goal is to compare different methods in recovering this significant subnetwork (a set of significant reactions) based on the simulated abundance matrix. Biologically, the enriched pathways indicate functional enrichment of certain biological processes in a microbial community.
Five subnetworks (Fig. 5a-5e) are completely contained in the KEGG Fatty Acid Biosynthesis pathway, which consists of catabolic processes that can generate energy and primary metabolites from fatty acids. Our findings are consistent with previous observations and biochemical analysis in microbiota transplantation experiments in germ-free mice , where the concentrations of short-chain fatty acids in the caeca of obese mice are higher than lean mice, suggesting that the gut microbiome in obese subjects has an increased capacity for dietary energy harvest.
Another interesting significant networks consists of 10 reactions (Fig. 5f), of which 8 belong to Cysteine and Methionine Metabolism and 2 belong to Sulfur Metabolism. Many reactions in this subnetwork are connected by the L-Homocysteine molecule. In addition, three other subnetworks (Fig. 5g-5i) we discovered further confirm its potential involvement in obesity, because all these three pathways contain L-homocysteine as metabolite. It is well-known that a high level of blood serum homocysteine is a risk factor for cardiovascular disease , and obesity — an increasingly prevalent metabolic disorder — is closely associated with heart disease . Significant correlations between plasma homocysteine concentrations and obesity have been previously reported [18, 20–23]. The finding of increased potential for homocysteine metabolism within the obese gut microbiome provides an interesting hypothesis for future studies that, the gut microbiome may either have a direct role in the elevation of homocysteine levels in plasma, or may indirectly affect the hepatic biosynthesis of this amino-acid in the human body.
A second data-set comprises gut microbiome samples from 4 infants and 9 adults individuals which were sequenced by Kurokawa, et al., 2007 . The sequences were annotated and mapped to the reactions of KEGG pathway using BLASTX (E value < 10-8, hit length coverage ≥ 50% of a query sequence), resulting in total 1781 unique reactions within the 13 metagenomic samples. Based on 10 runs of Metastats, 383.7±1.56 reactions are significant using p value cutoff of 0.05, including 268.7±1.56 and 115±0 reactions enriched in infant and adult subjects respectively.Using a q value cutoff of 0.05, 167.2±2.7 reactions are significant, including 133.2±2.7 and 34±0 reactions enriched in infant and adult subjects respectively.Compared with the previous dataset (obese and lean twins samples), the predictions of significant reactions are much more consistent across different permutations.
The pathway in Fig. 6i (enriched in adult subjects) is part of the lipopolysaccharide biosynthesis. Lipopolysaccharides are a building block of the outer membrane of Gram-negative bacteria. The enrichment of pathway Fig. 6i in adult subject may be a result of the fact that Gram-negative bacteria are also enriched in adults. Specifically, Bacteroides, a genus of Gram-negative bacteria, are a major constituent of adult gut microbiome, but not highly prevalent in infants. Fig. 6h and Fig. 6j (enriched in adult) are pathways related with pyrimidine metabolism. The metabolites RNA, cytidine and uridine, which are contained in pyrimidine metabolism, are normally obtained from high RNA food such as organ meats, broccoli, and brewer’s yeast, which are not available to unweaned infants, as they are not present in high abundance in milk. The pathway in Fig. 6g (enriched in adult) is part of fructose and mannose metabolism a pathway related to carbohydrate metabolism. This is also consistent with COG-based analyses indicating that many mono- or disaccharides metabolism genes are enriched in adults , explained by the fact that colonic microbiota in adults uses indigestible polysaccharides as resources for energy production and biosynthesis of cellular components.
We have introduced a statistical method for finding significant metabolic subpathways from metagenomic datasets. Compared with previous methods, results from MetaPath are more robust to noise in the data, and have significantly higher sensitivity and specificity (when tested on simulated datasets). When applied to two publicly available metagenomic data-sets the output of MetaPath is consistent with previous observations and also provides several new insights into the metabolic activity of the gut microbiome. Finally, MetaPath is efficient: a typical metagenomic dataset and the corresponding metabolic network (about 2000 edges) can be analyzed in half an hour on a single processor.
While showing promising results, our methods have several limitations that we plan to address in the near future. First, and foremost, we restrict ourselves to pathways of a fixed length — a restriction necessary for accurately computing the null distribution of pathway scores. This can severely affect our ability to discover long pathways whose abundance differs only slightly, but significantly, between samples. Second, we currently estimate gene abundances by simply counting the number of sequencing reads that map to a certain gene. Such an approach ignores differences in the length of genes, potentially leading to incorrect conclusions. We plan to address this issue by incorporating a recently-published  method that can accurately correct for gene-length effects. The software described in this paper is freely-available under an open-source license from http://metapath.cbcb.umd.edu
We thank Niranjan Nagarajan, Carl Kingsford, James White and Saket Navlakha, Theodore Gibbons for helpful discussions. This work was supported in part by grants R01-HG004885 from the NIH, and IIS-0812111 from the NSF, both to MP.
This article has been published as part of BMC Proceedings Volume 5 Supplement 2, 2011: Proceedings of the 6th International Symposium on Bioinformatics Research and Applications (ISBRA'10). The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/5?issue=S2.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.