Ecology-guided prediction of cross-feeding interactions in the human gut microbiome
Overview of the GutCP algorithm
Our approach uses the idea that we can leverage cross-feeding interactions—which comprise knowing the metabolites that each microbial species is capable of consuming and producing—to mechanistically connect the levels of microbes and metabolites in the human gut. Several different mechanistic models in past studies have shown that this is indeed possible18,20,29,36,37. While GutCP is generalizable and can be used with any of these models, in this paper, we use a previously published consumer-resource model20. We use this model because of its context and performance: it is built specifically for the human gut and is best able to explain the experimentally measured species composition of the gut microbiome with its resulting metabolic environment, or fecal metabolome (compared with other state-of-the-art methods, such as ref. 29). To predict the metabolome from the microbiome, it relies on a manually curated set of known cross-feeding interactions9. It then uses these known interactions to follow the stepwise flow of metabolites through the gut. At each step (ecologically, at each trophic level), the metabolites available to the gut are utilized by microbial species that are capable of consuming them, and a fraction of these metabolites are secreted as metabolic byproducts. These byproducts are then available for consumption by another set of species in the next trophic level. After several such steps, the metabolites that are left unconsumed constitute the fecal metabolome.
We hypothesized that adding new, yet-undiscovered cross-feeding interactions would improve our ability to predict the levels of metabolites with our mechanistic and causal model. Specifically, we predict that the set of undiscovered interactions resulting in the most accurate and optimal improvement in predictions would be the most likely candidates for true cross-feeding interactions. Inferring such an optimal set of new cross-feeding interactions or reactions is the main logic driving GutCP. In what follows, we sometimes refer to cross-feeding reactions (i.e., metabolite consumption or production by microbes) as “links” in an overall cross-feeding network of the gut microbiome, whose nodes are microbes and metabolites (Fig. 1a; metabolites in blue, microbes in orange); the links themselves are directed edges connecting the nodes. Links can be of two types: consumption or nutrient uptake reactions (from nutrients to microbes) and production or nutrient secretion reactions (from microbes to their metabolic byproducts).
Fig. 1: Overview of the GutCP algorithm.
a Schematic of the original set of known cross-feeding interactions (top) and bar plot of the prediction error for each metabolite and microbe (bottom). The cross-feeding interactions are represented as a network, whose nodes are either metabolites (cyan circles) or microbial species (orange ellipses), and directed links represent the abilities of different species to consume (red arrows) and produce (blue arrows) individual metabolites. b GutCP adds a new consumption link (red) and production link (blue) as added links reduce the prediction errors for metabolites and microbes.
Full size image
The salient aspects of our method are outlined in Fig. 1. We start with the known set of consumption and production links that were originally used by the model; these links are known from direct experiments and represent a ground-truth dataset or original cross-feeding network9. These are shown in Fig. 1a through the pink and blue arrows connecting nutrients 1 through 6 with microbes (a) through (c). For each sample, using only the species abundance from the microbiome, we use the model to quantitatively estimate the microbiome’s species and metabolomic composition. Briefly, we assume that a defined set of polysaccharides, common to human diets, are available as the nutrient intake to the gut (nutrients 1 and 4 in Fig. 1a). We calculate the microbiome and metabolome profiles separately for each individual, which contain a different set of microbial species in their guts. At the first trophic level, all microbial species that are capable of using the polysaccharides (indicated by the pink arrows in Fig. 1a) consume each of them in proportion to their abundances (microbes a, b, and c in Fig. 1a). They subsequently secrete a fixed fraction of the consumed nutrients as metabolic byproducts; every species at this trophic level secretes all the metabolic byproducts it is known to secrete (blue arrows in Fig. 1a) in equal proportion (nutrients 2–6 in Fig. 1a). At the next trophic level, all species detected in the individual’s gut which can consume the newly secreted byproducts consume them as nutrients, secreting a new set of byproducts, and this continues for four trophic levels (not shown in Fig. 1a for simplicity). At the end of this process, all metabolites which remain unconsumed by the community comprise the metabolome of the individual and the microbial species which consume nutrients and grow comprise the microbiome of the individual (for a complete description, see “Methods” and previous work20).
For each metabolite and microbial species, there can be two kinds of prediction errors, or biases: individual (the sample-specific difference between predicted and measured levels) and systematic (average difference across all samples). We focused on the “systematic bias” for each metabolite and microbial species: the average deviation of the predicted levels from the measured levels across all samples in our dataset (Fig. 1a, bottom). The systematic bias for each metabolite and microbe tells us whether our model generally tends to predict their level to be greater than observed (overpredicted), less than observed (underpredicted), or neither (well-predicted). We assume that metabolites and microbes with a large systematic bias are most likely to harbor missing consumption or production links that are relevant across many samples. We prioritize adding links to them in proportion to their systematic biases.
After measuring the systematic bias for each metabolite and microbe, GutCP proceeds in discrete steps (Fig. 1a, b). At each step, we attempt to add a new link to the current cross-feeding network. This new link is chosen randomly from the entire set of combinatorially possible links (see “Methods”; for S species, M metabolites, and two kinds of links (consumption and production), there are a total of 2SM combinatorially possible links). We accept this link—keeping it in the current network—if it leads to an overall improvement in the agreement between the predicted and measured levels of microbes and metabolites. We repeat the process of adding new links—accepting or rejecting them—until the improvements in the levels of metabolites and microbes became insignificant. Overall, GutCP can add several links to improve the agreement between the predicted and measured levels of microbes and metabolites (in Fig. 1a, b, bottom, adding the extra red and blue link at the top results in improved predictions for metabolite (1), metabolite (3), and microbe (b). Figure 2a shows how the cross-feeding network improves over a typical GutCP run via the red trajectory, starting from the original network (Fig. 2a, top left) to the final network state (Fig. 2a, bottom right). Trajectories from 100 other runs are shown in gray. GutCP repeatably reduces both the error of the metabolome predictions (y axis; measured as ({text{log}}_{10}(frac{,text{pred}-text{meas}}{text{measurement},}))) and improves the correlation between the predicted and measured metabolomes (x axis).
Fig. 2: Improvement in predictions using GutCP.
a Improvement in log error (({text{log}}_{10}(frac{,text{pred}-text{meas}}{text{measurement},}))) and the correlation between the prediction and measured fecal metabolome during 100 typical runs of the GutCP algorithm. The gray point at the top left indicates the performance of the original cross-feeding network of Ref. 9, and the black points at the bottom right, that of improved networks predicted using GutCP. A trajectory example, highlighting how performance improves over a GutCP run, is shown in red, and others are shown in gray. b Rarefaction curve showing the number of unique cross-feeding interactions discovered by GutCP over 100 runs of the algorithm. c Prevalence of links, i.e., the number of GutCP runs in which they repeatedly appeared (red dots; total 100 runs) and for comparison, a corresponding binomial distribution with the same mean (black dotted line). P values for different prevalences are estimated using the one-sided binomial test.
Full size image
Cross-validating the newly predicted interactions
To test if the cross-feeding interactions predicted by GutCP are generalizable to unknown datasets, we performed fourfold cross-validation. We used a sample -omics dataset of the gut microbiome and metabolome sampled from 41 human individuals, comprising 221 metabolites and 72 microbial species (data from ref. 38). We split our -omics dataset into two subsets: training (three-fourths of the individuals) and test (one-fourth of the individuals) subsets. We then ran GutCP on the training subset to discover new interactions and added them to the ground-truth interactions taken from ref. 9. Doing so resulted in a network of cross-feeding interactions learned only from the training subset of the data. Finally, we evaluated the improvement in accuracy of metabolome predictions resulting from the trained network on the unseen, test subset of the data. We repeated this process three times, each time splitting the full dataset into a training subset (with a randomly chosen three-fourths of the individuals) and test subset (with the remaining one-fourth of the individuals); finally, we calculated the average improvement in prediction accuracy over all four splits.
We found that both the training and test set performances after using the links predicted by GutCP were significantly better than the baseline given by the original cross-feeding network (Table 1). Specifically, both measures of model performance, namely the logarithmic error and the average correlation, improved by 64% and 20%, respectively, after adding GutCP’s discovered interactions. In addition, the test set performance was comparable to the training set performance (6% difference; Table 1). This suggests that the cross-feeding interactions inferred by GutCP are not likely to be a result of over-fitting.
Table 1 Cross-validating the newly predicted interactions.
Full size table
Building a consensus-based atlas of predicted cross-feeding interactions
Having confirmed that GutCP is unlikely to over-fit data, we pooled the entire sample dataset of 41 individuals and ran 100 independent instances of our prediction algorithm on it; we verified that incorporating more instances did not qualitatively affect our results (Fig. 2b shows a rarefaction curve, which highlights the number of new links discovered by GutCP as we perform more runs the algorithm). Each run of the algorithm resulted in an average of 140 newly predicted cross-feeding interactions. Then, based on consensus from many runs, we assigned a confidence level to each predicted interaction, namely what fraction of GutCP runs it was discovered in. By calculating a null distribution (Fig. 2c, black), which predicts the fraction of GutCP runs where a random link would be discovered by chance, we assigned a P value to each link and set a threshold at P = 10−3 (Fig. 3c, red; see “Methods” for details). Doing so finally resulted in a complete consensus-based atlas of 293 predicted cross-feeding interactions, which we have provided as a resource for experimental verification in Supplementary Table 1. Figure 3a shows a condensed version of these interactions obtained from the simulation with the best performance (the trajectory example in Fig. 2a with the lowest log error and highest correlation coefficient) in the form of a matrix; specifically, newly added interactions are in dark colors, and old interactions in faded colors. Supplementary Fig. 3 shows a complete version of this matrix. Note that some of the predicted interactions in Fig. 3a are unrealistic, e.g., the production of certain sugars like D-Fructose and D-Sorbitol. Such interactions are unlikely to be predicted in repeated simulations, and thus will not be part of the final consensus set. This illustrates the power of pooling results from several simulations to arrive at a set of highly probable predictions.
Fig. 3: New cross-feeding interactions predicted by GutCP.
a Concise matrix representation of the improved cross-feeding network of the gut microbiome predicted by GutCP (the trajectory example in Fig. 2a with the best performance). The rows are metabolites, and columns, microbial species. Faded cells represent the original, known set of cross-feeding interactions, both production (light blue), consumption (light red), and bidirectional links (gray). The new cross-feeding interactions predicted by GutCP are shown in dark colors: production links in dark blue, consumption links in dark red, and bidirectional links in black. b Network of 293 new links predicted by GutCP (with a P value More