From SSN to PFCs
We analyzed the 1,914,171 proteins from 885 MAGs from marine plankton, recovered from 12 geographically bound assemblies of metagenomic sets corresponding to a total of 93 Tara Oceans samples from the 0.2 to 3 µm and 0.2 to 1.6 µm size fractions21. A flowchart of our bioinformatic pipeline is available in Supplementary Fig. 1. 39.6% of the MAGs’ proteins (757,457) were involved in our SSN, i.e., they had at least one similarity relationship with another protein that satisfied the chosen threshold of 80% similarity and 80% coverage (see “Methods”). In total, 51.1% of the network proteins could be annotated to 4922 unique molecular function IDs in the KEGG database37, associated with 327 distinct metabolic pathways (a full list of these pathways is available in Supplementary Data 1). In total, 85.2% of the network proteins were annotated to 17,009 eggNOG functional descriptions38,39.
The SSN involved 233,756 connected components (CCs), i.e., groups of nodes (here proteins) connected together by at least one path and disconnected from the rest of the network. According to KEGG and eggNOG databases, 15.3% and 48.5% of the CCs remained without any functional annotation (i.e., all sequences from the CC were unmatched in the databases, or had a match but were not yet linked to any biological function, Table 1), and 14.8% were functionally unannotated for both databases. We ranked the functional homogeneity of CCs involving at least one functional annotation from 0 (all annotations in the CC are different) to 1 (all annotations in the CC are the same) and found mean homogeneity scores of 0.99 over 1 for KEGG annotations and 0.94 over 1 for eggNOG ones (see “Methods” for score calculation details). Only 88 (0.04%) CCs had a homogeneity score below 0.5 in both annotation databases, all with sizes below five proteins. 177 CCs (0.07%) had a score below 0.8 in both databases, all under 12 proteins in size. These CCs were kept in the analysis while tagged as poorly homogenous. We thereafter considered each CC as a PFC, numbered from #1 to #233,756.
To check for the influence of taxonomic relationships between the MAGs on our PFCs, we computed different metrics based on MAGs taxonomic annotations provided by Delmont et al.21. (Table 1). This taxonomic annotation based on 43 single-copy core genes allowed to annotate 100% of the MAGs at the domain level, and 95% of the MAGs at the phylum level, the remaining 5% corresponding to Bacteria MAGs of unidentified phyla21. Only 1330 PFCs (0.6%) mixed proteins from the Archaea and Bacteria domains. PFCs were very homogeneous at the phylum level, then the homogeneity decreased at lower taxonomic rank, meaning that PFCs studied here were generally not specific from a single class, order, family, genus, or MAG (Table 1). In total, 7834 PFCs (3.4%) were only composed of proteins with no functional annotation in KEGG and eggNOG databases, and no taxonomic annotation under the phylum level. Their sizes ranged from 2 to 30 proteins (mean of 2.62). Their 20,552 proteins came from Euryarchaeota MAGs (12,458; 60.6%), Bacteria MAGs of unidentified phylum (2742; 13.3%), Candidatus Marinimicrobia MAGs (2451; 11.9%), Proteobacteria MAGs (1528; 7.4%), Acidobacteria MAGs (1031; 5%), Verrucomicrobia MAGs (103; 0.5%), Planctomycetes MAGs (89; 0.4%), Bacteroidetes MAGs (79; 0.4%), Chloroflexi MAGs (59; 0.3%) and Candidate Phyla Radiation MAGs (12; 0.05%). We hereafter considered these functionally and taxonomically unknown PFCs as “dark” PFCs40,41. Their nucleotidic sequences are available in separate supplementary files (see “Data availability”). The abundance of dark PFCs was significantly different from the abundance of other PFCs in 85 samples over 93 (two-sided Wilcoxon rank-sum test, p-value < 0.05). The median abundance of dark PFCs was higher than the one of other PFCs in 36 of these 85 samples, and lower in the 49 others. Further details on dark PFCs’ abundances are available in section I of Supplementary notes.
Identification of PFCs highly related to environmental gradients
To identify the PFCs that responded the most to environmental gradients, we first selected the 228,914 clusters with non-zero variance abundance profiles (i.e., at least 10% of distinct abundance values across all samples, and less than a 95 to 5 ratio between the most and the second most observed abundance value, please see “Methods” for more details), to avoid the creation of constant or near-constant training and/or test sets. We then built random forest regression models for each of these 228,914 clusters. We used the sequence abundances as response variables or labels, and 52 environmental variables as explanatory variables (see “Methods” for details of model training and tuning). More than half of the random forest regression models showed a clear statistical signal: 130,651 models (55.9%) had R2 values over 0.25, corresponding to PFCs linked to environmental conditions, and 14,585 (6.4%) had R2 values over 0.5 (Fig. 1A), corresponding to PFCs highly linked to environmental gradients (hlePFCs), from which the abundances were potentially predictable from the environmental context (Fig. 1B). The mean R2 value over all models was 0.29, with a maximum of 0.88 (Fig. 1A). Longhurst biogeographical provinces42 were detected as the most important predictor in 98,450 models (43.0%) and were in the top three most important predictors in 167,039 models (73.0%) (Fig. 1C). Among models with biogeographical provinces as the best predictor, the mean R2 reached 0.30. The temperature was in the top three most important predictors in 12,673 models (5.5%). Models with temperature as the best predictor had a mean R2 of 0.43, which was the fourth-highest value of all quantitative variables, behind sunshine duration (0.49), ammonium at 5 m depth (0.46), and annual density (0.43).
A Density distribution of R2 values of all the 228,914 random forest models. The green curve corresponds to R2 values obtained using out-of-bag samples of the fivefold cross-validation process applied during model training, while the red curve corresponds to R2 values obtained from predictions made on test sets using the selected trained models (See “Methods” for details on the statistical approach). The mean R2 value over all models was of 0.29 using out-of-bag samples, and 0.28 based on predictions over test sets. B Relationship between abundances observations from test sets and the corresponding model predictions, for all the 14,585 models showing R2 values above 0.5, i.e., a total of 3,174,580 predictions/observations couples. Both observations and predictions were log-transformed for graphical purposes, as the range of abundance values varied across three orders of magnitudes across different PFCs. The space of the plot was divided into hexagonal bins colored according to the number of points located in each of them, in log scale. The black line represents exact predictions, while the red line corresponds to the linear model fit between the two plotted variables (R2 of 0.70). C The rank of importance of each environmental variable in models. Ranks were attributed from 1 for the most important to 52 for the least important variable in each model (a lower rank then indicates higher importance in models). Each boxplot summarizes the ranks of the importance of its focal variable in all the 228,914 models. Boxplots minima and maxima correspond to −1.5 and 1.5 times the interquartile range, while the limits of boxes correspond to the first and third quartiles. The centerline indicates the median. All points outside of the minima-maxima range are plotted.
We focused on the 14,585 PFCs associated with models showing R2 values over 0.5, hereafter called “hlePFCs” for hlePFCs PFCs. 246 KEGG pathways were associated with the 14,585 hlePFCs, i.e., 75% of the pathways identified on the full network were detected in hlePFCs. Supplementary Data 1 gives a detailed list of all pathways detected in our network, along with their total number of occurrences in PFCs, and their number of occurrences in hlePFCs.
The functional homogeneity of the 14,585 hlePFCs was similar to the one of the total 233,756 PFCs (Tables 1, 2). In parallel, 9.2% of the hlePFCs were functionally unannotated (e.g., only composed of unannotated proteins in the KEGG and the eggNOG functional databases), while 14.8% of the PFCs were functionally unannotated.
Proportions of taxonomically homogeneous PFCs were similar between hlePFCs (Table 2) and total PFCs (Table 1), all above 90% at the phylum, class, order, and family level when considering only PFCs including at least one protein taxonomically annotated. At the genus level, the proportion of taxonomically homogeneous PFCs decreased from 91.9% in total PFCs to 66.6% in hlePFCs (Tables 1 and 2). Hence, hlePFCs tended to be shared by more genera compared to all PFCs, while retaining a high functional homogeneity. The proportion of taxonomically unannotated hlePFCs was lower than the one of total PFCs at the phylum and class levels but was higher at the order, family, and genus levels (Tables 1 and 2).
The mean R2 values of models associated with the 7834 dark PFCs (i.e., PFCs without any functional annotation and without taxonomic annotation below the phylum level) was of 0.27, and 166 (i.e., 2.1%) of them were selected among the 14,585 hlePFCs. These 166 dark hlePFCs corresponded to 357 proteins belonging to 51 unique MAGs, annotated as Proteobacteria (8 MAGs, 186 proteins), Euryarchaeota (19 MAGs, 40 proteins), Candidatus Marinimicrobia (13 MAGs, 97 proteins), Bacteroidetes (2 MAGs, 2 proteins), Chloroflexi (1 MAG, 5 proteins), Verrucomicrobia (1 MAG, 2 proteins) and Bacteria unannotated at Phylum level (7 MAGs, 25 proteins). Forty-six of these 51 MAGs were estimated to be over 70% complete, and the remaining five showed completion estimates between 50 and 70%21.
Global biogeography of the PFCs hlePFCs
The canonical correspondence analysis (CCA) achieved on the 14,585 hlePFCs to investigate their biogeographical repartition had an R2 value of 68.2% and was significant (p-value < 0.001). The first axis (15.38% of explained variance) opposed warm samples from the Indian Ocean (CCA1 > 0) to saline samples from the Mediterranean (CCA1 < 0) (Fig. 2). The second axis (13.17%) opposed cold and nutrient-rich samples from the Southern Ocean (CCA2 > 0) to warmer and more oligotrophic samples. Samples from geographically close biogeographical provinces appeared close to each other in the CCA space, with samples from the Southern Ocean and the Atlantic zones on the bottom, the Pacific Ocean in the middle, and the Indian Ocean on the right of the CCA1-CCA2 space (Fig. 2). The closest sample from the Mediterranean ones in the CCA space was from the surface of the closest Atlantic station to the strait of Gibraltar (station TARA_004) at the entrance of the Mediterranean Sea (sample highlighted by a black arrow in Fig. 2).
hlePFCs are represented as gray dots, quantitative environmental variables as arrows, and samples as points colored and shaped according to their biogeographical province (correspondence between four letters codes used here and full biogeographical provinces names, as well as descriptions of all other environmental variables are available in Supplementary Data 2). A map of Longhurst biogeographical provinces42 colored using the same color scale is shown in the upper right panel. For simplification issues, other qualitative variables (season moment, depth, and ocean region) were not represented. The closest sample from the Mediterranean ones in the CCA space, which comes from the closest Atlantic station to the strait of Gibraltar, was highlighted by a plain black arrow. On the right and upper left panels, density plots are represented along each axis, illustrating the density of functionally annotated and unannotated hlePFCs based on KEGG annotations (functionally annotated hlePFCs contain at least one functionally annotated protein; functionally unannotated hlePFCs contain only functionally unannotated proteins). The mean hlePFC density was of 0.25 along CCA1 (standard deviation = 0.2, maximum = 0.71), and 0.2 along CCA2 (standard deviation = 0.42, maximum = 2.46). The mean difference in density between functionally annotated and unannotated hlePFCs along CCA1 was of 0.02 (standard deviation = 0.07, maximum = 0.23). The mean density difference between annotated and unannotated hlePFCs along CCA2 was 0.01 (standard deviation = 0.24, maximum = 1.33). Similar observations were done using eggNOG annotations densities (Supplementary Fig. 2).
Combining the CCA results with the functional annotation of hlePFCs, we observed that the vast majority of metabolic pathways were not enriched in particular environmental conditions (Fig. 3). Among the few exceptions, Atrazine degradation was lightly enriched in the Mediterranean Sea (CCA10), while RNA polymerase and AMPK signaling pathways were lightly enriched in nutrient-rich, cold waters (Fig. 3). Pathways related to biogeochemical functions (e.g., carbon fixation pathways in bacteria/archaea, Nitrogen metabolism, or methane metabolism) or linked to ecological interactions between organisms (e.g., biosynthesis of antibiotics, quorum sensing, or ABC transporters) were present homogeneously in the CCA space (Supplementary Fig. 3). Functionally unannotated hlePFCs were more abundant around −1 and above 1 along CCA1 (Fig. 2), corresponding to hlePFCs associated with Mediterranean samples and Indian Ocean samples, respectively. The 166 dark hlePFCs were associated with Mediterranean samples, and almost absent from polar samples (Fig. 5B, Supplementary Fig. 3).
Gray dots represent hlePFCs in the same way as in Fig. 2, while colored points indicate the barycenters of KEGG metabolic pathways that occurred at least ten times among hlePFCs, the barycenter of a pathway corresponding to the barycenter of the positions of its associated hlePFCs. The color of each barycenter codes for the number of PFCs annotated to the focal pathway, in log scale. A red triangle indicates the barycenter of all hlePFCs in the CCA space. The barycenter of dark hlePFCs (i.e., hlePFCs without functional annotation and the taxonomical assignment below the phylum level) was represented in black Colored arrows indicate the environmental conditions associated with the different zones of the CCA space (see Fig. 2).
We then examined the position of hlePFCs associated with different levels of taxonomic annotations in the CCA space. hlePFCs containing sequences from the phylum Candidatus Marinimicrobia were found mostly in Mediterranean samples and were almost absent from polar waters (Fig. 4A). Overall, the various phyla did not show any strong associations to a particular niche in the CCA space (cf. the central positions and large standard deviation bars for the various phyla on Fig. 4A). Similarly, most of the classes’ barycenters on Fig. 4B showed important standard deviation bars, with the exception of the cyanobacterial class of Prochlorales which seemed restrained to warm and oligotrophic waters (Fig. 4B). No particular order, family, or genus was found to be strongly associated with Mediterranean samples, and the genus Polaribacter was the only taxa strongly associated with polar samples (barycenter position on CCA2 = −1.46).
A Six selected phylum and (B) seven selected classes. These taxa were selected because they had the most peripheral barycenters’ positions in the CCA space. Error bars correspond to the standard deviations of hlePFCs positions around their barycenters on CCA1 and CCA2 axes for each taxon. The size of barycenters represents the number of associated hlePFCs for each taxon, with the exact corresponding values written in white in each barycenter and in the legend. Colored arrows indicate the environmental conditions associated with the different zones of the CCA space (cf. Fig. 2).
In total, 3345 hlePFCs were highly overabundant in the Mediterranean Sea (CCA1), corresponding to 8736 proteins from 193 different MAGs of 10 classes. 85 of these MAGs originated from the same assembly performed by Delmont et al.21. on Mediterranean samples, and accounted for 5537 proteins (63.4% of the 8736 present in the zone, Supplementary Fig. 4). A similar yet less marked pattern was observed along CCA2, with 189 of the 697 (26%) proteins of hlePFCs correlated to cold and rich waters (CCA2 < −2) coming from MAGs of the Southern Ocean assembly (Supplementary Fig. 4).
Our analysis allowed us to identify environmental variables driving the abundance of functionally unannotated hlePFCs. For example, PFC #90,382 was composed of eight unannotated proteins coming from four different MAGs (3 Flavobacteriales, 1 Gammaproteobacteria), and had a strong response to high temperature (Fig. 5A), as well as other environmental variables (R2 value of 0.501 for the associated random forest model). Conversely, PFCs #102,286 (two proteins coming from the same Saprospiraceae, R2 value of 0.68), #210,456 (two proteins from two distinct Flavobacteriaceae, R2 value of 0.83), and #233,673 (two proteins from two distinct Flavobacteriales, R2 value of 0.63) were highly linked to cold temperature (Fig. 5A). PFCs #172,397, #172,465 and #26,732 were dark hlePFCs overabundant in Mediterranean samples (Fig. 5B).
A Relationships between normalized sequence abundance and temperature for six selected protein functional clusters highly linked to the environment (hlePFCs) that were only composed of functionally unannotated sequences. The three graphs on the left, in purple, correspond to the three hlePFCs functionally unannotated in both KEGG and eggNOG databases that had the lowest positions along the second axis of the canonical correspondence analysis (CCA2) (cold and nutrient-rich waters). The three graphs in the middle, in green, correspond to the three hlePFCs functionally unannotated in both KEGG and eggNOG databases that had the highest positions along CCA1 (oligotrophic and warm waters). B Relationships between normalized sequence abundance and location of sampling, whether in the Mediterranean Sea or not, for 3 dark hlePFCs (only functionally unannotated sequences and no taxonomic annotations under the phylum level). These 3 hlePFCs had the lowest positions among dark PFCs on the first axis of the canonical correspondence analysis (CCA2) (correlated to Mediterranean samples). Each boxplot summarizes the abundance values of its focal PFC in Mediterranean samples (n = 6) and non-Mediterranean samples (n = 87). Boxplots minima and maxima correspond to −1.5 and 1.5 times the interquartile range, while the limits of boxes correspond to the first and third quartiles. The centerline indicates the median. All points outside of the minima-maxima range are plotted.
Positions of all functionally and/or taxonomically unannotated hlePFCs in the CCA space, the most important drivers of their abundance according to random forest models, the nucleotidic sequences of their proteins, and their MAGs of origins are all publicly accessible (link in “Data availability”).
Robustness of the observed biogeographical patterns
In total, 6.2% of the 233,756 PFCs that were created in this study were used in our biogeographical analysis. In section II of Supplementary notes, we provide a similar biogeographical analysis, this time including the 130,651 PFCs showing R2 values above 0.25, i.e., 55.9% of the total 233,756 PFCs. A CCA based on these 130,651 PFCs allowed to identify two samples from the surface and deep chlorophyll maximum of station 93 as strong outliers due to their singular position along CCA1 (Supplementary Fig. 5A). The outlying nature of these two samples was explained by the strong overabundance of PFCs composed of proteins from 5 MAGs of the genus Pseudoalteromonas (Supplementary Figs. 6 and 7). The general patterns observed using the 14,585 hlePFCs (Fig. 2) were confirmed by the third and fourth dimensions of the CCA on 130,651 PFCs (Supplementary Figure 5B).
Source: Ecology - nature.com