A network approach to elucidate and prioritize microbial dark matter in microbial communities
Overall strategy to detect the relevance of Unknown taxa
A pipeline based on network analysis was developed to detect and quantitatively measure the overall and individual impact of Unknown taxa on their environmental communities (Fig. 1). Briefly, Illumina 16S rRNA sequencing fastq files belonging to four distinctive aquatic extreme environments (i.e., hot springs, hypersaline, deep sea, and polar habitats that included both Arctic and Antarctic samples) were collected from public databases and 45 different BioProjects (Fig. 2a and Supplementary Dataset S1). We included different environment types to assess general and environment-specific patterns and chose to use extreme habitats as they comprise some of the harshest and relatively understudied habitats on Earth, and therefore, are likely to contain uncharacterized organisms.
Fig. 1: Overview of the analysis pipeline.
A minimum of 250 samples was retrieved for each of the four different extreme environments—hot springs (red), hypersaline (dark green), deep sea (turquoise), and polar (blue). Sequence reads were quality filtered, assigned to a taxonomy, and clustered to OTUs. At each classification level, any unassigned, ambiguous, or uncultured OTUs were designated as Unknowns, or “microbial dark matter” (MDM). For each environment, at each classification level, the direct co-occurrence relationship between all OTUs was mathematically modeled as a network. Networks were created for each environment, across all taxonomic classification levels, including all OTUs (Original, orange), excluding MDM (Without Unknowns, light green), and excluding an equal number of random Knowns (Bootstrap, blue). Network centrality metrics (i.e., degree, betweenness, and closeness) were calculated for each node, compared, and visualized as boxplots between these network types. Hub scores were calculated for each node in the Original network and networks were recreated, resizing by hub score, where the largest size node indicates a top hub species.
Full size image
Fig. 2: Summary of environmental 16S rRNA gene data.
a Map of the sample sites used in this study. Circles symbolize sample locations and are color-coded by environment: hot springs (HS, red), hypersaline (HY, dark green), deep sea (DS, turquoise), and polar (PO, blue). b Summary of data used in this study. OTUs counts are provided at the genus level. c Proportion of “microbial dark matter” (MDM) OTUs for each environment labeled as unassigned (dark blue), uncultured (dark red), and ambiguous (yellow) after SILVA and UCLUST-based taxonomic assignment to OTU. d Venn diagram of shared OTUs in four extreme environments. Each pie chart depicts the proportion of unique OTUs that were Known (lighter shade) and Unknown (darker shade) for each environment, with the bottom-most pie chart showing combined data for all environments. e Prevalence curves indicate the number of unique OTUs consistently present at an increasing number of samples. Dotted lines signify the prevalence of MDM OTUs and solid lines signify the prevalence of Known OTUs.
Full size image
After quality filtering, reads were mapped to OTUs and annotated against the SILVA (v128) reference database [38] by an open-reference strategy, i.e., allowing the detection of Unknown OTUs [39]. Over two million Known and novel taxa were observed with the vast majority of the taxa annotated as Bacteria. Only the bacterial taxa or OTUs unclassified at the domain-level were targeted for downstream network analyses to demonstrate the feasibility of this approach across ecosystems. As the term “microbial dark matter” can have a broad meaning, here, we define Unknown taxa as uncultured, unassigned, or ambiguous by the reference database at each taxonomic classification level (e.g., phylum to genus). For each environment, networks reflecting across-samples co-occurrence relationships between all taxa, Known and Unknown, were constructed and referred to as the “Original” networks (Fig. 1). To assess the role of the Unknown taxa on network structure and properties, the Unknown nodes were removed from the ‘Original’ network and a new network, referred to as ‘Without Unknown’ was reconstructed. To ensure that changes in network properties were not caused just by the number of nodes, a third network, referred to as the “Bootstrap” network, was created where a random set of nodes of the same number as the Unknown OTUs was removed. The relevance of the Unknown taxa was assessed by comparing changes in degree, closeness, and betweenness scores between the three network types and by evaluating the frequency of Unknowns as top hubs within each of the “Original” environmental networks.
A similar and significant fraction of Unknown taxa populates diverse environments
To assess whether there were distinctive patterns or trends of Unknown taxa within the four targeted environments, data from each habitat type were mined from several geographical locations across the globe (Fig. 2a and Supplementary Dataset S1). Reads were collected from the online repositories National Center for Biotechnology Information Sequence Read Archive (NCBI SRA) and Joint Genome Institute Genomes Online Database (JGI Gold). For each environment, between 255 and 286 16S rRNA samples and between 51 and 57 million reads were included in the analysis, resulting in a total of 219,980,340 reads from 1086 samples (Fig. 2b).
After processing, quality filtering, and OTU assignment steps were completed, there were 2102,595 unique OTUs totaling 164,896,127 amplicon read counts derived from these samples (Fig. 2b). The relative proportions of Unknown OTUs, which were designated as unassigned, ambiguous, or uncultured by the SILVA database, were compared between environments. Results indicated that all environments showed similar relative contributions of these three Unknown types, with unassigned and uncultured OTUs making up the majority of the Unknown component composition (Fig. 2c). Regardless of environment type, within each of the four habitats, between 25 and 38% of unique OTUs were cataloged as Unknown (Fig. 2b, d). Samples collected from polar habitats were significantly enriched (Fisher’s Exact Test p value < 0.05) in Unknown OTUs despite having the highest total read counts. The higher proportion of Unknown OTUs in the polar samples compared to the other habitats likely reflects the less-well-characterized biological diversity of these Arctic and Antarctic ecosystems.
Next, the proportion of shared Known and Unknown OTUs between environments was evaluated. Most OTUs, regardless of assigned or unassigned taxonomic status, were environment-specific, with only 11,318 out of the 2,102,595 OTUs present in all four of the environments (Fig. 2d). The majority of shared OTUs were observed between the hypersaline and polar environments and between hypersaline and hot springs environments, possibly reflecting the widespread distribution of hypersaline habitats across diverse thermal zones. Unsurprisingly, polar and hot springs environmental samples shared the least number of OTUs (Fig. 2d). Given this low common OTU pool across environments, network analyses were applied to each environment independently.
Last, the prevalence (i.e., the percentage of samples with nonzero counts in which an OTU was detected) of Known and Unknown OTUs was evaluated at the genus level to assess the consistency of OTU detection within each environment. The OTU matrix was sparse, with the majority of taxa observed in ≤50 (~20%) samples (Fig. 2e). However, prevalence curves were generally very similar for Known and Unknown OTUs in all four environments, indicating that Unknown OTUs are not necessarily rarer than already characterized species (Fig. 2e). Moreover, we confirmed that Unknown OTUs, like Known OTUs, were generally present and consistent across multiple studies within the same environment and did not tend to concentrate in any particular project (Supplementary Figs. S1–4). Consequently, these results indicated that a network analysis of these data would be a reflection of the co-occurrence structure of the community and not of potential compositional bias.
Network analysis of OTU abundance at different taxonomic levels reveals the connectivity of unknown microbes
Having demonstrated that the Unknown taxa comprise a substantial proportion of unique OTUs and have comparable abundances to Known taxa within a community, network metrics were used to effectively compare the ecological relevance of both Known and Unknown taxa in subsequent networks. Microbial association networks were constructed that featured only significant co-occurrence correlation relationships for OTUs with a notable prevalence in each environment, meaning that any OTUs that were not detected in a sufficient number of samples were removed. To select a suitable prevalence threshold, the proportion of Known and Unknown taxa across a range of sample percentages was evaluated (Supplementary Fig. S5). Across all taxonomic levels, the Known and Unknown taxa of hot springs and polar habitats were more prevalent than those of hypersaline and deep sea communities; therefore, a slightly more stringent prevalence threshold (40%) was chosen for the former, and a lower threshold value (30%) chosen for the latter environments. These thresholds resulted in the retention of a similar fraction of data from the initial OTU count (Table 1), with 102–297 nodes present per environment, making the networks both suitably large and comparable.
Table 1 Breakdown of node and edge attributes across extreme environmental networks.
Full size table The SpiecEasi Meinshausen–Buhlmann (MB) neighborhood algorithm was then used to construct networks (see “Methods”) that contained at least 100 nodes (i.e., OTUs) per environment and had edge-to-node ratios that varied from 1.9 and 2.9 (Table 1). Although most OTUs that passed the prevalence criteria became elements of the networks, no relationship between the initial data size (i.e., number of samples and taxa) and the interconnectedness (i.e., nodes/edges ratio) of the resulting network was observed (Table 1). The two environments with the highest number of initial OTUs (i.e., hypersaline and deep sea) had the lowest number of prevalent members, 193 and 102, respectively, and very different edge/node ratios, 2.7 and 1.9, respectively, indicating that high prevalence does not necessarily correlate with high co-occurrence. Similarly, the polar and hot springs networks retained a high number of prevalent OTUs but differed in edge number, yet again, indicating that the observed network structures were the result of the intrinsic properties of each environment and were not dependent on the sampling procedure. Interestingly, when evaluating the proportion of Unknown edges and Unknown-Unknown connections at the genus level, similar patterns were observed across environments. Between 45 and 62% of all connected nodes were Unknown OTUs and a higher proportion of Unknown-Unknown versus Known-Known links was present at the genus level (Table 1), for all environments but hot springs, where a higher proportion of Known-Unknown and Known-Known links was observed.
The results of this global analysis of network construction and composition demonstrate that although the general community connectivity might be environment-specific, the relative contribution of Known and Unknown taxa to these networks is similar. Once again, network properties were not a direct outcome of sampling biases, but more likely, reflect the biology of their respective ecosystems.
Unknown taxa play important roles in interconnectedness and connectivity of extreme environmental microbial networks
Next, the position and neighborhood of Unknown nodes were examined. At the phylum level, Unknown taxa were present in the hot spring, hypersaline, and polar environments, but were not found in the deep sea network (Supplementary Fig. S6). The class level was the first taxonomic classification rank in which Unknown taxa were present in all environmental networks. To accurately assess the role of Unknowns, we evaluated class-level networks and observed that the hypersaline and polar Unknown OTUs created distinctive clusters, whereas hot springs and deep sea Unknown OTUs were intermixed with Known taxa (Fig. 3a). Hypersaline and polar Unknowns consistently appeared to be isolated and peripherally located compared to the centrally positioned hot springs and deep sea Unknown nodes across almost all classification ranks (Supplementary Fig. S6). Consequently, these results suggest that the clustering pattern is unrelated to a higher abundance of Unknowns and is more environment dependent. For example, the targeted hot spring environments had the highest number of Unknown OTUs yet showed the most dispersed connections between the these taxa (Fig. 3a). Thus, the inclusion of the Unknown taxa in our environmental networks models was, as anticipated, not solely the consequence of their level of prevalence, but rather a reflection of a particular abundance co-variation pattern.
Fig. 3: Analysis of environmental network taxa interconnectedness. a Microbial networks at class taxonomic classification level. Nodes are colored by class assignment, with gray nodes representing Unknown taxa at the class level. b Bar graphs of the co-occurrence relationships (i.e., edges) of Unknown OTUs with other taxa at the class level within each environmental network. Y axis labels and colors signify the different classes with which Unknowns were found to co-occur. Unknown-Unknown relationships are represented in gray. Full size image For all four environments, Unknown OTUs had more frequently shared edges among them than with classified taxa (Fig. 3b). Unknown OTUs were found to frequently co-occur with each other within each environmental network, although the frequency of within-class interactions for Unknowns at the class level was found to be statistically no greater than all other within-class interactions for each environment (Supplementary Fig. S7). In fact, the pattern of a high frequency of shared edges among members of the same class held true for known classes as well (Supplementary Fig. S8). In accordance with other studies, these results demonstrate that OTUs of the same taxonomic classification most frequently co-occur with each other [40, 41]. Furthermore, the high frequency of shared edges between Unknown classes suggests that Unknown OTUs might be taxonomically related.
To ensure that the observations found were reproducible, robust, and not biased by earlier steps of the analysis, the diversity and position of Unknown taxa in the networks were examined for several parameters. Although the number of Unknown nodes changed at each level (Supplementary Fig. S6), the environment-specific network patterns observed at the class level (Fig. 3a) were retained at other taxonomic levels. In addition, to determine whether the topology of the network was a direct consequence of our correlation metric of choice or the prevalence threshold, three other network construction approaches were used: SparCC [11], CClasso [42], and Pearson correlation. Network analyses were performed across a range of prevalence thresholds (15–35%, at 5% increments). Again, regardless of the network construction approach or sample percentage applied, network shape and Unknown OTU position remained consistent and each environment exhibited a distinctive pattern of Unknown taxa inclusion. For example, Unknown nodes continued to occupy peripheral positions for hypersaline and polar networks, whereas nodes in hot springs and deep sea environments were more centrally located when applying different correlation metrics (Supplementary Fig. S9). Although networks appeared “noisy” at more lenient prevalence thresholds (15–20%; Supplementary Figs. S10–13), the networks and positioning of Unknowns at higher percent thresholds were similar in appearance to the “Original” networks for all environments. Based on these results, we found that our general analysis strategy was robust across parameter choices and, therefore, these networks captured critical features of the relationships among taxa within each distinctive environment.
Microbial dark matter acts as unifiers and frequent hubs within extreme environmental networks
Although these results suggest that Unknown taxa were highly interconnected, these observations did not reveal how the presence of Unknowns affected the overall community structure. To more fully understand this role, we analyzed how network properties changed in the presence and absence of Unknown OTU nodes. We evaluated changes in degree, betweenness, and closeness, as different network metrics reveal different aspects of the relevance of nodes within their networks. This approach has the potential to ascertain whether certain Unknown OTUs were more centrally positioned (e.g., due to high closeness scores), more essential for joining other taxa (e.g., high betweenness), or simply more prevalent and likely to co-occur with others (e.g., high degree). To control for the effect of node removal and distinguish effects of Unknown taxa from network size, networks were generated that excluded several randomly picked Known nodes equal to the number of Unknown OTUs. This process was repeated 100 times to create a distribution of “Null” or “Bootstrap” networks for statistical comparisons. Differences in network parameters between networks without Unknown OTUs and the “Original” or “Bootstrap” networks were determined by the Wilcoxon test and p values were adjusted using the Holm method [43]. Strikingly, removal of the Unknown taxa caused a statistically significant impact on all measured network metrics in all four studied environments (Table 2, Fig. 4, and Supplementary Figs. S14–24). In the polar environments, for example, removal of Unknown OTUs caused a significant decrease in degree (p value More
