Abiotic selection of microbial genome size in the global ocean
Non-prokaryotic metagenomic sequences confound average genome size estimationsIn this work, we employed MicrobeCensus22 for de novo estimation of the average genome size (AGS) of microorganisms captured in shotgun metagenome sequences (Fig. 1a; Supplementary Data 1). Briefly, MicrobeCensus optimally aligns metagenomic reads to a set of 30 conserved single-copy gene (CSCG) families found in prokaryotes 22. Based on these mappings, the relative abundance of each CSCG is then computed and used to estimate AGS based on the proportionality principle—that is, the AGS of the community is inversely proportional to the relative abundance of each marker genes22. Finally, a weighted average AGS is calculated that excludes outliers to obtain a robust AGS estimate for a given metagenomic sample22.Fig. 1: Eukaryotic and viral metagenomic reads bias AGS estimates in marine microbial metagenomes.a Schematic workflow of procedures used for estimating AGS in metagenomic samples. AGS is estimated based directly on preprocessed high-quality metagenomic reads (AGS1) and after three iterative steps to remove potential eukaryotic reads (AGS2) and viral reads detected based on the RefSeq viral genome database (AGS3) or de novo (AGS4). See the “Methods” section for more details. b Relationship between depth and proportion of total putative eukaryotic and viral sequences in marine metagenomic collections. The blue line indicates the fitted one-tailed Spearman correlation (r), with the corresponding 95% confidence intervals for the curve indicated by grey bands. The density distribution of the estimated proportion of contaminants is shown in green, with the corresponding median values (µ) highlighted. Values in parenthesis denote the filter size range of sampled metagenomes. c The fraction of ‘contaminating’ reads is highest in the epipelagic ocean relative to other ocean depth layers. EPI Epipelagic (~3–200 m), MES Mesopelagic (200–1000 m), BAT Bathypelagic (1000–4000 m). Values in parenthesis indicate the number of metagenomes. Only the results from the Malaspina Vertical Profiles (MProfile) metagenomes are shown as they cover greater depths of the global ocean (mean 1114 m; Supplementary Data 1). d Eukaryotic and viral metagenomic sequences significantly increase AGS estimates for prokaryotic plankton in marine metagenomes. Values in parenthesis show number of metagenomes for AGS1 and AGS2. e AGS estimates decreased in most metagenomic samples (85%; n = 220) after decontamination compared to predictions directly from preprocessed metagenomes by 1–19% (n = 39). Boxplots (c–e) show the median as middle horizontal (c, d) or vertical (e) lines and interquartile ranges as boxes (whiskers extend no further than 1.5 times the interquartile ranges). Data are shown as circular symbols, while mean values are shown as white colored diamonds. Values at the top (c, d) indicate the adjusted significant P-values of the unpaired (c) and paired (d) two-sided Wilcoxon test with Benjamini-Hochberg correction. Source data are provided as a Source Data file.Full size imageOf note, the AGS of complete prokaryotic genomes increases with the cumulative number of associated phages and other mobile genetic elements37. Similarly, AGS estimates derived from metagenomic sequences of uncultured “free-living” microbes (captured in 0.1–3 µm-size filters) may also be affected by putative phage and eukaryotic microbiomes sequenced concurrently in fractionated seawater samples (see,8,22). To evaluate this possibility in our AGS predictions, we compared AGS estimates obtained directly from quality-controlled metagenomes with estimates from the same metagenomes iteratively subjected to three (de novo) decontamination procedures to filter out potential eukaryotic and viral sequence reads (Fig. 1a; see details in the “Methods” section). Overall, putatively ‘contaminating’ viral and eukaryotic reads accounted for 1% to 20% (average 7.5%) of the high-quality trimmed sequences in the four microbial metagenome collections (Fig. 1b; Supplementary Data 1). As expected, the average proportion of contaminating sequences in metagenomes from large (0.2–3.0 µm) and small (0.1–1.2 µm) size fraction filters were the highest (~11%) and lowest (~5%), respectively (Fig. 1b). In addition, the proportion of contaminating reads was significantly dependent on the depth layer of the ocean (Kruskal-Wallis χ2 = 32.40, df = 2, p 200–1000 m), and bathypelagic (BAT, > 1000–4000 m). c AGS estimates in the “free-living” (0.2–0.8 µm) and particle-associated (0.8–20 µm) bathypelagic microbiome sampled latitudinally at 4000 m depth during the Malaspina expedition. Boxplots show the median as middle horizontal line and interquartile ranges as boxes (whiskers extend no further than 1.5 times the interquartile ranges). Data are shown as circular symbols, while mean values are shown as white colored diamonds. Values at the top indicate the adjusted significant P-values of the unpaired (b) and paired (c) two-sided Wilcoxon test with Benjamini-Hochberg correction. The number of metagenomes analyzed is indicated in parentheses in all three panels. Source data are provided as a Source Data file.Full size imageThe median AGS estimate range of 2.2 to ~3.0 Mbp in the sampled free-living (0.1–3 µm in size) marine prokaryotic communities (n = 209 metagenomes) is consistent with other large-scale metagenome sequence-based estimates and the sizes of metagenome-assembled prokaryotic genomes (MAGs; in 0.22–3 µm filters) from the photic ocean (surface to mesopelagic) based on the Tara Oceans Expedition (1.5–2.3 Mbp)15,16. Overall, our metagenome sequence-based AGS estimates support the unimodal distribution of prokaryotic genome sizes recently demonstrated in environmental genomes in several biomes38 and on cultured isolates (including marine bacterioplankton)14,39. However, estimates from isolates are likely biased since current cultivation approaches tend to favor copiotrophs (see, ref. 3).We next tested whether the derived AGS estimates depended on microbial cell size by analyzing 25 paired bathypelagic metagenomes (MDeep; Supplementary Data 1) sampled during the global Malaspina Expedition40 in which both prokaryotic life strategies, free-living (0.2–0.8 µm size) and particle-associated (0.8–20 µm size), were sampled simultaneously35. The analyzed metagenomes (MDeep) were from the Atlantic, Pacific, and Indian Ocean provinces and cover a spatial distance of 9437 km with an average depth (± SD) of 3688 ± 526 m at the tropical and subtropical latitudes (–33.55°N to 32.0788°N). These microbial metagenomes were also screened for contaminating eukaryotic and viral sequences as indicated in Fig. 1a (see details in the “Methods” section and Supplementary Data 1). The genomes of bathypelagic prokaryotes associated with marine particles (5.6 ± 0.97 Mbp) were twice as large (paired two-sided Wilcoxon test, p 3 µm) prokaryotes, respectively (Supplementary Data 3). These estimates are also consistent with those of MAGs reconstructed from the same metagenomes in the Challenger Deep (Mariana Trench)43. Overall, this reinforces the patterns of larger AGS in particle-associated compared to free-living bathypelagic prokaryotes, and larger microbial genomes in the deep ocean compared to the upper ocean.AGS patterns are not geographically constrainedExamination of the geographic patterns of AGS estimates showed that AGS distribution was independent of geographic distance in both the regional (Red Sea, Mantel statistic r = 0.01824, p = 0.2971) and global (MProfile, r = –0.01413, p = 0.7924) ocean metagenomes. Furthermore, AGS estimates in the vertically profiled global Malaspina metagenomes (MProfile, n = 81) were significantly independent of the Longhurst biogeochemical province sampled (n = 9; Kruskal-Wallis χ2 = 1.0006, df = 8, p = 0.9982; Supplementary Data 1). The lack of covariance between the patterns of AGS estimates and geographic distance or Longhurst province sampled may reflect the high connectivity of microbial communities throughout the global ocean, particularly the redistributive effects of circulation by ocean currents and other transport processes, as well as the enormous population sizes of plankton that allow dispersal constraints to be overcome44,45. This is consistent with the relatively small differences in microbial assemblages recently found in different ocean basins23,46. Another possible explanation is the effect of seasonality, which can cause selection of different taxa, resulting in the succession of microbial communities and affecting their distribution (see, ref. 47), and thus influence AGS patterns.An assessment of the relationship between AGS and measured environmental variables (Supplementary Fig. S1; Data 1)—separately for the Red Sea metagenomes (regional scale) and Malaspina Vertical Profiles metagenomes (global scale), showed that the cumulative effect of temperature, salinity, dissolved oxygen, and depth on AGS patterns was significant at both the regional scale (n = 45; Mantel statistic r = 0.1944, p = 0.0057) and the global scale (n = 81; Mantel statistic r = 0.1779, p = 1 × 10–4). This result suggests that environmental conditions are a driving force behind predicted AGS patterns in the marine microbiome. While no significant interaction effect was evident between many environmental variables (i.e., salinity, depth, oxygen, nitrate, and phosphate) in controlling AGS patterns (one-way ANOVA, p More