2,809 draft MAGs from the rumen ecosystem
We amassed 3.2 terabase pairs (Tbp) of data from 346 publicly available and 66 new rumen metagenome datasets (Supplementary Table 1). The metagenomes were from cattle (312 samples, 2.1 Tbp), sheep (75 samples, 888.4 gigabase pairs (Gbp)), moose (9 samples, 108.8 Gbp), deer (8 samples, 62.9 Gbp), and bison (8 samples, 52.3 Gbp). Metagenomes were assembled independently to reduce the influence of strain variation and improve the recovery of closely related genomes18,19. Following refinement, dereplication, and filtering of resulting population genomes, we identified 2,809 nonredundant MAGs satisfying the following criteria: dRep20 genome quality score ≥60, ≥75% complete, ≤10% contamination, N50 ≥5 kbp, and ≤500 contigs.
The median estimated completeness and contamination of the MAGs were 89.7% and 0.9%, respectively (Fig. 1a and Supplementary Data 1). Further, recovered MAGs had a median genome size of 2.2 Mbp, a median of 131 contigs, and a median N50 of 28.3 kbp (Fig. 1b). The proposed minimum information about a MAG (MIMAG) specifies high-quality draft genomes to have an estimated ≥90% completeness, ≤5% contamination, at least 18 tRNAs, and contain 23S, 16S, and 5S rRNA genes21. It remains challenging to reconstruct rRNA genes from short metagenomic reads due to the high sequence similarity of rRNA genes in closely related species. As a result, despite high estimated completeness and low contamination rates, only 20 MAGs meet the MIMAG standards for a high-quality draft genome. We identified a 16S rRNA gene in 197 of the MAGs. The remaining MAGs are characterized as medium-quality MAGs under the MIMAG standards.
The majority of bacterial MAGs belonged to phyla Firmicutes or Bacteroidota (2,326; Fig. 2a and Supplementary Data 1). Additionally, we assembled 12 bacterial genomes from the superphylum Patescibacteria. At lower taxonomic ranks, Lachnospiraceae (415) and Prevotella (398) were the dominant family and genus identified among the assembled bacterial genomes. The most prevalent archaeal family and genus were Methanobacteriaceae (45) and Methanobrevibacter (35), respectively (Fig. 2b). The recovered MAGs represent several new taxonomic lineages, as four genomes could not be classified at the rank of order, 16 at the rank of family, and 243 at the genus rank.
Species-level overlap between reference genomes, the Hungate1000 Collection, and rumen MAGs
To further characterize the assembled genomes, we compared the MAGs to other rumen-specific genome collections, specifically genomes generated from the Hungate1000 project3 and MAGs identified from the Stewart et al. studies4,5. We clustered genomes based on approximate species-level thresholds (≥95% ANI) and calculated the intersection between MAGs in the current study and the Hungate1000 Collection (410 genomes)3, MAGs from Stewart et al. (4,941 genomes)4,5, and a dereplicated genome collection from the GTDB (22,441 genomes, see Methods)22, which includes reference isolate genomes and some environmental MAGs23. It should be noted that we used the raw data from the first of the Stewart et al. studies4 (Supplementary Table 1), but with different assembly and binning approaches. Approximately one-third of the MAGs (1,007) did not exhibit ≥95% ANI with a genome in the GTDB, Stewart et al. MAGs, or the Hungate1000 isolates (Fig. 3a). When considering the pairwise intersections between the datasets, 98 (3.5%), 933 (33.2%), and 1,438 (51.2%) of the MAGs in the current study had ≥95% ANI with a genome in the Hungate1000 Collection3, GTDB22, and Stewart et al.4,5, respectively. One hundred twenty-one (29.5%), 552 (2.5%), and 3,125 (63.2%) of the genomes from the Hungate1000 Collection3, GTDB22, and Stewart et al.4,5 displayed ≥95% ANI with a MAG from the current study. Together, these results indicate that we recovered a majority of previous rumen genomic diversity with additional lineages not previously identified in other major rumen genomic collections.
We applied an additional clustering approach to identify the approximate number of species represented by the rumen-specific genomes assembled in this study, in the Hungate1000 Collection3, and Stewart et al.4,5. A 95% ANI threshold yielded 3,541 clusters from the combination of the datasets (Supplementary Data 2). Of the 3,541 clusters, 2,024 contained a MAG from the current study, and 1,135 were composed exclusively of MAGs from the current study. In comparison, 2,175 and 286 clusters were comprised of genomes from Stewart et al.4,5 and the Hungate1000 Collection3, respectively. The majority of 95% ANI clusters (2,166) are only comprised of a single genome (Fig. 3b). Furthermore, a rarefaction curve suggests the 8,160 genomes from the genomic collections analyzed here only represent a fraction of the estimated microbial species diversity in the rumen (Fig. 3c). The genome with the best dRep score from each cluster was used to generate a phylogenetic tree highlighting the species diversity within each rumen genomic collection and represents the vast diversity of rumen bacterial (Fig. 3d) and archaeal (Fig. 3e) genomes published to date.
As stated previously, the median genome size of reconstructed MAGs was 2.2 Mbp, smaller than the median size of genomes from the Hungate1000 project (3.1 Mbp)3. To provide an assessment at a finer resolution, genome sizes of MAGs and Hungate1000 genomes3 belonging to the same 95% ANI cluster were compared (Supplementary Fig. 1). Adjusted sizes of MAGs and Hungate1000 genomes that are ≥95% complete displayed a regression coefficient of 0.96 with a slope of 0.86, indicating the binning process likely did not lead to extensive losses and systematic biases in the reconstructed genomes. Instead, it further highlights that current culturing approaches have not brought large portions of rumen microbial diversity into culture and putatively supports previous findings from the human gut that revealed genome-reduction in uncultured bacteria24.
Rumen metagenome classification rates using reference and rumen-specific genomes
Utilizing an approach similar to Stewart et al.4,5, we investigated the influence of MAGs on rates of metagenomic read classification. The baseline for read classification was the standard Kraken database containing bacterial, archaeal, fungal, and protozoal RefSeq genomes25. Each rumen-specific dataset was incrementally added to the Kraken RefSeq genomic database in the following order to build new databases: the Hungate1000 Collection3, MAGs from Stewart et al.4,5, and MAGs from the current study. Each individual and collective database was used for classification of sample reads that underpinned metagenomic binning and from a rumen metagenomic dataset not used in the reconstruction of MAGs26. MAGs from the current work classified more reads from deer, moose, and sheep metagenomes, while the more numerous MAGs from Stewart et al.4,5 classified more reads from bison and cattle metagenomes (Supplementary Fig. 2a). The addition of MAGs improves classification relative to databases primarily based on cultured isolates, like the Hungate1000 Collection3 (Supplementary Fig. 2b). Using the combination of all reference and rumen-specific genomes, the median classification rate on an independent set of cattle metagenomes was 62.6%.
Phylogenetic characterization of biosynthetic gene clusters
Microbial genome mining is a powerful tool for natural product discovery. We sought to explore the extent of secondary metabolite diversity coded by the MAGs in the current study, the Hungate1000 Collection3, and Stewart et al. MAGs4,5. We identified 14,814 BGCs encoded by the 8,160 rumen-specific genomes using antiSMASH27 (Fig. 4a and Supplementary Data 3). The majority of BGCs were NRPS (5,346), followed by aryl polyenes (2,800), sactipeptides (2,126), and bacteriocins (1,943). Only a few PKS were identified (75). Firmicutes harbored the vast majority of clusters for NRPS, sactipeptide, lantipeptide, lassopeptide, and bacteriocin synthesis (Fig. 4b). At lower taxonomic ranks, DTU089 (979), Bacteroidaceae (934), and Lachnospiraceae (923) coded for the bulk of NRPS gene clusters. Moreover, Acidaminococcaceae genomes contained 21.2% of identified bacteriocins and Ruminococcus spp. possessed the bulk of sactipeptides and lantipeptides. Archaea were predicted to code 737 BGCs, including an average of 3.8 NRPS gene clusters per genome (Fig. 4a).
NRPS exhibit high molecular and structural diversity resulting in a wide array of biological activities. The diversity of NRPS, combined with their proteolytic stability and selective bioactivity, has resulted in the development of many NRPS as antimicrobials and other therapeutic agents28. Given the prevalence of NRPS among the recovered MAGs (Fig. 4a), the peptides appear to be important bioactive metabolites in the rumen. To gain fundamental insight into the phylogenetic diversity of rumen NRPS, we built a network based on BGC similarity using BiG-SCAPE29. BiG-SCAPE uses protein domain content, order, copy number, and sequence identity to calculate a distance metric. We assessed the similarity of NRPS gene clusters identified in Firmicutes, Bacteroidota, and Euryarchaeota, as these three phyla coded for 96.4% of assembled NRPS gene clusters from rumen genomes. With a BiG-SCAPE similarity threshold of 0.3, the resulting network consisted of 3,436 nodes (NRPS BGCs on contigs ≥10 kbp) and 79,112 edges (Fig. 4c and Supplementary Data 4). As expected, the network analysis depicted high inter- and intra-phylum genetic diversity among the NRPS gene clusters. The median intra-phylum, -family, and -genus similarity was 0.40, 0.44, and 0.46, respectively, while the median inter-phylum, -family, and -genus similarity was 0.32, 0.34, and 0.34, respectively. Further, only 2.6% of edges were inter-phylum and 69.0% were intra-family. Of the 6,594 Euryarchaeota edges, 8.1% were Euryarchaeota-Firmicutes (median similarity of 0.32) and 2.0% of edges were Euryarchaeota-Bacteroidota (median similarity of 0.31). To further examine the phylogenetic relationships of rumen Euryarchaeota NRPS, we clustered 265 NRPS gene clusters (≥10 kbp) from 85 near-complete Euryarchaeota genomes at a higher similarity threshold of 0.75, yielding 57 NRPS clusters (Fig. 4d). The distribution of NRPS clusters amongst the genomes suggests there exists a strong relationship between methanogen phylogeny and NRPS similarity. Only Methanobrevibacter genomes contain NRPS gene clusters, and genomes of the same species often possessed many of the same NRPS clusters (see genomes highlighted in blue in Fig. 4d). However, there are instances in which closely related methanogens code for a contrasting pattern of NRPS clusters or no NRPS clusters at all (see genomes highlighted in red in Fig. 4d).
Bacteriocins likely serve as regulatory elements in complex microbial communities such as the rumen. Consequently, bacteriocins have been studied and characterized for their bactericidal activity and as agents that modulate the microbiota structure and function30. In particular, lanthipeptides, a class of ribosomally synthesized and post-translationally modified peptides (RiPPs) with thioether cross-linked amino acids31, are of pharmaceutical, preservative, and agricultural interest due to their strong antimicrobial properties against gram-positive pathogens31,32,33, low levels of antimicrobial resistance34, and stability35. We identified 195 rumen lanthipeptide BGCs from the Hungate1000 genomes and MAGs from Stewart et al. and the current study. Rumen lanthipeptide BGCs were clustered with 22,870 lanthipeptide BGCs from RefSeq genomes36,37 into gene cluster families (GCFs; groups of BGCs that may generate highly similar products). Clustering with BiG-SCAPE29 yielded 4,565 GCFs, 120 of which contained a rumen lanthipeptide. The 120 GCFs were composed of 519 lanthipeptide BGCs, where 324 were from RefSeq isolates and 195 from rumen genomes (Fig. 5a). The 324 RefSeq BGCs fell into only 18 GCFs. Lanthipeptides from the Hungate1000 isolates clustered into 36 GCFs, while rumen MAG lanthipeptides belonged to 92 GCFs, 82 of which were exclusively composed of MAG lanthipeptides. Together, this evidence suggests rumen MAGs code for diverse and novel lanthipeptides not represented in cultured isolates, including the Hungate Collection.
We sought to further examine the differences in rumen MAG lanthipeptides relative to isolates and the taxonomic diversity of rumen microbes coding for lanthipeptides. The 195 rumen lanthipeptides were mainly found in Firmicutes genomes, with a subset from Bacteroidota and Actinobacteriota (Fig. 5b). Fifty-two of the 55 lanthipeptides from the Hungate Collection isolates were from Firmicutes (94.5%). At the family-level, these 52 Firmicutes BGCs were distributed evenly between Lachnospiraceae and Streptococcaceae. In contrast, 19.2% and 8.6% of lanthipeptides from rumen MAGs belonged to Bacteroidota and Actinobacteriota, respectively. Lanthipeptides from MAGs were also found in Muribaculaceae and Oscillospiraceae. Moreover, 26.4% of rumen MAG lanthipeptides, compared to 3.6% of Hungate Collection isolates, were found in Eubacterium genomes. The majority of Eubacterium MAG lanthipeptides (62.1%) belonged to a single GCF, suggesting they code for very similar products. Lastly, antiSMASH predicted the bulk of the rumen lanthipeptides were Class II lanthipeptides, with fewer Class I and Class III types (Fig. 5b). Nearly all of the Class I lanthipeptides were from Hungate isolates. The above analysis of lanthipeptide diversity further supports that rumen MAGs code for novel secondary metabolites not represented in cultured isolates.
We aligned previously published rumen metatranscriptome data from steers characterized as having high and low feed efficiency to the BGCs to demonstrate if the identified BGCs are active and to explore potential ecological roles of secondary metabolites. Despite data from the metatranscriptome study not being applied to reconstruct genomes in the current study, we identified the expression of 554 gene clusters from rumen-specific genomes in the 20 metatranscriptomes (≥100 aligned reads). Metatranscriptome read count data were normalized independently for each genome to better account for the variation in taxonomic composition across samples38. Genome-specific normalization resulted in the identification of 17 differentially expressed gene clusters between steers with high and low feed efficiency (DESeq239 false discovery rate adjusted P < 0.05; Supplementary Data 5). Of the 17 differentially expressed BGCs, 16 exhibited higher expression levels in the rumen samples from less efficient steers with higher residual feed intake. Further, Prevotella and Selenomonas coded for 12 of the differentially expressed BGCs (70.6%). All of the differentially expressed Selenomonas BGCs were sactipeptides (n = 7), while the Prevotella BGCs were more diverse and included NRPS and aryl polyenes.
Microdiversity of BGCs and MAGs
Phylogenetic analyses of BGC often revealed high inter-species diversity (i.e, methanogen NRPS diversity in Fig. 4d). We next investigated patterns of sub-species microdiversity in rumen BGCs. In order to reduce the influence of study-to-study effects, we focused on the microdiversity of MAGs across 282 metagenomes in the Stewart et al. studies4,5. MAGs with ≥50% of its genome covered by at least 5 reads were considered as detected in a sample and used for microdiversity analyses. The within-sample microdiversity of genes and genomes were assessed using InStrain40. Our phylogenetic analysis identified that different classes of BGCs are enriched in certain lineages (Fig. 4a, b). As a result, the nucleotide diversity values for genes were normalized using the mean genome-wide nucleotide diversity for each MAG to account for lineage-specific evolutionary processes and more accurately compare patterns of microdiversity in BGCs across lineages. There were significant differences in the nucleotide diversity of genes from the four major classes of BGCs identified in rumen-specific genomes (Kruskal–Wallis H = 1795.5, ε2 = 0.001, P < 2.2 × 10−16; Fig. 6a), but the effect size (ε2) between BGC types was negligible. Outliers with high microdiversity were bacteriocin genes from RC9 and UBA3207 sp. as well as NRPS genes from CAG-710 and UBA9715 sp. Additionally, we explored the association of genome-wide and secondary metabolism gene microdiversity with cattle breed. The mean nucleotide diversity of MAGs (Kruskal–Wallis H = 1027.5, ε2 = 0.0265, P < 2.2 × 10−16; Fig. 6b) and the normalized nucleotide diversity of genes from BGCs (Kruskal–Wallis H = 403.84, ε2 = 0.0003, P < 2.2 × 10−16; Fig. 6c) were both significantly different between the four breeds. The effect size (ε2) of microdiversity difference between breeds was much larger for the genome-wide comparison than for genes from BGCs. This finding raised the question if genes from BGCs have different nucleotide diversity relative to other genes. We found that genes across all BGCs had lower normalized nucleotide diversity compared to all other genes from investigated MAGs (Wilcoxon rank-sum W = 6.11 × 1013, Vargha and Delaney’s A = 0.507, P < 2.2 × 10−16; Fig. 6d). The raw nucleotide diversity values were higher for genes in BGCs than other genes (Wilcoxon rank-sum W = 5.801 × 1013, Vargha and Delaney’s A = 0.481, P < 2.2 × 10−16). Regardless, again we find the effect size of the difference to be very small though. Together, microdiversity analyses suggest rumen microbial BGC diversity is comparable across the prevalent BGC classes, breeds, and similar to other genes.
Source: Ecology - nature.com