Insights into rumen microbial biosynthetic gene cluster diversity through genome-resolved metagenomics

2,809 draft MAGs from the rumen ecosystem

We amassed 3.2 terabase pairs (Tbp) of data from 346 publicly available and 66 new rumen metagenome datasets (Supplementary Table 1). The metagenomes were from cattle (312 samples, 2.1 Tbp), sheep (75 samples, 888.4 gigabase pairs (Gbp)), moose (9 samples, 108.8 Gbp), deer (8 samples, 62.9 Gbp), and bison (8 samples, 52.3 Gbp). Metagenomes were assembled independently to reduce the influence of strain variation and improve the recovery of closely related genomes^18,19. Following refinement, dereplication, and filtering of resulting population genomes, we identified 2,809 nonredundant MAGs satisfying the following criteria: dRep²⁰ genome quality score ≥60, ≥75% complete, ≤10% contamination, N50 ≥5 kbp, and ≤500 contigs.

The median estimated completeness and contamination of the MAGs were 89.7% and 0.9%, respectively (Fig. 1a and Supplementary Data 1). Further, recovered MAGs had a median genome size of 2.2 Mbp, a median of 131 contigs, and a median N50 of 28.3 kbp (Fig. 1b). The proposed minimum information about a MAG (MIMAG) specifies high-quality draft genomes to have an estimated ≥90% completeness, ≤5% contamination, at least 18 tRNAs, and contain 23S, 16S, and 5S rRNA genes²¹. It remains challenging to reconstruct rRNA genes from short metagenomic reads due to the high sequence similarity of rRNA genes in closely related species. As a result, despite high estimated completeness and low contamination rates, only 20 MAGs meet the MIMAG standards for a high-quality draft genome. We identified a 16S rRNA gene in 197 of the MAGs. The remaining MAGs are characterized as medium-quality MAGs under the MIMAG standards.

Fig. 1: Genomic properties of 2,809 rumen MAGs.

a CheckM completeness and contamination estimates for the 2,809 population genomes recovered from rumen metagenomes. The size of the point on the scatter plot corresponds to the dRep genome quality score, where Quality = Completeness − (5 ⋅ Contamination) + (Contamination ⋅ (Strain Heterogeneity/100)) + 0.5 ⋅ (({mathrm{log}},)(N50). The reported MAGs meet the following minimum criteria: genome quality score ≥60, ≥75% complete, ≤10% contamination, N50 ≥5 kbp, and ≥500 contigs. b The frequency distribution of the number of contigs and genome sizes of reconstructed MAGs.

Full size image

The majority of bacterial MAGs belonged to phyla Firmicutes or Bacteroidota (2,326; Fig. 2a and Supplementary Data 1). Additionally, we assembled 12 bacterial genomes from the superphylum Patescibacteria. At lower taxonomic ranks, Lachnospiraceae (415) and Prevotella (398) were the dominant family and genus identified among the assembled bacterial genomes. The most prevalent archaeal family and genus were Methanobacteriaceae (45) and Methanobrevibacter (35), respectively (Fig. 2b). The recovered MAGs represent several new taxonomic lineages, as four genomes could not be classified at the rank of order, 16 at the rank of family, and 243 at the genus rank.

Fig. 2: Phylogenetic relationships and coverage patterns of near-complete MAGs.

a Phylogenomic analysis of 1,163 near-complete (≥90% complete, ≤5% contamination, and N50 ≥15 kbp) bacterial MAGs and (b) 20 near-complete archaeal MAGs inferred from the concatenation of phylogenetically informative proteins. Layers below the genomic trees designate bacterial phylum or archaeal genus based on GTDB taxonomic assignments, genomic size (0–5 Mbp), and the mean number of bases with ≥1× coverage in a rumen metagenomic dataset (layer color indicates the ruminant the data was collected from). The mean number of bases with ≥1× coverage was used as input for hierarchical clustering of rumen metagenomic datasets based on Euclidean distance and Ward linkage. The bacterial and archaeal phylogenetic trees are provided as Supplementary Data 6 and Supplementary Data 7, respectively.

Full size image

Species-level overlap between reference genomes, the Hungate1000 Collection, and rumen MAGs

To further characterize the assembled genomes, we compared the MAGs to other rumen-specific genome collections, specifically genomes generated from the Hungate1000 project³ and MAGs identified from the Stewart et al. studies^4,5. We clustered genomes based on approximate species-level thresholds (≥95% ANI) and calculated the intersection between MAGs in the current study and the Hungate1000 Collection (410 genomes)³, MAGs from Stewart et al. (4,941 genomes)^4,5, and a dereplicated genome collection from the GTDB (22,441 genomes, see Methods)²², which includes reference isolate genomes and some environmental MAGs²³. It should be noted that we used the raw data from the first of the Stewart et al. studies⁴ (Supplementary Table 1), but with different assembly and binning approaches. Approximately one-third of the MAGs (1,007) did not exhibit ≥95% ANI with a genome in the GTDB, Stewart et al. MAGs, or the Hungate1000 isolates (Fig. 3a). When considering the pairwise intersections between the datasets, 98 (3.5%), 933 (33.2%), and 1,438 (51.2%) of the MAGs in the current study had ≥95% ANI with a genome in the Hungate1000 Collection³, GTDB²², and Stewart et al.^4,5, respectively. One hundred twenty-one (29.5%), 552 (2.5%), and 3,125 (63.2%) of the genomes from the Hungate1000 Collection³, GTDB²², and Stewart et al.^4,5 displayed ≥95% ANI with a MAG from the current study. Together, these results indicate that we recovered a majority of previous rumen genomic diversity with additional lineages not previously identified in other major rumen genomic collections.

Fig. 3: Genomes sharing ≥95% ANI between databases and the characterization of rumen-specific 95% ANI clusters.

a The approximate number of species overlapping amongst rumen-specific and reference genomic datasets. Genomes demonstrating ≥95% ANI were considered to be shared between two datasets. Presented are a subset of intersections in which a MAG from the current study was the query genome. b The number of genomes comprising each of the 3,541 95% ANI clusters generated from 8,160 rumen microbial genomes in the current study, the Hungate1000 Collection³, and Stewart et al. studies^{4, 5}. c Rarefaction analysis based on subsampling 95% ANI clusters at steps of 500 genomes indicates the 8,160 genomes from recently published rumen genomic collections still only represent a fraction of expected microbial species diversity in the rumen ecosystem. Phylogenomic relationships of the 1,781 near-complete bacterial (d) and 35 near-complete archaeal (e) representative genomes with the highest dRep genome quality score from the 3,541 95% ANI clusters generated from 8,160 rumen-specific genomes. Near-complete genomes were defined as being ≥90% complete, having ≤5% contamination, and contig N50 ≥15 kbp. Layers surrounding the genomic trees indicate the bacterial phyla or archaeal genera and the log normalized number of genomes from each rumen genomic collection belonging to the same 95% ANI cluster. The bacterial and archaeal phylogenetic trees are provided as Supplementary Data 8 and Supplementary Data 9, respectively.

Full size image

We applied an additional clustering approach to identify the approximate number of species represented by the rumen-specific genomes assembled in this study, in the Hungate1000 Collection³, and Stewart et al.^4,5. A 95% ANI threshold yielded 3,541 clusters from the combination of the datasets (Supplementary Data 2). Of the 3,541 clusters, 2,024 contained a MAG from the current study, and 1,135 were composed exclusively of MAGs from the current study. In comparison, 2,175 and 286 clusters were comprised of genomes from Stewart et al.^4,5 and the Hungate1000 Collection³, respectively. The majority of 95% ANI clusters (2,166) are only comprised of a single genome (Fig. 3b). Furthermore, a rarefaction curve suggests the 8,160 genomes from the genomic collections analyzed here only represent a fraction of the estimated microbial species diversity in the rumen (Fig. 3c). The genome with the best dRep score from each cluster was used to generate a phylogenetic tree highlighting the species diversity within each rumen genomic collection and represents the vast diversity of rumen bacterial (Fig. 3d) and archaeal (Fig. 3e) genomes published to date.

As stated previously, the median genome size of reconstructed MAGs was 2.2 Mbp, smaller than the median size of genomes from the Hungate1000 project (3.1 Mbp)³. To provide an assessment at a finer resolution, genome sizes of MAGs and Hungate1000 genomes³ belonging to the same 95% ANI cluster were compared (Supplementary Fig. 1). Adjusted sizes of MAGs and Hungate1000 genomes that are ≥95% complete displayed a regression coefficient of 0.96 with a slope of 0.86, indicating the binning process likely did not lead to extensive losses and systematic biases in the reconstructed genomes. Instead, it further highlights that current culturing approaches have not brought large portions of rumen microbial diversity into culture and putatively supports previous findings from the human gut that revealed genome-reduction in uncultured bacteria²⁴.

Rumen metagenome classification rates using reference and rumen-specific genomes

Utilizing an approach similar to Stewart et al.^4,5, we investigated the influence of MAGs on rates of metagenomic read classification. The baseline for read classification was the standard Kraken database containing bacterial, archaeal, fungal, and protozoal RefSeq genomes²⁵. Each rumen-specific dataset was incrementally added to the Kraken RefSeq genomic database in the following order to build new databases: the Hungate1000 Collection³, MAGs from Stewart et al.^4,5, and MAGs from the current study. Each individual and collective database was used for classification of sample reads that underpinned metagenomic binning and from a rumen metagenomic dataset not used in the reconstruction of MAGs²⁶. MAGs from the current work classified more reads from deer, moose, and sheep metagenomes, while the more numerous MAGs from Stewart et al.^4,5 classified more reads from bison and cattle metagenomes (Supplementary Fig. 2a). The addition of MAGs improves classification relative to databases primarily based on cultured isolates, like the Hungate1000 Collection³ (Supplementary Fig. 2b). Using the combination of all reference and rumen-specific genomes, the median classification rate on an independent set of cattle metagenomes was 62.6%.

Phylogenetic characterization of biosynthetic gene clusters

Microbial genome mining is a powerful tool for natural product discovery. We sought to explore the extent of secondary metabolite diversity coded by the MAGs in the current study, the Hungate1000 Collection³, and Stewart et al. MAGs^4,5. We identified 14,814 BGCs encoded by the 8,160 rumen-specific genomes using antiSMASH²⁷ (Fig. 4a and Supplementary Data 3). The majority of BGCs were NRPS (5,346), followed by aryl polyenes (2,800), sactipeptides (2,126), and bacteriocins (1,943). Only a few PKS were identified (75). Firmicutes harbored the vast majority of clusters for NRPS, sactipeptide, lantipeptide, lassopeptide, and bacteriocin synthesis (Fig. 4b). At lower taxonomic ranks, DTU089 (979), Bacteroidaceae (934), and Lachnospiraceae (923) coded for the bulk of NRPS gene clusters. Moreover, Acidaminococcaceae genomes contained 21.2% of identified bacteriocins and Ruminococcus spp. possessed the bulk of sactipeptides and lantipeptides. Archaea were predicted to code 737 BGCs, including an average of 3.8 NRPS gene clusters per genome (Fig. 4a).

Fig. 4: Characterization of BGCs from 8,160 rumen genomes and MAGs.

a Number and types of BGCs identified from select phyla in genomes from the Hungate1000 Collection³, Stewart et al. studies^{4, 5}, and the current study. b Phylogenomic analysis of 1,766 near-complete Firmicutes genomes inferred from the concatenation of phylogenetically informative proteins. The inner layer surrounding the genomic tree designates taxonomic annotations, while the remaining layers depict the log normalized number of BGCs in the genome with the ascribed function. Bacterial class and order labels are displayed for those lineages in which more than 50 genomes were identified. Near-complete genomes were defined as being ≥90% complete, having ≤5% contamination, and contig N50 ≥15 kbp. The phylogenetic tree is provided as Supplementary Data 10. c A relational network of NRPS gene clusters in Firmicutes, Bacteroidota, and Euryarchaeota highlights the similarity of NRPS BGCs from Euryarchaeota and Firmicutes. Edge weight represents the similarity of two BGCs, as determined by BiG-SCAPE (i.e. darker edges demonstrate more similarity between two BGCs). Edges are only shown for BGCs with ≥0.3 BiG-SCAPE similarity. Nodes from each phylum are duplicated to illustrate intra-phylum relationships and nodes along a given axis are ordered alphabetically by taxonomic family. d The association between genome phylogeny and the similarity of NRPS gene clusters coded by near-complete Euryarchaeota genomes. BGCs designated as NRPS were clustered with BiG-SCAPE. The relationship between NRPS clusters was portrayed through the hierarchical clustering of pairwise inter-cluster similarities. The number of NRPS clusters coded by each genome (range of 0–3) is presented alongside the assigned genus. A group of Methanobrevibacter genomes, likely of the same species (≥95% ANI), possessed very similar NRPS clusters (highlighted in blue). Yet, phylogenetically closely related genomes, belonging to two different 95% ANI clusters, did not code for any identified NRPS gene clusters (highlighted in red). The phylogenetic tree is based on the concatenation of 122 phylogenetically informative archaeal proteins and is available as Supplementary Data 11.

Full size image

NRPS exhibit high molecular and structural diversity resulting in a wide array of biological activities. The diversity of NRPS, combined with their proteolytic stability and selective bioactivity, has resulted in the development of many NRPS as antimicrobials and other therapeutic agents²⁸. Given the prevalence of NRPS among the recovered MAGs (Fig. 4a), the peptides appear to be important bioactive metabolites in the rumen. To gain fundamental insight into the phylogenetic diversity of rumen NRPS, we built a network based on BGC similarity using BiG-SCAPE²⁹. BiG-SCAPE uses protein domain content, order, copy number, and sequence identity to calculate a distance metric. We assessed the similarity of NRPS gene clusters identified in Firmicutes, Bacteroidota, and Euryarchaeota, as these three phyla coded for 96.4% of assembled NRPS gene clusters from rumen genomes. With a BiG-SCAPE similarity threshold of 0.3, the resulting network consisted of 3,436 nodes (NRPS BGCs on contigs ≥10 kbp) and 79,112 edges (Fig. 4c and Supplementary Data 4). As expected, the network analysis depicted high inter- and intra-phylum genetic diversity among the NRPS gene clusters. The median intra-phylum, -family, and -genus similarity was 0.40, 0.44, and 0.46, respectively, while the median inter-phylum, -family, and -genus similarity was 0.32, 0.34, and 0.34, respectively. Further, only 2.6% of edges were inter-phylum and 69.0% were intra-family. Of the 6,594 Euryarchaeota edges, 8.1% were Euryarchaeota-Firmicutes (median similarity of 0.32) and 2.0% of edges were Euryarchaeota-Bacteroidota (median similarity of 0.31). To further examine the phylogenetic relationships of rumen Euryarchaeota NRPS, we clustered 265 NRPS gene clusters (≥10 kbp) from 85 near-complete Euryarchaeota genomes at a higher similarity threshold of 0.75, yielding 57 NRPS clusters (Fig. 4d). The distribution of NRPS clusters amongst the genomes suggests there exists a strong relationship between methanogen phylogeny and NRPS similarity. Only Methanobrevibacter genomes contain NRPS gene clusters, and genomes of the same species often possessed many of the same NRPS clusters (see genomes highlighted in blue in Fig. 4d). However, there are instances in which closely related methanogens code for a contrasting pattern of NRPS clusters or no NRPS clusters at all (see genomes highlighted in red in Fig. 4d).

Bacteriocins likely serve as regulatory elements in complex microbial communities such as the rumen. Consequently, bacteriocins have been studied and characterized for their bactericidal activity and as agents that modulate the microbiota structure and function³⁰. In particular, lanthipeptides, a class of ribosomally synthesized and post-translationally modified peptides (RiPPs) with thioether cross-linked amino acids³¹, are of pharmaceutical, preservative, and agricultural interest due to their strong antimicrobial properties against gram-positive pathogens^31,32,33, low levels of antimicrobial resistance³⁴, and stability³⁵. We identified 195 rumen lanthipeptide BGCs from the Hungate1000 genomes and MAGs from Stewart et al. and the current study. Rumen lanthipeptide BGCs were clustered with 22,870 lanthipeptide BGCs from RefSeq genomes^36,37 into gene cluster families (GCFs; groups of BGCs that may generate highly similar products). Clustering with BiG-SCAPE²⁹ yielded 4,565 GCFs, 120 of which contained a rumen lanthipeptide. The 120 GCFs were composed of 519 lanthipeptide BGCs, where 324 were from RefSeq isolates and 195 from rumen genomes (Fig. 5a). The 324 RefSeq BGCs fell into only 18 GCFs. Lanthipeptides from the Hungate1000 isolates clustered into 36 GCFs, while rumen MAG lanthipeptides belonged to 92 GCFs, 82 of which were exclusively composed of MAG lanthipeptides. Together, this evidence suggests rumen MAGs code for diverse and novel lanthipeptides not represented in cultured isolates, including the Hungate Collection.

Fig. 5: Phylogenetic diversity of 195 lanthipeptide BGCs coded by rumen genomes.

a Network depicting the similarity between lanthipeptide BGCs identified from complete and draft isolate genomes in RefSeq and rumen genomes of the Hungate1000 collection, Stewart et al. MAGs, and MAGs from the current study. The BGCs were clustered into gene cluster families (GCFs) with BiG-SCAPE²⁹. Only the GCFs containing a rumen genome and at least two BGCs were visualized. Nodes in the network represent BGCs and edges connect BGCs with BiG-SCAPE defined similarity ≥0.3. b Phylogenetic relationships of 120 near-complete rumen bacterial genomes coding for lanthipeptide BGCs. Near-complete genomes were defined as being ≥90% complete, having ≤5% contamination, and contig N50 ≥15 kbp. Layers surrounding the genomic trees indicate the bacterial phyla and family, if the genome is a MAG or Hungate Collection isolate, and the class of lanthipeptide, as predicted by antiSMASH²⁷. Genomes without an indicated lanthipeptide class were not classified by antiSMASH. The phylogenetic tree is based on the concatenation of 120 phylogenetically informative bacterial proteins and is available as Supplementary Data 12.

Full size image

We sought to further examine the differences in rumen MAG lanthipeptides relative to isolates and the taxonomic diversity of rumen microbes coding for lanthipeptides. The 195 rumen lanthipeptides were mainly found in Firmicutes genomes, with a subset from Bacteroidota and Actinobacteriota (Fig. 5b). Fifty-two of the 55 lanthipeptides from the Hungate Collection isolates were from Firmicutes (94.5%). At the family-level, these 52 Firmicutes BGCs were distributed evenly between Lachnospiraceae and Streptococcaceae. In contrast, 19.2% and 8.6% of lanthipeptides from rumen MAGs belonged to Bacteroidota and Actinobacteriota, respectively. Lanthipeptides from MAGs were also found in Muribaculaceae and Oscillospiraceae. Moreover, 26.4% of rumen MAG lanthipeptides, compared to 3.6% of Hungate Collection isolates, were found in Eubacterium genomes. The majority of Eubacterium MAG lanthipeptides (62.1%) belonged to a single GCF, suggesting they code for very similar products. Lastly, antiSMASH predicted the bulk of the rumen lanthipeptides were Class II lanthipeptides, with fewer Class I and Class III types (Fig. 5b). Nearly all of the Class I lanthipeptides were from Hungate isolates. The above analysis of lanthipeptide diversity further supports that rumen MAGs code for novel secondary metabolites not represented in cultured isolates.

We aligned previously published rumen metatranscriptome data from steers characterized as having high and low feed efficiency to the BGCs to demonstrate if the identified BGCs are active and to explore potential ecological roles of secondary metabolites. Despite data from the metatranscriptome study not being applied to reconstruct genomes in the current study, we identified the expression of 554 gene clusters from rumen-specific genomes in the 20 metatranscriptomes (≥100 aligned reads). Metatranscriptome read count data were normalized independently for each genome to better account for the variation in taxonomic composition across samples³⁸. Genome-specific normalization resulted in the identification of 17 differentially expressed gene clusters between steers with high and low feed efficiency (DESeq2³⁹ false discovery rate adjusted P < 0.05; Supplementary Data 5). Of the 17 differentially expressed BGCs, 16 exhibited higher expression levels in the rumen samples from less efficient steers with higher residual feed intake. Further, Prevotella and Selenomonas coded for 12 of the differentially expressed BGCs (70.6%). All of the differentially expressed Selenomonas BGCs were sactipeptides (n = 7), while the Prevotella BGCs were more diverse and included NRPS and aryl polyenes.

Microdiversity of BGCs and MAGs

Phylogenetic analyses of BGC often revealed high inter-species diversity (i.e, methanogen NRPS diversity in Fig. 4d). We next investigated patterns of sub-species microdiversity in rumen BGCs. In order to reduce the influence of study-to-study effects, we focused on the microdiversity of MAGs across 282 metagenomes in the Stewart et al. studies^4,5. MAGs with ≥50% of its genome covered by at least 5 reads were considered as detected in a sample and used for microdiversity analyses. The within-sample microdiversity of genes and genomes were assessed using InStrain⁴⁰. Our phylogenetic analysis identified that different classes of BGCs are enriched in certain lineages (Fig. 4a, b). As a result, the nucleotide diversity values for genes were normalized using the mean genome-wide nucleotide diversity for each MAG to account for lineage-specific evolutionary processes and more accurately compare patterns of microdiversity in BGCs across lineages. There were significant differences in the nucleotide diversity of genes from the four major classes of BGCs identified in rumen-specific genomes (Kruskal–Wallis H = 1795.5, ε² = 0.001, P < 2.2 × 10⁻¹⁶; Fig. 6a), but the effect size (ε²) between BGC types was negligible. Outliers with high microdiversity were bacteriocin genes from RC9 and UBA3207 sp. as well as NRPS genes from CAG-710 and UBA9715 sp. Additionally, we explored the association of genome-wide and secondary metabolism gene microdiversity with cattle breed. The mean nucleotide diversity of MAGs (Kruskal–Wallis H = 1027.5, ε² = 0.0265, P < 2.2 × 10⁻¹⁶; Fig. 6b) and the normalized nucleotide diversity of genes from BGCs (Kruskal–Wallis H = 403.84, ε² = 0.0003, P < 2.2 × 10⁻¹⁶; Fig. 6c) were both significantly different between the four breeds. The effect size (ε²) of microdiversity difference between breeds was much larger for the genome-wide comparison than for genes from BGCs. This finding raised the question if genes from BGCs have different nucleotide diversity relative to other genes. We found that genes across all BGCs had lower normalized nucleotide diversity compared to all other genes from investigated MAGs (Wilcoxon rank-sum W = 6.11 × 10¹³, Vargha and Delaney’s A = 0.507, P < 2.2 × 10⁻¹⁶; Fig. 6d). The raw nucleotide diversity values were higher for genes in BGCs than other genes (Wilcoxon rank-sum W = 5.801 × 10¹³, Vargha and Delaney’s A = 0.481, P < 2.2 × 10⁻¹⁶). Regardless, again we find the effect size of the difference to be very small though. Together, microdiversity analyses suggest rumen microbial BGC diversity is comparable across the prevalent BGC classes, breeds, and similar to other genes.

Fig. 6: Comparison of the microdiversity of MAGs and BGCs from cattle metagenomes in the Stewart et al. studies^{4, 5}.

The within-sample nucleotide diversity of BGCs was statistically different between BGC types, but the effect size of the difference was small (ε² = 0.001) (a). The difference in nucleotide diversity across breeds was greater for MAGs (ε² = 0.0265) (b) than for genes in BGCs (ε² = 0.0003) (c). Additionally, the effect size of the difference between the normalized nucleotide diversity of genes from BGCs and other genes was small (Vargha and Delaney’s A = 0.507) (d). Genome-wide normalized nucleotide diversity is the nucleotide diversity of a gene relative to the mean nucleotide diversity of the MAG. The genome-wide normalized nucleotide diversity metric was used to reduce the influence of lineage-specific evolutionary processes, allowing for a more accurate comparison of gene nucleotide diversity across microbial populations. The same conclusions were identified using the raw nucleotide diversity in the place of genome-wide normalized nucleotide diversity. The outlier points have been removed from the boxplots for clarity.

Full size image

Source: Ecology - nature.com

Insights into rumen microbial biosynthetic gene cluster diversity through genome-resolved metagenomics

2,809 draft MAGs from the rumen ecosystem

Species-level overlap between reference genomes, the Hungate1000 Collection, and rumen MAGs

Rumen metagenome classification rates using reference and rumen-specific genomes

Phylogenetic characterization of biosynthetic gene clusters

Microdiversity of BGCs and MAGs

Storm surge and ponding explain mangrove dieback in southwest Florida following Hurricane Irma

Parental selection for growth and early-life low stocking density increase the female-to-male ratio in European sea bass

ITALIAN LANGUAGE

ENGLISH LANGUAGE