Soil diversity, taxonomic classification and binning of BGCs
Nonpareil analysis estimated an abundance-weighted coverage of 85.3% for the 44.4 Gb used in the long-read assembly. To achieve 95% and 99% coverage, respectively, 250 Gb and 1.6 Tb of sequencing were predicted to be necessary. Alpha diversity was estimated at Nd = 21.6. Contigs were binned using CONCOCT, MaxBin2 and MetaBAT2, consensus bins were generated using metaWRAP refine and classified using GTDB-Tk. This yielded 114 bacterial bins with CheckM completeness > 50% and contamination < 10% containing 278 BGCs (see Table 1.) Since only 278 BGCs had been binned, an additional contig-based classification approach was adopted. All contigs were classified using CAT with a database based on GTDB r89 proteins, leading to a classification of 93% of BGC-containing contigs at a phylum level (Fig. 1B, C). A cross-check of bin-level classification and contig-level classification of the 269 binned and CAT-classified BGC-containing contigs showed three conflicts at different levels in total (phylum: 0, class: 1, order: 1, family: 0, genus: 1, species: 0). Of the 2892 total binned and CAT-classified contigs, 52 (1.7%) were classified differently at order level using CAT. This indicates that the risk of misclassification of BGC-containing contigs by CAT is low, but cannot be excluded. Bin-level classification was preferred where available.
Recovery of diverse and full-length BGCs
The polished assembly was analysed using antiSMASH v5.1. A total of 1417 BGCs were identified on 1350 contigs (Table 1). A total of 564 BGCs (39.8%) were identified as being on a contig edge and were therefore categorised potentially incomplete, while 853 (60.2%) were full length. The most abundant classes of BGCs were terpenes (27.2%), followed by NRPS (15.7%) and bacteriocins (10.1%). In particular, terpenes were dominated by few subclasses. Out of 401 observed terpene BGCs, 321 contained a squalene/phytoene synthase Pfam domain (PF00494). This indicates that the product of these BGCs is a tri- or tetraterpene. Forty-four BGCs also contained a squalene/hopene cyclase (N terminal; PF13249), 39 BGCs contained a carotenoid synthase (PF04240), while 47 contained a lycopene cyclase domain (PF05834).
Approximately half of the ribosomally synthesised and post-translationally modified peptides (RiPPs) identified in the sample contained methanobactin-like DUF692 domains (PF05114). However, no BGCs resembling known methanobactin BGCs were found.
The proportion of proteins identified as too short on BGC-containing contigs was estimated at 63%. It is possible that this measure was influenced by the UniProt reference database not containing representative proteins for the mostly uncultivated strains recovered in this study. However, fragmentation of ORFs through indels was clearly visible, especially in NRPS and PKS BGCs in which whole megasynthase genes were broken up into several fragments.
Long reads and GTDB improve phylogenetic classification of environmental BGCs
The use of GTDB proteins instead of the NCBI non-redundant protein database increased the classification success of BGC-containing contigs from 36.8% classified at order level with the NCBI database to 71.8% with GTDB. The difference was mainly due to BGCs from MAG-derived orders which were not present in the NCBI database, such as UBA7966. However, the GTDB database is also much smaller than the NCBI nr database, and many MAG-derived clades especially at lower taxonomic ranks do not have many representatives in the GTDB database. To avoid misclassifications, we therefore decided to conduct analysis at class and order level, even if contigs were classified at lower taxonomic ranks.
To assess the advantages of long-read sequencing for BGCs detection and classification, the output was compared with BiosyntheticSPAdes, which allows the assembly of NRPS and PKS from short-read sequences by following an ambiguous assembly graph using a priori information about their modularity. Using BiosyntheticSPAdes with the 28 Gb of short reads, 228 unambiguous NRPS/PKS BGCs were predicted. Sixty one of these were above 5 Kb long and five NRPS were larger than 30 Kb. Furthermore, 202 other BGCs were predicted from other contigs. 96.7% of BGCs were marked as on a contig edge, i.e. not full length. Indeed, 392 out of 430 BiosyntheticSPAdes BGCs could be aligned to 255 long-read BGCs using blastn (E-value < 1E−90), indicating that mostly the same BGCs were assembled, but they were fragmented in the short-read assembly (see Supplementary Fig. 1). In the case of NRPS/PKS BGCs, even the BGCs on contigs with the highest coverage (>120×) were fragmented into two or more contigs. Classification success using the same binning and CAT approach was lower (68% at phylum level, 50% at order level; 48 BGCs binned). This could be attributed to the lack of genomic context around the BGCs. While BiosyntheticSPAdes predicted a large number of BGCs in total, the practical usability and interpretability of the output remained low, since completeness, cluster borders and potential modification genes could not be assessed and phylogenetic classification success was reduced.
Highly divergent BGCs found in unusual specialised metabolite producer phyla
Examination of the BGC counts by BGC type and phylum showed that the three well-known producer phyla Actinobacteriota, Proteobacteria and Bacteroidota together contributed over 60% of BGCs (Fig. 2A). BGCs attributed to Acidobacteriota and Verrucomicrobiota represented up to 20% of the total BGCs, while other phyla constituted the remaining 12%, and 7% remained unclassified at phylum level. In particular, 20% of NRPS remained unclassified at phylum level. No archaeal BGCs were found.
A BGCs by phylum and BGC type (phyla with a count < 20 removed; products with count < 10 under “others”. B BiG-SLiCE distances of BGCs by phylum, with the black dotted line indicating d = 900 and the grey dotted line d = 1800 (phyla with a count < 20 removed). C–H BiG-SLiCE distances for different BGC types plotted by phylum (phyla with <5 BGCs of the type removed; hybrid BGCs counted for both classes). Each point indicates a BGC. Salmon = BGC not on contig edge, Light blue = BGC on contig edge.
The 1417 BGCs were then analysed with BiG-SLiCE’s query mode in order to calculate their distance (d) from a set of pre-computed GCFs comprised of 1.2 mio known BGCs. The analysis showed that 845 out of 1417 BGCs (59.6%) had a d > 900, indicating that they were only distantly related to a GCF. Fifty-five outliers were found with d > 1800, indicating extremely divergent BGCs. A wide span of distances was present within each phylum which indicates that each phylum contained BGCs that are both closely and distantly related to known BGCs (Fig. 2B). The median distances showed significant variation between phyla, with Bacteroidota containing the highest novelty (median d = 1227) and Planctomycetota the lowest (median d = 742). This overall score was, however, influenced by the fact that different classes of BGC scored differently. For example, NRPS/PKS BGCs scored higher than, e.g. terpenes or bacteriocins. Rankings of single BGC classes showed that the high Bacteroidota score was partly driven by the large number of NRPS (Fig. 2C) and the small number of terpenes and bacteriocins (Fig. 2E, F) in the phylum. This is evidenced by the fact that other phyla scored the highest in individual BGC classes. For NRPS BGCs, Gemmatimonadota, Acidobacteriota and Verrucomicrobiota showed the highest values for d (Fig. 2C). Gemmatimonadota furthermore showed the highest value for d when considering terpene BGCs (Fig. 2E), while Acidobacteriota scored high for lassopeptides, arylpolyenes and PKS (Fig. 2G, H, D). Furthermore, BGCs on a contig edge tended to score lower. To check whether low coverage and the resulting insertion and deletion errors in the assembly led to overestimation of d, contig coverage as well as percentage of correctly sized ORFs (as calculated by ideel) were plotted against d. There was a positive correlation of d values with increased coverage up until a coverage of ca 10, indicating an underestimation of novelty at low coverage. Similarly, for contigs with under 20% correctly sized ORFs, there was a slight positive correlation between the percentage of correctly sized ORFs and distance. As expected, coverage showed a strong positive correlation with percentage of correctly sized ORFs (see Supplementary Figs. 2–4).
Acidobacterial BGCs
Analysis of acidobacterial BGCs by order (Fig. 3A) showed that terpenes were the most numerous, but with significant contributions from PKS, NRPS, lassopeptide and bacteriocin clusters. The orders of Pyrinomonadales and Vicinamibacterales constituted >60% of BGCs.
A BGC counts by BGC type and order in phylum Acidobacteriota. B Map of a large Acidobacteriota contig (order Vicinamibacterales) and the BGCs on it. C Cluster map of proposed functions of genes in BGC1, BGC2 and BGC3. Functions were predicted from BLASTing against NCBI nr database as well as antiSMASH module predictions. A detailed table of homologous proteins can be found in Supplementary files.
BiG-SCAPE analysis showed that BGCs mainly clustered together within orders (Supplementary Table 1). None of the families contained MIBiG clusters at the cutoffs used. Acidobacteriota showed a large number of lassopeptides, 16 of which grouped into two GCFs. NRPS-like BGCs also contributed a large number to the sample. In particular, one NRPS-like family from the order Vicinamibacterales showed homology to the VEPE BGC from Myxococcus xanthus (MIBiG BGC0000871). Furthermore, seven NRPS/PKS with a megasynthase gene length of over 20 Kb were found with the largest BGC measuring 89 Kb of NRPS and PKS megasynthase genes. The largest Acidobacteriota (order Vicinamibacterales) contig was 1.5 Mb in size and contained three BGCs: a PKS, a terpene and a NRPS/PKS hybrid cluster (Fig. 3B, C). BGC1 (d = 1397) contained a partial one-module NRPS followed by a partial PKS module as well as transporter genes and a TonB-dependent receptor protein, suggesting a role as a siderophore. BGC2 (d = 1103) contained squalene/phytoene synthase genes and several potential tailoring enzymes. BGC3 (d = 1977) contained a complete NRPS and a partial NRPS module and an incomplete PKS domains. Several gaps visible in the BGC make a sequencing error seem possible, leading to truncated genes and therefore missing domains.
Verrucomicrobial BGCs
The analysis of Verrucomicrobial BGCs by order (Fig. 4A) showed that the vast majority of BGCs were terpenes, followed by arylpolyenes, PKS, NRPS, as well as ladderanes. The most prolific producer orders were Opitutales, Pedosphaerales and Chtoniobacterales. Verrucomicrobial BGCs did not show strong clustering into conserved GCFs compared to Acidobacteriota (Supplementary Table 2). One NRPS and one PKS BGC were the only BGCs that clustered with MIBiG clusters. The largest Verrucomicrobiota contig (order Opitutales) was 2.6 Mb in size and featured five BGCs, two of which were NRPS-PKS hybrids with megasynthase genes above 20 Kb (Fig. 4B, C). BGC1 (d = 1479) contained a ladderane-type 3-oxoacyl-[acyl-carrier-protein] synthase. BGC2 (d = 1305) contained four NRPS modules interspersed by one PKS module. BGC3 (d = 673) contained a squalene-hopene cyclase, indicating a role in hopanoid biosynthesis. BGC4 (d = 1142) encoded a chalcone/stilbene synthase. BGC5 (d = 1340) contained a PKS module followed by five NRPS modules. The third module, however, showed a truncated A domain, with the antiSMASH HMM NRPS-A_a3 only matching around 50 bp at the end of ORF ctg423_1968. This could be explained by a sequencing error in which an indel lead to a frameshift, causing a premature stop codon. Indeed, nucleotide-level BLAST of the gap between ctg423_1968 and the PCP-domain containing ctg423_1970 showed a match to known A domains. It is, however, not possible to rule out potential pseudogenisation.
A BGC counts by BGC type and order in phylum Verrucomicrobiota. B Map of a large Verrucomicrobiota contig (order Opitutales) and the BGCs on it. C Cluster map of proposed functions of genes in BGC1–BGC5. Functions were predicted from BLASTing against NCBI nr database as well as antiSMASH module predictions. X axis represents basepairs. A detailed table of homologous proteins can be found in Supplementary files.
Uncultivated and underexplored classes and orders from Actinobacteriota and Proteobacteria show a large biosynthetic potential
Actinobacteriota: Acidimicrobiia and Thermoleophilia
The phylum Actinobacteriota (335 BGCs) featured a large amount of BGCs unclassified at order level. Therefore, they were analysed by class (Fig. 5A). The class Actinobacteria (114 BGCs) contained BGC-rich genera such as Streptomyces and Pseudonocardia and accordingly contributed a large amount of BGCs in the sample. The class Acidimicrobiia (90 BGCs) contained the genera Illumatobacter and Microthrix and several uncultured genera. The class Thermoleophilia (95 BGCs) contained genera such as Solirubrobacter and Patulibacter, besides uncultured genera, and contributed to a large amount of the bacteriocin and betalactone BGCs. The amount of BGCs in these classes that were not placed into lower taxonomic ranks indicated that there is a large unexplored diversity of uncultured Actinobacteriota containing a great diversity of BGCs.
A BGC counts by BGC type and class in Actinobacteriota. B Map of a large Actinobacteriota contig (order IMCC26256) and number of basepairs. C Cluster map of proposed functions of genes in BGC1 and BGC2. Functions were predicted from BLASTing against NCBI nr database as well as antiSMASH module predictions. X axis represents basepairs. A detailed table of homologous proteins can be found in Supplementary files.
Remarkably, one circular genome from the uncultured order IMCC26256 from the class Acidimicrobiia was recovered in a single contig, measuring 3.3 Mb in size and containing two BGCs (Fig. 5B, C). The terpene BGC (d = 1398) contained a squalene synthase, a lycopene cyclase and polyprenyl synthetases, suggesting a role in pigment formation. The CaiA-related BGC (d = 1869) contained an acyl-CoA dehydrogenase related to CaiA (involved in saccharide antibiotic BGCs). BLAST hits indicated other genes related to small organic acids, sugars and nucleoside metabolism.
Two families of terpenes containing terpene cylases, methyltransferases and/or P450s showing similarity to the known geosmin and 2-methylisoborneol BGCs were found, with members belonging to both Acidimicrobiia, Thermoleophilia and unclassified Actinobacteriota. One BGC from a Streptomyces spp. was detected, containing an LmbU-like gene on the very edge of the contig. BiG-SCAPE analysis showed that Actinobacteriota BGCs mostly grouped within the classes, and one lanthipeptide BGC grouped with MIBiG BGCs at the cut-off used (Supplementary Table 3).
Proteobacteria: the uncultured methanotrophic order UBA7966 as a specialised metabolite producer
Analysis at the order level of proteobacterial BGCs showed that the biggest contributor was the Burkholderiales order with 116 BGCs (Fig. 6A) followed by order UBA7966 with 96 BGCs. UBA7966 BGCs included a variety of BGCs, including terpenes, bacteriocins, phosphonates, NRPS and NRPS hybrids, NRPS-like and arylpolyenes. In particular, the high abundance of NRPS-like and phosphonate BGCs in UBA7966 contrasted with the lower counts in other proteobacterial orders in the dataset. By order, UBA7966 contigs also showed a high average coverage of 26×, compared to the total average of 10.2×, indicating a high abundance. The total length of UBA7966 contigs was 53 Mb, indicating the presence of several genomes.
A BGC counts by BGC type and order in the phylum Proteobacteria. B Cluster layout of three gammaproteobacterial DUF692-containing BGCs representatives: contig_12391 for FAM_02418, contig_14956 for FAM_02526 and scaffold_15362 for FAM_02384. C Sequence logo generated from an HMM of 301 potential precursor peptides. D Similarity network generated from BiG-SCAPE with brown: FAM_02384, turquoise: FAM_02418, green: FAM_02526.
The order UBA7966 is an uncultured order consisting of one family, UBA7966, which contains two genera, UBA7966 and USCγ-Taylor. UBA7966-family bin bin.3 was assigned no genus, while all CAT-assigned contigs were assigned species USCγ-Taylor sp002007425, the only species in the USCγ-Taylor genus. The USCγ-Taylor genus is based on a putatively methanotrophic MAG extracted from a methane-oxidising soil metagenome from Taylor Valley in Antarctica (Genbank accession GCA_002007425.1) [30]. The low number of UBA7966 reference genomes in the GTDB database means, however, that these classifications remain an approximation. The two closest orders to UBA7966 that contain cultured representatives, Beggiatoales and Nitrosococcales, both have members implicated in methanotrophy, sulfur cycling and ammonia oxidation as well as varying degrees of chemolithotrophy and chemoautotrophy [51,52,53,54]. On all the contigs assigned to order UBA7966 by CAT, four pmoCAB operons were found, with pmoA showing 92.9–96.8% identity with pmoA from USCγ-Taylor. This indicates that, in addition to the methanotrophy of USCγ-Taylor, other members of the order UBA7966 could be involved in similar lifestyles.
When analysed with BiG-SCAPE at cut-off 0.7 (Supplementary Table 4), phosphonates (median d = 1421), NRPS/NRPS-like (median d = 1262) and bacteriocins seemed to form especially conserved GCFs. Other GCFs were shared with other proteobacterial orders. With 96 BGCs, UBA7966 contributed a similar number of BGCs as the established specialised metabolite producing order Burkholderiales (116 BGCs). However, the BiG-SLiCE distances of UBA7966 were higher than Burkholderiales for all but one BGC class, indicating more novel BGCs (Supplementary Fig. 5).
The potential methanotrophy of UBA7966 suggested the potential presence of methanobactins, but no BGCs corresponding to known methanobactins were found in the dataset. On the other hand, an abundance of DUF692-containing BGCs were observed, grouping into three GCFs. DUF692 proteins are a diverse family of proteins with largely unknown functions, although some are known to be involved in methanobactin biosynthesis [55]. The analysis of three related GCFs containing DUF692 domains (including BGCs from UBA7966 and unclassified gammaproteobacterial contigs) showed that FAM_02526 (two BGCs), FAM_02384 (three BGCs) and FAM_02418 (six BGCs) (Fig. 6B, D) all contained a short (circa 240 bp) ORF followed by first a DUF692-domain containing protein, then a DUF2063-domain containing protein. Furthermore, a putative cation antiporter was found upstream of the precursor peptide. The three families differed by the genes surrounding this core cluster (Fig. 6B). The 11 small translated 240 bp ORFs were aligned using Clustal Omega and a HMM search was made in EBI reference proteome database with a cut-off E-value of 1E−10. The resulting 290 protein sequences (almost exclusively from Proteobacteria) plus 11 original sequences were aligned using Clustal Omega and a HMM was generated and visualised using skylign.org. The resulting logo showed a low degree of sequence conservation except for a pattern of six conserved cysteines—some followed by glycines—within 40 amino acids towards the N-terminus, and a slightly conserved hydrophobic patch towards the C-terminus (Fig. 6C). This might represent a potential precursor peptide, with the six cysteines marking the potential core peptide.
The UBA7966 order also contained larger BGCs such as four NRPS/NRPS-PKS BGCs with megasynthase genes with a length of more than 20 Kb, the largest cluster possessing 56 Kb of PKS (seven modules) along with NRPS (three modules) genes. This latter BGC also formed a BiG-SCAPE GCF with several MIBiG BGCs which shared the presence of a small peptide moiety followed by several malonyl units.
Low numbers of BGC found in other underexplored phyla
Lower numbers of BGCs were detected in the phyla Gemmatimonadota (31 BGCs), Planctomycetota (29 BGCs), Myxococcota (22 BGCs), Patescibacteria(9 BGCs), Methylomirabilota (5 BGCs), Bdellovibrionota_B (8 BGCs), Elusimicrobiota (4 BGCs), Armatimonadota (4 BGCs) and Binatota (3 BGCs) (Fig. 7A, Supplementary Table 5).
A Distribution of BGCs among phyla with 31 or fewer BGCs in the dataset. B Map of a large Gemmatimonadota contig (order Gemmatimonadales) and BGCs detected on it. C Cluster map of proposed functions of genes in BGC1 and BGC2. Functions were predicted from BLASTing against NCBI nr database as well as antiSMASH module predictions. X axis represents basepairs. A detailed table of homologous proteins can be found in Supplementary files.
One remarkably long (1.5 Mb, Fig. 7B, C) Gemmatimonadota contig from the order Gemmatimonadales was found to contain two BGCs: one terpene (d = 998) and one NRPS/PKS BGC (d = 1423). BGC1 contained a phytoene synthase and several related oxidases. BGC2 contained six PKS modules and two NRPS modules as well as modifying enzymes presence of a TonB receptor indicated that the product could serve as a siderophore.
Source: Ecology - nature.com