Prokaryotic viruses impact functional microorganisms in nutrient removal and carbon cycle in wastewater treatment plants

AS systems display many novel and a high fraction of shared viruses

We used metagenomic sequencing of 30 Gb of sequences per sample to characterize the composition of viral concentrates across six WWTPs (see Methods). After assembly and mapping of reads, 24–34% of the total sequence information could be classified as viral using two different identification pipelines (see Methods). By combining the results of these two pipelines, the final data set consisted of 50,037 viral contigs with an N50 >20 kb. To evaluate whether our sequencing effort sufficiently sampled the viromes, all six WWTP samples were subsampled iteratively to evaluate the saturation dynamics. Rarefaction curves of the number of viral contigs, reached a plateau ~15 Gb of sequencing data for all six samples (Fig. 1a), indicating adequate recovery of prokaryotic viruses in these AS systems at the sequencing depth (30 Gb per sample) in this study.

Fig. 1: Compositional variation in viromes among wastewater treatment plants.

a Rarefaction curve of each sample. b Rank-abundance curve of each sample. c PCoA analysis of viromes based on the relative abundance of viral genera. d Relative abundance and appearance of dominant viral genera in each WWTP. White color denotes no appearance in this WWTP. Source data are provided as a Source Data file.

Full size image

Viral contigs were further classified into 8756 viral clusters (VC, equivalent to viral genera) using vConTACT2 by calculating the gene-content-based distance between viral contigs (~40% proteome similarity)¹¹, and each VC was assigned an ID for identification (mean length = 15.2 kb, mean genera size = 3). Compared with the current number of viral sequences from AS systems, our sequencing data increase the AS virome database (N = 2103 in the IMG/VR database v.2.0)¹⁰ by 12-fold at the genus level and by 23-fold at the species-level (95% identity, 80% coverage). Comparison with NCBI RefSeq viral genome database showed that across the six AS systems, only 0.4–1.6% of total viral contigs (coverage percentage) could be assigned to a known viral family. Similar to previously described viral metagenomes from the soil, freshwater, and marine system^7,12,13, this limited annotation highlights substantial uncharacterized viral diversity in AS communities. Among these recognizable viruses, members of the family Podoviridae (short-tailed phages from the Caudovirales order) were the most prevalent, comprising on average 41.3% of these viral contigs (coverage percentage) across the six WWTPs.

All samples displayed high but variable diversity of viral genera with Shannon’s diversity index H’ ranging from 5.22 to 7.14 and Pielou’s evenness index J’ ranging from 0.71 to 0.86 (Supplementary Data 1). These differences are evident in rank-abundance curves, which show that each sample has different viral frequency patterns (Fig. 1b). Most viruses occurred at low frequency with the relative abundance of individual genera diminishing below 0.1% after counting the top 138 viral genera.

Principal coordinate analysis (PCoA) of the Bray–Curtis dissimilarity based on the relative abundance of viral genera suggested that most samples are divergent from each other (Fig. 1c), with only two pairs of AS viromes, ST and STL as well as SK and SWH, displaying higher similarity to each other.

The overall variability in the viromes is also reflected in different dominance patterns. Each AS sample yielded a different dominant viral genus, and while these were also abundant in some WWTPs, they were below the detection limit in others (Fig. 1d). Although high relative abundance across all WWTPs indicates linkage to consistently abundant hosts, highly variable occurrence suggests that host populations are also more dynamic.

Although the viromes appear overall variable in rank abundance, many viral genera were shared across the WWTPs. Fifty-three viral genera were detected in all samples and were thus considered to be common members of the AS viromes, accounting for 1.7–5.4% of viral contigs (coverage percentage) in each WWTP (Fig. 2). Thirteen of these common viral genera were also present in AS viromes in the IMG/VR database v.2.0¹⁰. Of the total of 8756 unique viral genera collected across samples, STL and ST contained the largest fraction (5245 and 5149 genera, respectively) and shared most viral genera (N = 2885) with each other, far exceeding the number of all the other shared or unique viral genera (Fig. 2). These two WWTPs also had only 45 and 64 site-specific viral genera, consistent with the pattern in the PCoA virome profile. On the other hand, SWH possessed the most unique viral genera (N = 323), making up 11.0% of the total (Fig. 2). In fact, only relatively few viral genera were found exclusively in one of the WWTPs.

Fig. 2: Shared viral genera in each WWTP.

The UpSet⁵⁵ chart shows the total number of viral genera and their sharedness in each WWTP. The bar chart on the top right shows the distribution of predicted hosts for common viral genera in all WWTPs. Shared viral genera between ST and STL were labeled orange and shared viral genera in all WWTPs were labeled red. Source data are provided as a Source Data file.

Full size image

Overall, these results suggest that the virome across WWTPs consists of many shared genera. The lack of detection of some viral genera in the AS virome of one WWTP may be primarily due to the biological variation in the grab samples and/or the technical variation. Hence, if such technical and biological variations are taken into account, the virome shared among all AS maybe even more diverse.

Viruses infect a broad spectrum of bacteria and archaea

To examine putative host associations of all 50,037 viral contigs in the six WWTPs, we amassed a database of approximately three million CRISPR-Cas spacers from the NCBI prokaryotic complete genomes and metagenomes database (https://www.ncbi.nlm.nih.gov/assembly/). Host prediction was performed by matching CRISPR-Cas spacers at a sequence identity above 97%, sequence coverage over 90% in length, and mismatches <1 using the BLASTn-short task function (see Methods). Because the spacer content in a CRISPR array reflects recent virus encounters of bacteria and archaea, this approach has the potential to accurately link viruses to their hosts¹⁴ although it is obviously limited to hosts containing CRISPR-Cas systems¹⁵. Based on the cutoffs used, we recovered a total of 5879 viral contigs (4897 viral genera) (11.7% recall rate) with their predicted hosts, comprising 11.0–22.6% of viral contigs (coverage percentage) in each WWTP. Considering the random recall rate (0.70%) simply happens by chance (see Methods), the percentage of erroneous associations can be 6%.

At the phylum level, viruses were predicted to infect a wide range of archaea and bacteria, including 8 archaeal and 58 bacterial phyla (Fig. 3). Within the Archaea domain, 206 viruses were linked to Euryarchaeota, 10-fold more than to any other archaeal phylum and consistent with a previous survey of AS communities in WWTPs finding that Euryarchaeota was the most abundant lineage of archaea¹⁶. Euryarchaeota-infecting viruses can account for 0.25–0.56% in the viral community. Among bacteria, viruses infecting Proteobacteria (N = 2501), Actinobacteria (N = 1895), and Firmicutes (N = 1655) were the most abundant. The relative abundance of these viruses is quite similar to each other, i.e., 6.06%, 5.53%, and 4.20% for Proteobacteria, Actinobacteria, and Firmicutes-infecting phages, respectively. The higher proportion of actinophages indicates that actinophages, which have recently been discovered to be the most dominant viral groups in freshwater^13,17, are also abundant in AS systems.

Fig. 3: Taxonomy of predicted hosts of viral contigs from all samples at phylum level.

Eight phyla of archaea and 58 phyla of bacteria were shown in the graph. Green dots represented archaea and orange dots represented bacteria. The top three bacteria and top one archaea that could be infected by most viruses were highlighted with a gray background. Source data are provided as a Source Data file.

Full size image

Out of the 53 common viral genera, 16 could be assigned to more narrowly defined putative host taxa (Fig. 2). These hosts included members of the aerobic heterotrophs Mycobacterium, Bifidobacterium, and Actinomyces, which play important roles in organic carbon removal in WWTPs, and were predicted to be infected by ten, six, and four common viral genera, respectively. The methanogen Methanosarcina and a bacterium implicated in phosphorus removal, Candidatus Accumulibacter, also had predicted interactions with the common viral genera. These predicted interactions between common viral genera and functional microorganisms highlight the potential impact of viruses on the operational performance of WWTPs.

Matching of CRISPR spacers also suggested a range of host specificities of the AS viruses. The majority (68%) of CRISPR-annotated viral contigs (coverage percentage) were identified to be specialists at the host genus level, suggesting a narrow host range. However, 1668 out of 5879 viruses may be able to infect more than one host genus as suggested by the occurrence of CRISPR signatures in divergent host genomes. Intriguingly, 135 viruses might be able to infect hosts from different domains of life, with 73 displaying perfect matches to spacers in the CRISPR database, indicating viruses might occasionally switch hosts from bacteria to archaea. Considering the random error percentage of 6% in host identifications in our samples, the possibility that each one of these identifications is false equals P(f) = 1−(1−0.06)² = 0.1164 = 11.6%. Therefore, of 135 viruses infecting two domains of life, ~16 (11.6%) would be expected to have happened just by chance. Whether such switching happens primarily on ecological or evolutionary timescales through existing broad-host affinity or recombination of receptor domains remains poorly understood. However, recent experimental and high-resolution host mapping has highlighted that broad-host-range viruses, including those able to infect different domains of life, are widely distributed in nature^18,19,20.

Viruses may impact key functional microorganisms in WWTPs

To further investigate the connections between functional microorganisms and affiliated viruses in WWTPs, putative host genera identified through CRISPR-Cas spacer matches were linked to metabolic functions according to their classification in the MIDAS database²¹. This data set was supplemented with manually selected methanogens and sulfate/sulfur-reducing microorganisms. An ensemble of 100 functional microorganisms at the genera level was recovered with links to 1345 viruses, accounting for an average of 25.1% of CRISPR-annotated viral contigs (coverage percentage) across the WWTPs. For those broad-host viruses, we assumed that they could indirectly impact multiple functions in WWTPs. For example, viral contig STL1812_19787 could infect both Candidatus Accumulibacter (phosphate-accumulating organism) and Desulfosoma (sulfate-reducing organism). By infecting these two hosts, STL1812_19787 could indirectly impact both the phosphate removal process and the sulfate-reducing process. In this case, we add one to the count of both Candidatus Accumulibacter and Desulfosoma in Fig. 4.

Fig. 4: Virus–host profiles of functional microorganisms in WWTPs.

a Virus–host profiles of functional microorganisms related to aerobic oxidation. b Virus–host profiles of functional microorganisms related to the anaerobic process. c Virus–host profiles of functional microorganisms related to nutrient removal (N, P, S). The size of the circle indicates the number of viral genera that could infect the corresponding microbial genera. Source data are provided as a Source Data file.

Full size image

Virus-host connections were identified for 48 genera of aerobic heterotrophic bacteria (Fig. 4a). Seventeen of these genera were found to have virus–host connections in all the six WWTPs. Among them, Mycobacterium, Streptococcus, and Acinetobacter had the highest number of predicted viral interactions, i.e., 402, 263, and 109, respectively. Notably, Mycobacterium is notorious for being the major sludge foaming bacteria in AS, accounting for an average of 3% of 16S rRNA sequences in a previous study about foaming bacteria dynamics over 5 years²². Extensive links between Mycobacterium and its corresponding viruses provide a valuable genomic database for the potential future design of phage treatment to tackle sludge foaming problems.

Our data (Fig. 4b) suggest that viruses can infect 53 functional anaerobic genera in AS, including 30 fermenters, 6 acetogens, and 17 methanogens. Anaerobic processes could occur in microenvironments within sludge flocs²³ in spite of aeration overall creating an unfavorable environment for anaerobes in the bulk water and outer layers of AS flocs. However, some viruses that display specificity for sulfate reducers (owing to seawater flushing in Hong Kong) and methanogens might be detected in the AS system due to their presence in the influent. Results showed that they account for a very low percentage (0.13–0.28%) in total viral contigs (coverage percentage). These viruses might be carried from the wastewater influent.

Viruses may also impact bacteria involved in nutrient removal from AS as indicated by virus-host connections to 47 bacterial genera previously implicated in nitrogen and phosphorus removal, and sulfur cycle (Fig. 4c). These include one genus of ammonia-oxidizing bacteria (AOB) and three genera of nitrite-oxidizing bacteria (NOB). Ten viral genera were found in five WWTPs that could infect the AOB Nitrosomonas. Polyphosphate-accumulating bacteria were represented by Candidatus Accumulibacter and Tetrasphaera, where the first had a total of 24 viral matches across all WWTPs. Finally, a wide range of sulfate-reducing bacteria, which can generate hydrogen sulfide from sulfate, were represented by 21 genera, where 16 and 6 viral genera had connections in all WWTPs with Desulfovibrio and Desulfosoma, respectively. These results suggest that viruses could indirectly impact biological nutrient removal and carbon cycling in WWTPs through lysis of key functional microorganisms in AS.

We have also checked the number of spacers in all Midas genera (functional microorganisms in WWTPs) in our study. Results showed that these Midas genera contain on average 44 spacers in their genome and there exists an uneven distribution of CRISPR in prokaryotic groups (Supplementary Data 8). It should be noticed that the number of phage-prey interactions that can be detected using this approach is a minority and does not always represent present scenarios. Even phages infecting a CRISPR-positive strain may leave no trace in the form of spacers. Therefore, the conclusions that derive from these predictions must be handled with caution.

Viruses encode extensive auxiliary metabolic genes

Viral ORFs were distributed across 23 COG functional classes, demonstrating their high diversity. Aside from those with unknown or hypothetical functions, on average >1000 ORFs in each sample were found in the following categories, L: replication, recombination, and repair, M: cell wall/membrane/envelope biogenesis, and K: transcription, confirming that the main functions sustain viral reproduction and transcription. It is noteworthy that viruses encode on average 541 ORFs (1.4% of annotated viral ORFs) in each sample in category G: carbohydrate transport and metabolism. After removing redundant viral ORFs, most unique ORFs (N = 1610) were classified into the glycoside hydrolases (GH) module²⁴ (Fig. 5b). These GHs may be involved in the digestion of capsules to allow the viral tail to reach its membrane receptor on the host. The high representation of this function may be explained by the prevalence of biofilm formation among AS microbes and has previously also been noted among mangrove sediment viruses²⁵.

Fig. 5: Distribution of auxiliary metabolic genes (AMGs) relevant to the carbon cycle and nutrient removal.

a Boxplot of the overall gene profile for six WWTPs was summarized both as viral ORF hits and viral ORF relative abundance (ORF hits to each COG function class/total ORFs that have hits to eggNOG database). Data in a are presented as mean values (center) and 25%, 75% percentiles (bounds of box). The minima and maxima represent the range of the data. b Number of unique viral ORFs related to each CAZy function class was shown in a descendant order. Source data are provided as a Source Data file.

Full size image

Previous work has shown that many prokaryotic viruses carry auxiliary metabolic genes (AMGs), which can modulate host energy metabolism to provide an energetic advantage during viral genome and protein synthesis²⁶. Broadly speaking, AMGs refer to all metabolic genes in lytic phages²⁷, i.e., all genes in categories C, E, F, G, H, I, and P. Though only about a quarter of the viral ORFs have annotations in the eggNOG database, our data reveal a large repertoire of potential AMGs in the AS viromes (Fig. 5a). For example, 72 ORFs in total are potentially involved with carbon fixation pathways and 35 of them annotate as photosynthetic carbon fixation pathways in KEGG²⁸. Moreover, 10 viral ORFs belong to CobS genes, which are essential for the biosynthesis of cobalamin. Cobalamin biosynthesis pathway is usually not complete in bacterial genomes²⁹, and these viral encoded CobS genes could possibly assist the host metabolic capability. Seven viral genes encode adenylyl-sulfate kinase (CysC), which could facilitate host’s assimilatory sulfate reduction. By modulating host metabolism during infection, AMGs could alter the specific functions of their hosts in WWTPs and therefore influence carbon cycling and the removal of nutrients.

Viromes are shared between WWTPs and the water environment

To investigate whether the WWTP viral genera also occur in other habitats, we compared all viral sequences recovered here with the IMG/VR database v.2.0, which consists of 735,112 viral contigs predicted from metagenomic data¹⁰. Viral sequences from five ecosystems (AS, AD, solid waste, freshwater, marine) were included for comparison. For AS (N = 2103), AD (N = 8580), and solid waste (N = 5760), all viral sequences were subjected to our viral clustering pipeline, whereas for freshwater and marine environments, we each randomly selected 50,037 viral sequences to match the number in our samples (N = 50,037).

Results showed that viral genera in our samples were shared among multiple environments. There was considerable overlap between WWTPs and freshwater (N = 402) and marine (N = 172) environments (Fig. 6). Moreover, 200, 273, and 244 viral genera were shared with AS, AD, and solid waste, respectively (Fig. 6). This represents a higher number than with marine viromes, and if normalized against the dataset size, there is also a larger fraction of connections than with freshwater viromes. When hosts were predicted for these shared viral genera, Proteobacteria was the most shared host phylum, followed by Firmicutes, Actinobacteria, Bacteroidetes, and Cyanobacteria (Supplementary Data 2). At the genus level, Bacillus was the most abundant between our samples and marine, AD, and AS environments, while Streptomyces displayed a higher prevalence between our samples and freshwater and solid waste environments (Supplementary Data 3). Although it is difficult to identify the source and sink dynamics, AS and AD are typical processes in WWTPs, and solid waste viral sequences mainly stem from compost and leachate microbial communities. The considerable sharing of viromes between WWTPs and marine water may be caused by Hong Kong’s extensive application of marine water for toilet flushing, causing the influent sewage of WWTPs to contain a sizable amount of marine water. These data thus suggest that viruses are extensively shared and that the same viral genera may manipulate microbial communities in these different environments.

Fig. 6: Connections between viral genera in AS samples and in five ecosystems from IMG/VR database.

Pairwise connections were shown in a Circos⁵⁶ plot between samples and different ecosystems, including AS, AD, solid waste, freshwater, and marine water. Source data are provided as a Source Data file.

Full size image

Hi-C validation of virus–host interactions in AS system

High throughput chromosome conformation capture (Hi-C) method was used to validate the virus–host connections predicted by our CRISPR-based methods using an additional sample in December 2020 at ST WWTP, by referring to viral contigs and host genome bins obtained from direct sequencing using Illumina and Nanopore metagenomic sequencing (Fig. 7).

Fig. 7: General workflow to validate the precision of CRISPR-based methods in the present study.

For Illumina data, 91% precision was observed. For Nanopore data, 94% precision was observed. Source data are provided as a Source Data file.

Full size image

As for Illumina metagenomic sequencing, 4578 viral contigs were identified and 1695 of them were deconvoluted in Hi-C data to have virus–host interactions with 197 host bins (Supplementary Data 9). To compare the Hi-C results with the CRISPR-based methods, 21 viruses were predicted by BLASTn-short to link with spacers in eight bins (Supplementary Data 10).

As for hybrid assembly using both Nanopore and Illumina reads, 2593 viral contigs were identified and 989 of them were deconvoluted in Hi-C data to have virus–host interactions with 144 host bins (Supplementary Data 11). To compare the Hi-C results with the CRISPR-based methods, 28 viruses were predicted by BLASTn-short to link with spacers in 10 bins (Supplementary Data 12).

Results show that CRISPR-based results have very high accuracy. For Illumina data, of the 21 virus–host connections predicted using CRISPR spacers, 11 are simultaneously found in Hi-C data and 10 are not detected in Hi-C data. Of 11 detected connections, only 1 is different in Hi-C data and 10 are the same (91% precision) (Supplementary Data 13). Also for the Nanopore/Illumina hybrid data, of the 28 virus–host connections predicted using CRISPR spacers, 16 are simultaneously found in Hi-C data and 12 are not detected in Hi-C data. Of 16 detected connections, only 1 is different in Hi-C data, 15 are the same (94% precision) (Supplementary Data 14).

It should be noticed that some of the predicted CRISPR-based virus–host interactions are undetected in Hi-C data. CRISPR spacers represent a collection of memories regarding past virus invasions, whereas Hi-C data provide a snapshot of ongoing virus–host interactions. Also, Hi-C crosslinking may not be 100% efficient and might miss some of the virus–host interactions.

Prokaryotic viruses impact functional microorganisms in nutrient removal and carbon cycle in wastewater treatment plants

AS systems display many novel and a high fraction of shared viruses

Viruses infect a broad spectrum of bacteria and archaea

Viruses may impact key functional microorganisms in WWTPs

Viruses encode extensive auxiliary metabolic genes

Viromes are shared between WWTPs and the water environment

Hi-C validation of virus–host interactions in AS system

Microbial transfers from permanent grassland ecosystems to milk in dairy farms in the Comté cheese area

Marauding elephants, menacing macaques and epicurean bears

ITALIAN LANGUAGE

ENGLISH LANGUAGE