Infection strategy and biogeography distinguish cosmopolitan groups of marine jumbo bacteriophages
Detection and validation of high-quality jumbo phage binsDue to the large size of jumbo bacteriophage genomes, it is likely that they are present in multiple distinct contigs in metagenomic datasets and therefore require binning to recover high-quality metagenome-assembled genomes (MAGs) [28]. This has been shown for large DNA viruses that infect eukaryotes, where several recent studies have successfully employed binning approaches to recover viral MAGs [2, 3, 30]. Here, we used the same 1545 high-quality metagenomic assemblies [31] used in a recent study to recover giant viruses of eukaryotes [3], but we modified the bioinformatic pipeline to identify bins of jumbo bacteriophages. These metagenomes were compiled by Parks et al. [31] and included available metagenomes on the NCBI’s Short Read Archive by December 31, 2015 (see Parks et al. [31]). This dataset includes a wide variety of marine metagenomes (n = 469) including many non-Tara metagenomes (n = 165). We focused our benchmarking and distribution analyses on Tara data [29] because of the well-curated metadata and size fractions in this dataset. We first binned the contigs from these assemblies with MetaBat2 [32], which groups contigs together based on similar nucleotide composition and coverage profiles, and focused on bins of at least 200 kilobases in total length. We subsequently identified bins composed of bacteriophage contigs through analysis with VirSorter2 [33], VIBRANT [34], and CheckV [35] (see Methods for details).The occurrence of multiple copies of highly conserved marker genes is typically used to assess the level of contamination present in metagenome-derived genomes of bacteria and archaea [36]. Because bacteriophage lack these marker genes [37], we developed alternative strategies to assess possible contamination in our jumbo phage bins. Firstly, we refined the set of bins by retaining those with no more than 5 contigs that were each at least 5 kilobases in length to reduce the possibility that spurious contigs were put together. Secondly, we assessed the possibility that two strains of smaller phages with similar nucleotide composition may be binned together by aligning the contigs in a bin to each other. Bins that had contigs with high sequence similarity across the majority of their lengths were discarded (Supplementary Fig. 1). Thirdly, we discarded bins if their contigs exhibited aberrant co-abundance profiles in different metagenomes (see Supplementary Methods). To generate these co-abundance profiles, we mapped reads from 225 marine metagenomes provided by Tara Oceans [29] onto the bins. Coverage variation between contigs was benchmarked based on read-mapping results from artificially-fragmented reference genomes present in the samples (See Methods for details). Only bins with coverage variation below our empirically-derived threshold were retained. Using this stringent filtering, we identified 85 bins belonging to jumbo bacteriophages. These bins ranged in length from 202 kbp to 498 kbp, and 31 consisted of a single contig, while 54 consisted of 2–5 contigs (Supplementary Fig. 2).To assess global diversity patterns of jumbo bacteriophages, we combined these jumbo phage bins together with a compiled database of previously-identified jumbo phages that included all complete jumbo Caudovirales genomes on RefSeq (downloaded July 5th, 2020), the INPHARED database [14], a recent survey of cultivated jumbo phages [6], the Al-Shayeb et al. study [4], and marine jumbo phage contigs from metagenomic surveys of GOV 2.0 [26] (60 jumbo phages), ALOHA 2.0 [38] (8 jumbo phages), and one megaphage MAG recovered from datasets of the English Channel [39]. Ultimately, we arrived at a set of 244 jumbo phages, including the 85 bins, that were present in at least one Tara Oceans sample (min. 20% genome covered, see Methods) or deriving from a marine dataset (i.e. ALOHA, GOV 2.0) which we analyzed further in this study and refer to as marine jumbo phages. Statistics on genomic features can be found in Supplementary Dataset 1.Marine jumbo phages belong to distinct groups with diverse infection strategiesBecause bacteriophages lack high-resolution, universal marker genes for classification, such as 16S rRNA in bacteria, phages are often grouped by gene content [40, 41]. Here, we generated a bipartite network that included the 85 bins of jumbo phages with a dataset of available Caudovirales complete genomes in RefSeq (3012 genomes; downloaded July 5th, 2020) and the full set of reference jumbo phages described above. To construct the bipartite network, we compared proteins encoded in all the phage genomes to the VOG database, and each genome was linked to VOG hits that were present (Fig. 1, Supplementary Dataset 2, see Methods for details). To identify groups of phage genomes with similar VOG profiles, we employed a spinglass community detection algorithm [42] to generate genome clusters. Similar methods have recently been used to analyze evolutionary relationships in other dsDNA viruses [41]. The marine jumbo phages of this study clustered into five groups that included both jumbo and non-jumbo phage genomes (Fig. 2a). We refer to these five clusters as Phage Genome Clusters (PGCs): PGC_A, PGC_B, PGC_C, PGC_D, and PGC_E. These PGCs included cultured phages and metagenome-derived jumbo phages found in various environments (i.e. aquatic, engineered) and isolated on a diversity of hosts (i.e. Firmicutes, Proteobacteria, Bacteroidetes) (Fig. 2b, c). Of the marine jumbo phages, 135 belonged to PGC_A, 11 to PGC_B, 90 to PGC_C, 7 to PGC_D, and 1 to PGC_E (Fig. 1b). In addition to this network-based analysis, we also examined phylogenies of the major capsid protein (MCP) and the terminase large subunit (TerL) encoded by the marine jumbo phages and the same reference phage set examined in the network (Fig. 1c, d). With the exception of PGC_A, the marine jumbo phages that belong to the same PGC appeared more closely related to each other than those belonging to different clusters. The polyphyletic placement of jumbo phage PGCs in these marker gene phylogenies is consistent with the view that genome gigantism evolved multiple times, independently within the Caudovirales [6].Fig. 1: Bipartite network and marker gene analyses of jumbo phages.a Network with marine jumbos and references as nodes and edges based on shared VOGs. Marine jumbo phage nodes are colored by PGC as detected with spinglass community detection analysis, other nodes are in gray. Edges and VOG nodes have been omitted to more clearly represent the pattern of phage clustering. Node size corresponds to the natural log of genome length in kilobases. b Barchart of the number of members in each PGC. PGCs with marine jumbo phages are denoted with a star and the number of marine jumbo phages in that PGC. Proportion of marine jumbo phages in that PGC is colored. Phylogenies of TerL (c) and MCP (d) proteins with references and bins. Inner ring and branches are colored by the 5 PGCs that marine jumbo phages belong to. Navy blue circles in the outer ring denote marine jumbo phages.Full size imageFig. 2: Statistics of the Phage Genomes Clusters (PGCs).a Boxplot of genome length in each network cluster (x-axis is PGC number). Star denotes PGC with a marine jumbo phage and the color matches the PGC letters of Fig. 1. b Stacked barplot of the metagenome environment from which each phage derives from in each PGC (x-axis). Reference (yellow) are cultured phages, in black are the bins of jumbo phages from this study. c Stacked barplot of the host phylum of the RefSeq cultured phages in each cluster; metagenomic phages are in gray.Full size imageWe then compared functional content encoded by the marine jumbo phages in the PGCs to identify functional differences that distinguish these groups. PGC_E was excluded from this analysis because this genome cluster contained only a single jumbo phage. Collectively, most genes of the marine jumbo phages could not be assigned a function (mean: 86.60%, std dev: 7.01%; Supplementary Dataset 3), which is common with environmental viruses [43, 44]. Genes with known functions primarily belonged to functional categories related to viral replication machinery, such as information processing and virion structure (Fig. 3a), and these genes drove the variation between the genome clusters of marine jumbo phages (Fig. 3b). A recent comparative genomic analysis of cultivated jumbo phages was able to identify three types of jumbo phages that are defined by different infection strategies and host interactions (referred to as Groups 1–3) [6]. We cross-referenced our PGCs and found that PGCs B, C, and D of this study corresponded to Groups 1, 2, and 3, respectively, suggesting that these genome clusters contain phages with distinct infection and replication strategies. PGC_A corresponded to multiple groups, indicating that this genome cluster contains a particularly broad diversity of phages.Fig. 3: Functional predictions of PGCs.a Functional categories for genes encoded by jumbo phages averaged by PGC. b Heatmap of proportion of genomes in each PGC that contain the listed genes. Listed genes were selected based on containing a known function and having a variance between the PGCs above 0.2. Dendrogram was generated based on hierarchical clustering in pheatmap.Full size imageThe second largest phage cluster with marine jumbo phages, PGC_B, consists of 238 phages (11 (4.6%) marine jumbo phages, including 10 bins generated here), and included cultured phages of Group 1, which is typified by Pseudomonas aeruginosa phage PhiKZ. Supporting this correspondence with Group 1, all marine jumbo phages of PGC_B encoded the same distinct replication and transcription machinery, including a divergent family B DNA polymerase and a multi-subunit RNA polymerase (Fig. 3b, Supplementary Dataset 3). These marine jumbo phages also encoded a PhiKZ internal head protein, and they uniquely encoded shell and tubulin homologs which has recently been found in PhiKZ phages to assist in the formation of a nucleus-like compartment during infection that protects the replicating phage from host defenses [18, 19]. Although we could not confidently predict hosts for the 11 metagenomic marine jumbo phages in this PGC_B (Supplementary Dataset 1), the cultured phages of this genome cluster infect pathogenic bacteria belonging to the phyla Proteobacteria (178 phages) and Firmicutes (6 phages) (Fig. 2c), implicating a potential host range for marine jumbo phages in PGC_B.The next largest phage genome cluster, PGC_C, comprised of 156 phages total (90 marine jumbo phages (57.7%); 4 bins generated from this study) and included reference jumbo phages in Group 2 (31, 19.9%) which are typified by Alphaproteobacteria and Cyanobacteria phages. Likewise, the host range of other cultured phages in PGC_C support the Group 2 correspondence, either infecting Cyanobacteria (139 phages) or Proteobacteria (4 phages) (Fig. 2c). Furthermore, all 3 marine metagenomic phages in PGC_C for which hosts could be predicted were matched to Cyanobacteria hosts (Supplementary Dataset 1). Functional annotations of PGC_C marine jumbo phages revealed nearly all encode a family B DNA polymerase (97.8% of phages) and the photosystem II D2 protein (PF00124, VOG04549) characteristic of cyanophages (90% of phages) (Fig. 3b). This PGC included the reference Prochlorococcus phage P-TIM68 (NC_028955.1), which encodes components of both photosystem I and II as a mechanism to enhance cyclic electron flow during infection [45]. This suggests that an enhanced complement of genes used to manipulate host physiology during infection may be a driver of large genome sizes in this group. Additionally, most of the PGC_C marine jumbo phages encoded lipopolysaccharide biosynthesis proteins (76%), which have been found in cyanophage genomes that may induce a “pseudolysogeny” state, when infected host cells are dormant, by changing the surface of the host cell and preventing additional phage infections [6] (Supplementary Dataset 3). Taken together, most marine jumbo phages of PGC_C likely follow host interactions of jumbo cyanophages, such as potentially manipulating host metabolism by encoding their own photosynthetic genes and potentially inducing a pseudolysogenic state.Finally, phages of PGC_D totaled at 47 phages, of which 7 were marine jumbo phages generated in this study (14.9%). This group included Group 3 jumbo phages (15, 31.9%), which is primarily distinguished by encoding a T7-type DNA polymerase but is not typified by a particular phage type or host range. Supporting this grouping, all marine jumbo phages in this study encoded a T7 DNA polymerase which belongs to family A DNA polymerases (Fig. 3b, Supplementary Dataset 3). Most of the other genes distinctively encoded by the marine jumbo phages in this group included structural genes related to T7 (T7 baseplate, T7 capsid proteins), a eukaryotic DNA topoisomerase I catalytic core (PF01028), and DNA structural modification genes (MmcB-like DNA repair protein, DNA gyrase B). Hosts of metagenomic marine jumbo phages in PGC_D could not be predicted (Supplementary Dataset 1); however, cultured Group 3 jumbo phages in PGC_D all infect Proteobacteria, primarily Enterobacteria and other pathogens.The largest of the phage genome clusters, PGC_A, contained 469 phages, including 135 marine jumbo phages (63 bins from this study). This genome cluster contained the largest jumbo phages, such as Bacillus phage G (498 kb) and the marine megaphage Mar_Mega_1 (656 kb) recently recovered from the English Channel [39]. Unlike other PGCs, PGC_A contained mostly metagenomic phages (401, 85%, Fig. 2b, c). Considering PGC_A contains the largest jumbo phages (Figs. 1b, 2a), the vast genetic diversity in this PGC might explain why few genes were found to distinguish this group. Of the genes unique to PGC_A, only one was present in at least half of the phages (51.9%), which was a Bacterial DNA polymerase III alpha NTPase domain (PF07733). The host ranges of cultured phages from this PGC further reflect the large diversity of this group and included a variety of phyla and genera that can perform complex metabolisms or lifestyles, such as the nitrogen-fixing Cyanobacteria of the Nodularia genus isolated from the Baltic Sea (accessions NC_048756.1 and NC_048757.1) and the Bacteroidetes bacteria Rhodothermus isolated from a hot spring in Iceland (NC_004735.1) [46]. Because this group contains an abundance of metagenome-derived genomes that encode mostly proteins with no VOG annotation (Supplementary Dataset 2), it is possible that it includes several distinct lineages that could not be distinguished using the community detection algorithm of the bipartite network analysis.Relative abundance of jumbo bacteriophages across size fractionsTo explore the distribution of the marine jumbo phages in the ocean, we first examined the size fractions in which the jumbo phages were most prevalent. To remove redundancy for the purposes of read mapping, we examined the 244 jumbo phages at the population-level ( >80% genes shared with >95% average nucleotide identity [24]), corresponding to 142 populations (11 unique to this study, corresponding to 47 bins). We then mapped Tara Oceans metagenomes onto the 142 jumbo phage populations, and 102 of these populations could be detected [min. 20% of genome covered], resulting in 74 populations in PGC_A, 2 in PGC_B, 22 in PGC_C, 3 in PGC_D, and 1 in PGC_E. Out of the 225 Tara Oceans metagenomes examined, 213 (94.6%) contained at least one jumbo phage population (median: 7, Supplementary Dataset 4). Jumbo phages were more frequently detected in samples below 0.22 µm ( More