Resolving the structure of phage–bacteria interactions in the context of natural diversity

Sampling

Environmental sampling

Samples were collected from the littoral marine zone at Canoe Cove, Nahant, Massachusetts, USA, on 22 August (ordinal day 222), 18 September (261) and 13 October (286) 2010, during the course of the three month Nahant Collection Time Series sampling¹¹.

Bacterial isolation and characterization

Bacterial isolation

Bacterial strains were isolated from water samples using a fractionation-based approach⁷ as previously described^19,20. In brief, seawater was passed first through a 63um plankton net and then sequentially through 5um (Whatman 111113 or Sterlitech PCT5047100), 1um (Whatman 111110 or Sterlitech PCT1047100), and 0.2um (Whatman 111106) hydrophilic polycarbonate filters; material recovered on the filters was resuspended by shaking for 20 min; dilution series of resuspended cells were filtered onto 0.2um polyethersulfone filters (Pall 66234) in a carrier solution of artificial seawater (40 g Sigma Sea Salts, S9883; 0.2um filtered), and filters placed directly onto Vibrio-selective MTCBS plates (BD Difco TCBS Agar 265020, supplemented with with 10 g NaCl per liter to 2% final w/v). Colonies (96) from each of three replicates of each size fraction were selected from the dilution plates with the fewest numbers of colonies (1,152 isolates per isolation day). Colonies were purified by serial passage, first onto TSB-II (Difco Tryptic Soy Broth, 1.5% BD Difco Bacto Agar 214010, amended with 15 g NaCl to 2% w/v), second onto MTCBS, finally onto TSB-II again. Colonies were inoculated into 1 ml of 2216 Marine Broth (BD Difco 279110) in 96-well 2 ml culture blocks and allowed to grow, shaking at room temperature, for 48 h. Glycerol stocks were prepared by combining 100 ul of culture with 100 ul of 50% glycerol (BDH 1172-4LP) in 96-well microtiter plates and sealed with adhesive aluminum foil for preservation at −80 °C.

Bacterial hsp60 gene sequencing

To obtain hsp60 gene sequences for isolates, Lyse and Go (LNG) (Pierce, Thermo Scientific 78882) treatments of subsamples of the same overnights cultures used in the bait assay (described below) were used directly as template in PCR amplification reactions. PCR reactions were prepared in 30 ul volumes, as follows: 1 ul LNG template, 3 ul 10x buffer, 3 ul 2 mM dNTPs, 3 ul 2um hsp60-F primer, 3 ul 2um hsp60-R primer, 0.3 ul NEB Taq, 16.7 ul PCR-grade HOH; with hsp60-F (H279) primer sequence: 5′-GAA TTC GAI III GCI GGI GAY GGI ACI ACI AC-3′, and hsp60-R (H280) primer sequence: 5′-CGC GGG ATC CYK IYK ITC ICC RAA ICC IGG IGC YTT-3′⁶¹ (Supplementary Table 1). PCR thermocycling conditions were as follows: initial denaturation at 94 °C for 2 min; 35 cycles of 94 °C for 1 min, 37 C for 1 min, 72 °C for 1 min; final annealing at 72 °C for 6 min; hold at 10 °C. PCR products were cleaned up by isopropyl alcohol (IPA) precipitation, as follows: addition of 100 ul 75% IPA to 30 ul PCR reaction product, gentle inversion mixing followed by 25 min incubation at RT, 30 min centrifugation at 2800 rcf, addition of 50 ul 70% IPA with gentle inversion wash, centrifugation at 2000 rcf, inversion on paper towels to remove IPA, 10 min centrifugation at 700 rcf, air drying in PCR hood for 30 min, resuspension in 30 ul PCR HOH. PCR products were Sanger sequenced (Genewiz, Inc.) using hsp60-R primer, as follows: 5 ul of 5 um hsp60-R primer, 7 ul nuclease free water, 3 ul DNA template. For a subset of strains hsp60 sequences were obtained from subsequently determined whole-genome sequences. Hsp60 sequences were aligned to the hsp60 sequence previously published for Vibrio 1S_84 and trimmed to 422 bases using Geneious (https://www.geneious.com/). Accession numbers for these 1287 strains are provided in Supplementary Data 1, where they are identified as baxSet1287.

Bacterial hsp60 phylogenies

A phylogenetic tree of relationships among bacterial isolates screened in the bait assay (described below) was produced based on a 422 bp fragment of the hsp60 gene, derived either from Sanger or whole genome sequences; with E. coli K12 serving as the outgroup. Sequences from each of the three days of isolation were aligned using muscle v.3.8.31⁶² with default settings (muscle -in $seqsALL -out $seqsALL.muscleAln), and a single tree including all 1287 sequences from all the days was generated using FastTree v.2.1.8⁶³ (FastTree -gtr -gamma -nt -spr 4 -slow < $seqsALL.muscleAln > $seqsALL.muscleAln.fasttree). For presentation in Fig. 1 three sub-trees including only nodes from each day were produced using PareTree v.1.0.2⁶⁴ (java -jar PareTree1.0.2.jar -t O -del notDay222.txt -f $seqsALL.$round.muscleAln.fasttree.DAY222). Trees were visualized using iTOL⁶⁵ and painted with metadata for each of the strains, including: sensitivity to killing in agar overlay by co-occurring phage predators collected on the same day and, for the subset of strains that were genome sequenced and also included in the host range matrix, the bacterial species, based on concatenated ribosomal protein analysis using RiboTree⁶⁶ as described below. Isolation days for each of the strains included in these analyses are provided in Supplementary Data 1, where these strains are identified as baxSet1287.

Bacterial genome sequencing and assignment to populations

To assign genome-sequenced bacterial isolates used in the host range assay to species, we use the RiboTree tool⁶⁶ to produce a phylogeny based on concatenated single copy ribosomal proteins as in²³. We include strains of previously described Vibrionaceae in preliminary analyses as reference strains and assign species names to new isolates based on clustering with named representatives, as well as provide placeholder names for newly identified clades with no previously described representatives. Trees were visualized using iTOL⁶⁵ and the representation including only those strains included in the host range assay is shown in Supplementary Fig. 1; population assignments and accession numbers for this set of 294 genomes, which also includes a small number of previously isolated bacterial strains that were included in the host range assay (described below), are provided in Supplementary Data 1, where they are identified as baxSet294.

Viral isolation and characterization

We have previously described features of the viruses of the Nahant Collection²⁰, as well as approaches used for the standardization of their genome assemblies¹⁹, additional details are provided below.

Viral sample collection

The iron chloride flocculation approach was used to generate 1000-fold concentrated viral samples from 0.2 um-filtered seawater, as follows. For each isolation day, triplicate 4 L seawater samples were filtered through 0.2 um polyethersulfone cartridge filters (Millipore, Sterivex, SVGP0150) into collection bottles, spiked with 400 uL of FeCl₃ solution (10 gL⁻¹ Fe; as 4.83 g FeCl₃•6H₂O (Mallinckrodt 5029) into 100 ml H₂O), and allowed to incubate at room temperature for at least 1 h. Virus-containing flocs were then recovered from the sample by filtration onto 90 mm 0.2 um polycarbonate filters (Millipore, Isopore, GTTP09030) under gentle vacuum in a 90 mm glass cup-frit system (e.g Kontes funnel 953755-0090, fritted base 953752-0090, and clamp 953753-0090); once liquid was fully passed, the funnel was removed and, with the vacuum pump left on, the filters were folded into quarters, removed from the fritted base, and inserted into a 7 ml borosilicate glass vial. A volume of 4 ml of oxalate-EDTA solution (prepared from stock solution as 10 ml 2 M Mg₂EDTA (J.T. Baker, JTL701-5), 10 ml 2.5 M Tris-HCl (Promega PAH5123), 25 ml 1 M oxalic acid (Mallinckrodt 2752); adjusted to pH 6 with 10 M NaOH (J.T. Baker, 3722-01); final volume 100 ml; used within 7 days of preparation and maintained at room temperature in the dark) was added to the vial and the sample allowed to dissolve at room temperature for at least 30 min before transfer to storage at 4 °C. A reagent used in this original formulation (JT-Baker 7501 Mg₂EDTA) is no longer available and an updated recipe is provided elsewhere⁶⁷.

Bait assay and associated viral plaque archival

In order to obtain estimates of co-occurring phage predator loads at bacterial strain level resolution, and generate plaques from which to isolate phages, we exposed 1440 purified bacterial isolates to phage concentrates from their same day of isolation (1334 yielded lawns sufficient to evaluate for plaques, and hsp60 sequences could be determined for 1287 of these). Bacterial strains screened included 480 isolates from each ordinal day, representing 120 strains from each of 4 size-fractionation classes (0.2 um, 1.0 um, 5.0 um, 63 um) details of isolation origin are provided for each strain in Supplementary Data 1, and description of naming conventions is as previously described¹⁹. For the bait assay each strain was mixed in agar overlay with seawater concentrates containing viruses (15 ul concentrate, equivalent to 15 ml unconcentrated seawater assuming 100% recovery efficiency; derived from pooling of three replicate virus concentrates from each day). We note that recoveries were not tested for individual samples and that previous tests¹⁴ of recovery efficiency have shown that resuspension of iron flocculates in oxalate solution yields initial recoveries of approximately 50% (49 ± 3% and 55 ± 11% for a marine sipho- and myo-virus respectively, at 24 h post re-suspension) and shows low decay rate over time (47 ± 5% and 73 ± 16% for a marine sipho- and myo-virus respectively, at 38 days post re-suspension). All of our assays were performed approximately 8–9 months post-sampling from oxalate concentrates stored at 4 °C. Agar overlays were performed based on the previously described Tube-free method¹³, as follows. Bacterial strains were prepared for agar overlay plating by streaking out from glycerol stocks onto 2216MB agar plates with 1.5% agar (Difco, BD Bacto, 214010), and allowed to grow for 2 days at room temperature. Strains were then inoculated into 1 ml 2216MB in a 96-well culture block and incubated 24 h at room temperature shaking at 275 rpm on a VWR DS500E orbital shaker. Immediately prior to use in direct plating the OD600 was measured in 96-well microtiter plates and subsamples were taken for Lyse and Go (LNG) processing for DNA (10 ul culture, 10 ul LNG). Phage concentrates were prepared for plating by pooling 1.2 ml from each of the concentrate replicates into a 7 ml borosilicate scintillation vial. Cultures were transferred from overnight culture blocks to 96-well PCR plates in 100 ul volume and 15 ul of pooled phage concentrate was added to cultures one row at a time, with each row plated in agar overlay before adding phage concentrate to the next row of bacterial cultures. Mixed samples of 100 ul bacterial overnight culture and 15 ul pooled phage concentrate were transferred to the surface of bottom agar plates (2216MB, 1% agar, 5% glycerol, 125 ml L⁻¹ of chitin supplement [40 g L⁻¹ coarsely ground chitin, autoclaved, 0.2 um filtered]). A 2.5 ml volume of 52 °C molten top agar (2216MB, 0.4% agar, 5% glycerol BDH 1172-4LP) was added to the surface of the bottom agar and swirled around to incorporate and evenly disperse the mixed bacterial and phage sample into an agar overlay lawn. Agar overlay lawns were held at room temperature for 14–16 days and observed for plaque formation. Glycerol was incorporated into this assay to facilitate detection of plaques⁶⁸. Chitin supplement was incorporated into this assay to facilitate detection of phages interacting with receptors upregulated in response to chitin degradation products. A variety of preliminary tests exploring potential optimizations to agar compositions for direct plating indicated that the addition of chitin did not negatively impact recovery of plaques with control phage strains tested. After approximately 2 weeks, plaques on agar overlay lawns were cataloged and described with respect to plaque morphology and plaques were picked for storage based on the previously described Archiving Plaques method¹³, as follows. All plaques were archived from plates containing less than 25 plaques, on plates with larger numbers of plaques a random subsample of plaques from each distinct morphology were archived. A polypropylene 96-well PCR plate was filled with 200 ul aliquots of 0.2 um filtered 2216MB, agar plugs were collected from plates using a 1 ml barrier pipette tip and ejected into the 2216MB, skipping one well between each sample to minimize potential for cross-contamination, for a final count of 48 phage plugs per plate. Plaque plugs were soaked at 4 °C for several hours to allow elution of phage particles into the media. After soaking, 96-well plates were centrifuged at 2,000 rcf for 3 min before proceeding to the next step. Plug soaks were then processed for two independent storage treatments. For storage at 4 °C, plates were processed by transferring 150 ul of eluate from each well to a 0.2 um filtration plate (Millipore, Multiscreen HTS GV 0.22um Filter Plate MSGVS22) and gently filtered under vacuum to remove bacteria, the cell-free filtrates containing eluted phage particles from each plaque plug were stored at 4 °C. For storage at −20 °C, 50 ul of 50% glycerol was added to the residual ~50 ul of the plug elution, often still containing the agar plug. In this way all plaques were characterized and many plaques from each strain were archived in two independent sets of conditions. Total plaque counts for all strains included in the bait assay are represented in Fig. 1, and provided in Supplementary Data 1, where they are identified as baxSet1287. Notes on limitations to the assay: Water temperatures on each of the three isolations days were 13.8 °C, 16.3 °C, and 14.2 °C, for days 222, 261, and 286; as bait assays were performed at room temperature (approximately 22 °C) some phages requiring lower temperatures may not have yielded plaques. The majority of plates were evaluated for plaque formation twice, on day 1 and day 13, thus any plaques appearing after day 1 and disappearing before day 13 – for example as a result of overgrowth of lysogens—are likely to have been missed in these assays.

Viral purification

A subset of plaques archived during the bait assay was selected for phage purification, genome sequencing, and host range characterization. This subset included single randomly-selected representatives from each plaque-positive bacterial strain. Minor details of the purification and lysate preparation varied across samples but were largely as follows. Phages were purified from inocula derived primarily from −20 °C plaque archives, and secondarily from 4 °C archives when primary attempts with −20 °C stocks failed to produce plaques. Three serial passages were performed using Molten Streaking for Singles¹³ method. Agar overlay lawns for passages were prepared by aliquoting 100 ul of host overnight culture (4 ml 2216MB, colony inoculum from streak on 2216MB with 1.5% Bacto Agar, shaken overnight at RT at 250 rpm on VWR DS500E orbital shaker) onto a standard size bottom agar plate and adding 2.5 ml of molten 52 °C top agar as in the bait assay, swirling to disperse the host into the top agar and form a lawn, and streaking-in phage with a toothpick either from the plaque archive or directly from well-separated plaques in overlays from the previous step in serial purification. Following plaque formation on the third serial passage plate plaque plugs were picked using barrier tip 1 ml pipettes and ejected into 250 ul of 2216MB to elute overnight at 4 °C. Plaque eluates were spiked with 20 ul of host culture and grown with shaking for several hours to generate a primary small-scale lysate. Small scale primary lysates were centrifuged to pellet cells and titered by drop spot assay to estimate optimal inoculum volume to achieve confluent lysis in a 150 mm agar overlay plate lysate. Plate lysates were generated by mixing 250 ul of overnight host culture with primary lysate and plating in 7.5 ml agar overlay. After development of confluent lysis of lawns as compared against negative control without phage addition, the lysates were harvested by addition of 25 ml of 2216MB, shredding of the agar overlay with a dowel, and collection of the broth and top agar. Freshly harvested lysates were stored at 4 °C overnight for elution of phage particles, the following day lysates were centrifuged at 5,000 rcf for 20 min and the supernatant filtered through a 0.2 um Sterivex filter into a 50 ml tube and stored at 4 °C.

Viral genome sequencing

Sequencing of Nahant Collection viruses was described in previous work¹⁹, and was performed as follows. For DNA extraction approximately 18 ml of phage lysate was concentrated using a 30 kD centrifugal filtration device (Millipore, Amicon Ultra Centrifugal Filters, Ultracel 30 K, UFC903024) and washed with 1:100 2216MB to reduce salt concentrations inhibitory to downstream nuclease treatments. Concentrates were brought to approximately 500 ul using 1:100 diluted 2216MB and then treated with DNase I and RNase A (Qiagen RNase A 100 mg mL⁻¹) for 65 min at 37 °C to digest unencapsidated nucleic acids. Nuclease treated concentrates were extracted using an SDS, KOAc, phenol-chloroform extraction and resuspended in EB Buffer (Qiagen 19086) for storage at 20 °C. Phage genomic DNA was sheared by sonication in preparation for genome library preparation. DNA concentrations of extracts were determined using PicoGreen (Invitrogen, Quant-iT PicoGreen dsDNA Reagent and Kits P7589) in a 96-well format and samples brought to 5 ug in 100 ul final volume of PCR-grade water diluent for sonication. Samples were sonicated in batches of 6 for 6 cycles of 5 min each, at an interval of 30 s on/off on the Low Intensity setting of the Biogenode Bioruptor to enrich for a fragment size of ~300 bp. Illumina constructs were prepared from sheared DNA as follows: end repair of sheared DNA (NEB, Quick Blunting Kit, E1201L), 0.72×/0.21× dSPRI (AMPure XP SPRI Beads) size selection to enrich for ~300 bp sized fragments, ligation (NEB, Quick Ligation Kit, M2200L) of Illumina adapters and unique pairs of forward and reverse barcodes for each sample, SPRI (AMPure XP SPRI Beads) clean up, nick translation (NEB, Bst DNA polymerase, M0275L), and final SPRI (AMPure XP SPRI Beads) clean up (Rodrigue et al., 2010). Constructs were enriched by PCR using PE primers following qPCR-based normalization of template concentrations. Enrichment PCRs were prepared in octuplicate 25 ul volumes, with the recipe: 1 ul Illumina construct template, 5 ul 5x Phusion polymerase buffer (NEB, 5X Phusion HF Reaction Buffer, B0518S), 0.5 ul 10 mM dNTPs (NEB, dNTP Mix (1 mM; 0.5 ml), N1201AA), 0.25 ul 40 uM IGA-PCR-PE-F primer, 0.25 ul 40 uM IGA-PCR-PE-R primer, 0.25 ul Phusion polymerase (NEB, Phusion High Fidelity DNA Pol, M0530L), 17.75 ul PCR-grade water. PCR thermocycling conditions were as follows: initial denaturation at 98 °C for 20 sec; batch dependent number of cycles (range of 12–28) of 98 °C for 15 sec, 60 °C for 20 see, 72 °C for 20 sec; final annealing at 72 °C for 5 min; hold at 10 °C. For each sample 8 replicate enrichment PCR reactions were pooled and purified by 0.8x SPRI beads (AMPure XP) clean up. Each sample was then checked by Bioanalyzer (2100 expert High Sensitivity DNA Assay) to confirm the presence of a unimodal distribution of fragments with a peak between 350–500 bp. Sequencing of phage genomes was distributed over 4 paired-end sequencing runs as follows: HiSeq library of 18 samples pooled with 18 external samples, 3 MiSeq libraries each containing ~100 multiplexed phage genomes. Accession numbers for all sequenced phage genomes are provided in Supplementary Data 1, where they are identified as phageSet283; the subset of phages used in the majority of analyses in this work are identified as phageSet248 and exclude non-independent isolates derived from the same plaque, as well as well as identical phages isolated from multiple independent plaques from the same host strain in the bait assay.

Viral protein clustering

To characterize and annotate groups of proteins in assembled viral genomes in the Nahant Collection¹⁹, proteins were clustered using MMseqs2 v. 2.23394⁶⁹ with default parameter settings, the 21,937 proteins reported in the GenBank files associated with each of the 283 Nahant Collection phage genomes were clustered into 5,929 clusters including 2,978 singletons. MMseqs2 cluster assignments for each protein sequence are provided in Supplementary Data 6.

Viral protein cluster annotation

All proteins were annotated using InterProScan⁷⁰ v.5.39–77.0; eggNOG-mapper^71,72 v.2 using both automated and viral HMM selection options; Meta-iPVP⁷³; and with best matches to 9518 Viral Orthologous Groups⁷⁴ HMM profiles (obtained at http://dmk-brain.ecn.uiowa.edu/pVOGs/downloads.html); search was performed with hmmer, requiring a bitscore of 50 or greater (highest e-value 5.80E-13), as follows: hmmsearch -o $out_dir/$hmm_group.$hmmfile.$prots_short_name.hmm.out -tblout $out_dir/$hmm_group.$hmmfile.$prots_short_name.hmm.tbl.out -noali -T 50 $hmmfile $prots_dir/$prots_file. Annotations for viral protein clusters are provided in Supplementary Data 6.

Receptor binding proteins (RBPs) were annotated as follows. RBPs were defined here to include both globular and fibrous host interacting proteins and general protein annotations were reviewed for similarity to known phage receptor binding proteins and supplemented with Phyre2⁷⁵, HHpred, and literature review⁷⁶. Annotated RBPs were mapped onto phage genome diagrams and additional RBPs were annotated based on gene order conservation with phages in the same genus for which RBPs were already identified; annotated RBPs were then used to iteratively search against all Nahant Collection phage proteins using the jackhmmer search tool in the HMMER⁷⁷ v.3.2.1 package (jackhmmer -cpu 16 -N 3 -E 0.00001 -incE 0.01 -incdomE 0.01 -o $run.$1.vs.$2.jackhmmer.iters-$iters.out -tblout $run.$1.vs.$2.jackhmmer.iters-$iters.tbl.out -domtblout $run.$1.vs.$2.jackhmmer.iters-$iters.dom.tbl.out $queryFASTAS $subjectFASTAS) and new hits were manually reviewed. All annotations were performed on a protein-cluster level and annotations of proteins and protein clusters as “adsorption – RBP” are indicated in Supplementary Data 6.

Recombinases were annotated as follows: Homologs of single strand annealing protein recombinases in the Rad52, Rad51 and Gp2.5 superfamilies in the Nahant Collection phages were identified as described below. First, iterative HMM searches were performed against the Nahant Collection phage proteins using as seeds 194 recombinases identified in Lopes et al.⁴⁴ (excluding RecET fusion protein YP_512292.1; http://biodev.extra.cea.fr/virfam/table.aspx), these represent 6 families of SSAP recombinases (UvsX, Sak4, Sak, RedB, ERF, and Gp2.5); searches were performed using the jackhmmer function of HMMER v.3.1.2 (jackhmmer -cpu 16 -N 5 -E 0.00001 –incE 0.01 -incdomE 0.01 -o $run.$1.vs.$2.jackhmmer.out -tblout $run.$1.vs.$2.jackhmmer.tbl.out -domtblout $run.$1.vs.$2.jackhmmer.dom.tbl.out $queryFASTAS $subjectFASTAS) – this yielded 156 proteins. Second, all hits were plotted onto genome diagrams for all phages in the collection and additional candidate recombinases identified based on gene neighborhood comparisons (Supplementary Data 9) – this step identified 4 additional protein clusters (mmseqs 297, 149, 2211, and 600), totaling 224 proteins. Third, all proteins clusters were curated by manual review of annotations made using InterProScan⁷⁰, EggNOG-mapper⁷¹, and Phyre2⁷⁵ (annotations provided in Supplementary Data 6) to identify potential false positives (none identified), and references to recombinases in annotations. Where these annotation methods did not provide additional support, sequences were evaluated for additional support using HHpred⁷⁸ (hhsearch -cpu 8 -i../results/full.a3m -d /cluster/toolkit/production/databases/hh-suite/mmcif70/pdb70 -o../results/2058109.hhr -oa3m../results/2058109.a3m -p 20 -Z 250 -loc -z 1 -b 1 -B 250 -ssm 2 -sc 1 -seq 1 -dbstrlen 10000 -norealign -maxres 32000 -contxt /cluster/toolkit/production/bioprogs/tools/hh-suite-build-new/data/context_data.crf) as implemented on the MPI Bioinformatics Toolkit webserver (mmseq 2896 and 5138 both gave >99% probability hits to DNA repair protein Rad52 with PDB ID 5JRB_G), or JackHMMER (-E 1 -domE 1 -incE 0.01 -incdomE 0.03 -mx BLOSUM62 -pextend 0.4 -popen 0.02 -seqdb uniprotkb) as implemented on the EMBL-EBI webserver (mmseq 2990 showed hits to diverse RedB family RecT-like sequences at e-value ≤1e-05). Following this third step, there were 3 protein clusters for which support was limited, these were included in the final dataset as putative SSAP recombinases but are highlighted here. Protein cluster mmseq 297 (present in 21 phages in 6 genera): was always encoded by genes adjacent to genes in protein cluster mmseq 3923, which was itself a recombinase associated exonuclease that was found either adjacent to mmseq 297 or to the well-supported putative SSAP recombinase mmseq 3721 (sometimes separated by one gene from mmseq 3721). Protein cluster mmseq 600 (present in 2 phages in 2 genera): was encoded adjacent to a protein cluster annotated as a recombination associated exonuclease; iterative HHMER searches of a mmseq 600 cluster representative (AUR82881.1) against Viruses in UniProtKB using jackhmmer yielded hits to proteins in mmseq 297 in iteration 3. Protein cluster mmseq 2990 (present in 1 phage): was encoded adjacent to two small proteins encoding putative recombination associated exonucleases and was in the same genomic position relative to neighboring genes as putative recombinases in related phages in the genus. Finally, all putative SSAP recombinase genes were assigned to a recombinase family by clustering based on 2 iterations of all-by-all HMM jackhmmer sequence similarity searches of all candidates and the reference seed set of Lopes⁴⁴ (jackhmmer -cpu 16 -N 2 -E 0.00001 -incE 0.01 -incdomE 0.01 -o $run.$1.vs.$2.jackhmmer.out -tblout $run.$1.vs.$2.jackhmmer.tbl.out -domtblout $run.$1.vs.$2.jackhmmer.dom.tbl.out $queryFASTAS $subjectFASTAS); similarities were were visualized using Cytoscape v.3.3.0 using the “Edge-weighted Spring Embedded Layout” based on jackhmmer score, clusters were identified using the ClusterMaker2 v.1.2.1 Cytoscape plugin with the MCL cluster option and all settings at default and Granularity=2.5. Proteins in 3 mmseq clusters (149, 297, 600) did not fall into MCL clusters with recombinases from the annotated seed set and therefore are described as “unknown” rather than being assigned to a family of recombinases. All final assignments of genes to a recombinase superfamily and family, as well as all associated annotations, are provided in Supplementary Data 6 (sheet A.prots_overview column anno_Recombinase_manual). Additional details regarding seed sequences and MCL cluster assignments associated with recombinase analyses are provided in Supplementary Data 7 which contains a main descriptor sheet (00.readme), an overview of the 224 Nahant phages with recombinases (sheet 01.NahantPhageRecombinases_224), a table of InterPro domains associated with each of the reference and Nahant recombinases, with specific mmseqs and MCL clusters (sheet 02.IPR_annos_Lopes+Nahant), a list of all references used (sheet 03.List1_LOPES_ALL.noETfusion), the output of the iterative jackhmmer search with seeds against all Nahant Collection proteins (sheet 04.List1_vs_NahantProts), the output of the all-by-all jackhmmer search for 194 references and 224 putative Nahant recombinases (sheet 05.Lopes+Nahant224_v_self2iter), and information on the assignment of all Nahant and reference proteins to MCL clusters as shown in Fig. 6 (06.Recombinase_assign_by_MCL).

All proteins were assigned to one of three broad categories – structural, other (non-structural), or no prediction – based on manual review of annotations derived from: NCBI product ID, Virfam²¹, PhANNs⁷⁹, pVOGs⁷⁴, eggNOG-mapper⁷², Phyre2⁷⁵, the MPI Bioinformatics implementation of HHpred⁷⁸, and targeted annotations of predicted receptor binding proteins and recombinases (see descriptions for targeted annotations in Methods, above). Protein clusters (mmseq groups) were reviewed for conflicting calls and ultimately all proteins within each protein cluster (mmseqsID) were assigned to a single category. All assignments, and annotations on which they were based, are provided in Supplementary Data 6.

The approach for assigning annotations to these broad categories was as follows: Step 1) All genes identified as putative recombinases through targeted annotations were assigned as “other”. Step 2) All genes identified as putative receptor binding proteins through targeted annotations were assigned as “structural”. Step 3) Genes not assigned to a category in steps 1 and 2, and which were identified by Virfam as “head-neck-tail” associated were assigned as follows: Genes annotated by Virfam as a terminase (TerL) were assigned as “other”; genes annotated by Virfam as a major capsid protein (MCP), portal (portal), adaptor (Ad1, Ad2, Ad3), head-closure (Hc1, Hc2, Hc3), tail completion (Tc1, Tc2), major tail protein (MTP), neck (Ne), or sheath (sheath) were assigned as “structural”. Step 4) Genes not assigned to a category in steps 1–3, were assigned as “structural” or “other” (non-structural) if identified as such by PhANNs with a confidence of ≥95%. Cases where conflicting annotations were observed between PhANNs and other annotations were flagged for review in subsequent steps. Step 5) Genes with annotations of VOG0263 (DNA transfer protein); terminal protein, any reference to internal virion protein, DNA circularization protein, and MuF-like proteins were assigned as “other”; in the case of conflict the Step 5 annotation superseded the prior annotations. Step 6) Genes with annotation as a terminase (large subunit, small subunit, and unspecified) by any of the tools (requiring ≥ 90% confidence if based on Phyre2) were assigned as “other”. Step 7) All genes lacking support across annotations were assigned as “no prediction”, high confidence Phyre2 predictions qualitatively judged as inappropriate were disregarded. Step 8) Genes flagged in Step 4 were reviewed and assigned as “structural” when containing any structural related genes (i.e. those listed in Step 3 and any others identifiable as structural based on words in the annotations and consensus across tools, e.g. containing the word baseplate, capsid, coat, head, spike, tail, whisker, fibritin). Additional targeted annotation by HHpred was used to facilitate assignment to “structural” (known structural proteins as described for Step 3 and in the aforementioned list), “other” (non-structural), “no prediction” (e.g. no assignable function based on available annotations and a PhANNs confidence of <95% for its category of “other”). Step 9) All protein clusters (genes with the same mmseqsID) were reviewed for consistency of annotation among member genes, and additional targeted annotation by HHpred was used to facilitate assignment to “structural” (known structural proteins as described for Step 3 and Step 8), “other” (non-structural), “no prediction” (e.g. no assignable function based on available annotations, a PhANNs confidence of <95% for its category of “other”). In cases where existing assignments of genes within the protein cluster contained both “no prediction” and “other” calls, the “no prediction” call prevailed where these represented more than ~30% of the calls across all genes in the cluster.

Annotation of viral potential for temperate lifestyle

Overview

We identified 6 genera of phages as likely representing temperate phages (indicated in Fig. 2).

Bacterial genome read mapping

In order to evaluate the possibility that phages closely related to the Nahant Collection phages reside in the bacterial hosts in this study as prophages we used a read mapping approach. Briefly, reads from each of 276 bacterial genomes isolated from Nahant were mapped against each of the 248 phages and coverage in terms of total bases and genes was considered. Overall, results of this analysis are consistent with our other approaches for assessing potential for lysogeny in these phages. Confirmed as prophages are the known active prophage (1 phage, NCVICG_31; this is the sole case of a prophage being isolated from its own host of isolation and was the sole phage in NCVICG_31, a myovirus in the subfamily Peduovirinae) and the transposable phages (7 phages, NCVICG_41), both genera of which show extensive recruitment of bacterial reads, covering 100 and 93% of their genomes, respectively. In addition to these 8 phages, 58 phages recruited bacterial genome reads covering up to 510 bases (range: 30–510 bases covered). Investigation of the genes to which these reads mapped showed that in only one case was a gene covered at ≥ 70%, this was a single strand binding protein in one phage in NVICG_43 (a group which includes 20 members). Where there was any mapping of reads the patterns were bimodal, with either extensive coverage (which we describe in Fig. 2 as strong evidence, in category A) or very limited coverage (which we note in Fig. 2 as weak evidence, in category B).

Results of the analyses are reported in the supplementary data as follows: The total number of bases covered by bacterial genome reads for each individual phage genome are reported in a summary form that indicates both the total bases covered when all bacterial genome reads are considered together (aggregate) and when each bacterial genome’s reads are considered alone (individual) (Supplementary Data 5); additional data on minimum, maximum, and average read depth for each phage and each analysis type (individual vs aggregate) are also reported (Supplementary Data 5); all genes covered at ≥70% by bacterial genome mappings are indicated in the column entitled reads_mapping_from_bacterial_genome in sheet D of Supplementary Data 5.

Read mapping analyses were performed as follows: Demultiplexed bacterial genome reads were quality trimmed using fastp⁸⁰ v.0.20.1 with default settings (example: fastp -i $1_1.fastq -I $1_2.fastq -o $fastpDir/$1_1.FASTP.fastq -O $fastpDir/$1_2.FASTP.fastq -html $fastpDir/$1.FASTP.html -json $fastpDir/$1.FASTP.json). Phage genomes were indexed for read mapping using bwa⁸¹ v.0.7.17 with default settings example: Forward and reverse reads bacterial genome reads were then independently mapped onto phage genomes and resulting bam files sorted using samtools⁸² v.1.8 (here the example for mapping of forward reads: while read -r line; do cd $phageDir; (bwa mem -t 12 $line $reads/$1_1.FASTP.fastq | samtools view -h -F 4 – | samtools sort -@12 -o $interim/$1.vs.$line.BWA_MEM_ALN.FORWARDv2.sam.sorted.bam -) 2» $interim/$1.vs.$line.BWA_MEM_ALN.FORWARDv2.sam.sorted.bam.log; done < $scriptsDir/phageGenomes.names). Forward and reverse read mappings were merged and sorted using samtools and genome coverage evaluated using bedtools⁸³ v.2.29.2; this was done using both an “individual” approach where reads were merged only for each bacterial genome alone versus each phage (example: while read -r line; do cd $interim; (samtools merge -@$SLURM_CPUS_PER_TASK – $1.vs.$line.BWA_MEM_ALN.FORWARDv2.sam.sorted.bam $1.vs.$line.BWA_MEM_ALN.REVERSEv2.sam.sorted.bam | samtools sort -@$SLURM_CPUS_PER_TASK – -o – | bedtools genomecov -ibam – -d > $1.vs.$line.BWA_MEM_ALN.FandR.merged.sorted.bam.bed) 2» $1.vs.$line.merge2bed.log; done < $scriptsDir/phageGenomes.names #note that the .bed file is in fact a genomecov output file format and not a bed file format) and using an “aggregate” approach where reads from all bacterial genomes were merged for mapping onto each phage genome (example: while read -r line; do cd $interimSelect276; (samtools merge -@$SLURM_CPUS_PER_TASK – 10 N*.vs.$line.BWA_MEM_ALN.* | samtools sort -@$SLURM_CPUS_PER_TASK – -o $outDir276/ALL276.vs.$line.merged.sorted.bam) 2» $outDir276/ALL276.vs.$line.merge2bam.log; done < $scriptsDir/phageGenomes.names followed by while read -r line; do cd $interimSelect276; (bedtools genomecov -ibam $outDir276/ALL276.vs.$line.merged.sorted.bam -d > $outDir276/ALL276.vs.$line.merged.sorted.bam.genomecov) 2» ALL276.vs.$line.bam2genomecov.log; done < $scriptsDir/phageGenomes.names). Summary mapping information was obtained from genomcov files using awk⁸⁴ (example: while read -r line; do cd $interim; (awk -v OFS = ‘t’ ‘BEGIN {count = 0} {if ($3 > 1) count=count+1} END {print FILENAME,count,“coveredBases”}’ $1.vs.$line.BWA_MEM_ALN.FandR.merged.sorted.bam.bed » $1.vs.$line.awkGenomeCovInfos.summary && awk -v OFS = ‘t’ ‘END {print FILENAME,NR,“totalBases”}’ $1.vs.$line.BWA_MEM_ALN.FandR.merged.sorted.bam.bed » $1.vs.$line.awkGenomeCovInfos.summary && awk -v OFS = ‘t’ ‘{sum=sum + $3} END {print FILENAME,sum/NR,“averageDepth”}’ $1.vs.$line.BWA_MEM_ALN.FandR.merged.sorted.bam.bed » $1.vs.$line.awkGenomeCovInfos.summary && awk -v OFS = ‘t’ ‘BEGIN {max=0} {if ($3>max) max = $3} END {print FILENAME,max,“maxDepth”}’ $1.vs.$line.BWA_MEM_ALN.FandR.merged.sorted.bam.bed » $1.vs.$line.awkGenomeCovInfos.summary && awk -v OFS = ‘t’ ‘BEGIN {min=999999} {if ($3<min) min = $3} END {print FILENAME,min,“minDepth”}’ $1.vs.$line.BWA_MEM_ALN.FandR.merged.sorted.bam.bed » $1.vs.$line.awkGenomeCovInfos.summary && awk -v OFS = ‘t’ ‘BEGIN {count = 0} {if ($3 = =0) count=count+1} END {print FILENAME,count,“uncoveredBases”}’ $1.vs.$line.BWA_MEM_ALN.FandR.merged.sorted.bam.bed » $1.vs.$line.awkGenomeCovInfos.summary) 2» $1.vs.$line.awkGenomeCovInfos.log; done < $scriptsDir/phageGenomes.names). Phage genome gff3 files were converted to bed files using bedops⁸⁵ v.2.4.39 (example: for file in *gff3; do cd $gffDir; gff2bed < $file > $file.bed; done) and any genes with 70% or more overlapped by recruited read regions identified using bedops (example: while read line; do cd $gffDir; bedops -element-of 70% $line.gff3.bed.GENE.bed ALL276.vs.$line.merged.sorted.bam.bed » ALL276.vs.$line.genesCovered_70pct.bed; done < $scriptsDir/phageGenomes.names). All programs were installed using conda v.4.9.0.

Gene-based annotation

Integrases were identified in phages in 5 of the genera. The phages in NCVICG_12 (2 phages), NCVICG_24 (4 phages), and NCVICG_31 (1 phage), were identified as encoding integrases based on annotations with iterative jackhmmer searches with PF00589 phage integrase seed alignment, and by EggNOG-mapper and InterProScan annotation. The phages in NCVICG_23 and NCVICG_28 were identified as encoding PF00239 family integrases based on InterProScan annotation. The phages in NCVICG_41 (7 phages) were identified as encoding a Mu-like transposase, PF02914, on the basis of InterProScan annotations. Iterative jackhmmer searches with N15 phage linear plasmid maintenance gene SopA (NP_046923.1) and E. coli ParA (AAA99230.1) yielded no hits to any of the Nahant Collection phages. Finally, phages in 19 genera in the collection encode genes with transcriptional repressor domains represented by IPR010982, IPR010744, IPR001387, IPR032499.

Viral genome clustering

To understand how the diversity of viral genomes in the Nahant Collection is organized, we use the VICTOR classifier³², which determines genome to genome distances between concatenated amino acid sequences of viral proteomes using the Genome-BLAST Distance Phylogeny method⁸⁶ and clusters these using OPTSIL⁸⁷ and criteria optimized by benchmarking to reference ICTV prokaryotic virus taxonomic units available at the time of the development of the tool³², with the fraction of links required for cluster fusion of 0.5⁸⁸. Average support values for the phylogenomic trees using the d₀, d₄, and d₆VICTOR formulas were 49%, 31%, and 51%, respectively, and results presented here were those derived from the d₆ formula, for which 171 species-level and 49 genus-level clusters. VICTOR taxonomy assignments for all phages included in the analysis are provided in Supplementary Data 1 sheet B, where they are identified as phageSet283. Genome diagrams of all phages ordered by VICTOR distance are shown in Supplementary Data 8 (with all genes colored by and labeled based on the protein cluster number to which they belong) and Supplementary Data 9 (with only recombinases highlighted), these figures were generated in R v.3.6.1 with the packages genoPlotR v.0.8.10, ape v.5.4-1, and readr v.1.3.1.

Viral genome relatedness to previously described phages

To assess the robustness of the predicted VICTOR-based genus level groupings and their relation to previously identified viruses with known hosts we clustered the Nahant Collection phages with previously described phage genomes. We use the curated list of NCBI phage genomes generously provided as a public resource by the laboratory of Andrew Millard (16,103 phage genomes: http://s3.climb.ac.uk/ADM_share/crap/website/26Aug2019_phages.gb.gz; methods now published⁸⁹). We reduced the full list of 16,103 phage genomes to 10,663 genomes based on 95% identity clustering using the dedupe.sh tool in BBtools; we then added Nahant Collection phages (identified as phageSet283 in Supplementary Data 1 sheet B) not already included in the list, yielding a total of 10,722 phage genomes (see Supplementary Data 4 sheet A for a list of all phage genome accessions included). We next used vConTACT2 v.0.9.10²⁹ to predict viral genome clusters (vcontact -raw-proteins $rawprots -rel-mode ‘Diamond’ -proteins-fp $gene2genome -db ‘ProkaryoticViralRefSeq94-Merged’ -pcs-mode MCL -vcs-mode ClusterONE -c1-bin /home/k6logc/miniconda3/bin/cluster_one-1.0.jar -output-dir $outdir; see Supplementary Data 4 sheet B for full clustering output for all members).

The vConTACT2 analysis allowed us to identify 47 previously described phages that belong to 17 Nahant Collection VICTOR genera; none of these were found to be members of the same species as Nahant phages when examined together with VICTOR (Supplementary Data 4 for information about 47 previously described phages). The majority of previously described phages in Nahant Collection VICTOR genera also infected hosts in either the Vibrionaceae or Shewanellaceae (a second family of hosts also represented in this study), consistent with a previous finding that phage genera are specific to host families³². However, in 4 of the 17 genera, previously described phages had hosts in Gammaproteobacterial orders that were non-Vibrionales, including the Enterobacterales, Aeromonadales, Pasteurellales, and Alteromonadales. The genus of phages for which previous isolates had the most diverse hosts (NCVICG_31) contains phages that, in the 2019 taxonomy revision by the International Committee on the Taxonomy of Viruses, were assigned to multiple genera within the phage subfamily Peduovirinae; and was represented in the Nahant Collection by only a single phage – the sole case in this study of isolating a host-derived prophage (as the result of a prophage forming a plaque on its own host in the bait assay).

With respect to the Nahant phages alone, correspondence between VICTOR genera and vConTACT2 clusters was overall high (see Supplementary Data 4 sheet C for cluster assignments for Nahant phages and Supplementary Data 4 sheet D for correspondence between genera and vConTACT2 clusters). However, we found that a number of VICTOR genera (NCVICG) over-clustered phages as compared with vConTACT2, as follows. The representatives of the non-tailed family Autolykiviridae were identified as all being members of a single genus by VICTOR (NCVICG_17) but were split into 2 vConTACT2 clusters. Representatives of NVCG_36 were split across 2 vConTACT2 clusters. The two phages in NCVICG_36 were identified as separate outliers in vConTACT2. The phages in NCVICG_47 were split across 3 vConTACT2 subclusters within a single cluster. And, three VICTOR genera (NCVICG: 14, 21, and 46) that otherwise exactly corresponded with vConTACT2 clusters, each also included a member classified as an outlier by vConTACT2. Finally, 7 phages (members of NCVICG: 7, 9, 12, and 28) that were included as inputs to the vConTACT2 analysis were excluded from the summary outputs in multiple run attempts for reasons that are unclear.

Host range

Host range assay

Host range of viruses was determined as follows, and as also previously described^13,20. Cell-free phage lysates were stamped onto host agar overlay lawns and observed for changes in lawn morphology proximal to each stamp. Phage application to host lawns was performed using a 96-well blotter (BelArt, Bel-blotter 96-tip replicator 378760002) that was set into a microtiter plate containing arrayed phage lysate, transferred to the surface of the host lawn, and allowed to remain in contact for several minutes. Each 96-stamp contained 3 replicates of each phage lysate, distributed across three panels (columns 1–4, 5–8, 9–12) each with a unique array of the 32 samples (including one negative control). 96-well blotters were microwave steam sterilized (Tommee Tippee, Closer to Nature Microwave Steam Sterilizer) in batches for continuing re-use during plating sessions. Bacterial strains were prepared for the infection assay by inoculating 1 ml volumes of 2216MB in 2 ml 96-well culture blocks directly from glycerol stocks and shaking them at RT for approximately 48 h. Agar overlays were prepared by transferring 250 ul aliquots of host culture to bottom agar plates (2216MB, 1% Bacto Agar, 5% glycerol; in 150 mm diameter plates) and adding 7 ml of molten 52 °C top agar (2216MB, 0.4% Bacto Agar, 5% glycerol). Phages were prepared by distributing lysates into a 2 ml 96-well culture block in panels as described above, aliquots of <200 ul were then transferred into shallow microtiter plates so that the blotter could phage lysate by capillary action. Host lawns were stamped with phage lysates within 5–6 h of plating lawns. Agar overlays were assessed for changes in lawn morphology associated with phage treatment and scored blind with respect to phage identity and arrangement of replicates. Plates were scored for the presence of interactions on days 1, 2, 3, 7, 14, 21, and 30, and the outer bands of the interaction zones were marked with a different color for each time point. After 30 days the interactions for each strain were recorded and the approximate diameter for each interaction at each time point was recorded. During recording of the interactions for each plate an additional qualitative measure of confidence in the projected positive or negative call of the interaction was made. For example, where 2 of 3 replicates were positive for a phage on a lawn with no other positive interactions such an interaction would be called by the qualitative measure as “real”; alternatively, where 2 of 3 replicates were positive for a phage on a lawn containing several other positive interactions the qualitative measure might call these replicates “contam” if they were high-titer interactions and occurred in close proximity to other positive interactions. Evaluation of changes in clearing sizes showed that the majority of clearings (>90%) increased in size over the course of the observation period, consistent with these representing replicative infections rather than inhibition or killing from without. If any of the clearings that do not increase in size represent non-replicative interactions this would reduce the total true killing interactions further and underscore our finding of the general sparsity of interactions. A list of all infection pairs and information regarding plaque sizes and increases are provided in Supplementary Data 1 sheet C, and represent interactions between phages identified in the Supplement as phageSet248 and hosts identified as baxSet279.

Characterization of phage-host interaction features

BiMat modularity analysis

To characterize large scale features of the infection network we use the BiMat MatLab package²² as described in^5,6,90. Modularity was quantified using the leading eigenvector method, with a Kernighan-Lin tuning step performed after module detection, and nestedness was quantified using overlap and decreasing fill (NODF). Statistical significance of modularity and nestedness was tested against 1000 random matrices generated using the equiprobable method, preserving the overall matrix connectivity. Modularity values were as follows: Qb value 0.7306, mean 0.4362, std 0.0047, z-score 63.2774, t-score 2001.0077, percentile 100; Qr value 0.9318, mean 0.1004, std 0.0219, z-score 37.9184, t-score 1199.0848, percentile 100. Nestedness values were as follows: Nestedness value: 0.0300, mean 0.0230, std 0.0005, z-score 14.0305, t-score 443.6833, percentile 100. The 248 phages included in the analysis (phageSet248 in Supplementary Data 1) are genome sequenced phages isolated during the Nahant study, excluding cases of duplicate phages purified from the same plaque and excluding duplicate phages purified from different plaques on the same host; the 279 bacteria included in the analyses (baxSet279 in Supplementary Data 1) include all bacterial strains screened in the host range assay for which there was a positive interaction with a phage in phageSet248 (ie. host strains that were assayed but not killed by any phages were not included); these represent 1,436 infections out of a possible set of 69,192 and yield a connectance or fill of 0.021. To facilitate visual comparisons between the matrix of interactions between phages and bacteria with known species assignments (Fig. 2b) and the BiMat results, the representation of the BiMat analysis as shown in the main text (Fig.2a) includes only the subset of interactions between phages in phageSet248 and the 259 bacterial strains (baxSet259) that is the intersection of the 279 bacterial strains (baxSet279) included in the BiMat analysis and the 294 bacterial genomes (baxSet294) for which genomes were available. BiMat module assignments for all phages and hosts are provided in Supplementary Data 1 sheet C.

Average nucleotide identity

FastANI v.1.32 was used to determine average nucleotide identity (ANI) for phages and hosts as follows. For phages, run parameters were: kmer size of 16, fragment length of 100, minimum fraction of shorter genome coverage of 75% (fastANI -kmer 16 -fragLen 100 -minFraction 0.75 -matrix -ql $genomesPathsList -rl $genomesPathsList -o phageSet248_k16fL100Frax75.fastani.out). For bacteria, run parameters were: kmer size of 16, fragment length of 3000, minimum fraction of shorter genome coverage of 50% (fastANI -kmer 16 -fragLen 3000 -minFraction 0.5 -matrix -ql $genomesPathsList -rl $genomesPathsList -o baxSet294_k16fL3000Frax50.fastani.out). Results of ANI analyses are provided in Supplementary Data 1.

Host range divergence analyses within VIC-species and VIC-genera

To quantify overlap in host range profiles between phages we develop a metric of host range divergence (represented as concordance (1-divergence) in the main text and Fig. 4). We normalize the binary vector ({{{{{{boldsymbol{x}}}}}}}_{i}=left({x}_{1},{x}_{2},ldots ,{x}_{m}right)) representing the killing host range of a phage i across all m hosts so that it sums to 1, and interpret the result ({p}_{i}={x}_{i}/{sum }_{j}^{m}{x}_{j}) as a probability distribution of killing across all m hosts for a single phage i. We then define the scaled host range divergence of a given genus consisting of n phages to be the normalized generalized Jensen-Shannon divergence (gJSD) of their infection probability vectors, ({D}_{h}={{mbox{gJSD}}}left({{{{{{boldsymbol{p}}}}}}}_{{{{{{boldsymbol{1}}}}}}},ldots ,{{{{{{boldsymbol{p}}}}}}}_{{{{{{boldsymbol{n}}}}}}}right)/{{log }}_{2}left(nright)). This has the property that (0le {D}_{h}le 1), where ({D}_{h}=0) means the host ranges of all phages overlap exactly, and ({D}_{h}=1) means none of the host ranges have any overlap with each other. We note that we took a conservative approach in determining genus-level concordances as presented in Fig. 3 by including all phages within each VIC-genus in the calculation of concordance rather than collapsing intra-species blooms, which are largely comprised of phages with overlapping host ranges. Because high overlap of host range within VIC-species increases the value of the overall genus-level concordance metric (because of the higher number of pairwise comparisons with high values of concordance within species), this approach will be affected by the evenness of species abundances within a genus and including all members yields up to >24 times higher values of the concordance metric than when only including single representatives of each species in this dataset. Calculations of concordance are provided in Source Data Fig. 4 for VIC-genera using both the approach of using all members (Source Data Fig. 4 sheet B, with phages identified as phageSet248) and that of using only single representatives from each VIC-species (Source Data Fig. 4 sheet C, with phages identified as phageSet171); values for VIC-species are provided in Source Data Fig. 4 sheet A. A Welch’s t-test between the set of VIC-genus level concordances and the set of VIC-species level concordances yielded a p-value of 1.45e-07, suggesting that concordance differs significantly between the VIC-genus and VIC-species levels. Analyses and visualizations were performed in R 3.6.1 with the following packages: ggrepel 0.8.1, ape 5.3, combinat 0.0-8, Infotheo 1.2.0, philentropy 0.4.0.9000, igraph 1.2.4.1, ggraph 2.0.0, cowplot 1.0.0, data.table 1.12.2, ggplot2 3.2.1, tidyverse 1.3.0, ggtree 2.0.4, and patchwork 1.1.1.

Characterization of sequence sharing

Homologous recombination within species and genera

To assess the extent of recombination between closely related viruses we used viral species and genera as operationally defined by VICTOR as the framework and estimated effective relative contribution of recombination over mutation (r/m) as follows. HomBlocks v.1.0⁹¹ was used with default parameter settings to identify, extract, and trim conserved regions within genera based on alignments with progressiveMauve (build February 13, 2015)⁹² and trimAl v.1.2⁹³. Phylogenetic relationships between sequences were predicted using IQ-TREE v.1.6.12⁹⁴. ClonalFrameML v.1.12⁹⁵ was then used to evaluate evidence for recombination within species and genera. Estimates of relative effect of recombination to mutation (r/m) were based on the formula r/m = R/theta * delta * nu; where R/theta is the ratio of recombination to mutation rate, delta is mean import length, and nu is the nucleotide distance between imported sequences. We include in our analysis a control set of closely related siphovirus genomes which in a previous study were found to have an r/m of 23.50³⁵, accession numbers for these phages are provided in Source Data Fig. 4 sheet G; using the methods we describe here we find a similar r/m of 18.3. All analyses were performed only where there were ≥3 phages per species or genus. For genus level calculations of r/m we performed analyses using two different sets of genomes: first, as presented in Fig. 4e, g including all phages in each genus (Source Data Fig. 4 sheet E, using phages identified as phageSet248); second, including only 1 representative from each species (Source Data Fig. 4 sheet F, using phages identified as phageSet171). As for estimates of host range divergence, estimates of r/m are affected by evenness of species abundances within a genus and including all members can result in reduced estimates of r/m; for example, for NCVICG_17, the Autolykiviridae, estimates of genus-level r/m are >285 higher when considering only species representatives rather than all members of the genus.

Sharing of 25-mers

To assess nucleotide sequence sharing between phages in the collection overall, we used methods that were not limited by requirement for large scale pairwise genome conservation. We use Mash v.2.2.2⁹⁶ to create a sketch file for all phage genomes (mash sketch -o $out -k 400000 -s 25 -i $concatGenomesFile), using a k-mer size of 25 and a sufficiently large sketch size to capture all 25-mers in the largest phage genome; we next generate the mash distance output (mash dist $out.msh $out.msh -i >> $out.msh.ALLbyALL.dist), which includes a count of shared 25-mers between all pairs of phage genomes. To compare the infection and k-mer sharing networks, we created a binary vector reflecting whether each pair of phages shares at least one host in the observed infection matrix. Similarly, we created a binary vector reflecting whether each pair of phages shares at least one 25-mer between their respective genomes, as reported by Mash. We then calculated the mutual information between these vectors using the R function mutinformation() from package infotheo 1.2.0. Using mutual information analysis to ask whether phages that share ≥1 25-mer also share ≥1 hosts, we found that one matrix does not predict the other (0.018 bits out of a maximum value of 1 predicted). Host sharing information and Mash outputs for all phage-phage pairs are provided in Source Data Fig. 5 and were based on analysis of phages identified in Supplementary Data 1 as phageSet248.

BRIG plotting

To visualize regions of genome conservation and divergence within sets of phages as compared to a reference, as shown in Fig. 4d we used the BLAST Ring Image Generator tool BRIG⁹⁷.

Transfers with identifiable directionality between genera

To use a method that could predict directionality of transfer between phages in different genera, we used MetaCHIP⁴¹, providing VICTOR genus as the grouping variable. This method is a conservative estimator as it requires that candidate regions be conserved within donor groups but not in recipient groups. Modifications were performed to the source code to accept the amino acid-based phylogenetic tree as output by VICTOR as the phylogeny used in the MetaCHIP analysis. MetaCHIP parameters were set to 90% identity cutoff, 200 bp alignment length cutoff, 75% coverage cutoff, 80% end match identity cutoff, and 10 Kbp flanking region length. MetaCHIP results are provided in Source Data Fig. 5.

Potential for cross-recruitment of reads between viral genomes

To assess the potential for cross-recruitment of metagenomic reads to lead to false positive detection of phage genomes in metagenomic sequencing datasets we simulated Illumina reads for each of the phage genomes using parameters representative of high quality viral metagenome datasets and then asked whether they yielded a positive identification using a recently developed reference mapping tool designed for rapid characterization of large phage metagenomes. We included 248 phages of the Nahant Collection, as well as 47 previously described phage genomes that in vConTACT2 analyses (above) were found to be members of the same VIC-genera as the Nahant Collection phages. Simulated Illumina paired end reads were generated using DWGSIM v.1.1.11⁹⁸, with settings of: outer distance 500, standard deviation of read lengths 50, number of reads 100000, read lengths for both reads of 250 bp, and error rate 0.0010 (dwgsim -d 500 -s 50 -N 100000 -1 250 -2 250 -r 0.0010 -c 0 -S 2 $genome $genome.simreads). Cross recruitment of reads to each of the 295 phages was performed independently for simulated reads from each phage using default settings of FastViromeExplorer⁴³ for criteria required to define a genome as present in a read dataset: genome coverage [0.1]; ratio of coverage to expected coverage [0.3], which considers evenness of read mapping; and minimum number of mapped reads [10]. Notably, performing this analysis with 100bp-length read sets eliminated cross-genus mappings that were identified using 250 bp read sets. This is thought to occur because FastViromeExplorer uses a Kallisto⁹⁹ based pseudoalignment approach, which requires only a single matching 31-mer to map a given read to a target genomes, and thus where a 100 bp read and a 250 bp read both share only a single 31-mer with a reference genome the 250 bp read will result in overall greater coverage of the genome. All information about the phages included in the cross-recruitment analysis, and the associated data, are provided in Source Data Supplementary Fig. 7.

Biological materials availability

Phage and bacterial isolates are available from the authors upon request.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.