Phylogenetic diversity of NCLDV MAGs
To address critical questions regarding the genomic diversity, evolutionary relationships, and virocell metabolism of NCLDVs in the environment, we developed a workflow to generate metagenome-assembled genomes (MAGs) of NCLDVs from publicly-available metagenomic data (see Methods, Supplementary Fig. 1). We surveyed 1545 metagenomes and generated 501 novel NCLDV MAGs from individual samples that ranged in size from 100 to 1400 Kbp. Our workflow included steps to remove contigs potentially originating from cellular organisms or bacteriophage, as well as to minimize possible strain heterogeneity in each MAG. To ensure our NCLDV MAGs represented nearly-complete genomes, we only retained MAGs that contained at least 4 of 5 key NCLDV marker genes that are known to be highly conserved in these viruses11 and had a total length >100 Kbp (see Methods for details and rationale). Most of the MAGs were generated from marine and freshwater environments (444 and 36, respectively), but we also found 21 in metagenomes from bioreactors, wastewater treatment plants, oil fields, and soil samples (labeled “other” in Fig. 1, Supplementary Fig. 2; details in Supplementary Dataset 1).

a Phylogeny of the 501 NCLDV MAGs presented in this study together with 121 reference genomes. The phylogeny was constructed from a concatenated alignment of 5 highly conserved marker genes that are present throughout the NCLDV families using the VT + F + I + G4 model in IQ-TREE. The tree is rooted at Poxviridae/Asfarviridae branch, consistent with previous studies11. The inner strip is colored according to the phylogeny of the MAGs, while the outer strip is colored according to the habitat in which they were found. The bar chart represents genome size, which ranges from 100–2474 Kbp, and the dotted line denotes the 500 Kbp mark. Clades with >5 genomes are indicated with two letter abbreviations and clade numbers. MM: Mimiviridae, EP: Early Phycodnaviridae, LP: Late Phycodnaviridae, IR: Iridoviridae, MR: Marseilleviridae, PT: Pithoviridae. For the list of all the clades, see Dataset 1. b Average amino acid identity (AAI) heatmap of the MAGs and reference genomes, with rows and columns clustered according to the phylogeny.
We constructed a multi-locus phylogenetic tree of the NCLDV MAGs together with 121 reference genomes using five highly conserved genes that have been used previously for phylogenetic analysis of these viruses11 (Fig. 1). The majority of our MAGs placed within the Mimiviridae and Phycodnaviridae families (350 and 126, respectively), but we also identified new genomes in the Iridoviridae (16), Asfarviridae (7), Marseillviridae (1), and Pithoviridae (1). The identification of a large number of Mimiviridae members in our study is consistent with previous analyses suggesting high diversity of this family in marine systems18,25,26. Our phylogeny revealed that the Phycodnaviridae are polyphyletic and consist of at least two distinct monophyletic groups, one of which is sister to the Mimiviridae (termed Late Phycodnaviridae, 108 MAGs), and one which is basal branching to the Mimiviridae-Late Phycodnaviridae clade (termed Early Phycodnaviridae, 18 MAGs). The monophyly of the combined Phycodnaviridae and Mimiviridae families has been suggested previously based on concatenated marker gene phylogenies11,15,27, although one recent study reported an alternative topology in which the Asfarviridae also placed within this broader group28. In addition to the phylogeny, we evaluated the pairwise Average Amino Acid Identity (AAI) between NCLDV genomes to assess genomic divergence. AAI values provided results that were largely consistent with our phylogenetic analysis and showed that intra-family AAI values ranged from 26 to 100% (Fig. 1b), highlighting the substantial sequence divergence between even NCLDV genomes within the same family.
Given the large diversity within each of the NCLDV families, we sought to identify major clades within these groups that could be used for finer-grained classification. Using the rooted NCLDV phylogeny we calculated optimal clades within each family using the Dunn index29 (see Methods), resulting in 54 total clades, including 18 from the Mimiviridae, 13 for the Early Phycodnaviridae, and 6 for the Late Phycodnaviridae (Fig. 1, Supplementary Dataset 1). No cultured representatives were present in 31 of the clades (57%), including 2 from the Asfarviridae, 9 from the Early Phycodnaviridae, 1 in the Iridoviridae, 3 in the Late Phycodnaviridae, 1 in the Marseilleviridae, 14 in the Mimiviridae, and 1 in the Pithoviridae. Compared to references available in GenBank, this greatly increases the number of available genomes in the Mimiviridae and Phycodnaviridae, highlighting the vast diversity of environmental NCLDV that have not been sampled using culture-based methods.
Analysis of the genome size distribution across the NCLDV phylogeny provided results that are consistent with the current knowledge of these viruses. For example, Late Phycodnaviridae clade 1 contained sequenced representatives of the Prasinoviruses, including known Ostreococcus and Micromonas viruses, which encode the smallest Phycodnaviridae genomes known30. Consistent with this finding, the MAGs belonging to this clade were also smaller and ranged in size from 100–225 Kbp, suggesting that small genome size is broadly characteristic of this group. By comparison, genomes in the Early Phycodnaviridae were larger and formed more divergent groups with long branches, suggesting a large amount of untapped diversity in this clade (Fig. 1a). The Pandoraviruses and Mollivirus sibericum, notable for their particularly large genomes, formed a distinct clade in the early Phycodnaviridae. The Coccolithoviruses and Phaeoviruses31 were also placed in the Early Phycodnaviridae, and we identified 7 and 2 new members of these groups, respectively. Compared to the Late Phycodnaviridae, genome sizes of our MAGs were also notably higher in the Mimiviridae, which are known to encode among the largest viral genomes. In Mimivirus Clade 16, which includes Acanthaeomeba polyphaga mimivirus, we identified 19 new MAGs, 13 of which have genomes >500 Kbp. Taken together, these results are consistent with the larger genomes that have been observed in the Mimiviridae compared to the Late Phycodnaviridae11. Although our NCLDV MAGs contain most marker genes we would expect to find in these genomes, it is likely that many are not complete, and these genome size estimates are therefore best interpreted as underestimates.
Evolutionary genomics of the NCLDV
To assess the diversity of protein families across the NCLDV families, we calculated orthologous groups (OGs) between our MAGs and 126 reference genomes, resulting in 81,411 OGs (Supplementary Dataset 2). Of these, only 21,927 (27%) had detectable homology to known protein families in the EggNOG, Pfam, TigrFam, and VOG databases, highlighting the large number of novel genes in NCLDV genomes that has been observed in other studies32,33. Moreover, 55,692 (68%) of the OGs were present in only one NCLDV genome (singleton OGs), and overall the degree distribution of protein family membership revealed only a small number of widely-shared protein families (Fig. 2a, b), consistent with what has been shown for dsDNA viruses in general34. To visualize patterns of gene sharing across the NCLDV we constructed a bipartite network in which both genomes and OGs can be represented (Fig. 2c). Analysis of this network revealed primarily family-level clustering, with the Mimiviridae and early and late Phycodnaviridae clustering near each other, and the Pithoviridae, Marseiviridae, and Poxviridae clustering separately. Interestingly, although Pandoraviruses are members of the Early Phycodnaviridae clade, they clustered independently in a small sub-network, indicating that the particularly large genomes and novel genomic repertoires in this group are distinct from all other NCLDVs. These patterns suggest that genomic content in the NCLDV is shaped in part by evolutionary history, but that large-scale gains or losses of genomic content can occur over short evolutionary timescales, similar to what has occurred in the Pandoraviruses. This indicates that over long evolutionary timescales the genome evolution of NCLDV is shaped by a mixture of vertical inheritance and LGT, in many ways at least qualitatively similar to that of Bacteria and Archaea35.

a The distribution of the orthologous groups (OGs) in the NCLDV MAGs and reference genomes. The barplot on the left shows the proportion of OGs in each frequency category that could be assigned an annotation, while the barplot on the right shows the total number of OGs in each frequency category (log scale). b The degree distribution of the OG occurrence in the genomes analyzed. The best fit to a power law distribution is also shown. c A bipartite network of the OGs, with large nodes corresponding to genomes and small nodes corresponding to OGs. The size of the genome nodes is proportional to their genome size, and they are colored according to their family-level classification.
To further elucidate the evolutionary history of the large number of genes in NCLDVs, we investigated clade-specific patterns in gene sharing. We found distinct clustering of NCLDV OGs based on their presence in NCLDV clades, indicating that the majority of the OGs are unevenly distributed across clades (Fig. 3a). This was confirmed by an enrichment analysis, where we identified sets of enriched OGs in each of the major NCLDV clades (Mann-Whitney U test, corrected p-value <0.01). The most common functional categories among the clade-specific OGs are predicted to be involved in DNA replication, translation, and transcription. Translational machinery was particularly enriched in Mimivirus clade 16, which contains many cultivated representatives known to have the highest proportion of translation-associated genes of any virus36,37. The clade-specific genomic repertoires of NCLDV suggest that this is an appropriate phylogenetic scale for examining functional diversity across the NCLDV, and we anticipate these clades will be useful groupings that can be used in future studies examining spatiotemporal trends in viral diversity in the environment.

a The barplot shows the number of enriched OGs in each of the major NCLDV clades analyzed in this study. Only a subset of total functional categories are shown here; a full table can be found in Supplementary Dataset 2. The heatmap shows the occurrence of OGs with >5 total members across the major NCLDV clades, with shading corresponding to the percent of MAGs in that clade that encode a given OG. b A bubble plot of select metabolic genes detected in the NCLDV clades, with bubble size proportional to the percent of genomes in a clade that encode that protein. The numbers in brackets next to each enzyme name denote the number of these proteins observed in the MAGs we present here and the number observed in reference NCLDV genomes, respectively. G3P: glycerol-3-phosphate; LCM: Large conductance mechanosensitive; SCM; small conductance mechanosensitive.
Metabolic potential of the NCLDV
Relatively recent studies on model NCLDV-host systems have pointed out the presence of genes involved in rewiring key aspects of cell physiology during infection, such as apoptosis, nutrient processing and acquisition, and oxidative stress regulation23,38,39,40. We found a number of genes involved in such processes to be broadly encoded across NCLDVs, particularly in the Mimiviridae and Phycodnaviridae families (Fig. 3b). Superoxide dismutase (SOD) and Glutathione peroxidase (GPx), key players in regulating cellular oxidative stress, are prevalent in phylogenetically divergent NCLDVs. Giant virus replication potentially occurs under high oxidative stress inside the host cells40 and thus, the presence of enzymes with antioxidant activity might be crucial in preventing damage to the viral machineries. SOD was biochemically characterized in Megavirus chilensis, and was suggested to reduce the oxidative stress induced early in the infection39. In addition, GPx was found to be upregulated during infection by algal giant viruses39,40. Genes putatively involved in the regulation of cellular apoptosis are also widespread in giant viruses, including C14-family caspase-like proteins and several classes of apoptosis inhibitors, such as Bax141. C14-family metacaspases were reported in a giant virus obtained through single virus genomics approach, while viral activation and recruitment of cellular metacaspase was found during Emiliania huxleyi virus (EhV) replication38,39. In Chlorella viruses, a K+ channel (KcV) protein mediates host cell membrane depolarization, facilitating genome delivery within the host42. We identified KcV in genomes from all the major clades of late Phycodnaviridae and Mimiviridae, suggesting that host membrane depolarization is a widely-adopted aspect of NCLDV infection strategy. Lastly, in almost all the major Mimiviridae and Phycodnaviridae clades we detected genes involved in DNA repair and processing, such as photolyases, mismatch repair (mutS), histones, and histone acetyl transferases, of which the latter two have previously been reported in a number of giant virus families, with a possible role of viral histones in packaging of DNA within the capsid9,15,43,44. All together, these results demonstrate that many important aspects of viral reproduction and infection found in cultivated NCLDV are widespread in nature and a common feature of virocell metabolism during giant virus infection.
Viruses are thought to restructure host metabolism during infection to align with virion production rather than cell growth, leading to altered nutrient demands inside the cell20,45. We found that genes involved in nutrient acquisition and light-driven energy generation are widespread in several NCLDV clades, including rhodopsins, chlorophyll a/b binding proteins, ferritin, central nitrogen metabolism, and diverse nutrient transporters (Fig. 3b), consistent with other studies that have observed some of these genes in diverse NCLDV genomes23,46,47. In addition, studies on the structure and mechanism of rhodopsin present in two giant viruses have revealed that these are light-driven proton pumps, with potential to reshape energy transfer within the infected host48,49. Similarly, widely-distributed chlorophyll a/b binding proteins in giant viruses might increase photosynthetic light-harvesting capacity of infected cells, since protists and plants are known to suppress their photosynthetic machineries, including the chlorophyll binding antenna proteins, in response to virus infection39,50,51. In addition, the presence of the key eukaryotic iron storage protein ferritin52 and transporters predicted to target ammonium (AmT), phosphorus (Phosphate permease and Phosphate:Na+ symporters), sulfur (TauE/SafE family), and iron (Fe2+/Mn2+ transporters) highlights the shifting nutrient demands of virocells compared to their uninfected counterparts. Most of the MAGs were found in aquatic environments where nutrient availability may be limiting for cellular growth, and alteration of nutrient acquisition strategies during infection may be a key mechanism for increasing viral production. For example, although iron is crucial for photosynthesis and myriad other cellular processes53, it is often present in low concentrations in marine environments54,55, and the production of viral ferritin may aid in regulating the availability of this key micronutrient during virion production. Moreover, nitrogen and phosphorus are limiting for microbial growth in many marine ecosystems, and given the N:C and N:P ratios of viral biomass are relatively higher than that of cellular material56, it is likely crucial for viruses to boost acquisition of these nutrients with their own transporters. Indeed, a recent study has revealed that an NCLDV-encoded ammonium transporter (AmT) can influence the nutrient flux in host cells by altering the dynamics of ammonium uptake23.
Strikingly, many NCLDV genomes encode genes involved in central carbon metabolism, including most of the enzymes for glycolysis, gluconeogenesis, the TCA cycle, and the glyoxylate shunt (Figs. 3b, 4a, Supplementary Fig. 5). Central carbon metabolism is generally regarded as a fundamental feature of cellular life, and so it is remarkable to consider that giant viruses cumulatively encode nearly every step of these pathways. These genes were particularly enriched in Mimivirus clades 1, 9, and 16, but a few of them were also present in several Phycodnaviridae members (Figs. 3b, 4a). The glycolytic enzymes glyceraldehyde-3-phosphate dehydrogenase (G3P), phosphoglycerate mutase (PGM), and phosphoglycerate kinase (PGK), as well as the TCA cycle enzymes aconitase and succinate dehydrogenase (SDH) were particularly prevalent. In addition, we identified a fused gene in 16 MAGs that encodes the glycolytic enzymes G3P and PGK, which carry out adjacent steps in glycolysis (Fig. 4a, b), representing a unique domain architecture that has not been reported in cellular lineages before. Interestingly, in many MAGs, TCA cycle genes were co-localized on viral contigs, suggesting possible co-regulation of these genes during infection (Fig. 4c). Remarkably, one NCLDV MAG (ERX552257.96) encoded enzymes for 7 out of 10 steps of glycolysis (Fig. 4d), highlighting the high degree of metabolic independence that some giant viruses can achieve from their hosts. The fact that viruses encode these components of diverse central metabolic pathways underscores their potential to fundamentally reprogram virocell metabolism through manipulation of intracellular carbon fluxes.

a Schematic of Glycolysis, Gluconeogenesis, and the TCA cycle, with the number of NCLDV MAGs harboring a particular enzyme provided beside abbreviated enzyme names. Enzymes that were not detected in any of the studied NCLDVs are in gray. b Representative CDS from genome ERX552243.92 illustrating the domain organization (PFAM and Interpro) of the fused-domain gene (G3P + PGK) involved in glycolysis, that was detected in 16 of the NCLDV MAGs. c Example of co-localization of genes involved in TCA cycle on genomic contigs from five representative NCLDV MAGs. Location of a number of other genes commonly present in NCLDVs are also shown. d Presence/absence of genes involved in central-carbon metabolism in NCLDV genomes assembled in this study. Only the genomes harboring 3 or more enzymes are shown. G3P + PGK indicates the fused-domain gene illustrated in panel B. Blue arrow indicates the genome that harbors 7 out of 10 enzymes involved in glycolysis. HK: hexokinase, PGI: Phosphoglucoisomerase, PFK: Phosphofructokinase, ALD: aldolase, TPI: Triose-phosphate isomerase, G3P: Glyceraldehyde 3-phosphate dehydrogenase, PGK: Phosphoglycerate kinase, PGM: Phosphoglycerate mutase, ENO: Enolase, PYK: Pyruvate kinase, PEPCK: PEP carboxykinase, FBP: Fructose 1,6-bisphosphatase, G6P: Glucose 6-phosphatase, PDH: Pyruvate dehydrogenase, PC: Pyruvate carboxylase, CS: Citrate synthase, ACON: Aconitase, ICL: Isocitrate lyase, ICD: Isocitrate dehydrogenase, αKDH: α-ketoglutarate dehydrogenase, SCS: Succinyl-CoA synthetase, SD: Succinate dehydrogenase (subunits A, B and C), FH: Fumarate hydratase, MS: Malate synthase, MDH: Malate dehydrogenase.
Although the prevalence of genes involved in central carbon metabolism in the NCLDV MAGs strongly implicates them in modulating host metabolism, it is unclear at this point if these enzymes function in the same physiological context as their corresponding host versions. For example, succinate dehydrogenase has an important role in modulating cellular oxidative damage57, and could have a similar function during NCLDV propagation, which is carried out within a highly oxidative cellular environment. Moreover, in a recent study G3P was directly implicated in starvation-induced negative regulation of vesicle formation in the Golgi and several other cellular transport pathways independent of glycolysis58. It was proposed that by reducing energy consumption during starvation, G3P plays a complementary role in energy homeostasis alongside autophagy, which, in contrast, increases energy availability. Although the modulation of autophagy during NCLDV infection remains to be elucidated, it is possible that viral G3P could help avoid the lethal consequences of starvation in the hosts, while autophagy-mediated recycling of proteins could make amino acids and other nutrients available. This possibility is further strengthened by the fact that a large number of NCLDV MAGs harbor phosphate starvation inducible protein (PhoH) and nutrient transporters which might work in concert with the G3P-mediated mechanism to ensure virus propagation within the energy-limited host cells.
Evolutionary history of NCLDV metabolic genes
Phylogenies of a number of viral metabolic genes identified here together with their cellular homologs revealed that NCLDV sequences tended to group together in deep-branching clades, except for a few cases where multiple acquisitions from cellular sources was evident (Fig. 5, Supplementary Figs. 6–15). For example, aconitase, succinate dehydrogenase subunits B and C, PhoH, glyceraldehyde-3-phosphate dehydrogenase, and superoxide dismutase all showed distinct deep-branching viral clusters and were present in members of multiple NCLDV families, suggesting they diverged from their cellular homologs in the distant past (Fig. 5, Supplementary Figs. 6–15). This pattern was also observed for rhodopsin, similar to previous reports that NCLDV rhodopsins represent a virus-specific clade48, although our study suggests that at least some NCLDVs independently acquired a bacterial rhodopsin. Phosphoglycerate kinase, chlorophyll a-b binding proteins, and ammonium transporter (AmT) also appear to have been acquired multiple times, but nonetheless show several deep-branching viral clades. These results demonstrate that while NCLDVs have acquired numerous central metabolic genes from cellular hosts, many of these metabolic genes have subsequently diversified into virus-specific lineages. Indeed, detailed functional characterization of viral rhodopsin and Cu-Zn superoxide dismutases has revealed that they have different structural and mechanistic properties compared to the cellular homologs48,59, indicating that many metabolic genes in giant viruses evolved to have specific functions in the context of host-virus interactions. Our finding of the fused G3P-PGK glycolytic enzyme in many Mimivirus MAGs further reinforces this view and demonstrates that NCLDV are unique drivers of evolutionary innovation in metabolic genes.

NCLDV-specific clusters are encircled with dashed ovals in each of the trees, while number of genes from different NCLDV-clades contributing to these monophyletic groups are also provided (MM: Mimiviridae, EP: Early Phycodnaviridae, LP: Late Phycodnaviridae) Colors of the clade names correspond to those in Fig. 1. Although node support values are not provided for better visual clarity, all the NCLDV-specific nodes are supported by >90% ultrafast bootstrap values (see Methods and Data availability statement for details). One asterisk denotes bacteriophage sequences, which are only present in the PhoH tree. Double asterisks denote unclassified sequences (environmental), which are only present in the rhodopsin tree.
These results run contrary to a canonical view of viral evolution in which viruses are seen as pickpockets that sporadically acquire genes from their cellular hosts rather than encoding their own virus-specific metabolic machinery60. Although these metabolic enzymes were likely acquired from cellular lineages at some point, their distinct evolutionary trajectory differentiates them from their cellular counterparts and demonstrates that NCLDV are themselves a driver of evolutionary innovation in core metabolic pathways. A recent study suggested that NCDLV have ancient origins and may even pre-date the last eukaryotic common ancestor28, indicating there has been a long period of co-evolution between these viruses and their hosts during which these gene acquisitions could have taken place.
Source: Ecology - nature.com