Unique mobile elements and scalable gene flow at the prokaryote–eukaryote boundary revealed by circularized Asgard archaea genomes

Hydrothermal vent rock and sediment sample collection

Rock no. NA091-R045 (source of Ca. H. endolithica PR6, Ca. H. repetitus FW102 and Thorarchaeote FW25) and rock no. NA091-R008 (source of Heimdall group Gerdarchaeote AC18) were retrieved from the Auka hydrothermal vent site situated on the margin of the southern Pescadero Basin of the Gulf of California using remotely operated vehicle Hercules during research expedition NA091 on E/V Nautilus on 2 November 2017. Local venting fluids have a measured temperature approaching 300 °C, contain hydrocarbons and hydrogen and are precipitating minerals, such as calcite and barite¹⁵. R045 was collected during dive H1658 at coordinates 23.956987786° N, 108.86227922° W at a water depth of 3,674 m, near shimmering water, a sign of locally focused hydrothermal fluid discharge. R008 was collected during dive H1657 at coordinates 23° 57′ N, 108° 52′ W at a water depth of 3,651 m. After shipboard recovery, rock samples were placed in Mylar bags prefilled with 0.2 µm filtered bottom seawater collected during the same dive, flushed with N₂ gas for 10 min, sealed and stored at 4 °C until preparation for incubations in the laboratory.

Sediment sample no. FK181031-S0193-PC3 (source of Ca. H. aukensis) was collected during the research expedition FK181031 on R/V Falkor to the southern Pescadero Basin on 14 November 2018. The sample was collected during dive S193 at the Auka hydrothermal vent site (23.954822° N, 108.863009° W, water depth of 3,657 m), near the site where rocks nos. NA091-R045 and NA091-R008 were collected in 2017. The sediment push core was extruded upwards and sectioned into discrete 3 cm depth horizons on board immediately after recovery, transferred into sterile Whirl-Pak bags and sealed in a larger Mylar bag, flushed with argon gas, heat-sealed and stored at 4 °C until use in the laboratory.

Sample collection permits for the expedition were granted by the Dirección General de Ordenamiento Pesquero y Acuícola, Comisión Nacional de Acuacultura y Pesca (Permiso de Pesca de Fomento no. PPFE/DGOPA-200/18) and the Dirección General de Geografía y Medio Ambiente, Instituto Nacional de Estadística y Geografía (authorization no. EG0122018), with the associated diplomatic note no. 18-2083 (CTC/07345/18) from the Secretaría de Relaciones Exteriores-Agencia Mexicana de Cooperación Internacional para el Desarrollo/Dirección General de Cooperación Técnica y Científica.

Artificial seawater medium recipe

Artificial seawater was prepared as described in Scheller et al.⁴⁷ with minor modifications. Briefly, 1 l of artificial seawater (ASW) medium contained 46.6 mM MgCl₂, 9.2 mM CaCl₂, 485 mM NaCl, 7 mM KCl, 20 mM Na₂SO₄, 1 mM K₂HPO₄, 2 mM NH₄Cl, 1 ml of 1,000× trace element solution, 1 ml of 1,000× vitamin solution and 0.5 mg of resazurin and was buffered by 25 mM HEPES buffer adjusted to pH 7.5. One litre of 1,000× trace element solution contained 50 mM nitrilotriacetic acid, 5 mM FeCl₃, 2.5 mM MnCl₂, 1.3 mM CoCl₂, 1.5 mM ZnCl₂, 0.32 mM H₃BO₃, 0.38 mM NiCl₂, 0.03 mM Na₂SeO₃, 0.01 mM CuCl₂, 0.21 mM Na₂MoO₄ and 0.02 mM Na₂WO₄. One litre of 1,000× vitamin solution contained 82 μM d-biotin, 45 μM folic acid, 490 μM pyridoxine, 150 μM thiamine, 410 μM nicotinic acid, 210 μM pantothenic acid, 310 μM para-aminobenzoic acid, 240 μM lipoic acid, 14 μM choline chloride and 7.4 μM vitamin B₁₂.

Enrichment cultivation

Rock no. NA091-R045 was anaerobically fragmented; then, approximately 5 g wet weight was crushed using a sterile agate mortar and pestle on 8 November 2018 and immediately immersed in anaerobic ASW medium in 25–125 ml of butyl rubber-stoppered serum bottles supplemented with different carbon/energy sources, including lactate, H₂/CO₂, hexane and decane and incubated in the dark at 40 °C (Extended Data Fig. 1a). The headspace for all cultures was flushed and overpressurized with N₂ gas (2 atm). For the H₂-containing cultures, the N₂ gas headspace was replaced with H₂/CO₂ at an 80:20 mixture by flushing for 1 min and subsequent equilibration at 2 atm. After 33 d of incubation, the lactate-fed first-generation culture produced 5 mM sulphide, indicating active sulphate reduction. This enrichment was mixed by gentle shaking and diluted 1:100 vol/vol into fresh anaerobic ASW medium containing the same suite of carbon/energy sources as described above (Extended Data Fig. 1b). A transfer using the liquid fraction-lacking rock particles from the primary lactate enrichment was also included to enrich for members of the planktonic community alone with lactate as the carbon and energy source. This enrichment was later found to be devoid of the AAG (Heimdall) phylotype. Third- and fourth-generation cultures were set up in the following months through 1:100 dilution (Extended Data Fig. 1b). Further details of microbial community development in these enrichments are provided in Supplementary Note 1 and Supplementary Tables 1–3.

R008 was prepared as above except using 2 atm of methane in the headspace as the sole carbon source and electron donor. The culture was passaged twice using a 1:100 dilution under the same culturing conditions; the cell fraction was collected by centrifugation after a total of 22 months for metagenomic sequencing (described below).

For sediment enrichment cultivation, the top 3 cm section of the sediment core was mixed with anaerobic ASW at a 1:4 vol/vol ratio; a total of 60 ml volume each was dispensed into seven 125 ml glass serum bottles sealed with butyl rubber stoppers. The headspace was replaced by ethane (2 atm) in 2 bottles (Supplementary Table 5), while the headspace in 1 bottle was replaced by 100% N₂ gas (2 atm). The cultures were incubated at 37 °C in the dark. Further details on microbial community development are provided in Supplementary Note 1 and Supplementary Table 4.

Mineralogical analyses

The mineralogical composition of rocks NA091-R045 and R008 was characterized on a PANalytical X’Pert Pro X-Ray diffractometer. A dried rock aliquot was finely powdered using a clean agate mortar and pestle and scanned from 3 to 75° (2θ angle) at a 0.0167° step size. Mineral identification was performed with the X’Pert HighScore software v4.1 using the search and march algorithm.

DNA extraction

Combined cells with rock or sediment substrate were pelleted through centrifugation at 13,000 r.p.m. for 3 min. For amplicon sequencing, unless specified in Supplementary Table 6, DNA was extracted using the Qiagen DNeasy PowerSoil kit (catalogue no. 47014) according to the manufacturer’s instructions as described previously⁴⁸ with a minor modification, where mechanical shearing was carried out using the MP Biomedicals FastPrep-24 system (catalogue no. 116004500) at level 5.5 for 45 s. For genomic sequencing, incubated rock and sediment cultures were extracted using multiple approaches, including the Qiagen DNeasy PowerSoil kit, ZymoBIOMICS 96 MagBead DNA Kit (catalogue no. D4302; Zymo Research Corporation), Quick-DNA 96 Kit (catalogue no. D3010; Zymo Research Corporation), ZymoBIOMICS DNA Microprep Kit (catalogue no. D4301; Zymo Research Corporation) and a standard phenol/chloroform-based protocol. The list of samples and their extraction methods are provided in Supplementary Table 6.

16S rRNA gene amplicon sequencing

For amplicon (iTAG) sequencing of 16S rRNA genes, extracted DNA was amplified using primer pair 515f/806r GTGCCAGCMGCCGCGGTAA/ GGACTACHVGGGTWTCTAAT, barcoded and sequenced at Laragen using the Illumina MiSeq platform and analysed using Qiime v.1.8.0 (ref. ⁴⁹) as described previously⁴⁸. Taxonomic assignment was based on the SILVA 138 database (https://www.arb-silva.de)⁵⁰.

Full-length 16S archaeal rRNA gene sequences were amplified using the archaeal primer pair SSU1Arf/SSU1492Rngs TCCGGTTGATCCYGCBRG/ CGGNTACCTTGTKACGAC as described by Bahram et al.⁵¹, multiplexed as instructed by PacBio and sequenced using the PacBio Sequel II at the Brigham Young University DNA Sequencing Center and then analysed using the DADA2 package v1.9.1 in R v3.6.0 as described in Callahan et al.⁵² using the SILVA 138 database for taxonomic classification. Note that in the SILVA 138 database, all Asgard archaea clades are classified under Asgardarchaeota.

Metagenomic sequencing

A total of 11 metagenomic sequencing runs were performed using the Illumina and Oxford Nanopore platforms, with details listed in Supplementary Table 6. For Illumina short-read sequencing, libraries were constructed using the NEBNext Ultra and Nextera Flex Library kits as specified in the Supplementary Table 6. Sequencing was carried out using a HiSeq 2500 system (single-end, 100 bp) at the Caltech Genetics and Genomics Laboratory and HiSeq 4000 system at Novogene (paired-end, 150 bp). Only paired-end data were used for assembly, while all data were used for error correction. Due to the low DNA quantity obtained from the sediment incubation that yielded Ca. H. aukensis, we used multiple displacement amplification with the QIAGEN REPLI g Midi Kit before library preparation for Nanopore sequencing. Oxford Nanopore sequencing libraries were constructed using the PCR Barcoding Kit (catalogue no. SQK-PBK004) and were sequenced on MinION flow cells FLO-MIN106. Base calling was performed with the ONT Guppy software v.3.4.5.

Genome assembly, error correction and read coverage mapping

Two different approaches were used to assemble contiguous genomes from metagenomes. For species of interest, if Nanopore sequencing yielded high read coverage and read lengths N50 > 2 kb, we obtained more contiguous genomes through de novo assembly purely based on Nanopore reads. If Nanopore sequencing did not yield a high number of reads or exhibited low read lengths, we obtained more contiguous genomes through de novo assembly first based on Illumina reads and then joined using Nanopore reads.

For Ca. H. endolithica, Nanopore sequencing data were assembled de novo using Canu¹⁷ v.2.1, which yielded a 30 Mbp assembly, including a 3.4 Mbp contig. The approximate 40 kilobase (kb) regions at two ends of an approximate 3.4 Mbp contig were repetitive. This repeated region was deleted at one end and the two ends were joined to result in a circular genome. The resulting genome was mapped using BamM (http://ecogenomics.github.io/BamM/, based on Burrows–Wheeler Aligner⁵³ mapping) with 150 bp Illumina paired-end reads (88× coverage on average) and 100 bp single-end reads (20× coverage). Mapped reads were then used for error correction through pilon⁵⁴ v.1.22. To account for the reduced mapping at the edges (approximate 50 bp region), the two ends of the genomic sequence were joined, read-mapped and error-corrected again using the same methods. After the genome was annotated, it was rotated such that the genomic sequence ended with tRNA (GlyCCC), which was the integration site of the putative provirus HeimV1. All sequencing reads derived from incubations of the same rock were mapped onto the final genome using BamM, which was then used for coverage calculation through bedtools (https://bedtools.readthedocs.io/en/latest/).

For Ca. H. aukensis, Illumina PE150 bp sequencing data were assembled using SPAdes¹⁸ v.3.14.1 with the ‘-meta’ option and k-mers 21,33,55,77,99. The assembly was then scaffolded using Nanopore reads through two iterations of LRScaf⁵⁵ v.1.1.10. The Ca. H. aukensis genome was joined after trimming the identical sequences at the two ends. The end-joining region was verified through PCR amplification and Sanger sequencing using the primer pair CGCTTTCTTCAAACAATATTTCTGGTG/CTTACTTTCTCTCGGTCCATTTTTCAC. Finally, a 1 kbp stretch of unresolved genomic sequence at an approximate 2.9 Mbp position was resequenced through PCR amplification and Sanger sequencing using the primers GAGTTTTTTCAATCTTATAATGCCAAACTAAAAAATAG (forward), CAGTCAGATTTGACACAATTTTGGTC (reverse) and GCTGGACTCAACCTATAACTAATAGT (reverse). The final assembly was read-mapped, error-corrected through pilon v.1.24 using 346× coverage. It was rotated as described above to place the tRNA gene GlyCCC at the end.

The metagenome containing the Lokiarchaeote Ca. H. repetitus FW102 was assembled using Canu v.2.1, as described for the Ca. H. endolithica genome, and then binned using metabat2 v.2.15 (ref. ⁵⁶) with default parameters. The bin was then used to recruit long reads using minimap2 v.2.17 and reassembled and binned again. We then used LRScaf to scaffold the contigs and used ten iterations of pilon v.1.24 to achieve error correction and resolve ambiguous bases.

The Thorarcheote FW25 MAG was assembled using the hybrid assembly of Illumina reads and Nanopore reads using SPAdes v.3.14.1 with k-mers 21,33,55,77,99, and then binned using metabat2 v.2.15 with default parameters. The MAG bin was then used to recruit reads through MIRAbait in the MIRA v.4 package (http://mira-assembler.sourceforge.net/docs/DefinitiveGuideToMIRA.html#chap_intro). These reads were then used for hybrid assembly with Nanopore long reads via SPAdes v.3.14.1 with k-mers 21,33,55,77,99. It was then binned again using metabat2 v.2.15 with default parameters to yield the final Thorarcheote FW25 MAG.

The metagenome containing Gerdarchaeote AC18 was assembled from Illumina reads using SPAdes v.3.14.1 with k-mers 21,33,55,77,99 and then binned using metabat2 v.2.15 with default parameters. The MAG bin was then used to recruit reads through MIRAbait in the MIRA v.4 package and then reassembled and binned using SPAdes and metabat2 to yield the final Gerdarchaeote AC18 bin.

Alignment fraction, ANI and AAI

ANI and alignment fraction values, independently calculated for rRNA, tRNA and coding gene sequences were obtained using ANIcalculator⁵⁷ 2014-127, v.1.0 (https://ani.jgi.doe.gov/html/download.php?). Note that Lokiarchaeote FW102 contains 2 copies of 16S rRNA genes at 99% identity with each other, and Thorarchaeote BC has a partial 16S rRNA gene. The alignment of 16S rRNA was carried out using SINA⁵⁸ v.1.2.11. The AAI values of translated proteomes were obtained with the enveomics package v1.8.0⁵⁹. The final output is shown in Supplementary Table 7.

Genome and mobilome annotations

Gene calling was done using a combination of Prodigal v.2.6.3 and Glimmer v.3.0.2 using translation code 11 within the RASTtk⁶⁰ pipeline, now under the PATRIC package v1.032⁶¹. Translated coding sequences were annotated and domain-assigned using eggNOG mapper³⁹ v.2. The tRNA, 16S rRNA and 23S rRNA genes were identified using RNAmmer⁶² v.1.2 embedded in RASTtk. Thus far, 5S rRNA gene sequences could not be predicted through the existing HMM using various approaches. Long, non-tandem repeats were identified using RASTtk with the default cut-off of 95% identity and 100 bp. Tandem repeat sequences were identified using RASTtk, Prokka v1.14.6 and CRISPRCasTyper 1.1.4⁶³. Prokka and CRISPRCasTyper both employ MinCED (https://github.com/ctSkennerton/minced) to identify repeats and detect intragenic tandem repeats, which were manually removed from the CRISPR–Cas analyses. The Cas genes were annotated using CRISRCasTyper.

All identified Heimdallarchaeum mobilomes were further analysed using PSI-BLAST 1.10.0⁶⁴, CDD search v3.19⁶⁵ and PhANNs webserver (version March 2021)³⁷.

Genome evaluation and HMM construction

Marker coverage was carried out using a two-step process. First, we used the automated marker analyses via CheckM⁶⁶ v.1.1.3 with the lineage_wf option and the default HMM E value cut-off, which included the 149 standard archaeal single-copy marker set. Next, each of the missing markers was examined with hmmer⁶⁷ v.3.3.2 using the hmmsearch option with manual inspection of alignment regions and bitscores. This rescued markers unidentified through the default cut-offs by CheckM as well as divergent variants that most likely functionally replace the genuinely missing marker. The detailed description of markers missed by CheckM can be found in Supplementary Note 2 and the final evaluation of marker presence is displayed in Extended Data Fig. 4a and Supplementary Table 15. Next, we constructed an updated HMM set to replace the CheckM set by (1) updating all HMM to the most recent versions, (2) removing the six commonly missing or duplicated markers shown in Extended Data Fig. 4a from the list and (3) overcoming the pitfall of existing HMMs constructed using only a few sequences acquired from Euryarchaeota and Crenarchaeota. We manually constructed Asgard-specific versions based on the 282 Asgard archaea genomes. The HMMs constructed in this study are PF00832.ASG, PF00861.ASG, PF01194.ASG, PF01287.ASG, PF01667.ASG, PF03874.ASG, PF03876.ASG, PF13656.ASG, TIGR00270.ASG, TIGR00336.ASG, TIGR00442.ASG, TIGR02338.ASG and TIGR03677.ASG. The updated HMM file has been provided as a supplementary data file. The updated HMM was used to evaluate the 282 genomes reported in this study and in the literature^{3,6,7,8,9,10,11,12,16,23,26,68,69,70,71,72,73,74,75,76,77} through (1) CheckM, which uses Prodigal for gene calling, and (2) the more up to date HMMER3.2.2 on our gene calls described above. The latter generally produced slightly higher completeness and redundancy values (Supplementary Tables 8 and 9). For the expanded set of Asgard archaea genomes used for the phylogenomic analyses shown in Extended Data Fig. 4b, we applied the following filtering criteria: ≤100 contigs, >96% marker completeness and <8% marker redundancy. We also took the evenness of taxonomic sampling into account. The set is also shown in the Asgard archaea tree in Extended Data Fig. 2. The importance of genome quality evaluation is highlighted in Extended Data Fig. 9.

Phylogenomics

A phylogenomic tree of Asgard archaea was constructed with IQ-TREE v.2.1.2 (ref. ⁷⁸) using a partitioned analysis⁷⁹ with model selection using ModelFinder⁸⁰ and 1,000 ultrafast bootstrap replicates using UFBoot2⁸¹ on a concatenated alignment generated from MUSCLE⁸² v.3.8.1551 alignments of 76 archaeal marker genes identified in the genomes using HMMs included with anvi’o v.6.2 (ref. ⁸³). The phylogenomic tree was visualized using iTOL⁸⁴ and rooted with the TACK superphylum.

The Archaea–Eukaryota phylogenomic tree, including the Asgard genomes discussed in this study, was constructed based on the 56 Archaea–Eukaryota ribosomal proteins used by Zaremba-Niedzwiedzka et al.³ using reference sequences from the corresponding Dryad repository. In addition to the Asgard archaea identified in this study, additional sequences of the most complete genomes representing different lineages of the TACK superphylum were added to the dataset. Sequences of 56 archaeal COGs obtained from the Dryad repository were used as reference databases to retrieve homologous sequences from target genomes using BLAST⁸⁵ v.2.10.1. Each set of archaeal COG sequences were aligned using MUSCLE v.3.8.1551 and inspected and trimmed manually. Manually trimmed alignments were then further trimmed using BMGE⁸⁶, recoded to four-state SR4 using a custom script (https://github.com/dspeth/bioinfo_scripts/tree/master/phylogeny) and finally concatenated and converted to PHYLIP format using catfasta2phyml v1.1.0 (https://github.com/nylander/catfasta2phyml). The final concatenated, recoded alignment was used to calculate phylogenies using IQ-TREE v.2.1.2 (ref. ⁷⁸) using a C60 model adapted for SR4 recoded data by Zaremba-Niedzwiedzka et al.³ and 1,000 ultrafast bootstrap replicates using UFBoot. The phylogenomic tree was visualized using iTOL⁸⁴ and rooted with Euryarchaeota as the outgroup. The genomes and conserved genes used for the phylogenomic analyses are listed in Supplementary Tables 16 and 17.

Discovery of Heimdallarchaeum-targeting mobile elements through CRISPR spacer targeting

Repeat sequences from the Heimdallarchaeum CRISPR arrays were used to blast against the CRISPR repeats we recruited, using CRISPRCasTyper, from multiple databases with a 95% alignment and 95% identity cut-off. The databases include GTDB v.95, our in-house assemblies from the Pescadero Basin (this study, F.W. et al. manuscript in preparation and Speth et al.⁸⁷; Supplementary Table 10, 22 sets) and published assemblies from the Guaymas Basin²² (Supplementary Table 11, 16 sets).

While no homologous CRISPR repeats were found in the entire GTDB database, we found several CRISPR arrays from the Guaymas and Pescadero assemblies with identical repeats to the Heimdallarchaeum CRISPR repeats found in this study, demonstrating the specificity of the CRISPR discovery approach. Since both the Guaymas and Pescadero CRISPR sets comprise assembled sequences that were not de-replicated, the entire CRISPR spacer collection from the recruited CRISPR arrays was de-replicated using a 100% identity cut-off. Notably, no spacer overlap was found between the Guaymas and Pescadero CRISPR sets. In total, the final de-replicated, putative Heimdallarchaeota spacerome in this study consisted of 455 from the 2 original Heimdallarchaeum genomes, 578 from the Pescadero Basin assemblies and 532 from the Guaymas Basin assemblies. We note that the above set likely only represents a fraction of the true Heimdallarchaeum spacerome given that the original CRISPR repeats came from only two species.

Next, to identify potential mobile genetic elements (MGEs) targeted by the Heimdallarchaeum spacerome, we used BLAST to search for spacer matches in the above three assembly datasets, the two Ca. Heimdallarchaeum genomes and various published virus databases/datasets, which are the RefSeq virus database r98⁸⁸, IMG/VR v.3 (ref. ³⁶) and the huge phage⁸⁹, giant virus⁹⁰ and Loki’s castle virus datasets⁹¹. To avoid self-matches, the CRISPR arrays containing the spacers were replaced by Ns in their respective assemblies. For the homology cut-off, we used 95% alignment and 95% identity as described previously⁹². Strikingly, no spacer matches were found from any of the viral datasets or GTDB genome database. The spacer matches to the Guaymas and Pescadero Basins metagenomes are listed in Supplementary Tables 12 and 13.

We then de-replicated the putative MGEs/viruses identified above using BLAST, removed contigs smaller than 2.8 kb and manually examined the target gene neighbourhoods and potential self-match due to CRISPR arrays that evaded detection and blocking. These contigs, together with the ones described in Fig. 2, ultimately constitute the 56 putative Heimdallarchaeota MGEs listed in Supplementary Table 14.

Resolution of the genomic insertion and circularization of HeimV1

To capture the two different states during the life cycles of HeimV1 (Fig. 3f), we used three primer sets to amplify the sequences around the two insertion sites of HeimV1 and confirmed them using gel electrophoresis and Sanger sequencing. Set 1 amplified the region between upstream tRNA GlyCCC in the Ca. H. endolithica genome and the first coding gene of the HeimV1 (GTGAATCAATAGCTTTCACTTATAATGAG/GTGATTGTATTAAGTCTGCAACATATTC). Set 2 amplified the regions containing the transposase in the Ca. H. endolithica genome and the integrase in HeimV1 (CTTAGATATGTACGTGATAGGATCATATG/CTTCTTTCCTCTTTTTGTCTCTGCTTC). Set 3 amplified the two ends of the circular HeimV1 (CTTAGATATGTACGTGATAGGATCATATG/GTGATTGTATTAAGTCTGCAACATATTC). Each primer set amplified approximately 2 kb of target regions with set 1 and set 2 indicating the presence of the integrated state of HeimV1 and set 3 indicating the circular state.

Protein clustering of integrases and transposases

Protein sequences showing integrase and transposase domains, identified using eggNOG mapper from the 8 Asgard archaea MAGs, were pooled and clustered at 90% sequence identity using cd-hit⁹³ v.4.8.1. The resulting representative sequences were used for two sequential rounds of homology searches using DIAMOND⁹⁴ v.2.0.6 against the protein sequences obtained from the GTDB v.95 genome database. A cut-off of >20% sequence identity, >85% sequence alignment and <15% length difference was used for the first round; a cut-off of >30% sequence identity, >90% sequence alignment and <10% length difference was used for the second round. The resulting protein sequences were combined with the Asgard archaea integrases/transposases originally pooled and were clustered together using 95% sequence identity with cd-hit. The resulting 96,367 representative sequences were clustered using ASM-Clust⁹⁵ with a sequence subset size of 5,000 to generate the alignment score matrix, using default values for the other settings.

Taxonomic profiling through protein orthologues

The taxonomic clustering and COG analyses were carried out using eggNOG mapper³⁹ with the eggNOG orthologue database v.5.0. The protein counts belonging to each taxonomic group (Archaea/Bacteria/Eukaryota/Unassigned) were extracted from the output and fitted linearly with MATLAB R2018a using the polyfit function and yielding Fig. 5c.

Since different proteins evolved at different rates, we combined the use of a single cut-off-based protein clustering approach with functional domain-based manual refinement to capture and compare ERPs across the lineages selected in this study. First, we used BLAST v.2.2.26 to evaluate the sequence homologies within the entire proteome of the eight MAGs in this study. We then used an 80% alignment length (relative to the length of the shorter protein sequence) and 0.24 alignment × identity cut-off to yield candidate protein clusters, which we then cross-referenced with the Eukaryota group in the eggNOG classification to generate 227 candidate ERP clusters. Finally, we manually examined the relatedness within and between each ERP cluster through batch searches using the conserved domain database⁶⁵. This led to the recombination of the candidate ERP clusters into the functionally distinct 135 ERP families. To align with previous work^3,6, all small GTPases were classified as one single ERP family, constituting 291 proteins from the 8 representative Asgard archaea MAGs.

Maximum-likelihood analyses of proteins encoded by HeimV1 and HeimV2

Homology search for all peptide sequences of HeimV1 through DIAMOND⁹⁴ v.2.0.6 was carried out against the GTDB v.95, Pescadero Basin and Guaymas assemblies, RefSeq virus database⁸⁸, IMG/VR³⁶ and huge phage⁸⁹, giant virus⁹⁰ and Loki’s castle virus datasets⁹¹. The search outputs were pre-clustered with a 70% identity cut-off using cd-hit v.4.8.1 (ref. ⁹³). The representative sequences were aligned using the MAFFT v.7.475 (ref. ⁹⁶) option linsi and trimmed with trimAl v.1.4.1 (ref. ⁹⁷), option gappyout. Maximum-likelihood analyses were carried out with IQ-TREE v.2.1.12 (ref. ⁷⁸) using the LG4X model and ultrafast bootstrap with 2,000 replicates. The phylogenetic tree was visualized and prepared using iTOL⁸⁴.

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Source: Ecology - nature.com