Standardized multi-omics of Earth’s microbiomes reveals microbial and metabolite diversity
Dataset descriptionSample collectionOur research complies with all relevant ethical regulations following policies at the University of California, San Diego (UCSD). Animal samples that were sequenced were not collected at UCSD and are not for vertebrate animals research at UCSD following the UCSD Institutional Animal Care and Use Committee (IACUC). Samples were contributed by 34 principal investigators of the Earth Microbiome Project 500 (EMP500) Consortium and are samples from studies at their respective institutions (Supplementary Table 1). Relevant permits and ethics information for each parent study are described in the ‘Permits for sample collection’ section below. Samples were contributed as distinct sets referred to here as studies, where each study represented a single environment (for example, terrestrial plant detritus). To achieve more even coverage across microbial environments, we devised an ontology of sample types (microbial environments), the EMP Ontology (EMPO) (http://earthmicrobiome.org/protocols-and-standards/empo/)1, and selected samples to fill out EMPO categories as broadly as possible. EMPO recognizes strong gradients structuring microbial communities globally, and thus classifies microbial environments (level 4) on the basis of host association (level 1), salinity (level 2), host kingdom (if host-associated) or phase (if free-living) (level 3) (Fig. 1a). As we anticipated previously1, we have updated the number of levels as well as states therein for EMPO (Fig. 1b) on the basis of an important additional salinity gradient observed among host-associated samples when considering the previously unreported shotgun metagenomic and metabolomic data generated here (Fig. 3c,d). We note that although we were able to acquire samples for all EMPO categories, some categories are represented by a single study.Samples were collected following the Earth Microbiome Project sample submission guide50. Briefly, samples were collected fresh, split into 10 aliquots and then frozen, or alternatively collected and frozen, and subsequently split into 10 aliquots with minimal perturbation. Aliquot size was sufficient to yield 10–100 ng genomic DNA (approximately 107–108 cells). To leave samples amenable to chemical characterization (metabolomics), buffers or solutions for sample preservation (for example, RNAlater) were avoided. Ethanol (50–95%) was allowed as it is compatible with LC–MS/MS although it should also be avoided if possible.Sampling guidance was tailored for four general sample types: bulk unaltered (for example, soil, sediment, faeces), bulk fractionated (for example, sponges, corals, turbid water), swabs (for example, biofilms) and filters. Bulk unaltered samples were split fresh (or frozen), sampled into 10 pre-labelled 2 ml screw-cap bead beater tubes (Sarstedt, 72.694.005 or similar), ideally with at least 200 mg biomass, and flash frozen in liquid nitrogen (if possible). Bulk fractionated samples were fractionated as appropriate for the sample type, split into 10 pre-labelled 2 ml screw-cap bead beater tubes, ideally with at least 200 mg biomass, and flash frozen in liquid nitrogen (if possible). Swabs were collected as 10 replicate swabs using 5 BD SWUBE dual cotton swabs with wooden stick and screw cap (281130). Filters were collected as 10 replicate filters (47 mm diameter, 0.2 um pore size, polyethersulfone (preferred) or hydrophilic PTFE filters), placed in pre-labelled 2 ml screw-cap bead beater tubes, and flash frozen in liquid nitrogen (if possible). All sample types were stored at –80 °C if possible, otherwise –20 °C.To track the provenance of sample aliquots, we employed a QR coding scheme. Labels were affixed to aliquot tubes before shipping when possible. QR codes had the format ‘name.99.s003.a05’, where ‘name’ is the PI name, ‘99’ is the study ID, ‘s003’ is the sample number and ‘a05’ is the aliquot number. QR codes (version 2, 25 pixels × 25 pixels) were printed on 1.125’ × 0.75’ rectangular and 0.437’ circular cap Cryogenic Direct Thermal labels (GA International, DFP-70) using a Zebra model GK420d printer and ZebraDesigner Pro 3 software for Windows. After receipt but before aliquots were stored in freezers, QR codes were scanned into a sample inventory spreadsheet using a QR scanner.Sample metadataEnvironmental metadata were collected for all samples on the basis of the EMP Metadata Guide, which combines guidance from the Genomics Standards Consortium MIxS (Minimum Information about any Sequence) standard74 and the Qiita Database (https://qiita.ucsd.edu)51. The metadata guide provides templates and instructions for each MIxS environmental package (that is, sample type). Relevant information describing each PI submission, or study, was organized into a separate study metadata file (Supplementary Table 1).MetabolomicsLC–MS/MS sample extraction and preparationTo profile metabolites among all samples, we used LC–MS/MS, a versatile method that detects tens of thousands of metabolites in biological samples. All solvents and reactants used were LC–MS grade. To maximize the biomass extracted from each sample, the samples were prepared depending on their sampling method (for example, bulk, swabs, filter and controls). The bulk samples were transferred into a microcentrifuge tube (polypropylene, PP) and dissolved in 7:3 MeOH:H2O using a volume varying from 600 µl to 1.5 ml, depending on the amounts of sample available, and homogenized in a tissue lyser (QIAGEN) at 25 Hz for 5 min. Then, the tubes were centrifuged at 2,000 × g for 15 min, and the supernatant was collected in a 96-well plate (PP). For swabs, the swabs were transferred into a 96-well plate (PP) and dissolved in 1.0 ml of 9:1 ethanol:H2O. The prepared plates were sonicated for 30 min, and after 12 h at 4 °C, the swabs were removed from the wells. The filter samples were dissolved in 1.5 ml of 7:3 MeOH:H2O in microcentrifuge tubes (PP) and sonicated for 30 min. After 12 h at 4 °C, the filters were removed from the tubes. The tubes were centrifuged at 2,000 × g for 15 min, and the supernatants were transferred to 96-well plates (PP). The process control samples (bags, filters and tubes) were prepared by adding 3.0 ml of 2:8 MeOH:H2O and recovering 1.5 ml after 2 min. After the extraction process, all sample plates were dried with a vacuum concentrator and subjected to solid phase extraction (SPE). SPE was used to remove salts that could reduce ionization efficiency during mass spectrometry analysis, as well as the most polar and non-polar compounds (for example, waxes) that cannot be analysed efficiently by reversed-phase chromatography. The protocol was as follows: the samples (in plates) were dissolved in 300 µl of 7:3 MeOH:H2O and put in an ultrasound bath for 20 min. SPE was performed with SPE plates (Oasis HLB, hydrophilic-lipophilic-balance, 30 mg with particle sizes of 30 µm). The SPE beds were activated by priming them with 100% MeOH, and equilibrated with 100% H2O. The samples were loaded on the SPE beds, and 100% H2O was used as wash solvent (600 µl). The eluted washing solution was discarded, as it contains salts and very polar metabolites that subsequent metabolomics analysis is not designed for. The sample elution was carried out sequentially with 7:3 MeOH:H2O (600 µl) and 100% MeOH (600 µl). The obtained plates were dried with a vacuum concentrator. For mass spectrometry analysis, the samples were resuspended in 130 µl of 7:3 MeOH:H2O containing 0.2 µM of amitriptyline as an internal standard. The plates were centrifuged at 30 × g for 15 min at 4 °C. Samples (100 µl) were transferred into new 96-well plates (PP) for mass spectrometry analysis.LC–MS/MS sample analysisThe extracted samples were analysed by ultra-high performance liquid chromatography (UHPLC, Vanquish, Thermo Fisher) coupled to a quadrupole-Orbitrap mass spectrometer (Q Exactive, Thermo Fisher) operated in data-dependent acquisition mode (LC–MS/MS in DDA mode). Chromatographic separation was performed using a Kinetex C18 1.7 µm (Phenomenex), 100 Å pore size, 2.1 mm (internal diameter) × 50 mm (length) column with a C18 guard cartridge (Phenomenex). The column was maintained at 40 °C. The mobile phase was composed of a mixture of (A) water with 0.1% formic acid (v/v) and (B) acetonitrile with 0.1% formic acid. Chromatographic elution method was set as follows: 0.00–1.00 min, isocratic 5% B; 1.00–9.00 min, gradient from 5% to 100% B; 9.00–11.00 min, isocratic 100% B; followed by equilibration 11.00–11.50 min, gradient from 100% to 5% B; 11.50–12.50 min, isocratic 5% B. The flow rate was set to 0.5 ml min−1.The UHPLC was interfaced to the orbitrap using a heated electrospray ionization source with the following parameters: ionization mode, positive; spray voltage, +3,496.2 V; heater temperature, 363.90 °C; capillary temperature, 377.50 °C; S-lens RF, 60 arbitrary units (a.u.); sheath gas flow rate, 60.19 a.u.; and auxiliary gas flow rate, 20.00 a.u. The MS1 scans were acquired at a resolution (at m/z 200) of 35,000 in the m/z 100–1500 range, and the fragmentation spectra (MS2) scans at a resolution of 17,500 from 0 to 12.5 min. The automatic gain control target and maximum injection time were set at 1.0 × 106 and 160 ms for MS1 scans, and set at 5.0 × 105 and 220 ms for MS2 scans, respectively. Up to three MS2 scans in data-dependent mode (Top 3) were acquired for the most abundant ions per MS1 scans using the apex trigger mode (4–15 s), dynamic exclusion (11 s) and automatic isotope exclusion. The starting value for MS2 was m/z 50. Higher-energy collision induced dissociation (HCD) was performed with a normalized collision energy of 20, 30 and 40 eV in stepped mode. The major background ions originating from the SPE were excluded manually from the MS2 acquisition. Analyses were randomized within plate and blank samples analysed every 20 injections. A quality control mix sample assembled from 20 random samples across the sample types was injected at the beginning, the middle and the end of each plate sequence. The chromatographic shift observed throughout the batch was estimated as less than 2 s, and the relative standard deviation of ion intensity was 15% per replicate.LC–MS/MS data processingThe mass spectrometry data were centroided and converted from the proprietary format (.raw) to the m/z extensible markup language format (.mzML) using ProteoWizard (ver. 3.0.19, MSConvert tool)75. The mzML files were then processed with MZmine 2 toolbox76 using the ion-identity networking modules77 that allow advanced detection for adduct/isotopologue annotations. The MZmine processing was performed on Ubuntu 18.04 LTS 64-bits workstation (Intel Xeon E5-2637, 3.5 GHz, 8 cores, 64 Gb of RAM) and took ~3 d. The MZmine project, the MZmine batch file (.XML format) and results files (.MGF and .CSV) are available in the MassIVE dataset MSV000083475. The MZmine batch file contains all the parameters used during the processing. In brief, feature detection and deconvolution was performed with the ADAP chromatogram builder78 and local minimum search algorithm. The isotopologues were regrouped and the features (peaks) were aligned across samples. The aligned peak list was gap filled and only peaks with an associated fragmentation spectrum and occurring in a minimum of three files were conserved. Peak shape correlation analysis grouped peaks originating from the same molecule and annotated adduct/isotopologue with ion-identity networking77. Finally, the feature quantification table results (.CSV) and spectral information (.MGF) were exported with the GNPS module for feature-based molecular networking analysis on GNPS79 and with SIRIUS export modules.LC–MS/MS data annotationThe results files of MZmine (.MGF and .CSV files) were uploaded to GNPS (http://gnps.ucsd.edu)52 and analysed with the feature-based molecular networking workflow79. Spectral library matching was performed against public fragmentation spectra (MS2) spectral libraries on GNPS and the NIST17 library.For the additional annotation of small peptides, we used the DEREPLICATOR tools available on GNPS80,81. We then used SIRIUS82 (v. 4.4.25, headless, Linux) to systematically annotate the MS2 spectra. Molecular formulae were computed with the SIRIUS module by matching the experimental and predicted isotopic patterns83, and from fragmentation trees analysis84 of MS2. Molecular formula prediction was refined with the ZODIAC module using Gibbs sampling85 on the fragmentation spectra (chimeric spectra or those with poor fragmentation were excluded). In silico structure annotation using structures from biodatabase was done with CSI:FingerID86. Systematic class annotations were obtained with CANOPUS41 and used the NPClassifier ontology87.The parameters for SIRIUS tools were set as follows, for SIRIUS: molecular formula candidates retained, 80; molecular formula database, ALL; maximum precursor ion m/z computed, 750; profile, orbitrap; m/z maximum deviation, 10 ppm; ions annotated with MZmine were prioritized and other ions were considered (that is, [M+H3N+H]+, [M+H]+, [M+K]+, [M+Na]+, [M+H-H2O]+, [M+H-H4O2]+, [M+NH4]+); for ZODIAC: the features were split into 10 random subsets for lower computational burden and computed separately with the following parameters: threshold filter, 0.9; minimum local connections, 0; for CSI:FingerID: m/z maximum deviation, 10 ppm; and biological database, BIO.To establish putative microbially related secondary metabolites, we collected annotations from spectral library matching and the DEREPLICATOR+ tools and queried them against the largest microbial metabolite reference databases (Natural Products Atlas88 and MIBiG89). Molecular networking79 was then used to propagate the annotation of microbially related secondary metabolites throughout all molecular families (that is, the network component).LC–MS/MS data analysisWe combined the annotation results from the different tools described above to create a comprehensive metadata file describing each metabolite feature observed. Using that information, we generated a feature-table including only secondary metabolite features determined to be microbially related. We then excluded very low-intensity features introduced to certain samples during the gap-filling step described above. These features were identified on the basis of presence in negative controls that were universal to all sample types (that is, bulk, filter and swab) and by their relatively low per-sample intensity values. Finally, we excluded features present in positive controls for sampling devices specific to each sample type (that is, bulk, filter or swab). The final feature-table included 618 samples and 6,588 putative microbially related secondary metabolite features that were used for subsequent analysis.We used QIIME 2’s90 (v2020.6) ‘diversity’ plugin to quantify alpha-diversity (that is, feature richness) for each sample and ‘deicode’91 to quantify beta-diversity (that is, robust Aitchison distances, which are robust to both sparsity and compositionality in the data) between each pair of samples. We parameterized our robust Aitchison principal components analysis (RPCA)91 to exclude samples with fewer than 500 features and features present in fewer than 10% of samples. We used the ‘taxa’ plugin to quantify the relative abundance of microbially related secondary metabolite pathways and superclasses (that is, on the basis of NPClassifier) within each environment (that is, for each level of EMPO 4), and ‘songbird’ v1.0.492 to identify sets of microbially related secondary metabolites whose abundances were associated with certain environments. We parameterized our ‘songbird’ model as follows: epochs, 1,000,000; differential prior, 0.5; learning rate, 1.0 × 10−5; summary interval, 2; batch size, 400; minimum sample count, 0; and training on 80% of samples at each level of EMPO 4 using ‘Animal distal gut (non-saline)’ as the reference environment. Environments with fewer than 10 samples were excluded to optimize model training (that is, ‘Animal corpus (non-saline)’, ‘Animal proximal gut (non-saline)’, ‘Surface (saline)’). The output from ‘songbird’ includes a rank value for each metabolite in every environment, which represents the log fold change for a given metabolite in a given environment92. We compared log fold changes for each metabolite from this run to those from (1) a replicate run using the same reference environment and (2) a run using a distinct reference environment: ‘Water (saline)’. We found strong Spearman correlations in both cases (Supplementary Table 8), and therefore focused on results from the original run using ‘Animal distal gut (non-saline)’ as the reference environment, as it has previously been shown to be relatively unique among other habitats. In addition to summarizing the top 10 metabolites for each environment (Supplementary Table 3), we used the log fold change values in our multi-omics analyses described below.We used the RPCA biplot and QIIME 2’s90 EMPeror93 to visualize differences in composition among samples, as well as the association with samples of the 25 most influential microbially related secondary metabolite features (that is, those with the largest magnitude across the first three principal component loadings). We tested for significant differences in metabolite composition across all levels of EMPO using PERMANOVA implemented with QIIME 2’s ‘diversity’ plugin90 and using our robust Aitchison distance matrix as input. In parallel, we used the differential abundance results from ‘songbird’ described above to identify specific microbially related secondary metabolite pathways and superclasses that varied strongly across environments. We then went back to our metabolite feature-table to visualize differences in the relative abundances of those pathways and superclasses within each environment by first selecting features and calculating log-ratios using ‘qurro’94, and then plotting using the ‘ggplot2’ package95 in R96 v4.0.0. We tested for significant differences in relative abundances across environments using Kruskal–Wallis tests implemented with the base ‘stats’ package in R96.GC–MS sample extraction and preparationTo profile volatile small molecules among all samples in addition to what was captured with LC–MS/MS, we used gas chromatography coupled with mass spectrometry (GC–MS). All solvents and reactants were GC–MS grade. Two protocols were used for sample extraction, one for the 105 soil samples and a second for the 356 faecal and sediment samples that were treated as biosafety level 2. The 105 soil samples were received at the Pacific Northwest National Laboratory and processed as follows. Each soil sample (1 g) was weighed into microcentrifuge tubes (Biopur Safe-Lock, 2.0 ml, Eppendorf). H2O (1 ml) and one scoop (~0.5 g) of a 1:1 (v/v) mixture of garnet (0.15 mm, Omni International) and stainless steel (0.9–2.0 mm blend, Next Advance) beads and one 3 mm stainless steel bead (Qiagen) were added to each tube. Samples were homogenized in a tissue lyser (Qiagen) for 3 min at 30 Hz and transferred into 15 ml polypropylene tubes (Olympus, Genesee Scientific). Ice-cold water (1 ml) was used to rinse the smaller tube and combined into the 15 ml tube. Chloroform:methanol (10 ml, 2:1 v/v) was added and samples were rotated at 4 °C for 10 min, followed by cooling at −70 °C for 10 min and centrifuging at 150 × g for 10 min to separate phases. The top and bottom layers were combined into 40 ml glass vials and dried using a vacuum concentrator. Chloroform:methanol (1 ml, 2:1) was added to each large glass vial and the sample was transferred into 1.5 ml tubes and centrifuged at 1,300 × g. The supernatant was transferred into glass vials and dried for derivatization.The remaining 356 samples received from UCSD that included faecal and sediment samples were processed as follows: 100 µl of each sample was transferred to a 2 ml microcentrifuge tube using a scoop (MSP01, Next Advance). The final volume of the sample was brought to 1.5 ml, ensuring that the solvent ratio is 3:8:4 H2O:CHCl3:MeOH by adding the appropriate volumes of H2O, MeOH and CHCl3. After transfer, one 3 mm stainless steel bead (QIAGEN), 400 µl methanol and 300 µl H2O were added to each tube and the samples were vortexed for 30 s. Then, 800 µl chloroform was added and samples were vortexed for 30 s. After centrifuging at 150 × g for 10 min to separate phases, the top and bottom layers were combined in a vial and dried for derivatization.The samples were derivatized for GC–MS analysis as follows: 20 µl of a methoxyamine solution in pyridine (30 mg ml−1) was added to the sample vial and vortexed for 30 s. A bath sonicator was used to ensure that the sample was completely dissolved. Samples were incubated at 37 °C for 1.5 h while shaking at 1,000 r.p.m. N-methyl-N-trimethylsilyltrifluoroacetamide (80 µl) and 1% trimethylchlorosilane solution was added and samples were vortexed for 10 s, followed by incubation at 37 °C for 30 min, with 1,000 r.p.m. shaking. The samples were then transferred into a vial with an insert.An Agilent 7890A gas chromatograph coupled with a single quadrupole 5975C mass spectrometer (Agilent) and an HP-5MS column (30 m × 0.25 mm × 0.25 μm; Agilent) was used for untargeted analysis. Samples (1 μl) were injected in splitless mode, and the helium gas flow rate was determined by the Agilent Retention Time Locking function on the basis of analysis of deuterated myristic acid (Agilent). The injection port temperature was held at 250 °C throughout the analysis. The GC oven was held at 60 °C for 1 min after injection, and the temperature was then increased to 325 °C at a rate of 10 °C min−1, followed by a 10 min hold at 325 °C. Data were collected over the mass range of m/z 50–600. A mixture of FAMEs (C8–C28) was analysed each day with the samples for retention index alignment purposes during subsequent data analysis.GC–MS data processing and annotationThe data were converted from vendor’s format to the .mzML format and processed using GNPS GC–MS data analysis workflow (https://gnps.ucsd.edu)97. The compounds were identified by matching experimental spectra to the public libraries available at GNPS, as well as NIST 17 and Wiley libraries. The data are publicly available at the MassIVE depository (https://massive.ucsd.edu); dataset ID: MSV000083743. The GNPS deconvolution is available in GNPS (https://gnps.ucsd.edu/ProteoSAFe/status.jsp?task=d5c5135a59eb48779216615e8d5cb3ac), as is the library search (https://gnps.ucsd.edu/ProteoSAFe/status.jsp?task=59b20fc8381f4ee6b79d35034de81d86).GC–MS data analysisFor multi-omics analyses including GC–MS data, we first removed noisy (that is, suspected background contaminants and artifacts) features by excluding those with balance scores 1.5–2 kb DNA fragments’ (Oxford Nanopore Technologies). The resulting product consists of uniquely tagged rRNA operon amplicons. The uniquely tagged rRNA operons were amplified in a second PCR, where the reaction (100 µl) contained 2 U Platinum SuperFi DNA Polymerase High Fidelity (Thermo Fisher) and a final concentration of 1X SuperFi buffer, 0.2 mM of each dNTP, and 500 nM of each forward and reverse synthetic primer targeting the tailed primers from above. The PCR cycling parameters consisted of an initial denaturation (3 min at 95 °C) and then 25–35 cycles of denaturation (15 s at 95 °C), annealing (30 s at 60 °C) and extension (6 min at 72 °C), followed by final extension (5 min at 72 °C). The PCR product was purified using the custom bead purification protocol above. Batches of 25 amplicon libraries were barcoded and sent for PacBio Sequel II library preparation and sequencing (Sequel II SMRT Cell 8M and 30 h collection time) at the DNA Sequencing Center at Brigham Young University. Circular consensus sequencing (CCS) reads were generated using CCS v.3.4.1 (https://github.com/PacificBiosciences/ccs) using default settings. UMI consensus sequences were generated using the longread_umi pipeline (https://github.com/SorenKarst/longread_umi) with the following command: longread_umi pacbio_pipeline -d ccs_reads.fq -o out_dir -m 3500 -M 6000 -s 60 -e 60 -f CAAGCAGAAGACGGCATACGAGAT -F AGRGTTYGATYMTGGCTCAG -r AATGATACGGCGACCACCGAGATC -R CGACATCGAGGTGCCAAAC -U ‘0.75;1.5;2;0’ -c 2.Amplicon data analysisFor multi-omics analyses including amplicon sequence data, we processed each dataset for comparison of beta-diversity. For all amplicon data except that for bacterial full-length rRNA amplicons, raw sequence data were converted from bcl to fastq, and then multiplexed files for each sequencing run uploaded as separate preparations to Qiita (study: 13114).For each 16S sequencing run, in Qiita, data were demultiplexed, trimmed to 150 bp and denoised using Deblur122 to generate a feature-table of sub-operational taxonomic units (sOTUs) per sample, using default parameters. We then exported feature-tables and denoised sequences from each sequencing run, used QIIME 2’s ‘feature-table’ plugin to merge feature-tables and denoised reads across sequencing runs, and placed all denoised reads into the GreenGenes 13_8 phylogeny123 via fragment insertion using QIIME 2’s90 SATé-Enabled Phylogenetic Placement (SEPP)124 plugin to produce a phylogeny for diversity analyses. To allow for phylogenetically informed diversity analyses, reads not placed during SEPP (that is, 513 sOTUs, 0.1% of all sOTUs) were removed from the merged feature-table. We then used QIIME 2’s ‘feature-table’ plugin to exclude singleton sOTUs and rarefy the data to 5,000 reads per sample. Rarefaction depths for all amplicon analyses were chosen to best normalize sampling effort per sample while maintaining ≥75% of samples representative of Earth’s environments, and also to maintain consistency with the analyses from EMP release 1. We then used QIIME 2’s90 ‘diversity’ plugin to estimate alpha-diversity (that is, sOTU richness) and beta-diversity (that is, unweighted UniFrac distances). The final feature-table for 16S beta-diversity analysis included 681 samples and 93,260 features. We performed a comparative analysis of the data including and excluding the reads not placed during SEPP, and note that both alpha-diversity (that is, sOTU richness) and beta-diversity (that is, sample–sample RPCA distances) were highly correlated between datasets (Spearman r = 1.0) (Supplementary Fig. 5). We thus proceeded with the SEPP-filtered dataset and used phylogenetically informed diversity metrics where applicable.For 18S data, we used QIIME 2’s90 ‘demux’ plugin’s ‘emp-paired’ method125,126 to first demultiplex each sequencing run, and then the ‘cutadapt’ plugin’s127 ‘trim-paired’ method to trim sequencing primers from reads. We then exported trimmed reads, concatenated R1 and R2 read files per sample, and denoised reads using Deblur’s122,128 ‘workflow’ with default settings, trimming reads to 90 bp, and taking the ‘all.biom’ and ‘all.seqs’ output, for each sequencing run. We then used QIIME 2’s ‘feature-table’ plugin to merge feature-tables and denoised sequences across sequencing runs, and then the ‘feature-classifier’ plugin’s ‘classify-sklearn’ method to classify taxonomy for each sOTU via pre-fitted machine-learning classifiers129 and the SILVA 138 reference database130. We then used QIIME 2’s90 ‘feature-table’ plugin to exclude reads assigned to bacteria and archaea, singleton sOTUs and samples with a total frequency of More