The biogeographic differentiation of algal microbiomes in the upper ocean from pole to pole
Research cruisesThis dataset consists of sequence data from 4 separate cruises: ARK-XXVII/1 (PS80)—17th June to 9th July 2012; Stratiphyt-II— April to May 2011; ANT-XXIX/1 (PS81)—1st to 24th November 2012 and ANT-XXXII/2 (PS103)—16th December 2016 to 3rd February 2017 and covers a transect of the Atlantic Ocean from Greenland to the Weddell Sea (71.36°S to 79.09°N) (Supplementary Table 1). In order to study the composition, distribution and activity of microbial communities in the upper ocean across the broadest latitudinal ranges possible, samples have been collected during four field campaigns as shown in Fig. 1A. The first collection of samples was collected in the North Atlantic Ocean from April to May 2011 by Dr. Willem van de Poll of the University of Groningen, Netherlands and Dr. Klaas Timmermans of the Royal Netherlands Institute for Sea Research. The second set of samples was collected in the Arctic Ocean from June to July 2012, and the third set of samples was collected in the South Atlantic Ocean from October to November 2012. Both of which were collected by Dr. Katrin Schmidt of the University of East Anglia. The final set of samples was collected in the Antarctic Ocean from December 2016 to January 2017 by Dr. Allison Fong of the Alfred-Wegener Institute for Polar and Marine Research, Bremerhaven, Germany.SamplingWater samples from the Arctic Ocean and South Atlantic Ocean expeditions were collected using 12 L Niskin bottles (Rosette sampler with an attached Sonde (CTD, conductivity, temperature, depth) either at the chlorophyll maximum (10–110 m) and/or upper of the ocean (0–10 m). As soon as the rosette sampler was back on board, water samples were immediately transferred into plastic containers and transported to the laboratory. All samples were accompanied by measurements on salinity, temperature, sampling depth and silicate, nitrate, phosphate concentration (Supplementary Table 1). Water samples were pre-filtered with a 100 μm mesh to remove larger organisms and subsequently filtered onto 1.2 μm polycarbonate filters (Isopore membrane, Millipore, MA, USA). All filters were snap frozen in liquid nitrogen and stored at −80 °C until further analysis.Water samples from the North Atlantic Ocean cruise were also taken with 12 L Niskin bottles attached to a Rosette sampler with a Sonde. However, these samples were filtered onto 0.2 μm polycarbonate filters (Isopore membrane, Millipore, MA, USA) without pre-filtration but snap frozen in liquid nitrogen and stored at −80 °C as the other samples.Water samples from the Southern Ocean cruise were taken with 12 L Niskin bottles attached to an SBE911plus CTD system equipped with 24 Niskin samplers. These samples were filtered onto 1.2 μm polycarbonate membrane filters (Merck Millipore, Germany) in a container cooled to 4 °C and snap frozen in liquid nitrogen and stored at −80 °C as the other samples. Environmental data recorded at the time of sampling can be found in Supplementary Table 1.DNA extractions: Arctic Ocean and South Atlantic Ocean samplesDNA was extracted with the EasyDNA Kit (Invitrogen, Carlsbad, CA, USA) with modification to optimise DNA quantity and quality. Briefly, cells were washed off the filter with pre-heated (65 °C) Solution A and the supernatant was transferred into a new tube with one small spoon of glass beads (425–600 μm, acid washed) (Sigma-Aldrich, St. Louis, MO, USA). Samples were vortexed three times in intervals of 3 s to break the cells. RNase A was added to the samples and incubated for 30 min at 65 °C. The supernatant was transferred into a new tube and Solution B was added followed by a chloroform phase separation and an ethanol precipitation step. DNA was pelleted by centrifugation and washed several times with isopropanol, air dried and suspended in 100 μL TE buffer (10 mM Tris-HCl, pH 7.5, 1 mM EDTA, pH 8.0). Samples were snap frozen in liquid nitrogen and stored at −80 °C until sequencing.DNA extractions: North Atlantic Ocean samplesNorth Atlantic Ocean samples were extracted with the ZR-Duet™DNA/RNA MiniPrep kit (Zymo Research, Irvine, USA) allowing simultaneous extraction of DNA and RNA from one sample filter. Briefly, cells were washed from the filters with DNA/RNA Lysis Buffer and one spoon of glass beads (425–600 μm, Sigma-Aldrich, MO, USA) was added. Samples were vortexed quickly and loaded onto Zymno-Spin™IIIC columns. The columns were washed several times and DNA was eluted in 60 μmL, DNase-free water. Samples were snap frozen in liquid nitrogen and stored at −80 °C until sequencing.DNA extractions: Southern Ocean samplesDNA from the Southern Ocean samples was extracted with the NucleoSpin Soil DNA extraction kit (Macherey‐Nagel) following the manufacturer’s instructions. Briefly, cells were washed from the filters with DNA Lysis Buffer and into a lysis tube containing glass beads was added. Samples were disrupted by bead beating for 2 × 30 s interrupted by 1 min cooling on ice and loaded onto the NucleoSpin columns. The columns were washed three times and DNA was eluted in 50 μL, DNase-free water. Samples were stored at −20 °C until further processing.Amplicon sequencing of 16S and 18S rDNAAll extracted DNA samples were sequenced and pre-processed by the Joint Genome Institute (JGI) (Department of Energy, Berkeley, CA, USA). iTAG amplicon sequencing was performed at JGI with primers for the V4 region of the 16S (FW(515F): GTGCCAGCMGCCGCGGTAA; RV(806R): GGACTACNVGGGTWTCTAAT)49 and 18S (FW(565F): CCAGCASCYGCGGTAATTCC; RV(948R): ACTTTCGTTCTTGATYRA)50. (Supplementary Table 6) rRNA gene (on an Illumina MiSeq instrument with a 2 × 300 base pairs (bp) read configuration51. 18S sequences were pre-processed, this consisted of scanning for contamination with the tool Duk (US Department of Energy Joint Genome Institute (JGI), 2017,a) and quality trimming of reads with cutadapt52. Paired end reads were merged using FLASH53 with a max mismatch set to 0.3 and min overlap set to 20. A total of 54 18S samples passed quality control after sequencing. After read trimming, there was an average of 142,693 read pairs per 18S sample with an average length of 367 bp and 2.8 Gb of data over all samples.16S sequences were pre-processed, this consisted of merging the overlapping read pairs using USEARCH’s merge pairs54 with the parameter minimum number of differences (merge max diff pct) set to 15.0 into unpaired consensus sequences. Any reads that could not be merged are discarded. JGI then applied the tool USEARCH’s search oligodb tool with the parameters mean length (len mean) set to 292, length standard deviation (len stdev) set to 20, primer trimmed max difference (primer trim max diffs) set to 3, a list of primers and length filter max difference (len filter max diffs) set to 2.5 to ensure the Polymerase Chain Reaction (PCR) primers were located with the correct direction and inside the expected spacing. Reads that did not pass this quality control step were discarded. With a max expected error rate (max exp err rate) set to 0.02, JGI evaluated the quality score of the reads and those with too many expected errors were discarded. Any identical sequence was de-duplicated. These are then counted and sorted alphabetically for merging with other such files later. A total of 57 × 16S samples passed quality control after sequencing. There was an average 393,247 read pairs per sample and an average base length of 253 bp for each sequence with a total of 5.6 Gb.RNA extractions: Arctic Ocean and Atlantic samplesRNA from the Arctic and Atlantic Ocean samples was extracted using the Direct-zol RNA Miniprep Kit (Zymo Research, USA). Briefly, cells were washed off the filters with Trizol into a tube with one spoon of glass beads (425–600 μm, Sigma-Aldrich, MO, USA). Filters were removed and tubes bead beaten for 3 min. An equal volume of 95% ethanol was added, and the solution was transferred onto Zymo-Spin™ IICR Column and the manufacturer instructions were followed. Samples were treated with DNAse to remove DNA impurities, snap frozen in liquid nitrogen and stored at −80 °C until sequencing.RNA extractions: Southern OceanRNA from the Southern Ocean samples was extracted using the QIAGEN RNeasy Plant Mini Kit (QIAGEN, Germany) following the manufacturer’s instructions with on-column DNA digestion. Cells were broken by bead beating like for the DNA extractions before loading samples onto the columns. Elution was performed with 30 µm RNase-free water. Extracted samples were snap frozen in liquid nitrogen and stored at −80 °C until sequencing.Metatranscriptome sequencingAll samples were sequenced and pre-processed by the U.S. Department of Energy Joint Genome Institute (JGI). Metatranscriptome sequencing was performed on an Illumina HiSeq-2000 instrument27. A total of 79 samples passed quality control after sequencing with 19.87 Gb of sequence read data over all samples for analysis. This comprised a total of 34,241,890 contigs, with an average length of 503 and an average GC% of 51%. This resulted in 36354419 of non-redundant genes detected.JGI employed their suite of tools called BBTools55 for preprocessing the sequences. First, the sequences were cleaned using Duk a tool in the BBTools suite that performs various data quality procedures such as quality trimming and filtering by kmer matching. In our dataset, Duk identified and removed adaptor sequences, and also quality trimmed the raw reads to a phred score of Q10. In Duk the parameters were; kmer-trim (ktrim) was set to r, kmer (k) was set to 25, shorter kmers (mink) set to 12, quality trimming (qtrim) was set to r, trimming phred (trimq) set to 10, average quality below (maq) set to 10, maximum Ns (maxns) set to 3, minimum read length (minlen) set to 50, the flag “tpe” was set to t, so both reads are trimmed to the same length and the “tbo” flag was set to t, so to trim adaptors based on pair overlap detection. The reads were further filtered to remove process artefacts also using Duk with the kmer (k) parameter set to 16.BBMap55 is another a tool in the BBTools suite, that performs mapping of DNA and RNA reads to a database. BBMap aligns the reads by using a multi-kmer-seed-and-extend approach. To remove ribosomal RNA reads, the reads were aligned against a trimmed version of the SILVA database using BBMap with parameters set to; minratio (minid) set to 0.90, local alignment converter flag (local) set to t and fast flag (fast) set to t. Also, any human reads identified were removed using BBMap.BBmerge56 is a tool in the BBTools suite that performs the merging of overlapping paired end reads (Bushnell, 2017). For assembling the metatranscriptome, the reads were first merged with the tool BBmerge, and then BBNorm was used to normalise the coverage so as to generate a flat coverage distribution. This type of operation can speed up assembly and can even result in an improved assembly quality.Rnnotator52 was employed for assembling the metatranscriptome samples 1–68. Rnnotator assembles the transcripts by using a de novo assembly approach of RNA-Seq data and it accomplishes this without a reference genome52. MEGAHIT57 was employed for assembling the metatranscriptome samples 69–82. The tool BBMap was used for reference mapping, the cleaned reads were mapped to metagenome/isolate reference(s) and the metatranscriptome assembly.Metatranscriptome analysisJGI performed the functional analysis on the metatranscriptomic dataset. JGI’s annotation system is called the Metagenome Annotation Pipeline (MAP) (v4.15.2)27. JGI used HMMER 3.1b258 and the Pfam v3059 database for the functional analysis of our metatranscriptomic dataset. This resulted in 11,205,641 genes assigned to one or more Pfam domain. This resulted in 8379 Pfam functional assignments and their gene counts across the 79 samples. The files were further normalised by applying hits per million.18S rDNA analysisA reference dataset of 18S rRNA gene sequences that represent algae taxa was compiled for the construction of the phylogenetic tree by retrieving sequences of algae and outgroups taxa from the SILVA database (SSUREF 115)60 and Marine Microbial Eukaryote Transcriptome Sequencing Project (MMETSP) database61. The algae reference database consists of 1636 species from the following groups: Opisthokonta, Cryptophyta, Glaucocystophyceae, Rhizaria, Stramenopiles, Haptophyceae, Viridiplantae, Alveolata, Amoebozoa and Rhodophyta. A diagram of the 18S classification pipeline can be found in Supplementary Fig. 1. In order to construct the algae 18S reference database, we first retrieved all eukaryotic species from the SILVA database with a sequence length of > = 1500 base pairs (bp) and converted all base letters of U to T. Under each genus, we took the first species to represent that genus. Using a custom written script (https://github.com/SeaOfChange/SOC/blob/master/get_ref_seqs.pl), the species of interest (as stated above) were selected from the SILVA database, classified with NCBI taxa IDs and a sequence information file produced that describes each of the algae sequences by their sequence ID and NCBI species ID. Taxonomy from the NCBI database, eukaryote sequences from the SILVA database and a list of algal taxa including outgroups were used as input for the script. This information was combined with the MMETSP database excluding duplications.The algae reference database was clustered to remove closely related sequences with CD-HIT (4.6.1)62 using a similarity threshold of 97%. Using ClustalW (2.1)63 we aligned the reference sequences with the addition of the parameter iteration numbers set to 5. The alignment was examined by colour coding each species to their groups and visualising in iTOL64. It was observed that a few species were misaligning to other groups and these were then deleted using Jalview65. The resulting alignment was tidied up with TrimAL (1.1)66 by applying parameters to delete any positions in the alignment that have gaps in 10% or more of the sequence, except if this results in less than 60% of the sequence remaining. A maximum likelihood phylogenetic reference tree and statistics file based on our algae reference alignment was constructed by employing RaxML (8.0.20)67 with a general time reversible model of nucleotide substitution along with the GAMMA model of rate heterogeneity. For a description of the lineages of all species back to the root in the algae reference database, the taxa IDs were submitted for each species to extract a subset of the NCBI taxonomy with the NCBI taxtastic tool (0.8.4)68 Based on the algae reference multiple sequence alignment, with HMMER3 (3.1B1)69 a Profile HMM was created. A pplacer reference package using taxtastic was generated, which produced an organized collection of all the files and taxonomic information into one directory. With the reference package, a SQLite database was created using pplacer’s Reference Package PReparer (rppr). With hmmalign, the query sequences were aligned to the reference set and created a combined Stockholm format alignment. Pplacer (re-aligned to the reference set and created a combined Stockholm format alignment. Pplacer (1.1)70 was used to place the query sequences on the phylogenetic reference tree by means of the reference alignment according to a maximum likelihood model70 The place files were converted to CSV with pplacer’s guppy tool; in order to easily take those with a maximum likelihood score of > = 0.5 and counted the number of reads assigned to each classification. This resulted in 6,053,291 reads that were taxonomically assigned being taken for analysis.Normalisation of 18S rDNA gene copy number18S rDNA gene copy number vary widely among eukaryotes. In order to create an estimate of abundances of the species in the samples the data had to be normalised. Previous work has explored the link between copy number and genome size71. However, there is not a single database of 18S rDNA gene copy numbers for eukaryote species. In order to address this, gene copy number and related genome sizes of 185 species across the eukaryote tree was investigated and plotted (Supplementary Fig. 2, Supplementary Table 4)68,71,72,73,74,75,76,77,78,79. Based on the log transformed data, a significant correlation with a R2 of 0.55 with a p-value More
