The biogeographic differentiation of algal microbiomes in the upper ocean from pole to pole

Research cruises

This dataset consists of sequence data from 4 separate cruises: ARK-XXVII/1 (PS80)—17th June to 9th July 2012; Stratiphyt-II— April to May 2011; ANT-XXIX/1 (PS81)—1st to 24th November 2012 and ANT-XXXII/2 (PS103)—16th December 2016 to 3rd February 2017 and covers a transect of the Atlantic Ocean from Greenland to the Weddell Sea (71.36°S to 79.09°N) (Supplementary Table 1). In order to study the composition, distribution and activity of microbial communities in the upper ocean across the broadest latitudinal ranges possible, samples have been collected during four field campaigns as shown in Fig. 1A. The first collection of samples was collected in the North Atlantic Ocean from April to May 2011 by Dr. Willem van de Poll of the University of Groningen, Netherlands and Dr. Klaas Timmermans of the Royal Netherlands Institute for Sea Research. The second set of samples was collected in the Arctic Ocean from June to July 2012, and the third set of samples was collected in the South Atlantic Ocean from October to November 2012. Both of which were collected by Dr. Katrin Schmidt of the University of East Anglia. The final set of samples was collected in the Antarctic Ocean from December 2016 to January 2017 by Dr. Allison Fong of the Alfred-Wegener Institute for Polar and Marine Research, Bremerhaven, Germany.

Sampling

Water samples from the Arctic Ocean and South Atlantic Ocean expeditions were collected using 12 L Niskin bottles (Rosette sampler with an attached Sonde (CTD, conductivity, temperature, depth) either at the chlorophyll maximum (10–110 m) and/or upper of the ocean (0–10 m). As soon as the rosette sampler was back on board, water samples were immediately transferred into plastic containers and transported to the laboratory. All samples were accompanied by measurements on salinity, temperature, sampling depth and silicate, nitrate, phosphate concentration (Supplementary Table 1). Water samples were pre-filtered with a 100 μm mesh to remove larger organisms and subsequently filtered onto 1.2 μm polycarbonate filters (Isopore membrane, Millipore, MA, USA). All filters were snap frozen in liquid nitrogen and stored at −80 °C until further analysis.

Water samples from the North Atlantic Ocean cruise were also taken with 12 L Niskin bottles attached to a Rosette sampler with a Sonde. However, these samples were filtered onto 0.2 μm polycarbonate filters (Isopore membrane, Millipore, MA, USA) without pre-filtration but snap frozen in liquid nitrogen and stored at −80 °C as the other samples.

Water samples from the Southern Ocean cruise were taken with 12 L Niskin bottles attached to an SBE911plus CTD system equipped with 24 Niskin samplers. These samples were filtered onto 1.2 μm polycarbonate membrane filters (Merck Millipore, Germany) in a container cooled to 4 °C and snap frozen in liquid nitrogen and stored at −80 °C as the other samples. Environmental data recorded at the time of sampling can be found in Supplementary Table 1.

DNA extractions: Arctic Ocean and South Atlantic Ocean samples

DNA was extracted with the EasyDNA Kit (Invitrogen, Carlsbad, CA, USA) with modification to optimise DNA quantity and quality. Briefly, cells were washed off the filter with pre-heated (65 °C) Solution A and the supernatant was transferred into a new tube with one small spoon of glass beads (425–600 μm, acid washed) (Sigma-Aldrich, St. Louis, MO, USA). Samples were vortexed three times in intervals of 3 s to break the cells. RNase A was added to the samples and incubated for 30 min at 65 °C. The supernatant was transferred into a new tube and Solution B was added followed by a chloroform phase separation and an ethanol precipitation step. DNA was pelleted by centrifugation and washed several times with isopropanol, air dried and suspended in 100 μL TE buffer (10 mM Tris-HCl, pH 7.5, 1 mM EDTA, pH 8.0). Samples were snap frozen in liquid nitrogen and stored at −80 °C until sequencing.

DNA extractions: North Atlantic Ocean samples

North Atlantic Ocean samples were extracted with the ZR-Duet™DNA/RNA MiniPrep kit (Zymo Research, Irvine, USA) allowing simultaneous extraction of DNA and RNA from one sample filter. Briefly, cells were washed from the filters with DNA/RNA Lysis Buffer and one spoon of glass beads (425–600 μm, Sigma-Aldrich, MO, USA) was added. Samples were vortexed quickly and loaded onto Zymno-Spin™IIIC columns. The columns were washed several times and DNA was eluted in 60 μmL, DNase-free water. Samples were snap frozen in liquid nitrogen and stored at −80 °C until sequencing.

DNA extractions: Southern Ocean samples

DNA from the Southern Ocean samples was extracted with the NucleoSpin Soil DNA extraction kit (Macherey‐Nagel) following the manufacturer’s instructions. Briefly, cells were washed from the filters with DNA Lysis Buffer and into a lysis tube containing glass beads was added. Samples were disrupted by bead beating for 2 × 30 s interrupted by 1 min cooling on ice and loaded onto the NucleoSpin columns. The columns were washed three times and DNA was eluted in 50 μL, DNase-free water. Samples were stored at −20 °C until further processing.

Amplicon sequencing of 16S and 18S rDNA

All extracted DNA samples were sequenced and pre-processed by the Joint Genome Institute (JGI) (Department of Energy, Berkeley, CA, USA). iTAG amplicon sequencing was performed at JGI with primers for the V4 region of the 16S (FW(515F): GTGCCAGCMGCCGCGGTAA; RV(806R): GGACTACNVGGGTWTCTAAT)⁴⁹ and 18S (FW(565F): CCAGCASCYGCGGTAATTCC; RV(948R): ACTTTCGTTCTTGATYRA)⁵⁰. (Supplementary Table 6) rRNA gene (on an Illumina MiSeq instrument with a 2 × 300 base pairs (bp) read configuration⁵¹. 18S sequences were pre-processed, this consisted of scanning for contamination with the tool Duk (US Department of Energy Joint Genome Institute (JGI), 2017,a) and quality trimming of reads with cutadapt⁵². Paired end reads were merged using FLASH⁵³ with a max mismatch set to 0.3 and min overlap set to 20. A total of 54 18S samples passed quality control after sequencing. After read trimming, there was an average of 142,693 read pairs per 18S sample with an average length of 367 bp and 2.8 Gb of data over all samples.

16S sequences were pre-processed, this consisted of merging the overlapping read pairs using USEARCH’s merge pairs⁵⁴ with the parameter minimum number of differences (merge max diff pct) set to 15.0 into unpaired consensus sequences. Any reads that could not be merged are discarded. JGI then applied the tool USEARCH’s search oligodb tool with the parameters mean length (len mean) set to 292, length standard deviation (len stdev) set to 20, primer trimmed max difference (primer trim max diffs) set to 3, a list of primers and length filter max difference (len filter max diffs) set to 2.5 to ensure the Polymerase Chain Reaction (PCR) primers were located with the correct direction and inside the expected spacing. Reads that did not pass this quality control step were discarded. With a max expected error rate (max exp err rate) set to 0.02, JGI evaluated the quality score of the reads and those with too many expected errors were discarded. Any identical sequence was de-duplicated. These are then counted and sorted alphabetically for merging with other such files later. A total of 57 × 16S samples passed quality control after sequencing. There was an average 393,247 read pairs per sample and an average base length of 253 bp for each sequence with a total of 5.6 Gb.

RNA extractions: Arctic Ocean and Atlantic samples

RNA from the Arctic and Atlantic Ocean samples was extracted using the Direct-zol RNA Miniprep Kit (Zymo Research, USA). Briefly, cells were washed off the filters with Trizol into a tube with one spoon of glass beads (425–600 μm, Sigma-Aldrich, MO, USA). Filters were removed and tubes bead beaten for 3 min. An equal volume of 95% ethanol was added, and the solution was transferred onto Zymo-Spin™ IICR Column and the manufacturer instructions were followed. Samples were treated with DNAse to remove DNA impurities, snap frozen in liquid nitrogen and stored at −80 °C until sequencing.

RNA extractions: Southern Ocean

RNA from the Southern Ocean samples was extracted using the QIAGEN RNeasy Plant Mini Kit (QIAGEN, Germany) following the manufacturer’s instructions with on-column DNA digestion. Cells were broken by bead beating like for the DNA extractions before loading samples onto the columns. Elution was performed with 30 µm RNase-free water. Extracted samples were snap frozen in liquid nitrogen and stored at −80 °C until sequencing.

Metatranscriptome sequencing

All samples were sequenced and pre-processed by the U.S. Department of Energy Joint Genome Institute (JGI). Metatranscriptome sequencing was performed on an Illumina HiSeq-2000 instrument²⁷. A total of 79 samples passed quality control after sequencing with 19.87 Gb of sequence read data over all samples for analysis. This comprised a total of 34,241,890 contigs, with an average length of 503 and an average GC% of 51%. This resulted in 36354419 of non-redundant genes detected.

JGI employed their suite of tools called BBTools⁵⁵ for preprocessing the sequences. First, the sequences were cleaned using Duk a tool in the BBTools suite that performs various data quality procedures such as quality trimming and filtering by kmer matching. In our dataset, Duk identified and removed adaptor sequences, and also quality trimmed the raw reads to a phred score of Q10. In Duk the parameters were; kmer-trim (ktrim) was set to r, kmer (k) was set to 25, shorter kmers (mink) set to 12, quality trimming (qtrim) was set to r, trimming phred (trimq) set to 10, average quality below (maq) set to 10, maximum Ns (maxns) set to 3, minimum read length (minlen) set to 50, the flag “tpe” was set to t, so both reads are trimmed to the same length and the “tbo” flag was set to t, so to trim adaptors based on pair overlap detection. The reads were further filtered to remove process artefacts also using Duk with the kmer (k) parameter set to 16.

BBMap⁵⁵ is another a tool in the BBTools suite, that performs mapping of DNA and RNA reads to a database. BBMap aligns the reads by using a multi-kmer-seed-and-extend approach. To remove ribosomal RNA reads, the reads were aligned against a trimmed version of the SILVA database using BBMap with parameters set to; minratio (minid) set to 0.90, local alignment converter flag (local) set to t and fast flag (fast) set to t. Also, any human reads identified were removed using BBMap.

BBmerge⁵⁶ is a tool in the BBTools suite that performs the merging of overlapping paired end reads (Bushnell, 2017). For assembling the metatranscriptome, the reads were first merged with the tool BBmerge, and then BBNorm was used to normalise the coverage so as to generate a flat coverage distribution. This type of operation can speed up assembly and can even result in an improved assembly quality.

Rnnotator⁵² was employed for assembling the metatranscriptome samples 1–68. Rnnotator assembles the transcripts by using a de novo assembly approach of RNA-Seq data and it accomplishes this without a reference genome⁵². MEGAHIT⁵⁷ was employed for assembling the metatranscriptome samples 69–82. The tool BBMap was used for reference mapping, the cleaned reads were mapped to metagenome/isolate reference(s) and the metatranscriptome assembly.

Metatranscriptome analysis

JGI performed the functional analysis on the metatranscriptomic dataset. JGI’s annotation system is called the Metagenome Annotation Pipeline (MAP) (v4.15.2)²⁷. JGI used HMMER 3.1b2⁵⁸ and the Pfam v30⁵⁹ database for the functional analysis of our metatranscriptomic dataset. This resulted in 11,205,641 genes assigned to one or more Pfam domain. This resulted in 8379 Pfam functional assignments and their gene counts across the 79 samples. The files were further normalised by applying hits per million.

18S rDNA analysis

A reference dataset of 18S rRNA gene sequences that represent algae taxa was compiled for the construction of the phylogenetic tree by retrieving sequences of algae and outgroups taxa from the SILVA database (SSUREF 115)⁶⁰ and Marine Microbial Eukaryote Transcriptome Sequencing Project (MMETSP) database⁶¹. The algae reference database consists of 1636 species from the following groups: Opisthokonta, Cryptophyta, Glaucocystophyceae, Rhizaria, Stramenopiles, Haptophyceae, Viridiplantae, Alveolata, Amoebozoa and Rhodophyta. A diagram of the 18S classification pipeline can be found in Supplementary Fig. 1. In order to construct the algae 18S reference database, we first retrieved all eukaryotic species from the SILVA database with a sequence length of > = 1500 base pairs (bp) and converted all base letters of U to T. Under each genus, we took the first species to represent that genus. Using a custom written script (https://github.com/SeaOfChange/SOC/blob/master/get_ref_seqs.pl), the species of interest (as stated above) were selected from the SILVA database, classified with NCBI taxa IDs and a sequence information file produced that describes each of the algae sequences by their sequence ID and NCBI species ID. Taxonomy from the NCBI database, eukaryote sequences from the SILVA database and a list of algal taxa including outgroups were used as input for the script. This information was combined with the MMETSP database excluding duplications.

The algae reference database was clustered to remove closely related sequences with CD-HIT (4.6.1)⁶² using a similarity threshold of 97%. Using ClustalW (2.1)⁶³ we aligned the reference sequences with the addition of the parameter iteration numbers set to 5. The alignment was examined by colour coding each species to their groups and visualising in iTOL⁶⁴. It was observed that a few species were misaligning to other groups and these were then deleted using Jalview⁶⁵. The resulting alignment was tidied up with TrimAL (1.1)⁶⁶ by applying parameters to delete any positions in the alignment that have gaps in 10% or more of the sequence, except if this results in less than 60% of the sequence remaining. A maximum likelihood phylogenetic reference tree and statistics file based on our algae reference alignment was constructed by employing RaxML (8.0.20)⁶⁷ with a general time reversible model of nucleotide substitution along with the GAMMA model of rate heterogeneity. For a description of the lineages of all species back to the root in the algae reference database, the taxa IDs were submitted for each species to extract a subset of the NCBI taxonomy with the NCBI taxtastic tool (0.8.4)⁶⁸ Based on the algae reference multiple sequence alignment, with HMMER3 (3.1B1)⁶⁹ a Profile HMM was created. A pplacer reference package using taxtastic was generated, which produced an organized collection of all the files and taxonomic information into one directory. With the reference package, a SQLite database was created using pplacer’s Reference Package PReparer (rppr). With hmmalign, the query sequences were aligned to the reference set and created a combined Stockholm format alignment. Pplacer (re-aligned to the reference set and created a combined Stockholm format alignment. Pplacer (1.1)⁷⁰ was used to place the query sequences on the phylogenetic reference tree by means of the reference alignment according to a maximum likelihood model⁷⁰ The place files were converted to CSV with pplacer’s guppy tool; in order to easily take those with a maximum likelihood score of > = 0.5 and counted the number of reads assigned to each classification. This resulted in 6,053,291 reads that were taxonomically assigned being taken for analysis.

Normalisation of 18S rDNA gene copy number

18S rDNA gene copy number vary widely among eukaryotes. In order to create an estimate of abundances of the species in the samples the data had to be normalised. Previous work has explored the link between copy number and genome size⁷¹. However, there is not a single database of 18S rDNA gene copy numbers for eukaryote species. In order to address this, gene copy number and related genome sizes of 185 species across the eukaryote tree was investigated and plotted (Supplementary Fig. 2, Supplementary Table 4)^{68,71,72,73,74,75,76,77,78,79}. Based on the log transformed data, a significant correlation with a R2 of 0.55 with a p-value < 2.2e−16 between genome size and 18S copy number was observed. A regression equation was determined (f(x) = 0.66X + 0.75) as shown in Supplementary Fig. 2.

To derive this equation, the genome sizes for the species in the reference datasets were retrieved from the NCBI genome database. Since some of the genome sizes were unavailable, for species with missing genome sizes, an average of available genome sizes in closely related species was taken instead. More specifically, first a taxonomic lineage of the relevant subset of the NCBI database was obtained by submitting the taxa IDs using the NCBI taxtastic tool⁶⁸. Average genome sizes were then calculated by utilizing the parent ID and taxa ID columns and the known genome sizes of the lowest common ancestor. The 18S datasets were normalised by assigning their genome sizes using the regression equation. The files were further normalised by applying the hits per million reads method.

18S rDNA file preparation

In our 18S rDNA dataset, we had taxonomic assignments from the eukaryote node down to the species nodes. We employed Metagenome Analyzer (MEGAN) (5.10.3)⁸⁰ to cut out specific taxonomic levels. In MEGAN, we extracted the classifications at the taxonomic rank of species. This consisted of a file being generated for each station that contained the species names and their assigned abundances. The files were further normalised to hits per million.

In MEGAN, we extracted the leaves of the taxonomy tree at the rank of class and above but excluded assignments to the eukaryote node. Firstly, this consisted of a file being generated for each station that contained all assignments to the class nodes as well as any assignments under their respective lineages down to species being summed up under the individual class node. Secondly, we included nodes that were not highlighted for class taxonomic level on the leaves of the tree in MEGAN. These leaves were not highlighted because in NCBI taxonomy there are species that do not have a taxonomy designation at every taxonomy level. We took the nodes that were not highlighted on leaves of the tree and summed them together within their respective lineages and placed them under a new name. For example, under the phylum Rhizaria, on the leaves of the tree, there is Cercozoa, Gromiidae and unclassified Rhizaria which are not highlighted. Their abundance was summed together and renamed Nc. Rhizaria, “Nc.” standing for “No class”. The abundances assigned to Rhizaria were not included in this calculation. The leaves of the tree made up 34% of the total 18S rDNA dataset. The internal nodes between the leaves of the tree at the taxonomic rank of class and the eukaryote node was given a “U.” in front of their name, “U.” standing for “Unknown”. This was done to highlight that while they are of course associated with the lower lineages they are in fact considered separate, as those assignments to those nodes could not be determined any lower. The internal nodes made up 29% of the total 18S rDNA dataset.

The abundance assigned to the eukaryote node was excluded from our analysis as these sequences could not be classified lower. This comprised of a total of 37% of the 18S rDNA dataset. A file was generated for each station that contained the class nodes, “Nc.” nodes and “U.” nodes with their respective abundances. The files were further normalised to hits per million. Throughout the paper we refer to the analysis of these files at the taxonomic rank of class.

16S rDNA analysis

JGI performed the classification analysis on the 16S rDNA dataset^81,82. JGI’s 16S rDNA classification pipeline (JGI pipeline iTagger v2.1 16S classification pipeline) consists of firstly removing samples with less than 1000 sequences. The remaining samples and the de-duplicated identical sequences from the preprocessing step are then combined and their sequences organized by decreasing abundance. The sequences are divided out based on the criterion as to whether they contained a cluster centroid with a minimum size of at least 3 copies. The low-abundance sequences are put aside and not used for clustering. USEARCH’s⁸³ cluster otus command is employed to incrementally cluster the clusterable sequences. This begins at 99% identity and the radius is increased by 1% for each iteration until a OTU clustering identity of 97% is reached. At each step, the sequences are sorted by decreasing abundance. Once clustering is complete, USEARCH’s usearch global is used to map the low-abundance sequences to the cluster centroids. These are added to OTU counts if they were in the prescribed percent identity threshold. If they do not fall within this prescribed percent identity threshold they are discarded. USEARCH’s UTAX along with the SILVA database is used to evaluate the clustered centroid sequences. The predicted taxonomic classifications are then filtered with a cutoff of 0.5. Any chloroplast sequences identified are removed. The final accepted OTUs and read counts for each sample are finally placed in a taxonomic classification file.

Normalisation of 16S rDNA gene copy number

In order to normalise the 16S copy number, the 16S copy numbers for the species in the dataset were retrieved from the Ribosomal RNA Operon Copy Number Database (rrnDB)⁸⁴ The rrnDB database version 5.3 consisted at the time of 3021 bacterial entries. Firstly, since multiple entries of a species are in the rrnDB database due to the presence of different strains, we obtained an average copy number for each species in the rrnDB database, which resulted in 2876 species entries. The higher taxonomic levels for the rrnDB species needed to be established so that we could calculate their average copy number. For a description of the lineages of all species back to the root in the rrnDB database, we submitted the species names for each entry to extract a subset of the NCBI taxonomy with the NCBI taxtastic tool⁶⁸ thus producing a Taxtastic file. The Taxtastic file based on species from the rrnDB database was used to calculate the average copy number for higher taxonomic levels from the known copy number species level, with the assistance of the parent id and taxa id layout in the Taxtastic file. A Taxtastic file based on 16S rDNA species from our dataset was generated and we assigned our 16S species entries a copy number from species to root from the prepared average copy number rrnDB Taxtastic file. Not all copy numbers in the 16S rDNA dataset were known. We, therefore, took the average of closely related species from the above taxonomic level of those we could get and took that as the copy number for those that were missing in our dataset. The 16S dataset was normalised by dividing by the assigned copy number. The files were further normalised by applying the hits per million reads method.

16S rDNA file preparation

In our 16S rDNA dataset, we had taxonomic assignments from the bacteria node down to the genus nodes. We extracted the classifications at the taxonomic rank of genus. This consisted of a file being generated for each station that contained the genus names and their assigned abundances. The files were further normalised by applying the hits per million reads method.

We extracted the leaves of the tree that included class nodes and “Nc.” nodes with their respective abundances. This step resulted in 94% of the 16S rDNA dataset. Also, we extracted the internal nodes and placed “U.” in front of their names. This resulted in 3% of the 16S rDNA dataset. The abundance assigned to the bacteria node was excluded from our analysis and this comprised of a total of 3% of the 16S rDNA dataset. We generated a file for each station that contained the class nodes, “Nc.” nodes and “U.” nodes with their respective abundances. The files were further normalised by applying the hits per million reads method. Throughout the paper we refer to the analysis of these files at the taxonomic rank of class.

Statistical analysis

Alpha diversity (Shannon index) in relation to environmental covariates

The Shannon index H’⁸⁵ was used to calculate abundance weighted richness per station. The Shannon index was used over the Simpson index as the latter is heavily weighted towards the most abundant orders. The Shannon index was calculated based on the following equation:

$$H^{prime} =-mathop{sum }limits_{i=1}^{S}{p}_{i},{{{{{rm{In}}}}}},{{{{{{rm{p}}}}}}}_{i}$$

Environmental covariates were related to the Shannon index (H’) by fitting generalized linear models. Step-by-step backwards selection of covariates was used for model building, removing non-significant covariates until remaining covariates were significant at a p-value < 0.05.

Beta diversity in relation to environmental factors was calculated across the transect based on a Hellinger transformed class abundance matrix using the vegdist function of the vegan package⁸⁶. The Bray-Curtis dissimilarity index⁸⁷ was used as a measure of beta-diversity and was calculated based on the following equation:

$$B{C}_{ij}=sum frac{|{n}_{ik}-{n}_{jk}|}{({n}_{ik}+{n}_{jk})}$$

Evenness and occupancy

An abundance, station evenness and occupancy plots were produced for each 18S rDNA class level (n = 54) and 16S rRNA class level (n = 57) (Supplementary Fig. 5, Supplementary Table 3) The x-axis represents the number of times that class taxonomy occurs across the stations. The y-axis represents the evenness of that class taxonomy across stations it occurs in. This was calculated using a Dispersion index, which is a varient of J’ of Pielou’s evenness⁸⁸ and based on H’ of Shannon^85,89. Each circle represents a class taxonomy abundance. The size of each circle is resized by replacing the area of the circle which represented the total abundance for that class with square root of the abundance divided by pi.

Canonical correspondence analyses (CCAs)

The R package VEGAN⁹⁰ was employed to perform a Canonical Correspondence Analysis (CCA) on each dataset of 18S, 16S and metatranscript Pfam against the individual environmental variables. The environmental data consisted of temperature, salinity, nitrate/nitrite, phosphate and silicate (Supplementary Fig. 6).

Network analysis

A network analysis was performed using the R package Weighted Gene Co-Expression Network Analysis (WGCNA)⁹¹ The first analysis was performed on samples of combined prokaryotes at the taxonomic rank of genus and on eukaryotes at the taxonomic rank of species to describe networks derived from their log10-scaled abundances. The prokaryotes and eukaryotes normalised files were combined for each station. A signed adjacency measure for each lineage was determined by raising the absolute value of the Pearson correlation coefficient to the power of 11. A topological overlap measure (TOM) was calculated from the resulting adjacency matrix. Hierarchical clustering was carried out on the TOM measure, which resulted in two networks being discovered in the network (Fig. 4). The second analysis was performed on samples of the metatranscriptome Pfam dataset to describe networks derived from their log10-scaled gene counts. A signed adjacency measure for each lineage was determined by raising the absolute value of the Pearson correlation coefficient to the power of 12. A topological overlap measure (TOM) was calculated from the resulting adjacency matrix. Hierarchical clustering was carried out on the TOM measure, which resulted in two networks being discovered in the network (Fig. 2, Supplementary Table 5).

When incorporating environmental data, latitude values were redefined, so that the North pole is 0°, the Equator is 90° and the South pole is 180°. Unaltered environmental data can be found in Supplementary Table 1.

Beta diversity break-point analysis

The break-point analysis is based on the methodology from ref. ⁹². The beta diversity indices used in the break-point analyses is the Sørensen indices. A breakpoint was determined and plotted for each of the Pfam protein families, 18S rDNA and 16S rDNA datasets. Breakpoints in the 18S and 16S rDNA datasets were investigated between the temperature range of 7 °C to 29.02 °C. When incorporating environmental data, latitude values were redefined, so that the North pole is 0°, the Equator is 90° and the South pole is 180°. Unaltered environmental data can be found in Supplementary Table 1.

The break-point analysis was generated using piecewise regression in R. This was calculated by firstly producing a presence–absence matrix for each dataset. A multiple-site dissimilarity was performed on the presence–absence matrix with beta.pair, a function from the betapart R package and a dissimilarity index set to Sørensen, thus produced a distance object called beta.sor³⁴. Outliers were identified with bagplot, a function from the aplpack R package and then removed from the analyses. Remaining values were then plotted against the environmental variable (temperature or altered latitude), these were searched through for possible breakpoints, that is for the one with the lowest mean squared error.

For the 18S rDNA and 16S rDNA datasets, a number of samples in the North Atlantic Ocean did not pass quality control before sequencing. Due to this, when performing the 18S rDNA and 16S rDNA break-points analyses there were gaps in each of the datasets plots in the North Atlantic Ocean region. To investigate the effects of the missing samples, four model scenarios were produced to mimic the missing samples. The first model scenario involved filling in beta diversity values for the missing North Atlantic Ocean with current closest by latitude stations. This resulted in breakpoints for the 18S and 16S rDNA of 20.66 °C and 9.49 °C, respectively. The second model scenario involved filling in beta diversity values for the missing North Atlantic Ocean with values from the Arctic Ocean. This resulted in breakpoints for the 18S and 16S rDNA of 14.4 °C and 12.07 °C, respectively. The third model scenario involved filling in beta diversity values for the missing North Atlantic Ocean with values from the South Atlantic Ocean. This resulted in breakpoints for the 18S and 16S rDNA of 9.49 °C and 12.22 °C, respectively. The fourth model scenario involved filling in beta diversity values for the missing North Atlantic Ocean with values from both the Arctic Ocean and the South Atlantic Ocean. This resulted in breakpoints for the 18S and 16S rDNA of 14.4 °C and 12.22 °C, respectively.

A break-point analysis was performed for the Pfam protein families beta diversity against temperate with the North Atlantic Ocean samples (Stratiphyt-II) removed to test whether key results remain unchanged (Supplementary Fig. 10e). A breakpoint of 18.2 °C was determined with a p-value of 1.65e−11. Hence, the main result (Fig. 5A) remains unchanged.

IPCC-based modelling of geographical shifts in beta-diversity breakpoints across the North Atlantic

To assess where these boundaries are, we began with the HadISST dataset⁹³, taking the 1961–1990 climatology (Fig. 6). For estimates of changes over the 21st century, we used the RCP 8.5 HadGEM2-ES CMIP5 experiment³⁷. A historical HadGEM2-ES experiment was also run for CMIP5, which we used to bias-correct the projected temperatures. This was achieved by determining the differences between the 1961–1990 HadISST and HadGEM2-ES temperatures for each grid box and adding them to the projections. Grid boxes that contain sea ice in the climatology are ignored from this analysis.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Source: Ecology - nature.com