in

Triton, a new species-level database of Cenozoic planktonic foraminiferal occurrences

Data sources

No single comprehensive dataset of planktonic foraminiferal distributional records currently exists. Instead, these data are available from a wide range of sources in many different structures. Some of these sources are compilations of existing data (e.g., Neptune14,15,16, ForCenS21), and others derive from individual sampling sites (e.g. ocean drilling expeditions). Triton combines these disparate sources (Fig. 1) to produce a single spatio-temporal dataset of Cenozoic planktonic foraminifera with updated and consistent taxonomy, age models, and paleo-coordinates.

Neptune is currently the most comprehensive database of fossil plankton data, with records exclusively from the DSDP, ODP and IODP representing planktonic foraminifera, calcareous nannofossils, diatoms, radiolaria and dinoflagellates14,15,16. A subset of these sites is included in Neptune, representing those with the most continuous sampling through time. The raw data from Neptune form the core of our dataset. All foraminiferal occurrences for the Cenozoic (i.e. last 66 Ma) were downloaded using the GTS 2012 timescale. In the download options, all questionable identifications and invalid taxa were removed, as were records that had been identified as reworked.

In addition to Neptune, three other compilation datasets were included in Triton: ForCenS21, which consists of global core-top samples; the Eocene data from Fenton, et al.8 created based on literature searches for planktonic foraminiferal data in the Eocene; and the land-based records from Lloyd, et al.22 that were created from literature searches. The marine records in Lloyd, et al.22 were not included, as they were obtained from Neptune.

Following preliminary compilation of existing datasets, we identified all legacy DSDP, ODP and IODP cores missing from Triton. The online DESCLogik (http://web.iodp.tamu.edu/DESCReport/) and Pangaea17 databases were then mined for .csv files containing planktonic foraminiferal species count data for the missing cores, supplemented with data from AWI_Paleo (URI: http://www.awi.de/en/science/geosciences/marine-geology.html), GIK/IFG (URI: http://www.ifg.uni-kiel.de/), MARUM (URI: https://www.marum.de/index.html), and QUEEN (URI: http://ipt.vliz.be/eurobis/resource?r=pangaea_2747). All additional cores were assessed individually by inspecting the scientific drilling proceedings to determine whether sites were suitable to contribute to our dataset. The primary assessment criterion was identification of continuous sedimentary sections, wherein two or more confidently assigned consecutive chronostratigraphic tie points existed to allow for construction of age models.

In addition to these longer cores, many sediment sampling projects have produced planktonic foraminiferal distribution data from shorter cores that tend to correspond in age to the last few million years. The website PANGAEA17 (www.pangaea.de) has been used as a repository for most of these occurrence data. This website was searched using the terms “plank* AND foram”, with resulting datasets downloaded using the R package ‘pangaear’23. These datasets were filtered to exclude records collected using multinets, sediment traps or box cores, as these methods produce samples not easily correlated to sediment cores. Column names allowed for further filtering to exclude records with no species-level data, records that had only isotopic data (rather than abundance data), or records with no age controls.

Data processing

The data sources underpinning Triton serve their records in different formats. Therefore, processing was necessary to convert records into a unified framework, with one species per row for each sample and associated metadata (see below for details). Some metadata could be used without modification when available (e.g. water depth, data source), whereas other data needed processing to ensure consistency (e.g. abundance, paleo-coordinates, age). Without this processing, samples from different sources were not directly comparable. Where data were not available, they were set to NA. Those records with missing data in crucial columns (species name, abundance, age, and paleo-coordinates) were removed from the final dataset. All data processing was performed using R v. 3.6.124.

Taxonomic consistency is essential to enable comparison of datasets created at different times. The species and synonymy lists used in Triton are based on the Paleogene Atlases20,25,26, with additional information from mikrotax27 (http://www.mikrotax.org/pforams/). These sources were supplemented, when necessary with more up to date literature including Poole and Wade28 and Lam and Leckie29. (A full list of the taxonomic sources can be found in the PFdata.xlsx file18.) A synonymy list was generated to convert species names to the senior synonym. At the same time, typographic errors were corrected. For example, Globototalia flexuosa should be Globorotalia flexuosa. Exclusively Mesozoic taxa were omitted, as were all instances when species names were unclear or imprecise (i.e. not at the species level). Junior synonyms were merged with their senior synonyms and their abundances summed, although the original names and abundances are also retained in the processed dataset. For presence/absence samples, these numerical merged abundances were set to one (i.e. present). The full species list and list of synonyms can be found in the accompanying data.

Abundance data for planktonic foraminifera are provided in different formats: presence/absence, binned abundance, relative abundance, species counts, and number of specimens per gram. These metrics were converted into numeric relative abundance to make comparisons easier, although both the original abundance value and its numeric version are retained, as is a record of the abundance type. Presence/absence data were converted to a binary format (one for present; zero for absent). Species counts were converted to relative percent abundances based on the total number of specimens in the sample (this was calculated where it was not already recorded). When full counts were not performed, binned abundances were frequently used. These binned abundances were converted into numeric abundances based on the sequence. So, for example, the categorical labels of N, P, R, F, C, A, D (indicating none, present, rare, few, common, abundant, dominant) were converted to a numerical sequence of 0 to 6. As the meaning of letters can depend on the context (e.g. ‘A’ could be absent or abundant), conversion was done in a semi-automated fashion on a sample-by-sample basis. A value of 0.01 was assigned to records where an inconsistent abundance was recorded (e.g. samples with mostly numeric counts but a few species were designated ‘P’, indicating presence). Samples with zero abundance were retained in the full dataset to provide an indication of sampling.

The age of samples were recorded in multiple ways. For some samples, age models provide precise numerical estimates of the age (e.g., those in Neptune). Other samples are dated relative to stratigraphic events such as biostratigraphic zones (including benthic and planktonic foraminifera, diatoms, radiolarians and nannofossils) or magnetic reversals. In this case, ages sometimes needed to be converted to reflect revised age estimates. The start and end dates of biostratigraphic zones are defined in relation to events in marker species, e.g. their speciation, extinction or acme events. All such marker events were updated to their most recent estimates and tuned to the GTS 2020 timescale19. The process of updating included correction of synonymies. Additional care was taken to ensure the correct interpretation of abbreviations (e.g. determining whether LO meant lowest occurrence or last occurrence) based on the entire list of events for a study. Where up-to-date ages were not available or events were ambiguous, they were removed from the age models.

The marker events defining a zone can depend on the zonal scheme used. For example, Berggren30 defined the base of the planktonic foraminifera zone M8 as the first occurrence of Fohsella fohsi. Wade et al.31 used this same event to define the base of M9. Therefore, the zonal scheme was recorded when collecting age models, to accurately convert ages to the GTS 2020 time scale. Some marker events have different ages depending on the ocean basin or latitude, and these differences are not necessarily well studied31,32. Where these differences in marker events have been recorded, the coordinates of a site were used to determine whether sites were in the Atlantic or Indo-Pacific Ocean, and whether they were tropical or temperate (with the division at 23.5° latitude). However, this is an area where more research is needed to improve the accuracy of higher-latitude dating32. Magnetostratigraphic ages were also tuned to the GTS 2020 timescale.

We constructed new age models for samples not already assigned a numeric age. Where the depths of biostratigraphic events were already recorded, these were converted directly to GTS 2020. Where samples were not given any ages, often the case for the cores collected in the early days of ocean drilling, ages were reconstructed from the shipboard and post-cruise biostratigraphic data available in DESCLogik, Pangaea, and drilling publications. For holes where no tie point data were retrievable, biostratigraphic count data were extracted directly from drilling publications, and biostratigraphic events were assigned via GTS 2020. The first and last occurrences in raw shipboard biostratigraphic data often do not represent true datums, and careful assessment of the shipboard, and post-cruise literature was a prerequisite to confidently assigning chronostratigraphic datums. Tie point depths were assigned as the midpoint depth between the core sample before and after an event. For example, for an extinction event, the recorded depth was the midway point between the last recorded occurrence of a species and the first sample from which the species is absent. All sites were assessed individually to determine the age of the seafloor. Where IODP reports or sample-based publications strictly stated that the sediment surface (i.e. 0.00 meters below seafloor (mbsf)) was deemed to be “Holocene”, “Recent” or “Modern” in age, an additional 0 Ma tie point was assigned appropriately. All samples present outside the maximum/minimum age tie points for that site were removed, as they could not be confidently assigned an age. During assessment, individual drilling reports were investigated for geological structures. Where features such as unconformities, reverse faults, stratigraphic inversions, décollements, and major slips and slumps were identified, separate age models were generated for individual intact stratal intervals to account for potential externally emplaced or repeated strata (see “Age models” and “Triton working” in the figshare data repository18). Similarly, age gaps of greater than 10% of the age range of the core were classified as hiatuses, leading to separate age models (see Fig. 4). Cores of denser sediments that have been sampled using rotary drilling will often have only ~50–60% recovery in a core (9.5 m)33. As it is not possible to determine where the recovered core material came from within this length, all intact core pieces are grouped together as a continuous section from the section top, regardless of where the pieces were sourced (e.g. 4.5 m of recovered material will be recorded as 0–4.5 m of cored interval even if some came from 9–9.5 m). Consequently, age estimates within cores where recovery was low, typically the samples collected longer ago, will necessarily be less certain.

Fig. 4

Different age model estimates applied to core material from IODP Site U1499A in the South China Sea. Mag – mean age based only on the magnetostratigraphic marker events. Zones – mean age based on all the marker events. Int Mag – interpolation of the points between the magnetostratigraphic marker events. Interp – interpolation between the full set of marker events. Model – the model of age as a function of depth. Note the hiatus between 50 and 100 m. For the shallower section of the dataset, with only three data points, a simple linear model was used. For the deeper section, a GAM smooth was fitted. For this site, the model predictions were chosen as the best fit.

Full size image

Using the updated marker event ages, we created age-depth plots and modelled the best fit to the data. There are different ways of creating these models, and multiple methods were applied to each core. The one that provided the best fit to the original data was chosen (the different age models are available in “Age models” in the figshare data repository18). These choices were confirmed manually (see Fig. 4). The simplest age model used interpolation of the marker events to create ‘zones’ and assign estimated ages assuming a continuous sedimentation rate between the start and end of each of these zones. Where the events do not provide a continuous sequence (e.g. gaps in the zonal markers), age estimates were assigned as the mean of that zone with error estimates of the width of the zone. Where magnetostratigraphic events were present they were given preference. This method leads to different estimates of sedimentation rate for each zone. The more complex age model estimates a smoother sedimentation rate. When there were fewer than 5 marker events, a linear model of age as a function of depth was fitted for the entire core. For larger datasets, generalised additive models (GAMs) for the same variables were used, to allow for variation in sedimentation rates through time. GAMs were run using the mgcv R library, with a gamma value of 1.134. The type of age model used in the analysis was recorded. Where appropriate, the number of points and the r2 of the model are recorded to give an indication of the accuracy of the age model.

The latitude and longitude coordinates of samples were recorded in decimal degrees. For all samples except modern ones, plate tectonic reconstructions were necessary to determine the coordinates at which the sample was originally deposited. Reconstructions were performed using the Matthews, et al.35 plate motion model, which is an updated version of the Seton, et al.36 model used by Neptune. Comparisons of age models35,36,37,38,39 suggest this model is most appropriate for the deep sea environment where most of the samples occur, and is able to assign coordinates to significantly more sites than the Scotese39 GPlates model. This test was performed with a subset of the data (10633 unique sites); the Matthews, et al.35 model provided paleocoordinates for 95% of the data, whilst the GPlates model only provided coordinates for 17% of the data. The calculation of paleocoordinates was automated using an adaptation of https://github.com/macroecology/mapast.

When sediment samples are derived from multiple sources, duplication will inevitably occur. All such duplicated records, identified based on the combination of species, abundance, sample depth, and coordinate values, were removed. Additionally, working on an individual record level, species that occurred significantly outside their known ranges were flagged (following updated age models) on the assumption these records were misidentifications, contamination or re-working. Records were classified as falling significantly outside their known range if they were more than 5 Ma outside the species’ range in the Palaeogene (66-23 Ma) and more than 2 Ma in the Neogene (23-0 Ma). These values were chosen based on the tradeoff between removing reworked specimens and allowing for some errors in the age estimates. Age estimates for older samples tend to be less precise. Ages were obtained from Lamyman et al. (in prep) and are available in “PFdata” in the figshare data repository18. In total, 10,990 suspect records were flagged (~2% of all records).


Source: Ecology - nature.com

Engineered yeast could expand biofuels’ reach

Insights into rumen microbial biosynthetic gene cluster diversity through genome-resolved metagenomics