Complex networks of marine heatwaves reveal abrupt transitions in the global ocean
SST data
I used the National Oceanic and Atmospheric Administration (NOAA) daily optimum interpolation SST gridded dataset V2.0 to identify MHWs in the period 1 January 1982 to 31 December 20184,28. The dataset is a blend of observations from satellites, ships and buoys and includes bias adjustment of satellite and ship observations to compensate for platform differences and sensor biases. Remotely sensed SSTs were obtained through the Advanced Very High Resolution Radiometer and interpolated daily onto a 0.25° × 0.25° spatial grid globally. Data were provided by the NOAA/OAR/ESRL PSD, Boulder, Colorado, USA, from their Web site at https://www.esrl.noaa.gov/psd/ accessed in January 2019.
I also obtained data from simulation models implemented in the fifth phase of the Coupled Model Intercomparison Project (CMIP5). I used the first ensemble member r1i1p1 of 12 coupled Earth System Models that allowed the analysis of variation in daily SSTs in all scenarios: CNMR-CM5, GFDL-CM3, GFDL-ESM2G, GFDLESM2M, IPSL-CM5A-LR, IPSL-CM5A-MR, MPI-ESM-LR, MPI-ESM-MR, MIROC5, MIROC-ESM, MRI-CGCM3, MIROC-ESM-CHEM. A climatology (the statistical properties of the timeseries, including the mean, variance, seasonal cycle and quantiles; see section “Identifying marine heatwaves” below for derivation) was obtained from historical simulations over the period 1861–2005 and used to identify MHWs for the historical scenario and for simulated SSTs over the period 2006–2100 following high and low emission scenarios (RCP 8.5 and RCP 2.6, respectively; RCP, representative concentration pathway). All the analyses on simulated data were implemented on a multi-model ensemble obtained by averaging the twelve models, unless otherwise indicated. For comparisons with the simulated SSTs, the satellite 0.25° × 0.25° data were regridded by averaging daily onto a regular 1° × 1° grid. The climatology for the satellite MHWs was derived from the whole observational period (1982–2018).
Whether a fixed climatology is appropriate instead of using shifting baselines to define MHWs is a matter of debate13,40. Here, the historical scenario provides a common reference to gauge shifts in the spatiotemporal dynamics of projected MHWs under high (RCP 8.5) and low (RCP 2.6) emission scenarios.
Identifying marine heatwaves
I identified MHWs from daily observed and simulated SST timeseries within each 1° × 1° cell following Hobday et al.27, who define a MHW as an anomalously warm water event with daily SSTs exceeding the seasonally varying 90th percentile (climatological threshold) for at least 5 consecutive days. The climatological mean and threshold were computed for each calendar day within a 11-day window centered on the focal day across all years within the climatological period. The mean and threshold were further smoothed by applying a 31-day moving average. Two events with a break of less than 3 days were considered the same MHW. I then derived characteristic metrics of MHWs, including duration, intensity and frequency and linked them to network properties (see below, Network analysis). Only SST timeseries with less than 10% of missing data were used in the analysis. I used the R package heatwaveR to identify marine heatwaves from SSTs41.
Topological data analysis and the Mapper workflow
Topological Data Analysis (TDA) is a collection of statistical methods based on topology, the field of mathematics that deals with the study of shapes, to find structure in complex datasets24. The Mapper algorithm is one tool of TDA that allows reducing high-dimensional data into a combinatorial object that encapsulates the original topological and geometric information of the data, such that points close to each other are more similar than distant points. The combinatorial object, also called a shape graph, is indeed a network with nodes and edges. The statistical properties of the TDA-based Mapper algorithm and how it relates to other non-linear dimensionality reduction techniques have been discussed in Ref.22. Here, I briefly summarize the five key steps involved in a Mapper analysis (Fig. 1). The first step of MAPPER consists of collapsing a raster stack of spatiotemporal data of MHWs into a binary 2D matrix where rows are timeframes (days) and columns are 1° × 1° cells arranged sequentially to represent the occurrence of MHWs across the global ocean. The first column of the matrix corresponds to the upper-left pixel of the raster centered at 89.5°N and − 180°W and the subsequent 364 columns represent adjacent pixels within the same latitude. Column 366 is centered at 88.5°N and − 180°W and so on, with the last column of the matrix corresponding to the lower-right pixel at − 89.5°S and 180°E. Although this scheme would result in matrices with 64,800 columns (360 × 180), I used reduced matrices in computations by excluding pixels on land or where missing SST values prevented the identification of MHWs. The final size of the matrices used in the analysis is (rows × columns) 13,514 × 42,365 for observed SSTs and 52,960 × 41,968, 34,675 × 41,074 and 34,675 × 41,482 for simulated SSTs under the historical, RCP 2.6 and RCP 8.5 scenarios, respectively.
The second step of Mapper involves dimensionality reduction or filtering. I used the Uniform Manifold Approximation and Projection dimensionality reduction (UMAP) algorithm to perform nonlinear dimensionality reduction42. This algorithm is similar to t-distributed Stochastic Neighbor Embedding (tSNE), which is widely used in machine learning43. The advantage of UMAP is that it has superior run time performance compared to tSNE, while retaining the ability to preserve the local structure of the original high-dimensional space after projection into the low-dimensional space.
The third step of Mapper consists of dividing the output range generated by the filtering process into overlapping bins. The number of bins and the amount of overlap are determined by the resolution (R) and gain (G) parameters, respectively. I used an optimization procedure to objectivity identify the combination of parameters R and G that best localized timeframes with similar cumulative intensity of MHWs nearby in the network (see “Parameter search and sensitivity analysis”). This procedure selected R = 24 and G = 45 for observed SSTs and for the RCP 8.5 scenario, R = 22 and G = 45 for the historical scenario and R = 12 and G = 25 for the RCP 2.6 scenario.
The fourth step of Mapper consists of partial clustering of timeframes within bins. Although Mapper is flexible and can accommodate different clustering methods and distance functions, I employed single-linkage clustering with Euclidean distance25. It is worth noting that this approach does not involve averaging of timeframes within clusters, so the original information is preserved in a compressed representation of the data.
The fifth and final step involves the generation of the network graph from the low dimensional compressed representation of the data. Clusters become nodes in the network and nodes become connected if they share one or more timeframes. I implemented the TDA-based Mapper algorithm using a parallelized version of function mapper2D in the R package TDAmapper44.
Network analysis
I employed two widely used measures of network topology, modularity and node degree, to compare the structure of the four MHW networks. Modularity describes the strength of division of a network into communities—i.e. cohesive groups of nodes that have dense connections among them and that are only sparsely connected with nodes in other groups. High modularity indicates the presence of distinct regimes of spatiotemporal dynamics of MHWs. As a second measure of network structure, I used mean node degree—where the degree of a node is the number of edges that are incident to that node. High mean node degree indicates that many nodes share one or more timeframes and depicts similar spatiotemporal patterns of MHWs within those nodes. In contrast, low node degree indicates the occurrence of many isolated nodes with few timeframes in common and more isolated MHWs. I computed modularity and node degree with functions ‘modularity’ and ‘degree’ in the R package igraph45.
To provide significance tests for the observed measures of network topology and to evaluate if they originated simply from non-stationarity properties in the original data, I run two null models based on surrogate data for each of the four networks. Surrogate data can be obtained through the Fourier transform of the original timeseries, shuffling the phases and applying the inverse transform to generate the surrogate series46. Phase randomization preserves the power spectrum, autocorrelation function and other linear properties of the data, but not the amplitude distribution. To address this potential drawback, I generated surrogate timeseries via the Theiler’s Amplitude Adjusted Fourier Transform (AAFT) using function ‘surrogate’ in the R package fractal, which also preserves the amplitude distribution of the original timeseries47. Using this approach, I applied two schemes of randomization—one employing a random sequence for each timeseries and one employing the same sequence for all timeseries. Randomizing using a fixed sequence for all timeseries (constant phase) randomizes the nonlinear properties of the data while preserving linear properties, such as the linear cross-correlation function. The randomization scheme based on random sequences (random phase) also disrupts linear relationships in the data. A significant departure of the observed statistic from the null model under constant phase randomization allows rejecting the null hypothesis that the observed time series is a monotonic nonlinear transformation of a Gaussian process. A significant departure from the null model under random phase allows rejecting also the null hypothesis that the original data come from a linear Gaussian process. To assess significance, 1000 randomizations were performed for each network under each scheme of random and constant phase and a two-tailed test was performed at α = 0.025 to account for multiple testing (Bonferroni correction).
To quantify temporal transitions of MHWs, I estimated node degree of the temporal connectivity matrix (TCM) obtained from each network. The degree for each node in the TCM was estimated by counting the number of non-zero edges connected to that node22. Temporal fluctuations in node degree were benchmarked against the confidence intervals of the random phase null model, estimated as twice the standard deviation of the null distribution. A Generalized Additive Model (GAM) smooth function was fitted to node degree data to visualize temporal trends. The timing of collapse of node degree for the historical scenario was estimated as the year when the smooth curve intersected the upper confidence limit of the null model. To provide a measure of uncertainty, I obtained analogous estimates of the year of collapse by repeating the whole analysis for each of the twelve ESMs separately and computing the median and the bootstrap standard error (n = 1000) of these estimates. A similar analysis was done on the RCP 8.5 data to estimate the year when node degree increased again and diverged significantly from the null distribution. To determine the duration of the different period of connectivity identified in the TCMs, I used the change point algorithm implemented in function cpt.mean of package changepoint48.
Parameter search and sensitivity analysis
To objectively identify parameters R and G (resolution and gain) as part of the binning process in the Mapper algorithm, I used an optimization procedure that best localized timeframes with similar patterns of MHWs nearby in the network. Localization can be done for any of the properties of MHWs. I used cumulative intensity as the localization criterion since it was a good proxy for other properties of MHWs, such as duration (r = 0.88, p More
