Global and regional ecological boundaries explain abrupt spatial discontinuities in avian frugivory interactions

Dataset acquisition

Plant-frugivore network data were obtained through different online sources and publications (Supplementary Table 1). Only networks that met the following criteria were retrieved: (i) the network contains quantitative data (a measure of interaction frequency) from a location, pooling through time if necessary; (ii) the network includes avian frugivores. Importantly, we removed non-avian frugivores from our analyses because only 28 out of 196 raw networks (before data cleaning) sampled non-avian frugivores, and not removing non-avian frugivores would generate spurious apparent turnover between networks that did vs. did not sample those taxa. In addition, the removal of non-avian frugivores did not strongly decrease the number of frugivores in our dataset (Supplementary Fig. 20a) or the total number of links in the global network of frugivory (Supplementary Fig. 20b). Furthermore, non-avian frugivores, as well as their interactions, were not shared across ecoregions and biomes (Supplementary Fig. 21), so their inclusion would only strengthen the results we found (though as noted above, we believe that this would be spurious because they are not as well sampled); (iii) the network (after removal of non-avian frugivores) contains greater than two species in each trophic level. Because this size threshold was somewhat arbitrary, we used a sensitivity analysis to assess the effect of our network size threshold on the reported patterns (see Sensitivity analysis section in the Supplementary Methods and Supplementary Figs. 22–24); and (iv) network sampling was not taxonomically restricted, that is, sampling was not focused on a specific taxonomic group, such as a given plant or bird family. Note, however, that authors often select focal plants or frugivorous birds to be sampled, but this was not considered as a taxonomic restriction if plants and birds were not selected based on their taxonomy (e.g., focal plants were selected based on the availability of fruits at the time of sampling, or focal birds were selected based on previous studies of bird diet in the study site). The first source for network data was the Web of Life database⁴², which contains 33 georeferenced plant-frugivore networks from 28 published studies, of which 12 networks met our criteria.

We also accessed the Scopus database on 04 May 2020 using the following keyword combination: (“plant-frugivore*” OR “plant-bird*” OR “frugivorous bird*” OR “avian frugivore*” OR “seed dispers*”) AND (“network*” OR “web*”) to search for papers that include data on avian frugivory networks. The search returned a total of 532 studies, from which 62 networks that met the above criteria were retrieved. We also contacted authors to obtain plant-frugivore networks that were not publicly available, which provided us a further 110 networks. The remaining networks (N = 12) were obtained by checking the database from a recently published study¹². In total, 196 quantitative avian frugivory networks were used in our analyses.

Generating the distance matrices to serve as predictor and response variables

Ecoregion and biome distances

We used the most up-to-date (2017) map of ecoregions and biomes³, which divides the globe into 846 terrestrial ecoregions nested within 14 biomes, to generate our ecoregion and biome distance matrices. Of these, 67 ecoregions and 11 biomes are represented in our dataset (Supplementary Figs. 1 and 2). We constructed two alternative versions of both the ecoregion and biome distance matrices. In the first, binary version, if two ecological networks were from localities within the same ecoregion/biome, a dissimilarity of zero was given to this pair of networks, whereas a dissimilarity of one was given to a pair of networks from distinct ecoregions/biomes (this is the same as calculating the Euclidean distance on a presence–absence matrix with networks in rows and ecoregion/biomes in columns).

In the second, quantitative version, we estimated the pairwise environmental dissimilarity between our ecoregions and biomes using six environmental variables recently demonstrated to be relevant in predicting ecoregion distinctness, namely mean annual temperature, temperature seasonality, mean annual rainfall, rainfall seasonality, slope and human footprint³⁸. We obtained climatic and elevation data from WorldClim 2.1⁴³ at a spatial resolution of 1-km². We transformed the elevation raster into a slope raster using the terrain function from the raster package⁴⁴ in R⁴⁵. As a measure of human disturbance, we used human footprint—a metric that combines eight variables associated with human disturbances of the environment: the extent of built environments, crop land, pasture land, human population density, night-time lights, railways, roads and navigable waterways²⁶. The human footprint raster was downloaded at a 1-km² resolution²⁶. Because human footprint data were not available for one of our ecoregions (Galápagos Islands xeric scrub), we estimated human footprint for this ecoregion by converting visually interpreted scores into the human footprint index. We did this by analyzing satellite images of the region and following a visual score criterion²⁶. Given the previously demonstrated strong agreement between visual score and human footprint values²⁶, we fitted a linear model using the visual score and human footprint data from 676 validation plots located within the Deserts and xeric shrublands biome – the biome in which the Galápagos Islands xeric scrub ecoregion is located – and estimated the human footprint values for our own visual scores using the predict function in R⁴⁵.

We used 1-km² resolution rasters and the extract function from the raster package⁴⁴ to calculate the mean value of each of our six environmental variables for each ecoregion in our dataset. Because biomes are considerably larger than ecoregions (which makes obtaining environmental data for biomes more computationally expensive) we used a coarser spatial resolution of 5-km² for calculating the mean values of environmental variables for each biome. Since a 5-km² resolution raster was not available for human footprint, we transformed the 1-km² resolution raster into a 5-km² raster using the resample function from the same package.

To combine these six environmental variables into quantitative matrices of ecoregion and biome environmental dissimilarity, we ran a Principal Component Analysis (PCA) on our scaled multivariate data matrix (where rows are ecoregions or biomes and columns are environmental variables). From this PCA, we selected the scores of the four and three principal components, which represented 89.6% and 88.7% of the variance for ecoregions and biomes, respectively, and converted it into a distance matrix by calculating the Euclidean distance between pairs of ecoregions/biomes using the vegdist function from the vegan package⁴⁶. Finally, we transformed the ecoregion or biome distance matrix into a N × N matrix where N is the number of local networks. In this matrix, cell values represent the pairwise environmental dissimilarity between the ecoregions/biomes where the networks are located. The main advantage of using this quantitative approach is that, instead of simply evaluating whether avian frugivory networks located in distinct ecoregions or biomes are different from each other in terms of network composition and structure (as in our binary approach), we were also able to determine whether the extent of network dissimilarity depended on how environmentally different the ecoregions or biomes are from one another.

Local-scale human disturbance distance

To generate our local human disturbance distance matrix, we extracted human footprint data at a 1-km² spatial resolution²⁶ and calculated the mean human footprint values within a 5-km buffer zone around each network site. For the networks located within the Galápagos Islands xeric scrub ecoregion (N = 4), we estimated the human footprint index using the same method described in the previous section for ecoregion- or biome-scale human footprint. We then calculated the pairwise Euclidean distance between human footprint values from our network sites. Thus, low cell values in the local human disturbance distance matrix indicate pairs of network sites with a similar level of human disturbance, while high values represent pairs of network sites with very different levels of human disturbance.

Spatial distance

The spatial distance matrix was generated using the Haversine (i.e., great circle) distance between all pairwise combinations of network coordinates. In this matrix, cell values represent the geographical distance between network sites.

Elevational difference

We calculated the Euclidean distance between pairwise elevation values (estimated as meters above sea level) of network sites to generate our elevational difference matrix. Elevation values were obtained from the original sources when available or using Google Earth⁴⁷. In the elevational difference matrix, low cell values represent pairs of network sites within similar elevations, whereas high values represent pairs of network sites within very different elevations.

Network sampling dissimilarity

We used the metadata retrieved from each of our 196 local networks to generate our network sampling dissimilarity matrices, which aim to control statistically for differences in network sampling. There are many ways in which sampling effort could be quantified, so we began by calculating a variety of metrics, then narrowed our options by assessing which of these was most related to network metrics. We divided the sampling metrics into two categories: time span-related metrics (i.e., sampling hours and months) and empirical metrics of sampling completeness (i.e., sampling completeness and sampling intensity), which aim to account for how complete network sampling was in terms of species interactions (Supplementary Table 2).

We selected the quantitative sampling metrics to be included in our models based on (i) the fit of generalized linear models evaluating the relationship between number of sampling hours and sampling months of the study and network-level metrics (i.e., bird richness, plant richness and number of links), and (ii) how well time span-related metrics, sampling completeness and sampling intensity predicted the proportion of known interactions that were sampled in each local network (hereafter, ratio of interactions) for a subset of the data. This latter metric, defined as the ratio between the number of interactions in the local network and the number of known possible interactions in the region involving the species in the local network, captures raw sampling completeness. Therefore, ratio of interactions estimates, for a given set of species, the proportion of all their interactions known for a region that are found to occur among those same species in the local network. To calculate this metric, we needed high-resolution information on the possible interactions, so we used a subset of 14 networks sampled in Aotearoa New Zealand, since there is an extensive compilation of frugivory events recorded for this country⁴⁸. After this process, we selected number of sampling hours, number of sampling months and sampling intensity for inclusion in our statistical models (Supplementary Figs. 7 and 8; Supplementary Table 2). We generated the corresponding distance matrices by calculating the Euclidean distance between metric values. Similarly, we generated a Euclidean distance matrix for differences in sampling year between pairs of networks, which aims to account for long-term changes in the environment, species composition and network sampling methods. We obtained the sampling year of our local networks from the original sources and calculated the mean sampling year value for those networks sampled across multiple years.

Because sampling methods, such as sampling design, focus (i.e., focal taxa, which determines whether a zoocentric or phytocentric method was used), interaction frequency type (i.e., how interaction frequency was measured) and coverage (total or partial) might also affect the observed plant-frugivore interactions⁴⁹, we combined these variables into a single distance matrix to estimate the overall differences in sampling methods between networks. Because most of these variables were categorical with multiple levels (Supplementary Table 3), we generated our method’s dissimilarity matrix by using a generalization of Gower’s distance method⁵⁰, which allows the treatment of different types of variables when calculating distances. For this, we used the dist.ktab function from the ade4 package⁵¹. We ran a Principal Coordinates Analysis (PCoA) on this distance matrix, selected the first four axes, which explained 81.2% of the variation in method’s dissimilarity, and calculated the Euclidean distance between pairs of networks using the vegdist function from the vegan package⁴⁶ in R⁴⁵.

Network dissimilarity

We generated three network dissimilarity matrices to be our response variables in the statistical models. In the first, cell values represent the pairwise dissimilarity in species composition between networks (beta diversity of species; β_S)²⁷. Second, we measured interaction dissimilarity (beta diversity of interactions; β_WN), which represents the pairwise dissimilarity in the identity of interactions between networks²⁷. Importantly, we did not include interaction rewiring (β_OS) in our main analysis because this metric can only be calculated for networks that share interaction partners (i.e., it estimates whether shared species interact differently)²⁷, which limited the number and the spatial distribution of networks available for analysis (but see the Rewiring analysis section for an analysis on the subset of our dataset for which this was possible). Metrics were calculated using the network_betadiversity function from the betalink package⁵² in R⁴⁵.

Finally, we calculated a third dissimilarity matrix to capture overall differences in network structure. We recognize that there are many potential metrics of network structure, and that many of these are strongly correlated with one another^53,54,55,56. We therefore chose a range of metrics that captured the number of links, their relative weightings (including across trophic levels), and their arrangement among species, then combined these into a single distance matrix. Specifically, we quantified network structural dissimilarity using the following metrics: weighted connectance, weighted nestedness, interaction evenness, PDI and modularity.

Weighted connectance represents the number of links relative to the number of possible links, weighted by the frequency of each interaction⁵⁵, and is therefore a measure of network-level specialization (higher values of weighted connectance indicate lower specialization). Importantly, it has been suggested that connectance affects persistence in mutualistic systems⁵⁴. We measured nestedness (i.e., the pattern in which specialist species interact with proper subsets of the species that generalist species interact with) using the weighted version of nestedness based on overlap and decreasing fill (wNODF)⁵⁷. Notably, nested structures have been commonly reported in plant-frugivore networks³³. Interaction evenness is Shannon’s evenness index applied for species interactions and represents how evenly distributed the interactions are in the network^21,58. This metric has been previously demonstrated to decline with habitat modification as a consequence of some interactions being favored over others in high-disturbance environments²¹. PDI (Paired Difference Index) is a measure of species-level specialization on resources and a reliable indicator not only of specialization, but also of absolute generalism⁵⁹. Thus, this metric contributes to understanding of the ecological processes that drive the prevalence of specialists or generalists in ecological networks⁵⁹. In order to obtain a network-level PDI, we calculated the weighted mean PDI for each local network. Finally, we calculated modularity (i.e., the level of compartmentalization within networks) using the DIRTPLAwb+ algorithm⁶⁰. Modularity estimates the extent to which species within modules interact more with each other than with species from other modules⁶¹, and it has been demonstrated to affect the persistence and resilience of mutualistic networks⁵⁴. All the selected network metrics are based on weighted (quantitative) interaction data, as these have been suggested to be less biased by sampling incompleteness⁶² and to better reflect environmental changes²¹. All network metrics were calculated using the bipartite package⁶³ in R⁴⁵.

We ran a Principal Component Analysis (PCA) on our scaled multivariate data matrix (N × M where N is the number of local networks in our dataset and M is the number of network metrics), selected the scores of the three principal components, which represented 89.9% of the variance in network metrics, and converted it into a network structural dissimilarity matrix by calculating the Euclidean distance between networks. In this distance matrix, cell values represent differences in the overall architecture of networks (over all the network metrics calculated), and therefore provide a complementary approach for evaluating how species interaction patterns vary across large-scale environmental gradients.

Statistical analysis

We employed a two-tailed statistical test that combines Generalized Additive Models (GAM)²⁹ and Multiple Regression on distance Matrices (MRM)³⁰ to evaluate the effect of each of our predictor distance matrices on our response matrix. With this approach, we were able to fit GAMs where the predictor and responsible variables are distance matrices, while accounting for the non-independence of distances from each local network by permuting the response matrix³⁰. The main advantage of using GAMs is their flexibility in modeling non-linear relationships through smooth functions, which are represented by a sum of simpler, fixed basis functions that determine their complexity²⁹. Using GAM-based MRM models allowed us to obtain F values for each of the smooth terms (i.e., smooth functions of the predictor variables in our model), and test statistical significance at the level of individual variables. The binary versions of ecoregion and biome distance matrices (with two levels, “same” or “distinct”) were treated as categorical variables in the models, and t values were used for determining statistical significance. We fitted GAMs with thin plate regression splines⁶⁴ using the gam function from the mgcv package²⁹ in R⁴⁵. Smoothing parameters were estimated using restricted maximum likelihood (REML)²⁹. Our GAM-based MRM models were calculated using a modified version of the MRM function from the ecodist package⁶⁵, which allowed us to combine GAMs with the permutation approach from the original MRM function (see Code availability). All the models were performed with 1000 permutations (i.e., shuffling) of the response matrix.

We explored the unique and shared contributions of our predictor variables to network dissimilarity using deviance partitioning analyses. These were performed by fitting reduced models (i.e., GAMs where one or more predictor variables of interest were removed) using the same smoothing parameters as in the full model and comparing the explained deviance. We fixed smoothing parameters for comparisons in this way because these parameters tend to vary substantially (to compensate) if one of two correlated predictors is dropped from a GAM.

Assessing the influence of individual studies on the reported patterns

Because our dataset comprises 196 local frugivory networks obtained from 93 different studies, and some of these studies contained multiple networks, we needed to evaluate whether our results were strongly biased by individual studies. To do this, we followed the approach from a previous study⁶⁶ and tested whether F values of smooth terms and t values of categorical variables (binary version of ecoregion and biome distances) changed significantly when jackknifing across studies. We did this by dropping one study from the dataset and re-fitting the models, and then repeating this same process for all the studies in our dataset.

We found a number of consistent patterns within different subsets of the data (Supplementary Figs. 15 and 16); however, some of the patterns we observed appear to be driven by individual studies with multiple networks, and hence are less representative. For instance, the study with the greatest number of networks in our dataset (study ID = 76), which contains 35 plant-frugivore networks sampled across an elevation gradient in Mt. Kilimanjaro, Tanzania⁶⁷, had an overall high influence on the results when compared with the other studies. By re-running our GAM-based MRM models after removing this study from our dataset, we found that the effect of biome boundaries on interaction dissimilarity is no longer significant, whereas the effects of ecoregion boundaries, human disturbance distance, spatial distance and elevational differences remained consistent with those from the full dataset (Supplementary Table 33). Nevertheless, all the results were qualitatively similar to those obtained for the entire dataset when using network structural dissimilarity as the response variable (Supplementary Table 34).

Rewiring analysis

Interaction rewiring (β_OS) estimates the extent to which shared species interact differently²⁷. Because this metric can only be calculated for networks that share species from both trophic levels, we selected a subset of network pairs that shared plants and frugivorous birds (N = 1314) to test whether interaction rewiring increases across large-scale environmental gradients. Importantly, since not all possible combinations of network pairs contained values of interaction rewiring (i.e., not all pairs of networks shared species), a pairwise distance matrix could not be generated for this metric. Thus, we were not able to use the same statistical approach used in our main analysis, which is based on distance matrices (see Statistical analysis section). Instead, we performed a Generalized Additive Mixed-effects Model (GAMM) using ecoregion, biome, human disturbance, spatial, elevational, and sampling-related distance metrics as fixed effects and network IDs as random effects (to account for the non-independence of distances) (Supplementary Table 35). We also performed a reduced model with only ecoregion and biome distance metrics as predictor variables (Supplementary Table 36). The binary version of ecoregion and biome distance metrics (with two levels, “same” or “distinct”) were used as categorical variables in both models. Interaction rewiring (β_OS) was calculated using the network_betadiversity function from the betalink package⁵² in R⁴⁵. Although it has been recently argued that this metric may overestimate the importance of rewiring for network dissimilarity⁶⁸, our main focus was not the partitioning of network dissimilarity into species turnover and rewiring components, but rather simply detecting whether the sub-web of shared species interacted differently. In this case, β_OS (as developed by ref. 27) is an adequate and useful metric⁶⁸. We fitted our models using the gamm4 function from the gamm4 package⁶⁹ in R⁴⁵. Smoothing parameters were estimated using restricted maximum likelihood (REML)²⁹.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.