Widespread homogenization of plant communities in the Anthropocene
Estimating native plant species’ distributionsWe used the newly developed species database, GreenMaps, to estimate native plant species’ distributions59. GreenMaps includes global distribution maps for ~230,000 vascular plant species. Maps were generated using species distribution models – the statistical estimation of species geographic distributions based on only some known occurrences and environmental conditions – derived from carefully curated species occurrence records. Occurrence records were obtained from a variety of sources, including herbarium specimens, primary literature, personal observation, and online data repositories including the Global Biodiversity Information Facility60,61,62, and Integrated Digitized Biocollections (https://www.idigbio.org/). These records were thoroughly cleaned to reconcile names to follow currently accepted taxonomies [e.g., World Flora Online (www.worldfloraonline.org)], and to remove duplicates and records with doubtful or imprecise localities. Two stringent spatial filters were employed to restrict species’ distributions to their known native ranges (i.e., realized niches) and to prevent erroneous records and predictions in areas that contain suitable habitat but are unoccupied by the species (i.e., fundamental niche). First, we applied the spatial constraint, APGfamilyGeo, which are expert drawn occurrence polygons (“expert maps”) of plant family distributions63,64 (see Data availability) to restrict species to within these distributions. Second, we applied GeoEigenvectors, which are orthogonal variables representing spatial relationships among cells in a grid, encompassing the geometry of the study region at various scales65. For the latter, we generated a pairwise geographical connectivity matrix among grid cells to establish a truncation distance for the eigenvector-based spatial filtering, returning a total of 150 spatial filters. These filters were then resampled to the same resolution as the input environmental variables, and were included with the bioclimatic variables in the species distribution modeling. Bioclimatic variables were derived from WorldClim66 for a total of 19 variables (Supplementary Table 1). Species distribution models (SDMs) were fitted using four different algorithms: generalized linear models (GLM), generalized boosted models (GBM), maximum entropy (MaxEnt), and random forests (RF) with a binomial error distribution (with logit link). Model settings were chosen to yield intermediately complex response surfaces. Model performance was evaluated using area under the receiver operating curve (AUC) and true skill statistic (TSS) scores. AUC scores range from 0 to 1 and should be maximized whereas TSS scores range from −1 to 1. Prior to model building, all predictor variables were standardized. Univariate variable importance for each predictor was assessed in a 5-fold spatial block cross-validation design. The ensemble predictions from species distribution models were derived using un-weighted ensemble means. Predictive model performance was assessed using a 5-fold spatial block cross-validation. We generated a total of 230,000 range maps, representing species within 382 families at a resolution of 50 × 50 km which was also resampled to 100 × 100 km. To our knowledge, this makes it the largest and only global assessment of geographic distributions for plants at the species-level. Our approach of modeling species distributions follows the guidelines of ODMAP (Overview, Data, Model, Assessment, Prediction), a comprehensive framework of best practices for reporting species distribution models67 (see Supplementary Material 1). These maps were stacked and converted to a community matrix for downstream analyses. We also provide a new R function, sdm, for performing the SDMs across four algorithms (random forest, generalized linear models, gradient boosted machines, and MaxEnt) tailored for SDMs of large datasets. The sdm function is included in our R package phyloregion68 along with improved documentation and vignettes to show practical application of this functionality under various modeling scenarios. The sdm function was designed with multiple checks such that any species that did not meet one or more checks were filtered out. A feature of novelty of the sdm function is the addition of an algorithm that allows a user to exclude records that occur within a certain distance to herbaria, museums or other infrastructure. By default, we used the most updated version of Index Herbariorum, a global directory of herbaria69, but a user has the option to specify their own infrastructure to exclude.We validated the output distribution maps against the Kew Plants of the World Online database (POWO; http://www.plantsoftheworldonline.org/), which includes native distribution maps for all plants of the world within major biogeographically defined areas recognized by the Biodiversity Information Standards (also known as the Taxonomic Databases Working Group (TDWG))70. Although the Kew’s distributions of native species are largely based on state/province level such that if a species was observed in any location within a state the whole state is marked as its distribution range, our GreenMaps approach only used the Kew distributions to restrict modeled species distributions within such biogeographic areas. See ref. 59 for full description of the workflow. The range map rasters were converted to a community matrix using the function raster2comm in our new R package phyloregion68 for downstream analysis.Estimating non-native plant species’ distributionsWe used the Global Naturalized Alien Flora (GloNAF) database version 1.271,72 to compile a checklist of non-native species, including documented records of alien plants that have dispersed into new regions largely by humans, and which have become successfully naturalized73,74. The dataset includes non-native species distributions within TDWG regions. We generated species’ distributions for these species using the GreenMaps approach59 described above, but removing the spatial filters APGfamilyGeo and GeoEigenvectors. The non-native species ranges were modeled using occurrences that fell outside the boundaries of the native range of each species as determined by Plants of the World Online (POWO). Specifically, we used the following R code to subset occurrences falling outside of POWO as follows:$$y , < -!x[!complete.cases(sp::over(x,powo)),]$$ (1) where x is a data frame of occurrence of a species, and powo a shapefile of the native range of the species. We then used the output y to model the distribution of non-native species using the sdm function in the R package phyloregion68. We validated our non-native species distribution models against the GloNAF dataset by overlaying grid cells of non-native species predictions within GloNAF’s TDWG levels, and selecting only those projected occurrences that fell within the naturalized range indicated by GloNAF. Such approach allowed us to capture the precise distribution of the non-native species within a state/province as opposed to broadly scoring them present or absent in a state/province as did GloNAF. From our dataset of non-native species, we also identified ‘superinvasives’, here defined as non-native species with 1.5× the interquartile range above the third quartile of their invaded range size within a TDWG region.Recently extinct and threatened plant speciesWe compiled information on recent plant extinctions and conservation status of each mapped species. Our dataset of recent extinctions comes from a dataset that includes 1065 plant species that have become extinct since Linnaeus’ Species Plantarum75, derived from a comprehensive literature review and assessments of the International Union for Conservation of Nature (IUCN) Red List of Threatened Species26,76. We also explored alternative scenarios of increasing future extinction intensity, considering future losses of currently extant native species, some of which are not currently recognized as of global concern (data from ref. 27). For the latter analysis, we compiled information on the conservation status of each species and apply the term ‘extinction’ loosely, which included both native species lost from a region as well as native species that may still be present in some part or all of their native ranges, but they are unlikely to remain so in the near future if current trends continue (see ref. 27). This dataset comes from machine-learning predictions of conservation status for over 150,000 land plant species27 defined as the probability of each species as belonging to a Red List non-Least Concern category (i.e., likely of being at risk on some level) based on geographic, environmental, and morphological trait data, variables that are key in predicting conservation risk27. For our purposes here, we assumed that Least Concern species were not at risk of extinction; although we recognize that a substantial proportion of these species may in fact be endangered27,77. Within this framework, extinction risk is defined using the expected probability of extinction over 100 years of each taxon78, scaled as follows: Least Concern = 0.001, Near Threatened and Conservation Dependent = 0.01, Vulnerable = 0.1, Endangered = 0.67, and Critically Endangered = 0.999. We used these statistical projections to estimate future extinction scenarios because they can be fit to over 150,000 land plant species, whereas formal IUCN Red List assessments are currently available for only 33,573 plant species (March 15, 2020).The final dataset used for our analysis included 205,456 native species, 1065 recently extinct species, extinction projections for 150,000 species, and 10,138 naturalized species.Phylogenetic dataWe applied the dated phylogeny for seed plants of the world from ref. 79, which includes 353,185 terminal taxa. The ref. 79 phylogeny was assembled using a hierarchical clustering analysis of DNA sequence data of major seed plant clades and was resolved using data from the Open Tree of Life project. This represents one of the most comprehensive phylogenies of vascular plants at a global scale and includes all species in our analysis. It also provides divergence time estimates to facilitate downstream analytics.Data analysisWe quantified changes in alpha and beta diversity between the Holocene (native species’ assemblages in each region before widespread migration by humans as initiated by the Columbian Exchange circa 149216) and Anthropocene (non-native naturalizations, and recent past and projected plant extinctions)26 epochs across 100 × 100 km grid cells within major biogeographically defined areas recognized by the Biodiversity Information Standards (also known as the Taxonomic Databases Working Group (TDWG))70. These TDWG geographic regions correspond to continents, countries, states and provinces. We then explored differences in biotic homogenization under varying future scenarios of extinction including naturalizations only, ‘no superinvasives’, ‘best case’ ‘business as usual’, ‘increased extinction’ and ‘worst case’. Our definition of best case refers to recent plant extinctions and naturalizations, and assumes no future extinctions, business as usual assumes loss of Critically Endangered (CR) species, increased extinction assumes loss of Critically Endangered (CR) and Endangered (EN) species, and the worst case scenario assumes loss of all threatened species. Because biodiversity patterns are scale dependent, varying along spatial grains and geographic extents80,81, we repeated all analyses at spatial grid resolution of 50 × 50 km.Temporal changes in α-diversity across plant communitiesFor each grid cell, temporal and spatial change in α-diversity was quantified as the difference in species (or phylogenetic) diversity between the Anthropocene (j) and Holocene (i) periods (see above) expressed as:$$varDelta alpha =(alpha j-alpha i)/alpha i$$ (2) Negative Δα values imply that alpha diversity has decreased and positive values indicate increased alpha diversity. Species α-diversity was calculated as the total count of species in each grid cell. Phylogenetic α-diversity was computed as the sum of the phylogenetic branch lengths connecting species from the tip to the root of a dated phylogenetic tree in each grid cell82. We also assessed changes in phylogenetic (α) diversity standardized for species richness by calculating standard effects sizes of phylogenetic diversity in communities by shuffling the tips in the phylogeny based on 1000 randomizations. For each iteration of the randomization, the analysis was regenerated using the same set of spatial conditions, but using the randomized version of the tree after which the z-score for each index value was calculated (observed - expected)/sqrt (variance). Temporal changes in α-diversity was assessed at the spatial grain resolution of 50 and 100 km to account for the effects of scale.Temporal changes in compositional turnover across florasWithin TDWG geographic regions, we generated pairwise distance matrices of phylogenetic β-diversity (βphylo)83 and species β-diversity (βtax) between all pairs of grid cells, and compared Holocene and Anthropocene epochs. We used Simpson’s index for quantifying compositional turnover because it is insensitive to differences in total diversity among sites84,85. The phylogenetic equivalent, βphylo, represents the proportion of shared phylogenetic branch lengths between cells, and ranges from 0 (species sets are identical and all branch lengths are shared) to 1 (species sets share no phylogenetic branches). We calculated change in compositional turnover (Δβ) as:$$varDelta beta =(beta j-beta i)/beta i$$ (3) where j is the Anthropocene species pool and i refers to the Holocene species composition. Negative Δβ values imply that taxonomic/phylogenetic similarity has increased (i.e., biotic homogenization) and positive values indicate biotic differentiation. To assess sensitivity to our choice of diversity index, we re-ran all analyses using Sorensen and Jaccard dissimilarity indices. All (phylogenetic) β-diversity metrics were calculated using our new R package phyloregion68.Effect of superinvasive speciesTo determine the extent to which a small number of superinvasive non-native species may be driving patterns of homogenization, we re-ran the main analyses described above, but excluded non-native species with the widest ranges within biomes, i.e., species that are more than 1.5× the interquartile range above the third quartile of (invaded) range sizes (i.e., statistical outliers) within TDWG regions. Our definition of range size corresponds to the number of grid cells occupied by a species.Phylogenetic structure of naturalizationsWe evaluated whether naturalized species were more likely to have become naturalized in recipient communities in the absence of close relatives—Darwin’s naturalization hypothesis—by comparing the mean phylogenetic distance between each non-native species and its nearest phylogenetic neighbor in the recipient flora. Larger mean phylogenetic distances indicate that non-native species tend to be less closely related to the native flora. We first ran each analysis on a set of 100 trees. Significance was assessed by comparing the distribution of observed phylogenetic distances to a null model shuffling non-native status randomly on the tips of the phylogeny (1000 replicates) as implemented in the R package phyloregion68.Drivers of change in composition across florasTo relate change in alpha and beta diversity to possible external drivers, we obtained three sets of variables for each site: (i) ecological: mean annual precipitation (MAP), mean annual temperature (MAT), and elevation; (ii) evolutionary: range size (as proxy for dispersal potential, defined as the average range size across species within a grid cell); and (iii) anthropogenic: wilderness index (inverse of human footprint index). MAP, MAT, and elevation were obtained from the WorldClim database66; the geographic range of each species was calculated as the number of cells a species occupied. The Wilderness Index was obtained from ref. 86, and describes the degree to which a place is remote from and undisturbed by the influences of modern society86. These variables were converted to Behrmann equal-area projection using the function projectRaster in the R package raster87.We used a linear mixed effects (LME) model of temporal change in, separately, species (α) richness, phylogenetic (α) diversity, phylogenetic (α) diversity standardized for richness, β-diversity, and phylogenetic β-diversity between the Anthropocene and Holocene, against ecological, evolutionary and anthropogenic variables as predictors. We used level 3 regions as recognized by the Biodiversity Information Standards as a random effect, allowing us to account for idiosyncratic differences between regions. Changes in metrics of β-diversity were applied to grid cells by taking the average dissimilarity to other cells within a region as defined by the TDWG level 3 biomes, whereas changes in metrics of α diversity were applied directly to grid cells. We also included a spatial covariate of geographical coordinates as an additional predictor variable to account for spatial autocorrelation. Our model can be formulated as follows:$${triangle }_{i}=beta 0+beta 1,{{MAT}}_{i}+beta 2,{{MAP}}_{i}+beta 3,{{elevation}}_{i}+beta 4,{{range}{{{{{rm{_}}}}}}{_size}}_{i}+beta 5,{{wilderness}}_{i}+{e}_{i}$$ (4) where ∆i is the temporal diversity change (temporal changes in metrics of α or β diversity) between the Anthropocene and Holocene in grid cell i, β0 to β5 are fixed effect parameters, and ei is residual error. The LME model was fitted using the lme function in the nlme R package88.A vignette, with a worked example, data and R codes describing all the steps for the analyses, is also provided on Dryad (https://doi.org/10.5061/dryad.f4qrfj6st).Reporting summaryFurther information on research design is available in the Nature Research Reporting Summary linked to this article. More