The origin and evolution of open habitats in North America inferred by Bayesian deep learning models

Data

Spatial and temporal range

We focused on a geographic area that is defined by a cropping window with the corner points P₁ (Lon = −180, Lat = 25) and P₂ (Lon = −52, Lat = 80), covering the majority of the North American continent (e.g., Fig. 3). We focused on the last 30 Myr, a time span encompassing most of our available sites with paleovegetation information (Supplementary Fig. 1). From the following data sources, we only selected those data points that fall within this spatiotemporal range.

Our approach described below required discretizing the input data of past vegetation labels and fossil occurrences into time-bins. For this, we chose the age boundaries of geological stages defined in the International Chronostratigraphic Chart, v2020/03⁴⁵, since these stages are expected to represent meaningful temporal units for analyzing both faunal and floral patterns. A total of 17 geological stages fell within our selected time frame of the last 30 Myr. We discretized the ages of all data points (vegetation data and fossil occurrences) that fell within a given stage by setting them to the midpoint of the respective stage.

Paleovegetation data

We reviewed a large body of peer-reviewed literature containing paleovegetation reconstructions and compiled a database of 331 sites with paleovegetation data for North America (Supplementary Data 1). These sites represent individual vegetation reconstructions based on fossil evidence (phytoliths, pollen, macrofossil assemblages) of distinct locations in time and space. We condensed the vegetation interpretation of the compiled vegetation data, which in many cases described specific vegetation ecosystem components, into the broader labels “open” versus “closed” vegetation. This resulted in 180 sites being labeled as closed and 151 as open, their dating rounded to the midpoint of the nearest geological stage (Supplementary Data 1). For several of these sites we found multiple vegetation reconstructions in the reviewed literature, for example when multiple sediment samples were taken from the same horizon of a given formation, belonging to the same geological stage. We treated these spatiotemporal duplicates as a single data point, excluding sites with mixed vegetation information (i.e., containing both open and closed vegetation reconstructions).

Current vegetation data

To supplement the limited number of paleovegetation sites, we compiled data about the current vegetation within our study area. In order to obtain current vegetation patterns, we downloaded the SYNMAP Global Potential Vegetation data²⁹. As for the paleovegetation data, we collapsed the more detailed biome data into broader categories by coding the SYNMAP biome IDs < 37 as “closed” and biome IDs ≥ 37 as “open”. The resolution of the SYNMAP current vegetation raster was 0.5° longitude × 0.5° latitude, which equates to a spatial resolution of ~50 × 50 km grid cells (at the equator). We extracted all current vegetation grid cells that fell within our defined cropping window, excluding all sea water cells as well as large continental lakes. This resulted in 11,048 terrestrial grid cells with current vegetation information. For these grid cells, we extracted the coordinates of the cell-center as well as the corresponding vegetation label.

The compiled paleovegetation and current vegetation points constitute the pool of vegetation information from which we sampled subsets to train our model. From here on we refer to these data points as our training instances. We trained several BNN models, using different combinations of the paleovegetation points (n = 331) and current vegetation points (n = 11,048).

Fossil data

We downloaded all available mammal fossil data of the last 30 Ma from the Paleobiology Database (https://paleobiodb.org/, downloaded in October 2018). We removed all entries that were not identified to species level, as well as all spatiotemporal duplicates. In several cases, the fossil data downloaded from the major databases contained minor spelling inconsistencies in the genus names and species epithets. To correct these misspellings, which can lead to an overestimation of the number of genera and species in the dataset, we used the algorithm implemented in the PyRate package⁴⁶, which automatically identifies common typos in scientific names. Finally, we removed all aquatic families from the dataset (dugongs, pinnipeds, and whales).

For each fossil occurrence, we determined the mean age of the respective stratigraphic age interval. We reduced the taxonomic resolution of the mammal data to genus-level with the main purpose to reduce the number of taxa, while increasing the spatial and temporal extent of each taxon as well as to avoid taxonomic biases such as over-splitting or lumping of species in different genera, depending on taxonomic authority. These potential taxonomic biases are expected to have a smaller impact on genus level compared to species level. To further reduce the number of taxa to only the most informative ones, we only kept genera that were present in more than half of the geological stages covered in this study, based on the first and last occurrence date of each genus in the fossil record (assumed presence in at least 9 of 17 stages). This resulted in 65 selected mammal genera (Supplementary Table 1). While the model can potentially handle any number of taxa, taxa with occurrences spanning multiple locations and time bins are expected to be most informative in our supervised learning approach.

As an addition to the mammal fossil data, we compiled a large dataset of plant macrofossils from the Cenozoic Angiosperm database²⁴. Due to the sparse fossil record of plants with a taxonomic resolution of species or genus level, we decided to reduce the taxonomic resolution of the plant fossil data to family level. As with the mammal fossil data, we took the mean age of the stratigraphic age interval of each fossil occurrence and only selected plant families that were present in North America during at least 9 of the 17 geological stages. This resulted in 35 selected plant families (Supplementary Table 1). The final fossil data, consisting of the selected mammal and plant taxa (n = 100), amounted to a total of 5514 fossil occurrences (4770 mammal and 744 plant fossils, Supplementary Data 2).

Current occurrences

To complement the occurrence data extracted from the fossil record, we extracted current occurrences for all selected taxa from the Global Biodiversity Information Facility (GBIF, www.gbif.org, accessed in September 2019). For all mammal genera we downloaded the data through the R-package rgbif⁴⁷, only allowing human observations (as opposed to, e.g., machine observations or fossil data) and restricting the search to North American occurrences (Canada, Mexico, or USA), using the following command:

occ_search(taxonKey=taxon_id, return=“data”, hasCoordinate=TRUE, country=c(‘US’,‘CA’,‘MX’), basisOfRecord=‘HUMAN_OBSERVATION’)

Due to the large data volumes for the selected plant families, which result in very long waiting times and occasional time-out errors when using the rgbif package, we instead downloaded the current occurrences of the selected plant families directly from the GBIF online interface (download https://doi.org/10.15468/dl.nxuyg8).

After filtering these occurrences to exactly match the cropping window defined in this study (see above), this resulted in a total of 1,299,782 current occurrences for the selected extant mammal and plant taxa (109,027 mammal and 1,190,755 plant occurrences, Supplementary Data 2). Finally, all fossil and current occurrences of the selected taxa were merged into one data-frame and jointly treated as occurrence data, independently of the data origin as fossil or GBIF observation. For all further steps, we only selected those occurrences that fell within the cropping window defined as described above.

While the current distribution of taxa—and in effect their recorded spatial occurrences—are affected by human impact (a bias that is not present in the fossil occurrence data), we are assuming here that these current occurrences are still informative about a taxon’s habitat preference. This assumption holds true, unless there is reason to assume that taxa completely shift their habitat preference from open to closed habitat or vice versa, due to human impact. For the purpose of this study, we don’t expect this assumption to be violated. Only if this assumption was violated for a substantial number of taxa would we expect this potential bias to affect our model predictions.

Climate and elevation models

The paleoprecipitation and temperature data were reconstructed based on global climate raster data with a spatial resolution of 1° longitude × 1° latitude (raw data provided by Christopher Scotese). These rasters derive from the PALEOMAP Project, which has produced paleogeographic maps at 5-Myr intervals²⁵ and has assembled related precipitation and temperature data based on the HadleyCM3 paleoclimate simulations⁴⁸. Similarly, we downloaded global elevation rasters through time, generated by Scotese and Wright⁴⁹. Because the paleoclimate and elevation estimates are only available in 5 Myr intervals, we linearly interpolated the values into 1 Myr year intervals to reach higher temporal resolution. Since no directly measurable and spatially explicit and complete data exists to inform our models about climate and elevation through deep time, we apply these estimates—which themselves have been generated through modeling—as part of the input data for our models. To test to what extent potential biases or errors in these modeled data may affect our model predictions, we added increasing levels of noise to these data before making predictions with our models. For each of these predictors (precipitation, temperature, and elevation) we randomly resampled values for each grid cell from a uniform distribution ranging between ±10%, 20%, and 50% of the original value. These modified data were then used in combination with all other features to make vegetation predictions, to quantify how such uncertainties in the data affect our model predictions. This had no detectable effect on our vegetation predictions, as can be seen based on the produced vegetation maps for the 50% perturbated climate and precipitation grids (Supplementary Figs. 9 and 10).

As additional predictors, we downloaded estimates of mean global temperature that are based on oxygen isotope data²⁷, and mean global atmospheric CO₂ concentration estimates based on carbon isotope data from fossil soils and stomatal pore density of fossilized leaves⁵⁰. In theory, there are many other predictors that would be useful for the task of vegetation prediction, such as seasonal climatic variables and fluctuations of different elements in the atmospheric composition. However, the limitation is usually that these predictors are not available throughout the whole time frame covered in this study (last 30 Myr), particularly not in a spatially explicit manner as spatial grids. Future studies may be able to compile such data throughout deep time (based on measurements or modeled data) and be able to apply them as additional predictors in models similar to the ones presented here.

Feature generation

An essential element of applying neural networks is the process of feature generation, which describes the transformation of the raw data into numerical features that can be fed into the neural network. Each input data point, which is commonly referred to as an instance, consists of a list of associated feature values. In our case, the training instances consist of specific points in space-time with available vegetation information, and the associated features contain the information about nearby occurrences of the selected taxa (biotic features), as well as other information about climate, geography, and time (abiotic features), in relation to the given point.

Biotic features

For a given instance (vegetation point), defined by its spatial and temporal coordinates, we extracted the geographic distance between this instance and the closest occurrence of each taxon, and we did so for each geological stage (Fig. 1). To calculate these distances, we transformed all geographic data into the Albers equal area projection and then calculated the distance between a given pair of coordinates in this projection. If a taxon was present in all stages, this resulted in 17 geographic distances extracted for this taxon, one for each stage. These spatial distances were calculated using the current coordinates (instead of the paleocoordinates) of each point, assuming that the relative spatial distance between any two given points within North America is not affected (or negligibly so) by continental movements during the last 30 Myr, although their absolute coordinate values have changed through time.

In addition, we extracted the temporal distances between the selected taxon-occurrences and the given vegetation point, by measuring the difference between the age of the training instance and the midpoint of the geological stage of a given taxon occurrence. This resulted in N pairs of geographic and temporal distances to each taxon, where N is the number of stages this taxon occurred in. We designed our BNN model to estimate parameters to summarize the spatial and temporal distances of the selected occurrences of each taxon into one taxon-specific feature value, representing a measure of general “proximity” of each taxon, which we explain in more detail below (Fig. 2).

Abiotic features

In addition to the biotic features, we extracted the temperature, precipitation, and elevation associated with the space-time coordinates of a given instance. For this step we transformed the coordinates of each given vegetation label into the equivalent paleocoordinates at the time of the record, using the “PALEOMAP” model of the mapast R-package²⁶. We extracted the modeled temperature, precipitation, and elevation of these paleocoordinates from the rasterized climate and elevation data²⁵ as three separate features. In addition, we extracted the mean global temperature and the average atmospheric CO₂ concentration at the given time point. Finally, we added the absolute paleocoordinates (longitude and latitude) as well as the age of the vegetation point as three additional features.

Our neural network was trained on a total of 100 biotic features (one for each selected taxon), 4 climatic features, 1 elevation feature, and 3 spatiotemporal features, resulting in a total of 108 features for each instance.

To avoid potential biases based on the absolute values of given features, we scaled all features to a range between 0 and 1. The rescaling was done jointly for all training and prediction instances, in order to avoid differences in rescaling-factors between features in the training instances and those in the prediction instances.

Selecting training and test data

For the training of our neural network we had a total of 11,379 points with vegetation information available, consisting of 331 paleovegetation points and 11,048 current vegetation points. To test whether the larger number of current vegetation instances might bias our past vegetation predictions, we explored different combinations of paleovegetation and current vegetation instances during training of the model (Table 1).

To evaluate the prediction accuracy of our trained models, we performed a five-fold cross-validation, training each of the five cross-validation models on 80% of the available instances, while sparing the remaining 20% as a test set. The instances for each cross-validation fold were selected ensuring the same proportion of paleovegetation instances and current instances in each cross-validation fold. We then determined the prediction accuracy of the model as the average test set prediction accuracy across all 5 cross-validation folds, which we determined separately for all paleovegetation instances and all current instances. The final prediction accuracy of each model was then determined as the weighted mean between the paleovegetation prediction accuracy and the current vegetation prediction accuracy of the model, weighing the paleovegetation component ten times higher, as it represents the accuracy across ten geological stages that are covered by our paleovegetation data (Supplementary Fig. 1), while the current data only represent a single geological stage.

Neural network configuration

We developed a BNN classification model that maps raw spatial and temporal distances of selected taxon occurrences (fossil or current) to a set of vegetation classes. These distance features can be complemented by any set of additional features, such as the abiotic features used in this study. The BNN model consists of multiple hidden layers generating a numerical representation of the features in multidimensional space, as well as an output layer that maps the nodes of the last hidden layer to the output classes, in this case open and closed habitats. Given the flexibility of our model and the fact that it is based on absolute distance measures, it may be applied to any vegetation prediction task, independently of the spatial and temporal scale of the data.

The first two hidden layers are only applied to the taxon distance features, not to the additional abiotic features. In these layers, the raw spatial and temporal occurrence distances are combined into a single value per taxon, which represents a measure of proximity of each taxon to a given input instance.

The raw distances are provided in pairs of one spatial and one temporal distance measurement, both associated with a specific occurrence of a taxon. We indicate with (Delta {s}_{ij}) and (Delta {t}_{ij}) the spatial and temporal distances for a species (iin {1,,ldots,I}) at a geological stage (jin {1,,ldots,J}) (Fig. 1). These are used as input in a first hidden layer (Eq. 1) of a sparse neural network with parameter sharing resulting in one node for each species and geological time:

$${h}_{ij}^{left(1right)}=gleft({w}_{{{{{{rm{s}}}}}}}Delta {s}_{ij}+{w}_{{{{{{rm{t}}}}}}}Delta {t}_{ij}right)$$

(1)

where ({w}_{s}) and ({w}_{t}) are weights associated with space and time distances, respectively, shared among all species and occurrences and ({g}left(cdot right)) is the swish⁵¹ activation function (Eq. 2):

$${{{{rm{swish}}}}}left({x}right)={x}times {left(1+{{{{{rm{exp }}}}}}left(-{x}right)right)}^{-1}$$

(2)

The swish activation function was used after each hidden layer in the model. To reduce the number of estimated parameters for better convergence, the space and time weights are shared among all occurrences under the assumption that the relative importance of space and time in determining the proximity of a given occurrence is expected to be the same for all occurrences of different taxa.

After combining spatial and temporal distances into one spatiotemporal distance value in this way, we estimate specific taxon-weights for each taxon and geological stage, which are then used to collapse the multiple spatiotemporal distances across different geological stages into one single feature value for each taxon. This happens in the second hidden layer (Eq. 3):

$${h}_{i}^{left(2right)}=gleft({displaystylesum }_{j=1}^{J}{h}_{ij}^{left(1right)}{W}_{ij}^{left(2right)}right)$$

(3)

where ({W}^{left(2right)}) is a matrix of weights associated with each geological stage (j) specific to taxon (i). The second hidden layer ({h}^{left(2right)}) thus includes one node for each species, which provides the input, along with additional abiotic features (f), to a fully connected neural network. Depending on the chosen pooling strategy, these taxon feature values are either fed as individual features into the next layer (no pooling) or are summarized into one faunal and one floral feature, by either extracting the maximum output value from layer ({h}^{left(2right)}) (max-pooling) or by summing all output values (sum-pooling) across all mammal and plant taxa, respectively.

Following the initial two layers, the taxon-features (({h}^{left(2right)}), n = 100 or n = 2, depending on pooling strategy) are fed together with the additional abiotic features ((f), n = 8) into a fully connected neural network and eventually mapped to the binary vegetation classes in the output layer (Fig. 2b). Given a set of input features (x={{h}^{left(2right)},f}) of size (M) the next hidden layer (Eq. 4) with (n) nodes is obtained through:

$${h}_{n}^{left(3right)}=gleft({displaystylesum }_{m=1}^{M}{x}_{m}{W}_{mn}^{left(3right)}right)$$

(4)

where ({W}^{left(3right)}) is a matrix of (Mtimes n) weights. Finally, the output of the neural network (Eq. 5) is binary and quantifies the probability associated with each class (closed and open habitats):

$${y}_{o}={{{{{rm{sigma }}}}}}left({displaystylesum }_{n}{h}_{n}^{left(3right)}{W}_{no}^{left(4right)}right)$$

(5)

where (o=2), ({W}^{left(4right)}) is a matrix of (ntimes 2) weights, and ({{{{{rm{sigma }}}}}}left(cdot right)) is the softmax⁵² function (Eq. 6):

$${{{{{rm{sigma }}}}}}left({x}_{{{{{{rm{k}}}}}}}right)=frac{{{{{{rm{exp }}}}}}left({x}_{{{{{{rm{k}}}}}}}right)}{{displaystylesum }_{o}{{{{{rm{exp }}}}}}left({x}_{o}right)}$$

(6)

We tested different network configurations in terms of number of layers and nodes per layer, different pooling strategies, as well as different combinations of training features and instances, and selected the best model based on the highest test set prediction accuracy (Table 1).

The parameters of the model (({w}_{{{{{{rm{s}}}}}}},{w}_{{{{{{rm{t}}}}}}},{W}^{left(2right)},{W}^{left(3right)},{W}^{left(4right)})) were jointly estimated using a Metropolis Hastings Markov Chain Monte Carlo (MCMC) algorithm⁵³. During training, all weights of the model are initially drawn randomly from a normal distribution centered in 0 and are then updated via MCMC sampling. We used a standard normal distribution as prior on all weights (parameters of the model). During model testing we ran an MCMC chain for 200,000 generations for each cross-validation replicate, sampling every 200 iterations. We selected the best model based on the highest prediction accuracy, and then trained a final production model with these best model settings using all available instances (no test set) for 400,000 additional MCMC generations, departing from the parameter values estimated during cross-validation.

Our BNN implementation allows not only to estimate the most probable vegetation label for a given point in time and space, but also to calculate the posterior probability of this label, providing an inherent measure of uncertainty. We calculated the posterior probability of each class label for a given instance as the mean class probability across all posterior samples. This ability makes BNNs an attractive alternative to regular neural network algorithms, which allow no such uncertainty modeling, although analogous approximations exist, such as Monte Carlo dropout⁵⁴.

Feature importance

To determine the relative importance of each feature used in our model, we applied the method of permutation feature importance (sensu Breiman⁵⁵). In this approach, the values of a given feature are randomly shuffled across all instances of the test or training set. This process masks any existing information that lies within the data of a given feature. The class labels for all instances are then predicted using the modified feature matrix. The resulting prediction accuracy is then compared with that of the original feature matrix and the difference between these accuracies ((varDelta {acc})) is interpreted as a measure of relative importance of the shuffled feature for the classification task. We repeated this process for each feature column in our feature matrix (n = 108), using the complete training set, and ranked the features based on their (varDelta {acc}) values (Fig. 5).

Predicting vegetation labels

To produce continuous vegetation maps across North America, we constructed a 0.5° × 0.5° grid across the cropping window defined in this study and extracted the coordinates of the cell-center for each grid cell (n = 11,731). For these points, we extracted spatiotemporal taxon distances and abiotic features in the same manner as for the training instances. We repeated this process in 1 Myr steps starting in the present (t = 0) throughout the last 30 Myr (t = 30), producing 31 feature-datasets of North America through time, considering tectonic movement (mapast²⁶). Based on the BNN weights sampled during training by the MCMC (excl. burn-in) we determined the posterior probabilities of each vegetation label for each given point (Fig. 3).

To produce more spatially explicit subsets of the North America grid, we downloaded shape files delineating the ecoregions of North America (Level 1 ecoregions³⁰, downloaded from https://www.epa.gov/eco-research/ecoregions-north-america). We identified all grid cells that fall within each ecoregion and extracted the vegetation predictions for these cells, to track the spread of open vegetation in each of these ecoregions separately.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.