Eco-ISEA3H, a machine learning ready spatial database for ecometric and species distribution modeling

Our objective in developing the Eco-ISEA3H database³⁷ was to compile a coordinated, global set of tabular data, characterizing environmental conditions and the geographic distributions of large mammalian species. The database was built on the ISEA3H DGGS, a multi-resolution system of global grids, each grid dividing the Earth’s surface into discrete, equal-area hexagonal cells. These cells constitute areal units of observation, uniformly resampling data provided in different coordinate reference systems, spatial resolutions, geographic data models, and file formats. We included data at six consecutive ISEA3H resolutions, in which cell centroid spacing ranges from 29 kilometers to approximately 450 kilometers.

Eco-ISEA3H themes and variables were derived from 17 geospatial data sources, and represent 3,033 features to be used for ML-based predictive modeling. Source datasets were published in raster or vector format, data models built on fundamentally different representations of spatial phenomena. Raster datasets comprise regular arrays of pixels, each pixel holding a value, while vector datasets comprise point, line, and polygon features, each feature defined by one or more (x, y) coordinate pairs and attributed with one or more values. Our task was to integrate these disparate source datasets, resampling and summarizing the values of raster pixels and vector features via the discrete, equal-area cells of the ISEA3H global grid system. The hexagonal cells on which the Eco-ISEA3H database³⁷ is built thus serve as unifying observational units for SDM and ecometric analysis and modeling.

From the statistical and ML perspective, each areal observational unit is characterized by (1) a set of environmental variables, representing climatic conditions, soil and near-surface lithology, land cover, and physical geography; and (2) a set of occurrence variables, representing the present and estimated natural distributions of large mammalian species. Predictive modeling tasks for statistical and ML modeling can be formulated in two directions: predicting species’ occurrences as a function of climatic and other environmental conditions (as in SDM studies), or predicting climatic and other environmental conditions as a function of species’ occurrences and functional traits (as in ecometric studies).

Spatial units of observation

To study continuous spatial phenomena over a region of interest, it is often necessary to divide the region into a number of discrete, areal observational units, which may be used in statistical summaries and/or modeling. Machine learning methods for ecometric and species distribution modeling require discrete observational units, each characterized by two sets of variables, one describing environmental conditions, the other species’ geographic distributions. A major question in data representation concerns the form of these units; defining discrete spatial units of observation constitutes a well-known problem in geography, termed the modifiable areal unit problem (MAUP)³⁸. As we change the size of proposed observational units, or change the boundaries between units while holding unit areas constant, measures of interest within these units – and derived summary statistics and model parameters – may differ; these are termed the “scale” and “zone” effects, respectively³⁸.

Our objective in utilizing the ISEA3H DGGS³⁴ was to implement a robust spatial division of the Earth’s surface. The grid cells of the DGGS discretize the Earth’s sphere, forming, at each DGGS resolution, a global set of areal observational units with which to sample and summarize source datasets. To be optimally effective in the observation, simulation, and visualization of spatial phenomena, such a grid must meet certain structural criteria. We propose, modifying the Goodchild Criteria³⁹, the DGGS grid must contain (1) contiguous, (2) equivalent observational units, (3) minimizing intra-unit variability, (4) having uniform topology with neighboring units, and (5) being visually effective, facilitating interpretation and communication. Each criterion will be discussed in detail; further, we will argue the ISEA3H DGGS selected for this study satisfies these criteria.

Contiguity & congruency

We suggest that a regular tiling maximally satisfies the criteria of (1) contiguity and (2) equivalence. A tiling is simply a set of shapes which cover a plane without gaps or overlaps⁴⁰. A regular tiling is one of a class of tilings in which the tiles – our observational units – are highly equal; such tilings are monohedral, and composed of congruent, regular (equiangular and equilateral) polygons. Thus, regular tilings are also highly symmetrical, being vertex-, edge-, tile-, and flag-transitive. Three regular polygons may be used to create a regular tiling: the equilateral triangle, the square, and the regular hexagon⁴⁰.

With this suggestion, we follow common convention; in ecology, grids of square (or rectangular) cells are most often utilized, motivated in part by the use of raster datasets⁴¹, made of rectilinear rows and columns of pixels. However, it should be noted that while the square cells of these grids are equal in the coordinate reference system in which they are defined, such cells are rarely congruent, or indeed even square, on the Earth’s surface. The properties of the ISEA projection selected for this DGGS – area preservation, and relatively low angular distortion – serve to retain considerable congruency when inversely projecting grid cells to the spherical surface of the Earth.

Compactness

To accurately represent the spatially continuous phenomena of the Earth system, the grid cells of a DGGS – the areal observational units used in summarizing, modeling, and visualizing – must effectively discretize these phenomena. Thus, the DGGS must be structured such that (3) intra-unit variability is minimized, and inter-unit variability is maximized. In this way, patterns of variation among units more accurately represent patterns of variation inherent in the phenomena.

Intra-unit variability may be minimized, in expectation, by compact observational units. Tobler’s oft-cited first law of geography serves as explanation: “everything is related to everything else, but near things are more related than distant things”⁴². Thus, compact units, in which all portions of the interior are nearer each other, are expected to contain less interior variability than elongated units, in which portions of the interior may be more distant. Given these properties, compact units are optimal in the context of DGGS development, elongated units in the context of efficient ecological sampling.

Regular hexagons are the most compact of the three polygons – the equilateral triangle, square, and regular hexagon – admitting regular tilings. This compactness may be expressed in several related and complementary ways. First, of any equal-area tiling, regular hexagons have the minimum possible ratio of perimeter to area⁴³. In minimizing perimeter length per unit area, regular hexagons are thus the most circle-like of the polygons admitting equal-area tilings. Relatedly, regular hexagonal packing is the highest-density arrangement of equal-area circles on a plane⁴⁴.

Finally, a regular hexagonal lattice optimally quantizes a plane; of the polygons admitting regular tilings, regular hexagons minimize the mean squared distance of any point to the nearest polygon centroid⁴⁵. This distance, or “dimensionless second moment,” quantifies the more qualitative notion of interior nearness discussed in relation to Tobler’s Law.

Topology

In addition to maximally satisfying the (3) compactness criterion, regular hexagons have a topological advantage over equilateral triangles and squares. Of these three regular polygons, hexagons have the simplest relationship with neighbors in a tiling or grid, each (4) uniformly sharing an edge with the six adjacent hexagons forming its first-order neighborhood. Triangles and squares, in contrast, share only a single vertex with three or four neighbors, respectively, and an edge with three or four neighbors, complicating the definition of neighborhood in these grids.

It follows that hexagonal topology has greater angular resolution than edge-based triangular or square topologies; movement may be simulated between cells in six directions, rather than in three or four, respectively. These properties – neighborhood simplicity and angular resolution – were confirmed by Golay⁴⁶, in the context of pattern transformation operations on two-dimensional arrays. Further, these properties likely account for the widespread use of hexagonal grids in strategy board games, since these grids were introduced in the early 1960s⁴⁷.

Differing grid topologies affect the results of ecological models simulating dispersal. White and Kiester⁴⁸, for example, found the topology of the network of communities in a neutral community ecology model – in which simulated communities had hexagonal neighborhoods, or von Neumann, Moore, or Margolus neighborhoods – affected modeled species abundances and diversities, but in complex ways, which differed given different model parameter values. (Note that the four neighbors with which a square cell shares an edge are termed its rook, or von Neumann neighborhood, and these plus the four neighbors with which it shares a single vertex its queen, or Moore neighborhood.)

Visualization

Finally, in addition to these gains in representational accuracy, (5) hexagonal tilings are more visually effective than square tilings. Whether used in cartography or other two-dimensional data visualization, tilings inevitably create visual lines, artifacts of the lattice of shared edges between tiles⁴⁹. Given our “sense of gravitational balance,” Carr et al.⁴⁹ argue the horizontal and vertical lines of square tilings strongly distract the human eye, obscuring data-driven patterns in a dataset so visualized. The non-orthogonal lines of hexagonal tilings, however, feature less prominently, and thus distract less from patterns of interest⁴⁹.

Note that this is not an issue of aesthetics only: maps are often essential tools in scientific reasoning and communication, and effective visualization is important. Indeed, Carr et al.⁴⁹ suggest this visual advantage makes a stronger case for hexagonal tilings than the representational advantages discussed previously.

DGGS sampling workflows

The set of scripted workflows developed to incorporate spatial datasets into the Eco-ISEA3H database³⁷ utilize published spatial libraries and packages for Python and R, and include several validation steps, intended to verify the integrity of source datasets and the fidelity of the transfer to the DGGS. Workflows developed for raster datasets are presented in Fig. 1, and workflows for vector datasets in Fig. 2.

Fig. 1

Workflow developed to incorporate raster datasets into the ISEA3H DGGS.

Full size image

Fig. 2

Workflow developed to incorporate vector datasets into the ISEA3H DGGS.

Full size image

To begin, one general principle guides each workflow: each source dataset is processed in its native coordinate reference system. In all cases, a representation of the DGGS is developed in the coordinate reference system of the source dataset, and used in summarizing that dataset. The guiding premise here is that the spatial dataset is as the authors intended it in the coordinate reference system in which it is published and distributed.

This is especially relevant for vector polygon datasets. Consider, for example, certain species’ range polygons published by the IUCN Red List⁵⁰; these polygons are defined only roughly, having relatively few, widely spaced vertices, connected by arcs many hundreds of kilometers in length. These arcs are “straight” in the plate carrée projection with which the dataset’s WGS84 latitude/longitude coordinates are visualized by default. If vertex coordinates were projected into another coordinate reference system, the arcs would be similarly “straight” in this new system, and thus potentially trace very different paths across the Earth’s surface. Absent information to the contrary, we assume the arcs are as intended in the reference system in which the data are distributed.

The spatial structure of raster datasets depends similarly on each dataset’s coordinate reference system; rasters are made of rows and columns of pixels, rectilinear and orthogonal only in the raster’s native coordinate reference system. We assume raster pixels are “atomic” units, each indivisible and representative of the area it natively covers. Thus, we query the DGGS at each pixel’s centroid, and assign the pixel wholly to the coincident DGGS cell.

Raster dataset processing

If necessary, source raster datasets were first converted to the GeoTIFF file format, so that the files were readable in the open-source GIS software used later in the processing workflow. GeoTIFF files are simply Tag Image File Format (TIFF) image files with embedded georeferencing information, describing the dataset’s spatial extent and coordinate reference system. Hierarchical Data Format Release 4 (HDF4) files were converted to GeoTIFF format using the Geospatial Data Abstraction Library (GDAL) translate utility⁵¹.

Next, raster tiles containing ISEA3H hexagon identification (HID) indexing numbers were generated; these integer HIDs uniquely identify each cell at a given ISEA3H resolution. A set of HID raster tiles was required for each source raster dataset, for each ISEA3H resolution, because (1) GeoTIFF rasters are able to hold only a single value at each pixel; and (2) HIDs sequentially number cells at a given ISEA3H resolution, from 1 to the number of cells present at that resolution. Thus, HIDs are not unique between resolutions; HID 84, for example, identifies a cell at each ISEA3H resolution 2 and higher.

The HID raster tiles generated for a source raster dataset matched that dataset’s grid resolution, extent, and coordinate reference system precisely; thus, there was a one-to-one correlation between the pixels of the HID raster tiles and the source raster dataset tiles. For each tile, pixel centroid coordinates were passed to the dggridR package⁵² for R, which returned the ISEA3H cell identification number for that location. In this way, the pixels of the source raster were treated as indivisible units, assigned wholly to a particular HID on the basis of each pixel’s centroid. HID rasters were written in GeoTIFF format using the raster package⁵³ for R.

In equal-area projected coordinate reference systems, simple counts of the number of raster pixels assigned to each HID were sufficient to determine each ISEA3H cell’s total area. In all other cases – for example, for raster datasets using the World Geodetic System 1984 (WGS84) coordinate reference system – raster tiles containing pixel areas were generated. These areas were calculated by passing each pixel’s corner coordinates to the GeographicLib library⁵⁴ for Python.

Finally, source raster dataset tiles, HID raster tiles, and area raster tiles (for source rasters using non-authalic coordinate reference systems) were superimposed to generate summary tabular files, describing the features of the source raster dataset by ISEA3H cell. The specifics of this process, which utilized functions of the raster package⁵³ for R, depended on whether the source raster contained discrete, categorical values, or continuous, real-numbered values.

Discrete themes

For each source raster dataset containing discrete pixel values, one or more of the following summary statistics were calculated. While the centroid attribute requires a simple point sample, the fraction and mode attributes are area-integrated, and involve a multiple-step sampling process. For rasters using an authalic coordinate reference system, the raster package’s crosstab function⁵³ was used to generate a contingency table for each tile; applied to source raster and HID raster tiles, the function tallied the number of pixels of each class coincident with each HID, for each tile. These tile-specific tables were then summed, to obtain total counts of pixels of each class within each HID.

For rasters using a non-authalic coordinate reference system, area raster tiles were required as well. For each tile, a vector of classes present in the source raster was assembled. For each of these classes in turn, a mask raster tile was generated, retaining pixels belonging to the class, and screening pixels belonging to all other classes. This mask was applied to the area raster tile, and retained pixels were summed within each HID using the raster package’s zonal function⁵³. Thus, a contingency table was compiled for each raster tile, containing the area of each class within each HID. Finally, these tile-specific tables were summed, to obtain the total area of each class within each HID.

Centroid. The centroid attribute records the categorical value occurring at each ISEA3H cell’s centroid. Where the source raster dataset contains a null value at a centroid, the cell is assigned a flag signifying no value is available.
Fraction. The fraction attributes record the proportion of each ISEA3H cell’s area covered by each categorical value. For example, the Köppen-Geiger climate classification system, as implemented by Beck et al.⁵⁵, includes 30 classes, listed in Table 4. Thus, each ISEA3H cell has an associated set of 30 fraction attributes for this dataset, recording the proportions of the cell’s area covered by the 30 categorical values, from tropical rainforest (Af) to polar tundra (ET).
Mode. The mode attribute records the categorical value covering the greatest proportion of each ISEA3H cell’s area. For example, if an ISEA3H cell had a fraction value of 0.4 for some hypothetical categorical value A, 0.3 for B, and 0.3 for C, it would be assigned a mode value of A. A mode attribute is specified for cells in which the sum of the fraction attributes is greater than or equal to 0.2; where fraction attributes total less than 0.2, a flag signifying no value is assigned.

Continuous variables

For each source raster dataset containing continuous pixel values, one or more of the following summary statistics were calculated. Again, the centroid attribute requires only a simple point sample, while the mean attribute is area-integrated, requiring area raster tiles for source rasters using a non-authalic coordinate reference system.

Centroid. The centroid attribute records the continuous value occurring at each ISEA3H cell’s centroid. Where the source raster dataset contains a null value at a centroid, the cell is assigned a flag signifying no value is available.
Mean. The mean attribute records the area-weighted arithmetic mean of the continuous values of raster pixels within each ISEA3H cell. For raster datasets in authalic coordinate reference systems, the area-weighted mean is equivalent to the simple mean of the values of raster pixels within each cell; however, in all other cases, pixel values are weighted by pixel areas per the equation below, in which w_i and x_i indicate the area and value, respectively, of each pixel i within an ISEA3H cell containing n pixels.

$$overline{x}=frac{{sum }_{i=1}^{n}{w}_{i}{x}_{i}}{{sum }_{i=1}^{n}{w}_{i}}$$

For each tile, source raster values and area values were multiplied, pixel by pixel, using the raster package’s * arithmetic operator⁵³. The resulting product raster tile, as well as the area raster tile, were then summed within each HID using the raster package’s zonal function⁵³. Finally, these tile-specific tables were summed, to obtain both the numerator (summed product values) and denominator (summed area values) for the above equation, for each HID.

Vector dataset processing

Source vector datasets incorporated into the Eco-ISEA3H database³⁷ contain polygon features, discrete areas assigned a categorical value. A dataset may (1) contain polygons of several different classes; for example, the vector shapefile published by Olson et al.⁵⁶ contains ecoregion polygons, each assigned to one of several biogeographic realms. Alternatively, a dataset may (2) represent a single class, with polygons indicating class presence; for example, the shapefiles published by the IUCN Red List⁵⁰ each represent a species’ geographic range, with polygons indicating regions the species is present. In both cases, the summary statistics discussed in reference to raster datasets containing discrete values may be calculated.

Prior to inclusion in the Eco-ISEA3H database³⁷, source vector datasets were preprocessed. To simplify the geographic representation of the class(es) of interest – that is, to remove unnecessary polygon boundaries – dataset polygons were dissolved, either on the class attribute in case (1), or globally in case (2), using the QGIS open-source desktop GIS application. The geodesic areas of dissolved polygons were then calculated using the GeographicLib library⁵⁴. Finally, the geometries of dissolved polygons were checked for conformance with the OGC Simple Feature Access standard⁵⁷ using the Shapely library⁵⁸ for Python, ensuring these features served as valid input in the processing workflow to follow.

The intersection of source dataset polygons and ISEA3H cell polygons is central to the vector processing workflow. Source polygons result from the preliminary simplification and verification steps just discussed; cell polygons result from polygonizing a set of HID raster tiles for the ISEA3H resolution of interest. The polygonizing procedure utilized the open-source GDAL command-line tools polygonize and ogrmerge⁵¹, as well as the GeographicLib⁵⁴ and Shapely⁵⁸ libraries. Polygonizing HID raster tiles of the appropriate coordinate reference system (specifically, the system matching that of the source polygon dataset) ensured HID polygon boundaries displayed both proper geodesic curvature and the shape distortion induced by the ISEA map projection.

Intersection is a set-theoretic operation, returning polygons representing each coincident class/HID combination. The operation was implemented via the Shapely library⁵⁸, and the geodesic areas of intersected polygons were calculated via the GeographicLib library⁵⁴. Note that the scripted intersection tools developed for the Eco-ISEA3H database³⁷ allow limiting the ISEA3H cells included in a single tool run, to break the processing of large datasets into manageable pieces. Runs may be limited to a user-specified range of HIDs. Additionally, if cells at the next coarser or finer ISEA3H resolution have been intersected with the source dataset, cells retained by the operation may be used as a spatial index; a list of coincident HIDs at the ISEA3H resolution of interest may be generated, and used to limit tool runs.

An output shapefile is written, containing intersected polygons attributed with the geodesic area, the HID, and in case (1), the source class. Next, an additional verification of the geometries of these intersected polygons is performed. Each intersected polygon is superimposed over the original ISEA3H cell polygon having the same HID. If intersected polygons have too few vertices to be valid, or are not contained by the original cell polygon from which each was derived, these polygons are flagged for review and revision. This step was implemented to catch geometry errors observed early in the development of the Eco-ISEA3H intersection tools.

Finally, the geodesic areas of intersected polygons are totaled, and the total area of each class within each HID is calculated. Dividing by the geodesic areas of the original ISEA3H cell polygons, these class totals are expressed as fractions of each cell’s total area. In two final verification steps, (1) the total intersected area of each class, across all HIDs, is compared to the area of the same class in the source dataset; and (2) class fraction values are confirmed to be less than or equal to unity within each HID. Deviations are flagged for review and revision.

Data sources & themes

The Eco-ISEA3H database³⁷ incorporates 17 source datasets, characterizing the Earth’s climate, geology, land cover, and physical geography, as well as human population density and the geographic ranges of nearly 900 large mammalian species. Data sources are listed in Table 1. We first present a brief overview of these sources, and describe sources and themes in greater detail in the following sections.

Table 1 Source datasets and themes included in the Eco-ISEA3H database³⁷. Each dataset is described by full and abbreviated name, source, spatial resolution (for datasets published/distributed at more than one resolution), version, and scenario. Each theme is described by full and abbreviated name and type (whether it contains discrete, categorical values or continuous, real-valued variables).

Full size table

Climate is characterized primarily by temperature- and precipitation-based averages and extremes, summarized over the past 50 to 70 years, and forecasted for 40 to 60 years in the future under the RCP 8.5 climate change scenario⁵⁹; data sources include WorldClim^30,31, ENVIREM⁶⁰, and the ETCCDI extremes indices derived by Sillmann et al.^61,62 from ERA-40⁶³ and CCSM4⁶⁴. Additionally, present climate is classified via the Köppen-Geiger climate classification system, from GLOH2O⁵⁵. Geological data include soil types, from the Digital Soil Map of the World (DSMW)⁶⁵; near-surface rock types, from the Global Lithological Map (GLiM)⁶⁶; and sedimentary basin types⁶⁷. Human geography is quantified by human population density, from the Gridded Population of the World (GPW)⁶⁸. Land cover is described by the International Geosphere-Biosphere Programme (IGBP) cover classification scheme, from MCD12Q1⁶⁹; and by percent tree, non-tree, and non-vegetated cover, from MOD44B⁷⁰. The Earth’s physical geography is characterized by continental and island landmasses, from Natural Earth; lakes and wetlands, from the Global Lakes and Wetlands Database (GLWD)⁷¹; biogeographic realms⁵⁶; and terrestrial topography and ocean bathymetry, from ENVIREM⁶⁰ and SRTM30_PLUS⁷². Finally, distributional data include the present and estimated natural ranges of large mammalian species, from the IUCN Red List⁵⁰ and the Phylogenetic Atlas of Mammal Macroecology (PHYLACINE)^73,74.

Climate

ENVIREM. The ENVIREM (ENVIronmental Rasters for Ecological Modeling) dataset⁶⁰ contains 16 climatic variables derived from WorldClim v1.4 monthly temperature and precipitation³⁰, and extraterrestrial radiation. These are intended to compliment the WorldClim v1.4 bioclimatic variables³⁰, capturing additional environmental features directly relevant to floral and faunal physiology and ecology⁶⁰. Source rasters at 30 arc-second resolution were summarized by area-weighted mean at ISEA3H resolutions 8 and 9. Variable codes, descriptions, and units are listed in Table 2. Title and Bemmels⁶⁰, and references therein, provide full definitions and calculation methods for these variables.
Table 2 Codes, descriptions, and units for the 16 ENVIREM climatic variables, from Title and Bemmels⁶⁰.
Full size table
ETCCDI. A comprehensive set of 27 climate extremes indices was defined by the Expert Team on Climate Change Detection and Indices (ETCCDI); these generally capture “moderate” extremes, having recurrence intervals of a year or shorter, and are based on observed/simulated daily temperature and precipitation^61,62. Sillmann et al.^61,62 derive these indices from results of a number of global climate models and atmospheric reanalyses, several of which were incorporated in the Eco-ISEA3H database³⁷. Given the relatively low-resolution grids used in modeling and reanalysis, these source rasters were interpolated to ISEA3H cell centroids by inverse (geodesic) distance weighting (IDW). Variable codes, descriptions, and units are listed in Table 3. Sillmann et al.⁶¹ provide full definitions and calculation methods for these indices.
Table 3 Codes, descriptions, and units for the 27 ETCCDI climate extremes indices, from Sillmann et al.^61,62.
Full size table

The Eco-ISEA3H database³⁷ includes ETCCDI variables based on results of the ERA-40 reanalysis⁶³, produced by the European Centre for Medium-Range Weather Forecasts (ECMWF). The reanalysis combines past meteorological observations with a weather forecasting model, producing a global representation of the state of the atmosphere for each reanalysis time step, usually a six-hour interval⁶³. These were averaged for the period 1958 to 2001, the 44 full years for which the ERA-40 reanalysis was conducted, and were interpolated to ISEA3H resolutions 5 to 9.

Additionally, the database includes ETCCDI variables based on results of the Community Climate System Model v4 (CCSM4), a global climate model developed for CMIP5⁶⁴. These were averaged for the period 1950 to 2000, to match the approximate period covered by WorldClim v1.4, and for the period 2061 to 2080, to match the final interval for which CCSM4 model results were downscaled/debiased using WorldClim v1.4³⁰. Variables were interpolated to ISEA3H resolution 9.

ETCCDI variables for this latter period represent conditions under Representative Concentration Pathway (RCP) 8.5, the RCP resulting in the highest radiative forcing (8.5 W/m²) by 2100⁵⁹. This scenario was selected such that future conditions maximally different from the present might be considered; in RCP 8.5, rapid population growth, and relatively slow growth in per capita income and technological development, lead to high energy demand without associated climate mitigation policies, resulting in greenhouse gas emissions and atmospheric concentrations increasing significantly in the coming decades⁵⁹.

Köppen-Geiger Climate Classification. As implemented by Beck et al.⁵⁵, the Köppen-Geiger system classifies the Earth’s terrestrial climates into five primary classes, and further into 30 subclasses, based on a set of threshold criteria referencing monthly mean temperature and precipitation. These climate classes are ecologically significant, as regions within each class support floral communities sharing common characteristics. Beck et al.⁵⁵ utilize four climatic datasets, including WorldClim v1.x and v2.x, adjusted to the period 1980 to 2016, to define the present-day classes incorporated in the Eco-ISEA3H database³⁷. The source raster at 30 arc-second resolution was summarized by fraction and mode at ISEA3H resolution 9. Variable codes and descriptions are listed in Table 4.
Table 4 Codes and descriptions for the 30 Köppen-Geiger climate classes, from Beck et al.⁵⁵.
Full size table
WorldClim v1.4. The first-generation WorldClim dataset³⁰ contains four monthly themes, each with 12 variables, characterizing monthly temperature and precipitation; additionally, it contains 19 bioclimatic variables, derived from the monthly variables, capturing biologically relevant seasonal and annual features of the climate system. These bioclimatic variables, first developed for the BIOCLIM species distribution modeling (SDM) package⁷⁵, are used extensively in SDM studies; a recent synthesis found most were included in more than 1,000 published MaxEnt SDMs (of 2,040 reviewed)⁷⁶.

WorldClim monthly temperature and precipitation rasters are interpolated from weather station observations averaged for the approximate period 1950 to 2000. The interpolation was done using thin plate smoothing splines, with latitude, longitude, and elevation as predictor variables³⁰. These rasters characterize present-day climate, and further served as an observational baseline with which the predictions of CMIP5 global climate models were downscaled and bias-corrected.

The 19 bioclimatic variables, for both present-day and future conditions (the latter averaged for the period 2061 to 2080, from the CCSM4 RCP 8.5 simulation), were incorporated into the Eco-ISEA3H database³⁷; source rasters at 30 arc-second resolution were summarized by area-weighted mean at ISEA3H resolution 9. Variable codes, descriptions, and units are listed in Table 5. O’Donnell and Ignizio⁷⁷ provide full definitions and calculation methods for these variables.

Table 5 Codes, descriptions, and units for the 19 WorldClim bioclimatic variables, from v1.4³⁰ and v2.0³¹.

Full size table

WorldClim v2.0. The second-generation WorldClim dataset³¹ contains seven monthly themes, each with 12 variables, characterizing monthly temperature, precipitation, solar radiation, wind speed, and vapor pressure; additionally, it contains the standard set of 19 bioclimatic variables, derived from monthly temperature and precipitation.

As in the first-generation dataset, monthly rasters were interpolated from weather station observations, averaged here for the approximate period 1970 to 2000³¹. Again, thin plate smoothing splines were used in the interpolation, but with additional covariates included for one or more interpolated features: distance to coast, computed extraterrestrial radiation, and three satellite-derived observations – cloud cover, and maximum and minimum land surface temperature, from the Moderate Resolution Imaging Spectroradiometer (MODIS) instrument.

The 12 source rasters for each of the seven monthly themes, at 30 arc-second resolution, were summarized by centroid at ISEA3H resolutions 5 to 10. Additionally, the 19 source bioclimatic rasters, at 30 arc-second resolution, were summarized by centroid at ISEA3H resolutions 5 to 10, and by area-weighted mean at ISEA3H resolutions 6 to 9. Codes, descriptions, and units for the bioclimatic variables are listed in Table 5.

Geol10ogy

DSMW. The Digital Soil Map of the World (DSMW)⁶⁵ describes the geographic distribution and physical and chemical properties of the world’s soils. The DSMW was digitized from the FAO-UNESCO Soil Map of the World, printed at 1:5,000,000 scale. Each digitized mapping unit is assigned a number of soil attributes; here we classify units via the DOMSOI attribute, the dominant soil or land unit code. The DSMW includes 117 soils in 26 major soil groupings, as well as six other land units, for a total of 123 DOMSOI classes. The source vector dataset was summarized by fraction and mode at ISEA3H resolutions 5, 6, and 9. Variable codes and descriptions are listed in Table 6.
Table 6 Codes and descriptions for the 123 DSMW soil and land units, from the FAO⁶⁵.
Full size table
GLiM. The Global Lithological Map (GLiM)⁶⁶ represents the rock and unconsolidated sediments at or near the Earth’s terrestrial surface; this geological material is a source of geochemical flux to the Earth’s soils, biosphere, and hydrosphere. Hartmann and Moosdorf⁶⁶ compiled the map and accompanying database from 92 regional geological maps and 318 literature sources. Rock was classified into 16 first-level lithological classes; 12 second-level and 14 third-level subclasses further describe specific mineralogical and physical properties.

The source vector dataset was summarized by centroid at ISEA3H resolution 9. Variable codes and descriptions are listed in Table 7. The attribute assigned each ISEA3H cell takes the form xxyyzz; underscore characters (_) in the yy and/or zz slots indicate the second- and/or third-level subclasses were undefined.

Table 7 Codes and descriptions for the 16 first-level, 12 second-level, and 14 third-level GLiM lithological classes, from Hartmann and Moosdorf⁶⁶.

Full size table

Sedimentary Basins. Sedimentary basins are areas of subsidence in the Earth’s crust, in which sediments eroded from uplands are deposited and potentially preserved for a million or more years⁶⁷, thus entering the planet’s long-term geological record. Nyberg and Howell⁶⁷ delineate active sedimentary basins, covering both the Earth’s terrestrial surface and marine areas over continental crust. The authors operationally defined basins as low-relief areas containing Quaternary Period sediments, and further classified the basins by tectonic setting, identifying backarc, forearc, foreland, extensional, intracratonic, passive margin, and strike-slip basins on the basis of published literature and geological maps⁶⁷.

Terrestrial basins were incorporated in the Eco-ISEA3H database³⁷. Note that no terrestrial backarc basins were delineated. The source vector dataset was summarized by fraction and mode at ISEA3H resolution 9.

Human geography

GPW. Human population density is one of several measures of human presence and activity which together define the human “footprint,” associated with profound, adverse effects on natural systems⁷⁸. Given this pervasive impact, data characterizing degree of human influence are used as predictors in some ecological models, including SDMs²⁸. The Gridded Population of the World (GPW)⁶⁸ density dataset represents the global distribution of human population density, developed using census records, population registers, and the administrative boundaries of approximately 13.5 million national and subnational units. Density, measured by population count per square kilometer, was estimated every five years, from 2000 to 2020, inclusive. The source raster dataset for each year, at 30 arc-second resolution, was summarized by area-weighted mean at ISEA3H resolutions 6 to 9.

Land cover

MCD12Q1. The Moderate Resolution Imaging Spectroradiometer (MODIS) land cover type (MCD12Q1) dataset⁶⁹ describes land cover globally, via six different classification schemes. The Eco-ISEA3H database³⁷ includes land cover classified via the International Geosphere-Biosphere Programme (IGBP) scheme, initially developed for the DISCover dataset⁷⁹; the IGBP scheme includes 16 land cover classes, 13 natural and three anthropogenically modified. The MCD12Q1 dataset is derived from reflectance data collected by the MODIS instruments aboard the Terra and Aqua satellites; the two instruments observe the entirety of the Earth’s surface every one to two days, recording reflectance in 36 spectral bands.

MCD12Q1 land cover is estimated annually. For each year, reflectance time-series data are smoothed and gap-filled via smoothing splines; derived spectro-temporal features are used as input to a random forest classifier; and output land cover classifications are post-processed, to incorporate prior knowledge and reduce inter-annual variability⁶⁹. The source raster dataset for 2001 and 2014 to 2018, inclusive, at approximately 500 meter resolution, was summarized by centroid, fraction, and mode at ISEA3H resolutions 5 to 10. Variable codes and descriptions are listed in Table 8.

Table 8 Codes and descriptions for the 16 IGBP land cover classes, from Friedl and Sulla-Menashe⁶⁹.

Full size table

MOD44B. The MODIS vegetation continuous fields (VCF) dataset (MOD44B)⁷⁰ describes global land cover quantitatively, as fractions of three cover components: tree canopy, non-tree canopy, and non-vegetated, barren cover. Note that canopy cover, as defined here, indicates the area over which light is intercepted; this differs from crown cover, which indicates the area covered by a plant’s crown regardless of light interception/penetration. The MOD44B dataset is derived from reflectance data collected by the MODIS instrument aboard the Terra satellite; for each annual VCF estimate, reflectance time-series data are used as input to a bagged ensemble of linear regression trees⁷⁰. The source raster dataset for 2018, at approximately 250 meter resolution, was summarized by area-weighted mean at ISEA3H resolution 9.

Physical geography

Biogeographic Realms. As defined by Olson et al.⁵⁶, the eight terrestrial biogeographic realms are the broadest divisions of the Earth’s terrestrial flora and fauna; these may be further subdivided into biomes and ecoregions, the latter containing distinct natural communities. Olson et al.⁵⁶ developed this hierarchical system primarily for global and regional conservation planning. Realm, biome, and ecoregion delineations are based on expert knowledge, contributed by more than 1,000 scientists working in relevant fields; these divisions thus incorporate knowledge of endemic taxa, unique species assemblages, and local geological and biogeographical history⁵⁶. Realms were included in the Eco-ISEA3H database³⁷ to provide a high-level classification of the Earth’s biogeography, from a source frequently cited in the scientific literature. The source vector dataset was summarized by fraction and mode at ISEA3H resolutions 5 to 9. Variable codes and descriptions are listed in Table 9.
Table 9 Codes and descriptions for the eight biogeographic realms, from Olson et al.⁵⁶.
Full size table
ENVIREM. In addition to the climatic variables discussed previously, the ENVIREM dataset⁶⁰ contains two topographic variables, derived from SRTM30_PLUS. These two indices characterize terrain roughness, a measure of variability in local elevation; and topographic wetness, a function of slope and upgradient contributing area. Source rasters at 30 arc-second resolution were summarized by area-weighted mean at ISEA3H resolutions 8 and 9. Variable codes, descriptions, and units are listed in Table 10.
Table 10 Codes, descriptions, and units for the two ENVIREM topographic variables, from Title and Bemmels⁶⁰.
Full size table
GLWD. The Global Lakes and Wetlands Database (GLWD)⁷¹, Level 3, represents the maximum extent of lakes, reservoirs, rivers, and a number of wetland types, comprising 12 waterbody classes in total. Lehner and Döll⁷¹ compiled the three levels of the GLWD by combining seven source map and attribute datasets, and suggest Level 3 may be useful as input in global hydrologic and climatic modeling. The source raster dataset at 30 arc-second resolution was summarized by fraction and mode at ISEA3H resolution 9. Variable codes and descriptions are listed in Table 11.
Table 11 Codes and descriptions for the 12 GLWD waterbody classes, from Lehner and Döll⁷¹.
Full size table
Natural Earth. Natural Earth is a public-domain collection of raster and vector datasets developed for production cartography. Three vector themes describing physical geography were incorporated: Land, which includes continents and major islands; Islands, which includes additional minor islands; and Lakes, which includes lakes and reservoirs. Source vector datasets at 1:10,000,000 scale were summarized by fraction at ISEA3H resolutions 5 to 9. Further, fractions for a Terra theme were calculated, by adding per-cell Land and Islands, and subtracting Lakes. The Terra theme may be thresholded (for example, at a fraction value ≥0.5) to identify terrestrial ISEA3H cells, excluding cells covered primarily by ocean or freshwater habitat.
SRTM30_PLUS. The SRTM30_PLUS dataset⁷² is a global digital elevation model (DEM), representing the Earth’s terrestrial topography and ocean bathymetry. A number of elevation sources were incorporated in developing the DEM; terrestrial topography was derived from the Shuttle Radar Topography Mission (SRTM) at latitudes between ±60°, from GTOPO30 in the Arctic, and from GLAS/ICESat in the Antarctic. Ocean bathymetry was derived from satellite radar altimetry, calibrated on 298 million corrected ship-based depth soundings, gathered from several sounding sources⁷². The source raster dataset at 30 arc-second resolution was summarized by area-weighted mean at ISEA3H resolutions 6 to 10.

Species ranges

From the Red List and the Phylogenetic Atlas, the geographic ranges of species belonging to four mammalian orders were sampled: Artiodactyla (even-toed ungulates), Perissodactyla (odd-toed ungulates), Primates, and Proboscidea (elephants). These species are primarily large-bodied herbivores, and as such are frequently the subject of dental ecometrics research; for example, averaged dental traits of communities of these mammals have been used to predict measures of local precipitation, at both global³ and regional¹¹ scales.

IUCN Red List. The International Union for Conservation of Nature’s (IUCN) Red List of Threatened Species⁵⁰ comprises global assessments of the conservation status of nearly 150,000 floral, faunal, and fungal species. The Red List includes expert-delineated geographic ranges for most of these species, including most extant mammalian species. For each species, portions of the range for which the species’ presence was coded extant, and for which its origin was coded native or reintroduced, were sampled. Source vector datasets were summarized by fraction at ISEA3H resolutions 8 to 9 (Artiodactyla and Perissodactyla), 9 (Primates), and 7 to 9 (Proboscidea).
PHYLACINE. The Phylogenetic Atlas of Mammal Macroecology (PHYLACINE)^73,74 includes trait, phylogeny, and geographic range data for all mammalian species known from the last interglacial period (approximately 130,000 years ago) to the present, both extant and recently extinct. PHYLACINE includes species’ ranges under two scenarios, both of which were incorporated: present-day ranges, from the IUCN v2016.3; and “present-natural” ranges, for which each species’ present-day range was modified to estimate its distribution under current climatic conditions, but absent anthropogenic pressure. This included, among eight modification categories, reconnecting fragmented ranges, by filling suitable intervening habitat; and expanding ranges reduced by human activity, by filling suitable adjacent habitat. Present-natural range modifications are documented for each species in PHYLACINE’s metadata, and intended to mitigate human impact on the results of macroecological analysis and modeling. Source rasters at approximately 100 kilometer resolution were summarized by centroid at ISEA3H resolution 9.

Eco-ISEA3H, a machine learning ready spatial database for ecometric and species distribution modeling

Spatial units of observation

Contiguity & congruency

Compactness

Topology

Visualization

DGGS sampling workflows

Raster dataset processing

Discrete themes

Continuous variables

Vector dataset processing

Data sources & themes

Climate

Geol10ogy

Human geography

Land cover

Physical geography

Species ranges

Looking for massive carbon capture

A simple soil mass correction for a more accurate determination of soil carbon stock changes

ITALIAN LANGUAGE

ENGLISH LANGUAGE