Large-scale forecasting of Heracleum sosnowskyi habitat suitability under the climate change on publicly available data

From the popular algorithms, we chose the Random forest model as the most suitable for our case. The data required for predictions can be divided into plant occurrence records and environmental features. Bioclimatic variables and soil properties were selected as the main environmental features. All of the data were obtained from open sources.

Heracleum Sosnowskiy plant description

Heracleum sosnowskyi is a monocarpic perennial plant of the Apiaceae family. The height is up to 3–5 m with a straight stem up to 12 cm in diameter. HS compound steam leaves can reach 150 cm, both long and wide³⁸. The blooming period starts in July and continues until the end of September. Plant reproduction is performed by seeds only. The seeds’ depth of germination is reported as mainly in the upper 5 cm down to 15 cm of soil. One plant can produce 10–20,000 seeds^39,40. Seeds germinate in the early spring, while some have reported that a period of cold stratification for the dormancy break is obligatory for germination development. Suitable conditions for HS include a temperate climate with warm humid summers and cold winters, while it is probably not drought resistant. Plants of HS tend to neutral soils with a pH range from 6 to 7, rich in nutrients, and being reported as nitrophilous, so the eutrophication of the environment favours HS development. HS plants do not tolerate shade conditions in the first growing period.

HS is mostly spread in artificial and semi-natural habitats, including grasslands, pastures, parks, roadsides, agricultural fields, riverbanks or canal sides, and other distributed habitats. Currently, the main pathways of spread include an involuntary entry with soil on vehicles, machinery, footwear or the use of soil as a commodity (as the growing medium rich in organic matter)³⁹.

Study area

The area for modelling extends from approximately 41(^{circ }) to 70(^{circ }) N and from 27(^{circ }) to 60(^{circ }) E, and Kaliningrad region, it equals to approximately 4 mln km² (Fig. 4).

Figure 4

Map of the study area: white colour represents the territory used for prediction, red points correspond to the dataset of HS occurrence, collected from the available sources.

Full size image

The European part of Russia is the most inhabited part of the country, and it is the home of approximately 80% of the total population of Russia. It includes the East European Plain, Caucasus mountains and Ural mountains, with the predominance of the East European Plain. Environmental characteristics across the territory of study vary significantly. The climate is changing from semi-arid in the south to subarctic in the north, including humid continental climate conditions. Natural vegetation is represented by almost all types of biomes with the prevalence of different types of forests: broadleaf and mixed forests, coniferous forests, and boreal forests (taiga), while the area of arable lands is reported to be approximately 650,000 km²^41,42. The territory is subjected to the constant land-use types and cover changes due to the urbanization and switch of the status of arable lands—i.e. reduction of croplands and development of fallows and forests, and, vice versa, returning of some of them into the cultivation process⁴³. The soil cover is represented by the contrast by their physicochemical properties groups, in the northern part of Luvisols, Podzols, Histosols, while of the southern part—by Chernozems, Kastanozems, Solonetz⁴⁴.

Collection of the input data

Plant occurrence data

Plant occurrence coordinates were collected from several publicly available sources related to citizen science projects: the Global Biodiversity Information Facility database⁴⁵, iNaturalist database⁴⁶, and the database of the “Antiborschevik” community⁴⁷. Records were documented by human observation and collected from 2000 to 2021. The overall number of initial occurrence points from combined sources is 7637.

Environmental predictors

Climate data Modelling was performed for current and future climate conditions at its two scenarios, selected year ranges were 2000–2018 and 2040–2060 respectively.

Climatic variables were collected from the Worldclim database⁴⁸, containing the average seasonal information relevant to the physiological characteristics of species and available at different resolutions. We chose 10 arc-minutes spatial resolution taking into account the size of the studied area. Table 1 provides a short description of the used bioclimatic features, and we refer the reader to the Worldclim project for detailed information on the variables’ calculation.

For the future climate scenarios, we used two Shared Socioeconomic Pathways (SSPs)⁴⁹—1-2.6 and 5-8.5, corresponding to the lowest (keeping global mean temperature increase below 2 (^{circ })C) and the highest (at the increase of population without technological change) predicted future greenhouse gases emission scenarios. For these data, we took the same resolution (10 arc-minutes) as discussed above.

We used the Equilibrium Climate Sensitivity to select the climate model to model future HS distribution. Equilibrium climate sensitivity (ECS) is defined as the global mean surface air temperature change due to a rapid doubling of carbon dioxide concentrations as soon as the associated ocean-atmosphere-sea ice system reaches equilibrium. As the ECS value increases, the model’s sensitivity to the CO(_2) concentration in the atmosphere increases. We have chosen CanESM5 model (ECS—5.6), CNRM-CM6-1 model (ECS—4.3) and BCC-CSM2-MR model (ECS—3.0)⁵⁰.

Table 1 Description of used bioclimatic variables.

Full size table

For the future climate scenarios we selected three climate models:

BCC-CSM2-MR Beijing Climate Center climate system model developed in Beijing Climate Center, China Meteorological Administration⁵¹. Model has horizontal resolution 1.125(^{circ }) by 1.125(^{circ }).
CanESM5 Canadian Earth System Model version 5 developed in Canadian Center for Climate Modelling and Analysis, Canada⁵². Horizontal resolution 2.81(^{circ }) by 2.81(^{circ }).
CNRM-CM6-1 Climate model developed in National Center of Meteorological Research, France⁵³. Horizontal resolution 1.4(^{circ }) by 1.4(^{circ }).

Authors of the WorldClim project prepared historical and future climate data to a uniform spatial (10 arc-minutes) and temporal resolution.

Soil data Soil data were downloaded from the SoilGrids database⁵⁴—a system for global digital soil mapping. SoilGrids provides continuous data at several depths of the spatial distribution of soil properties across the globe with selected resolution. It uses a machine learning approach to reconstruct continuous data from 230,000 soil profile observations from the WoSIS (The World Soil Information Service) database and a series of environmental covariates.

From the whole set of the data provided by SoilGrids several properties were chosen for the forecasting: relative percentage of silt (Silt, %), sand (Sand, %), a volumetric fraction of coarse fragments (CF, %), cation exchange capacity (CEC, ({text{cmol}}_{c}/{text{kg}})) and soil organic carbon (SOC, g/kg) at the depth 5–15 cm, where the HS seeds are assumed to be located. These variables are expected to be more stable over time than bioclimatic predictors; thus, chosen soil properties could be implemented for the future time the same as in the present.

Data pre-processing

All the data were transformed to the ASCII format by R script and using software DIVA-GIS following the tutorial for the preparation of WorldClim files for use in SDM (http://www.lep-net.org/wp-content/uploads/2016/08/WorldClim_to_MaxEnt_Tutorial.pdf) with unified selected resolution 340 sq.km.

Optimization of the occurrence points amount

The general problem in using the available data collected from the databases of the citizen science projects is that the points of observation are distributed non-uniformly. For instance, the frequency of the records depends on the density of the population directly. The spatial filtering of the data (reducing the number of points) can be performed to reduce the sampling bias⁵⁵. We prepared three datasets with a distance between points of 4, 7 and 10 km with 2402, 1846 and 1504 occurrence points correspondingly filtering the initial dataset. For the thinning step thin() function was used within the R package spThin with 100 iterations for each of chosen thinning distances. To understand how much data we could lose, we used the analysis of feature distribution and evaluated the general fairness of the model performance.

Pseudo-absence generation

Due to the availability only of the presence points, it is important to generate the absence points for further implementation of the selected algorithm. Although the generation of pseudo-absence points in SDM research is a widespread solution, a closer look at the literature reveals several gaps and shortcomings. Since the raw dataset of the HS distribution demonstrates strong sampling bias, the generation of pseudo-absence points using the usual ‘random’ strategy can aggravate the sampling bias problem. Thus, the combination of the ‘disk’ and ‘random’ strategies was applied for the generation of the pseudo-absence points using the biomod R package¹⁷.

The ‘disk’ strategy is established on the geographic distance works as separation from truth presence and possible absence points. The optimal geographic distance for HS was chosen as 25 km. This distance was chosen empirically by trial-and-error. We started with 18 km (because the size of the cell is 9–18 km depending on location) and finished with 50 km. Using distances such as 30–50 km lead to a positive spatial autocorrelation. Thus, we decided to set 25 km which finally provided both optimal model performance and reduced spatial autocorrelation.
The second part of the generation was based on the ‘random’ strategy with filtration: according to the different range of climate conditions on the territory of Russia, there are several places where HS is not detected, thus not growing. The selection of unsuitable places for HS related to the north of Russia, where it is might be too cold for plant species. From all amount of randomly generated generated points we selected points with condition latitude (> 64^{circ }), according to tundra board line.

Features selection procedure

To avoid over-fitting and to choose the most conscientious set of parameters for final modelling, two approaches were combined. We searched features that are not correlated with others by a selected threshold is equal to 0.8 in absolute values⁵⁶ and estimated variable importance using the Mean Decrease Gini (MDG) and the Mean Decrease Accuracy (MDA) as the result of modelling on enumerated parameters’ combinations. MDG score is related to the homogeneity of the nodes and leaves coefficient. With the rise of the MDG score the importance of the corresponding feature is also increasing. MDA describes how much accuracy decrease by removing the feature. We selected the most important features according to the MDG and MDA scores by the highest values of both metrics using a sequential search from an initial set of variables.

Modelling approach

Random forest

Choosing the appropriate method for creating the tool for accurate SDM is crucial because the overall performance could vary dramatically, depending on the selected model and particular use case. There is a limited amount of acceptable machine learning methods that can be used in SDM. Several popular methods demonstrated high performance in modelling on large areas: GBM, RF, and GLM. In particular, for modelling and prediction of the potential distribution of invasive species, GLM and RF were used⁵⁷. We decided to use RF because this model was successfully implemented for solving a variety of tasks such as predictions of animal and plant distributions, and also was used for making predictions on a large territory⁵⁸. The other important advantage that should be noticed is the straightforward interpretability of RF, which means that it is possible to evaluate the impact of each environmental parameter on the occurrence of the invasive species.

Approach to the cross-validation of the model

A unique approach for the model calibration is needed to reduce spatial autocorrelation caused by the absence of a strict sampling design. In our case, the data was split into training and testing folds using the spatial blocks technique in a scheme of 13-fold cross-validation. Random spatial splitting was performed 20 times to calibrate the model, with a distance between blocks set as 100 km. To calibrate the model we used a spatial blocks approach with random type from R package blockCV.

Evaluation of the model performance

To evaluate the performance of the model a classic approach for ecology was used—Area Under Curve (AUC) or Receiver operating characteristic (ROC), related to the independent threshold techniques¹⁶. The principle of methods lies in the standard confusion matrix, where rows and columns represent actual and predicted classes. The construction of ROC curves uses all possible thresholds to obtain different confusion matrices which leads to the reproduction of the curve with two-dimensional space: (1) on y-axis is True Positive Rate (sensitivity, recall); (2) on x-axis is False Positive Rate (equal to 1 − specificity). In our case true positive (TP, sensitivity) rate means that predicted places where HS grows correspond to actual. Similarly, true negative rate (TN, specificity) indicates correctly classified locations as absence points. In contrast, the missteps when the model predicted places as presence points for plants that are incorrect are False Positive, FP, and places where HS is absent, according to the model, while this is not true are recognised as False Negative, FN.