A 2-scale ensemble machine learning
Predictions of soil nutrients are based on a fully automated and fully optimized 2-scale Ensemble Machine Learning (EML) framework as implemented in the mlr package for Machine Learning (https://mlr.mlr-org.com/). The entire process can be summarized in the following eight steps (Fig. 7):
- 1.
Prepare point data, quality control all values and remove any artifacts or types.
- 2.
Upload to Google Earth Engine, overlay the point data with the key covariates of interest and test fitting random forest or similar to get an initial estimate of relative variable importance and pre-select features of interest.
- 3.
Decide on a final list of all covariates to use in predictions, prepare covariates for predictive modeling—either using Amazon AWS or similar. Quality control all 250 m and 30 m resolution covariates and prepare Analysis-Ready data in a tiling system to speed up overlay and prediction.
- 4.
Run spatial overlay using 250 m and 30 m resolution covariates and generate regression matrices.
- 5.
Fit 250 m and 30 m resolution Ensemble Machine Learning models independently per soil property using spatial blocks of 30–100 km. Run sequentially: model fine-tuning, feature selection and stacking. Generate summary accuracy assessment, variable importance, and revise if necessary.
- 6.
Predict 250 m and 30 m resolution tiles independently using the optimized models. Downscale the 250 m predictions to 30 m resolution using Cubicsplines (GDAL).
- 7.
Combine predictions using Eq. (3) and generate pooled variance/s.d. using Eq. (4).
- 8.
Generate all final predictions as Cloud-Optimized GeoTIFFs. Upload to the server and share through API/Geoserver.
Scheme: a two-scale framework for Predictive Soil Mapping based on Ensemble Machine Learning (as implemented in the mlr and mlr3 frameworks for Machine Learning28 and based on the SuperLearner algorithm). This process is applied for a bulk of soil samples, the individual models per soil variable are then fitted using automated fine-tuning, feature selection and stacking. The map is showing distribution of training points used in this work. Part of the training points that are publicly available are available for use from https://gitlab.com/openlandmap/compiled-ess-point-data-sets/.
For the majority of soil properties, excluding depth to bedrock, we also use soil depth as one of the covariates so that the final models for the two scales are in the form5:
$$begin{aligned} y(phi ,theta ,d) = d + x_1 (phi ,theta ) + x_2 (phi ,theta ) + cdots + X_p (phi ,theta ) end{aligned}$$
(1)
where y is the target variable, d is the soil sampling depth, (phi theta) are geographical coordinates (northing and easting), and (X_p) are the covariates. Adding soil depth as a covariate allows for directly producing 3D predictions35, which is our preferred approach as prediction can be then produced at any depth within the standard depth interval (e.g. 0–50 cm).
Ensemble machine learning
Ensembles are predictive models that combine predictions from two or more learners36. We implement ensembling within the mlr package by fitting a ‘meta-learner’ i.e. a learner that combines all individual learners. mlr has extensive functionality, especially for model ‘stacking’ i.e. to generate ensemble predictions, and also incorporates spatial Cross-Validation37. It also provides wrapper functions to automate hyper-parameter fine-tuning and feature selection, which can all be combined into fully-automated functions to fit and optimize models and produce predictions. Parallelisation can be initiated by using the parallelMap package, which automatically determines available resources and cleans-up all temporary sessions38.
For stacking multiple base learners we use the SuperLearner method39, which is the most computational method but allows for an independent assessment of all individual learners through k-fold cross validation with refitting. To speed up computing we typically use a linear model (predict.lm) as the meta-learner, so that in fact the final formula to derive the final ensemble prediction can be directly interpreted by printing the model summary.
The predictions in the Ensemble models described in Fig. 7 are in principle based on using the following five Machine Learning libraries common for many soil mapping projects5.
- 1.
Ranger: fully scalable implementation of Random Forest23.
- 2.
XGboost: extreme gradient boosting40.
- 3.
Deepnet: the Open Source implementation of deep learning26.
- 4.
Cubist: the Open Source implementation of Cubist regression trees41.
- 5.
Glmnet: GLM with Lasso or Elasticnet Regularization24.
These Open source libraries, with the exception of the Cubist, are available through a variety of programming environments including R, Python and also as standalone C++ libraries.
Merging coarse and fine-scale predictions
The idea of modeling soil spatial variation at different scales can be traced back to the work of McBratney42. In a multiscale model, soil variation can be considered a composite signal (Fig. 8):
$$begin{aligned} y({mathbf{s}}_{mathtt {B}}) = S_4({mathbf{s}}_{mathtt {B}}) + S_3({mathbf{s}}_{mathtt {B}}) + S_2({mathbf{s}}_{mathtt {B}}) + S_1({mathbf{s}}_{mathtt {B}}) + varepsilon end{aligned}$$
(2)
where (S_4) is the value of the target variable estimated at the coarsest scale, (S_3), (S_2) and (S_1) are the higher order components, ({mathbf{s}}_{mathtt {B}}) is the location or block of land, and (varepsilon) is the residual soil variation i.e. pure noise.
Decomposition of a signal of spatial variation into four components plus noise. Based on McBratney42. See also Fig. 13 in Hengl et al.21.
In this work we used a somewhat simplified version of Eq. (2) with only two scale-components: coarse ((S_2); 250 m) and fine ((S_1); 30 m). We produce the coarse-scale and fine-scale predictions independently, then merge using a weighted average43:
$$begin{aligned} {hat{y}}({mathbf{s}}_{mathtt {B}}) = frac{sum _{i=1}^{2}{ w_i cdot S_i({mathbf{s}}_{mathtt {B}})}}{sum _{i=1}^{2}{ w_i }}, ; ; w_i = frac{1}{sigma _{i,mathrm{CV}}^2} end{aligned}$$
(3)
where ({hat{y}}({mathbf{s}}_{mathtt {B}})) is the ensemble prediction, (w_i) is the model weight and (sigma _{i,mathrm{CV}}^2) is the model squared prediction error obtained using cross-validation. This is an example of Ensemble Models fitted for coarse-scale model for soil pH:
and the fine-scale model for soil pH:
Note that in this case the coarse-scale model is somewhat more accurate with (mathrm {RMSE}=0.463), while the 30 m covariates achieve at best (mathrm {RMSE}=0.661), hence the weights for 250 m model are about 2(times) higher than for the 30 m resolution models. A step-by-step procedure explaining in detail how the 2-scale predictions are derived and merged is available at https://gitlab.com/openlandmap/spatial-predictions-using-eml. An R package landmap44 that implements the procedure in a few lines of code is also available.
Transformation of log-normally distributed nutrients and properties
For the majority of log-normal distributed (right-skewed) variables we model and predict the ln-transformed values ((log _e(x+1))), then provide back-transformed predictions ((e^{x}-1)) to users via iSDAsoil. Note that also pH is a log-transformed variable of the hydrogen ion concentrations.
Although ln-transformation is not required for non-linear models such as Random Forest or Gradient Boosting, we decided to apply it to give proportionally higher weights to lower values. This is, in principle, a biased decision by us the modelers as our interest is in improving predictions of critical values for agriculture i.e. producing maps of nutrient deficiencies and similar (hence focus on smaller values). If the objective of mapping was to produce soil organic carbon of peatlands or similar, then the ln-transformation could have decreased the overall accuracy, although with Machine Learning models sometimes it is impossible to predict effects as they are highly non-linear.
Derivation of prediction errors
We also provide per-pixel uncertainty in terms of prediction errors or prediction intervals (e.g. 50%, 68% and/or 90% probability intervals)45. Because stacking of learners is based on repeated resampling, the prediction errors (per pixel) can be determined using either:
- 1.
Quantile Regression Random Forest46, in our case by using the 4–5 base learners,
- 2.
Simplified procedure using Bootstraping, then deriving prediction errors as standard deviation from multiple independently fitted learners1.
Both are non-parametric techniques and the prediction errors do not require any assumptions or initial parameters, but come at a cost of extra computing. By default, we provide prediction errors with a probability of 67%, which is the 1 standard deviation upper and lower prediction interval. Prediction errors indicate extrapolation areas and should help users minimize risks of taking decisions.
For derivation of prediction interval via either Quantile Regression RF or bootstrapping, it is important to note that the individual learners must be derived using randomized subsets of data (e.g. fivefold) which are spatially separated using block Cross-Validation or similar, otherwise the results might be over-optimistic and prediction errors too narrow.
Schematic example of the derivation of a pooled variance ((sigma _{mathtt {250m+30m}})) using the 250 m and 30 m predictions and predictions errors with (a) larger and (b) smaller differences in independent predictions.
Further, the pooled variance (({hat{sigma }}_E)) from the two independent models (250 m and 100 m scales in Fig. 7) can be derived using47:
$$begin{aligned} {hat{sigma }}_E = sqrt{sum _{j=1}^{s}{w_j cdot (hat{sigma }_j^2+{hat{mu }}_j^2 )} – left( sum _{j=1}^{s}{w_j cdot {hat{mu }}_j} right) ^2 }, ; ; sum _{j=1}^{s}{w_j} = 1 end{aligned}$$
(4)
where (sigma _j^2) is the prediction error for the independent components, ({hat{mu }}_j) is the predicted value, and w are the weights per predicted component (need to sum up to 1). If the two independent models (250 m and 30 m) produce very similar predictions so that ({hat{mu }}_{mathtt {250}} approx {hat{mu }}_{mathtt {30}}), then the pooled variance approaches the geometric mean of the two variances; if the independent predictions are different (({hat{mu }}_{mathtt {250}} – {hat{mu }}_{mathtt {30}} > 0)) than the pooled variances increase proportionally to this additional difference (Fig. 9).
Accuracy assessment of final maps
We report overall average accuracy in Table 1 and Fig. 4 using spatial fivefold Cross-Validation with model refitting1,48. For each variable we then compute the following three metrics: (1) Root Mean Square Error, (2) R-square from the meta-learner, and (3) Concordance Correlation Coefficient (Fig. 4), which is derived using49:
$$begin{aligned} rho _c = frac{2 cdot rho cdot sigma _{{{hat{y}}}} cdot sigma _y }{ sigma _{{{hat{y}}}}^2 + sigma _y^2 + (mu _{{{hat{y}}}} – mu _y)^2} end{aligned}$$
(5)
where ({{hat{y}}}) are the predicted values and y are actual values at cross-validation points, (mu _{{{hat{y}}}}) and (mu _y) are predicted and observed means and (rho) is the correlation coefficient between predicted and observed values. CCC is the most appropriate performance criteria when it comes to measuring agreement between predictions and observations.
For Cross-validation we use the spatial tile ID produced in the equal-area projection system for Africa (Lambert Azimuthal EPSG:42106) as the blocking parameter in the training function in mlr. This ensures that points falling in close proximity (<30 km) are either used for training or for validation, which ultimately provides a more objective measure of accuracy for the whole of the continent48.
Training points
For model training we used a compilation of existing data previously produced by the AfSIS project and/or other publicly available soils data (Fig. 7). The important training point datasets include:
AfSIS I and II soil samples for Tanzania, Uganda, Nigeria, Ghana: ca. 40,000 sampling locations, based upon spectral and wet chemistry data (available from: https://registry.opendata.aws/afsis/). AfSIS I dataset was prepared by ICRAF using a systematic sampling procedure50,51,
ISRIC Africa Soil Profile Database: ca. 13,000 legacy profiles collected across Africa and collated by ISRIC as part of the AfSIS project13,
LandPKS: ca. 12,000 soil profile observations, crowd sourced and collected via the LandPKS mobile app52,
IFDC: ca. 9,000 soil sampling locations across Ghana, Uganda, Rwanda and Burundi collected from various projects,
AfricaRice and TAMASA: ca. 3,000 soil sampling locations across Africa generated from field trials/surveys by AfricaRice53 and Taking Maize Agronomy to Scale in Africa (TAMASA).
In total this consists of more than 100,000 soil sites (unique locations) from over 20 datasets, measured using wet chemistry and dry spectroscopy54. The final training dataset includes between ca. 30,000–150,000 cleaned and standardized training samples depending on the variable (see Table 1).
iSDA was supported by ICRAF to leverage their extensive spectral calibration libraries in order to generate accurate and inexpensive soil property predictions from spectral data55. Analytical methods used for soil variables included the laser diffraction method for clay and sand fractions, the Mehlich3 extraction for extractable nutrients, pH was determined in 1:2 deionised water, eCEC was determined with the Cobalthexamine method and thermal oxidation and subtraction of inorganic carbon was used for soil organic carbon. We paid special attention to filtering out artifacts in the input points, filling in gaps in the point data, and leveraging expert agronomy rules. A full harmonization of different laboratory methods used in different data sets was not conducted but we ensured that only data from comparable methods with a similar range of results were used. Different extraction or analysis methods that can easily depart from each other by factors of 2–10. For example, different ex-P methods. For this reasons we have rather opted to splitting some variables into groups and/or omitting measurements that are incompatible with the majority of measurements.
The training points from the LandPKS project are, in fact, non-laboratory variables i.e. quick estimates of texture by hand. To convert the values from e.g. clay-loam texture class to clay, silt and sand fractions we use the texture triangle centroids5 e.g. the class “clay” is converted to 20% sand, 18% silt and 63% clay and similar. The results of converting the values are thus visible as groupings in the observed data in the accuracy plots (Fig. 4) for sand, silt, clay and coarse fragments (CF)/stone content.
Part of the training datasets used for model building, and import and standardization rules are listed via a public repository at https://gitlab.com/openlandmap/compiled-ess-point-data-sets/. For an up-to-date overview of training point datasets used, please refer to https://isda-africa.com/isdasoil.
Covariate layers
We use an extensive stack of covariates that includes up-to-date MODIS, PROBA-V, cloud free Sentinel 2 mosaics, Landsat data, digital terrain parameters and climactic variables. The 250 m resolution covariates include (see Supplementary material for a complete list with file names):
Digital Terrain Model DTM-derived surfaces—slope, profile curvature, Multiresolution Index of Valley Bottom Flatness (VBF), deviation from Mean Value, valley depth, negative and positive Topographic Openness and SAGA Wetness Index—all based on the MERIT-DEM56 and computed using the SAGA GIS57 using varying spatial resolutions (250 m, 1 km, 2 km);
CHELSA Bioclimatic images58 downloaded from https://chelsa-climate.org/bioclim/,
SM2RAIN monthly mean and standard deviation images59 available for download from https://doi.org/10.5281/zenodo.1435912;
Long-term averaged mean monthly surface reflectances for MODIS bands 4 (NIR) and 7 (MIR) at 500 m resolution. Derived using a stack of MOD09A1 images;
Long-term averaged monthly mean and standard deviation of the MODIS land surface temperature (daytime and nighttime). Derived using a stack of MOD11A2 LST images60 which can be downloaded from https://doi.org/10.5281/zenodo.1420114;
MODIS Cloud fraction monthly images61 obtained from http://www.earthenv.org/cloud;
Solar direct and diffuse irradiation images obtained from https://globalsolaratlas.info/download;
Fraction of Absorbed Photosynthetically Active Radiation (FAPAR) at 250 m monthly for period 2014–201762 based on COPERNICUS land products that can be downloaded https://doi.org/10.5281/zenodo.1450336;
Long-term Flood hazard map for a 500-year return period63;
USGS Africa Surface Lithology map at 250 m resolution22.
CHELSA bioclimatic images include: (Bio1) annual mean temperature, (Bio2) mean diurnal temperature range, (Bio3) isothermality (day-to-night temperature oscillations relative to the summer-to-winter oscillations), (Bio4) temperature seasonality (standard deviation of monthly temperature averages), (Bio5) maximum temperature of warmest month, (Bio6) minimum temperature of coldest month, (Bio7) temperature annual range, (Bio10) mean temperature of warmest quarter, (Bio11) mean temperature of coldest quarter, (Bio12) annual precipitation amount, (Bio13) precipitation of wettest month, (Bio14) precipitation of driest month, (Bio16) precipitation of wettest quarter, (Bio17) precipitation of driest quarter. All layers were processed in the native resolution then, if necessary, downscaled to the same grid using bicubic splines resampling in GDAL64. The USGS Africa Surface Lithology map units were converted to indicators with some units being excluded for having too few ((<5)) training points.
The 30 m resolution covariates include:
We have pre-selected the 30 m resolution EO data for mapping soil nutrients over Africa, to still stay within the project budget by using the following procedure (Fig. 7):
- 1.
Upload points to the Google Earth Engine67, overlay and fit initial Random Forest models to identify and prioritize the most important bands;
- 2.
Processed prioritized bands using Amazon AWS; this is still tens of Terrabytes of Sentinel data, but considerably less than if all bands would have been selected and processed;
- 3.
Produce cloud free mosaics for the period 2018–2019 using Amazon AWS; download the final product as Cloud-Optimised GeoTIFFs;
- 4.
Run spatial overlay, model fitting and prediction in a local system using Solid State Disk drive and servers with a lot of RAM.
We refer to this as “the hybrid Cloud-based 2–step variable selection procedure” (Fig. 7). With it we combine the power of Google Earth Engine with our own computing infrastructure to achieve customized processing.
The Sentinel-2 cloud-free images were produced using the Scene Classification Mask (SCL band) for two seasons (S1 = months 1, 2, 3, 7, 8, 9, and S2 = 4, 5, 6, 10, 11, 12) combined through 2018 and 2019 year, to minimize number of pixels with clouds. We processed a total of 852,738 Sentinel-2 L2A scenes, or about 200TB of raw data. Scenes were processed by splitting the African continent into 8721 tiles (2000(times)2000 pixels or 60(times)60 km). For processing these large volumes of data we used the AWS EC2 Spot Instances (Auto Scaling Groups) with 3GB of RAM per vCPU and few TB of ephemeral (temporary) storage for satellite images. The total processing time to produce all Sentinel-2 products took ca. 100,000 h of computing. Average time required to produce one cloud-free tile per tile/band/season ranged between 90 min for B02, B04 and 50 min for B8A, B09, B11, B12.
For predictive mapping we use a fully-optimized High Performance Computing system (3(times) Scan 3XS servers) using the Intel Xeon Gold chip-set with 40 CPU cores/80 treads.
Source: Ecology - nature.com