African soil properties and nutrients mapped at 30 m spatial resolution using two-scale ensemble machine learning
A 2-scale ensemble machine learningPredictions of soil nutrients are based on a fully automated and fully optimized 2-scale Ensemble Machine Learning (EML) framework as implemented in the mlr package for Machine Learning (https://mlr.mlr-org.com/). The entire process can be summarized in the following eight steps (Fig. 7):
1.
Prepare point data, quality control all values and remove any artifacts or types.
2.
Upload to Google Earth Engine, overlay the point data with the key covariates of interest and test fitting random forest or similar to get an initial estimate of relative variable importance and pre-select features of interest.
3.
Decide on a final list of all covariates to use in predictions, prepare covariates for predictive modeling—either using Amazon AWS or similar. Quality control all 250 m and 30 m resolution covariates and prepare Analysis-Ready data in a tiling system to speed up overlay and prediction.
4.
Run spatial overlay using 250 m and 30 m resolution covariates and generate regression matrices.
5.
Fit 250 m and 30 m resolution Ensemble Machine Learning models independently per soil property using spatial blocks of 30–100 km. Run sequentially: model fine-tuning, feature selection and stacking. Generate summary accuracy assessment, variable importance, and revise if necessary.
6.
Predict 250 m and 30 m resolution tiles independently using the optimized models. Downscale the 250 m predictions to 30 m resolution using Cubicsplines (GDAL).
7.
Combine predictions using Eq. (3) and generate pooled variance/s.d. using Eq. (4).
8.
Generate all final predictions as Cloud-Optimized GeoTIFFs. Upload to the server and share through API/Geoserver.
Figure 7Scheme: a two-scale framework for Predictive Soil Mapping based on Ensemble Machine Learning (as implemented in the mlr and mlr3 frameworks for Machine Learning28 and based on the SuperLearner algorithm). This process is applied for a bulk of soil samples, the individual models per soil variable are then fitted using automated fine-tuning, feature selection and stacking. The map is showing distribution of training points used in this work. Part of the training points that are publicly available are available for use from https://gitlab.com/openlandmap/compiled-ess-point-data-sets/.Full size imageFor the majority of soil properties, excluding depth to bedrock, we also use soil depth as one of the covariates so that the final models for the two scales are in the form5:$$begin{aligned} y(phi ,theta ,d) = d + x_1 (phi ,theta ) + x_2 (phi ,theta ) + cdots + X_p (phi ,theta ) end{aligned}$$
(1)
where y is the target variable, d is the soil sampling depth, (phi theta) are geographical coordinates (northing and easting), and (X_p) are the covariates. Adding soil depth as a covariate allows for directly producing 3D predictions35, which is our preferred approach as prediction can be then produced at any depth within the standard depth interval (e.g. 0–50 cm).Ensemble machine learningEnsembles are predictive models that combine predictions from two or more learners36. We implement ensembling within the mlr package by fitting a ‘meta-learner’ i.e. a learner that combines all individual learners. mlr has extensive functionality, especially for model ‘stacking’ i.e. to generate ensemble predictions, and also incorporates spatial Cross-Validation37. It also provides wrapper functions to automate hyper-parameter fine-tuning and feature selection, which can all be combined into fully-automated functions to fit and optimize models and produce predictions. Parallelisation can be initiated by using the parallelMap package, which automatically determines available resources and cleans-up all temporary sessions38.For stacking multiple base learners we use the SuperLearner method39, which is the most computational method but allows for an independent assessment of all individual learners through k-fold cross validation with refitting. To speed up computing we typically use a linear model (predict.lm) as the meta-learner, so that in fact the final formula to derive the final ensemble prediction can be directly interpreted by printing the model summary.The predictions in the Ensemble models described in Fig. 7 are in principle based on using the following five Machine Learning libraries common for many soil mapping projects5.
1.
Ranger: fully scalable implementation of Random Forest23.
2.
XGboost: extreme gradient boosting40.
3.
Deepnet: the Open Source implementation of deep learning26.
4.
Cubist: the Open Source implementation of Cubist regression trees41.
5.
Glmnet: GLM with Lasso or Elasticnet Regularization24.
These Open source libraries, with the exception of the Cubist, are available through a variety of programming environments including R, Python and also as standalone C++ libraries.Merging coarse and fine-scale predictionsThe idea of modeling soil spatial variation at different scales can be traced back to the work of McBratney42. In a multiscale model, soil variation can be considered a composite signal (Fig. 8):$$begin{aligned} y({mathbf{s}}_{mathtt {B}}) = S_4({mathbf{s}}_{mathtt {B}}) + S_3({mathbf{s}}_{mathtt {B}}) + S_2({mathbf{s}}_{mathtt {B}}) + S_1({mathbf{s}}_{mathtt {B}}) + varepsilon end{aligned}$$
(2)
where (S_4) is the value of the target variable estimated at the coarsest scale, (S_3), (S_2) and (S_1) are the higher order components, ({mathbf{s}}_{mathtt {B}}) is the location or block of land, and (varepsilon) is the residual soil variation i.e. pure noise.Figure 8Decomposition of a signal of spatial variation into four components plus noise. Based on McBratney42. See also Fig. 13 in Hengl et al.21.Full size imageIn this work we used a somewhat simplified version of Eq. (2) with only two scale-components: coarse ((S_2); 250 m) and fine ((S_1); 30 m). We produce the coarse-scale and fine-scale predictions independently, then merge using a weighted average43:$$begin{aligned} {hat{y}}({mathbf{s}}_{mathtt {B}}) = frac{sum _{i=1}^{2}{ w_i cdot S_i({mathbf{s}}_{mathtt {B}})}}{sum _{i=1}^{2}{ w_i }}, ; ; w_i = frac{1}{sigma _{i,mathrm{CV}}^2} end{aligned}$$
(3)
where ({hat{y}}({mathbf{s}}_{mathtt {B}})) is the ensemble prediction, (w_i) is the model weight and (sigma _{i,mathrm{CV}}^2) is the model squared prediction error obtained using cross-validation. This is an example of Ensemble Models fitted for coarse-scale model for soil pH:and the fine-scale model for soil pH:Note that in this case the coarse-scale model is somewhat more accurate with (mathrm {RMSE}=0.463), while the 30 m covariates achieve at best (mathrm {RMSE}=0.661), hence the weights for 250 m model are about 2(times) higher than for the 30 m resolution models. A step-by-step procedure explaining in detail how the 2-scale predictions are derived and merged is available at https://gitlab.com/openlandmap/spatial-predictions-using-eml. An R package landmap44 that implements the procedure in a few lines of code is also available.Transformation of log-normally distributed nutrients and propertiesFor the majority of log-normal distributed (right-skewed) variables we model and predict the ln-transformed values ((log _e(x+1))), then provide back-transformed predictions ((e^{x}-1)) to users via iSDAsoil. Note that also pH is a log-transformed variable of the hydrogen ion concentrations.Although ln-transformation is not required for non-linear models such as Random Forest or Gradient Boosting, we decided to apply it to give proportionally higher weights to lower values. This is, in principle, a biased decision by us the modelers as our interest is in improving predictions of critical values for agriculture i.e. producing maps of nutrient deficiencies and similar (hence focus on smaller values). If the objective of mapping was to produce soil organic carbon of peatlands or similar, then the ln-transformation could have decreased the overall accuracy, although with Machine Learning models sometimes it is impossible to predict effects as they are highly non-linear.Derivation of prediction errorsWe also provide per-pixel uncertainty in terms of prediction errors or prediction intervals (e.g. 50%, 68% and/or 90% probability intervals)45. Because stacking of learners is based on repeated resampling, the prediction errors (per pixel) can be determined using either:
1.
Quantile Regression Random Forest46, in our case by using the 4–5 base learners,
2.
Simplified procedure using Bootstraping, then deriving prediction errors as standard deviation from multiple independently fitted learners1.
Both are non-parametric techniques and the prediction errors do not require any assumptions or initial parameters, but come at a cost of extra computing. By default, we provide prediction errors with a probability of 67%, which is the 1 standard deviation upper and lower prediction interval. Prediction errors indicate extrapolation areas and should help users minimize risks of taking decisions.For derivation of prediction interval via either Quantile Regression RF or bootstrapping, it is important to note that the individual learners must be derived using randomized subsets of data (e.g. fivefold) which are spatially separated using block Cross-Validation or similar, otherwise the results might be over-optimistic and prediction errors too narrow.Figure 9Schematic example of the derivation of a pooled variance ((sigma _{mathtt {250m+30m}})) using the 250 m and 30 m predictions and predictions errors with (a) larger and (b) smaller differences in independent predictions.Full size imageFurther, the pooled variance (({hat{sigma }}_E)) from the two independent models (250 m and 100 m scales in Fig. 7) can be derived using47:$$begin{aligned} {hat{sigma }}_E = sqrt{sum _{j=1}^{s}{w_j cdot (hat{sigma }_j^2+{hat{mu }}_j^2 )} – left( sum _{j=1}^{s}{w_j cdot {hat{mu }}_j} right) ^2 }, ; ; sum _{j=1}^{s}{w_j} = 1 end{aligned}$$
(4)
where (sigma _j^2) is the prediction error for the independent components, ({hat{mu }}_j) is the predicted value, and w are the weights per predicted component (need to sum up to 1). If the two independent models (250 m and 30 m) produce very similar predictions so that ({hat{mu }}_{mathtt {250}} approx {hat{mu }}_{mathtt {30}}), then the pooled variance approaches the geometric mean of the two variances; if the independent predictions are different (({hat{mu }}_{mathtt {250}} – {hat{mu }}_{mathtt {30}} > 0)) than the pooled variances increase proportionally to this additional difference (Fig. 9).Accuracy assessment of final mapsWe report overall average accuracy in Table 1 and Fig. 4 using spatial fivefold Cross-Validation with model refitting1,48. For each variable we then compute the following three metrics: (1) Root Mean Square Error, (2) R-square from the meta-learner, and (3) Concordance Correlation Coefficient (Fig. 4), which is derived using49:$$begin{aligned} rho _c = frac{2 cdot rho cdot sigma _{{{hat{y}}}} cdot sigma _y }{ sigma _{{{hat{y}}}}^2 + sigma _y^2 + (mu _{{{hat{y}}}} – mu _y)^2} end{aligned}$$
(5)
where ({{hat{y}}}) are the predicted values and y are actual values at cross-validation points, (mu _{{{hat{y}}}}) and (mu _y) are predicted and observed means and (rho) is the correlation coefficient between predicted and observed values. CCC is the most appropriate performance criteria when it comes to measuring agreement between predictions and observations.For Cross-validation we use the spatial tile ID produced in the equal-area projection system for Africa (Lambert Azimuthal EPSG:42106) as the blocking parameter in the training function in mlr. This ensures that points falling in close proximity ( More