African soil properties and nutrients mapped at 30 m spatial resolution using two-scale ensemble machine learning

A 2-scale ensemble machine learning

Predictions of soil nutrients are based on a fully automated and fully optimized 2-scale Ensemble Machine Learning (EML) framework as implemented in the mlr package for Machine Learning (https://mlr.mlr-org.com/). The entire process can be summarized in the following eight steps (Fig. 7):

1.
Prepare point data, quality control all values and remove any artifacts or types.
2.
Upload to Google Earth Engine, overlay the point data with the key covariates of interest and test fitting random forest or similar to get an initial estimate of relative variable importance and pre-select features of interest.
3.
Decide on a final list of all covariates to use in predictions, prepare covariates for predictive modeling—either using Amazon AWS or similar. Quality control all 250 m and 30 m resolution covariates and prepare Analysis-Ready data in a tiling system to speed up overlay and prediction.
4.
Run spatial overlay using 250 m and 30 m resolution covariates and generate regression matrices.
5.
Fit 250 m and 30 m resolution Ensemble Machine Learning models independently per soil property using spatial blocks of 30–100 km. Run sequentially: model fine-tuning, feature selection and stacking. Generate summary accuracy assessment, variable importance, and revise if necessary.
6.
Predict 250 m and 30 m resolution tiles independently using the optimized models. Downscale the 250 m predictions to 30 m resolution using Cubicsplines (GDAL).
7.
Combine predictions using Eq. (3) and generate pooled variance/s.d. using Eq. (4).
8.
Generate all final predictions as Cloud-Optimized GeoTIFFs. Upload to the server and share through API/Geoserver.

Figure 7

Scheme: a two-scale framework for Predictive Soil Mapping based on Ensemble Machine Learning (as implemented in the mlr and mlr3 frameworks for Machine Learning²⁸ and based on the SuperLearner algorithm). This process is applied for a bulk of soil samples, the individual models per soil variable are then fitted using automated fine-tuning, feature selection and stacking. The map is showing distribution of training points used in this work. Part of the training points that are publicly available are available for use from https://gitlab.com/openlandmap/compiled-ess-point-data-sets/.

Full size image

For the majority of soil properties, excluding depth to bedrock, we also use soil depth as one of the covariates so that the final models for the two scales are in the form⁵:

$$begin{aligned} y(phi ,theta ,d) = d + x_1 (phi ,theta ) + x_2 (phi ,theta ) + cdots + X_p (phi ,theta ) end{aligned}$$

(1)

where y is the target variable, d is the soil sampling depth, (phi theta) are geographical coordinates (northing and easting), and (X_p) are the covariates. Adding soil depth as a covariate allows for directly producing 3D predictions³⁵, which is our preferred approach as prediction can be then produced at any depth within the standard depth interval (e.g. 0–50 cm).

Ensemble machine learning

Ensembles are predictive models that combine predictions from two or more learners³⁶. We implement ensembling within the mlr package by fitting a ‘meta-learner’ i.e. a learner that combines all individual learners. mlr has extensive functionality, especially for model ‘stacking’ i.e. to generate ensemble predictions, and also incorporates spatial Cross-Validation³⁷. It also provides wrapper functions to automate hyper-parameter fine-tuning and feature selection, which can all be combined into fully-automated functions to fit and optimize models and produce predictions. Parallelisation can be initiated by using the parallelMap package, which automatically determines available resources and cleans-up all temporary sessions³⁸.

For stacking multiple base learners we use the SuperLearner method³⁹, which is the most computational method but allows for an independent assessment of all individual learners through k-fold cross validation with refitting. To speed up computing we typically use a linear model (predict.lm) as the meta-learner, so that in fact the final formula to derive the final ensemble prediction can be directly interpreted by printing the model summary.

The predictions in the Ensemble models described in Fig. 7 are in principle based on using the following five Machine Learning libraries common for many soil mapping projects⁵.

1.
Ranger: fully scalable implementation of Random Forest²³.
2.
XGboost: extreme gradient boosting⁴⁰.
3.
Deepnet: the Open Source implementation of deep learning²⁶.
4.
Cubist: the Open Source implementation of Cubist regression trees⁴¹.
5.
Glmnet: GLM with Lasso or Elasticnet Regularization²⁴.

These Open source libraries, with the exception of the Cubist, are available through a variety of programming environments including R, Python and also as standalone C++ libraries.

Merging coarse and fine-scale predictions

The idea of modeling soil spatial variation at different scales can be traced back to the work of McBratney⁴². In a multiscale model, soil variation can be considered a composite signal (Fig. 8):

$$begin{aligned} y({mathbf{s}}_{mathtt {B}}) = S_4({mathbf{s}}_{mathtt {B}}) + S_3({mathbf{s}}_{mathtt {B}}) + S_2({mathbf{s}}_{mathtt {B}}) + S_1({mathbf{s}}_{mathtt {B}}) + varepsilon end{aligned}$$

(2)

where (S_4) is the value of the target variable estimated at the coarsest scale, (S_3), (S_2) and (S_1) are the higher order components, ({mathbf{s}}_{mathtt {B}}) is the location or block of land, and (varepsilon) is the residual soil variation i.e. pure noise.

Figure 8

Decomposition of a signal of spatial variation into four components plus noise. Based on McBratney⁴². See also Fig. 13 in Hengl et al.²¹.

Full size image

In this work we used a somewhat simplified version of Eq. (2) with only two scale-components: coarse ((S_2); 250 m) and fine ((S_1); 30 m). We produce the coarse-scale and fine-scale predictions independently, then merge using a weighted average⁴³:

$$begin{aligned} {hat{y}}({mathbf{s}}_{mathtt {B}}) = frac{sum _{i=1}^{2}{ w_i cdot S_i({mathbf{s}}_{mathtt {B}})}}{sum _{i=1}^{2}{ w_i }}, ; ; w_i = frac{1}{sigma _{i,mathrm{CV}}^2} end{aligned}$$

(3)

where ({hat{y}}({mathbf{s}}_{mathtt {B}})) is the ensemble prediction, (w_i) is the model weight and (sigma _{i,mathrm{CV}}^2) is the model squared prediction error obtained using cross-validation. This is an example of Ensemble Models fitted for coarse-scale model for soil pH:

and the fine-scale model for soil pH:

Note that in this case the coarse-scale model is somewhat more accurate with (mathrm {RMSE}=0.463), while the 30 m covariates achieve at best (mathrm {RMSE}=0.661), hence the weights for 250 m model are about 2(times) higher than for the 30 m resolution models. A step-by-step procedure explaining in detail how the 2-scale predictions are derived and merged is available at https://gitlab.com/openlandmap/spatial-predictions-using-eml. An R package landmap⁴⁴ that implements the procedure in a few lines of code is also available.

Transformation of log-normally distributed nutrients and properties

For the majority of log-normal distributed (right-skewed) variables we model and predict the ln-transformed values ((log _e(x+1))), then provide back-transformed predictions ((e^{x}-1)) to users via iSDAsoil. Note that also pH is a log-transformed variable of the hydrogen ion concentrations.

Although ln-transformation is not required for non-linear models such as Random Forest or Gradient Boosting, we decided to apply it to give proportionally higher weights to lower values. This is, in principle, a biased decision by us the modelers as our interest is in improving predictions of critical values for agriculture i.e. producing maps of nutrient deficiencies and similar (hence focus on smaller values). If the objective of mapping was to produce soil organic carbon of peatlands or similar, then the ln-transformation could have decreased the overall accuracy, although with Machine Learning models sometimes it is impossible to predict effects as they are highly non-linear.

Derivation of prediction errors

We also provide per-pixel uncertainty in terms of prediction errors or prediction intervals (e.g. 50%, 68% and/or 90% probability intervals)⁴⁵. Because stacking of learners is based on repeated resampling, the prediction errors (per pixel) can be determined using either:

1.
Quantile Regression Random Forest⁴⁶, in our case by using the 4–5 base learners,
2.
Simplified procedure using Bootstraping, then deriving prediction errors as standard deviation from multiple independently fitted learners¹.

Both are non-parametric techniques and the prediction errors do not require any assumptions or initial parameters, but come at a cost of extra computing. By default, we provide prediction errors with a probability of 67%, which is the 1 standard deviation upper and lower prediction interval. Prediction errors indicate extrapolation areas and should help users minimize risks of taking decisions.

For derivation of prediction interval via either Quantile Regression RF or bootstrapping, it is important to note that the individual learners must be derived using randomized subsets of data (e.g. fivefold) which are spatially separated using block Cross-Validation or similar, otherwise the results might be over-optimistic and prediction errors too narrow.

Figure 9

Schematic example of the derivation of a pooled variance ((sigma _{mathtt {250m+30m}})) using the 250 m and 30 m predictions and predictions errors with (a) larger and (b) smaller differences in independent predictions.

Full size image

Further, the pooled variance (({hat{sigma }}_E)) from the two independent models (250 m and 100 m scales in Fig. 7) can be derived using⁴⁷:

$$begin{aligned} {hat{sigma }}_E = sqrt{sum _{j=1}^{s}{w_j cdot (hat{sigma }_j^2+{hat{mu }}_j^2 )} – left( sum _{j=1}^{s}{w_j cdot {hat{mu }}_j} right) ^2 }, ; ; sum _{j=1}^{s}{w_j} = 1 end{aligned}$$

(4)

where (sigma _j^2) is the prediction error for the independent components, ({hat{mu }}_j) is the predicted value, and w are the weights per predicted component (need to sum up to 1). If the two independent models (250 m and 30 m) produce very similar predictions so that ({hat{mu }}_{mathtt {250}} approx {hat{mu }}_{mathtt {30}}), then the pooled variance approaches the geometric mean of the two variances; if the independent predictions are different (({hat{mu }}_{mathtt {250}} – {hat{mu }}_{mathtt {30}} > 0)) than the pooled variances increase proportionally to this additional difference (Fig. 9).

Accuracy assessment of final maps

We report overall average accuracy in Table 1 and Fig. 4 using spatial fivefold Cross-Validation with model refitting^1,48. For each variable we then compute the following three metrics: (1) Root Mean Square Error, (2) R-square from the meta-learner, and (3) Concordance Correlation Coefficient (Fig. 4), which is derived using⁴⁹:

$$begin{aligned} rho _c = frac{2 cdot rho cdot sigma _{{{hat{y}}}} cdot sigma _y }{ sigma _{{{hat{y}}}}^2 + sigma _y^2 + (mu _{{{hat{y}}}} – mu _y)^2} end{aligned}$$

(5)

where ({{hat{y}}}) are the predicted values and y are actual values at cross-validation points, (mu _{{{hat{y}}}}) and (mu _y) are predicted and observed means and (rho) is the correlation coefficient between predicted and observed values. CCC is the most appropriate performance criteria when it comes to measuring agreement between predictions and observations.

For Cross-validation we use the spatial tile ID produced in the equal-area projection system for Africa (Lambert Azimuthal EPSG:42106) as the blocking parameter in the training function in mlr. This ensures that points falling in close proximity (<30 km) are either used for training or for validation, which ultimately provides a more objective measure of accuracy for the whole of the continent⁴⁸.

Training points

For model training we used a compilation of existing data previously produced by the AfSIS project and/or other publicly available soils data (Fig. 7). The important training point datasets include:

AfSIS I and II soil samples for Tanzania, Uganda, Nigeria, Ghana: ca. 40,000 sampling locations, based upon spectral and wet chemistry data (available from: https://registry.opendata.aws/afsis/). AfSIS I dataset was prepared by ICRAF using a systematic sampling procedure^50,51,
ISRIC Africa Soil Profile Database: ca. 13,000 legacy profiles collected across Africa and collated by ISRIC as part of the AfSIS project¹³,
LandPKS: ca. 12,000 soil profile observations, crowd sourced and collected via the LandPKS mobile app⁵²,
IFDC: ca. 9,000 soil sampling locations across Ghana, Uganda, Rwanda and Burundi collected from various projects,
AfricaRice and TAMASA: ca. 3,000 soil sampling locations across Africa generated from field trials/surveys by AfricaRice⁵³ and Taking Maize Agronomy to Scale in Africa (TAMASA).

In total this consists of more than 100,000 soil sites (unique locations) from over 20 datasets, measured using wet chemistry and dry spectroscopy⁵⁴. The final training dataset includes between ca. 30,000–150,000 cleaned and standardized training samples depending on the variable (see Table 1).

iSDA was supported by ICRAF to leverage their extensive spectral calibration libraries in order to generate accurate and inexpensive soil property predictions from spectral data⁵⁵. Analytical methods used for soil variables included the laser diffraction method for clay and sand fractions, the Mehlich3 extraction for extractable nutrients, pH was determined in 1:2 deionised water, eCEC was determined with the Cobalthexamine method and thermal oxidation and subtraction of inorganic carbon was used for soil organic carbon. We paid special attention to filtering out artifacts in the input points, filling in gaps in the point data, and leveraging expert agronomy rules. A full harmonization of different laboratory methods used in different data sets was not conducted but we ensured that only data from comparable methods with a similar range of results were used. Different extraction or analysis methods that can easily depart from each other by factors of 2–10. For example, different ex-P methods. For this reasons we have rather opted to splitting some variables into groups and/or omitting measurements that are incompatible with the majority of measurements.

The training points from the LandPKS project are, in fact, non-laboratory variables i.e. quick estimates of texture by hand. To convert the values from e.g. clay-loam texture class to clay, silt and sand fractions we use the texture triangle centroids⁵ e.g. the class “clay” is converted to 20% sand, 18% silt and 63% clay and similar. The results of converting the values are thus visible as groupings in the observed data in the accuracy plots (Fig. 4) for sand, silt, clay and coarse fragments (CF)/stone content.

Part of the training datasets used for model building, and import and standardization rules are listed via a public repository at https://gitlab.com/openlandmap/compiled-ess-point-data-sets/. For an up-to-date overview of training point datasets used, please refer to https://isda-africa.com/isdasoil.

Covariate layers

We use an extensive stack of covariates that includes up-to-date MODIS, PROBA-V, cloud free Sentinel 2 mosaics, Landsat data, digital terrain parameters and climactic variables. The 250 m resolution covariates include (see Supplementary material for a complete list with file names):

Digital Terrain Model DTM-derived surfaces—slope, profile curvature, Multiresolution Index of Valley Bottom Flatness (VBF), deviation from Mean Value, valley depth, negative and positive Topographic Openness and SAGA Wetness Index—all based on the MERIT-DEM⁵⁶ and computed using the SAGA GIS⁵⁷ using varying spatial resolutions (250 m, 1 km, 2 km);
CHELSA Bioclimatic images⁵⁸ downloaded from https://chelsa-climate.org/bioclim/,
SM2RAIN monthly mean and standard deviation images⁵⁹ available for download from https://doi.org/10.5281/zenodo.1435912;
Long-term averaged mean monthly surface reflectances for MODIS bands 4 (NIR) and 7 (MIR) at 500 m resolution. Derived using a stack of MOD09A1 images;
Long-term averaged monthly mean and standard deviation of the MODIS land surface temperature (daytime and nighttime). Derived using a stack of MOD11A2 LST images⁶⁰ which can be downloaded from https://doi.org/10.5281/zenodo.1420114;
MODIS Cloud fraction monthly images⁶¹ obtained from http://www.earthenv.org/cloud;
Solar direct and diffuse irradiation images obtained from https://globalsolaratlas.info/download;
Fraction of Absorbed Photosynthetically Active Radiation (FAPAR) at 250 m monthly for period 2014–2017⁶² based on COPERNICUS land products that can be downloaded https://doi.org/10.5281/zenodo.1450336;
Long-term Flood hazard map for a 500-year return period⁶³;
USGS Africa Surface Lithology map at 250 m resolution²².

CHELSA bioclimatic images include: (Bio1) annual mean temperature, (Bio2) mean diurnal temperature range, (Bio3) isothermality (day-to-night temperature oscillations relative to the summer-to-winter oscillations), (Bio4) temperature seasonality (standard deviation of monthly temperature averages), (Bio5) maximum temperature of warmest month, (Bio6) minimum temperature of coldest month, (Bio7) temperature annual range, (Bio10) mean temperature of warmest quarter, (Bio11) mean temperature of coldest quarter, (Bio12) annual precipitation amount, (Bio13) precipitation of wettest month, (Bio14) precipitation of driest month, (Bio16) precipitation of wettest quarter, (Bio17) precipitation of driest quarter. All layers were processed in the native resolution then, if necessary, downscaled to the same grid using bicubic splines resampling in GDAL⁶⁴. The USGS Africa Surface Lithology map units were converted to indicators with some units being excluded for having too few ((<5)) training points.

The 30 m resolution covariates include:

We have pre-selected the 30 m resolution EO data for mapping soil nutrients over Africa, to still stay within the project budget by using the following procedure (Fig. 7):

1.
Upload points to the Google Earth Engine⁶⁷, overlay and fit initial Random Forest models to identify and prioritize the most important bands;
2.
Processed prioritized bands using Amazon AWS; this is still tens of Terrabytes of Sentinel data, but considerably less than if all bands would have been selected and processed;
3.
Produce cloud free mosaics for the period 2018–2019 using Amazon AWS; download the final product as Cloud-Optimised GeoTIFFs;
4.
Run spatial overlay, model fitting and prediction in a local system using Solid State Disk drive and servers with a lot of RAM.

We refer to this as “the hybrid Cloud-based 2–step variable selection procedure” (Fig. 7). With it we combine the power of Google Earth Engine with our own computing infrastructure to achieve customized processing.

The Sentinel-2 cloud-free images were produced using the Scene Classification Mask (SCL band) for two seasons (S1 = months 1, 2, 3, 7, 8, 9, and S2 = 4, 5, 6, 10, 11, 12) combined through 2018 and 2019 year, to minimize number of pixels with clouds. We processed a total of 852,738 Sentinel-2 L2A scenes, or about 200TB of raw data. Scenes were processed by splitting the African continent into 8721 tiles (2000(times)2000 pixels or 60(times)60 km). For processing these large volumes of data we used the AWS EC2 Spot Instances (Auto Scaling Groups) with 3GB of RAM per vCPU and few TB of ephemeral (temporary) storage for satellite images. The total processing time to produce all Sentinel-2 products took ca. 100,000 h of computing. Average time required to produce one cloud-free tile per tile/band/season ranged between 90 min for B02, B04 and 50 min for B8A, B09, B11, B12.

For predictive mapping we use a fully-optimized High Performance Computing system (3(times) Scan 3XS servers) using the Intel Xeon Gold chip-set with 40 CPU cores/80 treads.

Source: Ecology - nature.com

African soil properties and nutrients mapped at 30 m spatial resolution using two-scale ensemble machine learning

A 2-scale ensemble machine learning

Ensemble machine learning

Merging coarse and fine-scale predictions

Transformation of log-normally distributed nutrients and properties

Derivation of prediction errors

Accuracy assessment of final maps

Training points

Covariate layers

Nutrient complexity triggers transitions between solitary and colonial growth in bacterial populations

Satellite remote sensing of deforestation for oil palm

ITALIAN LANGUAGE

ENGLISH LANGUAGE