in

On the interpretability of predictors in spatial data science: the information horizon

Meaningless predictors and spatial dependence

The variogram range of the GRF predictors has a strong influence on model validations. Figure 6 shows the increase in the cross-validation (R^2) with increasing ranges for all three study sites. The maximum prediction accuracy is reached when the range of the structurally meaningless predictors (e.g. the GRFs) is similar to the range of the soil properties. This is also the reason why meaningless predictors, with respect to a structural relationship to a soil property, such as photographs of faces or paintings, can produce accurate evaluation statistics.

The effect is more stable when 100 GRFs are used, compared to only 10 GRFs for each range of spatial dependence. That is, when using 100 GRFs, relatively accurate model validations are already reached using lower scale predictors compared to using only 10 GRFs. However, cross-validation accuracy can be relatively high when the variogram ranges of the 10 GRFs are long and thus better resemble the effects of EDFs.

Figure 6

Influence of the variogram ranges of Gaussian Random Fields (GRF) on predictive accuracy. Blue: (R^2) values for models based on 100 GRFs; gold: (R^2) values for models based on 10 GRFs; red: variogram range of the corresponding soil property.

Full size image

We can therefore accept the first two hypotheses: using enough but totally meaningless predictors with similar or longer ranges of spatial dependence than that of the response variable, can result in models that produce high predictive accuracies, however, with zero descriptive accuracy. Both, GRFs and EDFs are spatial but not environmental predictors. So, we can “interpolate” with only a few but very large-scale random predictors. But when we interpolate, we cannot interpret. And when the predictors are not completely linear as EDFs, which is the case for GRFs with random meaningless variations between the sample locations, smoothing between the sample points, which is the concept behind interpolation, is not guaranteed. Hence, using GRFs or otherwise meaningless predictors is neither a structural model nor an interpolation model in the classical sense, and therefore must be rejected in principle.

Reference models and prediction accuracy

The prediction accuracies of the different models are presented in Fig. 7. Figure 8 shows the corresponding maps.

Figure 7

Modelling cross-validation accuracies (represented by the coefficient of determination, (R^2)) for all approaches and datasets (GMS: Gaussian mixed scaling, EDF: Euclidean distance fields, GRF: Gaussian random fields models with 100 or 10 predictors).

Full size image
Figure 8

Modelling results. The first row shows the reference models (GMS: Gaussian mixed scaling, EDF: Euclidean distance fields). The other three rows show the models based on 100 Gaussian random fields (GRF) with different variogram ranges.

Full size image

Generally, the GMS models produced the best results. This could be an effect of the number and structure of the predictors, which is larger compared to the merged restricted GMS + EDF dataset, and thus possibly an effect of overfitting to the dependence structure of the data23.

Models with EDFs generally performed better than models with the restricted GMS. Hence, the spatial dependence of soil properties cannot be described by the constrained GMS dataset. There is an increase in prediction accuracy when using the EDFs together with the restricted GMS dataset (except for the Meuse dataset), which could be an indicator for non-stationarity.

In terms of prediction accuracy, modelling with GRFs produced relatively high predictive accuracies (Fig. 7).

The information horizon—descriptive accuracy and relevance

Generally, the descriptive accuracy is strong if there is a causal relationship25 between the response and a predictor. The descriptive accuracy can also be high if there are associations among variables as usually inferred by statistical analysis, which can suggest potential causal relationships1.

Provided that the relevance of the predictors is given and the algorithm or method to generate the data is valid15, 17, the results of this study indicate that the descriptive accuracy is high if the range of spatial dependence of a predictor is equal or smaller than the range of the response variable. However, predictive accuracy can increase when predictors with longer ranges than the range of the environmental property are included, possibly due to non-stationarities in the environmental process4,26 or effects of anisotropy. Therefore, one should remove only those predictors from multiscale approaches15,17,27,28,29,30 with variogram ranges that are long with respect to the size of the study area and if their information content is below a certain minimum (Fig. 4).

If, on the other hand, predictors show ranges of about the diagonal length of the entire study area, then they resemble the properties of the EDFs (Fig. 5) and their ranges are too long to lend themselves to interpretation. In these cases, predictors behave indistinguishably from purely spatial predictors, although they might still be interpretable in some situations.

In summary, these results confirm our third hypothesis, and show that the primary information horizon is located somewhere between the range of the variogram of the soil property and a certain minimal variation of the predictors across the study site.

Beyond the information horizon—descriptive uncertainty and contextual complexity

We recently showed that when finer to coarser scales are successively removed from a set of all scales of a GMS modelling, prediction accuracy usually remains high, even if only the coarsest scales remain in the model4. In cases where the prediction accuracy decreases, we can assume that (i) structural information is lacking, that (ii) interpolation is not the appropriate method, or that (iii) the spatial dependence of the coarse scale GMS predictors are not suitable for interpolation, e.g. if all these coarse scale predictors only show a trend in one direction, for example in X direction only instead of X and Y direction.

We also found an increase in prediction accuracy beyond the range of the variogram of soil properties in GMS and similar approaches4,15,26, when successively adding coarser scales. There are two explanations. First, not all original terrain properties show exactly the same original scale or range, which is due to the convolution functions and the general approaches to calculate terrain properties (e.g. first and second order derivatives, i.e. slope and curvature). Second, this effect might be related to non-stationarity, where coarse-scale predictors can help to “divide” the study area into zones. We tested this here by combining the restricted set of GMS predictors, which are within the information horizon, with EDFs. In all cases prediction accuracy increased. Hence, there is obviously some spatial dependence present, resulting from predictor interactions on very coarse scales or long ranges, which are beyond the information horizon, inferable by the size of the study area. In these cases, there will be some uncertainty in the descriptive accuracy when using predictors that show information contents below a certain minimum of spatial variation.

Looking specifically at the three study sites we see complex soil property formation processes due to interactions of predictors at different scales. This contextual complexity has to be taken into account when interpreting environmental predictors beyond the information horizon, as discussed above.

In Piracicaba very coarse-scale predictors are important27. The soil formation system, however, is rather simple. It is based on rock formation, strike and dip, and subsequent erosion. In this case coarse scale terrain indicators for aspect are good proxies to differentiate between the two different types of parent material, even though they resemble properties of EDFs. In such cases partial dependence models should be applied to aide interpretation27.

The silt content in Rhine-Hesse is controlled by local silt translocation31, which occurred in the last glacial period of the Pleistocene epoch (Würm glaciation) and which was modulated by interactions of climate and terrain. This can be described in terms of a teleconnected system32 and can be mapped by terrain only, which then serves as a proxy for that system26,27. Similar to Piracicaba, interpretations of predictors with very large ranges and relatively low information content can be reasonable. However, the descriptive uncertainty is higher compared to predictors that fall within the information horizon.

The situation for Meuse is different due to a different dominant process system. The zinc content is driven by flooding events. Therefore, different and more relevant predictors, such as the distance to the river Meuse, should be used in this case9. We see that EDFs perform better compared to the mixed dataset (GMS restricted + EDF). This shows that the multiscale terrain predictors are not relevant, but represent noise, and can therefore serve at most as vague proxies. Another problem resulting in such an effect could be algorithmic issues related to feature selection within the Random Forests model, which in some specific cases might occur in relation to autocorrelated predictors33, or to effects due to fitting noise5.

Interestingly, in all cases the GMS models perform better compared to the mixed dataset (GMS restricted + EDF). This can be either due to a higher number of predictors in the GMS approach or relevant structural predictors beyond the information horizon.

Generally, the interpretation of environmental predictors beyond the edge of the information horizon needs specific care and is afflicted with more uncertainty.


Source: Ecology - nature.com

Acidobacteria are active and abundant members of diverse atmospheric H2-oxidizing communities detected in temperate soils

Undergraduates ramp up research during pandemic diaspora