The three major axes of terrestrial ecosystem function

FLUXNET data

The data used in this study belong to the FLUXNET LaThuile⁹ and FLUXNET2015 Tier 1 and Tier 2 datasets¹⁰, which make up the global network of CO₂, water vapour and energy flux measurements. We merged the two FLUXNET releases and retained the FLUXNET2015 (the most recent and with a robust quality check) version of the data when the site was present in both datasets. Croplands were removed to avoid the inclusion of sites that are heavily managed in the analysis (for example, fertilization and irrigation).

The sites used cover a wide variety of climate zones (from tropical to Mediterranean to Arctic) and vegetation types (wetlands, shrublands, grasslands, savanna, evergreen and deciduous forests). It should be noted though that tropical forests are underrepresented in the FLUXNET database (Extended Data Figs. 1, 3).

Sites were excluded in cases in which: (i) data on precipitation or radiation were not available or completely gap-filled; (ii) the calculation of functional properties failed because of low availability of measured data (see ‘Calculation of ecosystem functions from FLUXNET’); and (iii) fluxes showed clear discontinuities in time series indicating a change of instrumentation set-up (for example, changes in the height of the ultrasonic anemometer or gas analyser).

The final number of sites selected was 203 (1,484 site years). The geographical distribution is shown in Extended Data Fig. 1, the distribution in the climate space is shown in Extended Data Fig. 2 and the fraction of sites for each climate classes is reported in Extended Data Fig. 3.

For each site, we downloaded the following variables at half-hourly temporal resolution: (i) gross primary productivity (GPP, μmol CO₂ m^–² s^–¹) derived from the night-time flux partitioning²⁶ (GPP_NT_VUT_50 in FLUXNET 2015 and GPP_f in LaThuile), (ii) net ecosystem exchange (NEE, μmol CO₂ m^–² s^–¹) measurements filtered using annual friction velocity (u*, m s⁻¹) threshold (NEE_VUT_50 in FLUXNET 2015; NEE in LaThuile); (iii) latent heat (LE, W m⁻²) fluxes, which were converted to evapotranspiration (ET, mm); (iv) sensible heat (H, W m⁻²) fluxes; (v) air temperature (T_air, °C); (vi) vapour pressure deficit (VPD, hPa); (vii) global shortwave incoming radiation (SW_in, W m⁻²); viii) net radiation (R_n, W m⁻²); (ix) ground heat flux (G, W m⁻²); (x) friction velocity u* (m s⁻¹); and (xi) wind speed (u, m s⁻¹). For the energy fluxes (H, LE) we selected the fluxes not corrected for the energy balance closure to guarantee consistency between the two FLUXNET datasets (in the LaThuile dataset energy fluxes were not corrected).

The cumulative soil water index (CSWI, mm) was computed as a measure of water availability according to a previous report²⁷. Half-hourly values of transpiration estimates (T, mm) were calculated with the transpiration estimation algorithm (TEA)²⁸. The TEA has been shown to perform well against both model simulations and independent sap flow data²⁸.

For 101 sites, ecosystem scale foliar N content (N%, gN 100 g⁻¹) was computed as the community weighted average of foliar N% of the major species at the site sampled at the peak of the growing season or gathered from the literature^29,30,31,32. Foliar N% for additional sites was derived from the FLUXNET Biological Ancillary Data Management (BADM) product and/or provided by site principal investigators (Supplementary Table 1, Extended Data Fig. 1). It should be noted that this compilation of N% data might suffer from uncertainties resulting from the scaling from leaves to the eddy covariance footprint, the sampling strategy (including the position along the vertical canopy profile), the species selection and the timing of sampling. About 30% of the data comes from a coordinated effort that minimized these uncertainties^29,30, and for the others we collected N% data that were representative for the eddy covariance footprint^31,32.

Maximum leaf area index (LAI_max, m² m⁻²) and maximum canopy height (H_c, m) were also collected for 153 and 199 sites, respectively, from the literature^32,33, the BADM product, and/or site principal investigators.

Earth observation retrievals of above-ground biomass (AGB, tons of dry matter per hectare (t DM ha⁻¹)) were extracted from the GlobBiomass dataset³⁴ at its original resolution (grid cell 100 × 100 m) for each site location. All the grid cells in a 300 × 300 m and 500 × 500 m window around each location were selected to estimate the median and 95th percentiles of AGB for each site. The median of AGB was selected to avoid the contribution of potential outliers to the expected value of AGB. The analysis further explored the contribution of higher percentiles in the local variation of AGB as previous studies have highlighted the contribution of older and larger trees in uneven stand age plots to ecosystem functioning³⁵. According to the evaluation against AGB measured at 71 FLUXNET sites (Extended Data Fig. 10), we decided to use the product with median AGB values extracted from the 500 × 500 m window.

A total of 94 sites have all the data on vegetation structure (N%, LAI_max, H_c, and AGB).

The list of sites is reported in Supplementary Table 1 along with the plant functional type (PFT), Köppen-Geiger classification, coordinates, and when available N%, LAI_max, H_c and AGB.

In this study we did not make use of satellite information, with the exception of the AGB data product. Future studies will benefit from new missions such as the ECOsystem Spaceborne Thermal Radiometer Experiment on Space Station (ECOSTRESS), the fluorescence explorer (FLEX), hyperspectral, and radar and laser detection and ranging (LiDAR) missions (for example, Global Ecosystem Dynamics Investigation (GEDI)), to characterize a multivariate space of structural and functional properties.

Calculation of ecosystem functions from FLUXNET

Starting from half-hourly data, we calculated at each site a single value for each of the ecosystem functions listed below. For the calculations of functional properties we used, unless otherwise indicated, good-quality data: quality flag 0 (measured data) and 1 (good-quality gap-filled data) in the FLUXNET dataset.

Gross primary productivity at light saturation (GPP_sat)

GPP at light saturation using photosynthetically active radiation as driving radiation and 2,000 μmol m⁻² s⁻¹ as saturating light. GPP_sat represents the ecosystem-scale maximum photosynthetic CO₂ uptake^15,30,36. The GPP_sat was estimated from half-hourly data by fitting the hyperbolic light response curves with a moving window of 5 days and assigned at the centre of the moving window^30,37. For each site the 90th percentile from the GPP_sat estimates was then extracted.

Maximum net ecosystem productivity (NEP_max)

This was computed as the 90th percentile of the half-hourly net ecosystem production (NEP = −NEE) in the growing season (that is, when daily GPP is higher than 30% of the GPP amplitude). This metric represents the maximum net CO₂ uptake of the ecosystem.

Basal ecosystem respiration (Rb and Rb_max)

Basal ecosystem respiration at reference temperature of 15 °C was derived from night-time NEE measurements²⁶. Daily basal ecosystem respiration (Rb_d) was derived by fitting an Arrhenius type equation over a five-day moving window and by keeping the sensitivity to temperature parameter (E₀) fixed as in the night-time partitioning algorithms^26,38. Rb_d varies across seasons because it is affected by short-term variations in productivity^33,39, phenology⁴⁰ and water stress⁴¹. For each site, the mean of the Rb_d (Rb) and the 95^th percentile (Rb_max) were computed. The calculations were conducted with the REddyProc R package v.1.2.2 (ref. ³⁸).

Apparent carbon-use efficiency (aCUE)

The aCUE as defined in this study is the efficiency of an ecosystem to sequester the carbon assimilated with photosynthesis³⁹. aCUE is an indication of the proportion of respired carbon with respect to assimilated carbon within one season. A previous report⁶ showed that little of the variability in aCUE can be explained by climate or conventional site characteristics, and suggested an underlying control by plant, faunal and microbial traits, in addition to site disturbance history. Daily aCUE (aCUE_d) is defined as aCUE_d = 1 − (Rb_d/GPP_d), where GPP_d is daily mean GPP and Rb_d is derived as described above. For each site, aCUE was computed as the median of aCUE_d.

Metrics of water-use efficiency (WUE)

Various metrics of WUE are described below: stomatal slope or slope coefficient (G1), underlying water-use efficiency (uWUE), and water-use efficiency based on transpiration (WUE_t). The three metrics were used because they are complementary, as shown in previous studies^11,42.

Stomatal slope or slope coefficient (G1)

This is the marginal carbon cost of water to the plant carbon uptake. G1 is the key parameter of the optimal stomatal model derived previously⁴³. G1 is inversely related to leaf-level WUE. At leaf level, G1 is calculated using nonlinear regression and can be interpreted as the slope between stomatal conductance and net CO₂ assimilation, normalized for VPD and CO₂ concentration⁴³. A previous report⁴² showed the potential of the use of G1 at ecosystem scale, where stomatal conductance is replaced by surface conductance (G_s), and net assimilation by GPP. The methodology is implemented in the bigleaf R package⁴⁴. The metric was computed in the following situations: (i) incoming shortwave radiation (SW_in) greater than 200 W m⁻²; (ii) no precipitation event for the last 24 h⁴⁵, when precipitation data are available; and (iii) during the growing season: daily GPP > 30% of its seasonal amplitude⁴⁴.

Underlying water-use efficiency (uWUE)

The underlying WUE was computed following a previous method⁴⁶. uWUE is a metric of water-use efficiency that is negatively correlated to G1 at canopy scale⁴⁴:

$${rm{uWUE}}=frac{{rm{GPP}}sqrt{{rm{VPD}}}}{{rm{ET}}}.$$

uWUE was calculated using the same filtering that was applied for the calculation of G1. The median of the half-hourly retained uWUE values was computed for each site and used as a functional property.

Water-use efficiency based on transpiration (WUE_t)

The WUE based on transpiration (T) was computed to reduce the confounding effect resulting from soil evaporation^11,28:

$${{rm{WUE}}}_{{rm{t}}}=frac{{rm{GPP}}}{T},$$

where T is the mean annual transpiration calculated with the transpiration estimation algorithm (TEA) developed by in a previous study²⁸ and GPP is the mean annual GPP.

Maximum surface conductance (G
_smax)

Surface conductance (G_s) was computed by inverting the Penman–Monteith equation after calculating the aerodynamic conductance (G_a).

Among the different formulations of G_a (m s^–¹) in the literature, we chose to use here the calculation of the canopy (quasi-laminar) boundary layer conductance to heat transfer, which ranges from empirical to physically based (for example, ref. ⁴⁷). Other studies^48,49 suggested an empirical relationship between G_a, the horizontal wind speed (u) and the friction velocity, u*:

$${G}_{{rm{a}}}=frac{1}{(frac{u}{{u}^{* 2}}+6.2u{* }^{-0.67})}$$

G_s (m s⁻¹) is computed by inverting the Penman–Monteith equation:

$${G}_{{rm{s}}}=frac{{{rm{LEG}}}_{{rm{a}}}gamma }{Delta ({R}_{{rm{n}}}-G-S)+rho {C}_{{rm{p}}}{G}_{{rm{a}}}{rm{VPD}}-{rm{LE}}(Delta +gamma )}$$

where Δ is the slope of the saturation vapour pressure curve (kPa K⁻¹), ρ is the air density (kg m⁻³), C_p is the specific heat of the air (J K⁻¹ kg⁻¹), γ is the psychrometric constant (kPa K⁻¹), VPD (kPa), R_n (W m⁻²), G (W m⁻²) and S is the sum of all energy storage fluxes (W m⁻²) and set to 0 as not available in the dataset. When not available, G also was set to 0.

G_s represents the combined conductance of the vegetation and the soil to water vapour transfer. To retain the values with a clear physiological interpretation, we filtered the data as we did for the calculation of G1.

For each site, the 90th percentile of the half-hourly G_s was calculated and retained as the maximum surface conductance of each site (G_smax). G_s was computed using the bigleaf R package⁴⁴.

Maximum evapotranspiration in the growing season (ET_max)

This metric represents the maximum evapotranspiration computed as the 95th percentile of ET in the growing season and using the data retained after the same filtering applied for the G1 calculation.

Evaporative fraction (EF)

EF is the ratio between LE and the available energy, here calculated as the sum of H + LE (ref. ⁵⁰). For the calculation of EF, we used the same filtering strategy as for G1. We first calculated mean daytime EF. We then computed the EF per site as the growing season average of daytime EF. We also computed the amplitude of the EF in the growing season by calculating the interquartile distance of the distribution of mean daytime EF (EF_ampl).

Principal component analysis

A PCA was conducted on the multivariate space of the ecosystem functions. Each variable (ecosystem functional property, EFP) was standardized using z-transformation (that is, by subtracting its mean value and then dividing by its standard deviation). From the PCA results we extracted the explained variance of each component and the loadings of the EFPs, indicating the contribution of each variable to the component. We performed the PCA using the function PCA() implemented in the R package FactoMineR⁵¹.

We justify using PCA over nonlinear methods because it is an exploratory technique that is highly suited to the analysis of the data volume used in this study, whereas other nonlinear methods applied to such data would be over-parameterized. For the same reason, PCA was used in previous work concerning the global spectrum of leaf and plant traits, and fluxes^1,3,52.

To test the significance of dimensionality of the PCA, we used a previously described methodology⁵³. We used the R package ade4 (ref. ⁵⁴) and evaluated the number of significant components of the PCA to be retained to minimize both redundancy and loss of information (Supplementary Information 2). We tested the significance of the PCA loadings using a combination of the bootstrapped eigenvector method⁵⁵ and a threshold selected using the number of dimensions⁵⁶ (Supplementary Information 2).

Predictive variable importance

A random forests (RF) analysis was used to identify the vegetation structure and climate variables that contribute the most to the variability of the significant principal components, which were identified with the PCA analysis (see ‘Principal component analysis’). In the main text we refer to the results of this analysis as ‘predictive variable importance’ to distinguish this to the ‘causal variable importance’ described below.

The analysis was conducted using the following predictor variables: as structural variables, N% (gN 100 g⁻¹), LAI_max (m₂ m⁻²), AGB (t DM ha⁻¹) and H_c (m); as climatic variables, mean annual precipitation (P, mm), mean VPD during the growing season (VPD, hPa), mean shortwave radiation (SW_in, W m⁻²), mean air temperature (T_air, °C); and the cumulative soil water index (CSWI, −), as indicator of site water availability.

We used partial dependencies of variables to assess the relationship between individual predictors and the response variable (that is, PC1, PC2 and PC3).

The results from the partial dependency analysis can be used to determine the effects of individual variables on the response, without the influence of the other variables. The partial dependence function was calculated using the pdp R package⁵⁷.

The partial dependencies were calculated restricted to the values that lie within the convex hull of their training values to reduce the risk of interpreting the partial dependence plot outside the range of the data (extrapolation).

Invariant causal regression models and causal variable importance

We have quantified the dependence of the principal components on the different structural and climatic variables using nonlinear regression. Such dependencies can only be interpreted causally if the regression models are in fact causal regression models (see Supplementary Information 3 for a formal definition), which may not be the case if there are hidden confounders. To see whether the regression models allow for a causal interpretation, we use invariant causal prediction⁵⁸. This method investigates whether the regression models are stable with respect to different patterns of heterogeneity in the data, encoded by different environments (that is, subsets of the original dataset). The rationale is that a causal model, describing the full causal mechanism for the response variable, should be invariant with respect to changes in the environment if the latter does not directly influence the response variable^13,59. Other non-causal models may be invariant, too, but a non-invariant model cannot be considered causal.

How to choose the environments is a modelling choice that must satisfy the following criteria. First, it should be possible to assign each data point to exactly one environment. Second, the environments should induce heterogeneity in the data, so that, for example, the predictor variables have different distributions across environments. Third, the environments must not directly affect the response variable, only via predictors, although the distribution of the response may still change between environments. The third criterion can be verified by expert knowledge and is assumed to hold for our analysis. In addition, if it is violated, then, usually, no set is invariant⁵⁸, which can be detected from data.

In our analysis, we assigned each data point (that is, each site) to one of two environments (two subsets of the original dataset): the first includes forest sites in North America, Europe or Asia; and the second includes non-forest and forest ecosystems from South America, Africa or Oceania, and non-forest ecosystems from North America, Europe or Asia (see Supplementary Information 3.1.3.1 for details). Our choice satisfies the method’s assumption that the distribution of the predictors is different between the two environments (that is, they induce heterogeneity in the data; see Supplementary Fig. 3.1). Environments that are too small or too homogeneous do not provide any evidence against the full set of covariates being a candidate for the set of causal predictors. Other choices of environments than the one presented here yield consistent results (Supplementary Information 3.2.1, Supplementary Fig. 3.4).

For each subset of predictors, we test whether the corresponding regression model is invariant (yielding the same model fit in each environment). Although many models were rejected and considered non-invariant, the full model (with all the nine predictors and used in the predictive variable importance analysis) was accepted as invariant, establishing the full set of covariates as a reasonable candidate for the set of direct causal predictors. We used both RF (randomForest package in R⁶⁰) and generalized additive models, GAM⁶¹ (mgcv package⁶² in R) to fit the models. Both methods lead to comparable results but with a better average performance of the RF: GAM led to slightly better results than RF for PC1, whereas for PC2 and PC3 RF showed a much better model performance (Supplementary Table 3.1, Supplementary Information 3.2.2). Therefore, in the main text we showed only the results from the RF (except for PC1).

If, indeed, the considered regression models are causal, this allows us to make several statements. First, we can test for the existence of causal effects by testing for statistical significance of the respective predictors in the fitted models. Second, we can use the response curves of the fitted model to define a variable importance measure with a causal interpretation. In the main text we refer to this variable importance as ‘causal variable importance’. For details, see Supplementary Information 3.1.2. More formally, we considered the expected value of the predicted variables (the principal components) under joint interventions on all covariates (AGB, H_c, LAI_max, N%, T_air, VPD, SW_in, CSWI and P) at once, and then, to define the importance, we quantified how this expected value depends on the different covariates. We applied the same analysis to groups of vegetation structural and climate covariates (see ‘Groupwise variable importance’ in Supplementary Information 3.1.2.3, 3.2.3).

The details of the methodology and the results are described in Supplementary Information 3, in which we also provide further details on the choice of environment variable and on the statistical tests that we use to test for invariance. An overview of the invariance-based methodology is shown in Supplementary Fig. 3.1.

Land surface model runs

We run two widely used land surface models: Orchidee-CN (OCN) and Jena Scheme for Biosphere Atmosphere Coupling in Hamburg (JSBACH):

OCN

The dynamic global vegetation model OCN is a model of the coupled terrestrial carbon and nitrogen cycles^63,64, derived from the ORCHIDEE land surface model. It operates at a half-hourly timescale and simulates diurnal net carbon, heat and water exchanges, as well as nitrogen trace gas emissions, which jointly affect the daily changes in leaf area index, foliar nitrogen, and vegetation structure and growth. The main purpose of the model is to analyse the longer-term (interannual to decadal) implication of nutrient cycling for the modelling of land–climate interactions^64,65. The model can run offline, driven by observed meteorological parameters, or coupled to the global circulation model.

JSBACH

JSBACH v.3 is the land surface model of the MPI Earth System Model^66,67. The model operates at a half-hourly time step and simulates the diurnal net exchange of momentum, heat, water and carbon with the atmosphere. Daily changes in leaf area index and leaf photosynthetic capacity are derived from a prognostic scheme assuming a PFT-specific set maximum leaf area index and a set of climate responses modulating the seasonal course of leaf area index. Carbon pools are prognostic allowing for simulating the seasonal course of net land–atmosphere carbon exchanges.

We selected OCN and JSBACH because they are widely used land surface models with different structures. JSBACH is a parsimonious representation of the terrestrial energy, water and carbon exchanges used to study the coupling of land and atmosphere processes in an Earth system model⁶⁷. OCN has also been derived from the land surface model ORCHIDEE⁶⁸, but it includes a more comprehensive representation of plant physiology, including a detailed representation of the tight coupling of the C and N cycling⁶³. Both models contribute to the annual global carbon budget of the Global Carbon Project⁶⁹ and have shown good performance compared to a number of global benchmarks. OCN was further used in several model syntheses focused on the interaction between changing N deposition and CO₂ fertilization^70,71,72. Both OCN and JSBACH can operate at a half-hourly timescale and simulate net and gross carbon exchanges, water and energy fluxes, and therefore are ideal for the extraction of ecosystem functional properties, as done with the eddy covariance data.

The models were driven by half-hourly meteorological variables (shortwave and longwave downward flux, air temperature and humidity, precipitation, wind speed and atmospheric CO₂ concentrations) observed at the eddy covariance sites. OCN was furthermore driven by N deposition fields⁷³. Vegetation type, soil texture and plant available water were prescribed on the basis of site observations, but no additional site-specific parameterization was used. Both models were brought into equilibrium with respect to their ecosystem water storage and biogeochemical pools by repeatedly looping over the available site years. We added random noise (mean equal to 0 and standard deviation of 5% of the flux value) to the fluxes simulated by the models to mimic the random noise of the eddy covariance flux observations. An additional test conducted without noise addition showed only a marginal effect on the calculations of the functional properties and the ecosystem functional space.

We used runs of the JSBACH and OCN model for 48 FLUXNET sites (Supplementary Table 1). The simulated fluxes were evaluated against the observation to assess the performance of the models at the selected sites. From the model outputs and from each site we derived the ecosystem functions using the same methodology described above. Then the PCA analysis was performed on the three datasets (FLUXNET, OCN and JSBACH) and restricted to the 48 sites used to run the models. We ran the models only on the subset of sites for which the information for the parameterization and high-quality forcing was available. However, the different ecosystem functions emerge from the model structure and climatological conditions. Therefore, even with a smaller set of site we can evaluate whether models reproduce the key dimensions of terrestrial ecosystem function by comparing the PCA results from FLUXNET and the model runs.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.

The three major axes of terrestrial ecosystem function

FLUXNET data

Calculation of ecosystem functions from FLUXNET

Gross primary productivity at light saturation (GPP_sat)

Maximum net ecosystem productivity (NEP_max)

Basal ecosystem respiration (Rb and Rb_max)

Apparent carbon-use efficiency (aCUE)

Metrics of water-use efficiency (WUE)

Stomatal slope or slope coefficient (G1)

Underlying water-use efficiency (uWUE)

Water-use efficiency based on transpiration (WUE_t)

Maximum surface conductance (G
_smax)

Maximum evapotranspiration in the growing season (ET_max)

Evaporative fraction (EF)

Principal component analysis

Predictive variable importance

Invariant causal regression models and causal variable importance

Land surface model runs

OCN

JSBACH

Reporting summary

Evaluation of fish feeder manufactured from local raw materials

A new method for removing lead from drinking water

ITALIAN LANGUAGE

ENGLISH LANGUAGE

FLUXNET data

Calculation of ecosystem functions from FLUXNET

Gross primary productivity at light saturation (GP﻿Psat)

Maximum net ecosystem productivity (NEPmax)

Basal ecosystem respiration (Rb and Rbmax)

Apparent carbon-use efficiency (aCUE)

Metrics of water-use efficiency (WUE)

Stomatal slope or slope coefficient (G1)

Underlying water-use efficiency (uWUE)

Water-use efficiency based on transpiration (WUEt)

Maximum surface conductance (G smax)

Maximum evapotranspiration in the growing season (ETmax)

Evaporative fraction (EF)

Principal component analysis

Predictive variable importance

Invariant causal regression models and causal variable importance

Land surface model runs

OCN

JSBACH

Reporting summary

ITALIAN LANGUAGE

ENGLISH LANGUAGE

Gross primary productivity at light saturation (GPP_sat)

Maximum net ecosystem productivity (NEP_max)

Basal ecosystem respiration (Rb and Rb_max)

Water-use efficiency based on transpiration (WUE_t)

Maximum surface conductance (G
_smax)

Maximum evapotranspiration in the growing season (ET_max)