### General description of a geoscientific model and parameter calibration

A model for both non-dynamical and dynamical systems can be generically written for site ** i** as

$${left{{{{{{{bf{y}}}}}}}_{t}^{i}right}}_{tin T}=fleft({left{{{{{{{bf{x}}}}}}}_{t}^{i}right}}_{tin T},{{{{{{boldsymbol{varphi }}}}}}}^{{{{{{boldsymbol{i}}}}}}},{{{{{{boldsymbol{theta }}}}}}}^{{{{{{boldsymbol{i}}}}}}}right)$$

(1)

where output physical predictions (({{{{{{bf{y}}}}}}}_{t}^{i}={left[{y}_{1,t}^{i},{y}_{2,t}^{i},cdots right]}^{T}), with the first subscript denoting variable type) vary with time (*t*) and location (*i*), and are functions of time- and location-specific inputs (({{{{{{bf{x}}}}}}}_{t}^{i}={left[{x}_{1,t}^{i},{x}_{2,t}^{i},cdots right]}^{T})), location-specific observable attributes (({{{{{{boldsymbol{varphi }}}}}}}^{i}={left[{varphi }_{1}^{i}{{{{{boldsymbol{,}}}}}}{varphi }_{2}^{i}{{{{{boldsymbol{,}}}}}}{{{{{boldsymbol{cdots }}}}}}right]}^{T})), and location-specific unobserved parameters that need to be separately determined (({{{{{{boldsymbol{theta }}}}}}}^{i}={left[{theta }_{1}^{i}{{{{{boldsymbol{,}}}}}}{theta }_{2}^{i}{{{{{boldsymbol{,}}}}}}{{{{{boldsymbol{cdots }}}}}}right]}^{T})). **θ** may be unobservable, or it may be too expensive or difficult to observe at the needed accuracy, resolution, or coverage. This formulation also applies to dynamical systems if ({{{{{{bf{x}}}}}}}_{{{{{{boldsymbol{t}}}}}}}^{{{{{{boldsymbol{i}}}}}}}) includes previous system states ({{{{{{bf{y}}}}}}}_{t-1}^{i}) (i.e. ({{{{{{bf{y}}}}}}}_{t-1}^{i}{{{{{boldsymbol{subset }}}}}}{{{{{{bf{x}}}}}}}_{t}^{i})), and the rest of the inputs are independent (e.g. meteorological) forcing data. In a non-dynamical system, ({{{{{{bf{x}}}}}}}_{t}^{i}) is independent of ({{{{{{bf{y}}}}}}}_{t-1}^{i}).

Given some observations

$${{{{{{bf{z}}}}}}}_{t}^{i}=hleft({{{{{{bf{y}}}}}}}_{{t}}^{{i}}right)+{{{{{{boldsymbol{varepsilon }}}}}}}_{t}^{i}$$

(2)

where *h*(·) relates model outputs to observations and ({{{{{{boldsymbol{varepsilon }}}}}}}_{{{boldsymbol{t}}}}^{{{boldsymbol{i}}}}={big[{varepsilon }_{1,t}^{i},{varepsilon }_{2,t}^{i},cdots big]}^{T}) is the error between the observations (Big({{{{{{boldsymbol{z}}}}}}}_{{{{{{boldsymbol{t}}}}}}}^{{{{{{boldsymbol{i}}}}}}} = {big[{z}_{1,t}^{i},{z}_{2,t}^{i},cdots big]}^{T}Big)) and the model predictions (({{{{{{bf{y}}}}}}}_{{{{{{boldsymbol{t}}}}}}}^{{{{{{boldsymbol{i}}}}}}})), we adjust the model parameters so that the predictions best match the observations. This is traditionally done individually for each location (here generically referring to a gridcell, basin, site, river reach, agricultural plot, etc., depending on the model):

$${hat{theta }}^{i}={{arg }},{{{min }}}_{{{{{{{boldsymbol{theta }}}}}}}^{i}}mathop{sum }limits_{tin T}{Vert {{{{{{boldsymbol{varepsilon }}}}}}}_{t}^{i}Vert }^{2}={{arg }},{{{min }}}_{{{{{{{boldsymbol{theta }}}}}}}^{i}}mathop{sum}limits_{tin T}{Vert h(f({{{x}_{t}^{i}}}_{tin T},{varphi }^{i},{theta }^{i}))-{z}_{t}^{i}Vert }^{2}$$

(3)

where *i** ∈ I* and where (I=left{{1,2},ldots ,{N}_{I}right}). Note that the superscript *i* suggests that this optimization is done for each site independently.

### The process-based hydrologic model and its surrogate

The Variable Infiltration Capacity (VIC) hydrologic model has been widely used for simulating the water and energy exchanges between the land surface and atmosphere, along with related applications in climate, water resources (e.g., flood, drought, hydropower), agriculture, and others. The model simulates evapotranspiration, runoff, soil moisture, and baseflow based on conceptualized bucket formulations. Inputs to the model include daily meteorological forcings, non-meteorological data, and the parameters to be determined. Meteorological forcing data include time series of precipitation, air temperature, wind speed, atmospheric pressure, vapor pressure, and longwave and shortwave radiation. More details about VIC can be found in Liang et al.^{39}.

LSTM was trained to reproduce the behavior of VIC as closely as possible while also allowing for gradient tracking. In theory, if a hydrologic model can be written into a machine learning platform (as in our HBV case), this step is not needed, but training a surrogate model is more convenient when the model is complex. To ensure the surrogate model had high fidelity in the parameter space where the search algorithms want to explore, we iterated the training procedure multiple times. We first trained an LSTM surrogate for VIC using the forcings, attributes, and parameters from NLDAS-2 as inputs, and the VIC-simulated surface soil moisture (variable name: SOILM_lev1) and evapotranspiration (ET, variable name: EVP) as the targets of emulation. Then, as the search algorithms (SCE-UA or dPL) went near an optimum, we took the calibrated parameter sets, made perturbations of them by adding random noise to these parameters, and retrained the network with added data. The perturbation was done to better represent the parameter space close to optimal solutions. We repeated this procedure four times so that the NSEs of the parameters, obtained from the CPU-based VIC model, converged. At 1/8^{2} sampling density (sampling one gridcell from each 8 × 8 patch), this results in fewer overall forward runs than a 1/8-degree NLDAS-2 simulation. Also note that this effort is needed similarly for both dPL and SCE-UA. If we did not use the surrogate model, SCE-UA would also have needed to employ the O(10^{2}) more expensive CPU-based VIC model. We evaluated the accuracy of the surrogate model, and the median correlations between VIC and the surrogate simulation were 0.91 and 0.92 for soil moisture and ET, respectively (Supplementary Fig. S2). When we connected the trained surrogate model to the parameter estimation network, the weights of the surrogate model were frozen and prevented from updating by backpropagation, but the gradient information could pass through. This was implemented in the PyTorch deep learning framework^{36}.

### The long short-term memory network

The long short-term memory network (LSTM) was originally developed in the artificial intelligence field for learning sequential data, but has recently become a popular choice for hydrologic time series data^{26}. As compared to a vanilla recurrent neural network with only one state, LSTM has two states (cell state, hidden state) and three gates (input gate, forget gate, and output gate). The cell state enables long-term memory, and the gates are trained to determine which information to carry across time steps and which information to forget. These units were collectively designed to address the notorious DL issue of the vanishing gradient, where the accumulated gradients would decrease exponentially along time steps and eventually be too small to allow effective learning^{48}. Given inputs *I*, our LSTM can be written as the following:

$${{{{{mathrm{Input}}}}}}; {{{{{mathrm{transformation:}}}}}}quad {x}^{t}={{ReLU}}({W}_{I}{I}^{t}+{b}_{I})$$

(4)

$${{{{{mathrm{Input}}}}}}; {{{{{mathrm{node:}}}}}}quad {g}^{t}={tanh }({{{{{mathscr{D}}}}}}({W}_{{gx}}{x}^{t})+{{{{{mathscr{D}}}}}}({W}_{{gh}}{h}^{t-1})+{b}_{g})$$

(5)

$${{{{{mathrm{Input}}}}}}; {{{{{mathrm{gate:}}}}}}quad {i}^{t}=sigma ({{{{{mathscr{D}}}}}}({W}_{{ix}}{x}^{t})+{{{{{mathscr{D}}}}}}({W}_{{ih}}{h}^{t-1})+{b}_{i})$$

(6)

$${{{{{mathrm{Forget}}}}}}; {{{{{mathrm{gate:}}}}}}quad {f}^{t}=sigma ({{{{{mathscr{D}}}}}}({W}_{{fx}}{x}^{t})+{{{{{mathscr{D}}}}}}({W}_{{fh}}{h}^{t-1})+{b}_{f})$$

(7)

$${{{{{mathrm{Output}}}}}}; {{{{{mathrm{gate:}}}}}}quad {o}^{t}=sigma ({{{{{mathscr{D}}}}}}({W}_{{ox}}{x}^{t})+{{{{{mathscr{D}}}}}}({W}_{{oh}}{h}^{t-1})+{b}_{o})$$

(8)

$${{{{{mathrm{Cell}}}}}}; {{{{{mathrm{state:}}}}}}quad {s}^{t}={g}^{t}odot {i}^{t}+{s}^{t-1}odot {f}^{t}$$

(9)

$${{{{{mathrm{Hidden}}}}}}; {{{{{mathrm{state:}}}}}}quad {h}^{t}={tanh }({s}^{t})odot {o}^{t}$$

(10)

$${{{{{mathrm{Output:}}}}}}quad {y}^{t}={W}_{{hy}}{h}^{t}+{b}_{y}$$

(11)

where *W* and *b* are the network weights and bias parameters, respectively, and ({{{{{mathscr{D}}}}}}) is the dropout operator, which randomly sets some of the connections to zero. The LSTM network and our whole workflow^{31} were implemented in PyTorch^{36}, an open source machine learning framework.

Here we do not use LSTM to predict the target variable. Rather, LSTM is used to (optionally) map from time series information to the parameters in our g_{z} network as described below.

### The parameter estimation network

We present two versions of the dPL framework. The first version allows us to train a parameter estimation network over selected training locations *I*_{train} where some ancillary information **A** (potentially including but not limited to attributes in *φ*^{i} used in the model) is available, for training period *T*_{train} (illustrated in Fig. 1b):

$${hat{{{{{{boldsymbol{theta }}}}}}}}^{i}={g}_{A}left({{{{{{bf{A}}}}}}}^{i}right){{{{{{rm{for}}}}}}; {{{{{rm{all}}}}}}; iin I}_{{{{{{{mathrm{train}}}}}}}}$$

(12a)

$${hat{g}}_{A}(cdot )={{{{{rm{arg }}}}}},{{{{{{rm{min }}}}}}}_{{g}_{A}(cdot )}mathop{sum}limits_{tin T,iin {I}_{{{{{{rm{train}}}}}}}}{Vert h(f({x}_{t}^{i},{varphi }^{i},{g}_{A}({{{{{{bf{A}}}}}}}^{i})))-{z}_{t}^{i}Vert }^{2}$$

(12b)

Essentially, we train a network (*g*_{A}) mapping from raw data (**A**) to parameters (({{{{{boldsymbol{theta }}}}}})) such that the PBM output (*f*) using ({{{{{boldsymbol{theta }}}}}}) best matches the observed target (({{{{{bf{z}}}}}})). We are not training to predict the observations – rather, we train *g*_{A} on how to best help the PBM to achieve its goal. The difference between Eq. 12 and Eq. 3 highlights that the loss function combines the sum of squared differences for all sites at once.

The second version is applicable where some observations ({left{{{{{{{bf{z}}}}}}}_{t}^{i}right}}_{tin T}) are also available as inputs at the test locations:

$${hat{{{{{{boldsymbol{theta }}}}}}}}^{i}={g}_{z}left({left{{{{{{{bf{x}}}}}}}_{t}^{i}right}}_{tin T},{{{{{{bf{A}}}}}}}^{{prime} ,i},{left{{{{{{{bf{z}}}}}}}_{t}^{i}right}}_{tin T}right){{{{{rm{for}}}}}}{,,}{{{{{rm{all}}}}}}{,,}{iin I}_{{{{{{{mathrm{train}}}}}}}}$$

(13a)

$$widehat{{g}_{z}}(cdot )={{{{{rm{arg }}}}}},{{{{{{rm{min }}}}}}}_{{g}_{z}(cdot )}mathop{sum }limits_{tin {T}_{{{{{{rm{train}}}}}}},iin {I}_{{{{{{rm{train}}}}}}}}{Vert h(f({x}_{t}^{i},{varphi }^{i},{g}_{z}({{{{{{{{bf{x}}}}}}}_{t}^{i}}}_{tin T},{{{{{{bf{A}}}}}}}^{{prime} ,i},{{{z}_{t}^{i}}}_{tin T})))-{z}_{t}^{i}Vert }^{2}$$

(13b)

Essentially, we train a network (({g}_{z})) that maps from attributes (**A**′), historical forcings (**x**), and historical observations (({left{{{{{{{bf{z}}}}}}}_{t}^{i}right}}_{tin T})) to a suitable parameter set (({{{{{boldsymbol{theta }}}}}})) with which the PBM output best matches the observed target (({{{{{bf{z}}}}}})) across all sites in the domain. Ancillary attributes **A**′ may be the same as or different from **A** used in *g*_{A}, and in the extreme case may be empty. Succinctly, they can be written as two mappings, *g*_{A}: **A → θ** and *g*_{Z}: (**A**′,**x**,**z***)* **→** **θ**. *g*_{Z} can accept time series data as inputs and here we choose LSTM as the network structure for this unit. There is no circular logic or information leak because the historical observations (({left{{{{{{{bf{z}}}}}}}_{t}^{i}right}}_{tin T})) are for a different period (*T*) than the main training period (*T*_{train}). In practice, this distinction may not be so crucial as the PBM acts as an information barrier such that only values suitable as parameters (({{{{{boldsymbol{theta }}}}}})) can produce a reasonable loss. As LSTM can output a time series, the parameters were extracted only at the last time step. For *g*_{A}, only static attributes were employed, and so the network structure amounts to a multilayer perceptron network. After some investigation of training and test metrics, we set the hidden size of *g* to be the same as for the surrogate model.

The whole network is trained using gradient descent, which is a first-order optimization scheme. Some second-order schemes like Levenberg–Marquardt often have large computational demand and are thus rarely used in modern DL^{49}. To allow gradient accumulation and efficient gradient-based optimization and to further reduce the computational cost, we can either implement the PBM directly into a differentiable form, as described in the global PUB case below, or first train a DL-based surrogate model (f^{prime} left(bullet right)simeq fleft(bullet right)) and use it in the loss function instead of *f*(·),

$$g(cdot )={{{{{rm{arg }}}}}},{{{{{{rm{min }}}}}}}_{g(cdot )}mathop{sum}limits_{tin {T}_{{{{{{rm{train}}}}}}},iin {I}_{{{{{{rm{train}}}}}}}}{Vert h(f{{mbox{‘}}}({x}_{t}^{i},{varphi }^{i},g(cdot )))-{z}_{t}^{i}Vert }^{2}$$

(14)

where (g(bullet )) generically refers to either *g*_{A} or *g*_{Z} with their corresponding inputs. *g*_{A} can be applied wherever we can have the ancillary inputs **A**, while *g*_{Z} can be applied over areas where forcings and observed responses (**x**, **z**) are also available, without additional training:

$${hat{{{{{{boldsymbol{theta }}}}}}}}^{i}={g}_{z}({{{{{{{{bf{X}}}}}}}_{t}^{i}}}_{tin T}{{{{{boldsymbol{,}}}}}}{{{{{{boldsymbol{varphi }}}}}}}^{i},{{{{{{{{bf{Z}}}}}}}_{t}^{i}}}_{tin T}){,,}{{{{{rm{or}}}}}},,{hat{{{{{{boldsymbol{theta }}}}}}}}^{i}={g}_{A}({{{{{{boldsymbol{varphi }}}}}}}^{i}); {{{{{rm{for}}}}}},{{{{{rm{any}}}}}}, i,{{{{{rm{and}}}}}},{{{{{rm{any}}}}}},{{{{{rm{reasonable}}}}}},T$$

(15)

We tested both *g*_{A} and *g*_{Z}, which work with and without forcing-observation (** x**–

**) pairs among the inputs, respectively. Since SMAP observations have an irregular revisit schedule of 2–3 days and neural networks cannot accept NaN inputs, we have to fill in the gaps, but simple interpolations do not consider the effects of rainfall. Here we used the near-real-time forecast method that we developed earlier**

*z*^{30}. Essentially, this forecast method uses forcings and integrates recently available observations to forecast the observed variable for the future time steps, achieving very high forecast accuracy (ubRMSE < 0.02). When recent observations are missing, their places are taken by the network’s own predictions, thus introducing no new information. Using this method, we generated continuous SMAP observations as

**for network**

*z**g*

_{Z}.

As with most DL work, the hyperparameters of dPL needed to be adjusted. We manually tuned hidden sizes and batch size using one year of data (2015-04-01 to 2016-03-31) using mostly the 1/16^{2} sampling density (sampling one gridcell from each 16 × 16 patch). Higher sampling densities led to larger training data, which could be better handled by larger hidden sizes. For sampling densities of 1/16^{2}, 1/8^{2}, and 1/4^{2} we used hidden sizes of 64, 256, and 1280, respectively. We used a batch size of 300 instances and the length of the training instances was 240 days. Given 1 or 2 years’ worth of training data, the code randomly selected gridcells and time periods with a length of 240 days within the training dataset to form a minibatch for training (In contrast, SCE-UA uses all available years of training data at once, as this is the standard approach). The network’s weights were updated after computing the combined loss of each minibatch. We employed the AdaDelta network optimization algorithm^{50}, for which the coefficient used to scale delta before applying it to the parameters was 0.5, the coefficient used for computing a running average of squared gradients was 0.95, and the weight decay was 0.00001. The dropout rate for the *g*_{Z} or *g*_{A} component of the network (Fig. 1) was 0.5. We normalized inputs and outputs by their CONUS-wide standard deviation.

Based on Troy et al.^{44}, the calibrated parameters include the variable infiltration curve parameter (INFILT), maximum base flow velocity (Dsmax), fraction of maximum base flow velocity where nonlinear base flow begins (Ds), fraction of maximum soil moisture content above which nonlinear baseflow occurs (Ws), and variation of saturated hydraulic conductivity with soil moisture (EXPT). INFILT decides the shape of the Variable Infiltration Capacity (VIC) curve and denotes the amount of available infiltration capacity. The formula regarding INFILT in VIC is:

$$i={i}_{m}[1-{(1-{a}_{f})}^{frac{1}{{{{{{mathrm{INFILT}}}}}}}}]$$

( 16)

where *i* is infiltration capacity, *i*_{m} is maximum infiltration capacity, and *a*_{f} is the fraction of saturated area. With other parameters and *a*_{f} being the same, larger INFILT leads to a reduced infiltration rate and thus higher runoff.

For all models, we collected atmospheric forcing data from the North American Land Data Assimilation System phase-II dataset (NLDAS-2)^{51}. A number of static physiographic attribute inputs were added to provide additional context for the module to better understand the input-response relationships and correctly estimate the parameters. These attributes included bulk density, soil water holding capacity, and sand, silt, and clay percentages from the International Soil Reference and Information Centre – World Inventory of Soil Emission Potentials (ISRIC-WISE) database^{52}. Other inputs were SMAP product flags indicating mountainous terrain, land cover classes, urban areas, and fraction of land surface with water (time averaged).

### SMAP data

We used the SMAP enhanced level-3 9-km resolution surface soil moisture product (see acknowledgements for access). SMAP level-3 data has an irregular revisit time of 2–3 days, which means there are irregular and densely-distributed gaps in the data. As discussed earlier, we employed the LSTM-based 1-day-ahead forecast scheme to fill the gaps^{30}. This scheme utilizes meteorological data and the most-recently-available SMAP observations to provide a near-real-time soil moisture forecast. Essentially, this scheme is a deep-learning version of data assimilation, and has achieved a very low CONUS-scale unbiased RMSE of 0.027, which revised our understanding of the random component of the SMAP data. Using this scheme, we were able to fill the gaps between SMAP observations, and provide seamless data.

To resolve the intrinsic difference between VIC-simulated soil moisture and SMAP, we followed the data assimilation literature^{53} and used dPL to estimate, along with other parameters, a linear function with a scaling term (*a*) and a bias term (*b*) between the two:

$${y}_{{{{{{{mathrm{SMAP}}}}}}}}=a* {y}_{{{{{{{mathrm{sur}}}}}}}}+b$$

(17)

It is widely known that bias correction needs to be applied before observations can be assimilated through data assimilation^{54}. *a* and *b* are estimated along with the rest of the parameters by dPL and SCE-UA.

### Satellite-based estimates of ET

MOD16A2^{55} is an 8-day composite ET product at 500-meter resolution, which is based on the Penman-Monteith equation. With this algorithm, MODIS 8-day fraction of photosynthetically active radiation is used as the fraction of vegetation cover to allocate surface net radiation between soil and vegetation; MODIS 8-day albedo and daily meteorological reanalysis data are used to calculate surface net radiation and soil heat flux; and MODIS 8-day leaf area index (LAI) and daily meteorological reanalysis data are used to estimate surface stomatal conductance, aerodynamic resistance, wet canopy, soil heat flux, and other environmental variables. MODIS land cover is used to specify the biome type for each pixel to retrieve biome-dependent constant parameters.

We did not use MOD16A2 as a learning target; the purpose here was to validate which calibration strategy led to better descriptions of overall model dynamics. MOD16A2 is not perfect, but since these are completely independent observations from those by SMAP, better agreement should still indicate better modeling of physics overall.

### Shuffled Complex Evolution for comparison

For comparing the parameter estimation module in dPL, the SCE-UA method^{11} introduced three decades ago but still relevant today^{56}, was also implemented as a reference method. We chose SCE-UA for comparison because it is well established and widely applied. The algorithm ranks a population based on the objective function, and partitions a population of parameter sets into multiple subpopulations called complexes. In one iteration of SCE-UA, the complexes are evolved individually for a number of competitive evolution steps, where reflection, contraction, and random trials are attempted, before they are shuffled and redivided into new complexes for the next iteration.

SCE-UA jobs use the data from the whole training period (1 or 2 years). One SCE-UA iteration contains many VIC forward runs. Because SCE-UA uses discrete iterations involving an uneven number of forward runs, it was not as meaningful to compute the best objective function at the end of a fixed number of runs. Instead, across different gridcells, we collected the lowest objective function RMSE achieved from the beginning to the end of each iteration, and took the average of the ending run as the number of runs to report. We tuned the number of complexes of SCE-UA and set it to seven.

### CAMELS streamflow test

In earlier work, Mizukami et al.^{21} calibrated the VIC model using the multiscale parameter regionalization scheme using data from 531 basins in the Catchment Attributes and Meteorology for Large-Sample Studies (CAMELS) dataset over the contiguous US (see basin locations in Supplementary Fig. S4a). They limited their scope to basins <2000 km^{2} and trained and tested on the same basins, making the experiment a test on the model’s ability to generalize temporally. The calibration period of MPR was from 1 October 1999 to 30 September 2008, and the validation period was from 1 October 1989 to 30 September 1999. They calibrated transfer functions for 8 default VIC parameters and added two additional parameters (shape and timescale) for routing, which accounts for the time it takes for water to travel from the catchment to the outlet in a mass-conservative manner:

$$gamma (t:a,{{{{{rm{tau }}}}}})=frac{1}{Gamma ({{{{{rm{a}}}}}}){{{{{{rm{tau }}}}}}}^{a}}{t}^{a-1}{e}^{-frac{t}{{{{{{rm{tau }}}}}}}}$$

( 18)

where *t* is time [T], *a* is a shape parameter [–] (*a* > 0), Γ() is the gamma function, and *τ* is a timescale parameter [T]^{57}.

Convolution of the gamma distribution with the runoff depth series is used to compute the fraction of runoff at the current time, which is discharged to its corresponding river segment at each future time as follows:

$$q(t)=int_{0}^{t{{{{{mathrm{max}}}}}}}gamma (s:a,{{{{{rm{tau }}}}}})* R(t-s)ds$$

(19)

where (qleft(tright)) is delayed runoff or discharge [L^{3}T^{-1}] at time step *t* [T], *R* is the model-simulated runoff from the basin [L^{3}T^{-1}], and *t*max is the maximum time length for the gamma distribution [T]. To compare with Mizukami et al.^{21}, we used the same dataset, basins, and training and test periods. They reported a median NSE of 0.30 for VIC.

### Global PUB test

Beck20 presented a global-scale hydrologic dataset containing forcings, static attributes, and daily streamflow data from 4229 headwater basins across the world. They used a state-of-the-art regionalization scheme for prediction in ungauged basins (PUB), in which no data from test basins were used in the training dataset, thus testing the scheme’s capability to generalize spatially. Eight attributes were used for the transfer functions in Beck20, including humidity, mean annual precipitation, mean annual potential evaporation, mean normalized difference vegetation index, fraction of open water, slope, and percentages of sand and clay. They trained linear parameter transfer functions from raw predictors to 12 free parameters of a simple hydrologic model, HBV. In all, 4229 basins were divided into three climate groups: (i) tropical, (ii) arid and temperate, and (iii) cold and polar. Transfer functions were trained for each of these groups. They ran cross validation within each group, e.g., for the arid and temperate group, they further divided the data into 10 random folds, trained the transfer functions in 9 of the 10 folds, and tested the transfer functions on the 10th holdout fold; then they rotated to other folds as the holdout data and reported the average metrics from these holdout basins. Because the holdout basins were randomly selected, there will always be neighboring basins that were included in the training set. However, the sparser the training data are, the further away the holdout basins will be, on average, from the training basins. Hence, we can reduce the number of training basins to examine the impacts of less training data on the model’s ability to generalize in space.

Since the primary purpose of this part of the work was to compare dPL to Beck20’s regionalization scheme, we kept the setup as similar as possible and focused on the comparison with their temperate catchment group (see their locations in Supplementary Fig. S4b), for which they reported a median Kling Gupta model efficiency (KGE, see below) coefficient of 0.48. We similarly ran 10-fold cross validation by training *g*_{A} (here **A** includes atmospheric forcings) on 9 randomly chosen groups of basins and testing it on the 10th, then rotating the holdout group. The training and testing periods were both 2000–2016, the same as used by Beck20. We compared using 8 raw attributes (to be comparable to Beck20) as well as using all 27 available attributes to demonstrate the extensibility of the dPL scheme. To show how the test results scaled with data, we ran additional experiments where the training basin density was systematically reduced with testing basins unchanged. We could afford to do this with dPL using a highly generic procedure. Unfortunately, it was not practical to do this for Beck20’s regionalization scheme for comparison, because it was too computationally expensive and labor intensive. Beck20 calibrated the model on the regular gridcells while dPL is applied at the basin scale. The chosen basins are small in area, hence this difference should have marginal impacts. In fact, comparison to past research [Beck 2016] suggests this discretization gives better metrics than the basin-based approach, meaning that this difference, if anything, is biased against dPL.

### Evaluation metrics

Four statistical metrics are commonly used to measure the performance of model simulation: bias, correlation (Corr), unbiased root-mean-square error (ubRMSE), and Nash-Sutcliffe model efficiency coefficient (NSE). Bias is the mean difference between modeled and observed results. ubRMSE is the RMSE calculated after the bias (systematic model error) is removed during the calculation, and measures the random component of the error. Corr assesses if a model captures the seasonality of the observation, but did not care about the correlation. NSE also considers bias and is 1 for a perfect model but can be negative for poor models. (bar{y}) is the average modeled (predicted) value of all pixels, and (bar{yast }) is the average observed value of all pixels.

$${{{{{{mathrm{Bias}}}}}}}=frac{{sum }_{i=1}^{n}({y}_{i}-{y}_{i}^{ast })}{n}$$

(20)

$${{{{{{mathrm{ubRMSE}}}}}}}=sqrt{frac{{sum }_{i=1}^{n}{[({y}_{i}-bar{y})-({y}_{i}^{ast }-bar{yast })]}^{2}}{n}}$$

(21)

$${{{{{{mathrm{Corr}}}}}}}=frac{{sum }_{i=1}^{n}big[big(y_{i}-bar{y}big)big({y}_{i}^{ast }-bar{yast }big)big]}{sqrt{{sum }_{i=1}^{n}{big[big(y_{i}-bar{y}}big)^{2}big]}sqrt{{sum }_{i=1}^{n}big[{big({y}_{i}^{ast }-bar{yast }big)}^{2}big]}}$$

(22)

$${{{{{{mathrm{NSE}}}}}}}=1-frac{{sum }_{i=1}^{n}{({y}_{i}^{ast }-{y}_{i})}^{2}}{{sum }_{i=1}^{n}{({y}_{i}^{ast }-bar{yast })}^{2}}$$

(23)

All metrics were reported for the test period. When we evaluated *g*_{Z} on the training locations, historical observations during the training period (({left{{{{{{{bf{z}}}}}}}_{t}^{i}right}}_{tin T})) were included in the inputs. Then we used the parameters calibrated from the training period to run the model in the test period to get the reported metrics.

For the spatial generalization tests, we sampled one gridcell out of each 8×8 patch (sampling density of 1/8^{2} or s8), and we ran the trained dPL model on another gridcell in the patch (three rows to the north and three columns to the east of the training gridcell). For SCE-UA, we took the parameter sets from the nearest trained neighbor. For *g*_{A}, we sent in static attributes from the test neighbor to infer the parameters, which were evaluated over the test period. For *g*_{Z}, we sent in static attributes as well as forcings and observed soil moisture from the test neighbor, but from the training period. All models were evaluated over the test period.

To be comparable to Beck20, we also calculated the Kling-Gupta model efficiency coefficient (KGE), which is similar in magnitude to NSE:

$${{{{{{mathrm{KGE}}}}}}}=1-sqrt{{(r-1)}^{2}+{(beta -1)}^{2}+{(gamma -1)}^{2}}$$

(24)

$$beta =frac{{mu }_{s}}{{mu }_{o}}{{{{{rm{and}}}}}}gamma =frac{{sigma }_{s}/{mu }_{s}}{{sigma }_{o}/{mu }_{o}}$$

(25)

where *r* is the correlation coefficient between simulations and observations, *β* and *γ* are respectively the bias and variability ratio, *μ* and *σ* are the mean and standard deviation of runoff, and indices *s* and *o* represent simulations and observations.

Source: Resources - nature.com