## From calibration to parameter learning: Harnessing the scaling effects of big data in geoscientific modeling

General description of a geoscientific model and parameter calibrationA model for both non-dynamical and dynamical systems can be generically written for site i as$${left{{{{{{{bf{y}}}}}}}_{t}^{i}right}}_{tin T}=fleft({left{{{{{{{bf{x}}}}}}}_{t}^{i}right}}_{tin T},{{{{{{boldsymbol{varphi }}}}}}}^{{{{{{boldsymbol{i}}}}}}},{{{{{{boldsymbol{theta }}}}}}}^{{{{{{boldsymbol{i}}}}}}}right)$$

(1)

where output physical predictions (({{{{{{bf{y}}}}}}}_{t}^{i}={left[{y}_{1,t}^{i},{y}_{2,t}^{i},cdots right]}^{T}), with the first subscript denoting variable type) vary with time (t) and location (i), and are functions of time- and location-specific inputs (({{{{{{bf{x}}}}}}}_{t}^{i}={left[{x}_{1,t}^{i},{x}_{2,t}^{i},cdots right]}^{T})), location-specific observable attributes (({{{{{{boldsymbol{varphi }}}}}}}^{i}={left[{varphi }_{1}^{i}{{{{{boldsymbol{,}}}}}}{varphi }_{2}^{i}{{{{{boldsymbol{,}}}}}}{{{{{boldsymbol{cdots }}}}}}right]}^{T})), and location-specific unobserved parameters that need to be separately determined (({{{{{{boldsymbol{theta }}}}}}}^{i}={left[{theta }_{1}^{i}{{{{{boldsymbol{,}}}}}}{theta }_{2}^{i}{{{{{boldsymbol{,}}}}}}{{{{{boldsymbol{cdots }}}}}}right]}^{T})). θ may be unobservable, or it may be too expensive or difficult to observe at the needed accuracy, resolution, or coverage. This formulation also applies to dynamical systems if ({{{{{{bf{x}}}}}}}_{{{{{{boldsymbol{t}}}}}}}^{{{{{{boldsymbol{i}}}}}}}) includes previous system states ({{{{{{bf{y}}}}}}}_{t-1}^{i}) (i.e. ({{{{{{bf{y}}}}}}}_{t-1}^{i}{{{{{boldsymbol{subset }}}}}}{{{{{{bf{x}}}}}}}_{t}^{i})), and the rest of the inputs are independent (e.g. meteorological) forcing data. In a non-dynamical system, ({{{{{{bf{x}}}}}}}_{t}^{i}) is independent of ({{{{{{bf{y}}}}}}}_{t-1}^{i}).Given some observations$${{{{{{bf{z}}}}}}}_{t}^{i}=hleft({{{{{{bf{y}}}}}}}_{{t}}^{{i}}right)+{{{{{{boldsymbol{varepsilon }}}}}}}_{t}^{i}$$

(2)

where h(·) relates model outputs to observations and ({{{{{{boldsymbol{varepsilon }}}}}}}_{{{boldsymbol{t}}}}^{{{boldsymbol{i}}}}={big[{varepsilon }_{1,t}^{i},{varepsilon }_{2,t}^{i},cdots big]}^{T}) is the error between the observations (Big({{{{{{boldsymbol{z}}}}}}}_{{{{{{boldsymbol{t}}}}}}}^{{{{{{boldsymbol{i}}}}}}} = {big[{z}_{1,t}^{i},{z}_{2,t}^{i},cdots big]}^{T}Big)) and the model predictions (({{{{{{bf{y}}}}}}}_{{{{{{boldsymbol{t}}}}}}}^{{{{{{boldsymbol{i}}}}}}})), we adjust the model parameters so that the predictions best match the observations. This is traditionally done individually for each location (here generically referring to a gridcell, basin, site, river reach, agricultural plot, etc., depending on the model):$${hat{theta }}^{i}={{arg }},{{{min }}}_{{{{{{{boldsymbol{theta }}}}}}}^{i}}mathop{sum }limits_{tin T}{Vert {{{{{{boldsymbol{varepsilon }}}}}}}_{t}^{i}Vert }^{2}={{arg }},{{{min }}}_{{{{{{{boldsymbol{theta }}}}}}}^{i}}mathop{sum}limits_{tin T}{Vert h(f({{{x}_{t}^{i}}}_{tin T},{varphi }^{i},{theta }^{i}))-{z}_{t}^{i}Vert }^{2}$$

(3)

where i ∈ I and where (I=left{{1,2},ldots ,{N}_{I}right}). Note that the superscript i suggests that this optimization is done for each site independently.The process-based hydrologic model and its surrogateThe Variable Infiltration Capacity (VIC) hydrologic model has been widely used for simulating the water and energy exchanges between the land surface and atmosphere, along with related applications in climate, water resources (e.g., flood, drought, hydropower), agriculture, and others. The model simulates evapotranspiration, runoff, soil moisture, and baseflow based on conceptualized bucket formulations. Inputs to the model include daily meteorological forcings, non-meteorological data, and the parameters to be determined. Meteorological forcing data include time series of precipitation, air temperature, wind speed, atmospheric pressure, vapor pressure, and longwave and shortwave radiation. More details about VIC can be found in Liang et al.39.LSTM was trained to reproduce the behavior of VIC as closely as possible while also allowing for gradient tracking. In theory, if a hydrologic model can be written into a machine learning platform (as in our HBV case), this step is not needed, but training a surrogate model is more convenient when the model is complex. To ensure the surrogate model had high fidelity in the parameter space where the search algorithms want to explore, we iterated the training procedure multiple times. We first trained an LSTM surrogate for VIC using the forcings, attributes, and parameters from NLDAS-2 as inputs, and the VIC-simulated surface soil moisture (variable name: SOILM_lev1) and evapotranspiration (ET, variable name: EVP) as the targets of emulation. Then, as the search algorithms (SCE-UA or dPL) went near an optimum, we took the calibrated parameter sets, made perturbations of them by adding random noise to these parameters, and retrained the network with added data. The perturbation was done to better represent the parameter space close to optimal solutions. We repeated this procedure four times so that the NSEs of the parameters, obtained from the CPU-based VIC model, converged. At 1/82 sampling density (sampling one gridcell from each 8 × 8 patch), this results in fewer overall forward runs than a 1/8-degree NLDAS-2 simulation. Also note that this effort is needed similarly for both dPL and SCE-UA. If we did not use the surrogate model, SCE-UA would also have needed to employ the O(102) more expensive CPU-based VIC model. We evaluated the accuracy of the surrogate model, and the median correlations between VIC and the surrogate simulation were 0.91 and 0.92 for soil moisture and ET, respectively (Supplementary Fig. S2). When we connected the trained surrogate model to the parameter estimation network, the weights of the surrogate model were frozen and prevented from updating by backpropagation, but the gradient information could pass through. This was implemented in the PyTorch deep learning framework36.The long short-term memory networkThe long short-term memory network (LSTM) was originally developed in the artificial intelligence field for learning sequential data, but has recently become a popular choice for hydrologic time series data26. As compared to a vanilla recurrent neural network with only one state, LSTM has two states (cell state, hidden state) and three gates (input gate, forget gate, and output gate). The cell state enables long-term memory, and the gates are trained to determine which information to carry across time steps and which information to forget. These units were collectively designed to address the notorious DL issue of the vanishing gradient, where the accumulated gradients would decrease exponentially along time steps and eventually be too small to allow effective learning48. Given inputs I, our LSTM can be written as the following:$${{{{{mathrm{Input}}}}}}; {{{{{mathrm{transformation:}}}}}}quad {x}^{t}={{ReLU}}({W}_{I}{I}^{t}+{b}_{I})$$

(4)

$${{{{{mathrm{Input}}}}}}; {{{{{mathrm{node:}}}}}}quad {g}^{t}={tanh }({{{{{mathscr{D}}}}}}({W}_{{gx}}{x}^{t})+{{{{{mathscr{D}}}}}}({W}_{{gh}}{h}^{t-1})+{b}_{g})$$

(5)

$${{{{{mathrm{Input}}}}}}; {{{{{mathrm{gate:}}}}}}quad {i}^{t}=sigma ({{{{{mathscr{D}}}}}}({W}_{{ix}}{x}^{t})+{{{{{mathscr{D}}}}}}({W}_{{ih}}{h}^{t-1})+{b}_{i})$$

(6)

$${{{{{mathrm{Forget}}}}}}; {{{{{mathrm{gate:}}}}}}quad {f}^{t}=sigma ({{{{{mathscr{D}}}}}}({W}_{{fx}}{x}^{t})+{{{{{mathscr{D}}}}}}({W}_{{fh}}{h}^{t-1})+{b}_{f})$$

(7)

$${{{{{mathrm{Output}}}}}}; {{{{{mathrm{gate:}}}}}}quad {o}^{t}=sigma ({{{{{mathscr{D}}}}}}({W}_{{ox}}{x}^{t})+{{{{{mathscr{D}}}}}}({W}_{{oh}}{h}^{t-1})+{b}_{o})$$

(8)

$${{{{{mathrm{Cell}}}}}}; {{{{{mathrm{state:}}}}}}quad {s}^{t}={g}^{t}odot {i}^{t}+{s}^{t-1}odot {f}^{t}$$

(9)

$${{{{{mathrm{Hidden}}}}}}; {{{{{mathrm{state:}}}}}}quad {h}^{t}={tanh }({s}^{t})odot {o}^{t}$$

(10)

$${{{{{mathrm{Output:}}}}}}quad {y}^{t}={W}_{{hy}}{h}^{t}+{b}_{y}$$

(11)

where W and b are the network weights and bias parameters, respectively, and ({{{{{mathscr{D}}}}}}) is the dropout operator, which randomly sets some of the connections to zero. The LSTM network and our whole workflow31 were implemented in PyTorch36, an open source machine learning framework.Here we do not use LSTM to predict the target variable. Rather, LSTM is used to (optionally) map from time series information to the parameters in our gz network as described below.The parameter estimation networkWe present two versions of the dPL framework. The first version allows us to train a parameter estimation network over selected training locations Itrain where some ancillary information A (potentially including but not limited to attributes in φi used in the model) is available, for training period Ttrain (illustrated in Fig. 1b):$${hat{{{{{{boldsymbol{theta }}}}}}}}^{i}={g}_{A}left({{{{{{bf{A}}}}}}}^{i}right){{{{{{rm{for}}}}}}; {{{{{rm{all}}}}}}; iin I}_{{{{{{{mathrm{train}}}}}}}}$$

(12a)

$${hat{g}}_{A}(cdot )={{{{{rm{arg }}}}}},{{{{{{rm{min }}}}}}}_{{g}_{A}(cdot )}mathop{sum}limits_{tin T,iin {I}_{{{{{{rm{train}}}}}}}}{Vert h(f({x}_{t}^{i},{varphi }^{i},{g}_{A}({{{{{{bf{A}}}}}}}^{i})))-{z}_{t}^{i}Vert }^{2}$$

(12b)

Essentially, we train a network (gA) mapping from raw data (A) to parameters (({{{{{boldsymbol{theta }}}}}})) such that the PBM output (f) using ({{{{{boldsymbol{theta }}}}}}) best matches the observed target (({{{{{bf{z}}}}}})). We are not training to predict the observations – rather, we train gA on how to best help the PBM to achieve its goal. The difference between Eq. 12 and Eq. 3 highlights that the loss function combines the sum of squared differences for all sites at once.The second version is applicable where some observations ({left{{{{{{{bf{z}}}}}}}_{t}^{i}right}}_{tin T}) are also available as inputs at the test locations:$${hat{{{{{{boldsymbol{theta }}}}}}}}^{i}={g}_{z}left({left{{{{{{{bf{x}}}}}}}_{t}^{i}right}}_{tin T},{{{{{{bf{A}}}}}}}^{{prime} ,i},{left{{{{{{{bf{z}}}}}}}_{t}^{i}right}}_{tin T}right){{{{{rm{for}}}}}}{,,}{{{{{rm{all}}}}}}{,,}{iin I}_{{{{{{{mathrm{train}}}}}}}}$$

(13a)

$$widehat{{g}_{z}}(cdot )={{{{{rm{arg }}}}}},{{{{{{rm{min }}}}}}}_{{g}_{z}(cdot )}mathop{sum }limits_{tin {T}_{{{{{{rm{train}}}}}}},iin {I}_{{{{{{rm{train}}}}}}}}{Vert h(f({x}_{t}^{i},{varphi }^{i},{g}_{z}({{{{{{{{bf{x}}}}}}}_{t}^{i}}}_{tin T},{{{{{{bf{A}}}}}}}^{{prime} ,i},{{{z}_{t}^{i}}}_{tin T})))-{z}_{t}^{i}Vert }^{2}$$

(13b)

Essentially, we train a network (({g}_{z})) that maps from attributes (A′), historical forcings (x), and historical observations (({left{{{{{{{bf{z}}}}}}}_{t}^{i}right}}_{tin T})) to a suitable parameter set (({{{{{boldsymbol{theta }}}}}})) with which the PBM output best matches the observed target (({{{{{bf{z}}}}}})) across all sites in the domain. Ancillary attributes A′ may be the same as or different from A used in gA, and in the extreme case may be empty. Succinctly, they can be written as two mappings, gA: A → θ and gZ: (A′,x,z) → θ. gZ can accept time series data as inputs and here we choose LSTM as the network structure for this unit. There is no circular logic or information leak because the historical observations (({left{{{{{{{bf{z}}}}}}}_{t}^{i}right}}_{tin T})) are for a different period (T) than the main training period (Ttrain). In practice, this distinction may not be so crucial as the PBM acts as an information barrier such that only values suitable as parameters (({{{{{boldsymbol{theta }}}}}})) can produce a reasonable loss. As LSTM can output a time series, the parameters were extracted only at the last time step. For gA, only static attributes were employed, and so the network structure amounts to a multilayer perceptron network. After some investigation of training and test metrics, we set the hidden size of g to be the same as for the surrogate model.The whole network is trained using gradient descent, which is a first-order optimization scheme. Some second-order schemes like Levenberg–Marquardt often have large computational demand and are thus rarely used in modern DL49. To allow gradient accumulation and efficient gradient-based optimization and to further reduce the computational cost, we can either implement the PBM directly into a differentiable form, as described in the global PUB case below, or first train a DL-based surrogate model (f^{prime} left(bullet right)simeq fleft(bullet right)) and use it in the loss function instead of f(·),$$g(cdot )={{{{{rm{arg }}}}}},{{{{{{rm{min }}}}}}}_{g(cdot )}mathop{sum}limits_{tin {T}_{{{{{{rm{train}}}}}}},iin {I}_{{{{{{rm{train}}}}}}}}{Vert h(f{{mbox{‘}}}({x}_{t}^{i},{varphi }^{i},g(cdot )))-{z}_{t}^{i}Vert }^{2}$$

(14)

where (g(bullet )) generically refers to either gA or gZ with their corresponding inputs. gA can be applied wherever we can have the ancillary inputs A, while gZ can be applied over areas where forcings and observed responses (x, z) are also available, without additional training:$${hat{{{{{{boldsymbol{theta }}}}}}}}^{i}={g}_{z}({{{{{{{{bf{X}}}}}}}_{t}^{i}}}_{tin T}{{{{{boldsymbol{,}}}}}}{{{{{{boldsymbol{varphi }}}}}}}^{i},{{{{{{{{bf{Z}}}}}}}_{t}^{i}}}_{tin T}){,,}{{{{{rm{or}}}}}},,{hat{{{{{{boldsymbol{theta }}}}}}}}^{i}={g}_{A}({{{{{{boldsymbol{varphi }}}}}}}^{i}); {{{{{rm{for}}}}}},{{{{{rm{any}}}}}}, i,{{{{{rm{and}}}}}},{{{{{rm{any}}}}}},{{{{{rm{reasonable}}}}}},T$$

(15)

We tested both gA and gZ, which work with and without forcing-observation (x-z) pairs among the inputs, respectively. Since SMAP observations have an irregular revisit schedule of 2–3 days and neural networks cannot accept NaN inputs, we have to fill in the gaps, but simple interpolations do not consider the effects of rainfall. Here we used the near-real-time forecast method that we developed earlier30. Essentially, this forecast method uses forcings and integrates recently available observations to forecast the observed variable for the future time steps, achieving very high forecast accuracy (ubRMSE More