### Study sites and data collection

The data used for this study were obtained from a previous multi-site study on post-distribution FRC decay collected from refugee settlements in South Sudan, Jordan, and Rwanda^{19}. This dataset was selected as process-based models have been used to produce FRC targets for these sites, which provide a useful comparison to the risk-based targets generated in this study. Details of the data collected at these sites, as well as important site characteristics are included in Table 3. Two datasets were collected from Jordan: one from the summer of 2014 and one 9 months later from the late winter of 2015. The original study treated these as two separate datasets due to differences in environmental conditions between the two datasets (10 °C difference in average temperature) and amount of time between the two datasets^{19}. To ensure a consistent comparison with the original study, we have also treated the 2014 and 2015 data from Jordan as two distinct datasets.

The dataset for each site includes FRC as well as other water quality parameters, which are routinely collected in humanitarian water systems operation including total residual chlorine, EC, water temperature, turbidity, and pH. Data were collected using paired sampling whereby the same unit of water was sampled at the following points along the post-distribution water supply chain:

From the tap at the point-of distribution

In the container immediately after collection

In the container immediately after transport to the dwelling

After a follow-up period of storage in the household

This study only used the measurements at the point-of-distribution and point-of-consumption to reflect data collection practices that are more feasible for humanitarian operations. In preparing the dataset, observations were removed if the point-of-distribution water quality did not meet humanitarian drinking water quality guidelines. Supplementary Table 2 in the Supplementary Information includes the full list of data cleaning steps that were used to prepare the data for use in the ANN models.

### Ethics

The initial field work in South Sudan received exemption from full ethics review by the Medical Director of Médecins sans Frontières (MSF) (Operational Centre Amsterdam) as data collected was routine for the on-going water supply intervention at the study site. For subsequent field studies in Jordan and Rwanda, ethics approval was obtained from the Committee for Protection of Human Subjects (CPHS) of the Institutional Review Board at the University of California, Berkeley (CPHS Protocol Number: 2014-05-6326). Informed consent was provided throughout all data collection.

### Input variable selection

Two input variable combinations were considered for predicting the output variable, the point-of-consumption FRC concentration. The variables considered are all variables that are routinely monitored in humanitarian water system operations. The first input variable combination (IV1) included FRC at the water point-of-distribution and the elapsed time between the measurement at the point-of-distribution and the point-of-consumption. This input variable combination represents the minimum number of variables that would be regularly collected under current humanitarian drinking water quality guidelines^{31}. Additionally, these are the only two variables included in the process-based model developed in a past study for these sites^{19}, so this input variable combination allows for a direct comparison of the ANN ensemble models with the process-based models. The second input variable combination (IV2) included the variables from IV1 as well as additional water quality variables measured from the point-of-distribution (directly after water had left the water distribution point): EC, water temperature, pH, and turbidity. These additional variables are recommended for collection in some humanitarian drinking water quality guidelines^{29,30,31}, and as such, may also be available in humanitarian response settings. This larger input variable set allowed us to investigate the usefulness of additional water quality variables for forecasting point-of-consumption FRC concentrations.

### Base-learner structure and architecture

The ensemble base learners (the individual ANNs in the ensemble models) were built as multi-layer perceptrons (MLPs) with a single hidden layer using the Keras 2.3.0 package^{48} in Python v3.7^{49}. This structure was selected because it has been shown to outperform other data-driven models and ANN architectures for predicting FRC in piped distribution systems^{20,21}. The weights and biases of the base learners were optimized to minimize mean squared error (MSE) using the Nadam algorithm with a learning rate of 0.1. An early stopping procedure with a patience of 10 epochs was used to prevent overfitting.

The hidden layer size of the base learners was determined through an exploratory analysis by consecutively doubling the hidden layer size until performance decreased or ceased to improve substantially from one iteration to the next. Based on this analysis, we selected a hidden layer size of four hidden neurons at all sites for the models using the IV1 variable combination for all sites. For the models using the IV2 input variable combination, we selected a hidden layer size of 16 hidden nodes for South Sudan and Jordan (2015), and a hidden layer size of eight hidden nodes for Jordan (2014) and Rwanda. The full results of the exploratory analysis into hidden layer size are included in Supplementary Figs 13–20 in the Supplementary Information.

### Data division

The full dataset for each site and variable combination was divided into calibration and testing subsets, with the calibration subset further subdivided into training and validation data. The testing subset was obtained by randomly sampling 25% of the overall dataset. The same testing subset was used for all base learners so that each base-learner’s testing predictions could be combined into an ensemble forecast. The training and validation data were obtained by randomly resampling from the calibration subset, with a different combination of training and validation data for each base learner to promote ensemble diversity. The ratio of data from the calibration set used for training and validation, respectively, was selected to avoid both overfitting and underfitting through an exploratory analysis using a grid search process. In all but two cases, we selected a validation set that was twice the size of the training set, for an overall training-validation-testing split of 25–50–25%. The two exceptions to this were for the Jordan (2014) model when using the IV1 input variable combination where we found that a training-validation-testing split of 50–25–25 produced better performance, and for the Jordan (2015) model when using the IV1 input variable combination where a training-validation-testing split of 30–45–25 performed substantially better. The full results of the exploratory analysis for data division are included in Supplementary Figs 21–28 in the Supplementary Information. Descriptive statistics for the calibration and testing datasets are included in Supplementary Tables 3 and 4 of the Supplementary Information, and histograms of the input and output variables are provided in Supplementary Figs 5–12 in the Supplementary Information to provide context of the range and patterns in the data used to train the ANN base learners.

### Ensemble model formation

The ensemble models in this study were used to generate probabilistic forecasts of post-distribution FRC by combining the predictions of each base learner into a probability density function (pdf). Thus, for each observation of FRC at the point-of-consumption, the ensemble model outputs a pdf representing the predicted probability of point-of-consumption FRC concentrations. This pdf can then be used to identify ensemble confidence intervals (CIs) for the expected point-of-consumption FRC concentration. To ensure a good representation of the full output space in the final pdfs, two approaches were taken to ensure ensemble diversity. First, as discussed above, the data used to train the base-learner ANNs was randomly sampled from the calibration set, so each ANN was trained on a different subset of the data. Second, the initial weights and biases were randomized for each base learner in a random-start process. Both of these are implicit approaches to ensuring ensemble diversity as they do not directly create diversity and instead the diversity arises through the randomization of the training data and the weights and biases^{50}. The benefit of implicit approaches is that the differences between the base learners are derived from randomness in the data^{50}.

The ensemble size (number of base learners included in the ensemble) was also determined through an exploratory analysis using a grid search procedure This exploratory analysis showed that in general, performance increased with larger ensemble sizes, but improvements in performance plateaued at ensemble sizes ranging from 50 members to 250 members. Based on this, a standard ensemble size of 250 members was selected for all sites and variable combinations. The full results of the exploratory analysis for ensemble size are included in Supplementary Figs 29–36 in the Supplementary Information.

### Ensemble post-processing

We used ensemble post-processing to attempt to improve the forecasts generated by the raw ensembles. We used the kernel dressing method to post-process ensemble predictions^{51}. This method follows a two-step process: first a kernel function is fit centred on the base-learner prediction for each observation, then each member’s kernel is summed together to produce the post-processed pdf, which is a non-parametric mixture distribution function. We used a Gaussian kernel function in keeping with past studies^{27,28,38,51}, though the selection of the specific kernel function is not critical^{28}. The kernel bandwidth was defined using the best member error method where the bandwidth for all kernels is the variance of the absolute error of the prediction that is closest to each observation in the calibration dataset^{51}.

### Ensemble verification and performance evaluation

We used ensemble verification metrics to evaluate the performance of the raw and post-processed ensembles for each site and variable combination. Ensemble verification metrics differ from traditional measures of performance (e.g. Nash Sutcliffe Efficiency, MSE, etc.) as they assess the performance of the probabilistic forecasts of an ensemble whereas traditional measures typically evaluate the average performance of an ensemble model or the predictions of a deterministic model^{52}. Throughout the following section, (O) refers to the full set of observed FRC concentrations at the point-of-consumption and (o_i) refers to the (i^{{mathrm{th}}}) observation, where there are (I) total observations. (F) refers to the full set of probabilistic forecasts for point-of-consumption FRC, where (F_i) is the probabilistic forecast corresponding to observation (o_i) and (f_i^m) is the prediction by the (m^{{mathrm{th}}}) base learner in the ensemble on the (i^{{mathrm{th}}}) observation. For the following metrics, it is assumed that the predictions of each base learner in the ensemble are sorted from low to high for each observation such that (f_i^m le f_i^{m + 1}) from (m = 0) to (m = M).

### Percent capture

Percent capture measures the percentage of observations which are captured within the ensemble forecast and provides a useful indication of how well the model can reproduce the full range of observed values, and, as such, can indicate if a model is underdispersed. For a raw ensemble forecast, the (i^{{mathrm{th}}}) observation is captured if (f_i^0 le o_i le _i^M). For a post-processed forecast, the (i^{{mathrm{th}}}) observation is captured if the probability of (o_i) in the mixture distribution is greater than 0. While not commonly used for ensemble verification, a similar metric has been used for evaluating other probabilistic or possibilistic models, especially neurofuzzy networks, referred to either as the percent capture or the percent of coverage^{53,54,55,56}. The percent capture was calculated both for the overall set of observations, as well as for observations with point-of-consumption FRC below 0.2 mg/L. The latter is a useful indicator of how well the model can predict if water will have sufficient FRC at the point-of-consumption, which is an important indicator of the degree of confidence we have in the risk-based targets generated using these ensemble models.

### CI reliability diagram

Reliability diagrams are visual indicators of ensemble reliability, where reliability refers to the similarity between the observed and forecasted probability distributions with the ideal model having all observations plotted along the 1:1 line showing that the observed probabilities are equal to the forecasted probabilities. These diagrams plot the observed relative frequency of events against the forecast probability of that event, though the reliability diagram has been adapted in past studies as the CI reliability diagram which compares the frequency of observed values within the corresponding CI of the ensemble. For raw ensembles, the CIs are derived from the sorted forecasts of the base learners (for example, the ensemble 90% CI would include all of the forecasts between (f^{0.05M}) and (f^{0.95M})) and for post-processed ensembles, the CIs are calculated directly from the probability distribution. In this study, we extended the CI reliability diagram further by plotting the percent capture of each CI within the ensemble against the CI level. For each ensemble model we plotted the CI reliability for the 10–100% CI levels at 10% intervals as well as at the 95 and 99% CI. We used this to develop a numerical score for the CI reliability diagram, which is calculated as the squared distance between the percentage of observations captured within each CI and the ideal percent capture in that CI. This was calculated for each CI threshold, *k*, from 10 to 100% in 10% increments as shown in Eq. 1.

$$CI;{mathrm{Reliability}};{mathrm{Score}} = mathop {sum }limits_{k = 0.1}^1 left( {k – {mathrm{Percent}};{mathrm{Capture}};{mathrm{in}};CI_k} right)^2$$

(1)

The CI reliability score measures the horizontal distance between the percent capture and the 1:1 line for each CI. The ideal value for this score would be 0, indicating all points fall on the 1:1 line. The worst possible score will depend on the number of CI’s included in the calculation of the score; for this study the worst score is 3.9, which would only occur if no observations were captured in any CI of the ensembles. The CI reliability score was calculated for both the overall dataset and for forecast-observation pairs where the observed household FRC concentration was below 0.2 mg/L.

### Continuous Ranked Probability Score

The Continuous Ranked Probability Score (CRPS) is a common metric for evaluating probabilistic forecasts that evaluates the difference between the predicted and observed probabilities of continuous variables and is equivalent to the mean absolute error of a deterministic forecast^{57,58}. The CRPS measures not only model reliability but also sharpness, which is an indicator of how closely the ensemble predictions are clustered around the observed values. Thus, the CRPS can be a useful measure of overdispersion and can provide an indication if improvements in reliability are being obtained at the expense of excess overdispersion. The CRPS is measured as the area between the forecast cumulative distribution function (cdf) and the observed cdf for each forecast-observation pairing^{58}. Since each observation is a discrete value, the observation cdf is represented with the Heaviside function (H{ x ge x_a}), which is a stepwise function with a value of 0 for all point-of-consumption FRC concentrations below the observed concentration and 1 for all point-of-consumption FRC concentrations above the observed concentration. The equation for calculating the CRPS of a single forecast-observation pair is given in Eq. 2. Note that Eq. 2 shows the calculation of CRPS for a single forecast-observation pair. To evaluate the ensemble models, the average CRPS, (overline {{mathrm{CRPS}}}), is calculated by taking the mean CRPS overall forecast-observation pairs.

$${mathrm{CRPS}} = {int nolimits_{-infty }^infty} left( {F_ileft( x right) – Hleft{ {x ge o_i} right}} right)^2dx$$

(2)

For the post-processed probability distributions, we calculated CRPS directly from Eq. 2 using numerical integration. For the raw ensemble, we treated the forecast cdf as a stepwise continuous function with (N = M + 1) bins where each bin is bounded at two ensemble forecasts and the value in each bin is the cumulative probability^{58}. (overline {{mathrm{CRPS}}}) is calculated using (overline {g_n}), the average width of bin (n) (average difference in FRC concentration between forecast values (m) and (m + 1)) and (overline {o_n}) the likelihood of the observed value being in bin (n)^{58}. Using these values, the (overline {{mathrm{CRPS}}}) for an ensemble can be calculated as^{58}:

$$overline {{mathrm{CRPS}}} = mathop {sum }limits_{n = 1}^N overline {g_n} [(1 – overline {o_n} )p_n^2 + overline {o_n} left( {1 – p_n} right)^2]$$

(3)

Where (p_n) is the probability associated with each bin, (p_n = frac{n}{N})^{58}.

### Generation of risk-based targets

To generate the risk-based FRC targets, the trained ensembles of ANNs were used to forecast the point-of-consumption FRC for a series of point-of-distribution FRC concentrations from 0.2 to 2 mg/L in 0.05 mg/L increments. For each point-of-distribution FRC concentration, the predicted risk of insufficient FRC was calculated from the forecast pdf as the cumulative probability of FRC at the point-of-consumption being below 0.2 mg/L. Using this predicted risk, the target FRC concentration for the point-of-distribution was then selected as the lowest FRC concentration at the water point-of-distribution that provides the desired level of protection. For this study we selected the FRC concentration that resulted in negligible risk of FRC being below the 0.2 mg/L threshold (i.e. the lowest FRC concentration where the predicted risk is 0), though operationally any level of protection could be used and the risk of insufficient FRC at the point-of-consumption should be balanced against risks associated with high FRC concentrations, such as DBP formation and taste and odour concerns.

For comparison with the previously published results, we used a storage duration of 10 h when generating the FRC targets for South Sudan, and 24 h for all other sites^{19}. Since the IV2 model also requires values for EC, water temperature, pH, and turbidity, two scenarios were considered. First, an “average” scenario was used where the median observed value for all other water quality parameters were selected. The second scenario considered was a “worst-case” scenario, where we simulated a scenario where water quality conditions were unfavourable for maintaining chlorine residual. A partial correlation analysis, which assesses the correlation between an input variable and the output variable while controlling for the impacts of other input variables, was used to determine the least favourable conditions for each input variable. The partial correlation analysis is performed by first developing multiple linear regression predictions of both the output variable (point-of-consumption FRC) and the input variable of interest using the remaining input variables as the predictors to the linear regression models and then taking the Pearson correlation coefficient of the residuals between the two regression models. Partial correlation was used to assess the directionality of the effect of the additional water quality variables included in IV2 to assess whether high or low values of these inputs would create a worst-case scenario. Once the directionality of the impact of the different variables had been established, the 95th or 5th percentile observed value of that variable was used at each site to simulate the worst-case scenario.

Source: Resources - nature.com