in

Zero-shot generalization for predicting viral concentrations and evaluating removal efficiencies across wastewater matrices


Abstract

Predicting viral particles on new unseen data across wastewater matrices (WMs) in aerobic membrane bioreactor (AeMBR)-based wastewater treatment plants (WWTPs) remains an open challenge due to the process drifts involved in the treatment stages. Efficient data augmentation approaches based on Markov chain (MCM), Markov chain and multivariate Gaussian (MMCM), Gaussian mixture (GMM) and Copula (CM) were proposed to generate synthetic data from physicochemical parameters, virometry, and PCR-based method. Dual-attention long short-term memory network (DA-LSTM) with new generative models was proposed to predict viral particles and evaluate the removal efficiencies across AeMBRs, thereby handling effluent processing drifts. The DA-LSTM combines attention mechanisms to adaptively adjust the weights of the features and increase the long-term memory, enabling accuracy and robustness across unseen WMs. DA-LSTM framework was tested for predicting pepper mild mottle virus and enteric viral pathogens such as total virus and adenovirus in two regions of Saudi Arabia. The log removal values were evaluated through the estimated viral concentrations. The DA-LSTM model demonstrated significant adaptability to unseen data across different WMs, maintaining robust performance despite the effluent drifts. The results showed that DA-LSTM zero-shot generalization achieved remarkable viral particles prediction performance using MMCM with a mean average coefficient of determination R2 of 0.91, and 0.97 across the sand, and MBR wastewater matrices in region R1, respectively, and R2 of 0.97 across the chlorinated effluent treatment process in region R2. Tests on total viral prediction across municipal WWTPs located in two other regions in Saudi Arabia confirmed the DA-LSTM’s effectiveness in predicting viral particle across WMs and its ability to enhance zero-shot generalization performance at the regional level.

Similar content being viewed by others

Viral particle prediction in wastewater treatment plants using nonlinear lifelong learning models

Flow virometry for water-quality assessment: protocol optimization for a model virus and automation of data analysis

Metagenomics with comprehensive validation as a supplementary tool for QMRA in SanjiangYuan watershed

Introduction

Viral particle concentrations in treated wastewater effluents have recently gained widespread attention due to the increased interest in reuse applications to enhance water security and ensure sustainable water management resources1. As the demand for reclaimed wastewater increases, accurately predicting and removing viral particles in wastewater treatment plants (WWTPs) becomes more important for preventing food contamination, reducing health risks, and mitigating ecological impacts. In this context, monitoring viral particles across the treated wastewater effluents and assessing the resulting log removal values (LRVs) between the influent and effluent treatment concentrations are necessary to ensure safe water reuse. However, removing viral particles from reclaimed water is challenging due to their inherently complex nature and smaller size, making them harder to remove than other biological contaminants in conventional wastewater treatment. Wastewater-based epidemiology methods used to quantify bacterial and viral pathogens are often highly specific and time consuming2,3, and do not reflect the measurement of the wastewater process variables in real time (e.g., water samples are collected on-site and then measured in off-site laboratory analysis4,5), thereby limiting their applications. There is an urgent need for efficient and rapid real-time monitoring approaches. Thus, the development of soft sensor algorithms to accurately predict viral particles and assess LRVs for contaminants in AeMBRs is important for supporting plant operators in water resource recovery facilities (WRRFs) and alleviating these issues.

Model-based and learning-based approaches have been proposed to characterize the physicochemical and biological interactions between water quality parameters and contaminant concentrations, and to assess these concentrations5,6,7,8,9,10,11. The physicochemical water quality parameters and concentrations measured in aerobic MBRs can be generally monitored and digitally processed continuously with appropriate sampling frequencies. Such bioreactor processes can be mathematically described or identified using ordinary differential equations from system identification techniques, empirical growth and mass conservation principles12. These model-based estimation approaches provide an efficient strategy for determining the variables of interest, including state, parameters, and unknown faults and disturbances7,13,14. However, the identification and model representation of accurate process models, along with the assumptions in the model derivation, such as microorganism concentrations and reaction rates of WWTP systems, are the bottlenecks of model-based bioreactor approaches. Additionally, there is no direct closed-form relationship between microbial/viral contaminants and biomass concentrations, leading to more complex model-based estimation problems. In several MBR processes, it is necessary to include microbial and viral population balance equations to capture mass transfer and cell division, and heterogeneous mixtures of cells/particles15. Building these complex reaction networks can be a challenging problem in modeling and estimation. Data-driven methods circumvent mechanistic models by learning input–output relationships and capturing dominant patterns in WWTPs (see, e.g., 5,8,9,10,16,17,18,19 and references therein). Learning-based methods have proven effective in processing input–output relationships from data and making predictions. In the context of WWTPs, some solutions have been proposed to estimate bacterial and viral concentrations using data-driven models. The studies in10,20 were among the first to propose estimating bacterial concentrations in WWTPs using different ML algorithms and advanced the prediction of microbial contaminants in wastewater dynamic processes. A sliding window neural network-based approach was proposed to predict bacterial concentrations in WWTPs20. Tree-based ML models were proposed to predict bacterial concentrations10. The key contributions in both ML algorithms were to identify the optimal combination of water quality input features for predicting bacterial concentrations, and then to provide an evidence-based strategy for investigating the transferability and generalizability of model-based estimation methods using dominant and minimum features to construct the aerobic membrane bioreactor model. However, the estimation of bacterial concentrations from model-based approaches comes with different challenges, including the observability conditions for nonlinear systems, which are not always guaranteed. Log removal values of the pepper mild mottle virus (PMMoV) and Norovirus GII particles were predicted using neural networks11. Flow virometry using tree-based machine learning models was proposed for rapid estimation of virus particles across various wastewater matrices21. These works advanced the prediction of bacterial and viral concentrations using datasets ranging from limited to representative, while considering the nature of membrane bioreactor (MBR) technologies (e.g., aerobic and anaerobic MBRs), and the appropriate type of ML models. However, they were limited to developing optimal models based on training, validation, and testing; consequently, they did not infer transferability for unseen testing sets, thereby limiting their model generalization capabilities in the presence of process drifts.

The lack of standard ML models capable of handling the time-varying characteristics of wastewater effluent matrices (e.g., process and distribution drifts on unseen datasets), hinders the development of effective data-driven models on unseen datasets to overcome model generalization. Few works targeted the model generalization on unseen data for predicting bacterial and viral contaminants across various wastewater matrices and WWTPs. A recent study was conducted to estimate the bacterial concentrations across various AeMBR-based WWTPs using a calibration that relied on an out-of-distribution framework22. The calibration method showed accurate prediction performances on unseen datasets from WWTPs in two regions of Saudi Arabia. The calibration method or out-of-distribution testing22 shares the same core principle of the proposed zero-shot generalization, as the pre-trained model is applied directly to unseen data with a retraining phase or condition before a downstream model enhancement analysis of unseen datasets. These two forms of zero-shot generalization approaches do not rely on a specific adaptation mechanism. In23, the authors proposed a lifelong learning framework that demonstrated excellent improvements in model generalization accuracy by integrating a knowledge-based adaptation mechanism and local ML predictor on unseen test data to predict viral particles across various WWTPs. Despite recent efforts made to enhance predictive modeling throughout the calibration and model knowledge-based adaptation in WWTPs (see, e.g., 22,23), achieving accurate viral particle prediction results and evaluating removal efficiencies on unseen datasets through a source-to-target estimation principle remains challenging, and there is no guarantee of good prediction performance for microbial and viral concentrations with the standard isolated ML models10,24. This highlights the need to improve model generalization during deployment to ensure consistent and reliable performance. The zero-shot generalization (ZSG) framework, combining a synergistic dual-attention mechanism and a neural network model, emerges as an alternative solution to the above challenges, particularly in source-to-target prediction tasks.

Attention mechanisms have recently been proposed to significantly improve the original features extracted in deep learning models, thereby alleviating the long-dependency issue of most neural networks, including recurrent neural networks and long short-term memory (LSTM) 25,26,27,28,29. The purpose of the ZSG method based on the attention mechanism is to address the shortcomings of traditional time series prediction methods when they are faced with long-term dependencies and multiple driving sequences. The attention mechanism assigns weights to intrinsic features to build an appropriate ML model in which the assigned weights are transferred to the target prediction tasks30,31,32. These features can then help identify highly distinguishable features in a high-dimensional space, thereby increasing the accuracy of the prediction results and their generalization capabilities with unseen test datasets. ZSG techniques are needed to handle the challenges caused by the rapid shifts in effluent treatment conditions and to adapt effectively to unseen data. The ZSG combines a synergistic dual-attention (DA) mechanism framework with an LSTM model to test the generalization performance of the trained model with unseen datasets25. The key advantage of synergistic DA-LSTM lies in its novel augmented generative models based on the Markov chain process, the global dependency of the effluent data distributions and feature spaces, and the long-term dependency on time series data using LSTM, ultimately improving the predictive modeling of unseen data via ZSG.

The present study paved the way for a ZSG technique based on a dual-attention mechanism and a novel generative model to predict total virus concentrations and associated viral particles across AeMBR wastewater matrices for the development of efficient and generalizable soft sensors. The performance of this ZSG framework with four AeMBR-based WWTPs geographically located in four regions (R1, R2, R3, R4) in Saudi Arabia was tested, aiming to predict unseen viral particles across wastewater matrix datasets from each of these WWTPs. The WWTP in region R1 treats a mix of municipal and industrial wastewater, while the WWTP in R2 treats municipal wastewater, respectively. The WWTPs in regions R3 and R4 are divided into two WWTP pilots (A/H) and (P1/P2) and treat municipal wastewater with a process similar to the WWTP in region R1, although with some modifications. The primary goal of this study was to validate the prediction of viral particles on unseen datasets using the DA-LSTM with generative models across the WMs in regions (R1, R2) and to extend it to the pilots (A/H) in region R3 and (P1/P2) in region R4 for further validation and comparison. The LRVs from the estimated influent and effluent viral concentrations were also evaluated. The LRVs results of the AeMBR-based WWTPs in the three regions (R1, R2, R3) showed different virus removal characteristics that were highly dependent on the treatment processes—aerobic treatment (conventional activated sludge), sand filtration, membrane (MBR), and chlorination—and the type of virus, including its size, structure, and morphological characteristics. The key innovation of this work lies in predicting viral particles and evaluating the removal efficiencies using DA-LSTM zero-shot generalization with novel generative models by optimizing the source model across wastewater effluent drifts and WWTPs. The effluent process drifts were handled by fine-tuning the weights and parameters of DA and LSTM. The pre-trained model was built on the primary effluent source and then applied to a downstream zero-shot generalization on the second effluent. The prediction performance on unseen datasets was remarkably preserved, as slight differences in distribution shifts and dynamic changes occurred between the effluent source and target domains. A retraining phase is needed to streamline the viral particles prediction performance on new unseen datasets coming from the third or new clarifier. To the best of our knowledge, DA-LSTM zero-shot generalization framework has not yet been developed to predict viral particles and evaluate removal efficiencies across various wastewater matrices and WWTPs.

Materials and methods

This section presents the DA-LSTM framework for predicting viral particles in new unseen datasets through source and target prediction tasks. The methodology included integrating four generative models to generate synthetic datasets from the measured datasets. It also provided a ZSG framework based on DA-LSTM to quantify the viral particles across various wastewater matrices and to assess the LRVs through the estimated viral particles.

Aerobic membrane bioreactor plants and sample collection

AeMBR systems have shown advantages for the reduction of pathogen presence in post-treated MBR wastewater effluent compared to conventional activated sludge processes1. Despite the low particulate and high quality effluents produced by AeMBR systems, a total reduction in pathogens, including viral and microbial, is often not achieved1,33. The present study proposes a data-driven model to quantify viral particle concentrations across various wastewater matrices and to assess the log removal values of the viral species in four pilot AeMBR-based WWTPs. The description of each AeMBR-based WWTP, including its schematical representation and sampling points, is provided in the supplementary material (Texts S1.1–S1.4; Figs. S1, S2, S3, and S4). The water quality samples and viral particle concentrations were collected from AeMBR-based WWTPs geographically located in four different regions (R1, R2, R3, R4) in Saudi Arabia. The WWTPs in R3 and R4 had two pilots (A, H) and (P1, P2) that were geographically located within the same region, respectively. Physicochemical water quality parameters such as pH, total dissolved solid (TDS), electroconductivity (conductivity), total suspended solid (TSS), turbidity, ammonium nitrogen (NH4-N), nitrate nitrogen (NO3-N), nitrite nitrogen (NO2-N), and chemical oxygen demand (COD) concentration were appropriately measured in the four regions (R1, R2, R3, R4) (Table S1, Supplementary Information). Human TV, which reflects overall viral diversity regardless of viral genera and adenovirus (AdV), was chosen as a predictive parameter for enteric viral pathogens. PMMoV was chosen as the viral indicator. For more details related to the equipment and the collection of the initial samples, we refer the readers to 34, which provides a detailed analysis and processing of all the parameters involved in the source-tracking microbial pathogens in WWTPs. Flow virometry and PCR-based methods (RT-qPCR) were used to measure TV, AdV, and PMMoV concentrations (Table 1) in regions R1 (Text S1.1 and Fig. S1, Supplementary Information) and R2 (Text S1.2 and Fig. S2, Supplementary Information), while TV concentrations were measured in regions R3 (Text S1.3 and Fig. S3, Supplementary Information) and R4 (Text S1.4 and Fig. S4, Supplementary Information) for WWTPs (A) and (H), and (P1) and (P2).

Table 1 Evaluation of the proposed generative models for the generated influent and aerobic effluent datasets using various quantitative measures, including the log removal value (LRV) for the MODON AeMBR-based WWTP in region R1. The best performance results of the quantitative error measures and LRVs with the lowest and matched values were highlighted in bold, respectively.
Full size table

Synthetic generative models and evaluation performances

The data generative models—Gaussian mixture models (GMMs) (Text S2.1.3, Supplementary Information), Markov chain models (MCMs) (Text S2.1.1, Supplementary Information), extended Markov chain models (MMCMs) (Text S2.1.2, Algorithm S1, Supplementary Information), and copula models (CMs) (Text S2.1.4, Supplementary Information)—were proposed to generate synthetic datasets from the measured datasets of the limited availability of real samples. The limited available data refers to the real input–output samples collected from WWTPs due to low pathogen concentrations. This limited data is used to generate synthetic datasets. The MMCM generative model follows the generative Markov chain proposed in 24 by modifying the probability state transition and adding noise with appropriate mean and standard deviation levels (Algorithm S1, Supplementary Information). A detailed description of these generative models, including their schematical representations and algorithms, is provided in the supplementary material (Section S2.1, Supplementary Information). These data generative models have demonstrated a strong ability to generate realistic data samples and imitate complex systems, including chemical and biological treatment processes for synthetic data augmentation22. We generate approximately 2000 samples, which were reduced to 1800 after applying a contamination level of 0.10 to remove outliers. It is important to note that a representative dataset with a satisfactory ratio between the real and synthetic datasets was generated. This ratio is adequate to provide remarkable prediction and generalization performances of viral particles across wastewater matrices and WWTPs. The generative models were carefully designed to ensure close distribution matching between the synthetic and original samples of the generating datasets, thereby guaranteeing data integrity while avoiding biases and inaccuracies. Qualitative and quantitative evaluation measures were proposed to ensure data integrity while controlling overfitting and avoiding data contamination or biases. First, the evaluation results included qualitative similarity performances based on principal component analysis (PCA) and t-stochastic neighboring embedding (t-SNE) between the real and synthetic datasets. Second, four quantitative metrics were proposed to evaluate the effectiveness of the similarity or dissimilarity performance between real and synthetic generative datasets. These metrics included the maximum mean discrepancy (MMD), Fréchet inception distance (FID), Wasserstein distance (WD), and energy distance (ED). Third, the conventional log removal value was proposed for the first time as a quantitative measure to evaluate the difference in virus concentration of untreated and treated water, thereby ensuring the consistency and accuracy of the real and synthetic datasets.

A series of experiments to select the architectures and hyperparameters of the generative models to effectively generate synthetic data from the available measurements of water quality and flow cytometry–PCR were conducted. The datasets comprised nine input variables and one or three viral particle output concentrations in R1 and R2 and TV particles in R3 and R4; both input–output variables contained limited real samples for each WWTP (Table S1, Supplementary Information). In the data preprocessing stage, the features were normalized in all cases to ensure data quality and make different features comparable. All values for the viral particle concentrations were converted to the log scale (i.e., log10 VP/L). For each prediction of viral particles in the model development and ZSG performance on unseen testing datasets, we generated approximately 2000 samples for the influent treatment process and wastewater effluent matrices (Table S1). Figure 1 shows the qualitative similarity results based on the principal component analysis (PCA) and t-stochastic neighboring embedding (t-SNE) between the real and synthetic data of the influent and aerobic effluent treatment processes for the MODON AeMBR-based WWTP in region R1. MMCM and CM-based generated samples exhibited a close match to the distribution of the real samples (Fig. 1). MMCM and CM performed well in terms of robustness, computational efficiency, dissimilarity, and discriminability by maintaining a good trade-off between qualitative and quantitative measures, as illustrated in Fig. 1 and Table 1. These results demonstrated the significant advantage of MMCM and CM in ensuring strong similarity performances in generating synthetic datasets that closely match the original datasets and represent true WWTP system variability. Although these generative models rely on a distribution to describe the occurrence of input–output values, MMCM intrinsically preserves the complex temporal dynamics, which is essential when generating large datasets.

Fig. 1

Similarity results between real and synthetic data: case studies of the influent and aerobic effluent treatment processes for MODON AeMBR-based WWTP in region R1 using PCA and t-SNE plots: (a) PCA of the influent; (b) t-SNE of the influent; (c) PCA of the aerobic effluent; (d) t-SNE of the aerobic effluent.

Full size image

Table 1 provides several quantitative evaluation scores—MMD, FID, WD, and ED metrics—to evaluate the proposed generative models and assess the quantitative similarity or dissimilarity performance between the real and synthetic generative datasets. Overall, the MMCM and CM generative models outperformed the MCM, and GMM generative models through the MMD, FID, WD, and ED evaluation measures. In addition, the following conventional log removal value (LRV)

$$text{LRV}={text{log}}_{10}left({text{C}}_{text{influent}}right)-{text{log}}_{10}left({text{C}}_{text{effluent}}right)$$

which evaluates the difference of virus concentration of untreated and treated water was proposed to ensure the consistency and accuracy of the real and synthetic datasets. The results of the quantitative LRV assessment of the generative models showed that the MMCM-LRVs and CM-LRVs were 0.54, 0.59 for TV, 1.17, 1.17 for AdV, and 1.16, 1.16 for PMMoV, and their corresponding real-LRVs were 0.59, 1.10, 1.35, respectively. Notably, the MMCM and CM generative models showed more similar LRVs to the real datasets than the MCM and GMM generative models and achieved better LRV performance across the TV, AdV, and PMMoV concentrations. These results demonstrated the significant advantage of MMCM and CM in ensuring strong similarity performances in generating synthetic datasets that closely match the original datasets and represent true WWTP system variability.

Feature correlation

Reducing the redundancy between input features or variables in the feature selection stage is crucial to developing consistent and accurate machine learning models. This step was conducted using Pearson’s correlation metric, which analyzes the linear dependency between variables. It is formulated as follows:

$$begin{array}{c}r=frac{{sum }_{i=1}^{n}left({x}_{i}-overline{x }right)left({y}_{i}-overline{y }right)}{sqrt{{sum }_{i=1}^{n}{left({x}_{i}-overline{x }right)}^{2}{sum }_{i=1}^{n}{left({y}_{i}-overline{y }right)}^{2}}}#end{array}$$

where ({x}_{i}) and ({y}_{i}) are the samples of features (x) and (y), (overline{x }) and (overline{y }) are the mean values of the features (x) and (y). Two input features are highly correlated when (r) is close to 1. Pearson correlation helps identify strong linear relationships between input features and reduces the necessary redundancy between input variables in the data preprocessing stage. Figure S5 illustrates the correlation between the features of the real and generated influent and effluent datasets in region R1.

The (r)-value between two input variables of the real data and all generated datasets was not greater than the specific threshold value of (r)=0.99 (Fig. S5), as highlighted in8, to eliminate features in the subsequent model development stage. These correlation results demonstrated that the proposed generative models performed well, which indicates the accuracy of these models in avoiding multicollinearity and preserving the distributions between the original and synthetic generated input features (Fig. S5). Preserving all the input features is particularly important for developing machine learning models in the source domain and achieving zero-shot generalization in the target domain across various wastewater treatment matrices.

DA-LSTM zero-shot generalization based on generative models

The DA mechanism comprises input and temporal attention layers providing key information to the neural network by assigning and weighting essential features and ignoring the contribution of irrelevant factors25,26,27 (Text S2.3.1, Supplementary Information). The core of the DA-LSTM method is to introduce two attention mechanism stages to address the shortcomings of traditional time series prediction methods when facing long-term dependencies and multiple driving sequences (Text S2.3.1–S2.3.3, Supplementary Information).

In the input stage, DA uses the input attention mechanism to dynamically select the external driving sequence that is most relevant to the prediction (Text S2.3.3, Supplementary Information). Specifically, by calculating the correlation with the previous encoder’s hidden state, the model assigns an attention weight to each driving sequence so that irrelevant information in the input sequence is suppressed, enhancing the model’s ability to focus on useful features. In the time stage, DA further processes the encoder’s hidden state through the time attention mechanism. This mechanism relies on the current state of the decoder to weigh the encoder’s hidden state at each time step to produce a weighted context vector that contains the most relevant long-term dependency information in the time series for the decoder to generate the final prediction. The introduction of the dual-stage attention mechanism, along with the Markov chain and random walk generative model, enables the model to effectively select important input features while fully capturing the long-term temporal dependencies of time series data, thereby improving prediction accuracy.

The description of the DA-LSTM framework and the attention mechanisms and parameters involved, as well as its algorithm, are detailed in the supplementary material (Section S2.3, Supplementary Information). The DA-LSTM algorithm, based on the generative models, is provided in Algorithm S2, and its schematic representation, based on the MMCM generative model and its architecture, is depicted in Fig. 2.

Fig. 2

Schematic representation of the DA-LSTM based on the novel MMCM generative model. The flowchart includes three main parts: MMCM, DA-LSTM algorithm, and zero-shot generalization. The MMCM part is mainly divided into three sections: (a) Added Gaussian noise: data enhancement by adding a Gaussian noise distribution to the real data; (b) Data discretization: discretize the data and divide it into buckets; (c) Markov chain generator: construct a Markov chain to generate data through a random walk. The DA-LSTM part is mainly divided into two sections: (d) input attention: assigning weights to the input features of the current time step and generating a weighted feature vector helps the model select the most important input features at each time step; (e) temporal attention: assigning weights to the encoder hidden states at all time steps and aggregating them into a vector helps the decoder focus on the most relevant time step information in the encoder when predicting or updating the model; (f) Zero-shot generalization: test the generalization performance on the trained model with unseen datasets and evaluate the results through different indicators.

Full size image

The effluent process drifts in the AeMBRs emerged with the underlying dynamic changes, multiple time scales, and distribution shifts on unseen datasets. The developed ML model must handle wastewater process drifts in soft sensor development to ensure globally representative and effective real-time prediction of viral concentrations and removal efficiencies, thereby capturing the complex interdependencies of water quality and viral particles in model development and model generalization tasks. In the DA-LSTM zero-shot generalization framework, the effluent process drifts were handled by fine-tuning the weights and parameters of DA and LSTM. The prediction performance on unseen datasets was preserved as slight differences in distribution shifts and dynamic changes occurred between the effluent source and target domains. Additionally, the model was built on the primary effluent source before a downstream zero-shot generalization on the second effluent. A retraining phase is needed to streamline the viral particles prediction performance on new unseen datasets coming from the third or new clarifier.

Model setting

For each model’s development in each region, the baseline effluent treatment process from the first filtration layer and partitioned its dataset into 80% training and 20% testing sets was selected. The first filtration layer was chosen to ensure the robustness of the ZSG across various wastewater effluent matrices and facilitate streamlining of the soft sensor modeling to provide real-time readings on how viral particles and associated viral concentrations would persist in the effluents of AeMBR-based WWTPs and their LRVs in practice. All experiments for the DA-LSTM model were conducted in a Python 3.11.9 environment to predict the viral concentrations from the water quality inputs, with both training and inference performed exclusively on a CPU. The key software components include TensorFlow 2.x for the deep learning architecture, NumPy, Pandas, and scikit-learn for data handling and evaluation, and Matplotlib for visualization.

Hyperparameter optimization and model evaluation

In the model training, the mean square error (MSE) of the testing sets was used as the loss function, while for ZSG, the coefficient of determination (({R}^{2})) was chosen as the evaluation metric and objective function for the iterative optimization.

$${R}^{2}=1-frac{{sum }_{i=1}^{n}{left({y}_{i}-widehat{{y}_{i}}right)}^{2}}{{sum }_{i=1}^{n}{left({y}_{i}-overline{y }right)}^{2}}, {text{MSE}}=frac{1}{n}{sum }_{i=1}^{n}{left({y}_{i}-widehat{{y}_{i}}right)}^{2}$$

The combination of hyperparameters was derived through multiple experiments and validated to yield optimal results, providing strong support for the core structure of the model. The optimal hyperparameter settings of the DA-LSTM and LTSM models in all experiments included the Adam optimizer with a learning rate of 0.001, a batch size of 128, a hidden layer dimension of 64, and a number of training epochs set to 50. The coefficient of determination (({R}^{2})), RMSE, MAE, and MAPE were used to evaluate the performance of the DA-LSTM framework for predicting viral particles.

$${text{RMSE}}=sqrt{frac{1}{n}{sum }_{i=1}^{n}{left({y}_{i}-widehat{{y}_{i}}right)}^{2}}, {text{MAE}}=frac{1}{n}{sum }_{i=1}^{n}left|{y}_{i}-widehat{{y}_{i}}right|, {text{MAPE}}=frac{1}{n}{sum }_{i=1}^{n}frac{left|{y}_{i}-widehat{{y}_{i}}right|}{left|{y}_{i}right|} ,$$

where ({y}_{i}) refers to the real output at sample (i), (widehat{{y}_{i}}) denotes the predicted output at sample (i), (overline{y }) is the mean value, and (n) is the number of dataset samples.

Results

This section describes the case studies conducted to predict total virus (TV), adenovirus (AdV), and PMMoV particles using the DA-LSTM framework and assesses the estimated LRVs across the wastewater matrices located in four regions (R1, R2, R3, R4) of Saudi Arabia, according to the optimal hyperparameters provided in the Materials and Methods section. For each ZSG process, we built the baseline model from the first effluent clarifier, and then we performed ZSG across the second and third clarifiers from its source model (i.e., baseline model). To assess the LRV in each experiment, we developed an influent model corresponding to the baseline model following the conventional LRV calculation. The discussion of the ZSG performance is mainly focused on the coefficient of determination ({R}^{2}) values, but we also utilized the root mean square error (RMSE), mean absolute error (MAE), and mean absolute percentage error (MAPE) to assess the performance from different angles.

Target viral particle prediction across wastewater matrices in region R1

The AeMBR-based WWTP located in region R1 includes a primary clarifier influent and the aerobic, sand, and MBR treatment processes (Text S1.1 and Fig. S1, Supplementary Information). Using the DA-LSTM approach, we first derived the baseline model (i.e., training and testing) of the aerobic effluent to predict TV, AdV, and PMMoV particles, and then performed zero-shot generalization (with unseen testing datasets) to predict the viral particles in the target effluent sand filtration and assessed the estimated LRVs. Second, we evaluated the effectiveness of the DA-LSTM for predicting the viral particles in the target effluent MBR. Figure 3 illustrates the model development, zero-shot generalization (ZSG) and LRV assessment for predicting TV particles across different wastewater matrices using DA-LSTM algorithms through the proposed generative models.

Fig. 3

Flowchart illustrating the case study on predicting TV particles for the MODON AeMBR-based WWTP in region R1 in which the model development, zero-shot prediction, and LRV assessment are highlighted: (a) Model development included the training and testing of the viral particles through the influent and aerobic treatment processes; (b) ZSG using DA-LSTM consisting of validating the developed source model from the aerobic effluent datasets to predict TV particles across sand filtration and MBR wastewater treatment matrices (i.e., target unseen datasets); (c) LRV assessment is derived through the estimated viral particle concentrations of the influent and effluent treatment processes.

Full size image

Aerobic effluent baseline model

Owing to its predictive modeling and zero-shot generalization performance, we implemented the DA-LSTM to predict the source model for TV, AdV, PMMoV, and MS2 particles from the generated datasets in region R1. This primary model development stage was crucial for comprehensively evaluating the baseline model (i.e., source model) and comparing the proposed synthetic generative models in region R1. The aerobic effluent, corresponding to the first filtration layer, was chosen as a baseline model to facilitate streamlined zero-shot generalization. DA-LSTMMMCM and DA-LSTMCM demonstrated superior testing performance for TV, AdV, and PMMoV, with ({R}^{2}) values of 0.99, 0.99, 0.99 and 0.99, 0.98, 0.98, respectively (Table 2). These results are consistent with those in the model generation performance (see Materials and Methods), confirming that the MMCM and copula (CM) generative models closely replicate real data and enhance model development effectiveness. In the model development stage, DA-LSTMMCM provided similar ({R}^{2}) values for the testing performance of DA-LSTMMMCM; however, its data generation performance did not perform well, which affected its zero-shot generalization performance (see Materials and Methods). Overall, these results indicate that DA-LSTMMMCM and DA-LSTMCM models are optimal for balancing data generation and model development and, consequently, are most suitable for zero-shot generalization. In addition, the results of the training and validation loss demonstrated that the predictive modeling does not reflect overfitting to synthetic patterns (Figs. S6, S8, S10, and S12, Supplementary Information).

Table 2 Testing performance results of the DA-LSTM based generative models for the aerobic model in region R1 using different performance metrics.
Full size table

Target viral particle prediction for unseen sand effluent datasets

The zero-shot generalization, often referred to as source and target prediction tasks, involves testing a developed baseline model (i.e., source model) on unseen datasets, where the weights of the developed dual-stage attention baseline model (i.e., source model) are transferred to the target datasets (i.e., unseen testing sets). This testing procedure requires a retraining step or calibration, an adequate network design and fine-tuning of the DA weights and LSTM parameters. Since the aerobic effluent baseline model was used in the model development, the DA-LSTM approach inherits from the global dependency of the underlying data distributions, feature spaces as well as the relationships between features for the source and target effluent datasets to act as a single global model and generalize the viral particle estimates across WMs at the regional level. This process further shows the rationality of constructing the DA-LSTM model of the first filtration layer and generalizing its existing parameter knowledge to handle new unseen tasks, enabling zero-shot prediction of viral particles.

In the zero-shot generalization process, we conducted tests to predict TV, AdV, and PMMoV particles in unseen sand effluent filtration datasets using the DA-LSTM approach. DA-LSTMMMCM and DA-LSTMCM achieved remarkable zero-shot generalization performance in each viral community and maintained robust performance across the unseen sand effluent datasets. The ({R}^{2}) values of the DA-LSTMMMCM and DA-LSTMCM models for TV, AdV, PMMoV and MS2 particles were 0.99, 0.80, 0.86, 0.99 and 0.87, 0.88, 0.80, 0.99, respectively (Table 3). In contrast, DA-LSTMGMM and DA-LSTMMCM showed poor prediction performance (Table S2, Supplementary Information). The model development performance results of the aerobic baseline model and DA-LSTM based ZSG on the sand effluent treatment for predicting MS2 particles are illustrated in Fig. S13. The MMCM-generated synthetic data led to a more accurate and robust zero-shot generalization model for all viral particles. These findings demonstrate the reliability of the DA-LSTMMMCM model in predicting viral particles and ensuring zero-shot generalization across the sand filtration process.

Table 3 Zero-shot generalization evaluations of the DA-LSTM-based generative models for predicting TV, AdV, PMMoV and MS2 particles on the unseen sand effluent data in region R1 from the corresponding aerobic effluent source models using different performance metrics. The R2 values highlighted in bold were used in the discussion of the generalization performance results.
Full size table

Target viral particle prediction on unseen MBR effluent datasets

We performed tests on the unseen target MBR datasets to predict TV, AdV, and PMMoV particles from the source aerobic effluent model using the proposed DA-LSTM in the zero-shot generalization process. DA-LSTMMMCM and DA-LSTMCM achieved remarkable zero-shot generalization performance across the MBR treatment process and all the viral communities. The ({R}^{2}) values of the DA-LSTMMMCM and DA-LSTMCM models for TV, AdV, and PMMoV particles were 0.98, 0.94, 0.99 and 0.93, 0.89, 0.89, respectively (Table 4). The results confirmed the significant advantages of DA-LSTMMMCM and DA-LSTMCM models, with DA-LSTMMMCM standing out as the best model, ensuring reliability across the ultrafiltration treatment process. In contrast, DA-LSTMGMM and DA-LSTMMCM showed poor prediction performance, as illustrated in Table S3, Supplementary Information. Using DA-LSTMMMCM and DA-LSTMCM models, the training and testing performance results of the aerobic baseline model, as well as the ZSG on the MBR effluent datasets for predicting TV, AdV, and PMMoV concentrations, are shown in Figs. S7, S9, and S11, respectively. These results demonstrated the advantages of DA-LSTM algorithms in maintaining accurate performance in the presence of multiple time scales and dynamic changes in the effluent treatment processes and distribution shifts involving the wastewater matrix datasets.

Table 4 Zero-shot generalization evaluations of the DA-LSTM-based generative models for predicting TV, AdV, and PMMoV particles on the unseen MBR effluent data in region R1 from the corresponding aerobic effluent source models using different performance metrics. The values highlighted in bold were used in the discussion of the generalization performance results.
Full size table

Log removal value estimation from target prediction

The LRV was used to evaluate the potential risks associated with reusing treated wastewater effluents. It is a key indicator of the virus removal efficiency in WWTPs and has been well studied in prior research1,35,36. The conventional LRV is calculated as follows:

$$begin{array}{c}{text{LRV}}^{text{MD}}={text{log}}_{10}left({text{C}}_{text{influent}}^{text{MD}}right)-{text{log}}_{10}left({text{C}}_{text{effluent}}^{text{MD}}right)end{array}$$
(1)
$$begin{array}{c}{text{LRV}}^{text{ZSG}}={text{log}}_{10}left({text{C}}_{text{influent}}^{text{MD}}right)-{text{log}}_{10}left({text{C}}_{text{effluent}}^{text{ZSG}}right)end{array}$$
(2)

where ({text{C}}_{text{influent}}^{text{MD}}) and ({text{C}}_{text{effluent}}^{text{MD}}) represent the viral concentrations in the influent and effluent contributed by the model development, respectively. ({text{C}}_{text{effluent}}^{text{ZSG}}) values are the viral concentrations in the effluent contributed by the ZSG. Upon building the aerobic effluent model and performing zero-shot generalization across the effluent sand filtration, the DA-LSTM was trained and tested to predict the viral particles in the influent treatment process. This step is necessary for estimating and monitoring the LRVs of viral particles between the influent treatment and the aerobic and sand filtration processes.

Figure 4 shows the estimated, generated, and real LRVs performance results for TV, AdV, and PMMoV concentrations across the aerobic, sand, and MBR wastewater matrices in the MODON AeMBR-based WWTP in region R1. The estimated LRVs obtained from (1)-(2) using DA-LSTMMMCM and DA-LSTMCM models demonstrated excellent prediction performance across all viral particle concentrations and wastewater matrices with the actual generated LRVs. The estimated reductions of TV, AdV, and PMMoV particles achieved by DA-LSTMMMCM were 0.53, 1.17, 1.13 LRVs with the aerobic treatment process, 0.37, 0.44, 0.25 LRVs with the sand filtration, and 0.43, 0.77, 0.85 LRVs with the MBR treatment. The TV, AdV, and PMMoV particles showed 30%, 62%, 78% and 19%, 56%, 25% estimated viral reductions from the aerobic treatment to sand filtration, and the aerobic treatment to MBR treatment, respectively. The aerobic treatment was successful in reducing the AdV and PMMoV loads by a 1-log reduction, which showed precise alignment with the synthetic and ground truth datasets. MBR treatment achieved relatively decent viral retention over time, also aligning with the generated and ground truth datasets. The cumulative LRV sum of TV, AdV, and PMMoV particles over the three treatment processes using DA-LSTMMMCM attained an average of 1.33, 2.38, and 2.23, respectively. The cumulative LRVs of the TV, AdV and PMMoV particles were consistent with the ground truth, which was 1.26, 1.9, and 2.07, respectively (Fig. 4).

Fig. 4

Estimated and actual log removal values of TV, AdV, PMMoV, and MS2 concentrations for the aerobic, sand, and MBR wastewater matrices of the MODON AeMBR-based WWTP in region R1: (a) LRV of TV particles; (b) LRV of AdV particles; (c) LRV of PMMoV particles; and (d) LRV of MS2 concentrations. The aerobic effluent was chosen as a baseline model in the model development to facilitate streamlined DA-LSTM zero-shot generalization of TV, AdV, and PMMoV across sand and MBR wastewater matrices, and MS2 across sand treatment process. The LRV results were obtained by deriving the influent model in region R1 to assess the viral concentrations and evaluated the difference of virus concentration of untreated and treated water.

Full size image

Target viral particle prediction in region R2

The pilot AeMBR-based WWTP in region R2, namely (K-WWTP), includes a primary clarifier influent, and two effluents: secondary clarifier (effluent) and chlorinated effluent (Chlor. effluent) treatment processes37 (Text S1.2 and Figure S2, Supplementary Information). We derived the baseline model of the secondary effluent for predicting TV, AdV, and PMMoV particles, which is essential for the ZSG across the chlorination effluent treatment. We then conducted ZSG for the target chlorination effluent to assess the viral particles and estimated the LRVs using DA-LSTMs algorithms. The ZSG accuracy showed remarkable prediction performance across TV, AdV, and PMMoV particles, with an average ({R}^{2}) of 0.97 for DA-LSTMMMCM and DA-LSTMCM (Table 5). The training and testing performance results of the effluent baseline model, as well as the ZSG on the chlorinated effluent datasets for predicting TV, AdV, and PMMoV concentrations using DA-LSTMMMCM and DA-LSTMCM models, are shown in Figs. S14, S15, and S16, Supplementary Information, respectively. These results indicate a clear advantage of DA-LSTM algorithms in predicting viral concentrations on completely unseen chlorination effluent datasets with underlying effluent and chlorination effluent drifts.

Table 5 Zero-shot generalization evaluations of the DA-LSTM-based generative models for predicting TV, AdV, and PMMoV on the unseen K-WWTP chlorination effluent (Chlor. effluent) datasets in region R2 from their respective effluent source models using different performance metrics.
Full size table

We evaluated the LRVs based on the estimated models between the influent and effluents. Figure 5 shows the LRVs’ performance results between the estimated and actual values in region R2. The estimated LRVs derived from DA-LSTMMMCM and DA-LSTMCM models demonstrated remarkable performance with the synthetic and ground truth LRVs across all viral particle concentrations. For instance, we observed good agreement between the estimated LRV of the ZSG, the LRV of the synthetic data, and the LRV of the ground truth. The estimated LRV of the TV, AdV, and PMMoV particles achieved with DA-LSTMMMCM were 0.19, 3.25, 1.07 for the secondary clarifier (i.e., effluent), and 0.29, 2.67, 1.60 for the chlorination treatment. The secondary clarifier and chlorination effluent treatments showed lower LRVs for the TV particles, achieving less than a 1-log reduction, while the LRVs’ contributions for PMMoV particles were close to a 1-log reduction. AdV achieved the highest viral retention, with a 3-log reduction for the secondary clarifier and a 2-log reduction for the chlorination effluent. The average cumulative LRVs of TV, AdV, and PMMoV particles over the two treatment processes using DA-LSTMMMCM were 0.48, 5.92, and 2.67, respectively. These results are consistent with the average cumulative LRV sum of TV, AdV, and PMMoV particles in the ground truth and synthetic datasets, which were 0.52, 5.90, 2.55 and 0.59, 6.12, 3.17, respectively, which demonstrating the reliability and robustness of the ZSG using the DA-LSTMMMCM algorithm (Fig. 5).

Fig. 5

Estimated and actual log removal values of TV, AdV, and PMMoV concentrations for the aerobic, sand, and MBR wastewater matrices of the MODON AeMBR-based WWTP in region R2: (a) LRV of TV particles; (b) LRV of AdV particles; (c) LRV of PMMoV particles. The effluent was chosen as a baseline model in the model development to facilitate streamlined DA-LSTM zero-shot generalization of TV, AdV and PMMoV particles across the chlorinated effluent treatment process. The LRV results were obtained by deriving the influent model in region R2 to assess the viral concentrations and evaluated the difference of virus concentration of untreated and treated water.

Full size image

Target TV particle prediction in region R3

The pilot AeMBR-based WWTP in region R3 (AH-WWTPs) has two WWTPs, (A) and (H), geographically located in the same region (R3). These two pilots were designed to treat municipal wastewater38. Their processes are similar to those of region R1, with aerobic and sand filtration treatments, although with some modifications (Text S1.3 and Fig. S3, Supplementary Information). Similarly, to the ZSG process in region R1, we constructed the DA-LSTM models for the (A) and (H) WWTPs systems from their aerobic treatments. We conducted ZSG on their respective unseen sand filtration datasets to predict TV concentrations and estimated the LRVs using DA-LSTMs algorithms. We also performed ZSG with a cross-validation between the (A) and (H) WWTPs for TV concentrations and estimated the LRVs using the DA-LSTMMMCM and DA-LSTMCM models.

Target TV particle prediction based on sand filtration datasets

We conducted ZSG using DA-LSTMMMCM and DA-LSTMCM to predict each target TV concentration of the (A) and (H) sand filtration treatments from their respective TV aerobic models. The ZSG evaluations of the two DA-LSTM-based generative models on the unseen TV sand effluents (A) and (H) demonstrated excellent prediction performance across the two sand filtration matrices, with an average ({R}^{2}) of 0.99 in both cases (Table S4, Supplementary Information). Figures S17 and S18 illustrate the training and testing performance results of the TV particles in the aerobic treatment process (A) and the ZSG performance on the unseen sand filtration (A) using DA-LSTMMMCM and DA-LSTMCM algorithms. Both DA-LSTM algorithms were able to track effluent process drifts and showed reliable performance results across their corresponding wastewater effluent matrices. Figure 6 shows the estimated LRV of TV particles contributed by the model development and ZSG approaches across the wastewater effluent matrices for WWTPs (A) (Fig. 6a) and (H) (Fig. 6b) in region R3. We observed good convergence between the estimated LRVs of the model development and ZSG approaches, the LRVs of the generated synthetic data, and the LRVs of the ground truth (Fig. 6). Overall, the estimated LRVs across the two wastewater effluent matrices from the model development and ZSG varied from 0.04 to 0.54 logs, which did not exceed 1-log reduction, and met the LRVs of the ground truth ranging from 0.02 to 0.48 logs. This slight variation in TV removal efficiency across the two wastewater matrices may be attributed to the low contribution of the aerobic and sand treatment processes in the AeMBR-based WWTP.

Fig. 6

Performance results between the estimated, synthetic, and actual log removal values for the aerobic treatment in the model development and sand effluent in the ZSG of (A) and (H), the WWTPs in region R3: (a) LRV of TV particles for plant (A), the aerobic effluent (A) was chosen as a baseline model in the model development to facilitate streamlined DA-LSTM zero-shot generalization of TV particles across the sand effluent treatment (A); (b) LRV of TV particles for plant (H), the aerobic effluent (H) was chosen as a baseline model in the model development to facilitate streamlined DA-LSTM zero-shot generalization of TV particles across the sand effluent treatment (H). The LRV results were obtained by deriving the influent model in region R3 to assess the viral concentrations and evaluated the difference of virus concentration of untreated and treated water.

Full size image

Target TV particle prediction with cross-validation between (A) and (H) datasets

We performed a cross-validation ZSG between plants (A) and (H) to assess the effectiveness of the proposed DA-LSTM algorithms in predicting TV concentrations across their respective wastewater matrices. The source models of the two plants were built from the aerobic effluent datasets of each plant. Then, we performed the ZSG of their corresponding aerobic effluent datasets in a cross-validation manner, which indicates that aerobic treatments (A) and (H) were considered in the model development, while aerobic processes for (H) and (A) were used for ZSG, respectively, as shown in Table S5. Figures S19 and S20 show the ZSG performance of DA-LSTMMMCM and DA-LSTMCM on the unseen TV test datasets from their source TV aerobic models in region R3. DA-LSTMMMCM and DA-LSTMCM achieved excellent ZSG performance with ({R}^{2}) values ranging from 0.98 to 0.99 (Fig. S21, Supplementary Information). The estimated LRVs of DA-LSTMMMCM and DA-LSTMCM in the model development and ZSG stages are shown in Fig. 7. Notably, the estimated LRVs were in excellent agreement within the ground truth (Fig. 7), demonstrating the ZSG feasibility for efficiently predicting viral particles and handling effluent treatment process drifts across wastewater matrices from two WWTPs in the same location.

Fig. 7

Estimated and actual log removal values of TV concentrations based on sand filtration datasets: (a) Target TV particles performance in the aerobic (H) from the source TV particle aerobic model (A), the aerobic effluent (A) was chosen as a baseline model in the model development to facilitate streamlined DA-LSTM zero-shot generalization of TV concentrations across the aerobic effluent (H); (b) target TV particles in the sand (H) from the source TV particle sand model (A) in region R3: the aerobic effluent (H) was chosen as a baseline model in the model development to facilitate streamlined DA-LSTM zero-shot generalization of TV particles across the aerobic effluent (A). The LRVs were determined by deriving the influent model in region R3 to assess the viral concentrations and evaluated the difference of virus concentration of untreated and treated water.

Full size image

DA-LSTM zero-shot generalization outperforms the state-of-the-art ML algorithms: Case study in region R4

To investigate the impact of the DA mechanism on the DA-LSTM framework, we compared the DA-LSTM and state-of-the-art ML algorithms for predicting TV particles to better assess their source-to-target ZSG performance and the impact of the DA mechanism on model fitness. These standard ML included Artificial Neural Network (ANN), Extreme Gradient Boosting (XGBoost), Random Forest (RF), and LSTM methods. The comparison was performed on the new pilot AeMBR-based WWTP designed to treat municipal wastewater in region R4 (P1P2-WWTPs), which has two WWTPs, (P1) and (P2), in the same region R4. The treatments at these WWTPs were similar to the pilot AeMBR in region R3, with aerobic and sand filtration treatment processes (Text S1.4 and Fig. S4, Supplementary Information).

Similar to the ZSG analysis in region R3, we built the DA-LSTM model for the baseline aerobic effluent treatment (P1), then we performed ZSG on the unseen aerobic effluent treatment (P2). We used MMCM and CM data generators to generate synthetic data for the model development and ZSG performance of DA-LSTM and standard ML algorithms to predict TV concentrations at (P1) and (P2) WWTPs. The mean square error results for predicting TV particles across the unseen aerobic (P2) from the source model (P1) showed that DA-LSTM outperformed the ANN, RF, XGBoost, and LSTM in region R4 (Fig. S22 and Table S6, Supplementary Information). Figure S21 shows the training, testing, and ZSG performance of the predicted and actual TV values for the LSTM and DA-LSTM models. Overall, DA-LSTMMMCM and DA-LSTMCM achieved excellent ZSG performance results, showing that the predicted values were closely distributed in the trend line and maintained their accuracy and consistency across the aerobic effluent treatment (P2) (Figs. S22, S23a–c, S23d–f, Tables S6 and S7, Supplementary Information).

The estimated LRVs of LSTM in the model development and ZSG stages did not meet the ground truth, while the LRVs of DA-LSTMs remained consistent with the real samples (Fig. 8). For instance, estimated LRVs across the two wastewater effluent matrices using LSTM from the model development and ZSG were 0.03 logs for LSTMMMCM and 0.07 logs for LSTMCM for the model development in the aerobic effluent treatment (P1), and 0.11 logs for LSTMMMCM and 0.13 logs for LSTMCM for the ZSG, as illustrated in Fig. 8. The LRVs for LSTMMMCM and LSTMCM achieved average deviation errors of 79.4% and 67.5% from the model development and ZSG to the ground truth LRVs, respectively. These ZSG performance results demonstrate the significant superiority sufficient of DA-LSTM over LSTM in the context of this experiment.

Fig. 8

Estimated and actual log removal values of TV particles using DA-LSTM and LSTM algorithms for the target aerobic effluent datasets (P2) from the source aerobic effluent model (P1) in region R4: (a) DA-LSTMMMCM and DA-LSTMCM; the aerobic effluent (P1) was chosen as a baseline model in the model development to facilitate streamlined DA-LSTM zero-shot generalization of TV particles across the aerobic effluent (P2); and (b) LSTMMMCM and LSTMCM: the aerobic effluent (P1) was chosen as a baseline model in the model development to facilitate streamlined DA-LSTM zero-shot generalization of TV particles across the aerobic effluent (P2). The LRVs were determined by deriving the influent model in region R4 to assess the viral concentrations and evaluated the difference of virus concentration of untreated and treated water.

Full size image

Discussion

The efficiency of wastewater-based epidemiology techniques based on localized experiments for measuring viral particles is not globally representative and is often compromised due to sampling and analytical methods and, more importantly, differences in experimental systems, which delay the decision-making process. Isolated ML models for predicting viral particles are often limited to model development tasks, preventing model performance enhancement and generalization capabilities on unseen testing datasets from different wastewater matrices and WWTPs. Furthermore, the real-time prediction of viral particles should account for the rapid distribution shifts of the datasets and process drifts in the treatment conditions and variations in the experimentation. These challenges impede subsequent efforts to develop efficient zero-shot generalization methods to estimate viral particles and monitor log removal values across different wastewater matrices in real time.

We conducted case studies to predict viral particles using a ZSG approach based on a DA-LSTM algorithm to generalize the model development process across wastewater matrices from different WWTPs in Saudi Arabia. The zero-shot generalization model relies on an extended Markov chain generative model (MMCM), which is inherited from the standard Markov chain model (MCM). The MMCM generative model is a diffusive model that includes noise propagation via the input–output variables with a tendency to move to zero and defines a probability distribution over time. Adding Gaussian noise to the MCM scheme with hierarchical priors generates high-quality synthetic samples that outperform even those with the Gaussian mixture model (GMM) and autoencoders, including generative adversarial networks (Text S2.1, Supplementary Information). Integrating the DA structure into the local LSTM predictor and the MMCM and CM generative models significantly enhanced ZSG performance for predicting viral particles on unseen datasets. Combining these DA, LSTM, and generative models significantly strengthened their abilities to adapt to time-varying characteristics and distribution shifts in WMs. This further highlighted DA’s contribution when faced with process drifts in the wastewater treatment process. The validation of the unseen test sets was based on a direct testing method. It did not intrinsically rely on an adaptation, simplifying the design while reducing the computational time.

The DA-LSTM zero-shot generalization model relies on historical physicochemical and flow cytometry–PCR data. These datasets can be collected in a closed-loop operation with controlled and calibrated input variables or open-loop settings without the need for a state feedback mechanism. For instance, the flow rate, shock loads, or chemical dosing input changes can be controlled with fixed values at desired steady state conditions representing the optimal WWTP system state within an operating cycle. These feedback control or optimal control routes are often needed to enhance quality production and energy demands at different levels and to monitor process anomalies in fault diagnosis problems. Herein, the most essential part in the development of a soft sensor capable of predicting viral particles in open-loop or closed-loop settings generally relies on the available measurements of the collected historical data. The latter might contain all possible conditions of the AeMBR-based WWTPs including environmental and chemical changes, and process input variations. The absence of the dynamic inputs does not necessarily mean that the changes of the process inputs were not considered in the historical data. In the model development process, the control input changes, including state control and parametric control, do not affect the prediction and estimation performances of the data-driven soft sensor models. Although multiple time scales and temporal dynamics (i.e., effluent process drifts and time-varying behaviors) were involved across wastewater matrices and WWTPs, the proposed DA-LSTM significantly strengthened its ability to adapt to these time-varying changes in the model performance enhancement by assigning specific weights to the input features that are transferred to the target unseen testing data.

DA-LSTM zero-shot generalization performance from various generative models and across different wastewater matrices

The key novelty of the proposed DA-LSTM algorithm lies in the generative models that infer the ability to handle unseen data sets. Four data generative models—GMM, MCM, MMCM, and CM—were proposed to generate synthetic datasets from the measured datasets. We evaluated these models with qualitative and quantitative metrics—including the LRVs across different treatment processes—to assess their performance in generating high-quality samples from various angles (Fig. 1). MMCM and CM greatly contributed to the model development and zero-shot generalization performance. The quantitative evaluation scores MMD, FID, WD, and ED metrics, which measure the effectiveness of the similarity or dissimilarity performance between real and synthetic generative datasets, confirmed the significant advantages of the generative models, with MMCM standing out as the best model. Further, the quantitative evaluation of the LRV across the wastewater matrices confirmed the consistency and accuracy of MMCM in maintaining the LRV of the synthetic data as close as possible to the ground truth (Table 1).

In the model development and ZSG, DA-LSTM-based MMCM showed excellent source-to-target viral particles predictions with unseen datasets across various WMs from different WWTPs. The strong generalizability of DA-LSTMMMCM for predicting viral particles was confirmed in three wastewater effluent treatment processes (aerobic, sand, and MBR) in the AeMBR-based WWTP in region R1 (Tables 3 and 4; Fig. 4, Supplementary Tables S2 and S3; Supplementary Figures S7, S9, S11, and S13), the secondary clarifier and chlorinated effluent treatment processes in region R2 (Fig. 5; Table 5; Supplementary Figures S14–S16), and the two plants (A) and (H) of the pilot AeMBR in region R3 (Figs. 6, and 7; Supplementary Figures S17–S20; Supplementary Tables S4 and S5). Further, it was extended to compare the standard ML and DA-LSTM algorithms to assess their source-to-target TV particles performance and the impact of the DA mechanism on model fitness (Fig. 8; Supplementary Figure S23; Supplementary Tables S6 and S7). Overall, the results showed consistent prediction performance of the DA-LSTM algorithm across various generative models and WMs from different WWTPs in different regions.

Estimated LRVs across different wastewater matrices

The current study investigated the estimated LRV differences from the target prediction using DA-LSTMMMCM across WMs from different WWTPs. The direct relationship between the estimated viral concentrations and LRV through cost effective water quality measurements allows faster and reasonably real-time monitoring of any deviations from the optimal operating points. The cumulative LRV sum of TV, AdV, and PMMoV particles using DA-LSTMMMCM across the aerobic, sand, and MBR treatment processes in region R1, and across the second clarifier and chlorinated effluent treatments in region R2, were consistent with the ground truth. The prediction consistency was verified in region R3 for predicting TV particles across the sand filtration process and in two pilots, (A) and (H) AeMBRs. The results revealed high consistency between the estimated LRVs and the ground truth values.

LRV results across different treatment processes are known to be influenced by different factors, including the type of membrane rejection and biomass concentrations in virus removal. For instance, the model development and zero-shot generalization performance results in all cases showed that the cumulative TV removal efficiency was relatively small across all the WMs, while AdV and PMMoV removal efficiencies were substantially higher for the MBR second clarifier and chlorinated effluent treatment processes in region R2. This shows that TV, AdV, and PMMoV particles responded differently to treatment. TV particles were resilient and demonstrated lower removal efficiency rates, while AdV and PMMoV were more susceptible to the aerobic, MBR, and chlorination treatments. Soft sensor development allowed us to assess the TV, AdV, and PMMoV concentrations across the wastewater matrices based only on wastewater physiochemical parameters, thereby monitoring LRV efficiencies and differences between WMs at the same WWTP sharing the same wastewater source. The core of our findings highlighted the importance of handling wastewater effluent drifts in the soft sensor development to ensure globally representative and effective estimation of viral concentrations and removal efficiencies through indirect physiochemical water quality measurements across WMs, thereby, capturing the complex interdependencies of water quality and viral particles in the model development and zero-shot generalization stages.

Zero-shot performance comparison with existing machine learning models

We conducted a comparison analysis for predicting TV particles using DA-LSTM and state-of-the-art ML algorithms in region R4. Specifically, we focused on a recent work that proposed Artificial Neural Network (ANN), and Random Forest (RF) models for predicting viral particles11, where the authors predicted log removal values of PMMoV and Norovirus GII concentrations. It is essential to note that the ML algorithms developed in11 do not consistently demonstrate generalization performance across various wastewater matrices. Furthermore, these results do not infer calibration and adaptation mechanisms on unseen test datasets from various wastewater matrices and WWTPs, thereby limiting their performance in the presence of effluent process drifts.

LSTM performed better than standard ANN, XGBoost, and RF models; however, standard ML models showed poor ZSG predictions and significant errors on the unseen datasets, limiting their performances in handling effluent treatment process drifts and time-varying characteristics or shifts of wastewater matrices (Figs. S22, S23a–c, S23d–f, Tables S6 and S7, Supplementary Information). Although ANN, XGBoost, RF, and LSTM models performed well on the training and test data (Tables S6 and S7), their performances dropped sharply on the zero-shot generalization tasks, especially when processing new unseen inputs. These standard ML algorithms did not work correctly at this stage, which demonstrating that they still have certain limitations in capturing long-term dependencies and processing dynamically changing data encountered when facing new unseen input sequences. DA-LSTM was significantly better than standard ML models in all training, testing, and validation stages (Tables S6 and S7).

Conclusion

The proposed ZSG approach based on a dual-stage attention marks a progressive step toward model generalization across unseen datasets in WWTPs, specifically across wastewater effluent matrices in the geographical region. The synergistic ZSG framework based on DA-LSTM relies on an extended Markov chain generative model to leverage the related issues of adapting and calibrating to new unseen testing datasets, thereby reducing the covariant shift (i.e., process shift) between source and target prediction tasks while improving the predictive modeling performance by continuously updating the DA weights. The results demonstrated the crucial role of integrating DA-LSTM and the Markov chain generative model (MMCM) for accurate prediction of human enteric pathogens and associated viral concentrations, and estimating LRVs across various WMs for rapid monitoring and response in protecting community health. Future work should focus on effectively predicting viral concentrations and virus removal efficiency by streaming the DA-LSTM prediction process under different scenarios and in the presence of abrupt changes across different technologies, including anaerobic MBR plants.

The current version of the DA-LSTM zero-shot generalization framework requires a retraining step of the baseline model before a downstream model enhancement analysis on new unseen datasets, specifically when adapting the viral particle prediction across various wastewater matrices. Hence, adaptation is mainly based on a retraining phase similar to the out-of-distribution testing22, which can underscore the online adaptation if constant forgetting factors or periodic intervals are not implemented.

In future studies, a transfer learning with an online adaptation phase comprising a receding horizon scheme based on statistical hypothesis testing of the distribution error between the predicted and true values of the viral particles could be considered to alleviate the retraining limitation of the current version of the DA-LSTM. Furthermore, the use of geographical representation and additional datasets, such as differences in industrial, climate, and microbial communities, will be essential to leverage the limited case studies arising from the datasets.

Data availability statement

The datasets used and/or analyzed are available from the corresponding author upon request.

References

  1. Harb, M. & Hong, P. Y. Molecular-based detection of potentially pathogenic bacteria in membrane bioreactor (MBR) systems treating municipal wastewater: a case study. Environ. Sci. Pollut. Res. 24, 5370–5380 (2017).

    Article 
    CAS 

    Google Scholar 

  2. Grabow, W. O. K. The virology of wastewater treatment. Water Res. 2, 675–701 (1968).

    Article 
    ADS 

    Google Scholar 

  3. Corpuz, A. V. et al. Viruses in wastewater: Occurrence, abundance and detection methods. Sci. Total Environ. 745, 140910 (2020).

    Article 
    PubMed 
    PubMed Central 
    CAS 

    Google Scholar 

  4. Manti, A. et al. Bacterial cell monitoring in wastewater treatment plants by flow cytometry. Water Environ. Res. 80, 346–354 (2008).

    Article 
    PubMed 
    CAS 

    Google Scholar 

  5. Alharbi, M., Hong, P. Y. & Laleg-Kirati, T. M. Sliding window neural network-based sensing of bacteria in wastewater treatment plants. J. Process Control 110, 35–44 (2022).

    Article 
    CAS 

    Google Scholar 

  6. Zambrano, J., Krustok, I., Nehrenheim, E. & Carlsson, B. A simple model for algae–bacteria interaction in photo-bioreactors. Algal Res. 19, 155–161 (2016).

    Article 

    Google Scholar 

  7. Yang, J. et al. Model-based evaluation of algal-bacterial systems for sewage treatment. J. Water Process Eng. 38, 101568 (2020).

    Article 

    Google Scholar 

  8. Ekundayo, T. C., Adewoyin, M. A., Ijabadeniyi, O. A., Igbinosa, E. O. & Okoh, A. I. Machine learning-guided determination of Acinetobacter density in waterbodies receiving municipal and hospital wastewater effluents. Sci. Rep. 13, 7749 (2023).

    Article 
    ADS 
    PubMed 
    PubMed Central 
    CAS 

    Google Scholar 

  9. Alharbi, M. S., Hong, P. Y. & Laleg-Kirati, T. M. Adaptive neural network-based monitoring of wastewater treatment plants. Proc. 2022 American Control Conference (ACC), 3204–3211 (IEEE, 2022).

  10. Aljehani, F., N’Doye, I., Hong, P.-Y., Monjed, M. K., & Laleg-Kirati, T.-M. Bacteria cells estimation in wastewater treatment plants using data-driven models. IFAC-PapersOnLine 58, 718–723 (2024). 12th IFAC Symposium on Advanced Control of Chemical Processes.

  11. Kadoya, S. et al. A soft-sensor approach for predicting an indicator virus removal efficiency of a pilot-scale anaerobic membrane bioreactor (AnMBR). J. Water Health 22, 967–977 (2024).

    Article 
    PubMed 

    Google Scholar 

  12. Bastin, G. On-line estimation and adaptive control of bioreactors (Elsevier, 2013).

    Google Scholar 

  13. Zambrano, J., Krustok, I., Nehrenheim, E. & Carlsson, B. A simple model for algae-bacteria interaction in photo-bioreactors. Algal Res. 19, 155–161 (2016).

    Article 

    Google Scholar 

  14. Dochain, D. State and parameter estimation in chemical and biochemical processes: A tutorial. J. Process Control 13(8), 801–818 (2003).

    Article 
    CAS 

    Google Scholar 

  15. Schugerl, K. & Bellgard, K.-H. Bioreactor models in Bioreaction engineering: modeling and control (Springer-Verlag, 2000).

    Book 

    Google Scholar 

  16. Farhi, N., Kohen, E., Mamane, H. & Shavitt, Y. Prediction of wastewater treatment quality using LSTM neural network. Environ. Technol. Innov. 23, 101632 (2021).

    Article 
    CAS 

    Google Scholar 

  17. Pisa, I., Santin, I., Morell, A., Vicario, J. L. & Vilanova, R. LSTM-based wastewater treatment plants operation strategies for effluent quality improvement. IEEE Access 7, 159773–159786 (2019).

    Article 

    Google Scholar 

  18. Mokhtari, H. A., Bagheri, M., Mirbagheri, S. A. & Akbari, A. Performance evaluation and modelling of an integrated municipal wastewater treatment system using neural networks. Water and Environment Journal 34, 622–634 (2020).

    Article 
    CAS 

    Google Scholar 

  19. Wang, R. et al. Model construction and application for effluent prediction in wastewater treatment plant: Data processing method optimization and process parameters integration. J. Environ. Manage. 302, 114020 (2022).

    Article 
    PubMed 
    CAS 

    Google Scholar 

  20. Alharbi, M., Hong, P.-Y. & Laleg-Kirati, T.-M. Sliding window neural network based sensing of bacteria in wastewater treatment plants. J. Process Control 110, 35–44 (2022).

    Article 
    CAS 

    Google Scholar 

  21. Myshkevych, Y., N’Doye, I., Sanchez Medina, J., Aljehani, F., Xiong, Y., Laleg-Kirati, T.-M., & Hong, P.-Y. Combining flow virometry with tree-based machine learning models for rapid virus particle estimation in different wastewater matrices. Water Research, 123905 (2025).

  22. Aljehani, F., N’Doye, I., Hong, P.-Y., Monjed, M. K. & Laleg-Kirati, T.-M. A calibration framework toward model generalization for bacteria concentration estimation in wastewater treatment plants. Sci. Rep. 14, 31218 (2014).

    Article 
    ADS 

    Google Scholar 

  23. Chen, J., N’Doye, I., Myshkevych, Y., Aljehani, F., Hong, P.-Y., Monjed, M. K., & Laleg-Kirati, T.-M. Viral particle prediction in wastewater treatment plants using nonlinear lifelong learning models. npj Clean Water 8, 28 (2025).

  24. Alvi, M., French, T., Cardell-Oliver, R., Batstone, D. & Akhtar, N. Enhanced deep predictive modeling of wastewater plants with limited data. IEEE Trans. Industr. Inf. 20, 1920–1930 (2023).

    Article 

    Google Scholar 

  25. Qin, Y., Song, D., Chen, H., Cheng, W., Jiang, G. & Cottrell, G. A dual-stage attention-based recurrent neural network for time series prediction. Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, 2627–2633 (2017).

  26. Yoon, N. et al. Dual-stage attention-based LSTM for simulating performance of brackish water treatment plant. Desalination 512, 115107 (2021).

    Article 
    CAS 

    Google Scholar 

  27. An, T. et al. Adaptive prediction for effluent quality of wastewater treatment plant: improvement with a dual-stage attention-based LSTM network. J. Environ. Manage. 359, 120887 (2024).

    Article 
    PubMed 
    CAS 

    Google Scholar 

  28. Chen, Q., Lin, N., Bu, S., Wang, H. & Zhang, B. Interpretable time-adaptive transient stability assessment based on dual-stage attention mechanism. IEEE Trans. Power Syst. 38, 2776–2790 (2023).

    Article 
    ADS 

    Google Scholar 

  29. Pan, S. J. & Yang, Q. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22, 1345–1359 (2010).

    Article 

    Google Scholar 

  30. N, Lin et al., Resistive memory-based zero-shot liquid state machine for multimodal event data learning. Nature Computational Science 7, 37–47 (2025).

  31. J. Meier et al., Language models enable zero-shot prediction of the effects of mutations on protein function. In Advances in Neural Information Processing Systems, 29287–29303 (2021).

  32. Zheng, Y., Zhang, X., Zhou, Y., Zhang, Y., Zhang, T. & Farmani, R. Deep representation learning enables cross-basin water quality prediction under data-scarce conditions. npj Clean Water 8, 33 (2025).

  33. Van den Akker, B. et al. Validation of a full-scale membrane bioreactor and the impact of membrane cleaning on the removal of microbial indicators. Biores. Technol. 155, 432–437 (2014).

    Article 

    Google Scholar 

  34. Cheng, H., Monjed, M. K., Myshkevych, Y., Wang, T. & Hong, P.-Y. Accounting for the microbial assembly of each process in wastewater treatment plants (WWTPs): study of four WWTPs receiving similar influent streams. Appl. Environ. Microbiol. 90(4), e02253-e2323 (2024).

    Article 
    ADS 
    PubMed 
    PubMed Central 

    Google Scholar 

  35. Zhang, J., Zhang, J., Sano, D. & Chen, R. Comparison of activated sludge and virus interactions in aerobic and anaerobic membrane bioreactors. iScience 27 (2024).

  36. Chaudhry, R. M., Nelson, K. L. & Drewes, J. E. Mechanisms of pathogenic virus removal in a full-scale membrane bioreactor. Environ. Sci. Technol. 49, 2815–2822 (2015).

    Article 
    ADS 
    PubMed 
    CAS 

    Google Scholar 

  37. Jumat, M. R. et al. Membrane bioreactor-based wastewater treatment plant in Saudi Arabia: Reduction of viral diversity, load, and infectious capacity. Water 9, 534 (2017).

    Article 

    Google Scholar 

  38. Timraz, K., Xiong, Y., Al Qarni, H. & Hong, P. Y. Removal of bacterial cells, antibiotic resistance genes and integrase genes by on-site hospital wastewater treatment plants: Surveillance of treated hospital effluent quality. Environ. Sci: Water Res. Techn. 3(2), 293–303 (2017).

    CAS 

    Google Scholar 

Download references

Acknowledgements

The authors thank the MODON WWTP operation team for granting us access to various wastewater samples.

Funding

KAUST-MEWA SPA (REP/1/6112-01-01), and Near Term Grand Challenge (AI) (REI/1/5233-01-01) awarded to Peiying Hong.

Author information

Authors and Affiliations

Authors

Contributions

J. C.: Methodology, investigation, software, validation, writing – review and editing. I. N.: Conceptualization, methodology, investigation, visualization, writing – original draft, writing – review and editing, supervision. M. K. M.: Data curation. P.-Y. H.: Conceptualization, supervision, resources, project administration, funding acquisition.

Corresponding author

Correspondence to
Ibrahima N’Doye.

Ethics declarations

Competing Interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Chen, J., N’Doye, I., Monjed, M.K. et al. Zero-shot generalization for predicting viral concentrations and evaluating removal efficiencies across wastewater matrices.
Sci Rep 15, 41726 (2025). https://doi.org/10.1038/s41598-025-26384-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • DOI: https://doi.org/10.1038/s41598-025-26384-4

Keywords

  • Wastewater matrices
  • Viral particle prediction
  • Effluent process drifts
  • Log removal value
  • Zero-shot generalization
  • Generative models
  • Dual-attention long short-term memory network


Source: Resources - nature.com

Analysis of spatiotemporal change characteristics of Poyang Lake from 1984 to 2021 based on GEE

Chasing crayfish and the leeches that live on them

Back to Top