in

Model generalization paradigms for predicting viral particles and evaluating removal efficiencies in anaerobic membrane bioreactor plants


Abstract

The dynamic changes in influent and effluent streams and the shifts in effluent quality across filtration layers of membrane bioreactors (MBRs) are major challenges that hinder the generalization of machine learning (ML) models developed to predict bacterial and viral contaminants in unseen data. This paper proposes two model generalization paradigms based on lifelong and zero-shot generalization frameworks for predicting viral particles and assessing log removal values (LRVs) across two anaerobic MBRs (AnMBRs) based WWTPs located in different cities of Saudi Arabia using physicochemical parameters, virometry, and PCR-based data. The lifelong learning approach integrates a knowledge-based adaptation module with a shared dictionary and a local ML predictor for streaming and predicting viral particles with delayed output measurements. The zero-shot generalization approach is based on a dual-attention transformer model and adaptively prioritizes key input water features through temporal and input attention mechanisms for estimating viral pathogens. Both approaches ensured generalization and robustness guarantees across unseen AnMBR-based wastewater matrices (WMs) and WWTPs. We validated them by predicting adenovirus, coliphage, CrAssphage, pepper mild mottle virus, and total virus concentrations and estimating contaminant removal performances through the LRVs across various WMs and WWTPs.

Similar content being viewed by others

Zero-shot generalization for predicting viral concentrations and evaluating removal efficiencies across wastewater matrices

Viral particle prediction in wastewater treatment plants using nonlinear lifelong learning models

Machine learning comparison for biomarker level estimation in wastewater dynamics monitoring

Introduction

The presence of viral particle concentrations in reclaimed water membrane bioreactors (MBRs) poses significant concerns for ensuring sustainable water management resources and enhancing water security1. Effective wastewater-based epidemiology (WBE) techniques have been proposed to monitor the surveillance of treated water quality by enumerating and quantifying the concentrations of viral pathogens and indicators (i.e., culturable bacteriophages) and detecting and extracting pathogenic genomes and biomarkers from wastewater samples2,3. Although these WBE methods targeting human pathogens and indicators are highly specific, they face significant challenges in detecting viruses at low concentrations and are generally time-consuming4,5,6. These techniques do not provide real-time measurements of wastewater process variables, as water samples are typically collected on-site and analyzed off-site in laboratories7,8, limiting their practical benefits. To ensure reliable and effective aerobic MBR (AeMBR) and anaerobic MBR (AnMBR) systems, achieving efficient log removal values (LRVs) over time is essential to control microbial water quality and meet water reuse safety standards. Although AnMBRs exhibit lower levels of opportunistic pathogens and less favorable conditions for natural transformation than AeMBRs, eliminating pathogens and estimating the mechanism of viral pathogen removal remain major challenges in both MBR processes. As a result, there is a growing interest in effective and rapid real-time monitoring methods to predict viral concentrations9,10 and assess and verify the log removal safety with accessible water quality measurements11.

Machine learning (ML) is currently shaping the modeling, prediction, and control of complex systems12 and enabling the accelerated prediction of viral particle concentrations9,10 and pathogenic genomes in wastewater treatment plants (WWTPs)13. Computational ML methods bypass mechanistic models by learning input–output relationships and characterizing dominant patterns for the potential goal of estimation and prediction in WWTPs (see, e.g.9,14,15,16,17,18,19,20,). In the context of wastewater applications, data-driven methods have demonstrated a remarkable ability to estimate viral and bacterial concentrations from wastewater samples8,10,11,16. However, variations in influent and effluent streams, caused by non-stationary wastewater flow rates and changing wastewater composition, represent primary sources of process drift in in MBRs. These dynamic changes are difficult to monitor and characterize over extended periods time11,21,22. In addition, shifts in effluent quality across filtration layers, combined with instrument lags and measurement noise, cause data distribution shifts, multiple time scales, and domain drift. Further, the underlying drifts are exacerbated by the generated synthetic data necessary to address the limited available real data in most wastewater treatment plants (WWTPs) due to low pathogen concentrations. These challenges limit the generalization performance of the data-driven approaches in ensuring accurate validation on unseen test datasets. In other words, even an optimal baseline ML model without adaptation may struggle to handle sudden process drifts caused by data shifts across diverse WWTPs. For instance, abrupt effluent changes can arise from varying wastewater treatment types in AnMBR systems.

In the model generalization context, a recent study was conducted to estimate bacterial concentrations across various AeMBRs using a calibration that relied on an out-of-distribution framework23. The calibration method showed accurate prediction performance on unseen datasets across AeMBR-based WWTPs in two different regions of Saudi Arabia. The calibration framework often requires a retraining step before a downstream model enhancement analysis of unseen datasets. In the model adaptation process, the partial least squares (PLS) method has recently shown excellent performance within linear streaming lifelong learning frameworks24. By extracting key latent variables, PLS effectively reduces the dimensionality of complex datasets while preserving essential information, thereby providing high accuracy and robustness in environmental modeling and prediction tasks. The long short-term memory (LSTM) and gated-recurrent unit (GRU) local predictors, which are inherited from the nonlinear dependencies of the water quality input and virus concentration output, coupled with the knowledge base adaptation framework, ensure model performance enhancement for streaming and predicting viral particle communities in AeMBR-based WWTPs9. The authors in9 proposed a lifelong learning framework that demonstrated strong model generalization accuracy improvements by integrating a knowledge-based adaptation mechanism and local ML predictor on unseen test data to predict viral particles across various AeMBR-based wastewater matrices and WWTPs.

A zero-shot generalization (ZSG) framework that incorporates attention mechanisms and neural network models has emerged as an effective alternative to lifelong learning for source-to-target prediction problems. ZSG involves testing a developed baseline ML model on unseen datasets, where the weights of the developed ML-based attention model are transferred to the target datasets. Dual-stage attention (DA) neural network models with input and temporal attention mechanisms have demonstrated superior performance compared to LSTM25. These DA approaches improve feature representation in deep learning models, mitigating the long-dependency issues that commonly affect recurrent neural networks, including LSTMs25,26,27,28,29. The goal of ZSG methods leveraging DA is to overcome the limitations of classical time series learning models in handling extended long-term dependencies and multiple sequences. The attention mechanism attributes and weights intrinsic features, enabling global dependency of the underlying data distributions of the developed ML models25,30,31,32. These features identify highly distinguishable features in a high-dimensional space, enhancing prediction accuracy and improving the ability to generalize to unseen data.

Model generalization techniques are needed to handle the challenges caused by rapid shifts in most effluent treatment conditions and to adapt to unseen data. Although lifelong learning and ZSG-based DA approaches have been used as effective and rapid real-time monitoring tools of viral particles9 and LRVs32 in AeMBRs, prior research has not investigated their applications for monitoring viral contaminants in AnMBR-based WWTPs. Contaminants such as coliphage, total virus, adenovirus, CrAssphage, and pepper mild mottle virus and their corresponding log removal efficiencies have yet to be evaluated in AnMBR-based WWTPs. Subsequently, generalization performance of data-driven soft sensors for predicting viral particles across AnMBR and AeMBR remain an open challenge due to the underlying process drifts and dynamic changes in WWTPs.

The present study proposed accurate and generalizable soft sensors using a lifelong learning approach and a ZSG-based DA mechanism with generative models to predict viral concentrations and evaluate virus removal capacities across AnMBR wastewater matrices and WWTPs. The performance of the model generalization paradigms was tested on the KAUST and MODON AnMBR-based WWTPs geographically located in two cities of Saudi Arabia. The KAUST treats a mix of municipal wastewater, while the MODON treats municipal and industrial wastewater. We validated the prediction of viral particles on unseen datasets using lifelong learning and ZSG-based DA approaches coupled with the Wasserstein generative adversarial network (WGAN), extended Markov chain model (EMCM), and Copula model (CM) across the wastewater matrices of each AnMBR, and extended it across the two AnMBRs for further validation and comparison. We demonstrated that both lifelong learning and ZSG-based DA achieve accurate prediction of viral particles from various data streams and maintain robust performance across the wastewater matrices and WWTPs. The results showed that lifelong and DA-LSTM across the generative models achieved remarkable (i) AdV particles prediction in KAUST and MODON AnMBRs wastewater matrices; (ii) MS2 and CrAssphage particles prediction of untreated AnMBRs from untreated AnMBRs with a mean average coefficient of determination R2 of 0.94, and 0.93 using lifelong and DA-LSTM, respectively; (iii) PMMoV particles of treated AnMBRs from treated AnMBRs with R2 of 0.93, and 0.98 using lifelong and DA-LSTM, respectively; (iv) TV particles of treated AnMBRs from untreated AnMBRs with R2 of 0.86, and 0.91 using lifelong and DA-LSTM, respectively. We evaluated the LRVs from the estimated influent and effluent viral concentrations. The LRV results of the AnMBRs from both model generalization paradigms showed remarkable performance and consistently preserved accurate values with the ground truth and generated synthetic datasets. The findings of this work highlight the importance of handling wastewater process drifts in soft sensor development to ensure globally representative and effective real-time prediction of viral concentrations and removal efficiencies, thereby capturing the complex interdependencies of water quality and viral particles in model development and model generalization tasks.

Results

This section describes the case studies conducted to predict viral particles and assess the LRVs using lifelong learning and ZSG-based DA approaches across AnMBR-based wastewater matrices and WWTPs. The two WWTPs are geographically located in two different cities in Saudi Arabia. For each lifelong and ZSG-based DA approach, we constructed the baseline model of the viral particles from the influent or effluent treatment process to assess the model generalization frameworks across wastewater matrices sharing the same AnMBR wastewater source. We then extended it across different AnMBR-based WWTPs for further validation and comparison. The samples of untreated to treated wastewater within the same WWTP, untreated to different untreated AnMBR wastewater, and treated to different treated AnMBR wastewater from different treatment facilities were investigated in viral particle prediction across different AnMBR-based WWTP cases. We evaluated the LRVs through the estimated viral concentrations from the conventional LRV formula given in Eq. (1). To present the proposed model generalization results for predicting viral particles across wastewater matrices and WWTPs effectively, we illustrate in the subsequent subsections the prediction performance for adenovirus (AdV) particles, coliphage (MS2) and CrAssphage concentrations, pepper mild mottle virus (PMMoV) particles, and total virus (TV) particles.

Viral particle prediction and LRV assessment across wastewater matrices of the municipal KAUST AnMBR wastewater

The municipal KAUST pilot-scale AnMBR includes a primary clarifier influent, a submerged microfiltration membrane (MBR), and a nature-based filtration column (NBF) treatment process (Fig. 1). For each model generalization framework, we first constructed the influent baseline model (i.e., training and testing or model development) to predict TV, AdV, and PMMoV particles. We employed LSTM and GRU ML predictors to build the lifelong and ZSG-based DA models. Then, we conducted model generalizations on unseen testing datasets to predict viral particles and assessed the removal efficiencies across the KAUST MBR and NBF wastewater matrices (Table S7). Note that the ZSG-based DA model requires retraining of the influent baseline model before the adaptation process on the NBF clarifier. Figure 1 illustrates the model generalization flowchart for predicting viral particles using lifelong and ZSG-based DA approaches, including the model development of the influent baseline model and the LRV assessment across the KAUST MBR and NBF wastewater matrices. To assess the model development and generalization performances from different angles, we included comprehensive qualitative and quantitative evaluations across various metrics visualized in a radar chart. The expression and calculation method for each evaluation metric are detailed in the Supporting Information (Table S8).

Fig. 1: AdV particle prediction across the municipal KAUST AnMBR wastewater matrices.

Flowchart illustrating the case study for predicting AdV particles across the municipal KAUST AnMBR wastewater matrices in which the model development, the model generalization frameworks using lifelong and ZSG-based DA, and the LRV assessment are highlighted: a Model development included the testing results of AdV particles through the KAUST influent treatment process; b Lifelong-LSTM and ZSG-based DA-LSTM consisting of validating the developed KAUST influent model to predict AdV particles across the KAUST microfiltration (MBR) and KAUST nature-based filtration (NBF) treatment matrices (i.e., target unseen datasets); c LRV assessment across the wastewater matrices is derived through the predicted AdV concentrations of the KAUST influent and effluent treatment processes.

Full size image

Figure 2 represents the radar chart of the model adaptation performance for predicting AdV particles across the KAUST microfiltration (MBR) effluent using lifelong-LSTM (Fig. 2a), lifelong-GRU (Fig. 2b), DA-LSTM (Fig. 2c), and DA-GRU (Fig. 2d) frameworks with the EMCM, WGAN, and CM (Table S9). As shown in the radar charts, the four algorithms—lifelong-LSTM, lifelong-GRU, DA-LSTM, and DA-GRU—achieved generalization, precision, and harmonic mean performances greater than 80% with the EMCM, WGAN, and CM across the MBR treatment process. The time efficiency of the lifelong-LSTM and lifelong-GRU showed good performance across all the generative models. The lifelong-LSTM and lifelong-GRU algorithms coupled with EMCM indicated superior deviation performance. The deviation performance of the lifelong-EMCM can be attributed to its ability to maintain accurate predictions, despite wastewater process drifts. The highest computational cost was observed with the lifelong-LSTM and lifelong-GRU approaches across all three generative models. It is well known that GRU benefits from its number of gates to reduce the computational burden in the training stage compared to LSTM. However, LSTM better captured the overall adaptation performance, specifically the deviation performance, which ensures better adaptation in the presence of treatment process drifts in WWTPs. Overall, the performance results of the four algorithms with the three generative models indicate a weighted score performance greater than 80% (Table S10).

Fig. 2: Radar chart of AdV particle prediction and adaptation using various performance metrics.

Model generalization performance for predicting AdV particles using lifelong learning and ZSG-based DA frameworks with LSTM and GRU predictors and EMCM, WGAN, and CM across the KAUST MBR effluent from the baseline KAUST influent model using various metrics (Table S7): a lifelong-LSTM, b lifelong-GRU, c DA-LSTM, and d DA-GRU. The performance metrics include precision (P), generalization (G), the harmonic mean (H) of R2 with respect to the R2 value, deviation (D) with respect to the root mean square error (RMSE), time efficiency (E), and computational cost (C), corresponding to the number of floating-point operations per second (FLOPs) (Table S8).

Full size image

We assessed the LRV performance to evaluate the presence of viral pathogens across the MBR and NBF wastewater matrices and to monitor the potential risks associated with the reuse of treated wastewater effluents. Figure 3 shows the estimated and actual LRVs of AdV particles across the MBR and NBF effluents from the influent baseline model of the KAUST AnMBR using lifelong and ZSG-based DA frameworks based on LSTM and GRU approaches with the EMCM, WGAN, and CM. The LRVs obtained from the estimated viral concentrations using Eq. (1) demonstrated remarkable estimation performance with the actual and generated LRVs across the KAUST MBR and NBF treatments. The cumulative LRV sum of AdV particles across the MBR and NBF treatments achieved using lifelong-LSTM, lifelong-GRU, DA-LSTM, and DA-GRU algorithms coupled with the three generative models attained an average of 4.08, 4.13, 4.12, and 4.08, respectively. The average cumulative LRV results of the AdV particles were consistent with the ground truth, which was 4.19.

Fig. 3: LRV evaluation across the municipal KAUST wastewater matrices.

Estimated and actual LRVs of AdV particles across the municipal KAUST MBR and NBF effluents from the KAUST influent baseline model using (a) lifelong-LSTM, b lifelong-GRU, c DA-LSTM, and d DA-GRU models with EMCM, WGAN, and CM.

Full size image

Similar to the model adaptation performances for the AdV particles, we achieved remarkable generalization performance for the TV and PMMoV particles of the KAUST AnMBR-based WWTP. The prediction performance of all four models is shown in Table S7 with the evaluation metrics for each viral community. The scatterplots in the Supporting Information in Figs. S10, S13, and S16 for lifelong-LSTM, Figs. S19, S23, and S27 for DA-LSTM and viral loads in Figs. S12, S15, and S18 for lifelong-LSTM, and Figs. S22, S26, and S30 for DA-LSTM illustrated the prediction performance results. The LRVs of TV and PMMoV particles for all the models across the MBR and NBF treatments were 0.50 and 4.80, respectively (Fig. S31 for TV particles and Fig. S32 for PMMoV particles, Supporting Information). Overall, MBR and NBF treatments using the model generalization algorithms were successful in achieving the LRVs of AdV and PMMoV by an average of 4.10 and 4.80, respectively, while TV-LRV achieved an average of 0.52. These results were consistent with the ground truth and the generated synthetic data.

Viral particle prediction and LRV assessment across wastewater matrices of the industrial and municipal MODON AnMBR wastewater

The municipal and industrial MODON AnMBR-based WWTP features a primary influent clarifier that removes grease and sediment, and a second MBR clarifier where the wastewater undergoes treatment under anaerobic conditions similar to the municipal KAUST AnMBR. The final effluent is distributed for industrial water reuse and landscaping irrigation (Fig. 4). Similar to the model generalization steps of the municipal KAUST AnMBR, soft sensor development lies in generalizing the baseline influent model (i.e., samples of untreated wastewater) to treated wastewater samples within the same WWTP, with the advantage of streaming the viral particle prediction from untreated to treated wastewaters. We first constructed a baseline MODON influent model for predicting TV, AdV, and PMMoV particles. We then conducted model generalization on unseen MODON MBR datasets to predict viral particles using lifelong and ZSG-based DA coupled with EMCM, WGAN, and CM. Table 1 shows the remarkable training, testing, and model generalization performance results for predicting TV, AdV, and PMMoV particles across the MODON MBR treatment from the influent baseline model of the MODON AnMBR with the generative models. Overall, the evaluation values—R2, RMSE, mean absolute error (MAE), and mean square error (MSE)—showed strong generalization performance in each viral community and ensured robust performance across unseen datasets (Table 1). Overall, lifelong-LSTM and DA-LSTM with all generative models achieved average weighted scores of 90% and 88.7% for predicting AdV particles across the MODON MBR treatment, respectively (Tables S11 and S12).

Fig. 4: AdV particle prediction across the municipal and industrial MODON AnMBR wastewater matrices.

Flowchart illustrating the case study for predicting AdV particles across the municipal and industrial MODON AnMBR wastewater matrices in which the model development, the model generalization frameworks using lifelong and ZSG-based DA, and the LRV assessment are highlighted: a Model development included the testing results of AdV particles through the MODON influent treatment process; b Lifelong-LSTM and ZSG-based DA-LSTM consisting of validating the developed MODON influent model to predict AdV particles across the MODON MBR treatment process; c LRV assessment is derived through the predicted AdV concentrations of the MODON influent and effluent treatment processes.

Full size image
Table 1 Training, testing, and model generalization results for predicting TV, AdV, and PMMoV particles across the ultrafiltration effluent (MBR) from the influent model of the MODON AnMBR-based WWTP using lifelong and ZSG-based DA approaches with EMCM, WGAN, and CM
Full size table

Lifelong-LSTM and DA-LSTM algorithms with the generative models achieved a 1-log reduction in AdV viral particles. These estimation results are in perfect agreement with the ground truth and generated datasets (Fig. 5). In addition, the mean average LRV of the AdV particles for the three generative models across the MBR treatment process using lifelong-LSTM and DA-LSTM was 1.11 and 1.09, respectively (Fig. 5). These mean average LRVs of AdV particles were consistent with the mean average LRVs of the generated and ground truth AdV datasets, which were 1.10 and 1.17, respectively. Similar to the model adaptation performance of AdV particles, we achieved remarkable generalization performance for TV and PMMoV particles in the MODON AnMBR. The prediction performance of the lifelong-LSTM and DA-LSTM models is shown in Table 1, with the evaluation metrics for each viral community. The scatterplots in the Supporting Information in Figs. S33, S36, and S39 for lifelong-LSTM, Figs. S42, S45, and S48 for DA-LSTM, and viral loads in Figs. S35, S38, and S41 for lifelong-LSTM, Figs. S44, S47, and S50 for DA-LSTM illustrated the prediction performance results. Further, we evaluated the LRVs of TV and PMMoV across the MODON MBR effluent using lifelong-LSTM and DA-LSTM (Fig. S51 for TV concentrations and Fig. S52 for PMMoV concentrations, Supporting Information).

Fig. 5: LRV evaluation across the municipal and industrial MODON MBR treatment.

Estimated and actual LRVs of AdV particles across the municipal and industrial MODON MBR treatment from the baseline MODON influent model using (a) lifelong-LSTM and b DA-LSTM algorithms with EMCM, WGAN, and CM.

Full size image

Viral particle prediction and virus removal efficiencies evaluation of the MODON AnMBR from the KAUST AnMBR using various treatment facilities

The prediction of viral particles across different WWTPs from raw to treated wastewater is essential for addressing microbial risk and contamination. However, the development of a rapid and cost-effective soft sensor capable of continuously monitoring and quantifying viral pathogens in real time across different WWTPs remains a great challenge. We examined three cases of predicting viral particles across two different AnMBR-based WWTPs to address the transferability of the soft sensors using lifelong-LSTM and DA-LSTM: (i) MODON untreated AnMBR from KAUST untreated AnMBR (Fig. 6a), (ii) MODON treated AnMBR from KAUST treated AnMBR (Fig. 6b), and (iii) MODON treated AnMBR from KAUST untreated AnMBR (Fig. 6c). We demonstrated the ability of the proposed lifelong-LSTM and DA-LSTM frameworks in predicting MS2, CrAssphage, TV, AdV, and PMMoV concentrations from the KAUST AnMBR to the MODON AnMBR, and evaluated the virus removal efficiency in the treated AnMBRs.

Fig. 6: Viral particle prediction and virus removal efficiencies evaluation using various treatment facilities.

Flowchart illustrating the model generalization performances for predicting MS2, PMMoV and TV particles of MODON AnMBR from KAUST AnMBR using lifelong-LSTM and ZSG-based DA models with different treatment facilities: a Model adaptation results for predicting MS2 of the MODON untreated AnMBR from the KAUST untreated AnMBR; b LRV performance results of PMMoV particles of the MODON treated AnMBR from the KAUST treated AnMBR; c LRV performance results of TV concentrations of the MODON treated AnMBR from the KAUST untreated AnMBR.

Full size image

Viral particle prediction of the MODON untreated AnMBR from the KAUST untreated AnMBR

We first constructed the municipal KAUST influent model using lifelong and DA algorithms to predict MS2, CrAssphage, TV, AdV and PMMoV particles. Then, we assessed the lifelong and ZSG-based DA principles of transferability to predict these viral particles across the municipal and industrial MODON untreated AnMBR. Because MS2 and CrAssphage particles were present in both untreated AnMBRs, we present the results of these two viral particles. The predicted versus actual values of MS2 particles (Fig. 7a, b) and CrAssphage particles (Fig. 7c, d) across the MODON AnMBR influent treatment process from the KAUST AnMBR influent baseline model using lifelong and ZSG-based DA approaches coupled with the EMCM, WGAN, and CM were illustrated. The results indicated that lifelong-LSTM and DA-LSTM coupled with the EMCM, WGAN, and CM provided remarkable performance in predicting MS2 and CrAssphage across the unseen MODON influent datasets. The average R2 values for predicting MS2 and CrAssphage particles across the generative models using lifelong-LSTM were 0.96 and 0.93, respectively (Table S13). DA-LSTM achieved 0.94 of R2 values for MS2 particles and 0.92 for CrAssphage particles (Table S13). These results demonstrated strong evidence of the generalization and robustness abilities of the proposed model generalization algorithms. DA-LSTMCM and DA-LSTMWGAN achieved the highest and lowest weighted scores of 91.5% and 85.7%, respectively, for predicting MS2 concentrations (Tables S14 and S15). The viral loads of MS2 and CrAssphage showed remarkable predictions across the MODON influent process (Figs. S53a, S54a, and S55a for MS2 particles using lifelong-LSTM, Figs. S53b, S54b, and S55b for CrAssphage concentrations using lifelong-LSTM, Figs. S56a, S57a, and S58a for MS2 concentrations using DA-LSTM, Figs. S56b, S57b, and S58b for CrAssphage particles using DA-LSTM). Overall, the prediction results of the TV, AdV, and PMMoV particles showed remarkable generalization performance across the MODON influent AnMBR. The average R2 values using lifelong-LSTM model with all the generative models for predicting TV, AdV, and PMMoV were 0.89, 0.91, 0.91, and DA-LSTM model achieved 0.96, 0.96, 0.97 of R2 values (Table S16).

Fig. 7: Model generalization performance of untreated AnMBR from untreated AnMBR.

Scatterplots illustrating the predicted versus actual values of MS2 and CrAssphage concentrations across the MODON AnMBR influent treatment process from the KAUST AnMBR influent baseline model using lifelong and dual-attention frameworks with EMCM, WGAN, and CM generative models: a Lifelong-LSTM for MS2; b DA-LSTM for MS2; c Lifelong-LSTM for CrAssphage, and d DA-LSTM for CrAssphage.

Full size image

Viral particle prediction of the MODON treated AnMBR from the KAUST treated AnMBR

We conducted a model generalization test to evaluate the transferability of the lifelong-LSTM and DA-LSTM models in predicting viral particles across effluent wastewater matrices, aiming to leverage these models for unseen datasets. This validation was conducted from the municipal KAUST MBR effluent to the municipal and industrial MODON MBR effluent to assess the model’s generalization ability from treated-to-treated effluents of the two AnMBRs in two different cities of Saudi Arabia (Fig. 6b). We constructed the baseline model from the KAUST MBR effluent to predict TV, AdV, and PMMoV concentrations. We then conducted a model generalization to assess the reliability and robustness of the proposed lifelong-LSTM and DA-LSTM models across the MODON MBR effluent. The results in Table 2 show the performance evaluations of lifelong learning and ZSG-based DA models with the generative models for predicting PMMoV particles across the MODON MBR treatment process. Table 3 shows the weighted score performance of these models.

Table 2 Hexagon radar chart performance comparison for predicting PMMoV particles using lifelong learning and ZSG-based DA models with the generative models across the MODON MBR treatment process
Full size table
Table 3 Weighted score of the proposed model generalization algorithms for predicting PMMoV particles across the MODON MBR treatment
Full size table

Overall, the models demonstrated strong generalization, precision, and harmonic cross-validation performance. Lifelong-LSTM achieved better deviation performance with all generative models, while DA-LSTM outperformed in computational cost. The performance results of the lifelong-LSTM and DA-LSTM algorithms with the three generative models indicated a weighted score performance greater than 86% (Table 3). These results demonstrate that the model generalization algorithms are stable and remarkably effective in achieving transferability across unseen datasets. Lifelong-LSTMCM and DA-LSTMWGAN achieved the highest and lowest weighted scores of 92.7% and 86.2%, respectively (Tables 2 and 3). We evaluated the LRVs of PMMoV particles from the estimated PMMoV concentrations of the two treated plants through the baseline influent of the KAUST MBR AnMBR plant (Fig. 8). These LRV results across the two treated KAUST MBR and MODON MBR effluents revealed a consistent match with the ground truth and synthetic datasets (Fig. 8). Although these two effluents come from different AnMBR sources, the LRV results demonstrate the effectiveness of the transferability and reliability principles of the proposed lifelong-LSTM and DA-LSTM soft sensors in predicting viral particles from treated to treated AnMBRs.

Fig. 8: LRV evaluation of treated AnMBR from treated AnMBR.

Estimated and actual LRVs of PMMoV particles across the MODON MBR effluent treatment from the baseline KAUST MBR effluent (LRVs were calculated from the baseline KAUST influent) using (a) lifelong-LSTM and b DA-LSTM models with EMCM, WGAN, and CM.

Full size image

Model generalization performances for predicting TV and AdV particles across the MODON MBR were evaluated from the KAUST MBR baseline model. The prediction performance is shown in Table S17 with the evaluation metrics for each viral community, and illustrated with the scatterplots (Figs. S59, S60, and S61 for lifelong-LSTM, Figs. S62, S63, and S64 for DA-LSTM, Supporting Information). The LRVs of TV and AdV concentrations across the MODON MBR treatment showed good agreement with the LRVs of the ground truth and synthetic data (Figs. S65 and S66, Supporting Information). Overall, the lifelong-LSTM and DA-LSTM models successfully generalized the viral particle prediction of the unseen MODON MBR from the developed treated KAUST MBR baseline, demonstrating the reliability and robustness of the proposed model generalization algorithms.

Viral particle prediction of the MODON treated AnMBR from the KAUST untreated AnMBR

To further assess the transferability of the lifelong-LSTM and DA-LSTM models, we conducted a validation test for predicting TV particles across the unseen MODON MBR datasets with the EMCM, WGAN, and CM. This validation test was conducted on the developed KAUST municipal influent, aiming to underscore the importance of the transferability of the soft sensor algorithms in predicting the viral particles of industrial and municipal MODON treated AnMBR from municipal KAUST untreated AnMBR (Fig. 6c). Figure 9a, b illustrate the predicted versus actual values of TV particles across the MODON MBR treatment process with lifelong-LSTM and DA-LSTM models coupled with the generative models. The results demonstrated strong generalization abilities of the MODON MBR from the KAUST influent model using lifelong-LSTM and DA-LSTM models. The average R2 values of the lifelong-LSTM and DA-LSTM models with the generative models for predicting TV across the MODON MBR were 0.82 and 0.90, respectively (Table S18). We evaluated the LRVs of TV particles using lifelong-LSTM and DA-LSTM across the MODON MBR from the KAUST influent model. Figure 9c, d illustrate the estimated and actual LRVs of TV particles across the MODON MBR from the KAUST influent model using lifelong-LSTM and DA-LSTM with the generative models. Lifelong-LSTM and DA-LSTM achieved average LRV values of 0.28 and 0.27, respectively, demonstrating a good alignment with the LRVs of the ground truth, which was 0.26 (Fig. 9c, d). All the adaptation models achieved a weighted score greater than 80% (Tables S19 and S20).

Fig. 9: Model generalization performance and LRV evaluation of treated AnMBR from untreated AnMBR.

Scatterplots illustrating the predicted versus actual values, and estimated and actual log removal values (LRVs) of TV particles with EMCM, WGAN, and CM generative models across the MODON MBR treatment process from the KAUST influent model using (a) and (c) lifelong-LSTM, respectively, and (b) and (d) DA-LSTM, respectively.

Full size image

To further assess the model’s generalization abilities in predicting associated viral particles, we evaluated the prediction of AdV and PMMoV particles using lifelong-LSTM and DA-LSTM across the MODON MBR from the KAUST influent baseline. The prediction performance is shown in Table S18 with the evaluation metrics for each viral community, and illustrated with the viral loads (Figs. S67, S68, and S69 for lifelong-LSTM, and Figs. S70, S71, and S72 for DA-LSTM, Supporting Information). The LRVs of AdV and PMMoV particles across the MODON MBR treatment showed remarkable agreement with the LRVs of the ground truth and synthetic data (Figs. S73 and S74, Supporting Information).

Discussion

The changes in aerobic and anaerobic process dynamics induced by the influent and effluent streams and resulting from the non-stationary behavior of the wastewater flow rate and composition are the primary source of treatment process drift in membrane bioreactors, including AeMBRs and AnMBRs, and remain challenging to capture and monitor over time21,22,33. Additionally, the effluent quality between filtration layers induced a data distribution shift caused by instrument lags and measurement noise, resulting in multiple time scales and domain drift. These factors hindered the generalizability of the predicted viral particles using the data-driven models across unseen datasets. Our recent studies have demonstrated the need to develop model generalization paradigms for predicting viral particles to improve accuracy and transferability across wastewater matrices and plants9,10, thereby addressing these underlying process drifts in WWTPs.

In this study, we proposed lifelong learning and ZSG frameworks to predict TV, AdV, PMMoV, MS2, and CrAssphage concentrations across municipal and industrial AnMBR-based WWTPs. Our findings demonstrate the effectiveness of lifelong learning and DA models in improving the generalizability and robustness of the developed ML models on unseen datasets. Although both lifelong learning and ZSG-based DA possess generalization abilities, they differ in fundamental ways, such as model structures, adaptation mechanisms, and their performance online and transfer learning tasks. The DA model combines a synergistic DA mechanism framework with an LSTM or GRU model to test the generalization performance of the trained model on unseen datasets. This process is often called ZSG. The key advantages of the synergistic DA-LSTM lie in the augmented generative models, the global dependency of the effluent data distributions and feature spaces, and the long-term dependency of the LSTM and GRU local predictors on the wastewater time series data, which improve the predictive modeling of unseen data via the ZSG principles. It is important to note that a retraining step is often required before conducting a downstream model enhancement analysis of new unseen datasets using ZSG-based DA. The DA-LSTM model was retrained to predict the viral particles across the NBF treatment process of the KAUST AnMBR. This retraining step is conducted in cases where two or more wastewater matrices exist, as in the case of KAUST AnMBR with MBR and NBF treatments. The goal of this retraining is to initiate the online monitoring of the viral particles from the baseline influent to achieve better performance across the wastewater matrices. Lifelong learning with LSTM and GRU local predictors as an emerging paradigm has the unique potential to stream process monitoring with delayed output, thereby maintaining and transferring knowledge between different tasks, including new incoming viral particles in the presence of process drifts without requiring retraining conditions. The capacity of lifelong-LSTM and lifelong-GRU algorithms to efficiently predict viral particles and stream new batches across wastewater matrices and WWTPs based on wastewater physicochemical parameters confers their superior generalization ability over the ZSG for continuous online monitoring of viral contaminants in MBRs.

To comprehensively assess the model generalization performance across AnMBR wastewater matrices and WWTPs, we propose three effective data generative models—WGAN, EMCM, and CM—and emphasize the relevance of these models in the Methods section. The generative models were developed to generate synthetic datasets from the measured datasets to address the limited wastewater-related data due to large volumes and low pathogen concentrations34. These models were evaluated using four quantitative evaluation metrics—maximum mean discrepancy (MMD), Fréchet inception distance (FID), Wasserstein distance (WD), and energy distance (ED)—to better assess the data augmentation of effective representative synthetic samples in the Methods section. The representative synthetic dataset refers to the measure of high-quality similarity and diversity between real and synthetic datasets35. Qualitative evaluations of the similarity visualizing the data distributions of the real and synthetic datasets of the KAUST and MODON based AnMBRs were performed using principal component analysis (PCA) and t-stochastic neighboring embedding (t-SNE). The evaluation results of these representative datasets from the three generative models are consistent with the ground truth (Table 4, and Figs. S1, S2, S3, S4 and S5, Supporting Information). To generate consistent synthetic samples with accuracy and robustness, we also evaluated the LRV for each viral community across the wastewater matrices (Fig. 10). These results ensured a global dependency of the underlying data distributions and feature spaces for the considered model generalization task, acting as a single global model to generalize the viral particle estimates, specifically in the DA-based LTSM and GRU algorithms.

In the model development and model generalization stages, lifelong-LSTM, lifelong-GRU, DA-LSTM, and DA-GRU coupled with the generative models showed remarkable transferability for predicting TV, AdV, and PMMoV particles across the MBR and NBF wastewater matrices of the KAUST AnMBR. The generalizability performances of these model generalization frameworks for predicting viral particles were confirmed in the MBR treatment of the MODON AnMBR. Overall, the results showed consistent prediction performance for the proposed model generalization approaches across various wastewater matrices (Table S7, Supporting Information and Table 1). TV particles achieved a lower LRV in both plants, ranging from 0.50 to 0.70. The LRVs of PMMoV across the KAUST wastewater matrices were consistent and aligned with the AnMBR virus removal study11. The estimated LRVs of AdV using the lifelong and ZSG-based DA models were 2.50 for KAUST, and 1.10 for MODON across the MBR. The obtained LRV of AdV was slightly higher than the reported average LRV of AdV, which was 0.70 in ref. 36. The resulting LRVs using the ground truth MBR and NBF, and the model generalization frameworks ranged from 0.50 to 0.70 for TV and 1.04 to 4.80 for AdV enteric viruses. These results were consistent with previous studies in the AeMBR plants37, which showed LRVs ranging from 0.30 to 2.70 for TV, and 1.70 to 4.80 for AdV.

Overall, these results indicate significant advantages of using both model generalization frameworks to leverage the viral particle prediction on unseen datasets in the presence of underlying wastewater matrix drifts. Lifelong learning maintained a meaningful distribution performance with the EMCM, WGAN and CM. This distribution is slightly larger than the ZSG-based DA model offering an advantage in enhancing the transferability to unseen datasets. Further, lifelong learning preserved its streaming capability over the ZSG-based DA, which often requires a retraining step to handle new unseen datasets, such as in the NBF effluent datasets. These prediction responses of the lifelong learning result from its knowledge base adaptation and reflect its ability to generalize new incoming unseen samples without retraining and calibrating new batches.

Generalization performance across multiple sites is challenging to achieve in soft sensor development involving time series data and remains a problem. Herein, we extended the proposed lifelong-LSTM and DA-LSTM models to predict viral particles across the MODON AnMBR. We investigated three prediction cases of viral particles: (i) MODON untreated AnMBR from KAUST untreated AnMBR (Fig. 6a), (ii) MODON treated AnMBR from KAUST treated AnMBR (Fig. 6b), and (iii) MODON treated AnMBR from KAUST untreated AnMBR (Fig. 6c). These cases emphasized the importance of using lifelong-LSTM and DA-LSTM approaches to evaluate generalizability and transferability capabilities across different wastewater matrices and sites. To leverage this generalization ability, we predicted MS2 and CrAssphage particles across the MODON AnMBR influent in the first case (Fig. 7 and Table S13). Then, we evaluated the generalization performance for predicting PMMoV particles in the MODON MBR treatment from the treated KAUST MBR process (Tables 2 and 3). Finally, we implemented both lifelong-LSTM and DA-LSTM to predict TV particles of the treated MODON AnMBR from the untreated KAUST AnMBR model (Table S18). Overall, the proposed models were able to generalize from different data streams and across two different AnMBR wastewater plants.

The direct relationship between the estimated viral concentrations and the removal efficiencies as determined through cost-effective water quality measurements allows for faster and reasonably real-time monitoring of any deviations from the optimal operating points. We evaluated the cumulative LRV sum of the viral particles using the proposed models across various AnMBR wastewater matrices (Figs. 3 and 5) and AnMBR types (Figs. 8 and 9). The LRVs estimated using the lifelong-LSTM and DA-LTSM models with the generative models were consistent with the LRVs of the ground truth and synthetic datasets in all cases. These results revealed high consistency in the accuracy and transferability of the viral concentrations, which allowed for the achievement of remarkably accurate estimated LRVs. The generalization results in all cases showed that the cumulative TV removal efficiency was relatively small across all AnMBR wastewater matrices and WWTPs, while AdV and PMMoV removal efficiencies were substantially higher for the MBR and NBF effluent treatment processes. This shows that TV, AdV, and PMMoV particles responded differently to treatment. TV particles were resilient and demonstrated lower removal efficiency rates, while AdV and PMMoV were more susceptible to the MBR and NBF treatments.

This study demonstrated the development of a generalizable soft sensor to assess TV, AdV, MS2, PMMoV, and CrAssphage concentrations from various data streams and across wastewater matrices and AnMBR plants based only on wastewater physiochemical parameters, thereby monitoring LRV efficiencies between wastewater matrices at the same WWTP sharing the same wastewater source and different types of AnMBRs. The core of our findings highlighted the importance of handling wastewater process drifts in soft sensor development to ensure globally representative and effective estimation of viral concentrations and removal efficiencies, thereby capturing the complex interdependencies of water quality and viral particles in model development and model generalization tasks, while ensuring robust performance across different wastewater matrices and AnMBR technologies.

Future work will focus on integrating an online adaptation phase with a receding horizon scheme based on statistical hypothesis testing of the partial distribution error between the predicted and actual values of the viral particles to overcome the retraining limitation of the current version of ZSG based DA-LSTM. This reliance on retraining mirrors out-of-distribution testing strategies23 and hinders effective online adaptation. Future studies can investigate the cross-technology transfer learning of viral particles and bacterial cells across aerobic and anaerobic MBRs while optimizing operational strategies such as hydraulic retention time (HRT), thereby achieving stable performance and minimizing the proliferation of antibiotic resistance genes. Subsequently, the design of a cloud-based web application and integration of geographical representation with additional datasets including differences in industrial, climate, and microbial communities will be essential to extend the case studies and to encourage further soft sensor collaboration and development.

Methods

This section presents two model generalization paradigms for predicting viral particles on new unseen datasets in AnMBRs through lifelong learning and ZSG based dual-attention frameworks. The methodology includes descriptions and sampling points of AnMBRs, as well as three generative models proposed to generate synthetic datasets from the measured datasets. Additionally, four model generalization frameworks—lifelong-LSTM, lifelong-GRU, DA-LSTM, and DA-GRU—are presented to estimate viral particles and assess LRVs across various wastewater matrices and WWTPs.

Anaerobic membrane bioreactors descriptions and sample collection

AnMBR technology has attracted interest in sustainable wastewater treatment. It has become an alternative sewage treatment due to its ability to generate energy and produce minimal sludge38. Consequently, it reduces the risk of antimicrobial resistance associated with generated solid waste39 while achieving effluent quality comparable to AeMBR40. Although AnMBR displays a lower abundance of opportunistic pathogens and less suitable conditions for natural transformation assay than the AeMBR sludge39, achieving a total reduction of pathogens and estimating the associated mechanisms of viral pathogen removal remain significant challenges. The present study proposes a data-driven model to quantify the viral particle concentrations across two AnMBR-based WWTPs and to assess the log reduction values of associated viral species. Raw and treated wastewater samples were collected from various points within the two AnMBR-based WWTPs, each employing a unique combination of treatment technologies. These two WWTPs were a pilot-scale WWTP featuring an anaerobic MBR (AnMBR) coupled with a nature-based filtration column located within the King Abdullah University of Science and Technology (KAUST) (Fig. 1), and a demonstration-scale facility utilizing an anaerobic membrane bioreactor unit located in the city of Jeddah (MODON) (Fig. 4). Descriptions of the KAUST and MODON AnMBR-based WWTPs, including their schematic representations and sampling points, are provided in Figs. 1 and 4, respectively. The water quality samples and viral particle concentrations in this study were collected from these two AnMBR-based WWTPs. Physicochemical water quality parameters such as pH, total dissolved solids (TDS), total phosphate, electroconductivity (EC), total suspended solids (TSS), turbidity, ammonium nitrogen (NH4-N), nitrate nitrogen (NO3-N), nitrite nitrogen (NO2-N), and chemical oxygen demand (COD) concentration were measured appropriately in both AnMBRs (Tables S1, S2, and S3 for the KAUST AnMBR, and Tables S4 and S5 for the MODON, Supporting Information). Flow virometry and PCR-based methods (RT-qPCR) were used to measure TV, AdV, PMMoV, MS2, and CrAssphage concentrations in the KAUST AnMBR (Fig. 1) and MODON AnMBR (Fig. 4). TV and AdV were chosen as targets for enteric viral pathogens. PMMoV, MS2, and CrAssphage were chosen as viral indicators of fecal contamination10. MS2 and CrAssphage concentrations were present in the influent treatment processes of both AnMBR-based WWTPs.

Data generative models

The data generative models—WGAN, EMCM, and CM—were implemented to address the limited availability of real data in most WWTPs, thereby generating synthetic datasets for the model development stage and model adaptation performance on unseen datasets. These data generative models have demonstrated significant abilities to generate realistic data samples and capture nonlinear relationships between features over time32.

We conducted a series of experiments to select the architectures and hyperparameters of the WGAN, CM, and EMCM to generate synthetic data from available water quality and flow cytometry PCR measurements. The datasets comprised ten input variables. Three or five viral particle output concentrations were targeted, depending on the viral presence in the influent and effluent treatment processes. The input–output variables contained approximately between ten- and fifteen-real samples for each AnMBR-based WWTP (Tables S1, S2, and S3, for the KAUST AnMBR, and Tables S4, and S5, for the MODON, Supporting Information). In the data preprocessing stage, all values of the viral particle concentrations were converted to the logarithmic scale (i.e., log10 VP/L), and the features were normalized in all cases to ensure data quality and make different features comparable. It is worth noting that when generating these synthetic samples from a limited available real dataset, it becomes essential to have a representative dataset with a satisfactory ratio between the real and synthetic datasets. Thus, we generated approximately 2000 samples, which were reduced to 1800 after applying a contamination level of 0.10 to remove outliers. This ratio is adequate to provide good prediction and generalization performances of viral particles across wastewater matrices and WWTPs. In the viral particle estimation context, each of these data generative models presents advantages and limitations.

  1. i.

    The EMCM inherits from the generative Markov chain proposed in ref. 41. It includes Gaussian noise with appropriate mean and standard deviation levels for the existing real samples32. Then, it discretizes the enhanced data and constructs a transfer matrix. Finally, the random walk characteristics of the Markov chain are used to preserve the temporal dynamics and distribution of different features to generate effective synthetic datasets32. Note that the EMCM requires an appropriate choice of added noise to prevent the model from overfitting. For regression problems, it is essential to visualize the scatterplots in the model development to avoid the data clustering attributes that may occur with the EMCM model.

  2. ii.

    WGAN is a variant of the generative adversarial network (GAN)42 based on the Wasserstein distance. It introduces the Wasserstein distance as a metric to mitigate the mode collapse and overfitting of traditional GANs during training thereby improving the corresponding data generation quality43. WGAN requires meticulous processing steps to ensure accurate matching of data distribution between the synthetic and original samples. It achieves a good performance in viral particle prediction problems by maintaining well-distributed data points between the actual and predicted values9.

  3. iii.

    CM model generation is a process that relies on the simulation of the data dependency structure rather than on the specific distribution form of the data44. It estimates the joint distribution of the data through maximum likelihood to enhance the dependency between data features.

For a detailed description of the generative models, including their schematic representations and algorithms, we refer readers to the work in32. For each prediction of viral particles and model adaptation on unseen testing datasets, we generated two thousand samples for the influent treatment process and wastewater effluent matrices. The mean and standard deviation of the real and generated datasets of the KAUST and MODON AnMBRs demonstrated the consistency of the real and augmented input–output datasets (Tables S1, S2, and S3 for the KAUST AnMBR, and Tables S4 and S5 for the MODON, Supporting Information).

The quantitative evaluation scores of EMCM, WGAN, and CM were measured with the MMD, FID, WD, and ED metrics. Table 4 evaluates the proposed generative models and assesses the quantitative similarity or dissimilarity performance between the real and synthetic generative datasets. Overall, the generative models maintain remarkable similarity performance with the advantage of EMCM across all the treatment processes. We conducted qualitative comparisons to visualize the similarity performances between the real and synthetic datasets of the KAUST and MODON based AnMBRs using PCA and t-SNE. Qualitative similarity performances using PCA and t-SNE between the real and synthetic datasets were shown in Figs. S1, S2, and S3 for KAUST AnMBR, and Figs. S5 and S6 for MODON AnMBR. These results preserved a close match within a compact set of the generated samples to the distribution of the real samples. These evaluations were proposed to ensure data integrity while controlling overfitting and avoiding data contamination or biases.

Table 4 Evaluation of the generative models for the generated KAUST influent, MBR, and NBF datasets using various quantitative measures with the lowest performance values highlighted in bold
Full size table

Pearson’s correlation was used to examine the linear correlation between input features of the KAUST and MODON plants (Fig. S6 for KAUST and Fig. S7 for MODON, Supporting Information). Although a strong correlation between conductivity and TDS was observed in the real influent dataset of the KAUST AnMBR, this strong correlation did not pertain across the real datasets of the KAUST wastewater matrices and the real datasets of the MODON AnMBR. Herein, the baseline model for the model development and generalization tasks was constructed with the generated datasets that did not maintain this strong correlation; hence, we preserved all the input features in the model development and model generalization processes for predicting viral particles across wastewater matrices and AnMBR plants.

The possibility of having a large proportion of representative-generated data does not necessarily imply better performance or overfitting of the ML model. The ultimate goal is to ensure data integrity with proper data leakage management, thereby controlling overfitting and avoiding data contamination or biases through qualitative and quantitative evaluation measures. We evaluated the training and validation losses in each training and testing phase for the KAUST and MODON AnMBRs to ensure that the predictive modeling did not reflect overfitting to synthetic patterns. Additionally, the sample sequences were divided into training and testing. All features are normalized using the min-max method, where the scaler is fitted only on the training set, and the same transformation is applied to the test set to prevent leakage of test information. All these steps are essential to guarantee a robust model and achieve better prediction and generalization performances.

Log removal value performance

The contaminant removal performance of the data generative models was evaluated using the LRV. LRVs are used to evaluate the potential risks associated with reusing treated wastewater effluents. They are a key indicator of virus removal efficiency in WWTPs. The LRV has been well studied in the literature1,40,45, and its conventional formula is given as follows:

$${text{LRV}}={log }_{10}left({{rm{C}}}_{text{influent}}right)-{log }_{10}left({{rm{C}}}_{text{effluent}}right)$$
(1)

where ({C}_{text{influent}}) and ({C}_{text{effluent}}) stand for the viral particle concentration of the influent and effluent samples, respectively. Figure 10 illustrates the LRVs of TV, AdV, and PMMoV concentrations in the data generation stage of the KAUST AnMBR using the WGAN, CM, and EMCM, indicating the removal efficiency of these viral pathogens after treatment. The results in the quantitative LRV of the generative models in the primary membrane filtration process showed that the EMCM-LRVs, WGAN-LRVs, and CM-LRVs were 0.28, 0.26, and 0.27 for TV particles, 2.49, 2.37, and 2.34 for AdV particles, and 3.07, 2.96, and 2.93 for PMMoV particles, and their corresponding real-LRVs were 0.30, 2.52, and 2.96, respectively. From the secondary nature-based filtration column, the EMCM, WGAN, and CM achieved 0.29, 0.29, and 0.28 LRVs for TV particles,1.67, 1.69, and 1.71 LRVs for AdV particles, and 1.73, 1.98, and 1.97 LRVs for PMMoV particles, and their corresponding real-LRVs were 0.29, 1.66, and 1.93, respectively.

Overall, the generative models preserved comparable mean LRV contributions to the real datasets for TV, AdV, and PMMoV particles in both treatment processes, thereby ensuring the consistency and accuracy of the synthetic generative datasets. The LRV of the EMCM contributed well to the distribution of the real LRV viral particle distributions, with the lowest deviation (Fig. 10). The mean LRV contributions of WGAN- and CM-based generated samples tend to deviate slightly along the two treatment processes with respect to the original samples, as illustrated in Fig. 10. The moderate standard deviations of the WGAN and CM are mainly due to their key structural features and sensitivities to outliers in the original datasets. These deviations conferred advantages by consistently ensuring that these models balanced meaningful variability and diversity in the generated samples while approximating the real data distribution with less noise rejection. These quantitative LRV assessments of WGAN and CM synthetic data distributions were also relevant to the visualization results in the model development and model generalization performances in predicting viral concentrations on unseen datasets.

Fig. 10: LRV as a quantitative metric to evaluate synthetic data across wastewater matrices.

Average LRV contributions of (a) TV, b AdV, and c PMMoV particles across KAUST MBR and NBF treatment processes. The box plots compare the distributions of log removal values (LRV) from the real dataset with those generated by the EMCM, WGAN, and CM models for the KAUST MBR and NBF treatments. The diamond markers and the numbers below the plots indicate the mean LRV for each distribution.

Full size image

Model generalization paradigms

This section presents the model generalization paradigms including lifelong learning and ZSG-based DA transformer frameworks for predicting viral particle concentrations in KAUST and MODON AnMBR-based WWTPs. The proposed lifelong learning framework is based on a knowledge-based adaptation module and local ML predictors while ZSG-based DA integrates attention mechanisms and local ML predictors. The ML predictors include LSTM and GRU algorithms. These generalization approaches are necessary to leverage the process drifts and distribution shifts on unseen test datasets encountered in the model performance enhancement of soft sensor development in WWTPs.

Lifelong learning is designed to handle the adaptation of multiple data sets and cross-task knowledge sharing in a progressive manner. The core idea of the lifelong learning framework based on linear regression models lies in constructing the transfer and accumulation of knowledge between different tasks through a shared parameter matrix24. Recently, the authors in9 proposed a lifelong learning method based on LSTM and GRU machine learning predictors that provided a long-term time dependency and captured underlying nonlinear viral particle patterns across various AeMBR-based WWTPs. The present study leveraged the capability of lifelong learning methods in adjusting and adapting new incoming batch data by incorporating features into the shared knowledge from the previous task during the learning process. Lifelong learning consists of a local predictor-based ML module with a shared dictionary and a knowledge-based adaptation module. Figure S8 illustrates the lifelong learning framework, which includes the two modules. For each batch, lifelong learning encodes new inputs into feature vectors, applies sparse coding, computes predictor parameters, and updates the shared dictionary. This step is repeated for new incoming batches. The algorithm predicts outputs in the online phase by periodically adapting its shared knowledge base, optimizing with the local predictor-based LSTM and GRU algorithms, updating auxiliary variables, and refining the shared matrices to improve across tasks over time9,24. Lifelong learning uses a batch-processing method that processes input data for each water discharge stage in discrete batches, thereby improving its predictive performance from batch to batch. This approach is similar to most data-driven or model-based moving-horizon estimation methods. The first evolution in a one-shot batch for predicting viral particle concentrations might not fully converge to the true values, but often yields acceptable R2 values ranging from 0.7 to 0.8. The detailed mathematical description, hyperparameter selection process, and algorithm are provided in the supplementary information (Section S1.2.1, Table S6, and Algorithm S1, Supporting Information). The progressive adaptation process of the lifelong learning algorithm through knowledge-based adaptation and LSTM and GRU local predictors effectively balances the accumulation and forgetting of knowledge, relying only on input data or delayed measurement output. It fully reflects the advantages and feasibility of lifelong learning models in dealing with unseen test datasets in the presence of abrupt and distribution changes in the treatment process in a continuous task environment, unlike isolated ML models.

DA learning is a synergistic framework comprising input and temporal attention mechanisms into an ML module to improve the accuracy of time series forecasting25,27,28,32. The input attention mechanism enables the model to selectively emphasize significant input features at individual time steps and to adaptively adjust the weights of features relevant to the forecasting task. The temporal attention mechanism focuses on determining which past time steps are most relevant to the forecasting task and allows the model to capture the most relevant past information in multivariate time series data to make the current forecasting task more accurate. The DA algorithm first initializes the attention weights for both the feature and input matrix of the input features and time steps. These weights are calculated using SoftMax, which transforms values into probabilities. DA learning calculates two sets of attention weights for the current input feature and the past hidden state. These weighted inputs are then used in the LSTM and GRU networks to update their hidden state at each time step. This cycle is repeated over the entire sequence, and the final hidden state is used for the prediction. Figure S9 illustrates DA learning with the two attention mechanisms. The detailed mathematical descriptions of the DA framework and its algorithm including the involved attention mechanisms and parameters are provided in the supplementary material (Section S1.2.2 and Algorithm S2, Supporting Information). The synergistic effect of these two attention mechanisms enables the model to adaptively prioritize relevant input features and historical time steps, thereby coping with their varying significance over time, compared to traditional models. The validation of unseen test data using the DA approach on the developed baseline ML model is systematic, where the weights of the ML-based attention model are transferred to the target datasets, referred to as ZSG. This task is often limited to the knowledge of the current output to compute the involved DA weights. Subsequently, it relies on the accuracy of the generative models and fine-tuning of the DA parameter to handle distribution and process shifts. However, there is no performance guarantee in the presence of abrupt changes across different types of WWTPs or technologies without a retraining step. To this end, DA offers an efficient solution for validating the developed ML models on unseen test datasets, leveraging its model generalization ability to forecast tasks in time series data through optimal hyperparameter fine-tuning, a retraining step, and an efficient data augmentation method.

To enhance the methodology for implementing both lifelong and DA algorithms, we provided an overview of the model development and model generalization phases. Figure 11 illustrates the different steps of the model generalization methods using lifelong and ZSG models. The first step included a design phase in which we generated synthetic water quality and viral particle datasets from the AnMBRs (KAUST and MODON) real datasets using WGAN, EMCM, and CM. We evaluated the performance of the obtained datasets using different qualitative and quantitative metrics, including the LRV. We utilized a five-fold cross-validation method to randomly split the augmented datasets: 80% of the data were used for training the regression models and 20% for testing the models in the second step. The model development stage comprises the training, testing, and learning phases. A Bayesian optimizer was used to optimize the hyperparameters of each model by minimizing the MSE through five-fold cross-validation (Table S6). The model development was achieved through these three phases. The training and validation losses were also evaluated in each training and testing phase for the KAUST and MODON AnMBRs to ensure that the predictive modeling did not reflect overfitting to synthetic patterns (Figs. S11, S14, and S17 for the lifelong learning, Figs. S20, S24, and S28 for ZSG-based DA with the MBR treatment, and Figs. S21, S25, and S29 for ZSG-based DA with the NBF treatment of the KAUST pilot; Figs. S34, S37, and S40 for lifelong learning, and Figs. S43, S46, and S49 for ZSG-based DA with the MBR treatment of the MODON pilot, Supplementary Information). In the final phase, the coefficient of determination R2 was chosen as the evaluation metric and objective function for the iterative optimization to assess the generalization performance across wastewater matrices and WWTPs using lifelong and ZSG-based DA approaches.

Fig. 11: Model generalization steps.

Lifelong learning and ZSG-based DA frameworks highlighting the different steps of the model generalization from the design to the generalization phases.

Full size image

Data availability

The datasets generated and/or analyzed during the current study are not publicly available due to the sensitive nature of the data but are available from the corresponding author on reasonable request.

References

  1. Harb, M. & Hong, P. Y. Molecular-based detection of potentially pathogenic bacteria in membrane bioreactor (MBR) systems treating municipal wastewater: a case study. Environ. Sci. Pollut. Res. 24, 5370–5380 (2017).

    Article 
    CAS 

    Google Scholar 

  2. Adhikari, S. & Halden, R. U. Opportunities and limits of wastewater-based epidemiology for tracking global health and attainment of UN sustainable development goals. Environ. Int. 163, 107217 (2022).

    Article 
    CAS 

    Google Scholar 

  3. Kilaru, P. et al. Wastewater surveillance for infectious disease: a systematic review. Am. J. Epidemiol. 192, 305–322 (2023).

    Article 

    Google Scholar 

  4. Grabow, W. O. K. The virology of wastewater treatment. Water Res. 2, 675–701 (1968).

    Article 

    Google Scholar 

  5. Bibby, K. et al. Metagenomics and the development of viral water quality tools. npj Clean. Water 2, 1–13 (2019).

    Article 

    Google Scholar 

  6. Corpuz, A. et al. Viruses in wastewater: occurrence, abundance and detection methods. Sci. Total Environ. 745, 140910 (2020).

    Article 
    CAS 

    Google Scholar 

  7. Manti, A. et al. Bacterial cell monitoring in wastewater treatment plants by flow cytometry. Water Environ. Res. 80, 346–354 (2008).

    Article 
    CAS 

    Google Scholar 

  8. Alharbi, M., Hong, P. Y. & Laleg-Kirati, T. M. Sliding window neural network-based sensing of bacteria in wastewater treatment plants. J. Process Control 110, 35–44 (2022).

    Article 
    CAS 

    Google Scholar 

  9. Chen, J. et al. Viral particle prediction in wastewater treatment plants using nonlinear lifelong learning models. npj Clean. Water 8, 1–13 (2025).

    Article 
    CAS 

    Google Scholar 

  10. Myshkevych, Y. et al. Combining flow virometry with tree-based machine learning models for rapid virus particle estimation in different wastewater matrices. Water Res. 284, 123905 (2025).

    Article 
    CAS 

    Google Scholar 

  11. Kadoya, S. -s et al. A soft-sensor approach for predicting an indicator virus removal efficiency of a pilot-scale anaerobic membrane bioreactor (AnMBR). J. Water Health 22, 967–977 (2024).

    Article 

    Google Scholar 

  12. Brunton, S. L. & Kutz, J. N. Data-Driven Science and Engineering: Machine Learning, Dynamical Systems, and Control. (Cambridge University Press, 2022).

  13. Zhu, C. et al. Global diversity and distribution of antibiotic resistance genes in human wastewater treatment systems. Nat. Commun. 16, 4006 (2025).

    Article 
    CAS 

    Google Scholar 

  14. Ekundayo, T. C., Adewoyin, M. A., Ijabadeniyi, O. A., Igbinosa, E. O. & Okoh, A. I. Machine learning-guided determination of Acinetobacter density in waterbodies receiving municipal and hospital wastewater effluents. Sci. Rep. 13, 7749 (2023).

    Article 
    CAS 

    Google Scholar 

  15. Alharbi, M. S., Hong, P.-Y. & Laleg-Kirati, T.-M. Adaptive neural network-based monitoring of wastewater treatment plants. Proc. 2022 American Control Conference (ACC), 3204–3211 (ACC, 2022).

  16. Aljehani, F., N’Doye, I., Hong, P.-Y., Monjed, M. K. & Laleg-Kirati, T.-M. Bacteria cells estimation in wastewater treatment plants using data-driven models. IFAC-PapersOnLine 58, 718–723 (2024).

    Article 

    Google Scholar 

  17. Farhi, N., Kohen, E., Mamane, H. & Shavitt, Y. Prediction of wastewater treatment quality using LSTM neural network. Environ. Technol. Innov. 23, 101632 (2021).

    Article 
    CAS 

    Google Scholar 

  18. Pisa, I., Santin, I., Morell, A., Vicario, J. L. & Vilanova, R. LSTM-based wastewater treatment plants operation strategies for effluent quality improvement. IEEE Access 7, 159773–159786 (2019).

    Article 

    Google Scholar 

  19. Mokhtari, H. A., Bagheri, M., Mirbagheri, S. A. & Akbari, A. Performance evaluation and modelling of an integrated municipal wastewater treatment system using neural networks. Water Environ. J. 34, 622–634 (2020).

    Article 
    CAS 

    Google Scholar 

  20. Wang, R. et al. Model construction and application for effluent prediction in wastewater treatment plant: data processing method optimization and process parameters integration. J. Environ. Manag. 302, 114020 (2022).

    Article 
    CAS 

    Google Scholar 

  21. Mahuli, S. K., Rhinehart, R. R. & Riggs, J. B. Experimental demonstration of non-linear model-based in-line control of pH. J. Process Control 2, 145–153 (1992).

    Article 
    CAS 

    Google Scholar 

  22. Iratni, A. & Chang, N.-B. Advances in control technologies for wastewater treatment processes: status, challenges, and perspectives. IEEE/CAA J. Autom. Sin. 6, 145–153 (2019).

    Google Scholar 

  23. Aljehani, F., N’Doye, I., Hong, P.-Y., Monjed, M. K. & Laleg-Kirati, T.-M. A calibration framework toward model generalization for bacteria concentration estimation in wastewater treatment plants. Sci. Rep. 14, 31218 (2025).

    Article 

    Google Scholar 

  24. Liu, T. et al. Lifelong learning meets dynamic processes: an emerging streaming process prediction framework with delayed process output measurement. IEEE Trans. Control Syst. Technol. 32, 384–398 (2024).

    Article 

    Google Scholar 

  25. An, T. et al. Adaptive prediction for effluent quality of wastewater treatment plant: improvement with a dual-stage attention-based LSTM network. J. Environ. Manag. 359, 120887 (2024).

    Article 
    CAS 

    Google Scholar 

  26. Pan, S. J. & Yang, Q. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22, 1345–1359 (2010).

    Article 

    Google Scholar 

  27. Qin, Y., et al. A dual-stage attention-based recurrent neural network for time series prediction. Proc. Twenty-Sixth International Joint Conference on Artificial Intelligence, 2627–2633 (IJCAI, 2017).

  28. Yoon, N. et al. Dual-stage attention-based LSTM for simulating performance of brackish water treatment plant. Desalination 512, 115107 (2021).

    Article 
    CAS 

    Google Scholar 

  29. Chen, Q., Lin, N., Bu, S., Wang, H. & Zhang, B. Interpretable time-adaptive transient stability assessment based on dual-stage attention mechanism. IEEE Trans. Power Syst. 38, 2776–2790 (2023).

    Article 

    Google Scholar 

  30. Shim, E., Park, J., Hwang, Y., Choi, S. & Kim, S. Predicting reaction conditions from limited data through active transfer learning. Chem. Sci. 13, 6655–6668 (2022).

    Article 
    CAS 

    Google Scholar 

  31. Zheng, Y. et al. Deep representation learning enables cross-basin water quality prediction under data-scarce conditions. npj Clean. Water 8, 1–11 (2025).

    Article 
    CAS 

    Google Scholar 

  32. Chen, J., N’Doye, I., Monjed, M. K. & Hong, P.-Y. Zero-shot generalization for predicting viral concentrations and evaluating removal efficiencies across wastewater matrices. Sci. Rep. 15, 41726 (2025).

    Article 

    Google Scholar 

  33. Williams, G. L., Rhinehart, R. R. & Riggs, J. B. In-line process-model-based control of wastewater pH using dual base injection. Ind. Eng. Chem. Res. 29, 1254–1259 (1990).

    Article 
    CAS 

    Google Scholar 

  34. ABmann, E. et al. Augmentation of wastewater-based epidemiology with machine learning to support global health surveillance. Nat. Water 3, 753–763 (2025).

    Article 

    Google Scholar 

  35. Zhong, S. et al. Machine learning: new ideas and tools in environmental science and engineering. Environ. Sci. Technol. 55, 12741–12754 (2021).

    CAS 

    Google Scholar 

  36. Yin, Z., Tarabara, V. & Xagoraraki, I. Human adenovirus removal by hollow fiber membranes: effect of membrane fouling by suspended and dissolved matter. J. Membr. Sci. 482, 120–127 (2015).

    Article 
    CAS 

    Google Scholar 

  37. Jumat, M. et al. Membrane bioreactor-based wastewater treatment plant in Saudi Arabia: reduction of viral diversity, load, and infectious capacity. Water 9, 534 (2017).

    Article 

    Google Scholar 

  38. Wu, B. et al. Interface behavior and removal mechanisms of human pathogenic viruses in anaerobic membrane bioreactor (AnMBR). Water Res. 219, 118596 (2022).

    Article 
    CAS 

    Google Scholar 

  39. Medina, J. S. et al. Metagenomic insights in antimicrobial resistance threats in sludge from aerobic and anaerobic membrane bioreactors. Environ. Sci. Technol. 59, 5636–5646 (2025).

    Article 
    CAS 

    Google Scholar 

  40. Zhang, J., Zhang, J., Sano, D. & Chen, R. Comparison of activated sludge and virus interactions in aerobic and anaerobic membrane bioreactors. iScience 27, 111450 (2024).

    Article 
    CAS 

    Google Scholar 

  41. Alvi, M., French, T., Cardell-Oliver, R., Batstone, D. & Akhtar, N. Enhanced deep predictive modeling of wastewater plants with limited data. IEEE Trans. Ind. Inform. 20, 1920–1930 (2024).

    Article 

    Google Scholar 

  42. Goodfellow, I. et al. Generative adversarial nets. Advances in Neural Information Processing Systems, 2672–2680 (NIPS, 2014).

  43. Arjovsky, M., Chintala, S. & Bottou, L. Wasserstein generative adversarial networks. Proceedings of the 34th International Conference on Machine Learning, 214–223 (ICML, 2017).

  44. Stoeber, J., Joe, H. & Czado, C. Simplified pair copula constructions—limitations and extensions. J. Multivar. Anal. 119, 101–118 (2013).

    Article 

    Google Scholar 

  45. Chaudhry, R., Nelson, K. & Drewes, J. Mechanisms of pathogenic virus removal in a full-scale membrane bioreactor. Environ. Sci. Technol. 49, 2815–2822 (2015).

    Article 
    CAS 

    Google Scholar 

Download references

Acknowledgements

This work is supported by the Near-Term Grand Challenge (AI) REI/1/5233-01-01, and KAUST-MEWA SPA (REP/1/6112-01-01) awarded to P.-Y. Hong. We thank Yevhen Myshkevych, a Ph.D. student in Prof. Hong’s group, for his help in processing the samples and describing the pilot-scale AnMBRs, and the MODON WWTP operation team for granting us access to various wastewater samples.

Author information

Authors and Affiliations

Authors

Contributions

J.C.: Software, methodology, formal analysis, data curation, writing – review & editing. I.N.: Conceptualization, investigation, formal analysis, methodology, visualization, supervision, writing – original draft, writing – review & editing. J.S.M.: Data curation. S.H.S.: Data curation. P.-Y.H.: Supervision, resources, project administration, funding acquisition, conceptualization, writing – review & editing.

Corresponding author

Correspondence to
Ibrahima N’Doye.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

SI_revised_v5

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Chen, J., N’Doye, I., Sanchez Medina, J. et al. Model generalization paradigms for predicting viral particles and evaluating removal efficiencies in anaerobic membrane bioreactor plants.
npj Emerg. Contam. 2, 10 (2026). https://doi.org/10.1038/s44454-026-00030-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • DOI: https://doi.org/10.1038/s44454-026-00030-8


Source: Resources - nature.com

The paradoxes holding back progress on water security

Center-of-gravity shift and inequality of human water use in China over the last half century

Back to Top