in

A watershed-scale potential pathogenic bacteria dataset from the Yangtze River Basin


Abstract

Microbial safety is fundamental to ensuring water quality, particularly in the Yangtze River Basin, China’s most critical drinking water source. Despite its ecological and economic importance, the basin faces significant anthropogenic pressures, including wastewater discharge, which may elevate the risk of pathogenic contamination. However, fragmented sampling efforts and limited coverage have hindered a systematic understanding of pathogenic microbial diversity and distribution across this vast ecosystem. A novel bioinformatic pipeline leveraging Genome-Specific Markers to accurately identify and quantify potential pathogenic taxa in metagenomic data was applied to 625 publicly available metagenomes, spanning water, sediments, and riparian soils along the 6,300 km Yangtze River continuum. We reconstructed a potential pathogen catalog comprising 403 taxa, largely expanding the pathogen diversity in the large river ecosystem. We also generate the Richness distribution maps of potential pathogens for water, sediments and soils along Yangtze River. The basin-scale pathogen inventory not only establishes a baseline for potential pathogenic bacteria communities in the Yangtze Basin but also serves as a reference library for quick biosurveillance and risk management from genomic resolution.

Similar content being viewed by others

Microbial genomic database of the Yangtze River, the third-longest river on Earth

Microbial risk assessment across multiple environments based on metagenomic absolute quantification with cellular internal standards

National-scale biogeography and function of river and stream bacterial biofilm communities

Data availability

Data are available at the figshare repository (https://doi.org/10.6084/m9.figshare.30196462)29. The repository contains four datasets, including the spatial distribution maps for water, sediment and soils; S1. Metadata of samples for pathogen detection analysis; S2. Pathogens identified by GSMer in the Yangtze River Basin and their potential hosts and S3. Georeferenced sampling locations and pathogen richness used in spatial mapping. Dataset S1 contains the sources of the original metagenomic sequencing data used in this study. Dataset S2 provides potential pathogen species identified by the GSM-based matching and their host information.

Code availability

The parameters of all programs used for the analysis are described in the main text. GSM library construction code was available at https://github.com/yedeng-lab/humanpathogen-GSM.

References

  1. Hu, Y. et al. Annual trends and health risks of antibiotics and antibiotic resistance genes in a drinking water source in East China. Science of The Total Environment 791, 148152 (2021).

    Google Scholar 

  2. Pandey, P. K., Kass, P. H., Soupir, M. L., Biswas, S. & Singh, V. P. Contamination of water resources by pathogenic bacteria. AMB Expr 4, 51 (2014).

    Google Scholar 

  3. Oon, Y.-L. et al. Waterborne pathogens detection technologies: Advances, challenges, and future perspectives. Front. Microbiol. 14, 1286923 (2023).

    Google Scholar 

  4. Liu, W. et al. Unraveling pathogen dynamics in rivers flowing into taihu lake: Insights from high-throughput sequencing and environmental correlations. Water Research X 29, 100406 (2025).

    Google Scholar 

  5. Carraro, L., Mächler, E., Wüthrich, R. & Altermatt, F. Environmental DNA allows upscaling spatial patterns of biodiversity in freshwater ecosystems. Nat Commun 11, 3585 (2020).

    Google Scholar 

  6. Deiner, K., Fronhofer, E. A., Mächler, E., Walser, J.-C. & Altermatt, F. Environmental DNA reveals that rivers are conveyer belts of biodiversity information. Nat Commun 7, 12544 (2016).

    Google Scholar 

  7. Ding, J. et al. Impacts of land use on surface water quality in a subtropical river basin: A case study of the dongjiang river basin, southeastern China. Water 7, 4427–4445 (2015).

    Google Scholar 

  8. McKee, A. M. & Cruz, M. A. Microbial and viral indicators of pathogens and human health risks from recreational exposure to waters impaired by fecal contamination. J. Sustainable Water Built Environ. 7, 03121001 (2021).

    Google Scholar 

  9. Hofstra, N. Quantifying the impact of climate change on enteric waterborne pathogen concentrations in surface water. Current Opinion in Environmental Sustainability 3, 471–479 (2011).

    Google Scholar 

  10. Hales, S. Climate change, extreme rainfall events, drinking water and enteric disease. Reviews on Environmental Health 34, 1–3 (2019).

    Google Scholar 

  11. Seymour, J. R. & McLellan, S. L. Climate change will amplify the impacts of harmful microorganisms in aquatic ecosystems. Nat Microbiol 10, 615–626 (2025).

    Google Scholar 

  12. Girones, R. et al. Molecular detection of pathogens in water–the pros and cons of molecular techniques. Water Res 44, 4325–4339 (2010).

    Google Scholar 

  13. Gu, W. et al. Rapid pathogen detection by metagenomic next-generation sequencing of infected body fluids. Nat Med 27, 115–124 (2021).

    Google Scholar 

  14. Gallagher, T., Phan, J. & Whiteson, K. Getting Our Fingers on the Pulse of Slow-Growing Bacteria in Hard-To-Reach Places. J Bacteriol 200, e00540–18 (2018).

    Google Scholar 

  15. Aw, T. G. & Rose, J. B. Detection of pathogens in water: from phylochips to qPCR to pyrosequencing. Curr Opin Biotechnol 23, 422–430 (2012).

    Google Scholar 

  16. Wang, J., Han, Y. & Feng, J. Metagenomic next-generation sequencing for mixed pulmonary infection diagnosis. BMC Pulm Med 19, 252 (2019).

    Google Scholar 

  17. Tu, Q., He, Z. & Zhou, J. Strain/species identification in metagenomes using genome-specific markers. Nucleic Acids Research 42 (2014).

  18. Li, T. et al. Beyond water and soil: Air emerges as a major reservoir of human pathogens. Environment International 190, 108869 (2024).

    Google Scholar 

  19. NNCBI sequence read archive https://identifiers.org/insdc.sra:SRP288687 (2020).

  20. NCBI sequence read archive https://identifiers.org/insdc.sra:SRP217764 (2020).

  21. NCBI sequence read archive https://identifiers.org/insdc.sra:SRP394638 (2023).

  22. NCBI sequence read archive https://identifiers.org/insdc.sra:SRP201455 (2019).

  23. NGDC Genome Sequence Archive https://ngdc.cncb.ac.cn/gsa/browse/CRA006054 (2023).

  24. NGDC Genome Sequence Archive https://ngdc.cncb.ac.cn/gsa/browse/CRA008231 (2023).

  25. National Microbiology Data Center (NMDC) https://nmdc.cn/resource/genomics/project/detail/NMDC10020587 (2026).

  26. Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30, 2114–2120 (2014).

    Google Scholar 

  27. Wang, B. et al. Tackling Soil ARG‐Carrying Pathogens with Global‐Scale Metagenomics. Advanced Science 10, 2301980 (2023).

    Google Scholar 

  28. Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol 35, 1026–1028 (2017).

    Google Scholar 

  29. Wang, J., Wang, S., Li, T., Hou, W. & Deng, Y. A watershed-scale Potential pathogenic bacteria dataset from the Yangtze River Basin. figshare https://doi.org/10.6084/m9.figshare.30196462 (2026).

Download references

Acknowledgements

This work was supported by Opening Project of State Key Laboratory of Geomicrobiology and Environmental Changes (51830100303), the National Key Research and Development Program of China (Grant 2022YFC3204703) and the National Natural Science Foundation of China (Grant 42277104).

Author information

Authors and Affiliations

Authors

Contributions

J.W. generated the data and contributed to manuscript writing and revision. S.W. and Y.D designed the study and organized the research, manuscript writing and revision. T.L. contributed to the code writing and data analysis. W.G.H. contributed to manuscript revision.

Corresponding authors

Correspondence to
Shang Wang or Ye Deng.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

S1. Metadata of samples for pathogen detection analysis

S2. Pathogens identified by GSMer in the Yangtze River Basin and their potential hosts

S3. Georeferenced sampling locations and pathogen richness used in spatial mapping

the spatial distribution maps for water, sediment and soils

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Wang, J., Wang, S., Li, T. et al. A watershed-scale potential pathogenic bacteria dataset from the Yangtze River Basin.
Sci Data (2026). https://doi.org/10.1038/s41597-026-06983-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1038/s41597-026-06983-0


Source: Ecology - nature.com

Olfaction in fruit flies (Tephritidae) balances detection and discrimination of host fruits

Conservation tillage and sprinkler irrigation for sustainable water management and enhanced crop yields in maize and field pea cropping system

Back to Top