A global dataset of surface water and groundwater salinity measurements from 1980–2019
Selection criteria
Salinity is the measure of the concentration of dissolved (soluble) salts in water from all sources, and it can be measured by a range of parameters (including dissolved solids fractions, total dissolved solids, chloride, electrical conductivity, salinity) and units (including ppm, mg L−1, µS cm−1, dS m−1). A primary data collection focus here was given to EC measurements, since this is the most widely reported salinity parameter, and a main aim of this database is to provide comparable data across various scales. However, total dissolved solids (TDS) is also a common salinity parameter, particularly for groundwater quality measurements. The relationship of TDS and EC is correlated and can be determined using a conversion factor19. Regional conversion factors have been shown to produce better correlations than global factors, since the relationship between EC and TDS depends on a range of factors that may vary spatially, e.g. with climate, temperature, dissolved ion concentrations and ionic strength20. Thus, for optimizing data inclusion, a dataset containing TDS measurements was included, but only if a regional conversion factor could be found in the literature (see Methods and Technical Validation for further description on conversion and correlation analyses).
Multiple selection criteria were applied for each monitoring location and water type sampled. Surface waters were divided into the following categories: (i) river; and (ii) lake/reservoir. A sampling location was included if there were at least 30 measurements within the selected time period (1980–2019). For groundwater, we included all measurements at each location, if reported sampling depth information was available. The reason for this less stringent sampling frequency criterion for each groundwater location was due to the general limitation of high frequency groundwater monitoring compared to surface water monitoring21,22. Additionally, low temporal resolution groundwater data could provide valuable input for first order salinity assessments, model calibration and/or hypothesis testing23. An important variable for interpreting groundwater EC is however sample depth, since this has large implications on, for example, withdrawal depths for different sectoral water use, as well as for estimation of the freshwater/saltwater lens24. This thus motivates the depth availability criterion over sampling frequency for groundwaters. In addition to these criteria, all samples also had to have date and coordinate (latitude, longitude) information for qualifying inclusion in the database (see Fig. 2 for a schematic flowchart of the data selection and processing steps).
Fig. 2
Data selection and harmonisation flowchart. The figure illustrates the processing and harmonizing steps of each dataset (divided into surface and groundwater parts) after initial data collection.
Full size image
Data collection and sources
Data was collected from both surface water and groundwater monitoring locations using a combination of data sources, including: (i) global datasets, (ii) regional datasets, and (iii) individual river basins and groundwater aquifers datasets. The regional data includes datasets spanning multiple river basins and/or groundwater aquifers, both within the same region, but also cross-regionally. Most of these data are provided by governmental organizations or cross-regional data portal platforms under environmental protection agencies or National water quality monitoring programs. The local/individual basins datasets consist of monitoring data for individual basins and were usually found through governmental agencies, river basin management commissions, research organizations, as well as provided by individual researchers. Each data source is listed and described shortly below (the data source abbreviations were defined by us, for easy reference to the database terminology). A full list of the corresponding data (including their spatial and temporal resolution) for each of these sources (including their URL), divided by water type, is given in online-only Table 1.
For the here presented database, we focused on combining and harmonizing EC datasets from already available, open data sources. The reason for this is that EC is often included in broader environmental monitoring websites and/or water quality datasets, which are not identifiable as salinity datasets, but rather in general water quality terms. We thus wanted to extract the salinity data component, and facilitate the reuse of harmonized EC data for salinity-specific applications. Most of the dataset included in our database have original licenses that permit unrestricted reuse. Where this was not the case, or if information was lacking, we requested and were granted permission from the data owners to release the data under the CC-BY license.
Although we acknowledge the potential of valuable datasets in the scientific literature, this was not a data focus type, since this requires a different data search and extraction approach. We only incorporated pre-extracted datasets from literature reviews and synthesis when shared from individual researchers (reached through communication within our research community, e.g. during workshops and conferences and within own networks and communication channels). The following subsections provide an overview of the global, regional and local salinity datasets included in our developed database.
Global salinity dataset
The Global River Chemistry Dataset (GLORICH) includes multiple water quality parameters for river locations around the world, assembled by researchers from Hamburg University25,26. This data is publicly available and was downloaded as a zip file from PANGEA. The dataset includes 1.27 million samples of major compounds, nutrients, carbon species and physical properties. We extracted Specific Conductivity data (another terminology for EC) from the “hydrochemistry” csv file and paired it with station information (“Sampling_locations” file), for all stations that fulfilled our selection criteria.
Regional salinity datasets:
(1)
Data for Europe was collected from the European Environment Agency’s water quality database; Waterbase. Waterbase contains multiple water quality parameters for rivers, lakes and groundwater bodies throughout Europe. We extracted relevant EC and station information data using the raw disaggregated water quality data file: “Waterbase_v2018_1_T_WISE4_DisaggregatedData” and the parameter code for EC (“EEA_3142-01-6”, specified as Specific Conductance). The water types were identified and distinguished from the column parameterWaterBodyCategory, where “RW” is river, “LW” is lake and “GW” is groundwater location. Site information was extracted from the file: “Waterbase_v2018_1_WISE4_MonitoringSite_DerivedData”. The groundwater EC data was matched with depth information, using the parameterSampleDepth parameter.
(2)
The Water Quality Portal (WQP) for surface and groundwaters across the United States contains a range of water quality data for surface and groundwaters across the US. The data portal is established by the United States Geological Survey (USGS), the Environmental Protection Agency (EPA), and the National Water Quality Monitoring Council (NWQMC). The data originated from state, federal, tribal, and local agencies. Data was downloaded in bulk, for Specific conductance, for all available sites included under the search criteria (i) streams, (ii) lake, reservoir, impoundment and (iii) subsurface. Station information was additionally downloaded and paired with the salinity data.
(3)
Groundwater data for the US was also gathered from the Dissolved-Solids Dataset (Qi & Harris 2017)27, by downloading the “Dissolved solids” csv file and combining it with depth information from the “AquiferDepthSources” excel file. This data is published by the ScienceBase Catalog, provided by the USGS and contains EC (and other geochemical) data that was collected with the purpose of assessing brackish groundwaters across the United States. The original dataset contains a compilation of water-quality samples from 33 sources for almost 384,000 groundwater wells across the continental U.S., Alaska, Hawaii, Puerto Rico, the U.S. Virgin Islands, Guam, and American Samoa, dating back to the early 18th century.
(4)
Groundwater data from Colorado was collected from the Department of Agriculture and Agricultural Chemicals & Groundwater Protection section (Co Gov). Data was downloaded directly from the site using a search query of statewide inorganic quality monitoring data, and selecting the parameter Specific Conductance (Lab), for all available years. Site coordinate (latitude, longitude) information was not available online, but when requested via email, it was submitted to us, by their groundwater monitoring specialists (Karl Mauch, personal email communication). In addition, data on well sampling depth estimations were also provided via email, and the perforated interval measure (the interval between top and bottom of perforated section where the pump is installed) was recommended and used as depth information.
(5)
Groundwater data from California was downloaded from the GeoTracker Groundwater Ambient Monitoring and Assessment Program (GAMA), provided by the California state open data portal. The dataset includes multiple groundwater quality data from the GAMA Domestic Well (DW) and Priority Basin (PB) programs, covering locations throughout the state. The column “well_depth” was the only depth information available, and was included (and converted from feet to meters) as the Depth parameter.
(6)
Groundwater monitoring data from the Ohio Environmental Protection Agency (Ohio EPA) was downloaded from their ambient groundwater monitoring program. Monitoring of groundwater wells was established in the late 1960s and today covers more than 300 wells. Also here, the “well_depth” parameter was the only depth information available, and was included (and converted from feet to meters) as the Depth parameter.
(7)
The groundwater database from the Texas Water Development Board (TWDB) was also utilized to download water quality data. EC data was downloaded in bulk by groundwater aquifer (in total nine datasets). Well depths were converted from feet to meters and where multiple measurements for the same day and well was reported, daily averages were calculated. A total of 404 wells fulfilled the selection criteria and were included in the main groundwater database.
(8)
Data for South Africa was collected from the Department of Water and Sanitation (DWS), Republic of South Africa28. Both surface- and groundwaters are monitored, as a part of their National Chemical Monitoring Program. Monitoring stations and their data can be viewed and downloaded through the Water quality data exploration tool. However, due to the large amount of data for surface waters, we requested and recieved raw water quality data from the Resource Quality Information Services national monitoring programs for specific rivers and dams, through E-mail.
(9)
Surface water monitoring data for a large part of Australia is provided by the Australian Government, Bureau of Meteorology (AU Gov). Data can be queried at the Water Data Online portal, and search criteria can be specified. Conducted search criteria of all stations with EC data resulted in 1,333 stations. However, since data can only be downloaded as one by one station, we sent an email through the help desk system requesting a bulk download of all available data. The data was then provided as daily means recorded at midnight and as csv files (one file per station), with a metadata summary file included (with station information). From this, all files were combined and stations that fulfilled the selection criteria were then included in the main database. The separation between river and lake/reservoir locations were determined from the datafile “long_name” column, which always included the water type as well as the actual name of the monitoring location.
(10)
Surface water data for Australia was also synthesized from the Queensland Government Open Data Portal (QLD AU Gov). Data from QLD AU Gov was collected from the ambient estuary water quality monitoring program, which includes tidal rivers, streams and inshore waters of Central Queensland, monitored from 1993–2013. Data is available for 12 different drainage basins, reported as Specific Conductance at 25 °C. Data was downloaded as individual csv-files for each drainage basin (containing multiple sampling locations), and then combined and extracted according to the selection criteria.
(11)
Groundwater data for Australia was gathered from the Australian Government Bioregional Assessment Program (BAP). The data is provided through a collaboration between the Department of the Environment and Energy, the Bureau of Meteorology, CSIRO and Geoscience Australia. The dataset contains EC measurements of groundwater bores in the Namoi sub-region. The data is collected from groundwater bores that fell within the data management acquisition area as provided by the Bioregional Assessment to the Namoi NSW Office of Water. All data were downloaded in one csv-file.
(12)
Another groundwater dataset from Australia was collected, using the groundwater data portal from WaterConnect, which provides data from the Department for Environment and Water, for South Australia. Data was here queried by region, and then one file containing EC data for all sampled wells and one file containing site information were downloaded, for each region (in total 12 regions). The “Latest_Depth (m)” was used for depth information and all stations with both depth and EC measurements for a given data were included.
(13)
Additional groundwater data from Australia was downloaded using the Australian Groundwater Explorer tool (AU GwEX). Data was here search for by parameters Water level and Salinity and downloaded by region (in total 8 regions) and combined. Water levels and EC data was linked to the NGIS bore data to get the location and attributes of the measurement wells.
(14)
Data for New Zealand was gathered from New Zealand’s Hydro Web Portal for Hydrometric and Water Quality data (NIWA). This platform provides river water quality data under the National Institute of Water and Atmospheric Research. Data was queried by searching for all available data under the parameter conductivity and time-series, in their map interphase (resulting in 77 locations of timeseries data). Each dataset was then added for bulk export, using the export tab and a download link, via the map-interface platform.
(15)
Surface water quality data from the Government of Canada (Ca Gov) was downloaded from the National Long-term Water Quality Monitoring Data portal. The data include both rivers and lakes monitored for a set of physio-chemical variables, including specific conductance. Data was downloaded as csv-files.
(16)
River data was also synthesized from the Government of Ontario for multiple rivers, monitored between 2000–2016. The data is collected by the Provincial (Stream) Water Quality Monitoring Network (PWQMN), who measures water quality in rivers and streams across Ontario. Data was downloaded as individual excel files for each year, and then combined with site information.
(17)
Groundwater data from Argentina was downloaded from the repository of open public data of the Argentinian Republic (Dat.ar). The data is provided by the Federal Groundwater Information System SIFAS-SISAG and contains groundwater well measurements from April 2015. The data was downloaded as a main csv-file and translated from Spanish.
(18)
Groundwater data was also collected from Cambodia, using the online well database of Cambodia (WellMap). WellMap is an initiative of the Ministry of Rural Development of Cambodia, supported by the Water and Sanitation Program of the World Bank (WSP). The database is provided as a Microsoft Access Database and consists of water quality data collected from rural wells throughout the Country. Data was queried and extracted using the RODBC R package, that allows R interfacing to database systems. UTM coordinates were re-projected and converted to latitude and longitude, as decimal degrees, using the functions “proj4string” and “spTransform” in R.
(19)
Data from Mexico Government (MX Gov), was downloaded and translated (from Spanish) from one main csv-file, containing both water quality and site information data. The data included both surface water locations (original classification was rivers, streams, dams, which were reclassified to the here used terminology) and groundwater locations, monitored since 2012.
(20)
Groundwater data from Bangladesh was provided by M.M. Rahman (TH Cologne, University of Applied Sciences, Institute for Technology and Resources Management in the Tropics and Subtropics). The data was collected and shared by M.M. Rahman, and include electrical conductivity and depth data synthesized from both literature and governmental sources (see specifications and references in online-only Table 1).
(21)
Groundwater EC and level data from the Swedish geological Survey (SGU) was downloaded, on a county basis, for all 21 counties in Sweden, from environmental monitoring data. EC data was extracted from environmental monitoring files, with one file per county (queried using county specific codes and a URL link to each dataset) and combined with well water level data (downloaded in the same way as the salinity data) using matching coordinates. All stations with water level information were translated to English and were included in the main groundwater database.
Salinity datasets from individual river basins and groundwater aquifers:
(1)
Data for river locations within the Danoube river basin was collected from the Danube River Basin Water Quality Database. This database is provided by the International Commission for Protection of the Danube River (ICPDR) Information System Danubis (ICPDR). The database provides geochemical data for the major rivers in the Danube River Basin and waters are sampled at a minimum frequency of 12 times per year. The data was accessed through creating an account, and then performing a data search, for all available years and stations for the conductivity parameter, and exporting the resulting data as a csv file.
(2)
Data for the lower Murray Darling river basin was accessed through the Water Connect data portal (Waterconnect). All stations within the river basin that fulfilled the data selection criteria (six stations) were included and downloaded, one by one (using a combination of the historical EC daily readings and the Site summary files).
(3)
Groundwater TDS data for the Nile Delta aquifer (van Engelen et al.)29 was provided by Joeri van Engelen. These data include three datasets consisting of TDS measurements, synthesized from literature, collected with the selection criteria of including measurement data from less than 250 m depth. Two of these datasets had unspecific dates, and samples were thus assumed to be from the 1st of each reported month (see further specification of the data in van Engelen et al.29). The TDS data was then converted to EC, using a regional specific conversion factor, from literature sources (see section Conversions of TDS to EC for specifics on how this was done).
Data processing and harmonization
The overall objective with this database is to facilitate data reuse and research efforts within different fields of salinity research. For this purpose, the harmonization of data was a main part of the database construction. The flowchart (Fig. 2) illustrates the data selection criteria, data processing and harmonization of each sampling location and its associated dataset before it was added to the main database. All processing was done in R, version 3.6.0, using mainly the data.table and dplyr R packages. First, harmonization and fixing of data with regards to missing values and other uninterpretable field values and/or symbols preventing the appropriate reading of data files (i.e., special symbols like “***” or erroneous changes in field separators, e.g. from “,” to “;”) were done, e.g. by setting it to the standard missing data value (i.e., NA values) and by fixing or excluding rows which could not be read properly. Additionally, assumed erroneous data values for reported salinity values and depth (such as negative values, 999 and 9999, as well as depth values of zero) were removed.
Since information on sampling water type and parameter nomenclature and reported units differs between regions and organizations, we re-classified water types into the three mentioned categories (river, lake/reservoir, groundwater). Where needed, we also re-named and converted other parameters and their associated units, according to the database variables listed in Table 1.
Table 1 Variable names and descriptions, including reported units, of the salinity database.
Full size table
Different spatial and temporal conversions were also made (see Fig. 2). For instance, where multiple measurements per day were available, these were averaged into daily values, using the data.table package, and grouping by Station_ID and Date (see Table 1 for parameter definitions). Depth conversions were also common and included conversions from feet or centimeter to meters. Regarding spatial harmonization, each sample coordinates were converted to decimal degrees and re-projected to WGS 1984, if needed, using the “SpatialPoints”, “proj4string“ and the “spTransform” function of the rgdal R-package. If country information was missing, this was assigned from coordinates of each station using the package map.where, or extracted from country codes (if available) using the function “countrycode”. Continent information was then assigned from country names, also using the “countrycode” function, by matching country name with continent.
For assisting studies that might be interested specifically in coastal regions and applications, we also quantified if a sampling location was coastal or not. This analysis was done in ArcMap, using the “Near Table” analysis tool. The distance from all sampling locations to the coastline was computed, (using vector data from Natural Earth: https://www.naturalearthdata.com/downloads/10m-physical-vectors/). All locations within 10 km from the coastline were classified as being coastal. The identification of coastal stations was then included in each database summary file, under the column “Coastal_location” (see Table 1).
Conversions of TDS to EC
We considered the inclusion of additional groundwater data, where TDS measurements could be converted to EC. The relationship between EC and other measured salinity parameters (e.g. TDS) is depending on a range of conditions, such as temperature, climate and concentrations of ionic and undissociated species18. This relationship is commonly estimated according to Eq. (1).
$$EC=frac{TDS}{f}$$
(1)
where EC is in µS cm−1, TDS in mg L−1 and f is a conversion factor19,30. Commonly, predefined conversion factors without proper site-specific validation are used, but such estimation may be highly uncertain, due to the conditions mentioned above20. Instead, it has been shown that the use of region-specific conversion factors may be more representative, since these have been developed from measured relationships between EC and TDS under more local-reginal conditions19,20.
Due to reported improved predictability of EC-TDS relationships when using region-specific conversion factors (f), we included additional groundwater TDS measurements only for regions with available reported region-specific f values. This resulted in the inclusion of three additional groundwater datasets to the final database; one from Idaho31, one from California32 and one from Egypt29. Together these datasets added 3,477 sampling locations and a total of 9,654 measurements to the groundwater database. Both the original TDS data, as well as the converted EC values are included in the database.
For the two TDS groundwater datasets from the United States, TDS was converted to EC using the region-specific conversion factor f of 0.65. This conversion factor has been developed for the continental United States, by the US Geological Survey and is widely used cross-regionally within the US20,33. For the TDS groundwater data from Egypt (from the Nile delta)29, we converted TDS to EC using the region-specific conversion factor f of 0.64. This factor value has been derived from local measurement data in the Nile delta itself34.
For validation of our approach of predicting EC from TDS, we used regional-conversion factor f values on other groundwater datasets that had both TDS and EC measurements reported. These datasets, including data from both the US and from Australia, showed strong correlations between predicted and measured EC (Fig. 3; R2 of 0.91–0.99), supporting the approach of using TDS and region-specific conversion factors to estimate EC (see Technical validation section).
Fig. 3
Validation of converted TDS to EC for groundwaters. Time-series plot and scatter correlations of measured vs. predicted electrical conductivity (EC), using regional conversion factors. Panel (a) shows an example time-series from the groundwater station with the highest number of measurements (estimated from the “max” function in R) in Australia (data source: Water connect, n = 538) and panel (b) shows its corresponding scatter correlation (R2 = 0.99). Panel (c) shows the correlation between measured and converted EC for the full dataset of all groundwater stations from Water connect (n= 37,819, R2 = 0.98). Panel (d) and (e) shows correlations between measured and predicted EC data, for groundwaters in Texas (data source: TWDB, n = 59,985, R2 = 0.91) respectively California (data source: GAMA, n = 4,706, R2 = 0.98). All scatterplots were done in R, using the “ggscatter” function from the ggpubr package and estimating correlation coefficients using the “pearson” function.
Full size image More