Developing the hydrological dependency structure between streamgage and reservoir networks
Developing the interdependency between reservoirs and streamgages across a river network (Fig. 1a) is essentially merging three data sets (attributes of dams, streamgages, and river flow lines) (Table 1) to create a unidirectional graph or network18,19. In the resulting graph, each node of the network represents a streamgage or reservoir with the connecting stems between pairs of nodes represent river reaches in-between those streamgages or reservoir pairs (Fig. 1b). In this merged unidirectional graph, point information can be stored as attributes as the node of the unidirectional graph. We represent a reservoir-streamgage network as an ‘edge list’18,19 a table of information defined by flow connectivity between network nodes (streamgage or reservoir). In this edge list, each row is a connection link (edge) from one node to another containing NHDPlusflowlines4 attribute such as unique identifiers (ComID) and distances along the river path.
Table 1 Data sources.
Full size table
We provide our dataset as an edge list that can be easily adopted in various geospatial applications. The motivation to present the combined dataset in a robust format is to enable end-users from varied backgrounds to use the dataset in a conventional network data format. For clarity, hereafter, we refer to the nodes of our unified edge list as ‘points’ which can be either a dam or a streamgage. In the rest of the article, attributes of NHDPlusV2 are summarized in Table 2 and written in italics. We use the word ‘node’ only in the context of NHDFlowlines’ connecting nodes and its attributes.
Table 2 Glossary of terms.
Full size table
Combined reservoir-streamgage data is created by traversing on the NHDPlusflowlines4 network from headwaters to the downstream-most node(s) of each watershed, in a parallel fashion. The proposed method is summarized schematically in Fig. 2. We used various C++ 20 based R packages (dplyr21 and sf 22) for data handling and geospatial analysis. Headwater flow lines that have at least one point are analyzed first. Points (reservoirs or streamgages) are ordered according to their increasing distance from headwater based on ascending values of ArbolateSum of each NHDFlowline feature. At first, for each point, its nearest NHDFlowline and its ComID are recorded. Thus, each river reach where any streamgage or reservoir is located, all vector attributes of that flow line can be accessed from NHDPlusV24 database. Nearest NHDflowlines of these points are grouped by LevelPathID, which ensures all features along a river are uniquely grouped. Next, a parallel node search along the downstream direction is carried out for each group identified in the previous step based on LevelPathID. The downstream parallel node search is essentially a depth-first-search18,19 – a network traversal technique that recursively explores downstream point features (reservoirs or streamgages) till all points along a river (grouped by LevelPathID) are identified. At this iterative step, we account for different conditions such as river junctions (e.g., divergences, convergences, or both), boundary of a watershed, isolated network, and coastlines or end of a stream reach.
Fig. 2
Schematic of the methodology. Words in italics font indicate attributes from the NHDPlusV2 database.
Full size image
For the application of the proposed algorithm to CRB, we relate spatial information (latitude, longitude) of each point (reservoirs or streamgages) along with attributes such as (1) unique identifiers (NID ID for reservoirs and USGS site number for streamgages), (2) ComID of nearest NHDFlowline, (3) distance of each point from end nodes of the nearest NHDFlowline, and (4) distance between each point and its immediate upstream and downstream point. Additional information can also be retrieved from the output edge list for specific applications depending on specific requirements of a network analysis problem. For example, to understand the cumulative effect of dams in CRB on flow alteration in the downstream river reach, a subset of network is used from the edge list where network nodes are the reservoirs and streamflow at gages located immediately downstream of these dams are the variables of interest23. Thus, the developed riverine connectivity between streamgage and reservoirs can be used to extract additional information (e.g., reservoir storage, drainage area of the streamgage, etc.) from the individual nodes.
Input data sources
For CRB, the dataset combines dams24 from the National Inventory of Dams (NID) and streamgages25 from the United States Geological Survey (USGS) (Fig. 1a, Table 1). The NID24 is a database from the US Army Corps of Engineers (USACE) documenting information about more than 90,000 water control infrastructures (mainly dams) across the CONUS and US territories. This dataset includes information about a dam’s type, purpose, size, location, and classification of downstream potential hazard. According to this dataset, there are 1344 dams within the CRB (Water Resources Regions 14 and 15), which we used for the present study. Information on dams was retrieved using the R package dams26.
The National Water Information System (NWIS) database contains point measurements of water-related data (e.g., surface water, groundwater) collected at more than 1.5 million sites across the CONUS. This dataset contains records from 1899 to present water data for the nation. Around 20,000 streamflow gage stations are included in the NWIS dataset, of which 1656 are located within CRB (Regions 14 and 15).
In order to characterize the CRB stream network, we used high-resolution NHDPlusV2 dataset4 available from ftp://ftp.horizon-systems.com/NHDplus/NHDPlusV21/Data/. Our main purpose in utilizing the NHDPlusV2 database is to obtain high-quality streamflow lines and attributes (characterized by VAA), acting as the basis of upstream and downstream identification of dam-streamgage interdependency. We consider vector data of all input spatial datasets to develop the interdependent structure between reservoirs and streamgages over the CRB. For developing the RICON17 database using Vector Processing Units (VPU’s) in NHDPlusV2 dataset for watershed regions 14 and 15, the network analysis is carried out over the entire upper and lower CRB to maintain continuity in the river network from Glenn Canyon dam (outlet of water resources region 14) to Hoover dam which is located approximately 590 kilometers downstream in region 15.
Data preparation
In geospatial data merging, if two or more-point shapefiles are used, one can simply join the attribute tables of each shapefile by a common ID. Using unique identifiers of each of the data sets, we can concatenate any string pattern (e.g., “Point.”, “Pt”) to make a simple, unique identifier (node number) for the merged points. Given the number of points or nodes NP (NP = NR + NG) where NR is the number of reservoirs and NG is the number of streamgages. At this stage, we only keep common IDs such as latitude, longitude, and point name and river information, if available, for the merged point dataset. For network analysis, we do not retrieve any other attributes to reduce the memory requirement. In data preparation, the first step is to identify the nearest river reach of each point feature (reservoirs or gages). This can be achieved easily using any available geographic information tool. We used the nearest feature identification technique in sf package22 in R along with string matching for identifying the nearest NHDflowline feature. This step can be computationally demanding, depending on the size of the datasets and geospatial tools used. Accuracy of this step is also subjected to the resolution of the river network data.
For each node (reservoir or streamgage), (1) nearest NHDflowline and (2) distance from either endpoint of the nearest flowline to the node must be recorded in this step (see data preparation panel in Fig. 1). The nearest NHDflowline feature of each geospatial point is linked by the unique identification number (ComID) of the stream reach. Once the nearest river reach is identified for each reservoir or streamgage, these points are snapped to their nearest NHDflowline. Distance between each snapped point from network headwaters is calculated using the NHDPlusV2 flowlines’ attribute Arbolate Sum, which ensures that for all upstream features, the length from headwaters is always increasing from upstream to downstream irrespective of the number of tributaries joining the main path. This distance from headwaters is simply a summation of the Arbolate Sum of the nearest NHDflowline feature and the distance of the point (dam or gage) from the end node (upstream node) of this flowline. For faster computation, identification of the nearest flowline is carried out using the R package sf 22 a C++ 20 based R14 package for simple geospatial analysis. For developing the current dataset, we identify the nearest flowline feature using the following three steps:
(1)
First, all input shapefiles are transformed to a planar coordinate system Universal Transverse Mercator (UTM). For coordinate system conversion, the UTM zone of each point should be calculated for each watershed region. For simplicity, the zone with the maximum number of points is used for the entire region. This method can lead to erroneous conversion while handling a vast watershed at once, such as Mississippi. We carried out this step separately for each VPU in the NHDPlusV2 dataset for CRB.
(2)
After this, the nearest stream reach is identified for each point, and its ComID is noted.
(3)
Next, each point’s names and the information about the nearest NHDflowline are gathered as text from point shapefiles. These river names are then compared with NHDPlusV2 GNIS_NAME (Geographic Names Information System) by a robust text comparison method. This attribute is present for the majority of the main stem rivers and their tributaries. This step considers that the point might be located below or upstream or near or at another point or flow reach as well as at a tributary or canal or ditch or dike relative to the main flow path. Whenever no matching string is found at this step based on the nearest flowline identified in step 2, a warning is issued. A warning is also issued whenever NHDPlusV2 flowlines GNIS_NAME is blank. For points where a warning is issued, additional checks are carried out. First, the distance from these points to all flowlines within a user-specified buffer distance is calculated. Suppose the nearest line feature in step 2 gave a warning. In that case, the algorithm looks for successive nearest line feature, within a buffer distance of 1 km, using a combination of distance comparison and fuzzy string matching27 between GNIS_NAMEs of these flowlines within the buffer distance and the name of a given point (USGS station name for a streamgage; and NID dam name and river names for the reservoirs). This buffer distance is carefully chosen depending on the resolution of the data, units of distance measurement, and the difference between the distances to successive lines from the point of interest. This step issues warning suggesting visual checks which may be manually handled. Visual checks are only issued whenever a new flowline is chosen using fuzzy string matching over the nearest one identified in step 2. To decide on a viable buffer distance and accuracy of the snapping algorithm, we used a subset of national gage locations provided by NHD as a benchmark for linking streamgages with NHDFlowlines. We tested a range of buffer distances 100 m, 200 m, 500 m, 1 km, and 10 km, and counted the number of erroneously assigned flowlines. For the said buffer distances, 5.8–7.5% of streamgages needed correction. The visual warnings for dams are checked manually and corrected if needed. With a 1 km buffer distance, the algorithm issued visual warnings for 45 dams out of 1344. After manually checking each of them, the nearest lines for ten dams needed to be adjusted (see Table SI 1). Overall, if a visual warning is issued for less than 10% of the points, we posit that the snapping process is satisfactory as warnings at this stage do not necessarily mean a wrong flow line is linked to a point. It may merely denote that the name of the NHDflowline feature is missing in NHDPlusV2 dataset which is acceptable as not all minor river reaches and small tributaries have an associated river name in NHDPlusV2 data. Upon the availability of higher resolution spatial data, this threshold can be made more stringent.
Identification of the nearest NHDflowline feature is critical for the overall reliability of the dataset. Additional validation of this step is carried out by comparing the ComID of nearest flow lines of each gage used in the current study, with that of the gages provided in NHDPlusV2 dataset. Errors at this step may arise from sources, such as inaccurate coordinate transformation, wrong choice of the buffering distance during point snapping. In this data preparation stage, we join the NHDPlus hydrography datasets for regions 14 and 15 and create a single input data (NHD_data.RData).
Network analysis
Initial network data is created by traversing the NHDPlusflowlines network from headwaters to the downstream most node(s) of each watershed, in a parallel fashion (Fig. 2, ‘Network Analysis’ panel). We extensively used the R14 package dplyr21 in this work. We use three NHDPlusv2 database files (.dbf files) for each watershed – NHDflowline, PlusflowlineVAA, and Plusflow – using the following steps:
(1)
In the NHDPlusV2 dataset, headwaters or upstream most stream reaches are designated by the attribute StartFlag being 1. Headwater flowlines with at least one point are analyzed first (Step 2a in Fig. 2). Points are ordered according to their increasing distance from headwater. The downstream most point of the analyzed headwater flowlines are selected, and from each such point, network analysis is carried out recursively along the downstream direction.
(2)
At this stage, points downstream of headwater lines but without any upstream points (i.e., reservoirs or streamgages) are also selected for further analysis. Nearest NHDflowlines for each of these points are grouped by LevelPathID, ensuring all features along the river segment are uniquely grouped.
(3)
Next, a parallel and recursive node search along the downstream direction is carried out for each group identified in the previous step using ‘Move downstream’ function (Fig. 2). Input arguments for this function are x– points under consideration (node numbers, ComID of the nearest flowline, dP, dHW) and ‘PATH’ – ‘main’ or ‘minor’ flow path along which the recursive node search is carried out. For a ‘main’ path (Step 2b in Fig. 2), LevelPathID of upstream and downstream NHDFlowline are the same, whereas, for a ‘minor’ path (Step 2c in Fig. 2), LevelPathID’s are different for an upstream or downstream stream reach connection. This is essentially a depth-first search along a given river stem and given PATH that builds network connectivity from upstream to downstream by recursively updating x (Fig. 2). At this iterative step, we account for different conditions such as river junctions (e.g., divergences, convergences, or both), boundary of a watershed, isolated network, and coastlines or end of a stream reach. ‘Move downstream’ function accesses the NHDPlusV2 VPU tables (PlusFlow, PlusFlowlineVAA, and NHDFlowlines).
In this method, the raw or initial output is a connectivity matrix – a NP × NP square matrix where NP is the total number of points (all dams and gages) analyzed. While the raw output in this format is easy to compute and handle, a connectivity matrix for a large dataset of a riverine network can often be a huge sparse matrix. We present our final dataset as an edge list for better memory usage and ease of incorporating additional information on points and flowlines, which is a popular and efficient representation tool of tree data structure. The edge list consists of nodes and edges information for the entire river network. The nodes are the infrastructures along the flowlines and edges are the connecting river reaches in between successive infrastructure. This highly flexible and robust edge list can be easily used in any traditional network search algorithm. Although we generate a separate edge list for each drainage basin, they can be easily merged by identifying the ComID’s of the NHDPlusflowlines that represent the basin outlet(s). More