Environmental DNA allows upscaling spatial patterns of biodiversity in freshwater ecosystems

Data collection and sequencing

In June 2016, diversity data were collected at 61 sites in a 740 km² sub-catchment of the river Thur, northeastern Switzerland. No singular rain events took place during the sampling days, and thus all sampling was performed at base-flow conditions. Sampling sites were chosen in order to proportionally represent all stream orders in the river network (Supplementary Fig. 1) and to span the complete geographical extent of the catchment. At each site, a standardized³⁵ 3-min kicknet sampling applied to three microhabitats was performed to collect benthic macroinvertebrates. The subsamples from the three microhabitats were pooled and stored in ethanol. In the laboratory, debris was removed and all individuals belonging to may-, stone-, and caddisflies (Ephemeroptera, Plecoptera, and Trichoptera, abbreviated as EPT) were identified under a stereomicroscope to the genus or species level if applicable. One sample was lost due to handling error, and thus we subsequently only had EPT kicknet sample data from 60 sites.

At the same 61 sites at which kicknet data were collected, also eDNA samples were taken. We collected three independent samples of 250 mL of river water at each site (sampled below the surface and well above the river bottom). In small streams (up to about 1-m wide), the water was taken from the middle of the stream, while in larger streams the water was taken about 0.5 m away from the shore side. All three water samples per site were filtered on site on separate GF/F filters (pore size 0.7 µm Whatman International Ltd.), which were stored on ice immediately after filtering, and frozen at −20 °C within a few hours. Subsequently, these three samples were analyzed separately and as independent replicates. To prevent cross-contamination, the eDNA samples were collected a few meters upstream of the kicknet sampling, and kicknet sampling and eDNA sampling were performed by different people.

DNA was extracted with the DNeasy Blood and Tissue Kit (Qiagen GmbH) from the filter in a lab dedicated to low DNA-concentrated environmental samples (i.e., clean lab with positive air pressure, well-defined procedures and all work conducted under sterile benches, and no handling of high-DNA concentration samples or post-PCR products). A short barcoding region of the COI gene^58,59 was targeted, and a dual-barcoded two-step PCR amplicon sequencing protocol for Illumina MiSeq was performed. In short, three PCR replicates were performed for each eDNA replicate with primers that contained an Illumina adapter-specific tail, a heterogeneity spacer and the amplicon target site. These three PCR replicates per sample replicate were pooled and indexed in a second PCR with the Nextera XT Index Kit v2 (lllumina). We measured the concentration of all indexed PCR reactions and pooled them in equimolar parts to a final library, which we ran twice on two consecutive Illumina MiSeq runs to increase sequencing depth. Raw data were demultiplexed and read quality was checked with FastQC⁶⁰. Thereafter, end-trimming (usearch, version 10.0.240) and merging (Flash, version 1.2.11) of raw reads were performed, before primer sites were removed (cutadapt, version 1.12) and reads were quality‐filtered (prinseq‐lite, version 0.20.4). Next, we used UNOISE3⁶¹, which has a built‐in error correction to reduce the influence of sequencing errors, to determine amplicon sequence variants (ZOTUs). To reduce sequence diversity, we implemented an additional clustering at 99% sequence identity. We checked ZOTUs for stop codons of the invertebrate mitochondrial code, to ensure an intact open reading frame. For the taxonomic assignment, all COI‐related sequences were downloaded from NCBI and ZOTUs were blasted against the NCBI COI collection. We extracted the top five best blast hits and then used the R packages “taxize”⁶² (version 0.9.7) and “rentrez”⁶³ (version 1.2.2) to acquire taxonomic labels. We modified the selected COI sequences with the taxonomic labels in order to index the database. The ZOTUs were then assigned to taxa using Sintax (usearch) and the NCBI COI‐based reference. Details of the data collection and the bioinformatic pipeline can be further assessed in Mächler et al.³⁶. The OTUs found in this dataset are generally saturating³⁹. We acknowledge that the primers used are targeting eukaryotic diversity in general and were not only specific to EPT taxa, which might lead to an underestimation of detections of EPT by eDNA (i.e., false absences). This, however, is conservative with respect to our findings, as we would miss taxa with eDNA where the kicknet sample indicated their presence.

Network extraction and hydrological characterization

The river network was extracted from a 25-m Digital Elevation Model (DEM) of the region provided by the Swiss Federal Institute of Topography (Swisstopo) by applying the D8 method through a TauDEM routine in a GIS software⁶⁴. A threshold of 800 pixels (0.5 km²) on contributing area was applied to identify the channelized (i.e., perennial, see ref. ⁶⁴) portion of the drainage network, whose total length equaled 751 km. We then defined a reach as a sequence of pixels of the DEM matrix, originating from a source or a confluence, directed downstream, and ending at the following confluence or at the outlet. This resulted in a discretization of the river network comprising 760 reaches of median length 0.78 km (95% equal-tailed interval of the distribution: 0.07–3.18 km). Note that a lower threshold area would have resulted in a higher number of reaches, implying a more refined discretization of the river network but also an increased computational burden for the subsequent model runs. Hence, we chose the highest value of threshold area such that: i) the extracted river network retained all reaches where sampling sites were located; ii) a qualitative comparison with the vectorial hydrographic network provided by Swisstopo resulted adequate. For modeling purposes, reaches were considered as nodes of a graph with edges following flow direction. All variables referred to a node were considered homogeneously distributed within the corresponding reach.

In order to assess values of hydrological variables for all reaches, we made use of power-law scaling relationships, a well-established and universally applicable concept in hydrology^42,43,44. In particular, river width w, river depth D and water velocity v are known to scale (both within a single cross-section and in the downstream direction) as a power-law of water discharge Q (refs. ^42,43); along the flow direction, the relationships w~Q^0.5, D~Q^0.4, w~Q^0.1 are valid over wide ranges of natural streams⁴². Moreover, water discharge scales linearly across a catchment with drainage area A (ref. ⁴⁴). Strictly speaking, the relationship Q~A holds for mean annual values of Q; however, it can be reasonably extended to values of Q averaged over shorter time windows (say, at least 1 day), provided that the time scale of flow propagation is much shorter than 1 day, and that rainfall (and the resulting runoff generation) can be considered spatially homogeneous if aggregated at a daily scale. Both assumptions are reasonable for catchments up to 10³ km² (ref. ⁶⁵), as the one here studied. Mean water discharges during the sampling days and stage-discharge relationships were available at four stations operated by the Swiss Federal Office for the Environment (FOEN) (see Fig. 1). River widths at these locations were estimated via aerial images. Power law relationships with drainage area for discharge and river width were then fitted on the four stations, yielding Q = 0.072 A^1.056 and w = 1.586A^0.526, where A is in km², Qin m³s⁻¹, and w in m. As for river depth, we discarded the station with lowest drainage area because we observed that the values of depth measured therein were highly overestimated with respect to the expected values, based on the other stations and the scaling exponent 0.4 (ref. ⁴²). Hence, we limited the fit to the three other stations and obtained D = 0.073 A^0.463, where D is in m. By assuming rectangular river’s cross-sections, we finally derived a power-law relationship linking water velocity v to drainage area: v = Q/(Dw) = 0.623 A^0.067, where v is in ms⁻¹. Notably, all scaling exponents obtained were very close to the literature values^42,44. Details on the fit of these hydrological relationships are reported in Supplementary Fig. 2.

Choice of covariates

A first set of 18 covariates was chosen as representative of morphological, geological and land cover features of the catchment. These covariates can either reflect local or upstream characteristics. Land cover covariates were evaluated as local values, because they are assumed to potentially have a role in determining local taxon suitability (see also ref. ⁶⁶). Geological covariates were calculated as upstream averaged values, because they are likely to affect the chemical composition of streamflow at a site; such process could affect local habitat suitability but is driven by the upstream catchment³⁰. Morphological covariates were obtained by analyzing the DEM of the region, while geological and land cover raster maps were provided by Swisstopo^67,68. The list of covariates is shown in Supplementary Table 1.

A second set of covariates was constituted by grouping all 760 reaches into 17 clusters according to geographical proximity and hydrological connectivity (see Fig. 1c). For each cluster, a corresponding covariate vector was defined with values equal to one for the reaches constituting that cluster, and zero otherwise. The addition of these ‘geographical’ covariates aimed at allowing eDITH to reproduce spatial patterns uncorrelated to any of the previous covariates (e.g., in the case of a taxon only inhabiting a single tributary of the catchment).

The 35 covariates were z-normalized and checked for possible multicollinearity. The 760-by-34 design matrix obtained by removing the covariate corresponding to cluster LUT of Fig. 1c (corresponding to the Luteren tributary, and arbitrarily chosen among the geographical covariates: each of them can indeed be expressed as a linear combination of the other 16 geographical covariates) yielded all pairwise correlation coefficients with absolute values lower than 0.80 and variance inflation factors lower than 10, which minimized the effects of multicollinearity⁶⁹. The LUT covariate was however kept in the design matrix to facilitate model fitting.

The eDITH model

The eDNA transport component of the eDITH model is derived from Carraro et al.³¹

$${hat{C}}_{j} = frac{1}{{Q_j}}sum_{i in gamma left( j right)} A_{S,i}exp left( { – frac{{L_{ij}}}{{v_{ij}tau }}} right)p_i,$$

(1)

where (hat C_j), the modeled eDNA concentration of a given taxon at site j, results from the summation of eDNA production rates p_i across all network nodes upstream of j. These contributions are weighted by the relative source area A_S,i and a first-order decay factor, where L_ij is the along-stream length of the path joining i to j, v_ij the average water velocity along that path, and τ a characteristic decay time. Finally, Q_j is a characteristic (mean during sampling days) value of water discharge across node j. Production rates are assumed to be proportional to the taxon’s density and expressed via an exponential link

$$p_i = p_0exp left( {{boldsymbol{beta }}^T{boldsymbol{X}}left( i right)} right),$$

(2)

where X(i) is a vector of (normalized) environmental covariates evaluated at node i, β a vector of covariate effects and p₀ is a characteristic production rate³¹. Equation (2) represents a generalized linear model, a widely used class of species distribution models⁷⁰. Note that Eq. (1) does not explicitly account for deposition of eDNA particles on the river bed. Decay and deposition are both relevant processes affecting eDNA removal, and thus detection, in stream water (see e.g., ref. ²⁵). We chose not to model these processes independently because it would be hardly possible to disentangle the roles of degradation of genetic material and that of gravity-induced deposition. Decay time τ is therefore to be seen as a characteristic time (linked to half-life T_1/2 by the relationship T_1/2= In(2)τ) for the presence of eDNA in water, regardless of the source of its depletion.

We here further hypothesize that the expected number of reads for a given taxon at a given site is proportional to the taxon’s eDNA concentration in the sample: (hat N_j = khat C_j). Equation (1) then becomes

$${hat{N}}_{j} = frac{1}{{Q_j}}sum_{i in gamma left( j right)} A_{S,i}exp left( { – frac{{L_{ij}}}{{v_{ij}tau }}} right)p_0^prime exp left( {{boldsymbol{beta }}^T{boldsymbol{X}}left( i right)} right),$$

(3)

where (p_0^prime = kp_0). Finally, by denoting with L_i and w_i the length and width of reach i, respectively, it is A_S,i= L_iw_i, (v_{ij} = sum_{k in P_{ij}} L_k/sum_{k in P_{ij}} left( {L_k/v_k} right)), where P_ij is the set of nodes constituting the path joining i to j. For the sake of clarity, all mathematical symbols used here and in the following are listed in Supplementary Table 2.

Model calibration

Measured numbers of reads at a given site are assumed to be distributed according to a geometric distribution with mean (hat N_j) obtained from Eq. (3). The geometric distribution is a special case of the negative binomial distribution, obtained by setting to one the parameter corresponding to the number of failures. Reasons for this choice are multifold: (i) it is a discrete distribution, as it is the case for read number data; (ii) it depends on a single parameter, thereby it reduces the complexity of the problem; (iii) it has null mode, which is in agreement with the data (in fact, the number of replicates with null read numbers ranged from 50 to 182 out of 183, depending on the genus); (iv) the distribution has fatter tails compared to the Poisson distribution, which is also a single-parameter, discrete distribution with null mode. Fat tails allow the interpretation of large observed numbers of reads.

Given these assumptions on the distribution of observed numbers of reads, the likelihood function reads

$$fleft( {{boldsymbol{N}}{mathrm{|}}{boldsymbol{beta }},p_{0}^{prime} ,tau } right) = mathop {prod }limits_{j in S} left[ {mathop {prod }limits_{o = 1}^3 frac{1}{{1 + hat N_j}}left( {frac{{{hat{N}}_j}}{{1 + {hat{N}}_j}}} right)^{N_{jo}}} right],$$

(4)

where S is the set of sampling sites used to calibrate the model and N is a |S|-by-3 matrix whose entries are observed numbers of reads N_jo at site j and replicate o (|S| is the cardinality of S, equal to 61 in this case).

The posterior distribution of parameters β, (p_0^prime), τ was sampled by means of an Adaptive Metropolis algorithm⁷¹. The prior distribution adopted for β was a multivariate normal with null mean and covariance matrix equal to 9 ∗ I, where I is the identity matrix of order equal to the number of components of β. We then adopted a uniform prior distribution for (p_0^prime) and a lognormal prior distribution for τ with mean equal to 2.55 h and standard deviation equal to 1.36 h, equivalent to a normal distribution for ({mathrm{ln}}left( tau right)) (with τ expressed in seconds) with mean equal to 9 and standard deviation equal to 0.5; such prior distribution was derived from a previous study³¹. For each of the 50 model runs (corresponding to the 50 EPT genera detected in the eDNA samples), Markov chains were randomly initialized; the first 5000 elements of the chains built by the Adaptive Metropolis algorithm were then discarded (burn-in phase), while the following 10,000 were retained. With the above-mentioned settings, eDITH was used to generate spatial patterns of relative density for all of the 50 genera.

Assessing detection probability and presence maps

In order to enable the comparison among maps of relative density for the different genera, we resorted to the evaluation of detection probability. This was defined as the probability that, given the relative density of a genus at a given reach predicted by eDITH, an eDNA sample taken from that reach would yield a nonzero number of reads, if that reach were unconnected from the river network. From Eq. (3), the expected number of reads of a sample taken from an unconnected reach reads

$$hat N_{U,i} = p_i^prime frac{{A_{S,i}}}{{q_i}}exp left( { – frac{{L_i}}{{v_itau }}} right),$$

(5)

where q_i is the discharge directly contributing to reach i (such that (Q_i = sum_{k in gamma left( i right)} q_k)), L_i and v_i are length and water velocity relative to reach i, respectively. Thus, according to the assumption of geometric distribution for observed read numbers, the probability of a non-zero read number for a sample with expected read number (hat N_{U,i}) is

$$P_{D,i} = frac{{hat N_{U,i}}}{{1 + hat N_{U,i}}}.$$

(6)

Detection probabilities are evaluated by using median posterior values for parameters β, (p_0^prime), τ. Finally, maps of detection probability are converted into presence maps by applying the threshold³⁶P_D,i≥ 2/3. The summation of presence maps for all genera yields the genus richness map displayed in Fig. 5a.

A number of methods for threshold selection in species distribution models exist in the literature⁷²; however, preliminary analyses (not reported) showed that methods such as the prevalence approach and the average probability/suitability approach (see ref. ⁷²) resulted in lower accuracy estimates (see below) as opposed to the fixed threshold approach. Moreover, the fixed threshold approach P_D,i≥ 2/3 was used by Mächler et al.³⁶; since the eDITH model is to be seen as a way to translate “upstream-averaged” eDNA measurements into local equivalent eDNA measurements, it appears appropriate to keep the same criterion already used with the raw eDNA data.

For a given genus that was found both in the eDNA and kicknet samples, the accuracy of a presence map obtained by eDITH with respect to the kicknet data was determined as the fraction of true positives (i.e., sites where both model and kicknet assessed presence) and true negatives (i.e., sites where both model and kicknet assessed absence) over all (60) sites where kicknet was performed. On the contrary, false positives are sites where eDITH predicts presence while kicknet indicates absence; the vice versa holds for false negatives.

We underline that the eDITH model was run for all genera that were found in at least one replicate at one site. Such an inclusive choice was operated in a bid to maximize the available information and avoid false absences. However, the estimates of occurrence of taxa performed by our model remain conservative: indeed, for the 8 genera that were never found at any site with at least 2 out of 3 nonzero read numbers (see Supplementary Data 1), the fraction of river reaches where eDITH predicted presence was always lower than 3%.

Note that calculating the accuracy between model predictions and eDNA data with the method used for the model vs. kicknet comparison would be formally wrong: indeed, eDNA data are an aggregate measure of the upstream taxon distribution, while eDITH-based presence estimates refer to a local variable, because upstream contributions have been disentangled by the model. Conversely, an adequate measure of consistency between model predictions and eDNA data must compare quantities of the same type (i.e., both referred to the upstream catchment). To this end, we utilized the goodness-of-fit test detailed below.

Goodness-of-fit test

An ad hoc goodness-of-fit test was required owing to the use of a geometric distribution to represent the observed numbers of reads. The test relies on a bootstrapping technique derived from that proposed by Mi et al.⁷³, to which the reader is referred for details on the theoretical foundation of the method.

For a given site j, let (hat N_j) be the read number predicted by eDITH (see Eq. (3)) and N_jo, o = 1,2,3 the triplet of observed read numbers at that site. The sample standard deviation of the data with respect to the predicted mean reads

$$s_j = sqrt {mathop {sum }limits_{o = 1}^3 left( {N_{jo} – hat N_j} right)^2} ,$$

(7)

while Pearson’s residuals are given by (r_{jo} = left( {N_{jo} – hat N_j} right)/s_j). Now, let (tilde N_jsim {mathrm{Geom}}left( {hat N_j} right)) be a random variable that is geometrically distributed with mean equal to (hat N_j). Let us generate a large (h = 1, …, 10⁵) number of triplets (tilde N_{jho}), o = 1, 2, 3 and compute their sample standard deviation (tilde s_{jh}) and residuals (tilde r_{jho}):

$$tilde s_{jh} = sqrt {mathop {sum }limits_{o = 1}^3 left( {tilde N_{jho} – hat N_j} right)^2} ;,tilde r_{jho} = frac{{tilde N_{jho} – hat N_j}}{{tilde s_{jh}}}.$$

(8)

Let now d_j and (tilde d_{jh}) be the sum of squared deviations of ordered residuals (of the data and of the sampled distributions, respectively) from the medians of their sampled distributions

$$d_j = mathop {sum }limits_{o = 1}^3 left( {r_{jo} – tilde r_{jo}^{left( {50} right)}} right)^2;,tilde d_{jh} = mathop {sum }limits_{o = 1}^3 left( {tilde r_{jho} – tilde r_{jo}^{left( {50} right)}} right)^2,$$

(9)

where (tilde r_{jo}^{left( {50} right)}) is the median of the residuals of the o-th components of the generated triplets (tilde N_{jho}). The p-value can thus be computed as

$$p_j^{{mathrm{GOF}}} = frac{{1 + mathop {sum }nolimits_{h = 1}^{10^5} {mathbf{1}}_{tilde d_{jh}!ge! d_j}left( h right)}}{{1 + 10^5}},$$

(10)

where ({mathbf{1}}_{tilde d_{jh}! ge! d_j}left( h right)) is an indicator function equal to one if (tilde d_{jh} ge d_j) and null otherwise. We finally assumed that the null hypothesis H₀ that the triplet N_jo, o = 1, 2, 3 is geometrically distributed with mean (hat N_j) (namely, that the model correctly reproduces the data at site j) cannot be rejected if (p_j^{GOF}) > 0.05.

Cross-validation analysis of the effect of sample size

In order to evaluate how model predictions vary as a function of data availability, we performed 9 additional simulations (termed AS), in which only subsets of the sampling sites were used to calibrate eDITH: 3 of them (simulation group termed AS1) were based on 49 out of 61 sites (~80%), 3 (AS2) were based on 37 sites (~60%) and 3 (AS3) were based on 25 sites (~40%). For each simulation, a quasi-random choice of the excluded subset of sites was operated. In particular, to ensure spatial coverage of the catchment, we first imposed that, for each AS group, the distribution of Strahler stream order values of the calibration subset was proportional to that of the complete set of sites (shown in Supplementary Fig. 2). The allocation of number of excluded sites per stream order value was determined via the D’Hondt method⁷⁴. The excluded sites were then randomly sampled while respecting the constraint on stream order-based allocation. Note that, for a given thus-obtained calibration set, eDITH was run for all 50 genera.

For these additional simulations, calibration was performed as described with regards to the complete model (CM) (see “Model calibration” section), except that Markov Chains were here shorter (burn-in phase: 1000 elements; length of the retained chain: 5000 elements) to speed up the computational process.

Finally, performance indices were calculated in analogy to CM. Specifically, these were goodness of fit and accuracy. Goodness of fit is defined as the ability of the modeled spatial patterns of expected read numbers to reproduce the observed read numbers, and is evaluated as the fraction of sites (either all sampling sites, calibration or validation subsets) where the null hypothesis that observed read numbers come from the hypothesized distribution cannot be rejected (see “Goodness-of-fit test” section). Accuracy is defined as the agreement between model-based genus presence predictions and kicknet observations, and is evaluated as the fraction of sites marked as true positives or true negatives (see “Assessing detection probability and presence maps” section). We then calculated loss of goodness of fit as the difference between the goodness of fit of the AS simulations and that of CM. Loss of accuracy is analogously calculated.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Source: Ecology - nature.com