Measuring protected-area effectiveness using vertebrate distributions from leech iDNA

This section provides an overview of methods. The Supplementary Information provides additional detailed descriptions of the leech collections, laboratory processing, bioinformatics pipeline, and site-occupancy modelling. Code for our bioinformatics pipeline is available at Ji⁷² and Yu⁷³. Code for our site-occupancy modelling and analysis is available at Baker et al.⁷⁴.

Leech collections

Samples were collected during the rainy season, from July to September 2016, by park rangers from the Ailaoshan Forestry Bureau. The nature reserve is divided into 172 non-overlapping patrol areas defined by the Yunnan Forestry Survey and Planning Institute. These areas range in size from 0.5 to 12.5 km² (mean 3.9 ± sd 2.5 km²), in part reflecting accessibility (smaller areas tend to be more rugged). These patrol areas pre-existed our study, and are used in the administration of the reserve. The reserve is divided into six parts, which are managed by six cities or autonomous counties (NanHua, ChuXiong, JingDong, ZhenYuan, ShuangBai, XinPing) which assign patrol areas to the villages within their jurisdiction based on proximity. The villages establish working groups to carry out work within the patrol areas. Thus, individual park rangers might change every year, but the patrol areas and the villages responsible for them are fixed.

Each ranger was supplied with several small bags containing tubes filled with RNAlater preservative. Rangers were asked to place any leeches they could collect opportunistically during their patrols (e.g. from the ground or clothing) into the tubes, in exchange for a one-off payment of RMB 300 ( ~USD 45) for participation, plus RMB 100 if they caught one or more leeches. Multiple leeches could be placed into each tube, but the small tube sizes generally required the rangers to use multiple tubes for their collections.

A total of 30,468 leeches were collected in 3 months by 163 rangers across all 172 patrol areas. When a bag of tubes contained <100 total leeches, we reduced our DNA-extraction workload by pooling leeches from all tubes in the same plastic bag and treating them as one replicate. However, when a bag contained ≥100 total leeches, we selectively pooled some of the tubes in that bag to create five approximately equally sized replicates from the bag, to avoid any replicates containing an excessive number of leeches. Eighty-one per cent of bags contained <100 leeches, and 78% of patrol areas consisted only of bags below the threshold. Each patrol area typically returned multiple replicates, in the form of multiple bags below the threshold and/or multiple tubes from the bags above the threshold. After this pooling, the mean number of leeches per replicate was 34 (range 1–98), for a total of 893 replicates across the entire collection.

Environmental characteristics

We used ArcGIS Desktop 9.3 (Esri, Redlands, CA) and R v3.4.0⁷⁵ to calculate characteristics of each patrol area. We created 30 m raster layers for elevation, topographic position index (i.e. difference between each pixel and its surrounding pixels⁷⁶), distance to nearest road, and distance to nearest stream. We then calculated the median of the raster values for each patrol area for use as predictors in our statistical modelling (Table 4 and Supplementary Fig. 1). We also calculated distance to the Ailaoshan reserve edge as the distance of each patrol-area centroid to the nearest nature-reserve edge.

Table 4 Summary of environmental covariates.

Full size table

Laboratory processing

We extracted DNA from each replicate and then PCR-amplified two mitochondrial markers: one from the 16S rRNA gene (MT-RNR2; primers: 16Smam1 5′-CGGTTGGGGTGACCTCGGA-3′ and 16Smam2 5′-GCTGTTATCCCTAGGGTAACT-3′⁷⁷), and the other from the 12S rRNA gene (MT-RNR1; primers: 5′-ACTGGGATTAGATACCCC-3′ and 5′-YRGAACAGGCTCCTCTAG-3′ modified from Riaz et al.⁷⁸). We refer to these two markers as LSU (16S, 82–150 bp) and SSU (12S, 81–117 bp), respectively, referring to the ribosomal large subunit and small subunit that these genes code for. A third primer pair targeting the standard cytochrome c oxidase I marker⁷⁹ was tested but not adopted, as it co-amplified leech DNA and consequently returned few vertebrate reads.

The LSU primers are designed to target mammals, and the SSU primers to amplify all vertebrates. We ran ecoPCR v0.5⁸⁰ with three allowed mismatches on the Tetrapoda in the MIDORI database⁸¹ to estimate expected amplification success, B_c, for our primers. B_c is the proportion of species in the reference database that can be amplified in silico. The 16Smam primers returned high B_c values for Mammalia (99.3%), as expected, and also for Aves (96.2%), a moderate value for Amphibia (79%), and a low value for species grouped under “Reptilia” in the MIDORI database (=Crocodylia + Sphenodontia + Squamata + Testudines) (39.9%). The 12S primers returned high B_c values ( > 98%) for Mammalia, Amphibia, and Aves, and a moderate B_c value (79.8%) for “Reptilia”. We therefore expected most or all Ailaoshan mammals, birds, and amphibians to be amplifiable by one or both primers, and a lower success rate for snakes and lizards.

Primers were ordered with sample-identifying tag sequences, and we used a twin-tagging strategy to identify and remove ‘tag jumping’ errors⁸² using the DAMe protocol⁸³. From our 893 replicate tubes, we successfully PCR-amplified in triplicate 661 samples using our LSU primers and 745 samples using our SSU primers. Successful PCR amplifications were sent to Novogene (Beijing, China) for PCR-free library construction and 150 bp paired-end sequencing on an Illumina HiSeq X Ten.

Negative controls were included for each set of PCRs, and the PCR set was repeated, or ultimately abandoned, if agarose gels revealed contamination in the negative controls. We also sequenced the negative controls, because gels do not always detect very low levels of contamination. Sequences assigned to human, cow, dog, goat, pig, chicken and some wild species appeared in our sequenced negative controls, but with low PCR replication and at low read number. We used these negative controls to set DAMe filtering stringency in our bioinformatics pipeline (see next section and Supplementary Information) for all samples to levels that removed these contaminants: -y 2 for both markers (minimum number of PCRs out of 3 in which a unique read must be present), and -t 9 for LSU and -t 20 for SSU (minimum number of copies per PCR at which a unique read must appear). We also amplified and sequenced a set of positive controls containing DNA from two rodent species, Myodes glareolus and Apodemus flavicollis, along with negative controls that we verified to be contamination-free using agarose gel electrophoresis. M. glareolus and A. flavicollis have European and Western Asian distributions, and we did not detect either species in our leech samples.

Bioinformatics pipeline

The three key features of our bioinformatics pipeline were the DAMe protocol⁸³, which uses twin-tagging and three independent PCR replicates to identify and remove tag-jumped and erroneous reads, the use of two independent markers, which provides an independent check on taxonomic assignments (Supplementary Fig. 2), and the PROTAX statistical ‘wrapper’ for taxonomic assignment^84,85, which reduces overconfidence in taxonomic assignment when reference databases are incomplete, as they always are. In this case, around half of the known Ailaoshan taxa were present in the reference databases (Supplementary Data 2). Mammals and amphibians were relatively well represented: 73% of mammals and 83% of amphibians were in the LSU database, respectively 70% and 67% in the SSU database. Birds and squamates were less well captured, with 42% of birds and 53% of squamates present in the LSU database, respectively 35% and 34% in the SSU database. For OTUs that do not have reference sequences, PROTAX assigns them to higher ranks and flags them as ‘unknowns,’ allowing us to assign those OTUs to morphospecies and potentially supply taxonomy based on other information such as correlations between the datasets as described here.

After DAMe filtering, we removed residual chimeras using VSEARCH v2.9.0⁸⁶, clustered sequences into preliminary operational taxonomic units (‘pre-OTUs’) using Swarm v2.0⁸⁷, and then used the R package LULU v0.1.0⁸⁸ to merge pre-OTUs with high similarity and distribution across samples. We then used PROTAX to assign taxonomy to representative sequences from the merged pre-OTUs^33,84,85, in which we benefited from recent additions to the mitochondrial reference database for Southeast Asian mammals⁸⁹. The full pipeline is described in detail in the Supplementary Information (Assigning taxonomy to preliminary operational taxonomic units and following sections). We shared taxonomic information between the LSU and SSU datasets by making use of correlations between the datasets. To do this, we calculated pairwise correlations of LSU and SSU pre-OTUs across the 619 replicates for which both markers had been amplified and visualised the correlations as a network (Supplementary Fig. 2). If an LSU and an SSU pre-OTU occurred in (mostly) the same subset of replicates and were assigned the same higher-level taxonomies, the two pre-OTUs were deemed likely to have been amplified from the same set of leeches feeding on the same species. We manually inspected the network diagram and assigned such correlated pre-OTU pairs the same taxonomy.

We eliminated any pre-OTUs to which we were unable to assign a taxonomy; these pre-OTUs only accounted for 0.9% and 0.2% of reads in the LSU and SSU datasets respectively, and most likely represent sequencing errors rather than novel taxa. Within the LSU and SSU datasets, we merged pre-OTUs that had been assigned the same taxonomies, thus generating a final set of operational taxonomic units (OTUs) for each dataset. Finally, we removed the OTU identified as Homo sapiens from both datasets prior to analysis. Although it would be informative to map the distribution of humans across the reserve, we expect that most of the DNA came from the rangers themselves, not from other humans using the reserve.

Our final OTUs are intended to be interpreted as species-level groups, even though some cannot yet be assigned taxonomic names to species level (most likely due to incomplete reference databases). Thus, for example, the two frog OTUs Kurixalus sp1 and Kurixalus sp2 in the LSU dataset should be interpreted as two distinct Kurixalus species. Likewise, the frog OTU Megophryidae sp3 in the LSU and SSU datasets should be interpreted as a single species within Megophryidae. We therefore refer to our final OTUs as species throughout this study.

After excluding humans, the final LSU and SSU datasets comprised 18,502,593 and 84,951,011 reads respectively. These reads represented a total of 59 species across 653 replicates and 126 patrol areas in the LSU dataset, and 72 species across 740 replicates and 127 patrol areas in the SSU dataset. To assess the degree to which our iDNA approach was able to capture the breadth of vertebrate biodiversity in the park, we compared the list of species that we detected against unpublished, working species lists maintained by researchers at the Kunming Institute of Zoology.

We also attached additional metadata to our species list: we attached International Union for Conservation of Nature (IUCN) data for individual species by using the R package rredlist v0.6.0⁹⁰ to search for scientific names assigned by PROTAX. For this purpose, we treated Capricornis milneedwardsii as synonymous with Capricornis sumatraensis, in line with recent research and the latest IUCN assessment^91,92. For mammals, we used the PanTHERIA database⁹³ to obtain data on adult body mass for each species; where species-level information was not available, we used the median adult body mass from the database for the lowest taxonomic group possible.

Site-occupancy modelling

We estimated separate multispecies site-occupancy models for the LSU and SSU datasets using parameter-expanded data augmentation^46,53. These models assume that the n_LSU = 59 and n_SSU = 72 species observed in each dataset are, respectively, subsets of larger communities of size N_LSU and N_SSU species that are present in the vicinity of Ailaoshan and vulnerable to capture (e.g. fed on by leeches and amplified by the LSU and SSU primers). Although N_LSU and N_SSU are unknown, these communities can be modelled by embedding them in a larger ‘supercommunity’ of fixed size M. We set M = 200 for our final model. Values from M = 150 up to M = 474 (the latter being the total species richness for mammals, birds, non-avian reptiles and amphibians in the 1984-85 survey of Ailaoshan³⁵) produced similar estimates for N_LSU and N_SSU.

For each species in the supercommunity, our models explicitly capture (i) a ‘community process’ governing whether the species is in the Ailaoshan community or not; (ii) an ‘ecological process’ governing the presence or absence of the species in each patrol area, given that it is in the community; and (iii) an ‘observation process’ governing whether we detect the species’ DNA in each of our replicate samples, given that it is present in the patrol area. The community-, ecological- and observation processes for individual species are linked by imposing community-level parameters and priors as described below.

For the community process, each species i was assumed to be either a member of the Ailaoshan community or not. We denote this unobserved state with w_i, which was assumed to be a Bernoulli random variable governed by the community membership parameter ({{{Omega }}}_{{g}_{i}}), i.e. the probability that species i was in the Ailaoshan community:

$${w}_{i} sim {{{{{{{rm{Bernoulli}}}}}}}}({{{Omega }}}_{{g}_{i}}).$$

(1)

For the community process, we separated the species into two natural groupings – homeothermic mammals and birds, and poikilothermic amphibians and squamates – and allowed them to have different probabilities of being in the Ailaoshan community. This is denoted by the subscript on the ({{{Omega }}}_{{g}_{i}}) parameter, in which g_i represents which of these two groupings species i belongs to. This approach reflected our expectation that these groupings would differ systematically in their community probabilities, and we employed the same grouping for parameters governing the ecological and detection processes (see below for further discussion).

For the ecological process, each species i was assumed to be either present or absent in each patrol area j, and we used z_ij to denote this unobserved ecological state. We assumed the z_ij to be constant across all replicates taken from patrol area j, consistent with the samples being taken at essentially the same point in time. Any species present were assumed to be members of the Ailaoshan community (i.e. w_i = 1), so we modelled z_ij as a Bernoulli random variable governed by both w_i and an occupancy parameter ψ_ij, i.e. the probability that a species i in the community was present in patrol area j:

$${z}_{ij}| {w}_{i} sim {{{{{{{rm{Bernoulli}}}}}}}}({w}_{i}{psi }_{ij}).$$

(2)

We modelled occupancy ψ_ij as a function of elevation and distance from the reserve edge in the LSU dataset

$${{{{{{{rm{logit}}}}}}}}({psi }_{ij})={beta }_{0i}+{beta }_{1i}{{{{{{{{rm{elevation}}}}}}}}}_{j}+{beta }_{2i}{{{{{{{{rm{reserve}}}}}}}}}_{j}$$

(3)

and as a function of elevation in the SSU dataset

$${{{{{{{rm{logit}}}}}}}}({psi }_{ij})={beta }_{0i}+{beta }_{1i}{{{{{{{{rm{elevation}}}}}}}}}_{j}$$

(4)

where elevation_j is the median elevation for patrol area j, and reserve_j is the distance from the centroid of patrol area j to the nature reserve edge. We chose these specifications by running a ‘full’ model for each dataset with all five environmental covariates, and retaining only those covariates for which the 95% Bayesian confidence interval on the slope coefficient excluded zero.

We modelled observation as a Bernoulli process assuming imperfect detection but no false positives:

$${y}_{ijk}| {z}_{ij} sim {{{{{{{rm{Bernoulli}}}}}}}}({z}_{ij}{p}_{ijk}),$$

(5)

where y_ijk is the observed data, i.e. detection or non-detection of species i’s DNA in replicate k from patrol area j.

We allowed the conditional detection probability p_ijk to vary as a function of the conditional detection probability for species i per 100 leeches, r_i, and the number of leeches in the replicate, leeches_jk:

$$kern0.3pc {p}_{ijk}=1-{(1-{r}_{i})}^{{{{{{{{{rm{leeches}}}}}}}}}_{jk}/100}$$

(6)

$${{{{{{{rm{logit}}}}}}}}({r}_{i})={gamma }_{0i}$$

(7)

We allowed r_i, and its logit-scale equivalent γ_0i, to vary among species to capture e.g. variation in leech feeding preferences among taxa. We used leeches_jk/100 rather than leeches_jk to avoid computational problems arising from rounding.

Note that the detection probability p_ijk is conditional on species i being present in patrol area j, and not on species i’s DNA being present in replicate k from that site. p_ijk therefore subsumes multiple sources of imperfect detection, including those that result in species i’s DNA being absent from the replicate (e.g. the leeches in replicate k did not feed on species i, or they did so long ago and the DNA has since been digested), as well as those that result in apparent non-detection of species i DNA when it is present (e.g. failure to PCR amplify sufficiently, PCR or sequencing errors, or problems arising during bioinformatic processing). The multiple PCRs that we performed for each replicate (see Laboratory processing above, and Supplementary Information) could in principle have been used to decompose p_ijk into (i) a per-replicate probability that species i’s DNA is present in the replicate when the species is present at the site, and (ii) a per-PCR probability that species i’s DNA is detected when it present in the replicate, by adding another hierarchical level to our model^94,95,96,97. However, we instead chose to combine the results from the multiple PCRs using DAMe⁸³ prior to modelling, since DAMe is specifically designed to detect and remove errors arising in PCR and sequencing, and offers filtering options specialised to this task that we found useful.

Finally, whereas Eqs. (1) through (7) define a site-occupancy model for species i alone, we united these species-specific models with a community model for both ecological and detection processes:

$${beta }_{1i} sim {{{{{{{rm{N}}}}}}}}({mu }_{{beta }_{1}},{sigma }_{{beta }_{1}})$$

(8)

$${beta }_{2i} sim {{{{{{{rm{N}}}}}}}}({mu }_{{beta }_{2}},{sigma }_{{beta }_{2}})quad ({{{{rm{for}}}};{{{rm{the}}}};{{{rm{LSU}}}};{{{rm{model}}}};{{{rm{only}}}}})$$

(9)

$$({beta }_{0i},{gamma }_{0i}) sim {{{{{{{rm{MVN}}}}}}}}left([{mu }_{{beta }_{0}{g}_{i}},{mu }_{{gamma }_{0}{g}_{i}}],left[begin{array}{cc}{sigma }_{{beta }_{0}{g}_{i}}^{2}&rho {sigma }_{{beta }_{0}{g}_{i}}{sigma }_{{gamma }_{0}{g}_{i}} rho {sigma }_{{beta }_{0}{g}_{i}}{sigma }_{{gamma }_{0}{g}_{i}}&{sigma }_{{gamma }_{0}{g}_{i}}^{2}end{array}right]right)$$

(10)

where N() and MVN() denote normal and multivariate normal distributions. These distributions were characterised by community hyperparameters μ_• and σ_•, with separate distributions for each parameter as denoted by the first subscript. We used a multivariate normal prior for (β_0i, γ_0i) to allow non-zero covariance between species’ occupancy and detection probabilities, as we might expect if, for example, variation in abundance affects both probabilities⁴⁶.

These community models allow rare species effectively to borrow information from more common ones, producing a better overall ensemble of parameter estimates, though at the cost of shrinkage on the individual parameters^46,98,99. As for the community process described above, we separated the species into two groups – homeothermic mammals and birds, and poikilothermic amphibians and squamates – and allowed them to have different community distributions. This is denoted by the subscripts on the μ_• and σ_• community hyperparameters for the occupancy and detection intercepts, in which g_i represents which of these two groupings species i belongs to. This approach reflected our expectation that these groupings would differ systematically in occupancy probabilities (e.g. due to different habitat preferences) and in detection probabilities (e.g. due to different encounter rates with leeches, or leech feeding preferences). Alternative groupings could also be justified on biological grounds: for example, separating mammals and birds on the basis that many of the mammals are terrestrial while many of the birds are arboreal; or grouping birds and squamates together to better reflect phylogeny. Such alternative groupings did not perform well in our datasets, as most birds and squamates were observed too infrequently to provide much information on these groups by themselves, but this aspect of the model would be worth revisiting in future work.

We estimated our models using a Bayesian framework with JAGS v4.3.0¹⁰⁰. We used 5 chains of 100,000 generations, including a burn-in of 50,000. We retained all rounds (i.e. without thinning) for the posterior sample, except for where we needed to save the z matrix for beta diversity and cluster occupancy calculations (see Statistical analyses below); memory limitations prevented us from retaining all posterior samples for the z matrix, and we thinned tenfold in order to make these calculations feasible. The Supplementary Information provides details of the prior distributions used for the model parameters. From the model results we calculated posterior means and quantiles for all model parameters of interest, as well as estimated species richness for each patrol area, and number of sites occupied for each species.

Statistics

Species richness

For each dataset, we obtained estimates of overall species richness for Ailaoshan directly from the model, by summing the w_i. To assess our choice of M, we compared these overall species richness estimates for M = 100, 150 and 200.

After examining occupancy and detection estimates for each species, we used histograms to visualise the distribution of estimated species richness per patrol area (obtained for each patrol area j by summing the z_ij). We calculated median estimated species richness across the patrol areas for comparison with median observed species richness per patrol area and per replicate. We drew choropleths to visualise the spatial distribution of both observed and estimated species richness across the nature reserve.

We examined community mean occupancy and detection probabilities (see e.g. Section 11.7.2 in Kéry and Royle¹⁰¹) to help understand the effects of the site and sample covariates. For each species group g = 1, 2 (representing mammals/birds and amphibians/squamates, respectively), we calculated the posterior mean and 95% Bayesian confidence interval for community mean occupancy and detection as functions of the covariates:

$${psi }_{g}({{{{{{{rm{elevation}}}}}}}})=logi{t}^{-1}({mu }_{{beta }_{0}g}+{mu }_{{beta }_{1}}{{{{{{{rm{elevation}}}}}}}})$$

(11)

$${psi }_{g}({{{{{{{rm{reserve}}}}}}}})=logi{t}^{-1}({mu }_{{beta }_{0}g}+{mu }_{{beta }_{2}}{{{{{{{rm{reserve}}}}}}}})quad ({{{{rm{for}}}};{{{rm{the}}}};{{{rm{LSU}}}};{{{rm{model}}}};{{{rm{only}}}}})$$

(12)

$${p}_{g}({{{{{{{rm{leeches}}}}}}}})=1-{(1-{{{{{{{{rm{logit}}}}}}}}}^{-1}({mu }_{{gamma }_{0}g}))}^{{{{{{{{rm{leeches}}}}}}}}/100}$$

(13)

This approach effectively holds distance from reserve edge at zero in ψ_g(elevation), and elevation at zero in ψ_g(reserve), corresponding to the mean values for these covariates in our data, since predictors were normalised prior to modelling. To visualise variation among species in occupancy and detection response to covariates, we repeated these calculations using each species’ estimates for β₀, β₁, β₂ and γ₀ in place of the community hyperparameters to obtain the posterior means for each species.

We compared three measures of species richness between the two datasets in order to assess the extent to which the two datasets agreed on variation in richness within Ailaoshan. First, the observed species richness in each replicate; second, the observed species richness in each patrol area; and third, the estimated species richness in each patrol area (i.e. the posterior mean number of species, calculated from z_ij). For each of these measures, we computed the Pearson correlation between the datasets and tested the correlation coefficient against zero with a t-test. We also used Poisson GLMs to examine the relationship between each of these species richness measures and sampling effort: we regressed observed species richness per replicate against the log-transformed number of leeches per replicate, and we regressed both the observed and estimated species richness per patrol area against the log-transformed number of replicates per patrol area, testing the significance of the slope coefficients with t-tests.

Community composition

We explored variation in vertebrate community composition among patrol areas using posterior mean Jaccard similarities calculated from the estimated occupancy states z_ij (see Dorazio⁵³ and Kéry and Royle¹⁰¹ for other examples of this approach). We visualised the pairwise Jaccard distances (i.e. distance = (1 − similarity)) using non-metric multidimensional scaling ordinations, overlaying environmental covariates using the vegan::ordisurf function. We clustered patrol areas based on the Jaccard distances using Ward’s criterion (R function hclust(., method = “ward.D2”)). We used this clustering to split the patrol areas into three groups, which turned out to correspond to low-, intermediate-, and high-elevation sites. We used Cramer’s V to quantify the extent to which these clusters matched across the two datasets. We visualised the spatial variation in community composition within the reserve by drawing maps of Ailaoshan with patrol areas coloured by these three clusters. To help understand how vertebrate communities varied among the clusters, we used the posterior sample of the occupancy states z_ij to calculate posterior means and 95% Bayesian confidence intervals for the occupancy (i.e. fraction of patrol areas occupied) of each species in the low-, intermediate- and high-elevation site clusters.

To assess the extent to which the two datasets identified common patterns of variation in community composition across the patrol areas, we performed a co-inertia analysis on the matrices of predicted species in each patrol area in each dataset using ade4::coinertia in R. We used the RV coefficient⁵⁴ to quantify coinertia, testing its significance with the permutation test in ade4::RV.rtest with 999 permutations. We also tested for correlation between the posterior mean Jaccard distances from the two datasets using a Mantel test with 999 permutations.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.