in

Measuring protected-area effectiveness using vertebrate distributions from leech iDNA

This section provides an overview of methods. The Supplementary Information provides additional detailed descriptions of the leech collections, laboratory processing, bioinformatics pipeline, and site-occupancy modelling. Code for our bioinformatics pipeline is available at Ji72 and Yu73. Code for our site-occupancy modelling and analysis is available at Baker et al.74.

Leech collections

Samples were collected during the rainy season, from July to September 2016, by park rangers from the Ailaoshan Forestry Bureau. The nature reserve is divided into 172 non-overlapping patrol areas defined by the Yunnan Forestry Survey and Planning Institute. These areas range in size from 0.5 to 12.5 km2 (mean 3.9 ± sd 2.5 km2), in part reflecting accessibility (smaller areas tend to be more rugged). These patrol areas pre-existed our study, and are used in the administration of the reserve. The reserve is divided into six parts, which are managed by six cities or autonomous counties (NanHua, ChuXiong, JingDong, ZhenYuan, ShuangBai, XinPing) which assign patrol areas to the villages within their jurisdiction based on proximity. The villages establish working groups to carry out work within the patrol areas. Thus, individual park rangers might change every year, but the patrol areas and the villages responsible for them are fixed.

Each ranger was supplied with several small bags containing tubes filled with RNAlater preservative. Rangers were asked to place any leeches they could collect opportunistically during their patrols (e.g. from the ground or clothing) into the tubes, in exchange for a one-off payment of RMB 300 ( ~USD 45) for participation, plus RMB 100 if they caught one or more leeches. Multiple leeches could be placed into each tube, but the small tube sizes generally required the rangers to use multiple tubes for their collections.

A total of 30,468 leeches were collected in 3 months by 163 rangers across all 172 patrol areas. When a bag of tubes contained <100 total leeches, we reduced our DNA-extraction workload by pooling leeches from all tubes in the same plastic bag and treating them as one replicate. However, when a bag contained ≥100 total leeches, we selectively pooled some of the tubes in that bag to create five approximately equally sized replicates from the bag, to avoid any replicates containing an excessive number of leeches. Eighty-one per cent of bags contained <100 leeches, and 78% of patrol areas consisted only of bags below the threshold. Each patrol area typically returned multiple replicates, in the form of multiple bags below the threshold and/or multiple tubes from the bags above the threshold. After this pooling, the mean number of leeches per replicate was 34 (range 1–98), for a total of 893 replicates across the entire collection.

Environmental characteristics

We used ArcGIS Desktop 9.3 (Esri, Redlands, CA) and R v3.4.075 to calculate characteristics of each patrol area. We created 30 m raster layers for elevation, topographic position index (i.e. difference between each pixel and its surrounding pixels76), distance to nearest road, and distance to nearest stream. We then calculated the median of the raster values for each patrol area for use as predictors in our statistical modelling (Table 4 and Supplementary Fig. 1). We also calculated distance to the Ailaoshan reserve edge as the distance of each patrol-area centroid to the nearest nature-reserve edge.

Table 4 Summary of environmental covariates.
Full size table

Laboratory processing

We extracted DNA from each replicate and then PCR-amplified two mitochondrial markers: one from the 16S rRNA gene (MT-RNR2; primers: 16Smam1 5′-CGGTTGGGGTGACCTCGGA-3′ and 16Smam2 5′-GCTGTTATCCCTAGGGTAACT-3′77), and the other from the 12S rRNA gene (MT-RNR1; primers: 5′-ACTGGGATTAGATACCCC-3′ and 5′-YRGAACAGGCTCCTCTAG-3′ modified from Riaz et al.78). We refer to these two markers as LSU (16S, 82–150 bp) and SSU (12S, 81–117 bp), respectively, referring to the ribosomal large subunit and small subunit that these genes code for. A third primer pair targeting the standard cytochrome c oxidase I marker79 was tested but not adopted, as it co-amplified leech DNA and consequently returned few vertebrate reads.

The LSU primers are designed to target mammals, and the SSU primers to amplify all vertebrates. We ran ecoPCR v0.580 with three allowed mismatches on the Tetrapoda in the MIDORI database81 to estimate expected amplification success, Bc, for our primers. Bc is the proportion of species in the reference database that can be amplified in silico. The 16Smam primers returned high Bc values for Mammalia (99.3%), as expected, and also for Aves (96.2%), a moderate value for Amphibia (79%), and a low value for species grouped under “Reptilia” in the MIDORI database (=Crocodylia + Sphenodontia + Squamata + Testudines) (39.9%). The 12S primers returned high Bc values ( > 98%) for Mammalia, Amphibia, and Aves, and a moderate Bc value (79.8%) for “Reptilia”. We therefore expected most or all Ailaoshan mammals, birds, and amphibians to be amplifiable by one or both primers, and a lower success rate for snakes and lizards.

Primers were ordered with sample-identifying tag sequences, and we used a twin-tagging strategy to identify and remove ‘tag jumping’ errors82 using the DAMe protocol83. From our 893 replicate tubes, we successfully PCR-amplified in triplicate 661 samples using our LSU primers and 745 samples using our SSU primers. Successful PCR amplifications were sent to Novogene (Beijing, China) for PCR-free library construction and 150 bp paired-end sequencing on an Illumina HiSeq X Ten.

Negative controls were included for each set of PCRs, and the PCR set was repeated, or ultimately abandoned, if agarose gels revealed contamination in the negative controls. We also sequenced the negative controls, because gels do not always detect very low levels of contamination. Sequences assigned to human, cow, dog, goat, pig, chicken and some wild species appeared in our sequenced negative controls, but with low PCR replication and at low read number. We used these negative controls to set DAMe filtering stringency in our bioinformatics pipeline (see next section and Supplementary Information) for all samples to levels that removed these contaminants: -y 2 for both markers (minimum number of PCRs out of 3 in which a unique read must be present), and -t 9 for LSU and -t 20 for SSU (minimum number of copies per PCR at which a unique read must appear). We also amplified and sequenced a set of positive controls containing DNA from two rodent species, Myodes glareolus and Apodemus flavicollis, along with negative controls that we verified to be contamination-free using agarose gel electrophoresis. M. glareolus and A. flavicollis have European and Western Asian distributions, and we did not detect either species in our leech samples.

Bioinformatics pipeline

The three key features of our bioinformatics pipeline were the DAMe protocol83, which uses twin-tagging and three independent PCR replicates to identify and remove tag-jumped and erroneous reads, the use of two independent markers, which provides an independent check on taxonomic assignments (Supplementary Fig. 2), and the PROTAX statistical ‘wrapper’ for taxonomic assignment84,85, which reduces overconfidence in taxonomic assignment when reference databases are incomplete, as they always are. In this case, around half of the known Ailaoshan taxa were present in the reference databases (Supplementary Data 2). Mammals and amphibians were relatively well represented: 73% of mammals and 83% of amphibians were in the LSU database, respectively 70% and 67% in the SSU database. Birds and squamates were less well captured, with 42% of birds and 53% of squamates present in the LSU database, respectively 35% and 34% in the SSU database. For OTUs that do not have reference sequences, PROTAX assigns them to higher ranks and flags them as ‘unknowns,’ allowing us to assign those OTUs to morphospecies and potentially supply taxonomy based on other information such as correlations between the datasets as described here.

After DAMe filtering, we removed residual chimeras using VSEARCH v2.9.086, clustered sequences into preliminary operational taxonomic units (‘pre-OTUs’) using Swarm v2.087, and then used the R package LULU v0.1.088 to merge pre-OTUs with high similarity and distribution across samples. We then used PROTAX to assign taxonomy to representative sequences from the merged pre-OTUs33,84,85, in which we benefited from recent additions to the mitochondrial reference database for Southeast Asian mammals89. The full pipeline is described in detail in the Supplementary Information (Assigning taxonomy to preliminary operational taxonomic units and following sections). We shared taxonomic information between the LSU and SSU datasets by making use of correlations between the datasets. To do this, we calculated pairwise correlations of LSU and SSU pre-OTUs across the 619 replicates for which both markers had been amplified and visualised the correlations as a network (Supplementary Fig. 2). If an LSU and an SSU pre-OTU occurred in (mostly) the same subset of replicates and were assigned the same higher-level taxonomies, the two pre-OTUs were deemed likely to have been amplified from the same set of leeches feeding on the same species. We manually inspected the network diagram and assigned such correlated pre-OTU pairs the same taxonomy.

We eliminated any pre-OTUs to which we were unable to assign a taxonomy; these pre-OTUs only accounted for 0.9% and 0.2% of reads in the LSU and SSU datasets respectively, and most likely represent sequencing errors rather than novel taxa. Within the LSU and SSU datasets, we merged pre-OTUs that had been assigned the same taxonomies, thus generating a final set of operational taxonomic units (OTUs) for each dataset. Finally, we removed the OTU identified as Homo sapiens from both datasets prior to analysis. Although it would be informative to map the distribution of humans across the reserve, we expect that most of the DNA came from the rangers themselves, not from other humans using the reserve.

Our final OTUs are intended to be interpreted as species-level groups, even though some cannot yet be assigned taxonomic names to species level (most likely due to incomplete reference databases). Thus, for example, the two frog OTUs Kurixalus sp1 and Kurixalus sp2 in the LSU dataset should be interpreted as two distinct Kurixalus species. Likewise, the frog OTU Megophryidae sp3 in the LSU and SSU datasets should be interpreted as a single species within Megophryidae. We therefore refer to our final OTUs as species throughout this study.

After excluding humans, the final LSU and SSU datasets comprised 18,502,593 and 84,951,011 reads respectively. These reads represented a total of 59 species across 653 replicates and 126 patrol areas in the LSU dataset, and 72 species across 740 replicates and 127 patrol areas in the SSU dataset. To assess the degree to which our iDNA approach was able to capture the breadth of vertebrate biodiversity in the park, we compared the list of species that we detected against unpublished, working species lists maintained by researchers at the Kunming Institute of Zoology.

We also attached additional metadata to our species list: we attached International Union for Conservation of Nature (IUCN) data for individual species by using the R package rredlist v0.6.090 to search for scientific names assigned by PROTAX. For this purpose, we treated Capricornis milneedwardsii as synonymous with Capricornis sumatraensis, in line with recent research and the latest IUCN assessment91,92. For mammals, we used the PanTHERIA database93 to obtain data on adult body mass for each species; where species-level information was not available, we used the median adult body mass from the database for the lowest taxonomic group possible.

Site-occupancy modelling

We estimated separate multispecies site-occupancy models for the LSU and SSU datasets using parameter-expanded data augmentation46,53. These models assume that the nLSU = 59 and nSSU = 72 species observed in each dataset are, respectively, subsets of larger communities of size NLSU and NSSU species that are present in the vicinity of Ailaoshan and vulnerable to capture (e.g. fed on by leeches and amplified by the LSU and SSU primers). Although NLSU and NSSU are unknown, these communities can be modelled by embedding them in a larger ‘supercommunity’ of fixed size M. We set M = 200 for our final model. Values from M = 150 up to M = 474 (the latter being the total species richness for mammals, birds, non-avian reptiles and amphibians in the 1984-85 survey of Ailaoshan35) produced similar estimates for NLSU and NSSU.

For each species in the supercommunity, our models explicitly capture (i) a ‘community process’ governing whether the species is in the Ailaoshan community or not; (ii) an ‘ecological process’ governing the presence or absence of the species in each patrol area, given that it is in the community; and (iii) an ‘observation process’ governing whether we detect the species’ DNA in each of our replicate samples, given that it is present in the patrol area. The community-, ecological- and observation processes for individual species are linked by imposing community-level parameters and priors as described below.

For the community process, each species i was assumed to be either a member of the Ailaoshan community or not. We denote this unobserved state with wi, which was assumed to be a Bernoulli random variable governed by the community membership parameter ({{{Omega }}}_{{g}_{i}}), i.e. the probability that species i was in the Ailaoshan community:

$${w}_{i} sim {{{{{{{rm{Bernoulli}}}}}}}}({{{Omega }}}_{{g}_{i}}).$$

(1)

For the community process, we separated the species into two natural groupings – homeothermic mammals and birds, and poikilothermic amphibians and squamates – and allowed them to have different probabilities of being in the Ailaoshan community. This is denoted by the subscript on the ({{{Omega }}}_{{g}_{i}}) parameter, in which gi represents which of these two groupings species i belongs to. This approach reflected our expectation that these groupings would differ systematically in their community probabilities, and we employed the same grouping for parameters governing the ecological and detection processes (see below for further discussion).

For the ecological process, each species i was assumed to be either present or absent in each patrol area j, and we used zij to denote this unobserved ecological state. We assumed the zij to be constant across all replicates taken from patrol area j, consistent with the samples being taken at essentially the same point in time. Any species present were assumed to be members of the Ailaoshan community (i.e. wi = 1), so we modelled zij as a Bernoulli random variable governed by both wi and an occupancy parameter ψij, i.e. the probability that a species i in the community was present in patrol area j:

$${z}_{ij}| {w}_{i} sim {{{{{{{rm{Bernoulli}}}}}}}}({w}_{i}{psi }_{ij}).$$

(2)

We modelled occupancy ψij as a function of elevation and distance from the reserve edge in the LSU dataset

$${{{{{{{rm{logit}}}}}}}}({psi }_{ij})={beta }_{0i}+{beta }_{1i}{{{{{{{{rm{elevation}}}}}}}}}_{j}+{beta }_{2i}{{{{{{{{rm{reserve}}}}}}}}}_{j}$$

(3)

and as a function of elevation in the SSU dataset

$${{{{{{{rm{logit}}}}}}}}({psi }_{ij})={beta }_{0i}+{beta }_{1i}{{{{{{{{rm{elevation}}}}}}}}}_{j}$$

(4)

where elevationj is the median elevation for patrol area j, and reservej is the distance from the centroid of patrol area j to the nature reserve edge. We chose these specifications by running a ‘full’ model for each dataset with all five environmental covariates, and retaining only those covariates for which the 95% Bayesian confidence interval on the slope coefficient excluded zero.

We modelled observation as a Bernoulli process assuming imperfect detection but no false positives:

$${y}_{ijk}| {z}_{ij} sim {{{{{{{rm{Bernoulli}}}}}}}}({z}_{ij}{p}_{ijk}),$$

(5)

where yijk is the observed data, i.e. detection or non-detection of species i’s DNA in replicate k from patrol area j.

We allowed the conditional detection probability pijk to vary as a function of the conditional detection probability for species i per 100 leeches, ri, and the number of leeches in the replicate, leechesjk:

$$kern0.3pc {p}_{ijk}=1-{(1-{r}_{i})}^{{{{{{{{{rm{leeches}}}}}}}}}_{jk}/100}$$

(6)

$${{{{{{{rm{logit}}}}}}}}({r}_{i})={gamma }_{0i}$$

(7)

We allowed ri, and its logit-scale equivalent γ0i, to vary among species to capture e.g. variation in leech feeding preferences among taxa. We used leechesjk/100 rather than leechesjk to avoid computational problems arising from rounding.

Note that the detection probability pijk is conditional on species i being present in patrol area j, and not on species i’s DNA being present in replicate k from that site. pijk therefore subsumes multiple sources of imperfect detection, including those that result in species i’s DNA being absent from the replicate (e.g. the leeches in replicate k did not feed on species i, or they did so long ago and the DNA has since been digested), as well as those that result in apparent non-detection of species i DNA when it is present (e.g. failure to PCR amplify sufficiently, PCR or sequencing errors, or problems arising during bioinformatic processing). The multiple PCRs that we performed for each replicate (see Laboratory processing above, and Supplementary Information) could in principle have been used to decompose pijk into (i) a per-replicate probability that species i’s DNA is present in the replicate when the species is present at the site, and (ii) a per-PCR probability that species i’s DNA is detected when it present in the replicate, by adding another hierarchical level to our model94,95,96,97. However, we instead chose to combine the results from the multiple PCRs using DAMe83 prior to modelling, since DAMe is specifically designed to detect and remove errors arising in PCR and sequencing, and offers filtering options specialised to this task that we found useful.

Finally, whereas Eqs. (1) through (7) define a site-occupancy model for species i alone, we united these species-specific models with a community model for both ecological and detection processes:

$${beta }_{1i} sim {{{{{{{rm{N}}}}}}}}({mu }_{{beta }_{1}},{sigma }_{{beta }_{1}})$$

(8)

$${beta }_{2i} sim {{{{{{{rm{N}}}}}}}}({mu }_{{beta }_{2}},{sigma }_{{beta }_{2}})quad ({{{{rm{for}}}};{{{rm{the}}}};{{{rm{LSU}}}};{{{rm{model}}}};{{{rm{only}}}}})$$

(9)

$$({beta }_{0i},{gamma }_{0i}) sim {{{{{{{rm{MVN}}}}}}}}left([{mu }_{{beta }_{0}{g}_{i}},{mu }_{{gamma }_{0}{g}_{i}}],left[begin{array}{cc}{sigma }_{{beta }_{0}{g}_{i}}^{2}&rho {sigma }_{{beta }_{0}{g}_{i}}{sigma }_{{gamma }_{0}{g}_{i}} rho {sigma }_{{beta }_{0}{g}_{i}}{sigma }_{{gamma }_{0}{g}_{i}}&{sigma }_{{gamma }_{0}{g}_{i}}^{2}end{array}right]right)$$

(10)

where N() and MVN() denote normal and multivariate normal distributions. These distributions were characterised by community hyperparameters μ and σ, with separate distributions for each parameter as denoted by the first subscript. We used a multivariate normal prior for (β0i, γ0i) to allow non-zero covariance between species’ occupancy and detection probabilities, as we might expect if, for example, variation in abundance affects both probabilities46.

These community models allow rare species effectively to borrow information from more common ones, producing a better overall ensemble of parameter estimates, though at the cost of shrinkage on the individual parameters46,98,99. As for the community process described above, we separated the species into two groups – homeothermic mammals and birds, and poikilothermic amphibians and squamates – and allowed them to have different community distributions. This is denoted by the subscripts on the μ and σ community hyperparameters for the occupancy and detection intercepts, in which gi represents which of these two groupings species i belongs to. This approach reflected our expectation that these groupings would differ systematically in occupancy probabilities (e.g. due to different habitat preferences) and in detection probabilities (e.g. due to different encounter rates with leeches, or leech feeding preferences). Alternative groupings could also be justified on biological grounds: for example, separating mammals and birds on the basis that many of the mammals are terrestrial while many of the birds are arboreal; or grouping birds and squamates together to better reflect phylogeny. Such alternative groupings did not perform well in our datasets, as most birds and squamates were observed too infrequently to provide much information on these groups by themselves, but this aspect of the model would be worth revisiting in future work.

We estimated our models using a Bayesian framework with JAGS v4.3.0100. We used 5 chains of 100,000 generations, including a burn-in of 50,000. We retained all rounds (i.e. without thinning) for the posterior sample, except for where we needed to save the z matrix for beta diversity and cluster occupancy calculations (see Statistical analyses below); memory limitations prevented us from retaining all posterior samples for the z matrix, and we thinned tenfold in order to make these calculations feasible. The Supplementary Information provides details of the prior distributions used for the model parameters. From the model results we calculated posterior means and quantiles for all model parameters of interest, as well as estimated species richness for each patrol area, and number of sites occupied for each species.

Statistics

Species richness

For each dataset, we obtained estimates of overall species richness for Ailaoshan directly from the model, by summing the wi. To assess our choice of M, we compared these overall species richness estimates for M = 100, 150 and 200.

After examining occupancy and detection estimates for each species, we used histograms to visualise the distribution of estimated species richness per patrol area (obtained for each patrol area j by summing the zij). We calculated median estimated species richness across the patrol areas for comparison with median observed species richness per patrol area and per replicate. We drew choropleths to visualise the spatial distribution of both observed and estimated species richness across the nature reserve.

We examined community mean occupancy and detection probabilities (see e.g. Section 11.7.2 in Kéry and Royle101) to help understand the effects of the site and sample covariates. For each species group g = 1, 2 (representing mammals/birds and amphibians/squamates, respectively), we calculated the posterior mean and 95% Bayesian confidence interval for community mean occupancy and detection as functions of the covariates:

$${psi }_{g}({{{{{{{rm{elevation}}}}}}}})=logi{t}^{-1}({mu }_{{beta }_{0}g}+{mu }_{{beta }_{1}}{{{{{{{rm{elevation}}}}}}}})$$

(11)

$${psi }_{g}({{{{{{{rm{reserve}}}}}}}})=logi{t}^{-1}({mu }_{{beta }_{0}g}+{mu }_{{beta }_{2}}{{{{{{{rm{reserve}}}}}}}})quad ({{{{rm{for}}}};{{{rm{the}}}};{{{rm{LSU}}}};{{{rm{model}}}};{{{rm{only}}}}})$$

(12)

$${p}_{g}({{{{{{{rm{leeches}}}}}}}})=1-{(1-{{{{{{{{rm{logit}}}}}}}}}^{-1}({mu }_{{gamma }_{0}g}))}^{{{{{{{{rm{leeches}}}}}}}}/100}$$

(13)

This approach effectively holds distance from reserve edge at zero in ψg(elevation), and elevation at zero in ψg(reserve), corresponding to the mean values for these covariates in our data, since predictors were normalised prior to modelling. To visualise variation among species in occupancy and detection response to covariates, we repeated these calculations using each species’ estimates for β0, β1, β2 and γ0 in place of the community hyperparameters to obtain the posterior means for each species.

We compared three measures of species richness between the two datasets in order to assess the extent to which the two datasets agreed on variation in richness within Ailaoshan. First, the observed species richness in each replicate; second, the observed species richness in each patrol area; and third, the estimated species richness in each patrol area (i.e. the posterior mean number of species, calculated from zij). For each of these measures, we computed the Pearson correlation between the datasets and tested the correlation coefficient against zero with a t-test. We also used Poisson GLMs to examine the relationship between each of these species richness measures and sampling effort: we regressed observed species richness per replicate against the log-transformed number of leeches per replicate, and we regressed both the observed and estimated species richness per patrol area against the log-transformed number of replicates per patrol area, testing the significance of the slope coefficients with t-tests.

Community composition

We explored variation in vertebrate community composition among patrol areas using posterior mean Jaccard similarities calculated from the estimated occupancy states zij (see Dorazio53 and Kéry and Royle101 for other examples of this approach). We visualised the pairwise Jaccard distances (i.e. distance = (1 − similarity)) using non-metric multidimensional scaling ordinations, overlaying environmental covariates using the vegan::ordisurf function. We clustered patrol areas based on the Jaccard distances using Ward’s criterion (R function hclust(., method = “ward.D2”)). We used this clustering to split the patrol areas into three groups, which turned out to correspond to low-, intermediate-, and high-elevation sites. We used Cramer’s V to quantify the extent to which these clusters matched across the two datasets. We visualised the spatial variation in community composition within the reserve by drawing maps of Ailaoshan with patrol areas coloured by these three clusters. To help understand how vertebrate communities varied among the clusters, we used the posterior sample of the occupancy states zij to calculate posterior means and 95% Bayesian confidence intervals for the occupancy (i.e. fraction of patrol areas occupied) of each species in the low-, intermediate- and high-elevation site clusters.

To assess the extent to which the two datasets identified common patterns of variation in community composition across the patrol areas, we performed a co-inertia analysis on the matrices of predicted species in each patrol area in each dataset using ade4::coinertia in R. We used the RV coefficient54 to quantify coinertia, testing its significance with the permutation test in ade4::RV.rtest with 999 permutations. We also tested for correlation between the posterior mean Jaccard distances from the two datasets using a Mantel test with 999 permutations.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.


Source: Ecology - nature.com

Q&A: Bettina Stoetzer on envisioning a livable future

Antennae of psychodid and sphaerocerid flies respond to a high variety of floral scent compounds of deceptive Arum maculatum L.