The custom code used to clean occurrence records and construct SDMs is available at (github.com/RhettRautsaw/ VenomMaps). We used the following R16 packages for data cleaning, manipulation, species distribution modeling, and Shiny app creation: tidyverse17 readxl18, data.table19, sf20, sp21,22, rgdal23, raster24, smoothr25, ape26, phytools27, argparse28, parallel16, memuse29, dismo30, rJava31, concaveman32, spThin33, usdm34, ENMeval35, kuenm36, shiny37, leaflet38, leaflet.extras39, leaflet.extras240, RColorBrewer41, ggpubr42, ggtext43, and patchwork44.
Updating occurrence record taxonomy
Our goal was to update and reconstruct the distributions of New World pitvipers. We used the Reptile Database45 (May 2021) as our primary source for current taxonomy which included the following genera: Agkistrodon, Atropoides, Bothriechis, Bothrocophias, Bothrops, Cerrophidion, Crotalus, Lachesis, Metlapilcoatlus, Mixcoatlus, Ophryacus, Porthidium, and Sistrurus. However, to ensure we captured all New World pitvipers records, we incorporated all members of the family Viperidae (all vipers and pitvipers) into our pipeline for updating occurrence record taxonomy (i.e., to account for errors in the recorded latitude, longitude, or if subfamily was not recorded).
First, we collected global occurrence records for “Viperidae” from GBIF (downloaded 2021-08-19)46, Bison (downloaded 2021-08-19)47, HerpMapper (only New World taxa; downloaded 2021-08-19)48, Brazilian Snake Atlas49, BioWeb (downloaded 2021-07-07)50, unpublished data/databases from RMR, GJV, EPH, LRVA, MM, and CLP, and georeferenced literature records totaling 373,673 species-level records, 292,425 of which are New World pitvipers. Given the fluidity of taxonomy, records were often associated with outdated names. For example, Crotalus mitchelli pyrrhus was elevated to Crotalus pyrrhus51, but may still be recorded as the former in a given repository (e.g., GBIF). To correct taxonomy in our database, we checked records against a list of synonyms found on the Reptile Database and compared them to current taxonomy. If species and subspecies columns matched the same taxon (or no subspecies was recorded), then species IDs were not altered. If species and subspecies IDs did not match the same taxon, we updated taxonomy by minimizing the number of changes required to a given character string. We then manually checked all changes.
Constructing distribution maps
Next, we collected preliminary distribution maps from the International Union for Conservation of Nature (IUCN; downloaded 2018-11-27)52, Global Assessment of Reptile Distributions (GARD) v1.153, Heimes54, Campbell and Lamar55, and unpublished maps. We manually curated distribution maps for all New World pitvipers in QGIS using the occurrence records, previous distribution maps, and recent publications for each taxon (note that distributions for Old World Viperidae have not yet been updated). We used a digital relief map (maps-for-free.com) and The Nature Conservancy Terrestrial Ecoregions (TNG.org)56 to identify clear distribution boundaries (e.g., mountains). We then clipped the final distributions to a land boundary (GADM v3.6)57 and smoothed the distribution using the the “chaikin” method in the R package smoothr25.
Our initial taxonomy check was only concerned with records for which a subspecies was recorded and had since been elevated to species status. Therefore, many records with no assigned subspecies likely remained associated with an incorrect or outdated generic and/or specific identification. Fortunately, taxonomic changes are typically associated with changes in the species’ expected distribution. For example, when Crotalus simus was resurrected from C. durissus, the distribution of C. durissus was split: the northern portion of its range in Central America now represented the resurrected species (C. simus) and the southern portion of its range remained C. durissus55. Yet, occurrence records in Central America often remain labelled as C. durissus in data repositories. Therefore, we spatially joined records with the newly reconstructed species distribution maps to determine if they overlapped with their expected distribution (Old World taxa were joined with the GARD 1.1 distributions53).
Briefly, we developed a custom function (occ_cleaner.R) to perform the spatial join and update taxonomy. First, we calculated the distance for each record to the 20 nearest distributions within 50 km (full overlap resulted in a distance of 0 m). Next, we calculated the phylogenetic distance between the recorded species ID and each species with which that record overlapped using the tree from Zaher et al.58 and adding taxa based on recent clade-specific publications (bind.tip2.R; see github.com/RhettRautsaw/VenomMaps for full list of references and details). If records overlapped with their expected species, no changes were made. If records fell outside of their expected distribution, we filtered the potential overlapping and nearby species (within 50 km) to minimize phylogenetic distance. If multiple species were equally distant (i.e., share the same common ancestor), we attempted to minimize geographic distance. If multiple species remained equally distant in both phylogenetic and geographic distance, we flagged the record to be manually checked. We also flagged records if a species’ taxonomy had changed and records were additionally flagged as potentially dubious if the taxonomic change had a phylogenetic divergence greater than 5 million years. We manually checked all flagged records and returned records to their original species ID if species identity remained uncertain. We flagged these records as potentially dubious, along with records that fell outside of their expected distribution (within 50 km), and removed all flagged records for species distribution modeling. Our final cleaned database contained 344,998 global records, of which 275,087 were New World pitvipers.
Species distribution modeling
We attempted to infer SDMs for the 158 species of New World pitvipers currently recognized by the Reptile Database (May 2021) and additionally modeled the three subspecies of Crotalus ravus separately based on recommendations for species status elevation by Blair et al.59 for a total of 160 species. We developed a unix-executable R script (autokuenm.R) designed to take occurrence records, distribution maps, and environmental data and prepare these data for species distribution modeling with kuenm36. We chose to use kuenm – and MaxEnt v3.4.460 – because it has been shown to have good predictive power61 and fine-tuning of this algorithm has performance comparable to more computationally intensive ensembles62,63. Additionally, MaxEnt allows for flexibility in parameter selection64 and can function entirely with presence data14.
Prior to autokuenm, to account for sampling/spatial bias during SDM, we created a bias file by using the pooled New World pitviper occurrence records as representative background data65,66,67,68. Specifically, we converted occurrence records to a raster and performed two-dimensional kernel density estimation (kde2d) with the MASS package with default settings69 and rescaled the kernel density by a factor of 1000 and rounded to three decimal places. This was then used as input to factor out sampling bias by MaxEnt. We then ran autokuenm, which is designed to subset/partition the cleaned occurrence records for a given species and prepare additional files for SDM. We first defined M-areas – or areas accessible to a given species – using the World Wildlife Fund Terrestrial Ecoregions70. Biogeographic regions represent distributional limits for many species and are reasonable hypotheses for the areas accessible to a given species71,72. To do this, we created alpha hulls from the subset of occurrence records for a given species using concaveman32 with default settings. We then identified regions with at least 20% of the region covered by the alpha hull and merged these regions together to form our final M-area. All environmental layers and the bias file were cropped to this M-area which was used as the geographic extent for modeling. We then randomly selected 5% of records to function as an independent test set for final model evaluation. Next, we generated 2000 random background points across the cropped environmental layers and used ENMeval to partition occurrence records into four sets using the checkerboard2 pattern35. Note that the background points here were not used in MaxEnt. One of the four partitions was selected at random to be used as the testing set; the remaining three partitions were used for training the MaxEnt models. If the number of occurrence records in the independent test set was less than five, then we used the training partition for final model creation and used the testing partition for final model evaluation.
We tested the top-contributing variables from three sets of environmental layers: (1) bioclimatic variables, (2) EarthEnv topographic variables73, and (3) a combination of these variables. To select the top-contributing variables in each set, we wrote a custom function (SelectVariables) which used a combination of MaxEnt permutation importance and Variable Inflation Factors (VIF) to remove collinearity while keeping the variables that contributed the most to the model. Compared with variable selection via principal component analysis loadings, the permutation importance and VIF methodology demonstrated significant improvement in MaxEnt model fit. First, we designed SelectVariables to run MaxEnt using dismo::maxent with default settings and then extracted the permutation importance. We removed variables if they had 0% permutation importance. Next, we calculated VIF with usdm::vif and then iteratively removed variables by selecting the variables with two highest VIF values and removing whichever variable had the lowest permutation importance. We then recalculated VIF and repeated the process until the maximum VIF value was less than 10. Finally, we recalculated permutation importance with the remaining variables using dismo::maxent with default settings and removed variables with less than 1% permutation importance to create the final variable sets. This process was done for each species independently.
With the final environmental variable, testing, and training sets, we generated SDMs using kuenm. First, we created candidate calibration models with multiple combinations of regularization multipliers (0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 2, 3, 4, 5, 6, 8, 10), feature classes (l, q, h, lq, lp, lt, lh, qp, qt, qh, pt, ph, th, lqp, lqt, lqh, lpt, lph, lth, qpt, qph, qth, pth, lqpt, lqph, lqth, lpth, qpth, lqpth), and sets of environmental predictors (bioclimatic, topographic, combination) totaling 2,958 candidate models per species. We then ran each model in parallel using GNU Parallel74. Next, we evaluated the candidate models and selected the best models using statistical significance (partial ROC), prediction ability (omission rates; OR), and model complexity (AICc) with the “kuenm_ceval” function with default settings. Specifically, models were only considered if they were statistically significant and had an OR less than 5%. If no models passed the OR criteria, the models with the minimal OR were considered. Finally, any remaining models were filtered to those within 2 AICc of the top model (Supplementary Table 1). In addition to evaluating and comparing all models together, we evaluated bioclimatic-only and combination-only models separately since these two sets of environmental variables were expected to be the best performing models given the ubiquity of bioclimatic variables in species distribution modeling (Supplementary Table 1).
We generated 10 bootstrap replicates for each of the “best” calibration models using the “kuenm_mod” function. We also performed jackknifing to assess variable importance and models were output in raw format. We evaluated the final models using “kuenm_feval” with default settings. To select the best model for each comparative set (i.e., all, bioclimatic-only, and combination-only sets), we filtered the final evaluation results to minimize the OR and maximize the AUC ratio (Supplementary Table 2). If multiple models remained and were considered equally competitive, we averaged these models together (Supplementary Table 3). Because we performed three different set of comparisons, there were three “best” models per species, so we again aimed to minimize the OR and maximize the AUC ratio to select a final model for each species (Supplementary Table 4). We then converted our final models into cloglog format for visualization and threshold the models using a 10th percentile training presence cutoff (Fig. S2). Both conversion and thresholding functions are provided as R functions (raw2log, raw2clog, raster_threshold in functions.R; github.com/RhettRautsaw/VenomMaps).
Source: Ecology - nature.com