Searching the web builds fuller picture of arachnid trade

Our online sampling methods largely follow protocols detailed in^3,4, though we limited our online searches to online shops and did not extend to social media. Large portions of code are directly re-used from those papers, although we provide modified code with this paper additionally. For keyword searches and data review we used R v.4.1.1⁴⁹ via RStudio v.1.4.1103⁵⁰, and made wide use of functions supplied by the anytime v.0.3.9⁵¹, assertthat v.0.2.1⁵², dplyr v.1.0.7⁵³, glue v.1.4.2⁵⁴, lazyeval v.0.2.2⁵⁵, lubridate v.1.7.10⁵⁶, magrittr v.2.0.1⁵⁷, 17urr v.0.3.4⁵⁸, reshape2 v.1.4.4⁵⁹, stringr v.1.4.0⁶⁰, and tidyr v.1.1.3⁶¹ other specific package uses are listed during the methods description. We used the grateful v.0.0.3⁶² package to generate citations for all R packages. Code and data used to produce figures and summary data are also available at: 10.5281/zenodo.5758541.

Website sampling and search

We searched for contemporary arachnid selling websites using the Google search engine and targeted nine languages (English, French, Spanish, German, Portuguese, Japanese, Czech, Polish, Russian). Terms were created to be inclusive, so only spiders and scorpions were on the initial search string as specialist groups may exist for either, but are unlikely for smaller arachnid groups, which were often listed under “other” in online shops. Terms were selected to be encompassing so that any sites listing variants of “spider” or mentioning arachnid in the chosen language were selected. Whilst some groups such as tarantulas are more popular as pets such sites will not omit translations of spider and should also be captured in the search, hence Terraristika (as was shown in previous analysis of amphibians and reptiles) listed the greatest number of species, despite not being a specialist site. We used the localised versions of each of these languages with the following Boolean search strings:

English: (Spider OR scorpion OR arachnid) AND for sale
French: (Araignée OR scorpion OR arachnide) AND à vendre
Spanish: (Araña OR escorpión OR arácnido) AND en venta
German: (Arachnoid OR Spinne OR Skorpion OR Spinnentier) AND zum Verkauf
Portuguese: (Aranha OR escorpião OR aracnídeo) AND à venda
Japanese: (クモ OR サソリ OR クモ型類) AND (中村彰宏 OR 販売)
Czech: (Pavouk OR Štír OR pavoukovec) AND prodej
Polish: (Pająk OR Skorpion OR pajęczak) AND sprzedaż
Russian: Продажа пауков OR скорпионов

We undertook these searches in a private window in the Firefox v.92.0.1 browser⁶³ to limit the impacts of search history. These keywords were used to identify sites which may be selling arachnids, which could then be checked before a comprehensive scrape.

For each language, we downloaded the first 15 pages of results between 2021-06-06 and 2021-07-07 (or fewer in the result that the search returned fewer than 15 pages: German 8 pages and Spanish 14 pages). This resulted in ~1270 sites that could potentially be selling arachnids. After removing duplicates and simplifying the URLs (so all ended in.com,.org,. co.uk etc.; Code S1), we reviewed each site for the following criteria (2021-07-31 to 2021-08-02): whether they sell arachnids, the type of site (trade or classified ads), the order arachnids were listed in (e.g., date or alphabetical), the best search method for gather species appearances (see below for hierarchical search methods), a refined target URL listing species inventory, the number of pages within the website potentially required to cycle through, and if the search method required a crawl, whether the site explicitly forbade crawling data collection and whether we could limit the crawl’s scope with a filter on downstream URLs. Finally, we assigned all suitable sites with a unique ID. We have made a censored version of the website review results available in Data S1. In addition to the systematic search for arachnid trade, we added 43 websites discovered ad hoc from links on previously discovered sites (many listed online shops), those listed in other journal articles on invertebrate trade (i.e.,⁶) or from discussion with informed colleagues (between 2021-08-07 and 2021-09-15). After reviewing these ad hoc sites (2021-08-07 to 2021-09-15), we had a combined total of 111 sites to attempt to search for the appearance of arachnid species.

Our searches of websites took one of five forms (Code S2), designed to minimise server load and limit the number of irrelevant pages searched, while ensuring we captured the pages listing species. We prioritised using the lowest/simplest search method possible for each site.

Single page or PDF

For websites that listed their entire arachnid stock on a single page, we retrieved that single page using the downloader v.0.4 package⁶⁴. In cases where the inventory was listed in a PDF, we manually downloaded the PDF and used pdftools v.3.0.1⁶⁵ to assess the text.

Cycle

Some websites had large stocklists split across multiple pages that could be accessed sequentially. In these cases, we used the downloader v.0.4 package⁶⁴ to retrieve each page, as we cycled from page 1 to the terminal page identified during the website review stage. Two sites required a slight modification to the page cycling process: as the sequential pages were not defined by pages, but by the number of adverts displayed. In these instances, we cycled through all adverts 20 adverts at a time (i.e., matching the default number displayed at a time by the site). For all cycling we implemented a 10 s cooldown between requests to limit server load.

Level 1 crawl

For websites that split their stock between multiple pages, but with no sequential ordering, we used a level 1 crawl, via the Rcrawler v.0.1.9.1 package⁶⁶ to access them all. For example, where a site had an “arachnid for sale” page, but full species names existed only in linked pages (e.g., “tarantulas”, “scorpions”).

Cycle and level 1 crawl

Some websites required a combined approach, where stock was split sequentially across pages, and the species identities (i.e., scientific names) required accessing the pages linked to from the sequential pages. In these cases, we ran the initial sequential sampling followed by a level 1 crawl.

Level 2 crawl

Where level 1 crawls were insufficient to cover all species sold on a site, we used a level 2 crawl to reach all pages listing species. This tended to be the case on websites with multiple categories to classify and split their stock (e.g., “arachnid”—“spider”—“tarantula”).

For all crawls, we used a cooldown of 20 s between requests to limit server load, and where possible we limited the scope of the crawl (i.e., linked pages to be retrieved) using a key phrase common to all stock listing pages (e.g., “/category=arachnid/”).

In addition to the sampling of contemporary sites, we explored the archived pages available for https://www.terraristik.com via the Internet Archive (2002–2019⁶⁷). Terraristika had been previously shown as a major contributor to traded species lists⁴, and the website’s age and accessibility via the internet archive meant it was one of the few websites where temporal sampling was feasible. We used pages retrieved via the Internet Archive’s Wayback machine API⁶⁸, via code created for^3,4. The code used was based on the wayback v.0.4.0 package⁶⁹, but additionally made httr v.1.4.2⁷⁰, jsonlite v.1.7.2⁷¹, downloader v.0.4⁶⁴, lubridate v.1.7.10⁵⁶, and tibble v.3.1.3 packages⁷² (Code S3).

Keyword generation

We relied on multiple sources to build a list of arachnid species (spiders, scorpions and uropygi). For spiders we used the WSC (ref. ¹⁸; https://wsc.nmbe.ch/dataresources; accessed 2021-09-18). We filtered the WSC dataset to remove subspecies, then used a combination of rvest v.1.0.1⁷³, dplyr v.1.0.7⁵³, and stringr v.1.4.0 packages⁶⁰ (see Code S4) to query the online version of the WSC database to retrieve all synonyms for each species. Where the synonyms were listed with an abbreviated genus, we replace the abbreviation with the first instance of a genus that matched the first letter of the abbreviation.

We combined the WSC data with a list manually retrieved from the Scorpion Files⁷⁴ (https://www.ntnu.no/ub/scorpion-files/index.php; accessed 2021-09-19). For the uropygi species, we combined species listings from Integrated Taxonomic Information System (ITIS⁷⁵; https://www.itis.gov/servlet/SingleRpt/RefRpt?search_type=source&search_id=source_id&search_id_value=1209 and https://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&anchorLocation=SubordinateTaxa&credibilitySort=TWG%20standards%20met&rankName=ALL&search_value=82710&print_version=SCR&source=from_print#SubordinateTaxa; accessed 2021-09-19) and the Western Australian Museum⁷⁶ (http://www.museum.wa.gov.au/catalogues-beta/browse/uropygi; accessed 2021-09-19). We were unable to source reliable data on all scorpion and uropygi synonyms; therefore, we used all names listed from all sources, but made note of those names considered nomen dubium. Our final keyword list contained 52,111 species, 94,184 different species names, with mean of 1.81 SE ± 0.01 terms per species (range 1–61). For summaries of total species, we relied on the species classed as accepted by the species databases (WSC, Scorpion Files, ITIS and the Western Australian Museum).

Keyword search

We successfully retrieved 3020 pages from 103 websites (mean = 28.78 SE ± 11.42, range: 1–1077), and used a further 4668 previously archived pages. To prepare each of the retrieved web pages for keyword searching, we removed all double spaces, html elements, and non-alpha-numeric characters, replacing them with single spaces (Code S5). For this process we used rvest v.1.0.1⁷³, XML v.3.99.0.8⁷⁷, and xml2 v.1.3.2⁷⁸ packages. This process increased the chances that genus and species epithets would appear in a compatible format when compared to our keyword list. The process was not able to repair abbreviated genera, or aid detection where genus and species epithet were not reported side-by-side.

Due to the large number of species we were forced to adapt previous searching methods, instead implementing a hierarchical genus-species search (Code S6). We searched each retrieved page for any mention of genera, then only searched for species that were contained within that genus. We did not differentiate whether the genus was currently accepted or old, so if a species had ever belonged to a genus it was included in the second stage of the search. The specifics of the keyword search used case-insensitive fixed string matching (via the stringr v.1.4.0 package⁶⁰). While collation string matching would have helped detect species with differently coded ligatures or diacritic marks, the occurrence of ligature and diacritic marks are infrequent in scientific names and did not warrant the considerably increased computational costs.

Via the keyword search we recorded all instances of genus matches, species matches, the website ID, and the page number. We also collected the words surrounding a genus match (3 prior and four after) as a means of exploring common terms that may be used to describe the genera.

We provide the compiled outputs from searching contemporary and historic pages in Data S2–S4. Prior to combining these two datasets into a final list of traded species, and summarising the overall patterns, we cleaned out instances of spurious genera and species detections. Predominantly this included short genera names that could appear at the start of longer words (e.g., terms such as: “rufus”, “Dia”, “Diana”, “Mala”, “Inca”, “Pero”, “May”, “Janus”, “Yukon”, “Lucia”, “Zora”, “Beata”, “Neon”, “Prima”, “Meta”, “Patri”, “Enna”, “Maso”, “Mica”, “Perro”; we already implemented a filter that required genera to be preceded by a space and thus these were not part of the species name). We are confident these genera should be excluded, as none had species detected within them.

Trade database and third-party data

We downloaded United States Fish and Wildlife Service’s LEMIS data compiled by^79,80 from https://doi.org/10.5281/zenodo.3565869 (Data S5). We filtered the LEMIS data to records where the class was listed as Arachnida (Code S6).

We downloaded the Gross imports data from the CITES trade database from the website and filtered to Class Arachnid, years 1975–2021⁸¹ (accessed 2021-09-15; Data S6), and downloaded the CITES appendices filtered to arachnids⁸² (Data S7).

We downloaded the IUCN Redlist assessments for arachnids from https://www.iucnredlist.org⁸³ (accessed 2021-09-15; Data S8).

Species summary and visualisation

We compiled all sources of trade data (online, LEMIS, CITES) into a single dataset detailing which genera/species had been detected in each source (Data S9 and Code S7). We used two criteria to determine detection, whether there was an exact match with an accepted genus/species or whether there was a match to any historically used genera/species name. Because of splits in genera, the “ANY genera” matching is likely overly generous. For broad summaries we rely on the “ANY species” name matching.

We used cowplot v.1.1.1⁸⁴, ggplot2 v.3.3.5⁸⁵, ggpubr v.0.4.0⁸⁶, ggtext v.0.1.1⁸⁷, scales v.1.1.1⁸⁸, scico v.1.2.0⁸⁹, and UpSetR v.1.4.0⁹⁰ to generate summary visuals (Code S8; Code S9). We added additional details to the upset plot and modified the position of plot labels using Affinity Designer v.1.10.3⁹¹. We also used Affinity Designer to create the Uropygid silhouette for Fig. 1. We obtained public domain licensed spider and scorpion silhouettes from http://phylopic.org/ (https://phylopic.org/image/d7a80fdc0-311f-4bc5-b4fc-1a45f4206d27/; http://phylopic.org/image/4133ae32-753e-49eb-bd31-50c67634aca1/).

Descriptions and colours

We explored the lag time between species descriptions, and their detection in LEMIS or online trade (Code S10). We relied on the description dates provided alongside the lists of species names. Unlike the broader summaries, we restricted explorations of lag times to species detected only via exact matches (operating under the assumption that newly described species traded swiftly after description would be using the modern accepted name). We distinguished between those species detected only in the complementary data, as the earliest trade date was not known; therefore, our summaries of lag time are based on those species detected in a particular year either via LEMIS or temporal online trade.

Following a visual inspection of sites where we often noticed listings with either colour or localities (e.g., “Chilobrachys spp. “Electric Blue” 0.1.3. Chilobrachys sp. “Kaeng Krachan” 0.1.0. Chilobrachys spp. “Prachuap Khiri Khan”: Data S9). We explored the words that surrounded detected genera. After using the forcats v.0.5.1⁹², stringr v.1.4.0⁶⁰, and tidytext v.0.3.1⁹³ package to compile common terms and remove English stop words, we determine colour was frequently mentioned (Code S11). To filter out non-colour words, we used wikipedia’s list of colours (https://en.wikipedia.org/wiki/List_of_colors:_N%E2%80%93Z). Once cleaned, we further removed terms that are ambiguously colour related (e.g., “space”, “racing”, “photo”, “boy”, “bean”, “blaze”, “jungle”, “mountain”, “dune”, “web”, “colour”, “rainforest”, “tree”, “sea”). We then summarised this data as the counts of instances where a genus appeared alongside a given colour term (n.b., counts are therefore impacted by any underlying imbalances in how many times a site mentioned a genus). We plotted all colours using the same hex codes listed on the wikipedia page, with the exception of “cobalt”, “grey”, “metallic”, “slate”, “electric”, “dark”, “sheen”, and “chocolate” that required manual linking to a hex code.

Summary of trade numbers

We summarised LEMIS data using a number of filters (Code S12). Following^3,4,94, we limited our summaries to items that feasibly can be considered to represent whole individuals (LEMIS code = Dead animal BOD, live eggs (EGL), dead specimen (DEA), live specimen (LIV), specimen (SPE), whole skin (SKI), entire animal trophy (TRO)). We describe the portion of trade that is prevented (i.e., seized, where disposition == “S”). We classed non-commercial trade as anything listed as for Biomedical research (M), Scientific (S), or Reintroduction/introduction into the wild (Y). For captive vs. wild summaries, we treated all Animals bred in captivity (C and F), Commercially bred (D), and Specimens originating from a ranching operation (R) as originating from captivity. We only included animals listed as Specimens taken from the wild (W) in wild counts. The few instances that fell outside of our defined captive vs. wild categorisation are treated as other. For summaries of wild capture per genus, we relied entirely on LEMIS’s listings of genera, making no effort to determine synonymisations. We did filter out those listed only as “Non-CITES entry” or NA. We used the countrycode v.1.3.0⁹⁵ package to help plot the LEMIS countries of origin. Taxonomy represents an ongoing challenge, we were limited to recognising the species listed in the aforementioned databases, generating synonym lists from these sources, and attempting to reconcile these lists. Rapid rates of species description means that compiling comprehensive lists can be challenging, and species may be traded under junior synonyms or old names, and newer descriptions may not have been added to sites⁹⁶. We were also limited to platforms that advertised using text not images, as images can be challenging to identify accurately.

Mapping

Mapping species is challenging due to the lack of standardised data on species distributions. Spider distributions were mapped based on the data in the World Spider Catalogue (Data S12). Firstly, the localities associated with each species were collated into four spreadsheets based on the data provided in the WSC (WSC¹⁸; https://wsc.nmbe.ch/dataresources; accessed 2021-09-18), these listed (1) country, (2) region, (3) “to” (where the range was given as one country to another) and (4) Island.

Before processing any “introduced” localities were removed, the four sheets were then checked for any simple spelling errors (in islands file) or mislistings (i.e., regions in the islands file). Country data were cross-referenced with the names of country provided by Thematic Mapper to standardise them (https://thematicmapping.org/; Data S11). This was done by uploading data into Arcmap and using joins and connects to connect it to the standard country name file, and any which could not be paired were corrected to ensure all could be successfully digitised.

Regions were digitised based on accepted names of different regions and included 33 different regions (see supplements) for each of these the standard accepted area within each of these regions was searched online to determine the accepted boundaries. These were then selected from the Thematic mapper, exported and labelled with the corresponding region. Once this was completed for all 33 regions they were merged and exported to a geodatabase. The spreadsheet listing regional preferences of each species was also uploaded to Arcmap 10.3, then exported into the geodatabase, then connected to a regional map using joins and relates to connect the regional preferences from the spreadsheet to the shapefiles. The new dbf was then exported to provide a listing of each species and each country in the region it was connected to, and then copied into the same csv as the corrected country listings.

For preferences listed as “to” we first separated each country listed in the “to” listings into a separate column, then developed a list of species and each of the countries listed in the “to” list (which was frequently between 5–6). These were then corrected to the standard names from thematic mapper for both countries and the regions used in the previous section. We then merged the countries and regions file and added fields of geometry in ArcMap to provide a centroid for each designated area. This table was then exported and joined and connected to the species in the “to” file. This data was then converted to point form and turned to a point file, then a minimum convex polygon (convex hulls) developed for each species to connect the regions between all those listed. These species specific minimum convex polygons were then intersected with the countries from Thematic mapper, and then dissolve was used to form a shapefile that just listed species and all the countries between those ranges. This was then exported and merged with the listings from countries and regions.

The islands file included both independent islands (which needed names corrected, or archipelago names given) and those that fall within a national designation. For those islands we replaced the island name with that of the country, as listings of species may be particularly poor, and tiny non-independent islands are not visible in the global-scale analysis.

This forth database table was then merged with the former three, and remove duplicates used to remove any duplicate entries, as species often had individual countries listed in additions to regions or “to”. This was then uploaded into Arcmap and exported to a geodatabase file then connected to the original Thematic mapper file and exported to the geodatabase to yield 134,187 connections between species and countries. This was then connected to our main analysis to include the trade status, and CITES and IUCN Redlist status for each species for further analysis.

Scorpion data was considerably messier than that on the world spider catalogue. Firstly, we downloaded all scorpion data from iNaturalist and GBIF^97,98 (search; scorpions), removed duplicates, then cross-referenced these with the thematic mapper file within Quantum GIS. Species listed in regions where they were clearly not native (i.e., a species listed in the UK when the rest of that species or genus were in Australia) were removed, and all extinct species were excluded.

In addition, all the “update files” were downloaded from the “Scorpion files”, the PDFs collated then using smallpdf tools the tables were extracted into excel form and cleaned to include just species and country listing. This was added to the countries listed for species within⁹⁹ and¹⁰⁰ though this was restricted to a subset of species. The data were all collated into an excel file with the species name, and country listing. This was then added to all the data from https://scorpiones.pl/maps/. These maps have a good coverage of species countries, but are apparently no longer being updated (Jan Ove Rein pers comm 2021) hence the need for further data to provide complete and updated and comprehensive coverage for all species. Country names were then standardised based on the Thematic Mapper standards (Data S13 and Data S11). Species names were then cross-referenced to those listed in the Scorpion files, any not matching were checked as synonyms and converted to the accepted name (though the only collated data for Scorpion synonyms was on French-language Wikipedia, i.e., see https://fr.wikipedia.org/wiki/Bothriurus). Once all country and species names were corrected this provided a listing of 4059 species-country associations. These were then associated with country files in the same way as spiders. We plotted spider and scorpion species/genera, as well as LEMIS origins, using ggplot2⁸⁵, combining Thematic world border data (https://thematicmapping.org/) with summaries of species/genera/and trade levels. Species listed in a single-country (and thus more likely to be country endemic) were also counted using summary statistics, so that species most vulnerable to trade could be noted separately.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Source: Ecology - nature.com