A long-term dataset on wild bee abundance in Mid-Atlantic United States
Identifier variables
To facilitate aggregation and analysis of the BIML data, we added ‘site’, ‘site-year’, ‘sampling event’, and ‘transect’ identifier variables. We defined ‘sites’ as unique combinations of latitude and longitude, and ‘site-years’ as unique combinations of site and year of sampling. Within site-years, we defined ‘sampling events’ according to the date of sampling and ‘transects’ as unique combinations of sampling event and text field notes. For some specimens, field notes included a transect ID, indicating that the BIML used multiple sets of pan traps at the same site. In other cases, field notes recorded differing sampling methods, or different information on the number of missing traps (traps that were cracked, tipped over, or otherwise compromised). If field notes recorded different methods or number of lost traps, we assumed that the BIML deployed multiple sets of traps (transects). We reviewed the field notes for all sampling events with multiple transects and reassigned these occurrences to a single transect if there was no evidence of multiple transects in the field notes.
Locality and taxonomic identification
Next, we reviewed and excluded occurrences lacking critical date and locality information. We removed all occurrences lacking sampling date or latitude and longitude of sampling location and occurrences with duplicated specimen identifiers. We filtered occurrences to a limited geographic area (Maryland, Delaware, and Washington DC, Fig. 1) that represents the densest region of BIML sampling (39.6% of dataset). This filtering removed wild-bee communities collected in desert or tropical biomes, which are likely governed by very different floral resource and climate dynamics19,20, and within the Mid-Atlantic USA, limited sampling locations to a region with a consistent dominant forest type21. Bee occurrences in 1999 and 2001 represented fewer than three sites per year, so we removed these years, retaining sites sampled from 2002–2016.
Fig. 1
Abundance per day per trap of wild bees at locations surveyed between 2002 and 2016 by the United States Geological Survey Native Bee Inventory and Monitoring Lab (USGS BIML).
Full size image
We also filtered data to our taxa of interest. We removed non-bee occurrences (species outside superfamily Apoidea, clade Anthophila) and records lacking species-level identity, discarding occurrences identified to family or genus (Online-only Table 1). Almost all non-bee occurrences we removed were wasps in the Vespidae, Crabronidae, and Sphecidae families, which are primarily predators, rather than pollen-collectors like most wild bees. For the transect-level dataset22 (see ‘Data Records’ below), we calculated the abundance of Apis mellifera L., then removed A. mellifera from the dataset before calculating total bee abundance per transect, since often A. mellifera specimens likely originated from managed colonies and are not considered to be wild bees.
We verified species names by cross-referencing all species binomials with the Discover Life database23. We corrected genus and species names that were clear spelling errors (Online-only Table 1) and consulted the original data source (S. Droege) for remaining species binomials that did not exist on Discover Life. We also referenced Discover Life occurrence maps to confirm that all species in the BIML dataset occur in the Mid-Atlantic US. After these data cleaning steps, we removed six occurrences of the remaining five unknown or out-of-region species (Online-only Table 1). Some species in the BIML data were identified singularly and as part of a species set. To avoid double counting these species, we created a new variable with cleaned, mutually exclusive species names (termed ‘grouped name’). In ‘grouped name’, we combined singular species names with their associated species sets (Online-only Table 1). For example, we reclassified occurrences identified as Halictus ligatus/poeyi, Halictus ligatus, or Halictus poeyi to Halictus ligatus/poeyi to avoid inflating future species richness estimates when occurrences might be the same species. In the final occurrence datasets, we included the cleaned, singular names (‘name’) and cleaned, grouped names (‘grouped_name’), so future analysts can select the appropriate taxonomic aggregation for their research objectives. Voucher specimens for most species in the BIML dataset are housed in the Smithsonian collection, but some are not yet permanently archived. We suggest interested parties contact Sam Droege (current email: sdroege@usgs.gov) to access voucher specimens. We also included, to the best of our knowledge, current affiliations for individuals who identified BIML specimens (Supplementary Information, Table S1), and standardized names of identifiers (‘identifiedBy’) in the final datasets.
Sampling method and effort
To describe sampling method and effort, we used regular expressions to extract these data from field notes. We sought to compare bee communities sampled with a standard methodology, so we discarded bee occurrences collected with vane traps or nets, only retaining occurrences sampled with pan traps (i.e., bee bowls). Using the stringr package in R24,25, we searched the text of field notes to document trap volume, trap color, total number of traps, and the number of traps missing or disturbed. The most common BIML pan-trapping method involved setting out traps of multiple colors and combining the bees in all traps into one sample. Consequently, BIML recorded trap color in field notes as the number of traps of each color used for a specific sampling event. We designed regular expressions to extract the number of traps for the eight most common colors (white, blue, yellow, pale blue, fluorescent yellow, fluorescent blue, and florescent pale blue). For some occurrences, our regular expressions yielded no sampling information, so we manually reviewed these field notes and recorded any data missed by the automated search.
Next, we simplified the trap color and volume classification to facilitate future statistical analyses. To reduce the number of trap volume categories, we rounded trap volume to the nearest 0.5 ounces, and removed trap volumes greater than 40 ounces, assuming these were errors in data entry or extraction. When the trap color or volume used at a specific site changed within a year, we manually reviewed the field notes and corrected color or volume classifications when necessary. After correcting these discrepancies, we found the BIML very rarely changed sampling methods within a year, so we filled in most missing trap color or volume information by assuming a constant sampling method for all transects within a site-year. Finally, we combined rarely used color/volumes (fewer than 1% of transects) into an ‘Other’ category. In the archived datasets with sampling information, we included original and simplified variables for trap color and volume.
Lastly, we summarized sampling effort and calculated effort-adjusted abundance of wild bees. We calculated the total number of traps for each sampling event, and, when available, we also described the number of traps missing or disturbed. If there was no documentation of missing or disturbed traps, we assumed all traps were recovered successfully. When the total number of traps differed between the field notes and ‘number of traps’ column, we selected the lower value. We defined the final number of traps in each transect as the original number minus the number missing or disturbed. We calculated the duration of sampling as the difference between the date traps were collected and date traps were set. If traps were set and collected on the same day, we set the duration of sampling to one day. For each occurrence in the BIML dataset, we converted bee abundance to abundance day−1 trap−1. We conducted all data manipulation and aggregation with the R statistical and computing language 3.6.025,26 More