Joint analysis of structured and semi-structured community science data improves precision of relative abundance but not trends in birds

Data acquisition and preparation

Structured datasets

We used structured North American Breeding Bird Survey (BBS) data, which is conducted annually over > 2500 routes across the United States and Canada^11,12 during the peak of the breeding season (May and June). BBS routes were approximately 40 km long with 50 stops spaced 0.8 km apart. At each stop a 3-min point count was conducted, where all species seen or heard were recorded¹². We downloaded the entire dataset, 1966–2019, to identify each observer’s first year and account for differences in survey experience. We created a binary variable for the observers’ first year, with 1 indicating the first year they provided data, and 0 indicating all subsequent years. We then subset the data to years 2010–2019 to align with available community science data. We zero-filled BBS data by adding zeros for each species on routes in which birds were not detected in each year.

Semi-structured dataset

We used the eBird Basic Dataset as a semi-structured dataset. We used checklists within the US and Canada during June and July from 2010 to 2019. Data were filtered to impose structure on the observation process and minimize effects of unequal spatial and temporal sampling using the auk package in program R^{24,25,56,59,60}. Data were filtered to only include complete checklists where observers recorded counts of all species detected to reduce effects of preferential species reporting⁶¹. We also filtered data based on observer effort to only include checklists < 5 h in duration, < 5 km in distance travelled, and checklists with < 10 observers²⁴.

Semi-structured community science data are generally sampled unevenly across space and time. Observers tend to collect data more frequently in accessible locations close to where they live or in areas with high species diversity²⁵. Therefore, populated areas and protected areas are data rich, while expansive rural, private lands are relatively data poor. Additionally, the number of checklists submitted to eBird has increased over time. To improve spatial and temporal representation within each 1-degree grid cell block and reduce the density of checklists within populated or protected areas, we randomly sampled Canada goose data to include one detection and one non-detection checklist for each cell of a 5-km hexagonal grid in each week of June and July because of their high frequency of detections in submitted checklists²⁴. The resulting Canada goose data set consisted of 18% detections and 82% non-detections. American woodcock, black-billed cuckoo, black-throated blue warbler, loggerhead shrike, northern bobwhite and upland sandpiper were detected in < 25% of checklists submitted; there was a high proportion of non-detections compared to detections causing class imbalance. For rare species and highly imbalanced datasets, inference on species distributions may be impeded by the volume of non-detections⁶². We corrected for this class imbalance by retaining all checklists with detections and under-sampling the majority class by spatiotemporally subsampling non-detection checklists only⁶² (i.e., retaining all checklists with detections but only one non-detection checklist for each cell of the 5-km hexagonal grid per week). Resulting datasets consisted of 1–8% detections and 92–99% non-detections.

Each eBird checklist also contained survey information to accompany the associated counts. We used survey start time, duration (in minutes), distance traveled (in km), survey type (stationary or traveling), number of observers, and a unique observer ID number for each observer to account for differences among survey data. We expected our study species to be most active and easily detected in early morning and evening. To capture activity, we converted survey start times to local solar time of day (in seconds) and calculated two continuous covariates with a 24-h period by (text{cos}left(frac{2pi left(text{seconds}right)}{86400}right)) and (text{sin}left(frac{2pi left(text{seconds}right)}{86400}right)), referred to as cos(time) and sin(time). Thus, cos(time) represented night (high values) and day (low values), and sin(time) represented the first half of the day (high values) and the second half of the day (low values)⁶³.

Reference grid

We overlayed a grid of 111-km × 111-km (approximately 1° latitude × 1° longitude near the equator) grid cells using Albers Equal Area projection across Canada and the US to estimate site-specific species relative abundance and trends. We assigned BBS routes and eBird observations to cells based on route starting points and checklist locations. We used the center points of grid cells to assign each to one of the Bird Conservation Regions (BCRs) from the North American Bird Conservation Initiative⁶⁴. We defined species ranges as all BCRs that contained grid cells where the species was observed in any dataset during our study period after removing observations of likely vagrants. We calculated the Euclidean distance of each grid cell centroid to the nearest edge of these defined ranges.

Model description

For each species, we developed two models to estimate the spatiotemporal variation of species relative abundance. The first model utilized BBS data only and the second model jointly analyzed BBS and eBird data. These models were developed under a hierarchical Bayesian framework, so they shared the same process model but had different observation or data models⁶⁵.

Process model

The process model described the trends of relative abundance by expressing relative abundance as a linear function of time (year). More specifically, we assumed that

$${mu }_{i,t}^{left[gamma right]}={alpha }_{i[k]}^{left[gamma right]}+{beta }_{i[k]}^{left[gamma right]}times {YEAR}_{t},$$

(1)

and

$$logleft({gamma }_{i,t}right) sim Nleft({mu }_{i,t}^{left[gamma right]}, {sigma }^{left[gamma right]}right),$$

(2)

in which ({gamma }_{i,t}) was year-specific relative abundance for grid cell i and year t, ({mu }_{i,t}^{left[gamma right]}) was the mean of (logleft({gamma }_{i,t}right)), ({alpha }_{i[k]}^{left[gamma right]}) and ({beta }_{i[k]}^{left[gamma right]}) were grid cell-specific intercept and slope, respectively, for BCR k. The intercept and slope parameters, ({alpha }_{i[k]}^{left[gamma right]}) and ({beta }_{i[k]}^{left[gamma right]}), were random deviations from a mean intercept and slope for BCR k. For example, we had for the slope parameter (({beta }_{i[k]}^{left[gamma right]}))

$${beta }_{i[k]}^{left[gamma right]}sim text{N}left({mu }_{left[kright]}^{left[beta right]}, {sigma }^{left[beta right]}right), {mu }_{left[kright]}^{left[beta right]}sim text{N}left({{mu }^{[beta ]},uptau }^{left[beta right]}right),$$

(3)

in which ({mu }_{left[kright]}^{left[beta right]}) is the BCR-specific mean of the slope parameters (({beta }_{i[k]}^{left[gamma right]})) for BCR k and ({mu }^{[beta ]}) was the grand mean of ({beta }_{i[k]}^{left[gamma right]}) for all i and k. The priors for the mean parameters were assumed to be normally distributed with mean 0 and standard deviation 10 while the priors for the variance parameters were assumed to come from an inverse gamma distribution with shape and rate parameters 0.01. In this way we allowed the intercept and slope parameters to vary across any grid cell, but also assumed that grid cells within the same BCR have more similar values than grid cells in different BCRs⁶⁶. This allowed us to additionally estimate relative abundance and relative abundance trends in grid cells without any count data based on these relationships among grid cells in the same BCR. We used the same model for the grid cell-specific intercepts (({alpha }_{i[k]}^{left[gamma right]})) with an intercept grand mean ({mu }^{[alpha ]}), BCR-specific mean ({mu }_{left[kright]}^{left[alpha right]}), and variance parameters ({sigma }^{left[alpha right]}) and ({uptau }^{left[alpha right]}), assuming the same priors.

Note that we centered ({YEAR}_{t}) so that the middle year (i.e., between 2014 and 2015 for our 2010–2019 study period) was 0. Thus (expleft({alpha }_{i[k]}^{left[gamma right]}right)) was the mean relative abundance for grid cell i across the study period, and ({beta }_{i[k]}^{left[gamma right]}) the grid cell-specific trend of log relative abundance during the study period. We used Eq. (2) to add additional Gaussian noise to our mean log relative abundance estimates, ({mu }_{i,t}^{left[gamma right]}), to account for overdispersion in year-specific relative abundance, ({gamma }_{i,t}), in grid cell i in year t. We assumed the variance parameter ({sigma }^{[gamma ]}) came from an inverse gamma distribution with shape and rate parameters 0.01.

Observation model (BBS model)

We hierarchically linked year-specific relative abundance ((gamma)), and by extension, mean relative abundance ((alpha)) and relative abundance trends ((beta)), from the process model to BBS count data (({y}^{[BBS]})). Because BBS data contained a larger proportion of non-detections (0 counts) than predicted with a Poisson distribution (as determined by posterior predictive checks), we assumed that data were generated from a zero-inflated Poisson process such that

$${y}_{j,s,i,t}^{[BBS]}sim Poissonleft[{uplambda }_{j,s,i,t}^{left[BBSright]}times left(1-{z}_{j,s,i,t}^{[BBS]}right)right],$$

(4)

in which ({y}_{j,s,i,t}^{[BBS]}) was the BBS count for route j (sum of counts across all 50 stops) surveyed by observer s in grid cell i and year t, ({uplambda }_{j,s,i,t}^{left[BBSright]}) was the expectation of count data ({y}_{j,s,i,t}^{[BBS]}), and ({z}_{j,s,i,t}^{[BBS]}) was a Bernoulli random variable with probability ({omega }^{[BBS]}), where ({omega }^{[BBS]} sim U(0, 1)).

To achieve grid cell-level inference, we offset the grid cell relative abundance by the effective area surveyed by BBS routes to scale the grid cell relative abundance to BBS route-level relative abundance. More specifically, we assumed

$$text{log}left({uplambda }_{j,s,i,t}^{[BBS]}right)={text{log}}left({gamma }_{i,t}times (25/text{12,321})right)+{varepsilon }_{j,s}^{[BBS]}+{eta }^{[BBS]}times {I}_{j,s,t}^{[BBS]},$$

(5)

where (25/text{12,321}) was the effective area surveyed by a BBS route (~ 25 km²; 50 stops each with a 400 m survey radius¹²) divided by the area of a single grid cell (12,321 km²). To account for detection error related to observer experience, we included ({varepsilon }_{j,s}^{[BBS]}) as a route- and observer-specific error term, and ({eta }^{[BBS]}) as an additional term of observation error if route j was surveyed by a first-time observer s^12,67 (indicated by ({I}_{j,s,t}^{[BBS]})). We assumed ({varepsilon }_{j,s}^{[BBS]}) was normally distributed with mean 0 and variance ({sigma }^{[BBS]}), where ({sigma }^{[BBS]}) came from an inverse gamma distribution with shape and rate parameters 0.01. We assumed ({eta }^{[BBS]}) was normally distributed with a mean 0 and standard deviation 10.

Observation model (joint model)

In the joint model, we again linked year-specific relative abundance ((gamma)) from the process model to BBS count data using Eqs. (4) and (5). Similarly, we hierarchically linked year-specific relative abundance to eBird count data (({y}^{[eBird]})). Thus, both datasets informed year-specific relative abundance ((gamma)), mean relative abundance ((alpha)) and relative abundance trend ((beta)) estimates. We again used a zero-inflated Poisson distribution such that

$${y}_{j,s,i,t}^{[eBird]}sim Poissonleft[{uplambda }_{j,s,i,t}^{[eBird]}times left(1-{z}_{j,s,i,t}^{[eBird]}right)right],$$

(6)

in which ({y}_{j,s,i,t}^{[eBird]}) was the eBird count for checklist j surveyed by observer s in grid cell i and year t, ({uplambda }_{j,s,i,t}^{[eBird]}) was the expected count, and ({z}_{j,s,i,t}^{[eBird]}) was a Bernoulli random variable with probability ({omega }^{[eBird]}), where ({omega }^{[eBird]} sim U(0, 1)).

For eBird data, we again considered the variation among checklists. Instead of observer experience, we considered additional survey effort variables that potentially influenced the distribution of counts. We assumed

$$begin{aligned}text{log}left({uplambda }_{j,s,i,t}^{[eBird]}right)& =logleft({gamma }_{i,t}right)+{beta }_{0}^{[eBird]}+{beta }_{1}^{[eBird]}times {TYPE}_{j,s,i,t}+{beta }_{2}^{[eBird]}times {DIST}_{j,s,i,t}+ {beta }_{3}^{[eBird]}times {COSTIME}_{j,s,i,t}& quad+, {{beta }_{4}^{[eBird]}times {SINTIME}_{j,s,i,t}+beta }_{5}^{[eBird]}times {DURA}_{j,s,i,t}+{ beta }_{6}^{[eBird]}times {NOOB}_{j,s,i,t}+{varepsilon }_{s}^{[eBird]},end{aligned}$$

(7)

in which ({TYPE}_{j,s,i,t}) was survey type (i.e., 1 for point count and 0 for transect), ({DIST}_{j,s,i,t}) was survey distance, ({COSTIME}_{j,s,i,t}) and ({SINTIME}_{j,s,i,t}) were cos(time) and sin(time), respectively, of start times of the survey, ({DURA}_{j,s,i,t}) was the survey duration, and ({NOOB}_{j,s,i,t}) was the number of observers. We included ({varepsilon }_{s}^{[eBird]}) as an observer-specific error term to account for correlation among counts from the same observer. We were unable to calculate the effective area surveyed for individual eBird checklists. Instead, we included ({beta }_{0}^{[eBird]}), an additional scaling intercept to relate eBird relative abundance estimates from the checklist level to the grid cell level. We assumed ({beta }_{0-6}^{[eBird]}) were normally distributed with mean 0 and standard deviation 10. We assumed ({varepsilon }_{s}^{[eBird]}) was normally distributed with mean 0 and variance ({sigma }^{[eBird]}), where ({sigma }^{[eBird]}) came from an inverse gamma distribution with shape and rate parameters 0.01.

Additional considerations to improve model convergence

We further altered American woodcock and Canada goose joint models to improve model convergence. The BBS and eBird datasets contained fewer detections of American woodcocks than in other species; thus, we predicted that model convergence issues in the American woodcock models were due to lack of sufficient data across BBS and eBird datasets. To increase our data sample sizes, we used data from the American woodcock Singing Ground Survey (SGS) as a third data source for American woodcock models (see Supplementary Information). The American woodcock SGS is specifically designed for detecting this cryptic species, so detectability is likely higher during this survey compared to surveys not specifically designed for American woodcocks. We developed another model to jointly analyze BBS, eBird, and SGS datasets (see Supplementary Information). This model used the same process model, BBS observation model, and eBird observation model described in Eqs. (1)–(7).

Canada geese frequently form large flocks (often 100 s or 1000 s of individuals), a unique ecological trait among our study species. Thus, both BBS and eBird Canada goose data contained uniquely high counts among our data (max counts of 2900 and 3000 Canada geese in BBS and eBird data, respectively). We predicted that convergence issues in the Canada goose joint model were due to overdispersion resulting from uniquely high counts, especially in the eBird data where these counts were more frequent than in the BBS data. To account for this overdispersion, we added additional Gaussian noise to ({uplambda }_{j,s,i,t}^{[eBird]}) in Eq. (7) (see Supplementary Information).

Model implementation

We used Markov Chain Monte Carlo and a Gibbs sampler in JAGS⁶⁸ using the jagsUI package⁶⁹ in program R⁶¹. We ran at minimum 9000 iterations, discarded at minimum the first 5000 samples as burn-in, and retained at least 4000 samples with no thinning from 3 chains (exact number of iterations needed to achieve convergence varied among models). We evaluated convergence using the Gelman–Rubin statistic⁷⁰ ((widehat{R}) < 1.2) and visual assessments of traceplots. We used a modified CV calculation to assess improvements in precision for (alpha), (gamma), and (beta) (i.e., mean relative abundance, year-specific relative abundance, and trend estimates, respectively) parameters between BBS and joint models. For (alpha) and (gamma) parameters,

$$begin{array}{l}{CV}^{left[alpha right]}=left({widehat{alpha }}_{left(0.975right)}-{widehat{alpha }}_{left(0.025right)}right)/left|{widehat{alpha }}_{left(0.5right)}right| {CV}^{[gamma ]}=left({widehat{gamma }}_{(0.975)}-{widehat{gamma }}_{(0.025)}right)/left|{widehat{gamma }}_{(0.5)}right|end{array},$$

(8)

and for (beta)

$${CV}^{[beta ]}=left({widehat{beta }}_{(0.975)}-{widehat{beta }}_{(0.025)}right),$$

(9)

where ({widehat{alpha }}_{(p)}), ({widehat{gamma }}_{(p)}), and ({widehat{beta }}_{(p)}) are the pth quantiles of the posterior distribution for (alpha), (gamma), and (beta), respectively. Smaller CVs indicated more precise estimates of comparable parameters. We calculated CVs of each parameter for each grid cell. We subtracted the joint model CVs for each parameter for each grid cell from corresponding BBS-only model CVs to evaluate the magnitude of improvement in precision for each grid cell and evaluated overall improvements for each parameter by calculating the proportion of grid cells for which joint model CVs < corresponding BBS-only model CVs.