Population genetic data collection from primary data sources
Figure 4 describes the overall data collection workflow for the four datasets that comprise CaliPopGen. We first identified literature potentially containing population genetic data for California by querying the Web of Science Core Collection (https://webofknowledge.com/) for relevant literature from 1900 to 2020 with the terms: topic = (California*) AND topic = (genetic* OR genomic*) AND topic = (species OR taxa* OR population*). We included only empirical peer-reviewed literature and excluded unreviewed preprints. In using these search terms, our goal was to broadly identify genetic papers focused on California with population or species-level analyses, while avoiding purely phylogenetic studies or those focused on agricultural or model species. This resulted in 4,942 unique records.
We next screened titles and abstracts to retain articles that: (1) provided data on populations of species which are self-sustaining without anthropogenic involvement; (2) included at least some eukaryote species; (3) included population(s) sampled within California; (4) mentioned measures of genetic diversity or differentiation; and (5) were not reviews (thus restricting our search to only primary literature). We retained 1869 studies after this first pass of literature screening (see Technical Validation for estimate of inter- and intra-screener bias).
Our second, more in-depth screening pass involved reading the full text of these 1869 studies. We had two goals. First, we confirmed that retained papers fully met all five of our inclusion criteria (the first screen was very liberal with respect to these criteria, and many papers failed to meet at least one criterion after close reading). Second, we eliminated papers where the data were not presented in a way that allowed us to extract population-level information. For example, many of the more systematics-focused studies pooled samples from large, somewhat ill-defined regions (“Sierra Nevada” or “Southern California”); if such regions were larger than 50 km in a linear dimension, we deemed them unusable for making geographically-informative inferences. Other studies presented summaries of population data, often in the form of phylogenetic networks or trees, but did not include information on actual population genetic parameters and therefore were not relevant to our database. We retained 528 publications after this second pass.
From this set of papers, we extracted species, locality, and genetic data for each California population or sampling locality described in each study (Fig. 3A). This included Latin binomial/trinomial, English common name, population identifiers, and geographic coordinates of sampling sites. We also noted population/sampling localities that were interpreted as comprised of interspecific hybrids, and listed both parental species. We collected population genetic diversity and differentiation statistics for each unique genetic marker for each population/sampling locality; as a result, a sampling locality may have multiple entry rows, one for each locus or marker type. Parameters extracted for each population/marker combination include sample size, genetic marker type, gene targets, number of loci, years of sampling, and reported values for effective population size (Ne), expected (HE) and observed (HO,) heterozygosity, nucleotide diversity (π, pi), alleles-per-locus (APL), allelic richness (AR), percent polymorphic loci (PPL), haplotype diversity (HDIV), inbreeding coefficient (e.g. FIS, FIT, GIS), and pairwise population genetic comparison parameters (FST, GST, DST, Nei’s D, Jost’s D, or phi). We note that while there are technical differences between allelic richness and alleles-per-locus, source literature often used the terms interchangeably, and we include the parameters and their values as named in the source. We define marker type as the general category of genetic marker used (e.g., “microsatellite” or “nuclear”), while gene targets are the specific locus/loci (e.g., “COI”). We present these data in two separate datasets, one containing all population-level genetic summary statistics (Dataset 121, see Fig. 3C and detailed description in Table 1) and a second for estimates of pairwise genetic differentiation (Dataset 221, see Fig. 3D and detailed description in Table 2).
All genetic data were extracted directly from the source literature. However, we also updated or added to the metadata for these population genetic values in several ways. We included kingdom, phylum, and a lower-level taxonomic grouping for each species (usually class), and updated scientific and common names based on the currently accepted taxonomy of the Global Biodiversity Information Facility22. When geographic coordinates were not provided for a sampling locality, as was frequently the case in the older literature, we used Google Maps (https://www.google.com/maps) to georeference localities based on either in-text descriptions or embedded figure maps guided by permanent landmarks like a bend in a river or administrative boundaries. Because this can only yield approximate coordinates, we recorded estimated accuracy as the radius of our best estimate of possible error in kilometers. If coordinates were provided in degree/minute/seconds, we used Google Maps to translate them to decimal degrees. In cases where coordinates were not provided and locality descriptions were too vague to determine coordinates with less than 50 km estimated coordinate error, we did not attempt to extract coordinates but still provide the genetic data. All coordinates are provided in the web Mercator projection (EPSG:3857). We excluded studies that reported genetic parameter values only for samples aggregated regionally (“Southern California” or “Sierra Nevada”). If marker type was not explicitly included, we classified marker type based on the gene targets reported, if provided.
Life history trait data collection
To increase the utility of CaliPopGen, we also assembled data on life history traits for all animal (Dataset 321) and plant (Dataset 421) species contained in Datasets 121 and 221. We assembled trait data that have previously been shown to correlate with genetic diversity, including those related to reproduction, life cycle, and body size, as well as conservation status (e.g.23,24,25,26,). Life history data were compiled by first referencing large online repositories, often specific to taxonomic groups, like the TRY plant trait database27, and the Royal Botanic Gardens Kew Seed Information Database28. If trait data for species of interest were unavailable from these compilations, we conducted keyword literature searches for each combination of species and life history trait, and extracted data from the primary literature. When data were not available for the subspecies or species for which we had genetic data, we report values for the next closest taxonomic level, up to and including family, as available in the literature.
For both animals and plants, we defined habitat types as marine, freshwater, diadromous, amphibious, or terrestrial. Marine species include those that are found in brackish or wetland-marine habitats, as well as bird species that primarily reside in marine habitats. Freshwater species include those that are found in wetland-freshwater habitats, as well as species that primarily reside in freshwater. The diadromous category includes fish species that are catadromous or anadromous. We considered species to be amphibious if they have an obligatory aquatic stage in their life cycle, but also spend a significant portion of their life cycle on land. Terrestrial species were defined as those that spend most of their life cycle on land and are not aquatic for any portion of their life cycle. In a few cases (e.g., waterbirds that are both freshwater and marine, semi-aquatic reptiles), a species could reasonably be placed in more than one category, and we did our best to identify the primary life history category for such taxa. If the taxonomic identity of an entry was hybrid between species or subspecies, this was noted in the speciesID column and no life history data were reported.
The CaliPopGen Animal Life History Traits Dataset 321 (description of dataset in Table 3) includes habitat type, lifespan, fecundity, lifetime reproductive success, age at sexual maturity, number of breeding events per year, mode of reproduction, adult length and mass, California native status, listing status under the US Endangered Species Act (ESA), listing status under the California Endangered Species Act (CESA), and status as a California Species of Special Concern (SSC). For some traits, value ranges were recorded–for example, minimum to maximum lifespan. In other cases, we recorded single values and, when available, a definition of this single value, (for example, minimum, average, or maximum lifespan). We report either the range of the age of sexual maturity (minimum to maximum), or a single value, depending on the available literature. For sexually dimorphic species, we report female adult length and weight when available, because female body size often correlates with fecundity. Across animal taxonomic groups, different measures of body size and length measurements are often used, reflecting community consensus on how to measure size. Given this variation, we report the type of length measurement, if available, as Standard Length (SL), Fork Length (FL), Total Length (TL), Snout-to-Vent Length (SVL), Straight-Line Carapace (SLC), or Wingspan (WS).
The CaliPopGen Plant Life History Traits Dataset 421 (description of dataset in Table 4) includes habitat type, lifespan, life cycle, adult height, self-compatibility, monoecious or dioecious, mode of reproduction, pollination and seed dispersal modes, mass per seed, California native status, NatureServe29 element ranks (global and state ranks, see Table 5 for definitions), listing status under the Federal Endangered Species Act (ESA), and listing status under the California Endangered Species Act (CESA). In contrast to most animal species, plant lifespan was typically reported as a single value. We define life cycles as the following: Annual: completes full life cycle in one year; Biennial: completes full life cycle in two years; Perennial: completes full life cycle in more than two years; Perennial-Evergreen: perennial and retains functional leaves throughout the year; Perennial-Deciduous: perennial and loses all leaves synchronously for part of the year. Some species are variable (for example, have annual and biennial individuals), and in those cases we attempted to characterize the most common modality.
Because of the paucity of data available for chromists and fungi, we did not extract life history trait data for the relatively few species in these taxonomic groups.
Data visualization and summary
We used the R-package raster (v3.1–5) to visualize the spatial extent of the data in CaliPopGen in Fig. 3. Panel (A) shows a summary plot of all unique populations of both the Population Genetic Diversity in Dataset 121 and the Pairwise Population Differentiation in Dataset 221. Panel (B) shows the total number of unique populations in each California terrestrial ecoregion. Panel (C) depicts all data entries of Population Genetic Diversity Dataset 121, summed for each 20×20 km grid cell. Panel (D) shows the density of pairwise straight lines drawn between pairs of localities in the Pairwise Population Differentiation Dataset 221, depicted as the total number of lines per 20×20 km grid cell. The number of populations and species of both Datasets 121 & 221 are summarized for each marine and terrestrial ecoregion in Table 6.
Source: Ecology - nature.com