in

Internet searches offer insight into early-season pollen patterns in observation-free zones

Assessment of data quality

National Allergy Bureau pollen concentration data quality

To assess the quality of NAB data overall, we analyzed gaps in data recording and percentages of missing data in daily NAB measurements from each station from January to December of each year. Availability of pollen concentration data varied widely by station, with percent of days per year missing pollen data ranging from 0% (e.g. San Antonio, TX; 2012) up to 100% (e.g. Oklahoma City, OK; 2014) (Supplementary Fig. 5A). Common days missing data were at the beginning of the year, the end of the year, and on weekends (data not shown). Although NAB directs its certified pollen counting stations to collect data for a minimum of 3 days per week, gaps in pollen collection within 10 days before and after the first recorded high pollen concentration (200 grains/m3) spanned up to 10 consecutive days (Supplementary Fig. 5B). Over the span of the year, the median gap between measurements across station-years was 5 days (IQR = 3.12). The date of first available pollen concentration data ranged from day 1 of the year to day 96 with a median day of 3 (IQR = 1.27) (Supplementary Fig. 5C). For the majority of station-years (64.5%), the first day of the first recorded data for the year was the same as the first day with a non-zero pollen count.

Google trends search data quality

We analyzed GT daily data quality per DMA region during the early pollen season, from January to June of each year. The percent of missing days of GT data ranged from 0–93% (lowest missing from San Jose CA 2013 and highest missing from Midland TX 2012, respectively) with median and IQR = 33% (8–51%) (Supplementary Fig. 6A). Earlier years of GT data had more daily search volumes not quantified (referred to here as “missing”) due to lower search volumes and not meeting Google’s threshold for inclusion (Supplementary Fig. 6B). Variation was observed between GT download iterations, as GT provides a random sample of its data for each download (Supplementary Fig. 2A,B).

Factors associated with data quality

Biogeography and population characteristics were assessed for their impact on data quality, specifically overall ecoregion classification, total annual precipitation and mean spring temperature (chosen for their likely impact pollen production and seasonality34), as well as TV-homes, a combinatorial metric for population size and media use.

With respect to ecoregion, the majority of NAB stations were classified as Eastern Temperature Forests (67.6%) or Great Plains (21.6%). Other ecoregions each represented 5% or less of NAB stations: Marine West Coast Forest, Mediterranean California, and Northwestern Forested Mountains. U.S. ecoregions not represented by NAB stations included: Northern Forests (as in Vermont), Tropical Wet Forests (as in southern Florida), North American Deserts (as in Nevada), Southern Semi-Arid Highlands (as in southeastern Arizona), and Temperate Sierras (as in southwestern New Mexico). As a whole, NAB stations in Great Plains ecoregions had slightly higher data quality (p < 0.01), with a median of 70.7% days of non-missing data (IQR = 60.3%, 89.9%), versus Eastern Temperate Forests with a median of 63.8% (IQR = 53.2%, 68.4%) across station-years. Statistical comparisons were not performed between other ecoregions due to small sample sizes, however data by ecoregion can be viewed in Supplementary Fig. 7.

With respect to climactic factors, we evaluated mean spring temperatures, total annual days of precipitation, latitude, and longitude, in relation to percent of missing GT and NAB data as well as number of consecutive days of missing NAB data. Among all pairwise comparisons, a few significant relationships were identified. Mean spring temperature (°C) exhibited a positive correlation with non-missing NAB data (% days) [R-squared = 0.29, Coefficient = 1.89 (95% CI 1.31, 2.46), Supplementary Fig. 8]. This may reflect the behavior described by some pollen counting stations in northerly, colder regions of not recording or monitoring pollen until weather is warmer and pollen is more likely to be produced (from personal correspondence, data not shown). Mean spring temperature was strongly inversely correlated with latitude, as is expected (R2 = 0.73). Total annual days of precipitation was found to be positively correlated with longitude (R2 = 0.38; Coefficient = 1.39; 95% CI 1.04, 1.73), which is consistent with Köppen–Geiger dry-moist climate classifications for the continental U.S.35. No associations were detected between any other climactic factors or data quality metrics examined via univariate regression analyses (R2 ≤ 0.1).

With respect to regional population sizes and media consumption, the percent of non-missing NAB pollen concentration data was not found to be correlated (as estimated by number of TV-homes in the associated DMA region; p = 0.29; R2 < 0.01). In contrast, the percent days of missing GT data was strongly correlated to the log-transformed number of TV-homes in the region [Coef = − 22.6 (95% CI − 24.7, − 20.6); p < 0.01; R2 = 0.68] (Supplementary Fig. 9).

Correlation between NAB and GT data with respect to data quality

Data quality inclusion criteria for correlation analyses

NAB total pollen concentrations from the majority of station-years showed a bimodal seasonality consisting of one larger peak early in the year and one smaller peak later in the year (See Supplementary Fig. 1 for national seasonality). For correlation analyses, we focused specifically on the period from January to June to examine the extent to which GT data correlated to the larger, early season peak in total pollen. Of 246 GT location-matched NAB station-years, 24 (9.7%) had no NAB data recorded in the period of interest. In addition, the following station-years did not meet data quality inclusion criteria: 85 (34.5%) station-years had over 60% of days missing data during the pollen season, and an additional 32 (13.0%) station-years had over four consecutive days missing data within 10 days of the first high pollen concentration day of the year. A total of 105 station-years, representing 27 NAB stations, were ultimately included in correlation analyses. See the Supplement for visualizations of ecoregions (Supplementary Fig. 10) and geographical distribution (Supplementary Fig. 11; Interactive Map https://bit.ly/2XTlHrC)36 of NAB stations represented in the included study sample.

Effects of data missingness on NAB-GT correlation strength

Daily total pollen concentrations from NAB data were compared to daily GT search counts by station-year via Spearman rank correlation. Since GT data varied widely with respect to percent of missing days of data per year, station-years were grouped into quartiles to test the effect of missingness on ability of GT to correlate with NAB data, with quartile cutoffs at 1.8%, 14%, and 32% of days missing GT data. Significant differences were identified in correlation strength between Q1 v. Q2 (p < 0.01) and Q2 v. Q3 (p = 0.02). Rho values by quartile were: Q1 0.71 (IQR 0.83, 0.93), Q2 0.66 (IQR 0.38–0.85), Q3 0.44 (IQR 0.04–0.77), Q4 0.23 (IQR = 0.06, 0.59) (Fig. 1A).

Figure 1

Correlation between Google Trends searches and National Allergy Bureau pollen concentration data with respect to data quality and pattern. (a) Correlation by quartiles of annual percent of missing Google Trends data. (b) Signal to noise ratio (size of peak relative to smoothing function).

Full size image

Effects of GT peak signal strength on GT-NAB correlation strength

As a proxy for site-specific estimates of signal to noise ratio and ability to identify peaks in GT data, the mean absolute difference between daily GT adjusted search volumes (“signal”) and a heavily smoothing lowess function (baseline fluctuations or “noise”) was calculated per station year (Supplementary Fig. 12A). Station-years were separated into quartiles to test the effect of signal strength on ability of GT to correlate with NAB data, with cutoffs at 0.19, 0.24, and 0.28. Significant differences were identified in correlation strength between Q2–Q3 (p = 0.04) and Q3–Q4 (p = 0.02). Spearman’s rho values by quartile were: Q1 0.23 (IQR 0.00, 0.57), Q2 0.52 (IQR 0.19–0.73), Q3 0.73 (IQR 0.43–0.86), Q4 0.83 (IQR 0.69, 0.95) (Fig. 1B). Correlation between GT peak signal strength and GT-NAB correlation strength can also be visualized by scatter plot (Supplementary Fig. 12B).

Effects of ecoregion and climate on GT-NAB correlation strength

When comparing the two main ecoregions represented in the study sample, correlations appeared to be somewhat weaker (p = 0.01) in Great Plains locations, among which the median rho value was 0.70 (IQR 0.39, 0.90) than in Eastern Temperate Forest locations, among which the median rho was 0.44 (IQR 0.13, 0.73). Other ecoregions were not compared due to small sample sizes, but correlation data by ecoregion are reported in Supplementary Fig. 7. We were not able to detect statistically significant associations between either precipitation or spring temperatures with GT-NAB correlation strength, although numbers were small (e.g., N = 18 stations in 2013). However, scatter plot visualization indicate that GT-NAB correlation strength may tend toward positive associations with annual precipitation and negative associations with spring temperature (Supplementary Fig. 13A,B).

Variation in correlation strength across sites

Overall, comparisons between GT search data with any amount of non-missing data and NAB pollen concentration data resulted in Spearman’s rho values that ranged from very poor (rho = − 0.63) to excellent (rho = 0.98) for 105 station-years, covering 27 unique stations (see Fig. 2A–D), with a median rho of 0.24 (IQR 0.61–0.80). See Supplementary Table 2 for complete data missingness and rank-correlation values by station-year.

Figure 2

Overlay of lightly smoothed, normalized Google Trends search data (blue) and NAB pollen concentration data (orange) for representative station-years. Examples of (a) excellent, (b,c) good to moderate, and (d) poor correlation between GT and NAB data.

Full size image

Season start estimates

To evaluate whether Google Trends data on search volumes for “pollen” could be used to estimate the start of the pollen season, we compared GT-calculated to NAB-calculated season starts, where start date was defined as the first date that pollen concentrations reached 5% of total annual cumulative pollen (see “Methods” for rationale and additional context).

Effects of data transformations and data quality thresholding on estimation accuracy

Estimates derived from GT data were examined in the context of percent days missing and first date of available NAB data (Fig. 3A–C and Supplementary Table 3). Overall, comparing smoothed GT and NAB data, and comparing log-transformed smoothed GT and NAB data decreased the discrepancies between GT-derived and NAB-derived estimates of season start dates, as compared to using untransformed data. As a result of applying progressive inclusion criteria based on data quality, the range of discrepancies between NAB- and GT-derived data also decreased. When examining all station-years using smoothed, log-transformed data, GT-derived start estimates preceded NAB-derived start dates by a median value of − 24 days (IQR − 39, 7). With progressive inclusion criteria applied, this decreased to − 12 days (IQR − 28, − 3) and then to − 8.5 days (IQR − 21, 0). NAB-derived start dates using data from the previous year had discrepancies from the current year with a median of 2 days, and in IQR within 1–2 weeks.

Figure 3

Difference in days between Google Trends- and NAB-calculated season start dates. Differences in start dates are shown for untransformed, smoothed, and log-transformed smoothed data. As a reference, differences between NAB-calculated start dates those calculated from NAB data for the previous year dates are displayed as well, for (a) All available station-years, (b) additional inclusion criteria of NAB data collection beginning within first month of the year applied, (c) additional inclusion criteria of < 20% missing GT data applied.

Full size image


Source: Ecology - nature.com

Quantitative comparison between the rhizosphere effect of Arabidopsis thaliana and co-occurring plant species with a longer life history

A new approach to carbon capture