Identifying and characterizing pesticide use on 9,000 fields of organic agriculture
We first identify the location of organic crop fields in Kern County and then estimate whether status as organic versus conventional fields determines pesticide use (Fig. 5).Fig. 5: Methodology overview.Figure outlines the main method steps from identifying organic fields to creating the analysis data to performing the statistical analyses. All images shown are simplified, visual representations of the datasets. CDFA refers to the California Department of Food and Agriculture, while APN is the Assessor’s Parcel Number and TRS is the Township-Range-Section. Identifying organic fields combines the created CDFA organic APN, CDFA organic TRS, and organic pesticides data layers together to create the final organic versus conventional fields layer used in the analysis data section. All analysis data layers are then inputted into the various statistical analyses.Full size imageIdentifying organic fieldsWe identified organic fields using a combination of California Department of Food and Agriculture (CDFA) records and Kern County Agricultural Commissioner’s Office spatial data (“fields shapefiles”) and pesticide use records. No single source was complete, and as such, we evaluated several different approaches to identifying organic fields.California Department of Food and Agriculture (CDFA) recordsData on the location of organic fields, per the California State Organic Program, for 2013–2019 was obtained by request from the California Department of Food and Agriculture (CDFA). The CDFA, through the State Organic Program, requires annual registration of certified organic producers who have an expected gross sale of over $5000. We were specifically interested in the pesticide aspects of organic production, which is governed in our study region by the USDA’s National List of Allowed and Prohibited Substances. The National List of Allowed and Prohibited Substances delineates which synthetic substances can be used and which natural substances cannot be used for pest control in US organic production. Besides substances specifically (dis)allowed on the National List, allowed substances include non-synthetic biological, botanical, and mineral inputs. Field location data were in the form of either Assessor’s Parcel Number (APN) or PLS System Township-Range-Section (TRS) values, though data were reported without systematic formatting. We harmonized the CDFA APN values to merge with the Kern County Assessor’s parcel shapefile (2017), which we then spatially joined with the Kern fields shapefiles. We followed a similar process with PLSS TRS values, which were then merged with the Kern County PLS Section shapefile, and joined to Kern field shapefiles. We refer to our final organic designation as “CDFA Organic”. Details of the data cleaning process are described in the Ancillary Data Processing Methods section below.Using pesticide use reports to refine organic field identificationAfter spot-checking pesticide use on CDFA Organic fields, it became clear we had not entirely eliminated conventional fields. This was likely due to variation in polygon geometries between PLSS Sections, Kern County Assessor parcels, and Kern agricultural fields data. To further refine our classification, we used field-level pesticide use, again from the Kern County Agricultural Commissioner’s Office. As thousands of pesticide products (active ingredients + adjuvants) are in use in Kern County, we took an iterative approach to eliminate fields using conventional pesticides. We first limited the universe of pesticides to those applied to fields that were CDFA Organic. We then identified the 50 most commonly used pesticide products by a number of applications, and manually classified each as organic or conventional. Having identified these products as described below, we matched them back in, eliminating fields that used conventional products and identifying as “PUR Organic” those that used only organic products. We repeated this process, hand identifying the most commonly used products and eliminating fields using conventional products until we had isolated fields using only organic products.To classify a product as organic or conventional, we first searched for each product’s U.S. EPA-registered product label, using the exact product name and EPA product registration number. If there was any indication on the label that the product was certified as organic by the Organic Materials Review Institute (OMRI), or said “for use in organic production” or “organic”, then the pesticide was identified as organic (n = 132). If there was no organic indication on the product label, we searched the OMRI certification database for products with identical names and manufacturers, and identified products as organic if such certifications existed (n = 39). If all ingredients were defined (i.e., no inert or undefined ingredients) and were known organic active ingredients, then the pesticide was identified as organic (n = 1) (Supplementary Data 1). We failed to find EPA-registered labels for three products and confirmed on the California Department of Pesticide Regulation website that they are either inactive or out of production (EPA registration numbers: 52467-50008-AA-5905, 36208-50020-AA, 2935-48-AA-120). Each of the three was rarely used (n 0) to be the same as the mechanisms determining the amount sprayed when some pesticides are used (pesticides when pesticides > 0). Double-hurdle models64 are an alternative to the Tobit model that allows for the separation of these two decisions.The mechanisms underlying the two decisions (to spray, how much to spray if spraying) can differ such that different covariates can describe each process, and the same covariates are allowed to influence the two processes in different ways (i.e., sign and magnitude can differ). The first, binary decision is usually modeled with a probit model.$${{{{{rm{P}}}}}}left(y=0|{{{{{bf{x}}}}}}right)=1-Phi left({{{{{bf{x}}}}}}gammaright)$$
(1)
Then, the second decision is modeled as a linear model with pesticide use following a lognormal distribution, conditional on positive pesticide use64$$log (y)|{{{{{bf{x}}}}}},y , > , 0 sim {{{{{rm{Normal}}}}}}({{{{{bf{x}}}}}}{{{{{mathbf{upbeta }}}}}},{sigma }^{2})$$
(2)
where Φ is the standard normal cdf, x is a vector of explanatory variables including organic status, y is pesticide use, and ({{{{{mathbf{upbeta }}}}}}) is a vector of coefficients. We use a lognormal hurdle model rather than a truncated normal hurdle model since pesticide use is highly non-normal, and Q-Q plots suggested substantial model improvement using a lognormal rather than normal distribution. In contrast to the panel data models described in the Ancillary Statistical Methods below, our estimation equation used natural log-transformed variables for pesticides (and field and farm size) rather than inverse hyperbolic sine (IHS) transformation since only positive observations are included in the second hurdle model. Following insights derived from our panel data models (Supplementary Notes), we build on the basic hurdle model concept using the farm-by-crop family interaction as a random intercept in both the first and second hurdle. We chose the farm-by-crop family interaction rather than a crossed random effect due to computational feasibility with thousands of permits and hundreds of crops, due to similarity of results to the within estimator model (i.e., fixed effects in causal inference terminology; Supplementary Notes, Supplementary Table 2), and due to AIC/BIC (Supplementary Table 3). Further, we find evidence of heteroskedasticity from both visual inspection and Levine’s test, which adds additional complications to computing crossed random effects. Thus, we proceed with the farm-by-crop family interaction in a random intercept model with cluster robust standard errors clustered at the same grouping. In doing so, observations, where the taxonomic family of the crop was unclear, were dropped. Of the 7367 fields that were dropped due to missing crop families, 6684 were uncultivated agriculture.Our data are effectively repeated cross-sections rather than a true panel since fields are defined by the farm-site-year combination and thus generally change year-to-year or when crops rotate. We model it as such. This implies we do not require observations to have no spray in all time periods, as would be the case in a double hurdle panel model. Linking field IDs over time through spatial processing is complicated by crop rotations of different size areas. Since farmers may farm multiple fields under different management systems, as we illustrate here, and have different contractual obligations at a sub-farm level, requiring farms to never use pesticides on all fields is not reflective of on-the-ground decisions.We repeated the lognormal hurdle models individually for carrots, grapes, oranges, potatoes, and onions, which were the only widely-grown crops with more than 100 organic fields. This allowed for a different slope and intercept by crop type.We conduct several robustness checks. First, we do not have data on crop yields. However, to assess the potential implications of a yield gap on our results, we modify our per hectare rates following Ponisio et al.15 as a robustness check. We group commodities into cereals, roots and tubers, oilseeds, legumes/pulses, fruits, and vegetables and assign them the Ponisio et al.15 yield gap estimates for that group. Crops that did not fall into any of the above groups (e.g., cannabis) were provided the all-crop average from Ponisio et al.15. Second, we analyze how conventional and organic differ with respect to soil quality using a within estimator approach to account for crop-specific differences in soil quality. Third, binary toxicity metrics, while valuable given the number of chemicals and endpoints of interest here, nevertheless fail to distinguish gradations of toxicity for chemicals above (or below) the regulatory threshold. As mentioned above, the data needed to calculate many aggregate indices (e.g., Pesticide Load57 and Environmental Impact Quotient58) are not readily available for all of the chemicals in our study. For completeness, we attempted to calculate the Pesticide Toxicity Index for one well-studied endpoint, fish. We supplemented data provided in Nowell et al.41 with data from Standartox42. However, only about 70% of the chemicals used in our study matched, and pesticide products used on organic fields were more likely to lack toxicity information for one or more chemicals. We briefly discuss the highly preliminary investigation, given the non-random missing toxicity data.All spatial analyses were performed in R Statistical Software v 3.5.3 and all statistical analyses were performed Stata 16 MP. For all tests, statistical significance was based on two-tailed tests with (alpha =0.05.)Ancillary data processing methodsCleaning parcel dataTo spatially locate organic fields, we needed to match the Assessor’s parcel numbers (APNs) provided in the CDFA tabular data to APNs in the Kern County Parcel shapefile (from 2017). Over 90% of the APN entries in the CDFA data were in the format [xxx-xxx-xx], though multiple APNs were often provided in the same cell separated by line breaks, semi-colons, commas, and/or spaces. We made initial edits separating values into individual cells in Microsoft Excel since formatting was highly inconsistent. Observations whose APNs were not in the [xxx-xxx-xx] were modified so that their format matched. In the R environment, dashes were inserted after the third, sixth, and eighth characters (1234567895 became 123-456-78-95) for APNs that did not already contain them. Occasionally, APN numbers were provided with dashes, but with segments of incorrect length (e.g., 12-34-567). In these instances, APN segments were either trimmed from the right or padded with a zero on the left so they matched the [xxx-xxx-xx] format. This approach yielded the greatest number of matches and was checked for accuracy as described below. Additional segments (from APNs with more than two dashes and eight numeric characters) were dropped. A handful of APNs with fewer than eight numeric characters and no dashes were dropped entirely.The edited CDFA APNs were then joined with the Kern County Assessor’s parcel shapefile, creating the “CDFA organic shapefile”. In total, 1637 of 1829 individual CDFA records joined successfully. To evaluate the accuracy of joins between CDFA tabular data, Kern County parcel, and Kern County agricultural spatial data, we spot-checked ownership information using “Company” (CDFA) and “PERMITTEE” (Kern County agricultural data) values.To then identify the crop fields within the organic parcels, we performed a spatial join between the CDFA organic shapefile and the Kern County fields shapefiles. Prior to performing the join, the CDFA parcels’ dimensions were reduced with a 50-m buffer to eliminate spatial joins between CDFA parcels and crop fields that were only touching the parcel margins. Of five different buffer widths evaluated, 50 m reduced the number of false positives and negatives, as determined by comparing the “Company” and “PERMITTEE” values. We refer to the fields that match as “APN Organic”.Cleaning PLSS Township-Range-Section valuesEach year several producers reported Township, Section, and Range (TRS) values, consistent with the PLS System (PLSS), rather than APN values. We used these TRS values to identify PLSS Sections that contained organic fields.We separated any cell containing multiple TRS values and removed any prefixes such as “S”, “Section”, “Sec.”, “T”, and “R” that would prevent joining to Kern County PLSS spatial data in Excel. In the R environment, we padded the left side of the “S” value with a 0 if it was a single digit, then concatenated the three columns into a “TRS” column. We joined TRS from the CDFA tabular data to PLSS spatial data, which identified 563 Sections as containing organic fields, from 2013 to 2019, out of a total of 664 unique TRS codes in the CDFA dataset. We then performed a spatial join between PLSS Sections that contain organic fields and Kern County fields shapefiles, to identify all agriculture fields that overlap with those Sections. Additional processing using the Pesticide Use Reports is described above.Ancillary statistical methodsWe began with a pooled ordinary least squares (OLS) model that, as the name suggests, pools observations over farms, years, and crop types. However, there may be attributes of crops or farms that may be systematically different between organic and conventional, and this systematic difference could bias our pooled OLS results. To address this, we first considered propensity score approaches but were unable to find a sufficient balance of our covariate distribution between organic and conventional fields. As an alternative, we limited our sample to fields with overlapping farmers and crop types. In other words, we focused on the subset of fields that are grown by farmers producing both organic and conventional fields and to crops that are produced both conventionally and organically. However, this shrunk our dataset by two-thirds.To leverage more of our data, we next considered panel data models as a means to address unobserved variables. We consider both within-estimator models (also known as “fixed effects” in causal inference terminology, but different from the biostatistical use of the term) and random effects models (with random intercepts), seeking to capture characteristics of the crop, grower, and year. The advantage of a within-estimator approach is that the omitted variables are removed (through differencing) and thus, they can be correlated with covariates without biasing the estimation. In other words, pesticide use and all covariates are differenced from their crop-specific mean (or crop family, farmer, etc. specific mean, depending on the model). In doing so, the propensity for certain crops (crop family, farmer) to be grown organic or to be fast or slow adopters of new technologies is removed. The disadvantage is that characteristics shared by all fields of a crop (e.g., value) are lost in the differencing, and more importantly, that the differencing is not easily translated to nonlinear models that we employ later in the analysis. Random effects are more easily translated to nonlinear models. The disadvantage of random effects is the strong assumption that the unobserved variables are uncorrelated with the covariates18,65, which is required for random effects coefficient estimates to be unbiased. Here, we see the difference in coefficient estimates between the within-estimator and random effects models are quite small (Supplementary Table 2).Random effects particularly crossed random effects with thousands of permits and hundreds of crops, introduce computational challenges due to large, sparse matrices. Further, we find evidence of heteroskedasticity from both visual inspection and Levine’s test, which adds additional complications to computing crossed random effects. We proceed using the farm-by-crop family interaction in a random intercept model with cluster robust standard errors clustered at the same grouping based on AIC/BIC (Supplementary Table 3), computational feasibility, and similarity to the within-estimator results (Supplementary Table 2). Observations, where the taxonomic family of the crop was unclear, were dropped in any models including family in either the random effects or the cluster robust standard errors. Of the 7367 fields that were dropped due to missing crop families, 6684 were uncultivated agriculture.In the panel data models, we used IHS transformations to accommodate highly non-normal pesticide (and field and farm size) data. IHS is very similar to natural log transformation66 but is defined at zero, which is important given a sizable fraction of our observations have zero pesticide use. As with log–log transformations, IHS–IHS transformation can be interpreted as elasticities. We pre-multiply pesticide use by 100 to improve estimation66, though this does not affect interpretation. As described above, we leverage insights on model specification from the panel data models, but rely on the double hurdle models to parse apart the decision to spray from the decision of how much to spray.Reporting summaryFurther information on research design is available in the Nature Research Reporting Summary linked to this article. More