Few keystone plant genera support the majority of Lepidoptera species

Data collection

This study uses a compiled data set that includes 12,072 native Lepidoptera species, 2079 native plant genera, and 24,037 different host plant-native Lepidoptera interactions from across the contiguous United States. From these data we extracted the plant information of 83 counties and a Lepidoptera list for the corresponding 25 states. The full dataset will be publicly available in a forthcoming data manuscript.

Lepidoptera range and host plant data

The Lepidoptera species data were compiled from historic citable sources (Supplementary Data 6) of range and host plant records. We originally compiled a similar list for the Mid-Atlantic region²⁹. This dataset was updated to include more states and counties to include on the National Wildlife Federation website⁴⁸. Non-plant host records (e.g., detriphagous, algae, fungi, lichen, and insect predators) are included, as well as Lepidoptera without known host plant associations, but not considered in this analysis. Plant ranges are to the county, Lepidoptera ranges are to state, and host plant-Lepidoptera records are relationships between a plant genus and Lepidoptera species. Plant genus was included as the unit of interaction because data on Lepidoptera- host plant associations are most accurate and available at the genus level. Although more specific data are occasionally available (e.g., Lepidoptera records to plant species), we limited our analysis to the genus scale in order to make equitable inferences across the geographic and ecological scope of this analysis.

Plant distribution data

The current list uses the Biota of North America Program⁴⁹ (BONAP) as its major source for plant nomenclature. We used the BONAP database as our source for plant distributions because it specifies North American plant ranges that currently occur beyond their historic native range due to anthropogenic and natural expansion (e.g. Osage orange, Maclura pomifera).

Using the BONAP, a county-level survey was made for each state used in this study. Every county within those states was surveyed and BONAP records include records from adjacent counties. We classified plants into three categories; native, non-native, and adventive. A plant species is classified as adventive if it is native to North America but not in that specified region. Each plant genus was reviewed individually in each state. Genus records that fell entirely in one category resulted in all county records being designated that category. Any plant genus that had species that fell in two or more categories was examined county by county, with adjacent records being noted but not included. The state records for each plant genus are labeled in various combinations of the three categories. County-level data designate a genus either containing native records, or only non-native. For our study, we focused only on native plants, excluding non-native and adventive records (except for parameterizing probabilities of host-plant switching, see below for details).

Eighty-three counties in 25 states (Alabama, Arizona, Arkansas, California, Colorado, Delaware, Florida, Georgia, Idaho, Illinois, Kansas, Maine, Massachusetts, Michigan, Minnesota, Montana, New York, North Dakota, Ohio, Oregon, Pennsylvania, South Carolina, Tennessee, Texas, and Utah) were examined. At least three counties in each state (except Delaware) were used. Two counties from each state were selected from dissimilar ecoregions within each state. Ecoregions were determined using the Commission for Environmental Cooperation’s Ecological Regions of North America map⁵⁰. For county selection we used the level 2 designation of terrestrial ecoregions (50 separate categories); however, for subsequent analyses, we used the level 1 designation (15 categories). As much as possible, counties that bordered other counties within an ecoregion were used to alleviate the issue of BONAP including records from adjoining states. At least one more county was added to meet the parameters of the latitude study and to fill out ecoregions. Up to five counties were used in some states. While the majority of counties were chosen without criteria beyond ecoregion status, Chase County KS was chosen based on its presence in the ‘South-central Semi-Arid Prairies’ ecoregion, as well as high natural grassland cover relative to other agriculturally dominated Kansas counties.

County data

A series of counties were selected along three primary latitude bands (Latitudes 46, 40, and 34). The latitude and longitude of each county were determined by the county seat using Google Earth. In most cases the county seat was centrally located within the state. A few county seats are not centrally located, most notably Monroe County in Florida where the county seat is Key West. We determined the land area (in km², excluding inland, coastal, Great Lakes, and territorial sea water) for each county using information from the US Census (https://www.census.gov/quickfacts/fact/note/US/LND110210). Counties varied from 415 to 26,368 km² with an average of 4519 ± 5598 SD.

Lepidoptera-host plant data

The host plant records for each Lepidoptera species include all known literature records. Not all host plant-Lepidoptera associations occur in every county, state or even the USA due to differences in plant distributions. Thus, we filtered the Lepidoptera list from each state to exclude any Lepidoptera species whose host plant did not occur in the selected county. The final dataset per county includes (1) all plant genera known to occur in the county, (2) all Lepidoptera known to use at least one plant genus that occurs in the county and (3) all host plants used by Lepidoptera that could potentially occur in the county.

Statistical methods

All analyses were conducted using program R, Version 3.5.1⁵¹.

Distribution analysis

We first determined what distribution best fit our data in order to derive parameters that could be compared among counties. We used the package ‘goft’⁵² to conduct goodness of fit tests for the Exponential, Gamma, and Pareto distributions on each county separately. Functions in the ‘goft’ package use parametric bootstrap tests for the null hypothesis that a distribution fits a tested distribution. We tested each county (n = 83) and each distribution type (n = 3) separately. Using the distribution that best fit our datasets, we then used the function ‘fitdistr’ from the ‘MASS’ package⁵³ to use maximum likelihood fitting to obtain parameters (e.g., shape α and scale θ) for the Plant-Lepidoptera distributions for each county separately.

We then tested for differences in the distribution of caterpillar richness among plants by county-level diversity and location metrics. For each county, we compared the α and θ of the distribution with county plant richness, Lepidoptera richness, ecoregion, latitude, and county land area. To test whether α or θ varied by ecoregion we used an analysis of variance (ANOVA) with Tukey’s post hoc multiple comparisons of means. To test whether α or θ changed with increasing plant richness, Lepidoptera richness, or land area, we used linear regression. Based on scatter plots of the data, no nonlinear relationships were necessary.

Network analysis

To determine conservation targets, we identified keystone plant species^24,25 using methods from Harvey et al.²⁶. To perform this analysis, we used binary networks of host-plant caterpillar interactions. Ecological networks are ideal to determine cascading extinction rates of specialists following host plant loss⁵⁴. For this analysis and our following simulation, we chose one representative county dataset from each state from our 83 available counties (25 counties total). Our method to identify target keystone species at a national scale consists of three steps.

Species richness

On a per-county basis, we first identified how many Lepidoptera species are recorded in the literature as using each plant genus for growth and reproduction.

Extinction sensitivity

We also determined the ‘extinction sensitivity’ of each plant; in other words, how many Lepidoptera species are at risk of extirpation with the loss of a host plant? In our context, the sensitivity means “specialization”, caterpillars that use only one genus of the plant are considered especially sensitive to the removal of that plant. We modified the “nb.extinct” function from Harvey et al.²⁶ so that we could calculate the number of extinct herbivores following the removal of a plant. This function was repeated for each county to acquire a number of species that were specialists to each host plant.

Network stability

Then we used a network-based approach to assess the effect of each plant genus on network stability. We used a binary matrix where each record of a caterpillar on a host plant indicates the existence of an interaction. The results given from a binary interaction network are correlated with those from a network weighted by abundance⁵⁵. Here, the community stability index represents the minimum interactions required for the system to be stable where smaller values are the most stable, i.e., the more resilient a network is to disturbance, and large values are the most unstable. We calculated the stability index as the real part of the dominant eigenvalue of a Jacobian matrix following methods from Sauve et al.²⁷ (see Appendix 1 in ref. ²⁷ for definition and details). To acquire baseline stability, we conducted 175 iterations and took the median value. We determined that 175 was the minimum number of iterations needed to acquire a stability value ± 0.001 resolution using simulations on test datasets. For this analysis we only considered woody plants in our analysis because (1) woody plants tend to host the most diverse caterpillar communities²⁹ and (2) computation time to include all plant genera was prohibitive.

To quantify the effect of each plant on total network stability, we reran the analysis, iteratively removing each plant genus and then recalculating the stability²⁶. Then we subtracted the new stability values from baseline stability to find the median change when each plant was removed where negative values indicate reductions in network stability and positive values indicate increases. We then multiplied the stability value by −1 so that increases in this metric indicated an increase in a plant’s importance to stability to make this value comparable in direction to network structure and extinction sensitivity (i.e., increases in Lepidoptera diversity and # of specialist species).

Standardizing results across counties

For each analysis (step 1–3) we scaled our final values from 0–1 using this equation for each plant in each county separately:

$$frac{{{mathrm{x}} + left| {{mathrm{minimum}}}; {{mathrm{x}}} right|}}{{{mathrm{maximum}}; {mathrm{x}} + left| {{mathrm{minimum}}}; {{mathrm{x}}} right|}}.$$

(1)

We then took the mean of our three values to obtain a final ‘score’ for each plant genus per county. Finally, we identified which plant genera had the highest values over all the counties by plotting the means for each plant genera (n = 288) and assessing outliers (values that were 1.5× the interquartile range).

Field-based host plant-caterpillar interaction data

Field sampling

To compare the results from the network analysis on literature-based data collection with results from field-based data collection, we used caterpillar interactions recorded from native host plants from Richard et al.⁵⁶ (hereafter: Mid-Atlantic dataset). Caterpillar surveys were conducted in 2011 within 8 hedgerows in New Castle County, DE and Cecil County, MD. This dataset contains plants surveyed in both native- (>95% native plant biomass, n = 4) and nonnative-dominated (>75% nonnative plant biomass, n = 4) hedgerows. Sites were all located within Mid-atlantic decidous piedmont forest, and were separated by at least 100 m.

In June–July, an observer walked a 100 m transect on days in which foliage was not wet to collect caterpillars in each site. Observers sampled all caterpillars using the total search approach⁵⁷ to methodically inspect leaves, twigs, and branches of all woody plants within a 2-m3 area along the transect. Each search was conducted for 5 m every 2 m along the 100 m transect. In total, each hedgerow treatment was searched for a total of 1000 min in both June and July. All caterpillars were identified to species or morphospecies using Wagner⁵⁷, Wagner et al.⁵⁸, and various web sources. Caterpillars that could not be identified in the field were measured and then brought to the lab to be reared to adulthood for later identification using the literature and the University of Delaware Insect Reference collection.

Data management and analysis

Because our network analysis was based on native woody plant genera, we excluded all non-native plant genera in the Mid-Atlantic dataset. We calculated the number of times each plant was searched and excluded all plant species that were searched <10 sampling occasions. That left us with caterpillar abundance data for 18 genera: Acer (A. rubrum and A. saccharum), Carpinus caroliniana, Carya sp., Cornus (C. alternifolia, C. florida, and C. racemosa), Diospyros virginiana, Fagus grandifolia, Fraxinus americana, Juglans nigra, Lindera benzoin, Liquidambar styraciflua, Liriodendron tulipifera, Platanus occidentalis, Prunus (P. americana and P. serotina), Quercus (Q. alba, Q. montana, Q. rubra, Q. velutina, Q. palustris, Q. coccinea, Q. phellos, and Q. imbricaria), Rhus (R. glabra and R. copallinum), Sassafras albidum, Ulmus (U. americana and U. rubra), and Viburnum (V. dentatum and V. prunifolium).

Prior to analysis, we filtered the Mid-Atlantic dataset to exclude all caterpillar species that weren’t definitively identified to species. Excluded species were primarily leaf miners, tiers, folders and webbers. Included individuals were composed of species from 19 separate families.

We summarised whether the 58 identified caterpillar species were observed feeding on the 18 host plants and converted the data into a binary matrix. For abundance, we standardized search efforts across the plant species by summing the total number of surveys and calculated an abundance per 10 samples. We completed the network analysis on the field-collected interaction dataset using 10,000 iterations. We compared the values from the Mid-Atlantic dataset (Field-derived scores) with that derived from the host plant data (Literature-derived scores) using a Pearson’s correlation test using the function ‘cor.test’⁵¹ and report the R-statistic.

Restoration simulations

Simulation parameters

To determine the applicability of our results, we simulated the management actions of restoration efforts involving 1–50 plant genera and 0–15 keystone plants (i.e. plants that support disproportionately high Lepidoptera richness) to demonstrate how supported Lepidoptera and interaction diversity changed when keystone plants were intentionally included in the landscape or not. To do this, we simulated random plant choices from the list of available plant genera from each of the randomly selected counties from 25 states and calculated the total Lepidoptera species richness supported (excluding duplicate species supported by >1 chosen host plant) and total interaction richness supported (i.e. all interactions between a plant and a caterpillar consumer).

Host-plant switching

We also included the potential for host plant switching by including a probability of using plants not recorded in the host plant literature. We calculated the probability of random host plant switching as:

$${P}_{{mathrm{ic}}}left( {{mathrm{host plant shift}}} right) = frac{{{E}_{{mathrm{ic}}}}}{{{N}_{mathrm{c}} – {H}_{{mathrm{ic}}}}} * {N},$$

(2)

where P_ic is the probability of host plant shifting by Lepidoptera species i in county c, E_ic is the proportion of non-native plants used by Lepidoptera species i in county c, N_c is the number of native host plants in county c, H_ic is the number of total native host plants used by Lepidoptera species i in county c, and N is the total number of plants included in the simulation (from 1 to 50). This equation gives the probability that Lepidoptera species i in county c shifted to at least one of N plants used in the simulation. Using this probability, we used the sample function in R to predict whether any of the possible Lepidoptera species were included or not in each iteration and added each unique species to those included from known hosts.

Simulation output

For each county, we iterated these scenarios over 100 iterations with random draws of plant genera and keystone genera. We chose 100 iterations for each county to accurately estimate means while maintaining computational efficiency. To standardize across counties, we calculated the percent of Lepidoptera supported out of all potential phytophagous Lepidoptera (for woody plants and herbaceous plants separately) and percent interactions in each iteration. For each county and iteration, we plotted the median value. Simulations were run for woody and herbaceous plants separately.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.