Pollen collection and RNA extraction
Pollen is a microscopic and notoriously resistant plant product. Thus, methods to collect a sufficient and roughly equivalent volume of pollen per species, and to ensure RNA was collected from viruses both internal and external to pollen grains, were developed specifically for this work. At each of the four regions, we identified visually asymptomatic plants species that were in full flower and in high enough abundance to achieve our pollen sample minimum. Many of the pollen samples were collected from public roadsides. However, some from the California Grasslands were collected from the University of California’s McLaughlin Natural Reserve, and some from the Eastern Deciduous Agro-forest Interface were collected from the University of Pittsburgh’s Pymatuning Laboratory of Ecology. We had permission to sample in both places. In addition, we obtained permission from the USDA Forest Service to sample in the Till Ridge Cove area of the Chattahoochee-Oconee National Forest for sampling in Central Appalachia. None of the sampled plants displayed classic viral symptoms (e.g., leaf yellowing, vein clearing, leaf distortions, growth abnormalities). To achieve the broadest representation of plant species, we selected species in different families, where feasible. Also when possible, we focused primarily on perennial species to avoid any effects of life history variation. From these, we collected 30 to 50 mg of pollen from newly dehiscing anthers (3–967 fresh hermaphroditic flowers from 1–27 plants per species; Supplementary Table 3) in situ using a sterile sonic dismembrator (Fisherbrand Model 50, Fisher Scientific, Waltham, MA, USA) with a frequency of 20 Hz. We removed non-pollen tissues (e.g., anther debris) with sterile forceps. In addition to removing non-pollen debris that was visible to the naked eye in the field at the time of pollen sample collection, we conducted microscopic and gene expression analyses to confirm the purity of the pollen samples in the lab (Supplementary Methods). Visibly pure pollen from a single species was transferred to a 2-mL collection tube with Lysing Matrix D (MP Biomedicals, Irvine, CA, USA) and kept on dry ice until transported to and stored at −80°C at the University of Pittsburgh (Pittsburgh, PA, USA).
Before extracting the total RNA, we freeze-dried the pollen samples (FreeZone 4.5 Liter Benchtop Freeze Dry System, Labconco Corporation, Kansas City, MO, USA) and lysed with a TissueLyser II (Qiagen, Inc., Germantown, MD, USA) at 30 Hz with varying times for different plant species (Supplementary Table 3). We confirmed via microscopy that this protocol resulted in the breakage of ≥50% of the pollen grains in a sample. The total RNA, including dsRNA, was extracted using the Quick-RNA Plant Miniprep Extraction Kit (Zymo Research Corporation, Irvine, CA, USA), following the full manufacturer’s protocol, including the optional steps of in-column DNA digestion and inhibitor removal.
RNA sequencing
We assessed the quantity and quality of the total RNA extracted from each pollen sample with a Qubit 2.0 fluorometer (Invitrogen, ThermoFisher Scientific, Waltham, MA, USA) and with TapeStation analyses performed by the Genomics Research Core (GRC) at the University of Pittsburgh. Only samples with an RNA integrity value of ≥1.9 were used (Supplementary Table 3). Stranded RNA libraries were prepared by the GRC using the TruSeq Total RNA Library Kit (Illumina, Inc., San Diego, CA, USA), and ribosomal depletion was performed using a RiboZero Plant Leaf Kit (Illumina, Inc., San Diego, CA, USA). At the GRC, we pooled depleted RNA libraries from six species on a single lane of an Illumina NextSeq500 platform.
Pre-virus detection steps
A sequencing depth of 117–260 million 75 bp paired-end reads was achieved per sample (Supplementary Table 3). Sequences were demultiplexed and trimmed of adapter sequences. We used the Pickaxe pipeline42,60,61 to detect known and novel pollen-associated viruses. First, Pickaxe removes poor-quality raw reads42,60,61 and aligns the quality-filtered reads using the Bowtie2 aligner with default parameters62 to a subtraction library. Each customized subtraction library contained the host plant species genome or the most closely related plant genomes in the National Center for Biotechnology Information (NCBI) database, if the host plant genome was not available (Supplementary Table 7), as well as other possible contaminant genomes (e.g., the human genome)42,60,61. The subtraction libraries with 1–8 closely related plant genomes, a bioinformatically tractable amount, were used to remove plant sequences, which allows for a conservative estimate of the viruses associated with pollen to be made. The size of the subtraction libraries did not influence the number of identified viruses, as there was no correlation between library size and either estimate of virus richness (conservative: r = 0.08, P = 0.75; relaxed: r = 0.06, P = 0.77). After subtraction, only non-plant reads remained and were used for viral detection.
Known RNA virus detection, identity confirmation
With Pickaxe, we used the Bowtie2 aligner with default parameters62 (v2.3.4.2-3) to align viral non-plant reads to Viral RefSeq42,60,61 (hereafter, VRS; Index of /refseq/release/viral (nih.gov)). Each known virus reflects the top hit of an alignment to VRS42,60,61. Following Cantalupo et al.42, we considered a known virus to be present if the viral reads covered at least 20% of the top hit and aligned to it at least ten times. For viruses with segmented genomes, at least one segment was required to meet these criteria.
Contig annotation and extension; novel RNA viral genome detection, identity confirmation
Viral reads were assembled into contigs using the CLC Assembly Cell (Qiagen Digital Insights, Redwood City, CA, USA), and Pickaxe was used to remove repetitive, short (<500 base pairs), and heavily masked sequences42,60,61. Contigs that passed these quality steps were annotated following Starrett et al.61, except that the GenBank nucleotide database was also searched with Rapsearch263 (v2.22). We then we used BLASTN (NCBI) to search for overlapping regions that were at least 90% identical between contig ends to extend them, if possible.
Main criteria used to confirm the identification of novel RNA viruses (genomes and variants) were: (1) contig or extended contig length corresponded to a putative viral family13; (2) the dissimilarity of a contig or extended contig from the top BLAST or RAPSearch2 hit exceeded the percent identity threshold for its putative family, as per the ICTV species demarcation criteria13; (3) the open reading frame (ORF; https://www.ncbi.nlm.nih.gov/orffinder/, default parameters, v0.4.3) architecture of a contig or extended contig matched that (or part of that) of a putative viral family13,64; and (4) detection of at least one of the conserved viral domains (i.e., proteins; hereafter, CDs) in the ORFs of a contig or extended contig with a search of the Conserved Domain Database65 (https://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi, default parameters, v3.16-17) corresponding to the CDs of a putative viral family13. We also considered contig relative abundance (i.e., the number of reads assembled into a contig divided by the contig length; representing the overall number of reads belonging to a novel viral genome) and how much of a contig participated in the alignment with the top BLAST or RAPSearch2 hit. Novel coding-complete genomes or variants of known viruses (i.e., those that met all or nearly all the above criteria) are reported at the family level. Genome organization and coverage depth across all these are shown in diagrams drawn to a unified length scale and depth plots, respectively (Supplementary Fig. 1). Depth plots were created using Bowtie262 to align the non-plant reads to contigs and Samtools66 (v1.9) to determine coverage depth at each base. In addition to novel coding-complete viral genomes and variants of known viruses, we also report novel partial RNA viral genomes and variants. All novel viral genomes were named by the plant species in which they were discovered, the putative viral family, and a number, and novel viral variants were named after their known viral species name.
Virus richness estimation and correlations
For each pollen sample, we calculated the conservative virus richness, or the total number of known viruses, novel coding-complete viral genomes, and novel coding-complete variants. We also calculated the relaxed estimate of virus richness, which also included the novel partial RNA-dependent RNA polymerase (RdRp) CDs of both genomes and variants. Since we collected the same volume of pollen from all plant species, we determined whether sampling variation was related to virus recovery by correlating virus richness estimates with the number of individuals and flowers sampled.
We found that the conservative and relaxed estimates of virus richness were highly correlated (r = 0.96, P < 0.001), and both were correlated with the number of flowers (both r > 0.74, P < 0.001), but not the number of individuals (conservative: r = −0.24, P = 0.26; relaxed: r = −0.27, P = 0.20) sampled. These patterns were unaffected by removing outliers (i.e., plant species where >100 flowers were sampled; correlations with flowers sampled: conservative r = 0.49, P = 0.02; relaxed: r = 0.40, P = 0.05; correlations with individuals sampled: conservative: r = −0.18, P = 0.40; relaxed: r = −0.23, P = 0.28). To be conservative, however, we controlled for both flowers and individuals sampled in all the analyses of virus richness by adding them as covariates to the phylogenetically corrected linear models (see “Flower, pollen traits…” below).
Plant and viral phylogenies
We constructed a phylogeny of the plant species based on the PhytoPhylo maximum likelihood megaphylogeny of vascular plants67,68 with the R (v4.0.1) packages “ape” and “phytools”69,70. The positions of the two plant species that were not present in the megaphylogeny data set (Calochortus amabilis and Calystegia collina) were manually added to the tree according to genus-level phylogenetic relationships71,72.
To assess the taxonomic membership of coding-complete novel viral genomes and variants and known viruses, we built maximum-likelihood family-level viral phylogenies by first aligning the amino acid sequences of the novel coding-complete viral genome RdRp CDs using the MUSCLE algorithm, with default parameters in MEGA X73. We then ran 500 bootstrap replicates of the Jones-Taylor-Thornton matrix-based model with default parameters. In doing so, we applied the Neighbor-Join and BioNJ algorithms in MEGA X to a model-generated matrix of pairwise distances between each sequence, and the topology with the best log-likelihood value is reflected in the phylogenies73,74,75. Following Galbraith et al.76 to create a frame of reference in these phylogenies, we also included the top five unique BLASTP (NCBI) hits with the closest percentage identity to each RdRp sequence.
Pollen-associated RNA virus distribution
To assess the evolutionary dependence of virus richness among the plant species (the conservative and relaxed estimates separately), we tested for a phylogenetic signal using Pagel’s λ77 with the R package “phytools”70, and concluded a phylogenetic signal was present if Pagel’s λ was significantly above zero. We evaluated whether the conservative and relaxed estimates of virus richness were disproportionally distributed among the five plant subclasses by creating null models that assumed random distribution of the viruses among the plant species and shuffling virus presence (N = 1000) using the R package “vegan”78. To assess significance, we compared the observed virus richness of each plant subclass to its 95% null confidence intervals.
To assess whether the viruses included in the conservative and relaxed estimates of virus richness belonged disproportionately to the Bromoviridae, Partitiviridae, and Secoviridae viral families, we created null models that assumed a random virus distribution across all viral families and shuffled virus presence among them (N = 1000), as above. We compared the combined observed virus richness in the Bromoviridae, Partitiviridae, and Secoviridae viral families to the 95% null confidence intervals of the same group.
We visualized known virus and novel coding-complete viral genome and variant distribution across plant species, viral families, and geographic regions using the R packages “gplots,” “Heatplus,” and “RColorBrewer”79,80,81.
Flower, pollen traits, and land use as drivers of pollen-associated virus richness
To assess whether floral traits explained variation in virus richness, we recorded traits important for pollinator vector attraction (inflorescence type, flower longevity, flower size [or equivalent floral unit of attraction], floral rewards) and floral reward accessibility (flower symmetry and accessibility based on floral morphology) as described in the literature (see Supplementary Table 1). In addition, we scored two traits important for pollen grain collectability: pollen grain texture and size (diameter of the longest dimension in µm), which were determined with the aid of a light microscope (magnification 10X or 40X; Leica DM500, Leica Microsystems, Buffalo Grove, IL, USA).
Prior to analysis, we coded the levels of categorical traits as 0 or 1 as follows: the number of flowers in the inflorescence (single vs. multiple [including cyme, raceme, panicle, and heads]), rewards (pollen only vs. pollen and nectar), flower symmetry (bilateral vs. radial), reward accessibility (restricted [by morphology or time] vs. accessible), and pollen grain texture (granulate [all non-spiky] vs. echinate [spiky]). All traits were standardized (i.e., mean = 0, standard deviation = 1).
We performed a principle component analysis (PCA) on all eight floral traits (three quantitative and five ordinal, binary) described above, which yielded three dominant floral trait principal components (PC1 – 3) using the “prcomp” function in R. The PCA results were validated using a factor analysis of mixed data (FAMD) for quantitative and qualitative variables, implemented in the package “FactoMineR”82, which yielded results identical to those of the PCA. To assess whether floral traits reflected shared evolutionary histories among plant species, we tested for phylogenetic signals of PC1 – 3 using Pagel’s λ in the R package “phytools”70. We then evaluated which floral traits influenced the conservative estimate of virus richness, while accounting for the influence of geographic region, with a phylogenetically corrected linear model using the R package “nlme”83. To improve normality, we added a small constant (0.1) to the conservative estimate of virus richness prior to natural logarithm transformation. The predictors of the model included the first three floral PCs and region, with the number of flowers and individual plants sampled as covariates to account for potential variation in virus recovery. This nested linear model design, and the explicit inclusion of the phylogenetic relationships among the plant species, allowed us to treat the plant species as replicates in each region and to isolate the effects of the floral traits, while controlling for the evolutionary history of the plant species. Variance inflation factors implemented in the R package “car”84 were used to confirm the absence of multicollinearity. We assessed the statistical significance of the predictors using type III sums of squares in “car” and estimated the least-squares means (LSmeans) using the package “emmeans”85. We repeated all statistical analyses with the relaxed estimate of virus richness, and all statistical analyses were performed in R (v4.0.1).
To characterize land use for each of the geographic regions, we circumscribed buffer zones with a 0.5 km- or a 3 km-radius around the spatial location of each plant species in each region using ArcGIS Desktop (v10.7.1; Environmental Systems Research Institute, Inc.). The larger radius reflects the average foraging distance of honey bees86, which are common pollinators throughout much of the United States and in our landscapes. The foraging distances of other pollinators are also encompassed by the smaller buffer zone. To most accurately quantify the land use in each region, the land use percent cover within each circular buffer zone was calculated and averaged for three categories—agriculture, urban (impervious surface: buildings, sidewalks, roads, other hard surfaces), and natural vegetation (grassland and forest)—extracted from the National Geospatial Data Asset (NGDA) Land Use Land Cover dataset (v2014; NLCD Land Cover Change Index (CONUS) | Multi-Resolution Land Characteristics (MRLC) Consortium), which contains land use states for the years 2004 – 2011 at 0.30 × 0.30 degree spatial resolution. The estimates of land use within the buffer zones were highly correlated (r = 1.00, P < 0.001), so we present results only for the 3 km-radius buffer zone throughout the manuscript.
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Source: Ecology - nature.com