Filtering threshold for handling spurious sequences
We first used bacterial communities of known composition (simplified communities) to assess the occurrence of spurious taxa and to determine at which relative abundances they begin to appear. To propose a cutoff that is potentially applicable to different 16S rRNA gene amplicon studies, we included reference data obtained with different variable regions and sequencing pipelines and originating from both in vitro an in vivo communities varying in number and type of species (max. 58) (Tables 1 and 2). To determine a filtering threshold that allowed exclusion of most spurious taxa, we recorded the relative abundance of the first spurious OTU occurring in each of the reference community datasets (Fig. 2a). Median values of approx. 0.12% relative abundance were observed (Fig. 2b). Besides one outlier in the mock communities (0.44% relative abundance), all values were below 0.25% relative abundance.
a Example of the occurrence of all molecular species detected without filtering in the gut of a gnotobiotic mouse [49]. The arrow indicates the position of the first spurious molecular species, all following taxa being considered as having a high risk of being spurious (light gray bars in the enlarged inset). b Distribution of the relative abundances of first occurring spurious molecular species (as shown in panel a) across all mock communities and samples from gnotobiotes. The orange dashes on the y-axis indicate the consensus threshold of 0.25% relative abundance, above which no spurious taxa occurred with the exception of one outlier in a mock community at a relative abundance of 0.44%. c Comparison of various standard filtering cutoffs (see explanations in the text) in terms of spurious taxa (i.e., those molecular species not matching sequences of the known species contained in the artificial communities). d Corresponding percentages of positive hits retained by the different filtering strategies, with positive hits being defined as the reference sequences found in the respective amplicon datasets. e Percentage of spurious taxa and positive hits in the same reference communities using the DADA2 pipeline for analysis based on amplicon sequence variants (ASVs) [6]. f Effect of filtering thresholds at increments of 0.05% relative abundance on the detection of spurious taxa and positive hits in all mock and gnotobiotic datasets for OTUs (upper panel) and ASVs (lower panel). Lines correspond to mean values; ribbons represent standard deviations.
Without any filtering, sequence clustering generated an average of 508 ± 355 OTUs (min. 52; max. 1081) per mock community (10–58 target species in theory) and 105 ± 50 OTUs (min, 55; max. 215) per gnotobiotic community (4–12 target species in theory). Up to 87% of these OTUs were spurious (i.e., they did not match the expected classification of species contained in the corresponding artificial community) (Fig. 2c). On average, the proportion of spurious OTUs in both the mock communities and samples from gnotobiotic mice was slightly lower after removing singletons, although this did not reach statistical significance (50.8 vs. 64.3%, p = 0.227; 57.5% vs. 65.7%; p = 0.70, pairwise comparison by t-test, including Benjamini–Hochberg correction following ANOVA). Interestingly, the proportion of spurious molecular species was higher in gnotobiotic mice independent of filtering (p < 0.001), suggesting that the matrix accompanying the defined communities (fecal material in this instance) influences the outcome. Besides the goal of removing spurious taxa, it is of course important to include as many true molecular species as possible into the analysis. Even without any cutoff, not all target species could be detected: the percentage of positive hits was 94.9% and 92.3% for mock communities and gnotobiotic mice, respectively (Fig. 2d).
Although the number of spurious taxa decreased drastically (4.0 vs. 50.8% for mock communities and 1.0 vs. 57.0% for gnotobiotes; p ≤ 0.01) after applying the proposed cutoff of 0.25% relative abundance vs. singletons removal (Fig. 2c), the number of positive hits was not affected significantly (87.2 vs. 93.7% for mock communities and 82.4 vs. 88.7% for gnotobiotes; p > 0.50) (Fig. 2d). Note that the diversity of reference communities in the gnotobiotic mice was relatively low (4–12 members; Table 2), resulting in a marked drop in the percentage of positive hit (8–25%) when even just one true member is excluded after filtering because of its low relative abundance (which is an expectable event considering a classical, exponentially decreasing distribution of species occurrence in gut environments).
We next employed the widely used ASV analysis approach to confirm the aforementioned results. Processing of the same simplified communities generated a total number of 42 ± 25 ASVs (min. 16; max. 98) for mock communities (10–58 target species) and 14 ± 8 ASVs (min. 4; max. 25) for gnotobiotes (4–12 target species). Altogether, a marked decrease in spurious taxa was observed compared with OTU clustering, with an average of 8.6 ± 11.8 and 4.4 ± 6.4% spurious sequences for mock and gnotobiotic communities, respectively (comparison of purple box plots in Fig. 2e, top panels, and Fig. 2c). Of note, the DADA2 pipeline used for the ASV approach does not infer sequence variants that are only supported by a single read (singletons) due to a lack of confidence in their existence relative to sequencing errors. Consequently, data corresponding to “no filtering” with the OTU-based approach were not generated. On average, the first spurious ASV occurred at a relative abundance of 0.10 ± 0.32%. By applying the cutoff of 0.25% relative abundance, spurious sequences were completely removed (except for three outlying samples), albeit with a slight drop in positive hits for both mock and gnotobiotic communities (Fig. 2e).
To obtain a more comprehensive view on how filtering thresholds affect the detection of spurious taxa, all datasets (mock and gnotobiotic mice) were processed using a range of relative abundance filtering thresholds (from 0 to 0.5% at increments of 0.05%) after either OTU- or ASV-based processing of raw sequence reads (Fig. 2f). These data indicate that filtering thresholds between 0.1 and 0.3% are appropriate to reduce the occurrence of spurious taxa to <10% of total OTUs at a loss of <15% positive hits. It is important to note that our intent is not to set a strict rule for data processing, and we recognize that filtering strategies must be adapted in a study-specific manner. Instead, we aim to suggest best-practice guidelines and raise awareness for the importance of proper handling of spurious sequences. To further investigate spurious taxa, the threshold of 0.25% relative abundance described in the first paragraph was kept for all further analyses.
Ecology and origin of spurious taxa
To better understand how spurious molecular species arise in amplicon datasets, we investigated the diversity and origin of sequences not matching reference sequences from the defined communities. To this end, we taxonomically classified and evaluated the occurrence of these sequences in >100,000 IMNGS-derived amplicon data [4]. Approximately half of the 678 non-redundant spurious OTUs belonged to the phylum Firmicutes, followed by Bacteroidetes and Proteobacteria. Most of these were characterized by highest prevalence in human- and mouse-derived datasets, with values reaching up to 40% in the thousands of tested samples (Fig. 3a). Over 20% of spurious molecular species detected in human and mouse samples were only found in these habitats (Fig. 3b). This distribution implies that the type of samples multiplexed with target samples within a given sequencing run (in our case mouse and human gut samples) greatly influences the occurrence of spurious OTUs in target samples. Interestingly, >600 of the 678 spurious OTUs occurred in fewer than five of the ten sequencing runs tested, with approximately 450 of them occurring in only one run (Fig. 3c). This observation indicates that the majority of spurious taxa are sporadic cross-contaminations rather than generalist artifacts across sequencing runs, suggesting that fully independent technical replicates would improve data quality. Although most of the spurious taxa were characterized by relative abundances between 0.25 and 2% in the IMNGS-amplicon datasets tested, they represented very dominant populations in a few samples (Fig. 3d).
a Taxonomic profile and ecological distribution. Inner ring: SILVA-based classification of all non-redundant spurious molecular species at the phylum and family level. Outer colored ring: sample type characterized by the highest prevalence for the given taxon. Outer bars: corresponding highest prevalence values. Only samples with relative abundances >0.25% for any given OTU were counted as positive for prevalence calculation. The total numbers of samples considered were: human, 46,153; soil, 29,864; freshwater, 13,977; mouse, 10,409; marine, 8478. b Distribution of the spurious taxa across sample types. The exclusivity of each OTU for any given sample type was assessed using a Z-test: those assumed to be non-specific for any given sample type appear in red (p < 0.05). The total number of IMNGS samples considered for each sample type with at least one of any spurious taxa matching sequences above 0.25% relative abundance was labeled as “Total” (equal numbers in panel a). The number of samples in each type covered by at least one spurious OTU with highest prevalence in this sample type was labeled as “Covered” (i.e., the remaining samples in that category contained also at least one spurious OTU, which was however characterized by highest prevalence in another sample type). c Redundancy of the spurious taxa across 10 sequencing runs. d Violin plots of the distribution of median relative abundances of all spurious molecular species within each sample type as shown in panel b. The average prevalence of the spurious taxa in each sample category is shown as mean ± SD below the x-axis. e The ZymoBIOMICS DNA Standard was sequenced as such or in combination with DNA extracts of cecal contents from germfree mice with or without pre-treatment for free DNA removal as described in detail in the methods. The stacked bar plots indicate the number of spurious taxa and positive hits in the different sample treatment categories with or without relative abundance filtering following the color codes presented in the figure panel.
To test sample matrix effects and the effect of free DNA removal commonly used for low biomass samples on the occurrence of spurious taxa, mock community DNA was used in combination with gut samples from germfree mice that were either pre-treated for free DNA removal or not (Fig. 1b). While all negative controls (DNA isolation and PCR blanks, treated and untreated control germfree samples) generated <100 processed reads, the average sequencing depth was 14,129 ± 4,682 for the target samples. The digestion of free DNA prior to extraction from germfree cecal contents tended to lower the number of spurious taxa detected; however, these spurious taxa still represented at least one third of all OTUs among the three samples tested without relative abundance filtering (Fig. 3e). Using the 0.25% cutoff provided the most congruent results with respect to the expected taxa diversity in the mock community. However, a few spurious taxa were still present when free DNA removal was employed, and the number of positive hits tended to be higher than the expected number of 8 due to the present of satellite OTUs of low abundance (Fig. 3e).
Inadequate taxa filtering inflates alpha-diversity and increases heterogeneity
Spurious taxa, such as those considered in the present study, tend to be low abundant per se: the cumulative relative abundance of spurious molecular species in the reference communities used above was approx. 1% on average. Consequently, spurious sequences are not expected to substantially influence overall composition data, even though the risk exists that authors draw attention onto spurious taxa characterized by statistical significant, yet biologically irrelevant different. In contrast, spurious sequences can have a major influence on diversity (e.g., richness and evenness for alpha-diversity and between-sample distances for beta-diversity), as presented in the next sections. To assess the effect of filtering thresholds on analysis outcomes, we used recently published amplicon data from two comprehensive studies that included a substantial number of samples analyzed by Illumina sequencing of 16S rRNA gene amplicons and for which raw datasets could be retrieved from public repositories. The study by Flores et al. [12] (hereon referred to as Study-1) focused on dynamics of human body microbiomes over time, collecting samples weekly from 85 college-age adults over a 3-month period (in the present work, we focused only on the gut samples). The second study published by Halfvarson et al. [11] (hereon referred to as Study-2) focused on shifts in the human fecal microbiota over time in patients with inflammatory bowel diseases vs. controls and consisted of 683 fecal samples from 137 individuals. We emphasize again that the purpose of the present study was not to confirm or refute data from the literature, but rather to draw attention to an analysis parameter that can profoundly affect results. In all the following analyses, outcomes after the common approach of filtering singletons after de novo OTU clustering were compared with the 0.25% cutoff introduced above (i.e., keeping only those OTUs occurring at a minimum relative abundance of 0.25% in at least one sample).
In both Study-1 and Study-2, filtering OTUs using the 0.25% cutoff led to an approximately two-fold decrease in richness, resulting in an average number of about 200 observed species per sample (Fig. 4a). Interestingly, when looking at individual variations in richness by plotting interquartile ranges (IQR) across the different time points analyzed in the studies, the 0.25% cutoff was associated with a significantly lower heterogeneity in richness (Study-1: IQR = 28.0 ± 17.8 vs. 70.6 ± 34.1, p < 0.001; Study-2: IQR = 17.0 ± 3.2 vs. 49.0 ± 10.4, p = 2.5 × 10−13) (Fig. 4a). Another helpful readout of alpha-diversity is the Shannon effective count, which accounts for the evenness of species distribution and can be, simply speaking, considered as a proxy for the number of most dominant species [21, 28]. Altogether, the trend observed for richness (less heterogeneity after 0.25% filtering) was similar when considering Shannon effective counts (data not shown). However, lower effective counts after stringent filtering (0.25%) were not significantly different for Study-2, showing that Shannon effective counts can be useful to alleviate the influence of lowly abundant species.
a Richness distribution across all individual samples and time points. The bar plots show interquartile ranges (IQR = Q3–Q1) of individual samples (rows) as a proxy for richness variation across the various time points of a given sample. IQRs were ranked by decreasing values after applying the 0.25% cutoff. Colors are: purple, singleton removal; green, 0.25% cutoff filtering (i.e., keeping only those molecular species occurring in at least one sample at a relative abundance >0.25%). b Coefficient of variations calculated on richness values obtained from six fecal samples each sequenced in triplicates in seven different sequencing runs. Sequencing reads were processed using either an OTU- or ASV-based approach (left or right box, respectively). Within runs: variations across triplicates within any given sequencing run. Across runs: variations between the same samples included in the different runs. c Richness and effective microbial richness (see definition in the text) in the ZymoBIOMICS DNA Standard at increasing sequencing depths (x-axis).
In addition to these two published studies, which focused on the analysis of different biological samples (i.e., from multiple individuals at several time points), we also analyzed triplicates of six fecal samples from healthy human adults sequenced several times in-house. This dataset, which consisted of the same samples sequenced in seven different runs, allowed us to evaluate technical reproducibility depending on filtering thresholds (Fig. 1c). Across all runs, the coefficient of variations (CVs) calculated on richness values among the triplicates of each sample within a run were on average <5% and lowest when applying the 0.25% cutoff (Fig. 4b). In contrast, CVs of the richness within samples across sequencing runs increased to 20% on average with a peak at 40% when applying singletons filtering, which dropped to approx. 10% (average) and 30% (maximum) when applying the 0.25% cutoff (Wilcoxon Mann–Whitney test, p = 0.004) (Fig. 4b). Similar results were obtained when using an ASV-based data processing strategy (Fig. 4b). These data clearly indicate that 16S rRNA gene amplicon sequencing, at least as performed in our study, generates richness values that vary markedly between sequencing runs for the same sample, especially when following a loose taxa filtering strategy such as singleton removal.
To assess the effect of sequencing depth on alpha-diversity, we sequenced the ZymoBIOMICS DNA Standard by multiplexing uniquely barcoded amplicon libraries of this reference mock community at different cluster densities. Analysis of all molecular species created after processing without filtering clearly showed that richness inflated as a function of sequencing depth (Fig. 4c). In contrast, the count of taxa occurring above 0.25% relative abundance, referred to as “effective microbial richness” (EMR), was stable and a better proxy of the true diversity within the reference community.
Between-sample comparisons are influenced by filtering strategies
We next assessed how filtering influenced beta-diversity analyses. As in the published studies selected [11, 12], a closed-reference OTU protocol was also used to obtain reference data to which filtering strategies after de novo OTU clustering could be compared.
In Study-1, the median unweighted distance across all individuals was approximately 0.5 after using reference-picking, including a broad range of within-host temporal variations (e.g., some individuals were characterized by more stable profiles than others) (Fig. 5a; left panel), as observed in the original study [12]. As expected, the strongest effect of filtering strategies was observed when using unweighted UniFrac distances: singleton removal was characterized by a higher temporal variation in profiles (median value of approx. 0.6 vs. 0.3 for the 0.25% cutoff) (Fig. 5a; middle panel). Notably, using generalized UniFrac distances decreased the difference between filtering approaches; however, it also widened the range of individual-specific temporal variability around the median, potentially enhancing the discriminatory power between “stable” and “variable” individuals (Fig. 5a; right panel).
Colors are as in Fig. 4. Brown indicates closed-reference picking. a Overtime variations in microbiota profiles for each individual from Study-1 [12] based on reference OTU-picking and unweighted UniFrac distances (left; as in the published study), de novo OTU-picking and unweighted UniFrac distances (middle) or generalized UniFrac distances (right). Bars indicate median distances across all individuals. Individuals were ordered by increasing average distance using the 0.25% cutoff and generalized UniFrac (right panel). b Differences in the phylogenetic makeup of fecal microbiota as in panel a for Study-2 [11].
In Study-2, one of the main findings in the original work was that volatility (i.e., variations overtime within individuals) was highest in patients suffering from Crohn’s disease with an ileal phenotype who underwent ileocecal resection (ICD-r) [11]. We confirmed this finding by using reference-based picking and unweighted distances, as performed in the published manuscript (Fig. 5b; left panel). However, when applying de novo clustering, this difference could only be observed when using the 0.25% cutoff in combination with unweighted distances (Fig. 5b; middle panel). The absence of significant differences when using unweighted distances after singletons removal suggests that the biological signal in this study is overwhelmed by the stochastic “noise” introduced by spurious molecular species (Fig. 5b; middle panel). The absence of differences when applying generalized distances (Fig. 5b; right panel) further suggests that individual-specific temporal variations are attributable to the presence/absence of taxa rather than to changes in composition.
Validation studies
To confirm the utility of the 0.25% cutoff inferred from the aforementioned data generated at the Core Facility Microbiome of the ZIEL Institute for Food & Health (TU Munich, Germany) and at the Institute of Medical Microbiology of the RWTH University Hospital, additional samples were processed and analyzed independently at the Joint Microbiome Facility of the Medical University of Vienna and the University of Vienna (JMF).
First, processing of a log-distributed version of the ZymoBIOMICS Microbial Community Standard (Zymo Research GmbH) containing eight bacterial strains confirmed the advantage of applying the 0.25% filtering approach. Twenty-five replicates of the same DNA sample were sequenced on five sequencing runs (1–8 replicates per run) using either the V4 region combined with CD barcoding or the V3-V4 regions with UD barcoding (two and three runs, respectively). V4/CD vs. V3-V4/UD yielded 31 ± 16 vs. 8 ± 2 ASVs (min: 13 vs. 5, max: 57 vs. 10), respectively. Spurious ASVs (i.e., all sequences with a Hamming distance to the reference >1) were greatly reduced using a 0.25% filtering step, from 73 ± 8 to 2 ± 2 and from 13 ± 15 to 0 in V4/CD vs. V3-V4/UD, respectively (Fig. 6a). This occurred at a loss of 15% true taxa in the case of V4/CD while no change was observed with V3-V4/UD (Fig. 6b). As the highest relative abundance reached by any spurious ASV was 0.28% and the true taxa detected corresponded to dominant members of the standard community, the cumulative relative abundance of true taxa was high (>98%) in all cases (Fig. 6b).
a Fraction of spurious taxa with (green) or without (violet) applying the 0.25% cutoff displayed according to the targeted 16S rRNA gene regions and barcoding strategy used. b Corresponding fraction of positive hits (i.e., amplicons matching the reference strains contained in the mock community). c Average and distribution of richness and distance values between replicates of the same soil sample processed in multiple sequencing runs. ASV amplicon sequence variant, bc barcoding, CD combinatorial dual, no. number, UD unique dual.
Second, peat soil DNA [17] was analyzed to confirm suitability of our filtering approach for non-gut samples. One identical DNA sample was sequenced on three different runs (3–5 replicates per run) using primers 515F/806R (V4 region) and CD barcoding. The ASV table was rarefied to the minimum sum count (9104) and analyzed with or without filtering (i.e., only ASVs observed at a relative abundance >0.25% in at least one replicate were kept). Richness was calculated using ampvis2 [29]. Applying the 0.25% cutoff decreased the number of observed ASVs from 408 ± 71 to 139 ± 5 and, more importantly, the IQR from 101 to 7 (Fig. 6b). Unweighted UniFrac distances within and between runs as calculated using ampvis2 were also compared before and after filtering. Sequences were aligned using MAFFT [30] and phylogeny was inferred using FastTree. Whilst the community makeup in the soil sample varied substantially between sequencing runs without additional filtering, the 0.25% cutoff reduced this variation to the level observed within runs without filtering (Fig. 6c). Replicates within a run were very similar after applying the 0.25% cutoff. Altogether, these data serve as an independent confirmation that stringent filtering delivers more stable values obtained for the exact same sample sequenced in replicates across several sequencing runs.
Source: Ecology - nature.com