Comparison between 16S rRNA and shotgun sequencing data for the taxonomic characterization of the gut microbiota

Relative species abundance distribution

In order to evaluate sample quality, we analysed both the Relative Species Abundance distribution (RSA) and the rarefaction curves. For each sample, we compared the RSA derived by shotgun and 16S sequencing. RSA histograms in logarithmic scale show that the distributions obtained by shotgun and 16S have similar shape at phylum level (Fig. 1a, b). In Fig. 1b, the 16S sample is characterized by a more patchy distribution, having identified less phyla. At phylum level, both strategies produce positively skewed samples in the log₂-transformed distributions, except for 16S outliers, because none of the phyla is significantly rare (Fig. 2a).

Figure 1

RSA histograms in logarithmic scale (Preston plots ²¹) of bacterial abundances in one sample selected as anexample (caeca25): (a) genera sampled by shotgun sequencing, (b) genera sampled by 16S rRNA sequencing, (c) phyla sampled by shotgun sequencing and (d) phyla sampled by 16S sequencing.

Full size image

Figure 2

Box plot of the RSA skewness of bacterial communities at (a) phylum level and (b) genus level. Bacterial communities are sampled with (left) shotgun sequencing and (right) 16S sequencing.

Full size image

On the other hand, at genus level, the two strategies display different shapes (Fig. 1c, d, Supplementary Fig. S1, S2, S3, S4). Indeed, the log₂-transformed distributions derived by shotgun sequencing generally have a skewness closer to zero compared to those obtained by 16S, i.e. are more symmetrical (Figure 2b): a paired Student’s t-test on the skewness shows a significant difference between them (P = 8·10^–6). This indicates that shotgun samples are characterized by a higher sampling size. According to Preston, left-skewed shapes of the RSA can be explained as artefacts of small sample size^21,22, since insufficient sampling of the original space produces a truncation of the left tail of the RSA, increasing its skewness.

In shotgun samples, the RSA skewness at genus level is related to the total number of reads (Supplementary Fig. S5): the shotgun samples with the lowest total number of reads have the largest skewness. Specifically, Supplementary Figure S5 shows that shotgun samples cluster in two groups, one characterized by a low number of reads (# reads < 500,000, 28/78 samples) and a highly skewed RSA (greater than the 16S median), and one with a high total number of reads (# reads > 500,000, 50/78 samples) and a less skewed RSA.

Noticeably, the high-skewness group includes all 9 samples from 1st day, all 15 crop samples from 14th day and 4 out of 18 crop samples from 35th day. The samples collected at day 1 were very poor in terms of biomass and the crop samples contained more feed residues than caecal samples, making the DNA extraction less efficient both in terms of DNA quantity and quality. For the comparative analysis we removed samples with less than 500,000 reads being characterized by a low quality. This choice was corroborated by the analysis of the rarefaction curves, showing that shotgun samples with less than 500,000 reads do not reach a plateau in terms of identified genera (Supplementary Fig. S6). All the 50 samples included in the comparative analysis have a total number of reads > 500,000 and a skewness lower than the median of 16S samples, indicating a good sampling depth. Since included samples were characterized by a high microbial load, we are confident to extend the results of the following analyses only to samples with few contaminant DNA and low cross-contaminations. Nonetheless, we have shown that shotgun samples have a RSA similar to 16S samples when a low number of total reads is available, thus hypothesizing that in differential analyses carried on samples with a low microbial load regime, shotgun sequencing could perform similarly to 16S sequencing or even worse.

For a balanced comparison, also 16S samples corresponding to the discarded shotgun samples were removed.

Differential analysis for the experimental conditions

Since in many situations a metagenomic analysis is used to discriminate between different experimental conditions, we compared the results of differential analysis performed on reads obtained by the two strategies. To this aim, we analysed the fold changes of genera abundances between compartments of the GI tract and between sampling times (Fig. 3 for caeca vs crop, Supplementary Fig. S7 for 14th vs. 35th day) common to both sequencing strategies (288 genera for caeca vs crop, and 246 for 14th vs. 45th day). Comparing the genera abundances between caeca and crop, 16S identified 108 statistically significant differences (adjusted P < 0.05 with DESeq2), while shotgun identified 256; 28 genera, corresponding to 9.7% of total common genera, were not identified as significantly different by either strategy. Among the 104 genera identified to be different between caeca and crop by both sequencing strategies, 93.3% (97/104) showed a concordant fold change. Concerning the genera with different abundance between sampling times, 16S detected 58 statistically significant changes (adjusted P < 0.05 with DESeq2), while 75 were detected by shotgun. The 80% (16/20) of the 20 genera with different abundance between the 14th and 35th day for both sequencing strategies showed a concordant fold change (see Table). The discrepancies seem to be related to detection issues in 16S samples: indeed, all the seven discordant changes in caeca vs crop are caused by the partial or total absence of a genus in 16S samples (as shown in Supplementary Fig. S8). This is possibly due to the fact that these genera are close to the detection limit of 16S strategy, for which we have provided an estimate in the next section. On the other hand, three of the four discordant genera in the 14th vs 35th day showed actual discrepancies between shotgun and 16S changes, not necessarily caused by detection issues (see Supplementary Fig. S9).

Figure 3

Fold changes between caeca and crop in genera identified by both strategies. Some fold changes are shrunk toward zero by the DESeq2 algorithm (see “Methods” section). Points with a statistically significant change for both strategies are represented in blue, for shotgun only in green, for 16S only in orange and without a significant change in cyan (adjusted P > 0.05 with DESeq2). Point size is the log₁₀ of average number of reads from shotgun strategy mapping to each genus. Pearson’s correlation coefficient r and regression line are computed only on points with statistically significant fold changes according to both strategies (“Both” group in figure legend and in Table 1).

Full size image

Table 1 Percentage of genera having a concordant change in abundance in shotgun and 16S samples with a statistically significant change for: both sequencing strategies (1st column), 16S only (2nd column), Shotgun only (3rd column), neither (4th column).

Full size table

Noticeably, shotgun sequencing found 152 statistically significant changes in genera abundance between caeca and crop of chickens that 16S sequencing failed to detect, while 16S found only 4 changes that shotgun sequencing did not identify (Fig. 3).

Genera detection and abundance quantification

The agreement between the taxonomic profiles estimated with the two strategies was further evaluated computing for each sample the Pearson’s correlation coefficient (r) between the taxonomic abundances of genera common to 16S and shotgun sequencing. Overall, we observed a good agreement between the taxonomic abundances found by the two strategies (Fig. 4, Supplementary Fig. S10), with an average correlation of 0.69 ± 0.03 in caeca (all p-values < 10^–11 for correlation significance between 16S-shotgun sample pairs) and 0.75 ± 0.05 in the crop (all p-values < 5·10^–5).

Figure 4

Scatter plot of 16S and shotgun genera abundances of one sample selected as example (caeca25). Histograms display stacked bars, where every column is divided in a part corresponding to the abundance of genera detected by both sequencing strategies (blue) and the other part is relative to genera detected exclusively by only one strategy (red for 16S and green for shotgun). Pearson’s correlation coefficient is computed only for the common genera. Logarithmic (log₂) scale helps to recognize that less abundant genera identified by shotgun sequencing are almost not detected by 16S sequencing.

Full size image

A larger difference is observed between the number of identified taxa by the two strategies. The histogram on the left of Fig. 4 shows that, in the selected sample, a great majority of genera detected by shotgun sequencing are not found by 16S sequencing (green portion of the bars). This phenomenon is mostly observed on the leftmost bins, that represent low abundance genera (Supplementary Fig. S11, S12, S13, S14 for single samples). These results are confirmed in Fig. 5 and Supplementary Table S1, showing that (a) all the phyla detected in 16S samples were identified in shotgun samples and (b) only a small number of genera (about 23% for caeca samples and 11% for crop) is recovered by both strategies. The percentage of reads mapping to the genera identified by both sequencing strategies is large (on average 89% for caeca and 99% for crop). This means that genera identified by both strategies are the most abundant ones, e.g. those mapped by most of the reads. In summary, shotgun sequencing always identifies more taxa than 16S sequencing, as also reflected on the rarefaction curves (Supplementary Fig. S6). Moreover, 16S sequencing predominantly detects taxa that are also identified by shotgun sequencing (133 of 187 genera and 14 of 14 phyla on average) (Fig. 5).

Figure 5

Average number of (a) phyla and (b) genera found within caeca and crop samples. The length of the error bars is equal to the standard deviation computed on all the samples.

Full size image

The relationship between abundances detected by 16S and shotgun metagenomics was further investigated fitting a linear regression model on the abundance of genera common to both strategies in each sample. We considered, for each shotgun-16S pair of samples, the logarithmic abundances obtained with 16S as independent variable and those obtained with shotgun as dependent variable, so that the intercept in this model represents the number of shotgun reads corresponding to genera that are mapped to one single read by 16S sequencing, that we consider as a detection limit. Here, samples with low number of reads (< 500,000) were included for completeness. Results show that the model intercept increases as a function of the total number of reads available in shotgun samples (Fig. 6). The regression intercept and the total number of shotgun reads are positively correlated (Pearson’s coefficient = 0.93, P < 10^–16).

Figure 6

Intercepts of shotgun vs 16S abundance linear regressions of caeca samples against the total number of reads in each set, representing the number of shotgun sequences corresponding to one 16S sequence. Error bars correspond to the confidence interval for the parameter provided by the fit.

Full size image

Hence, given the total number of reads of a shotgun sequencing sample, the genera mapped by a number of reads lower than the model intercept are the most likely to be undetected by 16S strategy. For example, considering the sample depicted in Fig. 4 with a total number of reads equal to 2,972,671 with shotgun strategy, genera with a number of reads approximately < 350 would probably not be detected by 16S sequencing, corresponding to 88% of total detected genera in that shotgun sample.

Sample stratification by experimental condition

Since in many studies an unsupervised analysis (i.e. clustering) based on taxonomic profiling is performed, in order to identify possible sample stratification due to experimental conditions or to other unknown factors, we calculated the Bray–Curtis beta diversity and performed a Principal Coordinate Analysis (PCoA). We considered for every sample a n-dimensional (n = 678) vector of abundances, considering the genera that are common to all samples (a graphical representation is showed in Supplementary Figure S15). A quantitative evaluation of the separation in the PCoA space of samples labelled by experimental conditions was obtained through the mean Silhouette Score (SS) of the samples at genus level. We compared the silhouette scores either on the totality of genera identified by shotgun (“SHOTGUN”) or 16S (“16S”) strategy, or on the subset of genera detected exclusively by shotgun (“SHOTGUNex”) or by 16S (“16Sex”), to evaluate their ability to discriminate between known experimental conditions (compartment of the GI tract and sampling time). Results show that when samples are labelled according to the compartment of the GI tract, the Silhouette Score is high (i.e., close to 1) for both strategies (see Supplementary Tab. S2 and Supplementary Fig. S16). However, the separation between groups is stronger in shotgun samples than in 16S samples. This result remains true even when the PCoA of shotgun samples is computed considering only shotgun-specific genera.

On the other hand, when grouping samples according to sampling time within the same compartment of the GI tract, Silhouette Scores are generally lower for both strategies, while still achieving a good separation (see Fig. 7). Figure 7 plots refer to the Silhouette Scores of Supplementary Table S3, in which full samples (Fig. 7a,b) reach an almost perfect separation, though the low compactness leads to worse scores with respect to GI tract labelling. An interesting result is that abundance profiles of genera found only in shotgun sequencing (SHOTGUNex) have a positive silhouette score, while 16Sex samples have a smaller silhouette score, close to 0 (a value representing low-quality clustering, very close to random). This result highlights that shotgun sequencing detects low-abundance genera that carry significant information about the experimental biological factors, while 16S-specific genera fail to correctly cluster samples based on one of the experimental factors considered (namely sampling time), and a good separation is obtained only when the most abundant genera common to both strategies are considered.

Figure 7

PCoA based on the beta-diversities between samples (Bray–Curtis metric), computed on genera abundances of caeca samples normalized by DESeq2, labelled by sampling time: 14th day (gold), 35th day (cyan). (a, b) with all genera detected respectively by shotgun (a) and 16S (b); with genera found exclusively by shotgun (c) or 16S (d).

Full size image

Source: Ecology - nature.com

Comparison between 16S rRNA and shotgun sequencing data for the taxonomic characterization of the gut microbiota

Relative species abundance distribution

Differential analysis for the experimental conditions

Genera detection and abundance quantification

Sample stratification by experimental condition

Wolbachia affects mitochondrial population structure in two systems of closely related Palaearctic blue butterflies

Ancient CO2 levels favor nitrogen fixing plants over a broader range of soil N compared to present

ITALIAN LANGUAGE

ENGLISH LANGUAGE