Intraspecies characterization of bacteria via evolutionary modeling of protein domains

Protein domains show a Gompertzian growth

The protein domain RSA distributions of 3368 bacterial genomes were obtained as detailed in the “Materials and methods” section. Briefly, for each bacterial genome we retrieved all the identifiable protein domains. Then, we computed the RSA by counting the number of protein domains belonging to each protein domain family.

Three evolutionary hypotheses were tested by fitting the empirical RSAs with the Log-Series [Eq. (7)], the Negative Binomial (Eq. (6)) and the Poisson Log-Normal (Eq. (4)) distribution (Fig. 1a). According to the Akaike Information Criterion (AIC)³⁰, in (99.97%) of bacteria the selected model was the Poisson Log-Normal (Fig. 1b). This model performed better than both the Log-Series and the Negative Binomial and described the data well, with an average (R^2) of 0.97 (minimum (R^2)=0.86). The selection of the Poisson Log-Normal model instead of the Negative Binomial or the Log-Series, implies that the protein domains evolution process is characterized by a Gompertzian density regulation function ((g(x)=gamma ln (x+epsilon ))) rather than a linear one ((g(x)=eta x)). This suggests an asymmetric process in which the proliferation rate for low abundant protein domains is faster than for the high abundant ones.

Figure 1

Fit of protein domains RSA. (a) Example of protein domains Preston plot fitted with three different distributions: the Poisson Log-Normal, the Negative Binomial and the Log-Series. Results refer to the bacterial genome (text {GCA}_000717515). The Negative Binomial and the Log-Series fit overlap. This implies that the dispersion parameter (alpha) of the Negative Binomial distribution (see Eq. (6)) is close to zero. The mean and the median of the dispersion parameter obtained for the 3368 bacterial genomes are ({2.67times 10^{-4}}) and ({2.62times 10^{-7}}), in agreement with the observed overlap. (b) Distribution of the difference between the AIC obtained with the Poisson Log-Normal model (PL) and the Log-Series (LS) or the Negative Binomial (NB) model, considering all the 3368 bacterial genomes.

Full size image

Protein domains deactivation is faster than duplication

The examination of the Poisson Log-Normal scale ((mu)) and location ((sigma ^2)) parameters (see Eq. (4) and Supplementary Material) estimated by the fitting procedure for each bacterial genome, allows us to reveal further features of the evolutionary process of protein domains.

First of all, Fig. 2 shows that (mu) has negative values in all bacterial genomes. Recalling that (mu =r/gamma), where r is the growth rate and (gamma) is the multiplicative constant of the Gompertzian function, which must be positive, this implies that the growth rate of protein domains, r, is also negative. Notice that the growth rate can be expressed as the difference between the birth and the death rate, (r=b-d). Hence, a negative r means that the death rate is greater than the birth rate ((d > b)). In the evolutionary model of protein domains, the birth rate b has the meaning of duplication rate, while the death rate d is the rate at which protein domains are deactivated. A negative r hence implies that protein domain deactivation, which is related to the accumulation of events which disrupt the coding sequence of protein domains, happens at a faster rate than the duplication of the whole protein domain sequence within the genome.

Figure 2

Distribution of species according to the model parameters. Scatter plot of Poisson Log-Normal parameters (mu) versus (sigma ^2) obtained fitting the protein domains RSAs. Only species represented by at least 10 different strains were included in the plot, for a total of 1173 bacterial genomes which belong to 48 different species. Different colors represent different species as indicated in the legend.

Full size image

Furthermore, the plot of (mu) as a function of (sigma ^2) (Fig. 2) highlights the negative linear relationship between (mu) and (sigma ^2). Such relationship can also be deduced mathematically.

Starting from the expressions (mu =r/gamma) and (sigma ^2=sigma _e^2 / 2gamma), and after simple algebraic manipulation, we can in fact obtain that (mu = 2rsigma ^2 / sigma _e^2), which explains the negative linear relationship between the two parameters.

Besides the negative relationship, the plot of (mu) versus (sigma ^2) also highlights the presence of clusters of bacterial genomes with similar ecological features, which are pictured in the plot as roughly parallel stripes (Fig. 2). When we depict bacterial strains belonging to the same species using the same color, it emerges that the stripes are related to the bacterial taxonomy. This result motivates us to introduce a new approach to bacterial phylogeny based on the ecological modeling of protein domains and the consequent estimation of the parameters (mu) and (sigma ^2).

Protein domain RSA and evolutionary distance

We propose to calculate the pairwise evolutionary distances between bacteria based on three parameters: the Poisson Log-Normal scale and location parameters discussed above ((mu) and (sigma)), and the density of protein domains in the bacterial genome. Such density describes to which extend the whole bacterial genome is populated with protein domains and it hence constitutes an additional feature of the protein domain ecological dynamics. As detailed in the Materials and Methods, the distance between bacteria is specifically computed as the 3D euclidean distance in the scaled space of (mu), (sigma), and protein domain density. In the following, we refer to such distance as ‘RSA distance’.

To evaluate the bacterial interrelationships derived from the RSA distances, we compared our results with both the bacterial taxonomic classification and the 16S rRNA gene-based phylogeny. Specifically, starting from the RSA distance matrix we computed a hierarchical clustering of bacteria and we compared the resulting clusters with those obtained from the 16S rRNA gene-based distance matrix. Both clustering results were then compared with the bacterial taxonomic classification.

Notice that the usage of both 16S rRNA phylogeny and bacterial taxonomic classification allows us to exploit the complementary information that these two approaches provide, despite their intrinsic connection. Namely, modern microbial taxonomy is mostly based on 16S rRNA gene⁶ and, on the other hand, the cutoffs commonly used in 16S rRNA phylogeny originated from phenotype-based taxonomy³¹. However, while taxonomy allows us to assign human interpretable names to bacteria, to associate such names with phenotypic properties, and to classify bacteria into a predefined hierarchy, 16S rRNA phylogeny provides a quantitative measurement of the evolutionary distance between bacteria that can be compared with the RSA distance without setting any pre-defined threshold. Moreover, the usage of 16S rRNA phylogeny allows us to investigate the bacterial relationships at the intraspecies level, for which the taxonomic classification is not available.

As detailed in the Materials and Methods, 16S rRNA distances were calculated based on the bacterial 16S rRNA gene reference sequences, following the standard procedure³². Taxonomic classification, instead, was retrieved from NCBI and included the following levels: phylum, class, order, family, genus and species. In order to obtain a comparable number of clusters from all three methods, we considered separately each taxonomic level and we cut the 16S rRNA and the RSA -based hierarchical trees so as to get a number of clusters equivalent to the number of taxa available at the selected taxonomic level.

At each taxonomic level, the Normalized Mutual Information (NMI) was used as a measurement of agreement between different clustering solutions³³. Notice that, while the theoretical range of the NMI score is the interval (left[ 0,1right]), NMI is biased towards clustering solutions with more clusters and fewer data points³⁴. Consequently, the baseline of NMI score in practise is not zero and relatively high NMI scores can be an artifact caused by the low ratio between number of bacteria and number of taxonomic groups. To make the comparison fair, we used simulations to calculate the baseline NMI for each taxonomic level (box plots of Fig. 3).

As expected by their intrinsic relationship, taxonomy and 16S rRNA phylogeny show high agreement (red dots in Fig. 3). RSA-based clusters, instead, show a certain deviation from both taxonomy (blue dots in Fig. 3) and phylogeny (green dots in Fig. 3). For both comparisons, however, the NMI scores are still evidently higher than the baseline, signifying that the RSA model captures phylogenetic signals to a certain degree. Comparing the obtained NMI scores with the baseline, we notice that the agreement between RSA-based clusters and both taxonomy and phylogeny increases at lower taxonomic levels, reaching the maximum at species level. Taking as ground truth the taxonomic classification, the total purity of the RSA-based clusters at species level is 0.65, signifying that 65% of bacteria are correctly classified.

Figure 3

Comparison between the three clustering results at different taxonomic levels. NMI scores (y-axis) are calculated as a measurement of agreement between clusters based on: RSA method and taxonomy (blue), 16S rRNA gene and taxonomy (red), RSA method and 16S rRNA gene (green). Different taxonomic levels are considered for the comparison: phylum, class, order, family, genus and species (x-axis). The box plots represent the baselines of NMI score and are based on simulations.

Full size image

To assess the robustness and stability of the RSA-based phylogeny, with regard to the choice of protein domains, we randomly selected subsamples of protein domains in different proportions (from (10%) to 90% of all protein domains). The reconstructed phylogenetic trees were then compared with the phylogenetic tree obtained using all protein domains (see Materials and Methods for details), and the correlation between the trees was calculated (see Supplementary Fig. S6). As expected, with larger proportions of protein domains taken into account, the correlation between subsample-based phylogeny and base phylogeny increases. For larger subsampling proportions, the compared phylogenetic trees are in good agreement: for a subsample with 90% of protein domains, the mean cophenetic correlation is equal to 0.74, and the mean common-nodes-correlation is equal to 0.68. We notice that the common-nodes-correlation is more stable compared to the cophenetic correlation, as expected since cophenetic correlation is affected by the height of the phylogenetic trees. The results suggest that the overall structure of the phylogenies is stable even for smaller subsampling proportions, while subsampling height of the branches correlates with the full-data height only at larger subsampling proportions.

To evaluate the intraspecies composition obtained from the RSA-based clustering, we selected the subset of species for which at least 10 different strains were present in our data (48 species). Among them, we selected the species where hierarchical clustering showed a clear separation of clusters (including outliers) and for which published literature characterizing at least some of the strains was available (6 out of 48 species). For these 6 species, we again assessed the robustness and stability of RSA phylogenies, as detailed in the “Materials and methods” section. Our results suggest (see Supplementary Fig. S7) that the subsample-based phylogenies are in good agreement with the full-data phylogenies, especially for larger subsampling proportions. We notice the correlations is larger than in the case of phylogenetic trees for randomly selected 100 bacteria (Supplementary Fig. S6), especially for certain species (i.e., Xanthomonas citri). This could be attributed to the smaller size of the phylogenetic tree. However, the species with similar phylogenetic tree size still show differences in correlation (i.e., Xanthomonas citri and Francisella tularensis), suggesting that the RSA-based distance matrix between the strains of Xanthomonas citri carries stronger phylogenetic signal. Comparing 6 observed species with the randomly sampled subsets of 100 bacteria, we can analogously conclude that the RSA-captured phylogenetic signal is stronger within the species. In the following, we discuss the results obtained for the 6 selected bacterial species in more details.

Figure 4

(Previous page.) Hierarchical clustering of bacteria at the intraspecies level, comparing solutions obtained by RSA and 16S rRNA method. Each subplot shows a tanglegram with RSA-based dendrogram on the left and 16S rRNA-based dendrogram on the right. Lines connect the same bacteria from two dendrograms. The color/type of the line represents the feature of the bacterium it connects. (a) 22 strains of Xanthomonas citri belong to two different pathovars: A (orange) and (hbox {A}^{mathrm{W}}) (purple). (b) 10 strains of Chlamydia pneumoniae are isolated from different tissues: conjuctival (yellow), respiratory (magenta) and vascular (violet). 9 strains represented with solid line are human (Homo sapiens) pathogens while the one strain represented by dashed line is koala (Phascolarctos cinereus) pathogen. (c) 14 strains of Vibrio cholerae are colored based on their karyotype. 11 strains have two circular chromosomes Chr1 ((sim 3) Mb) and Chr2 ((sim 1) Mb) (magenta). 2 strains have one (sim)4 Mb long circular chromosome (yellow). One strain has three chromosomes Chr1 ((sim)3 Mb), Chr2 ((sim)1 Mb) and Chr3 ((sim)1 Mb) (violet).

Full size image

RSA-based method distinguishes subspecies infecting different hosts

Xanthomonas citri subsp. citri (XCC) and Chlamydia pneumoniae (Cpn) are two species whose subspecies can infect different hosts. Here we show that the RSA-based method correctly discriminates such subspecies even when their divergence is not detected comparing the 16S rRNA gene sequences.

Xanthomonas citri subsp. citri (XCC) is a causal agent of citrus canker type A, a bacterial disease affecting different plants from the genus Citrus. While citrus canker A infects most citrus species, two of its variants, A* and (hbox {A}^{mathrm{W}}), have a much more limited host range with XCC pathotype (hbox {A}^{mathrm{W}}) infecting only Key lime (C. aurantifolia) and alemow (C. macrophylla)². Our data set includes 17 strains of XCC pathotype A and 5 strains of XCC pathotype (hbox {A}^{mathrm{W}})². RSA-based clustering of the 22 XCC strains identifies two separated clusters (Fig. 4a, left) which coincide with the two XCC pathotypes. Concurrently, clustering based on 16S rRNA gene fails to identify the two pathotypes of XCC (Fig. 4a, right). This suggests that even though pathotypes A and (hbox {A}^{mathrm{W}}) have different hosts, their diversification is not reflected in the variability of the 16S rRNA gene. On the other hand, modeling the protein domain RSA of the two pathotypes succesfully captures the different functions of their proteomes.

Another important aspect of the citrus canker is the geographical spread of the disease. The 22 strains of XCC included in our data set have diverse geographical origin. While all (hbox {A}^{mathrm{W}}) strains were sampled from USA, strains of pathotype A originate from USA, Brazil and China. RSA clustering of 17 A-type strains colored by their sampling location shows a geographical pattern (Supplementary Fig. S2) similar to the one obtained by Patané et al.² using a maximum likelihood tree based on 1785 concatenated unicopy genes, with the only exception of strain jx-6 ((text {GCA}_001028285)) coming from China.

For what concerns Chlamydia pneumoniae (Cpn), this is an obligate intercellular parasite which is widespread in human population and causes acute respiratory disease. Besides humans, different animal species can be infected with Chlamydia pneumoniae. Our data set includes 9 strains which infect humans (Homo sapiens) and 1 strain isolated from koala (Phascolarctos cinereus). RSA-based clustering clearly separates such isolate from the group of highly similar human isolates (Fig. 4b, left). This result is confirmed by 16S rRNA-based clustering (Fig. 4b, right) and is in agreeement with previous results in which the comparison of four human-derived isolates and the koala strain LPCoLN ((text {GCA}_000024145)) through whole-genome sequencing showed a much higher variation between human and koala-derived strains than within the human-derived strains³⁵.

Another peculiarity of Chlamydia pneumoniae is tissue tropism. The human-derived strains of Chlamydia pneumoniae can in fact be divided into conjuctival, raspiratory and vascular based on their tissue of origin. Cpn tissue tropism was the focus of the study conducted by Weinmaier et al., where whole-genome sequences of multiple Cpn strains isolated from different human anatomical sites were compared and animal isolates were used as outgroup³. Weinmaier et al. found a good agreement between the anatomical origin of strains and the maximum likelihood phylogenetic tree based on all SNPs. However, they could not obtain a clear separation between anatomical subgroups of Cpn. Our results show that the RSA-based method partially succeeds in separating subspecies related to different tissues (Fig. 4b, left). The RSA-based dendrogram, in fact, shows a cluster of four respiratory bacteria. However, it does not separate the other subspecies by infection site, suggesting that tissue tropism is not entirely captured by our method.

RSA-based method discriminates subspecies with different genome composition

In some cases, subspecies of the same species are characterized by global differences in the genome composition. This is, for example, the case of Vibrio cholerae and Buchnera aphidicola. Here, we show that the RSA-based model is able to capture such differences and to discriminate subspecies with known different genomic peculiarities.

Vibrio cholerae is the causative agent of cholera disease. Its genome is normally composed of two chromosomes: Chr1 ((sim 3) Mb) and Chr2 ((sim 1) Mb). However, some strains show a different karyotype. The two strains (1154text {-}74) ((text {GCA}_000969235)) and (10432text {-}62) ((text {GCA}_000969265)), for instance, underwent the process of chromosomal fusion and possess only one (sim 4) Mb long circular chromosome, which shows a high degree of synteny with the two chromosomes of the more common strains³⁶. The strain (text {TSY}216) ((text {GCA}001045415)), on the other hand, besides having the original two chromosomes, also contains an additional (sim 1) Mb long replicon, which does not share any conserved region with Chr1 and Chr2³⁷. For these reasons, we expect the single- and two-chromosome strains to be phylogenetically closer to each other than to the three-chromosome strain, which contains the extra replicon. The 16S rRNA gene-based clustering, however, does not identify any clear separation between the three types of strains (Fig. 4c, right). As a matter of fact, all the 16S rRNA gene copies of all the Vibrio cholerae strains included in our data set are located on (sim 3) Mb long chromosome, which shows high synteny across all strains. It is therefore not surprising that the comparison of the 16S rRNA genes does not capture the global genomic differences that exist between the considered strains. On the other hand, the results obtained with the RSA-based clustering show a clear distinction of the strains with different genomic structure (Fig. 4c, left). The reason for the success of the RSA-based method lies in the theoretical definition of RSA-based distance. In fact, the RSA-based distance depends on the Poisson Log-Normal location parameter (sigma ^2), which increases with the genome length (Supplementary Fig. S1): by definition, (sigma ^2 = sigma _e^2 / 2gamma), and, while the environmental noise (sigma _e^2) can be reasonably considered independent of the genome length, the density regulation (gamma) is expected to be stronger for smaller genomes, which repesent a scarcer environment with less resources.

Buchnera aphidicola is a bacterial species in mutualistic endosymbiotic relationship with different aphids (members of superfamily Aphidoidea). As many endosymbionts, Buchnera aphidicola underwent the process of genome reduction as an adaptation to the host-associated lifestyle and has a genome with length (<1) Mb. One of the main processes which contributed to genome reduction is gene inactivation followed by progressive gene disintegration³⁸. An intermediate step of this process is pseudogenization. For this reason, we expect the number of pseudogenes within a genome to be related to its evolutionary state and strains with similar numbers of pseudogenes to be phylogenetically similar. Among the 13 strains of Buchnera aphidicola available in our data set, we observe a high variability in the number of pseudogenes, with 12 strains having between 7 and 63, and one strain (JF98, (text {GCA}_000183305)) having 176 pseudogenes. Since the total number of genes (considering both protein coding genes and pseudogenes) is similar in all strains, this implies that JF98 has a smaller proteome compared to the others. Our results show that the divergence of strain JF98 from the others is evident from the RSA-based clustering (Supplementary Fig. S3-a, left), while it is not clear from the 16S rRNA-based one (Supplementary Fig. S3-a, right).

On the other hand, another outstanding property of the Buchnera aphidicola strains is their capability to infect different hosts. Focusing on this aspect, we notice that the RSA-based clustering only partially agrees with a grouping of the strains based on the infected host. The 16S rRNA-based method, instead, finds more coherent results (Supplementary Fig. S3-b), possibly due to the fact that endosymbionts and their hosts tend to co-evolve together once their endosymbiotic relationship starts⁴.

Distinct bacterial isolates detected by the RSA-based method

For several species in our data set, the RSA-based phylogenic tree indicates the presence of outlier strains (see Supplementary Fig. S8). For two of such species (Listeria monocytogenes and Francisella tularensis), in particular, we were able to find bibliographic information at the strain level which provides a biological explanation to our findings.

Listeria monocytogenes is a food-borne pathogenic bacterium which causes listeriosis in humans. From the 48 Listeria monocytogenes strains present in our data set, RSA-based clustering identifies a subgroup of two strains (Supplementary Fig. S4). These two strains, La111 ((text {GCA}_000382925)) and N53-1 ((text {GCA}_000382945)), seem to have very similar proteome composition which differs from that of the other strains. Holch et al. investigated strains La111 and N53-1 in their study on bacterial persistence in Listeria monocytogenes³⁹. Based on whole-genome analysis, the authors found that these two strains, which were isolated 6 years apart from different Danish fish processors, are extremely similar and collectively different from the other analyzed strains, in agreement with our RSA-based results.

Francisella tularensis, instead, is an intracellular bacterium which is a causal agent of tularemia. Our data set includes 25 Francisella tularensis strains from different subspecies: tularensis, holarctica, mediasiatica and novicida. While neither RSA- nor 16S rRNA-based clustering accurately separate subspecies of Francisella tularensis, the RSA method identifies one outlier, strain TIGB03 ((text {GCA}_000248415)) (Supplementary Fig. S5). Unlike the other 24 virulent strains, strain TIGB03 is an attenuated tularensis strain. This strain was described by Modise et al. as an attenuated O-antigen mutant of the virulent strain TI0902 ((text {GCA}_000248435))⁵. Indeed, we notice that 16S rRNA genes of these two strains are identical (Supplementary Fig. S5) and thus 16S rRNA-based clustering is unable to detect the mutant strain. Comparing whole-genome sequences of strains TIGB03 and TI0902, Modise et al. found 31 nonsynonymous point mutations and a 75.9 kb long duplicated region in the mutant strain TIGB03. However, such long duplication does not make TIGB03 an outlier in terms of genome length (the genome length of TIGB03 is 1.97 Mb and that of the other 24 strains included in our study ranges from 1.86 to 2.05 Mb). This indicates that the presence of nonsynomymous mutations led to different proteome composition, which is recognized by our method.

Intraspecies characterization of bacteria via evolutionary modeling of protein domains

Protein domains show a Gompertzian growth

Protein domains deactivation is faster than duplication

Protein domain RSA and evolutionary distance

RSA-based method distinguishes subspecies infecting different hosts

RSA-based method discriminates subspecies with different genome composition

Distinct bacterial isolates detected by the RSA-based method

The impact of restoration methods for Solidago-invaded land on soil invertebrates

Endangered animals and plants are positively or neutrally related to wild boar (Sus scrofa) soil disturbance in urban grasslands

ITALIAN LANGUAGE

ENGLISH LANGUAGE