Environmental stress leads to genome streamlining in a widely distributed species of soil bacteria

A. Strain sampling and isolation

Bradyrhizobium is a commonly occurring genus in soil [21]. Closely related Bradyrhizobium diazoefficiens (previously Bradyrhizobium japonicum) strains were isolated from soil, as previously described [20, 22]. In brief, Bradyrhizobium isolates that formed symbiotic associations with a foundational legume species in the censused region, Acacia acuminata, were isolated from soil sampled along a large region spanning ~300,000 km² in South West Australia, a globally significant biodiversity hotspot [23]. In total 60 soil samples were collected from twenty sites (3 soil samples per site; Supplementary Fig. S1) and 380 isolates were sequenced (19 isolates per site, 5 or 6 isolates per soil sample, each isolate re-plated from a single colony at least 2 times). Host A. acuminata legume plants were inoculated with field soil in controlled chamber conditions and isolates were cultured on Mannitol Yeast agar plates from root nodules (see [20, 22] for details). A total of 374 strains were included in this study after removing 5 contaminated samples and one sample that was a different Bradyrhizobium species; non- Bradyrhizobium diazoefficiens sample removal was determined from 16S rRNA sequences extracted from draft genome assemblies (Method C) using RNAmmer [24].

B. Environmental variation among sampled sites

In this study, I focus on environmental factors (temperature, rainfall, soil pH and salinity) previously identified to impact either rhizobia growth performance, functional fitness or persistence in soil [25,26,27,28] and where a directionality of rhizobial stress response could be attributed with respect to environmental variation present in the sampled region (i.e. stress occurs at high temperatures, low rainfall, high acidity and high salinity). Each environmental factor was standardised to a mean of 0 and a standard deviation of 1, and pH and rainfall scales were reversed to standardise stress responses directions so that low stress is at low values and high stress is at high values for all factors (Supplementary Fig. S2). Additionally, salinity was transformed using a log transformation (log(x + 0.01) to account for some zeroes) prior to standardisation.

C. Isolate sequencing and pangenome annotation

Illumina short reads (150 bp paired-end) were obtained and draft genome assemblies were generated using Unicycler from a previous study [29]. Resulting assemblies were of good assembly quality (99.2% of all strains had >95.0% genome completeness score according to BUSCO [30]; Table S1; assembled using reads that contained nominal 0.016 ± 0.00524% non-prokaryotic DNA content across all 374 isolates, according to Kraken classification [31]). Protein coding regions (CDS regions) were identified using Prokka [32] and assembled into a draft pangenome using ROARY [33], which produced a matrix of counts of orthologous gene clusters (i.e. here cluster refers to a set of protein-coding sequences containing all orthologous variants from all the different strains, grouped together and designated as a single putative gene). Gene clusters that occurred in 99% of strains were designated as ‘core genes’ and used to calculate the ‘efficiency of selection’ [34, 35] (measured as dN/dS, Method G.2) and population divergence measured as Fixation Index ‘Fst’, Method H) across each environmental stress factor. The identified gene clusters were then annotated using eggNOG-mapper V2 [36] and the strain by gene cluster matrix was reaggregated using the Seed ortholog ID returned by eggNOG-mapper as the protein identity. Out of the total 2,744,533 CDS regions identified in the full sample of 374 strains, eggNOG-mapper was able to assign 2,612,345 of them to 91,230 unique Seed orthologs. These 91,230 protein coding genes constituted the final protein dataset for subsequent analyses.

D. Calculation and statistical analysis of gene richness and pangenome diversity along the stress gradient

Gene richness was calculated as the total number of unique seed orthologues for each strain (i.e. genome). Any singleton genes that occurred in only a single strain, as well as ‘core’ genes that occurred in every strain (for symmetry, and because these are equally uninformative with respect to variation between strains) were removed, leaving 74,089 genes in this analysis. Gene richness (being count data) was modelled on a negative binomial distribution (glmer.nb function) as a function of each of the four environmental stressors as predictors using the lme4 package in R [37], also accounting for hierarchical structure in the data by including site and soil sample as random effects.

To rule out potentially spurious effects of assembly quality (i.e. missed gene annotations due to incomplete draft genomes) on key findings, I confirmed no significant association between gene richness and genome completeness (r = 0.042, p = 0.4224, Fig. S3).

Finally, pangenome diversity was calculated as the total number of unique genes that occurred in each soil sample (since multiple strains were isolated from a single soil sample). Pangenome diversity was modelled the same as gene richness, except here soil sample was not included as a random effect.

E. Calculation of network and duplication traits for each gene

I used the seed orthologue identifier from eggNOG-mapper annotations to query matching genes within StringDB ([38]; https://string-db.org/), which collects information on protein-protein interactions. Out of 91,230 query seed orthologues, 73,126 (~80%) returned a match in STRING. For matching seed orthologue hits, a network was created by connecting any proteins that were annotated as having pairwise interactions in the STRING database using the igraph package in R [39]. Two vertex-based network metrics were calculated for each gene: betweenness centrality, which measures a genes tendency to connect other genes in the gene network, and mean cosine similarity, which is a measure of how much a gene’s links to other genes are similar to other genes.

Betweenness centrality was calculated using igraph (functional betweenness). For mean cosine similarity, a pairwise cosine similarity was first calculated between all genes. To do this, the igraph network object was converted into a (naturally sparse yet large) adjacency matrix and the cosSparse function in qlcMatrix in R [40] was used to calculate cosine similarity between all pairs of genes. To obtain an overall cosine similarity trait value for each gene, the average pairwise cosine similarity to all other genes in the network was calculated.

Finally, gene duplication level was calculated for each gene as one additional measure of ‘redundancy’, by calculating the average number of gene duplicates found within the same strain. Duplicates were identified as CDS regions with the same Seed orthologue ID.

F. Gene distribution models

To determine how gene traits predict accessory genome distributions patterns along the stress gradients, I first calculated a model-based metric (hereafter and more specifically a standardised coefficient, ‘z-score’) of the relative tendency of each gene to be found in different soil samples across the four stress gradients (heat, salinity, acidity, and aridity). This was achieved by modelling each gene’s presence or absence in a strain as a function of the four stress gradients, with site and soil sample as a random effect, using a binomial model in lme4 (the structure of the model being the same as the gene richness model, only the response is different). To reduce computational overhead, these models were only run for the set of genes that were used in the gene richness analysis (e.g. after removing singletons and core genes), and which had matching network traits calculated (e.g. they occurred in the STRING database; n = 64,867 genes). Distribution models were run in tandem for each gene using the manyany function in the R package mvabund [41]. Standardised coefficients, or z-scores (coefficient/standard error) indicating the change in the probability of occurrence for each gene across each of the stress gradients were extracted. More negative coefficients correspond to genes that are more likely to be absent in high stress (and vice versa for positive coefficients).

To determine how network and duplication traits influence the distribution of genes across the stress gradient, I performed a subsequent linear regression model where the gene’s z-scores was the response and gene traits as predictors. The environmental stress type (i.e. acidity, aridity, heat and salinity) was included as a categorical predictor, and the interaction between stress category and gene function traits were used to infer the influence of gene function traits on gene distributions in a given stress type (see Supplementary Methods S1 for z-score transformation).

G. Quantifying molecular signals of natural selection on accessory and core genes

To examine molecular signatures of selection in accessory and core genes, I calculated dN/dS for a subsample of the total pool (n=74,089 genes), which estimates the efficiency of selection [34, 35]. Two major questions relevant to dN/dS that are addressed here require a different gene subsampling approach:

(1) Do variable environmental stress responses lead to different dN/dS patterns among accessory genes?

Here, I subsampled accessory genes (total accessory gene pool across 374 strains, 74,089) to generate and compare dN/dS among 3 categorical groups, each representing a different level of stress response based on their z-scores (accessory genes that either strongly increase, decrease or have no change in occurrence as stress increases; n = 1000 genes/category; see Supplementary Methods S2 for subsample stratification details).

For each gene, sequences were aligned using codon-aware alignment tool, MACSE v2 [42]. dN/dS was estimated by codon within each gene using Genomegamap’s Bayesian model-based approach [43], which is a phylogeny-free method optimised for within bacterial species dN/dS calculation (see Supplementary Methods S3 for dN/dS calculation/summarisation; S9 for xml configuration). The proportion of codons with dN/dS that were credibly less than 1 (purifying selection) and those credibly greater than 1 (positive selection) were analysed, with respect to the genes’ occurrence response to stress. Specifically, I modelled the proportion of codons with dN/dS < 1 using a beta regression (suitable for response data expressed as a proportion), with the stress response category as a predictor. The proportion of codons with dN/dS > 1 was overall too low to analyse in this way, so the binary outcome (a gene with any codons with dN/dS > 1 or not) was modelled using a binomial response model with the response categories as predictors (see Supplementary Methods S4 for details of both models).

(2) Does dN/dS among microbial populations vary across environmental stress?

Here, I compared the average change in dN/dS in core genes present across all environments at the population level (i.e. all isolates from one soil sample), which is used here to measure the change in the efficiency of selection across each stress gradient. Core genes were used since they occur in all soil samples, allowing a consistent set and sample size of genes to be used in the population-level dN/dS calculation. This contrasts with the previous section, which quantifies gene-level dN/dS on extant accessory genes that intrinsically have variable presence or absence across soil samples. For computational feasibility, 500 core genes were subsampled (total core 1015 genes) and, for each gene, individual strain variants were collated into a single fasta file based on soil sample membership, such that dN/dS could be calculated separately for each gene within each soil sample (i.e. 60 soil samples × 500 genes = 30,000 fasta files). Each fasta file was then aligned in MACSE and dN/dS calculated using the same methodology for accessory genes (Supplementary Method S3). This enabled the average dN/dS in a sample to be associated with soil-sample level environmental stress variables. Specifically, I modelled the mean proportion of codons with dN/dS < 1 in a soil sample (where the mean was taken over all genes in the soil sample), as a linear function of the sample’s four environmental stress variables in a multiple beta regression (see Supplementary Methods S5 for model details). There was insufficient power to analyse the proportion of codons with dN/dS > 1 due to overall rarity of positive selection (average proportion of genes where at least 1 codon with dN/dS > 1 was ~0.006). This low level of positive selection is expected for core genes which tend to be under strong selective constraint.

H. Calculation and analysis of Fixation index (Fst) along stress gradients

Using the core genome alignment (all SNPs among 1015 core genes) generated previously with ROARY, I computed pairwise environmentally-stratified Fst. Each soil sample (n = 60) was first placed into one of 5 bins based on their distances in total environmental stress space (using all four stress gradients), with the overall goal of generating roughly evenly sized bins such that the environmental similarity of stress was greater within bins than between (see Supplementary Methods S6 and Fig. S4 for clustering algorithm details). Next, fasta alignments were converted to binary SNPs using the adegenet package. Pairwise Fst between all strains (originating from a particular soil sample) within a single bin was calculated using StAMPP in R [44]. For each strain pair, the average of the two stress gradient values was assigned.

The relationship between pairwise Fst and the average stress value was evaluated using a linear regression model with each of the four stress values as predictors. Since the analysis uses pairwise data (thus violating standard independence assumptions), the significance of the relationship was determined using a permutation test (see Supplementary Methods S7 for details).

I. Chromosomal structure analysis of gene loss patterns

To gain insight into structural variation and test for regional hotspots in gene loss along the chromosome, I mapped each gene’s stress response (i.e. probability of loss or gain indicated by each genes z-score) onto a completed Bradyrhizobium genome (strain ‘36_1’ from the same set of 374 strains (Genbank CP067102.1; [45]). Putative CDS positions on the complete genome were determined using Prokka and annotated with SEED orthologue ID’s using eggNOG-mapper. Matching accessory genes derived from the full set of 374 incomplete draft genomes (n = 74,089) were mapped to positions on the complete genome (6274 matches). The magnitude of gene loss or gain (as measured by their standardised z-scores for each environment from the gene distribution models; see Method F) was then modelled across the genome using a one-dimensional spatial smoothing model. This model was implemented in R INLA [46] (www.r-inla.org), and models a response in a one-dimensional space using a Matern covariance-based random effect. The method uses an integrated nested Laplace approximation to a Bayesian posterior distribution, with a cyclical coordinate system to accommodate circular genomes. The model accounts for spatial non-independence among sites and produces a continuous posterior distribution of average z-score predictions along the entire genome, which was then used to visualise potential ‘hotspots’ of gene loss or gain. The modelling procedure was repeated, instead with gene network traits, such that model predictions of similarity and betweenness could be visualised on the reference chromosome.