Macroecological laws describe variation and diversity in microbial communities

Data

All the data sets analyzed in this work have been previously published and were obtained from EBI Metagenomics⁶⁹. Previous publications (Supplementart Table 1) report the original experiments and corresponding analysis. In order to test the robustness of the macroecological laws and the modeling framework presented in this work, we considered 7 data sets that differ not only for the biome considered, but also for the sequencing techniques and the pipeline used to process the data. Data sets were selected to represent a wide set of biomes. We considered only data sets with at least 50 samples with more than 10⁴ reads. No data set was excluded a-posteriori.

Sampling and compositional data

In order to study how (relative) abundance varies across communities and species, one needs to remove the effect of sampling noise, as it is not a biologically informative source of variation. By explicitly modeling sampling (Supplementary Note 2), one finds that the probability of observing n reads of species i in a sample with N total number of reads, is given by

$${P}_{i}(n| N)=int_{0}^{1}dx,{rho }_{i}(x),left(begin{array}{*{20}{c}} {n} {N} end{array}right){x}^{n}{(1-x)}^{N-n},,$$

(6)

where ρ_i(x) is the AFD, i.e., the probability (over communities or times) that the relative abundance of i is equal to x. Note that this equation does not assume anything about independence across species or communities. It only assumes the sampling process is carried independently across communities.

Since the random variable x_i, whose distribution is ρ_i(x), is a relative abundance, one has that ∑_ix_i = 1 (i.e., the data are compositional⁷⁰). As discussed in Supplementary Note 2, given the range of variation of the empirical relative abundances, one can substitute Eq. (6) with

$${P}_{i}(n| N)=int_{0}^{infty }dx,{rho }_{i}(x)frac{{(xN)}^{n}}{n!}{e}^{-xN},,$$

(7)

and the condition (sum)_ix_i = 1 to ({sum }_{i}{bar{x}}_{i}=1), where ({bar{x}}_{i}=mathop{int}nolimits_{0}^{infty }dx,{rho }_{i}(x)x) is the mean value of x_i. Under this assumption, one can also take the limits of the integration from 0 to ∞, instead of considering them from 0 to 1, as the contribution of the integrand from 1 to ∞ is negligible.

Note that, because of sampling, the average of a function f(x) over the pdf ρ(x) differs in general from the average of f(n/N) over P(n∣N)

$$int_{0}^{1}dx,rho (x)f(x), ne, sum _{n = 0}^{N}P(n| N)fleft(frac{n}{N}right)=int_{0}^{1}dx,rho (x)sum _{n = 0}^{N}fleft(frac{n}{N}right)frac{{(xN)}^{n}}{n!}{e}^{-xN},,$$

(8)

and the inequality becomes equality only if f(x) is linear. The important difference between the right- and the left-hand side is often neglected in the literature. In fact, the right-hand side is a good approximation of the left-hand size only in the limit xN ≫ 1, which is far from being realized in the data for most of the species. Supplementary Note 2 introduces a method to reconstruct the moments of ρ(x) from the moments of P(n∣N). More generally, I show that it is possible to infer the moment generating function of ρ(x) from the data, which allow to reconstruct the shape of the empirical ρ(x).

Excluding competitive exclusion

A Gamma-distributed AFD implies that all the species present in a community of a biome are present in all the communities from that biome. Therefore, when a species is not observed is because it is undetected due to sampling errors. I test this claim in two different ways. First, it is shown that one can in fact predict the occupancy of a species from its abundance fluctuations. Secondly, I show that a model without true absences is statistically more supported than a model where species are allowed to be absent.

The first way to test this hypothesis is to directly test its immediate prediction: if the absence is a consequence of sampling, one should be able to predict occupancy of a species (the probability that a species is present) simply from its average and variance of abundance (together with the total number of reads of each sample). In particular, assuming a Gamma AFD, the occupancy of species i is given by

$$langle {o}_{i}rangle =1-frac{1}{T}sum _{s}P(0| {N}_{s})=1-frac{1}{T}sum _{s = 1}^{T}{left(1+frac{{bar{x}}_{i}{N}_{s}}{{beta }_{i}}right)}^{-{beta }_{i}},,$$

(9)

where N_s is the total number of reads in sample s, T is the total number of samples, and ({beta }_{i}={bar{x}}_{i}^{2}/{sigma }_{{x}_{i}}^{2}). As shown in Fig. 2 and in Supplementary Fig. 3, this prediction well reproduces the observed occupancy across species. The prediction of Eq. (12) also matches the occupancy of temporal (longitudinal) data Supplementary Fig. 20.

The second, more rigorous, way to test the hypothesis that (most) species are always present is to use model selection. In this context we want to compare two (or more) models that aim at describing the observed number of reads of each species starting from alternative hypothesis. In particular I compare a purely Gamma AFD with a zero-inflated Gamma, which reads

$${varrho }_{i}(x| {vartheta }_{i},{beta }_{i},{bar{x}}_{i})={vartheta }_{i}delta (x)+(1-{vartheta }_{i})frac{1}{Gamma ({beta }_{i})}{left(frac{{beta }_{i}}{{bar{x}}_{i}}right)}^{{beta }_{i}}{x}^{{beta }_{i}-1}exp left(-{beta }_{i}frac{x}{{bar{x}}_{i}}right),,$$

(10)

where ϑ_i is the probability that a species is truly absent in a community and δ( ⋅ ) is the Dirac delta distribution. Our goal is to test whether the ϑ_is are significantly different from zero. Since the two models are nested, one can compare the maximum likelihood estimator in the case ϑ_i = 0 with the (maximum) likelihood marginalized over ϑ (which has prior μ(ϑ)). Given the number of reads ({n}_{i}^{s}) of species i in community s, with N_s total number of reads, one can compute the ratio (Supplementary Note 4)

$${ell }_{i}=frac{mathop{max }nolimits_{bar{x},beta }{prod }_{s}int dx{varrho }_{i}(x| 0,beta ,bar{x})frac{{(x{N}_{s})}^{{n}_{i}^{s}}}{{n}_{i}^{s}!}{e}^{-x{N}_{s}}}{int dvartheta ,mu (vartheta )left(mathop{max }nolimits_{bar{x},beta }{prod }_{s}int dx{varrho }_{i}(x| vartheta ,beta ,bar{x})frac{{(x{N}_{s})}^{{n}_{i}^{s}}}{{n}_{i}^{s}!}{e}^{-x{N}_{s}}right)},,$$

(11)

where μ(ϑ) is a prior over ϑ. If ℓ_i > 1, the model with ϑ_i = 0 is more strongly supported than the model with ϑ ≠ 0. Under Beta prior with parameters 0.25 and 8, one obtains that ℓ_i > 1 in 98.8% of the cases (averaged across data sets, ranging from 94.4 to 99.7%) and ℓ_i > 100 in 97.5% cases (ranging from 92.8 to 99.2%). See Supplementary Note 4 for a more detailed description of the methodology and Supplementary Fig. 6 for results obtained with other priors.

Prediction of macroecological patterns

Given laws #1, #2, and #3, the probability to observe n reads of a randomly chosen species in a sample with N total reads is

$$P(n| N)=int_{-infty }^{infty }deta ,frac{Gamma (beta +n)}{n!Gamma (beta )}{left(frac{{e}^{eta }N}{beta +{e}^{eta }N}right)}^{n}{left(frac{beta }{beta +{e}^{eta }N}right)}^{beta }frac{exp left(-frac{{(eta -mu )}^{2}}{2{sigma }^{2}}right)}{sqrt{2pi {sigma }^{2}}},,$$

(12)

where (eta ={mathrm{log}},(bar{x})). All the properties of species are fully specified by its mean abundance (bar{x}={e}^{eta }). The probability of observing n reads of species with average abundance (bar{x}) in a sample with N total number of reads is therefore

$$P(n| N,bar{x})=frac{Gamma (beta +n)}{n!Gamma (beta )}{left(frac{bar{x}N}{beta +bar{x}N}right)}^{n}{left(frac{beta }{beta +bar{x}N}right)}^{beta },.$$

(13)

The predictions for the patterns shown in Fig. 3 are reported here. The full derivation of this and other patterns is presented in Supplementary Note 8.

The total number of observed species in a sample with N total number of reads can be easily calculated using Eq. (12). The probability of not observing a species is simply P(0∣N). The expected number of distinct species 〈s(N)〉 in a sample with N reads is therefore

$$langle s(N)rangle ={s}_{tot}left(1-P(0| N)right)={s}_{tot}left(1-int_{-infty }^{infty }deta ,frac{exp left(-frac{{(eta -mu )}^{2}}{2{sigma }^{2}}right)}{sqrt{2pi {sigma }^{2}}},{left(frac{beta }{beta +{e}^{eta }N}right)}^{beta }right),,$$

(14)

where s_tot is the total number of species in the biome (including unobserved ones, see Supplementary Note 7). Note that s_tot is (substantially) larger than s_obs, the number of different species observed in the union of all the communities, which can instead be written as

$$langle {s}_{obs}rangle ={s}_{tot}left(1-int_{-infty }^{infty }deta ,frac{exp left(-frac{{(eta -mu )}^{2}}{2{sigma }^{2}}right)}{sqrt{2pi {sigma }^{2}}},{left(prod _{s = 1}^{T}frac{beta }{beta +{e}^{eta }{N}_{s}}right)}^{beta }right),.$$

(15)

Figure 3a shows that the prediction of Eq. (14) correctly matches the data (Supplementary Fig. 13).

The SAD, one of the most studied patterns in ecology and directly related to the Relative Species Abundance³⁵, is defined as the fraction of species with a given abundance. According to our model, the expected SAD is given by

$$langle {Phi }_{n}(N)rangle := frac{langle {s}_{n}(N)rangle }{langle s(N)rangle }=frac{P(n| N)}{1-P(0,N)}=frac{int_{-infty }^{infty }deta ,frac{Gamma (beta +n)}{n!Gamma (beta )}{left(frac{{e}^{eta }N}{beta +{e}^{eta }N}right)}^{n}{left(frac{beta }{beta +{e}^{eta }N}right)}^{beta }frac{exp left(-frac{{(eta -mu )}^{2}}{2{sigma }^{2}}right)}{sqrt{2pi {sigma }^{2}}}}{1-int_{-infty }^{infty }deta ,{left(frac{beta }{beta +{e}^{eta }N}right)}^{beta },frac{exp left(-frac{{(eta -mu )}^{2}}{2{sigma }^{2}}right)}{sqrt{2pi {sigma }^{2}}}},,$$

(16)

where 〈s_n(N)〉 is the number of species with n reads in a sample with N total number of reads. The cumulative SAD is defined as

$$langle {Phi }_{n}^{ ,{> },}(N)rangle := sum _{m = n}^{infty }langle {Phi }_{m}(N)rangle =frac{int mathop{int}nolimits_{-infty }^{infty },{I}_{frac{{e}^{eta }N}{beta +{e}^{eta }N}}(n,beta )frac{exp left(-frac{{(eta -mu )}^{2}}{2{sigma }^{2}}right)}{sqrt{2pi {sigma }^{2}}}}{1-mathop{int}nolimits_{-infty }^{infty }eta ,{left(frac{beta }{beta +{e}^{eta }N}right)}^{beta },frac{exp left(-frac{{(eta -mu )}^{2}}{2{sigma }^{2}}right)}{sqrt{2pi {sigma }^{2}}}},,$$

(17)

where I_p(n, β) is the regularized incomplete Beta function. Figure 3b shows that the Eq. (17) captures the empirical cumulative SAD (Supplementary Fig. 17).

The occupancy probability is defined as the probability that a species is present in a given fraction of communities. This quantity has been extensively studied in a variety of contexts (from genomics⁷¹ to Lego sets and texts⁷²) and has been more recently considered in microbial ecology³⁷. The three macroecological laws predict (see derivation in Supplementary Note 8)

$${p}_{obs}(o)=frac{mathop{int}nolimits_{-infty }^{infty }deta ,sum_{t = 1}^{T}delta left(o-1+frac{1}{T}mathop{sum }nolimits_{s = 1}^{T}{left(frac{beta }{beta +{e}^{eta }{N}_{s}}right)}^{beta }right)frac{exp left(-frac{{(eta -mu )}^{2}}{2{sigma }^{2}}right)}{sqrt{2pi {sigma }^{2}}}mathop{prod }nolimits_{s = 1}^{T}left(1-{left(frac{beta }{beta +{e}^{eta }{N}_{s}}right)}^{beta }right)}{mathop{int}nolimits_{-infty }^{infty }deta ,frac{exp left(-frac{{(eta -mu )}^{2}}{2{sigma }^{2}}right)}{sqrt{2pi {sigma }^{2}}},mathop{prod }nolimits_{s = 1}^{T}left(1-{left(frac{beta }{beta +{e}^{eta }{N}_{s}}right)}^{beta }right)},,$$

(18)

where δ( ⋅ ) is a Dirac delta function. Figure 3c compares the prediction of Eq. (18) with the data (Supplementary Fig. 15).

Occupancy (the fraction of communities where a species is found) and abundance are not independent properties, and their relative dependence is often referred to as occupancy-abundance relationship²¹ Given an average (relative) abundance (bar{x}=exp (eta )), the expected occurrence is

$${langle orangle }_{eta }=1-frac{1}{T}sum _{s = 1}^{T}P(0| {N}_{s},bar{x})=1-frac{1}{T}sum _{s = 1}^{T}{left(frac{beta }{beta +bar{x}{N}_{s}}right)}^{beta },,$$

(19)

Figure 3d shows the comparison between data and predictions (Supplementary Fig. 16). These predictions are also tested for temporal (longitudinal) data in Supplementary Figs. 22–24.

Transition probabilities in longitudinal data

For longitudinal data, in addition to the stationary AFD, one can study the probability ({rho }_{i}(x^{prime} ,t+Delta t| x,t)) that a species i has abundance (x^{prime}) at time t + Δt conditioned on having abundance x at time t. Instead of focusing on the full distribution, we study its first two (conditional) central moments, i.e. the average and variance of the abundance at t + Δt conditioned to abundance x at time t. In the analysis of the data stationarity is assumed (the distribution ({rho }_{i}(x^{prime} ,t+Delta t| x,t)) depends on Δt but not on t). I also assume that the dynamics of different species are governed by similar equations that only differ in their parameters. One would like therefore to average over species, by properly rescaling their abundances. The average over species is potentially problematic, as it could add a spurious effect to the conditional averages. For instance, only species with larger fluctuations would appear for extreme values of the initial abundance. In order to avoid these problems, instead of consider the actual abundance, its cumulative probability distribution value (calculated using the empirical AFD of each species) was used, that is referred as “quantile abundance”. This is equivalent to rank the abundances of each species over communities and use the (relative) ranking of each community instead of the abundance. A value equal to 0 corresponds to the lowest observed abundance, and a value equal to 1 to the highest. By definition, the quantile abundance is always uniformly distributed.

Ruling out demographic stochasticity

Demographic stochasticity can reproduce a Gamma AFD. A birth, death, and immigration process has a Gamma as stationary distribution³⁵. In the limit of large populations sizes, it corresponds to the following equation³⁵

$$frac{dx}{dt}=m-(d-b)x+sqrt{(b+d)x}xi (t),,$$

(20)

where m is the migration rate, while b and d are the per-capita birth and death rate. The Gaussian white noise term ξ(t) has mean zero and time-correlation (langle xi (t)xi (t^{prime} )rangle =delta (t-t^{prime} )). The stationary distribution of this process turns out to be

$$rho (x)=frac{1}{Gamma left(2frac{m}{b+d}right)}{left(frac{b+d}{2(d-b)}right)}^{-2frac{m}{b+d}}{x}^{2frac{m}{b+d}-1}exp left(-2frac{d-b}{b+d}xright),.$$

(21)

The average abundance is equal to (bar{x}=m/(d-b)), while the variance turns out to be ({sigma }_{x}^{2}=(m/2)(b+d)/{(b-d)}^{2}). The square of the coefficient of variation would therefore be equal to (b + d)/(2m).

More generally, one can assume that all the parameters are species dependent, and the population of species i is described by

$$frac{d{x}_{i}}{dt}={m}_{i}-({d}_{i}-{b}_{i}){x}_{i}+sqrt{({b}_{i}+{d}_{i}){x}_{i}}{xi }_{i}(t),,$$

(22)

where (langle {xi }_{i}(t){xi }_{j}(t^{prime} )rangle ={delta }_{ij}delta (t-t^{prime} )) was assumed.

Taylor’s Law and the wide variation of average abundance together imply that m_i/(b_i + d_i) is constant while m_i/(d_i − b_i) varies across species on several orders of magnitudes. This imposes a constraint on the variation of parameter values across species.

For instance, one can consider the scenario where species migrate to local communities from a common species pool (metacommunity). As abundance in the metacommunity varies across species the migration rate is a species-dependent quantity. Under neutrality, the per-capita birth and death rates in the local communities are constant and independent of the identity of the species. In this case m_i depends on the species, while b and d do not. One could recover the Lognormal MAD by imposing that m_i is Lognormally distributed. On the other hand, this model would fail in reproducing Taylor’s law with exponent 2, as it would predict and exponent 1.

More in general, the condition imposed on the parameters corresponds to an unnatural fine-tuned relationship between migration, birth, and death rates. Variation of the average abundance is observed across, at least, 7 orders of magnitudes. In order to reproduce this variation across species and Taylor’s law with exponent 2, the range of variability of (b_i − d_i)/(b_i + d_i) should be of the same order. It is unrealistic that the relative difference between birth and death rates, which have strong and direct connection to fundamental biological processes, vary so much across bacterial species. It is important to underline however, that the model of Eq. (22) can, in fact, for a proper parameterization, explain the observed variation of the data. But the choice of parameters explaining the empirical variation require for achieving this goal requires careful and unrealistic fine-tuning of the microscopic parameters.

Stochastic logistic model

The SLM is defined as

$$frac{d{x}_{i}}{dt}=frac{{x}_{i}}{{tau }_{i}}left(1-frac{{x}_{i}}{{K}_{i}}right)+sqrt{frac{{sigma }_{i}}{{tau }_{i}}}{x}_{i}{xi }_{i}(t),,$$

(23)

where ξ(t) is a Gaussian white noise term with mean zero and correlation (langle {xi }_{i}(t){xi }_{j}(t^{prime} )rangle ={delta }_{ij}delta (t-t^{prime} )). Taylor’s Law and the observed Lognormal MAD constraints the parameter value. The parameters 1/τ_i, K_i and σ_i are the intrinsic growth rate, the carrying capacity and the coefficient of variation of the growth-rate fluctuations. Taylor’s Law requires σ_i = σ (independently of i). Since the average abundance of the SLM is ({bar{x}}_{i}={K}_{i}(1-{sigma }_{i}/2)), if σ_i = σ, the average abundance and the carrying capacity turn out to be proportional to each other. The lognormal MAD implies therefore that the K_is are lognormally distributed. The stationary distribution corresponding to Eq. (23) reads

$${rho }_{i}(x)=frac{1}{Gamma (2{sigma }_{i}^{-1}-1)}{left(frac{2}{{K}_{i}{sigma }_{i}}right)}^{2{sigma }_{i}^{-1}-1}exp left(-frac{2}{{K}_{i}{sigma }_{i}}xright){x}^{2{sigma }_{i}^{-1}-2},.$$

(24)

The parameter τ_i does not affect stationary properties, but determines the timescale of relaxation to the stationary distribution. For small deviation of abundance from the average and for large times, the conditional expected abundance behaves as

$${langle {x}_{i}(t+Delta t)rangle }_{{x}_{i}(t)}={bar{x}}_{i}+left({x}_{i}(t)-{bar{x}}_{i}right){e}^{-frac{Delta t}{{tau }_{i}}},.$$

(25)

From the slopes of Fig. 4g one can then determine the timescales τ_i, which turn out to be approximately equal to 19 h. In Fig. 4 it was assumed τ_i = 19 h for all species.

Equation (23) can emerge as effective description of more complicated coupled equations. For instance, it is possible to show that a Lotka-Volterra system of equation with random interactions reduces to Eq. (23) (with colored noise to be self-consistently determined)⁴². If the coefficient of variation of the interaction coefficient does not increase with the number of species (e.g., if it is constant) then the Lotka-Volterra equations can be effectively approximated with Eq. (23).

The noise term in Eq. (23) can be interpreted as corresponding to environmental fluctuations. These fluctuations are typically known to have a characteristic timescale and are not white^40,41. Supplementary Note 13 and Supplementary Fig. 25 show that colored noise in Eq. (23) does not affect significantly the predictions obtained with the SLM with white noise.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Source: Ecology - nature.com