Notation
The notations described in ANCOM-BC methodology are summarized in Table 1.
Table 1 Summary of notations.
Full size table
Data preprocessing
We adopted the methodology of ANCOM-II22 as the preprocessing step to deal with different types of zeros before performing DA analysis.
There are instances where some taxa are systematically absent in an ecosystem. For example, there may be taxa present in a soil sample from a desert that might absent in a soil sample from a rain forest. In such cases, the observed zeros are called structural zeros. Let pij denote proportion non-zero samples of the ith taxon in the jth group, and let ({hat{p}}_{ij}=frac{1}{{n}_{j}}mathop{sum }nolimits_{k = 1}^{{n}_{j}}I({O}_{ijk}ne 0)) denote the estimate of pij. In practice, we declare the ith taxon to have structural zeros in the jth group if either of the following is true:
(a)
({hat{p}}_{ij}=0.)
(b)
({hat{p}}_{ij}-1.96scriptstyle{sqrt{frac{{hat{p}}_{ij}(1-{hat{p}}_{ij})}{{n}_{j}}}}le 0).
If a taxon is considered to be a structural zero in an experimental group then, for that specific ecosystem, the taxon is not used in further analysis. Thus, suppose there are three ecosystems A, B, and C and suppose taxon X is a structural zero in ecosystems A and B but not in C, then taxon X is declared to be differentially abundant in C relative to A and B and not analyzed further. If taxon Y is a structurally zero in ecosystem A but not in B and C, in that case we declare that taxon Y is differentially abundant in B relative to A as well as differentially abundant in C relative to A. We then compare the absolute abundance of taxon Y between B and C using the methodology described in this section. Taxa identified to be structural zeros among all experimental groups are ignored from the following analyses.
In a similar fashion, we address the outlier zeros as well as sampling zeros using the methodology developed in ANCOM-II22.
Model assumptions
Assumption 0.1.
$$E({O}_{ijk}| {A}_{ijk})={c}_{jk}{A}_{ijk}\ Var({O}_{ijk}| {A}_{ijk})={sigma }_{w,ijk}^{2},$$
(1)
where ({sigma }_{w,ijk}^{2}) = variability between specimens within the kth sample from the jth group. Therefore, ({sigma }_{w,ijk}^{2}) characterizes the within-sample variability. Typically, researchers do not obtain more than one specimen at a given time in most microbiome studies. Consequently, variability between specimens within sample is usually not estimated.
According to Assumption 0.1, in expectation the absolute abundance of a taxon in a random sample is in constant proportion to the absolute abundance in the ecosystem of the sample. In other words, the expected relative abundance of each taxon in a random sample is equal to the relative abundance of the taxon in the ecosystem of the sample.
Assumption 0.2. For each taxon i, Aijk, j = 1, …, g, k = 1, …, nj, are independently distributed with
$$E({A}_{ijk}| {theta }_{ij}) ={theta }_{ij}\ Var({A}_{ijk}| {theta }_{ij}) ={sigma }_{b,ij}^{2},$$
(2)
where ({sigma }_{b,ij}^{2}) = between-sample variation within group j for the ith taxon.
The Assumption 0.2 states that for a given taxon, all subjects within and between groups are independent, where θij is a fixed parameter rather than a random variable.
Regression framework
From Assumptions 0.1 and 0.2, we have:
$$E({O}_{ijk}) ={c}_{jk}{theta }_{ij}\ Var({O}_{ijk}) =f({sigma }_{w,ijk}^{2},{sigma }_{b,ij}^{2}):= {sigma }_{t,ijk}^{2}.$$
(3)
Motivated by the above set-up, we introduce the following linear model framework for log-transformed OTU counts data:
$${y}_{ijk}={d}_{jk}+{mu }_{ij}+{epsilon }_{ijk},$$
(4)
with
$$E({epsilon }_{ijk}) =0,\ E({y}_{ijk}) ={d}_{jk}+{mu }_{ij},\ Var({y}_{ijk}) =Var({epsilon }_{ijk}):= {sigma }_{ijk}^{2}.$$
(5)
Note that with a slight abuse of notation for simplicity of exposition, the above log-transformation of data is inspired by the Box–Cox family of transformations23 which are routinely used in data analysis. Note that d in the above equation is not exactly log(c) due to Jensenʼs inequality, it simply reflects the effect of c
An important distinction from standard ANOVA: Before we describe the details of the proposed methodology, we like to draw attention to a fundamental difference between the current formulation of the problem and the standard one-way ANOVA model. For simplicity, let us suppose we have two groups, say a treatment and a control group. Let us also suppose that there is only one taxon in our microbiome study and n subjects are assigned to the treatment group and n are assigned to the control group. Suppose the researcher is interested in comparing the mean absolute abundance of the taxon in the ecosystems of the two groups. Then under the above assumptions, the model describing the study is given by:
$${y}_{jk}={d}_{jk}+{mu }_{j}+{epsilon }_{jk},j=1,2,k=1,2,ldots ,n.$$
Then trivially the sample mean absolute abundance of jth group is given by ({bar{y}}_{j.}=frac{1}{n}mathop{sum }nolimits_{k = 1}^{n}{y}_{jk}) and (E({bar{y}}_{j.})=frac{1}{n}mathop{sum }nolimits_{k = 1}^{n}{d}_{jk}+{mu }_{j}={bar{d}}_{j.}+{mu }_{j}). The difference in the sample means between the two groups is ({bar{y}}_{1.}-{bar{y}}_{2.}) and its expectation is (E({bar{y}}_{1.}-{bar{y}}_{2.})=({bar{d}}_{1.}-{bar{d}}_{2.})+({mu }_{1}-{mu }_{2})). Under the null hypothesis μ1 = μ2, (E({bar{y}}_{1.}-{bar{y}}_{2.})={bar{d}}_{1.}-{bar{d}}_{2.}ne ,0), unless ({bar{d}}_{1.}={bar{d}}_{2.}). Thus because of the differential sampling fractions, which are sample specific, the numerator of the standard t-test under the null hypothesis for these microbiome data is non-zero. This introduces bias and hence inflates the Type I error. On the other hand, the standard one-way ANOVA model for two groups, which is not applicable for the microbiome data described in this paper, is of the form:
$${y}_{jk}=d+{mu }_{j}+{epsilon }_{jk},j=1,2,k=1,2,ldots ,n.$$
Hence under the null hypothesis μ1 = μ2, (E({bar{y}}_{1.}-{bar{y}}_{2.})=0). Thus, in this case the standard t-test is appropriate. Hence in this paper we develop methodology to eliminate the bias introduced by the differential sampling fraction by each sample. To do so, we exploit the fact that we have a large number of taxa on each subject and we borrow information across taxa to estimate this bias, which is the essence of the following methodology.
Bias and variance of bias estimation under the null hypothesis: From the above model (equation (4)), for each j, note that (E({bar{y}}_{ijcdot })={bar{d}}_{jcdot }+{mu }_{ij}) and (E({bar{y}}_{cdot jk})={d}_{jk}+{bar{mu }}_{cdot j}), where (bar{w}) represents the arithmetic mean over the suitable index. Using the least squares framework, we therefore estimate μij and djk as follows:
$${hat{d}}_{jk} ={overline{y}}_{cdot jk}-{overline{y}}_{cdot jcdot },,k=1,,ldots ,,{n}_{j},j=1,2,ldots g,\ {hat{mu }}_{ij} ={overline{y}}_{ijcdot }-{overline{hat{d}}}_{jcdot }={overline{y}}_{ijcdot },,i=1,,ldots ,,m.$$
(6)
Note that (E({hat{mu }}_{ij})=E({overline{y}}_{ijcdot })={mu }_{ij}+{overline{d}}_{jcdot }). Thus, for each j = 1, 2, …g, ({hat{mu }}_{ij}) is a biased estimator and (E({hat{mu }}_{i1}-{hat{mu }}_{i2})=({mu }_{i1}-{mu }_{i2})+{overline{d}}_{1cdot }-{overline{d}}_{2cdot }). Denote (delta ={overline{d}}_{1cdot }-{overline{d}}_{2cdot }.) To begin with, in the following we shall assume there are two experimental groups with balanced design, i.e., g = 2 and n1 = n2 = n. Later the methodology is easily extended to unbalanced design and multigroup settings. Suppose we have two ecosystems and for each taxon i, i = 1, 2, …m, we wish to test the hypothesis
$${H}_{0} :{mu }_{i1}={mu }_{i2}\ {H}_{1} :{mu }_{i1}ne {mu }_{i2}.$$
(7)
Under the null hypothesis, (E({hat{mu }}_{i1}-{hat{mu }}_{i2})=delta , ne , 0), and hence biased. The goal of ANCOM-BC is to estimate this bias and accordingly modify the estimator ({hat{mu }}_{i1}-{hat{mu }}_{i2}) so that the resulting estimator is asymptotically centered at zero under the null hypothesis and hence the test statistic is asymptotically centered at zero. First, we make the following observations. Since (E({overline{y}}_{ijcdot })={overline{d}}_{jcdot }+{mu }_{ij}) and ({hat{mu }}_{ij}={overline{y}}_{ijcdot }), therefore ({hat{mu }}_{ij}) is an unbiased estimator of ({overline{d}}_{jcdot }+{mu }_{ij}). From (5) and Lyapunov central limit theorem, we have:
$$frac{{hat{mu }}_{ij}-({mu }_{ij}+{overline{d}}_{jcdot })}{{sigma }_{i}}{to }_{d}N(0,1),,,{rm{as}} nto infty ,\ {rm{where}} {sigma }_{ij}^{2}=Var({hat{mu }}_{ij})=Var({overline{y}}_{ij.})=frac{1}{{n}^{2}}mathop{sum }limits_{k=1}^{n}{sigma }_{ijk}^{2}.$$
(8)
Let Σjk denote an m × m covariance matrix of ({epsilon }_{jk}={({epsilon }_{1jk},{epsilon }_{2jk},ldots ,{epsilon }_{mjk})}^{T}), where ({sigma }_{ii^{prime} jk}) is the ({(i,i^{prime} )}^{th}) element of Σjk and ({sigma }_{ijk}^{2}) is the ith diagonal element of Σjk. Furthermore, suppose
Assumption 0.3.
$${sigma }_{ijk}^{2} νi0. Thus, without loss of generality for κ1, κ2 > 0, let νi1 = νi0 + κ1 and νi2 = νi0 + κ2. While this assumption is not a requirement for our method, it is reasonable to assume that variability among differentially abundant taxa is larger than that among the null taxa. By making this assumption, we speed-up the computation time.
Assuming samples are independent between experimental groups, we begin by first estimating ({nu }_{i0}^{2}={rm{V}}{rm{ar}}({hat{mu }}_{i1}-{hat{mu }}_{i2})={rm{V}}{rm{ar}}({hat{mu }}_{i1})+{rm{V}}{rm{ar}}({hat{mu }}_{i2})). Using the estimator stated in (14), we estimate ({nu }_{i0}^{2}) consistently as follows:
$${hat{nu }}_{i0}^{2}=mathop{sum }limits_{j=1}^{2}{hat{sigma }}_{ij}^{2}=mathop{sum }limits_{j=1}^{2}frac{1}{{n}^{2}}mathop{sum }limits_{k=1}^{n}{({y}_{ijk}-{hat{d}}_{jk}-{hat{mu }}_{ij})}^{2}.$$
(20)
In all future calculations, we plug in ({hat{nu }}_{i0}^{2}) for ({nu }_{i0}^{2}). This is similar in spirit to many statistical procedures involving nuisance parameters. The following lemma is useful in the sequel.
Lemma 0.1.
(frac{partial }{partial theta }{mathrm{log}},f(x)={E}_{f(z| x)}[frac{partial }{partial theta }{mathrm{log}},f(z)+frac{partial }{partial theta }{mathrm{log}},f(x| z)]).25
Let (Theta ={(delta ,{pi }_{1},{pi }_{2},{pi }_{3},{l}_{1},{l}_{2},{kappa }_{1},{kappa }_{2})}^{T}) denote the set of unknown parameters, then for each taxon the log-likelihood can be reformulated using Lemma 0.1, as follows:
$$Theta leftarrow arg {max }_{Theta }mathop{sum }limits_{i=1}^{m}mathop{sum }limits_{r=0}^{2}{p}_{r,i}[{mathrm{log}},{rm{P}}{rm{r}}(iin {C}_{r})+{mathrm{log}},f({Delta }_{i}| iin {C}_{r})].$$
(21)
Then the E–M algorithm is described as follows:
E-step: Compute conditional probabilities of the latent variable. Define ({p}_{r,i}={rm{P}}{rm{r}}(iin {C}_{r}| {Delta }_{i})=frac{{pi }_{r}phi (frac{{Delta }_{i}, -, (delta , +, {l}_{r})}{{nu }_{ir}})}{{sum }_{r}{pi }_{r}phi (frac{{Delta }_{i}, -, (delta , + , {l}_{r})}{{nu }_{ir}})},r=0,1,2;i=1,ldots ,m), which are conditional probabilities representing the probability that an observed value follows each distribution. Note that l0 = 0.
M-step: Maximize the likelihood function with respect to the parameters, given the conditional probabilities.
We shall denote the resulting estimator of δ by ({hat{delta }}_{{rm{EM}}}.)
Next we estimate ({rm{V}}{rm{ar}}({hat{delta }}_{{rm{EM}}})). Since the likelihood function is not a regular likelihood and hence it is not feasible to derive the Fisher information. Consequently, we take a simpler and a pragmatic approach to derive an approximate estimator of ({rm{V}}{rm{ar}}({hat{delta }}_{{rm{EM}}})) using ({rm{V}}{rm{ar}}({hat{delta }}_{{rm{WLS}}})), which is defined below. Extensive simulation studies suggest that ({hat{delta }}_{{rm{EM}}}) and ({hat{delta }}_{{rm{WLS}}}) are highly correlated (Supplementary Fig. 9) and it appears to be reasonable to approximate ({rm{V}}{rm{ar}}({hat{delta }}_{{rm{EM}}})) by ({rm{V}}{rm{ar}}({hat{delta }}_{{rm{WLS}}})).
Let {Cr} = mr, r = 0, 1, 2, then
$${hat{delta }}_{{rm{WLS}}} =frac{{sum }_{iin {C}_{0}}frac{{Delta }_{i}}{{hat{nu }}_{i0}^{2}}+{sum }_{iin {C}_{1}}frac{{Delta }_{i}-{hat{l}}_{1}}{{hat{nu }}_{i1}^{2}}+{sum }_{iin {C}_{2}}frac{{Delta }_{i}-{hat{l}}_{2}}{{hat{nu }}_{i2}^{2}}}{{sum }_{iin {C}_{0}}frac{1}{{hat{nu }}_{i0}^{2}}+{sum }_{iin {C}_{1}}frac{1}{{hat{nu }}_{i1}^{2}}+{sum }_{iin {C}_{2}}frac{1}{{hat{nu }}_{i2}^{2}}}\ =frac{{sum }_{iin {C}_{0}}frac{{Delta }_{i}}{{nu }_{i0}^{2}}+{sum }_{iin {C}_{1}}frac{{Delta }_{i}-{l}_{1}}{{nu }_{i1}^{2}}+{sum }_{iin {C}_{2}}frac{{Delta }_{i}-{l}_{2}}{{nu }_{i2}^{2}}}{{sum }_{iin {C}_{0}}frac{1}{{nu }_{i0}^{2}}+{sum }_{iin {C}_{1}}frac{1}{{nu }_{i1}^{2}}+{sum }_{iin {C}_{2}}frac{1}{{nu }_{i2}^{2}}}+{o}_{p}(1).$$
(22)
The above expression is of the form
$$frac{{a}_{1}^{T}{x}_{1}+{a}_{2}^{T}{x}_{2}+{a}_{3}^{T}{x}_{3}}{{a}_{1}^{T}{bf{1}}+{a}_{2}^{T}{bf{1}}+{a}_{3}^{T}{bf{1}}}equiv frac{{alpha }^{T}u}{{alpha }^{T}{bf{1}}},$$
(23)
where
(1)
1 = (1, …, 1)T,
(2)
({a}_{r}={({a}_{r1},{a}_{r2},ldots ,{a}_{r{m}_{r}})}^{T}:= {(frac{1}{{nu }_{ir}^{2}})}^{T},,iin {C}_{r},,r=0,1,2),
(3)
({x}_{r}={({x}_{r1},{x}_{r2},ldots ,{x}_{r{m}_{r}})}^{T}:= {({Delta }_{i}-{l}_{i})}^{T},,iin {C}_{r},,r=0,1,2). Note that l0 = 0,
(4)
(alpha ={({alpha }_{1},{alpha }_{2},ldots ,{alpha }_{m})}^{T}equiv {({a}_{1}^{T},{a}_{2}^{T},{a}_{3}^{T})}^{T}),
(5)
(u={({u}_{1},{u}_{2},ldots ,{u}_{m})}^{T}equiv {({x}_{1}^{T},{x}_{2}^{T},{x}_{3}^{T})}^{T}).
For the simplicity of notation, we relabel a and x by α and u, respectively. Denote Cov(x) = Cov(u) by Ω, and let ({omega }_{ii^{prime} }) denotes the ((i,i^{prime} )) element of Ω. As in Assumption 0.3, we make the following assumption
Assumption 0.4.
$$frac{mathop{sum }nolimits_{ine i^{prime} }^{m}{omega }_{ii^{prime} }}{{m}^{2}}=o(1).$$
(24)
Using the above expressions, we compute the variance as follows:
$${rm{V}}{rm{ar}}({hat{delta }}_{{rm{WLS}}})={rm{V}}{rm{ar}}(frac{{alpha }^{T}u}{{alpha }^{T}{bf{1}}})=frac{mathop{sum }nolimits_{i = 1}^{m}{alpha }_{i}^{2}{omega }_{ii}}{{(mathop{sum }nolimits_{i = 1}^{m}{alpha }_{i})}^{2}}+frac{mathop{sum }nolimits_{ine i^{prime} }^{m}{alpha }_{i}{alpha }_{i^{prime} }{omega }_{ii^{prime} }}{{(mathop{sum }nolimits_{i = 1}^{m}{alpha }_{i})}^{2}}.$$
(25)
Recall that (a) for i ∈ C0, ({omega }_{ii}={rm{V}}{rm{ar}}({Delta }_{i})={nu }_{i0}^{2}=O({n}^{-1})), (b) for i ∈ C1, ({omega }_{ii}={rm{V}}{rm{ar}}({Delta }_{i})={nu }_{i1}^{2}={nu }_{i0}^{2}+{kappa }_{1}=O(1)), and (c) for i ∈ C2, ({omega }_{ii}={rm{V}}{rm{ar}}({Delta }_{i})={nu }_{i2}^{2}={nu }_{i0}^{2}+{kappa }_{2}=O(1)). Note that ({alpha }_{i}=frac{1}{{rm{V}}{rm{ar}}({Delta }_{i})}=frac{1}{{omega }_{ii}}), thus we have:
$${rm{V}}{rm{ar}}(frac{{alpha }^{T}u}{{alpha }^{T}{bf{1}}})=frac{mathop{sum }nolimits_{i = 1}^{m}{alpha }_{i}^{2}{omega }_{ii}}{{(mathop{sum }nolimits_{i = 1}^{m}{alpha }_{i})}^{2}}+frac{mathop{sum }nolimits_{ine i^{prime} }^{m}{alpha }_{i}{alpha }_{i^{prime} }{omega }_{ii^{prime} }}{{(mathop{sum }nolimits_{i = 1}^{m}{alpha }_{i})}^{2}}=frac{1}{mathop{sum }nolimits_{i = 1}^{m}{alpha }_{i}}+frac{mathop{sum }nolimits_{ine i^{prime} }^{m}{alpha }_{i}{alpha }_{i^{prime} }{omega }_{ii^{prime} }}{{(mathop{sum }nolimits_{i = 1}^{m}{alpha }_{i})}^{2}}.$$
(26)
Since ({nu }_{i0}^{2}=O({n}^{-1})), ({nu }_{i1}^{2}=O(1)), and ({nu }_{i2}^{2}=O(1)), consequently, a1i = O(n), a2i = a3i = O(1), and
$$mathop{sum }limits_{i=1}^{m}{alpha }_{i} ={{bf{1}}}^{T}{a}_{1}+{{bf{1}}}^{T}{a}_{2}+{{bf{1}}}^{T}{a}_{3} = sum _{iin {C}_{0}}O(n)+sum _{iin {C}_{1}}O(1)+sum _{iin {C}_{2}}O(1)\ = O({m}_{0}n)+O({m}_{1})+O({m}_{2})\ =O({m}_{0}n),,,{rm{if}}{m}_{0}nge max {{m}_{1},{m}_{2}}.$$
(27)
Using these facts and Assumption 0.4 in (26), we get
$${rm{V}}{rm{ar}}(frac{{alpha }^{T}u}{{alpha }^{T}{bf{1}}}) = O({m}_{0}^{-1}{n}^{-1})+frac{mathop{sum }nolimits_{ine i^{prime} }^{m}{{n}^{-1}{m}^{-1}{alpha }_{i}}{{n}^{-1}{m}^{-1}{alpha }_{i^{prime} }}{omega }_{ii^{prime} }}{{n}^{-2}{m}^{-2}{(mathop{sum }nolimits_{i = 1}^{m}{alpha }_{i})}^{2}}\ = O({m}_{0}^{-1}{n}^{-1})+frac{1}{{m}^{2}}frac{mathop{sum }nolimits_{ine i^{prime} }^{m}{{n}^{-1}{alpha }_{i}}{{n}^{-1}{alpha }_{i^{prime} }}{omega }_{ii^{prime} }}{{(mathop{sum }nolimits_{i = 1}^{m}{n}^{-1}{m}^{-1}{alpha }_{i})}^{2}}\ = O({m}_{0}^{-1}{n}^{-1})+frac{1}{{m}^{2}}frac{O(1)o({m}^{2})}{O(1)}\ = O({m}_{0}^{-1}{n}^{-1}).$$
(28)
Thus, under Assumption 0.4 regarding ({omega }_{ii^{prime} }), the contribution of the covariance terms in the above variance expression is negligible as long as m is very large compared with n, which is usually the case. Hence
$${rm{V}}{rm{ar}}({hat{delta }}_{{rm{WLS}}})={rm{V}}{rm{ar}}(frac{{alpha }^{T}u}{{alpha }^{T}{bf{1}}})=O({m}_{0}^{-1}{n}^{-1}).$$
(29)
Furthermore, appealing to Cauchy–Schwartz inequality we get
$${rm{Cov}}({hat{mu }}_{i1}-{hat{mu }}_{i2},{hat{delta }}_{{rm{WLS}}}) le sqrt{{rm{V}}{rm{ar}}({hat{mu }}_{i1}-{hat{mu }}_{i2}){rm{V}}{rm{ar}}({hat{delta }}_{{rm{WLS}}})}\ le O({n}^{-1/2})O({m}_{0}^{-1/2}{n}^{-1/2})=O({n}^{-1}{m}_{0}^{-1/2}).$$
(30)
Hence, as long as m0 is large, the contribution made by ({rm{V}}{rm{ar}}({hat{delta }}_{{rm{WLS}}})) and ({rm{Cov}}({hat{mu }}_{i1}-{hat{mu }}_{i2},{hat{delta }}_{{rm{WLS}}})) relative to ({rm{V}}{rm{ar}}({hat{mu }}_{i1}-{hat{mu }}_{i2})) is negligible.
Neglect the covariance term in (26), let ({hat{C}}_{r}) denote the estimator of Cr, r = 0, 1, 2 from the E–M algorithm, define
$$widehat{{rm{V}}{rm{ar}}}({hat{delta }}_{{rm{WLS}}})=frac{1}{{sum }_{iin {hat{C}}_{0}}frac{1}{{hat{nu }}_{i0}^{2}}+{sum }_{iin {hat{C}}_{1}}frac{1}{{hat{nu }}_{i1}^{2}}+{sum }_{iin {hat{C}}_{2}}frac{1}{{hat{nu }}_{i2}^{2}}},$$
(31)
an estimator of ({rm{V}}{rm{ar}}({hat{delta }}_{{rm{WLS}}})) under the Assumption 0.4. Then
$$widehat{{rm{V}}{rm{ar}}}({hat{delta }}_{{rm{WLS}}}){to }_{p}frac{1}{mathop{sum }nolimits_{i = 1}^{m}{alpha }_{i}}=frac{1}{{sum }_{iin {C}_{0}}frac{1}{{nu }_{i0}^{2}}+{sum }_{iin {C}_{1}}frac{1}{{nu }_{i1}^{2}}+{sum }_{iin {C}_{2}}frac{1}{{nu }_{i2}^{2}}},{rm{as}},, m,nto infty .$$
(32)
We performed extensive simulations to evaluate the bias and variance of ({hat{delta }}_{{rm{EM}}}) and that of ({hat{delta }}_{{rm{WLS}}}). The scatter plot (Supplementary Fig. 9) of ({hat{delta }}_{{rm{EM}}}) and ({hat{delta }}_{{rm{WLS}}}) are almost perfectly linear with correlation coefficient nearly 1 in all cases. This suggests that ({hat{delta }}_{{rm{WLS}}}) is a very good approximation for ({hat{delta }}_{{rm{EM}}}). The variance of ({hat{delta }}_{{rm{EM}}}) as well as that of ({hat{delta }}_{{rm{WLS}}}) are roughly of the order ({n}^{-1}{m}_{0}^{-1}), as we expected. In addition, they are approximately unbiased (Supplementary Table 6).
Hypothesis testing for two-group comparison
For taxon i, we test the following hypothesis
$${H}_{0} :{mu }_{i1}={mu }_{i2}\ {H}_{1} :{mu }_{i1},,ne ,,{mu }_{i2}$$
using the following test statistic which is approximately centered at zero under the null hypothesis:
$${W}_{i}=frac{{hat{mu }}_{i1}-{hat{mu }}_{i2}-{hat{delta }}_{{rm{EM}}}}{sqrt{{hat{sigma }}_{i1}^{2}+{hat{sigma }}_{i2}^{2}}}.$$
(33)
From Slutsky’s theorem, we have:
$${W}_{i}{to }_{d}N(0,1),{rm{as}},, m,,n,to infty .$$
(34)
If the sample size is not very large and/or the number of non-null taxa are very large, then we modify the above test statistic as follows:
$${W}_{i}^{* }=frac{{hat{mu }}_{i1}-{hat{mu }}_{i2}-{hat{delta }}_{{rm{WLS}}}}{sqrt{{hat{sigma }}_{i1}^{2}+{hat{sigma }}_{i2}^{2}+widehat{{rm{V}}{rm{ar}}}({hat{delta }}_{{rm{WLS}}})+2sqrt{({hat{sigma }}_{i1}^{2}+{hat{sigma }}_{i2}^{2})widehat{{rm{V}}{rm{ar}}}({hat{delta }}_{{rm{WLS}}})}}}.$$
(35)
To control the FDR due to multiple comparisons, we recommend applying the Holm–Bonferroni method26 or Bonferroni27,28 correction rather than the Benjamini–Hochberg (BH) procedure29 to adjust the raw p values as research has showed that it is more appropriate to control the FDR when p values were not accurate30, and the BH procedure controls the FDR provided you have either independence or some special correlation structures such as perhaps positive regression dependence among taxa29,31. In our simulation studies, since the absolute abundances for each taxon are generated independently, we compared the ANCOM-BC results adjusted either by Bonferroni correction (Fig. 4) or BH procedure (Supplementary Fig. 10), it is clearly that the FDR control by Bonferroni correction is more conservative while implementing BH procedure results in FDR around the nominal level (5%). Obviously, ANCOM-BC has larger power when using BH procedure.
Hypothesis testing for multigroup comparison
In some applications, for a given taxon, researchers are interested in drawing inferences regarding DA in more than two ecosystems. For example, for a given taxon, researchers may want to test whether there exists at least one experimental group that is significantly different from others, i.e., to test
$${H}_{0,i} :{cap }_{jne j^{prime} in {1,ldots ,g}}{mu }_{ij}={mu }_{ij^{prime} }\ {H}_{1,i} :{cup }_{jne j^{prime} in {1,ldots ,g}}{mu }_{ij},, ne ,, {mu }_{ij^{prime} }.$$
Similar to the two-group comparison, after getting the initial estimates of ({hat{mu }}_{ij}) and ({hat{d}}_{jk}), setting the reference group r (e.g., r = 1), and obtaining the estimator of the bias term ({hat{delta }}_{rj}) through E–M algorithm, the final estimator of mean absolute abundance of the ecosystem (in log scale) are obtained by transforming ({hat{mu }}_{ij}) of (6) into:
$${hat{mu }}_{ij}^{* }:= left{begin{array}{l}{}hskip -42pt {hat{mu }}_{ir},,,, qquad j=r\ {hat{mu }}_{ij}+{hat{delta }}_{rj},,,,j ,, ne ,, rin 1,ldots ,gend{array}right..$$
(36)
Thus, based on (18) and the E–M estimator of δrj, as (m,min ({n}_{j},{n}_{j^{prime} })to infty)
$${{hat{mu }}_{ij}^{* }-{hat{mu }}_{ij^{prime} }^{* }to }_{p}left{begin{array}{ll}hskip -30pt 0&{rm{if}} {rm{taxon}} i {rm{is}} {rm{not}} {rm{differentially}} {rm{abundant}} {rm{between}} {rm{group}} j {rm{and}} j^{prime} ,\ {mu }_{ij}-{mu }_{ij^{prime} }&hskip -219pt {rm{otherwise}}.end{array}right.$$
(37)
Similarly, the estimator of the sampling fraction is obtained by transforming ({hat{d}}_{jk}) of (6) into
$${hat{d}}_{jk}^{* }:= left{begin{array}{l}hskip -45pt {hat{d}}_{rk},,,,, qquad j=r\ {hat{d}}_{jk}-{hat{delta }}_{rj},,,,j ,, ne ,, rin 1,ldots ,gend{array}right..$$
(38)
As by (13) and the E–M estimator of δrj
$${hat{d}}_{jk}^{* }{to }_{p}{d}_{jk}-{overline{d}}_{rcdot }{rm{as}} m,,min ({n}_{j},{n}_{j^{prime} })to infty,$$
(39)
which indicates that we are only able to estimate sampling fractions up to an additive constant (({overline{d}}_{rcdot })).
Define the test statistic for pairwise comparison as:
$${W}_{i,jj^{prime} }=frac{{hat{mu }}_{ij}^{* }-{hat{mu }}_{ij^{prime} }^{* }}{sqrt{{hat{sigma }}_{ij}^{2}+{hat{sigma }}_{ij^{prime} }^{2}}},,,,i=1,,ldots ,,m,j ,, ne ,, j^{prime} in {1,,ldots ,,g}.$$
(40)
For computational simplicity, inspired by the William’s type of test32,33,34,35, we reformulate the global test with the following test statistic:
$${W}_{i}=mathop{max }limits_{jne j^{prime} in {1,ldots ,g}}| {W}_{i,jj^{prime} }| ,,,,i=1,,ldots ,,m.$$
(41)
Under null, ({W}_{i,jj^{prime} }{to }_{d}N(0,1)), thus we can construct the null distribution of Wi by simulations, i.e., for each specific taxon i,
(a)
Generate ({W}_{i,jj^{prime} }^{(b)} sim N(0,1),j, ne , j^{prime} in {1,,ldots ,,g},b=1,ldots ,B.)
(b)
Compute ({W}_{i}^{(b)}=mathop{max }limits_{jne j^{prime} in {1,ldots ,g}}{W}_{i,jj^{prime} }^{(b)}.)
(c)
Repeat above steps B times (e.g., B = 1000), we then get the null distribution of Wi by ({({W}_{i}^{(1)},ldots ,{W}_{i}^{(B)})}^{T}.)
Finally, p value is calculated as
$${p}_{i}=frac{1}{B}mathop{sum }limits_{b=1}^{B}I({W}_{i}^{(b)} > {W}_{i}),,,,i=1,,ldots ,,m$$
(42)
and the Bonferroni correction is applied to control the FDR.
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article. More