Null-model-based network comparison reveals core associations

Null model toolbox

We have developed a software toolbox, anuran, (a toolbox with null models for identification of nonrandom patterns in association networks) that generates random networks and assesses properties of these networks. Three types of networks can be generated in the current implementation: completely randomized networks, degree-preserving networks, and a variation of both networks that keeps a fraction of the edges fixed. Networks without a synthetic CAN, meaning they do not contain any fixed edges, are referred to as negative controls in the remainder of the manuscript, while networks with a synthetic CAN are referred to as positive control networks. In combination, these null models can generate CAN sizes for (1) the situation where all edges are entirely random, (2) the situation where taxa connecting edges are random, but the presence of an edge is not, and (3) the situation where part of a network is random but the remainder is part of a CAN.

For the completely randomized model, a network is initialized with the same nodes as the input network. Edges are then added randomly until the total edge number is equal to the number of edges in the input network. For the degree-preserving model, edges are swapped rather than removed and added back to the network, so that two edges (a, b) and (c, d) become the new edges (a, c) and (b, d). Hence, the model preserves the degree distribution found in the input network and each node has the same degree as it has in the original network, but other centralities such as the betweenness centrality can change. The user specifies both the number of random networks generated for each network (by default 10) and the number of sets (collections) of these networks (by default 50) that are sampled to calculate set sizes.

As stated previously, variations of the above two null models can be used to construct positive control networks. For this procedure, a fraction of edges is extracted from the total union of edges across all networks. For fully randomized networks, these edges are first added, then edges are added until the total number of edges in the original network is reached. For the degree-preserving randomized networks, negative control networks (with preserved degree) are first generated. Then, for each edge in the fixed core, the algorithm attempts to find two edges that can be swapped so the fixed edge is created. If this fails, a random edge is deleted and the fixed edge is introduced, so the degree is not exactly preserved. To swap the edges successfully, it is necessary that each of the nodes participating in a fixed edge has another edge not part of the fixed core. As a result, the degree distribution can change significantly for networks where nodes in the fixed core are disconnected or where the fixed core is very large compared to the positive control network.

It is possible to include nodes without significant associations in the network file as disconnected nodes (orphan nodes) by supplying the network file with the orphan nodes included as nodes without any edges. In this case, the random model reflects a situation where associations are randomly selected from all taxa. However, the degree-preserving networks are not affected by orphan nodes. The inclusion of orphan nodes leads to different estimates for set sizes for the random model that may lead to an overestimation of the significance of a CAN, as most taxa are too rare to acquire associations. Therefore, we ignored the presence of disconnected nodes in our case study.

The toolbox has been implemented in Python 3.6 and consists of both an application programming interface and command-line interface (CLI). Documentation for the toolbox has been included as a supplement (Supplementary File 1), with this and additional vignettes available through the GitHub page at https://github.com/ramellose/anuran. Currently, the CLI pipeline assesses set sizes, (rank-transformed) betweenness, degree, and closeness centrality scores and several network-level properties: degree assortativity, connectivity, diameter, radius, and average shortest path length (Fig. 1). NetworkX implementations of these centrality calculations were used [21].

The software uses a set-of-sets approach to identify CANs. A set is a specific collection of edges, such as the intersection set, which is the collection of edges present across multiple networks. The CANs are identified as differences of specific intersection sets. Hence, the toolbox specifically identifies sets and sets of sets that are likely to be of interest for microbial association networks. These sets represent collections of edges that are only present in one specific fraction of networks and distinguish between less conserved and more conserved edges.

An example with four networks is illustrated with a Venn diagram (Fig. 1c). To obtain the difference of the intersections, the set that includes one or more additional networks is subtracted from the intersection set that includes fewer networks. These sets are referred to as combinations of intersections with fractions or integers, i.e., the intersection 0.5 refers to all intersections of 50% of the networks. Similarly, set of sets are identified by a combination of intersection numbers: the set of sets 6→10 refers to the difference of intersection 6 and intersection 10 and therefore contains no edges present in at least 10 networks. For most analyses, the difference of intersections is preferred over intersections since the intersections are nested. By taking the difference, it is possible to distinguish between more and less conserved associations.

The equations for differences and k-intersections for groups of n networks are given below. The equations only refer to edge sets E, so they do not apply to numbers of matching nodes. The difference is the union of all sets D_i for 1 up to n networks, where the sets D_i contain all edges x present in an edge set E_i but not in the union of all other edge sets

$${{{mathrm{Difference}}}} = mathop {bigcup}limits_{i = 1}^n {D_i} ;{{{mathrm{where}}}};D_i = left{ {x:x in E_i,x ,notin, mathop {bigcup}limits_{begin{array}{*{20}{c}} {j = 1} {i ne j} end{array}}^n {E_j} } right}$$

The k-intersections are unions of intersections S_I. These intersections S_I are sets of groups of edge sets, where the groups I are k-permutations of n and E_i is a single edge set in I. Hence, for a total number of edge sets n, each of the groups I have size k and the collection of all possible groups is indicated as (P_k^n). For the 4-intersection for a group of 40 edge sets, the size of (P_k^n) can be calculated as the binomial coefficient (Big( {begin{array}{*{20}{c}} {40} 4 end{array}} Big)). This mathematical representation is not implemented directly in the software, as the software simply takes the set of all edges present in at least four networks and therefore ignores network identity.

Hence, a k-intersection is the union of all intersections S_I for I in (P_k^n)

$${{{mathrm{Intersection}}}} = mathop {bigcup}limits_{I in P_k^n} {S_I} ;{{{mathrm{where}}}};S_I = left{ {x:x in mathop {bigcap}limits_{i in I} {E_i} } right}$$

Since edges present in at least k networks but not in m networks represent less conserved edges, the difference of the intersections is calculated to distinguish between less conserved and more conserved edges. The difference of two intersections k and m, with S_I and S_J defined identically to S_I in the equation above is then given below

$${{{mathrm{Difference}}}};{{{mathrm{of}}}};{{{mathrm{intersections}}}} = mathop {bigcup}limits_{I in P_k^n} {S_I} backslash mathop {bigcup}limits_{J in P_m^n} {S_J} ;{{{mathrm{where}}}};k < m$$

To compare observed set sizes to set sizes of random networks, the Z-score test is carried out, which identifies set sizes in the input networks that are outside the range of set sizes inferred from groups of random networks. The SciPy normaltest implementation [22] of D’Agostino’s and Pearson’s omnibus normality test is used to test for both kurtosis and skewness [23, 24]. Since this test requires at least 20 observations, a warning is issued if the number of random networks needs to be increased.

The toolbox can also assess centrality scores across networks. To ensure that centralities are not biased by edge number, these are first converted to ranks before a Mann–Whitney U test is used to assess whether the distributions of ranks are similar across groups of observed networks and random networks. The comparisons to random networks are repeated a number of times and parameter-free p values across all comparisons are calculated from the number of successful Mann–Whitney U tests. By default, Benjamini-Hochberg multiple-testing corrections (implemented in the statsmodel package) are carried out on these p values to correct for the number of taxa [25]. The approach for network-level properties is similar, with the software currently supporting assortativity, connectivity, diameter, radius, and the average shortest path length. If the networks are ordered, the toolbox can calculate Spearman correlations of these properties to the network order. For example, users could supply networks constructed across a pH gradient. The results of all analyses are exported to tab-delimited files so they can be further analyzed and visualized in the user’s preferred statistical environment.

Finally, the toolbox includes an option for resampling networks. In this way, the resulting data show how trends in set sizes change as the number of networks is increased. The resulting data can be interpreted as a rarefaction curve, where flattening of the curve suggests that sufficient networks have been collected to identify all edges present in a specific fraction of networks.

Case studies

Gut microbial time series data were collected from 20 women each of whom donated stool samples for over a month, with a sampling frequency close to one sample per day (Vandeputte et al., submitted) [26]. These women also reported data on their menstrual cycle. For each sample, enterotype assignments were carried out as in Vandeputte et al. [27] with Dirichlet multinomial clustering. Samples were assigned to Bacteroides 1, Bacteroides 2, Ruminococcaceae, or Prevotella.

Progression through the menstrual cycle was rescaled to 28 days (the average length of a menstrual cycle) for all women. For days where there was more than one sample, only the first sample was used. Taxa present in less than 50% of participants were discarded from the analysis. Association networks were constructed with fastLSA v1.0 [28] with data rarefied to 10,000 sequences per sample, with correlations inferred across a delay of three time points (α = 0.05). Set sizes were analyzed with anuran, by generating 20 networks per observed network and resampling 100 different groups from these. Positive controls were generated 20 times, with a core size equal to 20% of the union of edges at 10% prevalence (edges present in at least two networks) and at 50% prevalence (edges present in at least ten networks). Set sizes and centralities with a p value below 0.05 for comparisons to values from random networks were considered significantly different from the random networks. The anuran toolbox was also used to assess the effect of increasing the number of participants.

The Walktrap community finding algorithm [29], implemented in the igraph R package v1.2.6 [30], was used to cluster the inferred CAN as the lack of negative edges in the CAN suggested that random walks could sufficiently identify clusters. To visualize enterotype-specific patterns of relative abundance, we computed the mean relative abundance of taxa per individual. We then took the median relative abundances across all individuals who belonged predominantly to the Ruminococcaceae enterotype, an enterotype previously linked to lower stool moisture [27], and subtracted from these all other median relative abundances, giving an estimate of taxa that had high abundance in the Ruminococcaceae enterotype compared to other enterotypes.

For the case study on the sponge microbiome, QIIME-processed data were downloaded from Moitinho et al. [31]. Samples with fewer than 1000 counts were removed and the samples were rarefied to even depth at 1034 sequences. After rarefaction, the abundance data were first filtered for 20% taxon prevalence across all samples, then once more to ensure 20% prevalence across different orders. Counts for removed taxa were retained to preserve the sample sums. After excluding host orders with fewer than 50 samples, 10 orders remained. CoNet v1.1.1 with renormalisation was then used to infer association networks (Faust and Raes [2]). Edges were generated with Pearson correlation, Spearman correlation, mutual information, Bray–Curtis dissimilarity, and Kullback–Leibler distance. Edges were included if at least one method reached significance; only edges with a combined Q-value below 0.05 (estimated using a combination of permutation and bootstrapping) were retained. The CoNet CANs were inferred with anuran generating 20 negative control random networks per host order and resampling these 100 times. For the positive controls, 20 network groups were generated with a core size equal to 20% of the union of edges at 20% prevalence (edges present in at least two networks) and at 50% prevalence (edges present in at least five networks). Set sizes and centralities with a p value below 0.05 for comparisons to values from random networks were considered significantly different from the random networks. CoNet networks were compared to FlashWeave networks [7]. FlashWeave v0.16.0 was run as FlashWeave-S (sensitive set to true and heterogeneous to false), with all other settings set to the default. To compare FlashWeave networks to CoNet networks, anuran generated five randomized networks per order-specific network and resampled these five times.

Prior research indicated that microbial abundance was a significant driver of community structure in sponges [32]. Therefore, taxa in the CAN were compared to taxa reported as indicators of high microbial abundance (HMA) or low microbial abundance (LMA) [32]. CAN network clusters were identified with manta v1.0.0 [33], as this algorithm has been designed to handle negative edges in the CAN. To run the clustering algorithm, default settings were used, except the number of iterations and permutations, which was set to 200. A Chi-squared test was used to compare HMA–LMA predictions to CAN cluster assignments (α = 0.05).

Source: Ecology - nature.com

Null-model-based network comparison reveals core associations

Null model toolbox

Case studies

Waging a two-pronged campaign against climate change

MIT.nano receives American Institute of Architects’s Top Ten Award for sustainable design

ITALIAN LANGUAGE

ENGLISH LANGUAGE