Comparing alternatives to the fixed degree sequence model for extracting the backbone of bipartite projections

Preliminaries

A bipartite network captures connections between nodes of one type (agents) and nodes of a second type (artifacts). Throughout this section, we use the ecological case of Darwin’s Finches to provide a concrete example^24,25. On his voyage to the Galapagos Islands on the H.M.S. Beagle, Darwin observed that only some species of finches lived on each island. These patterns can be represented as a bipartite network in which finch species (the agent nodes) are connected to the islands (the artifact nodes) where they are found²⁶. A bipartite network can be represented as a binary matrix in which the agents are arrayed as rows, and the artifacts are arrayed as columns. We use ({mathbf {B}}) to denote a bipartite network’s representation as a matrix, where (B_{ik}=1) if agent i is connected to artifact k, and otherwise is 0. The sequence of row sums and the sequence of column sums of ({mathbf {B}}) are called the agent and artifact degrees sequences, respectively. These sequences are among the bipartite network’s most significant features and are known to have implications for bipartite projections and backbones^15,27,28. In the ecological case, the agent degree sequence captures the number of islands where each species is found, while the artifact degree sequence captures the number of species found on each island.

The projection of a bipartite network is a weighted unipartite co-occurrence network in which a pair of agents is connected by an edge with a weight equal to their number of shared artifacts. For example, the bipartite projection of Darwin’s finch network is a species co-occurrence network in which a pair of finch species is connected by an edge with a weight equal to the number of islands where they are both found. We use ({mathbf {P}}) to denote the matrix representation of a bipartite projection, which is computed as ({{mathbf {B}}}{{mathbf {B}}}^T), where ({mathbf {B}}^T) indicates the transpose of ({mathbf {B}}). In a projection ({mathbf {P}}), (P_{ij}) indicates the number of times agents i and j were connected to the same artifact k in ({mathbf {B}}). The diagonal entries of ({mathbf {P}}), (P_{ii}), are equal to the agent degrees, but in practice are ignored.

The backbone of a bipartite projection is a binary representation of ({mathbf {P}}) that contains only the most ‘important’ or ‘significant’ edges. For example, the backbone of a species co-occurrence network connects pairs of species if they are found on a significant number of the same islands, which might be interpreted as evidence that the two species do not compete for resources and perhaps are symbiotic. We use ({mathbf {P}}’) to denote the matrix representation of the backbone of ({mathbf {P}}). Because multiple methods exist for deciding when an edge is significant and thus should be preserved in the backbone, we use (mathbf{P }^{‘{text {M}}}) denote a backbone extracted using method M. It is important to note that for a given bipartite projection, there is no ‘true’ backbone, but only backbones corresponding to specific backbone methods M. The backbone extracted using FDSM (i.e. (mathbf{P }^{‘{text{FDSM}}})) may be similar or different from a backbone extracted using another method such as SDSM (i.e. (mathbf{P }^{‘{text {SDSM}}})), and these similarities and differences depend on the information that is considered by the respective methods when determining whether edges’ weights are significant. It is these similarities and differences that we explore in the four studies below.

Backbone extraction methods that were originally developed for non-projection weighted networks are often applied to weighted bipartite projections. One simple method preserves an edge in the backbone if its weight in the projection exceeds some global threshold T. However, when (T = 0), which is common, the backbone will be dense and have a high clustering coefficient because each artifact of degree d induces (d(d-1)/2) edges in the backbone²⁹. Using (T > 0) can yield a sparser and less clustered backbone^30,31,32, but still yields highly clustered networks in which low-degree nodes are excluded while high-degree nodes are preserved¹⁹. More sophisticated methods, including the disparity filter¹⁹ and likelihood filter²⁰, aim to overcome these limitations of the global threshold method by using a different threshold for each edge based on a null model. However, all methods that were developed for non-projection weighted networks have the same shortcoming when applied to weighted bipartite projections: they ignore information about the artifacts, which is lost when generating the projection¹⁸. In the ecological case, the global threshold, disparity filter, and likelihood filter methods all decide whether two species should be connected in the backbone only by examining how many islands these two species are both found on, but do not consider the characteristics of those islands, including how many other species are found there, or even how many islands there are. Therefore, although these methods are promising for extracting the backbone from non-projection weighted networks, different methods are required for extracting the backbone from a bipartite projection.

Bipartite ensemble backbone models

Bipartite ensemble backbone models decide whether an edge’s observed weight (P_{ij}) is significantly large, and thus whether a corresponding edge should be included in the backbone by comparing it to an ensemble of random bipartite networks. Let ({mathscr {B}}) be the set of all bipartite networks (mathbf {B^*}) having the same number of agents and artifacts as ({mathbf {B}}). In the ecological case, (mathbf {B^*}) might be viewed as representing a possible world containing the same species and islands, but in which locations of species on islands is different, and likewise ({mathscr {B}}) is the set of all such possible worlds. The bipartite ensembles used in backbone models take a subset ({mathscr{B}}^{text{M}}) of ({mathscr {B}}), subject to certain constraints M, and impose a probability distribution on it. In all models except the SDSM, the uniform probability distribution is imposed on ({mathscr{B}}^{text{M}}), that is, each element of the ensemble is equally likely. The backbone is then extracted from the projection of ({mathbf {B}}) by using the distribution of edge weights arising from projections of members of the ensemble to evaluate their statistical significance.

We use (P^*_{ij}) to denote a random variable equal to ((mathbf {B^*}mathbf {B^*}^T)_{ij}) for (mathbf {B^*}~in ~{mathscr {B}}^{text {M}}). That is, (P^*_{ij}) is the number of artifacts shared by i and j in a bipartite network randomly drawn from ({mathscr {B}}^{text {M}}). In the ecological case, (P^*_{ij}) represents the number of islands that are home to both species i and j in a possible world, while the distribution of (P^*_{ij}) is the distribution of the number of islands shared by species i and j in all possible worlds.

Decisions about which edges should appear in a backbone extracted at the statistical significance level (alpha) are made by comparing (P_{ij}) to (P^*_{ij})

$$begin{aligned} P_{ij}’= {left{ begin{array}{ll} 1 &{} quad {text { if }} Pr (P^*_{ij} ge P_{ij}) < frac{alpha }{2}, 0 &{} quad {text {otherwise.}} end{array}right. } end{aligned}$$

This test includes edge (P’_{ij}) in the backbone if its weight in the observed projection (P_{ij}) is uncommonly large compared to its weight in projections of members of the ensemble (P^*_{ij}). We use a two-tailed significance test in the studies below because, in principle, an edge’s weight in the observed projection could be uncommonly larger or uncommonly smaller than its weight in projections of members of the ensemble, however a one-tailed test may also be used. In the ecological case, two species are connected in the backbone if their number of shared islands in the observed world is uncommonly large compared to their number of shared islands in all possible worlds.

There are many ways that ({mathscr {B}}) can be constrained³³, with each set of constraints describing a particular ensemble ({mathscr {B}}^{text {M}}), which is used in a particular ensemble backbone model M to yield a particular backbone ({mathbf {P}}^{‘M}). In the case of ensembles used to extract the backbone of bipartite projections, our focus in this paper, two broad types of constraints are common²³. First, ensembles can be distinguished by what they constrain: only the number of edges, the degrees of the agent nodes, the degrees of the artifact nodes, or the degrees of both the agent and artifact nodes. Second, ensembles can be distinguished by how they impose these constraints: the constraints can be satisfied exactly, or only on average. In statistical physics, ensembles that impose exact or ‘hard’ constraints are known as microcanonical, while ensembles that satisfy constraints on average or impose ‘soft’ constraints are known as canonical⁹.

Prior work on these ensembles generally adopts either a theoretical focus on the ensembles themselves, or an applied focus on the consequences of ensemble choice. In the theoretical literature, some (primarily mathematicians) have aimed to characterize the properties of ensembles, such as estimating the cardinality of the ensemble of matrices with fixed rows and columns (below, we call this ensemble ({mathscr{B}}^{{text{FDSM}}}))³⁴. Others (primarily physicists) have aimed to identify conditions under which ensembles are equivalent or non-equivalent, typically interpreting ensembles as representing thermodynamic systems^35,36,37. In the applied literature, the focus is not on identifying fundamental properties of ensembles, but instead on understanding the implications of choosing a particular ensemble when detecting a particular pattern, such as nestedness³⁸ or community structure^23,27. The present work falls into this latter group: we are not directly concerned with identifying fundamental properties of ensembles, but instead on identifying the consequences of ensemble choice, with the ultimate goal of offering practical guidance to applied researchers wishing to extract the backbone of a bipartite projection.

In the remaining subsections below, we first describe the FDSM in terms of its ensemble. We then present four potential alternative backbone models whose ensembles differ only slightly from FDSM, in terms of either what they constrain or how they impose constraints. We then turn to exploring the consequences of choosing one of these alternatives over FDSM when extracting a backbone.

Fixed degree sequence model (FDSM)

In the fixed degree sequence model (FDSM), (mathbf {B^*}~in ~{mathscr {B}}^{{text{FDSM}}}) are constrained to have the same agent and artifact degree sequences as ({mathbf {B}}). That is, FDSM constrains the degrees of both the agent and artifact nodes, and requires that these constraints are satisfied exactly, making it a tightly-constrained microcanonical ensemble. Adopting the FDSM implies, for example, that in all possible worlds a given species is found on exactly the same number of islands, and a given island is home to exactly the same number of species. The distribution of (P^*_{ij}) arising from ({mathscr {B}}^{{text{FDSM}}}) is unknown, but can be approximated by uniformly sampling (mathbf {B^*}) from ({mathscr {B}}^{text{FDSM}}), constructing (mathbf {P^*}), and saving the values (P^*_{ij}). In the studies below, we use 1000 samples of (mathbf {B^*}) generated using the ‘curveball’ algorithm, which is among the fastest methods to sample ({mathscr {B}}^{text{FDSM}}) uniformly at random^39,40. The FDSM has been used to extract the backbone of bipartite projections of, for example, movies co-liked by viewers²¹ and conference panel co-participation by scholars^41,42.

The FDSM offers an intuitively appealing approach to extracting the backbone of bipartite projections because it fully controls for both bipartite degree sequences, which are known to be responsible for many of the projection’s structural characteristics^15,16. However, because the distribution of (P^*_{ij}) must be computed via Monte Carlo sampling, it is computationally costly, making it impractical for all but relatively small bipartite projections. There are at least three distinct computational challenges. First, although the curveball algorithm is the fastest among existing methods for randomly sampling a bipartite graph with fixed degree sequences (i.e. for sampling (mathbf {B^*}) from ({mathscr {B}}^{text{FDSM}})), it still can require several seconds per sample for large graphs. Second, once a (mathbf {B^*}) has been sampled, constructing each (mathbf {P^*}) requires matrix multiplication, which must be performed repeatedly and has complexity of at least ({mathscr {O}}(n^{2.37}))⁴³. Finally, computing an edge’s p value (i.e. (Pr (P^*_{ij} ge P_{ij}))) with sufficient precision to achieve a specified familywise error rate that controls for Type-I error inflation due to multiple testing²² can require these sampling and multiplication steps to be performed a very large number of times (see Supplementary Text S2).

These computational challenges have led researchers to develop other backbone models^3,9,18. Many such models exist, however here we are focused on identifying methods that yield backbones similar to what would be obtained using FDSM, and thus which may serve as computationally-feasible alternatives to FDSM. Therefore, we consider only those models whose ensembles involve at least one of the two types of constraints imposed by FDSM. That is, we consider models that either (1) impose exact constraints, or (2) impose constraints on both the agent and artifact degrees.

Fixed fill model (FFM)

In the fixed fill model (FFM), (mathbf {B^*}~in ~{mathscr {B}}^{{text {FFM}}}) are simply constrained to contain the same number of 1s as ({mathbf {B}}). That is, the FFM constrains only the number of edges, but requires that this constraint is satisfied exactly. Adopting the FFM implies, for example, that in all possible worlds only the total number of species-island pairs is fixed, but any given species may be found on a different number of islands and any given island may be home to a different number of species. The distribution of (P^*_{ij}) arising from ({mathscr {B}}^{{text {FFM}}}) has not been described before, but is derived in Supplementary Text S1.1. We call it a Jacobi distribution because it is related to Jacobi polynomials.

Fixed row model (FRM)

In the fixed row model (FRM), (mathbf {B^*}~in ~{mathscr {B}}^{{text {FRM}}}) are constrained to have the same agent degree sequence as ({mathbf {B}}), but have unconstrained artifact degree sequences. That is, the FRM constrains the degrees of the agent nodes, and requires that this constraint is satisfied exactly. A canonical variant of the FRM, the (hbox {BiPCM}_r), also constrains the degrees of the agent nodes, but only requires this constraint to be satisfied on average; we do not consider it here because it involves neither of FDSM’s constraints⁹. Adopting the FRM for backbone extraction implies, for example, that in all possible worlds a given species is found on the same number of islands, but a given island may be home to a different number of species. The distribution of (P^*_{ij}) arising from ({mathscr {B}}^{{text {FRM}}}) is hypergeometric (see Supplementary Text S1.2), and for this reason it is sometimes referred to as the hypergeometric model^22,23,44. The FRM has been used to extract the backbone of bipartite projections of, for example, movies co-starring actors²², papers co-written by authors²², parties co-attended by women⁴⁴, majority opinions joined by Supreme Court justices⁴⁴, and microRNAs co-associated with diseases⁴⁵.

Fixed column model (FCM)

In the fixed column model (FCM), (mathbf {B^*}~in ~{mathscr {B}}^{{text {FCM}}}) are constrained to have the same artifact degree sequence as ({mathbf {B}}), but have unconstrained agent degree sequences. That is, the FCM constrains the degrees of the artifact nodes, and requires that this constraint is satisfied exactly. A canonical variant of the FCM, the (hbox {BiPCM}_c), also constrains the degrees of the artifact nodes, but only requires this constraint to be satisfied on average; we do not consider it here because it involves neither of FDSM’s constraints⁹. Adopting the FCM for backbone extraction implies, for example, that in all possible worlds a given species may be found on a different number of islands, but a given island is home to the same number of species. The distribution of (P^*_{ij}) arising from ({mathscr {B}}^{{text {FCM}}}) has not been described before, but is derived in Supplementary Text S1.3, where we show it is Poisson-binomial.

Stochastic degree sequence model (SDSM)

Finally, the stochastic degree sequence model (SDSM) takes ({mathscr {B}}^{{text {SDSM}}}) to be all binary (m times n) matrices, but also gives a process for generating these matrices with different probabilities. Each (mathbf {B^*}) is generated by filling the cells (B^*_{ik}) with a 0 or 1 depending on the outcome of an independent Bernoulli trial with probability (p^*_{ik}). The distribution of the random variable (P^*_{ij}) arising from ({mathscr {B}}^{{text {SDSM}}}) is Poisson-binomial with parameters which can be computed using the (p^*_{ik}) (see Supplementary Text S1.4)^27,46. There are many ways to choose (p^*_{ik}), but in the studies below we choose (p^*_{ik}) so that it approximates (Pr (B^*_{ik} = 1)) for (mathbf {B^*}~in ~{mathscr {B}}^{{text{FDSM}}}). This choice of (p^*_{ik}) ensures that the SDSM constrains the degrees of both the agent and artifact nodes, but only requires these constraints to be satisfied on average. Adopting such a version of SDSM implies, for example, that in each possible world a given species may be found on many or few islands and a given island may be home to many or few species, but the average number of islands on which a given species lives in all possible worlds and the average number of species that live on an given island in all possible worlds matches these values the observed world. The SDSM has been used to extract the backbone of bipartite projections of, for example, legislators co-sponsoring bills^{1,18,47,48,49}, zebrafish (Danio rerio) sharing operational taxonomic units⁵⁰, countries sharing exports³, and genes expressed in genesets⁵¹.

Source: Ecology - nature.com