in

Reconstruction of plant–pollinator networks from observational data

Network reconstruction from observational data

The typical field study of plant–pollinator interactions involves recording instances of potential pollinators (such as insects) visiting plants within a prescribed observation area and over a prescribed period of time. We will refer to these records as visitation data. Network ecologists analyze visitation data by constructing networks of plant and pollinator species, where a connection between two species indicates that a plant-pollinator interaction exists between them.

However, the meaning of edges in ecological networks is not always clear31. One popular way to transform visitation data into networks is to connect two species when they interact “enough”—say when a pollinator species is seen on the reproductive organ of a plant species a specified number of times—but in this case the precise meaning of an edge will depend on the details of the data collection and the choices made in the analysis. How many visits do we take as evidence of a plant–pollinator interaction? A single visit is probably not enough—it might well be an error or misobservation. Is two enough, or ten, or a hundred? And what about observations that were missed entirely? Other methods of analysis transform the data in different ways, for instance encoding them as weighted networks, possibly with some statistical processing along the way32. Even in this case, however, the edges still just count numbers of visits (perhaps transformed in some way), so the resulting networks are effectively histograms in disguise, recording only potential interactions rather than true biological connections.

A more principled approach to network construction begins with a clear definition of what relationship (or relationships) a network’s edges encode33. We argue that network ecology often calls for a network of preferred interactions. In the context of plant-pollinator networks the edges of such a network indicate that pollinators preferentially visit certain plant species and they encode a variety of mechanisms that constrain species interactions, such as temporal or spatial uncoupling (i.e., species that do not co-occur in either time or space), constraints due to trait mismatches (e.g., proboscis size very different from corolla size), and physiological-biochemical constraints that prevent the interactions (e.g., chemical barriers). (One can regard preferred interactions as being the opposite of the “forbidden links” described in refs. 34,35,36). Preferred interactions are arguably the relevant ones for instance when analyzing the reaction of a network to abrupt changes: when one removes a plant species from a system, for example, the pollinators that prefer it will have to modify their behavior7,37,38. The interactions we consider are binary—either a species prefers another species or it doesn’t—so the network does not encode varying strengths of interaction.

While the data gathered in a typical field study are certainly reflective of preferred interactions, they are, for many reasons, not perfect measurements of networks of preferred interactions13,17. First, there may be observational errors. While the observers performing the work are usually highly trained individuals, they may nonetheless make mistakes. They may confuse one species for another, which is particularly easy to do for small-bodied insects, or smaller species may be overlooked altogether. Observers may make correct observations but record them wrongly. And there will be statistical fluctuations in the number of visits of an insect species to a plant species over any finite time. For rare interactions there may even be no visits at all if we are unlucky. The insects themselves may also appear to make “mistakes” by visiting plants that they typically do not pollinate. These and other factors mean that the record of observed visits is an inherently untrustworthy guide to the true structure of the network of preferred interactions. Here we develop a statistical method for making estimates of network structure despite these limitations of the data.

Model of plant–pollinator data

Consider a typical plant–pollinator study in which some number np of plant species, labeled by i = 1…np, and some number na of animal pollinator species, labeled by j = 1…na, are under observation for a set amount of time, producing a record of observed visits such that Mij is the number of times plant species i is visited by pollinator species j. Collectively the Mij can be regarded as a data matrix M with np rows and na columns. This is the input to our calculation.

The unknown quantity, the thing we would like to understand, is the network of plant–pollinator interactions. We can think of this network as composed of two sets of nodes, one representing plant species and the other pollinator species, with connections or edges joining each pollinator to the plants it pollinates. In the language of network science this is a bipartite network, meaning that edges run only between nodes of unlike kinds—plants and pollinators—and never between two plants or two pollinators. Such a network can be represented by a second matrix B, called the incidence matrix, with the same size as the data matrix and elements Bij = 1 if plant i is preferentially visited by pollinator j and 0 otherwise.

The question we would like to answer is this: What is the structure of the network, represented by B, given the data M? It is not straightforward to answer this question directly, but it is relatively easy to answer the reverse question. If we imagine that we know B, then we can say what the probability is that we make a specific set of observations M. And if we can do this then the methods of Bayesian inference allow us to invert the calculation and compute B from a knowledge of M and hence achieve our goal. The procedure is as follows.

Consider a specific plant-pollinator species pair i, j. How many times do we expect to see j visit i if there is, or is not, a preferred interaction between i and j? The answer will depend on several factors. First, and most obviously, we expect the number of visits to be higher if j is in fact a pollinator of i. That is, we expect Mij to be larger if Bij = 1 than if Bij = 0. Second, we expect there to be more visits if there is greater sampling effort—for instance if the period of observation is longer or if the land area over which observations take place is larger15,16,26,27. Third, we expect to see more visits for more abundant plant and pollinator species than for less abundant ones, as demonstrated by several studies28,30. And fourth, as discussed above, we expect there to be some random variation in the number of visits, driven by fluctuations in individual behavior and the environment. These are the primary features that we incorporate into our model. It is possible to add others to handle specific situations (see ref. 39 and the Methods), but we focus on these four here.

We translate these factors into a mathematical model of plant–pollinator interaction as follows. The random variations in the numbers of visits will follow a Poisson distribution for each plant–pollinator pair i, j, parameterized by a single number, the distribution mean μij, provided only that measurements are made sufficiently far apart to be independent (which under normal conditions they will be). We expect μij to depend on the factors discussed above and we introduce additional parameters to represent this dependence. First we introduce a parameter r to represent the change in the average number of visits when two species are connected (Bij = 1), versus when they are not (Bij = 0). We write the factor by which the number of visits is increased as 1 + r with r ≥ 0, so that r = 0 implies no increase and successively larger values of r give us larger increases. Second, we represent the effect of sampling effort by an overall constant C that multiplies the mean μij. The same constant is used for all i and j, since the same sampling effort is devoted to all plant–pollinator pairs. Third, we assume that the number of visits is proportional to the abundance of the relevant plant and pollinator species: twice as many pollinators of species j, for instance, will mean twice as many visits by that species, and similarly for the abundance of the plant species13. Thus the number of visits will be proportional to σiτj, for some parameters σi and τj representing the abundances of plant i and pollinator j, respectively, in suitable units (which we will determine shortly).

Putting everything together, the mean number of observed visits to plant i by pollinator j is

$${mu }_{ij}=C{sigma }_{i}{tau }_{j}(1+r{B}_{ij}),$$

(1)

and the probability of observing exactly Mij visits is drawn from a Poisson distribution with this mean:

$$P({M}_{ij}| {mu }_{ij})=frac{{mu }_{ij}^{{M}_{ij}}}{{M}_{ij}!} {e}^{-{mu }_{ij}}.$$

(2)

This equation gives us the probability distribution of a single element Mij of the data matrix. Then, combining Eqs. (1) and (2), the data likelihood—the probability of the complete data matrix M—is given by the product over all species thus:

$$P({boldsymbol{M}}| {boldsymbol{B}},theta )=mathop{prod}limits_{i,j}frac{{left[C{sigma }_{i}{tau }_{j}(1+r{B}_{ij})right]}^{{M}_{ij}}}{{M}_{ij}!} {e}^{-C{sigma }_{i}{tau }_{j}(1+r{B}_{ij})},$$

(3)

where θ is a shorthand collectively denoting all the parameters of the model: C, r, σ and τ. Our model is thus effectively a model of an entire network, rather than single interactions, in contrast with other recent approaches to the modeling of network data reliability17,18,32.

There are two important details to note about this model. First, the definition in Eq. (1) does not completely determine C, σ, and τ because we can increase (or decrease) any of these parameters by a constant factor without changing the resulting value of μij if we simultaneously decrease (or increase) one or both of the others. In the language of statistics we say that the parameters are not “identifiable.” We can rectify this problem by fixing the normalization of the parameters in any convenient fashion. Here we do this by stipulating that σi and τj sum to one, thus:

$$mathop{sum }limits_{i=1}^{{n}_{p}}{sigma }_{i}=mathop{sum }limits_{j=1}^{{n}_{a}}{tau }_{j}=1.$$

(4)

In effect, this makes σi and τj measures of relative abundance, quantifying the fraction of individual organisms that belong to each species, rather than the total number. (This definition differs from traditional estimates of pollinator abundance that define the abundance of a pollinator species in terms of its number of observed visits.) Second, there may be other species-level effects on the observed number of visits in addition to abundance, such as the propensity for observers to overlook small-bodied pollinators. There is, at least within the data used in this paper, no way to tell these effects from true variation in abundance—no way to tell for example if there are truly fewer individuals of a species or if they are just hard to see and hence less often observed. As a result, the abundance parameters in our model actually capture a combination of effects on observation frequency. This does not affect the accuracy of the model, which works just as well either way, but it does mean that we have to be cautious about interpreting the values of the parameters in terms of actual abundance. This point is discussed further in the applications below.

Bayesian reconstruction

The likelihood of Eq. (3) tells us the probability of the data M given the network B and parameters θ. What we actually want to know is the probability of the network and parameters given the data, which we can calculate by applying Bayes’ rule in the form

$$P({boldsymbol{B}},theta | {boldsymbol{M}})=frac{P({boldsymbol{M}}| {boldsymbol{B}},theta )P({boldsymbol{B}}| theta )P(theta )}{P({boldsymbol{M}})}.$$

(5)

This is the posterior probability that the network has structure B and parameter values θ given the observations that were made. There are three important parts to the expression: the likelihood P(MB, θ), the prior probability of the network P(Bθ), and the prior probability of the parameters P(θ). The denominator P(M) we can ignore because it depends on the data alone and will be constant (and hence irrelevant for our calculations) once M is determined by the observations.

Of the three non-constant parts, the first, the likelihood, we have already discussed—it is given by Eq. (3). For the prior on the network P(Bθ) we make the conservative assumption—in the absence of any knowledge to the contrary—that all edges in the network are a priori equally likely. If we denote the probability of an edge by ρ, then the prior probability on the entire network is

$$P({boldsymbol{B}}| theta )=mathop{prod}limits_{i,j}{(1-rho )}^{1-{B}_{ij}}{rho }^{{B}_{ij}}.$$

(6)

We consider ρ an additional parameter which is to be inferred from the data and which we will henceforth include, along with our other parameters, in the set θ.

To complete Eq. (5), we also need to choose a prior P(θ) over the parameters. We expect there to be some limit on the value of r, which we impose using a minimally informative prior with finite mean (this distribution turns out to be the exponential distribution). For the remaining parameters we use uniform priors. With these choices, we then have everything we need to compute the posterior probability, Eq. (5).

Once we have the posterior probability there are a number of things we can do with it. The simplest is just to maximize it with respect to the unknown quantities B and θ to find the most likely structure for the network and the most likely parameter values, given the data. This, however, misses an opportunity for more detailed inference and can moreover give misleading results. In most cases there will be more than one value of B and θ with high probability under Eq. (5): there may be a unique maximum of the probability, a most likely value, but there are often many other values that have nearly as high probability and offer plausible network structures competitive with the most likely one. To get the most complete picture of the structure of the network we should consider all these plausible structures.

For example, if all plausible structures are similar to one another in their overall shape then we can be quite confident that this shape is reflective of the true preferred interactions between plant and pollinator species. If plausible structures are widely varying, however, then we have many different candidates for the true structure and our certainty about that structure is correspondingly lower. In other words, by considering the complete set of plausible structures we can not only make an estimate of the network structure but also say how confident we are in that estimate, in effect putting “error bars” on the network.

How do we specify these errors bars in practice? One way is to place posterior probabilities on individual edges in the network. For example, when considering the edge connecting plant i and pollinator j, we would not ask “Is there an edge?” but rather “What is the probability that there is an edge?” Within the formulation outlined above, this probability is given by the average

$$P({B}_{ij}=1| {boldsymbol{M}})=mathop{sum}limits_{{boldsymbol{B}}}int {B}_{ij}P({boldsymbol{B}},theta | {boldsymbol{M}})dtheta ,$$

(7)

where the sum runs over all possible incidence matrices and the integral over all parameter values. More generally we can compute the average of any function f(B, θ) of the matrix B and/or the parameters θ thus:

$$leftlangle f({boldsymbol{B}},theta )rightrangle =mathop{sum}limits_{{boldsymbol{B}}}int f({boldsymbol{B}},theta ) P({boldsymbol{B}},theta | {boldsymbol{M}})dtheta .$$

(8)

Functions of the matrix and functions of the parameters can both be interesting—the matrix tells us about the structure of the network but the parameters, as we will see, can also reveal important information.

Computing averages of the form (8) is unfortunately not an easy task. A closed-form expression appears out of reach and the brute-force approach of performing the sums and integrals numerically over all possible networks and parameters is computationally intractable in all but the most trivial of cases. The sum over B alone involves ({2}^{{n}_{p}{n}_{a}}) terms, which is normally a very large number.

Instead therefore we use an efficient Monte Carlo sampling technique to approximate the answers. We generate a sample of network/parameter pairs (B1, θ1), …, (Bn, θn), where each pair appears with probability proportional to the posterior distribution of Eq. (5). Then we approximate the average of f(B, θ) as

$$leftlangle f({boldsymbol{B}},theta )rightrangle simeq frac{1}{n}mathop{sum }limits_{i=1}^{n}f({{boldsymbol{B}}}_{i},{theta }_{i}).$$

(9)

Under very general conditions, this estimate will converge to the true value of the average asymptotically as the number of Monte Carlo samples n becomes large. Full details of the computations are given in Materials and Methods, and an extensive simulation study of the model is presented in Supplementary Note 1.

Checking the model

Inherent in the discussion so far is the assumption that the data can be well represented by our model. In other words, we are assuming there is at least one choice of the network B and parameters θ such that the model will generate data similar to what we see in the field. This assumption could be violated if our model is a poor one, but there is nothing in the method described above that would tell us so. To be fully confident in our results we need to be able not only to infer the network structure, but also to check whether that structure is a good match to the data. The Bayesian toolbox comes with a natural procedure for doing this. Given a set of high-probability values of B and θ generated by the method, we can use them in Eq. (3) to compute the likelihood P(MB, θ) of a data set M and then sample possible data sets from this probability distribution, in effect recreating data as they would appear if the model were in fact correct. We can then compare these data to the original field data to see if they are similar: if they are then our model has done a good job of capturing the structure in the data.

In the parlance of Bayesian statistics this approach is known as a posterior–predictive assessment40. It amounts to calculating the probability

$$P({widetilde{M}}_{ij}| {boldsymbol{M}})=mathop{sum}limits_{{boldsymbol{B}}}int P({widetilde{M}}_{ij}| {boldsymbol{B}},theta )P({boldsymbol{B}},theta | {boldsymbol{M}})dtheta$$

(10)

that pollinator species j makes ({widetilde{M}}_{ij}) visits to plant species i in artificial data sets generated by the model, averaged over many sets of values of B and θ. We can then use this probability to calculate the average value of ({widetilde{M}}_{ij}) thus:

$$langle {widetilde{M}}_{ij}rangle =mathop{sum}limits_{{widetilde{M}}_{ij}}{widetilde{M}}_{ij} P({widetilde{M}}_{ij}| {boldsymbol{M}}).$$

(11)

The averages for all plant–pollinator pairs can be thought of as the elements of a matrix (langle widetilde{{boldsymbol{M}}}rangle), which we can then compare to the actual data matrix M, or alternatively we can calculate a residue ({boldsymbol{M}}-langle widetilde{{boldsymbol{M}}}rangle). If (langle widetilde{{boldsymbol{M}}}rangle) and M are approximately equal, or equivalently if the residue is small, then we consider the model a good one.

To quantify the level of agreement between the fit and the data we can also compute the discrepancy40 between the artificial data and M as

$${X}^{2}=mathop{sum}limits_{ij}frac{{({M}_{ij}-{langle widetilde{M}_{ij}rangle })}^{2}}{{langle widetilde{M}_{ij}rangle }}.$$

(12)

Under the hypothesis that the model is correct, X2 follows a chi-squared distribution with np × na degrees of freedom40. A good fit between model and data is signified by a value of X2 that is much smaller than its expectation value of np × na. Note that the calculation of (P({widetilde{M}}_{ij}| {boldsymbol{M}})) in Eq. (10) is of the same form as the one in Eq. (8), with (f({boldsymbol{B}},theta )=P({widetilde{M}}_{ij}| {boldsymbol{B}},theta )), which means we can calculate (P({widetilde{M}}_{ij}| {boldsymbol{M}})) in the same way we calculate other average quantities, using Monte Carlo sampling and Eq. (9).

Application to visitation data sets

Well-sampled data

To demonstrate how the method works in practice, we first consider a large data set of plant–pollinator interactions gathered by Kaiser-Bunbury and collaborators41 at a set of study sites on the island of Mahé in the Seychelles. The data describe the interactions of plant and pollinator species observed over a period of eight months across eight different sites on the island. The data also include measurements of floral abundances for all observation periods and all sites. Our method for inferring network structure does not make use of the abundance measurements, but we discuss them briefly at the end of this section.

The study by Kaiser-Bunbury et al. focused particularly on the role of exotic plant species in the ecosystem and on whether restoring a site by removing exotic species would significantly impact the resilience and function of the plant–pollinator network. To help address these questions, half of the sites in the study were restored in this way while the rest were left unrestored as a control group.

As an illustration of our method we apply it to data from one of the restored sites, as observed over the course of a single month in December 2012 (the smallest time interval for which data were available). We pick the site named “Trois-Frères” because it is relatively small but also well sampled. Our calculation then proceeds as shown in Fig. 1. There were 8 plant and 21 pollinator species observed at the site during the month, giving us an 8 × 21 data matrix M as shown in Fig. 1a. (Following common convention, the plots of matrices in this paper are drawn with rows and columns ordered by decreasing numbers of observed interactions, so that the largest elements of the data matrix—the darkest squares—are in the top and left of the plot.)

Fig. 1: Illustration of the method of this paper applied to data from the study of Kaiser-Bunbury et al.41.

a We start with a data matrix M that records the number of interactions between each plant species and pollinator species. Species pairs that are never observed to interact (Mij = 0) are shown in white. b We then draw 2000 samples from the distribution of Eq. (5), four of which are shown in the figure. Each sample consists of a binary incidence matrix B, values for the relative abundances σ and τ (shown as the orange and blue bar plots, respectively), and values for the parameters C, r, and ρ (not shown). c We combine the samples using Eqs. (7)–(9) to give an estimate of the probability of each edge in the network and the complete parameter set θ. For the data set studied here our estimates of the expected values of the parameters C, r, and ρ are 〈C〉 = 20.2, 〈r〉 = 45.9, and 〈ρ〉 = 0.244.

Full size image

Now we use our Monte Carlo procedure to draw 2000 sets of incidence matrices B and parameters θ from the posterior distribution of Eq. (5) (Fig. 1b). These samples vary in their structure: some edges, like the one connecting the plant N. vanhoutteanum and the pollinator A. mellifera, are present in nearly all samples, while others, like the one between M. sechellarum and A. mellifera, appear only a small fraction of the time. Some others never occur at all. Averaging over these sampled networks we can estimate the probability, Eq. (7), that each connection exists in the network of preferred interactions between plant and animal species—see Fig. 1c. Some connections have high probability, close to 1, meaning that we have a high degree of confidence that they exist. Others have probability close to 0, meaning we have a high degree of confidence that they do not exist. And some have intermediate probabilities, meaning we are uncertain about them (such as the M. sechellarumA. mellifera connection, which has probability around 0.45). In the latter case the method is telling us that the data are not sufficient to reach a firm conclusion about these connections. Indeed, if we compare with the original data matrix M in Fig. 1a, we find that most of the uncertain connections are ones for which we have very few observations, relative to the total number of observations for these species—say Mij = 1 or 2 for species with dozens of total observations overall.

As we have mentioned, we also need to check whether the model is a good fit to the data by performing a posterior–predictive test. Figure 2 shows the results of this test. The main plot in the figure compares the values of the 40 largest elements of the original data matrix M with the corresponding elements of the generated matrix (widetilde{{boldsymbol{M}}}). In each case, the original value is well within one standard deviation of the average value generated by the test, confirming the accuracy of the model. The inset of the figure shows the residue matrix ({boldsymbol{M}}-widetilde{{boldsymbol{M}}}), which reveals no systematic bias unaccounted for by the model. The discrepancy X2 of Eq. (12) takes the value 26.94 in this case, well below the expected value of npna = 168, which indicates that the good fit is not a statistical fluke.

Fig. 2: Results of a posterior–predictive test on the data matrix M for the example data set analyzed in Fig. 1.

The main plot shows the error on the 40 largest entries of M, while the inset shows the residue matrix ({boldsymbol{M}}-langle tilde{{boldsymbol{M}}}rangle). Because the actual data M are well within one standard deviation of the posterior–predictive mean, the test confirms that the model is a good fit in this case. Error bars correspond to one standard deviation and are computed with n = 2000 samples from the posterior distribution.

Full size image

In addition to inferring the structure of the network itself, our method allows us to estimate many other quantities from the data. There are two primary methods by which we can do this. One is to look at the values of the fitted model parameters, which represent quantities such as the preference r and species abundances σ, τ. The other is to compute averages of quantities that depend on the network structure or the parameters (or both) from Eq. (9).

As an example of the former approach, consider the parameter ρ, which represents the average probability of an edge, also known as the connectance of the network. Figure 3a shows the distribution of values of this quantity over our set of Monte Carlo samples, and neatly summarizes our overall certainty about the presence or absence of edges. If we were certain about all edges in the network, then ρ would take only a single value and the distribution would be narrowly peaked. The distribution we observe, however, is somewhat broadened, indicating significant uncertainty. The most likely value of ρ, the peak of the distribution, turns out to be quite close to the value one would arrive at if one were simply to assume that every pair of species that interacts even once is connected in the network. This does not mean, however, that one could make this assumption and get good results. As we show below, the network one would derive by doing so would be badly in error in other ways.

Fig. 3: Analyses that can be performed using samples from the posterior distribution of Eq. (5).

a Distribution of the connectance ρ. Connectance values for binary networks obtained by thresholding the data matrix at Mij > 0 and Mij ≥ 5 are shown as vertical lines for reference. b Distribution of the preference parameter r. The mean value of r is 〈r〉 = 45.9 and its mode close to 40, but individual values as high as 100 are possible. c Distribution of the nestedness measure NODF. Values obtained by thresholding the data matrix at Mij > 0 and Mij > 1 are shown for reference. d Measured and estimated abundances for each of the plant species (R2 = 0.54).

Full size image

Figure 3b shows the distribution of another of the model parameters, the parameter r, which measures the extent to which pollinators prefer the plants they normally pollinate over the ones they do not. For this particular data set the most likely value of r is around 40, meaning that pollinators visit their preferred plant species about 40 times more often than non-preferred ones, indicating all other things being equal, an impressive level of selectivity on the part of the pollinators.

For the calculation of more complicated network properties we can perform an average over the value of any function f(B, θ), as long as there is an algorithm to compute it. As an example, Fig. 3c shows a calculation of the quantity known as “Nestedness based on Overlap and Decreasing Fill” (NODF), a measure of the nestedness property discussed in the introduction. This quantity measures the extent to which specialist species—those with relatively few interactions—tend to interact with a subset of the partners of generalist species42. While it is complicated to compute NODF analytically, due to the fact that one must order the species by degrees22, it is straightforward to calculate it within our framework: we simply calculate the value for each sampled network B and plot the resulting distribution. Interestingly, the most likely value of NODF is significantly different from the one we would calculate had we assumed, as discussed above, that a single interaction is sufficient to consider two species connected. On the contrary, we find that the system is almost certainly more nested than this simple analysis would conclude.

In Fig. 3d, we compare the values of our estimated floral abundance parameters σ to the measured abundances reported by Kaiser-Bunbury et al.41. These parameters are not measures of abundance in the usual sense, because they combine actual abundance (quantity or density) with other characteristics such as ease of observation. We do find a correlation between the estimated and observed abundances, but it is relatively weak (R2 = 0.54), signaling significant disagreement, on which we elaborate in the discussion section.

Undersampled data

As we have pointed out, the connections in the network about which we are most uncertain tend to be ones that are undersampled, i.e., those for which we have only a small amount of data. In an ideal world we could address this problem by taking more data, but it is rare that we have the opportunity to do this. More commonly the data have already been gathered and our task is to produce the best results we can with those data. There are nonetheless some remedies open to us, such as aggregating data over different geographical areas or time windows. In Fig. 4 we compare the edge probabilities estimated from data recorded individually at the four “restored” sites in the Mahé study during October 2012 to the edge probabilities we obtain when we aggregate these observations into a single data matrix and only then estimate the network. (We use restored sites observed during the same month because they are likely to be ecologically similar, meaning the data are measuring approximately the same system.) Comparison of the two distributions shows—as we would hope—that there are fewer uncertain edges in the aggregated network than in its disaggregated parts, i.e., there are fewer edges with probabilities in the middle of the distribution and more with probabilities close to zero or one.

Fig. 4: Illustration of the effect of data aggregation on edge uncertainty.

a Histogram of the edge probabilities P(Bij = 1M) for the four restored sites in the Mahé study as observed in October 2012 and analyzed individually. b Equivalent histogram after aggregating the data over the sites and then estimating a single network from the resulting data matrix. The horizontal lines, both drawn at fifty observations—are added merely as a guide to the eye. Note how the upper histogram has more mass near the middle of the plot, while the lower one has most of its mass close to probability zero or one, indicating greater certainty in the positions of the edges in the aggregated data.

Full size image

In other cases neither aggregation nor gathering more data is possible, for instance when reanalyzing a data set already collected by others or already maximally aggregated. Such data sets record the results of observational studies that are already over, and may contain too few observations, but our approach still allows us to perform rigorous inference in these circumstances.

For instance, Jordano et al.43 used dozens of existing plant-pollinator and plant-frugivore data sets to argue that the degree distributions of mutualistic networks have a long tail, but this conclusion is undermined by issues with undersampling. As an example, one of the data sets they studied, originally gathered by Inouye and Pyke44, records 1314 individual interactions over a period of 3 months in Kosciusko National Park, Australia, between 40 plants and 85 pollinator species, which works out to an average of 0.386 unique observations per species pair. Is this sampling effort sufficient to establish edges with certainty? As a point of reference, the data analyzed in Fig. 1 comprises 201 observations between 8 plants and 21 pollinators species for an average of 1.196 observations per pair of species, and the aggregated data of Fig. 4 contain 1.420 observations for every pair. Nonetheless, there is uncertainty about some of the connections in these reconstructed networks; this suggests that the network of Inouye and Pyke, with less than a third as much data per species pair, will contain significant uncertainty.

Even so, our method allows us to make inferences about this network. In Fig. 5, we show estimates of the degree distributions of both plant and pollinator nodes in the network obtained from the posterior distribution P(BM), along with naive estimates calculated by thresholding the (undersampled) data as in the study by Jordano et al.43. As the figure shows, the results derived from the two approaches are very different. The thresholded degree distributions were classified as scale-free by Jordano et al., but this classification no longer holds once we account for the issues with the data; the inferred degree distributions are in this case well-modeled as Poisson distributions of means 5.53 and 2.60 for plants and pollinators respectively and the power-law form is a poor fit. On the other hand, the abundance parameters of the model, shown in Fig. 5, do appear to have a broad distribution, an interesting finding that calls for a rethinking of the relationship between abundances and degree distributions. It is generally thought that interactions will tend to be evenly distributed under an even distribution of abundance13 but here the opposite seems to be true.

Fig. 5: Distributions of species-level parameters for a network of plants and pollinators in Kosciusko National Park, Australia, from the study by Inouye and Pyke44.

a Thresholded degree distributions calculated by connecting species i and j with an edge if Mij > 0. Inferred degree distributions are calculated using the method of this paper, averaging the fraction pk of nodes with a given degree k over n = 2000 Monte Carlo samples. b Inferred distributions of abundances σ and τ, calculated as a histogram over n = 2000 Monte Carlo samples of the abundance parameters of the fitted model. Error bars correspond to one standard deviation in all cases.

Full size image


Source: Ecology - nature.com

Grace Moore ’21 receives Michel David-Weill Scholarship

Revisiting a quantum past for a fusion future