NetGAM: Using generalized additive models to improve the predictive power of ecological network analyses constructed using time-series data
Our general strategy was to compare the performance of four approaches for inferring microbial associations from abundance data with overlying time-series signals. The approaches were (1) pairwise spearman correlation analysis (SCC) [1, 29], (2) Graphical lasso analysis (Glasso) [30, 31], (3) pairwise SCC analysis with a pre-processing step where seasonal and long-term splines were fit to and subtracted from each variable using a GAM (GAM-SCC), and (4) Glasso with the same GAM subtraction approach (GAM-Glasso). Our validation strategy for the GAM transformation consisted of generating mock datasets with underlying associations, masking those associations by adding seasonal and long-term signals to the abundance data, and comparing the predicted associations obtained from each network inference method to the true species-species associations.Data simulation: generating mock abundance data with time-series propertiesWe generated mock abundance datasets that had a predetermined, underlying network structure and contained long-term and seasonal species abundance patterns. First, a covariance matrix was generated to describe the relationships between species in a mock dataset (Fig. S1, Panel 1). The covariance matrices were constructed with underlying network structures that followed either a scale-free Barabási-Albert model, a random Erdős-Rényi model, or a model of network topology based on a real microbial dataset (American Gut dataset; Fig. S1) [32, 33]. The Erdős-Rényi and Barabási-Albert model datasets were generated so that each dataset contained 400 species and 200 samples, and the American Gut datasets were created so that each dataset contained 127 species and 200 samples. A random Bernoulli distribution was used to simulate the covariance matrix for the Erdős-Rényi networks. We set the probability of interactions occurring between species in a given Erdős-Rényi network to 1%. The Barabási-Albert networks were generated using the “sample_pa” function in the igraph package [34]. The “graph2prec” function in the SpiecEasi package was used to predict the covariance matrix of the American Gut dataset [33]. The covariance between species in a dataset was considered “high” or “low” when the true associations in the covariance matrix were set to 100 or 10 respectively (Fig. S1, Panel 1). These covariance matrices describe the “real”, underlying species interactions in our mock datasets.After generating a covariance matrix, the mean abundance for each species was generated from a normal distribution with a mean of 10 and a variance of 1. These mean abundance values and the covariance matrix were used to parameterize a multivariate normal distribution from which species abundance values for all 200 samples in a dataset were drawn (Fig. S1, Panel 2). The values generated from this multivariate normal distribution were the species abundance values without time-series features confounding the relationship between two associated species (Fig. S1, Panel 2).“Gradual” or “abrupt” seasonal trends were added to 0%, 25%, 50% or 100% of the species in each mock dataset. The gradual seasonal trend increased over 5 months, peaked at a specific month, and decreased over 5 months. Conversely, the abrupt seasonal signal increased over 2 months, peaked at a specific month, and decreased over 2 months (Fig. S1, Panel 3). These seasonal signals were generated by plugging a vector of consecutive integers of length 200 (Nt) into the gradual (Eq. (1)) or abrupt (Eq. (2)) seasonal equations (Fig. S1, Panel 3)…$$Gradual:S_t = left( {frac{{cos left( {N_t ast 2 ast frac{pi }{{12}}} right)}}{2}} right) + 0.5$$
(1)
$$Abrupt:,S_t = left( {left( {frac{{cos left( {N_t ast 2 ast frac{pi }{{12}}} right)}}{2}} right) + 0.5} right)^{10}$$
(2)
where N is the random vector of consecutive integers, S is the output seasonal vector, and t is the index of vectors N and S. The starting value of vector Nt was drawn at random for each species to allow the seasonal peaks to be centered at different months. Each element in the seasonal vector (St) was then multiplied by the corresponding element in the abundance vector (Xt) of a specific species to obtain mock species abundance values with a gradual or abrupt seasonal trend (Fig. S1, Panel 3).A long-term time-series trend was added to the abundance values of 0% or 50% of the species in the mock datasets (Fig. S1, Panel 4). When a long-term signal was applied to 50% of the species in a dataset, half of the species were randomly selected to have this long-term trend. Then, a vector of linear values was generated following Eq. (3) such that…$$Long – term,trend:,L_t = pm mleft( {L_{t – 1}} right) + 0.01$$
(3)
where Lt is the point in the line at the next time point and m is the slope of the line. The slope parameter (m) was generated from a random normal distribution with a mean of 0.01 and a variance of 0.01. The slope parameter (m) was also multiplied by −1 half of the time to ensure that half of the long-term trends increased over time and half decreased over time (Fig. S1, Panel 4). After generating the vector of linear values (Lt), each element of this vector was added to each element of the abundance vector (Xt) of a specific species to simulate long-term time-series trends (Fig. S1, Panel 4).Time-series predictor columns were added to each dataset after applying monthly and long-term abundance trends to a portion of the species in the mock datasets. The predictors that were used in the downstream GAM-based data transformation were the month of the year (i.e., 1–12) and the day of the time-series (i.e., 1–200). In total, we generated 100 mock datasets for every combination of conditions (84 combinations total; Table S1), resulting in 8400 mock time-series datasets that were used in the downstream count data transformation, GAM subtraction, and network analysis procedures.Data simulation: Simulating count data from abundance valuesThe 8400 time-series datasets that were generated using the methods described above were transformed to make the abundance values resemble high-throughput sequencing data because microbial time-series sampling efforts are often processed using such molecular methods (e.g., tag-sequencing, meta-omics). Analysis of high-throughput sequencing data is complicated by the compositional (i.e., relative) nature of the data and by the high number of zeros that may be prevalent in a dataset (i.e., zero-inflation; see Supplementary Information) [35, 36]. Relative abundances of different species in natural communities are also highly skewed, so that relatively few species constitute most of the organisms in a sample although many rare species are also present [37, 38]. Therefore, species abundances were first exponentiated to increase the prevalence of abundant species and to decrease the prevalence of rare species (Fig. S1, Panel 5). The exponentiated species abundance values were then converted to relative abundance values by dividing each species count by the sum of all species counts in a sample (Fig. S1, Panel 6). The resulting relative abundance values and time-series predictor variables were used in data normalization and GAM-transformation steps prior to carrying out the network analyses.Network inference: Count data normalization and GAM transformationSeveral steps were taken to back out the species-species relationships in the mock datasets. We advocate these steps to infer network structure from a real time-series dataset. A centered log-ratio (CLR) transformation was first applied to the species relative abundance values to normalize the mock species abundance data across samples using the “clr” function in the compositions package in R (Fig. 1) [39]. This transformation step is important to avoid spurious inferences induced by the inherent compositionality of relative abundance data [31, 33, 36]. In addition to the CLR transformation used in our main network iterations, we carried out additional network iterations using the modified CLR [40], cumulative sum scaling [41], and total sum scaling [42] transformations (see Supplementary Information). In all cases, the normalized dataset was copied, with one copy subjected to a subsequent GAM transformation, and the other one not GAM-transformed.Fig. 1: Steps used to carry out the GAM-based transformation of time-series species abundance data prior to carrying out pairwise spearman correlation (SCC) and graphical lasso (Glasso) ecological network analyses.The raw, species abundance data were first CLR-transformed (1). Generalized additive models (GAMs) were then fit to each species in the dataset (2) and the residuals of each GAM were checked for significant autocorrelation (3). The residuals of each GAM were extracted (4) and were used as input in the SCC and Glasso network analysis methods (5). Finally, the GAM-transformed network outputs were obtained (6; see text for additional details).Full size imageThe GAM transformation was carried out by fitting GAMs to each individual species in the dataset to remove monthly signals, long-term trends, and autocorrelation from the species abundance data. These GAMs were fit using the “gamm” function in the mgcv package in R [43, 44]. The GAMs that were used included the “month of year” parameter as a cyclical spline predictor and the “day of time-series” parameter as a penalized thin-plate spline predictor (“ts” in the mgcv package; Fig. 1), which given our one-dimensional data is analogous to a natural cubic spline. In addition, the first GAM included a continuous AR1 (“corCAR1” in the mgcv package) correlation structure term in the model. This corCAR1 model was revised for specific species when the GAM could not be resolved or when significant autocorrelation was detected in the GAM residuals (Fig. 1). The GAM revision step fit 4 new GAMs with different correlation structure terms (i.e., “AR1”, “CompSymm”, “Exp”, and “Gaus”) to the species that could not be fit using the corCAR1 model or that contained significant autocorrelation in the corCAR1 GAM residuals. Then, the correlation structure term that addressed these issues for the largest number of individuals was used as the GAM model for this group of species. After fitting a GAM to all of the species in the input dataset, the residuals of each GAM were extracted and were used as the new, GAM-transformed abundance values (Fig. 1). These GAM residuals represent species abundance values with a reduced influence of time (Fig. 2) and were used as input in the downstream GAM-SCC and GAM-Glasso network analyses.Fig. 2: A conceptual figure that demonstrates how the GAM transformation can remove seasonal signals while preserving ecologically relevant species co-occurrence patterns.In this example, the co-occurrence pattern between Species A and Species B persists even after the seasonal signals are removed by the GAM transformation.Full size imageNetwork inference: Network runs and statistical analysesThe pre-processed species abundance data with and without the GAM-removal of time-series signals were used in SCC and Glasso networks in order to compare the outputs of the SCC, GAM-SCC, Glasso, and GAM-Glasso network inference approaches (Fig. 1). Additional network iterations were also carried out using the CCLasso [45] and SPRING [40] network inference approaches (see Supplementary Information). For the SCC and Glasso network iterations, a nonparanormal transformation was applied to the species abundance datasets with and without the GAM transformation using the “huge.npn” function in the huge package in R [46]. Spearman correlation networks were then constructed by calculating the correlation between every pair of species in the mock abundance datasets. A Bonferroni-corrected p value of 0.01 was used as a cutoff to identify edges in these SCC networks. The Glasso networks were constructed by testing 30 regularization parameter values (i.e., lambdas) in each network using the “batch.pulsar” (criterion = “stars”; rep.num = 50) function in the pulsar package in R [47]. The lambda that resulted in the most stable network output was selected using the StARS method [48]. Finally, the graph that resulted from the StARS output was used to obtain a species adjacency matrix for the Glasso networks.The species-species associations predicted by the SCC, GAM-SCC, Glasso, and GAM-Glasso networks were compared to the true species-species associations and the F1 scores of the network predictions were calculated. The F1 score is a measure of classification performance (presence or absence of an edge) that accounts for uneven classes, which is essential when dealing with sparse networks. The F1 scores of the GAM-transformed networks were compared to the networks that did not undergo GAM transformation using paired Wilcoxon tests with Bonferroni correction. An adjusted p value of 0.01 was used as a cutoff to identify under what circumstances the GAM significantly improved the F1 score of a Glasso or SCC network.Network inference: Comparison of predicted network structuresAdditional networks were generated using the methods described above to compare the predicted network structures obtained from the GAM-Glasso, Glasso, GAM-SCC, and SCC approaches to the real network structures. These additional networks were constructed using smaller mock datasets to allow for better visualization of the network outputs and contained species with a gradual seasonal signal and high species-species covariance (see Supplementary Information). The average clustering coefficient and the degree distribution of these additional network outputs were calculated and used for the network structure comparisons. The average clustering coefficient of a network describes the likelihood that two species that are both associated with a third species are also associated with each other [49], and in a sense describes the “clumpiness” of a network. The network degree distributions describe the probability distribution of the number of interactions per node in a network [50]. More