Jointly modeling marine species to inform the effects of environmental change on an ecological community in the Northwest Atlantic
Species dataSpecies CPUE data were obtained from the National Oceanographic and Atmospheric Administration (NOAA) Northeast Fishery Science Center (NEFSC) U.S. NES bottom trawl survey, which, for almost 50 years, has collected abundance data for over 250 fish species in the spring and fall. The survey employs a stratified random design, with stations allocated proportionally to the stratum area. A 12 mm mesh coded liner is used to retain small-bodied and juvenile fish. All fish caught are weighed and counted18. We downloaded the data from OceanAdapt.com, which calibrates the CPUE for each species depending on survey ship. We cleaned the data for the years from 1998 to 2020, excluding years prior to 1997 due to many missing values for chlorophyll (Chla). We only included strata that were consistently sampled in the spring and fall. To account for the seasonal migrations of many of the studied species, we modeled spring and fall seasons separately. We present the results for the fall CPUE, with the spring results and presence/absence results in the supplemental materials. We selected species that were present in at least 400 tows and with a biomass of at least 0.5 kg/tow (CPUE) in more than 100 tows. Finally, we removed roughly 400 tows per season with missing environmental covariates (see below). In the fall, we selected 30 species with 5217 observations, and in the spring, we selected 24 species with 5935 observations (see Supplemental Tables S1, S2).Environmental dataThe study region includes Southern New England and The Gulf of Maine. We selected environmental covariates known to influence marine fish distributions and abundances. Depth, temperature (bottom and surface) and salinity (bottom and surface) were measured in situ during trawl surveys. Missing values were augmented with the data-assimilative HYbrid Coordinate Ocean Model (HYCOM) daily and then monthly data. HYCOM is an oceanographic model that produces 32 vertical layers including ocean temperature, salinity, sea surface height, and wind stress as well as other 3- and 4-dimensional variables. The system uses the Navy Coupled Ocean Data Assimilation (NCODA) system19 for data assimilation. NCODA uses the model forecast as a first guess in a multivariate optimal interpolation (MVOI) scheme and assimilates available satellite altimeter observations (along track obtained via the Naval Oceanographic Office Altimeter Data Fusion Center satellite) and in situ sea surface temperature as well as available in situ vertical temperature and salinity profiles from expendable bathythermographs, Argo floats, and moored buoys20. Seven HYCOM models (HYCOM + NCODA Global 1/12° Reanalysis GLBu0.08 Expts 19.0, 19.1, 90.9, 91.0, 91.1, 91.2) were temporally concatenated to create a continuous dataset of BT and salinity, ranging from 1992 to 2017. These model runs differed slightly in their configurations (time steps, advection scheme, mixing, vertical structure, slight change in NCODA, and MVOI transition to 3-dimensional analysis in 2013), but the differences are not expected to influence the applicability of the output21. The numbers of filled in missing values were 787 (7.0%) surface salinity (SSAL), 735 (6.5%) surface temperature (SST), 809 (7.2%) bottom temperature (BT), and 850 (7.6%) bottom salinity (BSAL). Chla was obtained from the MODIS satellite (monthly rasters from 2003 to 2019) on a monthly time step22, with missing values filled using the SeaWIFS satellite23 (1998 to 2009). Temperature, salinity and Chla data that were not collected in situ were downloaded using Google Earth Engine (HYCOM and MODIS)24. Benthic substrate (grain size in mm, referred to as SEDSIZE), subregion (Gulf of Maine or Southern New England), benthic land position (high, low, mid), and seabed form data (depression, high flat, high slope, low slope, mid flat, side slope, steep) were obtained from the Nature Conservancy’s Northwest Atlantic Marine Ecoregional Assessment25 (Supplemental Fig. S1).GJAMTo study the influence of the environmental covariates on the joint distribution of marine fish and invertebrate species we use the generalized joint attribute model (GJAM)12 and the corresponding R package (Version 2.5)26. Briefly, this multivariate Bayesian model allows us to jointly model the marine fish community and accounts for responses to the environment that can include combinations of continuous and discrete responses (e.g., CPUE and zeros) and the dependence between species. GJAM returns all parameters on the observation scale, in this case, CPUE and presence-absence. Products of model fitting include a species‐by‐species covariance matrix (Σ), species responses to predictor variables (B), and predicted responses. The species‐by‐species covariance matrix Σ captures residual codependence between species after removing the main structure explained by the model (also referred to as the residual correlation matrix). As a result, Σ allows for conditional prediction of some species under different scenarios for the abundances of others27.CPUE is termed continuous abundance (CA) data in GJAM, meaning that observations are continuous with discrete zeros. Let yis be the CPUE for species s at location i. For CA data GJAM expands the tobit model for (univariate) regression to the multivariate setting, where a latent variable wis is equal to yis when yis is positive and zero otherwise,$$y_{i,s}^{0} = left{ {begin{array}{*{20}l} {w_{is} ,} hfill & {w_{is} > 0quad {text{continuous}}} hfill \ {0,} hfill & {w_{is} le 0quad {text{discrete zero}}} hfill \ end{array} } right.$$
(1)
The length-S vector of all species responses wi is continuous on the real line, and thus can be modeled with a multivariate normal. The model for wi is$$begin{aligned} left. {{mathbf{w}}_{i} } right|{mathbf{x}}_{i, } {mathbf{y}}_{i} & sim ,MVNleft( {{varvec{mu}}_{i} ,{Sigma }} right) times mathop prod limits_{s = 1}^{S} {mathcal{I}}_{is} \ u_{{varvec{i}}} & = {mathbf{B}}^{prime } {mathbf{x}}_{{varvec{i}}} \ {mathcal{I}}_{is} & = mathop prod limits_{k in C} I_{is,k}^{{Ileft( {y_{is} = k} right)}} left( {1 – I_{is,k} } right)^{{Ileft( {y_{is} ne k} right)}} \ end{aligned}$$
(2)
$$begin{aligned} {mathcal{I}}_{is} & = I(w_{is} le 0)^{{Ileft( {y_{is} = 0} right)}} left[ {1 – Ileft( {w_{is} le 0} right)} right]^{{Ileft( {y_{is} > 0} right)}} \ & quad I(w_{is} > 0)^{{Ileft( {y_{is} > 0} right)}} left[ {(1 – I(w_{is} > 0)} right]^{{Ileft( {y_{is} = 0} right)}} \ end{aligned}$$where the indicator function (I(cdot )) is equal to 1 when its argument is true and 0 otherwise. For presence-absence data, ({mathbf{p}}_{{varvec{i}}{varvec{s}}}boldsymbol{ }=boldsymbol{ }left(-boldsymbol{infty },boldsymbol{ }0,boldsymbol{ }boldsymbol{infty }right).) This is equivalent to Chib and Greenberg’s28 probit model which can be written as ({mathcal{I}}_{is}=I({w}_{is} >{0)}^{Ileft({y}_{is} >0right)}I({w}_{is}le {0)}^{1-{y}_{is}}).The mean vector ({varvec{mu}}_{i} = {mathbf{B}}^{prime } {mathbf{x}}_{{varvec{i}}}) contains the Q × S matrix of coefficients B and the length-Q design vector xi. Σ is a S × S covariance matrix. There is a correlation matrix associated with Σ,$${mathbf{R}}_{{S,S^{prime } }} = frac{{{{varvec{Sigma}}}_{{S,S^{prime } }} }}{{sqrt {{{varvec{Sigma}}}_{S,S} {{varvec{Sigma}}}_{{S^{prime } ,S^{prime } }} } }}$$
(3)
The predictive distribution is obtained as$$left[tilde{Y }left| tilde{X }right.right]=int left[ tilde{Y }left| tilde{X }right.,widehat{theta }right]left[widehat{theta } left|X, Yright.right]$$
(4)
The integrand contains the likelihood (Eq. (2)) followed by the posterior distribution for parameters, (widehat{theta }= left{widehat{mathbf{B}},boldsymbol{ }widehat{{varvec{Sigma}}}right}). Input (tilde{X }) can equal X (in-sample prediction) or not (out-of-sample prediction). We fitted both CPUE (continuous abundance) and presence-absence versions of the model. As a Bayesian method, GJAM provides probabilistic estimates of parameters with full dependence in data, including jointly distributed species. Model fitting is performed using Gibbs sampling, which is a Markov chain Monte Carlo (MCMC) technique.The sensitivity of an individual response variable s to an individual predictor q is given by the coefficient βqs (individual coefficients from the B matrix). The sensitivity that applies to the full response matrix is given by$${mathbf{f}} = diagleft( {{mathbf{B}}{Sigma }^{ – 1} {mathbf{B}}^{prime } } right)$$
(5)
The Q × S matrix B contains relationships of each species to the environment, the “signal”, but not to each another. Matrix E summarizes species similarities in terms of their response to an environment (stackrel{sim }{mathbf{x}}) and is given by$${mathbf{E}=mathbf{B}}^{boldsymbol{^{prime}}}mathbf{V}mathbf{B}$$
(6)
where V is a covariance matrix for (stackrel{sim }{mathbf{x}})(a vector of predictors) and contributes the environmental component of variation in (stackrel{sim }{mathbf{y}}). Similar species in E have similar columns in B. Those similarities and differences are amplified for predictors (stackrel{sim }{mathbf{x}}) with large variance. Conversely, species differences in B do not matter for variables in X that do not vary. The covariance in predictors could come from observed data, i.e., the variance of X (see12 for more details).Prior distributions for this study are non-informative. This is particularly helpful for the covariance, lending stability to Gibbs sampling and avoiding dominance by a prior. In cases this particular case, the direction of the prior effect may be known, but the magnitude is not.Variable selectionUnlike the familiar univariate setting, variable selection has to consider which species are included in the model. In a univariate model, there is one response and perhaps a number of potential predictor variables from which to choose. As in a univariate model, variable selection focuses on predictors held in the n by p design matrix X. Rather than a response vector, the multivariate model includes the n by S response matrix Y. Unlike the univariate model, the overall fit and predictive capacity depends not only on what is in X, but also on the species that are included in Y, each of which would be best explained by a different combination of variables. Rare species having no signal will not provide cross-correlations and thus can offer little learning from an analysis. For this reason, there may be no reason to include them in model fitting. Given that many species may be rare, and rare types will not be explained by the model, there will be decisions about what variables to include on both sides of the likelihood (i.e., predictors and responses).These considerations mean that simple rules for variable selection, such as the combination yielding the lowest DIC, may not be sensible. The combination of variables that yields the lowest DIC could miss variables that are important for subsets of species. In principle, one poorly-fitted species could dominate variable selection. The best model for responses ranging from rare to abundant will depend on precisely which species are included, both rare and abundant. Thus, in order to select variables, we utilize inverse prediction—predicting the environment from species – and the overall community sensitivity12.Inverse prediction provides a comprehensive estimate of the environmental importance for the entire community, because it determines the capacity of the community to predict (through the fitted model) the environment; it inverts the model12. A variable predicted by the community explains important variation in one to many species. A variable that is not predicted by the community does not explain important variation in any of them. To look at the importance of environmental variables for the entire community, we started with the saturated model that included the predictors BT, SST, depth, BSAL, SSAL, Chla, SEDSIZE, subregion, benthic position and an interaction between depth and BT, BSAL, SST and SSAL (Fig. 1a). Sensitivity was highest for the interaction between BT and depth and lowest for Chla and sediment size (see right subpanel on Fig. 1a for sensitivity). Inverse prediction confirmed that sediment size and Chla contribute little to community biomass, because the community cannot “predict” them (see left and middle subpanels on Fig. 1a for sensitivity). Inverse prediction results from a second model (Fig. 1b) showed that SSAL and the third model for benthic position also (Fig. 1c) contribute little to the community response. Using the combination of sensitivity and inverse prediction we obtained the final model that includes BT, depth, BSAL, SST, subregion and an interaction between depth and BT, BSAL and SST (Fig. 1d). Inverse prediction indicates that the CPUE predicts the environment well. In the final model, sensitivity is highest for depth. Subregion remains as a two-level factor and there is strong inverse prediction for that variable as well (Fig. 1d). In the variable-selection stage, each model was run on the entire fall dataset for 5000 iterations and a burn-in of 800. Inverse prediction results from the spring model indicated similar patterns; thus, the same variables were used for the spring and fall.Figure 1Inverse prediction and sensitivity for combinations of environmental parameters in GJAM. Starting with the most complicated model (a), sensitivity was highest for the interaction between BT and depth and lowest for Chla and sediment size (a). Inverse prediction confirms that sediment size and Chla contribute little to community biomass (a) and those are removed in the second model (b). SSAL contributes little to community response and are removed in the third model (c), The final model (d) includes terms that have strong inverse prediction and overall sensitivity. Inverse prediction for continuous and factor variables is on the left and center of each box, and overall sensitivity is on the right.Full size imageWe compare the model selected above using inverse prediction to a model selected using the more traditional method of out-of-sample prediction. For out-of-sample prediction, we fitted all combinations of 11 environmental variables (BT, BSAL, SST, SSAL, Chla, depth, sediment size, subregion, position, seabed form) plus interaction terms between depth and SEDSIZE, BT, BSAL, SST, SSAL and chlorophyll. These models were run with 1000 iterations and a burn-in of 400. All models included BT, BSAL, SST, SSAL, chlorophyll A and depth, as these variables have been shown to be important for these species. In total, 1,024 possible models were evaluated by training each potential model on 70% of the data (n = 3652 in the fall, n = 4155 in the spring), evaluating in-sample performance with DIC, and then testing out-of-sample performance on the remaining 30% (n = 1565 in the fall, n = 1780 in the spring). The 10 models with the lowest DIC in-sample were selected, and the final model was selected out of those 10 with the lowest out-of-sample R2. The selected model for fall CPUE had the following terms: ~ BT + depth + BSAL + SST + SSAL + chla + depth*BT + depth*SEDSIZE + depth*SSAL + depth*chla + SEDSIZE + Benthic position. Recall that inverse prediction selected a simpler model including the following terms: BT + depth + BSAL + SST + Subregion + depth*BT + depth*BSAL + depth*SST. The inclusion of SEDSIZE and benthic position in the model selected via out-of-sample prediction is probably a result of these predictor variables being important for a subset of species (i.e. benthic species29), but not the community as a whole. When we have a large number of response variables, as in this study, we need to consider the variables that are more important on a community level, rather than just for a few species. Thus, we use the model selected via inverse prediction for the remainder of the study.We fitted the selected model with 70% of the data for 20,000 iterations with a burn-in of 8,000 iterations (n = 3652 in the fall, n = 4155 in the spring). Out-of-sample prediction was performed on the remaining 30% (n = 1565 in the fall, n = 1780 in the spring) of the dataset and predicted versus observed values were evaluated (Supplemental Figs. S2 and S3) as well as residual versus fitted values (Supplemental Figs. S4 and S5). As has been shown in other research30,31, aggregating noisy predictions based on similar environmental preferences can improve performance, especially for larger datasets. Thus, we generated an aggregated data set that uses a k-means clustering of predictors (Supplemental Figs. S8 and S9). We performed the same analysis for the spring and the fall as well as with the presence absence data and recorded AUC as well as precision for each species (Supplemental Figs. S6 and S7). Precision is defined as the arithmetic mean of precision (proportion of predicted presences actually observed as presences) across all threshold values (at an interval of 0.01).Final modelWe ran the final model on 100% of the data with 20,000 iterations and a burn-in of 8000 iterations for the spring and fall for CPUE as well as presence absence for a total of 4 models. From the final model we obtained coefficients for the species-environment responses, β, covariance between species in how they respond to the environment E, and the residual correlation from the fitted model, R. We subtracted the absolute values from the presence/absence residual correlation matrix from the absolute values of the CPUE residual correlation matrix to observe where these results diverged. For MCMC chains and convergence of the final model as well as example models from both methods of variable selection see Supplemental Figs. S10–S12).Comparison to SSDMsWe built single species distribution models for each species in the form of GAMs using the mgcv package in R32. GAMs are a semiparametric extension of the generalized linear model and are a common modeling technique for species distribution modeling in this ecosystem33. For each species, we ran one GAM with CPUE as the response variable with a log-linked tweedie distribution that had penalized regression splines, a REML smoothing parameter with an outer Newton optimizer, 10 knots, and omitted NAs. We also ran GAMs for each species with a binary response variable indicating species presence with a binomial error distribution and a logit link function, penalized regression splines, a REML smoothing parameter with an outer Newton optimizer, 10 knots, and omitted NAs. We compared the out of sample observed versus predicted values for GAMs versus GJAM using RMSPE, R2, AUC, and precision. Root Mean Squared Prediction Error (RMSPE) is a measure of the average squared difference between the observed and predicted values, measured in the same units as the input data (kg/tow). R2 is a measure of the average squared difference between the observed and predicted values and is unitless. R2 is calculated as (1 − sum((predicted − observed)2)/sum((observed − mean(observed))2)) The ROC curve is a measure of model performance which plots true positive rate versus false positive rate, and the area under the ROC curve (AUC) provides a single measure of accuracy. A pairwise Wilcoxon test was used to compare means. We also compare the significance of predictors in both the GJAM model and GAM models. In this example, significance is defined for GJAM as a credible interval of the beta estimation that does not cross zero, and for the GAM as a p-value less than 0.0534.Spatial and temporal autocorrelationExamining the spatial and temporal autocorrelation of the modeled residuals can help specify missing endogenous (habitat selection or density dependence) and exogenous (covariate) effects that may be missing from the model. Thus, for each species modeled, we plot the spatial autocorrelation of residuals using a semi-variogram for the year 2015 and the temporal autocorrelation of the residuals using a partial autocorrelation function (PACF). We present the results for each species in the fall in the Supplemental materials (Supplemental Figs. S27–S57).All analysis and figure creation was performed in R version 3.6.235. Figures were created using the following R packages: ggplot236, ggpubr37, corrplot38, gridExtra39, cowplot40, lessR41, and ggcorrplot42. More