Community confounding in joint species distribution models
Historically, species distributions have been modeled independently from each other due to unavailability of multispecies datasets and computational restraints. However, ecological datasets that provide insights about collections of organisms have become prevalent over the last decade thanks to efforts like Long Term Ecological Research Network (LTER), National Ecological Observatory Network (NEON), and citizen science surveys1. In addition, technology has improved our ability to fit modern statistical models to these datasets that account for both species environmental preferences and interspecies dependence. These advancements have allowed for the development of joint species distribution models (JSDM)2,3,4 that can model dependence among species simultaneously with environmental drivers of occurrence and/or abundance.Species distributions are shaped by both interspecies dynamics and environmental preferences5,6,7,8. JSDMs integrate both sources of variability and adjust uncertainty to reflect that multiple confounded factors can contribute to similar patterns in species distributions. Some have proposed that JSDMs not only account for biotic interactions but also correct estimates of association between species distributions and environmental drivers3,9, while others claim JSDMs cannot disentangle the roles of interspecies dependence and environmental drivers5. We address why JSDMs can provide inference distinct from their concomitant independent SDMs, how certain parameterizations of a JSDM induce confounding between the environmental and random species effects, and when deconfounding these effects may be appealing for computation and interpretation.Because of the prevalence of occupancy data for biomonitoring in ecology, we focus our discussion of community confounding in JSDMs on occupancy models, although we also consider a JSDM for species density data in the simulation study. The individual species occupancy model was first formulated by MacKenzie et al.10 and has several joint species extensions4,11,12,13,14,15,16. We chose to investigate the impacts of community confounding on the probit model since it has been widely used in the analysis of occupancy data4,13,17. We also developed a joint species extension to the Royle-Nichols model18 and consider community confounding in that model.We use the probit and Royle-Nichols occupancy models to improve our understanding of montaine mammal communities in what follows. We show that including unstructured random species effects in either occupancy model induces confounding between the fixed environmental and random species effects. We demonstrate how to orthogonalize these effects in the model and compare the resulting inference compared to models where species are treated independently.Unlike previous approaches that have applied restricted regression techniques similar to ours, we use it in the context of well-known ecological models for species occupancy and intensity. While such approaches have been discussed in spatial statistics and environmental science, they have not been adopted in settings involving the multivariate analysis of community data. We draw parallels between restricted spatial regression and restricted JSDMs but also highlight where the methods differ in goals and outcomes. We find that the computational benefits conferred by performing restricted spatial regression also hold for some joint species distribution models.Royle-Nichols joint species distribution modelWe present a JSDM extension to the Royle-Nichols model18. The Royle-Nichols model accounts for heterogeneity in detection induced by the species’ latent intensity, a surrogate related to true species abundance. Abundance, density, and occupancy estimation often requires an explicit spatial region that is closed to emmigration and immigration. In our model, the unobservable intensity variable helps us explain heterogeneity in the frequencies we observe a species at different sites without making assumptions about population closure. In the “Model” section, we further discuss the distinctions between abundance and intensity in the Royle-Nichols model.The Royle-Nichols model utilizes occupancy survey data but provides inference distinct from the basic occupancy model10. In the Royle-Nichols model, we estimate individual detection probability for homogeneous members of the population, whereas in an occupancy model, we estimate probability of observing at least one member of the population given that the site is occupied. Furthermore, the Royle-Nichols model allows us to relate environmental covariates to the latent intensity associated with a species at a site, while in an occupancy model, environmental covariates are associated with the species latent probability of occupancy at a site. Species intensity and occupancy may be governed by different mechanisms, and inference from an intensity model can be distinct from that provided by an occupancy model19,20,21. Cingolani et al.20 proposed that, in plant communities, certain environmental filters preclude species from occupying a site and an additional set of filters may regulate if a species can flourish. Hence, certain covariates that were unimportant in an occupancy model may improve predictive power in an intensity model.Community confoundingSpecies distributions are shaped by environment as well as competition and mutualism within the community8,22,23. Community confounding occurs when species distributions are explained by a convolution of environmental and interspecies effects and can lead to inferential differences between a joint and single species distribution model as well as create difficulties for fitting JSDMs. Former studies have incorporated interspecies dependence into an occupancy model4,11,12,13,14,15,16, and others have addressed spatial confounding1,17,24,25, but none of these explicitly addressed community confounding. However, all Bayesian joint occupancy models naturally attenuate the effects of community confounding due to the prior on the regression coefficients. The prior, assuming it is proper, induces regularization on the regression coefficients26 that can lessen the inferential and computational impacts of confounding27. Furthermore, latent factor models like that described by Tobler et al.4 restrict the dimensionality of the random species effect which should also reduce confounding with the environmental effects.We address community confounding by formulating a version of our model that orthogonalizes the environmental effects and random species effects. Orthogonalizing the fixed and random effects is common practice in spatial statistics and often referred to as restricted spatial regression27,28,29,30,31. Restricted regression has been applied to spatial generalized linear mixed models (SGLMM) for observations (varvec{y},) which can be expressed as$$begin{aligned} varvec{y}&sim [varvec{y}|varvec{mu }, varvec{psi }], end{aligned}$$
(1)
$$begin{aligned} g(varvec{mu })&= varvec{X}varvec{beta } + varvec{eta }, end{aligned}$$
(2)
$$begin{aligned} varvec{eta }&sim mathcal {N}(varvec{0}, varvec{Sigma }), end{aligned}$$
(3)
where (g(cdot )) is a link function, (varvec{psi }) are additional parameters for the data model, and (varvec{Sigma }) is the covariance matrix of the spatial random effect. In the SGLMM, prior information facilitates the estimation of (varvec{eta },) which would not be estimable otherwise due to its shared column space with (varvec{beta })30. This is analogous to applying a ridge penalty to (varvec{eta },) which stabilizes the likelihood. Another method for fitting the confounded SGLMM is to specify a restricted version:$$begin{aligned} varvec{y}&sim [varvec{y}|varvec{mu }, varvec{psi }], end{aligned}$$
(4)
$$begin{aligned} g(varvec{mu })&= varvec{X}varvec{delta } + (varvec{I}-varvec{P}_{varvec{X}})varvec{eta }, end{aligned}$$
(5)
$$begin{aligned} varvec{eta }&sim mathcal {N}(varvec{0}, varvec{Sigma }), end{aligned}$$
(6)
where (varvec{P}_{varvec{X}}=varvec{X}(varvec{X}varvec{X})^{-1}varvec{X}’) is the projection matrix onto the column space of (varvec{X}.) In the unrestricted SGLMM, the regression coefficients (varvec{beta }) and random effect (varvec{eta }) in (1) compete to explain variability in the latent mean (varvec{mu }) in the direction of (varvec{X})27. In the restricted model, however, all variability in the direction of (varvec{X}) is explained solely by the regression coefficients (varvec{delta }) in (4)31, and (varvec{eta }) explains residual variation that is orthogonal to (varvec{X}). We refer to (varvec{beta }) as the conditional effects because they depend on (varvec{eta }), and (varvec{delta }) as the unconditional effects.Restricted regression, as specified in (4), was proposed by Reich et al.28. Reich et al.28 described a disease-mapping example in which the inclusion of a spatial random effect rendered one covariate effect unimportant that was important in the non-spatial model. Spatial maps indicated an association between the covariate and response, making inference from the spatial model appear untenable. Reich et al.28 proposed restricted spatial regression as a method for recovering the posterior expectations of the non-spatial model and shrinking the posterior variances which tend to be inflated for the unrestricted SGLMM.Several modifications of restricted spatial regression have been proposed30,32,33,34,35. All restricted spatial regression methods seek to provide posterior means (text {E}left( delta _j|varvec{y}right)) and marginal posterior variances (text {Var}left( delta _j|varvec{y}right)), (j=1,…,p) that satisfy the following two conditions36:
1.
(text {E}left( varvec{delta }|varvec{y}right) = text {E}left( varvec{beta }_{text {NS}}|varvec{y}right)) and,
2.
(text {Var}left( beta _{text {NS,}j}|varvec{y}right) le text {Var}left( delta _{j}|varvec{y}right) le text {Var}left( beta _{text {Spatial,}j}|varvec{y}right)) for (j=1,…,p),
where (varvec{beta }_{NS}) and (varvec{beta }_{Spatial}) are the regression coefficients corresponding to the non-spatial and unrestricted spatial models, respectively.The inferential impacts of spatial confounding on the regression coefficients has been debated. Hodges and Reich29 outlined five viewpoints on spatial confounding and restricted regression in the literature and refuted the two following views:
1.
Adding the random effect (varvec{eta }) corrects for bias in (varvec{beta }) resulting from missing covariates.
2.
Estimates of (varvec{beta }) in a SGLMM are shrunk by the random effect and hence conservative.
The random effect (varvec{eta }) can increase or decrease the magnitude of (varvec{beta }), and the change may be galvanized by mechanisms not related to missing covariates. Therefore, we cannot assume the regression coefficients in the SGLMM will exceed those of the restricted model, nor should we regard the estimates in either model as biased due to misspecification. Confounding in the SGLMM causes (text {Var}left( beta _j|varvec{y}right) ge text {Var}left( delta _j|varvec{y}right)), (j=1,…,p), because of the shared column space of the fixed and random effects. Thus, we refer to the conditional coefficients as conservative with regard to their credible intervals, not their posterior expectations.Reich et al.28 argued that restricted spatial regression should always be applied because the spatial random effect is generally added to improve predictions and/or correct the fixed effect variance estimate. While it may be inappropriate to orthogonalize a set of fixed effects in an ordinary linear model, orthogonalizing the fixed and random effect in a spatial model is permissible because the random effect is generally not of inferential interest. Paciorek37 provided the alternative perspective that, if confounding exists, it is inappropriate to attribute all contested variability in (varvec{y}) to the fixed effects. Hanks et al.31 discussed factors for deciding between the unrestricted and restricted SGLMM on a continuous spatial support. The restricted SGLMM leads to improved computational stability, but the unconditional effects are less conservative under model misspecification and more prone to type-S errors: The Bayesian analogue of Type I error. Fitting the unrestricted SGLMM when the fixed and random effects are truly orthogonal does not introduce bias, but it will increase the fixed effect variance. Given these considerations, Hanks et al.31 suggested a hybrid approach where the conditional effects, (varvec{beta }), are extracted from the restricted SGLMM. This is possible because the restricted SGLMM is a reparameterization of the unrestricted SGLMM. This hybrid approach leads to improved computational stability but yields the more conservative parameter estimates. We describe how to implement this hybrid approach for joint species distribution models in the “Community confounding” section.Restricted regression has also been applied in time series applications. Dominici et al.38 debiased estimates of fixed effects confounded by time using restricted smoothing splines. Without the temporal random effect, Dominici et al.38 asserted all temporal variation in the response would be wrongly attributed to temporally correlated fixed effects. Houseman et al.39 used restricted regression to ensure identifiability of a nonparametric temporal effect and highlighted certain covariate effects that were more evident in the restricted model (i.e., the unconditional effects’ magnitude was greater). Furthermore, restricted regression is implicit in restricted maximum likelihood estimation (REML). REML is often employed for debiasing the estimate of the variance of (varvec{y}) in linear regression and fitting linear mixed models that are not estimable in their unrestricted format40. Because REML is generally applied in the context of variance and covariance estimation, considerations regarding the effects of REML on inference for the fixed effects are lacking in the literature.In ecological science, JSDMs often include an unstructured random effect like (varvec{eta }) in (1) to account for interspecies dependence, and hence can also experience community confounding between (varvec{X}) and (varvec{eta }) analogous to spatial confounding. Unlike a spatial or temporal random effect, we consider random species effects to be inferentially important, rather than a tool solely for improving predictions or catch-all for missing covariates. An orthogonalization approach in a JSDM attributes contested variation between the fixed effects (environmental information) and random effect (community information) to the fixed effect.We describe how to orthogonalize the fixed and random species effects in a suite of JSDMs and present a method for detecting community confounding. In the simulation study, we test the efficacy of our method for detecting confounding, show that community confounding can lead to computational difficulties similar to those caused by spatial confounding31, and highlight that, for some models, restricted regression can improve model fitting. We also investigate the inferential implications of community confouding and restricted regression in JSDMs by comparing outputs from the SDM, unrestricted JSDM, and restricted JSDM of the Royle-Nichols and probit occupancy models fit to mammalian camera trap data. Lastly, we discuss other inferential and computational methods for confounded models and consider their appropriateness for joint species distribution modeling. More