The potential distribution of Bacillus anthracis suitability across Uganda using INLA

Study setting

Uganda covers 241,037 km² (4° N, 2° S, 29° E, 35° E) and averages 1100 m above sea level²⁵. Uganda has a total population of 34.6 million people, with 7.4 and 27.2 million living in urban and rural areas, respectively²⁶. It borders Lake Victoria and has an equatorial climate. The study area is mainly plateau, with a few mountains. About 20 percent of Uganda’s area is covered by swamps and water bodies, including the four Great Lakes of East Africa (Lake Edward, Lake Victoria, Lake Albert, and Lake Kyoga). The country has ten national parks housing a high diversity of wildlife and endangered species²⁶. By 2009, agriculture ranked as the second leading contributor to the country’s Gross Domestic Product²⁷.

Surveillance data of Bacillus anthracis infection

Surveillance data of livestock and human cases from 2018 were provided by the Ministry of Health (MOH) in Uganda through the Field Epidemiology and Laboratory Training Program and the Uganda Ministry of Agriculture, Animal Industry and Fisheries (MAAIF). Geographical coordinates of anthrax cases among wildlife (from 2004 to 2010) in Queen Elizabeth National Park were obtained from a recently published study⁴. These cases were mapped to show the distribution of anthrax across Uganda (Fig. 1). It is only recently that Uganda has mandated systematic anthrax surveillance across the country following the outbreak that started in 2018. GPS coordinates gathered by field personnel during outbreak responses were used to map outbreak locations. The human anthrax cases were defined based on the CDC’s clinical criteria (signs and symptoms), presumptive laboratory diagnosis (Gram staining), and confirmatory laboratory diagnosis (bacterial culture, immunohistochemistry, ELISA, and PCR)²⁸. The animal anthrax cases were also defined based on the clinical presentation, presumptive laboratory diagnosis, and confirmatory laboratory diagnosis. All cases were classified as either ‘probable’ or ‘confirmed,’ with probable defined as cases that met both the clinical and presumptive laboratory diagnostic criteria and confirmed defined as cases that met the clinical and confirmatory laboratory diagnostic criteria. A total of 497 livestock (n = 171), humans (n = 32), and wildlife cases (n = 294), both confirmed (n = 32) and probable (n = 465), occurring from 2004 to 2018 were compiled. All methods were performed in accordance with the relevant guidelines and regulations.

Figure 1

Distribution of anthrax presence and pseudo-absence locations across Uganda. The navy-blue circles show wildlife cases (n = 294) used for model training, the blue triangles show livestock cases (n = 171), and the red diamonds represent human cases (n = 32). The blue polygons show the locations of the 50 km, 75 km, and 100 km buffers which were constructed around the wildlife cases with a distance of 10 km between the buffers and the presence locations. The pink dots show the pseudo-absence points selected within the 50 km buffer, the orange dots show the pseudo-absence points selected within the 75 km buffer, and the white dots show the pseudo-absence points selected within the 100 km buffer. Prediction maps were developed using the Quantum Geographic Information System software (QGIS). URL: https://qgis.osgeo.org (2020).

Full size image

Environmental variable processing

Correlative studies of environmental risk factors for anthrax outbreaks suggest that temperature^{6,29,30,31,32,33,34,35,36,37}, precipitation^{6,29,30,31,32,33,34,35,36,37,38,39}, elevation^{6,29,31,32,34,35,37,38,39}, soil (type, calcium concentration, pH, carbon content, and moisture)^{6,30,31,33,34,36,37,39,40,41,42,43}, vegetation^{6,29,31,34,36,37,38,39,40,42,43}, and hydrology^37,44 are some of the major drivers of B. anthracis suitability. A total of 26 environmental predictors (Fig. 2) were selected for this study based on these known variables. These comprised 19 bioclimatic variables (the mean for the years 1970–2000) collected from the WorldClim database version 2 (https://www.worldclim.org/data/worldclim21.html) at a resolution of 30 s (~ 1 km²)⁴⁵. Four soil variables, including exchangeable calcium at a depth of 0–20 cm, soil water availability, soil pH (10×) in H₂O at a depth of 0 cm, and soil organic carbon at a depth of 0–5 cm, were retrieved from the International Soil Reference and Information Centre (ISRIC) data hub at a resolution of 250 m (https://data.isric.org/geonetwork/srv/eng/catalog.search#/home). Distance to permanent water bodies was derived from a global hydrology map provided by ArcGIS online version 10.6.1⁴⁶. Elevation data of 1 km² in resolution was obtained from the Global Multi-resolution Terrain Elevation Data (GMTED2010) dataset available from the United States Geological Service. Finally, the monthly Enhanced Vegetation Index (EVI) data for the years 2004, 2005, and 2010 (36 tiles in total) were obtained from The Aqua Moderate Resolution Imaging Spectroradiometer (MODIS) Vegetation Indices (MYD13A3 v.6) at a spatial resolution of 1 km (https://lpdaac.usgs.gov/products/myd13a3v006/). The single variable, mean EVI, was calculated in QGIS by averaging all 36 tiles. The EVI minimizes variations in the canopy background and maintains precision over conditions with dense vegetation.

Figure 2

Results of correlation between covariates using Pearson’s correlation test. Correlation between covariates was shown by red numbers (negative correlation) and blue numbers (positive correlation). BIO1 = Annual Mean Temperature, BIO2 = Mean Diurnal Range, BIO3 = Isothermality, BIO4 = Temperature Seasonality, BIO5 = Max Temperature of Warmest Month, BIO6 = Min Temperature of Coldest Month, BIO7 = Temperature Annual Range, BIO8 = Mean Temperature of Wettest Quarter, BIO9 = Mean Temperature of Driest Quarter, BIO10 = Mean Temperature of Warmest Quarter, BIO11 = Mean Temperature of Coldest Quarter, BIO12 = Annual Precipitation, BIO13 = Precipitation of Wettest Month, BIO14 = Precipitation of Driest Month, BIO15 = Precipitation Seasonality, BIO16 = Precipitation of Wettest Quarter, BIO17 = Precipitation of Driest Quarter, BIO18 = Precipitation of Warmest Quarter, BIO19 = Precipitation of Coldest Quarter.

Full size image

All environmental variables were resampled using the R package ‘Resample’⁴⁷ to a resolution of 1 km and clipped to the extent of Uganda. Since data sampling for wildlife data (used in model training) was not done systematically across the study area, target backgrounds buffers (polygon buffers created at certain radii from the training points and used for the random selection of pseudo-absences) were created for model calibration to reduce sampling bias. As there was no information on the sampling radius, a sensitivity analysis was conducted by creating target backgrounds using circular buffers of radius 50 km, 75 km, and 100 km around the presence points used for model training, leaving 10 km between the presence points and the various buffers (Fig. 1). Quantum Global Information System (QGIS) version 3.16 (https://qgis.org) software was then used to add 294 random pseudo-absence points within each buffer polygon giving a ratio of 1:1 for the training presences to pseudo-absences. A recent study explored how four approaches of pseudo-absence creation affect the performance of models across different species and three model types by building both terrestrial and marine models using boosted regression trees, generalised additive mixed models, and generalised linear mixed models⁴⁸. They then tested four methods for generating pseudoabsences across all the different model types: (1) correlated random walks (RW); (2) reverse correlated RW; (3) sampling pseudoabsences within a buffer area surrounding the presence points; (4) background sampling⁴⁸. The findings of the study suggested that the separation or distance in the environmental space between the presence locations and the pseudoabsences was the most significant driver of the model predictive ability and explanatory power, and thus finding was consistent across the three model types (boosted regression trees, generalised additive mixed models, and generalised linear mixed models) and both the terrestrial and marine habitats⁴⁸.

The values of the environmental variables were then extracted for each presence and pseudo-absence location using the raster package in R. With these, we did an initial data exploration to check for outliers within the covariates, collinearity, and to explore the relationships between the covariates and the response variables (presence or absence of anthrax). Cleveland dot plots were used to check for possible outliers. Following the outlier checks, variance inflation factors (VIF), pairwise plots, and Pearson correlation coefficients with correction for multiple comparisons were used to measure the statistically significant correlation between the covariates (Fig. 2). For variables that were highly correlated (correlation greater than 0.6) or those with high variance inflation (VIF > 3), only one was used in the modelling process. Five variables were selected following this analysis: Temperature seasonality (BIO4), elevation, distance to water, soil calcium, and soil water (Table 1).

Table 1 Summary of the environmental variables used.

Full size table

Modelling anthrax suitability across Uganda

QGIS v.3.16 (https://qgis.org) and the R statistical package version 4.1.0⁴⁹ were used to conduct data visualization, cleaning, and model analysis (R code used available in a Github repository: https://github.com/valentinandolo/Uganda-Spatial_Model/tree/master). The wildlife cases (n = 294) were used for model training and testing. The remaining human (n = 32) and livestock (n = 171) cases were used for model evaluation. Since the wildlife case locations were recorded from 2004 to 2010, while the livestock and human cases were recorded in 2018, the latter locations were both spatially and temporally distinct from the wildlife cases, making an excellent basis for block cross-validation of the final model performance⁵⁰. Random partitioning of the data into training and testing sets can inflate the performance of a model and underestimate the error in the spatial prediction evaluation⁵⁰. Block cross-validation uses spatial blocks that separate the testing and training datasets; thus, the method has been suggested to be a good technique for error estimation and a robust approach for measuring a model’s predictive performance⁵⁰.

The INLA package in R was applied to model the suitability of B. anthracis across Uganda. INLA calculates the spatial interaction effects using a Stochastic Partial Differential Equation (SPDE) method, which estimates a continuous Gaussian Markov Random Field (GMRF) where the correlation between two locations in space is specified by the Matérn correlation which is explained in more detail elsewhere⁵¹. The initial step in fitting an SPDE model is the creation of a Constrained Refined Delaunay Triangulation or a mesh to illustrate the spatial process⁵¹. R-INLA uses a function called ‘inla.mesh.2d()’ that applies a variety of arguments to build the mesh, these include: loc, loc.domain, boundary, max.edge, and cutoff⁵¹. The loc argument contains the point locations which provide information about the area of study and are used to create the triangulation nodes. Alternatively, a polygon of the study area can be used to identify the extent of the domain via loc.domain. We applied the point locations using the loc argument. We then specified the boundary of the mesh as a convex hull. We used the max.edge argument to specify the maximum edge length for the inner mesh domain/triangles and the outer triangles. We did this by first studying the distribution of distances between the point locations for the training presences and pseudo-absences. Most points were within a distance of about 90–100 km away from each other, thus, a possible guess for the range at which spatial autocorrelation persists was 100 km. We used a distance of 20 km as a range guess to create a finer mesh which has been shown to produce more precise models. We specified the maximum edge length for the inner triangles as 20 km divided by 5 (4 km) and the maximum edge length for the outer triangles as 20 km. Finally, the cutoff argument sets the minimum distance allowed between point locations. We divided the maximum edge length for the inner triangles by 5 (4 km divided by 5 = 0.8 km), meaning that points that were closer in distance than 0.8 km were replaced by one vertex to avoid the occurrence of small triangles.

The spatial effect, which is a numeric vector, then links each observation within the data to a spatial location, thus, accounting for region-specific variation that cannot be accounted for by the covariates. Following the recommendation by Lindgren and Rue⁵², multivariate Gaussian distributions with means of zero and a spatially defined covariance matrix were used to model the spatial effect. Several versions of Bayesian hierarchical additive models were created by estimating a Bernoulli generalized additive model (GAM) with and without spatially correlated random effects. The Bernoulli GAM is defined as shown in Eqs. (1) and (2)

$${C}_{i} sim Bernoulli left({p}_{i}right),$$

(1)

$$logit left({p}_{i}right)= alpha +sum_{j=1}^{m}{beta }_{j}left({X}_{j,i}right)+ sum_{k=1}^{l}{f}_{k}left({X}_{k,i}right)+ {mu }_{i},$$

(2)

where ({C}_{i}) denotes the observed value, such that: B. anthracis presence or absence at a given location i (i = 1, …, n; n = 588) is given as ({C}_{i}), where ({C}_{i}) =1 if B. anthracis was present, and ({C}_{i})=0 if absent. Logit is the link function for binomial family, ({p}_{i}) is the expected value of the response variable (the probability of B. anthracis suitability) at location i, α is the intercept, ({X}_{j,i}) and ({X}_{k,i}) are the j th and the k th covariates at a location i, ({beta }_{j}) are the beta coefficients, ({f}_{k}) are the smooth functions (cubic regression splines) for k th covariates, and ({mu }_{i}) is the spatial random effect at the location i⁵³. The number of variables in linear term (m) and the non-liner term (l) are different because the variables employed in each term are different. We estimated both linear and non-linear effects for the covariates. Our overall database had 294 B. anthracis pseudo-absences generated randomly across the target background buffers to match the number of species presences recorded. Because we had no prior information, a Gaussian prior distribution with a mean of zero (default no effect prior unless data is informative) was applied for all the model parameters. The posterior mean, standard deviation, and 95% credible intervals were estimated for all the parameters.

Model selection

Several different candidate models were examined. First, a baseline model was built using only the intercept. A second baseline model was then built, which included the intercept and spatial random effects only. Covariates were added to the second baseline model (intercept plus spatial effects model) without any smoothing function (i.e., only linear effects). The contribution of the spatial random effect was then re-examined by taking it out from the model. Smoothing functions were then added to all covariates, and the same procedure was repeated. Model selection was done using this forward stepwise approach. The final model was run using the three different target background buffers to identify the buffer distance with the lowest Deviance Information Criterion (DIC). The various options were assessed using the DIC⁵⁴, Watanabe-Akaike information criterion (WAIC), and the Conditional Predictive Ordinate (CPO). The DIC and WAIC were chosen because they are commonly used to assess model performance by measuring the compromise between the goodness of fit and complexity. For CPO, the logarithmic score (− mean(log (CPO)) (LCPO) was calculated and used⁵⁵. CPO can also be used to conduct internal cross-validation of models using a leave-one-out approach to evaluate the predictive performance of the model. Lower LCPO, DIC, and WAIC estimates suggest superior model performance. Thus, the favoured model had the lowest values across the 3 metrics.

Model validation and evaluation

Model validation for the favoured model was conducted using an independent evaluation dataset comprising of livestock and human outbreaks occurring at different spatial locations and 8 years after the training data. The omission rate, which indicates the proportion of positive test locations that end up in pixels predicted to be unsuitable for B. anthracis⁵⁶, was used to validate the model. A low omission rate indicates good model performance. The threshold for suitability was the probability threshold that maximized the sensitivity and specificity of the model. The sensitivity was then derived by calculating the proportion of positive test locations that end up in pixels predicted to be suitable for B. anthracis⁵⁶. A high sensitivity indicates good model performance.

Model prediction

The favoured model selected using the criteria described above was used to generate countrywide prediction maps showing the posterior mean values, standard deviation, and the 95% credible intervals of the probability of B. anthracis suitability. The raster package in R was used to make the prediction maps. Bayesian kriging was done by treating all model parameters as random variables to include uncertainty in the prediction⁵⁷. This kriging is built into the INLA framework via the SPDE, which allows a Delaunay triangulation to be constructed around the presence and absence locations within the sampling frame⁵². INLA then conducts model inference and prediction at the same time by considering the prediction points as locations missing the response variable (set to NA)⁵¹. Following successful model prediction, additional linear interpolation functions then generate the output for the whole study area scaled from 0 to 1.

Ethical approval

Ethical approval for this study was provided by the Human Biology Research Ethics Committee, University of Cambridge, UK (Ref: HBREC.2019.02) the School of Veterinary Medicine and Animal Resources Institutional Animal Care and Use Committee, Makerere University, Uganda (Ref: SVAREC/21/2019); and the Uganda National Council for Science and Technology. Informed consent was obtained from all subjects and/or their legal guardian(s). All methods were performed in accordance with the relevant guidelines and regulations.