in

# Drivers and potential distribution of anthrax occurrence and incidence at national and sub-county levels across Kenya from 2006 to 2020 using INLA

### Data sources

We analyzed records of confirmed and suspected livestock deaths attributed to anthrax occurring from 1 January 2006 to 31 December 2020 across Kenya (available online along with full code for the analysis in this paper https://github.com/spatialmodels/Kenyan_anthrax_model). The case records covering the entire country were reported from the Kenya Directorate of Veterinary Services (KDVS) located in Nairobi and the five Regional Veterinary Investigation Laboratories located in Nakuru, Eldoret, Karatina, Kericho, and Mariakani. The anthrax outbreaks were considered as any livestock (cattle, goats, sheep, pigs, camels) or wildlife deaths confirmed through clinical and laboratory diagnosis. Clinical diagnosis was defined as an acute disease accompanied by sudden death, bleeding from body orifices, swelling, lack of rigor mortis, and oedema of the neck and face in pigs. Laboratory confirmation was done through methylene blue staining to identify the characteristic bacterial capsule and the rod-shaped bacilli in clinical specimens collected from the infected carcasses.

We extracted data from old paper records of livestock anthrax cases into Microsoft Excel. These records comprised the location of the livestock outbreaks, name of the farmer, number of animals dead and herd size, species affected, date, method of diagnosis, and the details of the reporting veterinary doctor. Since the locations of livestock anthrax outbreaks were reported at sub-county/district levels (districts refer to the old naming given to current sub-counties before the rollout of the current constitution), we recorded the geographic coordinates of livestock cases at the district level. During data cleaning, we removed duplicate coordinates, outliers, and entries with missing variables. In the end, we had 540 livestock cases that we used for analysis. The spatial granularity and prolonged surveillance period of these data allow for a more detailed perspective on the major drivers of anthrax across Kenya. We also collected wildlife data from the Kenya Wildlife Service (KWS). Most of the data from KWS was lacking information on the geographic coordinates of the outbreaks, so we visited the actual locations and collected the coordinates. We recorded 20 wildlife cases that we used to validate the performance of the spatial model.

### Processing socio-economic and ecological covariates

We gathered geospatial data on ecological and socio-economic correlates of B. anthracis ecology and distribution. For the spatial model, we obtained the following variables: rainfall, vegetation, elevation, distance to permanent water bodies, and soil patterns. For the spatiotemporal models, we used human population estimates (total population, population density, and male and female population per sub-county), host population (livestock producing households, total number of indigenous, exotic dairy, and exotic beef cattle per sub-county), and agricultural practices that lead to soil disturbance (agricultural area under cultivation, number of farming households, and crop-producing households).

We chose seven environmental covariates for the spatial model based on known correlates of B. anthracis suitability identified from previous peer-reviewed studies9,10,13,15,21,22,23. These comprised three soil variables, including soil pH (× 10) in H2O at a depth of 0 cm, exchangeable calcium at a depth of 0–20 cm, and soil water availability (volume of water per unit volume of soil) retrieved at a resolution of 250 m from the International Soil Reference and Information Centre (ISRIC) data hub (https://data.isric.org/geonetwork/srv/eng/catalog.search#/home). We used the shallowest depth available because although the bacterial spores can persist in the surface soil for up to five years and indefinitely in much deeper soils24, the spores in the surface soils are more likely to trigger host infection25. We retrieved monthly Enhanced Vegetation Index (EVI) data from 1 January 2006 to 31 December 2020 (180 tiles in total) from The Aqua Moderate Resolution Imaging Spectroradiometer (MODIS) Vegetation Indices (MYD13A3 v.6) at a resolution of 1 km2 (https://lpdaac.usgs.gov/products/myd13a3v006/). The mean EVI was then calculated using QGIS by averaging all 180 tiles. EVI reduces variations in the canopy background and retains precision over dense vegetation conditions. Monthly Climate Hazards Group InfraRed Precipitation with Station data (CHIRPS) rainfall data from rain gauge and satellite observations was retrieved from the United States Geological Service (USGS) at a resolution of 0.05 degrees (https://climateserv.servirglobal.net/map). Since the rainfall data also comprised 180 tiles, the mean rainfall was calculated by averaging all 180 tiles using QGIS. We also collected data on the distance to permanent water bodies from a global hydrology map obtained from ArcGIS version 10.6.1.26 and elevation data at 1 km2 resolution from the Global Multi-resolution Terrain Elevation Data (GMTED2010) dataset available from USGS (Table 1).

For the spatiotemporal sub-county-based models, we accessed the population data per sub-county (total population, male population, female population, and population density) from the 2019 Kenyan census report provided via the Humanitarian Data Exchange platform (https://data.humdata.org/dataset/kenya-population-per-county-from-census-report-2019). We also obtained data on livestock population (numbers of exotic dairy and beef cattle, and indigenous cattle), area of agricultural land in hectares, number of farming households, and the number of households actively practicing agriculture (crop production and livestock production) aggregated to the sub-county level from the 2019 Kenya Population and Housing Census volume IV provided by the OpenAfrica platform (https://open.africa/dataset/2019-kenya-population-and-housing-census).

We conducted data exploration to check for outliers, collinearity, and the relationships between the covariates and the response variables. We used Cleveland dot plots to check for outliers. We measured collinearity using variance inflation factors (VIF), Pearson correlation coefficients, and pairs plots. For VIF scores, the covariates with scores higher than 3 were eliminated one-by-one until all the scores were equal to or less than 3. All the covariates included in the study had correlation coefficient values of less than 0.6 (Figs. 1, 2).

### Spatial model analysis

We used R version 4.1.0 together with the packages raster version 4.1.127, and R-INLA version 4.1.128 to conduct the data processing and statistical modelling. The R-INLA package applies the INLA framework in designing models. We used Quantum Global Information System (QGIS) version 3.16 (https://qgis.org) to create a 50 km buffer polygon around all the observed livestock outbreak points. We then created a 20 km2 grid within this buffer and counted the number of points within each grid cell to create a regular lattice with a given number of counts per cell. We then extracted the coordinates of the centroids of each cell to create marked locations with a given number of livestock cases per location. We essentially converted the data into a count process (number of livestock outbreaks per location). We had 95 cells with one or more counts which formed our new presence locations. We then randomly selected 95 pseudoabsences within the 50 km buffer polygon but at a distance of 10 km from the presence locations as shown in Fig. 3.

We defined a Zero-inflated Poisson (ZIP) regression model with spatially correlated random effects, implemented as a generalized additive model (GAM) with anthrax incidence as the response variable. The model is defined as shown in Eqs. (1), (2), and (3)

$${C}_{i} sim zero-inflated, Poisson left({mu }_{i},{p}_{i}right),$$

(1)

$$expectedleft({C}_{i}right)=left(1- {p}_{i}right)times {mu }_{i},$$

(2)

$$mathrm{log}left({mu }_{i}right)= alpha + sum_{j}{beta }_{j}{X}_{j,i}+ sum_{k}{delta }_{k,i}+{u}_{i},$$

(3)

where (Ci) denotes the observed number of anthrax livestock cases at location i, ({mu }_{i}) and ({p}_{i}) are parameters of the ZIP distribution. (expectedleft({C}_{i}right)) refers to the expected number of outbreaks at location i, (alpha) is the intercept, (beta) are the beta coefficients for the covariates, X is the matrix with all the covariates, (delta k) are the non-linear effects (cubic regression splines), and ({u}_{i}) is the spatial random effect at location i.

To test whether the addition of the GAM smoothers and the spatially correlated random effects improved the fit of the model, we also considered candidate models without smoothers and spatial random effects. We tested three versions of the spatial model: the first used distance to water, elevation, and EVI as linear covariates without spatial random effects, the second applied non-linear terms to elevation and EVI also without spatial random effects, and the final model was similar to the second model but with the addition of spatial random effects. We then measured the DIC values of the candidate models to select the final spatial model.

We conducted model validation by assessing the posterior distributions of the parameters and the residuals for adherence to the distributional assumptions. We checked whether the residuals were independent and normally distributed. We also plotted a sample variogram to check for any residual spatial auto-correlation using a well-defined method29. We then ran 1000 simulations to check whether the model was capable of handling zeros.

The estimated model was used to map posterior predicted distributions for the incidence of anthrax disease (plotted as mean and 95% credible intervals). We validated the model using independent evaluation data withheld from the model calibration. This evaluation dataset comprises the wildlife cases collected from KWS. We then calculated the sensitivity by estimating the proportion of wildlife case locations correctly identified by the model, using the minimum presence training threshold (minimum value of the fitted presence training points).

### Spatiotemporal model analysis

Our second objective was to investigate the socio-economic, population-based drivers of livestock anthrax risk at the sub-county level. These socioeconomic variables are usually collected at the sub-county level. Therefore, we developed a second areal model with the number of observations per sub-county as the new response variable. The occurrence data, gathered by the Kenya Directorate for Veterinary Services (KDVS), consisted of monthly case reports of livestock anthrax cases collected by all 290 sub-counties across Kenya between January 2006 to December 2020. We analyzed the whole monthly case time series from the year 2006 to 2020 and mapped out the annual counts of confirmed and suspected livestock anthrax cases across Kenya at the sub-county level to analyse the spatial and temporal trends throughout the surveillance period. The sub-county shapefiles that were used for mapping and modelling were derived from Humanitarian Data Exchange version 1.57.16 under a Creative Commons Attribution for Intergovernmental Organisations license (https://data.humdata.org/dataset/ken-administrative-boundaries).

Due to the sparsity of data, we aggregated the monthly case counts and modelled the quarterly occurrence and incidence of anthrax at the sub-county-level scale, including spatial and temporal effects, to determine the spatial socio-economic drivers of livestock anthrax disease risk across Kenya. We used R-INLA version 4.1.1 (26) to conduct the data processing and statistical modelling. We used quarterly case counts that were confirmed per sub-county across the 15 years of surveillance (2006–2020) as a measure of anthrax incidence. Due to the zero-inflated and over-dispersed nature of the distribution, which is difficult to fit incidence counts, we employed a two-stage modelling approach using the hurdle model distribution to separately model anthrax occurrence (presence or absence) across all sub-counties via logistic regression, and incidence counts using a zero-inflated Poisson distribution. We were then able separately to estimate the contributions of the various socio-ecological factors that drive disease occurrence (the presence or absence of anthrax) and total incidence counts.

We model the quarterly anthrax occurrence (n = 290 sub-counties over 60 quarters; 17,400 observations) where ({Y}_{i,t}) refers to the binary presence (denoted as 1) or absence (denoted as 0) of anthrax in sub-county i during year t, and ({P}_{i,t}) is the probability of anthrax occurrence, thus:

$${Y}_{i,t} sim Bernoullileft({P}_{i,t}right).$$

(4)

We model quarterly anthrax incidence counts ({C}_{i,t}) using a zero-inflated Poisson process with parameters ({mu }_{i,t}) and ({p}_{i,t}) (see Eq. (5)). Equation (6) denotes the expected values for the ZIP distribution at sub-county i during year t.

$${C}_{i,t} sim Zero-inflated, Poisson left({mu }_{i,t},{p}_{i,t}right),$$

(5)

$$expectedleft({C}_{i,t}right)=left(1- {p}_{i,t}right)times {mu }_{i,t}.$$

(6)

Both the Bernoulli and the ZIP distributions are modelled separately as functions of the covariates and the spatial and temporal random effects using a general linear predictor as shown in Eqs. (7) and (8):

$$logit left({P}_{i,t}right)= alpha + sum_{j}{beta }_{j}{X}_{j,i}+{u}_{i,t}+{v}_{i,t}+{y}_{i,t},$$

(7)

$$mathrm{log}left({mu }_{i,t}right)= alpha + sum_{j}{beta }_{j}{X}_{j,i}+{u}_{i,t}+{v}_{i,t}+{y}_{i,t},$$

(8)

$${y}_{i,t}= {y}_{i,t-1}+ {w}_{i,t},$$

(9)

where α denotes the intercept; (X) signifies a matrix made up of the socio-economic covariates accompanied by their linear coefficients denoted as (beta); spatiotemporal reporting trends at the sub-county level were accounted for in the models using spatially structured (({u}_{i,t}); conditional autoregressive) and unstructured noise (({v}_{i,t}); i.i.d—independent and identically distributed) random-effects specified jointly as a Besag–York–Mollie model30,31, as well as temporally structured (({y}_{i,t})) random effects of the first order where ({w}_{i,t}) is a pure noise term that is normally distribute with a mean of zero and a variance of σ2. We used uninformative priors with a Gaussian distribution for the fixed effects and penalized complexity priors for the hyperparameters of all the random effects.

For the two spatiotemporal models, we applied linear effects for all the variables: population density, total population, number of exotic dairy cattle, agricultural land area, and number of livestock producing households. We scaled the continuous covariates by standardizing them (to a mean of 0 and standard deviation of 1) before fitting the linear fixed effects.

We used R-INLA to conduct model inference and selection and used DIC to evaluate the model fit for both the occurrence and incidence models. For both models (occurrence and incidence), we created 4 candidate models, compared them, and selected the model with the lowest DIC as the final model. The candidate models included: a baseline intercept only model; a second model with the intercept and covariates; a third model with the intercept, covariates, and the spatial random effects; and a fourth model with the intercept, covariates, spatial random effects, and a temporal trend.

We evaluated the posterior distributions of the parameters and the residuals for adherence to the distributional assumptions. We assessed the residuals to check whether they were independent and normally distributed. We also plotted the residuals against the covariates to check for any non-linear patterns using a well-defined method29. We then ran 1000 simulations to check whether the model was capable of handling zeros.

### Ethics statement

Licence to conduct the research was granted by the National Council for Science, Technology, and Innovation (NACOSTI) under reference number 651983, and the Kenya Wildlife Service under reference number KWS-0003-01-21.

Source: Ecology - nature.com