Analysis of individual-level data from 2018–2020 Ebola outbreak in Democratic Republic of the Congo

Ebola dataset

The 2018–2020 DRC EVD outbreak lasted over 24 months and spread over 3 distinct spatial and temporal waves. Between the emergency declaration of the EVD outbreak in northern DRC on August 1, 2018 and the outbreak’s official end on June 25, 2020, the DRC Ministry of Health has reported a total of 3481 cases (including confirmed and probable), 1162 recoveries, and 2299 deaths¹⁶ in the provinces of Northern Kivu, Southern Kivu, and Ituri. The dataset considered here is a large subset of the entire EVD database compiled by the University of Kinshasa School of Public Health, which comprises 3117 total case records (confirmed and probable) recorded between May 3, 2018, and September 12, 2019. The data included partially de-identified but still detailed patient information, such as each person’s location, date of symptom onset and hospitalization, as well as discharge due to recovery or death. These individual records came from the Ebola treatment centers in 24 different health zones, spread out among the three DRC provinces of Northern Kivu, Southern Kivu, and Ituri.

Of the 24 health zones, 77.1% of all cases were from only 6: Beni, Butembo, Katwa, Kalunguta, Mabalako, and Mandima. Only 9.7% of cases were under the age of 18. There is also a slightly larger proportion of females contracting the disease, comprising 57.0% of the cases. Approximately 5% of the cases were health care workers. About one-third of the EVD fatalities were not identified until patient’s death and thus not effectively isolated from the time of infection. Although over 170,000 contacts of confirmed and probable Ebola cases had been monitored across all affected health zones for 21 days after their last known exposure by the end of the epidemic, some of the contact tracing was incomplete due to insecurity that prevented public health response teams from entering some communities. The overall case density map is presented in panel (A) of Fig. 1 with the animated version of the map presented in the online appendix in Fig. A.1. Notice that the high-density areas, particularly Butembo, Katwa, and Beni, are all spatially small health zones corresponding to cities or towns with larger populations.

Figure 1

DRC Ebola dataset. (A) The spatial distribution of 3481 EVD cases across the northern DRC health zones during Ebola 2018–2020 outbreak. (B) The flowchart of personal records available up to September 12, 2019 available for the current analysis. The total number of available individual disease records was 3080. Map created using open software R¹⁷ with geospatial data obtained from¹⁸.

Full size image

Figure 2

Daily incidence and removal rates. Daily incidence (grey bars) and removal counts (red dots) during DRC Ebola 2018–2020 outbreak between August 15, 2018 and September 12, 2020 along with their respective trendlines (loess smoothers). The blue trendline above the plot represents daily effective reproduction number (mathcal{R}_t) defined as the ratio of daily number of new infections to new removals. The vertical lines indicate cut-off dates for data collection in each wave as listed in Table 1.

Full size image

Table 1 Observed cases by EVD wave.

Full size table

Case alerts and definitions

Since early August, 2018, the DRC Ministry of Health has been collaborating with several international partners to support and enhance EVD response activities through its emergency operations center in Goma. To the extent possible given regional security considerations¹⁹, the response teams were deployed to interview patients and their suspected contacts using a standardized case investigation form classifying cases as suspected, probable, or confirmed. A suspected case (whether surviving or not) was defined as one with the acute onset of fever (over 100(^{circ })F) and at least three Ebola-compatible clinical signs or symptoms (headache, vomiting, anorexia, diarrhea, lethargy, stomach pain, muscle or joint aches, difficulty swallowing or breathing, hiccups, unexplained bleeding, or any sudden, unexplained death) in a North Kivu, South Kivu, or Ituri resident or any person who had traveled to these provinces during this period and reported the signs or symptoms defined above. A patient who met the suspected case definition and died but from whom no specimens were available was considered a probable case. A confirmed Ebola case was defined as a suspected case with at least one positive test for Ebola virus using reverse transcription polymerase chain reaction (RT-PCR)²⁰ testing. Patients with suspected Ebola were isolated and transported to an Ebola treatment center for confirmatory testing and treatment².

Onset and removal

In our analysis of the DRC dataset, we focused on dates of symptom onset and removal, with removal defined as either a death/recovery at home or transfer to an Ebola treatment center (ETC). It was assumed that, once in the treatment center, the probability of further infection spread by an isolated individual was very small due to the strict safety protocols—and later due also to vaccination of healthcare personnel and family members who were in contact with the suspected Ebola case. As summarized in panel (B) of Fig. 1, we were able to access 3117 out of 3481 individual records of confirmed and probable Ebola cases. Of these 3117 records, 37 were missing both the onset and recovery dates and were removed from further analysis. In about 30% of the remaining records, either their dates of onset or removal were missing. A detailed flow diagram summarizing the amount of missing data and data processing leading to the final dataset is presented in panel (B) of Fig. 1. The distribution of the original and the partially imputed records across the three waves of infection is provided for further reference in Table 1.

Spatial and temporal patterns

Throughout the pandemic, the incidence rates exhibited strong spatial and temporal patterns that can be summarized as three distinct waves of infections with approximate boundaries marked by vertical lines in Fig. 1. The distribution of weekly reported cases across the most affected health zones listed in Table 1 is provided in the bar plot and in the corresponding animation in the appendix (see Figure A.1). As seen from the bar chart and the animated plot, the epidemic was initially driven largely by infections in the health zones of Beni, Mandima and Mabalako. After several months, the incidence of new cases in these zones subsided, but the epidemic moved south to the health zones of Katwa and Butembo, where the majority of new infections was registered between weeks 22 to 45 of the epidemic (see Panel (A) in Figure A.1 in the online Appendix). In the final spatial shift, around week 49, the epidemic returned to the health zones of Beni, Mandima, and Mabalako, where it was mostly extinguished around week 60 (September 2019). Isolated Ebola incidences occurred sporadically across northern DRC until end of the outbreak was officially declared in June 2020.

The empirical patterns of incidence and removal for EVD cases are summarized in Fig. 2 with the bar and the dot plots representing the daily numbers of new infections and removals, respectively. As seen from the plot, these daily counts closely follow a three-wave temporal pattern in Table 1. This is further evident from the black and red trendlines representing the loess smoothers (see²¹). The daily ratio of new cases and removals may be interpreted as a crude estimate of the effective reproduction number (mathcal{R}_t) defined more formally in (2) in Model for Data Analysis below. In particular, the blue trendline for (mathcal{R}_t) indicates that towards the end of the observed time period, the number of removals outpaced the number of new infections ((mathcal{R}_t <1)). The ability to sustain this pattern for a sufficiently long time period, mostly by increasing the rate of quarantine and ETC transfers along with ring vaccination of case contacts was largely credited with the end of EVD epidemic in mid-2020. The quantification of this public health intervention effect in 2018–2020 DRC outbreak is one of the main motivations for our model-based analysis. Although the precise cut-off dates for the three waves of 2018–2020 Ebola infections are difficult to establish, the incidence data along with simple statistical analysis (see Parameter estimation) indicate that the first wave lasted approximately until the end of February 2019, whereas the second wave ended around the end of May 2019. For the purpose of the data analysis below, the specific break dates used were February 27, 2019 and May 27, 2019 as marked by vertical lines in Fig. 2. September 12, 2019 was the cutoff date for the individual records data available from the University of Kinshasa (see Table 1).

Model for data analysis

The analysis of the individual-level epidemic data is based on the standard ecological model known as the SIR (susceptible-infected-removed) model and developed for the purpose of analyzing average behavior of a large population with a homogenous pattern of interactions^11,22. Although there are many variants of SIR models in the literature²³, our current analysis considers the classical Kermack-McKendrick SIR model that assumes the proportions of population categorized as susceptibles (s), infected ((iota)), or removed (r) to evolve according to the differential equations

$$begin{aligned} begin{aligned} {dot{s}}_t&= -beta s_t iota _t , {dot{iota }}_t&= beta s_tiota _t – gamma iota _t , {dot{r}}_t&= gamma iota _t, end{aligned} end{aligned}$$

(1)

with (s_0 = 1, iota _0 = rho >0) and (r_t = 0) where (beta > 0) is the rate of infection, (gamma > 0) is the rate of recovery and (rho > 0) is the initial amount of infection. In particular, the model implies the existence of the basic reproduction number (mathcal{R}_0) (R-naught), which determines the average speed of disease spread¹¹ and is given by the formula

$$mathcal{R}_0=beta /gamma .$$

If (mathcal{R}_0 > 1), the proportion of infected initially rises and then subsides, with the final proposition of surviving susceptibles given by (s_infty = 1 – tau > 0) where (tau) is know as the epidemic’s final size. In typical statistical analysis, an estimate of (mathcal{R}_0) is obtained by separately estimating the parameters (beta) and (gamma). Another important quantity related to (1) is the effective reproduction number, which is typically defined as

$$begin{aligned} mathcal{R}_t= mathcal{R}_0 s_t. end{aligned}$$

(2)

Although equation (1) is typically considered in the context of an average behavior of a large population, for our purposes we interpret it as defining the individual histories of infection and recovery, according to the idea of the dynamic survival analysis (DSA) discussed recently in¹⁰ and²⁴ and also briefly summarized in the Appendix. With the DSA approach, we interpret equation (1) as the so-called stochastic master equation²⁵ describing the change in probability of a randomly selected individual being at time t either susceptible, infected, or removed. These respective probabilities are represented by the scaled proportions (s_t/(1+rho )), (iota _t/(1+rho )), and (r_t/(1+rho )) and evolve according to (1). As outlined in¹⁰, the DSA-based interpretation of the classical SIR equations has a number of advantages that make it particularly convenient for analyzing epidemic data consisting of individual histories of infection onsets and removals, which is exactly the type of data available in the DRC Ebola dataset. The fact that the model is individual-based implies also that we can vary the parameters (theta =(beta ,gamma ,rho )) to account for individual covariates and changes in the parameter values over time, as different waves of infection sweep through the population. Finally, for the purpose of our analysis, it is also important to note that the DSA model does not require any knowledge of the size of the susceptible population subjected to the epidemic pressure. For the DRC dataset, that assumption would be difficult to justify due to spatial and temporal heterogeneity of the epidemic and the frequent movements of local populations driven by political conflicts and insecurity. Another element complicating the determination of the size of susceptible population was the ring vaccination campaign that has been conducted since 2019 wherever possible in the northern DRC during periods of relative stability, despite local mistrust and supply issues. This campaign ultimately resulted in over 250,000 vaccinations.

Note that, because (s_0 = 1), the values of (mathcal{R}_0) and (mathcal{R}_t) coincide for (t = 0). Moreover, (s_t = exp left( -mathcal{R}_0 int _0^t r_u mathrm {d}u right)) is a decreasing function of time and therefore, so is (mathcal{R}_t). However, in practice, this implication is problematic. Rewriting (mathcal{R}_t = – {dot{s}}_t/ {dot{r}}_t) suggests that a crude but sensible way to estimate (mathcal{R}_t) empirically is to take the ratio of daily number of new infections to new removals. The empirical (mathcal{R}_t) thus estimated will not be necessarily monotonically decreasing. In the light of possibly changing parameters and the effective population size, we have adopted this approach to estimating the daily effective reproduction number (mathcal{R}_t) in Fig. 2.

Parameter estimation

We assume that, for each of the three waves of the epidemic, we have a separate and independent set of parameters (theta) and that, in each wave, we observe (n_T) histories (records) of infection. The i-th individual history may be represented either by the times of disease onset and removal ((t_i,T_i)) or by (t_i) or (T_i) times alone ((t_i,circ )) or ((circ ,T_i)) ((circ) denoting missing value). We assume that among the available (n_T) histories we have n complete records ((t_i,T_i)), (n_1) incomplete ones ((t_i,circ )) and (n_2) incomplete ones ((circ ,T_i )). The wave-specific DSA likelihood function for n complete data records is (see Appendix)

$$begin{aligned} begin{aligned} {mathcal {L}}_C(theta vert t_1ldots ,t_n,T_1,ldots ,T_n,T)=(s_T-1)^{-n}prod _{i=1}^n {dot{s}}_{t_i}gamma ^{w_i}e^{-gamma (T_i wedge T -t_i)} end{aligned} end{aligned}$$

(3)

where T is the available time horizon and (w_i) is the binary variable indicating whether (T_i) is right-censored (that is, (T_iwedge T =T)) in which case (w_i = 0) and otherwise (w_i = 1). For the remaining (n_1+n_2) records that are partially incomplete, the wave-specific DSA likelihood function is

$$begin{aligned} begin{aligned} {mathcal {L}}_I(theta vert t_1ldots ,t_{n_1},T_1,ldots ,T_{n_2},T)= (s_T-1)^{-(n_1+n_2)} gamma ^{n_2}prod _{i=1}^{n_1} {dot{s}}_{t_i} prod _{i=1}^{n_2} (rho e^{-gamma T_i }-iota _{T_i}) end{aligned} end{aligned}$$

(4)

where we assume that (T_i<T). The overall likelihood for all (n_T) individual histories is obtained by multiplying (3) and (4). Note that the likelihood formulas depends on the parameter (beta) only implicitly, through the values of the function (s_t) defined by (1). Note also that we assume T to be unique and exactly known although in practice this may not be true as subsequent waves of infection may be too close in time (perhaps even overlapping) to allow for a precise specification of T. In our analysis below, we solve this practical problem by considering several candidates for the values of T in each wave and then identifying ones that jointly maximize the combined posterior distribution corresponding to the wave-specific likelihoods in equations (3–4).

The fitting of the model parameters (theta =(beta ,gamma , rho )) by maximizing the likelihood function (3) can be conveniently integrated into the Bayesian estimation framework, which allows for a more complete propagation of uncertainty and the use of external information in the statistical model. This, in turn, allows us to produce estimates that reflect all available information and uncertainty. In our DRC data analysis, the approximate posterior densities of (theta) were obtained using the Hamiltonian Monte-Carlo sampler²⁶ implemented in the open source statistical software STAN²⁷ and integrated with the popular statistical analysis language R via the library Rstan²⁸. For the Rstan analysis, we have assumed uniform (sometimes improper) prior distributions on the (theta) components as follows

$$begin{aligned} &beta in (0.15, infty ), &gamma in (0, beta ),&rho in (0, 1). end{aligned}$$

(5)

The lower bound was placed on (beta) based on empirical information, and the upper bound was placed on (gamma) to enforce the constraint (mathcal{R}_0>1). Given the wave-specific time horizons (T’s), the set of parameters for each epidemic wave was estimated independently using 2 independent chains of 3000 iterations, with a burn-in period of 1000 iterations. The chains’ convergence assessed using Rubin’s R statistic²⁸. The analysis resulted in approximate samples from the posterior distribution of (theta) for each of the three waves of the epidemic (see e.g., Fig. 4).

Ethics statement on human subjects and methods

The research was conducted in accordance with the relevant guidelines and regulations of the US law and OSU Institutional Review Board. The research activities involving human subjects discussed in the paper meet the US federal exemption criteria under 45 CFR 46 and 21 CFR 56.

Analysis of individual-level data from 2018–2020 Ebola outbreak in Democratic Republic of the Congo

Ebola dataset

Case alerts and definitions

Onset and removal

Spatial and temporal patterns

Model for data analysis

Parameter estimation

Ethics statement on human subjects and methods

A biologging database of juvenile white sharks from the northeast Pacific

Direct and latent effects of ocean acidification on the transition of a sea urchin from planktonic larva to benthic juvenile

ITALIAN LANGUAGE

ENGLISH LANGUAGE