Machine learning identifies straightforward early warning rules for human Puumala hantavirus outbreaks

We performed data acquisition, processing, analysis and visualization using Python²³ version 3.8 with the packages Numpy²⁴, Pandas²⁵, Geopandas²⁶, Matplotlib²⁷, Selenium, Beautiful Soup²⁸, SciPy¹⁴ and scikit-learn²⁹. The functions used for specific tasks are explicitly mentioned to allow validation and replication studies.

Data acquisition and processing

Human PUUV-incidence

Hantavirus disease has been notifiable in Germany since 2001. The Robert Koch Institute collects anonymized data from the local and state public health departments and offers via the SurvStat application² a freely available, limited version of its database for research and informative purposes. We retrieved the reported laboratory-confirmed human PUUV-infections (({text{n}}=text{11,228}) from 2006 to 2021, status: 2022-02-07). From the attributes available for each case, we retrieved the finest temporal and spatial resolution, i.e., the week and the year of notification, together with the district (named “County” in the English version of the SurvStat interface).

To avoid bias through underreporting, our dataset was limited to PUUV-infections since 2006. The years 2006–2021 contain 91.9% of the total cases from 2001 to 2021. Human PUUV-incidence was calculated as the number of infections per 100,000 people, by using population data from Eurostat³⁰. For each year, we used the population reported for the January 1 of that year. The population for 2020 was also used for 2021.

In the analysis, we only included districts where the total infections were (ge {20}) and the maximum annual incidence was (ge {2}) in the period 2006–2021. The spatial information about the infections provided by the SurvStat application refers to the district where the infection was reported. Therefore, in most of the cases, the reported district corresponds to the residence of the infected person, which may differ from the district of infection. To compensate partially for differences between the reported place of residence and the place of infection, we combined most of the urban districts with their surrounding rural district. The underlying assumption was that most infections reported in urban districts occurred in the neighboring or surrounding rural district. In addition, some urban and rural districts have the same health department. Supplementary Table 1 lists the combined districts.

Weather data

From the German Meteorological Service³¹ we retrieved grids of the following monthly weather parameters over Germany from 2004 to 2021: mean daily air temperature—Tmean, minimum daily air temperature—Tmin, and maximum daily air temperature—Tmax (all temperatures are the monthly averages of the corresponding daily values, in 2 m height above ground, in °C); total precipitation in mm—Pr, total sunshine duration in hours—SD, mean monthly soil temperature in 5 cm depth under uncovered typical soil of location in °C—ST, and soil moisture under grass and sandy loam in percent plant useable water—SM. The dataset version for Tmean, Tmin, Tmax, Pr, and SD was v1.0; for ST and SM the dataset version was 0. × . The spatial resolution was 1 × 1 km².

The data acquisition was performed with the Selenium package. The processing was based on the geopandas package²⁶ using a geospatial vector layer for the district boundaries of Germany³². Each grid was processed to obtain the average value of the parameter over each district. We first used the function within to define a mask based on the grid centers contained in the district; we then applied this mask to the grid. In this method, called “central point rasterizing”³³, each rectangle of the grid was assigned to a single district, the one that contained its center. The typical processing error was estimated to be about 1%, which agrees with the rasterizing error reported by Bregt et al.³³; we consider that most likely this error is significantly less than the uncertainties of the grids themselves, caused by calculation, interpolation, and erroneous or missing observations.

Data structure

Our analysis was performed at the district level based on the annual infections, acquired by aggregating the weekly cases. From each monthly weather parameter, we created 24 records, for all months of the two previous years. Each observation in our dataset characterized one district in one year. Its target was acquired by transforming the annual incidence, as described in the following section. Each observation comprised all 168 available predictors from the weather parameters (7 parameters × 24 months), thereafter called “variables”. The notation for the naming of the variables follows the format Vx_<parameter>_<month>, where “Vx” can be V1 or V2 that corresponds to one or two years before, respectively; <parameter> is the abbreviation of the weather parameter (see previous subsection: “Weather data”); and <month> is the numerical value of the month, i.e., from 1 to 12.

The observations for combined districts retained the label of the rural district. For their infections and populations, we aggregated the individual values, and recalculated the incidence. For their weather variables, we assigned the mean values weighted by the area of each district.

Target transformation

To consider the effects that drive the occurrence of high district-relative incidence, we discretized the incidence at the district level. The incidence scaled at its maximum value for each district showed extreme values for minima and maxima. About 49% of all observations were in the range [0, 0.1) and 8% in the range [0.9, 1] (Fig. 5). Therefore, we specifically selected to discretize the scaled incidence with two bins, i.e., to binarize it.

Figure 5

Histograms of the annual PUUV incidence from 2006 to 2021, scaled to its maximum value for each of the selected districts. Left: Raw incidence. Right: Log-transformed incidence, according to Eq. (6).

Full size image

We first applied a log-transformation to the incidence values³⁴, described in Eq. (6).

$${text{Log – incidence}} = log_{10} left( {{text{incidence}} + 1} right)$$

(6)

The addition of a positive constant ensured a noninfinite value for zero incidence, with 1 selected so that the log-incidence is nonnegative, and a zero incidence was transformed into a zero log-incidence. This transformation aimed to increase the influence of nonzero incidence values; values that are not pronounced, but still hint at a nonzero infection risk. Its effect is demonstrated in the right plot of Fig. 5, where the positive skewness of the original data is reduced, i.e., low incidence values are spread to higher values, resulting to more uniform bin heights in the range [0.05, 0.95] after the transformation. Formally, in this case the log-transformation achieves a more uniform distribution for the non-extreme incidence values.

For the binarization, we performed unsupervised clustering of the log-transformed incidence, separately for each district, applying the function KBinsDiscretizer of the scikit-learn package²⁹. Our selected strategy was the k-means clustering with two bins, because it does not require a pre-defined threshold, and it can operate with the same fixed number of bins for every district, by automatically adjusting the cluster centroids accordingly.

Classification method

We concentrated only on those variable combinations that led to a linear decision boundary for the classification of our selected target. We selected support vector machines (SVM)³⁵ with a linear kernel, because they combine high performance with low model complexity, in that they return the decision boundary as a linear equation of the variables. In addition, SVM is geometrically motivated³⁶ and expected to be less prone to outliers and overfitting than other machine-learning classification algorithms, such as the logistic regression. For the complete modelling process, the regularization parameter C was set to 1, that is the default value in the applied SVC method of the scikit-learn package²⁹, and the weights for both risk classes were also set to 1.

Feature selection

Our aim was to use the smallest possible number of weather parameters as variables for a classification model with sufficient performance. To identify the optimal variable combination, we first applied an SVM with a linear kernel for all 2-variable combinations of the monthly weather variables from V2 and V1, i.e., 168 variables (7 weather parameters × 2 years × 12 months). Only for this step, the variables were scaled to their minimum and maximum values, which significantly reduced the processing time. For all the following steps, the scaler was omitted, because the unscaled support vectors were required for the final model. From the total 14,028 models for each unique pair ((frac{168!}{2!cdot left(168-2right)!})), we kept the 100 models with the best F₁-score, i.e., of the harmonic mean of sensitivity and precision, and counted the occurrences of each year-month combination in the variables. The best F₁-score was 0.752 for the pair (V1_Tmean_9 and V2_Tmax_4); and the best sensitivity was 83% for the pair (V2_Tmax_9 and V1_ST_9).

The year-month combinations with more than 10% occurrences were: V1_9 (September of the previous year, with 49% occurrences), V2_9 (September of two years before, with 12%) and V2_4 (April of two years before, with 10%). To avoid sets with highly correlated variables, we formed 3-variable combinations, with exactly one variable from each year-month combination (threefold Cartesian product). From the total 343 models (7³ combinations, i.e., 7 weather parameters for 3 year-month combinations), we selected the model with the best sensitivity and at least 70% precision, i.e., the variable set (V2_ST_4, V2_SD_9, and V1_ST_9). We consider that the criteria for this selection are not particularly crucial; and we expect comparable performance for most variable sets with a high F₁-score, because the variables for each dimension of the Cartesian product were highly correlated. The eight variable sets with at least 70% precision and at least 80% sensitivity are shown in Supplementary Table 2.

The SVM classifier has two hyperparameters: the regularization parameter C and the class weights. By decreasing C, the decision boundary becomes softer and more misclassifications are allowed. On the other hand, increasing the high-risk class weight, the misclassifications of high-risk observations are penalized higher, which is expected to increase the sensitivity and decrease the precision. The simultaneous adjustment of both hyperparameters ensures that the resulting model has the optimal performance with respect to the preferred metric. However, in order to avoid overfitting, we considered redundant a further model optimization with these two hyperparameters. For completeness, we examined SVM models for different values of the hyperparameters and found that the global maximum for the F1-score is in the region of 0.001 for C and 1.5 for the high-risk class weight. Our selected values C = 1 and high-risk class weight equal to 1 give the second best F1-score, which is a local maximum with comparable performance, mostly insensitive to the selection of C from the range [0.2, 5.5].

The addition of a fourth variable from V1_6 (June of the previous year) resulted in a model with higher sensitivity but lower precision and specificity (for V1_Pr_6). The highest F1-score was achieved for the quadruple (V2_ST_4, V2_SD_9, V1_ST_9, V1_Pr_6). Because of the increased complexity without significant improvement in the performance, we considered unnecessary a further expansion of our variable triplet.

Source: Ecology - nature.com