Data mining and model-predicting a global disease reservoir for low-pathogenic Avian Influenza (A) in the wider pacific rim using big data sets

Study area

The study area consists of the wider northern Pacific Rim area which is known to be an exchange frontier between diseases and cultures (Fig. 1^2,9). We followed methods outlined in^5,11,12 and specifically¹³ drawing inference from predictions.

The conducted international landscape investigation in this study area is described in a research workflow (Fig. 2), and it mainly consists of different steps: field work, open access data compilation, data cleaning and lab work, GIS mapping, data mining and prediction, reflection and inference, as further described below (for more clarifications or questions please contact authors).

Figure 2

Workflow of this study to obtain best-available AI data and to data mine and predict them with machine learning in a geographic information system (GIS) for best-possible predictions and inference for the Pacific Rim study area (IRD = Influenza Research Database; USDA = U.S. Department of Agriculture); for more details, model specifications etc. see manuscript text.

Full size image

Field work

As part of the eASIA program the field sampling of AI was conducted in Russia and Japan primarily during the fall (August) 2016, 2017 and 2018. Fall is a season when birds finished breeding and started to migrate southwards to their wintering sites. Birds are known during that time to disperse relatively slowly along flyways^10,12,14,15. Traditionally, this time period has the highest known prevalence of virus, thus far⁹ In Vietnam, the surveillance targeting domestic birds was conducted in summers and falls. Together with all eASIA participants, we extracted data from an agreeable compatible workflow and protocol that allowed for geo-referenced and time-referenced AI samples in the field. Hunters were not directly involved in the study (see permits for bird specimen details). In Russia, following their lab method protocol and according to standard procedures^16,17 it resulted in 52 samples (10 LPAI presences) from years 2016 and 2017 with 13 unique locations. In Japan, their respective lab method protocol was followed (details in¹⁸) resulting in 203 samples from years 2016 and 2017 based on 5 unique locations. In Vietnam, the lab method protocol of Japan was followed (details in¹⁹) resulting in 1,182 samples (951 LPAI presences) from years 2016 and 2017 based on 102 unique locations. Finally, we were also able to obtain 407 samples (395 LPAI presences) for Mongolia for 27 unique locations, also following the protocol from Japan. Alaska was not part of field campaign but had data available through the IRD ‘flu’ database (see details below).

All field data were compiled into one eASIA database for further analysis (Appendix 1), namely to carry out data mining, model-training and subsequent predictions with machine learning and geographic information system (GIS; details in^9,10).

Compilations of open access AI data

To reach across the Pacific Rim for a wider and more robust inference, and to make a connection with North America and other available data, further AI data from Alaska were obtained from the IRD database online (https://www.fludb.org/brc/home.spg? Decorator = influenza). This resulted in 38,517 samples (448 low-path AI presences) from 1,175 unique locations. We then queried all these data for low-path AI strains which resulted in 110 strains and 40,837 samples from 157 host species entries that we used for this study (see Appendix 2 for details). To our knowledge, that is the biggest and most diverse AI database ever compiled and analysed for the Pacific Rim (see Herrick et al. 2013 for a first initial model and using all of AI).

Data mining of low-path AI

We queried the obtained data for the number of low-path AI strains, host species distribution, proportion of host species carrying a specific low-path AI strain, and prevalence.

Compilations of open access GIS data layers for the study area

GIS layers are used as predictors for model-predictions in the study area. Here we used 19 global GIS layers available from earlier research (Sriram and Huettmann unpublished https://www.earth-syst-sci-data-discuss.net/essd-2016-65/; Table 1). For polygon outlines we used data with our ArcGIS UAF campus license (FH). All GIS data layers were displayed for the study area as a Mercator projection using WGS84, decimal degrees coordinates (latitude and longitude) with a precision of 6 decimals (GPS and GIS, a real world precision of 5 decimals).

Table 1 List of GIS Predictors used in this study to data mine and predict low path (LP) Avian Influenza (AI) *

Full size table

GIS mapping and data processing

We used commercial and open source GIS softwares (ArcGIS, QGIS) to operate, map and overlay all data. We imported the AI Data from ASCII table (MS Excel) into a shapefile layer of AI, and overlaid them with 19 environmental GIS layers we had available from compiled global data sets. This resulted into a data cube that is analyzed with data mining and for modeling and predictions.

Modeling and predictions

The resulting data cube was imported into SPM 8.2 (https://www.minitab.com/en-us/products/spm/) and then modeled and predicted. We ran a stochastic grading boosting (TreeNet) algorithm for best-possible predictions and inference (²⁰see also^9,10,12,21; for an R implementation see²²). As outlined in^9,12,21 we started with default settings for this powerful software as they are known to achieve best inference, as taken from the predictive performance¹³. Models then used 6 Maximum nodes per tree, 10 Cases as a Terminal Node Minimum, 200 trees to converge, a balanced class weight and a ten-fold cross-validation (a repeated 90% training vs 10% testing setting) optimizing on the ROC. To avoid overfitting we used an auto learn rate and a 50% subsampling. The resulting tree model was stored as a grove and applied to an equally-spaced lattice of the predictors (excluding species information). The maps were presented in GIS with a resolution of a 5 km pixel size (Appendix 3).

Model assessment data

We were able to obtain two alternative data set on AI for an assessment of our predictions. The Influenza Research Database (IRD) has an Asian subset (n = 28,205 and 19,405) comparable to our work, and which was used to confront our predictions for the study area.

Although the U.S. Department of Agriculture (USDA) has a U.S-wide AI survey data set (3,589 for Alaska), it actually lacks geo-referencing with coordinates (just done by counties etc.) and just includes H5, H7 Avian Flu columns; presumably done trying to protect the industry. We still used this best-available alternative data set for further assessment of the model predictions.

Ethics statement

For this eASIA project oropharyngeal and cloacal samples in Russia were collected according to the “Federal Law on Hunting and Sharing of Hunting Resources of Russian Federation # 209-ФЗ” and with the permissions of local governments in hunting regions during each hunting seasons. Hunted birds were provided for sampling by licensed hunters to our group during expeditions.

Fecal samples in Japan were collected with the permission of the municipality managing the sampling areas and Hokkaido University. Fecal samples in Mongolia were collected with the permission of the State Central Veterinary Laboratory, Mongolia. These samples were transferred to Japan under the permissions of the Animal Quarantine Service, Japan (27douken560-2, 28douken563-6, 29douken 683–2). Swab samples in Vietnam were collected with the permission of the Department of Animal Health, Vietnam. These samples were transferred to Japan under the permissions of the Animal Quarantine Service, Japan (27douken560-3, 28douken563-1, 28douken563-4, 28douken563-5, 29douken683-3, 29douken683-4).

Data reported in the Influenza Research Database (IRD) were from samples obtained and submitted under NIH-funded avian influenza surveillance collection efforts (CEIRS) and are publicly available at: www.fludb.org . This work was supported in part by a National Institute of Allergy and Infectious Disease Centers of Excellence in Influenza Research and Surveillance (CEIRS) award, Contract HHSN272201400008C (to Eric Bortz).

For Alaska USDA data, wild bird samples primarily came from hunter-killed waterfowl, with voluntary participation from hunters. These sampling activities were covered under US Fish and Wildlife Service Federal Permit MB124992-0.

Source: Ecology - nature.com