Divide-and-conquer: machine-learning integrates mammalian and viral traits with network features to predict virus-mammal associations

Our framework to predict unknown associations between known viruses and potential mammalian hosts or susceptible species comprised three distinct perspectives: viral, mammalian and network. Each perspective produced predictions from a unique vantage point (that of each virus, each mammal, and the network connecting them respectively). Subsequently, their results were consolidated via majority voting. This approach suggested that 20,832 (median, 90% CI = [2,736, 97,062], hereafter values in square brackets represent 90% CI) unknown associations potentially exist between our mammals and their known viruses, (18,920 [2,440, 91,517] in wild or semi-domesticated mammals). Number of unknown associations predicted by each perspective individually were as follows: mammalian only = 41,537 [4,275, 23,8971], viral only = 21,352 [2,536, 95,630], and network only = 76,081 [27,738, 20,5814]. Our results indicated a ~4.29-fold increase ([~1.43, ~16.33]) in virus-mammal associations (~4.89 [~1.5, ~19.81] in wild and semi-domesticated mammals).

Additionally, we trained an independent pipeline including only the 3534 supported by evidence extracted from meta-data accompanying nucleotide sequences, as indexed in EID2 (55.82% of all associations – see Methods section and Supplementary Results 8). Our sequence-evidence pipeline indicated that 15,721 (median, 90% CI = [1,603, 88,553]) unknown associations could potentially exist (13,930 [1,298, 83,043] in wild or semi-domesticated mammals).

In the following subsections we first illustrate the mechanism of our framework via an example, then further explore the predictive power of our approach for viruses and mammals.

Example

Our multi-perspective framework generates predictions for each known or unknown virus-mammal association (2,722,656 possible associations between 1,896 viruses and 1,436 terrestrial mammals). We highlight this functionality using two examples (Fig. 1). West Nile virus (WNV) a flavivirus with wide host range, and the bat Rousettus leschenaultia (order: Chiroptera). We first consider each of our perspectives separately, and then showcase how these perspectives are consolidated to produce final predictions.

Fig. 1: Example showcasing final and intermediate predictions of West Nile Virus (WNV), and Rousettus leschenaultii.

Panel A Top 60 predicted mammalian species susceptible to WNV. Mammals were ordered by mean probability of predictions derived from mammalian (all models), viral (WNV models) and network perspectives, and top 60 were selected. Circles represent the following information in order: 1) whether the association is known (documented in our sources) or not (potential or undocumented). Hosts are omitted for known associations. 2) Mean probability of the three perspectives (per association). 3) Median mammalian perspective probabilities of predicted associations. These probabilities are obtained from 3000 models (50 replicate models for each mammal), trained with viral features – SMOTE class balancing. 4) Median viral perspective probabilities of predicted associations (50 WNV replicate models trained with mammalian features – SMOTE class balancing). 5) Median network perspective probabilities of predicted associations (100 replicate models, balanced under-sampling). 6) Taxonomic order of predicted susceptible species. Orders are shortened as follows: Artiodactyla (Art), Carnivora (Crn), Chiroptera (Chp), primates (Prm), Rodentia (Rod), and Others (Oth). Panel B Top 50 predicted viruses of R. leschenaultii. Viruses were ordered by mean probability of predictions derived from mammalian (R. leschenaultii models), viral (all models) and network perspectives. Circles as per Panel A. Baltimore represents Baltimore classification. Panel C Median probability of predicted WNV-mammal associations in each of the three perspectives per mammalian order. Points represent susceptible species predicted by voting (at least two of the three perspectives – n = 137). Median ensemble probability is computed in each perspective (50 replicate models for each virus/mammal, 100 replicate network models). Predictions derived from each perspective at 0.5 probability cut-off. Supplementary Data 1 presents full WNV results. Panel D Median probability of virus-R. leschenaultii associations in the three perspectives per Baltimore group. Points represent susceptible species predicted by voting (at least two of the three perspectives – n = 64), predictions are derived as per panel C. Supplementary Data 2 lists full results for R. leschenaultii. Supplementary Fig. 7 illustrate the results when research effort into viruses and mammals is included in mammalian and viral perspectives, respectively.

Full size image

1) The mammalian perspective: our mammalian perspective models, trained with features expressing viral traits (Table 1), suggested a median of 90 [17, 410] unknown associations between WNV and terrestrial mammals could form when predicting virus-mammal associations based on viral features alone – a ~2.61-fold increase [~1.3, ~8.32]. Similarly, our results indicated that 64 [4, 331] new associations could form between our selected mammal (R. leschenaultia) and our viruses – a ~4.37-fold increase [~1.21, ~18.42] (Supplementary Results 4).

Table 1 Viral traits & features used to build our mammalian models.

Full size table

(2) The viral perspective: our viral models, trained with features expressing mammalian traits (Table 2), indicated a median of 48 [0, 214] new hosts of WNV (~1.86- fold increase [~1, 4.82]). Results for our example mammal (R. leschenaultia) suggested 18 [3, 76], existing viruses could be found in this host (~1.95-fold increase [~1.16, ~5.00]) – Supplementary Results 5).

Table 2 mammalian traits & features used to build our viral models.

Full size table

(3) The network perspective: Our network models indicated a median of 721 [448, 1,317] (~13.88 [9, 24.52] fold increase) unknown associations between WNV and terrestrial mammals, and that 246 [91, 336] existing viruses could be found in our selected host (R. leschenaultia), equivalent to a ~13.95 [~5.79, ~18.68] fold increase (Supplementary Results 6).

Considering that each of the above perspectives approached the problem of predicting virus-mammal associations from a different angle, the agreement between these perspectives varied. In the case of WNV: mammalian and viral perspectives achieved 92.3% agreement [72.6%–98.5%]; mammals and network perspectives had 55.3% agreement [33.4%–69.5%]; and viruses and network had 52.9% agreement [19.8%–68.7%]. In the case of R. leschenaultia these numbers were as follows: 96.15% [82.44%, 99.58%], 87.24% [76.37%, 95.04%], and 87.61% [75.90%, 95.25%], respectively. The agreements between our perspectives across the 2,722,656 possible associations were as follows: 98.04% [90.36%, 99.73%] between mammalian and viral perspectives, 96.71% [88.62%, 98.92%] between mammalian and network perspectives, and 97.11% [91.57%, 98.95%] between viral and network perspectives.

After voting, our framework suggested that a median of 117 [15, 509] new or undetected associations could be missing between WNV and terrestrial mammals (~3.45-fold increase [~1.3, ~12.2]). Similarly, our results indicated that R. leschenaultia could be susceptible to an additional 45 [5, 235] viruses that were not captured in our input (~1.37-fold increase [~1.26, ~13.37]). Figure 1 illustrates top predicted and detected associations for WNV (Supplementary Data 1) and R. leschenaultia (Supplementary Data 2). Supplementary Results 1 illustrate results with research effort into viruses, and mammals included as a predictor in our mammalian and viral perspective models, respectively. Predictions with and without research effort incorporated into models trained in these perspectives broadly agreed.

Relative importance of viral features

Our multi-perspective approach trained a suite of models for each mammalian species with two or more known viruses (n = 699, response variable = 1 if the virus is known to associate with the focal mammalian species, 0 otherwise). This enabled us to assess the relative importance (influence) of viral traits (Table 1) to each of our mammalian models. This in turn showcased variations of how these viral traits contribute to the models at the level of individual species (e.g. humans), and at an aggregated level (e.g. by order or domestication status). The results, highlighted in Fig. 2A, indicate that mean phylogenetic (median = 95.4% [75.6%, 100%]) and mean ecological (90.90% [43.50%, 100%]) distances between potential and known hosts of each virus were the top predictors of associations between the focal host and each of the input viruses. Maximum phylogenetic breadth was also important (74.7 0%, [16.60%, 100%]).

Fig. 2: Results (viruses).

Panel A Variable importance (relative contribution) of viral traits to mammalian perspective models. Variable importance is calculated for each constituent ensemble (n = 699) of our mammalian perspective (median of a suite of 50 replicate models, trained with viral features, with SMOTE sampling), and then aggregated (mean) per each reported group (columns). Panel B – Number of known and new mammalian species associated with each virus. Rabies lyssavirus was excluded from panel B to allow for better visualisation. Top 40 (by number of new hosts) are labelled. Species in bold have over 150 predicted hosts (Supplementary Data 3 lists details of these viruses including CI). Panel C Predicted number of viruses per species of wild and semi-domesticated mammals (group by mammalian order). Following orders (clockwise) are presented: Artiodactyla, Carnivora, Chiroptera, Perissodactyla, Primates, and Rodentia. Source of the silhouette graphics is PhyloPic.org. (Supplementary Data 4 lists aggregated results per mammalian order). Circles represent each mammalian species (with predicted viruses > 0), coloured by number of known viruses previously not associated with this species. Boxplots indicate median (centre), the 25th and 75th percentiles (bounds of box) and inter quantile range (whiskers) and are aggregated at the order level. Large red circles with error bars (90% CI) illustrate the median number of known viruses per species in each order. Number of species presented (n) is as follows: All = 1293 (Artiodactyla = 104, Carnivora = 177, Chiroptera = 548, Perissodactyla = 11, Primates = 171, and Rodentia = 282); Group I = 666 (94, 109, 156, 10, 160, 137); Group II = 371 (32, 120, 111, 1, 54, 53); Group III = 410 (87,62,123,9,51,78); Group IV = 739 (98, 102, 221, 9, 148, 161); Group V = 1129 (87, 173, 528, 8, 107, 226); Group VI = 358 (55, 64, 30, 6, 139, 64); and Group VII = 110 (3,2,53,1,43,8). Supplementary Fig. 8 presents results derived with research effort into mammalian hosts and viruses included in the constituent models trained in the viral and mammalian perspectives, respectively.

Full size image

Mammalian host range

Our results suggested that the average mammalian host range of our viruses is 14.33 [4.78, 54.53] (average fold increase of ~3.18 [~1.23, ~9.86] in number of hosts detected per virus). Overall, RNA viruses had the average host range of 21.65 [7.01, 82.96] hosts (~4.00- fold increase [~1.34, ~14.15]). DNA viruses, on the other hand, had 7.85 [2.81, 29.47] hosts on average (~2.43 [~1.14, ~6.89] fold increase). Table 3 lists the results of our framework at Baltimore group level and selected family and transmission routes of our viruses. Figure 2 illustrates predicted mammalian host range of our viruses (Fig. 2B, Supplementary Data 3), and the increase in predicted number of viruses per species in species-rich mammalian orders of interest (Fig. 2C, Supplementary Data 4).

Table 3 Predicted range of susceptible mammalian species of viruses per Baltimore group, family (top 15 families, ranked by fold increase) and transmission route.

Full size table

Relative importance of mammalian features

We trained a suite of models for each virus species with two or more known mammalian hosts (n = 556, response variable = 1 if the mammal is known to associate with the focal virus species, 0 otherwise). This allowed us to calculate relative importance of mammalian traits (Table 2) to our viral models. We were also able to capture variations in how these features contribute to our viral models at various levels (e.g. Baltimore classification, or transmission route) as highlighted in Fig. 3A. Our results indicated that distances to known hosts of viruses were the top predictor of associations between the focal virus and our terrestrial mammals. The breakdown was: 1) mean phylogenetic distance – all viruses = 98.75% [93.01%, 100%], DNA = 99.48% [96.03%, 100%], RNA = [91.93%, 100%]; 2) mean ecological distance all viruses = 94.39% [71.86%, 100%], DNA = 96.36% [80.99%, 100%], RNA = [69.48%, 100%]. In addition, life-history traits significantly improved our models, in particular: longevity (all viruses = 60.9% [12.12%, 98.88%], DNA = 68.03% [11.22%, 99.69%], RNA = [13.55%, 96.37%]); body mass (all viruses = 62.92% [5.4%, 97.65%], DNA = 72.75% [18.49%, 100%], RNA = 57.45% [4.32%, 95.5%]); and reproductive traits (all viruses = 53.37% [5.67%, 95.99%]%, DNA = 59.46% [8.27%, 99.32%], RNA = 50.17% [4.85%, 92.17%]).

Fig. 3: Results (Mammals).

Panel A Variable importance (relative contribution) of mammalian traits to viral perspective models. Variable importance is calculated for each constituent model (n = 556) of our viral perspective (trained with mammalian features), and then aggregated (median) per each reported group (columns). Panel B Number of known and new viruses associated with each mammal. Labelled mammals are as follows: top 4 (by number of new viruses) for each of Artiodactyla, Carnivora, Chiroptera, Primates, Rodentia, and other orders. Species in bold have 100 or more predicted viruses (Supplementary Data 5). Panel C Top 18 genera (by number of predicted wild or semi-domesticated mammalian host species) in selected orders (Other indicated results for all orders not included in the first five circles). Each order figure comprises the following circles (from outside to inside): 1) Number of hosts predicted to have an association with viruses within the viral genus. 2) Number of hosts detected to have association. 3) Number of hosts predicted to harbour viral zoonoses (i.e. known or predicted to share at least one virus species with humans). 4) Number of hosts predicted to share viruses with domesticated mammals of economic significance (domesticated mammals in orders: Artiodactyla, Carnivora, Lagomorpha and Perissodactyla). 5) Baltimore classification of the selected genera (Supplementary Data 6). Supplementary Fig. 9 presents results derived with research effort into mammalian hosts and viruses included in the constituent models trained in the viral and mammalian perspectives, respectively.

Full size image

Wild and semi-domesticated susceptible mammalian hosts of viruses

our framework indicated ~4.28 -fold increase [~1.2, ~14.64] of the number of virus species in wild and or semi-domesticated mammalian hosts (16.86 [4.95, 68.5] viruses on average per mammalian species). These results indicated an average of 13.45 [1.73, 65.04] unobserved virus species for each wild or semi-domesticated mammalian host (known viruses that are yet to be associated with these mammals). Our framework highlighted differences in the number of viruses predicted per order (Table 4). Figure 3 illustrates the predicted number of viruses in wild or semi-domesticated mammal by mammalian host range (Fig. 3B, Supplementary Data 5), and the top 18 virus genera (per number of host-virus associations) in selected orders (Fig. 3C, Supplementary Data 6). Supplementary Results 1 lists the results with the inclusion of research effort into mammalian species in our viral perspective models.

Table 4 Predicted number of viruses per top 15 orders by fold increase in number of viruses predicted in wild or semi-domesticated mammalian hosts (per species).

Full size table

Network perspective – Potential motifs

We quantified the topology of the network linking virus and mammal species by means of counts of potential motifs²¹. Figure 4 illustrates how potential motifs are captured in our network. Briefly, for each virus-mammal association for which we want to make predictions (n = 2,722,656, of which 6,331 are supported by our evidence, see methods section), we “force insert” this focal association into our network (Fig. 4A, B) and enumerate all instances of 3 (n = 2), 4 (n = 6), and 5-node (n = 20) potential motifs in which this association might feature if it actually existed²¹ (Fig. 4C visualises these different motifs). Following this process, a features-set is generated comprising the counts potential motifs for all included associations. Figure 4D illustrates the count of motifs (logged) grouped by mammalian order and virus Baltimore classification.

Fig. 4: The network perspective – potential motifs (subgraphs) in our virus-host bipartite network.

A The concept of potential motif. The association TBEV-P. leo is a forced insertion into the network prior to calculating motifs for the association. B Motifs space: networks represent 2 steps and 3 steps ego networks (union) of host (here P. leo) and virus (TBEV). 1, 2 and 3 step ego networks comprise the counting space for TBEV-P. leo potential motifs. Dark grey nodes represent viruses, light grey nodes represent hosts. Size of nodes is adjusted to represent overall number of hosts or viruses with known associations to the node. Red edges represent nodes reachable from the mammal (P. leo) in 1 or 2 steps (links). Blue edges represent nodes reachable from the virus (TBEV) with 1 or 2 steps (links). Humans and rabies virus were excluded from these networks. C 3, 4 and 5-node potential motifs in our virus-host bipartite network. Circles represent viruses and squares represent mammals. Red circles represent the focal virus (v), and blue squares represent the focal mammal (m) of the association v-m for which the motifs are being counted (dashed yellow line). This association has two states: either already known (documented in EID2), or unknown. Grey lines illustrate existing associations in our network. D Motifs counts. Heatmap illustrating distribution of motif-features (counts of potential motifs per each focal association) in our bipartite network, grouped by mammalian order and Baltimore classification. The counts are logged to allow for better visualisation. E Variable importance (relative contribution) of motif-features (variables) to our network perspective models (SVM-RW). Motifs (subgraphs) are coloured by the number of nodes (K = 3, 4, 5). Boxplots indicate median (centre), the 25th and 75th percentiles (bounds of box) and inter quantile range (whiskers). Points represent variable importance in individual runs (n = 100). Research effort into both viruses and mammals is included as independent variables in our network models (coloured in yellow).

Full size image

Relative importance of network (motif) features

Figure 4E illustrates that M4.1 was the most important feature in our network models: median = 100% [90.19%, 100%]. Followed by: M5.1 = 97.84% [89.19%, 99.93%], M5.7 = 98.8 97.22% [87.7%, 98.77%] and M4.6 = 96.75% [86.13%, 100%]. Research effort of viruses and mammals had relative importance = 90.26% [82.94%, 95.36%], 88.42% [78.38%, 94.87%] respectively. Overall, 5-node motif-features had median relative influence = 75.06% [1.21%, 98.14%]; whereas 3 and 4-node motif-features had relative influence = 71.69% [55.76%, 85.34%], and 61.06% [27.14%, 100%], respectively. Supplementary Fig. 29 illustrate the partial dependence of network perspective models on each of our network features.

Validation

We validated our framework in three ways: 1) against a held-out test set; 2) by systematically removing selected known viral-mammalian associations and attempting to predict them; and 3) against external data source, comprising viral-mammalian associations extracted using an exhaustive literature search targeting wild mammals and their viruses^4,30.

Our held-out test set comprised 15% of all data (randomly selected, n = 407,265; 954 known virus-mammal associations, see methods below). We removed this set from our network, computed network features (motifs), and trained constituent models in each perspective with the remainder data. We then estimated our framework performance metrics against the held-out test set. Our framework achieved overall AUC = 0.938 [0.862–0.959], F1-Score = 0.284 [0.464–0.124], and TSS = 0.876 [0.724–0.918], when trained without including research effort in its mammalian and viral perspectives. When research effort was included in these perspectives, performance metrics were as follows: AUC = 0.920 [0.823, 0.944], F1-Score = 0.272 [0.526, 0.093], and TSS = 0.840 [0.646, 0.888].

The performance of our voting approach was better than any individual perspective, or combination of perspectives (Supplementary Tables 8–11). The most significant improvement was in F1-score, where individual perspectives scores were as follows: network = 0.104 [0.210–0.051], mammalian = 0.115 [0.009–0.064] (0.131 [0.284–0.035] with research effort), and viral = 0.181 [0.374–0.074] (0.196 [0.373–0.067]).

Additionally, we conducted a systematic test to predict removed virus-mammal associations. In this test, we systematically removed one known virus-mammal association at a time from our framework, recalculated all inputs (including from network) and attempted to predict these removed associations. Our framework succeeded in predicting 90% of removed associations (90.70% for associations removed for viruses, 89.92% for associations removed from mammals, Supplementary Results 3).

Finally, our framework predicted 84.02% [77.69%, 89.60%] of the externally obtained viral-mammalian associations (with detection quality > 0) where both host and virus were included in our pipeline, and 77.82% [68.46%, 86.51%] (any detection quality). When including research effort in our mammalian and viral perspectives, these results were: 84.47% [78.15%, 89.60%], and 78.41% [68.83%, 86.37%], respectively.

Source: Ecology - nature.com

Divide-and-conquer: machine-learning integrates mammalian and viral traits with network features to predict virus-mammal associations

Example

Relative importance of viral features

Mammalian host range

Relative importance of mammalian features

Wild and semi-domesticated susceptible mammalian hosts of viruses

Network perspective – Potential motifs

Relative importance of network (motif) features

Validation

Coral distribution and bleaching vulnerability areas in Southwestern Atlantic under ocean warming

Human encroachment into wildlife gut microbiomes

ITALIAN LANGUAGE

ENGLISH LANGUAGE