Predicting the potential for zoonotic transmission and host associations for novel viruses

Data collection

Virus-host data was collated from various sources. Major sources for the association databases included data shared by Olival et al⁴., Pandit et al.³, and Johnson et al.¹³. In data provided by Olival et al (assessed September 2019), host-virus associations have been assigned a score, based on detection methods and tests that are specific and more reliable. We used associations that have been identified as the most reliable (stringent data) from Olival et al⁴. In addition, a query in GenBank was run to parse out hosts reported for each GenBank submission for viruses presented in each of these three databases. Initially, for each virus name, taxonomic ID was identified using entrez.esearch function in biopython package. The taxonomic ID helped linked to the GenBank databases, identify the ICTV lineage and associated data in PubMed^20,21. NCBI TaxID closely follows the ICTV database, but some recent changes in ICTV might not always be reflected in NCBI, so we manually checked names to ensure matching. This included virus genus and family information along with a standard virus name. Host data were aggregated based on the taxonomic ID and associated standard name. Finally, for each virus, a search was completed in PubMed to compile the number of hits related to the virus and their vertebrate hosts using the search terms below. The number of PubMed hits (PMH1) were used as a proxy for sampling bias^3,13. The virus-host association data source is presented in supplementary code and data files (https://zenodo.org/record/5899054).

$$ searchterm= (+virus_name+,[Title/Abstract]) ANDleft(host,OR,hosts,OR,reservoir,OR,reservoirs,OR right. wild,OR,wildlife,OR,domestic,OR,animal,OR,animals,OR mammal,OR,bird,OR,birds,OR,aves,OR,avian,OR,avians left. OR,vertebrate,OR,vertebrates,OR,surveillance,OR,sylvaticright)$$

Along with the PubMed terms we also queried the nucleotide database on PubMed using the taxonomic ID to find the number of GenBank entries for these viruses (PMH2). A correlation analysis between the PMH1 and PMH2 of well-recognized known viruses showed a high correlation with each other for us to safely use GenBank hits for novel viruses during the prediction stage of the model (Fig. S32).

Development of ({{{{{{boldsymbol{G}}}}}}}_{{{{{{boldsymbol{c}}}}}}})

a. Centrality measures of observed network (({{{{{{boldsymbol{G}}}}}}}_{{{{{{boldsymbol{c}}}}}}}))

To test if centrality measures (degree centrality, betweenness centrality, eigenvector centrality, clustering coefficient) for viral nodes in the observed network (({G}_{c})) vary significantly between viral families, we firstly used the Kolmogorov-Smirnov (KS) test. KS test is routinely used to identify distances between cumulative distribution functions of two probability distributions and is largely used to compare degree distributions of networks^22,23. For each viral family, distributions of centrality measures (degree centrality, betweenness centrality, and eigenvector centrality) and clustering coefficient within the observed network (({G}_{c})) were compared with the distribution of all nodes in the network using the two-tailed KS test. Secondly, a linear regression model with virus family as a categorical variable and the number of PubMed hits as a covariate to adjust for sampling bias were fitted to understand associations of viral families with centrality measures.

$${centrality},{measure}={beta }_{0}{intercept}+{{beta }_{1}{Viral}{family}}_{{categorical}}+{beta }_{2}{PubMed},{hits}$$

After fitting the model, node-level permutations were implemented. For each random permutation, the output variable was randomly assigned to covariate values and the model was re-fitted. Finally, a p-value was calculated by comparing the distribution of coefficients from permutations with the original model coefficient.

Network topology feature selection

Using the observed network (({{{{{{boldsymbol{G}}}}}}}_{{{{{{boldsymbol{c}}}}}}})), multiple network topological features for all node (virus) pairs were calculated. The following are topological network features calculated. Features data type, definition and methods to calculate these features are presented in Table S3.

1. The Jaccard coefficient: a commonly used similarity metric between nodes in information retrieval, is also called an intersection of over the union for two nodes in the network. In the unipartite network generated here, it represents the proportion of common neighbor viruses from the union of neighbor viruses for two nodes. Neighbor viruses are defined as viruses with which the virus shares at least a single host.

2. Adamic/Adar (Frequency-Weighted Common Neighbors): Is the sum of inverse logarithmic degree centrality of the neighbors shared by two nodes in the network²⁴. The concept of Adamic Adar index is a weighted common neighbors for viruses in the network. Within network prediction, the index assumes that viruses with large neighborhoods have a less significant impact while predicting a connection between two viruses compared with smaller neighborhoods.

Both Jaccard and Adamic Adar coefficients have been routinely used for generalized network prediction and have shown high accuracy in predicting missing links in networks, specifically bipartite networks²⁵, the information flowing through neighborhoods formed by two nodes might not always be enough to have similar predictive power in an unipartite network. This warrants use of other topology features along with neighborhood-based features.

3. Resource allocation: Similarity score of two nodes defined by the weights of common neighbors of two nodes. Resource allocation is another measure to quantify the closeness of two nodes in the network and hence to understand the similarity of hosts they infect.

4. Preferential attachment coefficients: The mechanism of preferential attachment can be used to generate evolving scale-free networks, where the probability that a new link is connected to node x is proportional to k²⁶.

5. Betweenness centrality: For a node in the network betweenness centrality is the sum of the fraction of all-pairs shortest paths that pass through it. The feature that we used for training the supervised learning model was the absolute difference between of betweenness centralities of two nodes. The difference between the betweenness centrality represents the difference in the sharing observed by two viruses in the pair.

6. Degree centrality: The degree centrality for a node v is the fraction of nodes it is connected to. The feature that we used for training the supervised learning model was the absolute difference between degree centralities of two nodes. Unlike the difference in the betweenness centrality, the difference in degree centrality only looks at the difference in the number of observed host sharing.

7. Network clustering: All nodes were classified into community clusters using Louvain methods²⁷. A binary feature variable was generated to describe if both the nodes in the pair were part of the same cluster or not. If both viruses are from the same cluster, it represents a similar host predilection than when both viruses are not from the same cluster hence accounting for the evolutionary predilection of viruses (or virus families) to infect a certain type of host.

These topological network characteristics come with certain limitations when it comes to the unipartite network of viruses with links formed due to shared hosts and might not truly represent the flow of information between nodes as compared to a bipartite network. Therefore, to account for these limitations, we use multiple network features as weak learners in our model building characteristics summarizing the network through the use of several quantitative metrics. In addition to this, we estimated the feature importance of these metrics in predicting missing links between viruses to quantify the information pasting through these links.

Pearson’s correlation coefficients were calculated to identify highly correlated features and for choosing features for model training (Fig. S33). Virological features included in model training were categorical variables describing the virus family of both the nodes in the pair, followed by a binary variable if both the viruses belong to the same virus family. During the model development, PubMed hits generated three predictive features for each pair of viruses on which model training and predictions were conducted. These included two features representing PubMed hits for the two viruses in the pair (PubMed_V1, PubMed_V2) and the absolute difference between PubMed_V1 and PubMed_V2 to account for differences in sampling bias between the two viruses.

Cross-validation and fitting generalized boosting machine (GBMs) models

A nested-cross-validation was implemented for the binary model while simple cross-validation was implemented for the multiclass model (multiple output categories). The parameters of the binary model were first hyper-tuned using a cross-validated grid-search method. Values were tested using a grid search to find the best-performing model parameters that showed the highest sensitivity (recall). The parameters tested for hypertuning and their performance are provided in the supplementary material (supplementary results and Table S5). For further cross-validation of the overall binary model, all the viruses were randomly assigned to five groups. For each fold, the viruses assigned to a group were dropped from the data, and a temporary training network (({{{{{{boldsymbol{G}}}}}}}_{{{{{{boldsymbol{t}}}}}}}{{{{{boldsymbol{)}}}}}}) was constructed, assuming that this represented the current observed status of the virus-host community. For all possible pairs in ({{{{{{boldsymbol{G}}}}}}}_{{{{{{boldsymbol{t}}}}}}}) (both that sharing and not sharing any hosts) ten topological and viral characteristics were calculated as training features (Table S4). Categorical features were one-hot-encoded and numeric features were scaled. An XGBClassifier model with binary: logistic family was trained using the feature dataset to predict if virus pairs share hosts (1,0 encoded output). The cross-validation was also used to determine the optimum decision threshold for determining binary classification (Fig. S6) and a precision-recall curve was used to identify positive predictive value and sensitivity at the optimum threshold (Fig. S8).

The multiclass model was implemented in the same way, creating an observed network (({G}_{c})) based on species-level sharing of hosts and randomly dropping viruses to generate a training network (({G}_{t})) to train the XGboost model. The output variables were generated based on the taxonomical orders of shared hosts. A pair of viruses can share multiple hosts, hence we trained a multioutput-multiclass model. Humans were considered an independent category of taxonomical order (label) and were given a separate label from primates. For fine-tuning the multiclass model, we started with the best performing parameters of the binary model and manually tested 5 combinations of model parameters by adjusting values of the learning rate, number of estimators, maximum depth, and minimum child weight (Supplementary code and results).

We used three methods to estimate the importance of features for our binary model. Specifically, improvement in accuracy brought by branching based on the feature (gain), the percentage of times the feature appears in the XGboost tree model (weight), and the relative number of observations related to the specific feature (cover). Results for feature importance are shown in supplementary results (Fig. S10).

Missing links for novel viruses, binary and multiclass prediction

The wildlife surveillance data represented a sampling of 99,379 animals (94,723 wildlife, 4656 domesticated animals) conducted in 34 countries around the world between 2009–2019 (Table S6)¹. Specimens were tested using conventional Rt-PCR, Quantitative PCR, Sanger sequencing, and Next Generation Sequencing protocols to detect viruses from 28 virus families or taxonomic groups (Table S7). Testing resulted in 951 novel monophyletic clusters of virus sequences (referred to as novel viruses henceforth). Within 951 novel viruses, 944 novel viruses had vertebrate hosts that were identified with certainty based on barcoding methods and field identification. Host species identification was confirmed by cytochrome b (cytb) DNA barcoding using DNA extracted from the samples²⁸. We predicted the shared host links between novel viruses and known viruses using binary and multiclass models in the following steps. Out of 944 novel viruses discovered in the last ten years, we were able to generate predictions for 531 novel viruses that were detected in species already classified as hosts within the network. The remaining 413 viruses were the first detection of any virus in that species and thus host associations could not be informed by the observed network (({{{{{{boldsymbol{G}}}}}}}_{{{{{{boldsymbol{C}}}}}}})) data.

1. A new node representing the novel virus was inserted in the observed network (({{{{{{boldsymbol{G}}}}}}}_{{{{{{boldsymbol{c}}}}}}})). Using the list of species in which the novel virus was detected, new edges were created with known viruses that are also known to be found in those hosts. This generated a temporary network for the novel virus (({{{{{{boldsymbol{G}}}}}}}_{{temp}})). If the novel virus was not able to generate any edges with known viruses, meaning the host in which they have been found was never found positive for any known virus, predictions were not performed.

2. Using ({{{{{{boldsymbol{G}}}}}}}_{{temp}}) feature values were calculated for the novel virus (betweenness centrality, clustering, and degree). For all possible pairs of the novel virus with known viruses that are not yet connected with each other through an edge in ({{{{{{boldsymbol{G}}}}}}}_{{temp}}) a feature dataset was generated (Jaccard coefficient_{(novel virus, known virus)}, the difference in betweenness centrality of the novel virus and known virus, if the novel virus and known virus were in the same cluster, the difference in degree centrality_{(novel virus, known virus)}, if the novel virus and known virus were from same virus family, the difference in PubMed hits_{(novel virus, known virus)}, PubMed hits for the novel virus, PubMed hits for the known virus). Studies and nucleotide sequences for novel viruses are expected to be published and shared on PubMed’s Nucleotide database and in various peer-reviewed publications. Data associated with GenBank accession numbers and nucleotide sequences for novel viruses are presented in Supplementary Data 3 and Supplementary Data 4 respectively. At the time of development of the model, data for all viruses was not shared in a format that would reflect on PubMed’s database, we decided to use the number of unique species the virus was detected in the last ten years of wildlife surveillance conducted by the USAID PREDICT project. These detections will be reflected in PubMed’s Nucleotide database and search term eventually, hence we considered them as a proxy for search terms conducted for known viruses. Currently, evaluation of the effects of this substitution of PubMed hits with the number of detections for novel viruses is not possible with limited data on novel viruses but needs to be reevaluated as more studies are published on these novel viruses. To further evaluate the association between PubMed hits through search term and Genbank hits, we ran a generalized linear regression model with PubMed hits as dependent variable and Genbank hits as intendent variable, accounting for virus families.

$${{PubMed}}_{{Search}}left({log }right)={beta }_{0}{intercept}+{{beta }_{1}{Virus}{family}}_{{categorical}}+{beta }_{2}{Genbank},{hits},({log })$$

The results indicated that Genbank hits had statistically significant predictive value in predicting PubMed hits (β = 0.72, p < 0.005) even after accounting for various virus families. Multiple virus families showed statistically different estimates than the reference virus family (Adenoviridae) indicating a significantly different association than other virus families. Results of the generalized linear regression model are presented in Table S8.

3. Using this dataset for the novel virus, a binary presence of a link between the novel virus and known viruses was predicted using the trained binary model. The taxonomic order of the host link was predicted using the trained multiclass model.

4. For each possible link, the binary model predicted the probability of sharing a link, and the multiclass model predicted multivariate outcomes of taxonomic orders and associated probabilities. A threshold of 0.70 for the binary prediction model was used to classify if the link is present or not and only those links were explored for their corresponding multiclass model outputs.

5. The multiclass model showed higher performance for correctly classifying links as “human” hosts than other numerous avian and mammalian taxonomic orders. Hence, the multiclass model outputs were summarized into either humans or other taxonomic groups. For the novel virus, a list of known viruses with the predicted link was generated. Using the hosts of these known viruses and the taxonomic order in which the novel virus was detected, a list of most likely species was generated based on the overall frequency of the host species. For understanding the likelihood of infecting humans two factors were considered to be of importance. Firstly, the number of links where humans are predicted as shared hosts with known viruses ((n)) and the average model-predicted probability of those links. A representation was generated incorporating the probability and available model support in terms of number links to reflect the likelihood and compare viruses relative to each other.

To test if virus family, the taxonomic order of hosts in which novel viruses were detected, and the number of times the viruses were detected (equivalent to PubMed hits for known viruses) influenced node (virus) level network centrality measures in the predicted network (({{{{{{boldsymbol{G}}}}}}}_{{{{{{boldsymbol{p}}}}}}})) a linear regression model was fitted with centrality measures.

$${centrality},{measure}= {beta }_{0}{intercept}+{{beta }_{1}{Viral}{family}}_{{categorical}} +{{beta }_{2}{Host}{Order}}_{{categorical}}+{beta }_{3}{PubMed},{hits}$$

For each of the random 10,000 node-level permutations, the output variable (centrality measure) was randomly assigned to covariate values and the model was re-fitted. A p-value was calculated by comparing the distributions of coefficients with the original model coefficient. These models were fitted for degree centrality, betweenness centrality, eigenvector centrality, and clustering coefficient of novel viruses in the predicted network.

Prioritization score for novel viruses

Generalized Linear Mixed Models were used to understand the association effects of virus family, taxonomic order of the host and PubMed hits on the number of predicted human links and mean probability of the predicted links. The models were fit using glmmTMB and glm packages in R. For relative comparison of zoonotic risk and for prioritizing novel viruses for further characterization, a prioritization metric was developed based on the predicted probability of sharing the humans as hosts with known viruses (({p}_{{sharing; humans}})) and the number of predicted shared human links (({n}_{{humans}})) in the predicted network for the given virus (({G}_{{predicted}})). Distributions for both ({p}_{{sharing; humans}}) and ({n}_{{humans}}) were normalized and multiplied to generate a single score for a virus and for appropriate relative comparisons between novel viruses. To understand the behavior of the prioritization score when predicting the zoonotic risk of novel viruses, we also compared prioritization scores of known zoonotic and non-zoonotic viruses using the Kolmogorov-Smirnov test.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Source: Ecology - nature.com

Predicting the potential for zoonotic transmission and host associations for novel viruses

Data collection

Development of ({{{{{{boldsymbol{G}}}}}}}_{{{{{{boldsymbol{c}}}}}}})

Network topology feature selection

Cross-validation and fitting generalized boosting machine (GBMs) models

Missing links for novel viruses, binary and multiclass prediction

Prioritization score for novel viruses

Reporting summary

Influence of suspended inorganic particles (kaolinite) on eggs and larvae of the pelagic shrimp Lucensosergia lucens

Stranded assets could exact steep costs on fossil energy producers and investors

ITALIAN LANGUAGE

ENGLISH LANGUAGE