Viruses and mammalian data
Viral genomic data
Complete sequences of coronaviruses were downloaded from Genbank44. Sequences labelled with the terms: ‘vaccine’, ‘construct’, ‘vector’, ‘recombinant’ were removed from the analyses. In addition, we removed those associated with experimental infections where possible. This resulted in a total of 3264 sequences for 411 coronavirus species or strains (i.e., viruses below species level on NCBI taxonomy tree). Of those, 88 were sequences of coronavirus species, and 307 sequences of strains (in 25 coronavirus species, with total number of species included = 92). Of our included species, six in total were unclassified Coronavirinae (unclassified coronaviruses).
Selection of potential mammalian hosts of coronaviruses
We processed meta-data accompanying all sequences (including partial sequences but excluding vaccination and experimental infections) of coronaviruses uploaded to GenBank to extract information on hosts (to species level) of these coronaviruses. We supplemented these data with species-level hosts of coronaviruses extracted from scientific publications via the ENHanCEd Infectious Diseases Database (EID2)45. This resulted in identification of 313 known terrestrial mammalian hosts of coronaviruses (regardless of whether a complete genome was available or not, n = 185 mammalian species for which an association with a coronavirus with complete genome was identified). We expanded this set of potential hosts by including terrestrial mammalian species in genera containing at least one known host of coronavirus, and which are known to host one or more other virus species (excluding coronaviruses, information of whether the host is associated with a virus were obtained from EID2). This results in total of 876 mammalian species which were selected.
Quantification of viral similarities
We computed three types of similarities between each two viral genomes as summarised below.
Biases and codon usage
We calculated proportion of each nucleotide of the total coding sequence length. We computed dinucleotide and codon biases46 and codon-pair bias, measured as the codon-pair score46,47 in each of the above sequences. This enabled us to produce for each genome sequence (n = 3264) the following feature vectors: nucleotide bias, dinucleotide bias, codon biased and codon-pair bias.
Secondary structure
Following alignment of sequences (using AlignSeqs function in R package Decipher48), we predicted the secondary structure for each sequence using PredictHEC function in the R package Decipher48. We obtained both states (final prediction), and probability of secondary structures for each sequence. We then computed for each 1% of the genome length both the coverage (number of times a structure was predicted) and mean probability of the structure (in the per cent of the genome considered). This enabled us to generate six vectors (length = 100) for each genome representing: mean probability and coverage for each of three possible structures—Helix (H), Beta-Sheet (E) or Coil (C).
Genome dissimilarity (distance)
We calculated pairwise dissimilarity (in effect a hamming distance) between each two sequences in our set using the function DistanceMatrix in the R package Decipher48. We set this function to penalise gap-to-gap and gap-to-letter mismatches.
Similarity quantification
We transformed the feature (traits) vectors described above into similarities matrices between coronaviruses (species or strains). This was achieved by computing cosine similarity between these vectors in each category (e.g., codon-pair usage, H coverage, E probability). Formally, for each genomic feature (n = 10) presented by vector as described above, this similarity was calculated as follows:
$${mathrm{sim}}_{{mathrm{genomic}}_{mathrm{l}}}left( {s_m,s_n} right) = {mathrm{sim}}_{{mathrm{genomic}}_{mathrm{l}}}left( {{mathbf{V}}_m^{f_l},{mathbf{V}}_n^{f_l}} right) = frac{{mathop {sum}nolimits_{i = 1}^d {left( {{mathbf{V}}_m^{f_l}[i] times {mathbf{V}}_n^{f_l}[i]} right)} }}{{sqrt {mathop {sum}nolimits_{i = 1}^d {{mathbf{V}}_m^{f_l}[i]^2} } times sqrt {mathop {sum}nolimits_{i = 1}^d {{mathbf{V}}_n^{f_l}[i]^2} } }}$$
(1)
where sm and sn are two sequences presented by two feature vectors ({mathbf{V}}_m^{f_l}) and ({mathbf{V}}_n^{f_l}) from the genomic feature space fl (e.g., codon-pair bias) of the dimension d (e.g., d = 100 for H coverage).
We then calculated similarity between each pair of virus strains or species (in each category) as the mean of similarities between genomic sequences of the two virus strains or species (e.g., mean nucleotide bias similarity between all sequences of SARS-CoV-2 and all sequences of MERS-CoV presented the final nucleotide bias similarity between SARS-CoV-2 and MERS-CoV). This enabled us to generate 11 genomic features similarity matrices (the above 10 features represented by vectors and genomic similarity matrix) between our input coronaviruses. Supplementary Fig. 1 illustrates the process.
Similarity network fusion (SNF)
We applied SNF49 to integrate the following similarities in order to reduce our viral genomic feature space: (1) nucleotide, dinucleotide, codon and codon-pair usage biases were combined into one similarity matrix—genome bias similarity. And (2) Helix (H), Beta-Sheet (E) or Coil (C) mean probability and coverage similarities (six in total) were combined into one similarity matrix—secondary structure similarity.
SNF applies an iterative nonlinear method that updates every similarity matrix according to the other matrices via nearest neighbour approach and is scalable and is robust to noise and data heterogeneity. The integrated matrix captures both shared and complementary information from multiple similarities.
Quantification of mammalian similarities
We calculated a comprehensive set of mammalian similarities. Table 3 summarises these similarities and provides justification for inclusion. Supplementary Note 1 provides full details.
Quantification of network similarities
Network construction
We processed meta-data accompanying all sequences (including partial genome but excluding vaccination and experimental infections) of coronaviruses uploaded to Genbank44 (accessed 4 May 2020) to extract information on hosts (to species level) of these coronaviruses. We supplemented these data with virus–host associations extracted from publications via the EID2 Database45. This resulted in 1669 associations between 1108 coronaviruses and 545 hosts (including non-mammalian hosts). We transformed these associations into a bipartite network linking species and strains of coronaviruses with their hosts.
Quantification of topological features
The above constructed network summarises our knowledge to date of associations between coronaviruses and their hosts, and its topology expresses patterns of sharing these viruses between various hosts and host groups. Our analytical pipeline captures this topology, and relations between nodes in our network, by means of node embeddings. This approach encodes each node (here either a coronavirus or a host) with its own vector representation in a continuous vector space, which, in turn, enables us to calculate similarities between two nodes based on this representation.
We adopted DeepWalk23 to compute vectorised representations for our coronaviruses and hosts from the network connecting them. DeepWalk23 uses truncated random walks to get latent topological information of the network and obtains the vector representation of its nodes (in our case coronaviruses and their hosts) by maximising the probability of reaching a next node (i.e., probability of a virus–host association) given the previous nodes in these walks (Supplementary Note 2 lists further details).
Similarity calculations
Following the application of DeepWalk to compute the latent topological representation of our nodes, we calculated the similarity between two nodes in our network—n (vectorised as N) and m (vectorised as M), by using cosine similarity as follows24,25:
$${mathrm{sim}}_{{mathrm{network}}}left( {n,m} right) = {mathrm{sim}}_{{mathrm{network}}}left( {{mathbf{M}},{mathbf{N}}} right) = frac{{mathop {sum}nolimits_{i = 1}^d {left( {m_i times n_i} right)} }}{{sqrt {mathop {sum}nolimits_{i = 1}^d {m_i^2} } times sqrt {mathop {sum}nolimits_{i = 1}^d {n_i^2} } }}$$
(2)
where d is the dimension of the vectorised representation of our nodes: M, N; and mi and ni are the components of vectors M and N, respectively.
Similarity learning meta-ensemble—a multi-perspective approach
Our analytical pipeline stacks 12 similarity learners into testable meta-ensembles. The constituent learners can be categorised by the following three ‘points of view’ (see also Supplementary Fig. 4 for a visual description):
Coronaviruses—the virus point of view
We assembled three models derived from (a) genome similarity, (b) genome biases and (c) genome secondary structure. Each of these learners gave each coronavirus–mammalian association (( {v_i to m_j} )) a score, termed confidence, based on how similar the coronavirus vi is to known coronaviruses of mammalian species mj, compared to how similar vi is to all included coronaviruses. In other words, if vi is more similar (e.g., based on genome secondary structure) to coronaviruses observed in host mj than it is similar to all coronaviruses (both observed in mj and not), then the association (v_i to m_j) is given a higher confidence score. Conversely, if vi is somewhat similar to coronaviruses observed in mj, and also somewhat similar to viruses not known to infect this particular mammal, then the association (v_i to m_j) is given a medium confidence score. The association (v_i to m_j) is given a lower confidence score if vi is more similar to coronaviruses not known to infect mj than it is similar to coronaviruses observed in this host.
Formally, given an adjacency matrix A of dimensions (left| {mathbf{V}} right| times left| {mathbf{M}} right|) where (left| {mathbf{V}} right|) is number of coronaviruses included in this study (for which a complete genome could be found), and (left| {mathbf{M}} right|) is number of included mammals, such that for each (v_i in {mathbf{V}}) and (m_j in {mathbf{M}}), aij = 1 if an association is known to exist between the virus and the mammal, and 0 otherwise. Then for a similarity matrix simviral corresponding to each of the similarity matrices calculated above, a learner from the viral point of view is defined as follows24,25:
$${mathrm{confidence}}_{{mathrm{viral}}}( {v_i to m_j} ) = frac{{mathop {sum}nolimits_{l = 1,,l ne i}^{left| {mathbf{V}} right|} {( {{mathrm{sim}}_{{mathrm{viral}}}( {v_i,,v_l} ) times a_{lj}} )} }}{{mathop {sum}nolimits_{l = 1,,l ne i}^{left| {mathbf{V}} right|} {{mathrm{sim}}_{{mathrm{viral}}}left( {v_i,,v_l} right)} }}$$
(3)
Mammals—the host point of view
We constructed seven learners from the similarities summarised in Table 3. Each of these learners calculated for every coronavirus–mammalian association (( {v_i to m_j})) a confidence score based on how similar the mammalian species mj is to known hosts of the coronavirus vi, compared to how similar mj is to mammals not associated with vi. For instance, if mj is phylogenetically close to known hosts of vi, and also phylogenetically distant to mammalian species not known to be associated with this coronavirus, then the phylogenetic similarly learner will assign (v_i to m_j) a higher confidence score. However, if mj does not overlap geographically with known hosts of vi, then the geographical overlap learner will assign it a low (in effect 0) confidence score.
Formally, given the above-defined adjacency matrix A, and a similarity matrix simmammalian corresponding to each of the similarity matrices summarised in Table 3, a learner from the mammalian point of view is defined as follows24,25:
$${mathrm{confidence}}_{{mathrm{mammalian}}}( {v_i to m_j} ) = frac{{mathop {sum}nolimits_{l = 1,,l ne j}^{left| {mathbf{M}} right|} {( {{mathrm{sim}}_{{mathrm{mammalian}}}( {m_j,,m_l} ) times a_{il}} )} }}{{mathop {sum}nolimits_{l = 1,,l ne j}^{left| {mathbf{M}} right|} {{mathrm{sim}}_{{mathrm{mammalian}}}( {m_j,,m_l} )} }}$$
(4)
Network—the network point of view
We integrated two learners based on network similarities—one for mammals and one for coronaviruses. Formally, given the adjacency matrix A, our two learners from the network point of view as defined as follows24:
$${mathrm{confidence}}_{{mathrm{network}}_{mathbf{V}}}( {v_i to m_j} ) = frac{{mathop {sum}nolimits_{l = 1,,l ne i}^{left| {mathbf{V}} right|} {( {{mathrm{sim}}_{{mathrm{network}}}( {v_i,,v_l} ) times a_{lj}} )} }}{{mathop {sum}nolimits_{l = 1,,l ne i}^{left| {mathbf{V}} right|} {{mathrm{sim}}_{{mathrm{network}}}( {v_i,,v_l} )} }};;$$
(5)
$${mathrm{confidence}}_{{mathrm{network}}_{mathbf{M}}}( {v_i to m_j} ) = frac{{mathop {sum}nolimits_{l = 1,,l ne j}^{left| {mathbf{M}} right|} {( {{mathrm{sim}}_{{mathrm{network}}}( {m_j,,m_l} ) times a_{il}} )} }}{{mathop {sum}nolimits_{l = 1,,l ne j}^{left| {mathbf{M}} right|} {{mathrm{sim}}_{{mathrm{network}}}( {m_j,,m_l} )} }}$$
(6)
Ensemble construction
We combined the learners described above by stacking them into ensembles (meta-ensembles) using Stochastic Gradient Boosting (GBM). The purpose of this combination is to incorporate the three points of views, as well as varied aspects of the coronaviruses and their mammalian potential hosts, into a generalisable, robust model50. We selected GBM as our stacking algorithm following an assessment of seven machine-learning algorithms using held-out test sets (20% of known associations randomly selected, N = 5—Supplementary Fig. 14). In addition, GBM is known for its ability to handle non-linearity and high-order interactions between constituent learners51, and have been used to predict reservoirs of viruses46 and zoonotic hot-spots51.We performed the training and optimisation (tuning) of these ensembles using the caret R Package52.
Sampling
Our GBM ensembles comprised 100 replicate models. Each model was trained with balanced random samples using tenfold cross-validation (Supplementary Fig. 4). Final ensembles were generated by taking mean predictions (probability) of constituent models. Predictions were calculated form the mean probability at three cut-offs: >0.5 (standard), >0.75 and ≥0.9821. SD from mean probability was also generated and its values subtracted/added to predictions, to illustrate variation in the underlying replicate models.
Validation and performance estimation
We validated the performance of our analytical pipeline externally against 20 held-out test sets. Each test set was generated by splitting the set of observed associations between coronaviruses and their hosts into two random sets: a training set comprising 85% of all known associations and a test set comprising 15% of known associations. These test sets were held-out throughout the processes of generating similarity matrices; similarity learning, and assembling our learners, and were only used for the purposes of estimating performance metrics of our analytical pipeline. This resulted in 20 runs in which our ensemble learnt using only 85% of observed associations between our coronaviruses and their mammalian hosts. For each run, we calculated three performance metrics based on the mean probability across each set of 100 replicate models of the GBM meta-ensembles: AUC, true skill statistics (TSS) and F-score.
AUC is a threshold-independent measure of model predictive performance that is commonly used as a validation metric for host–pathogen predictive models21,46. Use of AUC has been criticised for its insensitivity to absolute predicted probability and its inclusion of a priori untenable prediction51,53, and so we also calculated the TSS (TSS = sensitivity + specificity − 1)54. F-score captures the harmonic mean of the precision and recall and is often used with uneven class distribution. Our approach is relaxed with respect to false positives (unobserved associations), hence the low F-score recorded overall.
We selected three probability cut-offs for our meta-ensemble: 0.50, 0.75 and 0.9821. One extreme of our cut-off range (0.5) maximises the ability of our ensemble to detect known associations (higher AUC, lower F-score). The other (0.9821) is calculated so that 90% of known positives are captured by our ensemble, while reducing the number of additional associations predicted (higher F-score, lower AUC).
Changes in network structure
We quantified the diversity of the mammalian hosts of each coronavirus in our input by computing mean phylogenetic distance between these hosts. Similarly, we captured the diversity of coronaviruses associated with each mammalian species by calculating mean (hamming) distance between the genomes of these coronaviruses. We termed these two metrics: mammalian diversity per virus and viral diversity per mammal, respectively. We aggregated both metrics at the network level by means of simple average. This enabled us to quantify changes in these diversity metrics, at the level of network, with addition of predicted links at three probability cut-offs: >0.5, >0.75 and ≥0.9821.
In addition, we captured changes in the structure of the bipartite network linking CoVs with their mammalian hosts, with the addition of predicted associations, by computing a comprehensive set of structural properties (Supplementary Note 3) at the probability cut-offs mentioned above, and comparing the results with our original network. Here we ignore properties that deterministically change with the addition of links (e.g., degree centrality, connectance; Supplementary Table 2 lists all computed metrics and changes in their values). Instead, we focus on non-trivial structural properties. Specifically, we capture changes in network stability, by measuring its nestedness55,56,57; and we quantify non-independence in interaction patterns by means of C-score58. Supplementary Note 3 provides full definition of these concepts as well as other metrics we computed for our networks.
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Source: Ecology - nature.com