SARS-CoV-2 genome data sets and associated travel history
To focus on the early stage of COVID-19 spread, we analyzed SARS-CoV-2 genome sequences and metadata available in GISAID on March 10th8. We curated a data set of 305 genomes by removing error-prone sequences, keeping only genomes with appropriate metadata, and a single genome from patients with multiple genomes available. We assigned each genome a global lineage designation based on the nomenclature scheme outlined in Rambaut et al.28 using pangolin v1.1.14 (https://github.com/hCoV-2019/pangolin), lineages data release 2020-05-19 (https://github.com/hCoV-2019/lineages). We aligned the remaining genomes using MAFFT v.7.45329 and partially trimmed the 5′ and 3′ ends. All sequences were associated with exact sampling dates in their meta-information, except for one genome from Anhui with known month of sampling. Upon visualizing root-to-tip divergence as a function of sampling time, using TempEst v.1.5.330 based on an ML tree inferred with IQ-TREE v.2.0-rc131, we removed one potential outlier. The root-to-tip plots without the outlier are shown in Supplementary Fig. S3. We formally tested for temporal signal using BETS32. The final 282 genomes were sampled from 28 different countries, with Chinese samples originating from 13 provinces, one municipality (Beijing), and one special administrative area (Hong Kong), which we considered as separate locations in our (discrete) phylogeographic analyses. Phylogenetic signal in the data set was explored through likelihood mapping analysis33 (Supplementary Fig. S4).
We searched for travel history data associated with the genomes in the GISAID records, media reports, and publications and retrieved recent travel locations for 64 genomes (22.5%, Supplementary Table 2): 43 traveled/returned from Hubei (Wuhan), 1 from Beijing, 3 from China without further detail (which we associated with an appropriate ambiguity code in our phylogeographic analysis that represents all sampled Chinese locations), 2 from Singapore, 1 from Southeast Asia (which we also associated with an ambiguity code that represents all sampled Southeast Asian locations), 7 from Italy, and 7 from Iran. In this data set, Italy is better represented by recent travel locations than actual samples (n = 4) and Iran is exclusively represented by travelers returning from this country. For 46 out of the 64 genomes, we retrieved the date of travel, which represents the most recent time point at which the ancestral lineage circulated in the travel location.
In order to examine (i) to what extent our reconstructions could be updated by the genome data that has become available retrospectively for the same locations and the same time period before March 10 ~4 months after this date, and (ii) how sampling bias can be mitigated by downsampling from the larger collection of available genomes, we assembled an additional data set of 500 genomes. For this purpose, SARS-CoV-2 genomes were downloaded from GISAID on June 23, 2020 and processed according to the COG-UK pre-analysis pipeline (https://github.com/COG-UK/grapevine). Briefly, sequences were aligned to the reference sequence Wuhan-Hu-1 (Genbank accession number NC_045512) using Minimap2 v.2.1734. Problematic sites were masked (https://virological.org/t/issues-with-sars-cov-2-sequencing-data/473), and sequences with <95% coverage or an overabundance of mutations were removed. Due to the availability of a relatively large amount of genomes from Shanghai and its importance in international air travel, we considered genomes for this location instead of Fujian to maintain the same phylogeographic dimensionality (44 locations) as in the 282 genome data set. We included the same 64 genomes with travel history used in the 282 genome data set, and for the remaining 436 genomes, we performed a subsampling relative to time and geographic location from the sequences sampled before Mach 10, 2020. To maximize the temporal signal with minimal geographic bias, the genomes were selected such that the 500 sequences were distributed as evenly as possible in each epi-week for which samples were available. Within each week, sequences were sampled proportionally to the cumulative number of cases in that location on March 10. Despite this sampling procedure, the resulting number of genomes by location as a function of case counts still indicates sampling biases (Supplementary Fig. S5c). The root-to-tip divergence plot and likelihood mapping plot are shown in Supplementary Figs. S3 and S4 respectively.
Incorporating travel history in Bayesian phylogeographic inference
Discrete trait phylogeographic inference attempts to reconstruct an ancestral location-transition history along a phylogeny based on the discrete states associated with the sampled sequences. In our Bayesian approach, the phylogeny is treated as random which is critical to accommodate estimation uncertainty, when confronted with sparse sequence information. Here, we aim to augment these location-transition history reconstructions on random trees with travel history information obtained from (returning) travelers. When such information is available, the tip location state for a sequence can either be set to the location of sampling, as is done in the absence of such information, or the location from which the individual traveled (assuming that this was the location from which the infection was acquired). Neither of these options is satisfactory: using the location of sampling ignores important information about the ancestral location of the sequence, whereas using the travel location together with the collection date represents a data mismatch, and ignores the final transitions to the location of sampling. These events are particularly important when the infected traveler then produces a productive transmission chain in the sampling location.
Incorporating information about ancestral locations cannot be achieved simply through the parameterization of the discrete diffusion model, which follows a CTMC process determined by relative transition rates between all pairs of locations that applies homogeneously (or time inhomogeneously26) along the phylogeny. Instead, we need to shape the realization of this process according to the travel histories by augmenting the phylogeny with ancestral nodes that are associated with a location state (but not with a known sequence), and hence enforce that ancestral location at a particular, possibly random point in the past of a lineage. Depending on the time at which the ancestral node lies, it may fall on a terminal branch leading to the tip associated with travel history, or before nodes representing common ancestors with other taxa. Further, the location state associated with the ancestral node can also be ambiguous, allowing equal or weighted probability to be assigned to multiple possible locations35. We illustrate this procedure for an empirical example that includes 9 SARS-CoV-2 genomes in Fig. 7.
a The concept of introducing ancestral nodes associated with locations from which travelers returned. The ancestral nodes are indicated by arrows for five cases relating them to the genomes sampled from the travelers. The ancestral nodes are introduced at different times in the ancestral path of each sampled genome. b–d The results from analyses using sampling location and travel history, sampling location only, and travel origin location, respectively. The branch color reflects the modal state estimate at the child node. There is some topological variability, but only involving nodes that are poorly supported.
The empirical example includes two genomes from Hubei, four from Australia, and three from Italy. Travel history is available for five genomes (one sampled in Italy and four in Australia), and Fig. 7a demonstrates how this information is incorporated. When a sampled traveler returned from location i to location j, we denote time Ti→j as the time when the traveler started the return journey to j. At this time point in the ancestral path of the tip (indicated with arrows for the five relevant tips), we introduce an ancestral node and associate it with location i, in order to inform the reconstruction that at this point in time the lineage was in location i. The upper arrow represents the information introduced for the traveler that returned from Hubei to Italy. The same procedure is applied to the four genomes from travelers returning to Australian from Hubei, Iran, Southeast Asia, and Hubei again (from top to bottom in Fig. 7a).
In subsequent panels, we compare a travel-aware reconstruction (Fig. 7b) to a reconstruction using the standard sampling location (Fig. 7c), and a reconstruction using the location of origin for the travelers (Fig. 7d). Using the location of sampling (Fig. 7c) results in an unrealistic Australian ancestry and two transitions from Australia to Italy, likely because Australia is represented by the largest number of tips. Using the location of travel origin, Fig. 7d results in a reconstruction that better matches the travel-aware reconstruction in terms of inferring an ancestry in Hubei, but misses transitions along four tip branches and differs from the reconstruction including travel history for the upper two Italian genomes. Specifically, it implies a transition from Hubei for the Italian patient that does not have travel history.
We note that Ti→j can be treated as a random variable in case the time of traveling to the sampling location is not known (with sufficient precision). We make use of this ability for the genomes associated with a travel location, but without a clear travel time. In our Bayesian inference, we specify normal prior distributions over Ti→j informed by an estimate of time of infection and truncated to be positive (back-in-time) relative to sampling date. Specifically, we use a mean of 10 days before sampling based on a mean incubation time of 5 days36, and a constant ascertainment period of 5 days between symptom onset and testing37, and a standard deviation of 3 days to incorporate the uncertainty on the incubation time. Finally, we indicate that not only information about the sampled traveler can be incorporated, but also about prior transmission history. We apply this for two cases in our data set. One of the genomes was sampled from a German patient, who was infected after contact with someone who came from Shanghai. The person traveling from Shanghai was assumed to be infected after being visited by her parents from Wuhan a few days before she left. In this case, we incorporate Wuhan (Hubei) as an ancestral location with an associated time that accounts for the travel time from Shanghai with a number of additional days and associated uncertainty. Another genome was obtained from a French person, who had been in contact with a person who is believed to have contracted the virus at a conference in Singapore38. In this case, we incorporate Singapore as an ancestral location with a known travel time (Supplementary Table 3).
Incorporating unsampled diversity in Bayesian phylogeographic inference
To investigate how unsampled diversity may impact phylogeographic reconstructions, we include in our Bayesian inference of the 282 genome data set taxa that are associated with a location, but not with observed sequence data. We identify undersampled locations by considering the ratio of available genomes to the cumulative number of cases for each location (obtained from Our World in Data, https://ourworldindata.org/coronavirus-source-data). To keep all available data, we opt not to downsample genomes, but to add a number of unsampled taxa to specific locations in order to achieve a minimal ratio of taxa (sampled and unsampled) to cumulative number of cases. In Supplementary Fig. S5a, we plot the number of available genomes against the number of cases on March 10th, 2020 on a log–log scale. In our case, we set the minimal ratio arbitrarily to 0.005. Although higher ratios may be preferred, this comes at the expense of adding larger numbers of unsampled taxa, and hence computationally more expensive Bayesian analyses. Our choice for the minimal ratio requires adding 458 taxa for 14 locations (colored symbols in Supplementary Fig. S5b), so ~1.6 times the number of available genomes. Most of the unsampled taxa are assigned to Hubei (n = 307), followed by Italy (n = 47), Iran (n = 40), and South Korea (n = 30). For comparison, we also plot the number of available genomes against the number of cases for the 500 genome data set in Supplementary Fig. S5c.
We integrate over all possible phylogenetic placements of such taxa, using standard Markov chain Monte Carlo (MCMC) transition kernels. In the absence of sequence data, time of sampling represents an important source of information for the analysis in addition to sampling location. Here, we use epidemiological data in order to estimate a probabilistic distribution for the sampling times of unsampled taxa. Specifically, we follow Grubaugh et al.5 in estimating the number of prevalent infectious individuals on day t (Pt), by multiplying the number of incident infections up to day t by the probability that an individual who became infectious on day i was still infectious on day t:
$$P_t = mathop {sum}limits_{i = 1}^{t – 1} {I_i(1 – gamma (t – i)) + I_t} ,$$
(1)
Where γ (t −i ) is the cumulative distribution function of the infectious period. We also follow Grubaugh et al.5 in modeling the infectious period as a gamma distribution with mean 7 days and standard deviation 4.5 days. Based on the estimated distributions of prevalent infections for the relevant locations over the time period of our analysis, we specify exponential or (truncated) normal prior distributions on the sampling times of unsampled taxa (Supplementary Fig. S6), and integrate over all possible times using MCMC in the full Bayesian analysis. Using a different, small empirical example, we illustrate the concept of including unsampled diversity in phylogeographic reconstruction and how it contributes to uncovering viral migration pathways (Supplementary Text S2).
Bayesian phylogeographic inference incorporating global mobility
We implement our approach to incorporate travel history in discrete phylogeographic inference in the BEAST framework (v.1.10.439). In this framework, we assume that discrete trait data X, in our case location data associated with both sampled and unsampled taxa, and aligned molecular sequence data Y arise according to CTMC processes on a random phylogeny F with the following model posterior distribution:
$$begin{array}{l}{mathrm{Pr(}}{mathbf{F}},{mathbf{Lambda}} {mathrm{,}}{mathbf{T}},{boldsymbol{phi}} {mathrm{|}}{mathbf{X}},{mathbf{Y}}{mathrm{)}} propto {mathrm{Pr(}}{mathbf{X}}|{mathbf{F}},{mathbf{T}},{mathbf{Lambda}} {mathrm{)Pr(}}{mathbf{Y}}|{mathbf{F}},{boldsymbol{phi}} {mathrm{)}} {mathrm{ Pr}}left( {mathbf{F}} right){mathrm{Pr(}}{mathbf{T}}{mathrm{)Pr(}}{mathbf{Lambda}} {mathrm{)Pr(}}{boldsymbol{phi}} {mathrm{),}}end{array}$$
(2)
where Pr(X|F,T,Λ) and Pr(Y|F,ϕ) represent the discrete trait likelihood and sequence likelihood, respectively, T is the additional time information for the ancestral nodes, Λ and ϕ characterize the discrete trait and molecular sequence CTMC parameterizations along F, respectively.
Likelihoods Pr(X|F,T,Λ) and Pr(Y|F,ϕ) are calculated using Felsenstein’s pruning algorithm40, a computation that is efficiently performed using the high-performance BEAGLE library41. We note that for the travel histories, the ancestral locations and times are introduced only for evaluating the discrete trait location likelihood Pr(X|F,T,Λ). The ancestral locations and times do not affect the sequence likelihood Pr(Y|F,ϕ), nor the likelihood of the coalescent model we use as our tree prior Pr(F).
For the sequence data, we use the HKY nucleotide substitution CTMC model42, with a proportion of invariant sites and gamma-distributed rate variation among sites43, a strict molecular clock model, and an exponential growth coalescent tree prior. The uncertainty in collection date for 1 genome was accommodated in the inference by integrating their age over the respective month of sampling. Our discrete location diffusion model involves 44 locations, represented by a limited number of sampled (and unsampled) taxa and ancestral nodes associated with travel locations. In order to avoid having to estimate a huge number of location-transition parameters in a high-dimensional CTMC, and to further inform the phylogenetic placement of unsampled taxa, we adopt a GLM formulation of the discrete trait CTMC that parametrizes the transition rates as a function of a number of potential covariates17. As covariates, we consider (i) air travel data, (ii) geographic distance, and (iii) an estimable asymmetry coefficient for Hubei to account for the fact that the early stage of COVID-19 spread was dominated by importations from Hubei (with underdetected cases of COVID-19 probably having spread in most locations around the world25). The air travel data consist of average daily symmetric fluxes between the 44 locations in January and February, 2013 (International Air Transport Association, http://www.iata.org). The geographic distance covariate only considers distances for pairs of locations in the same continent, which are based on centroid coordinates. We estimate the effect size of each of these covariates, as well as their inclusion probability (specifying a 0.5 prior inclusion probability for each covariate).
We approximate the posterior distribution of our full probabilistic model using MCMC sampling. We run sufficiently long chains to ensure adequate effective sample sizes for continuous parameters as diagnosed, using Tracer v.1.7.144. We summarize posterior tree distributions using MCC trees, and visualize them using FigTree v.1.4.4. However, due to the limitations of single-tree representations when facing extensive phylogenetic uncertainty, we also propose new summaries below. A tutorial explaining how to perform travel-aware phylogeographic analyses in BEAST can be found at http://beast.community/travel_history.
Posterior predictive accuracy assessment
We validate the approach of incorporating travel history data through a posterior predictive accuracy assessment. Specifically, we perform a tenfold cross validation that, in each fold, holds out a 10% random partition of the travel history information (the ancestral travel location for tip sequences with travel history data) and estimates the known, but not included, location at their respective times in the past (Ti→j, generally the travel return times). Across folds, we measure the prediction accuracy for the withheld ancestral travel locations (i) when including the remaining 90% of the travel history, and (ii) when excluding all travel history data, using the original BS for multistate predictions45, defined as follows:
$$BS = frac{1}{N}mathop {sum}limits_{i = 1}^N {mathop {sum}limits_{j = 1}^K {left( {p_{ij} – x_{ij}} right)^2} },$$
(3)
where N is the number of ancestral location instances we predict in our tenfold validation, K is the number of location states, pij is the posterior probability for location state j in ancestral location instance i, and xij is the outcome for location state j in ancestral location instance i (1 for the observed location state at the ancestral travel history node and 0 for all other location states). This score represents the mean squared error for the predictions and ranges between 0 for perfect accuracy and 2 for perfect inaccuracy, and is a proper scoring rule that incorporates both discrimination and calibration, arguably the two most important characteristics of prediction46.
Phylogeographic visualizations
Due to the relatively low sequence variability over the short time scale of spread, phylogenetic reconstructions of SARS-CoV-2 are inherently uncertain, which also complicates inferring and interpreting location-transition histories. If nodes in an MCC tree are associated with low posterior support, their conditional modal state annotation will be determined by a limited number of corresponding samples from the posterior tree distribution. The addition of unsampled taxa adds an additional challenge because the absence of sequence data makes them highly volatile in phylogenetic reconstruction, reducing posterior node support to impractically low values for many nodes.
In order to marginalize over phylogenetic clustering in our visualization of phylogeographic history, we generate Markov jump estimates of the transition histories that are averaged over the entire posterior in our Bayesian inference17,47. We study the ancestral transition history of specific taxa of interest by summarizing their Markov jump estimates as trajectories over time between a number of relevant states. A new BEAST tree sample tool (TaxaMarkovJumpHistoryAnalyzer available in the BEAST codebase at https://github.com/beast-dev/beast-mcmc) and associated R package constructs these estimates. The BEAST tutorial on incorporating travel history also includes information on how to use these tools (http://beast.community/travel_history). We also visualize posterior expected Markov jumps estimates between all locations using Sankey plots and circular migration flow plots. The latter have been successfully used to visualize migration data48, including phylogeographic estimates49. When summarizing these jumps from analyses that include unsampled diversity, we ignore branches that only have unsampled taxa as descendants. We only plot jumps that have a posterior probability larger than 0.90.
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Source: Ecology - nature.com