Conceptualization of organismal to tumor biogeography through clone phylogenies
We first map organismal biogeographic concepts and models to the process of migration and colonization of cancer cells during metastasis. Tumors are populations consisting of a diversity of cancer cells with different genetic profiles that may represent different lineages in the clone phylogeny. We use the example in Fig. 1, which contains a phylogeny of 17 clones found in one primary tumor (P) and four metastases (M1–M4). Events occurring along a branch in a phylogeny are anagenetic events, which include diversification, extinction, and expansion12,14. In organismal evolutionary biology, anagenetic events are not directly observed except through the fossil record. However, one can map the collection of genetic variants that likely arose on individual lineage in a phylogeny. In many cancers, sequencing of temporally sampled biopsies’ can directly reveal anagenetic events similar to the sequencing of ancient DNA in paleogenomics.
The other types of evolutionary events in the phylogeny are cladogenetic, including genetic divergences and dispersals (Fig. 1). Genetic differences observed among species and populations are the key to detect cladogenetic events reconstructed in molecular phylogenies of living descendants. In cancer, temporal sampling of biopsies can reveal cladogenetic events that produced extinct descendants.
In biogeography, genetic divergence results in the diversification of lineages within an area. Sometimes, the term duplication is used, but we avoid its further use because of the confusion it may cause in evolutionary genomics. Divergence events are also observed in a clone phylogeny, particularly when clone lineages diverge from each other within a tumor or across tumors. The exact opposite of genetic diversification can also be observed when lineages partially or fully disappear from the phylogeny. Extinction can occur due to random chance, selection, or environmental pressures. Even though extinction is rarely discussed in tumor clone phylogenetics, it happens frequently.
Phylogenies also reveal movements of lineages between locations (geographic areas or body parts) when the locations of individual cells, species, or populations are known5,6,7,8,9,15. When lineages accumulate genetic differences along a branch in the phylogeny, and the evolved lineages migrate to a new area, we observe an expansion event. Expansions differ from dispersals in such that the growth of a population occurs in the same place. This movement of cells of a clone from one location to another, where they would potentially form a metastasis, results in the dispersal of these cells of that clone to additional areas, which is modeled by a dispersal rate (d) in organismal biogeography. When a clone genetically diverges following its migration, then a distant dispersal event is said to have occurred. Similarly, when a clone diverges from the rest of the clones within a tumor and disperses to another tumor, we have observed an expansion event. Thus, clone phylogenies can give insights into the origin and trajectory of cancer cells between tumors.
When a clone is no longer present at a location, it is extinct at that location. Extinctions are modeled by an extinction rate (e) in biogeographic models. As a result of extinction, the range of descendent clones on a phylogeny can be smaller than the ancestors. Biogeography models also have a parameter (J) to consider founder events that establish new populations from a few individuals. In phylogenies, founder events can be detected if only one or a few cells are found to have moved from one location to another to start diversifying in a new area. Both distant dispersal and founder events may result in forming a new colony of cells, i.e., a new metastasis in the case of cancer cell migrations. The primary distinction between dispersal and founder events is the relative number of migrating cells. Founder events are due to one or a few cells, whereas dispersal events involve a larger number of migrating cells. Founder events are expected to be more common in tumor evolution because metastases are thought to be formed by the spread of only one or a few cancer cells. These biogeographic events have been mathematically modeled and implemented in various approaches to infer species migration events12, which are directly applicable in the inference of cancer cell migrations between tumors.
Model fits
We began by analyzing the statistical fits of six biogeographic models (Table 1) to 80 computer-simulated tumor evolutionary datasets. Simulations enable us to assess the performance of computational approaches and reveal potential caveats associated with their use because the ground truth is known. These datasets were simulated using four main clone migration schemes defined by the different number of migrating clones (1–3), the small and large number of tumor areas (5–7 tumors, m5 datasets; 8–11 tumors, m8 datasets), and the different types of source areas of migration (primary or metastasis). The following seeding scenarios reflect this complexity of the clone migration schemes: monoclonal single-source seeding (mS), polyclonal single-source seeding (pS), polyclonal multisource seeding (pM), and polyclonal reseeding (pR) (see “Methods” section).
We considered biogeographic models that weigh genetic divergence, dispersal/expansion, and extinction events differently (Table 1). We also explored the provision of including founder events in our models on the accuracy of detecting clone migrations. The parameterization of the aforementioned events results in models with two free parameters, i.e., dispersal rate (d) and extinction rate (e), and models with three free parameters by adding the founder-event speciation (J); see “Methods” section for more details.
Overall, we tested six biogeographic models for their fit to the tumor data, three models with two free parameters and three others with three free parameters. BAYAREALIKE, DEC, and DIVALIKE models have two parameters each. They are nested within their respective models that add the founder effect, resulting in a model with three free parameters (hereinafter +J models). We used the BioGeoBEARS software for all model fit analyses. In data analysis, we first inferred phylogeny of cancer cell populations (clone phylogeny) using an existing method16, followed by the use of BioGeoBEARS to infer the clone migration history in which the clone phylogeny is used along with the location of tumor sites in which each clone is observed (Fig. 2). BioGeoBEARS estimates the probabilities of annotating internal nodes with tumor locations. These annotations are then used to derive cancer cell migration paths when two adjacent nodes are annotated with different tumor locations. In these analyses, we assumed the correct clone phylogeny because our focus was not assessing the impact of errors in a phylogeny on the accuracy of clone migration inferences. We also compared the accuracy of migration histories reconstructed using biogeographic models in BioGeoBEARS with those obtained from the approaches that do not model biogeographic processes (BBM9, MACHINA5, and PathFinder7).
Data analysis pipeline using BioGeoBEARS in R14 to infer clonal migration histories.
We first conducted Likelihood Ratio Tests (LRTs) to examine the improvement offered by considering founder events in modeling tumor migrations. In this case, the fit of the BAYAREALIKE, DEC, and DIVALIKE models was compared to their +J counterparts, respectively. The null hypothesis was rejected for more than 50% of the datasets (BAYAREALIKE: 71.25%, DEC: 60%, and DIVALIKE: 53.75%; P < 0.05), which means that the model with founder-event effects provided a better fit for a majority of datasets. Interestingly, results differed between BAYAREALIKE and other models (DEC and DIVALIKE) for smaller datasets. For DEC and DIVALIKE, the null hypothesis was rejected in only 20–40% of the datasets with low complexity compared to 60–90% of the ones with more complicated migration graphs (Table 2). For BAYAREALIKE, the null hypothesis was rejected for more than 50% of the datasets in almost all seeding scenarios (except for the m5pM, which was 20%; Table 2). This indicates that a more complex model fits the data better than a simpler model.
Second, we compared the fits of these three non-nested models by using the small-sample size corrected Akaike Information Criterion (AICc). AICc suggested that BAYAREALIKE + J and DEC + J models received the best AICc scores (26.25% and 22.5% of the datasets, respectively) (Table 2). We observed that DEC and DIVALIKE fitted better for simple seeding scenarios and the small number of tumors. In contrast, BAYAREALIKE had a more consistent performance across different complexities in seeding scenarios and different numbers of tumor sites. Estimation of Bayesian Information Criterion (BIC) resulted in similar patterns predicting that more complex models fit the data the best, except for DIVALIKE and DIVALIKE + J.
Our analyses show that models accommodating founder events fit tumor sequencing data better, especially when many tumors were sampled. Still, we observed that more sophisticated models (+J) did not fit better for datasets with a small number of tumor sites and in the presence of reseeding.
Estimation of ranges at ancestral nodes
In organismal biogeography, researchers often expect and explore the hypothesis of vicariance, i.e., the genetic divergence of a population due to geographic isolation typically caused by a physical barrier. Vicariance causes the split of an area into multiple ones, a type of biogeographic process that is not reasonable when examining cancer cells’ movements in a tumor site. Vicariance events are incorporated in the biogeographic models examined. This showed as inference of multiple ranges at ancestral nodes. To better understand this, suppose for example clones a and b are sampled in tumors A and B, respectively. In that case, there are three ancestral range possibilities: two are dispersals from areas A and/or B, and one is vicariance from area AB.
In our data analysis, five of the biogeographic models, except for BAYAREALIKE + J, predicted multiple ranges at ancestral nodes of the clone phylogeny instead of a single ancestor area (Fig. 3). DEC and DIVALIKE models predicted many multiple ranges (average of 14.74 and 15.53, respectively). These models prefer vicariance, so they end up inferring multiple ranges at the ancestral nodes to explain clone divergences and movements. However, vicariance is unlikely to be a major force in tumor clonal biogeography because of the lack of a physical medium in the same anatomical site that would not allow the dispersal of cancer cells within an area. Therefore, the use of DEC and DIVALIKE may result in erroneous inferences of clone migrations. In fact, this problem is remedied in analysis in which the founder effect is considered. DEC + J and DIVALIKE + J predict less than one multiple range (0.76 and 0.34, respectively; Fig. 3). The average number of multiple ranges was low for the BAYAREALIKE model (average of 1.12), and it became zero in BAYAREALIKE + J.
The average number of multiple ranges suggested at ancestral nodes differed among the two and three free parameter biogeographic models.
Accuracy of migration paths inferred with biogeographic models
We assessed the accuracy of clone migration inferences by using F1-scores (Fig. 4). The F1-score considers true positive (correct), false positive (wrong), and false negative (missing) inference of migration paths. In our evaluations, these represent correct, erroneous, and not-inferred clone migration paths.
Performance of the seven biogeographic models used for clone migration inferences as measured by F1-scores. Mean values are depicted above the box plots. Differences in F1-scores were examined through a t-test and are marked when significant.
Overall, the highest average F1-score was 0.82 for DIVALIKE, followed by DEC and BBM (0.81 and 0.79, respectively; Fig. 4). All other models produced lower accuracy (0.67–0.69). Although the differences in mean F1-scores between BBM and the best method (DIVALIKE) were statistically significant, only BBM produced perfect inferences (F1 equal to 1.0) for some datasets. The best-fitting model (BAYAREALIKE + J) performed significantly worse than some others (0.67). The +J versions of the three models performed significantly worse, with DEC and DIVALIKE producing more accurate migration paths than DEC + J and DIVALIKE + J. There was no difference between the performance of BAYAREALIKE and BAYAREALIKE + J. Therefore, the best-fitting model was identified due to over-parameterization. The use of +J will not improve the accuracy of the migration inferences in tumor data analysis. Furthermore, the method that does not use model biogeographic processes (BBM) was among the most accurate. Also, other methods without modeling of biogeographic processes (MACHINA5 and PathFinder7) were previously reported to perform similar or better than BBM for inferring migration events of cancer cells9.
To further dissect the accuracy differences, we examined the effect of the number of tumors (m5 and m8 datasets) and the complexity of the seeding scenario (mS, pS, pM, and pR) on the clone migration inferences (Fig. 5). F1-scores were slightly worse for datasets with higher than those with a small number of tumors (m5; 0.66–0.95, and m8; 0.66–0.89).
Performance of the seven biogeographic models in inferring clone migrations as measured by F1-scores. Mean values are shown above the box plots. Accuracies are shown based on the number of tumors within datasets (5–7 tumors for m5 and 8–11 tumors for m8) and on the complexity of migration schemes within datasets (monoclonal single- source seeding, mS; polyclonal single-source seeding, pS; polyclonal multisource seeding, pM; polyclonal reseeding, pR). Differences in values of F1-score on tumor count and seeding scenario were examined through t-test and are marked when significant (*, **, ***, ****).
F1-scores for most complex seeding scenarios (pM and pR; 0.70–0.72) were worse than single-source seeding (0.75–0.95) in the BBM model. The number of tumors seems to dictate the accuracy of clone migration inferences by BBM in the simplest data (mS and pS), but not in most complex ones (pM and pR). That is not the case for the rest of the models, as their performances remain poor but steady across complexities of metastatic migrations.
Next, we examined the numbers of correct, erroneous, and not-inferred migration paths depending on the source- and recipient-area of the migration for each biogeographic method used. In the graphs presented in Fig. 6, density plots show the number of the datasets (y-axis) in which the inferred number of migration paths were observed (x-axis). Four sets of plots are shown: (i) all migration paths; (ii) primary to metastasis paths, P → M; (iii) metastasis to metastasis paths, M → M; and (iv) metastasis to primary paths, M → P. Within each panel, three graphs are included to display the propensity of a given method to infer correct paths (top), erroneous paths (middle), and paths not-inferred (bottom).
Correct, erroneous, and not-inferred migration paths inferred using seven biogeographic models. Here, we show the results depending on the source-area of the migration: all migration paths, primary to metastasis (P → M), metastasis to metastasis (M → M), and metastasis to primary (M → P).
For example, in the graph showing all the migration paths for the BBM model, we observe that many correct migration paths were reconstructed for many datasets. Furthermore, there are only a few wrong paths inferred for a smaller number of datasets. A similar pattern was seen for true paths not-inferred (Fig. 6). For P → M migrations, many biogeographic models performed equally well, with a few notable trends. BBM showed a smaller tendency not to infer correct P → M paths. DEC and DIVALIKE reconstructed a smaller number of correct and incorrect paths compared to other methods, resulting in a larger number of datasets with many P → M paths remaining undetected. All biogeographic models failed to detect many M → M paths. BBM, DEC, and DIVALIKE produced fewer wrong M → M paths. Still, DEC and DIVALIKE inferred fewer correct migration paths as well. Interestingly, +J models produce more correct and erroneous M → P paths than their counterparts without the J parameter. However, BBM produced fewer wrong and not-inferred paths, making it the most accurate model.
Biogeographic analyses of breast cancer metastases
We inferred clone migration histories for two patients (A1 and A7) with basal-like breast cancer13 (Figs. 7 and 8, respectively). We begin with patient A1’s dataset consisting of nine clones from one primary tumor (breast) and four metastases (adrenal, lung, spinal, and liver). The phylogeny and tumor sampling locations are shown in Fig. 7a.
Analysis of Patient A1 with basal-like breast cancer13. (a) Clone phylogeny and tumor location of each clone reported in the original study. (b–k) Clone migration histories predicted by non-biogeographic and biogeographic models. Colors correspond to the tumor location where clones were sampled from. Values on the top right are AICcs.
Analysis of Patient A7 with basal-like breast cancer13. (a) Clone phylogeny and tumor location of each clone reported in the original study. (b–k) Clone migration histories predicted by non-biogeographic and biogeographic models. Colors correspond to the tumor location where clones were sampled from. Values on the top right are AICcs. Clone migration history inferred by BAYAREALIKE (panel e) predicted the origin of metastasis to be from lung tissue, but the estimated probability value of lung at the root of the tree was very low and similar to that for breast tissue (primary tumor). We show the migration history starting from breast tissue (dotted line) because the primary tumor was found there (e).
For the A1 patient, Hoadley et al.13 suggested that clones from the primary tumor seeded all the metastases13. Similar to the analysis of computer-simulated datasets, +J models fit better in the Likelihood Ratio Tests comparing BAYAREALIKE, DEC, and DIVALIKE with their +J counterparts, respectively (2∆lnL/BIC of 4.5/5.5, 9.5/10.5, and 10/10.9, respectively; P < 0.05) (Fig. 7). The use of the BAYAREALIKE model suggested that clones from the primary tumor seeded all metastases. More than one primary clone seeded some metastases (clones C4 and C7, Fig. 7a). This pattern of a primary-to-metastasis spread is similar to that inferred by Hoadley et al.13 and by methods that do not involve biogeographic modeling (BBM, PathFinder, and MACHINA; Fig. 7b–d). Importantly, however, not all model combinations and methods produce identical migration history, which was similar to the trends observed in the analysis of computer-simulated datasets.
The second dataset analyzed came from patient A7, which consisted of ten clones sampled from one primary (breast) and five metastatic tumors (brain, lung, rib, liver, and kidney). Figure 8a shows the phylogenetic relationship of clones and their sampling sites of patient A7. Hoadley et al.13 suggested that the breast’s primary tumor directly seeded all the metastases for this patient13. While different types of migration graphs were reconstructed by the biogeograhic models, none of them predicted a migration history in which the primary tumor seeded all the metastases, contrary to Hoadley et al.13 suggestion (Fig. 8). Again, models with +J fit the data the best, consistent
with the results seen in the analysis of the computer-simulated datasets (2ΔlnL/BIC of 4.6/5.6, 2.3/3.3, and 5.5/6.5, respectively; P < 0.05).
All biogeographic models predicted that metastases seeded other metastases much more frequently than the primary tumors in patient A7 (Fig. 8e–k). In all migration history reconstructions, one or more clones migrated between metastases. Similarly, methods that do not consider biogeographic models (BBM, MACHINA, and PathFinder) predicted many migration paths between metastatic tumor sites. For example, there were two migration events from the lung to brain metastases and one reseeding event from the brain to the lung when BBM was used (Fig. 8h). Most of the methods showed slightly different clone migration histories from each other. However, the seeding of kidney metastasis by the liver metastasis was inferred in all analyses, and the rib metastasis received clones from both liver and lung metastases. The lung metastasis also contributed clones to the brain metastasis. Some biogeographic models predicted reseeding of the primary tumor (breast) by metastases, but this pattern was not universally observed. Overall, non-biogeographic models agree with the type of migration patterns inferred by the biogeographic models indicating metastasis-to-metastasis and not a primary-to-metastasis spread.
Source: Ecology - nature.com