Data
Pathosystem
WMV is widespread in cucurbit crops, mostly in temperate and Mediterranean climatic regions of the world16. WMV has a wide host range including some legumes, orchids and many weeds that can be alternative hosts16. Like other potyviruses, it is non-persistently transmitted by at least 30 aphid species16. In temperate regions, WMV causes summer epidemics on cucurbit crops, and it can overwinter in several common non-cucurbit weeds when no crops are present16,34. WMV has been common in France for more than 40 years, causing mosaics on leaves and fruits in melon, but mostly mild symptoms on zucchini squash. Since 2000, new symptoms were observed in southeastern France on zucchini squash: leaf deformations and mosaics, as well as fruit discoloration and deformations that made them unmarketable. This new agronomic problem was correlated to the introduction of new molecular groups of WMV strains. At least four new groups have emerged since 2000 and they have rapidly replaced the native “classical” strains, causing important problems for the producers35. These new groups, hereafter “emerging strains” (ES) are significantly more related molecularly to worldwide strains than to any other isolates from the French populations36. As emphasised in35, this supports that the new group of emerging strains has arisen through introductions, mostly from Southeastern Asia, rather than through local differentiation.
In this study, we focus on the pathosystem corresponding to a classical strain (CS) and four emerging strains (ESk, (k = 1, ldots ,4)) of WMV and their cucurbit hosts.
Study area and sampling
The study area, located in Southeastern France, is included in a rectangle of about 25,000 km2 and is bounded on the South by the Mediterranean Sea. Between 2004 and 2008, the presence of WMV had been monitored in collaboration with farmers, farm advisers and seed companies. Each year, cultivated host plants were collected in different fields and at different dates between May 1st and September 30th. In total, more than two thousand plant samples were collected over the entire study area. All plant samples were analyzed in the INRAE Plant Pathology Unit to confirm the presence of WMV and determine the molecular type of the virus strain causing the infection (see35 for detail on field and laboratory protocols). All infected host plants were cucurbits, mostly melon and different squashes (e.g., zucchini, pumpkins).
Observations
In the absence of individual geographic coordinates, all infected host plants were attributed to the centroid of the municipality (French administrative unit, median size about 10 km2) where they have been collected. Then for one date, one observation corresponded to a municipality in which at least one infected host plant was sampled. Table 1 summarizes for each year, the number of observations (i.e. number of municipalities), the number of infected plants sampled and the proportion of each WMV strain (CS, and ES1 to ES4) found in the infected host plants. Errors in assignment of virus samples to the CS or ES strains was negligible because of the large genetic distance separating them: 5 to 10% nucleotide divergence both in the fragment used in the study and in complete genomes35, also precluding the possibility of local jumps between groups by accumulation of mutations.
Landscape
To approximate the density of WMV host plants over the study area, we used 2006 land use data (i.e. BD Ocsol 2006 PACA and LR) produced by the CRIGE PACA (http://www.crige-paca.org/) and the Association SIG-LR (http://www.siglr.org/lassociation/la-structure.html). Based on satellite images, land use is determined at a spatial resolution of 1/50,000 using an improved three-level hierarchical typology derived from the European Corine Land Cover nomenclature. Here we used the third hierarchical level of the BD Ocsol typology (i.e. 42 land use classes) to classify the entire study area in three habitats: (1) WMV-susceptible crops, (2) habitats unfavorable to WMV host plants (e.g. forests, industrial and commercial units…) and, (3) non-terrestrial habitat (i.e. water). The proportion of WMV-susceptible crops was then computed within all cells of a raster covering the entire study area, with a spatial resolution of (1.4 times 1.4) km2. These proportions were used to approximate host plant density (zleft( {varvec{x}} right)), which was normalized, so that (zleft( {varvec{x}} right) = 0) corresponds to an absence of host plants and (zleft( {varvec{x}} right) = 1) to the maximum density of host plants (Fig. 1).
Mechanistic-statistical model
The general modeling strategy is based on a mechanistic-statistical approach12,22,33. This type of approach combines a mechanistic model describing the dynamics under investigation with a probabilistic model conditional on the dynamics, describing how the measurements have been collected. This method that has already proved its theoretical effectiveness in determining dispersal parameters using simulated genetic data12 aims at bridging the gap between the data and the model for the determination of virus dynamics.
Here, the mechanistic part of the model describes the spatio-temporal dynamics of the virus strains, given the model parameters (demographic parameters, introduction dates/sites). This allows us to compute the expected proportions of the five types of virus strains (CS and ES1 to ES4) at each date and site of observation. The probabilistic part of the mechanistic-statistical model describes the conditional distribution of the observed proportions of the virus strains, given the expected proportions. Using this approach, it is then possible to derive a numerically tractable formula for the likelihood function associated with the model parameters.
Population dynamics
The model is segmented into two stages: (1) the intra-annual stage describes the dispersal and growth of the five virus strains during the summer epidemics on cucurbit crops, and the competition between them, during a period ranging from May 1st (noted (t = 0)) to September 30 (noted (t = t_{f}), (t_{f} = 153) days); (2) the inter-annual stage describes the winter decay of the different strains when no crops are present and the virus overwinters in weeds. We denote by (c^{n} left( {t,{varvec{x}}} right)) and (e_{k}^{n} left( {t,{varvec{x}}} right)) the densities of classical strain (CS) and emerging strains (ESk, (k = 1, ldots ,4)), at position ({varvec{x}}) and at time (t) of year (n.)
Dynamics of the classical strain before the first introduction events
Before the introduction of the first emerging strain, the intra-annual dynamics of the population CS is described by a standard diffusion model with logistic growth: (partial_{t} c^{n} = D{Delta }c^{n} + rc^{n} left( {zleft( {varvec{x}} right) – c^{n} } right)). Here, ({Delta }) is the Laplace 2D diffusion operator (sum of the second derivatives with respect to coordinate). This operator describes uncorrelated random walk movements of the viruses, with the coefficient (D) measuring the mobility of the viruses (e.g.,26,37). The term (r zleft( {varvec{x}} right)) is the intrinsic growth rate (i.e., growth rate in the absence of competition) of the population, which depends on the density of host plants (zleft( {varvec{x}} right)) and on a coefficient (r) (intrinsic growth rate at maximum host density). Under these assumptions, the carrying capacity at a position ({varvec{x}}) is equal to (zleft( {varvec{x}} right)), which means that the population densities are expressed in units of the maximum host population density. The model is initialized by setting (c^{1980} left( {0,{varvec{x}}} right) = (1 – m_{c} ) zleft( {varvec{x}} right)), where (m_{c}) is the winter decay rate of the CS (see the description of the inter-annual stage below). In other terms, we assume that the CS density is at the carrying capacity in 1979, i.e., 5 years after its first detection and 20 years before the first detections of ESs38.
Introduction events
The ESs are introduced during years noted (n_{k} ge 1981), at the beginning of the intra-annual stage (other dates of introduction within the intra-annual stage would lead—at most—to a one-year lag in the dynamics). Their densities are (0) before introduction: (e_{k}^{n} = 0) for (n < n_{k}). Once introduced, the initial density of any ES is assumed to be 1/10th of the carrying capacity at the introduction point (other values have been tested without much effect, see Supplementary Fig. S1), with a decreasing density as the distance from this point increases:
$$e_{k}^{{n_{k} }} left( {0,x} right) = frac{{zleft( {varvec{x}} right)}}{10}exp left( { – frac{|{{varvec{x}} – {varvec{X}}_{{varvec{k}}}|^{2} }}{{2sigma^{2} }}} right),$$
where ({varvec{X}}_{{varvec{k}}}) is the location of introduction of the strain (k.) In our computations, we took (sigma = 5) km for the standard deviation.
Intra-annual dynamics after the first introduction event
Intra-annual dynamics were described by a neutral competition model with diffusion (properties of this model have been analyzed in [54]):
$$left{ {begin{array}{*{20}c} {partial_{t} c^{n} left( {t,x} right) = DDelta c^{n} + rc^{n} left( {zleft( {varvec{x}} right) – c^{n} – mathop sum limits_{i = 1}^{4} e_{i}^{n} left( {t,{varvec{x}}} right)} right)} {partial_{t} e_{k}^{n} left( {t,x} right) = DDelta e_{k}^{n} + re_{k}^{n} left( {zleft( {varvec{x}} right) – c^{n} – mathop sum limits_{i = 1}^{4} e_{i}^{n} left( {t,{varvec{x}}} right)} right)} end{array} } right.,$$
for (t = 0 ldots t_{f}) and for all introduced emerging strains, i.e. all (k) such that (n ge n_{k} .) We assume reflecting boundary conditions, meaning that the population flows vanish at the boundary of the study area, due to truly reflecting boundaries (e.g., sea coast in the southern part of the site) or symmetric inward and outward fluxes26. In addition, in order to limit the number of unknown parameters, and in the absence of precise knowledge on the differences between the strains, we assume here that the diffusion, competition and growth coefficients are common to all the strains during the intra-annual stage (see the discussion for more details on this assumption).
Inter-annual dynamics
The population densities at time (t = 0) of year (n) are connected with those of year (n – 1,) at time (t = t_{f} ,) through the following formulas:
$$left{ {begin{array}{*{20}c} {c^{n} left( {0,{varvec{x}}} right) = left( {1 – m_{c} } right)c^{n – 1} left( {t_{f} ,{varvec{x}}} right) hbox{ for } n ge 1981} {e_{k}^{n} left( {0,{varvec{x}}} right) = left( {1 – m_{e} } right)e_{k}^{n – 1} left( {t_{f} ,{varvec{x}}} right) hbox{ for }n ge n_{k} + 1} end{array} } right.$$
with (m_{c}) and (m_{e}) the winter decay rates of the CS and ESs strains (note that (m_{e}) is common to all of the ESs). Estimation of CS and ES decay rates provides an assessment of the competitive advantage of one type of strain vs the other.
Numerical computations
The intra-annual dynamics were solved using Comsol Multiphysics time-dependent solver, which is based on a finite element method (FEM). The triangular mesh which was used for our computations is available as Supplementary Fig. S2.
Probabilistic model for the observation process
During the years (n = 2004, ldots ,2008), (I_{n}) observations were made (see Section Observations above and Table 1). They consist in counting data, that we denote by (C_{i}) and (E_{k,i}) for (k = 1, ldots ,4) and (i = 1, ldots ,I_{n}), corresponding to the number of samples infected by the CS and ESs strains, respectively, at each date of observation and location (left( {t_{i} ,{varvec{x}}_{i} } right)). Note that these dates and locations depend on the year of observation (n). More generally, the above quantities should be noted (C_{i}^{n} , E_{k,i}^{n} , t_{i}^{n} , {varvec{x}}_{i}^{n} 😉 for simplicity, the index (n) is omitted in the sequel, unless necessary.
We denote by (V_{i} = C_{i} + mathop sum nolimits_{k = 1}^{4} E_{k,i}) the total number of infected samples observed at (left( {t_{i} ,{varvec{x}}_{i} } right)). The conditional distribution of the vector (left( {C_{i} ,E_{1,i} ,E_{2,i} ,E_{3,i} ,E_{4,i} } right)), given (V_{i}) can be described by a multinomial distribution ({mathcal{M}}left( {V_{i} ,{varvec{p}}_{i} } right)) with ({varvec{p}}_{i} = left( {p_{i}^{c} ,p_{i}^{{e_{1} }} ,p_{i}^{{e_{2} }} ,p_{i}^{{e_{3} }} ,p_{i}^{{e_{4} }} } right)) the vector of the respective proportions of each strain in the population at (left( {t_{i} ,{varvec{x}}_{i} } right)). We chose to work conditionally to (V_{i}) because the sample sizes are not related to the density of WMV.
Statistical inference
Unknown parameters
We denote by ({{varvec{Theta}}}) the vector of unknown parameters: the diffusion coefficient (D,) the intrinsic growth rate at maximum host density (r), the winter decay rates ((m_{c} , m_{e} )) and the locations ((x_{k} in {mathbb{R}}^{2})) and years ((n_{k})) of introduction, for (k = 1, ldots ,4.) Thus ({{varvec{Theta}}} in {mathbb{R}}^{16} .)
Computation of a likelihood
Given the set of parameters ({{varvec{Theta}}}), the densities (c^{n} left( {t,{varvec{x}}|{{varvec{Theta}}}} right)) and (e_{k}^{n} left( {t,{varvec{x}}|{{varvec{Theta}}}} right)) can be computed explicitly with the mechanistic model for population dynamics. Thus, at a given year (n), at (left( {t_{i} ,x_{i} } right)), the parameter ({varvec{p}}_{i}) of the multinomial distribution ({mathcal{M}}left( {V_{i} ,{varvec{p}}_{i} } right)) writes:
$$p_{i}^{c} left( {{varvec{Theta}}} right) = frac{{c^{n} left( {t_{i} ,{varvec{x}}_{i} |{{varvec{Theta}}}} right)}}{{c^{n} left( {t_{i} ,{varvec{x}}_{i} |{{varvec{Theta}}}} right) + mathop sum nolimits_{i = 1}^{4} e_{i}^{n} left( {t_{i} ,{varvec{x}}_{i} {|}{{varvec{Theta}}}} right)}}, p_{i}^{{e_{k} }} left( {{varvec{Theta}}} right) = frac{{e_{k}^{n} left( {t_{i} ,{varvec{x}}_{i} |{{varvec{Theta}}}} right)}}{{c^{n} left( {t_{i} ,{varvec{x}}_{i} |{{varvec{Theta}}}} right) + mathop sum nolimits_{i = 1}^{4} e_{i}^{n} (t_{i} ,{varvec{x}}_{i} |{{varvec{Theta}}})}}.$$
The probability (P(C_{i} ,E_{1,i} ,E_{2,i} ,E_{3,i} ,E_{4,i} |{{varvec{Theta}}},{text{V}}_{{text{i}}} )) of the observed outcome (C_{i} ,E_{1,i} ,E_{2,i} ,E_{3,i} ,E_{4,i}) is then
$$Pleft( {C_{i} ,E_{1,i} ,E_{2,i} ,E_{3,i} ,E_{4,i} {|}{{varvec{Theta}}},{text{V}}_{{text{i}}} } right) = frac{{left( {V_{i} } right)!}}{{C_{i} ! mathop prod nolimits_{k = 1}^{4} E_{k,i} !}}left( {p_{i}^{c} left( {{varvec{Theta}}} right)} right)^{{C_{i} }} mathop prod limits_{k = 1}^{4} (p_{i}^{{e_{k} }} left( {{varvec{Theta}}} right))^{{E_{k,i} }} .$$
Assuming that the observations during each year and at each date/location are independent from each other conditionally on the virus strain proportions, we get the following formula for the likelihood:
$${mathcal{L}}left( {{varvec{Theta}}} right) = mathop prod limits_{n = 2004}^{2008} mathop prod limits_{{i = 1, ldots , I_{n} }} Pleft( {C_{i} ,E_{1,i} ,E_{2,i} ,E_{3,i} ,E_{4,i} {|}{{varvec{Theta}}},{text{V}}_{{text{i}}} } right).$$
A priori constraints on the parameters
By definition and for biological reasons, the parameter vector ({{varvec{Theta}}}) satisfies some constraints. First, (D in left( {10^{ – 4} ,10} right){text{ km}}^{2} /{text{day}}), (r in left( {0.1,1} right) {text{day}}^{ – 1} ,) and (m_{c} , m_{e} in left{ {0,0.1,0.2, ldots ,0.9} right},) (see Supplementary Note S7 for a biological interpretation of these values). Second, we assumed that the locations of introductions ({varvec{X}}_{{varvec{k}}}) belong to the study area. To facilitate the estimation procedure, the points ({varvec{X}}_{{varvec{k}}}) were searched in a regular grid with 20 points (see Supplementary Fig. S3), and the dates of introduction (n_{k}) were searched in (left{ {1985,1990,1995,2000} right}.)
Inference procedure
Due to the important computation time (4 min in average for one simulation of the model on an Intel(R) Core(R) CPU i7-4790 @ 3.60 GHz), we were not able to compute an a posteriori distribution of the parameters in a Bayesian framework. Rather, we used a simulated annealing algorithm to compute the maximum likelihood estimate (MLE), i.e., the parameter ({{varvec{Theta}}}^{*}) which leads to the highest log-likelihood. This is an iterative algorithm, which constructs a sequence (({{varvec{Theta}}}_{j} )_{j ge 1}) converging in probability towards a MLE. It is based on an acceptance-rejection procedure, where the acceptance rate depends on the current iteration (j) through a “cooling rate” ((alpha )). Empirically, a good trade-off between quality of optimization and time required for computation (number of iterations) is obtained with exponential cooling rates of the type (T_{0} alpha^{j}) with (0 < alpha < 1) and some constant (T_{0} gg 1) (this cooling schedule was first proposed in= 39 = 39). Too rapid cooling ((alpha ll 1)) results in a system frozen into a state far from the optimal one, whereas too slow cooling ((alpha approx 1)) leads to important computation times due to very slow convergence. Here, we ran (6) parallel sequences with cooling rates (alpha in left{ {0.995,0.999,0.9995} right}). For this type of algorithm, there are no general rules for the choice of the stopping criterion [HenJac03], which should be heuristically adapted to the considered optimization problem. Here, our stopping criterion was that ({{varvec{Theta}}}_{j}) remained unchanged during 500 iterations. The computations took about 100 days (CPU time).
Confidence intervals and goodness-of-fit
To assess the model’s goodness-of-fit, 95% confidence regions were computed for the observations (left( {C_{i} ,E_{1,i} ,E_{2,i} ,E_{3,i} ,E_{4,i} } right)) at each date/location (left( {t_{i} ,{varvec{x}}_{i} } right),) and for each year of observation. The confidence regions were computed by assessing the probability of each possible outcome of the observation process, at each date/location, based on the computed proportions ({varvec{p}}_{i} = left( {p_{i}^{c} ,p_{i}^{{e_{1} }} ,p_{i}^{{e_{2} }} ,p_{i}^{{e_{3} }} ,p_{i}^{{e_{4} }} } right)), corresponding to the output of the mechanistic model using the MLE ({{varvec{Theta}}}^{user2{*}}) and given the total number of infected samples (V_{i}). Then, we checked if the observations belonged to the 95% most probable outcomes.
Source: Ecology - nature.com