Estimating and explaining the spread of COVID-19 at the county level in the USA

Data selection and handling: death data

For mortality due to COVID-19, we used time series provided by the New York Times¹². We selected the New York Times dataset because it is rigorously curated. We analyzed separately only counties that had records of 100 or more deaths by 23 May, 2020. The threshold of 100 was a balance between including more counties and obtaining reliable estimates of r(t). Preliminary simulations showed that time series with low numbers of deaths would bias r(t) estimates (Supplementary Fig. 2). However, we did not want to use the maximum daily number of deaths as a selection criterion, because this could lead to selection of counties based on data from a single day. It would also involve some circularity, because the information obtained, r(t), would be directly related to the criterion used to select datasets. Therefore, we used the threshold of 100 cumulative deaths. The District of Columbia was treated as a county. Also, because the New York Times dataset aggregated the five boroughs of New York City, we treated them as a single county. For counties with fewer than 100 deaths, we aggregated mortality to the state level to create a single time series. For thirteen states (AK, DE, HI, ID, ME, MT, ND, NH, SD, UT, VM, WV, and WY), the aggregated time series did not contain 100 or more deaths and were therefore not analyzed.

Data selection and handling: explanatory county-level variables

County-level variables were collected from several public data sources^{36,37,38,39,40,41,42}. We selected socio-economic variables a priori in part to represent a broad set of population characteristics.

Time series analysis: time series model

We used a time-varying autoregressive model^15,56 designed explicitly to estimate the rate of increase of a variable using nonlinear, state-dependent error terms¹⁶. We assume in our analyses that the susceptible proportion of the population represented by a time series is close to one, and therefore there is no decrease in the infection rate caused by a pool of individuals who were infected, recovered, and were then immune to further infection.

The model is

$$xleft( t right) = rleft( {t-1} right) + xleft( {t-1} right)$$

(1a)

$$rleft( t right) = rleft( {t-1} right) + omega_rleft( t right)$$

(1b)

$$Dleft( t right) = {mathrm{exp}}(xleft( t right) + phi left( t right))$$

(1c)

Here, x(t) is the unobserved, log-transformed value of daily deaths at time t, and D(t) is the observed count that depends on the observation uncertainty described by the random variable ϕ(t). Because a few of the datasets that we analyzed had zeros, we replaced zeros with 0.5 before log-transformation. The model assumes that the death count increases exponentially at rate r(t), where the latent state variable r(t) changes through time as a random walk with ω_r(t) ~ N(0, σ²_r). We assume that the count data follow a quasi-Poisson distribution. Thus, the expectation of counts at time t is exp(x(t)), and the variance is proportional to this expectation.

We fit the model using the extended Kalman filter to compute the maximum likelihood^57,58. In addition to the parameters σ²_r and σ²_ϕ, we estimated the initial value of r(t) at the start of the time series, r₀, and the initial value of x(t), x₀. The estimation also requires terms for the variances in x₀ and r₀, which we assumed were zero and σ²_r, respectively. In the validation using simulated data (Supplementary Methods: Simulation model), we found that the estimation process tended to absorb σ²_r to zero too often. To eliminate this absorption to zero, we imposed a minimum of 0.02 on σ²_r.

Time series analysis: parametric bootstrapping

To generate approximate confidence intervals for the time-varying estimates of r(t) (Eq. 1b), we used a parametric bootstrap designed to simulate datasets with the same characteristics as the real data that are then refit using the autoregressive model. We used bootstrapping to obtain confidence intervals, because an initial simulation study showed that standard methods, such as obtaining the variance of r(t) from the Kalman filter, were too conservative (the confidence intervals too narrow) when the number of counts was small. Furthermore, parametric bootstrapping can reveal bias and other features of a model, such as the lags we found during model fitting (Supplementary Fig. 1a, b).

Changes in r(t) consist of unbiased day-to-day variation and the biased deviations that lead to longer-term changes in r(t). The bootstrap treats the day-to-day variation as a random variable while preserving the biased deviations that generate longer-term changes in r(t). Specifically, the bootstrap was performed by calculating the differences between successive estimates of r(t), Δr(t) = r(t) – r(t-1), and then standardizing to remove the bias, Δr_s(t) = Δr(t) – E[Δr(t)], where E[] denotes the expected value. The sequence Δr_s(t) was fit using an autoregressive time-series model with time lag 1, AR(1), to preserve any shorter-term autocorrelation in the data. For the bootstrap, a new time series was simulated from this AR(1) model, Δρ(t), and then standardized, Δρ_s(t) = Δρ(t) – E[Δρ(t)]. The simulated time series for the spread rate was constructed as ρ(t) = r(t) + Δρ_s(t)/2^1/2, where dividing by 2^1/2 accounts for the fact that Δρ_s(t) was calculated from the difference between successive values of r(t). A new time series of count data, ξ(t), was then generated using equation 1 with the parameters from fitting the data. Finally, the statistical model was fit to the reconstructed ξ(t). In this refitting, we fixed the variance in r(t), σ²_r, to the same value as estimated from the data. Therefore, the bootstrap confidence intervals are conditional of the estimate of σ²_r.

Time series analysis: calculating R₀

We derived estimates of R(t) directly from r(t) using the Dublin-Lotka equation²¹ from demography. This equation is derived from a convolution of the distribution of births under the assumption of exponential population growth. In our case, the “birth” of COVID-19 is the secondary infection of susceptible hosts leading to death, and the assumption of exponential population growth is equivalent to assuming that the initial rate of spread of the disease is exponential, as is the case in equation 1. Thus,

$$Rleft( t right) = 1/mathop {sum}nolimits_{_tau} {{mathrm{e}}^{ – r(t)}} tau p(tau)$$

(2)

where p(τ) is the distribution of the proportion of secondary infections caused by a primary infection that occurred τ days previously. We used the distribution of p(τ) from Li et al.⁵⁹ that had an average serial interval of T₀ = 7.5 days; smaller or larger values of T₀, and greater or lesser variance in p(τ), will decrease or increase R(t) but will not change the pattern in R(t) through time. Note that the uncertainty in the distribution of serial times for COVID-19 is a major reason why we focus on estimating r₀, rather than R₀: the estimates of r₀ are not contingent on time distributions that are poorly known. Computing R(t) from r(t) also does not depend on the mean or variance in time between secondary infection and death. We report values of R(t) at dates that are offset by 18 days, the average length of time between initial infection and death given by Zhou et al.⁶⁰.

Time series analysis: Initial date of the time series

Many time series consisted of initial periods containing zeros that were uninformative. As the initial date for the time series, we chose the day on which the estimated daily death count exceeded 1. To estimate the daily death count, we fit a Generalized Additive Mixed Model (GAMM) to the death data while accounting for autocorrelation and greater measurement error at low counts using the R package mgcv⁶¹. We used this procedure, rather than using a threshold of the raw death count, because the raw death count will include variability due to sampling small numbers of deaths. Applying the GAMM to “smooth” over the variation in count data gives a well-justified method for standardizing the initial dates for each time series.

Time series analysis: validation

We performed extensive simulations to validate the time-series analysis approach (Supplementary Methods: Simulation model).

Regression analysis for r ₀

We applied a Generalized Least Squares (GLS) regression model to explain the variation in estimates of r₀ from the 160 county and county-aggregate time series:

$$r_0 = b_0 + b_1start.date + b_2logleft( {pop.size} right) + b_3pop.den^{0.25} + varepsilon$$

(3)

where start.date is the Julian date of the start of the time series, log(pop.size) and pop.den^0.25 are the log-transformed population size and 0.25 power-transformed population density of the county or county-aggregate, respectively, and ε is a multivariate Gaussian random variable with covariance matrix σ²Σ. We used the transforms log(pop.size) and pop.den^0.25 to account for nonlinear relationships with r₀; these transforms give the highest maximum likelihood of the overall regression. The covariance matrix contains a spatial correlation matrix of the form C = uI + (1–u)S(g) where u is the nugget and S(g) contains elements exp(−d_ij/g), where d_ij is the distance between spatial locations and g is the range⁶². To incorporate differences in the precision of the estimates of r₀ among time series, we weighted by the vector of their standard errors, s, so that Σ = diag(s) * C * diag(s), where * denotes matrix multiplication. With this weighting, the overall scaling term for the variance, σ², will equal 1 if the residual variance of the regression model matches the square of the standard errors of the estimates of r₀ from the time series. We fit the regression model with the function gls() in the R package nlme⁶³.

To make predictions for new values of r₀, we used the relationship

$$hat e_{mathrm{i}} = bar e + {mathbf{v}}_{mathbf{i}} ast ,{mathbf{V}}^{ – 1}(epsilon_i – bar e)$$

(4)

where ε_ι is the GLS residual for data i, (hat e)_i is the predicted residual, (bar e) is the mean of the GLS residuals, V is the covariance matrix for data other than i, and v_i is a row vector containing the covariances between data i and the other data in the dataset⁶⁴. This equation was used for three purposes. First, we used it to compute R²_pred for the regression model by removing each data point, recomputing (hat e)_i, and using these values to compute the predicted residual variance²³. Second, we used it to obtain predicted values of r₀, and subsequently R₀, for the 160 counties and county-aggregates for which r₀ was also estimated from time series. Third, we used equation (4) to obtain predicted values of r₀, and hence predicted R₀, for all other counties. We also calculated the variance of the estimates from⁶⁴

$$hat v_{mathrm{i}} = sigma^2-{mathbf{v}}_{mathbf{i}} ast ,{mathbf{V}}^{ – 1} ast v_i^{mathbf{t}}$$

(5)

Predicted values of R₀ were mapped using the R package usmap⁶⁵.

Regression analysis for SARS-CoV-2 effects on r₀

The GISAID metadata²⁷ for SARS-CoV-2 contains the clade and state-level location for strains in the USA; strains G, GH, and GR contain the G614 mutation. For each state, we limited the SARS-CoV-2 genomes to those collected no more than 30 days following the onset of outbreak that we used as the starting point for the time series from which we estimated r₀; from these genomes (totaling 5290 from all states), we calculated the proportion that had the G614 mutation. We limited the analyses to the 28 states that had five or more genome samples. For each state, we selected the estimates of r₀ from the county or county-aggregate representing the greatest number of deaths. We fit these estimates of r₀ with the weighted Least Squares (LS) model as in equation (3) with additional variables for strain. Figure 3 was constructed using the R packages usmap⁶⁵ and scatterpie⁶⁶.

Statistics and reproducibility

The statistics for this study are summarized in the preceding sections of the “Methods”. No experiments were conducted, so experimental reproducibility is not an issue. Nonetheless, we repeated analyses using alternative datasets giving county-level characteristics, and also an alternative dataset on SARS-CoV-2 strains (Supplementary Methods: Analysis of Nextstrain metadata of SARS-CoV-2 strains), and all of the conclusions were the same.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Source: Ecology - nature.com

Estimating and explaining the spread of COVID-19 at the county level in the USA

Data selection and handling: death data

Data selection and handling: explanatory county-level variables

Time series analysis: time series model

Time series analysis: parametric bootstrapping

Time series analysis: calculating R₀

Time series analysis: Initial date of the time series

Time series analysis: validation

Regression analysis for r ₀

Regression analysis for SARS-CoV-2 effects on r₀

Statistics and reproducibility

Reporting summary

Portable device can quickly detect plant stress

Negative frequency-dependent selection and asymmetrical transformation stabilise multi-strain bacterial population structures

ITALIAN LANGUAGE

ENGLISH LANGUAGE

Data selection and handling: death data

Data selection and handling: explanatory county-level variables

Time series analysis: time series model

Time series analysis: parametric bootstrapping

Time series analysis: calculating R0

Time series analysis: Initial date of the time series

Time series analysis: validation

Regression analysis for r 0

Regression analysis for SARS-CoV-2 effects on r0

Statistics and reproducibility

Reporting summary

ITALIAN LANGUAGE

ENGLISH LANGUAGE

Time series analysis: calculating R₀

Regression analysis for r ₀

Regression analysis for SARS-CoV-2 effects on r₀