Data selection and handling: death data
For mortality due to COVID-19, we used time series provided by the New York Times12. We selected the New York Times dataset because it is rigorously curated. We analyzed separately only counties that had records of 100 or more deaths by 23 May, 2020. The threshold of 100 was a balance between including more counties and obtaining reliable estimates of r(t). Preliminary simulations showed that time series with low numbers of deaths would bias r(t) estimates (Supplementary Fig. 2). However, we did not want to use the maximum daily number of deaths as a selection criterion, because this could lead to selection of counties based on data from a single day. It would also involve some circularity, because the information obtained, r(t), would be directly related to the criterion used to select datasets. Therefore, we used the threshold of 100 cumulative deaths. The District of Columbia was treated as a county. Also, because the New York Times dataset aggregated the five boroughs of New York City, we treated them as a single county. For counties with fewer than 100 deaths, we aggregated mortality to the state level to create a single time series. For thirteen states (AK, DE, HI, ID, ME, MT, ND, NH, SD, UT, VM, WV, and WY), the aggregated time series did not contain 100 or more deaths and were therefore not analyzed.
Data selection and handling: explanatory county-level variables
County-level variables were collected from several public data sources36,37,38,39,40,41,42. We selected socio-economic variables a priori in part to represent a broad set of population characteristics.
Time series analysis: time series model
We used a time-varying autoregressive model15,56 designed explicitly to estimate the rate of increase of a variable using nonlinear, state-dependent error terms16. We assume in our analyses that the susceptible proportion of the population represented by a time series is close to one, and therefore there is no decrease in the infection rate caused by a pool of individuals who were infected, recovered, and were then immune to further infection.
The model is
$$xleft( t right) = rleft( {t-1} right) + xleft( {t-1} right)$$
(1a)
$$rleft( t right) = rleft( {t-1} right) + omega_rleft( t right)$$
(1b)
$$Dleft( t right) = {mathrm{exp}}(xleft( t right) + phi left( t right))$$
(1c)
Here, x(t) is the unobserved, log-transformed value of daily deaths at time t, and D(t) is the observed count that depends on the observation uncertainty described by the random variable ϕ(t). Because a few of the datasets that we analyzed had zeros, we replaced zeros with 0.5 before log-transformation. The model assumes that the death count increases exponentially at rate r(t), where the latent state variable r(t) changes through time as a random walk with ωr(t) ~ N(0, σ2r). We assume that the count data follow a quasi-Poisson distribution. Thus, the expectation of counts at time t is exp(x(t)), and the variance is proportional to this expectation.
We fit the model using the extended Kalman filter to compute the maximum likelihood57,58. In addition to the parameters σ2r and σ2ϕ, we estimated the initial value of r(t) at the start of the time series, r0, and the initial value of x(t), x0. The estimation also requires terms for the variances in x0 and r0, which we assumed were zero and σ2r, respectively. In the validation using simulated data (Supplementary Methods: Simulation model), we found that the estimation process tended to absorb σ2r to zero too often. To eliminate this absorption to zero, we imposed a minimum of 0.02 on σ2r.
Time series analysis: parametric bootstrapping
To generate approximate confidence intervals for the time-varying estimates of r(t) (Eq. 1b), we used a parametric bootstrap designed to simulate datasets with the same characteristics as the real data that are then refit using the autoregressive model. We used bootstrapping to obtain confidence intervals, because an initial simulation study showed that standard methods, such as obtaining the variance of r(t) from the Kalman filter, were too conservative (the confidence intervals too narrow) when the number of counts was small. Furthermore, parametric bootstrapping can reveal bias and other features of a model, such as the lags we found during model fitting (Supplementary Fig. 1a, b).
Changes in r(t) consist of unbiased day-to-day variation and the biased deviations that lead to longer-term changes in r(t). The bootstrap treats the day-to-day variation as a random variable while preserving the biased deviations that generate longer-term changes in r(t). Specifically, the bootstrap was performed by calculating the differences between successive estimates of r(t), Δr(t) = r(t) – r(t-1), and then standardizing to remove the bias, Δrs(t) = Δr(t) – E[Δr(t)], where E[] denotes the expected value. The sequence Δrs(t) was fit using an autoregressive time-series model with time lag 1, AR(1), to preserve any shorter-term autocorrelation in the data. For the bootstrap, a new time series was simulated from this AR(1) model, Δρ(t), and then standardized, Δρs(t) = Δρ(t) – E[Δρ(t)]. The simulated time series for the spread rate was constructed as ρ(t) = r(t) + Δρs(t)/21/2, where dividing by 21/2 accounts for the fact that Δρs(t) was calculated from the difference between successive values of r(t). A new time series of count data, ξ(t), was then generated using equation 1 with the parameters from fitting the data. Finally, the statistical model was fit to the reconstructed ξ(t). In this refitting, we fixed the variance in r(t), σ2r, to the same value as estimated from the data. Therefore, the bootstrap confidence intervals are conditional of the estimate of σ2r.
Time series analysis: calculating R0
We derived estimates of R(t) directly from r(t) using the Dublin-Lotka equation21 from demography. This equation is derived from a convolution of the distribution of births under the assumption of exponential population growth. In our case, the “birth” of COVID-19 is the secondary infection of susceptible hosts leading to death, and the assumption of exponential population growth is equivalent to assuming that the initial rate of spread of the disease is exponential, as is the case in equation 1. Thus,
$$Rleft( t right) = 1/mathop {sum}nolimits_{_tau} {{mathrm{e}}^{ – r(t)}} tau p(tau)$$
(2)
where p(τ) is the distribution of the proportion of secondary infections caused by a primary infection that occurred τ days previously. We used the distribution of p(τ) from Li et al.59 that had an average serial interval of T0 = 7.5 days; smaller or larger values of T0, and greater or lesser variance in p(τ), will decrease or increase R(t) but will not change the pattern in R(t) through time. Note that the uncertainty in the distribution of serial times for COVID-19 is a major reason why we focus on estimating r0, rather than R0: the estimates of r0 are not contingent on time distributions that are poorly known. Computing R(t) from r(t) also does not depend on the mean or variance in time between secondary infection and death. We report values of R(t) at dates that are offset by 18 days, the average length of time between initial infection and death given by Zhou et al.60.
Time series analysis: Initial date of the time series
Many time series consisted of initial periods containing zeros that were uninformative. As the initial date for the time series, we chose the day on which the estimated daily death count exceeded 1. To estimate the daily death count, we fit a Generalized Additive Mixed Model (GAMM) to the death data while accounting for autocorrelation and greater measurement error at low counts using the R package mgcv61. We used this procedure, rather than using a threshold of the raw death count, because the raw death count will include variability due to sampling small numbers of deaths. Applying the GAMM to “smooth” over the variation in count data gives a well-justified method for standardizing the initial dates for each time series.
Time series analysis: validation
We performed extensive simulations to validate the time-series analysis approach (Supplementary Methods: Simulation model).
Regression analysis for r 0
We applied a Generalized Least Squares (GLS) regression model to explain the variation in estimates of r0 from the 160 county and county-aggregate time series:
$$r_0 = b_0 + b_1start.date + b_2logleft( {pop.size} right) + b_3pop.den^{0.25} + varepsilon$$
(3)
where start.date is the Julian date of the start of the time series, log(pop.size) and pop.den0.25 are the log-transformed population size and 0.25 power-transformed population density of the county or county-aggregate, respectively, and ε is a multivariate Gaussian random variable with covariance matrix σ2Σ. We used the transforms log(pop.size) and pop.den0.25 to account for nonlinear relationships with r0; these transforms give the highest maximum likelihood of the overall regression. The covariance matrix contains a spatial correlation matrix of the form C = uI + (1–u)S(g) where u is the nugget and S(g) contains elements exp(−dij/g), where dij is the distance between spatial locations and g is the range62. To incorporate differences in the precision of the estimates of r0 among time series, we weighted by the vector of their standard errors, s, so that Σ = diag(s) * C * diag(s), where * denotes matrix multiplication. With this weighting, the overall scaling term for the variance, σ2, will equal 1 if the residual variance of the regression model matches the square of the standard errors of the estimates of r0 from the time series. We fit the regression model with the function gls() in the R package nlme63.
To make predictions for new values of r0, we used the relationship
$$hat e_{mathrm{i}} = bar e + {mathbf{v}}_{mathbf{i}} ast ,{mathbf{V}}^{ – 1}(epsilon_i – bar e)$$
(4)
where ει is the GLS residual for data i, (hat e)i is the predicted residual, (bar e) is the mean of the GLS residuals, V is the covariance matrix for data other than i, and vi is a row vector containing the covariances between data i and the other data in the dataset64. This equation was used for three purposes. First, we used it to compute R2pred for the regression model by removing each data point, recomputing (hat e)i, and using these values to compute the predicted residual variance23. Second, we used it to obtain predicted values of r0, and subsequently R0, for the 160 counties and county-aggregates for which r0 was also estimated from time series. Third, we used equation (4) to obtain predicted values of r0, and hence predicted R0, for all other counties. We also calculated the variance of the estimates from64
$$hat v_{mathrm{i}} = sigma^2-{mathbf{v}}_{mathbf{i}} ast ,{mathbf{V}}^{ – 1} ast v_i^{mathbf{t}}$$
(5)
Predicted values of R0 were mapped using the R package usmap65.
Regression analysis for SARS-CoV-2 effects on r0
The GISAID metadata27 for SARS-CoV-2 contains the clade and state-level location for strains in the USA; strains G, GH, and GR contain the G614 mutation. For each state, we limited the SARS-CoV-2 genomes to those collected no more than 30 days following the onset of outbreak that we used as the starting point for the time series from which we estimated r0; from these genomes (totaling 5290 from all states), we calculated the proportion that had the G614 mutation. We limited the analyses to the 28 states that had five or more genome samples. For each state, we selected the estimates of r0 from the county or county-aggregate representing the greatest number of deaths. We fit these estimates of r0 with the weighted Least Squares (LS) model as in equation (3) with additional variables for strain. Figure 3 was constructed using the R packages usmap65 and scatterpie66.
Statistics and reproducibility
The statistics for this study are summarized in the preceding sections of the “Methods”. No experiments were conducted, so experimental reproducibility is not an issue. Nonetheless, we repeated analyses using alternative datasets giving county-level characteristics, and also an alternative dataset on SARS-CoV-2 strains (Supplementary Methods: Analysis of Nextstrain metadata of SARS-CoV-2 strains), and all of the conclusions were the same.
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Source: Ecology - nature.com