A review of existing despiking procedures
Among despiking algorithms for raw, high-frequency, EC data, a popular approach was developed by Vickers and Mahrt6 (hereinafter VM97). The method consists in estimating the sample mean and standard deviation in overlapping temporal windows whose width in time is 5 min. The temporal window slides point by point, and any data point whose value exceeds (pm 3.5 sigma) (sample standard deviation) is flagged as a spike. The method is highly sensitive to the masking effect (where less extreme spikes go undetected because of the existence of the most extreme spikes), a reason for which the procedure is iterated increasing by 0.1 the threshold value at each pass, until no more spikes are detected.
A revised version of the VM97 procedure was proposed by Metzger et al.14 (hereinafter M12), who suggested replacing the mean and standard deviation by more robust estimates, such as the median and the median absolute deviation (MAD), respectively. The authors found that this method reliably removed spikes that were not detected by VM97, showing a superior performance.
To reduce the high-computational burden attributable to the windowed computations prescribed by the VM97 algorithm, Mauder et al.7 (hereinafter M13) proposed to estimate median and MAD over the whole flux averaging period (usually 30 or 60 min). M13 suggested to consider as spike those observations exceeding (pm 7cdot)MAD. Such an approach was selected as candidate method in the data processing scheme at the ICOS ecosystem stations15.
Starkenburg et al9 recommended the approach developed by Brock16 (hereinafter BR86) as the best method for despiking EC data. This algorithm is currently implemented in the processing pipeline adopted by the National Ecological Observatory Network (NEON, https://www.neonscience.org). It is based on a two-stage procedure, where the first step consists in extracting the signal by means of a rolling third-order median filter which replaces the center value in the window with the median value of all the points within the window; the second step aims at identifying spikes by analyzing the histogram of the differences between the raw signal and the median filtered signal. Specifically, the differences are initially binned into 25 classes. Then, the first bins with zero counts on either side of the histogram are identified and points in the original signal that exceed the empty bins are flagged as spikes. If no bin with zero counts is found, then the number of bins is doubled (for example from 25 to 51, with one bin added ensuring to retain an odd number because the mean of differences, which is expected to be close to zero, should fall into the central bin of the histogram). The procedure is iterated by increasing the number of bins until the bin width is not less than the acquiring instrument resolution.
The proposed despiking algorithm
In order to define a modeling framework suitable for the representation of a sequence ((x_t)_{t in Z}) of observed raw EC data indexed by time t and contaminated by spikes, we assume a component model as follows:
$$begin{aligned} left. begin{aligned} x_t&= mu _t + v_t + s_t, end{aligned}right. end{aligned}$$
(1)
where (mu _t) denotes the low frequency component (signal); (v_t) the deviations from the signal level (residuals) whose variability ((sigma _t^2)) is allowed to change slowly over time; and (s_t) the spike generating mechanism which is zero most of time but occasionally generates large absolute values.
To achieve unbiased estimates of both the signal and the scale parameter ((sigma _t)) when data are contaminated by errors, the use of robust estimators is required. One of the most popular measures of robustness of a statistical procedure is the breakdown point, which represents the proportion of outlying data points an estimator can resist before giving a biased result. The maximum breakdown point is 50%, since, if more than half of the observations are contaminated, it is not possible to distinguish between the distribution of good data and the distribution of outlying data. Described in these terms, the arithmetic mean has a breakdown point of 0% (i.e. we can make the mean arbitrarily large just by changing any of the data point), whereas the median has a breakdown point of 50% (i.e. it becomes biased only when 50% or more of the data are large outliers).
The proposed despiking procedure (hereinafter RobF) makes use of robust functionals whose breakdown point is 50% and consists in three stages (see Fig. 1). In the first step the signal ((mu _t)) extraction is carried out by means of the repeated median (RM) regression technique10,17. The second step involves the estimation of the time-varying scale parameter (sigma _t) by means of the (Q_n) estimator12. A detailed description of the robust functionals will be provided in the following sections. Spikes are detected in the third step, through the examination of outlier scores calculated as:
$$begin{aligned} z_t=frac{x_t-mu _t}{sigma _t}. end{aligned}$$
(2)
Any values of (|z_t|) exceeding a pre-fixed threshold value ((z_{th})) is considered as spike. The choice of the threshold value should be based on the outlier scores data distribution which can vary across time. In this work (z_{th}) was set equal to 5 which means that for Normal- and Laplace-distributed data there is a 1 in 3.5 million and 1 in 300 chance, respectively, that an anomalous value is the result of a statistical fluctuation over the spectrum of plausible values. Once detected, spikes are removed and replaced by (mu _t) estimates obtained by the RM filter.
Repeated median filter
The idea underlying moving time window based approaches is that of approximating the signal underlying observed data by means of local estimates that approximate the level of data in the center of the window.
To this end, we fit a local linear trend11 of the form
$$begin{aligned} mu _{t+i}=mu _t+ibeta _t, quad i=-k,ldots ,k, quad mathrm {to} quad {x_{t-k},ldots ,x_{t+k}}, end{aligned}$$
(3)
where k is the parameter defining the time window of length (n=2k+1), whereas (mu _t) and (beta _t) are estimated by means of the RM filter10 as
$$begin{aligned} left. begin{aligned} tilde{mu }_t^{RM}&=medbigl (x_{t-k}+ktilde{beta }_t,ldots ,x_{t+k}-ktilde{beta }_tbigr ), tilde{beta }_t^{RM}&=med_{i=-k,ldots ,k} Bigl (med_{j=-k,ldots k,j ne i} frac{x_{t+i}-x_{t+j}}{i-j}Bigr ). end{aligned}right. end{aligned}$$
(4)
The only parameter required for the application of the RM filter is k, which controls how many neighbouring points are included in the estimation of (mu _t). Its choice depends not only on the time series characteristics, but also on the situations a procedure needs to handle. For despiking purposes, k has to be chosen as a trade-off problem between the duration of periods in which trends can be assumed to be approximately linear and the maximum number of consecutive outliers the estimator allows to resist before returning biased results.
Results of previous studies18 for the evaluation of the RM filter performance in the removal of patches of impulsive noise showed that the RM resists up to 30% subsequent outliers without being substantially affected. Therefore, the minimal window width should be larger than at least three times the maximal length of outlier patches to be removed.
To this end, the optimal time window width selection is carried out through a preliminary analysis of the data distribution. Specifically, the time series is subject to a preliminary de-trending procedure, where trend is approximated by a 5-degree polynomial function whose parameters are estimated via iterated re-weighted least squares (IWLS) regression. The optimal window width is then set equal to 4 times the maximum number of values exceeding (pm 3cdot s_g) in 30 s intervals, where (s_g) is the (global) standard deviation estimated by the (Q_n) estimator on de-trended data. To prevent cases where few or no data exceed the threshold values, a minimum window width of 5 s is imposed (i.e. 51 time steps for data sampled at 10 Hz acquisition frequency).
({{Q}}_n) scale estimator
Beyond the ability of the filter adopted for signal extraction, the effectiveness of a despiking strategy depends also on the robustness of the scale parameter, (sigma _t), which is of fundamental importance for the outlier scores derivation. Raw EC time series cannot be assumed to be identically distributed as variability may vary over time as the effect of changes in turbulence regimes and heterogeneity of the flux footprint area. In such situations, global estimates of the scale parameter are unrepresentative of the local variability. Consequently, the spike detection procedure becomes ineffective. To cope with this feature, the scale parameter (sigma _t) was estimated in rolling time windows whose width was set equal to those adopted for the signal extraction. As a robust estimates of (sigma _t), we used the (Q_n) estimator12
$$begin{aligned} Q_n=2.2219{|x_i-x_j|;i<j}_{(q)}, end{aligned}$$
(5)
with (q=) ({h}atopwithdelims (){2}) (approx) ({n}atopwithdelims (){2}) (/4) and (h=lfloor frac{n}{2} rfloor +1), where (lfloor cdots rfloor)rounds down to the nearest integer19. This scale estimator corresponds approximately to the first quartile of all pairwise distances between any two data points. Compared to other robust scale estimators – such as the MAD and the interquartile distance IQD – it is a location-free estimator, i.e. it does not implicitly rely on a symmetric distribution. However, its efficiency is larger especially when identical measurements occur, e.g. due to limited resolution of the measurement instruments13.
Software implementation
The application of the robust estimators previously described is hampered because of their high computational demands20. For example, the slope (tilde{beta }_t^{RM}) within the time window is estimated by taking repeated medians of pairwise slopes, namely an inner median with one observation being fixed, and then the median of all these inner medians. As a consequence, when applied to large time windows the RM estimator may require high intensity calculation. To cope with this problem, we used the algorithm developed by Bernholt et al.17 and implemented in the R package robfilter21 which allows to update the RM filter by using the stored information from the last time window. Since estimates do not have to be calculated for each window from scratch, the computational demand is significantly reduced. For (Q_n) estimator, the implementation available in the R package robustbase22 was used.
To further reduce the computational time, the proposed despiking procedure was run in parallel mode. The parallelization consists in the simultaneously processing of 5 min length time series (provided the window width is less than 1 min). Such an improvement makes the computation of the both RM filter and (Q_n) scale estimator much faster and then suitable for the analysis of such high-dimensional data as the raw EC time series.
Monte Carlo experiment
Simulation design
Performance evaluation of despiking algorithms is a difficult task because the true labels (good data/spike) of individual data points are not always available. To overcome such limitations, we carried out a Monte Carlo simulation study. Simulations allow, in fact, to get a full control of the spike-generating mechanism and, consequently, a more objective performance evaluation of the despiking algorithms becomes feasible.
To mimic stochastic properties of raw, high-frequency EC data as closely as possible, synthetic time series were simulated from autoregressive integrated moving average ARIMA(p,d,q) processes with a time-varying error structure, the latter generated by means of the component generalized autoregressive heteroskedastic (CGARCH) model23. The ARIMA modeling framework aims at simulating the conditional mean process, whereas the CGARCH model aims at mimicking the non-constant conditional variance process.
The basic form of an ARIMA(p,d,q) process can be written as:
$$begin{aligned} nabla ^d X_t = c + sum _{i=1}^p phi _i nabla ^d X_{t-i} + sum _{j=0}^q theta _j varepsilon _{t-j}, end{aligned}$$
(6)
where p is the order of the AR part and q is the order of the MA part, (nabla ^d) denoting the difference operator of order d. The process (X_t) is stationary if and only if (d=0), in which case it reduces to an ARMA(p,q) process. If (d>0), then the process (X_t) is said to be integrated of order d, meaning that (X_t) needs to be differenced d times to achieve stationarity. To allow heteroskedasticity, we assume that (varepsilon _t= sigma _t e_t), where (e_t) is a sequence of independently and identically distributed variables with mean 0 and variance 1 and (sigma _t^2) is the conditional variance allowed to vary with time.
The latter was simulated by means of a CGARCH process, which can be written as:
$$begin{aligned} left. begin{aligned} sigma _t^2&=q_t + sum _{i=1}^r alpha _i (varepsilon _{t-i}^2 – q_{t-i}) + sum _{j=1}^s beta _j (sigma _{t-j}^2 -q_{t-j}) q_t&=omega + eta _{11} q_{t-1} + eta _{21} (varepsilon _{t-1}^2 – sigma _{t-1}^2), end{aligned}right. end{aligned}$$
(7)
where (omega), (alpha _i), (beta _j), (eta _{11}), (eta _{21}) are strictly positive coefficients; (q_t) is the permanent (long-run) component of the conditional variance allowed to vary with time following first order autoregressive type dynamics. The difference between the conditional variance and its trend, (sigma _{t}^2 – q_{t}), is the transitory (short-run) component of the conditional variance. The conditions for the non-negativity estimation of the conditional variance23 are related to the stationary conditions that (alpha _i + beta _j < 1) and that (eta _{11} < 1) (such quantities provide a measure of the persistence of the transitory and permanent components, respectively).
Model order specification and parameter estimation were performed by analyzing real EC data (more detail are provided in the “Results and discussion” section). With this modelling framework, we simulated 18,000 values as in EC raw data sampled at 10 Hz scanning frequency within a 30-min interval. Simulations were executed in the R v.4.0.2 programming environment by using the tools implemented in the rugarch package24.
Once simulated, synthetic time series were intentionally corrupted with 180 spiky data points (1% for a sample size of 18000). Two macro-scenarios were considered. In the first scenario (S1), isolated or consecutive spike events of short duration were generated. In particular, 180 spike locations were randomly selected in such a way to obtain 30 single spikes, 30 spikes as double (consecutive) events, and 30 spikes as triple (consecutive) events. In the second scenario (S2), instead, time series were contaminated by impulsive peaks of longer duration. To this end, spike locations were carried out by randomly selecting five blocks of 50 consecutive data points. Once located, spikes were generated by multiplying the corresponding time series values (after mean removal) for a factor 10 in such a way to have magnitude similar to those commonly encountered on real, observed EC data. To simulate consecutive spike events as imposed by S2 scenario, generated spiky data points were taken in absolute term. Each scenario was permuted 99 times.
Metrics
The ability of the despiking algorithms was assessed by comparing the number of artificial spikes inserted into the time series with the number of spikes identified by the method. More particularly, by referring to the (2times 2) confusion matrix as reported in Table 1, a valid despiking procedure maximizes decisions of type true positive (TP) while, at the same time, keeping decisions of the types false negative (FN) and false positive (FP) at the lowest levels possible. This trade-off can be measured in terms of Precision and Recall, which are commonly used for measuring the effectiveness of set-based retrieval25. For any given threshold value, the Precision is defined as the fraction of reported spikes that truly turn out to be spikes:
$$begin{aligned} text {Precision}=frac{text {TP}}{text {TP}+text {FP}}, end{aligned}$$
(8)
while the Recall is correspondingly defined as the fraction of ground-truth spikes that have been reported as spikes:
$$begin{aligned} text {Recall}=frac{text {TP}}{text {TP}+text {FN}}. end{aligned}$$
(9)
As a measure that combines Precision and Recall, we consider the balanced F1-Score, which is the harmonic mean of the two indices above-mentioned, and given by:
$$begin{aligned} text {F1-Score}=2 cdot frac{text {Precision} cdot text {Recall}}{text {Precision} + text {Recall}}. end{aligned}$$
(10)
We have (0le text {F1-Score} le 1) where 0 implies that no spikes are detected and 1 indicates that all, and only, the spikes are detected. The closer to 1 the F1-Score index, the greater the effectiveness of the despiking method.
In addition to the previous outlined metrics, a comparison between variances of (simulated) uncorrupted time series and the one estimated after the application of the despiking procedure has been performed.
For an overall evaluation of the performance of the despiking algorithms, the Friedman test26 using a significance level (alpha =0.05), followed by a post-hoc test based on the procedure introduced in Nemenyi27 was applied. The Friedman test is a non-parametric statistical test, equivalent to repeated-measures ANOVA, which can be used to compare the performances of several algorithms28. The null hypothesis of the Friedman test is that there are no significant differences between performances of all the considered algorithms. Provided that significant differences were detected by the Friedman test (that is the null hypothesis is rejected) the Nemenyi test can be used for pairwise multiple comparisons of the considered algorithms. Nemenyi test is similar to the post-hoc Tukey test for ANOVA, and its output consists of a critical difference (CD) threshold. In order to do that, ranks are assigned to algorithms. For each data set, the algorithm with the best performance gets the lowest (best) average rank. The mean performance of two despiking algorithms is judged to be signifycantly different if the corresponding average ranks differ by at least the critical difference (the graphical output of Nemenyi test was implemented using tools provided in the R package tsutils (https://CRAN.R-project.org/package=tsutils)).
Source: Ecology - nature.com