Experimental data
The dataset used in this study is the global long-term air quality indicator data of 5577 regions from 2010 to 2014 extracted by Betancourt et al.14 based on the TOAR database (https://gitlab.jsc.fz-juelich.de/esde/machine-learning/aq-bench/-/blob/master/resources/AQbench_dataset.csv)29. As shown in Fig. 3, the monitoring sites include 15 regions, including EUR (Europe), NAM (North America), and EAS (East Asia), and are mainly distributed in NAM (North America), EUR (Europe) and EAS (East Asia). The dataset mainly includes the geographical location information of the monitoring site, such as longitude and latitude, the area to which it belongs, altitude, etc., and the site environment information, such as population density, night light intensity, and vegetation coverage. Since it is difficult to directly quantify factors such as the degree of industrial activity and the degree of human activity, environmental information such as the average light intensity at night and population density are used as proxy variables for the above factors. The ozone indicator records the hourly ozone concentration from air quality observation points in various regions and aggregates the collected ozone time series in units of one year into one indicator. Using a longer aggregation period can be used to average short-term weather fluctuations. The experimental data have a total of 35 input variables, including 4 categorical attributes and 31 continuous attributes. The predictor variable is the average ozone concentration in each region from 2010 to 2014. The specific variable names and descriptions14 are shown in the supplementary materials. A total of 4/5 of the total samples were used as the training set, and 1/5 were used as the test set.
Results of BO-XGBoost-RFE
According to the XGBoost-RFE algorithm for feature selection, XGBoost-RFE combined with the cross-validation method is used to calculate the selected feature set in each RFE stage for fivefold cross-validation, and the mean absolute error (MAE) is used as the evaluation criterion to finally determine the number of features with the lowest mean absolute error (MAE). At the same time, the Bayesian optimization algorithm is used to adjust the hyper-parameters of XGBoost-RFE, and then the feature subset with the lowest cross-validation mean absolute error (MAE) is obtained. The main parameters of the XGBoost model in this article include the learning_rate, n_estimators, max_depth, gamma, reg_alpha, reg_lambda, colsample_bytree, and subsample. All parameters used in the model are shown in the supplementary material. Within the given parameter range, the Bayesian optimization algorithm is used, the mean absolute error (MAE) of the XGBoost-RFE fivefold cross-validation is used as the objective function, and the number of iterations is controlled to be 100. We obtained the hyperparameter combination corresponding to the lowest MAE and the corresponding optimal feature subset. The iterative process of Bayesian optimization is shown in Fig. 4.
The parameter range and optimized value of XGBoost-RFE are shown in Table 1. The XGBoost-RFE feature selection results under the above optimized hyperparameters are shown in Fig. 5. The number of features in the feature subset with the lowest mean absolute error is 22, and the MAE is 2.410.
Additionally, the XGBoost-RFE feature selection model without Bayesian optimization is compared with the algorithm in this study. The default parameters of the underlying model XGBoost are set to learning_rate as 0.3, max_depth as 6, gamma as 0, colsample_bytree as 1, subsample as 1, reg_alpha as 1, and reg_lambda as 0. The comparison results are shown in Table 2. The results show that the XGBoost-RFE cross-validation MAE without parameter tuning is larger than that of the algorithm in this study, and the dimension of the feature subset obtained is also higher than that of the algorithm in this study.
Prediction results
To test the prediction accuracy of the prediction model with the optimal subset obtained by BO-XGBoost-RFE, three indexes, MAE, RMSE and R2, are used to evaluate the prediction results, and the expressions are as follows:
$$begin{array}{*{20}c} {MAE = frac{1}{n}mathop sum limits_{i = 1}^{n} left| {left( {y_{i} – widehat{{y_{i} }}} right)} right|} end{array}$$
(8)
$$begin{array}{*{20}c} {RMSE = sqrt {frac{1}{n}mathop sum limits_{i = 1}^{n} left( {y_{i} – widehat{{y_{i} }}} right)^{2} } } end{array}$$
(9)
$$begin{array}{*{20}c} {R^{2} = 1 – frac{{mathop sum nolimits_{i = 1}^{n} left( {widehat{{y_{i} }} – y_{i} } right)^{2} }}{{mathop sum nolimits_{i = 1}^{n} left( {y_{i} – overline{{y_{i} }} } right)^{2} }}} end{array}$$
(10)
n indicates the number of samples, yi is the true value, (widehat{{y_{i} }}) is the predicted value and (overline{{y_{i} }}) indicates the mean value of the predicted value.
The XGBoost-RFE feature selection algorithm based on Bayesian optimization in this study is compared with feature selection using full features and features selected by the Pearson correlation coefficient, which measures the correlation between two variables. In this study, the correlation with predictor variables was selected to be less than 0.1, and the variables with correlations greater than 0.9 were deleted to avoid multicollinearity.
XGBoost, random forest, support vector regression machine, and KNN algorithms were used to predict ozone concentration with full features, features selected by Pearson’s correlation coefficient, and features based on BO-XGBoost-RFE. According to the evaluation indicators described above, the comparison of the prediction performance results of the three algorithms before and after dimensionality reduction can be obtained. The MAE, RMSE and R2 results of each prediction model are shown in Table 3.
Among the four prediction models, random forest has the lowest MAE and RMSE and the highest R2 based on three different dimensions of data and therefore has the best prediction performance. The prediction accuracy of all four prediction models based on Pearson correlation is lower than that based on BO-XGBoost-RFE, indicating that only selecting features by correlation cannot accurately extract important variables. Although the RMSE of the support vector regression model based on BO-XGBoost-RFE is slightly lower than the RMSE based on full features, the prediction accuracy of XGBoost, RF, KNN after feature selection of BO-XGBoost-RFE is higher than that based on full features and Pearson correlation. Among the four prediction models, random forest has obtained the highest prediction accuracy. The MAE based on BO-XGBoost-RFE is 5.0% and 1.4% lower than that based on the Pearson correlation coefficient and the full-feature-based model, and the RMSE is reduced by 5.1%, 1.8%, R2 improved by 4.3%, 1.4%. Additionally, the XGBoost model achieved the greatest improvement in accuracy. The MAE was reduced by 5.9% and 1.7%, the RMSE was reduced by 5.2% and 1.7%, and the R2 was improved by 4.9% and 1.4% compared with the Pearson correlation coefficient-based and full-feature-based models, respectively. This indicates that feature selection based on BO-XGBoost-RFE effectively extracts important features, improves prediction accuracy based on multiple prediction models, and has better dimensionality reduction performance.
Figure 6 shows the importance of each feature obtained by using the random forest prediction model, reflecting the degree of influence of each variable on the prediction results of the global multi-year average near-ground ozone concentration. The most important variables that affect the prediction results according to the ranking of feature importance are altitude, relative altitude, and latitude, followed by night light intensity within a radius of 5 km, population density and nitrogen dioxide concentration, while the proxy variables for vegetation cover have a relatively weak effect on the prediction of ozone concentration.
Source: Ecology - nature.com