LightGBM
Before explaining LightGBM23, it is necessary to introduce XGBoost24, which is also based on the gradient boosting decision tree (GBDT) algorithm30. XGBoost integrates multiple classification and regression trees (CART) to compensate for the lack of prediction accuracy of a single CART. It is an improved boosting algorithm based on GBDT, which is popular due to its high processing speed, high regression accuracy and ability to process large-scale data31. However, XGBoost uses a presorted algorithm to find data segmentation points, which takes up considerable memory in the calculation and seriously affects cache optimization.
LightGBM is improved based on XGBoost. It uses a histogram algorithm to find the best data segmentation point, which occupies less memory and has a lower complexity of data segmentation. The flow of the histogram algorithm to find the optimal segmentation point is shown in Fig. 3.
Moreover, LightGBM abandons the levelwise decision tree growth strategy used by most GBDT tools and uses the leafwise algorithm with depth limitations. This leaf-by-leaf growth strategy can reduce more errors and obtain better accuracy. Decision trees in boosting algorithms may grow too deep while training, leading to model overfitting. Therefore, LightGBM adds a maximum depth limit to the leafwise growth strategy to prevent this from happening and maintains its high computational efficiency. To summarize, LightGBM can be better and faster used in industrial practice and is also very suitable as the base model in our tide level prediction task. The layer-by-layer growth strategy and leaf-by-leaf growth strategy are shown in Fig. 4.
CNN-BiGRU
Convolutional neural network
A convolutional neural network (CNN) is a deep feedforward neural network with the characteristics of local connection and weight sharing. It was first used in the field of computer vision and achieved great success32,33. In recent years, CNNs have also been widely used in time series processing. For example, Bai et al.34 proposed a temporal convolutional network (TCN) based on a convolutional neural network and residual connections, which is not worse than recurrent neural networks such as LSTM in some time series analysis tasks. At present, a convolutional neural network is generally composed of convolution layers, pooling layers and a fully connected layer. Its network structure is shown in Fig. 5. The pooling layer is usually added after the convolution layers. The maximum pooling layer can retain the strong features in the data after the convolution operation, eliminate the weak features to reduce the number of parameters in a network and avoid overfitting of the model.
Bidirectional GRU
In previous attempts at tide level prediction by scholars, bidirectional long short-term memory networks35 have achieved good prediction results. However, in our subsequent experiments, the bidirectional gated recurrent unit achieved higher prediction accuracy than BiLSTM, so we used the BiGRU network for subsequent prediction tasks.
The GRU network36 adds a gating mechanism to control information updating in a recurrent neural network. Different from the mechanism in LSTM, GRU consists of only two gates called the update gate ({z}_{t}) and the reset door ({r}_{t}).
The recurrent unit structure of the GRU network is shown in Fig. 6.
Each unit of GRU is calculated as follows:
$${z}_{t}= sigma ({W}_{z}{x}_{t}+{U}_{z}{h}_{t-1}+{b}_{z})$$
(7)
$${r}_{t}= sigma ({W}_{r}{x}_{t}+{U}_{r}{h}_{t-1}+{b}_{r})$$
(8)
$${widetilde{h}}_{t}=tanh({W}_{h}{x}_{t}+{U}_{h}left({r}_{t}odot {h}_{t-1}right)+{b}_{h})$$
(9)
$${h}_{t}={z}_{t}odot {h}_{t-1}+left(1-{z}_{t}right)odot {widetilde{h}}_{t}$$
(10)
In the above formula, ({z}_{t}) represents the update gate, which controls how much information is retained from the previous state ({h}_{t-1}) (without nonlinear transformation) when calculating the current state ({h}_{t}). Meanwhile, it also controls how much information will be accepted by ({h}_{t}) from the candidate states ({widetilde{h}}_{t}). ({r}_{t}) represents the reset gate, which is used to ensure whether the calculation of the candidate state ({widetilde{h}}_{t}) depends on the previous state ({h}_{t-1}). (upsigma ) is the standard sigmoid activation function; (tanh(cdot )) is the hyperbolic tangent activation function; and (odot ) indicates the Hadamard product. The weight matrices of the update gate, reset gate, and ({widetilde{h}}_{t}) calculation layer are expressed as ({W}_{z},{W}_{r},{W}_{h}); the coefficient matrices of the update gate, reset gate, and ({widetilde{h}}_{t}) calculation layer are expressed as ({U}_{z},{U}_{r},{U}_{h}); and the offset vectors of the update gate, reset gate, and ({widetilde{h}}_{t}) calculation layer are expressed as ({b}_{z},{b}_{r},{b}_{h}).
A bidirectional gated recurrent unit network37 is a combination of two GRUs whose information propagating directions are reversed, and it has independent parameters in each, which makes it able to fit both forward and backward data at first and then join up the results from two directions. BiGRU can capture sequence patterns that may be ignored by unidirectional GRU. The structure of BiGRU is shown in Fig. 7.
Taking the BiGRU’s forward hidden state vector at time (t) as ({h}_{t}^{(1)}) and taking the BiGRU’s backward hidden state vector at time (t) as ({h}_{t}^{(2)}), (upsigma ) indicates the standard sigmoid activation function, and (oplus ) indicates a vector splicing operation. We can calculate the output ({y}_{t}) of a BiGRU network as follows:
$${h}_{t}={h}_{t}^{(1)}oplus {h}_{t}^{(2)}$$
(11)
$${y}_{t}=sigma ({h}_{t} )$$
(12)
CNN-BiGRU prediction model
Because CNN has significant advantages in extracting useful features from a picture or a sequence and BiGRU is good at processing time series, we combine CNN and BiGRU to build the CNN-BiGRU model. The model can be mainly divided into an input layer, a convolution layer, a BiGRU network layer, a dropout layer, a fully connected layer and an output layer. The CNN layer and BiGRU layer are the core structures of the model. The function of the dropout layer is to avoid model overfitting. The CNN layer consists of two one-dimensional convolution (Conv1D) layers and a one-dimensional maximum pooling (MaxPooling1D) layer. The input of BiGRU is the output sequence of the CNN layer, and the BiGRU network is set as a one-hidden-layer structure. The structure of the CNN-BiGRU combination model is shown in Fig. 8.
Variable weight combination model
When we analyze and predict relatively stationary tide level time series, LightGBM can perform well. However, due to environmental factors such as air pressure, wind force and terrain in reality, most tide level observation sequences are sometimes not relatively stationary, which requires that our tide level prediction model be reasonably able to “extrapolate” based on the sample observations, that is, be capable of generating values that are not in the sample. LightGBM is a tree-based model, which leads to our prediction results being between the maximum and minimum values of sequences. Therefore, LightGBM will not be able to accurately predict the situation or tidal change trend that did not appear in previous observations. However, the CNN-BiGRU model, which is a kind of neural network, has no such problem in theory and will be able to find the trend information that may be hidden in the tide level series. Therefore, we consider providing an appropriate weight for a single base model to build a combination model to improve the accuracy of the tide level prediction task.
Principle of the residual weight combination model and improved variable weight combination model
To improve the prediction accuracy of the combination model, a simple and effective idea is to determine the base models’ weights in the combination model according to the error between the prediction value and the real value. This method is also called the residual weight method, and its calculation formulas for determining the weights are:
$$gleft({x}_{t}right)= sum_{i=1}^{m}{omega }_{i}left(t-1right){f}_{i}({x}_{t})$$
(13)
$${omega }_{i}left(t-1right)=frac{frac{1}{overline{{varphi }_{i}}left(t-1right)}}{sum_{i=1}^{m}frac{1}{overline{{varphi }_{i}}left(t-1right)}}$$
(14)
$$sum_{i=1}^{m}{omega }_{i}left(t-1right)=1,{omega }_{i}left(t-1right)ge 0$$
(15)
where ({omega }_{i}left(t-1right)) denotes the weight of the (i) th model at the moment (t-1), ({f}_{i}left({x}_{t}right)) denotes the prediction value of the (i) th model at the moment (t), (gleft({x}_{t}right)) denotes the prediction value of the combination model at the moment (t), and (overline{{varphi }_{i}}left(t-1right)) is the square sum of the predictive errors of the (i) th model at the moment (t-1).
Our LightGBM-CNN-BiGRU (combination model) is based on the improved residual weight method. We call it the variable weight combination model. We use the weights calculated by formula (9) and formula (11) to calculate a series of new weights. The new weights from formula (11) will take the residual weight changes in (d) time steps into consideration by averaging the old weights in (d) time steps to improve the stability of the residual weight method.
$${omega }_{j}left(tright)=frac{1}{d}sum_{k=1}^{d}{omega }_{i}left(t-kright)left(d=4right)$$
(16)
After obtaining a series of weights through formula (9) and formula (11), we take the absolute value of the error between the prediction value and the true value of each combination model at the moment of (t) as ({delta }_{i,t}) and ({delta }_{j,t}), respectively:
$${delta }_{i,t}=mid sum_{i=1}^{m}{omega }_{i}left(tright){f}_{i}left({x}_{t}right)-{y}_{t}mid $$
(17)
$${delta }_{j,t}=mid sum_{i=1}^{m}{omega }_{j}left(tright){f}_{i}left({x}_{t}right)-{y}_{t}mid $$
(18)
Comparing ({delta }_{i,t}) and ({delta }_{j,t}), if ({delta }_{i,t}>{delta }_{j,t}), the combination model uses the new weight ({omega }_{j}left(tright)) in place of the original weight ({omega }_{i}left(tright)). Otherwise, the weight of the combination model remains unchanged.
Parameter optimization of the combination model
Because the LightGBM-CNN-BiGRU (combination model) is a variable weight combination of the prediction results from two base models, the performance of the combination model can be directly improved by separately optimizing the super parameters of the two base models. We mainly use the grid search algorithm and K-fold cross validation method to optimize the parameters. The grid search algorithm is a method to improve the performance of a certain model by iterating over a given set of parameters. With the help of the K-fold cross validation method, we can calculate the performance score of the LightGBM model on the training set and easily optimize its superparameters. The final parameters of the LightGBM model are set to num_leaves = 26, learning_rate = 0.05, and n_estimators = 46.
For the CNN-BiGRU network, we mainly improve the prediction accuracy of the model by adjusting the size and number of hidden layers in the BiGRU structure and prevent the model from overfitting by changing the dropout ratio and tracking the validation loss of the network while training.
The LightGBM and CNN-BiGRU variable weight combination model
The workflow of our tide level prediction model is shown in Fig. 9. It mainly includes data preprocessing; training, optimization and prediction of the base models; construction of a variable weight combination prediction model; and evaluation and analysis of the combination model’s performance.
- (1)
Data preprocessing: The quality of the data directly determines the upper limit of the prediction and generalization ability of a certain machine learning model. Standard, clean and continuous data are conducive to model training. The data used in this study are from the Irish National Tide Gauge Network, and all of them are subject to quality control. We filled in a small number of missing values and normalized the data to speed up the model training.
- (2)
Construction and optimization of base models: We divide the dataset into a training set, a validation set and a test set according to the proportion of 7:1:2 and train the LightGBM model and CNN-BiGRU model with data on the training set. We optimize the parameters and monitor whether the model has been overfitted by tracking the validation loss of the network while training. Finally, we put the data into two base models for training and then obtain the prediction results of a single base model.
- (3)
Construction of the variable weight combination model. Based on the prediction results of two single base models obtained in step (2), we calculate the weight of each base model according to the principle of the improved variable weight combination method and then obtain the prediction results of the variable weight combination model.
- (4)
Model evaluation and analysis: According to the indexes of the model evaluation, the variable weight combination model is compared with other basic models to analyze its prediction performance after being improved.
Source: Ecology - nature.com