A study on energy consumption analysis and prediction of electric bus at intersections considering driving behavior

Abstract

When passing through an intersection section, the relationship between driving behavior and energy consumption of pure Electric Buses (E-Bus) is unclear. In this study, natural driving data on two Bus Rapid Transit (BRT) routes were collected to quantify and analyze the implied relationship between driving behavior and energy consumption when entering an intersection point. Furthermore, it is proposed that predicting energy consumption on the basis of distinguishing whether an intersection is stopping or non-stopping would be a more accurate scenario. Concerning the working method, firstly, statistical analysis is used to observe the difference between the energy consumption of the stopping and non-stopping samples; secondly, correlation analysis and linear regression are used to analyze the significant parameters related to whether to stop or not, and energy consumption; finally, machine learning method is used to establish the classification model of whether to stop or not at an intersection as well as a prediction model the energy consumption of the intersection.The results show that the model accuracy of XGBoost-KNN is higher than that of KNN and XGBoost in predicting whether to stop or not, which is 84.4%. For predicting energy consumption, the GBDT has the lowest prediction accuracy; as for XGBoost and SVM, which have a higher prediction accuracy, distinguishing whether to stop or not helps to enhance the model’s prediction accuracy. Furthermore, after distinguishing whether to stop or not, SVM outperformed XGBoost in R², MAE, and RMSE. Research results provide a new perspective for studying the relationship between the driving behavior and energy consumption of pure electric buses at intersections. Meanwhile, they also offer the possibility for further research on the applicability of energy consumption when expanding from the BRT to more complex mixed traffic environments.

Predictive methods for CO2 emissions and energy use in vehicles at intersections

Article
Open access
22 February 2025

Evaluating machine learning algorithms for energy consumption prediction in electric vehicles: A comparative study

Article
Open access
08 May 2025

Route selection guidelines and prioritization tools for efficient electrification of bus fleets

Article
Open access
03 July 2025

Introduction

Motivation

Pure Electric Vehicles (EVs) are considered a means to reduce pollution in the transportation sector and decrease dependence on highly polluting, scarce oil, and they have gradually replaced conventional fuel vehicles¹. According to statistics, Chinese pure EV ownership has reached 12.594 million, accounting for 77.8% of the new energy vehicle market, and has become the mainstay (64.8%) of urban passenger transportation². Moreover, pure Electric Buses (E-Bus), as the primary means of public transportation in cities, are the subject of research regarding their energy consumption. This research addresses environmental concerns and results in energy savings and time efficiency, alleviates “mileage anxiety”, reduces energy expenditure cost, and provides social benefits³.

Literature review

Some scholars have conducted research relative to the aspects related to energy consumption of pure EVs, starting from analyzing the influencing factors and studying the relationship with energy consumption⁴. In the human-vehicle-road environment system, the factors affecting the energy consumption of EVs consist of internal vehicle factors (vehicle design parameters⁵, traveling characteristics⁶, driver’s cabin assistive devices⁷, etc.), external vehicle factors ( environmental temperatures⁸, route planning⁹, charging characteristics¹⁰, etc.), and driver’s factors¹¹ (acceleration & brake pedal openings, etc.), which can lead to variations of the vehicle parameters, such as speed, acceleration, and impact, resulting in differences in energy consumption^12,13. For instance, Ullah et al.¹⁴ used correlation analysis to get the parameters related to energy consumption, including travel distance, road gradient, and other parameters as inputs, and used advanced Machine Learning (ML) models, eXtreme Gradient Boosting (XGBoost), and Light Gradient Boosting Machine (LightGBM) to predict energy consumption.

Intersections draw attention because vehicles frequently slow down, accelerate, or stop, resulting in much higher energy consumption than on normal roadways¹⁵. A fundamental question that remains unanswered consists of defining the way to predict the energy consumption of an intersection. Many estimation methods for studying fuel or energy consumption have been applied at intersections, offering valuable insights for understanding the energy consumption of pure EVs. For instance, methods such as directly measured¹⁶, VT-Micro model¹⁷, Vehicle Specific Power (VSP)¹⁸, Vehicle operating State Analysis Method (VSAM)¹⁹, Delay-Based Analysis Method (DBAM)²⁰, and Microscopic Simulation-based Analysis Method (MSAM)²¹ all serve the purpose of traffic control and management at the macro level. However, most of these methods estimate fuel consumption at the micro level based on vehicle parameters such as speed and acceleration/deceleration²². They often overlook the differences in driver behaviors or assume uniform driving behaviors¹⁵, thus failing to adequately consider the role of driving behavior in energy consumption estimation. Most drivers believe that in areas where there are “dilemma decisions” such as at intersections, improper operations can also increase energy consumption. For example, sudden braking will lead to higher energy consumption.

In recent years, there has been a growing trend in intersection studies to consider driving behavior as a potential factor affecting fuel consumption²³. This approach acknowledges that driving behavior reflects driving styles, driving habits, and other factors^5,24,25, as well as in the inbound and outbound Sect²⁶. For instance, Tang et al.²⁷ proposed a model that incorporates driving behavior, and sensitivity analysis has shown its potential to significantly impact capacity and total fuel consumption. Therefore, it is essential to consider driving behavior when conducting energy consumption prediction studies. On a macroscopic level, it is crucial to analyze the relationship between driving behavior and energy consumption, while, on a microscopic level, consideration must be given to the driver’s braking/acceleration pedal operation to generate an energy consumption model²⁸.

Another significant issue that warrants our attention is the decision-making process between “Stopping” and “Non-stopping” at intersections, which is a crucial choice that any driver must make²⁹. The decision is affected by several factors, such as signal changes and traffic flow at intersections²⁵. An increase in the number of vehicle stopping can result in higher levels of fuel consumption³⁰. In recent research on intersections, whether for connected vehicles³¹ or non-connected vehicles²³, and whether with or without signalization^32,33, the focus has been on optimizing speed profiles³⁴. This optimization allows vehicles or convoys to smoothly pass through intersections at a constant speed, thereby avoiding the “Stopping”³⁵. This approach improves access efficiency and reduces energy loss, which potentially indicates that “Stopping” or “Non-stopping” at intersections has a serious impact on energy consumption. For instance, Xia et al.³⁶ showed that stop-and-go vehicles result in 14% higher emissions than constant-speed driving in a simulator study. However, the specific energy consumption difference between “Stopping” or “Non-stopping” at intersections, regarding naturalistic driving for all-electric buses, is not yet clear.

Research objectives and contribution

To sum up, to solve the above problems, the objective of this paper is to develop an energy consumption prediction model for pure electric public transportation considering driving behavior. The utility of this model lies in its ability to predict the energy consumption of passing an intersection ahead of time, even when the specific intersection is unknown. It accomplishes this by utilizing the driving behavior parameters of passing through that intersection as a basis for the discrimination of energy-saving or energy-consuming scenarios³⁷. Such predictions have various practical applications, including the establishment of personalized eco-driving strategies, the design of Environmentally-friendly Advanced Driver Assistance Systems (EADAS) to improve the energy utilization of the vehicle, and the formulation of an efficient traffic management strategy^23,30,38. Different from previous energy consumption studies (especially when considering the road sections such as entering and exiting the station), there may be behavior at intersections that passes directly with non-stopping. Therefore, we mainly explore whether there is a need to divide the pure E-Bus at the intersection into two phenomena (stopping & waiting and passing with non-stopping) to analyze the difference in the energy consumption and the prediction of the energy consumption.

Therefore, in the study of the relationship between driving behavior and energy consumption at intersections, this paper aims to introduce a new approach: A (first classify to predict whether to stop here at the intersection and then predict energy consumption) is superior to B (directly predicting energy consumption). This approach seeks to provide new insights into eco-friendly driving strategies at intersections. When it comes to choosing prediction models, both statistical and machine learning methods have their own strengths and weaknesses. Statistical models are widely employed due to their good theoretical interpretability but may exhibit lower prediction accuracy for nonlinear and complex problems. As for machine learning models, they offer higher prediction accuracy but often lack interpretability due to black-box information processing¹⁴. To solve these problems, we first perform statistical linear regression to analyze the explainable variables, and then apply traditional machine learning models, such as explainable Gradient Boosting Decision Tree (GBDT), XGBoost, and Support Vector Machine (SVM), to explore and solve the prediction of EV energy consumption.

The contributions of this study are divided into the following points: first, we analyze and compare the energy consumption of “stopping” and ” non-stopping” scenarios at intersections using statistical methods to determine the importance and necessity of considering whether to stop or not, which has not been done before; second, we analyze and compare the energy consumption of pure electric buses at intersections using statistical methods, which has not been done before; second, we analyze and compare the energy consumption of pure electric buses at intersections using statistical methods to determine the importance and necessity of considering whether to stop or not. Second, we selected longitudinal kinematic indicators that may be related to vehicle stopping to predict whether a pure electric bus stops at an intersection; finally, we built a machine learning model to predict the energy consumption of “stopping” and ” non-stopping” scenarios, and compared it with the scenario without distinguishing between “stopping” and “non-stopping”. Finally, we build a machine learning model to predict the energy consumption in the “stop” and “no-stop” scenarios, and compare it with the scenarios without distinguishing whether the bus stops or not, to conclude which energy consumption prediction method is more beneficial. Our study will provide a new way of thinking about energy consumption prediction at intersections for pure electric buses, which is crucial for optimizing driving behavior and developing eco-driving strategies.

Paper organization

The remaining parts of the paper are organized as follows: Sect. 2. describes the sources and preprocessing of E-Bus intersection data. Sect. 3. analyzes the comparison of energy consumption at intersections and the difference in energy consumption with or without stopping, highlighting the need to predict energy consumption for each scenario apart. Sect. 4. filters the prediction parameters related to the decision of whether to stop by correlation analysis and subsequently proposes the Extreme Gradient Boosting-K-nearest Neighbor (XGBoost-KNN)-based classification prediction model to forecast the likelihood of stopping at intersections. Sect. 5 screens the prediction parameters of energy consumption through linear regression, and establishes multiple machine learning-based energy consumption prediction models on the basis of considering specific situations such as stopping, non-stopping, and not distinguishing whether to stop or not. After conducting k-z fold cross-validation, a comparative analysis is carried out. Finally, discussion and conclusions are presented in Sect. 6. The technical flowchart of this study is shown in Fig. 1.

Fig. 1

Technical flowchart of this study.

Full size image

Data sources & pre-processing

This research selected two dedicated Bus Rapid Transit (BRT) routes to collect natural driving data from E-Bus. These routes have designated bus lanes in most sections and are negligibly affected by other vehicles. Consequently, the operating data, obtained from these routes, is more representative of the driver’s driving behaviors compared to buses operating in a mixed-traffic environment.

Data acquisition

Vehicle driving status data is collected using on-board surveillance video equipment and sensors installed in the E-Bus. The basic technical parameters of the E-Bus are shown in Table 1. The onboard data acquisition system consists of an all-in-one video camera, a vehicle controller, an information fusion controller, and video and sensing units. Real-time data acquisition and analysis is achieved by monitoring the vehicle’s status and capturing video data. The data encompasses multiple channels, primarily including vehicle operation status data, driver operation data, motor status data, battery status data, and radar data, with some other parameters listed in Table 2, a unified vehicle model was used for all drivers to minimize the influence of vehicle conditions (such as weight, tire pressure) and technical configurations (such as regenerative braking systems) on energy consumption during the analysis. The sampling frequency is 2 Hz, which satisfies the data analysis requirements in this study.

Table 1 Basic parameters of the E-Bus.

Full size table

Table 2 Parameters of the onboard data acquisition system.

Full size table

Data pre-processing

When an E-bus approaches and exits an intersection, it will either wait with stopping or pass directly with non-stopping. A distance of about 100 m before and after the intersection was designated as the starting and ending points. Moreover, the data was processed using the Kvaser Memorator Tool, resulting in the selection of 583 samples of E-Bus entering and exiting the station.

Data screening

The data acquisition system collects several data types, some of which are not related to this study. Thus, it is necessary to filter this data. Referring to previous research³⁹, the main parameters of the data set include t, v, θ₁, θ₂, a, J, I, V, T, and ω, where t represents the sampling time, v indicates the vehicle speed (km/h), θ₁ denotes the accelerator pedal opening (%), θ₂ is the brake pedal opening (%), a presents the longitudinal acceleration (m/s²), J denotes accelerated acceleration (m/s³), I designs the pack current (A), V indicates the pack voltage (V), T is the motor torque (N), ω represents the motor speed (r/min). It should be noted that accelerated acceleration refers to jerk and indicates the rate of change of acceleration⁴⁰. Pack current refers to the total current output from the battery system. Pack current and pack voltage refer to the total current and voltage output from the battery system.

Parameter supplementation

The objective of this study is to determine the energy consumption of the E-Bus. Moreover, the energy consumption rate is obtained from the total energy and total driving distance of the journey. Therefore, the energy consumption rate obtained in this study is based on the calculation of the following parameters:

First, the instantaneous energy consumption(:{text{E}}_{text{i}})(kWh) is calculated as follows:

$$:{text{E}}_{text{i}}=frac{{text{I}}_{text{i}}times:{text{V}}_{text{i}}times:{text{t}}_{text{i}}}{3.6times:1{0}^{6}}$$

(1)

where I_i is the instantaneous total current (measured in A), V_i represents the instantaneous total voltage (measured in V), t_i denotes the sampling time interval (measured in s), and i is the number of samples.

Next, the instantaneous distance traveled S_i(m) is computed:

$$:{text{S}}_{text{i}}={{upnu:}}_{text{i}}text{*}{Delta:}text{t}+0.5text{*}{text{a}}_{text{i}}{Delta:}{text{t}}^{2}$$

(2)

where v_i is the instantaneous velocity (measured in km/h), a_i is the instantaneous acceleration (measured in m/s²), and Δt is the sampling interval (measured in s).

Therefore, the Energy Consumption per hundred kilometers (EC₁₀₀) of a pure E-Bus, passing through an intersection, is defined as follows:

$$:text{E}{text{C}}_{100}=frac{sum:_{text{i}=1}^{text{n}}{text{E}}_{text{i}}}{sum:_{text{i}=1}^{text{n}}{text{S}}_{text{i}}}times:1{0}^{3}$$

(3)

Finally, the instantaneous Energy consumption rate Energy Per Kilometer (E_EPKi) is given as follows:

$$:{text{E}}_{{text{E}text{P}text{K}}_{text{i}}}=frac{{text{E}}_{text{i}}}{{text{S}}_{text{i}}}times:{10}^{3}$$

(4)

Intersection energy consumption analysis and comparison

To understand the energy consumption distribution pattern of pure E-Bus while crossing intersections, we also investigated whether stopping or non-stopping significantly affects consumption. Therefore, we applied an independent samples t-test⁴¹ to determine if it is necessary to categorize the samples for a discussion regarding stopping versus non-stopping the bus.

Distribution of energy consumption at intersections

Fig. 2 displays the histogram of the energy consumption rate of pure E-Bus passing through an intersection, with a mean value of 1.096 kWh/km (Standard Deviation (SD) = 0.802) and a range of -2.253 to 4.284 kWh/km. Moreover, 82.5% of the samples have an energy consumption rate varying between 0 and 2 kWh/km, showing that most of the points are close to the mean value. It is worth noting that, for very few points, the recovered energy is greater than the consumed energy.

Fig. 2

Histogram of energy consumption rate at intersections.

Full size image

Comparison of energy consumption of stopping and non-stopping

Among the captured 583 samples, there are 291 stopping samples and 292 non-stopping samples. Therefore, their normal distribution is displayed in Fig. 3. The independent samples t-test shows that there is a significant difference in energy consumption between the stopping and non-stopping groups when passing through an intersection t(581) = 9.325, where p < 0.001 and Cohen’s d = 0.772, as shown in Fig. 2. Based on these findings, it is evident that the energy consumption when buses stop at the intersection (Mean = 1.386, SD = 0.751) is considerably higher than that when they do not stop (Mean = 0.808, SD = 0.746). This observation indicates that when E-Bus enter an intersection and come to a stop, they have the opportunity to recover more energy through braking compared to buses that do not stop. However, when these buses exit the intersection, they need additional energy to accelerate and reach the target speed, i.e., the bus needs greater acceleration and more energy consumption compared to the non-stopping samples. This significant difference in energy consumption between the two groups underscores the importance of categorizing them for further analysis. As a result, the energy consumption of the Stopping sample is greater than that of the Non-stopping sample, and it is essential to classify it.

Fig. 3

Stopping vs. Non-stopping energy consumption rate while passing through intersections.

Full size image

Intersection stopping and non-stopping classification prediction model based on XGBoost-KNN

After performing the analysis in Sect. 3, it has been determined that there exists a significant difference in energy consumption between the samples where buses stop and those where they do not. Therefore, in this section, a correlation analysis¹⁴ is first employed to identify the input parameters for the prediction model of whether to stop or not. Secondly, the data of the first 5 s close to the intersection is employed to classify the prediction of whether a pure electric bus will come to a stopping or continue with non-stopping when passing through the intersection.

Correlation analysis

To identify the specific prediction parameters of whether to stop, we focused on longitudinal kinematic indicators that might be associated with stopping behavior. We analyzed their correlation with whether to stop, and consequently selected the parameters with significant correlation. As illustrated in Fig. 4, the correlation matrix reveals that the correlation coefficients between whether to stop and the following parameters are significant: speed (0.65), accelerator pedal opening (0.33), brake pedal opening (-0.43), and acceleration (0.63). Importantly, all of them are highly correlated (p < 0.001). Consequently, we have selected speed, accelerator pedal opening, brake pedal opening, and acceleration as the predictive parameters for determining whether a bus will come to a stop at an intersection or continue with non-stopping.

Fig. 4

Correlation matrix analysis.

Full size image

Intersection stopping and Non-stopping prediction model

Determining whether a bus will stop or not essentially constitutes a binary prediction problem. Common machine learning algorithms for binary classification prediction include decision trees, random forests, KNN, and XGBoost. Parameters closely related to stopping behaviour from the correlation analysis in Sect. 4.1 were used as inputs, including speed, accelerator pedal opening, brake pedal opening, and acceleration. We developed a model for predicting whether a bus stops or does not stop when passing through an intersection based on the XGBoost-KNN algorithm and compared each algorithm individually (i.e., XGBoost and KNN). It is worth noting that the algorithmic principles of XGBoost and KNN will be presented in the following :

Algorithmic theory

Principles of the XGBoost algorithm

XGBoost is an integrated learning algorithm based on GBDT¹¹. It continuously generates tree models based on the features to get the target sample prediction. For traditional classification machine learning models, most of them guarantee the accuracy by boosting the size of the training dataset. XGBoost as an integrated model defines the loss function more finely, has better accuracy performance, and also controls the complexity of the model, thus guaranteeing the speed of operation. Moreover, it does not require much feature engineering and is widely used in the field of driving behaviour prediction, especially in industrial areas^42,43. XGBoost has powerful non-linear fitting capabilities, enabling it to effectively capture complex relationships between non-linear features (such as changes in speed, pedal opening, and acceleration near intersections) and accurately classify whether a vehicle is stopping. Additionally, it includes regularisation terms (such as the lambda and alpha parameters used to control regularisation strength), which effectively control model complexity, prevent overfitting in relatively small datasets (593 samples), and enhance generalisation ability.

XGBoost is an additive model consisting of multiple base models, and for the t-th data, the output equation is shown in (5) as follows:

$$:{widehat{text{y}}}_{text{i}}^{left(text{t}right)}=sum:_{text{k}=1}^{text{t}}{text{f}}_{text{k}}left({text{x}}_{text{i}}right)={widehat{text{y}}}_{text{i}}^{(text{t}-1)}+{text{f}}_{text{k}}left({text{x}}_{text{i}}right)$$

(5)

where ŷ_i^(t) represents the prediction of sample i after the t-th iteration, ŷ_i^(t−1) denotes the prediction result of the (t-1)-th regression tree model, and f_k(x_i) is the prediction result of the t-th regression tree model.

The objective function is denoted as follows:

$$:text{O}text{b}text{j}=sum:_{text{i}=1}^{text{n}}text{l}({text{y}}_{text{i}},{widehat{text{y}}}_{text{i}})+sum:_{text{i}=1}^{text{n}}{Omega:}left({text{x}}_{text{i}}right)$$

(6)

where l is the loss function and Ω represents the penalty. To get a better prediction model, the XGBoost method fits the residuals of the previous tree with the prediction of each tree, i.e., the model’s prediction for the i-th sample is represented as follows:

$$:{widehat{text{y}}}_{text{i}}^{left(text{t}right)}={widehat{text{y}}}_{text{i}}^{(text{t}=1)}+{text{f}}_{text{k}}left({text{x}}_{text{i}}right)$$

(7)

The objective function can then be modified and yield in the following expression:

$$:text{O}text{b}{text{j}}^{left(text{t}right)}=sum:_{text{i}=1}^{text{n}}text{l}left({text{y}}_{text{i},}{widehat{text{y}}}_{text{i}}^{left(text{t}right)}right)+sum:_{text{i}=1}^{text{t}}{Omega:}left({text{f}}_{text{i}}right)=sum:_{text{i}=1}^{text{n}}text{l}({text{y}}_{text{i}},{widehat{text{y}}}_{text{i}}^{(text{t}-1)}+{text{f}}_{text{t}}({text{x}}_{text{i}}left)right)+{Omega:}left({text{f}}_{text{t}}right)+text{c}text{o}text{n}text{t}text{a}text{n}text{t}$$

(8)

The final objective function can be expressed as a variable-constant relationship through the following transformation:

$$:sum:_{text{i}=1}^{text{t}}{Omega:}left({text{f}}_{text{i}}right)={Omega:}left({text{f}}_{text{t}}right)+text{c}text{o}text{n}text{t}text{a}text{n}text{t}$$

(9)

According to Taylor’s approximation, the objective function can be expanded as follows:

$$:text{l}({text{y}}_{text{i},}{widehat{text{y}}}_{text{i}}^{(text{t}-1)}+{text{f}}_{text{t}}({text{x}}_{text{i}}left)right)=text{l}left({text{y}}_{text{i},}{widehat{text{y}}}_{text{i}}^{(text{t}-1)}right)+{text{g}}_{text{i}}{text{f}}_{text{t}}left({text{x}}_{text{i}}right)+frac{1}{2}{text{h}}_{text{i}}{text{f}}_{text{t}}^{2}left({text{x}}_{text{i}}right)$$

(10)

Where g_i is the first-order derivative of the loss function and h_i is the second-order derivative of the loss function. Both functions are expressed as follows:

$$:{text{g}}_{text{i}}=frac{partial:text{l}({text{y}}_{text{i}},{widehat{text{y}}}_{text{i}}^{(text{t}-1)})}{partial:{widehat{text{y}}}_{text{i}}^{(text{t}-1)}}$$

(11)

$$:{text{h}}_{text{i}}=frac{{partial:}^{2}({text{y}}_{text{i}},{widehat{{text{y}}_{text{i}}}}^{(text{t}-1)})}{partial:({widehat{{text{y}}_{text{i}}}}^{(text{t}-1)}{)}^{2}}$$

(12)

By replacing the above second-order expansion with the objective function, the approximation of this latter function can be expressed as follows:

$$:text{O}text{b}{text{j}}^{left(text{t}right)}approx:sum:_{text{i}=1}^{text{n}}left[text{l}right({text{y}}_{text{i},}{widehat{text{y}}}_{text{i}}^{(text{t}-1)})+{text{g}}_{text{i}}{text{f}}_{text{t}}({text{x}}_{text{i}})+frac{1}{2}{text{h}}_{text{i}}{text{f}}_{text{t}}^{2}({text{x}}_{text{i}}left)right]+{Omega:}left({text{f}}_{text{t}}right)+text{c}text{o}text{n}text{t}text{a}text{n}text{t}$$

(13)

Moreover, the constant term will not affect the optimization of the function. Therefore, removing all the constant terms yields in the objective function as follows:

$$:text{O}text{b}{text{j}}^{left(text{t}right)}approx:sum:_{text{i}=1}^{text{n}}left[{text{g}}_{text{i}}{text{f}}_{text{t}}right({text{x}}_{text{i}})+frac{1}{2}{text{h}}_{text{i}}{text{f}}_{text{t}}^{2}({text{x}}_{text{i}}left)right]+{Omega:}left({text{f}}_{text{t}}right)$$

(14)

The final steps consist of calculating the values of g_i and h_i for the loss function at each step, optimizing the objective function, superimposing it according to the additive model, and finally obtaining an overall model. In the XGBoost model selection process, initialise the parameters, such as setting the initial ranges for parameters like the maximum tree depth (max_depth), learning rate (learning_rate), and number of estimators (n_estimators). Then, using hyperparameter tuning methods like grid search or random search, search for the optimal parameter combination within the given parameter ranges, focusing on metrics like accuracy, precision, and recall, to ultimately determine the optimal results.

Principle of KNN algorithm

KNN is an extremely classical non-parametric machine learning model with a more mature theory, which can be used for both classification and regression problems, it does not require training, it is simple to use and faster to compute^44,45. KNN is an extremely classical non-parametric machine learning model with a more mature theory, which can be used for both classification and regression problems, it does not require training, it is simple to use and faster to compute. Since KNN is a lazy learning algorithm based on distance, it is suitable for datasets with moderate feature dimensions and small sample sizes, and it is more suitable for tasks with limited sample size and well-defined class distributions, Given the limited data volume (593 samples) in this study, it has the advantage of strong distribution adaptability. Regardless of whether the data is linearly separable or non-linearly distributed, it can attempt to find suitable neighbours for classification, making it suitable for complex data distribution scenarios such as intersection parking situations. At the same time, it can avoid overfitting to a certain extent, ensuring that the model has reasonable generalisation capabilities for classifying whether to park or not⁴⁶.

The algorithmic model is first trained on a given training dataset T where x_i is the feature vector of the instance and y_i is the category of the instance.

$$:text{T}=left{right({text{x}}_{1},{text{y}}_{1}),({text{x}}_{2},{text{y}}_{2}),dots:,({text{x}}_{text{n}},{text{y}}_{text{n}}left)right}$$

(15)

In the output process, the k points closest to x are located in the training set T according to the given distance metric, and the neighborhood of x covering these k points is N_k(x); furthermore, the category y of x is decided in N_k(x) according to the following classification rule:

$$:text{y}=text{a}text{r}text{g}text{m}text{a}text{x}sum:_{{text{x}}_{text{i}}in:{text{N}}_{stackrel{prime }{text{k}}}left(text{x}right)=1}text{I}({text{y}}_{text{i}}={text{c}}_{text{j}}),text{i}=text{1,2},cdots:hspace{0.17em},text{N};text{j}=text{1,2},cdots:hspace{0.17em},text{K}$$

(16)

where I is the indicator function, i.e., I is 1 when y_i = c_j; otherwise I is 0.

The Euclidean distance is expressed as shown in (17):

$$:text{d}(text{A},text{B})=sqrt{{({text{x}}_{1text{a}}-{text{x}}_{1text{b}})}^{2}+cdots:+{({text{x}}_{3text{a}}-{text{x}}_{3text{b}})}^{2}}=sqrt{sum:_{text{i}=1}^{text{n}}{({text{x}}_{text{i}text{a}}-{text{x}}_{text{i}text{b}})}^{2}}$$

(17)

Namely, the distance between points A and B is represented in the N-dimensional space and the values on the coordinate axes x₁, x₂,.,x_n represent exactly the n features on our sample data. The model selection process of the KNN algorithm is relatively simple. All you need to do is compare the goodness of the model results by choosing different K values.

Characteristics of the XGBoost-KNN algorithm

The XGBoost-KNN algorithm is a two-stage combined operation using XGboost and KNN. Typically, XGBoost can extract a variety of features, depending on the dataset and problem domain being applied. This may include numerical features (e.g., age, income, etc.), category features after encoding (e.g., unique heat encoding, label encoding), statistical features (e.g., mean, variance, etc.), features derived from the importance of features based on tree models, etc. The extracted features are transformed into new feature vectors, which are used as inputs to the KNN algorithm.Each data point can be represented as a feature vector, and the KNN algorithm determines the neighbouring points based on the distances (e.g., Euclidean distance, Manhattan distance, etc.) between these feature vectors.

KNN improves the performance of XGBoost in the following ways: ① XGBoost is a tree-based integration model that mainly makes predictions by constructing multiple decision trees. Although it has better performance globally, it may not be accurate enough to predict some local areas, while KNN algorithm predicts based on the information of local neighbouring points, which can make up for the lack of local information utilization in XGBoost; ② XGBoost may be sensitive to noisy data, while KNN can reduce the effect of noise by majority voting (for classification problems) or averaging the results of multiple samples in the vicinity (for regression problems). When the prediction results of XGBoost are disturbed by noise, KNN can provide a relatively stable correction and improve the overall noise resistance; ③ Combining XGBoost and KNN can give full play to the advantages of the two different types of algorithms: XGBoost is good at handling large-scale data and high-dimensional features, and can quickly find the global decision boundary; KNN has the advantage of predicting small samples and local regions. By fusing the two, the generalisation ability and robustness of the model can be improved.

In summary, XGBoost – KNN is a combination of XGBoost and KNN, which combines their advantages to complement each other. XGBoost is good at mining deep-level features and non-linear relationships in the data and can perform feature engineering-like transformations on the original data to improve the quality of features. Meanwhile, KNN can make more accurate neighbor-based classification judgments based on the transformed high-quality features. The combination of the two brings their respective strengths into play and makes up for the deficiencies that may exist when they are used alone. For example, KNN may not dig deep enough into complex non-linear relationships, while the features transformed by XGBoost can help KNN better handle these complex situations.

Therefore, in this study, based on the results in Sect. 4.1, the predictor variables for predicting whether or not a bus will stop after passing through an intersection are identified as four features: vehicle speed, accelerator pedal opening, brake pedal opening, and acceleration. Firstly, the prediction labels (stopping or not stopping) are uniquely coded, and then the four predictor variables are input into the XGBoost module, but instead of outputting classification results, the model does not output classification results, but rather, the input pairs of the four feature variables are feature extracted and converted into new feature vectors, which are used as inputs to the KNN algorithm, and ultimately, the classification results are outputted.

Evaluation indicators

The performance of the two-class prediction model is assessed using the confusion matrix. In a classification with K elements, the confusion matrix has a size of K × K. Moreover, the dichotomous confusion matrix is shown in Table 3. True Positives (TP), False Negatives (FN), True Negatives (TN), and False Positives (FP) are listed in the confusion matrix.

Table 3 Confusion matrix.

Full size table

Furthermore, the accuracy, recall, and precision of the model are calculated according to (18)-(20), respectively

$$:text{A}text{c}text{c}text{u}text{r}text{a}text{c}text{y}=frac{text{T}text{P}+text{T}text{N}}{text{T}text{P}+text{T}text{N}+text{F}text{P}+text{F}text{N}}$$

(18)

$$:text{P}text{r}text{e}text{c}text{i}text{s}text{i}text{o}text{n}=frac{text{T}text{P}}{text{T}text{P}+text{F}text{P}}$$

(19)

$$:text{R}text{e}text{c}text{a}text{l}text{l}=frac{text{T}text{P}}{text{T}text{P}+text{F}text{N}}$$

(20)

Meanwhile, the Receiver Operating Characteristic (ROC) curve is also used to evaluate the training performance of the model. The ROC curve illustrates the trade-off between TP and FP rates under different threshold settings. The Area Under the Curve (AUC), i.e., the area under the ROC curve, varies with a range from 0.5 to 1, indicating a good classification performance, with a higher value of the AUC indicating better performance.

Model results

Fig. 5 shows the confusion matrices for the two classifications using KNN, XGBoost, and XGBoost-KNN models. XGBoost-KNN incorrectly predicted seven non-stopping samples as stopping and ten stopping samples as non-stopping. In comparison with XGBoost-KNN, KNN predicted more non-stopping samples as stopping whereas XGBoost predicted more stopping samples as non-stopping.

Fig. 5

Confusion matrices.

Full size image

Table 4 presents the prediction results of the different models demonstrating that the XGBoost-KNN model outperforms KNN and XGBoost, achieving an accuracy, precision, and recall of 84.4%, 83.1%, and 87.5%, respectively. Moreover, Fig. 6 shows the ROC curves of the prediction models where XGBoost-KNN has the highest AUC value (0.884). These findings highlight the superior performance of XGBoost-KNN over KNN, XGBoost alone.

Table 4 Prediction results of different models.

Full size table

Fig. 6

ROC curves of the prediction models.

Full size image

Intersection energy consumption prediction model based on machine learning

This section is based upon the findings of Sect. 4, i.e., focusing on the prediction of energy consumption using upstream intersection data, with or without considering completion stops. Initially, we identify the parameters significantly affecting the energy consumption of intersections based on multiple linear regression with a backward stepwise elimination strategy⁴⁷. Consequently, we use these significant parameters as inputs for energy consumption prediction, leveraging several common machine learning algorithms. In addition, we conducted a direct prediction of energy consumption without distinguishing between stopping or non-stopping to observe any differences that compare both models.

Multiple linear regression analysis of energy consumption

Driving behavior parameters and vehicle motion parameters of pure EVs approaching intersections, including speed, acceleration and brake pedal opening, acceleration, pedal travel, acceleration plus, and their related descriptive statistics, totaling 30 parameters, were selected to analyze the energy consumption used by pure EVs passing through intersections. A backward stepwise elimination strategy was applied to build a multiple stepwise regression model (with the significance level α set to 0.05) to ultimately identify statistically significant influences on energy consumption. They were also used as input parameters for the subsequent energy consumption predictions as it will be detailed in Sect. 5.2.

When the multiple regression analysis was performed without distinguishing between stopping and non-stopping, F = 41.339, p < 0.001, R² = 0.558. Consequently, the regression model has identified six statistically significant factors that influence the rate of energy consumption. Collectively, these six statistically explain the total 55.8% of the variance in energy consumption rate. Table 5 lists the coefficients of the multiple regression analysis for each predictor. The absolute values of the standardized coefficients are placed in descending order, i.e., the following order is selected: the maximum value of speed, the mean value of accelerator pedal opening, the maximum value of acceleration, the proportion of accelerator pedal pressing, the mean value of acceleration, and the standard deviation of the absolute value of acceleration. The sensitivities of the six indexes regarding the outgoing energy consumption rate are in the same order starting by the strongest to the weakest, and the larger the absolute value is, the higher the sensitivity will be.

Table 5 Regression analysis of energy consumption rates without distinguishing between stopping and non-stopping.

Full size table

After distinguishing whether to stop or not, regression analysis for stopping and non-stopping situations was performed, respectively. The results show that when stopping, F = 23.165, p < 0.001, R² = 0.688, the regression model screened out three statistically significant influences on the energy consumption rate, namely, the maximum value of speed, the mean value of positive acceleration, and the mean value of negative acceleration. These findings explained the total percentage of 68.8% of the total variance. Table 6 lists the multiple regression analysis coefficients for each predictor. The absolute values of the standardized coefficients are shown in descending order; in more detail, the following order is applied: maximum value of speed, average value of negative acceleration, and average value of positive acceleration. The sensitivity of the three indicators to the outbound energy consumption rate is in the same order from the strongest to the weakest, and the larger the absolute value is, the higher the sensitivity will be. At the same time, the larger the maximum value of speed and the average value of negative acceleration are, the smaller the energy consumption will be; moreover, the smaller the average value of positive acceleration is, the smaller the energy consumption will be.

Table 6 Regression analysis of energy consumption rates for stopping.

Full size table

When non-stopping, the results are the following: F = 32.624, p < 0.001, R² = 0.579 and the regression model screened out eight statistically significant influences on the energy consumption rate, namely, speed maximum, speed standard deviation, accelerator pedal opening mean, accelerator pedal ratio, acceleration minimum, acceleration mean, positive acceleration mean, and accelerated acceleration maximum. These findings explained the total percentage of 57.9% of the total variance. Table 7 lists the coefficients of the multiple regression analysis for each predictor. The absolute values of the standardized coefficients are placed in descending order where the first one has the maximum value and the last one the minimum value. In more detail, the following order is applied: maximum value of speed, mean value of acceleration, mean value of accelerator pedal opening, standard deviation of speed, mean value of positive acceleration, proportion of accelerator pedal pressing, maximum value of accelerated acceleration, and minimum value of acceleration. The sensitivities of the eight indexes to the energy consumption rate are in the same order of strongest to weakest, and the larger the absolute values are, the higher the sensitivities will be.

Table 7 Regression analysis of energy consumption rates for non-stopping.

Full size table

Intersection energy consumption prediction model

In this section, referring to the parameters with significant impact on energy consumption obtained by the linear regression prediction presented in Sect. 5.1 as inputs, different machine learning (GBDT, XGBoost, SVM) are used for regression prediction of energy consumption, and the coefficient of determination (R²), the mean absolute error (MAE), and the root mean square error (RMSE) are deployed as performance evaluation metrics. In addition, to ensure the reliability and robustness of the model performance, k-fold cross-validation (k = 5) is adopted to evaluate and compare these three methods.

Algorithmic theory

Principle of GBDT regression algorithm

GBDT can handle high-dimensional data directly through iterative training of multiple decision trees. The research involves multiple dimensions of driving behavior features (such as speed, acceleration, different pedal openings, etc.), which can gradually capture the complex nonlinear relationships that often exist between energy consumption and driving behavior.

The expression for the GBDT prediction function⁴⁸ is expressed as follows:

$$:text{F}(text{x},text{w})=sum:_{text{t}=0}^{text{n}}{{uprho:}}_{text{t}}{text{h}}_{text{t}}(text{x},{{upomega:}}_{text{t}})=sum:_{text{t}=0}^{text{n}}{text{f}}_{text{t}}(text{x},{{upomega:}}_{text{t}})$$

(21)

where x is the input sample, h_t is the t-th regression tree, ω_t is the regression tree parameters, and ρ_t is the weight of the t-th regression tree.

When considering N samples, the optimal value of the prediction function is the following:

$$:{text{F}}^{text{*}}=text{a}text{r}text{g}text{m}text{i}text{n}sum:_{text{i}=0}^{text{N}}text{L}({text{y}}_{text{i}},text{F}({text{x}}_{text{i}},text{w}left)right)$$

(22)

where L is the loss function.

The prediction function after iteration is expressed as follows:

$$:{text{F}}_{text{t}}left(text{x}right)={text{F}}_{text{t}-1}left(text{x}right)+{text{f}}_{text{t}}$$

(23)

If the loss function satisfies the error convergence condition or the t-value of the obtained regression tree reaches the preset value, the iteration is terminated; otherwise, the iteration continues.

Principle of XGBoost regression algorithm

The reasons for choosing XGBoost are basically the same as those mentioned above. It has an efficient computing speed and can quickly complete the model training and prediction process. Moreover, it can conveniently handle larger-scale data and more feature dimensions, possessing good scalability. It can effectively prevent overfitting, generalization ensure the ability of the model under a limited amount of data, and at the same time improve the stability and robustness of the model, ensuring that the energy consumption predictions for different types of samples (stopping, non-stopping, and overall samples) can maintain good consistency and accuracy.

The XGBoost output expression was proposed in (5) in Sect. 4.2, and its analytical expression¹⁴ can be transformed into the following:

$$:text{O}text{b}{text{j}}^{left(text{p}right)}=sum:_{text{k}=1}^{text{n}}text{l}({stackrel{prime }{text{y}}}_{text{i}},{text{y}}_{text{i}})+sum:_{text{k}=1}^{text{p}}{upsigma:}left({text{f}}_{text{i}}right)$$

(24)

where L denotes the loss function, n represents the number of observations used, and σ indicates the regularization term, as explained below:

$$:{upsigma:}left(text{f}right)={upgamma:}text{T}+frac{1}{2}{uplambda:}parallel:{upomega:}{parallel:}^{2}$$

(25)

where ω is the vector fraction in the leaf, γ represents the minimum loss required to additionally divide the leaf nodes, and λ indicates the regularization parameter.

Principle of SVM regression (SVR) algorithm

SVR is applicable to small samples and nonlinear problems. The data volume in this study is 593 samples, which belongs to a relatively small sample situation. SVM can play a good role in small sample scenarios and avoid overfitting problems. Secondly, it has good generalization ability. Based on the principle of minimizing structural risk, SVM aims to find a classification hyperplane with the largest margin. For energy consumption prediction, it means that it can make relatively accurate predictions for the energy consumption corresponding to unseen driving behavior samples. Finally, it has strong robustness. SVM is relatively insensitive to noise and outliers in the data, and can overcome interference factors to a certain extent, still predicting energy consumption relatively accurately according to driving behavior features.

The mathematical description of SVR⁴⁹ is as follows:

$$:text{m}text{i}text{n}frac{1}{2}parallel:{upomega:}{parallel:}^{2}+text{C}sum:_{text{i}=1}^{text{n}}{text{L}}_{{upepsilon:}}left(text{f}right({text{x}}_{text{i}})-{text{y}}_{text{i}})$$

(26)

where ω is the hyperplane weight vector, C represents the penalty coefficient (always positive), L_ε denotes the insensitive loss function, and ε indicates the loss threshold.

By introducing the slack variables ξ_i ≥ 0 and ξ_j ≥ 0, the equation is optimized as follows:

$$:text{m}text{i}text{n}frac{1}{2}parallel:{upomega:}{parallel:}^{2}+text{C}sum:_{text{i}=1}^{text{n}}({{upxi:}}_{text{i}}+{{upxi:}}_{text{i}}^{text{*}})$$

(27)

By introducing the Lagrange multipliers α_i^* and α_i, the solution form of the SVR can be finally obtained as follows:

$$:text{f}left(text{x}right)=sum:_{text{i}=1}^{text{n}}({{upalpha:}}_{text{i}}^{text{*}}-{{upalpha:}}_{text{i}}){upkappa:}({text{x}}_{text{i}},text{x})+text{b}={upomega:}text{x}+text{b}$$

(28)

where к(x_i,x) is the introduced kernel function and b is the bias with a constant value.

Evaluation indicators

To compare the effectiveness of the prediction models quantitatively, three indicators are used to evaluate the effectiveness of the model, namely, R², MAE, and RMSE. In more detail, R² is an accuracy indicator representing the maximum level of the variation in the target vector that can be explained by the model; moreover, its larger size means that the regression straight line fits the data better, and the model’s performance is better. As for MAE and RMSE, they reflect the precision of the prediction error, indicating the degree of deviation between the prediction value and the actual value; this latter is used to evaluate the degree of change of the data where the smaller they are, the better the efficiency of the model will be.

$$:{text{R}}^{2}=1-frac{text{S}text{S}text{E}}{text{S}text{S}text{T}}=1-frac{sum:_{text{i}=1}^{text{n}}{({text{y}}_{text{i}}-widehat{{text{y}}_{text{i}}})}^{2}}{sum:_{text{i}=1}^{text{n}}{({text{y}}_{text{i}}-stackrel{⃐}{text{y}})}^{2}}$$

(29)

$$:text{M}text{A}text{E}=frac{1}{text{n}}sum:_{text{i}=1}^{text{n}}mid:({text{y}}_{text{i}}-widehat{{text{y}}_{text{i}}})mid:$$

(30)

$$:text{R}text{M}text{S}text{E}=sqrt{text{M}text{S}text{E}}=sqrt{frac{1}{text{n}}sum:_{text{i}=1}^{text{n}}{({text{y}}_{text{i}}-widehat{{text{y}}_{text{i}}})}^{2}}$$

(31)

where SSE is the sum of squares of residuals, SST represents the sum of squares of total deviations, y_i indicates the true observation, ŷ_i denotes the fitted value, and ‾y shows the mean of the true observations.

Modelling training

To address the potential sensitivity of random data partitioning in the hold-out method and ensure a more robust evaluation of the model’s generalization ability, this study further adopted k-fold (k = 5) cross-validation. The entire dataset with a sample size of N = 583 was randomly divided into 5 mutually exclusive subsets of equal size, each fold containing approximately 117 samples. Meanwhile, the distribution ratio of “stopping” and “non-stopping” samples in each subset was ensured to be consistent with that of the original dataset to avoid class imbalance bias. In each iteration, 4 subsets were selected as the training set for optimizing the model’s hyperparameters, consistent with the original settings: GBDT: n_estimators = 100, learning_rate = 0.1, max_depth = 3; XGBoost: n_estimators = 100, learning_rate = 0.1, max_depth = 3, min_child_weight = 1; SVM: C = 1.0, kernel=’rbf’, degree = 3, gamma=’auto’); the remaining 1 subset served as the test set for evaluating the model’s performance. The above process was repeated 5 times, ensuring that each subset was used as the test set exactly once. Finally, the performance metrics coefficient of determination R², mean absolute error (MAE), root mean square error (RMSE) from the 5 iterations were aggregated, and their mean values and standard deviations were calculated to characterize the stable performance and variability of the model.

Model results

As can be seen from Table 8: (1) For the prediction of energy consumption when “not distinguishing between stopping and non-stopping”, in terms of R², GBDT is the lowest, while XGBoost is slightly higher than SVM by 0.55%. In terms of MAE, GBDT is the largest, and XGBoost is slightly smaller than SVM by 0.87%. In terms of RMSE, GBDT is the largest, and XGBoost is smaller than SVM by 17.77%. It can be observed that when “not distinguishing between stopping and non-stopping”, GBDT has the lowest prediction effect, and XGBoost has the optimal prediction effect (even though it is only slightly better than SVM). (2) When the data is divided according to “stopping” and “non-stopping” for energy consumption prediction, the advantages of SVM are significant. In the prediction of stopping samples, in terms of R², GBDT is the lowest, while SVM is slightly higher than XGBoost by 1.73%. In terms of MAE, GBDT is the largest, and SVM is smaller than XGBoost by 9.89%. In terms of RMSE, GBDT is the largest, and SVM is slightly smaller than XGBoost by 1.44%. It can be seen that in the prediction of stopping samples, GBDT has the lowest prediction effect and SVM has the optimal prediction effect. In the prediction of non-stopping samples, in terms of R², GBDT is the lowest, and SVM is slightly higher than XGBoost by 12.43%. In terms of MAE, GBDT is the largest, and SVM is smaller than XGBoost by 4.30%. In terms of RMSE, GBDT is the largest, and SVM is slightly smaller than XGBoost by 0.79%. It can be seen that in the prediction of non-stopping samples, GBDT has the lowest prediction effect and SVM has the optimal prediction effect.

In conclusion, it can be seen that when not distinguishing whether to stop or not, the XGBoost model has the best effect in predicting energy consumption. When distinguishing between stopping and non-stopping samples, the advantages of SVM are manifested and it has the best prediction effect.

Table 8 Comparatives performance of the models.

Full size table

Fig. 7

Dot Plot of Actual and Predicted Energy Consumption with and without Stopping and Non-stopping for Different Models. (a)GBDT Not distinguishing, (b)GBDT Stopping, (c)GBDT Non-stopping, (d)XGBoost Not distinguishing, (e)XGBoost Stopping, (f)XGBoost Non-stopping, (g)SVM Not distinguishing, (h)SVM Stopping, (i)SVM Non-stopping.

Full size image

As for Fig. 7, it estimates the energy consumption for a random sample, with the X-axis representing the number of observations in the test sample and the Y-axis representing the energy consumption (target value) of the representative E-Bus and is a dotted line plot of the actual energy consumption and the predicted energy consumption using various regression models.

Discussion and conclusion

When pure E-Bus are at urban intersections, they are prone to walking and stopping alternately due to traffic conditions, signals, and other factors, which leads to higher energy consumption near the intersections. Avoiding the “stopping” phenomenon and passing through intersections at relatively gentle or constant speeds is conducive for the reduction of energy consumption at intersections and the improvement of the overall energy utilization rate. Driver behavior affects driving safety and ride comfort, as well as energy consumption, with frequent acceleration and deceleration leading to increased energy consumption. Regarding the intersections, quantifying the impact of driving behavior on energy consumption has profound implications for the development of eco-driving strategies. In the meantime, we hypothesize that predicting energy consumption based on distinguishing between stopping and non-stopping would be more favorable to the estimation of intersection energy consumption. In this study, the natural driving data of pure E-Bus are used to analyze the difference in energy consumption between “stopping” and “non-stopping” at intersections, to establish a classification prediction model based on XGBoost-KNN, to clarify the correlation between driving behavior and energy consumption, and to design an energy consumption prediction model based on three machine learning regression algorithms for investigation whether the model is better than the model without differentiation.

Previous studies have expected an optimal speed profile to avoid “stopping” at intersections. This profile is conducive to reducing energy consumption; however, the quantification of the energy consumption with and without stopping is unclear. In opposite, this study analyzes that most of the energy consumption rates of pure electric buses at intersections are in the range of 0–2 kWh/km, with a mean value of 1.096 kWh/km. The energy consumption of “stopping” (1.386 kWh/km) is 71.5% more than that of ” non-stopping” (0.808 kWh/km), which is significantly different, suggesting that the braking energy recovered during stopping is much smaller than that after stopping.This suggests that the braking energy recovery generated during stopping is much smaller than the energy consumed during acceleration after stopping. This effect may be similar to the effect of target speed on energy consumption during exit, where higher acceleration out of the intersection downstream of the intersection leads to a sharper increase in energy consumption. Therefore, we propose, design, and validate the idea that it may be more accurate to predict energy consumption based on whether to stop the car for categorization. For the classification problem of whether to stop or not, the input parameters of the classification model, namely, speed, accelerator pedal opening, brake pedal opening, and acceleration, were determined through correlation analysis, which showed a high degree of correlation with them, probably because all of them are longitudinal motion parameters directly related to whether to stop or not (i.e., whether the speed is null or not). We use the data sequence of about 5 s when entering the intersection to predict whether to stop at the intersection or not, and the comparison between the models reveals that XGBoost-kNN has a higher prediction accuracy of 84.4% than KNN and XGBoost.

We attempt to use the data upstream of the intersection to predict the total energy consumption at the intersection. This will help in improving the ecological behavior of the driving action. Moreover, future ecological driving strategies in the connected environment could be proposed. It is worth noting that most of the early studies focused on speed and acceleration/deceleration parameters to predict fuel or energy consumption at intersections, ignoring behavioral parameters that reflect driver heterogeneity. Therefore, this study was statistically analyzed to show that driving behavior parameters have a significant effect on energy consumption, and the sensitivity of these parameters was determined by multiple regression analysis, as well as analyzing the factors of change in the factors affecting energy consumption. Stopping samples and non-stopping samples showed different impact relationships. The larger the maximum value of speed and the average value of negative acceleration during stopping are, the smaller the energy consumption will be; moreover, the smaller the average value of positive acceleration, the smaller the energy consumption (which indicates that for the stopping samples alone) are, the larger the speed and the smaller the acceleration (i.e., the larger the deceleration) at the time of entry into the intersection will be. Furthermore, the larger the recovered energy, and at the same time, their target speeds at the time of exit from the intersection are the same, which results in small differences in the loss of energy, and therefore leads to a reduction in the energy consumption. When non-stopping, the larger the maximum value of velocity is, the minimum value of acceleration, the average value of positive acceleration, and the smaller the energy consumption will be. In addition, the smaller the standard deviation of velocity, the average value of accelerator pedal opening, the ratio of accelerator pedal pressing, the average value of acceleration, and the maximum value of acceleration are, the smaller the energy consumption will be. This means that the larger the velocity and acceleration when entering an intersection without stopping indicates that the less energy is consumed when it needs to reach the target speed again when exiting an intersection; moreover, the smaller the standard deviation of velocity is, the smaller the opening of the accelerator pedal, etc. indicate that the more gentle acceleration will be, which would lead to the lesser energy consumption.

For the quantification of the relationship between driving behavior and energy consumption at intersections, this study used GBDT, XGBoost, and SVM machine learning regression algorithms for the prediction of energy consumption, and we also added the prediction when we did not differentiate between whether we stopped as a comparison. The results show that when directly predicting energy consumption without distinguishing between stopping and non-stopping scenarios, the XGBoost model achieves the best performance in energy consumption prediction (with optimal coefficient of determination R² (interpretability and prediction accuracy), MAE, and RMSE). Specifically, its R² is 17.35% higher than that of GBDT and 0.55% higher than that of SVM; its MAE is 18.30% lower than that of GBDT and 0.87% lower than that of SVM; and its RMSE is 23.42% lower than that of GBDT and 17.77% lower than that of SVM. This may be because XGBoost, as an ensemble tree model, can better capture the nonlinear relationships in the global data distribution when processing mixed-type data (including complex patterns of stopping and non-stopping), and it performs more robustly especially when there are many feature dimensions with interactions. However, when predicting energy consumption after distinguishing between stopping and non-stopping, SVM has the best performance in predicting energy consumption. In the stopping samples, its R² is 42.62% higher than that of GBDT and 1.73% higher than that of XGBoost; its MAE is 22.12% lower than that of GBDT and 9.89% lower than that of XGBoost; and its RMSE is 10.75% lower than that of GBDT and 1.44% lower than that of XGBoost. In the non-stopping samples, its R² is 54.4% higher than that of GBDT and 12.43% higher than that of XGBoost; its MAE is 14.06% lower than that of GBDT and 4.30% lower than that of XGBoost; and its RMSE is 2.19% lower than that of GBDT and 0.79% lower than that of XGBoost. This may be related to the characteristics of SVM. SVM processes nonlinear problems based on kernel function mapping. In scenarios where the sample distribution is relatively single (such as only stopping or only non-stopping), it can capture the subtle patterns of local data more accurately, and it is especially suitable for fitting complex relationships under small samples. And the characteristic distributions of the stopping and non-stopping sub-samples (such as the dynamic ranges of acceleration and pedal operations) precisely provide the conditions for SVM to give play to its advantages. In addition, whether in the stopping samples or the non-stopping samples, the prediction accuracy of SVM is higher than that of XGBoost when not distinguishing between stopping samples. This proves that when facing an unknown intersection, the idea of predicting whether to stop or not from the data when entering the intersection and, based on this, predicting energy consumption from the data upstream of the intersection is feasible. It should be noted that the results of this study are mainly based on the dedicated lanes of the Bus Rapid Transit (BRT) system, and its applicability in more complex mixed traffic environments still requires further research.

From the perspectives of policy, practice, and application, the research mainly has the following three implications. Firstly, to guide the formulation of eco-driving strategies, specific operational guidelines such as “try to avoid stopping at intersections as much as possible” and “avoid rapid acceleration at startup” are extracted based on the relationship between different driving behaviors and energy consumption, providing standardized energy-saving driving norms for public transport operators. Meanwhile, based on the stopping probability predicted by the model, an “energy consumption-sensitive signal control system” is developed to dynamically adjust the green light duration at intersections and reduce energy consumption. Secondly, as a basis for eco-driving training for drivers by bus companies, through the identification of whether to stop at intersections and energy consumption as well as the relationship between various characteristic parameters and energy consumption, low-energy consumption driving behaviors (such as not stopping, avoiding rapid acceleration, etc.) are incorporated into the training courses for bus drivers. An interactive simulation training system is developed to conduct driving training. For example, a pilot project is carried out in cooperation with a bus company. Fifty drivers are given a three-month special training. By comparing the energy consumption data before and after the training, the “prediction-feedback” strategy can reduce the energy consumption at intersections. Meanwhile, it is recommended that the transportation department formulate the “Eco-Driving Operation Specification for Electric Buses” and take the “energy consumption risk index” output by the model as a performance assessment indicator for drivers (for example, for every 10% reduction in energy consumption, 5% is added to the performance score). Finally, in the development of in-vehicle driver feedback systems or eco-driving applications, the energy consumption prediction model can be lightweighted and deployed on the in-vehicle ECU (Electronic Control Unit) or smartphone terminals. Meanwhile, a standardized CAN bus interface is developed to obtain real-time data such as vehicle speed and acceleration in real time. Combined with the in-vehicle multimodal feedback mechanism (voice prompts, dashboard visualization), real-time energy consumption data is pushed to drivers and interaction is guided. For example, when the vehicle detects that the traffic light ahead is green and there is sufficient remaining time, the system will recommend maintaining the current speed and coasting through. Further, it is a thought-provoking question of how to combine driving safety and comfort while providing an eco-driving strategy. For example, a two-level warning system for safety and eco-driving at intersections⁵⁰.

This study also has some limitations and is insufficient in terms of the robustness and universality of the model. In fact, the factors taken into account are limited. In other words, there are many other factors that can affect energy consumption, such as the personal attributes of drivers (driving experience, gender, age), external environment (route alignment, traffic conditions, weather, road gradient), and so on. Besides, the amount of data used for training the model can also influence the prediction accuracy of the model. The data volume in this study is relatively small. Expanding the data volume while paying attention to the issue of class imbalance will be beneficial for improving the prediction accuracy of the model. Finally, considering the change time of traffic lights in the connected environment can also affect drivers’ operating behaviors and consequently influence energy consumption and improve the accuracy of the model.

Future research should focus on multi-factor integrated models. These models should not only incorporate more data (such as road gradients, traffic densities, weather conditions, multiple routes, etc.), but also take into account driver information. Meanwhile, qualitative and quantitative analyses combining subjectivity and objectivity should be carried out based on the insights of drivers and transportation planning stakeholders. For example, through the networked data of the traffic system, analyze the changes in traffic signals when vehicles pass through intersections, and decide whether to stop according to the changes in the signals to improve the accuracy of the model. At the same time, methods such as deep learning⁵¹ and reinforcement learning⁵² can be adopted to further optimize the energy consumption model, and multiple validation methods can be used for cross-validation. In addition, as cars with different levels of autonomous driving are becoming more and more popular, ecological driving strategies⁵³ in the human-machine co-driving mode should also be considered so that they can be better applied in the driver real-time feedback system or ecological driving assistance applications.

Data availability

The data used to support the findings of this study are available from the corresponding author upon request.

References

Zhao, X., Zhao, X., Yu, Q., Ye, Y. & Yu, M. Development of a representative urban driving cycle construction methodology for electric vehicles: A case study in Xi’an. Transp. Res. Part. D: Transp. Environ., 81, (2020).
Traffic Administration Bureau of the ministry of public security of PRC. China. [Online]. Available: https://www.mps.gov.cn/n2254314/n6409334/c9106375/content.html, Accessed on 2023.
Mintsis, E., Vlahogianni, E. I. & Mitsakis, E. Dynamic Eco-Driving near signalized intersections: systematic review and future research directions. J. Transp. Eng. Part. A: Syst., 146, 4, (2020).
Younes, Z., Boudet, L., Suard, F., Gérard, M. & Rioux, R. Analysis of the Main Factors Influencing the Energy Consumption of Electric Vehicles, In: 2013 IEEE International Electric Machines & Drives Conference (IEMDC), Chicago, United States, May .6556260. (2013).
Berry, I. M. The effects of driving style and vehicle performance on the real-world fuel consumption of us light-duty vehicles, Ph.D. dissertation, Dept. Mech. Eng., Massachusetts Inst. Technol., Cambridge, MA, USA, (2010).
Zhang, R. & Yao, E. Electric vehicles’ energy consumption Estimation with real driving condition data. Transp. Res. Part. D: Transp. Environ. 41, 177–187 (2015).
Article
Google Scholar
Kambly, K. & Bradley, T. H. Geographical and Temporal differences in electric vehicle range due to cabin conditioning energy consumption. J. Power Sources. 275, 468–475 (2015).
Article
ADS
Google Scholar
Al-Wreikat, Y., Serrano, C. & Sodré, J. R. Effects of ambient temperature and trip characteristics on the energy consumption of an electric vehicle. Energy 238, (2022).
Lajunen, A. Energy consumption and cost-benefit analysis of hybrid and electric City buses. Transp. Res. Part. C: Emerg. Technol. 38, 1–15 (2014).
Article
Google Scholar
Zou, Y., Wei, S., Sun, F., Hu, X. & Shiao, Y. Large-scale deployment of electric taxis in Beijing: A real-world analysis, Energy, 100, 25–39, (2016).
Zhang, Y., Fu, R., Guo, Y. & Yuan, W. Environmental screening model of driving behavior for an electric bus entering and leaving stops. Transportation Research Part D: Transport Environ. 112, (2022).
Fiori, C., Ahn, K. & Rakha, H. A. Power-based electric vehicle energy consumption model: Model development and validation, Appl. Energy 168, . 257–268, (2016).
Qiu, C. & Wang, G. New evaluation methodology of regenerative braking contribution to energy efficiency improvement of electric vehicles. Energy. Conv. Manag. 119, 389–398 (2016).
Article
ADS
Google Scholar
Ullah, I. et al. A comparative performance of machine learning algorithm to predict electric vehicles energy consumption: A path towards sustainability. Energy Environ. 33 (8), 1583–1612 (2021).
Article
Google Scholar
Wu, L., Ci, Y., Wang, Y. & Chen, P. Fuel consumption at the oversaturated signalized intersection considering queue effects: A case study in Harbin, China, Energy, vol. 192, (2020).
Garder, P. Fuel consumption at a modern roundabout vs. a signalized intersection: a case study comparing two similar intersections in Bangor, Maine, In: Proceedings of the 91st annual meeting of transportation research board, Washington, DC., USA, 1–11. (2012).
Ahn, K., Rakha, H., Trani, A. & Aerde, M. V. Estimating vehicle fuel consumption and emissions based on instantaneous speed and acceleration levels. J. Transp. Eng. 192 (2), 182–190 (2002).
Article
Google Scholar
Yao, Z., Wei, H., Liu, H. & Li, Z. Statistical vehicle specific power profiling for Urban freeways. Procedia – Soc. Behavioral Sci. 96, 2927–2938, (2013).
Zhang, W., Wang, W., Yin, H. & Hu, G. Study on vehicle fuel consumption of signalized intersections. J. Southeast. Univ. (Natural Sci. Ed. 32 (2), 249–251 (2022).
Google Scholar
Xiang, Q., Wang, W. & Lu, J. Research on fuel consumption at signalized intersection. J. Highway Transp. Res. Dev. 21 (12), 100–102 (2004).
Google Scholar
Feng, Y., Zhang, Y., Leng, J., Sun, G. & Feng, Y. Model for evaluation on fuel economy of urban intersection. J. Harbin Inst. Technol. 18 (3), 79–83 (2011).
Google Scholar
Akcelik, R., Bayley, C., Bowyer, D. P. & Biggs, D. C. A hierarchy of vehicle fuel consumption models. Traffic Eng. Control. 24 (10), 491–495 (1983).
Google Scholar
Bakibillah, A. S. M., Kamal, M. A. S., Tan, C. P., Hayakawa, T. & Imura, J. I. Event-driven stochastic eco-driving strategy at signalized intersections from self-driving data, IEEE Trans. Veh. Technol., 68, 9, 8557–8569, (2019).
Knowles, M., Scott, H. & Baglee, D. The effect of driving style on electric vehicle performance, economy and perception. Int. J. Electr. Hybrid Veh. 4 (3), 228–247 (2012).
Article
Google Scholar
Liu, K., Liu, D., Li, C. & Yamamoto, T. Eco-speed guidance for the mixed traffic of electric vehicles and internal combustion engine vehicles at an isolated signalized intersection, Sustainability. 11, 20, (2019).
Yan, Y. & Fan, Y. Influence of the driver style difference in the acceleration process on the energy consumption of the EV bus. Adv. Eng. Res. (AER). 82, 186–190 (2017).
Google Scholar
Tang, T. Q., Yi, Z. Y., Zhang, J. & Zheng, N. Modelling the driving behaviour at a signalised intersection with the information of remaining green time. IET Intel. Transport Syst. 11 (9), 596–603 (2017).
Article
Google Scholar
Ping, P., Qin, W., Xu, Y., Miyajima, C. & Takeda, K. Impact of driver behavior on fuel consumption: classification, evaluation and prediction using machine learning, IEEE Access 7 78515–78532, (2019).
Albool, I. et al. Fuel consumption at signalized intersections: investigating the impact of different signal indication settings, Case Studies on Transport Policy, 13, (2023).
Yang, H., Almutairi, F. & Rakha, H. Eco-Driving at signalized intersections: A multiple signal optimization approach. IEEE Trans. Intell. Transp. Syst. 22 (5), 2943–2955 (2021).
Article
Google Scholar
Wei, X. et al. Co-optimization method of speed planning and energy management for fuel cell vehicles through signalized intersections. J. Power Sources, 518, (2022).
Liu, H., Zhang, Y., Zhang, Y. & Zhang, K. Evaluating impacts of intelligent transit priority on intersection energy and emissions, Transp. Res. Part D: Transport Environ., 86, (2020).
Dong, H. et al. Predictive energy-efficient driving strategy design of connected electric vehicle among multiple signalized intersections. Transp. Res. Part. C: Emerg. Technol., 137, (2022).
Meng, X. & Cassandras, C. G. Eco-Driving of autonomous vehicles for nonstop crossing of signalized intersections. IEEE Trans. Autom. Sci. Eng. 19 (1), 320–331 (2022).
Article
Google Scholar
Hesami, S., De Cauwer, C., Rombaut, E., Vanhaverbeke, L. & Coosemans, T. Energy-Optimal speed control for autonomous electric vehicles Up- and downstream of a signalized intersection. World Electr. Veh. J., 14, 2, (2023).
Xia, H. et al. Field operational testing of ECO-approach technology at a fixed-time signalized intersection, In: 2012 15th International IEEE Conference on Intelligent Transportation Systems, Anchorage, Alaska, USA, 16–19. (2012).
Musiał, J. et al. Assessment and analysis of road transport driver’s behavior in terms of eco-driving, MATEC Web of Conferences, 332, (2021).
Zhang, J., Tang, T. Q., Yan, Y. & Qu, X. Eco-driving control for connected and automated electric vehicles at signalized intersections with wireless charging. Appl. Energy 282 (2021).
Hong, J. et al. Data-driven multi-dimension drivingsafety evaluation for real-world electric vehicles, IEEE Trans. Veh. Technol., 73, 7, 9721–9733, July (2024).
Murphey, Y. L., Milton, R. & Kiliaris, L. Driver’s style classification using jerk analysis. In: 2009 IEEE Workshop on Computational Intelligence in Vehicles and Vehicular Systems, Nashville, TN, USA, 23–28, (2009).
Louie, J. F. & Mouloua, M. Predicting distracted driving: The role of individual differences in working memory. Appl. Energy 74,154–161,(2019).
Lu, Y., Fu, X., Guo, E. & Tang, F. XGBoost algorithm-based monitoring model for urban driving stress: combining driving behaviour, driving environment, and route familiarity. IEEE Access. 9, 21921–21938 (2021).
Article
Google Scholar
Nan, S., Tu, R., Li, T., Sun, J. & Chen, H. From driving behavior to energy consumption: A novel method to predict the energy consumption of electric bus. Energy 261, 125188 (2022).
Article
Google Scholar
Karri, S. L., De Silva, L. C. & Lai, D. T. C. Classification and prediction of driving behaviour at a traffic intersection using SVM and KNN. SN COMPUT. 2, 209 (2021).
Article
Google Scholar
Khalid, N. W. & Abdullah, W. D. Enhancing Road Safety with Smartphone-Based Machine Learning Driver Behavior Classification and Aggressive Driving Detection Using Feature Reduction Methods, In: International Conference On Innovative Computing And Communication, 111–139, (Springer Nature Singapore, 2024).
Xiao, J. SVM and KNN ensemble learning for traffic incident detection. Phys. A: Stat. Mech. Its Appl. 517, 29–35 (2019).
Article
ADS
Google Scholar
Ledger, S., Bennett, J. M., Chekaluk, E. & Batchelor, J. Cognitive function and driving: important for young and old alike, transportation research part F: traffic psychology and behaviour,60, 262–273, (2019).
Zhu, Q. & Yang, P. Research on artificial intelligence prediction algorithm of photovoltaic power station output based on GBDT regression. Power Syst. Big Data. 24 (11), 16–22 (2021).
Google Scholar
Zhang, Z., Zhang, Z., Li, Y. & Zhang, S. Prediction of lateral spread for hot strip finishing mill based on SVR model. J. Iron Steel Res. 35 (8), 1017–1024 (2023).
Google Scholar
Zhang, Y., Li, X., Yu, Q. & Yan, X. Developing a two-stage auditory warning system for safe driving and eco-driving at signalized intersections: A driving simulation study. Accid. Anal. Prev., 175, (2022).
Meng, D. et al. Incentive-Driven partial offloading and resource allocation in vehicular edge computing networks. IEEE Internet Things J., (2024).
Zhou, H., Jiang, K., He, S., Min, G. & Wu, J. Distributed deep Multi-Agent reinforcement learning for cooperative edge caching in Internet-of-Vehicles. IEEE Trans. Wireless Commun. 22 (12), 9595–9609 (2023).
Article
Google Scholar
Zhang, Y., Fu, R., Guo, Y. & Yuan, W. Eco-driving strategy for connected electric buses at the signalized intersection with a station. Transport. Res. Part D: Transport Environ. 128, (2024).

Download references

Acknowledgements

This work was supported by the Natural Science Basic Research Program of Shaanxi Province (2023-JC-YB-596). We are grateful for that support.

Author information

Authors and Affiliations

Vocational and Technical College, Xianyang Normal University, Xianyang, 712099, China
Aihong Lyu
School of Automobile, Chang’an University, Xi’an, 710018, China
Huiming Zhang & Yali Zhang
China Academy of Transportation Sciences, Beijing, 100029, China
Cheng Li
Guangzhou Bus Group Co.,Ltd, Guangzhou, 510098, China
Yaowu Chen

Authors

Aihong Lyu
View author publications
Search author on:PubMed Google Scholar
Huiming Zhang
View author publications
Search author on:PubMed Google Scholar
Yali Zhang
View author publications
Search author on:PubMed Google Scholar
Cheng Li
View author publications
Search author on:PubMed Google Scholar
Yaowu Chen
View author publications
Search author on:PubMed Google Scholar

Contributions

Conceptualization, A. L.; methodology, H. Z.; investigation, C. L. and Y. C.; data curation, Y. Z.; writing-original draft, H. Z.; project administration, A. L.; writing-review and editing, A. L. and Y. Z. All authors have read and agreed to the published version of the manuscript.

Corresponding authors

Correspondence to
Huiming Zhang or Yali Zhang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Lyu, A., Zhang, H., Zhang, Y. et al. A study on energy consumption analysis and prediction of electric bus at intersections considering driving behavior.
Sci Rep 15, 44755 (2025). https://doi.org/10.1038/s41598-025-28835-4

Download citation

Received: 29 February 2024
Accepted: 12 November 2025
Published: 29 December 2025
Version of record: 29 December 2025
DOI: https://doi.org/10.1038/s41598-025-28835-4

Keywords

Intersections
Energy consumption analysis
Driving behavior
Stopping
Non-stopping
Machine learning

Source: Ecology - nature.com

Abstract

Similar content being viewed by others

Predictive methods for CO2 emissions and energy use in vehicles at intersections

Evaluating machine learning algorithms for energy consumption prediction in electric vehicles: A comparative study

Route selection guidelines and prioritization tools for efficient electrification of bus fleets

Introduction

Motivation

Literature review

Research objectives and contribution

Paper organization

Data sources & pre-processing

Data acquisition

Data pre-processing

Data screening

Parameter supplementation

Intersection energy consumption analysis and comparison

Distribution of energy consumption at intersections

Comparison of energy consumption of stopping and non-stopping

Intersection stopping and non-stopping classification prediction model based on XGBoost-KNN

Correlation analysis

Intersection stopping and Non-stopping prediction model

Algorithmic theory

Principles of the XGBoost algorithm

Principle of KNN algorithm

Characteristics of the XGBoost-KNN algorithm

Evaluation indicators

Model results

Intersection energy consumption prediction model based on machine learning

Multiple linear regression analysis of energy consumption

Intersection energy consumption prediction model

Algorithmic theory

Principle of GBDT regression algorithm

Principle of XGBoost regression algorithm

Principle of SVM regression (SVR) algorithm

Evaluation indicators

Modelling training

Model results

Discussion and conclusion

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

ITALIAN LANGUAGE

ENGLISH LANGUAGE