This is a self-archived – parallel published version of this article in the 

publication archive of the University of Vaasa. It might differ from the original. 

Short-term traffic flow prediction based on 

whale optimization algorithm optimized 

BiLSTM_Attention 

Author(s): Xu, Xing; Liu, Chengxing; Zhao, Yun; Lv, Xiaoshu 

Title: Short-term traffic flow prediction based on whale optimization algorithm 

optimized BiLSTM_Attention 

Year: 2022 

Version: Accepted manuscript 

Copyright ©2022 Wiley. This is the peer reviewed version of the following article: 

Xu, X., Liu, C., Zhao, Y. & Lv, X. (2022). Short-term traffic flow prediction 

based on whale optimization algorithm optimized BiLSTM_Attention. 

Concurrency and Computation: Practice and Experience 34(10), e6782, 

which has been published in final form at 

https://doi.org/10.1002/cpe.6782. This article may be used for non-

commercial purposes in accordance with Wiley Terms and Conditions for 

Use of Self-Archived Versions. This article may not be enhanced, enriched 

or otherwise transformed into a derivative work, without express 

permission from Wiley or by statutory rights under applicable legislation. 

Copyright notices must not be removed, obscured or modified. The article 

must be linked to Wiley’s version of record on Wiley Online Library and 

any embedding, framing or otherwise making available the article or pages 

thereof by third parties from platforms, services and websites other than 

Wiley Online Library must be prohibited. 

Please cite the original version: 


Xu, X., Liu, C., Zhao, Y. & Lv, X. (2022). Short-term traffic flow 

prediction based on whale optimization algorithm optimized 

BiLSTM_Attention. Concurrency and Computation: Practice and 

Experience 34(10), e6782. https://doi.org/10.1002/cpe.6782 

 
Short-term traffic flow prediction based on whale optimization 

algorithm optimized BiLSTM_Attention 

 
Xing Xu
1

 Chengxing Liu
1

 Yun Zhao
2

 Xiaoshu Lv
3,4

 
2 of 17 

 
1 
School of Mechanical and Energy Engineering, 

Zhejiang University of Science and Technology, 

Hangzhou, China 

2 
School of Information and Electronic 

Engineering, Zhejiang University of Science 

and Technology, Hangzhou, China 

3 
Department of Electrical Engineering and 

Energy Technology, University of Vaasa, Vaasa, 

Finland 

4 
Department of Civil Engineering, Aalto 

University, Espoo, Finland 

 
1 INTRODUCTION 

 
1.1 Background 

 
Traffic flow is an important indicator of urban development and operation status. Predicting, adjusting and controlling traff ic flow is of great significance 

to urban management.
1

 How to effectively predict traffic flow trends for early intervention is one of the key issues in decision-making for traffic 

management.
2

 Traffic flow data collected at fixed observation points often include lots of attributes such as traffic speed, vehicle flow,  passing duration, 

and specific road conditions. Due to the large number of vehicles and complex transport networks, the data are characterized by large size, high variation, 

and high dynamics. In particular, the recent development of smart cities
3

 lead to more and more sensors that are installed on vehicles, enabling the 

communication between vehicles and peripheral equipment at any time, such as vehicle to vehicle (V2V)
4

 communication, vehicle to infrastructure 

(V2I)
5

 communication, and vehicle to everything (V2X) information exchanges. Hence, vehicles can send their position information at any time. This 

Internet of Vehicles technology is applied based on integrating technologies such as GPS navigation,
6,7

 and mass of traffic flow data are generated.
8

 The 

mass of traffic datasets lay the foundation for traffic flow prediction. Accurate traffic flow prediction also allows urban terminals to carry out better 

traffic planning and to judge the operation condition of traffic networks in advance. 

 
Related work 

 
In recent years, various modeling methods have been applied to build stable and accurate traffic flow prediction models. The Auto-Progressive 

Integrated Moving Average (ARIMA) model, the Gaussian process, and the Kalman filter constitute the major part of conventional methods. For 

example, Liu
9

 applied the ARIMA model to study the rail traffic flow. Emami
10

 applied the Kalman filter to filter traffic flow fluctuations and predict 

traffic flow. These conventional models usually adopt linear methods that are too simple to reflect thenon-linear factors involved in the process of 

short-term traffic flow prediction, resulting in low prediction accuracy. 

Machine learning and artificial intelligence are the core of conventional traffic flow prediction algorithms. Such algorithms include artificial 

neural networks (ANN),
11

 extreme learning machines (ELM), and support vector regression (SVR). For example, Zhang
12

 used the ELM non-iterative 

algorithm to predict air traffic flow. Castro-Neto
13

 used the SVR algorithm to predict the short-term traffic flow. These methods have great advan-

tages in tackling non-linear problems, but they are unable to tackle temporal correlation and to process large-scale data with poor prediction. Deep 

learning models have been further introduced in the load prediction area, including deep neural network (DNN),
14

 stacked AutoEncoder (SAE),
15

 
and convolutional neural network (CNN).
16

 Although compared with shallow ANNs, these DNNs have higher load prediction accuracy, and they 

Abstract 

With the growths in population and vehicles, traffic flow becomes more complex and 

uncertain disruptions occur more often. Accurate prediction of urban traffic flow is 

important for intelligent decision-making and warning, however, remains a challenge. 

Many researchers have applied neural network methods, such as convolutional neural 

networks and recurrent neural networks, for traffic flow prediction modeling, but 

training the conventional network that can obtain the best network parameters and 

structure is difficult, different hyperparameters lead to different network structures. 

Therefore, this article proposes a traffic flow prediction model based on the whale 

optimization algorithm (WOA) optimized BiLSTM_Attention structure to solve this 

problem. The traffic flow is predicted first using the BiLSTM_Attention network which 

is then optimized by using the WOA to obtain its four best parameters, including the 

learning rate, the training times, and the numbers of the nodes of two hidden lay-

ers. Finally, the four best parameters are used to build a WOA_BiLSTM_Attention 

model. The proposed model is compared with both conventional neural network model 

and neural network model optimized by the WOA. Based on the evaluation metrics of 

MAPE, RMSE, MAE, and R2, the WOA_BiLSTM_Attention model proposed in this 

article presents the best performance. 

 
KEY WORDS   

attention, BiLSTM, prediction, traffic flow, whale optimization algorithm 


3 of 17 

 
also require artificial extraction of temporal characteristics. If the temporal characteristics of traffic flow data are ignored, artificial extraction of 

characteristics will affect the continuity of load data, and thus reduce the prediction accuracy of the models. 

 
1.2 Major contribution 

 
To meet the requirement of high accuracy of traffic flow prediction an whale optimization algorithm (WOA) optimized BiLSTM_At tention (“BiL-

STM_A” for short) model is proposed in this article. BiLSTM is a bidirectional LSTM network structure that can process temporal networks better 

than the unidirectional LSTM and can extract traffic flow data in two directions.
17

 Attention is a weight mechanism used to capture different weights of 

hidden layers and further to overcome the long-term dependence of networks such as RNNs and RNN-based improved networks when the input 

time series is long. When the network framework is established, obtaining the best hyperparameters of the network structure is difficult. Therefore, 

the BiLSTM_A model is optimized using the WOA to get the best parameters, so as to build a WOA_BiLSTM_Attention (“WOA_BiLSTM_A” for short) 

model. 

 
2 RELEVANT THEORIES 

 
2.1 BiLSTM short-term traffic flow prediction based on improved WOA optimized attention mechanism 

 
This article proposes a BiLSTM_A model optimized using WOA. When the frameworks of models are established, it is difficult fo r many of them 

to directly obtain the best hyper parameters for one time. Even if the frameworks are the same, the networks of different hyper parameters have 

greatly different accuracies. Therefore, to predict short-term traffic flow, a WOA_BiLSTM_A model is proposed based on the BiLSTM, the attention 

mechanism, and the WOA. The model is optimized by using the WOA to improve its hyper parameter adaptability. The experiment results show that 

the optimized network is much better than the comparison networks. 

 
2.2 Whale optimization algorithm 

 
The WOA is a new heuristic optimization algorithm that mimics the hunting behavior of whales. Whales use a special hunting me thod called 

bubble-net hunting strategy.
18

 The bubble-net shown in Figure 1. The WOA involves the following three stages: encircling prey, bubble-net attacking, and 

search for prey.
19

 The process is shown in Figure 2. 

 
2.2.1 Encircling prey 

 
This stage mainly mimics the behavior of whales encircling the prey during hunting. To describe the behavior, the following model is proposed: 

                                                                                                    
4 of 17 

 
X ( D 

 
F IG U RE  1 Whale hunting behavior diagram 

 
F IG U RE  2 WOA algorithm flow chart 

 
Where A⃗ and C⃗ are coefficient vectors, A⃗ = 2a⃗ ⋅ r⃗1 − a⃗, C⃗ = 2 ⋅ a⃗2 and a⃗ decrease to 0 during the search, a⃗ = 2 −  and Tmax are the maximum number 

−→
 

of iterations, r⃗1 and r⃗2 are random vectors meeting [0,1], t is the number of current iterations, X∗(t) represents the vector of the best whale position 

up to now, X ⃗ (t) represents the vector of the current whale position, and 

updated at each iteration. 

represents the absolute value. If there is the best solution, X∗ will be 

 
2.2.2 Hunting behavior 

 
Whales swim in a spiral way to prey, and the hunting behavior is expressed as follows: 

 
X⃗ (t + 1) = 
−→

∗ t) + 
−→

′  • ebl cos(2𝜋l) 

| |  


5 of 17 

 
| D = X (t) − X (t) is the distance between a whale and the prey, b is a constant defining the shape of the logarithmic spiral, and l is the uniformly 

X (t) − A • D, if p < 0.5 

X (t) + D • e cos(2𝜋l),  if p ≥ 0.5 

| 
−→′  −→∗ →− 

distrib

|
uted ran do m

|  
vector [−1,1]. Whales swim around the prey within a shrinking circle and along a spiral-shaped path simultaneously. So, we 

assume the probability of Pi to select the shrinking encircling method and the probability of 1 − Pi to select the spiral model to update the position 

of whales. The mathematical model is described as follows: 

⎧
⎪
−→∗ ⃗    ⃗ 

⎨
⎪
−→∗ −→

′  bl 

 
When the value of 

 
A is in [− 
1,1], the new position of a wha

⎩

le can be defined anywhere between its current position and the prey’s position. The 

algorithm sets that when A < 1, the whales attack the prey. 

 
2.2.3 Search for prey 

 
The mathematical model for this phase is as follows: 

 
X⃗ (t + 1) =  


6 of 17 

 
F IG U RE  3 LSTM cell structure diagram 

 
2.3 Attention mechanism 

 
The attention mechanism is a solution to mimicking human attention and a means of allocating attentional resources. In some cases, people con-

centrate on what is worthy of attention at certain moments. In this process, they often ignore other areas so as to obtain more details worthy of 

attention and suppress other useless information. The core principle of this algorithm is how to rationally and skillfully change the attention to the 

information concerned, ignore irrelevant information, and amplify the necessary information to the maximum extent.
27

 The attention structure is 

shown in Figure 4. Where xt(t ∈ [1, n]) represents the input of the network, ht(t ∈ [1, n]) corresponds to the output of the hidden layer obtained by 

the input layer, 𝛼t(t ∈ [1, n]) is the attention probability distribution value output by the attention mechanism to the hidden layer, and y is the output 

value of the network introduced by the attention mechanism. 

Conventional encoder-decoder RNN models often have a problem that the input series, regardless of their length, are all encoded to vector 

representation with a fixed length, while decoding is limited by the vector representation of that fixed length. This problem badly restricts the 


7 of 17 

 
F IGU R E 4  Attention structure diagram 

 
performance of the models. Especially, when input series are long, their performance will be very poor. The attention mechanism breaks the limitation 

that conventional encoder-decoders rely on the internal fixed-length vector during encoding. They retain the intermediate output results of the input 

series from the encoders and then train a model to selectively learn these inputs and correlate the output series with them during the model’s output.  In 

other words, the probability of success of each item of the output series depends on which items are selected from the input series. This is the core 

principle of the attention mechanism. 

 
3 ALGORITHM PROCESS 

 
3.1 BiLSTM_A model 

 
The attention mechanism attracted much attention when it was proposed. Like human beings attaching more importance to some in formation, the 

attention mechanism can properly assign weights to the information obtained and perform summation based on the weights.
28

 As a result, the attention 

method is highly interpretable and more effective than other methods. In the early stage, researchers integrated the attention mechanism and the 

BiLSTM to address text translation and classification problems.
29

 In this article, the BiLSTM is used to process traffic flow data, and essentially the 

time series. The network structure of the BiLSTM_A model is shown in Figure 5. 

The BiLSTM_A structure is divided into the following five layers: 

 
1. Input Layer: inputting series. The series may be character series or time series, or a combination of both. In this article, the input traffic flow is 

time series. 

2. Embedding Layer: mapping each time series into a low-dimensional vector. The embedding layer of the model covers the embedding of time 

series words and the embedding of relative position codes. A vector may be randomly initialized or a trained vector may be used. 

3. LSTM Layer: using the BiLSTM to obtain advanced features from the previous step. 

4. Attention Layer: generating a weight vector and multiplying it by the weight vector to combine the short time series of each time step into a long 

time series feature vector. 

5. Output Layer: finally, outputting by using the time series. 

 
As shown in Figure 5, the main difference between the previous conventional BiLSTM models and the BiLSTM_A model is that in the latter 

model, a structure called Attention Layer is interposed after the BiLSTM layer and before the full connection to the softmax layer. The Attention 

structure first calculates the weight of the time series of each position in the BiLSTM outputs, and then performs weighted s ummation and uses the sum 

as the representation vector of the time series, and finally conducts output and prediction. 

The calculation formula of Attention is as follows: 

 
8 of 17 

 
F IG U RE  5 The network structure of BiLSTM_A model 

 
Where T is the length of the input time series, and the weighted sum of these output vectors constitutes the time series r. Where ∈ Rd𝜔×T 
, d𝜔 is the 

dimension of the time series vector. 𝜔 is a trained parameter vector, 𝜔T 
is the transposed vector, and dimensions of 𝜔, 𝛼, r are d𝜔, T, d𝜔

, respectively. 

Obtain the final time series representation from the following formula: 

 
Finally, we use the softmax classifier to predict ̂y, the label of a discrete class Y set. The classifier uses the hidden * as the input. 

 
The loss function is the negative log-likelihood value of the real label ̂y. 

 
Where 𝜆 is the regularization parameter of L2. In this article, we alleviate overfitting also through the regularization of dropout and L2. 

 
3.2 Hyperparameters of WOA optimized BiLSTM_A 

 
The WOA effectively eliminates the defect that the BiLSTM_A algorithm is prone to the local best solution and improves the accuracy of parameter 

optimization. 

The WOA optimization is to obtain the maximum or minimum value of the fitness function. In this article, the mean square error of the minimum 

BiLSTM_A network output and the actual value is used as the fitness function. The formula is given below: 

 
Where yi is the ith actual value in the prediction result, ŷi is the ith predicted value in the prediction result, and N is the total number of samples. The 

more accurate the predicted values, the smaller the loss values obtained. 


9 of 17 

 
F IG U RE  6 WOA_ BiLSTM_ A algorithm flow chart 

 
The flow chart of the prediction model of the WOA_BiLSTM_A short-term traffic flow algorithm is shown in Figure 6. 

According to Figure 6, the WOA optimized BiLSTM_A algorithm includes three parts: WOA, BiLSTM_A, and data. The WOA part is the detailed 

flow of whale algorithm. The BiLSTM_A part implements the detailed prediction algorithm. The data part is used to process data. Each part transfers 

parameters during model training. The prediction process of the model is described below: 

 
1. Initialize BiLSTM_A model parameters. 

2. Initialize WOA algorithm population. Build a set of values with the four variables (iter, 𝛼, n1, n2) and puts them as optimization parameters into 

the WOA. The four parameters represent the number of iterations, the learning rate, and the number of the nodes of the two hidden layers of 

the BiLSTM network. 

3. Assign the initialized values as the historical best value to the parameters of BiLSTM_A and train the model. Then, predict t he test set and output the 

mean square error of the actual output value and the expected output value as the TrainingLoss. 

4. Set the TrainingLoss obtained from the conventional BiLSTM_A training as the system requirement. Adjust the parameters for the WOA according 

to the fitness to update the best solution of the population, and obtain the loss value of the WOA optimized model. 

5. If the parameters of the model optimized by the WOA are less than the Trainingloss, which means that the requirement is met, output the final 

prediction model and parameter values. 

6. If the loss value cannot be less than the TrainingLoss or the number of iterations does not reach the maximum, update the parameters and carry 

out training again. Otherwise, stop training. 

 
4 PREDICTION BASED ON THE EXPERIMENT OF WOA OPTIMIZED BILSTM_A 

 
4.1 Data processing and experiment conditions 
 

The development of smart cities drives the acquisition of data including lane inspection and tracking as well as the location, speed and direction of 

vehicles from theory to reality.
30

 Acquisition of mass of traffic flow data was theoretical in the past, and now this is possible.
31

 In this context, the set 


10 of 17 

 
F IG U RE  7 Range of the data 

 
TABLE 1 Data division table 
 

Classification Input dimension size Output dimension size 

Training set (47,832,24) (47,832,1) 

Test set (144,24) (144,1) 

 
of traffic flow data of California State Route 24 are used in this article. Refer to Figure 7 for the range of the data. The data were sampled every 5 min. 

The data from 2000 time intervals after 00: 00: 00 on May 1, 2014 are selected for the experiment in this article. There are 2000 × 24 = 48,000 

values in total. 

 
4.2 Experiment conditions and parameters 

 
The python3.7 is used in the experiment. The hardware platform is Intel (R) Core (TM) i9-10900X CPU @3.70GHZ, 64 GB memory, 1 TB solid-state 

SSD, NVIDIA GeForce RTX 2080Ti graphics card. TensorFlow 2.2 and Tensor flow 1.14 are used to train and test the model.
32

 Because the data are 

time series, a rolling series model is built. To be specific, the values of data 1 to n are the input, the value of data n + 1 is the tag output; the values of 

data 2 to n + 1 are the input, and the value at the time n + 2 is the tag output, and so on. To facilitate the training of the model network, the original 

data are standardized with the StandardScaler method. StandardScaler makes the processed data conform to the standard normal distribution.
33

 
Namely, the mean is 0, the standard deviation is 1, and the conversion function is as follows: 

x∗ = 
x − 𝜇 

𝜎 

 
Where 𝜇 is the mean of all the sample data and 𝜎 is the standard deviation of all the sample data. 

Here, n = 24. Classify the training set and test set for the original data. The size of data classified is shown in Table 1. 

 
4.3 Comparison of models 

 
To measure the optimization effect of the algorithm proposed in this article more objectively, an experiment is carried out by comparing the 

WOA_BiLSTM_A network with the BP neural network, LSTM neural network, CNN_LSTM_Attention network (CLSTM_A network), WOA optimized 


11 of 17 

 
CLSTM_A network (W_CLSTM_A network), and BiLSTM_A network. All the hyperparameters of the BP network, LSTM network, CLSTM_A net-

work, and BiLSTM_A network are set to the same, so as to reasonably analyze the influence of each model hyperparameter on the network structure. They 

are trained to converge during network training. The parameters of the four comparison models are shown in Table 2. 

As shown in Table 2 above, the hyperparameters of network nodes in the first and second layers of the four models are all 100 , and the first layer of 

the CLSTM_A network is a one-dimensional convolution layer. 

Refer to Figure 8 for the loss function curves after the simulation of the four networks. 

As shown in Figure 8, the BP neural network converges after three rounds of training. The loss value of the LSTM neural network quickly drops 

before the 20th round of training and slowly drops after the 20th round until both the training set and verification set converge. The loss value of 

the CLSTM_A network quickly drops before the 50th round of training and gradually converges after the 50th round of training. The loss value of 

the BiLSTM_A network quickly drops before the 75th round and gradually converges after the 75th round. 

 
TABL E 2  Comparison model parameter table 
 

Network type 

Parameter 

Number of nodes 

in the first layer 

Number of nodes 

in the second layer 

Loss 

function 

 
Optimizer 

Number of 

iterations 

 
Batch_size 

 
BP 100 100 mse adam 100 128 

LSTM 
    

100 
 

CLSTM_A 
    

200 
 

BiLSTM_A 
    

200 
 

F IGU R E 8  Compare network loss function curve 


12 of 17 

 
F IG U RE  9 WOA optimized CLSTM_A fitness curve 

 
In order to verify the effectiveness of WOA algorithm, as shown by the comparison networks above, this article also optimizes the CLSTM_A 

model with the WOA and uses it as the comparison model of WOA_BiLSTM_A.
34

 The number of the initial populations of the WOA is set to 5, the 

number of iterations is set to 10, and the dimensions are set to 4. As for the learning rate and the numbers of the nodes of t he two hidden layers of the 

CLSTM_A and the BiLSTM_A, we set the optimization lower bound to [0.001,10,1,1] and the optimization upper bound to [0.01,200,200,200]. 

The WOA is used for parameter optimization to obtain the best learning rate, training times, and numbers of the nodes of the two hidden layers 

of the CLSTM_A and BiLSTM_A models optimized by WOA. After simulation calculation, the fitness convergence curves of the WOA in the process 

of optimizing the two models are obtained. Refer to Figures 9 and 10. 

According to Figures 9 and 10, the CLSTM_A model has the best fitness value at the fourth iteration, and the corresponding minimum mean 

square error is 0.0139; the BiLSTM_A model has the best fitness value at the second iteration and the corresponding minimum mean square error 

is 0.0134. The best parameter combinations of the CLSTM_A and the BiLSTM_A optimized by WOA are given in Table 3. 

The best parameter combinations of the optimized CLSTM_A and BiLSTM_A are used for simulation calculation and the loss function curves 

are shown in Figure 11. 

According to Figure 11, in the case of the best parameter combinations, the loss functions can finally converge in a certain range after a quick 

drop in the early stage, demonstrating the effectiveness of the simulation. 

 
4.4 Comparison of experiment results 

 
In this article, it is known that the best learning rate of the WOA_BiLSTM_A model is 0.00687, its number of iterations is 78, its number of the 

nodes of the first layer is 62 and that of the second layer is 191. As per the classification of the training set and the test set, the BP neural network, 


13 of 17 

 
{ } 

 
F IGU R E 10  WOA optimized BiLSTM_A fitness curve 

 
TABLE 3 Optimal parameter combination of two models after WOA optimization 
 

Network type 

Optimal learning 

rate 

Number of 

iterations 

Number of nodes 

in the first layer 

Number of nodes 

in the second layer 

W_CLSTM_A 0.00621 36 26 177 

WOA_BiLSTM_A 0.00687 78 62 191 

 
LSTM neural network, CLSTM_A network, W_CLSTM_A network, and BiLSTM_A network are used as comparison networks of WOA_BiLSTM_A  in the 

experiment. MAPE, RMSE, and MAE
35,36

 of predicted and actual values and the linear regression coefficient of determination R2 are used to evaluate 

the errors, so as to better reflect the error distance between the predicted values and the actual values of different models. The MAPE, RMSE, and 

MAE formulas are given below. Where the predicted value is assumed to be ̂y = ŷ1, y ̂2, ŷ3, … , ŷn and the actual value is assumed to be y = {y1, y2, y3, … , 

yn}. 

 
14 of 17 

 
F IG U RE  11 CLSTM_A and BiLSTM_A optimal parameter combination loss function curve 

 
TABLE 4  Error analysis table  
 

Evaluating indicator 
   

Network type MAPE RMSE MAE R2 

BP 0.0623 26.5267 21.0652 0.9139 

LSTM 0.0382 16.8932 12.7013 0.9621 

BiLSTM_A 0.0378 17.4511 12.9509 0.9638 

CLSTM_A 0.0395 17.6930 12.7199 0.9617 

W_CLSTM_A 0.0383 17.4843 13.2746 0.9626 

WOA_BiLSTM_A 0.0361 17.1479 12.4787 0.9640 

 
For MAPE, RMSE, and MAE, the smaller the values are, the more accurate the prediction results will be. 

The linear regression coefficient R2 
is defined as follows:  

 
Where SSR is the sum of squares regression, SST is the sum of squares total, and y is the mean of actual values of y. The larger the value of R2 is, the 

more accurate the prediction results will be. 

The comparison of error results from these evaluation metrics is shown in Table 4. 

As shown in Table 4, the WOA_BiLSTM_A network is better than other networks in terms of each parameter evaluation metrics. Even after the 

WOA optimized CLSTM_A model is added in the comparison experiment, the WOA_BiLSTM_A is better than expected. 

The trained model is used to test the test set. Figure 12 below shows the comparison of the relative actual values of the six models: BP, LSTM, 

BiLSTM_A, CLSTM_A, W_CLSTM_A, and WOA_BiLSTM_A. It can be seen from Figure 12 that the accuracy of the WOA_BiLSTM_A is slightly higher 

than that of other networks. 

 
5 CONCLUSION 

 
At present, deep learning technology is rapidly developing. Based on many scholars’ researches  on deep learning technology in the field of traffic flow 

prediction,
37

 this article proposes a BiLSTM_A traffic flow prediction network model optimized using the WOA. The BiLSTM network is effective in 

extracting the time series feature. On this basis, the attention mechanism is used to capture different weights of the hidden layers of the BiLSTM 


15 of 17 

 
F I G UR E  12  Comparison between the predicted value of each network and the real value 

 
network. When a network architecture is built, it is difficult to directly build a network of best hyperparameters for the data used at one time. 

Therefore, the WOA is used to optimize the parameters of the network structure and obtain the four parameters,  that is, the best learning rate, 

training times, and number of the nodes of two hidden layers. Finally, the best parameters are imported into the BiLSTM_A network, and after 

training, the best traffic flow prediction network WOA_BiLSTM_A is built. 

Dataset of Californian highway is used to train and validate the WOA_BiLSTM_A network; the BP, LSTM, CLSTM_A, BiLSTM_A, and W_CLSTM_A 

are used as the comparison models; MAPE, RMSE, MAE, and R2 are used as the valuation metrics. The result adequately proves that the network 

proposed in this article is much better than the comparison networks. Here, the BiLSTM neural network, the attention mechanism, and the WOA 


16 of 17 

 
are integrated, and as a result, the accuracy of the short-term traffic flow prediction model is further improved. The significance of this article lies 

in that by optimizing the parameters with the WOA, the network accuracy can be greatly improved based on the best parameters of the network 

obtained by using the optimization algorithm in the case of different network frameworks. 

To solve the problems arising in the application of the WOA_BiLSTM_A model to traffic flow prediction, future efforts can be made in the two 

aspects below: 

 
1. In view of the high time complexity of the WOA in parameter optimization, the algorithm may be improved or other better algorithms may be 

used for parameter optimization to reduce the time complexity. 

2. The network structure adopted in this article only uses nodes in two layers. Next, a deeper network structure may be used and more parameters 

may be optimized to improve the stability of the model in various cases. 

 
ACKNOWLEDGMENTS 

This research was supported by the National Key Research and Development Program of China (2019YFE0126100), the Key Research and 

Development Program in Zhejiang Province of China (2019C54005). 

 
DATA AVAILABILITY STATEMENT 

The data that support the findings of this study are openly available in california department of transportation at https://pems.dot.ca.gov/. 

 
REFERENCES  

1. Jin KH, Wi JA, Lee EJ, Kang SJ, Kim SK, Kim YB. TrafficBERT: pre-trained model with large-scale data for long-range traffic flow forecasting. Expert Syst 

Appl. 2021;186:115738. 

2. Shahid N, Shah MA, Khan A, Maple C, Jeon G. Towards greener smart cities and road traffic forecasting using air pollution data. Sustain Cities Soc. 

2021;72:103062. 

3. Marques P. Uneven innovation: the work of smart cities. Reg Stud. 2021;55(8):1488-1489. 

4. Hassan MU, Rehmani MH, Chen J. Privacy preservation in blockchain based IoT systems: integration issues, prospects, challenges, and future research 

directions. Futur Gener Comput Syst. 2019;97:512-529. 

5. Hussein H, Radwan MH, Elsayed HA, Abd el-Kader SM. Depth-first-search-tree based D2D power allocation algorithms for V2I/V2V shared 5G network 

resources. Wirel Netw. 2021;27:1-15. 

6. Zhao Y, Zhou X, Xu X, et al. A novel vehicle tracking ID switches algorithm for driving recording sensors. Sensors. 2020;20(13):3638. 

7. Cui Z, Zhao Y, Cao Y, Cai X, Chen J. Malicious code detection under 5G HetNets based on multi-objective RBM model. IEEE Netw. 2021;35(2):82-87. 

8. Cai X, Geng S, Wu D, Cai J, Chen J. A multi-cloud model based many-objective intelligent algorithm for efficient task scheduling in Internet of Things. 

IEEE Internet Things J. 2020;8(12):9645-9653. 

9. Liu SY, Liu S, Tian Y, Sun QL, Tang YY. Research on forecast of rail traffic flow based on ARIMA model. J Phys Conf Ser. 2021;1792(1):012065. 

10. Emami A, Sarvi M, Bagloee SA. Short-term traffic flow prediction based on faded memory Kalman filter fusing data from connected vehicles and 

Bluetooth sensors. Simul Model Pract Theory. 2020;102:102025. 

11. Mabel MC, Fernandez E. Analysis of wind power generation and prediction using ANN: a case study. Renew Energy. 2008;33(5):986-992. 

12. Zhang Z, Zhang A, Sun C, Xiang S, Guan J, Huang X. Research on air traffic flow forecast based on ELM non-iterative algorithm. Mob Netw Appl. 

2021;26(1):425-439. 

13. Castro-Neto M, Jeong YS, Jeong MK, Han LD. Online-SVR for short-term traffic flow prediction under typical and atypical traffic conditions. Expert Syst 

Appl. 2009;36(3):6164-6173. 

14. Li D, Li W, Wang X, Nguyen CT, Lu S. App trajectory recognition over encrypted internet traffic based on deep neural network.  Comput Netw. 

2020;179:107372. 

15. Khodayar M, Kaynak O, Khodayar ME. Rough deep neural architecture for short-term wind speed forecasting. IEEE Trans Industr Inform. 

2017;13(6):2770-2779. 

16. Chen K, Chen K, Wang Q, He Z, Hu J, He J. Short-term load forecasting with deep residual networks. IEEE Trans Smart Grid. 2018;10(4):3943-3952. 

17. Siami-Namini S, Tavakoli N, Namin A S. A comparative analysis of forecasting financial time series using arima, lstm, and bilstm. 2019; arXiv preprint 

arXiv:1911.09512. 

18. Jiang F, Wang L, Bai L. An improved whale algorithm and its application in truss optimization. J Bionic Eng. 2021;18(3):721-732. 

19. Mirjalili S, Lewis A. The whale optimization algorithm. Adv Eng Softw. 2016;95:51-67. 

20. Hu C, Duan Y, Liu S, et al. LSTM-RNN-based defect classification in honeycomb structures using infrared thermography. Infrared Phys Technol. 

2019;102:103032. 

21. Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9(8):1735-1780. 

22. Gao W, Gao J, Yang L, Wang M, Yao W. A novel modeling strategy of weighted mean temperature in China using RNN and LSTM. Remote Sens. 

2021;13(15):3004. 

23. Gers FA, Schmidhuber J, Cummins F. Learning to forget: continual prediction with LSTM. Neural Comput. 2000;12(10):2451-2471. 


17 of 17 

 
24. Kumar D, Mathur HD, Bhanot S, Bansal RC. Forecasting of solar and wind power using LSTM RNN for load frequency control in isolated microgrid. Int 

J Model Simul. 2021;41(4):311-323. 

25. Jeong JH, Shim KH, Kim DJ, Lee SW. Brain-controlled robotic arm system based on multi-directional CNN-BiLSTM network using EEG signals. IEEE Trans 

Neural Syst Rehabil Eng. 2020;28(5):1226-1238. 

26. Jia Y, Xu X. Chinese named entity recognition based on CNN-BiLSTM-CRF. Proceedings of the 2018 IEEE 9th International Conference on Software 

Engineering and Service Science (ICSESS); IEEE; 2018:1-4. 

27. Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. 2014; arXiv preprint arXiv:1409.0473. 

28. Lan X, Wang H, Gong S, Zhu X. Deep reinforcement learning attention selection for person re-identification. 2017; arXiv preprint arXiv:1707.02785. 

29. Zhou P, Shi W, Tian J, et al. Attention-based bidirectional long short-term memory networks for relation classification. Proceedings of the 54th Annual 

Meeting of the Association for Computational Linguistics (Volume 2: Short papers); 2016:207-212. 

30. Durduran SS. A decision making system to automatic recognize of traffic accidents on the basis of a GIS platform. Expert Syst Appl. 

2010;37(12):7729-7736. 

31. Hassan MU, Rehmani MH, Chen J. Differential privacy techniques for cyber physical systems: a survey. IEEE Commun Surv Tutor. 2019;22(1):746-789. 

32. Qi L, Dou W, Chen J. Weighted principal component analysis-based service selection method for multimedia services in cloud. Comput Secur. 

2016;98(1–2):195-214. 

33. Spalink G, Freiburg V, Wagner P. Method for pre-processing digital data, digital to analog and analog to digital conversion system. August 4, 2005; U.S. 

Patent Application 11/034,584. 

34. Gu B, Shen H, Lei X, Hu H, Liu X. Forecasting and uncertainty analysis of day-ahead photovoltaic power using a novel forecasting method. Appl Energy. 

2021;299:117291. 

35. Cui Z, Xu X, Xue F, et al. Personalized recommendation system based on collaborative filtering for IoT scenarios. IEEE Trans Serv Comput. 

2020;13(4):685-695. 

36. Li Y, Yu H, Song B, Chen J. Image encryption based on a single-round dictionary and chaotic sequences in cloud computing. Concurrency Computat Pract 

Exper. 2021;33(7):1-1. 

37. Awan FM, Minerva R, Crespi N. Improving road traffic forecasting using air pollution and atmospheric data: experiments based on LSTM recurrent neural 

networks. Sensors. 2020;20(13):3749. 

 
How to cite this article: Xu X, Liu C, Zhao Y, Lv X. Short-term traffic flow prediction based on whale optimization algorithm 

optimized BiLSTM_Attention. Concurrency Computat Pract Exper. 2022;34(10):e6782. doi: 10.1002/cpe.6782