Arya Ashrafi 
Reinforcement Learning for Decentralized Energy 
Systems 
 
 
 
 
 
 
 
  
Vaasa 2024 
School of Technology & Innovation 
Master’s thesis in Smart Energy 
2 
 
 
Acknowledgments 
 
I would like to thank my gratitude to my supervisor, Prof. Miadreza Shafiekhah, for all of 
his help and counsel during the writing of this thesis. His guidance not only played a 
crucial role in molding this study but also served as a motivation and a standard for 
scholarly integrity and superiority. 
 
Dr. Petri Välisuo, my co-supervisor, has my sincere gratitude as well. His insights have 
broadened my viewpoint and improved my comprehension of the issue. 
 
I express my sincere appreciation to my family in Iran and here in Vaasa. Their constant 
support and unshakable faith in my abilities have been the cornerstones of my journey. 
I owe them the perseverance and tenacity that have allowed me to succeed in my aca-
demic career. 
 
Without their continuous support, I would not be here today. For that, I am incredibly 
grateful.  
 
This work was supported by the Horizon Europe project DiTArtIS (Network of Excellence 
in Digital Technologies and AI Solutions for Electromechanical and Power Systems Appli-
cations), grant agreement number: 101079242.  
 
 
Arya Ashrafi 
 
  
3 
 
UNIVERSITY OF VAASA 
School of Technology & Innovation   
Author:    Arya Ashrafi 
Title of the Thesis:  Reinforcement Learning for Decentralized Energy Systems 
Degree:    Master of Science   
Program:   Smart Energy  
Supervisor:   Prof. Miadreza Shafiekhah and Dr. Petri Välisuo  
Year:    2024 Number of Pages: 66 
 
ABSTRACT: 
With the rise of electric vehicles (EVs) and smart grid technology, sophisticated energy manage-
ment systems that guarantee sustainability and efficiency are required. An in-depth examination 
of reinforcement learning (RL) algorithms in a simulated smart grid system featuring prosumer-
generated renewable energy and embedded EV charging stations is presented in this thesis. The 
study assesses how well the Deep Deterministic Policy Gradient (DDPG), Proximal Policy Optimi-
zation (PPO), and Rule-Based Control (RBC) algorithms manage the energy dynamics of 50 
prosumer nodes and 40 EVs over a 24-hour period using a Markov decision process framework. 
The RL algorithms interact with the environment to learn sequential decision-making processes 
that maximize the overall reward, with a particular focus on balancing energy production, con-
sumption, and vehicle charging demands. The simulation results reveal DDPG's strength in cost-
efficient grid energy purchasing and effective state of charge (SOC) management, PPO's poten-
tial through exploratory learning, and RBC's advantage in minimizing energy wastage. The find-
ings point towards the necessity of intelligent energy management strategies that not only min-
imize costs and maximize the use of renewable energy but also enhance the operational effi-
ciency and sustainability of the smart grid and EV ecosystems. 
 
KEYWORDS: Smart Grid, Reinforcement Learning, Electric Vehicles, Energy Management, De-
mand Response 
  
4 
 
Contents 
1 Introduction 8 
2 Background 11 
2.1 Smart Charging & Demand Response 11 
2.2 Smart Electricity Market 12 
2.2.1 Electricity Market Dynamics 12 
2.2.2 Types of Electricity Markets 12 
2.3 Machine Learning for Energy Management 13 
3 Literature Review 16 
4 Methodology 21 
4.1 Problem Statement 21 
4.2 Reinforcement Learning 22 
4.3 Modeling & Formulation 23 
4.3.1 RL Environment 24 
4.3.2 Markov decision process (MDP) 26 
4.4 RL Policy 30 
4.4.1 PPO Policy 31 
4.4.2 DDPG Policy 31 
4.4.3 Rule-Based Controller 32 
5 Numerical Studies 34 
5.1 Daily Energy Dynamics 34 
5.2 Solar Energy Surplus 35 
5.3 Tariffs 37 
5.4 EVs Presence during the day 41 
6 Results (Simulation Implementation) 49 
6.1 An Evaluation of DDPG, PPO, and RBC 49 
6.2 Learning Trajectories of DDPG and PPO 51 
6.3 The Cost-Effectiveness of Policies 52 
6.4 Energy Wastage 55 
5 
 
6.5 SOC Management 57 
7 Discussion 59 
8 Conclusion 60 
References 62 
  
6 
 
Figures 
 
Figure 1. Methodology Workflow: RL-Based EV Charging Framework 26 
Figure 2. Consumption (Load) and Generated Electricity by RE (Renewable) 34 
Figure 3. The Remaining RE after consumption 36 
Figure 4. Present cars over a 24-hour period 41 
Figure 5. The Reward obtained by each agent in different policies 49 
Figure 6. DDPG and PPO Policy Reward Training 51 
Figure 7. Amount of electricity needed to be purchased from the grid 53 
Figure 8. Wasted RE in three policies 55 
Figure 9. Cumulative SOC in three policies 57 
 
Tables 
 
Table 1. The Pseudocode of the RL Policies and Final Evaluation 32 
Table 2. The price of electricity at different times with six kinds of Tariffs 37 
Table 3. The Pseudocode of the RL Environment 42 
Table 4. The Pseudocode of EV charging utilities 46 
Table 5. The cost of electricity purchased from the power grid using PV 54 
Table 6. The cost of electricity purchased from the power grid without PV 54 
Table 7. Wastage of Electricity Generated by RE 56 
 
Abbreviations 
 
ANN Artificial Neural Network 
DDPG  Deep Deterministic Policy Gradient 
DRL Deep Reinforcement Learning 
EM-SA Energy Management System Aggregator 
EV Electric Vehicle 
MDP Markov Decision Process 
ML Machine Learning 
MPC Model Predictive Control 
POMDP Observable Markov Decision Processes 
PER Prioritized Experience Replay 
PPO Proximal Policy Optimization 
PV Photovoltaic 
RBC Rule-Based Control 
7 
 
RES Renewable Energy Sources 
RL Reinforcement Learning 
RNN Recurrent Neural Network 
SAC Soft-actor Critic 
SOC State of Charge 
  
8 
 
1 Introduction 
The increasing availability of electric cars (EVs) around the world offers tremendous po-
tential as well as significant obstacles, especially when it comes to EV charging and how 
it affects the electrical grid. The increased use of EVs places more burden on our electri-
cal infrastructure, especially during periods of high demand. This increasing demand has 
the potential to worsen system congestion, increase the probability of blackouts, reduce 
the effectiveness of power distribution, and increase consumer electricity costs. These 
difficulties are exacerbated in places where there is a high concentration of electric ve-
hicles (EVs), as local grid bottlenecks may require major infrastructure modifications to 
handle the extra load, highlighting the pressing need for creative solutions (Khaki, 2019). 
  
In this context, photovoltaic (PV) systems show up as an economically and sustainably 
sound addition to conventional energy sources for household use and EV charging. The 
transition to renewable energy sources, such as solar electricity, is essential for promot-
ing environmental sustainability and preventing climate change. There is a crucial shift 
to renewable energy sources to mitigate climate change and advance environmental sus-
tainability (Cohen, 2019). PV installations have the potential to significantly lower elec-
tricity prices over the long run. The long-term cost savings from producing one's elec-
tricity can more than offset the initial high setup costs of PV systems. This increases the 
attraction and viability of having an electric vehicle by making "fueling" an EV much less 
expensive for owners. Moreover, EV charging through PV system integration improves 
resilience and energy independence. A PV system can offer a dependable alternative 
energy supply during blackouts or times of heavy demand when the grid is stressed, 
guaranteeing that EVs can still be charged and essential home operations can go on with-
out interruption (Esfandyari, 2019). 
  
While PV systems provide a sustainable energy source for charging EVs in Yao (2024), 
their intermittent nature poses challenges for consistent EV charging. Solar energy pro-
duction rises during the day, usually during a time when there is less need for charging, 
and falls off in the evening when usage usually increases. To ensure that EVs can be 
9 
 
charged with green energy effectively, this mismatch necessitates creative solutions, 
such as energy storage devices or smart charging techniques, to store extra solar energy 
generated during the day for use during peak charging hours. 
  
According to Li (2021), technologies like demand response and smart charging offer 
workable answers to these problems. Smart charging systems optimize the charging pro-
cess to lower peak demand and more effectively incorporate renewable energy sources 
by adjusting the charging rate or time of EVs based on grid conditions, PV generation, 
and EV owner preferences. To further balance the load and maximize the usage of re-
newable energy, demand response systems might encourage EV owners to charge their 
vehicles during off-peak hours or when there is excess renewable energy available. To 
ensure that the grid can support the expanding EV market while maximizing the use of 
renewable energy sources, advances in grid infrastructure, regulatory frameworks, and 
consumer engagement techniques are necessary for the widespread application of these 
technologies in Li (2021). 
  
In this case, Lan (2021) looks into how renewable energy sources can be integrated with 
demand response and smart charging programs for electric vehicles (EVs), particularly 
when machine learning (ML) is used. It offers a cutting-edge strategy for improving grid 
stability and energy consumption optimization. With its ability to forecast energy use, 
optimize charging schedules, and manage the intermittent nature of renewable energy 
sources like solar and wind power, machine learning can greatly increase the efficacy and 
efficiency of these programs. 
  
To optimize energy usage in residential neighborhoods and integrate renewable energy 
sources for EV charging, this research builds upon previous work to create a powerful 
reinforcement learning (RL) system specifically designed for 50 residential participants 
to manage their energy consumption. These residential participants who are considered 
to be prosumers in this case generate their electricity through photovoltaic (PV) systems 
installed in their homes, in addition to consuming it. Forty of these homes use electric 
10 
 
vehicles (EVs), which adds a big factor to the energy management equation. This study's 
main goal is to design a system that will allow these 40 EVs to prioritize the usage of 
renewable energy by charging from excess energy produced by separate PV systems. 
When the energy generated by the PV panels is not enough to satisfy the demands of 
charging, the system will help to obtain the necessary electricity from the grid. Due to 
the dynamic nature of this dual-source charging approach, energy management be-
comes more complex, requiring an intelligent system that can make decisions in real-
time based on fluctuating energy prices and availability. 
  
To traverse this intricate energy terrain, the study investigates the utilization of two so-
phisticated reinforcement learning algorithms: Proximal Policy Optimization (PPO) and 
Deep Deterministic Policy Gradients (DDPG). To provide a baseline for comparison, these 
algorithms will be assessed against a traditional Rule-Based Control (RBC) system. The 
comparison analysis will evaluate the effect of renewable energy generation on overall 
energy management and EV charging efficiency across a range of scenarios, including 
those without PV systems and those with them. The investigation's key component is its 
capacity to assess and contrast the efficacy of PPO and DDPG algorithms in handling 
prosumers' energy demands when using EVs and PV systems. The goal of the study is to 
determine the best RL approach for minimizing EV charging expenses and optimizing the 
use of locally produced renewable energy by examining various policy outcomes. 
  
 
 
 
11 
 
2 Background 
2.1 Smart Charging & Demand Response  
The management of electric vehicle (EV) charging through rate and timing adjustments 
in response to grid needs and renewable energy availability is known as smart charging. 
By doing this, the grid is not as strained during peak hours, and more clean energy gen-
erated by PV systems is optimally utilized (Yao et al., 2017). Energy distribution and man-
agement take on new dimensions with the integration of photovoltaic (PV) systems and 
electric vehicles (EVs) into the electric grid. Utilizing these technologies to their fullest 
potential, maintaining grid stability, and maximizing energy use require smart charging 
and demand response programs (Radu, 2019). Controlling EV charging rates and sched-
ules to coincide with periods of low electricity demand or high renewable energy gener-
ation is made possible by smart charging. This optimizes the utilization of clean energy 
produced by PV systems and lessens the load on the grid during peak hours. 
  
Programs that encourage consumers to modify their energy use in response to grid con-
ditions further improve this, which are called demand response programs. Demand re-
sponse (DR) programs include techniques such as price incentives or notifications during 
periods of high demand, that persuade customers to adjust their energy usage in re-
sponse to grid conditions (Nezamoddini & Wang, 2016). The purpose of these programs 
is to maintain equilibrium between the supply and demand of power in the smart grid, 
hence enhancing stability and mitigating potential hazards. For example, EV owners can 
be incentivized to charge their cars during peak solar production periods, absorbing ex-
tra generation and avoiding cutoff. In the same way, customers can lower their consump-
tion or postpone billing to help balance the grid during periods of high demand and low 
renewable generation. By encouraging a more robust and effective energy system, these 
tactics lessen dependency on fossil fuels and accelerate the transition to a sustainable 
energy future (Ferro, 2018). 
  
12 
 
The adoption of demand response and smart charging programs for PV systems and EVs 
adds a degree of efficiency and flexibility that has the potential to completely change 
conventional power markets. With the help of these initiatives, energy producers, cus-
tomers, and the grid may interact more dynamically, creating new market opportunities 
and models. 
 
2.2 Smart Electricity Market 
To maximize electricity generation, distribution, and consumption, cutting-edge technol-
ogy and market mechanisms combine in the smart electricity market, which is a creative 
and dynamic segment of the energy business. In the context of electric vehicles (EVs), 
which are contributing a growing amount to the demand for electricity, this market is 
especially appropriate. Below is the examination of the main features of the smart power 
market, such as payment structures, types, and market dynamics, with an emphasis on 
how these factors relate to EV charging. 
 
2.2.1 Electricity Market Dynamics 
The smart energy market is a marketplace where utility companies, independent power 
producers, and consumers—including owners of electric vehicles—trade electricity. 
These transactions occur when buying orders and selling offers, or matched asks and 
bids, occur in an auction. The features of these markets are designed to efficiently bal-
ance supply and demand. Different auction techniques are used to decide the price of 
energy, depending on the type of market. Dynamic pricing is crucial for owners of electric 
vehicles (EVs) who may choose to charge their cars during off-peak hours to save money 
on electricity (Morstyn, 2018). 
 
2.2.2 Types of Electricity Markets 
Day-Ahead Markets: In these markets, agreements are settled one or more days before 
electricity is actually delivered. Users can therefore plan their output or consumption 
13 
 
based on expected demand. This may mean scheduling EV charging sessions during 
times when lower expenses are predicted. 
Spot Markets: These markets provide a faster reaction time by enabling transactions to 
be finished up to five minutes before delivery. This makes it possible to respond quickly 
to sudden changes in supply or demand, such as those resulting from EVs being charged 
in huge quantities. 
Operating Reserve Markets: These markets have the fastest reaction times, ranging 
from minutes to seconds, and provide backup resources to maintain grid stability. This is 
especially important for electric vehicles (EVs), as ubiquitous charging can result in con-
siderable increases in the amount of electricity used. 
Most power markets operate under the merit order concept, which orders energy 
sources to be sent in increasing cost order until the demand is met. The price at which 
supply and demand are equal is known as the clearing price. Since it ensures that the 
most cost-effective resources are used first, EV owners may benefit from this notion by 
having their power expenses cut during times of strong renewable energy production 
(Al-Gabalawy, 2021). 
 
2.3 Machine Learning for Energy Management 
A sophisticated method for handling the challenges involved in integrating PV installa-
tions and EV charging into the smart grid is machine learning (ML). ML algorithms can 
forecast demand and optimize energy distribution by examining trends in energy gener-
ation, consumption, and a host of other variables, including user behavior and meteor-
ological conditions. For instance, Shin (2020) states that predictive models capable of 
predicting bursts in solar radiation can be used to strategically schedule EV charging ses-
sions to align with these periods, thus optimizing the usage of renewable energy. To 
avoid grid overload and boost system efficiency, ML can also dynamically modify charg-
ing rates in real time based on grid circumstances and the availability of solar electricity. 
Moreover, Abdullah, et al. (2021) suggest that reinforcement learning, a subset of ma-
chine learning, can be applied to create control techniques that adjust over time, learn-
ing from past actions to make more accurate decisions about the distribution and 
14 
 
consumption of energy. As the grid and its users' needs change, energy management 
solutions must be able to adapt as well. This is because renewable energy sources are 
naturally unpredictable, and managing their inherent variability and changing consump-
tion patterns is critical. All things considered, the employment of machine learning in 
this situation not only raises the energy system's operational efficiency but also increases 
customer comfort and happiness, opening the door for EV and PV system adoption on a 
larger scale. 
 
Here is a list of why utilizing ML can be beneficial in the context of smart charging and 
demand response: 
Adaptive Smart Charging with Machine Learning: Machine learning algorithms can an-
alyze vast amounts of data from the grid, EVs, and renewable energy sources to predict 
optimal charging times (Lan, 2021). ML models may dynamically modify EV charging 
schedules to periods when renewable energy production is high, minimizing depend-
ency on non-renewable energy sources and cutting consumer power bills. This is 
achieved by identifying patterns in energy consumption and generation. 
 
Demand Response Optimization: To lower total energy demand, demand response pro-
grams try to motivate customers to use less energy during off-peak hours or times when 
renewable energy sources are available. Machine learning enhances demand response 
technics by predicting times of peak demand and peaks in renewable energy output. ML 
models can automatically send signals to EVs and other smart appliances. This will help 
to alter their energy use in real-time and prevent over-load to maintain grid balance. 
 
Integration of Renewable Energy Sources: Renewable energy sources are difficult to in-
tegrate into the grid because of their intermittent nature. By forecasting renewable en-
ergy generation using machine learning algorithms based on historical data and weather 
projections, grid operators can more effectively plan for energy storage and delivery. An 
ML model might, for instance, anticipate a decrease in solar power generation owing to 
15 
 
cloud cover and proactively modify EV charging schedules to maintain grid balance (Shin, 
2020). 
 
Enhancing Grid Stability: Smart grids can become more adaptable and resilient to vari-
ations in the supply and demand of energy by utilizing machine learning to assess data 
from several sources, such as energy consumption patterns, EV battery conditions, and 
projections of renewable energy generation. According to Mololoth et al. (2023), ma-
chine learning (ML) can aid in the dynamic management of energy flows, preventing EV 
charging from adding to grid instability and facilitating the efficient distribution of re-
newable energy throughout the network. 
16 
 
3 Literature Review 
Within the domain of optimizing EV charging in liberalized energy markets, scholars have 
investigated several approaches with the goal of optimizing cost-saving advantages for 
EV owners in the face of fluctuating pricing structures and user behaviors. Foster (2013) 
emphasizes how important it is to incorporate smart grid data to provide EV owners with 
cost-saving options based on their current level of charge and planned departure time. 
Better cost optimization and increased charge decision flexibility in the face of price vol-
atility are both made possible by this combination.  
  
Radu (2019) provides a charging management plan for electric vehicles (EVs) to facilitate 
distributed generation and renewable energy integration. To attain a scheduled aggre-
gated power profile, the technique takes into account the preferences of electric vehicle 
users. This is done in an effort to lessen the impact of intermittent energy sources like 
solar and wind power. It is intended to store the excess energy in EV batteries, which will 
increase electrical network reliability and enable widespread integration of distributed 
energy resources (DER). This strategy makes it possible for users of flexible electric vehi-
cles to take part in demand response programs, which could be extremely important for 
enhancing the reliability and effectiveness of future smart grids. Shin (2020) elaborates 
on this idea of adaptability and suggests a decentralized method of controlling EV charg-
ing stations that have energy storage and photovoltaic (PV) systems installed. This meth-
odology makes use of the surplus energy stored in EVs from other stations to enable 
autonomous control of charging stations through the use of multi-agent deep reinforce-
ment learning. This decentralized approach, which reflects a move toward more inde-
pendent and effective charging infrastructure management, not only allows for flexibility 
in decision-making but also reduces total costs through distributed coordination. 
  
Current research emphasizes how important cutting-edge technologies are to improving 
the energy management of smart grids, especially when electric vehicles (EVs) are in-
cluded. Examples of these technologies include blockchain and machine learning (ML). 
The growing interest in machine learning (ML)-driven smart charging techniques for 
17 
 
electric vehicles (EVs) holds great potential for improving demand-side management and 
cutting operational costs. For instance, López (2018) developed an intelligent EV charg-
ing approach that uses deep learning to maximize energy savings and optimize charging 
times. Their method eliminates the need for projections of future energy prices or vehi-
cle usage by utilizing data on driving habits, the environment, and energy pricing. Their 
methodology's core is dynamic programming for the examination of historical data, 
which trains deep neural networks to make economical charging decisions in real-time. 
According to their research, deep learning can produce charging charges that are almost 
exactly like the ideal costs that were determined in the past, if not identical. 
  
In the realm of ML, reinforcement learning (RL) has gained vast attention for the energy 
management of EVs and for providing demand response programs (Abdullah, et al. (2021) 
and Mololoth, et al. (2023)). In line with that, Vázquez-Canteli (2019) investigated de-
mand response using reinforcement learning (RL) in smart grids with an emphasis on 
home energy systems. Their thorough analysis of RL techniques for a range of compo-
nents, such as smart appliances and EVs, highlights how flexible RL is to the dynamics of 
the environment and how well it can integrate user feedback. Moreover, Li (2022) sug-
gests a workable energy management plan that uses Deep Reinforcement Learning (DRL) 
to optimize EV charging expenses. The study aims to improve real-time charging man-
agement efficiency in the face of fluctuating electricity tariffs and user behaviors by in-
tegrating an improved Recurrent Neural Network (RNN) architecture and Deep Deter-
ministic Policy Gradient (DDPG) algorithm.  
 
Arwa (2021) uses a Markov Decision Process formulation to explore energy management 
and optimal scheduling in the setting of EV charging stations coupled with renewable 
energy. Through the introduction of an enhanced Q-learning algorithm for managing dy-
namic stochastic problems and grid tariff models, the study seeks to reduce the cost of 
purchasing electricity and encourage self-generation via photovoltaic systems. The inte-
gration of renewable energy sources with local generation and consumption highlights 
a significant step towards sustainable and efficient charging infrastructures, despite the 
18 
 
obstacles presented by grid prices and variability in renewable energy supply. The Q-
learning technique is used to precisely anticipate loads across a range of scenarios, which 
helps to optimize plug-in hybrid Electric Vehicle (EV) charging stations This highlights the 
significance of adaptive and efficient charging strategies in dynamic energy environ-
ments.  
  
To optimize operation and reduce expenses, Cai (2023) offers a performance-driven ap-
proach to energy management in residential microgrid systems that integrates Rein-
forcement Learning (RL) and Model Predictive Control (MPC) algorithms. This reflects 
the increased interest in using RL techniques to optimize the utilization of renewable 
energy sources as the study presents a viable paradigm for effective energy management 
in residential microgrid systems by integrating the benefits of both techniques, demon-
strating the possibility for collaborative approaches to address local energy concerns. Ye 
(2020) looks at a bilevel optimization model that may be used to create bidding strate-
gies in deregulated power markets. To deal with non-convexities, the model uses Deep 
Reinforcement Learning (DRL) to help make cost-effective judgments about charging and 
unplugging electric vehicles. It emphasizes the role that DRL techniques play in enabling 
adaptive and responsive decision-making strategies, underscoring the significance of on-
going feedback and strategy refinement in dynamic market situations. Also, Yan (2021) 
addresses the challenge of optimizing EV charging schedules in the face of fluctuating 
electricity costs and user behavior by utilizing Deep Reinforcement Learning (DRL) tech-
niques in conjunction with a Markov Decision Process (MDP) framework. Sequential de-
cision-making based on driving behavior is made easier by using off-policy algorithms 
within the soft-actor critic (SAC) framework. The ultimate goal is to find affordable charg-
ing solutions that are in line with user preferences and market dynamics. 
  
Furthermore, Dabbaghjamanesh (2021) illustrates the superior accuracy of Q-learning 
by utilizing recurrent neural network (RNN) and artificial neural network (ANN) forecast-
ing outcomes. This is especially relevant in scenarios where there are coordinated and 
disorganized charging behaviors, underscoring the significance of precise load 
19 
 
forecasting for efficient charging optimization. The Energy Management System Aggre-
gator (EM-SA) is part of a demand response program to maximize power sales to the grid 
while limiting purchases to maximize advantages. The EMSA addresses challenges re-
lated to renewable energy integration and variable weather conditions by utilizing a two-
level Model Predictive Control (MPC) framework and Reinforcement Learning (RL) algo-
rithms for real-time decision-making. Qiu (2020) promotes the optimization of EVs' con-
tinuous charging and discharging levels through the use of deep reinforcement learning 
algorithms, with a focus on the Deep Deterministic Policy Gradient (DDPG) method. Pri-
oritized experience replay (PER) approaches are used to improve learning efficiency, 
which leads to more resilient and flexible charging schemes. This study also emphasizes 
how crucial it is to solve the EV pricing issue that aggregators are facing and how EV 
flexibility affects average electricity bills and aggregator revenue.  
  
There are even new opportunities for demand management and the safe integration of 
renewable energy sources thanks to Mololoth, et al.'s (2023) investigation on the syner-
gistic potential of blockchain and machine learning for upcoming smart grids. On the 
other hand, some researchers and industry stakeholders have turned to advanced con-
trol techniques of RL in conjunction with Photovoltaic Systems (PV). The coupling of RL 
with PVS offers a synergistic approach, leveraging solar energy to charge EVs efficiently 
while minimizing costs and environmental impact. Huang (2020) uses a hybrid wind-so-
lar storage system in conjunction with a Deep Reinforcement Learning (DRL) control 
technique to improve prediction quality and optimization while addressing the uncer-
tainties related to the production of renewable energy. This study shows that DRL algo-
rithms are effective in controlling uncertainties related to renewable energy, even when 
there is a lack of high-dimensional data. This helps power systems utilize renewable en-
ergy sources more consistently and profitably. Reinforcement Learning (RL) in solar bat-
tery storage to maximize energy use and operational effectiveness in these systems is 
utilized (Härtel, 2023). The study enhances photovoltaic battery storage systems' perfor-
mance by utilizing recurrent neural networks (RNNs) and proximal policy optimization 
(PPO). In addition, there are even new opportunities for demand management and the 
20 
 
safe integration of renewable energy sources thanks to Mololoth, V. K., et al.'s (2023) 
investigation on the synergistic potential of blockchain and machine learning for upcom-
ing smart grids. 
21 
 
4 Methodology 
4.1 Problem Statement 
This thesis concerns how electric vehicles are charged at charging stations that have ac-
cess to renewable energy and can also purchase electricity from the grid. The renewable 
energy for these stations is provided by solar panels, which can only generate energy 
during certain hours of the day. A portion of the generated energy is also used for the 
daily consumption of the station itself, referred to as the load. Therefore, only a part of 
the generated energy can be sold for charging electric vehicles. Depending on the imme-
diate needs of each station, more electricity can be purchased from the grid if needed 
for charging. Since purchasing electricity from the grid is costly, the goal of this issue is 
to charge the vehicles in a way that incurs the lowest cost.  
 
In this scenario, the simulation leverages authentic data on tariffs, the production of re-
newable energy (specifically from solar panels), and the grid load for each station, 
sourced from the actual energy consumption and solar power generation records of 
households. These households are deemed prosumers, reflecting their dual role as both 
energy consumers and producers within the network. This data also includes the energy 
usage of electric vehicles associated with these households, based on a study conducted 
in Portugal (Faia, 2021).  
 
In the problem of charging electric vehicles, the task is to reduce the consumption costs 
of the stations. While a desired output cannot be specified for each vehicle, for example, 
it's not possible to state that a vehicle should receive a specific amount of charge at a 
particular time of day to minimize costs. Solving such problems requires learning a pro-
cedure that decides on the charging of each vehicle at every step, considering the varia-
ble conditions of the problem. For instance, the moment each vehicle arrives at the sta-
tion, given the energy level in the device's battery, the instantaneous energy production 
from the solar panels, and the cost of purchasing energy from the power grid, a decision 
to charge or not charge a vehicle is made. 
22 
 
 
Given the characteristics mentioned regarding the electric vehicle charging problem, re-
inforcement learning is the most suitable method for finding the optimal solution to the 
problem.   
 
4.2 Reinforcement Learning 
Reinforcement learning (RL) is a learning method that is about an action that is supposed 
to maximize the reward. It differs from supervised learning in that neither the dataset 
nor the correct decisions are labeled by an external supervisor. Moreover, the purpose 
of RL is to find the best possible actions that are not defined by learning and exploring 
based on the environment which the agent is interacting in. It is also not quite similar to 
unsupervised learning although it is sometimes classified within this category. The rea-
son is that RL is not trying to find a hidden structure within the unlabeled data but it is 
used for maximizing the reward that the agent is trying to achieve (Sutton & Barto, 2018).  
 
Unlike supervised algorithms, which define a specific desired output for each input, 
there exists a category of problems where a precise output cannot be established, and 
the only performance criterion for an agent in these environments is a descriptive quan-
titative value known as a reward (Co-Reyes et al., 2020). In other words, the reward is 
the environment's response to the agent's actions. In many environments, a maximum 
or minimum for the reward cannot be defined, and learning in these problems can be 
modeled as maximizing the cumulative reward received. Regarding the termination of 
an episode in a simulated environment, environments can be divided into continuous 
and episodic. In episodic environments, one or more specific states are considered as 
the agent's terminal state in the environment, whereas in continuous environments, no 
terminal state is defined for the agent. A complete cycle of the agent's performance in 
an episodic environment is called an episode. A cycle in the problem of charging electric 
vehicles can be simplified to be confined to a single day.  
 
23 
 
Overall, addressing this problem requires a simulation environment and an agent tasked 
with optimizing the objectives. In reinforcement learning problems, a simulated environ-
ment and a reinforcement learning agent interact sequentially with each other. At each 
step, considering the current state of the environment and the values of the environ-
ment's parameters, the agent selects an appropriate action. The implementation of the 
mentioned action then changes the current state of the environment and also causes 
the environment to return a quantitative value indicating the effectiveness of the action 
performed (Sutton & Barto, 2018). The goal of reinforcement learning is to find actions 
that create the most significant effect in the environment, in other words, to maximize 
the cumulative value returned from the environment. In this thesis, many agents are 
involved in a dynamic and uncertain environment and reinforcement learning is utilized 
as a trial-and-error method of machine learning algorithm that can interact with this 
environment and make sequential decisions to learn how to maximize the reward. 
 
In the next step, the simulation environment will be outlined. Then, various reinforce-
ment learning algorithms will be detailed, assessing the strengths of each. The conclu-
sion will present the results derived from each algorithm, showcased through graphs and 
analyses. To this end, it is necessary to model the reinforcement learning variables in the 
problem at hand and reduce the charging costs of the vehicles using reinforcement 
learning algorithms. In the modeled scenario, the function that selects the appropriate 
action at each step is called a policy, and the best-learned policy is referred to as the 
optimal policy. 
 
4.3 Modeling & Formulation  
Formulation of such problems in RL is done by the Markov decision process (MDP) which 
is suitable for a random probability pattern that is difficult to predict. In scenarios when 
decisions are taken randomly or controlled by an agent, MDP can be used which is a 
mathematical framework to make the decision-making process of the stochastic process. 
In MDP models, 4 key elements should be defined in a pre-defined environment, namely 
states, actions, transition probabilities, and reward. This environment is supposed to act 
24 
 
as the artificial world that is a close representation of the real world, and the agent can 
interact within this environment to achieve its goal over time (Karatzinis, et al., 2022). 
 
4.3.1 RL Environment 
The environment is crucial to reinforcement learning (RL) because it interacts with the 
learning agent by supplying the scenarios that the agent must navigate. under essence, 
the environment is a model or simulation of the real world, or a particular component 
of it, that establishes the parameters under which the agent must carry out its duties. 
This environment can range from virtual simulations for training autonomous vehicles to 
game settings like chess, or more complex scenarios such as managing energy distribu-
tion in a power grid (Narvekar et al., 2020). Environments can be classified into episodic, 
where interactions are divided into separate episodes with clear endpoints, or continu-
ous, where the interaction goes on indefinitely without predefined ends.  
 
Each RL environment must follow certain rules and characteristics to be acceptable and 
readable as an environment. Here are the characteristics of a general RL environment in 
summary (Moussaoui, Akkad, and Benslimane, 2023): 
• State Space (S): The environment defines the state space, which includes all pos-
sible situations the agent might encounter. The complexity of an environment 
often correlates with the size and diversity of its state space. 
• Action Space (A): This covers every course of action the agent may pursue. Since 
the action space's design dictates the extent of the agent's interactions with the 
surroundings, it can have a substantial impact on the agent's learning and per-
formance. 
• Reward Function (R): The environment determines the rewards, which are re-
sponses to the agent's actions aimed at achieving specific goals. The reward 
structure is crucial as it guides the learning process by signaling to the agent 
which actions are beneficial toward achieving its objectives. 
• Transition Function (T): This refers to how the environment changes in response 
to the agent's actions. The dynamics can be deterministic (predictable outcomes) 
25 
 
or stochastic (random elements influencing the outcomes), affecting the agent's 
strategy for learning and decision-making. 
• Policy (π): A policy is a strategy used by the agent, defined as a mapping from 
states to actions. The policy determines the action that an agent will take in a 
given state. 
• Episode: Many RL issues involve exchanges that can be divided into smaller units 
called episodes. Every episode has a starting state and a terminal state at the 
conclusion. Tasks without a clear endpoint in the engagement are known as non-
episodic (or ongoing). 
• Discount Factor (γ): The impact of future benefits is determined by the discount 
factor, which has a value between 0 and 1. When a factor is near 1, the agent is 
considered far-sighted because it values benefits that are further in the future, 
whereas a factor of 0 makes the agent short-sighted since it only considers cur-
rent rewards. 
• Objective Function: The objective function in reinforcement learning typically in-
volves maximizing the cumulative reward, which could be the sum of rewards in 
the case of a finite horizon, or the discounted sum of rewards in the case of an 
infinite horizon. 
 
To provide an overview of the methodology in this thesis, a detailed flowchart has been 
made to illustrate the implementation processes of the proposed framework (Figure 1). 
This flowchart encapsulates the key steps, ranging from the initialization of the environ-
ment and calculating main parameters to implementing reinforcement learning (RL) pol-
icies and assessing performance.  
 
26 
 
 
Figure 1. Methodology Workflow: RL-Based EV Charging Framework 
 
4.3.2 Markov decision process (MDP) 
As previously stated, a large number of RL issues can be examined using the MDP frame-
work. An environment in an MDP is made of several unique states, and an agent is only 
ever allowed to be in one of these states at a time. The agent in a Markov Decision Pro-
cess can also select from a range of possible actions at any given time. Any action per-
formed could cause the agent's status inside the state space to change. Since the MDP 
27 
 
is a stochastic process, a probabilistic transition table governs how it will change states 
when an action is performed. This model works very well in situations when decisions 
have to be made sequentially under uncertainty, and each option could have a different 
effect depending on the system's current state. The construction of strategies that can 
adapt and function well under a variety of settings is made possible by the stochastic 
character of MDPs, which represent the randomness and unpredictability of real-world 
environments (Narvekar et al., 2020). 
 
The mathematical formulas and the principles of MDP in RL are explained accordingly:  
 
Formula (1) represents the transition probability in the context of a MDP, which is a core 
concept in Reinforcement Learning (RL) (Sutton and Barto, 2018). 𝑃𝑎(𝑠, ?́?) denotes the 
probability of transitioning from the state given that action 𝑎 is taken. In an MDP, this 
transition probability is crucial because it encapsulates the dynamics of the environment, 
describing how the environment responds at a future time 𝑡 + 1 based on the agent’s 
current state and action at time 𝑡. 
 
𝑃𝑎(𝑠, ?́?) = Pr⁡(𝑠𝑡+1 = ?́?|𝑠𝑡 = 𝑠, 𝑎𝑡 = 𝑎)   (1) 
 
Every action that is taken in a MDP that causes a state to change is referred to as a "step." 
Every action the agent does in the environment results in a number value being returned 
to them as a reward. This indicates how effective their actions were. The amount of the 
reward is determined by the agent's current condition, the action taken, and the agent's 
subsequent state following the action, which is indicated as 𝑅𝑎(𝑠, ?́?).  
 
It is essential to remember that the environment alone calculates the reward; the agent 
is unaware of the process or the rationale behind the value that is rewarded. As a result, 
the knowledge that the agent has to learn consists of its current state, the action that it 
chose, and the reward that it received. This configuration emphasizes the reinforcement 
learning principle of learning from interaction rather than predefined rules by 
28 
 
guaranteeing that the agent's learning is entirely dependent on the input it receives from 
the environment. By analyzing which behaviors result in greater rewards in particular 
states over time, the agent optimizes its behavior and adjusts and improves its plan 
based on actual data rather than theoretical models. Reinforcement learning systems 
use this ongoing process of interaction and adjustment as the core learning mechanism. 
 
To utilize the MDP framework for modeling our problem, a precise definition of the pa-
rameters and variables involved in this process is needed. An MDP can be represented 
by a tuple (s, a, p, r), where 's' represents the states, 'a' represents the actions, 'p' repre-
sents the transition probabilities between states, and 'r' represents the rewards. The 
state represents the condition of an agent about its environment and can encompass 
any number of variables. In other words, the state can be considered the agent's percep-
tion of its surrounding environment. For example, for an autonomous vehicle, all the 
input values from the sensors form the vehicle's understanding of its surrounding envi-
ronment. Thus, the decision-making of an agent to choose an action depends solely on 
the state it is in. Additionally, the input to a policy is the agent's state. For a policy to 
improve and be able to choose better actions in each state, it needs to maximize the 
rewards obtained. Therefore, the improvement of a policy entirely depends on the re-
ward an agent receives for a particular action in a specific state. Hence, having values of 
state, action, and reward can enable training an agent within an environment. 
 
In scenarios where the environment is not fully known to us, the transition probabilities 
between states are not available. Such environments are called Partially Observable Mar-
kov Decision Processes (POMDPs), and finding the optimal policy in these environments 
is more challenging than in MDP environments. Our evidence for learning in POMDP 
environments would be represented by a tuple (s, a, s', r, d), where s' represents the next 
state and d indicates the end of a complete cycle of the environment's execution. In both 
MDP and POMDP settings, the key to effective reinforcement learning lies in accurately 
capturing and utilizing the dynamics of state transitions and reward mechanisms to iter-
atively refine the decision-making policy. This iterative learning process focuses on 
29 
 
maximizing the cumulative rewards across episodes, adapting the policy based on feed-
back from the environment to optimize outcomes under given constraints and uncer-
tainties. 
 
Formula (2) represents the optimization of an operator in the environment according to 
Bellman equations explained in Sutton and Barto (2018): 
 
𝑉𝜋(𝑠) = 𝔼𝜋[∑ 𝛾
𝑡𝑅(𝑠𝑡 , 𝑎𝑡)|𝑠0 = 𝑠
∞
𝑡=0 ]    (2) 
 
Here, 𝑠 represents the state, 𝑎 represents the action, and 𝑅 is the function calculating 
the reward. Additionally, 𝛾 is the discount factor, and 𝜋 represents the policy adopted by 
the agent. The above formula shows the value of each state according to the chosen 
policy. Since our environment is a POMDP, instead of calculating 𝑉 values, formula (3) 
has to be solved to find the optimal policy (Qiu et al., 2020). 
 
𝑄𝜋(𝑠, 𝑎) = 𝔼𝜋[∑ 𝛾
𝑡𝑅(𝑠𝑡, 𝑎𝑡)|𝑠0 = 𝑠, 𝑎0 = 𝑎
∞
𝑡=0 ]   (3) 
 
The 𝑄 − 𝑓𝑢𝑛𝑡𝑐𝑖𝑜𝑛, also known as the action-value function, quantifies the expected re-
turn of taking an action  𝑎 in a state 𝑠 and thereafter following a policy 𝜋. Having a func-
tion that provides a good approximation of the 𝑄 value, at each step, given the state, an 
action that produces the highest 𝑄 value can be selected as the optimal action for that 
particular state. Qiu (et al., 2020) rewrites the above equation as follows (formula (4)): 
 
𝑄𝜋(𝑠0, 𝑎0) = 𝔼𝜋[𝑅𝑎(𝑠0, 𝑠1) + ∑ 𝛾
𝑡𝑅(𝑠𝑡 , 𝑎𝑡)|𝑠1 = 𝑠
∞
𝑡=1 ] = 𝔼𝜋[𝑅(𝑠0, 𝑎0) + 𝑉
𝜋(𝑠1)]⁡(4) 
 
Where 𝔼𝜋 denotes the expected value given that the agent follows policy 𝜋 after taking 
action 𝑎 in state 𝑠, ∑ 𝛾𝑡𝑅(𝑠𝑡, 𝑎𝑡)|𝑠1 = 𝑠
∞
𝑡=1  represents the sum of discounted rewards 
received over the future, starting from state 𝑠 and action 𝑎 at time 𝑡 = 0. 𝛾 is the dis-
count factor and 𝑅 is the reward received after executing action 𝑎𝑡 in state 𝑠𝑡. 
 
30 
 
If the above equation is modified as follows (formula (5)), the calculation of the 𝑄 value 
will not depend on the transition probabilities between states. This method is known as 
Q-learning. 
 
𝑄(𝑠, 𝑎) = 𝑅(𝑠, 𝑎) + ⁡𝛾 max
𝑎
𝑄(?́?, 𝑎)    (5) 
 
Where 𝑠′ is the next state that the agent occupies as a result of performing action 𝑎  in 
state  𝑠. To learn the 𝑄  function, formula (6) can be used (Arwa & Folly, 2021): 
 
𝑄𝑛𝑒𝑤(𝑠, 𝑎) = (1 − 𝛼)𝑄𝑜𝑙𝑑(𝑠, 𝑎) + 𝛼 (𝑅(𝑠, 𝑎) + ⁡𝛾 max
𝑎
𝑄(?́?, 𝑎))  (6) 
  
Where  𝛼 is the learning rate of the algorithm. As shown in the equation, at each step, 
the 𝑄  value for a pair of state and action sums a fraction of its current value and a frac-
tion of the predicted value. Choosing an appropriate learning rate is crucial in effectively 
learning the agent's behavior.  
 
4.4 RL Policy  
Given that the electricity at the station is supplied from two different sources at varying 
costs, the decision on how much charge each vehicle should receive during each visit to 
the station can significantly affect the costs incurred. Therefore, selecting an appropriate 
policy for the amount of charging for each vehicle can be envisioned as an optimization 
problem aimed at enhancing efficiency and reducing costs. By considering the condition 
of each vehicle upon entering and exiting a station, as well as the source of the charge 
for that vehicle, a numerical metric can be defined to evaluate the performance of the 
station about the specific vehicle. 
 
For solving the defined RL problem, three policies have been used. PPO and DDPG are 
from the category of reinforcement learning algorithms, and the RBC algorithm is from 
the category of rule-based algorithms. 
31 
 
4.4.1 PPO Policy 
The PPO policy is one of the most commonly used algorithms in reinforcement learning 
and has become the standard algorithm for RL in many companies. This policy prevents 
sudden changes in policy and, unlike the DDPG policy, is simple to implement. It also has 
high stability, and changes in hyperparameters do not significantly affect the algorithm's 
performance, ensuring it consistently performs well (Schulman et al, 2017). Additionally, 
in terms of data needed for training, PPO requires fewer data points than other rein-
forcement learning algorithms, which reduces the learning time. 
 
Specifically, this algorithm tries to prevent sudden changes in policy so that the updated 
policy deviates only slightly from the current policy, common in off-policy methods. Also, 
changes in the policy are clipped by a predetermined limit. It is noteworthy that in this 
algorithm (Cheng et al., 2018), instead of predicting the Q-value, the advantage value is 
used (formula 16).  
 
𝐸𝑐𝑙𝑖𝑝(𝜃) = ?̂?𝑡[min(𝑟𝑡(𝜃)) ?̂?𝑡, 𝑐𝑙𝑖𝑝(𝑟𝑡(𝜃), 1 − 𝜀, 1 + 𝜀)]  (16) 
 
Where 𝜃 represents the policy, 𝑟𝑡 is the likelihood ratio of events under the new policy 
relative to the old, and  ?̂?𝑡 represents the advantage. 
 
𝐴𝜋(𝑠, 𝑎) = 𝑄𝜋(𝑠, 𝑎) − 𝑉𝜋(𝑠)     (17) 
 
4.4.2 DDPG Policy  
Zhang et al. (2019) present this policy as a combination of Q-learning and policy-learning 
algorithms. It consists of two separate steps, one for predicting the Q-value and another 
for deriving the policy from Q-values. This policy falls under the Actor-critic category of 
algorithms. This category, having an actor, possesses the strengths of policy-based algo-
rithms, and also having a critic, holds the strengths of value-based algorithms. Therefore, 
32 
 
for its implementation, two separate deep neural networks can be used for learning Q-
values and policy. 
 
DDPG is especially suitable for environments where the dimension of state or action is 
high. As an off-policy learning algorithm, it requires fewer data points for learning. More-
over, off-policy learning ensures that the learning and action selection are from two sep-
arate networks, leading to more stability for the actor-network (Zhang et al., 2019). 
 
4.4.3 Rule-Based Controller 
Rule-based algorithms are typically used in environments where sufficient knowledge of 
the environment exists, and a set of predefined rules can be written for them (Karmaker 
et al., 2023). The implementation of the mentioned rules is very straightforward and 
usually stems from complete human knowledge of an environment and its governing 
policies. In this method, learning does not occur, and the vehicle charging is determined 
by the time it detaches from the charger. If the vehicle leaves the station in less than 
three hours, it is charged at full capacity; otherwise, depending on the availability of 
renewable energy, the station proceeds to charge the vehicle.  
 
𝑎𝑐𝑡𝑖𝑜𝑛𝑡
𝑖 = {
1⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡𝑇𝑙𝑒𝑎𝑣𝑒𝑡
𝑖 ≤ 3
(𝐺𝑡+𝐺𝑡+1)
2
⁡⁡⁡⁡⁡⁡⁡⁡𝑇𝑙𝑒𝑎𝑣𝑒𝑡
𝑖 ≤ 3
    (18) 
 
Where 𝐺𝑡  is the first element of the states, representing the battery energy level. 
Table 1 highlights the key steps taken in this thesis for implementing the DDPG, PPO, and 
RBC policies.  
  
Table 1. The Pseudocode of the RL Policies and Final Evaluation 
Algorithm 1 RL Policies and Final Evaluation 
1:  Initialize DDPG/PPO/RBC Environment 
2:       import necessary libraries (`gym`, `stable_baselines3`, `RBC`, etc.) 
3:       Create directories for saving models and logs 
33 
 
 
### **DDPG & PPO Policies** 
4:  Create and Train Model 
5:      - Initialize custom environment using `gym.make` 
6:      - Optionally set up action noise for DDPG 
7:      - Create model (`DDPG` or `PPO`) with `MlpPolicy` and custom environment 
8:      - Train and save model periodically 
 
9:  Evaluate and Use Model 
10:     - Evaluate the model using `evaluate_policy` 
11:     - Test trained agent by predicting and stepping through the environment 
 
### **RBC Algorithm** 
1:  RBC Class 
2:      - Create `select_action` function 
3:      - Determine departure time and solar radiation 
4:      - Set action based on departure time and radiation 
5:      - Return selected action 
 
6:  Run RBC Algorithm 
7:     - Parse arguments for custom environment 
8:     - Initialize custom environment and reset state 
9:     - Loop until done: 
10:         - Select action using RBC 
11:         - Step through environment and update state 
12:         - Track rewards 
 
### **Evaluation and Comparison** 
1:  Evaluate Models 
2:      - Loop through evaluation episodes 
3:      - for each episode: 
4:          - Evaluate DDPG model and record rewards  
5:          - Evaluate PPO model and record rewards  
6:          - Evaluate RBC algorithm and record rewards  
 
7: Plot and Show Results 
8:     - Calculate mean rewards for DDPG, PPO, and RBC by ‘np.mean(final_reward)’ 
9:     - ‘pl.plot(final_reward)’ Plot and display reward comparison graph 
 
34 
 
5 Numerical Studies  
5.1 Daily Energy Dynamics 
In this environment, there are 40 electric vehicles and 50 prosumers. Each household is 
equipped with solar panels that produce a specific amount of electricity during the day 
depending on weather conditions. Each station is also connected to the power grid and 
can purchase electricity from it if there is a shortage in production and use it for charging 
vehicles. The load imposed on the network for each of these nodes and the amount of 
electricity produced by each of these prosumers is pre-defined and available in Faia (et 
al., 2021) – specified in the PV.csv and Load.csv files. The time step defined in this envi-
ronment is 15 minutes, and the total simulation time is 24 hours, meaning that 96 steps 
are carried out in this simulation.  The amount of electricity consumed and produced by 
a prosumer helps to better understand the environment and the performance of each 
agent.  
 
 
Figure 2. Consumption (Load) and Generated Electricity by RE (Renewable) 
 
35 
 
Figure 2 illustrates the daily profile of energy consumption and production of the men-
tioned 50 prosumers. The vertical axis measures the energy in kilowatt-hours (kWh), de-
picting the household's energy dynamics. Upon examination of Figure 2, it is evident that 
renewable energy production—denoted by the blue line—experiences a significant in-
crease during the central daylight hours. This pattern is indicative of solar power gener-
ation, which aligns with the peak insolation periods typically observed in middle lati-
tudes. The load curve, depicted by the orange line, illustrates daily fluctuations in energy 
consumption. It reveals a rise in energy use during the early morning as people prepare 
for work, followed by a more significant increase in the evening, which lasts until about 
9 o'clock. This evening peak likely results from household members returning home and 
using various appliances simultaneously. 
 
A critical observation from the graph is the intersection points of the load and renewable 
lines, which divide the transition between net consumption and net production phases. 
During the early morning and late evening hours (00:00-05:00 and 20:00-23:45), the 
household's energy consumption surpasses its production, thereby necessitating the 
procurement of additional electricity from the grid. 
 
Conversely, the middle of the day (07:30-17:30) is characterized by an excess of renew-
able energy production over consumption. This surplus provides an opportunity for the 
household to act as an energy provider, potentially channeling excess electricity for stor-
age or ancillary uses such as charging electric vehicles. The profile also reveals potential 
for optimization. Energy storage solutions could be employed during the production sur-
plus to alleviate the demand during deficit periods. This would enhance energy inde-
pendence and contribute to a more balanced and self-sustaining household energy sys-
tem. 
 
5.2 Solar Energy Surplus  
The strength of the decision-making policy is when the renewable energy is more than 
the energy consumed by the prosumers. Now, to calculate the amount of energy a 
36 
 
prosumer can use to charge electric vehicles, the remaining renewable energy at each 
moment must be calculated. 
 
 
Figure 3. The Remaining RE after consumption  
 
Figure 3 provides a visual representation of the residual renewable energy available at a 
prosumer household, postulated after the subtraction of consumed load from the gross 
renewable production over 24 hours. The peak suggests that the highest availability of 
surplus renewable energy occurs in the central daylight hours, providing the most op-
portune time for energy-intensive activities such as charging electric vehicles (EVs) in this 
case. 
 
The surplus initiates at time-step 05:00, peaks around time-step 12:30, and wanes by 
time-step 17:30. The data presented in the chart implies that there exists a window be-
tween these intervals where the generated solar energy exceeds the household's con-
sumption demands. Given that the total number of electric vehicles is 40, it can be in-
ferred that the household's energy management system must prioritize vehicle charging 
schedules to align with the surplus energy availability. The variance in vehicle activity 
37 
 
over the day affects the charging demand. Hence, strategic scheduling is paramount to 
ensure that the vehicles are adequately charged when most active while optimizing the 
use of surplus renewable energy. 
 
5.3 Tariffs 
According to Faia (et al., 2021), Table 2 demonstrates the price of electricity with six 
different tariffs at different times of the day, with values stored in the Tariff.csv file as 
well. To streamline the simulation, a consistent tariff, “FIXED T1”, which is a fixed tariff is 
applied for the cost of electricity procured from the grid.  
 
Table 2. The price of electricity at different times with six kinds of Tariffs 
TIME PERIODS FIXED T1 HOURLY T1 FIXED T2 HOURLY T2 TRI-HOURLY T2 
00:00:00 Off-Peak 0.1456 0.0923 0.1548 0.0968 0.0942 
00:15:00 Off-Peak 0.1456 0.0923 0.1548 0.0968 0.0942 
00:30:00 Off-Peak 0.1456 0.0923 0.1548 0.0968 0.0942 
00:45:00 Off-Peak 0.1456 0.0923 0.1548 0.0968 0.0942 
01:00:00 Off-Peak 0.1456 0.0923 0.1548 0.0968 0.0942 
01:15:00 Off-Peak 0.1456 0.0923 0.1548 0.0968 0.0942 
01:30:00 Off-Peak 0.1456 0.0923 0.1548 0.0968 0.0942 
01:45:00 Off-Peak 0.1456 0.0923 0.1548 0.0968 0.0942 
02:00:00 Off-Peak 0.1456 0.0923 0.1548 0.0968 0.0942 
02:15:00 Off-Peak 0.1456 0.0923 0.1548 0.0968 0.0942 
02:30:00 Off-Peak 0.1456 0.0923 0.1548 0.0968 0.0942 
02:45:00 Off-Peak 0.1456 0.0923 0.1548 0.0968 0.0942 
03:00:00 Off-Peak 0.1456 0.0923 0.1548 0.0968 0.0942 
03:15:00 Off-Peak 0.1456 0.0923 0.1548 0.0968 0.0942 
03:30:00 Off-Peak 0.1456 0.0923 0.1548 0.0968 0.0942 
03:45:00 Off-Peak 0.1456 0.0923 0.1548 0.0968 0.0942 
04:00:00 Off-Peak 0.1456 0.0923 0.1548 0.0968 0.0942 
04:15:00 Off-Peak 0.1456 0.0923 0.1548 0.0968 0.0942 
04:30:00 Off-Peak 0.1456 0.0923 0.1548 0.0968 0.0942 
38 
 
04:45:00 Off-Peak 0.1456 0.0923 0.1548 0.0968 0.0942 
05:00:00 Off-Peak 0.1456 0.0923 0.1548 0.0968 0.0942 
05:15:00 Off-Peak 0.1456 0.0923 0.1548 0.0968 0.0942 
05:30:00 Off-Peak 0.1456 0.0923 0.1548 0.0968 0.0942 
05:45:00 Off-Peak 0.1456 0.0923 0.1548 0.0968 0.0942 
06:00:00 Off-Peak 0.1456 0.0923 0.1548 0.0968 0.0942 
06:15:00 Off-Peak 0.1456 0.0923 0.1548 0.0968 0.0942 
06:30:00 Off-Peak 0.1456 0.0923 0.1548 0.0968 0.0942 
06:45:00 Off-Peak 0.1456 0.0923 0.1548 0.0968 0.0942 
07:00:00 Off-Peak 0.1456 0.0923 0.1548 0.0968 0.0942 
07:15:00 Off-Peak 0.1456 0.0923 0.1548 0.0968 0.0942 
07:30:00 Off-Peak 0.1456 0.0923 0.1548 0.0968 0.0942 
07:45:00 Off-Peak 0.1456 0.0923 0.1548 0.0968 0.0942 
08:00:00 Off-Peak 0.1456 0.0923 0.1548 0.0968 0.0942 
08:15:00 Peak 0.1456 0.1833 0.1548 0.2027 0.1715 
08:30:00 Peak 0.1456 0.1833 0.1548 0.2027 0.1715 
08:45:00 Peak 0.1456 0.1833 0.1548 0.2027 0.1715 
09:00:00 Peak 0.1456 0.1833 0.1548 0.2027 0.1715 
09:15:00 On-Peak 0.1456 0.1833 0.1548 0.2027 0.2942 
09:30:00 On-Peak 0.1456 0.1833 0.1548 0.2027 0.2942 
09:45:00 On-Peak 0.1456 0.1833 0.1548 0.2027 0.2942 
10:00:00 On-Peak 0.1456 0.1833 0.1548 0.2027 0.2942 
10:15:00 On-Peak 0.1456 0.1833 0.1548 0.2027 0.2942 
10:30:00 On-Peak 0.1456 0.1833 0.1548 0.2027 0.2942 
10:45:00 Peak 0.1456 0.1833 0.1548 0.2027 0.1715 
11:00:00 Peak 0.1456 0.1833 0.1548 0.2027 0.1715 
11:15:00 Peak 0.1456 0.1833 0.1548 0.2027 0.1715 
11:30:00 Peak 0.1456 0.1833 0.1548 0.2027 0.1715 
11:45:00 Peak 0.1456 0.1833 0.1548 0.2027 0.1715 
12:00:00 Peak 0.1456 0.1833 0.1548 0.2027 0.1715 
12:15:00 Peak 0.1456 0.1833 0.1548 0.2027 0.1715 
12:30:00 Peak 0.1456 0.1833 0.1548 0.2027 0.1715 
12:45:00 Peak 0.1456 0.1833 0.1548 0.2027 0.1715 
39 
 
13:00:00 Peak 0.1456 0.1833 0.1548 0.2027 0.1715 
13:15:00 Peak 0.1456 0.1833 0.1548 0.2027 0.1715 
13:30:00 Peak 0.1456 0.1833 0.1548 0.2027 0.1715 
13:45:00 Peak 0.1456 0.1833 0.1548 0.2027 0.1715 
14:00:00 Peak 0.1456 0.1833 0.1548 0.2027 0.1715 
14:15:00 Peak 0.1456 0.1833 0.1548 0.2027 0.1715 
14:30:00 Peak 0.1456 0.1833 0.1548 0.2027 0.1715 
14:45:00 Peak 0.1456 0.1833 0.1548 0.2027 0.1715 
15:00:00 Peak 0.1456 0.1833 0.1548 0.2027 0.1715 
15:15:00 Peak 0.1456 0.1833 0.1548 0.2027 0.1715 
15:30:00 Peak 0.1456 0.1833 0.1548 0.2027 0.1715 
15:45:00 Peak 0.1456 0.1833 0.1548 0.2027 0.1715 
16:00:00 Peak 0.1456 0.1833 0.1548 0.2027 0.1715 
16:15:00 On-Peak 0.1456 0.1833 0.1548 0.2027 0.2942 
16:30:00 On-Peak 0.1456 0.1833 0.1548 0.2027 0.2942 
16:45:00 On-Peak 0.1456 0.1833 0.1548 0.2027 0.2942 
17:00:00 On-Peak 0.1456 0.1833 0.1548 0.2027 0.2942 
17:15:00 On-Peak 0.1456 0.1833 0.1548 0.2027 0.2942 
17:30:00 On-Peak 0.1456 0.1833 0.1548 0.2027 0.2942 
17:45:00 On-Peak 0.1456 0.1833 0.1548 0.2027 0.2942 
18:00:00 On-Peak 0.1456 0.1833 0.1548 0.2027 0.2942 
18:15:00 On-Peak 0.1456 0.1833 0.1548 0.2027 0.2942 
18:30:00 On-Peak 0.1456 0.1833 0.1548 0.2027 0.2942 
18:45:00 On-Peak 0.1456 0.1833 0.1548 0.2027 0.2942 
19:00:00 On-Peak 0.1456 0.1833 0.1548 0.2027 0.2942 
19:15:00 On-Peak 0.1456 0.1833 0.1548 0.2027 0.2942 
19:30:00 On-Peak 0.1456 0.1833 0.1548 0.2027 0.2942 
19:45:00 On-Peak 0.1456 0.1833 0.1548 0.2027 0.2942 
20:00:00 On-Peak 0.1456 0.1833 0.1548 0.2027 0.2942 
20:15:00 On-Peak 0.1456 0.1833 0.1548 0.2027 0.2942 
20:30:00 On-Peak 0.1456 0.1833 0.1548 0.2027 0.2942 
20:45:00 Peak 0.1456 0.1833 0.1548 0.2027 0.1715 
21:00:00 Peak 0.1456 0.1833 0.1548 0.2027 0.1715 
40 
 
21:15:00 Peak 0.1456 0.1833 0.1548 0.2027 0.1715 
21:30:00 Peak 0.1456 0.1833 0.1548 0.2027 0.1715 
21:45:00 Peak 0.1456 0.1833 0.1548 0.2027 0.1715 
22:00:00 Peak 0.1456 0.1833 0.1548 0.2027 0.1715 
22:15:00 Off-Peak 0.1456 0.0923 0.1548 0.0968 0.1715 
22:30:00 Off-Peak 0.1456 0.0923 0.1548 0.0968 0.0942 
22:45:00 Off-Peak 0.1456 0.0923 0.1548 0.0968 0.0942 
23:00:00 Off-Peak 0.1456 0.0923 0.1548 0.0968 0.0942 
23:15:00 Off-Peak 0.1456 0.0923 0.1548 0.0968 0.0942 
23:30:00 Off-Peak 0.1456 0.0923 0.1548 0.0968 0.0942 
23:45:00 Off-Peak 0.1456 0.0923 0.1548 0.0968 0.0942 
 
In this thesis, it's assumed that all electric vehicles share identical specifications regard-
ing battery capacity (45 kWh), the period required for a full charge, the charging and 
discharging rate (8 kWh), and the efficiency of the charging and discharging process 
(90%). Moreover, for each vehicle, it must have access to the amount of charge stored in 
the battery in the first departure, which is saved in the SOC.csv file.  
 
The variables in this problem are the battery charge levels of each vehicle. At any given 
time, a vehicle may stop at one of the charging stations. Therefore, in the simulation 
environment, it is needed to provide the time of arrival and departure from the station 
for each vehicle, which are saved in the Evolution.csv file and can be observed in Figure 
4.  
41 
 
5.4 EVs Presence during the day 
 
Figure 4. Present cars over a 24-hour period 
 
Figure 4 depicts the fluctuation in the quantity of electric vehicles (EVs) present within 
the simulated environment over the specified day. The graph shows sharp variations in 
the number of vehicles, ranging from as few as 10 to the maximum fleet size of 40. No-
tably, there are pronounced dips in the vehicle count, suggesting times when a signifi-
cant portion of the fleet is not present—presumably when the vehicles are out for usage. 
 
Tracking the presence of each car at specific times, indicated by 𝑡, is crucial to precisely 
calculating penalties in the algorithm and to better understand how EVs move. Creating 
𝑠𝑡𝑎𝑡𝑒 part of the Reinforcement Learning (RL) environment depends on this tracking. 
Based on this presence data, the EVs that depart within the next 15 minutes are ex-
tracted for the 𝑎𝑐𝑡𝑖𝑜𝑛 phase. Also, the SOC of an EV that is expected to depart in the 
next fifteen minutes will be extracted as well for additional analysis and reward calcula-
tions.  
42 
 
Each EV is an agent that enters a station with a specific amount of charge and either 
increases or decreases its battery charge. At the start of each simulation, when the en-
vironment is reinitialized, the values of load, amount of produced energy, and electricity 
prices at each step of the algorithm are read from the simulation files. Table 3 shows the 
core functions of the environment, such as initialization, step function, resetting, gener-
ating observations, and setting the seed.   
 
Table 3. The Pseudocode of the RL Environment 
Algorithm 2 Reinforcement Learning Environment  
1:  Define Environment 
2:    Create an environment class using `gym.Env` 
3:    Initialize EV parameters and observation/action spaces 
4:  EV Charging Environment (Gym Environment) 
5:     Initialization ‘(__init__)’: 
6:        Initialize key variables (`number_of_cars`, `number_of_days`, etc.) 
7:        Set EV parameters (e.g., `EV_capacity`, `charging_rate`) 
8:        Define action and observation spaces using `spaces.Box` 
9:        Initialize random seed and other variables 
10:    Step Function (`step`): 
11:      Simulate actions using `Simulate_Actions3.RL_Actions` 
12:       Update and record relevant metrics (`reward_History`, `Grid_trade_Evol`, etc.) 
13:       Check if the episode is done and save results using `savemat`  
14:    Reset Function (`reset`): 
15:        Reset environment state and load initial values using `loadmat` 
16:        Initialize necessary variables (`timestep`, `done`, etc.) 
17:    Get Observations (`get_obs`): 
18:        Calculate the state of the environment using `Simulate_Station3.RL_States` 
19:        Return the current state as a numpy array 
20:    Seed Function (`seed`): 
21:        Set random seed for reproducibility using `seeding.np_random` 
22:    Close (`close`): 
 
Given the problem definition, reinforcement learning algorithms can be used to optimize 
decision-making about the manner and amount of charging for the vehicles. For this pur-
pose, a clear definition of state, reward, and action is needed: 
43 
 
• State: The state of the environment at any moment is a vector twice the length of the 
number of vehicles presented in the simulation, which includes the amount of charge 
stored in each vehicle’s battery and the movement time of each vehicle. 
• Reward: The defined reward consists of three parts that determine the total reward at 
each stage. The first part is the amount of electricity purchased from the power grid. For 
this purpose, at each step, the total energy required for electric vehicles is calculated 
and the total remaining renewable energies from it is subtracted to determine the 
amount of electricity needed for purchase. Then, by multiplying this amount by the cost 
of electricity at that time, the cost of purchasing energy from the power grid is deter-
mined. It should be noted that this amount is considered a penalty term and needs to 
be minimized for optimization. To calculate the purchased energy, first, a computation 
of the total remaining renewable energies according to formula (7) must be done. 
 
𝑟𝑡 = 𝑅𝑒𝐿𝑈(∑ [𝐸𝑟𝑒𝑛𝑒𝑤𝑎𝑏𝑙𝑒
𝑡 − 𝐿𝑜𝑎𝑑𝑡]𝑖⁡∈⁡𝑠𝑡𝑎𝑡𝑖𝑜𝑛𝑠 )   (7) 
 
The formula (7) represents the total remaining renewable energy generated by prosum-
ers, where 𝐸𝑟𝑒𝑛𝑒𝑤𝑎𝑏𝑙𝑒
𝑡
  indicates the renewable energy generated at each household 
and 𝐿𝑜𝑎𝑑𝑡⁡indicates the consumption load of that prosumer. The ReLU function is de-
fined as formula (8): 
 
𝑅𝑒𝐿𝑈(𝑥) = {
𝑥⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡𝑥 ≥ 0
0⁡⁡⁡⁡⁡⁡⁡𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
     (8) 
 
Now, having the amount of remaining renewable energy at any moment, the formula (9) 
calculates the cost of purchasing the deficit load from the power grid. 
 
𝐸𝑐ℎ𝑎𝑟𝑔𝑒 = min⁡(𝐴𝑣𝑅, (1 − 𝑆𝑂𝐶) × 𝐶𝑎𝑝𝑎𝑐𝑖𝑡𝑦)   (9) 
 
𝐸𝑐ℎ𝑎𝑟𝑔𝑒  is the energy required to charge the battery. 𝐴𝑣𝑅 is the available rate of charging 
(8 kWh) according to the average rate of charging extracted from Faia (et al., 2021).  
44 
 
𝐶𝑎𝑝𝑎𝑐𝑖𝑡𝑦 is the total energy capacity of the battery, which is considered to be 45 kWh 
for all the EVs here according to the average capacity of EVs in Faia (et al., 2021). 
 
In the defined environment, there is no possibility of selling electricity from the stations 
to the power grid. That is why the second term of the reward is a penalty on the amount 
of energy that is produced by the prosumers but not consumed by the vehicles. Formula 
10 is also considered a penalty term presented for the environment to maintain the ex-
plore and exploit process of the RL environment.  
 
𝑃𝑅𝐸
𝑡 = 𝑅𝑒𝐿𝑈(∑ 𝑟𝑡 − 𝐸𝑖
𝑡
𝑖⁡∈⁡𝐶𝑎𝑟𝑠 ),
𝑃𝑟𝑖𝑐𝑒𝑡
2
    (10) 
 
𝑃𝑅𝐸
𝑡  is the power associated with renewable energy at time, 𝐸𝑖
𝑡   represents the total 
energy injected into or taken from the vehicles by the stations, and  𝑟𝑡 denotes the total 
remaining renewable energies at moment 𝑡.  
 
In the defined environment, in an optimal state, a vehicle that visits a charging station 
should not leave the station with a low charge and should benefit from the services as 
much as possible. Therefore, a penalty is considered in this thesis for leaving the station 
with a low charge. The mentioned penalty is calculated according to the following for-
mula 11: 
 
𝑃𝑆𝑂𝐶 =
{
 
 
[(1 − 𝑆𝑂𝐶𝑖) × 2]
2
(1 − 𝑆𝑂𝐶𝑖) × 2
(1 − 𝑆𝑂𝐶𝑖) × 3
(1 − 𝑆𝑂𝐶𝑖) × 5
     (11) 
 
Where 𝑆𝑂𝐶 represents the remaining charge of each vehicle, and 𝑆𝑂𝐶𝑖 ∈ [0,1] . This 
function imposes a penalty based on how much the vehicle's battery is below full charge 
(100%) when it leaves the charging station. The closer 𝑆𝑂𝐶𝑖  is to 1 (or 100%), the smaller 
the penalty because it would be a smaller number. The exact penalties are determined 
by multiplying the deficit from the full charge by different factors (2 squared, 2, 3, or 5). 
45 
 
These factors correspond to different tiers or conditions for the penalties. This is a policy 
to ensure that EVs on the road maintain a higher average charge level to keep the system 
more reliable for EV transportation or the stability of the grid in a vehicle-to-grid (V2G) 
system.  
 
Given that all three parts considered for the reward are in the form of a penalty term, 
the reward value of the environment is defined as formula 12: 
 
𝑟𝑒𝑤𝑎𝑟𝑑 = −1 × (𝑃𝐸𝑉 + 𝑃𝑅𝐸 + 𝑃𝑆𝑂𝐶)    (12) 
 
Action (action): The action for each vehicle is a number in the range 𝑎𝑖 ∈ [−1,+1] , 
where positive numbers indicate the injection of electricity from the station to the vehi-
cle, and negative numbers indicate the injection of electricity from the vehicle to the 
station. Furthermore, the numbers are proportional. The number “+1” indicates a full 
charge of an empty battery, and -1 indicates a complete discharge of a fully charged bat-
tery. To calculate how each vehicle is charged by the station, formula (9) is used. Based 
on that, the lower the charge level of a vehicle, the higher the rate it receives from the 
station, although this rate should not exceed the average rate of 8 kWh.  
 
To calculate the discharge of each vehicle, the following formula is used (formula 13): 
 
𝐸𝑐ℎ𝑎𝑟𝑔𝑒 = min(𝐴𝑣𝑅, 𝑆𝑂𝐶 × 𝐶𝑎𝑝𝑎𝑐𝑖𝑡𝑦)   (13) 
 
Then, to apply the calculated energies above in the charge level of each vehicle, the for-
mula 15 is presented: 
 
𝑆𝑂𝐶𝑛𝑒𝑤 = 𝑆𝑂𝐶𝑜𝑙𝑑 +
𝑎×𝐸𝑐ℎ𝑎𝑟𝑔𝑒
𝐶𝑎𝑝𝑎𝑐𝑖𝑡𝑦
       (14) 
 
46 
 
Where  𝑆𝑂𝐶𝑜𝑙𝑑  is the previous charge level of the vehicle, 𝑆𝑂𝐶𝑛𝑒𝑤  is the new charge 
level of the vehicle, and 𝑎 is the action applied. It should be noted that in the above 
formula, the battery energy level must remain within the range of -1 to +1. 
 
A detailed presentation of this section can be found in Table 4, which focuses on sup-
porting functions and data processing utilities, such as energy calculations, initial value 
settings, action simulations, and state calculations. 
 
Table 4. The Pseudocode of EV charging utilities 
Algorithm 3 EV Charging Utilities  
1: Energy_Calculation 
2:     load data from CSV files: 
3:         - tariffs_data = pd.read_csv('Tariffs.csv') 
4:         - pv_data = pd.read_csv('PV.csv') 
5:         - load_data = pd.read_csv('Load.csv') 
6:     define function Energy_Data(self): 
7:         set experiment parameters: 
8:             - days_of_experiment = 1 
9:             - number_of_prosumers = 50 
10:           - price_flag = self.price_flag         # which tariff is set for the simulation 
11:           - pv_flag = self.pv_flag                  # whether it would be used or not 
12:        initialize arrays: 
13:            - Renewable = np.zeros([number_of_prosumers, 96]) 
14:            - Load = np.zeros([number_of_prosumers, 96]) 
15:        for each prosumer do 
16:            for each timestep do 
17:                calculate the remaining RE after using it for consumption 
18:                calculate the remaining load after using RE 
19:        determine price structure based on price flag 
20:        create Price array based on chosen tariff 
21:        return Load, Renewable, Price 
22:    end procedure 
 
23: Initial_Values 
24:    define function EVs_Calculations(): 
25:        load data from CSV file (Moves) 
47 
 
26:        initialize data structures: 
27:            - arrival = {} 
28:            - departure = {} 
29:            - SOC = np.zeros([number_of_cars, 96]) 
30:        for each car do 
31:            for each timestep do 
32:                identify changes in movement state 
33:                assign SOC based on departure duration 
34:        handle last timestep departure 
35:        ensure all cars are fully charged at midnight 
36:        calculate presence and evolution of cars 
37:        return SOC, ArrivalT, DepartureT, present_cars, evolution_of_cars 
38:    end procedure 
 
39: Action_Simulation 
40:    define function RL_Actions(self, actions): 
41:        set current hour based on timestep 
42:        initialize EV charging demands and SOC 
43:        for each car do 
44:            determine charging or discharging action 
45:            calculate energy demand or supply 
46:            update SOC for next timestep 
47:        compute rewards based on penalties: 
48:            - Penalty_EV 
49:            - Penalty_RE 
50:            - Penalty_SOC 
51:        calculate final reward 
52:        return reward, Grid_trade, RE_surplus, Penalty_SOC, SOC 
53:    end procedure 
 
54: Station_State 
55:    define function RL_States(self): 
56:        initialize key variables: 
57:            - SOC = self.SOC 
58:            - Arrival = self.Invalues['ArrivalT'] 
59:            - Departure = self.Invalues['DepartureT'] 
60:            - present_cars = self.Invalues['present_cars'] 
61:        identify cars departing soon 
62:        compute hours until departure 
48 
 
63:        calculate SOC for each car 
64:        return leave, Departure_hour, Battery 
65:    end procedure 
 
 
49 
 
6 Results (Simulation Implementation) 
6.1 An Evaluation of DDPG, PPO, and RBC 
 
Figure 5. The Reward obtained by each agent in different policies 
 
Figure 5 offers a comparative assessment of three distinct policies, Deep Deterministic 
Policy Gradient (DDPG), Proximal Policy Optimization (PPO), and Rule-Based Control 
(RBC) in terms of their reward outcomes over a sequence of 100 evaluation episodes. 
The 'Reward' axis represents the numerical score accrued by each algorithm, signifying 
its effectiveness within the simulated environment, post-learning phase. The 'Evaluation 
episodes' axis provides a temporal sequence of tests, which are indicative of the policies' 
performance over time. 
 
A cursory inspection of Figure 5 reveals that the DDPG algorithm, depicted by the blue 
line, consistently achieves the highest rewards throughout the evaluation episodes, in-
dicating its superior performance in this specific environment. It maintains a notable de-
gree of stability with minimal fluctuation in reward, suggesting that its policy is both 
50 
 
effective and reliable over successive episodes. The advantage of DDPG most likely stems 
from its ability to gracefully handle continuous action spaces and deterministic policy 
approach in decision making. It is crucial for controlling time and charging rates in a set-
ting where EV charging stations are present. Moreover, by effectively balancing explora-
tion and exploitation, the Ornstein-Uhlenbeck action noise was incorporated into the 
algorithm, allowing it to sustain high performance through efficient search of the action 
space around similar states. 
 
The PPO algorithm, represented by the orange line, while outperforming the RBC algo-
rithm, exhibits significant volatility in its performance. This on-policy algorithm may have 
trouble adjusting its policy in a complicated reward landscape like the EV charging simu-
lation. The high degree of reward variability in PPO may be made worse by policy opti-
mization's usage of a multi-epoch stochastic gradient ascent, which may not always con-
verge optimally in the specified environmental configuration. 
 
In contrast, the green line corresponding to the RBC algorithm shows the least variability 
among the three, though it consistently yields the lowest rewards. Rule-based systems 
exhibits this patter as they follow predetermined rules instead of adjusting to feedback 
from the environment. Because it lacks learning and adaptive processes, RBC's perfor-
mance is consistent, which indicates how robust it is. Nonetheless, the steadily decreas-
ing rewards show that the RBC's static rules fall short of the adaptive algorithms in terms 
of capturing complexities and optimizing decision-making. 
 
When examining stability during the evaluation phase, it is evident that the DDPG and 
RBC algorithms demonstrate a degree of dependability, with the former also achieving 
superior performance. On the other hand, the variability of the PPO algorithm’s rewards 
indicates a potential for improvement in policy consistency.  
 
To further scrutinize the comparative efficiency of the DDPG and PPO algorithms, an 
analysis beyond the scope of Figure 5 would be required, focusing on the reward 
51 
 
trajectories during the learning phase. Such an analysis would elucidate the learning pro-
gression and the eventual stabilization of policy performance. 
 
6.2 Learning Trajectories of DDPG and PPO 
 
 
Figure 6. DDPG and PPO Policy Reward Training  
 
Figure 6 provides a comparison of the learning trajectories of the DDPG and PPO algo-
rithms throughout training episodes, with the number of episodes reaching one million. 
The vertical axis tracks the cumulative reward gained during training.  
 
The DDPG algorithm, indicated by the blue line, demonstrates rapid convergence to-
wards a local optimum early in the training process. The architecture of DDPG incorpo-
rates noise processes. However, since it is deterministic, it is naturally focused on taking 
advantage of the knowledge it gains immediately to maximize rewards. This quick stabi-
lization of reward suggests that DDPG efficiently bootstraps knowledge from its experi-
ences, leveraging this to achieve a policy that yields consistent results. The algorithm's 
52 
 
exploration tends to diminish as it converges to what it considers as an optimal policy, 
based on its experience. This may cause the algorithm to prematurely plateau and thus 
fail to sufficiently examine possible better policies outside of the present policy's near 
area. 
 
Conversely, the PPO algorithm, illustrated by the orange maintains exploration through-
out its training process due to being on-policy. That is why it shows a more gradual in-
crease in rewards over the training episodes. Every update is based on new data, and 
the clipping method aids in a deeper exploration of the policy space, avoiding the policy 
from prematurely convergent to local optima that are not optimal. This trend suggests 
that PPO continues to explore the policy space throughout the training process, poten-
tially leading to a better overall understanding of the environment. Such persistent ex-
ploration could be advantageous in environments where the initial apparent local op-
tima are not the best possible policies. 
 
Exploration (looking for new information) and exploitation (using existing knowledge to 
maximize reward) must be balanced at the heart of reinforcement learning. The DDPG 
method favors quick exploitation that works well when the ideal course of action doesn't 
alter all that much. On the other hand, PPO's approach of persistently examining the 
policy space enhances its flexibility. However, it can lead to a delayed convergence in 
situations when a policy is readily apparent. 
 
6.3 The Cost-Effectiveness of Policies 
To analyze the economic impact of each policy, Figure 7 is proposed to involving the in-
tegration of time-variable electricity pricing into the cost calculation, thereby multiplying 
the quantity of electricity purchased from the grid at each time step by the electricity 
price. 
 
53 
 
 
Figure 7. Amount of electricity needed to be purchased from the grid 
 
According to The DDPG algorithm, represented by the blue line, shows the most stable 
and lowest costs throughout, indicating its efficiency in utilizing available renewable en-
ergy and minimizing reliance on the grid. This suggests a well-optimized policy that has 
learned to predict and match the household's energy consumption with the production 
of renewable energy. With DDPG, EVs are charged when solar energy is plentiful and 
nearly free, reducing the need to use the grid to generate electricity during pricey peak 
demand hours. Because DDPG is able to learn, it can adjust its charging approach to best 
save money by basing it on the average daily solar energy curve. 
 
In contrast, the exploratory tendency of PPO can result in less consistency in how well it 
matches charging with peak solar times as compared to DDPG. Even though it is equally 
capable of responding to solar availability. The results may fluctuate, indicating a higher 
level of performance variability as a result. PPO may investigate methods that infre-
quently move charging to less-than-ideal times to see whether they could produce bet-
ter long-term results.  
54 
 
On the other hand, RBC does not optimally reduce costs, as its static nature doesn't 
adapt to real-time changes in solar output. It may cause energy to be taken from the grid 
at high-cost times or underuse abundant solar energy during low-demand times if the 
regulations are overly strict or out of sync with real consumption patterns. 
 
Table 5. The cost of electricity purchased from the power grid using PV 
 PPO DDPG RBC 
Total Cost (€) 6,378 1,533 13,086 
 
The data in Table 5 quantifies the total costs for each policy, providing a comparison of 
their economic impact. With DDPG incurring the lowest total cost of 1,533€, it drastically 
undercuts the expenses of PPO and RBC, which are 6,378€ and 13,086€ respectively. The 
cost savings are substantial when considering the hypothetical scenario without solar 
energy, where costs would escalate to 14,439€ in DDPG, 16,648€ in PPO, and 18,849€ in 
RBC according to Table 6, demonstrating the significant financial advantage of integrat-
ing renewable energy sources into the energy management system.  
 
Table 6. The cost of electricity purchased from the power grid without PV 
 PPO DDPG RBC 
Total Cost (€) 16,648 14,439 18,849 
 
The analyses of both Figure 7 and Table 5 highlight the superior performance of the 
DDPG algorithm in reducing electricity purchase costs, evidencing the potential of rein-
forcement learning algorithms to offer notable economic benefits. However, there is an 
observed inefficiency across all policies concerning the distribution of surplus energy. 
The simulation constraints, specifically the inability to sell excess electricity back to the 
grid, lead to a scenario where surplus renewable energy is not fully utilized, resulting in 
wastage. In an optimized real-world application, policies would ideally be designed to 
incorporate mechanisms such as energy storage systems or grid feedback to capture and 
55 
 
redistribute surplus energy, thereby maximizing the utilization of renewable energy pro-
duction and further enhancing cost savings. 
 
6.4 Energy Wastage  
 
 
Figure 8. Wasted RE in three policies  
 
Figure 8 illustrates the wastage costs associated with the unused renewable energy (RE) 
for the DDPG, PPO, and RBC energy management policies for 100 episodes. The chart 
delineates the amount of surplus renewable energy that was not utilized for any benefi-
cial purpose, such as charging electric vehicles (EVs) or selling back to the grid, which is 
considered 'wasted' in this context.  
 
The RBC policy, depicted by the green line, exhibits significantly lower peaks of wastage 
costs compared to the other two algorithms although it is a less flexible and adaptive 
method than the machine learning-based policies. This indicates that the RBC's rule-
based approach has been pre-configured with more efficient usage of the generated 
56 
 
renewable energy under the constraints given in the simulation. Using this method ena-
bles RBC to be precisely designed to use renewable energy sources before using grid 
electricity. These guidelines can result in a more efficient use of the generated electricity 
if they are designed to maximize the immediate use of renewable energy as it is created 
(for example, by arranging charging during hours of peak solar production). However, 
the performance of the RBC method varies between episodes because its fixed rules do 
not match the fluctuating conditions of renewable energy production (because of mete-
orological and seasonal variations reasons) and consumption patterns.  
 
Table 7. Wastage of Electricity Generated by RE 
 PPO DDPG RBC 
Total Cost (€) 26994 26519 9349 
 
In Table 7, the total wastage costs quantified over the evaluation period further empha-
size the findings from Figure 8. The RBC algorithm has resulted in the least total wastage 
(9,349€), significantly outperforming the machine learning-based DDPG (26,519€) and 
PPO (26,994€) algorithms in terms of reducing energy wastage. This could imply that 
while the machine learning algorithms are more adept at optimizing for cost through 
purchasing decisions, they are not as efficient in handling the distribution of surplus en-
ergy within the simulation's parameters. For their policies to be optimized, DDPG and 
PPO both use environmental learning. These algorithms may not effectively match the 
supply of renewable energy with the demand for electricity during early episodes.  
 
While the comparison of rewards is a common metric for evaluating the performance of 
reinforcement learning algorithms, Figure 8 and Table 7 underscore the importance of 
considering additional operational variables, such as energy wastage when assessing 
overall system efficiency. This approach recognizes that the optimal functioning of an 
energy management system must balance multiple objectives, including cost minimiza-
tion and sustainable energy utilization.  
57 
 
6.5 SOC Management  
 
Figure 9. Cumulative SOC in three policies 
 
Figure 9 presents a temporal visualization of the cumulative state of charge (SOC) ratios 
for a fleet of electric vehicles (EVs) managed under the three mentioned policies. Each 
SOC ratio ranges from -1 to +1, with the aggregate at each 15-minute time step providing 
a snapshot of the fleet's overall charge status.  
 
The graph portrays the DDPG algorithm as maintaining a relatively balanced SOC, avoid-
ing extremes in undercharging or overcharging the fleet. This suggests an optimal energy 
distribution strategy that aligns vehicle charging needs with the availability of renewable 
energy, thus ensuring that vehicles are adequately charged for use without incurring un-
necessary energy wastage. The SOC trajectories for the PPO and RBC algorithms show 
greater fluctuations. Both algorithms experience moments where the cumulative SOC 
dips or peaks more sharply, implying periods of potential overcharging or under-utiliza-
tion of the fleet's battery capacity. However, it is worth noting that neither PPO nor RBC 
58 
 
consistently underperforms compared to DDPG, as they all occasionally cross over one 
another's paths.   
 
The data on the presence of cars (Figure 4) shows significant variations in the number of 
cars that need to be charged at various time intervals. In response, the best charging 
approach would modify the SOC levels as necessary: lower energy output during peak 
hours to save energy or divert it to other uses, and raise SOC during off-peak hours to 
meet increased demand without going overboard or wasting energy. This behavior sug-
gests the existence of an advanced energy management system covered by policies like 
the DDPG, which appears to adjust to fluctuating vehicle counts while maximizing energy 
use and making sure cars are charged appropriately without wasting or consuming ex-
cessive amounts of energy. 
 
59 
 
7 Discussion  
The exploration of reinforcement learning in the operation of smart grid energy systems, 
as observed in this study, is a testament to the transformative potential of machine 
learning in the energy sector. With each algorithm revealing distinct strengths, the study 
paints a comprehensive picture of the current capabilities and areas ripe for improve-
ment. 
 
Currently, any excess energy not used for charging electric vehicles (EVs) is considered 
wasted, representing an opportunity loss both economically and in energy resource 
management. Future studies could focus on two main strategies to address this issue: 
• Grid Feed-In Systems: Incorporating the option to sell surplus electricity back to 
the grid would not only prevent wastage but could also provide a financial return 
to the prosumers and contribute to the overall stability of the grid.  
• Battery Energy Storage Systems (BESS): The installation of battery storage sys-
tems offers another viable solution for capturing and retaining surplus energy. By 
storing excess production, energy can be utilized during periods of high demand 
or low generation, thereby enhancing the reliability and resilience of the smart 
grid system. 
 
Building on the existing research, further algorithmic refinement could enable a more 
nuanced balance between immediate reward maximization and long-term strategic 
planning. Adaptive algorithms that can respond to real-time pricing and engage with 
BESS and grid feed-in options would significantly advance smart grid management capa-
bilities. Expanding the scope to include variable renewable energy sources (like wind en-
ergy), larger-scale simulations, and behavioral models of human energy consumption 
will enhance the realism and applicability of the research. Furthermore, the integration 
of these technologies must be considered within the context of the evolving policy and 
regulatory environment, which will undoubtedly shape the operational framework of fu-
ture smart grids. 
60 
 
8 Conclusion 
In this thesis, a dynamic and uncertain environment populated by 50 prosumer nodes 
and 40 electric vehicles (EVs) is examined, each interacting within a smart grid frame-
work over a 24-hour simulation period. Reinforcement learning (RL), characterized by its 
trial-and-error approach, interacts with this environment to sequentially make decisions 
that aim to maximize the reward. By framing the problem within the structure of a MDP, 
we engage with a system where outcomes are probabilistically determined and inher-
ently difficult to predict. Each household in the simulation is equipped with solar panels 
that generate variable amounts of electricity, contingent upon weather conditions. Ad-
ditionally, each station is connected to the power grid, allowing for the purchase of elec-
tricity to supplement any shortfall in production, which can then be used to charge the 
EVs. The vehicles enter the station with varying charge levels, and the challenge lies in 
either increasing or decreasing the battery charge efficiently. The simulation is struc-
tured into 96 time steps of 15 minutes each, ensuring a detailed resolution of energy 
dynamics. 
 
Given this complex backdrop, the study applies RL algorithms to navigate the intricate 
decisions surrounding the charging strategies for EVs. These decisions hinge on a well-
defined set of states, rewards, and actions that determine the algorithms' efficiency in 
optimizing for both the individual vehicle requirements and the broader energy de-
mands of the grid. Three various RL algorithms (Policies) are used in this framework, 
namely DDPG, PPO, and RBC. The comparative evaluation of the Deep Deterministic Pol-
icy Gradient (DDPG), Proximal Policy Optimization (PPO), and Rule-Based Control (RBC) 
algorithms demonstrated that each possesses distinct advantages and challenges in 
managing the delicate balance between energy consumption, production, and storage. 
 
DDPG emerged as a robust approach for cost-effective energy management, adeptly 
minimizing the need for grid electricity purchases by effectively aligning energy con-
sumption with renewable production. Its rapid convergence suggests a strong capability 
for bootstrapping and immediate application of learned policies, though this may come 
61 
 
at the cost of reduced exploration in the policy space. While PPO, with its continuous 
exploration, has the potential for better adaptability in dynamic environments, its per-
formance was marked by greater volatility in cost-effectiveness and energy utilization. 
This characteristic could potentially yield improvements over time as the algorithm fur-
ther explores the environment. On the other hand, the RBC algorithm's performance 
highlighted the benefits of predefined rules in reducing energy wastage, emphasizing 
the value of predictable and stable energy management strategies in certain contexts. 
While not as economically efficient in reducing costs as its machine learning counter-
parts, its effectiveness in minimizing wastage presents a compelling case for its inclusion 
in a hybrid strategy. 
 
When considering the overall sustainability and economic viability of smart grid systems, 
it is clear that a multifaceted approach is necessary. The integration of renewable energy 
not only provides an avenue for cost savings but also presents challenges in terms of 
energy distribution and storage. The inability to return surplus energy to the grid in this 
simulation underscored the importance of effective energy storage solutions or alterna-
tive strategies to fully capitalize on the generated renewable energy. Through the lens of 
EV fleet management, the study accentuated the importance of managing the state of 
charge (SOC) levels to prevent overcharging and ensure the vehicles are charged by re-
newable energy availability, thereby enhancing the operational efficiency of the fleet. 
 
Overall, the analyses suggest that while reinforcement learning algorithms such as DDPG 
and PPO show promise in optimizing for economic gains and adaptability, the predicta-
bility and stability of rule-based systems like RBC offer valuable benefits. An integrated 
approach that combines the predictive and learning capabilities of machine learning 
with the consistency of rule-based algorithms may yield the most effective smart grid 
energy management system, one that maximizes renewable energy usage, minimizes 
costs, and promotes sustainability in the evolving landscape of energy distribution and 
consumption. 
62 
 
References 
Al-Gabalawy, M. (2021). Reinforcement learning for the optimization of electric vehicle 
virtual power plants. International Transactions on Electrical Energy Systems, 
38(8). doi:https://doi.org/10.1002/2050-7038.12951 
Arwa, E., & Folly, K. (2021, June). Improved Q-learning for Energy Management in a Grid-
tied PV Microgrid. SAIEE Africa Research Journal, 112(2), 77-88. 
doi:10.23919/SAIEE.2021.9432896 
Cai, W., Kordabad, A., & Gros, S. (2023, March ). Energy management in residential 
microgrid using model predictive control-based reinforcement learning and 
Shapley value. Engineering Applications of Artificial Intelligence, 119, 0952-1976. 
doi:10.1016/j.engappai.2022.105793 
Chen, G., Peng, Y., & Zhang, M. (2018). An Adaptive Clipping Approach for Proximal Policy 
Optimization. ArXiv. 
Chiş, A., Lundén , J., & Koivunen, V. (2017, May). Reinforcement Learning-Based Plug-in 
Electric Vehicle Charging With Forecasted Price. IEEE Transactions on Vehicular 
Technology, 66(5), 3674-3684. doi:10.1109/TVT.2016.2603536 
Cohen, J., Azarova, V., Kollmann, A., & Reichl, J. (2019). Q-complementarity in household 
adoption of photovoltaics and electricity-intensive goods: The case of electric 
vehicles. Energy Economics, 567-577. 
doi:https://doi.org/10.1016/J.ENECO.2019.08.004 
Co-Reyes, J., Sanjeev, S., Berseth, G., Gupta, A., & Levine, S. (2020). Ecological 
Reinforcement Learning. ArXiv. 
Dabbaghjamanesh, M., Moeini, A., & Kavousi-Fard, A. (2021, June). Reinforcement 
Learning-Based Load Forecasting of Electric Vehicle Charging Station Using Q-
Learning Technique. IEEE Transactions on Industrial Informatics, 17(6), 4229-
4237. doi:10.1109/TII.2020.2990397 
Esfandyari, A., Norton, B., Conlon, M., & McCormack, S. (2019). Performance of a 
campus photovoltaic electric vehicle charging station in a temperate climate. 
Solar Energy, 762-771. doi:https://doi.org/10.1016/J.SOLENER.2018.12.005 
63 
 
Faia, R., Soares, J., Fotouhi Ghazvini, M., Franco, J., & Vale, Z. (2021). Local Electricity 
Markets for Electric Vehicles: An Application Study Using a Decentralized Iterative 
Approach. Frontiers in Energy Research, 9. doi:10.3389/fenrg.2021.705066 
Ferro, G., Laureri, F., Miniciardi, R., & Robba, M. (n.d.). An optimization model for 
electrical vehicles scheduling in a smart grid. Sustainable Energy, Grids and 
Networks, 14, 62-70. doi:https://doi.org/10.1016/j.segan.2018.04.002 
Foster, J., & Caramanis, M. (2013, Aug.). Optimal Power Market Participation of Plug-In 
Electric Vehicles Pooled by Distribution Feeder. IEEE Transactions on Power 
Systems, 28(3), 2065-2076. doi:10.1109/TPWRS.2012.2232682 
Härtel, F., & Bocklisch, T. (2023). Minimizing Energy Cost in PV Battery Storage Systems 
Using Reinforcement Learning. IEEE Access, 11, 39855-39865. 
doi:10.1109/ACCESS.2023.3267978 
Huang, S., Yang, M., Zhang, C., Yun, J., Gao, Y., & Li, P. (2020). A Control Strategy Based 
on Deep Reinforcement Learning Under the Combined Wind-Solar Storage 
System. 2020 IEEE 3rd Student Conference on Electrical Machines and Systems 
(SCEMS), 819-824. doi:10.1109/SCEMS48876.2020.9352436 
Karatzinis, G., Korkas, C., Terzopoulos, M., Tsaknakis, C., Stefanopoulou, A., Michailidis, 
I., & Kosmatopoulos, E. (2022). Chargym: An EV Charging Station Model for 
Controller Benchmarking. In Artificial Intelligence Applications and Innovations 
(pp. 241-252). 
Karmaker, A., Hossain, M., Pota, H., Onen, A., & Jung, J. (2023). Energy Management 
System for Hybrid Renewable Energy-Based Electric Vehicle Charging Station. 
IEEE Access, 11, 27793-27805. 
doi:https://doi.org/10.1109/ACCESS.2023.3259232 
Khaki, B., Chung, Y., Chu, C., & Gadh, R. (2019). Probabilistic Electric Vehicle Load 
Management in Distribution Grids. 2019 IEEE Transportation Electrification 
Conference and Expo (ITEC), 1-6. 
doi:https://doi.org/10.1109/ITEC.2019.8790535 
Lan, T., Jermsittiparsert, K., Alrashood, S., Rezaei, M., Al-Ghussain, L., & Mohamed, M. 
(2021). An Advanced Machine Learning Based Energy Management of 
64 
 
Renewable Microgrids Considering Hybrid Electric Vehicles’ Charging Demand. 
Energies, 14(3), 569. doi:https://doi.org/10.3390/EN14030569 
Li , S., & et al. (2022, May). Electric Vehicle Charging Management Based on Deep 
Reinforcement Learning. Journal of Modern Power Systems and Clean Energy, 
10(3), 719-730. doi:10.35833/MPCE.2020.000460 
Li, Y., Han, M., Yang, Z., & Li, G. (2021). oordinating Flexible Demand Response and 
Renewable Uncertainties for Scheduling of Community Integrated Energy 
Systems With an Electric Vehicle Charging Station: A Bi-Level Approach. IEEE 
Transactions on Sustainable Energy, 12(4), 2321-2331. 
doi:https://doi.org/10.1109/TSTE.2021.3090463 
López, K., Gagné, C., & Gardner, M. (2018). Demand-Side Management Using Deep 
Learning for Smart Charging of Electric Vehicles. IEEE Transactions on Smart Grid, 
1-1. doi:10.1109/TSG.2018.2808247 
Mololoth, V., Saguna, S., & Åhlund, C. (2023). Blockchain and Machine Learning for 
Future Smart Grids: A Review. Energies 2023, 16, 528. doi: 
https://doi.org/10.3390/en16010528 
Morstyn, T., Teytelboym, A., & Mcculloch, M. (2018). Matching Markets with Contracts 
for Electric Vehicle Smart Charging. IEEE Power & Energy Society General Meeting 
(PESGM), 1-5. doi:https://doi.org/10.1109/PESGM.2018.8586361 
Moussaoui, H., Akkad, N., & Benslimane, M. (2023). Reinforcement Learning: A review. 
International Journal of Computing and Digital Systems. 
doi:https://doi.org/10.12785/ijcds/1301118 
Najafi, S., Shafie-khah, M., Siano, P., Wei, W., & Catalão, P. (2019, December). 
Reinforcement learning method for plug-in electric vehicle bidding. IET Smart 
Grid, 2(4), 529-536. doi:10.1049/iet-stg.2018.0297 
Narvekar, S., Peng, B., Leonetti, M., Sinapov, J., Taylor, M., & Stone, P. (2020). Curriculum 
Learning for Reinforcement Learning Domains: A Framework and Survey. ArXiv. 
Retrieved from abs/2003.04960 
65 
 
Nezamoddini, N., & Wang, Y. (2016). Risk management and participation planning of 
electric vehicles in smart grids for demand response. Energy, 116, 836-850. 
doi:https://doi.org/10.1016/J.ENERGY.2016.10.002 
Ojand , K., & Dagdougui, H. (2022, Jan.). Q-Learning-Based Model Predictive Control for 
Energy Management in Residential Aggregator. 19(1), 70-81. 
doi:10.1109/TASE.2021.3091334 
Qiu, D., Ye, Y., Papadaskalopoulos, D., & Strbac, G. (2020, Sept.-Oct.). A Deep 
Reinforcement Learning Method for Pricing Electric Vehicles With Discrete 
Charging Levels. IEEE Transactions on Industry Applications, 56(5), 5901-5912. 
doi:10.1109/TIA.2020.2984614 
Radu, A., Eremia, M., & Toma, L. (n.d.). Optimal charging coordination of electric vehicles 
considering distributed energy resources. 2019 IEEE Milan PowerTech, 1-6. 
doi:http://dx.doi.org/10.1109/PTC.2019.8810756 
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy 
Optimization Algorithms. Retrieved from abs/1707.06347 
Shin, M., Choi, D.-H., & Kim, J. (2020, May). Cooperative Management for PV/ESS-
Enabled Electric Vehicle Charging Stations: A Multiagent Deep Reinforcement 
Learning Approach. IEEE Transactions on Industrial Informatics, 16(5), 3493-3503. 
doi:10.1109/TII.2019.2944183 
Sutton, R., & Barto, A. (2018). Reinforcement learning: an introduction. Westchester 
Publishing Services. 
Vandael, S., Claessens, B., Ernst, D., Holvoet , T., & Deconinck, G. (2015, July). 
Reinforcement Learning of Heuristic EV Fleet Charging in a Day-Ahead Electricity 
Market. IEEE Transactions on Smart Grid, 6(4), 1795-1805. 
doi:10.1109/TSG.2015.2393059 
Vázquez-Canteli, P., José, R., & Nagy, Z. (2019). Reinforcement learning for demand 
response: A review of algorithms and modeling techniques. Applied Energy, 
Elsevier, 235(C), 1072-1089. doi:10.1016/j.apenergy.2018.11.002 
66 
 
Wan, Z., Li, H., He, H., & Prokhorov, D. (2019, Sept.). Model-Free Real-Time EV Charging 
Scheduling Based on Deep Reinforcement Learning. IEEE Transactions on Smart 
Grid, 10(5), 5246-5257. doi:10.1109/TSG.2018.2879572 
Yan, L., Chen, X., Zhou, J., Chen, Y., & Wen, J. (2021, Nov.). Deep Reinforcement Learning 
for Continuous Electric Vehicles Charging Control With Dynamic User Behaviors. 
IEEE Transactions on Smart Grid, 12(6), 5124-5134. 
doi:10.1109/TSG.2021.3098298 
Yao, L., Lim, W., & Tsai, T. (2017). A Real-Time Charging Scheme for Demand Response in 
Electric Vehicle Parking Station. IEEE Transactions on Smart Grid, 8, 52-62. 
doi:https://doi.org/10.1109/TSG.2016.2582749 
Yao, M., Da, D., Lu, X., & Wang, Y. (2024). A Review of Capacity Allocation and Control 
Strategies for Electric Vehicle Charging Stations with Integrated Photovoltaic and 
Energy Storage Systems. World Electric Vehicle Journal, 15(3):101. 
doi:https://doi.org/10.3390/wevj15030101 
Ye, X., Ji, T., Li, M., & Wu, Q. (2018). Optimal Control Strategy for Plug-in Electric Vehicles 
Based on Reinforcement Learning in Distribution Networks. 2018 International 
Conference on Power System Technology (POWERCON), 1706-1711. 
doi:10.1109/POWERCON.2018.8602101 
Ye, Y., Qiu, D., Sun, M., Papadaskalopoulos, D., & Strbac, G. (2020, March). Deep 
Reinforcement Learning for Strategic Bidding in Electricity Markets. IEEE 
Transactions on Smart Grid, 11(2), 1343-1355. doi:10.1109/TSG.2019.2936142 
Zhang, Z., Chen, J., Chen, Z., & Li, W. (2019). Asynchronous Episodic Deep Deterministic 
Policy Gradient: Toward Continuous Control in Computationally Complex 
Environments. IEEE Transactions on Cybernetics, (99):1-10. 
doi:10.1109/TCYB.2019.2939174