Contents lists available at ScienceDirect 
Applied Energy 
journal homepage: www.elsevier.com/locate/apen 
Reinforcement learning for data center energy efficiency optimization: 
A systematic literature review and research roadmap 
Hussain Kahil a ,∗   , Shiva Sharma b , Petri Välisuo a   , Mohammed Elmusrati a  
a School of Technology and Innovation, University of Vaasa, Wolffintie 32, Vaasa, 65200, Finland 
b School of Technology, Vaasa University of Applied Sciences, Wolffintie 30, Vaasa, 65200, Finland
H I G H L I G H T S 
• Discusses using Reinforcement Learning (RL) for data center cooling system. 
• Discusses using RL for data center information and communication (ICT) system. 
• Provides a deep critical analysis for the energy optimization results. 
• Presents a comprehensive data extraction about the experimental setup and benchmarks. 
• Explores future direction in RL for optimizing energy in data center environments. 
A R T I C L E I N F O 
Keywords: 
Data center 
Energy efficiency optimization 
Cooling system 
ICT system 
Reinforcement learning (RL) 
Deep reinforcement learning (DRL) 
A B S T R A C T 
With today’s challenges posed by climate change, global attention is increasingly focused on reducing energy con -
sumption within sustainable communities. As significant energy consumers, data centers represent a crucial area 
for research in energy efficiency optimization. To address this issue, various algorithms have been employed to 
develop sophisticated solutions for data center systems. Recently, Reinforcement Learning (RL) and its advanced 
counterpart, Deep Reinforcement Learning (DRL), have demonstrated promising potential in improving data cen -
ter energy efficiency. However, a comprehensive review of the deployment of these algorithms remains limited. 
In this systematic review, we explore the application of RL/DRL algorithms for optimizing data center energy 
efficiency, with a focus on optimizing the operation of cooling systems and Information and Communication 
Technology (ICT) processes, including task scheduling, resource allocation, virtual machine (VM) consolida -
tion/placement, and network traffic control. Following the Preferred Reporting Items for Systematic review and 
Meta-Analysis (PRISMA) protocol, we provide a detailed overview of the methodologies and objectives of 65 
identified studies, along with an in-depth analysis of their energy-related results. We also summarize key aspects 
of these studies, including benchmark comparisons, experimental setups, datasets, and implementation platforms. 
Additionally, we present a structured qualitative comparison of the Markov Decision Process (MDP) elements for 
joint optimization studies. Our findings highlight vital research gaps, including the lack of real-time validation 
for developed algorithms and the absence of multi-scale standardized metrics for reporting energy efficiency im -
provements. Furthermore, we propose joint optimization of multi-system objectives as a promising direction for 
future research. 
∗ Corresponding author. 
Email addresses: hussain.kahil@uwasa.fi (H. Kahil), shiva.sharma@vamk.fi (S. Sharma), petri.valisuo@uwasa.fi (P. Välisuo), mohammed.elmusrati@uwasa.fi 
(M. Elmusrati).
https://doi.org/10.1016/j.apenergy.2025.125734 
Received 10 January 2025; Received in revised form 25 February 2025; Accepted 14 March 2025
Applied Energy 389 (2025) 125734 
Available online 25 March 2025 0306-2619/© 2025 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license ( http://creativecommons.org/licenses/by/4.0/ ). 
H. Kahil, S. Sharma, P. Välisuo et al.
Nomenclature 
A3C Asynchronous advantage actor-critic 
AC Actor-critic 
ACO Ant Colony Optimization 
ACS Ant Colony System 
ADVMC Adaptive DRL based VM Consolidation 
AFED-EF Adaptive Four-threshold Energy-aware VM Deployment 
ARLCA Advanced RL Consolidation Agent 
ATES Aquifer Thermal Energy Storage 
AVMC Autonomous VM Consolidation 
AVT Active Ventilation Tile 
BDQ Branching Dueling Q-Network 
BF Best Fit 
BFD Best Fit Decreasing 
CARPO Correlation-AwaRe Power Optimization 
CCO Cooling Control Optimization 
CDRL Constrained DRL 
CFD Computational Fluid Dynamics 
CFWS Cost and carbon Footprint through Workload Shifting 
CNN Convolutional Neural Network 
CSLB Crow Search-based Load Balancing 
CVP Chemical reaction optimization-VMP-Permutation 
CW Chilled Water 
D3QN Dueling Deep Q Network 
DAG Directed Acyclic Graph 
DBC Deadline and Budget Constrained 
DCI Dynamic Control Interval 
DCN Data Center Network 
DDPG Deep Deterministic Policy Gradient 
DL Deep Learning 
DPPE Data Center Performance Per Energy 
DPSO Discrete Particle Swarm Optimization 
DQN Deep Q-Network 
DRL Deep Reinforcement Learning 
DSTS Dynamic Stochastic Task Scheduling 
DTA DRL-based Task Migration 
DTH-MF Dynamic Threshold Maximum Fit 
DTM Dynamic Thermal Management 
DUE De-underestimation Validation Mechanism 
DX Direct Expansion 
ECA Enclosed Cold Aisle 
EDF Earliest Deadline First 
EMVO Enhanced Multi-Verse Optimizer 
EOM Energy Optimization Module 
EQBFD Energy-efficient and QoS-aware BFD 
ERE Energy Reuse Effectiveness 
ERLFC Eco-friendly RL in Federated Cloud 
ETAS Energy and Thermal-Aware Scheduling 
ETF Earliest Time First 
ETHC Elastic Task Handler over hybrid Cloud 
EVCT Energy-efficient VM minimum Cut Theory 
EVMM Energy-aware VM Migration 
FCT Flow Completion Time 
FERPTS Fast and Energy-aware Resource Provisioning and Task 
Scheduling 
FF First Fit 
FFD First Fit Decreasing 
FFO FireFly Optimization 
FIFO First-In-First-Out 
GA Genetic Algorithm 
GCD Google Cluster Dataset 
GEC Green Energy Coefficient 
GJO Golden Jackal Optimization 
GMPR Greedy Minimizing Power consumption and Resource 
wastage 
GRF Generalized Resource-Fair 
GRR Generalized Round Robin 
GRVMP Greedy Randomized VM Placement 
HDDL Heterogeneous Distributed Deep Learning 
HDRL Hierarchical DRL 
HEFT Heterogeneous Earliest Time First 
HGP Heteroscedastic Gaussian Processes 
HM Host Machine 
HVAC Heating, Ventilation, and Air Conditioning 
ICA Imperialist Competitive Algorithm 
ICO IT Control Optimization 
ICT Information and communication Technology 
IGGA Improved Grouping Genetic Algorithm 
IQR Inter-Quartile Range 
ITEE IT Equipment Energy 
ITEU IT Equipment Utilization 
JCO Joint IT and Cooling Control Optimization Algorithm 
KMI-MRCU K-Means clustering algorithm-Midrange-Interquartile 
range 
LECC Location, Energy, Carbon and Cost-aware vm placement 
LR Logistic Regression 
LRR Local regression robust 
LSTM Long Short-Term Memory 
MAD Median Absolute Deviation 
MAGNETIC Multi-AGent machine learNing-based approach for 
Energy efficienT dynamIc Consolidation 
MBAC Model-Based Actor-Critic 
MBHC MBRL-based HVAC control 
MBRL Model-Based RL 
MCP Modified Critical Path 
MCTS Monte Carlo Tree Search 
MDP Markov Decision Process 
MFFD Modified First Fit Decreasing 
MGGA Multi-objective Genetic Algorithm 
MILP Mixed Integer linear programming 
MIMT Minimization of Migration based on Tesa 
MLF Minimum Load First 
MMT Minimum Migration Time 
MOACO Multi-Objective Ant Colony Optimization 
MOPSO Multi-Objective Particle Swarm Optimization 
MPC Model Predictive Control 
MSP Multi-Set Point 
MVO Multi-Verse Optimizer 
NFV Network Function Virtualization 
NPA Non-Power-Aware 
NSGA-II Non-dominated Sorting Genetic Algorithm II 
OCA Open Cold Aisle 
OEMACS Order Exchange and Migration Ant Colony System 
PABFD Power-aware Best Fit Decreasing 
PADQN PArametrized Deep Q-Network 
PETS Probabilistic Ensembles with Trajectory Sampling 
PID Proportional-Integral-Derivative 
PM Physical Machine 
PPO Proximal Policy Optimization 
PRISMA Preferred Reporting Items for Systematic review and Meta-
analysis 
PSO Particle Swarm Optimization 
PUE Power Usage Effectiveness 
QEEC Q-learning Energy-Efficient Cloud computing 
QL Q-learning 
RAC Resource Allocation in container-based Clouds
Applied Energy 389 (2025) 125734 
2 
H. Kahil, S. Sharma, P. Välisuo et al.
RDHX Rear Door Heat Exchangers 
RES Renewable Energy Systems 
RH Relative Humidity 
RLR Robust Logistic Regression 
RP Residual Physics 
RR Round Robin 
RTP Real-Time Pricing 
SAC Soft Actor Critic 
SARSA State-Action-Reward-State-Action 
SDAEM Stacked De-noising Auto-encoders with Multilayer 
Perception 
SDN Software-Defined Networking 
SFC Service Function Chaining 
SLA Service Level Agreement 
SO Snake optimizer 
SSP Single-Set Point 
TDBS Task Duplication-Based Scheduling 
TPM Traffic Prediction Module 
TRPO Trust Region Policy Optimization 
UP Utilization Prediction-aware 
UPS Uninterruptible Power Supply 
VDN Value Decomposition Network 
VDT-UMC VM-based Dynamic Threshold and Minimum Correlation 
of Host Utilization 
VM Virtual Machine 
VMC VM Consolidation 
VMP VM placement 
VMPMBBO Multi-objective Biogeography-Based Optimization 
VMTA VM Traffic burst 
VPBAR VM scheduling Based on Poisson Arrival Rate 
VPME VM Placement with Maximizing Energy efficiency 
WUE Water Usage Effectiveness 
1. Introduction 
The digitalization of society and the emergence of new AI 
technologies have increased the overall demand for computing power. 
This growth has made data centers a critical infrastructure that supports 
our modern digital ecosystems. The rise in the use of technologies such 
as the Internet of Things (IoT), cloud computing, big data, and artifi -
cial intelligence (AI) has increased the workload of data centers, which 
now require even more computing resources to meet demand. Data cen -
ters form the backbone of modern digital infrastructure, and their high 
energy consumption has substantial financial and environmental impli -
cations. According to International Energy Agency [1], an estimated 460 
terawatt hours (TWh) of electricity, with projections indicating that this 
could exceed 1000 TWh by 2026. In the European Union (EU), data 
centers consumed approximately 45–65 TWh of electricity in 2022, rep -
resenting 1.8 % to 2.6 % of the total electricity consumption of the EU 
for that year [2]. 
This substantial energy consumption contributes to increased opera -
tional costs and has significant environmental consequences, including 
large amounts of greenhouse gas emissions [3], and increased strain on 
power grids [4]. Therefore, improving energy efficiency in data centers 
has become a critical issue, requiring intelligent and automated solutions 
capable of dynamically adapting to real-time demands. 
Among the many emerging technologies, Reinforcement Learning 
(RL) and its subset, Deep Reinforcement Learning (DRL), have gained 
attention as promising techniques for optimizing energy efficiency 
within complex environments like data centers. These algorithms enable 
systems to learn optimal policies by interacting with dynamic environ -
ments, making them suitable for resource allocation, task scheduling, 
and heating and cooling management. A study conducted by Jayanetti 
et al. [5] demonstrates the significant potential of RL/DRL for minimiz -
ing energy consumption and reducing operational costs. 
The data center architecture comprises three main systems: informa -
tion and communication technology (ICT), cooling, and power supply 
systems. Today’s data centers are vast, complex, and highly sophisti -
cated, powered by a diverse ecosystem of ICT devices. These range 
from high-performance servers equipped with heterogeneous computing 
processors, such as CPUs, GPUs, and specialized accelerators, to arrays 
of memory units and storage solutions. In addition to computational 
infrastructure, the cooling system is critical in sustaining data center 
functionality. Its complexity arises from integrating multiple subsystems 
designed to regulate thermal conditions and protect highly sensitive ICT 
equipment from overheating. Efficient cooling is a fundamental aspect 
of data center operations, directly impacting energy consumption, op -
erational costs, and system reliability. Due to the high heat dissipation 
of modern ICT equipment, data center cooling systems are designed to 
maintain optimal temperatures, prevent hardware failures, and enhance 
overall performance. 
A typical data center cooling system consists of multiple compo -
nents, including chillers, pumps, fans, heat exchangers, and cooling 
towers, which work together to regulate temperature and ensure ef -
ficient heat dissipation. These systems can generally be classified into 
air-based and liquid-based cooling solutions. Air-based cooling relies on 
Computer Room Air Conditioning (CRAC) units [6] and Computer Room 
Air Handlers (CRAH) [7]. Liquid-based cooling, in contrast to air-based 
methods, incorporates technologies such as direct-to-chip cooling [8], 
spray/immersion cooling [9]. These approaches significantly enhance 
thermal management by efficiently dissipating heat and directly cooling 
critical components. Recently, localized heat exchanger solutions, such 
as in-row, rear-door cooling and in-rack, have gained popularity due 
to their efficiency in high-density environments. In-row cooling places 
cooling units between server racks, reducing airflow distance and im -
proving cooling efficiency [10]. Rear Door Heat Exchangers (RDHX), on 
the other hand, attach cooling units directly to the back of racks, cap -
turing and dissipating heat immediately as it exits the servers. These 
strategies enhance cooling performance while minimizing energy waste 
by targeting heat removal close to the source [11]. 
Free cooling is an energy-efficient heat rejection method that uses 
low ambient air or water temperature with a dry cooler or heat ex -
changer. Depending on the ambient media, the free cooling is also 
known as water-side or air-side economizer [12]. Heat pumps [13] and 
thermal energy storage [14] are increasingly being adopted to enhance 
energy efficiency and overall performance of heat reuse. Fig. 1 provides 
a schematic diagram of the data center cooling and heat rejection and 
reuse systems. 
Solutions based on RL/DRL techniques enable adaptive, real-time 
decision making which has significant potential for enhancing energy ef -
ficiency through optimization in the complex data center environments. 
Despite all these promising developments, the adoption of RL/DRL for 
minimizing energy consumption in data centers faces various challenges, 
including the complexity of modeling data center environments, man -
aging computational costs, and ensuring scalability [15]. To address 
these challenges, innovative and intelligent solutions are required that 
can adapt to complex and dynamic environments in real-time. Several 
previous reviews on the use of RL/DRL have been conducted for gen -
eral applications [16] rather than analyzing a holistically integrated 
RL/DRL framework with a specific system, which this paper aims to ex -
amine. Additionally, few studies have provided systematic evaluations 
of RL/DRL across data center functions, leaving a gap in understanding
Applied Energy 389 (2025) 125734 
3 
H. Kahil, S. Sharma, P. Välisuo et al.
Cooling Heat exchange
Air to liquid
Rejection
Re-use
Air
cooling
Liquid
cooling
CRAH
CRAC
In ROW
RDHX
In RackCold plate
Immersion
Spray
Dry cooler
Chiller
Tower
Heat pump
Direct use
M
o
re
lo
calized
Fig. 1. Schematic diagram of data center cooling, heat rejection, and heat reuse 
system options. Black and blue arrows show heat flows in air, and liquid re -
spectively. The grey arrow shows that the heat exchangers at the bottom of the 
middle box are localized closer to the heat source, whereas those at the top are 
far from it. (For interpretation of the references to colour in this figure legend, 
the reader is referred to the web version of this article.) 
these algorithms’ capabilities in real-world environments. In this system -
atic literature review, we aim to investigate the recent advancements 
and applications of RL/DRL for enhancing energy efficiency in data cen -
ters by analyzing the literature using the PRISMA framework. The main 
objective of this work is to explore and evaluate the diverse potential 
applications of RL/DRL as a tool for optimizing energy efficiency in 
data centers while also synthesizing and consolidating existing research 
knowledge on their implementation in such facilities. Furthermore, this 
study aims to achieve the following specific objectives: 
• Investigate and assess key applications of RL/DRL in data cen -
ters: This review aims to provide a comprehensive analysis of how 
RL/DRL algorithms have been applied to solve various energy ef -
ficiency challenges in data centers. To achieve this, we categorize 
RL/DRL applications by data center subsystems, giving readers 
insights into their roles and effectiveness. 
• Evaluate and summarize each identified study in terms of algorithm 
type, the specific research problem addressed, primary objectives, 
and energy-efficiency outcomes, along with the benchmarks em -
ployed for performance evaluation, enabling a deeper understanding 
of the current state of research. 
• Summarize details about the execution aspects of the identified stud -
ies: the implementation environment, dataset source, dataset type, 
and the platforms or frameworks utilized, offering insights into the 
practical considerations and resources required for implementing 
future studies. 
• Utilize the identified joint optimization studies to present compre -
hensive guidelines for formulating the Markov Decision Process 
(MDP) elements, providing readers with a clear overview and foun -
dational knowledge to construct such frameworks in future research. 
• Identify technical and practical challenges in the current research di -
rection. By investigating the essential issues related to RL/DRL usage 
in data centers, we aim to provide an in-depth view of the barri -
ers that limit the broader use of these techniques in the data center 
industry. 
• Highlight other objectives integrated with the energy efficiency 
problem in the identified studies, to address multi-objective opti -
mization, thereby comprehensively ensuring sustainable and cost -
effective operations in modern data centers. 
• Explore research gaps, open issues, and future directions to propose a 
strategic roadmap for advancing the practical deployment of RL/DRL 
techniques in optimizing data center energy efficiency. 
Through the above-mentioned objectives, this review aims to 
contribute a structured synthesis of RL/DRL applications for data center 
energy efficiency, identify persistent challenges, and chart a course for 
future research to address existing limitations and enhance the practical 
utility of RL/DRL techniques in data centers. 
The remainder of this paper is organized as follows. Section 2 com 
pares previous related reviews and this study. 
-
Section 3 provides a 
comprehensive background on RL/DRL algorithms. Section 4 outlines 
the research methodology. Section 5 explores the relevant literature 
in detail. Section 6 offers an overview of additional objectives com -
bined with energy efficiency. Section 7 discusses the identified research 
gaps, open challenges, and suggests future directions. Finally, Section 8 
concludes this review. 
2. Related reviews 
Several existing reviews focus on the energy efficiency of the data 
center cooling system as a key objective. Chang et al. [17] explore the 
cooling system optimization strategies in data centers by utilizing bib -
liometric methods. Their review examines the utilization of RL as a 
cooling control strategy for energy efficiency applications. Additionally, 
Shaqour et al. [18] investigate the literature on using DRL algorithms 
for HVAC energy management in data centers, which are considered as 
a subgroup of smart buildings. 
In contrast, other reviews target the energy efficiency of ICT systems 
in data centers. Gari et al. [19] evaluate the effectiveness of RL algo -
rithms for data center scaling and scheduling purposes in the literature, 
while partially addressing energy consumption as an optimization objec -
tive. Magotra et al. [20] provide a comprehensive overview of using VM 
consolidation to enhance the data center energy efficiency. This review 
surveys the research problem based on architecture and VM consolida -
tion steps. Zhou et al. [21] present DRL-based approaches for resource 
scheduling in the cloud, highlighting their advantages, challenges, and 
future directions. Recently, Hou et al. [22] provided a specialized re -
view on leveraging DRL algorithms for energy-efficient task scheduling 
in cloud computing. This study conducts an in-depth investigation of the 
Markov Decision Process (MDP) model components. Singh et al. [23] 
summarize previous empirical studies on multiple objectives in ICT sys -
tems, such as task scheduling and VM consolidation, to enhance energy 
efficiency while maintaining system performance. 
Furthermore, other reviews combine cooling and ICT systems as 
the core topic of their review. Lin et al. [24] explore previous efforts 
to achieve green-aware data centers from five different perspectives: 
workload management, virtual resource management, energy manage -
ment, thermal management, and waste heat recovery. Long et al. [25] 
outline performance evaluation metrics for data center energy effi -
ciency through ICT systems and infrastructure, including cooling and 
power supply systems. Conversely, Zhang et al. [26] address the joint 
optimization of cooling and ICT systems to achieve effective data cen -
ter management under a set of evaluation metrics, including thermal 
conditions, energy consumption, and response delay. 
Although these reviews address energy efficiency objectives in data 
centers based on RL/DRL algorithms from different perspectives, there 
remains a gap in the existing literature due to the absence of a systematic 
overview of RL/DRL applications for improving the energy efficiency 
of data center systems. Additionally, there appears to be a lack of 
research addressing joint optimization using RL/DRL for energy effi -
ciency objectives. Moreover, previous reviews do not sufficiently discuss 
experimental setups, including data sources and types used, and the im -
plementation platforms. Our research introduces a systematic literature 
review that examines the use of RL/DRL for energy efficiency objec -
tives across the main data center systems: cooling and ICT systems. We 
aim to explore recent advancements in this field to gain deeper insights, 
identify research gaps, and suggest future directions. Table 1 summa 
rizes and compares related reviews and our work, emphasizing how our 
study differs from previous research.
-
Applied Energy 389 (2025) 125734 
4 
H. Kahil, S. Sharma, P. Välisuo et al.
Table 1 
Related reviews on DC energy efficiency, and comparison with our review. 
Reference 
General focus System specific Review outcomes 
Data 
center 
Energy 
efficiency 
RL/DRL 
approaches 
Cooling 
system 
ICT 
system 
Joint 
optimization 
Energy 
reporting 
Algorithm 
comparisons 
Benchmark 
comparisons 
Experimental 
setup 
[17] ● ● ● ● × × ● ◑ × × 
[18] ● ● ● ● × × ● ◑ × × 
[19] ● ◑ ● × ◑ × ◑ ● × × 
[20] ● ● ◑ × ◑ × ◑ ◑ ◑ ● 
[21] ● ◑ ● × ◑ × ◑ ● ● ● 
[22] ● ● ● × ◑ × ● ● ● ● 
[23] ● ◑ × × ◑ × ◑ × × × 
[24] ● ◑ ◑ ◑ ◑ × ◑ ● × × 
[25] ● ● × ◑ ◑ × ◑ ● × × 
[26] ● ◑ ◑ × × ● ● ● × ◑ 
Current 
review 
● ● ● ● ● ● ● ● ● ● 
● – Topic addressed in detail/self-contained, ◑ – Topic partially addressed (i.e., not self contained, requires additional readings for deep understanding), × – Topic 
not addressed. 
3. Overview of RL/DRL algorithms 
Reinforcement learning (RL) stands out as a machine learning tech -
nique developed by the computational intelligence community. It is 
inspired by natural learning mechanisms, in which organisms adjust 
their future behavior based on feedback from interactions with the 
environment. Fundamentally, RL is a closed-loop approach aimed at 
maximizing the cumulative reward, allowing the decision-maker or 
agent to learn and adapt over time. However, the actions taken by the 
learning agent influence its future inputs. The RL algorithm establishes 
an interactive relationship with the dynamic environment, allowing the 
agent to perform actions, observe the states of the environment, and 
receive feedback in the form of rewards and punishments. In most prac -
tical cases, the agent’s actions may not only influence the immediate 
reward but also shape the ultimate reward. In this closed-loop learn -
ing approach, the absence of explicit instructions for taking actions 
and the uncertainty of future consequences are the key features of RL. 
These characteristics position RL algorithms as an integration of adap -
tive and optimal control techniques [27,28]. Fig. 2 illustrates the general 
framework of RL algorithms. 
Let us consider a typical reinforcement learning scenario within a 
fully observable, stationary, stochastic environment, where the agent 
interacts with the environment by fully and accurately observing the 
current state. At each discrete time step, the agent selects an action based 
only on the current state to maximize the cumulative reward over time. 
The representation of this scenario is given by: 
• States (𝑆): The set of all possible states of the environment that the 
agent can observe. 
𝑆 = {𝑠 1 , 𝑠 2 ,… , 𝑠 𝑛 } (1)
Agent
Environment
Actions, A
Reward, R
States, S
Fig. 2. RL framework. 
• Actions (𝐴): The set of all available actions that the agent can take 
in a given state. 
𝐴 = {𝑎 1 , 𝑎 2 ,… , 𝑎 𝑛 } (2) 
• Transition probabilities (𝑃 ): The probability of moving to a future 
state 𝑠 ′ given the current 𝑠 state and action 𝑎, which may differ over 
time due to dynamic changes. 
′𝑃 ( ′ 𝑡 𝑠 ∣ 𝑠, 𝑎) = P (𝑆 𝑡+1 =   𝑠 ∣ 𝑆 𝑡 = 𝑠, 𝐴 𝑡 = 𝑎) (3) 
• Reward function (𝑅): The immediate reward that the agent receives 
when taking action 𝑎 in the state 𝑠 at time 𝑡, which may differ over 
time due to dynamic changes. 
𝑅 d𝑡(𝑠, 𝑎) = E(rewar    ∣ 𝑆 𝑡 = 𝑠, 𝐴 𝑡 = 𝑎) (4) 
• Policy function (𝜋): This function determines the agent’s future be -
havior by defining the probability of taking action 𝑎 in the state 𝑠 at 
time 𝑡, which may differ over time due to dynamic changes. 
𝜋 𝑡(𝑠, 𝑎) = P (reward ∣ 𝑆𝑡 = 𝑠, 𝐴 𝑡 = 𝑎) (5) 
• Discount factor (𝛾): It determines the weight of future rewards 
compared to immediate rewards. 
0 ≤ 𝛾 ≤ 1 (6) 
where the value of the discount factor is close to 0, it makes the RL 
agent focus on immediate reward, while a value close to 1 makes the 
RL agent focus on the future reward. 
• Objective (cumulative reward): This is the ultimate goal of the 
RL agent to identify the trajectories that can maximize the expected 
discounted reward: 
∑𝑛 
2𝐺 𝑡 = 𝑅 𝑡+1 + 𝛾 𝑅 𝑡+2 + 𝛾   𝑅   𝑡+3 + ⋯ = 𝛾 𝑘 𝑅 (7)𝑡+𝑘+1 
𝑘=0 
The tuple {𝑆, 𝐴, 𝑃 , 𝑅, 𝛾} formulates the Markov decision processes 
(MDP) representation for the proposed stationary stochastic environ -
ment. In the MDP framework, at each time step 𝑡, the agent interacts 
with the environment by observing the current state 𝑠 𝑡 ∈ 𝑆 , choosing 
the action 𝑎 𝑡 ∈ 𝐴 according to the policy function 𝜋 𝑡(𝑠 𝑡 , 𝑎 𝑡 ), while esti -
mating the probability of transitioning to a specific next state or taking 
a specific action using the transition probability model 𝑃𝑡  (𝑠 ′ ∣ 𝑠, 𝑎). After 
taking the action, the agent obtains a reward 𝑟 𝑡 ∈ 𝑅 and transitions to 
the next state. The aim of reinforcement learning is to design the agent’s
Applied Energy 389 (2025) 125734 
5 
H. Kahil, S. Sharma, P. Välisuo et al.
learning process to find the optimal policy that maximizes the expected 
cumulative reward over time 𝐺 𝑡 , considering the environment dynamics 
defined by the MDP [29–31]. 
However, the aforementioned process is not trivial. This challenge 
can be addressed recursively by introducing the state value function (V -
function): [ ]∑𝑛 
𝑉 𝜋 (𝑠) = 𝛾   E 𝑠 𝑡 ,𝑎 𝑡∼𝜏  
𝑘𝑅𝑡 +𝑘+1 
𝑘=0∑ ∑∞ (8) 
= 𝜋(𝑎 𝑡 |𝑠𝑡  )𝑃 (𝑠 , 𝛾 𝑘 𝑡+1   |𝑠𝑡 𝑎 𝑡 ) 𝑅𝑡+𝑘+1 
(𝑠 𝑡 ,𝑎 ,…)∼𝜏 𝑘=0[𝑡  
= E
 ]
 𝜋 𝑅 𝑡+1 + 𝛾𝑉 (𝑆 𝑡+1) ∣   𝑆 = 𝑠 𝑡 
where 𝜏: (𝑠 0 , 𝑎 0 , 𝑠 1 , 𝑎 1 ,… , 𝑎 𝑡−1 , 𝑠 𝑡 ) represents the interaction trajectory of 
the RL agent. 
Similarly, the expected return of taking a specific action 𝑎 in a given 
state 𝑠 while following the policy 𝜋 can be given by the state-action value 
function (Q-function):[
𝑄 𝜋 (𝑠, 𝑎) = E
 ]
 𝜋 𝑅𝑡 +1 + 𝛾𝑄(𝑆
 
 (9)𝑡+1, 𝐴 𝑡+1 ) ∣ 𝑆 𝑡 = 𝑠, 𝐴 𝑡 = 𝑎 
Eqs. (8) and (9) are referred to as the Bellman equations [32], which 
are considered the fundamental formulas for tackling the decision -
making process of an RL agent. The optimal V-function and Q-function 
are indicated by the maximum value across all states: 𝑉 ∗ = max 𝑣 𝜋 (𝑠) , or 
in all state-actions: 𝑄 ∗ (𝑠, 𝑎) = max 𝑄(𝑠, 𝑎). In all MDP cases, at least one 
optimal policy always exists, and the value functions 𝑉 (𝑠) and 𝑄(𝑠, 𝑎) of 
all optimal policies are the same. As a result, optimizing the Q-function 
yields the optimal policy of the MDP:{
∗ 1 𝜋
 arg  
(𝑎|𝑠) = if 𝑎 =    max 𝑎∈𝐴 𝑄 ∗(𝑠, 𝑎)  (10) 
0 otherwise 
To obtain a solution to the MDP problem using RL techniques, two 
main categories of methods are used. Model-free RL algorithms allow an 
agent to learn a policy purely from interactions with the environment, 
without explicitly constructing a model of the environment’s dynamics. 
The other category is called model-based RL algorithms and leverages a 
model of the environment, which can be given or learned. This model 
typically includes the transition probability function (3) and the re -
ward function (4), allowing the agent to plan actions before execution 
[33]. Value-based algorithms are among the most popular model-free RL 
methods, where the agent estimates state-action values and represents 
them as a table (referred to as a Q-table or policy table), to optimize its 
decision-making. The most well-known value-based algorithms used for 
smaller MDP problems are tabular methods: Q-learning, in which the 
agent updates the table based on the maximum possible future reward 
(off-policy learning), making it more exploratory [34], and state-action -
reward-state-action (SARSA) [35], where the agent updates the Q-table 
according to the actual action taken (on-policy learning), leading to 
more conservative behavior. 
On the other hand, model-based RL leverages a model of the en -
vironment to update the Q-table of state-action pairs. This approach 
can be classified into two main categories based on how the environ -
ment model is acquired. In the first category, the agent learns the model 
through its interactions with the environment, as in the Dynamic Q -
learning (Dyna-Q) algorithm [36]. In the second category, the model is 
provided to the agent, as seen in Monte Carlo Tree Search (MCTS) [37]. 
However, RL algorithms face scalability limitations when applied in 
large-scale learning environments. They often struggle with an extensive 
state space and continuous action space, leading to inefficiencies in the 
exploration–exploitation trade-off, slow convergence, and difficulties in 
learning optimal policies. 
To address the limitations of traditional Reinforcement Learning 
(RL) methods, the computational intelligence community has developed 
Deep Reinforcement Learning (DRL), which integrates advancements 
in deep neural networks. In DRL algorithms, deep learning techniques 
are employed to construct at least one of the following agent compo -
nents: value functions (8), (9), policy function (5), transition model(3), 
and the reward function (4). Such representations are essential when 
the RL agent interacts with environments characterized by a high -
dimensional state space and a continuous action space. DRL is a powerful 
tool for achieving an end-to-end goal-directed learning process [38,39]. 
Figs. 3 and 4 present a comprehensive classification of the most popular 
RL/DRL algorithms based on their respective model types. 
Another crucial aspect of RL/DRL algorithms is the type of policy 
used during the training process. The focus here is to determine whether
RL
Model-free
Algorithms
Value-based
algorithms
Tabular
methods Q-learning
SARSA
Deep learning
methods
DQN
Temporal
difference
method
TD0
Actor-critic
algorithms
Deep soft
actor critic
DSAC
Deterministic
policy gradient
DDPG
Soft actor critic SAC
Twin delayed
DDPG
TD3
Advanced
Actor-critic A2C
A3C
Policy-based
algorithms
Basic policy
gradient REINFORCE
VPG
Advanced policy
gradient PPO
TRPO
Fig. 3. RL/DRL model-free algorithms. 
RL
Model-
based
Algorithms
Learn
the model
Model
Ensemble
GPS
VAML
PAML
Planning
Oriented
PETS
Dyna-Q
Deep Dyna-Q
Policy
Optimization
MBPO 
MBVE
MBAC
Given
the model
Residual
Augmentation
RCE
EBD
Residual-Q
Policy Learning
with Rollouts
Dyna-DDPG
SAC-MBR
ME-TRPO
Planning
Oriented MuZero
AlphaZero
Fig. 4. RL/DRL model-based algorithms.
Applied Energy 389 (2025) 125734 
6 
H. Kahil, S. Sharma, P. Välisuo et al.
the behavior policy – defined as the policy interacting with the environ -
ment to collect training data – and the target policy – which represents 
the final policy that the agent is aiming to learn – are identical. On -
policy methods utilize the collected data directly for the next round 
of policy optimization, meaning that the behavior and target policies 
are the same. However, in off-policy methods, the generated training 
data is stored in a buffer during the interaction with the environment. 
Then, during training, this stored data – which may be gathered from 
previous policies – is used for the target policy. In this case, the be -
havior policy is not the same as the target policy. The advantages of 
on-policy methods include greater stability and faster convergence, bal -
anced exploration–exploitation rates, and ease of implementation, while 
off-policy methods offer better performance in complex environments 
and greater adaptability to changing policies.
Finally, RL/DRL are used to solve a wide range of optimization prob -
lems, from playing simple computer games to controlling highly complex 
large-scale configurations such as transportation networks and energy 
systems [40–42] . Both RL and DRL offer the advantage of boasting real -
time adaptability and dynamic responsiveness compared to traditional 
control methods. However, without prior knowledge about the studied 
environment, they may encounter slow convergence and failures during 
the initial phases of operation [43,44]. 
4. Materials and methods 
The methodology of this review was structured following the 
PRISMA (Preferred Reporting Items for Systematic Reviews and Meta -
Analyses) framework to ensure transparency, rigor, and reproducibility 
[45]. 
4.1. Research questions 
The main aim of this review is to synthesize recent advancements in 
RL/DRL techniques for improving energy efficiency in data centers. To 
provide a comprehensive understanding of this topic, this study focuses 
on answering the following research questions based on the identified 
papers. 
• RQ1: What data center subsystems (e.g., cooling, ICT equipment, 
power supply) are targeted by the RL/DRL algorithms? 
• RQ2: Which RL/DRL algorithms are utilized for energy optimization 
in data centers? 
• RQ3: What experimental setups and dataset sources (e.g., real-world 
deployments or simulations) are commonly used? 
• RQ4: What specific research problems are addressed using RL/DRL 
algorithms? 
• RQ5: What are the primary objectives addressed in the identified 
studies? 
• RQ6: What benchmarks are used to evaluate the achieved results in 
terms of energy efficiency? 
• RQ7: What tools, frameworks, or platforms are employed to imple -
ment RL/DRL algorithms in this context? 
• RQ8: What metrics are used to measure and report the effectiveness 
of RL/DRL algorithms in improving energy efficiency? 
4.2. Search strategy 
4.2.1. Literature resources 
To ensure that all recent and relevant studies are covered in the lit -
erature, the search was carried out in five major and well-established 
academic databases, known for their extensive repositories of peer -
reviewed studies in computer science, engineering, and energy systems. 
Given that the scope of this review is relatively new, the covered time 
frame is limited to publications from 2019 to August 2024. To maintain 
high quality and credibility, only peer-reviewed journal articles from the 
databases mentioned below were selected. 
• IEEE Xplore 
• Scopus 
• ScienceDirect 
• Web of Science 
• ACM Digital Library 
4.2.2. Search terms (key words) 
To ensure the high quality of this study, search queries were sys -
tematically designed using Boolean operators and keywords relevant to 
RL/DRL and energy efficiency in data centers. A representative search 
string was: (“data center” OR “data centers”) AND (“energy-aware” OR 
“energy utilization” OR “energy saving” OR “energy efficiency”) AND 
(“reinforcement learning” OR “RL”). Fig. 5 shows the search strategy 
used in this study. 
4.3. Search process and selection criteria 
To ensure the relevance and quality of the included studies, the 
PRISMA framework guided the article identification process, which 
involved four distinct stages: 
1. Identification: Studies were retrieved using search queries across 
the selected databases. 
2. Screening: The titles and abstracts were screened to eliminate 
irrelevant studies and duplicates. 
3. Eligibility: Full-text articles were reviewed against the inclusion 
and exclusion criteria. 
4. Inclusion: The final set of studies that met all quality assessment 
criteria was selected for detailed analysis. 
A PRISMA flow diagram (Fig. 6) illustrates the selection process, 
documenting the number of studies identified, screened, excluded, and 
included. 
Fig. 5. Search strategy to get relevant papers.
Applied Energy 389 (2025) 125734 
7 
H. Kahil, S. Sharma, P. Välisuo et al.
Search phase Selection phase Second search
and selection
IEEE
Xplore
Scopus
Science
Direct
Web of
Science
ACM Digital
library
25
100
14
20
5 Remove
duplicatesTotal=164
Abstract &
keywords
Inclusion
criteria
Exc
53
Inc 111
Exc
40
Inc 71
Exc 12
Additional
relevant from
references
Inc 59
new 18
Quality
check
Finally
selected
65
Sum 77
Exc
12
Fig. 6. Systematic literature review process stages: Removals of duplicates, removal based on abstract and keywords, removal based on inclusion and exclusion 
criterias, adding new articles which were found from the references, and finally removing those which did not match the quality criteria. 
4.3.1. Inclusion and exclusion criteria 
• Inclusion criteria: To ensure the inclusion of high-quality and 
relevant studies, the following criteria were applied: 
– Only peer-reviewed journal articles published between 2019 and 
August 2024. 
– Studies explicitly applying RL/DRL algorithms for energy effi -
ciency in data center environment. 
– Studies presenting measurable outcomes, such as increased energy 
savings or improved Power Usage Effectiveness (PUE). 
– Studies focusing on specific or joint subsystems (e.g., cooling 
systems, ICT equipment, and/or power supply). 
– Only the most recent version of a study was included when 
duplicate publications were identified. 
• Exclusion criteria: To facilitate the filtering of irrelevant studies, 
the following criteria were used: 
– Non-peer-reviewed studies, including conference papers, review 
articles, and opinion pieces. 
– Studies not addressing RL/DRL-based methods for energy opti -
mization in data center environment. 
– Studies lacking empirical evidence or quantitative metrics. 
– Studies without full-text availability, making it impossible to 
assess the study’s relevance and quality. 
– Studies focused on very small-scale experimental setups, as they
lack applicability to real-world data center environments. 
4.3.2. Quality assessment criteria and rating system 
To ensure the final selection of identified articles are robust and 
reliable, a rigorous and systematic quality assessment process was 
implemented, based on the clearly defined criteria listed below: 
• Clear and comprehensive documentation of the RL/DRL methods 
utilized, ensuring transparency in their implementation. 
• Explicit definition and justification of the targeted subsystem’s rele -
vance within the study. 
• Logical coherence in identifying the research problem and aligning 
it with the stated objectives. 
• Methodological rigor in the design of experimental setups, including 
appropriate baseline comparisons and validation techniques. 
• Implementation of well-defined metrics to assess energy efficiency, 
such as increased energy savings or improvements in Power Usage 
Effectiveness (PUE). 
• Thorough comparative analysis of RL/DRL techniques against al -
ternative benchmark methods to highlight their effectiveness and 
advantages. 
Only studies that achieved a perfect score of 6 out of 6 on these 
criteria were included in the final synthesis. 
4.4. Data extraction and synthesis 
A comprehensive data extraction and synthesis template was com -
pleted for each identified study to ensure that all selected studies 
addressed the review’s research questions. The extracted data were or -
ganized into a synthesis card and stored in an Excel file for further use 
throughout the systematic review stages. Table 2 summarizes the data 
extraction and synthesis card used to gather the necessary information 
from the identified studies. 
To present the findings of this review, visual representations, such as 
pie charts, bar charts, and Venn diagrams, were created. Additionally, 
tables were utilized to systematically summarize and provide a detailed 
analysis of each identified study. This systematic approach provides 
a clear and structured framework for synthesizing and interpreting 
the collected data, while also highlighting research gaps, addressing 
challenges, and identifying future directions [46]. 
4.5. Threats to validity 
The following threats to validity were acknowledged: 
1. Publication bias: The focus on peer-reviewed journals may ex -
clude innovative but unpublished studies. 
2. Database coverage: Relevant articles from less-accessible 
databases or gray literature might have been missed. 
3. Variability in reporting: Differences in methodologies and re -
porting standards across studies could limit comparability.
Applied Energy 389 (2025) 125734 
8 
H. Kahil, S. Sharma, P. Välisuo et al.
Table 2 
Data extraction template. 
Category 
Unique Identifier (ID) 
Study Title 
Authors Names 
Publication Venue 
Publication Year 
DC Subsystem Applications (RQ1) 
RL/DRL Algorithm Type (RQ2) 
Experimental Setup (RQ3) 
Research Problems (RQ4) 
Main Objectives (RQ5) 
Benchmark Algorithms (RQ6) 
Platforms and Frameworks (RQ7) 
Energy Efficiency Outcomes (RQ8) 
MDP Elements in Joint Optimization Studies 
Abstract 
Keywords 
Other Performance Metrics 
To mitigate these threats, standardized inclusion criteria were applied, 
and article selection and data extraction were independently verified by 
multiple reviewers. 
5. Results and discussions 
In this section, we discuss and present the findings of this review. 
First, we summarize the fundamental details of each identified study, 
including the study title, authors names, publication venue, and publi -
cation year. These details facilitated the systematic organization of this 
review, with each study assigned a unique identifier (ID) for easy refer -
ence during the data analysis and extraction process. Next, we provide a 
comprehensive analysis, highlighting key perspectives such as the stud -
ied subsystems, the RL/DRL algorithms applied, and the types of models 
utilized, offering valuable insights into the state-of-the-art. Then, we 
conduct a deeper synthesis, classifying the studies based on the sub -
systems they targeted. This categorization helped obtain quantitative 
and qualitative data to address the research questions for each subsys -
tem. We focus our discussion on more detailed and specific information 
regarding the research problems, study objectives, and experimental 
setup, benchmark comparisons, the platforms used, and energy-related 
outcomes. Finally, we summarize the construction of Markov Decision 
Process (MDP) elements in joint optimization studies. Additionally, we 
reference related works to further support and contextualize the purpose 
and findings of this review. 
5.1. Overview of the final identified studies 
In this review, we identify 65 journal articles that apply RL/DRL 
algorithms to improve the energy efficiency of at least one major data 
center system. The publication venues and years of these articles are 
summarized in Table 3. Given that the research topic of this review is 
relatively new, all selected studies were published between 2020 and 
2024, as shown in Fig. 7. 
Taking a broader look at the selected studies reveals that over 
60 % focus entirely on the ICT system, exploring opportunities to en -
hance energy efficiency by leveraging RL/DRL algorithms from various 
perspectives. In contrast, approximately 21 % of the papers focus exclu -
sively on the data center cooling system. Furthermore, the remaining 
studies examine combinations of multiple data center systems. Fig. 8 
provides a detailed overview of the specific systems addressed in each 
selected paper. 
In the following paragraphs, we will explore the RL/DRL algorithms 
used in the selected studies of this review. 
For the cooling system: Since the cooling system of data centers is 
characterized by a high-dimensional state space and a continuous ac -
tion space MDP, all selected studies employed DRL methods, primarily 
focusing on model-free algorithms, including: 
• Soft Actor-Critic Algorithm (SAC) [66,87] 
• Deep Deterministic Policy Gradient (DDPG) [58,70] 
• Twin Delayed DDPG (TD3) [93] 
• Proximal Policy Optimization (PPO) [60] 
• Trust Region Policy Optimization (TRPO) [93] 
• Deep Q-Network (DQN) [69,75,101,104] 
However, two studies used model-based algorithms: Model-Based 
Actor-Critic (MBAC) [49] to propose a safe cooling mode adhering to 
strict thermal constraints, and Probabilistic Ensembles with Trajectory 
Sampling (PETS) [102], in which the study makes a comparison be -
tween four different algorithms: two model-free off-policy algorithms: A 
DQN variant called Branching Dueling Q-Network (BDQ) and SAC, one 
model-free on-policy algorithm (PPO), and one model-based algorithm 
(PETS). 
For ICT system: Due to the discrete nature of certain ICT processes, 
such as task scheduling and resource allocation, the Q-learning algo -
rithm has been employed in multiple studies to handle the ICT MDP 
environment [62,79,105]. This approach allows Q-values to be updated 
independently from the action selection and execution, enabling the 
algorithm to capture delayed feedback more accurately. As a result, 
this method enhances the learning rate and accelerates the convergence 
process. Alternatively, DQN is commonly proposed for handling more 
complex ICT systems, as reported in [50,54,81]. However, other DRL 
algorithms are also used, such as: 
• Actor-Critic (AC) [72,107] 
• Soft Actor-Critic Algorithm (SAC) [55,65] 
• Proximal Policy Optimization (PPO) [67,103] 
• Asynchronous Actor-Critic Agents (A3C) [76] 
• Deep Deterministic Policy Gradient (DDPG) [86] 
For combining systems studies: As the complexity of the MDP prob -
lem increases when multiple systems are present, with a combination 
of discrete and continuous state spaces, along with high-dimensional 
action spaces, traditional RL approaches become less effective. In re -
sponse to these challenges, all selected studies addressing the integration 
of multiple data center systems employed DRL algorithms. Notable DRL 
algorithms used in these studies include: 
• Actor-Critic Algorithm (AC) [47] 
• Soft Actor-Critic Algorithm (SAC) [48] 
• Deep Q-Network (DQN) and its extensions [52,61,80,91,98] 
• Deep Deterministic Policy Gradient (DDPG) [91,92] 
Fig. 9 illustrates the distribution of various RL/DRL algorithms in 
the selected studies. Q-learning and DQN were the most frequently cited 
algorithms, appearing in 60 % of studies, followed by SAC (eight stud -
ies), PPO (four studies), DDPG (four studies), and AC/A3C (four studies). 
About 9 % of studies employed other algorithms. 
Table 4 categorizes the algorithms implemented in the selected 
studies based on the utilized model type. According to Figs. 3 and 4, 
nearly 98 % of the algorithms employed are model-free, divided into 
three main groups: value-based algorithms, policy-based algorithms, 
and actor-critic algorithms. Only two studies utilized model-based al -
gorithms, likely due to the complexity involved in accurately modeling 
a data center system. Some studies used more than one RL/DRL method, 
causing them to appear in multiple categories in the table. 
The following sections will provide a detailed analysis of these 
algorithms and their applications.
Applied Energy 389 (2025) 125734 
9 
H. Kahil, S. Sharma, P. Välisuo et al.
Table 3 
The selected studies. 
ID Authors Publication venue DC application (RQ1) Year 
S1 Jayanetti et al. IEEE Transactions on Parallel and Distributed Systems Integrating power supply and ICT systems 2024
S2 Biemann et al. IEEE Internet of Things Journal Integrating cooling and power supply sys-
tems 
2023 
S3 Wan et al. IEEE Transactions on Emerging Topics in Computational 
Intelligence 
Cooling system 2023 
S4 Lou et al. IEEE Transactions on Network and Service Management ICT system 2023
S5 Ran et al. IEEE Transactions on Services Computing Integrating cooling and ICT systems 2023
S6 Ran et al. IEEE Transactions on Services Computing Integrating cooling and ICT systems 2023
S7 Zeng et al. IEEE Transactions on Parallel and Distributed Systems ICT system 2022
S8 Kang et al. IEEE Transactions on Network and Service Management ICT system 2022
S9 Pham et al. IEEE Access ICT system 2021
S10 Yi et al. IEEE Transactions on Parallel and Distributed Systems ICT system
ICT system
2020
S11 Ding et al. IEEE Access 2020
S12 Li et al. IEEE Transactions on Cybernetics Cooling system 2020
S13 Cheng et al. IEEE Transactions on Computer-Aided Design of Integrated 
Circuits and Systems 
ICT system 2020 
S14 Leindals et al. Energy and AI Cooling system 2024
S15 Zhao et al. IEEE Transactions on Sustainable Computing Integrating power supply and ICT systems 2024
S16 Ghasemi et al. Cluster Computing ICT system 2024
S17 Ghasemi et al. Computing ICT system 2024
S18 Bhatt et al. International Journal of Advanced Computer Science and 
Applications 
ICT system 2024
S19 Zhang et al. IEEE Transactions on Network and Service Management ICT system 2024
S20 Guo et al. Applied Energy Cooling system 2024
S21 Yang et al.
Bouaouda et al.
Journal of Supercomputing ICT system 2024
S22 Sustainability ICT system 2024
S23 Chen et al. Measurement and Control Cooling system 2024
S24 Wang et al. ACM Transactions on Cyber-Physical Systems Cooling system 2024
S25 Aghasi et al. Computer Networks Integrating cooling and ICT systems 2023
S26 Wang et al. Journal of Cloud Computing ICT system 2023
S27 Wang et al. Computer Networks ICT system 2023
S28 Ghasemi et al. Cluster Computing ICT system 2023
S29 Huang et al. Energies Cooling system 2023
S30 Wei et al. Journal of King Saud University – Computer and Information 
Sciences 
ICT system 2023
S31 Liu et al. Applied Energy ICT system 2023
S32 Ahamed et al. Sensors ICT system 2023
S33 Ma et al. IEEE Transactions on Industrial Informatics ICT system 2023
S34 Simin et al. Journal of Intelligent and Fuzzy Systems Integrating cooling and ICT systems
ICT system
2023
S35 Nagarajan et al. Expert Systems 2023
S36 Yang et al. KSII Transactions on Internet and Information Systems ICT system 2022
S37 Pandey et al. Mobile Information Systems ICT system 2022
S38 Shaw et al. Information Systems ICT system 2022
S39 Yan et al. Computers and Electrical Engineering ICT system 2022
S40 Wang et al. Computer Networks ICT system 2022
S41 Mahbod et al. Applied Energy Cooling system 2022
S42 Abbas et al. Physical Communication ICT system 2022
S43 Uma et al. Transactions on Emerging Telecommunications Technologies ICT system 2022
S44 Wang et al. Future Generation Computer Systems ICT system 2021
S45 Zhou et al. IEEE Network Integrating cooling and ICT systems 2021
S46 Chi et al. Energies Integrating cooling and ICT systems 2021
S47 Biemann et al. Applied Energy Cooling system 2021
S48 Ding et al. Future Generation Computer Systems ICT system 2020
S49 Peng et al. Cluster Computing ICT system 2020
S50 Hu et al. Electronics Integrating power supply and ICT systems 2020
S51 Qin et al. Applied Intelligence ICT system 2020
S52 Yang et al. Journal of Building Engineering Integrating cooling, ICT, and power supply 
systems 
ICT system
2024 
S53 Lin et al. IEEE Access 2020
S54 Caviglione et al. Soft Computing ICT system 2021
S55 Le et al. ACM Transactions on Sensor Networks Cooling system 2021
S56 Zhang et al. Applied Energy Cooling system 2023
S57 Li et al. CCF Transactions on High Performance Computing ICT system 2021
S58 Wan et al. IEEE Intelligent Systems Cooling system 2021
S59 Haghshenas 
et al. 
IEEE Transactions on Services Computing ICT system 2022 
S60 Zhang et al. IEEE Transactions on Cybernetics Cooling system 
ICT system 
2024 
S61 Sun et al. Computer Networks 2020 
S62 Asghari et al. Computer Networks ICT system 2020 
S63 Siddesha et al. Cluster Computing ICT system 2022 
S64 Asghari et al. Soft Computing ICT system 2020 
S65 Zhang et al. Expert Systems with Applications Cooling system 2023 
Ref 
[47] 
[48] 
[49] 
[50] 
[51] 
[52] 
[53] 
[54] 
[55] 
[56] 
[57] 
[58] 
[59] 
[60] 
[61] 
[62] 
[63] 
[64] 
[65] 
[66] 
[67] 
[68] 
[69] 
[70] 
[71] 
[72] 
[73] 
[74] 
[75] 
[76] 
[77] 
[78] 
[79] 
[80] 
[81] 
[82] 
[83] 
[84] 
[85] 
[86] 
[87] 
[88] 
[89] 
[90] 
[91] 
[92] 
[93] 
[94] 
[95] 
[96] 
[97] 
[98] 
[99] 
[100] 
[101] 
[102] 
[103] 
[104] 
[105] 
[106] 
[107] 
[108] 
[109] 
[110] 
[111]
Applied Energy 389 (2025) 125734 
10 
H. Kahil, S. Sharma, P. Välisuo et al.
2020 2021 2022 2023 2024
0
5
10
15
20
12
9
12
18
14
Publication Year
N
um
be
r  o
f  S
tu
di
es
Fig. 7. Publication year distribution of selected studies. 
ICT,
40
studies
Cooling,
14
studies
Joint,
11
studies
Fig. 8. The sub-systems focused in selected studies: 40 studies are focused on 
ICT optimization, 14 on cooling optimization and 11 are joint studies integrating 
multiple systems, including the power supply system. 
33.8%
26.2%
12.3%
6.2%
6.2%
6.2%
9.1%
DQN
Q-learning
SAC
PPO
DDPG
AC-A3C
Other methods
Fig. 9. Distribution of algorithms utilized in this review. 
5.2. Comparison of RL/DRL algorithms applied to cooling system
Cooling systems account for approximately 40 % of energy consump-
tion in data centers [112]. Reducing the energy consumption of this 
non-ICT support system will improve the power utilization efficiency 
(PUE) of the data center. Furthermore, optimizing the operation of the 
cooling systems can significantly influence the thermal conditions and 
cooling flow of ICT devices, leading to future reductions in the total 
energy consumption of the entire data center [113]. In this section, we 
will analyze the selected articles that use RL/DRL techniques to optimize 
the operation of the cooling system in the data center with the aim of 
reducing energy consumption. The following analysis not only provides 
an overview of how RL/DRL methods are applied to data center cooling 
system, but also investigates the specific aspects of each selected study 
in detail. This includes the formulation of the research problem and ob -
jectives, the energy-related outcomes, the benchmark comparisons, and 
the experimental setup. 
5.2.1. The research problem and objective formulation 
As illustrated in Fig. 8, 14 papers discussing cooling systems were 
identified. These studies focused on two main research problems (RQ4), 
each with different objectives (RQ5). The dominant category of the 
research problem focuses on optimizing cooling system operations in 
various scenarios to improve energy efficiency by utilizing various 
RL/DRL approaches. An interesting configuration involves using DRL to 
optimize the data center cooling system integrated with an active ther -
mal management framework. For example, [60] explores the balance 
of aquifer thermal energy storage (ATES) while minimizing the total 
cost and maintaining the temperature range of the servers using a DRL 
agent. Similarly, [104] introduces Active Ventilation Tiles (AVTs) con -
trollers to enhance the operation of the rack cooling system, achieving 
a trade-off between energy consumption and rack supply temperature 
distribution. An alternative scenario is integrating RL/DRL algorithms 
with prior physical knowledge to enhance the cooling system’s energy 
efficiency. The study [75] integrates this knowledge by using big data, 
IoT sensor networks, and a digital twin model with the DRL algorithm. 
By leveraging historical and real-time data, this approach employs a 
Long Short-Term Memory (LSTM) network to predict temperatures, en -
abling the utilization of the DQN algorithm to effectively reduce the 
energy consumption of the cooling systems. Due to the strong relation -
ship between the energy efficiency of data center cooling systems and 
the ambient temperatures at their locations, several studies have exten -
sively investigated the efficiency of DRL algorithms in reducing cooling 
system energy consumption in tropical climates. Specifically, the study 
[101] focuses on optimizing the supply air temperature and relative hu -
midity in a free-cooled tropical data center under defined boundaries, 
while [87] explores a single-agent DRL strategy with a floating set point 
approach to reduce the temperature threshold for tropical data centers 
based on a whole-building evaluation method. The study [69] proposes a 
multi-set point approach based on the DQN algorithm (DQN-MSP) to en -
able precise cooling control of the CRAC unit’s air temperature, offering 
significant improvements in data center cooling energy consumption. 
Another key research direction in this category of literature emphasizes 
designing and comparing multiple state-of-the-art DRL algorithms for 
optimizing energy consumption while maintaining thermal conditions, 
as demonstrated in studies [66,93,102]. 
Meanwhile, the second research category shifts attention to the relia -
bility of safety-aware DRL strategies, with the core aim of minimizing the 
energy consumption of the data center cooling system. These strategies 
are designed to ensure strict adherence to both soft and hard constraints 
during the learning and operational phases. 
In [58], the study develops an end-to-end off-policy DDPG 
agent to optimize the cooling system using unprocessed and high -
dimensional input data directly. Additionally, the study introduces the 
de-underestimation (DUE) validation mechanism for the critic network 
to address underestimation of overheating risks. In [106], the study fo -
cuses on incorporating residual physics using thermodynamic principles 
to guide the DRL agent’s exploration process by estimating the desirable 
range of actions, ensuring future action safety. In addition, the study 
[49] develops safe cooling system operation by utilizing a model-based 
actor-critic DRL (MBAC) algorithm using two different models: a sys -
tem transition model to predict the future system state, and a risk model 
to estimate the negative effects of executing an action. Furthermore, 
the paper [70] utilizes offline imitation learning and online post-hoc 
rectification techniques to develop three different versions of a safety -
aware DDPG controller for the data center cooling system. Alternatively,
Applied Energy 389 (2025) 125734 
11 
H. Kahil, S. Sharma, P. Välisuo et al.
Table 4 
Classification of algorithms by model type and study IDs. 
Category Algorithm type (RQ2) Study IDs 
Value-Based Model-Free DQN S4, S6, S7, S8, S10, S13, S15, S23, S29, S32, S34, S35, S37, S39, S42, S45, S49, S50, S53, 
S54, S55, S58, S22 
Q-learning S11, S16, S17, S18, S22, S25, S28, S33, S38, S43, S44, S48, S51, S59, S62, S63, S64 
B3QN S56 
PADQN S5 , S27, S45 
SARSA S38 
BDQ S56 
Policy-Based Model-Free PPO S14, S21, S36, S57, S60, S65, S47 
TRPO S47 
Monte Carlo (REINFORCE) S31 
Actor-Critic (AC) Model-Free SAC S2, S9, S19, S20, S41, S56, S60, S65, S47 
A3C S30 
AC S1, S26, S61, S35 
DDPG S12, S24, S40, S46, S35, S45, S60 
TD3 S47, S56 
Learned Model-Based PETS S56 
MBAC S3 
Given Model-Based None identified 
the study [111] leverages techniques like Lagrangian-based constrained 
DRL (CDRL) and reward shaping to satisfy soft constraints through ex -
tensive online learning. Also, within the same study, hard constraints are 
addressed by a parameterized shielding DRL algorithm (DRL-S), which 
projects unsafe actions onto safe action spaces. The ultimate goal of these 
studies in the second category is to design a safe cooling system for 
data centers, reducing energy consumption while effectively maintain -
ing thermal constraints. The insights from this section are summarized 
in Table A.8. 
5.2.2. The energy related outcomes 
The primary motivation of this study is to address how the proposed 
RL/DRL algorithms enhance the energy efficiency of data centers (RQ8). 
The results related to energy efficiency have been carefully and thor -
oughly analyzed. Given the diversity of research problems and objectives 
addressed in the identified cooling system studies, the reporting methods 
for energy-related outcomes vary significantly. Some studies express the 
improvements in energy consumption when implementing the RL/DRL 
algorithm as a percentage reduction in energy consumption, compared 
to the baseline controller (e.g., DefaultE+) [93,102,111]. 
In addition, other studies compare the energy saving percentage of 
their proposed RL/DRL strategies to some other benchmark controllers, 
including DRL and non-DRL algorithms [49,70,87]. Energy efficiency is 
also reported in terms of improvements in key data center performance 
metrics, such as power usage effectiveness (PUE), compared to baseline 
controllers (e.g., DefaultE+) [58] or state-of-the-art controllers [66], 
while other studies use the PUE to evaluate the differences in energy 
consumption before and after applying the proposed RL/DRL algorithms 
[75]. Other studies focus on energy cost reductions rather than energy 
consumption savings [60]. 
Moreover, combining RL/DRL strategies with advanced setups, such 
as AVT systems [104] and physics-guided DRL with shielding [106], 
highlights the potential of RL/DRL in performing a trade-off analy -
sis between energy efficiency and system performance. Furthermore, 
some studies demonstrate energy savings while maintaining thermal 
constraints, either by increasing the average supply air temperature of 
the CRAC units [69] or by raising the temperature and relative humid -
ity thresholds [101]. A more detailed analysis of additional objectives 
combined with the energy efficiency will be provided in Section 6. A 
detailed summary of this section’s findings is presented in Table A.8. 
5.2.3. The benchmark comparisons 
The distribution of benchmark algorithms used in the cooling systems 
studies for energy-related results comparison is illustrated in Fig. 10. 
PI
D
MP
C
DQ
N
TR
PO PP
O
De
fau
ltE
+
DD
PG SA
C
TD
3  
Ot
he
r  no
n-D
RL
Ot
he
r  D
RL
0
1
2
3
4
5
6
7
8
9
10
2
3 3
4
8
6
3
5
2
6
7
Algorithms
N
um
be
r  o
f  B
en
ch
m
ar
ks
Fig. 10. Number of benchmarks in the literature for cooling system. 
Analyzing the statistical data reveals two distinct groups. The first 
group involves the use of DRL algorithms due to their adaptability as 
benchmarks for comparison, with PPO being the most widely used, ap -
pearing in eight studies. Other prominent DRL algorithms include SAC 
(used five times), TRPO (used four times), DDPG (used three times), DQN 
(used three times), and TD3 (used twice). The second group consists of 
non-DRL algorithms, where the built-in EnergyPlus baseline controller 
(DefaultE+) was used in five studies, the classical PID controller was 
used twice, and the optimal model predictive controller (MPC) was used 
three times. Other DRL and non-DRL algorithms, including those used as 
benchmarks only once, are also considered. Table 5 outlines the bench -
mark algorithm comparisons (RQ6) for each selected cooling system 
study, including both DRL and non-DRL algorithms. 
5.2.4. The experimental setup 
Among the 14 selected cooling system studies, only one study di -
rectly implemented the proposed DRL strategy on a real-world data 
center [104]. In contrast, the remaining studies tested the designed 
DRL algorithms in simulated environments, highlighting a gap in di -
rect real-world application and validation. These simulations utilized 
either real-world datasets, synthetic datasets, or a hybrid approach com -
bining both. The EnergyPlus building energy simulation program [117] 
emerged as a primary tool for simulating energy consumption in data
Applied Energy 389 (2025) 125734 
12 
H. Kahil, S. Sharma, P. Välisuo et al.
Table 5 
Selected cooling system studies experimental setup. 
ID Environment 
(RQ3) 
Data source 
(RQ3) 
Data type 
(RQ3) 
Benchmarks 
(RQ6) 
Platform 
(RQ7) 
S3 Simulation Simulated a typical data center 
room with Alibaba’s 2018 cluster 
data 
Real-world MBRL-MPC, MBHC Unspecified CFD simulator, 
Python (PyTorch) 
S12 Simulation National Super Computing Centre 
(NSCC) of Singapore 
Real-world DefaultE+, Two-stage (TS), A3C, TRPO EnergyPlus, Python (Scipy) 
S14 Simulation Naviair data center (the Danish 
airspace control company) 
Real-world No reward PPO, Delayed reward PPO, 
Uniform future PPO, Trend-based future 
policy to estimate the return 
Python (OpenAI Gym) 
S20 Simulation Simulated liquid-cooled data cen-
ter with unspecified real-world 
data set 
Simulated a small data center 
with a real-world dataset from the 
PlanetLab system 
Real-world PID, MPC, DQN, TRPO, PPO Matlab (Simscape) 
S23 Simulation Real-world DQN-SSP, PPO-MSP, DDPG-MSP 6SigmaRoom, CloudsimPy, 
Python 
S24 Simulation Four simulated configurations of 
CW- and DX-cooled data centers 
under two climate conditions 
Synthetic For the first three proposed controllers: 
DefaultE+, Reward shaping DDPG, Simplex 
DDPG, Projection post-hoc rectification 
DDPG 
For the fourth controller: PID, Vanilla DDPG, 
Reward shaping DDPG 
EnergyPlus, OpenFOAM, 
Python (OpenAI Gym and 
PyTorch) 
S29 Simulation Simulation for real-world data 
center room located in Shenzhen 
Simulated mid-tier stand-alone 
data center located in a tropical 
climate region 
Real-world Comparison of DC energy efficiency metrics 
before and after the DRL strategy 
6SigmaRoom, Autodesk Revit, 
Python 
S41 Simulation Synthetic DefaultE+, Load Aware, Temperature 
Aware, Joint-IT, Multi-Agent DRL, TD3, PPO, 
TRP, various versions of SAC 
EnergyPlus, Python 
S47 Simulation Simulated medium-sized DC with 
two zones, a direct expansion 
cooling coil, and a chilled water 
cooling coil 
Synthetic and 
real-world 
DefaultE+, TD3, PPO, TRPO, SAC EnergyPlus, Python (OpenAI 
Gym) 
S55 Simulation A real free-cooled data center 
located in a tropical zone 
Real-world Hysteresis-based controller, MPC Matlab, Python (Keras and 
TensorFlow) 
S56 Simulation Simulated data center test bed 
developed in [114] 
Synthetic DefaultE+, PETS, BDQ, PPO, SAC EnergyPlus, Python (PyTorch) 
S58 Real-time Inner Mongolia Meteorological 
Information Center (IMMIC) 
Real-world DL, DN, DQN Python (TensorFlow), Real-
time 
S60 Simulation Simulated data center test bed 
developed in [115] 
Synthetic SAC, RP-SAC, DDPG, RP-DDPG, PPO, RP-
PPO, Lagrangian-based safe DRL, Physics 
EnergyPlus, Python (PyTorch, 
TensorFlow) 
S65 Simulation Simulated data center test bed 
developed in [116] 
Synthetic DefaultE+, PPO, SAC, PPO-Lag EnergyPlus 
center cooling systems, often integrated with various Python libraries to 
implement DRL agents. Other simulation environments utilized include 
Computational Fluid Dynamics (CFD) simulators such as OpenFOAM 
[118] and 6SigmaRoom [119], which offer detailed modeling of airflow 
and thermal dynamics. Furthermore, MATLAB, along with its advanced 
toolboxes like Simulink and Simscape, was frequently employed to sim -
ulate the operational processes of data center cooling systems, providing 
a robust platform for evaluating control strategies and optimizing sys -
tem performance. Table 5 presents a comprehensive overview of the 
experimental setup, including the environment, dataset source and type 
(RQ3), and platform (RQ7) for all identified studies on cooling systems. 
5.3. Comparison of RL/DRL algorithms applied to ICT systems 
Over the past few years, data centers have grown significantly in 
size and complexity driven by the rapid advancements in ICT sys -
tems. The advancements involve a wide range of devices, including 
high-performance servers, processing units such as CPUs and GPUs, 
advanced memory units, and storage arrays [120]. This technological 
progress has enabled data centers to support more complex operations, 
such as training large language models (LLMs) and real-time data pro -
cessing. As a result, improving the energy efficiency of ICT systems has 
become a critical priority, not only to enhance the performance and 
scalability of data centers but also to minimize energy consumption 
and operational costs. In this section, we will comprehensively examine 
the role of RL/DRL algorithms in tackling energy efficiency challenges 
within ICT systems as identified in the literature. 
5.3.1. The research problem and objective formulation 
The majority of the identified papers in this review focus on ICT 
systems, specifically 40 studies. The research problems (RQ4) and ob -
jectives (RQ5) of these studies can be categorized into the following 
areas: 
Scheduling optimization: A considerable number of existing stud -
ies discuss the scheduling optimization challenge in a DC environment 
using RL/DRL approaches; however, few studies have explored the en -
ergy efficiency aspects of applying these algorithms. The three main 
types of RL/DRL algorithms applied to the scheduling optimization prob -
lem in the identified studies are: jobs scheduling, tasks scheduling, and 
resources scheduling. 
Jobs scheduling: Job scheduling refers to the process of assigning 
and allocating the entire arriving job which may consist of one or mul -
tiple tasks to the DC resources, aiming to manage workloads with a 
high-level approach. Traditional job scheduling mechanisms often strug -
gle to cope with extensive, heterogeneous DC environments, especially 
in cases involving long-lasting jobs. This limitation leads to inefficien -
cies in energy consumption and resource management. Three studies 
[56,77,85] have addressed this challenge by proposing RL/DRL algo -
rithms. The primary approach to handling this challenge dynamically 
involves considering real-world constraints, such as job dependencies 
and QoS levels, to minimize energy consumption and carbon emissions 
in data centers. 
Tasks scheduling: Tasks are the components of the jobs that typi -
cally need to be performed in a specific order due to their interdepen -
dence. Task scheduling refers to the process of managing the execution
Applied Energy 389 (2025) 125734 
13 
H. Kahil, S. Sharma, P. Välisuo et al.
of individual tasks within a job in a low-level approach. The main objec -
tive of the task scheduling studies is to select the optimal DC resource 
for task execution, ensuring compliance with time and QoS constraints. 
Ten studies were identified that discussed the task scheduling problem 
highlighting three main approaches: 
• Dependency- and workflow-oriented RL/DRL task scheduling ap -
proaches [72,82,94,110]. 
• Heterogeneous cloud DC online RL/DRL task scheduling approaches 
[67,103,109]. 
• Adaptive and hybrid RL/DRL task scheduling approaches [50,54,59]. 
Resources scheduling: While task and job scheduling focus on the 
DC workload, resource scheduling concentrates on the physical (e.g., 
servers) or virtual (e.g., VM) infrastructure level of the DC. The main aim 
of the resource scheduling process is to maximize resource utilization, 
and it does not directly consider job and task dependencies. Two studies 
specifically focused on addressing the resource scheduling problem [89, 
95]. 
Virtual machines and containers management: The virtualization of 
physical resources in data centers to meet the growing demands of work -
loads has received significant attention from researchers in recent years. 
Two primary technologies are commonly employed for virtualization: 
hardware-level virtualization (VM) in which each virtual machine (VM) 
utilizes a hypervisor to run its own operating system and applications. 
In contrast, operating system (OS)-level virtualization leverages the host 
system’s kernel to create containers which share the host’s resources 
[121]. In this review, we selected 14 studies focused on managing VMs 
and containers using RL/DRL algorithms and present the energy effi -
ciency results. These studies address three key areas: VM consolidation, 
VM and container placement, and VM replacement. 
VM consolidation: This refers to reducing the number of physi -
cal machines (PMs) required to operate the data center workload. This 
process includes three stages: workload detection (overutilization and 
underutilization), VM selection, and VM placement. By running multi -
ple VMs on fewer PMs, several objectives can be achieved, including 
optimizing ICT resources, reducing operational costs, and minimizing 
energy consumption. Five studies in this review collection discuss the 
VM consolidation problem in data centers using RL/DRL algorithms with 
two main approaches: 
• Centralized adaptive RL/DRL strategies [53,57,84,88] 
• Multi-agent RL strategies [105] 
VM and container placement: This is a sub-process of consolida -
tion, where the objective is solely to decide the optimal location (PM) 
for a VM. It is applied at the PM (host) level rather than at the DC sys -
tem level. Eight studies have been identified on this topic: seven for VM 
placement and only one for container placement [68]. 
VM replacement: This refers to the process of reassigning an already 
placed VM to a new physical machine (PM). This process is triggered by 
changes in the current state (e.g., overloading, failures). It is also consid -
ered a sub-process of VM consolidation, enabling VM migration. Among 
the selected studies, only one specifically addressed this issue, propos -
ing a novel approach that combines fuzzy logic with an RL algorithm to 
enhance decision-making and adaptability in this process [74]. 
Two studies combine the two aforementioned categories as a re -
search problem, focusing on VM scheduling by allocating tasks or jobs 
to VMs assigned to hosts, leveraging RL/DRL algorithms to optimize the 
scheduling process [79,90]. 
DCN traffic control: Data Center Networks (DCNs) play a critical role 
in ensuring the smooth operation of ICT systems. However, they often 
suffer from bandwidth surges, which degrade data center performance 
and significantly increase energy consumption. Traditional methods to 
address these issues are limited in their adaptability and fail to dynam -
ically handle sudden network traffic fluctuations, leading to substantial 
energy waste. RL/DRL algorithms offer effective approaches to tackle 
these challenges. Four studies have been identified that explore solutions 
to this problem, each employing a unique structural RL/DRL approach: 
• Combining LSTM networks for traffic prediction and proactive 
RL/DRL agents to optimize traffic control and energy efficiency 
[73,86]. 
• Formulating the problem as an MILP model to define the optimal so -
lution space and integrating RL/DRL algorithms to find near-optimal 
solutions dynamically [55]. 
• Employing Software-Defined Networking (SDN) and RL/DRL to dy -
namically schedule traffic flows, aiming to reduce energy consump -
tion while maintaining an optimal Flow Completion Time (FCT) 
[107]. 
Multi-objective framework: Five studies are identified here that ad -
dress job/task scheduling, task offloading, and resource allocation as 
multi-objective research problems. The resources considered in these 
studies include containers [65,81], multi-user, multi-data center re -
sources [99], and general data center resources [83,108]. 
A detailed summary of the identified ICT studies’ research problems 
(RQ4) and objectives (RQ5) is presented in Table A.9. 
5.3.2. The energy related outcomes 
As energy efficiency is the primary focus of this review, a compre -
hensive analysis of the energy efficiency outcomes of using RL/DRL 
algorithms in ICT systems in the identified studies is presented in 
Table A.9. This table answers this review’s RQ8 and demonstrates that 
the proposed RL/DRL algorithms consistently outperform baseline and 
benchmark non-RL/DRL methods in terms of energy efficiency. The re -
ported energy efficiency improvements range from small percentages 
(1 %–3 %) to significant enhancements (over 60 %), depending on the 
specified scenario and context, such as varying VM/task loads, DCN 
traffic sizes, or the use of real-world or synthetic datasets. The major -
ity of the studies reported achieving energy efficiency as a percentage 
improvement when compared to benchmark algorithms. 
Additionally, some studies highlighted energy efficiency enhance -
ments in terms of scalability and dataset-based performance. For in -
stance, studies [62,108] focus on performance across diverse datasets 
and scalability metrics. Other studies compare the achieved energy sav -
ings in multiple experimental setups or configurations. [67] investigated 
task scheduling across three distinct scenarios with 10, 50, and 100 
servers, examining the impact of server configurations on energy ef -
ficiency. [88] explored VM consolidation under different workloads, 
assessing its impact on resource utilization and energy consumption. 
[86] analyzed DCN traffic control with both more than 70 nodes and 
fewer than 70 nodes, assessing performance across different network 
sizes. [109] conducted task scheduling across two different task counts 
and varying numbers of VMs, evaluating the performance under diverse 
configurations. 
In addition, a few studies presented a generalized approach without 
explicitly referencing benchmark algorithms. For instance, [74] reported 
energy savings in a generalized context, providing insights into the 
potential applicability of the proposed RL algorithm. 
5.3.3. The benchmark comparisons 
Each research problem discussed in the identified studies of the 
ICT system was compared to other baseline or state-of-the-art bench -
mark methods commonly used in the respective problem domain. As 
presented in Fig. 11, the most commonly used baseline method for 
scheduling optimization studies was the RANDOM method. In this 
method, jobs/tasks/VMs were assigned to resources without consider -
ing optimization criteria. This approach is simple and achieves unbiased
Applied Energy 389 (2025) 125734 
14 
H. Kahil, S. Sharma, P. Välisuo et al.
Ra
nd
om
He
ur
ist
ic
Me
ta-
he
ur
ist
ic
RL
/D
RL  
Ot
he
r
ML
0
10
20
30
40
50
60
70
80
90
100
7
78
23
31
12
5
Algorithms
N
um
be
r  o
f  B
en
ch
m
ar
ks
Fig. 11. Number of benchmarks in the literature for ICT system. 
scheduling; however, it is inefficient as it overlooks critical DC met -
rics such as energy efficiency, quality of service (QoS), and workload 
balancing. This method was used in seven studies as a baseline for 
comparison with the proposed RL/DRL algorithms in the scheduling 
optimization identified studies. Additionally, heuristic-based algorithms 
were widely used as benchmarks to evaluate the proposed RL/DRL al -
gorithms for various ICT research problems. For scheduling research 
problems, the Round-Robin (RR) method was highlighted as the pri -
mary heuristic-based method for performance comparison. Greedy algo -
rithms, including First-Fit (FF), Best-Fit (BF), and their variants, were the 
main benchmarks for VM management research problems. Elastic-Tree 
was a common benchmark for DC network traffic control problems.
Approximately 78 additional heuristic-based algorithms were em -
ployed as comparison methods across all the research problems dis -
cussed in ICT systems. Meta-heuristic methods were also used 23 times 
as evaluation methods. These included Particle Swarm Optimization 
(PSO), Ant Colony Optimization (ACO), Genetic Algorithms (GA), and 
their variants, applied to various ICT system research problems. Machine 
learning algorithms were occasionally utilized in a limited number of 
identified studies as benchmark methods, particularly for VM manage -
ment. 
Other RL/DRL algorithms developed in previous studies were used 
31 times for comparison with newly proposed algorithms, demonstrat -
ing internal comparisons within RL/DRL approaches in the identified 
studies. Finally, some specially designed algorithms were also employed. 
Table 6 outlines the benchmark algorithm (RQ6) comparisons for each 
selected study on ICT systems. 
5.3.4. The experimental setup 
CloudSim [122] and its variant WorkflowSim [123], an extended 
and optimized version of CloudSim designed for dependent task work -
flows, were used as simulation environments in approximately 50 % 
of the identified studies focusing on scheduling optimization and VM 
management research problems in DC ICT systems. In addition to these 
tools, programming languages such as Java and Python were frequently 
employed for simulation experiments in multiple studies. 
MATLAB was used as the simulation environment in four studies. 
However, six studies did not specify the simulation environment used. 
On the other hand, several real-world datasets from large-scale data cen -
ters like Google, Wikipedia, and Alibaba, as well as smaller data centers 
such as the National Supercomputing Centre (NSCC) of Singapore and 
the Nottingham University Data Center, were utilized as data sources in 
the identified studies. Moreover, well-known datasets such as PlanetLab 
and the CoMon project were also employed for simulation experiments. 
Synthetic datasets were another key data source, enabling controlled 
and customized testing scenarios. Table 6 provides a comprehensive 
overview of the experimental setup, encompassing the simulation en -
vironment, the sources and types of datasets (RQ3), and the platforms 
used (RQ7) in all the identified studies on ICT systems.
5.4. Comparison of RL/DRL algorithms applied to optimizing integrated 
data center systems
Developing an accurate, intelligent, and real-time DC environment 
requires seamless integration of all systems, including the cooling, ICT, 
and power supply systems. The joint optimization of these systems 
has become a promising research direction, aiming to achieve mul-
tiple objectives across multiple systems using advanced optimization 
strategies.
Among these, RL/DRL algorithms have emerged as powerful ap -
proaches, demonstrating significant potential in addressing the complex -
ities of integrated DC systems. This section delves into a detailed analysis 
of 11 identified studies that leverage RL/DRL algorithms for the joint op -
timization of DC systems. A vital aspect of the identified studies in this 
section lies in formulating these studies as a multi-objective research 
problem across multiple systems. To further enrich this discussion and 
align with the growing interest in this field, we define the key elements 
of the Markov Decision Process (MDP) models employed in these stud -
ies, highlighting their critical role in achieving efficient and effective 
system integration. 
Various identified studies explored the integration of the ICT op -
eration optimization with energy-efficient cooling system controlling 
as a research problem (RQ4) with different objectives (RQ5). [71] 
investigates the implementation of a decentralized strategy to simultane -
ously optimize the cooling system and the VM placement. Additionally, 
scheduling optimization combined with cooling system control is an -
other prominent research focus. Task scheduling was discussed in 
[52,91,92], whereas job scheduling was examined in [52,80]. In both 
cases, the scheduling process is integrated with the optimization of the 
cooling system. On the other hand, three studies [47,61,96] examined 
the workflow scheduling of DC powered by renewable energy systems 
(RES). The primary objective in these studies is to optimize energy con -
sumption from RES during the execution of DC workloads. In study [48], 
a DRL strategy was applied to optimize the cooling system by integrat -
ing it with the power supply system using real-time electricity pricing 
(RTP). Finally, global optimization using a multi-agent approach to en -
hance energy efficiency across more than two DC systems was addressed 
in a recent study [98]. 
The majority of the identified papers report results related to energy 
efficiency (RQ8) of the developed RL/DRL algorithms as a percentage 
of energy savings compared to baseline algorithms. For example, [48] 
reported a slight improvement in energy savings compared to a PID 
controller, while [91] compared the energy efficiency results of the pro -
posed algorithm with a controller designed based on domain expert 
knowledge, achieving up to 30 % energy savings. Another method of 
reporting energy efficiency results involves using data center efficiency 
metrics, such as PUE. This approach was demonstrated in [52,80], where 
the proposed RL/DRL algorithms enhanced energy efficiency compared 
to benchmark algorithms. Table A.10 provides an overview of the re -
search problem (RQ4), related objectives (RQ5), and energy-related 
outcomes (RQ8) of the identified joint optimization studies. 
The proposed joint RL/DRL algorithms were compared against vari -
ous benchmark algorithms. Multiple studies evaluated the performance 
of the developed RL/DRL strategies against state-of-the-art individ -
ual optimization techniques, such as ICT algorithms (e.g., random or
Applied Energy 389 (2025) 125734 
15 
H. Kahil, S. Sharma, P. Välisuo et al.
Table 6 
Selected ICT system experimental setup. 
ID Environment 
(RQ3) 
Data source 
(RQ3) 
Data type 
(RQ3) 
Benchmarks 
(RQ6) 
Platform 
(RQ7) 
S4 Simulation Google cluster Real-world RR, B, MAD, DRL-DTM, DRL-DTA NA 
S7 Simulation Google cluster Real-world FF, MFFD, PABFD, RL-DC, UP-VMC EnergyPlus, CloudSim 
S8 Simulation Google cluster Real-world RR, HDRL, DRL-Cloud, MO-DQN Python (TensorFlow) 
S9 Simulation Abilene, Geant, and Synthetic 
topology datasets 
Synthetic and 
Real-world 
TEDO, TEDI Java, Python 
(TensorFlow) 
S10 Simulation National Supercomputing Center 
(NSCC) of Singapore 
Real-world RR, Job consolidator, Online optimizer with 
two different reward functions 
NA 
S11 Simulation CoMon Project Real-world LR-MMT, VDT-UMC, DTH-MF CloudSim 
S13 Simulation Google cluster Real-world RR, HDRL, DRL-Cloud NA 
S16 Simulation Amazon EC2 and Simulated dataset Synthetic and 
Real-world 
FFD, BFD, GRVMP, GMPR, NSGA-II, RLVMP CloudSim 
S17 Simulation Simulated dataset Synthetic VMPMORL, EVCT, VPME, AFED-EF CloudSim 
S18 Simulation GWA-T-12 Bitbrains Real-world MOPSO, MOACO, VMPORL MATLAB 
S19 Simulation Simulated tasks following an 
exponential workload distribution 
Synthetic Cloud, PREM, RANDOM, REQ Python (PyTorch and 
Gym) 
S21 Simulation Open-source: BitBrains, Scientific 
workflows: Ligo, Montage, 
Cybershake 
Real-world RR, RF, GRR, GRF, Tetris, RLScheduler, ACS NA 
S22 Simulation Simulated dataset Synthetic GA, ACO, SA, FFD Java, CloudSim 
S26 Simulation Google cluster Real-world RR, RANDOM, SO, GJO NA 
S27 Simulation Simulated dataset using a K-port 
FatTree topology 
Synthetic Greedy-ElasticTree, LSTM+DRL, DDPG Python (TensorFlow, 
Keras) 
S28 Simulation Nottingham University, Gaussian 
distribution Synthetic datasets 
Synthetic and 
Real-world 
MOVMrB, RLVMrB, VMPMORL CloudSim 
S30 Simulation PlanetLab dataset, Amazon EC2 
instance configurations 
Synthetic and 
Real-world 
MOVMrB, RLVMrB, ADVMC MATLAB 
S31 Simulation Alibaba Cloud Real-world FIFO, Ideal MPC, Tetris Python (TensorFlow) 
Python (PyTorch, 
Gymnasium, Scikit-learn) 
S32 Simulation Azure 2017 workload Real-world HGP, IQR-MMT, MAD-MMT, RLR-MMT, GA
S33 Simulation Ligo, Genome, Cybershake, 
Montage, and Sipht datasets 
Real-world EcoCloud, KMI-MRCU, AFED-EF Java 
S35 Simulation Simulated two common datasets Synthetic DSTS, LSTM, RF, CNN CloudSim 
S36 Simulation Alibaba Cluster Real-world EINFORCE, FF, RANDOM, Tetris Python (TensorFlow, 
NumPy, Matplotlib) 
S37 Simulation Simulated dataset Synthetic Small task sizes: Load Aware, FFO-EVMM, 
MIMT, DQN. Medium task sizes: FFO-EVMM, 
MIMT, L-No-Deaf, Worn-Dear, DQN. Larger 
task sizes: FFO-EVMM, MIMT, multiple PSO 
variants, DBC, EDF 
CloudSim 
S38 Simulation PlanetLab dataset Real-world PowerAware VM consolidation CloudSim 
S39 Simulation Simulated dataset Synthetic RANDOM, RR, EDF Python (PyTorch) 
S40 Simulation Packet trace files from three data 
centers, generated using Wireshark 
Real-world Shortest-path-based routing, Gurobi optimizer Python (Keras, 
TensorFlow) 
S42 Simulation PlanetLab Monitoring Real-world IQR, MAD, THR, LR, PABFD CloudSim 
S43 Simulation Simulated dataset Synthetic RoFFR, CSLB, TDBS WorkflowSim, Python 
S44 Simulation 1998 FIFA World Cup Dataset, 
UNSW-17 Network Traffic Dataset 
Real-world VPBAR, LRR-MMT, DTH-MF, VMTA, Megh, 
EQBFD-0.1, EQBFD-0.3 
CloudSim 
S48 Simulation Simulated dataset Synthetic MMS-RANDOM, MMS-FAIR, MMS-GREEDY CloudSim 
S49 Simulation Google cluster Real-world RANDOM, Round Robin (RR), MoPSO Python (TensorFlow) 
S51 Simulation Simulated dataset Synthetic Multi-objective optimization algorithms: 
MGGA, VMPACS, VMPMBBO, ICA-VMPLC, 
CVP. Single-objective optimization algorithms: 
FFD, OEMACS 
MATLAB 
S53 Simulation Simulated dataset Synthetic Job scheduling: RANDOM, RR, Greedy, 
MoPSO 
Resource allocation: RANDOM, RR, MLF, 
FERPTS 
Python (TensorFlow) 
S54 Simulation Production-quality cloud DC, 
simulated dataset 
Synthetic and 
Real-world 
FF, Dot Product, Norm2 heuristics Python (NumPy, PyTorch) 
S57 Simulation Google cluster Real-world Tetris, H2O-Cloud NA 
S59 Simulation CoMon Project (PlanetLab data) Real-world NPA, PABFD, IGGA, E-Eco CloudSim 
S61 Simulation Wikipedia trace files Real-world ElasticTree, CARPO, FCTcon, Optimal (it is 
not practical in use) 
Python (Keras) 
S62 Simulation Montage, Cybershake, Sipht, 
Inspiral datasets generated using 
the Pegasus Workflow Generator 
Real-world MPC, ETF, Lr-RL, Q-SCH, QL-HEFT CloudSim 
S63 Simulation Google Cloud Jobs dataset (GoCJ) Real-world PSO, MVO, EMVO MATLAB, Python 
(PyTorch) 
S64 Simulation Sipht, Inspiral, Cybershake datasets 
generated using the Pegasus 
Workflow Generator 
Real-world MPC, ETF WorkflowSim
Applied Energy 389 (2025) 125734 
16 
H. Kahil, S. Sharma, P. Välisuo et al.
Table 7 
Selected integrated studies experimental setup. 
ID Environment 
(RQ3) 
Data source 
(RQ3) 
Data type 
(RQ3) 
Benchmarks 
(RQ6) 
Platform 
(RQ7) 
S1 Simulation Pegasus workflow framework Synthetic Random, Green-Opt (Greedy), Common-Actor CloudSim, Python (Keras) 
S2 Simulation Weather: Collected from Denmark 
Electricity pricing: Danish electricity 
spot market 
Real-world Other RL Controllers (For SAC and PPO), PID 
controller 
EnergyPlus 
S5 Simulation LLNL Thunder Real-world ICO, MPC, Joint optimization (JCO), Original-
DQN 
Matlab, 6SigmaDCX, 
TensorFlow 
S6 Simulation LLNL Thunder Real-world PADQN, E-QL Matlab, 6SigmaDCX, 
TensorFlow 
S15 Simulation Workload: Google Cluster dataset 
(GCD). Renewable energy: National 
Renewable Energy Laboratory 
(NREL)/NE-3000 wind tur-
bines. Electricity Price: The US 
EIA. Carbon Footprint: The US 
Department of Energy Electricity 
Emission Factors 
Real-world Greenpacker, LECC, ADVMC, ADVMC-RES Python 
S25 Simulation PlanetLab, Google Cluster Real-world DeepEE, Deep-Q with LSTM, ETAS, Improved 
Genetic, Hierarchical Deep-Q, MPC 
CloudSim integrated with 
four CRAC units, and 
perforated floor tiles to 
simulate realistic cooling 
dynamics 
S34 Simulation A simulation-based data set Synthetic Schedule: Single-agent method, Hybrid DQN, 
Independent DQN, Original DQN 
6SigmaDC, CloudSimPy 
S45 Combining 
real-
world and 
simulation 
Operational data from Singapore’s 
National Supercomputing Center 
Real-world Based on expert domain knowledge algorithm. 
Heuristic Algorithms: For independent IT 
or cooling optimization. Thermal-Unaware 
Scheduling: Traditional task scheduling 
without considering thermal dynamics 
6SigmaRoom, EnergyPlus 
S46 Simulation Google Cluster data Real-world Random, RR, PowerTrade, DeepEE Python (OpenAI Gym and 
TensorFlow), Matlab 
S50 Simulation Wiki data center Real-world Static, Random, K-means Python (PyTorch) 
S52 Simulation Simulated dataset Synthetic Non-optimization: No algorithm-based con-
trol, Non-algorithm optimization: Logic-based 
manual controls 
NA 
heuristic approaches) and traditional cooling control methods, includ -
ing PID and Model Predictive Control (MPC). Additionally, other studies 
compared the results with joint optimization approaches. Furthermore, 
several studies benchmarked the outcomes against other RL/DRL algo -
rithms proposed in previous research.
The tools discussed in Sections 5.2.4 and 5.3.4 were similarly 
employed in the joint optimization studies to create simulation en -
vironments. These include the EnergyPlus building energy simulation 
program [117] and the Computational Fluid Dynamics (CFD) simu -
lators, 6SigmaRoom [119], which were utilized for cooling systems. 
CloudSim [122] served as a simulation environment for the ICT sys -
tem. Furthermore, Python, along with its extensive libraries, served as 
the main programming language for implementing RL/DRL algorithms, 
while MATLAB was also employed in several studies for simulation and 
analytical tasks. Table 7 summarizes details of experimental setups in 
joint optimization literature: simulation environments (RQ3), platforms 
(RQ7), and benchmarks (RQ6). 
5.5. The MDP elements 
As detailed in Section 3, the Markov Decision Process (MDP) provides 
the foundational structure for modeling the RL/DRL environment. The 
key components of the MDP are: the state space {𝑆}, the action space 
{𝐴}, and the reward function {𝑅}. In the context of the identified joint 
optimization problem, the MDP features a large and complex state space, 
as well as a mixed action space encompassing both discrete and con -
tinuous actions. Furthermore, the reward function guiding the RL/DRL 
agent in these studies consists of multiple terms to capture the various 
systems within the DC environment. This highlights that the MDP for 
joint optimization studies is considerably more complex than in stud -
ies addressing only one system. Table A.11 provides a comprehensive 
summary of the MDP components in the joint optimization studies. 
6. Other objectives combined with energy efficiency in the 
identified studies 
Besides energy efficiency objectives in the identified studies, other 
objectives have been investigated. It is essential to highlight these ob -
jectives which will shape the direction of future efforts in the field of 
multi-objective optimization. The RL/DRL algorithms have proven their 
effectiveness in resolving the conflicts between objectives in several 
identified works. For instance, in [95], where multi-objective optimiza -
tion aims to balance the energy consumption of various numbers of tasks 
(between 100 and 250 tasks) and the average task makespan. Moreover, 
[92] examines the classical trade-off between quality of service (QoS), 
resource utilization, and energy consumption. Fig. 12 outlines a tax -
onomy of other optimization objectives integrated with enhancing the 
energy efficiency of the data center systems. Although the majority of 
the identified studies address data center energy efficiency enhancement 
aspects as the core research objective, some studies combine this ob -
jective with other environmental metrics, which can directly improve 
the operation mode of the data center and reduce its negative impact 
on the surrounding ecosystems in terms of carbon footprint and RES 
utilization. 
In contrast, the identified studies examine the proposed RL/DRL 
strategies for ICT and cooling in terms of system performance. In 
one dimension, these strategies refine time-related aspects, including
Applied Energy 389 (2025) 125734 
17 
H. Kahil, S. Sharma, P. Välisuo et al.
Other optimization
objectives
Environmental
Impact
Reduce carbon emissions 
Improve RES utilization 
Balance cost-benefit trade-offs
System
Performance
Minimize total makespan 
Reduce average waiting time 
Improve response time 
Maximize resource utilization 
Maintain air temperature distributions 
Improve task completion rates
Reliability
Management
Minimize SLA violations 
Address thermal threshold conditions 
Reduce hotspots 
Balance temperature dispersion 
Maintain a stable CPU utilization level 
Improve Quality of Services (QoS)
Algorithmic
Performance
Assess average rewards 
Minimize computational overheads 
Enhance training and reproducibility
Fig. 12. Taxonomy of the other objectives than energy efficiency targeted in the included articles.
makespan (execution time), waiting time, and response time. In an -
other dimension, they enhance resource efficiency, including resource 
utilization, cooling system air temperature distributions, and task com -
pletion rates. Another objectives category integrated with the energy 
efficiency improvement objective in many studies is resilience opti -
mization. For ICT systems, these objectives include optimizing Service 
Level Agreement (SLA) violations, Quality of Service (QoS), and con -
sistent CPU utilization. Resilience is also vital for data center cooling 
systems, involving aspects such as addressing thermal threshold condi -
tions, reducing hotspots, and balancing temperature dispersion. Finally, 
the objectives related to the performance of the proposed RL/DRL meth -
ods are addressed in several identified works, with a primary focus on 
key aspects such as reward optimization, computational efficiency, and 
the reproducibility of the algorithms employed.
These objectives emphasize the significant impact of choosing re-
wards as an indicator of effective decision-making, boosting the scal-
ability and real-time applicability of these algorithms by minimizing 
computational requirements, and validating the accuracy and universal 
applicability of these algorithms.
7. Research gaps, open challenges and future directions
Many of the studies reviewed in this work have recognized the 
potential of utilizing RL/DRL strategies for optimizing data center 
systems, demonstrating their success in reducing energy consumption 
and enhancing overall performance. However, despite these advance-
ments, several obstacles must be addressed and overcome to achieve 
more effective, robust, and stable solutions. Based on the current re -
view, this section identifies key open challenges and explores research 
gaps and potential future directions across three critical dimensions: 
real-time validation, standardized energy efficiency reporting metrics, 
and advancements in DRL multi-agent approaches. 
7.1. Dependence on simulated environments 
Exploring the experimental setups (RQ3). tools, frameworks, and 
platforms (RQ7) discussed in the literature reveals a growing trend of 
leveraging real-world datasets from diverse sources to construct simu -
lated environments using a variety of software tools and programming 
languages. These environments are then used to evaluate the devel -
oped RL/DRL algorithms in terms of achieving the research objectives. 
However, a critical gap remains in the implementation of real-time 
validation, which is essential to prove the practical capability and ro -
bustness of the proposed algorithms. Real-time validation serves as a 
vital benchmark to ensure that the algorithms can operate effectively 
under dynamic conditions. The absence of real-time validation in 95 % 
of the identified studies in this review represents one of the most signifi -
cant limitations in the current adoption of RL/DRL strategies for cooling 
and ICT systems. The main barrier of adapting the real-world validation 
framework in data center applications is the reward-shaped behavior of 
the RL/DRL agent [31], which can cause a substantial safety and security 
problems, where failure could lead to system damage or even data center 
service disruptions [124]. Also, the complex and varying nature of real -
time energy management scenarios, further complicates the application 
of RL/DRL on a larger scale, as existing models may lack the adapt -
ability needed to respond effectively to varying load and environmental 
conditions. To cope with real-world validation, integration between sim -
ulation implementation and real-world testing is needed to enable more 
reliable and scalable deployments. This approach can mitigate safety 
risks and enhance the security of RL/DRL algorithms. In summary, while 
the implementation of RL/DRL algorithms in simulated environments 
has proven invaluable, advancing to real-time validation is an open re -
search gap for unlocking the full potential of these algorithms in data 
center environments. 
7.2. Evaluation the energy efficiency results 
The energy-related results (RQ8) in the identified literature high -
light the current trend of reporting these results based on the percentage 
reduction in energy consumption compared to baseline or benchmark 
results. However, more comprehensive and multi-scale reporting met -
rics are needed to reflect the functionality of the proposed optimization 
algorithms to enhance energy efficiency. Advanced data-center-specific 
energy efficiency metrics can be employed, including: 
• Power usage effectiveness (PUE): This method was the first en -
ergy efficiency data center metric proposed by [125]. It measures 
the ratio of the data center’s total energy consumption to the ICT 
system energy consumption. The PUE value is defined as a dimen -
sionless quantity between 1.0 and infinity [126]. A PUE value close 
to 1.0 indicates optimal energy efficiency, indicating that most of the 
consumed energy is dedicated to ICT operations rather than other 
systems. Although used in some of the studies identified in this re -
view, this metric does not account for objectives related to other 
data center systems, such as integrating renewable energy sources 
(RES) into the power supply or implementing waste heat recovery in 
cooling systems.
Applied Energy 389 (2025) 125734 
18 
H. Kahil, S. Sharma, P. Välisuo et al.
• Energy reuse effectiveness (ERE): This metric, introduced by 
[127], evaluates energy recovery in data centers by dividing the dif -
ference between total energy and reused energy by the ICT energy 
consumption. In an ideal scenario where ERE = 0, all waste heat is 
effectively recovered within the data center. 
• Data center energy productivity (DCeP): It integrates both the en -
ergy consumption of infrastructure systems, including cooling, and 
the ICT system in order to evaluate the data center’s energy efficiency 
[128]. DCeP is calculated as the ratio of useful work performed to 
the total energy consumption of the data center. Since data centers 
in various industries perform different types of work, defining useful 
work cannot be generalized and depends on the specific application. 
• Data center performance per energy (DPPE): This metric com -
bines four data center energy efficiency metrics: IT Equipment 
Energy (ITEE), IT Equipment Utilization (ITEU), Green Energy 
Coefficient (GEC), and Data Center Infrastructure Efficiency (DCiE) 
[129]. This metric measures the energy consumption performance 
output relative to each unit inside the data center. 
Comprehensive details about these metrics, along with additional re -
lated methods, can be found in specialized review articles dedicated to 
data center energy efficiency and performance evaluation [25,112,130]. 
Combining multiple metrics to report the energy efficiency results 
targeted by the proposed RL/DRL optimization strategies is essential. 
This approach ensures a comprehensive and multi-scale evaluation of 
performance across various dimensions. 
7.3. Multi-agent DRL algorithms 
The emerging developments in data centers have led to increas -
ingly complex cooling and ICT systems. These sophisticated systems 
require innovative optimization approaches to achieve key operational 
objectives such as resource allocation, job/task scheduling, and efficient 
and safe thermal management. However, optimizing each system inde -
pendently has proven insufficient in achieving the overarching goal of 
enhancing the overall operational performance of data centers. Multi -
agent Deep Reinforcement Learning (MADRL) algorithms, which enable 
agents to interact with other agents operating within the same en -
vironment, present a promising approach for improving optimization 
and achieving more efficient solutions. While MADRL has previously 
been utilized in fields such as UAV optimization [131,132], power 
grid management [133,134], and games [135,136] , there is a grow -
ing interest in leveraging MADRL in other fields to address multiple 
objectives across multi-system environments. Several studies identified 
in Section 5.4 have explored the application of MADRL algorithms in 
data centers. However, achieving a global optimization of data cen -
ter operations that accounts for all upstream and downstream energy 
consumption and recovery objectives remains a significant challenge. 
The primary complexity of employing MADRL in data center environ -
ments arises from the heterogeneous nature of the systems. While the 
cooling system operates as a continuous process, ICT systems are gener -
ally discrete. Combining the optimization of these systems using MADRL 
requires careful and accurate selection of suitable DRL algorithms tai -
lored to these characteristics. A comprehensive review of the application 
of MADRL in optimizing multi-system environments can be found in 
[137–139]. 
Other common challenges in deploying more advanced algorithms 
to optimize data center energy efficiency include accurately modeling 
data center environments, managing computational costs, and achieving 
scalability. These challenges have restricted the application of advanced 
algorithms, such as RL/DRL, to small, isolated use cases in small-scale 
data centers, rather than facilitating broader adoption in large-scale 
facilities. In large-scale data centers, RL/DRL could be holistically 
integrated, as demonstrated in the implementation of Green Data Center 
Cooling Control using Physics-guided Safe RL in [91]. However, some 
previous studies have highlighted that traditional RL/DRL methods 
face considerable barriers in large-scale applications due to high com -
putational demands and slow convergence rates, which further limit 
their scalability and practical implementation in real-world settings 
[140]. 
8. Conclusion 
In this work, we carried out a systematic literature review following 
the PRISMA Protocol, focusing on optimizing the energy efficiency of 
data centers by leveraging RL/DRL algorithms. This review examines re -
cent research from different perspectives in the context of the two main 
systems of the data center: the cooling system and the ICT system. In this 
review, a comprehensive analysis and synthesis of the 65 identified stud -
ies was conducted, addressing eight major research questions (RQs): the 
targeted system, the RL/DRL employed algorithm, the reporting metrics 
for energy-related outcomes, the experimental setups, including dataset 
source and type and the environment, the research problems, the main 
objectives, the benchmark algorithm comparisons, and the platforms 
and software used. A more in-depth discussion was conducted on the lit -
erature regarding the joint optimization, where the MDP elements were 
analyzed in detail. Additionally, a brief investigation of other objectives 
combined with energy efficiency was explored. Finally, we comprehen -
sively analyzed the research gaps, open challenges, and future directions 
from three different standpoints. We hope this in-depth review will serve 
as a valuable roadmap for upcoming research in the field of optimiz -
ing the energy efficiency and performance of current and future data 
centers. 
CRediT authorship contribution statement 
Hussain Kahil: Writing – original draft, writing – review & editing, 
visualization, conceptualization, methodology, formal analysis, investi -
gation, resources, and data curation. Shiva Sharma: Writing – original 
draft, writing – review & editing, visualization, conceptualization, and 
methodology. Petri Välisuo: Writing – review & editing, visualiza -
tion, supervision, and project administration. Mohammed Elmusratia: 
Writing – review & editing and supervision. 
Declaration of generative AI and AI-assisted technologies in the 
writing process 
During the preparation of this work the authors used ChatGPT in 
order to improve the readability and language of the manuscript. After 
using this tool, the authors reviewed and edited the content as needed 
and take full responsibility for the content of the publication. 
Declaration of competing interest 
The authors declare that they have no known competing financial 
interests or personal relationships that could have appeared to influence 
the work reported in this paper. 
Acknowledgement 
This project has received funding from the European Union – 
NextGenerationEU instrument and is funded by the Academy of Finland 
under grant number 353562. 
Appendix A. Related tables 
See Tables A.8–A.11.
Applied Energy 389 (2025) 125734 
19 
H. Kahil, S. Sharma, P. Välisuo et al.
Table A.8 
Selected cooling system studies objectives and outcomes. 
ID Research problem 
(RQ4) 
Objectives 
(RQ5) 
Energy related results 
(RQ8) 
S3 DC cooling system 
safe operations 
Reduce energy while maintaining thermal 
constraints using transition and risk models 
OCA configuration: 18.18 % energy savings compared to 11.92 % for the MBHC 
and 15.63 % for MBRL-MPC. 
ECA configuration: 10.94 %, and 13.18 % energy savings compared to the 
benchmark controllers, while reducing thermal violations by up to 48 % 
S12 DC cooling system 
safe operations 
Optimize cooling system via DDPG with neu-
ral networks and DUE validation to mitigate 
overheating risks 
Simulation: Improved PUE by approximately 11 %. 
Reducing cooling costs. Trace-based experiment: 15 % cooling energy reduction 
S14 DC efficient cooling 
system optimization 
Minimize total energy costs while maintaining 
server temperature and balancing the aquifer 
thermal energy storage (ATES) 
The Delayed algorithm increased energy cost by 1 %, while Cool and Cool 2.0 
reduced it by 1.2 % and 12.5 %, respectively, compared to the baseline. Only 
Cool 2.0 achieved ATES balance, while the others negatively affected it 
S20 DC efficient cooling 
system optimization 
Optimize PUE and chip thermal stability while 
improving adaptability 
In the first scenario, SAC agent reduced energy consumption 32.23 %, 9.86 %, 
10.77 %, 6.95 %, 1.83 % while enhancing the PUE by: 0.051, 0.011, 0.013, 
0.008, 0.002 compared to: PID, MPC, DQN, TRPO, PPO controllers, respectively. 
In the second scenario, SAC (300 s control interval) saved 3.45 % energy vs. 
SAC final value state. DCI-SAC (combined value state) achieved 4.4 % savings 
over final value state DCI-SAC and 9.48 % over SAC with a PUE reduction of 
0.01 
S23 DC efficient cooling 
system optimization 
Optimize CRAC unit supply air temperature using 
a DRL multi-set point (DQN-MSP) method for 
thermal recognition and energy efficiency 
DQN-MSP outperformed other benchmark DRL algorithms by 5.7 %, 2.4 %, 
and 4.2 %, in terms of reducing energy consumption. DQN-MSP increased the 
average supply air temperature of the CRAC units while maintaining thermal 
constraints 
S24 DC cooling system 
safe operations 
Develop a safety-aware DRL strategy using imita-
tion learning and rectification to optimize energy 
and maintain thermal constraints 
In the EnergyPlus environment, the three proposed Safari controllers achieved 
between 18 % and 26.6 % energy savings compared to the benchmarks while 
minimizing violations. 
In the OpenFOAM environment with non-uniform temperature distribution, 
the proposed Safari-4 controller achieved approximately 14 % energy savings 
compared to PID while maintaining thermal safety constraints 
S29 DC efficient cooling 
system optimization 
Integrating big data, IoT, and LSTM network with 
DRL to optimize energy consumption 
Achieve improvements in the energy-saving metrics of the data center, 
specifically in PUE and WUE by around 2.55 % 
S41 DC efficient cooling 
system optimization 
Optimizing DC energy consumption using a float-
ing set point and whole-building evaluation in 
tropical climates 
In an environment setup proposed by [141], the proposed algorithm achieved 
13.8 % energy savings, comparable to the 13 % savings recorded with SAC de 
veloped in that study. However, the other DRL algorithms used in [
-
141] (TD3, 
PPO, and TRPO) achieved slightly better performance. 
It achieved energy savings by 5.5 % (full-load IT) and 3 % (part-load IT), 
which demonstrated that the server fan identified as the primary energy-saving 
component 
S47 DC efficient cooling 
system optimization 
Compare four state-of-the-art DRL algorithms for 
optimizing cooling in a medium-sized DC under 
varying weather conditions 
Achieve energy savings of 13 % for SAC, 19 % for TD3, 18.3 % for PPO, and 
14.6 % for TRPO compared with the baseline model-based EnergyPlus controller 
(DefaultE+) 
Reduce energy consumption across all scenarios compared to the benchmark 
hysteresis-based controller. As lower thresholds require more cooling demand, 
Energy consumption rose as temperature and RH thresholds decreased 
S55 DC efficient cooling 
system optimization 
Minimize energy consumption of air-free cooling 
system in a tropical DC while maintaining the 
supply air temperature and relative humidity (RH) 
S56 DC efficient cooling 
system optimization 
Evaluation of four state-of-the-art DRL algorithms 
for dynamic thermal management in DC cooling 
systems, focusing on balancing energy consump-
tion and thermal constraints across different 
scenarios 
The evaluation highlights the DRL algorithms ability to balance energy ef-
ficiency and constraint satisfaction. Compared to the baseline EnergyPlus 
controller (DefaultE+), the proposed DRL strategy demonstrates energy sav-
ings influenced by the selected reward function, achieving up to 8.84 % in some 
cases 
S58 DC efficient cooling 
system optimization 
Optimize rack-level cooling by implementing 
multi-active ventilation tiles (AVTs) con-
trollers, while integrating Dyna architecture for 
energy-temperature trade-off 
Energy consumption was reduced by optimizing fan speeds and minimizing 
temperature variances. While exact percentages are not specified, a trade-off 
analysis between energy efficiency and AVT system performance was conducted 
by adjusting the weight parameter 𝜔. Once stabilized, increasing 𝜔 prioritized 
energy savings by reducing the average fan speed 
S60 DC cooling system 
safe operations 
Introduce Residual Physics DRL approach (RP-
SDRL) leverages thermodynamics to estimate 
desirable action ranges for guiding DRL explo-
ration. It enhances learning and mitigates unsafe 
actions 
The annual energy consumption comparison for the entire DC, including all sub-
systems, was conducted between the proposed SAC, RP-SAC, and the baseline 
physics algorithm. Results confirm that SAC and RP-SAC achieved over 10 % 
higher energy savings than the physics-based method 
S65 DC cooling system 
safe operations 
Propose a shielding DRL algorithm (DRL-S) 
that leverages techniques such as Lagrangian-
based CDRL and Reward Shaping. Shielding can 
transform unsafe actions into safe action spaces 
Compared to DefaultE+, PPO-based algorithms achieved better energy savings 
than the SAC-based algorithms. The best performance related to energy saving 
came from the proposed PPO-Lag algorithm, which can save between 8.12 % 
and 12.45 % of the energy compared to the baseline. PPO10 achieved the best 
energy saving among the reward shaping algorithms at 10.67 %. The PPO1 and 
SAC1 were excluded from the comparison as they cannot effectively handle the 
violations
Applied Energy 389 (2025) 125734 
20 
H. Kahil, S. Sharma, P. Välisuo et al.
Table A.9 
Selected ICT system studies: objectives and outcomes. 
ID Research problem 
(RQ4) 
Objectives 
(RQ5) 
Energy related results 
(RQ8) 
S4 Task scheduling Optimize DC energy consumption while main -
taining a high level of QoS using DRL hybrid task 
scheduling framework 
Outperform the benchmark BF strategy by saving 14 % of energy. This is achieved by 
scheduling tasks more efficiently, thereby minimizing the number of active servers 
S7 VM consolidation Reduce energy by optimal selection and placement 
of VM using adaptive DRL algorithm 
Achieve up to a 125.24 % improvement in energy consumption Compared to MFFD 
S8 Task scheduling Decrease task response time and increase resource 
utilization using an adaptive DRL task scheduling 
framework 
The results show a 61.9 % improvement in energy consumption over the RR method 
when using the Google Cluster, while achieving 2.46 times better performance than RR 
when using the synthetic dataset, with faster response times in both cases 
S9 DC network traffic 
control 
Optimize NFV traffic for SFCs in DC, aiming to 
minimize energy consumption and maximize cost 
efficiency while meeting delay-guarantee constraints 
In both scenarios: fixed and dynamic SFCs, the energy consumption of both proposed 
models (TEDI, and TEDO) was nearly identical. However, the TEDI outperforms the 
TEDO in terms of computation time in dynamic scenarios 
S10 Job scheduling Reduce energy consumption and improve thermal 
conditions in compute-intensive, long-lasting jobs 
DC environments 
Optimize resource allocation by achieving the 
balance between server energy consumption and 
application performance utilizing an adaptive RL 
framework 
Effectively reduce energy consumption by more than 10 % and improved the running 
temperature of the   processors by over 4 ◦C, while maintaining the same job processing 
throughput 
S11 VM consolidation Reduce energy consumption compared to the LR-MMT and VDT-UMC methods. 
However, the DTH-MF method achieve the same level of energy consumption as the 
proposed PPR-RL approach. Nonetheless, the DTH-MF uses temperature as the upper 
limit, which can affect the thermal conditions of hardware, making it unsuitable for 
real-world DCs 
S13 Task scheduling Improve task scheduling while maintaining QoS 
using comprehensive of a hierarchical and hybrid 
online DRL agent in warehouse-scale DC 
Achieve up to a 47.88 % improvement in energy consumption compared to baseline 
state-of-the-art methods 
S16 VM placement Cluster VM using the K-means algorithm and em 
ploys a multi-reward RL approach to reduce energy 
consumption and improve resource utilization 
- Achieve 14 % better results compared to the GMPR algorithm using a synthetic dataset, 
and up to 26 % improvement using real data (Amazon EC2), resulting in a significant 
reduction in energy consumption 
S17 VM placement Reduce energy via proposing: VMPMFuzzyORL 
(which uses a fuzzy system to create the reward sig 
nal). - MRRL (which utilizes the K-means clustering 
method) 
-
The VMPMFuzzyORL algorithm achieves an improvement in energy usage of between 
1 % and 3 % compared to the other benchmark methods, while MRRL achieve an 
energy consumption improvement between 4 % and 6 % 
S18 VM placement Minimize the number of active physical machines 
(PMs), and reduces energy consumption and re 
source fragmentation by Proposing a multi-objective 
RL approach 
-
The proposed framework achieve up to 17 % improvement in energy consumption 
compared to other benchmark algorithms 
S19 Task offload 
ing/Resource 
(container) 
allocation 
- Develop a two-stage optimizer (ETHC), using DRL 
agent with Lyapunov optimization to minimize 
energy consumption at public cloud DC rental costs 
Outperform other benchmark methods in maintaining energy consumption below the 
selected threshold (except for CLOUD, as no tasks are processed in the on-premises 
DC). Compared to REQ, the proposed ETHC incurs approximately a 6 % additional cost 
while achieving a 40 % reduction in energy consumption 
S21 Task scheduling Simultaneously optimizes energy consumption and 
QoS in heterogeneous cloud DC by implementing 
DAG-based hierarchical DRL strategy 
Reduce the number of active hosts while efficiently 
utilizing containers to enhance energy efficiency 
and curb the environmental impact in DCs 
Outperform other benchmark methods, achieving between 2.7 % and 21.6 % less 
energy consumption, in three different scenarios with 10, 50, and 100 servers 
S22 Container placement Outperform the other baseline metaheuristic and FFD heuristic algorithms in terms of 
energy efficiency across all eight configurations of the DC 
S26 Task scheduling Enhance energy efficiency and mitigating carbon 
emissions in a federated DCs cloud environment 
(ERLFC) by managing task dependencies in DAGs 
Depending on the used dataset size: Achieve better energy efficiency compared to tradi-
tional methods (RR and random), ranging from 1 % to 9 %. Additionally, reduce energy 
consumption by 5 % to 26 % compared to heuristic algorithms (SO and GJO) 
S27 DC network traffic 
control 
Optimize DC energy consumption by developing 
SmartDCN: a PAS-DQN framework that integrates a 
dynamic traffic predictor (TPM) using LSTM and an 
intelligent optimizer (EOM) with parametrized DRL 
Considering a network size of 250 servers, SmartDCN achieves energy savings of up to 
11 %, 8 %, and 4 % compared to the three benchmark methods, respectively. It also 
demonstrated superior scalability, achieving higher energy savings under the same 
network load for larger network sizes 
S28 VM replacement Enhance load balancing in two dimensions: be 
tween HMs and within each HM using two fuzzy 
algorithms, Fuzzy-MOVMrB and Fuzzy-RLVMrB, to 
address non-dominance limitations 
- While the main goal of the proposed methods is not to directly address energy con-
sumption, the strategy relying on fuzzy logic and RL (Fuzzy-RLVMrB) outperform the 
other methods in terms of reducing energy consumption 
S30 VM placement Achieve load balancing in HMs while maintaining 
SLA violations and reducing energy consumption 
by proposing the VMP-A3C strategy. Additionally, 
optimize the number of HMs through dynamic 
consolidation 
Outperform the other three DRL algorithms (A2C, DQN, and PPO) by 3.1 %, 1.6 %, 
and 1.4 %, respectively. It also achieve superior energy efficiency with different VM 
number scenarios, delivering the best performance with 38.4 % more energy savings 
compared to the MOVMrB and 7.6 % compared to the ADVMC 
S31 Job scheduling Reduce operational costs in large-scale hetero 
geneous DCs under continuous job arrivals, 
considering job dependencies and QoS constraints 
- Outperform the FIFO algorithm by 37 %, the Tetris algorithm by 30 %, and the ideal 
MPC algorithm by 1.6 % in terms of energy consumption costs 
S32 VM placement Offer comprehensive solutions for problems related 
to high energy consumption, SLA violations, VM mi 
grations, and migration duration in geo-distributed 
cloud DC environments 
-
Achieve better energy efficiency compared to other benchmark algorithms, with 
improvements of about 5.5 % 
S33 VM scheduling Optimize multiple objectives simultaneously, includ 
ing energy consumption, performance costs, and 
SLA violations, for Industrial IoT (IIoT) DC 
- achieve the lowest energy consumption across five scenarios with varying VM counts: 
33.3 % lower to EcoCloud, 15.19 % lower to KMI-MRCU for VM counts over 1000, and 
10.79 % lower to AFED-EF for VM counts over 1000
(continued on next page)
Applied Energy 389 (2025) 125734 
21 
H. Kahil, S. Sharma, P. Välisuo et al.
Table A.9 (continued)
ID Research problem
(RQ4)
Objectives
(RQ5)
Energy related results
(RQ8)
S35 Task schedul 
ing/Resource 
allocation 
- Achieve energy efficiency and minimize SLA vi 
olations by demonstrating that container-based 
environments enhance DC energy consumption 
more effectively than VM-based methods 
- Outperforms state-of-the-art methods in energy consumption while improving the 
cloud DC’s PUE. The proposed MADRL-RAC approach consumes only 29.637 W for 
5000 nodes, compared to 49.346 W (DSTS), 43.926 W (LSTM), 38.625 W (RF), and 
35.287 W (CNN) 
S36 Task scheduling Enhances DeepEnergyJSV’s performance by de 
ploying the PPO algorithm to schedule hybrid tasks 
(independent and interdependent) in cloud DCs, 
improving energy efficiency 
- Outperform state-of-the-art algorithms in energy efficiency, achieving maximum en 
ergy reductions of 7.81 %, 8.93 %, 13.86 %, and 8.69 % compared to EINFORCE, FF, 
Random, Tetris, respectively 
-
S37 Task schedul 
ing/Resource 
allocation 
- Enhance energy efficiency, and scalability in Dc 
by Proposing an LSTM model for load prediction, 
followed by a combination of DQN and DPSO 
Outperforms all benchmark algorithms in minimizing energy consumption across all 
cases. For example, in Case 1, the proposed framework reduced energy consumption 
by 2.63 %, 5.24 %, 41.61 %, and 65.26 % compared to Load Aware, traditional DQN, 
FFO-EVMM, and MIMT, respectively 
S38 VM consolidation Reduce energy consumption, enhance SLA, and 
optimize DC resources by developing ARLCA: 
an autonomous RL agent interacting with the 
environment 
Both proposed Q-learning policies consume more energy than SARSA policies within 
the same selection strategy. Additionally, softmax policies consumed less energy than 
𝜖-greedy policies, though the difference was minimal, ranging from 0.02 % and 0.2 % 
S39 Job scheduling Optimize energy consumption while maintaining 
the required high level of QoS for real-time received 
jobs in a cloud DC environment 
Confirm the superiority of the proposed strategy over baseline methods in energy 
cost. For example, in scenarios with varying mean arrival rates, baseline methods had 
similar energy costs, while the proposed algorithm achieved a 60 %–70 % reduction 
S40 DC network traffic 
control 
Address the challenge of achieving energy efficiency 
in DCNs by introducing a DRL framework for solv 
ing bandwidth allocation and routing optimization 
problems, while considering dynamic and real-time 
flow demands 
-
Outperform heuristic baseline algorithms by up to 7.4 % and an average of 4 % in 
energy consumption. For small-scale DCN, the Gurobi optimizer provides better energy 
efficient solutions. However, when network nodes exceed 70, its convergence time 
becomes unacceptable, making results unavailable 
-
S42 VM consolidation Reduce energy consumption and minimize SLA vio 
lations by developing a DRL-based VM consolidation 
(AVMC) strategy that addresses the challenge of 
continuous dynamic workloads 
- Confirm the superior performance of the proposed strategy over other methods. AVMC 
consumed 161.14, 138.31, 153, and 140.78 kWh for the four workloads, achieving 
energy savings of 6 % to 15 % compared with benchmark algorithms 
S43 Resource scheduling Ensure scalability, adaptability, and efficient re 
source allocation under varying workloads to reduce 
computational complexity, optimize task execu 
tion, enhance energy efficiency, and improve system 
performance 
-
-
Compares across VM numbers ranging from 5 to 20. While energy consumption in 
creases with higher VM counts, the proposed strategy consistently achieves the lowest 
energy consumption, outperforming state-of-the-art algorithms in energy savings 
-
S44 VM scheduling Achieve energy efficiency in large-scale cloud DCs 
using a DRL-based VM scheduling framework, lever 
aging robust QoS features extracted through SDAE 
to enhance the scheduling algorithm 
-
Outperform benchmark algorithms in reducing energy consumption across two differ 
ent scenarios. Using the world cup dataset, the proposed SDAEM-MMQ algorithm saves 
4.7 % and 22 % more energy compared to the other benchmark algorithms 
-
S48 Task scheduling Enhances response time, CPU utilization, and en 
ergy efficiency in DCs with QEEC, a two-step RL 
scheduling framework featuring centralized task 
dispatching and dynamic local prioritization 
- Reduces energy consumption compared to benchmark methods and achieves the best 
energy savings among state-of-the-art Q-learning methods 
S49 Resource scheduling Enhance the QoS levels and improve the energy 
saving in complex cloud DC environment by 
proposing DQN algorithm 
- Achieve lower energy consumption than traditional benchmarks (Random and RR) 
across varying task numbers. However, it consumes slightly more energy than MoPSO 
for less than 200 tasks but outperforms MoPSO when tasks exceed 200 
S51 VM placement Explore the NP-hard optimization problem to bal 
ance objectives: minimizing energy consumption 
and reducing resource wastage in DCs, While also 
addressing weight selection using the Chebyshev 
scalarization method 
- Outperforms multi-objective benchmark algorithms in nearly all scenarios by consum 
ing less energy and reducing resource waste more effectively. It also demonstrates 
superior performance over single-objective algorithms across various scenarios 
-
S53 Job schedul 
ing/Resource 
allocation 
- Achieve multi-objective optimization, including 
energy efficiency and job delay reduction in large 
scale cloud computing environments 
-
Shows that the energy consumption of the proposed scheduling algorithm (HDDL) 
is nearly equal to the Greedy algorithm among all baseline methods. In global job 
scheduling and resource allocation optimization, the HDDL-DQN framework outper 
forms other algorithms in energy efficiency by 5.7 % and 9.7 % compared to MLF and 
FERPTS, respectively 
-
S54 VM placement The two main objectives are enhancing Quality of 
Experience (QoE) and reducing data center energy 
consumption 
In all four scenarios, the proposed algorithm outperforms the FF state-of-the-art algo 
rithm in overall energy consumption. The dot product method surpasses the proposed 
algorithm in two scenarios but is statistically equivalent to DRL-VMP. Norm2 outper 
forms the proposed algorithm in only one scenario, where the workload has a constant 
mean. In this case, simple heuristic methods are recommended 
-
-
S59 VM consolidation Optimize energy consumption while maintaining the 
QoS by proposing a centralized-distributed multi 
agent RL algorithm called MAGNETIC 
-
Using three synthetic traces, the results confirm that the proposed algorithm outper-
forms benchmark algorithms, reducing energy consumption by 58 %, 10 %, and 15 % 
compared to NPA, PABFD, and E-Eco, respectively, while maintaining QoS 
S57 Task scheduling The objectives include improving energy consump 
tion and reducing waiting time in a large-scale 
heterogeneous cloud DC 
- – Small-scale DC (80,000 tasks): The proposed DRL algorithm achieved 23.8 % energy 
savings, while H2O-Cloud achieved 16.7 % compared to Tetris. 
– Medium-scale DC (120,000 tasks): The proposed DRL algorithm achieved 27.5 % 
energy savings, while H2O-Cloud achieved 21.4 % compared to Tetris. 
– Large-scale DC (240,000 tasks): The proposed algorithm achieved 35.4 %, while H2O-
Cloud achieved 24.3 % energy savings compared to Tetris. However, Tetris achieved 
better QoS, resulting in shorter waiting times across all scenarios. Additionally, as the 
scale and heterogeneity increase, the proposed algorithm effectively reduces energy 
consumption while maintaining QoS in task scheduling
(continued on next page)
Applied Energy 389 (2025) 125734 
22 
H. Kahil, S. Sharma, P. Välisuo et al.
Table A.9 (continued)
ID Research problem
(RQ4)
Objectives
(RQ5)
Energy related results
(RQ8)
S61 DC network traffic 
control 
Dynamically consolidates traffic without prior 
knowledge to enhance DCN energy efficiency 
Shows that algorithms without FCT constraints (ElasticTree and CARPO) performed 
better than the proposed approach (SmartFCT). However, when considering FCT, the 
proposed scheme outperforms the benchmark method (FCTcon) by 11.3 %, 11.7 %, and 
12.2 % in traffic datasets 1 to 3, respectively. Additionally, it achieves energy savings 
very close to the optimal algorithm across all datasets 
S62 Job schedul 
ing/Resource 
allocation 
- Reduce energy consumption, lower operational 
costs, decrease makespan, and optimize resource 
allocation 
Outperform all benchmark algorithms across datasets by achieving lower energy 
consumption 
S63 Task scheduling Reduces energy consumption, while balancing 
throughput, resource utilization, and makespan 
in heterogeneous cloud computing environments 
The energy aspect was evaluated in two scenarios: varying task numbers and VM counts. 
In both cases, the proposed algorithm outperformed other methods in energy reduc 
tion. Simulations showed that as task or VM numbers increased, the algorithm improved 
load balancing and optimized resource utilization more effectively, minimizing energy 
consumption 
-
S64 Task scheduling The objectives include minimizing makespan, reduc 
ing energy consumption, lowering operational costs, 
and maximizing resource utilization 
- Reduce energy consumption compared to MCP and ETF benchmarks across all datasets 
by optimizing resource count and frequency 
Table A.10 
Selected integrated studies objectives and outcomes. 
ID Research problem (RQ4) Objectives 
(RQ5) 
Energy related results 
(RQ8) 
S1 Workflow 
Scheduling/RES-powered 
DC 
Reduce brown energy usage and optimize workflow 
execution across geo-distributed cloud DCs with a 
multi-agent DRL: a global scheduler assigns workflows 
to DCs, and a local scheduler allocates tasks to physical 
machines 
Minimize energy consumption by 47 %, while outperforming the baseline 
algorithms 
S2 Load shifting/DC ef 
ficient cooling system 
optimization 
- Assess the flexibility of the DC cooling system by im 
plementing the SAC-LSTM strategy to optimize costs, 
shifting cooling demand to low-price periods via RTP 
while maintaining temperature constraints 
- Using real-world weather data and electricity price signals over multiple years, 
the proposed controller reduces energy costs by 2 %–4 % compared to PID and 
SAC controllers while maintaining DC temperature within the desired range 
S5 Task scheduling/DC ef 
ficient cooling system 
optimization 
- Propose DeepEE framework using a Parametrized 
Action Space Deep Q-Network (PADQN) to handle 
high-dimensional state space and discrete-continuous 
action space issues in ICT and cooling systems, enhanc 
ing PUE, preventing rack overheating, and balancing 
load distribution 
-
Improve power consumption for IT and cooling systems compared to baseline 
algorithms using PUE metrics, saving 7 % over ICO, 15 % over CCO, 10 % over 
JCO, and 5 % over O-DQN 
S6 Job scheduling/DC ef 
ficient cooling system 
optimization 
- Propose the E-DRL strategy to make decisions based 
on critical events rather than time intervals, enhancing 
PUE, reducing regulating decisions, stabilizing sys 
tem performance, and improving ICT-cooling system 
coordination by addressing time constant differences 
-
Using two types of workload traces, the designed algorithm achieved better PUE 
than benchmark algorithms, including PADQN with time-driven cooling intervals 
(10, 300, and 900 s) and E-QL 
S15 Workflow 
Scheduling/RES-powered 
DCs 
Develop a novel multi-objective framework, CFWS, 
based on DRL to balance energy costs and carbon 
emissions across geo-distributed cloud DCs. It enables 
workload shifting via VM migration while maximizing 
renewable energy utilization 
Reduce brown energy consumption by 5.67 % to 13.22 % compared to bench 
mark methods while maintaining the same energy usage. Additionally, it 
allocates more renewable energy and lowers carbon emissions 
-
S25 VMs Placement/DC ef 
ficient cooling system 
optimization (Thermal 
awareness) 
- Handle vast state-action spaces and random delayed 
feedback in the DC environment with a scalable, hi 
erarchical RL approach, improving energy efficiency, 
maintaining thermal conditions, and satisfying SLAs 
-
Outperform other benchmark methods by over 17 % in total energy savings 
S34 Job scheduling/DC ef 
ficient cooling system 
optimization 
- Proposes a DRL strategy (MADDQN) with a two-agent 
structure to optimize job scheduling and cooling. Each 
agent has an action network (ActNet) for local data and 
an evaluation network (EvalNet) for global data, en 
abling centralized training and decentralized execution. 
This ensures ideal room temperature, meets workload 
deadlines, and reduces energy consumption 
-
Outperform other DQN algorithms in energy efficiency, achieving the best PUE, 
lowest total energy consumption, reduced hot spots, and improved scalability for 
larger DC configurations 
S45 Task scheduling/DC ef 
ficient cooling system 
optimization 
- Proposes a DRL-based framework to optimize IT and 
facility operations in DCs. The DRL algorithm interacts 
with the physical DC system by continuously collecting 
real-time states and applying control actions across 
three areas: 
– Thermal-Aware Task Scheduling: DDPG optimizes 
resource allocation with thermal considerations. 
– Load-Aware Target Cooling: DQN manages CRAC 
airflow based on IT workload. 
– IT-Facility Optimization: PADQN coordinates IT and 
facility operations for global efficiency. 
Cooling Energy Savings: 
– Up to 15 % reduction in cooling energy for air-cooled systems. – Up to 30 % 
reduction for water-cooled systems 
Overall Energy Efficiency: 
– Joint IT and facility optimization achieved up to 15 % total energy savings 
compared to baseline manual control 
Task Scheduling: 
– Thermal-aware task scheduling reduced IT power consumption by 9 %
(continued on next page)
Applied Energy 389 (2025) 125734 
23 
H. Kahil, S. Sharma, P. Välisuo et al.
Table A.10 (continued)
ID Research problem (RQ4) Objectives
(RQ5)
Energy related results
(RQ8)
S46 Task scheduling/DC ef 
ficient cooling system 
optimization 
- Propose a multi-agent framework (MAC3C) to jointly 
optimize IT infrastructure and cooling operations, 
enhancing DC energy efficiency. The framework in 
teracts with the real-time DC environment instead of 
sub-system models to observe dynamic state space and 
generate corresponding discrete-continuous actions 
-
Compared to traditional methods (Random and RR), the proposed framework 
achieves significant energy savings. Additionally, MAC3C consumes 42.82 % 
and 18.95 % less energy than the joint optimization approaches DeepEE and 
PowerTrade, respectively 
S50 Workflow 
Scheduling/RES-powered 
DCs 
Propose a green workload framework that simulta 
neously optimizes two objectives: maximizing fuel 
cell utilization benefits and minimizing power budget 
fragmentation 
- Confirm the proposed algorithm’s effectiveness in reducing energy consumption 
when DCs exceed two, demonstrating its efficiency in high-dimensional upper 
level environments. Additionally, in low-level DC environments with varying 
rack numbers, the algorithm consumes less energy and minimizes power bud 
get fragmentation. It achieves energy savings of up to 7.5 %, 5.2 %, and 4.3 % 
compared to benchmark methods 
-
-
S52 Workflow Scheduling/DC 
efficient cooling system 
optimization/Battery 
charge and discharge 
Address the challenge of dynamically optimizing DC op 
erations over a year with a DRL framework. It combines 
D3QN and VDN in a multi-agent system to train three 
DRL agents that optimize battery charging/discharging, 
computational workload distribution, and waste heat 
utilization from the cooling system, achieving global 
integration and efficiency 
- Confirm that the proposed framework improves energy efficiency compared to 
pre-optimization, achieving: 18.37 % reduction in renewable energy waste, 
9.78 % improvement in operational cost efficiency, 4.01 % reduction in elec 
tricity consumption, and 29.74 % decrease in grid electricity consumption. 
Additionally, results show that algorithmic optimization outperforms non 
algorithmic methods in reducing renewable energy waste and operational 
costs 
-
-
Table A.11 
Overview of the main MDP elements in each joint optimization selected work. 
ID RL elements 
S1 state-space {𝑆}: For global agent: green energy surplus or deficit levels of the i-th datacenter, average processing speed of a server in the i-th datacenter, the current 
utilization level of the i-th datacenter, CPU requirement of j-th task, and Memory requirement of j-th task. For Local Scheduler: the processing speed of 𝜆i-th server in 
datacenter dci, the current utilization level of the 𝜆i-th server in datacenter dci, CPU requirement of j-th task, and Memory requirement of j-th task 
Action-Space {𝐴}: For global agent: the selected data center (the local scheduler) where the task will be executed. For Local Scheduler: the selection of a server, mi in 
which the task will be executed 
Reward-Function {𝑅}: For global agent: the current green energy deficit or surplus of the selected data center, and the local scheduler that utilize the physical machine 
to execute the task. For Local Scheduler: Energy efficiency and execution time 
Constraints: NA 
Discount factor: Unspecified 
S2 state-space {𝑆}: Outdoor (drybulb) air temperature, the indoor air temperatures of both data center zones, the IT equipment power demand, the HVAC equipment 
power demand, the real-time electricity price, and the hourly electricity cost 
Action-Space {𝐴}: The common setpoint of the outlet air temperature of the evaporative coolers and cooling coils, as well as the mass flow rates of the fans in the air 
handler (i.e., fan mass flow rate) for both zones of the data center 
Reward-Function {𝑅}: The temperatures in a recommended range in each data center zone and electricity cost of operating the data center through RTP 
Constraints: NA 
Discount factor: 0.99 for 1 day, 0.997 for 2 weeks 
S5 state-space {𝑆}: The number of required CPU cores of the candidate task i, the airflow rates, the IT workload states, and the thermal states 
Action-Space {𝐴}: The action for task scheduling is to select a proper server. The action for regulating cooling facilities is to adjust the airflow rate 
Reward-Function {𝑅}: The PUE of the data center. Penalties for overheating of each rack n and overloading of each server k 
Constraints: The number of available CPUs must be greater than the required number of CPUs to perform the dispatcher task. The airflow rate for each J ACUs must be 
greater than 0 and less than or equal to the maximum airflow rate that the ACUs can supply 
Discount factor: 0.99 
S6 state-space {𝑆}: Instead of using state space, this study used event space, including: Job Dispatching Events: Activated when the length of the job queue is greater than 
zero. State Transition Events: Represent the common state transitions of the following elements: CPU cores, server utilization, power consumption, two inlet temper-
atures, and the outlet temperature. Change Ratio Events: Aim to sense the change ratio of utilization and outlet temperature for each rack. Boundary Events: Handle 
critical scenarios, including hotspot events and over-provisioning events 
Action-Space {𝐴}: The action for task scheduling is to select a proper server. The action for regulating cooling facilities is to adjust the airflow rate 
Reward-Function {𝑅}: The PUE of the data center. Penalty for overheating of each rack n. Penalty for overloading of each server k 
Constraints: The number of available CPUs must be greater than the required number of CPUs to perform the dispatcher task. The airflow rate for each J ACUs must be 
greater than 0 and less than or equal to the maximum airflow rate that the ACUs can supply 
Discount factor: 0.99 
S15 state-space {𝑆}: The CPU utilization of the j-th host of DC k 
Action-Space {𝐴}: Selecting VM migration schemes (migrated VM, destination DC, and PM) to balance overloaded or underloaded PMs 
Reward-Function {𝑅}: The energy cost. The carbon footprint 
Constraints: The CPU capacity 
Discount factor: Unspecified 
S25 state-space {𝑆}: A tree structure to distribute VM requests among the servers. The inner nodes of the tree serve double roles: VM requests distributor and state of the 
Markov model 
Action-Space {𝐴}: Picking a route from a particular node 
Reward-Function {𝑅}: Thermal : server inlet temperature and CPU temperature. Power: power consumption relative to utilization rate 
Constraints: Maximum VM numbers statement: A constraint can be set on the maximum number of VMs that can be placed on each PM to prevent overloading. While 
putting other PMs into a sleep state 
Discount factor: 0.99
(continued on next page)
Applied Energy 389 (2025) 125734 
24 
H. Kahil, S. Sharma, P. Välisuo et al.
Table A.11 (continued)
ID RL elements
S34 state-space {𝑆}: For ICT system: The current server resource state and job request resources in the queue. For the cooling system: server inlet air temperature, CRAC 
return air temperature, and CRAC set temperature 
Action-Space {𝐴}: For the ICT system: assigning job j in the queue to a server. For cooling system: the temperature regulation action 
Reward-Function {𝑅}: For ICT system: minimize the energy consumption of the server on the premise of avoiding hot spots. For cooling system: minimize the cooling 
energy consumption while ensuring the safe operation of the server 
Constraints: NA 
Discount factor: Unspecified 
S45 state-space {𝑆}: For load-aware target cooling: power consumption and workload of the ICT subsystem, along with cooling subsystem power consumption and am-
bient temperature. For thermal-aware task scheduling: power consumption of the ICT and cooling systems, and IT subsystem temperatures. For iterative IT-cooling 
optimization: observations from the discrete-space ICT subsystem and continuous-space cooling subsystem 
Action-Space {𝐴}: For load-aware target cooling case: Airflow rate adjustment, and Pump flow rate adjustment. For Thermal-aware task scheduling: Assigning a task to 
a specific server in a thermal-aware manner. For Iterative IT-cooling optimization: two kinds of control actions simultaneously 
Reward-Function {𝑅}: For load-aware target cooling: optimize the trade-off between IT workload, ambient temperature, and facility energy cost. For thermal-aware 
task scheduling: minimize IT and facility power consumption while maintaining server temperature and computing throughput. For iterative IT-cooling optimization: 
jointly control IT and facility subsystems to balance energy consumption and efficiency 
Constraints: NA 
Discount factor: Unspecified 
S46 state-space {𝑆}: The available resources of each server. The power consumption of each server. The adaptability scores of each server to the current task. The outlet air 
temperature of each rack. The power consumption of each rack. The supply air temperature and flow rate. The requested resources of the current task 
Action-Space {𝐴}: For ICT system: scheduling task to server. For cooling system: supply air temperature, and the flow rate 
Reward-Function {𝑅}: The direct power influence of actions, measured by the change in IT and cooling power before and after execution. The waiting time of the 
current task. The available resource that the target server holds. The outlet air temperature of servers 
Constraints: The supply air temperature range and flow rate of CRACs should respect upper and lower thresholds. Optimization should consider both factors alongside 
task scheduling to maximize data center power savings 
Discount factor: 0.99 
S50 state-space {𝑆}: Global Scheduler: Green energy surplus/deficit levels of the i-th datacenter. –Average server processing speed in the i-th datacenter. –Current utiliza-
tion level of the i-th datacenter. –CPU and memory requirements of the j-th task. –Memory requirement of j-th task. Local Scheduler: Processing speed of the 𝜆i-th server 
in datacenter. –Current utilization level of the 𝜆i-th server in datacenter. –CPU requirement of j-th task. –Memory requirement of j-th task 
Action-Space {𝐴}: Global Scheduler: Action corresponds to the selection of a datacenter (hence a local scheduler) to which the task will be submitted for execution. 
Local Scheduler: The selection of a server in which the task will be executed 
Reward-Function {𝑅}: Global Scheduler: Maximize green energy utilization by balancing surplus/deficit in the selected datacenter. –The reward from the local 
scheduler for successful task allocation. Local Scheduler: Weighted function of task execution time. –The corresponding energy consumption during the execution of 
task 
Constraints: NA 
Discount factor: Unspecified 
S52 state-space {𝑆}: For the ICT system (computational workload): ambient temperature, municipal electricity price, average occupancy rate, battery capacity, wind power, 
and photovoltaic generation. For the cooling system (heating temperature of surrounding buildings): ambient temperature, municipal electricity price, battery capacity, 
wind power, and photovoltaic generation. For the power system (battery charge/discharge): ambient temperature, municipal electricity price, battery capacity, wind 
power, photovoltaic generation, and total electricity consumption of ICT and cooling 
Action-Space {𝐴}: - Adjustment parameter of computational workload scheduling. - Adjustment parameter of heating temperature of the surrounding buildings. -
Adjustment parameter of battery charge and discharge, i.e., the change of state of charge (SOC) 
Reward-Function {𝑅}: The global reward (reducing renewable energy waste and operational cost) is used as a common reward for evaluating the optimization 
situation. For ICT and cooling systems: Total electricity consumption, and grid electricity consumption 
Constraints: NA 
Discount factor: Unspecified 
Data availability 
Data will be made available on request. 
References 
[1] International Energy Agency. Analysis and forecast to 2026. IEA Report; 2024. 
https://www.iea.org/reports/electricity-2024. 
[2] Kamiya G, Bertoldi P, et al. Energy consumption in data centres and broadband 
communication networks in the EU. European Commission, Joint Research Centre; 
2024. 
[3] Andrae AS, Edler T. On global electricity usage of communication technology: 
trends to 2030. Challenges 2015;6(1):117–57. 
[4] Zhang Y, Tang H, Li H, Wang S. Unlocking the flexibilities of data centers for 
smart grid services: optimal dispatch and design of energy storage systems under 
progressive loading. Energy 2025;316:134511. 
[5] Jayanetti A, Halgamuge S, Buyya R. Deep reinforcement learning for energy 
and time optimized scheduling of precedence-constrained tasks in edge-cloud 
computing environments. Fut Gener Comput Syst 2022 Dec; 137:14–30. 
[6] Iyengar M, Schmidt R, Caricari J. Reducing energy usage in data centers through 
control of room air conditioning units. In: 2010 12th IEEE Intersociety Conference 
on Thermal and Thermomechanical Phenomena in Electronic Systems. IEEE; 2010. 
p. 1–11. 
[7] Kumar R, Khatri SK, Diván MJ. Data center air handling unit fan speed optimiza -
tion using machine learning techniques. In: 2021 9th International conference on 
reliability, infocom technologies and optimization (trends and future directions) 
(ICRITO). IEEE; 2021. p. 1–10. 
[8] Marcinichen JB, Olivier JA, Thome JR. On-chip two-phase cooling of datacen -
ters: cooling system and energy recovery evaluation. Appl Therm Eng 2012; 
41:36–51. 
[9] Wang H, Yuan X, Zhang K, Lang X, Chen H, Yu H, et al. Performance evaluation 
and optimization of data center servers using single-phase immersion cooling. Int 
J Heat Mass Transfer 2024;221:125057. 
[10] Gao T, Sammakia BG, Geer J, Murray B, Tipton R, Schmidt R. Comparative 
analysis of different in row cooler management configurations in a hybrid 
cooling data center. In: International electronic packaging technical conference 
and exhibition; vol. 56888. American Society of Mechanical Engineers; 2015. 
p. V001T09A011. 
[11] Shalom Simon V, Modi H, Sivaraju KB, Bansode P, Saini S, Shahi P, et al. Feasibility 
study of rear door heat exchanger for a high capacity data center. In: International 
electronic packaging technical conference and exhibition; vol. 86557. American 
Society of Mechanical Engineers; 2022. p. V001T01A018. 
[12] Deymi-Dashtebayaz M, Namanlo SV, Arabkoohsar A. Simultaneous use of air-side 
and water-side economizers with the air source heat pump in a data center for 
cooling and heating production. Appl Therm Eng 2019;161:114133. 
[13] Jang Y, Lee D, Kim J, Ham SH, Kim Y. Performance characteristics of a waste-heat 
recovery water-source heat pump system designed for data centers and residential 
area in the heating dominated region. J Build Eng 2022;62:105416. 
[14] Oró E, Depoorter V, Pflugradt N, Salom J. Overview of direct air free cooling and 
thermal energy storage potential energy savings in data centres. Appl Therm Eng 
2015;85:100–10. 
[15] Bousnina D, Guerassimoff G. Deep reinforcement learning for optimal energy man -
agement of multi-energy smart grids. In: Nicosia G, Ojha V, La Malfa E, La Malfa 
G, Jansen G, Pardalos PM, Giuffrida G, Umeton R, editors. Machine learning, op -
timization, and data science. Cham: Springer International Publishing; 2022. p. 
15–30. 
[16] Sutton RS, Barto AG. Reinforcement learning: an introduction. Cambridge: MIT 
Press; 1998. 
[17] Chang Q, Huang Y, Liu K, Xu X, Zhao Y, Pan S. Optimization control strategies 
and evaluation metrics of cooling systems in data centers: a review. Sustainability 
2024;16(16):7222.
Applied Energy 389 (2025) 125734 
25 
H. Kahil, S. Sharma, P. Välisuo et al.
[18] Shaqour A, Hagishima A. Systematic review on deep reinforcement learning-based 
energy management for different building types. Energies 2022;15(22):8663. 
[19] Garí Y, Monge DA, Pacini E, Mateos C, Garino CG. Reinforcement learning -
based application autoscaling in the cloud: a survey. Eng Appl Artif Intel 
2021;102:104288. 
[20] Magotra B, Malhotra D, Dogra AK. Adaptive computational solutions to energy 
efficiency in cloud computing environment using VM consolidation. Arch Comput 
Methods Eng 2023;30(3):1789–818. 
[21] Zhou G, Tian W, Buyya R, Xue R, Song L. Deep reinforcement learning-based meth -
ods for resource scheduling in cloud computing: a review and future directions. 
Artif Intell Rev 2024;57(5):124. 
[22] Hou H, Jawaddi SNA, Ismail A. Energy efficient task scheduling based on deep rein -
forcement learning in cloud environment: a specialized review. Fut Gener Comput 
Syst 2024;151:214–31. 
[23] Singh S, Kumar R, Singh D. An empirical investigation of task scheduling and VM 
consolidation schemes in cloud environment. Comput Sci Rev 2023;50:100583. 
[24] Lin W, Lin J, Peng Z, Huang H, Lin W, Li K. A systematic review of green -
aware management techniques for sustainable data center. Sustain Comput Inf Syst 
2024;100989. 
[25] Long S, Li Y, Huang J, Li Z, Li Y. A review of energy efficiency evaluation 
technologies in cloud data centers. Energy Build 2022;260:111848. 
[26] Zhang W, Wen Y, Wong YW, Toh KC, Chen C-H. Towards joint optimization 
over ICT and cooling systems in data centre: a survey. IEEE Commun Surv Tutor 
2016;18(3):1596–616. 
[27] Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, et al. Human -
level control through deep reinforcement learning. Nature 2015 Feb; 518:529–33. 
[28] Frank LL, Vrabie D, VamVoudakis KG. Reinforcement learning and feedback con -
trol: using natural decision methods to design optimal adaptive controllers. IEEE 
Control Syst 2012 Dec; 32:76–105. 
[29] Busoniu L, Babuska R, De Schutter B, Ernst D. Reinforcement learning and dynamic 
programming using function approximators. CRC Press; 2017 July. 
[30] Zanini E. Markov decision processes. Citeseer; 2014. 
[31] Ladosz P, Weng L, Kim M, Oh H. Exploration in deep reinforcement learning: a 
survey. Info Fusion 2022 Sep; 85:1–22. 
[32] Bellman R. Dynamic programming. Science 1966;153(3731):34–7. 
[33] Kaelbling LP, Littman ML, Moore AW. Reinforcement learning: a survey. J Artif 
Intell Res 1996 May; 4:237–85. 
[34] Watkins CJ, Dayan P. Q-learning. Mach Learn 1992;8:279–92. 
[35] Rummery GA, Niranjan M. On-line Q-Learning using connectionist systems, vol. 
37. Cambridge, UK: University of Cambridge, Department of Engineering; 1994. 
[36] Sutton RS. Dyna, an integrated architecture for learning, planning, and reacting. 
ACM Sigart Bull 1991;2(4):160–3. 
[37] Winands MH, Björnsson Y, Saito J-T. Monte-Carlo tree search solver. In: Computers 
and games: 6th international conference, CG 2008, Beijing, China, September 
29–October 1, 2008. Proceedings 6; Springer; 2008. p. 25–36. 
[38] Wang X, Wang S, Liang X, Zhao D, Huang J, Xu X, et al. Deep reinforcement 
learning: a survey. IEEE Trans Neural Netw Learn Syst 2024 Apr; 35:5064–78. 
[39] Li Y. Deep reinforcement learning: an overview. arXiv:1701.07274; 2018 Nov. 
[40] Shao K, Tang Z, Zhu Y, Li N, Zhao D. A survey of deep reinforcement learning in 
video games. arXiv:1912.10944; 2019 Dec. 
[41] Parvez Farazi N, Zou B, Ahamed T, Barua L. Deep reinforcement learning in trans -
portation research: a review. Transp Res Interdiscip Perspect 2021 Sep;11:100425. 
[42] Cao D, Hu W, Zhao J, Zhang G, Zhang B, Liu Z, et al. Reinforcement learning and 
its applications in modern power and energy systems: a review. J Mod Power Syst 
Clean Energy 2020 Nov;8:1029–42. 
[43] Haarnoja T, Zhou A, Abbeel P, Levine S. Soft actor-critic: off-policy maximum en -
tropy deep reinforcement learning with a stochastic actor. In: Proceedings of the 
35th international conference on machine learning. PMLR; 2018 July. p. 1861–70. 
ISSN: 2640-3498. 
[44] Kurte K, Munk J, Kotevska O, Amasyali K, Smith R, McKee E, et al. Evaluating the 
adaptability of reinforcement learning based HVAC control for residential houses. 
Sustainability 2020 Sep;12:7727. 
[45] Liberati A, Altman DG, Tetzlaff J, Mulrow C, Gøtzsche PC, Ioannidis JP, et al. The 
prisma statement for reporting systematic reviews and meta-analyses of studies that 
evaluate health care interventions: explanation and elaboration. Ann Intern Med 
2009;151(4):W-65. 
[46] Munn Z, Tufanaru C, Aromataris E. Jbi’s systematic reviews: data extraction and 
synthesis. AJN Am J Nurs 2014;114(7):49–54. 
[47] Jayanetti A, Halgamuge S, Buyya R. Multi-agent deep reinforcement learning 
framework for renewable energy-aware workflow scheduling on distributed cloud 
data centers. IEEE Trans Parallel Distrib Syst 2024;35(4):604–15. 
[48] Biemann M, Gunkel PA, Scheller F, Huang L, Liu X. Data center HVAC con -
trol harnessing flexibility potential via real-time pricing cost optimization using 
reinforcement learning. IEEE Internet Things J 2023;10(15):13876–94. 
[49] Wan J, Duan Y, Gui X, Liu C, Li L, Ma Z. SafeCool: safe and energy-efficient cooling 
management in data centers with model-based reinforcement learning. IEEE Trans 
Emerg Top Comput Intel 2023;7(6):1621–35. 
[50] Lou J, Tang Z, Jia W. Energy-efficient joint task assignment and migration in data 
centers: a deep reinforcement learning approach. IEEE Trans Netw Serv Manage 
2023;20(2):961–73. 
[51] Ran Y, Hu H, Wen Y, Zhou X. Optimizing energy efficiency for data cen -
ter via parameterized deep reinforcement learning. IEEE Trans Serv Comput 
2023;16(2):1310–23. 
[52] Ran Y, Zhou X, Hu H, Wen Y. Optimizing data center energy efficiency 
via event-driven deep reinforcement learning. IEEE Trans Serv Comput 
2023;16(2):1296–309. 
[53] Zeng J, Ding D, Kang XK, Xie H, Yin Q. Adaptive DRL-based virtual machine 
consolidation in energy-efficient cloud data center. IEEE Trans Parall Distrib Syst 
2022;33(11):2991–3002. 
[54] Kang K, Ding D, Xie H, Yin Q, Zeng J. Adaptive DRL-based task schedul -
ing for energy-efficient cloud computing. IEEE Trans Netw Serv Manage 
2022;19(4):4948–61. 
[55] Pham T-M. Traffic engineering based on reinforcement learning for service function 
chaining with delay guarantee. IEEE Access 2021;9:121583–92. 
[56] Yi D, Zhou X, Wen Y, Tan R. Efficient compute-intensive job allocation in 
data centers via deep reinforcement learning. IEEE Trans Parall Distrib Syst 
2020;31(6):1474–85. 
[57] Ding W, Luo F, Gu C, Lu H, Zhou Q. Performance-to-power ratio aware resource 
consolidation framework based on reinforcement learning in cloud data centers. 
IEEE Access 2020;8:15472–83. 
[58] Li Y, Wen Y, Tao D, Guan K. Transforming cooling optimization for green data 
center via deep reinforcement learning. IEEE Trans Cybern 2020;50(5):2002–13. 
[59] Cheng M, Li J, Bogdan P, Nazarian S. H2O-Cloud: a resource and quality of service -
aware task scheduling framework for warehouse-scale data centers. IEEE Trans 
Comput Aided Des Integr Circuits Syst 2020;39(10):2925–37. 
[60] Leindals L, Grønning P, Dominković DF, Junker RG. Context-aware reinforcement 
learning for cooling operation of data centers with an aquifer thermal energy 
storage. Energy AI 2024;17:100395. 
[61] Zhao D, Zhou J-T, Li K. CFWS: DRL-based framework for energy cost and car -
bon footprint optimization in cloud data centers. IEEE Trans Sustain Comput; 
2025;10(1):95–107. 
[62] Ghasemi A, Keshavarzi A. Energy-efficient virtual machine placement in hetero -
geneous cloud data centers: a clustering-enhanced multi-objective, multi-reward 
reinforcement learning approach. Clust Comput 2024;27(10):14149–66. 
[63] Ghasemi A, Toroghi Haghighat A, Keshavarzi A. Enhancing virtual machine place -
ment efficiency in cloud data centers: a hybrid approach using multi-objective 
reinforcement learning and clustering strategies. Computing 2024;106(9): 
2897–922. 
[64] Bhatt C, Singhal S. Multi-objective reinforcement learning for virtual machines 
placement in cloud computing. Int J Adv Comput Sci Appl 2024;15(3):1051–-8. 
[65] Zhang J, Yu H, Fan G, Li Z. Elastic task offloading and resource allocation over 
hybrid cloud: a reinforcement learning approach. IEEE Trans Netw Serv Manage 
2024;21(2):1983–97. 
[66] Guo Y, Qu S, Wang C, Xing Z, Duan K. Optimal dynamic thermal management 
for data center via soft actor-critic algorithm with dynamic control interval and 
combined-value state space. Appl Energy 2024;373:123815. 
[67] Yang W, Zhao M, Li J, Zhang X. Energy-efficient DAG scheduling with DVFS for 
cloud data centers. J Supercomput 2024;80(10):14799–823. 
[68] Bouaouda A, Afdel K, Abounacer R. Unveiling genetic reinforcement learning 
(GRLA) and hybrid attention-enhanced gated recurrent unit with random forest 
(HAGRU-RF) for energy-efficient containerized data centers empowered by solar 
energy and AI. Sustainability 2024;16(11):4438. 
[69] Chen Y, Guo W, Liu J, Shen S, Lin J, Cui D. A multi-setpoint cooling control 
approach for air-cooled data centers using the deep Q-network algorithm. Meas 
Control 2024;57(6):782–93. 
[70] Wang R, Cao Z, Zhou X, Wen Y, Tan R. Green data center cooling control 
via physics-guided safe reinforcement learning. ACM Trans Cyber-Phys Syst 
2024;8(2):1–26. 
[71] Aghasi A, Jamshidi K, Bohlooli A, Javadi B. A decentralized adaptation of model -
free Q-learning for thermal-aware energy-efficient virtual machine placement in 
cloud data centers. Comput Netw 2023;224:109624. 
[72] Wang Z, Chen S, Bai L, Gao J, Tao J, Bond RR, et al. Reinforcement learning 
based task scheduling for environmentally sustainable federated cloud computing. 
J Cloud Comp 2023;12(1):174. 
[73] Wang T, Fan X, Cheng K, Du X, Cai H, Wang Y. Parameterized deep reinforcement 
learning with hybrid action space for energy efficient data center networks. Comput 
Netw 2023;235:109989. 
[74] Ghasemi A, Toroghi Haghighat A, Keshavarzi A. Enhanced multi-objective vir -
tual machine replacement in cloud data centers: combinations of fuzzy logic with 
reinforcement learning and biogeography-based optimization algorithms. Clust 
Comput 2023;26(6):3855–68. 
[75] Huang N, Li X, Xu Q, Chen R, Chen H, Chen A. Artificial intelligence-based tem -
perature twinning and pre-control for data center airflow organization. Energies 
2023;16(16):6063. 
[76] Wei P, Zeng Y, Yan B, Zhou J, Nikougoftar E. VMP-A3C: virtual machines placement 
in cloud computing based on asynchronous advantage actor-critic algorithm. J King 
Saud Univ Comput Inf Sci 2023;35(5):101549. 
[77] Liu W, Yan Y, Sun Y, Mao H, Cheng M, Wang P, et al. Online job scheduling scheme 
for low-carbon data center operation: an information and energy nexus perspective. 
Appl Energy 2023;338:120918. 
[78] Ahamed Z, Khemakhem M, Eassa F, Alsolami F, Basuhail A, Jambi K. Deep re -
inforcement learning for workload prediction in federated cloud environments. 
Sensors 2023;23(15):6911. 
[79] Ma X, Xu H, Gao H, Bian M, Hussain W. Real-time virtual machine scheduling 
in industry IoT network: a reinforcement learning method. IEEE Trans Ind Inf 
2023;19(2):2129–39. 
[80] Simin W, Lulu Q, Chunmiao M, Weiguo W. Research on overall energy consumption 
optimization method for data center based on deep reinforcement learning. J Intell 
Fuzzy Syst 2023;44(5):7333–49. 
[81] Nagarajan S, Rani PS, Vinmathi MS, Subba Reddy V, Saleth ALM, Abdus Subhahan 
D. Multi agent deep reinforcement learning for resource allocation in container -
based clouds environments. Expert Syst 2025;42(1):e13362.
Applied Energy 389 (2025) 125734 
26 
H. Kahil, S. Sharma, P. Välisuo et al.
[82] Yang Y, He C, Yin B, Wei Z, Hong B. Cloud task scheduling based on proximal policy 
optimization algorithm for lowering energy consumption of data center. KSII Trans 
Internet Inf Syst 2022;16(6):1877–91. 
[83] Pandey NK, Diwakar M, Shankar A, Singh P, Khosravi MR, Kumar V. Energy 
efficiency strategy for big data in cloud environment using deep reinforcement 
learning. Mob Inf Syst 2022;2022:1–11. 
[84] Shaw R, Howley E, Barrett E. Applying reinforcement learning towards automat -
ing energy efficient virtual machine consolidation in cloud data centers. Inf Syst 
2022;107:101722. 
[85] Yan J, Huang Y, Gupta A, Gupta A, Liu C, Li J, et al. Energy-aware systems for real -
time job scheduling in cloud data centers: a deep reinforcement learning approach. 
Comp Electr Eng 2022;99:107688. 
[86] Wang Y, Li Y, Wang T, Liu G. Towards an energy-efficient data center network 
based on deep reinforcement learning. Comput Netw 2022;210:108939. 
[87] Mahbod MHB, Chng CB, Lee PS, Chui CK. Energy saving evaluation of an energy 
efficient data center using a model-free reinforcement learning approach. Appl 
Energy 2022;322:119392. 
[88] Abbas K, Hong J, Tu NV, Yoo J-H, Hong JW-K. Autonomous DRL-based en -
ergy efficient VM consolidation for cloud data centers. Phys Commun 2022; 
55:101925. 
[89] Uma J, Vivekanandan P, Shankar S. Optimized intellectual resource scheduling 
using deep reinforcement Q-learning in cloud computing. Trans Emerg Tel Tech 
2022;33(5):e4463. 
[90] Wang B, Liu F, Lin W. Energy-efficient VM scheduling based on deep reinforcement 
learning. Fut Gener Comput Syst 2021;125:616–28. 
[91] Zhou X, Wang R, Wen Y, Tan R. Joint IT-facility optimization for green data centers 
via deep reinforcement learning. IEEE Netw 2021;35(6):255–62. 
[92] Chi C, Ji K, Song P, Marahatta A, Zhang S, Zhang F, et al. Cooperatively improving 
data center energy efficiency based on multi-agent deep reinforcement learning. 
Energies 2021;14(8):2071. 
[93] Biemann M, Scheller F, Liu X, Huang L. Experimental evaluation of model-free 
reinforcement learning algorithms for continuous HVAC control. Appl Energy 
2021;298:117164. 
[94] Ding D, Fan X, Zhao Y, Kang K, Yin Q, Zeng J. Q-learning based dynamic 
task scheduling for energy-efficient cloud computing. Fut Gener Comput Syst 
2020;108:361–71. 
[95] Peng Z, Lin J, Cui D, Li Q, He J. A multi-objective trade-off framework for 
cloud resource scheduling based on the deep Q-network algorithm. Clust Comput 
2020;23(4):2753–67. 
[96] Hu X, Sun Y. A deep reinforcement learning-based power resource management for 
fuel cell powered data centers. Electronics 2020;9(12):2054. 
[97] Qin Y, Wang H, Yi S, Li X, Zhai L. Virtual machine placement based on multi -
objective reinforcement learning. Appl Intell 2020;50(8):2370–83. 
[98] Yang D, Wang X, Shen R, Li Y, Gu L, Zheng R, et al. Global optimization strategy of 
prosumer data center system operation based on multi-agent deep reinforcement 
learning. J Build Eng 2024;91:109519. 
[99] Lin J, Cui D, Peng Z, Li Q, He J. A two-stage framework for the multi-user multi-data 
center job scheduling and resource allocation. IEEE Access 2020;8:197863–74. 
[100] Caviglione L, Gaggero M, Paolucci M, Ronco R. Deep reinforcement learning for 
multi-objective placement of virtual machines in cloud datacenters. Soft Comput 
2021;25(19):12569–88. 
[101] Le DV, Wang R, Liu Y, Tan R, Wong Y-W, Wen Y. Deep reinforcement learning for 
tropical air free-cooled data center control. ACM Trans Sen Netw 2021;17(3):1–28. 
[102] Zhang Q, Zeng W, Lin Q, Chng C-B, Chui C-K, Lee P-S. Deep reinforcement learning 
towards real-world dynamic thermal management of data centers. Appl Energy 
2023;333:120561. 
[103] Li J, Zhang X, Wei Z, Wei J, Ji Z. Energy-aware task scheduling optimization with 
deep reinforcement learning for large-scale heterogeneous systems. CCF Trans HPC 
2021;3(4):383–92. 
[104] Wan J, Zhou J, Gui X. Intelligent rack-level cooling management in data centers 
with active ventilation tiles: a deep reinforcement learning approach. IEEE Intell 
Syst 2021;36(6):42–52. 
[105] Haghshenas K, Pahlevan A, Zapater M, Mohammadi S, Atienza D. MAGNETIC: 
multi-agent machine learning-based approach for energy efficient dynamic con -
solidation in data centers. IEEE Trans Serv Comput 2022;15(1):30–44. 
[106] Zhang Q, Mahbod MHB, Chng C-B, Lee P-S, Chui C-K. Residual physics and post -
posed shielding for safe deep reinforcement learning method. IEEE Trans Cybern 
2024;54(2):865–76. 
[107] Sun P, Guo Z, Liu S, Lan J, Wang J, Hu Y. SmartFCT: improving power -
efficiency for data center networks with deep reinforcement learning. Comput Netw 
2020;179:107255. 
[108] Asghari A, Sohrabi MK, Yaghmaee F. A cloud resource management framework 
for multiple online scientific workflows using cooperative reinforcement learning 
agents. Comput Netw 2020;179:107340. 
[109] Siddesha K, Jayaramaiah GV, Singh C. A novel deep reinforcement learning scheme 
for task scheduling in cloud computing. Clust Comput 2022;25(6):4171–88. 
[110] Asghari A, Sohrabi MK, Yaghmaee F. Online scheduling of dependent 
tasks of cloud’s workflows to enhance resource utilization and reduce the 
makespan using multiple reinforcement learning-based agents. Soft Comput 
2020;24(21):16177–99. 
[111] Zhang Q, Chng C-B, Chen K, Lee P-S, Chui C-K. DRL-s: toward safe real-world 
learning of dynamic thermal management in data center. Expert Syst Appl 
2023;214:119146. 
[112] Shao X, Zhang Z, Song P, Feng Y, Wang X. A review of energy efficiency evaluation 
metrics for data centers. Energy Build 2022;271:112308. 
[113] Jin C, Bai X, Yang C, Mao W, Xu X. A review of power consumption models of 
servers in data centers. Appl Energy 2020;265:114806. 
[114] Moriyama T, Magistris GD, Tatsubori M, Pham T-H, Munawar A, Tachibana 
R. Reinforcement learning testbed for power-consumption optimization. In: 
Proceedings of Asia Simulation conference (AsiaSim), Kyoto, Japan; 2018. p. 
45–59. 
[115] Phan L, Lin C-X. A multi-zone building energy simulation of a data center model 
with hot and cold aisles. Energy Build 2014;77:364–76. 
[116] Sun K, Luo N, Luo X, Hong T. Prototype energy models for data centers. Energy 
Build 2021;231:110603. 
[117] U.S. Department of Energy. EnergyPlus: energy simulation software; 2024. https: 
//energyplus.net/ [Accessed 18.11.2024]. 
[118] OpenFOAM Foundation. OpenFOAM: the open source CFD toolbox; 2024. https: 
//www.openfoam.com/ [Accessed 18.11.2024]. 
[119] Cadence Design Systems. Cadence reality digital twin platform; 2024. 
https://www.cadence.com/en_US/home/tools/reality-digital-twin.html [Accessed 
19.11.2024]. 
[120] Van Geet O, Sickinger D. Best practices guide for energy-efficient data center de -
sign. Technical Report. Golden, CO (United States): National Renewable Energy 
Laboratory (NREL); 2024. 
[121] Sharma P, Chaufournier L, Shenoy P, Tay Y. Containers and virtual machines at 
scale: a comparative study. In: Proceedings of the 17th international middleware 
conference; 2016. p. 1–13. 
[122] Calheiros RN, Ranjan R, Beloglazov A, De Rose CA, Buyya R. CloudSim: a toolkit 
for modeling and simulation of cloud computing environments and evaluation of 
resource provisioning algorithms. Softw Pract Exp 2011;41(1):23–50. 
[123] Chen W, Deelman E. Workflowsim: a toolkit for simulating scientific workflows in 
distributed environments. In: 2012 IEEE 8th international conference on E-science. 
IEEE; 2012. p. 1–8. 
[124] He H, Meng X, Wang Y, Khajepour A, An X, Wang R, et al. Deep reinforcement learn -
ing based energy management strategies for electrified vehicles: recent advances 
and perspectives. Renew Sustain Energy Rev 2024;192:114248. 
[125] Metrics GG. Describing datacenter power efficiency. Technical Committee White 
Paper. The Green Grid; 2007. 
[126] Horner N, Azevedo I. Power usage effectiveness in data centers: overloaded and 
underachieving. Electr J 2016;29(4):61–9. 
[127] Patterson M, Tschudi B, Vangeet O, Cooley J, Azevedo D. ERE: a metric for 
measuring the benefit of reuse energy from a data center. White Paper 29; 2010. 
[128] Sego LH, Marquez A, Rawson A, Cader T, Fox K, Gustafson WI Jr, et al. 
Implementing the data center energy productivity metric. ACM J Emerg Technol 
Comput Syst 2012;8(4):1–22. 
[129] Green I. New data center energy efficiency evaluation index dppe (datacenter 
performance per energy). Measurement guidelines (ver 2.05). 2012 Mar. 
[130] Reddy VD, Setz B, Rao GSV, Gangadharan G, Aiello M. Metrics for sustainable data 
centers. IEEE Trans Sustain Comput 2017;2(3):290–303. 
[131] Pham HX, La HM, Feil-Seifer D, Nefian A. Cooperative and distributed reinforce -
ment learning of drones for field coverage. arXiv preprint arXiv:1803.07250; 
2018. 
[132] Qie H, Shi D, Shen T, Xu X, Li Y, Wang L. Joint optimization of multi-UAV target 
assignment and path planning based on multi-agent reinforcement learning. IEEE 
Access 2019;7:146264–72. 
[133] Biagioni D, Zhang X, Wald D, Vaidhynathan D, Chintala R, King J, et al. 
PowerGridworld: a framework for multi-agent reinforcement learning in power 
systems. In: Proceedings of the thirteenth ACM international conference on future 
energy systems; 2022. p. 565–70. 
[134] Wang J, Xu W, Gu Y, Song W, Green TC. Multi-agent reinforcement learning for 
active voltage control on power distribution networks. Adv Neural Inf Process Syst 
2021;34:3271–84. 
[135] Terry J, Black B, Grammel N, Jayakumar M, Hari A, Sullivan R, et al. 
PettingZoo: gym for multi-agent reinforcement learning. Adv Neural Inf Process 
Syst 2021;34:15032–43. 
[136] Yang Y, Wang J. An overview of multi-agent reinforcement learning from game 
theoretical perspective. arXiv preprint arXiv:2011.00583; 2020. 
[137] Oroojlooy A, Hajinezhad D. A review of cooperative multi-agent deep reinforce -
ment learning. Appl Intell 2023;53(11):13677–722. 
[138] Canese L, Cardarilli GC, Di Nunzio L, Fazzolari R, Giardino D, Re M, et al. Multi -
agent reinforcement learning: a review of challenges and applications. Appl Sci 
2021;11(11):4948. 
[139] Ibrahim AM, Yau K-LA, Chong Y-W, Wu C. Applications of multi-agent deep 
reinforcement learning: models and algorithms. Appl Sci 2021;11(22):10870. 
[140] Wang F, Wang X, Sun S. A reinforcement learning level-based particle swarm opti -
mization algorithm for large-scale optimization. Inf Sci (NY) 2022;602:298–312. 
[141] Biemann M, Scheller F, Liu X, Huang L. Experimental evaluation of model-free 
reinforcement learning algorithms for continuous HVAC control. Appl Energy 
2021;298:117164.
Applied Energy 389 (2025) 125734 
27