Contents lists available at ScienceDirect 

Applied Energy 

journal homepage: www.elsevier.com/locate/apen 

Reinforcement learning for data center energy efficiency optimization: 

A systematic literature review and research roadmap 
Hussain Kahil a ,∗ 

  , Shiva Sharma 

b  , Petri Välisuo 

a 

  , Mohammed Elmusrati a 

 
a School of Technology and Innovation, University of Vaasa, Wolffintie 32, Vaasa, 65200, Finland 

b School of Technology, Vaasa University of Applied Sciences, Wolffintie 30, Vaasa, 65200, Finland

H I G H L I G H T S 

• Discusses using Reinforcement Learning (RL) for data center cooling system. 
• Discusses using RL for data center information and communication (ICT) system. 
• Provides a deep critical analysis for the energy optimization results. 
• Presents a comprehensive data extraction about the experimental setup and benchmarks. 
• Explores future direction in RL for optimizing energy in data center environments. 

A R T I C L E I N F O 

Keywords: 
Data center 

Energy efficiency optimization 

Cooling system 

ICT system 

Reinforcement learning (RL) 

Deep reinforcement learning (DRL) 

A B S T R A C T 

With today’s challenges posed by climate change, global attention is increasingly focused on reducing energy con -
sumption within sustainable communities. As significant energy consumers, data centers represent a crucial area 

for research in energy efficiency optimization. To address this issue, various algorithms have been employed to 

develop sophisticated solutions for data center systems. Recently, Reinforcement Learning (RL) and its advanced 

counterpart, Deep Reinforcement Learning (DRL), have demonstrated promising potential in improving data cen -
ter energy efficiency. However, a comprehensive review of the deployment of these algorithms remains limited. 

In this systematic review, we explore the application of RL/DRL algorithms for optimizing data center energy 

efficiency, with a focus on optimizing the operation of cooling systems and Information and Communication 

Technology (ICT) processes, including task scheduling, resource allocation, virtual machine (VM) consolida -
tion/placement, and network traffic control. Following the Preferred Reporting Items for Systematic review and 

Meta-Analysis (PRISMA) protocol, we provide a detailed overview of the methodologies and objectives of 65 

identified studies, along with an in-depth analysis of their energy-related results. We also summarize key aspects 

of these studies, including benchmark comparisons, experimental setups, datasets, and implementation platforms. 

Additionally, we present a structured qualitative comparison of the Markov Decision Process (MDP) elements for 

joint optimization studies. Our findings highlight vital research gaps, including the lack of real-time validation 
for developed algorithms and the absence of multi-scale standardized metrics for reporting energy efficiency im -
provements. Furthermore, we propose joint optimization of multi-system objectives as a promising direction for 

future research. 

∗ Corresponding author. 
Email addresses: hussain.kahil@uwasa.fi (H. Kahil), shiva.sharma@vamk.fi (S. Sharma), petri.valisuo@uwasa.fi (P. Välisuo), mohammed.elmusrati@uwasa.fi 

(M. Elmusrati).

https://doi.org/10.1016/j.apenergy.2025.125734 

Received 10 January 2025; Received in revised form 25 February 2025; Accepted 14 March 2025

Applied Energy 389 (2025) 125734 

Available online 25 March 2025 
0306-2619/© 2025 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license ( http://creativecommons.org/licenses/by/4.0/ ). 

http://www.sciencedirect.com/science/journal/0306-2619
https://www.elsevier.com/locate/APEN
https://orcid.org/0009-0005-1445-4462
https://orcid.org/0000-0002-9566-6408
https://orcid.org/0000-0001-9304-6590
https://doi.org/10.1016/j.apenergy.2025.125734
mailto:hussain.kahil@uwasa.fi
mailto:shiva.sharma@vamk.fi
mailto:petri.valisuo@uwasa.fi
mailto:mohammed.elmusrati@uwasa.fi
https://doi.org/10.1016/j.apenergy.2025.125734
http://crossmark.crossref.org/dialog/?doi=10.1016/j.apenergy.2025.125734&domain=pdf
http://creativecommons.org/licenses/by/4.0/


H. Kahil, S. Sharma, P. Välisuo et al.

Nomenclature 

A3C Asynchronous advantage actor-critic 
AC Actor-critic 
ACO Ant Colony Optimization 
ACS Ant Colony System 
ADVMC Adaptive DRL based VM Consolidation 
AFED-EF Adaptive Four-threshold Energy-aware VM Deployment 
ARLCA Advanced RL Consolidation Agent 
ATES Aquifer Thermal Energy Storage 
AVMC Autonomous VM Consolidation 
AVT Active Ventilation Tile 
BDQ Branching Dueling Q-Network 
BF Best Fit 
BFD Best Fit Decreasing 
CARPO Correlation-AwaRe Power Optimization 
CCO Cooling Control Optimization 
CDRL Constrained DRL 
CFD Computational Fluid Dynamics 
CFWS Cost and carbon Footprint through Workload Shifting 
CNN Convolutional Neural Network 
CSLB Crow Search-based Load Balancing 
CVP Chemical reaction optimization-VMP-Permutation 
CW Chilled Water 
D3QN Dueling Deep Q Network 
DAG Directed Acyclic Graph 
DBC Deadline and Budget Constrained 
DCI Dynamic Control Interval 
DCN Data Center Network 
DDPG Deep Deterministic Policy Gradient 
DL Deep Learning 
DPPE Data Center Performance Per Energy 
DPSO Discrete Particle Swarm Optimization 
DQN Deep Q-Network 
DRL Deep Reinforcement Learning 
DSTS Dynamic Stochastic Task Scheduling 
DTA DRL-based Task Migration 
DTH-MF Dynamic Threshold Maximum Fit 
DTM Dynamic Thermal Management 
DUE De-underestimation Validation Mechanism 
DX Direct Expansion 
ECA Enclosed Cold Aisle 
EDF Earliest Deadline First 
EMVO Enhanced Multi-Verse Optimizer 
EOM Energy Optimization Module 
EQBFD Energy-efficient and QoS-aware BFD 

ERE Energy Reuse Effectiveness 
ERLFC Eco-friendly RL in Federated Cloud 
ETAS Energy and Thermal-Aware Scheduling 
ETF Earliest Time First 
ETHC Elastic Task Handler over hybrid Cloud 
EVCT Energy-efficient VM minimum Cut Theory 
EVMM Energy-aware VM Migration 
FCT Flow Completion Time 
FERPTS Fast and Energy-aware Resource Provisioning and Task 

Scheduling 
FF First Fit 
FFD First Fit Decreasing 
FFO FireFly Optimization 
FIFO First-In-First-Out 
GA Genetic Algorithm 
GCD Google Cluster Dataset 
GEC Green Energy Coefficient 
GJO Golden Jackal Optimization 

GMPR Greedy Minimizing Power consumption and Resource 
wastage 

GRF Generalized Resource-Fair 
GRR Generalized Round Robin 
GRVMP Greedy Randomized VM Placement 
HDDL Heterogeneous Distributed Deep Learning 
HDRL Hierarchical DRL 
HEFT Heterogeneous Earliest Time First 
HGP Heteroscedastic Gaussian Processes 
HM Host Machine 
HVAC Heating, Ventilation, and Air Conditioning 
ICA Imperialist Competitive Algorithm 
ICO IT Control Optimization 
ICT Information and communication Technology 
IGGA Improved Grouping Genetic Algorithm 
IQR Inter-Quartile Range 
ITEE IT Equipment Energy 
ITEU IT Equipment Utilization 
JCO Joint IT and Cooling Control Optimization Algorithm 
KMI-MRCU K-Means clustering algorithm-Midrange-Interquartile 

range 

LECC Location, Energy, Carbon and Cost-aware vm placement 
LR Logistic Regression 
LRR Local regression robust 
LSTM Long Short-Term Memory 
MAD Median Absolute Deviation 
MAGNETIC Multi-AGent machine learNing-based approach for 

Energy efficienT dynamIc Consolidation 
MBAC Model-Based Actor-Critic 
MBHC MBRL-based HVAC control 
MBRL Model-Based RL 
MCP Modified Critical Path 
MCTS Monte Carlo Tree Search 
MDP Markov Decision Process 
MFFD Modified First Fit Decreasing 
MGGA Multi-objective Genetic Algorithm 
MILP Mixed Integer linear programming 
MIMT Minimization of Migration based on Tesa 
MLF Minimum Load First 
MMT Minimum Migration Time 
MOACO Multi-Objective Ant Colony Optimization 
MOPSO Multi-Objective Particle Swarm Optimization 
MPC Model Predictive Control 
MSP Multi-Set Point 
MVO Multi-Verse Optimizer 
NFV Network Function Virtualization 
NPA Non-Power-Aware 
NSGA-II Non-dominated Sorting Genetic Algorithm II 

OCA Open Cold Aisle 

OEMACS Order Exchange and Migration Ant Colony System 
PABFD Power-aware Best Fit Decreasing 
PADQN PArametrized Deep Q-Network 
PETS Probabilistic Ensembles with Trajectory Sampling 
PID Proportional-Integral-Derivative 
PM Physical Machine 
PPO Proximal Policy Optimization 
PRISMA Preferred Reporting Items for Systematic review and Meta-

analysis 
PSO Particle Swarm Optimization 
PUE Power Usage Effectiveness 
QEEC Q-learning Energy-Efficient Cloud computing 
QL Q-learning 
RAC Resource Allocation in container-based Clouds

Applied Energy 389 (2025) 125734 

2 


H. Kahil, S. Sharma, P. Välisuo et al.

RDHX Rear Door Heat Exchangers 
RES Renewable Energy Systems 
RH Relative Humidity 
RLR Robust Logistic Regression 
RP Residual Physics 
RR Round Robin 
RTP Real-Time Pricing 
SAC Soft Actor Critic 

SARSA State-Action-Reward-State-Action 

SDAEM Stacked De-noising Auto-encoders with Multilayer 
Perception 

SDN Software-Defined Networking 
SFC Service Function Chaining 
SLA Service Level Agreement 
SO Snake optimizer 
SSP Single-Set Point 

TDBS Task Duplication-Based Scheduling 
TPM Traffic Prediction Module 
TRPO Trust Region Policy Optimization 
UP Utilization Prediction-aware 
UPS Uninterruptible Power Supply 
VDN Value Decomposition Network 
VDT-UMC VM-based Dynamic Threshold and Minimum Correlation 

of Host Utilization 
VM Virtual Machine 
VMC VM Consolidation 
VMP VM placement 
VMPMBBO Multi-objective Biogeography-Based Optimization 
VMTA VM Traffic burst 
VPBAR VM scheduling Based on Poisson Arrival Rate 
VPME VM Placement with Maximizing Energy efficiency 
WUE Water Usage Effectiveness 

1. Introduction 

The digitalization of society and the emergence of new AI 

technologies have increased the overall demand for computing power. 

This growth has made data centers a critical infrastructure that supports 

our modern digital ecosystems. The rise in the use of technologies such 

as the Internet of Things (IoT), cloud computing, big data, and artifi -
cial intelligence (AI) has increased the workload of data centers, which 

now require even more computing resources to meet demand. Data cen -
ters form the backbone of modern digital infrastructure, and their high 

energy consumption has substantial financial and environmental impli -
cations. According to International Energy Agency [1], an estimated 460 

terawatt hours (TWh) of electricity, with projections indicating that this 

could exceed 1000 TWh by 2026. In the European Union (EU), data 

centers consumed approximately 45–65 TWh of electricity in 2022, rep -
resenting 1.8 % to 2.6 % of the total electricity consumption of the EU 

for that year [2]. 
This substantial energy consumption contributes to increased opera -

tional costs and has significant environmental consequences, including 

large amounts of greenhouse gas emissions [3], and increased strain on 

power grids [4]. Therefore, improving energy efficiency in data centers 

has become a critical issue, requiring intelligent and automated solutions 

capable of dynamically adapting to real-time demands. 
Among the many emerging technologies, Reinforcement Learning 

(RL) and its subset, Deep Reinforcement Learning (DRL), have gained 

attention as promising techniques for optimizing energy efficiency 

within complex environments like data centers. These algorithms enable 

systems to learn optimal policies by interacting with dynamic environ -
ments, making them suitable for resource allocation, task scheduling, 

and heating and cooling management. A study conducted by Jayanetti 

et al. [5] demonstrates the significant potential of RL/DRL for minimiz -
ing energy consumption and reducing operational costs. 
The data center architecture comprises three main systems: informa -

tion and communication technology (ICT), cooling, and power supply 

systems. Today’s data centers are vast, complex, and highly sophisti -
cated, powered by a diverse ecosystem of ICT devices. These range 

from high-performance servers equipped with heterogeneous computing 

processors, such as CPUs, GPUs, and specialized accelerators, to arrays 

of memory units and storage solutions. In addition to computational 

infrastructure, the cooling system is critical in sustaining data center 

functionality. Its complexity arises from integrating multiple subsystems 

designed to regulate thermal conditions and protect highly sensitive ICT 

equipment from overheating. Efficient cooling is a fundamental aspect 

of data center operations, directly impacting energy consumption, op -
erational costs, and system reliability. Due to the high heat dissipation 

of modern ICT equipment, data center cooling systems are designed to 

maintain optimal temperatures, prevent hardware failures, and enhance 

overall performance. 
A typical data center cooling system consists of multiple compo -

nents, including chillers, pumps, fans, heat exchangers, and cooling 

towers, which work together to regulate temperature and ensure ef -
ficient heat dissipation. These systems can generally be classified into 

air-based and liquid-based cooling solutions. Air-based cooling relies on 

Computer Room Air Conditioning (CRAC) units [6] and Computer Room 

Air Handlers (CRAH) [7]. Liquid-based cooling, in contrast to air-based 

methods, incorporates technologies such as direct-to-chip cooling [8], 

spray/immersion cooling [9]. These approaches significantly enhance 

thermal management by efficiently dissipating heat and directly cooling 

critical components. Recently, localized heat exchanger solutions, such 

as in-row, rear-door cooling and in-rack, have gained popularity due 

to their efficiency in high-density environments. In-row cooling places 

cooling units between server racks, reducing airflow distance and im -
proving cooling efficiency [10]. Rear Door Heat Exchangers (RDHX), on 

the other hand, attach cooling units directly to the back of racks, cap -
turing and dissipating heat immediately as it exits the servers. These 

strategies enhance cooling performance while minimizing energy waste 

by targeting heat removal close to the source [11]. 
Free cooling is an energy-efficient heat rejection method that uses 

low ambient air or water temperature with a dry cooler or heat ex -
changer. Depending on the ambient media, the free cooling is also 

known as water-side or air-side economizer [12]. Heat pumps [13] and 

thermal energy storage [14] are increasingly being adopted to enhance 

energy efficiency and overall performance of heat reuse. Fig. 1 provides 

a schematic diagram of the data center cooling and heat rejection and 

reuse systems. 
Solutions based on RL/DRL techniques enable adaptive, real-time 

decision making which has significant potential for enhancing energy ef -
ficiency through optimization in the complex data center environments. 

Despite all these promising developments, the adoption of RL/DRL for 

minimizing energy consumption in data centers faces various challenges, 

including the complexity of modeling data center environments, man -
aging computational costs, and ensuring scalability [15]. To address 

these challenges, innovative and intelligent solutions are required that 

can adapt to complex and dynamic environments in real-time. Several 

previous reviews on the use of RL/DRL have been conducted for gen -
eral applications [16] rather than analyzing a holistically integrated 

RL/DRL framework with a specific system, which this paper aims to ex -
amine. Additionally, few studies have provided systematic evaluations 

of RL/DRL across data center functions, leaving a gap in understanding

Applied Energy 389 (2025) 125734 

3 


H. Kahil, S. Sharma, P. Välisuo et al.

Cooling Heat exchange
Air to liquid

Rejection

Re-use

Air

cooling

Liquid

cooling

CRAH

CRAC

In ROW

RDHX

In RackCold plate

Immersion
Spray

Dry cooler

Chiller

Tower

Heat pump

Direct use

M
o
re
lo
calized

Fig. 1. Schematic diagram of data center cooling, heat rejection, and heat reuse 

system options. Black and blue arrows show heat flows in air, and liquid re -
spectively. The grey arrow shows that the heat exchangers at the bottom of the 

middle box are localized closer to the heat source, whereas those at the top are 

far from it. (For interpretation of the references to colour in this figure legend, 

the reader is referred to the web version of this article.) 

these algorithms’ capabilities in real-world environments. In this system -
atic literature review, we aim to investigate the recent advancements 

and applications of RL/DRL for enhancing energy efficiency in data cen -
ters by analyzing the literature using the PRISMA framework. The main 

objective of this work is to explore and evaluate the diverse potential 

applications of RL/DRL as a tool for optimizing energy efficiency in 

data centers while also synthesizing and consolidating existing research 

knowledge on their implementation in such facilities. Furthermore, this 

study aims to achieve the following specific objectives: 

• Investigate and assess key applications of RL/DRL in data cen -
ters: This review aims to provide a comprehensive analysis of how 

RL/DRL algorithms have been applied to solve various energy ef -
ficiency challenges in data centers. To achieve this, we categorize 

RL/DRL applications by data center subsystems, giving readers 

insights into their roles and effectiveness. 
• Evaluate and summarize each identified study in terms of algorithm 
type, the specific research problem addressed, primary objectives, 

and energy-efficiency outcomes, along with the benchmarks em -
ployed for performance evaluation, enabling a deeper understanding 

of the current state of research. 
• Summarize details about the execution aspects of the identified stud -
ies: the implementation environment, dataset source, dataset type, 

and the platforms or frameworks utilized, offering insights into the 

practical considerations and resources required for implementing 

future studies. 
• Utilize the identified joint optimization studies to present compre -
hensive guidelines for formulating the Markov Decision Process 

(MDP) elements, providing readers with a clear overview and foun -
dational knowledge to construct such frameworks in future research. 
• Identify technical and practical challenges in the current research di -
rection. By investigating the essential issues related to RL/DRL usage 

in data centers, we aim to provide an in-depth view of the barri -
ers that limit the broader use of these techniques in the data center 

industry. 
• Highlight other objectives integrated with the energy efficiency 
problem in the identified studies, to address multi-objective opti -
mization, thereby comprehensively ensuring sustainable and cost -
effective operations in modern data centers. 
• Explore research gaps, open issues, and future directions to propose a 
strategic roadmap for advancing the practical deployment of RL/DRL 

techniques in optimizing data center energy efficiency. 

Through the above-mentioned objectives, this review aims to 

contribute a structured synthesis of RL/DRL applications for data center 

energy efficiency, identify persistent challenges, and chart a course for 

future research to address existing limitations and enhance the practical 

utility of RL/DRL techniques in data centers. 
The remainder of this paper is organized as follows. Section 2 com 

pares previous related reviews and this study. 

-

Section 3 provides a 

comprehensive background on RL/DRL algorithms. Section 4 outlines 

the research methodology. Section 5 explores the relevant literature 

in detail. Section 6 offers an overview of additional objectives com -
bined with energy efficiency. Section 7 discusses the identified research 

gaps, open challenges, and suggests future directions. Finally, Section 8 

concludes this review. 

2. Related reviews 

Several existing reviews focus on the energy efficiency of the data 

center cooling system as a key objective. Chang et al. [17] explore the 

cooling system optimization strategies in data centers by utilizing bib -
liometric methods. Their review examines the utilization of RL as a 

cooling control strategy for energy efficiency applications. Additionally, 

Shaqour et al. [18] investigate the literature on using DRL algorithms 

for HVAC energy management in data centers, which are considered as 

a subgroup of smart buildings. 
In contrast, other reviews target the energy efficiency of ICT systems 

in data centers. Gari et al. [19] evaluate the effectiveness of RL algo -
rithms for data center scaling and scheduling purposes in the literature, 

while partially addressing energy consumption as an optimization objec -
tive. Magotra et al. [20] provide a comprehensive overview of using VM 

consolidation to enhance the data center energy efficiency. This review 

surveys the research problem based on architecture and VM consolida -
tion steps. Zhou et al. [21] present DRL-based approaches for resource 

scheduling in the cloud, highlighting their advantages, challenges, and 

future directions. Recently, Hou et al. [22] provided a specialized re -
view on leveraging DRL algorithms for energy-efficient task scheduling 

in cloud computing. This study conducts an in-depth investigation of the 

Markov Decision Process (MDP) model components. Singh et al. [23] 

summarize previous empirical studies on multiple objectives in ICT sys -
tems, such as task scheduling and VM consolidation, to enhance energy 

efficiency while maintaining system performance. 
Furthermore, other reviews combine cooling and ICT systems as 

the core topic of their review. Lin et al. [24] explore previous efforts 

to achieve green-aware data centers from five different perspectives: 

workload management, virtual resource management, energy manage -
ment, thermal management, and waste heat recovery. Long et al. [25] 

outline performance evaluation metrics for data center energy effi -
ciency through ICT systems and infrastructure, including cooling and 

power supply systems. Conversely, Zhang et al. [26] address the joint 

optimization of cooling and ICT systems to achieve effective data cen -
ter management under a set of evaluation metrics, including thermal 

conditions, energy consumption, and response delay. 
Although these reviews address energy efficiency objectives in data 

centers based on RL/DRL algorithms from different perspectives, there 

remains a gap in the existing literature due to the absence of a systematic 

overview of RL/DRL applications for improving the energy efficiency 

of data center systems. Additionally, there appears to be a lack of 

research addressing joint optimization using RL/DRL for energy effi -
ciency objectives. Moreover, previous reviews do not sufficiently discuss 

experimental setups, including data sources and types used, and the im -
plementation platforms. Our research introduces a systematic literature 

review that examines the use of RL/DRL for energy efficiency objec -
tives across the main data center systems: cooling and ICT systems. We 

aim to explore recent advancements in this field to gain deeper insights, 

identify research gaps, and suggest future directions. Table 1 summa 
rizes and compares related reviews and our work, emphasizing how our 

study differs from previous research.

-

Applied Energy 389 (2025) 125734 

4 


H. Kahil, S. Sharma, P. Välisuo et al.

Table 1 

Related reviews on DC energy efficiency, and comparison with our review. 

Reference 

General focus System specific Review outcomes 

Data 
center 

Energy 
efficiency 

RL/DRL 
approaches 

Cooling 
system 

ICT 
system 

Joint 
optimization 

Energy 
reporting 

Algorithm 
comparisons 

Benchmark 
comparisons 

Experimental 
setup 

[17] ● ● ● ● × × ● ◑ × × 
[18] ● ● ● ● × × ● ◑ × × 
[19] ● ◑ ● × ◑ × ◑ ● × × 
[20] ● ● ◑ × ◑ × ◑ ◑ ◑ ● 
[21] ● ◑ ● × ◑ × ◑ ● ● ● 
[22] ● ● ● × ◑ × ● ● ● ● 
[23] ● ◑ × × ◑ × ◑ × × × 
[24] ● ◑ ◑ ◑ ◑ × ◑ ● × × 
[25] ● ● × ◑ ◑ × ◑ ● × × 
[26] ● ◑ ◑ × × ● ● ● × ◑ 

Current 
review 

● ● ● ● ● ● ● ● ● ● 

● – Topic addressed in detail/self-contained, ◑ – Topic partially addressed (i.e., not self contained, requires additional readings for deep understanding), × – Topic 

not addressed. 

3. Overview of RL/DRL algorithms 

Reinforcement learning (RL) stands out as a machine learning tech -
nique developed by the computational intelligence community. It is 

inspired by natural learning mechanisms, in which organisms adjust 

their future behavior based on feedback from interactions with the 

environment. Fundamentally, RL is a closed-loop approach aimed at 

maximizing the cumulative reward, allowing the decision-maker or 

agent to learn and adapt over time. However, the actions taken by the 

learning agent influence its future inputs. The RL algorithm establishes 

an interactive relationship with the dynamic environment, allowing the 

agent to perform actions, observe the states of the environment, and 

receive feedback in the form of rewards and punishments. In most prac -
tical cases, the agent’s actions may not only influence the immediate 

reward but also shape the ultimate reward. In this closed-loop learn -
ing approach, the absence of explicit instructions for taking actions 

and the uncertainty of future consequences are the key features of RL. 

These characteristics position RL algorithms as an integration of adap -
tive and optimal control techniques [27,28]. Fig. 2 illustrates the general 

framework of RL algorithms. 
Let us consider a typical reinforcement learning scenario within a 

fully observable, stationary, stochastic environment, where the agent 

interacts with the environment by fully and accurately observing the 

current state. At each discrete time step, the agent selects an action based 

only on the current state to maximize the cumulative reward over time. 

The representation of this scenario is given by: 

• States (𝑆): The set of all possible states of the environment that the 
agent can observe. 

𝑆 = {𝑠 1 

, 𝑠 2 

,… , 𝑠 𝑛 

} (1)

Agent

Environment
Actions, A

Reward, R

States, S

Fig. 2. RL framework. 

• Actions (𝐴): The set of all available actions that the agent can take 
in a given state. 

𝐴 = {𝑎 1 

, 𝑎 2 

,… , 𝑎 𝑛 

} (2) 

• Transition probabilities (𝑃 ): The probability of moving to a future 
state 𝑠 

′ given the current 𝑠 state and action 𝑎, which may differ over 
time due to dynamic changes. 

′𝑃 ( ′ 𝑡 𝑠
 

 ∣ 𝑠, 𝑎) = P (𝑆 𝑡+1 =  

 𝑠 ∣ 𝑆 𝑡 = 𝑠, 𝐴 𝑡 

= 𝑎) (3) 

• Reward function (𝑅): The immediate reward that the agent receives 
when taking action 𝑎 in the state 𝑠 at time 𝑡, which may differ over 

time due to dynamic changes. 

𝑅 d𝑡(𝑠, 𝑎) = E(rewar   

 
∣ 𝑆 𝑡 = 𝑠, 𝐴 𝑡 

= 𝑎) (4) 

• Policy function (𝜋): This function determines the agent’s future be -
havior by defining the probability of taking action 𝑎 in the state 𝑠 at 

time 𝑡, which may differ over time due to dynamic changes. 

𝜋 𝑡(𝑠, 𝑎) = P (reward ∣ 𝑆𝑡 = 𝑠, 𝐴 𝑡 

= 𝑎) (5) 

• Discount factor (𝛾): It determines the weight of future rewards 
compared to immediate rewards. 

0 ≤ 𝛾 ≤ 1 (6) 

where the value of the discount factor is close to 0, it makes the RL 

agent focus on immediate reward, while a value close to 1 makes the 

RL agent focus on the future reward. 
• Objective (cumulative reward): This is the ultimate goal of the 
RL agent to identify the trajectories that can maximize the expected 

discounted reward: 

∑

𝑛 
2𝐺 𝑡 = 𝑅 𝑡+1 

+ 𝛾 𝑅 𝑡+2 

+ 𝛾  

 𝑅  

 𝑡+3 

+ ⋯ = 𝛾 

𝑘 𝑅 (7)𝑡+𝑘+1 

𝑘=0 

The tuple {𝑆, 𝐴, 𝑃 , 𝑅, 𝛾} formulates the Markov decision processes 
(MDP) representation for the proposed stationary stochastic environ -
ment. In the MDP framework, at each time step 𝑡, the agent interacts 

with the environment by observing the current state 𝑠 𝑡 

∈ 𝑆 , choosing 
the action 𝑎 𝑡 

∈ 𝐴 according to the policy function 𝜋 𝑡(𝑠 𝑡 

, 𝑎 𝑡 

), while esti -
mating the probability of transitioning to a specific next state or taking 

a specific action using the transition probability model 𝑃𝑡  

(𝑠 

′ ∣ 𝑠, 𝑎). After 

taking the action, the agent obtains a reward 𝑟 𝑡 

∈ 𝑅 and transitions to 

the next state. The aim of reinforcement learning is to design the agent’s

Applied Energy 389 (2025) 125734 

5 


H. Kahil, S. Sharma, P. Välisuo et al.

learning process to find the optimal policy that maximizes the expected 

cumulative reward over time 𝐺 𝑡 

, considering the environment dynamics 
defined by the MDP [29–31]. 
However, the aforementioned process is not trivial. This challenge 

can be addressed recursively by introducing the state value function (V -
function): 

[ ]

∑

𝑛 
𝑉 𝜋 (𝑠) = 𝛾  

 E 𝑠 𝑡 

,𝑎 𝑡∼𝜏  

𝑘𝑅𝑡 +𝑘+1 
𝑘=0

∑ ∑

∞ 
(8) 

= 𝜋(𝑎 𝑡 

|𝑠𝑡  

)𝑃 (𝑠 , 𝛾 

𝑘 

𝑡+1  

 
|𝑠𝑡 𝑎 𝑡 

) 𝑅𝑡+𝑘+1 
(𝑠 𝑡 

,𝑎 ,…)∼𝜏 𝑘=0
[

 𝑡  

= E
 ]

 𝜋 𝑅 𝑡+1 + 𝛾𝑉 (𝑆 𝑡+1) ∣  

  
𝑆 = 𝑠 

 𝑡 

where 𝜏: (𝑠 0 

, 𝑎 0 

, 𝑠 1 

, 𝑎 1 

,… , 𝑎 𝑡−1 

, 𝑠 𝑡 

) represents the interaction trajectory of 

the RL agent. 
Similarly, the expected return of taking a specific action 𝑎 in a given 

state 𝑠 while following the policy 𝜋 can be given by the state-action value 

function (Q-function):

[

𝑄 𝜋 

(𝑠, 𝑎) = E
 ]

 𝜋 

𝑅𝑡 +1 + 𝛾𝑄(𝑆
 

 (9)𝑡+1, 𝐴 𝑡+1 

) ∣ 𝑆 𝑡 = 𝑠, 𝐴 𝑡 

= 𝑎 

Eqs. (8) and (9) are referred to as the Bellman equations [32], which 

are considered the fundamental formulas for tackling the decision -
making process of an RL agent. The optimal V-function and Q-function 

are indicated by the maximum value across all states: 𝑉 

∗ = max 𝑣 𝜋 (𝑠) 

, or 

in all state-actions: 𝑄 

∗ (𝑠, 𝑎) = max 𝑄(𝑠, 𝑎). In all MDP cases, at least one 

optimal policy always exists, and the value functions 𝑉 (𝑠) and 𝑄(𝑠, 𝑎) of 

all optimal policies are the same. As a result, optimizing the Q-function 

yields the optimal policy of the MDP:

{

∗ 1 
𝜋

 arg  

(𝑎|𝑠) =
 if 𝑎 =   

 
max
  

𝑎∈𝐴 

𝑄 

∗(𝑠, 𝑎) 

 (10) 
0 otherwise 

To obtain a solution to the MDP problem using RL techniques, two 

main categories of methods are used. Model-free RL algorithms allow an 

agent to learn a policy purely from interactions with the environment, 

without explicitly constructing a model of the environment’s dynamics. 

The other category is called model-based RL algorithms and leverages a 

model of the environment, which can be given or learned. This model 

typically includes the transition probability function (3) and the re -
ward function (4), allowing the agent to plan actions before execution 

[33]. Value-based algorithms are among the most popular model-free RL 

methods, where the agent estimates state-action values and represents 

them as a table (referred to as a Q-table or policy table), to optimize its 

decision-making. The most well-known value-based algorithms used for 

smaller MDP problems are tabular methods: Q-learning, in which the 

agent updates the table based on the maximum possible future reward 

(off-policy learning), making it more exploratory [34], and state-action -
reward-state-action (SARSA) [35], where the agent updates the Q-table 

according to the actual action taken (on-policy learning), leading to 

more conservative behavior. 
On the other hand, model-based RL leverages a model of the en -

vironment to update the Q-table of state-action pairs. This approach 

can be classified into two main categories based on how the environ -
ment model is acquired. In the first category, the agent learns the model 

through its interactions with the environment, as in the Dynamic Q -
learning (Dyna-Q) algorithm [36]. In the second category, the model is 

provided to the agent, as seen in Monte Carlo Tree Search (MCTS) [37]. 

However, RL algorithms face scalability limitations when applied in 

large-scale learning environments. They often struggle with an extensive 

state space and continuous action space, leading to inefficiencies in the 

exploration–exploitation trade-off, slow convergence, and difficulties in 

learning optimal policies. 
To address the limitations of traditional Reinforcement Learning 

(RL) methods, the computational intelligence community has developed 

Deep Reinforcement Learning (DRL), which integrates advancements 

in deep neural networks. In DRL algorithms, deep learning techniques 

are employed to construct at least one of the following agent compo -
nents: value functions (8), (9), policy function (5), transition model(3), 

and the reward function (4). Such representations are essential when 

the RL agent interacts with environments characterized by a high -
dimensional state space and a continuous action space. DRL is a powerful 

tool for achieving an end-to-end goal-directed learning process [38,39]. 

Figs. 3 and 4 present a comprehensive classification of the most popular 

RL/DRL algorithms based on their respective model types. 
Another crucial aspect of RL/DRL algorithms is the type of policy 

used during the training process. The focus here is to determine whether

RL

Model-free

Algorithms

Value-based

algorithms

Tabular

methods Q-learning

SARSA

Deep learning

methods
DQN

Temporal

difference

method

TD0

Actor-critic

algorithms

Deep soft

actor critic
DSAC

Deterministic

policy gradient
DDPG

Soft actor critic SAC

Twin delayed

DDPG
TD3

Advanced

Actor-critic A2C

A3C

Policy-based

algorithms
Basic policy

gradient REINFORCE

VPG

Advanced policy

gradient PPO

TRPO

Fig. 3. RL/DRL model-free algorithms. 

RL

Model-

based

Algorithms

Learn

the model

Model

Ensemble
GPS

VAML

PAML

Planning

Oriented
PETS

Dyna-Q

Deep Dyna-Q

Policy

Optimization

MBPO 

MBVE

MBAC

Given

the model

Residual

Augmentation

RCE

EBD

Residual-Q

Policy Learning

with Rollouts
Dyna-DDPG

SAC-MBR

ME-TRPO

Planning

Oriented MuZero

AlphaZero

Fig. 4. RL/DRL model-based algorithms.

Applied Energy 389 (2025) 125734 

6 


H. Kahil, S. Sharma, P. Välisuo et al.

the behavior policy – defined as the policy interacting with the environ -
ment to collect training data – and the target policy – which represents 

the final policy that the agent is aiming to learn – are identical. On -
policy methods utilize the collected data directly for the next round 

of policy optimization, meaning that the behavior and target policies 

are the same. However, in off-policy methods, the generated training 

data is stored in a buffer during the interaction with the environment. 

Then, during training, this stored data – which may be gathered from 

previous policies – is used for the target policy. In this case, the be -
havior policy is not the same as the target policy. The advantages of 

on-policy methods include greater stability and faster convergence, bal -
anced exploration–exploitation rates, and ease of implementation, while 

off-policy methods offer better performance in complex environments 

and greater adaptability to changing policies.

Finally, RL/DRL are used to solve a wide range of optimization prob -
lems, from playing simple computer games to controlling highly complex 

large-scale configurations such as transportation networks and energy 

systems [40–42] . Both RL and DRL offer the advantage of boasting real -
time adaptability and dynamic responsiveness compared to traditional 

control methods. However, without prior knowledge about the studied 

environment, they may encounter slow convergence and failures during 

the initial phases of operation [43,44]. 

4. Materials and methods 

The methodology of this review was structured following the 

PRISMA (Preferred Reporting Items for Systematic Reviews and Meta -
Analyses) framework to ensure transparency, rigor, and reproducibility 

[45]. 

4.1. Research questions 

The main aim of this review is to synthesize recent advancements in 

RL/DRL techniques for improving energy efficiency in data centers. To 

provide a comprehensive understanding of this topic, this study focuses 

on answering the following research questions based on the identified 

papers. 

• RQ1: What data center subsystems (e.g., cooling, ICT equipment, 
power supply) are targeted by the RL/DRL algorithms? 
• RQ2: Which RL/DRL algorithms are utilized for energy optimization 
in data centers? 
• RQ3: What experimental setups and dataset sources (e.g., real-world 
deployments or simulations) are commonly used? 
• RQ4: What specific research problems are addressed using RL/DRL 
algorithms? 
• RQ5: What are the primary objectives addressed in the identified 
studies? 
• RQ6: What benchmarks are used to evaluate the achieved results in 
terms of energy efficiency? 

• RQ7: What tools, frameworks, or platforms are employed to imple -
ment RL/DRL algorithms in this context? 
• RQ8: What metrics are used to measure and report the effectiveness 
of RL/DRL algorithms in improving energy efficiency? 

4.2. Search strategy 

4.2.1. Literature resources 
To ensure that all recent and relevant studies are covered in the lit -

erature, the search was carried out in five major and well-established 

academic databases, known for their extensive repositories of peer -
reviewed studies in computer science, engineering, and energy systems. 

Given that the scope of this review is relatively new, the covered time 

frame is limited to publications from 2019 to August 2024. To maintain 

high quality and credibility, only peer-reviewed journal articles from the 

databases mentioned below were selected. 

• IEEE Xplore 
• Scopus 
• ScienceDirect 
• Web of Science 
• ACM Digital Library 

4.2.2. Search terms (key words) 
To ensure the high quality of this study, search queries were sys -

tematically designed using Boolean operators and keywords relevant to 

RL/DRL and energy efficiency in data centers. A representative search 

string was: (“data center” OR “data centers”) AND (“energy-aware” OR 

“energy utilization” OR “energy saving” OR “energy efficiency”) AND 

(“reinforcement learning” OR “RL”). Fig. 5 shows the search strategy 

used in this study. 

4.3. Search process and selection criteria 

To ensure the relevance and quality of the included studies, the 

PRISMA framework guided the article identification process, which 

involved four distinct stages: 

1. Identification: Studies were retrieved using search queries across 
the selected databases. 

2. Screening: The titles and abstracts were screened to eliminate 
irrelevant studies and duplicates. 

3. Eligibility: Full-text articles were reviewed against the inclusion 
and exclusion criteria. 

4. Inclusion: The final set of studies that met all quality assessment 
criteria was selected for detailed analysis. 

A PRISMA flow diagram (Fig. 6) illustrates the selection process, 

documenting the number of studies identified, screened, excluded, and 

included. 

Fig. 5. Search strategy to get relevant papers.

Applied Energy 389 (2025) 125734 

7 

https://www.ieeexplore.ieee.org
https://www.scopus.com
https://www.sciencedirect.com
https://www.webofscience.com
https://dl.acm.org


H. Kahil, S. Sharma, P. Välisuo et al.

Search phase Selection phase Second search
and selection

IEEE

Xplore

Scopus

Science

Direct

Web of

Science

ACM Digital

library

25

100

14

20

5 Remove

duplicatesTotal=164

Abstract &

keywords

Inclusion

criteria

Exc

53

Inc 111

Exc

40

Inc 71

Exc 12

Additional

relevant from

references

Inc 59

new 18

Quality

check

Finally

selected

65

Sum 77

Exc

12

Fig. 6. Systematic literature review process stages: Removals of duplicates, removal based on abstract and keywords, removal based on inclusion and exclusion 

criterias, adding new articles which were found from the references, and finally removing those which did not match the quality criteria. 

4.3.1. Inclusion and exclusion criteria 
• Inclusion criteria: To ensure the inclusion of high-quality and 
relevant studies, the following criteria were applied: 

– Only peer-reviewed journal articles published between 2019 and 
August 2024. 

– Studies explicitly applying RL/DRL algorithms for energy effi -
ciency in data center environment. 

– Studies presenting measurable outcomes, such as increased energy 
savings or improved Power Usage Effectiveness (PUE). 

– Studies focusing on specific or joint subsystems (e.g., cooling 
systems, ICT equipment, and/or power supply). 

– Only the most recent version of a study was included when 
duplicate publications were identified. 

• Exclusion criteria: To facilitate the filtering of irrelevant studies, 
the following criteria were used: 

– Non-peer-reviewed studies, including conference papers, review 
articles, and opinion pieces. 

– Studies not addressing RL/DRL-based methods for energy opti -
mization in data center environment. 

– Studies lacking empirical evidence or quantitative metrics. 
– Studies without full-text availability, making it impossible to 
assess the study’s relevance and quality. 

– Studies focused on very small-scale experimental setups, as they

lack applicability to real-world data center environments. 

4.3.2. Quality assessment criteria and rating system 
To ensure the final selection of identified articles are robust and 

reliable, a rigorous and systematic quality assessment process was 

implemented, based on the clearly defined criteria listed below: 

• Clear and comprehensive documentation of the RL/DRL methods 
utilized, ensuring transparency in their implementation. 
• Explicit definition and justification of the targeted subsystem’s rele -
vance within the study. 
• Logical coherence in identifying the research problem and aligning 
it with the stated objectives. 

• Methodological rigor in the design of experimental setups, including 
appropriate baseline comparisons and validation techniques. 
• Implementation of well-defined metrics to assess energy efficiency, 
such as increased energy savings or improvements in Power Usage 

Effectiveness (PUE). 
• Thorough comparative analysis of RL/DRL techniques against al -
ternative benchmark methods to highlight their effectiveness and 

advantages. 

Only studies that achieved a perfect score of 6 out of 6 on these 

criteria were included in the final synthesis. 

4.4. Data extraction and synthesis 

A comprehensive data extraction and synthesis template was com -
pleted for each identified study to ensure that all selected studies 

addressed the review’s research questions. The extracted data were or -
ganized into a synthesis card and stored in an Excel file for further use 

throughout the systematic review stages. Table 2 summarizes the data 

extraction and synthesis card used to gather the necessary information 

from the identified studies. 
To present the findings of this review, visual representations, such as 

pie charts, bar charts, and Venn diagrams, were created. Additionally, 

tables were utilized to systematically summarize and provide a detailed 

analysis of each identified study. This systematic approach provides 

a clear and structured framework for synthesizing and interpreting 

the collected data, while also highlighting research gaps, addressing 

challenges, and identifying future directions [46]. 

4.5. Threats to validity 

The following threats to validity were acknowledged: 

1. Publication bias: The focus on peer-reviewed journals may ex -
clude innovative but unpublished studies. 

2. Database coverage: Relevant articles from less-accessible 
databases or gray literature might have been missed. 

3. Variability in reporting: Differences in methodologies and re -
porting standards across studies could limit comparability.

Applied Energy 389 (2025) 125734 

8 


H. Kahil, S. Sharma, P. Välisuo et al.

Table 2 

Data extraction template. 

Category 

Unique Identifier (ID) 

Study Title 

Authors Names 

Publication Venue 

Publication Year 

DC Subsystem Applications (RQ1) 

RL/DRL Algorithm Type (RQ2) 

Experimental Setup (RQ3) 

Research Problems (RQ4) 

Main Objectives (RQ5) 

Benchmark Algorithms (RQ6) 

Platforms and Frameworks (RQ7) 

Energy Efficiency Outcomes (RQ8) 

MDP Elements in Joint Optimization Studies 

Abstract 

Keywords 

Other Performance Metrics 

To mitigate these threats, standardized inclusion criteria were applied, 

and article selection and data extraction were independently verified by 

multiple reviewers. 

5. Results and discussions 

In this section, we discuss and present the findings of this review. 

First, we summarize the fundamental details of each identified study, 

including the study title, authors names, publication venue, and publi -
cation year. These details facilitated the systematic organization of this 

review, with each study assigned a unique identifier (ID) for easy refer -
ence during the data analysis and extraction process. Next, we provide a 

comprehensive analysis, highlighting key perspectives such as the stud -
ied subsystems, the RL/DRL algorithms applied, and the types of models 

utilized, offering valuable insights into the state-of-the-art. Then, we 

conduct a deeper synthesis, classifying the studies based on the sub -
systems they targeted. This categorization helped obtain quantitative 

and qualitative data to address the research questions for each subsys -
tem. We focus our discussion on more detailed and specific information 

regarding the research problems, study objectives, and experimental 

setup, benchmark comparisons, the platforms used, and energy-related 

outcomes. Finally, we summarize the construction of Markov Decision 

Process (MDP) elements in joint optimization studies. Additionally, we 

reference related works to further support and contextualize the purpose 

and findings of this review. 

5.1. Overview of the final identified studies 

In this review, we identify 65 journal articles that apply RL/DRL 

algorithms to improve the energy efficiency of at least one major data 

center system. The publication venues and years of these articles are 

summarized in Table 3. Given that the research topic of this review is 

relatively new, all selected studies were published between 2020 and 

2024, as shown in Fig. 7. 
Taking a broader look at the selected studies reveals that over 

60 % focus entirely on the ICT system, exploring opportunities to en -
hance energy efficiency by leveraging RL/DRL algorithms from various 

perspectives. In contrast, approximately 21 % of the papers focus exclu -
sively on the data center cooling system. Furthermore, the remaining 

studies examine combinations of multiple data center systems. Fig. 8 

provides a detailed overview of the specific systems addressed in each 

selected paper. 
In the following paragraphs, we will explore the RL/DRL algorithms 

used in the selected studies of this review. 

For the cooling system: Since the cooling system of data centers is 

characterized by a high-dimensional state space and a continuous ac -
tion space MDP, all selected studies employed DRL methods, primarily 

focusing on model-free algorithms, including: 

• Soft Actor-Critic Algorithm (SAC) [66,87] 
• Deep Deterministic Policy Gradient (DDPG) [58,70] 
• Twin Delayed DDPG (TD3) [93] 
• Proximal Policy Optimization (PPO) [60] 
• Trust Region Policy Optimization (TRPO) [93] 
• Deep Q-Network (DQN) [69,75,101,104] 

However, two studies used model-based algorithms: Model-Based 

Actor-Critic (MBAC) [49] to propose a safe cooling mode adhering to 

strict thermal constraints, and Probabilistic Ensembles with Trajectory 

Sampling (PETS) [102], in which the study makes a comparison be -
tween four different algorithms: two model-free off-policy algorithms: A 

DQN variant called Branching Dueling Q-Network (BDQ) and SAC, one 

model-free on-policy algorithm (PPO), and one model-based algorithm 

(PETS). 
For ICT system: Due to the discrete nature of certain ICT processes, 

such as task scheduling and resource allocation, the Q-learning algo -
rithm has been employed in multiple studies to handle the ICT MDP 

environment [62,79,105]. This approach allows Q-values to be updated 

independently from the action selection and execution, enabling the 

algorithm to capture delayed feedback more accurately. As a result, 

this method enhances the learning rate and accelerates the convergence 

process. Alternatively, DQN is commonly proposed for handling more 

complex ICT systems, as reported in [50,54,81]. However, other DRL 

algorithms are also used, such as: 

• Actor-Critic (AC) [72,107] 
• Soft Actor-Critic Algorithm (SAC) [55,65] 
• Proximal Policy Optimization (PPO) [67,103] 
• Asynchronous Actor-Critic Agents (A3C) [76] 
• Deep Deterministic Policy Gradient (DDPG) [86] 

For combining systems studies: As the complexity of the MDP prob -
lem increases when multiple systems are present, with a combination 

of discrete and continuous state spaces, along with high-dimensional 

action spaces, traditional RL approaches become less effective. In re -
sponse to these challenges, all selected studies addressing the integration 

of multiple data center systems employed DRL algorithms. Notable DRL 

algorithms used in these studies include: 

• Actor-Critic Algorithm (AC) [47] 
• Soft Actor-Critic Algorithm (SAC) [48] 
• Deep Q-Network (DQN) and its extensions [52,61,80,91,98] 
• Deep Deterministic Policy Gradient (DDPG) [91,92] 

Fig. 9 illustrates the distribution of various RL/DRL algorithms in 

the selected studies. Q-learning and DQN were the most frequently cited 

algorithms, appearing in 60 % of studies, followed by SAC (eight stud -
ies), PPO (four studies), DDPG (four studies), and AC/A3C (four studies). 

About 9 % of studies employed other algorithms. 
Table 4 categorizes the algorithms implemented in the selected 

studies based on the utilized model type. According to Figs. 3 and 4, 

nearly 98 % of the algorithms employed are model-free, divided into 

three main groups: value-based algorithms, policy-based algorithms, 

and actor-critic algorithms. Only two studies utilized model-based al -
gorithms, likely due to the complexity involved in accurately modeling 

a data center system. Some studies used more than one RL/DRL method, 

causing them to appear in multiple categories in the table. 
The following sections will provide a detailed analysis of these 

algorithms and their applications.

Applied Energy 389 (2025) 125734 

9 


H. Kahil, S. Sharma, P. Välisuo et al.

Table 3 

The selected studies. 
ID Authors Publication venue DC application (RQ1) Year 

S1 Jayanetti et al. IEEE Transactions on Parallel and Distributed Systems Integrating power supply and ICT systems 2024

S2 Biemann et al. IEEE Internet of Things Journal Integrating cooling and power supply sys-

tems 

2023 

S3 Wan et al. IEEE Transactions on Emerging Topics in Computational 
Intelligence 

Cooling system 2023 

S4 Lou et al. IEEE Transactions on Network and Service Management ICT system 2023

S5 Ran et al. IEEE Transactions on Services Computing Integrating cooling and ICT systems 2023

S6 Ran et al. IEEE Transactions on Services Computing Integrating cooling and ICT systems 2023

S7 Zeng et al. IEEE Transactions on Parallel and Distributed Systems ICT system 2022

S8 Kang et al. IEEE Transactions on Network and Service Management ICT system 2022

S9 Pham et al. IEEE Access ICT system 2021

S10 Yi et al. IEEE Transactions on Parallel and Distributed Systems ICT system

ICT system
2020

S11 Ding et al. IEEE Access 2020

S12 Li et al. IEEE Transactions on Cybernetics Cooling system 2020

S13 Cheng et al. IEEE Transactions on Computer-Aided Design of Integrated 
Circuits and Systems 

ICT system 2020 

S14 Leindals et al. Energy and AI Cooling system 2024

S15 Zhao et al. IEEE Transactions on Sustainable Computing Integrating power supply and ICT systems 2024

S16 Ghasemi et al. Cluster Computing ICT system 2024

S17 Ghasemi et al. Computing ICT system 2024

S18 Bhatt et al. International Journal of Advanced Computer Science and 
Applications 

ICT system 2024

S19 Zhang et al. IEEE Transactions on Network and Service Management ICT system 2024

S20 Guo et al. Applied Energy Cooling system 2024

S21 Yang et al.

Bouaouda et al.

Journal of Supercomputing ICT system 2024

S22 Sustainability ICT system 2024

S23 Chen et al. Measurement and Control Cooling system 2024

S24 Wang et al. ACM Transactions on Cyber-Physical Systems Cooling system 2024

S25 Aghasi et al. Computer Networks Integrating cooling and ICT systems 2023

S26 Wang et al. Journal of Cloud Computing ICT system 2023

S27 Wang et al. Computer Networks ICT system 2023

S28 Ghasemi et al. Cluster Computing ICT system 2023

S29 Huang et al. Energies Cooling system 2023

S30 Wei et al. Journal of King Saud University – Computer and Information 
Sciences 

ICT system 2023

S31 Liu et al. Applied Energy ICT system 2023

S32 Ahamed et al. Sensors ICT system 2023

S33 Ma et al. IEEE Transactions on Industrial Informatics ICT system 2023

S34 Simin et al. Journal of Intelligent and Fuzzy Systems Integrating cooling and ICT systems

ICT system

2023

S35 Nagarajan et al. Expert Systems 2023

S36 Yang et al. KSII Transactions on Internet and Information Systems ICT system 2022

S37 Pandey et al. Mobile Information Systems ICT system 2022

S38 Shaw et al. Information Systems ICT system 2022

S39 Yan et al. Computers and Electrical Engineering ICT system 2022

S40 Wang et al. Computer Networks ICT system 2022

S41 Mahbod et al. Applied Energy Cooling system 2022

S42 Abbas et al. Physical Communication ICT system 2022

S43 Uma et al. Transactions on Emerging Telecommunications Technologies ICT system 2022

S44 Wang et al. Future Generation Computer Systems ICT system 2021

S45 Zhou et al. IEEE Network Integrating cooling and ICT systems 2021

S46 Chi et al. Energies Integrating cooling and ICT systems 2021

S47 Biemann et al. Applied Energy Cooling system 2021

S48 Ding et al. Future Generation Computer Systems ICT system 2020

S49 Peng et al. Cluster Computing ICT system 2020

S50 Hu et al. Electronics Integrating power supply and ICT systems 2020

S51 Qin et al. Applied Intelligence ICT system 2020

S52 Yang et al. Journal of Building Engineering Integrating cooling, ICT, and power supply 
systems 

ICT system

2024 

S53 Lin et al. IEEE Access 2020

S54 Caviglione et al. Soft Computing ICT system 2021

S55 Le et al. ACM Transactions on Sensor Networks Cooling system 2021

S56 Zhang et al. Applied Energy Cooling system 2023

S57 Li et al. CCF Transactions on High Performance Computing ICT system 2021

S58 Wan et al. IEEE Intelligent Systems Cooling system 2021

S59 Haghshenas 
et al. 

IEEE Transactions on Services Computing ICT system 2022 

S60 Zhang et al. IEEE Transactions on Cybernetics Cooling system 

ICT system 

2024 

S61 Sun et al. Computer Networks 2020 

S62 Asghari et al. Computer Networks ICT system 2020 

S63 Siddesha et al. Cluster Computing ICT system 2022 

S64 Asghari et al. Soft Computing ICT system 2020 

S65 Zhang et al. Expert Systems with Applications Cooling system 2023 

Ref 

[47] 
[48] 

[49] 

[50] 
[51] 
[52] 
[53] 
[54] 
[55] 
[56] 
[57] 
[58] 
[59] 

[60] 
[61] 
[62] 
[63] 
[64] 

[65] 
[66] 
[67] 
[68] 
[69] 
[70] 
[71] 
[72] 
[73] 
[74] 
[75] 
[76] 

[77] 
[78] 
[79] 
[80] 
[81] 
[82] 
[83] 
[84] 
[85] 
[86] 
[87] 
[88] 
[89] 
[90] 
[91] 
[92] 
[93] 
[94] 
[95] 
[96] 
[97] 
[98] 

[99] 
[100] 
[101] 
[102] 
[103] 
[104] 
[105] 

[106] 

[107] 

[108] 

[109] 

[110] 

[111]

Applied Energy 389 (2025) 125734 

10 


H. Kahil, S. Sharma, P. Välisuo et al.

2020 2021 2022 2023 2024
0

5

10

15

20

12

9

12

18

14

Publication Year

N
um
be
r  
of  
St
ud
ie
s

Fig. 7. Publication year distribution of selected studies. 

ICT,

40

studies

Cooling,

14

studies

Joint,

11

studies

Fig. 8. The sub-systems focused in selected studies: 40 studies are focused on 

ICT optimization, 14 on cooling optimization and 11 are joint studies integrating 

multiple systems, including the power supply system. 

33.8%

26.2%

12.3%

6.2%
6.2%

6.2%

9.1%

DQN

Q-learning

SAC

PPO

DDPG

AC-A3C

Other methods

Fig. 9. Distribution of algorithms utilized in this review. 

5.2. Comparison of RL/DRL algorithms applied to cooling system

Cooling systems account for approximately 40 % of energy consump-

tion in data centers [112]. Reducing the energy consumption of this 

non-ICT support system will improve the power utilization efficiency 

(PUE) of the data center. Furthermore, optimizing the operation of the 

cooling systems can significantly influence the thermal conditions and 

cooling flow of ICT devices, leading to future reductions in the total 

energy consumption of the entire data center [113]. In this section, we 

will analyze the selected articles that use RL/DRL techniques to optimize 

the operation of the cooling system in the data center with the aim of 

reducing energy consumption. The following analysis not only provides 

an overview of how RL/DRL methods are applied to data center cooling 

system, but also investigates the specific aspects of each selected study 

in detail. This includes the formulation of the research problem and ob -
jectives, the energy-related outcomes, the benchmark comparisons, and 
the experimental setup. 

5.2.1. The research problem and objective formulation 
As illustrated in Fig. 8, 14 papers discussing cooling systems were 

identified. These studies focused on two main research problems (RQ4), 

each with different objectives (RQ5). The dominant category of the 

research problem focuses on optimizing cooling system operations in 

various scenarios to improve energy efficiency by utilizing various 

RL/DRL approaches. An interesting configuration involves using DRL to 

optimize the data center cooling system integrated with an active ther -
mal management framework. For example, [60] explores the balance 

of aquifer thermal energy storage (ATES) while minimizing the total 

cost and maintaining the temperature range of the servers using a DRL 

agent. Similarly, [104] introduces Active Ventilation Tiles (AVTs) con -
trollers to enhance the operation of the rack cooling system, achieving 

a trade-off between energy consumption and rack supply temperature 

distribution. An alternative scenario is integrating RL/DRL algorithms 

with prior physical knowledge to enhance the cooling system’s energy 

efficiency. The study [75] integrates this knowledge by using big data, 

IoT sensor networks, and a digital twin model with the DRL algorithm. 

By leveraging historical and real-time data, this approach employs a 

Long Short-Term Memory (LSTM) network to predict temperatures, en -
abling the utilization of the DQN algorithm to effectively reduce the 

energy consumption of the cooling systems. Due to the strong relation -
ship between the energy efficiency of data center cooling systems and 

the ambient temperatures at their locations, several studies have exten -
sively investigated the efficiency of DRL algorithms in reducing cooling 

system energy consumption in tropical climates. Specifically, the study 

[101] focuses on optimizing the supply air temperature and relative hu -
midity in a free-cooled tropical data center under defined boundaries, 

while [87] explores a single-agent DRL strategy with a floating set point 

approach to reduce the temperature threshold for tropical data centers 

based on a whole-building evaluation method. The study [69] proposes a 

multi-set point approach based on the DQN algorithm (DQN-MSP) to en -
able precise cooling control of the CRAC unit’s air temperature, offering 

significant improvements in data center cooling energy consumption. 

Another key research direction in this category of literature emphasizes 

designing and comparing multiple state-of-the-art DRL algorithms for 

optimizing energy consumption while maintaining thermal conditions, 

as demonstrated in studies [66,93,102]. 
Meanwhile, the second research category shifts attention to the relia -

bility of safety-aware DRL strategies, with the core aim of minimizing the 

energy consumption of the data center cooling system. These strategies 

are designed to ensure strict adherence to both soft and hard constraints 

during the learning and operational phases. 
In [58], the study develops an end-to-end off-policy DDPG 

agent to optimize the cooling system using unprocessed and high -
dimensional input data directly. Additionally, the study introduces the 

de-underestimation (DUE) validation mechanism for the critic network 

to address underestimation of overheating risks. In [106], the study fo -
cuses on incorporating residual physics using thermodynamic principles 

to guide the DRL agent’s exploration process by estimating the desirable 

range of actions, ensuring future action safety. In addition, the study 

[49] develops safe cooling system operation by utilizing a model-based 

actor-critic DRL (MBAC) algorithm using two different models: a sys -
tem transition model to predict the future system state, and a risk model 

to estimate the negative effects of executing an action. Furthermore, 

the paper [70] utilizes offline imitation learning and online post-hoc 

rectification techniques to develop three different versions of a safety -
aware DDPG controller for the data center cooling system. Alternatively,

Applied Energy 389 (2025) 125734 

11 


H. Kahil, S. Sharma, P. Välisuo et al.

Table 4 

Classification of algorithms by model type and study IDs. 

Category Algorithm type (RQ2) Study IDs 

Value-Based Model-Free DQN S4, S6, S7, S8, S10, S13, S15, S23, S29, S32, S34, S35, S37, S39, S42, S45, S49, S50, S53, 
S54, S55, S58, S22 

Q-learning S11, S16, S17, S18, S22, S25, S28, S33, S38, S43, S44, S48, S51, S59, S62, S63, S64 
B3QN S56 
PADQN S5 , S27, S45 
SARSA S38 
BDQ S56 

Policy-Based Model-Free PPO S14, S21, S36, S57, S60, S65, S47 
TRPO S47 

Monte Carlo (REINFORCE) S31 

Actor-Critic (AC) Model-Free SAC S2, S9, S19, S20, S41, S56, S60, S65, S47 
A3C S30 
AC S1, S26, S61, S35 
DDPG S12, S24, S40, S46, S35, S45, S60 
TD3 S47, S56 

Learned Model-Based PETS S56 
MBAC S3 

Given Model-Based None identified 

the study [111] leverages techniques like Lagrangian-based constrained 

DRL (CDRL) and reward shaping to satisfy soft constraints through ex -
tensive online learning. Also, within the same study, hard constraints are 

addressed by a parameterized shielding DRL algorithm (DRL-S), which 

projects unsafe actions onto safe action spaces. The ultimate goal of these 

studies in the second category is to design a safe cooling system for 

data centers, reducing energy consumption while effectively maintain -
ing thermal constraints. The insights from this section are summarized 

in Table A.8. 

5.2.2. The energy related outcomes 
The primary motivation of this study is to address how the proposed 

RL/DRL algorithms enhance the energy efficiency of data centers (RQ8). 

The results related to energy efficiency have been carefully and thor -
oughly analyzed. Given the diversity of research problems and objectives 

addressed in the identified cooling system studies, the reporting methods 

for energy-related outcomes vary significantly. Some studies express the 

improvements in energy consumption when implementing the RL/DRL 

algorithm as a percentage reduction in energy consumption, compared 

to the baseline controller (e.g., DefaultE+) [93,102,111]. 
In addition, other studies compare the energy saving percentage of 

their proposed RL/DRL strategies to some other benchmark controllers, 

including DRL and non-DRL algorithms [49,70,87]. Energy efficiency is 

also reported in terms of improvements in key data center performance 

metrics, such as power usage effectiveness (PUE), compared to baseline 

controllers (e.g., DefaultE+) [58] or state-of-the-art controllers [66], 

while other studies use the PUE to evaluate the differences in energy 

consumption before and after applying the proposed RL/DRL algorithms 

[75]. Other studies focus on energy cost reductions rather than energy 

consumption savings [60]. 
Moreover, combining RL/DRL strategies with advanced setups, such 

as AVT systems [104] and physics-guided DRL with shielding [106], 

highlights the potential of RL/DRL in performing a trade-off analy -
sis between energy efficiency and system performance. Furthermore, 

some studies demonstrate energy savings while maintaining thermal 

constraints, either by increasing the average supply air temperature of 

the CRAC units [69] or by raising the temperature and relative humid -
ity thresholds [101]. A more detailed analysis of additional objectives 

combined with the energy efficiency will be provided in Section 6. A 

detailed summary of this section’s findings is presented in Table A.8. 

5.2.3. The benchmark comparisons 
The distribution of benchmark algorithms used in the cooling systems 

studies for energy-related results comparison is illustrated in Fig. 10. 

PI
D

M
PC DQ

N
TR
PO PP

O

De
fau
ltE
+
DD
PG SA

C
TD
3  

Ot
he
r  n
on
-D
RL

Ot
he
r  D
RL

0

1

2

3

4

5

6

7

8

9

10

2

3 3

4

8

6

3

5

2

6

7

Algorithms

N
um
be
r  
of  
B
en
ch
m
ar
ks

Fig. 10. Number of benchmarks in the literature for cooling system. 

Analyzing the statistical data reveals two distinct groups. The first 

group involves the use of DRL algorithms due to their adaptability as 

benchmarks for comparison, with PPO being the most widely used, ap -
pearing in eight studies. Other prominent DRL algorithms include SAC 

(used five times), TRPO (used four times), DDPG (used three times), DQN 

(used three times), and TD3 (used twice). The second group consists of 

non-DRL algorithms, where the built-in EnergyPlus baseline controller 

(DefaultE+) was used in five studies, the classical PID controller was 

used twice, and the optimal model predictive controller (MPC) was used 

three times. Other DRL and non-DRL algorithms, including those used as 

benchmarks only once, are also considered. Table 5 outlines the bench -
mark algorithm comparisons (RQ6) for each selected cooling system 

study, including both DRL and non-DRL algorithms. 

5.2.4. The experimental setup 
Among the 14 selected cooling system studies, only one study di -

rectly implemented the proposed DRL strategy on a real-world data 

center [104]. In contrast, the remaining studies tested the designed 

DRL algorithms in simulated environments, highlighting a gap in di -
rect real-world application and validation. These simulations utilized 

either real-world datasets, synthetic datasets, or a hybrid approach com -
bining both. The EnergyPlus building energy simulation program [117] 

emerged as a primary tool for simulating energy consumption in data

Applied Energy 389 (2025) 125734 

12 


H. Kahil, S. Sharma, P. Välisuo et al.

Table 5 

Selected cooling system studies experimental setup. 

ID Environment 
(RQ3) 

Data source 
(RQ3) 

Data type 
(RQ3) 

Benchmarks 
(RQ6) 

Platform 
(RQ7) 

S3 Simulation Simulated a typical data center 
room with Alibaba’s 2018 cluster 

data 

Real-world MBRL-MPC, MBHC Unspecified CFD simulator, 
Python (PyTorch) 

S12 Simulation National Super Computing Centre 
(NSCC) of Singapore 

Real-world DefaultE+, Two-stage (TS), A3C, TRPO EnergyPlus, Python (Scipy) 

S14 Simulation Naviair data center (the Danish 
airspace control company) 

Real-world No reward PPO, Delayed reward PPO, 
Uniform future PPO, Trend-based future 

policy to estimate the return 

Python (OpenAI Gym) 

S20 Simulation Simulated liquid-cooled data cen-

ter with unspecified real-world 

data set 

Simulated a small data center 
with a real-world dataset from the 

PlanetLab system 

Real-world PID, MPC, DQN, TRPO, PPO Matlab (Simscape) 

S23 Simulation Real-world DQN-SSP, PPO-MSP, DDPG-MSP 6SigmaRoom, CloudsimPy, 
Python 

S24 Simulation Four simulated configurations of 
CW- and DX-cooled data centers 

under two climate conditions 

Synthetic For the first three proposed controllers: 
DefaultE+, Reward shaping DDPG, Simplex 

DDPG, Projection post-hoc rectification 

DDPG 

For the fourth controller: PID, Vanilla DDPG, 

Reward shaping DDPG 

EnergyPlus, OpenFOAM, 
Python (OpenAI Gym and 

PyTorch) 

S29 Simulation Simulation for real-world data 
center room located in Shenzhen 

Simulated mid-tier stand-alone 
data center located in a tropical 

climate region 

Real-world Comparison of DC energy efficiency metrics 
before and after the DRL strategy 

6SigmaRoom, Autodesk Revit, 
Python 

S41 Simulation Synthetic DefaultE+, Load Aware, Temperature 
Aware, Joint-IT, Multi-Agent DRL, TD3, PPO, 

TRP, various versions of SAC 

EnergyPlus, Python 

S47 Simulation Simulated medium-sized DC with 
two zones, a direct expansion 

cooling coil, and a chilled water 

cooling coil 

Synthetic and 

real-world 
DefaultE+, TD3, PPO, TRPO, SAC EnergyPlus, Python (OpenAI 

Gym) 

S55 Simulation A real free-cooled data center 
located in a tropical zone 

Real-world Hysteresis-based controller, MPC Matlab, Python (Keras and 
TensorFlow) 

S56 Simulation Simulated data center test bed 
developed in [114] 

Synthetic DefaultE+, PETS, BDQ, PPO, SAC EnergyPlus, Python (PyTorch) 

S58 Real-time Inner Mongolia Meteorological 
Information Center (IMMIC) 

Real-world DL, DN, DQN Python (TensorFlow), Real-

time 

S60 Simulation Simulated data center test bed 
developed in [115] 

Synthetic SAC, RP-SAC, DDPG, RP-DDPG, PPO, RP-

PPO, Lagrangian-based safe DRL, Physics 
EnergyPlus, Python (PyTorch, 
TensorFlow) 

S65 Simulation Simulated data center test bed 
developed in [116] 

Synthetic DefaultE+, PPO, SAC, PPO-Lag EnergyPlus 

center cooling systems, often integrated with various Python libraries to 

implement DRL agents. Other simulation environments utilized include 

Computational Fluid Dynamics (CFD) simulators such as OpenFOAM 

[118] and 6SigmaRoom [119], which offer detailed modeling of airflow 

and thermal dynamics. Furthermore, MATLAB, along with its advanced 

toolboxes like Simulink and Simscape, was frequently employed to sim -
ulate the operational processes of data center cooling systems, providing 

a robust platform for evaluating control strategies and optimizing sys -
tem performance. Table 5 presents a comprehensive overview of the 

experimental setup, including the environment, dataset source and type 

(RQ3), and platform (RQ7) for all identified studies on cooling systems. 

5.3. Comparison of RL/DRL algorithms applied to ICT systems 

Over the past few years, data centers have grown significantly in 

size and complexity driven by the rapid advancements in ICT sys -
tems. The advancements involve a wide range of devices, including 

high-performance servers, processing units such as CPUs and GPUs, 

advanced memory units, and storage arrays [120]. This technological 

progress has enabled data centers to support more complex operations, 

such as training large language models (LLMs) and real-time data pro -
cessing. As a result, improving the energy efficiency of ICT systems has 

become a critical priority, not only to enhance the performance and 

scalability of data centers but also to minimize energy consumption 

and operational costs. In this section, we will comprehensively examine 

the role of RL/DRL algorithms in tackling energy efficiency challenges 

within ICT systems as identified in the literature. 

5.3.1. The research problem and objective formulation 
The majority of the identified papers in this review focus on ICT 

systems, specifically 40 studies. The research problems (RQ4) and ob -
jectives (RQ5) of these studies can be categorized into the following 
areas: 

Scheduling optimization: A considerable number of existing stud -
ies discuss the scheduling optimization challenge in a DC environment 

using RL/DRL approaches; however, few studies have explored the en -
ergy efficiency aspects of applying these algorithms. The three main 

types of RL/DRL algorithms applied to the scheduling optimization prob -
lem in the identified studies are: jobs scheduling, tasks scheduling, and 

resources scheduling. 
Jobs scheduling: Job scheduling refers to the process of assigning 

and allocating the entire arriving job which may consist of one or mul -
tiple tasks to the DC resources, aiming to manage workloads with a 

high-level approach. Traditional job scheduling mechanisms often strug -
gle to cope with extensive, heterogeneous DC environments, especially 

in cases involving long-lasting jobs. This limitation leads to inefficien -
cies in energy consumption and resource management. Three studies 

[56,77,85] have addressed this challenge by proposing RL/DRL algo -
rithms. The primary approach to handling this challenge dynamically 

involves considering real-world constraints, such as job dependencies 

and QoS levels, to minimize energy consumption and carbon emissions 

in data centers. 
Tasks scheduling: Tasks are the components of the jobs that typi -

cally need to be performed in a specific order due to their interdepen -
dence. Task scheduling refers to the process of managing the execution

Applied Energy 389 (2025) 125734 

13 


H. Kahil, S. Sharma, P. Välisuo et al.

of individual tasks within a job in a low-level approach. The main objec -
tive of the task scheduling studies is to select the optimal DC resource 

for task execution, ensuring compliance with time and QoS constraints. 

Ten studies were identified that discussed the task scheduling problem 

highlighting three main approaches: 

• Dependency- and workflow-oriented RL/DRL task scheduling ap -
proaches [72,82,94,110]. 
• Heterogeneous cloud DC online RL/DRL task scheduling approaches 
[67,103,109]. 
• Adaptive and hybrid RL/DRL task scheduling approaches [50,54,59]. 

Resources scheduling: While task and job scheduling focus on the 

DC workload, resource scheduling concentrates on the physical (e.g., 

servers) or virtual (e.g., VM) infrastructure level of the DC. The main aim 

of the resource scheduling process is to maximize resource utilization, 

and it does not directly consider job and task dependencies. Two studies 

specifically focused on addressing the resource scheduling problem [89, 

95]. 
Virtual machines and containers management: The virtualization of 

physical resources in data centers to meet the growing demands of work -
loads has received significant attention from researchers in recent years. 

Two primary technologies are commonly employed for virtualization: 

hardware-level virtualization (VM) in which each virtual machine (VM) 

utilizes a hypervisor to run its own operating system and applications. 

In contrast, operating system (OS)-level virtualization leverages the host 

system’s kernel to create containers which share the host’s resources 

[121]. In this review, we selected 14 studies focused on managing VMs 

and containers using RL/DRL algorithms and present the energy effi -
ciency results. These studies address three key areas: VM consolidation, 

VM and container placement, and VM replacement. 
VM consolidation: This refers to reducing the number of physi -

cal machines (PMs) required to operate the data center workload. This 

process includes three stages: workload detection (overutilization and 

underutilization), VM selection, and VM placement. By running multi -
ple VMs on fewer PMs, several objectives can be achieved, including 

optimizing ICT resources, reducing operational costs, and minimizing 

energy consumption. Five studies in this review collection discuss the 

VM consolidation problem in data centers using RL/DRL algorithms with 

two main approaches: 

• Centralized adaptive RL/DRL strategies [53,57,84,88] 
• Multi-agent RL strategies [105] 

VM and container placement: This is a sub-process of consolida -
tion, where the objective is solely to decide the optimal location (PM) 

for a VM. It is applied at the PM (host) level rather than at the DC sys -
tem level. Eight studies have been identified on this topic: seven for VM 

placement and only one for container placement [68]. 
VM replacement: This refers to the process of reassigning an already 

placed VM to a new physical machine (PM). This process is triggered by 

changes in the current state (e.g., overloading, failures). It is also consid -
ered a sub-process of VM consolidation, enabling VM migration. Among 

the selected studies, only one specifically addressed this issue, propos -
ing a novel approach that combines fuzzy logic with an RL algorithm to 

enhance decision-making and adaptability in this process [74]. 
Two studies combine the two aforementioned categories as a re -

search problem, focusing on VM scheduling by allocating tasks or jobs 

to VMs assigned to hosts, leveraging RL/DRL algorithms to optimize the 

scheduling process [79,90]. 
DCN traffic control: Data Center Networks (DCNs) play a critical role 

in ensuring the smooth operation of ICT systems. However, they often 

suffer from bandwidth surges, which degrade data center performance 

and significantly increase energy consumption. Traditional methods to 

address these issues are limited in their adaptability and fail to dynam -
ically handle sudden network traffic fluctuations, leading to substantial 

energy waste. RL/DRL algorithms offer effective approaches to tackle 

these challenges. Four studies have been identified that explore solutions 

to this problem, each employing a unique structural RL/DRL approach: 

• Combining LSTM networks for traffic prediction and proactive 
RL/DRL agents to optimize traffic control and energy efficiency 

[73,86]. 
• Formulating the problem as an MILP model to define the optimal so -
lution space and integrating RL/DRL algorithms to find near-optimal 

solutions dynamically [55]. 
• Employing Software-Defined Networking (SDN) and RL/DRL to dy -
namically schedule traffic flows, aiming to reduce energy consump -
tion while maintaining an optimal Flow Completion Time (FCT) 

[107]. 

Multi-objective framework: Five studies are identified here that ad -
dress job/task scheduling, task offloading, and resource allocation as 

multi-objective research problems. The resources considered in these 

studies include containers [65,81], multi-user, multi-data center re -
sources [99], and general data center resources [83,108]. 
A detailed summary of the identified ICT studies’ research problems 

(RQ4) and objectives (RQ5) is presented in Table A.9. 

5.3.2. The energy related outcomes 
As energy efficiency is the primary focus of this review, a compre -

hensive analysis of the energy efficiency outcomes of using RL/DRL 

algorithms in ICT systems in the identified studies is presented in 

Table A.9. This table answers this review’s RQ8 and demonstrates that 

the proposed RL/DRL algorithms consistently outperform baseline and 

benchmark non-RL/DRL methods in terms of energy efficiency. The re -
ported energy efficiency improvements range from small percentages 

(1 %–3 %) to significant enhancements (over 60 %), depending on the 

specified scenario and context, such as varying VM/task loads, DCN 

traffic sizes, or the use of real-world or synthetic datasets. The major -
ity of the studies reported achieving energy efficiency as a percentage 

improvement when compared to benchmark algorithms. 
Additionally, some studies highlighted energy efficiency enhance -

ments in terms of scalability and dataset-based performance. For in -
stance, studies [62,108] focus on performance across diverse datasets 

and scalability metrics. Other studies compare the achieved energy sav -
ings in multiple experimental setups or configurations. [67] investigated 

task scheduling across three distinct scenarios with 10, 50, and 100 

servers, examining the impact of server configurations on energy ef -
ficiency. [88] explored VM consolidation under different workloads, 

assessing its impact on resource utilization and energy consumption. 

[86] analyzed DCN traffic control with both more than 70 nodes and 

fewer than 70 nodes, assessing performance across different network 

sizes. [109] conducted task scheduling across two different task counts 

and varying numbers of VMs, evaluating the performance under diverse 

configurations. 
In addition, a few studies presented a generalized approach without 

explicitly referencing benchmark algorithms. For instance, [74] reported 

energy savings in a generalized context, providing insights into the 

potential applicability of the proposed RL algorithm. 

5.3.3. The benchmark comparisons 
Each research problem discussed in the identified studies of the 

ICT system was compared to other baseline or state-of-the-art bench -
mark methods commonly used in the respective problem domain. As 

presented in Fig. 11, the most commonly used baseline method for 

scheduling optimization studies was the RANDOM method. In this 

method, jobs/tasks/VMs were assigned to resources without consider -
ing optimization criteria. This approach is simple and achieves unbiased

Applied Energy 389 (2025) 125734 

14 


H. Kahil, S. Sharma, P. Välisuo et al.

Ra
nd
om

He
ur
ist
ic

Me
ta-
he
ur
ist
ic

RL
/D
RL  

Ot
he
r

ML
0

10

20

30

40

50

60

70

80

90

100

7

78

23

31

12
5

Algorithms

N
um
be
r  
of  
Be
nc
hm
ar
ks

Fig. 11. Number of benchmarks in the literature for ICT system. 

scheduling; however, it is inefficient as it overlooks critical DC met -
rics such as energy efficiency, quality of service (QoS), and workload 

balancing. This method was used in seven studies as a baseline for 

comparison with the proposed RL/DRL algorithms in the scheduling 

optimization identified studies. Additionally, heuristic-based algorithms 

were widely used as benchmarks to evaluate the proposed RL/DRL al -
gorithms for various ICT research problems. For scheduling research 

problems, the Round-Robin (RR) method was highlighted as the pri -
mary heuristic-based method for performance comparison. Greedy algo -
rithms, including First-Fit (FF), Best-Fit (BF), and their variants, were the 

main benchmarks for VM management research problems. Elastic-Tree 

was a common benchmark for DC network traffic control problems.

Approximately 78 additional heuristic-based algorithms were em -
ployed as comparison methods across all the research problems dis -
cussed in ICT systems. Meta-heuristic methods were also used 23 times 

as evaluation methods. These included Particle Swarm Optimization 

(PSO), Ant Colony Optimization (ACO), Genetic Algorithms (GA), and 

their variants, applied to various ICT system research problems. Machine 

learning algorithms were occasionally utilized in a limited number of 

identified studies as benchmark methods, particularly for VM manage -
ment. 
Other RL/DRL algorithms developed in previous studies were used 

31 times for comparison with newly proposed algorithms, demonstrat -
ing internal comparisons within RL/DRL approaches in the identified 

studies. Finally, some specially designed algorithms were also employed. 

Table 6 outlines the benchmark algorithm (RQ6) comparisons for each 

selected study on ICT systems. 

5.3.4. The experimental setup 
CloudSim [122] and its variant WorkflowSim [123], an extended 

and optimized version of CloudSim designed for dependent task work -
flows, were used as simulation environments in approximately 50 % 

of the identified studies focusing on scheduling optimization and VM 

management research problems in DC ICT systems. In addition to these 

tools, programming languages such as Java and Python were frequently 

employed for simulation experiments in multiple studies. 
MATLAB was used as the simulation environment in four studies. 

However, six studies did not specify the simulation environment used. 

On the other hand, several real-world datasets from large-scale data cen -
ters like Google, Wikipedia, and Alibaba, as well as smaller data centers 

such as the National Supercomputing Centre (NSCC) of Singapore and 

the Nottingham University Data Center, were utilized as data sources in 

the identified studies. Moreover, well-known datasets such as PlanetLab 

and the CoMon project were also employed for simulation experiments. 

Synthetic datasets were another key data source, enabling controlled 

and customized testing scenarios. Table 6 provides a comprehensive 

overview of the experimental setup, encompassing the simulation en -
vironment, the sources and types of datasets (RQ3), and the platforms 

used (RQ7) in all the identified studies on ICT systems.

5.4. Comparison of RL/DRL algorithms applied to optimizing integrated 

data center systems

Developing an accurate, intelligent, and real-time DC environment 

requires seamless integration of all systems, including the cooling, ICT, 

and power supply systems. The joint optimization of these systems 

has become a promising research direction, aiming to achieve mul-

tiple objectives across multiple systems using advanced optimization 

strategies.

Among these, RL/DRL algorithms have emerged as powerful ap -
proaches, demonstrating significant potential in addressing the complex -
ities of integrated DC systems. This section delves into a detailed analysis 

of 11 identified studies that leverage RL/DRL algorithms for the joint op -
timization of DC systems. A vital aspect of the identified studies in this 

section lies in formulating these studies as a multi-objective research 

problem across multiple systems. To further enrich this discussion and 

align with the growing interest in this field, we define the key elements 

of the Markov Decision Process (MDP) models employed in these stud -
ies, highlighting their critical role in achieving efficient and effective 

system integration. 
Various identified studies explored the integration of the ICT op -

eration optimization with energy-efficient cooling system controlling 

as a research problem (RQ4) with different objectives (RQ5). [71] 

investigates the implementation of a decentralized strategy to simultane -
ously optimize the cooling system and the VM placement. Additionally, 

scheduling optimization combined with cooling system control is an -
other prominent research focus. Task scheduling was discussed in 

[52,91,92], whereas job scheduling was examined in [52,80]. In both 

cases, the scheduling process is integrated with the optimization of the 

cooling system. On the other hand, three studies [47,61,96] examined 

the workflow scheduling of DC powered by renewable energy systems 

(RES). The primary objective in these studies is to optimize energy con -
sumption from RES during the execution of DC workloads. In study [48], 

a DRL strategy was applied to optimize the cooling system by integrat -
ing it with the power supply system using real-time electricity pricing 

(RTP). Finally, global optimization using a multi-agent approach to en -
hance energy efficiency across more than two DC systems was addressed 

in a recent study [98]. 
The majority of the identified papers report results related to energy 

efficiency (RQ8) of the developed RL/DRL algorithms as a percentage 

of energy savings compared to baseline algorithms. For example, [48] 

reported a slight improvement in energy savings compared to a PID 

controller, while [91] compared the energy efficiency results of the pro -
posed algorithm with a controller designed based on domain expert 

knowledge, achieving up to 30 % energy savings. Another method of 

reporting energy efficiency results involves using data center efficiency 

metrics, such as PUE. This approach was demonstrated in [52,80], where 

the proposed RL/DRL algorithms enhanced energy efficiency compared 

to benchmark algorithms. Table A.10 provides an overview of the re -
search problem (RQ4), related objectives (RQ5), and energy-related 

outcomes (RQ8) of the identified joint optimization studies. 
The proposed joint RL/DRL algorithms were compared against vari -

ous benchmark algorithms. Multiple studies evaluated the performance 

of the developed RL/DRL strategies against state-of-the-art individ -
ual optimization techniques, such as ICT algorithms (e.g., random or

Applied Energy 389 (2025) 125734 

15 


H. Kahil, S. Sharma, P. Välisuo et al.

Table 6 

Selected ICT system experimental setup. 

ID Environment 
(RQ3) 

Data source 
(RQ3) 

Data type 
(RQ3) 

Benchmarks 
(RQ6) 

Platform 
(RQ7) 

S4 Simulation Google cluster Real-world RR, B, MAD, DRL-DTM, DRL-DTA NA 
S7 Simulation Google cluster Real-world FF, MFFD, PABFD, RL-DC, UP-VMC EnergyPlus, CloudSim 
S8 Simulation Google cluster Real-world RR, HDRL, DRL-Cloud, MO-DQN Python (TensorFlow) 
S9 Simulation Abilene, Geant, and Synthetic 

topology datasets 
Synthetic and 
Real-world 

TEDO, TEDI Java, Python 
(TensorFlow) 

S10 Simulation National Supercomputing Center 
(NSCC) of Singapore 

Real-world RR, Job consolidator, Online optimizer with 
two different reward functions 

NA 

S11 Simulation CoMon Project Real-world LR-MMT, VDT-UMC, DTH-MF CloudSim 
S13 Simulation Google cluster Real-world RR, HDRL, DRL-Cloud NA 
S16 Simulation Amazon EC2 and Simulated dataset Synthetic and 

Real-world 
FFD, BFD, GRVMP, GMPR, NSGA-II, RLVMP CloudSim 

S17 Simulation Simulated dataset Synthetic VMPMORL, EVCT, VPME, AFED-EF CloudSim 
S18 Simulation GWA-T-12 Bitbrains Real-world MOPSO, MOACO, VMPORL MATLAB 
S19 Simulation Simulated tasks following an 

exponential workload distribution 
Synthetic Cloud, PREM, RANDOM, REQ Python (PyTorch and 

Gym) 
S21 Simulation Open-source: BitBrains, Scientific 

workflows: Ligo, Montage, 
Cybershake 

Real-world RR, RF, GRR, GRF, Tetris, RLScheduler, ACS NA 

S22 Simulation Simulated dataset Synthetic GA, ACO, SA, FFD Java, CloudSim 
S26 Simulation Google cluster Real-world RR, RANDOM, SO, GJO NA 
S27 Simulation Simulated dataset using a K-port 

FatTree topology 
Synthetic Greedy-ElasticTree, LSTM+DRL, DDPG Python (TensorFlow, 

Keras) 
S28 Simulation Nottingham University, Gaussian 

distribution Synthetic datasets 

Synthetic and 
Real-world 

MOVMrB, RLVMrB, VMPMORL CloudSim 

S30 Simulation PlanetLab dataset, Amazon EC2 
instance configurations 

Synthetic and 
Real-world 

MOVMrB, RLVMrB, ADVMC MATLAB 

S31 Simulation Alibaba Cloud Real-world FIFO, Ideal MPC, Tetris Python (TensorFlow) 
Python (PyTorch, 
Gymnasium, Scikit-learn) 

S32 Simulation Azure 2017 workload Real-world HGP, IQR-MMT, MAD-MMT, RLR-MMT, GA

S33 Simulation Ligo, Genome, Cybershake, 
Montage, and Sipht datasets 

Real-world EcoCloud, KMI-MRCU, AFED-EF Java 

S35 Simulation Simulated two common datasets Synthetic DSTS, LSTM, RF, CNN CloudSim 
S36 Simulation Alibaba Cluster Real-world EINFORCE, FF, RANDOM, Tetris Python (TensorFlow, 

NumPy, Matplotlib) 

S37 Simulation Simulated dataset Synthetic Small task sizes: Load Aware, FFO-EVMM, 
MIMT, DQN. Medium task sizes: FFO-EVMM, 

MIMT, L-No-Deaf, Worn-Dear, DQN. Larger 

task sizes: FFO-EVMM, MIMT, multiple PSO 

variants, DBC, EDF 

CloudSim 

S38 Simulation PlanetLab dataset Real-world PowerAware VM consolidation CloudSim 
S39 Simulation Simulated dataset Synthetic RANDOM, RR, EDF Python (PyTorch) 
S40 Simulation Packet trace files from three data 

centers, generated using Wireshark 
Real-world Shortest-path-based routing, Gurobi optimizer Python (Keras, 

TensorFlow) 
S42 Simulation PlanetLab Monitoring Real-world IQR, MAD, THR, LR, PABFD CloudSim 
S43 Simulation Simulated dataset Synthetic RoFFR, CSLB, TDBS WorkflowSim, Python 
S44 Simulation 1998 FIFA World Cup Dataset, 

UNSW-17 Network Traffic Dataset 
Real-world VPBAR, LRR-MMT, DTH-MF, VMTA, Megh, 

EQBFD-0.1, EQBFD-0.3 
CloudSim 

S48 Simulation Simulated dataset Synthetic MMS-RANDOM, MMS-FAIR, MMS-GREEDY CloudSim 
S49 Simulation Google cluster Real-world RANDOM, Round Robin (RR), MoPSO Python (TensorFlow) 
S51 Simulation Simulated dataset Synthetic Multi-objective optimization algorithms: 

MGGA, VMPACS, VMPMBBO, ICA-VMPLC, 

CVP. Single-objective optimization algorithms: 

FFD, OEMACS 

MATLAB 

S53 Simulation Simulated dataset Synthetic Job scheduling: RANDOM, RR, Greedy, 
MoPSO 

Resource allocation: RANDOM, RR, MLF, 

FERPTS 

Python (TensorFlow) 

S54 Simulation Production-quality cloud DC, 
simulated dataset 

Synthetic and 
Real-world 

FF, Dot Product, Norm2 heuristics Python (NumPy, PyTorch) 

S57 Simulation Google cluster Real-world Tetris, H2O-Cloud NA 
S59 Simulation CoMon Project (PlanetLab data) Real-world NPA, PABFD, IGGA, E-Eco CloudSim 
S61 Simulation Wikipedia trace files Real-world ElasticTree, CARPO, FCTcon, Optimal (it is 

not practical in use) 

Python (Keras) 

S62 Simulation Montage, Cybershake, Sipht, 
Inspiral datasets generated using 

the Pegasus Workflow Generator 

Real-world MPC, ETF, Lr-RL, Q-SCH, QL-HEFT CloudSim 

S63 Simulation Google Cloud Jobs dataset (GoCJ) Real-world PSO, MVO, EMVO MATLAB, Python 
(PyTorch) 

S64 Simulation Sipht, Inspiral, Cybershake datasets 
generated using the Pegasus 

Workflow Generator 

Real-world MPC, ETF WorkflowSim

Applied Energy 389 (2025) 125734 

16 


H. Kahil, S. Sharma, P. Välisuo et al.

Table 7 

Selected integrated studies experimental setup. 

ID Environment 
(RQ3) 

Data source 
(RQ3) 

Data type 
(RQ3) 

Benchmarks 
(RQ6) 

Platform 
(RQ7) 

S1 Simulation Pegasus workflow framework Synthetic Random, Green-Opt (Greedy), Common-Actor CloudSim, Python (Keras) 
S2 Simulation Weather: Collected from Denmark 

Electricity pricing: Danish electricity 

spot market 

Real-world Other RL Controllers (For SAC and PPO), PID 
controller 

EnergyPlus 

S5 Simulation LLNL Thunder Real-world ICO, MPC, Joint optimization (JCO), Original-

DQN 
Matlab, 6SigmaDCX, 

TensorFlow 
S6 Simulation LLNL Thunder Real-world PADQN, E-QL Matlab, 6SigmaDCX, 

TensorFlow 
S15 Simulation Workload: Google Cluster dataset 

(GCD). Renewable energy: National 

Renewable Energy Laboratory 

(NREL)/NE-3000 wind tur-

bines. Electricity Price: The US 

EIA. Carbon Footprint: The US 

Department of Energy Electricity 

Emission Factors 

Real-world Greenpacker, LECC, ADVMC, ADVMC-RES Python 

S25 Simulation PlanetLab, Google Cluster Real-world DeepEE, Deep-Q with LSTM, ETAS, Improved 
Genetic, Hierarchical Deep-Q, MPC 

CloudSim integrated with 

four CRAC units, and 

perforated floor tiles to 

simulate realistic cooling 

dynamics 
S34 Simulation A simulation-based data set Synthetic Schedule: Single-agent method, Hybrid DQN, 

Independent DQN, Original DQN 

6SigmaDC, CloudSimPy 

S45 Combining 
real-

world and 
simulation 

Operational data from Singapore’s 
National Supercomputing Center 

Real-world Based on expert domain knowledge algorithm. 
Heuristic Algorithms: For independent IT 
or cooling optimization. Thermal-Unaware 
Scheduling: Traditional task scheduling 

without considering thermal dynamics 

6SigmaRoom, EnergyPlus 

S46 Simulation Google Cluster data Real-world Random, RR, PowerTrade, DeepEE Python (OpenAI Gym and 
TensorFlow), Matlab 

S50 Simulation Wiki data center Real-world Static, Random, K-means Python (PyTorch) 
S52 Simulation Simulated dataset Synthetic Non-optimization: No algorithm-based con-

trol, Non-algorithm optimization: Logic-based 

manual controls 

NA 

heuristic approaches) and traditional cooling control methods, includ -
ing PID and Model Predictive Control (MPC). Additionally, other studies 

compared the results with joint optimization approaches. Furthermore, 

several studies benchmarked the outcomes against other RL/DRL algo -
rithms proposed in previous research.

The tools discussed in Sections 5.2.4 and 5.3.4 were similarly 

employed in the joint optimization studies to create simulation en -
vironments. These include the EnergyPlus building energy simulation 

program [117] and the Computational Fluid Dynamics (CFD) simu -
lators, 6SigmaRoom [119], which were utilized for cooling systems. 

CloudSim [122] served as a simulation environment for the ICT sys -
tem. Furthermore, Python, along with its extensive libraries, served as 

the main programming language for implementing RL/DRL algorithms, 

while MATLAB was also employed in several studies for simulation and 

analytical tasks. Table 7 summarizes details of experimental setups in 

joint optimization literature: simulation environments (RQ3), platforms 
(RQ7), and benchmarks (RQ6). 

5.5. The MDP elements 

As detailed in Section 3, the Markov Decision Process (MDP) provides 

the foundational structure for modeling the RL/DRL environment. The 

key components of the MDP are: the state space {𝑆}, the action space 

{𝐴}, and the reward function {𝑅}. In the context of the identified joint 

optimization problem, the MDP features a large and complex state space, 

as well as a mixed action space encompassing both discrete and con -
tinuous actions. Furthermore, the reward function guiding the RL/DRL 

agent in these studies consists of multiple terms to capture the various 

systems within the DC environment. This highlights that the MDP for 

joint optimization studies is considerably more complex than in stud -
ies addressing only one system. Table A.11 provides a comprehensive 

summary of the MDP components in the joint optimization studies. 

6. Other objectives combined with energy efficiency in the 

identified studies 

Besides energy efficiency objectives in the identified studies, other 

objectives have been investigated. It is essential to highlight these ob -
jectives which will shape the direction of future efforts in the field of 
multi-objective optimization. The RL/DRL algorithms have proven their 

effectiveness in resolving the conflicts between objectives in several 

identified works. For instance, in [95], where multi-objective optimiza -
tion aims to balance the energy consumption of various numbers of tasks 

(between 100 and 250 tasks) and the average task makespan. Moreover, 

[92] examines the classical trade-off between quality of service (QoS), 

resource utilization, and energy consumption. Fig. 12 outlines a tax -
onomy of other optimization objectives integrated with enhancing the 

energy efficiency of the data center systems. Although the majority of 

the identified studies address data center energy efficiency enhancement 

aspects as the core research objective, some studies combine this ob -
jective with other environmental metrics, which can directly improve 
the operation mode of the data center and reduce its negative impact 

on the surrounding ecosystems in terms of carbon footprint and RES 

utilization. 
In contrast, the identified studies examine the proposed RL/DRL 

strategies for ICT and cooling in terms of system performance. In 

one dimension, these strategies refine time-related aspects, including

Applied Energy 389 (2025) 125734 

17 


H. Kahil, S. Sharma, P. Välisuo et al.

Other optimization

objectives

Environmental

Impact

Reduce carbon emissions 
Improve RES utilization 
Balance cost-benefit trade-offs

System

Performance

Minimize total makespan 
Reduce average waiting time 
Improve response time 
Maximize resource utilization 
Maintain air temperature distributions 
Improve task completion rates

Reliability

Management

Minimize SLA violations 
Address thermal threshold conditions 
Reduce hotspots 
Balance temperature dispersion 
Maintain a stable CPU utilization level 
Improve Quality of Services (QoS)

Algorithmic

Performance

Assess average rewards