Contents lists available at ScienceDirect Applied Energy journal homepage: www.elsevier.com/locate/apen Reinforcement learning for data center energy efficiency optimization: A systematic literature review and research roadmap Hussain Kahil a ,∗ , Shiva Sharma b , Petri Välisuo a , Mohammed Elmusrati a a School of Technology and Innovation, University of Vaasa, Wolffintie 32, Vaasa, 65200, Finland b School of Technology, Vaasa University of Applied Sciences, Wolffintie 30, Vaasa, 65200, Finland H I G H L I G H T S • Discusses using Reinforcement Learning (RL) for data center cooling system. • Discusses using RL for data center information and communication (ICT) system. • Provides a deep critical analysis for the energy optimization results. • Presents a comprehensive data extraction about the experimental setup and benchmarks. • Explores future direction in RL for optimizing energy in data center environments. A R T I C L E I N F O Keywords: Data center Energy efficiency optimization Cooling system ICT system Reinforcement learning (RL) Deep reinforcement learning (DRL) A B S T R A C T With today’s challenges posed by climate change, global attention is increasingly focused on reducing energy con - sumption within sustainable communities. As significant energy consumers, data centers represent a crucial area for research in energy efficiency optimization. To address this issue, various algorithms have been employed to develop sophisticated solutions for data center systems. Recently, Reinforcement Learning (RL) and its advanced counterpart, Deep Reinforcement Learning (DRL), have demonstrated promising potential in improving data cen - ter energy efficiency. However, a comprehensive review of the deployment of these algorithms remains limited. In this systematic review, we explore the application of RL/DRL algorithms for optimizing data center energy efficiency, with a focus on optimizing the operation of cooling systems and Information and Communication Technology (ICT) processes, including task scheduling, resource allocation, virtual machine (VM) consolida - tion/placement, and network traffic control. Following the Preferred Reporting Items for Systematic review and Meta-Analysis (PRISMA) protocol, we provide a detailed overview of the methodologies and objectives of 65 identified studies, along with an in-depth analysis of their energy-related results. We also summarize key aspects of these studies, including benchmark comparisons, experimental setups, datasets, and implementation platforms. Additionally, we present a structured qualitative comparison of the Markov Decision Process (MDP) elements for joint optimization studies. Our findings highlight vital research gaps, including the lack of real-time validation for developed algorithms and the absence of multi-scale standardized metrics for reporting energy efficiency im - provements. Furthermore, we propose joint optimization of multi-system objectives as a promising direction for future research. ∗ Corresponding author. Email addresses: hussain.kahil@uwasa.fi (H. Kahil), shiva.sharma@vamk.fi (S. Sharma), petri.valisuo@uwasa.fi (P. Välisuo), mohammed.elmusrati@uwasa.fi (M. Elmusrati). https://doi.org/10.1016/j.apenergy.2025.125734 Received 10 January 2025; Received in revised form 25 February 2025; Accepted 14 March 2025 Applied Energy 389 (2025) 125734 Available online 25 March 2025 0306-2619/© 2025 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license ( http://creativecommons.org/licenses/by/4.0/ ). http://www.sciencedirect.com/science/journal/0306-2619 https://www.elsevier.com/locate/APEN https://orcid.org/0009-0005-1445-4462 https://orcid.org/0000-0002-9566-6408 https://orcid.org/0000-0001-9304-6590 https://doi.org/10.1016/j.apenergy.2025.125734 mailto:hussain.kahil@uwasa.fi mailto:shiva.sharma@vamk.fi mailto:petri.valisuo@uwasa.fi mailto:mohammed.elmusrati@uwasa.fi https://doi.org/10.1016/j.apenergy.2025.125734 http://crossmark.crossref.org/dialog/?doi=10.1016/j.apenergy.2025.125734&domain=pdf http://creativecommons.org/licenses/by/4.0/ H. Kahil, S. Sharma, P. Välisuo et al. Nomenclature A3C Asynchronous advantage actor-critic AC Actor-critic ACO Ant Colony Optimization ACS Ant Colony System ADVMC Adaptive DRL based VM Consolidation AFED-EF Adaptive Four-threshold Energy-aware VM Deployment ARLCA Advanced RL Consolidation Agent ATES Aquifer Thermal Energy Storage AVMC Autonomous VM Consolidation AVT Active Ventilation Tile BDQ Branching Dueling Q-Network BF Best Fit BFD Best Fit Decreasing CARPO Correlation-AwaRe Power Optimization CCO Cooling Control Optimization CDRL Constrained DRL CFD Computational Fluid Dynamics CFWS Cost and carbon Footprint through Workload Shifting CNN Convolutional Neural Network CSLB Crow Search-based Load Balancing CVP Chemical reaction optimization-VMP-Permutation CW Chilled Water D3QN Dueling Deep Q Network DAG Directed Acyclic Graph DBC Deadline and Budget Constrained DCI Dynamic Control Interval DCN Data Center Network DDPG Deep Deterministic Policy Gradient DL Deep Learning DPPE Data Center Performance Per Energy DPSO Discrete Particle Swarm Optimization DQN Deep Q-Network DRL Deep Reinforcement Learning DSTS Dynamic Stochastic Task Scheduling DTA DRL-based Task Migration DTH-MF Dynamic Threshold Maximum Fit DTM Dynamic Thermal Management DUE De-underestimation Validation Mechanism DX Direct Expansion ECA Enclosed Cold Aisle EDF Earliest Deadline First EMVO Enhanced Multi-Verse Optimizer EOM Energy Optimization Module EQBFD Energy-efficient and QoS-aware BFD ERE Energy Reuse Effectiveness ERLFC Eco-friendly RL in Federated Cloud ETAS Energy and Thermal-Aware Scheduling ETF Earliest Time First ETHC Elastic Task Handler over hybrid Cloud EVCT Energy-efficient VM minimum Cut Theory EVMM Energy-aware VM Migration FCT Flow Completion Time FERPTS Fast and Energy-aware Resource Provisioning and Task Scheduling FF First Fit FFD First Fit Decreasing FFO FireFly Optimization FIFO First-In-First-Out GA Genetic Algorithm GCD Google Cluster Dataset GEC Green Energy Coefficient GJO Golden Jackal Optimization GMPR Greedy Minimizing Power consumption and Resource wastage GRF Generalized Resource-Fair GRR Generalized Round Robin GRVMP Greedy Randomized VM Placement HDDL Heterogeneous Distributed Deep Learning HDRL Hierarchical DRL HEFT Heterogeneous Earliest Time First HGP Heteroscedastic Gaussian Processes HM Host Machine HVAC Heating, Ventilation, and Air Conditioning ICA Imperialist Competitive Algorithm ICO IT Control Optimization ICT Information and communication Technology IGGA Improved Grouping Genetic Algorithm IQR Inter-Quartile Range ITEE IT Equipment Energy ITEU IT Equipment Utilization JCO Joint IT and Cooling Control Optimization Algorithm KMI-MRCU K-Means clustering algorithm-Midrange-Interquartile range LECC Location, Energy, Carbon and Cost-aware vm placement LR Logistic Regression LRR Local regression robust LSTM Long Short-Term Memory MAD Median Absolute Deviation MAGNETIC Multi-AGent machine learNing-based approach for Energy efficienT dynamIc Consolidation MBAC Model-Based Actor-Critic MBHC MBRL-based HVAC control MBRL Model-Based RL MCP Modified Critical Path MCTS Monte Carlo Tree Search MDP Markov Decision Process MFFD Modified First Fit Decreasing MGGA Multi-objective Genetic Algorithm MILP Mixed Integer linear programming MIMT Minimization of Migration based on Tesa MLF Minimum Load First MMT Minimum Migration Time MOACO Multi-Objective Ant Colony Optimization MOPSO Multi-Objective Particle Swarm Optimization MPC Model Predictive Control MSP Multi-Set Point MVO Multi-Verse Optimizer NFV Network Function Virtualization NPA Non-Power-Aware NSGA-II Non-dominated Sorting Genetic Algorithm II OCA Open Cold Aisle OEMACS Order Exchange and Migration Ant Colony System PABFD Power-aware Best Fit Decreasing PADQN PArametrized Deep Q-Network PETS Probabilistic Ensembles with Trajectory Sampling PID Proportional-Integral-Derivative PM Physical Machine PPO Proximal Policy Optimization PRISMA Preferred Reporting Items for Systematic review and Meta- analysis PSO Particle Swarm Optimization PUE Power Usage Effectiveness QEEC Q-learning Energy-Efficient Cloud computing QL Q-learning RAC Resource Allocation in container-based Clouds Applied Energy 389 (2025) 125734 2 H. Kahil, S. Sharma, P. Välisuo et al. RDHX Rear Door Heat Exchangers RES Renewable Energy Systems RH Relative Humidity RLR Robust Logistic Regression RP Residual Physics RR Round Robin RTP Real-Time Pricing SAC Soft Actor Critic SARSA State-Action-Reward-State-Action SDAEM Stacked De-noising Auto-encoders with Multilayer Perception SDN Software-Defined Networking SFC Service Function Chaining SLA Service Level Agreement SO Snake optimizer SSP Single-Set Point TDBS Task Duplication-Based Scheduling TPM Traffic Prediction Module TRPO Trust Region Policy Optimization UP Utilization Prediction-aware UPS Uninterruptible Power Supply VDN Value Decomposition Network VDT-UMC VM-based Dynamic Threshold and Minimum Correlation of Host Utilization VM Virtual Machine VMC VM Consolidation VMP VM placement VMPMBBO Multi-objective Biogeography-Based Optimization VMTA VM Traffic burst VPBAR VM scheduling Based on Poisson Arrival Rate VPME VM Placement with Maximizing Energy efficiency WUE Water Usage Effectiveness 1. Introduction The digitalization of society and the emergence of new AI technologies have increased the overall demand for computing power. This growth has made data centers a critical infrastructure that supports our modern digital ecosystems. The rise in the use of technologies such as the Internet of Things (IoT), cloud computing, big data, and artifi - cial intelligence (AI) has increased the workload of data centers, which now require even more computing resources to meet demand. Data cen - ters form the backbone of modern digital infrastructure, and their high energy consumption has substantial financial and environmental impli - cations. According to International Energy Agency [1], an estimated 460 terawatt hours (TWh) of electricity, with projections indicating that this could exceed 1000 TWh by 2026. In the European Union (EU), data centers consumed approximately 45–65 TWh of electricity in 2022, rep - resenting 1.8 % to 2.6 % of the total electricity consumption of the EU for that year [2]. This substantial energy consumption contributes to increased opera - tional costs and has significant environmental consequences, including large amounts of greenhouse gas emissions [3], and increased strain on power grids [4]. Therefore, improving energy efficiency in data centers has become a critical issue, requiring intelligent and automated solutions capable of dynamically adapting to real-time demands. Among the many emerging technologies, Reinforcement Learning (RL) and its subset, Deep Reinforcement Learning (DRL), have gained attention as promising techniques for optimizing energy efficiency within complex environments like data centers. These algorithms enable systems to learn optimal policies by interacting with dynamic environ - ments, making them suitable for resource allocation, task scheduling, and heating and cooling management. A study conducted by Jayanetti et al. [5] demonstrates the significant potential of RL/DRL for minimiz - ing energy consumption and reducing operational costs. The data center architecture comprises three main systems: informa - tion and communication technology (ICT), cooling, and power supply systems. Today’s data centers are vast, complex, and highly sophisti - cated, powered by a diverse ecosystem of ICT devices. These range from high-performance servers equipped with heterogeneous computing processors, such as CPUs, GPUs, and specialized accelerators, to arrays of memory units and storage solutions. In addition to computational infrastructure, the cooling system is critical in sustaining data center functionality. Its complexity arises from integrating multiple subsystems designed to regulate thermal conditions and protect highly sensitive ICT equipment from overheating. Efficient cooling is a fundamental aspect of data center operations, directly impacting energy consumption, op - erational costs, and system reliability. Due to the high heat dissipation of modern ICT equipment, data center cooling systems are designed to maintain optimal temperatures, prevent hardware failures, and enhance overall performance. A typical data center cooling system consists of multiple compo - nents, including chillers, pumps, fans, heat exchangers, and cooling towers, which work together to regulate temperature and ensure ef - ficient heat dissipation. These systems can generally be classified into air-based and liquid-based cooling solutions. Air-based cooling relies on Computer Room Air Conditioning (CRAC) units [6] and Computer Room Air Handlers (CRAH) [7]. Liquid-based cooling, in contrast to air-based methods, incorporates technologies such as direct-to-chip cooling [8], spray/immersion cooling [9]. These approaches significantly enhance thermal management by efficiently dissipating heat and directly cooling critical components. Recently, localized heat exchanger solutions, such as in-row, rear-door cooling and in-rack, have gained popularity due to their efficiency in high-density environments. In-row cooling places cooling units between server racks, reducing airflow distance and im - proving cooling efficiency [10]. Rear Door Heat Exchangers (RDHX), on the other hand, attach cooling units directly to the back of racks, cap - turing and dissipating heat immediately as it exits the servers. These strategies enhance cooling performance while minimizing energy waste by targeting heat removal close to the source [11]. Free cooling is an energy-efficient heat rejection method that uses low ambient air or water temperature with a dry cooler or heat ex - changer. Depending on the ambient media, the free cooling is also known as water-side or air-side economizer [12]. Heat pumps [13] and thermal energy storage [14] are increasingly being adopted to enhance energy efficiency and overall performance of heat reuse. Fig. 1 provides a schematic diagram of the data center cooling and heat rejection and reuse systems. Solutions based on RL/DRL techniques enable adaptive, real-time decision making which has significant potential for enhancing energy ef - ficiency through optimization in the complex data center environments. Despite all these promising developments, the adoption of RL/DRL for minimizing energy consumption in data centers faces various challenges, including the complexity of modeling data center environments, man - aging computational costs, and ensuring scalability [15]. To address these challenges, innovative and intelligent solutions are required that can adapt to complex and dynamic environments in real-time. Several previous reviews on the use of RL/DRL have been conducted for gen - eral applications [16] rather than analyzing a holistically integrated RL/DRL framework with a specific system, which this paper aims to ex - amine. Additionally, few studies have provided systematic evaluations of RL/DRL across data center functions, leaving a gap in understanding Applied Energy 389 (2025) 125734 3 H. Kahil, S. Sharma, P. Välisuo et al. Cooling Heat exchange Air to liquid Rejection Re-use Air cooling Liquid cooling CRAH CRAC In ROW RDHX In RackCold plate Immersion Spray Dry cooler Chiller Tower Heat pump Direct use M o re lo calized Fig. 1. Schematic diagram of data center cooling, heat rejection, and heat reuse system options. Black and blue arrows show heat flows in air, and liquid re - spectively. The grey arrow shows that the heat exchangers at the bottom of the middle box are localized closer to the heat source, whereas those at the top are far from it. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) these algorithms’ capabilities in real-world environments. In this system - atic literature review, we aim to investigate the recent advancements and applications of RL/DRL for enhancing energy efficiency in data cen - ters by analyzing the literature using the PRISMA framework. The main objective of this work is to explore and evaluate the diverse potential applications of RL/DRL as a tool for optimizing energy efficiency in data centers while also synthesizing and consolidating existing research knowledge on their implementation in such facilities. Furthermore, this study aims to achieve the following specific objectives: • Investigate and assess key applications of RL/DRL in data cen - ters: This review aims to provide a comprehensive analysis of how RL/DRL algorithms have been applied to solve various energy ef - ficiency challenges in data centers. To achieve this, we categorize RL/DRL applications by data center subsystems, giving readers insights into their roles and effectiveness. • Evaluate and summarize each identified study in terms of algorithm type, the specific research problem addressed, primary objectives, and energy-efficiency outcomes, along with the benchmarks em - ployed for performance evaluation, enabling a deeper understanding of the current state of research. • Summarize details about the execution aspects of the identified stud - ies: the implementation environment, dataset source, dataset type, and the platforms or frameworks utilized, offering insights into the practical considerations and resources required for implementing future studies. • Utilize the identified joint optimization studies to present compre - hensive guidelines for formulating the Markov Decision Process (MDP) elements, providing readers with a clear overview and foun - dational knowledge to construct such frameworks in future research. • Identify technical and practical challenges in the current research di - rection. By investigating the essential issues related to RL/DRL usage in data centers, we aim to provide an in-depth view of the barri - ers that limit the broader use of these techniques in the data center industry. • Highlight other objectives integrated with the energy efficiency problem in the identified studies, to address multi-objective opti - mization, thereby comprehensively ensuring sustainable and cost - effective operations in modern data centers. • Explore research gaps, open issues, and future directions to propose a strategic roadmap for advancing the practical deployment of RL/DRL techniques in optimizing data center energy efficiency. Through the above-mentioned objectives, this review aims to contribute a structured synthesis of RL/DRL applications for data center energy efficiency, identify persistent challenges, and chart a course for future research to address existing limitations and enhance the practical utility of RL/DRL techniques in data centers. The remainder of this paper is organized as follows. Section 2 com pares previous related reviews and this study. - Section 3 provides a comprehensive background on RL/DRL algorithms. Section 4 outlines the research methodology. Section 5 explores the relevant literature in detail. Section 6 offers an overview of additional objectives com - bined with energy efficiency. Section 7 discusses the identified research gaps, open challenges, and suggests future directions. Finally, Section 8 concludes this review. 2. Related reviews Several existing reviews focus on the energy efficiency of the data center cooling system as a key objective. Chang et al. [17] explore the cooling system optimization strategies in data centers by utilizing bib - liometric methods. Their review examines the utilization of RL as a cooling control strategy for energy efficiency applications. Additionally, Shaqour et al. [18] investigate the literature on using DRL algorithms for HVAC energy management in data centers, which are considered as a subgroup of smart buildings. In contrast, other reviews target the energy efficiency of ICT systems in data centers. Gari et al. [19] evaluate the effectiveness of RL algo - rithms for data center scaling and scheduling purposes in the literature, while partially addressing energy consumption as an optimization objec - tive. Magotra et al. [20] provide a comprehensive overview of using VM consolidation to enhance the data center energy efficiency. This review surveys the research problem based on architecture and VM consolida - tion steps. Zhou et al. [21] present DRL-based approaches for resource scheduling in the cloud, highlighting their advantages, challenges, and future directions. Recently, Hou et al. [22] provided a specialized re - view on leveraging DRL algorithms for energy-efficient task scheduling in cloud computing. This study conducts an in-depth investigation of the Markov Decision Process (MDP) model components. Singh et al. [23] summarize previous empirical studies on multiple objectives in ICT sys - tems, such as task scheduling and VM consolidation, to enhance energy efficiency while maintaining system performance. Furthermore, other reviews combine cooling and ICT systems as the core topic of their review. Lin et al. [24] explore previous efforts to achieve green-aware data centers from five different perspectives: workload management, virtual resource management, energy manage - ment, thermal management, and waste heat recovery. Long et al. [25] outline performance evaluation metrics for data center energy effi - ciency through ICT systems and infrastructure, including cooling and power supply systems. Conversely, Zhang et al. [26] address the joint optimization of cooling and ICT systems to achieve effective data cen - ter management under a set of evaluation metrics, including thermal conditions, energy consumption, and response delay. Although these reviews address energy efficiency objectives in data centers based on RL/DRL algorithms from different perspectives, there remains a gap in the existing literature due to the absence of a systematic overview of RL/DRL applications for improving the energy efficiency of data center systems. Additionally, there appears to be a lack of research addressing joint optimization using RL/DRL for energy effi - ciency objectives. Moreover, previous reviews do not sufficiently discuss experimental setups, including data sources and types used, and the im - plementation platforms. Our research introduces a systematic literature review that examines the use of RL/DRL for energy efficiency objec - tives across the main data center systems: cooling and ICT systems. We aim to explore recent advancements in this field to gain deeper insights, identify research gaps, and suggest future directions. Table 1 summa rizes and compares related reviews and our work, emphasizing how our study differs from previous research. - Applied Energy 389 (2025) 125734 4 H. Kahil, S. Sharma, P. Välisuo et al. Table 1 Related reviews on DC energy efficiency, and comparison with our review. Reference General focus System specific Review outcomes Data center Energy efficiency RL/DRL approaches Cooling system ICT system Joint optimization Energy reporting Algorithm comparisons Benchmark comparisons Experimental setup [17] ● ● ● ● × × ● ◑ × × [18] ● ● ● ● × × ● ◑ × × [19] ● ◑ ● × ◑ × ◑ ● × × [20] ● ● ◑ × ◑ × ◑ ◑ ◑ ● [21] ● ◑ ● × ◑ × ◑ ● ● ● [22] ● ● ● × ◑ × ● ● ● ● [23] ● ◑ × × ◑ × ◑ × × × [24] ● ◑ ◑ ◑ ◑ × ◑ ● × × [25] ● ● × ◑ ◑ × ◑ ● × × [26] ● ◑ ◑ × × ● ● ● × ◑ Current review ● ● ● ● ● ● ● ● ● ● ● – Topic addressed in detail/self-contained, ◑ – Topic partially addressed (i.e., not self contained, requires additional readings for deep understanding), × – Topic not addressed. 3. Overview of RL/DRL algorithms Reinforcement learning (RL) stands out as a machine learning tech - nique developed by the computational intelligence community. It is inspired by natural learning mechanisms, in which organisms adjust their future behavior based on feedback from interactions with the environment. Fundamentally, RL is a closed-loop approach aimed at maximizing the cumulative reward, allowing the decision-maker or agent to learn and adapt over time. However, the actions taken by the learning agent influence its future inputs. The RL algorithm establishes an interactive relationship with the dynamic environment, allowing the agent to perform actions, observe the states of the environment, and receive feedback in the form of rewards and punishments. In most prac - tical cases, the agent’s actions may not only influence the immediate reward but also shape the ultimate reward. In this closed-loop learn - ing approach, the absence of explicit instructions for taking actions and the uncertainty of future consequences are the key features of RL. These characteristics position RL algorithms as an integration of adap - tive and optimal control techniques [27,28]. Fig. 2 illustrates the general framework of RL algorithms. Let us consider a typical reinforcement learning scenario within a fully observable, stationary, stochastic environment, where the agent interacts with the environment by fully and accurately observing the current state. At each discrete time step, the agent selects an action based only on the current state to maximize the cumulative reward over time. The representation of this scenario is given by: • States (𝑆): The set of all possible states of the environment that the agent can observe. 𝑆 = {𝑠 1 , 𝑠 2 ,… , 𝑠 𝑛 } (1) Agent Environment Actions, A Reward, R States, S Fig. 2. RL framework. • Actions (𝐴): The set of all available actions that the agent can take in a given state. 𝐴 = {𝑎 1 , 𝑎 2 ,… , 𝑎 𝑛 } (2) • Transition probabilities (𝑃 ): The probability of moving to a future state 𝑠 ′ given the current 𝑠 state and action 𝑎, which may differ over time due to dynamic changes. ′𝑃 ( ′ 𝑡 𝑠 ∣ 𝑠, 𝑎) = P (𝑆 𝑡+1 = 𝑠 ∣ 𝑆 𝑡 = 𝑠, 𝐴 𝑡 = 𝑎) (3) • Reward function (𝑅): The immediate reward that the agent receives when taking action 𝑎 in the state 𝑠 at time 𝑡, which may differ over time due to dynamic changes. 𝑅 d𝑡(𝑠, 𝑎) = E(rewar ∣ 𝑆 𝑡 = 𝑠, 𝐴 𝑡 = 𝑎) (4) • Policy function (𝜋): This function determines the agent’s future be - havior by defining the probability of taking action 𝑎 in the state 𝑠 at time 𝑡, which may differ over time due to dynamic changes. 𝜋 𝑡(𝑠, 𝑎) = P (reward ∣ 𝑆𝑡 = 𝑠, 𝐴 𝑡 = 𝑎) (5) • Discount factor (𝛾): It determines the weight of future rewards compared to immediate rewards. 0 ≤ 𝛾 ≤ 1 (6) where the value of the discount factor is close to 0, it makes the RL agent focus on immediate reward, while a value close to 1 makes the RL agent focus on the future reward. • Objective (cumulative reward): This is the ultimate goal of the RL agent to identify the trajectories that can maximize the expected discounted reward: ∑ 𝑛 2𝐺 𝑡 = 𝑅 𝑡+1 + 𝛾 𝑅 𝑡+2 + 𝛾 𝑅 𝑡+3 + ⋯ = 𝛾 𝑘 𝑅 (7)𝑡+𝑘+1 𝑘=0 The tuple {𝑆, 𝐴, 𝑃 , 𝑅, 𝛾} formulates the Markov decision processes (MDP) representation for the proposed stationary stochastic environ - ment. In the MDP framework, at each time step 𝑡, the agent interacts with the environment by observing the current state 𝑠 𝑡 ∈ 𝑆 , choosing the action 𝑎 𝑡 ∈ 𝐴 according to the policy function 𝜋 𝑡(𝑠 𝑡 , 𝑎 𝑡 ), while esti - mating the probability of transitioning to a specific next state or taking a specific action using the transition probability model 𝑃𝑡 (𝑠 ′ ∣ 𝑠, 𝑎). After taking the action, the agent obtains a reward 𝑟 𝑡 ∈ 𝑅 and transitions to the next state. The aim of reinforcement learning is to design the agent’s Applied Energy 389 (2025) 125734 5 H. Kahil, S. Sharma, P. Välisuo et al. learning process to find the optimal policy that maximizes the expected cumulative reward over time 𝐺 𝑡 , considering the environment dynamics defined by the MDP [29–31]. However, the aforementioned process is not trivial. This challenge can be addressed recursively by introducing the state value function (V - function): [ ] ∑ 𝑛 𝑉 𝜋 (𝑠) = 𝛾 E 𝑠 𝑡 ,𝑎 𝑡∼𝜏 𝑘𝑅𝑡 +𝑘+1 𝑘=0 ∑ ∑ ∞ (8) = 𝜋(𝑎 𝑡 |𝑠𝑡 )𝑃 (𝑠 , 𝛾 𝑘 𝑡+1 |𝑠𝑡 𝑎 𝑡 ) 𝑅𝑡+𝑘+1 (𝑠 𝑡 ,𝑎 ,…)∼𝜏 𝑘=0 [ 𝑡 = E ] 𝜋 𝑅 𝑡+1 + 𝛾𝑉 (𝑆 𝑡+1) ∣ 𝑆 = 𝑠 𝑡 where 𝜏: (𝑠 0 , 𝑎 0 , 𝑠 1 , 𝑎 1 ,… , 𝑎 𝑡−1 , 𝑠 𝑡 ) represents the interaction trajectory of the RL agent. Similarly, the expected return of taking a specific action 𝑎 in a given state 𝑠 while following the policy 𝜋 can be given by the state-action value function (Q-function): [ 𝑄 𝜋 (𝑠, 𝑎) = E ] 𝜋 𝑅𝑡 +1 + 𝛾𝑄(𝑆 (9)𝑡+1, 𝐴 𝑡+1 ) ∣ 𝑆 𝑡 = 𝑠, 𝐴 𝑡 = 𝑎 Eqs. (8) and (9) are referred to as the Bellman equations [32], which are considered the fundamental formulas for tackling the decision - making process of an RL agent. The optimal V-function and Q-function are indicated by the maximum value across all states: 𝑉 ∗ = max 𝑣 𝜋 (𝑠) , or in all state-actions: 𝑄 ∗ (𝑠, 𝑎) = max 𝑄(𝑠, 𝑎). In all MDP cases, at least one optimal policy always exists, and the value functions 𝑉 (𝑠) and 𝑄(𝑠, 𝑎) of all optimal policies are the same. As a result, optimizing the Q-function yields the optimal policy of the MDP: { ∗ 1 𝜋 arg (𝑎|𝑠) = if 𝑎 = max 𝑎∈𝐴 𝑄 ∗(𝑠, 𝑎) (10) 0 otherwise To obtain a solution to the MDP problem using RL techniques, two main categories of methods are used. Model-free RL algorithms allow an agent to learn a policy purely from interactions with the environment, without explicitly constructing a model of the environment’s dynamics. The other category is called model-based RL algorithms and leverages a model of the environment, which can be given or learned. This model typically includes the transition probability function (3) and the re - ward function (4), allowing the agent to plan actions before execution [33]. Value-based algorithms are among the most popular model-free RL methods, where the agent estimates state-action values and represents them as a table (referred to as a Q-table or policy table), to optimize its decision-making. The most well-known value-based algorithms used for smaller MDP problems are tabular methods: Q-learning, in which the agent updates the table based on the maximum possible future reward (off-policy learning), making it more exploratory [34], and state-action - reward-state-action (SARSA) [35], where the agent updates the Q-table according to the actual action taken (on-policy learning), leading to more conservative behavior. On the other hand, model-based RL leverages a model of the en - vironment to update the Q-table of state-action pairs. This approach can be classified into two main categories based on how the environ - ment model is acquired. In the first category, the agent learns the model through its interactions with the environment, as in the Dynamic Q - learning (Dyna-Q) algorithm [36]. In the second category, the model is provided to the agent, as seen in Monte Carlo Tree Search (MCTS) [37]. However, RL algorithms face scalability limitations when applied in large-scale learning environments. They often struggle with an extensive state space and continuous action space, leading to inefficiencies in the exploration–exploitation trade-off, slow convergence, and difficulties in learning optimal policies. To address the limitations of traditional Reinforcement Learning (RL) methods, the computational intelligence community has developed Deep Reinforcement Learning (DRL), which integrates advancements in deep neural networks. In DRL algorithms, deep learning techniques are employed to construct at least one of the following agent compo - nents: value functions (8), (9), policy function (5), transition model(3), and the reward function (4). Such representations are essential when the RL agent interacts with environments characterized by a high - dimensional state space and a continuous action space. DRL is a powerful tool for achieving an end-to-end goal-directed learning process [38,39]. Figs. 3 and 4 present a comprehensive classification of the most popular RL/DRL algorithms based on their respective model types. Another crucial aspect of RL/DRL algorithms is the type of policy used during the training process. The focus here is to determine whether RL Model-free Algorithms Value-based algorithms Tabular methods Q-learning SARSA Deep learning methods DQN Temporal difference method TD0 Actor-critic algorithms Deep soft actor critic DSAC Deterministic policy gradient DDPG Soft actor critic SAC Twin delayed DDPG TD3 Advanced Actor-critic A2C A3C Policy-based algorithms Basic policy gradient REINFORCE VPG Advanced policy gradient PPO TRPO Fig. 3. RL/DRL model-free algorithms. RL Model- based Algorithms Learn the model Model Ensemble GPS VAML PAML Planning Oriented PETS Dyna-Q Deep Dyna-Q Policy Optimization MBPO MBVE MBAC Given the model Residual Augmentation RCE EBD Residual-Q Policy Learning with Rollouts Dyna-DDPG SAC-MBR ME-TRPO Planning Oriented MuZero AlphaZero Fig. 4. RL/DRL model-based algorithms. Applied Energy 389 (2025) 125734 6 H. Kahil, S. Sharma, P. Välisuo et al. the behavior policy – defined as the policy interacting with the environ - ment to collect training data – and the target policy – which represents the final policy that the agent is aiming to learn – are identical. On - policy methods utilize the collected data directly for the next round of policy optimization, meaning that the behavior and target policies are the same. However, in off-policy methods, the generated training data is stored in a buffer during the interaction with the environment. Then, during training, this stored data – which may be gathered from previous policies – is used for the target policy. In this case, the be - havior policy is not the same as the target policy. The advantages of on-policy methods include greater stability and faster convergence, bal - anced exploration–exploitation rates, and ease of implementation, while off-policy methods offer better performance in complex environments and greater adaptability to changing policies. Finally, RL/DRL are used to solve a wide range of optimization prob - lems, from playing simple computer games to controlling highly complex large-scale configurations such as transportation networks and energy systems [40–42] . Both RL and DRL offer the advantage of boasting real - time adaptability and dynamic responsiveness compared to traditional control methods. However, without prior knowledge about the studied environment, they may encounter slow convergence and failures during the initial phases of operation [43,44]. 4. Materials and methods The methodology of this review was structured following the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta - Analyses) framework to ensure transparency, rigor, and reproducibility [45]. 4.1. Research questions The main aim of this review is to synthesize recent advancements in RL/DRL techniques for improving energy efficiency in data centers. To provide a comprehensive understanding of this topic, this study focuses on answering the following research questions based on the identified papers. • RQ1: What data center subsystems (e.g., cooling, ICT equipment, power supply) are targeted by the RL/DRL algorithms? • RQ2: Which RL/DRL algorithms are utilized for energy optimization in data centers? • RQ3: What experimental setups and dataset sources (e.g., real-world deployments or simulations) are commonly used? • RQ4: What specific research problems are addressed using RL/DRL algorithms? • RQ5: What are the primary objectives addressed in the identified studies? • RQ6: What benchmarks are used to evaluate the achieved results in terms of energy efficiency? • RQ7: What tools, frameworks, or platforms are employed to imple - ment RL/DRL algorithms in this context? • RQ8: What metrics are used to measure and report the effectiveness of RL/DRL algorithms in improving energy efficiency? 4.2. Search strategy 4.2.1. Literature resources To ensure that all recent and relevant studies are covered in the lit - erature, the search was carried out in five major and well-established academic databases, known for their extensive repositories of peer - reviewed studies in computer science, engineering, and energy systems. Given that the scope of this review is relatively new, the covered time frame is limited to publications from 2019 to August 2024. To maintain high quality and credibility, only peer-reviewed journal articles from the databases mentioned below were selected. • IEEE Xplore • Scopus • ScienceDirect • Web of Science • ACM Digital Library 4.2.2. Search terms (key words) To ensure the high quality of this study, search queries were sys - tematically designed using Boolean operators and keywords relevant to RL/DRL and energy efficiency in data centers. A representative search string was: (“data center” OR “data centers”) AND (“energy-aware” OR “energy utilization” OR “energy saving” OR “energy efficiency”) AND (“reinforcement learning” OR “RL”). Fig. 5 shows the search strategy used in this study. 4.3. Search process and selection criteria To ensure the relevance and quality of the included studies, the PRISMA framework guided the article identification process, which involved four distinct stages: 1. Identification: Studies were retrieved using search queries across the selected databases. 2. Screening: The titles and abstracts were screened to eliminate irrelevant studies and duplicates. 3. Eligibility: Full-text articles were reviewed against the inclusion and exclusion criteria. 4. Inclusion: The final set of studies that met all quality assessment criteria was selected for detailed analysis. A PRISMA flow diagram (Fig. 6) illustrates the selection process, documenting the number of studies identified, screened, excluded, and included. Fig. 5. Search strategy to get relevant papers. Applied Energy 389 (2025) 125734 7 https://www.ieeexplore.ieee.org https://www.scopus.com https://www.sciencedirect.com https://www.webofscience.com https://dl.acm.org H. Kahil, S. Sharma, P. Välisuo et al. Search phase Selection phase Second search and selection IEEE Xplore Scopus Science Direct Web of Science ACM Digital library 25 100 14 20 5 Remove duplicatesTotal=164 Abstract & keywords Inclusion criteria Exc 53 Inc 111 Exc 40 Inc 71 Exc 12 Additional relevant from references Inc 59 new 18 Quality check Finally selected 65 Sum 77 Exc 12 Fig. 6. Systematic literature review process stages: Removals of duplicates, removal based on abstract and keywords, removal based on inclusion and exclusion criterias, adding new articles which were found from the references, and finally removing those which did not match the quality criteria. 4.3.1. Inclusion and exclusion criteria • Inclusion criteria: To ensure the inclusion of high-quality and relevant studies, the following criteria were applied: – Only peer-reviewed journal articles published between 2019 and August 2024. – Studies explicitly applying RL/DRL algorithms for energy effi - ciency in data center environment. – Studies presenting measurable outcomes, such as increased energy savings or improved Power Usage Effectiveness (PUE). – Studies focusing on specific or joint subsystems (e.g., cooling systems, ICT equipment, and/or power supply). – Only the most recent version of a study was included when duplicate publications were identified. • Exclusion criteria: To facilitate the filtering of irrelevant studies, the following criteria were used: – Non-peer-reviewed studies, including conference papers, review articles, and opinion pieces. – Studies not addressing RL/DRL-based methods for energy opti - mization in data center environment. – Studies lacking empirical evidence or quantitative metrics. – Studies without full-text availability, making it impossible to assess the study’s relevance and quality. – Studies focused on very small-scale experimental setups, as they lack applicability to real-world data center environments. 4.3.2. Quality assessment criteria and rating system To ensure the final selection of identified articles are robust and reliable, a rigorous and systematic quality assessment process was implemented, based on the clearly defined criteria listed below: • Clear and comprehensive documentation of the RL/DRL methods utilized, ensuring transparency in their implementation. • Explicit definition and justification of the targeted subsystem’s rele - vance within the study. • Logical coherence in identifying the research problem and aligning it with the stated objectives. • Methodological rigor in the design of experimental setups, including appropriate baseline comparisons and validation techniques. • Implementation of well-defined metrics to assess energy efficiency, such as increased energy savings or improvements in Power Usage Effectiveness (PUE). • Thorough comparative analysis of RL/DRL techniques against al - ternative benchmark methods to highlight their effectiveness and advantages. Only studies that achieved a perfect score of 6 out of 6 on these criteria were included in the final synthesis. 4.4. Data extraction and synthesis A comprehensive data extraction and synthesis template was com - pleted for each identified study to ensure that all selected studies addressed the review’s research questions. The extracted data were or - ganized into a synthesis card and stored in an Excel file for further use throughout the systematic review stages. Table 2 summarizes the data extraction and synthesis card used to gather the necessary information from the identified studies. To present the findings of this review, visual representations, such as pie charts, bar charts, and Venn diagrams, were created. Additionally, tables were utilized to systematically summarize and provide a detailed analysis of each identified study. This systematic approach provides a clear and structured framework for synthesizing and interpreting the collected data, while also highlighting research gaps, addressing challenges, and identifying future directions [46]. 4.5. Threats to validity The following threats to validity were acknowledged: 1. Publication bias: The focus on peer-reviewed journals may ex - clude innovative but unpublished studies. 2. Database coverage: Relevant articles from less-accessible databases or gray literature might have been missed. 3. Variability in reporting: Differences in methodologies and re - porting standards across studies could limit comparability. Applied Energy 389 (2025) 125734 8 H. Kahil, S. Sharma, P. Välisuo et al. Table 2 Data extraction template. Category Unique Identifier (ID) Study Title Authors Names Publication Venue Publication Year DC Subsystem Applications (RQ1) RL/DRL Algorithm Type (RQ2) Experimental Setup (RQ3) Research Problems (RQ4) Main Objectives (RQ5) Benchmark Algorithms (RQ6) Platforms and Frameworks (RQ7) Energy Efficiency Outcomes (RQ8) MDP Elements in Joint Optimization Studies Abstract Keywords Other Performance Metrics To mitigate these threats, standardized inclusion criteria were applied, and article selection and data extraction were independently verified by multiple reviewers. 5. Results and discussions In this section, we discuss and present the findings of this review. First, we summarize the fundamental details of each identified study, including the study title, authors names, publication venue, and publi - cation year. These details facilitated the systematic organization of this review, with each study assigned a unique identifier (ID) for easy refer - ence during the data analysis and extraction process. Next, we provide a comprehensive analysis, highlighting key perspectives such as the stud - ied subsystems, the RL/DRL algorithms applied, and the types of models utilized, offering valuable insights into the state-of-the-art. Then, we conduct a deeper synthesis, classifying the studies based on the sub - systems they targeted. This categorization helped obtain quantitative and qualitative data to address the research questions for each subsys - tem. We focus our discussion on more detailed and specific information regarding the research problems, study objectives, and experimental setup, benchmark comparisons, the platforms used, and energy-related outcomes. Finally, we summarize the construction of Markov Decision Process (MDP) elements in joint optimization studies. Additionally, we reference related works to further support and contextualize the purpose and findings of this review. 5.1. Overview of the final identified studies In this review, we identify 65 journal articles that apply RL/DRL algorithms to improve the energy efficiency of at least one major data center system. The publication venues and years of these articles are summarized in Table 3. Given that the research topic of this review is relatively new, all selected studies were published between 2020 and 2024, as shown in Fig. 7. Taking a broader look at the selected studies reveals that over 60 % focus entirely on the ICT system, exploring opportunities to en - hance energy efficiency by leveraging RL/DRL algorithms from various perspectives. In contrast, approximately 21 % of the papers focus exclu - sively on the data center cooling system. Furthermore, the remaining studies examine combinations of multiple data center systems. Fig. 8 provides a detailed overview of the specific systems addressed in each selected paper. In the following paragraphs, we will explore the RL/DRL algorithms used in the selected studies of this review. For the cooling system: Since the cooling system of data centers is characterized by a high-dimensional state space and a continuous ac - tion space MDP, all selected studies employed DRL methods, primarily focusing on model-free algorithms, including: • Soft Actor-Critic Algorithm (SAC) [66,87] • Deep Deterministic Policy Gradient (DDPG) [58,70] • Twin Delayed DDPG (TD3) [93] • Proximal Policy Optimization (PPO) [60] • Trust Region Policy Optimization (TRPO) [93] • Deep Q-Network (DQN) [69,75,101,104] However, two studies used model-based algorithms: Model-Based Actor-Critic (MBAC) [49] to propose a safe cooling mode adhering to strict thermal constraints, and Probabilistic Ensembles with Trajectory Sampling (PETS) [102], in which the study makes a comparison be - tween four different algorithms: two model-free off-policy algorithms: A DQN variant called Branching Dueling Q-Network (BDQ) and SAC, one model-free on-policy algorithm (PPO), and one model-based algorithm (PETS). For ICT system: Due to the discrete nature of certain ICT processes, such as task scheduling and resource allocation, the Q-learning algo - rithm has been employed in multiple studies to handle the ICT MDP environment [62,79,105]. This approach allows Q-values to be updated independently from the action selection and execution, enabling the algorithm to capture delayed feedback more accurately. As a result, this method enhances the learning rate and accelerates the convergence process. Alternatively, DQN is commonly proposed for handling more complex ICT systems, as reported in [50,54,81]. However, other DRL algorithms are also used, such as: • Actor-Critic (AC) [72,107] • Soft Actor-Critic Algorithm (SAC) [55,65] • Proximal Policy Optimization (PPO) [67,103] • Asynchronous Actor-Critic Agents (A3C) [76] • Deep Deterministic Policy Gradient (DDPG) [86] For combining systems studies: As the complexity of the MDP prob - lem increases when multiple systems are present, with a combination of discrete and continuous state spaces, along with high-dimensional action spaces, traditional RL approaches become less effective. In re - sponse to these challenges, all selected studies addressing the integration of multiple data center systems employed DRL algorithms. Notable DRL algorithms used in these studies include: • Actor-Critic Algorithm (AC) [47] • Soft Actor-Critic Algorithm (SAC) [48] • Deep Q-Network (DQN) and its extensions [52,61,80,91,98] • Deep Deterministic Policy Gradient (DDPG) [91,92] Fig. 9 illustrates the distribution of various RL/DRL algorithms in the selected studies. Q-learning and DQN were the most frequently cited algorithms, appearing in 60 % of studies, followed by SAC (eight stud - ies), PPO (four studies), DDPG (four studies), and AC/A3C (four studies). About 9 % of studies employed other algorithms. Table 4 categorizes the algorithms implemented in the selected studies based on the utilized model type. According to Figs. 3 and 4, nearly 98 % of the algorithms employed are model-free, divided into three main groups: value-based algorithms, policy-based algorithms, and actor-critic algorithms. Only two studies utilized model-based al - gorithms, likely due to the complexity involved in accurately modeling a data center system. Some studies used more than one RL/DRL method, causing them to appear in multiple categories in the table. The following sections will provide a detailed analysis of these algorithms and their applications. Applied Energy 389 (2025) 125734 9 H. Kahil, S. Sharma, P. Välisuo et al. Table 3 The selected studies. ID Authors Publication venue DC application (RQ1) Year S1 Jayanetti et al. IEEE Transactions on Parallel and Distributed Systems Integrating power supply and ICT systems 2024 S2 Biemann et al. IEEE Internet of Things Journal Integrating cooling and power supply sys- tems 2023 S3 Wan et al. IEEE Transactions on Emerging Topics in Computational Intelligence Cooling system 2023 S4 Lou et al. IEEE Transactions on Network and Service Management ICT system 2023 S5 Ran et al. IEEE Transactions on Services Computing Integrating cooling and ICT systems 2023 S6 Ran et al. IEEE Transactions on Services Computing Integrating cooling and ICT systems 2023 S7 Zeng et al. IEEE Transactions on Parallel and Distributed Systems ICT system 2022 S8 Kang et al. IEEE Transactions on Network and Service Management ICT system 2022 S9 Pham et al. IEEE Access ICT system 2021 S10 Yi et al. IEEE Transactions on Parallel and Distributed Systems ICT system ICT system 2020 S11 Ding et al. IEEE Access 2020 S12 Li et al. IEEE Transactions on Cybernetics Cooling system 2020 S13 Cheng et al. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems ICT system 2020 S14 Leindals et al. Energy and AI Cooling system 2024 S15 Zhao et al. IEEE Transactions on Sustainable Computing Integrating power supply and ICT systems 2024 S16 Ghasemi et al. Cluster Computing ICT system 2024 S17 Ghasemi et al. Computing ICT system 2024 S18 Bhatt et al. International Journal of Advanced Computer Science and Applications ICT system 2024 S19 Zhang et al. IEEE Transactions on Network and Service Management ICT system 2024 S20 Guo et al. Applied Energy Cooling system 2024 S21 Yang et al. Bouaouda et al. Journal of Supercomputing ICT system 2024 S22 Sustainability ICT system 2024 S23 Chen et al. Measurement and Control Cooling system 2024 S24 Wang et al. ACM Transactions on Cyber-Physical Systems Cooling system 2024 S25 Aghasi et al. Computer Networks Integrating cooling and ICT systems 2023 S26 Wang et al. Journal of Cloud Computing ICT system 2023 S27 Wang et al. Computer Networks ICT system 2023 S28 Ghasemi et al. Cluster Computing ICT system 2023 S29 Huang et al. Energies Cooling system 2023 S30 Wei et al. Journal of King Saud University – Computer and Information Sciences ICT system 2023 S31 Liu et al. Applied Energy ICT system 2023 S32 Ahamed et al. Sensors ICT system 2023 S33 Ma et al. IEEE Transactions on Industrial Informatics ICT system 2023 S34 Simin et al. Journal of Intelligent and Fuzzy Systems Integrating cooling and ICT systems ICT system 2023 S35 Nagarajan et al. Expert Systems 2023 S36 Yang et al. KSII Transactions on Internet and Information Systems ICT system 2022 S37 Pandey et al. Mobile Information Systems ICT system 2022 S38 Shaw et al. Information Systems ICT system 2022 S39 Yan et al. Computers and Electrical Engineering ICT system 2022 S40 Wang et al. Computer Networks ICT system 2022 S41 Mahbod et al. Applied Energy Cooling system 2022 S42 Abbas et al. Physical Communication ICT system 2022 S43 Uma et al. Transactions on Emerging Telecommunications Technologies ICT system 2022 S44 Wang et al. Future Generation Computer Systems ICT system 2021 S45 Zhou et al. IEEE Network Integrating cooling and ICT systems 2021 S46 Chi et al. Energies Integrating cooling and ICT systems 2021 S47 Biemann et al. Applied Energy Cooling system 2021 S48 Ding et al. Future Generation Computer Systems ICT system 2020 S49 Peng et al. Cluster Computing ICT system 2020 S50 Hu et al. Electronics Integrating power supply and ICT systems 2020 S51 Qin et al. Applied Intelligence ICT system 2020 S52 Yang et al. Journal of Building Engineering Integrating cooling, ICT, and power supply systems ICT system 2024 S53 Lin et al. IEEE Access 2020 S54 Caviglione et al. Soft Computing ICT system 2021 S55 Le et al. ACM Transactions on Sensor Networks Cooling system 2021 S56 Zhang et al. Applied Energy Cooling system 2023 S57 Li et al. CCF Transactions on High Performance Computing ICT system 2021 S58 Wan et al. IEEE Intelligent Systems Cooling system 2021 S59 Haghshenas et al. IEEE Transactions on Services Computing ICT system 2022 S60 Zhang et al. IEEE Transactions on Cybernetics Cooling system ICT system 2024 S61 Sun et al. Computer Networks 2020 S62 Asghari et al. Computer Networks ICT system 2020 S63 Siddesha et al. Cluster Computing ICT system 2022 S64 Asghari et al. Soft Computing ICT system 2020 S65 Zhang et al. Expert Systems with Applications Cooling system 2023 Ref [47] [48] [49] [50] [51] [52] [53] [54] [55] [56] [57] [58] [59] [60] [61] [62] [63] [64] [65] [66] [67] [68] [69] [70] [71] [72] [73] [74] [75] [76] [77] [78] [79] [80] [81] [82] [83] [84] [85] [86] [87] [88] [89] [90] [91] [92] [93] [94] [95] [96] [97] [98] [99] [100] [101] [102] [103] [104] [105] [106] [107] [108] [109] [110] [111] Applied Energy 389 (2025) 125734 10 H. Kahil, S. Sharma, P. Välisuo et al. 2020 2021 2022 2023 2024 0 5 10 15 20 12 9 12 18 14 Publication Year N um be r of St ud ie s Fig. 7. Publication year distribution of selected studies. ICT, 40 studies Cooling, 14 studies Joint, 11 studies Fig. 8. The sub-systems focused in selected studies: 40 studies are focused on ICT optimization, 14 on cooling optimization and 11 are joint studies integrating multiple systems, including the power supply system. 33.8% 26.2% 12.3% 6.2% 6.2% 6.2% 9.1% DQN Q-learning SAC PPO DDPG AC-A3C Other methods Fig. 9. Distribution of algorithms utilized in this review. 5.2. Comparison of RL/DRL algorithms applied to cooling system Cooling systems account for approximately 40 % of energy consump- tion in data centers [112]. Reducing the energy consumption of this non-ICT support system will improve the power utilization efficiency (PUE) of the data center. Furthermore, optimizing the operation of the cooling systems can significantly influence the thermal conditions and cooling flow of ICT devices, leading to future reductions in the total energy consumption of the entire data center [113]. In this section, we will analyze the selected articles that use RL/DRL techniques to optimize the operation of the cooling system in the data center with the aim of reducing energy consumption. The following analysis not only provides an overview of how RL/DRL methods are applied to data center cooling system, but also investigates the specific aspects of each selected study in detail. This includes the formulation of the research problem and ob - jectives, the energy-related outcomes, the benchmark comparisons, and the experimental setup. 5.2.1. The research problem and objective formulation As illustrated in Fig. 8, 14 papers discussing cooling systems were identified. These studies focused on two main research problems (RQ4), each with different objectives (RQ5). The dominant category of the research problem focuses on optimizing cooling system operations in various scenarios to improve energy efficiency by utilizing various RL/DRL approaches. An interesting configuration involves using DRL to optimize the data center cooling system integrated with an active ther - mal management framework. For example, [60] explores the balance of aquifer thermal energy storage (ATES) while minimizing the total cost and maintaining the temperature range of the servers using a DRL agent. Similarly, [104] introduces Active Ventilation Tiles (AVTs) con - trollers to enhance the operation of the rack cooling system, achieving a trade-off between energy consumption and rack supply temperature distribution. An alternative scenario is integrating RL/DRL algorithms with prior physical knowledge to enhance the cooling system’s energy efficiency. The study [75] integrates this knowledge by using big data, IoT sensor networks, and a digital twin model with the DRL algorithm. By leveraging historical and real-time data, this approach employs a Long Short-Term Memory (LSTM) network to predict temperatures, en - abling the utilization of the DQN algorithm to effectively reduce the energy consumption of the cooling systems. Due to the strong relation - ship between the energy efficiency of data center cooling systems and the ambient temperatures at their locations, several studies have exten - sively investigated the efficiency of DRL algorithms in reducing cooling system energy consumption in tropical climates. Specifically, the study [101] focuses on optimizing the supply air temperature and relative hu - midity in a free-cooled tropical data center under defined boundaries, while [87] explores a single-agent DRL strategy with a floating set point approach to reduce the temperature threshold for tropical data centers based on a whole-building evaluation method. The study [69] proposes a multi-set point approach based on the DQN algorithm (DQN-MSP) to en - able precise cooling control of the CRAC unit’s air temperature, offering significant improvements in data center cooling energy consumption. Another key research direction in this category of literature emphasizes designing and comparing multiple state-of-the-art DRL algorithms for optimizing energy consumption while maintaining thermal conditions, as demonstrated in studies [66,93,102]. Meanwhile, the second research category shifts attention to the relia - bility of safety-aware DRL strategies, with the core aim of minimizing the energy consumption of the data center cooling system. These strategies are designed to ensure strict adherence to both soft and hard constraints during the learning and operational phases. In [58], the study develops an end-to-end off-policy DDPG agent to optimize the cooling system using unprocessed and high - dimensional input data directly. Additionally, the study introduces the de-underestimation (DUE) validation mechanism for the critic network to address underestimation of overheating risks. In [106], the study fo - cuses on incorporating residual physics using thermodynamic principles to guide the DRL agent’s exploration process by estimating the desirable range of actions, ensuring future action safety. In addition, the study [49] develops safe cooling system operation by utilizing a model-based actor-critic DRL (MBAC) algorithm using two different models: a sys - tem transition model to predict the future system state, and a risk model to estimate the negative effects of executing an action. Furthermore, the paper [70] utilizes offline imitation learning and online post-hoc rectification techniques to develop three different versions of a safety - aware DDPG controller for the data center cooling system. Alternatively, Applied Energy 389 (2025) 125734 11 H. Kahil, S. Sharma, P. Välisuo et al. Table 4 Classification of algorithms by model type and study IDs. Category Algorithm type (RQ2) Study IDs Value-Based Model-Free DQN S4, S6, S7, S8, S10, S13, S15, S23, S29, S32, S34, S35, S37, S39, S42, S45, S49, S50, S53, S54, S55, S58, S22 Q-learning S11, S16, S17, S18, S22, S25, S28, S33, S38, S43, S44, S48, S51, S59, S62, S63, S64 B3QN S56 PADQN S5 , S27, S45 SARSA S38 BDQ S56 Policy-Based Model-Free PPO S14, S21, S36, S57, S60, S65, S47 TRPO S47 Monte Carlo (REINFORCE) S31 Actor-Critic (AC) Model-Free SAC S2, S9, S19, S20, S41, S56, S60, S65, S47 A3C S30 AC S1, S26, S61, S35 DDPG S12, S24, S40, S46, S35, S45, S60 TD3 S47, S56 Learned Model-Based PETS S56 MBAC S3 Given Model-Based None identified the study [111] leverages techniques like Lagrangian-based constrained DRL (CDRL) and reward shaping to satisfy soft constraints through ex - tensive online learning. Also, within the same study, hard constraints are addressed by a parameterized shielding DRL algorithm (DRL-S), which projects unsafe actions onto safe action spaces. The ultimate goal of these studies in the second category is to design a safe cooling system for data centers, reducing energy consumption while effectively maintain - ing thermal constraints. The insights from this section are summarized in Table A.8. 5.2.2. The energy related outcomes The primary motivation of this study is to address how the proposed RL/DRL algorithms enhance the energy efficiency of data centers (RQ8). The results related to energy efficiency have been carefully and thor - oughly analyzed. Given the diversity of research problems and objectives addressed in the identified cooling system studies, the reporting methods for energy-related outcomes vary significantly. Some studies express the improvements in energy consumption when implementing the RL/DRL algorithm as a percentage reduction in energy consumption, compared to the baseline controller (e.g., DefaultE+) [93,102,111]. In addition, other studies compare the energy saving percentage of their proposed RL/DRL strategies to some other benchmark controllers, including DRL and non-DRL algorithms [49,70,87]. Energy efficiency is also reported in terms of improvements in key data center performance metrics, such as power usage effectiveness (PUE), compared to baseline controllers (e.g., DefaultE+) [58] or state-of-the-art controllers [66], while other studies use the PUE to evaluate the differences in energy consumption before and after applying the proposed RL/DRL algorithms [75]. Other studies focus on energy cost reductions rather than energy consumption savings [60]. Moreover, combining RL/DRL strategies with advanced setups, such as AVT systems [104] and physics-guided DRL with shielding [106], highlights the potential of RL/DRL in performing a trade-off analy - sis between energy efficiency and system performance. Furthermore, some studies demonstrate energy savings while maintaining thermal constraints, either by increasing the average supply air temperature of the CRAC units [69] or by raising the temperature and relative humid - ity thresholds [101]. A more detailed analysis of additional objectives combined with the energy efficiency will be provided in Section 6. A detailed summary of this section’s findings is presented in Table A.8. 5.2.3. The benchmark comparisons The distribution of benchmark algorithms used in the cooling systems studies for energy-related results comparison is illustrated in Fig. 10. PI D M PC DQ N TR PO PP O De fau ltE + DD PG SA C TD 3 Ot he r n on -D RL Ot he r D RL 0 1 2 3 4 5 6 7 8 9 10 2 3 3 4 8 6 3 5 2 6 7 Algorithms N um be r of B en ch m ar ks Fig. 10. Number of benchmarks in the literature for cooling system. Analyzing the statistical data reveals two distinct groups. The first group involves the use of DRL algorithms due to their adaptability as benchmarks for comparison, with PPO being the most widely used, ap - pearing in eight studies. Other prominent DRL algorithms include SAC (used five times), TRPO (used four times), DDPG (used three times), DQN (used three times), and TD3 (used twice). The second group consists of non-DRL algorithms, where the built-in EnergyPlus baseline controller (DefaultE+) was used in five studies, the classical PID controller was used twice, and the optimal model predictive controller (MPC) was used three times. Other DRL and non-DRL algorithms, including those used as benchmarks only once, are also considered. Table 5 outlines the bench - mark algorithm comparisons (RQ6) for each selected cooling system study, including both DRL and non-DRL algorithms. 5.2.4. The experimental setup Among the 14 selected cooling system studies, only one study di - rectly implemented the proposed DRL strategy on a real-world data center [104]. In contrast, the remaining studies tested the designed DRL algorithms in simulated environments, highlighting a gap in di - rect real-world application and validation. These simulations utilized either real-world datasets, synthetic datasets, or a hybrid approach com - bining both. The EnergyPlus building energy simulation program [117] emerged as a primary tool for simulating energy consumption in data Applied Energy 389 (2025) 125734 12 H. Kahil, S. Sharma, P. Välisuo et al. Table 5 Selected cooling system studies experimental setup. ID Environment (RQ3) Data source (RQ3) Data type (RQ3) Benchmarks (RQ6) Platform (RQ7) S3 Simulation Simulated a typical data center room with Alibaba’s 2018 cluster data Real-world MBRL-MPC, MBHC Unspecified CFD simulator, Python (PyTorch) S12 Simulation National Super Computing Centre (NSCC) of Singapore Real-world DefaultE+, Two-stage (TS), A3C, TRPO EnergyPlus, Python (Scipy) S14 Simulation Naviair data center (the Danish airspace control company) Real-world No reward PPO, Delayed reward PPO, Uniform future PPO, Trend-based future policy to estimate the return Python (OpenAI Gym) S20 Simulation Simulated liquid-cooled data cen- ter with unspecified real-world data set Simulated a small data center with a real-world dataset from the PlanetLab system Real-world PID, MPC, DQN, TRPO, PPO Matlab (Simscape) S23 Simulation Real-world DQN-SSP, PPO-MSP, DDPG-MSP 6SigmaRoom, CloudsimPy, Python S24 Simulation Four simulated configurations of CW- and DX-cooled data centers under two climate conditions Synthetic For the first three proposed controllers: DefaultE+, Reward shaping DDPG, Simplex DDPG, Projection post-hoc rectification DDPG For the fourth controller: PID, Vanilla DDPG, Reward shaping DDPG EnergyPlus, OpenFOAM, Python (OpenAI Gym and PyTorch) S29 Simulation Simulation for real-world data center room located in Shenzhen Simulated mid-tier stand-alone data center located in a tropical climate region Real-world Comparison of DC energy efficiency metrics before and after the DRL strategy 6SigmaRoom, Autodesk Revit, Python S41 Simulation Synthetic DefaultE+, Load Aware, Temperature Aware, Joint-IT, Multi-Agent DRL, TD3, PPO, TRP, various versions of SAC EnergyPlus, Python S47 Simulation Simulated medium-sized DC with two zones, a direct expansion cooling coil, and a chilled water cooling coil Synthetic and real-world DefaultE+, TD3, PPO, TRPO, SAC EnergyPlus, Python (OpenAI Gym) S55 Simulation A real free-cooled data center located in a tropical zone Real-world Hysteresis-based controller, MPC Matlab, Python (Keras and TensorFlow) S56 Simulation Simulated data center test bed developed in [114] Synthetic DefaultE+, PETS, BDQ, PPO, SAC EnergyPlus, Python (PyTorch) S58 Real-time Inner Mongolia Meteorological Information Center (IMMIC) Real-world DL, DN, DQN Python (TensorFlow), Real- time S60 Simulation Simulated data center test bed developed in [115] Synthetic SAC, RP-SAC, DDPG, RP-DDPG, PPO, RP- PPO, Lagrangian-based safe DRL, Physics EnergyPlus, Python (PyTorch, TensorFlow) S65 Simulation Simulated data center test bed developed in [116] Synthetic DefaultE+, PPO, SAC, PPO-Lag EnergyPlus center cooling systems, often integrated with various Python libraries to implement DRL agents. Other simulation environments utilized include Computational Fluid Dynamics (CFD) simulators such as OpenFOAM [118] and 6SigmaRoom [119], which offer detailed modeling of airflow and thermal dynamics. Furthermore, MATLAB, along with its advanced toolboxes like Simulink and Simscape, was frequently employed to sim - ulate the operational processes of data center cooling systems, providing a robust platform for evaluating control strategies and optimizing sys - tem performance. Table 5 presents a comprehensive overview of the experimental setup, including the environment, dataset source and type (RQ3), and platform (RQ7) for all identified studies on cooling systems. 5.3. Comparison of RL/DRL algorithms applied to ICT systems Over the past few years, data centers have grown significantly in size and complexity driven by the rapid advancements in ICT sys - tems. The advancements involve a wide range of devices, including high-performance servers, processing units such as CPUs and GPUs, advanced memory units, and storage arrays [120]. This technological progress has enabled data centers to support more complex operations, such as training large language models (LLMs) and real-time data pro - cessing. As a result, improving the energy efficiency of ICT systems has become a critical priority, not only to enhance the performance and scalability of data centers but also to minimize energy consumption and operational costs. In this section, we will comprehensively examine the role of RL/DRL algorithms in tackling energy efficiency challenges within ICT systems as identified in the literature. 5.3.1. The research problem and objective formulation The majority of the identified papers in this review focus on ICT systems, specifically 40 studies. The research problems (RQ4) and ob - jectives (RQ5) of these studies can be categorized into the following areas: Scheduling optimization: A considerable number of existing stud - ies discuss the scheduling optimization challenge in a DC environment using RL/DRL approaches; however, few studies have explored the en - ergy efficiency aspects of applying these algorithms. The three main types of RL/DRL algorithms applied to the scheduling optimization prob - lem in the identified studies are: jobs scheduling, tasks scheduling, and resources scheduling. Jobs scheduling: Job scheduling refers to the process of assigning and allocating the entire arriving job which may consist of one or mul - tiple tasks to the DC resources, aiming to manage workloads with a high-level approach. Traditional job scheduling mechanisms often strug - gle to cope with extensive, heterogeneous DC environments, especially in cases involving long-lasting jobs. This limitation leads to inefficien - cies in energy consumption and resource management. Three studies [56,77,85] have addressed this challenge by proposing RL/DRL algo - rithms. The primary approach to handling this challenge dynamically involves considering real-world constraints, such as job dependencies and QoS levels, to minimize energy consumption and carbon emissions in data centers. Tasks scheduling: Tasks are the components of the jobs that typi - cally need to be performed in a specific order due to their interdepen - dence. Task scheduling refers to the process of managing the execution Applied Energy 389 (2025) 125734 13 H. Kahil, S. Sharma, P. Välisuo et al. of individual tasks within a job in a low-level approach. The main objec - tive of the task scheduling studies is to select the optimal DC resource for task execution, ensuring compliance with time and QoS constraints. Ten studies were identified that discussed the task scheduling problem highlighting three main approaches: • Dependency- and workflow-oriented RL/DRL task scheduling ap - proaches [72,82,94,110]. • Heterogeneous cloud DC online RL/DRL task scheduling approaches [67,103,109]. • Adaptive and hybrid RL/DRL task scheduling approaches [50,54,59]. Resources scheduling: While task and job scheduling focus on the DC workload, resource scheduling concentrates on the physical (e.g., servers) or virtual (e.g., VM) infrastructure level of the DC. The main aim of the resource scheduling process is to maximize resource utilization, and it does not directly consider job and task dependencies. Two studies specifically focused on addressing the resource scheduling problem [89, 95]. Virtual machines and containers management: The virtualization of physical resources in data centers to meet the growing demands of work - loads has received significant attention from researchers in recent years. Two primary technologies are commonly employed for virtualization: hardware-level virtualization (VM) in which each virtual machine (VM) utilizes a hypervisor to run its own operating system and applications. In contrast, operating system (OS)-level virtualization leverages the host system’s kernel to create containers which share the host’s resources [121]. In this review, we selected 14 studies focused on managing VMs and containers using RL/DRL algorithms and present the energy effi - ciency results. These studies address three key areas: VM consolidation, VM and container placement, and VM replacement. VM consolidation: This refers to reducing the number of physi - cal machines (PMs) required to operate the data center workload. This process includes three stages: workload detection (overutilization and underutilization), VM selection, and VM placement. By running multi - ple VMs on fewer PMs, several objectives can be achieved, including optimizing ICT resources, reducing operational costs, and minimizing energy consumption. Five studies in this review collection discuss the VM consolidation problem in data centers using RL/DRL algorithms with two main approaches: • Centralized adaptive RL/DRL strategies [53,57,84,88] • Multi-agent RL strategies [105] VM and container placement: This is a sub-process of consolida - tion, where the objective is solely to decide the optimal location (PM) for a VM. It is applied at the PM (host) level rather than at the DC sys - tem level. Eight studies have been identified on this topic: seven for VM placement and only one for container placement [68]. VM replacement: This refers to the process of reassigning an already placed VM to a new physical machine (PM). This process is triggered by changes in the current state (e.g., overloading, failures). It is also consid - ered a sub-process of VM consolidation, enabling VM migration. Among the selected studies, only one specifically addressed this issue, propos - ing a novel approach that combines fuzzy logic with an RL algorithm to enhance decision-making and adaptability in this process [74]. Two studies combine the two aforementioned categories as a re - search problem, focusing on VM scheduling by allocating tasks or jobs to VMs assigned to hosts, leveraging RL/DRL algorithms to optimize the scheduling process [79,90]. DCN traffic control: Data Center Networks (DCNs) play a critical role in ensuring the smooth operation of ICT systems. However, they often suffer from bandwidth surges, which degrade data center performance and significantly increase energy consumption. Traditional methods to address these issues are limited in their adaptability and fail to dynam - ically handle sudden network traffic fluctuations, leading to substantial energy waste. RL/DRL algorithms offer effective approaches to tackle these challenges. Four studies have been identified that explore solutions to this problem, each employing a unique structural RL/DRL approach: • Combining LSTM networks for traffic prediction and proactive RL/DRL agents to optimize traffic control and energy efficiency [73,86]. • Formulating the problem as an MILP model to define the optimal so - lution space and integrating RL/DRL algorithms to find near-optimal solutions dynamically [55]. • Employing Software-Defined Networking (SDN) and RL/DRL to dy - namically schedule traffic flows, aiming to reduce energy consump - tion while maintaining an optimal Flow Completion Time (FCT) [107]. Multi-objective framework: Five studies are identified here that ad - dress job/task scheduling, task offloading, and resource allocation as multi-objective research problems. The resources considered in these studies include containers [65,81], multi-user, multi-data center re - sources [99], and general data center resources [83,108]. A detailed summary of the identified ICT studies’ research problems (RQ4) and objectives (RQ5) is presented in Table A.9. 5.3.2. The energy related outcomes As energy efficiency is the primary focus of this review, a compre - hensive analysis of the energy efficiency outcomes of using RL/DRL algorithms in ICT systems in the identified studies is presented in Table A.9. This table answers this review’s RQ8 and demonstrates that the proposed RL/DRL algorithms consistently outperform baseline and benchmark non-RL/DRL methods in terms of energy efficiency. The re - ported energy efficiency improvements range from small percentages (1 %–3 %) to significant enhancements (over 60 %), depending on the specified scenario and context, such as varying VM/task loads, DCN traffic sizes, or the use of real-world or synthetic datasets. The major - ity of the studies reported achieving energy efficiency as a percentage improvement when compared to benchmark algorithms. Additionally, some studies highlighted energy efficiency enhance - ments in terms of scalability and dataset-based performance. For in - stance, studies [62,108] focus on performance across diverse datasets and scalability metrics. Other studies compare the achieved energy sav - ings in multiple experimental setups or configurations. [67] investigated task scheduling across three distinct scenarios with 10, 50, and 100 servers, examining the impact of server configurations on energy ef - ficiency. [88] explored VM consolidation under different workloads, assessing its impact on resource utilization and energy consumption. [86] analyzed DCN traffic control with both more than 70 nodes and fewer than 70 nodes, assessing performance across different network sizes. [109] conducted task scheduling across two different task counts and varying numbers of VMs, evaluating the performance under diverse configurations. In addition, a few studies presented a generalized approach without explicitly referencing benchmark algorithms. For instance, [74] reported energy savings in a generalized context, providing insights into the potential applicability of the proposed RL algorithm. 5.3.3. The benchmark comparisons Each research problem discussed in the identified studies of the ICT system was compared to other baseline or state-of-the-art bench - mark methods commonly used in the respective problem domain. As presented in Fig. 11, the most commonly used baseline method for scheduling optimization studies was the RANDOM method. In this method, jobs/tasks/VMs were assigned to resources without consider - ing optimization criteria. This approach is simple and achieves unbiased Applied Energy 389 (2025) 125734 14 H. Kahil, S. Sharma, P. Välisuo et al. Ra nd om He ur ist ic Me ta- he ur ist ic RL /D RL Ot he r ML 0 10 20 30 40 50 60 70 80 90 100 7 78 23 31 12 5 Algorithms N um be r of Be nc hm ar ks Fig. 11. Number of benchmarks in the literature for ICT system. scheduling; however, it is inefficient as it overlooks critical DC met - rics such as energy efficiency, quality of service (QoS), and workload balancing. This method was used in seven studies as a baseline for comparison with the proposed RL/DRL algorithms in the scheduling optimization identified studies. Additionally, heuristic-based algorithms were widely used as benchmarks to evaluate the proposed RL/DRL al - gorithms for various ICT research problems. For scheduling research problems, the Round-Robin (RR) method was highlighted as the pri - mary heuristic-based method for performance comparison. Greedy algo - rithms, including First-Fit (FF), Best-Fit (BF), and their variants, were the main benchmarks for VM management research problems. Elastic-Tree was a common benchmark for DC network traffic control problems. Approximately 78 additional heuristic-based algorithms were em - ployed as comparison methods across all the research problems dis - cussed in ICT systems. Meta-heuristic methods were also used 23 times as evaluation methods. These included Particle Swarm Optimization (PSO), Ant Colony Optimization (ACO), Genetic Algorithms (GA), and their variants, applied to various ICT system research problems. Machine learning algorithms were occasionally utilized in a limited number of identified studies as benchmark methods, particularly for VM manage - ment. Other RL/DRL algorithms developed in previous studies were used 31 times for comparison with newly proposed algorithms, demonstrat - ing internal comparisons within RL/DRL approaches in the identified studies. Finally, some specially designed algorithms were also employed. Table 6 outlines the benchmark algorithm (RQ6) comparisons for each selected study on ICT systems. 5.3.4. The experimental setup CloudSim [122] and its variant WorkflowSim [123], an extended and optimized version of CloudSim designed for dependent task work - flows, were used as simulation environments in approximately 50 % of the identified studies focusing on scheduling optimization and VM management research problems in DC ICT systems. In addition to these tools, programming languages such as Java and Python were frequently employed for simulation experiments in multiple studies. MATLAB was used as the simulation environment in four studies. However, six studies did not specify the simulation environment used. On the other hand, several real-world datasets from large-scale data cen - ters like Google, Wikipedia, and Alibaba, as well as smaller data centers such as the National Supercomputing Centre (NSCC) of Singapore and the Nottingham University Data Center, were utilized as data sources in the identified studies. Moreover, well-known datasets such as PlanetLab and the CoMon project were also employed for simulation experiments. Synthetic datasets were another key data source, enabling controlled and customized testing scenarios. Table 6 provides a comprehensive overview of the experimental setup, encompassing the simulation en - vironment, the sources and types of datasets (RQ3), and the platforms used (RQ7) in all the identified studies on ICT systems. 5.4. Comparison of RL/DRL algorithms applied to optimizing integrated data center systems Developing an accurate, intelligent, and real-time DC environment requires seamless integration of all systems, including the cooling, ICT, and power supply systems. The joint optimization of these systems has become a promising research direction, aiming to achieve mul- tiple objectives across multiple systems using advanced optimization strategies. Among these, RL/DRL algorithms have emerged as powerful ap - proaches, demonstrating significant potential in addressing the complex - ities of integrated DC systems. This section delves into a detailed analysis of 11 identified studies that leverage RL/DRL algorithms for the joint op - timization of DC systems. A vital aspect of the identified studies in this section lies in formulating these studies as a multi-objective research problem across multiple systems. To further enrich this discussion and align with the growing interest in this field, we define the key elements of the Markov Decision Process (MDP) models employed in these stud - ies, highlighting their critical role in achieving efficient and effective system integration. Various identified studies explored the integration of the ICT op - eration optimization with energy-efficient cooling system controlling as a research problem (RQ4) with different objectives (RQ5). [71] investigates the implementation of a decentralized strategy to simultane - ously optimize the cooling system and the VM placement. Additionally, scheduling optimization combined with cooling system control is an - other prominent research focus. Task scheduling was discussed in [52,91,92], whereas job scheduling was examined in [52,80]. In both cases, the scheduling process is integrated with the optimization of the cooling system. On the other hand, three studies [47,61,96] examined the workflow scheduling of DC powered by renewable energy systems (RES). The primary objective in these studies is to optimize energy con - sumption from RES during the execution of DC workloads. In study [48], a DRL strategy was applied to optimize the cooling system by integrat - ing it with the power supply system using real-time electricity pricing (RTP). Finally, global optimization using a multi-agent approach to en - hance energy efficiency across more than two DC systems was addressed in a recent study [98]. The majority of the identified papers report results related to energy efficiency (RQ8) of the developed RL/DRL algorithms as a percentage of energy savings compared to baseline algorithms. For example, [48] reported a slight improvement in energy savings compared to a PID controller, while [91] compared the energy efficiency results of the pro - posed algorithm with a controller designed based on domain expert knowledge, achieving up to 30 % energy savings. Another method of reporting energy efficiency results involves using data center efficiency metrics, such as PUE. This approach was demonstrated in [52,80], where the proposed RL/DRL algorithms enhanced energy efficiency compared to benchmark algorithms. Table A.10 provides an overview of the re - search problem (RQ4), related objectives (RQ5), and energy-related outcomes (RQ8) of the identified joint optimization studies. The proposed joint RL/DRL algorithms were compared against vari - ous benchmark algorithms. Multiple studies evaluated the performance of the developed RL/DRL strategies against state-of-the-art individ - ual optimization techniques, such as ICT algorithms (e.g., random or Applied Energy 389 (2025) 125734 15 H. Kahil, S. Sharma, P. Välisuo et al. Table 6 Selected ICT system experimental setup. ID Environment (RQ3) Data source (RQ3) Data type (RQ3) Benchmarks (RQ6) Platform (RQ7) S4 Simulation Google cluster Real-world RR, B, MAD, DRL-DTM, DRL-DTA NA S7 Simulation Google cluster Real-world FF, MFFD, PABFD, RL-DC, UP-VMC EnergyPlus, CloudSim S8 Simulation Google cluster Real-world RR, HDRL, DRL-Cloud, MO-DQN Python (TensorFlow) S9 Simulation Abilene, Geant, and Synthetic topology datasets Synthetic and Real-world TEDO, TEDI Java, Python (TensorFlow) S10 Simulation National Supercomputing Center (NSCC) of Singapore Real-world RR, Job consolidator, Online optimizer with two different reward functions NA S11 Simulation CoMon Project Real-world LR-MMT, VDT-UMC, DTH-MF CloudSim S13 Simulation Google cluster Real-world RR, HDRL, DRL-Cloud NA S16 Simulation Amazon EC2 and Simulated dataset Synthetic and Real-world FFD, BFD, GRVMP, GMPR, NSGA-II, RLVMP CloudSim S17 Simulation Simulated dataset Synthetic VMPMORL, EVCT, VPME, AFED-EF CloudSim S18 Simulation GWA-T-12 Bitbrains Real-world MOPSO, MOACO, VMPORL MATLAB S19 Simulation Simulated tasks following an exponential workload distribution Synthetic Cloud, PREM, RANDOM, REQ Python (PyTorch and Gym) S21 Simulation Open-source: BitBrains, Scientific workflows: Ligo, Montage, Cybershake Real-world RR, RF, GRR, GRF, Tetris, RLScheduler, ACS NA S22 Simulation Simulated dataset Synthetic GA, ACO, SA, FFD Java, CloudSim S26 Simulation Google cluster Real-world RR, RANDOM, SO, GJO NA S27 Simulation Simulated dataset using a K-port FatTree topology Synthetic Greedy-ElasticTree, LSTM+DRL, DDPG Python (TensorFlow, Keras) S28 Simulation Nottingham University, Gaussian distribution Synthetic datasets Synthetic and Real-world MOVMrB, RLVMrB, VMPMORL CloudSim S30 Simulation PlanetLab dataset, Amazon EC2 instance configurations Synthetic and Real-world MOVMrB, RLVMrB, ADVMC MATLAB S31 Simulation Alibaba Cloud Real-world FIFO, Ideal MPC, Tetris Python (TensorFlow) Python (PyTorch, Gymnasium, Scikit-learn) S32 Simulation Azure 2017 workload Real-world HGP, IQR-MMT, MAD-MMT, RLR-MMT, GA S33 Simulation Ligo, Genome, Cybershake, Montage, and Sipht datasets Real-world EcoCloud, KMI-MRCU, AFED-EF Java S35 Simulation Simulated two common datasets Synthetic DSTS, LSTM, RF, CNN CloudSim S36 Simulation Alibaba Cluster Real-world EINFORCE, FF, RANDOM, Tetris Python (TensorFlow, NumPy, Matplotlib) S37 Simulation Simulated dataset Synthetic Small task sizes: Load Aware, FFO-EVMM, MIMT, DQN. Medium task sizes: FFO-EVMM, MIMT, L-No-Deaf, Worn-Dear, DQN. Larger task sizes: FFO-EVMM, MIMT, multiple PSO variants, DBC, EDF CloudSim S38 Simulation PlanetLab dataset Real-world PowerAware VM consolidation CloudSim S39 Simulation Simulated dataset Synthetic RANDOM, RR, EDF Python (PyTorch) S40 Simulation Packet trace files from three data centers, generated using Wireshark Real-world Shortest-path-based routing, Gurobi optimizer Python (Keras, TensorFlow) S42 Simulation PlanetLab Monitoring Real-world IQR, MAD, THR, LR, PABFD CloudSim S43 Simulation Simulated dataset Synthetic RoFFR, CSLB, TDBS WorkflowSim, Python S44 Simulation 1998 FIFA World Cup Dataset, UNSW-17 Network Traffic Dataset Real-world VPBAR, LRR-MMT, DTH-MF, VMTA, Megh, EQBFD-0.1, EQBFD-0.3 CloudSim S48 Simulation Simulated dataset Synthetic MMS-RANDOM, MMS-FAIR, MMS-GREEDY CloudSim S49 Simulation Google cluster Real-world RANDOM, Round Robin (RR), MoPSO Python (TensorFlow) S51 Simulation Simulated dataset Synthetic Multi-objective optimization algorithms: MGGA, VMPACS, VMPMBBO, ICA-VMPLC, CVP. Single-objective optimization algorithms: FFD, OEMACS MATLAB S53 Simulation Simulated dataset Synthetic Job scheduling: RANDOM, RR, Greedy, MoPSO Resource allocation: RANDOM, RR, MLF, FERPTS Python (TensorFlow) S54 Simulation Production-quality cloud DC, simulated dataset Synthetic and Real-world FF, Dot Product, Norm2 heuristics Python (NumPy, PyTorch) S57 Simulation Google cluster Real-world Tetris, H2O-Cloud NA S59 Simulation CoMon Project (PlanetLab data) Real-world NPA, PABFD, IGGA, E-Eco CloudSim S61 Simulation Wikipedia trace files Real-world ElasticTree, CARPO, FCTcon, Optimal (it is not practical in use) Python (Keras) S62 Simulation Montage, Cybershake, Sipht, Inspiral datasets generated using the Pegasus Workflow Generator Real-world MPC, ETF, Lr-RL, Q-SCH, QL-HEFT CloudSim S63 Simulation Google Cloud Jobs dataset (GoCJ) Real-world PSO, MVO, EMVO MATLAB, Python (PyTorch) S64 Simulation Sipht, Inspiral, Cybershake datasets generated using the Pegasus Workflow Generator Real-world MPC, ETF WorkflowSim Applied Energy 389 (2025) 125734 16 H. Kahil, S. Sharma, P. Välisuo et al. Table 7 Selected integrated studies experimental setup. ID Environment (RQ3) Data source (RQ3) Data type (RQ3) Benchmarks (RQ6) Platform (RQ7) S1 Simulation Pegasus workflow framework Synthetic Random, Green-Opt (Greedy), Common-Actor CloudSim, Python (Keras) S2 Simulation Weather: Collected from Denmark Electricity pricing: Danish electricity spot market Real-world Other RL Controllers (For SAC and PPO), PID controller EnergyPlus S5 Simulation LLNL Thunder Real-world ICO, MPC, Joint optimization (JCO), Original- DQN Matlab, 6SigmaDCX, TensorFlow S6 Simulation LLNL Thunder Real-world PADQN, E-QL Matlab, 6SigmaDCX, TensorFlow S15 Simulation Workload: Google Cluster dataset (GCD). Renewable energy: National Renewable Energy Laboratory (NREL)/NE-3000 wind tur- bines. Electricity Price: The US EIA. Carbon Footprint: The US Department of Energy Electricity Emission Factors Real-world Greenpacker, LECC, ADVMC, ADVMC-RES Python S25 Simulation PlanetLab, Google Cluster Real-world DeepEE, Deep-Q with LSTM, ETAS, Improved Genetic, Hierarchical Deep-Q, MPC CloudSim integrated with four CRAC units, and perforated floor tiles to simulate realistic cooling dynamics S34 Simulation A simulation-based data set Synthetic Schedule: Single-agent method, Hybrid DQN, Independent DQN, Original DQN 6SigmaDC, CloudSimPy S45 Combining real- world and simulation Operational data from Singapore’s National Supercomputing Center Real-world Based on expert domain knowledge algorithm. Heuristic Algorithms: For independent IT or cooling optimization. Thermal-Unaware Scheduling: Traditional task scheduling without considering thermal dynamics 6SigmaRoom, EnergyPlus S46 Simulation Google Cluster data Real-world Random, RR, PowerTrade, DeepEE Python (OpenAI Gym and TensorFlow), Matlab S50 Simulation Wiki data center Real-world Static, Random, K-means Python (PyTorch) S52 Simulation Simulated dataset Synthetic Non-optimization: No algorithm-based con- trol, Non-algorithm optimization: Logic-based manual controls NA heuristic approaches) and traditional cooling control methods, includ - ing PID and Model Predictive Control (MPC). Additionally, other studies compared the results with joint optimization approaches. Furthermore, several studies benchmarked the outcomes against other RL/DRL algo - rithms proposed in previous research. The tools discussed in Sections 5.2.4 and 5.3.4 were similarly employed in the joint optimization studies to create simulation en - vironments. These include the EnergyPlus building energy simulation program [117] and the Computational Fluid Dynamics (CFD) simu - lators, 6SigmaRoom [119], which were utilized for cooling systems. CloudSim [122] served as a simulation environment for the ICT sys - tem. Furthermore, Python, along with its extensive libraries, served as the main programming language for implementing RL/DRL algorithms, while MATLAB was also employed in several studies for simulation and analytical tasks. Table 7 summarizes details of experimental setups in joint optimization literature: simulation environments (RQ3), platforms (RQ7), and benchmarks (RQ6). 5.5. The MDP elements As detailed in Section 3, the Markov Decision Process (MDP) provides the foundational structure for modeling the RL/DRL environment. The key components of the MDP are: the state space {𝑆}, the action space {𝐴}, and the reward function {𝑅}. In the context of the identified joint optimization problem, the MDP features a large and complex state space, as well as a mixed action space encompassing both discrete and con - tinuous actions. Furthermore, the reward function guiding the RL/DRL agent in these studies consists of multiple terms to capture the various systems within the DC environment. This highlights that the MDP for joint optimization studies is considerably more complex than in stud - ies addressing only one system. Table A.11 provides a comprehensive summary of the MDP components in the joint optimization studies. 6. Other objectives combined with energy efficiency in the identified studies Besides energy efficiency objectives in the identified studies, other objectives have been investigated. It is essential to highlight these ob - jectives which will shape the direction of future efforts in the field of multi-objective optimization. The RL/DRL algorithms have proven their effectiveness in resolving the conflicts between objectives in several identified works. For instance, in [95], where multi-objective optimiza - tion aims to balance the energy consumption of various numbers of tasks (between 100 and 250 tasks) and the average task makespan. Moreover, [92] examines the classical trade-off between quality of service (QoS), resource utilization, and energy consumption. Fig. 12 outlines a tax - onomy of other optimization objectives integrated with enhancing the energy efficiency of the data center systems. Although the majority of the identified studies address data center energy efficiency enhancement aspects as the core research objective, some studies combine this ob - jective with other environmental metrics, which can directly improve the operation mode of the data center and reduce its negative impact on the surrounding ecosystems in terms of carbon footprint and RES utilization. In contrast, the identified studies examine the proposed RL/DRL strategies for ICT and cooling in terms of system performance. In one dimension, these strategies refine time-related aspects, including Applied Energy 389 (2025) 125734 17 H. Kahil, S. Sharma, P. Välisuo et al. Other optimization objectives Environmental Impact Reduce carbon emissions Improve RES utilization Balance cost-benefit trade-offs System Performance Minimize total makespan Reduce average waiting time Improve response time Maximize resource utilization Maintain air temperature distributions Improve task completion rates Reliability Management Minimize SLA violations Address thermal threshold conditions Reduce hotspots Balance temperature dispersion Maintain a stable CPU utilization level Improve Quality of Services (QoS) Algorithmic Performance Assess average rewards