Aalto-Setälä Eemil Design Assessment of a Conceptual Virtualization Architecture for OM690 at Olkiluoto 3 Vaasa 2025 School of Technology and Innovations Master’s Thesis in Computing Sciences 2 VAASAN YLIOPISTO Tekniikan ja innovaatiojohtamisen yksikkö Tekijä: Aalto-Setälä Eemil Tutkielman nimi: Design Assessment of a Conceptual Virtualization Architecture for OM690 at Olkiluoto 3 Tutkinto: Diplomi-insinööri Koulutusohjelma: Tietojenkäsittelytieteiden maisteriohjelma Opintosuunta: Työn ohjaaja: Automaatio ja tietotekniikka Jouni Lampinen Valmistumisvuosi: 2025 Sivumäärä: 76 TIIVISTELMÄ: Laitteiston vanheneminen on ydinvoimalaitosten teollisuusautomaatiojärjestelmien elinkaa- reen liittyvä uhka. Tässä työssä arvioidaan analyyttisesti OM690‑järjestelmän virtualisointiin pe- rustuvaa konseptisuunnitelmaa Olkiluoto 3:lla, jossa vanhat SPARC‑palvelimet korvataan virtu- alisoinnin avulla Red Hat OpenShift ‑alustalla. Työn rajaus on suunnitteluvaiheen arviointi: ark- kitehtuurin soveltuvuus, yksittäisvikasietoisuus, kyberturvallisuus ja pitkäaikainen ylläpidettä- vyys erillisverkossa. Suunnitelman toteutus ja suorituskykytestit eivät sisälly työhön. Työn kes- keinen tutkimuskysymys on: Onko ehdotettu arkkitehtuuri soveltuva OM690‑modernisoinnin perustaksi, ja mitä arkkitehtuurisia ja operatiivisia reunaehtoja sen toteuttaminen edellyttää? Tuloksena syntyy jäsennelty arvio ja suosituksia modernisointiprojektin toteutukseen. Tutkimus perustuu standardeihin, laitevalmistajien dokumentaatioon ja vika-analyysiin, joka keskittyy erityisesti järjestelmän verkkoon, tallennukseen ja laskentaan. Työ on jäsennelty yh- distämällä keskeisiä vaatimuksia niitä toteuttaviin teknisiin mekanismeihin. Lisäksi työssä esite- tään havainnollistava viitekehys vikasietoisuuden todentamista varten. Myös operatiiviset nä- kökulmat, kuten identiteetin- ja pääsynhallinta, valvonta, lokien kerääminen sekä elinkaaren au- tomaatio internetistä eristetyssä ympäristössä on huomioitu pitkän aikavälin hallinnan ja jälji- tettävyyden ylläpitämiseksi. Diplomityön johtopäätöksenä todetaan, että suunnitelma on teoriassa toteuttamiskelpoinen, mikäli kolme tasoa tukevat toisiaan: deterministinen ja segmentoitu verkko huonetasoisella eris- tyksellä, synkroninen tallennusjärjestelmän replikointi kirjoituksen kahtiajakautumisen estävällä todistajalla, sekä hallintatason ja työkuormien palautusjärjestys, joka priorisoi ei-redundantteja palveluja. Hallitun elinkaariautomaation ja porrastettujen varmuuskopioiden avulla suunnitelma on ylläpidettävissä ja auditoitavissa. Seuraaviksi toimenpiteiksi suositellaan vikasietoisuuden va- lidoimista laitosympäristöä vastaavissa olosuhteissa, järjestelmän palautusharjoituksia tunnet- tuun ja turvalliseen tilaan sekä palautumisen suorituskyvylle asetettavien hyväksymiskynnysten määrittelyä. AVAINSANAT: Laitosautomaatio, Virtualisointi, Korkea Käytettävyys, Vikasietoisuus, Elinkaa- riautomaatio, OpenShift 3 UNIVERSITY OF VAASA School of Technology and Innovations Author: Aalto-Setälä Eemil Title of the thesis: Design Assessment of a Conceptual Virtualization Architecture for OM690 at Olkiluoto 3 Degree: Master of Science in Technology Degree Programme: Master's Programme in Computing Sciences Supervisor: Jouni Lampinen Year: 2025 Pages: 76 ABSTRACT: Industrial control systems in the nuclear sector face lifecycle risk from hardware obsolescence. This thesis analytically evaluates a conceptual virtualization blueprint for the OM690 system at the Olkiluoto 3 nuclear power plant, replacing legacy SPARC‑based servers with a platform built on Red Hat OpenShift. The scope is planning‑stage, and the thesis focuses on architectural suit- ability, fault tolerance, and long‑term sustainment in an air‑gapped environment. Implementa- tion and performance testing are out of scope. The central research question is: Is the proposed virtualization architecture for OM690 viable with respect to long‑term maintainability, fault tol- erance, and cybersecurity, and what architectural design principles and regulatory considera- tions are required to achieve that viability? The outcome is a structured assessment and a set of recommendations. The study uses standards‑grounded reasoning, vendor documentation, and failure‑mode analy- sis across network, storage, and compute layers. Evidence is organized as requirements to mech- anisms mappings and an illustrative verification framework indicating what should be measured later. Operational aspects like identity and access control, monitoring, logging, and offline lifecy- cle automation are addressed to maintain long-term security and traceability in a disconnected environment. The thesis concludes that the blueprint is viable if three layers reinforce one another: a deter- ministic, segmented network with room level isolation; synchronous storage replication with quorum/witness to avoid split‑brain, and control plane and workload recovery sequencing that prioritizes non‑redundant roles. With controlled automation and tiered backups, the design ap- pears maintainable and auditable. Recommended next steps are plant‑representative validation of failover behavior, restore drills to a known secure state, and acceptance thresholds for recov- ery characteristics. KEYWORDS: Plant Automation, Virtualization, High Availability, Fault Tolerance, Lifecycle Au- tomation, OpenShift 4 Contents 1 Introduction 9 1.1 Methodology and Scope 10 2 Theoretical Background 11 2.1 Introduction to Virtualization 11 2.2 Types of Virtualization and Hypervisors 12 2.3 Traditional vs. Cloud-Native Virtualization 15 2.4 Red Hat OpenShift Virtualization 17 2.5 Real-time and Safety-Critical Considerations for Virtualization 18 2.6 Resilience and System Architecture Concepts 20 2.7 High Availability and Failover 22 2.7.1 Definitions and Expectations 22 2.7.2 Failover Models 23 2.8 Infrastructure Management and Automation 24 2.8.1 Red Hat Satellite 25 2.8.2 Ansible 25 2.8.3 Terraform 26 3 Current System and Virtualization Feasibility 27 3.1 System Constraints and Virtualization Implications 27 3.2 Virtualization Readiness 29 3.3 Overview of Network and Storage Topology 31 3.4 Communication and Integration Challenges 33 3.5 Migration Risks and Mitigation Drivers 35 3.6 Conceptual Virtualized Architecture 36 4 Architectural Resilience and Failover Analysis 38 4.1 Component‑Level Criticality and Recovery Priorities 38 4.2 OpenShift Control Plane and Redundancy Patterns 40 4.2.1 Control Plane Topology and Quorum Rationale 41 4.2.2 Workload Recovery Semantics (Platform-Level) 43 5 4.2.3 Comparative Assessment 43 4.3 Failure-Mode Impact and Recovery Patterns 45 4.3.1 Network-Level Resilience 45 4.3.2 Storage-Level Resilience 48 4.3.3 Compute-Node Resilience 51 4.3.4 Methods for Quick Recovery of High Urgency Workloads 53 4.4 Failover Verification 54 5 Implementation Considerations for Operational Continuity 56 5.1 Lifecycle Automation 56 5.2 Backup and Long-Term Data Recovery 58 5.3 Integrating the New Platform into Operational Workflows 59 5.4 Operational Evidence of Resilience 60 6 Results and Discussion 62 6.1 Overall Viability of the Target Architecture 62 6.2 Control Plane Placement 62 6.3 Storage and the Role of the Witness 63 6.4 Recovery Semantics 63 6.5 Network Core and Failure Profile 64 6.6 Automation Boundaries that Fit the Stack 65 6.7 Data Protection Beyond Replication 65 6.8 Limitations 66 7 Conclusions 67 References 70 6 Figures Figure 1. Comparison of virtualization execution methods 13 Figure 2. Traditional vs. cloud-native virtualization 16 Figure 3. Sources of latency and mitigation paths in virtualization 19 Figure 4. Current OM690 high-level overview 28 Figure 5. Byte order mismatch and conceptual workflow for endian conversion 30 Figure 6. High-level core architecture of the conceptual virtualized platform 37 Figure 7. Proposed dual vPC topology with redundancy 46 Figure 8. Metro-Volume storage diagram (Dell Technologies, 2024) 49 Figure 9. Illustrative Infrastructure-as-Code Pipeline for Lifecycle Management 58 Tables Table 1. etcd quorum majority and failure tolerance 41 Table 2. Failover verification framework 55 Abbreviations API Application Programming Interface CI/CD Continuous Integration / Continuous Delivery CNI Container Network Interface, Kubernetes networking plugin model CRD Custom Resource Definition, extends the Kubernetes API CSI Container Storage Interface standard for Kubernetes storage DM‑Multipath Device‑Mapper Multipath, Linux multipathing for storage DR Disaster Recovery, restore service after large disruptions DS Diagnostic Server EPLC Enhanced Passive Listening Compatibility ES Engineering Station ESXi VMware ESXi (type‑1 hypervisor) etcd Distributed key‑value store used by Kubernetes control plane FAR Fence Agents Remediation (OpenShift fencing operator) FT Fault Tolerance, continue service transparently through faults GLB Global Load Balancer (steers traffic/site failover) GUI Graphical User Interface 7 HA High Availability (minimize downtime via redundancy/failover) HBA Host Bus Adapter (storage interface on hosts) HCL HashiCorp Configuration Language (Terraform) HMI Human-Machine Interface (operator displays/screens) HRP High‑Speed Redundancy Protocol (Siemens ring redundancy) IaC Infrastructure as Code IEC International Electrotechnical Commission (publisher of standards) IOPS Input/Output Operations Per Second IP Internet Protocol iSCSI Internet Small Computer Systems Interface (block storage over IP) I&C Instrumentation and Control KVM Kernel‑based Virtual Machine (Linux hypervisor) LACP Link Aggregation Control Protocol LTO‑9 Linear Tape‑Open, generation 9 (tape storage) LUN Logical Unit Number (block storage identifier) MAC Media Access Control (link‑layer address) MLAG Multi‑Chassis Link Aggregation (active/active across two switches) NIC Network Interface Card NX‑OS Operating System of Cisco Nexus Switches OCP Red Hat OpenShift Container Platform OL3 Olkiluoto 3 nuclear power plant unit OM1 Multimode optical fibre grade (used by the plant ring) OM690 OL3 automation system (target system being modernized) OOB Out‑of‑Band (separate management path) OT Operator Terminal OVN Open Virtual Network PU Processing Unit PVC PersistentVolumeClaim (Kubernetes storage object) QEMU Quick EMUlator (emulator/virtualizer used with KVM) RBAC Role‑Based Access Control RHCOS Red Hat Enterprise Linux CoreOS (OpenShift node OS) RHEL Red Hat Enterprise Linux RPO Recovery Point Objective RSTP Rapid Spanning Tree Protocol RTO Recovery Time Objective RTT Round‑Trip Time (latency metric) RWO ReadWriteOnce (Kubernetes volume access mode) RWX ReadWriteMany (Kubernetes volume access mode) SCC Security Context Constraint (OpenShift security primitive) SDN Software‑Defined Networking 8 SFS Finnish Standards Association SIEM Security Information and Event Management SNR Self Node Remediation (OpenShift remediation operator) SPARC Scalable Processor Architecture (legacy CPU platform) SR‑IOV Single Root I/O Virtualization (NIC virtualization) SSH Secure Shell (remote management protocol) STP Spanning Tree Protocol STUK Radiation and Nuclear Safety Authority SU Server Unit TVO Teollisuuden Voima Oyj (OL3 operator) UDN User Defined Networking vCenter VMware vCenter Server (central VM management) VLAN Virtual Local Area Network VM Virtual Machine VMI VirtualMachineInstance (KubeVirt runtime object) vPC virtual PortChannel (Cisco) vSphere VMware vSphere (virtualization suite) WSUS Windows Server Update Services XU External Unit YAML “YAML Ain’t Markup Language” (data serialization format) YVL Finnish nuclear regulatory guide series 9 1 Introduction Industrial control systems in safety-critical sectors like nuclear power are defined by long operational lifespans, which inevitably produce lifecycle challenges as underlying hard- ware and software age. The OM690 automation system at the Olkiluoto 3 (OL3) nuclear power plant, which is part of the plant’s process information and control system, is now facing these challenges. It currently relies on an aging set of SPARC-based servers run- ning a deprecated operating system. This creates operational risks including hardware failure, loss of vendor support, and scarce replacement parts. A modernization project of OM690 through virtualization is therefore in planning. The constraints of the legacy platform are both technical and architectural. Tight cou- pling of software to specific hardware prevents modern redundancy patterns, and the proprietary SPARC stack raises migration hurdles. Fragility is most acute in disaster re- covery. With original server models out of support, even well-maintained backups are of limited use without compatible hardware. These factors call for a strategy that goes be- yond hardware replacement toward a resilient and maintainable system architecture. The purpose of this thesis is to determine whether a virtualization-based modernization blueprint for OM690 can meet nuclear-grade continuity and long-term maintainability requirements at OL3, and to provide recommendations for the project. The evaluation is preliminary and analytical, focusing on architectural suitability, single-fault tolerance, cybersecurity posture, and lifecycle sustainability in an air-gapped environment. Imple- mentation, performance benchmarking, cost analysis, and regulator-witnessed valida- tion are out of scope. This thesis assesses a proposed reference architecture for virtualizing OM690 on Red Hat OpenShift and justifies its suitability through analytical methods. These methods include standards tracing, failure-mode reasoning, and feasibility checks. The work of the thesis does not include the design of the proposed architecture, rather the contribution is an independent, standards-grounded appraisal and set of recommendations for OL3. 10 Chapter 2 summarizes theoretical background in virtualization, high availability, and au- tomation. Chapter 3 analyzes the legacy OM690 system and its constraints. Chapter 4 evaluates a multi-layer resilient architecture and performs failure-mode reasoning across network, storage, and compute domains. Chapter 5 discusses implementation considerations for operational continuity. These include lifecycle automation, data pro- tection, operational workflows, observability, and a framework for sustaining the plat- form long-term. 1.1 Methodology and Scope The thesis is an evaluation/analytical case study of a proposed architecture for OM690. The design is assessed via: • Feasibility and support boundaries: checking behaviors and constraints of OpenShift/KubeVirt and related components against the proposed design. • Lifecycle sustainability: examining automation, backup/recovery, observability, and long-term lifecycle governance. • Requirements coverage: mapping to concrete architectural mechanisms. Failure-mode reasoning: analyzing how the network, storage, and compute lay- ers behave under realistic faults. The central research question is: Is the proposed virtualization architecture for OM690 at OL3 viable with respect to long-term maintainability, single-fault tolerance, and cy- bersecurity, and what architectural design principles and regulatory considerations are required to achieve that viability? Evidence is drawn from standards, vendor documentation, and technical literature. Where hard metrics are unavailable pre-deployment, the thesis specifies what should be measured and how during future validation. The thesis contributes an evidence- based justification of the proposed architecture and a set of operational recommenda- tions for sustaining it. 11 2 Theoretical Background This chapter establishes the theoretical foundation for the thesis. It introduces core vir- tualization concepts, compares platforms and deployment models, and outlines key con- siderations related to high availability, real-time performance, and fault tolerance. Addi- tionally, it explores automation tools and system architecture patterns useful in manag- ing modernized industrial systems. The foundation supports the technical analysis de- veloped in the later chapters. 2.1 Introduction to Virtualization Industrial automation and control systems have historically relied on discrete embedded devices (e.g., PLCs) for sensing, actuation and control, which are typically complemented by supervisory servers and stations. As plants have grown in complexity under the In- dustry 4.0 umbrella, the demand for flexibility, scalability and resilience has risen along with IT-side softwarization trends like virtualization and containerization (Queiroz et al., 2023, pp. 1–3). Virtualization abstracts hardware so that multiple operating systems or applications can run concurrently on the same host. This is typically under a hypervisor (type-1 or type- 2) or, at the OS level, under container engines. Classic distinctions include full virtualiza- tion versus para-virtualization, but more recently, OS-based (container) virtualization has matured alongside these models (Lee et al., 2019, p. 2). In industrial contexts, these mechanisms are becoming increasingly relevant as a strategic response to aging infra- structure and lifecycle constraints. This is due to them enabling consolidation and de- coupling software stacks from specific hardware platforms (Queiroz et al., 2023, pp. 2– 3). Reported advantages include hardware independence and improved manageability via centralized control, faster provisioning and snapshotting, as well as workload isolation that supports modular software lifecycle practices (Queiroz et al., 2023, pp. 2–3). At the 12 same time, networking for virtualized industrial systems must preserve functional trans- parency and real-time guarantees. This is an area where existing research notes chal- lenges when bridging legacy fieldbuses and Ethernet-centric stacks within virtualized en- vironments (Lee et al., 2019, p. 2). Additionally, there are limitations that are relevant to virtualization in safety-critical en- vironments. One of these is that general-purpose schedulers and container runtimes are not designed for hard real-time constraints. For example, achieving determinism often requires specific techniques like CPU pinning, real-time kernels (e.g., PREEMPT_RT) or co-kernel approaches (Queiroz et al., 2023, p. 15). Even lightweight containers can ex- hibit latency and jitter that must be characterized and mitigated for time-sensitive con- trol paths (Queiroz et al., 2023, p. 8). The shift to virtual networking and storage also introduces integration work to maintain isolation, quality-of-service and predictable I/O behavior for legacy protocols (Lee et al., 2019, p. 2). Virtualization is treated as the foundational abstraction that enables decoupling from aging hardware while supporting managed redundancy and lifecycle controls in this the- sis. The analytical chapters later apply these general properties, and account for con- straints to OM690 context. 2.2 Types of Virtualization and Hypervisors Virtualization in server systems is commonly realized in two ways. One is hardware‑level virtualization with a hypervisor, where each virtual machine runs its own guest operating system and application stack on virtual hardware that the hypervisor exposes (Queiroz et al., 2023, p. 4; Sharma et al., 2016, p. 2). Hypervisors are usually classed as type 1 when they run directly on the host and type 2 when they run on top of a conventional operating system. Type 1 designs keep the stack thin and are a common choice for pro- duction, whereas type 2 designs are often used in development and testing where con- venience matters more than strict determinism (Queiroz et al., 2023, p. 4; Sharma et al., 2016, p. 2). Alongside these sits operating system level virtualization with 13 containers, where applications run as isolated processes that share a single kernel through namespaces and control groups (Queiroz et al., 2023, p. 6; Sharma et al., 2016, p. 2). Figure 1 visually illustrates the difference in virtualization models. Figure 1. Comparison of virtualization execution methods On bare metal, operating systems and applications run directly on physical servers, which makes it the performance benchmark by eliminating virtualization overhead and scheduler indirection (Giallorenzo et al., 2021, p. 12). Containers tend to start fast and carry low overhead because they avoid a separate guest kernel (Sharma et al., 2016, p. 7). The trade is a weaker isolation boundary than a full hypervisor provides, which is an important difference when workloads have strict isolation or predictability needs (Queiroz et al., 2023, pp. 4–6; Sharma et al., 2016, p. 12). Performance studies show a mixed picture that depends on the workload and the re- source under pressure. Containers can approach bare‑metal results for CPU and memory tasks (Giallorenzo et al., 2021, pp. 12–14; Lin et al., 2018, p. 5). Full system virtualization adds some overhead, which is more visible in I/O paths (Giallorenzo et al., 2021, pp. 8, 13–14). VM disk and network I/O often degrade sooner than CPU (Giallorenzo et al., 2021, pp. 8, 13). Containers can also suffer under contention or copy‑on‑write paths, and large co‑location tests report higher cross‑workload interference in some con- tainer profiles because the kernel is shared (Sharma et al., 2016, p. 12; Lin et al., 2018, p. 5). 14 Security and isolation line up with those mechanics. VMs benefit from a distinct guest kernel and a hardened hypervisor boundary, which can help when workloads are un- trusted or must be strongly separated. Containers gain efficiency from the shared kernel but expand the blast radius of kernel‑level defects. Strong platform hardening and policy are therefore preconditions for containerized services with elevated assurance needs. Bare metal still provides physical isolation when a host is dedicated to one role (Queiroz et al., 2023, pp. 4–6; Sharma et al., 2016, p. 12). Resource efficiency also differs. Containers usually reach the highest packing density be- cause there is no guest operating system per workload (Queiroz et al., 2023, p. 5; Sharma et al., 2016, p. 12). VMs still consolidate better than one‑app‑per‑server bare metal, though the guest footprint caps peak density relative to containers (Sharma et al., 2016, p. 2). In Kubernetes‑native stacks such as OpenShift Virtualization, running VMs along- side containers can improve overall utilization because one scheduler pools compute for both models (Red Hat, 2025a, pp. 232–236). Management follows the same pattern. Bare‑metal estates often require manual lifecycle work, while VM platforms add image‑ level operations. In OpenShift Virtualization the GUI and the Kubernetes APIs allow de- clarative definitions and image workflows such as the Containerized Data Importer, while patching inside the guest remains the VM owner’s task (Red Hat, 2025a, pp. 232– 236, 346–349). Containers move further toward image‑based delivery and API‑driven rollout and rollback, which streamlines changes but adds responsibility for the container platform itself (Lin et al., 2018, p. 3; Sharma et al., 2016, p. 9). High availability (HA) behaves according to these building blocks. On bare metal, HA of- ten relies on external clustering frameworks that handle fencing and shared‑storage or- chestration. Under OpenShift Virtualization, a failed node leads the scheduler to restart the VM on a healthy node and reattach its persistent storage (Red Hat, 2025a, pp. 354– 361), and live migration is available for planned maintenance and balancing (Red Hat, 2025a, pp. 347–349). In container workloads, controllers replace failed instances 15 quickly (Šimon et al., 2023, p. 12), while stateful failover and in‑place moves need careful storage and identity design compared with VM semantics (Sharma et al., 2016, p. 12). Later chapters use these distinctions when discussing OM690. Safety‑adjacent roles fa- vor VM isolation and predictable restart behavior, while some supporting services can use containers under the same platform when the risk profile and dependencies allow it (Šimon et al., 2023, p. 13). 2.3 Traditional vs. Cloud-Native Virtualization Industrial virtualization platforms broadly fall into two paradigms: traditional hypervisor- centric stacks and cloud-native orchestration frameworks. Understanding their architec- tural philosophies helps when assessing modernization options. Traditional platforms such as VMware vSphere follow a VM-centric model built around a proprietary type-1 hypervisor. They are installed directly on server hardware, with cen- tralized management through a toolchain such as vCenter Server (Faddom, 2023). This approach emphasizes stability, feature maturity, and established enterprise workflows (Faddom, 2023). Cloud-native platforms such as Red Hat OpenShift, by contrast, adopt declarative, API- driven orchestration. In this model, VM lifecycle management is integrated into the Ku- bernetes control plane via KubeVirt, where VMs are represented as custom resources and scheduled like pods (Red Hat, 2024a; Red Hat, 2025a, pp. 57–62). This enables a single control plane for both VMs and containers. As a result, operational practices cen- ter on automation, composability, and policy-driven management (Red Hat, 2025a, pp. 25, 33). A key architectural distinction is the separation versus unification of control planes. vSphere separates VM management from other application orchestration systems, whereas OpenShift Virtualization operates VMs and containers under one orchestrator. Figure 2 illustrates this difference by contrasting a vCenter/ESXi stack with an 16 OpenShift/KubeVirt stack that unifies VM and container control under Kubernetes (Red Hat, 2024a; Red Hat, 2025a, p. 25). Figure 2. Traditional vs. cloud-native virtualization Security and isolation are enforced differently as well. Traditional hypervisors provide isolation at the VM boundary with minimal kernel sharing. Kubernetes-native virtualiza- tion retains VM isolation while introducing shared orchestration components that must themselves be secured and audited (Red Hat, 2025a, p. 38). This is an operational con- sideration reflected in OpenShift guidance on platform hardening and access control. High availability and failover patterns also differ. Traditional stacks commonly rely on dedicated HA modules and hypervisor clustering (Faddom, 2023), whereas Kubernetes- native platforms achieve continuity through declarative scheduling, rescheduling on fail- ure, and storage abstraction (e.g., PVCs and live-migration support in KubeVirt) (Red Hat, 2025a, pp. 29, 62, 359–374). Finally, traditional platforms lean on GUI-centric operations and fixed administrative workflows for lifecycle management (Faddom, 2023). Cloud-native platforms focus on 17 infrastructure-as-code, CI/CD, and role-based automation through APIs and YAML de- scriptors (Red Hat, 2025a, pp. 25, 33). In short, the traditional model emphasizes separation of concerns and mature, VM-cen- tric tooling, while the cloud-native model prioritizes unified orchestration, automation, and scalability. These differences frame the set of trade-offs for OM690, where a unified control plane may reduce operational fragmentation without discarding the VM isolation properties required for critical workloads. 2.4 Red Hat OpenShift Virtualization Red Hat OpenShift Virtualization is an add-on capability of the OpenShift Container Plat- form (OCP). It enables Kubernetes to define and run virtual machines alongside contain- ers, often described as container-native virtualization (Red Hat, 2024a). The feature is implemented through KubeVirt, which extends the Kubernetes API with virtualization- specific Custom Resource Definitions (CRDs), including VirtualMachineInstance (VMI) for runtime and VirtualMachine (VM) for lifecycle management (Red Hat, 2024a; Red Hat, 2025a, pp. 57–58). VMs are launched as pods on OpenShift worker nodes using the KVM hypervisor. For nodes that host VMs, Red Hat Enterprise Linux CoreOS (RHCOS) is the supported host operating system (Red Hat, 2024a; Red Hat, 2025a, p. 38). Red Hat Enterprise Linux (RHEL) is the general-purpose enterprise OS used inside many VMs and on traditional servers. OpenShift nodes that host VMs run Red Hat Enterprise Linux CoreOS (RHCOS), which is an image-based, cluster-managed variant of RHEL main- tained by OpenShift’s Machine Config Operator (Red Hat, 2025f, p. 25; Red Hat, 2024a). Therefore, RHEL refers to guest OSs inside VMs or non-cluster hosts, and RHCOS refers to the node OS for OpenShift control plane and VM-hosting workers. Because it is Kubernetes-native, OpenShift Virtualization integrates with the broader ecosystem used by OCP: storage via the Container Storage Interface (CSI), advanced net- working via Multus for multiple network attachments, and platform-level security and 18 policy through standard Kubernetes constructs, while platform services like observability are surfaced through the same control plane (Red Hat, 2025a, pp. 25, 310–312, 342). In addition, the platform provides a practical bridge for legacy VM workloads, which allows them to be hosted without immediate refactoring while a gradual transition to contain- erized components can be evaluated (Red Hat, 2024a). For the OM690 context, this combination of VMs and containers managed under one orchestrator offers a solid foundation that subsequent chapters evaluate in terms of re- silience, operability, and lifecycle governance. 2.5 Real-time and Safety-Critical Considerations for Virtualization Virtualization technologies, including container-native approaches like OpenShift Virtu- alization, offer significant benefits for system modernization. However, their application within industrial safety-critical environments like nuclear power plants demands consid- eration of their impact on real-time performance. This is because aspects such as timing predictability, communication latency, and execution determinism remain crucial for control loops and safety functions (Queiroz et al., 2023, p. 7). Standard virtualization in- troduces layers of software abstraction and resource sharing that were not primarily de- signed for hard real-time guarantees. General purpose virtualization does not provide hard real‑time guarantees by default, which presents challenges for adoption in time- sensitive industrial systems (Aqasizade et al., 2024, p. 1; Queiroz et al., 2023, p. 1). The introduction of virtualization layers alters the execution environment compared to bare metal. Factors contributing to increased latency and jitter include the host OS or hypervisor scheduler managing CPU access, which can introduce delays and unpredicta- ble prioritization (Queiroz et al., 2023, p. 9), and the additional processing required for I/O operations traversing virtualization pathways (Queiroz et al., 2023, p. 10). Further- more, the sharing of resources among multiple VMs or containers running on the same nodes can lead to contention and performance interference (Queiroz et al., 2023, p. 10; 19 Sharma et al., 2016, p. 1). These challenges are particularly acute for real-time industrial networks requiring guaranteed message timing and low latency (Lee et al., 2019, p. 1). To mitigate these challenges, various techniques focus on enhancing the real-time capa- bilities of the underlying Linux and KVM environment. Figure 3 visualizes the primary sources of latency across layers. At the host OS level, employing real-time Linux kernel variants or specific tunings can improve scheduling predictability (Queiroz et al., 2023, p. 15). Within OpenShift, the Node Tuning Operator and Performance Profiles allow apply- ing these low-latency tunings, manage CPU affinity, and reserve resources on specific nodes (Red Hat, 2025d, Chapter 5.4). Figure 3. Sources of latency and mitigation paths in virtualization For individual virtual machines managed by OpenShift Virtualization, resource guaran- tees can be configured via their Kubernetes resource definitions. This includes dedicating specific CPU cores (CPU pinning) and configuring large memory pages (huge pages) to minimize latency and improve predictability (Red Hat, 2025a, pp. 279, 288). These fea- tures leverage underlying Linux control groups and CPU sets to isolate workloads (Quei- roz et al., 2023, p. 16; Sharma et al., 2016, p. 5). Additionally, careful network design, potentially using Single Root I/O Virtualization (SR-IOV) or dedicated network interfaces managed via Multus CNI, alongside coordinated task scheduling within applications, can 20 help manage end-to-end communication delays (Lee et al., 2019, p. 8; Red Hat, 2025a, p. 316). In safety-critical domains such as OL3, non-deterministic timing is a safety concern rather than a performance issue. Use of virtualization in safety-related functions should there- fore apply the necessary mitigations and verify timing behavior under normal and faulted conditions, demonstrating bounded latency/jitter for the relevant control cycles and deterministic recovery for defined failure modes (Queiroz et al., 2023, p. 2). These expectations are consistent with nuclear control standards and are treated in this thesis as validation requirements. 2.6 Resilience and System Architecture Concepts System resilience typically rests on three related ideas: high availability (HA), fault toler- ance (FT), and disaster recovery (DR). HA seeks to minimize downtime through redun- dancy and rapid failover, FT masks faults so service continues transparently, and DR re- stores service after larger-scale disruption (e.g., site loss) (Luca, 2024, pp. 2–6). All three are implemented by eliminating single points of failure and by adding controlled mech- anisms for detection, isolation, and recovery (Luca, 2024, pp. 3, 12). Within OpenShift, HA is achieved when control plane quorum and failure‑domain place- ment rules are respected. The distributed control plane (API servers and etcd) and the scheduler’s ability to reschedule workloads to healthy nodes provide the platform be- havior. For virtual machines managed by OpenShift Virtualization, this means a failed node triggers a cold restart of the VM pod on another node, assuming shared or reat- tachable storage (Red Hat, 2025a, p. 29; pp. 359–374). For multi-site strategies, OpenShift supports either a stretched cluster (one logical cluster across rooms/sites) or independent clusters. The latter avoids cross-site quorum/latency pitfalls at the cost of more operational complexity (Gurijala & Sullivan, 2022). 21 Modern datacenter networks typically remove single points of failure by making links and even entire switch chassis redundant. Multi-Chassis Link Aggregation (MLAG) ex- poses an active-active port-channel across two upstream switches so a downstream host can lose a link or a whole switch without losing connectivity. Cisco’s implementation, virtual PortChannel (vPC), presents two Nexus switches as one logical device to the host (Cisco, 2024). A vPC peer-link carries state synchronization, while a separate peer- keepalive distinguishes a peer-link failure from a switch failure to prevent split-brain (Cisco, 2024, pp. 255, 284). At the host edge, LACP with suspend-individual helps in avoiding black-holing by disabling out-of-sync members automatically (Cisco, 2024, p. 238). Siemens High-Speed Redundancy Protocol (HRP) is designed to provide rapid recovery on ring network topologies (used at OL3). When attaching such rings to an Ethernet core, Enhanced Passive Listening Compatibility (EPLC) relays spanning-tree changes without running STP on ring ports (Siemens, 2018). This preserves the ring’s deterministic behav- ior while keeping the wider network loop-free (Siemens, 2018). In OpenShift, segmentation and attachment to plant VLANs are modeled explicitly: Mul- tus CNI allows secondary network interfaces for pods/VMs so traffic can be bound to the correct broadcast domain/VLAN (Red Hat, 2025a, p. 346). User‑Defined Networking, dis- cussed later, can reference those Multus attachments for reusable connectivity. For stateful workloads, storage must be available across failures and consistent across sites. At room/site scale, synchronous replication provides Recovery Point Objective (RPO) of zero by committing each write on both sides before acknowledging to the host (Avrillier, 2023). In active-active metro designs, a witness in a third fault domain arbi- trates during partitions so only one side remains writable, preventing split-brain and data divergence (Itzikr, 2023). 22 At the host level, DM-Multipath maintains multiple paths (HBAs/switches/array ports) to the same logical unit number and reroutes transparently on path failure. Common poli- cies include round-robin path selection, while queueing and path checker settings gov- ern behavior during transient loss (Red Hat, 2025e, pp. 21–22). OpenShift Virtualization consumes such storage via PersistentVolumeClaims, and live migration is supported when storage is accessible to both source and destination nodes, while cross-site restarts depend on the replication layer’s failover semantics (Red Hat, 2025a, pp. 62, 359–374). Resilience also depends on scope-limiting faults and traffic. VLANs carve a physical fabric into isolated broadcast domains to reduce blast radius and enforce policy separation (Basan, 2024). This is foundational for both performance and security in safety-critical environments. In OpenShift, this segregation is represented in workload definitions (e.g., Multus network-attachment definitions) so that VM/pod interfaces map to the intended VLANs (Red Hat, 2025a, p. 312). These architectural concepts form the theoretical back- bone for the failover patterns analyzed later. 2.7 High Availability and Failover Building on the redundancy and architectural concepts introduced in Section 2.6, this section provides a focused exploration of high availability (HA) and failover in the context of modern virtualized infrastructures, especially OpenShift Virtualization. While Sec- tion 2.6 addressed general principles, this section emphasizes how these translate into specific configurations and operational models. Understanding these mechanisms is crit- ical for ensuring service continuity and regulatory compliance in safety‑critical systems, where predictable, bounded performance and safe, well‑governed recovery are required outcomes. 2.7.1 Definitions and Expectations High availability refers to a system’s ability to remain operational with minimal downtime, even amidst component failures (Somasekaram et al., 2021, p. 1). In industrial 23 automation, HA also implies predictable failover, unobstructed operator interaction, and that platform behavior does not degrade safety functions. These principles are im- portant when considering platforms like OpenShift Virtualization for critical systems. Failover is the mechanism by which a system detects a failure and transitions operations to a redundant or standby component. It is a core HA technique expected to be fast, reliable, and transparent where appropriate, preventing data loss, preserving control states, and ensuring safety margins are not degraded during the transition (Gurijala & Sullivan, 2022). In nuclear plant environments, these mechanisms must align with regulatory standards such as IEC 62443 and Finnish YVL E.7 and A.12, which mandate system integrity, zonal isolation, and deterministic behavior under defined fault conditions. 2.7.2 Failover Models Failover models in HA clusters can be broadly categorized. Active‑active configurations involve multiple nodes processing workloads simultaneously, redistributing traffic upon failure (Somasekaram et al., 2021, p. 5). Active‑passive models, often favored for their simplicity, feature one operational node and a synchronized standby ready for immedi- ate takeover (Somasekaram et al., 2021, p. 6). In virtualized infrastructures like the planned OpenShift Virtualization at OL3, these mod- els apply with specific mechanisms for virtual machine (VM) high availability. For VMs managed by OpenShift Virtualization, HA is primarily orchestrated by the underlying Ku- bernetes platform. Red Hat documentation notes that when an OpenShift worker node hosting a VM fails, Kubernetes detects the failure and attempts to reschedule the VM (which runs as a Kubernetes pod) onto another healthy worker node. This automatic restart process assumes fencing of the failed node and that the VM’s disks reside on shared or re‑attachable storage accessible by multiple nodes (Red Hat, 2025a, pp. 62, 408–409). Live migration of VMs between OpenShift nodes without service 24 interruption is supported and is typically used for planned maintenance or workload bal- ancing, not as the primary cross‑site failover tool (Red Hat, 2025a, pp. 29, 359–374). Failover functionality extends beyond compute. Network and storage systems also re- quire rapid, deterministic recovery. At the network level, VLAN segmentation and redun- dant links can be paired with deterministic core designs so that link/switch faults do not cause identity conflicts or prolonged reconvergence (Cisco, 2024). Similarly, storage fail- over depends on synchronized replication and path redundancy; solutions such as dis- tributed block storage or software‑defined storage platforms (e.g., Ceph‑based OpenShift Data Foundation when appropriate with OpenShift) enable data availability despite individual node or disk failures (Somasekaram et al., 2021, pp. 11–18; Red Hat, 2025a, p. 343). The selected failover model impacts system complexity, testability, and response time. Active‑active configurations, for example, need robust monitoring and quorum enforce- ment to prevent split‑brain (Somasekaram et al., 2021, p. 8). Validation of failover strat- egies relies on fault‑injection frameworks executed under controlled conditions to con- firm that recovery behaves as designed and that bounded outcomes are achieved. 2.8 Infrastructure Management and Automation Modernizing OM690 on a software‑defined platform increases the volume and cadence of routine change: systems must be provisioned, configured, patched, and observed in a way that remains deterministic and auditable over decades. In such settings, infrastruc- ture automation should be treated as an enabling abstraction that turns environment state into versioned, reviewable artefacts. Because the environment is planned to be air‑ gapped (offline), an offline content path is necessary. This section introduces three po- tential tool families for lifecycle management in a virtualized OM690. The operational patterns that use them are analyzed later in Chapter 5. 25 2.8.1 Red Hat Satellite Red Hat Satellite is a management platform for Red Hat Enterprise Linux (RHEL) systems that consolidates content distribution, patching, and configuration into a centrally gov- erned service (Red Hat, 2025b, p. 9). Satellite mirrors operating‑system content from up- stream sources to a local repository of record and exposes it to enrolled RHEL hosts (Red Hat, 2025b, pp. 10, 58). For distributed estates it can project this content close to con- sumers via Capsule Servers, which replicate from the central Satellite and reduce update latency across sites (Red Hat, 2025b, p. 12). Change control is expressed through Content Views and Lifecycle Environments: admin- istrators compose versioned snapshots of repositories and promote those snapshots through environments such as Development → Test → Production, ensuring that iden- tical content is consumed at each stage (Red Hat, 2025b, pp. 11–13). Satellite also sup- ports fully disconnected operation, either by importing content on removable media or by synchronizing from a staging Satellite in a differently zoned network using Inter‑Sat- ellite Synchronization (ISS) (Red Hat, 2025b, pp.  47–50). These properties make it a use- ful reference model for maintaining a verifiable baseline in air‑gapped or highly seg- mented facilities. In an OpenShift‑based modernization its role is specific: Satellite manages traditional RHEL guests (e.g., virtual machines that run a full RHEL user space) but does not manage OpenShift worker nodes running Red Hat Enterprise Linux CoreOS (RHCOS), which are maintained by OpenShift’s own cluster operators (e.g., the Machine Config Operator) (Red Hat, 2025f, pp. 25, 41). 2.8.2 Ansible Ansible provides a declarative, agentless approach to configuration management and orchestration: desired system state is described in human‑readable YAML playbooks, and the engine applies those changes idempotently over secure transports (Red Hat, 26 2025c, p. 7). Because it operates without a resident agent and encodes changes as text, Ansible is well suited to environments that value repeatability, reviewability, and minimal software footprint on managed hosts. In OpenShift contexts, Ansible is relevant in multiple ways. First, it is commonly used to automate platform‑adjacent tasks like installing and configuring clusters, applying post‑ install settings, or driving day‑2 configuration in a controlled manner. Second, when OpenShift Virtualization is present, Ansible collections can interact with the KubeVirt API to define or manage virtual machines as part of broader workflows, aligning VM lifecycle actions with the same declarative model used for other infrastructure components (Red Hat, 2025a, p. 59; Red Hat, 2025c, p. 6). 2.8.3 Terraform While Ansible manages software configuration, Terraform can be used to provision the underlying resources themselves. It frames infrastructure as code (IaC): administrators declare the desired state of underlying resources (compute, networks, storage), and a provider‑based engine plans and applies the minimal set of changes required to reach that state. A persisted state file enables drift detection and safe, incremental change (HashiCorp, n.d.). In a virtualized OM690, such a tool can be useful in bootstrapping and evolving the sub- strate on which OpenShift runs. This can include server definitions, VLANs and subnets, or storage allocations, so that environments can be reproduced and reviewed as code across pre‑production and production settings. As with the other tools, Terraform’s op- eration is discussed further in Chapter 5. 27 3 Current System and Virtualization Feasibility This chapter outlines the relevant existing OM690 infrastructure and assesses its readi- ness for virtualization. It outlines some of the current hardware and software landscape, highlights key limitations in the legacy systems, and identifies factors that influence vir- tualization feasibility. The chapter analyzes how well the current architecture enables a transition to virtualized platforms while serving as further foundation for the later chap- ters. 3.1 System Constraints and Virtualization Implications OM690 is an automation and supervisory information system in OL3. It provides opera- tor display/HMI, information and control functions, alarm handling, processing of pro- cess data, interfaces to field and system components, long-term archiving, and adminis- trative and maintenance capabilities, and it also interfaces with other I&C or monitoring systems (Areva, 2023). The current OM690 automation system at OL3 mainly runs on aging hardware infrastructure that is becoming unsustainable from a lifecycle perspec- tive. The system contains a hybrid mix of SPARC-based servers and x86-64 platforms. The SPARC servers are the operational backbone, while the x86-64 servers handle some of the peripheral and diagnostic workloads (Areva, 2023). The SPARC-based OM690 infrastructure includes about 30 physical servers that run au- tomation functions. These include 24 Operator Terminal (OT) servers for human machine interface functions, two processing units (PU) for data processing and calculations, two server units (SU) for data archiving and retrieval, an external unit (XU) for signal exchange between external monitoring systems, and different engineering and diagnostic stations (TVO, 2024). Most of these servers operate as bare-metal systems that run Solaris 10, which is an operating system that is now deprecated and only minimally supported in modern virtualization environments. Many of these SPARC servers are now out of vendor support, and sourcing compatible replacement hardware is difficult. This amplifies the risk of system outages and other issues. The system also includes about 10 physical x86- 28 64 servers. These systems already run on modern operating systems and are better po- sitioned for virtualization, particularly within the Red Hat OpenShift Virtualization frame- work that is being considered. Figure 4 provides a high-level overview of this hardware layout. Figure 4. Current OM690 high-level overview OM690 is deployed as a role-dedicated server model: operator interfaces, process data handling, long-term storage, engineering configuration, and diagnostics operate on dis- tinct physical hosts (TVO, 2024). This approach has offered operational clarity and fault isolation, but it leads to poor hardware utilization and administrative overhead. Each system requires manual lifecycle management, which contributes to configuration drift and inconsistent patch levels. These issues are made more prominent by tight hardware-software coupling and legacy dependencies, especially within the SPARC-based workloads. Hardware abstraction lay- ers are in some cases absent, and any failure in a server in these cases results in complete function loss, if there is no redundancy. Similar constraints affect network and storage 29 subsystems, as the current topology lacks many modern features that are foundational in virtualized environments. For avoidance of doubt, the current SPARC/Solaris servers are not candidates for direct virtualization. The migration path assumes Siemens‑sup- plied x86‑64 releases of OM690 components that will be deployed on the new, virtual- ized platform. 3.2 Virtualization Readiness The readiness of existing OM690 system components for virtualization differs between SPARC-based and x86-64 platforms. This is mainly due to architectural compatibility, op- erating system support, and the lifecycle status of both hardware and software. Systems based on SPARC have limitations when transitioning to modern virtualization platforms. For example, Red Hat's virtualization stack does not list SPARC-based operating systems as certified guests (Red Hat, n.d. -a), and support for older systems like Solaris 10 is dep- recated across most hypervisors. As a result, the modernization strategy does not involve direct virtualization of SPARC-based workloads. Instead, these are planned to be mi- grated to x86-64 hardware within OpenShift Virtualization clusters. A major technical challenge in this migration is endianness. SPARC platforms use a big- endian architecture, where the most significant byte of a multi-byte value is stored at the lowest memory address (Oracle, 2018). By contrast, x86-64 systems use little-endian encoding, where the least significant byte occupies the lowest memory address (HP, 2009). This difference affects the interpretation of fixed-width binary files, which cannot be reliably used unless properly converted. A 32-bit integer representing a timestamp, for example, will yield incorrect values on a little-endian system if read without transla- tion (HP, 2009, p. 9). The problem is particularly critical in nuclear instrumentation contexts, where precision, traceability, and auditability are essential. Binary archives generated on SPARC cannot be consumed by x86-64 virtual machines without risk of silent data corruption or 30 operational misinterpretation. Conversion requires tooling that accurately transforms data fields. A potential method to address endianness is Python’s built-in struct module, which al- lows byte-order specification for decoding and re-encoding binary values. Figure 5 illus- trates this type of schema-aware conversion workflow, where a big-endian source file is read, processed according to a schema, and written as a new little-endian file. The pro- cess involves reading the original data in big-endian format ('>'), transforming the fields as required, and writing the result in little-endian format ('<') (Python Software Founda- tion, n.d.). Figure 5. Byte order mismatch and conceptual workflow for endian conversion Several alternative approaches exist, including runtime byte-swapping middleware or conversion to architecture-neutral intermediate formats. However, for large volumes of historical data, a structured offline pipeline could be the most practical and auditable solution (HP, 2009, p. 14; Oracle, 2018). Endianness is not the only challenge. Some legacy SPARC applications were compiled as 32-bit binaries, despite the underlying hardware supporting 64-bit operation. Migrating such applications to a 64-bit Linux platform is infeasible without recompilation, which is often blocked by unavailable source code or incompatible library dependencies (HP, 2009, p. 7). 31 By contrast, the x86-64-based OM690 systems are considerably more virtualization- ready. These platforms already run supported operating systems and can typically be rehosted onto OpenShift Virtualization clusters with minimal modification. 3.3 Overview of Network and Storage Topology The current OM690 infrastructure at OL3 operates on networking and storage architec- ture that reflects the technological standards of its original deployment period. While functional, this legacy topology brings limitations on scalability, redundancy, and general virtualization readiness. The OM690 network architecture includes a combination of legacy and industrial net- working technologies. As depicted in Figure 4, core communication occurs over the ring networks, which utilize OM1 multimode fiber operating at 100 Mbps. These networks were optimized for the deterministic and time-sensitive communication required by con- trol and monitoring systems. This topology, while suitable for its original bare-metal deployment, lacks features re- quired for virtualization, and is limited in terms of redundancy, as modern protocols are not consistently deployed. As virtual machines are introduced, their integration should respect the behavior of the control network and ensure no unintentional traffic leakage across VLANs, particularly when using software-defined networking mechanisms (Red Hat, 2025a, pp. 278–281, 295). To enable compatibility with plant VLANs, OCP nodes intended to host virtual machines could be provisioned with multiple physical interfaces and/or SR-IOV capabilities, where needed, to ensure virtual interfaces maintain proper behavior under plant constraints. The configuration of NetworkAttachmentDefinition resources in Kubernetes, for exam- ple, allows binding VM interfaces to appropriate VLANs (Red Hat, 2025a, pp. 78–80). However, careful planning is needed to align MAC address visibility, broadcast domains, 32 and redundancy expectations between virtual and physical switches (Johnson et al., 2023, pp. 14–15). The storage infrastructure supporting OM690 today lacks many abstraction, redundancy, and orchestration features that are foundational in modern virtual environments. Spe- cifically, there is minimal support for high availability at the storage layer. Failures typi- cally lead to manual recovery processes, which is not ideal in a virtualized high-availabil- ity context. In a future OpenShift Virtualization deployment, persistent storage for virtual machines and containerized workloads could be provisioned using Kubernetes-compatible vol- umes, for example via the Container Storage Interface (CSI) (Red Hat, 2025a, pp. 342– 343). A potential option would be to integrate enterprise storage systems with CSI driv- ers. Critical workloads that require live migration capabilities or automated failover could be backed by shared storage accessible to all participating nodes in the OpenShift cluster. The current architecture does not yet support this pattern and might require an overlay or replacement solution that introduces shared block-level volumes with consistent per- formance and redundancy. While the current OM690 network and storage topology has met the requirements of the original system design, it lacks some capabilities required to support modern virtu- alization. Network speeds, interface flexibility, and protocol support should be updated to support dynamic and redundant virtual machine communication. Similarly, the stor- age layer will evolve from locally attached devices to shared, resilient storage solutions that enable live migration and failure recovery without data loss. These gaps are not insurmountable, and several paths exist for alignment. Network mod- ernization could include deploying OCP clusters on updated switching infrastructure with 33 VLAN-aware physical interfaces and appropriate SR-IOV configurations. Storage up- grades could introduce CSI-backed resilient storage pools for both VM disk images and container volumes (Red Hat, 2025d, pp. 258–265; Red Hat, 2025a, pp. 332–335). These measures will help ensure that the virtualized OM690 system meets the fault tolerance, performance, and regulatory requirements applicable to a safety-critical nuclear facility. 3.4 Communication and Integration Challenges The transition from legacy bare-metal infrastructure to a virtualized environment based on the proposed OpenShift Container Platform (OCP) also introduces technical and op- erational integration challenges. These include maintaining reliable communication be- tween newly deployed virtual components and existing plant systems coupled to deter- ministic, time-sensitive industrial protocols. The effort should also address legacy soft- ware compatibility, uphold cybersecurity boundaries within the new environment, and ensure operational transparency. One primary communication challenge is the protocol gap between the modern, soft- ware-defined environment and legacy control system protocols. Introducing virtual ma- chines managed by OpenShift Virtualization running on OCP nodes introduces complex- ity in preserving timing determinism. For instance, replacing tightly timed applications with virtualized equivalents may introduce network jitter or latency. This potentially af- fects system synchronization or safety loop timing (Burnicki, 2024, p. 3). Moreover, there is an existing segmented network architecture and deterministic fieldbus behavior that imposes constraints on integrating virtual systems. While VMs in OpenShift Virtualization can potentially be connected to existing plant VLANs using fea- tures like Multus CNI (Red Hat, 2025a, pp.  214–218), OCP's own software-defined net- working and virtual switching layers require precise configuration. Care must be taken with MAC addresses, broadcast domains, and redundancy between virtual and physical network layers to prevent traffic leakage or degraded performance (Johnson et al., 2023, pp. 14–15). 34 OpenShift 4.17 introduced a feature known as User Defined Networking (UDN), which allows for the creation of multiple, isolated layer-2 networks at the cluster level, defined through a new UserDefinedNetwork Custom Resource Definition (CRD) (OKD, n.d). In the long term, this could offer a more native and simplified approach to network segmenta- tion, potentially reducing the reliance on per-namespace NetworkAttachmentDefinition configurations (OKD, n.d.). UDN could provide more robust, centrally managed isolation so that safety-related traffic can never share a broadcast domain with maintenance util- ities, even inside the virtual fabric. However, its adoption is not assumed in this design. Cybersecurity is another concern. Introducing an OpenShift Container Platform environ- ment for virtualization adds new attack surfaces, including the OCP control plane, worker nodes, datastore, and the container subsystem itself. These should be hardened and iso- lated according to standards. It should be so that every human, process, and device op- erates within assigned roles. Per IEC 62443‑3‑3 SR 2.1, the platform should enforce au- thorization on all interfaces for all users/roles; YVL A.12 §4.3 further requires role‑lim- ited administration and traceable logging of accesses and changes (IEC, 2013; STUK, 2021). This could involve OCP-specific security measures such as configuring Role-Based Access Control (RBAC), Security Context Constraints (SCCs), NetworkPolicies, and securing con- tainer image sources and registries (Red Hat, 2025d, pp. 254–270). It is essential that the virtualized environment does not blur boundaries between trusted plant zones and ex- ternal interfaces, especially given OCP's capabilities for integrating with automation pipelines. This IT/OT convergence is a well-documented matter that requires careful management. Lastly, remapping physical I/O channels to virtual systems needs careful consideration. While many planned VMs are supervisory, virtualizing control components raises con- cerns about latency and determinism. Even without virtualizing safety-critical I&C loops, 35 data routing from these systems through the virtualized upper-layer systems on OCP re- quires careful design (Johnson et al., 2023, p. 9). 3.5 Migration Risks and Mitigation Drivers While the preceding sections have detailed the architectural and operational limitations of the current OM690 infrastructure, it is equally important to frame these as risks that shape the feasibility and sequencing of virtualization efforts. The most critical risks stem from legacy platform discontinuities, particularly the incompatibility between SPARC and x86-64 systems, and the obsolescence of Solaris 10. This creates an unavoidable need for migration. Endianness and binary compatibility issues also move beyond technical nuisances into project risk territory. If handled incorrectly, byte-order mismatches in historical data could compromise safety-critical logs or misrepresent operational telemetry. Similarly, legacy 32-bit applications may prove unmigratable if source code is unavailable, forcing redesigns under tight timelines. Another compounding risk is the fragility of disaster re- covery, and requirements (IEC 62443-3-3 SR 7.4) demand that the system can be re- stored to a known secure state after any disruption (IEC, 2013). With the old SPARC Sun Fire servers no longer supported and replacement parts scarce, even well-maintained backup images become unusable without compatible hardware. Security concerns round out the risk landscape. Integrating OpenShift introduces an en- tirely new control layer that must be hardened according to nuclear cybersecurity stand- ards (IEC, 2013; STUK, 2021). This brings technical work and regulatory overhead, espe- cially around network segmentation, role-based access, and patch hygiene across virtual and physical domains. To manage these risks, the migration strategy should proceed in phased, validated stages. Each workload should be tested in a controlled simulator envi- ronment before production use, and data conversion processes must be both verifiable and auditable. The risk is not only technical, but procedural as well. 36 3.6 Conceptual Virtualized Architecture To address the issues of the legacy system, a modernized architecture built on Red Hat OpenShift Virtualization has been planned. It establishes a software-defined foundation where workloads are decoupled from physical hardware and managed as virtual ma- chines or containers within a Kubernetes-native environment. The core platform is planned to consist of standardized x86-64 compute hosts (Dell PowerEdge), network switches (Cisco Nexus), and iSCSI-based shared storage systems. These are to be de- ployed in two computer rooms, each housing three OpenShift nodes, with six in total. Storage arrays in each room are planned to participate in a metro cluster configuration with synchronous replication, ensuring data consistency between sites. The OM690 components are planned to be hosted as virtual machines or containers on OpenShift. Key ancillary services are planned to include Active Directory and WSUS for identity and update control, Red Hat Satellite and EfficientIP for configuration and IP ad- dress management, and Veeam with LTO-9 for data protection and long-term archival. In this architecture, automation can be supported by an infrastructure-as-code pipeline with specific tools, which will be analyzed in Chapter 5. Figure 6 summarizes the high-level architectural layout of the conceptual new system, showing physical infrastructure, virtualized platform layers, ancillary services, and net- work segmentation. This architecture is aimed to resolve the legacy platform’s lifecycle limitations but will require careful planning to meet operational and regulatory demands. The techniques that would make this architecture work in terms of resilience are exam- ined in Chapter 4 and translated into operational practices in Chapter 5. 37 Figure 6. High-level core architecture of the conceptual virtualized platform 38 4 Architectural Resilience and Failover Analysis This chapter analyzes failover mechanisms and how they could be implemented in a vir- tualized OM690 infrastructure, especially in terms of network, storage and compute nodes. It focuses on architectural strategies, tools, and validation methods that ensure fault tolerance and operational continuity. 4.1 Component‑Level Criticality and Recovery Priorities Not all components within the OM690 platform have equal importance in terms of their role or their recovery requirements. To clarify the implications for system design and failover validation, the criticality of components is viewed here from two angles: • Functional criticality refers to how essential a component’s role is to the safe and continuous operation of the plant. • Recovery criticality reflects how urgently a specific instance of a component must be restored after failure, which depends on its redundancy and operational im- pact. This distinction is useful in the platform’s fault tolerance considerations, and it should guide both infrastructure design and testing priorities. Processing Units (PUs) are func- tionally extremely critical components, as they run real-time nuclear programs that con- trol and monitor plant systems (TVO, 2024). Any disruption to PU logic could compromise safety functions. This is why PUs are deployed as a high availability hot standby pair con- figuration in the legacy system (TVO, 2024) and are planned to operate this way in the virtualized system as well. Each PU has a standby partner that immediately assumes con- trol upon failure (TVO, 2024). As such, the recovery criticality of an individual PU VM isn’t high in terms of the virtualized platform. The failover will occur at the application level, and the virtualized infrastructure mainly needs to ensure that split-brain is avoided and that at least one PU always remains operational. 39 Server Units (SUs) handle data archiving and contribute to system transparency and di- agnostics (TVO, 2024). They too are redundant because a prolonged failure could result in data gaps affecting post-event analysis and regulatory traceability. Due to the redun- dancy, they too have a lower recovery criticality. Operator Terminals (OTs) provide the primary interface between plant personnel and the I&C systems (TVO, 2024). While they do not execute control logic, their availability needs to be ensured for situational awareness and timely operator intervention. One OT server typically drives four operator displays (TVO, 2024). If the server fails, a significant portion of the control room interface goes offline. This includes loss of alarm visibility, process trend graphs, and manual control interfaces. Therefore, OT servers have moderate func- tional criticality but high recovery criticality, and their restart procedures should be pri- oritized. The XU (External Unit) acts as communication hub to as many as 16 external systems and lacks built-in redundancy (TVO, 2024). If the XU server fails, it breaks links to multiple external processes and systems, and connections to up to sixteen external subsystems are interrupted. This makes it a top-priority component from a recovery criticality per- spective, even though it in itself does not execute core control functions. Other components such as diagnostic servers (DS), engineering stations (ES), and log servers are less critical for immediate plant operation (TVO, 2024). They support mainte- nance, post-event analysis, and can tolerate longer recovery windows. However, the net- work that underpins these systems is a backbone element, and its reliability is a necessity to keeping all subsystems operational and synchronized. Standards establish the backdrop for this criticality assessment. YVL E.7 requires that safety-classified systems continue functioning under failure conditions, with determinis- tic and testable recovery behavior (STUK, 2019). IEC 62443-3-3 (SR 3.6) further mandates that recovery from disruptions must be predictable and must not compromise 40 operational or data integrity (IEC, 2013). The OM690 architecture must therefore not only tolerate faults but do so in a way that aligns with both the function-specific and recovery-specific expectations of the nuclear safety domain. Although OpenShift offers a container-native infrastructure, a container-first model is challenging to implement for many OM690 workloads. Some components, like the PUs, exhibit real-time constraints, tightly coupled behavior, and platform-specific binaries. These characteristics make their effective containerization challenging, because it favors stateless and decoupled application design (Queiroz et al., 2023, pp. 5–15). For this rea- son, an appropriate model is VM-centric at the workload layer and container-native at the infrastructure level. That said, many services can and are preferred to be container- ized. The VM-centric stance applies to services/servers that are deemed not suitable for containerization. In short, the platform’s design should balance deterministic behavior, rapid interface res- toration, and communication survivability. Failover logic should be tiered accordingly, and validation activities should reflect the differing recovery criticality of components. 4.2 OpenShift Control Plane and Redundancy Patterns As stated earlier, the virtualized OM690 environment is planned to run on six OpenShift nodes split between two computer rooms. To meet OL3 recovery targets, the OpenShift platform must keep its control plane available through a room-level failure. The resili- ence of this layer is a prerequisite for any higher-level HA. This section outlines two clus- ter patterns: a single stretched cluster across two rooms and two independent clusters, and explains how control plane quorum, workload protection, and storage behavior dif- fer. For OM690, this choice directly affects how fast services can be restarted after faults. The project is going forward with a single stretched cluster, with the dual‑cluster alter- native retained here as an analyzed option for completeness and future contingency. 41 4.2.1 Control Plane Topology and Quorum Rationale In a stretched cluster design, a single OpenShift cluster would place its control plane members across both rooms. It remains one logical control plane backed by one etcd (key-value store) quorum. By contrast, a dual-cluster design would deploy two independ- ent clusters, one per room, with no shared quorum. etcd requires a strict majority of voting members for any change (etcd, 2025). Odd-sized control planes are therefore pre- ferred because they maximize fault tolerance for a given size, while even-sized layouts do not increase the number of simultaneous failures tolerated and add failure surface (etcd, 2025). Table 1 illustrates this principle. Table 1. etcd quorum majority and failure tolerance etcd voters Majority required Failures Tolerated 3 2 1 4 3 1 5 3 2 As the table demonstrates, a four-member control plane still needs three members for quorum and thus tolerates only one failure (etcd, 2025). Therefore, it behaves exactly like a three-member control plane, while adding parts that can fail. In two rooms, no stretched topology can guarantee write continuity through an arbitrary room loss (with- out a third site arbitrating). Control plane placement only improves the odds when the surviving room holds the majority (Gurijala & Sullivan, 2022). To remain fully read/write after the loss of either room, two approaches exist: operate dual clusters with GLB steering, or place control plane voters across three sites so a ma- jority always survives a single-room loss (Gurijala & Sullivan, 2022). Without one of those patterns, a stretched two-room control plane cannot guarantee continuity of writes on arbitrary room loss. 42 If the room holding the majority fails in a stretched layout, the cluster loses quorum. The API switches to read-only, controllers and the scheduler cannot persist changes, and Ku- beVirt cannot reschedule VMs (etcd, 2025). Existing pods (container workloads) and VMs may keep running on their current hosts, but no new recovery actions occur until oper- ators restore quorum. For example, by using the documented etcd quorum-restoration procedure or, if the majority cannot be recovered, by restoring from an etcd backup (Red Hat, 2025k, pp. 391–392). For OM690 this means restarts cannot begin until quorum is restored, as the delay sits on the control plane path, not on storage. Learner members do not change this outcome, as they are non-voting and must be explicitly promoted after catching up (etcd, 2025). A stretched cluster is viable for OM690 when inter-room latency remains in the single- digit milliseconds and voter placement is planned so that loss of the non-majority room preserves quorum for routine restarts. Day-2 operations should watch etcd peer round- trip time and disk flush-to-disk latency, as sustained high values can trigger disruptive leader elections (Red Hat, 2025j, pp. 14–15). At the storage layer, stretch patterns carry their own guardrails. Red Hat’s guidance calls for ≤ 10 ms round-trip time between sites in stretch configurations (Red Hat, 2025i, pp. 108–109). A dual-cluster alternative would avoid cross-room quorum coupling entirely, as each room keeps a full, independent control plane. Therefore, if one room is lost, the other remains fully read/write and a global load balancer (GLB) steers traffic while applications fail over according to their replication strategy (Gurijala & Sullivan, 2022). In practice this preserves the platform’s ability to restart services immediately in the surviving room, at the cost of operating two clusters and maintaining GLB-based runbooks. Finally, any node intended to run or migrate VMs should use Red Hat Enterprise Linux CoreOS (RHCOS). Generic RHEL workers may join for container-only workloads but can- not host or migrate VMs (Red Hat, 2025a, p. 59). Standardizing VM-hosting on RHCOS 43 simplifies fencing integrations and intra-cluster KubeVirt operations, regardless of whether the platform is stretched or dual-cluster. 4.2.2 Workload Recovery Semantics (Platform-Level) For infrastructure‑led failover, OpenShift Virtualization protects VMs primarily via cold restart on healthy nodes with volume reattachment orchestrated by Kubernetes/Ku- beVirt. Cross‑room live migration is out of scope here (Gurijala & Sullivan, 2022). For OM690 this is not ideal for the OTs and XU, but is acceptable if their maximum accepta- ble downtime Recovery Time Objectives (RTO) are met. The PU and SU on the other hand do not rely on VM‑level recovery for continuity because they run as redundant pairs. For containerized services that are architected for replication, active/active can be achieved by running replicas in both rooms and steering traffic via the GLB. This is distinct from services deployed as VMs, which have to remain cold‑restart oriented (Gurijala & Sulli- van, 2022). Storage behavior (replication mode, arbitration/witness, path management, and RWX/RWO implications for live migration) is analyzed in Chapter 4.3.2, which specifies the metro‑replication model, the role of the third‑site witness, and host multipathing considerations (Dell Technologies, 2024; Red Hat, 2025e). In a dual‑cluster design, state consistency mechanisms (infrastructure‑level vs. application‑level replication) are like- wise summarized in Chapter 4.3.2; at the platform layer, failover remains a re- start/switchover in the target cluster rather than live migration of a running VM (Gurijala & Sullivan, 2022). 4.2.3 Comparative Assessment A stretched cluster keeps the operational surface small (one control plane) and, under low inter‑room latency with favorable voter placement, can reschedule workloads trans- parently when paired with synchronous metro‑replication (RPO = 0). The trade‑off is quorum sensitivity: if the majority room is lost, the minority partition stops serving 44 writes and scheduling until a majority is restored (etcd, 2025; Dell Technologies, 2024, pp. 6–9). A third‑site storage witness is recommended in the stretched storage design to arbitrate a single writer and prevent split‑brain; see Chapter 4.3.2 for the storage‑layer rationale and parameters; however, it does not influence etcd quorum (Dell Technologies, 2024, pp. 6–13, 44–55; etcd, 2025). If uninterrupted control plane writes during a room loss are required across two rooms, a dual‑cluster approach behind a global load balancer would be the safer choice because in it each room retains an independent, read/write control plane. If a stretched, single‑ cluster model is preferred for operational simplicity, continuous writes through any sin- gle‑room loss would require placing control plane voters across three sites (for example 2+2+1), observing etcd latency budgets (e.g., p99 peer RTT and fsync), and validating that site‑to‑site RTT remains within Red Hat’s stretch guidance (≤ 10 ms) for steady leader election behavior (Gurijala & Sullivan, 2022; etcd, 2025; Red Hat, 2025j, p. 5; Red Hat, 2025i, Chapter 3.5). In either pattern, control plane stability is primarily sensitive to etcd performance. Red Hat documentation implies that a practical approach is to monitor key indicators at high percentiles, such as the 99th percentile (p99) for peer round-trip time (RTT) and disk fsync latency, rather than anchoring on a single average value. Persistently elevated p99 values for these metrics can lead to disruptive etcd leader elections and should be avoided (Red Hat, 2025j, p. 5). For stretched storage, Red Hat reference materials cite ≤  10 ms RTT between sites as a planning budget (Red Hat, 2025j; Red Hat, 2025i, § 3.5, pp. 8–9). These values can be treated as planning targets, but actual numbers would need to be validated in the future. For OM690, if continuous control plane writes during a room loss are required, a dual- cluster approach behind a GLB would be a safer choice. If a brief platform-level pause is acceptable and operational simplicity is paramount, a stretched cluster is also viable, 45 provided inter-site latency targets are met and sufficient quorum-restore procedures are in place (etcd, 2025; Red Hat, 2025j, Chapter 4.4). 4.3 Failure-Mode Impact and Recovery Patterns Ensuring that no single fault can compromise system-wide functionality is a core require- ment in nuclear instrumentation and control (I&C). This section evaluates how the pro- posed OM690 virtualization architecture meets that objective by analyzing its resilience across three key fault domains: network, storage, and compute nodes. Rather than treat- ing failure tolerance as a singular feature, the analysis demonstrates how layered recov- ery mechanisms work together to uphold availability targets. The emphasis is placed on realistic failure scenarios, the system’s potential responses, and the architectural design choices that enable recovery. 4.3.1 Network-Level Resilience To satisfy requirements within the OM690 upgrade, the network must continue operat- ing despite failures like the loss of one computer room, any single switch or link. The core is therefore planned to use four Cisco Nexus switches, two per room. Explicit design choices in the core will need to be in place that bound failure blast radius and keep for- warding predictable. As illustrated in Figure 7, each room is planned to contain one vPC domain (a pair of Nexus switches). Here, the two switches in Room A form vPC Domain A, and the two in Room B form vPC Domain B. A single Nexus switch can participate in only one vPC do- main at a time. Therefore, designs where each switch peers with two vPC peers are not supported by the Nexus OS (NX-OS) and would risk split brain behavior (Cisco, 2024, p. 260). The two room local vPC domains are preferred to interconnect with routed Layer 3 links for fault containment (Cisco, 2024, pp. 288–291). Alternatively, when a Layer 2 stretch is truly necessary, they can connect via a supported vPC to vPC interconnection. 46 These choices constrain failure propagation within the network: room local faults stay local, and cross room links are engineered for deterministic reconvergence. Figure 7. Proposed dual vPC topology with redundancy vPC peer-keepalives run out-of-band (OOB), ideally on the mgmt0 interface with dedi- cated virtual routing and forwarding, so the peer-link never carries keepalives (Cisco, 2024, pp. 259, 262–263). This arrangement lets the peers distinguish a failed switch from a failed peer-link and prevents dual-active conditions. This is critical for a safety context like in OM690 where ambiguous forwarding states are unacceptable. To avoid blackhol- ing in a peer‑link‑down event, orphan‑connected devices should preferentially attach to the vPC primary (or use the orphan‑port‑protection setting), as the secondary suspends its vPC member ports when the peer‑link is lost but keepalive remains up (Firewall.cx, n.d.) Within each room, the vPC pair presents itself as one logical switch to downstream de- vices, so uplinks run active-active rather than blocking. This increases available band- width and avoids spanning-tree failover delays during single-link faults (Cisco, 2024, p. 255). Spanning tree can remain enabled as a backstop: NX-OS runs STP on both vPC peers, with the primary coordinating vPC-facing ports on the secondary. Setting the pair as STP root/secondary and treating peer-link ports as STP network ports minimizes topological 47 surprises and speeds convergence when something fails (Cisco, 2024, pp. 273–274). The outcome of this is that common faults do not change the forwarding model seen by hosts. Each OpenShift and storage node exposes physical NICs (Figure 7, bond0 and bond1) and forms LACP port-channels to the room-local vPC pair. LACP in active mode detects unidi- rectional or mis-cabled members and automatically removes them from service; NX-OS explicitly recommends LACP for vPC member links (Cisco, 2024, pp. 238–240, 272). Com- pared to static port-channels, this avoids silent black holes in the network and reduces operator intervention during faults, which improves mean-time-to-recovery and keeps packet loss bounded to the detection window. In combination, LACP on the host edge and STP on the vPC pair detect and isolate asymmetric link failures quickly, so bandwidth is reduced but traffic remains available (Cisco, 2024, pp. 238–240, 273–274). Plant networks are planned to remain on Siemens Scalance rings, which use High-Speed Redundancy (HRP). To attach those rings to the core without compromising ring timing, the design can use Enhanced Passive Listening Compatibility (EPLC). In it, Scalance relays spanning-tree change messages toward the core, the core flushes MAC tables and can reconverge in a few seconds after a ring break, while STP processing is disabled on the HRP ports and the RSTP segment connects to more than one ring node in Passive Listen- ing mode (Siemens, 2018). The effect is that the ring’s latency budget is preserved even while the broader fabric re-optimizes. Potential network failure profiles and impacts in this design include: • High-probability, low-impact: Single switch or single peer-link failure. Traffic con- tinues in the surviving vPC domain(s); OOB keepalives and STP prevent dual-ac- tive, and dual-homed hosts keep forwarding on remaining members (Cisco, 2024, pp. 255–258, 272–274). Expected operational outcome: platform stays online; no VMs/pods are restarted, and no workloads are moved. Hosts keep using the re- maining links. 48 • Medium-probability, moderate-impact: One switch fails in each room simultane- ously. Each vPC domain still has one peer alive; the inter-domain vPC-to-vPC links preserve connectivity between rooms. Expected operational outcome: capacity reduced but services continue. • Low-probability, high-impact: Both switches in one room fail. Single-attached de- vices in that room are isolated; dual-homed servers retain connectivity through their links into the surviving room’s vPC domain. Expected operational outcome: services fail over to the other room but remain online. Because servers are planned to be dual-attached and storage should replicate synchro- nously between rooms (analyzed in next chapter), changes or upgrades on one room’s pair can be executed without workload interruption. This is assuming the cross-room interconnect and configuration parity described above are maintained (Cisco, 2024, pp. 255, 272–274). In short, resilience in the network can be built in layers: vPC contains local faults, LACP hardens the host edge, EPLC protects deterministic rings, and cross- room links provide a controlled path of last resort. Together these techniques align the network with A12 (§4.1) and IEC 62443‑3‑3 (SR 5.1, SR 5.2) single-failure tolerance and deterministic recovery expectations (STUK, 2021; IEC, 2013). 4.3.2 Storage-Level Resilience The integrity and availability of the storage backend are fundamental to the virtualized OM690 system. Both computer rooms are planned to host a Dell PowerStore array. Metro volume technology presents them as a single active-active storage pool that rep- licates every write synchronously, resulting in an RPO of zero (Dell Technologies, 2024, pp. 6–10). Each OpenShift node reaches the arrays through two independent iSCSI fab- rics, and Device-Mapper Multipath (DM-Multipath) balances I/O across all available paths (Red Hat, 2025e, pp. 9–10). Figure 8 illustrates this arrangement: two PowerStore arrays participate in metro sessions for each volume, a third‑site witness arbitrates which side is writable, and hosts access the arrays through target port groups (TPGs), continuing I/O on the preferred side via active‑optimized paths. 49 Figure 8. Metro-Volume storage diagram (Dell Technologies, 2024) At the host level, DM-Multipath monitors path health and transparently redirects traffic when a path, like a HBA, switch, or front-end port fails. Dell’s Linux examples use ‘”queue- length” 0, which spreads I/O across all active optimized paths (Dell Technologies, 2024, p. 85). Red Hat documents the queue-length 0 selector as an equally safe alternative for block devices (Red Hat, 2025e, pp. 9–10, 29, 33–38). Either policy sustains throughput during maintenance and avoids I/O contention during mass path recovery. Metro volume’s arbitration logic can be enhanced by a third-site witness introduced in PowerStoreOS 3.6. The witness continuously tests both arrays and the inter-site links. If communication is lost, only the side acknowledged by the witness retains the write role, which prevents split-brain corruption (Dell Technologies, 2024, pp. 13, 44–45). To oper- ate in synchronous mode Dell specifies round-trip latency below 5 ms and at least 250 Mb/s per concurrent replication + migration flow (Dell Technologies, 2024, 12– 13, 26, 44, 59). The planned OM690 storage architecture should be within those limits. It should be noted that metro volume’s single-writer arbitration protects VM disks but is not a multi-writer mechanism. Therefore, two-room active/active must be implemented at the application/data-replication layer for services that require it. 50 Because every write is synchronously committed, the design contains the three principal fault scenarios. If one array fails, the Witness demotes it and DM-Multipath continues over the surviving paths with no client impact. If an entire room is lost, the surviving array still serves data and OpenShift starts the affected VMs on local hosts. If the inter- room link alone fails, the Witness grants write access to only one site while the other remains read-only until synchronization is restored (Dell Technologies, 2024, pp. 9– 10, 44–45). This eliminates any risk of data divergence. OpenShift Virtualization consumes the metro volume through block‑mode Persis- tentVolumeClaims (PVCs). PowerStore metro volume is a block‑only capability (Dell Technologies, 2024); the CSI storage class used here presents RWO PVCs rather than RWX. As a result, live migration of KubeVirt VMs is not available with this storage class; recovery uses cold restart with volume reattachment (Red Hat, 2024b, p. 26). A cross- room transfer still requires powering off the VM on the failed site and booting it on the surviving hosts (Red Hat, 2025a, pp. 23–24, 64, 347). Declaring this limitation up front prevents overstating the benefit of metro volume while remaining compliant with re- quirements (IEC 62443‑3‑3 SR 7.3, SR 7.4) on state preservation (IEC, 2013). Synchronous replication protects data, but workloads also need their front-end IPs. A Global Load Balancer (GLB) can monitor health probes from both rooms and pin each client session to the preferred site (Gurijala & Sullivan, 2022). It makes switchover deci- sions based on application and endpoint health, while witness status is an input, not the sole trigger (Gurijala & Sullivan, 2022). The GLB therefore completes the switchover without human intervention and avoids snapshot divergence by ensuring that new re- quests always land on the authoritative copy. Finally, the witness appliance itself if used, becomes a critical dependency, and it should be deployed on an independent manage- ment VLAN (Dell Technologies, 2024). In summary, host-level multipathing, metro volume synchronous mirroring with a wit- ness, and GLB-based traffic steering together provide a clear, testable path from single- 51 cable faults up to full computer-room loss. This meets recovery requirements in SR 7.3 and 7.4 of IEC 62443-3-3 (IEC, 2013). 4.3.3 Compute-Node Resilience While resilient network and storage infrastructures are foundational, the physical serv- ers in the OM690 virtualization cluster represent another critical fault domain. To demonstrate the architecture’s ability to tolerate such failures, we can consider a sce- nario in which one of the OpenShift Virtualization hosts suffers a complete outage. This node is assumed to run a mix of virtual machines, including a Processing Unit (PU), Op- erator Terminals (OT) and the External Unit (XU) server. The scenario therefore tests both functionally critical and recovery-critical components, as defined in Chapter 4.1. Where suitable services should be containerized, pod rescheduling typically restores instances faster than VM cold restarts, but this applies only to workloads that can meet OM690’s determinism and integration constraints in container form. The system’s response begins at the application layer. When the PU on the failed node becomes unavailable, the hot standby PU takes over the role. This switchover happens within the PU pair and should not wait for any OpenShift remediation. The infrastruc- ture’s only task is to enforce fencing to prevent split-brain and to ensure that at least one PU instance remains available (Red Hat, 2025g, Chapter 1). The cluster marks the node from Ready to Unknown after about 50 s of missed heartbeats, and Node Health Check waits a further 40 s before creating the Self Node Remediation (SNR) object, so fencing starts roughly 90 s after the first symptom (Red Hat, 2025g, p. 7). Concurrently, the OpenShift control plane should recognize the failed node as non-re- sponsive. The Node Health Check operator creates an SNR resource, which reboots the unhealthy node via its watchdog and cordons it so that workloads are rescheduled. Be- cause SNR can only restart the node locally, sites that demand a full hardware power-off can attach the Fence Agents Remediation (FAR) operator to the same Node Health Check pipeline; FAR then performs the required out-of-band fencing (Red Hat, 2025g, pp. 21– 52 23; medik8s, n.d.). The SNR safe-time (safeTimeToAssumeNodeRebootedSeconds) can be tuned to balance data-corruption risk against recovery speed (Red Hat, 2025g, p. 15). Once the failed node is fenced, the infrastructure layer recovers less-resilient workloads. The OT VMs are rescheduled to a healthy node for a cold restart (Gurijala & Sullivan, 2022). Each OT drives up to four operator displays and stores its unique HMI layout and short-term archive locally; its loss therefore removes alarms, trend graphs and manual- control capability for a defined control-room segment. Because displays are not inter- changeable across OTs, rapid restoration of the exact OT VM is essential to preserve op- erator awareness and intervention capability. Equally urgent is the recovery of the XU server, which serves as a communications hub for up to sixteen external systems (TVO, 2024). XU failure disrupts those links and can trigger cascading plant alarms. As a single-instance component, its recovery-time objec- tive is among the most stringent in the system, so the scheduler should give the XU VM highest restart priority on an XU-capable host once fencing is confirmed, and the disk has been reattached (Gurijala & Sullivan, 2022). This assumes that the XU is not host- bound and can be restarted on another node. Throughout the incident, the SU pair planned to be hosted on different nodes remain unaffected, demonstrating effective fault-domain segregation. The scenario illustrates the tiered recovery model from Chapter 4.1: application-layer continuity for redundant components (PU), fast VM restarts for high recovery-critical roles (OT, XU) and deferred recovery for non-critical services such as diagnostics or engineering stations. Demonstrating that a single host can fail without loss of safety function satisfies the re- quirements (IEC 62443-3-3 SR 7.1, SR 7.4) that physical and functional redundancy be implemented so the system remains operable even after a single-component failure (IEC, 2013). 53 4.3.4 Methods for Quick Recovery of High Urgency Workloads As established, Operator Terminals (OT) and the External Unit (XU) are unique, non-re- dundant services: when one fails, a control room segment or multiple external links are unavailable until that exact instance runs again. Because there is no twin to fail over to, continuity depends on fast, deterministic cold restart with the same identity and data on a capable host. Several technical and procedural levers can be used to achieve this rapid recovery. One potential lever like this is placement. Defining a labeled pool of OT- and XU-capable nodes gives the scheduler multiple safe landing spots as soon as fencing completes. Anti- affinity for OTs helps spread terminals so a single host failure does not remove an entire cluster of displays. This keeps landing-zone constraints explicit without pinning a given instance to a single machine. Another useful lever is ensuring capacity headroom. Holding at least one host’s worth of spare CPU and memory for a VM’s pool converts a node loss into an immediate restart rather than a queued request. However, the headroom only pays off if the scheduler prefers the urgent workloads. Deterministic fencing and reschedule should be driven through Node Health Check with Self Node Remediation so that a failed node reboots (or is power-fenced when FAR is attached) and its workloads are eligible for restart without human intervention. Safe- time tuning balances corruption risk against recovery speed; this pipeline makes the previously looked at failed-node → restart-eligible transition predictable. On the scheduling side, assigning a high PriorityClass to targeted virt-launcher pods en- sures they preempt lower-priority work if necessary. In practice, this turns reserved ca- pacity into guaranteed behavior: once fencing confirms the node is out, the scheduler places urgent workloads first on surviving hosts. 54