Aalto-Setälä Eemil 

Design Assessment of a Conceptual Virtualization 
Architecture for OM690 at Olkiluoto 3 

 
Vaasa 2025 

School of Technology and Innovations 
Master’s Thesis in Computing Sciences 


2 

 
VAASAN YLIOPISTO 
Tekniikan ja innovaatiojohtamisen yksikkö 

Tekijä: Aalto-Setälä Eemil 
Tutkielman nimi: Design Assessment of a Conceptual Virtualization Architecture for 

OM690 at Olkiluoto 3 
Tutkinto: Diplomi-insinööri 
Koulutusohjelma: Tietojenkäsittelytieteiden maisteriohjelma 
Opintosuunta: 
Työn ohjaaja: 

Automaatio ja tietotekniikka 
Jouni Lampinen 

Valmistumisvuosi: 2025 Sivumäärä: 76 

 
TIIVISTELMÄ: 
Laitteiston vanheneminen on ydinvoimalaitosten teollisuusautomaatiojärjestelmien elinkaa-
reen liittyvä uhka. Tässä työssä arvioidaan analyyttisesti OM690‑järjestelmän virtualisointiin pe-
rustuvaa konseptisuunnitelmaa Olkiluoto 3:lla, jossa vanhat SPARC‑palvelimet korvataan virtu-
alisoinnin avulla Red Hat OpenShift ‑alustalla. Työn rajaus on suunnitteluvaiheen arviointi: ark-
kitehtuurin soveltuvuus, yksittäisvikasietoisuus, kyberturvallisuus ja pitkäaikainen ylläpidettä-
vyys erillisverkossa. Suunnitelman toteutus ja suorituskykytestit eivät sisälly työhön. Työn kes-
keinen tutkimuskysymys on: Onko ehdotettu arkkitehtuuri soveltuva OM690‑modernisoinnin 
perustaksi, ja mitä arkkitehtuurisia ja operatiivisia reunaehtoja sen toteuttaminen edellyttää? 
Tuloksena syntyy jäsennelty arvio ja suosituksia modernisointiprojektin toteutukseen. 
 
Tutkimus perustuu standardeihin, laitevalmistajien dokumentaatioon ja vika-analyysiin, joka 
keskittyy erityisesti järjestelmän verkkoon, tallennukseen ja laskentaan. Työ on jäsennelty yh-
distämällä keskeisiä vaatimuksia niitä toteuttaviin teknisiin mekanismeihin. Lisäksi työssä esite-
tään havainnollistava viitekehys vikasietoisuuden todentamista varten. Myös operatiiviset nä-
kökulmat, kuten identiteetin- ja pääsynhallinta, valvonta, lokien kerääminen sekä elinkaaren au-
tomaatio internetistä eristetyssä ympäristössä on huomioitu pitkän aikavälin hallinnan ja jälji-
tettävyyden ylläpitämiseksi. 
 
Diplomityön johtopäätöksenä todetaan, että suunnitelma on teoriassa toteuttamiskelpoinen, 
mikäli kolme tasoa tukevat toisiaan: deterministinen ja segmentoitu verkko huonetasoisella eris-
tyksellä, synkroninen tallennusjärjestelmän replikointi kirjoituksen kahtiajakautumisen estävällä 
todistajalla, sekä hallintatason ja työkuormien palautusjärjestys, joka priorisoi ei-redundantteja 
palveluja. Hallitun elinkaariautomaation ja porrastettujen varmuuskopioiden avulla suunnitelma 
on ylläpidettävissä ja auditoitavissa. Seuraaviksi toimenpiteiksi suositellaan vikasietoisuuden va-
lidoimista laitosympäristöä vastaavissa olosuhteissa, järjestelmän palautusharjoituksia tunnet-
tuun ja turvalliseen tilaan sekä palautumisen suorituskyvylle asetettavien hyväksymiskynnysten 
määrittelyä. 
 
 
AVAINSANAT: Laitosautomaatio, Virtualisointi, Korkea Käytettävyys, Vikasietoisuus, Elinkaa-
riautomaatio, OpenShift 


3 

 
UNIVERSITY OF VAASA 
School of Technology and Innovations 

Author: Aalto-Setälä Eemil 
Title of the thesis:  Design Assessment of a Conceptual Virtualization Architecture for 

OM690 at Olkiluoto 3 
Degree: Master of Science in Technology 
Degree Programme: Master's Programme in Computing Sciences 
Supervisor: Jouni Lampinen  
Year: 2025 Pages: 76 

 
ABSTRACT: 
 
Industrial control systems in the nuclear sector face lifecycle risk from hardware obsolescence. 
This thesis analytically evaluates a conceptual virtualization blueprint for the OM690 system at 
the Olkiluoto 3 nuclear power plant, replacing legacy SPARC‑based servers with a platform built 
on Red Hat OpenShift. The scope is planning‑stage, and the thesis focuses on architectural suit-
ability, fault tolerance, and long‑term sustainment in an air‑gapped environment. Implementa-
tion and performance testing are out of scope. The central research question is: Is the proposed 
virtualization architecture for OM690 viable with respect to long‑term maintainability, fault tol-
erance, and cybersecurity, and what architectural design principles and regulatory considera-
tions are required to achieve that viability? The outcome is a structured assessment and a set of 
recommendations. 
 
The study uses standards‑grounded reasoning, vendor documentation, and failure‑mode analy-
sis across network, storage, and compute layers. Evidence is organized as requirements to mech-
anisms mappings and an illustrative verification framework indicating what should be measured 
later. Operational aspects like identity and access control, monitoring, logging, and offline lifecy-
cle automation are addressed to maintain long-term security and traceability in a disconnected 
environment. 
 
The thesis concludes that the blueprint is viable if three layers reinforce one another: a deter-
ministic, segmented network with room level isolation; synchronous storage replication with 
quorum/witness to avoid split‑brain, and control plane and workload recovery sequencing that 
prioritizes non‑redundant roles. With controlled automation and tiered backups, the design ap-
pears maintainable and auditable. Recommended next steps are plant‑representative validation 
of failover behavior, restore drills to a known secure state, and acceptance thresholds for recov-
ery characteristics.  
 
 
KEYWORDS: Plant Automation, Virtualization, High Availability, Fault Tolerance, Lifecycle Au-
tomation, OpenShift 


4 

Contents 

1 Introduction 9 

1.1 Methodology and Scope 10 

2 Theoretical Background 11 

2.1 Introduction to Virtualization 11 

2.2 Types of Virtualization and Hypervisors 12 

2.3 Traditional vs. Cloud-Native Virtualization 15 

2.4 Red Hat OpenShift Virtualization 17 

2.5 Real-time and Safety-Critical Considerations for Virtualization 18 

2.6 Resilience and System Architecture Concepts 20 

2.7 High Availability and Failover 22 

2.7.1 Definitions and Expectations 22 

2.7.2 Failover Models 23 

2.8 Infrastructure Management and Automation 24 

2.8.1 Red Hat Satellite 25 

2.8.2 Ansible 25 

2.8.3 Terraform 26 

3 Current System and Virtualization Feasibility 27 

3.1 System Constraints and Virtualization Implications 27 

3.2 Virtualization Readiness 29 

3.3 Overview of Network and Storage Topology 31 

3.4 Communication and Integration Challenges 33 

3.5 Migration Risks and Mitigation Drivers 35 

3.6 Conceptual Virtualized Architecture 36 

4 Architectural Resilience and Failover Analysis 38 

4.1 Component‑Level Criticality and Recovery Priorities 38 

4.2 OpenShift Control Plane and Redundancy Patterns 40 

4.2.1 Control Plane Topology and Quorum Rationale 41 

4.2.2 Workload Recovery Semantics (Platform-Level) 43 


5 

4.2.3 Comparative Assessment 43 

4.3 Failure-Mode Impact and Recovery Patterns 45 

4.3.1 Network-Level Resilience 45 

4.3.2 Storage-Level Resilience 48 

4.3.3 Compute-Node Resilience 51 

4.3.4 Methods for Quick Recovery of High Urgency Workloads 53 

4.4 Failover Verification 54 

5 Implementation Considerations for Operational Continuity 56 

5.1 Lifecycle Automation 56 

5.2 Backup and Long-Term Data Recovery 58 

5.3 Integrating the New Platform into Operational Workflows 59 

5.4 Operational Evidence of Resilience 60 

6 Results and Discussion 62 

6.1 Overall Viability of the Target Architecture 62 

6.2 Control Plane Placement 62 

6.3 Storage and the Role of the Witness 63 

6.4 Recovery Semantics 63 

6.5 Network Core and Failure Profile 64 

6.6 Automation Boundaries that Fit the Stack 65 

6.7 Data Protection Beyond Replication 65 

6.8 Limitations 66 

7 Conclusions 67 

References 70 

  
6 

Figures 
 
Figure 1. Comparison of virtualization execution methods 13 

Figure 2. Traditional vs. cloud-native virtualization 16 

Figure 3. Sources of latency and mitigation paths in virtualization 19 

Figure 4. Current OM690 high-level overview 28 

Figure 5. Byte order mismatch and conceptual workflow for endian conversion 30 

Figure 6. High-level core architecture of the conceptual virtualized platform 37 

Figure 7. Proposed dual vPC topology with redundancy 46 

Figure 8. Metro-Volume storage diagram (Dell Technologies, 2024) 49 

Figure 9. Illustrative Infrastructure-as-Code Pipeline for Lifecycle Management 58 

 
Tables 
 
Table 1. etcd quorum majority and failure tolerance 41 

Table 2. Failover verification framework 55 

 
Abbreviations 
 
API   Application Programming Interface 

CI/CD   Continuous Integration / Continuous Delivery 

CNI   Container Network Interface, Kubernetes networking plugin model  

CRD   Custom Resource Definition, extends the Kubernetes API 

CSI   Container Storage Interface standard for Kubernetes storage 

DM‑Multipath  Device‑Mapper Multipath, Linux multipathing for storage 

DR   Disaster Recovery, restore service after large disruptions 

DS   Diagnostic Server 

EPLC   Enhanced Passive Listening Compatibility 

ES   Engineering Station 

ESXi   VMware ESXi (type‑1 hypervisor) 

etcd    Distributed key‑value store used by Kubernetes control plane 

FAR   Fence Agents Remediation (OpenShift fencing operator) 

FT   Fault Tolerance, continue service transparently through faults 

GLB   Global Load Balancer (steers traffic/site failover) 

GUI   Graphical User Interface 


7 

HA   High Availability (minimize downtime via redundancy/failover) 

HBA   Host Bus Adapter (storage interface on hosts) 

HCL   HashiCorp Configuration Language (Terraform) 

HMI   Human-Machine Interface (operator displays/screens) 

HRP   High‑Speed Redundancy Protocol (Siemens ring redundancy) 

IaC   Infrastructure as Code 

IEC   International Electrotechnical Commission (publisher of standards) 

IOPS   Input/Output Operations Per Second 

IP   Internet Protocol 

iSCSI   Internet Small Computer Systems Interface (block storage over IP) 

I&C   Instrumentation and Control 

KVM   Kernel‑based Virtual Machine (Linux hypervisor) 

LACP   Link Aggregation Control Protocol 

LTO‑9   Linear Tape‑Open, generation 9 (tape storage) 

LUN   Logical Unit Number (block storage identifier) 

MAC   Media Access Control (link‑layer address) 

MLAG   Multi‑Chassis Link Aggregation (active/active across two switches) 

NIC   Network Interface Card  

NX‑OS   Operating System of Cisco Nexus Switches 

OCP   Red Hat OpenShift Container Platform 

OL3   Olkiluoto 3 nuclear power plant unit 

OM1   Multimode optical fibre grade (used by the plant ring) 

OM690  OL3 automation system (target system being modernized) 

OOB   Out‑of‑Band (separate management path) 

OT   Operator Terminal 

OVN   Open Virtual Network 

PU   Processing Unit 

PVC   PersistentVolumeClaim (Kubernetes storage object) 

QEMU   Quick EMUlator (emulator/virtualizer used with KVM) 

RBAC   Role‑Based Access Control 

RHCOS  Red Hat Enterprise Linux CoreOS (OpenShift node OS) 

RHEL   Red Hat Enterprise Linux 

RPO   Recovery Point Objective 

RSTP   Rapid Spanning Tree Protocol 

RTO   Recovery Time Objective 

RTT   Round‑Trip Time (latency metric) 

RWO   ReadWriteOnce (Kubernetes volume access mode) 

RWX   ReadWriteMany (Kubernetes volume access mode) 

SCC   Security Context Constraint (OpenShift security primitive) 

SDN   Software‑Defined Networking 


8 

SFS   Finnish Standards Association 

SIEM   Security Information and Event Management 

SNR   Self Node Remediation (OpenShift remediation operator) 

SPARC   Scalable Processor Architecture (legacy CPU platform) 

SR‑IOV  Single Root I/O Virtualization (NIC virtualization) 

SSH   Secure Shell (remote management protocol) 

STP   Spanning Tree Protocol 

STUK   Radiation and Nuclear Safety Authority 

SU   Server Unit 

TVO   Teollisuuden Voima Oyj (OL3 operator) 

UDN   User Defined Networking  

vCenter  VMware vCenter Server (central VM management) 

VLAN   Virtual Local Area Network 

VM   Virtual Machine 

VMI   VirtualMachineInstance (KubeVirt runtime object) 

vPC   virtual PortChannel (Cisco) 

vSphere  VMware vSphere (virtualization suite) 

WSUS   Windows Server Update Services 

XU   External Unit 

YAML    “YAML Ain’t Markup Language” (data serialization format) 

YVL    Finnish nuclear regulatory guide series  

  
9 

1 Introduction 

Industrial control systems in safety-critical sectors like nuclear power are defined by long 

operational lifespans, which inevitably produce lifecycle challenges as underlying hard-

ware and software age. The OM690 automation system at the Olkiluoto 3 (OL3) nuclear 

power plant, which is part of the plant’s process information and control system, is now 

facing these challenges. It currently relies on an aging set of SPARC-based servers run-

ning a deprecated operating system. This creates operational risks including hardware 

failure, loss of vendor support, and scarce replacement parts. A modernization project 

of OM690 through virtualization is therefore in planning. 

 
The constraints of the legacy platform are both technical and architectural. Tight cou-

pling of software to specific hardware prevents modern redundancy patterns, and the 

proprietary SPARC stack raises migration hurdles. Fragility is most acute in disaster re-

covery. With original server models out of support, even well-maintained backups are of 

limited use without compatible hardware. These factors call for a strategy that goes be-

yond hardware replacement toward a resilient and maintainable system architecture. 

 
The purpose of this thesis is to determine whether a virtualization-based modernization 

blueprint for OM690 can meet nuclear-grade continuity and long-term maintainability 

requirements at OL3, and to provide recommendations for the project. The evaluation 

is preliminary and analytical, focusing on architectural suitability, single-fault tolerance, 

cybersecurity posture, and lifecycle sustainability in an air-gapped environment. Imple-

mentation, performance benchmarking, cost analysis, and regulator-witnessed valida-

tion are out of scope. 

 
This thesis assesses a proposed reference architecture for virtualizing OM690 on Red Hat 

OpenShift and justifies its suitability through analytical methods. These methods include 

standards tracing, failure-mode reasoning, and feasibility checks. The work of the thesis 

does not include the design of the proposed architecture, rather the contribution is an 

independent, standards-grounded appraisal and set of recommendations for OL3. 


10 

 
Chapter 2 summarizes theoretical background in virtualization, high availability, and au-

tomation. Chapter 3 analyzes the legacy OM690 system and its constraints. Chapter 4 

evaluates a multi-layer resilient architecture and performs failure-mode reasoning 

across network, storage, and compute domains. Chapter 5 discusses implementation 

considerations for operational continuity. These include lifecycle automation, data pro-

tection, operational workflows, observability, and a framework for sustaining the plat-

form long-term. 

 
1.1 Methodology and Scope 

The thesis is an evaluation/analytical case study of a proposed architecture for OM690. 

The design is assessed via: 

• Feasibility and support boundaries: checking behaviors and constraints of 

OpenShift/KubeVirt and related components against the proposed design. 

• Lifecycle sustainability: examining automation, backup/recovery, observability, 

and long-term lifecycle governance. 

• Requirements coverage: mapping to concrete architectural mechanisms. 

Failure-mode reasoning: analyzing how the network, storage, and compute lay-

ers behave under realistic faults. 

The central research question is: Is the proposed virtualization architecture for OM690 

at OL3 viable with respect to long-term maintainability, single-fault tolerance, and cy-

bersecurity, and what architectural design principles and regulatory considerations are 

required to achieve that viability? 

Evidence is drawn from standards, vendor documentation, and technical literature. 

Where hard metrics are unavailable pre-deployment, the thesis specifies what should 

be measured and how during future validation. The thesis contributes an evidence-

based justification of the proposed architecture and a set of operational recommenda-

tions for sustaining it.  


11 

2 Theoretical Background 

This chapter establishes the theoretical foundation for the thesis. It introduces core vir-

tualization concepts, compares platforms and deployment models, and outlines key con-

siderations related to high availability, real-time performance, and fault tolerance. Addi-

tionally, it explores automation tools and system architecture patterns useful in manag-

ing modernized industrial systems. The foundation supports the technical analysis de-

veloped in the later chapters. 

 
2.1 Introduction to Virtualization 

Industrial automation and control systems have historically relied on discrete embedded 

devices (e.g., PLCs) for sensing, actuation and control, which are typically complemented 

by supervisory servers and stations. As plants have grown in complexity under the In-

dustry 4.0 umbrella, the demand for flexibility, scalability and resilience has risen along 

with IT-side softwarization trends like virtualization and containerization (Queiroz et al., 

2023, pp. 1–3). 

 
Virtualization abstracts hardware so that multiple operating systems or applications can 

run concurrently on the same host. This is typically under a hypervisor (type-1 or type-

2) or, at the OS level, under container engines. Classic distinctions include full virtualiza-

tion versus para-virtualization, but more recently, OS-based (container) virtualization has 

matured alongside these models (Lee et al., 2019, p. 2). In industrial contexts, these 

mechanisms are becoming increasingly relevant as a strategic response to aging infra-

structure and lifecycle constraints. This is due to them enabling consolidation and de-

coupling software stacks from specific hardware platforms (Queiroz et al., 2023, pp. 2–

3). 

 
Reported advantages include hardware independence and improved manageability via 

centralized control, faster provisioning and snapshotting, as well as workload isolation 

that supports modular software lifecycle practices (Queiroz et al., 2023, pp. 2–3). At the 


12 

same time, networking for virtualized industrial systems must preserve functional trans-

parency and real-time guarantees. This is an area where existing research notes chal-

lenges when bridging legacy fieldbuses and Ethernet-centric stacks within virtualized en-

vironments (Lee et al., 2019, p. 2). 

 
Additionally, there are limitations that are relevant to virtualization in safety-critical en-

vironments. One of these is that general-purpose schedulers and container runtimes are 

not designed for hard real-time constraints. For example, achieving determinism often 

requires specific techniques like CPU pinning, real-time kernels (e.g., PREEMPT_RT) or 

co-kernel approaches (Queiroz et al., 2023, p. 15). Even lightweight containers can ex-

hibit latency and jitter that must be characterized and mitigated for time-sensitive con-

trol paths (Queiroz et al., 2023, p. 8). The shift to virtual networking and storage also 

introduces integration work to maintain isolation, quality-of-service and predictable I/O 

behavior for legacy protocols (Lee et al., 2019, p. 2). 

 
Virtualization is treated as the foundational abstraction that enables decoupling from 

aging hardware while supporting managed redundancy and lifecycle controls in this the-

sis. The analytical chapters later apply these general properties, and account for con-

straints to OM690 context. 

 
2.2 Types of Virtualization and Hypervisors 

Virtualization in server systems is commonly realized in two ways. One is hardware‑level 

virtualization with a hypervisor, where each virtual machine runs its own guest operating 

system and application stack on virtual hardware that the hypervisor exposes (Queiroz 

et al., 2023, p. 4; Sharma et al., 2016, p. 2). Hypervisors are usually classed as type 1 

when they run directly on the host and type 2 when they run on top of a conventional 

operating system. Type 1 designs keep the stack thin and are a common choice for pro-

duction, whereas type 2 designs are often used in development and testing where con-

venience matters more than strict determinism (Queiroz et al., 2023, p. 4; Sharma 

et al., 2016, p. 2). Alongside these sits operating system level virtualization with 


13 

containers, where applications run as isolated processes that share a single kernel 

through namespaces and control groups (Queiroz et al., 2023, p. 6; Sharma et al., 2016, 

p. 2). Figure 1 visually illustrates the difference in virtualization models. 

 
Figure 1. Comparison of virtualization execution methods 

 
On bare metal, operating systems and applications run directly on physical servers, 

which makes it the performance benchmark by eliminating virtualization overhead and 

scheduler indirection (Giallorenzo et al., 2021, p. 12). Containers tend to start fast and 

carry low overhead because they avoid a separate guest kernel (Sharma et al., 2016, 

p. 7). The trade is a weaker isolation boundary than a full hypervisor provides, which is 

an important difference when workloads have strict isolation or predictability needs 

(Queiroz et al., 2023, pp. 4–6; Sharma et al., 2016, p. 12). 

 
Performance studies show a mixed picture that depends on the workload and the re-

source under pressure. Containers can approach bare‑metal results for CPU and memory 

tasks (Giallorenzo et al., 2021, pp. 12–14; Lin et al., 2018, p. 5). Full system virtualization 

adds some overhead, which is more visible in I/O paths (Giallorenzo et al., 2021, 

pp. 8, 13–14). VM disk and network I/O often degrade sooner than CPU (Giallorenzo 

et al., 2021, pp. 8, 13). Containers can also suffer under contention or copy‑on‑write 

paths, and large co‑location tests report higher cross‑workload interference in some con-

tainer profiles because the kernel is shared (Sharma et al., 2016, p. 12; Lin et al., 2018, 

p. 5). 


14 

 
Security and isolation line up with those mechanics. VMs benefit from a distinct guest 

kernel and a hardened hypervisor boundary, which can help when workloads are un-

trusted or must be strongly separated. Containers gain efficiency from the shared kernel 

but expand the blast radius of kernel‑level defects. Strong platform hardening and policy 

are therefore preconditions for containerized services with elevated assurance needs. 

Bare metal still provides physical isolation when a host is dedicated to one role (Queiroz 

et al., 2023, pp. 4–6; Sharma et al., 2016, p. 12). 

 
Resource efficiency also differs. Containers usually reach the highest packing density be-

cause there is no guest operating system per workload (Queiroz et al., 2023, p. 5; Sharma 

et al., 2016, p. 12). VMs still consolidate better than one‑app‑per‑server bare metal, 

though the guest footprint caps peak density relative to containers (Sharma et al., 2016, 

p. 2). In Kubernetes‑native stacks such as OpenShift Virtualization, running VMs along-

side containers can improve overall utilization because one scheduler pools compute for 

both models (Red Hat, 2025a, pp. 232–236). Management follows the same pattern. 

Bare‑metal estates often require manual lifecycle work, while VM platforms add image‑

level operations. In OpenShift Virtualization the GUI and the Kubernetes APIs allow de-

clarative definitions and image workflows such as the Containerized Data Importer, while 

patching inside the guest remains the VM owner’s task (Red Hat, 2025a, pp. 232–

236, 346–349). Containers move further toward image‑based delivery and API‑driven 

rollout and rollback, which streamlines changes but adds responsibility for the container 

platform itself (Lin et al., 2018, p. 3; Sharma et al., 2016, p. 9). 

 
High availability (HA) behaves according to these building blocks. On bare metal, HA of-

ten relies on external clustering frameworks that handle fencing and shared‑storage or-

chestration. Under OpenShift Virtualization, a failed node leads the scheduler to restart 

the VM on a healthy node and reattach its persistent storage (Red Hat, 2025a, pp. 354–

361), and live migration is available for planned maintenance and balancing (Red 

Hat, 2025a, pp. 347–349). In container workloads, controllers replace failed instances 


15 

quickly (Šimon et al., 2023, p. 12), while stateful failover and in‑place moves need careful 

storage and identity design compared with VM semantics (Sharma et al., 2016, p. 12). 

Later chapters use these distinctions when discussing OM690. Safety‑adjacent roles fa-

vor VM isolation and predictable restart behavior, while some supporting services can 

use containers under the same platform when the risk profile and dependencies allow it 

(Šimon et al., 2023, p. 13). 

 
2.3 Traditional vs. Cloud-Native Virtualization 

Industrial virtualization platforms broadly fall into two paradigms: traditional hypervisor-

centric stacks and cloud-native orchestration frameworks. Understanding their architec-

tural philosophies helps when assessing modernization options.  

 
Traditional platforms such as VMware vSphere follow a VM-centric model built around 

a proprietary type-1 hypervisor. They are installed directly on server hardware, with cen-

tralized management through a toolchain such as vCenter Server (Faddom, 2023). This 

approach emphasizes stability, feature maturity, and established enterprise workflows 

(Faddom, 2023). 

 
Cloud-native platforms such as Red Hat OpenShift, by contrast, adopt declarative, API-

driven orchestration. In this model, VM lifecycle management is integrated into the Ku-

bernetes control plane via KubeVirt, where VMs are represented as custom resources 

and scheduled like pods (Red Hat, 2024a; Red Hat, 2025a, pp. 57–62). This enables a 

single control plane for both VMs and containers. As a result, operational practices cen-

ter on automation, composability, and policy-driven management (Red Hat, 2025a, pp. 

25, 33). 

 
A key architectural distinction is the separation versus unification of control planes. 

vSphere separates VM management from other application orchestration systems, 

whereas OpenShift Virtualization operates VMs and containers under one orchestrator. 

Figure 2 illustrates this difference by contrasting a vCenter/ESXi stack with an 


16 

OpenShift/KubeVirt stack that unifies VM and container control under Kubernetes (Red 

Hat, 2024a; Red Hat, 2025a, p. 25). 

 
Figure 2. Traditional vs. cloud-native virtualization 

 
Security and isolation are enforced differently as well. Traditional hypervisors provide 

isolation at the VM boundary with minimal kernel sharing. Kubernetes-native virtualiza-

tion retains VM isolation while introducing shared orchestration components that must 

themselves be secured and audited (Red Hat, 2025a, p. 38). This is an operational con-

sideration reflected in OpenShift guidance on platform hardening and access control. 

 
High availability and failover patterns also differ. Traditional stacks commonly rely on 

dedicated HA modules and hypervisor clustering (Faddom, 2023), whereas Kubernetes-

native platforms achieve continuity through declarative scheduling, rescheduling on fail-

ure, and storage abstraction (e.g., PVCs and live-migration support in KubeVirt) (Red Hat, 

2025a, pp. 29, 62, 359–374). 

 
Finally, traditional platforms lean on GUI-centric operations and fixed administrative 

workflows for lifecycle management (Faddom, 2023). Cloud-native platforms focus on 


17 

infrastructure-as-code, CI/CD, and role-based automation through APIs and YAML de-

scriptors (Red Hat, 2025a, pp. 25, 33). 

 
In short, the traditional model emphasizes separation of concerns and mature, VM-cen-

tric tooling, while the cloud-native model prioritizes unified orchestration, automation, 

and scalability. These differences frame the set of trade-offs for OM690, where a unified 

control plane may reduce operational fragmentation without discarding the VM isolation 

properties required for critical workloads. 

 
2.4 Red Hat OpenShift Virtualization 

Red Hat OpenShift Virtualization is an add-on capability of the OpenShift Container Plat-

form (OCP). It enables Kubernetes to define and run virtual machines alongside contain-

ers, often described as container-native virtualization (Red Hat, 2024a). The feature is 

implemented through KubeVirt, which extends the Kubernetes API with virtualization-

specific Custom Resource Definitions (CRDs), including VirtualMachineInstance (VMI) for 

runtime and VirtualMachine (VM) for lifecycle management (Red Hat, 2024a; Red Hat, 

2025a, pp. 57–58). VMs are launched as pods on OpenShift worker nodes using the KVM 

hypervisor. For nodes that host VMs, Red Hat Enterprise Linux CoreOS (RHCOS) is the 

supported host operating system (Red Hat, 2024a; Red Hat, 2025a, p. 38). 

 
Red Hat Enterprise Linux (RHEL) is the general-purpose enterprise OS used inside many 

VMs and on traditional servers. OpenShift nodes that host VMs run Red Hat Enterprise 

Linux CoreOS (RHCOS), which is an image-based, cluster-managed variant of RHEL main-

tained by OpenShift’s Machine Config Operator (Red Hat, 2025f, p. 25; Red Hat, 2024a). 

Therefore, RHEL refers to guest OSs inside VMs or non-cluster hosts, and RHCOS refers 

to the node OS for OpenShift control plane and VM-hosting workers.   

 
Because it is Kubernetes-native, OpenShift Virtualization integrates with the broader 

ecosystem used by OCP: storage via the Container Storage Interface (CSI), advanced net-

working via Multus for multiple network attachments, and platform-level security and 


18 

policy through standard Kubernetes constructs, while platform services like observability 

are surfaced through the same control plane (Red Hat, 2025a, pp. 25, 310–312, 342). In 

addition, the platform provides a practical bridge for legacy VM workloads, which allows 

them to be hosted without immediate refactoring while a gradual transition to contain-

erized components can be evaluated (Red Hat, 2024a). 

 
For the OM690 context, this combination of VMs and containers managed under one 

orchestrator offers a solid foundation that subsequent chapters evaluate in terms of re-

silience, operability, and lifecycle governance. 

 
2.5 Real-time and Safety-Critical Considerations for Virtualization 

Virtualization technologies, including container-native approaches like OpenShift Virtu-

alization, offer significant benefits for system modernization. However, their application 

within industrial safety-critical environments like nuclear power plants demands consid-

eration of their impact on real-time performance. This is because aspects such as timing 

predictability, communication latency, and execution determinism remain crucial for 

control loops and safety functions (Queiroz et al., 2023, p. 7). Standard virtualization in-

troduces layers of software abstraction and resource sharing that were not primarily de-

signed for hard real-time guarantees. General purpose virtualization does not provide 

hard real‑time guarantees by default, which presents challenges for adoption in time-

sensitive industrial systems (Aqasizade et al., 2024, p. 1; Queiroz et al., 2023, p. 1). 

 
The introduction of virtualization layers alters the execution environment compared to 

bare metal. Factors contributing to increased latency and jitter include the host OS or 

hypervisor scheduler managing CPU access, which can introduce delays and unpredicta-

ble prioritization (Queiroz et al., 2023, p. 9), and the additional processing required for 

I/O operations traversing virtualization pathways (Queiroz et al., 2023, p. 10). Further-

more, the sharing of resources among multiple VMs or containers running on the same 

nodes can lead to contention and performance interference (Queiroz et al., 2023, p. 10; 


19 

Sharma et al., 2016, p. 1). These challenges are particularly acute for real-time industrial 

networks requiring guaranteed message timing and low latency (Lee et al., 2019, p. 1). 

 
To mitigate these challenges, various techniques focus on enhancing the real-time capa-

bilities of the underlying Linux and KVM environment. Figure 3 visualizes the primary 

sources of latency across layers. At the host OS level, employing real-time Linux kernel 

variants or specific tunings can improve scheduling predictability (Queiroz et al., 2023, p. 

15). Within OpenShift, the Node Tuning Operator and Performance Profiles allow apply-

ing these low-latency tunings, manage CPU affinity, and reserve resources on specific 

nodes (Red Hat, 2025d, Chapter 5.4). 

 
Figure 3. Sources of latency and mitigation paths in virtualization 

 
For individual virtual machines managed by OpenShift Virtualization, resource guaran-

tees can be configured via their Kubernetes resource definitions. This includes dedicating 

specific CPU cores (CPU pinning) and configuring large memory pages (huge pages) to 

minimize latency and improve predictability (Red Hat, 2025a, pp. 279, 288). These fea-

tures leverage underlying Linux control groups and CPU sets to isolate workloads (Quei-

roz et al., 2023, p. 16; Sharma et al., 2016, p. 5). Additionally, careful network design, 

potentially using Single Root I/O Virtualization (SR-IOV) or dedicated network interfaces 

managed via Multus CNI, alongside coordinated task scheduling within applications, can 


20 

help manage end-to-end communication delays (Lee et al., 2019, p. 8; Red Hat, 2025a, 

p. 316). 

 
In safety-critical domains such as OL3, non-deterministic timing is a safety concern rather 

than a performance issue. Use of virtualization in safety-related functions should there-

fore apply the necessary mitigations and verify timing behavior under normal and 

faulted conditions, demonstrating bounded latency/jitter for the relevant control cycles 

and deterministic recovery for defined failure modes (Queiroz et al., 2023, p. 2). These 

expectations are consistent with nuclear control standards and are treated in this thesis 

as validation requirements. 

 
2.6 Resilience and System Architecture Concepts 

System resilience typically rests on three related ideas: high availability (HA), fault toler-

ance (FT), and disaster recovery (DR). HA seeks to minimize downtime through redun-

dancy and rapid failover, FT masks faults so service continues transparently, and DR re-

stores service after larger-scale disruption (e.g., site loss) (Luca, 2024, pp. 2–6). All three 

are implemented by eliminating single points of failure and by adding controlled mech-

anisms for detection, isolation, and recovery (Luca, 2024, pp. 3, 12). 

 
Within OpenShift, HA is achieved when control plane quorum and failure‑domain place-

ment rules are respected. The distributed control plane (API servers and etcd) and the 

scheduler’s ability to reschedule workloads to healthy nodes provide the platform be-

havior. For virtual machines managed by OpenShift Virtualization, this means a failed 

node triggers a cold restart of the VM pod on another node, assuming shared or reat-

tachable storage (Red Hat, 2025a, p. 29; pp. 359–374). For multi-site strategies, 

OpenShift supports either a stretched cluster (one logical cluster across rooms/sites) or 

independent clusters. The latter avoids cross-site quorum/latency pitfalls at the cost of 

more operational complexity (Gurijala & Sullivan, 2022). 

 
21 

Modern datacenter networks typically remove single points of failure by making links 

and even entire switch chassis redundant. Multi-Chassis Link Aggregation (MLAG) ex-

poses an active-active port-channel across two upstream switches so a downstream host 

can lose a link or a whole switch without losing connectivity. Cisco’s implementation, 

virtual PortChannel (vPC), presents two Nexus switches as one logical device to the host 

(Cisco, 2024). A vPC peer-link carries state synchronization, while a separate peer-

keepalive distinguishes a peer-link failure from a switch failure to prevent split-brain 

(Cisco, 2024, pp. 255, 284). At the host edge, LACP with suspend-individual helps in 

avoiding black-holing by disabling out-of-sync members automatically (Cisco, 2024, p. 

238). 

 
Siemens High-Speed Redundancy Protocol (HRP) is designed to provide rapid recovery 

on ring network topologies (used at OL3). When attaching such rings to an Ethernet core, 

Enhanced Passive Listening Compatibility (EPLC) relays spanning-tree changes without 

running STP on ring ports (Siemens, 2018). This preserves the ring’s deterministic behav-

ior while keeping the wider network loop-free (Siemens, 2018). 

 
In OpenShift, segmentation and attachment to plant VLANs are modeled explicitly: Mul-

tus CNI allows secondary network interfaces for pods/VMs so traffic can be bound to the 

correct broadcast domain/VLAN (Red Hat, 2025a, p. 346). User‑Defined Networking, dis-

cussed later, can reference those Multus attachments for reusable connectivity. 

 
For stateful workloads, storage must be available across failures and consistent across 

sites. At room/site scale, synchronous replication provides Recovery Point Objective 

(RPO) of zero by committing each write on both sides before acknowledging to the host 

(Avrillier, 2023). In active-active metro designs, a witness in a third fault domain arbi-

trates during partitions so only one side remains writable, preventing split-brain and data 

divergence (Itzikr, 2023). 

 
22 

At the host level, DM-Multipath maintains multiple paths (HBAs/switches/array ports) to 

the same logical unit number and reroutes transparently on path failure. Common poli-

cies include round-robin path selection, while queueing and path checker settings gov-

ern behavior during transient loss (Red Hat, 2025e, pp. 21–22). OpenShift Virtualization 

consumes such storage via PersistentVolumeClaims, and live migration is supported 

when storage is accessible to both source and destination nodes, while cross-site restarts 

depend on the replication layer’s failover semantics (Red Hat, 2025a, pp. 62, 359–374). 

 
Resilience also depends on scope-limiting faults and traffic. VLANs carve a physical fabric 

into isolated broadcast domains to reduce blast radius and enforce policy separation 

(Basan, 2024). This is foundational for both performance and security in safety-critical 

environments. In OpenShift, this segregation is represented in workload definitions (e.g., 

Multus network-attachment definitions) so that VM/pod interfaces map to the intended 

VLANs (Red Hat, 2025a, p. 312). These architectural concepts form the theoretical back-

bone for the failover patterns analyzed later. 

 
2.7 High Availability and Failover 

Building on the redundancy and architectural concepts introduced in Section 2.6, this 

section provides a focused exploration of high availability (HA) and failover in the context 

of modern virtualized infrastructures, especially OpenShift Virtualization. While Sec-

tion 2.6 addressed general principles, this section emphasizes how these translate into 

specific configurations and operational models. Understanding these mechanisms is crit-

ical for ensuring service continuity and regulatory compliance in safety‑critical systems, 

where predictable, bounded performance and safe, well‑governed recovery are required 

outcomes. 

 
2.7.1 Definitions and Expectations 

High availability refers to a system’s ability to remain operational with minimal downtime, 

even amidst component failures (Somasekaram et al., 2021, p. 1). In industrial 


23 

automation, HA also implies predictable failover, unobstructed operator interaction, and 

that platform behavior does not degrade safety functions. These principles are im-

portant when considering platforms like OpenShift Virtualization for critical systems. 

 
Failover is the mechanism by which a system detects a failure and transitions operations 

to a redundant or standby component. It is a core HA technique expected to be fast, 

reliable, and transparent where appropriate, preventing data loss, preserving control 

states, and ensuring safety margins are not degraded during the transition (Gurijala & 

Sullivan, 2022). 

 
In nuclear plant environments, these mechanisms must align with regulatory standards 

such as IEC 62443 and Finnish YVL E.7 and A.12, which mandate system integrity, zonal 

isolation, and deterministic behavior under defined fault conditions. 

 
2.7.2 Failover Models 

Failover models in HA clusters can be broadly categorized. Active‑active configurations 

involve multiple nodes processing workloads simultaneously, redistributing traffic upon 

failure (Somasekaram et al., 2021, p. 5). Active‑passive models, often favored for their 

simplicity, feature one operational node and a synchronized standby ready for immedi-

ate takeover (Somasekaram et al., 2021, p. 6). 

 
In virtualized infrastructures like the planned OpenShift Virtualization at OL3, these mod-

els apply with specific mechanisms for virtual machine (VM) high availability. For VMs 

managed by OpenShift Virtualization, HA is primarily orchestrated by the underlying Ku-

bernetes platform. Red Hat documentation notes that when an OpenShift worker node 

hosting a VM fails, Kubernetes detects the failure and attempts to reschedule the VM 

(which runs as a Kubernetes pod) onto another healthy worker node. This automatic 

restart process assumes fencing of the failed node and that the VM’s disks reside on 

shared or re‑attachable storage accessible by multiple nodes (Red Hat, 2025a, 

pp. 62, 408–409). Live migration of VMs between OpenShift nodes without service 


24 

interruption is supported and is typically used for planned maintenance or workload bal-

ancing, not as the primary cross‑site failover tool (Red Hat, 2025a, pp. 29, 359–374). 

 
Failover functionality extends beyond compute. Network and storage systems also re-

quire rapid, deterministic recovery. At the network level, VLAN segmentation and redun-

dant links can be paired with deterministic core designs so that link/switch faults do not 

cause identity conflicts or prolonged reconvergence (Cisco, 2024). Similarly, storage fail-

over depends on synchronized replication and path redundancy; solutions such as dis-

tributed block storage or software‑defined storage platforms (e.g., Ceph‑based 

OpenShift Data Foundation when appropriate with OpenShift) enable data availability 

despite individual node or disk failures (Somasekaram et al., 2021, pp. 11–18; Red 

Hat, 2025a, p. 343). 

 
The selected failover model impacts system complexity, testability, and response time. 

Active‑active configurations, for example, need robust monitoring and quorum enforce-

ment to prevent split‑brain (Somasekaram et al., 2021, p. 8). Validation of failover strat-

egies relies on fault‑injection frameworks executed under controlled conditions to con-

firm that recovery behaves as designed and that bounded outcomes are achieved. 

 
2.8 Infrastructure Management and Automation 

Modernizing OM690 on a software‑defined platform increases the volume and cadence 

of routine change: systems must be provisioned, configured, patched, and observed in a 

way that remains deterministic and auditable over decades. In such settings, infrastruc-

ture automation should be treated as an enabling abstraction that turns environment 

state into versioned, reviewable artefacts. Because the environment is planned to be air‑

gapped (offline), an offline content path is necessary. This section introduces three po-

tential tool families for lifecycle management in a virtualized OM690. The operational 

patterns that use them are analyzed later in Chapter 5. 

 
25 

2.8.1 Red Hat Satellite 

Red Hat Satellite is a management platform for Red Hat Enterprise Linux (RHEL) systems 

that consolidates content distribution, patching, and configuration into a centrally gov-

erned service (Red Hat, 2025b, p. 9). Satellite mirrors operating‑system content from up-

stream sources to a local repository of record and exposes it to enrolled RHEL hosts (Red 

Hat, 2025b, pp. 10, 58). For distributed estates it can project this content close to con-

sumers via Capsule Servers, which replicate from the central Satellite and reduce update 

latency across sites (Red Hat, 2025b, p. 12). 

 
Change control is expressed through Content Views and Lifecycle Environments: admin-

istrators compose versioned snapshots of repositories and promote those snapshots 

through environments such as Development → Test → Production, ensuring that iden-

tical content is consumed at each stage (Red Hat, 2025b, pp. 11–13). Satellite also sup-

ports fully disconnected operation, either by importing content on removable media or 

by synchronizing from a staging Satellite in a differently zoned network using Inter‑Sat-

ellite Synchronization (ISS) (Red Hat, 2025b, pp.  47–50). These properties make it a use-

ful reference model for maintaining a verifiable baseline in air‑gapped or highly seg-

mented facilities. 

 
In an OpenShift‑based modernization its role is specific: Satellite manages traditional 

RHEL guests (e.g., virtual machines that run a full RHEL user space) but does not manage 

OpenShift worker nodes running Red Hat Enterprise Linux CoreOS (RHCOS), which are 

maintained by OpenShift’s own cluster operators (e.g., the Machine Config Operator) 

(Red Hat, 2025f, pp. 25, 41). 

 
2.8.2 Ansible 

Ansible provides a declarative, agentless approach to configuration management and 

orchestration: desired system state is described in human‑readable YAML playbooks, 

and the engine applies those changes idempotently over secure transports (Red Hat, 


26 

2025c, p. 7). Because it operates without a resident agent and encodes changes as text, 

Ansible is well suited to environments that value repeatability, reviewability, and minimal 

software footprint on managed hosts. 

 
In OpenShift contexts, Ansible is relevant in multiple ways. First, it is commonly used to 

automate platform‑adjacent tasks like installing and configuring clusters, applying post‑

install settings, or driving day‑2 configuration in a controlled manner. Second, when 

OpenShift Virtualization is present, Ansible collections can interact with the KubeVirt API 

to define or manage virtual machines as part of broader workflows, aligning VM lifecycle 

actions with the same declarative model used for other infrastructure components (Red 

Hat, 2025a, p. 59; Red Hat, 2025c, p. 6). 

 
2.8.3 Terraform 

While Ansible manages software configuration, Terraform can be used to provision the 

underlying resources themselves. It frames infrastructure as code (IaC): administrators 

declare the desired state of underlying resources (compute, networks, storage), and a 

provider‑based engine plans and applies the minimal set of changes required to reach 

that state. A persisted state file enables drift detection and safe, incremental change 

(HashiCorp, n.d.).  

 
In a virtualized OM690, such a tool can be useful in bootstrapping and evolving the sub-

strate on which OpenShift runs. This can include server definitions, VLANs and subnets, 

or storage allocations, so that environments can be reproduced and reviewed as code 

across pre‑production and production settings. As with the other tools, Terraform’s op-

eration is discussed further in Chapter 5. 


27 

3 Current System and Virtualization Feasibility 

This chapter outlines the relevant existing OM690 infrastructure and assesses its readi-

ness for virtualization. It outlines some of the current hardware and software landscape, 

highlights key limitations in the legacy systems, and identifies factors that influence vir-

tualization feasibility. The chapter analyzes how well the current architecture enables a 

transition to virtualized platforms while serving as further foundation for the later chap-

ters. 

 
3.1 System Constraints and Virtualization Implications 

OM690 is an automation and supervisory information system in OL3. It provides opera-

tor display/HMI, information and control functions, alarm handling, processing of pro-

cess data, interfaces to field and system components, long-term archiving, and adminis-

trative and maintenance capabilities, and it also interfaces with other I&C or monitoring 

systems (Areva, 2023). The current OM690 automation system at OL3 mainly runs on 

aging hardware infrastructure that is becoming unsustainable from a lifecycle perspec-

tive. The system contains a hybrid mix of SPARC-based servers and x86-64 platforms. The 

SPARC servers are the operational backbone, while the x86-64 servers handle some of 

the peripheral and diagnostic workloads (Areva, 2023). 

 
The SPARC-based OM690 infrastructure includes about 30 physical servers that run au-

tomation functions. These include 24 Operator Terminal (OT) servers for human machine 

interface functions, two processing units (PU) for data processing and calculations, two 

server units (SU) for data archiving and retrieval, an external unit (XU) for signal exchange 

between external monitoring systems, and different engineering and diagnostic stations 

(TVO, 2024). Most of these servers operate as bare-metal systems that run Solaris 10, 

which is an operating system that is now deprecated and only minimally supported in 

modern virtualization environments. Many of these SPARC servers are now out of vendor 

support, and sourcing compatible replacement hardware is difficult. This amplifies the 

risk of system outages and other issues. The system also includes about 10 physical x86-


28 

64 servers. These systems already run on modern operating systems and are better po-

sitioned for virtualization, particularly within the Red Hat OpenShift Virtualization frame-

work that is being considered. Figure 4 provides a high-level overview of this hardware 

layout. 

 
Figure 4. Current OM690 high-level overview 

 
OM690 is deployed as a role-dedicated server model: operator interfaces, process data 

handling, long-term storage, engineering configuration, and diagnostics operate on dis-

tinct physical hosts (TVO, 2024). This approach has offered operational clarity and fault 

isolation, but it leads to poor hardware utilization and administrative overhead. Each 

system requires manual lifecycle management, which contributes to configuration drift 

and inconsistent patch levels.  

 
These issues are made more prominent by tight hardware-software coupling and legacy 

dependencies, especially within the SPARC-based workloads. Hardware abstraction lay-

ers are in some cases absent, and any failure in a server in these cases results in complete 

function loss, if there is no redundancy. Similar constraints affect network and storage 


29 

subsystems, as the current topology lacks many modern features that are foundational 

in virtualized environments. For avoidance of doubt, the current SPARC/Solaris servers 

are not candidates for direct virtualization. The migration path assumes Siemens‑sup-

plied x86‑64 releases of OM690 components that will be deployed on the new, virtual-

ized platform. 

 
3.2 Virtualization Readiness 

The readiness of existing OM690 system components for virtualization differs between 

SPARC-based and x86-64 platforms. This is mainly due to architectural compatibility, op-

erating system support, and the lifecycle status of both hardware and software. Systems 

based on SPARC have limitations when transitioning to modern virtualization platforms. 

For example, Red Hat's virtualization stack does not list SPARC-based operating systems 

as certified guests (Red Hat, n.d. -a), and support for older systems like Solaris 10 is dep-

recated across most hypervisors. As a result, the modernization strategy does not involve 

direct virtualization of SPARC-based workloads. Instead, these are planned to be mi-

grated to x86-64 hardware within OpenShift Virtualization clusters. 

 
A major technical challenge in this migration is endianness. SPARC platforms use a big-

endian architecture, where the most significant byte of a multi-byte value is stored at 

the lowest memory address (Oracle, 2018). By contrast, x86-64 systems use little-endian 

encoding, where the least significant byte occupies the lowest memory address (HP, 

2009). This difference affects the interpretation of fixed-width binary files, which cannot 

be reliably used unless properly converted. A 32-bit integer representing a timestamp, 

for example, will yield incorrect values on a little-endian system if read without transla-

tion (HP, 2009, p. 9). 

 
The problem is particularly critical in nuclear instrumentation contexts, where precision, 

traceability, and auditability are essential. Binary archives generated on SPARC cannot 

be consumed by x86-64 virtual machines without risk of silent data corruption or 


30 

operational misinterpretation. Conversion requires tooling that accurately transforms 

data fields.  

 
A potential method to address endianness is Python’s built-in struct module, which al-

lows byte-order specification for decoding and re-encoding binary values. Figure 5 illus-

trates this type of schema-aware conversion workflow, where a big-endian source file is 

read, processed according to a schema, and written as a new little-endian file. The pro-

cess involves reading the original data in big-endian format ('>'), transforming the fields 

as required, and writing the result in little-endian format ('<') (Python Software Founda-

tion, n.d.). 

 
Figure 5. Byte order mismatch and conceptual workflow for endian conversion 

 
Several alternative approaches exist, including runtime byte-swapping middleware or 

conversion to architecture-neutral intermediate formats. However, for large volumes of 

historical data, a structured offline pipeline could be the most practical and auditable 

solution (HP, 2009, p. 14; Oracle, 2018). 

 
Endianness is not the only challenge. Some legacy SPARC applications were compiled as 

32-bit binaries, despite the underlying hardware supporting 64-bit operation. Migrating 

such applications to a 64-bit Linux platform is infeasible without recompilation, which is 

often blocked by unavailable source code or incompatible library dependencies (HP, 

2009, p. 7). 


31 

 
By contrast, the x86-64-based OM690 systems are considerably more virtualization-

ready. These platforms already run supported operating systems and can typically be 

rehosted onto OpenShift Virtualization clusters with minimal modification. 

 
3.3 Overview of Network and Storage Topology 

The current OM690 infrastructure at OL3 operates on networking and storage architec-

ture that reflects the technological standards of its original deployment period. While 

functional, this legacy topology brings limitations on scalability, redundancy, and general 

virtualization readiness.  

 
The OM690 network architecture includes a combination of legacy and industrial net-

working technologies. As depicted in Figure 4, core communication occurs over the ring 

networks, which utilize OM1 multimode fiber operating at 100 Mbps. These networks 

were optimized for the deterministic and time-sensitive communication required by con-

trol and monitoring systems.  

 
This topology, while suitable for its original bare-metal deployment, lacks features re-

quired for virtualization, and is limited in terms of redundancy, as modern protocols are 

not consistently deployed. As virtual machines are introduced, their integration should 

respect the behavior of the control network and ensure no unintentional traffic leakage 

across VLANs, particularly when using software-defined networking mechanisms (Red 

Hat, 2025a, pp. 278–281, 295). 

 
To enable compatibility with plant VLANs, OCP nodes intended to host virtual machines 

could be provisioned with multiple physical interfaces and/or SR-IOV capabilities, where 

needed, to ensure virtual interfaces maintain proper behavior under plant constraints. 

The configuration of NetworkAttachmentDefinition resources in Kubernetes, for exam-

ple, allows binding VM interfaces to appropriate VLANs (Red Hat, 2025a, pp. 78–80). 

However, careful planning is needed to align MAC address visibility, broadcast domains, 


32 

and redundancy expectations between virtual and physical switches (Johnson et al., 

2023, pp. 14–15). 

 
The storage infrastructure supporting OM690 today lacks many abstraction, redundancy, 

and orchestration features that are foundational in modern virtual environments. Spe-

cifically, there is minimal support for high availability at the storage layer. Failures typi-

cally lead to manual recovery processes, which is not ideal in a virtualized high-availabil-

ity context. 

 
In a future OpenShift Virtualization deployment, persistent storage for virtual machines 

and containerized workloads could be provisioned using Kubernetes-compatible vol-

umes, for example via the Container Storage Interface (CSI) (Red Hat, 2025a, pp. 342–

343). A potential option would be to integrate enterprise storage systems with CSI driv-

ers. 

 
Critical workloads that require live migration capabilities or automated failover could be 

backed by shared storage accessible to all participating nodes in the OpenShift cluster. 

The current architecture does not yet support this pattern and might require an overlay 

or replacement solution that introduces shared block-level volumes with consistent per-

formance and redundancy. 

 
While the current OM690 network and storage topology has met the requirements of 

the original system design, it lacks some capabilities required to support modern virtu-

alization. Network speeds, interface flexibility, and protocol support should be updated 

to support dynamic and redundant virtual machine communication. Similarly, the stor-

age layer will evolve from locally attached devices to shared, resilient storage solutions 

that enable live migration and failure recovery without data loss. 

 
These gaps are not insurmountable, and several paths exist for alignment. Network mod-

ernization could include deploying OCP clusters on updated switching infrastructure with 


33 

VLAN-aware physical interfaces and appropriate SR-IOV configurations. Storage up-

grades could introduce CSI-backed resilient storage pools for both VM disk images and 

container volumes (Red Hat, 2025d, pp. 258–265; Red Hat, 2025a, pp. 332–335). These 

measures will help ensure that the virtualized OM690 system meets the fault tolerance, 

performance, and regulatory requirements applicable to a safety-critical nuclear facility. 

 
3.4 Communication and Integration Challenges 

The transition from legacy bare-metal infrastructure to a virtualized environment based 

on the proposed OpenShift Container Platform (OCP) also introduces technical and op-

erational integration challenges. These include maintaining reliable communication be-

tween newly deployed virtual components and existing plant systems coupled to deter-

ministic, time-sensitive industrial protocols. The effort should also address legacy soft-

ware compatibility, uphold cybersecurity boundaries within the new environment, and 

ensure operational transparency. 

 
One primary communication challenge is the protocol gap between the modern, soft-

ware-defined environment and legacy control system protocols. Introducing virtual ma-

chines managed by OpenShift Virtualization running on OCP nodes introduces complex-

ity in preserving timing determinism. For instance, replacing tightly timed applications 

with virtualized equivalents may introduce network jitter or latency. This potentially af-

fects system synchronization or safety loop timing (Burnicki, 2024, p. 3). 

 
Moreover, there is an existing segmented network architecture and deterministic 

fieldbus behavior that imposes constraints on integrating virtual systems. While VMs in 

OpenShift Virtualization can potentially be connected to existing plant VLANs using fea-

tures like Multus CNI (Red Hat, 2025a, pp.  214–218), OCP's own software-defined net-

working and virtual switching layers require precise configuration. Care must be taken 

with MAC addresses, broadcast domains, and redundancy between virtual and physical 

network layers to prevent traffic leakage or degraded performance (Johnson et al., 2023, 

pp. 14–15). 


34 

 
OpenShift 4.17 introduced a feature known as User Defined Networking (UDN), which 

allows for the creation of multiple, isolated layer-2 networks at the cluster level, defined 

through a new UserDefinedNetwork Custom Resource Definition (CRD) (OKD, n.d). In the 

long term, this could offer a more native and simplified approach to network segmenta-

tion, potentially reducing the reliance on per-namespace NetworkAttachmentDefinition 

configurations (OKD, n.d.). UDN could provide more robust, centrally managed isolation 

so that safety-related traffic can never share a broadcast domain with maintenance util-

ities, even inside the virtual fabric. However, its adoption is not assumed in this design.  

 
Cybersecurity is another concern. Introducing an OpenShift Container Platform environ-

ment for virtualization adds new attack surfaces, including the OCP control plane, worker 

nodes, datastore, and the container subsystem itself. These should be hardened and iso-

lated according to standards. It should be so that every human, process, and device op-

erates within assigned roles. Per IEC 62443‑3‑3 SR 2.1, the platform should enforce au-

thorization on all interfaces for all users/roles; YVL A.12 §4.3 further requires role‑lim-

ited administration and traceable logging of accesses and changes (IEC, 2013; STUK, 

2021). 

 
This could involve OCP-specific security measures such as configuring Role-Based Access 

Control (RBAC), Security Context Constraints (SCCs), NetworkPolicies, and securing con-

tainer image sources and registries (Red Hat, 2025d, pp. 254–270). It is essential that the 

virtualized environment does not blur boundaries between trusted plant zones and ex-

ternal interfaces, especially given OCP's capabilities for integrating with automation 

pipelines. This IT/OT convergence is a well-documented matter that requires careful 

management.  

 
Lastly, remapping physical I/O channels to virtual systems needs careful consideration. 

While many planned VMs are supervisory, virtualizing control components raises con-

cerns about latency and determinism. Even without virtualizing safety-critical I&C loops, 


35 

data routing from these systems through the virtualized upper-layer systems on OCP re-

quires careful design (Johnson et al., 2023, p. 9). 

 
3.5 Migration Risks and Mitigation Drivers 

While the preceding sections have detailed the architectural and operational limitations 

of the current OM690 infrastructure, it is equally important to frame these as risks that 

shape the feasibility and sequencing of virtualization efforts. The most critical risks stem 

from legacy platform discontinuities, particularly the incompatibility between SPARC and 

x86-64 systems, and the obsolescence of Solaris 10. This creates an unavoidable need 

for migration.  

 
Endianness and binary compatibility issues also move beyond technical nuisances into 

project risk territory. If handled incorrectly, byte-order mismatches in historical data 

could compromise safety-critical logs or misrepresent operational telemetry. Similarly, 

legacy 32-bit applications may prove unmigratable if source code is unavailable, forcing 

redesigns under tight timelines. Another compounding risk is the fragility of disaster re-

covery, and requirements (IEC 62443-3-3 SR 7.4) demand that the system can be re-

stored to a known secure state after any disruption (IEC, 2013). With the old SPARC Sun 

Fire servers no longer supported and replacement parts scarce, even well-maintained 

backup images become unusable without compatible hardware.  

 
Security concerns round out the risk landscape. Integrating OpenShift introduces an en-

tirely new control layer that must be hardened according to nuclear cybersecurity stand-

ards (IEC, 2013; STUK, 2021). This brings technical work and regulatory overhead, espe-

cially around network segmentation, role-based access, and patch hygiene across virtual 

and physical domains. To manage these risks, the migration strategy should proceed in 

phased, validated stages. Each workload should be tested in a controlled simulator envi-

ronment before production use, and data conversion processes must be both verifiable 

and auditable. The risk is not only technical, but procedural as well. 

 
36 

3.6 Conceptual Virtualized Architecture 

To address the issues of the legacy system, a modernized architecture built on Red Hat 

OpenShift Virtualization has been planned. It establishes a software-defined foundation 

where workloads are decoupled from physical hardware and managed as virtual ma-

chines or containers within a Kubernetes-native environment. The core platform is 

planned to consist of standardized x86-64 compute hosts (Dell PowerEdge), network 

switches (Cisco Nexus), and iSCSI-based shared storage systems. These are to be de-

ployed in two computer rooms, each housing three OpenShift nodes, with six in total. 

Storage arrays in each room are planned to participate in a metro cluster configuration 

with synchronous replication, ensuring data consistency between sites. 

 
The OM690 components are planned to be hosted as virtual machines or containers on 

OpenShift. Key ancillary services are planned to include Active Directory and WSUS for 

identity and update control, Red Hat Satellite and EfficientIP for configuration and IP ad-

dress management, and Veeam with LTO-9 for data protection and long-term archival. 

In this architecture, automation can be supported by an infrastructure-as-code pipeline 

with specific tools, which will be analyzed in Chapter 5.  

 
Figure 6 summarizes the high-level architectural layout of the conceptual new system, 

showing physical infrastructure, virtualized platform layers, ancillary services, and net-

work segmentation. This architecture is aimed to resolve the legacy platform’s lifecycle 

limitations but will require careful planning to meet operational and regulatory demands. 

The techniques that would make this architecture work in terms of resilience are exam-

ined in Chapter 4 and translated into operational practices in Chapter 5. 

 
37 

 
Figure 6. High-level core architecture of the conceptual virtualized platform 


38 

4 Architectural Resilience and Failover Analysis 

This chapter analyzes failover mechanisms and how they could be implemented in a vir-

tualized OM690 infrastructure, especially in terms of network, storage and compute 

nodes. It focuses on architectural strategies, tools, and validation methods that ensure 

fault tolerance and operational continuity. 

 
4.1 Component‑Level Criticality and Recovery Priorities 

Not all components within the OM690 platform have equal importance in terms of their 

role or their recovery requirements. To clarify the implications for system design and 

failover validation, the criticality of components is viewed here from two angles: 

• Functional criticality refers to how essential a component’s role is to the safe and 

continuous operation of the plant. 

• Recovery criticality reflects how urgently a specific instance of a component must 

be restored after failure, which depends on its redundancy and operational im-

pact. 

 
This distinction is useful in the platform’s fault tolerance considerations, and it should 

guide both infrastructure design and testing priorities. Processing Units (PUs) are func-

tionally extremely critical components, as they run real-time nuclear programs that con-

trol and monitor plant systems (TVO, 2024). Any disruption to PU logic could compromise 

safety functions. This is why PUs are deployed as a high availability hot standby pair con-

figuration in the legacy system (TVO, 2024) and are planned to operate this way in the 

virtualized system as well. Each PU has a standby partner that immediately assumes con-

trol upon failure (TVO, 2024). As such, the recovery criticality of an individual PU VM isn’t 

high in terms of the virtualized platform. The failover will occur at the application level, 

and the virtualized infrastructure mainly needs to ensure that split-brain is avoided and 

that at least one PU always remains operational. 

 
39 

Server Units (SUs) handle data archiving and contribute to system transparency and di-

agnostics (TVO, 2024). They too are redundant because a prolonged failure could result 

in data gaps affecting post-event analysis and regulatory traceability. Due to the redun-

dancy, they too have a lower recovery criticality.   

 
Operator Terminals (OTs) provide the primary interface between plant personnel and the 

I&C systems (TVO, 2024). While they do not execute control logic, their availability needs 

to be ensured for situational awareness and timely operator intervention. One OT server 

typically drives four operator displays (TVO, 2024). If the server fails, a significant portion 

of the control room interface goes offline. This includes loss of alarm visibility, process 

trend graphs, and manual control interfaces. Therefore, OT servers have moderate func-

tional criticality but high recovery criticality, and their restart procedures should be pri-

oritized. 

 
The XU (External Unit) acts as communication hub to as many as 16 external systems and 

lacks built-in redundancy (TVO, 2024). If the XU server fails, it breaks links to multiple 

external processes and systems, and connections to up to sixteen external subsystems 

are interrupted. This makes it a top-priority component from a recovery criticality per-

spective, even though it in itself does not execute core control functions. 

 
Other components such as diagnostic servers (DS), engineering stations (ES), and log 

servers are less critical for immediate plant operation (TVO, 2024). They support mainte-

nance, post-event analysis, and can tolerate longer recovery windows. However, the net-

work that underpins these systems is a backbone element, and its reliability is a necessity 

to keeping all subsystems operational and synchronized. 

 
Standards establish the backdrop for this criticality assessment. YVL E.7 requires that 

safety-classified systems continue functioning under failure conditions, with determinis-

tic and testable recovery behavior (STUK, 2019). IEC 62443-3-3 (SR 3.6) further mandates 

that recovery from disruptions must be predictable and must not compromise 


40 

operational or data integrity (IEC, 2013). The OM690 architecture must therefore not 

only tolerate faults but do so in a way that aligns with both the function-specific and 

recovery-specific expectations of the nuclear safety domain. 

 
Although OpenShift offers a container-native infrastructure, a container-first model is 

challenging to implement for many OM690 workloads. Some components, like the PUs, 

exhibit real-time constraints, tightly coupled behavior, and platform-specific binaries. 

These characteristics make their effective containerization challenging, because it favors 

stateless and decoupled application design (Queiroz et al., 2023, pp. 5–15). For this rea-

son, an appropriate model is VM-centric at the workload layer and container-native at 

the infrastructure level. That said, many services can and are preferred to be container-

ized. The VM-centric stance applies to services/servers that are deemed not suitable for 

containerization.  

 
In short, the platform’s design should balance deterministic behavior, rapid interface res-

toration, and communication survivability. Failover logic should be tiered accordingly, 

and validation activities should reflect the differing recovery criticality of components.  

 
4.2 OpenShift Control Plane and Redundancy Patterns 

As stated earlier, the virtualized OM690 environment is planned to run on six OpenShift 

nodes split between two computer rooms. To meet OL3 recovery targets, the OpenShift 

platform must keep its control plane available through a room-level failure. The resili-

ence of this layer is a prerequisite for any higher-level HA. This section outlines two clus-

ter patterns: a single stretched cluster across two rooms and two independent clusters, 

and explains how control plane quorum, workload protection, and storage behavior dif-

fer. For OM690, this choice directly affects how fast services can be restarted after faults. 

The project is going forward with a single stretched cluster, with the dual‑cluster alter-

native retained here as an analyzed option for completeness and future contingency. 

 
41 

4.2.1 Control Plane Topology and Quorum Rationale 

In a stretched cluster design, a single OpenShift cluster would place its control plane 

members across both rooms. It remains one logical control plane backed by one etcd 

(key-value store) quorum. By contrast, a dual-cluster design would deploy two independ-

ent clusters, one per room, with no shared quorum. etcd requires a strict majority of 

voting members for any change (etcd, 2025). Odd-sized control planes are therefore pre-

ferred because they maximize fault tolerance for a given size, while even-sized layouts 

do not increase the number of simultaneous failures tolerated and add failure surface 

(etcd, 2025). Table 1 illustrates this principle. 

 
Table 1. etcd quorum majority and failure tolerance 

etcd voters Majority required Failures Tolerated 

3 2 1 

4 3 1 

5 3 2 

 
As the table demonstrates, a four-member control plane still needs three members for 

quorum and thus tolerates only one failure (etcd, 2025). Therefore, it behaves exactly 

like a three-member control plane, while adding parts that can fail. In two rooms, no 

stretched topology can guarantee write continuity through an arbitrary room loss (with-

out a third site arbitrating). Control plane placement only improves the odds when the 

surviving room holds the majority (Gurijala & Sullivan, 2022). 

 
To remain fully read/write after the loss of either room, two approaches exist: operate 

dual clusters with GLB steering, or place control plane voters across three sites so a ma-

jority always survives a single-room loss (Gurijala & Sullivan, 2022). Without one of those 

patterns, a stretched two-room control plane cannot guarantee continuity of writes on 

arbitrary room loss. 

 
42 

If the room holding the majority fails in a stretched layout, the cluster loses quorum. The 

API switches to read-only, controllers and the scheduler cannot persist changes, and Ku-

beVirt cannot reschedule VMs (etcd, 2025). Existing pods (container workloads) and VMs 

may keep running on their current hosts, but no new recovery actions occur until oper-

ators restore quorum. For example, by using the documented etcd quorum-restoration 

procedure or, if the majority cannot be recovered, by restoring from an etcd backup (Red 

Hat, 2025k, pp. 391–392). For OM690 this means restarts cannot begin until quorum is 

restored, as the delay sits on the control plane path, not on storage. Learner members 

do not change this outcome, as they are non-voting and must be explicitly promoted 

after catching up (etcd, 2025). 

 
A stretched cluster is viable for OM690 when inter-room latency remains in the single-

digit milliseconds and voter placement is planned so that loss of the non-majority room 

preserves quorum for routine restarts. Day-2 operations should watch etcd peer round-

trip time and disk flush-to-disk latency, as sustained high values can trigger disruptive 

leader elections (Red Hat, 2025j, pp. 14–15). At the storage layer, stretch patterns carry 

their own guardrails. Red Hat’s guidance calls for ≤ 10 ms round-trip time between sites 

in stretch configurations (Red Hat, 2025i, pp. 108–109). 

 
A dual-cluster alternative would avoid cross-room quorum coupling entirely, as each 

room keeps a full, independent control plane. Therefore, if one room is lost, the other 

remains fully read/write and a global load balancer (GLB) steers traffic while applications 

fail over according to their replication strategy (Gurijala & Sullivan, 2022). In practice this 

preserves the platform’s ability to restart services immediately in the surviving room, at 

the cost of operating two clusters and maintaining GLB-based runbooks. 

 
Finally, any node intended to run or migrate VMs should use Red Hat Enterprise Linux 

CoreOS (RHCOS). Generic RHEL workers may join for container-only workloads but can-

not host or migrate VMs (Red Hat, 2025a, p. 59). Standardizing VM-hosting on RHCOS 


43 

simplifies fencing integrations and intra-cluster KubeVirt operations, regardless of 

whether the platform is stretched or dual-cluster. 

 
4.2.2 Workload Recovery Semantics (Platform-Level) 

For infrastructure‑led failover, OpenShift Virtualization protects VMs primarily via cold 

restart on healthy nodes with volume reattachment orchestrated by Kubernetes/Ku-

beVirt. Cross‑room live migration is out of scope here (Gurijala & Sullivan, 2022). For 

OM690 this is not ideal for the OTs and XU, but is acceptable if their maximum accepta-

ble downtime Recovery Time Objectives (RTO) are met. The PU and SU on the other hand 

do not rely on VM‑level recovery for continuity because they run as redundant pairs. For 

containerized services that are architected for replication, active/active can be achieved 

by running replicas in both rooms and steering traffic via the GLB. This is distinct from 

services deployed as VMs, which have to remain cold‑restart oriented (Gurijala & Sulli-

van, 2022). 

 
Storage behavior (replication mode, arbitration/witness, path management, and 

RWX/RWO implications for live migration) is analyzed in Chapter 4.3.2, which specifies 

the metro‑replication model, the role of the third‑site witness, and host multipathing 

considerations (Dell Technologies, 2024; Red Hat, 2025e). In a dual‑cluster design, state 

consistency mechanisms (infrastructure‑level vs. application‑level replication) are like-

wise summarized in Chapter 4.3.2; at the platform layer, failover remains a re-

start/switchover in the target cluster rather than live migration of a running VM (Gurijala 

& Sullivan, 2022). 

 
4.2.3 Comparative Assessment 

A stretched cluster keeps the operational surface small (one control plane) and, under 

low inter‑room latency with favorable voter placement, can reschedule workloads trans-

parently when paired with synchronous metro‑replication (RPO = 0). The trade‑off is 

quorum sensitivity: if the majority room is lost, the minority partition stops serving 


44 

writes and scheduling until a majority is restored (etcd, 2025; Dell Technologies, 2024, 

pp. 6–9). A third‑site storage witness is recommended in the stretched storage design to 

arbitrate a single writer and prevent split‑brain; see Chapter 4.3.2 for the storage‑layer 

rationale and parameters; however, it does not influence etcd quorum (Dell Technologies, 

2024, pp. 6–13, 44–55; etcd, 2025). 

 
If uninterrupted control plane writes during a room loss are required across two rooms, 

a dual‑cluster approach behind a global load balancer would be the safer choice because 

in it each room retains an independent, read/write control plane. If a stretched, single‑

cluster model is preferred for operational simplicity, continuous writes through any sin-

gle‑room loss would require placing control plane voters across three sites (for example 

2+2+1), observing etcd latency budgets (e.g., p99 peer RTT and fsync), and validating 

that site‑to‑site RTT remains within Red Hat’s stretch guidance (≤ 10 ms) for steady 

leader election behavior (Gurijala & Sullivan, 2022; etcd, 2025; Red Hat, 2025j, p. 5; Red 

Hat, 2025i, Chapter 3.5). 

 
In either pattern, control plane stability is primarily sensitive to etcd performance. Red 

Hat documentation implies that a practical approach is to monitor key indicators at high 

percentiles, such as the 99th percentile (p99) for peer round-trip time (RTT) and disk 

fsync latency, rather than anchoring on a single average value. Persistently elevated p99 

values for these metrics can lead to disruptive etcd leader elections and should be 

avoided (Red Hat, 2025j, p. 5). For stretched storage, Red Hat reference materials cite ≤

 10 ms RTT between sites as a planning budget (Red Hat, 2025j; Red Hat, 2025i, § 3.5, 

pp. 8–9). These values can be treated as planning targets, but actual numbers would 

need to be validated in the future. 

 
For OM690, if continuous control plane writes during a room loss are required, a dual-

cluster approach behind a GLB would be a safer choice. If a brief platform-level pause is 

acceptable and operational simplicity is paramount, a stretched cluster is also viable, 


45 

provided inter-site latency targets are met and sufficient quorum-restore procedures are 

in place (etcd, 2025; Red Hat, 2025j, Chapter 4.4). 

 
4.3 Failure-Mode Impact and Recovery Patterns 

Ensuring that no single fault can compromise system-wide functionality is a core require-

ment in nuclear instrumentation and control (I&C). This section evaluates how the pro-

posed OM690 virtualization architecture meets that objective by analyzing its resilience 

across three key fault domains: network, storage, and compute nodes. Rather than treat-

ing failure tolerance as a singular feature, the analysis demonstrates how layered recov-

ery mechanisms work together to uphold availability targets. The emphasis is placed on 

realistic failure scenarios, the system’s potential responses, and the architectural design 

choices that enable recovery. 

 
4.3.1 Network-Level Resilience 

To satisfy requirements within the OM690 upgrade, the network must continue operat-

ing despite failures like the loss of one computer room, any single switch or link. The core 

is therefore planned to use four Cisco Nexus switches, two per room. Explicit design 

choices in the core will need to be in place that bound failure blast radius and keep for-

warding predictable. 

 
As illustrated in Figure 7, each room is planned to contain one vPC domain (a pair of 

Nexus switches). Here, the two switches in Room A form vPC Domain A, and the two in 

Room B form vPC Domain B. A single Nexus switch can participate in only one vPC do-

main at a time. Therefore, designs where each switch peers with two vPC peers are not 

supported by the Nexus OS (NX-OS) and would risk split brain behavior (Cisco, 2024, 

p. 260). The two room local vPC domains are preferred to interconnect with routed Layer 

3 links for fault containment (Cisco, 2024, pp. 288–291). Alternatively, when a Layer 2 

stretch is truly necessary, they can connect via a supported vPC to vPC interconnection. 


46 

These choices constrain failure propagation within the network: room local faults stay 

local, and cross room links are engineered for deterministic reconvergence. 

 
Figure 7. Proposed dual vPC topology with redundancy 

 
vPC peer-keepalives run out-of-band (OOB), ideally on the mgmt0 interface with dedi-

cated virtual routing and forwarding, so the peer-link never carries keepalives (Cisco, 

2024, pp. 259, 262–263). This arrangement lets the peers distinguish a failed switch from 

a failed peer-link and prevents dual-active conditions. This is critical for a safety context 

like in OM690 where ambiguous forwarding states are unacceptable. To avoid blackhol-

ing in a peer‑link‑down event, orphan‑connected devices should preferentially attach to 

the vPC primary (or use the orphan‑port‑protection setting), as the secondary suspends 

its vPC member ports when the peer‑link is lost but keepalive remains up (Firewall.cx, 

n.d.) 

 
Within each room, the vPC pair presents itself as one logical switch to downstream de-

vices, so uplinks run active-active rather than blocking. This increases available band-

width and avoids spanning-tree failover delays during single-link faults (Cisco, 2024, p. 

255). Spanning tree can remain enabled as a backstop: NX-OS runs STP on both vPC peers, 

with the primary coordinating vPC-facing ports on the secondary. Setting the pair as STP 

root/secondary and treating peer-link ports as STP network ports minimizes topological 


47 

surprises and speeds convergence when something fails (Cisco, 2024, pp. 273–274). The 

outcome of this is that common faults do not change the forwarding model seen by hosts. 

 
Each OpenShift and storage node exposes physical NICs (Figure 7, bond0 and bond1) and 

forms LACP port-channels to the room-local vPC pair. LACP in active mode detects unidi-

rectional or mis-cabled members and automatically removes them from service; NX-OS 

explicitly recommends LACP for vPC member links (Cisco, 2024, pp. 238–240, 272). Com-

pared to static port-channels, this avoids silent black holes in the network and reduces 

operator intervention during faults, which improves mean-time-to-recovery and keeps 

packet loss bounded to the detection window. In combination, LACP on the host edge 

and STP on the vPC pair detect and isolate asymmetric link failures quickly, so bandwidth 

is reduced but traffic remains available (Cisco, 2024, pp. 238–240, 273–274). 

 
Plant networks are planned to remain on Siemens Scalance rings, which use High-Speed 

Redundancy (HRP). To attach those rings to the core without compromising ring timing, 

the design can use Enhanced Passive Listening Compatibility (EPLC). In it, Scalance relays 

spanning-tree change messages toward the core, the core flushes MAC tables and can 

reconverge in a few seconds after a ring break, while STP processing is disabled on the 

HRP ports and the RSTP segment connects to more than one ring node in Passive Listen-

ing mode (Siemens, 2018). The effect is that the ring’s latency budget is preserved even 

while the broader fabric re-optimizes. 

 
Potential network failure profiles and impacts in this design include:  

• High-probability, low-impact: Single switch or single peer-link failure. Traffic con-

tinues in the surviving vPC domain(s); OOB keepalives and STP prevent dual-ac-

tive, and dual-homed hosts keep forwarding on remaining members (Cisco, 2024, 

pp. 255–258, 272–274). Expected operational outcome: platform stays online; no 

VMs/pods are restarted, and no workloads are moved. Hosts keep using the re-

maining links.  


48 

• Medium-probability, moderate-impact: One switch fails in each room simultane-

ously. Each vPC domain still has one peer alive; the inter-domain vPC-to-vPC links 

preserve connectivity between rooms. Expected operational outcome: capacity 

reduced but services continue. 

• Low-probability, high-impact: Both switches in one room fail. Single-attached de-

vices in that room are isolated; dual-homed servers retain connectivity through 

their links into the surviving room’s vPC domain. Expected operational outcome: 

services fail over to the other room but remain online. 

 
Because servers are planned to be dual-attached and storage should replicate synchro-

nously between rooms (analyzed in next chapter), changes or upgrades on one room’s 

pair can be executed without workload interruption. This is assuming the cross-room 

interconnect and configuration parity described above are maintained (Cisco, 2024, pp. 

255, 272–274). In short, resilience in the network can be built in layers: vPC contains 

local faults, LACP hardens the host edge, EPLC protects deterministic rings, and cross-

room links provide a controlled path of last resort. Together these techniques align the 

network with A12 (§4.1) and IEC 62443‑3‑3 (SR 5.1, SR 5.2) single-failure tolerance and 

deterministic recovery expectations (STUK, 2021; IEC, 2013). 

 
4.3.2 Storage-Level Resilience 

The integrity and availability of the storage backend are fundamental to the virtualized 

OM690 system. Both computer rooms are planned to host a Dell PowerStore array. 

Metro volume technology presents them as a single active-active storage pool that rep-

licates every write synchronously, resulting in an RPO of zero (Dell Technologies, 2024, 

pp. 6–10). Each OpenShift node reaches the arrays through two independent iSCSI fab-

rics, and Device-Mapper Multipath (DM-Multipath) balances I/O across all available 

paths (Red Hat, 2025e, pp. 9–10). Figure 8 illustrates this arrangement: two PowerStore 

arrays participate in metro sessions for each volume, a third‑site witness arbitrates 

which side is writable, and hosts access the arrays through target port groups (TPGs), 

continuing I/O on the preferred side via active‑optimized paths. 


49 

 
Figure 8. Metro-Volume storage diagram (Dell Technologies, 2024) 

 
At the host level, DM-Multipath monitors path health and transparently redirects traffic 

when a path, like a HBA, switch, or front-end port fails. Dell’s Linux examples use ‘”queue-

length” 0, which spreads I/O across all active optimized paths (Dell Technologies, 2024, 

p. 85). Red Hat documents the queue-length 0 selector as an equally safe alternative for 

block devices (Red Hat, 2025e, pp. 9–10, 29, 33–38). Either policy sustains throughput 

during maintenance and avoids I/O contention during mass path recovery. 

 
Metro volume’s arbitration logic can be enhanced by a third-site witness introduced in 

PowerStoreOS 3.6. The witness continuously tests both arrays and the inter-site links. If 

communication is lost, only the side acknowledged by the witness retains the write role, 

which prevents split-brain corruption (Dell Technologies, 2024, pp. 13, 44–45). To oper-

ate in synchronous mode Dell specifies round-trip latency below 5 ms and at least 250 

Mb/s per concurrent replication + migration flow (Dell Technologies, 2024, 12–

13, 26, 44, 59). The planned OM690 storage architecture should be within those limits. 

It should be noted that metro volume’s single-writer arbitration protects VM disks but is 

not a multi-writer mechanism. Therefore, two-room active/active must be implemented 

at the application/data-replication layer for services that require it. 

 
50 

Because every write is synchronously committed, the design contains the three principal 

fault scenarios. If one array fails, the Witness demotes it and DM-Multipath continues 

over the surviving paths with no client impact. If an entire room is lost, the surviving 

array still serves data and OpenShift starts the affected VMs on local hosts. If the inter-

room link alone fails, the Witness grants write access to only one site while the other 

remains read-only until synchronization is restored (Dell Technologies, 2024, pp. 9–

10, 44–45). This eliminates any risk of data divergence. 

 
OpenShift Virtualization consumes the metro volume through block‑mode Persis-

tentVolumeClaims (PVCs). PowerStore metro volume is a block‑only capability (Dell 

Technologies, 2024); the CSI storage class used here presents RWO PVCs rather than 

RWX. As a result, live migration of KubeVirt VMs is not available with this storage class; 

recovery uses cold restart with volume reattachment (Red Hat, 2024b, p. 26). A cross-

room transfer still requires powering off the VM on the failed site and booting it on the 

surviving hosts (Red Hat, 2025a, pp. 23–24, 64, 347). Declaring this limitation up front 

prevents overstating the benefit of metro volume while remaining compliant with re-

quirements (IEC 62443‑3‑3 SR 7.3, SR 7.4) on state preservation (IEC, 2013). 

 
Synchronous replication protects data, but workloads also need their front-end IPs. A 

Global Load Balancer (GLB) can monitor health probes from both rooms and pin each 

client session to the preferred site (Gurijala & Sullivan, 2022). It makes switchover deci-

sions based on application and endpoint health, while witness status is an input, not the 

sole trigger (Gurijala & Sullivan, 2022). The GLB therefore completes the switchover 

without human intervention and avoids snapshot divergence by ensuring that new re-

quests always land on the authoritative copy. Finally, the witness appliance itself if used, 

becomes a critical dependency, and it should be deployed on an independent manage-

ment VLAN (Dell Technologies, 2024).  

 
In summary, host-level multipathing, metro volume synchronous mirroring with a wit-

ness, and GLB-based traffic steering together provide a clear, testable path from single-


51 

cable faults up to full computer-room loss. This meets recovery requirements in SR 7.3 

and 7.4 of IEC 62443-3-3 (IEC, 2013). 

 
4.3.3 Compute-Node Resilience 

While resilient network and storage infrastructures are foundational, the physical serv-

ers in the OM690 virtualization cluster represent another critical fault domain. To 

demonstrate the architecture’s ability to tolerate such failures, we can consider a sce-

nario in which one of the OpenShift Virtualization hosts suffers a complete outage. This 

node is assumed to run a mix of virtual machines, including a Processing Unit (PU), Op-

erator Terminals (OT) and the External Unit (XU) server. The scenario therefore tests both 

functionally critical and recovery-critical components, as defined in Chapter 4.1. Where 

suitable services should be containerized, pod rescheduling typically restores instances 

faster than VM cold restarts, but this applies only to workloads that can meet OM690’s 

determinism and integration constraints in container form. 

 
The system’s response begins at the application layer. When the PU on the failed node 

becomes unavailable, the hot standby PU takes over the role. This switchover happens 

within the PU pair and should not wait for any OpenShift remediation. The infrastruc-

ture’s only task is to enforce fencing to prevent split-brain and to ensure that at least one 

PU instance remains available (Red Hat, 2025g, Chapter 1). The cluster marks the node 

from Ready to Unknown after about 50 s of missed heartbeats, and Node Health Check 

waits a further 40 s before creating the Self Node Remediation (SNR) object, so fencing 

starts roughly 90 s after the first symptom (Red Hat, 2025g, p. 7). 

 
Concurrently, the OpenShift control plane should recognize the failed node as non-re-

sponsive. The Node Health Check operator creates an SNR resource, which reboots the 

unhealthy node via its watchdog and cordons it so that workloads are rescheduled. Be-

cause SNR can only restart the node locally, sites that demand a full hardware power-off 

can attach the Fence Agents Remediation (FAR) operator to the same Node Health Check 

pipeline; FAR then performs the required out-of-band fencing (Red Hat, 2025g, pp. 21–


52 

23; medik8s, n.d.). The SNR safe-time (safeTimeToAssumeNodeRebootedSeconds) can 

be tuned to balance data-corruption risk against recovery speed (Red Hat, 2025g, p. 15). 

 
Once the failed node is fenced, the infrastructure layer recovers less-resilient workloads. 

The OT VMs are rescheduled to a healthy node for a cold restart (Gurijala & Sullivan, 

2022). Each OT drives up to four operator displays and stores its unique HMI layout and 

short-term archive locally; its loss therefore removes alarms, trend graphs and manual-

control capability for a defined control-room segment. Because displays are not inter-

changeable across OTs, rapid restoration of the exact OT VM is essential to preserve op-

erator awareness and intervention capability. 

 
Equally urgent is the recovery of the XU server, which serves as a communications hub 

for up to sixteen external systems (TVO, 2024). XU failure disrupts those links and can 

trigger cascading plant alarms. As a single-instance component, its recovery-time objec-

tive is among the most stringent in the system, so the scheduler should give the XU VM 

highest restart priority on an XU-capable host once fencing is confirmed, and the disk 

has been reattached (Gurijala & Sullivan, 2022). This assumes that the XU is not host-

bound and can be restarted on another node.  

 
Throughout the incident, the SU pair planned to be hosted on different nodes remain 

unaffected, demonstrating effective fault-domain segregation. The scenario illustrates 

the tiered recovery model from Chapter 4.1: application-layer continuity for redundant 

components (PU), fast VM restarts for high recovery-critical roles (OT, XU) and deferred 

recovery for non-critical services such as diagnostics or engineering stations. 

 
Demonstrating that a single host can fail without loss of safety function satisfies the re-

quirements (IEC 62443-3-3 SR 7.1, SR 7.4) that physical and functional redundancy be 

implemented so the system remains operable even after a single-component failure (IEC, 

2013).  

 
53 

4.3.4 Methods for Quick Recovery of High Urgency Workloads 

As established, Operator Terminals (OT) and the External Unit (XU) are unique, non-re-

dundant services: when one fails, a control room segment or multiple external links are 

unavailable until that exact instance runs again. Because there is no twin to fail over to, 

continuity depends on fast, deterministic cold restart with the same identity and data 

on a capable host. Several technical and procedural levers can be used to achieve this 

rapid recovery. 

One potential lever like this is placement. Defining a labeled pool of OT- and XU-capable 

nodes gives the scheduler multiple safe landing spots as soon as fencing completes. Anti-

affinity for OTs helps spread terminals so a single host failure does not remove an entire 

cluster of displays. This keeps landing-zone constraints explicit without pinning a given 

instance to a single machine. 

Another useful lever is ensuring capacity headroom. Holding at least one host’s worth 

of spare CPU and memory for a VM’s pool converts a node loss into an immediate restart 

rather than a queued request. However, the headroom only pays off if the scheduler 

prefers the urgent workloads. 

Deterministic fencing and reschedule should be driven through Node Health Check with 

Self Node Remediation so that a failed node reboots (or is power-fenced when FAR is 

attached) and its workloads are eligible for restart without human intervention. Safe-

time tuning balances corruption risk against recovery speed; this pipeline makes the 

previously looked at failed-node → restart-eligible transition predictable.  

On the scheduling side, assigning a high PriorityClass to targeted virt-launcher pods en-

sures they preempt lower-priority work if necessary. In practice, this turns reserved ca-

pacity into guaranteed behavior: once fencing confirms the node is out, the scheduler 

places urgent workloads first on surviving hosts. 


54