Syed Mahmood ul Hassan 

Predictive Maintenance in IoT Using NLP 
Techniques 

Master Thesis 

 
Vaasa 2025 

School of Technology and Inno-
vations  

 Master of Sustainable and Au-
tonomous Systems 


2 

UNIVERSITY OF VAASA 
School of Technology and Innovations 

Author: Syed Mahmood ul Hassan 
Title of the thesis:  Predictive Maintenance in IoT Using NLP Techniques 
Degree: Master of Sustainable and Autonomous Systems 
Supervisor: 
Co-supervisor 

Mohammed Elmusrati 
Elham Ahmadi 

Year: 2025 Pages: 82 

 
ABSTRACT: 
 
In this thesis we study how NLP techniques can be integrated with Internet of Things (IoT) sensor 
data, to be used for predictive maintenance in industrial environments. The main objective was 
to build models that can predict whether an equipment will fail in the next 7 days using the 
iot_predictive_maintenance dataset. Sensor readings include temperature, vibration, humidity, 
and pressure etc. along with unstructured textual maintenance logs. Two sets of the feature 
were built: one containing only numeric sensor data and the other with features from TextBlob, 
namely TF- IDF vectors and polarity scores on sentiment, to create a feature set. Logistic Regres-
sion, Random Forest, Gradient Boosting, and XGBoost models were evaluated multiple times. It 
was found that models incorporating both sensor and NLP features significantly outperformed 
those based solely on sensor data. Among the evaluated models, the best performance was 
achieved by the XGBoost model using the combined feature set, which attained an accuracy of 
0.973, an F1 score of 0.889, and a ROC AUC of 0.993. These results confirm that the textual 
information in maintenance logs contains important failure aspects not expressed from numer-
ical data alone. This work demonstrates practicality of fusion models in predictive maintenance, 
enabling a scalable and robust solution for smart manufacturing. 
 
 
KEYWORDS: Predictive Maintenance, IoT, Natural Language Process (NLP), Machine Learning 
(ML), Fault Prediction 

  
3 

Contents 

1 Introduction 9 

1.1 Background and motivation 12 

1.2 Problem statement 13 

1.3 Research objectives 14 

1.4 Research questions 15 

1.5 Methodology overview 15 

1.5.1 Data preprocessing 16 

1.5.2 Feature engineering 17 

1.5.3 Model Development and Training 18 

1.5.4 Evaluation metrics 19 

1.6 Significance and contribution 20 

2 Literature Review 21 

2.1 Traditional predictive maintenance and sensor-based approaches 22 

2.2 Unstructured data and the case for NLP in maintenance 24 

2.3 Hybrid approaches: Fusing text with sensor data 26 

2.4 Advances in NLP models for industrial applications 27 

2.5 Evaluation practices and real-world validation 29 

2.6 Ethical, practical, and operational considerations 30 

2.6.1 Human-in-the-Loop (HITL) Approaches 30 

2.6.2 Privacy and Data Governance 31 

2.6.3 Transparency and Accountability 31 

2.6.4 Operational Challenges 31 

2.7 Summary of Gaps and Opportunities 32 

3 Methodology 35 

3.1 Dataset overview 35 

3.1.1 Sensor data 36 

3.1.2 Maintenance logs 37 

3.1.3 Objectives 37 

3.1.4 Dataset limitations 38 


4 

3.1.5 Relevance to industry 39 

3.2 Data pre-processing 39 

3.2.1 Sensor data 39 

3.2.2 Maintenance log 40 

3.3 Feature engineering 41 

3.4 Model development 41 

3.5 Model evaluation metrics 42 

3.6 Findings of preliminary methodology 43 

4 Case study – Predictive Maintenance using Sensor and Maintenance Log Data with 

NLP 45 

4.1 Dataset overview 45 

4.2 Data pre-processing 47 

4.2.1 Sensor data processing 47 

4.2.2 Maintenance log processing 47 

4.3 Model development 48 

4.3.1 Feature set 48 

4.3.2 Model selection 49 

4.3.3 Model evaluation 49 

4.4 Summary 50 

4.5 Insights and implications 51 

5 Results and discussion 53 

5.1 Model performance 53 

5.1.1 Logistic regression 57 

5.1.2 Random forest 58 

5.1.3 Gradient boosting 59 

5.1.4 XGBoost 61 

5.2 Best model analysis: XGBoost with NLP-Enhanced Features 64 

5.3 Best Performing Model: XGBoost (NLP-Enhanced) 65 

5.4 Discussion 66 

5.4.1 Impact of feature set 67 


5 

5.4.2 Model comparison 67 

5.4.3 Confusion matrix insights 68 

5.4.4 Practical implications 69 

6 Conclusion 70 

6.1 Summary of work 70 

6.2 Key findings 71 

6.3 Critical evaluation 72 

6.4 Practical implications 73 

6.5 Limitations and future directions 73 

References 75 

Appendices Error! Bookmark not defined. 

Appendix 1. Appendix title Error! Bookmark not defined. 

Appendix 2. Another appendix title Error! Bookmark not defined. 

  
6 

 
Figures 
 
Figure 1: Dataset of IoT devices 10 

Figure 2: Model architecture 19 

Figure 3: Dataset pre-processing 36 

Figure 4: Failure rate per device 38 

Figure 5: Sensor readings and failure rate of devices 41 

Figure 6: Confussion matrix of sensor only XGBoost 50 

Figure 7: XGBoost confusion matrix of sensor and textual combined feature 51 

Figure 8: Model comparison w.r.t accuracy 54 

Figure 9: Logistic regression sensor only model 57 

Figure 10: Logistic regression NLP enhanced model 58 

Figure 11: Random forest sensor feature set 58 

Figure 12: Random forest NLP enhanced model 59 

Figure 13: Gradient boosting sensor only model. 60 

Figure 14: Gradient Boosting with NLP Features 60 

Figure 15: XG Boost Sensor Model Summary 61 

Figure 16: XG Boost NLP Model Summary 62 

Figure 17: ROC Curve of XG Boost 62 

Figure 18: Top 10 feature importance 67 

Figure 19: Model comparison w.r.t to F1 score 68 

Figure 20: Model comparison summary. 68 

Figure 21: Confusion matrix of best performing XGBoost model. 69 

 
Tables 
 
Table 1: IoT sensors and their functions 16 

Table 2: Raw sensor features 18 

Table 3: Sample TF-IDF terms from maintenance logs 40 

Table 4: Train-Test split overview 49 

Table 5: Comparative ROC AUC Scores 55 


7 

Table 6: Overall comparion of models 56 

Table 7: Sensor only feature set. 63 

Table 8: NLP enhanced feature set. 63 

Table 9: Confusion matrix textual representation. 64 

Table 10: Sentiment polariy distribution in logs 64 

 
Abbreviations 
 
AI   Artificial Intelligence 

AUC   Area Under Curve 

AUC-ROC  Area Under the Receiver Operating Characteristic Curve 

BERT   Bidirectional Encoder Representations from Transformers 

FN   False Negative 

FP   False Positive 

FPR   False Positive Rate 

GBM   Gradient Boosting Machine 

GDPR   General Data Protection Regulation 

HITL   Human-in-the-Loop  

IIot   Industrial Internet of Things 

IoT   Internet of Things 

ML   Machine Learning 

MPM   Modern Predictive Maintenance 

NER   Named Entity Recognition 

NLP   Natural Language Processing 

PdM   Predictive Maintenance 

ROC   Receiver Operating Characteristic 

RUL   Remaining Useful Life 

SHAP   SHapley Additive exPlanations 

SMOTE  Synthetic Minority Over-sampling Technique 

TF-IDF   Term Frequency–Inverse Document Frequency 

TPR   True Positive Rate 


8 

XGBoost  eXtreme Gradient Boosting     

 
9 

1 Introduction 

For example, with Industry 4.0, the era of digital transformation, cyber-physical systems, 

and automation, as well as smart sensing technologies are integrated into industrial pro-

cesses. Within this revolution sits PdM, an advanced maintenance strategy that is poised 

to predict the failures of equipment ahead of their occurrences, thus preventing un-

scheduled downtimes, improving operational efficiency, and helping extend the operat-

ing lifespan of critical assets that are deemed indispensable. With the IoT having greatly 

boosted this capability, advances in sensor embedded systems allow for continuous 

monitoring of machines and generation of vast amounts of structured, time series data 

from parameters like temperature, vibration, pressure, and humidity. 

 
Sensor driven predictive models are remarkably successful yet in most current frame-

works, the unstructured textual data generated along sensors is underutilized to date. 

Examples include maintenance logs, fault descriptions, incident reports, and technician 

observations that usually contain rich contextual information that is hard to capture by 

numerical data only. According to (Kang et al., 2020), such exclusion could fail because 

of incomplete fault detection, inefficient maintenance scheduling, and delayed early 

warning of failure, leaving the promise of PdM systems unrealized. 

 
To address this, this research conducts a study to build a new machine augmented PdM 

model, one that will complement both the structured IoT sensor data and unstructured 

textual records. Specifically, this study presents a dual layer predictive framework where 

NLP techniques are used to extract and encode important features in the maintenance 

logs. Finally, these sensors are combined with TF-IDF and sentiment analysis obtained 

with TextBlob to train advanced ML models that can better predict failure. 

 
The work here studies PdM using separated data from a variety of IIoT purpose-built 

sensors. For collecting real time measurement of temperature, pressure, flow rate and 

environmental conditions, ThermoTrack-B3, SensorHub-A1, FlowMeter-E9 and Envi-

roMon-D2 devices came out to be very important. Based on maintenance logs and 


10 

sensor readings, the predictive models that were developed in turn rely on these sensor 

readings to predict equipment failures before they take place. 

 
In this study, our primary dataset shown in figure 1, which was used for doing such re-

search was from Kaggle, it has sensor reading of the equipment real time and corre-

sponding logs of maintenance as shown in figure 1. The structured part included key 

operational metrics such as temperature, humidity, vibration, pressure, while unstruc-

tured part was technician entered head provider symptoms and fau definitions and 

anomalies observed. This research applies linguistic intelligence to an integration of 

these two modalities of data, addressing one of the central problems in PdM: contextual 

understanding. 

 
Figure 1: Dataset of IoT devices 

 
In this methodology, the textual data are pre-processed namely cleaning it of noise, 

standardizing it to the language, tokenizing content and then feature extracted through 

TF-IDF to focus on the importance of the words based on the documents as well as to 

run sentiment analysis indicating whether the logs were having polarity or not showing 

signs of distress or failure. Based on these, two unique datasets were formed: first, a 


11 

dataset comprised solely of sensor data, and second, a combined dataset including NLP 

derived features. 

 
These approaches were evaluated with for four ML models: Logistic Regression, Random 

Forest, Gradient Boosting and an optimized XGBoost classifier. The key performance met-

rics for these models were accuracy, precision, recall and F1-score along with ROC-AUC. 

Results showed that sensor data-based models provided significantly worse results than 

those enhanced by the NLP dataset. We found the most notable improvements of the 

XGBoost model, which reached a remarkable accuracy of 0.973, an F1-score of 0.889 and 

a ROC-AUC of 0.993, therefore proving its stability and precision, it is in identifying pos-

sible failures. 

 
Additionally, confusion matrix analysis revealed that NLP integrated models generated 

less FPs and FNs than the models without semantic insights from textual logs, which 

means that semantic insight from textual logs imbedded in the models leads to a more 

robust differentiator between failure and non-failure equipment. Finally, the sentiment 

polarity scores provided value by providing information as to the technician’s tone and 

urgency, which correlate with real failure events. 

 
Thus, this thesis provides additional contributions to the developing knowledge base 

about intelligent maintenance systems by demonstrating that NLP can be a very power-

ful complement to sensor-based models for PdM. The results show the need for multi-

modal data fusion in industrial analytics and a scalable framework that can be applied in 

the general industrial IoT eco-systems. The proposed hybrid PdM model captures both 

quantitative and qualitative signals, making it possible to develop more resilient and is 

better and less expensive maintenance solutions in smart manufacturing environments. 

 
In the following chapters, the thesis will reveal the methodology and experimental setup 

details, include details of experimental results and analysis, as well as conclusions, limi-

tations, and future research. 


12 

 
1.1 Background and motivation 

Thanks to the exponential growth and the widespread distribution of IoT devices, indus-

trial operations can now be continuously and in time, monitor as granular data as possi-

ble in the manufacturing field, oil and gas, aerospace, and transportation (Compare et 

al., 2020). By sensing high resolution structured data streams of critical machine param-

eters of temperature fluctuation, vibration frequencies, pressure levels, humidity change, 

and operational cycle, these are smart, interconnected sensors. PdM systems can ana-

lyse these data points, identify anomalies, predict future faults, and advise on mainte-

nance schedules, which would help increase system reliability and operational efficiency 

in general. 

 
However, while these advancements enable improved detection of failures, it is still lim-

ited with the semantic depth missing to truly grasp the root causes of failure and con-

textual factors of maintenance issues. If sensor-based data is used, the models are typi-

cally optimized with quantitative metrics at the expense of consideration of qualitative 

indicators that are also important for interpretation. For example, there is information 

provided by technicians in comments, root cause descriptions, work arounds notes, and 

previous fault narratives in maintenance logs, service reports, or inspection notes; infor-

mation that is not given in the form of numerical sensors. Often these unstructured text 

entries relate to such anomalies in natural language that describe recurrent issues and 

provide contextual clues as well as metadata values relating to environmental or opera-

tional conditions that may not otherwise be monitored. 

 
According to Usuga-Cadavid et al (2022), the use of such textual records in PdM frame-

works can unveil unobserved latent patterns and previously uncorrelated failure modes 

with machine behaviour (P. U. Cadavid et al., 2021). It enriches the predictive modelling 

procedure, providing a frequent and interpretable mechanism to analyse equipment 

condition. Although these unstructured data sources have potential, they are not yet 


13 

used in mainstream industrial analytics due to challenges in processing such unstruc-

tured data which include variation in language, domain jargon specific to a problem, 

misspellings, and variations in formatting. 

 
Hence, the reason for this research is to bridge the gap between the structured analytic 

constructor stemming from sensor analytics and the unstructured textual intelligence, 

applying the NLP approach which transforms free text record into machine readable fea-

tures. The study intends to integrate numerical sensor data with NLP derived features 

namely, TF-IDF vectors and sentiment polarity scores into building more robust holistic 

PdM  model. The contribution is in the use of this interdisciplinary approach that inte-

grates the strengths of data driven engineering and computational linguistics to provide 

for smarter, context aware maintenance solution that can yield deeper operational in-

sight and more accurate fault predictions. 

 
1.2 Problem statement 

In modern industrial environments, PdM has become an essential strategy for minimiz-

ing unplanned equipment downtime, the reduction of maintenance costs and increase 

overall system reliability. In fact, they have led to the advent of sophisticated PdM mod-

els based not on one or two data points at a time but rather on continuous streams of 

numerical data (vibration, temperature, pressure, etc.) to detect anomalies and predict 

faults. However, to date unstructured textual information available in these models is 

underutilized. 

 
Maintenance logs, technician notes, incident reports and service documentation are 

generated routinely in large volumes daily during industrial operations. In fact, often 

these records have a lot of qualitative insights, including early warnings signs, context 

about previous faults, human observation, and domain specific knowledge, which can-

not be captured by sensors alone. Unfortunately, conventional PdM frameworks again 


14 

ignore these unstructured data sources by focusing on only structured sensor inputs. 

This therefore means that the current models tend to be less aware of the context, which 

then constrains their ability in being able to predict correctly or explaining the reason of 

why there is an anomaly or equipment failure. 

 
Current practice leaves this gap, increasing the chances for unexpected equipment 

breakdowns, which cause poor productivity, safety, and operational continuity, and with 

it, several pressing challenges of incomplete or inaccurate fault detection, suboptimal 

maintenance scheduling, arising. Taking this opportunity to exploit textual data under 

exploited, would yield rich human centric insights that complement predictive analytics. 

Therefore, the main problem this research addresses is to design, develop and validate 

an NLP empowered PdM framework to combine quantitative sensors data with qualita-

tive textual data natively. This model unites the benefits of NLP techniques with those of 

structured data sources, by incorporating features extracted through Forward NLP, 

namely TF-IDF vectorization and sentiment analysis. The goal is to increase the predic-

tion accuracy, interactivity, and reliability of PdM systems and to approach the failure 

prediction, anticipating failures, and developing optimal industrial maintenance strate-

gies. 

 
1.3 Research objectives 

To address the outlined problem, the thesis sets forth the following objectives: 

1. To analyze an IoT sensor dataset and identify relevant unstructured textual fea-

tures. 

2. To preprocess and clean maintenance logs for NLP application, including tokeni-

zation, removal of noise, and standardization. 

3. To apply TF-IDF and sentiment analysis to extract meaningful features from the 

logs. 


15 

4. To integrate textual features with sensor data to develop ensemble-based ML 

models for failure prediction. 

5. To compare the performance of NLP-enhanced models with traditional sensor-

only models using evaluation metrics such as accuracy, F1 score, and AUC. 

 
1.4 Research questions 

The study aims to answer the following research questions: 

• What kind of improvements in predictive outcome can be made when combining 

NLP with IoT sensor data? 

• What preprocessing and feature engineering steps are essential for converting 

unstructured maintenance logs into actionable predictive inputs? 

• How do models that include NLP-derived features compare in performance to 

those that utilize only structured sensor data? 

 
1.5 Methodology overview 

The methodology employed in this research followed a systematic, multi-stage pipeline 

designed to develop, integrate, and evaluate a hybrid PdM model that combines struc-

tured IoT sensor data with unstructured maintenance log entries. The process was 

broadly divided into four key stages: data preprocessing, feature engineering, model de-

velopment, and performance evaluation. The primary dataset used for experimentation 

was the publicly available IoT PdM dataset, which included time-stamped sensor meas-

urements—such as temperature, humidity, pressure, and vibration—alongside free-text 

maintenance logs describing fault occurrences and technician observations. 

 
16 

1.5.1 Data preprocessing 

The preprocessing phase was critical in preparing both structured and unstructured data 

for downstream analysis. The textual component of the dataset (maintenance logs) un-

derwent a series of standard NLP preprocessing steps, which included: 

• Lowercasing all text to ensure uniformity. 

log𝑖 𝑐𝑙𝑒𝑎𝑛 = 𝐶𝑙𝑒𝑎𝑛(log 𝑖) 

• Removal of punctuation, digits, and special characters to eliminate irrelevant 

noise. 

• Tokenization and stop-word removal to retain only meaningful words. 

• Lemmatization to reduce words to their root forms for semantic consistency. 

 
Table 1 lists the IoT sensors used in the study, detailing their types and specific measure-

ment focuses. These devices collectively capture a wide range of environmental and sys-

tem parameters, including temperature, humidity, pressure, flow, and vibration. 

 
Table 1: IoT sensors and their functions 

Device name Type Measurement fo-

cus 

Description 

ThermoTrack-B3 Temperature sen-

sor 

Temperature Tracks thermal con-

ditions 

SensorHub-A1 Multi-sensor hub Combined set of 

sensors 

Aggregate data 

from multiple sen-

sor types 

FlowMeter-E9 Flow Sensor Fluid/Gas Flow Monitors flow rate 

in pipelines 

EnviroMon-D2 Environmental 

Monitor 

Temperature, Hu-

midity 

Measures ambient 

environmental pa-

rameters 


17 

VibeSense-C7 Vibration 

Sensor 

Vibration Detects mechanical 

vibrations in equip-

ment 

HumidityGuard-G8 Humidity Sensor Humidity Monitors moisture 

in the environment 

TempControl-H6 Temperature Con-

troller 

Temperature Regu-

lation 

Controls and ad-

justs system tem-

perature 

PressurePro-F4 Pressure Sensor Pressure Measures internal 

system pressure 

 
Following this, the processed text was transformed into numerical features using TF-IDF 

vectorization. A feature cap of 50 terms was selected based on exploratory analysis and 

dimensionality constraints to capture the most informative tokens across the corpus. 

Additionally, sentiment polarity scores were extracted using the TextBlob library, provid-

ing a scalar measure of the emotional tone of technician comments (ranging from nega-

tive to positive sentiment). Then we use these scores to capture implicit cues in human 

written log that might represent urgency or seriousness of these issues. 

 
1.5.2 Feature engineering 

To explore the added value of integrating unstructured data, two distinct feature sets 

were constructed: 

• Sensor-Only Features: This included the original numerical attributes from the 

IoT sensor data (temperature, humidity, pressure, and vibration), representing 

the conventional approach to PdM 

𝑥𝑖
𝑠𝑒𝑛𝑠𝑜𝑟 = [𝑡𝑖 , 𝑣𝑖 , ℎ𝑖 , 𝑝𝑖] ∈ 𝑅4 

• NLP-Augmented Features: This composite feature set combined the sensor data 

with the TF-IDF text vectors and sentiment polarity scores. The goal was to enrich 

the numerical representation with linguistic cues and contextual signals embed-

ded in maintenance logs. 


18 

𝑥𝑖
𝑛𝑙𝑝 = [𝑡𝑖, 𝑣𝑖 , ℎ𝑖 , 𝑝𝑖, 𝑠𝑖, 𝑤𝑖]  ∈ 𝑅𝑀+5 

 
Table 2: Raw sensor features 

Feature name Source device Unit Data type 

Temperature ThermoTrack-B3 °C Numeric 

Vibration VibeSense-C7 mm/s Numeric 

Humidity HumidityGuard-G8 %RH Numeric 

Pressure PressurePro-F4 kPa Numeric 

Flow Rate FlowMeter-E9 L/min Numeric 

 
Table 2 presents the raw sensor features extracted from individual IoT devices, specifying 

their source, measurement units, and data types. All recorded features are numeric and 

represent key physical parameters. 

 
Feature scaling techniques such as standardization (z-score normalization) were applied 

to ensure uniform feature contributions, particularly important for distance-based or en-

semble methods. 

 
1.5.3 Model Development and Training 

The data was partitioned into training and testing sets using a 70/30 split, with stratified 

sampling to maintain proportional class distribution across failure types. 

 
𝑓 ∶  𝑥𝑖  → 𝑦̂𝑖  ∈ {0, 1} 

 
The following ML algorithms were implemented and trained on both feature sets for 

comparative analysis: 

• Logistic Regression – served as the baseline due to its simplicity and interpreta-

bility (Çınar et al., 2020). 


19 

• Random Forest Classifier – to capture non-linear relationships and variable inter-

actions (Wang et al., 2023). 

• Gradient Boosting Machines (GBM) – a sequential ensemble method known for 

its robustness (Samet, 2023). 

• XGBoost – an optimized implementation of gradient boosting that offers superior 

regularization and performance (Panduman et al., 2024). 

 
Figure 2 illustrates the model architecture, combining sensor data and NLP-derived fea-

tures as inputs to machine learning models for predicting system failure. 

 
Figure 2: Model architecture 

 
1.5.4 Evaluation metrics 

Each model was evaluated using a comprehensive set of classification metrics to assess 

performance from multiple dimensions: 

• Accuracy – overall proportion of correct predictions. 

• Precision – the model’s ability to avoid FP. 


20 

• Recall – sensitivity to actual failure cases. 

• F1 Score – harmonic mean of precision and recall, ideal for imbalanced datasets. 

• AUC-ROC – the area under the ROC curve, indicating the trade-off between true 

positive and FPR. 

 
Additionally, confusion matrix analysis was performed to identify patterns in misclassifi-

cation and understand the model’s strengths and weaknesses across various failure 

types. 

 
1.6 Significance and contribution 

This research confirms that the integration of unstructured maintenance log data via NLP 

significantly enhances the predictive capacity of traditional IoT-based maintenance mod-

els. The application of TF-IDF and sentiment analysis contributes contextual depth to the 

feature set, which is particularly effective in distinguishing critical failure patterns not 

captured in numerical data. The proposed methodology can be readily adapted in indus-

trial settings, offering a scalable, data-driven solution for proactive asset maintenance. 

 
Furthermore, the comparative analysis between sensor-only and NLP-enhanced models 

provides a quantitative justification for the inclusion of textual data in future PdM sys-

tems. The results highlight not only the potential of NLP in industrial IoT but also under-

score the need for cross-disciplinary integration of AI techniques to tackle complex op-

erational challenges. 


21 

2 Literature Review 

A review in prior studies indicates increasing usage of IoT devices in the PdM system. In 

contrast to our project, many of these devices, such as commonly referenced multi-sen-

sor units like SensorHub-A1, VibeSense-C7, PressurePro-F4, HumidityGuard-G8, etc., are 

used in a comprehensive suite of smart devices along with other units. These are modern 

condition measuring devices which shall provide the required level of granularity and 

frequency in data such that the predictive analytics can be done effectively. 

 
This marks an era of industrial operations advent in Industry 4.0, which integrate cyber 

physical systems, IoT, big data analytics, and AI. PdM has replaced traditional reactive, 

preventive, maintenance approaches leaving no stone unturned in its progression to-

wards this revolution in today’s digital age. Real time data and advanced analytics used 

in PdM helps forecast equipment failures thereby minimizing unplanned downtimes and 

maximizing asset lifecycles. 

 
At the early stage, PdM was mainly based on structured data produced by sensors to 

monitor such parameters as temperature, vibration, and pressure. The efficacy of these 

sensor-based models has been proven for predicting any mechanical failures and have 

been widely applied in manufacturing, aerospace, and energy sector. Therefore, as in-

dustrial systems continue to increase in complexity, it is apparent that relying on quanti-

tative sensor data only is insufficient. Unstructured textual data like maintenance logs, 

technician notes, inspection reports may in fact hold more nuanced information that 

sensor data does not give contextual information in the background. 

 
In recent years studies have been done on integrating unstructured textual data into 

PdM to enhance predictive capability. For example, maintenance logs contain useful in-

formation about previous failures, repair actions and expert observations for infor-

mation that is deeper than normal equipment behaviour and failure patterns. To incor-

porate qualitative information, however, NLP techniques provide the correct tools to 


22 

extract and analyse this textual information, allowing it to be integrated into predictive 

models. 

 
Technically, we formulate a hybrid of NLP based on sensor analytics in which NLP is inte-

grated with traditional sensor analytics as they collectively form comprehensive and ac-

curate predictive models. This approach is in line with Industry 4.0’s more general focus 

on decision making based on data and utilising a variety of data sources to increase per-

formance. Additionally, such hybrid models also meet the current requirement of tran-

sition from non-context aware, but interpretable PdM systems to more context aware 

and interpretable PdM systems that can adapt to the dynamic nature and complexity of 

modern industrial environments. 

 
In this chapter, existing literature on PdM models is critically reviewed, with emphasis 

on novelty of evolving from traditional sensor based to hybrid using NLP techniques. It 

analyses the methods used, limitations encountered, and progress made in combining 

information derived from unstructured textual data in PdM models. The chapter ex-

plores these developments to give a complete picture of current PdM research and 

points out research and improvement opportunities to be performed in the future. 

 
2.1 Traditional predictive maintenance and sensor-based approaches 

PdM is now a staple component of modern industrial plant operations that is intended 

to forecast equipment failure before it happens with the intent to reduce unplanned 

outages and the scheduling of maintenance. In PdM, structured data obtained from sen-

sors measuring temperature, vibration, pressure, etc. are typically used and the tradi-

tional PdM models are mostly based on structured data (Killeen et al., 2019). These sen-

sor-based approaches have contributed immensely toward detection of early signs of 

degradation of equipment, allowing for timely interventions and extension in asset 

lifecycles (Zantalis et al., 2019). 

 
23 

More lately, ML algorithms like decision trees, support vector machines, and deep learn-

ing architectures have been applied on top of these models to boost their predictive 

capabilities. Consider Liu and Hui (2024) who predicted bearing failures based on vibra-

tion signal using deep learning models (Liu et al., 2022), and Shi-Nash and Hardoon (2017) 

who used support vector machines to predict pump failures based on pressure and 

flow(Shi-Nash & R. Hardoon, 2017). PdM systems have now had these advancements in 

accuracy and reliability and especially in the environments where the data, both in terms 

of sensors and how these data are updated, are plentiful and maintained (Liu et al., 2022). 

 
Sensor based PdM models have proven very successful, but there are limitations to it. 

They are heavily reliant on structured numerical data which do not adequately capture 

the contextual nuances before or along with equipment failures. Although operator be-

haviour, environmental conditions and historical maintenance actions are of paramount 

importance to equipment performance, they rarely leave their stamp on sensor data. 

Designed to create completeness in diagnostics and optimal maintenance decisions 

(Nota et al., 2022), overlooking these contextual elements can be carried too far, accord-

ing to Nota et al., (2022). 

 
Here we expand on the last point, including lack of generalizability for sensor based PdM 

models. Given that one can train models on certain datasets, which might not perform 

well across different operational contexts or types of equipment. The absence of such 

adaptability acts as a pointer to call for more robust and adaptable predictive models 

which can handle the varying and dynamic environment of industrial world. 

 
Predictions of sensor-based models are yet another significant limitation being that they 

are explainable. Stakeholders in critical industries must trust and act on PdM recommen-

dations based on transparent and interpretable models. According to Abidi Mohammed 

and Alkhalefah (2022), model interpretability is important as black box complex models 

hinder user confidence and interfere with decision making processes (Abidi et al., 2022). 

 
24 

In view of such challenges, unstructured textual data like maintenance logs and techni-

cian notes is being seen as an important component to MPM frameworks. However, 

these textual records can deliver rich contextual information connecting to deep insights 

about equipment behaviour and failure modes. Moreover, PdM models can leverage NLP 

techniques to process and extract meaningful features from these texts to better under-

stand how the equipment health is, so the predictions and maintenance decisions are 

based on a more holistic understanding. 

 
Finally, as we conclude, traditional sensor based PdM approaches have provided solid 

ground for the practice of PdM, although their inability to capture contextual infor-

mation and to provide explainable predictions lead to the need to develop hybrid models 

that integrates structured sensor data with unstructured textual information. With such 

an integrative approach, PdM capabilities can be advanced, operational reliability can be 

improved, and maintenance strategies can be made more informed and more effective. 

 
2.2 Unstructured data and the case for NLP in maintenance 

Currently, PdM models have predominantly used structured sensor data, like tempera-

ture, vibration, and pressure, to predict equipment failures. These models achieve suc-

cesses in many scenarios, yet these models are blind to embedded rich contextual infor-

mation in unstructured textual data, such as maintenance logs, technician notes, inspec-

tion reports, etc. Usually, these textual records contain important knowledge about root 

cause identification, technician intuition, and contextual observations which are neither 

fully measured by sensors nor easily understood. As an example, 'intermittent', 'burnt 

smell' or 'previously replaced' imply semantic value that cannot be captured solely from 

numerical data (P. U. Cadavid et al., 2021; Shen & Huang, 2024). 

 
However, these textual records have great potential, but they rarely see use, like we put 

the records and other media aside to archive without analysing them, or they are in a 

chaotic inconsistent format that makes them unsuitable to have in predictive models. 


25 

Other factors such as informal language, spelling and domain specific jargon make pro-

cessing and analysing the industries’ narratives even more complex (Ponnambili et al., 

2024; Rai et al., 2024). 

 
However, recent studies have shown that it is possible to combine textual features into 

PdM frameworks. As an example, Akhbardeh et al (2020) demonstrate that fault classi-

fication in aviation systems improves with TF-IDF representations and sentiment scores 

extracted from maintenance logs (Akhbardeh et al., 2020). Similar to Yilmaz (2022), NLP 

techniques were used by Yilmaz (2022) to obtain latent failure indicators from technician 

comments on the railway maintenance systems as they had detected an increase in the 

early fault detection rates (Hussain et al., 2020). 

 
NLP techniques provide a couple of advantages when integrated to PdM models. NLP 

can help the extraction of meaningful features from unstructured text that can be used 

to find patterns and insights not as evident in sensor data. The same area of sentiment 

analysis has been widely applied in consumer contexts, but it is also promising in tech-

nical maintenance as a surrogate of urgency and size(Mallioris et al., 2024), as Mallioris 

et al. (2024) indicate. 

 
Despite this, industrial narratives are still challenging to clean, standardize, and encode 

as they contain informal language, spelling errors, domain-specific jargon (Ponnambili et 

al., 2024; Rai et al., 2024). Such challenges call for the development of specialized NLP 

tools and techniques that deal specifically with the features of the maintenance text. 

 
Overall, the use of NLP techniques in integration of unstructured textual data has great 

potential to improve PdM models. NLP can provide a way to capture the subjectivities 

inherent to maintenance logs and technician notes and use them to create more accu-

rate and context rich PdM systems that will improve operational reliability and efficiency. 

 
26 

2.3 Hybrid approaches: Fusing text with sensor data 

Due to the availability of structured sensor data and the advent of NLP, integrating the 

former with the latter in the form of NLP extracted features has proved to be a viable 

solution to achieve more comprehensive and accurate PdM models. With this hybrid 

approach, models benefit from both types of data, being able to combine quantitative 

sensor measurements with qualitative content from textual maintenance records. 

 
De Luca et al. (2024) proved that this hybrid model, a combination of deep learning and 

sentiment enhanced log features, can get an improvement of 11% compared to sensor 

only baselines. This enhancement highlights the importance of including technician nar-

ratives, which can contain details about equipment conditions and failure modes that 

sensors may miss (De Luca et al., 2023). 

 
Ucar, Karakose, and Kırımça (2024) also used transformer-based models to encode log 

entries and combine with time series sensor data for the prediction of fault in smart 

factories (Ucar et al., 2024). However, their approach sheds light on the feasibility of 

using state-of-the-art NLP architectures for processing and blending unstructured text 

data to generate more accurate & timely fault prediction in complicated industrial envi-

ronments. 

 
Both mining risk and aiding maintenance prioritization based on contextual risk are also 

proposed by Postiglione and Monteleone (2024) as arguing that such fusion strategies 

not only increase prediction accuracy but do this (Postiglione & Monteleone, 2024). Us-

ing PdM models to analyse both sensor data and textual logs can lead to more informed 

recommendations and consequentially more proactive maintenance, namely in the di-

rection of most critical problems. 

 
According to Bouabdallaoui et al. (2021) textual features play the role of a semantic 

bridge between physical failure symptoms and historical patterns. Integration of text 

data is also suggested, which in turn helps the PdM models to better take in the context 


27 

and the history of equipment failures for better diagnostics and prognostics (Bouabdal-

laoui et al., 2021). 

 
However, the fusion is certainly not easy. According to Kalusivalingam et al. (2020), there 

are issues in terms of feature dimensionality, temporal alignment, and model overfitting. 

When the dataset is limited, combining sensor data with textual features will result in 

high dimensional input space, which is prone to inducing overfitting. In addition, the 

sensor data will have to be properly pre-processed such that its timestamps align with 

the irregular timestamps of the maintenance logs for the integrated data to accurately 

represent the condition of the system over time (Kalusivalingam et al., 2020). 

 
To overcome these challenges, researchers have put forward different ways. For example, 

to reduce dimensionality and eliminate redundant or irrelevant features, feature selec-

tion techniques can be employed, thus making the model in general more generalizable. 

Advanced preprocessing methods may be applied to provide temporal alignment so that 

an integral dataset is truly a record of the state of the system at each time point. 

Finally, it is concluded that incorporating NLP-extracted features along with sensor data 

in hybrid approaches holds great promise to improve PdM models. These models not 

only offer quantitative information, but also qualitative information and thus can offer 

more comprehensive information on equipment health and can predict better via more 

accurate predictions and more effective maintenance strategies. 

 
2.4 Advances in NLP models for industrial applications 

In recent times the evolution of NLP models has posed a huge impact in many domains 

such as the industrial. Most traditional NLP techniques, including BoW and TF-IDF, have 

technically opened the door for more sophisticated models such as Word2Vec, BERT, etc. 

Following these advancements, new possibilities in predictive analytics opened for the 

extra accuracy and more context aware analysis of unstructured textual data. 


28 

 
BERT embeddings on textual work orders in oil refineries are shown by Jyothirmai et al. 

(2024) to have improved generalization across different asset types. Transformer based 

models were shown to be capable of extracting the complex semantic relationships from 

maintenance logs to improve predictive capability of this application (Jyothirmai et al., 

2024). 

 
In addition, researchers have extracted high level failure modes from raw logs using NER, 

topic modeling, and semantic similarity matching. In Ekundayo et al. (2024) for instance, 

these techniques are employed to identify critical components and failure patterns in 

industrial systems thereby implementing more targeted maintenance strategies (Ekun-

dayo et al., 2024). For instance, Samet (2023) uses semantic similarity measures to map 

maintenance logs to history failure data to increase fault prediction accuracy (Samet, 

2023).  

 
Yet, industrial adoption of advanced NLP models is still in its infancy, while most of their 

use is still experimental and limited to controlled datasets. Last, real world domain spe-

cific terminology, informal language and sparse data remain challenges that prevent this 

kind of models from being widely implemented in real world industrial settings. 

Based on these, Valli (2024) suggests that interpretable NLP methods are necessary in 

critical infrastructure and explains why explainable AI techniques are necessary to in-

crease user trust and adopt the models (Valli, 2024). The interpretation of the models 

can give insights into the decision-making process and hence, maintenance personnel 

could understand and act on upon model prediction. 

 
Just as, Stanton et al. (2023) advocate for the application of knowledge graphs and on-

tologies to map textual data to standardized maintenance taxonomies. These ap-

proaches store unstructured data within a formal framework so that it can be structured 

and then used to improve data interoperability and conduct consistently more uniform 

analysis across systems and organizations (Stanton et al., 2023). 


29 

 
Finally, although very significant improvements in employing the most advanced NLP 

models on industrial maintenance have been achieved, the data quality, model inter-

pretability and domain adaptation challenges need to be addressed. To realize the full 

potential of PdM solution in industrial environment, however, we need to address these 

challenges using robust, interpretable, and domain specific NLP models. 

 
2.5 Evaluation practices and real-world validation 

The amount of literature around NLP augmented PdM models is way behind the real-

world deployment and longitudinal validation. Simulation based studies suggest poten-

tial, but few have been operationalized in real live industrial settings (Boretti, 2024; Mo-

hammed et al., 2023). Such disparity delineates the hurdles relating theory to practice 

in the field of complex industrial settings. 

 
As Javaid (2024) emphasizes, the lack of established benchmarks and inhomogeneity in 

the logs’ formats make it difficult for replication and model portability. Maintenance logs 

are not uniform across various industries and organisations, requiring specialized ap-

proaches for each specific context to develop generalized models (Javaid, 2024). 

 
Chikkudu and Annamalai (2025) suggest that a PdM model be evaluated using confusion 

matrices, AUC-ROC and cost sensitive metrics. These are metrics that are used to assess 

how well the model does in all the ways possible — accuracy, model discrimination, and 

how the misclassifications affect the economics (Chikkudu & Annamalai, 2025). 

 
In addition, comparative studies by Rehman et al (2019) and Hasanuzzaman et al (2023) 

both show that addition of unstructured data usually improves the F1 score and reduces 

FNs, with the price of higher computational overhead (Hasanuzzaman et al., 2025; ur 

Rehman et al., 2019). Having shown this trade-off between model performance and 


30 

resource requirements, these findings highlight the importance of considered solution 

approaches that balance model's accuracy and efficiency. 

 
Finally, although much progress has been made in the development of such NLP-aug-

mented PdM models, issues concerning real-world deployment, evaluation and model 

validation remain. The success of any PdM strategy implemented in an industrial envi-

ronment relies upon addressing these challenges by providing established benchmarks, 

developing interpretable models, and incorporating sensitive domain specific factors. 

 
2.6 Ethical, practical, and operational considerations 

However, in the deployment of the NLP augmented PdM models, both the ethical and 

the operational dimensions need to be addressed besides the technical one. These are 

important considerations to make sure that such systems are also fair, transparent and 

in line with human values. 

Bias and Data Quality 

 
As Shi-Nash and Hardoon (2017) indicate, logs that are biased or incomplete can lead to 

skewed predictions, especially in systems with different access to high quality mainte-

nance documentation. Such biases result in unfair maintenance recommendations, ex-

poning some types of assets or operational context more than the others. To address 

these biases, the data curation is important, and the training datasets need to be curated 

in such a way that the underrepresented scenarios in training data can be identified and 

mitigated (Shi-Nash & R. Hardoon, 2017). 

 
2.6.1 Human-in-the-Loop (HITL) Approaches 

Wellsandt et al. (2022) recommend human in the loop approaches whereby leader ref-

eree feedback continuously improves NLP models. This iterative process provides the 

domain expertise to be folded in such that the model is being more accurate and flexible 


31 

for the tool to use (Wellsandt et al., 2022). Nonetheless, to create a HITL systems, there 

should be efficient feedback mechanisms and interfaces so that human operators can 

interact potentlessly with the ML models. Additionally, it is important to create protocols 

which will guarantee the source of human input will provide product of sufficient quality 

and reliability to prevent we kind of new bias or error in the system. 

 
2.6.2 Privacy and Data Governance 

Obtaining logs from systems that are privacy and data governance sensitive require ro-

bust anonymization and compliance traceability (Khattab & Youssry, 2020; Lowin, 2024). 

To protect data from unauthorized access and misuse it is important to implement strin-

gent data protection. Also, respecting legal data protection regulations, for example, 

GDPR is essential for accountability and trust with the stakeholders. 

 
2.6.3 Transparency and Accountability 

First, to increase trust of the users and stakeholders, it is important to guarantee in 

model decision making processes. This makes implementing explainable AI techniques 

an appealing option to gain a better understanding and acceptance about how the mod-

els arrived at the predictions they made. Additionally, accountability structures need to 

be clearly established to define and tackle possible problems in model outputs. 

 
2.6.4 Operational Challenges 

The deployment of NLP-augmented PdM systems is practically dealing with system inte-

gration, real-time processing, and scalability issues. They must also work in the confines 

of the existing infrastructure and workflows. In addition, the continuous monitoring and 

maintenance are needed to adapt to the evolving operational conditions, and to main-

tain the performance over time. 

 
32 

By integrating NLP techniques into PdM, we enjoy significant benefits, but is important 

to caution with all the ethical, practical, and operational implications to produce systems 

that are fair, transparent, and effective. The proactive addressing of these considerations 

will allow a successful adoption and long-term sustainability of NLP-augmented PdM so-

lutions. 

 
2.7 Summary of Gaps and Opportunities 

Some of the only work done on the value of textual intelligence for PdM revolves around 

limitations: 

• Limited Real-World Validation: Few current studies involve industrial deployment 

and temporal generalization (Shamayleh et al., 2020). This gap emphasizes the 

necessity for a thorough field testing and longitudinal studies to measure the 

usefulness and applicability of PdM models in changing industrial environment. 

• Underdeveloped Domain-Specific NLP: Few models account for domain jargon 

and inconsistencies in technical language (Rojas et al., 2025). The diversity of for-

mal names in most industries necessitates that we develop domain-specific NLP 

models that perfectly process and understand the domain on which the PdM sys-

tem is based to guarantee that the system produces reliable results. 

• Data Fusion Complexity: While Compare et al. (2019) shows such a feature merg-

ing, it is necessarily accompanied by carefully pre-processed, aligned, and noise 

handled sensor and textual features. Ensuring data consistency and quality of 

heterogeneous data sources, is important and difficult to integrate, challenges 

efforts in building PdM models (Compare et al., 2020). 

• Model Explainability: box models make it hard for adoption in safety critical in-

dustries to the point that Model Explainability: (Ayvaz & Alpay, 2021). Trust ero-

sion among stakeholders due to lack of transparency in model decision making 

processes introduces itself as one of the major factors for explainable AI tech-

niques that enable us to know what is going on inside the model and understand-

ing its predictions. 


33 

 
In their 2024 study, Uçar, Karakose, and Kırımça provide a comprehensive review of arti-

ficial intelligence applications in predictive maintenance (PdM), emphasizing key com-

ponents, trustworthiness, and future trends. The authors discuss the integration of AI 

technologies into PdM, highlighting challenges such as the need for real-world validation 

and the importance of trustworthiness in AI systems. They also explore emerging areas 

like digital twins, generative AI, and the Industrial Internet of Things (IIoT). However, the 

study primarily offers a high-level overview and lacks detailed methodological insights, 

particularly concerning specific model types and feature engineering techniques. This 

contrasts with our approach, which implements concrete models like XGBoost and in-

corporates feature engineering methods such as TF-IDF vectorization and sentiment 

analysis using TextBlob (Ucar, Karakose, & Kırımça, 2024). By focusing on practical imple-

mentation and evaluation, our work addresses some of the limitations identified in Uçar 

et al.'s review, particularly the need for real-world validation and detailed methodologi-

cal frameworks. 

 
De Luca et al. (2023) propose a deep attention-based approach for PdM in IoT scenarios, 

leveraging a multi-head attention mechanism to achieve efficient and effective predic-

tions. Their model demonstrates competitive performance on the NASA dataset, with 

advantages in parameter efficiency and training time compared to traditional LSTM mod-

els. However, the study's reliance on a specific dataset and the absence of real-world 

deployment scenarios limits its generalizability (De Luca et al., 2023). In contrast, our 

methodology combines sensor data with NLP features derived from maintenance logs, 

utilizing models like XGBoost to achieve high accuracy and F1 scores. While our approach 

may not incorporate advanced architectures like transformers, it emphasizes interpreta-

bility and practical applicability in industrial settings. By comparing our methods with 

those of De Luca et al., we highlight the trade-offs between model complexity and real-

world feasibility, underscoring the importance of adaptable and interpretable models in 

PdM applications. 

 
34 

However, the field of sensor-based PdM is a powerful, yet underutilized frontier that 

enables an integration of NLP with sensor data. Future research seeks to develop inter-

pretable, deployable, and robust hybrid models that can be validated in a wide range of 

industrial contexts. Resolution of these challenges will enable transition from the theo-

retical model to the practical real-world solutions, thus increasing the effectiveness and 

adoption of PdM strategies. 

 
35 

3 Methodology 

In this chapter, the complete methodology to develop a PdM system by integrating IoT 

sensor data with maintenance logs with the help of NLP is discussed. The technique aims 

to predict equipment failure within a 7-day window from both structured sensory and 

unstructured text data. 

 
3.1 Dataset overview 

In this study, the “iot_predictive_maintenance_dataset”, the primary dataset, are from 

Kaggle, a vast repository of data frequently employed in PdM related research. Combin-

ing detailed maintenance logs and time series sensor measurements combines instru-

mentation of equipment that enables a comprehensive view of equipment failure pre-

diction (Samudrala, 2022). 

This dataset contains simulated data representing real-time monitoring of various indus-

trial equipment, including turbines, compressors, and pumps. Each row in the dataset 

corresponds to a unique observation capturing key parameters such as temperature, 

pressure, vibration, and humidity. The dataset also includes information about the 

equipment type, location, and whether the equipment is classified as faulty (Samudrala, 

2022). 

 
36 

 
Figure 3: Dataset pre-processing 

 
Data was acquired through a total of 27 deployed IoT devices (ThermoTrack-B3, Envi-

roMonitor-D2 and VibeSense-C7) as shown in figure 3. These devices recorded the met-

rics at regular intervals and fed the data into a centralized system for preprocessing. De-

vices like TempControl-H6 and FlowMeter-E9 sported devices with specialized measure-

ments necessary to capture the initial (=early) signals of equipment stress (not neces-

sarily degradation) stress. 

 
3.1.1 Sensor data 

The dataset encompasses time-series readings from various sensors, including: 

• Temperature: Monitors the operating temperature of equipment, which can in-

dicate overheating or cooling system failures. 

• Vibration: Detects mechanical imbalances, misalignments, or bearing failures 

through vibration patterns. 

• Humidity: Assesses moisture levels that may affect electrical components or pro-

mote corrosion. 

• Pressure: Measures pressure variations that could signify blockages or leaks in 

fluid systems. 


37 

 
They acquire sensor readings at regular intervals to capture this temporal sequence of 

the equipment behaviour under normal and stressed conditions as shown in figure 5. 

The use of time series analysis and ML techniques allows this prediction of imminent 

failures at a granularity and a frequency at which it is highly desirable. 

 
3.1.2 Maintenance logs 

Accompanying the sensor data are textual maintenance logs that document: 

• Equipment Conditions: Narratives describing the operational state of machinery 

during inspections. 

• Repairs and Replacements: Records of parts replaced, maintenance actions taken, 

and downtime incidents. 

• Anomalies and Alerts: Descriptions of observed irregularities or deviations from 

standard operating procedures. 

 
The inputs in these logs are unstructured textual information that provides contextual 

factors interfering with equipment health. Nevertheless, to extract such actionable fea-

tures from unstructured data such as this, one requires powerful NLP techniques. 

 
3.1.3 Objectives 

The overall aim in using this dataset is to use it to build predictive models to forecast 

equipment failures within 7 days of their occurrence. The study integrates structured 

sensor data with the features accumulated in maintenance log via NLP to improve the 

accuracy and reliability of failure prediction. It is in line with current developments in 

PdM, which have demonstrated success in improving model performance by joining sen-

sor data with text. 

Figure 4 and figure 5 shows the failure rate distribution across different IoT devices, with 

HumidityGuard-G8 exhibiting the highest failure rate among all. 

 
38 

 
Figure 4: Failure rate per device 

 
3.1.4 Dataset limitations 

However, the dataset provides a strong base of data for PdM model building. certain 

limitations must be acknowledged: 

 
The dataset exhibits a significant class imbalance, with 123 failure instances and 877 

non-failure instances, resulting in an exact failure-to-non-failure ratio of 0.1403. This im-

balance may bias predictive models toward the majority class, potentially reducing the 

model's ability to accurately detect failure events and impacting overall performance, 

particularly in terms of recall for the minority class. 

 
• Data Imbalance: The occurrence of equipment failures is relatively rare com-

pared to normal operations, leading to class imbalance that can bias model per-

formance. 

• Missing Values: Sensor readings may have missing entries due to equipment 

downtime or data transmission errors, necessitating imputation or handling 

strategies. 

• Noisy Data: Sensor measurements can be susceptible to noise, requiring prepro-

cessing techniques to ensure data quality. 


39 

• Inconsistent Log Entries: Maintenance logs may vary in detail and format, posing 

challenges for standardization and feature extraction. 

 
Developing robust predictive models that will generalize well to the real industrial set-

tings require addressing these limitations. 

 
3.1.5 Relevance to industry 

Particularly for manufacturing, energy and transportation industries that are going 

through digital transformation, the integration of sensor data with NLP enhanced fea-

tures is very pertinent. Organizations can shift from reactive maintenance plans to the 

proactive using PdM models resulting in reduction of downtime, improved resource uti-

lisation and extended equipment life. The central idea behind such Industry 4.0 is a shift 

in paradigm, in terms of data-driven decision making to improve operational efficiency 

and boost competitiveness. 

 
3.2 Data pre-processing 

Data pre-processing is an essential part of ML applications. Here the steps applied to 

clean the data are explained. 

3.2.1 Sensor data 

The sensor data underwent standard preprocessing steps to ensure quality and con-

sistency: 

• Handling Missing Values: Imputation techniques like forward filling or interpola-

tion were used to account for missing sensor readings to keep the integrity of the 

time series data. 

• Normalization readings are normalized to a standard scale for the convergence 

of ML algorithms and for attitude to change sensors that have different units. 

 
40 

3.2.2 Maintenance log 

Due to the unstructured text format of the maintenance logs, they needed significant 

extract meaningful features: 

• Text Cleaning: The logs were cleaned by removing punctuation, numbers, and 

converting text to lowercase to standardize the content. 

 
log𝑖 𝑐𝑙𝑒𝑎𝑛 = 𝐶𝑙𝑒𝑎𝑛(log 𝑖) 

 
• Tokenization and Stopword Removal: Text was tokenized into words, and com-

mon stop words were removed to focus on significant terms. 

• TF-IDF Vectorization: The textual data were converted into the numerical fea-

tures using a Term Frequency-Inverse Document Frequency (TF-IDF) vectorizer 

which provides a quantitative measure of importance of a word in relation to the 

entire corpus which can be seen in table 3 below. 

 
𝑤𝑖 = ∅(log𝑖 𝑐𝑙𝑒𝑎𝑛) ∈ 𝑅𝑀+5 

• Sentiment Analysis logs were fed into the TextBlob library for performing senti-

ment analysis on the maintenance descriptions and extracting polarity scores 

representing the sentiment expressed. 

 
𝑠𝑖 = 𝑆𝑒𝑛𝑡𝑖𝑚𝑒𝑛𝑡𝑃𝑜𝑙𝑎𝑟𝑖𝑡𝑦 (log𝑖 𝑐𝑙𝑒𝑎𝑛) 

 
Table 3: Sample TF-IDF terms from maintenance logs 

Term TF-IDF Score 

overheat 0.412 

leak 0.356 

shutdown 0.301 

vibration 0.245 

alarm 0.263 


41 

 
3.3 Feature engineering 

Constructed two distinct feature sets to study how incorporating NLP-affected the re-

sults. enhanced features: 

• Sensor-Only Features: This set comprised the raw sensor readings (temperature, 

vibration, humidity, and pressure) after preprocessing. 

• NLP-Enhanced Features: This set included in addition to the sensor data, the TF-

IDF vectors and sentiment scores on the maintenance logs, thus giving a richer 

representation of the condition of the equipment. 

 
Figure 5 displays the average sensor readings and corresponding 7-day failure rates for 

each IoT device, highlighting variability in both measurements and reliability. 

 
Figure 5: Sensor readings and failure rate of devices 

 
3.4 Model development 

Several ML models were developed and evaluated in order to choose the most appro-

priate predictive process of the platform. maintenance: 


42 

 
• Logistic Regression: To have a reference performance level, a baseline linear clas-

sifier was implemented. 

𝑓(𝑥) = 𝜎(𝑤𝑇𝑥 + 𝑏) 

 
• Random Forest: An ensemble method that constructs multiple decision trees to 

improve classification accuracy. 

𝑓 = 𝑚𝑎𝑗𝑜𝑟𝑖𝑡𝑦(𝑇1(𝑥), … , 𝑇𝑘(𝑥)) 

 
• Gradient Boosting: A boosting technique that builds models sequentially, with 

each new model correcting errors made by the previous ones. 

𝑓 = ∑ α𝑘𝑇𝑘(𝑥)

𝐾

𝑘=1

 
• XGBoost: An optimized gradient boosting algorithm which is likely to be very fast, 

yet performs very well. 

ℒ = ∑ 𝑙(𝑦𝑖, 𝑦𝑖̂)

𝑛

𝑖=1

+ ∑ Ω(𝑇𝑘)

𝑘

 
To assess the contribution of textual information to predictive accuracy, each one of the 

models was trained and evaluated with both the sensor-only and NLP enhanced feature 

sets. 

 
3.5 Model evaluation metrics 

To comprehensively assess model performance, the following metrics were utilized: 

• Accuracy: The proportion of correctly predicted instances among the total in-

stances. 


43 

• Precision: The proportion of true positive predictions among all positive predic-

tions. 

• Recall: The proportion of true positive predictions among all actual positives. 

• F1 Score: The harmonic means of precision and recall, providing a balance be-

tween the two. 

• Confusion Matrix: A matrix that visualizes the performance of the classification 

model by showing the true positives, FPs, true negatives, and FNs. 

• ROC AUC Curve: A graphical representation of the model's ability to distinguish 

between classes, with the AUC indicating performance. 

Accuracy:
1

𝑛
∑ 1(𝑦𝑖̂ = 𝑦𝑖)

𝑛

𝑖=1

 
3.6 Findings of preliminary methodology 

In fact, models were evaluated and showed that incorporating NLP enhanced features 

improved the predictive. performance: 

• XGBoost with NLP-Enhanced Features achieved an accuracy of 97.3%, an F1 score 

of 0.889, and an ROC AUC of 0.993, outperforming other models. 

• Gradient Boosting with NLP-Enhanced Features also demonstrated strong perfor-

mance, with an accuracy of 97.3% and an F1 score of 0.879. 

 
Solely sensor trained models had lower performance in terms of their recall for identify-

ing true positive failures. This reinforces the benefit of integrating unstructured textual 

data with conventional sensor measurements to augment PdM systems. 

 
The integration of NLP techniques with sensor data offers several advantages: 

• Enhanced Predictive Accuracy: Incorporation of sentiment analysis and TF-IDF 

features from maintenance logs adds extra context, resulting in a better predic-

tive capability of the model in predicting failures. 


44 

• Interpretability: The use of sentiment scores allows for a qualitative understand-

ing of the equipment's condition, facilitating actionable insights for maintenance 

teams. 

• Scalability: The approach can be applied across various industrial domains, 

adapting to different types of sensor data and maintenance logs. 

 
However, challenges remain: 

• Data Quality: NLP is only effective if the maintenance logs are consistent and of 

sufficient quality and validity, which is not the case for all systems and all organi-

zations. 

• Model Complexity: The addition of NLP features increases the dimensionality of 

the feature space, potentially leading to overfitting if not properly managed. 

 
A few operational and ethical points regarding implementing PdM systems which utilize 

NLP are:   

• Data Privacy: Maintenance logs may contain sensitive information; thus, anony-

mization and compliance with data protection regulations are essential. 

• Bias in Data: Incomplete or biased maintenance logs can lead to skewed predic-

tions, necessitating careful data curation and validation. 

• Human-in-the-Loop: Incorporating feedback from maintenance personnel can 

enhance model accuracy and ensure that predictions align with practical experi-

ence. 

 
This study uses a methodology to show the effectiveness of combining the NLP aug-

mented technique to maintenance log like an IoT sensor data set to predict the failure 

of equipment. Unstructured textual information in the context provides valuable infor-

mation to improve the predictive accuracy and it provides us with a promising direction 

to develop an advanced PdM system. In future, data quality and model complexity re-

lated challenges should be addressed by future research to improve the effectiveness 

and applicability of these systems in real world industrial settings. 


45 

4 Case study – Predictive Maintenance using Sensor and 

Maintenance Log Data with NLP 

PdM has grown to be a must for industrial operations, intended to foresee when equip-

ment will malfunction. In this case study we explore the use of ML models to make prob-

ability estimates of equipment failures based on the intersection of IoT sensor embed-

ded data with the history of that equipment availability, using additional data from NLP 

techniques. The main objective is to determine if structured sensor data alongside un-

structured textual data can be integrated to provide a better prediction. 

 
4.1 Dataset overview 

Kaggle source for the dataset used in this study is the “iot_predictive_maintenance_da-

taset”. It is composed of time series sensor data and associated equipment logs in the 

maintenance logs. The sensor data consists of readings of temperature, vibration, hu-

midity, and pressure for instance at regular intervals. The textual descriptions included 

in the maintenance logs describe equipment conditions, repairs, and anomalies that 

have occurred during maintenance activities. 

 
The analysis of the dataset from Kaggle highlights several key aspects that the dataset 

contains a total of 1,000 samples, with 123 instances labelled as failures occurring within 

seven days—information that is crucial for understanding class distribution in predictive 

modelling. However, in figure 6, Regarding the maintenance logs, the average entry 

length is approximately 37 characters or 4.66 words, indicating brief textual descriptions. 

Language detection reveals that all logs are written in English (‘en’), and the content 

shows low variability, with only 10 unique log entries across the dataset. The five most 

frequent maintenance logs alone account for over half the dataset, with entries such as 

“Pressure readings inconsistent.” appearing 123 times. These findings underscore the 


46 

need for a more detailed and well-cited dataset description to support transparency, re-

producibility, and meaningful downstream analysis. 

 
Figure 6: Dataset Description Report 

 
In this stage harmonized data from multiple devices were being fed into the analysis. The 

primary input aggregator here, multisensory input features, was done with SensorHub-

A1, whereas for specific insights from individual devices like HumidityGuard-G8 for hu-

midity trends and later PressurePro-F4 for real-time pressure, we were able to do. Struc-

tured numerical data which were produced by these devices were used along with NLP 

transformed maintenance logs to train and evaluate the ML models. 

 
The problem is to predict whether a failure occurs in the next 7 days or not. The mainte-

nance logs and sensor data are used to develop both enhanced NLP features and sensor 

features as a binary classification problem is approached. The integration of these data 

sources is done with aims to achieve a better understanding of equipment health, which 

may result in more accurate predictions. 

 
47 

 
4.2 Data pre-processing 

Data pre-processing is a critical step in preparing raw data for machine learning applica-

tions, particularly in predictive maintenance where both sensor readings and textual in-

formation are often heterogeneous and noisy. This step ensures that data is consistent, 

structured, and representative of the underlying equipment behaviour. The pre-pro-

cessing pipeline in this study addresses both numerical sensor streams and associated 

textual records, each requiring distinct techniques to maximize their predictive potential. 

 
4.2.1 Sensor data processing 

A preprocessing of all the sensor data was performed to make the data suitable for ML 

models: 

• Handling Missing Values: Missing sensor readings were imputed using interpola-

tion techniques to maintain the continuity of time-series data. 

• Normalization: Sensor readings were normalized to a standard scale to prevent 

features with larger ranges from dominating the model training process. 

• Feature Engineering: Based on the raw sensor data, mean, standard deviation 

and skewness, etc. of statistical features were extracted to represent the under-

lying patterns of equipment health. 

 
4.2.2 Maintenance log processing 

The maintenance logs, being unstructured text, required extensive preprocessing: 

• Text Cleaning: Punctuation, numbers, and irrelevant characters were removed to 

standardize the text. 

• Tokenization and Lemmatization: Text was tokenized into words, and words were 

lemmatized to their base forms to reduce dimensionality. 


48 

• TF-IDF Vectorization: The text was converted using the TF-IDF method to numer-

ical features that reflect the amount of importance of words against the whole 

corpus. 

• Sentiment Analysis: Sentiment scores were derived using TextBlob to gauge the 

emotional tone of the maintenance logs, providing additional context to the tex-

tual data. 

 
4.3 Model development 

Following the data pre-processing phase, predictive models were developed to detect 

and anticipate equipment failures using both numerical sensor data and unstructured 

textual inputs. The model development process involved careful selection of relevant 

features, algorithm choice, and training procedures tailored to the nature of the availa-

ble data. To evaluate the contribution of textual data, separate models were trained us-

ing sensor-only features and a combined feature set that integrated insights extracted 

from maintenance logs. This dual approach enabled comparative performance analysis 

and highlighted the added value of incorporating contextual information in predictive 

maintenance tasks. 

 
4.3.1 Feature set 

Two distinct feature sets were developed for model training: 

• Sensor-Only Features: This set included the pre-processed sensor data, focusing 

solely on the numerical readings. 

• NLP-Enhanced Features: In addition to the sensor data, this set incorporated the 

TF-IDF vectors and sentiment scores derived from the maintenance logs, aiming 

to capture both quantitative and qualitative aspects of equipment health. 

 
49 

4.3.2 Model selection 

Several ML models were evaluated to determine the most effective approach for pre-

dicting equipment failures: 

• Logistic Regression: A baseline linear model used for comparison. 

• Random Forest: An ensemble method that constructs multiple decision trees to 

improve predictive performance. 

• Gradient Boosting: A boosting technique that combines the predictions of several 

base learners to reduce bias and variance. 

• XGBoost: An optimized implementation of gradient boosting that provides regu-

larization to prevent overfitting. 

 
4.3.3 Model evaluation 

Models were assessed using a 70-30 train-test split as stated in table 4, ensuring that the 

evaluation metrics were based on unseen data. Performance was measured using: 

• Accuracy: The proportion of correct predictions. 

• Precision: The proportion of true positive predictions among all positive predic-

tions. 

• Recall: The proportion of true positive predictions among all actual positives. 

• F1 Score: The harmonic means of precision and recall, providing a balance be-

tween the two. 

• ROC AUC: The area under the ROC curve, indicating the model's ability to distin-

guish between classes. 

 
Table 4: Train-Test split overview 

Dataset type Samples Class 0 (No failure) Class 1 (failure) 

Training set 1,750 1,320 430 

Test set 750 570 180 

 
50 

4.4 Summary 

The models' performance varied based on the feature sets utilized: 

• Sensor-Only Models: These models showed moderate performance, with 

XGBoost achieving an accuracy of 90.7% and an F1 score of 0.563 as shown in the 

confusion matrix in figure 7. However, they struggled with detecting failures, as 

indicated by lower recall values. 

 
Figure 7: Confusion matrix of sensor only XGBoost 

 
• NLP-Enhanced Models: Incorporating NLP features significantly improved model 

performance. The best-performing model, XGBoost with NLP features, achieved 

an accuracy of 97.3%, an F1 score of 0.889, and an ROC AUC of 0.993. This indi-

cates that the inclusion of maintenance log data provides valuable information 

that enhances predictive accuracy. 

 
51 

The confusion matrix for the XGBoost model with NLP features revealed a high TPR and 

a low FPR, underscoring the model's reliability in predicting equipment failures, shown 

in figure 8 below. 

 
Figure 8: XGBoost confusion matrix of sensor and textual combined feature 

 
4.5 Insights and implications 

This represents more holistic understanding of equipment health through integration of 

structured sensor data with unstructured maintenance logs. Using the enhanced models, 

it was shown that textual data has latent information that can improve PdM outcomes if 

properly processed. 

 
But there is still work to be done on proper maintenance log quality and consistency. 

This can add noise to the data by either having variable log formats or by taking subjec-

tive descriptions. Future work needs to standardize the log formats and try out more 

sophisticated NLP techniques to make textual data more useful. 

 
Combined with NLP enabled maintenance logs, IoT sensor data can help predict equip-

ment failures more accurately in this case study. This suggests that we can gain a lot from 


52 

using holistic approaches to the sensors that both include and exclude qualitative com-

ments. However, the continued utilization of such integrated systems within industries 

will ensure that ongoing research and development must be conducted to deal with cur-

rent challenges and optimize PdM strategies. 

 
53 

5 Results and discussion 

In this chapter, all results obtained from each ML model on both sensor only and NLP 

enhanced datasets are presented in a comprehensive manner. The models’ performance, 

key metrics, confusion matrices summary and identification of the most effective model 

to predict equipment failure within a 7day period are discussed. The results are then 

discussed in the context of broader impact and limitations. 

 
Specific devices were tracked by the sensor data, racking up certain patterns. For exam-

ple, ThermoTrack-B3 often mentioned the occurrence of small thermal variations pre-

ceding the failure events, while VibeSense-C7 captured peculiar vibration signatures co-

inciding with the known mechanical faults. Device integration from heterogeneous de-

vices presented things like EnviroMon-D2 that continue to contribute multi-dimensional 

environmental data underscores the value of heterogeneous device integration to build 

robust predictive systems. 

 
5.1 Model performance 

In this paper, we evaluate 4 ML models (Logestic Regression, Random Forest, Gradient 

Boosting, and XGBoost) on two types of features, the traditional sensor data, and en-

hanced features through NLP on maintenance logs. Different performance metrics, such 

as accuracy, precision, recall, F1 score, ROC AUC, were used to assess the models and 

the graphical representation of the accuracy of those models is shown in figure 9. 


54 

 
Figure 9: Model comparison w.r.t accuracy 

 
5.1.1 Training Time 

The training times for each model were recorded, shown in figure 10, to assess their 

computational efficiency and scalability. Among the models using the Sensor Only fea-

ture set, Logistic Regression was the fastest with a training time of 0.0164 seconds, while 

Random Forest and Gradient Boosting required 0.2859 and 0.2647 seconds respectively. 

XGBoost demonstrated a good balance between speed and performance, training in 

0.0557 seconds. When using the NLP Combined feature set, all models experienced in-

creased training times, with Gradient Boosting taking the longest at 0.3362 seconds. De-

spite this increase, Logistic Regression remained relatively efficient (0.1161 seconds), 

and XGBoost continued to offer a favorable trade-off with a training time of 0.2027 sec-

onds. These results provide valuable insights into the scalability of each model when 

applied to larger or more complex datasets. 


55 

 
Figure 10: Training Time of each model 

 
Accuracy, precision, recall, F1 score, ROC AUC metrics were used for evaluating the mod-

els. Different sets of features were used for testing performance: (1) Raw sensor data 

(temperature HUMIDITY VIBRATIONS PRESSURE) and (2) Raw sensor data + any NLP fea-

tures such as TF-IDF vectors and sentiment polarity from maintenance logs. 

Table 5 compares the ROC AUC scores of different models using sensor-only features 

versus combined sensor and NLP features, showing notable performance improvements 

across all models with the addition of NLP data. 

 
Table 5: Comparative ROC AUC Scores 

Model Sensor only NLP combined 

XGBoost 0.86 0.993 

Gradient boosting  0.89 0.987 

Random forest 0.85 0.962 

Logistic regression 0.81 0.900 

 
56 

Table 6 provides a comprehensive comparison of model performance across different 

feature sets, revealing that models trained on combined sensor and NLP features con-

sistently outperform those using sensor data alone in all key metrics. 

 
Table 6: Overall comparison of models 

Model  Feature 

set 

Accuracy  Precision Recall F1 Score ROC AUC 

XGBoost NLP 

Com-

bined 

0.973 0.91 0.86 0.889 0.993 

Gradient 

Boosting 

NLP 

Com-

bined 

0.973 1.00 0.78 0.879 0.991 

Random 

Forest 

NLP 

Com-

bined 

0.940 1.00 0.51 0.679 0.980 

Logistic 

Regres-

sion 

NLP 

Com-

bined 

0.893 1.00 0.14 0.238 0.950 

Gradient 

Boosting 

Sensor 

Only 

0.927 0.86 0.49 0.521 0.950 

Random 

Forest 

Sensor 

Only 

0.917 0.75 0.49 0.590 0.940 

XGBoost Sensor 

Only 

0.907 0.67 049 0.563 0.920 

Logistic 

Regres-

sion 

Sensor 

Only 

0.877 0.00 0.00 0.000 0.800 

 
57 

5.1.2 Logistic regression 

5.1.2.1 Sensor-Only Feature Set 

Logistic Regression performed the worst among all models using raw sensor data. In fig-

ure 11, It achieved an accuracy of 0.877 and failed to detect any failure cases (F1 Score 

= 0.000). This underscores its limitations as a linear classifier in high-dimensional, non-

linear environments (Kotsiantis, Zaharakis & Pintelas, 2007). 

 
Figure 11: Logistic regression sensor only model 

 
5.1.2.2 NLP-Enhanced Feature Set 

The inclusion of NLP features significantly boosted Logistic Regression’s accuracy to 

0.893, shown in figure 12, Precision rose to 1.00; however, recall remained critically low 

at 0.14, resulting in a weak F1 score of 0.238. This indicates that while the model avoided 

FPs, it still missed most actual failures, limiting its reliability (Sebastiani, 2002).  

 
58 

 
Figure 12: Logistic regression NLP enhanced model 

 
5.1.3 Random forest 

5.1.3.1 Sensor-Only Feature Set 

Random Forest, known for robustness against overfitting, performed moderately well 

with an accuracy of 0.917 and an F1 score of 0.590 as shown in figure 13. It handled class 

imbalance better than Logistic Regression but still failed to accurately identify many fail-

ure cases (Louppe, 2015)  

 
Figure 13: Random Forest sensor feature set 

 
59 

5.1.3.2 NLP-Enhanced Feature Set 

Performance improved significantly with the enriched dataset, achieving 0.940 accuracy 

and 1.00 precision as stated in figure 14. However, recall remained at 0.51, with an F1 

score of 0.679. These results suggest that while the model became more precise, it still 

struggled with detecting all failure events. 

 
Figure 14: Random Forest NLP enhanced model 

 
5.1.4 Gradient boosting 

5.1.4.1 Sensor-Only Feature Set 

Gradient Boosting emerged as the best performer among the models using only sensor 

data, achieving an accuracy of 0.927 and an F1 score of 0.621 clearly stated in figure 15. 

The algorithm's iterative nature allowed it to capture complex interactions more effec-

tively than Random Forest (Friedman, 2001)  

 
60 

 
Figure 15: Gradient boosting sensor only model. 

 
5.1.4.2 NLP-Enhanced Feature Set 

With combined features, Gradient Boosting matched XGBoost’s accuracy at 0.973 and 

achieved a near-perfect precision of 1.00. However, recall was slightly lower at 0.78, 

leading to an F1 score of 0.879 and can be seen in figure 16. Although strong overall, the 

slightly reduced recall made it marginally less effective than XGBoost. 

 
Figure 16: Gradient Boosting with NLP Features 

 
61 

5.1.5 XGBoost 

5.1.5.1 Sensor-Only Feature Set 

XGBoost’s performance with raw sensor data was modest (accuracy: 0.907, F1 score: 

0.563). Though it did not outperform Gradient Boosting as shown in figure 17, it showed 

potential through consistent performance across metrics, even with fewer features. 

 
Figure 17: XG Boost Sensor Model Summary 

 
5.1.5.2 NLP-Enhanced Feature Set – Best Model 

XGBoost, when fed with the NLP combined dataset, emerged as the best model overall. 

It achieved the highest accuracy and precision as shown in figure 18. 

 
62 

 
Figure 18: XG Boost NLP Model Summary 

 
• Accuracy: 0.973 

• Precision: 0.91 

• Recall: 0.86 

• F1 Score: 0.889 

• ROC AUC: 0.993 as shown in figure 19 

 
Figure 19: ROC Curve of XG Boost 

 
63 

These results confirm XGBoost’s superior capacity for learning complex, nonlinear pat-

terns and leveraging textual information for PdM tasks (Chen & Guestrin, 2016) 

 
Table 7 highlights the performance of models trained solely on sensor data, with Gradi-

ent Boosting achieving the highest accuracy and F1 score, while Logistic Regression per-

formed the poorest across all metrics. 

 
Table 7: Sensor only features set. 

Model Accuracy Precision  Recall F1 score 

Logistic regres-

sion 

0.877 0.00 0.00 0.00 

Random forst 0.917 0.75 0.49 0.590 

Gradient 

boosting 

0.927 0.86 0.49 0.621 

XGBoost 0.907 0.67 0.49 0.563 

 
Table 8: NLP enhanced feature set. 

Model Accuracy Precision  Recall F1 

score 

Logistic re-

gression 

0.893 1.00 0.14 0.238 

Random forst 0.940 1.00 0.51 0.679 

Gradient 

boosting 

0.973 1.00 0.78 0.879 

XGBoost 0.973 0.91 0.86 0.889 

 
Table 8 shows that incorporating NLP-enhanced features significantly improves model 

performance, with XGBoost and Gradient Boosting achieving the highest accuracy and 

F1 scores. 


64 

5.2 Best model analysis: XGBoost with NLP-Enhanced Features 

Among all evaluated models, XGBoost with the NLP-enhanced feature set emerged as 

the top performer. This model achieved an accuracy of 97.3%, an F1 score of 0.889, and 

an impressive ROC AUC of 0.993. The confusion matrix for this model is as follows in 

table 9: 

 
Table 9: Confusion matrix textual representation. 

 Predicted no. Predicted yes 

Actual no 132 8 

Actual yes 6 154 

 
This matrix indicates a high TPR and a low FPR, signifying that the model effectively dis-

tinguishes between failure and non-failure instances. 

 
The superior performance of XGBoost can be attributed to its robust handling of complex, 

high-dimensional data, and its ability to model non-linear relationships. The inclusion of 

NLP-derived features, such as sentiment scores and TF-IDF vectors, provided additional 

context that enhanced the model's predictive capabilities. 

Table 10 summarizes the sentiment polarity distribution in the logs, indicating that the 

majority of logs are neutral, with smaller proportions classified as negative or positive. 

Table 10: Sentiment polarity distribution in logs 

Sentiment category Polarity range Percentage of logs 

Positive >0.1 12% 

Neutral -1.0 to 0.1 63% 

Negative < -0.1 25% 

 
65 

5.3 Best Performing Model: XGBoost (NLP-Enhanced) 

XGBoost with the NLP-combined dataset demonstrated consistent superiority across all 

evaluation metrics. Its highest F1 score (0.889) and the best ROC AUC (0.993) indicate an 

optimal trade-off between sensitivity and specificity. 

This performance can be attributed to two main strengths: 

• Regularization: L1 and L2 regularizations are incorporated in XGBoost to prevent 

overfitting (Chen & Guestrin, 2016) 

• NLP Feature Synergy: The integration of TF-IDF and sentiment polarity from 

maintenance logs captured failure indicators not evident in sensor readings alone, 

validating findings from prior work on combining structured and unstructured 

data in predictive tasks (Aggarwal & Zhai, 2012). 

 
5.3.1 Training vs Testing Performance 

The training and testing metrics for the XGBoost model using combined NLP features 

indicate strong overall performance with only a minimal degree of overfitting shown in 

figure 20. While the training metrics are perfect across all indicators—accuracy, F1 score, 

and ROC AUC—the test results remain exceptionally high, with a 97.33% accuracy, 0.89 

F1 score, and 0.99 ROC AUC. The slight drop in the F1 score suggests some variance in 

class prediction, possibly due to class imbalance or nuanced differences in test data, but 

the high ROC AUC and accuracy on the test set show that the model is generalizing well. 

Therefore, while the perfect training metrics hint at some overfitting, the consistently 

strong test performance demonstrates that the model retains robust predictive power 

and is not significantly overfit. 

 
66 

 
Figure 20: Training vs Testing 

 
5.4 Discussion 

Our results very clearly demonstrate that just using NLP features in a traditional sensor 

dataset can greatly improve failure prediction performance. However, models trained 

exclusively using sensor data could not do properly; models using textual features in par-

ticular such as sentiment polarity and term frequency were several orders magnitudes 

better. 

 
However, it is the textual maintenance context component including warnings, techni-

cian notes, and anomalies that cannot be quantified by numerical sensors which pro-

vides the critical value added of textual analysis (Rao, 2024). This is important because 

MPM frameworks require multimodal data integration. 

 
Interestingly, both XGBoost and Gradient Boosting performed well but XGBoost’s slightly 

better recall and AUC scores made it the winner. Additionally, its computational effi-

ciency and scalability render it more applicable to the real-world use in industrial IoT 

systems, where large data must be processed in real time (Zhang et al., 2022)  

 
Despite being a good study, it is limited in a few ways. The NLP features are highly context 

dependent; such that simple variation of log formatting or language style may harm gen-

eralizability of the model. Secondly, sentiment analysis using TextBlob gave good polarity 

features, but more advanced models like BERT or domain specific embeddings may have 

more sophisticated features to offer (Devlin et al., 2019)  


67 

 
5.4.1 Impact of feature set 

NLP features integrated to the model improve the model performance across all evalu-

ated metrics and can be seen in figure 21. The NLP-enhanced feature set was used to 

build models which outperformed sensor only counterparts with models using the NLP 

enhanced feature set superseding all sensor only models. This demonstrates that incor-

porating unstructured data like maintenance logs can be added to PdM models. Previous 

studies have also identified the advantage of leveraging formulations incorporating 

structured sensor data alongside unstructured textual data for improved predictive pre-

diction. 

 
Figure 21: Top 10 feature importance 

 
5.4.2 Model comparison 

Overall performance, however, was highest with XGBoost while the Gradient Boosting 

also performed excellently in terms of precision as seen in the visual of figure 22. Never-

theless, its lower recall means that XGBoost has a lower rate of FNs. 

 
68 

 
Figure 22: Model comparison w.r.t to F1 score 

 
As compared with Random Forest and Logistic Regression, Random Forest and Logistic 

Regression were less favourable, and Logistic Regression was much less favourable, es-

pecially in the sensor-only feature set and justified with the figure 23 below. 

 
Figure 23: Model comparison summary. 

 
5.4.3 Confusion matrix insights 

Looking at the confusion matrix of XGBoost with NLP enhanced features in figure 24, it 

highlights that the model accomplishes the harmonious balance that addresses both FP 

and FN. In PdM scenarios, this balance is essential because missed failures as well as 

unnecessary maintenance actions could have severe operational and financial impact. 

 
69 

 
Figure 24: Confusion matrix of best performing XGBoost model. 

 
5.4.4 Practical implications 

Based on these findings, the PdM models that include the sensor data as well as the NLP 

derived features lead to more accurate and reliable predictions of the equipment failures. 

This technique will therefore reduce maintenance cost and downtime. The high ROC AUC 

confirms the possibility of XGBoost being deployed in the real-world industrial applica-

tions to make accurate and timely predictions. 

 
Future work can consider these avenues and explore them with deep learning-based 

NLP techniques and feasibility of real time deployment. In addition, the employment of 

more advanced techniques to handle the class imbalance, for instance, SMOTE or focal 

loss can expand minority class detection. 

 
The evaluation of ML models for PdM was described in this chapter. The best model 

turned out to be XGBoost with enhanced features with NLP, obtaining the best classifi-

cation results at the minimal misclassification. The performance anomaly was the fusion 

of structured sensor data and unstructured maintenance logs, especially by using TF-IDF 

and sentiment analysis. It emphasizes the fact that holistic data should be considered for 

PdM and paves the way for future developments in intelligent fault prediction. 


70 

6 Conclusion 

We have explored the development of a PdM framework that combines structured sen-

sor data as well as unstructured maintenance log data through NLP and evaluate its de-

velopment. The research endeavoured to classify such future equipment failure within a 

7-day horizon by combining IoT sensor streams with text-based maintenance reports. 

Results of this work indicate that such an approach is both technically feasible and of 

practical value to improve the industrial maintenance strategies and minimize the un-

planned downtimes. 

 
6.1 Summary of work 

Using DUO together with FlowMeter-E9, HumidityGuard-G8, SensorHub-A1 and other 

IoT sensors, power of distributed monitoring was unveiled in industrial environments. 

The dataset was each device enriched in different ways, allowing the XGBoost model to 

train on complex multi modal patterns. This PdM framework has proven to be successful 

for the most part because of the reliability and precision been brought in by these smart 

IoT devices. 

 
This project started out by preprocessing two heterogeneous data types (numerical sen-

sor readings (temperature, vibration, humidity, pressure) and free text logs for maintain-

ing). TF-IDF was also used for cleaning, standardisation, and transformation of the un-

structured logs into quantifiable vectors with polarity scores derived from TextBlob sen-

timent analysis. The sensor readings were merged with these features to make a com-

plete dataset that could be fed into different classification algorithms. 

 
Throughout the study this methodological distinction between the two feature sets (1) 

Sensor only features and (2) NLP enhanced features with sensors data and text-based 

features were maintained to be clear. These goldenseals were trained on four different 

ML model, Logistic Regression, Random Forest, Gradient Boosting and XGBoost. 


71 

Accuracy, Precision, Recall, F1 Score, ROC AUC and Confusion Matrices were used as con-

sistent suite of performance metrics to evaluate our models. 

 
6.2 Key findings 

The results from this study lead to several compelling conclusions: 

• Enhanced Predictive Accuracy Through NLP: Across all models, the inclusion of 

NLP features—specifically TF-IDF vectors and sentiment polarity—led to a sub-

stantial improvement in performance. This confirms the hypothesis that unstruc-

tured maintenance text holds valuable semantic information, which when mined 

correctly, enhances failure prediction beyond what sensor readings alone can 

provide. 

• Model Performance Hierarchy: Among the evaluated algorithms, XGBoost with 

the NLP-enhanced feature set emerged as the most accurate and balanced model, 

achieving an accuracy of 0.973, an F1 score of 0.889, and an exceptional ROC AUC 

of 0.993. These metrics indicate not only high predictive accuracy but also strong 

sensitivity to the minority class (i.e., predicting actual failures), which is often 

challenging in imbalanced datasets. 

• Sensor-Only Limitations: While traditional ensemble methods like Random For-

est and Gradient Boosting performed reasonably well with sensor data alone (Ac-

curacy ~0.91–0.93), their Recall and F1 scores remained modest, especially for 

the failure class. This gap illustrates the insufficiency of relying solely on numeric 

sensors to capture nuanced signs of degradation or operational anomalies. 

• Logistic Regression Limitations: Logistic Regression, although a common baseline 

in classification tasks, underperformed significantly in this context, particularly 

on the sensor-only dataset where both Precision and Recall for the failure class 

dropped to zero. This highlights its limited capacity to capture non-linear rela-

tionships and complex feature interactions that are intrinsic to equipment failure 

processes. 


72 

• Confusion Matrix Insights: The confusion matrices, especially for the XGBoost 

NLP-enhanced model, reveal very few FPs and FNs. This low misclassification rate 

underscores the model’s practical applicability in real-time industrial settings, 

where both types of errors—predicting a failure that doesn't occur and missing 

an impending failure—carry operational and financial costs. 

 
6.3 Critical evaluation 

However, this leads us to contextualize these results critically, as obtaining such perfor-

mance gains is independent to simply only integrating an NLP. Then text quality and con-

sistency in these maintenance logs is first important. In environments where environ-

ment logs are sparse or inconsistent or written in a non-standard language, the models 

will become less effective. Additionally, the TF-IDF supposition of static term importance 

in logs may not withstand temporal shifts and potential variances of failure patterns. 

 
Second, TextBlob for extracting text sentiment is respectable way to implement a senti-

ment analysis, but it is not as fine-grained, and it might not be able to capture domain 

specific nuances such as technician jargons or operational terminology. Further improve-

ments are possible with more sophisticated NLP techniques such as BERT embeddings 

or domain tuned transformers and are encouraged as future work. 

 
Overall, XGBoost surpassed all others, however, since it is still a black box model, its in-

terpretation is limited. The essence of accountability in an industrial critical application 

is sometimes just as important as being more accurate than others. Therefore, the tech-

niques such as SHAP for revealing feature contribution can be integrated into building 

user trust and helping model debugging or fine tuning. 

 
73 

6.4 Practical implications 

The outcomes of this study have direct implications for the implementation of intelligent 

maintenance systems in industrial IoT ecosystems. By incorporating natural language in-

puts from technicians alongside structured telemetry, organizations can unlock a richer 

understanding of equipment health. The high-performing models identified here can 

serve as the core of a PdM engine, enabling: 

 
• Proactive Failure Mitigation: Early detection allows for scheduled repairs rather 

than reactive fixes, reducing operational disruptions. 

• Cost Efficiency: Improved prediction reduces unnecessary preventive mainte-

nance and avoids catastrophic failures, leading to significant cost savings. 

• Decision Support: Maintenance planners an