Syed Mahmood ul Hassan Predictive Maintenance in IoT Using NLP Techniques Master Thesis Vaasa 2025 School of Technology and Inno- vations Master of Sustainable and Au- tonomous Systems 2 UNIVERSITY OF VAASA School of Technology and Innovations Author: Syed Mahmood ul Hassan Title of the thesis: Predictive Maintenance in IoT Using NLP Techniques Degree: Master of Sustainable and Autonomous Systems Supervisor: Co-supervisor Mohammed Elmusrati Elham Ahmadi Year: 2025 Pages: 82 ABSTRACT: In this thesis we study how NLP techniques can be integrated with Internet of Things (IoT) sensor data, to be used for predictive maintenance in industrial environments. The main objective was to build models that can predict whether an equipment will fail in the next 7 days using the iot_predictive_maintenance dataset. Sensor readings include temperature, vibration, humidity, and pressure etc. along with unstructured textual maintenance logs. Two sets of the feature were built: one containing only numeric sensor data and the other with features from TextBlob, namely TF- IDF vectors and polarity scores on sentiment, to create a feature set. Logistic Regres- sion, Random Forest, Gradient Boosting, and XGBoost models were evaluated multiple times. It was found that models incorporating both sensor and NLP features significantly outperformed those based solely on sensor data. Among the evaluated models, the best performance was achieved by the XGBoost model using the combined feature set, which attained an accuracy of 0.973, an F1 score of 0.889, and a ROC AUC of 0.993. These results confirm that the textual information in maintenance logs contains important failure aspects not expressed from numer- ical data alone. This work demonstrates practicality of fusion models in predictive maintenance, enabling a scalable and robust solution for smart manufacturing. KEYWORDS: Predictive Maintenance, IoT, Natural Language Process (NLP), Machine Learning (ML), Fault Prediction 3 Contents 1 Introduction 9 1.1 Background and motivation 12 1.2 Problem statement 13 1.3 Research objectives 14 1.4 Research questions 15 1.5 Methodology overview 15 1.5.1 Data preprocessing 16 1.5.2 Feature engineering 17 1.5.3 Model Development and Training 18 1.5.4 Evaluation metrics 19 1.6 Significance and contribution 20 2 Literature Review 21 2.1 Traditional predictive maintenance and sensor-based approaches 22 2.2 Unstructured data and the case for NLP in maintenance 24 2.3 Hybrid approaches: Fusing text with sensor data 26 2.4 Advances in NLP models for industrial applications 27 2.5 Evaluation practices and real-world validation 29 2.6 Ethical, practical, and operational considerations 30 2.6.1 Human-in-the-Loop (HITL) Approaches 30 2.6.2 Privacy and Data Governance 31 2.6.3 Transparency and Accountability 31 2.6.4 Operational Challenges 31 2.7 Summary of Gaps and Opportunities 32 3 Methodology 35 3.1 Dataset overview 35 3.1.1 Sensor data 36 3.1.2 Maintenance logs 37 3.1.3 Objectives 37 3.1.4 Dataset limitations 38 4 3.1.5 Relevance to industry 39 3.2 Data pre-processing 39 3.2.1 Sensor data 39 3.2.2 Maintenance log 40 3.3 Feature engineering 41 3.4 Model development 41 3.5 Model evaluation metrics 42 3.6 Findings of preliminary methodology 43 4 Case study – Predictive Maintenance using Sensor and Maintenance Log Data with NLP 45 4.1 Dataset overview 45 4.2 Data pre-processing 47 4.2.1 Sensor data processing 47 4.2.2 Maintenance log processing 47 4.3 Model development 48 4.3.1 Feature set 48 4.3.2 Model selection 49 4.3.3 Model evaluation 49 4.4 Summary 50 4.5 Insights and implications 51 5 Results and discussion 53 5.1 Model performance 53 5.1.1 Logistic regression 57 5.1.2 Random forest 58 5.1.3 Gradient boosting 59 5.1.4 XGBoost 61 5.2 Best model analysis: XGBoost with NLP-Enhanced Features 64 5.3 Best Performing Model: XGBoost (NLP-Enhanced) 65 5.4 Discussion 66 5.4.1 Impact of feature set 67 5 5.4.2 Model comparison 67 5.4.3 Confusion matrix insights 68 5.4.4 Practical implications 69 6 Conclusion 70 6.1 Summary of work 70 6.2 Key findings 71 6.3 Critical evaluation 72 6.4 Practical implications 73 6.5 Limitations and future directions 73 References 75 Appendices Error! Bookmark not defined. Appendix 1. Appendix title Error! Bookmark not defined. Appendix 2. Another appendix title Error! Bookmark not defined. 6 Figures Figure 1: Dataset of IoT devices 10 Figure 2: Model architecture 19 Figure 3: Dataset pre-processing 36 Figure 4: Failure rate per device 38 Figure 5: Sensor readings and failure rate of devices 41 Figure 6: Confussion matrix of sensor only XGBoost 50 Figure 7: XGBoost confusion matrix of sensor and textual combined feature 51 Figure 8: Model comparison w.r.t accuracy 54 Figure 9: Logistic regression sensor only model 57 Figure 10: Logistic regression NLP enhanced model 58 Figure 11: Random forest sensor feature set 58 Figure 12: Random forest NLP enhanced model 59 Figure 13: Gradient boosting sensor only model. 60 Figure 14: Gradient Boosting with NLP Features 60 Figure 15: XG Boost Sensor Model Summary 61 Figure 16: XG Boost NLP Model Summary 62 Figure 17: ROC Curve of XG Boost 62 Figure 18: Top 10 feature importance 67 Figure 19: Model comparison w.r.t to F1 score 68 Figure 20: Model comparison summary. 68 Figure 21: Confusion matrix of best performing XGBoost model. 69 Tables Table 1: IoT sensors and their functions 16 Table 2: Raw sensor features 18 Table 3: Sample TF-IDF terms from maintenance logs 40 Table 4: Train-Test split overview 49 Table 5: Comparative ROC AUC Scores 55 7 Table 6: Overall comparion of models 56 Table 7: Sensor only feature set. 63 Table 8: NLP enhanced feature set. 63 Table 9: Confusion matrix textual representation. 64 Table 10: Sentiment polariy distribution in logs 64 Abbreviations AI Artificial Intelligence AUC Area Under Curve AUC-ROC Area Under the Receiver Operating Characteristic Curve BERT Bidirectional Encoder Representations from Transformers FN False Negative FP False Positive FPR False Positive Rate GBM Gradient Boosting Machine GDPR General Data Protection Regulation HITL Human-in-the-Loop IIot Industrial Internet of Things IoT Internet of Things ML Machine Learning MPM Modern Predictive Maintenance NER Named Entity Recognition NLP Natural Language Processing PdM Predictive Maintenance ROC Receiver Operating Characteristic RUL Remaining Useful Life SHAP SHapley Additive exPlanations SMOTE Synthetic Minority Over-sampling Technique TF-IDF Term Frequency–Inverse Document Frequency TPR True Positive Rate 8 XGBoost eXtreme Gradient Boosting 9 1 Introduction For example, with Industry 4.0, the era of digital transformation, cyber-physical systems, and automation, as well as smart sensing technologies are integrated into industrial pro- cesses. Within this revolution sits PdM, an advanced maintenance strategy that is poised to predict the failures of equipment ahead of their occurrences, thus preventing un- scheduled downtimes, improving operational efficiency, and helping extend the operat- ing lifespan of critical assets that are deemed indispensable. With the IoT having greatly boosted this capability, advances in sensor embedded systems allow for continuous monitoring of machines and generation of vast amounts of structured, time series data from parameters like temperature, vibration, pressure, and humidity. Sensor driven predictive models are remarkably successful yet in most current frame- works, the unstructured textual data generated along sensors is underutilized to date. Examples include maintenance logs, fault descriptions, incident reports, and technician observations that usually contain rich contextual information that is hard to capture by numerical data only. According to (Kang et al., 2020), such exclusion could fail because of incomplete fault detection, inefficient maintenance scheduling, and delayed early warning of failure, leaving the promise of PdM systems unrealized. To address this, this research conducts a study to build a new machine augmented PdM model, one that will complement both the structured IoT sensor data and unstructured textual records. Specifically, this study presents a dual layer predictive framework where NLP techniques are used to extract and encode important features in the maintenance logs. Finally, these sensors are combined with TF-IDF and sentiment analysis obtained with TextBlob to train advanced ML models that can better predict failure. The work here studies PdM using separated data from a variety of IIoT purpose-built sensors. For collecting real time measurement of temperature, pressure, flow rate and environmental conditions, ThermoTrack-B3, SensorHub-A1, FlowMeter-E9 and Envi- roMon-D2 devices came out to be very important. Based on maintenance logs and 10 sensor readings, the predictive models that were developed in turn rely on these sensor readings to predict equipment failures before they take place. In this study, our primary dataset shown in figure 1, which was used for doing such re- search was from Kaggle, it has sensor reading of the equipment real time and corre- sponding logs of maintenance as shown in figure 1. The structured part included key operational metrics such as temperature, humidity, vibration, pressure, while unstruc- tured part was technician entered head provider symptoms and fau definitions and anomalies observed. This research applies linguistic intelligence to an integration of these two modalities of data, addressing one of the central problems in PdM: contextual understanding. Figure 1: Dataset of IoT devices In this methodology, the textual data are pre-processed namely cleaning it of noise, standardizing it to the language, tokenizing content and then feature extracted through TF-IDF to focus on the importance of the words based on the documents as well as to run sentiment analysis indicating whether the logs were having polarity or not showing signs of distress or failure. Based on these, two unique datasets were formed: first, a 11 dataset comprised solely of sensor data, and second, a combined dataset including NLP derived features. These approaches were evaluated with for four ML models: Logistic Regression, Random Forest, Gradient Boosting and an optimized XGBoost classifier. The key performance met- rics for these models were accuracy, precision, recall and F1-score along with ROC-AUC. Results showed that sensor data-based models provided significantly worse results than those enhanced by the NLP dataset. We found the most notable improvements of the XGBoost model, which reached a remarkable accuracy of 0.973, an F1-score of 0.889 and a ROC-AUC of 0.993, therefore proving its stability and precision, it is in identifying pos- sible failures. Additionally, confusion matrix analysis revealed that NLP integrated models generated less FPs and FNs than the models without semantic insights from textual logs, which means that semantic insight from textual logs imbedded in the models leads to a more robust differentiator between failure and non-failure equipment. Finally, the sentiment polarity scores provided value by providing information as to the technician’s tone and urgency, which correlate with real failure events. Thus, this thesis provides additional contributions to the developing knowledge base about intelligent maintenance systems by demonstrating that NLP can be a very power- ful complement to sensor-based models for PdM. The results show the need for multi- modal data fusion in industrial analytics and a scalable framework that can be applied in the general industrial IoT eco-systems. The proposed hybrid PdM model captures both quantitative and qualitative signals, making it possible to develop more resilient and is better and less expensive maintenance solutions in smart manufacturing environments. In the following chapters, the thesis will reveal the methodology and experimental setup details, include details of experimental results and analysis, as well as conclusions, limi- tations, and future research. 12 1.1 Background and motivation Thanks to the exponential growth and the widespread distribution of IoT devices, indus- trial operations can now be continuously and in time, monitor as granular data as possi- ble in the manufacturing field, oil and gas, aerospace, and transportation (Compare et al., 2020). By sensing high resolution structured data streams of critical machine param- eters of temperature fluctuation, vibration frequencies, pressure levels, humidity change, and operational cycle, these are smart, interconnected sensors. PdM systems can ana- lyse these data points, identify anomalies, predict future faults, and advise on mainte- nance schedules, which would help increase system reliability and operational efficiency in general. However, while these advancements enable improved detection of failures, it is still lim- ited with the semantic depth missing to truly grasp the root causes of failure and con- textual factors of maintenance issues. If sensor-based data is used, the models are typi- cally optimized with quantitative metrics at the expense of consideration of qualitative indicators that are also important for interpretation. For example, there is information provided by technicians in comments, root cause descriptions, work arounds notes, and previous fault narratives in maintenance logs, service reports, or inspection notes; infor- mation that is not given in the form of numerical sensors. Often these unstructured text entries relate to such anomalies in natural language that describe recurrent issues and provide contextual clues as well as metadata values relating to environmental or opera- tional conditions that may not otherwise be monitored. According to Usuga-Cadavid et al (2022), the use of such textual records in PdM frame- works can unveil unobserved latent patterns and previously uncorrelated failure modes with machine behaviour (P. U. Cadavid et al., 2021). It enriches the predictive modelling procedure, providing a frequent and interpretable mechanism to analyse equipment condition. Although these unstructured data sources have potential, they are not yet 13 used in mainstream industrial analytics due to challenges in processing such unstruc- tured data which include variation in language, domain jargon specific to a problem, misspellings, and variations in formatting. Hence, the reason for this research is to bridge the gap between the structured analytic constructor stemming from sensor analytics and the unstructured textual intelligence, applying the NLP approach which transforms free text record into machine readable fea- tures. The study intends to integrate numerical sensor data with NLP derived features namely, TF-IDF vectors and sentiment polarity scores into building more robust holistic PdM model. The contribution is in the use of this interdisciplinary approach that inte- grates the strengths of data driven engineering and computational linguistics to provide for smarter, context aware maintenance solution that can yield deeper operational in- sight and more accurate fault predictions. 1.2 Problem statement In modern industrial environments, PdM has become an essential strategy for minimiz- ing unplanned equipment downtime, the reduction of maintenance costs and increase overall system reliability. In fact, they have led to the advent of sophisticated PdM mod- els based not on one or two data points at a time but rather on continuous streams of numerical data (vibration, temperature, pressure, etc.) to detect anomalies and predict faults. However, to date unstructured textual information available in these models is underutilized. Maintenance logs, technician notes, incident reports and service documentation are generated routinely in large volumes daily during industrial operations. In fact, often these records have a lot of qualitative insights, including early warnings signs, context about previous faults, human observation, and domain specific knowledge, which can- not be captured by sensors alone. Unfortunately, conventional PdM frameworks again 14 ignore these unstructured data sources by focusing on only structured sensor inputs. This therefore means that the current models tend to be less aware of the context, which then constrains their ability in being able to predict correctly or explaining the reason of why there is an anomaly or equipment failure. Current practice leaves this gap, increasing the chances for unexpected equipment breakdowns, which cause poor productivity, safety, and operational continuity, and with it, several pressing challenges of incomplete or inaccurate fault detection, suboptimal maintenance scheduling, arising. Taking this opportunity to exploit textual data under exploited, would yield rich human centric insights that complement predictive analytics. Therefore, the main problem this research addresses is to design, develop and validate an NLP empowered PdM framework to combine quantitative sensors data with qualita- tive textual data natively. This model unites the benefits of NLP techniques with those of structured data sources, by incorporating features extracted through Forward NLP, namely TF-IDF vectorization and sentiment analysis. The goal is to increase the predic- tion accuracy, interactivity, and reliability of PdM systems and to approach the failure prediction, anticipating failures, and developing optimal industrial maintenance strate- gies. 1.3 Research objectives To address the outlined problem, the thesis sets forth the following objectives: 1. To analyze an IoT sensor dataset and identify relevant unstructured textual fea- tures. 2. To preprocess and clean maintenance logs for NLP application, including tokeni- zation, removal of noise, and standardization. 3. To apply TF-IDF and sentiment analysis to extract meaningful features from the logs. 15 4. To integrate textual features with sensor data to develop ensemble-based ML models for failure prediction. 5. To compare the performance of NLP-enhanced models with traditional sensor- only models using evaluation metrics such as accuracy, F1 score, and AUC. 1.4 Research questions The study aims to answer the following research questions: • What kind of improvements in predictive outcome can be made when combining NLP with IoT sensor data? • What preprocessing and feature engineering steps are essential for converting unstructured maintenance logs into actionable predictive inputs? • How do models that include NLP-derived features compare in performance to those that utilize only structured sensor data? 1.5 Methodology overview The methodology employed in this research followed a systematic, multi-stage pipeline designed to develop, integrate, and evaluate a hybrid PdM model that combines struc- tured IoT sensor data with unstructured maintenance log entries. The process was broadly divided into four key stages: data preprocessing, feature engineering, model de- velopment, and performance evaluation. The primary dataset used for experimentation was the publicly available IoT PdM dataset, which included time-stamped sensor meas- urements—such as temperature, humidity, pressure, and vibration—alongside free-text maintenance logs describing fault occurrences and technician observations. 16 1.5.1 Data preprocessing The preprocessing phase was critical in preparing both structured and unstructured data for downstream analysis. The textual component of the dataset (maintenance logs) un- derwent a series of standard NLP preprocessing steps, which included: • Lowercasing all text to ensure uniformity. log𝑖 𝑐𝑙𝑒𝑎𝑛 = 𝐶𝑙𝑒𝑎𝑛(log 𝑖) • Removal of punctuation, digits, and special characters to eliminate irrelevant noise. • Tokenization and stop-word removal to retain only meaningful words. • Lemmatization to reduce words to their root forms for semantic consistency. Table 1 lists the IoT sensors used in the study, detailing their types and specific measure- ment focuses. These devices collectively capture a wide range of environmental and sys- tem parameters, including temperature, humidity, pressure, flow, and vibration. Table 1: IoT sensors and their functions Device name Type Measurement fo- cus Description ThermoTrack-B3 Temperature sen- sor Temperature Tracks thermal con- ditions SensorHub-A1 Multi-sensor hub Combined set of sensors Aggregate data from multiple sen- sor types FlowMeter-E9 Flow Sensor Fluid/Gas Flow Monitors flow rate in pipelines EnviroMon-D2 Environmental Monitor Temperature, Hu- midity Measures ambient environmental pa- rameters 17 VibeSense-C7 Vibration Sensor Vibration Detects mechanical vibrations in equip- ment HumidityGuard-G8 Humidity Sensor Humidity Monitors moisture in the environment TempControl-H6 Temperature Con- troller Temperature Regu- lation Controls and ad- justs system tem- perature PressurePro-F4 Pressure Sensor Pressure Measures internal system pressure Following this, the processed text was transformed into numerical features using TF-IDF vectorization. A feature cap of 50 terms was selected based on exploratory analysis and dimensionality constraints to capture the most informative tokens across the corpus. Additionally, sentiment polarity scores were extracted using the TextBlob library, provid- ing a scalar measure of the emotional tone of technician comments (ranging from nega- tive to positive sentiment). Then we use these scores to capture implicit cues in human written log that might represent urgency or seriousness of these issues. 1.5.2 Feature engineering To explore the added value of integrating unstructured data, two distinct feature sets were constructed: • Sensor-Only Features: This included the original numerical attributes from the IoT sensor data (temperature, humidity, pressure, and vibration), representing the conventional approach to PdM 𝑥𝑖 𝑠𝑒𝑛𝑠𝑜𝑟 = [𝑡𝑖 , 𝑣𝑖 , ℎ𝑖 , 𝑝𝑖] ∈ 𝑅4 • NLP-Augmented Features: This composite feature set combined the sensor data with the TF-IDF text vectors and sentiment polarity scores. The goal was to enrich the numerical representation with linguistic cues and contextual signals embed- ded in maintenance logs. 18 𝑥𝑖 𝑛𝑙𝑝 = [𝑡𝑖, 𝑣𝑖 , ℎ𝑖 , 𝑝𝑖, 𝑠𝑖, 𝑤𝑖] ∈ 𝑅𝑀+5 Table 2: Raw sensor features Feature name Source device Unit Data type Temperature ThermoTrack-B3 °C Numeric Vibration VibeSense-C7 mm/s Numeric Humidity HumidityGuard-G8 %RH Numeric Pressure PressurePro-F4 kPa Numeric Flow Rate FlowMeter-E9 L/min Numeric Table 2 presents the raw sensor features extracted from individual IoT devices, specifying their source, measurement units, and data types. All recorded features are numeric and represent key physical parameters. Feature scaling techniques such as standardization (z-score normalization) were applied to ensure uniform feature contributions, particularly important for distance-based or en- semble methods. 1.5.3 Model Development and Training The data was partitioned into training and testing sets using a 70/30 split, with stratified sampling to maintain proportional class distribution across failure types. 𝑓 ∶ 𝑥𝑖 → 𝑦̂𝑖 ∈ {0, 1} The following ML algorithms were implemented and trained on both feature sets for comparative analysis: • Logistic Regression – served as the baseline due to its simplicity and interpreta- bility (Çınar et al., 2020). 19 • Random Forest Classifier – to capture non-linear relationships and variable inter- actions (Wang et al., 2023). • Gradient Boosting Machines (GBM) – a sequential ensemble method known for its robustness (Samet, 2023). • XGBoost – an optimized implementation of gradient boosting that offers superior regularization and performance (Panduman et al., 2024). Figure 2 illustrates the model architecture, combining sensor data and NLP-derived fea- tures as inputs to machine learning models for predicting system failure. Figure 2: Model architecture 1.5.4 Evaluation metrics Each model was evaluated using a comprehensive set of classification metrics to assess performance from multiple dimensions: • Accuracy – overall proportion of correct predictions. • Precision – the model’s ability to avoid FP. 20 • Recall – sensitivity to actual failure cases. • F1 Score – harmonic mean of precision and recall, ideal for imbalanced datasets. • AUC-ROC – the area under the ROC curve, indicating the trade-off between true positive and FPR. Additionally, confusion matrix analysis was performed to identify patterns in misclassifi- cation and understand the model’s strengths and weaknesses across various failure types. 1.6 Significance and contribution This research confirms that the integration of unstructured maintenance log data via NLP significantly enhances the predictive capacity of traditional IoT-based maintenance mod- els. The application of TF-IDF and sentiment analysis contributes contextual depth to the feature set, which is particularly effective in distinguishing critical failure patterns not captured in numerical data. The proposed methodology can be readily adapted in indus- trial settings, offering a scalable, data-driven solution for proactive asset maintenance. Furthermore, the comparative analysis between sensor-only and NLP-enhanced models provides a quantitative justification for the inclusion of textual data in future PdM sys- tems. The results highlight not only the potential of NLP in industrial IoT but also under- score the need for cross-disciplinary integration of AI techniques to tackle complex op- erational challenges. 21 2 Literature Review A review in prior studies indicates increasing usage of IoT devices in the PdM system. In contrast to our project, many of these devices, such as commonly referenced multi-sen- sor units like SensorHub-A1, VibeSense-C7, PressurePro-F4, HumidityGuard-G8, etc., are used in a comprehensive suite of smart devices along with other units. These are modern condition measuring devices which shall provide the required level of granularity and frequency in data such that the predictive analytics can be done effectively. This marks an era of industrial operations advent in Industry 4.0, which integrate cyber physical systems, IoT, big data analytics, and AI. PdM has replaced traditional reactive, preventive, maintenance approaches leaving no stone unturned in its progression to- wards this revolution in today’s digital age. Real time data and advanced analytics used in PdM helps forecast equipment failures thereby minimizing unplanned downtimes and maximizing asset lifecycles. At the early stage, PdM was mainly based on structured data produced by sensors to monitor such parameters as temperature, vibration, and pressure. The efficacy of these sensor-based models has been proven for predicting any mechanical failures and have been widely applied in manufacturing, aerospace, and energy sector. Therefore, as in- dustrial systems continue to increase in complexity, it is apparent that relying on quanti- tative sensor data only is insufficient. Unstructured textual data like maintenance logs, technician notes, inspection reports may in fact hold more nuanced information that sensor data does not give contextual information in the background. In recent years studies have been done on integrating unstructured textual data into PdM to enhance predictive capability. For example, maintenance logs contain useful in- formation about previous failures, repair actions and expert observations for infor- mation that is deeper than normal equipment behaviour and failure patterns. To incor- porate qualitative information, however, NLP techniques provide the correct tools to 22 extract and analyse this textual information, allowing it to be integrated into predictive models. Technically, we formulate a hybrid of NLP based on sensor analytics in which NLP is inte- grated with traditional sensor analytics as they collectively form comprehensive and ac- curate predictive models. This approach is in line with Industry 4.0’s more general focus on decision making based on data and utilising a variety of data sources to increase per- formance. Additionally, such hybrid models also meet the current requirement of tran- sition from non-context aware, but interpretable PdM systems to more context aware and interpretable PdM systems that can adapt to the dynamic nature and complexity of modern industrial environments. In this chapter, existing literature on PdM models is critically reviewed, with emphasis on novelty of evolving from traditional sensor based to hybrid using NLP techniques. It analyses the methods used, limitations encountered, and progress made in combining information derived from unstructured textual data in PdM models. The chapter ex- plores these developments to give a complete picture of current PdM research and points out research and improvement opportunities to be performed in the future. 2.1 Traditional predictive maintenance and sensor-based approaches PdM is now a staple component of modern industrial plant operations that is intended to forecast equipment failure before it happens with the intent to reduce unplanned outages and the scheduling of maintenance. In PdM, structured data obtained from sen- sors measuring temperature, vibration, pressure, etc. are typically used and the tradi- tional PdM models are mostly based on structured data (Killeen et al., 2019). These sen- sor-based approaches have contributed immensely toward detection of early signs of degradation of equipment, allowing for timely interventions and extension in asset lifecycles (Zantalis et al., 2019). 23 More lately, ML algorithms like decision trees, support vector machines, and deep learn- ing architectures have been applied on top of these models to boost their predictive capabilities. Consider Liu and Hui (2024) who predicted bearing failures based on vibra- tion signal using deep learning models (Liu et al., 2022), and Shi-Nash and Hardoon (2017) who used support vector machines to predict pump failures based on pressure and flow(Shi-Nash & R. Hardoon, 2017). PdM systems have now had these advancements in accuracy and reliability and especially in the environments where the data, both in terms of sensors and how these data are updated, are plentiful and maintained (Liu et al., 2022). Sensor based PdM models have proven very successful, but there are limitations to it. They are heavily reliant on structured numerical data which do not adequately capture the contextual nuances before or along with equipment failures. Although operator be- haviour, environmental conditions and historical maintenance actions are of paramount importance to equipment performance, they rarely leave their stamp on sensor data. Designed to create completeness in diagnostics and optimal maintenance decisions (Nota et al., 2022), overlooking these contextual elements can be carried too far, accord- ing to Nota et al., (2022). Here we expand on the last point, including lack of generalizability for sensor based PdM models. Given that one can train models on certain datasets, which might not perform well across different operational contexts or types of equipment. The absence of such adaptability acts as a pointer to call for more robust and adaptable predictive models which can handle the varying and dynamic environment of industrial world. Predictions of sensor-based models are yet another significant limitation being that they are explainable. Stakeholders in critical industries must trust and act on PdM recommen- dations based on transparent and interpretable models. According to Abidi Mohammed and Alkhalefah (2022), model interpretability is important as black box complex models hinder user confidence and interfere with decision making processes (Abidi et al., 2022). 24 In view of such challenges, unstructured textual data like maintenance logs and techni- cian notes is being seen as an important component to MPM frameworks. However, these textual records can deliver rich contextual information connecting to deep insights about equipment behaviour and failure modes. Moreover, PdM models can leverage NLP techniques to process and extract meaningful features from these texts to better under- stand how the equipment health is, so the predictions and maintenance decisions are based on a more holistic understanding. Finally, as we conclude, traditional sensor based PdM approaches have provided solid ground for the practice of PdM, although their inability to capture contextual infor- mation and to provide explainable predictions lead to the need to develop hybrid models that integrates structured sensor data with unstructured textual information. With such an integrative approach, PdM capabilities can be advanced, operational reliability can be improved, and maintenance strategies can be made more informed and more effective. 2.2 Unstructured data and the case for NLP in maintenance Currently, PdM models have predominantly used structured sensor data, like tempera- ture, vibration, and pressure, to predict equipment failures. These models achieve suc- cesses in many scenarios, yet these models are blind to embedded rich contextual infor- mation in unstructured textual data, such as maintenance logs, technician notes, inspec- tion reports, etc. Usually, these textual records contain important knowledge about root cause identification, technician intuition, and contextual observations which are neither fully measured by sensors nor easily understood. As an example, 'intermittent', 'burnt smell' or 'previously replaced' imply semantic value that cannot be captured solely from numerical data (P. U. Cadavid et al., 2021; Shen & Huang, 2024). However, these textual records have great potential, but they rarely see use, like we put the records and other media aside to archive without analysing them, or they are in a chaotic inconsistent format that makes them unsuitable to have in predictive models. 25 Other factors such as informal language, spelling and domain specific jargon make pro- cessing and analysing the industries’ narratives even more complex (Ponnambili et al., 2024; Rai et al., 2024). However, recent studies have shown that it is possible to combine textual features into PdM frameworks. As an example, Akhbardeh et al (2020) demonstrate that fault classi- fication in aviation systems improves with TF-IDF representations and sentiment scores extracted from maintenance logs (Akhbardeh et al., 2020). Similar to Yilmaz (2022), NLP techniques were used by Yilmaz (2022) to obtain latent failure indicators from technician comments on the railway maintenance systems as they had detected an increase in the early fault detection rates (Hussain et al., 2020). NLP techniques provide a couple of advantages when integrated to PdM models. NLP can help the extraction of meaningful features from unstructured text that can be used to find patterns and insights not as evident in sensor data. The same area of sentiment analysis has been widely applied in consumer contexts, but it is also promising in tech- nical maintenance as a surrogate of urgency and size(Mallioris et al., 2024), as Mallioris et al. (2024) indicate. Despite this, industrial narratives are still challenging to clean, standardize, and encode as they contain informal language, spelling errors, domain-specific jargon (Ponnambili et al., 2024; Rai et al., 2024). Such challenges call for the development of specialized NLP tools and techniques that deal specifically with the features of the maintenance text. Overall, the use of NLP techniques in integration of unstructured textual data has great potential to improve PdM models. NLP can provide a way to capture the subjectivities inherent to maintenance logs and technician notes and use them to create more accu- rate and context rich PdM systems that will improve operational reliability and efficiency. 26 2.3 Hybrid approaches: Fusing text with sensor data Due to the availability of structured sensor data and the advent of NLP, integrating the former with the latter in the form of NLP extracted features has proved to be a viable solution to achieve more comprehensive and accurate PdM models. With this hybrid approach, models benefit from both types of data, being able to combine quantitative sensor measurements with qualitative content from textual maintenance records. De Luca et al. (2024) proved that this hybrid model, a combination of deep learning and sentiment enhanced log features, can get an improvement of 11% compared to sensor only baselines. This enhancement highlights the importance of including technician nar- ratives, which can contain details about equipment conditions and failure modes that sensors may miss (De Luca et al., 2023). Ucar, Karakose, and Kırımça (2024) also used transformer-based models to encode log entries and combine with time series sensor data for the prediction of fault in smart factories (Ucar et al., 2024). However, their approach sheds light on the feasibility of using state-of-the-art NLP architectures for processing and blending unstructured text data to generate more accurate & timely fault prediction in complicated industrial envi- ronments. Both mining risk and aiding maintenance prioritization based on contextual risk are also proposed by Postiglione and Monteleone (2024) as arguing that such fusion strategies not only increase prediction accuracy but do this (Postiglione & Monteleone, 2024). Us- ing PdM models to analyse both sensor data and textual logs can lead to more informed recommendations and consequentially more proactive maintenance, namely in the di- rection of most critical problems. According to Bouabdallaoui et al. (2021) textual features play the role of a semantic bridge between physical failure symptoms and historical patterns. Integration of text data is also suggested, which in turn helps the PdM models to better take in the context 27 and the history of equipment failures for better diagnostics and prognostics (Bouabdal- laoui et al., 2021). However, the fusion is certainly not easy. According to Kalusivalingam et al. (2020), there are issues in terms of feature dimensionality, temporal alignment, and model overfitting. When the dataset is limited, combining sensor data with textual features will result in high dimensional input space, which is prone to inducing overfitting. In addition, the sensor data will have to be properly pre-processed such that its timestamps align with the irregular timestamps of the maintenance logs for the integrated data to accurately represent the condition of the system over time (Kalusivalingam et al., 2020). To overcome these challenges, researchers have put forward different ways. For example, to reduce dimensionality and eliminate redundant or irrelevant features, feature selec- tion techniques can be employed, thus making the model in general more generalizable. Advanced preprocessing methods may be applied to provide temporal alignment so that an integral dataset is truly a record of the state of the system at each time point. Finally, it is concluded that incorporating NLP-extracted features along with sensor data in hybrid approaches holds great promise to improve PdM models. These models not only offer quantitative information, but also qualitative information and thus can offer more comprehensive information on equipment health and can predict better via more accurate predictions and more effective maintenance strategies. 2.4 Advances in NLP models for industrial applications In recent times the evolution of NLP models has posed a huge impact in many domains such as the industrial. Most traditional NLP techniques, including BoW and TF-IDF, have technically opened the door for more sophisticated models such as Word2Vec, BERT, etc. Following these advancements, new possibilities in predictive analytics opened for the extra accuracy and more context aware analysis of unstructured textual data. 28 BERT embeddings on textual work orders in oil refineries are shown by Jyothirmai et al. (2024) to have improved generalization across different asset types. Transformer based models were shown to be capable of extracting the complex semantic relationships from maintenance logs to improve predictive capability of this application (Jyothirmai et al., 2024). In addition, researchers have extracted high level failure modes from raw logs using NER, topic modeling, and semantic similarity matching. In Ekundayo et al. (2024) for instance, these techniques are employed to identify critical components and failure patterns in industrial systems thereby implementing more targeted maintenance strategies (Ekun- dayo et al., 2024). For instance, Samet (2023) uses semantic similarity measures to map maintenance logs to history failure data to increase fault prediction accuracy (Samet, 2023). Yet, industrial adoption of advanced NLP models is still in its infancy, while most of their use is still experimental and limited to controlled datasets. Last, real world domain spe- cific terminology, informal language and sparse data remain challenges that prevent this kind of models from being widely implemented in real world industrial settings. Based on these, Valli (2024) suggests that interpretable NLP methods are necessary in critical infrastructure and explains why explainable AI techniques are necessary to in- crease user trust and adopt the models (Valli, 2024). The interpretation of the models can give insights into the decision-making process and hence, maintenance personnel could understand and act on upon model prediction. Just as, Stanton et al. (2023) advocate for the application of knowledge graphs and on- tologies to map textual data to standardized maintenance taxonomies. These ap- proaches store unstructured data within a formal framework so that it can be structured and then used to improve data interoperability and conduct consistently more uniform analysis across systems and organizations (Stanton et al., 2023). 29 Finally, although very significant improvements in employing the most advanced NLP models on industrial maintenance have been achieved, the data quality, model inter- pretability and domain adaptation challenges need to be addressed. To realize the full potential of PdM solution in industrial environment, however, we need to address these challenges using robust, interpretable, and domain specific NLP models. 2.5 Evaluation practices and real-world validation The amount of literature around NLP augmented PdM models is way behind the real- world deployment and longitudinal validation. Simulation based studies suggest poten- tial, but few have been operationalized in real live industrial settings (Boretti, 2024; Mo- hammed et al., 2023). Such disparity delineates the hurdles relating theory to practice in the field of complex industrial settings. As Javaid (2024) emphasizes, the lack of established benchmarks and inhomogeneity in the logs’ formats make it difficult for replication and model portability. Maintenance logs are not uniform across various industries and organisations, requiring specialized ap- proaches for each specific context to develop generalized models (Javaid, 2024). Chikkudu and Annamalai (2025) suggest that a PdM model be evaluated using confusion matrices, AUC-ROC and cost sensitive metrics. These are metrics that are used to assess how well the model does in all the ways possible — accuracy, model discrimination, and how the misclassifications affect the economics (Chikkudu & Annamalai, 2025). In addition, comparative studies by Rehman et al (2019) and Hasanuzzaman et al (2023) both show that addition of unstructured data usually improves the F1 score and reduces FNs, with the price of higher computational overhead (Hasanuzzaman et al., 2025; ur Rehman et al., 2019). Having shown this trade-off between model performance and 30 resource requirements, these findings highlight the importance of considered solution approaches that balance model's accuracy and efficiency. Finally, although much progress has been made in the development of such NLP-aug- mented PdM models, issues concerning real-world deployment, evaluation and model validation remain. The success of any PdM strategy implemented in an industrial envi- ronment relies upon addressing these challenges by providing established benchmarks, developing interpretable models, and incorporating sensitive domain specific factors. 2.6 Ethical, practical, and operational considerations However, in the deployment of the NLP augmented PdM models, both the ethical and the operational dimensions need to be addressed besides the technical one. These are important considerations to make sure that such systems are also fair, transparent and in line with human values. Bias and Data Quality As Shi-Nash and Hardoon (2017) indicate, logs that are biased or incomplete can lead to skewed predictions, especially in systems with different access to high quality mainte- nance documentation. Such biases result in unfair maintenance recommendations, ex- poning some types of assets or operational context more than the others. To address these biases, the data curation is important, and the training datasets need to be curated in such a way that the underrepresented scenarios in training data can be identified and mitigated (Shi-Nash & R. Hardoon, 2017). 2.6.1 Human-in-the-Loop (HITL) Approaches Wellsandt et al. (2022) recommend human in the loop approaches whereby leader ref- eree feedback continuously improves NLP models. This iterative process provides the domain expertise to be folded in such that the model is being more accurate and flexible 31 for the tool to use (Wellsandt et al., 2022). Nonetheless, to create a HITL systems, there should be efficient feedback mechanisms and interfaces so that human operators can interact potentlessly with the ML models. Additionally, it is important to create protocols which will guarantee the source of human input will provide product of sufficient quality and reliability to prevent we kind of new bias or error in the system. 2.6.2 Privacy and Data Governance Obtaining logs from systems that are privacy and data governance sensitive require ro- bust anonymization and compliance traceability (Khattab & Youssry, 2020; Lowin, 2024). To protect data from unauthorized access and misuse it is important to implement strin- gent data protection. Also, respecting legal data protection regulations, for example, GDPR is essential for accountability and trust with the stakeholders. 2.6.3 Transparency and Accountability First, to increase trust of the users and stakeholders, it is important to guarantee in model decision making processes. This makes implementing explainable AI techniques an appealing option to gain a better understanding and acceptance about how the mod- els arrived at the predictions they made. Additionally, accountability structures need to be clearly established to define and tackle possible problems in model outputs. 2.6.4 Operational Challenges The deployment of NLP-augmented PdM systems is practically dealing with system inte- gration, real-time processing, and scalability issues. They must also work in the confines of the existing infrastructure and workflows. In addition, the continuous monitoring and maintenance are needed to adapt to the evolving operational conditions, and to main- tain the performance over time. 32 By integrating NLP techniques into PdM, we enjoy significant benefits, but is important to caution with all the ethical, practical, and operational implications to produce systems that are fair, transparent, and effective. The proactive addressing of these considerations will allow a successful adoption and long-term sustainability of NLP-augmented PdM so- lutions. 2.7 Summary of Gaps and Opportunities Some of the only work done on the value of textual intelligence for PdM revolves around limitations: • Limited Real-World Validation: Few current studies involve industrial deployment and temporal generalization (Shamayleh et al., 2020). This gap emphasizes the necessity for a thorough field testing and longitudinal studies to measure the usefulness and applicability of PdM models in changing industrial environment. • Underdeveloped Domain-Specific NLP: Few models account for domain jargon and inconsistencies in technical language (Rojas et al., 2025). The diversity of for- mal names in most industries necessitates that we develop domain-specific NLP models that perfectly process and understand the domain on which the PdM sys- tem is based to guarantee that the system produces reliable results. • Data Fusion Complexity: While Compare et al. (2019) shows such a feature merg- ing, it is necessarily accompanied by carefully pre-processed, aligned, and noise handled sensor and textual features. Ensuring data consistency and quality of heterogeneous data sources, is important and difficult to integrate, challenges efforts in building PdM models (Compare et al., 2020). • Model Explainability: box models make it hard for adoption in safety critical in- dustries to the point that Model Explainability: (Ayvaz & Alpay, 2021). Trust ero- sion among stakeholders due to lack of transparency in model decision making processes introduces itself as one of the major factors for explainable AI tech- niques that enable us to know what is going on inside the model and understand- ing its predictions. 33 In their 2024 study, Uçar, Karakose, and Kırımça provide a comprehensive review of arti- ficial intelligence applications in predictive maintenance (PdM), emphasizing key com- ponents, trustworthiness, and future trends. The authors discuss the integration of AI technologies into PdM, highlighting challenges such as the need for real-world validation and the importance of trustworthiness in AI systems. They also explore emerging areas like digital twins, generative AI, and the Industrial Internet of Things (IIoT). However, the study primarily offers a high-level overview and lacks detailed methodological insights, particularly concerning specific model types and feature engineering techniques. This contrasts with our approach, which implements concrete models like XGBoost and in- corporates feature engineering methods such as TF-IDF vectorization and sentiment analysis using TextBlob (Ucar, Karakose, & Kırımça, 2024). By focusing on practical imple- mentation and evaluation, our work addresses some of the limitations identified in Uçar et al.'s review, particularly the need for real-world validation and detailed methodologi- cal frameworks. De Luca et al. (2023) propose a deep attention-based approach for PdM in IoT scenarios, leveraging a multi-head attention mechanism to achieve efficient and effective predic- tions. Their model demonstrates competitive performance on the NASA dataset, with advantages in parameter efficiency and training time compared to traditional LSTM mod- els. However, the study's reliance on a specific dataset and the absence of real-world deployment scenarios limits its generalizability (De Luca et al., 2023). In contrast, our methodology combines sensor data with NLP features derived from maintenance logs, utilizing models like XGBoost to achieve high accuracy and F1 scores. While our approach may not incorporate advanced architectures like transformers, it emphasizes interpreta- bility and practical applicability in industrial settings. By comparing our methods with those of De Luca et al., we highlight the trade-offs between model complexity and real- world feasibility, underscoring the importance of adaptable and interpretable models in PdM applications. 34 However, the field of sensor-based PdM is a powerful, yet underutilized frontier that enables an integration of NLP with sensor data. Future research seeks to develop inter- pretable, deployable, and robust hybrid models that can be validated in a wide range of industrial contexts. Resolution of these challenges will enable transition from the theo- retical model to the practical real-world solutions, thus increasing the effectiveness and adoption of PdM strategies. 35 3 Methodology In this chapter, the complete methodology to develop a PdM system by integrating IoT sensor data with maintenance logs with the help of NLP is discussed. The technique aims to predict equipment failure within a 7-day window from both structured sensory and unstructured text data. 3.1 Dataset overview In this study, the “iot_predictive_maintenance_dataset”, the primary dataset, are from Kaggle, a vast repository of data frequently employed in PdM related research. Combin- ing detailed maintenance logs and time series sensor measurements combines instru- mentation of equipment that enables a comprehensive view of equipment failure pre- diction (Samudrala, 2022). This dataset contains simulated data representing real-time monitoring of various indus- trial equipment, including turbines, compressors, and pumps. Each row in the dataset corresponds to a unique observation capturing key parameters such as temperature, pressure, vibration, and humidity. The dataset also includes information about the equipment type, location, and whether the equipment is classified as faulty (Samudrala, 2022). 36 Figure 3: Dataset pre-processing Data was acquired through a total of 27 deployed IoT devices (ThermoTrack-B3, Envi- roMonitor-D2 and VibeSense-C7) as shown in figure 3. These devices recorded the met- rics at regular intervals and fed the data into a centralized system for preprocessing. De- vices like TempControl-H6 and FlowMeter-E9 sported devices with specialized measure- ments necessary to capture the initial (=early) signals of equipment stress (not neces- sarily degradation) stress. 3.1.1 Sensor data The dataset encompasses time-series readings from various sensors, including: • Temperature: Monitors the operating temperature of equipment, which can in- dicate overheating or cooling system failures. • Vibration: Detects mechanical imbalances, misalignments, or bearing failures through vibration patterns. • Humidity: Assesses moisture levels that may affect electrical components or pro- mote corrosion. • Pressure: Measures pressure variations that could signify blockages or leaks in fluid systems. 37 They acquire sensor readings at regular intervals to capture this temporal sequence of the equipment behaviour under normal and stressed conditions as shown in figure 5. The use of time series analysis and ML techniques allows this prediction of imminent failures at a granularity and a frequency at which it is highly desirable. 3.1.2 Maintenance logs Accompanying the sensor data are textual maintenance logs that document: • Equipment Conditions: Narratives describing the operational state of machinery during inspections. • Repairs and Replacements: Records of parts replaced, maintenance actions taken, and downtime incidents. • Anomalies and Alerts: Descriptions of observed irregularities or deviations from standard operating procedures. The inputs in these logs are unstructured textual information that provides contextual factors interfering with equipment health. Nevertheless, to extract such actionable fea- tures from unstructured data such as this, one requires powerful NLP techniques. 3.1.3 Objectives The overall aim in using this dataset is to use it to build predictive models to forecast equipment failures within 7 days of their occurrence. The study integrates structured sensor data with the features accumulated in maintenance log via NLP to improve the accuracy and reliability of failure prediction. It is in line with current developments in PdM, which have demonstrated success in improving model performance by joining sen- sor data with text. Figure 4 and figure 5 shows the failure rate distribution across different IoT devices, with HumidityGuard-G8 exhibiting the highest failure rate among all. 38 Figure 4: Failure rate per device 3.1.4 Dataset limitations However, the dataset provides a strong base of data for PdM model building. certain limitations must be acknowledged: The dataset exhibits a significant class imbalance, with 123 failure instances and 877 non-failure instances, resulting in an exact failure-to-non-failure ratio of 0.1403. This im- balance may bias predictive models toward the majority class, potentially reducing the model's ability to accurately detect failure events and impacting overall performance, particularly in terms of recall for the minority class. • Data Imbalance: The occurrence of equipment failures is relatively rare com- pared to normal operations, leading to class imbalance that can bias model per- formance. • Missing Values: Sensor readings may have missing entries due to equipment downtime or data transmission errors, necessitating imputation or handling strategies. • Noisy Data: Sensor measurements can be susceptible to noise, requiring prepro- cessing techniques to ensure data quality. 39 • Inconsistent Log Entries: Maintenance logs may vary in detail and format, posing challenges for standardization and feature extraction. Developing robust predictive models that will generalize well to the real industrial set- tings require addressing these limitations. 3.1.5 Relevance to industry Particularly for manufacturing, energy and transportation industries that are going through digital transformation, the integration of sensor data with NLP enhanced fea- tures is very pertinent. Organizations can shift from reactive maintenance plans to the proactive using PdM models resulting in reduction of downtime, improved resource uti- lisation and extended equipment life. The central idea behind such Industry 4.0 is a shift in paradigm, in terms of data-driven decision making to improve operational efficiency and boost competitiveness. 3.2 Data pre-processing Data pre-processing is an essential part of ML applications. Here the steps applied to clean the data are explained. 3.2.1 Sensor data The sensor data underwent standard preprocessing steps to ensure quality and con- sistency: • Handling Missing Values: Imputation techniques like forward filling or interpola- tion were used to account for missing sensor readings to keep the integrity of the time series data. • Normalization readings are normalized to a standard scale for the convergence of ML algorithms and for attitude to change sensors that have different units. 40 3.2.2 Maintenance log Due to the unstructured text format of the maintenance logs, they needed significant extract meaningful features: • Text Cleaning: The logs were cleaned by removing punctuation, numbers, and converting text to lowercase to standardize the content. log𝑖 𝑐𝑙𝑒𝑎𝑛 = 𝐶𝑙𝑒𝑎𝑛(log 𝑖) • Tokenization and Stopword Removal: Text was tokenized into words, and com- mon stop words were removed to focus on significant terms. • TF-IDF Vectorization: The textual data were converted into the numerical fea- tures using a Term Frequency-Inverse Document Frequency (TF-IDF) vectorizer which provides a quantitative measure of importance of a word in relation to the entire corpus which can be seen in table 3 below. 𝑤𝑖 = ∅(log𝑖 𝑐𝑙𝑒𝑎𝑛) ∈ 𝑅𝑀+5 • Sentiment Analysis logs were fed into the TextBlob library for performing senti- ment analysis on the maintenance descriptions and extracting polarity scores representing the sentiment expressed. 𝑠𝑖 = 𝑆𝑒𝑛𝑡𝑖𝑚𝑒𝑛𝑡𝑃𝑜𝑙𝑎𝑟𝑖𝑡𝑦 (log𝑖 𝑐𝑙𝑒𝑎𝑛) Table 3: Sample TF-IDF terms from maintenance logs Term TF-IDF Score overheat 0.412 leak 0.356 shutdown 0.301 vibration 0.245 alarm 0.263 41 3.3 Feature engineering Constructed two distinct feature sets to study how incorporating NLP-affected the re- sults. enhanced features: • Sensor-Only Features: This set comprised the raw sensor readings (temperature, vibration, humidity, and pressure) after preprocessing. • NLP-Enhanced Features: This set included in addition to the sensor data, the TF- IDF vectors and sentiment scores on the maintenance logs, thus giving a richer representation of the condition of the equipment. Figure 5 displays the average sensor readings and corresponding 7-day failure rates for each IoT device, highlighting variability in both measurements and reliability. Figure 5: Sensor readings and failure rate of devices 3.4 Model development Several ML models were developed and evaluated in order to choose the most appro- priate predictive process of the platform. maintenance: 42 • Logistic Regression: To have a reference performance level, a baseline linear clas- sifier was implemented. 𝑓(𝑥) = 𝜎(𝑤𝑇𝑥 + 𝑏) • Random Forest: An ensemble method that constructs multiple decision trees to improve classification accuracy. 𝑓 = 𝑚𝑎𝑗𝑜𝑟𝑖𝑡𝑦(𝑇1(𝑥), … , 𝑇𝑘(𝑥)) • Gradient Boosting: A boosting technique that builds models sequentially, with each new model correcting errors made by the previous ones. 𝑓 = ∑ α𝑘𝑇𝑘(𝑥) 𝐾 𝑘=1 • XGBoost: An optimized gradient boosting algorithm which is likely to be very fast, yet performs very well. ℒ = ∑ 𝑙(𝑦𝑖, 𝑦𝑖̂) 𝑛 𝑖=1 + ∑ Ω(𝑇𝑘) 𝑘 To assess the contribution of textual information to predictive accuracy, each one of the models was trained and evaluated with both the sensor-only and NLP enhanced feature sets. 3.5 Model evaluation metrics To comprehensively assess model performance, the following metrics were utilized: • Accuracy: The proportion of correctly predicted instances among the total in- stances. 43 • Precision: The proportion of true positive predictions among all positive predic- tions. • Recall: The proportion of true positive predictions among all actual positives. • F1 Score: The harmonic means of precision and recall, providing a balance be- tween the two. • Confusion Matrix: A matrix that visualizes the performance of the classification model by showing the true positives, FPs, true negatives, and FNs. • ROC AUC Curve: A graphical representation of the model's ability to distinguish between classes, with the AUC indicating performance. Accuracy: 1 𝑛 ∑ 1(𝑦𝑖̂ = 𝑦𝑖) 𝑛 𝑖=1 3.6 Findings of preliminary methodology In fact, models were evaluated and showed that incorporating NLP enhanced features improved the predictive. performance: • XGBoost with NLP-Enhanced Features achieved an accuracy of 97.3%, an F1 score of 0.889, and an ROC AUC of 0.993, outperforming other models. • Gradient Boosting with NLP-Enhanced Features also demonstrated strong perfor- mance, with an accuracy of 97.3% and an F1 score of 0.879. Solely sensor trained models had lower performance in terms of their recall for identify- ing true positive failures. This reinforces the benefit of integrating unstructured textual data with conventional sensor measurements to augment PdM systems. The integration of NLP techniques with sensor data offers several advantages: • Enhanced Predictive Accuracy: Incorporation of sentiment analysis and TF-IDF features from maintenance logs adds extra context, resulting in a better predic- tive capability of the model in predicting failures. 44 • Interpretability: The use of sentiment scores allows for a qualitative understand- ing of the equipment's condition, facilitating actionable insights for maintenance teams. • Scalability: The approach can be applied across various industrial domains, adapting to different types of sensor data and maintenance logs. However, challenges remain: • Data Quality: NLP is only effective if the maintenance logs are consistent and of sufficient quality and validity, which is not the case for all systems and all organi- zations. • Model Complexity: The addition of NLP features increases the dimensionality of the feature space, potentially leading to overfitting if not properly managed. A few operational and ethical points regarding implementing PdM systems which utilize NLP are: • Data Privacy: Maintenance logs may contain sensitive information; thus, anony- mization and compliance with data protection regulations are essential. • Bias in Data: Incomplete or biased maintenance logs can lead to skewed predic- tions, necessitating careful data curation and validation. • Human-in-the-Loop: Incorporating feedback from maintenance personnel can enhance model accuracy and ensure that predictions align with practical experi- ence. This study uses a methodology to show the effectiveness of combining the NLP aug- mented technique to maintenance log like an IoT sensor data set to predict the failure of equipment. Unstructured textual information in the context provides valuable infor- mation to improve the predictive accuracy and it provides us with a promising direction to develop an advanced PdM system. In future, data quality and model complexity re- lated challenges should be addressed by future research to improve the effectiveness and applicability of these systems in real world industrial settings. 45 4 Case study – Predictive Maintenance using Sensor and Maintenance Log Data with NLP PdM has grown to be a must for industrial operations, intended to foresee when equip- ment will malfunction. In this case study we explore the use of ML models to make prob- ability estimates of equipment failures based on the intersection of IoT sensor embed- ded data with the history of that equipment availability, using additional data from NLP techniques. The main objective is to determine if structured sensor data alongside un- structured textual data can be integrated to provide a better prediction. 4.1 Dataset overview Kaggle source for the dataset used in this study is the “iot_predictive_maintenance_da- taset”. It is composed of time series sensor data and associated equipment logs in the maintenance logs. The sensor data consists of readings of temperature, vibration, hu- midity, and pressure for instance at regular intervals. The textual descriptions included in the maintenance logs describe equipment conditions, repairs, and anomalies that have occurred during maintenance activities. The analysis of the dataset from Kaggle highlights several key aspects that the dataset contains a total of 1,000 samples, with 123 instances labelled as failures occurring within seven days—information that is crucial for understanding class distribution in predictive modelling. However, in figure 6, Regarding the maintenance logs, the average entry length is approximately 37 characters or 4.66 words, indicating brief textual descriptions. Language detection reveals that all logs are written in English (‘en’), and the content shows low variability, with only 10 unique log entries across the dataset. The five most frequent maintenance logs alone account for over half the dataset, with entries such as “Pressure readings inconsistent.” appearing 123 times. These findings underscore the 46 need for a more detailed and well-cited dataset description to support transparency, re- producibility, and meaningful downstream analysis. Figure 6: Dataset Description Report In this stage harmonized data from multiple devices were being fed into the analysis. The primary input aggregator here, multisensory input features, was done with SensorHub- A1, whereas for specific insights from individual devices like HumidityGuard-G8 for hu- midity trends and later PressurePro-F4 for real-time pressure, we were able to do. Struc- tured numerical data which were produced by these devices were used along with NLP transformed maintenance logs to train and evaluate the ML models. The problem is to predict whether a failure occurs in the next 7 days or not. The mainte- nance logs and sensor data are used to develop both enhanced NLP features and sensor features as a binary classification problem is approached. The integration of these data sources is done with aims to achieve a better understanding of equipment health, which may result in more accurate predictions. 47 4.2 Data pre-processing Data pre-processing is a critical step in preparing raw data for machine learning applica- tions, particularly in predictive maintenance where both sensor readings and textual in- formation are often heterogeneous and noisy. This step ensures that data is consistent, structured, and representative of the underlying equipment behaviour. The pre-pro- cessing pipeline in this study addresses both numerical sensor streams and associated textual records, each requiring distinct techniques to maximize their predictive potential. 4.2.1 Sensor data processing A preprocessing of all the sensor data was performed to make the data suitable for ML models: • Handling Missing Values: Missing sensor readings were imputed using interpola- tion techniques to maintain the continuity of time-series data. • Normalization: Sensor readings were normalized to a standard scale to prevent features with larger ranges from dominating the model training process. • Feature Engineering: Based on the raw sensor data, mean, standard deviation and skewness, etc. of statistical features were extracted to represent the under- lying patterns of equipment health. 4.2.2 Maintenance log processing The maintenance logs, being unstructured text, required extensive preprocessing: • Text Cleaning: Punctuation, numbers, and irrelevant characters were removed to standardize the text. • Tokenization and Lemmatization: Text was tokenized into words, and words were lemmatized to their base forms to reduce dimensionality. 48 • TF-IDF Vectorization: The text was converted using the TF-IDF method to numer- ical features that reflect the amount of importance of words against the whole corpus. • Sentiment Analysis: Sentiment scores were derived using TextBlob to gauge the emotional tone of the maintenance logs, providing additional context to the tex- tual data. 4.3 Model development Following the data pre-processing phase, predictive models were developed to detect and anticipate equipment failures using both numerical sensor data and unstructured textual inputs. The model development process involved careful selection of relevant features, algorithm choice, and training procedures tailored to the nature of the availa- ble data. To evaluate the contribution of textual data, separate models were trained us- ing sensor-only features and a combined feature set that integrated insights extracted from maintenance logs. This dual approach enabled comparative performance analysis and highlighted the added value of incorporating contextual information in predictive maintenance tasks. 4.3.1 Feature set Two distinct feature sets were developed for model training: • Sensor-Only Features: This set included the pre-processed sensor data, focusing solely on the numerical readings. • NLP-Enhanced Features: In addition to the sensor data, this set incorporated the TF-IDF vectors and sentiment scores derived from the maintenance logs, aiming to capture both quantitative and qualitative aspects of equipment health. 49 4.3.2 Model selection Several ML models were evaluated to determine the most effective approach for pre- dicting equipment failures: • Logistic Regression: A baseline linear model used for comparison. • Random Forest: An ensemble method that constructs multiple decision trees to improve predictive performance. • Gradient Boosting: A boosting technique that combines the predictions of several base learners to reduce bias and variance. • XGBoost: An optimized implementation of gradient boosting that provides regu- larization to prevent overfitting. 4.3.3 Model evaluation Models were assessed using a 70-30 train-test split as stated in table 4, ensuring that the evaluation metrics were based on unseen data. Performance was measured using: • Accuracy: The proportion of correct predictions. • Precision: The proportion of true positive predictions among all positive predic- tions. • Recall: The proportion of true positive predictions among all actual positives. • F1 Score: The harmonic means of precision and recall, providing a balance be- tween the two. • ROC AUC: The area under the ROC curve, indicating the model's ability to distin- guish between classes. Table 4: Train-Test split overview Dataset type Samples Class 0 (No failure) Class 1 (failure) Training set 1,750 1,320 430 Test set 750 570 180 50 4.4 Summary The models' performance varied based on the feature sets utilized: • Sensor-Only Models: These models showed moderate performance, with XGBoost achieving an accuracy of 90.7% and an F1 score of 0.563 as shown in the confusion matrix in figure 7. However, they struggled with detecting failures, as indicated by lower recall values. Figure 7: Confusion matrix of sensor only XGBoost • NLP-Enhanced Models: Incorporating NLP features significantly improved model performance. The best-performing model, XGBoost with NLP features, achieved an accuracy of 97.3%, an F1 score of 0.889, and an ROC AUC of 0.993. This indi- cates that the inclusion of maintenance log data provides valuable information that enhances predictive accuracy. 51 The confusion matrix for the XGBoost model with NLP features revealed a high TPR and a low FPR, underscoring the model's reliability in predicting equipment failures, shown in figure 8 below. Figure 8: XGBoost confusion matrix of sensor and textual combined feature 4.5 Insights and implications This represents more holistic understanding of equipment health through integration of structured sensor data with unstructured maintenance logs. Using the enhanced models, it was shown that textual data has latent information that can improve PdM outcomes if properly processed. But there is still work to be done on proper maintenance log quality and consistency. This can add noise to the data by either having variable log formats or by taking subjec- tive descriptions. Future work needs to standardize the log formats and try out more sophisticated NLP techniques to make textual data more useful. Combined with NLP enabled maintenance logs, IoT sensor data can help predict equip- ment failures more accurately in this case study. This suggests that we can gain a lot from 52 using holistic approaches to the sensors that both include and exclude qualitative com- ments. However, the continued utilization of such integrated systems within industries will ensure that ongoing research and development must be conducted to deal with cur- rent challenges and optimize PdM strategies. 53 5 Results and discussion In this chapter, all results obtained from each ML model on both sensor only and NLP enhanced datasets are presented in a comprehensive manner. The models’ performance, key metrics, confusion matrices summary and identification of the most effective model to predict equipment failure within a 7day period are discussed. The results are then discussed in the context of broader impact and limitations. Specific devices were tracked by the sensor data, racking up certain patterns. For exam- ple, ThermoTrack-B3 often mentioned the occurrence of small thermal variations pre- ceding the failure events, while VibeSense-C7 captured peculiar vibration signatures co- inciding with the known mechanical faults. Device integration from heterogeneous de- vices presented things like EnviroMon-D2 that continue to contribute multi-dimensional environmental data underscores the value of heterogeneous device integration to build robust predictive systems. 5.1 Model performance In this paper, we evaluate 4 ML models (Logestic Regression, Random Forest, Gradient Boosting, and XGBoost) on two types of features, the traditional sensor data, and en- hanced features through NLP on maintenance logs. Different performance metrics, such as accuracy, precision, recall, F1 score, ROC AUC, were used to assess the models and the graphical representation of the accuracy of those models is shown in figure 9. 54 Figure 9: Model comparison w.r.t accuracy 5.1.1 Training Time The training times for each model were recorded, shown in figure 10, to assess their computational efficiency and scalability. Among the models using the Sensor Only fea- ture set, Logistic Regression was the fastest with a training time of 0.0164 seconds, while Random Forest and Gradient Boosting required 0.2859 and 0.2647 seconds respectively. XGBoost demonstrated a good balance between speed and performance, training in 0.0557 seconds. When using the NLP Combined feature set, all models experienced in- creased training times, with Gradient Boosting taking the longest at 0.3362 seconds. De- spite this increase, Logistic Regression remained relatively efficient (0.1161 seconds), and XGBoost continued to offer a favorable trade-off with a training time of 0.2027 sec- onds. These results provide valuable insights into the scalability of each model when applied to larger or more complex datasets. 55 Figure 10: Training Time of each model Accuracy, precision, recall, F1 score, ROC AUC metrics were used for evaluating the mod- els. Different sets of features were used for testing performance: (1) Raw sensor data (temperature HUMIDITY VIBRATIONS PRESSURE) and (2) Raw sensor data + any NLP fea- tures such as TF-IDF vectors and sentiment polarity from maintenance logs. Table 5 compares the ROC AUC scores of different models using sensor-only features versus combined sensor and NLP features, showing notable performance improvements across all models with the addition of NLP data. Table 5: Comparative ROC AUC Scores Model Sensor only NLP combined XGBoost 0.86 0.993 Gradient boosting 0.89 0.987 Random forest 0.85 0.962 Logistic regression 0.81 0.900 56 Table 6 provides a comprehensive comparison of model performance across different feature sets, revealing that models trained on combined sensor and NLP features con- sistently outperform those using sensor data alone in all key metrics. Table 6: Overall comparison of models Model Feature set Accuracy Precision Recall F1 Score ROC AUC XGBoost NLP Com- bined 0.973 0.91 0.86 0.889 0.993 Gradient Boosting NLP Com- bined 0.973 1.00 0.78 0.879 0.991 Random Forest NLP Com- bined 0.940 1.00 0.51 0.679 0.980 Logistic Regres- sion NLP Com- bined 0.893 1.00 0.14 0.238 0.950 Gradient Boosting Sensor Only 0.927 0.86 0.49 0.521 0.950 Random Forest Sensor Only 0.917 0.75 0.49 0.590 0.940 XGBoost Sensor Only 0.907 0.67 049 0.563 0.920 Logistic Regres- sion Sensor Only 0.877 0.00 0.00 0.000 0.800 57 5.1.2 Logistic regression 5.1.2.1 Sensor-Only Feature Set Logistic Regression performed the worst among all models using raw sensor data. In fig- ure 11, It achieved an accuracy of 0.877 and failed to detect any failure cases (F1 Score = 0.000). This underscores its limitations as a linear classifier in high-dimensional, non- linear environments (Kotsiantis, Zaharakis & Pintelas, 2007). Figure 11: Logistic regression sensor only model 5.1.2.2 NLP-Enhanced Feature Set The inclusion of NLP features significantly boosted Logistic Regression’s accuracy to 0.893, shown in figure 12, Precision rose to 1.00; however, recall remained critically low at 0.14, resulting in a weak F1 score of 0.238. This indicates that while the model avoided FPs, it still missed most actual failures, limiting its reliability (Sebastiani, 2002). 58 Figure 12: Logistic regression NLP enhanced model 5.1.3 Random forest 5.1.3.1 Sensor-Only Feature Set Random Forest, known for robustness against overfitting, performed moderately well with an accuracy of 0.917 and an F1 score of 0.590 as shown in figure 13. It handled class imbalance better than Logistic Regression but still failed to accurately identify many fail- ure cases (Louppe, 2015) Figure 13: Random Forest sensor feature set 59 5.1.3.2 NLP-Enhanced Feature Set Performance improved significantly with the enriched dataset, achieving 0.940 accuracy and 1.00 precision as stated in figure 14. However, recall remained at 0.51, with an F1 score of 0.679. These results suggest that while the model became more precise, it still struggled with detecting all failure events. Figure 14: Random Forest NLP enhanced model 5.1.4 Gradient boosting 5.1.4.1 Sensor-Only Feature Set Gradient Boosting emerged as the best performer among the models using only sensor data, achieving an accuracy of 0.927 and an F1 score of 0.621 clearly stated in figure 15. The algorithm's iterative nature allowed it to capture complex interactions more effec- tively than Random Forest (Friedman, 2001) 60 Figure 15: Gradient boosting sensor only model. 5.1.4.2 NLP-Enhanced Feature Set With combined features, Gradient Boosting matched XGBoost’s accuracy at 0.973 and achieved a near-perfect precision of 1.00. However, recall was slightly lower at 0.78, leading to an F1 score of 0.879 and can be seen in figure 16. Although strong overall, the slightly reduced recall made it marginally less effective than XGBoost. Figure 16: Gradient Boosting with NLP Features 61 5.1.5 XGBoost 5.1.5.1 Sensor-Only Feature Set XGBoost’s performance with raw sensor data was modest (accuracy: 0.907, F1 score: 0.563). Though it did not outperform Gradient Boosting as shown in figure 17, it showed potential through consistent performance across metrics, even with fewer features. Figure 17: XG Boost Sensor Model Summary 5.1.5.2 NLP-Enhanced Feature Set – Best Model XGBoost, when fed with the NLP combined dataset, emerged as the best model overall. It achieved the highest accuracy and precision as shown in figure 18. 62 Figure 18: XG Boost NLP Model Summary • Accuracy: 0.973 • Precision: 0.91 • Recall: 0.86 • F1 Score: 0.889 • ROC AUC: 0.993 as shown in figure 19 Figure 19: ROC Curve of XG Boost 63 These results confirm XGBoost’s superior capacity for learning complex, nonlinear pat- terns and leveraging textual information for PdM tasks (Chen & Guestrin, 2016) Table 7 highlights the performance of models trained solely on sensor data, with Gradi- ent Boosting achieving the highest accuracy and F1 score, while Logistic Regression per- formed the poorest across all metrics. Table 7: Sensor only features set. Model Accuracy Precision Recall F1 score Logistic regres- sion 0.877 0.00 0.00 0.00 Random forst 0.917 0.75 0.49 0.590 Gradient boosting 0.927 0.86 0.49 0.621 XGBoost 0.907 0.67 0.49 0.563 Table 8: NLP enhanced feature set. Model Accuracy Precision Recall F1 score Logistic re- gression 0.893 1.00 0.14 0.238 Random forst 0.940 1.00 0.51 0.679 Gradient boosting 0.973 1.00 0.78 0.879 XGBoost 0.973 0.91 0.86 0.889 Table 8 shows that incorporating NLP-enhanced features significantly improves model performance, with XGBoost and Gradient Boosting achieving the highest accuracy and F1 scores. 64 5.2 Best model analysis: XGBoost with NLP-Enhanced Features Among all evaluated models, XGBoost with the NLP-enhanced feature set emerged as the top performer. This model achieved an accuracy of 97.3%, an F1 score of 0.889, and an impressive ROC AUC of 0.993. The confusion matrix for this model is as follows in table 9: Table 9: Confusion matrix textual representation. Predicted no. Predicted yes Actual no 132 8 Actual yes 6 154 This matrix indicates a high TPR and a low FPR, signifying that the model effectively dis- tinguishes between failure and non-failure instances. The superior performance of XGBoost can be attributed to its robust handling of complex, high-dimensional data, and its ability to model non-linear relationships. The inclusion of NLP-derived features, such as sentiment scores and TF-IDF vectors, provided additional context that enhanced the model's predictive capabilities. Table 10 summarizes the sentiment polarity distribution in the logs, indicating that the majority of logs are neutral, with smaller proportions classified as negative or positive. Table 10: Sentiment polarity distribution in logs Sentiment category Polarity range Percentage of logs Positive >0.1 12% Neutral -1.0 to 0.1 63% Negative < -0.1 25% 65 5.3 Best Performing Model: XGBoost (NLP-Enhanced) XGBoost with the NLP-combined dataset demonstrated consistent superiority across all evaluation metrics. Its highest F1 score (0.889) and the best ROC AUC (0.993) indicate an optimal trade-off between sensitivity and specificity. This performance can be attributed to two main strengths: • Regularization: L1 and L2 regularizations are incorporated in XGBoost to prevent overfitting (Chen & Guestrin, 2016) • NLP Feature Synergy: The integration of TF-IDF and sentiment polarity from maintenance logs captured failure indicators not evident in sensor readings alone, validating findings from prior work on combining structured and unstructured data in predictive tasks (Aggarwal & Zhai, 2012). 5.3.1 Training vs Testing Performance The training and testing metrics for the XGBoost model using combined NLP features indicate strong overall performance with only a minimal degree of overfitting shown in figure 20. While the training metrics are perfect across all indicators—accuracy, F1 score, and ROC AUC—the test results remain exceptionally high, with a 97.33% accuracy, 0.89 F1 score, and 0.99 ROC AUC. The slight drop in the F1 score suggests some variance in class prediction, possibly due to class imbalance or nuanced differences in test data, but the high ROC AUC and accuracy on the test set show that the model is generalizing well. Therefore, while the perfect training metrics hint at some overfitting, the consistently strong test performance demonstrates that the model retains robust predictive power and is not significantly overfit. 66 Figure 20: Training vs Testing 5.4 Discussion Our results very clearly demonstrate that just using NLP features in a traditional sensor dataset can greatly improve failure prediction performance. However, models trained exclusively using sensor data could not do properly; models using textual features in par- ticular such as sentiment polarity and term frequency were several orders magnitudes better. However, it is the textual maintenance context component including warnings, techni- cian notes, and anomalies that cannot be quantified by numerical sensors which pro- vides the critical value added of textual analysis (Rao, 2024). This is important because MPM frameworks require multimodal data integration. Interestingly, both XGBoost and Gradient Boosting performed well but XGBoost’s slightly better recall and AUC scores made it the winner. Additionally, its computational effi- ciency and scalability render it more applicable to the real-world use in industrial IoT systems, where large data must be processed in real time (Zhang et al., 2022) Despite being a good study, it is limited in a few ways. The NLP features are highly context dependent; such that simple variation of log formatting or language style may harm gen- eralizability of the model. Secondly, sentiment analysis using TextBlob gave good polarity features, but more advanced models like BERT or domain specific embeddings may have more sophisticated features to offer (Devlin et al., 2019) 67 5.4.1 Impact of feature set NLP features integrated to the model improve the model performance across all evalu- ated metrics and can be seen in figure 21. The NLP-enhanced feature set was used to build models which outperformed sensor only counterparts with models using the NLP enhanced feature set superseding all sensor only models. This demonstrates that incor- porating unstructured data like maintenance logs can be added to PdM models. Previous studies have also identified the advantage of leveraging formulations incorporating structured sensor data alongside unstructured textual data for improved predictive pre- diction. Figure 21: Top 10 feature importance 5.4.2 Model comparison Overall performance, however, was highest with XGBoost while the Gradient Boosting also performed excellently in terms of precision as seen in the visual of figure 22. Never- theless, its lower recall means that XGBoost has a lower rate of FNs. 68 Figure 22: Model comparison w.r.t to F1 score As compared with Random Forest and Logistic Regression, Random Forest and Logistic Regression were less favourable, and Logistic Regression was much less favourable, es- pecially in the sensor-only feature set and justified with the figure 23 below. Figure 23: Model comparison summary. 5.4.3 Confusion matrix insights Looking at the confusion matrix of XGBoost with NLP enhanced features in figure 24, it highlights that the model accomplishes the harmonious balance that addresses both FP and FN. In PdM scenarios, this balance is essential because missed failures as well as unnecessary maintenance actions could have severe operational and financial impact. 69 Figure 24: Confusion matrix of best performing XGBoost model. 5.4.4 Practical implications Based on these findings, the PdM models that include the sensor data as well as the NLP derived features lead to more accurate and reliable predictions of the equipment failures. This technique will therefore reduce maintenance cost and downtime. The high ROC AUC confirms the possibility of XGBoost being deployed in the real-world industrial applica- tions to make accurate and timely predictions. Future work can consider these avenues and explore them with deep learning-based NLP techniques and feasibility of real time deployment. In addition, the employment of more advanced techniques to handle the class imbalance, for instance, SMOTE or focal loss can expand minority class detection. The evaluation of ML models for PdM was described in this chapter. The best model turned out to be XGBoost with enhanced features with NLP, obtaining the best classifi- cation results at the minimal misclassification. The performance anomaly was the fusion of structured sensor data and unstructured maintenance logs, especially by using TF-IDF and sentiment analysis. It emphasizes the fact that holistic data should be considered for PdM and paves the way for future developments in intelligent fault prediction. 70 6 Conclusion We have explored the development of a PdM framework that combines structured sen- sor data as well as unstructured maintenance log data through NLP and evaluate its de- velopment. The research endeavoured to classify such future equipment failure within a 7-day horizon by combining IoT sensor streams with text-based maintenance reports. Results of this work indicate that such an approach is both technically feasible and of practical value to improve the industrial maintenance strategies and minimize the un- planned downtimes. 6.1 Summary of work Using DUO together with FlowMeter-E9, HumidityGuard-G8, SensorHub-A1 and other IoT sensors, power of distributed monitoring was unveiled in industrial environments. The dataset was each device enriched in different ways, allowing the XGBoost model to train on complex multi modal patterns. This PdM framework has proven to be successful for the most part because of the reliability and precision been brought in by these smart IoT devices. This project started out by preprocessing two heterogeneous data types (numerical sen- sor readings (temperature, vibration, humidity, pressure) and free text logs for maintain- ing). TF-IDF was also used for cleaning, standardisation, and transformation of the un- structured logs into quantifiable vectors with polarity scores derived from TextBlob sen- timent analysis. The sensor readings were merged with these features to make a com- plete dataset that could be fed into different classification algorithms. Throughout the study this methodological distinction between the two feature sets (1) Sensor only features and (2) NLP enhanced features with sensors data and text-based features were maintained to be clear. These goldenseals were trained on four different ML model, Logistic Regression, Random Forest, Gradient Boosting and XGBoost. 71 Accuracy, Precision, Recall, F1 Score, ROC AUC and Confusion Matrices were used as con- sistent suite of performance metrics to evaluate our models. 6.2 Key findings The results from this study lead to several compelling conclusions: • Enhanced Predictive Accuracy Through NLP: Across all models, the inclusion of NLP features—specifically TF-IDF vectors and sentiment polarity—led to a sub- stantial improvement in performance. This confirms the hypothesis that unstruc- tured maintenance text holds valuable semantic information, which when mined correctly, enhances failure prediction beyond what sensor readings alone can provide. • Model Performance Hierarchy: Among the evaluated algorithms, XGBoost with the NLP-enhanced feature set emerged as the most accurate and balanced model, achieving an accuracy of 0.973, an F1 score of 0.889, and an exceptional ROC AUC of 0.993. These metrics indicate not only high predictive accuracy but also strong sensitivity to the minority class (i.e., predicting actual failures), which is often challenging in imbalanced datasets. • Sensor-Only Limitations: While traditional ensemble methods like Random For- est and Gradient Boosting performed reasonably well with sensor data alone (Ac- curacy ~0.91–0.93), their Recall and F1 scores remained modest, especially for the failure class. This gap illustrates the insufficiency of relying solely on numeric sensors to capture nuanced signs of degradation or operational anomalies. • Logistic Regression Limitations: Logistic Regression, although a common baseline in classification tasks, underperformed significantly in this context, particularly on the sensor-only dataset where both Precision and Recall for the failure class dropped to zero. This highlights its limited capacity to capture non-linear rela- tionships and complex feature interactions that are intrinsic to equipment failure processes. 72 • Confusion Matrix Insights: The confusion matrices, especially for the XGBoost NLP-enhanced model, reveal very few FPs and FNs. This low misclassification rate underscores the model’s practical applicability in real-time industrial settings, where both types of errors—predicting a failure that doesn't occur and missing an impending failure—carry operational and financial costs. 6.3 Critical evaluation However, this leads us to contextualize these results critically, as obtaining such perfor- mance gains is independent to simply only integrating an NLP. Then text quality and con- sistency in these maintenance logs is first important. In environments where environ- ment logs are sparse or inconsistent or written in a non-standard language, the models will become less effective. Additionally, the TF-IDF supposition of static term importance in logs may not withstand temporal shifts and potential variances of failure patterns. Second, TextBlob for extracting text sentiment is respectable way to implement a senti- ment analysis, but it is not as fine-grained, and it might not be able to capture domain specific nuances such as technician jargons or operational terminology. Further improve- ments are possible with more sophisticated NLP techniques such as BERT embeddings or domain tuned transformers and are encouraged as future work. Overall, XGBoost surpassed all others, however, since it is still a black box model, its in- terpretation is limited. The essence of accountability in an industrial critical application is sometimes just as important as being more accurate than others. Therefore, the tech- niques such as SHAP for revealing feature contribution can be integrated into building user trust and helping model debugging or fine tuning. 73 6.4 Practical implications The outcomes of this study have direct implications for the implementation of intelligent maintenance systems in industrial IoT ecosystems. By incorporating natural language in- puts from technicians alongside structured telemetry, organizations can unlock a richer understanding of equipment health. The high-performing models identified here can serve as the core of a PdM engine, enabling: • Proactive Failure Mitigation: Early detection allows for scheduled repairs rather than reactive fixes, reducing operational disruptions. • Cost Efficiency: Improved prediction reduces unnecessary preventive mainte- nance and avoids catastrophic failures, leading to significant cost savings. • Decision Support: Maintenance planners an