Tuomas Niemelä Demand forecasting in the retail environment A comparative study of LightGBM, XGBoost, and MLP models Vaasa 2025 School of Technology and Innovations Master of Science in Economics and Business Administration Industrial Management 2 UNIVERSITY OF VAASA School of Technology and Innovations Author: Tuomas Niemelä Title of the thesis: Demand forecasting in the retail environment: A comparative study of LightGBM, XGBoost, and MLP models Degree: Master of Science in Economics and Business Administration Degree Programme: Industrial Management Supervisor: Petri Helo Year: 2025 Pages: 103 ABSTRACT: Accurate demand forecasting is a critical operational factor in the retail environment, as organizational decision-making and management are increasingly dependent on it. Accurate forecasts enable strategic planning, inventory optimization, increased customer satisfaction, and reduction of surplus and waste. While advanced machine learning (ML) models are recognized for producing accurate forecasts, current literature often focuses on comparing algorithmic efficiency without sufficiently examining the contribution of external features to forecast accuracy. This thesis aims to address this research gap by investigating how external variables, such as unemployment and inflation, influence the predictive accuracy of ML models and how feature selection affects their performance. The study conducts a comparative analysis of three algorithms: LightGBM, XGBoost, and Multilayer Perceptron (MLP). The models are tested and compared in relation to one another and benchmarked against a 52-week seasonal naïve forecast. The comparative analysis is based on comparing forecasts made with different feature sets, evaluating forecast accuracy using various error and performance metrics. The empirical part of the research applies quantitative methods using simulated and anonymized time series data representing weekly sales figures from a U.S.-based retail chain operating in forty-five locations. The dataset covers approximately three years and includes seven original variables, consisting of macroeconomic, temporal, and store-specific features. Additional features were engineered to capture lagged and interaction effects within the data. The methodology involves data preprocessing, new feature engineering, a 65:35 train-test split, hyperparameter optimization, and evaluation using RMSE, MAE, MASE, and R2 metrics. Permutation feature importance is used to assess the contribution of different features. The findings indicate that all machine learning models significantly outperformed the seasonal naïve baseline, demonstrating their capability to produce more accurate forecasts. Gradient boosting models achieved the best overall performance, with LightGBM outperforming XGBoost with a slight margin, while the MLP model provided the weakest performance and highest computational cost. Answering the research questions, the results confirm that feature selection has a decisive effect on model performance. Lag features representing short-term temporal dependencies were found to dominate feature importance scores across all models. The optimal lag length was identified as one week, while macroeconomic variables such as unemployment and inflation showed limited significance in short-term forecasts. MLP was the only model for which holiday-related features showed notable importance. KEYWORDS: Demand forecasting, machine learning, retail analytics, feature importance, LightGBM, XGBoost, MLP, time series analysis 3 VAASAN YLIOPISTO School of Technology and Innovations Tekijä: Tuomas Niemelä Tutkielman nimi: Demand forecasting in the retail environment: A comparative study of LightGBM, XGBoost, and MLP models Tutkinto: Kauppatieteiden maisteri Koulutusohjelma: Opintosuunta: Tuotantotalouden maisteriohjelma Tuotantotalous Työn ohjaaja: Petri Helo Valmistumisvuosi: 2025 Sivumäärä: 103 TIIVISTELMÄ: Tarkka kysynnän ennustaminen on katsottu olevan kriittinen operatiivinen tekijä vähittäiskaupassa, mistä organisaation päätöksenteko ja johtaminen ovat yhä enemmän riippuvaisia. Tarkat ennusteet mahdollistavat strategisen suunnittelun, varastojen optimoinnin, asiakastyytyväisyyden parantamisen sekä ylijäämän ja hävikin vähentämisen. Vaikka kehittyneet koneoppimismallit tunnetaan tarkkojen ennusteiden tuottamisesta, nykyisessä kirjallisuudessa keskitytään usein algoritmien tehokkuuden vertailuun ilman, että ulkoisten tekijöiden vaikutusta ennusteiden tarkkuuteen tarkastellaan riittävästi. Tämän tutkielman tarkoitus on vastata aiemman tutkimuksen puutteellisuuteen selvittämällä, kuinka ulkoiset muuttujat, kuten työttömyys ja inflaatio, vaikuttavat ML-mallien ennustustarkkuuteen ja kuinka ominaisuuksien valinta vaikuttaa niiden suorituskykyyn. Tutkimuksessa on toteutettu vertaileva analyysi kolmesta algoritmista, jotka ovat LightGBM, XGBoost ja MLP. Analyysi perustuu eri ominaisuusjoukoilla tehtyjen ennusteiden vertailuun ja ennusteiden tarkkuuden arviointiin käyttämällä erilaisia virhe- ja suorituskykymittareita. Työn metodologiaan sisältyy datan esikäsittely, uusien dataominaisuuksien luonti, tietokannan jakaminen harjoitus- ja testidataan, hyperparametrien optimointi, sekä virhe- ja suorituskykymittareiden validointi. Tutkimuksen empiirisessä osassa sovelletaan kvantitatiivisia menetelmiä käyttäen simuloitua ja anonymisoitua aikasarjadataa, joka koostuu yhdysvaltalaisen vähittäiskauppaketjun viikoittaisista myyntiluvuista, kerättynä 45 eri toimipisteestä. Aineisto kattaa noin kolmen vuoden ajanjakson ja sisältää kahdeksan alkuperäistä muuttujaa, jotka koostuvat makrotaloudellisista, ajallisista ja myymäläkohtaisista ominaisuuksista. Muuttujien vaikutusta ennustetarkkuuteen mitataan permutaatiomenetelmällä. Tulokset osoittavat, että koneoppimismallit suoriutuivat merkittävästi paremmin kuin kausittainen naiivi vertailuarvo, mikä osoittaa niiden kyvyn tuottaa tarkempia ennusteita kuin perinteiset ennustemallit. Gradient boosting -mallit saavuttivat parhaan kokonaistehokkuuden, joista LightGBM suoriutui hieman paremmin kuin XGBoost. MLP-malli puolestaan suoriutui heikoiten. Tulokset vahvistavat, että ominaisuuksien valinta vaikuttaa ratkaisevasti mallin suorituskykyyn. Lyhytaikaisia ajallisia riippuvuuksia edustavat viiveominaisuudet osoittautuivat tärkeimmiksi ominaisuuksiksi kaikissa malleissa. Optimaaliseksi viiveen pituudeksi on havaittu yksi viikko, kun taas makrotaloudelliset muuttujat, kuten työttömyys ja inflaatio, ovat osoittautuneet merkitykseltään rajallisiksi lyhyen aikavälin ennusteissa. AVAINSANAT: Demand forecasting, machine learning, retail analytics, feature importance, LightGBM, XGBoost, MLP, time series analysis 4 Contents 1 Introduction 8 1.1 Research questions and purpose 12 1.2 Objectives and clear limitations 14 1.3 Structure of the paper 15 2 Literature review 16 2.1 Fundamentals of forecasting 16 2.2 Demand forecasting 21 2.3 Artificial intelligence 23 2.3.1 AI as a driver of efficiency 24 2.3.2 Machine learning methods 26 2.4 Creating forecasts with machine learning 35 2.4.1 Feature-based forecasts 35 2.4.2 Promotions and seasonality 37 2.4.3 Uncertainty and difficulties in demand forecasting 38 3 Methodology and data 42 3.1 Data 42 3.1.1 Data acquisition and preparation 43 3.1.2 Data cleaning and preprocessing 44 3.2 Feature engineering 45 3.3 Train-test split 46 3.4 Exploratory data analysis 46 3.5 Machine Learning Models 46 3.6 Feature Importance Analysis 49 4 Results 50 4.1 Data analysis results 50 4.1.1 Exploratory data analysis 50 4.1.2 Outlier detection 54 4.1.3 Train-test split 57 4.2 Hyperparameter optimization 59 5 4.2.1 Hyperparameters of LightGBM and XGBoost 60 4.2.2 Hyperparameters of MLP 62 4.3 Models’ predictive performance 63 4.3.1 Baseline (seasonal naïve) 63 4.3.2 Light gradient-boosting machine (LightGBM) 65 4.3.3 Extreme gradient boosting (XGBoost) 67 4.3.4 Multilayer Perceptron (MLP) 68 4.4 Feature importance and interpretation 70 4.5 Comparative analysis 76 4.5.1 Comparison of predictions and actual sales 76 4.5.2 Residual Analysis 78 5 Conclusions 81 5.1 Summary of comparative analysis 81 5.2 Main findings and recommendations for future research 83 References 87 Appendices 98 Appendix 1. Functions created in data analysis 98 Appendix 2. Parameter grids 98 Appendix 3. Summary of the dataset 99 Appendix 4. LightGBM Results with different feature sets 100 Appendix 5. XGBoost results with different feature sets 101 Appendix 6. MLP results with different feature sets 102 Appendix 7. Exploratory data analysis 103 6 Figures Figure 1. Retail trade volume and turnover (Eurostat, 2024). 9 Figure 2. Common Data Patterns (Sanders, 2015, p. 23). 19 Figure 3. Components of data (Sanders, 2015, p. 24). 20 Figure 4. Venn diagram depicting the relationship between statistical concepts. 23 Figure 5. Decision tree architecture, adapted from Mohri et al (2017, p. 7). 30 Figure 6. Random forest architecture, adapted from Jiang et al. (2016, p. 58). 31 Figure 7. Gradient boosting architecture, adapted from Xu et al (Xu et al., 2023, p. 3). 32 Figure 8. ANN Architecture, adapted from Bre et al. (2018, p. 1430). 34 Figure 9. Weekly Sales, 12-Week Moving Average. 51 Figure 10. Feature Correlation Heatmap with Spearman’s Correlation. 53 Figure 11. Outlier detection with interquartile range. 54 Figure 12. Boxplot of Weekly_Sales. 55 Figure 13. Histogram of Unemployment data. 56 Figure 14. Visualization of train-test split with different ratios (80:20 vs. 65:35). 58 Figure 15. LightGBM permutation feature importance. 73 Figure 16. XGBoost permutation feature importance. 74 Figure 17. MLP permutation feature importance. 75 Figure 18. Actual weekly sales vs. predicted weekly sales. 77 Figure 19. Residual Plots for each model. 79 Figure 20. Summary of results. 85 Tables Table 1. Principles of forecasting according to Armstrong (2001, pp. 61–66). 17 Table 2. Forecasting principles according to Sanders (2015, pp. 18–19). 18 Table 3. Conclusion of advantages of demand forecasting. 22 Table 4. Explanations for each variable of the dataset. 43 Table 5. LightGBM cross-validation results. 66 Table 6. LightGBM final test results. 66 7 Table 7. XGBoost cross-validation results. 67 Table 8. XGBoost final test results. 68 Table 9. MLP cross-validation results. 69 Table 10. MLP final test results. 70 Table 11. Feature configuration. 72 Table 12. Comparison of the models' error (RMSE, MAE, MASE, R2). 77 Algorithms Code 1. LightGBM with optimized parameters. 61 Code 2. XGBoost with optimized parameters. 61 Code 3. MLP with optimized hyperparameters. 63 Equations Equation 1. Equation of root mean square error (RMSE). 47 Equation 2. Equation of mean absolute error (MAE). 48 Equation 3. Equation of mean absolute scaled error (MASE). 48 Equation 4. Equation of R-squared (R2). 48 Equation 5. Equation of seasonal naïve forecast. 64 8 1 Introduction The structures of consumer-driven industries have reshaped over the past few decades as competition has intensified, and dynamics have increased due to globalization driven by free trade. The operational structures of global organizations have grown into complex entities, covering everything from the procurement of raw materials to the sales of the final product. As the operational structures of organizations expand, their administration and management become increasingly complex, as the volume and dimension of data affecting strategic planning increases. Based on current literature, organizational decision-making and management are increasingly dependent on accurate demand forecasts to sustain with growing competition. However, despite technological advances and availability of data, the complex nature and unpredictability of demand challenges consistent forecasting in both research and practice. Accurate demand forecasting is critical operational factor in supporting strategic planning (Caniato et al., 2005; Lima et al., 2024; Mircetic et al., 2022). It is crucial for the planning of functional processes, such as financing, logistics, inventory management, marketing, and production of a profitable business (Lima et al., 2024; Mircetic et al., 2022, p. 2514). Forecasting demand can help optimize inventory, increase customer satisfaction, and reduce waste (Ganguly & Mukherjee, 2024, p. 884), making it an essential part of an effective organization management. A study on global retail market estimated that inefficiencies in inventory management alone cost retailers 1.106 billion U.S. dollars yearly (IHL, 2015, as cited in Disney et al., 2021). Inefficiencies in organizational structures can increase costs that could be minimized by systematic optimization. According to a report conducted by McKinsey & Company (2022) advanced forecasting and digital process optimization can significantly improve operational efficiency. The study states that companies were able to improve the accuracy of their demand forecasts from 60% to 90% by replacing manual forecasts with machine learning models. Furthermore, an intelligent procure-to-pay process reduced processing time from days to minutes and achieved 15-20 % lower costs. Consequently, Institute of 9 Business Forecasting and Planning (2018) reported that a 15 % increase in forecasting accuracy can yield a 3 % higher pre-tax improvement. Regardless of its size, a forecast error is significant in terms of a company’s results. Even a one percent improvement in the forecast was able to improve the results of a company with a turnover of 50 million by 1.52 million during its fiscal year. Additionally, accurate demand forecasting can reduce the annual operating expenditure by 7% (Mitra et al., 2022, p. 3). Forecasts are therefore significant drivers in a changing competitive environment. Figure 1 provides a comprehensive overview displaying the volatility of the retail market in the euro area over the past 10 years. The figure displays two indices: trade volume and turnover excluding motor vehicles and motorcycles. The trade volume indicates inflation-adjusted real trade volume, and turnover measures the nominal retail turnover rate. Both indexes are monthly, seasonally and calendar adjusted data from the European Economic Area (EEA) and selected non-EU states. Figure 1. Retail trade volume and turnover (Eurostat, 2024). 10 The statistics illustrate how the macroeconomic effects triggered by the pandemic are reflected in consumer behavior. Strong volatility, starting from 2020, continues until 2022, after which the graph reveals the impact of inflation in the euro area. Nominal turnover would indicate that retail trade accelerated at the end of 2022, although in reality, according to inflation-adjusted retail sales volume, trade slowed down and declined in 2022-2023. The fluctuations of retail sales on a macroeconomic scale underscore the importance of effective forecasting. Consequently, volume and turnover indices provide an example of how, when making a forecast, it is necessary to be aware of which variables are used as inputs in the predictive analysis. Due to the increased volume and dimensionality of data, traditional forecasting methods usually lack the requirements for efficient forecasting (Mediavilla et al., 2022, p. 1126), yet many forecasts are still conducted manually based on experience and intuition in the retail sector (Falatouri et al., 2022, p. 995). Current literature focuses on advanced forecasting methods, which are often based on machine learning (ML). These state-of- the-art algorithms can process large volumes of data and are efficient in finding causal connections between independent variables (Ganjare et al., 2023, p. 2237). Studies have shown that different ML methods can outperform traditional forecasts in specific prediction problems (Schmid et al., 2025, p. 2). Petropoulos and colleagues (2025) discuss the current state of retail demand forecasting. They depict the current situation as being twofold by providing a concrete example about an U.S. based retail company operating at over 10,000 different locations and managing approximately 200,000 unique stock keeping units (SKUs). Field tests and forecasting competitions have proven that advanced algorithms can be used to make accurate predictions on demand, but as the scale increases, computational constraints, the complexity of forecasts, and lack of practicality create clear limitations for effective forecasting (Petropoulos et al., 2025, p. 1564). Forecasting demand can be effective for an individual store or product, but in modern, data-driven organizations, forecasting is not just a single process, but a multi-level 11 system that extends across different hierarchies of the business. Furthermore, the purpose of demand forecasting is not to estimate the sales of a single SKU, but rather to make an estimation of, for instance, regional purchasing power over a specific period. Such estimates can influence strategic decisions such as the location of a new business premises (Petropoulos et al., 2025, p. 1564). According to Yasir et al. (2024), the factors affecting demand are typically divided into internal and external. The internal factors are endogenous and organization-specific variables, and the external variables include, for instance, geographic location, temperature, seasons, and holidays (Falatouri et al., 2022, p. 995) as well as macroeconomic indicators, such as interest rates, trade volumes, prevailing employment rates, and exchange rates (Yasir et al., 2024, p. 2868). The external variables have proven to be valuable when conducting time series forecasts (Abolghasemi, Hurley, et al., 2020, p. 2; Falatouri et al., 2022, p. 995) but research on the significance of macroeconomic factors appears to be limited. Despite the recognized value of forecasting in business operations, its implementation is relatively limited due to its complex nature. According to Schneider et al (2021), the difficulty of forecasting stems from the fact that measuring demand is not explicit. They state that the accuracy of an effective forecast is influenced by a plethora of potentially relevant variables, making it difficult to identify which factors truly improve forecasting accuracy. Additionally, traditional methods are usually too unsophisticated and advanced methods are still in the early stages of development, especially in complex decision-making situations (Schneider et al., 2021, p. 218). Abolghasemi and colleagues (2020) support the statement by concluding that when developing predictive models, it is important to find a balance between complexity and accuracy to maintain the efficiency and reliability of the model without unnecessary data usage. This thesis focuses on the features used in demand forecasting, i.e., the external factors based on which the target (dependent) variable is predicted. According to previous research, forecasting is crucial in terms of operational efficiency. Advanced models, such as algorithms based on artificial intelligence, have also been found to produce accurate 12 predictions. However, there are relatively few studies focusing on the relationship between external factors used in predictions and the forecasting models. The study tests three AI-based models for making predictions and examines how the models control the features of the same time series dataset. 1.1 Research questions and purpose Demand arises from the need for a specific good, product, or service. This need is influenced by various external factors, such as purchasing power, seasonality, and trends. Consequently, the final investment or purchase decision depends on this need, as well as, i.e., price, quality, economic situation, and substitutes. Thus, demand is influenced by numerous external variables, on the basis of which companies make important investment and operational decisions. When examining demand forecasting, the topic combines two highly complex areas: demand and forecasting. Currently, most research focuses on comparing the efficiency and predictive accuracy of different algorithms, and does not take external features into account when forecasting sales or demand (Deng et al., 2025, p. 156). Although studies focus on forecasting and specific exogenous factors related to demand, the emphasis is often solely on prediction error minimization without considering the contribution of the exogenous features. Huber and Stuckenschmidt (2020) focused their research on analyzing forecast accuracy on specific calendar and holiday-related days in retail domain. They used various external features in their predictive analyses, such as store location and type, temporal characteristics (lag, rolling medians, etc.), and sales promotions and special days. The evaluation is based on comparing models in relation to the baseline forecasts and comparing the forecast error margins of the methods used. The study did not investigate the impact of external features used on forecast accuracy. However, they suggest in their proposal for further research that industry-related insights could be explored by studying the contributions of external features. Furthermore, Deng and others (2025) support the proposal by stating that analysis on consumer business domain is insufficient, thus 13 research on features of the retail market to optimize model structure is needed. They also conducted their research on retail by comparing a model consisting of LightGBM and Prophet to single prediction models such as LSTM, SVR, and ARIMA. Additionally, Falatouri compared machine learning models for an Austrian retail company and examined the effects from a supply chain management perspective. The results showed that profitability was increased by minimizing waste and increasing sales numbers. Nevertheless, he also notes that future studies could focus on examining the impact of external features, such as calendar events, weather, or availability of substitute products to provide domain-specific insights. This thesis aims to address the research gap identified in previous research regarding retail demand forecasting by clarifying the contribution of external factors on the accuracy of forecasting made using machine learning models. The analyzed data consists of a U.S. -based retail chain. The data includes weekly sales figures from stores in 45 different locations, as well as data on contextual, macroeconomic, and temporal variables. This study aims to examine how machine learning models leverage external variables, such as unemployment, inflation, and fuel price data, in demand forecasting. This is examined by conducting predictions using three different machine learning algorithms. The algorithms are used to create different models by providing them with data with different sets of features, allowing the predictions obtained with different features to be compared. Additionally, the results of the best feature set are then used to produce a feature importance analysis using the permutation method. Motivated by this, previous research, and background of the study, the research questions are as follows: RQ1: How do the algorithms evaluate external features in their predictions? RQ2: How does feature selection affect the predictive accuracy? The research questions are based on the purpose and background of the thesis, and they guide the research to fill the research gap identified within the field of study. The 14 empirical part of the thesis is done by conducting predictions with four different predictive models, one of which works as the baseline model for the other machine learning algorithms. Machine learning algorithms used include LightGBM, XGBoost, and Multilayer Perceptron (MLP), with the seasonal naïve forecast serving as the baseline. 1.2 Objectives and clear limitations The objective of this thesis is to create a clear framework for how demand, its forecasting, and advanced predictive models are interconnected. Furthermore, the study aims to answer how the prediction models used in the study handle external variables in the creation of predictions. The thesis starts with a literature review, which examines the basic theories of the subject areas and previous research on the topic. During the literature review the key concepts and terminology are explained to create understanding of the fundamentals of forecasting, demand forecasting, as well as applications of artificial intelligence. Next, the thesis reviews previous research, which allows for a more detailed examination of topics related to demand forecasting. During quantitative research, the topics studied in the literature review are examined in practice. Three different machine learning models are tested and compared with each other, as well as in relation to a seasonal naïve forecast. The aim is to examine the accuracy of the predictions using various errors and performance metrics and to produce results that are as objective as possible. These metrics allow the comparison of predictive accuracy of models with different feature sets. Consequently, it allows the generation of quantitative results on how external factors affect predictive accuracy. Last, the importance of features is measured by permutation feature importance, which depicts the contribution of a single feature by determining how much the model relies on such feature. The clear limitations of the thesis are related to the characteristics of the data, the methodology of this study, and model-specific technicalities which may impact the 15 results. When conducting time series analysis, it would be desirable to have as much data available as possible, as is needed both in the training phase and in the testing phase of the model. The time series spans only three years, consisting of 6435 datapoints. Some algorithms, such as neural network architectures, require a lot of data to function properly. Furthermore, the data is not based on actual values as it is a simulated and anonymized dataset, therefore the relevance of the research results to real life situation decreases. Furthermore, the study lacks qualitative input that could support its objective and the generalizability of its results to real life situations. Technical limitations are related to tuning of model hyperparameters. Even if the study were repeated using the same dataset, data features, and algorithms, the results obtained could be different if the hyperparameters or their values are changed. To conclude, the thesis and its results are highly theoretical due to the mentioned limitations. 1.3 Structure of the paper The structure of the thesis consists of five main chapters, which are introduction, literature review, methodology, results, and conclusions. The introduction chapter presents the background for this thesis, as well as the purpose, research questions, objectives, and clear limitations of this study. Second, the literature review presents the rationale for the topic, the main concepts, and the relevant terminology. After key concepts, it delves deeper into the research area at a more advanced level by examining previous research papers on the topic and the used algorithms. This is followed by the methodology of this paper, which explains the unit of analysis, process steps, data, and tools used to obtain the results. Its purpose is to provide the reader with guidelines on how this study was conducted. After the methodology, the results and how they were obtained are explained. The results section depicts the concrete quantitative results of the analysis and delve deeper into the results with visualization and comparisons and provides answers to the research questions. Finally, the conclusions section summarizes the findings of this research, generalizes the results beyond this thesis and its data, and possible suggestions for further research are provided. 16 2 Literature review Forecasting demand is a crucial aspect of managing business. According to Punia et al. (2020) demand forecasting is done to solve two main problems of companies within manufacturing, retail or distribution business which are deciding the quantities for production or orders and allocation of resources. Furthermore, Sillanpää ja Liesiö (2018), in their study on demand forecasting in retail business, state that forecasts are required to effectively plan operations and incoming orders. They emphasize the need for information to apply demand forecasting in businesses and state that the data is most easily gathered from point-of-sales (POS) and stock keeping unit (SKU) sales data. However, creating forecasts solely on historical sales data can be problematic, since demand is influenced by numerous external variables. For example, forecast methods using POS-data can be unreliable in case of reschedules (Sillanpää and Liesiö, 2018, p. 4169). Because demand is a difficult variable to measure (Mitra et al., 2022, p. 3) a lot of research can be found on the topic: how forecasting methods are implemented and how they are created. The purpose of this section is to review the fundamentals of forecasting, which will be used to continue discussion on demand forecasting to create a theoretical framework for the thesis. Next, machine learning models are examined, along with the principles of the models used in this thesis, and how they are used in demand forecasting. Finally, previous research related to the topic is examined. 2.1 Fundamentals of forecasting Forecasting is about making predictions of the future. In their book Forecasting Fundamentals (2015) Nada Sanders explains that forecasting is all about predictions, whether it is the future weather, the outcome of tomorrow’s match or who will win the election. They say that forecasting is the most important aspect of decision making and stress that inaccurate forecasting can lead business in a very unfortunate state and even 17 bankruptcy. Forecasting influences many managerial decisions such as the need for workers, how much inventory is enough and when to order more, what resources will be available, or how much production is appropriate for a given period (Sanders, 2015, pp. 4-5). Armstrong (2001) proposes seven concrete principles to forecast with the intention to improve judgment by reducing bias and/or inconsistency of the forecast. The principles are displayed in table 1. Table 1. Principles of forecasting according to Armstrong (2001, pp. 61–66). Principle Effect on the forecast 1. Using checklists Increases consistency 2. Defining and delimiting precise criteria Increases consistency and efficiency, minimizes bias 3. Comparison and evaluation of previous forecasts Increases consistency and minimizes bias 4. Visualization of results for interpretation Minimizes bias and decreases error 5. Utilizing patterns and trend lines Increases consistency 6. Using multiple forecast methods Increases robustness 7. Peer reviews Minimizes bias Armstrong states that two of the six, checklists (1) and utilization of trend lines (5) increase consistency of the forecast method. Usage of checklists emphasize systematic considerations of relevant variables and utilization of trend lines when making judgmental forecasts help visualize the data, thus providing possibility for pattern recognition and support more consistent decision-making. In contrast, (4) use of graphs for interpretation and (7) peer-reviewing the probability of success are for minimizing the bias within the forecast. Armstrong suggests that studying data in graphic rather than tabular form increases the forecasting accuracy and decreases error. Having peers 18 reviewing the probability of success decreases the amount of human error, which is usually shown as overconfident forecasts, thus increasing error of the forecast. Principles (2) and (3) help in both, decreasing bias and increasing consistency. Defining and delimitating precise criteria removes unnecessary variables from the forecast, making the forecast more efficient. Records from previous forecasts give forecasters a way to obtain cognitive feedback. However, Armstrong underlines that it is important to use the records appropriately to provide a reliable assessment (Armstrong, 2001, pp. 70-71). Sanders (2015) proposes a different perspective on the forecasting principles with three main ideas. Whereas Armstrong’s seven principles for forecasting are more traditional and concrete methods that should be considered when making predictions, Sanders’ criteria are suitable for forecasting at a general level, with ML models for example. Sanders’ principles are displayed in table 2. Table 2. Forecasting principles according to Sanders (2015, pp. 18–19). Forecasting principle Rationale of the principle 1. Forecasts are rarely perfect The goal of a good forecast is to minimize error, not to forecast perfectly 2. Forecasting clusters is more accurate than individual items The overall variance can be minimized by diversification 3. Short-term forecasting is more accurate than long-term. Shorter time horizons involve less uncertainty, making short-term forecasts more reliable Sanders (2015) also proposes six-step process of forecasting, which starts with deciding what to forecast to identify the real problem. It follows with data cleaning, identifying data patterns, selecting models, generating the forecast, and measuring the forecast accuracy. Sanders highlights the importance of setting clear delimitations and focusing solely on the variables being forecasted. It starts with identifying the core issue to which the forecast is trying to find a solution. For instance, in a scenario where unexpected 19 demand causes a seller to run out of stock midway through the day, resulting in unfulfilled customer needs, the sales data will underestimate the actual demand for that day. Thus, the forecast for sales and demand would require different approaches, although at first, they seem to be measurable with the same parameters. To conclude, having a clear consensus on what the forecast is for is crucial in terms of the relevancy and reliability of the forecast results (Sanders, 2015, pp. 20‒21). Determining the core issue provides the forecaster with a framework for data collection at the most detailed level possible. Figure 2. Common Data Patterns (Sanders, 2015, p. 23). The patterns, level or horizontal, trend, seasonality, and cycles help the forecaster to choose the right forecasting model, which in turn is more likely to produce a more reliable forecast. Sanders (2015) explains that the clearer the trend is identified through data analysis, the more accurate the forecast is likely to be. This is because greater random variance increases the difficulty of producing reliable forecasts, as depicted in figure 3 below. 20 Figure 3. Components of data (Sanders, 2015, p. 24). After cleaning and analyzing the data, the forecasting model can be selected. The selection might not be straightforward since there are plenty of different models for different types of datasets and patterns. Sanders highlights four factors that should be considered when choosing the model: forecast horizon, data patterns, the availability and quantity of data, and the required accuracy. After deciding which model to apply, the forecast can be generated, and the forecast results can be analyzed and interpreted. When the possible forecasting errors are identified and the model provides accurate results, the dataset should be updated as new relevant data is available (Sanders, 2015, pp. 25-26). Demand forecasting can be implemented with either qualitative or quantitative methods. Qualitative methods, such as historical analogies, market research, questionnaires, and Delphi technique are used to predict market demand (Mitra et al., 2022, p. 58). Quantitative methods rely on numerical and measurable data, and most demand forecasting methods are evaluating causal relationships between independent and dependent variables through regression analysis. Additionally, other data-driven models utilized are time-series models such as exponential smoothing and moving average methods (Mitra et al., 2022, p. 58; Punia et al., 2020, pp. 2-3). This thesis focuses solely on quantitative demand forecasting. 21 2.2 Demand forecasting In today’s global economy organizations need to be efficient in terms of cost optimization, information flows, delivery, information transparency and development to be able to keep up with global competition (Abolghasemi, Beh, et al., 2020, p. 2; Feizabadi, 2022, p. 121; Mediavilla et al., 2022, pp. 1126–1127; Mitra et al., 2022, p. 2). Globalization has driven the trend of outsourcing increasing supply chain complexity and internationality, making lead-times of procurement longer (Feizabadi, 2022, pp. 121–122). When making purchases or planning production volumes, a company needs to have a plan on how much product on the shelf or in production is sufficient in quantities. Too much supply will be costly in terms of storage costs, surplus and labor costs. Insufficient supply, on the other hand, means lost revenue, decreased customer satisfaction and loyalty, loss of goodwill, and possible overstocking against future demand (Abolghasemi, Beh, et al., 2020, p. 1; Punia et al., 2020, pp. 1–2). Demand forecasting can help address many of the issues by providing insight into decision making, as according to Abolghasemi et al (2020). However, demand is highly volatile and not an easy variable to predict, since it is affected by various exogenous and endogenous factors (Mitra et al., 2022, p. 3). Despite demand being highly dependent on exogenous factors, Falatouri (2022) states that many retailers still conduct demand forecasting manually, making the decision based on their individual biases. Usually this can lead to inaccuracies as the uncontrollable external factors, such as price volatility, market cannibalization, or consumer behavior impacting market demand. Efficient demand forecasting is conducted by strict processes and predictive analyses (Kilimci et al., 2019, pp. 1–2), utilizing advanced technologies and resources such as big data (Pereira & Frazzon, 2021, pp. 3–5). 22 Table 3. Conclusion of advantages of demand forecasting. Authors Advantages of demand forecasting Abolghasemi et al. (2020) Help in addressing volatility over the entire demand series, mitigating issues in upstream supply chains and increasing cost-efficiency Feizabadi (2022). Improve efficiency across the processes of entire supply chain Ho et al. (2025). Optimize storage, increase customer satisfaction, and process efficiency Mitigates stockouts and overstocking Huber & Stuckenschmidt (2020). Increase in competitive advantage Minimize the amount of discarded goods and waste Jackson et al. (2024). Increase strategic decision-making efficiency and cost efficiency Khan et al. (2020). Help enterprises to formulate market strategies, increase inventory turn-over rates, customer satisfaction, and transparency. Reduce waste and overall costs Kilimci et al. (2019). Increased cost efficiency by minimizing excessive stocks and stockouts Increasing customer satisfaction Lay et al. (2018). Alleviates stockout and overstocking and increases customer satisfaction, which enables companies to gain sustainable competitive advantage 23 Table 3 illustrates how different researchers depict the effectiveness of demand forecasting. The broad effects highlight the demand for accurate forecasts and advanced forecasting methods. Furthermore, the increasing availability of data further complicates the processes and slows down data processing if an organization lacks the necessary means to harness the data. 2.3 Artificial intelligence Artificial intelligence (AI) is ubiquitous, regardless of the industry, as are all the buzzwords associated with it such as Machine Learning, Large Language Models, Deep Learning, and Big Data. Many researchers highlight how AI and its sub-areas work in different industries as drivers of efficiency (Dell’Acqua et al., 2023; Fosso Wamba et al., 2024; Jackson et al., 2024; Krakowski et al., 2023; Wasserbacher & Spindler, 2022). Figure 4. Venn diagram depicting the relationship between statistical concepts. 24 Figure 4 provides a conceptual illustration of how the subareas related to artificial intelligence are interconnected. It can be concluded that AI refers to a field of computer science focused on creating models performing tasks typically requiring human intelligence (Krakowski et al., 2023, p. 1426). artificial intelligence has gained popularity as Large Language Models (LLMs) became more common with the release of ChatGPT by OpenAI in 2022 (Jackson et al., 2024, p. 1). They quickly gained remarkable attention because of their generative capabilities (Dell’Acqua et al., 2023, p. 3; Jackson et al., 2024, p. 6120). LLMs are best known for their ability to provide their users with human-like answers, and creative and analytical capabilities (Dell’Acqua et al., 2023, p. 1) which can be utilized to complement or substitute human work. 2.3.1 AI as a driver of efficiency The integration of AI into human work is seen as an opportunity to complement the efficiency of individuals (Dell’Acqua et al., 2023, p. 1) having impact on human cognition and problem-solving ability reducing the marginal cost of human thinking and reasoning similar to how internet lowered the cost of information sharing (Dell’Acqua et al., 2023, p. 18; Jackson et al., 2024, p. 6120). Applications of artificial intelligence enables autonomic learning of machines, which provides these machines the ability to co- operate in problem solving and decision making with humans (Krakowski et al., 2023, p. 1426). AI’s ability to mimic the cognitive skills of humans is a unique capability in the field of technology according to Krakowski and others (2023). Humans’ individual cognitive abilities have traditionally been difficult to duplicate, for which the supply has been scarce, thus AI’s ability to provide cognitive skills provides a huge advantage since it can be utilized in decision making and managerial tasks (Krakowski et al., 2023, p. 1427). The added value of AI is not straightforward to calculate, but some researchers have examined possibilities to estimate it. Efficiency can be increased by harnessing artificial intelligence, as according to Jackson (2024) AI tools can lead to a surge in productivity, enhance cognitive work efficiency, 25 support logistics and warehouse management, and even help in negotiating optimal contracts. From a demand forecasting perspective, implementation of AI methods can increase the accuracy of the forecast, which in turn increases supply chain resiliency (Mediavilla et al., 2022, p. 1130) improving order-picking performance, and accurately respond to upcoming demand spikes or uptrends (Ho et al., 2025, p. 2). They further examine the capabilities of artificial intelligence and identify five core characteristics that distinguish AI from traditional technologies, which collectively serve to define and explain the concept of AI. The core characteristics are Learning, Perception, Prediction, Interaction, Adaptation, and Reasoning (Jackson et al., 2024, p. 6123), from which learning, prediction, and reasoning are by far the most important features for Demand Forecasting. Krakowski and others (2023) study the additional value of artificial intelligence from the perspective of resource-based view, which defines organizations’ competitive advantage based on the availability and volume of resources. Traditionally, cognitive skills are considered difficult to duplicate, scarce in supply, heterogeneously distributed across individuals, and decisive in decision making and problem solving. Thus, from the perspective of resource-based view they are rendered as valuable organizational resources. However, the prediction of the potential added value of AI contradicts this, as AI’s ability to learn and perform cognitive tasks affects the irreplaceability of cognitive skills and abilities, which have made them valuable resources. In addition, Krakowski and others note that generally technological resources, like AI, are subject to relatively few constraints on imitation and the marginal cost of reproducing them is almost negligible (Krakowski et al., 2023, p. 1426). On the contrary, when studying AI as a complementary to individuals’ cognitive skills, it can further enhance cognitive resources thus adding value for the user. The unique capabilities of AI make it difficult to evaluate theoretically, which has led to a discussion about AI’s potential as a substitute and complementary utility for cognitive workers. 26 2.3.2 Machine learning methods Machine Learning is a subcategory of artificial intelligence (Ganjare et al., 2023, p. 2237; Mediavilla et al., 2022, p. 1126), and it is being harnessed in different industries to analyze masses of data. The industries utilizing ML, for instance, are search engines, finance, logistics, e-commerce, and inventory management, and a few examples of which tasks ML is used to provide are fraud detection, detecting spam emails, predictive analyses, optimizing inventory levels, and providing personalized feed in an e-commerce platform for the consumer (Ganjare et al., 2023, p. 2237). Barua et al. (2020) generalize that machine learning is a way to teach computers to learn new tasks naturally from experience resembling the way organisms acquire new knowledge. In ML the computer utilizes a computational method, an algorithm, for example, to learn directly from the dataset without a predetermined equation, unlike traditional statistical methods. Compared to conventional methods, ML is more efficient with faster, and more accurate analyses for large data sets, providing tools for better predictive data analyses while conventional statistical models provide relationships between variables based on predetermined models, such as regression (Ganjare et al., 2023, pp. 2236–2237; Rajula et al., 2020, pp. 1–2). There are three general categories of machine learning, which differ in terms of the quality of the data and the methods used to teach the machine. They are called supervised, unsupervised, and reinforcement learning (Barua et al., 2020). They further divided supervised learning into two sub-categories: classic supervised learning and ensemble learning. In addition to the three main sub-categories, Wasserbacher and Spindler (2022) propose semi-supervised learning as the fourth category in machine learning. Semi-supervised methods include models utilizing small amounts of labeled data in addition to unlabeled data. Semi-supervised models aim to enhance supervised learning in environments where availability of labeled data is scarce (Wasserbacher & Spindler, 2022, p. 67). However, semi-supervised methods are not generally known in current literature. 27 In contrast with supervised learning, unsupervised learning models are not trained to create predictions based on pre-defined and labeled data. The task of unsupervised learning is to identify patterns and relationships within unlabeled data, without prior knowledge or predefined labels about what the data represents (Barua et al., 2020, pp. 2–3). The only known parameter of the unlabeled dataset is the joint distribution (Wasserbacher & Spindler, 2022, p. 67). The benefit of unsupervised learning is recognition of previously unseen insights in the data, according to Wasserbacher and Spindler (2022). They give customer segmentation as an example of a task unsupervised learning can provide based on consumers’ demographic characteristics, socio-economic status, and behavior. These kinds of hidden patterns and features are primarily recognized by clustering and principal component analysis (PCA) (Barua et al., 2020, p. 2). For instance, in a study conducted by Kılıç et al. (2025), two clustering algorithms were utilized to categorize the unlabeled data. They used K-means and Mean-Shift algorithms to cluster three datasets separately, from which, three distinct clusters (low, medium, and high) were identified by their performance metrics. By assigning cluster-based labels, the researchers were then able to further analyze the data by supervised learning algorithms to predict the future shipment performance. This hybrid approach enabled proactive operational management by allowing early identification of underperforming shipments (Kılıç Sarıgül et al., 2025, pp. 22–23). Thus, the unsupervised learning algorithms were used to organize an unspecified dataset so that it could be processed by supervised learning algorithms to model the relationship between input data and output values. In reinforcement learning, a learning agent is learning from an environment it is operating in by trial and error (Barua et al., 2020, p. 2; Wasserbacher & Spindler, 2022, p. 67). Sutton and Barto (2015) state that the basic idea in reinforcement learning is to capture the most important aspects of a real problem by providing the learning agent feedback provided by the learning environment. The main reinforcement learning distinguishing aspects are the closed loop reward system, lack of direct instructions for 28 the learning agent, and where and how long the consequences of actions will affect the agent’s performance. Because reinforcement learning is not utilizing pre-labeled examples and data, it might be often referred to as a subset of unsupervised learning algorithms (Sutton & Barto, 2015, p. 3). Based on the reward system of the learning environment, after a failed attempt the agent decides the best action for it to take to succeed in the given task, to maximize the numerical reward (Barua et al., 2020, p. 3). Sutton and Barto (2015) presented an Exploration-Exploitation dilemma, which emphasize the dimensionality of reinforcement learning method. The learning agent must prefer past actions found to be effective producing reward—but it also must try new, unseen actions, to find the effective actions. Thus, the agent must exploit well proven actions at the same time exploring possible better actions to select in the future. Thus, unlike supervised learning, there are no instructions in reinforcement learning on how to proceed. The learning agent must decide how to approach the task purely based on gathered data via the reward system. In supervised learning the machine is taught from a training set of labeled data, provided by a human annotator or domain expert (Sutton & Barto, 2015, p. 2). Supervised learning has two main steps, first training the model with predefined training dataset and then evaluating the model with separate, unseen dataset, which is referred to as test data. The predefined data is called labeled data and the purpose of using labeled training data is to teach the model to recognize the relationship between the input data and correct output values (Kılıç Sarıgül et al., 2025, p. 14), to create accurate predictions, which can inform decisions based on unseen data (Barua et al., 2020, p. 2; Sutton & Barto, 2015, p. 2; Wasserbacher & Spindler, 2022, p. 67). An example of where supervised learning can be utilized is predicting future sales based on known input variables, such as date, historical prices and sales, and availability of competitors’ products (Wasserbacher & Spindler, 2022, p. 67). 29 Barua and others divide supervised learning into two, Classic supervised learning and Ensemble learning. Classic supervised learning involves training a single model using predefined labeled learning data. The most general algorithms Classic Supervised methods include are regression analyses, k-nearest neighbors (KNN), Artificial Neural Network (ANN), decision trees, and Support Vector Machine (SVM) (Barua et al., 2020, p. 2). Chaudhuri and others (2021) also state that decision tree, SVM, ANN, and random forest are among the most used analytical methods for forecasting. Interpretability, ease of use, and ability to handle categorical variables have popularized decision tree and random forest over ANN. On the other hand, ANN’s capability to handle multidimensional datasets and efficiency in use of resources is superior compared to the easier-to-use random forest and decision tree (Chaudhuri et al., 2021, p. 3). 2.3.2.1 Decision tree, random forest, bagging, and boosting Decision tree is a conceptually simple but effective and versatile (Barua et al., 2020, p. 2) non-parametric supervised learning algorithm. It is capable of both classification and regression tasks, and it forms a flowchart-like hierarchical structure consisting of a root node, internal nodes, branches, and leaf nodes. In a tree, each node splits the data into subsets based on feature values, and the branches represent the outcomes of these tests (Barua et al., 2020, p. 2). The straightforward implementation makes decision tree fast in performing a forecast (Barua et al., 2020, p. 2), and together with its interpretability it has become a widespread model for forecasting related applications (Chaudhuri et al., 2021, p. 3). Barua and others (2020) present a case study by Mohri and Haghshenas (2017), where a decision tree-based algorithm was used to determine when the use of shipping containers is optimal. The input variables included the price, weight, value, and distance of the shipment, for instance. The most important variables were item perishability, value of goods, distance, destination and point of departure. On the other hand, decision trees can be relatively unstable as they are sensitive to variance and noise in the data due to their tendency to overfit, according to Chaudhuri and others (2021). However, the sensitivity of decision tree can be addressed by composing multiple 30 decision trees as one ensemble learning model such as random forest (Huber & Stuckenschmidt, 2020, p. 1426). Ensemble methods improve accuracy and robustness while reducing variance (Barua et al., 2020, p. 2) compared to training a single decision tree. Figure 5. Decision tree architecture, adapted from Mohri et al (2017, p. 7). Ensemble learning combines the predictions of several individual models to form a composite model. It uses the outputs of the individual models, referred to as base learners or weak learners, to produce its own predictions. Barua and others (2020) state that ensemble methods use a two-step process, first developing a population of base models from training data, and second, combining the base models to form the composite predictor (Barua et al., 2020, p. 2). Kilimci and others (2019) name the two steps of ensemble learning as “ensemble generation and ensemble integration”, and state that combining learning methods is done to boost the system performance (Kilimci et al., 2019, p. 2). Hastie and others (2009) also split ensemble learning into two tasks: creating a population of base learners and combining them to form the composition, 31 idea being to create a prediction model to combine the strengths of simple learning models (e.g. single decision tree) for a more efficient predictive model (e.g. random forest). Thus, the underlying concept of ensemble learning methods is to develop an enhanced form of supervised learning from base learners that enables more efficient predictive modeling compensating the weaknesses of individual models. Figure 6. Random forest architecture, adapted from Jiang et al. (2016, p. 58). Common ensemble techniques include bootstrap aggregating (bagging), boosting, and stacking, all of which employ a different strategy to achieve better performance by improving accuracy of predictions and robustness of the model minimizing bias and variance (Barua et al., 2020, p. 2). Boosting methods (e.g. LightGBM and XGBoost), for instance, train a series of weak learners and compiles the predictions of subsequent learners by the sum of trained simple models (Huber & Stuckenschmidt, 2020, p. 1426). For example, AdaBoost has been used in conjunction with SVMs to enhance predictive performance, as referenced by Ghareeb and others (2020, p. 1). 32 As Boosting focuses on reducing bias and variance by improving accuracy of the model and training learners subsequently, Bagging models combine homogenous weak learners, training them independently and in parallel with random subsets of training data, primarily prevents overfitting by reducing variance. random forest is a known Bagging method, which combines individual simple decision trees suitable for both, regression and classification tasks (Ghareeb et al., 2020, p. 1). Figure 7. Gradient boosting architecture, adapted from Xu et al (Xu et al., 2023, p. 3). Whilst bagging and boosting usually combines homogenous weak learners, stacking utilizes heterogeneous learners leveraging the strengths of different algorithms. The objective is to optimally incorporate the results of weak learners to improve the ability to make accurate predictions on new, unseen data (Ghareeb et al., 2020, p. 1). According to Ghareeb and others (2020) multistep predictions like Stacking are more sensitive to errors due to their complexity. However, the complexity of Stacking is also said to be viable making it more effective in forecasting complex datasets compared to other ensemble models. 33 2.3.2.2 Artificial Neural Networks Artificial Neural Networks (ANN) consist of artificial neurons, connected to each other by arranged series of layers. The artificial neurons in ANNs are usually recalled as units, and a single ANN system can consist of dozens to millions of units, depending on how complex the neural network is (Barua et al., 2020, p. 2; Seyedan & Mafakheri, 2020, p. 12). The neurons are connected with synapses, which together construct the layers of neural networks. Neural networks usually include three layers: input layer, output layer, and hidden layer. ANNs with more than one, usually multiple hidden layers are called Deep Neural Networks (DNNs) (Punia et al., 2020, p. 4). The information processing of artificial neural networks resembles the way animals make decisions (outputs), based on a learned logical model of information processing (Barua et al., 2020, p. 2). Deep neural network models are used for complex problems, such as image recognition, and they form the core architecture of large language models (GPT-4, Llama 3, etc.). These models typically require a lot of computational resources to train and run. The common architecture of an artificial neural network is displayed in figure 8 below. 34 Figure 8. ANN Architecture, adapted from Bre et al. (2018, p. 1430). This thesis uses a Multilayer Perceptron (MLP) regressor because of its relatively fast operation and training times, and because it is an effective model for analyzing time series data. Although different neural network-based models are based on the same type of architecture (Figure 8), there can be significant differences in their learning process according to Ramos et al (2023, p. 672). The relatively small amount of data used in the project might have been insufficient for highly complex neural network models, such as LSTM (long-short term memory) or RNN (recurrent neural network) models, which operate a different, more sophisticated memory concept compared to MLP (Ramos et al., 2023, pp. 672–673). 35 2.4 Creating forecasts with machine learning The use of machine learning in forecasting has received a huge amount of attention in recent years. Forecasts conducted with machines are objective and mitigate human error (bias). Even if machine-based methods are applied by humans, with the right parameters and restrictions, most bias can be eliminated when making a forecast. Machine-based methods are particularly superior when forecasting demand, as demand itself is a multidimensional concept. First, there are different types of demand influenced by context and temporal features (Armstrong, 2001), and even if the form of demand is known, it is influenced by numerous macro- and microeconomic factors, not all of which are qualitative, i.e., they cannot be measured (Schneider et al., 2021, p. 218). Therefore, it is important to recognize demand forecasting as a complex domain and to consider it from different perspectives. Falatouri et al. (2022) state that the objective of forecasting is to discover data patterns and provide accurate forecasts, for which machine learning is one of the most viable tools for processing data for accurate and transparent output. They propose that machine learning methods for demand forecasting can be divided into three categories which are time series analysis, regression-based methods, and supervised and unsupervised methods. Furthermore, they state that demand forecasting is done on long or short-term levels, short-term forecasts being six to twelve months and long-term forecasts for more than a year (Falatouri et al., 2022, p. 994). 2.4.1 Feature-based forecasts Li and others (2023) researched the connection between intermittent demand and inventory management accuracy. Their research discusses the characteristics of demand, emphasizing that it is ubiquitous for demand to be intermittent, i.e. demand is often zero, which is rarely considered when studying demand. Predicting intermittent demand is particularly difficult due to uncertainty caused by stochasticity and timing of demand. 36 Previous literature has proposed methods to solve the problem of intermittent forecasting by dividing the time series into different intervals, but more recently the accuracy has been improved by machine learning methods, such as artificial neural networks. According to them, combining different forecasting methods have proved to provide efficient forecasts compared to individual methods, providing better or equal results. However, their own study concentrates on improving forecasting by engineering new features based on time series data. Li et al (2023) produced a forecasting model based on XGBoost, in which they utilize features derived from time series selected for intermittent demand. The feature-based model produced accurate predictions on variables having immediate impact on inventory managerial decisions (L. Li et al., 2023, p. 7568). In Feizabadi’s study (2022) on demand forecasting using autoregressive models and neural networks, demand forecasting is largely influenced by the characteristics (features) of the type of product and industry. They gave an example of metal products as so-called functional products. Functional products have less product variety, longer life cycles, lower profit margins and lower inventory risk. He notes that it is easier to predict demand for these products downstream of supply chain, closer to the consumer, than upstream, where demand is created mainly by organizational suppliers and buyers. However, they note that when moving downstream to upstream in the supply chain, updating demand forecasts is the single biggest cause of demand-supply mismatches and inefficiencies (Feizabadi, 2022, pp. 119–121), as separate parts of the supply chain update their own demand forecasts based on the purchase signal generated by the end customer. Updating demand data based solely on a signal from the end-customer, i.e. sales, is inaccurate on the scale of the whole supply chain. Therefore, more traditional methods, such as simple regression, are inadequate methods for forecasting demand. Machine learning can help organizations to better predict demand by dealing with complex dependencies even between causal factors with a non-linear relationship. 37 2.4.2 Promotions and seasonality Various promotions are not uncommon in the domain of retailing. Promotions aim to increase sales during a specific seasonal period through various means, including price reductions, advertising campaigns, or free gifts for a purchase when specific conditions are met (e.g. minimum amount spent). Promotions are typically timed to coincide with seasonal holidays or events, such as Christmas, Thanksgiving, or Black Friday. Sales promotions tend to increase short-term sales and consumption, causing sudden fluctuations in demand patterns. Furthermore, usually after sales surges there is downward trend, which is explained by consumers’ stockpiling (non-perishable) goods. Therefore, the consequences of promotions do not merely follow the traditional law of supply and demand, increasing demand as price decreases, but they also have more complex consequences due to consumer dynamics, which further undermines the complexity of demand forecasting. (Abolghasemi, Hurley, et al., 2020, p. 3) According to Abolghasemi et al (2020), the level of demand can vary significantly during promotions, such as seasonal holiday weeks or campaigns, compared to non- promotional periods. They find that the variation in demand can increase by up to 6000% during different promotions in a high variance time series. They discuss how the behavior of a time series can be explained by an analysis and identification of its features. They analyze six features specific to time series which are: seasonality, stationarity, non- linearity, skewness, kurtosis, and spectral entropy. The first three depict whether the time series is dependent on time, what kinds of seasonal patterns are present, and if the time series is non-linear. The latter depicts the skewness of the trend patterns, i.e., how close the pattern is to normal distribution, kurtosis explains if the distribution is heavy- tailed or light-tailed, and spectral entropy is used to display the unpredictability of the data (2020, pp. 6–7). Their results conclude that uncertainty can be reduced, and volatility controlled by combining forecasts from various models. Some models in their study result in relatively accurate forecasts during non-promotional periods but drastically overfit during promotional periods with high variance. The occasional overfitting results in a low average forecast accuracy. The results also highlighted the 38 accuracy and efficiency of an artificial neural network (ANN) in time series forecasting. They describe the ANN results contradicting from literature, as it did not generalize well on the volatile time series and proposed that different architecture in the network, choosing relevant data selection (features) and adding data quantity could serve the ANN’s performance. In daily retail the calendric days usually stand for special promotional days rather than public holidays, which are often considered in time series methods. Huber and Stuckenschmidt (2020) conducted research on machine learning (ML) in forecasting of daily retail demand on specific calendric days. The study focused on a company with a large distribution network comprising over 100 individual retailers, each of which requires daily demand forecasts for its business operations. They compared a set of three machine learning models, including MLP as a feed-forward ANN, LSTM (long-short term memory) as a recurrent ANN, and LightGBM representing gradient-boosted regression trees (GBRTs). The models were evaluated by comparison of forecast errors, displayed with MAE and MASE. According to their research (2020, pp. 1435–1437), ML methods provided higher accuracy being more than 10% to 20% more accurate compared to time series models such as regularized linear regression model. Their conclusion was that ANNs were the strongest in forecasting daily demand, with LSTM as the top performer, followed by MLP and LightGBM being the worst of the comparison. However, the models were retained to fit the data, and no sophisticated hyperparameter optimization methods were used in their study. This can increase the probability of overfitting and lack generalization. 2.4.3 Uncertainty and difficulties in demand forecasting The aim of demand forecasting is to create accurate forecasts for a specific future period to optimize supply as accurately as possible. Accurate forecasts help companies to optimize their operations by reducing excess costs, such as inventory costs or waste. Despite the benefits of accurate forecasts, forecasts aim to minimize error rather than 39 seek perfect prediction, as forecasts always involve inherent aleatoric uncertainty, which cannot be eliminated. Epistemic uncertainty, however, stems from a lack of knowledge and can potentially be reduced, which is why demand forecasts are produced. According to Hüllermeier and Waegeman (2021, p. 458) uncertainty can be roughly divided into two: aleatoric and epistemic uncertainty. Aleatoric (statistical) uncertainty is caused by inherent randomness. Aleatoric uncertainty therefore always involves a stochastic factor that cannot be eliminated by any statistical methods. Epistemic (systematic) uncertainty refers to uncertainty caused by ignorance or lack of knowledge. In other words, epistemic uncertainty can be eliminated, whereas aleatoric uncertainty always prevails when making predictions. In statistical fields, uncertainty is traditionally treated as a probabilistic concept, which often fails to explicitly distinguish the types of uncertainty. Seyedan and Mafakheri (2020) examine supply chain demand forecasting from the perspective of big data analytics and forecasting. According to them, uncertainty is a key problem in supply chains and note that there is a common misconception associated with forecasting where variables such as cost, capacity, and demand are generally known parameters. In reality, these variables are subject to uncertainty due to external factors related to customer demand, deliveries, delivery times, and risks. Uncertainty created by demand plays a significant role, for which forecasting demand is the primary tool to mitigate uncertainty across the supply chain. In addition, demand uncertainty is a significant factor that affects, for instance, process scheduling, planning, and distribution. Forecasting demand is the primary means of reducing supply chain-wide uncertainty and minimizing disruptions caused by uncertainty, such as the bullwhip effect. Feizabadi (2022) also addresses the risk as a key challenge in supply chains. According to the study, uncertainty mainly occurs in the form of demand uncertainty, which increases the imbalance between supply and demand. The fundamental argument of the study assumes that there are three key factors involved in forecasting demand: model uncertainty, parameter uncertainty, and data uncertainty. Traditional approaches to managing demand uncertainty include pull methods, i.e. make-to-stock (inventory 40 buffer) and make-to-order production methods. Additionally, advanced forecasting models, such as ANNs, are effective in predictive analysis for their ability to manage large volumes of varying data. Due to the complexity of uncertainty, prediction models that use a single algorithm are unable to address all sources of uncertainty simultaneously, which has led to the use of ensemble and hybrid prediction models are common in the forecast domain. However, when examining uncertainty from a machine learning perspective, Hüllemeier and Waegeman (2021) state that the distinction between the two might be unnecessary. For instance, in supervised learning, where the agent is a learning algorithm, and is forced to provide decisions or predictions, the distinction between the two is irrelevant. However, in some cases where a decision can be postponed or rejected altogether, the scenario might not always apply. Inefficiencies in the supply chain caused by uncertainty can create cumulative negative effects that accumulate as they move up the supply chain structure. A primary example of this is the bullwhip effect and it occurs when different parties, such as individual companies (Sanders, 2015, p. 13), in the supply chain make their own, often unsuccessful, demand forecasts based on fluctuations in downstream demand. This means that small changes in consumer demand can cause much larger fluctuations in orders and stock levels upstream of the chain, amplifying the demand variance (Tai et al., 2022, p. 5). The phenomenon is exacerbated by poor information flow and inconsistent forecasting at different parts of the organization. For instance, inadequate information flow, long lead times, a distorted view of demand or inefficient inventory management can lead to recurring problems within the organization. The bullwhip effect causes over-stocking and poor customer service (Tai et al., 2022, p. 1) creating uncertainty which depletes operational efficiency (Disney et al., 2021, p. 5810). Accurate demand forecasting plays a crucial role in reducing the probability of the bullwhip effect. According to Pereira et al. (2021), the cornerstone of creating an accurate forecast is choosing the right forecasting method and using machine learning algorithms to make the forecast, as machine learning provides better demand 41 predictability, which gives managers a better chance of identifying consumer needs. This in turn encourages managers to make more confident decisions, which in turn minimizes mismatches between supply and demand (Pereira & Frazzon, 2021, p. 11). According to Ganjare et al. (2023), the bullwhip effect can be prevented by careful inventory management, accurate order-up and replenishment strategies. To conclude, the literature suggests that transparent information flow across the supply chain, combined with highly accurate forecasting, is key to preventing the bullwhip effect. A common conception in forecasting is that the longer the forecast period, the lower the forecast accuracy. Saoud, Kourentzes and Boylan (2025) discuss forecast uncertainty. They base their study on demand uncertainty and the bullwhip effect, emphasizing prevailing uncertainty as the primary driver of costs in the supply chain. However, they also highlight the impact of the forecast horizon length on uncertainty. Based on the study, forecast uncertainty increases as the forecast horizon lengthens. This is because errors accumulate over a longer period of time and the number of parameters affecting the variables to be forecast increases. Cerqueira et al (2025) examine the impact of the forecast horizon on the performance of forecast models. They discuss the robustness of forecast models in volatile environments where anomalies are present. They conclude that when selecting the model for predictive analysis, the best model is not necessarily the best choice for handling unexpected variation. Modern neural network models outperform traditional statistical models only in long-term forecasts. In the short term, there was no significant difference. This emphasizes that the length of the horizon is a critical factor when selecting a model. 42 3 Methodology and data This section discusses the different stages of this thesis in chronological order. The objective of this section is to explain the unit of analysis, process steps, data, and tools used to obtain the results. The first chapters discuss how and where the data was acquired, the different stages of data pre-processing, and the statistical methods used in the pre-processing stage. Next, the algorithms selected for this study are presented and how they were utilized. Finally, the comparison of the models is discussed by explaining the selection of error and performance metrics and which data features were the most important for each model in the prediction process. 3.1 Data The data for this project was sourced from a publicly available dataset from Kaggle.com. The dataset is in tabular format and consists of simulated and anonymized weekly sales data from a U.S. based retail chain. The dataset includes various variables depicting exogenous factors influencing demand in addition to the store-specific features and date columns for temporal information. In this thesis, following standard machine learning terminology, these external influencing factors, as well as temporal and store-specific factors used as inputs for the forecasting models, are referred to as features (Van Wyk, 2023, p. 7). This dataset forms the basis for this study and enables the examination of how these features impact demand forecasts generated by machine learning models. In the first stage of data preprocessing, the nature of the data is analyzed by validating the data types of the dataset. The dataset consists of 6435 rows and 8 columns; thus, the dataset has a total of 55 480 observations. The data types for numerical features are float and integer, and objects for the date features (Appendix 1). The numerical features are Store, Weekly_Sales, Holiday_Flag, Temperature, CPI, Unemployment and Fuel_Price. The explanations for the variables can be found on table 4 below. 43 Table 4. Explanations for each variable of the dataset. Variable Explanation Store Indicates number of the store Date The week of sales Weekly_Sales Sales for the given store from one week Holiday_Flag Indicates whether the week is a special holiday week 1 = holiday week, 0 = non-holiday week Temperature Temperature (in °F) of the week during the week for the region of the store Fuel_Price Cost of Fuel in the region CPI Consumer Price Index Unemployment Prevailing regional unemployment rate Holiday weeks mark the four most prominent holidays in the U.S. which are Super Bowl, Labor Day, Thanksgiving, and Christmas, as defined in the original Kaggle dataset (2021) documentation. An examination of the dataset reveals that the numerical variables Date and Holiday_Flag require conversion into categorical features to prevent the models from misinterpreting their numerical codes as ordered values. For example, this ensures that Store 45 is not interpreted as of greater value than Store 1. Furthermore, as the Date column only refers to the week of sales, Day, Week, Month, and Year, columns were derived from the Date column to provide more accurate values for time series analysis. 3.1.1 Data acquisition and preparation All data processing for this thesis was performed using Python programming language in Jupyter Notebook. Jupyter is an open-source web-based interactive computing environment used for data science (The Jupyter Notebook — Jupyter Notebook 7.5.0b0 Documentation, 2015). The data was analyzed using several libraries: pandas and NumPy for data handling, Matplotlib and Seaborn for visualization, and scikit-learn for machine learning methods. LightGBM and XGBoost were applied for gradient boosting models. 44 Data processing begins with importing the necessary libraries into the data processing environment, after which dataset file is also read and imported into Python with pandas library for further processing. 3.1.2 Data cleaning and preprocessing After the initial overview of the dataset, the data pre-processing phase continued by checking the dataset for missing values, duplicated rows, and possible outliers. Pre- processing was done to ensure the quality of the raw data to prevent distorted analysis that would result from incorrect or biased data (Çetin & Yıldız, 2022, p. 300). Cleaning the dataset is an essential part of data analysis (Slater & Hasson, 2025, p. 723); thus, data analysis began with data cleaning by detecting missing or duplicated data, and identifying any anomalies in the data. In case there were any, the empty and duplicated rows were deleted from the dataset. Fortunately, the dataset chosen for this project did not include any empty or duplicated data, so no further processing was required at this point of the analysis. After handling missing and duplicated values, the dataset was checked for possible outliers to ensure a robust data analysis. Outlier detection began with a visual inspection to help select the most efficient and objective method for identifying outliers (Alves et al., 2024a, p. 5). The numerical variables are assessed and analyzed with box plots for outlier visualization and histograms for displaying the skewness of the variable. In addition, outliers were calculated using interquartile range (IQR) method. All detected outliers were further validated to determine whether the deviations were caused by errors or represented natural variation. 45 3.2 Feature engineering The next task in pre-processing of data is new feature engineering in which new features are derived from the existing dataset to create new insightful data into a form the machine learning algorithm can benefit from. Kampezidou and others (2024, p. 388) state that new features can be produced by generating, transforming, and combining existing features. The purpose of new feature engineering is to improve the efficiency of used machine learning models by minimizing generalization and training errors. Some of the new features were created solely for exploratory data analysis. These features were dropped from the dataset before moving forward to the predictive analysis with the machine learning algorithms. This was done to prevent any data leakage during predictive analysis which can result in unreliable results. Data leakage refers to a situation where some of the test data is mixed with the training set, thus resulting in falsely great results on the test data, but decreasing the generalization of the model making the it useless in real world problems (Liu et al., 2022, p. 13). Lag and Rolling features were engineered as they can increase the performance of a time series analysis especially when using tree-based algorithms (Kampezidou et al., 2024, p. 388). Since the data covers a relatively short period of three years, the Lag and Rolling features were created for time periods of one, four, and eight weeks. The four- and eight- week features were created to reflect time periods of approximately one and two months. Lag and Rolling features can improve the model’s training efficiency and performance to predict seasonal patterns and trends (Tam et al., 2025, p. 23). In addition to the temporal features, Interaction features were engineered to complement the temporary ones, capturing non-linear relationships and providing insight into how the original features of the dataset relate to one another. New features were engineered separately for both EDA and predictive analysis, because not all new engineered features needed for EDA could be used in the predictive analysis due to possible data leakage and biased parameter estimates. 46 3.3 Train-test split When conducting predictive analysis on time series data, it is crucial to split the dataset into training and test sets before new feature engineering and training the algorithm. The data split was conducted into training and test sets by preserving the chronological order of the data. The ratio for the train-test split was 65 to 35, meaning 65 % of the data was used as the training set, and the remaining 35 % as the test data, which the models predict. The split was conducted with time series split, as a different split method could alter the time dependency, impacting the reliability of time series analysis. 3.4 Exploratory data analysis Exploratory data analysis (EDA) was done by descriptive statistics to explore the patterns, possible trends, and overall structure of the dataset, and to visualize outlier detection and the train-test split. EDA started by summarizing the data with tables and bar charts. Histograms were also used to visualize the characteristics of the dataset and the skewness of numerical features. Correlation and correlation coefficients between individual features were visualized by a feature correlation heatmap. Outliers were visualized with boxplots, interquartile range (IQR), and tables, and possible trend patterns were visualized with a 12-week moving average of weekly sales. Holidays and non-holiday dates were visualized with simple pie chart and a histogram. 3.5 Machine Learning Models The following predictive algorithms were used in this thesis: simple naïve, 52-week seasonal naïve, Multi Perceptron Neural Network (MLP), LightGBM, and XGBoost. Based on previous literature and research results, the gradient boosting models LightGBM and 47 XGBoost were selected as the primary comparison targets for this prediction. Naïve models were selected as baseline models to provide perspective when compared with more advanced algorithms to display the differences in predictive performances. Since it was assumed that the advanced machine learning models would outperform simple naïve predictions in the comparison, one more multivariate algorithm was selected for comparison to make the comparison more comprehensive. A neural network based algorithm, multilayer perceptron (MLP) was chosen due to its efficiency in processing time series data, as well as computational efficiency when compared to other neural network based algorithms, such as LSTM (Long-Short Term Memory) (Ramos et al., 2023). The model training process started with pre-processing of the dataset. After the initial check for outliers, empty rows, and duplicates, the dataset was split into train and test sets. Once train-test split was done, new features were engineered to support the predictive performance of used models. The hyperparameter optimization was conducted using randomized search for gradient boosted models (LGBM & XGB) as well as for the Neural Network model (MLP). The models were evaluated and measured using four different metrics, which are mean absolute error (MAE), root mean square error (RMSE), mean absolute scaled error (MASE), and R-squared (R2). Root mean square error (Equation 1) measures the magnitude of prediction errors, with lower values indicating better accuracy (Kannadasan, 2025, p. 25). Equation 1. Equation of root mean square error (RMSE). 𝑅𝑀𝑆𝐸 = ඨ ෌ (𝑦௜ − 𝑥௜)ଶ௡ ௜ୀଵ n where, n is the number of datapoints, yi is the actual value for data point i, and xi is the predicted value for data point i. 48 Mean absolute error measures the average absolute error (equation 2) between the predicted and actual values, indicating how close the prediction is to the reference point. Equation 2. Equation of mean absolute error (MAE). 𝑀𝐴𝐸 = ∑ |𝑦௜ − 𝑥௜| ௡ ௜ୀଵ n where, n is the number of datapoints, yi is the actual value for data point i , and xi is the predicted value for data point i. Mean absolute scaled error (equation 3) indicates the effectiveness of the predictive model by comparing its Mean Absolute error (MAE) with the MAE of naïve forecast (Huber & Stuckenschmidt, 2020, p. 1430). Equation 3. Equation of mean absolute scaled error (MASE). 𝑀𝐴𝑆𝐸 = 𝑀𝐴𝐸 𝑀𝐴𝐸௡௔ï௩௘ where, MAE is the Mean Absolute error of the prediction and, MAEnaive is the actual MAE of Naïve forecast. The coefficient of determination, also referred to as R-squared (R2), quantifies the amount of variance in the dependent variable can be explained by the independent variables (Chicco et al., 2021, p. 2). Equation 4. Equation of R-squared (R2). 𝑅ଶ = ෌ (𝑦௜ − 𝑦పෝ ௡ ௜ୀଵ )ଶ ෌ (𝑦௜ − 𝑦ത ௡ ௜ୀଵ )ଶ where, 𝑦పෝ is the prediction of datapoint i, 𝑦௜ represents the actual values on datapoint i, and 𝑦ത is the mean of all the observations (Scikit-Learn., 2025). 49 3.6 Feature Importance Analysis Feature importance analysis is particularly important when the subject of analysis is highly dependable on exogenous factors. The feature importance method chosen for this project was permutation feature importance method, as it can be applied to evaluate any model, regardless of its operating principles. According to Yagmur and others (2024), it measures the significance of a feature by how much the model’s performance metrics (e.g., MAE, RMSE, R2) react when the values of a singular feature are randomly permuted, thus cutting its connection to the explanatory variable. In other words, permutation importance method can be used to determine the contribution of external variables on weekly sales. The method works regardless of the model used, meaning it can be applied to both gradient boosting and neural network models. Permutation method was applied manually to test all models with different feature configurations, to display what features work best together and which features are repeated with the most contribution regardless of the configuration. The testing started with the baseline features of the dataset, after which new engineered features were gradually added into the feature configuration. This allowed the identification of the most important features, by which the final set of variables could be selected. 50 4 Results In this section, the results from each section of the empirical part of the thesis are presented. The interpretation of results begins with exploratory data analysis, after which the predictive performance and feature importance metrics are evaluated. The models’ results are first evaluated separately for each model, after which the results are analyzed relative to one another. The model performance section delves into the accuracy of the models, presenting the error and forecast metrics in tabular form. The feature importance and interpretation section analyzes how the models utilized external features in their forecasts and provides the results from feature importance analysis. Finally, the prediction accuracies of the models are compared with each other under comparative analysis where the prediction accuracy of each model is visualized with respect to actual sales. 4.1 Data analysis results The data used in this thesis was collected from an open source. The dataset contains anonymized simulated weekly sales data from forty-five different stores from different regions, as well as exogenous factors implicating prevailing economic factors during the sales week. Before statistical data analysis, the usability of the dataset was ensured through data preprocessing and data validation methods. The steps are as follows: (1) preprocessing and validating the data, (2) outlier detection, (3) data cleaning, (4) new feature engineering, (5) normalizing of the data, and (6) balancing the dataset (Van Wyk, 2023, pp. 7–8). 4.1.1 Exploratory data analysis Data analysis began with visualizing the dataset to explain and understand anomalies and the statistical nature of the data. Alves and others (2024b, p. 2) state that 51 exploratory data analysis is essential to gain deeper insight into the dataset as it helps identifying outliers and inliers, as well as deriving information on key variables and features of the data. In this thesis, the exploratory data analysis includes analysis of the time series to understand possible trends and patterns of the sales data. The exploratory data analysis began with a visual examination of the dataset. The nature of the numerical features of the data was examined based on the distribution of the variables. Based on this, it was found that most of the data was close to normally distributed, except for Unemployment (see appendix 7), which was heavily skewed. Furthermore, the correlation between the target variable, Weekly_Sales and continuous numerical features, such as Temperature and Fuel_Price, were examined using scatter plots. This preliminary analysis suggested that Unemployment data might contain outliers. Figure 9 depicts the average weekly sales for the dataset, and the orange line represents the 12-week moving average, providing a fundamental perspective on the time series data. The graph clearly indicates that sales peaked before and during December, after which there is a sharp drop in sales right before January. This observation can be explained by the holiday seasons shown in the data, which occur in the last quarter of the year, including Labour Day, Thanksgiving, Christmas, and Super Bowl according to our dataset (2021). Figure 9. Weekly Sales, 12-Week Moving Average. 52 The time series analysis provides valuable insight when forecasting sales and demand. Neba and others (2024) state that identifying seasonal trends and patterns is important for forecasting accuracy, resource allocation and inventory management, promotional strategies, and financial planning (Neba et al., 2024, p. 11). As well as improving the above-mentioned, trends can increase consumer uncertainty challenging forecasting accuracy, creating risk for inefficiencies in inventory management (Ratra & Seth, 2025, p. 1024). Furthermore, Caniato et al (2005, pp. 39–40) studied demand variability and divided it into three main drivers, seasonal, promotional, and random (i.e. unpredictable) variability. They state that the variability of demand is mostly derived from the seasonal and promotional variability, further emphasizing the importance of understanding the patterns in the time series data to be forecasted. Pairwise correlations between features were examined with Spearman Correlation, and the results are visualized by a Correlation Heatmap in figure 10. Correlation heatmap illustrates the correlation coefficients for each presented feature, indicating how strong the correlation between two variables is. The value for the correlation is between -1 and 1, negative values indicating a negative correlation, positive values indicating a positive correlation, and a value of zero means there is no correlation between the variables (J. Li et al., 2024, p. 152). The bar on the right side of the figure explains the color palette, giving a deeper color for a stronger correlation. 53 Figure 10. Feature Correlation Heatmap with Spearman’s Correlation. The correlation between the variables is, on average, weak according to the correlation matrix. The matrix shows that the correlation between weekly sales and features, such as CPI, fuel price, temperature, and unemployment, is between -0.07 and 0.03, i.e. almost neutral. Despite this, the models are expected to find dependencies in the data during the prediction process. It is important to note that the correlation between interaction features, such as Fuel_Temp_Interaction with Fuel_Price and Temperature, are not statistically significant, as Fuel_Temp_Interaction was derived from these two features. 54 4.1.2 Outlier detection Outlier detection is done to detect unexpected, significantly deviated observations from the data (Reunanen et al., 2020, p. 287) to sustain the usability of the dataset. Multivariate datasets often have many outliers, for which, according to Alves and others (2024b), outlier detection is a principal task during data processing. They emphasize that detecting outliers is often essential, but the outlier detection method must be efficient and objective. It is also said that removing the outliers is potentially imperative, meaning the removal of the outliers is not always necessary (Alves et al., 2024b, p. 5). The outlier detection method used in this thesis included outlier visualization with boxplots, after which interquartile range (IQR) was applied as a computational method. IQR and Boxplots both use the same computational formula to determine the upper and lower quartiles (Tukey, 1977, pp. 43–44). The interquartile range is defined by subtracting the first quartile from the third (Q3-Q1), and the inner fences are set at 1.5 times the IQR (Tukey, 1977, pp. 43–44). As outlier detection is dependent on the context, the flagged values were evaluated considering domain knowledge. The IQR method was applied by creating a function (see Appendix 1) to calculate the interquartile range and detect outliers for each column passed to the function. The function was subsequently applied to all numerical features in the dataset. Finally, the output for the code displays all the values that fall below lower bound or above upper bound. Outliers detected with IQR are depicted in figure 11. Figure 11. Outlier detection with interquartile range. 55 Only two of five numerical columns included notable outliers. Before the possible outliers were removed from the data, the reasons for the deviation were checked to ensure that the observations were valid. The possible outliers in Weekly_Sales are displayed in figure 12 with a boxplot. Figure 12. Boxplot of Weekly_Sales. The dots represent the datapoints that fall outside the upper (or lower) bounds (fences) (Tukey, 1977, pp. 43–44). With Weekly_Sales, there were 34 datapoints fal