Tuomas Niemelä 

Demand forecasting in the retail environment 

A comparative study of LightGBM, XGBoost, and MLP models 

 
Vaasa 2025 

School of Technology and Innovations  
Master of Science in Economics and Business Administration 

Industrial Management 


2 

 
UNIVERSITY OF VAASA 
School of Technology and Innovations 
Author: Tuomas Niemelä 
Title of the thesis:  Demand forecasting in the retail environment: A comparative study 

of LightGBM, XGBoost, and MLP models 
Degree: Master of Science in Economics and Business Administration 
Degree Programme: Industrial Management  
Supervisor: Petri Helo  
Year: 2025 Pages: 103 

ABSTRACT:  
Accurate demand forecasting is a critical operational factor in the retail environment, as 
organizational decision-making and management are increasingly dependent on it. Accurate 
forecasts enable strategic planning, inventory optimization, increased customer satisfaction, and 
reduction of surplus and waste. While advanced machine learning (ML) models are recognized 
for producing accurate forecasts, current literature often focuses on comparing algorithmic 
efficiency without sufficiently examining the contribution of external features to forecast 
accuracy. 
 
This thesis aims to address this research gap by investigating how external variables, such as 
unemployment and inflation, influence the predictive accuracy of ML models and how feature 
selection affects their performance. The study conducts a comparative analysis of three 
algorithms: LightGBM, XGBoost, and Multilayer Perceptron (MLP). The models are tested and 
compared in relation to one another and benchmarked against a 52-week seasonal naïve 
forecast. The comparative analysis is based on comparing forecasts made with different feature 
sets, evaluating forecast accuracy using various error and performance metrics. 
 
The empirical part of the research applies quantitative methods using simulated and anonymized 
time series data representing weekly sales figures from a U.S.-based retail chain operating in 
forty-five locations. The dataset covers approximately three years and includes seven original 
variables, consisting of macroeconomic, temporal, and store-specific features. Additional 
features were engineered to capture lagged and interaction effects within the data. The 
methodology involves data preprocessing, new feature engineering, a 65:35 train-test split, 
hyperparameter optimization, and evaluation using RMSE, MAE, MASE, and R2 metrics. 
Permutation feature importance is used to assess the contribution of different features. 
 
The findings indicate that all machine learning models significantly outperformed the seasonal 
naïve baseline, demonstrating their capability to produce more accurate forecasts. Gradient 
boosting models achieved the best overall performance, with LightGBM outperforming XGBoost 
with a slight margin, while the MLP model provided the weakest performance and highest 
computational cost. Answering the research questions, the results confirm that feature selection 
has a decisive effect on model performance. Lag features representing short-term temporal 
dependencies were found to dominate feature importance scores across all models. The optimal 
lag length was identified as one week, while macroeconomic variables such as unemployment 
and inflation showed limited significance in short-term forecasts. MLP was the only model for 
which holiday-related features showed notable importance. 
 

KEYWORDS: Demand forecasting, machine learning, retail analytics, feature importance, 
LightGBM, XGBoost, MLP, time series analysis 

  
3 

 
VAASAN YLIOPISTO 
School of Technology and Innovations 
Tekijä: Tuomas Niemelä 
Tutkielman nimi: Demand forecasting in the retail environment: A comparative study 

of LightGBM, XGBoost, and MLP models 
Tutkinto: Kauppatieteiden maisteri 
Koulutusohjelma: 
Opintosuunta: 

Tuotantotalouden maisteriohjelma 
Tuotantotalous 

Työn ohjaaja: Petri Helo 
Valmistumisvuosi: 2025 Sivumäärä: 103 

TIIVISTELMÄ: 
Tarkka kysynnän ennustaminen on katsottu olevan kriittinen operatiivinen tekijä 
vähittäiskaupassa, mistä organisaation päätöksenteko ja johtaminen ovat yhä enemmän 
riippuvaisia. Tarkat ennusteet mahdollistavat strategisen suunnittelun, varastojen optimoinnin, 
asiakastyytyväisyyden parantamisen sekä ylijäämän ja hävikin vähentämisen. Vaikka kehittyneet 
koneoppimismallit tunnetaan tarkkojen ennusteiden tuottamisesta, nykyisessä kirjallisuudessa 
keskitytään usein algoritmien tehokkuuden vertailuun ilman, että ulkoisten tekijöiden vaikutusta 
ennusteiden tarkkuuteen tarkastellaan riittävästi. 
 
Tämän tutkielman tarkoitus on vastata aiemman tutkimuksen puutteellisuuteen selvittämällä, 
kuinka ulkoiset muuttujat, kuten työttömyys ja inflaatio, vaikuttavat ML-mallien 
ennustustarkkuuteen ja kuinka ominaisuuksien valinta vaikuttaa niiden suorituskykyyn. 
Tutkimuksessa on toteutettu vertaileva analyysi kolmesta algoritmista, jotka ovat LightGBM, 
XGBoost ja MLP. Analyysi perustuu eri ominaisuusjoukoilla tehtyjen ennusteiden vertailuun ja 
ennusteiden tarkkuuden arviointiin käyttämällä erilaisia virhe- ja suorituskykymittareita. Työn 
metodologiaan sisältyy datan esikäsittely, uusien dataominaisuuksien luonti, tietokannan 
jakaminen harjoitus- ja testidataan, hyperparametrien optimointi, sekä virhe- ja 
suorituskykymittareiden validointi. 
 
Tutkimuksen empiirisessä osassa sovelletaan kvantitatiivisia menetelmiä käyttäen simuloitua ja 
anonymisoitua aikasarjadataa, joka koostuu yhdysvaltalaisen vähittäiskauppaketjun 
viikoittaisista myyntiluvuista, kerättynä 45 eri toimipisteestä. Aineisto kattaa noin kolmen 
vuoden ajanjakson ja sisältää kahdeksan alkuperäistä muuttujaa, jotka koostuvat 
makrotaloudellisista, ajallisista ja myymäläkohtaisista ominaisuuksista. Muuttujien vaikutusta 
ennustetarkkuuteen mitataan permutaatiomenetelmällä. 
 
Tulokset osoittavat, että koneoppimismallit suoriutuivat merkittävästi paremmin kuin 
kausittainen naiivi vertailuarvo, mikä osoittaa niiden kyvyn tuottaa tarkempia ennusteita kuin 
perinteiset ennustemallit. Gradient boosting -mallit saavuttivat parhaan kokonaistehokkuuden, 
joista LightGBM suoriutui hieman paremmin kuin XGBoost. MLP-malli puolestaan suoriutui 
heikoiten. Tulokset vahvistavat, että ominaisuuksien valinta vaikuttaa ratkaisevasti mallin 
suorituskykyyn. Lyhytaikaisia ajallisia riippuvuuksia edustavat viiveominaisuudet osoittautuivat 
tärkeimmiksi ominaisuuksiksi kaikissa malleissa. Optimaaliseksi viiveen pituudeksi on havaittu 
yksi viikko, kun taas makrotaloudelliset muuttujat, kuten työttömyys ja inflaatio, ovat 
osoittautuneet merkitykseltään rajallisiksi lyhyen aikavälin ennusteissa. 
 
 
AVAINSANAT: Demand forecasting, machine learning, retail analytics, feature importance, 
LightGBM, XGBoost, MLP, time series analysis 


4 

 
Contents  
1 Introduction 8 

1.1 Research questions and purpose 12 

1.2 Objectives and clear limitations 14 

1.3 Structure of the paper 15 

2 Literature review 16 

2.1 Fundamentals of forecasting 16 

2.2 Demand forecasting 21 

2.3 Artificial intelligence 23 

2.3.1 AI as a driver of efficiency 24 

2.3.2 Machine learning methods 26 

2.4 Creating forecasts with machine learning 35 

2.4.1 Feature-based forecasts 35 

2.4.2 Promotions and seasonality 37 

2.4.3 Uncertainty and difficulties in demand forecasting 38 

3 Methodology and data 42 

3.1 Data 42 

3.1.1 Data acquisition and preparation 43 

3.1.2 Data cleaning and preprocessing 44 

3.2 Feature engineering 45 

3.3 Train-test split 46 

3.4 Exploratory data analysis 46 

3.5 Machine Learning Models 46 

3.6 Feature Importance Analysis 49 

4 Results 50 

4.1 Data analysis results 50 

4.1.1 Exploratory data analysis 50 

4.1.2 Outlier detection 54 

4.1.3 Train-test split 57 

4.2 Hyperparameter optimization 59 


5 

 
4.2.1 Hyperparameters of LightGBM and XGBoost 60 

4.2.2 Hyperparameters of MLP 62 

4.3 Models’ predictive performance 63 

4.3.1 Baseline (seasonal naïve) 63 

4.3.2 Light gradient-boosting machine (LightGBM) 65 

4.3.3 Extreme gradient boosting (XGBoost) 67 

4.3.4 Multilayer Perceptron (MLP) 68 

4.4 Feature importance and interpretation 70 

4.5 Comparative analysis 76 

4.5.1 Comparison of predictions and actual sales 76 

4.5.2 Residual Analysis 78 

5 Conclusions 81 

5.1 Summary of comparative analysis 81 

5.2 Main findings and recommendations for future research 83 

References 87 

Appendices 98 

Appendix 1. Functions created in data analysis 98 

Appendix 2. Parameter grids 98 

Appendix 3. Summary of the dataset 99 

Appendix 4. LightGBM Results with different feature sets 100 

Appendix 5. XGBoost results with different feature sets 101 

Appendix 6. MLP results with different feature sets 102 

Appendix 7. Exploratory data analysis 103 

  
6 

 
Figures  
 
Figure 1. Retail trade volume and turnover (Eurostat, 2024). 9 

Figure 2. Common Data Patterns (Sanders, 2015, p. 23). 19 

Figure 3. Components of data (Sanders, 2015, p. 24). 20 

Figure 4. Venn diagram depicting the relationship between statistical concepts. 23 

Figure 5. Decision tree architecture, adapted from Mohri et al (2017, p. 7). 30 

Figure 6. Random forest architecture, adapted from Jiang et al. (2016, p. 58). 31 

Figure 7. Gradient boosting architecture, adapted from Xu et al (Xu et al., 2023, p. 3). 32 

Figure 8. ANN Architecture, adapted from Bre et al. (2018, p. 1430). 34 

Figure 9. Weekly Sales, 12-Week Moving Average. 51 

Figure 10. Feature Correlation Heatmap with Spearman’s Correlation. 53 

Figure 11. Outlier detection with interquartile range. 54 

Figure 12. Boxplot of Weekly_Sales. 55 

Figure 13. Histogram of Unemployment data. 56 

Figure 14. Visualization of train-test split with different ratios (80:20 vs. 65:35). 58 

Figure 15. LightGBM permutation feature importance. 73 

Figure 16. XGBoost permutation feature importance. 74 

Figure 17. MLP permutation feature importance. 75 

Figure 18. Actual weekly sales vs. predicted weekly sales. 77 

Figure 19. Residual Plots for each model. 79 

Figure 20. Summary of results. 85 

 
Tables 
 
Table 1. Principles of forecasting according to Armstrong (2001, pp. 61–66). 17 

Table 2. Forecasting principles according to Sanders (2015, pp. 18–19). 18 

Table 3. Conclusion of advantages of demand forecasting. 22 

Table 4. Explanations for each variable of the dataset. 43 

Table 5. LightGBM cross-validation results. 66 

Table 6. LightGBM final test results. 66 


7 

 
Table 7. XGBoost cross-validation results. 67 

Table 8. XGBoost final test results. 68 

Table 9. MLP cross-validation results. 69 

Table 10. MLP final test results. 70 

Table 11. Feature configuration. 72 

Table 12. Comparison of the models' error (RMSE, MAE, MASE, R2). 77 

 
Algorithms 
 
Code 1. LightGBM with optimized parameters. 61 

Code 2. XGBoost with optimized parameters. 61 

Code 3. MLP with optimized hyperparameters. 63 

 
Equations 
 
Equation 1. Equation of root mean square error (RMSE). 47 

Equation 2. Equation of mean absolute error (MAE). 48 

Equation 3. Equation of mean absolute scaled error (MASE). 48 

Equation 4. Equation of R-squared (R2). 48 

Equation 5. Equation of seasonal naïve forecast. 64 

 
8 

 
1 Introduction  

The structures of consumer-driven industries have reshaped over the past few decades 

as competition has intensified, and dynamics have increased due to globalization driven 

by free trade. The operational structures of global organizations have grown into 

complex entities, covering everything from the procurement of raw materials to the 

sales of the final product. As the operational structures of organizations expand, their 

administration and management become increasingly complex, as the volume and 

dimension of data affecting strategic planning increases. Based on current literature, 

organizational decision-making and management are increasingly dependent on 

accurate demand forecasts to sustain with growing competition. However, despite 

technological advances and availability of data, the complex nature and unpredictability 

of demand challenges consistent forecasting in both research and practice.  

 
Accurate demand forecasting is critical operational factor in supporting strategic 

planning (Caniato et al., 2005; Lima et al., 2024; Mircetic et al., 2022). It is crucial for the 

planning of functional processes, such as financing, logistics, inventory management, 

marketing, and production of a profitable business (Lima et al., 2024; Mircetic et al., 

2022, p. 2514). Forecasting demand can help optimize inventory, increase customer 

satisfaction, and reduce waste (Ganguly & Mukherjee, 2024, p. 884), making it an 

essential part of an effective organization management.  A study on global retail market 

estimated that inefficiencies in inventory management alone cost retailers 1.106 billion 

U.S. dollars yearly (IHL, 2015, as cited in Disney et al., 2021). Inefficiencies in 

organizational structures can increase costs that could be minimized by systematic 

optimization. According to a report conducted by McKinsey & Company (2022) advanced 

forecasting and digital process optimization can significantly  improve operational 

efficiency. The study states that companies were able to improve the accuracy of their 

demand forecasts from 60% to 90% by replacing manual forecasts with machine learning 

models. Furthermore, an intelligent procure-to-pay process reduced processing time 

from days to minutes and achieved 15-20 % lower costs. Consequently, Institute of 


9 

 
Business Forecasting and Planning (2018) reported that a 15 % increase in forecasting 

accuracy can yield a 3 % higher pre-tax improvement. Regardless of its size, a forecast 

error is significant in terms of a company’s results. Even a one percent improvement in 

the forecast was able to improve the results of a company with a turnover of 50 million 

by 1.52 million during its fiscal year. Additionally, accurate demand forecasting can 

reduce the annual operating expenditure by 7% (Mitra et al., 2022, p. 3). Forecasts are 

therefore significant drivers in a changing competitive environment. Figure 1 provides a 

comprehensive overview displaying the volatility of the retail market in the euro area 

over the past 10 years. The figure displays two indices: trade volume and turnover 

excluding motor vehicles and motorcycles. The trade volume indicates inflation-adjusted 

real trade volume, and turnover measures the nominal retail turnover rate. Both indexes 

are monthly, seasonally and calendar adjusted data from the European Economic Area 

(EEA) and selected non-EU states.  

 
Figure 1. Retail trade volume and turnover (Eurostat, 2024). 

 
10 

 
The statistics illustrate how the macroeconomic effects triggered by the pandemic are 

reflected in consumer behavior. Strong volatility, starting from 2020, continues until 

2022, after which the graph reveals the impact of inflation in the euro area. Nominal 

turnover would indicate that retail trade accelerated at the end of 2022, although in 

reality, according to inflation-adjusted retail sales volume, trade slowed down and 

declined in 2022-2023. The fluctuations of retail sales on a macroeconomic scale 

underscore the importance of effective forecasting. Consequently, volume and turnover 

indices provide an example of how, when making a forecast, it is necessary to be aware 

of which variables are used as inputs in the predictive analysis. 

 
Due to the increased volume and dimensionality of data, traditional forecasting methods 

usually lack the requirements for efficient forecasting (Mediavilla et al., 2022, p. 1126), 

yet many forecasts are still conducted manually based on experience and intuition in the 

retail sector (Falatouri et al., 2022, p. 995). Current literature focuses on advanced 

forecasting methods, which are often based on machine learning (ML). These state-of-

the-art algorithms can process large volumes of data and are efficient in finding causal 

connections between independent variables (Ganjare et al., 2023, p. 2237). Studies have 

shown that different ML methods can outperform traditional forecasts in specific 

prediction problems (Schmid et al., 2025, p. 2). Petropoulos and colleagues (2025) 

discuss the current state of retail demand forecasting. They depict the current situation 

as being twofold by providing a concrete example about an U.S. based retail company 

operating at over 10,000 different locations and managing approximately 200,000 

unique stock keeping units (SKUs). Field tests and forecasting competitions have proven 

that advanced algorithms can be used to make accurate predictions on demand, but as 

the scale increases, computational constraints, the complexity of forecasts, and lack of 

practicality create clear limitations for effective forecasting (Petropoulos et al., 2025, p. 

1564). 

 
Forecasting demand can be effective for an individual store or product, but in modern, 

data-driven organizations, forecasting is not just a single process, but a multi-level 


11 

 
system that extends across different hierarchies of the business. Furthermore, the 

purpose of demand forecasting is not to estimate the sales of a single SKU, but rather to 

make an estimation of, for instance, regional purchasing power over a specific period. 

Such estimates can influence strategic decisions such as the location of a new business 

premises (Petropoulos et al., 2025, p. 1564). According to Yasir et al. (2024), the factors 

affecting demand are typically divided into internal and external. The internal factors are 

endogenous and organization-specific variables, and the external variables include, for 

instance, geographic location, temperature, seasons, and holidays (Falatouri et al., 2022, 

p. 995) as well as macroeconomic indicators, such as interest rates, trade volumes, 

prevailing employment rates, and exchange rates (Yasir et al., 2024, p. 2868). The 

external variables have proven to be valuable when conducting time series forecasts 

(Abolghasemi, Hurley, et al., 2020, p. 2; Falatouri et al., 2022, p. 995) but research on the 

significance of macroeconomic factors appears to be limited. 

 
Despite the recognized value of forecasting in business operations, its implementation is 

relatively limited due to its complex nature. According to Schneider et al (2021), the 

difficulty of forecasting stems from the fact that measuring demand is not explicit. They 

state that the accuracy of an effective forecast is influenced by a plethora of potentially 

relevant variables, making it difficult to identify which factors truly improve forecasting 

accuracy. Additionally, traditional methods are usually too unsophisticated and 

advanced methods are still in the early stages of development, especially in complex 

decision-making situations (Schneider et al., 2021, p. 218). Abolghasemi and colleagues 

(2020) support the statement by concluding that when developing predictive models, it 

is important to find a balance between complexity and accuracy to maintain the 

efficiency and reliability of the model without unnecessary data usage. 

 
This thesis focuses on the features used in demand forecasting, i.e., the external factors 

based on which the target (dependent) variable is predicted. According to previous 

research, forecasting is crucial in terms of operational efficiency. Advanced models, such 

as algorithms based on artificial intelligence, have also been found to produce accurate 


12 

 
predictions. However, there are relatively few studies focusing on the relationship 

between external factors used in predictions and the forecasting models. The study tests 

three AI-based models for making predictions and examines how the models control the 

features of the same time series dataset. 

 
1.1 Research questions and purpose 

Demand arises from the need for a specific good, product, or service. This need is 

influenced by various external factors, such as purchasing power, seasonality, and trends. 

Consequently, the final investment or purchase decision depends on this need, as well 

as, i.e., price, quality, economic situation, and substitutes. Thus, demand is influenced 

by numerous external variables, on the basis of which companies make important 

investment and operational decisions. When examining demand forecasting, the topic 

combines two highly complex areas: demand and forecasting. Currently, most research 

focuses on comparing the efficiency and predictive accuracy of different algorithms, and 

does not take external features into account when forecasting sales or demand (Deng et 

al., 2025, p. 156). Although studies focus on forecasting and specific exogenous factors 

related to demand, the emphasis is often solely on prediction error minimization without 

considering the contribution of the exogenous features. 

 
Huber and Stuckenschmidt (2020) focused their research on analyzing forecast accuracy 

on specific calendar and holiday-related days in retail domain. They used various external 

features in their predictive analyses, such as store location and type, temporal 

characteristics (lag, rolling medians, etc.), and sales promotions and special days. The 

evaluation is based on comparing models in relation to the baseline forecasts and 

comparing the forecast error margins of the methods used. The study did not investigate 

the impact of external features used on forecast accuracy. However, they suggest in their 

proposal for further research that industry-related insights could be explored by studying 

the contributions of external features. Furthermore, Deng and others (2025) support the 

proposal by stating that analysis on consumer business domain is insufficient, thus 


13 

 
research on features of the retail market to optimize model structure is needed. They 

also conducted their research on retail by comparing a model consisting of LightGBM 

and Prophet to single prediction models such as LSTM, SVR, and ARIMA. Additionally, 

Falatouri compared machine learning models for an Austrian retail company and 

examined the effects from a supply chain management perspective. The results showed 

that profitability was increased by minimizing waste and increasing sales numbers. 

Nevertheless, he also notes that future studies could focus on examining the impact of 

external features, such as calendar events, weather, or availability of substitute products 

to provide domain-specific insights. 

 
This thesis aims to address the research gap identified in previous research regarding 

retail demand forecasting by clarifying the contribution of external factors on the 

accuracy of forecasting made using machine learning models. The analyzed data consists 

of a U.S. -based retail chain. The data includes weekly sales figures from stores in 45 

different locations, as well as data on contextual, macroeconomic, and temporal 

variables. This study aims to examine how machine learning models leverage external 

variables, such as unemployment, inflation, and fuel price data, in demand forecasting. 

This is examined by conducting predictions using three different machine learning 

algorithms. The algorithms are used to create different models by providing them with 

data with different sets of features, allowing the predictions obtained with different 

features to be compared. Additionally, the results of the best feature set are then used 

to produce a feature importance analysis using the permutation method. Motivated by 

this, previous research, and background of the study, the research questions are as 

follows: 

 
RQ1: How do the algorithms evaluate external features in their predictions? 

RQ2: How does feature selection affect the predictive accuracy? 

 
The research questions are based on the purpose and background of the thesis, and they 

guide the research to fill the research gap identified within the field of study. The 


14 

 
empirical part of the thesis is done by conducting predictions with four different 

predictive models, one of which works as the baseline model for the other machine 

learning algorithms. Machine learning algorithms used include LightGBM, XGBoost, and 

Multilayer Perceptron (MLP), with the seasonal naïve forecast serving as the baseline. 

 
1.2 Objectives and clear limitations 

The objective of this thesis is to create a clear framework for how demand, its forecasting, 

and advanced predictive models are interconnected. Furthermore, the study aims to 

answer how the prediction models used in the study handle external variables in the 

creation of predictions. The thesis starts with a literature review, which examines the 

basic theories of the subject areas and previous research on the topic. During the 

literature review the key concepts and terminology are explained to create 

understanding of the fundamentals of forecasting, demand forecasting, as well as 

applications of artificial intelligence. Next, the thesis reviews previous research, which 

allows for a more detailed examination of topics related to demand forecasting. 

 
During quantitative research, the topics studied in the literature review are examined in 

practice. Three different machine learning models are tested and compared with each 

other, as well as in relation to a seasonal naïve forecast. The aim is to examine the 

accuracy of the predictions using various errors and performance metrics and to produce 

results that are as objective as possible. These metrics allow the comparison of 

predictive accuracy of models with different feature sets. Consequently, it allows the 

generation of quantitative results on how external factors affect predictive accuracy. Last, 

the importance of features is measured by permutation feature importance, which 

depicts the contribution of a single feature by determining how much the model relies 

on such feature.  

 
The clear limitations of the thesis are related to the characteristics of the data, the 

methodology of this study, and model-specific technicalities which may impact the 


15 

 
results. When conducting time series analysis, it would be desirable to have as much 

data available as possible, as is needed both in the training phase and in the testing 

phase of the model. The time series spans only three years, consisting of 6435 datapoints. 

Some algorithms, such as neural network architectures, require a lot of data to function 

properly. Furthermore, the data is not based on actual values as it is a simulated and 

anonymized dataset, therefore the relevance of the research results to real life situation 

decreases. Furthermore, the study lacks qualitative input that could support its objective 

and the generalizability of its results to real life situations. Technical limitations are 

related to tuning of model hyperparameters. Even if the study were repeated using the 

same dataset, data features, and algorithms, the results obtained could be different if 

the hyperparameters or their values are changed. To conclude, the thesis and its results 

are highly theoretical due to the mentioned limitations. 

 
1.3 Structure of the paper 

The structure of the thesis consists of five main chapters, which are introduction, 

literature review, methodology, results, and conclusions. The introduction chapter 

presents the background for this thesis, as well as the purpose, research questions, 

objectives, and clear limitations of this study. Second, the literature review presents the 

rationale for the topic, the main concepts, and the relevant terminology. After key 

concepts, it delves deeper into the research area at a more advanced level by examining 

previous research papers on the topic and the used algorithms. This is followed by the 

methodology of this paper, which explains the unit of analysis, process steps, data, and 

tools used to obtain the results. Its purpose is to provide the reader with guidelines on 

how this study was conducted. After the methodology, the results and how they were 

obtained are explained. The results section depicts the concrete quantitative results of 

the analysis and delve deeper into the results with visualization and comparisons and 

provides answers to the research questions. Finally, the conclusions section summarizes 

the findings of this research, generalizes the results beyond this thesis and its data, and 

possible suggestions for further research are provided. 


16 

 
2 Literature review 

Forecasting demand is a crucial aspect of managing business. According to Punia et al. 

(2020) demand forecasting is done to solve two main problems of companies within 

manufacturing, retail or distribution business which are deciding the quantities for 

production or orders and allocation of resources. Furthermore, Sillanpää ja Liesiö (2018), 

in their study on demand forecasting in retail business, state that forecasts are required 

to effectively plan operations and incoming orders. They emphasize the need for 

information to apply demand forecasting in businesses and state that the data is most 

easily gathered from point-of-sales (POS) and stock keeping unit (SKU) sales data. 

However, creating forecasts solely on historical sales data can be problematic, since 

demand is influenced by numerous external variables. For example, forecast methods 

using POS-data can be unreliable in case of reschedules (Sillanpää and Liesiö, 2018, p. 

4169). Because demand is a difficult variable to measure (Mitra et al., 2022, p. 3) a lot of 

research can be found on the topic: how forecasting methods are implemented and how 

they are created. 

 
The purpose of this section is to review the fundamentals of forecasting, which will be 

used to continue discussion on demand forecasting to create a theoretical framework 

for the thesis. Next, machine learning models are examined, along with the principles of 

the models used in this thesis, and how they are used in demand forecasting. Finally, 

previous research related to the topic is examined. 

 
2.1 Fundamentals of forecasting 

Forecasting is about making predictions of the future. In their book Forecasting 

Fundamentals (2015) Nada Sanders explains that forecasting is all about predictions, 

whether it is the future weather, the outcome of tomorrow’s match or who will win the 

election. They say that forecasting is the most important aspect of decision making and 

stress that inaccurate forecasting can lead business in a very unfortunate state and even 


17 

 
bankruptcy. Forecasting influences many managerial decisions such as the need for 

workers, how much inventory is enough and when to order more, what resources will 

be available, or how much production is appropriate for a given period (Sanders, 2015, 

pp. 4-5). 

 
Armstrong (2001) proposes seven concrete principles to forecast with the intention to 

improve judgment by reducing bias and/or inconsistency of the forecast. The principles 

are displayed in table 1. 

 
Table 1. Principles of forecasting according to Armstrong (2001, pp. 61–66). 

Principle Effect on the forecast 

1. Using checklists Increases consistency 

2. Defining and delimiting precise 

criteria 

Increases consistency and efficiency, 

minimizes bias 

3. Comparison and evaluation of 

previous forecasts 

Increases consistency and minimizes bias 

4. Visualization of results for 

interpretation 

Minimizes bias and decreases error 

5. Utilizing patterns and trend lines Increases consistency 

6. Using multiple forecast methods Increases robustness  

7. Peer reviews Minimizes bias 

 
Armstrong states that two of the six, checklists (1) and utilization of trend lines (5) 

increase consistency of the forecast method. Usage of checklists emphasize systematic 

considerations of relevant variables and utilization of trend lines when making 

judgmental forecasts help visualize the data, thus providing possibility for pattern 

recognition and support more consistent decision-making. In contrast, (4) use of graphs 

for interpretation and (7) peer-reviewing the probability of success are for minimizing 

the bias within the forecast. Armstrong suggests that studying data in graphic rather than 

tabular form increases the forecasting accuracy and decreases error. Having peers 


18 

 
reviewing the probability of success decreases the amount of human error, which is 

usually shown as overconfident forecasts, thus increasing error of the forecast. Principles 

(2) and (3) help in both, decreasing bias and increasing consistency. Defining and 

delimitating precise criteria removes unnecessary variables from the forecast, making 

the forecast more efficient. Records from previous forecasts give forecasters a way to 

obtain cognitive feedback. However, Armstrong underlines that it is important to use the 

records appropriately to provide a reliable assessment (Armstrong, 2001, pp. 70-71). 

 
Sanders (2015) proposes a different perspective on the forecasting principles with three 

main ideas. Whereas Armstrong’s seven principles for forecasting are more traditional 

and concrete methods that should be considered when making predictions, Sanders’ 

criteria are suitable for forecasting at a general level, with ML models for example. 

Sanders’ principles are displayed in table 2. 

 
Table 2. Forecasting principles according to Sanders (2015, pp. 18–19). 

Forecasting principle Rationale of the principle 

1. Forecasts are rarely perfect The goal of a good forecast is to minimize 

error, not to forecast perfectly 

2. Forecasting clusters is more 

accurate than individual items 

The overall variance can be minimized by 

diversification 

3. Short-term forecasting is more 

accurate than long-term. 

Shorter time horizons involve less 

uncertainty, making short-term forecasts 

more reliable 

 
Sanders (2015) also proposes six-step process of forecasting, which starts with deciding 

what to forecast to identify the real problem. It follows with data cleaning, identifying 

data patterns, selecting models, generating the forecast, and measuring the forecast 

accuracy. Sanders highlights the importance of setting clear delimitations and focusing 

solely on the variables being forecasted. It starts with identifying the core issue to which 

the forecast is trying to find a solution. For instance, in a scenario where unexpected 


19 

 
demand causes a seller to run out of stock midway through the day, resulting in 

unfulfilled customer needs, the sales data will underestimate the actual demand for that 

day. Thus, the forecast for sales and demand would require different approaches, 

although at first, they seem to be measurable with the same parameters. To conclude, 

having a clear consensus on what the forecast is for is crucial in terms of the relevancy 

and reliability of the forecast results (Sanders, 2015, pp. 20‒21). Determining the core 

issue provides the forecaster with a framework for data collection at the most detailed 

level possible. 

 
Figure 2. Common Data Patterns (Sanders, 2015, p. 23). 

 
The patterns, level or horizontal, trend, seasonality, and cycles help the forecaster to 

choose the right forecasting model, which in turn is more likely to produce a more 

reliable forecast. Sanders (2015) explains that the clearer the trend is identified through 

data analysis, the more accurate the forecast is likely to be. This is because greater 

random variance increases the difficulty of producing reliable forecasts, as depicted in 

figure 3 below. 

 
20 

 
Figure 3. Components of data (Sanders, 2015, p. 24). 

 
After cleaning and analyzing the data, the forecasting model can be selected. The 

selection might not be straightforward since there are plenty of different models for 

different types of datasets and patterns. Sanders highlights four factors that should be 

considered when choosing the model: forecast horizon, data patterns, the availability 

and quantity of data, and the required accuracy. After deciding which model to apply, 

the forecast can be generated, and the forecast results can be analyzed and interpreted. 

When the possible forecasting errors are identified and the model provides accurate 

results, the dataset should be updated as new relevant data is available (Sanders, 2015, 

pp. 25-26). 

 
Demand forecasting can be implemented with either qualitative or quantitative methods. 

Qualitative methods, such as historical analogies, market research, questionnaires, and 

Delphi technique are used to predict market demand (Mitra et al., 2022, p. 58). 

Quantitative methods rely on numerical and measurable data, and most demand 

forecasting methods are evaluating causal relationships between independent and 

dependent variables through regression analysis. Additionally, other data-driven models 

utilized are time-series models such as exponential smoothing and moving average 

methods (Mitra et al., 2022, p. 58; Punia et al., 2020, pp. 2-3). This thesis focuses solely 

on quantitative demand forecasting.   

 
21 

 
2.2 Demand forecasting 

In today’s global economy organizations need to be efficient in terms of cost optimization, 

information flows, delivery, information transparency and development to be able to 

keep up with global competition (Abolghasemi, Beh, et al., 2020, p. 2; Feizabadi, 2022, 

p. 121; Mediavilla et al., 2022, pp. 1126–1127; Mitra et al., 2022, p. 2). Globalization has 

driven the trend of outsourcing increasing supply chain complexity and internationality, 

making lead-times of procurement longer (Feizabadi, 2022, pp. 121–122). When making 

purchases or planning production volumes, a company needs to have a plan on how 

much product on the shelf or in production is sufficient in quantities. Too much supply 

will be costly in terms of storage costs, surplus and labor costs. Insufficient supply, on 

the other hand, means lost revenue, decreased customer satisfaction and loyalty, loss of 

goodwill, and possible overstocking against future demand (Abolghasemi, Beh, et al., 

2020, p. 1; Punia et al., 2020, pp. 1–2).  

 
Demand forecasting can help address many of the issues by providing insight into 

decision making, as according to Abolghasemi et al (2020). However, demand is highly 

volatile and not an easy variable to predict, since it is affected by various exogenous and 

endogenous factors (Mitra et al., 2022, p. 3). Despite demand being highly dependent 

on exogenous factors, Falatouri (2022) states that many retailers still conduct demand 

forecasting manually, making the decision based on their individual biases. Usually this 

can lead to inaccuracies as the uncontrollable external factors, such as price volatility, 

market cannibalization, or consumer behavior impacting market demand. Efficient 

demand forecasting is conducted by strict processes and predictive analyses (Kilimci et 

al., 2019, pp. 1–2), utilizing advanced technologies and resources such as big data 

(Pereira & Frazzon, 2021, pp. 3–5). 

 
22 

 
Table 3. Conclusion of advantages of demand forecasting. 

 
Authors Advantages of demand forecasting 

Abolghasemi et al. (2020) Help in addressing volatility over the 

entire demand series, mitigating issues in 

upstream supply chains and increasing 

cost-efficiency 

Feizabadi (2022). Improve efficiency across the processes 

of entire supply chain 

Ho et al. (2025). Optimize storage, increase customer 

satisfaction, and process efficiency 

Mitigates stockouts and overstocking 

Huber & Stuckenschmidt (2020). Increase in competitive advantage 

Minimize the amount of discarded goods 

and waste 

Jackson et al. (2024). Increase strategic decision-making 

efficiency and cost efficiency 

Khan et al. (2020). Help enterprises to formulate market 

strategies, increase inventory turn-over 

rates, customer satisfaction, and 

transparency. Reduce waste and overall 

costs 

Kilimci et al. (2019). Increased cost efficiency by minimizing 

excessive stocks and stockouts 

Increasing customer satisfaction 

Lay et al. (2018). Alleviates stockout and overstocking and 

increases customer satisfaction, which 

enables companies to gain sustainable 

competitive advantage 


23 

 
Table 3 illustrates how different researchers depict the effectiveness of demand 

forecasting. The broad effects highlight the demand for accurate forecasts and advanced 

forecasting methods. Furthermore, the increasing availability of data further complicates 

the processes and slows down data processing if an organization lacks the necessary 

means to harness the data. 

 
2.3 Artificial intelligence 

Artificial intelligence (AI) is ubiquitous, regardless of the industry, as are all the 

buzzwords associated with it such as Machine Learning, Large Language Models, Deep 

Learning, and Big Data. Many researchers highlight how AI and its sub-areas work in 

different industries as drivers of efficiency (Dell’Acqua et al., 2023; Fosso Wamba et al., 

2024; Jackson et al., 2024; Krakowski et al., 2023; Wasserbacher & Spindler, 2022). 

 
Figure 4. Venn diagram depicting the relationship between statistical concepts. 

 
24 

 
Figure 4 provides a conceptual illustration of how the subareas related to artificial 

intelligence are interconnected. It can be concluded that AI refers to a field of computer 

science focused on creating models performing tasks typically requiring human 

intelligence (Krakowski et al., 2023, p. 1426). artificial intelligence has gained popularity 

as Large Language Models (LLMs) became more common with the release of ChatGPT 

by OpenAI in 2022 (Jackson et al., 2024, p. 1).  They quickly gained remarkable attention 

because of their generative capabilities (Dell’Acqua et al., 2023, p. 3; Jackson et al., 2024, 

p. 6120). LLMs are best known for their ability to provide their users with human-like 

answers, and creative and analytical capabilities (Dell’Acqua et al., 2023, p. 1) which can 

be utilized to complement or substitute human work. 

 
2.3.1 AI as a driver of efficiency 

The integration of AI into human work is seen as an opportunity to complement the 

efficiency of individuals (Dell’Acqua et al., 2023, p. 1) having impact on human cognition 

and problem-solving ability reducing the marginal cost of human thinking and reasoning 

similar to how internet lowered the cost of information sharing (Dell’Acqua et al., 2023, 

p. 18; Jackson et al., 2024, p. 6120). Applications of artificial intelligence enables 

autonomic learning of machines, which provides these machines the ability to co-

operate in problem solving and decision making with humans (Krakowski et al., 2023, p. 

1426). AI’s ability to mimic the cognitive skills of humans is a unique capability in the 

field of technology according to Krakowski and others (2023). Humans’ individual 

cognitive abilities have traditionally been difficult to duplicate, for which the supply has 

been scarce, thus AI’s ability to provide cognitive skills provides a huge advantage since 

it can be utilized in decision making and managerial tasks (Krakowski et al., 2023, p. 1427). 

The added value of AI is not straightforward to calculate, but some researchers have 

examined possibilities to estimate it. 

 
Efficiency can be increased by harnessing artificial intelligence, as according to Jackson 

(2024) AI tools can lead to a surge in productivity, enhance cognitive work efficiency, 


25 

 
support logistics and warehouse management, and even help in negotiating optimal 

contracts. From a demand forecasting perspective, implementation of AI methods can 

increase the accuracy of the forecast, which in turn increases supply chain resiliency 

(Mediavilla et al., 2022, p. 1130) improving order-picking performance, and accurately 

respond to upcoming demand spikes or uptrends (Ho et al., 2025, p. 2). They further 

examine the capabilities of artificial intelligence and identify five core characteristics that 

distinguish AI from traditional technologies, which collectively serve to define and 

explain the concept of AI. The core characteristics are Learning, Perception, Prediction, 

Interaction, Adaptation, and Reasoning (Jackson et al., 2024, p. 6123), from which 

learning, prediction, and reasoning are by far the most important features for Demand 

Forecasting. 

 
Krakowski and others (2023) study the additional value of artificial intelligence from the 

perspective of resource-based view, which defines organizations’ competitive advantage 

based on the availability and volume of resources. Traditionally, cognitive skills are 

considered difficult to duplicate, scarce in supply, heterogeneously distributed across 

individuals, and decisive in decision making and problem solving. Thus, from the 

perspective of resource-based view they are rendered as valuable organizational 

resources. However, the prediction of the potential added value of AI contradicts this, as 

AI’s ability to learn and perform cognitive tasks affects the irreplaceability of cognitive 

skills and abilities, which have made them valuable resources. In addition, Krakowski and 

others note that generally technological resources, like AI, are subject to relatively few 

constraints on imitation and the marginal cost of reproducing them is almost negligible 

(Krakowski et al., 2023, p. 1426). On the contrary, when studying AI as a complementary 

to individuals’ cognitive skills, it can further enhance cognitive resources thus adding 

value for the user. The unique capabilities of AI make it difficult to evaluate theoretically, 

which has led to a discussion about AI’s potential as a substitute and complementary 

utility for cognitive workers. 

 
26 

 
2.3.2 Machine learning methods 

Machine Learning is a subcategory of artificial intelligence (Ganjare et al., 2023, p. 2237; 

Mediavilla et al., 2022, p. 1126), and it is being harnessed in different industries to 

analyze masses of data. The industries utilizing ML, for instance, are search engines, 

finance, logistics, e-commerce, and inventory management, and a few examples of 

which tasks ML is used to provide are fraud detection, detecting spam emails, predictive 

analyses, optimizing inventory levels, and providing personalized feed in an e-commerce 

platform for the consumer (Ganjare et al., 2023, p. 2237). Barua et al. (2020) generalize 

that machine learning is a way to teach computers to learn new tasks naturally from 

experience resembling the way organisms acquire new knowledge. In ML the computer 

utilizes a computational method, an algorithm, for example, to learn directly from the 

dataset without a predetermined equation, unlike traditional statistical methods. 

Compared to conventional methods, ML is more efficient with faster, and more accurate 

analyses for large data sets, providing tools for better predictive data analyses while 

conventional statistical models provide relationships between variables based on 

predetermined models, such as regression (Ganjare et al., 2023, pp. 2236–2237; Rajula 

et al., 2020, pp. 1–2). 

 
There are three general categories of machine learning, which differ in terms of the 

quality of the data and the methods used to teach the machine. They are called 

supervised, unsupervised, and reinforcement learning (Barua et al., 2020). They further 

divided supervised learning into two sub-categories: classic supervised learning and 

ensemble learning. In addition to the three main sub-categories, Wasserbacher and 

Spindler (2022) propose semi-supervised learning as the fourth category in machine 

learning. Semi-supervised methods include models utilizing small amounts of labeled 

data in addition to unlabeled data. Semi-supervised models aim to enhance supervised 

learning in environments where availability of labeled data is scarce (Wasserbacher & 

Spindler, 2022, p. 67). However, semi-supervised methods are not generally known in 

current literature. 


27 

 
In contrast with supervised learning, unsupervised learning models are not trained to 

create predictions based on pre-defined and labeled data. The task of unsupervised 

learning is to identify patterns and relationships within unlabeled data, without prior 

knowledge or predefined labels about what the data represents (Barua et al., 2020, pp. 

2–3). The only known parameter of the unlabeled dataset is the joint distribution 

(Wasserbacher & Spindler, 2022, p. 67). The benefit of unsupervised learning is 

recognition of previously unseen insights in the data, according to Wasserbacher and 

Spindler (2022). They give customer segmentation as an example of a task unsupervised 

learning can provide based on consumers’ demographic characteristics, socio-economic 

status, and behavior. 

 
These kinds of hidden patterns and features are primarily recognized by clustering and 

principal component analysis (PCA) (Barua et al., 2020, p. 2). For instance, in a study 

conducted by Kılıç et al. (2025), two clustering algorithms were utilized to categorize the 

unlabeled data. They used K-means and Mean-Shift algorithms to cluster three datasets 

separately, from which, three distinct clusters (low, medium, and high) were identified 

by their performance metrics. By assigning cluster-based labels, the researchers were 

then able to further analyze the data by supervised learning algorithms to predict the 

future shipment performance. This hybrid approach enabled proactive operational 

management by allowing early identification of underperforming shipments (Kılıç 

Sarıgül et al., 2025, pp. 22–23). Thus, the unsupervised learning algorithms were used 

to organize an unspecified dataset so that it could be processed by supervised learning 

algorithms to model the relationship between input data and output values. 

 
In reinforcement learning, a learning agent is learning from an environment it is 

operating in by trial and error (Barua et al., 2020, p. 2; Wasserbacher & Spindler, 2022, 

p. 67). Sutton and Barto (2015) state that the basic idea in reinforcement learning is to 

capture the most important aspects of a real problem by providing the learning agent 

feedback provided by the learning environment. The main reinforcement learning 

distinguishing aspects are the closed loop reward system, lack of direct instructions for 


28 

 
the learning agent, and where and how long the consequences of actions will affect the 

agent’s performance.  Because reinforcement learning is not utilizing pre-labeled 

examples and data, it might be often referred to as a subset of unsupervised learning 

algorithms (Sutton & Barto, 2015, p. 3). 

 
Based on the reward system of the learning environment, after a failed attempt the 

agent decides the best action for it to take to succeed in the given task, to maximize the 

numerical reward (Barua et al., 2020, p. 3). Sutton and Barto (2015) presented an 

Exploration-Exploitation dilemma, which emphasize the dimensionality of 

reinforcement learning method. The learning agent must prefer past actions found to be 

effective producing reward—but it also must try new, unseen actions, to find the 

effective actions. Thus, the agent must exploit well proven actions at the same time 

exploring possible better actions to select in the future. Thus, unlike supervised learning, 

there are no instructions in reinforcement learning on how to proceed. The learning 

agent must decide how to approach the task purely based on gathered data via the 

reward system. 

 
In supervised learning the machine is taught from a training set of labeled data, provided 

by a human annotator or domain expert (Sutton & Barto, 2015, p. 2). Supervised learning 

has two main steps, first training the model with predefined training dataset and then 

evaluating the model with separate, unseen dataset, which is referred to as test data. 

The predefined data is called labeled data and the purpose of using labeled training data 

is to teach the model to recognize the relationship between the input data and correct 

output values (Kılıç Sarıgül et al., 2025, p. 14), to create accurate predictions, which can 

inform decisions based on unseen data (Barua et al., 2020, p. 2; Sutton & Barto, 2015, p. 

2; Wasserbacher & Spindler, 2022, p. 67). An example of where supervised learning can 

be utilized is predicting future sales based on known input variables, such as date, 

historical prices and sales, and availability of competitors’ products (Wasserbacher & 

Spindler, 2022, p. 67). 

 
29 

 
Barua and others divide supervised learning into two, Classic supervised learning and 

Ensemble learning. Classic supervised learning involves training a single model using 

predefined labeled learning data. The most general algorithms Classic Supervised 

methods include are regression analyses, k-nearest neighbors (KNN), Artificial Neural 

Network (ANN), decision trees, and Support Vector Machine (SVM) (Barua et al., 2020, 

p. 2). Chaudhuri and others (2021) also state that decision tree, SVM, ANN, and random 

forest are among the most used analytical methods for forecasting. Interpretability, ease 

of use, and ability to handle categorical variables have popularized decision tree and 

random forest over ANN. On the other hand, ANN’s capability to handle 

multidimensional datasets and efficiency in use of resources is superior compared to the 

easier-to-use random forest and decision tree (Chaudhuri et al., 2021, p. 3). 

 
2.3.2.1 Decision tree, random forest, bagging, and boosting 

Decision tree is a conceptually simple but effective and versatile (Barua et al., 2020, p. 2) 

non-parametric supervised learning algorithm. It is capable of both classification and 

regression tasks, and it forms a flowchart-like hierarchical structure consisting of a root 

node, internal nodes, branches, and leaf nodes. In a tree, each node splits the data into 

subsets based on feature values, and the branches represent the outcomes of these tests 

(Barua et al., 2020, p. 2). The straightforward implementation makes decision tree fast 

in performing a forecast (Barua et al., 2020, p. 2), and together with its interpretability it 

has become a widespread model for forecasting related applications (Chaudhuri et al., 

2021, p. 3). Barua and others (2020) present a case study by Mohri and Haghshenas 

(2017), where a decision tree-based algorithm was used to determine when the use of 

shipping containers is optimal. The input variables included the price, weight, value, and 

distance of the shipment, for instance. The most important variables were item 

perishability, value of goods, distance, destination and point of departure. On the other 

hand, decision trees can be relatively unstable as they are sensitive to variance and noise 

in the data due to their tendency to overfit, according to Chaudhuri and others (2021). 

However, the sensitivity of decision tree can be addressed by composing multiple 


30 

 
decision trees as one ensemble learning model such as random forest (Huber & 

Stuckenschmidt, 2020, p. 1426). Ensemble methods improve accuracy and robustness 

while reducing variance (Barua et al., 2020, p. 2) compared to training a single decision 

tree. 

 
Figure 5. Decision tree architecture, adapted from Mohri et al (2017, p. 7). 

 
Ensemble learning combines the predictions of several individual models to form a 

composite model. It uses the outputs of the individual models, referred to as base 

learners or weak learners, to produce its own predictions. Barua and others (2020) state 

that ensemble methods use a two-step process, first developing a population of base 

models from training data, and second, combining the base models to form the 

composite predictor (Barua et al., 2020, p. 2). Kilimci and others (2019) name the two 

steps of ensemble learning as “ensemble generation and ensemble integration”, and 

state that combining learning methods is done to boost the system performance (Kilimci 

et al., 2019, p. 2). Hastie and others (2009) also split ensemble learning into two tasks: 

creating a population of base learners and combining them to form the composition, 


31 

 
idea being to create a prediction model to combine the strengths of simple learning 

models (e.g. single decision tree) for a more efficient predictive model (e.g. random 

forest). Thus, the underlying concept of ensemble learning methods is to develop an 

enhanced form of supervised learning from base learners that enables more efficient 

predictive modeling compensating the weaknesses of individual models. 

 
Figure 6. Random forest architecture, adapted from Jiang et al. (2016, p. 58). 

 
Common ensemble techniques include bootstrap aggregating (bagging), boosting, and 

stacking, all of which employ a different strategy to achieve better performance by 

improving accuracy of predictions and robustness of the model minimizing bias and 

variance (Barua et al., 2020, p. 2). Boosting methods (e.g. LightGBM and XGBoost), for 

instance, train a series of weak learners and compiles the predictions of subsequent 

learners by the sum of trained simple models (Huber & Stuckenschmidt, 2020, p. 

1426). For example, AdaBoost has been used in conjunction with SVMs to enhance 

predictive performance, as referenced by Ghareeb and others (2020, p. 1). 

 
32 

 
As Boosting focuses on reducing bias and variance by improving accuracy of the model 

and training learners subsequently, Bagging models combine homogenous weak 

learners, training them independently and in parallel with random subsets of training 

data, primarily prevents overfitting by reducing variance. random forest is a known 

Bagging method, which combines individual simple decision trees suitable for both, 

regression and classification tasks (Ghareeb et al., 2020, p. 1). 

 
Figure 7. Gradient boosting architecture, adapted from Xu et al (Xu et al., 2023, p. 3). 

 
Whilst bagging and boosting usually combines homogenous weak learners, stacking 

utilizes heterogeneous learners leveraging the strengths of different algorithms. The 

objective is to optimally incorporate the results of weak learners to improve the ability 

to make accurate predictions on new, unseen data (Ghareeb et al., 2020, p. 1). According 

to Ghareeb and others (2020) multistep predictions like Stacking are more sensitive to 

errors due to their complexity. However, the complexity of Stacking is also said to be 

viable making it more effective in forecasting complex datasets compared to other 

ensemble models. 


33 

 
2.3.2.2 Artificial Neural Networks 

Artificial Neural Networks (ANN) consist of artificial neurons, connected to each other 

by arranged series of layers. The artificial neurons in ANNs are usually recalled as units, 

and a single ANN system can consist of dozens to millions of units, depending on how 

complex the neural network is (Barua et al., 2020, p. 2; Seyedan & Mafakheri, 2020, p. 

12). The neurons are connected with synapses, which together construct the layers of 

neural networks. Neural networks usually include three layers: input layer, output layer, 

and hidden layer. ANNs with more than one, usually multiple hidden layers are called 

Deep Neural Networks (DNNs) (Punia et al., 2020, p. 4). The information processing of 

artificial neural networks resembles the way animals make decisions (outputs), based on 

a learned logical model of information processing (Barua et al., 2020, p. 2). Deep neural 

network models are used for complex problems, such as image recognition, and they 

form the core architecture of large language models (GPT-4, Llama 3, etc.). These models 

typically require a lot of computational resources to train and run. The common 

architecture of an artificial neural network is displayed in figure 8 below. 

 
34 

 
Figure 8. ANN Architecture, adapted from Bre et al. (2018, p. 1430). 

 
This thesis uses a Multilayer Perceptron (MLP) regressor because of its relatively fast 

operation and training times, and because it is an effective model for analyzing time 

series data. Although different neural network-based models are based on the same type 

of architecture (Figure 8), there can be significant differences in their learning process 

according to Ramos et al (2023, p. 672). The relatively small amount of data used in the 

project might have been insufficient for highly complex neural network models, such as 

LSTM (long-short term memory) or RNN (recurrent neural network) models, which 

operate a different, more sophisticated memory concept compared to MLP (Ramos et 

al., 2023, pp. 672–673). 

 
35 

 
2.4 Creating forecasts with machine learning 

The use of machine learning in forecasting has received a huge amount of attention in 

recent years. Forecasts conducted with machines are objective and mitigate human error 

(bias). Even if machine-based methods are applied by humans, with the right parameters 

and restrictions, most bias can be eliminated when making a forecast. Machine-based 

methods are particularly superior when forecasting demand, as demand itself is a 

multidimensional concept. First, there are different types of demand influenced by 

context and temporal features (Armstrong, 2001), and even if the form of demand is 

known, it is influenced by numerous macro- and microeconomic factors, not all of which 

are qualitative, i.e., they cannot be measured (Schneider et al., 2021, p. 218). Therefore, 

it is important to recognize demand forecasting as a complex domain and to consider it 

from different perspectives. 

 
Falatouri et al. (2022) state that the objective of forecasting is to discover data patterns 

and provide accurate forecasts, for which machine learning is one of the most viable 

tools for processing data for accurate and transparent output. They propose that 

machine learning methods for demand forecasting can be divided into three categories 

which are time series analysis, regression-based methods, and supervised and 

unsupervised methods. Furthermore, they state that demand forecasting is done on long 

or short-term levels, short-term forecasts being six to twelve months and long-term 

forecasts for more than a year (Falatouri et al., 2022, p. 994).  

 
2.4.1 Feature-based forecasts 

Li and others (2023) researched the connection between intermittent demand and 

inventory management accuracy. Their research discusses the characteristics of demand, 

emphasizing that it is ubiquitous for demand to be intermittent, i.e. demand is often zero, 

which is rarely considered when studying demand. Predicting intermittent demand is 

particularly difficult due to uncertainty caused by stochasticity and timing of demand. 


36 

 
Previous literature has proposed methods to solve the problem of intermittent 

forecasting by dividing the time series into different intervals, but more recently the 

accuracy has been improved by machine learning methods, such as artificial neural 

networks. According to them, combining different forecasting methods have proved to 

provide efficient forecasts compared to individual methods, providing better or equal 

results. However, their own study concentrates on improving forecasting by engineering 

new features based on time series data. Li et al (2023) produced a forecasting model 

based on XGBoost, in which they utilize features derived from time series selected for 

intermittent demand. The feature-based model produced accurate predictions on 

variables having immediate impact on inventory managerial decisions (L. Li et al., 2023, 

p. 7568). 

 
In Feizabadi’s study (2022) on demand forecasting using autoregressive models and 

neural networks, demand forecasting is largely influenced by the characteristics 

(features) of the type of product and industry. They gave an example of metal products 

as so-called functional products. Functional products have less product variety, longer 

life cycles, lower profit margins and lower inventory risk. He notes that it is easier to 

predict demand for these products downstream of supply chain, closer to the consumer, 

than upstream, where demand is created mainly by organizational suppliers and buyers. 

However, they note that when moving downstream to upstream in the supply chain, 

updating demand forecasts is the single biggest cause of demand-supply mismatches 

and inefficiencies (Feizabadi, 2022, pp. 119–121), as separate parts of the supply chain 

update their own demand forecasts based on the purchase signal generated by the end 

customer. Updating demand data based solely on a signal from the end-customer, i.e. 

sales, is inaccurate on the scale of the whole supply chain. Therefore, more traditional 

methods, such as simple regression, are inadequate methods for forecasting demand. 

Machine learning can help organizations to better predict demand by dealing with 

complex dependencies even between causal factors with a non-linear relationship. 

 
37 

 
2.4.2 Promotions and seasonality 

Various promotions are not uncommon in the domain of retailing. Promotions aim to 

increase sales during a specific seasonal period through various means, including price 

reductions, advertising campaigns, or free gifts for a purchase when specific conditions 

are met (e.g. minimum amount spent). Promotions are typically timed to coincide with 

seasonal holidays or events, such as Christmas, Thanksgiving, or Black Friday. Sales 

promotions tend to increase short-term sales and consumption, causing sudden 

fluctuations in demand patterns. Furthermore, usually after sales surges there is 

downward trend, which is explained by consumers’ stockpiling (non-perishable) goods. 

Therefore, the consequences of promotions do not merely follow the traditional law of 

supply and demand, increasing demand as price decreases, but they also have more 

complex consequences due to consumer dynamics, which further undermines the 

complexity of demand forecasting. (Abolghasemi, Hurley, et al., 2020, p. 3) 

 
According to Abolghasemi et al (2020), the level of demand can vary significantly during 

promotions, such as seasonal holiday weeks or campaigns, compared to non-

promotional periods. They find that the variation in demand can increase by up to 6000% 

during different promotions in a high variance time series. They discuss how the 

behavior of a time series can be explained by an analysis and identification of its features. 

They analyze six features specific to time series which are: seasonality, stationarity, non-

linearity, skewness, kurtosis, and spectral entropy. The first three depict whether the 

time series is dependent on time, what kinds of seasonal patterns are present, and if the 

time series is non-linear. The latter depicts the skewness of the trend patterns, i.e., how 

close the pattern is to normal distribution, kurtosis explains if the distribution is heavy-

tailed or light-tailed, and spectral entropy is used to display the unpredictability of the 

data (2020, pp. 6–7). Their results conclude that uncertainty can be reduced, and 

volatility controlled by combining forecasts from various models. Some models in their 

study result in relatively accurate forecasts during non-promotional periods but 

drastically overfit during promotional periods with high variance. The occasional 

overfitting results in a low average forecast accuracy. The results also highlighted the 


38 

 
accuracy and efficiency of an artificial neural network (ANN) in time series forecasting. 

They describe the ANN results contradicting from literature, as it did not generalize well 

on the volatile time series and proposed that different architecture in the network, 

choosing relevant data selection (features) and adding data quantity could serve the 

ANN’s performance.  

 
In daily retail the calendric days usually stand for special promotional days rather than 

public holidays, which are often considered in time series methods. Huber and 

Stuckenschmidt (2020) conducted research on machine learning (ML) in forecasting of 

daily retail demand on specific calendric days. The study focused on a company with a 

large distribution network comprising over 100 individual retailers, each of which 

requires daily demand forecasts for its business operations. They compared a set of 

three machine learning models, including MLP as a feed-forward ANN, LSTM (long-short 

term memory) as a recurrent ANN, and LightGBM representing gradient-boosted 

regression trees (GBRTs). The models were evaluated by comparison of forecast errors, 

displayed with MAE and MASE. According to their research (2020, pp. 1435–1437), ML 

methods provided higher accuracy being more than 10% to 20% more accurate 

compared to time series models such as regularized linear regression model. Their 

conclusion was that ANNs were the strongest in forecasting daily demand, with LSTM as 

the top performer, followed by MLP and LightGBM being the worst of the comparison. 

However, the models were retained to fit the data, and no sophisticated hyperparameter 

optimization methods were used in their study. This can increase the probability of 

overfitting and lack generalization. 

 
2.4.3 Uncertainty and difficulties in demand forecasting 

The aim of demand forecasting is to create accurate forecasts for a specific future period 

to optimize supply as accurately as possible. Accurate forecasts help companies to 

optimize their operations by reducing excess costs, such as inventory costs or waste. 

Despite the benefits of accurate forecasts, forecasts aim to minimize error rather than 


39 

 
seek perfect prediction, as forecasts always involve inherent aleatoric uncertainty, which 

cannot be eliminated. Epistemic uncertainty, however, stems from a lack of knowledge 

and can potentially be reduced, which is why demand forecasts are produced. According 

to Hüllermeier and Waegeman (2021, p. 458) uncertainty can be roughly divided into 

two: aleatoric and epistemic uncertainty. Aleatoric (statistical) uncertainty is caused by 

inherent randomness. Aleatoric uncertainty therefore always involves a stochastic factor 

that cannot be eliminated by any statistical methods. Epistemic (systematic) uncertainty 

refers to uncertainty caused by ignorance or lack of knowledge. In other words, 

epistemic uncertainty can be eliminated, whereas aleatoric uncertainty always prevails 

when making predictions. In statistical fields, uncertainty is traditionally treated as a 

probabilistic concept, which often fails to explicitly distinguish the types of uncertainty.  

 
Seyedan and Mafakheri (2020) examine supply chain demand forecasting from the 

perspective of big data analytics and forecasting. According to them, uncertainty is a key 

problem in supply chains and note that there is a common misconception associated 

with forecasting where variables such as cost, capacity, and demand are generally known 

parameters. In reality, these variables are subject to uncertainty due to external factors 

related to customer demand, deliveries, delivery times, and risks. Uncertainty created 

by demand plays a significant role, for which forecasting demand is the primary tool to 

mitigate uncertainty across the supply chain. In addition, demand uncertainty is a 

significant factor that affects, for instance, process scheduling, planning, and distribution. 

Forecasting demand is the primary means of reducing supply chain-wide uncertainty and 

minimizing disruptions caused by uncertainty, such as the bullwhip effect. 

 
Feizabadi (2022) also addresses the risk as a key challenge in supply chains. According to 

the study, uncertainty mainly occurs in the form of demand uncertainty, which increases 

the imbalance between supply and demand. The fundamental argument of the study 

assumes that there are three key factors involved in forecasting demand: model 

uncertainty, parameter uncertainty, and data uncertainty. Traditional approaches to 

managing demand uncertainty include pull methods, i.e. make-to-stock (inventory 


40 

 
buffer) and make-to-order production methods. Additionally, advanced forecasting 

models, such as ANNs, are effective in predictive analysis for their ability to manage large 

volumes of varying data. Due to the complexity of uncertainty, prediction models that 

use a single algorithm are unable to address all sources of uncertainty simultaneously, 

which has led to the use of ensemble and hybrid prediction models are common in the 

forecast domain. However, when examining uncertainty from a machine learning 

perspective, Hüllemeier and Waegeman (2021) state that the distinction between the 

two might be unnecessary. For instance, in supervised learning, where the agent is a 

learning algorithm, and is forced to provide decisions or predictions, the distinction 

between the two is irrelevant. However, in some cases where a decision can be 

postponed or rejected altogether, the scenario might not always apply. 

 
Inefficiencies in the supply chain caused by uncertainty can create cumulative negative 

effects that accumulate as they move up the supply chain structure. A primary example 

of this is the bullwhip effect and it occurs when different parties, such as individual 

companies (Sanders, 2015, p. 13), in the supply chain make their own, often unsuccessful, 

demand forecasts based on fluctuations in downstream demand. This means that small 

changes in consumer demand can cause much larger fluctuations in orders and stock 

levels upstream of the chain, amplifying the demand variance (Tai et al., 2022, p. 5). The 

phenomenon is exacerbated by poor information flow and inconsistent forecasting at 

different parts of the organization. For instance, inadequate information flow, long lead 

times, a distorted view of demand or inefficient inventory management can lead to 

recurring problems within the organization. The bullwhip effect causes over-stocking and 

poor customer service (Tai et al., 2022, p. 1) creating uncertainty which depletes 

operational efficiency (Disney et al., 2021, p. 5810). 

 
Accurate demand forecasting plays a crucial role in reducing the probability of the 

bullwhip effect. According to Pereira et al. (2021), the cornerstone of creating an 

accurate forecast is choosing the right forecasting method and using machine learning 

algorithms to make the forecast, as machine learning provides better demand 


41 

 
predictability, which gives managers a better chance of identifying consumer needs. This 

in turn encourages managers to make more confident decisions, which in turn minimizes 

mismatches between supply and demand (Pereira & Frazzon, 2021, p. 11). According to 

Ganjare et al. (2023), the bullwhip effect can be prevented by careful inventory 

management, accurate order-up and replenishment strategies. To conclude, the 

literature suggests that transparent information flow across the supply chain, combined 

with highly accurate forecasting, is key to preventing the bullwhip effect. 

 
A common conception in forecasting is that the longer the forecast period, the lower the 

forecast accuracy. Saoud, Kourentzes and Boylan (2025) discuss forecast uncertainty. 

They base their study on demand uncertainty and the bullwhip effect, emphasizing 

prevailing uncertainty as the primary driver of costs in the supply chain. However, they 

also highlight the impact of the forecast horizon length on uncertainty. Based on the 

study, forecast uncertainty increases as the forecast horizon lengthens. This is because 

errors accumulate over a longer period of time and the number of parameters affecting 

the variables to be forecast increases. Cerqueira et al (2025) examine the impact of the 

forecast horizon on the performance of forecast models. They discuss the robustness of 

forecast models in volatile environments where anomalies are present. They conclude 

that when selecting the model for predictive analysis, the best model is not necessarily 

the best choice for handling unexpected variation. Modern neural network models 

outperform traditional statistical models only in long-term forecasts. In the short term, 

there was no significant difference. This emphasizes that the length of the horizon is a 

critical factor when selecting a model. 


42 

 
3 Methodology and data 

This section discusses the different stages of this thesis in chronological order. The 

objective of this section is to explain the unit of analysis, process steps, data, and tools 

used to obtain the results. The first chapters discuss how and where the data was 

acquired, the different stages of data pre-processing, and the statistical methods used in 

the pre-processing stage. Next, the algorithms selected for this study are presented and 

how they were utilized. Finally, the comparison of the models is discussed by explaining 

the selection of error and performance metrics and which data features were the most 

important for each model in the prediction process.  

 
3.1 Data 

The data for this project was sourced from a publicly available dataset from Kaggle.com. 

The dataset is in tabular format and consists of simulated and anonymized weekly sales 

data from a U.S. based retail chain. The dataset includes various variables depicting 

exogenous factors influencing demand in addition to the store-specific features and date 

columns for temporal information. In this thesis, following standard machine learning 

terminology, these external influencing factors, as well as temporal and store-specific 

factors used as inputs for the forecasting models, are referred to as features (Van Wyk, 

2023, p. 7). This dataset forms the basis for this study and enables the examination of 

how these features impact demand forecasts generated by machine learning models. 

 
In the first stage of data preprocessing, the nature of the data is analyzed by validating 

the data types of the dataset. The dataset consists of 6435 rows and 8 columns; thus, 

the dataset has a total of 55 480 observations. The data types for numerical features are 

float and integer, and objects for the date features (Appendix 1). The numerical features 

are Store, Weekly_Sales, Holiday_Flag, Temperature, CPI, Unemployment and Fuel_Price. 

The explanations for the variables can be found on table 4 below. 

 
43 

 
Table 4. Explanations for each variable of the dataset. 

Variable Explanation 

Store Indicates number of the store 

Date The week of sales 

Weekly_Sales Sales for the given store from one week 

Holiday_Flag Indicates whether the week is a special holiday week 

1 = holiday week, 0 = non-holiday week 

Temperature Temperature (in °F) of the week during the week for the region 

of the store  

Fuel_Price Cost of Fuel in the region 

CPI Consumer Price Index 

Unemployment Prevailing regional unemployment rate 

 
Holiday weeks mark the four most prominent holidays in the U.S. which are Super Bowl, 

Labor Day, Thanksgiving, and Christmas, as defined in the original Kaggle dataset (2021) 

documentation. An examination of the dataset reveals that the numerical variables Date 

and Holiday_Flag require conversion into categorical features to prevent the models 

from misinterpreting their numerical codes as ordered values. For example, this ensures 

that Store 45 is not interpreted as of greater value than Store 1. Furthermore, as the 

Date column only refers to the week of sales, Day, Week, Month, and Year, columns were 

derived from the Date column to provide more accurate values for time series analysis. 

 
3.1.1 Data acquisition and preparation 

All data processing for this thesis was performed using Python programming language in 

Jupyter Notebook. Jupyter is an open-source web-based interactive computing 

environment used for data science (The Jupyter Notebook — Jupyter Notebook 7.5.0b0 

Documentation, 2015). The data was analyzed using several libraries: pandas and NumPy 

for data handling, Matplotlib and Seaborn for visualization, and scikit-learn for machine 

learning methods. LightGBM and XGBoost were applied for gradient boosting models. 


44 

 
Data processing begins with importing the necessary libraries into the data processing 

environment, after which dataset file is also read and imported into Python with pandas 

library for further processing. 

 
3.1.2 Data cleaning and preprocessing 

After the initial overview of the dataset, the data pre-processing phase continued by 

checking the dataset for missing values, duplicated rows, and possible outliers. Pre-

processing was done to ensure the quality of the raw data to prevent distorted analysis 

that would result from incorrect or biased data (Çetin & Yıldız, 2022, p. 300). Cleaning 

the dataset is an essential part of data analysis (Slater & Hasson, 2025, p. 723); thus, 

data analysis began with data cleaning by detecting missing or duplicated data, and 

identifying any anomalies in the data. In case there were any, the empty and duplicated 

rows were deleted from the dataset. Fortunately, the dataset chosen for this project did 

not include any empty or duplicated data, so no further processing was required at this 

point of the analysis. 

 
After handling missing and duplicated values, the dataset was checked for possible 

outliers to ensure a robust data analysis. Outlier detection began with a visual inspection 

to help select the most efficient and objective method for identifying outliers (Alves et 

al., 2024a, p. 5). The numerical variables are assessed and analyzed with box plots for 

outlier visualization and histograms for displaying the skewness of the variable. In 

addition, outliers were calculated using interquartile range (IQR) method. All detected 

outliers were further validated to determine whether the deviations were caused by 

errors or represented natural variation.  

 
45 

 
3.2 Feature engineering  

The next task in pre-processing of data is new feature engineering in which new features 

are derived from the existing dataset to create new insightful data into a form the 

machine learning algorithm can benefit from. Kampezidou and others (2024, p. 388) 

state that new features can be produced by generating, transforming, and combining 

existing features. The purpose of new feature engineering is to improve the efficiency of 

used machine learning models by minimizing generalization and training errors. 

 
Some of the new features were created solely for exploratory data analysis. These 

features were dropped from the dataset before moving forward to the predictive 

analysis with the machine learning algorithms. This was done to prevent any data 

leakage during predictive analysis which can result in unreliable results. Data leakage 

refers to a situation where some of the test data is mixed with the training set, thus 

resulting in falsely great results on the test data, but decreasing the generalization of the 

model making the it useless in real world problems (Liu et al., 2022, p. 13). 

 
Lag and Rolling features were engineered as they can increase the performance of a time 

series analysis especially when using tree-based algorithms (Kampezidou et al., 2024, p. 

388). Since the data covers a relatively short period of three years, the Lag and Rolling 

features were created for time periods of one, four, and eight weeks. The four- and eight-

week features were created to reflect time periods of approximately one and two 

months. Lag and Rolling features can improve the model’s training efficiency and 

performance to predict seasonal patterns and trends (Tam et al., 2025, p. 23). 

 
In addition to the temporal features, Interaction features were engineered to 

complement the temporary ones, capturing non-linear relationships and providing 

insight into how the original features of the dataset relate to one another. New features 

were engineered separately for both EDA and predictive analysis, because not all new 

engineered features needed for EDA could be used in the predictive analysis due to 

possible data leakage and biased parameter estimates. 


46 

 
3.3 Train-test split 

When conducting predictive analysis on time series data, it is crucial to split the dataset 

into training and test sets before new feature engineering and training the algorithm. 

The data split was conducted into training and test sets by preserving the chronological 

order of the data. The ratio for the train-test split was 65 to 35, meaning 65 % of the data 

was used as the training set, and the remaining 35 % as the test data, which the models 

predict. The split was conducted with time series split, as a different split method could 

alter the time dependency, impacting the reliability of time series analysis.  

 
3.4 Exploratory data analysis 

Exploratory data analysis (EDA) was done by descriptive statistics to explore the patterns, 

possible trends, and overall structure of the dataset, and to visualize outlier detection 

and the train-test split. EDA started by summarizing the data with tables and bar charts. 

Histograms were also used to visualize the characteristics of the dataset and the 

skewness of numerical features. Correlation and correlation coefficients between 

individual features were visualized by a feature correlation heatmap. Outliers were 

visualized with boxplots, interquartile range (IQR), and tables, and possible trend 

patterns were visualized with a 12-week moving average of weekly sales. Holidays and 

non-holiday dates were visualized with simple pie chart and a histogram. 

 
3.5 Machine Learning Models 

The following predictive algorithms were used in this thesis: simple naïve, 52-week 

seasonal naïve, Multi Perceptron Neural Network (MLP), LightGBM, and XGBoost. Based 

on previous literature and research results, the gradient boosting models LightGBM and 


47 

 
XGBoost were selected as the primary comparison targets for this prediction. 

Naïve models were selected as baseline models to provide perspective when compared 

with more advanced algorithms to display the differences in predictive performances. 

Since it was assumed that the advanced machine learning models would outperform 

simple naïve predictions in the comparison, one more multivariate algorithm was 

selected for comparison to make the comparison more comprehensive. A neural 

network based algorithm, multilayer perceptron (MLP) was chosen due to its efficiency 

in processing time series data, as well as computational efficiency when compared to 

other neural network based algorithms, such as LSTM (Long-Short Term Memory)  

(Ramos et al., 2023). 

 
The model training process started with pre-processing of the dataset. After the initial 

check for outliers, empty rows, and duplicates, the dataset was split into train and test 

sets. Once train-test split was done, new features were engineered to support the 

predictive performance of used models. The hyperparameter optimization was 

conducted using randomized search for gradient boosted models (LGBM & XGB) as well 

as for the Neural Network model (MLP).  The models were evaluated and measured using 

four different metrics, which are mean absolute error (MAE), root mean square error 

(RMSE), mean absolute scaled error (MASE), and R-squared (R2). 

 
Root mean square error (Equation 1) measures the magnitude of prediction errors, with 

lower values indicating better accuracy (Kannadasan, 2025, p. 25). 

 
Equation 1. Equation of root mean square error (RMSE). 

𝑅𝑀𝑆𝐸 = ඨ
෌ (𝑦௜ − 𝑥௜)ଶ௡

௜ୀଵ

n
 

where, n is the number of datapoints, yi is the actual value for 

data point i, and xi is the predicted value for data point i. 

 
48 

 
Mean absolute error measures the average absolute error (equation 2) between the 

predicted and actual values, indicating how close the prediction is to the reference point.  

 
Equation 2. Equation of mean absolute error (MAE). 

𝑀𝐴𝐸 =
∑ |𝑦௜ − 𝑥௜|

௡
௜ୀଵ

n
 

where, n is the number of datapoints, yi is the actual value for 

data point i , and xi is the predicted value for data point i. 

 
Mean absolute scaled error (equation 3) indicates the effectiveness of the predictive 

model by comparing its Mean Absolute error (MAE) with the MAE of naïve forecast 

(Huber & Stuckenschmidt, 2020, p. 1430). 

 
Equation 3. Equation of mean absolute scaled error (MASE). 

𝑀𝐴𝑆𝐸 =
𝑀𝐴𝐸

𝑀𝐴𝐸௡௔ï௩௘
 

where, MAE is the Mean Absolute error of the prediction and, 

MAEnaive is the actual MAE of Naïve forecast. 

 
The coefficient of determination, also referred to as R-squared (R2), quantifies the 

amount of variance in the dependent variable can be explained by the independent 

variables (Chicco et al., 2021, p. 2). 

 
Equation 4. Equation of R-squared (R2). 

𝑅ଶ =
෌ (𝑦௜ − 𝑦పෝ

௡

௜ୀଵ
)ଶ

෌ (𝑦௜ − 𝑦ത
௡

௜ୀଵ
)ଶ

 
where, 𝑦పෝ   is the prediction of datapoint i, 𝑦௜  represents the 

actual values on datapoint i, and 𝑦ത  is the mean of all the 

observations (Scikit-Learn., 2025). 

 
49 

 
3.6 Feature Importance Analysis 

Feature importance analysis is particularly important when the subject of analysis is 

highly dependable on exogenous factors. The feature importance method chosen for this 

project was permutation feature importance method, as it can be applied to evaluate 

any model, regardless of its operating principles. According to Yagmur and others (2024), 

it measures the significance of a feature by how much the model’s performance metrics 

(e.g., MAE, RMSE, R2) react when the values of a singular feature are randomly permuted, 

thus cutting its connection to the explanatory variable. In other words, permutation 

importance method can be used to determine the contribution of external variables on 

weekly sales. 

 
The method works regardless of the model used, meaning it can be applied to both 

gradient boosting and neural network models. Permutation method was applied 

manually to test all models with different feature configurations, to display what features 

work best together and which features are repeated with the most contribution 

regardless of the configuration. The testing started with the baseline features of the 

dataset, after which new engineered features were gradually added into the feature 

configuration. This allowed the identification of the most important features, by which 

the final set of variables could be selected. 

 
50 

 
4 Results 

In this section, the results from each section of the empirical part of the thesis are 

presented. The interpretation of results begins with exploratory data analysis, after 

which the predictive performance and feature importance metrics are evaluated. The 

models’ results are first evaluated separately for each model, after which the results are 

analyzed relative to one another. The model performance section delves into the 

accuracy of the models, presenting the error and forecast metrics in tabular form. The 

feature importance and interpretation section analyzes how the models utilized external 

features in their forecasts and provides the results from feature importance analysis. 

Finally, the prediction accuracies of the models are compared with each other under 

comparative analysis where the prediction accuracy of each model is visualized with 

respect to actual sales. 

 
4.1 Data analysis results 

The data used in this thesis was collected from an open source. The dataset contains 

anonymized simulated weekly sales data from forty-five different stores from different 

regions, as well as exogenous factors implicating prevailing economic factors during the 

sales week. Before statistical data analysis, the usability of the dataset was ensured 

through data preprocessing and data validation methods.  The steps are as follows: (1) 

preprocessing and validating the data, (2) outlier detection, (3) data cleaning, (4) new 

feature engineering, (5) normalizing of the data, and (6) balancing the dataset (Van Wyk, 

2023, pp. 7–8). 

 
4.1.1 Exploratory data analysis 

Data analysis began with visualizing the dataset to explain and understand anomalies 

and the statistical nature of the data. Alves and others (2024b, p. 2) state that 


51 

 
exploratory data analysis is essential to gain deeper insight into the dataset as it helps 

identifying outliers and inliers, as well as deriving information on key variables and 

features of the data. In this thesis, the exploratory data analysis includes analysis of the 

time series to understand possible trends and patterns of the sales data. The exploratory 

data analysis began with a visual examination of the dataset. The nature of the numerical 

features of the data was examined based on the distribution of the variables. Based on 

this, it was found that most of the data was close to normally distributed, except for 

Unemployment (see appendix 7), which was heavily skewed. Furthermore, the 

correlation between the target variable, Weekly_Sales and continuous numerical 

features, such as Temperature and Fuel_Price, were examined using scatter plots. This 

preliminary analysis suggested that Unemployment data might contain outliers.  

 
Figure 9 depicts the average weekly sales for the dataset, and the orange line represents 

the 12-week moving average, providing a fundamental perspective on the time series 

data. The graph clearly indicates that sales peaked before and during December, after 

which there is a sharp drop in sales right before January. This observation can be 

explained by the holiday seasons shown in the data, which occur in the last quarter of 

the year, including Labour Day, Thanksgiving, Christmas, and Super Bowl according to 

our dataset (2021). 

 
Figure 9. Weekly Sales, 12-Week Moving Average. 

 
52 

 
The time series analysis provides valuable insight when forecasting sales and demand. 

Neba and others (2024) state that identifying seasonal trends and patterns is important 

for forecasting accuracy, resource allocation and inventory management, promotional 

strategies, and financial planning (Neba et al., 2024, p. 11). As well as improving the 

above-mentioned, trends can increase consumer uncertainty challenging forecasting 

accuracy, creating risk for inefficiencies in inventory management (Ratra & Seth, 2025, p. 

1024). Furthermore, Caniato et al (2005, pp. 39–40) studied demand variability and 

divided it into three main drivers, seasonal, promotional, and random (i.e. unpredictable) 

variability. They state that the variability of demand is mostly derived from the seasonal 

and promotional variability, further emphasizing the importance of understanding the 

patterns in the time series data to be forecasted. 

 
Pairwise correlations between features were examined with Spearman Correlation, and 

the results are visualized by a Correlation Heatmap in figure 10. Correlation heatmap 

illustrates the correlation coefficients for each presented feature, indicating how strong 

the correlation between two variables is. The value for the correlation is between -1 and 

1, negative values indicating a negative correlation, positive values indicating a positive 

correlation, and a value of zero means there is no correlation between the variables (J. 

Li et al., 2024, p. 152). The bar on the right side of the figure explains the color palette, 

giving a deeper color for a stronger correlation. 

 
53 

 
Figure 10. Feature Correlation Heatmap with Spearman’s Correlation. 

 
The correlation between the variables is, on average, weak according to the correlation 

matrix. The matrix shows that the correlation between weekly sales and features, such 

as CPI, fuel price, temperature, and unemployment, is between -0.07 and 0.03, i.e. almost 

neutral. Despite this, the models are expected to find dependencies in the data during 

the prediction process. It is important to note that the correlation between interaction 

features, such as Fuel_Temp_Interaction with Fuel_Price and Temperature, are not 

statistically significant, as Fuel_Temp_Interaction was derived from these two features. 

 
54 

 
4.1.2 Outlier detection 

Outlier detection is done to detect unexpected, significantly deviated observations from 

the data (Reunanen et al., 2020, p. 287) to sustain the usability of the dataset. 

Multivariate datasets often have many outliers, for which, according to Alves and others 

(2024b), outlier detection is a principal task during data processing. They emphasize that 

detecting outliers is often essential, but the outlier detection method must be efficient 

and objective. It is also said that removing the outliers is potentially imperative, meaning 

the removal of the outliers is not always necessary (Alves et al., 2024b, p. 5). 

 
The outlier detection method used in this thesis included outlier visualization with 

boxplots, after which interquartile range (IQR) was applied as a computational method. 

IQR and Boxplots both use the same computational formula to determine the upper and 

lower quartiles (Tukey, 1977, pp. 43–44). The interquartile range is defined by 

subtracting the first quartile from the third (Q3-Q1), and the inner fences are set at 1.5 

times the IQR (Tukey, 1977, pp. 43–44). As outlier detection is dependent on the context, 

the flagged values were evaluated considering domain knowledge. 

 
The IQR method was applied by creating a function (see Appendix 1) to calculate the 

interquartile range and detect outliers for each column passed to the function. The 

function was subsequently applied to all numerical features in the dataset. Finally, the 

output for the code displays all the values that fall below lower bound or above upper 

bound. Outliers detected with IQR are depicted in figure 11. 

 
Figure 11. Outlier detection with interquartile range. 

 
55 

 
Only two of five numerical columns included notable outliers. Before the possible 

outliers were removed from the data, the reasons for the deviation were checked to 

ensure that the observations were valid. The possible outliers in Weekly_Sales are 

displayed in figure 12 with a boxplot. 

 
Figure 12. Boxplot of Weekly_Sales. 

 
The dots represent the datapoints that fall outside the upper (or lower) bounds (fences) 

(Tukey, 1977, pp. 43–44). With Weekly_Sales, there were 34 datapoints fal